Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DLP: Display count and links to the papers which cite some version of the dandiset #1669

Open
yarikoptic opened this issue Jul 28, 2023 · 9 comments
Labels
enhancement New feature or request

Comments

@yarikoptic
Copy link
Member

came up in dandi/helpdesk#105 . Naive implementation could just search for the DOI on google scholar, e.g. https://scholar.google.com/scholar?hl=en&as_sdt=0%2C46&q=dandi.000055%2F0.220127.0436&btnG= but I bet there are better options, I didn't check

@satra
Copy link
Member

satra commented Jul 28, 2023

there are two things that can be done with datacite dois. see:

i'm not sure many people are citing using DOIs. and the dandi citation box should be updated to support different citation formats as well. finally the dandi url also does not plug in easily into paperpile and other formats.

@yarikoptic
Copy link
Member Author

cool! I didn't try to establish report (we would probably want for all dandi DOIs at once), but for that sample DOI seems to have nothing

❯ curl --silent https://api.datacite.org/events?doi=dandi.000055/0.220127.0436 | jq .
{
  "data": [],
  "meta": {
    "total": 0,
    "total-pages": 0,
    "page": 1
  },
  "links": {
    "self": "https://api.datacite.org/events?doi=dandi.000055/0.220127.0436"
  }
}
❯ curl --silent  https://api.datacite.org/dois/dandi.000055/0.220127.0436 | jq .
{
  "errors": [
    {
      "status": "404",
      "title": "The resource you are looking for doesn't exist."
    }
  ]
}

It seems also that DANDI could even contribute to the usage reports, e.g. view counts of the DLPs for any given release DOI, or if we start minting overall dandiset DOI (as zenodo does and probably we should) - we could provide overall as well: https://support.datacite.org/docs/contributing

@satra
Copy link
Member

satra commented Jul 28, 2023

the doi needs the prefix as well.

curl --silent 'https://api.datacite.org/events?doi=10.48324/dandi.000055/0.220127.0436' | jq
{
  "data": [
    {
      "id": "bbb655d0-5d76-481e-b6f1-b2cb2b457380",
      "type": "events",
      "attributes": {
        "subj-id": "https://doi.org/10.1038/s41597-022-01280-y",
        "obj-id": "https://doi.org/10.48324/dandi.000055/0.220127.0436",
        "source-id": "crossref",
        "relation-type-id": "references",
        "total": 1,
        "message-action": "add",
        "source-token": "36c35e23-8757-4a9d-aacf-345e9b7eb50d",
        "license": "https://creativecommons.org/publicdomain/zero/1.0/",
        "occurred-at": "2022-04-21T10:45:13.000Z",
        "timestamp": "2022-04-23T03:38:18.173Z"
      },
      "relationships": {
        "subj": {
          "data": {
            "id": "https://doi.org/10.1038/s41597-022-01280-y",
            "type": "objects"
          }
        },
        "obj": {
          "data": {
            "id": "https://doi.org/10.48324/dandi.000055/0.220127.0436",
            "type": "objects"
          }
        }
      }
    }
  ],
  "meta": {
    "total": 1,
    "total-pages": 1,
    "page": 1,
    "sources": [
      {
        "id": "crossref",
        "title": "Crossref to DataCite",
        "count": 1
      }
    ],
    "occurred": [
      {
        "id": "2022",
        "title": "2022",
        "count": 1
      }
    ],
    "prefixes": [
      {
        "id": "10.1038",
        "title": "10.1038",
        "count": 1
      },
      {
        "id": "10.48324",
        "title": "10.48324",
        "count": 1
      }
    ],
    "citation-types": [
      {
        "id": "Dataset-ScholarlyArticle",
        "title": "Dataset-ScholarlyArticle",
        "count": 1,
        "year-months": [
          {
            "id": "2022-04",
            "title": "2022-04",
            "sum": 1
          }
        ]
      }
    ],
    "relation-types": [
      {
        "id": "references",
        "title": "references",
        "count": 1,
        "year-months": [
          {
            "id": "2022-04",
            "title": "2022-04",
            "sum": 1
          }
        ]
      }
    ],
    "registrants": [
      {
        "id": "crossref.297",
        "title": "crossref.297",
        "count": 1,
        "years": [
          {
            "id": "2022",
            "title": "2022",
            "sum": 1
          }
        ]
      },
      {
        "id": "datacite.dartlib.dandi",
        "title": "datacite.dartlib.dandi",
        "count": 1,
        "years": [
          {
            "id": "2022",
            "title": "2022",
            "sum": 1
          }
        ]
      }
    ]
  },
  "links": {
    "self": "https://api.datacite.org/events?doi=10.48324/dandi.000055/0.220127.0436"
  }
}

@yarikoptic
Copy link
Member Author

d'oh and Great! so might be a matter of a "cron job" to collate all such references and render them nicely ;-)

@tuanpham96
Copy link

I did some preliminary runs with this and I found that datacite can still miss things. For example this one, the authors included doi for https://doi.org/10.48324/dandi.000404/0.230605.2024 in the "Data availability" section but not in "References", which I assume is how datacite works?

Using datacite API doesn't return anything. Hopefully I got the syntax right?

$ curl -s 'https://api.datacite.org/events?doi=10.48324/dandi.000404/0.230605.2024' | jq
{
  "data": [],
  "meta": {
    "total": 0,
    "total-pages": 0,
    "page": 1
  },
  "links": {
    "self": "https://api.datacite.org/events?doi=10.48324/dandi.000404/0.230605.2024"
  }
}

In fact, doing it for the doi prefix

$ curl "https://api.datacite.org/events?prefix=10.48324,10.80507" -o dandiset-datacite-query.json

then some cleaning

datacite attributes records
$ jq -r '.data[].attributes' dandiset-datacite-query.json 
{
  "subj-id": "https://doi.org/10.1038/s41597-022-01280-y",
  "obj-id": "https://doi.org/10.48324/dandi.000055/0.220127.0436",
  "source-id": "crossref",
  "relation-type-id": "references",
  "total": 1,
  "message-action": "add",
  "source-token": "36c35e23-8757-4a9d-aacf-345e9b7eb50d",
  "license": "https://creativecommons.org/publicdomain/zero/1.0/",
  "occurred-at": "2022-04-21T10:45:13.000Z",
  "timestamp": "2022-04-23T03:38:18.173Z"
}
{
  "subj-id": "https://doi.org/10.1038/s41597-022-01728-1",
  "obj-id": "https://doi.org/10.48324/dandi.000231/0.220904.1554",
  "source-id": "crossref",
  "relation-type-id": "references",
  "total": 1,
  "message-action": "add",
  "source-token": "36c35e23-8757-4a9d-aacf-345e9b7eb50d",
  "license": "https://creativecommons.org/publicdomain/zero/1.0/",
  "occurred-at": "2022-10-13T13:45:22.000Z",
  "timestamp": "2022-10-14T08:55:30.912Z"
}
{
  "subj-id": "https://doi.org/10.48324/dandi.000252/0.230408.2207",
  "obj-id": "https://doi.org/10.1101/2022.12.07.22283227",
  "source-id": "datacite-crossref",
  "relation-type-id": "is-described-by",
  "total": 1,
  "message-action": "create",
  "source-token": "28276d12-b320-41ba-9272-bb0adc3466ff",
  "license": "https://creativecommons.org/publicdomain/zero/1.0/",
  "occurred-at": "2023-04-08T22:07:55.000Z",
  "timestamp": "2023-04-08T22:07:57.238Z"
}
{
  "subj-id": "https://doi.org/10.48324/dandi.000252/0.230408.2207",
  "obj-id": "https://orcid.org/0000-0002-8040-8844",
  "source-id": "datacite-orcid-auto-update",
  "relation-type-id": "is-authored-by",
  "total": 1,
  "message-action": "create",
  "source-token": "7b09eda9-0024-4e26-9f01-6d8d5a1028d7",
  "license": "https://creativecommons.org/publicdomain/zero/1.0/",
  "occurred-at": "2023-04-08T22:07:55.000Z",
  "timestamp": "2023-04-08T22:07:59.828Z"
}
{
  "subj-id": "https://doi.org/10.48324/dandi.000252/0.230408.2207",
  "obj-id": "https://orcid.org/0000-0002-0101-2455",
  "source-id": "datacite-orcid-auto-update",
  "relation-type-id": "is-authored-by",
  "total": 1,
  "message-action": "create",
  "source-token": "7b09eda9-0024-4e26-9f01-6d8d5a1028d7",
  "license": "https://creativecommons.org/publicdomain/zero/1.0/",
  "occurred-at": "2023-04-08T22:07:55.000Z",
  "timestamp": "2023-04-08T22:07:59.899Z"
}
{
  "subj-id": "https://doi.org/10.48324/dandi.000252/0.230408.2207",
  "obj-id": "https://orcid.org/0000-0002-8765-7253",
  "source-id": "datacite-orcid-auto-update",
  "relation-type-id": "is-authored-by",
  "total": 1,
  "message-action": "create",
  "source-token": "7b09eda9-0024-4e26-9f01-6d8d5a1028d7",
  "license": "https://creativecommons.org/publicdomain/zero/1.0/",
  "occurred-at": "2023-04-08T22:07:55.000Z",
  "timestamp": "2023-04-08T22:07:59.973Z"
}
subj-id obj-id source-id relation-type-id total message-action
0 https://doi.org/10.1038/s41597-022-01280-y https://doi.org/10.48324/dandi.000055/0.220127.0436 crossref references 1 add
1 https://doi.org/10.1038/s41597-022-01728-1 https://doi.org/10.48324/dandi.000231/0.220904.1554 crossref references 1 add
2 https://doi.org/10.48324/dandi.000252/0.230408.2207 https://doi.org/10.1101/2022.12.07.22283227 datacite-crossref is-described-by 1 create
3 https://doi.org/10.48324/dandi.000252/0.230408.2207 https://orcid.org/0000-0002-8040-8844 datacite-orcid-auto-update is-authored-by 1 create
4 https://doi.org/10.48324/dandi.000252/0.230408.2207 https://orcid.org/0000-0002-0101-2455 datacite-orcid-auto-update is-authored-by 1 create
5 https://doi.org/10.48324/dandi.000252/0.230408.2207 https://orcid.org/0000-0002-8765-7253 datacite-orcid-auto-update is-authored-by 1 create

Not sure whether datacite-orcid-auto-update should be counted. It seems there are 3 crossref related events, yet none refers to dandi.000404/0.230605.2024

So I went to Google Scholar and tried to look for all dandiset versions that have a DOI using scholarly

simplified code
import glob, os, re, json
import pandas as pd
from tqdm import tqdm
from scholarly import scholarly, ProxyGenerator

pg = ProxyGenerator()

# gather dandiset IDs
# assume `doi_list` is all DOIs of all Dandisets
dandiset_IDs = [
    '%s' %(re.search('dandi.+', x).group())
    for x in doi_list
]

# use scholarly to query
pubs = dict()

for dandiset in tqdm(dandiset_IDs):
    if dandiset in pubs:
        # in case need to restart
        continue

    pg.FreeProxies()
    scholarly.use_proxy(pg)
    query_results = scholarly.search_pubs(dandiset)

    pubs[dandiset] = pd.DataFrame([
        next(query_results) for _ in range(query_results.total_results)
    ]).assign(dandiset = dandiset)

# save data
df_pubs = pd.concat(list(pubs.values()), ignore_index=True)
df_pubs.to_json(out_path, orient='records')
dandiset pub_url pub_title pub_year
0 dandi.000019/0.220126.2148 https://elifesciences.org/articles/78362 The neurodata without borders ecosystem for neurophysiological data science 2022
1 dandi.000037/0.230426.0054 https://www.nature.com/articles/s41597-023-02214-y Responses of pyramidal cell somata and apical dendrites in mouse visual cortex over multiple days 2023
2 dandi.000055/0.220127.0436 https://www.nature.com/articles/s41597-022-01280-y AJILE12: Long-term naturalistic human intracranial neural recordings and pose 2022
3 dandi.000055/0.220127.0436 https://arxiv.org/abs/2302.08643 Fast Temporal Wavelet Graph Neural Networks 2023
4 dandi.000165/0.211118.1526 https://www.cell.com/cell-reports/pdf/S2211-1247(21)01655-7.pdf Dentate gyrus and CA3 GABAergic interneurons bidirectionally modulate signatures of internal and external drive to CA1 2021
5 dandi.000207/0.220216.0323 https://www.nature.com/articles/s41593-022-01020-w Neurons detect cognitive boundaries to structure episodic memories in humans 2022
6 dandi.000230/0.220506.1516 https://www.cell.com/cell-reports-methods/pdf/S2667-2375(22)00084-4.pdf All-viral tracing of monosynaptic inputs to single birthdate-defined neurons in the intact brain 2022
7 dandi.000231/0.220904.1554 https://www.nature.com/articles/s41597-022-01728-1 A detailed behavioral, videographic, and neural dataset on object recognition in mice 2022
8 dandi.000292/0.220708.1652 https://academic.oup.com/gigascience/article-abstract/doi/10.1093/gigascience/giac108/6827564 An in vitro whole-cell electrophysiology dataset of human cortical neurons 2022
9 dandi.000293/0.220708.1652 https://academic.oup.com/gigascience/article-abstract/doi/10.1093/gigascience/giac108/6827564 An in vitro whole-cell electrophysiology dataset of human cortical neurons 2022
10 dandi.000404/0.230605.2024 https://www.cell.com/current-biology/pdf/S0960-9822(23)00778-9.pdf Invariant neural dynamics drive commands to control different movements 2023
11 dandi.000447/0.230316.2133 https://www.sciencedirect.com/science/article/pii/S266616672300480X Protocol for geometric transformation of cognitive maps for generalization across hippocampal-prefrontal circuits 2023
12 dandi.000458/0.230317.0039 https://elifesciences.org/articles/84630 Cortico-thalamo-cortical interactions modulate electrically evoked EEG responses in mice 2023
13 dandi.000473/0.230417.1502 https://www.nature.com/articles/s41593-023-01367-8 Esr1+ hypothalamic-habenula neurons shape aversive states 2023
14 dandi.000488/0.230602.2022 https://www.biorxiv.org/content/10.1101/2023.06.02.543483.abstract Differential encoding of temporal context and expectation under representational drift across hierarchically connected areas 2023

That's actually how I was able to find dandi.000404. Another example where the dandiset doi appears in the "Data availability" section but not in "References" is dandiset.000473/0.230417.1502, and not in the datacite api above.

@yarikoptic yarikoptic added the enhancement New feature or request label Sep 20, 2023
@yarikoptic
Copy link
Member Author

Hi @tuanpham96 -- thanks for going through it! I thought to take advantage of all your work, installed bleeding edge scholarly , tuned up the script and was ready to profit but ran into raise MaxTriesExceededException("Cannot Fetch from Google Scholar.") so might be just no luck going through google scholar any longer?

FWIW: here is the adjusted version which gets the dandiset dois from API
import glob, os, re, json
import pandas as pd
from collections import defaultdict
from itertools import chain
from tqdm import tqdm
from scholarly import scholarly, ProxyGenerator

pg = ProxyGenerator()


from dandi.dandiapi import DandiAPIClient

# Happen we decide to group per dandiset
dois = defaultdict(list)

with DandiAPIClient.for_dandi_instance("dandi") as client:
    for dandiset in client.get_dandisets():
        if dandiset.most_recent_published_version is None:
            continue
        # to actually get full DOIs but we do not need them
        for version in dandiset.get_versions():
            if version.identifier != 'draft':
                dois[dandiset.identifier].append(
                    dandiset.for_version(version.identifier).get_raw_metadata()['doi']
                )
        break

# use scholarly to query
pubs = dict()

for dandiset, dandiset_dois in tqdm(dois.items()):
    if dandiset in pubs:
        # in case need to restart
        continue
    for doi in dandiset_dois:
        pg.FreeProxies()
        scholarly.use_proxy(pg)
        query_results = scholarly.search_pubs(re.search('dandi.+', doi).group())

        pubs[dandiset] = pd.DataFrame([
            next(query_results) for _ in range(query_results.total_results)
        ]).assign(dandiset = dandiset)

# save data
df_pubs = pd.concat(list(pubs.values()), ignore_index=True)
df_pubs.to_json(out_path, orient='records')

@tuanpham96
Copy link

@yarikoptic thanks for trying that out. And sorry I forgot about updating you, this was broken for some time now. I was able to try SerpAPI but the free account only afforded ~ 100 searches per month, which is not sufficient and not sustainable. So I guess crossref / datacite would be the only way to go, and google scholar is out of the window, as far as I'm aware.

@yarikoptic
Copy link
Member Author

ha -- didn't know about serpapi. In principle we could potentially cover the cost of querying if that would fall into a reasonable amount, but I do not think that such service overall is sustainable indeed ;-) May be we could make it "modular" -- get all from crossref/datacite, then try to complement with findings from google, e.g. via SerpAPI on some "round robin" schedule to start with, and who knows what other means (e.g. scraping paper texts from pubmed?... found https://github.com/jannisborn/paperscraper )

@tuanpham96
Copy link

oh that's a cool idea. I quickly tried it with dandi.000404/0.230605.2024 but it doesn't seem to return the relevant current biology paper. Not sure if one would need to get the Pubmed datadump somehow. Any thoughts?

# non-relevant papers
query=[["dandi.000404/0.230605.2024"]]
get_and_dump_pubmed_papers(query, output_filepath='test.jsonl')

# non-relevant papers
query=["dandi.000404/0.230605.2024"]
get_and_dump_pubmed_papers(query, output_filepath='test.jsonl')

# empty
query=[["dandi"],["000404"],["0.230605.2024"]]
get_and_dump_pubmed_papers(query, output_filepath='test.jsonl')

# empty
query=[["dandi.000404"],["0.230605.2024"]]
get_and_dump_pubmed_papers(query, output_filepath='test.jsonl') 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants