-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DLP: Display count and links to the papers which cite some version of the dandiset #1669
Comments
there are two things that can be done with datacite dois. see:
i'm not sure many people are citing using DOIs. and the dandi citation box should be updated to support different citation formats as well. finally the dandi url also does not plug in easily into paperpile and other formats. |
cool! I didn't try to establish report (we would probably want for all dandi DOIs at once), but for that sample DOI seems to have nothing ❯ curl --silent https://api.datacite.org/events?doi=dandi.000055/0.220127.0436 | jq .
{
"data": [],
"meta": {
"total": 0,
"total-pages": 0,
"page": 1
},
"links": {
"self": "https://api.datacite.org/events?doi=dandi.000055/0.220127.0436"
}
}
❯ curl --silent https://api.datacite.org/dois/dandi.000055/0.220127.0436 | jq .
{
"errors": [
{
"status": "404",
"title": "The resource you are looking for doesn't exist."
}
]
} It seems also that DANDI could even contribute to the usage reports, e.g. view counts of the DLPs for any given release DOI, or if we start minting overall dandiset DOI (as zenodo does and probably we should) - we could provide overall as well: https://support.datacite.org/docs/contributing |
the doi needs the prefix as well. curl --silent 'https://api.datacite.org/events?doi=10.48324/dandi.000055/0.220127.0436' | jq
{
"data": [
{
"id": "bbb655d0-5d76-481e-b6f1-b2cb2b457380",
"type": "events",
"attributes": {
"subj-id": "https://doi.org/10.1038/s41597-022-01280-y",
"obj-id": "https://doi.org/10.48324/dandi.000055/0.220127.0436",
"source-id": "crossref",
"relation-type-id": "references",
"total": 1,
"message-action": "add",
"source-token": "36c35e23-8757-4a9d-aacf-345e9b7eb50d",
"license": "https://creativecommons.org/publicdomain/zero/1.0/",
"occurred-at": "2022-04-21T10:45:13.000Z",
"timestamp": "2022-04-23T03:38:18.173Z"
},
"relationships": {
"subj": {
"data": {
"id": "https://doi.org/10.1038/s41597-022-01280-y",
"type": "objects"
}
},
"obj": {
"data": {
"id": "https://doi.org/10.48324/dandi.000055/0.220127.0436",
"type": "objects"
}
}
}
}
],
"meta": {
"total": 1,
"total-pages": 1,
"page": 1,
"sources": [
{
"id": "crossref",
"title": "Crossref to DataCite",
"count": 1
}
],
"occurred": [
{
"id": "2022",
"title": "2022",
"count": 1
}
],
"prefixes": [
{
"id": "10.1038",
"title": "10.1038",
"count": 1
},
{
"id": "10.48324",
"title": "10.48324",
"count": 1
}
],
"citation-types": [
{
"id": "Dataset-ScholarlyArticle",
"title": "Dataset-ScholarlyArticle",
"count": 1,
"year-months": [
{
"id": "2022-04",
"title": "2022-04",
"sum": 1
}
]
}
],
"relation-types": [
{
"id": "references",
"title": "references",
"count": 1,
"year-months": [
{
"id": "2022-04",
"title": "2022-04",
"sum": 1
}
]
}
],
"registrants": [
{
"id": "crossref.297",
"title": "crossref.297",
"count": 1,
"years": [
{
"id": "2022",
"title": "2022",
"sum": 1
}
]
},
{
"id": "datacite.dartlib.dandi",
"title": "datacite.dartlib.dandi",
"count": 1,
"years": [
{
"id": "2022",
"title": "2022",
"sum": 1
}
]
}
]
},
"links": {
"self": "https://api.datacite.org/events?doi=10.48324/dandi.000055/0.220127.0436"
}
}
|
d'oh and Great! so might be a matter of a "cron job" to collate all such references and render them nicely ;-) |
I did some preliminary runs with this and I found that datacite can still miss things. For example this one, the authors included doi for Using datacite API doesn't return anything. Hopefully I got the syntax right? $ curl -s 'https://api.datacite.org/events?doi=10.48324/dandi.000404/0.230605.2024' | jq
{
"data": [],
"meta": {
"total": 0,
"total-pages": 0,
"page": 1
},
"links": {
"self": "https://api.datacite.org/events?doi=10.48324/dandi.000404/0.230605.2024"
}
} In fact, doing it for the doi prefix $ curl "https://api.datacite.org/events?prefix=10.48324,10.80507" -o dandiset-datacite-query.json then some cleaning datacite attributes records$ jq -r '.data[].attributes' dandiset-datacite-query.json {
"subj-id": "https://doi.org/10.1038/s41597-022-01280-y",
"obj-id": "https://doi.org/10.48324/dandi.000055/0.220127.0436",
"source-id": "crossref",
"relation-type-id": "references",
"total": 1,
"message-action": "add",
"source-token": "36c35e23-8757-4a9d-aacf-345e9b7eb50d",
"license": "https://creativecommons.org/publicdomain/zero/1.0/",
"occurred-at": "2022-04-21T10:45:13.000Z",
"timestamp": "2022-04-23T03:38:18.173Z"
}
{
"subj-id": "https://doi.org/10.1038/s41597-022-01728-1",
"obj-id": "https://doi.org/10.48324/dandi.000231/0.220904.1554",
"source-id": "crossref",
"relation-type-id": "references",
"total": 1,
"message-action": "add",
"source-token": "36c35e23-8757-4a9d-aacf-345e9b7eb50d",
"license": "https://creativecommons.org/publicdomain/zero/1.0/",
"occurred-at": "2022-10-13T13:45:22.000Z",
"timestamp": "2022-10-14T08:55:30.912Z"
}
{
"subj-id": "https://doi.org/10.48324/dandi.000252/0.230408.2207",
"obj-id": "https://doi.org/10.1101/2022.12.07.22283227",
"source-id": "datacite-crossref",
"relation-type-id": "is-described-by",
"total": 1,
"message-action": "create",
"source-token": "28276d12-b320-41ba-9272-bb0adc3466ff",
"license": "https://creativecommons.org/publicdomain/zero/1.0/",
"occurred-at": "2023-04-08T22:07:55.000Z",
"timestamp": "2023-04-08T22:07:57.238Z"
}
{
"subj-id": "https://doi.org/10.48324/dandi.000252/0.230408.2207",
"obj-id": "https://orcid.org/0000-0002-8040-8844",
"source-id": "datacite-orcid-auto-update",
"relation-type-id": "is-authored-by",
"total": 1,
"message-action": "create",
"source-token": "7b09eda9-0024-4e26-9f01-6d8d5a1028d7",
"license": "https://creativecommons.org/publicdomain/zero/1.0/",
"occurred-at": "2023-04-08T22:07:55.000Z",
"timestamp": "2023-04-08T22:07:59.828Z"
}
{
"subj-id": "https://doi.org/10.48324/dandi.000252/0.230408.2207",
"obj-id": "https://orcid.org/0000-0002-0101-2455",
"source-id": "datacite-orcid-auto-update",
"relation-type-id": "is-authored-by",
"total": 1,
"message-action": "create",
"source-token": "7b09eda9-0024-4e26-9f01-6d8d5a1028d7",
"license": "https://creativecommons.org/publicdomain/zero/1.0/",
"occurred-at": "2023-04-08T22:07:55.000Z",
"timestamp": "2023-04-08T22:07:59.899Z"
}
{
"subj-id": "https://doi.org/10.48324/dandi.000252/0.230408.2207",
"obj-id": "https://orcid.org/0000-0002-8765-7253",
"source-id": "datacite-orcid-auto-update",
"relation-type-id": "is-authored-by",
"total": 1,
"message-action": "create",
"source-token": "7b09eda9-0024-4e26-9f01-6d8d5a1028d7",
"license": "https://creativecommons.org/publicdomain/zero/1.0/",
"occurred-at": "2023-04-08T22:07:55.000Z",
"timestamp": "2023-04-08T22:07:59.973Z"
}
Not sure whether So I went to Google Scholar and tried to look for all dandiset versions that have a DOI using simplified codeimport glob, os, re, json
import pandas as pd
from tqdm import tqdm
from scholarly import scholarly, ProxyGenerator
pg = ProxyGenerator()
# gather dandiset IDs
# assume `doi_list` is all DOIs of all Dandisets
dandiset_IDs = [
'%s' %(re.search('dandi.+', x).group())
for x in doi_list
]
# use scholarly to query
pubs = dict()
for dandiset in tqdm(dandiset_IDs):
if dandiset in pubs:
# in case need to restart
continue
pg.FreeProxies()
scholarly.use_proxy(pg)
query_results = scholarly.search_pubs(dandiset)
pubs[dandiset] = pd.DataFrame([
next(query_results) for _ in range(query_results.total_results)
]).assign(dandiset = dandiset)
# save data
df_pubs = pd.concat(list(pubs.values()), ignore_index=True)
df_pubs.to_json(out_path, orient='records')
That's actually how I was able to find |
Hi @tuanpham96 -- thanks for going through it! I thought to take advantage of all your work, installed bleeding edge scholarly , tuned up the script and was ready to profit but ran into FWIW: here is the adjusted version which gets the dandiset dois from APIimport glob, os, re, json
import pandas as pd
from collections import defaultdict
from itertools import chain
from tqdm import tqdm
from scholarly import scholarly, ProxyGenerator
pg = ProxyGenerator()
from dandi.dandiapi import DandiAPIClient
# Happen we decide to group per dandiset
dois = defaultdict(list)
with DandiAPIClient.for_dandi_instance("dandi") as client:
for dandiset in client.get_dandisets():
if dandiset.most_recent_published_version is None:
continue
# to actually get full DOIs but we do not need them
for version in dandiset.get_versions():
if version.identifier != 'draft':
dois[dandiset.identifier].append(
dandiset.for_version(version.identifier).get_raw_metadata()['doi']
)
break
# use scholarly to query
pubs = dict()
for dandiset, dandiset_dois in tqdm(dois.items()):
if dandiset in pubs:
# in case need to restart
continue
for doi in dandiset_dois:
pg.FreeProxies()
scholarly.use_proxy(pg)
query_results = scholarly.search_pubs(re.search('dandi.+', doi).group())
pubs[dandiset] = pd.DataFrame([
next(query_results) for _ in range(query_results.total_results)
]).assign(dandiset = dandiset)
# save data
df_pubs = pd.concat(list(pubs.values()), ignore_index=True)
df_pubs.to_json(out_path, orient='records') |
@yarikoptic thanks for trying that out. And sorry I forgot about updating you, this was broken for some time now. I was able to try SerpAPI but the free account only afforded ~ 100 searches per month, which is not sufficient and not sustainable. So I guess crossref / datacite would be the only way to go, and google scholar is out of the window, as far as I'm aware. |
ha -- didn't know about serpapi. In principle we could potentially cover the cost of querying if that would fall into a reasonable amount, but I do not think that such service overall is sustainable indeed ;-) May be we could make it "modular" -- get all from crossref/datacite, then try to complement with findings from google, e.g. via SerpAPI on some "round robin" schedule to start with, and who knows what other means (e.g. scraping paper texts from pubmed?... found https://github.com/jannisborn/paperscraper ) |
oh that's a cool idea. I quickly tried it with # non-relevant papers
query=[["dandi.000404/0.230605.2024"]]
get_and_dump_pubmed_papers(query, output_filepath='test.jsonl')
# non-relevant papers
query=["dandi.000404/0.230605.2024"]
get_and_dump_pubmed_papers(query, output_filepath='test.jsonl')
# empty
query=[["dandi"],["000404"],["0.230605.2024"]]
get_and_dump_pubmed_papers(query, output_filepath='test.jsonl')
# empty
query=[["dandi.000404"],["0.230605.2024"]]
get_and_dump_pubmed_papers(query, output_filepath='test.jsonl') |
came up in dandi/helpdesk#105 . Naive implementation could just search for the DOI on google scholar, e.g. https://scholar.google.com/scholar?hl=en&as_sdt=0%2C46&q=dandi.000055%2F0.220127.0436&btnG= but I bet there are better options, I didn't check
The text was updated successfully, but these errors were encountered: