DLP: Display count and links to the papers which cite some version of the dandiset #1669

yarikoptic · 2023-07-28T12:47:47Z

came up in dandi/helpdesk#105 . Naive implementation could just search for the DOI on google scholar, e.g. https://scholar.google.com/scholar?hl=en&as_sdt=0%2C46&q=dandi.000055%2F0.220127.0436&btnG= but I bet there are better options, I didn't check

satra · 2023-07-28T13:01:50Z

there are two things that can be done with datacite dois. see:

https://support.datacite.org/docs/views-and-downloads (embedding into webpage)
https://support.datacite.org/docs/citations-and-references (using rest to get info)

i'm not sure many people are citing using DOIs. and the dandi citation box should be updated to support different citation formats as well. finally the dandi url also does not plug in easily into paperpile and other formats.

yarikoptic · 2023-07-28T15:14:41Z

cool! I didn't try to establish report (we would probably want for all dandi DOIs at once), but for that sample DOI seems to have nothing

❯ curl --silent https://api.datacite.org/events?doi=dandi.000055/0.220127.0436 | jq .
{
  "data": [],
  "meta": {
    "total": 0,
    "total-pages": 0,
    "page": 1
  },
  "links": {
    "self": "https://api.datacite.org/events?doi=dandi.000055/0.220127.0436"
  }
}
❯ curl --silent  https://api.datacite.org/dois/dandi.000055/0.220127.0436 | jq .
{
  "errors": [
    {
      "status": "404",
      "title": "The resource you are looking for doesn't exist."
    }
  ]
}

It seems also that DANDI could even contribute to the usage reports, e.g. view counts of the DLPs for any given release DOI, or if we start minting overall dandiset DOI (as zenodo does and probably we should) - we could provide overall as well: https://support.datacite.org/docs/contributing

satra · 2023-07-28T15:23:45Z

the doi needs the prefix as well.

curl --silent 'https://api.datacite.org/events?doi=10.48324/dandi.000055/0.220127.0436' | jq
{
  "data": [
    {
      "id": "bbb655d0-5d76-481e-b6f1-b2cb2b457380",
      "type": "events",
      "attributes": {
        "subj-id": "https://doi.org/10.1038/s41597-022-01280-y",
        "obj-id": "https://doi.org/10.48324/dandi.000055/0.220127.0436",
        "source-id": "crossref",
        "relation-type-id": "references",
        "total": 1,
        "message-action": "add",
        "source-token": "36c35e23-8757-4a9d-aacf-345e9b7eb50d",
        "license": "https://creativecommons.org/publicdomain/zero/1.0/",
        "occurred-at": "2022-04-21T10:45:13.000Z",
        "timestamp": "2022-04-23T03:38:18.173Z"
      },
      "relationships": {
        "subj": {
          "data": {
            "id": "https://doi.org/10.1038/s41597-022-01280-y",
            "type": "objects"
          }
        },
        "obj": {
          "data": {
            "id": "https://doi.org/10.48324/dandi.000055/0.220127.0436",
            "type": "objects"
          }
        }
      }
    }
  ],
  "meta": {
    "total": 1,
    "total-pages": 1,
    "page": 1,
    "sources": [
      {
        "id": "crossref",
        "title": "Crossref to DataCite",
        "count": 1
      }
    ],
    "occurred": [
      {
        "id": "2022",
        "title": "2022",
        "count": 1
      }
    ],
    "prefixes": [
      {
        "id": "10.1038",
        "title": "10.1038",
        "count": 1
      },
      {
        "id": "10.48324",
        "title": "10.48324",
        "count": 1
      }
    ],
    "citation-types": [
      {
        "id": "Dataset-ScholarlyArticle",
        "title": "Dataset-ScholarlyArticle",
        "count": 1,
        "year-months": [
          {
            "id": "2022-04",
            "title": "2022-04",
            "sum": 1
          }
        ]
      }
    ],
    "relation-types": [
      {
        "id": "references",
        "title": "references",
        "count": 1,
        "year-months": [
          {
            "id": "2022-04",
            "title": "2022-04",
            "sum": 1
          }
        ]
      }
    ],
    "registrants": [
      {
        "id": "crossref.297",
        "title": "crossref.297",
        "count": 1,
        "years": [
          {
            "id": "2022",
            "title": "2022",
            "sum": 1
          }
        ]
      },
      {
        "id": "datacite.dartlib.dandi",
        "title": "datacite.dartlib.dandi",
        "count": 1,
        "years": [
          {
            "id": "2022",
            "title": "2022",
            "sum": 1
          }
        ]
      }
    ]
  },
  "links": {
    "self": "https://api.datacite.org/events?doi=10.48324/dandi.000055/0.220127.0436"
  }
}

yarikoptic · 2023-07-28T22:16:58Z

d'oh and Great! so might be a matter of a "cron job" to collate all such references and render them nicely ;-)

tuanpham96 · 2023-08-19T20:28:56Z

I did some preliminary runs with this and I found that datacite can still miss things. For example this one, the authors included doi for https://doi.org/10.48324/dandi.000404/0.230605.2024 in the "Data availability" section but not in "References", which I assume is how datacite works?

Using datacite API doesn't return anything. Hopefully I got the syntax right?

$ curl -s 'https://api.datacite.org/events?doi=10.48324/dandi.000404/0.230605.2024' | jq
{
  "data": [],
  "meta": {
    "total": 0,
    "total-pages": 0,
    "page": 1
  },
  "links": {
    "self": "https://api.datacite.org/events?doi=10.48324/dandi.000404/0.230605.2024"
  }
}

In fact, doing it for the doi prefix

$ curl "https://api.datacite.org/events?prefix=10.48324,10.80507" -o dandiset-datacite-query.json

then some cleaning

datacite attributes records

$ jq -r '.data[].attributes' dandiset-datacite-query.json

{
  "subj-id": "https://doi.org/10.1038/s41597-022-01280-y",
  "obj-id": "https://doi.org/10.48324/dandi.000055/0.220127.0436",
  "source-id": "crossref",
  "relation-type-id": "references",
  "total": 1,
  "message-action": "add",
  "source-token": "36c35e23-8757-4a9d-aacf-345e9b7eb50d",
  "license": "https://creativecommons.org/publicdomain/zero/1.0/",
  "occurred-at": "2022-04-21T10:45:13.000Z",
  "timestamp": "2022-04-23T03:38:18.173Z"
}
{
  "subj-id": "https://doi.org/10.1038/s41597-022-01728-1",
  "obj-id": "https://doi.org/10.48324/dandi.000231/0.220904.1554",
  "source-id": "crossref",
  "relation-type-id": "references",
  "total": 1,
  "message-action": "add",
  "source-token": "36c35e23-8757-4a9d-aacf-345e9b7eb50d",
  "license": "https://creativecommons.org/publicdomain/zero/1.0/",
  "occurred-at": "2022-10-13T13:45:22.000Z",
  "timestamp": "2022-10-14T08:55:30.912Z"
}
{
  "subj-id": "https://doi.org/10.48324/dandi.000252/0.230408.2207",
  "obj-id": "https://doi.org/10.1101/2022.12.07.22283227",
  "source-id": "datacite-crossref",
  "relation-type-id": "is-described-by",
  "total": 1,
  "message-action": "create",
  "source-token": "28276d12-b320-41ba-9272-bb0adc3466ff",
  "license": "https://creativecommons.org/publicdomain/zero/1.0/",
  "occurred-at": "2023-04-08T22:07:55.000Z",
  "timestamp": "2023-04-08T22:07:57.238Z"
}
{
  "subj-id": "https://doi.org/10.48324/dandi.000252/0.230408.2207",
  "obj-id": "https://orcid.org/0000-0002-8040-8844",
  "source-id": "datacite-orcid-auto-update",
  "relation-type-id": "is-authored-by",
  "total": 1,
  "message-action": "create",
  "source-token": "7b09eda9-0024-4e26-9f01-6d8d5a1028d7",
  "license": "https://creativecommons.org/publicdomain/zero/1.0/",
  "occurred-at": "2023-04-08T22:07:55.000Z",
  "timestamp": "2023-04-08T22:07:59.828Z"
}
{
  "subj-id": "https://doi.org/10.48324/dandi.000252/0.230408.2207",
  "obj-id": "https://orcid.org/0000-0002-0101-2455",
  "source-id": "datacite-orcid-auto-update",
  "relation-type-id": "is-authored-by",
  "total": 1,
  "message-action": "create",
  "source-token": "7b09eda9-0024-4e26-9f01-6d8d5a1028d7",
  "license": "https://creativecommons.org/publicdomain/zero/1.0/",
  "occurred-at": "2023-04-08T22:07:55.000Z",
  "timestamp": "2023-04-08T22:07:59.899Z"
}
{
  "subj-id": "https://doi.org/10.48324/dandi.000252/0.230408.2207",
  "obj-id": "https://orcid.org/0000-0002-8765-7253",
  "source-id": "datacite-orcid-auto-update",
  "relation-type-id": "is-authored-by",
  "total": 1,
  "message-action": "create",
  "source-token": "7b09eda9-0024-4e26-9f01-6d8d5a1028d7",
  "license": "https://creativecommons.org/publicdomain/zero/1.0/",
  "occurred-at": "2023-04-08T22:07:55.000Z",
  "timestamp": "2023-04-08T22:07:59.973Z"
}

	subj-id	obj-id	source-id	relation-type-id	total	message-action
0	https://doi.org/10.1038/s41597-022-01280-y	https://doi.org/10.48324/dandi.000055/0.220127.0436	crossref	references	1	add
1	https://doi.org/10.1038/s41597-022-01728-1	https://doi.org/10.48324/dandi.000231/0.220904.1554	crossref	references	1	add
2	https://doi.org/10.48324/dandi.000252/0.230408.2207	https://doi.org/10.1101/2022.12.07.22283227	datacite-crossref	is-described-by	1	create
3	https://doi.org/10.48324/dandi.000252/0.230408.2207	https://orcid.org/0000-0002-8040-8844	datacite-orcid-auto-update	is-authored-by	1	create
4	https://doi.org/10.48324/dandi.000252/0.230408.2207	https://orcid.org/0000-0002-0101-2455	datacite-orcid-auto-update	is-authored-by	1	create
5	https://doi.org/10.48324/dandi.000252/0.230408.2207	https://orcid.org/0000-0002-8765-7253	datacite-orcid-auto-update	is-authored-by	1	create

Not sure whether datacite-orcid-auto-update should be counted. It seems there are 3 crossref related events, yet none refers to dandi.000404/0.230605.2024

So I went to Google Scholar and tried to look for all dandiset versions that have a DOI using scholarly

simplified code

import glob, os, re, json
import pandas as pd
from tqdm import tqdm
from scholarly import scholarly, ProxyGenerator

pg = ProxyGenerator()

# gather dandiset IDs
# assume `doi_list` is all DOIs of all Dandisets
dandiset_IDs = [
    '%s' %(re.search('dandi.+', x).group())
    for x in doi_list
]

# use scholarly to query
pubs = dict()

for dandiset in tqdm(dandiset_IDs):
    if dandiset in pubs:
        # in case need to restart
        continue

    pg.FreeProxies()
    scholarly.use_proxy(pg)
    query_results = scholarly.search_pubs(dandiset)

    pubs[dandiset] = pd.DataFrame([
        next(query_results) for _ in range(query_results.total_results)
    ]).assign(dandiset = dandiset)

# save data
df_pubs = pd.concat(list(pubs.values()), ignore_index=True)
df_pubs.to_json(out_path, orient='records')

	dandiset	pub_url	pub_title	pub_year
0	dandi.000019/0.220126.2148	https://elifesciences.org/articles/78362	The neurodata without borders ecosystem for neurophysiological data science	2022
1	dandi.000037/0.230426.0054	https://www.nature.com/articles/s41597-023-02214-y	Responses of pyramidal cell somata and apical dendrites in mouse visual cortex over multiple days	2023
2	dandi.000055/0.220127.0436	https://www.nature.com/articles/s41597-022-01280-y	AJILE12: Long-term naturalistic human intracranial neural recordings and pose	2022
3	dandi.000055/0.220127.0436	https://arxiv.org/abs/2302.08643	Fast Temporal Wavelet Graph Neural Networks	2023
4	dandi.000165/0.211118.1526	https://www.cell.com/cell-reports/pdf/S2211-1247(21)01655-7.pdf	Dentate gyrus and CA3 GABAergic interneurons bidirectionally modulate signatures of internal and external drive to CA1	2021
5	dandi.000207/0.220216.0323	https://www.nature.com/articles/s41593-022-01020-w	Neurons detect cognitive boundaries to structure episodic memories in humans	2022
6	dandi.000230/0.220506.1516	https://www.cell.com/cell-reports-methods/pdf/S2667-2375(22)00084-4.pdf	All-viral tracing of monosynaptic inputs to single birthdate-defined neurons in the intact brain	2022
7	dandi.000231/0.220904.1554	https://www.nature.com/articles/s41597-022-01728-1	A detailed behavioral, videographic, and neural dataset on object recognition in mice	2022
8	dandi.000292/0.220708.1652	https://academic.oup.com/gigascience/article-abstract/doi/10.1093/gigascience/giac108/6827564	An in vitro whole-cell electrophysiology dataset of human cortical neurons	2022
9	dandi.000293/0.220708.1652	https://academic.oup.com/gigascience/article-abstract/doi/10.1093/gigascience/giac108/6827564	An in vitro whole-cell electrophysiology dataset of human cortical neurons	2022
10	dandi.000404/0.230605.2024	https://www.cell.com/current-biology/pdf/S0960-9822(23)00778-9.pdf	Invariant neural dynamics drive commands to control different movements	2023
11	dandi.000447/0.230316.2133	https://www.sciencedirect.com/science/article/pii/S266616672300480X	Protocol for geometric transformation of cognitive maps for generalization across hippocampal-prefrontal circuits	2023
12	dandi.000458/0.230317.0039	https://elifesciences.org/articles/84630	Cortico-thalamo-cortical interactions modulate electrically evoked EEG responses in mice	2023
13	dandi.000473/0.230417.1502	https://www.nature.com/articles/s41593-023-01367-8	Esr1+ hypothalamic-habenula neurons shape aversive states	2023
14	dandi.000488/0.230602.2022	https://www.biorxiv.org/content/10.1101/2023.06.02.543483.abstract	Differential encoding of temporal context and expectation under representational drift across hierarchically connected areas	2023

That's actually how I was able to find dandi.000404. Another example where the dandiset doi appears in the "Data availability" section but not in "References" is dandiset.000473/0.230417.1502, and not in the datacite api above.

yarikoptic · 2024-10-28T19:59:29Z

Hi @tuanpham96 -- thanks for going through it! I thought to take advantage of all your work, installed bleeding edge scholarly , tuned up the script and was ready to profit but ran into raise MaxTriesExceededException("Cannot Fetch from Google Scholar.") so might be just no luck going through google scholar any longer?

FWIW: here is the adjusted version which gets the dandiset dois from API

import glob, os, re, json
import pandas as pd
from collections import defaultdict
from itertools import chain
from tqdm import tqdm
from scholarly import scholarly, ProxyGenerator

pg = ProxyGenerator()


from dandi.dandiapi import DandiAPIClient

# Happen we decide to group per dandiset
dois = defaultdict(list)

with DandiAPIClient.for_dandi_instance("dandi") as client:
    for dandiset in client.get_dandisets():
        if dandiset.most_recent_published_version is None:
            continue
        # to actually get full DOIs but we do not need them
        for version in dandiset.get_versions():
            if version.identifier != 'draft':
                dois[dandiset.identifier].append(
                    dandiset.for_version(version.identifier).get_raw_metadata()['doi']
                )
        break

# use scholarly to query
pubs = dict()

for dandiset, dandiset_dois in tqdm(dois.items()):
    if dandiset in pubs:
        # in case need to restart
        continue
    for doi in dandiset_dois:
        pg.FreeProxies()
        scholarly.use_proxy(pg)
        query_results = scholarly.search_pubs(re.search('dandi.+', doi).group())

        pubs[dandiset] = pd.DataFrame([
            next(query_results) for _ in range(query_results.total_results)
        ]).assign(dandiset = dandiset)

# save data
df_pubs = pd.concat(list(pubs.values()), ignore_index=True)
df_pubs.to_json(out_path, orient='records')

tuanpham96 · 2024-10-29T15:54:08Z

@yarikoptic thanks for trying that out. And sorry I forgot about updating you, this was broken for some time now. I was able to try SerpAPI but the free account only afforded ~ 100 searches per month, which is not sufficient and not sustainable. So I guess crossref / datacite would be the only way to go, and google scholar is out of the window, as far as I'm aware.

yarikoptic · 2024-10-29T16:34:55Z

ha -- didn't know about serpapi. In principle we could potentially cover the cost of querying if that would fall into a reasonable amount, but I do not think that such service overall is sustainable indeed ;-) May be we could make it "modular" -- get all from crossref/datacite, then try to complement with findings from google, e.g. via SerpAPI on some "round robin" schedule to start with, and who knows what other means (e.g. scraping paper texts from pubmed?... found https://github.com/jannisborn/paperscraper )

tuanpham96 · 2024-10-30T16:37:04Z

oh that's a cool idea. I quickly tried it with dandi.000404/0.230605.2024 but it doesn't seem to return the relevant current biology paper. Not sure if one would need to get the Pubmed datadump somehow. Any thoughts?

# non-relevant papers
query=[["dandi.000404/0.230605.2024"]]
get_and_dump_pubmed_papers(query, output_filepath='test.jsonl')

# non-relevant papers
query=["dandi.000404/0.230605.2024"]
get_and_dump_pubmed_papers(query, output_filepath='test.jsonl')

# empty
query=[["dandi"],["000404"],["0.230605.2024"]]
get_and_dump_pubmed_papers(query, output_filepath='test.jsonl')

# empty
query=[["dandi.000404"],["0.230605.2024"]]
get_and_dump_pubmed_papers(query, output_filepath='test.jsonl')

yarikoptic mentioned this issue Jul 28, 2023

usage stats on dataset pages dandi/helpdesk#105

Open

yarikoptic added the enhancement New feature or request label Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DLP: Display count and links to the papers which cite some version of the dandiset #1669

DLP: Display count and links to the papers which cite some version of the dandiset #1669

yarikoptic commented Jul 28, 2023

satra commented Jul 28, 2023

yarikoptic commented Jul 28, 2023

satra commented Jul 28, 2023

yarikoptic commented Jul 28, 2023

tuanpham96 commented Aug 19, 2023

yarikoptic commented Oct 28, 2024

tuanpham96 commented Oct 29, 2024

yarikoptic commented Oct 29, 2024

tuanpham96 commented Oct 30, 2024

DLP: Display count and links to the papers which cite some version of the dandiset #1669

DLP: Display count and links to the papers which cite some version of the dandiset #1669

Comments

yarikoptic commented Jul 28, 2023

satra commented Jul 28, 2023

yarikoptic commented Jul 28, 2023

satra commented Jul 28, 2023

yarikoptic commented Jul 28, 2023

tuanpham96 commented Aug 19, 2023

yarikoptic commented Oct 28, 2024

tuanpham96 commented Oct 29, 2024

yarikoptic commented Oct 29, 2024

tuanpham96 commented Oct 30, 2024