Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussing Priorities & Direction #10

Open
mekarpeles opened this issue Feb 10, 2016 · 7 comments
Open

Discussing Priorities & Direction #10

mekarpeles opened this issue Feb 10, 2016 · 7 comments

Comments

@mekarpeles
Copy link
Collaborator

Should we prioritize:

  • inventory​ attempt to collaborative orchestrate inventory of academic documents (across institutions)?
  • crawlers focus on organizing information on existing tools & crawlers + ways to contribute?
  • standards/apis​ explore more interoperable standards / promoting sharing?
  • decentralize​ push for a paradigm shift: distributing & decentralizing storage?
  • classify raise awareness about a plan to classify open-access works?
  • end-user interfaces​ for better navigating research (and unifying disparate sources)
@davidar
Copy link
Collaborator

davidar commented Feb 12, 2016

My top three:

  1. decentralise, or at least make content and metadata easy to mirror (single points of failure are bad, geographic redundancy within a single organisation isn't enough)
  2. standard lossless metadata format (Dublin core is too lossy, nonstandard XML schema are difficult to work with)
  3. crawling infrastructure for existing repositories (to ease migration to the above two points)

I think the interface related stuff can run in parallel to these.

@wetneb
Copy link
Collaborator

wetneb commented Feb 12, 2016

Thinking about protocols and metadata formats would be very interesting indeed, especially since many people from different horizons have joined. What would be the scope of it? Designing our own decentralized storage and metadata format for our own use? Or design a better OAI-PMH (say), that we would like content providers to adopt? The latter is a very long shot (but exciting), and has a heavy political component. People at OpenAIRE+ have been trying to do this (basically they promote their own enhanced version of oai_dc, and are gaining momentum: https://www.mail-archive.com/[email protected]/msg11122.html).

@mekarpeles
Copy link
Collaborator Author

@wetneb I'll try to invite someone from OpenAIRE to the community. I can see how a better OAI-PMH could be useful (something with less friction to pub/sub + handle callbacks). At the same time, BASE and CORE have demonstrated well that the very existence of OAI-PMH has allowed us to pareto principle (80% value w/ 20% work, in this case). Perhaps we can identify which remaining sources don't use OAI-PMH at all?

Redundancy. As @davidar suggests, I do think having a policy for redundancy is important, e.g. if a project like BASE was able to determine where else a paper lived. I think IPFS could be alleviate a lot of the contention between who owns what moving forward.

Dissem.in and CiteSeerX are in really interesting spaces -- tools and crawlers for collecting and classifying papers. I think raising awareness about tools and doing more research to create a coherent narrative between these tools can have a big impact. For instance EIFL and Dissem.in have a lot in common but likely aren't leveraging each other as much as they can be.

I think doing a survey and determining what projects are out there, what their goals and needs are, and then writing a paper on results could be a good way to determine what's next. Also, perhaps we can work together to create a website for discovering the right tools, like Thomas Crouzier has done: http://connectedresearchers.com/online-tools-for-researchers/

@aeschylus
Copy link
Collaborator

On the topic of decentralisation, what do people think about IPFS for a "mirror" of the content, and Mediachain, which is based on IPLD, for storing metadata. This would make it easy for anyone to contribute to guaranteed access by "pinning" the relevant files, and keep the metadata representations synchronised across systems. One major problem with OAI over the years has been synchronising repository representations, which they do the same way every other library system tries to do anything: explicitly describe each change as yet another publication.

Something like IPFS/IPLD/Mediachain, or even just torrents, would give a deeper guarantee at the "computer science" level of protocols.

@aeschylus
Copy link
Collaborator

Just noticing that Mek has already mentioned IPFS. I'll just +1 it. What would be a next step pilot for the redundancy goal? This would let us evaluate if IPFS/IPLD/Mediachain are a good choice.

@mekarpeles
Copy link
Collaborator Author

So, something fairly monumental is in the works. I just spoke with @jjjake and @wumpus at the Archive about running a Pilot Program to distribute + decentralize Open Access publications across all OpenJournal partners using IPFS.

@jbenet, @aeschylus, @MikeTaylor, @mwojnars, @davidar, @gdamdam, @pietsch, @cleegiles, @wetneb -- the plan is to start w/ a source like BASE (or CORE, DOAJ, paperity.org). I will use the Internet Archive's infrastructure to upload the first 10,000 papers in the collection as items into Archive.org and then take these 10,000 items and put them in the Internet Archive Labs' IPFS node. We'd like to encourage BASE, DOAJ, CORE, PLOS, and all our other able partners to do the same thing -- contributing a pilot IPFS node w/ the next 10,000 contiguous blocks of papers.

In order for this to work, the Internet Archive will need a "registry" database for itself which will map Archive.org specific item identifiers (sha256 hash) with their the corresponding IPFS hashes. We imagine other institutions will need something similar (a mapping between their ID space & IPFS hash space). Please let me know if your institution needs help.

¡Viva la Revolución!

@davidar
Copy link
Collaborator

davidar commented Feb 13, 2016

@wetneb I'd say leading by example would be a good first step, and others can adopt it if we can show it works well

@aeschylus yeah, IPLD for metadata, with format based on something like citeproc-json (already being used by crossref et al), or maybe one of the schema.org types, was what I had in mind

@mekarpeles awesome :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants