Pubmed-based collaborators' recommendation system: suggesting potential collaborators within the biomedical scientific community, based on the past collaborations' network and the reserach topics of each author.
That's my final graduation project for the Data Science Retreat ML bootcamp. The goal was to create a preliminary recommendation system that will find good matches of authors, candidates for future collaboration, based on the past collaborations' network and the research topics of each author.
Due to time and resources constraints, I opted to work on a limited dataset of authors (500), all of whom have at some point published a paper with Jennifer Doudna ('Doudna JA'), a famous genetic engineering scientist & Nobel prize recipient. The papers and their info were collected from PubMed and whenever the topic was lacking as a form of keywords, I generated it based on the abstract and title using GPT-3 API (paid). Then, according to their contribution in each paper I was able to figure out the main research topics of each author. Alongside, a graph based on their past contributions was created and the node embeddings were generated. Combining the word embeddings from the topics and the node embeddings from the past collaborations, I created a similarity matrix (pairwise) of the authors. This serves as an indirect prediction/indication for future collacborations ('the more 2 authors are similar, the more likely it is they will -and should- collaborate'). At the end, I used Flask to deploy it for showcasing purposes.
Disclaimer: This tool should be used as a starting point for exploring potential research partners, rather than as a definitive guide for selecting collaborators. The main goal was to make myself familiar with various ML techniques and not build a complete guide. The recommendations might be biased towards more established researchers with a larger number of publications and collaborations, which could inadvertently overlook early-career researchers or researchers from underrepresented groups. I plan to introduce fainess and diversity metrics in the future to mitigate such biases and avoid homophily. This is an initial approach which will hopefully be further developed in the future into a stronger recommendation system taking into account more parameters.