diff --git a/tutorials/tutorial-data-access-from-java.ipynb b/tutorials/tutorial-data-access-from-java.ipynb new file mode 100644 index 0000000..d8fe55a --- /dev/null +++ b/tutorials/tutorial-data-access-from-java.ipynb @@ -0,0 +1,635 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# IR Lab Tutorial: Data Access from Java (or any other language)\n", + "\n", + "This tutorial demonstrates how to access [TIRA](https://www.tira.io)/[TIREx](https://www.tira.io) components by loading their outputs." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Basics: Access to Documents and Queries" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.\n", + "INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.\n", + "/root/.tira/extracted_datasets/None/longeval-tiny-train-20240315-training/input-data\n" + ] + } + ], + "source": [ + "!tira-cli download --dataset longeval-tiny-train-20240315-training" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "documents.jsonl.gz metadata.json queries.jsonl queries.xml\n" + ] + } + ], + "source": [ + "!ls /root/.tira/extracted_datasets/None/longeval-tiny-train-20240315-training/input-data" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\"docno\": \"doc062200109610\", \"text\": \"\\n\\nEDF\\n-\\nGDF School-Valentine (25480)\\n- Opening of electricity and gas meter Opening of your electricity or gas meter at \\u00c9cole-Valentin on the Enedis/ErDF or GrDF network with papernest Free and non-binding service Announcement\\n- papernest is not a partner of EDF.\\nThank you.\\nYour request has been taken into account A counsellor will call you back to the I understood\\nIt seems that there is an error with our service Try again Opening your electricity or gas meter at \\u00c9cole-Valentin on the Enedis/ErDF or GrDF network\\nwith agence-france-electricite.fr Call the Me to call back Simple and quick: 5 minutes is enough No commitment or cancellation fee On 13 users Announcement\\n- agency-france-electricite.fr\\nis not a partner of Edf Contacts and rates of Engie gas offers to \\u00c9cole-Valentin\\nEngie\\n, formerly SFM\\nSuez, is one of the main suppliers of energy in Franche-Comt\\u00e9 and throughout France.\\nThe company emerged from the merger between Suez and GDF (Gaz de France) in the summer of 2008\\nbefore changing its name to Engie in the spring of 2015.\\nThe company is with EDF for electricity the historical gas supplier in the 25th and presents gas at the regulated rate.\\nHowever, Engie is indeed an alternative supplier of electricity and offers \\u00c9cole-Valentin fixed price offers and 100% green offers.\\nOn the 25th (Doubs), the number to reach their customer service is 09 69 39 99 93.\\nEnergy Suppliers to \\u00c9cole-Valentin\\nEngie offers gas to \\u00c9cole-Valentin\\nEngie (ex GDF\\nSuez\\n) is the result of the merger of the Gaz de France group and the Suez group.\\nThe alliance between these two gas giants makes ENGIE one of the world's largest energy groups.\\nThe company is the historical gas supplier in Franche\\n-\\nCounty and throughout France,\\nas EDF for electricity\\n.\\nThe gas rates offered by ENGIE to \\u00c9cole-Valentin in 25 (Doubs) are at the regulated rate.\\nAs an alternative electricity supplier, Engie offers its customers in the town of \\u00c9cole-Valentin fixed price offers and 100% green offers.\\nTo contact Engie customer service on the 25th (Doubs), please call 09 69 39 99 93.\\nEverything you need to know about EDF at \\u00c9cole-Valentin At \\u00c9cole-Valentin, EDF is the historic supplier of electricity but also markets natural gas for professionals, individuals and local communities.\\nIn the Franche-Comt\\u00e9 region, EDF offers Ecovaliens the kWh of electricity at the regulated rate.\\nAt School\\n-\\nValentin and throughout France, EDF is not the historic supplier of natural gas\\nand can therefore freely choose its tariffs\\n.\\nFor more information about EDF's offers at \\u00c9cole-Valentin or if you have any questions about your EDF bill, you can contact their customer service on 09 69 32 15 or go to the supplier's website : edf.fr.\\nGood to know:\\nThere is no longer an EDF agency in France.\\nThese are permanently closed.\\nDirect\\nEnergy at School-Valentine: everything you need to know Direct\\nEnergie is the leading alternative supplier of natural gas and electricity in France.\\nPrivate individuals and professionals at \\u00c9cole-Valentin have the choice of 3 types of offers: the Classic offer, the Green offer and the 100% online offer.\\nThese offers are available for gas, electricity or both.\\nThe price offered by Direct Energie is lower than the regulated tariff.\\nResidents of the Franche-Comt\\u00e9 region are very satisfied with Direct Energie customer service, which has just been named the best customer service of the year for the ninth consecutive time.\\nTo reach Direct Energy customer service on the 25th (Doubs), here is the corresponding number: 30 99.\\nTotal Spring (ex Lampiris):\\nAlternative supplier of electricity and gas to \\u00c9cole-Valentin Total Spring (formerly Lampiris) is one of the new suppliers of gas and electricity available in France since the opening of the energy market to competition in 2007.\\nThe Belgian company, which was bought by Total in 2016, offers the inhabitants of the town of \\u00c9cole-Valentin in the Franche-Comt\\u00e9 region several non-binding offers, in particular a 100% green offer, another at a fixed price for one year and a dual offer (gas and electricity).\\nTo contact Total Spring customer service on 25 (Doubs), you can do so by calling the following number: 09 70 25 02 50.\\nContact information for Engie and EDF at \\u00c9cole-Valentin\\nEngie my client account at \\u00c9cole-Valentin\\nIf you are a customer\\nat Engie (formerly GDF)\\nyou have the right to have a personal space online.\\nYou will therefore have access to your bills and even data relating to your personal gas consumption in the town of \\u00c9cole-Valentin.\\nOn each of your invoices a code of 9 to 10 digits allows you to be identified as a customer.\\nFind out more\\n-when creating your account on the site: https://particuliers.engie.fr/creation-espace-client.html.\\nYou will then just have to confirm the creation of your client account with the code received by sms.\\nContact of EDF and SFM at \\u00c9cole-Valentin\\nWhether for EDF or Engie there are many ways to contact your suppliers.\\nFor EDF: By telephone to the free number 09 69 32 15 15 (open from 8 to 20 h) By mail: EDF Service Client TSA 20012 41975 BLOIS\\nCedex 9\\nWith your customer space For Engie:\\nBy telephone free of charge 09 69 39 99 93\\nOnline on your space\\ncustomer You are therefore able to take out an electricity or gas contract thanks to these contacts.\\nBeware of Engie, we advise you to read our dedicated page since the contact number can change according to your request.\\n\\\" How to open your meter at \\u00c9cole-Valentin?\\nSteps to open your gas meter at \\u00c9cole-Valentin At \\u00c9cole-Valentin and throughout France, the cost of opening the gas meter varies according to the length of time you are prepared to wait.\\nEcovalians who want to open their gas meter should contact the supplier of their choice to take out a gas contract.\\nYou also need to arrange an appointment with a GRDF technician to activate your meter, as it is the GRDF network manager who is responsible for the commissioning of the gas meter at \\u00c9cole-Valentin in the Franche-Comt\\u00e9 region (25).\\nOn the day of the technician's visit to your new accommodation, you will not have to pay anything as the fee will be added directly to your next invoice.\\nIf the gas has not been turned off on the day of your move to \\u00c9cole-Valentin, then the technician's pass is no longer necessary but you will still be charged 18.58\\u20ac including tax.\\nIf, on the other hand, the gas no longer works, then the Ecovalians will have the choice between 3 services: type of commissioning Timeframes 2 working days on the same day 143.01\\nYou have just moved into a new house in \\u00c9cole-Valentin in the Franche-Comt\\u00e9 region (25)?\\nRemember to connect your new home to natural gas and ask for a first commissioning.\\nThis will cost you \\u20ac18.26 for a 10 working day response time.\\nOpen your meter at School-Valentine\\nThe steps to open his meter at \\u00c9cole-Valentin are the same for gas as for electricity.\\nThe only parameters that will change are the contact numbers for Ecovalians and the rates.\\nWhat is the procedure for opening an electricity or gas meter?\\n1\\n.\\nFirstly, the inhabitants of 25 (Doubs) should check whether the housing meter in \\u00c9cole-Valentin is properly connected to the Enedis (for electricity) or GrDF (for gas) network;\\n2.\\nIf the \\u00c9cole-Valentin accommodation is not connected, it is necessary to contact the energy distributor and manager concerned; 3.\\nIf it is, Ecovalians can contact the supplier they want, be it the history (EDF for electricity and Engie for gas) or an alternative supplier (Eni, Direct Energie, etc...); 4.\\nDuring this stage, if the electricity at \\u00c9cole-Valentin is functional (so it has not been turned off in the accommodation), you can proceed to the suite.\\nIf the opposite is the case, it is a matter of switching the meter into service in the 25th (Doubs).\\nEcovalians must make an appointment with Enedis or GrDF for a technician to come.\\nIt will not be necessary to see an Ecovalian technician arrive if the meter is a smart meter like Linky or Gazpar.\\n5.\\nYou have electricity at \\u00c9cole-Valentin.\\nSharing the article Send To learn more about our policy on controlling, processing and publishing notices:\\nclick here Comment sent !\\nThank you, your comment has been taken into account and will be subject to moderation.\\nTo learn more about our policy on the control, processing and publication of notices:\\nclick here\\n\\n\\n\"}\n", + "{\"docno\": \"doc062203902005\", \"text\": \"\\n\\nPeriungual Bowenoid Papulosis Due to\\nHuman\\nPapillomavirus\\nType 42 932 CASE AND\\nRESEARCH LETTERS\\nPeriungual Bowenoid Papulosis Due to\\nHuman Papillomavirus Type\\n42\\nPapulosis bowenoide periungueal por virus del papiloma humano 42\\nTo the Editor: A 28-year-old man with no relevant past medical history consulted for a progressive, slow-growing exophytic lesion at the edge of the nail of the middle finger on the left hand.\\nThe lesion had been present for 2 years.\\nThe patient reported that he had had multiple sexual partners up to 6 years earlier.\\nPhysical examination revealed a red\\n-grayish exophytic periungual lesion measuring 0.7 cm in diameter with\\na hyperkeratotic surface (Fig.\\n1\\n).\\nHistopathologic examina-\\ntion showed\\na hyperplastic epidermis with loss of normal cell\\narchitecture, moderate cellular atypia, dyskeratosis, and koilocytes in the epidermis (Fig.\\nOverexpression\\nof the tumor suppressor protein p16 was also observed (Fig.\\n3\\n).\\nBased\\non a diagnosis of extragenital bowenoid papulosis,\\nthe patient was re-questioned, but he reported that nei- ther he nor his current partner had had genital or anal warts.\\nLymphocyte counts (including CD4 and CD8 counts) were nor- mal and serologic testing for human immunodeficiency virus (HIV) was negative.\\nHuman papillomavirus (HPV) DNA test- ing using the hybrid capture assay detected HPV 42.\\nIt was decided to excise the entire lesion and schedule periodic Figure 1\\nRed\\n-\\ngrayish exophytic periungual lesion with a hyperkeratotic surface.\\nfollow\\n-up visits\\nfor the\\npatient and his partner.\\nThe patient was asymptomatic a year after diagnosis.\\nThe term bowenoid papulosis was introduced to describe multiple wart\\n-like papules in the genital region that are histologically similar to Bowen disease and clinically simi-\\nlar to genital warts.\\nThe condition typically affects young, sexually active individuals and mainly involves the genital, crural, and perianal regions.\\nHowever, there have also been reports of extragenital bowenoid papulosis with\\nor without concomitant genital\\nlesions in both immunocompetent and immunodeficient patients.1 Bowenoid papulosis is associated with\\nHPV infection.\\nWhile there is a close link with HPV 16 infection,1,2 types 18, 31-35, 39, 42, 48, and 51-54 have also been implicated.1---4 Please cite this article as: G\\u00f3mez V\\u00e1zquez M, Navarra Amayuelas R. Papulosis bowenoide periungueal por virus del papiloma humano 42.\\nActas Dermosifiliogr.\\n2013;104:932\\n-\\n--934\\n.\\nFigure\\n2 Hypoplastic epidermis with loss of normal cell archi- tecture, moderate cellular atypia, dyskeratosis, and koilocytes (hematoxylin-eosin, original magnification \\u00d740).\\nHPV 16, 18, and 33 are considered to have the greatest oncogenic potential.\\nBowenoid papulosis is considered to be a squamous cell carcinoma in situ and has an estimated risk of malignant transformation of 2.6%.2\\nThe oncogenic mechanism is prob- ably initiated by HPV-induced genetic changes in infected cells.\\nHigh\\n-risk serotypes produce HPV oncoproteins E6 and E7, which are capable of inactivating the Rb and p53 tumor\\nsuppressor proteins, respectively, giving rise to uncontrolled cell\\nproliferation.4 HPV has been detected in bowenoid papulosis lesions and adjacent healthy skin, indicating that HPV infection\\nis a necessary\\nbut not sufficient factor in the development of bowenoid papulosis.\\nOther alterations, such as additional genetic mutations in the host cell, may be necessary.3,4 Bowen papulosis is histologically similar to Bowen dis- ease, but exhibits more focal and less intense changes.1,5,6\\nTo distinguish\\nbetween\\nthe\\n2 entities,\\nit is always necessary to correlate clinical and histologic findings.\\nThe prevalence of Bowenoid papulosis is unknown, as the lesions are commonly confused with other wart-like lesions and are frequently destroyed without histologi- cal examination.7 Since the lesions tend to recur and have malignant potential, patients and their sexual part- ners should be referred for periodic follow-up and an immunological study in the case of persistent or recurrent lesions.7\\nThe variable, with some lesions dx.doi.org/ http://refhub.elsevier.com/S1578-2190(13)00244-8/sbref0005 http://refhub.elsevier.com/S1578-2190(13)00244.sesevier.com\\nhttp://refhub.elsevier.com/S1578-2190(13)00244-8/sbref0010\\nhttp://refhub.elsevier.com/S1578-2190(13)00244-8/sbref0010\\nhttp://refhub.elsevier.com/S1578-2190(13)00244-8/sbref0010\\nhttp://refhub.elsevier.com/S1578-2190(13)00244-8/sbref0010\\nhttp://refhub.elsevier.com/S1578-2190(13)00244-8/sbref0010\\nhttp://refhub.elsevier.com/S1578-2190(13)00244-8/sbref0010\\nhttp://refhub.elsevier.com/S1578-2190(13)00244-8/sbref0010\\nhttp://refhub.elsevier.com/S1578-2190(13)00244-8/sbref0010\\nhttp://refhub.elsevier.com/S1578-2190(13)00244-8/sbref0010\\nhttp://refhub.elsevier.com/S1578-2190(13)00244-8/sbref0010\\nhttp://refhub.elsevier.com/S1578-2190(13)00244-8/sbref0010\\nhttp://refhub.elsevier.com/S1578-2190(13)00244-8/sbref0010\\nhttp://refhub.elsevier.com/S1578-2190(13)00244-8/sbref0010\\nhttp://refhub.elsevier.com/S1578-2190(13)00244-8/sbref0010\\nhttp://refhub.elsevier.com/S1578-2190(13)00244-8/sbref0010\\nhttp://refhub.elsevier.com/S1578-2190(13)00244-8/sbref0010\\nhttp://refhub.elsevier.com/S1578-2190(13)00244-8/sbref0010\\nhttp://refhub.elsevier.com/S1578-2190(13)00244-8/sbref0010\\nhttp://refhub.elsevier.com/S1578-2190(13)00244-8/sbref0010\\nhttp://refhub.elsevier.com/S1578-2190(13)00244-8/sbref0010\\nhttp://refhub.elsevier.com/S1578-2190(13)00244-8/sbref0010\\nhttp://refhub.elsevier.com/S1578-2190(13)00244-8/sbref0010\\nhttp://refhub.elsevier.com/S1578-2190(13)00244-8/sbref0010\\nhttp://refhub.elsevier.com/S1578-2190(13)00244-8/sbref0010\\nhttp://refhub.elsevier.com/S1578-2190(13)00244-8/sbref0010\\nhttp://refhub.elsevier.com/S1578-2190(13)00244-8/sbref0010\\nhttp://refhub.elsevier.com/S1578-2190(13)00244-8/sbref0010\\nhttp://refhub.elsevier.com/S1578-2190(13)00244-8/sbref0015 http://refhub.elsevier.com/S1578-2190(13)00244-8/hubsesesevier.com\\nhttp://refhub.elsevier.com/S1578-2190(13)00244-8/sbref0020 http://refhub.elsevier.com/S1578-2190(13)00244-8/hubsesesevier.com\\n\\n\\n\"}\n", + "\n", + "gzip: stdout: Broken pipe\n" + ] + } + ], + "source": [ + "!zcat /root/.tira/extracted_datasets/None/longeval-tiny-train-20240315-training/input-data/documents.jsonl.gz|head -2 " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Advanced: Query Performance Prediction\n", + "\n", + "Paper: [An Enhanced Evaluation Framework for\n", + "Query Performance Prediction](https://iiia.dei.unipd.it/research/papers/2021/ECIR2021-FZCFS.pdf)" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.\n", + "INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.\n", + "Download from the Incubator: https://files.webis.de/data-in-production/data-research/tira-zenodo-dump-preparation/query-processors-in-progress/qpptk-all-predictors-clef-labs.zip\n", + "\tThis is only used for last spot checks before archival to Zenodo.\n", + "Download: 100%|█████████████████████████████| 969k/969k [00:00<00:00, 3.82MiB/s]\n", + "Download finished. Extract...\n", + "Extraction finished: /root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/qpptk\n", + "/root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/qpptk/2024-02-27-21-19-19/output\n" + ] + } + ], + "source": [ + "!tira-cli download --dataset longeval-train-20230513-training --approach ir-benchmarks/qpptk/all-predictors" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\"qid\":\"q06223196\",\"max-idf\":4.0958698531,\"avg-idf\":3.1749690458,\"scq\":80.9333491421,\"max-scq\":47.6725992603,\"avg-scq\":40.4666745711,\"var\":3.9709926948,\"max-var\":2.0144139575,\"avg-var\":1.9854963474,\"wig+5\":6.182578941,\"nqc+5\":0.0088258559,\"smv+5\":0.0066671207,\"clarity+5+100\":4.5015133219,\"wig+10\":5.9957913108,\"nqc+10\":0.0187822357,\"smv+10\":0.0156152256,\"clarity+10+100\":4.4665091851,\"wig+20\":5.4968548214,\"nqc+20\":0.0439798072,\"smv+20\":0.0408074922,\"clarity+20+100\":4.4414408844,\"wig+50\":5.0234189643,\"nqc+50\":0.0423631289,\"smv+50\":0.0329943025,\"clarity+50+100\":4.2113685018,\"wig+100\":4.7201972152,\"nqc+100\":0.0396683849,\"smv+100\":0.0257729994,\"clarity+100+100\":4.0947997525,\"wig+1000\":3.4686269255,\"nqc+1000\":0.0437217743,\"smv+1000\":0.0317818778,\"clarity+1000+100\":3.7476998003}\n", + "{\"qid\":\"q062228\",\"max-idf\":3.4870977938,\"avg-idf\":3.4870977938,\"scq\":44.6419616273,\"max-scq\":44.6419616273,\"avg-scq\":44.6419616273,\"var\":2.5749288345,\"max-var\":2.5749288345,\"avg-var\":2.5749288345,\"wig+5\":5.992612212,\"nqc+5\":0.0323327842,\"smv+5\":0.0246336859,\"clarity+5+100\":6.3435267912,\"wig+10\":5.776255412,\"nqc+10\":0.0348595014,\"smv+10\":0.0238581332,\"clarity+10+100\":5.8166571193,\"wig+20\":5.563106012,\"nqc+20\":0.0347385042,\"smv+20\":0.0264756052,\"clarity+20+100\":5.0992195654,\"wig+50\":5.314391312,\"nqc+50\":0.0324713026,\"smv+50\":0.0219021972,\"clarity+50+100\":4.6399090314,\"wig+100\":5.107497382,\"nqc+100\":0.0333128376,\"smv+100\":0.023725479,\"clarity+100+100\":4.4063334649,\"wig+1000\":4.298347591,\"nqc+1000\":0.0416533156,\"smv+1000\":0.0316292003,\"clarity+1000+100\":3.7469415446}\n" + ] + } + ], + "source": [ + "!head -2 /root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/qpptk/2024-02-27-21-19-19/output/queries.jsonl" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Advanced: Query Segmentation\n", + "\n", + "Paper: [Query Segmentation Revisited](https://webis.de/publications.html?q=segmentation#hagen_2011a)" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.\n", + "INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.\n", + "Download from the Incubator: https://files.webis.de/data-in-production/data-research/tira-zenodo-dump-preparation/query-processors-in-progress/ows-query-segmentation-hyb-a-clef-labs.zip\n", + "\tThis is only used for last spot checks before archival to Zenodo.\n", + "Download: 100%|█████████████████████████████| 275k/275k [00:00<00:00, 2.49MiB/s]\n", + "Download finished. Extract...\n", + "Extraction finished: /root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/ows\n", + "/root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/ows/2024-02-25-08-12-47/output\n" + ] + } + ], + "source": [ + "!tira-cli download --dataset longeval-train-20230513-training --approach ir-benchmarks/ows/query-segmentation-hyb-a" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\"qid\":\"q062214880\",\"originalQuery\":\"papillomavirus\",\"segmentationApproach\":\"hyb-a\",\"segmentation\":[\"papillomavirus\"]}\n", + "{\"qid\":\"q06225490\",\"originalQuery\":\"weight of a car\",\"segmentationApproach\":\"hyb-a\",\"segmentation\":[\"weight of\",\"a car\"]}\n", + "{\"qid\":\"q06225371\",\"originalQuery\":\"solar panel self-consumption\",\"segmentationApproach\":\"hyb-a\",\"segmentation\":[\"solar panel\",\"self-consumption\"]}\n", + "{\"qid\":\"q062213796\",\"originalQuery\":\"Potato patty\",\"segmentationApproach\":\"hyb-a\",\"segmentation\":[\"potato\",\"patty\"]}\n", + "{\"qid\":\"q062214645\",\"originalQuery\":\"my job centre\",\"segmentationApproach\":\"hyb-a\",\"segmentation\":[\"my\",\"job centre\"]}\n" + ] + } + ], + "source": [ + "!head -5 /root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/ows/2024-02-25-08-12-47/output/queries.jsonl" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Advanced: Query Intent\n", + "\n", + "Paper: [ORCAS-I: Queries Annotated with Intent using Weak Supervision](https://dl.acm.org/doi/abs/10.1145/3477495.3531737)" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.\n", + "INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.\n", + "Download from the Incubator: https://files.webis.de/data-in-production/data-research/tira-zenodo-dump-preparation/query-processors-in-progress/dossier-pre-retrieval-query-intent-clef-labs.zip\n", + "\tThis is only used for last spot checks before archival to Zenodo.\n", + "Download: 100%|█████████████████████████████| 272k/272k [00:00<00:00, 2.59MiB/s]\n", + "Download finished. Extract...\n", + "Extraction finished: /root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/dossier\n", + "/root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/dossier/2024-02-26-19-27-33/output\n" + ] + } + ], + "source": [ + "!tira-cli download --dataset longeval-train-20230513-training --approach ir-benchmarks/dossier/pre-retrieval-query-intent" + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\"qid\":\"q06223196\",\"intent_prediction\":\"Abstain\"}\n", + "{\"qid\":\"q062228\",\"intent_prediction\":\"Abstain\"}\n", + "{\"qid\":\"q062287\",\"intent_prediction\":\"Abstain\"}\n", + "{\"qid\":\"q06223261\",\"intent_prediction\":\"Transactional\"}\n", + "{\"qid\":\"q062291\",\"intent_prediction\":\"Abstain\"}\n" + ] + } + ], + "source": [ + "!head -5 /root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/dossier/2024-02-26-19-27-33/output/queries.jsonl" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Advanced: Corpus Graph\n", + "\n", + "Paper: [Adaptive Re-Ranking with a Corpus Graph](https://arxiv.org/pdf/2208.08942.pdf)" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.\n", + "INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.\n", + "/root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/seanmacavaney/2024-03-21-12-46-50/output\n" + ] + } + ], + "source": [ + "!tira-cli download --dataset longeval-train-20230513-training --approach ir-benchmarks/seanmacavaney/corpus-graph" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\"doc_id\": \"doc062209106001\", \"neighbors\": [\"doc062210211350\", \"doc062210406947\", \"doc062210607383\", \"doc062210507204\", \"doc062210607043\", \"doc062210612025\", \"doc062210300409\", \"doc062210413453\", \"doc062210414356\", \"doc062210402464\", \"doc062210407620\", \"doc062210503771\", \"doc062210405214\", \"doc062210700385\", \"doc062210204782\"]}\n", + "{\"doc_id\": \"doc062209106002\", \"neighbors\": [\"doc062208706086\", \"doc062206408053\", \"doc062208906995\", \"doc062209009751\", \"doc062208807503\", \"doc062208805517\", \"doc062208704530\", \"doc062208906844\", \"doc062209008454\", \"doc062206304488\", \"doc062206511806\", \"doc062208900657\", \"doc062206305100\", \"doc062209201327\", \"doc062208908335\"]}\n", + "\n", + "gzip: stdout: Broken pipe\n" + ] + } + ], + "source": [ + "!zcat /root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/seanmacavaney/2024-03-21-12-46-50/output/documents.jsonl.gz|head -2" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Advanced: Query Expansion with LLMs\n", + "\n", + "Approaches:\n", + "\n", + "- ir-benchmarks/tu-dresden-03/qe-gpt3.5-cot\n", + "- ir-benchmarks/tu-dresden-03/qe-gpt3.5-sq-zs\n", + "- ir-benchmarks/tu-dresden-03/qe-gpt3.5-sq-fs\n", + "- ir-benchmarks/tu-dresden-03/qe-llama-cot\n", + "- ir-benchmarks/tu-dresden-03/qe-llama-sq-zs\n", + "- ir-benchmarks/tu-dresden-03/qe-llama-sq-fs\n", + "- ir-benchmarks/tu-dresden-03/qe-flan-ul2-cot\n", + "- ir-benchmarks/tu-dresden-03/qe-flan-ul2-sq-zs\n", + "- ir-benchmarks/tu-dresden-03/qe-flan-ul2-sq-fs" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.\n", + "INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.\n", + "Download from the Incubator: https://files.webis.de/data-in-production/data-research/tira-zenodo-dump-preparation/query-processors-in-progress/tu-dresden-03-qe-gpt3.5-cot-clef-labs.zip\n", + "\tThis is only used for last spot checks before archival to Zenodo.\n", + "Download: 100%|█████████████████████████████| 620k/620k [00:00<00:00, 4.01MiB/s]\n", + "Download finished. Extract...\n", + "Extraction finished: /root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/tu-dresden-03\n", + "/root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/tu-dresden-03/2024-03-10-19-13-34/output\n" + ] + } + ], + "source": [ + "!tira-cli download --dataset longeval-train-20230513-training --approach ir-benchmarks/tu-dresden-03/qe-gpt3.5-cot" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\"qid\":\"q06223196\",\"query\":\"A car shelter is a structure designed to provide protection and coverage for vehicles, such as cars, trucks, and motorcycles. It is typically made of materials like metal, wood, or fabric and can come in various forms such as carports, garages, or portable shelters.\\n\\nThe rationale for using a car shelter is to protect vehicles from various elements and environmental factors that can cause damage. This includes protection from sunlight, rain, snow, hail, and wind, which can lead to fading of paint, rusting, corrosion, and other forms of damage. Additionally, a car shelter can also provide security by keeping the vehicle out of sight and reducing the risk of theft or vandalism.\\n\\nOverall, a car shelter helps to extend the lifespan of vehicles, maintain their appearance, and provide a safe and secure storage space.\"}\n", + "{\"qid\":\"q062228\",\"query\":\"The term \\\"airport\\\" refers to a facility where aircraft can take off and land, as well as receive services such as fueling, maintenance, and passenger handling. Airports are essential for the operation of air transportation, allowing people and goods to travel quickly and efficiently over long distances. They also play a crucial role in the economy by facilitating trade, tourism, and business activities. Overall, airports are vital infrastructure that connects people and businesses around the world.\"}\n", + "\n", + "gzip: stdout: Broken pipe\n" + ] + } + ], + "source": [ + "!zcat /root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/tu-dresden-03/2024-03-10-19-13-34/output/queries.jsonl.gz|head -2" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Advanced: DocT5Query\n", + "\n", + "Paper: [From doc2query to docTTTTTquery](https://cs.uwaterloo.ca/~jimmylin/publications/Nogueira_Lin_2019_docTTTTTquery-v2.pdf)" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.\n", + "INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.\n", + "Download from the Incubator: https://files.webis.de/data-in-production/data-research/tira-zenodo-dump-preparation/doc-t5-query/2024-03-19-19-46-01.zip\n", + "\tThis is only used for last spot checks before archival to Zenodo.\n", + "Download: 100%|███████████████████████████| 60.8M/60.8M [00:03<00:00, 16.6MiB/s]\n", + "Download finished. Extract...\n", + "Extraction finished: /root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/seanmacavaney\n", + "/root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/seanmacavaney/2024-03-19-19-46-01/output\n" + ] + } + ], + "source": [ + "!tira-cli download --dataset longeval-train-20230513-training --approach ir-benchmarks/seanmacavaney/DocT5Query" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\"doc_id\": \"doc062211608898\", \"querygen\": \"when is fiesta in dominicana?\\nwhen is fiesta dominicana\\nwhen is fiesta dinner\"}\n", + "{\"doc_id\": \"doc062214401851\", \"querygen\": \"when is spectacle coming\\nwhen does spectacle show start\\nwhat is the name of the three little pigs circus\"}\n", + "\n", + "gzip: stdout: Broken pipe\n" + ] + } + ], + "source": [ + "!zcat /root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/seanmacavaney/2024-03-19-19-46-01/output/documents.jsonl.gz|head -2" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Advanced: Query Entity Linking \n", + "\n", + "Paper: [ Query Interpretations from Entity-Linked Segmentations](https://webis.de/publications.html?q=Query#kasturia_2022)" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.\n", + "INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.\n", + "Download from the Incubator: https://files.webis.de/data-in-production/data-research/tira-zenodo-dump-preparation/query-processors-in-progress/marcel-gohsen-entity-linking-clef-labs.zip\n", + "\tThis is only used for last spot checks before archival to Zenodo.\n", + "Download: 100%|███████████████████████████| 1.82M/1.82M [00:00<00:00, 7.86MiB/s]\n", + "Download finished. Extract...\n", + "Extraction finished: /root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/marcel-gohsen\n", + "/root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/marcel-gohsen/2024-02-22-05-05-35/output\n" + ] + } + ], + "source": [ + "!tira-cli download --dataset longeval-train-20230513-training --approach ir-benchmarks/marcel-gohsen/entity-linking" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\"qid\":\"q06223196\",\"query\":\"Car shelter\",\"entities\":[{\"begin\":4,\"end\":11,\"mention\":\"shelter\",\"url\":\"https://en.wikipedia.org/wiki/Shelter_(charity)\",\"score\":0.19759679572763686},{\"begin\":4,\"end\":11,\"mention\":\"shelter\",\"url\":\"https://en.wikipedia.org/wiki/Shelter_(band)\",\"score\":0.13618157543391188},{\"begin\":4,\"end\":11,\"mention\":\"shelter\",\"url\":\"https://en.wikipedia.org/wiki/Shelter_(building)\",\"score\":0.11615487316421896},{\"begin\":0,\"end\":3,\"mention\":\"car\",\"url\":\"https://en.wikipedia.org/wiki/car\",\"score\":0.09153180278509597},{\"begin\":4,\"end\":11,\"mention\":\"shelter\",\"url\":\"https://en.wikipedia.org/wiki/Shelter_Records\",\"score\":0.06675567423230974},{\"begin\":4,\"end\":11,\"mention\":\"shelter\",\"url\":\"https://en.wikipedia.org/wiki/Shelter_(2007_film)\",\"score\":0.04672897196261682},{\"begin\":4,\"end\":11,\"mention\":\"shelter\",\"url\":\"https://en.wikipedia.org/wiki/Shelter\",\"score\":0.044058744993324434},{\"begin\":4,\"end\":11,\"mention\":\"shelter\",\"url\":\"https://en.wikipedia.org/wiki/Shelter_(Porter_Robinson_and_Madeon_song)\",\"score\":0.030707610146862484},{\"begin\":4,\"end\":11,\"mention\":\"shelter\",\"url\":\"https://en.wikipedia.org/wiki/Homeless_shelter\",\"score\":0.02403204272363151},{\"begin\":4,\"end\":11,\"mention\":\"shelter\",\"url\":\"https://en.wikipedia.org/wiki/Shelter_(Lone_Justice_album)\",\"score\":0.02403204272363151},{\"begin\":4,\"end\":11,\"mention\":\"shelter\",\"url\":\"https://en.wikipedia.org/wiki/Shelter_(2010_film)\",\"score\":0.022696929238985315},{\"begin\":4,\"end\":11,\"mention\":\"shelter\",\"url\":\"https://en.wikipedia.org/wiki/Animal_shelter\",\"score\":0.022696929238985315},{\"begin\":4,\"end\":11,\"mention\":\"shelter\",\"url\":\"https://en.wikipedia.org/wiki/Shelter_(2014_film)\",\"score\":0.018691588785046728},{\"begin\":4,\"end\":11,\"mention\":\"shelter\",\"url\":\"https://en.wikipedia.org/wiki/Shelter_(Alcest_album)\",\"score\":0.018691588785046728},{\"begin\":4,\"end\":11,\"mention\":\"shelter\",\"url\":\"https://en.wikipedia.org/wiki/Shelter_(video_game)\",\"score\":0.018691588785046728},{\"begin\":4,\"end\":11,\"mention\":\"shelter\",\"url\":\"https://en.wikipedia.org/wiki/Shelter_(Brand_New_Heavies_album)\",\"score\":0.014686248331108143},{\"begin\":4,\"end\":11,\"mention\":\"shelter\",\"url\":\"https://en.wikipedia.org/wiki/Women's_shelter\",\"score\":0.014686248331108143},{\"begin\":4,\"end\":11,\"mention\":\"shelter\",\"url\":\"https://en.wikipedia.org/wiki/Shelter_(The_xx_song)\",\"score\":0.012016021361815754},{\"begin\":4,\"end\":11,\"mention\":\"shelter\",\"url\":\"https://en.wikipedia.org/wiki/Shelter_(automobile)\",\"score\":0.006675567423230975},{\"begin\":0,\"end\":3,\"mention\":\"car\",\"url\":\"https://en.wikipedia.org/wiki/Car_(magazine)\",\"score\":0.0060971019947309},{\"begin\":0,\"end\":3,\"mention\":\"car\",\"url\":\"https://en.wikipedia.org/wiki/Central_African_Republic\",\"score\":0.005043281896876176},{\"begin\":0,\"end\":3,\"mention\":\"car\",\"url\":\"https://en.wikipedia.org/wiki/Cordillera_Administrative_Region\",\"score\":0.004968009032743696},{\"begin\":4,\"end\":11,\"mention\":\"shelter\",\"url\":\"https://en.wikipedia.org/wiki/Shelter_(1998_film)\",\"score\":0.004005340453938585},{\"begin\":0,\"end\":3,\"mention\":\"car\",\"url\":\"https://en.wikipedia.org/wiki/Car_language\",\"score\":0.002107640195709447},{\"begin\":0,\"end\":3,\"mention\":\"car\",\"url\":\"https://en.wikipedia.org/wiki/Action_Committee_for_Renewal\",\"score\":0.0018065487391795258},{\"begin\":0,\"end\":3,\"mention\":\"car\",\"url\":\"https://en.wikipedia.org/wiki/Railroad_car\",\"score\":0.0014301844185171245},{\"begin\":4,\"end\":11,\"mention\":\"shelter\",\"url\":\"https://en.wikipedia.org/wiki/Shelter_(Rasa_album)\",\"score\":0.0013351134846461949},{\"begin\":4,\"end\":11,\"mention\":\"shelter\",\"url\":\"https://en.wikipedia.org/wiki/Shelter_(2012_film)\",\"score\":0.0013351134846461949},{\"begin\":4,\"end\":11,\"mention\":\"shelter\",\"url\":\"https://en.wikipedia.org/wiki/Chernobyl_Nuclear_Power_Plant_sarcophagus\",\"score\":0.0013351134846461949},{\"begin\":0,\"end\":3,\"mention\":\"car\",\"url\":\"https://en.wikipedia.org/wiki/Carolina_Panthers\",\"score\":0.001279638690252164},{\"begin\":0,\"end\":3,\"mention\":\"car\",\"url\":\"https://en.wikipedia.org/wiki/Constitutive_androstane_receptor\",\"score\":9.032743695897629E-4},{\"begin\":0,\"end\":3,\"mention\":\"car\",\"url\":\"https://en.wikipedia.org/wiki/Car,_Azerbaijan\",\"score\":6.02182913059842E-4},{\"begin\":0,\"end\":3,\"mention\":\"car\",\"url\":\"https://en.wikipedia.org/wiki/Canadian_Atlantic_Railway\",\"score\":3.01091456529921E-4},{\"begin\":0,\"end\":3,\"mention\":\"car\",\"url\":\"https://en.wikipedia.org/wiki/CAR_and_CDR\",\"score\":2.2581859239744072E-4},{\"begin\":0,\"end\":3,\"mention\":\"car\",\"url\":\"https://en.wikipedia.org/wiki/Carina_(constellation)\",\"score\":1.505457282649605E-4},{\"begin\":0,\"end\":3,\"mention\":\"car\",\"url\":\"https://en.wikipedia.org/wiki/Car_(Greek_myth)\",\"score\":1.505457282649605E-4},{\"begin\":0,\"end\":3,\"mention\":\"car\",\"url\":\"https://en.wikipedia.org/wiki/Canada_Atlantic_Railway\",\"score\":1.505457282649605E-4},{\"begin\":0,\"end\":3,\"mention\":\"car\",\"url\":\"https://en.wikipedia.org/wiki/Car_(disambiguation)\",\"score\":1.505457282649605E-4},{\"begin\":0,\"end\":3,\"mention\":\"car\",\"url\":\"https://en.wikipedia.org/wiki/Coxsackievirus_and_adenovirus_receptor\",\"score\":7.527286413248025E-5},{\"begin\":0,\"end\":3,\"mention\":\"car\",\"url\":\"https://en.wikipedia.org/wiki/Car_(surname)\",\"score\":7.527286413248025E-5},{\"begin\":0,\"end\":3,\"mention\":\"car\",\"url\":\"https://en.wikipedia.org/wiki/Car_of_Caria\",\"score\":7.527286413248025E-5},{\"begin\":0,\"end\":3,\"mention\":\"car\",\"url\":\"https://en.wikipedia.org/wiki/Peter_Gabriel_(1977_album)\",\"score\":7.527286413248025E-5},{\"begin\":0,\"end\":3,\"mention\":\"car\",\"url\":\"https://en.wikipedia.org/wiki/Central_apparatus_room\",\"score\":7.527286413248025E-5},{\"begin\":0,\"end\":3,\"mention\":\"car\",\"url\":\"https://en.wikipedia.org/wiki/There's_Nothing_Wrong_with_Love\",\"score\":7.527286413248025E-5},{\"begin\":0,\"end\":3,\"mention\":\"car\",\"url\":\"https://en.wikipedia.org/wiki/Computer-assisted_reviewing\",\"score\":7.527286413248025E-5}]}\n", + "{\"qid\":\"q062228\",\"query\":\"airport\",\"entities\":[{\"begin\":0,\"end\":7,\"mention\":\"airport\",\"url\":\"https://en.wikipedia.org/wiki/airport\",\"score\":0.7456366828462253},{\"begin\":0,\"end\":7,\"mention\":\"airport\",\"url\":\"https://en.wikipedia.org/wiki/Airport_(1970_film)\",\"score\":0.028916658060518435},{\"begin\":0,\"end\":7,\"mention\":\"airport\",\"url\":\"https://en.wikipedia.org/wiki/Airport_(TV_series)\",\"score\":0.00371785603635237},{\"begin\":0,\"end\":7,\"mention\":\"airport\",\"url\":\"https://en.wikipedia.org/wiki/Airport_(1953_film)\",\"score\":0.002581844469689146},{\"begin\":0,\"end\":7,\"mention\":\"airport\",\"url\":\"https://en.wikipedia.org/wiki/Airport_(novel)\",\"score\":0.00247857069090158},{\"begin\":0,\"end\":7,\"mention\":\"airport\",\"url\":\"https://en.wikipedia.org/wiki/Airport_(1993_film)\",\"score\":0.0016523804606010533},{\"begin\":0,\"end\":7,\"mention\":\"airport\",\"url\":\"https://en.wikipedia.org/wiki/Airport_(song)\",\"score\":0.0013425591242383558},{\"begin\":0,\"end\":7,\"mention\":\"airport\",\"url\":\"https://en.wikipedia.org/wiki/Airport,_California\",\"score\":9.294640090880925E-4},{\"begin\":0,\"end\":7,\"mention\":\"airport\",\"url\":\"https://en.wikipedia.org/wiki/Airport,_Roanoke,_Virginia\",\"score\":5.163688939378292E-4},{\"begin\":0,\"end\":7,\"mention\":\"airport\",\"url\":\"https://en.wikipedia.org/wiki/Airport_(film_series)\",\"score\":2.0654755757513167E-4},{\"begin\":0,\"end\":7,\"mention\":\"airport\",\"url\":\"https://en.wikipedia.org/wiki/Delhi_Airport_metro_station\",\"score\":1.0327377878756583E-4},{\"begin\":0,\"end\":7,\"mention\":\"airport\",\"url\":\"https://en.wikipedia.org/wiki/Airport_station_(MARTA)\",\"score\":1.0327377878756583E-4},{\"begin\":0,\"end\":7,\"mention\":\"airport\",\"url\":\"https://en.wikipedia.org/wiki/Airport_station_(Ottawa)\",\"score\":1.0327377878756583E-4},{\"begin\":0,\"end\":7,\"mention\":\"airport\",\"url\":\"https://en.wikipedia.org/wiki/Airport_(EP)\",\"score\":1.0327377878756583E-4}]}\n", + "{\"qid\":\"q062287\",\"query\":\"antivirus comparison\",\"entities\":[{\"begin\":0,\"end\":9,\"mention\":\"antivirus\",\"url\":\"https://en.wikipedia.org/wiki/Antivirus_software\",\"score\":0.3641304347826087},{\"begin\":10,\"end\":20,\"mention\":\"comparison\",\"url\":\"https://en.wikipedia.org/wiki/Comparison_(grammar)\",\"score\":0.08602150537634409},{\"begin\":10,\"end\":20,\"mention\":\"comparison\",\"url\":\"https://en.wikipedia.org/wiki/Inequality_(mathematics)\",\"score\":0.010752688172043012},{\"begin\":10,\"end\":20,\"mention\":\"comparison\",\"url\":\"https://en.wikipedia.org/wiki/Social_comparison_theory\",\"score\":0.005376344086021506},{\"begin\":10,\"end\":20,\"mention\":\"comparison\",\"url\":\"https://en.wikipedia.org/wiki/Pairwise_comparison\",\"score\":0.005376344086021506},{\"begin\":10,\"end\":20,\"mention\":\"comparison\",\"url\":\"https://en.wikipedia.org/wiki/Relational_operator\",\"score\":0.005376344086021506}]}\n", + "{\"qid\":\"q06223261\",\"query\":\"free antivirus\",\"entities\":[{\"begin\":5,\"end\":14,\"mention\":\"antivirus\",\"url\":\"https://en.wikipedia.org/wiki/Antivirus_software\",\"score\":0.3641304347826087}]}\n", + "cat: write error: Broken pipe\n" + ] + } + ], + "source": [ + "!cat /root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/marcel-gohsen/2024-02-22-05-05-35/output/queries.jsonl|head -4" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Advanced: Genre Classification\n", + "\n", + "Paper: [Web Genre Analysis: Use Cases, Retrieval Models, and Implementation Issues](https://webis.de/publications.html?q=genre#stein_2010b)" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.\n", + "INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.\n", + "Download: 2.51MiB [00:00, 5.04MiB/s]\n", + "Download finished. Extract...\n", + "Extraction finished: /root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/tu-dresden-01\n", + "/root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/tu-dresden-01/2024-03-18-18-34-17/output\n" + ] + } + ], + "source": [ + "!tira-cli download --dataset longeval-train-20230513-training --approach ir-benchmarks/tu-dresden-01/genre-mlp" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\"docno\":\"doc062200602177\",\"predicted_label\":\"Shop\",\"probability_Discussion\":0.0344156198,\"probability_Shop\":0.4675025358,\"probability_Download\":0.0412818462,\"probability_Articles\":0.037098362,\"probability_Help\":0.0942853201,\"probability_Linklists\":0.0314797933,\"probability_Porttrait private\":0.0184404896,\"probability_Protrait non private\":0.2754960331}\n", + "{\"docno\":\"doc062200206592\",\"predicted_label\":\"Help\",\"probability_Discussion\":0.0556134054,\"probability_Shop\":0.0598814937,\"probability_Download\":0.0131974471,\"probability_Articles\":0.0504586852,\"probability_Help\":0.5047823773,\"probability_Linklists\":0.0353957292,\"probability_Porttrait private\":0.0223725693,\"probability_Protrait non private\":0.2582982928}\n", + "{\"docno\":\"doc062210912628\",\"predicted_label\":\"Help\",\"probability_Discussion\":0.0549635234,\"probability_Shop\":0.0319441333,\"probability_Download\":0.0316361181,\"probability_Articles\":0.1685541632,\"probability_Help\":0.3903540373,\"probability_Linklists\":0.0427212461,\"probability_Porttrait private\":0.0415189797,\"probability_Protrait non private\":0.2383077988}\n", + "{\"docno\":\"doc062200201629\",\"predicted_label\":\"Help\",\"probability_Discussion\":0.0577689717,\"probability_Shop\":0.0532979111,\"probability_Download\":0.0074223426,\"probability_Articles\":0.1361994421,\"probability_Help\":0.3213793774,\"probability_Linklists\":0.1144573251,\"probability_Porttrait private\":0.0339900171,\"probability_Protrait non private\":0.275484613}\n", + "\n", + "gzip: stdout: Broken pipe\n" + ] + } + ], + "source": [ + "!zcat /root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/tu-dresden-01/2024-03-18-18-34-17/output/documents.jsonl.gz|head -4" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Advanced: Text Features, e.g., readability, coherence, etc.\n", + "\n", + "Spacy" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.\n", + "INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.\n", + "Download: 16.2MiB [00:01, 8.54MiB/s]\n", + "Download finished. Extract...\n", + "Extraction finished: /root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/tu-dresden-04\n", + "/root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/tu-dresden-04/2024-03-18-18-16-47/output\n" + ] + } + ], + "source": [ + "!tira-cli download --dataset longeval-train-20230513-training --approach ir-benchmarks/tu-dresden-04/spacy-document-features" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\"docno\":\"doc062200602177\",\"entropy\":19.989747459,\"perplexity\":480216431.4089455605,\"per_word_perplexity\":589222.6152257001,\"first_order_coherence\":0.4858336708,\"second_order_coherence\":0.4434301508,\"flesch_reading_ease\":50.059375,\"flesch_kincaid_grade\":13.5746187943,\"smog\":14.2530751775,\"gunning_fog\":16.9131205674,\"automated_readability_index\":15.6784361702,\"coleman_liau_index\":11.1828085106,\"lix\":54.9069148936,\"rix\":7.5,\"pos_prop_ADJ\":0.0429447853,\"pos_prop_ADP\":0.082208589,\"pos_prop_ADV\":0.0122699387,\"pos_prop_AUX\":0.0196319018,\"pos_prop_CCONJ\":0.0159509202,\"pos_prop_DET\":0.0588957055,\"pos_prop_INTJ\":0.0,\"pos_prop_NOUN\":0.226993865,\"pos_prop_NUM\":0.0355828221,\"pos_prop_PART\":0.0147239264,\"pos_prop_PRON\":0.0429447853,\"pos_prop_PROPN\":0.1521472393,\"pos_prop_PUNCT\":0.1263803681,\"pos_prop_SCONJ\":0.0061349693,\"pos_prop_SYM\":0.0036809816,\"pos_prop_VERB\":0.0625766871,\"pos_prop_X\":0.0049079755,\"token_length_mean\":4.7602836879,\"token_length_median\":4.0,\"token_length_std\":3.3800599985,\"sentence_length_mean\":29.375,\"sentence_length_median\":25.5,\"sentence_length_std\":19.4332457831,\"syllables_per_token_mean\":1.5007092199,\"syllables_per_token_median\":1.0,\"syllables_per_token_std\":0.9288516859,\"n_tokens\":705,\"n_unique_tokens\":303,\"proportion_unique_tokens\":0.429787234,\"n_characters\":3473,\"n_sentences\":24,\"dependency_distance_mean\":2.7472426064,\"dependency_distance_std\":0.610754567,\"prop_adjacent_dependency_relation_mean\":0.4925269127,\"prop_adjacent_dependency_relation_std\":0.066383583,\"passed_quality_check\":false,\"n_stop_words\":211.0,\"alpha_ratio\":0.7337423313,\"mean_word_length\":4.2625766871,\"doc_length\":815.0,\"symbol_to_word_ratio_#\":0.0,\"proportion_ellipsis\":0.0131578947,\"proportion_bullet_points\":0.0789473684,\"contains_lorem ipsum\":0.0,\"duplicate_line_chr_fraction\":0.0568462679,\"duplicate_paragraph_chr_fraction\":0.0,\"duplicate_ngram_chr_fraction_5\":0.2901631241,\"duplicate_ngram_chr_fraction_6\":0.2748393475,\"duplicate_ngram_chr_fraction_7\":0.2748393475,\"duplicate_ngram_chr_fraction_8\":0.2525951557,\"duplicate_ngram_chr_fraction_9\":0.2268907563,\"duplicate_ngram_chr_fraction_10\":0.2034107761,\"top_ngram_chr_fraction_2\":0.0088976767,\"top_ngram_chr_fraction_3\":0.0434997528,\"top_ngram_chr_fraction_4\":0.037073653,\"oov_ratio\":null}\n", + "{\"docno\":\"doc062200206592\",\"entropy\":29.315342668,\"perplexity\":5388793675820.052734375,\"per_word_perplexity\":5314392185.2268762589,\"first_order_coherence\":0.4980747735,\"second_order_coherence\":0.4931992651,\"flesch_reading_ease\":62.5887949453,\"flesch_kincaid_grade\":10.4721038156,\"smog\":12.2091607464,\"gunning_fog\":13.7928434949,\"automated_readability_index\":11.6141236755,\"coleman_liau_index\":9.283520352,\"lix\":45.4832088472,\"rix\":5.1578947368,\"pos_prop_ADJ\":0.0374753452,\"pos_prop_ADP\":0.1193293886,\"pos_prop_ADV\":0.016765286,\"pos_prop_AUX\":0.057199211,\"pos_prop_CCONJ\":0.0098619329,\"pos_prop_DET\":0.1055226824,\"pos_prop_INTJ\":0.0,\"pos_prop_NOUN\":0.2199211045,\"pos_prop_NUM\":0.0088757396,\"pos_prop_PART\":0.0285996055,\"pos_prop_PRON\":0.0877712032,\"pos_prop_PROPN\":0.0433925049,\"pos_prop_PUNCT\":0.100591716,\"pos_prop_SCONJ\":0.0216962525,\"pos_prop_SYM\":0.0,\"pos_prop_VERB\":0.0956607495,\"pos_prop_X\":0.0059171598,\"token_length_mean\":4.4763476348,\"token_length_median\":4.0,\"token_length_std\":2.6894123736,\"sentence_length_mean\":23.9210526316,\"sentence_length_median\":21.5,\"sentence_length_std\":14.1037165295,\"syllables_per_token_mean\":1.4180418042,\"syllables_per_token_median\":1.0,\"syllables_per_token_std\":0.7557219313,\"n_tokens\":909,\"n_unique_tokens\":317,\"proportion_unique_tokens\":0.3487348735,\"n_characters\":4176,\"n_sentences\":38,\"dependency_distance_mean\":2.6271047121,\"dependency_distance_std\":0.7258526827,\"prop_adjacent_dependency_relation_mean\":0.4648816373,\"prop_adjacent_dependency_relation_std\":0.0575750452,\"passed_quality_check\":true,\"n_stop_words\":467.0,\"alpha_ratio\":0.8461538462,\"mean_word_length\":4.1183431953,\"doc_length\":1014.0,\"symbol_to_word_ratio_#\":0.0,\"proportion_ellipsis\":0.0227272727,\"proportion_bullet_points\":0.0,\"contains_lorem ipsum\":0.0,\"duplicate_line_chr_fraction\":0.0,\"duplicate_paragraph_chr_fraction\":0.0,\"duplicate_ngram_chr_fraction_5\":0.0518,\"duplicate_ngram_chr_fraction_6\":0.0148,\"duplicate_ngram_chr_fraction_7\":0.0148,\"duplicate_ngram_chr_fraction_8\":0.0,\"duplicate_ngram_chr_fraction_9\":0.0,\"duplicate_ngram_chr_fraction_10\":0.0,\"top_ngram_chr_fraction_2\":0.0096,\"top_ngram_chr_fraction_3\":0.032,\"top_ngram_chr_fraction_4\":0.0102,\"oov_ratio\":null}\n", + "{\"docno\":\"doc062210912628\",\"entropy\":18.9317371909,\"perplexity\":166705141.6404820979,\"per_word_perplexity\":206318.2446045571,\"first_order_coherence\":0.4547418656,\"second_order_coherence\":0.3897152896,\"flesch_reading_ease\":57.5173669124,\"flesch_kincaid_grade\":10.8076167076,\"smog\":13.4158895425,\"gunning_fog\":14.7534807535,\"automated_readability_index\":11.8872968878,\"coleman_liau_index\":10.4762702703,\"lix\":45.9377559378,\"rix\":5.2727272727,\"pos_prop_ADJ\":0.0643564356,\"pos_prop_ADP\":0.0804455446,\"pos_prop_ADV\":0.0334158416,\"pos_prop_AUX\":0.0544554455,\"pos_prop_CCONJ\":0.0284653465,\"pos_prop_DET\":0.073019802,\"pos_prop_INTJ\":0.0024752475,\"pos_prop_NOUN\":0.2141089109,\"pos_prop_NUM\":0.0297029703,\"pos_prop_PART\":0.0247524752,\"pos_prop_PRON\":0.0754950495,\"pos_prop_PROPN\":0.0581683168,\"pos_prop_PUNCT\":0.077970297,\"pos_prop_SCONJ\":0.0061881188,\"pos_prop_SYM\":0.0024752475,\"pos_prop_VERB\":0.1101485149,\"pos_prop_X\":0.0,\"token_length_mean\":4.6932432432,\"token_length_median\":4.0,\"token_length_std\":2.9590370549,\"sentence_length_mean\":22.4242424242,\"sentence_length_median\":22.0,\"sentence_length_std\":11.3672317545,\"syllables_per_token_mean\":1.4959459459,\"syllables_per_token_median\":1.0,\"syllables_per_token_std\":0.8581783329,\"n_tokens\":740,\"n_unique_tokens\":300,\"proportion_unique_tokens\":0.4054054054,\"n_characters\":3543,\"n_sentences\":33,\"dependency_distance_mean\":2.4855785109,\"dependency_distance_std\":0.5500287371,\"prop_adjacent_dependency_relation_mean\":0.4784050336,\"prop_adjacent_dependency_relation_std\":0.0694216406,\"passed_quality_check\":true,\"n_stop_words\":309.0,\"alpha_ratio\":0.823019802,\"mean_word_length\":4.3849009901,\"doc_length\":808.0,\"symbol_to_word_ratio_#\":0.0,\"proportion_ellipsis\":0.0,\"proportion_bullet_points\":0.0178571429,\"contains_lorem ipsum\":0.0,\"duplicate_line_chr_fraction\":0.003109304,\"duplicate_paragraph_chr_fraction\":0.0,\"duplicate_ngram_chr_fraction_5\":0.1171968429,\"duplicate_ngram_chr_fraction_6\":0.1009327912,\"duplicate_ngram_chr_fraction_7\":0.0801243722,\"duplicate_ngram_chr_fraction_8\":0.0722315236,\"duplicate_ngram_chr_fraction_9\":0.0722315236,\"duplicate_ngram_chr_fraction_10\":0.048792155,\"top_ngram_chr_fraction_2\":0.0110021526,\"top_ngram_chr_fraction_3\":0.0057402535,\"top_ngram_chr_fraction_4\":0.0078928486,\"oov_ratio\":null}\n", + "{\"docno\":\"doc062200201629\",\"entropy\":27.2774941773,\"perplexity\":702207075817.8988037109,\"per_word_perplexity\":683080813.0524307489,\"first_order_coherence\":0.4907128261,\"second_order_coherence\":0.4864410173,\"flesch_reading_ease\":60.6745425978,\"flesch_kincaid_grade\":10.3550209497,\"smog\":12.2516242395,\"gunning_fog\":13.5086592179,\"automated_readability_index\":11.3077234637,\"coleman_liau_index\":9.7805586592,\"lix\":44.9448324022,\"rix\":5.05,\"pos_prop_ADJ\":0.0904669261,\"pos_prop_ADP\":0.1245136187,\"pos_prop_ADV\":0.0116731518,\"pos_prop_AUX\":0.0291828794,\"pos_prop_CCONJ\":0.0252918288,\"pos_prop_DET\":0.0904669261,\"pos_prop_INTJ\":0.0,\"pos_prop_NOUN\":0.2645914397,\"pos_prop_NUM\":0.0243190661,\"pos_prop_PART\":0.0126459144,\"pos_prop_PRON\":0.03307393,\"pos_prop_PROPN\":0.0418287938,\"pos_prop_PUNCT\":0.1215953307,\"pos_prop_SCONJ\":0.0077821012,\"pos_prop_SYM\":0.0038910506,\"pos_prop_VERB\":0.0642023346,\"pos_prop_X\":0.0009727626,\"token_length_mean\":4.5754189944,\"token_length_median\":4.0,\"token_length_std\":2.6427214722,\"sentence_length_mean\":22.375,\"sentence_length_median\":21.5,\"sentence_length_std\":14.3521557614,\"syllables_per_token_mean\":1.4592178771,\"syllables_per_token_median\":1.0,\"syllables_per_token_std\":0.7609619409,\"n_tokens\":895,\"n_unique_tokens\":390,\"proportion_unique_tokens\":0.4357541899,\"n_characters\":4228,\"n_sentences\":40,\"dependency_distance_mean\":2.7274302658,\"dependency_distance_std\":1.0390001988,\"prop_adjacent_dependency_relation_mean\":0.4991816439,\"prop_adjacent_dependency_relation_std\":0.0653745593,\"passed_quality_check\":true,\"n_stop_words\":367.0,\"alpha_ratio\":0.7947470817,\"mean_word_length\":4.1128404669,\"doc_length\":1028.0,\"symbol_to_word_ratio_#\":0.0,\"proportion_ellipsis\":0.0175438596,\"proportion_bullet_points\":0.0350877193,\"contains_lorem ipsum\":0.0,\"duplicate_line_chr_fraction\":0.0106,\"duplicate_paragraph_chr_fraction\":0.0,\"duplicate_ngram_chr_fraction_5\":0.0656,\"duplicate_ngram_chr_fraction_6\":0.061,\"duplicate_ngram_chr_fraction_7\":0.0394,\"duplicate_ngram_chr_fraction_8\":0.0214,\"duplicate_ngram_chr_fraction_9\":0.0214,\"duplicate_ngram_chr_fraction_10\":0.0214,\"top_ngram_chr_fraction_2\":0.012,\"top_ngram_chr_fraction_3\":0.015,\"top_ngram_chr_fraction_4\":0.0114,\"oov_ratio\":null}\n", + "\n", + "gzip: stdout: Broken pipe\n" + ] + } + ], + "source": [ + "!zcat /root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/tu-dresden-04/2024-03-18-18-16-47/output/documents.jsonl.gz|head -4" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Advanced: Query Interpretation \n", + "\n", + "Paper: [ Query Interpretations from Entity-Linked Segmentations](https://webis.de/publications.html?q=Query#kasturia_2022)" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.\n", + "INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.\n", + "Download from the Incubator: https://files.webis.de/data-in-production/data-research/tira-zenodo-dump-preparation/query-processors-in-progress/marcel-gohsen-query-interpretation-clef-labs.zip\n", + "\tThis is only used for last spot checks before archival to Zenodo.\n", + "Download: 100%|█████████████████████████████| 191k/191k [00:00<00:00, 2.40MiB/s]\n", + "Download finished. Extract...\n", + "Extraction finished: /root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/marcel-gohsen\n", + "/root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/marcel-gohsen/2024-02-23-07-19-23/output\n" + ] + } + ], + "source": [ + "!tira-cli download --dataset longeval-train-20230513-training --approach ir-benchmarks/marcel-gohsen/query-interpretation" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\"qid\":\"q06223196\",\"query\":\"Car shelter\",\"interpretations\":[{\"id\":0,\"interpretation\":[\"car shelter\"],\"relevance\":0.0,\"containedEntities\":[],\"contextWords\":[\"shelter\",\"car\"],\"score\":0.0}]}\n", + "{\"qid\":\"q062228\",\"query\":\"airport\",\"interpretations\":[{\"id\":0,\"interpretation\":[\"https://en.wikipedia.org/wiki/airport\"],\"relevance\":0.7456366828462253,\"containedEntities\":[\"https://en.wikipedia.org/wiki/airport\"],\"contextWords\":[],\"score\":0.7456366828462253}]}\n", + "{\"qid\":\"q062287\",\"query\":\"antivirus comparison\",\"interpretations\":[{\"id\":0,\"interpretation\":[\"antivirus comparison\"],\"relevance\":0.0,\"containedEntities\":[],\"contextWords\":[\"comparison\",\"antivirus\"],\"score\":0.0}]}\n", + "{\"qid\":\"q06223261\",\"query\":\"free antivirus\",\"interpretations\":[{\"id\":0,\"interpretation\":[\"free antivirus\"],\"relevance\":0.0,\"containedEntities\":[],\"contextWords\":[\"antivirus\",\"free\"],\"score\":0.0}]}\n", + "cat: write error: Broken pipe\n" + ] + } + ], + "source": [ + "cat /root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/marcel-gohsen/2024-02-23-07-19-23/output/queries.jsonl|head -4" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/tutorials/tutorial-query-performance-prediction.ipynb b/tutorials/tutorial-query-performance-prediction.ipynb index 28b6814..b828a40 100644 --- a/tutorials/tutorial-query-performance-prediction.ipynb +++ b/tutorials/tutorial-query-performance-prediction.ipynb @@ -18,73 +18,9 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Requirement already satisfied: python-terrier in /usr/local/lib/python3.10/dist-packages (0.9.2)\n", - "Requirement already satisfied: tira in /usr/local/lib/python3.10/dist-packages (0.0.77)\n", - "Requirement already satisfied: more-itertools in /usr/local/lib/python3.10/dist-packages (from python-terrier) (10.1.0)\n", - "Requirement already satisfied: ir-measures>=0.3.1 in /usr/local/lib/python3.10/dist-packages (from python-terrier) (0.3.3)\n", - "Requirement already satisfied: dill in /usr/local/lib/python3.10/dist-packages (from python-terrier) (0.3.7)\n", - "Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from python-terrier) (2.31.0)\n", - "Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from python-terrier) (4.66.1)\n", - "Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from python-terrier) (2.1.1)\n", - "Requirement already satisfied: matchpy in /usr/local/lib/python3.10/dist-packages (from python-terrier) (0.5.5)\n", - "Requirement already satisfied: statsmodels in /usr/local/lib/python3.10/dist-packages (from python-terrier) (0.14.0)\n", - "Requirement already satisfied: ir-datasets>=0.3.2 in /usr/local/lib/python3.10/dist-packages (from python-terrier) (0.5.5)\n", - "Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from python-terrier) (1.3.2)\n", - "Requirement already satisfied: wget in /usr/local/lib/python3.10/dist-packages (from python-terrier) (3.2)\n", - "Requirement already satisfied: deprecated in /usr/local/lib/python3.10/dist-packages (from python-terrier) (1.2.14)\n", - "Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from python-terrier) (1.11.3)\n", - "Requirement already satisfied: pyjnius>=1.4.2 in /usr/local/lib/python3.10/dist-packages (from python-terrier) (1.6.0)\n", - "Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from python-terrier) (3.1.2)\n", - "Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from python-terrier) (1.26.1)\n", - "Requirement already satisfied: chest in /usr/local/lib/python3.10/dist-packages (from python-terrier) (0.2.3)\n", - "Requirement already satisfied: nptyping==1.4.4 in /usr/local/lib/python3.10/dist-packages (from python-terrier) (1.4.4)\n", - "Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (from python-terrier) (1.3.1)\n", - "Requirement already satisfied: pytrec-eval-terrier>=0.5.3 in /usr/local/lib/python3.10/dist-packages (from python-terrier) (0.5.6)\n", - "Requirement already satisfied: typish>=1.7.0 in /usr/local/lib/python3.10/dist-packages (from nptyping==1.4.4->python-terrier) (1.9.3)\n", - "Requirement already satisfied: docker==6.*,>=6.0.0 in /usr/local/lib/python3.10/dist-packages (from tira) (6.1.3)\n", - "Requirement already satisfied: packaging>=14.0 in /usr/local/lib/python3.10/dist-packages (from docker==6.*,>=6.0.0->tira) (23.2)\n", - "Requirement already satisfied: websocket-client>=0.32.0 in /usr/local/lib/python3.10/dist-packages (from docker==6.*,>=6.0.0->tira) (1.6.4)\n", - "Requirement already satisfied: urllib3>=1.26.0 in /usr/local/lib/python3.10/dist-packages (from docker==6.*,>=6.0.0->tira) (2.0.7)\n", - "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->python-terrier) (2023.7.22)\n", - "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->python-terrier) (3.4)\n", - "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->python-terrier) (3.3.1)\n", - "Requirement already satisfied: trec-car-tools>=2.5.4 in /usr/local/lib/python3.10/dist-packages (from ir-datasets>=0.3.2->python-terrier) (2.6)\n", - "Requirement already satisfied: lz4>=3.1.10 in /usr/local/lib/python3.10/dist-packages (from ir-datasets>=0.3.2->python-terrier) (4.3.2)\n", - "Requirement already satisfied: pyyaml>=5.3.1 in /usr/local/lib/python3.10/dist-packages (from ir-datasets>=0.3.2->python-terrier) (6.0.1)\n", - "Requirement already satisfied: beautifulsoup4>=4.4.1 in /usr/local/lib/python3.10/dist-packages (from ir-datasets>=0.3.2->python-terrier) (4.12.2)\n", - "Requirement already satisfied: ijson>=3.1.3 in /usr/local/lib/python3.10/dist-packages (from ir-datasets>=0.3.2->python-terrier) (3.2.3)\n", - "Requirement already satisfied: pyautocorpus>=0.1.1 in /usr/local/lib/python3.10/dist-packages (from ir-datasets>=0.3.2->python-terrier) (0.1.12)\n", - "Requirement already satisfied: warc3-wet>=0.2.3 in /usr/local/lib/python3.10/dist-packages (from ir-datasets>=0.3.2->python-terrier) (0.2.3)\n", - "Requirement already satisfied: inscriptis>=2.2.0 in /usr/local/lib/python3.10/dist-packages (from ir-datasets>=0.3.2->python-terrier) (2.3.2)\n", - "Requirement already satisfied: zlib-state>=0.1.3 in /usr/local/lib/python3.10/dist-packages (from ir-datasets>=0.3.2->python-terrier) (0.1.6)\n", - "Requirement already satisfied: warc3-wet-clueweb09>=0.2.5 in /usr/local/lib/python3.10/dist-packages (from ir-datasets>=0.3.2->python-terrier) (0.2.5)\n", - "Requirement already satisfied: unlzw3>=0.2.1 in /usr/local/lib/python3.10/dist-packages (from ir-datasets>=0.3.2->python-terrier) (0.2.2)\n", - "Requirement already satisfied: lxml>=4.5.2 in /usr/local/lib/python3.10/dist-packages (from ir-datasets>=0.3.2->python-terrier) (4.9.3)\n", - "Requirement already satisfied: cwl-eval>=1.0.10 in /usr/local/lib/python3.10/dist-packages (from ir-measures>=0.3.1->python-terrier) (1.0.12)\n", - "Requirement already satisfied: heapdict in /usr/local/lib/python3.10/dist-packages (from chest->python-terrier) (1.0.1)\n", - "Requirement already satisfied: wrapt<2,>=1.10 in /usr/local/lib/python3.10/dist-packages (from deprecated->python-terrier) (1.15.0)\n", - "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->python-terrier) (2.1.3)\n", - "Requirement already satisfied: multiset<3.0,>=2.0 in /usr/local/lib/python3.10/dist-packages (from matchpy->python-terrier) (2.1.1)\n", - "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->python-terrier) (2023.3.post1)\n", - "Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas->python-terrier) (2.8.2)\n", - "Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas->python-terrier) (2023.3)\n", - "Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->python-terrier) (3.2.0)\n", - "Requirement already satisfied: patsy>=0.5.2 in /usr/local/lib/python3.10/dist-packages (from statsmodels->python-terrier) (0.5.3)\n", - "Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.10/dist-packages (from beautifulsoup4>=4.4.1->ir-datasets>=0.3.2->python-terrier) (2.5)\n", - "Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (from patsy>=0.5.2->statsmodels->python-terrier) (1.16.0)\n", - "Requirement already satisfied: cbor>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from trec-car-tools>=2.5.4->ir-datasets>=0.3.2->python-terrier) (1.0.0)\n", - "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n", - "\u001b[0m" - ] - } - ], + "outputs": [], "source": [ "# This is only needed in Google Colab, in a dev container, everything should be installed already\n", "!pip3 install python-terrier tira" @@ -92,33 +28,9 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 2, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Start PyTerrier with version=5.7, helper_version=0.0.7, no_download=True\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "PyTerrier 0.9.2 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7\n", - "\n", - "No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "No settings given in /root/.tira/.tira-settings.json. I will use defaults.\n" - ] - } - ], + "outputs": [], "source": [ "import pyterrier as pt\n", "from tira.third_party_integrations import ensure_pyterrier_is_loaded\n", @@ -141,13 +53,13 @@ }, { "cell_type": "code", - "execution_count": 29, + "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "def qpp_correlation_to_ground_truth(bm25, qpp, dataset, eval_metrics):\n", " import pandas as pd\n", - " topics = dataset.get_topics(variant='title')\n", + " topics = dataset.get_topics()\n", " df_eval = pt.Experiment([bm25], topics=topics, qrels=dataset.get_qrels(), eval_metrics=eval_metrics, perquery=True, names=['BM25'])\n", " df_predictions = qpp(topics)\n", " df_joined = pd.merge(df_eval, df_predictions, on=['qid'])\n", @@ -161,14 +73,21 @@ }, { "cell_type": "code", - "execution_count": 21, + "execution_count": 8, "metadata": {}, "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "There are multiple query fields available: ('title', 'description', 'narrative'). To use with pyterrier, provide variant or modify dataframe to add query column.\n" + ] + }, { "name": "stderr", "output_type": "stream", "text": [ - "/usr/local/lib/python3.10/dist-packages/pyterrier/pipelines.py:107: UserWarning: 1 topic(s) not found in qrels. Scores for these topics are given as NaN and should not contribute to averages.\n", + "/usr/local/lib/python3.10/dist-packages/pyterrier/pipelines.py:129: UserWarning: 1 topic(s) not found in qrels. Scores for these topics are given as NaN and should not contribute to averages.\n", " warn(f'{backfill_count} topic(s) not found in qrels. Scores for these topics are given as NaN and should not contribute to averages.')\n" ] }, @@ -222,13 +141,6 @@ " 0.345846\n", " \n", " \n", - " 11\n", - " clarity+1000+100\n", - " 0.330306\n", - " 0.263366\n", - " 0.385582\n", - " \n", - " \n", " 7\n", " avg-var\n", " 0.302321\n", @@ -289,29 +201,28 @@ "" ], "text/plain": [ - " QPP Method Pearson Correlation Kendall Spearman\n", - "9 nqc+100 0.411153 0.317898 0.451018\n", - "10 smv+100 0.391963 0.302243 0.429123\n", - "8 wig+10 0.342660 0.240992 0.345846\n", - "11 clarity+1000+100 0.330306 0.263366 0.385582\n", - "7 avg-var 0.302321 0.256582 0.369490\n", - "6 max-var 0.276160 0.233053 0.338973\n", - "3 max-scq 0.264181 0.221495 0.323365\n", - "1 avg-idf 0.248902 0.168261 0.243340\n", - "4 avg-scq 0.220149 0.157302 0.229591\n", - "0 max-idf 0.214187 0.175756 0.255930\n", - "5 var 0.163394 0.163238 0.235878\n", - "2 scq 0.015893 0.011904 0.018180" + " QPP Method Pearson Correlation Kendall Spearman\n", + "9 nqc+100 0.411153 0.317898 0.451018\n", + "10 smv+100 0.391963 0.302243 0.429123\n", + "8 wig+10 0.342660 0.240992 0.345846\n", + "7 avg-var 0.302321 0.256582 0.369490\n", + "6 max-var 0.276160 0.233053 0.338973\n", + "3 max-scq 0.264181 0.221495 0.323365\n", + "1 avg-idf 0.248902 0.168261 0.243340\n", + "4 avg-scq 0.220149 0.157302 0.229591\n", + "0 max-idf 0.214187 0.175756 0.255930\n", + "5 var 0.163394 0.163238 0.235878\n", + "2 scq 0.015893 0.011904 0.018180" ] }, - "execution_count": 21, + "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset = pt.get_dataset(\"irds:disks45/nocr/trec-robust-2004\")\n", - "qpp = tira.pt.transform_queries('ir-benchmarks/qpptk/all', dataset)\n", + "qpp = tira.pt.transform_queries('ir-benchmarks/qpptk/all-predictors', dataset)\n", "bm25 = tira.pt.from_submission('ir-benchmarks/tira-ir-starter/BM25 Re-Rank (tira-ir-starter-pyterrier)', dataset)\n", "\n", "qpp_correlation_to_ground_truth(bm25, qpp, dataset, ['ndcg_cut_10'])" @@ -319,9 +230,25 @@ }, { "cell_type": "code", - "execution_count": 26, + "execution_count": 12, "metadata": {}, "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Download: 199kiB [00:00, 1.94MiB/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Download finished. Extract...\n", + "Extraction finished: /root/.tira/extracted_datasets/ir-benchmarks/msmarco-passage-trec-dl-2019-judged-20230107-training/\n", + "There are multiple query fields available: ('text', 'title', 'query', 'description', 'narrative'). To use with pyterrier, provide variant or modify dataframe to add query column.\n" + ] + }, { "data": { "text/html": [ @@ -446,33 +373,76 @@ "4 avg-scq -0.198025 -0.103219 -0.118642" ] }, - "execution_count": 26, + "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "dataset = pt.get_dataset(\"irds:msmarco-passage/trec-dl-2019/judged\")\n", - "qpp = tira.pt.transform_queries('ir-benchmarks/qpptk/all', dataset)\n", - "bm25 = tira.pt.from_submission('ir-benchmarks/tira-ir-starter/BM25 Re-Rank (tira-ir-starter-pyterrier)', dataset)\n", + "dataset = pt.get_dataset(\"irds:ir-benchmarks/msmarco-passage-trec-dl-2019-judged-20230107-training\")\n", + "qpp = tira.pt.transform_queries('ir-benchmarks/qpptk/all-predictors', 'msmarco-passage-trec-dl-2019-judged-20230107-training')\n", + "bm25 = tira.pt.from_submission('ir-benchmarks/tira-ir-starter/BM25 Re-Rank (tira-ir-starter-pyterrier)', 'msmarco-passage-trec-dl-2019-judged-20230107-training')\n", "\n", "qpp_correlation_to_ground_truth(bm25, qpp, dataset, ['ndcg_cut_10'])" ] }, { "cell_type": "code", - "execution_count": 27, + "execution_count": 14, "metadata": {}, "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Download from the Incubator: https://files.webis.de/data-in-production/data-research/tira-zenodo-dump-preparation/query-processors-in-progress/qpptk-all-predictors-trec-recent.zip\n", + "\tThis is only used for last spot checks before archival to Zenodo.\n" + ] + }, { "name": "stderr", "output_type": "stream", "text": [ - "[INFO] [starting] https://trec.nist.gov/data/deep/2020qrels-pass.txt\n", - "[INFO] [finished] https://trec.nist.gov/data/deep/2020qrels-pass.txt: [00:00] [219kB] [308kB/s]\n", - "[INFO] [starting] https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-test2020-queries.tsv.gz\n", - "[INFO] [finished] https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-test2020-queries.tsv.gz: [00:00] [4.13kB] [23.9MB/s]\n", - " \r" + "Download: 100%|██████████| 223k/223k [00:00<00:00, 2.41MiB/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Download finished. Extract...\n", + "Extraction finished: /root/.tira/extracted_runs/ir-benchmarks/msmarco-passage-trec-dl-2020-judged-20230107-training/qpptk\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Download: 642kiB [00:00, 4.76MiB/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Download finished. Extract...\n", + "Extraction finished: /root/.tira/extracted_runs/ir-benchmarks/msmarco-passage-trec-dl-2020-judged-20230107-training/tira-ir-starter\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Download: 235kiB [00:00, 2.30MiB/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Download finished. Extract...\n", + "Extraction finished: /root/.tira/extracted_datasets/ir-benchmarks/msmarco-passage-trec-dl-2020-judged-20230107-training/\n", + "There are multiple query fields available: ('text', 'title', 'query', 'description', 'narrative'). To use with pyterrier, provide variant or modify dataframe to add query column.\n" ] }, { @@ -599,28 +569,29 @@ "3 max-scq -0.131996 -0.108241 -0.150106" ] }, - "execution_count": 27, + "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "dataset = pt.get_dataset(\"irds:msmarco-passage/trec-dl-2020/judged\")\n", - "qpp = tira.pt.transform_queries('ir-benchmarks/qpptk/all', dataset)\n", - "bm25 = tira.pt.from_submission('ir-benchmarks/tira-ir-starter/BM25 Re-Rank (tira-ir-starter-pyterrier)', dataset)\n", + "dataset = pt.get_dataset(\"irds:ir-benchmarks/msmarco-passage-trec-dl-2020-judged-20230107-training\")\n", + "qpp = tira.pt.transform_queries('ir-benchmarks/qpptk/all-predictors', 'msmarco-passage-trec-dl-2020-judged-20230107-training')\n", + "bm25 = tira.pt.from_submission('ir-benchmarks/tira-ir-starter/BM25 Re-Rank (tira-ir-starter-pyterrier)', 'msmarco-passage-trec-dl-2020-judged-20230107-training')\n", "\n", "qpp_correlation_to_ground_truth(bm25, qpp, dataset, ['ndcg_cut_10'])" ] }, { "cell_type": "code", - "execution_count": 35, + "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ + "There are multiple query fields available: ('title', 'description', 'narrative'). To use with pyterrier, provide variant or modify dataframe to add query column.\n", "/root/.ir_datasets/touche/2020/task-1/qrels.qrels\n" ] }, @@ -748,14 +719,14 @@ "5 var -0.028511 -0.026485 -0.022410" ] }, - "execution_count": 35, + "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset = pt.get_dataset(\"irds:argsme/2020-04-01/touche-2020-task-1\")\n", - "qpp = tira.pt.transform_queries('ir-benchmarks/qpptk/all', dataset)\n", + "qpp = tira.pt.transform_queries('ir-benchmarks/qpptk/all-predictors', dataset)\n", "bm25 = tira.pt.from_submission('ir-benchmarks/tira-ir-starter/BM25 Re-Rank (tira-ir-starter-pyterrier)', dataset)\n", "\n", "qpp_correlation_to_ground_truth(bm25, qpp, dataset, ['ndcg_cut_10'])" @@ -763,22 +734,60 @@ }, { "cell_type": "code", - "execution_count": 32, + "execution_count": 18, "metadata": {}, "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Download from the Incubator: https://files.webis.de/data-in-production/data-research/tira-zenodo-dump-preparation/query-processors-in-progress/qpptk-all-predictors-clef-labs.zip\n", + "\tThis is only used for last spot checks before archival to Zenodo.\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Download: 100%|██████████| 969k/969k [00:00<00:00, 6.39MiB/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Download finished. Extract...\n", + "Extraction finished: /root/.tira/extracted_runs/ir-benchmarks/argsme-touche-2021-task-1-20230209-training/qpptk\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Download: 1.03MiB [00:00, 1.16MiB/s]\n", + "[INFO] [starting] opening zip file\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Download finished. Extract...\n", + "Extraction finished: /root/.tira/extracted_runs/ir-benchmarks/argsme-touche-2021-task-1-20230209-training/tira-ir-starter\n" + ] + }, { "name": "stderr", "output_type": "stream", "text": [ - "[INFO] [starting] opening zip file\n", "[INFO] [starting] https://zenodo.org/record/6798216/files/topics-task-1-only-titles-2021.zip\n", - "[INFO] [finished] https://zenodo.org/record/6798216/files/topics-task-1-only-titles-2021.zip: [00:00] [1.35kB] [2.88MB/s]\n", - "[INFO] [finished] opening zip file [633ms] \n", + "[INFO] [finished] https://zenodo.org/record/6798216/files/topics-task-1-only-titles-2021.zip: [00:00] [1.35kB] [4.40MB/s]\n", + "[INFO] [finished] opening zip file [344ms] \n", "[INFO] [starting] https://zenodo.org/record/6798216/files/touche-task1-51-100-relevance.qrels\n", - "[INFO] [finished] https://zenodo.org/record/6798216/files/touche-task1-51-100-relevance.qrels: [00:00] [100kB] [1.43MB/s]\n", + "[INFO] [finished] https://zenodo.org/record/6798216/files/touche-task1-51-100-relevance.qrels: [00:00] [100kB] [1.02MB/s]\n", "[INFO] [starting] https://zenodo.org/record/6798216/files/touche-task1-51-100-quality.qrels \n", - "[INFO] [finished] https://zenodo.org/record/6798216/files/touche-task1-51-100-quality.qrels: [00:00] [99.7kB] [1.46MB/s]\n", - " \r" + "[INFO] [finished] https://zenodo.org/record/6798216/files/touche-task1-51-100-quality.qrels: [00:00] [99.7kB] [828kB/s]\n", + " \r" ] }, { @@ -912,14 +921,14 @@ "0 max-idf -0.201394 -0.153051 -0.202617" ] }, - "execution_count": 32, + "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset = pt.get_dataset(\"irds:argsme/2020-04-01/touche-2021-task-1\")\n", - "qpp = tira.pt.transform_queries('ir-benchmarks/qpptk/all', dataset)\n", + "qpp = tira.pt.transform_queries('ir-benchmarks/qpptk/all-predictors', dataset)\n", "bm25 = tira.pt.from_submission('ir-benchmarks/tira-ir-starter/BM25 Re-Rank (tira-ir-starter-pyterrier)', dataset)\n", "\n", "qpp_correlation_to_ground_truth(bm25, qpp, dataset, ['ndcg_cut_10'])"