From e92261ab5a64ee8fc754bec5a2882ab83bcd0384 Mon Sep 17 00:00:00 2001 From: oborchers Date: Sun, 10 Apr 2022 21:42:55 +0200 Subject: [PATCH] Removed autogenerated docs --- README.md | 1 - docs/MODULES.md | 17 -- docs/README.md | 318 --------------------- docs/fse/index.md | 27 -- docs/fse/inputs.md | 438 ----------------------------- docs/fse/models/average.md | 119 -------- docs/fse/models/base_s2v.md | 279 ------------------ docs/fse/models/index.md | 12 - docs/fse/models/sentencevectors.md | 350 ----------------------- docs/fse/models/sif.md | 28 -- docs/fse/models/usif.md | 28 -- docs/fse/models/utils.md | 89 ------ docs/fse/vectors.md | 73 ----- release.sh | 4 +- 14 files changed, 1 insertion(+), 1782 deletions(-) delete mode 100644 docs/MODULES.md delete mode 100644 docs/README.md delete mode 100644 docs/fse/index.md delete mode 100644 docs/fse/inputs.md delete mode 100644 docs/fse/models/average.md delete mode 100644 docs/fse/models/base_s2v.md delete mode 100644 docs/fse/models/index.md delete mode 100644 docs/fse/models/sentencevectors.md delete mode 100644 docs/fse/models/sif.md delete mode 100644 docs/fse/models/usif.md delete mode 100644 docs/fse/models/utils.md delete mode 100644 docs/fse/vectors.md diff --git a/README.md b/README.md index bb55b3c..6513972 100644 --- a/README.md +++ b/README.md @@ -247,7 +247,6 @@ Changelog 1.0.0: - Added support for gensim>=4. This library is no longer compatible with gensim<4. For migration, see the [README](https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4). - `size` argument is now `vector_size` -- Added docs 0.2.0: - Added `Vectors` and `FTVectors` class and hub support by `from_pretrained` diff --git a/docs/MODULES.md b/docs/MODULES.md deleted file mode 100644 index 2ee9ca5..0000000 --- a/docs/MODULES.md +++ /dev/null @@ -1,17 +0,0 @@ -# Fast_sentence_embeddings Modules - -> Auto-generated documentation modules index. - -Full list of [Fast_sentence_embeddings](README.md#fast_sentence_embeddings-index) project modules. - -- [Fast_sentence_embeddings Index](README.md#fast_sentence_embeddings-index) -- [Fse](fse/index.md#fse) - - [Inputs](fse/inputs.md#inputs) - - [Models](fse/models/index.md#models) - - [Average](fse/models/average.md#average) - - [Base S2v](fse/models/base_s2v.md#base-s2v) - - [SentenceVectors](fse/models/sentencevectors.md#sentencevectors) - - [SIF](fse/models/sif.md#sif) - - [uSIF](fse/models/usif.md#usif) - - [Utils](fse/models/utils.md#utils) - - [Vectors](fse/vectors.md#vectors) diff --git a/docs/README.md b/docs/README.md deleted file mode 100644 index 6c30ba4..0000000 --- a/docs/README.md +++ /dev/null @@ -1,318 +0,0 @@ -# Fast_sentence_embeddings Index - -> Auto-generated documentation index. - -

-Build Status -Coverage Status -Downloads -Language grade: Python -Code style: black -License: GPL3 -

-

-fse -

- -Fast Sentence Embeddings -================================== - -Fast Sentence Embeddings is a Python library that serves as an addition to Gensim. This library is intended to compute *sentence vectors* for large collections of sentences or documents with as little hassle as possible: - -``` -from fse import Vectors, Average, IndexedList - -vecs = Vectors.from_pretrained("glove-wiki-gigaword-50") -model = Average(vecs) - -sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]] - -model.train(IndexedList(sentences)) - -model.sv.similarity(0,1) -``` - -If you want to support fse, take a quick [survey](https://forms.gle/8uSU323fWUVtVwcAA) to improve it. - -Audience ------------- - -This package builds upon Gensim and is intenteded to compute sentence/paragraph vectors for large databases. Use this package if: -- (Sentence) Transformers are too slow -- Your dataset is too large for existing solutions (spacy) -- Using GPUs is not an option. - -The average (online) inference time for a well optimized (and batched) sentence-transformer is around 1ms-10ms per sentence. If that is not enough and you are willing to sacrifice a bit in terms of quality, this is your package. - -Features ------------- - -Find the corresponding blog post(s) here (code may be outdated): - -Full Fast_sentence_embeddings project documentation can be found in [Modules](MODULES.md#fast_sentence_embeddings-modules) - -- [Visualizing 100,000 Amazon Products](https://towardsdatascience.com/vis-amz-83dea6fcb059) -- [Sentence Embeddings. Fast, please!](https://towardsdatascience.com/fse-2b1ffa791cf9) - - [Fast_sentence_embeddings Modules](MODULES.md#fast_sentence_embeddings-modules) - -**fse** implements three algorithms for sentence embeddings. You can choose -between *unweighted sentence averages*, *smooth inverse frequency averages*, and *unsupervised smooth inverse frequency averages*. - -Key features of **fse** are: - -**[X]** Up to 500.000 sentences / second (1) - -**[X]** Provides HUB access to various pre-trained models for convenience - -**[X]** Supports Average, SIF, and uSIF Embeddings - -**[X]** Full support for Gensims Word2Vec and all other compatible classes - -**[X]** Full support for Gensims FastText with out-of-vocabulary words - -**[X]** Induction of word frequencies for pre-trained embeddings - -**[X]** Incredibly fast Cython core routines - -**[X]** Dedicated input file formats for easy usage (including disk streaming) - -**[X]** Ram-to-disk training for large corpora - -**[X]** Disk-to-disk training for even larger corpora - -**[X]** Many fail-safe checks for easy usage - -**[X]** Simple interface for developing your own models - -**[X]** Extensive documentation of all functions - -**[X]** Optimized Input Classes - -(1) May vary significantly from system to system (i.e. by using swap memory) and processing. -I regularly observe 300k-500k sentences/s for preprocessed data on my Macbook (2016). -Visit **Tutorial.ipynb** for an example. - -Installation ------------- - -This software depends on NumPy, Scipy, Scikit-learn, Gensim, and Wordfreq. -You must have them installed prior to installing fse. - -As with gensim, it is also recommended you install a BLAS library before installing fse. - -The simple way to install **fse** is: - - pip install -U fse - -In case you want to build from source, just run: - - python setup.py install - -If building the Cython extension fails (you will be notified), try: - - pip install -U git+https://github.com/oborchers/Fast_Sentence_Embeddings - -Usage -------------- - -Using pre-trained models with **fse** is easy. You can just use them from the hub and download them accordingly. -They will be stored locally so you can re-use them later. - -``` -from fse import Vectors, Average, IndexedList -vecs = Vectors.from_pretrained("glove-wiki-gigaword-50") -model = Average(vecs) - -sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]] - -model.train(IndexedList(sentences)) - -model.sv.similarity(0,1) -``` - -If your vectors are large and you don't have a lot of RAM, you can supply the `mmap` argument as follows to read the vectors from disk instead of loading them into RAM: - -``` -Vectors.from_pretrained("glove-wiki-gigaword-50", mmap="r") -``` - -To check which vectors are on the hub, please check: https://huggingface.co/fse. For example, you will find: -- glove-twitter-25 -- glove-twitter-50 -- glove-twitter-100 -- glove-twitter-200 -- glove-wiki-gigaword-100 -- glove-wiki-gigaword-300 -- word2vec-google-news-300 -- paragram-25 -- paranmt-300 -- paragram-300-sl999 -- paragram-300-ws353 -- fasttext-wiki-news-subwords-300 -- fasttext-crawl-subwords-300 (Use with `FTVectors`) - -In order to use **fse** with a custom model you must first estimate a Gensim model which contains a -gensim.models.keyedvectors.BaseKeyedVectors class, for example *Word2Vec* or *Fasttext*. Then you can proceed to compute sentence embeddings for a corpus as follows: - -``` -from gensim.models import FastText -sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]] -ft = FastText(sentences, min_count=1, vector_size=10) - -from fse import Average, IndexedList -model = Average(ft) -model.train(IndexedList(sentences)) - -model.sv.similarity(0,1) -``` - -fse offers multi-thread support out of the box. However, for most applications a *single thread will most likely be sufficient*. - -Additional Information -------------- - -Within the folder nootebooks you can find the following guides: - -**Tutorial.ipynb** offers a detailed walk-through of some of the most important functions fse has to offer. - -**STS-Benchmarks.ipynb** contains an example of how to use the library with pre-trained models to -replicate the STS Benchmark results [4] reported in the papers. - -**Speed Comparision.ipynb** compares the speed between the numpy and the cython routines. - -In order to use the **fse** model, you first need some pre-trained gensim -word embedding model, which is then used by **fse** to compute the sentence embeddings. - -After computing sentence embeddings, you can use them in supervised or -unsupervised NLP applications, as they serve as a formidable baseline. - -The models presented are based on -- Deep-averaging embeddings [1] -- Smooth inverse frequency embeddings [2] -- Unsupervised smooth inverse frequency embeddings [3] - -Credits to Radim Řehůřek and all contributors for the **awesome** library -and code that [Gensim](https://github.com/RaRe-Technologies/gensim) provides. A whole lot of the code found in this lib is based on Gensim. - -To install **fse** on Colab, check out: https://colab.research.google.com/drive/1qq9GBgEosG7YSRn7r6e02T9snJb04OEi - -Results ------------- - -Model | Vectors | params | [STS Benchmark](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark#Results) -:---: | :---: | :---: | :---: -`CBOW` | `paranmt-300` | | 79.82 -`uSIF` | `paranmt-300` | length=11 | 79.00 -`SIF-10` | `paranmt-300` | components=10 | 76.72 -`SIF-10` | `paragram-300-sl999` | components=10 | 74.21 -`SIF-10` | `paragram-300-ws353` | components=10 | 74.03 -`SIF-10` | `fasttext-crawl-subwords-300` | components=10 | 73.38 -`uSIF` | `paragram-300-sl999` | length=11 | 73.04 -`SIF-10` | `fasttext-wiki-news-subwords-300` | components=10 | 72.29 -`uSIF` | `paragram-300-ws353` | length=11 | 71.84 -`SIF-10` | `glove-twitter-200` | components=10 | 71.62 -`SIF-10` | `glove-wiki-gigaword-300` | components=10 | 71.35 -`SIF-10` | `word2vec-google-news-300` | components=10 | 71.12 -`SIF-10` | `glove-wiki-gigaword-200` | components=10 | 70.62 -`SIF-10` | `glove-twitter-100` | components=10 | 69.65 -`uSIF` | `fasttext-crawl-subwords-300` | length=11 | 69.40 -`uSIF` | `fasttext-wiki-news-subwords-300` | length=11 | 68.63 -`SIF-10` | `glove-wiki-gigaword-100` | components=10 | 68.34 -`uSIF` | `glove-wiki-gigaword-300` | length=11 | 67.60 -`uSIF` | `glove-wiki-gigaword-200` | length=11 | 67.11 -`uSIF` | `word2vec-google-news-300` | length=11 | 66.99 -`uSIF` | `glove-twitter-200` | length=11 | 66.67 -`SIF-10` | `glove-twitter-50` | components=10 | 65.52 -`uSIF` | `glove-wiki-gigaword-100` | length=11 | 65.33 -`uSIF` | `paragram-25` | length=11 | 64.22 -`uSIF` | `glove-twitter-100` | length=11 | 64.13 -`SIF-10` | `glove-wiki-gigaword-50` | components=10 | 64.11 -`uSIF` | `glove-wiki-gigaword-50` | length=11 | 62.06 -`CBOW` | `word2vec-google-news-300` | | 61.54 -`uSIF` | `glove-twitter-50` | length=11 | 60.41 -`SIF-10` | `paragram-25` | components=10 | 59.07 -`uSIF` | `glove-twitter-25` | length=11 | 55.06 -`CBOW` | `paragram-300-ws353` | | 54.72 -`SIF-10` | `glove-twitter-25` | components=10 | 54.16 -`CBOW` | `paragram-300-sl999` | | 51.46 -`CBOW` | `fasttext-crawl-subwords-300` | | 48.49 -`CBOW` | `glove-wiki-gigaword-300` | | 44.46 -`CBOW` | `glove-wiki-gigaword-200` | | 42.40 -`CBOW` | `paragram-25` | | 40.13 -`CBOW` | `glove-wiki-gigaword-100` | | 38.12 -`CBOW` | `glove-wiki-gigaword-50` | | 37.47 -`CBOW` | `glove-twitter-200` | | 34.94 -`CBOW` | `glove-twitter-100` | | 33.81 -`CBOW` | `glove-twitter-50` | | 30.78 -`CBOW` | `glove-twitter-25` | | 26.15 -`CBOW` | `fasttext-wiki-news-subwords-300` | | 26.08 - -Changelog -------------- - -1.0.0: -- Added support for gensim>=4. This library is no longer compatible with gensim<4. For migration, see the [README](https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4). -- `size` argument is now `vector_size` -- Added docs - -0.2.0: -- Added `Vectors` and `FTVectors` class and hub support by `from_pretrained` -- Extended benchmark -- Fixed zero division bug for uSIF -- Moved tests out of the main folder -- Moved sts out of the main folder - -0.1.17: -- Fixed dependency issue where you cannot install fse properly -- Updated readme -- Updated travis python versions (3.6, 3.9) - -0.1.15 from 0.1.11: -- Fixed major FT Ngram computation bug -- Rewrote the input class. Turns out NamedTuple was pretty slow. -- Added further unittests -- Added documentation -- Major speed improvements -- Fixed division by zero for empty sentences -- Fixed overflow when infer method is used with too many sentences -- Fixed similar_by_sentence bug - -Literature -------------- - -1. Iyyer M, Manjunatha V, Boyd-Graber J, Daumé III H (2015) Deep Unordered -Composition Rivals Syntactic Methods for Text Classification. Proc. 53rd Annu. -Meet. Assoc. Comput. Linguist. 7th Int. Jt. Conf. Nat. Lang. Process., 1681–1691. - -2. Arora S, Liang Y, Ma T (2017) A Simple but Tough-to-Beat Baseline for Sentence -Embeddings. Int. Conf. Learn. Represent. (Toulon, France), 1–16. - -3. Ethayarajh K (2018) Unsupervised Random Walk Sentence Embeddings: A Strong but Simple Baseline. -Proceedings of the 3rd Workshop on Representation Learning for NLP. (Toulon, France), 91–100. - -4. Eneko Agirre, Daniel Cer, Mona Diab, Iñigo Lopez-Gazpio, Lucia Specia. Semeval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation. Proceedings of SemEval 2017. - -Copyright -------------- - -**Disclaimer**: I am working full time. Unfortunately, I have yet to find time to add all the features I'd like to see. Especially the API needs some overhaul and we need support for gensim 4.0.0. - -I am looking for active contributors to keep this package alive. Please feel free to ping me at if you are interested. - -Author: Oliver Borchers - -Copyright (C) 2022 Oliver Borchers - -Citation -------------- - -If you found this software useful, please cite it in your publication. - - @misc{Borchers2019, - author = {Borchers, Oliver}, - title = {Fast sentence embeddings}, - year = {2019}, - publisher = {GitHub}, - journal = {GitHub Repository}, - howpublished = {\url{https://github.com/oborchers/Fast_Sentence_Embeddings}}, - } diff --git a/docs/fse/index.md b/docs/fse/index.md deleted file mode 100644 index 6fe89a1..0000000 --- a/docs/fse/index.md +++ /dev/null @@ -1,27 +0,0 @@ -# Fse - -> Auto-generated documentation for [fse](../../fse/__init__.py) module. - -- [Fast_sentence_embeddings](../README.md#fast_sentence_embeddings-index) / [Modules](../MODULES.md#fast_sentence_embeddings-modules) / Fse - - [NullHandler](#nullhandler) - - [NullHandler().emit](#nullhandleremit) - - Modules - - [Inputs](inputs.md#inputs) - - [Models](models/index.md#models) - - [Vectors](vectors.md#vectors) - -## NullHandler - -[[find in source code]](../../fse/__init__.py#L19) - -```python -class NullHandler(logging.Handler): -``` - -### NullHandler().emit - -[[find in source code]](../../fse/__init__.py#L20) - -```python -def emit(record): -``` diff --git a/docs/fse/inputs.md b/docs/fse/inputs.md deleted file mode 100644 index 1e7a29e..0000000 --- a/docs/fse/inputs.md +++ /dev/null @@ -1,438 +0,0 @@ -# Inputs - -> Auto-generated documentation for [fse.inputs](../../fse/inputs.py) module. - -- [Fast_sentence_embeddings](../README.md#fast_sentence_embeddings-index) / [Modules](../MODULES.md#fast_sentence_embeddings-modules) / [Fse](index.md#fse) / Inputs - - [BaseIndexedList](#baseindexedlist) - - [BaseIndexedList().\_\_delitem\_\_](#baseindexedlist__delitem__) - - [BaseIndexedList().\_\_getitem\_\_](#baseindexedlist__getitem__) - - [BaseIndexedList().\_\_len\_\_](#baseindexedlist__len__) - - [BaseIndexedList().\_\_setitem\_\_](#baseindexedlist__setitem__) - - [BaseIndexedList().\_\_str\_\_](#baseindexedlist__str__) - - [BaseIndexedList().append](#baseindexedlistappend) - - [BaseIndexedList().extend](#baseindexedlistextend) - - [BaseIndexedList().insert](#baseindexedlistinsert) - - [CIndexedList](#cindexedlist) - - [CIndexedList().\_\_getitem\_\_](#cindexedlist__getitem__) - - [CIndexedList().append](#cindexedlistappend) - - [CIndexedList().extend](#cindexedlistextend) - - [CIndexedList().insert](#cindexedlistinsert) - - [CSplitCIndexedList](#csplitcindexedlist) - - [CSplitCIndexedList().\_\_getitem\_\_](#csplitcindexedlist__getitem__) - - [CSplitCIndexedList().append](#csplitcindexedlistappend) - - [CSplitCIndexedList().extend](#csplitcindexedlistextend) - - [CSplitCIndexedList().insert](#csplitcindexedlistinsert) - - [CSplitIndexedList](#csplitindexedlist) - - [CSplitIndexedList().\_\_getitem\_\_](#csplitindexedlist__getitem__) - - [IndexedLineDocument](#indexedlinedocument) - - [IndexedLineDocument().\_\_getitem\_\_](#indexedlinedocument__getitem__) - - [IndexedLineDocument().\_\_iter\_\_](#indexedlinedocument__iter__) - - [IndexedList](#indexedlist) - - [IndexedList().\_\_getitem\_\_](#indexedlist__getitem__) - - [SplitCIndexedList](#splitcindexedlist) - - [SplitCIndexedList().\_\_getitem\_\_](#splitcindexedlist__getitem__) - - [SplitCIndexedList().append](#splitcindexedlistappend) - - [SplitCIndexedList().extend](#splitcindexedlistextend) - - [SplitCIndexedList().insert](#splitcindexedlistinsert) - - [SplitIndexedList](#splitindexedlist) - - [SplitIndexedList().\_\_getitem\_\_](#splitindexedlist__getitem__) - -## BaseIndexedList - -[[find in source code]](../../fse/inputs.py#L15) - -```python -class BaseIndexedList(MutableSequence): - def __init__(*args: List[Union[list, set, ndarray]]): -``` - -### BaseIndexedList().\_\_delitem\_\_ - -[[find in source code]](../../fse/inputs.py#L81) - -```python -def __delitem__(i: int): -``` - -Delete an item. - -### BaseIndexedList().\_\_getitem\_\_ - -[[find in source code]](../../fse/inputs.py#L71) - -```python -def __getitem__(i: int) -> tuple: -``` - -Getitem method. - -Returns -------- -tuple ([str], int) - Returns the core object, a tuple, for every sentence embedding model. - -### BaseIndexedList().\_\_len\_\_ - -[[find in source code]](../../fse/inputs.py#L51) - -```python -def __len__(): -``` - -List length. - -Returns -------- -int - Length of the IndexedList - -### BaseIndexedList().\_\_setitem\_\_ - -[[find in source code]](../../fse/inputs.py#L85) - -```python -def __setitem__(i: int, item: str): -``` - -Sets an item. - -### BaseIndexedList().\_\_str\_\_ - -[[find in source code]](../../fse/inputs.py#L61) - -```python -def __str__(): -``` - -Human readable representation of the object's state, used for debugging. - -Returns -------- -str - Human readable representation of the object's state (words and tags). - -### BaseIndexedList().append - -[[find in source code]](../../fse/inputs.py#L95) - -```python -def append(item: str): -``` - -Appends item at last position. - -### BaseIndexedList().extend - -[[find in source code]](../../fse/inputs.py#L100) - -```python -def extend(arg: Union[list, set, ndarray]): -``` - -Extens list. - -### BaseIndexedList().insert - -[[find in source code]](../../fse/inputs.py#L90) - -```python -def insert(i: int, item: str): -``` - -Inserts an item at a position. - -## CIndexedList - -[[find in source code]](../../fse/inputs.py#L133) - -```python -class CIndexedList(BaseIndexedList): - def __init__( - custom_index: Union[list, ndarray], - *args: Union[list, set, ndarray], - ): -``` - -#### See also - -- [BaseIndexedList](#baseindexedlist) - -### CIndexedList().\_\_getitem\_\_ - -[[find in source code]](../../fse/inputs.py#L156) - -```python -def __getitem__(i: int) -> tuple: -``` - -Getitem method. - -Returns -------- -tuple - Returns the core object, tuple, for every sentence embedding model. - -### CIndexedList().append - -[[find in source code]](../../fse/inputs.py#L175) - -```python -def append(item: str): -``` - -### CIndexedList().extend - -[[find in source code]](../../fse/inputs.py#L178) - -```python -def extend(arg: Union[list, set, ndarray]): -``` - -### CIndexedList().insert - -[[find in source code]](../../fse/inputs.py#L172) - -```python -def insert(i: int, item: str): -``` - -## CSplitCIndexedList - -[[find in source code]](../../fse/inputs.py#L280) - -```python -class CSplitCIndexedList(BaseIndexedList): - def __init__( - custom_split: callable, - custom_index: Union[list, ndarray], - *args: Union[list, set, ndarray], - ): -``` - -#### See also - -- [BaseIndexedList](#baseindexedlist) - -### CSplitCIndexedList().\_\_getitem\_\_ - -[[find in source code]](../../fse/inputs.py#L309) - -```python -def __getitem__(i: int) -> tuple: -``` - -Getitem method. - -Returns -------- -tuple - Returns the core object, tuple, for every sentence embedding model. - -### CSplitCIndexedList().append - -[[find in source code]](../../fse/inputs.py#L328) - -```python -def append(item: str): -``` - -### CSplitCIndexedList().extend - -[[find in source code]](../../fse/inputs.py#L331) - -```python -def extend(arg: Union[list, set, ndarray]): -``` - -### CSplitCIndexedList().insert - -[[find in source code]](../../fse/inputs.py#L325) - -```python -def insert(i: int, item: str): -``` - -## CSplitIndexedList - -[[find in source code]](../../fse/inputs.py#L254) - -```python -class CSplitIndexedList(BaseIndexedList): - def __init__(custom_split: callable, *args: Union[list, set, ndarray]): -``` - -#### See also - -- [BaseIndexedList](#baseindexedlist) - -### CSplitIndexedList().\_\_getitem\_\_ - -[[find in source code]](../../fse/inputs.py#L269) - -```python -def __getitem__(i: int) -> tuple: -``` - -Getitem method. - -Returns -------- -tuple - Returns the core object, tuple, for every sentence embedding model. - -## IndexedLineDocument - -[[find in source code]](../../fse/inputs.py#L335) - -```python -class IndexedLineDocument(object): - def __init__(path, get_able=True): -``` - -### IndexedLineDocument().\_\_getitem\_\_ - -[[find in source code]](../../fse/inputs.py#L367) - -```python -def __getitem__(i): -``` - -Returns the line indexed by i. Primarily used for. - -:meth:`~fse.models.sentencevectors.SentenceVectors.most_similar` - -Parameters ----------- -i : int - The line index used to index the file - -Returns -------- -str - line at the current index - -### IndexedLineDocument().\_\_iter\_\_ - -[[find in source code]](../../fse/inputs.py#L393) - -```python -def __iter__(): -``` - -Iterate through the lines in the source. - -Yields ------- -tuple : (list[str], int) - Tuple of list of string and index - -## IndexedList - -[[find in source code]](../../fse/inputs.py#L110) - -```python -class IndexedList(BaseIndexedList): - def __init__(*args: Union[list, set, ndarray]): -``` - -#### See also - -- [BaseIndexedList](#baseindexedlist) - -### IndexedList().\_\_getitem\_\_ - -[[find in source code]](../../fse/inputs.py#L122) - -```python -def __getitem__(i: int) -> tuple: -``` - -Getitem method. - -Returns -------- -tuple - Returns the core object, tuple, for every sentence embedding model. - -## SplitCIndexedList - -[[find in source code]](../../fse/inputs.py#L205) - -```python -class SplitCIndexedList(BaseIndexedList): - def __init__( - custom_index: Union[list, ndarray], - *args: Union[list, set, ndarray], - ): -``` - -#### See also - -- [BaseIndexedList](#baseindexedlist) - -### SplitCIndexedList().\_\_getitem\_\_ - -[[find in source code]](../../fse/inputs.py#L228) - -```python -def __getitem__(i: int) -> tuple: -``` - -Getitem method. - -Returns -------- -tuple - Returns the core object, tuple, for every sentence embedding model. - -### SplitCIndexedList().append - -[[find in source code]](../../fse/inputs.py#L247) - -```python -def append(item: str): -``` - -### SplitCIndexedList().extend - -[[find in source code]](../../fse/inputs.py#L250) - -```python -def extend(arg: Union[list, set, ndarray]): -``` - -### SplitCIndexedList().insert - -[[find in source code]](../../fse/inputs.py#L244) - -```python -def insert(i: int, item: str): -``` - -## SplitIndexedList - -[[find in source code]](../../fse/inputs.py#L182) - -```python -class SplitIndexedList(BaseIndexedList): - def __init__(*args: Union[list, set, ndarray]): -``` - -#### See also - -- [BaseIndexedList](#baseindexedlist) - -### SplitIndexedList().\_\_getitem\_\_ - -[[find in source code]](../../fse/inputs.py#L194) - -```python -def __getitem__(i: int) -> tuple: -``` - -Getitem method. - -Returns -------- -tuple - Returns the core object, tuple, for every sentence embedding model. diff --git a/docs/fse/models/average.md b/docs/fse/models/average.md deleted file mode 100644 index 2276fa2..0000000 --- a/docs/fse/models/average.md +++ /dev/null @@ -1,119 +0,0 @@ -# Average - -> Auto-generated documentation for [fse.models.average](../../../fse/models/average.py) module. - -This module implements the base class to compute average representations for sentences, using highly optimized C routines, -data streaming and Pythonic interfaces. - -- [Fast_sentence_embeddings](../../README.md#fast_sentence_embeddings-index) / [Modules](../../MODULES.md#fast_sentence_embeddings-modules) / [Fse](../index.md#fse) / [Models](index.md#models) / Average - - [Average](#average) - - [train_average_np](#train_average_np) - -The implementation is based on Iyyer et al. (2015): Deep Unordered Composition Rivals Syntactic Methods for Text Classification. -For more information, see . - -The training algorithms is based on the Gensim implementation of Word2Vec, FastText, and Doc2Vec. -For more information, see: class `gensim.models.word2vec.Word2Vec`, class `gensim.models.fasttext.FastText`, or -class `gensim.models.doc2vec.Doc2Vec`. - -Initialize and train a class `fse.models.sentence2vec.Sentence2Vec` model - -pycon - -```python ->>> from gensim.models.word2vec import Word2Vec ->>> sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]] ->>> model = Word2Vec(sentences, min_count=1, vector_size=20) -``` - -```python ->>> from fse.models.average import Average ->>> avg = Average(model) ->>> avg.train([(s, i) for i, s in enumerate(sentences)]) ->>> avg.sv.vectors.shape -(2, 20) - -## Average - -[[find in source code]](../../../fse/models/average.py#L187) - -```python -class Average(BaseSentence2VecModel): - def __init__( - model: KeyedVectors, - sv_mapfile_path: str = None, - wv_mapfile_path: str = None, - workers: int = 1, - lang_freq: str = None, - **kwargs, - ): -``` - -Train, use and evaluate averaged sentence vectors. - -The model can be stored/loaded via its :meth:`~fse.models.average.Average.save` and -:meth:`~fse.models.average.Average.load` methods. - -Some important attributes are the following: - -Attributes ----------- -wv : class `gensim.models.keyedvectors.KeyedVectors` - This object essentially contains the mapping between words and embeddings. After training, it can be used - directly to query those embeddings in various ways. See the module level docstring for examples. - -sv : class `fse.models.sentencevectors.SentenceVectors` - This object contains the sentence vectors inferred from the training data. There will be one such vector - for each unique docusentence supplied during training. They may be individually accessed using the index. - -prep : class `fse.models.base_s2v.BaseSentence2VecPreparer` - The prep object is used to transform and initialize the sv.vectors. Aditionally, it can be used - to move the vectors to disk for training with memmap. - -#### See also - -- [BaseSentence2VecModel](base_s2v.md#basesentence2vecmodel) - -## train_average_np - -[[find in source code]](../../../fse/models/average.py#L56) - -```python -def train_average_np( - model: BaseSentence2VecModel, - indexed_sentences: List[tuple], - target: ndarray, - memory: ndarray, -) -> Tuple[int, int]: -``` - -Training on a sequence of sentences and update the target ndarray. - -Called internally from :meth:`~fse.models.average.Average._do_train_job`. - -Warnings --------- -This is the non-optimized, pure Python version. If you have a C compiler, -fse will use an optimized code path from :mod:`fse.models.average_inner` instead. - -Parameters ----------- -model : class `fse.models.base_s2v.BaseSentence2VecModel` - The BaseSentence2VecModel model instance. -indexed_sentences : iterable of tuple - The sentences used to train the model. -target : ndarray - The target ndarray. We use the index from indexed_sentences - to write into the corresponding row of target. -memory : ndarray - Private memory for each working thread - -Returns -------- -int, int - Number of effective sentences (non-zero) and effective words in the vocabulary used - during training the sentence embedding. - -#### See also - -- [BaseSentence2VecModel](base_s2v.md#basesentence2vecmodel) diff --git a/docs/fse/models/base_s2v.md b/docs/fse/models/base_s2v.md deleted file mode 100644 index f868bf9..0000000 --- a/docs/fse/models/base_s2v.md +++ /dev/null @@ -1,279 +0,0 @@ -# Base S2v - -> Auto-generated documentation for [fse.models.base_s2v](../../../fse/models/base_s2v.py) module. - -Base class containing common methods for training, using & evaluating sentence embeddings. -A lot of the code is based on Gensim. I have to thank Radim Rehurek and the whole team -for the outstanding library which I used for a lot of my research. - -- [Fast_sentence_embeddings](../../README.md#fast_sentence_embeddings-index) / [Modules](../../MODULES.md#fast_sentence_embeddings-modules) / [Fse](../index.md#fse) / [Models](index.md#models) / Base S2v - - [BaseSentence2VecModel](#basesentence2vecmodel) - - [BaseSentence2VecModel().\_\_str\_\_](#basesentence2vecmodel__str__) - - [BaseSentence2VecModel().estimate_memory](#basesentence2vecmodelestimate_memory) - - [BaseSentence2VecModel().infer](#basesentence2vecmodelinfer) - - [BaseSentence2VecModel.load](#basesentence2vecmodelload) - - [BaseSentence2VecModel().save](#basesentence2vecmodelsave) - - [BaseSentence2VecModel().scan_sentences](#basesentence2vecmodelscan_sentences) - - [BaseSentence2VecModel().train](#basesentence2vecmodeltrain) - - [BaseSentence2VecPreparer](#basesentence2vecpreparer) - - [BaseSentence2VecPreparer().prepare_vectors](#basesentence2vecpreparerprepare_vectors) - - [BaseSentence2VecPreparer().reset_vectors](#basesentence2vecpreparerreset_vectors) - - [BaseSentence2VecPreparer().update_vectors](#basesentence2vecpreparerupdate_vectors) - -Attributes ----------- -wv : class `gensim.models.keyedvectors.KeyedVectors` - This object essentially contains the mapping between words and embeddings. After training, it can be used - directly to query those embeddings in various ways. See the module level docstring for examples. - -sv : class `fse.models.sentencevectors.SentenceVectors` - This object contains the sentence vectors inferred from the training data. There will be one such vector - for each unique docusentence supplied during training. They may be individually accessed using the index. - -prep : class `fse.models.base_s2v.BaseSentence2VecPreparer` - The prep object is used to transform and initialize the sv.vectors. Aditionally, it can be used - to move the vectors to disk for training with memmap. - -See Also --------- -class `fse.models.average.Average`. - Average sentence model. -class `fse.models.sif.SIF`. - Smooth inverse frequency weighted model. -class `fse.models.usif.uSIF`. - Unsupervised Smooth inverse frequency weighted model. - -## BaseSentence2VecModel - -[[find in source code]](../../../fse/models/base_s2v.py#L82) - -```python -class BaseSentence2VecModel(SaveLoad): - def __init__( - model: KeyedVectors, - sv_mapfile_path: str = None, - wv_mapfile_path: str = None, - workers: int = 1, - lang_freq: str = None, - fast_version: int = 0, - batch_words: int = 10000, - batch_ngrams: int = 40, - **kwargs, - ): -``` - -### BaseSentence2VecModel().\_\_str\_\_ - -[[find in source code]](../../../fse/models/base_s2v.py#L163) - -```python -def __str__() -> str: -``` - -Human readable representation of the model's state. - -Returns -------- -str - Human readable representation of the model's state. - -### BaseSentence2VecModel().estimate_memory - -[[find in source code]](../../../fse/models/base_s2v.py#L652) - -```python -def estimate_memory( - max_index: int, - report: dict = None, - **kwargs, -) -> Dict[str, int]: -``` - -Estimate the size of the sentence embedding - -Parameters ----------- -max_index : int - Maximum index found during the initial scan -report : dict - Report of subclasses - -Returns -------- -dict - Dictionary of estimated memory sizes - -### BaseSentence2VecModel().infer - -[[find in source code]](../../../fse/models/base_s2v.py#L768) - -```python -def infer(sentences: List[tuple] = None, use_norm=False) -> ndarray: -``` - -Secondary routine to train an embedding. This method is essential for small batches of sentences, -which require little computation. Note: This method does not apply post-training transformations, -only post inference calls (such as removing principal components). - -Parameters ----------- -sentences : (list, iterable) - An iterable consisting of tuple objects -use_norm : bool - If bool is True, the sentence vectors will be L2 normalized (unit euclidean length) - -Returns -------- -ndarray - Computed sentence vectors - -### BaseSentence2VecModel.load - -[[find in source code]](../../../fse/models/base_s2v.py#L540) - -```python -@classmethod -def load(*args, **kwargs): -``` - -Load a previously saved class `fse.models.base_s2v.BaseSentence2VecModel`. - -Parameters ----------- -fname : str - Path to the saved file. - -Returns -------- -class `fse.models.base_s2v.BaseSentence2VecModel` - Loaded model. - -### BaseSentence2VecModel().save - -[[find in source code]](../../../fse/models/base_s2v.py#L567) - -```python -def save(*args, **kwargs): -``` - -Save the model. -This saved model can be loaded again using :func:`~fse.models.base_s2v.BaseSentence2VecModel.load` - -Parameters ----------- -fname : str - Path to the file. - -### BaseSentence2VecModel().scan_sentences - -[[find in source code]](../../../fse/models/base_s2v.py#L582) - -```python -def scan_sentences( - sentences: List[tuple] = None, - progress_per: int = 5, -) -> Dict[str, int]: -``` - -Performs an initial scan of the data and reports all corresponding statistics - -Parameters ----------- -sentences : (list, iterable) - An iterable consisting of tuple objects -progress_per : int - Number of seconds to pass before reporting the scan progress - -Returns -------- -dict - Dictionary containing the scan statistics - -### BaseSentence2VecModel().train - -[[find in source code]](../../../fse/models/base_s2v.py#L700) - -```python -def train( - sentences: List[tuple] = None, - update: bool = False, - queue_factor: int = 2, - report_delay: int = 5, -) -> Tuple[int, int]: -``` - -Main routine to train an embedding. This method writes all sentences vectors into sv.vectors and is -used for computing embeddings for large chunks of data. This method also handles post-training transformations, -such as computing the SVD of the sentence vectors. - -Parameters ----------- -sentences : (list, iterable) - An iterable consisting of tuple objects -update : bool - If bool is True, the sentence vector matrix will be updated in size (even with memmap) -queue_factor : int - Multiplier for size of queue -> size = number of workers * queue_factor. -report_delay : int - Number of seconds between two consecutive progress report messages in the logger. - -Returns -------- -int, int - Count of effective sentences and words encountered - -## BaseSentence2VecPreparer - -[[find in source code]](../../../fse/models/base_s2v.py#L979) - -```python -class BaseSentence2VecPreparer(SaveLoad): -``` - -Contains helper functions to perpare the weights for the training of BaseSentence2VecModel - -### BaseSentence2VecPreparer().prepare_vectors - -[[find in source code]](../../../fse/models/base_s2v.py#L982) - -```python -def prepare_vectors( - sv: SentenceVectors, - total_sentences: int, - update: bool = False, -): -``` - -Build tables and model weights based on final vocabulary settings. - -#### See also - -- [SentenceVectors](sentencevectors.md#sentencevectors) - -### BaseSentence2VecPreparer().reset_vectors - -[[find in source code]](../../../fse/models/base_s2v.py#L991) - -```python -def reset_vectors(sv: SentenceVectors, total_sentences: int): -``` - -Initialize all sentence vectors to zero and overwrite existing files - -#### See also - -- [SentenceVectors](sentencevectors.md#sentencevectors) - -### BaseSentence2VecPreparer().update_vectors - -[[find in source code]](../../../fse/models/base_s2v.py#L1008) - -```python -def update_vectors(sv: SentenceVectors, total_sentences: int): -``` - -Given existing sentence vectors, append new ones - -#### See also - -- [SentenceVectors](sentencevectors.md#sentencevectors) diff --git a/docs/fse/models/index.md b/docs/fse/models/index.md deleted file mode 100644 index b915fd5..0000000 --- a/docs/fse/models/index.md +++ /dev/null @@ -1,12 +0,0 @@ -# Models - -> Auto-generated documentation for [fse.models](../../../fse/models/__init__.py) module. - -- [Fast_sentence_embeddings](../../README.md#fast_sentence_embeddings-index) / [Modules](../../MODULES.md#fast_sentence_embeddings-modules) / [Fse](../index.md#fse) / Models - - Modules - - [Average](average.md#average) - - [Base S2v](base_s2v.md#base-s2v) - - [SentenceVectors](sentencevectors.md#sentencevectors) - - [SIF](sif.md#sif) - - [uSIF](usif.md#usif) - - [Utils](utils.md#utils) diff --git a/docs/fse/models/sentencevectors.md b/docs/fse/models/sentencevectors.md deleted file mode 100644 index fc41392..0000000 --- a/docs/fse/models/sentencevectors.md +++ /dev/null @@ -1,350 +0,0 @@ -# SentenceVectors - -> Auto-generated documentation for [fse.models.sentencevectors](../../../fse/models/sentencevectors.py) module. - -- [Fast_sentence_embeddings](../../README.md#fast_sentence_embeddings-index) / [Modules](../../MODULES.md#fast_sentence_embeddings-modules) / [Fse](../index.md#fse) / [Models](index.md#models) / SentenceVectors - - [SentenceVectors](#sentencevectors) - - [SentenceVectors().\_\_getitem\_\_](#sentencevectors__getitem__) - - [SentenceVectors().distance](#sentencevectorsdistance) - - [SentenceVectors().get_vector](#sentencevectorsget_vector) - - [SentenceVectors().init_sims](#sentencevectorsinit_sims) - - [SentenceVectors.load](#sentencevectorsload) - - [SentenceVectors().most_similar](#sentencevectorsmost_similar) - - [SentenceVectors().save](#sentencevectorssave) - - [SentenceVectors().similar_by_sentence](#sentencevectorssimilar_by_sentence) - - [SentenceVectors().similar_by_vector](#sentencevectorssimilar_by_vector) - - [SentenceVectors().similar_by_word](#sentencevectorssimilar_by_word) - - [SentenceVectors().similarity](#sentencevectorssimilarity) - -## SentenceVectors - -[[find in source code]](../../../fse/models/sentencevectors.py#L40) - -```python -class SentenceVectors(utils.SaveLoad): - def __init__(vector_size: int, mapfile_path: str = None): -``` - -### SentenceVectors().\_\_getitem\_\_ - -[[find in source code]](../../../fse/models/sentencevectors.py#L53) - -```python -def __getitem__(entities: int) -> ndarray: -``` - -Get vector representation of `entities`. - -Parameters ----------- -entities : {int, list of int} - Index or sequence of entities. - -Returns -------- -numpy.ndarray - Vector representation for `entities` (1D if `entities` is int, otherwise - 2D). - -### SentenceVectors().distance - -[[find in source code]](../../../fse/models/sentencevectors.py#L202) - -```python -def distance(d1: int, d2: int) -> float: -``` - -Compute cosine similarity between two sentences from the training set. - -Parameters ----------- -d1 : int - index of sentence -d2 : int - index of sentence - -Returns -------- -float - The cosine distance between the vectors of the two sentences. - -### SentenceVectors().get_vector - -[[find in source code]](../../../fse/models/sentencevectors.py#L133) - -```python -def get_vector(index: int, use_norm: bool = False) -> ndarray: -``` - -Get sentence representations in vector space, as a 1D numpy array. - -Parameters ----------- -index : int - Input index -norm : bool, optional - If True - resulting vector will be L2-normalized (unit euclidean length). - -Returns -------- -numpy.ndarray - Vector representation of index. - -Raises ------- -KeyError - If index out of bounds. - -### SentenceVectors().init_sims - -[[find in source code]](../../../fse/models/sentencevectors.py#L165) - -```python -def init_sims(replace: bool = False): -``` - -Precompute L2-normalized vectors. - -Parameters ----------- -replace : bool, optional - If True - forget the original vectors and only keep the normalized ones = saves lots of memory! - -### SentenceVectors.load - -[[find in source code]](../../../fse/models/sentencevectors.py#L123) - -```python -@classmethod -def load(fname_or_handle, **kwargs): -``` - -### SentenceVectors().most_similar - -[[find in source code]](../../../fse/models/sentencevectors.py#L220) - -```python -def most_similar( - positive: Union[int, ndarray] = None, - negative: Union[int, ndarray] = None, - indexable: Union[IndexedList, IndexedLineDocument] = None, - topn: int = 10, - restrict_size: Union[int, Tuple[int, int]] = None, -) -> List[Tuple[int, float]]: -``` - -Find the top-N most similar sentences. -Positive sentences contribute positively towards the similarity, negative sentences negatively. - -This method computes cosine similarity between a simple mean of the projection -weight vectors of the given sentences and the vectors for each sentence in the model. - -Parameters ----------- -positive : list of int, optional - List of indices that contribute positively. -negative : list of int, optional - List of indices that contribute negatively. -indexable: list, IndexedList, IndexedLineDocument - Provides an indexable object from where the most similar sentences are read -topn : int or None, optional - Number of top-N similar sentences to return, when `topn` is int. When `topn` is None, - then similarities for all sentences are returned. -restrict_size : int or Tuple(int,int), optional - Optional integer which limits the range of vectors which - are searched for most-similar values. For example, restrict_vocab=10000 would - only check the first 10000 sentence vectors. - restrict_vocab=(500, 1000) would search the sentence vectors with indices between - 500 and 1000. - -Returns -------- -list of (int, float) or list of (str, int, float) - A sequence of (index, similarity) is returned. - When an indexable is provided, returns (str, index, similarity) - When `topn` is None, then similarities for all words are returned as a - one-dimensional numpy array with the size of the vocabulary. - -#### See also - -- [IndexedLineDocument](../inputs.md#indexedlinedocument) -- [IndexedList](../inputs.md#indexedlist) - -### SentenceVectors().save - -[[find in source code]](../../../fse/models/sentencevectors.py#L101) - -```python -def save(*args, **kwargs): -``` - -Save object. - -Parameters ----------- -fname : str - Path to the output file. - -See Also --------- -:meth:`~gensim.models.keyedvectors.Doc2VecKeyedVectors.load` - Load object. - -### SentenceVectors().similar_by_sentence - -[[find in source code]](../../../fse/models/sentencevectors.py#L373) - -```python -def similar_by_sentence( - sentence: List[str], - model, - indexable: Union[IndexedList, IndexedLineDocument] = None, - topn: int = 10, - restrict_size: Union[int, Tuple[int, int]] = None, -) -> List[Tuple[int, float]]: -``` - -Find the top-N most similar sentences to a given sentence. - -Parameters ----------- -sentence : list of str - Sentence as list of strings -model : class `fse.models.base_s2v.BaseSentence2VecModel` - This object essentially provides the infer method used to transform . -indexable: list, IndexedList, IndexedLineDocument - Provides an indexable object from where the most similar sentences are read -topn : int or None, optional - Number of top-N similar sentences to return, when `topn` is int. When `topn` is None, - then similarities for all sentences are returned. -restrict_size : int or Tuple(int,int), optional - Optional integer which limits the range of vectors which - are searched for most-similar values. For example, restrict_vocab=10000 would - only check the first 10000 sentence vectors. - restrict_vocab=(500, 1000) would search the sentence vectors with indices between - 500 and 1000. - -Returns -------- -list of (int, float) or list of (str, int, float) - A sequence of (index, similarity) is returned. - When an indexable is provided, returns (str, index, similarity) - When `topn` is None, then similarities for all words are returned as a - one-dimensional numpy array with the size of the vocabulary. - -#### See also - -- [IndexedLineDocument](../inputs.md#indexedlinedocument) -- [IndexedList](../inputs.md#indexedlist) - -### SentenceVectors().similar_by_vector - -[[find in source code]](../../../fse/models/sentencevectors.py#L422) - -```python -def similar_by_vector( - vector: ndarray, - indexable: Union[IndexedList, IndexedLineDocument] = None, - topn: int = 10, - restrict_size: Union[int, Tuple[int, int]] = None, -) -> List[Tuple[int, float]]: -``` - -Find the top-N most similar sentences to a given vector. - -Parameters ----------- -vector : ndarray - Vectors -indexable: list, IndexedList, IndexedLineDocument - Provides an indexable object from where the most similar sentences are read -topn : int or None, optional - Number of top-N similar sentences to return, when `topn` is int. When `topn` is None, - then similarities for all sentences are returned. -restrict_size : int or Tuple(int,int), optional - Optional integer which limits the range of vectors which - are searched for most-similar values. For example, restrict_vocab=10000 would - only check the first 10000 sentence vectors. - restrict_vocab=(500, 1000) would search the sentence vectors with indices between - 500 and 1000. - -Returns -------- -list of (int, float) or list of (str, int, float) - A sequence of (index, similarity) is returned. - When an indexable is provided, returns (str, index, similarity) - When `topn` is None, then similarities for all words are returned as a - one-dimensional numpy array with the size of the vocabulary. - -#### See also - -- [IndexedLineDocument](../inputs.md#indexedlinedocument) -- [IndexedList](../inputs.md#indexedlist) - -### SentenceVectors().similar_by_word - -[[find in source code]](../../../fse/models/sentencevectors.py#L328) - -```python -def similar_by_word( - word: str, - wv: KeyedVectors, - indexable: Union[IndexedList, IndexedLineDocument] = None, - topn: int = 10, - restrict_size: Union[int, Tuple[int, int]] = None, -) -> List[Tuple[int, float]]: -``` - -Find the top-N most similar sentences to a given word. - -Parameters ----------- -word : str - Word -wv : class `gensim.models.keyedvectors.KeyedVectors` - This object essentially contains the mapping between words and embeddings. -indexable: list, IndexedList, IndexedLineDocument - Provides an indexable object from where the most similar sentences are read -topn : int or None, optional - Number of top-N similar sentences to return, when `topn` is int. When `topn` is None, - then similarities for all sentences are returned. -restrict_size : int or Tuple(int,int), optional - Optional integer which limits the range of vectors which - are searched for most-similar values. For example, restrict_vocab=10000 would - only check the first 10000 sentence vectors. - restrict_vocab=(500, 1000) would search the sentence vectors with indices between - 500 and 1000. - -Returns -------- -list of (int, float) or list of (str, int, float) - A sequence of (index, similarity) is returned. - When an indexable is provided, returns (str, index, similarity) - When `topn` is None, then similarities for all words are returned as a - one-dimensional numpy array with the size of the vocabulary. - -#### See also - -- [IndexedLineDocument](../inputs.md#indexedlinedocument) -- [IndexedList](../inputs.md#indexedlist) - -### SentenceVectors().similarity - -[[find in source code]](../../../fse/models/sentencevectors.py#L184) - -```python -def similarity(d1: int, d2: int) -> float: -``` - -Compute cosine similarity between two sentences from the training set. - -Parameters ----------- -d1 : int - index of sentence -d2 : int - index of sentence - -Returns -------- -float - The cosine similarity between the vectors of the two sentences. diff --git a/docs/fse/models/sif.md b/docs/fse/models/sif.md deleted file mode 100644 index 1c36bb0..0000000 --- a/docs/fse/models/sif.md +++ /dev/null @@ -1,28 +0,0 @@ -# SIF - -> Auto-generated documentation for [fse.models.sif](../../../fse/models/sif.py) module. - -- [Fast_sentence_embeddings](../../README.md#fast_sentence_embeddings-index) / [Modules](../../MODULES.md#fast_sentence_embeddings-modules) / [Fse](../index.md#fse) / [Models](index.md#models) / SIF - - [SIF](#sif) - -## SIF - -[[find in source code]](../../../fse/models/sif.py#L19) - -```python -class SIF(Average): - def __init__( - model: KeyedVectors, - alpha: float = 0.001, - components: int = 1, - cache_size_gb: float = 1.0, - sv_mapfile_path: str = None, - wv_mapfile_path: str = None, - workers: int = 1, - lang_freq: str = None, - ): -``` - -#### See also - -- [Average](average.md#average) diff --git a/docs/fse/models/usif.md b/docs/fse/models/usif.md deleted file mode 100644 index a16006a..0000000 --- a/docs/fse/models/usif.md +++ /dev/null @@ -1,28 +0,0 @@ -# uSIF - -> Auto-generated documentation for [fse.models.usif](../../../fse/models/usif.py) module. - -- [Fast_sentence_embeddings](../../README.md#fast_sentence_embeddings-index) / [Modules](../../MODULES.md#fast_sentence_embeddings-modules) / [Fse](../index.md#fse) / [Models](index.md#models) / uSIF - - [uSIF](#usif) - -## uSIF - -[[find in source code]](../../../fse/models/usif.py#L23) - -```python -class uSIF(Average): - def __init__( - model: KeyedVectors, - length: int = None, - components: int = 5, - cache_size_gb: float = 1.0, - sv_mapfile_path: str = None, - wv_mapfile_path: str = None, - workers: int = 1, - lang_freq: str = None, - ): -``` - -#### See also - -- [Average](average.md#average) diff --git a/docs/fse/models/utils.md b/docs/fse/models/utils.md deleted file mode 100644 index ee4bdaf..0000000 --- a/docs/fse/models/utils.md +++ /dev/null @@ -1,89 +0,0 @@ -# Utils - -> Auto-generated documentation for [fse.models.utils](../../../fse/models/utils.py) module. - -- [Fast_sentence_embeddings](../../README.md#fast_sentence_embeddings-index) / [Modules](../../MODULES.md#fast_sentence_embeddings-modules) / [Fse](../index.md#fse) / [Models](index.md#models) / Utils - - [compute_principal_components](#compute_principal_components) - - [remove_principal_components](#remove_principal_components) - - [set_madvise_for_mmap](#set_madvise_for_mmap) - -## compute_principal_components - -[[find in source code]](../../../fse/models/utils.py#L56) - -```python -def compute_principal_components( - vectors: ndarray, - components: int = 1, - cache_size_gb: float = 1.0, -) -> Tuple[ndarray, ndarray]: -``` - -Method used to compute the first singular vectors of a given (sub)matrix - -Parameters ----------- -vectors : ndarray - (Sentence) vectors to compute the truncated SVD on -components : int, optional - Number of singular values/vectors to compute -cache_size_gb : float, optional - Cache size for computing the principal components in GB - -Returns -------- -ndarray, ndarray - Singular values and singular vectors - -## remove_principal_components - -[[find in source code]](../../../fse/models/utils.py#L99) - -```python -def remove_principal_components( - vectors: ndarray, - svd_res: Tuple[ndarray, ndarray], - weights: ndarray = None, - inplace: bool = True, -) -> ndarray: -``` - -Method used to remove the first singular vectors of a given matrix - -Parameters ----------- -vectors : ndarray - (Sentence) vectors to remove components fromm -svd_res : (ndarray, ndarray) - Tuple consisting of the singular values and components to remove from the vectors -weights : ndarray, optional - Weights to be used to weigh the components which are removed from the vectors -inplace : bool, optional - If true, removes the components from the vectors inplace (memory efficient) - -Returns -------- -ndarray, ndarray - Singular values and singular vectors - -## set_madvise_for_mmap - -[[find in source code]](../../../fse/models/utils.py#L26) - -```python -def set_madvise_for_mmap(return_madvise: bool = False) -> object: -``` - -Method used to set madvise parameters. -This problem adresses the memmap issue raised in https://github.com/numpy/numpy/issues/13172 -The issue is not applicable for windows - -Parameters ----------- -return_madvise : bool - Returns the madvise object for unittests, se test_utils.py - -Returns -------- -object - madvise object diff --git a/docs/fse/vectors.md b/docs/fse/vectors.md deleted file mode 100644 index 98284bd..0000000 --- a/docs/fse/vectors.md +++ /dev/null @@ -1,73 +0,0 @@ -# Vectors - -> Auto-generated documentation for [fse.vectors](../../fse/vectors.py) module. - -Class to obtain BaseKeyedVector from. - -- [Fast_sentence_embeddings](../README.md#fast_sentence_embeddings-index) / [Modules](../MODULES.md#fast_sentence_embeddings-modules) / [Fse](index.md#fse) / Vectors - - [FTVectors](#ftvectors) - - [FTVectors.from_pretrained](#ftvectorsfrom_pretrained) - - [Vectors](#vectors) - - [Vectors.from_pretrained](#vectorsfrom_pretrained) - -## FTVectors - -[[find in source code]](../../fse/vectors.py#L51) - -```python -class FTVectors(FastTextKeyedVectors): -``` - -Class to instantiates FT vectors from pretrained models. - -### FTVectors.from_pretrained - -[[find in source code]](../../fse/vectors.py#L54) - -```python -@classmethod -def from_pretrained(model: str, mmap: str = None): -``` - -Method to load vectors from a pre-trained model. - -Parameters ----------- -model : :str: of the model name to load from the bug. For example: "glove-wiki-gigaword-50" -mmap : :str: If to load the vectors in mmap mode. - -Returns -------- -Vectors - An object of pretrained vectors. - -## Vectors - -[[find in source code]](../../fse/vectors.py#L20) - -```python -class Vectors(KeyedVectors): -``` - -Class to instantiates vectors from pretrained models. - -### Vectors.from_pretrained - -[[find in source code]](../../fse/vectors.py#L23) - -```python -@classmethod -def from_pretrained(model: str, mmap: str = None): -``` - -Method to load vectors from a pre-trained model. - -Parameters ----------- -model : :str: of the model name to load from the bug. For example: "glove-wiki-gigaword-50" -mmap : :str: If to load the vectors in mmap mode. - -Returns -------- -Vectors - An object of pretrained vectors. diff --git a/release.sh b/release.sh index e959d6b..1ea6892 100644 --- a/release.sh +++ b/release.sh @@ -3,6 +3,4 @@ docformatter --in-place **/*.py --wrap-summaries 88 --wrap-descriptions 88 isort --atomic **/*.py black . -pytest -v --cov=fse --cov-report=term-missing - -handsdown \ No newline at end of file +pytest -v --cov=fse --cov-report=term-missing \ No newline at end of file