This project aims to develop a platform for detecting plagiarism on English, French and Arabic texts using Artificial Intelligence and Natural Language Processing algorithms. The project takes two or more reports as input and uses string processing algorithms and NLP technologies to produce a result indicating the level of plagiarism in the report.
The project uses various algorithms and methods for detecting plagiarism, including:
- Token Count Vectorizer
- Term Frequency-Inverse Document Frequency
- Similar_Text Algorithm
- Levenshtein Distance Algorithm
- Jaccard Index Algorithm
- Cosine Similarity Algorithm
- Longest Common Subsequence Algorithm
- Dice Coefficient Algorithm
The project also uses various preprocessing techniques, including normalization, Arabic tashkil removal, stop word removal, lemmatization, stemming, and tokenization into N-grams.
The project is written using PHP, JS, Bootstrap, HTML5, CSS3, SQL.
- Lemmatizer : https://github.com/writecrow/lemmatizer
- PHP-ML : https://php-ml.readthedocs.io/en/latest
- PHP-LCS : https://packagist.org/packages/eloquent/lcs
- NLP-Tools : http://php-nlp-tools.com/documentation/
- php-stemmer : https://github.com/amaccis/php-stemmer
This repository contains the report and source code for the project, along with the database file. The report is organized into five chapters, covering introduction and context, preprocessing, similarity calculation algorithms, user interface design, implementation, experimentation, and discussion.
This project demonstrates the potential of using Artificial Intelligence and Natural Language Processing algorithms for detecting plagiarism in documents. The different algorithms and methods used in the project provide a comprehensive approach for identifying plagiarized documents with high accuracy.