Turn any document directory into a prompt-friendly text ingest for LLMs, with a focus on compliance and comprehensive context generation.
-
Multi-Format Document Support
- Ingests PDF, DOCX, Markdown, TXT files
- Automatic encoding detection
- Intelligent file type handling
- NEW: Extended support for
.xlsx
,.xls
,.pptx
,.json
,.csv
,.xml
-
Compliance-Focused Ingestion
- Pre-configured Compliance Officer prompt
- Customizable AI agent context
- Designed for compliance in mind
-
Smart File Processing
- Skips system and configuration files
- Handles temporary and hidden files
- Supports complex directory structures
-
Metadata and Reporting
- Generates comprehensive directory structure tree
- Counts total files and tokens
- Provides summary statistics
-
Semantic Compression (NEW)
- Intelligently reduce document size while maintaining core meaning
- Configurable compression levels
- Preserves full original content
- Optional compressed view for AI processing
-
Flexible Usage
- Command-line interface
- Importable as a Python package
- Configurable output options
pip install docsingest
# Clone the repository
git clone https://github.com/marc-shade/docsingest.git
# Navigate to the directory
cd docsingest
# Install the package
pip install -e .
# Ingest documents with default Compliance Officer prompt
docsingest /path/to/documents
# Enable semantic compression
docsingest /path/to/documents --compress
# Custom compression level
docsingest /path/to/documents --compress --compression-level 0.7
# Custom AI agent prompt
docsingest /path/to/documents --agent "Financial Auditor" -o financial_report.md
from docsingest import ingest
# Basic usage
summary, tree, content = ingest("/path/to/documents")
# Custom agent prompt
summary, tree, content = ingest(
"/path/to/documents",
agent_prompt="Specialized Compliance Analyst"
)
- Microsoft Word (.docx)
- Microsoft Excel (.xlsx, .xls)
- Microsoft PowerPoint (.pptx)
- Markdown (.md)
- Plain Text (.txt)
- CSV
- XML
- JSON
.DS_Store
- Temporary Office files (
~$
) - Temporary files (
.tmp
) - Log files
- Git-related files and directories
- IDE configuration directories
- Python cache and virtual environment files
DocsIngest provides a robust, multi-layered approach to regulatory compliance and document risk management:
- Multi-Jurisdiction Support: Designed to handle compliance requirements across various regulatory landscapes
- Adaptive Compliance Scanning: Intelligent detection of sensitive information and potential regulatory risks
- Configurable Compliance Profiles: Customizable settings for different industry standards and regulations
-
Document Ingestion Analysis
- Automatic classification of document types
- Identification of sensitive and regulated content
- Contextual risk scoring
-
Compliance Risk Evaluation
- Detect potential regulatory violations
- Flag documents with high-risk content
- Generate detailed compliance reports
-
Proactive Monitoring
- Continuous document scanning
- Real-time alerts for compliance breaches
- Audit trail generation
- GDPR (General Data Protection Regulation)
- HIPAA (Health Insurance Portability and Accountability Act)
- CCPA (California Consumer Privacy Act)
- SOX (Sarbanes-Oxley Act)
- PCI DSS (Payment Card Industry Data Security Standard)
- NIST Framework
- ISO 27001 Information Security Management
- Advanced PII Detection
- Identify sensitive personal information
- Support for multiple PII categories:
- Names
- Email addresses
- Phone numbers
- Social Security Numbers
- Credit card numbers
- Intelligent Redaction
- Automatic masking of sensitive information
- Configurable redaction levels
- Comprehensive Compliance Reporting
- Detailed risk assessment
- Actionable compliance recommendations
- Multi-Regulation Support
- Compliance checks for GDPR, FERPA, COPPA
- Proactive regulatory alignment
- Document Ingestion
- Automated PII Scanning
- Risk Assessment and Scoring
- Compliance Reporting
- Optional Redaction
Note: While DocsIngest provides powerful compliance tools, it is not a substitute for professional legal or compliance advice. Always consult with compliance experts for your specific regulatory requirements.
# Clone the repository
git clone https://github.com/marc-shade/docsingest.git
cd docsingest
# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt
# Run tests
pytest tests/
Please read CONTRIBUTING.md for details on our code of conduct and the process for submitting pull requests.
MIT License
- Support more file types
- Enhanced token estimation
- Web interface
- Cloud storage integration
- Advanced AI prompt customization