Gogs 2 ngày trước cách đây
commit
6db66cb907
100 tập tin đã thay đổi với 10400 bổ sung0 xóa
  1. 20 0
      py/.dockerignore
  2. 47 0
      py/Dockerfile
  3. 125 0
      py/README.md
  4. 295 0
      py/all_possible_config.toml
  5. 285 0
      py/all_possible_config.toml.bak
  6. 176 0
      py/core/__init__.py
  7. 36 0
      py/core/agent/__init__.py
  8. 1484 0
      py/core/agent/base.py
  9. 326 0
      py/core/agent/rag.py
  10. 707 0
      py/core/agent/research.py
  11. 126 0
      py/core/base/__init__.py
  12. 147 0
      py/core/base/abstractions/__init__.py
  13. 13 0
      py/core/base/agent/__init__.py
  14. 298 0
      py/core/base/agent/agent.py
  15. 82 0
      py/core/base/agent/tools/built_in/get_file_content.py
  16. 67 0
      py/core/base/agent/tools/built_in/search_file_descriptions.py
  17. 82 0
      py/core/base/agent/tools/built_in/search_file_knowledge.py
  18. 109 0
      py/core/base/agent/tools/built_in/tavily_extract.py
  19. 123 0
      py/core/base/agent/tools/built_in/tavily_search.py
  20. 92 0
      py/core/base/agent/tools/built_in/web_scrape.py
  21. 64 0
      py/core/base/agent/tools/built_in/web_search.py
  22. 195 0
      py/core/base/agent/tools/registry.py
  23. 208 0
      py/core/base/api/models/__init__.py
  24. 5 0
      py/core/base/parsers/__init__.py
  25. 12 0
      py/core/base/parsers/base_parser.py
  26. 69 0
      py/core/base/providers/__init__.py
  27. 231 0
      py/core/base/providers/auth.py
  28. 135 0
      py/core/base/providers/base.py
  29. 120 0
      py/core/base/providers/crypto.py
  30. 208 0
      py/core/base/providers/database.py
  31. 96 0
      py/core/base/providers/email.py
  32. 169 0
      py/core/base/providers/embedding.py
  33. 110 0
      py/core/base/providers/file.py
  34. 188 0
      py/core/base/providers/ingestion.py
  35. 233 0
      py/core/base/providers/llm.py
  36. 120 0
      py/core/base/providers/ocr.py
  37. 70 0
      py/core/base/providers/orchestration.py
  38. 39 0
      py/core/base/providers/scheduler.py
  39. 39 0
      py/core/base/utils/__init__.py
  40. 21 0
      py/core/configs/full.toml
  41. 46 0
      py/core/configs/full_azure.toml
  42. 55 0
      py/core/configs/full_lm_studio.toml
  43. 61 0
      py/core/configs/full_ollama.toml
  44. 19 0
      py/core/configs/gemini.toml
  45. 40 0
      py/core/configs/lm_studio.toml
  46. 46 0
      py/core/configs/ollama.toml
  47. 23 0
      py/core/configs/r2r_azure.toml
  48. 37 0
      py/core/configs/r2r_azure_with_test_limits.toml
  49. 8 0
      py/core/configs/r2r_with_auth.toml
  50. 30 0
      py/core/configs/tavily.toml
  51. 0 0
      py/core/examples/__init__.py
  52. BIN
      py/core/examples/data/DeepSeek_R1.pdf
  53. 430 0
      py/core/examples/data/aristotle.txt
  54. 9 0
      py/core/examples/data/aristotle_v2.txt
  55. 29 0
      py/core/examples/data/aristotle_v3.txt
  56. 80 0
      py/core/examples/data/got.txt
  57. BIN
      py/core/examples/data/graphrag.pdf
  58. BIN
      py/core/examples/data/lyft_2021.pdf
  59. 19 0
      py/core/examples/data/pg_essay_1.html
  60. 19 0
      py/core/examples/data/pg_essay_2.html
  61. 19 0
      py/core/examples/data/pg_essay_3.html
  62. 19 0
      py/core/examples/data/pg_essay_4.html
  63. 19 0
      py/core/examples/data/pg_essay_5.html
  64. BIN
      py/core/examples/data/sample.mp3
  65. BIN
      py/core/examples/data/sample2.mp3
  66. BIN
      py/core/examples/data/screen_shot.png
  67. 1 0
      py/core/examples/data/test.txt
  68. BIN
      py/core/examples/data/uber_2021.pdf
  69. 999 0
      py/core/examples/data/yc_companies.txt
  70. 114 0
      py/core/examples/hello_r2r.ipynb
  71. 23 0
      py/core/examples/hello_r2r.py
  72. BIN
      py/core/examples/supported_file_types/bmp.bmp
  73. 126 0
      py/core/examples/supported_file_types/css.css
  74. 11 0
      py/core/examples/supported_file_types/csv.csv
  75. BIN
      py/core/examples/supported_file_types/doc.doc
  76. BIN
      py/core/examples/supported_file_types/docx.docx
  77. 61 0
      py/core/examples/supported_file_types/eml.eml
  78. BIN
      py/core/examples/supported_file_types/epub.epub
  79. BIN
      py/core/examples/supported_file_types/heic.heic
  80. 69 0
      py/core/examples/supported_file_types/html.html
  81. BIN
      py/core/examples/supported_file_types/jpeg.jpeg
  82. BIN
      py/core/examples/supported_file_types/jpg.jpg
  83. 43 0
      py/core/examples/supported_file_types/js.js
  84. 58 0
      py/core/examples/supported_file_types/json.json
  85. 310 0
      py/core/examples/supported_file_types/md.md
  86. BIN
      py/core/examples/supported_file_types/msg.msg
  87. BIN
      py/core/examples/supported_file_types/odt.odt
  88. 153 0
      py/core/examples/supported_file_types/org.org
  89. 50 0
      py/core/examples/supported_file_types/p7s.p7s
  90. BIN
      py/core/examples/supported_file_types/pdf.pdf
  91. BIN
      py/core/examples/supported_file_types/png.png
  92. BIN
      py/core/examples/supported_file_types/ppt.ppt
  93. BIN
      py/core/examples/supported_file_types/pptx.pptx
  94. 32 0
      py/core/examples/supported_file_types/py.py
  95. 86 0
      py/core/examples/supported_file_types/rst.rst
  96. 5 0
      py/core/examples/supported_file_types/rtf.rtf
  97. BIN
      py/core/examples/supported_file_types/tiff.tiff
  98. 247 0
      py/core/examples/supported_file_types/ts.ts
  99. 11 0
      py/core/examples/supported_file_types/tsv.tsv
  100. 21 0
      py/core/examples/supported_file_types/txt.txt

+ 20 - 0
py/.dockerignore

@@ -0,0 +1,20 @@
+__pycache__
+*.pyc
+*.pyo
+*.pyd
+.Python
+env
+pip-log.txt
+pip-delete-this-directory.txt
+.tox
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.log
+.git
+.mypy_cache
+.pytest_cache
+.hypothesis

+ 47 - 0
py/Dockerfile

@@ -0,0 +1,47 @@
+FROM python:3.12-slim AS builder
+
+# Install system dependencies
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    gcc g++ musl-dev curl libffi-dev gfortran libopenblas-dev \
+    poppler-utils \
+    && apt-get clean && rm -rf /var/lib/apt/lists/* \
+    && curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
+
+# Add Rust to PATH
+ENV PATH="/root/.cargo/bin:${PATH}"
+
+# Create the /app/py directory
+RUN mkdir -p /app/py
+WORKDIR /app/py
+COPY pyproject.toml ./
+RUN pip install --default-timeout=1000 -i https://pypi.tuna.tsinghua.edu.cn/simple/ -e ".[core]" && \
+    pip install gunicorn uvicorn pydantic
+
+# Optionally, if you want gunicorn and uvicorn explicitly installed, you can
+# either list them under [project] in `pyproject.toml` or install them here:
+RUN pip install --no-cache-dir gunicorn uvicorn
+
+# Create the final image
+FROM python:3.12-slim
+
+# Minimal runtime deps
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    curl poppler-utils \
+    && apt-get clean && rm -rf /var/lib/apt/lists/*
+
+# Copy the built environment from builder to final image
+# (If you want a fully self-contained environment, copy /usr/local)
+COPY --from=builder /usr/local /usr/local
+
+WORKDIR /app
+
+# Copy the rest of your source code
+COPY . /app
+
+# Expose environment variables and port
+ARG R2R_PORT=8000 R2R_HOST=0.0.0.0
+ENV R2R_PORT=$R2R_PORT R2R_HOST=$R2R_HOST
+EXPOSE $R2R_PORT
+
+# Launch the app
+CMD ["sh", "-c", "uvicorn core.main.app_entry:app --host $R2R_HOST --port $R2R_PORT --workers 40"]

+ 125 - 0
py/README.md

@@ -0,0 +1,125 @@
+<img width="1217" alt="Screenshot 2025-03-27 at 6 35 02 AM" src="https://github.com/user-attachments/assets/10b530a6-527f-4335-b2e4-ceaa9fc1219f" />
+
+<h3 align="center">
+The most advanced AI retrieval system.
+
+Agentic Retrieval-Augmented Generation (RAG) with a RESTful API.
+</h3>
+
+<div align="center">
+   <div>
+      <a href="https://r2r-docs.sciphi.ai/"><strong>Docs</strong></a> ·
+      <a href="https://github.com/SciPhi-AI/R2R/issues/new?assignees=&labels=&projects=&template=bug_report.md&title="><strong>Report Bug</strong></a> ·
+      <a href="https://github.com/SciPhi-AI/R2R/issues/new?assignees=&labels=&projects=&template=feature_request.md&title="><strong>Feature Request</strong></a> ·
+      <a href="https://discord.gg/p6KqD2kjtB"><strong>Discord</strong></a>
+   </div>
+   <br />
+   <p align="center">
+    <a href="https://r2r-docs.sciphi.ai"><img src="https://img.shields.io/badge/docs.sciphi.ai-3F16E4" alt="Docs"></a>
+    <a href="https://discord.gg/p6KqD2kjtB"><img src="https://img.shields.io/discord/1120774652915105934?style=social&logo=discord" alt="Discord"></a>
+    <a href="https://github.com/SciPhi-AI"><img src="https://img.shields.io/github/stars/SciPhi-AI/R2R" alt="Github Stars"></a>
+    <a href="https://github.com/SciPhi-AI/R2R/pulse"><img src="https://img.shields.io/github/commit-activity/w/SciPhi-AI/R2R" alt="Commits-per-week"></a>
+    <a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/License-MIT-purple.svg" alt="License: MIT"></a>
+  </p>
+</div>
+
+# About
+R2R is an advanced AI retrieval system supporting Retrieval-Augmented Generation (RAG) with production-ready features. Built around a RESTful API, R2R offers multimodal content ingestion, hybrid search, knowledge graphs, and comprehensive document management.
+
+R2R also includes a **Deep Research API**, a multi-step reasoning system that fetches relevant data from your knowledgebase and/or the internet to deliver richer, context-aware answers for complex queries.
+
+# Usage
+
+```python
+# Basic search
+results = client.retrieval.search(query="What is DeepSeek R1?")
+
+# RAG with citations
+response = client.retrieval.rag(query="What is DeepSeek R1?")
+
+# Deep Research RAG Agent
+response = client.retrieval.agent(
+  message={"role":"user", "content": "What does deepseek r1 imply? Think about market, societal implications, and more."},
+  rag_generation_config={
+    "model": "anthropic/claude-3-7-sonnet-20250219",
+    "extended_thinking": True,
+    "thinking_budget": 4096,
+    "temperature": 1,
+    "top_p": None,
+    "max_tokens_to_sample": 16000,
+  },
+)
+```
+
+
+
+## Getting Started
+```bash
+# Quick install and run in light mode
+pip install r2r
+export OPENAI_API_KEY=sk-...
+python -m r2r.serve
+
+# Or run in full mode with Docker
+# git clone git@github.com:SciPhi-AI/R2R.git && cd R2R
+# export R2R_CONFIG_NAME=full OPENAI_API_KEY=sk-...
+# docker compose -f compose.full.yaml --profile postgres up -d
+```
+
+For detailed self-hosting instructions, see the [self-hosting docs](https://r2r-docs.sciphi.ai/self-hosting/installation/overview).
+
+## Demo
+https://github.com/user-attachments/assets/173f7a1f-7c0b-4055-b667-e2cdcf70128b
+
+## Using the API
+
+### 1. Install SDK & Setup
+
+```bash
+# Install SDK
+pip install r2r  # Python
+# or
+npm i r2r-js    # JavaScript
+```
+
+### 2. Client Initialization
+
+```python
+from r2r import R2RClient
+client = R2RClient(base_url="http://localhost:7272")
+```
+
+```javascript
+const { r2rClient } = require('r2r-js');
+const client = new r2rClient("http://localhost:7272");
+```
+
+### 3. Document Operations
+
+```python
+# Ingest sample or your own document
+client.documents.create(file_path="/path/to/file")
+
+# List documents
+client.documents.list()
+```
+
+
+## Key Features
+
+- **📁 Multimodal Ingestion**: Parse `.txt`, `.pdf`, `.json`, `.png`, `.mp3`, and more
+- **🔍 Hybrid Search**: Semantic + keyword search with reciprocal rank fusion
+- **🔗 Knowledge Graphs**: Automatic entity & relationship extraction
+- **🤖 Agentic RAG**: Reasoning agent integrated with retrieval
+- **🔐 User & Access Management**: Complete authentication & collection system
+
+## Community & Contributing
+
+- [Join our Discord](https://discord.gg/p6KqD2kjtB) for support and discussion
+- Submit [feature requests](https://github.com/SciPhi-AI/R2R/issues/new?assignees=&labels=&projects=&template=feature_request.md&title=) or [bug reports](https://github.com/SciPhi-AI/R2R/issues/new?assignees=&labels=&projects=&template=bug_report.md&title=)
+- Open PRs for new features, improvements, or documentation
+
+### Our Contributors
+<a href="https://github.com/SciPhi-AI/R2R/graphs/contributors">
+  <img src="https://contrib.rocks/image?repo=SciPhi-AI/R2R" />
+</a>

+ 295 - 0
py/all_possible_config.toml

@@ -0,0 +1,295 @@
+[app]
+# app settings are global available like `r2r_config.agent.app`
+# project_name = "r2r_default" # optional, can also set with `R2R_PROJECT_NAME` env var
+default_max_documents_per_user = 1_000_000
+default_max_chunks_per_user = 10_000_000
+default_max_collections_per_user = 1_000_000
+
+# Set the default max upload size to 2 GB for local testing
+default_max_upload_size = 2147483648  # 2 GB for anything not explicitly listed
+
+# LLM used for internal operations, like deriving conversation names
+fast_llm = "openai/gpt-4o-mini"
+
+# LLM used for user-facing output, like RAG replies
+quality_llm = "openai/gpt-4o-mini"
+
+# LLM used for ingesting visual inputs
+vlm = "openai/gpt-4o-mini"
+
+# LLM used for transcription
+audio_lm = "openai/whisper-1"
+
+# Reasoning model, used for `research` agent
+reasoning_llm = "openai/o3-mini"
+# Planning model, used for `research` agent
+planning_llm = "anthropic/claude-3-7-sonnet-20250219"
+
+
+
+  [app.max_upload_size_by_type]
+    txt  = 2000000
+    md   = 2000000
+    tsv  = 2000000
+    csv  = 5000000
+    html = 5000000
+    doc  = 10000000
+    docx = 10000000
+    ppt  = 20000000
+    pptx = 20000000
+    xls  = 10000000
+    xlsx = 10000000
+    odt  = 5000000
+    pdf  = 30000000
+    eml  = 5000000
+    msg  = 5000000
+    p7s  = 5000000
+    bmp  = 5000000
+    heic = 5000000
+    jpeg = 5000000
+    jpg  = 5000000
+    png  = 5000000
+    tiff = 5000000
+    epub = 10000000
+    rtf  = 5000000
+    rst  = 5000000
+    org  = 5000000
+
+[agent]
+rag_agent_static_prompt = "static_rag_agent"
+rag_agent_dynamic_prompt = "dynamic_rag_agent"
+#tools = ["search_file_knowledge", "content"]
+rag_tools = ["search_file_descriptions", "search_file_knowledge", "get_file_content"] # can add  "web_search" | "web_scrape"
+# The following tools are available to the `research` agent
+research_tools = ["rag", "reasoning", "critique", "python_executor"]
+
+
+# tool_names = ["local_search", "web_search"] # uncomment to enable web search
+#tool_names = ["local_search"]
+
+#  [agent.generation_config]
+#  model = "openai/gpt-4o"
+
+[auth]
+provider = "r2r"
+access_token_lifetime_in_minutes = 6000 # set a very high default value, for easier testing
+refresh_token_lifetime_in_days = 7
+require_authentication = false
+require_email_verification = false
+default_admin_email = "xujiawei@cocorobo.cc"
+default_admin_password = "usestudio-1"
+
+[completion]
+provider = "r2r"
+concurrent_request_limit = 256000
+#fast_llm = "openai/gpt-4o-mini"
+
+  [completion.generation_config]
+  #model = "openai/gpt-4o-mini"
+  temperature = 0.1
+  top_p = 1.0
+  max_tokens_to_sample = 1024
+  stream = false
+  add_generation_kwargs = { }
+
+[crypto]
+provider = "bcrypt"
+
+[database]
+provider = "postgres"
+default_collection_name = "Default"
+default_collection_description = "Your default collection."
+# collection_summary_system_prompt = 'default_system'
+# collection_summary_task_prompt = 'default_collection_summary'
+
+# KG settings
+batch_size = 6400
+
+collection_summary_system_prompt = "system"
+collection_summary_prompt = "collection_summary"
+disable_create_extension = false
+kg_store_path = ""
+
+  # PostgreSQL tuning settings
+  [database.postgres_configuration_settings]
+    checkpoint_completion_target = 0.7
+    default_statistics_target = 100
+    effective_io_concurrency = 4
+    effective_cache_size = 5242880
+    huge_pages = "try"
+    maintenance_work_mem = 655360
+    max_connections = 2560
+    max_parallel_workers_per_gather = 16
+    max_parallel_workers = 4
+    max_parallel_maintenance_workers = 4
+    max_wal_size = 102400
+    max_worker_processes = 8
+    min_wal_size = 80
+    shared_buffers = 163840
+    statement_cache_size = 1000
+    random_page_cost = 1.1
+    wal_buffers = 2560
+    work_mem = 409600
+
+  # Graph creation settings
+  [database.graph_creation_settings]
+    graph_entity_description_prompt = "graph_entity_description"
+    graph_extraction_prompt = "graph_extraction"
+    entity_types = []
+    relation_types = []
+    automatic_deduplication = false
+
+  # Graph enrichment settings
+  [database.graph_enrichment_settings]
+    graph_communities_prompt = "graph_communities"
+
+  # Rate limiting settings
+  [database.limits]
+    global_per_min = 60
+    route_per_min = 20
+    monthly_limit = 10000
+
+  # Route-specific limits (empty by default)
+  [database.route_limits]
+    # e.g., "/api/search" = { global_per_min = 30, route_per_min = 10, monthly_limit = 5000 }
+
+  # User-specific limits (empty by default)
+  [database.user_limits]
+    # e.g., "user_uuid_here" = { global_per_min = 20, route_per_min = 5, monthly_limit = 2000 }
+
+  [database.maintenance]
+    vacuum_schedule = "0 3 * * *"  # Run at 3:00 AM daily
+
+
+[embedding]
+provider = "litellm"
+
+# For basic applications, use `openai/text-embedding-3-small` with `base_dimension = 512`
+
+# RECOMMENDED - For advanced applications,
+# use `openai/text-embedding-3-large` with `base_dimension = 3072` and binary quantization
+#base_model = "openai/text-embedding-3-small"
+#base_dimension = 512
+
+base_model = "openai/text-embedding-3-large"
+
+#base_model = "/text-embedding-v3"
+
+base_dimension = 256
+
+rerank_model = ""
+rerank_url = ""
+
+# rerank_model = "huggingface/mixedbread-ai/mxbai-rerank-large-v1" # reranking model
+
+batch_size = 32
+prefixes = {}   # Provide prefix overrides here if needed
+add_title_as_prefix = false
+concurrent_request_limit = 2560
+max_retries = 3
+initial_backoff = 1.0
+max_backoff = 64.0
+# Deprecated fields (if still used)
+rerank_dimension = 0
+rerank_transformer_type = ""
+
+
+  # Vector quantization settings for embeddings
+  [embedding.quantization_settings]
+    quantization_type = "FP32"
+    # (Additional quantization parameters can be added here)
+
+[completion_embedding]
+# Generally this should be the same as the embedding config, but advanced users may want to run with a different provider to reduce latency
+provider = "litellm"
+base_model = "openai/text-embedding-3-large"
+#base_model = "dashscope/text-embedding-v3"
+base_dimension = 256
+batch_size = 128
+add_title_as_prefix = false
+concurrent_request_limit = 256
+
+[file]
+provider = "postgres"
+# If using S3
+bucket_name = ""
+endpoint_url = ""
+region_name = ""
+aws_access_key_id = ""
+aws_secret_access_key = ""
+
+[ingestion]
+provider = "r2r"
+chunking_strategy = "recursive"
+chunk_size = 800
+chunk_overlap = 400
+excluded_parsers = ["mp4"]
+
+
+
+# Ingestion-time document summary parameters
+# skip_document_summary = False
+# document_summary_system_prompt = 'default_system'
+# document_summary_task_prompt = 'default_summary'
+# chunks_for_document_summary = 128
+document_summary_model = "openai/gpt-4o-mini"
+vision_img_model = "openai/gpt-4o-mini"
+vision_pdf_model = "openai/gpt-4o-mini"
+automatic_extraction = false # enable automatic extraction of entities and relations
+vlm_batch_size=20
+vlm_max_tokens_to_sample=1024
+max_concurrent_vlm_tasks=20
+vlm_ocr_one_page_per_chunk = true
+# Audio transcription and vision model settings
+audio_transcription_model = ""
+skip_document_summary = false
+document_summary_system_prompt = "system"
+document_summary_task_prompt = "summary"
+document_summary_max_length = 100000
+chunks_for_document_summary = 128
+document_summary_model = ""
+parser_overrides = {}
+
+
+  # Chunk enrichment settings
+  [ingestion.chunk_enrichment_settings]
+    chunk_enrichment_prompt = "chunk_enrichment"
+    enable_chunk_enrichment = false
+    n_chunks = 2
+
+#  [ingestion.chunk_enrichment_settings]
+#    enable_chunk_enrichment = false # disabled by default
+#    n_chunks = 2 # the number of chunks (both preceeding and succeeding) to use in enrichment
+#    generation_config = { model = "openai/gpt-4o-mini" }
+
+  [ingestion.extra_parsers]
+    pdf = ["ocr", "zerox"]
+    #pdf = "ocr"
+
+[logging]
+provider = "r2r"
+log_table = "logs"
+log_info_table = "log_info"
+
+[ocr]
+provider = "mistral"
+model = "mistral-ocr-latest"
+
+################################################################################
+# Orchestration Settings (OrchestrationConfig)
+################################################################################
+[orchestration]
+provider = "no"
+#max_runs = 2048
+#kg_creation_concurrency_limit = 32
+#ingestion_concurrency_limit = 16
+#kg_concurrency_limit = 4
+
+[prompt]
+provider = "r2r"
+
+[email]
+provider = "console_mock"
+
+[scheduler]
+provider = "apscheduler"

+ 285 - 0
py/all_possible_config.toml.bak

@@ -0,0 +1,285 @@
+[app]
+# app settings are global available like `r2r_config.agent.app`
+# project_name = "r2r_default" # optional, can also set with `R2R_PROJECT_NAME` env var
+default_max_documents_per_user = 1_000_000
+default_max_chunks_per_user = 10_000_000
+default_max_collections_per_user = 1_000_000
+
+# Set the default max upload size to 2 GB for local testing
+default_max_upload_size = 2147483648  # 2 GB for anything not explicitly listed
+
+# LLM used for internal operations, like deriving conversation names
+fast_llm = "openai/gpt-4o-mini"
+
+# LLM used for user-facing output, like RAG replies
+quality_llm = "openai/gpt-4o-mini"
+
+# LLM used for ingesting visual inputs
+vlm = "openai/gpt-4o-mini"
+
+# LLM used for transcription
+audio_lm = "openai/whisper-1"
+
+# Reasoning model, used for `research` agent
+reasoning_llm = "openai/o3-mini"
+# Planning model, used for `research` agent
+planning_llm = "anthropic/claude-3-7-sonnet-20250219"
+
+
+
+  [app.max_upload_size_by_type]
+    # Common text-based formats
+    txt   = 2147483648  # 2 GB
+    md    = 2147483648
+    tsv   = 2147483648
+    csv   = 2147483648
+    xml   = 2147483648
+    html  = 2147483648
+
+    # Office docs
+    doc   = 2147483648
+    docx  = 2147483648
+    ppt   = 2147483648
+    pptx  = 2147483648
+    xls   = 2147483648
+    xlsx  = 2147483648
+    odt   = 2147483648
+
+    # PDFs
+    pdf   = 2147483648
+
+    # E-mail
+    eml   = 2147483648
+    msg   = 2147483648
+    p7s   = 2147483648
+
+    # Images
+    bmp   = 2147483648
+    heic  = 2147483648
+    jpeg  = 2147483648
+    jpg   = 2147483648
+    png   = 2147483648
+    tiff  = 2147483648
+
+    # E-books and other formats
+    epub  = 2147483648
+    rtf   = 2147483648
+    rst   = 2147483648
+    org   = 2147483648
+
+[agent]
+rag_agent_static_prompt = "static_rag_agent"
+rag_agent_dynamic_prompt = "dynamic_rag_agent"
+#tools = ["search_file_knowledge", "content"]
+rag_tools = ["search_file_descriptions", "search_file_knowledge", "get_file_content"] # can add  "web_search" | "web_scrape"
+# The following tools are available to the `research` agent
+research_tools = ["rag", "reasoning", "critique", "python_executor"]
+
+
+# tool_names = ["local_search", "web_search"] # uncomment to enable web search
+#tool_names = ["local_search"]
+
+#  [agent.generation_config]
+#  model = "openai/gpt-4o"
+
+[auth]
+provider = "r2r"
+access_token_lifetime_in_minutes = 600000 # set a very high default value, for easier testing
+refresh_token_lifetime_in_days = 70
+require_authentication = false
+require_email_verification = false
+default_admin_email = "xujiawei@cocorobo.cc"
+default_admin_password = "usestudio-1"
+
+[completion]
+provider = "r2r"
+concurrent_request_limit = 256
+#fast_llm = "openai/gpt-4o-mini"
+
+  [completion.generation_config]
+  #model = "openai/gpt-4o-mini"
+  temperature = 0.1
+  top_p = 1.0
+  max_tokens_to_sample = 1024
+  stream = false
+  add_generation_kwargs = { }
+
+[crypto]
+provider = "bcrypt"
+
+[database]
+provider = "postgres"
+default_collection_name = "Default"
+default_collection_description = "Your default collection."
+# collection_summary_system_prompt = 'default_system'
+# collection_summary_task_prompt = 'default_collection_summary'
+
+# KG settings
+batch_size = 64
+
+
+  # PostgreSQL tuning settings
+  [database.postgres_configuration_settings]
+    checkpoint_completion_target = 0.9
+    default_statistics_target = 100
+    effective_io_concurrency = 1
+    effective_cache_size = 524288
+    huge_pages = "try"
+    maintenance_work_mem = 65536
+    max_connections = 256
+    max_parallel_workers_per_gather = 2
+    max_parallel_workers = 8
+    max_parallel_maintenance_workers = 2
+    max_wal_size = 1024
+    max_worker_processes = 8
+    min_wal_size = 80
+    shared_buffers = 16384
+    statement_cache_size = 100
+    random_page_cost = 4.0
+    wal_buffers = 256
+    work_mem = 4096
+
+  # Graph creation settings
+  [database.graph_creation_settings]
+    graph_entity_description_prompt = "graph_entity_description"
+    graph_extraction_prompt = "graph_extraction"
+    entity_types = []
+    relation_types = []
+    automatic_deduplication = false
+
+  # Graph enrichment settings
+  [database.graph_enrichment_settings]
+    graph_communities_prompt = "graph_communities"
+
+  # (Optional) Graph search settings – add fields as needed
+  [database.graph_search_settings]
+    # e.g., search_mode = "default"
+
+  # Rate limiting settings
+  [database.limits]
+    global_per_min = 60
+    route_per_min = 20
+    monthly_limit = 10000
+
+  # Route-specific limits (empty by default)
+  [database.route_limits]
+    # e.g., "/api/search" = { global_per_min = 30, route_per_min = 10, monthly_limit = 5000 }
+
+  # User-specific limits (empty by default)
+  [database.user_limits]
+    # e.g., "user_uuid_here" = { global_per_min = 20, route_per_min = 5, monthly_limit = 2000 }
+
+  [database.maintenance]
+    vacuum_schedule = "0 3 * * *"  # Run at 3:00 AM daily
+
+
+[embedding]
+provider = "litellm"
+
+# For basic applications, use `openai/text-embedding-3-small` with `base_dimension = 512`
+
+# RECOMMENDED - For advanced applications,
+# use `openai/text-embedding-3-large` with `base_dimension = 3072` and binary quantization
+#base_model = "openai/text-embedding-3-small"
+#base_dimension = 512
+
+#base_model = "openai/text-embedding-3-large"
+
+base_model = "openai/text-embedding-v3"
+
+base_dimension = 256
+
+rerank_model = ""
+rerank_url = ""
+
+# rerank_model = "huggingface/mixedbread-ai/mxbai-rerank-large-v1" # reranking model
+
+batch_size = 32
+prefixes = {}   # Provide prefix overrides here if needed
+add_title_as_prefix = false
+concurrent_request_limit = 2560
+max_retries = 3
+initial_backoff = 1.0
+max_backoff = 64.0
+# Deprecated fields (if still used)
+rerank_dimension = 0
+rerank_transformer_type = ""
+
+
+  # Vector quantization settings for embeddings
+  [embedding.quantization_settings]
+    quantization_type = "FP32"
+    # (Additional quantization parameters can be added here)
+
+[completion_embedding]
+# Generally this should be the same as the embedding config, but advanced users may want to run with a different provider to reduce latency
+provider = "litellm"
+base_model = "openai/text-embedding-v3"
+base_dimension = 256
+batch_size = 128
+add_title_as_prefix = false
+concurrent_request_limit = 256
+
+[file]
+provider = "postgres"
+
+[ingestion]
+provider = "r2r"
+chunking_strategy = "recursive"
+chunk_size = 800
+chunk_overlap = 400
+excluded_parsers = ["mp4"]
+
+
+
+# Ingestion-time document summary parameters
+# skip_document_summary = False
+# document_summary_system_prompt = 'default_system'
+# document_summary_task_prompt = 'default_summary'
+# chunks_for_document_summary = 128
+document_summary_model = "openai/gpt-4o-mini"
+vision_img_model = "openai/gpt-4o-mini"
+vision_pdf_model = "openai/gpt-4o-mini"
+automatic_extraction = false # enable automatic extraction of entities and relations
+parser_overrides = {}
+
+
+  # Chunk enrichment settings
+  [ingestion.chunk_enrichment_settings]
+    chunk_enrichment_prompt = "chunk_enrichment"
+    enable_chunk_enrichment = false
+    n_chunks = 2
+
+#  [ingestion.chunk_enrichment_settings]
+#    enable_chunk_enrichment = false # disabled by default
+#    n_chunks = 2 # the number of chunks (both preceeding and succeeding) to use in enrichment
+#    generation_config = { model = "openai/gpt-4o-mini" }
+
+  [ingestion.extra_parsers]
+    pdf = ["ocr", "zerox"]
+    #pdf = "ocr"
+
+[logging]
+provider = "r2r"
+log_table = "logs"
+log_info_table = "log_info"
+
+[ocr]
+provider = "mistral"
+model = "mistral-ocr-latest"
+
+[orchestration]
+provider = "no"
+#max_runs = 2048
+#kg_creation_concurrency_limit = 32
+#ingestion_concurrency_limit = 16
+#kg_concurrency_limit = 4
+
+[prompt]
+provider = "r2r"
+
+[email]
+provider = "console_mock"
+
+[scheduler]
+provider = "apscheduler"

+ 176 - 0
py/core/__init__.py

@@ -0,0 +1,176 @@
+import logging
+
+# Keep '*' imports for enhanced development velocity
+from .agent import *
+from .base import *
+from .main import *
+from .parsers import *
+from .providers import *
+
+logger = logging.getLogger()
+logger.setLevel(logging.INFO)
+
+# Create a console handler and set the level to info
+ch = logging.StreamHandler()
+ch.setLevel(logging.INFO)
+
+# Create a formatter and set it for the handler
+formatter = logging.Formatter(
+    "%(asctime)s - %(levelname)s - %(name)s - %(message)s"
+)
+ch.setFormatter(formatter)
+
+# Add the handler to the logger
+logger.addHandler(ch)
+
+# Optional: Prevent propagation to the root logger
+logger.propagate = False
+
+logging.getLogger("httpx").setLevel(logging.WARNING)
+logging.getLogger("LiteLLM").setLevel(logging.WARNING)
+
+__all__ = [
+    "ThinkingEvent",
+    "ToolCallEvent",
+    "ToolResultEvent",
+    "CitationEvent",
+    "Citation",
+    "R2RAgent",
+    "SearchResultsCollector",
+    "R2RRAGAgent",
+    "R2RXMLToolsRAGAgent",
+    "R2RStreamingRAGAgent",
+    "R2RXMLToolsStreamingRAGAgent",
+    "AsyncSyncMeta",
+    "syncable",
+    "MessageType",
+    "Document",
+    "DocumentChunk",
+    "DocumentResponse",
+    "IngestionStatus",
+    "GraphExtractionStatus",
+    "GraphConstructionStatus",
+    "DocumentType",
+    "R2RDocumentProcessingError",
+    "R2RException",
+    "Entity",
+    "GraphExtraction",
+    "Relationship",
+    "GenerationConfig",
+    "LLMChatCompletion",
+    "LLMChatCompletionChunk",
+    "RAGCompletion",
+    "Prompt",
+    "AggregateSearchResult",
+    "WebSearchResult",
+    "GraphSearchResult",
+    "ChunkSearchSettings",
+    "GraphSearchSettings",
+    "ChunkSearchResult",
+    "WebPageSearchResult",
+    "SearchSettings",
+    "select_search_filters",
+    "SearchMode",
+    "HybridSearchSettings",
+    "Token",
+    "TokenData",
+    "Vector",
+    "VectorEntry",
+    "VectorType",
+    "IndexConfig",
+    "Agent",
+    "AgentConfig",
+    "Conversation",
+    "Message",
+    "TokenResponse",
+    "User",
+    "AppConfig",
+    "Provider",
+    "ProviderConfig",
+    "AuthConfig",
+    "AuthProvider",
+    "CryptoConfig",
+    "CryptoProvider",
+    "EmailConfig",
+    "EmailProvider",
+    "LimitSettings",
+    "DatabaseConfig",
+    "DatabaseProvider",
+    "EmbeddingConfig",
+    "EmbeddingProvider",
+    "CompletionConfig",
+    "CompletionProvider",
+    "RecursiveCharacterTextSplitter",
+    "TextSplitter",
+    "generate_id",
+    "validate_uuid",
+    "yield_sse_event",
+    "convert_nonserializable_objects",
+    "num_tokens",
+    "num_tokens_from_messages",
+    "SearchResultsCollector",
+    "R2RProviders",
+    "R2RApp",
+    "R2RBuilder",
+    "R2RConfig",
+    "R2RProviderFactory",
+    "AuthService",
+    "IngestionService",
+    "MaintenanceService",
+    "ManagementService",
+    "RetrievalService",
+    "GraphService",
+    "AudioParser",
+    "BMPParser",
+    "DOCParser",
+    "DOCXParser",
+    "ImageParser",
+    "ODTParser",
+    "OCRPDFParser",
+    "VLMPDFParser",
+    "BasicPDFParser",
+    "PDFParserUnstructured",
+    "PPTParser",
+    "PPTXParser",
+    "RTFParser",
+    "CSVParser",
+    "CSVParserAdvanced",
+    "EMLParser",
+    "EPUBParser",
+    "JSONParser",
+    "MSGParser",
+    "ORGParser",
+    "P7SParser",
+    "RSTParser",
+    "TSVParser",
+    "XLSParser",
+    "XLSXParser",
+    "XLSXParserAdvanced",
+    "MDParser",
+    "HTMLParser",
+    "TextParser",
+    "PythonParser",
+    "JavaScriptParser",
+    "TypeScriptParser",
+    "CSSParser",
+    "SupabaseAuthProvider",
+    "R2RAuthProvider",
+    "JwtAuthProvider",
+    "ClerkAuthProvider",
+    # Email
+    # Crypto
+    "BCryptCryptoProvider",
+    "BcryptCryptoConfig",
+    "NaClCryptoConfig",
+    "NaClCryptoProvider",
+    "PostgresDatabaseProvider",
+    "LiteLLMEmbeddingProvider",
+    "OpenAIEmbeddingProvider",
+    "OllamaEmbeddingProvider",
+    "OpenAICompletionProvider",
+    "R2RCompletionProvider",
+    "LiteLLMCompletionProvider",
+    "UnstructuredIngestionProvider",
+    "R2RIngestionProvider",
+    "ChunkingStrategy",
+]

+ 36 - 0
py/core/agent/__init__.py

@@ -0,0 +1,36 @@
+# FIXME: Once the agent is properly type annotated, remove the type: ignore comments
+from .base import (  # type: ignore
+    R2RAgent,
+    R2RStreamingAgent,
+    R2RXMLStreamingAgent,
+)
+from .rag import (  # type: ignore
+    R2RRAGAgent,
+    R2RStreamingRAGAgent,
+    R2RXMLToolsRAGAgent,
+    R2RXMLToolsStreamingRAGAgent,
+)
+
+# Import the concrete implementations
+from .research import (
+    R2RResearchAgent,
+    R2RStreamingResearchAgent,
+    R2RXMLToolsResearchAgent,
+    R2RXMLToolsStreamingResearchAgent,
+)
+
+__all__ = [
+    # Base
+    "R2RAgent",
+    "R2RStreamingAgent",
+    "R2RXMLStreamingAgent",
+    # RAG Agents
+    "R2RRAGAgent",
+    "R2RXMLToolsRAGAgent",
+    "R2RStreamingRAGAgent",
+    "R2RXMLToolsStreamingRAGAgent",
+    "R2RResearchAgent",
+    "R2RStreamingResearchAgent",
+    "R2RXMLToolsResearchAgent",
+    "R2RXMLToolsStreamingResearchAgent",
+]

+ 1484 - 0
py/core/agent/base.py

@@ -0,0 +1,1484 @@
+import asyncio
+import json
+import logging
+import re
+from abc import ABCMeta
+from typing import AsyncGenerator, Optional, Tuple
+
+from core.base import AsyncSyncMeta, LLMChatCompletion, Message, syncable
+from core.base.agent import Agent, Conversation
+from core.utils import (
+    CitationTracker,
+    SearchResultsCollector,
+    SSEFormatter,
+    convert_nonserializable_objects,
+    dump_obj,
+    find_new_citation_spans,
+)
+
+logger = logging.getLogger()
+
+
+class CombinedMeta(AsyncSyncMeta, ABCMeta):
+    pass
+
+
+def sync_wrapper(async_gen):
+    loop = asyncio.get_event_loop()
+
+    def wrapper():
+        try:
+            while True:
+                try:
+                    yield loop.run_until_complete(async_gen.__anext__())
+                except StopAsyncIteration:
+                    break
+        finally:
+            loop.run_until_complete(async_gen.aclose())
+
+    return wrapper()
+
+
+class R2RAgent(Agent, metaclass=CombinedMeta):
+    def __init__(self, *args, **kwargs):
+        self.search_results_collector = SearchResultsCollector()
+        super().__init__(*args, **kwargs)
+        self._reset()
+
+    async def _generate_llm_summary(self, iterations_count: int) -> str:
+        """
+        Generate a summary of the conversation using the LLM when max iterations are exceeded.
+
+        Args:
+            iterations_count: The number of iterations that were completed
+
+        Returns:
+            A string containing the LLM-generated summary
+        """
+        try:
+            # Get all messages in the conversation
+            all_messages = await self.conversation.get_messages()
+
+            # Create a prompt for the LLM to summarize
+            summary_prompt = {
+                "role": "user",
+                "content": (
+                    f"The conversation has reached the maximum limit of {iterations_count} iterations "
+                    f"without completing the task. Please provide a concise summary of: "
+                    f"1) The key information you've gathered that's relevant to the original query, "
+                    f"2) What you've attempted so far and why it's incomplete, and "
+                    f"3) A specific recommendation for how to proceed. "
+                    f"Keep your summary brief (3-4 sentences total) and focused on the most valuable insights. If it is possible to answer the original user query, then do so now instead."
+                    f"Start with '⚠️ **Maximum iterations exceeded**'"
+                ),
+            }
+
+            # Create a new message list with just the conversation history and summary request
+            summary_messages = all_messages + [summary_prompt]
+
+            # Get a completion for the summary
+            generation_config = self.get_generation_config(summary_prompt)
+            response = await self.llm_provider.aget_completion(
+                summary_messages,
+                generation_config,
+            )
+
+            return response.choices[0].message.content
+        except Exception as e:
+            logger.error(f"Error generating LLM summary: {str(e)}")
+            # Fall back to basic summary if LLM generation fails
+            return (
+                "⚠️ **Maximum iterations exceeded**\n\n"
+                "The agent reached the maximum iteration limit without completing the task. "
+                "Consider breaking your request into smaller steps or refining your query."
+            )
+
+    def _reset(self):
+        self._completed = False
+        self.conversation = Conversation()
+
+    @syncable
+    async def arun(
+        self,
+        messages: list[Message],
+        system_instruction: Optional[str] = None,
+        *args,
+        **kwargs,
+    ) -> list[dict]:
+        self._reset()
+        await self._setup(system_instruction)
+
+        if messages:
+            for message in messages:
+                await self.conversation.add_message(message)
+        iterations_count = 0
+        while (
+            not self._completed
+            and iterations_count < self.config.max_iterations
+        ):
+            iterations_count += 1
+            messages_list = await self.conversation.get_messages()
+            generation_config = self.get_generation_config(messages_list[-1])
+            response = await self.llm_provider.aget_completion(
+                messages_list,
+                generation_config,
+            )
+            logger.debug(f"R2RAgent response: {response}")
+            await self.process_llm_response(response, *args, **kwargs)
+
+        if not self._completed:
+            # Generate a summary of the conversation using the LLM
+            summary = await self._generate_llm_summary(iterations_count)
+            await self.conversation.add_message(
+                Message(role="assistant", content=summary)
+            )
+
+        # Return final content
+        all_messages: list[dict] = await self.conversation.get_messages()
+        all_messages.reverse()
+
+        output_messages = []
+        for message_2 in all_messages:
+            if (
+                # message_2.get("content")
+                message_2.get("content") != messages[-1].content
+            ):
+                output_messages.append(message_2)
+            else:
+                break
+        output_messages.reverse()
+
+        return output_messages
+
+    async def process_llm_response(
+        self, response: LLMChatCompletion, *args, **kwargs
+    ) -> None:
+        if not self._completed:
+            message = response.choices[0].message
+            finish_reason = response.choices[0].finish_reason
+
+            if finish_reason == "stop":
+                self._completed = True
+
+            # Determine which provider we're using
+            using_anthropic = (
+                "anthropic" in self.rag_generation_config.model.lower()
+            )
+
+            # OPENAI HANDLING
+            if not using_anthropic:
+                if message.tool_calls:
+                    assistant_msg = Message(
+                        role="assistant",
+                        content="",
+                        tool_calls=[msg.dict() for msg in message.tool_calls],
+                    )
+                    await self.conversation.add_message(assistant_msg)
+
+                    # If there are multiple tool_calls, call them sequentially here
+                    for tool_call in message.tool_calls:
+                        await self.handle_function_or_tool_call(
+                            tool_call.function.name,
+                            tool_call.function.arguments,
+                            tool_id=tool_call.id,
+                            *args,
+                            **kwargs,
+                        )
+                else:
+                    await self.conversation.add_message(
+                        Message(role="assistant", content=message.content)
+                    )
+                    self._completed = True
+
+            else:
+                # First handle thinking blocks if present
+                if (
+                    hasattr(message, "structured_content")
+                    and message.structured_content
+                ):
+                    # Check if structured_content contains any tool_use blocks
+                    has_tool_use = any(
+                        block.get("type") == "tool_use"
+                        for block in message.structured_content
+                    )
+
+                    if not has_tool_use and message.tool_calls:
+                        # If it has thinking but no tool_use, add a separate message with structured_content
+                        assistant_msg = Message(
+                            role="assistant",
+                            structured_content=message.structured_content,  # Use structured_content field
+                        )
+                        await self.conversation.add_message(assistant_msg)
+
+                        # Add explicit tool_use blocks in a separate message
+                        tool_uses = []
+                        for tool_call in message.tool_calls:
+                            # Safely parse arguments if they're a string
+                            try:
+                                if isinstance(
+                                    tool_call.function.arguments, str
+                                ):
+                                    input_args = json.loads(
+                                        tool_call.function.arguments
+                                    )
+                                else:
+                                    input_args = tool_call.function.arguments
+                            except json.JSONDecodeError:
+                                logger.error(
+                                    f"Failed to parse tool arguments: {tool_call.function.arguments}"
+                                )
+                                input_args = {
+                                    "_raw": tool_call.function.arguments
+                                }
+
+                            tool_uses.append(
+                                {
+                                    "type": "tool_use",
+                                    "id": tool_call.id,
+                                    "name": tool_call.function.name,
+                                    "input": input_args,
+                                }
+                            )
+
+                        # Add tool_use blocks as a separate assistant message with structured content
+                        if tool_uses:
+                            await self.conversation.add_message(
+                                Message(
+                                    role="assistant",
+                                    structured_content=tool_uses,
+                                    content="",
+                                )
+                            )
+                    else:
+                        # If it already has tool_use or no tool_calls, preserve original structure
+                        assistant_msg = Message(
+                            role="assistant",
+                            structured_content=message.structured_content,
+                        )
+                        await self.conversation.add_message(assistant_msg)
+
+                elif message.content:
+                    # For regular text content
+                    await self.conversation.add_message(
+                        Message(role="assistant", content=message.content)
+                    )
+
+                    # If there are tool calls, add them as structured content
+                    if message.tool_calls:
+                        tool_uses = []
+                        for tool_call in message.tool_calls:
+                            # Same safe parsing as above
+                            try:
+                                if isinstance(
+                                    tool_call.function.arguments, str
+                                ):
+                                    input_args = json.loads(
+                                        tool_call.function.arguments
+                                    )
+                                else:
+                                    input_args = tool_call.function.arguments
+                            except json.JSONDecodeError:
+                                logger.error(
+                                    f"Failed to parse tool arguments: {tool_call.function.arguments}"
+                                )
+                                input_args = {
+                                    "_raw": tool_call.function.arguments
+                                }
+
+                            tool_uses.append(
+                                {
+                                    "type": "tool_use",
+                                    "id": tool_call.id,
+                                    "name": tool_call.function.name,
+                                    "input": input_args,
+                                }
+                            )
+
+                        await self.conversation.add_message(
+                            Message(
+                                role="assistant", structured_content=tool_uses
+                            )
+                        )
+
+                # NEW CASE: Handle tool_calls with no content or structured_content
+                elif message.tool_calls:
+                    # Create tool_uses for the message with only tool_calls
+                    tool_uses = []
+                    for tool_call in message.tool_calls:
+                        try:
+                            if isinstance(tool_call.function.arguments, str):
+                                input_args = json.loads(
+                                    tool_call.function.arguments
+                                )
+                            else:
+                                input_args = tool_call.function.arguments
+                        except json.JSONDecodeError:
+                            logger.error(
+                                f"Failed to parse tool arguments: {tool_call.function.arguments}"
+                            )
+                            input_args = {"_raw": tool_call.function.arguments}
+
+                        tool_uses.append(
+                            {
+                                "type": "tool_use",
+                                "id": tool_call.id,
+                                "name": tool_call.function.name,
+                                "input": input_args,
+                            }
+                        )
+
+                    # Add tool_use blocks as a message before processing tools
+                    if tool_uses:
+                        await self.conversation.add_message(
+                            Message(
+                                role="assistant",
+                                structured_content=tool_uses,
+                            )
+                        )
+
+                # Process the tool calls
+                if message.tool_calls:
+                    for tool_call in message.tool_calls:
+                        await self.handle_function_or_tool_call(
+                            tool_call.function.name,
+                            tool_call.function.arguments,
+                            tool_id=tool_call.id,
+                            *args,
+                            **kwargs,
+                        )
+
+
+class R2RStreamingAgent(R2RAgent):
+    """
+    Base class for all streaming agents with core streaming functionality.
+    Supports emitting messages, tool calls, and results as SSE events.
+    """
+
+    # These two regexes will detect bracket references and then find short IDs.
+    BRACKET_PATTERN = re.compile(r"\[([^\]]+)\]")
+    SHORT_ID_PATTERN = re.compile(
+        r"[A-Za-z0-9]{7,8}"
+    )  # 7-8 chars, for example
+
+    def __init__(self, *args, **kwargs):
+        # Force streaming on
+        if hasattr(kwargs.get("config", {}), "stream"):
+            kwargs["config"].stream = True
+        super().__init__(*args, **kwargs)
+
+    async def arun(
+        self,
+        system_instruction: str | None = None,
+        messages: list[Message] | None = None,
+        *args,
+        **kwargs,
+    ) -> AsyncGenerator[str, None]:
+        """
+        Main streaming entrypoint: returns an async generator of SSE lines.
+        """
+        self._reset()
+        await self._setup(system_instruction)
+
+        if messages:
+            for m in messages:
+                await self.conversation.add_message(m)
+
+        # Initialize citation tracker for this run
+        citation_tracker = CitationTracker()
+
+        # Dictionary to store citation payloads by ID
+        citation_payloads = {}
+
+        # Track all citations emitted during streaming for final persistence
+        self.streaming_citations: list[dict] = []
+
+        async def sse_generator() -> AsyncGenerator[str, None]:
+            pending_tool_calls = {}
+            partial_text_buffer = ""
+            iterations_count = 0
+
+            try:
+                # Keep streaming until we complete
+                while (
+                    not self._completed
+                    and iterations_count < self.config.max_iterations
+                ):
+                    iterations_count += 1
+                    # 1) Get current messages
+                    msg_list = await self.conversation.get_messages()
+                    gen_cfg = self.get_generation_config(
+                        msg_list[-1], stream=True
+                    )
+
+                    accumulated_thinking = ""
+                    thinking_signatures = {}  # Map thinking content to signatures
+
+                    # 2) Start streaming from LLM
+                    llm_stream = self.llm_provider.aget_completion_stream(
+                        msg_list, gen_cfg
+                    )
+                    async for chunk in llm_stream:
+                        delta = chunk.choices[0].delta
+                        finish_reason = chunk.choices[0].finish_reason
+
+                        if hasattr(delta, "thinking") and delta.thinking:
+                            # Accumulate thinking for later use in messages
+                            accumulated_thinking += delta.thinking
+
+                            # Emit SSE "thinking" event
+                            async for (
+                                line
+                            ) in SSEFormatter.yield_thinking_event(
+                                delta.thinking
+                            ):
+                                yield line
+
+                        # Add this new handler for thinking signatures
+                        if hasattr(delta, "thinking_signature"):
+                            thinking_signatures[accumulated_thinking] = (
+                                delta.thinking_signature
+                            )
+                            accumulated_thinking = ""
+
+                        # 3) If new text, accumulate it
+                        if delta.content:
+                            partial_text_buffer += delta.content
+
+                            # (a) Now emit the newly streamed text as a "message" event
+                            async for line in SSEFormatter.yield_message_event(
+                                delta.content
+                            ):
+                                yield line
+
+                            # (b) Find new citation spans in the accumulated text
+                            new_citation_spans = find_new_citation_spans(
+                                partial_text_buffer, citation_tracker
+                            )
+
+                            # Process each new citation span
+                            for cid, spans in new_citation_spans.items():
+                                for span in spans:
+                                    # Check if this is the first time we've seen this citation ID
+                                    is_new_citation = (
+                                        citation_tracker.is_new_citation(cid)
+                                    )
+
+                                    # Get payload if it's a new citation
+                                    payload = None
+                                    if is_new_citation:
+                                        source_obj = self.search_results_collector.find_by_short_id(
+                                            cid
+                                        )
+                                        if source_obj:
+                                            # Store payload for reuse
+                                            payload = dump_obj(source_obj)
+                                            citation_payloads[cid] = payload
+
+                                    # Create citation event payload
+                                    citation_data = {
+                                        "id": cid,
+                                        "object": "citation",
+                                        "is_new": is_new_citation,
+                                        "span": {
+                                            "start": span[0],
+                                            "end": span[1],
+                                        },
+                                    }
+
+                                    # Only include full payload for new citations
+                                    if is_new_citation and payload:
+                                        citation_data["payload"] = payload
+
+                                    # Add to streaming citations for final answer
+                                    self.streaming_citations.append(
+                                        citation_data
+                                    )
+
+                                    # Emit the citation event
+                                    async for (
+                                        line
+                                    ) in SSEFormatter.yield_citation_event(
+                                        citation_data
+                                    ):
+                                        yield line
+
+                        if delta.tool_calls:
+                            for tc in delta.tool_calls:
+                                idx = tc.index
+                                if idx not in pending_tool_calls:
+                                    pending_tool_calls[idx] = {
+                                        "id": tc.id,
+                                        "name": tc.function.name or "",
+                                        "arguments": tc.function.arguments
+                                        or "",
+                                    }
+                                else:
+                                    # Accumulate partial name/arguments
+                                    if tc.function.name:
+                                        pending_tool_calls[idx]["name"] = (
+                                            tc.function.name
+                                        )
+                                    if tc.function.arguments:
+                                        pending_tool_calls[idx][
+                                            "arguments"
+                                        ] += tc.function.arguments
+
+                        # 5) If the stream signals we should handle "tool_calls"
+                        if finish_reason == "tool_calls":
+                            # Handle thinking if present
+                            await self._handle_thinking(
+                                thinking_signatures, accumulated_thinking
+                            )
+
+                            calls_list = []
+                            for idx in sorted(pending_tool_calls.keys()):
+                                cinfo = pending_tool_calls[idx]
+                                calls_list.append(
+                                    {
+                                        "tool_call_id": cinfo["id"]
+                                        or f"call_{idx}",
+                                        "name": cinfo["name"],
+                                        "arguments": cinfo["arguments"],
+                                    }
+                                )
+
+                            # (a) Emit SSE "tool_call" events
+                            for c in calls_list:
+                                tc_data = self._create_tool_call_data(c)
+                                async for (
+                                    line
+                                ) in SSEFormatter.yield_tool_call_event(
+                                    tc_data
+                                ):
+                                    yield line
+
+                            # (b) Add an assistant message capturing these calls
+                            await self._add_tool_calls_message(
+                                calls_list, partial_text_buffer
+                            )
+
+                            # (c) Execute each tool call in parallel
+                            await asyncio.gather(
+                                *[
+                                    self.handle_function_or_tool_call(
+                                        c["name"],
+                                        c["arguments"],
+                                        tool_id=c["tool_call_id"],
+                                    )
+                                    for c in calls_list
+                                ]
+                            )
+
+                            # Reset buffer & calls
+                            pending_tool_calls.clear()
+                            partial_text_buffer = ""
+
+                        elif finish_reason == "stop":
+                            # Handle thinking if present
+                            await self._handle_thinking(
+                                thinking_signatures, accumulated_thinking
+                            )
+
+                            # 6) The LLM is done. If we have any leftover partial text,
+                            #    finalize it in the conversation
+                            if partial_text_buffer:
+                                # Create the final message with metadata including citations
+                                final_message = Message(
+                                    role="assistant",
+                                    content=partial_text_buffer,
+                                    metadata={
+                                        "citations": self.streaming_citations
+                                    },
+                                )
+
+                                # Add it to the conversation
+                                await self.conversation.add_message(
+                                    final_message
+                                )
+
+                            # (a) Prepare final answer with optimized citations
+                            consolidated_citations = []
+                            # Group citations by ID with all their spans
+                            for (
+                                cid,
+                                spans,
+                            ) in citation_tracker.get_all_spans().items():
+                                if cid in citation_payloads:
+                                    consolidated_citations.append(
+                                        {
+                                            "id": cid,
+                                            "object": "citation",
+                                            "spans": [
+                                                {"start": s[0], "end": s[1]}
+                                                for s in spans
+                                            ],
+                                            "payload": citation_payloads[cid],
+                                        }
+                                    )
+
+                            # Create final answer payload
+                            final_evt_payload = {
+                                "id": "msg_final",
+                                "object": "agent.final_answer",
+                                "generated_answer": partial_text_buffer,
+                                "citations": consolidated_citations,
+                            }
+
+                            # Emit final answer event
+                            async for (
+                                line
+                            ) in SSEFormatter.yield_final_answer_event(
+                                final_evt_payload
+                            ):
+                                yield line
+
+                            # (b) Signal the end of the SSE stream
+                            yield SSEFormatter.yield_done_event()
+                            self._completed = True
+                            break
+
+                # If we exit the while loop due to hitting max iterations
+                if not self._completed:
+                    # Generate a summary using the LLM
+                    summary = await self._generate_llm_summary(
+                        iterations_count
+                    )
+
+                    # Send the summary as a message event
+                    async for line in SSEFormatter.yield_message_event(
+                        summary
+                    ):
+                        yield line
+
+                    # Add summary to conversation with citations metadata
+                    await self.conversation.add_message(
+                        Message(
+                            role="assistant",
+                            content=summary,
+                            metadata={"citations": self.streaming_citations},
+                        )
+                    )
+
+                    # Create and emit a final answer payload with the summary
+                    final_evt_payload = {
+                        "id": "msg_final",
+                        "object": "agent.final_answer",
+                        "generated_answer": summary,
+                        "citations": consolidated_citations,
+                    }
+
+                    async for line in SSEFormatter.yield_final_answer_event(
+                        final_evt_payload
+                    ):
+                        yield line
+
+                    # Signal the end of the SSE stream
+                    yield SSEFormatter.yield_done_event()
+                    self._completed = True
+
+            except Exception as e:
+                logger.error(f"Error in streaming agent: {str(e)}")
+                # Emit error event for client
+                async for line in SSEFormatter.yield_error_event(
+                    f"Agent error: {str(e)}"
+                ):
+                    yield line
+                # Send done event to close the stream
+                yield SSEFormatter.yield_done_event()
+
+        # Finally, we return the async generator
+        async for line in sse_generator():
+            yield line
+
+    async def _handle_thinking(
+        self, thinking_signatures, accumulated_thinking
+    ):
+        """Process any accumulated thinking content"""
+        if accumulated_thinking:
+            structured_content = [
+                {
+                    "type": "thinking",
+                    "thinking": accumulated_thinking,
+                    # Anthropic will validate this in their API
+                    "signature": "placeholder_signature",
+                }
+            ]
+
+            assistant_msg = Message(
+                role="assistant",
+                structured_content=structured_content,
+            )
+            await self.conversation.add_message(assistant_msg)
+
+        elif thinking_signatures:
+            for (
+                accumulated_thinking,
+                thinking_signature,
+            ) in thinking_signatures.items():
+                structured_content = [
+                    {
+                        "type": "thinking",
+                        "thinking": accumulated_thinking,
+                        # Anthropic will validate this in their API
+                        "signature": thinking_signature,
+                    }
+                ]
+
+                assistant_msg = Message(
+                    role="assistant",
+                    structured_content=structured_content,
+                )
+                await self.conversation.add_message(assistant_msg)
+
+    async def _add_tool_calls_message(self, calls_list, partial_text_buffer):
+        """Add a message with tool calls to the conversation"""
+        assistant_msg = Message(
+            role="assistant",
+            content=partial_text_buffer or "",
+            tool_calls=[
+                {
+                    "id": c["tool_call_id"],
+                    "type": "function",
+                    "function": {
+                        "name": c["name"],
+                        "arguments": c["arguments"],
+                    },
+                }
+                for c in calls_list
+            ],
+        )
+        await self.conversation.add_message(assistant_msg)
+
+    def _create_tool_call_data(self, call_info):
+        """Create tool call data structure from call info"""
+        return {
+            "tool_call_id": call_info["tool_call_id"],
+            "name": call_info["name"],
+            "arguments": call_info["arguments"],
+        }
+
+    def _create_citation_payload(self, short_id, payload):
+        """Create citation payload for a short ID"""
+        # This will be overridden in RAG subclasses
+        # check if as_dict is on payload
+        if hasattr(payload, "as_dict"):
+            payload = payload.as_dict()
+        if hasattr(payload, "dict"):
+            payload = payload.dict
+        if hasattr(payload, "to_dict"):
+            payload = payload.to_dict()
+
+        return {
+            "id": f"{short_id}",
+            "object": "citation",
+            "payload": dump_obj(payload),  # Will be populated in RAG agents
+        }
+
+    def _create_final_answer_payload(self, answer_text, citations):
+        """Create the final answer payload"""
+        # This will be extended in RAG subclasses
+        return {
+            "id": "msg_final",
+            "object": "agent.final_answer",
+            "generated_answer": answer_text,
+            "citations": citations,
+        }
+
+
+class R2RXMLStreamingAgent(R2RStreamingAgent):
+    """
+    A streaming agent that parses XML-formatted responses with special handling for:
+     - <think> or <Thought> blocks for chain-of-thought reasoning
+     - <Action>, <ToolCalls>, <ToolCall> blocks for tool execution
+    """
+
+    # We treat <think> or <Thought> as the same token boundaries
+    THOUGHT_OPEN = re.compile(r"<(Thought|think)>", re.IGNORECASE)
+    THOUGHT_CLOSE = re.compile(r"</(Thought|think)>", re.IGNORECASE)
+
+    # Regexes to parse out <Action>, <ToolCalls>, <ToolCall>, <Name>, <Parameters>, <Response>
+    ACTION_PATTERN = re.compile(
+        r"<Action>(.*?)</Action>", re.IGNORECASE | re.DOTALL
+    )
+    TOOLCALLS_PATTERN = re.compile(
+        r"<ToolCalls>(.*?)</ToolCalls>", re.IGNORECASE | re.DOTALL
+    )
+    TOOLCALL_PATTERN = re.compile(
+        r"<ToolCall>(.*?)</ToolCall>", re.IGNORECASE | re.DOTALL
+    )
+    NAME_PATTERN = re.compile(r"<Name>(.*?)</Name>", re.IGNORECASE | re.DOTALL)
+    PARAMS_PATTERN = re.compile(
+        r"<Parameters>(.*?)</Parameters>", re.IGNORECASE | re.DOTALL
+    )
+    RESPONSE_PATTERN = re.compile(
+        r"<Response>(.*?)</Response>", re.IGNORECASE | re.DOTALL
+    )
+
+    async def arun(
+        self,
+        system_instruction: str | None = None,
+        messages: list[Message] | None = None,
+        *args,
+        **kwargs,
+    ) -> AsyncGenerator[str, None]:
+        """
+        Main streaming entrypoint: returns an async generator of SSE lines.
+        """
+        self._reset()
+        await self._setup(system_instruction)
+
+        if messages:
+            for m in messages:
+                await self.conversation.add_message(m)
+
+        # Initialize citation tracker for this run
+        citation_tracker = CitationTracker()
+
+        # Dictionary to store citation payloads by ID
+        citation_payloads = {}
+
+        # Track all citations emitted during streaming for final persistence
+        self.streaming_citations: list[dict] = []
+
+        async def sse_generator() -> AsyncGenerator[str, None]:
+            iterations_count = 0
+
+            try:
+                # Keep streaming until we complete
+                while (
+                    not self._completed
+                    and iterations_count < self.config.max_iterations
+                ):
+                    iterations_count += 1
+                    # 1) Get current messages
+                    msg_list = await self.conversation.get_messages()
+                    gen_cfg = self.get_generation_config(
+                        msg_list[-1], stream=True
+                    )
+
+                    # 2) Start streaming from LLM
+                    llm_stream = self.llm_provider.aget_completion_stream(
+                        msg_list, gen_cfg
+                    )
+
+                    # Create state variables for each iteration
+                    iteration_buffer = ""
+                    yielded_first_event = False
+                    in_action_block = False
+                    is_thinking = False
+                    accumulated_thinking = ""
+                    thinking_signatures = {}
+
+                    async for chunk in llm_stream:
+                        delta = chunk.choices[0].delta
+                        finish_reason = chunk.choices[0].finish_reason
+
+                        # Handle thinking if present
+                        if hasattr(delta, "thinking") and delta.thinking:
+                            # Accumulate thinking for later use in messages
+                            accumulated_thinking += delta.thinking
+
+                            # Emit SSE "thinking" event
+                            async for (
+                                line
+                            ) in SSEFormatter.yield_thinking_event(
+                                delta.thinking
+                            ):
+                                yield line
+
+                        # Add this new handler for thinking signatures
+                        if hasattr(delta, "thinking_signature"):
+                            thinking_signatures[accumulated_thinking] = (
+                                delta.thinking_signature
+                            )
+                            accumulated_thinking = ""
+
+                        # 3) If new text, accumulate it
+                        if delta.content:
+                            iteration_buffer += delta.content
+
+                            # Check if we have accumulated enough text for a `<Thought>` block
+                            if len(iteration_buffer) < len("<Thought>"):
+                                continue
+
+                            # Check if we have yielded the first event
+                            if not yielded_first_event:
+                                # Emit the first chunk
+                                if self.THOUGHT_OPEN.findall(iteration_buffer):
+                                    is_thinking = True
+                                    async for (
+                                        line
+                                    ) in SSEFormatter.yield_thinking_event(
+                                        iteration_buffer
+                                    ):
+                                        yield line
+                                else:
+                                    async for (
+                                        line
+                                    ) in SSEFormatter.yield_message_event(
+                                        iteration_buffer
+                                    ):
+                                        yield line
+
+                                # Mark as yielded
+                                yielded_first_event = True
+                                continue
+
+                            # Check if we are in a thinking block
+                            if is_thinking:
+                                # Still thinking, so keep yielding thinking events
+                                if not self.THOUGHT_CLOSE.findall(
+                                    iteration_buffer
+                                ):
+                                    # Emit SSE "thinking" event
+                                    async for (
+                                        line
+                                    ) in SSEFormatter.yield_thinking_event(
+                                        delta.content
+                                    ):
+                                        yield line
+
+                                    continue
+                                # Done thinking, so emit the last thinking event
+                                else:
+                                    is_thinking = False
+                                    thought_text = delta.content.split(
+                                        "</Thought>"
+                                    )[0].split("</think>")[0]
+                                    async for (
+                                        line
+                                    ) in SSEFormatter.yield_thinking_event(
+                                        thought_text
+                                    ):
+                                        yield line
+                                    post_thought_text = delta.content.split(
+                                        "</Thought>"
+                                    )[-1].split("</think>")[-1]
+                                    delta.content = post_thought_text
+
+                            # (b) Find new citation spans in the accumulated text
+                            new_citation_spans = find_new_citation_spans(
+                                iteration_buffer, citation_tracker
+                            )
+
+                            # Process each new citation span
+                            for cid, spans in new_citation_spans.items():
+                                for span in spans:
+                                    # Check if this is the first time we've seen this citation ID
+                                    is_new_citation = (
+                                        citation_tracker.is_new_citation(cid)
+                                    )
+
+                                    # Get payload if it's a new citation
+                                    payload = None
+                                    if is_new_citation:
+                                        source_obj = self.search_results_collector.find_by_short_id(
+                                            cid
+                                        )
+                                        if source_obj:
+                                            # Store payload for reuse
+                                            payload = dump_obj(source_obj)
+                                            citation_payloads[cid] = payload
+
+                                    # Create citation event payload
+                                    citation_data = {
+                                        "id": cid,
+                                        "object": "citation",
+                                        "is_new": is_new_citation,
+                                        "span": {
+                                            "start": span[0],
+                                            "end": span[1],
+                                        },
+                                    }
+
+                                    # Only include full payload for new citations
+                                    if is_new_citation and payload:
+                                        citation_data["payload"] = payload
+
+                                    # Add to streaming citations for final answer
+                                    self.streaming_citations.append(
+                                        citation_data
+                                    )
+
+                                    # Emit the citation event
+                                    async for (
+                                        line
+                                    ) in SSEFormatter.yield_citation_event(
+                                        citation_data
+                                    ):
+                                        yield line
+
+                            # Now prepare to emit the newly streamed text as a "message" event
+                            if (
+                                iteration_buffer.count("<")
+                                and not in_action_block
+                            ):
+                                in_action_block = True
+
+                            if (
+                                in_action_block
+                                and len(
+                                    self.ACTION_PATTERN.findall(
+                                        iteration_buffer
+                                    )
+                                )
+                                < 2
+                            ):
+                                continue
+
+                            elif in_action_block:
+                                in_action_block = False
+                                # Emit the post action block text, if it is there
+                                post_action_text = iteration_buffer.split(
+                                    "</Action>"
+                                )[-1]
+                                if post_action_text:
+                                    async for (
+                                        line
+                                    ) in SSEFormatter.yield_message_event(
+                                        post_action_text
+                                    ):
+                                        yield line
+
+                            else:
+                                async for (
+                                    line
+                                ) in SSEFormatter.yield_message_event(
+                                    delta.content
+                                ):
+                                    yield line
+
+                        elif finish_reason == "stop":
+                            break
+
+                    # Process any accumulated thinking
+                    await self._handle_thinking(
+                        thinking_signatures, accumulated_thinking
+                    )
+
+                    # 6) The LLM is done. If we have any leftover partial text,
+                    #    finalize it in the conversation
+                    if iteration_buffer:
+                        # Create the final message with metadata including citations
+                        final_message = Message(
+                            role="assistant",
+                            content=iteration_buffer,
+                            metadata={"citations": self.streaming_citations},
+                        )
+
+                        # Add it to the conversation
+                        await self.conversation.add_message(final_message)
+
+                    # --- 4) Process any <Action>/<ToolCalls> blocks, or mark completed
+                    action_matches = self.ACTION_PATTERN.findall(
+                        iteration_buffer
+                    )
+
+                    if len(action_matches) > 0:
+                        # Process each ToolCall
+                        xml_toolcalls = "<ToolCalls>"
+
+                        for action_block in action_matches:
+                            tool_calls_text = []
+                            # Look for ToolCalls wrapper, or use the raw action block
+                            calls_wrapper = self.TOOLCALLS_PATTERN.findall(
+                                action_block
+                            )
+                            if calls_wrapper:
+                                for tw in calls_wrapper:
+                                    tool_calls_text.append(tw)
+                            else:
+                                tool_calls_text.append(action_block)
+
+                            for calls_region in tool_calls_text:
+                                calls_found = self.TOOLCALL_PATTERN.findall(
+                                    calls_region
+                                )
+                                for tc_block in calls_found:
+                                    tool_name, tool_params = (
+                                        self._parse_single_tool_call(tc_block)
+                                    )
+                                    if tool_name:
+                                        # Emit SSE event for tool call
+                                        tool_call_id = (
+                                            f"call_{abs(hash(tc_block))}"
+                                        )
+                                        call_evt_data = {
+                                            "tool_call_id": tool_call_id,
+                                            "name": tool_name,
+                                            "arguments": json.dumps(
+                                                tool_params
+                                            ),
+                                        }
+                                        async for line in (
+                                            SSEFormatter.yield_tool_call_event(
+                                                call_evt_data
+                                            )
+                                        ):
+                                            yield line
+
+                                        try:
+                                            tool_result = await self.handle_function_or_tool_call(
+                                                tool_name,
+                                                json.dumps(tool_params),
+                                                tool_id=tool_call_id,
+                                                save_messages=False,
+                                            )
+                                            result_content = tool_result.llm_formatted_result
+                                        except Exception as e:
+                                            result_content = f"Error in tool '{tool_name}': {str(e)}"
+
+                                        xml_toolcalls += (
+                                            f"<ToolCall>"
+                                            f"<Name>{tool_name}</Name>"
+                                            f"<Parameters>{json.dumps(tool_params)}</Parameters>"
+                                            f"<Result>{result_content}</Result>"
+                                            f"</ToolCall>"
+                                        )
+
+                                        # Emit SSE tool result for non-result tools
+                                        result_data = {
+                                            "tool_call_id": tool_call_id,
+                                            "role": "tool",
+                                            "content": json.dumps(
+                                                convert_nonserializable_objects(
+                                                    result_content
+                                                )
+                                            ),
+                                        }
+                                        async for line in SSEFormatter.yield_tool_result_event(
+                                            result_data
+                                        ):
+                                            yield line
+
+                        xml_toolcalls += "</ToolCalls>"
+                        pre_action_text = iteration_buffer[
+                            : iteration_buffer.find(action_block)
+                        ]
+                        post_action_text = iteration_buffer[
+                            iteration_buffer.find(action_block)
+                            + len(action_block) :
+                        ]
+                        iteration_text = (
+                            pre_action_text + xml_toolcalls + post_action_text
+                        )
+
+                        # Update the conversation with tool results
+                        await self.conversation.add_message(
+                            Message(
+                                role="assistant",
+                                content=iteration_text,
+                                metadata={
+                                    "citations": self.streaming_citations
+                                },
+                            )
+                        )
+                    else:
+                        # (a) Prepare final answer with optimized citations
+                        consolidated_citations = []
+                        # Group citations by ID with all their spans
+                        for (
+                            cid,
+                            spans,
+                        ) in citation_tracker.get_all_spans().items():
+                            if cid in citation_payloads:
+                                consolidated_citations.append(
+                                    {
+                                        "id": cid,
+                                        "object": "citation",
+                                        "spans": [
+                                            {"start": s[0], "end": s[1]}
+                                            for s in spans
+                                        ],
+                                        "payload": citation_payloads[cid],
+                                    }
+                                )
+
+                        # Create final answer payload
+                        final_evt_payload = {
+                            "id": "msg_final",
+                            "object": "agent.final_answer",
+                            "generated_answer": iteration_buffer,
+                            "citations": consolidated_citations,
+                        }
+
+                        # Emit final answer event
+                        async for (
+                            line
+                        ) in SSEFormatter.yield_final_answer_event(
+                            final_evt_payload
+                        ):
+                            yield line
+
+                        # (b) Signal the end of the SSE stream
+                        yield SSEFormatter.yield_done_event()
+                        self._completed = True
+
+                # If we exit the while loop due to hitting max iterations
+                if not self._completed:
+                    # Generate a summary using the LLM
+                    summary = await self._generate_llm_summary(
+                        iterations_count
+                    )
+
+                    # Send the summary as a message event
+                    async for line in SSEFormatter.yield_message_event(
+                        summary
+                    ):
+                        yield line
+
+                    # Add summary to conversation with citations metadata
+                    await self.conversation.add_message(
+                        Message(
+                            role="assistant",
+                            content=summary,
+                            metadata={"citations": self.streaming_citations},
+                        )
+                    )
+
+                    # Create and emit a final answer payload with the summary
+                    final_evt_payload = {
+                        "id": "msg_final",
+                        "object": "agent.final_answer",
+                        "generated_answer": summary,
+                        "citations": consolidated_citations,
+                    }
+
+                    async for line in SSEFormatter.yield_final_answer_event(
+                        final_evt_payload
+                    ):
+                        yield line
+
+                    # Signal the end of the SSE stream
+                    yield SSEFormatter.yield_done_event()
+                    self._completed = True
+
+            except Exception as e:
+                logger.error(f"Error in streaming agent: {str(e)}")
+                # Emit error event for client
+                async for line in SSEFormatter.yield_error_event(
+                    f"Agent error: {str(e)}"
+                ):
+                    yield line
+                # Send done event to close the stream
+                yield SSEFormatter.yield_done_event()
+
+        # Finally, we return the async generator
+        async for line in sse_generator():
+            yield line
+
+    def _parse_single_tool_call(
+        self, toolcall_text: str
+    ) -> Tuple[Optional[str], dict]:
+        """
+        Parse a ToolCall block to extract the name and parameters.
+
+        Args:
+            toolcall_text: The text content of a ToolCall block
+
+        Returns:
+            Tuple of (tool_name, tool_parameters)
+        """
+        name_match = self.NAME_PATTERN.search(toolcall_text)
+        if not name_match:
+            return None, {}
+        tool_name = name_match.group(1).strip()
+
+        params_match = self.PARAMS_PATTERN.search(toolcall_text)
+        if not params_match:
+            return tool_name, {}
+
+        raw_params = params_match.group(1).strip()
+        try:
+            # Handle potential JSON parsing issues
+            # First try direct parsing
+            tool_params = json.loads(raw_params)
+        except json.JSONDecodeError:
+            # If that fails, try to clean up the JSON string
+            try:
+                # Replace escaped quotes that might cause issues
+                cleaned_params = raw_params.replace('\\"', '"')
+                # Try again with the cleaned string
+                tool_params = json.loads(cleaned_params)
+            except json.JSONDecodeError:
+                # If all else fails, treat as a plain string value
+                tool_params = {"value": raw_params}
+
+        return tool_name, tool_params
+
+
+class R2RXMLToolsAgent(R2RAgent):
+    """
+    A non-streaming agent that:
+     - parses <think> or <Thought> blocks as chain-of-thought
+     - filters out XML tags related to tool calls and actions
+     - processes <Action><ToolCalls><ToolCall> blocks
+     - properly extracts citations when they appear in the text
+    """
+
+    # We treat <think> or <Thought> as the same token boundaries
+    THOUGHT_OPEN = re.compile(r"<(Thought|think)>", re.IGNORECASE)
+    THOUGHT_CLOSE = re.compile(r"</(Thought|think)>", re.IGNORECASE)
+
+    # Regexes to parse out <Action>, <ToolCalls>, <ToolCall>, <Name>, <Parameters>, <Response>
+    ACTION_PATTERN = re.compile(
+        r"<Action>(.*?)</Action>", re.IGNORECASE | re.DOTALL
+    )
+    TOOLCALLS_PATTERN = re.compile(
+        r"<ToolCalls>(.*?)</ToolCalls>", re.IGNORECASE | re.DOTALL
+    )
+    TOOLCALL_PATTERN = re.compile(
+        r"<ToolCall>(.*?)</ToolCall>", re.IGNORECASE | re.DOTALL
+    )
+    NAME_PATTERN = re.compile(r"<Name>(.*?)</Name>", re.IGNORECASE | re.DOTALL)
+    PARAMS_PATTERN = re.compile(
+        r"<Parameters>(.*?)</Parameters>", re.IGNORECASE | re.DOTALL
+    )
+    RESPONSE_PATTERN = re.compile(
+        r"<Response>(.*?)</Response>", re.IGNORECASE | re.DOTALL
+    )
+
+    async def process_llm_response(self, response, *args, **kwargs):
+        """
+        Override the base process_llm_response to handle XML structured responses
+        including thoughts and tool calls.
+        """
+        if self._completed:
+            return
+
+        message = response.choices[0].message
+        finish_reason = response.choices[0].finish_reason
+
+        if not message.content:
+            # If there's no content, let the parent class handle the normal tool_calls flow
+            return await super().process_llm_response(
+                response, *args, **kwargs
+            )
+
+        # Get the response content
+        content = message.content
+
+        # HACK for gemini
+        content = content.replace("```action", "")
+        content = content.replace("```tool_code", "")
+        content = content.replace("```", "")
+
+        if (
+            not content.startswith("<")
+            and "deepseek" in self.rag_generation_config.model
+        ):  # HACK - fix issues with adding `<think>` to the beginning
+            content = "<think>" + content
+
+        # Process any tool calls in the content
+        action_matches = self.ACTION_PATTERN.findall(content)
+        if action_matches:
+            xml_toolcalls = "<ToolCalls>"
+            for action_block in action_matches:
+                tool_calls_text = []
+                # Look for ToolCalls wrapper, or use the raw action block
+                calls_wrapper = self.TOOLCALLS_PATTERN.findall(action_block)
+                if calls_wrapper:
+                    for tw in calls_wrapper:
+                        tool_calls_text.append(tw)
+                else:
+                    tool_calls_text.append(action_block)
+
+                # Process each ToolCall
+                for calls_region in tool_calls_text:
+                    calls_found = self.TOOLCALL_PATTERN.findall(calls_region)
+                    for tc_block in calls_found:
+                        tool_name, tool_params = self._parse_single_tool_call(
+                            tc_block
+                        )
+                        if tool_name:
+                            tool_call_id = f"call_{abs(hash(tc_block))}"
+                            try:
+                                tool_result = (
+                                    await self.handle_function_or_tool_call(
+                                        tool_name,
+                                        json.dumps(tool_params),
+                                        tool_id=tool_call_id,
+                                        save_messages=False,
+                                    )
+                                )
+
+                                # Add tool result to XML
+                                xml_toolcalls += (
+                                    f"<ToolCall>"
+                                    f"<Name>{tool_name}</Name>"
+                                    f"<Parameters>{json.dumps(tool_params)}</Parameters>"
+                                    f"<Result>{tool_result.llm_formatted_result}</Result>"
+                                    f"</ToolCall>"
+                                )
+
+                            except Exception as e:
+                                logger.error(f"Error in tool call: {str(e)}")
+                                # Add error to XML
+                                xml_toolcalls += (
+                                    f"<ToolCall>"
+                                    f"<Name>{tool_name}</Name>"
+                                    f"<Parameters>{json.dumps(tool_params)}</Parameters>"
+                                    f"<Result>Error: {str(e)}</Result>"
+                                    f"</ToolCall>"
+                                )
+
+            xml_toolcalls += "</ToolCalls>"
+            pre_action_text = content[: content.find(action_block)]
+            post_action_text = content[
+                content.find(action_block) + len(action_block) :
+            ]
+            iteration_text = pre_action_text + xml_toolcalls + post_action_text
+
+            # Create the assistant message
+            await self.conversation.add_message(
+                Message(role="assistant", content=iteration_text)
+            )
+        else:
+            # Create an assistant message with the content as-is
+            await self.conversation.add_message(
+                Message(role="assistant", content=content)
+            )
+
+        # Only mark as completed if the finish_reason is "stop" or there are no action calls
+        # This allows the agent to continue the conversation when tool calls are processed
+        if finish_reason == "stop":
+            self._completed = True
+
+    def _parse_single_tool_call(
+        self, toolcall_text: str
+    ) -> Tuple[Optional[str], dict]:
+        """
+        Parse a ToolCall block to extract the name and parameters.
+
+        Args:
+            toolcall_text: The text content of a ToolCall block
+
+        Returns:
+            Tuple of (tool_name, tool_parameters)
+        """
+        name_match = self.NAME_PATTERN.search(toolcall_text)
+        if not name_match:
+            return None, {}
+        tool_name = name_match.group(1).strip()
+
+        params_match = self.PARAMS_PATTERN.search(toolcall_text)
+        if not params_match:
+            return tool_name, {}
+
+        raw_params = params_match.group(1).strip()
+        try:
+            # Handle potential JSON parsing issues
+            # First try direct parsing
+            tool_params = json.loads(raw_params)
+        except json.JSONDecodeError:
+            # If that fails, try to clean up the JSON string
+            try:
+                # Replace escaped quotes that might cause issues
+                cleaned_params = raw_params.replace('\\"', '"')
+                # Try again with the cleaned string
+                tool_params = json.loads(cleaned_params)
+            except json.JSONDecodeError:
+                # If all else fails, treat as a plain string value
+                tool_params = {"value": raw_params}
+
+        return tool_name, tool_params

+ 326 - 0
py/core/agent/rag.py

@@ -0,0 +1,326 @@
+# type: ignore
+import logging
+from typing import Callable, Optional
+
+from core.base import (
+    format_search_results_for_llm,
+)
+from core.base.abstractions import (
+    AggregateSearchResult,
+    GenerationConfig,
+    SearchSettings,
+)
+from core.base.agent.tools.registry import ToolRegistry
+from core.base.providers import DatabaseProvider
+from core.providers import (
+    AnthropicCompletionProvider,
+    LiteLLMCompletionProvider,
+    OpenAICompletionProvider,
+    R2RCompletionProvider,
+)
+from core.utils import (
+    SearchResultsCollector,
+    num_tokens,
+)
+
+from ..base.agent.agent import RAGAgentConfig
+
+# Import the base classes from the refactored base file
+from .base import (
+    R2RAgent,
+    R2RStreamingAgent,
+    R2RXMLStreamingAgent,
+    R2RXMLToolsAgent,
+)
+
+logger = logging.getLogger(__name__)
+
+
+class RAGAgentMixin:
+    """
+    A Mixin for adding search_file_knowledge, web_search, and content tools
+    to your R2R Agents. This allows your agent to:
+      - call knowledge_search_method (semantic/hybrid search)
+      - call content_method (fetch entire doc/chunk structures)
+      - call an external web search API
+    """
+
+    def __init__(
+        self,
+        *args,
+        search_settings: SearchSettings,
+        knowledge_search_method: Callable,
+        content_method: Callable,
+        file_search_method: Callable,
+        max_tool_context_length=10_000,
+        max_context_window_tokens=512_000,
+        tool_registry: Optional[ToolRegistry] = None,
+        **kwargs,
+    ):
+        # Save references to the retrieval logic
+        self.search_settings = search_settings
+        self.knowledge_search_method = knowledge_search_method
+        self.content_method = content_method
+        self.file_search_method = file_search_method
+        self.max_tool_context_length = max_tool_context_length
+        self.max_context_window_tokens = max_context_window_tokens
+        self.search_results_collector = SearchResultsCollector()
+        self.tool_registry = tool_registry or ToolRegistry()
+
+        super().__init__(*args, **kwargs)
+
+    def _register_tools(self):
+        """
+        Register all requested tools from self.config.rag_tools using the ToolRegistry.
+        """
+        if not self.config.rag_tools:
+            logger.warning(
+                "No RAG tools requested. Skipping tool registration."
+            )
+            return
+
+        # Make sure tool_registry exists
+        if not hasattr(self, "tool_registry") or self.tool_registry is None:
+            self.tool_registry = ToolRegistry()
+
+        format_function = self.format_search_results_for_llm
+
+        for tool_name in set(self.config.rag_tools):
+            # Try to get the tools from the registry
+            if tool_instance := self.tool_registry.create_tool_instance(
+                tool_name, format_function, context=self
+            ):
+                logger.debug(
+                    f"Successfully registered tool from registry: {tool_name}"
+                )
+                self._tools.append(tool_instance)
+            else:
+                logger.warning(f"Unknown tool requested: {tool_name}")
+
+        logger.debug(f"Registered {len(self._tools)} RAG tools.")
+
+    def format_search_results_for_llm(
+        self, results: AggregateSearchResult
+    ) -> str:
+        context = format_search_results_for_llm(results)
+        context_tokens = num_tokens(context) + 1
+        frac_to_return = self.max_tool_context_length / (context_tokens)
+
+        if frac_to_return > 1:
+            return context
+        else:
+            return context[: int(frac_to_return * len(context))]
+
+
+class R2RRAGAgent(RAGAgentMixin, R2RAgent):
+    """
+    Non-streaming RAG Agent that supports search_file_knowledge, content, web_search.
+    """
+
+    def __init__(
+        self,
+        database_provider: DatabaseProvider,
+        llm_provider: (
+            AnthropicCompletionProvider
+            | LiteLLMCompletionProvider
+            | OpenAICompletionProvider
+            | R2RCompletionProvider
+        ),
+        config: RAGAgentConfig,
+        search_settings: SearchSettings,
+        rag_generation_config: GenerationConfig,
+        knowledge_search_method: Callable,
+        content_method: Callable,
+        file_search_method: Callable,
+        tool_registry: Optional[ToolRegistry] = None,
+        max_tool_context_length: int = 20_000,
+    ):
+        # Initialize base R2RAgent
+        R2RAgent.__init__(
+            self,
+            database_provider=database_provider,
+            llm_provider=llm_provider,
+            config=config,
+            rag_generation_config=rag_generation_config,
+        )
+        self.tool_registry = tool_registry or ToolRegistry()
+        # Initialize the RAGAgentMixin
+        RAGAgentMixin.__init__(
+            self,
+            database_provider=database_provider,
+            llm_provider=llm_provider,
+            config=config,
+            search_settings=search_settings,
+            rag_generation_config=rag_generation_config,
+            max_tool_context_length=max_tool_context_length,
+            knowledge_search_method=knowledge_search_method,
+            file_search_method=file_search_method,
+            content_method=content_method,
+            tool_registry=tool_registry,
+        )
+
+        self._register_tools()
+
+
+class R2RXMLToolsRAGAgent(RAGAgentMixin, R2RXMLToolsAgent):
+    """
+    Non-streaming RAG Agent that supports search_file_knowledge, content, web_search.
+    """
+
+    def __init__(
+        self,
+        database_provider: DatabaseProvider,
+        llm_provider: (
+            AnthropicCompletionProvider
+            | LiteLLMCompletionProvider
+            | OpenAICompletionProvider
+            | R2RCompletionProvider
+        ),
+        config: RAGAgentConfig,
+        search_settings: SearchSettings,
+        rag_generation_config: GenerationConfig,
+        knowledge_search_method: Callable,
+        content_method: Callable,
+        file_search_method: Callable,
+        tool_registry: Optional[ToolRegistry] = None,
+        max_tool_context_length: int = 20_000,
+    ):
+        # Initialize base R2RAgent
+        R2RXMLToolsAgent.__init__(
+            self,
+            database_provider=database_provider,
+            llm_provider=llm_provider,
+            config=config,
+            rag_generation_config=rag_generation_config,
+        )
+        self.tool_registry = tool_registry or ToolRegistry()
+        # Initialize the RAGAgentMixin
+        RAGAgentMixin.__init__(
+            self,
+            database_provider=database_provider,
+            llm_provider=llm_provider,
+            config=config,
+            search_settings=search_settings,
+            rag_generation_config=rag_generation_config,
+            max_tool_context_length=max_tool_context_length,
+            knowledge_search_method=knowledge_search_method,
+            file_search_method=file_search_method,
+            content_method=content_method,
+            tool_registry=tool_registry,
+        )
+
+        self._register_tools()
+
+
+class R2RStreamingRAGAgent(RAGAgentMixin, R2RStreamingAgent):
+    """
+    Streaming-capable RAG Agent that supports search_file_knowledge, content, web_search,
+    and emits citations as [abc1234] short IDs if the LLM includes them in brackets.
+    """
+
+    def __init__(
+        self,
+        database_provider: DatabaseProvider,
+        llm_provider: (
+            AnthropicCompletionProvider
+            | LiteLLMCompletionProvider
+            | OpenAICompletionProvider
+            | R2RCompletionProvider
+        ),
+        config: RAGAgentConfig,
+        search_settings: SearchSettings,
+        rag_generation_config: GenerationConfig,
+        knowledge_search_method: Callable,
+        content_method: Callable,
+        file_search_method: Callable,
+        tool_registry: Optional[ToolRegistry] = None,
+        max_tool_context_length: int = 10_000,
+    ):
+        # Force streaming on
+        config.stream = True
+
+        # Initialize base R2RStreamingAgent
+        R2RStreamingAgent.__init__(
+            self,
+            database_provider=database_provider,
+            llm_provider=llm_provider,
+            config=config,
+            rag_generation_config=rag_generation_config,
+        )
+        self.tool_registry = tool_registry or ToolRegistry()
+        # Initialize the RAGAgentMixin
+        RAGAgentMixin.__init__(
+            self,
+            database_provider=database_provider,
+            llm_provider=llm_provider,
+            config=config,
+            search_settings=search_settings,
+            rag_generation_config=rag_generation_config,
+            max_tool_context_length=max_tool_context_length,
+            knowledge_search_method=knowledge_search_method,
+            content_method=content_method,
+            file_search_method=file_search_method,
+            tool_registry=tool_registry,
+        )
+
+        self._register_tools()
+
+
+class R2RXMLToolsStreamingRAGAgent(RAGAgentMixin, R2RXMLStreamingAgent):
+    """
+    A streaming agent that:
+     - treats <think> or <Thought> blocks as chain-of-thought
+       and emits them incrementally as SSE "thinking" events.
+     - accumulates user-visible text outside those tags as SSE "message" events.
+     - filters out all XML tags related to tool calls and actions.
+     - upon finishing each iteration, it parses <Action><ToolCalls><ToolCall> blocks,
+       calls the appropriate tool, and emits SSE "tool_call" / "tool_result".
+     - properly emits citations when they appear in the text
+    """
+
+    def __init__(
+        self,
+        database_provider: DatabaseProvider,
+        llm_provider: (
+            AnthropicCompletionProvider
+            | LiteLLMCompletionProvider
+            | OpenAICompletionProvider
+            | R2RCompletionProvider
+        ),
+        config: RAGAgentConfig,
+        search_settings: SearchSettings,
+        rag_generation_config: GenerationConfig,
+        knowledge_search_method: Callable,
+        content_method: Callable,
+        file_search_method: Callable,
+        tool_registry: Optional[ToolRegistry] = None,
+        max_tool_context_length: int = 10_000,
+    ):
+        # Force streaming on
+        config.stream = True
+
+        # Initialize base R2RXMLStreamingAgent
+        R2RXMLStreamingAgent.__init__(
+            self,
+            database_provider=database_provider,
+            llm_provider=llm_provider,
+            config=config,
+            rag_generation_config=rag_generation_config,
+        )
+        self.tool_registry = tool_registry or ToolRegistry()
+        # Initialize the RAGAgentMixin
+        RAGAgentMixin.__init__(
+            self,
+            database_provider=database_provider,
+            llm_provider=llm_provider,
+            config=config,
+            search_settings=search_settings,
+            rag_generation_config=rag_generation_config,
+            max_tool_context_length=max_tool_context_length,
+            knowledge_search_method=knowledge_search_method,
+            content_method=content_method,
+            file_search_method=file_search_method,
+            tool_registry=tool_registry,
+        )
+
+        self._register_tools()

+ 707 - 0
py/core/agent/research.py

@@ -0,0 +1,707 @@
+import logging
+import os
+import subprocess
+import sys
+import tempfile
+from copy import copy
+from typing import Any, Callable, Optional
+
+from core.base import AppConfig
+from core.base.abstractions import GenerationConfig, Message, SearchSettings
+from core.base.providers import DatabaseProvider
+from core.providers import (
+    AnthropicCompletionProvider,
+    LiteLLMCompletionProvider,
+    OpenAICompletionProvider,
+    R2RCompletionProvider,
+)
+from core.utils import extract_citations
+from shared.abstractions.tool import Tool
+
+from ..base.agent.agent import RAGAgentConfig  # type: ignore
+
+# Import the RAG agents we'll leverage
+from .rag import (  # type: ignore
+    R2RRAGAgent,
+    R2RStreamingRAGAgent,
+    R2RXMLToolsRAGAgent,
+    R2RXMLToolsStreamingRAGAgent,
+    RAGAgentMixin,
+)
+
+logger = logging.getLogger(__name__)
+
+
+class ResearchAgentMixin(RAGAgentMixin):
+    """
+    A mixin that extends RAGAgentMixin to add research capabilities to any R2R agent.
+
+    This mixin provides all RAG capabilities plus additional research tools:
+    - A RAG tool for knowledge retrieval (which leverages the underlying RAG capabilities)
+    - A Python execution tool for code execution and computation
+    - A reasoning tool for complex problem solving
+    - A critique tool for analyzing conversation history
+    """
+
+    def __init__(
+        self,
+        *args,
+        app_config: AppConfig,
+        search_settings: SearchSettings,
+        knowledge_search_method: Callable,
+        content_method: Callable,
+        file_search_method: Callable,
+        max_tool_context_length=10_000,
+        **kwargs,
+    ):
+        # Store the app configuration needed for research tools
+        self.app_config = app_config
+
+        # Call the parent RAGAgentMixin's __init__ with explicitly passed parameters
+        super().__init__(
+            *args,
+            search_settings=search_settings,
+            knowledge_search_method=knowledge_search_method,
+            content_method=content_method,
+            file_search_method=file_search_method,
+            max_tool_context_length=max_tool_context_length,
+            **kwargs,
+        )
+
+        # Register our research-specific tools
+        self._register_research_tools()
+
+    def _register_research_tools(self):
+        """
+        Register research-specific tools to the agent.
+        This is called by the mixin's __init__ after the parent class initialization.
+        """
+        # Add our research tools to whatever tools are already registered
+        research_tools = []
+        for tool_name in set(self.config.research_tools):
+            if tool_name == "rag":
+                research_tools.append(self.rag_tool())
+            elif tool_name == "reasoning":
+                research_tools.append(self.reasoning_tool())
+            elif tool_name == "critique":
+                research_tools.append(self.critique_tool())
+            elif tool_name == "python_executor":
+                research_tools.append(self.python_execution_tool())
+            else:
+                logger.warning(f"Unknown research tool: {tool_name}")
+                raise ValueError(f"Unknown research tool: {tool_name}")
+
+        logger.debug(f"Registered research tools: {research_tools}")
+        self.tools = research_tools
+
+    def rag_tool(self) -> Tool:
+        """Tool that provides access to the RAG agent's search capabilities."""
+        return Tool(
+            name="rag",
+            description=(
+                "Search for information using RAG (Retrieval-Augmented Generation). "
+                "This tool searches across relevant sources and returns comprehensive information. "
+                "Use this tool when you need to find specific information on any topic. Be sure to pose your query as a comprehensive query."
+            ),
+            results_function=self._rag,
+            llm_format_function=self._format_search_results,
+            parameters={
+                "type": "object",
+                "properties": {
+                    "query": {
+                        "type": "string",
+                        "description": "The search query to find information.",
+                    }
+                },
+                "required": ["query"],
+            },
+            context=self,
+        )
+
+    def reasoning_tool(self) -> Tool:
+        """Tool that provides access to a strong reasoning model."""
+        return Tool(
+            name="reasoning",
+            description=(
+                "A dedicated reasoning system that excels at solving complex problems through step-by-step analysis. "
+                "This tool connects to a separate AI system optimized for deep analytical thinking.\n\n"
+                "USAGE GUIDELINES:\n"
+                "1. Formulate your request as a complete, standalone question to a reasoning expert.\n"
+                "2. Clearly state the problem/question at the beginning.\n"
+                "3. Provide all relevant context, data, and constraints.\n\n"
+                "IMPORTANT: This system has no memory of previous interactions or context from your conversation.\n\n"
+                "STRENGTHS: Mathematical reasoning, logical analysis, evaluating complex scenarios, "
+                "solving multi-step problems, and identifying potential errors in reasoning."
+            ),
+            results_function=self._reason,
+            llm_format_function=self._format_search_results,
+            parameters={
+                "type": "object",
+                "properties": {
+                    "query": {
+                        "type": "string",
+                        "description": "A complete, standalone question with all necessary context, appropriate for a dedicated reasoning system.",
+                    }
+                },
+                "required": ["query"],
+            },
+        )
+
+    def critique_tool(self) -> Tool:
+        """Tool that provides critical analysis of the reasoning done so far in the conversation."""
+        return Tool(
+            name="critique",
+            description=(
+                "Analyzes the conversation history to identify potential flaws, biases, and alternative "
+                "approaches to the reasoning presented so far.\n\n"
+                "Use this tool to get a second opinion on your reasoning, find overlooked considerations, "
+                "identify biases or fallacies, explore alternative hypotheses, and improve the robustness "
+                "of your conclusions."
+            ),
+            results_function=self._critique,
+            llm_format_function=self._format_search_results,
+            parameters={
+                "type": "object",
+                "properties": {
+                    "query": {
+                        "type": "string",
+                        "description": "A specific aspect of the reasoning you want critiqued, or leave empty for a general critique.",
+                    },
+                    "focus_areas": {
+                        "type": "array",
+                        "items": {"type": "string"},
+                        "description": "Optional specific areas to focus the critique (e.g., ['logical fallacies', 'methodology'])",
+                    },
+                },
+                "required": ["query"],
+            },
+        )
+
+    def python_execution_tool(self) -> Tool:
+        """Tool that provides Python code execution capabilities."""
+        return Tool(
+            name="python_executor",
+            description=(
+                "Executes Python code and returns the results, output, and any errors. "
+                "Use this tool for complex calculations, statistical operations, or algorithmic implementations.\n\n"
+                "The execution environment includes common libraries such as numpy, pandas, sympy, scipy, statsmodels, biopython, etc.\n\n"
+                "USAGE:\n"
+                "1. Send complete, executable Python code as a string.\n"
+                "2. Use print statements for output you want to see.\n"
+                "3. Assign to the 'result' variable for values you want to return.\n"
+                "4. Do not use input() or plotting (matplotlib). Output is text-based."
+            ),
+            results_function=self._execute_python_with_process_timeout,
+            llm_format_function=self._format_python_results,
+            parameters={
+                "type": "object",
+                "properties": {
+                    "code": {
+                        "type": "string",
+                        "description": "Python code to execute.",
+                    }
+                },
+                "required": ["code"],
+            },
+        )
+
+    async def _rag(
+        self,
+        query: str,
+        *args,
+        **kwargs,
+    ) -> dict[str, Any]:
+        """Execute a search using an internal RAG agent."""
+        # Create a copy of the current configuration for the RAG agent
+        config_copy = copy(self.config)
+        config_copy.max_iterations = 10  # Could be configurable
+
+        # Always include critical web search tools
+        default_tools = ["web_search", "web_scrape"]
+
+        # Get the configured RAG tools from the original config
+        configured_tools = set(self.config.rag_tools or default_tools)
+
+        # Combine default tools with all configured tools, ensuring no duplicates
+        config_copy.rag_tools = list(
+            set(default_tools + list(configured_tools))
+        )
+
+        logger.debug(f"Using RAG tools: {config_copy.rag_tools}")
+
+        # Create a generation config for the RAG agent
+        generation_config = GenerationConfig(
+            model=self.app_config.quality_llm,
+            max_tokens_to_sample=16000,
+        )
+
+        # Create a new RAG agent - we'll use the non-streaming variant for consistent results
+        rag_agent = R2RRAGAgent(
+            database_provider=self.database_provider,
+            llm_provider=self.llm_provider,
+            config=config_copy,
+            search_settings=self.search_settings,
+            rag_generation_config=generation_config,
+            knowledge_search_method=self.knowledge_search_method,
+            content_method=self.content_method,
+            file_search_method=self.file_search_method,
+            max_tool_context_length=self.max_tool_context_length,
+        )
+
+        # Run the RAG agent with the query
+        user_message = Message(role="user", content=query)
+        response = await rag_agent.arun(messages=[user_message])
+
+        # Get the content from the response
+        structured_content = response[-1].get("structured_content")
+        if structured_content:
+            possible_text = structured_content[-1].get("text")
+            content = response[-1].get("content") or possible_text
+        else:
+            content = response[-1].get("content")
+
+        # Extract citations and transfer search results from RAG agent to research agent
+        short_ids = extract_citations(content)
+        if short_ids:
+            logger.info(f"Found citations in RAG response: {short_ids}")
+
+            for short_id in short_ids:
+                result = rag_agent.search_results_collector.find_by_short_id(
+                    short_id
+                )
+                if result:
+                    self.search_results_collector.add_result(result)
+
+            # Log confirmation for successful transfer
+            logger.info(
+                "Transferred search results from RAG agent to research agent for citations"
+            )
+        return content
+
+    async def _reason(
+        self,
+        query: str,
+        *args,
+        **kwargs,
+    ) -> dict[str, Any]:
+        """Execute a reasoning query using a specialized reasoning LLM."""
+        msg_list = await self.conversation.get_messages()
+
+        # Create a specialized generation config for reasoning
+        gen_cfg = self.get_generation_config(msg_list[-1], stream=False)
+        gen_cfg.model = self.app_config.reasoning_llm
+        gen_cfg.top_p = None
+        gen_cfg.temperature = 0.1
+        gen_cfg.max_tokens_to_sample = 64000
+        gen_cfg.stream = False
+        gen_cfg.tools = None
+        gen_cfg.functions = None
+        gen_cfg.reasoning_effort = "high"
+        gen_cfg.add_generation_kwargs = None
+
+        # Call the LLM with the reasoning request
+        response = await self.llm_provider.aget_completion(
+            [{"role": "user", "content": query}], gen_cfg
+        )
+        return response.choices[0].message.content
+
+    async def _critique(
+        self,
+        query: str,
+        focus_areas: Optional[list] = None,
+        *args,
+        **kwargs,
+    ) -> dict[str, Any]:
+        """Critique the conversation history."""
+        msg_list = await self.conversation.get_messages()
+        if not focus_areas:
+            focus_areas = []
+        # Build the critique prompt
+        critique_prompt = (
+            "You are a critical reasoning expert. Your task is to analyze the following conversation "
+            "and critique the reasoning. Look for:\n"
+            "1. Logical fallacies or inconsistencies\n"
+            "2. Cognitive biases\n"
+            "3. Overlooked questions or considerations\n"
+            "4. Alternative approaches\n"
+            "5. Improvements in rigor\n\n"
+        )
+
+        if focus_areas:
+            critique_prompt += f"Focus areas: {', '.join(focus_areas)}\n\n"
+
+        if query.strip():
+            critique_prompt += f"Specific question: {query}\n\n"
+
+        critique_prompt += (
+            "Structure your critique:\n"
+            "1. Summary\n"
+            "2. Key strengths\n"
+            "3. Potential issues\n"
+            "4. Alternatives\n"
+            "5. Recommendations\n\n"
+        )
+
+        # Add the conversation history to the prompt
+        conversation_text = "\n--- CONVERSATION HISTORY ---\n\n"
+        for msg in msg_list:
+            role = msg.get("role", "")
+            content = msg.get("content", "")
+            if content and role in ["user", "assistant", "system"]:
+                conversation_text += f"{role.upper()}: {content}\n\n"
+
+        final_prompt = critique_prompt + conversation_text
+
+        # Use the reasoning tool to process the critique
+        return await self._reason(final_prompt, *args, **kwargs)
+
+    async def _execute_python_with_process_timeout(
+        self, code: str, timeout: int = 10, *args, **kwargs
+    ) -> dict[str, Any]:
+        """
+        Executes Python code in a separate subprocess with a timeout.
+        This provides isolation and prevents re-importing the current agent module.
+
+        Parameters:
+          code (str): Python code to execute.
+          timeout (int): Timeout in seconds (default: 10).
+
+        Returns:
+          dict[str, Any]: Dictionary containing stdout, stderr, return code, etc.
+        """
+        # Write user code to a temporary file
+        with tempfile.NamedTemporaryFile(
+            mode="w", suffix=".py", delete=False
+        ) as tmp_file:
+            tmp_file.write(code)
+            script_path = tmp_file.name
+
+        try:
+            # Run the script in a fresh subprocess
+            result = subprocess.run(
+                [sys.executable, script_path],
+                capture_output=True,
+                text=True,
+                timeout=timeout,
+            )
+
+            return {
+                "result": None,  # We'll parse from stdout if needed
+                "stdout": result.stdout,
+                "stderr": result.stderr,
+                "error": (
+                    None
+                    if result.returncode == 0
+                    else {
+                        "type": "SubprocessError",
+                        "message": f"Process exited with code {result.returncode}",
+                        "traceback": "",
+                    }
+                ),
+                "locals": {},  # No direct local var capture in a separate process
+                "success": (result.returncode == 0),
+                "timed_out": False,
+                "timeout": timeout,
+            }
+        except subprocess.TimeoutExpired as e:
+            return {
+                "result": None,
+                "stdout": e.output or "",
+                "stderr": e.stderr or "",
+                "error": {
+                    "type": "TimeoutError",
+                    "message": f"Execution exceeded {timeout} second limit.",
+                    "traceback": "",
+                },
+                "locals": {},
+                "success": False,
+                "timed_out": True,
+                "timeout": timeout,
+            }
+        finally:
+            # Clean up the temp file
+            if os.path.exists(script_path):
+                os.remove(script_path)
+
+    def _format_python_results(self, results: dict[str, Any]) -> str:
+        """Format Python execution results for display."""
+        output = []
+
+        # Timeout notification
+        if results.get("timed_out", False):
+            output.append(
+                f"⚠️ **Execution Timeout**: Code exceeded the {results.get('timeout', 10)} second limit."
+            )
+            output.append("")
+
+        # Stdout
+        if results.get("stdout"):
+            output.append("## Output:")
+            output.append("```")
+            output.append(results["stdout"].rstrip())
+            output.append("```")
+            output.append("")
+
+        # If there's a 'result' variable to display
+        if results.get("result") is not None:
+            output.append("## Result:")
+            output.append("```")
+            output.append(str(results["result"]))
+            output.append("```")
+            output.append("")
+
+        # Error info
+        if not results.get("success", True):
+            output.append("## Error:")
+            output.append("```")
+            stderr_out = results.get("stderr", "").rstrip()
+            if stderr_out:
+                output.append(stderr_out)
+
+            err_obj = results.get("error")
+            if err_obj and err_obj.get("message"):
+                output.append(err_obj["message"])
+            output.append("```")
+
+        # Return formatted output
+        return (
+            "\n".join(output)
+            if output
+            else "Code executed with no output or result."
+        )
+
+    def _format_search_results(self, results) -> str:
+        """Simple pass-through formatting for RAG search results."""
+        return results
+
+
+class R2RResearchAgent(ResearchAgentMixin, R2RRAGAgent):
+    """
+    A non-streaming research agent that uses the standard R2R agent as its base.
+
+    This agent combines research capabilities with the non-streaming RAG agent,
+    providing tools for deep research through tool-based interaction.
+    """
+
+    def __init__(
+        self,
+        app_config: AppConfig,
+        database_provider: DatabaseProvider,
+        llm_provider: (
+            AnthropicCompletionProvider
+            | LiteLLMCompletionProvider
+            | OpenAICompletionProvider
+            | R2RCompletionProvider
+        ),
+        config: RAGAgentConfig,
+        search_settings: SearchSettings,
+        rag_generation_config: GenerationConfig,
+        knowledge_search_method: Callable,
+        content_method: Callable,
+        file_search_method: Callable,
+        max_tool_context_length: int = 20_000,
+    ):
+        # Set a higher max iterations for research tasks
+        config.max_iterations = config.max_iterations or 15
+
+        # Initialize the RAG agent first
+        R2RRAGAgent.__init__(
+            self,
+            database_provider=database_provider,
+            llm_provider=llm_provider,
+            config=config,
+            search_settings=search_settings,
+            rag_generation_config=rag_generation_config,
+            knowledge_search_method=knowledge_search_method,
+            content_method=content_method,
+            file_search_method=file_search_method,
+            max_tool_context_length=max_tool_context_length,
+        )
+
+        # Then initialize the ResearchAgentMixin
+        ResearchAgentMixin.__init__(
+            self,
+            app_config=app_config,
+            database_provider=database_provider,
+            llm_provider=llm_provider,
+            config=config,
+            search_settings=search_settings,
+            rag_generation_config=rag_generation_config,
+            max_tool_context_length=max_tool_context_length,
+            knowledge_search_method=knowledge_search_method,
+            file_search_method=file_search_method,
+            content_method=content_method,
+        )
+
+
+class R2RStreamingResearchAgent(ResearchAgentMixin, R2RStreamingRAGAgent):
+    """
+    A streaming research agent that uses the streaming RAG agent as its base.
+
+    This agent combines research capabilities with streaming text generation,
+    providing real-time responses while still offering research tools.
+    """
+
+    def __init__(
+        self,
+        app_config: AppConfig,
+        database_provider: DatabaseProvider,
+        llm_provider: (
+            AnthropicCompletionProvider
+            | LiteLLMCompletionProvider
+            | OpenAICompletionProvider
+            | R2RCompletionProvider
+        ),
+        config: RAGAgentConfig,
+        search_settings: SearchSettings,
+        rag_generation_config: GenerationConfig,
+        knowledge_search_method: Callable,
+        content_method: Callable,
+        file_search_method: Callable,
+        max_tool_context_length: int = 10_000,
+    ):
+        # Force streaming on
+        config.stream = True
+        config.max_iterations = config.max_iterations or 15
+
+        # Initialize the streaming RAG agent first
+        R2RStreamingRAGAgent.__init__(
+            self,
+            database_provider=database_provider,
+            llm_provider=llm_provider,
+            config=config,
+            search_settings=search_settings,
+            rag_generation_config=rag_generation_config,
+            knowledge_search_method=knowledge_search_method,
+            content_method=content_method,
+            file_search_method=file_search_method,
+            max_tool_context_length=max_tool_context_length,
+        )
+
+        # Then initialize the ResearchAgentMixin
+        ResearchAgentMixin.__init__(
+            self,
+            app_config=app_config,
+            database_provider=database_provider,
+            llm_provider=llm_provider,
+            config=config,
+            search_settings=search_settings,
+            rag_generation_config=rag_generation_config,
+            max_tool_context_length=max_tool_context_length,
+            knowledge_search_method=knowledge_search_method,
+            content_method=content_method,
+            file_search_method=file_search_method,
+        )
+
+
+class R2RXMLToolsResearchAgent(ResearchAgentMixin, R2RXMLToolsRAGAgent):
+    """
+    A non-streaming research agent that uses XML tool formatting.
+
+    This agent combines research capabilities with the XML-based tool calling format,
+    which might be more appropriate for certain LLM providers.
+    """
+
+    def __init__(
+        self,
+        app_config: AppConfig,
+        database_provider: DatabaseProvider,
+        llm_provider: (
+            AnthropicCompletionProvider
+            | LiteLLMCompletionProvider
+            | OpenAICompletionProvider
+            | R2RCompletionProvider
+        ),
+        config: RAGAgentConfig,
+        search_settings: SearchSettings,
+        rag_generation_config: GenerationConfig,
+        knowledge_search_method: Callable,
+        content_method: Callable,
+        file_search_method: Callable,
+        max_tool_context_length: int = 20_000,
+    ):
+        # Set higher max iterations
+        config.max_iterations = config.max_iterations or 15
+
+        # Initialize the XML Tools RAG agent first
+        R2RXMLToolsRAGAgent.__init__(
+            self,
+            database_provider=database_provider,
+            llm_provider=llm_provider,
+            config=config,
+            search_settings=search_settings,
+            rag_generation_config=rag_generation_config,
+            knowledge_search_method=knowledge_search_method,
+            content_method=content_method,
+            file_search_method=file_search_method,
+            max_tool_context_length=max_tool_context_length,
+        )
+
+        # Then initialize the ResearchAgentMixin
+        ResearchAgentMixin.__init__(
+            self,
+            app_config=app_config,
+            search_settings=search_settings,
+            knowledge_search_method=knowledge_search_method,
+            content_method=content_method,
+            file_search_method=file_search_method,
+            max_tool_context_length=max_tool_context_length,
+        )
+
+
+class R2RXMLToolsStreamingResearchAgent(
+    ResearchAgentMixin, R2RXMLToolsStreamingRAGAgent
+):
+    """
+    A streaming research agent that uses XML tool formatting.
+
+    This agent combines research capabilities with streaming and XML-based tool calling,
+    providing real-time responses in a format suitable for certain LLM providers.
+    """
+
+    def __init__(
+        self,
+        app_config: AppConfig,
+        database_provider: DatabaseProvider,
+        llm_provider: (
+            AnthropicCompletionProvider
+            | LiteLLMCompletionProvider
+            | OpenAICompletionProvider
+            | R2RCompletionProvider
+        ),
+        config: RAGAgentConfig,
+        search_settings: SearchSettings,
+        rag_generation_config: GenerationConfig,
+        knowledge_search_method: Callable,
+        content_method: Callable,
+        file_search_method: Callable,
+        max_tool_context_length: int = 10_000,
+    ):
+        # Force streaming on
+        config.stream = True
+        config.max_iterations = config.max_iterations or 15
+
+        # Initialize the XML Tools Streaming RAG agent first
+        R2RXMLToolsStreamingRAGAgent.__init__(
+            self,
+            database_provider=database_provider,
+            llm_provider=llm_provider,
+            config=config,
+            search_settings=search_settings,
+            rag_generation_config=rag_generation_config,
+            knowledge_search_method=knowledge_search_method,
+            content_method=content_method,
+            file_search_method=file_search_method,
+            max_tool_context_length=max_tool_context_length,
+        )
+
+        # Then initialize the ResearchAgentMixin
+        ResearchAgentMixin.__init__(
+            self,
+            app_config=app_config,
+            search_settings=search_settings,
+            knowledge_search_method=knowledge_search_method,
+            content_method=content_method,
+            file_search_method=file_search_method,
+            max_tool_context_length=max_tool_context_length,
+        )

+ 126 - 0
py/core/base/__init__.py

@@ -0,0 +1,126 @@
+from .abstractions import *
+from .agent import *
+from .api.models import *
+from .parsers import *
+from .providers import *
+from .utils import *
+
+__all__ = [
+    "ThinkingEvent",
+    "ToolCallEvent",
+    "ToolResultEvent",
+    "CitationEvent",
+    "Citation",
+    ## ABSTRACTIONS
+    # Base abstractions
+    "AsyncSyncMeta",
+    "syncable",
+    # Completion abstractions
+    "MessageType",
+    # Document abstractions
+    "Document",
+    "DocumentChunk",
+    "DocumentResponse",
+    "IngestionStatus",
+    "GraphExtractionStatus",
+    "GraphConstructionStatus",
+    "DocumentType",
+    # Exception abstractions
+    "R2RDocumentProcessingError",
+    "R2RException",
+    # Graph abstractions
+    "Entity",
+    "GraphExtraction",
+    "Relationship",
+    "Community",
+    "GraphCreationSettings",
+    "GraphEnrichmentSettings",
+    # LLM abstractions
+    "GenerationConfig",
+    "LLMChatCompletion",
+    "LLMChatCompletionChunk",
+    "RAGCompletion",
+    # Prompt abstractions
+    "Prompt",
+    # Search abstractions
+    "AggregateSearchResult",
+    "WebSearchResult",
+    "GraphSearchResult",
+    "GraphSearchSettings",
+    "ChunkSearchSettings",
+    "ChunkSearchResult",
+    "WebPageSearchResult",
+    "SearchSettings",
+    "select_search_filters",
+    "SearchMode",
+    "HybridSearchSettings",
+    # User abstractions
+    "Token",
+    "TokenData",
+    # Vector abstractions
+    "Vector",
+    "VectorEntry",
+    "VectorType",
+    "StorageResult",
+    "IndexConfig",
+    ## AGENT
+    # Agent abstractions
+    "Agent",
+    "AgentConfig",
+    "Conversation",
+    "Message",
+    ## API
+    # Auth Responses
+    "TokenResponse",
+    "User",
+    ## PARSERS
+    # Base parser
+    "AsyncParser",
+    ## PROVIDERS
+    # Base provider classes
+    "AppConfig",
+    "Provider",
+    "ProviderConfig",
+    # Auth provider
+    "AuthConfig",
+    "AuthProvider",
+    # Crypto provider
+    "CryptoConfig",
+    "CryptoProvider",
+    # Database providers
+    "LimitSettings",
+    "DatabaseConfig",
+    "DatabaseProvider",
+    "Handler",
+    "PostgresConfigurationSettings",
+    # Email provider
+    "EmailConfig",
+    "EmailProvider",
+    # Embedding provider
+    "EmbeddingConfig",
+    "EmbeddingProvider",
+    # File provider
+    "FileConfig",
+    "FileProvider",
+    # Ingestion provider
+    "IngestionConfig",
+    "IngestionProvider",
+    "ChunkingStrategy",
+    # LLM provider
+    "CompletionConfig",
+    "CompletionProvider",
+    ## UTILS
+    "RecursiveCharacterTextSplitter",
+    "TextSplitter",
+    "format_search_results_for_llm",
+    "validate_uuid",
+    # ID generation
+    "generate_id",
+    "generate_document_id",
+    "generate_extraction_id",
+    "generate_default_user_collection_id",
+    "generate_user_id",
+    "yield_sse_event",
+    "dump_collector",
+    "dump_obj",
+]

+ 147 - 0
py/core/base/abstractions/__init__.py

@@ -0,0 +1,147 @@
+from shared.abstractions.base import AsyncSyncMeta, R2RSerializable, syncable
+from shared.abstractions.document import (
+    ChunkEnrichmentSettings,
+    Document,
+    DocumentChunk,
+    DocumentResponse,
+    DocumentType,
+    GraphConstructionStatus,
+    GraphExtractionStatus,
+    IngestionStatus,
+    RawChunk,
+    UnprocessedChunk,
+    UpdateChunk,
+)
+from shared.abstractions.exception import (
+    R2RDocumentProcessingError,
+    R2RException,
+)
+from shared.abstractions.graph import (
+    Community,
+    Entity,
+    Graph,
+    GraphCommunitySettings,
+    GraphCreationSettings,
+    GraphEnrichmentSettings,
+    GraphExtraction,
+    Relationship,
+    StoreType,
+)
+from shared.abstractions.llm import (
+    GenerationConfig,
+    LLMChatCompletion,
+    LLMChatCompletionChunk,
+    Message,
+    MessageType,
+    RAGCompletion,
+)
+from shared.abstractions.prompt import Prompt
+from shared.abstractions.search import (
+    AggregateSearchResult,
+    ChunkSearchResult,
+    ChunkSearchSettings,
+    GraphCommunityResult,
+    GraphEntityResult,
+    GraphRelationshipResult,
+    GraphSearchResult,
+    GraphSearchResultType,
+    GraphSearchSettings,
+    HybridSearchSettings,
+    SearchMode,
+    SearchSettings,
+    WebPageSearchResult,
+    WebSearchResult,
+    select_search_filters,
+)
+from shared.abstractions.user import Token, TokenData, User
+from shared.abstractions.vector import (
+    IndexArgsHNSW,
+    IndexArgsIVFFlat,
+    IndexConfig,
+    IndexMeasure,
+    IndexMethod,
+    StorageResult,
+    Vector,
+    VectorEntry,
+    VectorQuantizationSettings,
+    VectorQuantizationType,
+    VectorTableName,
+    VectorType,
+)
+
+__all__ = [
+    # Base abstractions
+    "R2RSerializable",
+    "AsyncSyncMeta",
+    "syncable",
+    # Completion abstractions
+    "MessageType",
+    # Document abstractions
+    "Document",
+    "DocumentChunk",
+    "DocumentResponse",
+    "DocumentType",
+    "IngestionStatus",
+    "GraphExtractionStatus",
+    "GraphConstructionStatus",
+    "RawChunk",
+    "UnprocessedChunk",
+    "UpdateChunk",
+    # Exception abstractions
+    "R2RDocumentProcessingError",
+    "R2RException",
+    # Graph abstractions
+    "Entity",
+    "Graph",
+    "Community",
+    "StoreType",
+    "GraphExtraction",
+    "Relationship",
+    # Index abstractions
+    "IndexConfig",
+    # LLM abstractions
+    "GenerationConfig",
+    "LLMChatCompletion",
+    "LLMChatCompletionChunk",
+    "Message",
+    "RAGCompletion",
+    # Prompt abstractions
+    "Prompt",
+    # Search abstractions
+    "WebSearchResult",
+    "AggregateSearchResult",
+    "GraphSearchResult",
+    "GraphSearchResultType",
+    "GraphEntityResult",
+    "GraphRelationshipResult",
+    "GraphCommunityResult",
+    "GraphSearchSettings",
+    "ChunkSearchSettings",
+    "ChunkSearchResult",
+    "WebPageSearchResult",
+    "SearchSettings",
+    "select_search_filters",
+    "SearchMode",
+    "HybridSearchSettings",
+    # Graph abstractions
+    "GraphCreationSettings",
+    "GraphEnrichmentSettings",
+    "GraphCommunitySettings",
+    # User abstractions
+    "Token",
+    "TokenData",
+    "User",
+    # Vector abstractions
+    "Vector",
+    "VectorEntry",
+    "VectorType",
+    "IndexMeasure",
+    "IndexMethod",
+    "VectorTableName",
+    "IndexArgsHNSW",
+    "IndexArgsIVFFlat",
+    "VectorQuantizationSettings",
+    "VectorQuantizationType",
+    "StorageResult",
+    "ChunkEnrichmentSettings",
+]

+ 13 - 0
py/core/base/agent/__init__.py

@@ -0,0 +1,13 @@
+# FIXME: Once the agent is properly type annotated, remove the type: ignore comments
+from .agent import (  # type: ignore
+    Agent,
+    AgentConfig,
+    Conversation,
+)
+
+__all__ = [
+    # Agent abstractions
+    "Agent",
+    "AgentConfig",
+    "Conversation",
+]

+ 298 - 0
py/core/base/agent/agent.py

@@ -0,0 +1,298 @@
+# type: ignore
+import asyncio
+import json
+import logging
+from abc import ABC, abstractmethod
+from datetime import datetime
+from json import JSONDecodeError
+from typing import Any, AsyncGenerator, Optional, Type
+
+from pydantic import BaseModel
+
+from core.base.abstractions import (
+    GenerationConfig,
+    LLMChatCompletion,
+    Message,
+)
+from core.base.providers import CompletionProvider, DatabaseProvider
+from shared.abstractions.tool import Tool, ToolResult
+
+logger = logging.getLogger()
+
+
+class Conversation:
+    def __init__(self):
+        self.messages: list[Message] = []
+        self._lock = asyncio.Lock()
+
+    async def add_message(self, message):
+        async with self._lock:
+            self.messages.append(message)
+
+    async def get_messages(self) -> list[dict[str, Any]]:
+        async with self._lock:
+            return [
+                {**msg.model_dump(exclude_none=True), "role": str(msg.role)}
+                for msg in self.messages
+            ]
+
+
+# TODO - Move agents to provider pattern
+class AgentConfig(BaseModel):
+    rag_rag_agent_static_prompt: str = "static_rag_agent"
+    rag_agent_dynamic_prompt: str = "dynamic_reasoning_rag_agent_prompted"
+    stream: bool = False
+    include_tools: bool = True
+    max_iterations: int = 10
+
+    @classmethod
+    def create(cls: Type["AgentConfig"], **kwargs: Any) -> "AgentConfig":
+        base_args = cls.model_fields.keys()
+        filtered_kwargs = {
+            k: v if v != "None" else None
+            for k, v in kwargs.items()
+            if k in base_args
+        }
+        return cls(**filtered_kwargs)  # type: ignore
+
+
+class Agent(ABC):
+    def __init__(
+        self,
+        llm_provider: CompletionProvider,
+        database_provider: DatabaseProvider,
+        config: AgentConfig,
+        rag_generation_config: GenerationConfig,
+    ):
+        self.llm_provider = llm_provider
+        self.database_provider: DatabaseProvider = database_provider
+        self.config = config
+        self.conversation = Conversation()
+        self._completed = False
+        self._tools: list[Tool] = []
+        self.tool_calls: list[dict] = []
+        self.rag_generation_config = rag_generation_config
+        # self._register_tools()
+
+    @abstractmethod
+    def _register_tools(self):
+        pass
+
+    async def _setup(
+        self, system_instruction: Optional[str] = None, *args, **kwargs
+    ):
+        await self.conversation.add_message(
+            Message(
+                role="system",
+                content=system_instruction
+                or (
+                    await self.database_provider.prompts_handler.get_cached_prompt(
+                        self.config.rag_rag_agent_static_prompt,
+                        inputs={
+                            "date": str(datetime.now().strftime("%m/%d/%Y"))
+                        },
+                    )
+                    + f"\n Note,you only have {self.config.max_iterations} iterations or tool calls to reach a conclusion before your operation terminates."
+                ),
+            )
+        )
+
+    @property
+    def tools(self) -> list[Tool]:
+        return self._tools
+
+    @tools.setter
+    def tools(self, tools: list[Tool]):
+        self._tools = tools
+
+    @abstractmethod
+    async def arun(
+        self,
+        system_instruction: Optional[str] = None,
+        messages: Optional[list[Message]] = None,
+        *args,
+        **kwargs,
+    ) -> list[LLMChatCompletion] | AsyncGenerator[LLMChatCompletion, None]:
+        pass
+
+    @abstractmethod
+    async def process_llm_response(
+        self,
+        response: Any,
+        *args,
+        **kwargs,
+    ) -> None | AsyncGenerator[str, None]:
+        pass
+
+    async def execute_tool(self, tool_name: str, *args, **kwargs) -> str:
+        if tool := next((t for t in self.tools if t.name == tool_name), None):
+            return await tool.results_function(*args, **kwargs)
+        else:
+            return f"Error: Tool {tool_name} not found."
+
+    def get_generation_config(
+        self, last_message: dict, stream: bool = False
+    ) -> GenerationConfig:
+        if (
+            last_message["role"] in ["tool", "function"]
+            and last_message["content"] != ""
+            and "ollama" in self.rag_generation_config.model
+            or not self.config.include_tools
+        ):
+            return GenerationConfig(
+                **self.rag_generation_config.model_dump(
+                    exclude={"functions", "tools", "stream"}
+                ),
+                stream=stream,
+            )
+
+        return GenerationConfig(
+            **self.rag_generation_config.model_dump(
+                exclude={"functions", "tools", "stream"}
+            ),
+            # FIXME: Use tools instead of functions
+            # TODO - Investigate why `tools` fails with OpenAI+LiteLLM
+            tools=(
+                [
+                    {
+                        "function": {
+                            "name": tool.name,
+                            "description": tool.description,
+                            "parameters": tool.parameters,
+                        },
+                        "type": "function",
+                        "name": tool.name,
+                    }
+                    for tool in self.tools
+                ]
+                if self.tools
+                else None
+            ),
+            stream=stream,
+        )
+
+    async def handle_function_or_tool_call(
+        self,
+        function_name: str,
+        function_arguments: str,
+        tool_id: Optional[str] = None,
+        save_messages: bool = True,
+        *args,
+        **kwargs,
+    ) -> ToolResult:
+        logger.debug(
+            f"Calling function: {function_name}, args: {function_arguments}, tool_id: {tool_id}"
+        )
+        if tool := next(
+            (t for t in self.tools if t.name == function_name), None
+        ):
+            try:
+                function_args = json.loads(function_arguments)
+
+            except JSONDecodeError as e:
+                error_message = f"Calling the requested tool '{function_name}' with arguments {function_arguments} failed with `JSONDecodeError`."
+                if save_messages:
+                    await self.conversation.add_message(
+                        Message(
+                            role="tool" if tool_id else "function",
+                            content=error_message,
+                            name=function_name,
+                            tool_call_id=tool_id,
+                        )
+                    )
+
+            merged_kwargs = {**kwargs, **function_args}
+            try:
+                raw_result = await tool.execute(*args, **merged_kwargs)
+                llm_formatted_result = tool.llm_format_function(raw_result)
+            except Exception as e:
+                raw_result = f"Calling the requested tool '{function_name}' with arguments {function_arguments} failed with an exception: {e}."
+                logger.error(raw_result)
+                llm_formatted_result = raw_result
+
+            tool_result = ToolResult(
+                raw_result=raw_result,
+                llm_formatted_result=llm_formatted_result,
+            )
+            if tool.stream_function:
+                tool_result.stream_result = tool.stream_function(raw_result)
+
+            if save_messages:
+                await self.conversation.add_message(
+                    Message(
+                        role="tool" if tool_id else "function",
+                        content=str(tool_result.llm_formatted_result),
+                        name=function_name,
+                        tool_call_id=tool_id,
+                    )
+                )
+                # HACK - to fix issues with claude thinking + tool use [https://github.com/anthropics/anthropic-cookbook/blob/main/extended_thinking/extended_thinking_with_tool_use.ipynb]
+                logger.debug(
+                    f"Extended thinking - Claude needs a particular message continuation which however breaks other models. Model in use : {self.rag_generation_config.model}"
+                )
+                is_anthropic = (
+                    self.rag_generation_config.model
+                    and "anthropic/" in self.rag_generation_config.model
+                )
+                if (
+                    self.rag_generation_config.extended_thinking
+                    and is_anthropic
+                ):
+                    await self.conversation.add_message(
+                        Message(
+                            role="user",
+                            content="Continue...",
+                        )
+                    )
+
+            self.tool_calls.append(
+                {
+                    "name": function_name,
+                    "args": function_arguments,
+                }
+            )
+        return tool_result
+
+
+# TODO - Move agents to provider pattern
+class RAGAgentConfig(AgentConfig):
+    rag_rag_agent_static_prompt: str = "static_rag_agent"
+    rag_agent_dynamic_prompt: str = "dynamic_reasoning_rag_agent_prompted"
+    stream: bool = False
+    include_tools: bool = True
+    max_iterations: int = 10
+    # tools: list[str] = [] # HACK - unused variable.
+
+    # Default RAG tools
+    rag_tools: list[str] = [
+        "search_file_descriptions",
+        "search_file_knowledge",
+        "get_file_content",
+        # Web search tools - disabled by default
+        # "web_search",
+        # "web_scrape",
+        # "tavily_search",
+        # "tavily_extract",
+    ]
+
+    # Default Research tools
+    research_tools: list[str] = [
+        "rag",
+        "reasoning",
+        # DISABLED by default
+        "critique",
+        "python_executor",
+    ]
+
+    @classmethod
+    def create(cls: Type["AgentConfig"], **kwargs: Any) -> "AgentConfig":
+        base_args = cls.model_fields.keys()
+        filtered_kwargs = {
+            k: v if v != "None" else None
+            for k, v in kwargs.items()
+            if k in base_args
+        }
+        filtered_kwargs["tools"] = kwargs.get("tools", None) or kwargs.get(
+            "tool_names", None
+        )
+        return cls(**filtered_kwargs)  # type: ignore

+ 82 - 0
py/core/base/agent/tools/built_in/get_file_content.py

@@ -0,0 +1,82 @@
+import logging
+from typing import Any, Optional
+from uuid import UUID
+
+from shared.abstractions.tool import Tool
+
+logger = logging.getLogger(__name__)
+
+
+class GetFileContentTool(Tool):
+    """
+    A tool to fetch entire documents from the local database.
+
+    Typically used if the agent needs deeper or more structured context
+    from documents, not just chunk-level hits.
+    """
+
+    def __init__(self):
+        # Initialize with all required fields for the Pydantic model
+        super().__init__(
+            name="get_file_content",
+            description=(
+                "Fetches the complete contents of all user documents from the local database. "
+                "Can be used alongside filter criteria (e.g. doc IDs, collection IDs, etc.) to restrict the query."
+                "For instance, a single document can be returned with a filter like so:"
+                "{'document_id': {'$eq': '...'}}."
+            ),
+            parameters={
+                "type": "object",
+                "properties": {
+                    "document_id": {
+                        "type": "string",
+                        "description": "The unique UUID of the document to fetch.",
+                    },
+                },
+                "required": ["document_id"],
+            },
+            results_function=self.execute,
+            llm_format_function=None,
+        )
+
+    async def execute(
+        self,
+        document_id: str,
+        options: Optional[dict[str, Any]] = None,
+        *args,
+        **kwargs,
+    ):
+        """
+        Calls the content_method from context to fetch doc+chunk structures.
+        """
+        from core.base.abstractions import AggregateSearchResult
+
+        # Use either provided context or stored context
+        context = self.context
+
+        # Check if context has necessary method
+        if not context or not hasattr(context, "content_method"):
+            logger.error("No content_method provided in context")
+            return AggregateSearchResult(document_search_results=[])
+
+        try:
+            doc_uuid = UUID(document_id)
+            filters = {"id": {"$eq": doc_uuid}}
+        except ValueError:
+            logger.error(f"Invalid document_id format received: {document_id}")
+            return AggregateSearchResult(document_search_results=[])
+
+        options = options or {}
+
+        try:
+            content = await context.content_method(filters, options)
+        except Exception as e:
+            logger.error(f"Error calling content_method: {e}")
+            return AggregateSearchResult(document_search_results=[])
+
+        result = AggregateSearchResult(document_search_results=content)
+
+        if hasattr(context, "search_results_collector"):
+            context.search_results_collector.add_aggregate_result(result)
+
+        return result

+ 67 - 0
py/core/base/agent/tools/built_in/search_file_descriptions.py

@@ -0,0 +1,67 @@
+import logging
+
+from shared.abstractions.tool import Tool
+
+logger = logging.getLogger(__name__)
+
+
+class SearchFileDescriptionsTool(Tool):
+    """
+    A tool to search over high-level document data (titles, descriptions, etc.)
+    """
+
+    def __init__(self):
+        super().__init__(
+            name="search_file_descriptions",
+            description=(
+                "Semantic search over AI-generated summaries of stored documents. "
+                "This does NOT retrieve chunk-level contents or knowledge-graph relationships. "
+                "Use this when you need a broad overview of which documents (files) might be relevant."
+            ),
+            parameters={
+                "type": "object",
+                "properties": {
+                    "query": {
+                        "type": "string",
+                        "description": "Query string to semantic search over available files 'list documents about XYZ'.",
+                    }
+                },
+                "required": ["query"],
+            },
+            results_function=self.execute,
+            llm_format_function=None,
+        )
+
+    async def execute(self, query: str, *args, **kwargs):
+        """
+        Calls the file_search_method from context.
+        """
+        from core.base.abstractions import AggregateSearchResult
+
+        context = self.context
+
+        # Check if context has necessary method
+        if not context or not hasattr(context, "file_search_method"):
+            logger.error("No file_search_method provided in context")
+            return AggregateSearchResult(document_search_results=[])
+
+        # Get the file_search_method from context
+        file_search_method = context.file_search_method
+
+        # Call the content_method from the context
+        try:
+            doc_results = await file_search_method(
+                query=query,
+                settings=context.search_settings,
+            )
+        except Exception as e:
+            logger.error(f"Error calling content_method: {e}")
+            return AggregateSearchResult(document_search_results=[])
+
+        result = AggregateSearchResult(document_search_results=doc_results)
+
+        # Add to results collector if context has it
+        if hasattr(context, "search_results_collector"):
+            context.search_results_collector.add_aggregate_result(result)
+
+        return result

+ 82 - 0
py/core/base/agent/tools/built_in/search_file_knowledge.py

@@ -0,0 +1,82 @@
+import logging
+
+from shared.abstractions.tool import Tool
+
+logger = logging.getLogger(__name__)
+
+
+class SearchFileKnowledgeTool(Tool):
+    """
+    A tool to do a semantic/hybrid search on the local knowledge base.
+    """
+
+    def __init__(self):
+        super().__init__(
+            name="search_file_knowledge",
+            description=(
+                "Search your local knowledge base using the R2R system. "
+                "Use this when you want relevant text chunks or knowledge graph data."
+            ),
+            parameters={
+                "type": "object",
+                "properties": {
+                    "query": {
+                        "type": "string",
+                        "description": "User query to search in the local DB.",
+                    },
+                },
+                "required": ["query"],
+            },
+            results_function=self.execute,
+            llm_format_function=None,
+        )
+
+    async def execute(self, query: str, *args, **kwargs):
+        """
+        Calls the knowledge_search_method from context.
+        """
+        from core.base.abstractions import AggregateSearchResult
+
+        context = self.context
+
+        # Check if context has necessary method
+        if not context or not hasattr(context, "knowledge_search_method"):
+            logger.error("No knowledge_search_method provided in context")
+            return AggregateSearchResult(document_search_results=[])
+
+        # Get the knowledge_search_method from context
+        knowledge_search_method = context.knowledge_search_method
+
+        # Call the content_method from the context
+        try:
+            """
+            FIXME: This is going to fail, as it requires an embedding NOT a query.
+            I've moved 'search_settings' to 'settings' which had been causing a silent failure
+            causing null content in the Message object.
+            """
+            results = await knowledge_search_method(
+                query=query,
+                search_settings=context.search_settings,
+            )
+
+            # FIXME: This is slop
+            if isinstance(results, AggregateSearchResult):
+                agg = results
+            else:
+                agg = AggregateSearchResult(
+                    chunk_search_results=results.get(
+                        "chunk_search_results", []
+                    ),
+                    graph_search_results=results.get(
+                        "graph_search_results", []
+                    ),
+                )
+        except Exception as e:
+            logger.error(f"Error calling content_method: {e}")
+            return AggregateSearchResult(document_search_results=[])
+
+        # Add to results collector if context has it
+        if hasattr(context, "search_results_collector"):
+            context.search_results_collector.add_aggregate_result(agg)
+
+        return agg

+ 109 - 0
py/core/base/agent/tools/built_in/tavily_extract.py

@@ -0,0 +1,109 @@
+import logging
+
+from core.utils import (
+    generate_id,
+)
+from shared.abstractions.tool import Tool
+
+logger = logging.getLogger(__name__)
+
+
+class TavilyExtractTool(Tool):
+    """
+    Uses the Tavily Search API, to extract content from a specific URL.
+    """
+
+    def __init__(self):
+        super().__init__(
+            name="tavily_extract",
+            description=(
+                "Use Tavily to extract and retrieve the contents of a specific webpage. "
+                "This is useful when you want to get clean, structured content from a URL. "
+                "Use this when you need to analyze the full content of a specific webpage."
+            ),
+            parameters={
+                "type": "object",
+                "properties": {
+                    "url": {
+                        "type": "string",
+                        "description": (
+                            "The absolute URL of the webpage you want to extract content from. "
+                            "Example: 'https://www.example.com/article'"
+                        ),
+                    }
+                },
+                "required": ["url"],
+            },
+            results_function=self.execute,
+            llm_format_function=None,
+        )
+
+    async def execute(self, url: str, *args, **kwargs):
+        """
+        Calls Tavily's extract API asynchronously.
+        """
+        import asyncio
+        import os
+
+        from core.base.abstractions import (
+            AggregateSearchResult,
+            WebPageSearchResult,
+        )
+
+        context = self.context
+
+        try:
+            from tavily import TavilyClient
+
+            # Get API key from environment variables
+            api_key = os.environ.get("TAVILY_API_KEY")
+            if not api_key:
+                logger.warning("TAVILY_API_KEY environment variable not set")
+                return AggregateSearchResult()
+
+            # Initialize Tavily client
+            tavily_client = TavilyClient(api_key=api_key)
+
+            # Perform the URL extraction asynchronously
+            extracted_content = await asyncio.get_event_loop().run_in_executor(
+                None,  # Uses the default executor
+                lambda: tavily_client.extract(url, extract_depth="advanced"),
+            )
+
+            web_page_search_results = []
+            for successfulResult in extracted_content.results:
+                content = successfulResult.raw_content
+                if len(content) > 100_000:
+                    content = (
+                        f"{content[:100000]}...FURTHER CONTENT TRUNCATED..."
+                    )
+
+                web_result = WebPageSearchResult(
+                    title=successfulResult.url,
+                    link=successfulResult.url,
+                    snippet=content,
+                    position=0,
+                    id=generate_id(successfulResult.url),
+                    type="tavily_extract",
+                )
+                web_page_search_results.append(web_result)
+
+            result = AggregateSearchResult(
+                web_page_search_results=web_page_search_results
+            )
+
+            # Add to results collector if context is provided
+            if context and hasattr(context, "search_results_collector"):
+                context.search_results_collector.add_aggregate_result(result)
+
+            return result
+        except ImportError:
+            logger.error(
+                "The 'tavily-python' package is not installed. Please install it with 'pip install tavily-python'"
+            )
+            # Return empty results in case Tavily is not installed
+            return AggregateSearchResult()
+        except Exception as e:
+            logger.error(f"Error during Tavily search: {e}")
+            # Return empty results in case of any other error
+            return AggregateSearchResult()

+ 123 - 0
py/core/base/agent/tools/built_in/tavily_search.py

@@ -0,0 +1,123 @@
+import logging
+
+from core.utils import (
+    generate_id,
+)
+from shared.abstractions.tool import Tool
+
+logger = logging.getLogger(__name__)
+
+
+class TavilySearchTool(Tool):
+    """
+    Uses the Tavily Search API, a specialized search engine designed for
+    Large Language Models (LLMs) and AI agents.
+    """
+
+    def __init__(self):
+        super().__init__(
+            name="tavily_search",
+            description=(
+                "Use the Tavily search engine to perform an internet-based search and retrieve results. Useful when you need "
+                "to search the internet for specific information.  The query should be no more than 400 characters."
+            ),
+            parameters={
+                "type": "object",
+                "properties": {
+                    "query": {
+                        "type": "string",
+                        "description": "The query to search using Tavily that should be no more than 400 characters.",
+                    },
+                    "kwargs": {
+                        "type": "object",
+                        "description": (
+                            "Dictionary for additional parameters to pass to Tavily, such as max_results, include_domains and exclude_domains."
+                            '{"max_results": 10, "include_domains": ["example.com"], "exclude_domains": ["example2.com"]}'
+                        ),
+                    },
+                },
+                "required": ["query"],
+            },
+            results_function=self.execute,
+            llm_format_function=None,
+        )
+
+    async def execute(self, query: str, *args, **kwargs):
+        """
+        Calls Tavily's search API asynchronously.
+        """
+        import asyncio
+        import os
+
+        from core.base.abstractions import (
+            AggregateSearchResult,
+            WebSearchResult,
+        )
+
+        context = self.context
+
+        # Check if query is too long and truncate if necessary. Tavily recommends under 400 chars.
+        if len(query) > 400:
+            logger.warning(
+                f"Tavily query is {len(query)} characters long, which exceeds the recommended 400 character limit. Consider breaking into smaller queries for better results."
+            )
+            query = query[:400]
+
+        try:
+            from tavily import TavilyClient
+
+            # Get API key from environment variables
+            api_key = os.environ.get("TAVILY_API_KEY")
+            if not api_key:
+                logger.warning("TAVILY_API_KEY environment variable not set")
+                return AggregateSearchResult()
+
+            # Initialize Tavily client
+            tavily_client = TavilyClient(api_key=api_key)
+
+            # Perform the search asynchronously
+            raw_results = await asyncio.get_event_loop().run_in_executor(
+                None,  # Uses the default executor
+                lambda: tavily_client.search(
+                    query=query,
+                    search_depth="advanced",
+                    include_raw_content=False,
+                    include_domains=kwargs.get("include_domains", []),
+                    exclude_domains=kwargs.get("exclude_domains", []),
+                    max_results=kwargs.get("max_results", 10),
+                ),
+            )
+
+            # Extract the results from the response
+            results = raw_results.get("results", [])
+
+            # Process the raw results into a format compatible with AggregateSearchResult
+            search_results = [
+                WebSearchResult(  # type: ignore
+                    title=result.get("title", "Untitled"),
+                    link=result.get("url", ""),
+                    snippet=result.get("content", ""),
+                    position=index,
+                    id=generate_id(result.get("url", "")),
+                    type="tavily_search",
+                )
+                for index, result in enumerate(results)
+            ]
+
+            result = AggregateSearchResult(web_search_results=search_results)
+
+            # Add to results collector if context is provided
+            if context and hasattr(context, "search_results_collector"):
+                context.search_results_collector.add_aggregate_result(result)
+
+            return result
+        except ImportError:
+            logger.error(
+                "The 'tavily-python' package is not installed. Please install it with 'pip install tavily-python'"
+            )
+            # Return empty results in case Tavily is not installed
+            return AggregateSearchResult()
+        except Exception as e:
+            logger.error(f"Error during Tavily search: {e}")
+            # Return empty results in case of any other error
+            return AggregateSearchResult()

+ 92 - 0
py/core/base/agent/tools/built_in/web_scrape.py

@@ -0,0 +1,92 @@
+import logging
+
+from core.utils import (
+    generate_id,
+)
+from shared.abstractions.tool import Tool
+
+logger = logging.getLogger(__name__)
+
+
+class WebScrapeTool(Tool):
+    """
+    A web scraping tool that uses Firecrawl to to scrape a single URL and return
+    its contents in an LLM-friendly format (e.g. markdown).
+    """
+
+    def __init__(self):
+        super().__init__(
+            name="web_scrape",
+            description=(
+                "Use Firecrawl to scrape a single webpage and retrieve its contents "
+                "as clean markdown. Useful when you need the entire body of a page, "
+                "not just a quick snippet or standard web search result."
+            ),
+            parameters={
+                "type": "object",
+                "properties": {
+                    "url": {
+                        "type": "string",
+                        "description": (
+                            "The absolute URL of the webpage you want to scrape. "
+                            "Example: 'https://docs.firecrawl.dev/getting-started'"
+                        ),
+                    }
+                },
+                "required": ["url"],
+            },
+            results_function=self.execute,
+            llm_format_function=None,
+        )
+
+    async def execute(self, url: str, *args, **kwargs):
+        """
+        Performs the Firecrawl scrape asynchronously.
+        """
+        import asyncio
+
+        from firecrawl import FirecrawlApp
+
+        from core.base.abstractions import (
+            AggregateSearchResult,
+            WebPageSearchResult,
+        )
+
+        context = self.context
+        app = FirecrawlApp()
+        logger.debug(f"[Firecrawl] Scraping URL={url}")
+
+        response = await asyncio.get_event_loop().run_in_executor(
+            None,  # Uses the default executor
+            lambda: app.scrape_url(
+                url=url,
+                formats=["markdown"],
+            ),
+        )
+
+        markdown_text = response.markdown or ""
+        metadata = response.metadata or {}
+        page_title = metadata.get("title", "Untitled page")
+
+        if len(markdown_text) > 100_000:
+            markdown_text = (
+                f"{markdown_text[:100000]}...FURTHER CONTENT TRUNCATED..."
+            )
+
+        # Create a single WebPageSearchResult HACK - TODO FIX
+        web_result = WebPageSearchResult(
+            title=page_title,
+            link=url,
+            snippet=markdown_text,
+            position=0,
+            id=generate_id(markdown_text),
+            type="firecrawl",
+        )
+
+        result = AggregateSearchResult(web_page_search_results=[web_result])
+
+        # Add to results collector if context is provided
+        if context and hasattr(context, "search_results_collector"):
+            context.search_results_collector.add_aggregate_result(result)
+
+        return result

+ 64 - 0
py/core/base/agent/tools/built_in/web_search.py

@@ -0,0 +1,64 @@
+from shared.abstractions.tool import Tool
+
+
+class WebSearchTool(Tool):
+    """
+    A web search tool that uses Serper to perform Google searches and returns
+    the most relevant results.
+    """
+
+    def __init__(self):
+        super().__init__(
+            name="web_search",
+            description=(
+                "Search for information on the web - use this tool when the user "
+                "query needs LIVE or recent data from the internet."
+            ),
+            parameters={
+                "type": "object",
+                "properties": {
+                    "query": {
+                        "type": "string",
+                        "description": "The query to search with an external web API.",
+                    },
+                },
+                "required": ["query"],
+            },
+            results_function=self.execute,
+            llm_format_function=None,
+        )
+
+    async def execute(self, query: str, *args, **kwargs):
+        """
+        Implementation of web search functionality.
+        """
+        import asyncio
+
+        from core.base.abstractions import (
+            AggregateSearchResult,
+            WebSearchResult,
+        )
+        from core.utils.serper import SerperClient
+
+        context = self.context
+
+        serper_client = SerperClient()
+
+        raw_results = await asyncio.get_event_loop().run_in_executor(
+            None,
+            lambda: serper_client.get_raw(query),
+        )
+
+        web_response = await asyncio.get_event_loop().run_in_executor(
+            None, lambda: WebSearchResult.from_serper_results(raw_results)
+        )
+
+        result = AggregateSearchResult(
+            web_search_results=[web_response],
+        )
+
+        # Add to results collector if context is provided
+        if context and hasattr(context, "search_results_collector"):
+            context.search_results_collector.add_aggregate_result(result)
+
+        return result

+ 195 - 0
py/core/base/agent/tools/registry.py

@@ -0,0 +1,195 @@
+import importlib
+import inspect
+import logging
+import os
+import pkgutil
+import sys
+from typing import Callable, Optional, Type
+
+from shared.abstractions.tool import Tool
+
+logger = logging.getLogger(__name__)
+
+
+class ToolRegistry:
+    """
+    Registry for discovering and managing tools from both
+    built-in sources and user-defined extensions.
+    """
+
+    def __init__(
+        self,
+        built_in_path: str | None = None,
+        user_tools_path: str | None = None,
+    ):
+        self.built_in_path = built_in_path or os.path.join(
+            os.path.dirname(os.path.abspath(__file__)), "built_in"
+        )
+        self.user_tools_path = (
+            user_tools_path
+            or os.getenv("R2R_USER_TOOLS_PATH")
+            or "../docker/user_tools"
+        )
+
+        # Tool storage
+        self._built_in_tools: dict[str, Type[Tool]] = {}
+        self._user_tools: dict[str, Type[Tool]] = {}
+
+        # Discover tools
+        self._discover_built_in_tools()
+        if os.path.exists(self.user_tools_path):
+            self._discover_user_tools()
+        else:
+            logger.warning(
+                f"User tools directory not found: {self.user_tools_path}"
+            )
+
+    def _discover_built_in_tools(self):
+        """Load all built-in tools from the built_in directory."""
+        if not os.path.exists(self.built_in_path):
+            logger.warning(
+                f"Built-in tools directory not found: {self.built_in_path}"
+            )
+            return
+
+        # Add to Python path if needed
+        if self.built_in_path not in sys.path:
+            sys.path.append(os.path.dirname(self.built_in_path))
+
+        # Import the built_in package
+        try:
+            built_in_pkg = importlib.import_module("built_in")
+        except ImportError:
+            logger.error("Failed to import built_in tools package")
+            return
+
+        # Discover all modules in the package
+        for _, module_name, is_pkg in pkgutil.iter_modules(
+            [self.built_in_path]
+        ):
+            if is_pkg:  # Skip subpackages
+                continue
+
+            try:
+                module = importlib.import_module(f"built_in.{module_name}")
+
+                # Find all tool classes in the module
+                for name, obj in inspect.getmembers(module, inspect.isclass):
+                    if (
+                        issubclass(obj, Tool)
+                        and obj.__module__ == module.__name__
+                        and obj != Tool
+                    ):
+                        try:
+                            tool_instance = obj()
+                            self._built_in_tools[tool_instance.name] = obj
+                            logger.debug(
+                                f"Loaded built-in tool: {tool_instance.name}"
+                            )
+                        except Exception as e:
+                            logger.error(
+                                f"Error instantiating built-in tool {name}: {e}"
+                            )
+            except Exception as e:
+                logger.error(
+                    f"Error loading built-in tool module {module_name}: {e}"
+                )
+
+    def _discover_user_tools(self):
+        """Scan the user tools directory for custom tools."""
+        # Add user_tools directory to Python path if needed
+        if self.user_tools_path not in sys.path:
+            sys.path.append(os.path.dirname(self.user_tools_path))
+
+        user_tools_pkg_name = os.path.basename(self.user_tools_path)
+
+        # Check all Python files in user_tools directory
+        for filename in os.listdir(self.user_tools_path):
+            if (
+                not filename.endswith(".py")
+                or filename.startswith("_")
+                or filename.startswith(".")
+            ):
+                continue
+
+            module_name = filename[:-3]  # Remove .py extension
+
+            try:
+                # Import the module
+                module = importlib.import_module(
+                    f"{user_tools_pkg_name}.{module_name}"
+                )
+
+                # Find all tool classes in the module
+                for name, obj in inspect.getmembers(module, inspect.isclass):
+                    if (
+                        issubclass(obj, Tool)
+                        and obj.__module__ == module.__name__
+                        and obj != Tool
+                    ):
+                        try:
+                            tool_instance = obj()
+                            self._user_tools[tool_instance.name] = obj
+                            logger.debug(
+                                f"Loaded user tool: {tool_instance.name}"
+                            )
+                        except Exception as e:
+                            logger.error(
+                                f"Error instantiating user tool {name}: {e}"
+                            )
+            except Exception as e:
+                logger.error(
+                    f"Error loading user tool module {module_name}: {e}"
+                )
+
+    def get_tool_class(self, tool_name: str):
+        """Get a tool class by name."""
+        if tool_name in self._user_tools:
+            return self._user_tools[tool_name]
+
+        return self._built_in_tools.get(tool_name)
+
+    def list_available_tools(
+        self, include_built_in=True, include_user=True
+    ) -> list[str]:
+        """
+        List all available tool names.
+        Optionally filter by built-in or user-defined tools.
+        """
+        tools: set[str] = set()
+
+        if include_built_in:
+            tools.update(self._built_in_tools.keys())
+
+        if include_user:
+            tools.update(self._user_tools.keys())
+
+        return sorted(list(tools))
+
+    def create_tool_instance(
+        self, tool_name: str, format_function: Callable, context=None
+    ) -> Optional[Tool]:
+        """
+        Create, configure, and return an instance of the specified tool.
+        Returns None if the tool doesn't exist or instantiation fails.
+        """
+        tool_class = self.get_tool_class(tool_name)
+        if not tool_class:
+            logger.warning(f"Tool class not found for '{tool_name}'")
+            return None
+
+        try:
+            tool_instance = tool_class()
+            if hasattr(tool_instance, "llm_format_function"):
+                tool_instance.llm_format_function = format_function
+
+            # Set the context on the specific tool instance
+            tool_instance.set_context(context)
+
+            return tool_instance
+
+        except Exception as e:
+            logger.error(
+                f"Error creating or setting context for tool instance '{tool_name}': {e}"
+            )
+            return None

+ 208 - 0
py/core/base/api/models/__init__.py

@@ -0,0 +1,208 @@
+from shared.api.models.auth.responses import (
+    TokenResponse,
+    WrappedTokenResponse,
+)
+from shared.api.models.base import (
+    GenericBooleanResponse,
+    GenericMessageResponse,
+    PaginatedR2RResult,
+    R2RResults,
+    WrappedBooleanResponse,
+    WrappedGenericMessageResponse,
+)
+from shared.api.models.graph.responses import (  # TODO: Need to review anything above this
+    Community,
+    Entity,
+    GraphResponse,
+    Relationship,
+    WrappedCommunitiesResponse,
+    WrappedCommunityResponse,
+    WrappedEntitiesResponse,
+    WrappedEntityResponse,
+    WrappedGraphResponse,
+    WrappedGraphsResponse,
+    WrappedRelationshipResponse,
+    WrappedRelationshipsResponse,
+)
+from shared.api.models.ingestion.responses import (
+    IngestionResponse,
+    UpdateResponse,
+    VectorIndexResponse,
+    VectorIndicesResponse,
+    WrappedIngestionResponse,
+    WrappedMetadataUpdateResponse,
+    WrappedUpdateResponse,
+    WrappedVectorIndexResponse,
+    WrappedVectorIndicesResponse,
+)
+from shared.api.models.management.responses import (  # Document Responses; Prompt Responses; Chunk Responses; Conversation Responses; User Responses; TODO: anything below this hasn't been reviewed
+    ChunkResponse,
+    CollectionResponse,
+    ConversationResponse,
+    MessageResponse,
+    PromptResponse,
+    ServerStats,
+    SettingsResponse,
+    User,
+    WrappedAPIKeyResponse,
+    WrappedAPIKeysResponse,
+    WrappedChunkResponse,
+    WrappedChunksResponse,
+    WrappedCollectionResponse,
+    WrappedCollectionsResponse,
+    WrappedConversationMessagesResponse,
+    WrappedConversationResponse,
+    WrappedConversationsResponse,
+    WrappedDocumentResponse,
+    WrappedDocumentsResponse,
+    WrappedLimitsResponse,
+    WrappedLoginResponse,
+    WrappedMessageResponse,
+    WrappedMessagesResponse,
+    WrappedPromptResponse,
+    WrappedPromptsResponse,
+    WrappedServerStatsResponse,
+    WrappedSettingsResponse,
+    WrappedUserResponse,
+    WrappedUsersResponse,
+)
+from shared.api.models.retrieval.responses import (
+    AgentEvent,
+    AgentResponse,
+    Citation,
+    CitationData,
+    CitationEvent,
+    Delta,
+    DeltaPayload,
+    FinalAnswerData,
+    FinalAnswerEvent,
+    MessageData,
+    MessageDelta,
+    MessageEvent,
+    RAGEvent,
+    RAGResponse,
+    SearchResultsData,
+    SearchResultsEvent,
+    SSEEventBase,
+    ThinkingData,
+    ThinkingEvent,
+    ToolCallData,
+    ToolCallEvent,
+    ToolResultData,
+    ToolResultEvent,
+    UnknownEvent,
+    WrappedAgentResponse,
+    WrappedCompletionResponse,
+    WrappedDocumentSearchResponse,
+    WrappedEmbeddingResponse,
+    WrappedLLMChatCompletion,
+    WrappedRAGResponse,
+    WrappedSearchResponse,
+    WrappedVectorSearchResponse,
+)
+
+__all__ = [
+    # Auth Responses
+    "TokenResponse",
+    "WrappedTokenResponse",
+    "WrappedGenericMessageResponse",
+    # Ingestion Responses
+    "IngestionResponse",
+    "WrappedIngestionResponse",
+    "WrappedUpdateResponse",
+    "WrappedMetadataUpdateResponse",
+    "WrappedVectorIndexResponse",
+    "WrappedVectorIndicesResponse",
+    "UpdateResponse",
+    "VectorIndexResponse",
+    "VectorIndicesResponse",
+    # Knowledge Graph Responses
+    "Entity",
+    "Relationship",
+    "Community",
+    "WrappedEntityResponse",
+    "WrappedEntitiesResponse",
+    "WrappedRelationshipResponse",
+    "WrappedRelationshipsResponse",
+    "WrappedCommunityResponse",
+    "WrappedCommunitiesResponse",
+    # TODO: Need to review anything above this
+    "GraphResponse",
+    "WrappedGraphResponse",
+    "WrappedGraphsResponse",
+    # Management Responses
+    "PromptResponse",
+    "ServerStats",
+    "SettingsResponse",
+    "ChunkResponse",
+    "CollectionResponse",
+    "WrappedServerStatsResponse",
+    "WrappedSettingsResponse",
+    "WrappedDocumentResponse",
+    "WrappedDocumentsResponse",
+    "WrappedCollectionResponse",
+    "WrappedCollectionsResponse",
+    # Conversation Responses
+    "ConversationResponse",
+    "WrappedConversationMessagesResponse",
+    "WrappedConversationResponse",
+    "WrappedConversationsResponse",
+    # Prompt Responses
+    "WrappedPromptResponse",
+    "WrappedPromptsResponse",
+    # Conversation Responses
+    "MessageResponse",
+    "WrappedMessageResponse",
+    "WrappedMessagesResponse",
+    # Chunk Responses
+    "WrappedChunkResponse",
+    "WrappedChunksResponse",
+    # User Responses
+    "User",
+    "WrappedUserResponse",
+    "WrappedUsersResponse",
+    "WrappedAPIKeyResponse",
+    "WrappedLimitsResponse",
+    "WrappedAPIKeysResponse",
+    "WrappedLoginResponse",
+    # Base Responses
+    "PaginatedR2RResult",
+    "R2RResults",
+    "GenericBooleanResponse",
+    "GenericMessageResponse",
+    "WrappedBooleanResponse",
+    "WrappedGenericMessageResponse",
+    # Retrieval Responses
+    "SSEEventBase",
+    "SearchResultsData",
+    "SearchResultsEvent",
+    "MessageDelta",
+    "MessageData",
+    "MessageEvent",
+    "DeltaPayload",
+    "Delta",
+    "CitationData",
+    "CitationEvent",
+    "FinalAnswerData",
+    "FinalAnswerEvent",
+    "ToolCallData",
+    "ToolCallEvent",
+    "ToolResultData",
+    "ToolResultEvent",
+    "ThinkingData",
+    "ThinkingEvent",
+    "RAGEvent",
+    "AgentEvent",
+    "UnknownEvent",
+    "RAGResponse",
+    "Citation",
+    "AgentResponse",
+    "WrappedDocumentSearchResponse",
+    "WrappedSearchResponse",
+    "WrappedVectorSearchResponse",
+    "WrappedCompletionResponse",
+    "WrappedRAGResponse",
+    "WrappedAgentResponse",
+    "WrappedLLMChatCompletion",
+    "WrappedEmbeddingResponse",
+]

+ 5 - 0
py/core/base/parsers/__init__.py

@@ -0,0 +1,5 @@
+from .base_parser import AsyncParser
+
+__all__ = [
+    "AsyncParser",
+]

+ 12 - 0
py/core/base/parsers/base_parser.py

@@ -0,0 +1,12 @@
+"""Abstract base class for parsers."""
+
+from abc import ABC, abstractmethod
+from typing import AsyncGenerator, Generic, TypeVar
+
+T = TypeVar("T")
+
+
+class AsyncParser(ABC, Generic[T]):
+    @abstractmethod
+    async def ingest(self, data: T, **kwargs) -> AsyncGenerator[str, None]:
+        pass

+ 69 - 0
py/core/base/providers/__init__.py

@@ -0,0 +1,69 @@
+from .auth import AuthConfig, AuthProvider
+from .base import AppConfig, Provider, ProviderConfig
+from .crypto import CryptoConfig, CryptoProvider
+from .database import (
+    DatabaseConfig,
+    DatabaseConnectionManager,
+    DatabaseProvider,
+    Handler,
+    LimitSettings,
+    PostgresConfigurationSettings,
+)
+from .email import EmailConfig, EmailProvider
+from .embedding import EmbeddingConfig, EmbeddingProvider
+from .file import FileConfig, FileProvider
+from .ingestion import (
+    ChunkingStrategy,
+    IngestionConfig,
+    IngestionProvider,
+)
+from .llm import CompletionConfig, CompletionProvider
+from .ocr import OCRConfig, OCRProvider
+from .orchestration import OrchestrationConfig, OrchestrationProvider, Workflow
+from .scheduler import SchedulerConfig, SchedulerProvider
+
+__all__ = [
+    # Auth provider
+    "AuthConfig",
+    "AuthProvider",
+    # Base provider classes
+    "AppConfig",
+    "Provider",
+    "ProviderConfig",
+    # Crypto provider
+    "CryptoConfig",
+    "CryptoProvider",
+    # Database providers
+    "DatabaseConnectionManager",
+    "DatabaseConfig",
+    "LimitSettings",
+    "PostgresConfigurationSettings",
+    "DatabaseProvider",
+    "Handler",
+    # Email provider
+    "EmailConfig",
+    "EmailProvider",
+    # Embedding provider
+    "EmbeddingConfig",
+    "EmbeddingProvider",
+    # File provider
+    "FileConfig",
+    "FileProvider",
+    # Ingestion provider
+    "IngestionConfig",
+    "IngestionProvider",
+    "ChunkingStrategy",
+    # LLM provider
+    "CompletionConfig",
+    "CompletionProvider",
+    # OCR provider
+    "OCRConfig",
+    "OCRProvider",
+    # Orchestration provider
+    "OrchestrationConfig",
+    "OrchestrationProvider",
+    "Workflow",
+    # Scheduler provider
+    "SchedulerConfig",
+    "SchedulerProvider",
+]

+ 231 - 0
py/core/base/providers/auth.py

@@ -0,0 +1,231 @@
+import logging
+from abc import ABC, abstractmethod
+from datetime import datetime
+from typing import TYPE_CHECKING, Optional
+
+from fastapi import Security
+from fastapi.security import (
+    APIKeyHeader,
+    HTTPAuthorizationCredentials,
+    HTTPBearer,
+)
+
+from ..abstractions import R2RException, Token, TokenData
+from ..api.models import User
+from .base import Provider, ProviderConfig
+from .crypto import CryptoProvider
+from .email import EmailProvider
+
+logger = logging.getLogger()
+
+if TYPE_CHECKING:
+    from core.providers.database import PostgresDatabaseProvider
+
+api_key_header = APIKeyHeader(name="X-API-Key", auto_error=False)
+
+
+class AuthConfig(ProviderConfig):
+    secret_key: Optional[str] = None
+    require_authentication: bool = False
+    require_email_verification: bool = False
+    default_admin_email: str = "admin@example.com"
+    default_admin_password: str = "change_me_immediately"
+    access_token_lifetime_in_minutes: Optional[int] = None
+    refresh_token_lifetime_in_days: Optional[int] = None
+
+    @property
+    def supported_providers(self) -> list[str]:
+        return ["r2r"]
+
+    def validate_config(self) -> None:
+        pass
+
+
+class AuthProvider(Provider, ABC):
+    security = HTTPBearer(auto_error=False)
+    crypto_provider: CryptoProvider
+    email_provider: EmailProvider
+    database_provider: "PostgresDatabaseProvider"
+
+    def __init__(
+        self,
+        config: AuthConfig,
+        crypto_provider: CryptoProvider,
+        database_provider: "PostgresDatabaseProvider",
+        email_provider: EmailProvider,
+    ):
+        if not isinstance(config, AuthConfig):
+            raise ValueError(
+                "AuthProvider must be initialized with an AuthConfig"
+            )
+        self.config = config
+        self.admin_email = config.default_admin_email
+        self.admin_password = config.default_admin_password
+        self.crypto_provider = crypto_provider
+        self.database_provider = database_provider
+        self.email_provider = email_provider
+        super().__init__(config)
+        self.config: AuthConfig = config
+        self.database_provider: "PostgresDatabaseProvider" = database_provider
+
+    async def _get_default_admin_user(self) -> User:
+        return await self.database_provider.users_handler.get_user_by_email(
+            self.admin_email
+        )
+
+    @abstractmethod
+    def create_access_token(self, data: dict) -> str:
+        pass
+
+    @abstractmethod
+    def create_refresh_token(self, data: dict) -> str:
+        pass
+
+    @abstractmethod
+    async def decode_token(self, token: str) -> TokenData:
+        pass
+
+    @abstractmethod
+    async def user(self, token: str) -> User:
+        pass
+
+    @abstractmethod
+    def get_current_active_user(self, current_user: User) -> User:
+        pass
+
+    @abstractmethod
+    async def register(self, email: str, password: str) -> User:
+        pass
+
+    @abstractmethod
+    async def send_verification_email(
+        self, email: str, user: Optional[User] = None
+    ) -> tuple[str, datetime]:
+        pass
+
+    @abstractmethod
+    async def verify_email(
+        self, email: str, verification_code: str
+    ) -> dict[str, str]:
+        pass
+
+    @abstractmethod
+    async def login(self, email: str, password: str) -> dict[str, Token]:
+        pass
+
+    @abstractmethod
+    async def refresh_access_token(
+        self, refresh_token: str
+    ) -> dict[str, Token]:
+        pass
+
+    def auth_wrapper(
+        self,
+        public: bool = False,
+    ):
+        async def _auth_wrapper(
+            auth: Optional[HTTPAuthorizationCredentials] = Security(
+                self.security
+            ),
+            api_key: Optional[str] = Security(api_key_header),
+        ) -> User:
+            # If authentication is not required and no credentials are provided, return the default admin user
+            if (
+                ((not self.config.require_authentication) or public)
+                and auth is None
+                and api_key is None
+            ):
+                return await self._get_default_admin_user()
+            if not auth and not api_key:
+                raise R2RException(
+                    message="No credentials provided. Create an account at https://app.sciphi.ai and set your API key using `r2r configure key` OR change your base URL to a custom deployment.",
+                    status_code=401,
+                )
+            if auth and api_key:
+                raise R2RException(
+                    message="Cannot have both Bearer token and API key",
+                    status_code=400,
+                )
+            # 1. Try JWT if `auth` is present (Bearer token)
+            if auth is not None:
+                credentials = auth.credentials
+                try:
+                    token_data = await self.decode_token(credentials)
+                    user = await self.database_provider.users_handler.get_user_by_email(
+                        token_data.email
+                    )
+                    if user is not None:
+                        return user
+                except R2RException:
+                    # JWT decoding failed for logical reasons (invalid token)
+                    pass
+                except Exception as e:
+                    # JWT decoding failed unexpectedly, log and continue
+                    logger.debug(f"JWT verification failed: {e}")
+
+                # 2. If JWT failed, try API key from Bearer token
+                # Expected format: key_id.raw_api_key
+                if "." in credentials:
+                    key_id, raw_api_key = credentials.split(".", 1)
+                    api_key_record = await self.database_provider.users_handler.get_api_key_record(
+                        key_id
+                    )
+                    if api_key_record is not None:
+                        hashed_key = api_key_record["hashed_key"]
+                        if self.crypto_provider.verify_api_key(
+                            raw_api_key, hashed_key
+                        ):
+                            user = await self.database_provider.users_handler.get_user_by_id(
+                                api_key_record["user_id"]
+                            )
+                            if user is not None and user.is_active:
+                                return user
+
+            # 3. If no Bearer token worked, try the X-API-Key header
+            if api_key is not None and "." in api_key:
+                key_id, raw_api_key = api_key.split(".", 1)
+                api_key_record = await self.database_provider.users_handler.get_api_key_record(
+                    key_id
+                )
+                if api_key_record is not None:
+                    hashed_key = api_key_record["hashed_key"]
+                    if self.crypto_provider.verify_api_key(
+                        raw_api_key, hashed_key
+                    ):
+                        user = await self.database_provider.users_handler.get_user_by_id(
+                            api_key_record["user_id"]
+                        )
+                        if user is not None and user.is_active:
+                            return user
+
+            # If we reach here, both JWT and API key auth failed
+            raise R2RException(
+                message="Invalid token or API key",
+                status_code=401,
+            )
+
+        return _auth_wrapper
+
+    @abstractmethod
+    async def change_password(
+        self, user: User, current_password: str, new_password: str
+    ) -> dict[str, str]:
+        pass
+
+    @abstractmethod
+    async def request_password_reset(self, email: str) -> dict[str, str]:
+        pass
+
+    @abstractmethod
+    async def confirm_password_reset(
+        self, reset_token: str, new_password: str
+    ) -> dict[str, str]:
+        pass
+
+    @abstractmethod
+    async def logout(self, token: str) -> dict[str, str]:
+        pass
+
+    @abstractmethod
+    async def send_reset_email(self, email: str) -> dict[str, str]:
+        pass

+ 135 - 0
py/core/base/providers/base.py

@@ -0,0 +1,135 @@
+from abc import ABC, abstractmethod
+from typing import Any, Optional, Type
+
+from pydantic import BaseModel
+
+
+class InnerConfig(BaseModel, ABC):
+    """A base provider configuration class."""
+
+    extra_fields: dict[str, Any] = {}
+
+    class Config:
+        populate_by_name = True
+        arbitrary_types_allowed = True
+        ignore_extra = True
+
+    @classmethod
+    def create(cls: Type["InnerConfig"], **kwargs: Any) -> "InnerConfig":
+        base_args = cls.model_fields.keys()
+        filtered_kwargs = {
+            k: v if v != "None" else None
+            for k, v in kwargs.items()
+            if k in base_args
+        }
+        instance = cls(**filtered_kwargs)  # type: ignore
+        for k, v in kwargs.items():
+            if k not in base_args:
+                instance.extra_fields[k] = v
+        return instance
+
+
+class AppConfig(InnerConfig):
+    project_name: Optional[str] = None
+    user_tools_path: Optional[str] = None
+    default_max_documents_per_user: Optional[int] = 100
+    default_max_chunks_per_user: Optional[int] = 10_000
+    default_max_collections_per_user: Optional[int] = 5
+    default_max_upload_size: int = 2_000_000  # e.g. ~2 MB
+    quality_llm: Optional[str] = None
+    fast_llm: Optional[str] = None
+    vlm: Optional[str] = None
+    audio_lm: Optional[str] = None
+    reasoning_llm: Optional[str] = None
+    planning_llm: Optional[str] = None
+
+    # File extension to max-size mapping
+    # These are examples; adjust sizes as needed.
+    max_upload_size_by_type: dict[str, int] = {
+        # Common text-based formats
+        "txt": 2_000_000,
+        "md": 2_000_000,
+        "tsv": 2_000_000,
+        "csv": 5_000_000,
+        "html": 5_000_000,
+        # Office docs
+        "doc": 10_000_000,
+        "docx": 10_000_000,
+        "ppt": 20_000_000,
+        "pptx": 20_000_000,
+        "xls": 10_000_000,
+        "xlsx": 10_000_000,
+        "odt": 5_000_000,
+        # PDFs can expand quite a bit when converted to text
+        "pdf": 30_000_000,
+        # E-mail
+        "eml": 5_000_000,
+        "msg": 5_000_000,
+        "p7s": 5_000_000,
+        # Images
+        "bmp": 5_000_000,
+        "heic": 5_000_000,
+        "jpeg": 5_000_000,
+        "jpg": 5_000_000,
+        "png": 5_000_000,
+        "tiff": 5_000_000,
+        # Others
+        "epub": 10_000_000,
+        "rtf": 5_000_000,
+        "rst": 5_000_000,
+        "org": 5_000_000,
+    }
+
+
+class ProviderConfig(BaseModel, ABC):
+    """A base provider configuration class."""
+
+    app: Optional[AppConfig] = None  # Add an app_config field
+    extra_fields: dict[str, Any] = {}
+    provider: Optional[str] = None
+
+    class Config:
+        populate_by_name = True
+        arbitrary_types_allowed = True
+        ignore_extra = True
+
+    @abstractmethod
+    def validate_config(self) -> None:
+        pass
+
+    @classmethod
+    def create(cls: Type["ProviderConfig"], **kwargs: Any) -> "ProviderConfig":
+        base_args = cls.model_fields.keys()
+        filtered_kwargs = {
+            k: v if v != "None" else None
+            for k, v in kwargs.items()
+            if k in base_args
+        }
+        instance = cls(**filtered_kwargs)  # type: ignore
+        for k, v in kwargs.items():
+            if k not in base_args:
+                instance.extra_fields[k] = v
+        return instance
+
+    @property
+    @abstractmethod
+    def supported_providers(self) -> list[str]:
+        """Define a list of supported providers."""
+        pass
+
+    @classmethod
+    def from_dict(
+        cls: Type["ProviderConfig"], data: dict[str, Any]
+    ) -> "ProviderConfig":
+        """Create a new instance of the config from a dictionary."""
+        return cls.create(**data)
+
+
+class Provider(ABC):
+    """A base provider class to provide a common interface for all
+    providers."""
+
+    def __init__(self, config: ProviderConfig, *args, **kwargs):
+        if config:
+            config.validate_config()
+        self.config = config

+ 120 - 0
py/core/base/providers/crypto.py

@@ -0,0 +1,120 @@
+from abc import ABC, abstractmethod
+from datetime import datetime
+from typing import Optional, Tuple
+
+from .base import Provider, ProviderConfig
+
+
+class CryptoConfig(ProviderConfig):
+    provider: Optional[str] = None
+
+    @property
+    def supported_providers(self) -> list[str]:
+        return ["bcrypt", "nacl"]
+
+    def validate_config(self) -> None:
+        if self.provider not in self.supported_providers:
+            raise ValueError(f"Unsupported crypto provider: {self.provider}")
+
+
+class CryptoProvider(Provider, ABC):
+    def __init__(self, config: CryptoConfig):
+        if not isinstance(config, CryptoConfig):
+            raise ValueError(
+                "CryptoProvider must be initialized with a CryptoConfig"
+            )
+        super().__init__(config)
+
+    @abstractmethod
+    def get_password_hash(self, password: str) -> str:
+        """Hash a plaintext password using a secure password hashing algorithm
+        (e.g., Argon2i)."""
+        pass
+
+    @abstractmethod
+    def verify_password(
+        self, plain_password: str, hashed_password: str
+    ) -> bool:
+        """Verify that a plaintext password matches the given hashed
+        password."""
+        pass
+
+    @abstractmethod
+    def generate_verification_code(self, length: int = 32) -> str:
+        """Generate a random code for email verification or reset tokens."""
+        pass
+
+    @abstractmethod
+    def generate_signing_keypair(self) -> Tuple[str, str, str]:
+        """Generate a new Ed25519 signing keypair for request signing.
+
+        Returns:
+            A tuple of (key_id, private_key, public_key).
+            - key_id: A unique identifier for this keypair.
+            - private_key: Base64 encoded Ed25519 private key.
+            - public_key: Base64 encoded Ed25519 public key.
+        """
+        pass
+
+    @abstractmethod
+    def sign_request(self, private_key: str, data: str) -> str:
+        """Sign request data with an Ed25519 private key, returning the
+        signature."""
+        pass
+
+    @abstractmethod
+    def verify_request_signature(
+        self, public_key: str, signature: str, data: str
+    ) -> bool:
+        """Verify a request signature using the corresponding Ed25519 public
+        key."""
+        pass
+
+    @abstractmethod
+    def generate_api_key(self) -> Tuple[str, str]:
+        """Generate a new API key for a user.
+
+        Returns:
+            A tuple (key_id, raw_api_key):
+            - key_id: A unique identifier for the API key.
+            - raw_api_key: The plaintext API key to provide to the user.
+        """
+        pass
+
+    @abstractmethod
+    def hash_api_key(self, raw_api_key: str) -> str:
+        """Hash a raw API key for secure storage in the database.
+
+        Use strong parameters suitable for long-term secrets.
+        """
+        pass
+
+    @abstractmethod
+    def verify_api_key(self, raw_api_key: str, hashed_key: str) -> bool:
+        """Verify that a provided API key matches the stored hashed version."""
+        pass
+
+    @abstractmethod
+    def generate_secure_token(self, data: dict, expiry: datetime) -> str:
+        """Generate a secure, signed token (e.g., JWT) embedding claims.
+
+        Args:
+            data: The claims to include in the token.
+            expiry: A datetime at which the token expires.
+
+        Returns:
+            A JWT string signed with a secret key.
+        """
+        pass
+
+    @abstractmethod
+    def verify_secure_token(self, token: str) -> Optional[dict]:
+        """Verify a secure token (e.g., JWT).
+
+        Args:
+            token: The token string to verify.
+
+        Returns:
+            The token payload if valid, otherwise None.
+        """
+        pass

+ 208 - 0
py/core/base/providers/database.py

@@ -0,0 +1,208 @@
+"""Base classes for database providers."""
+
+import logging
+from abc import ABC, abstractmethod
+from typing import Any, Optional, Sequence, cast
+from uuid import UUID
+
+from pydantic import BaseModel
+
+from core.base.abstractions import (
+    GraphCreationSettings,
+    GraphEnrichmentSettings,
+    GraphSearchSettings,
+)
+from core.utils.context import get_current_project_schema
+
+from .base import Provider, ProviderConfig
+
+logger = logging.getLogger()
+
+
+class DatabaseConnectionManager(ABC):
+    @abstractmethod
+    def execute_query(
+        self,
+        query: str,
+        params: Optional[dict[str, Any] | Sequence[Any]] = None,
+        isolation_level: Optional[str] = None,
+    ):
+        pass
+
+    @abstractmethod
+    async def execute_many(self, query, params=None, batch_size=1000):
+        pass
+
+    @abstractmethod
+    def fetch_query(
+        self,
+        query: str,
+        params: Optional[dict[str, Any] | Sequence[Any]] = None,
+    ):
+        pass
+
+    @abstractmethod
+    def fetchrow_query(
+        self,
+        query: str,
+        params: Optional[dict[str, Any] | Sequence[Any]] = None,
+    ):
+        pass
+
+    @abstractmethod
+    async def initialize(self, pool: Any):
+        pass
+
+
+class Handler(ABC):
+    def __init__(
+        self,
+        project_name: str,
+        connection_manager: DatabaseConnectionManager,
+    ):
+        self.project_name = project_name
+        self.connection_manager = connection_manager
+
+    def _get_table_name(self, base_name: str) -> str:
+        """Get the full qualified table name with the current project schema."""
+        return f'"{get_current_project_schema() or self.project_name}"."{base_name}"'
+
+    @abstractmethod
+    def create_tables(self):
+        pass
+
+
+class PostgresConfigurationSettings(BaseModel):
+    """Configuration settings with defaults defined by the PGVector docker
+    image.
+
+    These settings are helpful in managing the connections to the database. To
+    tune these settings for a specific deployment, see
+    https://pgtune.leopard.in.ua/
+    """
+
+    checkpoint_completion_target: Optional[float] = 0.9
+    default_statistics_target: Optional[int] = 100
+    effective_io_concurrency: Optional[int] = 1
+    effective_cache_size: Optional[int] = 524288
+    huge_pages: Optional[str] = "try"
+    maintenance_work_mem: Optional[int] = 65536
+    max_connections: Optional[int] = 256
+    max_parallel_workers_per_gather: Optional[int] = 2
+    max_parallel_workers: Optional[int] = 8
+    max_parallel_maintenance_workers: Optional[int] = 2
+    max_wal_size: Optional[int] = 1024
+    max_worker_processes: Optional[int] = 8
+    min_wal_size: Optional[int] = 80
+    shared_buffers: Optional[int] = 16384
+    statement_cache_size: Optional[int] = 100
+    random_page_cost: Optional[float] = 4
+    wal_buffers: Optional[int] = 512
+    work_mem: Optional[int] = 4096
+
+
+class LimitSettings(BaseModel):
+    global_per_min: Optional[int] = None
+    route_per_min: Optional[int] = None
+    monthly_limit: Optional[int] = None
+
+    def merge_with_defaults(
+        self, defaults: "LimitSettings"
+    ) -> "LimitSettings":
+        return LimitSettings(
+            global_per_min=self.global_per_min or defaults.global_per_min,
+            route_per_min=self.route_per_min or defaults.route_per_min,
+            monthly_limit=self.monthly_limit or defaults.monthly_limit,
+        )
+
+
+class MaintenanceSettings(BaseModel):
+    vacuum_schedule: str = "0 3 * * *"  # Run at 3 AM every day by default
+    vacuum_analyze: bool = True
+    vacuum_full: bool = False
+
+
+class DatabaseConfig(ProviderConfig):
+    """A base database configuration class."""
+
+    provider: str = "postgres"
+    user: Optional[str] = None
+    password: Optional[str] = None
+    host: Optional[str] = None
+    port: Optional[int] = None
+    db_name: Optional[str] = None
+    project_name: Optional[str] = None
+    postgres_configuration_settings: Optional[
+        PostgresConfigurationSettings
+    ] = None
+    default_collection_name: str = "Default"
+    default_collection_description: str = "Your default collection."
+    collection_summary_system_prompt: str = "system"
+    collection_summary_prompt: str = "collection_summary"
+    disable_create_extension: bool = False
+
+    # Graph settings
+    batch_size: Optional[int] = 1
+    graph_search_results_store_path: Optional[str] = None
+    graph_enrichment_settings: GraphEnrichmentSettings = (
+        GraphEnrichmentSettings()
+    )
+    graph_creation_settings: GraphCreationSettings = GraphCreationSettings()
+    graph_search_settings: GraphSearchSettings = GraphSearchSettings()
+
+    # Rate limits
+    limits: LimitSettings = LimitSettings(
+        global_per_min=60, route_per_min=20, monthly_limit=10000
+    )
+
+    # Maintenance settings
+    maintenance: MaintenanceSettings = MaintenanceSettings()
+    route_limits: dict[str, LimitSettings] = {}
+    user_limits: dict[UUID, LimitSettings] = {}
+
+    def validate_config(self) -> None:
+        if self.provider not in self.supported_providers:
+            raise ValueError(f"Provider '{self.provider}' is not supported.")
+
+    @property
+    def supported_providers(self) -> list[str]:
+        return ["postgres"]
+
+    @classmethod
+    def from_dict(cls, data: dict[str, Any]) -> "DatabaseConfig":
+        instance = cls.create(**data)
+
+        instance = cast(DatabaseConfig, instance)
+
+        limits_data = data.get("limits", {})
+        default_limits = LimitSettings(
+            global_per_min=limits_data.get("global_per_min", 60),
+            route_per_min=limits_data.get("route_per_min", 20),
+            monthly_limit=limits_data.get("monthly_limit", 10000),
+        )
+
+        instance.limits = default_limits
+
+        route_limits_data = limits_data.get("routes", {})
+        for route_str, route_cfg in route_limits_data.items():
+            instance.route_limits[route_str] = LimitSettings(**route_cfg)
+
+        return instance
+
+
+class DatabaseProvider(Provider):
+    connection_manager: DatabaseConnectionManager
+    config: DatabaseConfig
+    project_name: str
+
+    def __init__(self, config: DatabaseConfig):
+        logger.info(f"Initializing DatabaseProvider with config {config}.")
+        super().__init__(config)
+
+    @abstractmethod
+    async def __aenter__(self):
+        pass
+
+    @abstractmethod
+    async def __aexit__(self, exc_type, exc, tb):
+        pass

+ 96 - 0
py/core/base/providers/email.py

@@ -0,0 +1,96 @@
+import logging
+import os
+from abc import ABC, abstractmethod
+from typing import Optional
+
+from .base import Provider, ProviderConfig
+
+
+class EmailConfig(ProviderConfig):
+    smtp_server: Optional[str] = None
+    smtp_port: Optional[int] = None
+    smtp_username: Optional[str] = None
+    smtp_password: Optional[str] = None
+    from_email: Optional[str] = None
+    use_tls: Optional[bool] = True
+    sendgrid_api_key: Optional[str] = None
+    mailersend_api_key: Optional[str] = None
+    verify_email_template_id: Optional[str] = None
+    reset_password_template_id: Optional[str] = None
+    password_changed_template_id: Optional[str] = None
+    frontend_url: Optional[str] = None
+    sender_name: Optional[str] = None
+
+    @property
+    def supported_providers(self) -> list[str]:
+        return [
+            "smtp",
+            "console",
+            "sendgrid",
+            "mailersend",
+        ]  # Could add more providers like AWS SES, SendGrid etc.
+
+    def validate_config(self) -> None:
+        if (
+            self.provider == "sendgrid"
+            and not self.sendgrid_api_key
+            and not os.getenv("SENDGRID_API_KEY")
+        ):
+            raise ValueError(
+                "SendGrid API key is required when using SendGrid provider"
+            )
+
+        if (
+            self.provider == "mailersend"
+            and not self.mailersend_api_key
+            and not os.getenv("MAILERSEND_API_KEY")
+        ):
+            raise ValueError(
+                "MailerSend API key is required when using MailerSend provider"
+            )
+
+
+logger = logging.getLogger(__name__)
+
+
+class EmailProvider(Provider, ABC):
+    def __init__(self, config: EmailConfig):
+        if not isinstance(config, EmailConfig):
+            raise ValueError(
+                "EmailProvider must be initialized with an EmailConfig"
+            )
+        super().__init__(config)
+        self.config: EmailConfig = config
+
+    @abstractmethod
+    async def send_email(
+        self,
+        to_email: str,
+        subject: str,
+        body: str,
+        html_body: Optional[str] = None,
+        *args,
+        **kwargs,
+    ) -> None:
+        pass
+
+    @abstractmethod
+    async def send_verification_email(
+        self, to_email: str, verification_code: str, *args, **kwargs
+    ) -> None:
+        pass
+
+    @abstractmethod
+    async def send_password_reset_email(
+        self, to_email: str, reset_token: str, *args, **kwargs
+    ) -> None:
+        pass
+
+    @abstractmethod
+    async def send_password_changed_email(
+        self,
+        to_email: str,
+        *args,
+        **kwargs,
+    ) -> None:
+        pass

+ 169 - 0
py/core/base/providers/embedding.py

@@ -0,0 +1,169 @@
+import asyncio
+import logging
+import random
+import time
+from abc import abstractmethod
+from enum import Enum
+from typing import Any, Optional
+
+from litellm import AuthenticationError
+
+from core.base.abstractions import VectorQuantizationSettings
+
+from ..abstractions import (
+    ChunkSearchResult,
+)
+from .base import Provider, ProviderConfig
+
+logger = logging.getLogger()
+
+
+class EmbeddingConfig(ProviderConfig):
+    provider: str
+    base_model: str
+    base_dimension: int | float
+    rerank_model: Optional[str] = None
+    rerank_url: Optional[str] = None
+    batch_size: int = 1
+    concurrent_request_limit: int = 256
+    max_retries: int = 3
+    initial_backoff: float = 1
+    max_backoff: float = 64.0
+    quantization_settings: VectorQuantizationSettings = (
+        VectorQuantizationSettings()
+    )
+
+    def validate_config(self) -> None:
+        if self.provider not in self.supported_providers:
+            raise ValueError(f"Provider '{self.provider}' is not supported.")
+
+    @property
+    def supported_providers(self) -> list[str]:
+        return ["litellm", "openai", "ollama"]
+
+
+class EmbeddingProvider(Provider):
+    class Step(Enum):
+        BASE = 1
+        RERANK = 2
+
+    def __init__(self, config: EmbeddingConfig):
+        if not isinstance(config, EmbeddingConfig):
+            raise ValueError(
+                "EmbeddingProvider must be initialized with a `EmbeddingConfig`."
+            )
+        logger.info(f"Initializing EmbeddingProvider with config {config}.")
+
+        super().__init__(config)
+        self.config: EmbeddingConfig = config
+        self.semaphore = asyncio.Semaphore(config.concurrent_request_limit)
+        self.current_requests = 0
+
+    async def _execute_with_backoff_async(self, task: dict[str, Any]):
+        retries = 0
+        backoff = self.config.initial_backoff
+        while retries < self.config.max_retries:
+            try:
+                async with self.semaphore:
+                    return await self._execute_task(task)
+            except AuthenticationError:
+                raise
+            except Exception as e:
+                logger.warning(
+                    f"Request failed (attempt {retries + 1}): {str(e)}"
+                )
+                retries += 1
+                if retries == self.config.max_retries:
+                    raise
+                await asyncio.sleep(random.uniform(0, backoff))
+                backoff = min(backoff * 2, self.config.max_backoff)
+
+    def _execute_with_backoff_sync(self, task: dict[str, Any]):
+        retries = 0
+        backoff = self.config.initial_backoff
+        while retries < self.config.max_retries:
+            try:
+                return self._execute_task_sync(task)
+            except AuthenticationError:
+                raise
+            except Exception as e:
+                logger.warning(
+                    f"Request failed (attempt {retries + 1}): {str(e)}"
+                )
+                retries += 1
+                if retries == self.config.max_retries:
+                    raise
+                time.sleep(random.uniform(0, backoff))
+                backoff = min(backoff * 2, self.config.max_backoff)
+
+    @abstractmethod
+    async def _execute_task(self, task: dict[str, Any]):
+        pass
+
+    @abstractmethod
+    def _execute_task_sync(self, task: dict[str, Any]):
+        pass
+
+    async def async_get_embedding(
+        self,
+        text: str,
+        stage: Step = Step.BASE,
+    ):
+        task = {
+            "text": text,
+            "stage": stage,
+        }
+        return await self._execute_with_backoff_async(task)
+
+    def get_embedding(
+        self,
+        text: str,
+        stage: Step = Step.BASE,
+    ):
+        task = {
+            "text": text,
+            "stage": stage,
+        }
+        return self._execute_with_backoff_sync(task)
+
+    async def async_get_embeddings(
+        self,
+        texts: list[str],
+        stage: Step = Step.BASE,
+    ):
+        task = {
+            "texts": texts,
+            "stage": stage,
+        }
+        return await self._execute_with_backoff_async(task)
+
+    def get_embeddings(
+        self,
+        texts: list[str],
+        stage: Step = Step.BASE,
+    ) -> list[list[float]]:
+        task = {
+            "texts": texts,
+            "stage": stage,
+        }
+        return self._execute_with_backoff_sync(task)
+
+    @abstractmethod
+    def rerank(
+        self,
+        query: str,
+        results: list[ChunkSearchResult],
+        stage: Step = Step.RERANK,
+        limit: int = 10,
+    ):
+        pass
+
+    @abstractmethod
+    async def arerank(
+        self,
+        query: str,
+        results: list[ChunkSearchResult],
+        stage: Step = Step.RERANK,
+        limit: int = 10,
+    ):
+        pass

+ 110 - 0
py/core/base/providers/file.py

@@ -0,0 +1,110 @@
+import logging
+import os
+from abc import ABC, abstractmethod
+from datetime import datetime
+from io import BytesIO
+from typing import BinaryIO, Optional
+from uuid import UUID
+
+from .base import Provider, ProviderConfig
+
+logger = logging.getLogger()
+
+
+class FileConfig(ProviderConfig):
+    """
+    Configuration for file storage providers.
+    """
+
+    provider: Optional[str] = None
+
+    # S3-specific configuration
+    bucket_name: Optional[str] = None
+    aws_access_key_id: Optional[str] = None
+    aws_secret_access_key: Optional[str] = None
+    region_name: Optional[str] = None
+    endpoint_url: Optional[str] = None
+
+    @property
+    def supported_providers(self) -> list[str]:
+        """
+        List of supported file storage providers.
+        """
+        return [
+            "postgres",
+            "s3",
+        ]
+
+    def validate_config(self) -> None:
+        if self.provider not in self.supported_providers:
+            raise ValueError(f"Unsupported file provider: {self.provider}")
+
+        if self.provider == "s3" and (
+            not self.bucket_name and not os.getenv("S3_BUCKET_NAME")
+        ):
+            raise ValueError(
+                "S3 bucket name is required when using S3 provider"
+            )
+
+
+class FileProvider(Provider, ABC):
+    """
+    Base abstract class for file storage providers.
+    """
+
+    def __init__(self, config: FileConfig):
+        if not isinstance(config, FileConfig):
+            raise ValueError(
+                "FileProvider must be initialized with a `FileConfig`."
+            )
+        super().__init__(config)
+        self.config: FileConfig = config
+
+    @abstractmethod
+    async def initialize(self) -> None:
+        """Initialize the file provider."""
+        pass
+
+    @abstractmethod
+    async def store_file(
+        self,
+        document_id: UUID,
+        file_name: str,
+        file_content: BytesIO,
+        file_type: Optional[str] = None,
+    ) -> None:
+        """Store a file."""
+        pass
+
+    @abstractmethod
+    async def retrieve_file(
+        self, document_id: UUID
+    ) -> Optional[tuple[str, BinaryIO, int]]:
+        """Retrieve a file."""
+        pass
+
+    @abstractmethod
+    async def retrieve_files_as_zip(
+        self,
+        document_ids: Optional[list[UUID]] = None,
+        start_date: Optional[datetime] = None,
+        end_date: Optional[datetime] = None,
+    ) -> tuple[str, BinaryIO, int]:
+        """Retrieve multiple files as a zip."""
+        pass
+
+    @abstractmethod
+    async def delete_file(self, document_id: UUID) -> bool:
+        """Delete a file."""
+        pass
+
+    @abstractmethod
+    async def get_files_overview(
+        self,
+        offset: int,
+        limit: int,
+        filter_document_ids: Optional[list[UUID]] = None,
+        filter_file_names: Optional[list[str]] = None,
+    ) -> list[dict]:
+        """Get an overview of stored files."""
+        pass

+ 188 - 0
py/core/base/providers/ingestion.py

@@ -0,0 +1,188 @@
+import logging
+from abc import ABC
+from enum import Enum
+from typing import TYPE_CHECKING, Any, ClassVar, Optional
+
+from pydantic import Field
+
+from core.base.abstractions import ChunkEnrichmentSettings
+
+from .base import AppConfig, Provider, ProviderConfig
+from .llm import CompletionProvider
+
+logger = logging.getLogger()
+
+if TYPE_CHECKING:
+    from core.providers.database import PostgresDatabaseProvider
+
+
+class ChunkingStrategy(str, Enum):
+    RECURSIVE = "recursive"
+    CHARACTER = "character"
+    BASIC = "basic"
+    BY_TITLE = "by_title"
+
+
+class IngestionConfig(ProviderConfig):
+    _defaults: ClassVar[dict] = {
+        "app": AppConfig(),
+        "provider": "r2r",
+        "excluded_parsers": [],
+        "chunking_strategy": "recursive",
+        "chunk_size": 1024,
+        "chunk_overlap": 512,
+        "chunk_enrichment_settings": ChunkEnrichmentSettings(),
+        "extra_parsers": {},
+        "audio_transcription_model": None,
+        "vlm": None,
+        "vlm_batch_size": 5,
+        "vlm_max_tokens_to_sample": 1_024,
+        "max_concurrent_vlm_tasks": 5,
+        "vlm_ocr_one_page_per_chunk": True,
+        "skip_document_summary": False,
+        "document_summary_system_prompt": "system",
+        "document_summary_task_prompt": "summary",
+        "document_summary_max_length": 100_000,
+        "chunks_for_document_summary": 128,
+        "document_summary_model": None,
+        "parser_overrides": {},
+        "extra_fields": {},
+        "automatic_extraction": False,
+    }
+
+    provider: str = Field(
+        default_factory=lambda: IngestionConfig._defaults["provider"]
+    )
+    excluded_parsers: list[str] = Field(
+        default_factory=lambda: IngestionConfig._defaults["excluded_parsers"]
+    )
+    chunking_strategy: str | ChunkingStrategy = Field(
+        default_factory=lambda: IngestionConfig._defaults["chunking_strategy"]
+    )
+    chunk_size: int = Field(
+        default_factory=lambda: IngestionConfig._defaults["chunk_size"]
+    )
+    chunk_overlap: int = Field(
+        default_factory=lambda: IngestionConfig._defaults["chunk_overlap"]
+    )
+    chunk_enrichment_settings: ChunkEnrichmentSettings = Field(
+        default_factory=lambda: IngestionConfig._defaults[
+            "chunk_enrichment_settings"
+        ]
+    )
+    extra_parsers: dict[str, Any] = Field(
+        default_factory=lambda: IngestionConfig._defaults["extra_parsers"]
+    )
+    audio_transcription_model: Optional[str] = Field(
+        default_factory=lambda: IngestionConfig._defaults[
+            "audio_transcription_model"
+        ]
+    )
+    vlm: Optional[str] = Field(
+        default_factory=lambda: IngestionConfig._defaults["vlm"]
+    )
+    vlm_batch_size: int = Field(
+        default_factory=lambda: IngestionConfig._defaults["vlm_batch_size"]
+    )
+    vlm_max_tokens_to_sample: int = Field(
+        default_factory=lambda: IngestionConfig._defaults[
+            "vlm_max_tokens_to_sample"
+        ]
+    )
+    max_concurrent_vlm_tasks: int = Field(
+        default_factory=lambda: IngestionConfig._defaults[
+            "max_concurrent_vlm_tasks"
+        ]
+    )
+    vlm_ocr_one_page_per_chunk: bool = Field(
+        default_factory=lambda: IngestionConfig._defaults[
+            "vlm_ocr_one_page_per_chunk"
+        ]
+    )
+    skip_document_summary: bool = Field(
+        default_factory=lambda: IngestionConfig._defaults[
+            "skip_document_summary"
+        ]
+    )
+    document_summary_system_prompt: str = Field(
+        default_factory=lambda: IngestionConfig._defaults[
+            "document_summary_system_prompt"
+        ]
+    )
+    document_summary_task_prompt: str = Field(
+        default_factory=lambda: IngestionConfig._defaults[
+            "document_summary_task_prompt"
+        ]
+    )
+    chunks_for_document_summary: int = Field(
+        default_factory=lambda: IngestionConfig._defaults[
+            "chunks_for_document_summary"
+        ]
+    )
+    document_summary_model: Optional[str] = Field(
+        default_factory=lambda: IngestionConfig._defaults[
+            "document_summary_model"
+        ]
+    )
+    parser_overrides: dict[str, str] = Field(
+        default_factory=lambda: IngestionConfig._defaults["parser_overrides"]
+    )
+    automatic_extraction: bool = Field(
+        default_factory=lambda: IngestionConfig._defaults[
+            "automatic_extraction"
+        ]
+    )
+    document_summary_max_length: int = Field(
+        default_factory=lambda: IngestionConfig._defaults[
+            "document_summary_max_length"
+        ]
+    )
+
+    @classmethod
+    def set_default(cls, **kwargs):
+        for key, value in kwargs.items():
+            if key in cls._defaults:
+                cls._defaults[key] = value
+            else:
+                raise AttributeError(
+                    f"No default attribute '{key}' in IngestionConfig"
+                )
+
+    @property
+    def supported_providers(self) -> list[str]:
+        return ["r2r", "unstructured_local", "unstructured_api"]
+
+    def validate_config(self) -> None:
+        if self.provider not in self.supported_providers:
+            raise ValueError(
+                f"Provider {self.provider} is not supported, must be one of {self.supported_providers}"
+            )
+
+    @classmethod
+    def get_default(cls, mode: str, app) -> "IngestionConfig":
+        """Return default ingestion configuration for a given mode."""
+        if mode == "hi-res":
+            return cls(app=app, parser_overrides={"pdf": "zerox"})
+        if mode == "ocr":
+            return cls(app=app, parser_overrides={"pdf": "ocr"})
+        if mode == "fast":
+            return cls(app=app, skip_document_summary=True)
+        else:
+            return cls(app=app)
+
+
+class IngestionProvider(Provider, ABC):
+    config: IngestionConfig
+    database_provider: "PostgresDatabaseProvider"
+    llm_provider: CompletionProvider
+
+    def __init__(
+        self,
+        config: IngestionConfig,
+        database_provider: "PostgresDatabaseProvider",
+        llm_provider: CompletionProvider,
+    ):
+        super().__init__(config)
+        self.config: IngestionConfig = config
+        self.llm_provider = llm_provider
+        self.database_provider: "PostgresDatabaseProvider" = database_provider

+ 233 - 0
py/core/base/providers/llm.py

@@ -0,0 +1,233 @@
+import asyncio
+import logging
+import random
+import time
+from abc import abstractmethod
+from concurrent.futures import ThreadPoolExecutor
+from typing import Any, AsyncGenerator, Generator, Optional
+
+from litellm import AuthenticationError
+
+from core.base.abstractions import (
+    GenerationConfig,
+    LLMChatCompletion,
+    LLMChatCompletionChunk,
+)
+
+from .base import Provider, ProviderConfig
+
+logger = logging.getLogger()
+
+
+class CompletionConfig(ProviderConfig):
+    provider: Optional[str] = None
+    generation_config: Optional[GenerationConfig] = None
+    concurrent_request_limit: int = 256
+    max_retries: int = 3
+    initial_backoff: float = 1.0
+    max_backoff: float = 64.0
+    request_timeout: float = 15.0
+
+    def validate_config(self) -> None:
+        if not self.provider:
+            raise ValueError("Provider must be set.")
+        if self.provider not in self.supported_providers:
+            raise ValueError(f"Provider '{self.provider}' is not supported.")
+
+    @property
+    def supported_providers(self) -> list[str]:
+        return ["anthropic", "litellm", "openai", "r2r"]
+
+
+class CompletionProvider(Provider):
+    def __init__(self, config: CompletionConfig) -> None:
+        if not isinstance(config, CompletionConfig):
+            raise ValueError(
+                "CompletionProvider must be initialized with a `CompletionConfig`."
+            )
+        logger.info(f"Initializing CompletionProvider with config: {config}")
+        super().__init__(config)
+        self.config: CompletionConfig = config
+        self.semaphore = asyncio.Semaphore(config.concurrent_request_limit)
+        self.thread_pool = ThreadPoolExecutor(
+            max_workers=config.concurrent_request_limit
+        )
+
+    async def _execute_with_backoff_async(
+        self,
+        task: dict[str, Any],
+        apply_timeout: bool = False,
+    ):
+        retries = 0
+        backoff = self.config.initial_backoff
+        while retries < self.config.max_retries:
+            try:
+                # A semaphore allows us to limit concurrent requests
+                async with self.semaphore:
+                    if not apply_timeout:
+                        return await self._execute_task(task)
+
+                    try:  # Use asyncio.wait_for to set a timeout for the request
+                        return await asyncio.wait_for(
+                            self._execute_task(task),
+                            timeout=self.config.request_timeout,
+                        )
+                    except asyncio.TimeoutError as e:
+                        raise TimeoutError(
+                            f"Request timed out after {self.config.request_timeout} seconds"
+                        ) from e
+            except AuthenticationError:
+                raise
+            except Exception as e:
+                logger.warning(
+                    f"Request failed (attempt {retries + 1}): {str(e)}"
+                )
+                retries += 1
+                if retries == self.config.max_retries:
+                    raise
+                await asyncio.sleep(random.uniform(0, backoff))
+                backoff = min(backoff * 2, self.config.max_backoff)
+
+    async def _execute_with_backoff_async_stream(
+        self, task: dict[str, Any]
+    ) -> AsyncGenerator[Any, None]:
+        retries = 0
+        backoff = self.config.initial_backoff
+        while retries < self.config.max_retries:
+            try:
+                async with self.semaphore:
+                    async for chunk in await self._execute_task(task):
+                        yield chunk
+                return  # Successful completion of the stream
+            except AuthenticationError:
+                raise
+            except Exception as e:
+                logger.warning(
+                    f"Streaming request failed (attempt {retries + 1}): {str(e)}"
+                )
+                retries += 1
+                if retries == self.config.max_retries:
+                    raise
+                await asyncio.sleep(random.uniform(0, backoff))
+                backoff = min(backoff * 2, self.config.max_backoff)
+
+    def _execute_with_backoff_sync(
+        self,
+        task: dict[str, Any],
+        apply_timeout: bool = False,
+    ):
+        retries = 0
+        backoff = self.config.initial_backoff
+        while retries < self.config.max_retries:
+            if not apply_timeout:
+                return self._execute_task_sync(task)
+
+            try:
+                future = self.thread_pool.submit(self._execute_task_sync, task)
+                return future.result(timeout=self.config.request_timeout)
+            except TimeoutError as e:
+                raise TimeoutError(
+                    f"Request timed out after {self.config.request_timeout} seconds"
+                ) from e
+            except Exception as e:
+                logger.warning(
+                    f"Request failed (attempt {retries + 1}): {str(e)}"
+                )
+                retries += 1
+                if retries == self.config.max_retries:
+                    raise
+                time.sleep(random.uniform(0, backoff))
+                backoff = min(backoff * 2, self.config.max_backoff)
+
+    def _execute_with_backoff_sync_stream(
+        self, task: dict[str, Any]
+    ) -> Generator[Any, None, None]:
+        retries = 0
+        backoff = self.config.initial_backoff
+        while retries < self.config.max_retries:
+            try:
+                yield from self._execute_task_sync(task)
+                return  # Successful completion of the stream
+            except Exception as e:
+                logger.warning(
+                    f"Streaming request failed (attempt {retries + 1}): {str(e)}"
+                )
+                retries += 1
+                if retries == self.config.max_retries:
+                    raise
+                time.sleep(random.uniform(0, backoff))
+                backoff = min(backoff * 2, self.config.max_backoff)
+
+    @abstractmethod
+    async def _execute_task(self, task: dict[str, Any]):
+        pass
+
+    @abstractmethod
+    def _execute_task_sync(self, task: dict[str, Any]):
+        pass
+
+    async def aget_completion(
+        self,
+        messages: list[dict],
+        generation_config: GenerationConfig,
+        apply_timeout: bool = False,
+        **kwargs,
+    ) -> LLMChatCompletion:
+        task = {
+            "messages": messages,
+            "generation_config": generation_config,
+            "kwargs": kwargs,
+        }
+        response = await self._execute_with_backoff_async(
+            task=task, apply_timeout=apply_timeout
+        )
+        return LLMChatCompletion(**response.dict())
+
+    async def aget_completion_stream(
+        self,
+        messages: list[dict],
+        generation_config: GenerationConfig,
+        **kwargs,
+    ) -> AsyncGenerator[LLMChatCompletionChunk, None]:
+        generation_config.stream = True
+        task = {
+            "messages": messages,
+            "generation_config": generation_config,
+            "kwargs": kwargs,
+        }
+        async for chunk in self._execute_with_backoff_async_stream(task):
+            if isinstance(chunk, dict):
+                yield LLMChatCompletionChunk(**chunk)
+                continue
+
+            if chunk.choices and len(chunk.choices) > 0:
+                chunk.choices[0].finish_reason = (
+                    chunk.choices[0].finish_reason
+                    if chunk.choices[0].finish_reason != ""
+                    else None
+                )  # handle error output conventions
+                chunk.choices[0].finish_reason = (
+                    chunk.choices[0].finish_reason
+                    if chunk.choices[0].finish_reason != "eos"
+                    else "stop"
+                )  # hardcode `eos` to `stop` for consistency
+                try:
+                    yield LLMChatCompletionChunk(**(chunk.dict()))
+                except Exception as e:
+                    logger.error(f"Error parsing chunk: {e}")
+                    yield LLMChatCompletionChunk(**(chunk.as_dict()))
+
+    def get_completion_stream(
+        self,
+        messages: list[dict],
+        generation_config: GenerationConfig,
+        **kwargs,
+    ) -> Generator[LLMChatCompletionChunk, None, None]:
+        generation_config.stream = True
+        task = {
+            "messages": messages,
+            "generation_config": generation_config,
+            "kwargs": kwargs,
+        }
+        for chunk in self._execute_with_backoff_sync_stream(task):
+            yield LLMChatCompletionChunk(**chunk.dict())

+ 120 - 0
py/core/base/providers/ocr.py

@@ -0,0 +1,120 @@
+import asyncio
+import logging
+import random
+import time
+from abc import abstractmethod
+from concurrent.futures import ThreadPoolExecutor
+from typing import Any, Optional
+
+from litellm import AuthenticationError
+
+from .base import Provider, ProviderConfig
+
+logger = logging.getLogger()
+
+
+class OCRConfig(ProviderConfig):
+    provider: Optional[str] = None
+    model: Optional[str] = None
+    concurrent_request_limit: int = 256
+    max_retries: int = 3
+    initial_backoff: float = 1.0
+    max_backoff: float = 64.0
+
+    def validate_config(self) -> None:
+        if not self.provider:
+            raise ValueError("Provider must be set.")
+        if self.provider not in self.supported_providers:
+            raise ValueError(f"Provider '{self.provider}' is not supported.")
+
+    @property
+    def supported_providers(self) -> list[str]:
+        return ["mistral"]
+
+
+class OCRProvider(Provider):
+    def __init__(self, config: OCRConfig) -> None:
+        if not isinstance(config, OCRConfig):
+            raise ValueError(
+                "OCRProvider must be initialized with a `OCRConfig`."
+            )
+        logger.info(f"Initializing OCRProvider with config: {config}")
+        super().__init__(config)
+        self.config: OCRConfig = config
+        self.semaphore = asyncio.Semaphore(config.concurrent_request_limit)
+        self.thread_pool = ThreadPoolExecutor(
+            max_workers=config.concurrent_request_limit
+        )
+
+    async def _execute_with_backoff_async(self, task: dict[str, Any]):
+        retries = 0
+        backoff = self.config.initial_backoff
+        while retries < self.config.max_retries:
+            try:
+                async with self.semaphore:
+                    return await self._execute_task(task)
+            except AuthenticationError:
+                raise
+            except Exception as e:
+                logger.warning(
+                    f"Request failed (attempt {retries + 1}): {str(e)}"
+                )
+                retries += 1
+                if retries == self.config.max_retries:
+                    raise
+                await asyncio.sleep(random.uniform(0, backoff))
+                backoff = min(backoff * 2, self.config.max_backoff)
+
+    def _execute_with_backoff_sync(self, task: dict[str, Any]):
+        retries = 0
+        backoff = self.config.initial_backoff
+        while retries < self.config.max_retries:
+            try:
+                return self._execute_task_sync(task)
+            except Exception as e:
+                logger.warning(
+                    f"Request failed (attempt {retries + 1}): {str(e)}"
+                )
+                retries += 1
+                if retries == self.config.max_retries:
+                    raise
+                time.sleep(random.uniform(0, backoff))
+                backoff = min(backoff * 2, self.config.max_backoff)
+
+    @abstractmethod
+    async def _execute_task(self, task: dict[str, Any]):
+        pass
+
+    @abstractmethod
+    def _execute_task_sync(self, task: dict[str, Any]):
+        pass
+
+    @abstractmethod
+    async def upload_file(
+        self,
+        file_path: str | None = None,
+        file_content: bytes | None = None,
+        file_name: str | None = None,
+    ) -> Any:
+        pass
+
+    @abstractmethod
+    async def process_file(
+        self, file_id: str, include_image_base64: bool = False
+    ) -> Any:
+        pass
+
+    @abstractmethod
+    async def process_url(
+        self,
+        url: str,
+        is_image: bool = False,
+        include_image_base64: bool = False,
+    ) -> Any:
+        pass
+
+    @abstractmethod
+    async def process_pdf(
+        self, file_path: str | None = None, file_content: bytes | None = None
+    ) -> Any:
+        pass

+ 70 - 0
py/core/base/providers/orchestration.py

@@ -0,0 +1,70 @@
+from abc import abstractmethod
+from enum import Enum
+from typing import Any
+
+from .base import Provider, ProviderConfig
+
+
+class Workflow(Enum):
+    INGESTION = "ingestion"
+    GRAPH = "graph"
+
+
+class OrchestrationConfig(ProviderConfig):
+    provider: str
+    max_runs: int = 2_048
+    graph_search_results_creation_concurrency_limit: int = 32
+    ingestion_concurrency_limit: int = 16
+    graph_search_results_concurrency_limit: int = 8
+
+    def validate_config(self) -> None:
+        if self.provider not in self.supported_providers:
+            raise ValueError(f"Provider {self.provider} is not supported.")
+
+    @property
+    def supported_providers(self) -> list[str]:
+        return ["hatchet", "simple"]
+
+
+class OrchestrationProvider(Provider):
+    def __init__(self, config: OrchestrationConfig):
+        super().__init__(config)
+        self.config = config
+        self.worker = None
+
+    @abstractmethod
+    async def start_worker(self):
+        pass
+
+    @abstractmethod
+    def get_worker(self, name: str, max_runs: int) -> Any:
+        pass
+
+    @abstractmethod
+    def step(self, *args, **kwargs) -> Any:
+        pass
+
+    @abstractmethod
+    def workflow(self, *args, **kwargs) -> Any:
+        pass
+
+    @abstractmethod
+    def failure(self, *args, **kwargs) -> Any:
+        pass
+
+    @abstractmethod
+    def register_workflows(
+        self, workflow: Workflow, service: Any, messages: dict
+    ) -> None:
+        pass
+
+    @abstractmethod
+    async def run_workflow(
+        self,
+        workflow_name: str,
+        parameters: dict,
+        options: dict,
+        *args,
+        **kwargs,
+    ) -> dict[str, str]:
+        pass

+ 39 - 0
py/core/base/providers/scheduler.py

@@ -0,0 +1,39 @@
+from abc import abstractmethod
+
+from .base import Provider, ProviderConfig
+
+
+class SchedulerConfig(ProviderConfig):
+    """Configuration for scheduler provider"""
+
+    provider: str = "apscheduler"
+
+    def validate_config(self):
+        if self.provider not in self.supported_providers:
+            raise ValueError(
+                f"Scheduler provider {self.provider} is not supported."
+            )
+
+    @property
+    def supported_providers(self) -> list[str]:
+        return ["apscheduler"]
+
+
+class SchedulerProvider(Provider):
+    """Base class for scheduler providers"""
+
+    def __init__(self, config: SchedulerConfig):
+        super().__init__(config)
+        self.config = config
+
+    @abstractmethod
+    async def add_job(self, func, trigger, **kwargs):
+        pass
+
+    @abstractmethod
+    async def start(self):
+        pass
+
+    @abstractmethod
+    async def shutdown(self):
+        pass

+ 39 - 0
py/core/base/utils/__init__.py

@@ -0,0 +1,39 @@
+from shared.utils import (
+    RecursiveCharacterTextSplitter,
+    TextSplitter,
+    _decorate_vector_type,
+    _get_vector_column_str,
+    deep_update,
+    dump_collector,
+    dump_obj,
+    format_search_results_for_llm,
+    generate_default_prompt_id,
+    generate_default_user_collection_id,
+    generate_document_id,
+    generate_entity_document_id,
+    generate_extraction_id,
+    generate_id,
+    generate_user_id,
+    validate_uuid,
+    yield_sse_event,
+)
+
+__all__ = [
+    "format_search_results_for_llm",
+    "generate_id",
+    "generate_default_user_collection_id",
+    "generate_document_id",
+    "generate_extraction_id",
+    "generate_user_id",
+    "generate_entity_document_id",
+    "generate_default_prompt_id",
+    "RecursiveCharacterTextSplitter",
+    "TextSplitter",
+    "validate_uuid",
+    "deep_update",
+    "_decorate_vector_type",
+    "_get_vector_column_str",
+    "yield_sse_event",
+    "dump_collector",
+    "dump_obj",
+]

+ 21 - 0
py/core/configs/full.toml

@@ -0,0 +1,21 @@
+[completion]
+provider = "r2r"
+concurrent_request_limit = 12800
+
+[ingestion]
+provider = "unstructured_local"
+strategy = "auto"
+chunking_strategy = "by_title"
+new_after_n_chars = 2_048
+max_characters = 4_096
+combine_under_n_chars = 1_024
+overlap = 1_024
+
+    [ingestion.extra_parsers]
+    pdf = ["zerox", "ocr"]
+
+[orchestration]
+provider = "hatchet"
+kg_creation_concurrency_limit = 32
+ingestion_concurrency_limit = 16
+kg_concurrency_limit = 8

+ 46 - 0
py/core/configs/full_azure.toml

@@ -0,0 +1,46 @@
+[app]
+# LLM used for internal operations, like deriving conversation names
+fast_llm = "azure/gpt-4.1-mini"
+
+# LLM used for user-facing output, like RAG replies
+quality_llm = "azure/gpt-4.1"
+
+# LLM used for ingesting visual inputs
+vlm = "azure/gpt-4.1"
+
+# LLM used for transcription
+audio_lm = "azure/whisper-1"
+
+# Reasoning model, used for `research` agent
+reasoning_llm = "azure/o3-mini"
+# Planning model, used for `research` agent
+planning_llm = "azure/o3-mini"
+
+[embedding]
+base_model = "azure/text-embedding-3-small"
+
+[completion_embedding]
+base_model = "azure/text-embedding-3-small"
+
+[ingestion]
+provider = "unstructured_local"
+strategy = "auto"
+chunking_strategy = "by_title"
+new_after_n_chars = 2_048
+max_characters = 4_096
+combine_under_n_chars = 1_024
+overlap = 1_024
+document_summary_model = "azure/gpt-4.1-mini"
+automatic_extraction = true # enable automatic extraction of entities and relations
+
+  [ingestion.extra_parsers]
+    pdf = ["zerox", "ocr"]
+
+  [ingestion.chunk_enrichment_settings]
+    generation_config = { model = "azure/gpt-4.1-mini" }
+
+[orchestration]
+provider = "hatchet"
+kg_creation_concurrency_limit = 32
+ingestion_concurrency_limit = 4
+kg_concurrency_limit = 8

+ 55 - 0
py/core/configs/full_lm_studio.toml

@@ -0,0 +1,55 @@
+[app]
+# LLM used for internal operations, like deriving conversation names
+fast_llm = "lm_studio/llama-3.2-3b-instruct"
+
+# LLM used for user-facing output, like RAG replies
+quality_llm = "lm_studio/llama-3.2-3b-instruct"
+
+# LLM used for ingesting visual inputs
+vlm = "lm_studio/llama3.2-vision" # TODO - Replace with viable candidate
+
+# LLM used for transcription
+audio_lm = "lm_studio/llama-3.2-3b-instruct" # TODO - Replace with viable candidate
+
+[embedding]
+provider = "litellm"
+base_model = "lm_studio/text-embedding-nomic-embed-text-v1.5"
+base_dimension = nan
+batch_size = 128
+concurrent_request_limit = 2
+
+[completion_embedding]
+# Generally this should be the same as the embedding config, but advanced users may want to run with a different provider to reduce latency
+provider = "litellm"
+base_model = "lm_studio/text-embedding-nomic-embed-text-v1.5"
+base_dimension = nan
+batch_size = 128
+concurrent_request_limit = 2
+
+[agent]
+tools = ["search_file_knowledge"]
+
+[completion]
+provider = "litellm"
+concurrent_request_limit = 1
+
+  [completion.generation_config]
+  temperature = 0.1
+  top_p = 1
+  max_tokens_to_sample = 1_024
+  stream = false
+
+[ingestion]
+provider = "unstructured_local"
+strategy = "auto"
+chunking_strategy = "by_title"
+new_after_n_chars = 512
+max_characters = 1_024
+combine_under_n_chars = 128
+overlap = 20
+chunks_for_document_summary = 16
+document_summary_model = "lm_studio/llama-3.2-3b-instruct"
+automatic_extraction = false
+
+[orchestration]
+provider = "hatchet"

+ 61 - 0
py/core/configs/full_ollama.toml

@@ -0,0 +1,61 @@
+[app]
+# LLM used for internal operations, like deriving conversation names
+fast_llm = "ollama/llama3.1"
+
+# LLM used for user-facing output, like RAG replies
+quality_llm = "ollama/llama3.1"
+
+# LLM used for ingesting visual inputs
+vlm = "ollama/llama3.1" # TODO - Replace with viable candidate
+
+# LLM used for transcription
+audio_lm = "ollama/llama3.1" # TODO - Replace with viable candidate
+
+
+# Reasoning model, used for `research` agent
+reasoning_llm = "ollama/llama3.1"
+# Planning model, used for `research` agent
+planning_llm = "ollama/llama3.1"
+
+[embedding]
+provider = "ollama"
+base_model = "mxbai-embed-large"
+base_dimension = 1_024
+batch_size = 128
+concurrent_request_limit = 2
+
+[completion_embedding]
+provider = "ollama"
+base_model = "mxbai-embed-large"
+base_dimension = 1_024
+batch_size = 128
+concurrent_request_limit = 2
+
+[agent]
+tools = ["search_file_knowledge"]
+
+[completion]
+provider = "litellm"
+concurrent_request_limit = 1
+
+  [completion.generation_config]
+  temperature = 0.1
+  top_p = 1
+  max_tokens_to_sample = 1_024
+  stream = false
+  api_base = "http://host.docker.internal:11434"
+
+[ingestion]
+provider = "unstructured_local"
+strategy = "auto"
+chunking_strategy = "by_title"
+new_after_n_chars = 512
+max_characters = 1_024
+combine_under_n_chars = 128
+overlap = 20
+chunks_for_document_summary = 16
+document_summary_model = "ollama/llama3.1"
+automatic_extraction = false
+
+[orchestration]
+provider = "hatchet"

+ 19 - 0
py/core/configs/gemini.toml

@@ -0,0 +1,19 @@
+[app]
+fast_llm = "gemini/gemini-2.0-flash-lite"
+quality_llm = "gemini/gemini-2.0-flash"
+vlm = "gemini/gemini-2.0-flash"
+audio_lm = "gemini/gemini-2.0-flash-lite"
+
+[embedding]
+provider = "litellm"
+base_model = "gemini/text-embedding-004"
+base_dimension = nan
+batch_size = 128
+concurrent_request_limit = 2
+
+[completion_embedding]
+provider = "litellm"
+base_model = "gemini/text-embedding-004"
+base_dimension = nan
+batch_size = 128
+concurrent_request_limit = 2

+ 40 - 0
py/core/configs/lm_studio.toml

@@ -0,0 +1,40 @@
+[app]
+# LLM used for internal operations, like deriving conversation names
+fast_llm = "lm_studio/llama-3.2-3b-instruct"
+
+# LLM used for user-facing output, like RAG replies
+quality_llm = "lm_studio/llama-3.2-3b-instruct"
+
+# LLM used for ingesting visual inputs
+vlm = "lm_studio/llama3.2-vision" # TODO - Replace with viable candidate
+
+# LLM used for transcription
+audio_lm = "lm_studio/llama-3.2-3b-instruct" # TODO - Replace with viable candidate
+
+[embedding]
+provider = "litellm"
+base_model = "lm_studio/text-embedding-nomic-embed-text-v1.5"
+base_dimension = nan
+batch_size = 128
+concurrent_request_limit = 2
+
+[completion_embedding]
+# Generally this should be the same as the embedding config, but advanced users may want to run with a different provider to reduce latency
+provider = "litellm"
+base_model = "lm_studio/text-embedding-nomic-embed-text-v1.5"
+base_dimension = nan
+batch_size = 128
+concurrent_request_limit = 2
+
+[agent]
+tools = ["search_file_knowledge"]
+
+[completion]
+provider = "litellm"
+concurrent_request_limit = 1
+
+  [completion.generation_config]
+  temperature = 0.1
+  top_p = 1
+  max_tokens_to_sample = 1_024
+  stream = false

+ 46 - 0
py/core/configs/ollama.toml

@@ -0,0 +1,46 @@
+[app]
+# LLM used for internal operations, like deriving conversation names
+fast_llm = "ollama/llama3.1" ### NOTE - RECOMMENDED TO USE `openai` with `api_base = "http://localhost:11434/v1"` for best results, otherwise `ollama` with `litellm` is acceptable
+
+# LLM used for user-facing output, like RAG replies
+quality_llm = "ollama/llama3.1"
+
+# LLM used for ingesting visual inputs
+vlm = "ollama/llama3.1" # TODO - Replace with viable candidate
+
+# LLM used for transcription
+audio_lm = "ollama/llama3.1" # TODO - Replace with viable candidate
+
+
+# Reasoning model, used for `research` agent
+reasoning_llm = "ollama/llama3.1"
+# Planning model, used for `research` agent
+planning_llm = "ollama/llama3.1"
+
+[embedding]
+provider = "ollama"
+base_model = "mxbai-embed-large"
+base_dimension = 1_024
+batch_size = 128
+concurrent_request_limit = 2
+
+[completion_embedding]
+provider = "ollama"
+base_model = "mxbai-embed-large"
+base_dimension = 1_024
+batch_size = 128
+concurrent_request_limit = 2
+
+[agent]
+tools = ["search_file_knowledge"]
+
+[completion]
+provider = "litellm"
+concurrent_request_limit = 1
+
+  [completion.generation_config]
+  temperature = 0.1
+  top_p = 1
+  max_tokens_to_sample = 1_024
+  stream = false
+  api_base = "http://localhost:11434/v1"

+ 23 - 0
py/core/configs/r2r_azure.toml

@@ -0,0 +1,23 @@
+[app]
+# LLM used for internal operations, like deriving conversation names
+fast_llm = "azure/gpt-4.1-mini"
+
+# LLM used for user-facing output, like RAG replies
+quality_llm = "azure/gpt-4.1"
+
+# LLM used for ingesting visual inputs
+vlm = "azure/gpt-4.1"
+
+# LLM used for transcription
+audio_lm = "azure/whisper-1"
+
+# Reasoning model, used for `research` agent
+reasoning_llm = "azure/o3-mini"
+# Planning model, used for `research` agent
+planning_llm = "azure/o3-mini"
+
+[embedding]
+base_model = "azure/text-embedding-3-small"
+
+[completion_embedding]
+base_model = "azure/text-embedding-3-small"

+ 37 - 0
py/core/configs/r2r_azure_with_test_limits.toml

@@ -0,0 +1,37 @@
+[app]
+# LLM used for internal operations, like deriving conversation names
+fast_llm = "azure/gpt-4.1-mini"
+
+# LLM used for user-facing output, like RAG replies
+quality_llm = "azure/gpt-4.1"
+
+# LLM used for ingesting visual inputs
+vlm = "azure/gpt-4.1"
+
+# LLM used for transcription
+audio_lm = "azure/whisper-1"
+
+
+# Reasoning model, used for `research` agent
+reasoning_llm = "azure/o3-mini"
+# Planning model, used for `research` agent
+planning_llm = "azure/o3-mini"
+
+[embedding]
+base_model = "openai/text-embedding-3-small"
+base_dimension = 512
+
+[completion_embedding]
+base_model = "openai/text-embedding-3-small"
+
+[database]
+  [database.limits]
+  global_per_min = 10  # Small enough to test quickly
+  monthly_limit = 20  # Small enough to test in one run
+
+  [database.route_limits]
+  "/v3/retrieval/search" = { route_per_min = 5, monthly_limit = 10 }
+
+  [database.user_limits."47e53676-b478-5b3f-a409-234ca2164de5"]
+  global_per_min = 2
+  route_per_min = 1

+ 8 - 0
py/core/configs/r2r_with_auth.toml

@@ -0,0 +1,8 @@
+[auth]
+provider = "r2r"
+access_token_lifetime_in_minutes = 60
+refresh_token_lifetime_in_days = 7
+require_authentication = true
+require_email_verification = false
+default_admin_email = "admin@example.com"
+default_admin_password = "change_me_immediately"

+ 30 - 0
py/core/configs/tavily.toml

@@ -0,0 +1,30 @@
+[completion]
+provider = "r2r"
+concurrent_request_limit = 128
+
+[ingestion]
+provider = "unstructured_local"
+strategy = "auto"
+chunking_strategy = "by_title"
+new_after_n_chars = 2_048
+max_characters = 4_096
+combine_under_n_chars = 1_024
+overlap = 1_024
+    [ingestion.extra_parsers]
+    pdf = "zerox"
+
+[orchestration]
+provider = "hatchet"
+kg_creation_concurrency_limit = 32
+ingestion_concurrency_limit = 16
+kg_concurrency_limit = 8
+
+[agent]
+# Enable the Tavily search and extraction tools
+rag_tools = [
+    "search_file_descriptions",
+    "search_file_knowledge",
+    "get_file_content",
+    "tavily_search",
+    "tavily_extract"
+]

+ 0 - 0
py/core/examples/__init__.py


BIN
py/core/examples/data/DeepSeek_R1.pdf


+ 430 - 0
py/core/examples/data/aristotle.txt

@@ -0,0 +1,430 @@
+Aristotle[A] (Greek: Ἀριστοτέλης Aristotélēs, pronounced [aristotélɛːs]; 384–322 BC) was an Ancient Greek philosopher and polymath. His writings cover a broad range of subjects spanning the natural sciences, philosophy, linguistics, economics, politics, psychology, and the arts. As the founder of the Peripatetic school of philosophy in the Lyceum in Athens, he began the wider Aristotelian tradition that followed, which set the groundwork for the development of modern science.
+
+Little is known about Aristotle's life. He was born in the city of Stagira in northern Greece during the Classical period. His father, Nicomachus, died when Aristotle was a child, and he was brought up by a guardian. At 17 or 18, he joined Plato's Academy in Athens and remained there until the age of 37 (c. 347 BC). Shortly after Plato died, Aristotle left Athens and, at the request of Philip II of Macedon, tutored his son Alexander the Great beginning in 343 BC. He established a library in the Lyceum, which helped him to produce many of his hundreds of books on papyrus scrolls.
+
+Though Aristotle wrote many elegant treatises and dialogues for publication, only around a third of his original output has survived, none of it intended for publication. Aristotle provided a complex synthesis of the various philosophies existing prior to him. His teachings and methods of inquiry have had a significant impact across the world, and remain a subject of contemporary philosophical discussion.
+
+Aristotle's views profoundly shaped medieval scholarship. The influence of his physical science extended from late antiquity and the Early Middle Ages into the Renaissance, and was not replaced systematically until the Enlightenment and theories such as classical mechanics were developed. He influenced Judeo-Islamic philosophies during the Middle Ages, as well as Christian theology, especially the Neoplatonism of the Early Church and the scholastic tradition of the Catholic Church.
+
+Aristotle was revered among medieval Muslim scholars as "The First Teacher", and among medieval Christians like Thomas Aquinas as simply "The Philosopher", while the poet Dante called him "the master of those who know". His works contain the earliest known formal study of logic, and were studied by medieval scholars such as Peter Abelard and Jean Buridan. Aristotle's influence on logic continued well into the 19th century. In addition, his ethics, although always influential, gained renewed interest with the modern advent of virtue ethics.
+
+Life
+In general, the details of Aristotle's life are not well-established. The biographies written in ancient times are often speculative and historians only agree on a few salient points.[B]
+
+Aristotle was born in 384 BC[C] in Stagira, Chalcidice,[2] about 55 km (34 miles) east of modern-day Thessaloniki.[3][4] His father, Nicomachus, was the personal physician to King Amyntas of Macedon. While he was young, Aristotle learned about biology and medical information, which was taught by his father.[5] Both of Aristotle's parents died when he was about thirteen, and Proxenus of Atarneus became his guardian.[6] Although little information about Aristotle's childhood has survived, he probably spent some time within the Macedonian palace, making his first connections with the Macedonian monarchy.[7]
+
+
+School of Aristotle in Mieza, Macedonia, Greece.
+At the age of seventeen or eighteen, Aristotle moved to Athens to continue his education at Plato's Academy.[8] He probably experienced the Eleusinian Mysteries as he wrote when describing the sights one viewed at the Eleusinian Mysteries, "to experience is to learn" [παθείν μαθεĩν].[9] Aristotle remained in Athens for nearly twenty years before leaving in 348/47 BC. The traditional story about his departure records that he was disappointed with the Academy's direction after control passed to Plato's nephew Speusippus, although it is possible that he feared the anti-Macedonian sentiments in Athens at that time and left before Plato died.[10] Aristotle then accompanied Xenocrates to the court of his friend Hermias of Atarneus in Asia Minor. After the death of Hermias, Aristotle travelled with his pupil Theophrastus to the island of Lesbos, where together they researched the botany and zoology of the island and its sheltered lagoon. While in Lesbos, Aristotle married Pythias, either Hermias's adoptive daughter or niece. They had a daughter, whom they also named Pythias. In 343 BC, Aristotle was invited by Philip II of Macedon to become the tutor to his son Alexander.[11][12]
+
+
+"Aristotle tutoring Alexander" by Jean Leon Gerome Ferris.
+Aristotle was appointed as the head of the royal Academy of Macedon. During Aristotle's time in the Macedonian court, he gave lessons not only to Alexander but also to two other future kings: Ptolemy and Cassander.[13] Aristotle encouraged Alexander toward eastern conquest, and Aristotle's own attitude towards Persia was unabashedly ethnocentric. In one famous example, he counsels Alexander to be "a leader to the Greeks and a despot to the barbarians, to look after the former as after friends and relatives, and to deal with the latter as with beasts or plants".[13] By 335 BC, Aristotle had returned to Athens, establishing his own school there known as the Lyceum. Aristotle conducted courses at the school for the next twelve years. While in Athens, his wife Pythias died and Aristotle became involved with Herpyllis of Stagira. They had a son whom Aristotle named after his father, Nicomachus. If the Suda – an uncritical compilation from the Middle Ages – is accurate, he may also have had an erômenos, Palaephatus of Abydus.[14]
+
+
+Portrait bust of Aristotle; an Imperial Roman (1st or 2nd century AD) copy of a lost bronze sculpture made by Lysippos.
+This period in Athens, between 335 and 323 BC, is when Aristotle is believed to have composed many of his works.[12] He wrote many dialogues, of which only fragments have survived. Those works that have survived are in treatise form and were not, for the most part, intended for widespread publication; they are generally thought to be lecture aids for his students. His most important treatises include Physics, Metaphysics, Nicomachean Ethics, Politics, On the Soul and Poetics. Aristotle studied and made significant contributions to "logic, metaphysics, mathematics, physics, biology, botany, ethics, politics, agriculture, medicine, dance, and theatre."[15]
+
+Near the end of his life, Alexander and Aristotle became estranged over Alexander's relationship with Persia and Persians. A widespread tradition in antiquity suspected Aristotle of playing a role in Alexander's death, but the only evidence of this is an unlikely claim made some six years after the death.[16] Following Alexander's death, anti-Macedonian sentiment in Athens was rekindled. In 322 BC, Demophilus and Eurymedon the Hierophant reportedly denounced Aristotle for impiety,[17] prompting him to flee to his mother's family estate in Chalcis, on Euboea, at which occasion he was said to have stated: "I will not allow the Athenians to sin twice against philosophy"[18][19][20] – a reference to Athens's trial and execution of Socrates. He died in Chalcis, Euboea[2][21][15] of natural causes later that same year, having named his student Antipater as his chief executor and leaving a will in which he asked to be buried next to his wife.[22]
+
+Theoretical philosophy
+Logic
+Main article: Term logic
+Further information: Non-Aristotelian logic
+With the Prior Analytics, Aristotle is credited with the earliest study of formal logic,[23] and his conception of it was the dominant form of Western logic until 19th-century advances in mathematical logic.[24] Kant stated in the Critique of Pure Reason that with Aristotle, logic reached its completion.[25]
+
+Organon
+Main article: Organon
+
+Plato (left) and Aristotle in Raphael's 1509 fresco, The School of Athens. Aristotle holds his Nicomachean Ethics and gestures to the earth, representing his view in immanent realism, whilst Plato gestures to the heavens, indicating his Theory of Forms, and holds his Timaeus.[26][27]
+Most of Aristotle's work is probably not in its original form, because it was most likely edited by students and later lecturers. The logical works of Aristotle were compiled into a set of six books called the Organon around 40 BC by Andronicus of Rhodes or others among his followers.[28] The books are:
+
+Categories
+On Interpretation
+Prior Analytics
+Posterior Analytics
+Topics
+On Sophistical Refutations
+The order of the books (or the teachings from which they are composed) is not certain, but this list was derived from analysis of Aristotle's writings. It goes from the basics, the analysis of simple terms in the Categories, the analysis of propositions and their elementary relations in On Interpretation, to the study of more complex forms, namely, syllogisms (in the Analytics)[29][30] and dialectics (in the Topics and Sophistical Refutations). The first three treatises form the core of the logical theory stricto sensu: the grammar of the language of logic and the correct rules of reasoning. The Rhetoric is not conventionally included, but it states that it relies on the Topics.[31]
+
+One of Aristotle's types of syllogism[D]
+In words	In
+terms[E]	In equations[F]
+    All men are mortal.
+
+    All Greeks are men.
+
+∴ All Greeks are mortal.	M a P
+
+S a M
+
+S a P
+What is today called Aristotelian logic with its types of syllogism (methods of logical argument),[32] Aristotle himself would have labelled "analytics". The term "logic" he reserved to mean dialectics.
+
+Metaphysics
+Main article: Metaphysics (Aristotle)
+The word "metaphysics" appears to have been coined by the first century AD editor who assembled various small selections of Aristotle's works to the treatise we know by the name Metaphysics.[34] Aristotle called it "first philosophy", and distinguished it from mathematics and natural science (physics) as the contemplative (theoretikē) philosophy which is "theological" and studies the divine. He wrote in his Metaphysics (1026a16):
+
+if there were no other independent things besides the composite natural ones, the study of nature would be the primary kind of knowledge; but if there is some motionless independent thing, the knowledge of this precedes it and is first philosophy, and it is universal in just this way, because it is first. And it belongs to this sort of philosophy to study being as being, both what it is and what belongs to it just by virtue of being.[35]
+
+Substance
+Further information: Hylomorphism
+Aristotle examines the concepts of substance (ousia) and essence (to ti ên einai, "the what it was to be") in his Metaphysics (Book VII), and he concludes that a particular substance is a combination of both matter and form, a philosophical theory called hylomorphism. In Book VIII, he distinguishes the matter of the substance as the substratum, or the stuff of which it is composed. For example, the matter of a house is the bricks, stones, timbers, etc., or whatever constitutes the potential house, while the form of the substance is the actual house, namely 'covering for bodies and chattels' or any other differentia that let us define something as a house. The formula that gives the components is the account of the matter, and the formula that gives the differentia is the account of the form.[36][34]
+
+Immanent realism
+Main article: Aristotle's theory of universals
+
+Plato's forms exist as universals, like the ideal form of an apple. For Aristotle, both matter and form belong to the individual thing (hylomorphism).
+Like his teacher Plato, Aristotle's philosophy aims at the universal. Aristotle's ontology places the universal (katholou) in particulars (kath' hekaston), things in the world, whereas for Plato the universal is a separately existing form which actual things imitate. For Aristotle, "form" is still what phenomena are based on, but is "instantiated" in a particular substance.[34]
+
+Plato argued that all things have a universal form, which could be either a property or a relation to other things. When one looks at an apple, for example, one sees an apple, and one can also analyse a form of an apple. In this distinction, there is a particular apple and a universal form of an apple. Moreover, one can place an apple next to a book, so that one can speak of both the book and apple as being next to each other. Plato argued that there are some universal forms that are not a part of particular things. For example, it is possible that there is no particular good in existence, but "good" is still a proper universal form. Aristotle disagreed with Plato on this point, arguing that all universals are instantiated at some period of time, and that there are no universals that are unattached to existing things. In addition, Aristotle disagreed with Plato about the location of universals. Where Plato spoke of the forms as existing separately from the things that participate in them, Aristotle maintained that universals exist within each thing on which each universal is predicated. So, according to Aristotle, the form of apple exists within each apple, rather than in the world of the forms.[34][37]
+
+Potentiality and actuality
+Concerning the nature of change (kinesis) and its causes, as he outlines in his Physics and On Generation and Corruption (319b–320a), he distinguishes coming-to-be (genesis, also translated as 'generation') from:
+
+growth and diminution, which is change in quantity;
+locomotion, which is change in space; and
+alteration, which is change in quality.
+
+Aristotle argued that a capability like playing the flute could be acquired – the potential made actual – by learning.
+Coming-to-be is a change where the substrate of the thing that has undergone the change has itself changed. In that particular change he introduces the concept of potentiality (dynamis) and actuality (entelecheia) in association with the matter and the form. Referring to potentiality, this is what a thing is capable of doing or being acted upon if the conditions are right and it is not prevented by something else. For example, the seed of a plant in the soil is potentially (dynamei) a plant, and if it is not prevented by something, it will become a plant. Potentially, beings can either 'act' (poiein) or 'be acted upon' (paschein), which can be either innate or learned. For example, the eyes possess the potentiality of sight (innate – being acted upon), while the capability of playing the flute can be possessed by learning (exercise – acting). Actuality is the fulfilment of the end of the potentiality. Because the end (telos) is the principle of every change, and potentiality exists for the sake of the end, actuality, accordingly, is the end. Referring then to the previous example, it can be said that an actuality is when a plant does one of the activities that plants do.[34]
+
+For that for the sake of which (to hou heneka) a thing is, is its principle, and the becoming is for the sake of the end; and the actuality is the end, and it is for the sake of this that the potentiality is acquired. For animals do not see in order that they may have sight, but they have sight that they may see.[38]
+
+In summary, the matter used to make a house has potentiality to be a house and both the activity of building and the form of the final house are actualities, which is also a final cause or end. Then Aristotle proceeds and concludes that the actuality is prior to potentiality in formula, in time and in substantiality. With this definition of the particular substance (i.e., matter and form), Aristotle tries to solve the problem of the unity of the beings, for example, "what is it that makes a man one"? Since, according to Plato there are two Ideas: animal and biped, how then is man a unity? However, according to Aristotle, the potential being (matter) and the actual one (form) are one and the same.[34][39]
+
+Epistemology
+Aristotle's immanent realism means his epistemology is based on the study of things that exist or happen in the world, and rises to knowledge of the universal, whereas for Plato epistemology begins with knowledge of universal Forms (or ideas) and descends to knowledge of particular imitations of these.[31] Aristotle uses induction from examples alongside deduction, whereas Plato relies on deduction from a priori principles.[31]
+
+Natural philosophy
+Aristotle's "natural philosophy" spans a wide range of natural phenomena including those now covered by physics, biology and other natural sciences.[40] In Aristotle's terminology, "natural philosophy" is a branch of philosophy examining the phenomena of the natural world, and includes fields that would be regarded today as physics, biology and other natural sciences. Aristotle's work encompassed virtually all facets of intellectual inquiry. Aristotle makes philosophy in the broad sense coextensive with reasoning, which he also would describe as "science". However, his use of the term science carries a different meaning than that covered by the term "scientific method". For Aristotle, "all science (dianoia) is either practical, poetical or theoretical" (Metaphysics 1025b25). His practical science includes ethics and politics; his poetical science means the study of fine arts including poetry; his theoretical science covers physics, mathematics and metaphysics.[40]
+
+Physics
+
+The four classical elements (fire, air, water, earth) of Empedocles and Aristotle illustrated with a burning log. The log releases all four elements as it is destroyed.
+Main article: Aristotelian physics
+Five elements
+Main article: Classical element
+In his On Generation and Corruption, Aristotle related each of the four elements proposed earlier by Empedocles, earth, water, air, and fire, to two of the four sensible qualities, hot, cold, wet, and dry. In the Empedoclean scheme, all matter was made of the four elements, in differing proportions. Aristotle's scheme added the heavenly aether, the divine substance of the heavenly spheres, stars and planets.[41]
+
+Aristotle's elements[41]
+Element	Hot/Cold	Wet/Dry	Motion	Modern state
+of matter
+Earth	Cold	Dry	Down	Solid
+Water	Cold	Wet	Down	Liquid
+Air	Hot	Wet	Up	Gas
+Fire	Hot	Dry	Up	Plasma
+Aether	(divine
+substance)	—	Circular
+(in heavens)	Vacuum
+Motion
+Further information: History of classical mechanics
+Aristotle describes two kinds of motion: "violent" or "unnatural motion", such as that of a thrown stone, in the Physics (254b10), and "natural motion", such as of a falling object, in On the Heavens (300a20). In violent motion, as soon as the agent stops causing it, the motion stops also: in other words, the natural state of an object is to be at rest,[42][G] since Aristotle does not address friction.[43] With this understanding, it can be observed that, as Aristotle stated, heavy objects (on the ground, say) require more force to make them move; and objects pushed with greater force move faster.[44][H] This would imply the equation[44]
+
+𝐹
+=
+𝑚
+𝑣
+{\displaystyle F=mv},
+incorrect in modern physics.[44]
+
+Natural motion depends on the element concerned: the aether naturally moves in a circle around the heavens,[I] while the 4 Empedoclean elements move vertically up (like fire, as is observed) or down (like earth) towards their natural resting places.[45][43][J]
+
+
+Aristotle's laws of motion. In Physics he states that objects fall at a speed proportional to their weight and inversely proportional to the density of the fluid they are immersed in.[43] This is a correct approximation for objects in Earth's gravitational field moving in air or water.[45]
+In the Physics (215a25), Aristotle effectively states a quantitative law, that the speed, v, of a falling body is proportional (say, with constant c) to its weight, W, and inversely proportional to the density,[K] ρ, of the fluid in which it is falling:;[45][43]
+
+𝑣
+=
+𝑐
+𝑊
+𝜌{\displaystyle v=c{\frac {W}{\rho }}}
+Aristotle implies that in a vacuum the speed of fall would become infinite, and concludes from this apparent absurdity that a vacuum is not possible.[45][43] Opinions have varied on whether Aristotle intended to state quantitative laws. Henri Carteron held the "extreme view"[43] that Aristotle's concept of force was basically qualitative,[46] but other authors reject this.[43]
+
+Archimedes corrected Aristotle's theory that bodies move towards their natural resting places; metal boats can float if they displace enough water; floating depends in Archimedes' scheme on the mass and volume of the object, not, as Aristotle thought, its elementary composition.[45]
+
+Aristotle's writings on motion remained influential until the Early Modern period. John Philoponus (in Late antiquity) and Galileo (in Early modern period) are said to have shown by experiment that Aristotle's claim that a heavier object falls faster than a lighter object is incorrect.[40] A contrary opinion is given by Carlo Rovelli, who argues that Aristotle's physics of motion is correct within its domain of validity, that of objects in the Earth's gravitational field immersed in a fluid such as air. In this system, heavy bodies in steady fall indeed travel faster than light ones (whether friction is ignored, or not[45]), and they do fall more slowly in a denser medium.[44][L]
+
+Newton's "forced" motion corresponds to Aristotle's "violent" motion with its external agent, but Aristotle's assumption that the agent's effect stops immediately it stops acting (e.g., the ball leaves the thrower's hand) has awkward consequences: he has to suppose that surrounding fluid helps to push the ball along to make it continue to rise even though the hand is no longer acting on it, resulting in the Medieval theory of impetus.[45]
+
+Four causes
+Main article: Four causes
+
+Aristotle argued by analogy with woodwork that a thing takes its form from four causes: in the case of a table, the wood used (material cause), its design (formal cause), the tools and techniques used (efficient cause), and its decorative or practical purpose (final cause).[47]
+Aristotle suggested that the reason for anything coming about can be attributed to four different types of simultaneously active factors. His term aitia is traditionally translated as "cause", but it does not always refer to temporal sequence; it might be better translated as "explanation", but the traditional rendering will be employed here.[48][49]
+
+Material cause describes the material out of which something is composed. Thus the material cause of a table is wood. It is not about action. It does not mean that one domino knocks over another domino.[48]
+The formal cause is its form, i.e., the arrangement of that matter. It tells one what a thing is, that a thing is determined by the definition, form, pattern, essence, whole, synthesis or archetype. It embraces the account of causes in terms of fundamental principles or general laws, as the whole (i.e., macrostructure) is the cause of its parts, a relationship known as the whole-part causation. Plainly put, the formal cause is the idea in the mind of the sculptor that brings the sculpture into being. A simple example of the formal cause is the mental image or idea that allows an artist, architect, or engineer to create a drawing.[48]
+The efficient cause is "the primary source", or that from which the change under consideration proceeds. It identifies 'what makes of what is made and what causes change of what is changed' and so suggests all sorts of agents, non-living or living, acting as the sources of change or movement or rest. Representing the current understanding of causality as the relation of cause and effect, this covers the modern definitions of "cause" as either the agent or agency or particular events or states of affairs. In the case of two dominoes, when the first is knocked over it causes the second also to fall over.[48] In the case of animals, this agency is a combination of how it develops from the egg, and how its body functions.[50]
+The final cause (telos) is its purpose, the reason why a thing exists or is done, including both purposeful and instrumental actions and activities. The final cause is the purpose or function that something is supposed to serve. This covers modern ideas of motivating causes, such as volition.[48] In the case of living things, it implies adaptation to a particular way of life.[50]
+Optics
+Further information: History of optics
+Aristotle describes experiments in optics using a camera obscura in Problems, book 15. The apparatus consisted of a dark chamber with a small aperture that let light in. With it, he saw that whatever shape he made the hole, the sun's image always remained circular. He also noted that increasing the distance between the aperture and the image surface magnified the image.[51]
+
+Chance and spontaneity
+Further information: Accident (philosophy)
+According to Aristotle, spontaneity and chance are causes of some things, distinguishable from other types of cause such as simple necessity. Chance as an incidental cause lies in the realm of accidental things, "from what is spontaneous". There is also more a specific kind of chance, which Aristotle names "luck", that only applies to people's moral choices.[52][53]
+
+Astronomy
+Further information: History of astronomy
+In astronomy, Aristotle refuted Democritus's claim that the Milky Way was made up of "those stars which are shaded by the earth from the sun's rays," pointing out partly correctly that if "the size of the sun is greater than that of the earth and the distance of the stars from the earth many times greater than that of the sun, then... the sun shines on all the stars and the earth screens none of them."[54] He also wrote descriptions of comets, including the Great Comet of 371 BC.[55]
+
+Geology and natural sciences
+Further information: History of geology
+
+Aristotle noted that the ground level of the Aeolian islands changed before a volcanic eruption.
+Aristotle was one of the first people to record any geological observations. He stated that geological change was too slow to be observed in one person's lifetime.[56][57] The geologist Charles Lyell noted that Aristotle described such change, including "lakes that had dried up" and "deserts that had become watered by rivers", giving as examples the growth of the Nile delta since the time of Homer, and "the upheaving of one of the Aeolian islands, previous to a volcanic eruption."'[58]
+
+Meteorologica lends its name to the modern study of meteorology, but its modern usage diverges from the content of Aristotle's ancient treatise on meteors. The ancient Greeks did use the term for a range of atmospheric phenomena, but also for earthquakes and volcanic eruptions. Aristotle proposed that the cause of earthquakes was a gas or vapor (anathymiaseis) that was trapped inside the earth and trying to escape, following other Greek authors Anaxagoras, Empedocles and Democritus.[59]
+
+Aristotle also made many observations about the hydrologic cycle. For example, he made some of the earliest observations about desalination: he observed early – and correctly – that when seawater is heated, freshwater evaporates and that the oceans are then replenished by the cycle of rainfall and river runoff ("I have proved by experiment that salt water evaporated forms fresh and the vapor does not when it condenses condense into sea water again.")[60]
+
+Biology
+Main article: Aristotle's biology
+
+Among many pioneering zoological observations, Aristotle described the reproductive hectocotyl arm of the octopus (bottom left).
+Empirical research
+Aristotle was the first person to study biology systematically,[61] and biology forms a large part of his writings. He spent two years observing and describing the zoology of Lesbos and the surrounding seas, including in particular the Pyrrha lagoon in the centre of Lesbos.[62][63] His data in History of Animals, Generation of Animals, Movement of Animals, and Parts of Animals are assembled from his own observations,[64] statements given by people with specialized knowledge, such as beekeepers and fishermen, and less accurate accounts provided by travellers from overseas.[65] His apparent emphasis on animals rather than plants is a historical accident: his works on botany have been lost, but two books on plants by his pupil Theophrastus have survived.[66]
+
+Aristotle reports on the sea-life visible from observation on Lesbos and the catches of fishermen. He describes the catfish, electric ray, and frogfish in detail, as well as cephalopods such as the octopus and paper nautilus. His description of the hectocotyl arm of cephalopods, used in sexual reproduction, was widely disbelieved until the 19th century.[67] He gives accurate descriptions of the four-chambered fore-stomachs of ruminants,[68] and of the ovoviviparous embryological development of the hound shark.[69]
+
+He notes that an animal's structure is well matched to function so birds like the heron (which live in marshes with soft mud and live by catching fish) have a long neck, long legs, and a sharp spear-like beak, whereas ducks that swim have short legs and webbed feet.[70] Darwin, too, noted these sorts of differences between similar kinds of animal, but unlike Aristotle used the data to come to the theory of evolution.[71] Aristotle's writings can seem to modern readers close to implying evolution, but while Aristotle was aware that new mutations or hybridizations could occur, he saw these as rare accidents. For Aristotle, accidents, like heat waves in winter, must be considered distinct from natural causes. He was thus critical of Empedocles's materialist theory of a "survival of the fittest" origin of living things and their organs, and ridiculed the idea that accidents could lead to orderly results.[72] To put his views into modern terms, he nowhere says that different species can have a common ancestor, or that one kind can change into another, or that kinds can become extinct.[73]
+
+Scientific style
+
+Aristotle inferred growth laws from his observations on animals, including that brood size decreases with body mass, whereas gestation period increases. He was correct in these predictions, at least for mammals: data are shown for mouse and elephant.
+Aristotle did not do experiments in the modern sense.[74] He used the ancient Greek term pepeiramenoi to mean observations, or at most investigative procedures like dissection.[75] In Generation of Animals, he finds a fertilized hen's egg of a suitable stage and opens it to see the embryo's heart beating inside.[76][77]
+
+Instead, he practiced a different style of science: systematically gathering data, discovering patterns common to whole groups of animals, and inferring possible causal explanations from these.[78][79] This style is common in modern biology when large amounts of data become available in a new field, such as genomics. It does not result in the same certainty as experimental science, but it sets out testable hypotheses and constructs a narrative explanation of what is observed. In this sense, Aristotle's biology is scientific.[78]
+
+From the data he collected and documented, Aristotle inferred quite a number of rules relating the life-history features of the live-bearing tetrapods (terrestrial placental mammals) that he studied. Among these correct predictions are the following. Brood size decreases with (adult) body mass, so that an elephant has fewer young (usually just one) per brood than a mouse. Lifespan increases with gestation period, and also with body mass, so that elephants live longer than mice, have a longer period of gestation, and are heavier. As a final example, fecundity decreases with lifespan, so long-lived kinds like elephants have fewer young in total than short-lived kinds like mice.[80]
+
+Classification of living things
+Further information: Scala naturae
+
+Aristotle recorded that the embryo of a dogfish was attached by a cord to a kind of placenta (the yolk sac), like a higher animal; this formed an exception to the linear scale from highest to lowest.[81]
+Aristotle distinguished about 500 species of animals,[82][83] arranging these in the History of Animals in a graded scale of perfection, a nonreligious version of the scala naturae, with man at the top. His system had eleven grades of animal, from highest potential to lowest, expressed in their form at birth: the highest gave live birth to hot and wet creatures, the lowest laid cold, dry mineral-like eggs. Animals came above plants, and these in turn were above minerals.[84][85] He grouped what the modern zoologist would call vertebrates as the hotter "animals with blood", and below them the colder invertebrates as "animals without blood". Those with blood were divided into the live-bearing (mammals), and the egg-laying (birds, reptiles, fish). Those without blood were insects, crustacea (non-shelled – cephalopods, and shelled) and the hard-shelled molluscs (bivalves and gastropods). He recognised that animals did not exactly fit into a linear scale, and noted various exceptions, such as that sharks had a placenta like the tetrapods. To a modern biologist, the explanation, not available to Aristotle, is convergent evolution.[86] Philosophers of science have generally concluded that Aristotle was not interested in taxonomy,[87][88] but zoologists who studied this question in the early 21st century think otherwise.[89][90][91] He believed that purposive final causes guided all natural processes; this teleological view justified his observed data as an expression of formal design.[92]
+
+Aristotle's Scala naturae (highest to lowest)
+Group	Examples
+(given by Aristotle)	Blood	Legs	Souls
+(Rational,
+Sensitive,
+Vegetative)	Qualities
+(Hot–Cold,
+Wet–Dry)
+Man	Man	with blood	2 legs	R, S, V	Hot, Wet
+Live-bearing tetrapods	Cat, hare	with blood	4 legs	S, V	Hot, Wet
+Cetaceans	Dolphin, whale	with blood	none	S, V	Hot, Wet
+Birds	Bee-eater, nightjar	with blood	2 legs	S, V	Hot, Wet, except Dry eggs
+Egg-laying tetrapods	Chameleon, crocodile	with blood	4 legs	S, V	Cold, Wet except scales, eggs
+Snakes	Water snake, Ottoman viper	with blood	none	S, V	Cold, Wet except scales, eggs
+Egg-laying fishes	Sea bass, parrotfish	with blood	none	S, V	Cold, Wet, including eggs
+(Among the egg-laying fishes):
+placental selachians	Shark, skate	with blood	none	S, V	Cold, Wet, but placenta like tetrapods
+Crustaceans	Shrimp, crab	without	many legs	S, V	Cold, Wet except shell
+Cephalopods	Squid, octopus	without	tentacles	S, V	Cold, Wet
+Hard-shelled animals	Cockle, trumpet snail	without	none	S, V	Cold, Dry (mineral shell)
+Larva-bearing insects	Ant, cicada	without	6 legs	S, V	Cold, Dry
+Spontaneously generating	Sponges, worms	without	none	S, V	Cold, Wet or Dry, from earth
+Plants	Fig	without	none	V	Cold, Dry
+Minerals	Iron	without	none	none	Cold, Dry
+Psychology
+Soul
+Further information: On the Soul
+
+Aristotle proposed a three-part structure for souls of plants, animals, and humans, making humans unique in having all three types of soul.
+Aristotle's psychology, given in his treatise On the Soul (peri psychēs), posits three kinds of soul ("psyches"): the vegetative soul, the sensitive soul, and the rational soul. Humans have all three. The vegetative soul is concerned with growth and nourishment. The sensitive soul experiences sensations and movement. The unique part of the human, rational soul is its ability to receive forms of other things and to compare them using the nous (intellect) and logos (reason).[93]
+
+For Aristotle, the soul is the form of a living being. Because all beings are composites of form and matter, the form of living beings is that which endows them with what is specific to living beings, e.g. the ability to initiate movement (or in the case of plants, growth and transformations, which Aristotle considers types of movement).[11] In contrast to earlier philosophers, but in accordance with the Egyptians, he placed the rational soul in the heart, rather than the brain.[94] Notable is Aristotle's division of sensation and thought, which generally differed from the concepts of previous philosophers, with the exception of Alcmaeon.[95]
+
+In On the Soul, Aristotle famously criticizes Plato's theory of the soul and develops his own in response. The first criticism is against Plato's view of the soul in the Timaeus that the soul takes up space and is able to come into physical contact with bodies.[96] 20th-century scholarship overwhelmingly opposed Aristotle's interpretation of Plato and maintained that he had misunderstood him.[97] Today's scholars have tended to re-assess Aristotle's interpretation and been more positive about it.[98] Aristotle's other criticism is that Plato's view of reincarnation entails that it is possible for a soul and its body to be mis-matched; in principle, Aristotle alleges, any soul can go with any body, according to Plato's theory.[99] Aristotle's claim that the soul is the form of a living being eliminates that possibility and thus rules out reincarnation.[100]
+
+Memory
+According to Aristotle in On the Soul, memory is the ability to hold a perceived experience in the mind and to distinguish between the internal "appearance" and an occurrence in the past.[101] In other words, a memory is a mental picture (phantasm) that can be recovered. Aristotle believed an impression is left on a semi-fluid bodily organ that undergoes several changes in order to make a memory. A memory occurs when stimuli such as sights or sounds are so complex that the nervous system cannot receive all the impressions at once. These changes are the same as those involved in the operations of sensation, Aristotelian 'common sense', and thinking.[102][103]
+
+Aristotle uses the term 'memory' for the actual retaining of an experience in the impression that can develop from sensation, and for the intellectual anxiety that comes with the impression because it is formed at a particular time and processing specific contents. Memory is of the past, prediction is of the future, and sensation is of the present. Retrieval of impressions cannot be performed suddenly. A transitional channel is needed and located in past experiences, both for previous experience and present experience.[104]
+
+Because Aristotle believes people receive all kinds of sense perceptions and perceive them as impressions, people are continually weaving together new impressions of experiences. To search for these impressions, people search the memory itself.[105] Within the memory, if one experience is offered instead of a specific memory, that person will reject this experience until they find what they are looking for. Recollection occurs when one retrieved experience naturally follows another. If the chain of "images" is needed, one memory will stimulate the next. When people recall experiences, they stimulate certain previous experiences until they reach the one that is needed.[106] Recollection is thus the self-directed activity of retrieving the information stored in a memory impression.[107] Only humans can remember impressions of intellectual activity, such as numbers and words. Animals that have perception of time can retrieve memories of their past observations. Remembering involves only perception of the things remembered and of the time passed.[108]
+
+
+Senses, perception, memory, dreams, action in Aristotle's psychology. Impressions are stored in the sensorium (the heart), linked by his laws of association (similarity, contrast, and contiguity).
+Aristotle believed the chain of thought, which ends in recollection of certain impressions, was connected systematically in relationships such as similarity, contrast, and contiguity, described in his laws of association. Aristotle believed that past experiences are hidden within the mind. A force operates to awaken the hidden material to bring up the actual experience. According to Aristotle, association is the power innate in a mental state, which operates upon the unexpressed remains of former experiences, allowing them to rise and be recalled.[109][110]
+
+Dreams
+Further information: Dream § Other
+Aristotle describes sleep in On Sleep and Wakefulness.[111] Sleep takes place as a result of overuse of the senses[112] or of digestion,[113] so it is vital to the body.[112] While a person is asleep, the critical activities, which include thinking, sensing, recalling and remembering, do not function as they do during wakefulness. Since a person cannot sense during sleep, they cannot have desire, which is the result of sensation. However, the senses are able to work during sleep,[114] albeit differently,[111] unless they are weary.[112]
+
+Dreams do not involve actually sensing a stimulus. In dreams, sensation is still involved, but in an altered manner.[112] Aristotle explains that when a person stares at a moving stimulus such as the waves in a body of water, and then looks away, the next thing they look at appears to have a wavelike motion. When a person perceives a stimulus and the stimulus is no longer the focus of their attention, it leaves an impression.[111] When the body is awake and the senses are functioning properly, a person constantly encounters new stimuli to sense and so the impressions of previously perceived stimuli are ignored.[112] However, during sleep the impressions made throughout the day are noticed as there are no new distracting sensory experiences.[111] So, dreams result from these lasting impressions. Since impressions are all that are left and not the exact stimuli, dreams do not resemble the actual waking experience.[115] During sleep, a person is in an altered state of mind. Aristotle compares a sleeping person to a person who is overtaken by strong feelings toward a stimulus. For example, a person who has a strong infatuation with someone may begin to think they see that person everywhere because they are so overtaken by their feelings. Since a person sleeping is in a suggestible state and unable to make judgements, they become easily deceived by what appears in their dreams, like the infatuated person.[111] This leads the person to believe the dream is real, even when the dreams are absurd in nature.[111] In De Anima iii 3, Aristotle ascribes the ability to create, to store, and to recall images in the absence of perception to the faculty of imagination, phantasia.[11]
+
+One component of Aristotle's theory of dreams disagrees with previously held beliefs. He claimed that dreams are not foretelling and not sent by a divine being. Aristotle reasoned naturalistically that instances in which dreams do resemble future events are simply coincidences.[116] Aristotle claimed that a dream is first established by the fact that the person is asleep when they experience it. If a person had an image appear for a moment after waking up or if they see something in the dark it is not considered a dream because they were awake when it occurred. Secondly, any sensory experience that is perceived while a person is asleep does not qualify as part of a dream. For example, if, while a person is sleeping, a door shuts and in their dream they hear a door is shut, this sensory experience is not part of the dream. Lastly, the images of dreams must be a result of lasting impressions of waking sensory experiences.[115]
+
+Practical philosophy
+Aristotle's practical philosophy covers areas such as ethics, politics, economics, and rhetoric.[40]
+
+Virtues and their accompanying vices[15]
+Too little	Virtuous mean	Too much
+Humbleness	High-mindedness	Vainglory
+Lack of purpose	Right ambition	Over-ambition
+Spiritlessness	Good temper	Irascibility
+Rudeness	Civility	Obsequiousness
+Cowardice	Courage	Rashness
+Insensibility	Self-control	Intemperance
+Sarcasm	Sincerity	Boastfulness
+Boorishness	Wit	Buffoonery
+Shamelessness	Modesty	Shyness
+Callousness	Just resentment	Spitefulness
+Pettiness	Generosity	Vulgarity
+Meanness	Liberality	Wastefulness
+Ethics
+Main article: Aristotelian ethics
+Aristotle considered ethics to be a practical rather than theoretical study, i.e., one aimed at becoming good and doing good rather than knowing for its own sake. He wrote several treatises on ethics, most notably including the Nicomachean Ethics.[117]
+
+Aristotle taught that virtue has to do with the proper function (ergon) of a thing. An eye is only a good eye in so much as it can see, because the proper function of an eye is sight. Aristotle reasoned that humans must have a function specific to humans, and that this function must be an activity of the psuchē (soul) in accordance with reason (logos). Aristotle identified such an optimum activity (the virtuous mean, between the accompanying vices of excess or deficiency[15]) of the soul as the aim of all human deliberate action, eudaimonia, generally translated as "happiness" or sometimes "well-being". To have the potential of ever being happy in this way necessarily requires a good character (ēthikē aretē), often translated as moral or ethical virtue or excellence.[118]
+
+Aristotle taught that to achieve a virtuous and potentially happy character requires a first stage of having the fortune to be habituated not deliberately, but by teachers, and experience, leading to a later stage in which one consciously chooses to do the best things. When the best people come to live life this way their practical wisdom (phronesis) and their intellect (nous) can develop with each other towards the highest possible human virtue, the wisdom of an accomplished theoretical or speculative thinker, or in other words, a philosopher.[119]
+
+Politics
+Main article: Politics (Aristotle)
+In addition to his works on ethics, which address the individual, Aristotle addressed the city in his work titled Politics. Aristotle considered the city to be a natural community. Moreover, he considered the city to be prior in importance to the family, which in turn is prior to the individual, "for the whole must of necessity be prior to the part".[120] He famously stated that "man is by nature a political animal" and argued that humanity's defining factor among others in the animal kingdom is its rationality.[121] Aristotle conceived of politics as being like an organism rather than like a machine, and as a collection of parts none of which can exist without the others. Aristotle's conception of the city is organic, and he is considered one of the first to conceive of the city in this manner.[122]
+
+
+Aristotle's classifications of political constitutions.
+The common modern understanding of a political community as a modern state is quite different from Aristotle's understanding. Although he was aware of the existence and potential of larger empires, the natural community according to Aristotle was the city (polis) which functions as a political "community" or "partnership" (koinōnia). The aim of the city is not just to avoid injustice or for economic stability, but rather to allow at least some citizens the possibility to live a good life, and to perform beautiful acts: "The political partnership must be regarded, therefore, as being for the sake of noble actions, not for the sake of living together." This is distinguished from modern approaches, beginning with social contract theory, according to which individuals leave the state of nature because of "fear of violent death" or its "inconveniences".[M]
+
+In Protrepticus, the character 'Aristotle' states:[123]
+
+For we all agree that the most excellent man should rule, i.e., the supreme by nature, and that the law rules and alone is authoritative; but the law is a kind of intelligence, i.e. a discourse based on intelligence. And again, what standard do we have, what criterion of good things, that is more precise than the intelligent man? For all that this man will choose, if the choice is based on his knowledge, are good things and their contraries are bad. And since everybody chooses most of all what conforms to their own proper dispositions (a just man choosing to live justly, a man with bravery to live bravely, likewise a self-controlled man to live with self-control), it is clear that the intelligent man will choose most of all to be intelligent; for this is the function of that capacity. Hence it's evident that, according to the most authoritative judgment, intelligence is supreme among goods.[123]
+
+As Plato's disciple Aristotle was rather critical concerning democracy and, following the outline of certain ideas from Plato's Statesman, he developed a coherent theory of integrating various forms of power into a so-called mixed state:
+
+It is … constitutional to take … from oligarchy that offices are to be elected, and from democracy that this is not to be on a property-qualification. This then is the mode of the mixture; and the mark of a good mixture of democracy and oligarchy is when it is possible to speak of the same constitution as a democracy and as an oligarchy.
+
+— Aristotle. Politics, Book 4, 1294b.10–18
+Aristotle's views on women influenced later Western philosophers, who quoted him as an authority until the end of the Middle Ages, but these views have been controversial in modern times. Aristotle's analysis of procreation describes an active, ensouling masculine element bringing life to an inert, passive female element. The biological differences are a result of the fact that the female body is well-suited for reproduction, which changes her body temperature, which in turn makes her, in Aristotle's view, incapable of participating in political life.[124] On this ground, proponents of feminist metaphysics have accused Aristotle of misogyny[125] and sexism.[126] However, Aristotle gave equal weight to women's happiness as he did to men's, and commented in his Rhetoric that the things that lead to happiness need to be in women as well as men.[N]
+
+Economics
+Main article: Politics (Aristotle)
+Aristotle made substantial contributions to economic thought, especially to thought in the Middle Ages.[128] In Politics, Aristotle addresses the city, property, and trade. His response to criticisms of private property, in Lionel Robbins's view, anticipated later proponents of private property among philosophers and economists, as it related to the overall utility of social arrangements.[128] Aristotle believed that although communal arrangements may seem beneficial to society, and that although private property is often blamed for social strife, such evils in fact come from human nature. In Politics, Aristotle offers one of the earliest accounts of the origin of money.[128] Money came into use because people became dependent on one another, importing what they needed and exporting the surplus. For the sake of convenience, people then agreed to deal in something that is intrinsically useful and easily applicable, such as iron or silver.[129]
+
+Aristotle's discussions on retail and interest was a major influence on economic thought in the Middle Ages. He had a low opinion of retail, believing that contrary to using money to procure things one needs in managing the household, retail trade seeks to make a profit. It thus uses goods as a means to an end, rather than as an end unto itself. He believed that retail trade was in this way unnatural. Similarly, Aristotle considered making a profit through interest unnatural, as it makes a gain out of the money itself, and not from its use.[129]
+
+Aristotle gave a summary of the function of money that was perhaps remarkably precocious for his time. He wrote that because it is impossible to determine the value of every good through a count of the number of other goods it is worth, the necessity arises of a single universal standard of measurement. Money thus allows for the association of different goods and makes them "commensurable".[129] He goes on to state that money is also useful for future exchange, making it a sort of security. That is, "if we do not want a thing now, we shall be able to get it when we do want it".[129]
+
+Rhetoric
+Part of a series on
+Rhetoric
+
+History
+Concepts
+Genres
+Criticism
+Rhetoricians
+Works
+Subfields
+Related
+vte
+Main article: Rhetoric (Aristotle)
+Aristotle's Rhetoric proposes that a speaker can use three basic kinds of appeals to persuade his audience: ethos (an appeal to the speaker's character), pathos (an appeal to the audience's emotion), and logos (an appeal to logical reasoning).[130] He also categorizes rhetoric into three genres: epideictic (ceremonial speeches dealing with praise or blame), forensic (judicial speeches over guilt or innocence), and deliberative (speeches calling on an audience to make a decision on an issue).[131] Aristotle also outlines two kinds of rhetorical proofs: enthymeme (proof by syllogism) and paradeigma (proof by example).[132]
+
+Poetics
+Main article: Poetics (Aristotle)
+Aristotle writes in his Poetics that epic poetry, tragedy, comedy, dithyrambic poetry, painting, sculpture, music, and dance are all fundamentally acts of mimesis ("imitation"), each varying in imitation by medium, object, and manner.[133][134] He applies the term mimesis both as a property of a work of art and also as the product of the artist's intention[133] and contends that the audience's realisation of the mimesis is vital to understanding the work itself.[133] Aristotle states that mimesis is a natural instinct of humanity that separates humans from animals[133][135] and that all human artistry "follows the pattern of nature".[133] Because of this, Aristotle believed that each of the mimetic arts possesses what Stephen Halliwell calls "highly structured procedures for the achievement of their purposes."[136] For example, music imitates with the media of rhythm and harmony, whereas dance imitates with rhythm alone, and poetry with language. The forms also differ in their object of imitation. Comedy, for instance, is a dramatic imitation of men worse than average; whereas tragedy imitates men slightly better than average. Lastly, the forms differ in their manner of imitation – through narrative or character, through change or no change, and through drama or no drama.[137]
+
+
+The Blind Oedipus Commending his Children to the Gods (1784) by Bénigne Gagneraux. In his Poetics, Aristotle uses the tragedy Oedipus Tyrannus by Sophocles as an example of how the perfect tragedy should be structured, with a generally good protagonist who starts the play prosperous, but loses everything through some hamartia (fault).[138]
+While it is believed that Aristotle's Poetics originally comprised two books – one on comedy and one on tragedy – only the portion that focuses on tragedy has survived. Aristotle taught that tragedy is composed of six elements: plot-structure, character, style, thought, spectacle, and lyric poetry.[139] The characters in a tragedy are merely a means of driving the story; and the plot, not the characters, is the chief focus of tragedy. Tragedy is the imitation of action arousing pity and fear, and is meant to effect the catharsis of those same emotions. Aristotle concludes Poetics with a discussion on which, if either, is superior: epic or tragic mimesis. He suggests that because tragedy possesses all the attributes of an epic, possibly possesses additional attributes such as spectacle and music, is more unified, and achieves the aim of its mimesis in shorter scope, it can be considered superior to epic.[140] Aristotle was a keen systematic collector of riddles, folklore, and proverbs; he and his school had a special interest in the riddles of the Delphic Oracle and studied the fables of Aesop.[141]
+
+Transmission
+Further information: List of writers influenced by Aristotle
+More than 2300 years after his death, Aristotle remains one of the most influential people who ever lived.[142][143][144] He contributed to almost every field of human knowledge then in existence, and he was the founder of many new fields. According to the philosopher Bryan Magee, "it is doubtful whether any human being has ever known as much as he did".[145]
+
+Among countless other achievements, Aristotle was the founder of formal logic,[146] pioneered the study of zoology, and left every future scientist and philosopher in his debt through his contributions to the scientific method.[2][147][148] Taneli Kukkonen, observes that his achievement in founding two sciences is unmatched, and his reach in influencing "every branch of intellectual enterprise" including Western ethical and political theory, theology, rhetoric, and literary analysis is equally long. As a result, Kukkonen argues, any analysis of reality today "will almost certainly carry Aristotelian overtones ... evidence of an exceptionally forceful mind."[148] Jonathan Barnes wrote that "an account of Aristotle's intellectual afterlife would be little less than a history of European thought".[149]
+
+Aristotle has been called the father of logic, biology, political science, zoology, embryology, natural law, scientific method, rhetoric, psychology, realism, criticism, individualism, teleology, and meteorology.[151]
+
+The scholar Taneli Kukkonen notes that "in the best 20th-century scholarship Aristotle comes alive as a thinker wrestling with the full weight of the Greek philosophical tradition."[148] What follows is an overview of the transmission and influence of his texts and ideas into the modern era.
+
+His successor, Theophrastus
+Main articles: Theophrastus and Historia Plantarum (Theophrastus)
+
+Frontispiece to a 1644 version of Theophrastus's Historia Plantarum, originally written around 300 BC.
+Aristotle's pupil and successor, Theophrastus, wrote the History of Plants, a pioneering work in botany. Some of his technical terms remain in use, such as carpel from carpos, fruit, and pericarp, from pericarpion, seed chamber.[152] Theophrastus was much less concerned with formal causes than Aristotle was, instead pragmatically describing how plants functioned.[153][154]
+
+Later Greek philosophy
+Further information: Peripatetic school
+The immediate influence of Aristotle's work was felt as the Lyceum grew into the Peripatetic school. Aristotle's students included Aristoxenus, Dicaearchus, Demetrius of Phalerum, Eudemos of Rhodes, Harpalus, Hephaestion, Mnason of Phocis, Nicomachus, and Theophrastus. Aristotle's influence over Alexander the Great is seen in the latter's bringing with him on his expedition a host of zoologists, botanists, and researchers. He had also learned a great deal about Persian customs and traditions from his teacher. Although his respect for Aristotle was diminished as his travels made it clear that much of Aristotle's geography was clearly wrong, when the old philosopher released his works to the public, Alexander complained "Thou hast not done well to publish thy acroamatic doctrines; for in what shall I surpass other men if those doctrines wherein I have been trained are to be all men's common property?"[155]
+
+Hellenistic science
+Further information: Ancient Greek medicine
+After Theophrastus, the Lyceum failed to produce any original work. Though interest in Aristotle's ideas survived, they were generally taken unquestioningly.[156] It is not until the age of Alexandria under the Ptolemies that advances in biology can be again found.
+
+The first medical teacher at Alexandria, Herophilus of Chalcedon, corrected Aristotle, placing intelligence in the brain, and connected the nervous system to motion and sensation. Herophilus also distinguished between veins and arteries, noting that the latter pulse while the former do not.[157] Though a few ancient atomists such as Lucretius challenged the teleological viewpoint of Aristotelian ideas about life, teleology (and after the rise of Christianity, natural theology) would remain central to biological thought essentially until the 18th and 19th centuries. Ernst Mayr states that there was "nothing of any real consequence in biology after Lucretius and Galen until the Renaissance."[158]
+
+Revival
+In the slumbering centuries following the decline of the Roman Empire, Aristotle's vast philosophical and scientific corpus lay largely dormant in the West. But in the burgeoning intellectual heartland of the Abbasid Caliphate, his works underwent a remarkable revival.[159] Translated into Arabic alongside other Greek classics, Aristotle's logic, ethics, and natural philosophy ignited the minds of early Islamic scholars.[160]
+
+Through meticulous commentaries and critical engagements, figures like Al-Farabi and Ibn Sina (Avicenna) breathed new life into Aristotle's ideas. They harmonized his logic with Islamic theology, employed his scientific methodologies to explore the natural world, and even reinterpreted his ethics within the framework of Islamic morality. This revival was not mere imitation. Islamic thinkers embraced Aristotle's rigorous methods while simultaneously challenging his conclusions where they diverged from their own religious beliefs.[161]
+
+Byzantine scholars
+See also: Commentaries on Aristotle and Byzantine Aristotelianism
+Greek Christian scribes played a crucial role in the preservation of Aristotle by copying all the extant Greek language manuscripts of the corpus. The first Greek Christians to comment extensively on Aristotle were Philoponus, Elias, and David in the sixth century, and Stephen of Alexandria in the early seventh century.[162] John Philoponus stands out for having attempted a fundamental critique of Aristotle's views on the eternity of the world, movement, and other elements of Aristotelian thought.[163] Philoponus questioned Aristotle's teaching of physics, noting its flaws and introducing the theory of impetus to explain his observations.[164]
+
+After a hiatus of several centuries, formal commentary by Eustratius and Michael of Ephesus reappeared in the late eleventh and early twelfth centuries, apparently sponsored by Anna Comnena.[165]
+
+Medieval Islamic world
+Further information: Logic in Islamic philosophy and Transmission of the Greek Classics
+
+Islamic portrayal of Aristotle (right) in the Kitāb naʿt al-ḥayawān, c. 1220.[166]
+Aristotle was one of the most revered Western thinkers in early Islamic theology. Most of the still extant works of Aristotle,[167] as well as a number of the original Greek commentaries, were translated into Arabic and studied by Muslim philosophers, scientists and scholars. Averroes, Avicenna and Alpharabius, who wrote on Aristotle in great depth, also influenced Thomas Aquinas and other Western Christian scholastic philosophers. Alkindus greatly admired Aristotle's philosophy,[168] and Averroes spoke of Aristotle as the "exemplar" for all future philosophers.[169] Medieval Muslim scholars regularly described Aristotle as the "First Teacher".[167] The title was later used by Western philosophers (as in the famous poem of Dante) who were influenced by the tradition of Islamic philosophy.[170]
+
+Medieval Europe
+Further information: Aristotelianism and Syllogism § Medieval
+With the loss of the study of ancient Greek in the early medieval Latin West, Aristotle was practically unknown there from c. CE 600 to c. 1100 except through the Latin translation of the Organon made by Boethius. In the twelfth and thirteenth centuries, interest in Aristotle revived and Latin Christians had translations made, both from Arabic translations, such as those by Gerard of Cremona,[171] and from the original Greek, such as those by James of Venice and William of Moerbeke.
+
+After the Scholastic Thomas Aquinas wrote his Summa Theologica, working from Moerbeke's translations and calling Aristotle "The Philosopher",[172] the demand for Aristotle's writings grew, and the Greek manuscripts returned to the West, stimulating a revival of Aristotelianism in Europe that continued into the Renaissance.[173] These thinkers blended Aristotelian philosophy with Christianity, bringing the thought of Ancient Greece into the Middle Ages. Scholars such as Boethius, Peter Abelard, and John Buridan worked on Aristotelian logic.[174]
+
+According to scholar Roger Theodore Lafferty, Dante built up the philosophy of the Comedy with the works of Aristotle as a foundation, just as the scholastics used Aristotle as the basis for their thinking. Dante knew Aristotle directly from Latin translations of his works and indirectly through quotations in the works of Albert Magnus.[175] Dante even acknowledges Aristotle's influence explicitly in the poem, specifically when Virgil justifies the Inferno's structure by citing the Nicomachean Ethics.[176] Dante famously refers to him as "he / Who is acknowledged Master of those who know".[177][178]
+
+Medieval Judaism
+Moses Maimonides (considered to be the foremost intellectual figure of medieval Judaism)[179] adopted Aristotelianism from the Islamic scholars and based his Guide for the Perplexed on it and that became the basis of Jewish scholastic philosophy. Maimonides also considered Aristotle to be the greatest philosopher that ever lived, and styled him as the "chief of the philosophers".[180][181][182] Also, in his letter to Samuel ibn Tibbon, Maimonides observes that there is no need for Samuel to study the writings of philosophers who preceded Aristotle because the works of the latter are "sufficient by themselves and [superior] to all that were written before them. His intellect, Aristotle's is the extreme limit of human intellect, apart from him upon whom the divine emanation has flowed forth to such an extent that they reach the level of prophecy, there being no level higher".[183]
+
+Early Modern science
+
+William Harvey's De Motu Cordis, 1628, showed that the blood circulated, contrary to classical era thinking.
+In the Early Modern period, scientists such as William Harvey in England and Galileo Galilei in Italy reacted against the theories of Aristotle and other classical era thinkers like Galen, establishing new theories based to some degree on observation and experiment. Harvey demonstrated the circulation of the blood, establishing that the heart functioned as a pump rather than being the seat of the soul and the controller of the body's heat, as Aristotle thought.[184] Galileo used more doubtful arguments to displace Aristotle's physics, proposing that bodies all fall at the same speed whatever their weight.[185]
+
+18th and 19th-century science
+The English mathematician George Boole fully accepted Aristotle's logic, but decided "to go under, over, and beyond" it with his system of algebraic logic in his 1854 book The Laws of Thought. This gives logic a mathematical foundation with equations, enables it to solve equations as well as check validity, and allows it to handle a wider class of problems by expanding propositions of any number of terms, not just two.[186]
+
+Charles Darwin regarded Aristotle as the most important contributor to the subject of biology. In an 1882 letter he wrote that "Linnaeus and Cuvier have been my two gods, though in very different ways, but they were mere schoolboys to old Aristotle".[187][188] Also, in later editions of the book "On the Origin of Species', Darwin traced evolutionary ideas as far back as Aristotle;[189] the text he cites is a summary by Aristotle of the ideas of the earlier Greek philosopher Empedocles.[190]
+
+Present science
+The philosopher Bertrand Russell claims that "almost every serious intellectual advance has had to begin with an attack on some Aristotelian doctrine". Russell calls Aristotle's ethics "repulsive", and labelled his logic "as definitely antiquated as Ptolemaic astronomy". Russell states that these errors make it difficult to do historical justice to Aristotle, until one remembers what an advance he made upon all of his predecessors.[191]
+
+The Dutch historian of science Eduard Jan Dijksterhuis writes that Aristotle and his predecessors showed the difficulty of science by "proceed[ing] so readily to frame a theory of such a general character" on limited evidence from their senses.[192] In 1985, the biologist Peter Medawar could still state in "pure seventeenth century"[193] tones that Aristotle had assembled "a strange and generally speaking rather tiresome farrago of hearsay, imperfect observation, wishful thinking and credulity amounting to downright gullibility".[193][194]
+
+Zoologists have frequently mocked Aristotle for errors and unverified secondhand reports. However, modern observation has confirmed several of his more surprising claims.[195][196][197] Aristotle's work remains largely unknown to modern scientists, though zoologists sometimes mention him as the father of biology[150] or in particular of marine biology.[198] Practising zoologists are unlikely to adhere to Aristotle's chain of being, but its influence is still perceptible in the use of the terms "lower" and "upper" to designate taxa such as groups of plants.[199] The evolutionary biologist Armand Marie Leroi has reconstructed Aristotle's biology,[200] while Niko Tinbergen's four questions, based on Aristotle's four causes, are used to analyse animal behaviour; they examine function, phylogeny, mechanism, and ontogeny.[201][202] The concept of homology began with Aristotle;[203] the evolutionary developmental biologist Lewis I. Held commented that he would be interested in the concept of deep homology.[204]
+
+Surviving works
+Corpus Aristotelicum
+Main article: Works of Aristotle
+
+First page of a 1566 edition of the Nicomachean Ethics in Greek and Latin.
+The works of Aristotle that have survived from antiquity through medieval manuscript transmission are collected in the Corpus Aristotelicum. These texts, as opposed to Aristotle's lost works, are technical philosophical treatises from within Aristotle's school.[205] Reference to them is made according to the organization of Immanuel Bekker's Royal Prussian Academy edition (Aristotelis Opera edidit Academia Regia Borussica, Berlin, 1831–1870), which in turn is based on ancient classifications of these works.[206]
+
+Loss and preservation
+Further information: Transmission of the Greek Classics
+Aristotle wrote his works on papyrus scrolls, the common writing medium of that era.[O] His writings are divisible into two groups: the "exoteric", intended for the public, and the "esoteric", for use within the Lyceum school.[208][P][209] Aristotle's "lost" works stray considerably in characterization from the surviving Aristotelian corpus. Whereas the lost works appear to have been originally written with a view to subsequent publication, the surviving works mostly resemble lecture notes not intended for publication.[210][208] Cicero's description of Aristotle's literary style as "a river of gold" must have applied to the published works, not the surviving notes.[Q] A major question in the history of Aristotle's works is how the exoteric writings were all lost, and how the ones now possessed came to be found.[212] The consensus is that Andronicus of Rhodes collected the esoteric works of Aristotle's school which existed in the form of smaller, separate works, distinguished them from those of Theophrastus and other Peripatetics, edited them, and finally compiled them into the more cohesive, larger works as they are known today.[213][214]
+
+According to Strabo and Plutarch, after Aristotle's death, his library and writings went to Theophrastus (Aristotle's successor as head of the Lycaeum and the Peripatetic school).[215] After the death of Theophrastus, the peripatetic library went to Neleus of Scepsis.[216]: 5 
+
+Some time later, the Kingdom of Pergamon began conscripting books for a royal library, and the heirs of Neleus hid their collection in a cellar to prevent it from being seized for that purpose. The library was stored there for about a century and a half, in conditions that were not ideal for document preservation. On the death of Attalus III, which also ended the royal library ambitions, the existence of Aristotelian library was disclosed, and it was purchased by Apellicon and returned to Athens in about 100 BC.[216]: 5–6 
+
+Apellicon sought to recover the texts, many of which were seriously degraded at this point due to the conditions in which they were stored. He had them copied out into new manuscripts, and used his best guesswork to fill in the gaps where the originals were unreadable.[216]: 5–6 
+
+When Sulla seized Athens in 86 BC, he seized the library and transferred it to Rome. There, Andronicus of Rhodes organized the texts into the first complete edition of Aristotle's works (and works attributed to him).[217] The Aristotelian texts we have today are based on these.[216]: 6–8 
+
+Depictions in art
+Paintings
+Aristotle has been depicted by major artists including Lucas Cranach the Elder,[218] Justus van Gent, Raphael, Paolo Veronese, Jusepe de Ribera,[219] Rembrandt,[220] and Francesco Hayez over the centuries. Among the best-known depictions is Raphael's fresco The School of Athens, in the Vatican's Apostolic Palace, where the figures of Plato and Aristotle are central to the image, at the architectural vanishing point, reflecting their importance.[221] Rembrandt's Aristotle with a Bust of Homer, too, is a celebrated work, showing the knowing philosopher and the blind Homer from an earlier age: as the art critic Jonathan Jones writes, "this painting will remain one of the greatest and most mysterious in the world, ensnaring us in its musty, glowing, pitch-black, terrible knowledge of time."[222][223]

+ 9 - 0
py/core/examples/data/aristotle_v2.txt

@@ -0,0 +1,9 @@
+Aristotle[A] (Greek: Ἀριστοτέλης Aristotélēs, pronounced [aristotélɛːs]; 384–322 BC) was an Ancient Greek philosopher and polymath. His writings cover a broad range of subjects spanning the natural sciences, philosophy, linguistics, economics, politics, psychology, and the arts. As the founder of the Peripatetic school of philosophy in the Lyceum in Athens, he began the wider Aristotelian tradition that followed, which set the groundwork for the development of modern science.
+
+Little is known about Aristotle's life. He was born in the city of Stagira in northern Greece during the Classical period. His father, Nicomachus, died when Aristotle was a child, and he was brought up by a guardian. At 17 or 18, he joined Plato's Academy in Athens and remained there until the age of 37 (c. 347 BC). Shortly after Plato died, Aristotle left Athens and, at the request of Philip II of Macedon, tutored his son Alexander the Great beginning in 343 BC. He established a library in the Lyceum, which helped him to produce many of his hundreds of books on papyrus scrolls.
+
+Though Aristotle wrote many elegant treatises and dialogues for publication, only around a third of his original output has survived, none of it intended for publication. Aristotle provided a complex synthesis of the various philosophies existing prior to him. His teachings and methods of inquiry have had a significant impact across the world, and remain a subject of contemporary philosophical discussion.
+
+Aristotle's views profoundly shaped medieval scholarship. The influence of his physical science extended from late antiquity and the Early Middle Ages into the Renaissance, and was not replaced systematically until the Enlightenment and theories such as classical mechanics were developed. He influenced Judeo-Islamic philosophies during the Middle Ages, as well as Christian theology, especially the Neoplatonism of the Early Church and the scholastic tradition of the Catholic Church.
+
+Aristotle was revered among medieval Muslim scholars as "The First Teacher", and among medieval Christians like Thomas Aquinas as simply "The Philosopher", while the poet Dante called him "the master of those who know". His works contain the earliest known formal study of logic, and were studied by medieval scholars such as Peter Abelard and Jean Buridan. Aristotle's influence on logic continued well into the 19th century. In addition, his ethics, although always influential, gained renewed interest with the modern advent of virtue ethics.

+ 29 - 0
py/core/examples/data/aristotle_v3.txt

@@ -0,0 +1,29 @@
+
+Aristotle proposed a three-part structure for souls of plants, animals, and humans, making humans unique in having all three types of soul.
+Aristotle's psychology, given in his treatise On the Soul (peri psychēs), posits three kinds of soul ("psyches"): the vegetative soul, the sensitive soul, and the rational soul. Humans have all three. The vegetative soul is concerned with growth and nourishment. The sensitive soul experiences sensations and movement. The unique part of the human, rational soul is its ability to receive forms of other things and to compare them using the nous (intellect) and logos (reason).[93]
+
+For Aristotle, the soul is the form of a living being. Because all beings are composites of form and matter, the form of living beings is that which endows them with what is specific to living beings, e.g. the ability to initiate movement (or in the case of plants, growth and transformations, which Aristotle considers types of movement).[11] In contrast to earlier philosophers, but in accordance with the Egyptians, he placed the rational soul in the heart, rather than the brain.[94] Notable is Aristotle's division of sensation and thought, which generally differed from the concepts of previous philosophers, with the exception of Alcmaeon.[95]
+
+In On the Soul, Aristotle famously criticizes Plato's theory of the soul and develops his own in response. The first criticism is against Plato's view of the soul in the Timaeus that the soul takes up space and is able to come into physical contact with bodies.[96] 20th-century scholarship overwhelmingly opposed Aristotle's interpretation of Plato and maintained that he had misunderstood him.[97] Today's scholars have tended to re-assess Aristotle's interpretation and been more positive about it.[98] Aristotle's other criticism is that Plato's view of reincarnation entails that it is possible for a soul and its body to be mis-matched; in principle, Aristotle alleges, any soul can go with any body, according to Plato's theory.[99] Aristotle's claim that the soul is the form of a living being eliminates that possibility and thus rules out reincarnation.[100]
+
+Memory
+According to Aristotle in On the Soul, memory is the ability to hold a perceived experience in the mind and to distinguish between the internal "appearance" and an occurrence in the past.[101] In other words, a memory is a mental picture (phantasm) that can be recovered. Aristotle believed an impression is left on a semi-fluid bodily organ that undergoes several changes in order to make a memory. A memory occurs when stimuli such as sights or sounds are so complex that the nervous system cannot receive all the impressions at once. These changes are the same as those involved in the operations of sensation, Aristotelian 'common sense', and thinking.[102][103]
+
+Aristotle uses the term 'memory' for the actual retaining of an experience in the impression that can develop from sensation, and for the intellectual anxiety that comes with the impression because it is formed at a particular time and processing specific contents. Memory is of the past, prediction is of the future, and sensation is of the present. Retrieval of impressions cannot be performed suddenly. A transitional channel is needed and located in past experiences, both for previous experience and present experience.[104]
+
+Because Aristotle believes people receive all kinds of sense perceptions and perceive them as impressions, people are continually weaving together new impressions of experiences. To search for these impressions, people search the memory itself.[105] Within the memory, if one experience is offered instead of a specific memory, that person will reject this experience until they find what they are looking for. Recollection occurs when one retrieved experience naturally follows another. If the chain of "images" is needed, one memory will stimulate the next. When people recall experiences, they stimulate certain previous experiences until they reach the one that is needed.[106] Recollection is thus the self-directed activity of retrieving the information stored in a memory impression.[107] Only humans can remember impressions of intellectual activity, such as numbers and words. Animals that have perception of time can retrieve memories of their past observations. Remembering involves only perception of the things remembered and of the time passed.[108]
+
+
+Senses, perception, memory, dreams, action in Aristotle's psychology. Impressions are stored in the sensorium (the heart), linked by his laws of association (similarity, contrast, and contiguity).
+Aristotle believed the chain of thought, which ends in recollection of certain impressions, was connected systematically in relationships such as similarity, contrast, and contiguity, described in his laws of association. Aristotle believed that past experiences are hidden within the mind. A force operates to awaken the hidden material to bring up the actual experience. According to Aristotle, association is the power innate in a mental state, which operates upon the unexpressed remains of former experiences, allowing them to rise and be recalled.[109][110]
+
+Dreams
+Further information: Dream § Other
+Aristotle describes sleep in On Sleep and Wakefulness.[111] Sleep takes place as a result of overuse of the senses[112] or of digestion,[113] so it is vital to the body.[112] While a person is asleep, the critical activities, which include thinking, sensing, recalling and remembering, do not function as they do during wakefulness. Since a person cannot sense during sleep, they cannot have desire, which is the result of sensation. However, the senses are able to work during sleep,[114] albeit differently,[111] unless they are weary.[112]
+
+Dreams do not involve actually sensing a stimulus. In dreams, sensation is still involved, but in an altered manner.[112] Aristotle explains that when a person stares at a moving stimulus such as the waves in a body of water, and then looks away, the next thing they look at appears to have a wavelike motion. When a person perceives a stimulus and the stimulus is no longer the focus of their attention, it leaves an impression.[111] When the body is awake and the senses are functioning properly, a person constantly encounters new stimuli to sense and so the impressions of previously perceived stimuli are ignored.[112] However, during sleep the impressions made throughout the day are noticed as there are no new distracting sensory experiences.[111] So, dreams result from these lasting impressions. Since impressions are all that are left and not the exact stimuli, dreams do not resemble the actual waking experience.[115] During sleep, a person is in an altered state of mind. Aristotle compares a sleeping person to a person who is overtaken by strong feelings toward a stimulus. For example, a person who has a strong infatuation with someone may begin to think they see that person everywhere because they are so overtaken by their feelings. Since a person sleeping is in a suggestible state and unable to make judgements, they become easily deceived by what appears in their dreams, like the infatuated person.[111] This leads the person to believe the dream is real, even when the dreams are absurd in nature.[111] In De Anima iii 3, Aristotle ascribes the ability to create, to store, and to recall images in the absence of perception to the faculty of imagination, phantasia.[11]
+
+One component of Aristotle's theory of dreams disagrees with previously held beliefs. He claimed that dreams are not foretelling and not sent by a divine being. Aristotle reasoned naturalistically that instances in which dreams do resemble future events are simply coincidences.[116] Aristotle claimed that a dream is first established by the fact that the person is asleep when they experience it. If a person had an image appear for a moment after waking up or if they see something in the dark it is not considered a dream because they were awake when it occurred. Secondly, any sensory experience that is perceived while a person is asleep does not qualify as part of a dream. For example, if, while a person is sleeping, a door shuts and in their dream they hear a door is shut, this sensory experience is not part of the dream. Lastly, the images of dreams must be a result of lasting impressions of waking sensory experiences.[115]
+
+Practical philosophy
+Aristotle's practical philosophy covers areas such as ethics, politics, economics, and rhetoric.[40]

+ 80 - 0
py/core/examples/data/got.txt

@@ -0,0 +1,80 @@
+Eddard (Ned) Stark
+The Lord of Winterfell and new Hand of the King. A devoted father and dutiful lord, he is best characterized by his strong sense of honor, and he strives to always do what is right, regardless of his personal feelings.
+Catelyn (Cat) Tully
+Ned’s wife and Lady Stark of Winterfell. She is intelligent, strong, and fiercely devoted to her family, leading her to seek out the person responsible for trying to kill her son Bran.
+Daenerys Stormborn Targaryen
+The Dothraki khaleesi (queen) and Targaryen princess. She and her brother are the only surviving members of the Targaryen family, and she grows from a frightened girl to a confident ruler, while still maintaining her kindness, over the course of the novel.
+Jon Snow
+Ned Stark’s bastard son. Since Catelyn is not his mother, he is not a proper member of the Stark family, and he often feels himself an outsider. He is also a highly capable swordsman and thinker, with a knack for piercing observations.
+Tyrion (The Imp) Lannister
+A small man with a giant intellect and sharp tongue. Tyrion does not pity himself but rather accepts his shortcomings as a little person and turns them to his advantage. He loves his family but recognizes their greed and ambition.
+Bran Stark
+One of the youngest of the Stark children. Bran is fascinated by stories of knights and adventure, but when is paralyzed in a fall and realizes he is no longer able to become a knight, he is forced to reconsider his life.
+Sansa Stark
+The elder Stark daughter and a beautiful, but extremely naïve, young girl. The twelve-year-old Sansa imagines her life as though it were a storybook, ignoring cruel realities around her and concerning herself only with marrying Joffrey Baratheon.
+Arya Stark
+The youngest Stark girl and a wild, willful, but very intelligent child. What the ten-year-old Ayra lacks in her sister’s refinement, she makes up for with skill in swordfighting and riding. Arya rejects the idea of a woman’s role being to marry and have babies.
+Cersei Lannister
+Queen of the realm and wife of Robert Baratheon. She despises Robert (as well as most other people it seems), and she is cunning and extremely ambitious.
+Ser Jaime (The Kingslayer) Lannister
+Brother to Tyrion and Cersei, as well as Cersei’s lover. Jaime is arrogant, short-tempered, and rash, but he’s also a gifted swordsman. He is widely mistrusted and called Kingslayer because he murdered the previous king.
+Petyr (Littlefinger) Baelish
+The Red Keep’s master of coin. He is shrewd, conniving, and selfish, and he keeps informed about everything that goes on in King’s Landing. He holds a grudge against the Starks because he wanted to marry Catelyn when he was younger.
+Varys (The Spider)
+The Red Keep’s master of whispers and a eunuch. His role in the court is to run a network of spies and keep the king informed, and he often uses what he knows to manipulate those around him, including the king.
+Robert Baratheon
+The corpulent king of Westeros. He loves to fight, drink, and sleep with women, and he hates the duties of ruling. He and Ned are long-time friends, and he was engaged to Ned’s sister until she died.
+Ser Jorah Mormont
+An exiled knight who serves unofficially as Daenerys’s chief advisor. Though he was exiled by Ned Stark for selling slaves, he is intelligent, valiant, and a great fighter. He swears allegiance to Viserys as true king of Westeros, but he also feeds information about the Targaryens back to Varys.
+Viserys Targaryen
+Brother of Daenerys and son of the murdered King Aerys Targaryen. Having lived in exile for many years, earning him the nickname of The Beggar King, he wants to return to Westeros and retake the throne. He is arrogant, cruel, easily angered, and foolish.
+Khal Drogo
+A powerful khal (king) among the Dothraki people and the husband of Daenerys Targaryen. Stoic and brave, Drogo is an exceptional warrior who shows his enemies no mercy. He controls a massive nomadic tribe, or khalasar.
+Prince Joffrey (Joff) Baratheon
+The repulsive prince of Westeros. The twelve-year-old Joff is the eldest child of Cersei and Robert, and he is spoiled, impulsive, and cruel when using his power as prince and heir to the throne.
+Sandor (The Hound) Clegane
+Prince Joff’s unofficial bodyguard. Proud that he is not a knight, The Hound appears to have no scruples whatsoever and does what Joffrey orders, however cruel or unjust, without question. His face is scarred on one side by extensive burning inflicted by his brother, Gregor.
+Robb Stark
+The eldest Stark son and thus heir to Ned Stark. Though just fourteen, he is mature beyond his age as well as being brave and dutiful like his father.
+Maester Luwin
+Counselor to Ned, Catelyn, and Robb. Luwin is old and wise, and his advice proves indispensible to the Starks.
+Theon Greyjoy
+The Starks’s ward and Robb’s best friend. Ned Stark took the young Theon, now nineteen, as a ward after putting down a rebellion led by the Greyjoy family, and Theon consequently grew up with the Stark children as something like a brother.
+Ser Rodrik Cassel
+Winterfell’s master-at-arms. He escorts and defends Catelyn on her journey to King’s Landing and to the Eyrie, tugging anxiously or thoughtfully at his whiskers the whole way.
+Tywin Lannister
+The calculating lord of Casterly Rock and the richest man in the realm. A fierce general, Tywin will go to great ends to protect the honor of the Lannister name.
+Bronn
+A sellsword, or mercenary, who saves Tyrion’s life many times over. Bronn is smart and skilled, and he knows a good deal when he sees one. Though he is an unscrupulous mercenary, he develops something of a friendship with Tyrion.
+Lysa Arryn
+The inconstant and irrational ruler of the Eyrie and sister of Catelyn Stark. Her paranoid, obsessive care of her only son, Robert, consumes her after her husband, Jon Arryn, the former Hand of the King, is murdered. Though she grew up with Catelyn, the two are now very different.
+Jeor Mormont (Commander Mormont)
+Lord Commander of the Night’s Watch at Castle Black. Commander Mormont is tough, old, and wise, and his men call him “The Old Bear.”
+Maester Aemon
+The chief man of learning at Castle Black. Despite his blind white eyes, Maester Aemon sees and speaks the truth in cryptic ways. Though few people realize it, Aemon is one of the few surviving members of the Targaryen family, but he has always put his vows to the Night’s Watch ahead of any family loyalties.
+Samwell (Sam) Tarly
+A new recruit to the Night’s Watch who is fat and cowardly but very smart. Sam loves to read and eat but hates to fight, and he quickly becomes one of Jon Snow’s closest companions at the Wall.
+Ser Allister Thorne
+Castle Black’s resentful master-at-arms. He hard on the new recruits to the Night’s Watch and seems to enjoy making them suffer, causing Jon to rebel against him. During Robert’s rebellion against the former king, he was a Targaryen loyalist.
+Illyrio Mopatis
+An obese merchant from the Free Cities who helps Daenerys and Viserys Targaryen. Illyrio is very rich and very well-informed. He is quick to please, especially when there is a possibility that his kindness will help him avoid trouble or gain greater fortune in the future.
+Ser Barristan Selmy
+Lord Commander of the Kingsguard. He has served kings Jaehaerys, Aerys II, and Robert. Though he has grown old, Barristan “The Bold” is a formidable fighter. He is, and has always been, an honorable knight.
+Renly Baratheon
+The youngest of the three Baratheon brothers. Renly is lighthearted and opportunistic, and unexpectedly ambitious. He serves on Robert’s royal council.
+Stannis Baratheon
+The middle brother of the three Baratheons. Stannis does not appear in A Game of Thrones, but as the brother of the king, he is a potential heir to the throne. Stannis does not seem to be well-liked.
+Ser Ilyn Payne
+The King’s Justice, meaning executioner. He has a frightful appearance, and he cannot speak since Aerys had his tongue ripped out with hot pincers. Though he is the king’s executioner, his family is loyal to House Lannister.
+Ser Gregor Cleagne
+The Hound’s older brother and a knight of the court. Called The Mountain that Rides, Ser Gregor is even larger and crueler than the Hound himself. He is also a sore loser and a marginal commander in battle.
+Osha
+A wildling woman who becomes a ward of the Starks after trying to kidnap Bran. She is tough and strong, and she takes care of Bran after her capture, telling him stories about life in the wild and warning him about what is happening north of the Wall.
+Rickon Stark
+The youngest of the Stark children. Three-year-old Rickon is wild and undisciplined, as is his pet direwolf.
+Aerys II Targaryen
+King of Westeros before Robert Baratheon. He was known as The Mad King because of his cruelty. Aerys murdered Ned’s older brother, Brandon Stark, in the Red Keep’s throne room. At the end of the war that followed, Jaime Lannister slew Aerys in the same room.
+Rhaegar Targaryen
+The heir to Aerys and older brother of Daenerys and Viserys. Rhaegar kidnapped Lyanna Stark, Robert’s betrothed, helping to set in motion the events that led to Robert’s Rebellion. The war effectively ended when Robert slew Rhaegar with his warhammer on the Trident River.
+Jon Arryn
+The recently deceased Lord of the Eyrie and Hand of the King. Jon Arryn fostered Ned Stark and Robert Baratheon at the Eyrie. When Robert became king, Jon Arryn served as his Hand until his murder.

BIN
py/core/examples/data/graphrag.pdf


BIN
py/core/examples/data/lyft_2021.pdf


Những thai đổi đã bị hủy bỏ vì nó quá lớn
+ 19 - 0
py/core/examples/data/pg_essay_1.html


Những thai đổi đã bị hủy bỏ vì nó quá lớn
+ 19 - 0
py/core/examples/data/pg_essay_2.html


Những thai đổi đã bị hủy bỏ vì nó quá lớn
+ 19 - 0
py/core/examples/data/pg_essay_3.html


Những thai đổi đã bị hủy bỏ vì nó quá lớn
+ 19 - 0
py/core/examples/data/pg_essay_4.html


Những thai đổi đã bị hủy bỏ vì nó quá lớn
+ 19 - 0
py/core/examples/data/pg_essay_5.html


BIN
py/core/examples/data/sample.mp3


BIN
py/core/examples/data/sample2.mp3


BIN
py/core/examples/data/screen_shot.png


+ 1 - 0
py/core/examples/data/test.txt

@@ -0,0 +1 @@
+this is a test text

BIN
py/core/examples/data/uber_2021.pdf


+ 999 - 0
py/core/examples/data/yc_companies.txt

@@ -0,0 +1,999 @@
+https://www.ycombinator.com/companies/airbnb
+https://www.ycombinator.com/companies/dawn
+https://www.ycombinator.com/companies/vendah
+https://www.ycombinator.com/companies/rippling
+https://www.ycombinator.com/companies/unriddle
+https://www.ycombinator.com/companies/talc
+https://www.ycombinator.com/companies/sola
+https://www.ycombinator.com/companies/manaflow
+https://www.ycombinator.com/companies/dragoneye
+https://www.ycombinator.com/companies/deepnight
+https://www.ycombinator.com/companies/shiboleth
+https://www.ycombinator.com/companies/axflow
+https://www.ycombinator.com/companies/quill-ai
+https://www.ycombinator.com/companies/wallbit
+https://www.ycombinator.com/companies/infinity
+https://www.ycombinator.com/companies/airfront
+https://www.ycombinator.com/companies/upstream
+https://www.ycombinator.com/companies/piramidal
+https://www.ycombinator.com/companies/plivo
+https://www.ycombinator.com/companies/codeparrot-ai
+https://www.ycombinator.com/companies/fivetran
+https://www.ycombinator.com/companies/garage-2
+https://www.ycombinator.com/companies/narrative
+https://www.ycombinator.com/companies/y-combinator
+https://www.ycombinator.com/companies/ego
+https://www.ycombinator.com/companies/fazeshift
+https://www.ycombinator.com/companies/driver-ai
+https://www.ycombinator.com/companies/envelope
+https://www.ycombinator.com/companies/double-2
+https://www.ycombinator.com/companies/invopop
+https://www.ycombinator.com/companies/decipher-ai
+https://www.ycombinator.com/companies/meru
+https://www.ycombinator.com/companies/prosights
+https://www.ycombinator.com/companies/gemnote
+https://www.ycombinator.com/companies/flexport
+https://www.ycombinator.com/companies/quartzy
+https://www.ycombinator.com/companies/agentsforce
+https://www.ycombinator.com/companies/pandasai
+https://www.ycombinator.com/companies/sciphi
+https://www.ycombinator.com/companies/honeylove
+https://www.ycombinator.com/companies/circuithub
+https://www.ycombinator.com/companies/gauge
+https://www.ycombinator.com/companies/lifestylerx
+https://www.ycombinator.com/companies/choppy
+https://www.ycombinator.com/companies/relari
+https://www.ycombinator.com/companies/campfire-2
+https://www.ycombinator.com/companies/inbuild
+https://www.ycombinator.com/companies/readme
+https://www.ycombinator.com/companies/osium-ai
+https://www.ycombinator.com/companies/shekel-mobility
+https://www.ycombinator.com/companies/ubicloud
+https://www.ycombinator.com/companies/shipbob
+https://www.ycombinator.com/companies/coperniq
+https://www.ycombinator.com/companies/empower
+https://www.ycombinator.com/companies/focal
+https://www.ycombinator.com/companies/monzo-bank
+https://www.ycombinator.com/companies/lightski
+https://www.ycombinator.com/companies/spark
+https://www.ycombinator.com/companies/swift-2
+https://www.ycombinator.com/companies/makrwatch
+https://www.ycombinator.com/companies/stellar-sleep
+https://www.ycombinator.com/companies/proprise
+https://www.ycombinator.com/companies/lawdingo
+https://www.ycombinator.com/companies/dagworks-inc
+https://www.ycombinator.com/companies/ezdubs
+https://www.ycombinator.com/companies/cakework
+https://www.ycombinator.com/companies/snapdocs
+https://www.ycombinator.com/companies/flint-2
+https://www.ycombinator.com/companies/health-harbor
+https://www.ycombinator.com/companies/optimizely
+https://www.ycombinator.com/companies/basalt-tech
+https://www.ycombinator.com/companies/fynt-ai
+https://www.ycombinator.com/companies/commodityai
+https://www.ycombinator.com/companies/intrinsic
+https://www.ycombinator.com/companies/icepanel
+https://www.ycombinator.com/companies/scale-ai
+https://www.ycombinator.com/companies/olio-labs
+https://www.ycombinator.com/companies/clad
+https://www.ycombinator.com/companies/martin
+https://www.ycombinator.com/companies/rivet
+https://www.ycombinator.com/companies/ruuf
+https://www.ycombinator.com/companies/slicker
+https://www.ycombinator.com/companies/retailready
+https://www.ycombinator.com/companies/tableflow
+https://www.ycombinator.com/companies/human-interest
+https://www.ycombinator.com/companies/continue
+https://www.ycombinator.com/companies/metal-2
+https://www.ycombinator.com/companies/mth-sense
+https://www.ycombinator.com/companies/raz
+https://www.ycombinator.com/companies/magic-hour
+https://www.ycombinator.com/companies/amplitude
+https://www.ycombinator.com/companies/circuitlab
+https://www.ycombinator.com/companies/shepherd-2
+https://www.ycombinator.com/companies/bitesight
+https://www.ycombinator.com/companies/kontractify
+https://www.ycombinator.com/companies/suretynow
+https://www.ycombinator.com/companies/numo
+https://www.ycombinator.com/companies/hegel-ai
+https://www.ycombinator.com/companies/magnaplay
+https://www.ycombinator.com/companies/drip-capital
+https://www.ycombinator.com/companies/presto
+https://www.ycombinator.com/companies/meadow
+https://www.ycombinator.com/companies/protocol-labs
+https://www.ycombinator.com/companies/clarum
+https://www.ycombinator.com/companies/wild-moose
+https://www.ycombinator.com/companies/atomwise
+https://www.ycombinator.com/companies/greenboard
+https://www.ycombinator.com/companies/dailype
+https://www.ycombinator.com/companies/berriai
+https://www.ycombinator.com/companies/partnerstack
+https://www.ycombinator.com/companies/mux
+https://www.ycombinator.com/companies/foundation-2
+https://www.ycombinator.com/companies/fortuna-health
+https://www.ycombinator.com/companies/magicbus
+https://www.ycombinator.com/companies/interana
+https://www.ycombinator.com/companies/attunement
+https://www.ycombinator.com/companies/soundboks
+https://www.ycombinator.com/companies/lifelike
+https://www.ycombinator.com/companies/kopia
+https://www.ycombinator.com/companies/fiber
+https://www.ycombinator.com/companies/xendit
+https://www.ycombinator.com/companies/rubber-ducky-labs
+https://www.ycombinator.com/companies/somn
+https://www.ycombinator.com/companies/centralize
+https://www.ycombinator.com/companies/ginkgo-bioworks
+https://www.ycombinator.com/companies/flip
+https://www.ycombinator.com/companies/lytix
+https://www.ycombinator.com/companies/aedilic
+https://www.ycombinator.com/companies/eligible
+https://www.ycombinator.com/companies/greentoe
+https://www.ycombinator.com/companies/type
+https://www.ycombinator.com/companies/teleport
+https://www.ycombinator.com/companies/radar
+https://www.ycombinator.com/companies/chaldal
+https://www.ycombinator.com/companies/bright
+https://www.ycombinator.com/companies/chow-central-inc
+https://www.ycombinator.com/companies/terrakotta
+https://www.ycombinator.com/companies/langdock
+https://www.ycombinator.com/companies/bankjoy
+https://www.ycombinator.com/companies/fabius
+https://www.ycombinator.com/companies/inquery-2
+https://www.ycombinator.com/companies/mercoa
+https://www.ycombinator.com/companies/asklio
+https://www.ycombinator.com/companies/conduit
+https://www.ycombinator.com/companies/her
+https://www.ycombinator.com/companies/structured
+https://www.ycombinator.com/companies/anneal
+https://www.ycombinator.com/companies/panora
+https://www.ycombinator.com/companies/tegon
+https://www.ycombinator.com/companies/metoro
+https://www.ycombinator.com/companies/vitalize-care
+https://www.ycombinator.com/companies/finex
+https://www.ycombinator.com/companies/scritch
+https://www.ycombinator.com/companies/roe-ai
+https://www.ycombinator.com/companies/inkeep
+https://www.ycombinator.com/companies/taylor-ai
+https://www.ycombinator.com/companies/scope-ar
+https://www.ycombinator.com/companies/empirical-health
+https://www.ycombinator.com/companies/lattice
+https://www.ycombinator.com/companies/docsum
+https://www.ycombinator.com/companies/zidisha
+https://www.ycombinator.com/companies/mtailor
+https://www.ycombinator.com/companies/inlet-2
+https://www.ycombinator.com/companies/inri
+https://www.ycombinator.com/companies/cardinal-gray
+https://www.ycombinator.com/companies/parea
+https://www.ycombinator.com/companies/asseta
+https://www.ycombinator.com/companies/nowadays
+https://www.ycombinator.com/companies/watto-ai
+https://www.ycombinator.com/companies/quivr
+https://www.ycombinator.com/companies/tremor
+https://www.ycombinator.com/companies/artos
+https://www.ycombinator.com/companies/patchwork
+https://www.ycombinator.com/companies/maven-bio
+https://www.ycombinator.com/companies/theorem
+https://www.ycombinator.com/companies/ninite
+https://www.ycombinator.com/companies/kiosk
+https://www.ycombinator.com/companies/marblism
+https://www.ycombinator.com/companies/proglix
+https://www.ycombinator.com/companies/snapmagic
+https://www.ycombinator.com/companies/echo
+https://www.ycombinator.com/companies/fume
+https://www.ycombinator.com/companies/redcarpetup
+https://www.ycombinator.com/companies/shasta-health
+https://www.ycombinator.com/companies/glass-health
+https://www.ycombinator.com/companies/baserun
+https://www.ycombinator.com/companies/ten
+https://www.ycombinator.com/companies/emailio
+https://www.ycombinator.com/companies/giga-ml
+https://www.ycombinator.com/companies/bilanc
+https://www.ycombinator.com/companies/koywe
+https://www.ycombinator.com/companies/tusk
+https://www.ycombinator.com/companies/trendup
+https://www.ycombinator.com/companies/mixpanel
+https://www.ycombinator.com/companies/contour
+https://www.ycombinator.com/companies/sweetspot
+https://www.ycombinator.com/companies/plutis
+https://www.ycombinator.com/companies/submittable
+https://www.ycombinator.com/companies/meticulate
+https://www.ycombinator.com/companies/kivo-health
+https://www.ycombinator.com/companies/wordware
+https://www.ycombinator.com/companies/ocular-ai
+https://www.ycombinator.com/companies/invitris
+https://www.ycombinator.com/companies/apollo
+https://www.ycombinator.com/companies/diligent
+https://www.ycombinator.com/companies/doordash
+https://www.ycombinator.com/companies/delve
+https://www.ycombinator.com/companies/betterbasket
+https://www.ycombinator.com/companies/sohar-health
+https://www.ycombinator.com/companies/byterat
+https://www.ycombinator.com/companies/elyos-energy
+https://www.ycombinator.com/companies/cedalio
+https://www.ycombinator.com/companies/diffuse-bio
+https://www.ycombinator.com/companies/maia
+https://www.ycombinator.com/companies/circleback
+https://www.ycombinator.com/companies/abel
+https://www.ycombinator.com/companies/flightfox
+https://www.ycombinator.com/companies/sonauto
+https://www.ycombinator.com/companies/safetykit
+https://www.ycombinator.com/companies/instawork
+https://www.ycombinator.com/companies/scentbird
+https://www.ycombinator.com/companies/cartage
+https://www.ycombinator.com/companies/newfront-insurance
+https://www.ycombinator.com/companies/hippo-scribe
+https://www.ycombinator.com/companies/ssoready
+https://www.ycombinator.com/companies/dgi-apparel
+https://www.ycombinator.com/companies/corefin
+https://www.ycombinator.com/companies/shred-video
+https://www.ycombinator.com/companies/obento-health
+https://www.ycombinator.com/companies/datacurve
+https://www.ycombinator.com/companies/ruby-card
+https://www.ycombinator.com/companies/schemeflow
+https://www.ycombinator.com/companies/zentail
+https://www.ycombinator.com/companies/truemetrics
+https://www.ycombinator.com/companies/granza-bio
+https://www.ycombinator.com/companies/cloudchipr
+https://www.ycombinator.com/companies/promptarmor
+https://www.ycombinator.com/companies/the-human-utility
+https://www.ycombinator.com/companies/dianahr
+https://www.ycombinator.com/companies/healia
+https://www.ycombinator.com/companies/whatnot
+https://www.ycombinator.com/companies/tokenowl
+https://www.ycombinator.com/companies/crowdvolt
+https://www.ycombinator.com/companies/pivot-robots
+https://www.ycombinator.com/companies/kite
+https://www.ycombinator.com/companies/9gag
+https://www.ycombinator.com/companies/remy
+https://www.ycombinator.com/companies/sanvivo
+https://www.ycombinator.com/companies/reform
+https://www.ycombinator.com/companies/senso
+https://www.ycombinator.com/companies/suger
+https://www.ycombinator.com/companies/weave
+https://www.ycombinator.com/companies/podium
+https://www.ycombinator.com/companies/tile
+https://www.ycombinator.com/companies/prodtrace
+https://www.ycombinator.com/companies/outerbase
+https://www.ycombinator.com/companies/escape
+https://www.ycombinator.com/companies/wave
+https://www.ycombinator.com/companies/arctic-capture
+https://www.ycombinator.com/companies/blacksmith
+https://www.ycombinator.com/companies/octolane-ai
+https://www.ycombinator.com/companies/gitlab
+https://www.ycombinator.com/companies/trieve
+https://www.ycombinator.com/companies/sid
+https://www.ycombinator.com/companies/alai
+https://www.ycombinator.com/companies/anarchy-labs
+https://www.ycombinator.com/companies/go1
+https://www.ycombinator.com/companies/flaviar
+https://www.ycombinator.com/companies/faire
+https://www.ycombinator.com/companies/briefer
+https://www.ycombinator.com/companies/kino-ai
+https://www.ycombinator.com/companies/ally
+https://www.ycombinator.com/companies/transcriptic
+https://www.ycombinator.com/companies/justpaid-io
+https://www.ycombinator.com/companies/lollipuff
+https://www.ycombinator.com/companies/intercept
+https://www.ycombinator.com/companies/pylon-2
+https://www.ycombinator.com/companies/font-awesome
+https://www.ycombinator.com/companies/pointwise
+https://www.ycombinator.com/companies/meesho
+https://www.ycombinator.com/companies/ryse
+https://www.ycombinator.com/companies/hazel-2
+https://www.ycombinator.com/companies/ellipsis
+https://www.ycombinator.com/companies/feather-3
+https://www.ycombinator.com/companies/upsolve-ai
+https://www.ycombinator.com/companies/spire-health
+https://www.ycombinator.com/companies/sudocode
+https://www.ycombinator.com/companies/constant
+https://www.ycombinator.com/companies/ariglad
+https://www.ycombinator.com/companies/kips-health
+https://www.ycombinator.com/companies/respaid
+https://www.ycombinator.com/companies/berry
+https://www.ycombinator.com/companies/democracy-earth
+https://www.ycombinator.com/companies/celest
+https://www.ycombinator.com/companies/dalmatian
+https://www.ycombinator.com/companies/mezmo
+https://www.ycombinator.com/companies/picnichealth
+https://www.ycombinator.com/companies/twine
+https://www.ycombinator.com/companies/cambioml
+https://www.ycombinator.com/companies/littio
+https://www.ycombinator.com/companies/orchid
+https://www.ycombinator.com/companies/onward
+https://www.ycombinator.com/companies/mem0
+https://www.ycombinator.com/companies/dealwise
+https://www.ycombinator.com/companies/pierre
+https://www.ycombinator.com/companies/zenflow
+https://www.ycombinator.com/companies/offdeal
+https://www.ycombinator.com/companies/oddsview
+https://www.ycombinator.com/companies/numeral
+https://www.ycombinator.com/companies/zinc
+https://www.ycombinator.com/companies/corgea
+https://www.ycombinator.com/companies/trayd
+https://www.ycombinator.com/companies/fiddlecube
+https://www.ycombinator.com/companies/moxion-power-co
+https://www.ycombinator.com/companies/innkeeper
+https://www.ycombinator.com/companies/dropbox
+https://www.ycombinator.com/companies/poplarml
+https://www.ycombinator.com/companies/apriora
+https://www.ycombinator.com/companies/fastgen
+https://www.ycombinator.com/companies/retell-ai
+https://www.ycombinator.com/companies/play
+https://www.ycombinator.com/companies/phospho
+https://www.ycombinator.com/companies/parasale
+https://www.ycombinator.com/companies/persana-ai
+https://www.ycombinator.com/companies/automorphic
+https://www.ycombinator.com/companies/thrive-agritech
+https://www.ycombinator.com/companies/zener
+https://www.ycombinator.com/companies/open
+https://www.ycombinator.com/companies/guesty
+https://www.ycombinator.com/companies/tensorfuse
+https://www.ycombinator.com/companies/rigetti-computing
+https://www.ycombinator.com/companies/strikingly
+https://www.ycombinator.com/companies/rainmaker
+https://www.ycombinator.com/companies/coil-inc
+https://www.ycombinator.com/companies/clearspace
+https://www.ycombinator.com/companies/hadrius
+https://www.ycombinator.com/companies/double-coding-copilot
+https://www.ycombinator.com/companies/chequpi
+https://www.ycombinator.com/companies/backerkit
+https://www.ycombinator.com/companies/resonance
+https://www.ycombinator.com/companies/finni-health
+https://www.ycombinator.com/companies/cratejoy
+https://www.ycombinator.com/companies/cleva
+https://www.ycombinator.com/companies/squack
+https://www.ycombinator.com/companies/petcube
+https://www.ycombinator.com/companies/malibou
+https://www.ycombinator.com/companies/stacksync
+https://www.ycombinator.com/companies/yenmo
+https://www.ycombinator.com/companies/crew-2
+https://www.ycombinator.com/companies/infinity-ai
+https://www.ycombinator.com/companies/mio
+https://www.ycombinator.com/companies/tab
+https://www.ycombinator.com/companies/axoni
+https://www.ycombinator.com/companies/padlet
+https://www.ycombinator.com/companies/fluently
+https://www.ycombinator.com/companies/leya
+https://www.ycombinator.com/companies/qventus
+https://www.ycombinator.com/companies/zelos-cloud
+https://www.ycombinator.com/companies/ambition
+https://www.ycombinator.com/companies/maihem
+https://www.ycombinator.com/companies/leaders-in-tech
+https://www.ycombinator.com/companies/edgetrace
+https://www.ycombinator.com/companies/topo
+https://www.ycombinator.com/companies/sage-ai
+https://www.ycombinator.com/companies/pledge-health
+https://www.ycombinator.com/companies/xylem-ai
+https://www.ycombinator.com/companies/shape-shapescale
+https://www.ycombinator.com/companies/x-zell
+https://www.ycombinator.com/companies/mantlebio
+https://www.ycombinator.com/companies/certainly-health
+https://www.ycombinator.com/companies/vista-space
+https://www.ycombinator.com/companies/magicflow
+https://www.ycombinator.com/companies/heroic-labs
+https://www.ycombinator.com/companies/codeant-ai
+https://www.ycombinator.com/companies/benchling
+https://www.ycombinator.com/companies/forfeit
+https://www.ycombinator.com/companies/tetrascience
+https://www.ycombinator.com/companies/newsblur
+https://www.ycombinator.com/companies/webflow
+https://www.ycombinator.com/companies/cheetah
+https://www.ycombinator.com/companies/tandem-2
+https://www.ycombinator.com/companies/haplotype-labs
+https://www.ycombinator.com/companies/wuri
+https://www.ycombinator.com/companies/mbx
+https://www.ycombinator.com/companies/agentic-labs-2
+https://www.ycombinator.com/companies/claimsorted
+https://www.ycombinator.com/companies/reactwise
+https://www.ycombinator.com/companies/preloop
+https://www.ycombinator.com/companies/soundry-ai
+https://www.ycombinator.com/companies/forge
+https://www.ycombinator.com/companies/reducto
+https://www.ycombinator.com/companies/ohmic-biosciences
+https://www.ycombinator.com/companies/automat
+https://www.ycombinator.com/companies/apoxy
+https://www.ycombinator.com/companies/onesignal
+https://www.ycombinator.com/companies/aiflow
+https://www.ycombinator.com/companies/watsi
+https://www.ycombinator.com/companies/movley
+https://www.ycombinator.com/companies/heypurple
+https://www.ycombinator.com/companies/pointhound
+https://www.ycombinator.com/companies/reworkd
+https://www.ycombinator.com/companies/shoobs
+https://www.ycombinator.com/companies/strada
+https://www.ycombinator.com/companies/sweep
+https://www.ycombinator.com/companies/terminal
+https://www.ycombinator.com/companies/sante
+https://www.ycombinator.com/companies/sprx
+https://www.ycombinator.com/companies/sails-co
+https://www.ycombinator.com/companies/dyspatch
+https://www.ycombinator.com/companies/orbio-earth
+https://www.ycombinator.com/companies/epsilon
+https://www.ycombinator.com/companies/new-story
+https://www.ycombinator.com/companies/hatchet-2
+https://www.ycombinator.com/companies/epsilla
+https://www.ycombinator.com/companies/resend
+https://www.ycombinator.com/companies/teamnote
+https://www.ycombinator.com/companies/thread-2
+https://www.ycombinator.com/companies/zeplin
+https://www.ycombinator.com/companies/simbie-health
+https://www.ycombinator.com/companies/pincites
+https://www.ycombinator.com/companies/k-scale-labs
+https://www.ycombinator.com/companies/arroyo
+https://www.ycombinator.com/companies/goldenbasis
+https://www.ycombinator.com/companies/dill
+https://www.ycombinator.com/companies/gocardless
+https://www.ycombinator.com/companies/smartasset
+https://www.ycombinator.com/companies/taiki
+https://www.ycombinator.com/companies/toma
+https://www.ycombinator.com/companies/inari
+https://www.ycombinator.com/companies/candoriq
+https://www.ycombinator.com/companies/holacasa
+https://www.ycombinator.com/companies/hyperpad
+https://www.ycombinator.com/companies/hona
+https://www.ycombinator.com/companies/velorum-therapeutics
+https://www.ycombinator.com/companies/launchflow
+https://www.ycombinator.com/companies/guide-labs
+https://www.ycombinator.com/companies/stealth-worker
+https://www.ycombinator.com/companies/embark-trucks
+https://www.ycombinator.com/companies/omnistrate
+https://www.ycombinator.com/companies/navier-ai
+https://www.ycombinator.com/companies/confident-lims
+https://www.ycombinator.com/companies/craftwork
+https://www.ycombinator.com/companies/oway
+https://www.ycombinator.com/companies/pocketpod
+https://www.ycombinator.com/companies/triply
+https://www.ycombinator.com/companies/trueclaim
+https://www.ycombinator.com/companies/isono-health
+https://www.ycombinator.com/companies/basepilot
+https://www.ycombinator.com/companies/screenleap-inc
+https://www.ycombinator.com/companies/gbatteries
+https://www.ycombinator.com/companies/constructable
+https://www.ycombinator.com/companies/highlight-io
+https://www.ycombinator.com/companies/baselit
+https://www.ycombinator.com/companies/dili
+https://www.ycombinator.com/companies/yondu
+https://www.ycombinator.com/companies/fragment
+https://www.ycombinator.com/companies/flock-safety
+https://www.ycombinator.com/companies/zapier
+https://www.ycombinator.com/companies/openmeter
+https://www.ycombinator.com/companies/tennr
+https://www.ycombinator.com/companies/aptdeco
+https://www.ycombinator.com/companies/tamarind-bio
+https://www.ycombinator.com/companies/assembly
+https://www.ycombinator.com/companies/codestory
+https://www.ycombinator.com/companies/goat-group
+https://www.ycombinator.com/companies/verge-genomics
+https://www.ycombinator.com/companies/keep
+https://www.ycombinator.com/companies/flair-health
+https://www.ycombinator.com/companies/hylight
+https://www.ycombinator.com/companies/polo
+https://www.ycombinator.com/companies/starlight-charging
+https://www.ycombinator.com/companies/true-link
+https://www.ycombinator.com/companies/poll-everywhere
+https://www.ycombinator.com/companies/0pass
+https://www.ycombinator.com/companies/trainy
+https://www.ycombinator.com/companies/reddit
+https://www.ycombinator.com/companies/wevorce
+https://www.ycombinator.com/companies/labdoor
+https://www.ycombinator.com/companies/estimote-inc
+https://www.ycombinator.com/companies/astro-mechanica
+https://www.ycombinator.com/companies/7cups
+https://www.ycombinator.com/companies/transformity
+https://www.ycombinator.com/companies/pico
+https://www.ycombinator.com/companies/speck
+https://www.ycombinator.com/companies/metal
+https://www.ycombinator.com/companies/truewind
+https://www.ycombinator.com/companies/uptrain-ai
+https://www.ycombinator.com/companies/panorama-education
+https://www.ycombinator.com/companies/serra
+https://www.ycombinator.com/companies/1stcollab
+https://www.ycombinator.com/companies/buildscience
+https://www.ycombinator.com/companies/healthtech-1
+https://www.ycombinator.com/companies/getaccept
+https://www.ycombinator.com/companies/streak
+https://www.ycombinator.com/companies/groww
+https://www.ycombinator.com/companies/agilemd
+https://www.ycombinator.com/companies/syntheticfi
+https://www.ycombinator.com/companies/cargo
+https://www.ycombinator.com/companies/common-paper
+https://www.ycombinator.com/companies/cleanly
+https://www.ycombinator.com/companies/oma-care
+https://www.ycombinator.com/companies/goodcourse
+https://www.ycombinator.com/companies/datashare
+https://www.ycombinator.com/companies/menza
+https://www.ycombinator.com/companies/nectar
+https://www.ycombinator.com/companies/etleap
+https://www.ycombinator.com/companies/skygaze
+https://www.ycombinator.com/companies/kabilah
+https://www.ycombinator.com/companies/linc
+https://www.ycombinator.com/companies/vocode
+https://www.ycombinator.com/companies/brex
+https://www.ycombinator.com/companies/devcycle
+https://www.ycombinator.com/companies/hockeystack
+https://www.ycombinator.com/companies/healthsherpa
+https://www.ycombinator.com/companies/heartbyte
+https://www.ycombinator.com/companies/stripe
+https://www.ycombinator.com/companies/athina-ai
+https://www.ycombinator.com/companies/serial
+https://www.ycombinator.com/companies/sunfarmer
+https://www.ycombinator.com/companies/draftaid
+https://www.ycombinator.com/companies/venta
+https://www.ycombinator.com/companies/pair-ai
+https://www.ycombinator.com/companies/dream3d
+https://www.ycombinator.com/companies/bellabeat
+https://www.ycombinator.com/companies/superkalam
+https://www.ycombinator.com/companies/mathgpt-pro
+https://www.ycombinator.com/companies/aglide
+https://www.ycombinator.com/companies/mano-health
+https://www.ycombinator.com/companies/pando-bioscience
+https://www.ycombinator.com/companies/truebill
+https://www.ycombinator.com/companies/converge
+https://www.ycombinator.com/companies/hackerrank
+https://www.ycombinator.com/companies/assembly-2
+https://www.ycombinator.com/companies/deasie
+https://www.ycombinator.com/companies/renderlet
+https://www.ycombinator.com/companies/daily
+https://www.ycombinator.com/companies/recipeui
+https://www.ycombinator.com/companies/eggnog
+https://www.ycombinator.com/companies/dealpage
+https://www.ycombinator.com/companies/odo
+https://www.ycombinator.com/companies/aidy
+https://www.ycombinator.com/companies/circle-medical
+https://www.ycombinator.com/companies/nimblerx
+https://www.ycombinator.com/companies/autotab
+https://www.ycombinator.com/companies/bitmovin
+https://www.ycombinator.com/companies/chatter
+https://www.ycombinator.com/companies/hamming-ai
+https://www.ycombinator.com/companies/khoj
+https://www.ycombinator.com/companies/peerdb
+https://www.ycombinator.com/companies/unbabel
+https://www.ycombinator.com/companies/central
+https://www.ycombinator.com/companies/lantern-2
+https://www.ycombinator.com/companies/picktrace
+https://www.ycombinator.com/companies/bodyport
+https://www.ycombinator.com/companies/finny-ai
+https://www.ycombinator.com/companies/finta
+https://www.ycombinator.com/companies/mathdash
+https://www.ycombinator.com/companies/booth-ai
+https://www.ycombinator.com/companies/elodin
+https://www.ycombinator.com/companies/human-dx
+https://www.ycombinator.com/companies/yuma-ai
+https://www.ycombinator.com/companies/warp
+https://www.ycombinator.com/companies/deepgram
+https://www.ycombinator.com/companies/pushbullet
+https://www.ycombinator.com/companies/powder
+https://www.ycombinator.com/companies/cair-health
+https://www.ycombinator.com/companies/milio
+https://www.ycombinator.com/companies/airhelp
+https://www.ycombinator.com/companies/openfoundry
+https://www.ycombinator.com/companies/cloudcruise
+https://www.ycombinator.com/companies/ion-design
+https://www.ycombinator.com/companies/influxdata
+https://www.ycombinator.com/companies/kobalt-labs
+https://www.ycombinator.com/companies/tovala
+https://www.ycombinator.com/companies/tara-ai
+https://www.ycombinator.com/companies/razorpay
+https://www.ycombinator.com/companies/konstructly
+https://www.ycombinator.com/companies/voicepanel
+https://www.ycombinator.com/companies/onegrep
+https://www.ycombinator.com/companies/studdy
+https://www.ycombinator.com/companies/bronco-ai
+https://www.ycombinator.com/companies/kapa-ai
+https://www.ycombinator.com/companies/letter-ai
+https://www.ycombinator.com/companies/coinbase
+https://www.ycombinator.com/companies/skyvern
+https://www.ycombinator.com/companies/atri-labs
+https://www.ycombinator.com/companies/cocrafter
+https://www.ycombinator.com/companies/one-month
+https://www.ycombinator.com/companies/shortloop
+https://www.ycombinator.com/companies/danswer
+https://www.ycombinator.com/companies/nowhouse
+https://www.ycombinator.com/companies/maitai
+https://www.ycombinator.com/companies/glasskube
+https://www.ycombinator.com/companies/outschool
+https://www.ycombinator.com/companies/wattson-health
+https://www.ycombinator.com/companies/ebrandvalue
+https://www.ycombinator.com/companies/cambly
+https://www.ycombinator.com/companies/gusto
+https://www.ycombinator.com/companies/frigade
+https://www.ycombinator.com/companies/happenstance
+https://www.ycombinator.com/companies/pythagora-gpt-pilot
+https://www.ycombinator.com/companies/adagy-robotics
+https://www.ycombinator.com/companies/vendora
+https://www.ycombinator.com/companies/vector
+https://www.ycombinator.com/companies/reprompt
+https://www.ycombinator.com/companies/branch8
+https://www.ycombinator.com/companies/oklo
+https://www.ycombinator.com/companies/inspectmind-ai
+https://www.ycombinator.com/companies/hiro-systems
+https://www.ycombinator.com/companies/upwave
+https://www.ycombinator.com/companies/cedana
+https://www.ycombinator.com/companies/noora-health
+https://www.ycombinator.com/companies/aether-energy
+https://www.ycombinator.com/companies/swishjam
+https://www.ycombinator.com/companies/quantierra
+https://www.ycombinator.com/companies/branch-ai
+https://www.ycombinator.com/companies/selera-medical
+https://www.ycombinator.com/companies/pirros
+https://www.ycombinator.com/companies/edgebit
+https://www.ycombinator.com/companies/unbound-security
+https://www.ycombinator.com/companies/42
+https://www.ycombinator.com/companies/lucira-health
+https://www.ycombinator.com/companies/helion-energy
+https://www.ycombinator.com/companies/bluebirds
+https://www.ycombinator.com/companies/scanbase
+https://www.ycombinator.com/companies/egress-health
+https://www.ycombinator.com/companies/saatvy
+https://www.ycombinator.com/companies/magic-loops
+https://www.ycombinator.com/companies/manifold-freight
+https://www.ycombinator.com/companies/unhaze
+https://www.ycombinator.com/companies/tenjin
+https://www.ycombinator.com/companies/greenlite
+https://www.ycombinator.com/companies/tempo-labs
+https://www.ycombinator.com/companies/caremessage
+https://www.ycombinator.com/companies/opencall-ai
+https://www.ycombinator.com/companies/openpipe
+https://www.ycombinator.com/companies/ironclad
+https://www.ycombinator.com/companies/equipmentshare
+https://www.ycombinator.com/companies/algolia
+https://www.ycombinator.com/companies/akido-labs
+https://www.ycombinator.com/companies/simplyinsured
+https://www.ycombinator.com/companies/glade
+https://www.ycombinator.com/companies/yarn-2
+https://www.ycombinator.com/companies/deel
+https://www.ycombinator.com/companies/magic
+https://www.ycombinator.com/companies/revamp
+https://www.ycombinator.com/companies/electric-air-previously-helios-climate
+https://www.ycombinator.com/companies/priime
+https://www.ycombinator.com/companies/turntable
+https://www.ycombinator.com/companies/centauri-ai
+https://www.ycombinator.com/companies/eight-sleep
+https://www.ycombinator.com/companies/metricwire
+https://www.ycombinator.com/companies/222
+https://www.ycombinator.com/companies/atla
+https://www.ycombinator.com/companies/fileforge
+https://www.ycombinator.com/companies/floworks
+https://www.ycombinator.com/companies/momentic
+https://www.ycombinator.com/companies/accend
+https://www.ycombinator.com/companies/science-exchange
+https://www.ycombinator.com/companies/synsorybio
+https://www.ycombinator.com/companies/speccheck
+https://www.ycombinator.com/companies/technician
+https://www.ycombinator.com/companies/level-frames
+https://www.ycombinator.com/companies/pier
+https://www.ycombinator.com/companies/80-000-hours
+https://www.ycombinator.com/companies/noya-software
+https://www.ycombinator.com/companies/mason
+https://www.ycombinator.com/companies/propexo
+https://www.ycombinator.com/companies/bluedot
+https://www.ycombinator.com/companies/fountain
+https://www.ycombinator.com/companies/humanlike
+https://www.ycombinator.com/companies/versive
+https://www.ycombinator.com/companies/zenfetch
+https://www.ycombinator.com/companies/microhealth
+https://www.ycombinator.com/companies/alchemy
+https://www.ycombinator.com/companies/camelqa
+https://www.ycombinator.com/companies/zepto
+https://www.ycombinator.com/companies/grubmarket
+https://www.ycombinator.com/companies/spotangels
+https://www.ycombinator.com/companies/clipboard-health
+https://www.ycombinator.com/companies/brainbase
+https://www.ycombinator.com/companies/apten
+https://www.ycombinator.com/companies/metalware
+https://www.ycombinator.com/companies/experiment
+https://www.ycombinator.com/companies/surface-labs
+https://www.ycombinator.com/companies/virtualmin
+https://www.ycombinator.com/companies/synch
+https://www.ycombinator.com/companies/metofico
+https://www.ycombinator.com/companies/drymerge
+https://www.ycombinator.com/companies/front
+https://www.ycombinator.com/companies/givemetap
+https://www.ycombinator.com/companies/industrial-microbes
+https://www.ycombinator.com/companies/neptyne
+https://www.ycombinator.com/companies/atopile
+https://www.ycombinator.com/companies/fintool
+https://www.ycombinator.com/companies/roundtable
+https://www.ycombinator.com/companies/trigo
+https://www.ycombinator.com/companies/micsi
+https://www.ycombinator.com/companies/theya
+https://www.ycombinator.com/companies/bujeti
+https://www.ycombinator.com/companies/forge-rewards
+https://www.ycombinator.com/companies/medisearch
+https://www.ycombinator.com/companies/billforward
+https://www.ycombinator.com/companies/keywords-ai
+https://www.ycombinator.com/companies/loula
+https://www.ycombinator.com/companies/craftos
+https://www.ycombinator.com/companies/ply-health
+https://www.ycombinator.com/companies/giveffect
+https://www.ycombinator.com/companies/catx
+https://www.ycombinator.com/companies/refine
+https://www.ycombinator.com/companies/buster
+https://www.ycombinator.com/companies/every
+https://www.ycombinator.com/companies/superagent
+https://www.ycombinator.com/companies/svbtle
+https://www.ycombinator.com/companies/eden-care
+https://www.ycombinator.com/companies/mantys
+https://www.ycombinator.com/companies/sizeless
+https://www.ycombinator.com/companies/opencurriculum
+https://www.ycombinator.com/companies/wefunder
+https://www.ycombinator.com/companies/shortbread
+https://www.ycombinator.com/companies/iliad
+https://www.ycombinator.com/companies/leaping
+https://www.ycombinator.com/companies/gumloop
+https://www.ycombinator.com/companies/radmate-ai
+https://www.ycombinator.com/companies/scribd
+https://www.ycombinator.com/companies/glimmer
+https://www.ycombinator.com/companies/nuanced-inc
+https://www.ycombinator.com/companies/gradientj
+https://www.ycombinator.com/companies/silimate
+https://www.ycombinator.com/companies/titan-2
+https://www.ycombinator.com/companies/quack-ai
+https://www.ycombinator.com/companies/the-ticket-fairy
+https://www.ycombinator.com/companies/permutive
+https://www.ycombinator.com/companies/million
+https://www.ycombinator.com/companies/saphira-ai
+https://www.ycombinator.com/companies/truevault
+https://www.ycombinator.com/companies/happyrobot
+https://www.ycombinator.com/companies/trellis
+https://www.ycombinator.com/companies/yardbook
+https://www.ycombinator.com/companies/per-vices
+https://www.ycombinator.com/companies/risotto
+https://www.ycombinator.com/companies/untether-labs
+https://www.ycombinator.com/companies/helicone
+https://www.ycombinator.com/companies/subsets
+https://www.ycombinator.com/companies/flexwash
+https://www.ycombinator.com/companies/precip
+https://www.ycombinator.com/companies/tower
+https://www.ycombinator.com/companies/anaphero
+https://www.ycombinator.com/companies/one-degree
+https://www.ycombinator.com/companies/usergems
+https://www.ycombinator.com/companies/glide-2
+https://www.ycombinator.com/companies/coba
+https://www.ycombinator.com/companies/clueso
+https://www.ycombinator.com/companies/hostai
+https://www.ycombinator.com/companies/fancave
+https://www.ycombinator.com/companies/teclada
+https://www.ycombinator.com/companies/gluetrail
+https://www.ycombinator.com/companies/elythea
+https://www.ycombinator.com/companies/buxfer
+https://www.ycombinator.com/companies/rex
+https://www.ycombinator.com/companies/sirum
+https://www.ycombinator.com/companies/openmart
+https://www.ycombinator.com/companies/gleam
+https://www.ycombinator.com/companies/matterport
+https://www.ycombinator.com/companies/momentus
+https://www.ycombinator.com/companies/buildzoom
+https://www.ycombinator.com/companies/hive
+https://www.ycombinator.com/companies/artie
+https://www.ycombinator.com/companies/shadeform
+https://www.ycombinator.com/companies/tesorio
+https://www.ycombinator.com/companies/answergrid
+https://www.ycombinator.com/companies/dioxus-labs
+https://www.ycombinator.com/companies/infinia
+https://www.ycombinator.com/companies/crux
+https://www.ycombinator.com/companies/parabolic
+https://www.ycombinator.com/companies/casehopper
+https://www.ycombinator.com/companies/rove
+https://www.ycombinator.com/companies/lucite
+https://www.ycombinator.com/companies/cofactor-genomics
+https://www.ycombinator.com/companies/givefront
+https://www.ycombinator.com/companies/octavewealth
+https://www.ycombinator.com/companies/just-words
+https://www.ycombinator.com/companies/aptible
+https://www.ycombinator.com/companies/peeba
+https://www.ycombinator.com/companies/haven-2
+https://www.ycombinator.com/companies/click-and-grow
+https://www.ycombinator.com/companies/mashgin
+https://www.ycombinator.com/companies/aqua-voice
+https://www.ycombinator.com/companies/xpay
+https://www.ycombinator.com/companies/sync-labs
+https://www.ycombinator.com/companies/extend
+https://www.ycombinator.com/companies/nowports
+https://www.ycombinator.com/companies/moonrepo
+https://www.ycombinator.com/companies/instaclass
+https://www.ycombinator.com/companies/model-ml
+https://www.ycombinator.com/companies/chatfuel
+https://www.ycombinator.com/companies/sonia
+https://www.ycombinator.com/companies/cleartax
+https://www.ycombinator.com/companies/pointone
+https://www.ycombinator.com/companies/duckie
+https://www.ycombinator.com/companies/luca
+https://www.ycombinator.com/companies/storyboarder
+https://www.ycombinator.com/companies/modulari-t
+https://www.ycombinator.com/companies/silogy
+https://www.ycombinator.com/companies/clerky
+https://www.ycombinator.com/companies/greptile
+https://www.ycombinator.com/companies/tiptap
+https://www.ycombinator.com/companies/firebender
+https://www.ycombinator.com/companies/muffin-data
+https://www.ycombinator.com/companies/repaint
+https://www.ycombinator.com/companies/browser-buddy
+https://www.ycombinator.com/companies/sfox
+https://www.ycombinator.com/companies/nextui
+https://www.ycombinator.com/companies/ncompass-technologies
+https://www.ycombinator.com/companies/salvy
+https://www.ycombinator.com/companies/pretzel-ai
+https://www.ycombinator.com/companies/piinpoint
+https://www.ycombinator.com/companies/pardes-bio
+https://www.ycombinator.com/companies/fleetworks
+https://www.ycombinator.com/companies/smobi
+https://www.ycombinator.com/companies/paradedb
+https://www.ycombinator.com/companies/corgi-labs
+https://www.ycombinator.com/companies/parcelbio
+https://www.ycombinator.com/companies/edge
+https://www.ycombinator.com/companies/carma
+https://www.ycombinator.com/companies/partnerhq
+https://www.ycombinator.com/companies/honeydew
+https://www.ycombinator.com/companies/creatorml
+https://www.ycombinator.com/companies/alguna
+https://www.ycombinator.com/companies/aminoanalytica
+https://www.ycombinator.com/companies/reach-labs
+https://www.ycombinator.com/companies/lumina-2
+https://www.ycombinator.com/companies/flower
+https://www.ycombinator.com/companies/vooma
+https://www.ycombinator.com/companies/capi-money
+https://www.ycombinator.com/companies/nanograb
+https://www.ycombinator.com/companies/can-of-soup
+https://www.ycombinator.com/companies/xeol
+https://www.ycombinator.com/companies/aisdr
+https://www.ycombinator.com/companies/opsberry-ai
+https://www.ycombinator.com/companies/mattermost
+https://www.ycombinator.com/companies/pure
+https://www.ycombinator.com/companies/radical
+https://www.ycombinator.com/companies/codecombat
+https://www.ycombinator.com/companies/nunu-ai
+https://www.ycombinator.com/companies/index-1
+https://www.ycombinator.com/companies/resolve
+https://www.ycombinator.com/companies/flex
+https://www.ycombinator.com/companies/buildjet
+https://www.ycombinator.com/companies/markprompt
+https://www.ycombinator.com/companies/inventive-ai
+https://www.ycombinator.com/companies/vectorshift
+https://www.ycombinator.com/companies/roame
+https://www.ycombinator.com/companies/intelliga-voice
+https://www.ycombinator.com/companies/ragas
+https://www.ycombinator.com/companies/feanix-biotechnologies
+https://www.ycombinator.com/companies/hona-2
+https://www.ycombinator.com/companies/easypost
+https://www.ycombinator.com/companies/vizly
+https://www.ycombinator.com/companies/miden
+https://www.ycombinator.com/companies/fern
+https://www.ycombinator.com/companies/marr-labs
+https://www.ycombinator.com/companies/glaze
+https://www.ycombinator.com/companies/rappi
+https://www.ycombinator.com/companies/omniai
+https://www.ycombinator.com/companies/thorntale
+https://www.ycombinator.com/companies/replika
+https://www.ycombinator.com/companies/vaultpay
+https://www.ycombinator.com/companies/roomstorm
+https://www.ycombinator.com/companies/lob
+https://www.ycombinator.com/companies/blue-frog-gaming
+https://www.ycombinator.com/companies/kyber
+https://www.ycombinator.com/companies/focal-systems
+https://www.ycombinator.com/companies/alacrity
+https://www.ycombinator.com/companies/keeling-labs
+https://www.ycombinator.com/companies/andy-ai
+https://www.ycombinator.com/companies/argon-ai-inc
+https://www.ycombinator.com/companies/spine-ai
+https://www.ycombinator.com/companies/mixerbox
+https://www.ycombinator.com/companies/second
+https://www.ycombinator.com/companies/paradigm
+https://www.ycombinator.com/companies/vastrm
+https://www.ycombinator.com/companies/pagerduty
+https://www.ycombinator.com/companies/linkgrep
+https://www.ycombinator.com/companies/rainforest
+https://www.ycombinator.com/companies/phonely
+https://www.ycombinator.com/companies/intently
+https://www.ycombinator.com/companies/cleverdeck
+https://www.ycombinator.com/companies/outset
+https://www.ycombinator.com/companies/tempo
+https://www.ycombinator.com/companies/ecliptor
+https://www.ycombinator.com/companies/affinity
+https://www.ycombinator.com/companies/yoneda-labs
+https://www.ycombinator.com/companies/markhor
+https://www.ycombinator.com/companies/ofone
+https://www.ycombinator.com/companies/alaan
+https://www.ycombinator.com/companies/odeko
+https://www.ycombinator.com/companies/fundersclub
+https://www.ycombinator.com/companies/reebee
+https://www.ycombinator.com/companies/twenty
+https://www.ycombinator.com/companies/decohere
+https://www.ycombinator.com/companies/ottimate
+https://www.ycombinator.com/companies/povio
+https://www.ycombinator.com/companies/telophase
+https://www.ycombinator.com/companies/codenow
+https://www.ycombinator.com/companies/spaceium-inc
+https://www.ycombinator.com/companies/arcane
+https://www.ycombinator.com/companies/veles
+https://www.ycombinator.com/companies/waza
+https://www.ycombinator.com/companies/hemingway
+https://www.ycombinator.com/companies/artisan
+https://www.ycombinator.com/companies/rescuetime
+https://www.ycombinator.com/companies/trench
+https://www.ycombinator.com/companies/benchmark
+https://www.ycombinator.com/companies/flirtey
+https://www.ycombinator.com/companies/immunity-project
+https://www.ycombinator.com/companies/tracecat
+https://www.ycombinator.com/companies/sevn
+https://www.ycombinator.com/companies/goldbelly
+https://www.ycombinator.com/companies/shoptiques
+https://www.ycombinator.com/companies/arini
+https://www.ycombinator.com/companies/givecampus
+https://www.ycombinator.com/companies/defog-ai
+https://www.ycombinator.com/companies/boundary
+https://www.ycombinator.com/companies/vellum
+https://www.ycombinator.com/companies/instacart
+https://www.ycombinator.com/companies/zaymo
+https://www.ycombinator.com/companies/distro
+https://www.ycombinator.com/companies/cleancard
+https://www.ycombinator.com/companies/solve-intelligence
+https://www.ycombinator.com/companies/pandan
+https://www.ycombinator.com/companies/leafpress
+https://www.ycombinator.com/companies/sorted
+https://www.ycombinator.com/companies/mango-health
+https://www.ycombinator.com/companies/vectorview
+https://www.ycombinator.com/companies/cascading-ai
+https://www.ycombinator.com/companies/quary
+https://www.ycombinator.com/companies/revideo
+https://www.ycombinator.com/companies/chart
+https://www.ycombinator.com/companies/junction-bioscience
+https://www.ycombinator.com/companies/keyval
+https://www.ycombinator.com/companies/backpack
+https://www.ycombinator.com/companies/synaptiq
+https://www.ycombinator.com/companies/governgpt
+https://www.ycombinator.com/companies/vaero
+https://www.ycombinator.com/companies/bayes-impact
+https://www.ycombinator.com/companies/airgoods
+https://www.ycombinator.com/companies/infobot
+https://www.ycombinator.com/companies/sirdab
+https://www.ycombinator.com/companies/zep-ai
+https://www.ycombinator.com/companies/bird
+https://www.ycombinator.com/companies/upfront
+https://www.ycombinator.com/companies/amber-ai
+https://www.ycombinator.com/companies/nango
+https://www.ycombinator.com/companies/lugg
+https://www.ycombinator.com/companies/creo
+https://www.ycombinator.com/companies/carousel-technologies
+https://www.ycombinator.com/companies/guac
+https://www.ycombinator.com/companies/unstatiq
+https://www.ycombinator.com/companies/notable-labs
+https://www.ycombinator.com/companies/agentive
+https://www.ycombinator.com/companies/lumona
+https://www.ycombinator.com/companies/blume-benefits
+https://www.ycombinator.com/companies/quantic
+https://www.ycombinator.com/companies/persist-ai
+https://www.ycombinator.com/companies/homeflow
+https://www.ycombinator.com/companies/andromeda-surgical
+https://www.ycombinator.com/companies/salient
+https://www.ycombinator.com/companies/zeitview
+https://www.ycombinator.com/companies/kater-ai
+https://www.ycombinator.com/companies/flowiseai
+https://www.ycombinator.com/companies/hyperbound
+https://www.ycombinator.com/companies/cercli
+https://www.ycombinator.com/companies/dime-2
+https://www.ycombinator.com/companies/medmonk
+https://www.ycombinator.com/companies/cosine
+https://www.ycombinator.com/companies/double-robotics
+https://www.ycombinator.com/companies/adventris-pharmaceuticals
+https://www.ycombinator.com/companies/sherloq
+https://www.ycombinator.com/companies/checkr
+https://www.ycombinator.com/companies/speedybrand
+https://www.ycombinator.com/companies/stralis-aircraft
+https://www.ycombinator.com/companies/platzi
+https://www.ycombinator.com/companies/fiber-ai
+https://www.ycombinator.com/companies/coldreach
+https://www.ycombinator.com/companies/univerbal
+https://www.ycombinator.com/companies/arcimus
+https://www.ycombinator.com/companies/decoda-health
+https://www.ycombinator.com/companies/zerodev
+https://www.ycombinator.com/companies/texel-ai
+https://www.ycombinator.com/companies/teabot
+https://www.ycombinator.com/companies/stack-4
+https://www.ycombinator.com/companies/superapi
+https://www.ycombinator.com/companies/berilium
+https://www.ycombinator.com/companies/eris-biotech
+https://www.ycombinator.com/companies/shasqi
+https://www.ycombinator.com/companies/vetrec
+https://www.ycombinator.com/companies/langfuse
+https://www.ycombinator.com/companies/entangl

+ 114 - 0
py/core/examples/hello_r2r.ipynb

@@ -0,0 +1,114 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "from r2r import R2RClient\n",
+    "\n",
+    "# Create an account at SciPhi Cloud https://app.sciphi.ai and set an R2R_API_KEY environment variable\n",
+    "# or set the base URL to your instance. E.g. R2RClient(\"http://localhost:7272\")\n",
+    "os.environ[\"R2R_API_KEY\"] = \"your-api-key\"\n",
+    "\n",
+    "# Create a client\n",
+    "client = R2RClient()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'results': {'message': 'Ingest files task queued successfully.', 'task_id': 'd14004c5-09b7-4d15-acd6-6708ad394908', 'document_id': '96090824-0b1b-5459-a9e1-da0c781d5e71'}}\n"
+     ]
+    }
+   ],
+   "source": [
+    "import os\n",
+    "import tempfile\n",
+    "\n",
+    "import requests\n",
+    "\n",
+    "# Download the content from GitHub\n",
+    "url = \"https://raw.githubusercontent.com/SciPhi-AI/R2R/refs/heads/main/py/core/examples/data/aristotle.txt\"\n",
+    "response = requests.get(url)\n",
+    "\n",
+    "# Create a temporary file to store the content\n",
+    "with tempfile.NamedTemporaryFile(\n",
+    "    delete=False, mode=\"w\", suffix=\".txt\"\n",
+    ") as temp_file:\n",
+    "    temp_file.write(response.text)\n",
+    "    temp_path = temp_file.name\n",
+    "\n",
+    "# Ingest the file\n",
+    "ingestion_response = client.documents.create(file_path=temp_path)\n",
+    "print(ingestion_response)\n",
+    "\n",
+    "# Clean up the temporary file\n",
+    "os.unlink(temp_path)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Performing RAG...\n",
+      "The nature of the soul, according to Aristotle, is multifaceted and can be understood through his three-part structure of the soul, which includes the vegetative soul, the sensitive soul, and the rational soul. Each type of soul has distinct functions:\n",
+      "\n",
+      "1. **Vegetative Soul**: This is concerned with growth and nourishment, and is present in all living beings, including plants [1], [2], [3].\n",
+      "2. **Sensitive Soul**: This experiences sensations and movement, and is present in animals [1], [2], [3].\n",
+      "3. **Rational Soul**: Unique to humans, this soul has the ability to receive forms of other things and to compare them using intellect (nous) and reason (logos) [1], [2], [3].\n",
+      "\n",
+      "For Aristotle, the soul is the form of a living being, which means it is the essence that gives life to the body and enables it to perform its specific functions. The soul is what endows living beings with the ability to initiate movement, growth, and transformations [1], [2], [3]. Aristotle also placed the rational soul in the heart, contrasting with earlier philosophers who located it in the brain [1], [2], [3].\n",
+      "\n",
+      "In contrast, the Hermetic perspective, as seen in the \"Corpus Hermeticum,\" views the soul as an immortal aspect of humanity that undergoes a transformative journey through various states of existence in pursuit of divine knowledge and enlightenment. The soul's journey emphasizes the importance of wisdom and virtue in achieving a higher understanding of existence and connecting with the divine [4], [5], [6], [7], [8], [9].\n",
+      "\n",
+      "Thus, the nature of the soul can be seen as both a vital essence that animates living beings and a divine entity that seeks knowledge and enlightenment through a transformative journey.\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\"Performing RAG...\")\n",
+    "rag_response = client.retrieval.rag(\n",
+    "    query=\"What is the nature of the soul?\",\n",
+    ")\n",
+    "\n",
+    "print(rag_response[\"results\"][\"completion\"])"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "r2r-giROgG2W-py3.12",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}

+ 23 - 0
py/core/examples/hello_r2r.py

@@ -0,0 +1,23 @@
+from r2r import R2RClient
+
+client = R2RClient()
+
+with open("test.txt", "w") as file:
+    file.write("John is a person that works at Google.")
+
+client.ingest_files(file_paths=["test.txt"])
+
+# Call RAG directly on an R2R object
+rag_response = client.rag(
+    query="Who is john",
+    rag_generation_config={"model": "gpt-4.1-mini", "temperature": 0.0},
+)
+results = rag_response["results"]
+print(f"Search Results:\n{results['search_results']}")
+print(f"Completion:\n{results['completion']}")
+
+# RAG Results:
+# Search Results:
+# AggregateSearchResult(chunk_search_results=[ChunkSearchResult(id=2d71e689-0a0e-5491-a50b-4ecb9494c832, score=0.6848798582029441, metadata={'text': 'John is a person that works at Google.', 'version': 'v0', 'chunk_order': 0, 'document_id': 'ed76b6ee-dd80-5172-9263-919d493b439a', 'id': '1ba494d7-cb2f-5f0e-9f64-76c31da11381', 'associatedQuery': 'Who is john'})], graph_search_results=None)
+# Completion:
+# ChatCompletion(id='chatcmpl-9g0HnjGjyWDLADe7E2EvLWa35cMkB', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='John is a person that works at Google [1].', role='assistant', function_call=None, tool_calls=None))], created=1719797903, model='gpt-4o-mini', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=11, prompt_tokens=145, total_tokens=156))

BIN
py/core/examples/supported_file_types/bmp.bmp


+ 126 - 0
py/core/examples/supported_file_types/css.css

@@ -0,0 +1,126 @@
+@layer components {
+    .fern-search-hit-title {
+        display: block;
+        overflow: hidden;
+        text-overflow: ellipsis;
+    }
+
+    .fern-search-hit-title.deprecated {
+        opacity: .7;
+        text-decoration: line-through;
+    }
+
+    .fern-search-hit-breadcrumb,.fern-search-hit-endpoint-path,.fern-search-hit-snippet {
+        color: var(--grayscale-a11);
+        display: block;
+        overflow: hidden;
+        overflow-wrap: break-word;
+        text-overflow: ellipsis;
+        white-space: nowrap;
+    }
+
+    .fern-search-hit-highlighted {
+        font-weight: 600;
+    }
+
+    .fern-search-hit-snippet {
+        font-size: .875rem;
+        line-height: 1.375;
+    }
+
+    .fern-search-hit-breadcrumb,.fern-search-hit-endpoint-path {
+        font-size: .75rem;
+    }
+
+    .fern-search-hit-endpoint-path {
+        font-family: var(--font-mono);
+    }
+
+    #fern-search-mobile-command[data-cmdk-root] {
+        overflow: hidden;
+    }
+
+    #fern-search-mobile-command[data-cmdk-root] [data-cmdk-fern-header] {
+        display: flex;
+        gap: .5rem;
+        padding: 0 .5rem;
+    }
+
+    #fern-search-mobile-command[data-cmdk-root] [data-cmdk-list] {
+        overflow: auto;
+        overscroll-behavior: contain;
+    }
+
+    #fern-search-mobile-command[data-cmdk-root] [data-cmdk-list]:focus {
+        outline: none;
+    }
+
+    #fern-search-mobile-command[data-cmdk-root] [data-cmdk-list-sizer] {
+        display: flex;
+        flex-direction: column;
+        gap: .5rem;
+    }
+
+    #fern-search-mobile-command[data-cmdk-root] [data-cmdk-item] {
+        border-radius: calc(.5rem - 2px);
+        cursor: default;
+        display: flex;
+        gap: .5rem;
+        margin-left: .5rem;
+        margin-right: .5rem;
+        padding: .5rem;
+        scroll-margin: .75rem 0;
+        text-align: left;
+    }
+
+    #fern-search-mobile-command[data-cmdk-root] [data-cmdk-item] svg:first-child {
+        flex-shrink: 0;
+        height: 1rem;
+        margin: .25rem 0;
+        opacity: .6;
+        pointer-events: none;
+        width: 1rem;
+    }
+
+    #fern-search-mobile-command[data-cmdk-root] [data-cmdk-item] mark {
+        background: transparent!important;
+        color: inherit;
+    }
+}
+
+@layer components {
+    @media (hover: hover) and (pointer: fine) {
+        #fern-search-mobile-command[data-cmdk-root] [data-cmdk-item][data-selected=true] {
+            background-color: var(--accent-a3);
+            color: var(--accent-a11);
+        }
+
+        #fern-search-mobile-command[data-cmdk-root] [data-cmdk-item][data-selected=true] .fern-search-hit-breadcrumb,
+        #fern-search-mobile-command[data-cmdk-root] [data-cmdk-item][data-selected=true] .fern-search-hit-endpoint-path,
+        #fern-search-mobile-command[data-cmdk-root] [data-cmdk-item][data-selected=true] .fern-search-hit-snippet {
+            color: var(--accent-a11);
+            opacity: .8;
+        }
+    }
+
+    #fern-search-mobile-command[data-cmdk-root] [data-cmdk-empty] {
+        color: var(--grayscale-a9);
+        hyphens: auto;
+        overflow-wrap: break-word;
+        padding: 2rem;
+        text-align: center;
+    }
+
+    #fern-search-mobile-command[data-cmdk-root] [data-cmdk-group-heading] {
+        color: var(--grayscale-a9);
+        font-size: .75rem;
+        font-weight: 600;
+        margin-bottom: .5rem;
+        padding: 0 1rem;
+    }
+
+    #fern-search-mobile-command[data-cmdk-root] .fern-search-hit-snippet {
+        line-clamp: 2;
+        -webkit-line-clamp: 2;
+    }
+}

+ 11 - 0
py/core/examples/supported_file_types/csv.csv

@@ -0,0 +1,11 @@
+Date,Customer ID,Product,Quantity,Unit Price,Total
+2024-01-15,C1001,Laptop Pro X,2,999.99,1999.98
+2024-01-15,C1002,Wireless Mouse,5,29.99,149.95
+2024-01-16,C1003,External SSD 1TB,3,159.99,479.97
+2024-01-16,C1001,USB-C Cable,4,19.99,79.96
+2024-01-17,C1004,Monitor 27",1,349.99,349.99
+2024-01-17,C1005,Keyboard Elite,2,129.99,259.98
+2024-01-18,C1002,Headphones Pro,1,199.99,199.99
+2024-01-18,C1006,Webcam HD,3,79.99,239.97
+2024-01-19,C1007,Power Bank,2,49.99,99.98
+2024-01-19,C1003,Phone Case,5,24.99,124.95

BIN
py/core/examples/supported_file_types/doc.doc


BIN
py/core/examples/supported_file_types/docx.docx


+ 61 - 0
py/core/examples/supported_file_types/eml.eml

@@ -0,0 +1,61 @@
+From: sender@example.com
+To: recipient@example.com
+Subject: Meeting Summary - Q4 Planning
+Date: Mon, 16 Dec 2024 10:30:00 -0500
+Content-Type: multipart/mixed; boundary="boundary123"
+
+--boundary123
+Content-Type: text/plain; charset="utf-8"
+Content-Transfer-Encoding: quoted-printable
+
+Hi Team,
+
+Here's a summary of our Q4 planning meeting:
+
+Key Points:
+1. Revenue targets increased by 15%
+2. New product launch scheduled for November
+3. Marketing budget approved for expansion
+
+Action Items:
+- Sarah: Prepare detailed product roadmap
+- Mike: Contact vendors for pricing
+- Jennifer: Update financial projections
+
+Please review and let me know if you have any questions.
+
+Best regards,
+Alex
+
+--boundary123
+Content-Type: text/html; charset="utf-8"
+Content-Transfer-Encoding: quoted-printable
+
+<html>
+<body>
+<p>Hi Team,</p>
+
+<p>Here's a summary of our Q4 planning meeting:</p>
+
+<h3>Key Points:</h3>
+<ul>
+<li>Revenue targets increased by 15%</li>
+<li>New product launch scheduled for November</li>
+<li>Marketing budget approved for expansion</li>
+</ul>
+
+<h3>Action Items:</h3>
+<ul>
+<li><strong>Sarah:</strong> Prepare detailed product roadmap</li>
+<li><strong>Mike:</strong> Contact vendors for pricing</li>
+<li><strong>Jennifer:</strong> Update financial projections</li>
+</ul>
+
+<p>Please review and let me know if you have any questions.</p>
+
+<p>Best regards,<br>
+Alex</p>
+</body>
+</html>
+
+--boundary123--

BIN
py/core/examples/supported_file_types/epub.epub


BIN
py/core/examples/supported_file_types/heic.heic


+ 69 - 0
py/core/examples/supported_file_types/html.html

@@ -0,0 +1,69 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>Product Dashboard</title>
+    <style>
+        body {
+            font-family: Arial, sans-serif;
+            margin: 20px;
+            background-color: #f5f5f5;
+        }
+        .dashboard {
+            max-width: 800px;
+            margin: 0 auto;
+            padding: 20px;
+            background-color: white;
+            border-radius: 8px;
+            box-shadow: 0 2px 4px rgba(0,0,0,0.1);
+        }
+        .header {
+            text-align: center;
+            margin-bottom: 30px;
+        }
+        .metrics {
+            display: grid;
+            grid-template-columns: repeat(auto-fit, minmax(200px, 1fr));
+            gap: 20px;
+            margin-bottom: 30px;
+        }
+        .metric-card {
+            padding: 15px;
+            background-color: #f8f9fa;
+            border-radius: 4px;
+            text-align: center;
+        }
+    </style>
+</head>
+<body>
+    <div class="dashboard">
+        <div class="header">
+            <h1>Product Performance Dashboard</h1>
+            <p>Real-time metrics and analytics</p>
+        </div>
+        <div class="metrics">
+            <div class="metric-card">
+                <h3>Active Users</h3>
+                <p>1,234</p>
+            </div>
+            <div class="metric-card">
+                <h3>Revenue</h3>
+                <p>$45,678</p>
+            </div>
+            <div class="metric-card">
+                <h3>Conversion Rate</h3>
+                <p>2.34%</p>
+            </div>
+        </div>
+        <div class="recent-activity">
+            <h2>Recent Activity</h2>
+            <ul>
+                <li>New feature deployed: Enhanced search</li>
+                <li>Bug fix: Mobile navigation issue</li>
+                <li>Performance improvement: Cache optimization</li>
+            </ul>
+        </div>
+    </div>
+</body>
+</html>

BIN
py/core/examples/supported_file_types/jpeg.jpeg


BIN
py/core/examples/supported_file_types/jpg.jpg


+ 43 - 0
py/core/examples/supported_file_types/js.js

@@ -0,0 +1,43 @@
+const path = require('path');
+const { r2rClient } = require("r2r-js");
+
+// Create an account at SciPhi Cloud https://app.sciphi.ai and set an R2R_API_KEY environment variable
+// or set the base URL to your instance. E.g. r2rClient("http://localhost:7272")
+const client = new r2rClient();
+
+async function main() {
+  const filePath = path.resolve(__dirname, "data/raskolnikov.txt");
+
+
+  console.log("Ingesting file...");
+  const ingestResult = await client.documents.create({
+    file: {
+      path: filePath,
+      name: "raskolnikov.txt"
+    },
+    metadata: { author: "Dostoevsky" },
+  });
+  console.log("Ingest result:", JSON.stringify(ingestResult, null, 2));
+
+  console.log("Waiting for the file to be ingested...");
+  await new Promise((resolve) => setTimeout(resolve, 10000));
+
+  console.log("Performing RAG...");
+  const ragResponse = await client.retrieval.rag({
+    query: "To whom was Raskolnikov desperately in debt to?",
+  });
+
+  console.log("Search Results:");
+  ragResponse.results.searchResults.chunkSearchResults.forEach(
+    (result, index) => {
+      console.log(`\nResult ${index + 1}:`);
+      console.log(`Text: ${result.text.substring(0, 100)}...`);
+      console.log(`Score: ${result.score}`);
+    },
+  );
+
+  console.log("\nCompletion:");
+  console.log(ragResponse.results.completion);
+}
+
+main();

+ 58 - 0
py/core/examples/supported_file_types/json.json

@@ -0,0 +1,58 @@
+{
+    "dashboard": {
+        "name": "Product Performance Dashboard",
+        "lastUpdated": "2024-12-16T10:30:00Z",
+        "metrics": {
+            "activeUsers": {
+                "current": 1234,
+                "previousPeriod": 1156,
+                "percentChange": 6.75
+            },
+            "revenue": {
+                "current": 45678.90,
+                "previousPeriod": 41234.56,
+                "percentChange": 10.78,
+                "currency": "USD"
+            },
+            "conversionRate": {
+                "current": 2.34,
+                "previousPeriod": 2.12,
+                "percentChange": 10.38,
+                "unit": "percent"
+            }
+        },
+        "recentActivity": [
+            {
+                "type": "deployment",
+                "title": "Enhanced search",
+                "description": "New feature deployed: Enhanced search functionality",
+                "timestamp": "2024-12-15T15:45:00Z",
+                "status": "successful"
+            },
+            {
+                "type": "bugfix",
+                "title": "Mobile navigation",
+                "description": "Bug fix: Mobile navigation issue resolved",
+                "timestamp": "2024-12-14T09:20:00Z",
+                "status": "successful"
+            },
+            {
+                "type": "performance",
+                "title": "Cache optimization",
+                "description": "Performance improvement: Cache optimization completed",
+                "timestamp": "2024-12-13T11:15:00Z",
+                "status": "successful"
+            }
+        ],
+        "settings": {
+            "refreshInterval": 300,
+            "timezone": "UTC",
+            "theme": "light",
+            "notifications": {
+                "email": true,
+                "slack": true,
+                "inApp": true
+            }
+        }
+    }
+}

+ 310 - 0
py/core/examples/supported_file_types/md.md

@@ -0,0 +1,310 @@
+# Markdown: Syntax
+
+*   [Overview](#overview)
+    *   [Philosophy](#philosophy)
+    *   [Inline HTML](#html)
+    *   [Automatic Escaping for Special Characters](#autoescape)
+*   [Block Elements](#block)
+    *   [Paragraphs and Line Breaks](#p)
+    *   [Headers](#header)
+    *   [Blockquotes](#blockquote)
+    *   [Lists](#list)
+    *   [Code Blocks](#precode)
+    *   [Horizontal Rules](#hr)
+*   [Span Elements](#span)
+    *   [Links](#link)
+    *   [Emphasis](#em)
+    *   [Code](#code)
+    *   [Images](#img)
+*   [Miscellaneous](#misc)
+    *   [Backslash Escapes](#backslash)
+    *   [Automatic Links](#autolink)
+
+
+**Note:** This document is itself written using Markdown; you
+can [see the source for it by adding '.text' to the URL](/projects/markdown/syntax.text).
+
+----
+
+## Overview
+
+### Philosophy
+
+Markdown is intended to be as easy-to-read and easy-to-write as is feasible.
+
+Readability, however, is emphasized above all else. A Markdown-formatted
+document should be publishable as-is, as plain text, without looking
+like it's been marked up with tags or formatting instructions. While
+Markdown's syntax has been influenced by several existing text-to-HTML
+filters -- including [Setext](http://docutils.sourceforge.net/mirror/setext.html), [atx](http://www.aaronsw.com/2002/atx/), [Textile](http://textism.com/tools/textile/), [reStructuredText](http://docutils.sourceforge.net/rst.html),
+[Grutatext](http://www.triptico.com/software/grutatxt.html), and [EtText](http://ettext.taint.org/doc/) -- the single biggest source of
+inspiration for Markdown's syntax is the format of plain text email.
+
+## Block Elements
+
+### Paragraphs and Line Breaks
+
+A paragraph is simply one or more consecutive lines of text, separated
+by one or more blank lines. (A blank line is any line that looks like a
+blank line -- a line containing nothing but spaces or tabs is considered
+blank.) Normal paragraphs should not be indented with spaces or tabs.
+
+The implication of the "one or more consecutive lines of text" rule is
+that Markdown supports "hard-wrapped" text paragraphs. This differs
+significantly from most other text-to-HTML formatters (including Movable
+Type's "Convert Line Breaks" option) which translate every line break
+character in a paragraph into a `<br />` tag.
+
+When you *do* want to insert a `<br />` break tag using Markdown, you
+end a line with two or more spaces, then type return.
+
+### Headers
+
+Markdown supports two styles of headers, [Setext] [1] and [atx] [2].
+
+Optionally, you may "close" atx-style headers. This is purely
+cosmetic -- you can use this if you think it looks better. The
+closing hashes don't even need to match the number of hashes
+used to open the header. (The number of opening hashes
+determines the header level.)
+
+
+### Blockquotes
+
+Markdown uses email-style `>` characters for blockquoting. If you're
+familiar with quoting passages of text in an email message, then you
+know how to create a blockquote in Markdown. It looks best if you hard
+wrap the text and put a `>` before every line:
+
+> This is a blockquote with two paragraphs. Lorem ipsum dolor sit amet,
+> consectetuer adipiscing elit. Aliquam hendrerit mi posuere lectus.
+> Vestibulum enim wisi, viverra nec, fringilla in, laoreet vitae, risus.
+>
+> Donec sit amet nisl. Aliquam semper ipsum sit amet velit. Suspendisse
+> id sem consectetuer libero luctus adipiscing.
+
+Markdown allows you to be lazy and only put the `>` before the first
+line of a hard-wrapped paragraph:
+
+> This is a blockquote with two paragraphs. Lorem ipsum dolor sit amet,
+consectetuer adipiscing elit. Aliquam hendrerit mi posuere lectus.
+Vestibulum enim wisi, viverra nec, fringilla in, laoreet vitae, risus.
+
+> Donec sit amet nisl. Aliquam semper ipsum sit amet velit. Suspendisse
+id sem consectetuer libero luctus adipiscing.
+
+Blockquotes can be nested (i.e. a blockquote-in-a-blockquote) by
+adding additional levels of `>`:
+
+> This is the first level of quoting.
+>
+> > This is nested blockquote.
+>
+> Back to the first level.
+
+Blockquotes can contain other Markdown elements, including headers, lists,
+and code blocks:
+
+> ## This is a header.
+>
+> 1.   This is the first list item.
+> 2.   This is the second list item.
+>
+> Here's some example code:
+>
+>     return shell_exec("echo $input | $markdown_script");
+
+Any decent text editor should make email-style quoting easy. For
+example, with BBEdit, you can make a selection and choose Increase
+Quote Level from the Text menu.
+
+
+### Lists
+
+Markdown supports ordered (numbered) and unordered (bulleted) lists.
+
+Unordered lists use asterisks, pluses, and hyphens -- interchangably
+-- as list markers:
+
+*   Red
+*   Green
+*   Blue
+
+is equivalent to:
+
++   Red
++   Green
++   Blue
+
+and:
+
+-   Red
+-   Green
+-   Blue
+
+Ordered lists use numbers followed by periods:
+
+1.  Bird
+2.  McHale
+3.  Parish
+
+It's important to note that the actual numbers you use to mark the
+list have no effect on the HTML output Markdown produces. The HTML
+Markdown produces from the above list is:
+
+If you instead wrote the list in Markdown like this:
+
+1.  Bird
+1.  McHale
+1.  Parish
+
+or even:
+
+3. Bird
+1. McHale
+8. Parish
+
+you'd get the exact same HTML output. The point is, if you want to,
+you can use ordinal numbers in your ordered Markdown lists, so that
+the numbers in your source match the numbers in your published HTML.
+But if you want to be lazy, you don't have to.
+
+To make lists look nice, you can wrap items with hanging indents:
+
+*   Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
+    Aliquam hendrerit mi posuere lectus. Vestibulum enim wisi,
+    viverra nec, fringilla in, laoreet vitae, risus.
+*   Donec sit amet nisl. Aliquam semper ipsum sit amet velit.
+    Suspendisse id sem consectetuer libero luctus adipiscing.
+
+But if you want to be lazy, you don't have to:
+
+*   Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
+Aliquam hendrerit mi posuere lectus. Vestibulum enim wisi,
+viverra nec, fringilla in, laoreet vitae, risus.
+*   Donec sit amet nisl. Aliquam semper ipsum sit amet velit.
+Suspendisse id sem consectetuer libero luctus adipiscing.
+
+List items may consist of multiple paragraphs. Each subsequent
+paragraph in a list item must be indented by either 4 spaces
+or one tab:
+
+1.  This is a list item with two paragraphs. Lorem ipsum dolor
+    sit amet, consectetuer adipiscing elit. Aliquam hendrerit
+    mi posuere lectus.
+
+    Vestibulum enim wisi, viverra nec, fringilla in, laoreet
+    vitae, risus. Donec sit amet nisl. Aliquam semper ipsum
+    sit amet velit.
+
+2.  Suspendisse id sem consectetuer libero luctus adipiscing.
+
+It looks nice if you indent every line of the subsequent
+paragraphs, but here again, Markdown will allow you to be
+lazy:
+
+*   This is a list item with two paragraphs.
+
+    This is the second paragraph in the list item. You're
+only required to indent the first line. Lorem ipsum dolor
+sit amet, consectetuer adipiscing elit.
+
+*   Another item in the same list.
+
+To put a blockquote within a list item, the blockquote's `>`
+delimiters need to be indented:
+
+*   A list item with a blockquote:
+
+    > This is a blockquote
+    > inside a list item.
+
+To put a code block within a list item, the code block needs
+to be indented *twice* -- 8 spaces or two tabs:
+
+*   A list item with a code block:
+
+        <code goes here>
+
+### Code Blocks
+
+Pre-formatted code blocks are used for writing about programming or
+markup source code. Rather than forming normal paragraphs, the lines
+of a code block are interpreted literally. Markdown wraps a code block
+in both `<pre>` and `<code>` tags.
+
+To produce a code block in Markdown, simply indent every line of the
+block by at least 4 spaces or 1 tab.
+
+This is a normal paragraph:
+
+    This is a code block.
+
+Here is an example of AppleScript:
+
+    tell application "Foo"
+        beep
+    end tell
+
+A code block continues until it reaches a line that is not indented
+(or the end of the article).
+
+Within a code block, ampersands (`&`) and angle brackets (`<` and `>`)
+are automatically converted into HTML entities. This makes it very
+easy to include example HTML source code using Markdown -- just paste
+it and indent it, and Markdown will handle the hassle of encoding the
+ampersands and angle brackets. For example, this:
+
+    <div class="footer">
+        &copy; 2004 Foo Corporation
+    </div>
+
+Regular Markdown syntax is not processed within code blocks. E.g.,
+asterisks are just literal asterisks within a code block. This means
+it's also easy to use Markdown to write about Markdown's own syntax.
+
+```
+tell application "Foo"
+    beep
+end tell
+```
+
+## Span Elements
+
+### Links
+
+Markdown supports two style of links: *inline* and *reference*.
+
+In both styles, the link text is delimited by [square brackets].
+
+To create an inline link, use a set of regular parentheses immediately
+after the link text's closing square bracket. Inside the parentheses,
+put the URL where you want the link to point, along with an *optional*
+title for the link, surrounded in quotes. For example:
+
+This is [an example](http://example.com/) inline link.
+
+[This link](http://example.net/) has no title attribute.
+
+### Emphasis
+
+Markdown treats asterisks (`*`) and underscores (`_`) as indicators of
+emphasis. Text wrapped with one `*` or `_` will be wrapped with an
+HTML `<em>` tag; double `*`'s or `_`'s will be wrapped with an HTML
+`<strong>` tag. E.g., this input:
+
+*single asterisks*
+
+_single underscores_
+
+**double asterisks**
+
+__double underscores__
+
+### Code
+
+To indicate a span of code, wrap it with backtick quotes (`` ` ``).
+Unlike a pre-formatted code block, a code span indicates code within a
+normal paragraph. For example:
+
+Use the `printf()` function.

BIN
py/core/examples/supported_file_types/msg.msg


BIN
py/core/examples/supported_file_types/odt.odt


+ 153 - 0
py/core/examples/supported_file_types/org.org

@@ -0,0 +1,153 @@
+#+title: Modern Org Example
+#+author: Daniel Mendler
+#+filetags: :example:org:
+
+This example Org file demonstrates the Org elements,
+which are styled by =org-modern=.
+
+-----
+
+* Headlines
+** Second level
+*** Third level
+**** Fourth level
+***** Fifth level
+
+* Task Lists [1/3]
+  - [X] Write =org-modern=
+  - [-] Publish =org-modern=
+  - [ ] Fix all the bugs
+
+* List Bullets
+  - Dash
+  + Plus
+  * Asterisk
+
+* Timestamps
+DEADLINE:  <2022-03-01 Tue>
+SCHEDULED: <2022-02-25 10:00>
+DRANGE:    [2022-03-01]--[2022-04-01]
+DRANGE:    <2022-03-01>--<2022-04-01>
+TRANGE:    [2022-03-01 Tue 10:42-11:00]
+TIMESTAMP: [2022-02-21 Mon 13:00]
+DREPEATED: <2022-02-26 Sat .+1d/2d +3d>
+TREPEATED: <2022-02-26 Sat 10:00 .+1d/2d>
+
+* Blocks
+
+#+begin_src emacs-lisp
+  ;; Taken from the well-structured Emacs config by @oantolin.
+  ;; Take a look at https://github.com/oantolin/emacs-config!
+  (defun command-of-the-day ()
+    "Show the documentation for a random command."
+    (interactive)
+    (let ((commands))
+      (mapatoms (lambda (s)
+                  (when (commandp s) (push s commands))))
+      (describe-function
+       (nth (random (length commands)) commands))))
+#+end_src
+
+#+begin_src calc
+  taylor(sin(x),x=0,3)
+#+end_src
+
+#+results:
+: pi x / 180 - 2.85779606768e-8 pi^3 x^3
+
+#+BEGIN_SRC C
+  printf("a|b\nc|d\n");
+#+END_SRC
+
+#+results:
+| a | b |
+| c | d |
+
+
+
+
+
+
+
+* Todo Labels and Tags
+** DONE Write =org-modern= :emacs:foss:coding:
+** TODO Publish =org-modern=
+** WAIT Fix all the bugs
+
+* Priorities
+** DONE [#A] Most important
+** TODO [#B] Less important
+** CANCEL [#C] Not that important
+** DONE [100%] [#A] Everything combined :tag:test:
+  * [X] First
+  * [X] Second
+  * [X] Third
+
+* Tables
+
+| N | N^2 | N^3 | N^4 | sqrt(n) | sqrt[4](N) |
+|---+----+----+----+---------+------------|
+| 2 |  4 |  8 | 16 |  1.4142 |     1.1892 |
+| 3 |  9 | 27 | 81 |  1.7321 |     1.3161 |
+
+|---+----+----+----+---------+------------|
+| N | N^2 | N^3 | N^4 | sqrt(n) | sqrt[4](N) |
+|---+----+----+----+---------+------------|
+| 2 |  4 |  8 | 16 |  1.4142 |     1.1892 |
+| 3 |  9 | 27 | 81 |  1.7321 |     1.3161 |
+|---+----+----+----+---------+------------|
+
+#+begin_example
+| a | b | c |
+| a | b | c |
+| a | b | c |
+#+end_example
+
+* Special Links
+
+Test numeric footnotes[fn:1] and named footnotes[fn:foo].
+
+<<This is an internal link>>
+
+<<<radio link>>>
+
+[[This is an internal link]]
+
+radio link
+
+[fn:1] This is footnote 1
+[fn:foo] This is the foonote
+
+* Progress bars
+
+- quotient [1/13]
+- quotient [2/13]
+- quotient [3/13]
+- quotient [4/13]
+- quotient [5/13]
+- quotient [6/13]
+- quotient [7/13]
+- quotient [8/13]
+- quotient [9/13]
+- quotient [10/13]
+- quotient [11/13]
+- quotient [12/13]
+- quotient [13/13]
+
+- percent [0%]
+- percent [1%]
+- percent [2%]
+- percent [5%]
+- percent [10%]
+- percent [20%]
+- percent [30%]
+- percent [40%]
+- percent [50%]
+- percent [60%]
+- percent [70%]
+- percent [80%]
+- percent [90%]
+- percent [100%]
+
+- overflow [110%]
+- overflow [20/10]

+ 50 - 0
py/core/examples/supported_file_types/p7s.p7s

@@ -0,0 +1,50 @@
+MIME-Version: 1.0
+Content-Type: multipart/signed; protocol="application/x-pkcs7-signature"; micalg="sha-256"; boundary="----2234CCF759A742BD58A8D9D012C3BC23"
+
+This is an S/MIME signed message
+
+------2234CCF759A742BD58A8D9D012C3BC23
+Hello World
+
+------2234CCF759A742BD58A8D9D012C3BC23
+Content-Type: application/x-pkcs7-signature; name="smime.p7s"
+Content-Transfer-Encoding: base64
+Content-Disposition: attachment; filename="smime.p7s"
+
+MIIGiwYJKoZIhvcNAQcCoIIGfDCCBngCAQExDzANBglghkgBZQMEAgEFADALBgkq
+hkiG9w0BBwGgggOpMIIDpTCCAo2gAwIBAgIUNUBhVZGwKQ9d8VLtLZLNvEwWnXUw
+DQYJKoZIhvcNAQELBQAwezELMAkGA1UEBhMCVVMxEzARBgNVBAgMCkNhbGlmb3Ju
+aWExFjAUBgNVBAcMDVNhbiBGcmFuY2lzY28xDzANBgNVBAoMBlNjaVBoaTEOMAwG
+A1UEAwwFTm9sYW4xHjAcBgkqhkiG9w0BCQEWD25vbGFuQHNjaXBoaS5haTAeFw0y
+NDEyMTYyMDIxMjJaFw0yNTEyMTYyMDIxMjJaMHsxCzAJBgNVBAYTAlVTMRMwEQYD
+VQQIDApDYWxpZm9ybmlhMRYwFAYDVQQHDA1TYW4gRnJhbmNpc2NvMQ8wDQYDVQQK
+DAZTY2lQaGkxDjAMBgNVBAMMBU5vbGFuMR4wHAYJKoZIhvcNAQkBFg9ub2xhbkBz
+Y2lwaGkuYWkwggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQCcBfnCPjDl
+SBzauhd/Q0z2lQc1smO6eDmaly3CsHvFMvINQrX9adnQt9PQW35oV+lzikDfEfpv
+W60pYLQR1iZEDu6ELS5iGjHFtnQvj8BYm23CKdDY+NGlZYJXgw9J1Ezz0wgqruYU
+yduy2Tdp3uWxMXkEnR681u1PEPAFqMx3qYpTzEkdu6tmIF5QYHLle4qKyxknV1Yu
+RZYc7OVpBfKlpt9Ya+i+gugNZoSwPgouLxdZkM5XBGgS2iMD7X2C5819DAmXzdm5
+l95VsCISQ5bjpmXiS8LHdFaTEqtvgeqw8nmlcU8994t0PpfdKFr0lL8NoiDYXht7
+v1mLmEmrtAoTAgMBAAGjITAfMB0GA1UdDgQWBBQZW3RPHHKH4MsjXsdwNtI0BQDu
+DzANBgkqhkiG9w0BAQsFAAOCAQEAEqYqqM/8BgB6LfHdj+vo7S9kHauh2bhLOZnm
+ecZu+N/Dg1WwIaCtGL6L5UmLkcQ28pJNgnUyr5eQZxtOa7y1CfDFxO6bnY8oeAcU
+0PqLi6sdUtLTjLlt47rOysCnIx8MjscQRfopH3sUD5eKYk3yMGVcTAVLBUMSgaUJ
+a+tYhk9UEcIFtKrmRmNE+kW8+t/UKSv4xT4aDvmiiIQgel88YMgu3ADv1WWDjbd9
+u96blAHOR4FpfJzuEJ/4YVOND//A4Skqv4r82lu6ZoQx0u1CJd4UOZVcGF2itRgI
+OSm2hgEG/UpmWKdIwskBQM1dwdFpSzMtYWnDAcPB3S5onmE4OjGCAqYwggKiAgEB
+MIGTMHsxCzAJBgNVBAYTAlVTMRMwEQYDVQQIDApDYWxpZm9ybmlhMRYwFAYDVQQH
+DA1TYW4gRnJhbmNpc2NvMQ8wDQYDVQQKDAZTY2lQaGkxDjAMBgNVBAMMBU5vbGFu
+MR4wHAYJKoZIhvcNAQkBFg9ub2xhbkBzY2lwaGkuYWkCFDVAYVWRsCkPXfFS7S2S
+zbxMFp11MA0GCWCGSAFlAwQCAQUAoIHkMBgGCSqGSIb3DQEJAzELBgkqhkiG9w0B
+BwEwHAYJKoZIhvcNAQkFMQ8XDTI0MTIxNjIwMjEyOVowLwYJKoZIhvcNAQkEMSIE
+ILCAItMVzx6xLSZlve0OavQGU8CgvpdSMvtJvL0CHPw2MHkGCSqGSIb3DQEJDzFs
+MGowCwYJYIZIAWUDBAEqMAsGCWCGSAFlAwQBFjALBglghkgBZQMEAQIwCgYIKoZI
+hvcNAwcwDgYIKoZIhvcNAwICAgCAMA0GCCqGSIb3DQMCAgFAMAcGBSsOAwIHMA0G
+CCqGSIb3DQMCAgEoMA0GCSqGSIb3DQEBAQUABIIBAAFj405qE8q1KSpxckUqUwrp
+HFnkySyQnxHykeTrC3IwbwerL3lA9KBaP9F+yuweXro4dCKAMx/I0ajCJqiMWgDq
+6Gctn+RQURgP1ZEUViAonCOFMJ9a5bQs351DgH13qB48J8PnRmVQsoZNsjI+0atk
+2f5WBXrbv+onrUemFA5DdKOmb7ZWX6LmuJWg92JZQYuA56hdal0OZMBWvtZxLPaG
+z8CJSscfcbMEJhSDHSodnj4JpS0TkNW8LtqCaKnCFVYWOBsUPI/L6g7kPZ02BAy+
+XjtEf3BlXNq3nTZlppXN21y0thKrp0IMkwKrfLeEzY3ir1XrjkTy99gIz+lw++w=
+
+------2234CCF759A742BD58A8D9D012C3BC23--

BIN
py/core/examples/supported_file_types/pdf.pdf


BIN
py/core/examples/supported_file_types/png.png


BIN
py/core/examples/supported_file_types/ppt.ppt


BIN
py/core/examples/supported_file_types/pptx.pptx


+ 32 - 0
py/core/examples/supported_file_types/py.py

@@ -0,0 +1,32 @@
+# type: ignore
+from typing import AsyncGenerator
+
+from bs4 import BeautifulSoup
+
+from core.base.parsers.base_parser import AsyncParser
+from core.base.providers import (
+    CompletionProvider,
+    DatabaseProvider,
+    IngestionConfig,
+)
+
+
+class HTMLParser(AsyncParser[str | bytes]):
+    """A parser for HTML data."""
+
+    def __init__(
+        self,
+        config: IngestionConfig,
+        database_provider: DatabaseProvider,
+        llm_provider: CompletionProvider,
+    ):
+        self.database_provider = database_provider
+        self.llm_provider = llm_provider
+        self.config = config
+
+    async def ingest(
+        self, data: str | bytes, *args, **kwargs
+    ) -> AsyncGenerator[str, None]:
+        """Ingest HTML data and yield text."""
+        soup = BeautifulSoup(data, "html.parser")
+        yield soup.get_text()

+ 86 - 0
py/core/examples/supported_file_types/rst.rst

@@ -0,0 +1,86 @@
+Header 1
+========
+--------
+Subtitle
+--------
+
+Example text.
+
+.. contents:: Table of Contents
+
+Header 2
+--------
+
+1. Blah blah ``code`` blah
+
+2. More ``code``, hooray
+
+3. Somé UTF-8°
+
+The UTF-8 quote character in this table used to cause python to go boom. Now docutils just silently ignores it.
+
+.. csv-table:: Things that are Awesome (on a scale of 1-11)
+	:quote: ”
+
+	Thing,Awesomeness
+	Icecream, 7
+	Honey Badgers, 10.5
+	Nickelback, -2
+	Iron Man, 10
+	Iron Man 2, 3
+	Tabular Data, 5
+	Made up ratings, 11
+
+.. code::
+
+	A block of code
+
+.. code:: python
+
+	python.code('hooray')
+
+.. code:: javascript
+
+	export function ƒ(ɑ, β) {}
+
+.. doctest:: ignored
+
+	>>> some_function()
+	'result'
+
+>>> some_function()
+'result'
+
+==============  ==========================================================
+Travis          http://travis-ci.org/tony/pullv
+Docs            http://pullv.rtfd.org
+API             http://pullv.readthedocs.org/en/latest/api.html
+Issues          https://github.com/tony/pullv/issues
+Source          https://github.com/tony/pullv
+==============  ==========================================================
+
+
+.. image:: https://scan.coverity.com/projects/621/badge.svg
+	:target: https://scan.coverity.com/projects/621
+	:alt: Coverity Scan Build Status
+
+.. image:: https://scan.coverity.com/projects/621/badge.svg
+	:alt: Coverity Scan Build Status
+
+Field list
+----------
+
+:123456789 123456789 123456789 123456789 123456789 1: Uh-oh! This name is too long!
+:123456789 123456789 123456789 123456789 1234567890: this is a long name,
+	but no problem!
+:123456789 12345: this is not so long, but long enough for the default!
+:123456789 1234: this should work even with the default :)
+
+someone@somewhere.org
+
+Press :kbd:`Ctrl+C` to quit
+
+
+.. raw:: html
+
+    <p><strong>RAW HTML!</strong></p><style> p {color:blue;} </style>

+ 5 - 0
py/core/examples/supported_file_types/rtf.rtf

@@ -0,0 +1,5 @@
+{\rtf1\ansi\deff0
+{\fonttbl{\f0\froman\fcharset0 Times New Roman;}}
+\viewkind4\uc1\pard\f0\fs24
+Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\par
+}

BIN
py/core/examples/supported_file_types/tiff.tiff


+ 247 - 0
py/core/examples/supported_file_types/ts.ts

@@ -0,0 +1,247 @@
+import axios, {
+  AxiosInstance,
+  Method,
+  AxiosResponse,
+  AxiosRequestConfig,
+// @ts-ignore: Ignore module declaration error for axios
+} from "axios";
+// @ts-ignore: Ignore module declaration error for axios
+import { ensureCamelCase } from "./utils";
+
+let fs: any;
+  // @ts-ignore: This is only for the GitHub flow build, not the dev environment
+  fs = require("fs");
+if (typeof window === "undefined") {
+  // @ts-ignore: This is only for the GitHub flow build, not the dev environment
+  fs = require("fs");
+}
+
+function handleRequestError(response: AxiosResponse): void {
+  if (response.status < 400) {
+    return;
+  }
+
+  let message: string;
+  const errorContent = ensureCamelCase(response.data);
+
+  if (typeof errorContent === "object" && errorContent !== null) {
+    message =
+      errorContent.message ||
+      (errorContent.detail && errorContent.detail.message) ||
+      (typeof errorContent.detail === "string" && errorContent.detail) ||
+      JSON.stringify(errorContent);
+  } else {
+    message = String(errorContent);
+  }
+
+  throw new Error(`Status ${response.status}: ${message}`);
+}
+
+export abstract class BaseClient {
+  protected axiosInstance: AxiosInstance;
+  protected baseUrl: string;
+  protected accessToken?: string | null;
+  protected apiKey?: string | null;
+  protected refreshToken: string | null;
+  protected anonymousTelemetry: boolean;
+  protected enableAutoRefresh: boolean;
+
+  constructor(
+    baseURL: string = "http://localhost:7272",
+    prefix: string = "",
+    anonymousTelemetry = true,
+    enableAutoRefresh = false,
+  ) {
+    this.baseUrl = `${baseURL}${prefix}`;
+    this.accessToken = null;
+  // @ts-ignore: This is only for the GitHub flow build, not the dev environment
+  this.apiKey = process.env.R2R_API_KEY || null;
+    this.refreshToken = null;
+    this.anonymousTelemetry = anonymousTelemetry;
+
+    this.enableAutoRefresh = enableAutoRefresh;
+
+    this.axiosInstance = axios.create({
+      baseURL: this.baseUrl,
+      headers: {
+        "Content-Type": "application/json",
+      },
+    });
+  }
+
+  protected async _makeRequest<T = any>(
+    method: Method,
+    endpoint: string,
+    options: any = {},
+    version: "v3" = "v3",
+  ): Promise<T> {
+    const url = `/${version}/${endpoint}`;
+    const config: AxiosRequestConfig = {
+      method,
+      url,
+      headers: { ...options.headers },
+      params: options.params,
+      ...options,
+      responseType: options.responseType || "json",
+    };
+
+    config.headers = config.headers || {};
+
+    if (options.params) {
+      config.paramsSerializer = (params) => {
+        return Object.entries(params)
+          .map(([key, value]) => {
+            if (Array.isArray(value)) {
+              return value
+                .map(
+                  (v) => `${encodeURIComponent(key)}=${encodeURIComponent(v)}`,
+                )
+                .join("&");
+            }
+            return `${encodeURIComponent(key)}=${encodeURIComponent(
+              String(value),
+            )}`;
+          })
+          .join("&");
+      };
+    }
+
+    if (options.data) {
+      if (typeof FormData !== "undefined" && options.data instanceof FormData) {
+        config.data = options.data;
+        delete config.headers["Content-Type"];
+      } else if (typeof options.data === "object") {
+        if (
+          config.headers["Content-Type"] === "application/x-www-form-urlencoded"
+        ) {
+          config.data = Object.keys(options.data)
+            .map(
+              (key) =>
+                `${encodeURIComponent(key)}=${encodeURIComponent(
+                  options.data[key],
+                )}`,
+            )
+            .join("&");
+        } else {
+          config.data = JSON.stringify(options.data);
+          if (method !== "DELETE") {
+            config.headers["Content-Type"] = "application/json";
+          } else {
+            config.headers["Content-Type"] = "application/json";
+            config.data = JSON.stringify(options.data);
+          }
+        }
+      } else {
+        config.data = options.data;
+      }
+    }
+
+    if (this.accessToken && this.apiKey) {
+      throw new Error("Cannot have both access token and api key.");
+    }
+
+    if (
+      this.apiKey &&
+      !["register", "login", "verify_email", "health"].includes(endpoint)
+    ) {
+      config.headers["x-api-key"] = this.apiKey;
+    } else if (
+      this.accessToken &&
+      !["register", "login", "verify_email", "health"].includes(endpoint)
+    ) {
+      config.headers.Authorization = `Bearer ${this.accessToken}`;
+    }
+
+    if (options.responseType === "stream") {
+      return this.handleStreamingRequest<T>(method, version, endpoint, config);
+    }
+
+    try {
+      const response = await this.axiosInstance.request(config);
+
+      if (options.responseType === "blob") {
+        return response.data as T;
+      } else if (options.responseType === "arraybuffer") {
+        if (options.returnFullResponse) {
+          return response as unknown as T;
+        }
+        return response.data as T;
+      }
+
+      const responseData = options.returnFullResponse
+        ? { ...response, data: ensureCamelCase(response.data) }
+        : ensureCamelCase(response.data);
+
+      return responseData as T;
+    } catch (error) {
+      if (axios.isAxiosError(error) && error.response) {
+        handleRequestError(error.response);
+      }
+      throw error;
+    }
+  }
+
+  private async handleStreamingRequest<T>(
+    method: Method,
+    version: string,
+    endpoint: string,
+    config: AxiosRequestConfig,
+  ): Promise<T> {
+    const fetchHeaders: Record<string, string> = {};
+
+    // Convert Axios headers to Fetch headers
+    Object.entries(config.headers || {}).forEach(([key, value]) => {
+      if (typeof value === "string") {
+        fetchHeaders[key] = value;
+      }
+    });
+
+    try {
+      const response = await fetch(`${this.baseUrl}/${version}/${endpoint}`, {
+        method,
+        headers: fetchHeaders,
+        body: config.data,
+      });
+
+      if (!response.ok) {
+        const errorData = await response.json().catch(() => ({}));
+        throw new Error(
+          `HTTP error! status: ${response.status}: ${
+            ensureCamelCase(errorData).message || "Unknown error"
+          }`,
+        );
+      }
+
+      // Create a TransformStream to process the response
+      const transformStream = new TransformStream({
+        transform(chunk, controller) {
+          // Process each chunk here if needed
+          controller.enqueue(chunk);
+        },
+      });
+
+      // Pipe the response through the transform stream
+      const streamedResponse = response.body?.pipeThrough(transformStream);
+
+      if (!streamedResponse) {
+        throw new Error("No response body received from stream");
+      }
+
+      return streamedResponse as unknown as T;
+    } catch (error) {
+      console.error("Streaming request failed:", error);
+      throw error;
+    }
+  }
+
+  protected _ensureAuthenticated(): void {
+    if (!this.accessToken) {
+      throw new Error("Not authenticated. Please login first.");
+    }
+  }
+
+  setTokens(accessToken: string, refreshToken: string): void {
+    this.accessToken = accessToken;
+    this.refreshToken = refreshToken;
+  }
+}

+ 11 - 0
py/core/examples/supported_file_types/tsv.tsv

@@ -0,0 +1,11 @@
+Region	Year	Quarter	Sales	Employees	Growth Rate
+North America	2024	Q1	1250000	45	5.2
+Europe	2024	Q1	980000	38	4.8
+Asia Pacific	2024	Q1	1450000	52	6.1
+South America	2024	Q1	580000	25	3.9
+Africa	2024	Q1	320000	18	4.2
+North America	2024	Q2	1380000	47	5.5
+Europe	2024	Q2	1050000	40	4.9
+Asia Pacific	2024	Q2	1520000	54	5.8
+South America	2024	Q2	620000	27	4.1
+Africa	2024	Q2	350000	20	4.4

+ 21 - 0
py/core/examples/supported_file_types/txt.txt

@@ -0,0 +1,21 @@
+Quod equidem non reprehendo;
+Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quibus natura iure responderit non esse verum aliunde finem beate vivendi, a se principia rei gerendae peti; Quae enim adhuc protulisti, popularia sunt, ego autem a te elegantiora desidero. Duo Reges: constructio interrete. Tum Lucius: Mihi vero ista valde probata sunt, quod item fratri puto. Bestiarum vero nullum iudicium puto. Nihil enim iam habes, quod ad corpus referas; Deinde prima illa, quae in congressu solemus: Quid tu, inquit, huc? Et homini, qui ceteris animantibus plurimum praestat, praecipue a natura nihil datum esse dicemus?
+
+Iam id ipsum absurdum, maximum malum neglegi. Quod ea non occurrentia fingunt, vincunt Aristonem; Atqui perspicuum est hominem e corpore animoque constare, cum primae sint animi partes, secundae corporis. Fieri, inquam, Triari, nullo pacto potest, ut non dicas, quid non probes eius, a quo dissentias. Equidem e Cn. An dubium est, quin virtus ita maximam partem optineat in rebus humanis, ut reliquas obruat?
+
+Quis istum dolorem timet?
+Summus dolor plures dies manere non potest? Dicet pro me ipsa virtus nec dubitabit isti vestro beato M. Tubulum fuisse, qua illum, cuius is condemnatus est rogatione, P. Quod si ita sit, cur opera philosophiae sit danda nescio.
+
+Ex eorum enim scriptis et institutis cum omnis doctrina liberalis, omnis historia.
+Quod si ita est, sequitur id ipsum, quod te velle video, omnes semper beatos esse sapientes. Cum enim fertur quasi torrens oratio, quamvis multa cuiusque modi rapiat, nihil tamen teneas, nihil apprehendas, nusquam orationem rapidam coerceas. Ita redarguitur ipse a sese, convincunturque scripta eius probitate ipsius ac moribus. At quanta conantur! Mundum hunc omnem oppidum esse nostrum! Incendi igitur eos, qui audiunt, vides. Vide, ne magis, inquam, tuum fuerit, cum re idem tibi, quod mihi, videretur, non nova te rebus nomina inponere. Qui-vere falsone, quaerere mittimus-dicitur oculis se privasse; Si ista mala sunt, in quae potest incidere sapiens, sapientem esse non esse ad beate vivendum satis. At vero si ad vitem sensus accesserit, ut appetitum quendam habeat et per se ipsa moveatur, quid facturam putas?
+
+Quem si tenueris, non modo meum Ciceronem, sed etiam me ipsum abducas licebit.
+Stulti autem malorum memoria torquentur, sapientes bona praeterita grata recordatione renovata delectant.
+Esse enim quam vellet iniquus iustus poterat inpune.
+Quae autem natura suae primae institutionis oblita est?
+Verum tamen cum de rebus grandioribus dicas, ipsae res verba rapiunt;
+Hoc est non modo cor non habere, sed ne palatum quidem.
+Voluptatem cum summum bonum diceret, primum in eo ipso parum vidit, deinde hoc quoque alienum; Sed tu istuc dixti bene Latine, parum plane. Nam haec ipsa mihi erunt in promptu, quae modo audivi, nec ante aggrediar, quam te ab istis, quos dicis, instructum videro. Fatebuntur Stoici haec omnia dicta esse praeclare, neque eam causam Zenoni desciscendi fuisse. Non autem hoc: igitur ne illud quidem. Ratio quidem vestra sic cogit. Cum audissem Antiochum, Brute, ut solebam, cum M. An quod ita callida est, ut optime possit architectari voluptates?
+
+Idemne, quod iucunde?
+Haec mihi videtur delicatior, ut ita dicam, molliorque ratio, quam virtutis vis gravitasque postulat. Sed quoniam et advesperascit et mihi ad villam revertendum est, nunc quidem hactenus; Cuius ad naturam apta ratio vera illa et summa lex a philosophis dicitur. Neque solum ea communia, verum etiam paria esse dixerunt. Sed nunc, quod agimus; A mene tu?

Một số tệp đã không được hiển thị bởi vì quá nhiều tập tin thay đổi trong này khác