Organize, ingest, and leverage your organization's content with vector-powered datasets. Ground AI responses in your specific sources for accurate, contextually relevant results.
Ask Sage Datasets are organized collections of your organization's contentβincluding text, images, and audioβthat you ingest into the platform to ground AI-generated responses in your specific sources. Datasets enable you to ingest data once and reuse it across multiple prompts, models, and team members, ensuring consistent, accurate, and contextually relevant results.
π―
Accuracy & Relevance
Generate responses grounded in your specific materials rather than generic web content
β‘
Efficiency & Reuse
Ingest content once and reuse across different prompts, models, and use cases
π€
Team Collaboration
Share datasets to establish a single source of truth across your organization
π
Advanced Search
Use Search Datasets plugin to quickly locate specific facts and information
π
Model Flexibility
Use datasets with any GenAI modelβnever locked into a single provider
π‘οΈ
CUI Support
Classify datasets as Unclassified or CUI for controlled information
Getting Started
Selecting Datasets in Prompt Settings
1
Access Data & Settings
Navigate to Data & Settings to select datasets that will provide context for your prompt:
Click the Data & Settings button or the Folder Icon below the prompt window
Select the dataset(s) you want to reference
Choose multiple datasets or select None if no dataset context is needed
Selected datasets will appear under the prompt window for easy identification
Understanding Attachments vs. Datasets
π
Chat Attachments
PersistenceOne-time use only
ScopeSingle conversation
SharingNot shareable
Use CaseQuick, ad-hoc analysis
Token TypeInference tokens
vs
π
Datasets
PersistencePermanent storage for reuse
ScopeAvailable across all prompts
SharingCan be shared with team
Use CaseRecurring reference material
Token TypeTraining + Inference tokens
π‘
Important: Files you attach in a chat are for one-off use and are not automatically saved to a dataset unless you explicitly ingest them.
Creating and Ingesting Datasets
2
Create a New Dataset
Click Prompt Tools β Data & Settings β Upload New Files
Click Create New Dataset
Enter a dataset name (alphanumeric characters and hyphens only, e.g., my-dataset-2025)
Classify the dataset as Unclassified or CUI (Controlled Unclassified Information)
Click Create Dataset
β οΈ
CUI Classification: CUI classification requires a CAC/PIV card or special activation. Contact support@asksage.ai to request this feature or reach out to your organization Administrator.
3
Upload Files to Dataset
Select your dataset from the dropdown list
Drag and drop files into the designated box, or click to browse your local machine
Review the file list and remove any unwanted files using the garbage bin icon
Click Ingest Files to begin the upload process
Look for the white checkmark and "Successfully Imported" message for each file
Supported File Formats (Not Exhaustive - more already supported)
π
Documents
.pdf.doc.docx
π
Spreadsheets
.xls.xlsx.csv
π½οΈ
Presentations
.ppt.pptx
πΌοΈ
Images
.jpg.png.gif.svg
π
Text Files
.txt.rtf.log
π§
Markup & Data
.html.xml.json.yaml
π
Maximum File Size: 50MB per file
πΈ
Image Handling: Images embedded in text documents will not be extracted automaticallyβupload images separately. If you don't see a file type you need supported, email support@asksage.ai.
Managing and Sharing Datasets
π€
Share Datasets
Enter teammates' email addresses and confirm to grant access. Sharing ensures your team works from the same source of truth.
ποΈ
Delete Datasets
Remove datasets you no longer need to maintain organization and optimize token usage.
π
Copy Datasets
Duplicate dataset files for different use cases or teams without re-ingesting content.
π
View Details
See ingested files, file counts, and dataset metadata at a glance.
Understanding Tokens and Usage
π
Training Tokens
PurposeIngest data into datasets
When ConsumedWhen uploading and processing files
Use CaseConverting content into vector embeddings
β‘
Inference Tokens
PurposeGenerate AI responses
When ConsumedWhen submitting prompts and generating text
Use CaseQuerying datasets and producing content
π
Monitoring Token Usage
Navigate to Settings β Tokens to view:
πCurrent subscription plan
β‘Inference tokens used and remaining
πTraining tokens used and remaining
π Monthly reset date
β°
Important: Tokens reset on the first day of each month and do not roll over. Plan your usage accordingly to maximize value.
Best Practices for Dataset Creation
Avoid Bloated Datasets
β οΈ
Why It Matters
Overloading datasets with too many files can confuse models. RAG is limited by the model's context window, so only a finite amount of information can be processed per prompt.
βKeep datasets focused and purposeful
βOnly include files directly relevant to the dataset's intended use case
βRegularly review and remove outdated or unnecessary content
βConsider context window limitations when determining dataset size
Maintain Specificity
π―
Why It Matters
Mixing unrelated data can confuse the model and lead to irrelevant or incorrect responses.
β Bad Practice
Creating a dataset with both tank manuals and airplane manuals
Problem: If you ask "Tell me how to fix Model XY," and both a tank and plane share that model number, the AI might pull information about the wrong vehicle.
β Good Practice
Create separate datasets: tank-manuals-2025 and aircraft-manuals-2025
βCreate separate datasets for different product lines, departments, or subject areas
βUse clear, descriptive dataset names that reflect specific content
Removing unnecessary content ensures the model focuses on relevant information and improves retrieval accuracy.
Preprocessing Steps
βRemove cover pages, table of contents, appendices, or sections without substantive information
βClean up formatting issues that might interfere with text extraction
βEnsure images are clear and properly labeled if they contain important information
βVerify text is machine-readable (avoid scanned documents with poor OCR quality)
For Large Documents
βSplit large files into smaller chunks (e.g., 100-page document β five 20-page sections)
βUse the Summarization Plugin to condense content before ingestion
βFor technical manuals, create separate datasets for different sections
Prioritize Quality Over Quantity
β
Quality Guidelines
βPrioritize high-quality, authoritative sources over volume
βVerify documents are current and accurate before ingestion
βRemove duplicate or redundant information
βEnsure text is machine-readable and well-formatted
Maintain Dataset Hygiene
π§Ή
Ongoing Maintenance
βRegularly audit datasets for outdated information
βUpdate datasets when source materials change
βDocument what each dataset contains and its intended use case
βUse consistent naming conventions across your organization
βArchive or delete datasets that are no longer needed
Understanding RAG Technology
What Are Vector Datasets?
Ask Sage Datasets are vector databases that store your content as embeddingsβnumerical representations that capture the semantic meaning of your data. Unlike traditional databases that store raw files, vector databases enable:
π
Semantic Search
Find information based on meaning, not just keyword matching
β‘
Rapid Retrieval
Query large volumes of data efficiently
π§
Context-Aware Responses
Generate answers that understand relationships between concepts
π‘
Important: Ask Sage datasets store embeddings, not the original files. This design optimizes for search and retrieval rather than file storage.
How Embeddings Work
1
Tokenization
Your text is broken into tokens (units of text ranging from single characters to whole words)
Example: "I love programming!" = 5 tokens: ["I", "love", "programming", "!", " "]
2
Embedding Generation
Each token is mapped to a numerical vector that captures its semantic meaning
These vectors represent relationships and context between words and concepts
3
Vector Storage
Embeddings are stored in the vector database, optimized for similarity search
This process consumes Training Tokens based on content volume
What Is RAG (Retrieval Augmented Generation)?
Retrieval Augmented Generation (RAG) is the core technology that makes Ask Sage Datasets powerful. RAG enhances AI responses by combining your ingested data with the model's capabilities through a two-step process:
π
Step 1: Retrieve Relevant Context
When you submit a prompt with a dataset selected:
1Your query is converted into an embedding vector
2The system searches the vector database for semantically similar content
3The most relevant passages, facts, and information are retrieved
4This context is ranked by relevance to your specific query
β¨
Step 2: Augment Prompt with Retrieved Data
The retrieved context is integrated with your original prompt:
1Your original prompt is combined with relevant dataset excerpts
2This augmented prompt provides the AI model with specific, grounded information
3The model generates a response based on both its training and your retrieved data
4The result is a contextually accurate answer grounded in your sources
Benefits of RAG with Vector Datasets
π
Up-to-date Information
Overcome model training cutoff dates by referencing your current data
π
Domain Expertise
Provide specialized knowledge not present in general AI training
π‘οΈ
Reduced Hallucinations
Ground responses in verifiable sources rather than model speculation
π
Transparency
Use explainability features to see which dataset content informed each response
π
Flexibility
Use the same dataset with different models without re-ingesting data
RAG vs. Traditional Approaches
No Dataset (Base Model)
How It WorksModel relies only on training data
LimitationsOutdated information, no organization-specific knowledge
Attachments Only
How It WorksFiles processed per conversation
LimitationsNo reusability, inefficient for recurring needs
Recommended
RAG with Datasets
How It WorksSemantic search retrieves relevant context
LimitationsRequires initial ingestion, consumes training tokens
Practical RAG Example
π
Scenario
You've ingested your company's product documentation into a dataset called product-docs-2025.
β Without RAG (No Dataset)
Prompt:
"What are the warranty terms for our Model X product?"
Response:
Generic information about typical warranties, possibly inaccurate or irrelevant to your specific product.
β With RAG (Dataset Selected)
Prompt:
"What are the warranty terms for our Model X product?"
Process:
Vector search retrieves relevant warranty sections from your documentation
Response:
Specific warranty terms from your actual product documentation, with citations
Explainability:
Shows which document sections were referenced
π‘
This demonstrates how RAG transforms generic AI into a knowledgeable assistant grounded in your organization's specific information.
Technical Considerations
π―
Vector Database Optimization
Best Use Cases
βVector databases excel at semantic search and retrieval
βIdeal for unstructured text, documents, and narrative content
βPerfect for finding conceptually similar information
Limitations
β οΈNot designed for large tabular data (spreadsheets)
π‘For tabular data analysis, attach spreadsheets directly to promptsβAsk Sage will use Python libraries to analyze them
β‘
Token Efficiency Tips
βSelecting "None" for datasets saves inference tokens when dataset context isn't needed
βOnly select relevant datasets to optimize token usage and response quality
βUse the Show Explainability feature to verify which dataset content was used
βMonitor token usage regularly in Settings β Tokens
π‘οΈ
CUI Compliance
β οΈ
Important: The Live feature is not CUI compliant and cannot be used with CUI-labeled datasets. Ensure proper classification when creating datasets containing sensitive information.