Organize, ingest, and leverage your organization's content with vector-powered datasets. Ground AI responses in your specific sources for accurate, contextually relevant results.
Ask Sage Datasets are organized collections of your organization's content—including text, images, and audio—that you ingest into the platform to ground AI-generated responses in your specific sources. Datasets enable you to ingest data once and reuse it across multiple prompts, models, and team members, ensuring consistent, accurate, and contextually relevant results.
Accuracy & Relevance
Generate responses grounded in your specific materials rather than generic web content
Efficiency & Reuse
Ingest content once and reuse across different prompts, models, and use cases
Team Collaboration
Share datasets to establish a single source of truth across your organization
Advanced Search
Use Search Datasets plugin to quickly locate specific facts and information
Model Flexibility
Use datasets with any GenAI model—never locked into a single provider
CUI Support
Classify datasets as Unclassified or CUI for controlled information
Getting Started
Selecting Datasets in Prompt Settings
1
Access Data & Settings
Click the Data & Settings button below your prompt window
→
2
Select Dataset(s)
Choose one or multiple datasets for context
→
3
Submit Prompt
Send your prompt with dataset context
Pro Tip: Selected datasets appear under your prompt window, making it easy to see what context you're using for each query.
Attachments vs. Datasets
Understanding the difference helps you choose the right approach:
Chat Attachments
One-time use only
Single conversation
Not shareable
Quick, ad-hoc analysis
Datasets
Permanent reusable storage
Available across all prompts
Shareable with team
Recurring reference material
Important: Files you attach in a chat are for one-off use only. To save them permanently, you must explicitly ingest them into a dataset.
Creating and Ingesting Datasets
1
Create Dataset
Click Prompt Tools → Data & Settings → Create New Dataset
→
2
Name & Classify
Enter name (alphanumeric/hyphens) and classify as Unclassified or CUI
→
3
Upload Files
Drag/drop or browse to select files (max 50MB each)
→
4
Ingest
Click Ingest Files and wait for success confirmation
Example: A dataset name like product-docs-2025 or research-papers-q1 helps organize your content effectively.
CUI Classification: Requires CAC/PIV card or special activation. Contact support@asksage.ai for CUI access.
Supported File Formats
Ask Sage supports a wide range of file types. Common formats include:
Documents
.pdf
.doc, .docx
Word documents
Spreadsheets
.xls, .xlsx
.csv
Tabular data
Presentations
.ppt
.pptx
Slide decks
Images & Media
.jpg, .png
.gif, .svg
Visual content
Data Formats
.json, .xml
.html, .yaml
Markup languages
Text Files
.txt, .rtf
.log files
Plain text
Maximum File Size: 50MB per file. Images embedded in documents won't be extracted—upload them separately.
Managing Your Datasets
Share
Collaborate by sharing datasets with team members
Copy
Duplicate for different use cases without re-ingesting
View Details
See metadata, file counts, and ingestion status
Delete
Remove datasets you no longer need
Understanding Tokens and Usage
Training vs. Inference Tokens
Training Tokens
Used when ingesting data
Converting to embeddings
Building your datasets
Inference Tokens
Used when querying
Generating responses
Daily interactions
Monitor Usage: Check Settings → Tokens to track your consumption and remaining quota. Tokens reset monthly and do not roll over.
Best Practices for Dataset Creation
Key Principles
Follow these best practices to get the most value from your datasets:
Stay Focused
Create purpose-built datasets for specific use cases rather than large "catch-all" collections
Clear dataset purpose
Relevant content only
Easier to manage
Preprocess Files
Clean and prepare your documents before ingestion for better results
Remove unnecessary pages
Fix formatting issues
Verify OCR quality
Prioritize Quality
Use high-quality, authoritative sources rather than maximizing quantity
Remove duplicates
Current information
Verified accuracy
Maintain Hygiene
Keep your datasets fresh and well-organized over time
Regular audits
Update outdated content
Consistent naming
Split Large Documents: For documents over 100 pages, consider splitting them into smaller chunks (e.g., 20-page sections) for better retrieval and accuracy.
Understanding RAG Technology
What Is RAG (Retrieval Augmented Generation)?
RAG is the core technology powering Ask Sage Datasets. It works by combining your ingested data with AI models to generate accurate, grounded responses.
1
Retrieve
Query converts to embedding and searches dataset
→
2
Rank
Relevant passages ranked by relevance
→
3
Augment
Retrieved context combined with prompt
→
4
Generate
AI produces grounded, accurate response
Result: Instead of generic answers, you get responses specifically grounded in your organization's data and knowledge.
Why RAG with Datasets Wins
Reduce Hallucinations
AI can only reference what you've provided, not invent facts
Stay Current
Override AI training cutoffs with your latest information
Add Expertise
Inject your domain-specific knowledge into responses
Full Transparency
See exactly which sources informed each response
Technical Note: Datasets store embeddings (numerical representations), not original files. This enables fast semantic search across your entire collection.
Technical Considerations
Vector Database Optimization
Best For
Semantic search
Unstructured text
Document retrieval
Not For
Tabular data analysis
SQL queries
Structured databases
Workaround: For spreadsheets, attach them directly to prompts—Ask Sage will use Python to analyze the data.