Intelligent Data Management

Ask Sage Datasets

Organize, ingest, and leverage your organization's content with vector-powered datasets. Ground AI responses in your specific sources for accurate, contextually relevant results.

Table of Contents
  1. Overview
  2. Getting Started
  3. Understanding Tokens and Usage
  4. Best Practices for Dataset Creation
  5. Understanding RAG Technology
  6. Technical Considerations

Overview


Getting Started

Selecting Datasets in Prompt Settings

1

Access Data & Settings

Click the Data & Settings button below your prompt window

2

Select Dataset(s)

Choose one or multiple datasets for context

3

Submit Prompt

Send your prompt with dataset context

Pro Tip: Selected datasets appear under your prompt window, making it easy to see what context you're using for each query.

Attachments vs. Datasets

Understanding the difference helps you choose the right approach:

Chat Attachments

One-time use only
Single conversation
Not shareable
Quick, ad-hoc analysis

Datasets

Permanent reusable storage
Available across all prompts
Shareable with team
Recurring reference material
Important: Files you attach in a chat are for one-off use only. To save them permanently, you must explicitly ingest them into a dataset.

Creating and Ingesting Datasets

1

Create Dataset

Click Prompt Tools → Data & Settings → Create New Dataset

2

Name & Classify

Enter name (alphanumeric/hyphens) and classify as Unclassified or CUI

3

Upload Files

Drag/drop or browse to select files (max 50MB each)

4

Ingest

Click Ingest Files and wait for success confirmation

Example: A dataset name like product-docs-2025 or research-papers-q1 helps organize your content effectively.
CUI Classification: Requires CAC/PIV card or special activation. Contact support@asksage.ai for CUI access.

Supported File Formats

Ask Sage supports a wide range of file types. Common formats include:

Documents

.pdf
.doc, .docx
Word documents

Spreadsheets

.xls, .xlsx
.csv
Tabular data

Presentations

.ppt
.pptx
Slide decks

Images & Media

.jpg, .png
.gif, .svg
Visual content

Data Formats

.json, .xml
.html, .yaml
Markup languages

Text Files

.txt, .rtf
.log files
Plain text
Maximum File Size: 50MB per file. Images embedded in documents won't be extracted—upload them separately.

Managing Your Datasets

Share

Collaborate by sharing datasets with team members

Copy

Duplicate for different use cases without re-ingesting

View Details

See metadata, file counts, and ingestion status

Delete

Remove datasets you no longer need


Understanding Tokens and Usage

Training vs. Inference Tokens

Training Tokens

Used when ingesting data
Converting to embeddings
Building your datasets

Inference Tokens

Used when querying
Generating responses
Daily interactions
Monitor Usage: Check Settings → Tokens to track your consumption and remaining quota. Tokens reset monthly and do not roll over.

Best Practices for Dataset Creation

Key Principles

Follow these best practices to get the most value from your datasets:

Stay Focused

Create purpose-built datasets for specific use cases rather than large "catch-all" collections

Clear dataset purpose
Relevant content only
Easier to manage

Preprocess Files

Clean and prepare your documents before ingestion for better results

Remove unnecessary pages
Fix formatting issues
Verify OCR quality

Prioritize Quality

Use high-quality, authoritative sources rather than maximizing quantity

Remove duplicates
Current information
Verified accuracy

Maintain Hygiene

Keep your datasets fresh and well-organized over time

Regular audits
Update outdated content
Consistent naming
Split Large Documents: For documents over 100 pages, consider splitting them into smaller chunks (e.g., 20-page sections) for better retrieval and accuracy.

Understanding RAG Technology

Why RAG with Datasets Wins

Reduce Hallucinations

AI can only reference what you've provided, not invent facts

Stay Current

Override AI training cutoffs with your latest information

Add Expertise

Inject your domain-specific knowledge into responses

Full Transparency

See exactly which sources informed each response

Technical Note: Datasets store embeddings (numerical representations), not original files. This enables fast semantic search across your entire collection.

Technical Considerations

Vector Database Optimization

Best For

Semantic search
Unstructured text
Document retrieval

Not For

Tabular data analysis
SQL queries
Structured databases
Workaround: For spreadsheets, attach them directly to prompts—Ask Sage will use Python to analyze the data.

Token Efficiency Tips

Optimize Usage

Select only relevant datasets
Use "None" when not needed
Monitor in Settings → Tokens

CUI Compliance

Live feature not CUI-safe
Classify datasets correctly
Contact support for CAC/PIV

Back to top

Copyright © 2026 Ask Sage Inc. All Rights Reserved. Ask Sage is a BigBear.ai company.