πŸ“Š Intelligent Data Management

Ask Sage Datasets

Organize, ingest, and leverage your organization's content with vector-powered datasets. Ground AI responses in your specific sources for accurate, contextually relevant results.


πŸ“‘ Table of Contents
  1. Overview
  2. Getting Started
    1. Selecting Datasets in Prompt Settings
    2. Understanding Attachments vs. Datasets
    3. Creating and Ingesting Datasets
    4. Supported File Formats (Not Exhaustive - more already supported)
    5. Managing and Sharing Datasets
  3. Understanding Tokens and Usage
  4. Best Practices for Dataset Creation
    1. Avoid Bloated Datasets
    2. Maintain Specificity
    3. Preprocess Your Files
    4. Prioritize Quality Over Quantity
    5. Maintain Dataset Hygiene
  5. Understanding RAG Technology
    1. What Are Vector Datasets?
    2. How Embeddings Work
    3. What Is RAG (Retrieval Augmented Generation)?
    4. Benefits of RAG with Vector Datasets
    5. RAG vs. Traditional Approaches
    6. Practical RAG Example
  6. Technical Considerations

Overview

Ask Sage Datasets are organized collections of your organization's contentβ€”including text, images, and audioβ€”that you ingest into the platform to ground AI-generated responses in your specific sources. Datasets enable you to ingest data once and reuse it across multiple prompts, models, and team members, ensuring consistent, accurate, and contextually relevant results.

🎯

Accuracy & Relevance

Generate responses grounded in your specific materials rather than generic web content

⚑

Efficiency & Reuse

Ingest content once and reuse across different prompts, models, and use cases

🀝

Team Collaboration

Share datasets to establish a single source of truth across your organization

πŸ”

Advanced Search

Use Search Datasets plugin to quickly locate specific facts and information

πŸ”„

Model Flexibility

Use datasets with any GenAI modelβ€”never locked into a single provider

πŸ›‘οΈ

CUI Support

Classify datasets as Unclassified or CUI for controlled information


Getting Started

Selecting Datasets in Prompt Settings

1

Access Data & Settings

Navigate to Data & Settings to select datasets that will provide context for your prompt:

  • Click the Data & Settings button or the Folder Icon below the prompt window
  • Select the dataset(s) you want to reference
  • Choose multiple datasets or select None if no dataset context is needed
  • Selected datasets will appear under the prompt window for easy identification

Understanding Attachments vs. Datasets

πŸ“Ž

Chat Attachments

Persistence One-time use only
Scope Single conversation
Sharing Not shareable
Use Case Quick, ad-hoc analysis
Token Type Inference tokens
vs
πŸ“Š

Datasets

Persistence Permanent storage for reuse
Scope Available across all prompts
Sharing Can be shared with team
Use Case Recurring reference material
Token Type Training + Inference tokens
πŸ’‘
Important: Files you attach in a chat are for one-off use and are not automatically saved to a dataset unless you explicitly ingest them.

Creating and Ingesting Datasets

2

Create a New Dataset

  1. Click Prompt Tools β†’ Data & Settings β†’ Upload New Files
  2. Click Create New Dataset
  3. Enter a dataset name (alphanumeric characters and hyphens only, e.g., my-dataset-2025)
  4. Classify the dataset as Unclassified or CUI (Controlled Unclassified Information)
  5. Click Create Dataset
⚠️
CUI Classification: CUI classification requires a CAC/PIV card or special activation. Contact support@asksage.ai to request this feature or reach out to your organization Administrator.
3

Upload Files to Dataset

  1. Select your dataset from the dropdown list
  2. Drag and drop files into the designated box, or click to browse your local machine
  3. Review the file list and remove any unwanted files using the garbage bin icon
  4. Click Ingest Files to begin the upload process
  5. Look for the white checkmark and "Successfully Imported" message for each file

Supported File Formats (Not Exhaustive - more already supported)

πŸ“„

Documents

.pdf .doc .docx
πŸ“Š

Spreadsheets

.xls .xlsx .csv
πŸ“½οΈ

Presentations

.ppt .pptx
πŸ–ΌοΈ

Images

.jpg .png .gif .svg
πŸ“

Text Files

.txt .rtf .log
πŸ”§

Markup & Data

.html .xml .json .yaml
πŸ“
Maximum File Size: 50MB per file
πŸ“Έ
Image Handling: Images embedded in text documents will not be extracted automaticallyβ€”upload images separately. If you don't see a file type you need supported, email support@asksage.ai.

Managing and Sharing Datasets

🀝

Share Datasets

Enter teammates' email addresses and confirm to grant access. Sharing ensures your team works from the same source of truth.

πŸ—‘οΈ

Delete Datasets

Remove datasets you no longer need to maintain organization and optimize token usage.

πŸ“‹

Copy Datasets

Duplicate dataset files for different use cases or teams without re-ingesting content.

πŸ“Š

View Details

See ingested files, file counts, and dataset metadata at a glance.


Understanding Tokens and Usage

πŸŽ“

Training Tokens

Purpose Ingest data into datasets
When Consumed When uploading and processing files
Use Case Converting content into vector embeddings
⚑

Inference Tokens

Purpose Generate AI responses
When Consumed When submitting prompts and generating text
Use Case Querying datasets and producing content
πŸ“ˆ

Monitoring Token Usage

Navigate to Settings β†’ Tokens to view:

πŸ“Š Current subscription plan
⚑ Inference tokens used and remaining
πŸŽ“ Training tokens used and remaining
πŸ“… Monthly reset date
⏰
Important: Tokens reset on the first day of each month and do not roll over. Plan your usage accordingly to maximize value.

Best Practices for Dataset Creation

Avoid Bloated Datasets

⚠️

Why It Matters

Overloading datasets with too many files can confuse models. RAG is limited by the model's context window, so only a finite amount of information can be processed per prompt.

βœ“ Keep datasets focused and purposeful
βœ“ Only include files directly relevant to the dataset's intended use case
βœ“ Regularly review and remove outdated or unnecessary content
βœ“ Consider context window limitations when determining dataset size

Maintain Specificity

🎯

Why It Matters

Mixing unrelated data can confuse the model and lead to irrelevant or incorrect responses.

❌ Bad Practice

Creating a dataset with both tank manuals and airplane manuals

Problem: If you ask "Tell me how to fix Model XY," and both a tank and plane share that model number, the AI might pull information about the wrong vehicle.
βœ… Good Practice

Create separate datasets: tank-manuals-2025 and aircraft-manuals-2025

βœ“ Create separate datasets for different product lines, departments, or subject areas
βœ“ Use clear, descriptive dataset names that reflect specific content
βœ“ Avoid "catch-all" datasets covering multiple unrelated topics

Preprocess Your Files

πŸ”§

Why It Matters

Removing unnecessary content ensures the model focuses on relevant information and improves retrieval accuracy.

Preprocessing Steps
βœ“ Remove cover pages, table of contents, appendices, or sections without substantive information
βœ“ Clean up formatting issues that might interfere with text extraction
βœ“ Ensure images are clear and properly labeled if they contain important information
βœ“ Verify text is machine-readable (avoid scanned documents with poor OCR quality)
For Large Documents
βœ“ Split large files into smaller chunks (e.g., 100-page document β†’ five 20-page sections)
βœ“ Use the Summarization Plugin to condense content before ingestion
βœ“ For technical manuals, create separate datasets for different sections

Prioritize Quality Over Quantity

⭐

Quality Guidelines

βœ“ Prioritize high-quality, authoritative sources over volume
βœ“ Verify documents are current and accurate before ingestion
βœ“ Remove duplicate or redundant information
βœ“ Ensure text is machine-readable and well-formatted

Maintain Dataset Hygiene

🧹

Ongoing Maintenance

βœ“ Regularly audit datasets for outdated information
βœ“ Update datasets when source materials change
βœ“ Document what each dataset contains and its intended use case
βœ“ Use consistent naming conventions across your organization
βœ“ Archive or delete datasets that are no longer needed

Understanding RAG Technology

What Are Vector Datasets?

Ask Sage Datasets are vector databases that store your content as embeddingsβ€”numerical representations that capture the semantic meaning of your data. Unlike traditional databases that store raw files, vector databases enable:

πŸ”

Semantic Search

Find information based on meaning, not just keyword matching

⚑

Rapid Retrieval

Query large volumes of data efficiently

🧠

Context-Aware Responses

Generate answers that understand relationships between concepts

πŸ’‘
Important: Ask Sage datasets store embeddings, not the original files. This design optimizes for search and retrieval rather than file storage.

How Embeddings Work

1

Tokenization

Your text is broken into tokens (units of text ranging from single characters to whole words)

Example: "I love programming!" = 5 tokens: ["I", "love", "programming", "!", " "]
2

Embedding Generation

Each token is mapped to a numerical vector that captures its semantic meaning

These vectors represent relationships and context between words and concepts

3

Vector Storage

Embeddings are stored in the vector database, optimized for similarity search

This process consumes Training Tokens based on content volume

What Is RAG (Retrieval Augmented Generation)?

Retrieval Augmented Generation (RAG) is the core technology that makes Ask Sage Datasets powerful. RAG enhances AI responses by combining your ingested data with the model's capabilities through a two-step process:

πŸ”

Step 1: Retrieve Relevant Context

When you submit a prompt with a dataset selected:

1 Your query is converted into an embedding vector
2 The system searches the vector database for semantically similar content
3 The most relevant passages, facts, and information are retrieved
4 This context is ranked by relevance to your specific query
✨

Step 2: Augment Prompt with Retrieved Data

The retrieved context is integrated with your original prompt:

1 Your original prompt is combined with relevant dataset excerpts
2 This augmented prompt provides the AI model with specific, grounded information
3 The model generates a response based on both its training and your retrieved data
4 The result is a contextually accurate answer grounded in your sources

Benefits of RAG with Vector Datasets

πŸ“…
Up-to-date Information
Overcome model training cutoff dates by referencing your current data
πŸŽ“
Domain Expertise
Provide specialized knowledge not present in general AI training
πŸ›‘οΈ
Reduced Hallucinations
Ground responses in verifiable sources rather than model speculation
πŸ”
Transparency
Use explainability features to see which dataset content informed each response
πŸ”„
Flexibility
Use the same dataset with different models without re-ingesting data

RAG vs. Traditional Approaches

No Dataset (Base Model)

How It Works Model relies only on training data
Limitations Outdated information, no organization-specific knowledge

Attachments Only

How It Works Files processed per conversation
Limitations No reusability, inefficient for recurring needs
Recommended

RAG with Datasets

How It Works Semantic search retrieves relevant context
Limitations Requires initial ingestion, consumes training tokens

Practical RAG Example

πŸ“‹

Scenario

You've ingested your company's product documentation into a dataset called product-docs-2025.

❌ Without RAG (No Dataset)
Prompt:

"What are the warranty terms for our Model X product?"

Response:

Generic information about typical warranties, possibly inaccurate or irrelevant to your specific product.

βœ… With RAG (Dataset Selected)
Prompt:

"What are the warranty terms for our Model X product?"

Process:

Vector search retrieves relevant warranty sections from your documentation

Response:

Specific warranty terms from your actual product documentation, with citations

Explainability:

Shows which document sections were referenced

πŸ’‘
This demonstrates how RAG transforms generic AI into a knowledgeable assistant grounded in your organization's specific information.

Technical Considerations

🎯

Vector Database Optimization

Best Use Cases
βœ“ Vector databases excel at semantic search and retrieval
βœ“ Ideal for unstructured text, documents, and narrative content
βœ“ Perfect for finding conceptually similar information
Limitations
⚠️ Not designed for large tabular data (spreadsheets)
πŸ’‘ For tabular data analysis, attach spreadsheets directly to promptsβ€”Ask Sage will use Python libraries to analyze them
⚑

Token Efficiency Tips

βœ“ Selecting "None" for datasets saves inference tokens when dataset context isn't needed
βœ“ Only select relevant datasets to optimize token usage and response quality
βœ“ Use the Show Explainability feature to verify which dataset content was used
βœ“ Monitor token usage regularly in Settings β†’ Tokens
πŸ›‘οΈ

CUI Compliance

⚠️
Important: The Live feature is not CUI compliant and cannot be used with CUI-labeled datasets. Ensure proper classification when creating datasets containing sensitive information.

Back to top

Copyright © 2025 Ask Sage Inc. All Rights Reserved.