# Data Overview

Apify datasets store the output of actor runs as structured collections of items — essentially rows of scraped or processed data. The exact fields in each item depend on the actor that produced them, but Coupler.io supports four entities that let you access both metadata and the raw item content.

## Entities and what they return

| Entity                                   | Description                                                                                                 |
| ---------------------------------------- | ----------------------------------------------------------------------------------------------------------- |
| Dataset collections                      | Returns a list of all datasets in your Apify account, including IDs, names, and creation dates              |
| Datasets                                 | Returns metadata for a specific dataset: item count, size, creation date, and last modified time            |
| Item collections                         | Returns the full list of items (rows) stored in a specific dataset — fields vary by actor                   |
| Item collection website content crawlers | Returns items from website content crawlers with standardized fields like URL, page title, and crawled text |

## Available fields

#### Dataset collections fields

| Field            | Description                                          |
| ---------------- | ---------------------------------------------------- |
| `id`             | Unique identifier for the dataset                    |
| `name`           | Dataset name (if set)                                |
| `createdAt`      | Timestamp when the dataset was created               |
| `modifiedAt`     | Timestamp of the last modification                   |
| `accessedAt`     | Timestamp of the last access                         |
| `itemCount`      | Number of items stored in the dataset                |
| `cleanItemCount` | Number of items excluding empty or duplicate records |

#### Dataset metadata fields

| Field            | Description                                         |
| ---------------- | --------------------------------------------------- |
| `id`             | Dataset ID                                          |
| `name`           | Dataset name                                        |
| `userId`         | ID of the Apify user who owns the dataset           |
| `createdAt`      | Creation timestamp                                  |
| `modifiedAt`     | Last modified timestamp                             |
| `itemCount`      | Total item count                                    |
| `cleanItemCount` | Clean item count                                    |
| `actId`          | ID of the actor that created this dataset           |
| `actRunId`       | ID of the specific actor run that produced the data |

#### Item collection fields

Fields vary depending on the actor that produced the dataset. Common fields include:

| Field         | Description                               |
| ------------- | ----------------------------------------- |
| `url`         | The URL that was scraped                  |
| `title`       | Page or record title                      |
| `description` | Short description or meta description     |
| `text`        | Extracted text content                    |
| `price`       | Price (e-commerce actors)                 |
| `imageUrl`    | Image URL                                 |
| Custom fields | Any additional fields output by the actor |

#### Item collection website content crawler fields

| Field      | Description                                |
| ---------- | ------------------------------------------ |
| `url`      | Crawled page URL                           |
| `title`    | Page title                                 |
| `text`     | Full extracted text from the page          |
| `markdown` | Page content in Markdown format            |
| `metadata` | Additional page metadata                   |
| `crawl`    | Crawl metadata (depth, referrer URL, etc.) |

## Common field combinations

* **Content audits**: `url` + `title` + `text` from Item collection to review all scraped pages
* **Dataset monitoring**: `itemCount` + `cleanItemCount` + `modifiedAt` from Datasets to track actor run output over time
* **AI analysis**: `url` + `markdown` from Item collection website content crawlers, piped into ChatGPT or Claude for summarization
* **Cross-dataset comparison**: Use **Append** transformation to stack item collections from multiple actor runs into one unified table

## Use cases by role

{% tabs %}
{% tab title="Marketers" %}

* Pull competitor pricing data scraped by an Apify actor into Google Sheets for weekly tracking
* Send crawled website content to ChatGPT or Gemini for automated content gap analysis
* Append item collections from multiple scraping runs to build a historical dataset of SERP results
  {% endtab %}

{% tab title="Data teams" %}

* Load raw item collections into BigQuery for transformation and analysis at scale
* Join dataset metadata with item collections to enrich records with actor run context (run ID, creation date)
* Schedule syncs to keep a data warehouse table updated after each actor run
  {% endtab %}

{% tab title="Developers" %}

* Use Dataset collections to programmatically audit all datasets stored in an account
* Export item collections to Excel or Looker Studio for stakeholder-ready reporting without writing custom export scripts
* Pipe website content crawler output into Cursor or Claude for AI-assisted code or content generation workflows
  {% endtab %}
  {% endtabs %}

## Platform-specific notes

* Item fields are actor-dependent — the schema of Item collection data will differ between actors (e.g., an e-commerce scraper vs. a news scraper)
* The `cleanItemCount` may be lower than `itemCount` if the actor produced empty or duplicate records
* Apify datasets are tied to a specific actor run; if you re-run an actor, it creates a new dataset with a new ID
* For website content crawlers, the `markdown` field is the most useful for feeding content into AI destinations
* Very large datasets (millions of items) may require pagination — Coupler.io handles this automatically
