Intermediatedataaiautomation

Scrape → CSVAI Workflow

Scrape a website, extract structured data with an LLM, and export the results as a clean CSV. A three-step data pipeline with no manual parsing.

Prerequisites

Environment variables

OPENAI_KEY

Respect robots.txt and rate limits when scraping third-party sites.

Install

$npx radzor@latest recipe add scrape-to-csv

AI Prompt

“Run `npx radzor@latest add web-scraper structured-output csv-export` to install 3 Radzor components. Then read components/radzor/web-scraper/radzor.manifest.json, components/radzor/structured-output/radzor.manifest.json, components/radzor/csv-export/radzor.manifest.json and each component's llm/integration.md. Wire them together to scrape a website, extract structured data with an LLM, and export the results as a clean CSV. A three-step data pipeline with no manual parsing. Use the manifest's inputs (check envVar for required environment variables), outputs (check fields for object shapes), composability (check mapField for field extraction), and actions — don't invent custom interfaces.”

Paste this into Claude Code, Cursor, Windsurf, or any AI coding agent.

Pipeline

WebScraper

Fetches raw HTML from target URLs

→

↓

HTML

StructuredOutput

Extracts structured data via LLM

→

↓

typed records

CsvExport

Generates the final CSV file

Scaffolded Code

scrape-to-csv-recipe.ts

// npx radzor@latest add web-scraper structured-output csv-export
import { WebScraper }       from "./components/radzor/web-scraper"
import { StructuredOutput } from "./components/radzor/structured-output"
import { CsvExport }        from "./components/radzor/csv-export"

const scraper = new WebScraper({ timeout: 15000, rateLimit: 2000 })
const extractor = new StructuredOutput({ provider: "openai", apiKey: process.env.OPENAI_KEY!, model: "gpt-4o", temperature: 0 })
const csv = new CsvExport({ delimiter: ",", includeHeaders: true })

const urls = [
  "https://example.com/products/1",
  "https://example.com/products/2",
  "https://example.com/products/3",
]

const schema = { name: "string", price: "number", inStock: "boolean" }
const rows: Record<string, unknown>[] = []

for (const url of urls) {
  const html = await scraper.fetchHtml(url)
  const product = await extractor.extract(html, schema)
  rows.push(product)
}

// Export to CSV file
await csv.toFile("./products.csv", rows)

Components used

WebScraperFetches raw HTML from target URLs

View

StructuredOutputExtracts structured data via LLM

View

CsvExportGenerates the final CSV file

View

LLM tip

Pass all 3 radzor.manifest.json files to your agent at once. It will read the outputs of each step and match them against the inputs of the next — wiring the full pipeline without any extra instructions.

web-scraper/manifest.json structured-output/manifest.json csv-export/manifest.json

// npx radzor@latest add web-scraper structured-output csv-export import { WebScraper } from "./components/radzor/web-scraper" import { StructuredOutput } from "./components/radzor/structured-output" import { CsvExport } from "./components/radzor/csv-export" const scraper = new WebScraper({ timeout: 15000, rateLimit: 2000 }) const extractor = new StructuredOutput({ provider: "openai", apiKey: process.env.OPENAI_KEY!, model: "gpt-4o", temperature: 0 }) const csv = new CsvExport({ delimiter: ",", includeHeaders: true }) const urls = [ "https://example.com/products/1", "https://example.com/products/2", "https://example.com/products/3", ] const schema = { name: "string", price: "number", inStock: "boolean" } const rows: Record<string, unknown>[] = [] for (const url of urls) { const html = await scraper.fetchHtml(url) const product = await extractor.extract(html, schema) rows.push(product) } // Export to CSV file await csv.toFile("./products.csv", rows)