
Web Crawler: Convert Websites to AI-Ready Markdown in Google Sheets
描述
分类
⚙️ Automation
使用的节点
n8n-nodes-base.setn8n-nodes-base.htmln8n-nodes-base.filtern8n-nodes-base.switchn8n-nodes-base.markdownn8n-nodes-base.splitOutn8n-nodes-base.aggregaten8n-nodes-base.aggregaten8n-nodes-base.aggregaten8n-nodes-base.stickyNote
价格免费
浏览量0
最后更新11/28/2025
workflow.json
{
"meta": {
"instanceId": "3d7eb9567ae690bf8c9bba1cb43396e6e40c18e15eb5889cf9673ed1713da6db",
"templateCredsSetupCompleted": true
},
"nodes": [
{
"id": "349e50cf-75b8-432c-818e-63f1ff3ead34",
"name": "Overview Note",
"type": "n8n-nodes-base.stickyNote",
"position": [
1696,
3104
],
"parameters": {
"color": 4,
"width": 600,
"height": 1112,
"content": "# Automated Website Crawler for AI Knowledge Bases\n\n## 📋 What This Template Does\nThis workflow crawls a website's homepage to extract all sublinks, filters images from content pages, scrapes and converts textual content to Markdown, then aggregates everything into Google Sheets—ideal for building AI-ready knowledge bases or company dossiers.\n\n## 🔧 Prerequisites\n- Google account with Sheets access\n- n8n instance\n\n## 🔑 Required Credentials\n\n### Google Sheets OAuth2 API Setup\n1. Go to console.cloud.google.com → APIs & Services → Credentials\n2. Create OAuth client ID for Web application\n3. Add n8n redirect URI: https://your-n8n-instance.com/rest/oauth2-credential/callback\n4. Add to n8n as Google Sheets OAuth2 API and grant Sheets scopes\n\n## ⚙️ Configuration Steps\n1. Import JSON into n8n\n2. Set target URL in Set Website node\n3. Assign Google credential to Sheet nodes\n4. Update documentId and sheetName to your spreadsheet\n5. Ensure sheet has columns: Website, Links, Scraped Content, Images\n6. Test manually\n\n## 🎯 Use Cases\n- Crawl company sites for knowledge base building\n- Extract content for AI agent training datasets\n- Gather competitor intel for market analysis\n- Archive dynamic sites for compliance\n\n## ⚠️ Troubleshooting\n- No links: Check homepage <a> tags and test URL\n- Sheet errors: Verify columns and permissions\n- Truncated content: Adjust slice limit or split rows\n- Rate limits: Add Wait node after scraping"
},
"typeVersion": 1
},
{
"id": "eb43d67c-01fc-4d83-bb2c-099938a57468",
"name": "Note: Trigger and Setup",
"type": "n8n-nodes-base.stickyNote",
"position": [
2512,
3072
],
"parameters": {
"color": 6,
"width": 556,
"height": 176,
"content": "## 🖱️ Trigger & Setup Nodes\n\n**Purpose:** Manual Trigger starts the workflow; Set Website configures the target URL.\n\n**Note:** Update website_url in Set Website for your site; use Schedule Trigger for automation."
},
"typeVersion": 1
},
{
"id": "3c8581cb-46cd-4f25-af5a-c52bc2f463c6",
"name": "Set Website",
"type": "n8n-nodes-base.set",
"position": [
2688,
3296
],
"parameters": {
"options": {},
"assignments": {
"assignments": [
{
"id": "a652f57e-210e-421e-b20b-781d6f4dc240",
"name": "website_url",
"type": "string",
"value": "https://example.com"
}
]
}
},
"typeVersion": 3.4
},
{
"id": "18201858-7764-4a14-9f6b-12e36eaf158b",
"name": "Manual Trigger",
"type": "n8n-nodes-base.manualTrigger",
"position": [
2496,
3296
],
"parameters": {},
"typeVersion": 1
},
{
"id": "b7435481-bed3-439f-933c-1c5e0142ad5c",
"name": "Scrape Homepage",
"type": "n8n-nodes-base.httpRequest",
"onError": "continueRegularOutput",
"position": [
2880,
3296
],
"parameters": {
"url": "={{ $json.website_url }}",
"options": {
"redirect": {
"redirect": {}
},
"allowUnauthorizedCerts": false
}
},
"executeOnce": false,
"typeVersion": 4.2,
"alwaysOutputData": false
},
{
"id": "ce13710d-24ca-47d4-a25c-8890c1592947",
"name": "Note: Homepage Scraping",
"type": "n8n-nodes-base.stickyNote",
"position": [
3168,
3488
],
"parameters": {
"color": 5,
"width": 396,
"height": 192,
"content": "## 🌐 Homepage Scraping Nodes\n\n**Purpose:** Scrape Homepage fetches HTML; Extract Links pulls hrefs from <a> tags; Split Links breaks array into items.\n\n**Note:** Handles redirects; targets all links for discovery."
},
"typeVersion": 1
},
{
"id": "61a60f2c-f032-4b46-83ba-405df0ce05df",
"name": "Extract Links from HTML",
"type": "n8n-nodes-base.html",
"position": [
3088,
3296
],
"parameters": {
"options": {
"trimValues": true,
"cleanUpText": true
},
"operation": "extractHtmlContent",
"extractionValues": {
"values": [
{
"key": "links",
"attribute": "href",
"cssSelector": "a",
"returnArray": true,
"returnValue": "attribute"
}
]
}
},
"typeVersion": 1.2
},
{
"id": "582eeae0-fec0-4548-9c78-7c05ac5aaebc",
"name": "Split Links",
"type": "n8n-nodes-base.splitOut",
"position": [
3296,
3296
],
"parameters": {
"options": {},
"fieldToSplitOut": "links"
},
"typeVersion": 1
},
{
"id": "17d59531-4d51-4494-8ae9-e91b81851a0b",
"name": "Remove Duplicate Links",
"type": "n8n-nodes-base.removeDuplicates",
"position": [
3520,
3296
],
"parameters": {
"options": {}
},
"typeVersion": 2
},
{
"id": "d50fa2a9-1a58-4dad-8bd0-cfbd31aeae91",
"name": "Filter Real Hyperlinks",
"type": "n8n-nodes-base.filter",
"position": [
3696,
3296
],
"parameters": {
"options": {},
"conditions": {
"options": {
"version": 2,
"leftValue": "",
"caseSensitive": true,
"typeValidation": "strict"
},
"combinator": "and",
"conditions": [
{
"id": "bd6c6da6-8af7-4809-b6cd-01a38d71953b",
"operator": {
"type": "string",
"operation": "startsWith"
},
"leftValue": "={{ $json.links }}",
"rightValue": "https://"
}
]
}
},
"typeVersion": 2.2
},
{
"id": "cb121b70-a14a-4cbd-a54c-e55c6fc235b7",
"name": "Note: Link Processing",
"type": "n8n-nodes-base.stickyNote",
"position": [
3216,
3056
],
"parameters": {
"color": 2,
"width": 556,
"height": 224,
"content": "## 🔄 Link Processing Nodes\n\n**Purpose:** Remove Duplicate Links cleans list; Filter Real Hyperlinks keeps HTTPS; Separate Images and Links routes via regex.\n\n**Note:** Switch output 0: Images, 1: Content links; adjust regex for custom extensions."
},
"typeVersion": 1
},
{
"id": "d69c0dc2-2c4c-474b-ba11-3d79e1390b12",
"name": "Separate Images and Links",
"type": "n8n-nodes-base.switch",
"position": [
2480,
3680
],
"parameters": {
"rules": {
"values": [
{
"outputKey": "Images",
"conditions": {
"options": {
"version": 2,
"leftValue": "",
"caseSensitive": true,
"typeValidation": "strict"
},
"combinator": "and",
"conditions": [
{
"id": "16724958-4eea-489d-b494-3d76a3ba2562",
"operator": {
"type": "string",
"operation": "regex"
},
"leftValue": "={{ $json.links }}",
"rightValue": "=^https?:\\/\\/.*\\.(?:png|jpe?g|gif|webp|bmp|svg|ico)(?:\\?.*)?$"
}
]
},
"renameOutput": true
},
{
"outputKey": "Links",
"conditions": {
"options": {
"version": 2,
"leftValue": "",
"caseSensitive": true,
"typeValidation": "strict"
},
"combinator": "and",
"conditions": [
{
"id": "816392f0-96db-4134-8bee-4b74688ff929",
"operator": {
"type": "string",
"operation": "notRegex"
},
"leftValue": "={{ $json.links }}",
"rightValue": "=^https?:\\/\\/.*\\.(?:png|jpe?g|gif|webp|bmp|svg|ico)(?:\\?.*)?$"
}
]
},
"renameOutput": true
}
]
},
"options": {}
},
"typeVersion": 3.2
},
{
"id": "23896343-575e-4956-8e95-3b5e6e4c8ae7",
"name": "Aggregate Images",
"type": "n8n-nodes-base.aggregate",
"position": [
2736,
3504
],
"parameters": {
"options": {},
"fieldsToAggregate": {
"fieldToAggregate": [
{
"fieldToAggregate": "links"
}
]
}
},
"typeVersion": 1
},
{
"id": "fcad347b-60d7-4fa2-9b02-e96c2f27116d",
"name": "Aggregate Links",
"type": "n8n-nodes-base.aggregate",
"position": [
2736,
3696
],
"parameters": {
"options": {},
"fieldsToAggregate": {
"fieldToAggregate": [
{
"fieldToAggregate": "links"
}
]
}
},
"typeVersion": 1
},
{
"id": "fc5d6ce1-1765-4768-a9c7-de3677e8109d",
"name": "Scrape Content Links",
"type": "n8n-nodes-base.httpRequest",
"position": [
2736,
3872
],
"parameters": {
"url": "={{ $json.links }}",
"options": {}
},
"typeVersion": 4.2
},
{
"id": "0d4b6a4e-b6cb-4e6c-9a22-bd0dc6a72027",
"name": "Note: Content Scraping",
"type": "n8n-nodes-base.stickyNote",
"position": [
2320,
3984
],
"parameters": {
"color": 5,
"width": 428,
"height": 224,
"content": "## 📄 Content Scraping & Aggregation Nodes\n\n**Purpose:** Scrape Content Links fetches pages; Convert to Markdown formats HTML; Aggregate Images/Links/Content combines outputs.\n\n**Note:** Markdown preserves structure for AI; slice content if exceeding sheet limits."
},
"typeVersion": 1
},
{
"id": "349e5f7c-c81b-467b-a59b-ea40a47226f0",
"name": "Convert to Markdown",
"type": "n8n-nodes-base.markdown",
"position": [
2944,
3872
],
"parameters": {
"html": "={{ $json.data }}",
"options": {}
},
"typeVersion": 1
},
{
"id": "24f22a31-03a3-4faf-81f4-3c38c0956ee4",
"name": "Aggregate Scraped Content",
"type": "n8n-nodes-base.aggregate",
"position": [
3136,
3872
],
"parameters": {
"options": {},
"fieldsToAggregate": {
"fieldToAggregate": [
{
"fieldToAggregate": "data"
}
]
}
},
"typeVersion": 1
},
{
"id": "a4d34aab-1af2-4196-85f5-1a2d832969dd",
"name": "Add Images to Sheet",
"type": "n8n-nodes-base.googleSheets",
"position": [
2944,
3504
],
"parameters": {
"columns": {
"value": {
"Images": "={{ $json.links.join('\\n\\n') }}",
"Website": "={{ $('Set Website').item.json.website_url }}"
},
"schema": [
{
"id": "Website",
"type": "string",
"display": true,
"removed": false,
"required": false,
"displayName": "Website",
"defaultMatch": false,
"canBeUsedToMatch": true
},
{
"id": "Links",
"type": "string",
"display": true,
"removed": true,
"required": false,
"displayName": "Links",
"defaultMatch": false,
"canBeUsedToMatch": true
},
{
"id": "Scraped Content",
"type": "string",
"display": true,
"removed": true,
"required": false,
"displayName": "Scraped Content",
"defaultMatch": false,
"canBeUsedToMatch": true
},
{
"id": "Images",
"type": "string",
"display": true,
"required": false,
"displayName": "Images",
"defaultMatch": false,
"canBeUsedToMatch": true
}
],
"mappingMode": "defineBelow",
"matchingColumns": [
"Website"
],
"attemptToConvertTypes": false,
"convertFieldsToString": false
},
"options": {},
"operation": "appendOrUpdate",
"sheetName": "your-sheet-name",
"documentId": "your-document-id"
},
"credentials": {
"googleSheetsOAuth2Api": {
"id": "ZVbWK0SlohYDlZYO",
"name": "Ewere"
}
},
"typeVersion": 4.7
},
{
"id": "6afbfad8-b80f-4a0d-81b4-9138cc2af46a",
"name": "Add Links to Sheet",
"type": "n8n-nodes-base.googleSheets",
"position": [
2944,
3696
],
"parameters": {
"columns": {
"value": {
"Links": "={{ $json.links.join('\\n\\n') }}",
"Website": "={{ $('Set Website').item.json.website_url }}"
},
"schema": [
{
"id": "Website",
"type": "string",
"display": true,
"removed": false,
"required": false,
"displayName": "Website",
"defaultMatch": false,
"canBeUsedToMatch": true
},
{
"id": "Links",
"type": "string",
"display": true,
"removed": false,
"required": false,
"displayName": "Links",
"defaultMatch": false,
"canBeUsedToMatch": true
},
{
"id": "Scraped Content",
"type": "string",
"display": true,
"removed": true,
"required": false,
"displayName": "Scraped Content",
"defaultMatch": false,
"canBeUsedToMatch": true
},
{
"id": "Images",
"type": "string",
"display": true,
"removed": true,
"required": false,
"displayName": "Images",
"defaultMatch": false,
"canBeUsedToMatch": true
}
],
"mappingMode": "defineBelow",
"matchingColumns": [
"Website"
],
"attemptToConvertTypes": false,
"convertFieldsToString": false
},
"options": {},
"operation": "appendOrUpdate",
"sheetName": "your-sheet-name",
"documentId": "your-document-id"
},
"credentials": {
"googleSheetsOAuth2Api": {
"id": "ZVbWK0SlohYDlZYO",
"name": "Ewere"
}
},
"typeVersion": 4.7
},
{
"id": "35ae2c30-a93a-4fd2-82b6-07d2f4c56c88",
"name": "Add Scraped Content to Sheet",
"type": "n8n-nodes-base.googleSheets",
"position": [
3344,
3872
],
"parameters": {
"columns": {
"value": {
"Website": "={{ $('Set Website').item.json.website_url }}",
"Scraped Content": "={{ $json.data.join('\\n\\n').slice(0, 50000) }}"
},
"schema": [
{
"id": "Website",
"type": "string",
"display": true,
"removed": false,
"required": false,
"displayName": "Website",
"defaultMatch": false,
"canBeUsedToMatch": true
},
{
"id": "Links",
"type": "string",
"display": true,
"removed": true,
"required": false,
"displayName": "Links",
"defaultMatch": false,
"canBeUsedToMatch": true
},
{
"id": "Scraped Content",
"type": "string",
"display": true,
"removed": false,
"required": false,
"displayName": "Scraped Content",
"defaultMatch": false,
"canBeUsedToMatch": true
},
{
"id": "Images",
"type": "string",
"display": true,
"removed": true,
"required": false,
"displayName": "Images",
"defaultMatch": false,
"canBeUsedToMatch": true
}
],
"mappingMode": "defineBelow",
"matchingColumns": [
"Website"
],
"attemptToConvertTypes": false,
"convertFieldsToString": false
},
"options": {},
"operation": "appendOrUpdate",
"sheetName": "your-sheet-name",
"documentId": "your-document-id"
},
"credentials": {
"googleSheetsOAuth2Api": {
"id": "ZVbWK0SlohYDlZYO",
"name": "Ewere"
}
},
"typeVersion": 4.7
},
{
"id": "c3f7b022-db11-400c-baaa-77392acfb991",
"name": "Note: Sheet Integration",
"type": "n8n-nodes-base.stickyNote",
"position": [
3232,
4048
],
"parameters": {
"color": 3,
"width": 444,
"height": 176,
"content": "## 📊 Sheet Integration Nodes\n\n**Purpose:** Add Images/Links/Scraped Content to Sheet appends aggregated data to Google Sheets.\n\n**Note:** Matches on 'Website' column; update documentId/sheetName for your sheet."
},
"typeVersion": 1
}
],
"pinData": {},
"connections": {
"Set Website": {
"main": [
[
{
"node": "Scrape Homepage",
"type": "main",
"index": 0
}
]
]
},
"Split Links": {
"main": [
[
{
"node": "Remove Duplicate Links",
"type": "main",
"index": 0
}
]
]
},
"Manual Trigger": {
"main": [
[
{
"node": "Set Website",
"type": "main",
"index": 0
}
]
]
},
"Aggregate Links": {
"main": [
[
{
"node": "Add Links to Sheet",
"type": "main",
"index": 0
}
]
]
},
"Scrape Homepage": {
"main": [
[
{
"node": "Extract Links from HTML",
"type": "main",
"index": 0
}
]
]
},
"Aggregate Images": {
"main": [
[
{
"node": "Add Images to Sheet",
"type": "main",
"index": 0
}
]
]
},
"Convert to Markdown": {
"main": [
[
{
"node": "Aggregate Scraped Content",
"type": "main",
"index": 0
}
]
]
},
"Scrape Content Links": {
"main": [
[
{
"node": "Convert to Markdown",
"type": "main",
"index": 0
}
]
]
},
"Filter Real Hyperlinks": {
"main": [
[
{
"node": "Separate Images and Links",
"type": "main",
"index": 0
}
]
]
},
"Remove Duplicate Links": {
"main": [
[
{
"node": "Filter Real Hyperlinks",
"type": "main",
"index": 0
}
]
]
},
"Extract Links from HTML": {
"main": [
[
{
"node": "Split Links",
"type": "main",
"index": 0
}
]
]
},
"Aggregate Scraped Content": {
"main": [
[
{
"node": "Add Scraped Content to Sheet",
"type": "main",
"index": 0
}
]
]
},
"Separate Images and Links": {
"main": [
[
{
"node": "Aggregate Images",
"type": "main",
"index": 0
}
],
[
{
"node": "Aggregate Links",
"type": "main",
"index": 0
},
{
"node": "Scrape Content Links",
"type": "main",
"index": 0
}
]
]
}
}
}