N
n8n Store
Workflow Market
Web Crawler: Convert Websites to AI-Ready Markdown in Google Sheets

Web Crawler: Convert Websites to AI-Ready Markdown in Google Sheets

by daniel-automatesโ€ข0 views

Description

Categories

โš™๏ธ Automation

Nodes Used

n8n-nodes-base.setn8n-nodes-base.htmln8n-nodes-base.filtern8n-nodes-base.switchn8n-nodes-base.markdownn8n-nodes-base.splitOutn8n-nodes-base.aggregaten8n-nodes-base.aggregaten8n-nodes-base.aggregaten8n-nodes-base.stickyNote
PriceFree
Views0
Last Updated11/28/2025
workflow.json
{
  "meta": {
    "instanceId": "3d7eb9567ae690bf8c9bba1cb43396e6e40c18e15eb5889cf9673ed1713da6db",
    "templateCredsSetupCompleted": true
  },
  "nodes": [
    {
      "id": "349e50cf-75b8-432c-818e-63f1ff3ead34",
      "name": "Overview Note",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        1696,
        3104
      ],
      "parameters": {
        "color": 4,
        "width": 600,
        "height": 1112,
        "content": "# Automated Website Crawler for AI Knowledge Bases\n\n## ๐Ÿ“‹ What This Template Does\nThis workflow crawls a website's homepage to extract all sublinks, filters images from content pages, scrapes and converts textual content to Markdown, then aggregates everything into Google Sheetsโ€”ideal for building AI-ready knowledge bases or company dossiers.\n\n## ๐Ÿ”ง Prerequisites\n- Google account with Sheets access\n- n8n instance\n\n## ๐Ÿ”‘ Required Credentials\n\n### Google Sheets OAuth2 API Setup\n1. Go to console.cloud.google.com โ†’ APIs & Services โ†’ Credentials\n2. Create OAuth client ID for Web application\n3. Add n8n redirect URI: https://your-n8n-instance.com/rest/oauth2-credential/callback\n4. Add to n8n as Google Sheets OAuth2 API and grant Sheets scopes\n\n## โš™๏ธ Configuration Steps\n1. Import JSON into n8n\n2. Set target URL in Set Website node\n3. Assign Google credential to Sheet nodes\n4. Update documentId and sheetName to your spreadsheet\n5. Ensure sheet has columns: Website, Links, Scraped Content, Images\n6. Test manually\n\n## ๐ŸŽฏ Use Cases\n- Crawl company sites for knowledge base building\n- Extract content for AI agent training datasets\n- Gather competitor intel for market analysis\n- Archive dynamic sites for compliance\n\n## โš ๏ธ Troubleshooting\n- No links: Check homepage <a> tags and test URL\n- Sheet errors: Verify columns and permissions\n- Truncated content: Adjust slice limit or split rows\n- Rate limits: Add Wait node after scraping"
      },
      "typeVersion": 1
    },
    {
      "id": "eb43d67c-01fc-4d83-bb2c-099938a57468",
      "name": "Note: Trigger and Setup",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        2512,
        3072
      ],
      "parameters": {
        "color": 6,
        "width": 556,
        "height": 176,
        "content": "## ๐Ÿ–ฑ๏ธ Trigger & Setup Nodes\n\n**Purpose:** Manual Trigger starts the workflow; Set Website configures the target URL.\n\n**Note:** Update website_url in Set Website for your site; use Schedule Trigger for automation."
      },
      "typeVersion": 1
    },
    {
      "id": "3c8581cb-46cd-4f25-af5a-c52bc2f463c6",
      "name": "Set Website",
      "type": "n8n-nodes-base.set",
      "position": [
        2688,
        3296
      ],
      "parameters": {
        "options": {},
        "assignments": {
          "assignments": [
            {
              "id": "a652f57e-210e-421e-b20b-781d6f4dc240",
              "name": "website_url",
              "type": "string",
              "value": "https://example.com"
            }
          ]
        }
      },
      "typeVersion": 3.4
    },
    {
      "id": "18201858-7764-4a14-9f6b-12e36eaf158b",
      "name": "Manual Trigger",
      "type": "n8n-nodes-base.manualTrigger",
      "position": [
        2496,
        3296
      ],
      "parameters": {},
      "typeVersion": 1
    },
    {
      "id": "b7435481-bed3-439f-933c-1c5e0142ad5c",
      "name": "Scrape Homepage",
      "type": "n8n-nodes-base.httpRequest",
      "onError": "continueRegularOutput",
      "position": [
        2880,
        3296
      ],
      "parameters": {
        "url": "={{ $json.website_url }}",
        "options": {
          "redirect": {
            "redirect": {}
          },
          "allowUnauthorizedCerts": false
        }
      },
      "executeOnce": false,
      "typeVersion": 4.2,
      "alwaysOutputData": false
    },
    {
      "id": "ce13710d-24ca-47d4-a25c-8890c1592947",
      "name": "Note: Homepage Scraping",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        3168,
        3488
      ],
      "parameters": {
        "color": 5,
        "width": 396,
        "height": 192,
        "content": "## ๐ŸŒ Homepage Scraping Nodes\n\n**Purpose:** Scrape Homepage fetches HTML; Extract Links pulls hrefs from <a> tags; Split Links breaks array into items.\n\n**Note:** Handles redirects; targets all links for discovery."
      },
      "typeVersion": 1
    },
    {
      "id": "61a60f2c-f032-4b46-83ba-405df0ce05df",
      "name": "Extract Links from HTML",
      "type": "n8n-nodes-base.html",
      "position": [
        3088,
        3296
      ],
      "parameters": {
        "options": {
          "trimValues": true,
          "cleanUpText": true
        },
        "operation": "extractHtmlContent",
        "extractionValues": {
          "values": [
            {
              "key": "links",
              "attribute": "href",
              "cssSelector": "a",
              "returnArray": true,
              "returnValue": "attribute"
            }
          ]
        }
      },
      "typeVersion": 1.2
    },
    {
      "id": "582eeae0-fec0-4548-9c78-7c05ac5aaebc",
      "name": "Split Links",
      "type": "n8n-nodes-base.splitOut",
      "position": [
        3296,
        3296
      ],
      "parameters": {
        "options": {},
        "fieldToSplitOut": "links"
      },
      "typeVersion": 1
    },
    {
      "id": "17d59531-4d51-4494-8ae9-e91b81851a0b",
      "name": "Remove Duplicate Links",
      "type": "n8n-nodes-base.removeDuplicates",
      "position": [
        3520,
        3296
      ],
      "parameters": {
        "options": {}
      },
      "typeVersion": 2
    },
    {
      "id": "d50fa2a9-1a58-4dad-8bd0-cfbd31aeae91",
      "name": "Filter Real Hyperlinks",
      "type": "n8n-nodes-base.filter",
      "position": [
        3696,
        3296
      ],
      "parameters": {
        "options": {},
        "conditions": {
          "options": {
            "version": 2,
            "leftValue": "",
            "caseSensitive": true,
            "typeValidation": "strict"
          },
          "combinator": "and",
          "conditions": [
            {
              "id": "bd6c6da6-8af7-4809-b6cd-01a38d71953b",
              "operator": {
                "type": "string",
                "operation": "startsWith"
              },
              "leftValue": "={{ $json.links }}",
              "rightValue": "https://"
            }
          ]
        }
      },
      "typeVersion": 2.2
    },
    {
      "id": "cb121b70-a14a-4cbd-a54c-e55c6fc235b7",
      "name": "Note: Link Processing",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        3216,
        3056
      ],
      "parameters": {
        "color": 2,
        "width": 556,
        "height": 224,
        "content": "## ๐Ÿ”„ Link Processing Nodes\n\n**Purpose:** Remove Duplicate Links cleans list; Filter Real Hyperlinks keeps HTTPS; Separate Images and Links routes via regex.\n\n**Note:** Switch output 0: Images, 1: Content links; adjust regex for custom extensions."
      },
      "typeVersion": 1
    },
    {
      "id": "d69c0dc2-2c4c-474b-ba11-3d79e1390b12",
      "name": "Separate Images and Links",
      "type": "n8n-nodes-base.switch",
      "position": [
        2480,
        3680
      ],
      "parameters": {
        "rules": {
          "values": [
            {
              "outputKey": "Images",
              "conditions": {
                "options": {
                  "version": 2,
                  "leftValue": "",
                  "caseSensitive": true,
                  "typeValidation": "strict"
                },
                "combinator": "and",
                "conditions": [
                  {
                    "id": "16724958-4eea-489d-b494-3d76a3ba2562",
                    "operator": {
                      "type": "string",
                      "operation": "regex"
                    },
                    "leftValue": "={{ $json.links }}",
                    "rightValue": "=^https?:\\/\\/.*\\.(?:png|jpe?g|gif|webp|bmp|svg|ico)(?:\\?.*)?$"
                  }
                ]
              },
              "renameOutput": true
            },
            {
              "outputKey": "Links",
              "conditions": {
                "options": {
                  "version": 2,
                  "leftValue": "",
                  "caseSensitive": true,
                  "typeValidation": "strict"
                },
                "combinator": "and",
                "conditions": [
                  {
                    "id": "816392f0-96db-4134-8bee-4b74688ff929",
                    "operator": {
                      "type": "string",
                      "operation": "notRegex"
                    },
                    "leftValue": "={{ $json.links }}",
                    "rightValue": "=^https?:\\/\\/.*\\.(?:png|jpe?g|gif|webp|bmp|svg|ico)(?:\\?.*)?$"
                  }
                ]
              },
              "renameOutput": true
            }
          ]
        },
        "options": {}
      },
      "typeVersion": 3.2
    },
    {
      "id": "23896343-575e-4956-8e95-3b5e6e4c8ae7",
      "name": "Aggregate Images",
      "type": "n8n-nodes-base.aggregate",
      "position": [
        2736,
        3504
      ],
      "parameters": {
        "options": {},
        "fieldsToAggregate": {
          "fieldToAggregate": [
            {
              "fieldToAggregate": "links"
            }
          ]
        }
      },
      "typeVersion": 1
    },
    {
      "id": "fcad347b-60d7-4fa2-9b02-e96c2f27116d",
      "name": "Aggregate Links",
      "type": "n8n-nodes-base.aggregate",
      "position": [
        2736,
        3696
      ],
      "parameters": {
        "options": {},
        "fieldsToAggregate": {
          "fieldToAggregate": [
            {
              "fieldToAggregate": "links"
            }
          ]
        }
      },
      "typeVersion": 1
    },
    {
      "id": "fc5d6ce1-1765-4768-a9c7-de3677e8109d",
      "name": "Scrape Content Links",
      "type": "n8n-nodes-base.httpRequest",
      "position": [
        2736,
        3872
      ],
      "parameters": {
        "url": "={{ $json.links }}",
        "options": {}
      },
      "typeVersion": 4.2
    },
    {
      "id": "0d4b6a4e-b6cb-4e6c-9a22-bd0dc6a72027",
      "name": "Note: Content Scraping",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        2320,
        3984
      ],
      "parameters": {
        "color": 5,
        "width": 428,
        "height": 224,
        "content": "## ๐Ÿ“„ Content Scraping & Aggregation Nodes\n\n**Purpose:** Scrape Content Links fetches pages; Convert to Markdown formats HTML; Aggregate Images/Links/Content combines outputs.\n\n**Note:** Markdown preserves structure for AI; slice content if exceeding sheet limits."
      },
      "typeVersion": 1
    },
    {
      "id": "349e5f7c-c81b-467b-a59b-ea40a47226f0",
      "name": "Convert to Markdown",
      "type": "n8n-nodes-base.markdown",
      "position": [
        2944,
        3872
      ],
      "parameters": {
        "html": "={{ $json.data }}",
        "options": {}
      },
      "typeVersion": 1
    },
    {
      "id": "24f22a31-03a3-4faf-81f4-3c38c0956ee4",
      "name": "Aggregate Scraped Content",
      "type": "n8n-nodes-base.aggregate",
      "position": [
        3136,
        3872
      ],
      "parameters": {
        "options": {},
        "fieldsToAggregate": {
          "fieldToAggregate": [
            {
              "fieldToAggregate": "data"
            }
          ]
        }
      },
      "typeVersion": 1
    },
    {
      "id": "a4d34aab-1af2-4196-85f5-1a2d832969dd",
      "name": "Add Images to Sheet",
      "type": "n8n-nodes-base.googleSheets",
      "position": [
        2944,
        3504
      ],
      "parameters": {
        "columns": {
          "value": {
            "Images": "={{ $json.links.join('\\n\\n') }}",
            "Website": "={{ $('Set Website').item.json.website_url }}"
          },
          "schema": [
            {
              "id": "Website",
              "type": "string",
              "display": true,
              "removed": false,
              "required": false,
              "displayName": "Website",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            },
            {
              "id": "Links",
              "type": "string",
              "display": true,
              "removed": true,
              "required": false,
              "displayName": "Links",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            },
            {
              "id": "Scraped Content",
              "type": "string",
              "display": true,
              "removed": true,
              "required": false,
              "displayName": "Scraped Content",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            },
            {
              "id": "Images",
              "type": "string",
              "display": true,
              "required": false,
              "displayName": "Images",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            }
          ],
          "mappingMode": "defineBelow",
          "matchingColumns": [
            "Website"
          ],
          "attemptToConvertTypes": false,
          "convertFieldsToString": false
        },
        "options": {},
        "operation": "appendOrUpdate",
        "sheetName": "your-sheet-name",
        "documentId": "your-document-id"
      },
      "credentials": {
        "googleSheetsOAuth2Api": {
          "id": "ZVbWK0SlohYDlZYO",
          "name": "Ewere"
        }
      },
      "typeVersion": 4.7
    },
    {
      "id": "6afbfad8-b80f-4a0d-81b4-9138cc2af46a",
      "name": "Add Links to Sheet",
      "type": "n8n-nodes-base.googleSheets",
      "position": [
        2944,
        3696
      ],
      "parameters": {
        "columns": {
          "value": {
            "Links": "={{ $json.links.join('\\n\\n') }}",
            "Website": "={{ $('Set Website').item.json.website_url }}"
          },
          "schema": [
            {
              "id": "Website",
              "type": "string",
              "display": true,
              "removed": false,
              "required": false,
              "displayName": "Website",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            },
            {
              "id": "Links",
              "type": "string",
              "display": true,
              "removed": false,
              "required": false,
              "displayName": "Links",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            },
            {
              "id": "Scraped Content",
              "type": "string",
              "display": true,
              "removed": true,
              "required": false,
              "displayName": "Scraped Content",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            },
            {
              "id": "Images",
              "type": "string",
              "display": true,
              "removed": true,
              "required": false,
              "displayName": "Images",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            }
          ],
          "mappingMode": "defineBelow",
          "matchingColumns": [
            "Website"
          ],
          "attemptToConvertTypes": false,
          "convertFieldsToString": false
        },
        "options": {},
        "operation": "appendOrUpdate",
        "sheetName": "your-sheet-name",
        "documentId": "your-document-id"
      },
      "credentials": {
        "googleSheetsOAuth2Api": {
          "id": "ZVbWK0SlohYDlZYO",
          "name": "Ewere"
        }
      },
      "typeVersion": 4.7
    },
    {
      "id": "35ae2c30-a93a-4fd2-82b6-07d2f4c56c88",
      "name": "Add Scraped Content to Sheet",
      "type": "n8n-nodes-base.googleSheets",
      "position": [
        3344,
        3872
      ],
      "parameters": {
        "columns": {
          "value": {
            "Website": "={{ $('Set Website').item.json.website_url }}",
            "Scraped Content": "={{ $json.data.join('\\n\\n').slice(0, 50000) }}"
          },
          "schema": [
            {
              "id": "Website",
              "type": "string",
              "display": true,
              "removed": false,
              "required": false,
              "displayName": "Website",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            },
            {
              "id": "Links",
              "type": "string",
              "display": true,
              "removed": true,
              "required": false,
              "displayName": "Links",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            },
            {
              "id": "Scraped Content",
              "type": "string",
              "display": true,
              "removed": false,
              "required": false,
              "displayName": "Scraped Content",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            },
            {
              "id": "Images",
              "type": "string",
              "display": true,
              "removed": true,
              "required": false,
              "displayName": "Images",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            }
          ],
          "mappingMode": "defineBelow",
          "matchingColumns": [
            "Website"
          ],
          "attemptToConvertTypes": false,
          "convertFieldsToString": false
        },
        "options": {},
        "operation": "appendOrUpdate",
        "sheetName": "your-sheet-name",
        "documentId": "your-document-id"
      },
      "credentials": {
        "googleSheetsOAuth2Api": {
          "id": "ZVbWK0SlohYDlZYO",
          "name": "Ewere"
        }
      },
      "typeVersion": 4.7
    },
    {
      "id": "c3f7b022-db11-400c-baaa-77392acfb991",
      "name": "Note: Sheet Integration",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        3232,
        4048
      ],
      "parameters": {
        "color": 3,
        "width": 444,
        "height": 176,
        "content": "## ๐Ÿ“Š Sheet Integration Nodes\n\n**Purpose:** Add Images/Links/Scraped Content to Sheet appends aggregated data to Google Sheets.\n\n**Note:** Matches on 'Website' column; update documentId/sheetName for your sheet."
      },
      "typeVersion": 1
    }
  ],
  "pinData": {},
  "connections": {
    "Set Website": {
      "main": [
        [
          {
            "node": "Scrape Homepage",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Split Links": {
      "main": [
        [
          {
            "node": "Remove Duplicate Links",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Manual Trigger": {
      "main": [
        [
          {
            "node": "Set Website",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Aggregate Links": {
      "main": [
        [
          {
            "node": "Add Links to Sheet",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Scrape Homepage": {
      "main": [
        [
          {
            "node": "Extract Links from HTML",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Aggregate Images": {
      "main": [
        [
          {
            "node": "Add Images to Sheet",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Convert to Markdown": {
      "main": [
        [
          {
            "node": "Aggregate Scraped Content",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Scrape Content Links": {
      "main": [
        [
          {
            "node": "Convert to Markdown",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Filter Real Hyperlinks": {
      "main": [
        [
          {
            "node": "Separate Images and Links",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Remove Duplicate Links": {
      "main": [
        [
          {
            "node": "Filter Real Hyperlinks",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Extract Links from HTML": {
      "main": [
        [
          {
            "node": "Split Links",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Aggregate Scraped Content": {
      "main": [
        [
          {
            "node": "Add Scraped Content to Sheet",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Separate Images and Links": {
      "main": [
        [
          {
            "node": "Aggregate Images",
            "type": "main",
            "index": 0
          }
        ],
        [
          {
            "node": "Aggregate Links",
            "type": "main",
            "index": 0
          },
          {
            "node": "Scrape Content Links",
            "type": "main",
            "index": 0
          }
        ]
      ]
    }
  }
}

็›ธๅ…ณๅทฅไฝœๆต