What is robots.txt: AI crawlers are OK, blocking malicious bots up to robots.txt optimization code

Hello. I'm No Tuggeun, and besides this blog, I run various other sites. I host everything—my company homepage, personal blog, and even WordPress sites I've built for clients—on a single AWS Lightsail instance.

Running everything on a single instance keeps costs low,

However, there's a downside. Whether it's a dynamic site or static pages, having everything on one server… what happens when traffic spikes? The entire setup can go down. (If one static site goes down, it takes down the company homepage and all the outsourced sites I host.)

That's why I regularly check server traffic and spend time blocking "malicious bots."

For WordPress, to prevent server traffic issues, you can achieve relatively stable server operation by combining two basic settings: Wordfence plugin configuration + robots.txt malicious bot blocking.

This article covers robots.txt and shares the optimized robots.txt file I've developed through experience.

What is robots.txt? (Concept Overview)

robots.txt is a communication file for robots. It communicates with
search engines and AI robots (crawlers). Crawlers (robots) are programs like Google, Naver, and GPT that scan websites.

The robots.txt file uses code to distinguish between information we want shared and information we don't want shared, telling them what to crawl and what not to crawl.

  • Example:
    • To have your homepage appear in Google search results 👉 Robots must be able to read your content
    • But what if they crawl login pages or admin screens? ❌ That's risky, so we must tell them not to crawl it.

So we use a file called robots.txt to tell them, "You can crawl this / Don't crawl this."

  • ❌ What if there's no robots.txt?
    • Wasted site traffic + security risk
    • Most bots crawl every page by default
    • Malicious bots can scrape and scan admin pages too

robots.txt Where should the file be located?

robots.txt The file must always be in the domain root directory.

https://내사이트주소.com/robots.txt

Accessing the above address allows both bots and humans to view the robots.txt file.

robots.txt Basic Syntax Reference Table

SyntaxMeaningExampleDescription
User-agent:Specifies target robotsUser-agent: *Applies to all robots (crawlers)
(Googlebot, Bingbot, etc.)
Disallow:Set paths to blockDisallow: /private/Prevent robots (crawlers) from scraping the specified path
Allow:Set paths to allow accessAllow: /public/Allow robots (crawlers) to crawl these paths
Sitemap:Specify sitemap locationSitemap: https://example.com/sitemap.xmlGuides site structure to aid search engine optimization
  • User-agent: Specifies who the
    instruction is for. Example: * = All, Googlebot = Google only
  • Disallow: Don't look
    here Example: /public
  • Allow: You can see this
    . Example: /wp-admin/admin-ajax.php
  • sitemap: The site structure is here. Used to tell
    search engines about the sitemap

"User-agent:"
This syntax tells you "who it's talking to." For example, User-agent: *If you write it like this, it applies to all bots, whether Googlebot or Naverbot. If you want to
tell only a specific bot, User-agent: Googlebot Write it like this:

"Disallow:"
This is a command prohibiting access: "Do not look at this path!" For example, Disallow: /private/ If you write `/index.html`, the robot will not read the content below. example.com/private/ will not read the content below.

"Allow:"
Conversely, this is permission saying "You can crawl here!" It's mainly used Disallow:" when you block everything and only open exceptions within it.

"Sitemap:" It tells
search engines, "Here's the map of our house!" Having a sitemap file helps search engines understand your site better and expose it more.

Frequently used ROBOTS.TXT file

1. Allow access to entire site: All robots can crawl everything!

User-agent: *
Disallow:

2. Block entire site: Absolutely no access. Not visible to search engines either.

User-agent: *
Disallow: /

3. Block specific crawlers (e.g., AhrefsBot): Block backlink scanning bots like Ahrefs that generate traffic

User-agent: AhrefsBot
Disallow: /

4. Block specific folders: /private/ Access to the folder contents is prohibited

User-agent: *
Disallow: /private/

You cannot "block humans" with robots.txt

robots.txt only applies to robots. Humans accessing directly via browser will see everything.

To block humans,

  • Redirect to a login page
  • Implement a member authentication system
  • Server-side User-Agent You can use

to block people from the homepage or redirect them to the login page.

WordPress Default robots.txt

The robots.txt code below is the default robots.txt file automatically generated upon WordPress installation.

# 워드프레스 기본 설정
User-agent: *
Disallow: /wp-admin/
Disallow: /wp-login.php
Allow: /wp-admin/admin-ajax.php

Sitemap: https://사이트주소.com/sitemap_index.xml

📌If you set it to the default WordPress robots.txt file, you may experience server downtime due to increased traffic. (Server downtime in WordPress can stem from various causes. Examples: using low-cost hosting, robots.txt issues, server crashes, plugin conflicts, etc.)

Nowadays, beyond simple search engine crawlers, AI crawlers are becoming increasingly common.

GPTBot, ClaudeBot, Applebot, Perplexity… While some AI bots are welcome, others are malicious bots that only generate traffic and scrape your content.

For bots that can be utilized (excluding malicious bots), I've organized them in the robots.txt file to ensure they can crawl properly.

AI Crawler Control + Malicious Bot Blocking Version (2025.05.23)

The robots.txt file I created follows these principles:

ItemSetting MethodPurpose
WordPress Default SecurityDisallow SettingBlock Login Page
AI CrawlersCrawl-delayAllow positive exposure but control speed
Malicious BotsDisallow: /Block traffic/information scraping
For search enginesAllow + SitemapMaintain SEO optimization
  • 1. Allow AI crawlers but throttle speed
    • GPTBot, Gemini, Applebot, etc. Crawl-delay: 30 Configure
    • Scrape our content, but come slowly
  • 2. Block malicious bots outright
    • Ahrefs, Semrush, MJ12, etc. Backlink analysis bots: Completely blocked
    • DataForSeoBot, barkrowler, and other unknown information scraping bots OUT
  • 3. Block suspicious crawlers based in Russia/China
    • Yandex, PetalBot, MauiBot, etc. Disallow: / Handled

The robots.txt file can be used in two ways. It can be downloaded and uploaded directly to the root folder, or it is written in two ways so that the robots.txt code can be copied and pasted.

robots.txt file distribution methods

🔹 Method 1: Directly download the robots.txt file and upload it to the root

🔹 Method 2: Copy + Paste the code below

**워드프레스 Robots.txt 최적화 코드 ( ai bot + 악성 봇 차단)**

# ==  워드프레스==

User-agent: *

Disallow: /wp-admin/

Disallow: /wp-login.php

Allow: /wp-admin/admin-ajax.php

# ==============================================

# 🤖 AI & SEO 크롤러 제어 설정 - by 노퇴근

# GPTBot, Ahrefs, Baidu 등 트래픽 유발 크롤러 관리

# robots.txt v2025.05.23

# ==============================================

# 🧠 국내 AI 크롤러들

# ====================================

# 네이버의 클로바 AI 크롤러

User-agent: CLOVA

Crawl-delay: 30

# 카카오의 AI 및 검색용 크롤러

User-agent: KakaoBot

Crawl-delay: 30

# ====================================

# 🌎 글로벌 AI 크롤러들  - 허용하되 딜레이만 설정

# ====================================

# OpenAI의 ChatGPT용 크롤러 (공식)

User-agent: GPTBot

Crawl-delay: 30

# 구글의 Gemini (Bard) AI 관련 크롤러 (추정)

User-agent: Gemini

Crawl-delay: 30

# 마이크로소프트의 Copilot (VS Code 등 연동)

User-agent: Copilot

Crawl-delay: 30

# Anthropic Claude AI의 일반 User-agent (별도 공식 미확인)

User-agent: Claude

Crawl-delay: 30

# Perplexity AI의 검색형 LLM 봇

User-agent: Perplexity

Crawl-delay: 30

# ChatGPT와 연결된 일반 유저 요청 (비공식 User-agent 사용시)

User-agent: ChatGPT-User

Crawl-delay: 30

# ====================================

# 🍏 Apple & Microsoft AI 크롤러 - 허용하되 딜레이만 설정

# ====================================

# 🍏 Apple의 Siri/Spotlight용

User-agent: Applebot

Crawl-delay: 30

# Apple의 AI 학습용 확장 크롤러

User-agent: Applebot-Extended

Crawl-delay: 30

# Bing AI 기반 봇 (Copilot 연계)

User-agent: Bing AI

Crawl-delay: 30

# ====================================

# 🌐 글로벌 번역/검색/대화형 AI

# ====================================

# DeepL 번역 서비스 연동 크롤러

User-agent: DeepL

Crawl-delay: 30

# 캐릭터 기반 대화 AI 서비스 (Character.AI)

User-agent: Character.AI

Crawl-delay: 30

# Quora 기반 Poe AI 또는 관련 크롤러

User-agent: Quora

Crawl-delay: 30

# Microsoft의 실험적 대화형 모델 DialoGPT (추정 User-agent)

User-agent: DialoGPT

Crawl-delay: 30

# Otter.ai 회의 텍스트 전환 및 음성 분석 서비스

User-agent: Otter

Crawl-delay: 30

# 학생용 학습 Q&A AI 앱 Socratic (구글 소유)

User-agent: Socratic

Crawl-delay: 30

# ====================================

# ✍️ AI 콘텐츠 자동생성 툴들

# ====================================

# Writesonic (ChatGPT 대안급 AI 카피/에디터)

User-agent: Writesonic

Crawl-delay: 30

# CopyAI (스타트업 대상 카피라이팅 AI)

User-agent: CopyAI

Crawl-delay: 30

# Jasper (전문 마케팅/블로그 AI)

User-agent: Jasper

Crawl-delay: 30

# ELSA 스피킹/영어 말하기 코칭 AI

User-agent: ELSA

Crawl-delay: 30

# Codium (코드 자동화 AI) — Git 연동

User-agent: Codium

Crawl-delay: 30

# TabNine (VSCode 기반 코딩 AI)

User-agent: TabNine

Crawl-delay: 30

# Vaiv (국내 AI 스타트업, NLP 서비스)

User-agent: Vaiv

Crawl-delay: 30

# Bagoodex (출처 불명, 데이터 수집 크롤러 추정)

User-agent: Bagoodex

Crawl-delay: 30

# You.com의 YouChat AI 봇

User-agent: YouChat

Crawl-delay: 30

# 중국 기반 iAsk AI 검색/QA 봇

User-agent: iAsk

Crawl-delay: 30

# Komo.ai — 개인정보 중심 AI 검색

User-agent: Komo

Crawl-delay: 30

# Hix AI — 콘텐츠 생성 특화 AI

User-agent: Hix

Crawl-delay: 30

# ThinkAny — ChatGPT 기반 AI 플랫폼

User-agent: ThinkAny

Crawl-delay: 30

# Brave 검색 엔진 기반 AI 요약/검색

User-agent: Brave

Crawl-delay: 30

# Lilys — AI 추천 엔진/챗봇 추정

User-agent: Lilys

Crawl-delay: 30

# Sidetrade Indexer Bot — AI 영업 CRM 기반 크롤러

User-agent: Sidetrade Indexer Bot

Crawl-delay: 30

# Common Crawl 기반 AI 학습 봇

User-agent: CCBot

Crawl-delay: 30

# 추후 사용자 정의 AI 크롤러 등록용 placeholder

User-agent: AI-Bot-Name

Crawl-delay: 30

# ====================================

# 🧠 기타 주요 AI/웹 크롤러 (이전에 추가한 것 포함)

# ====================================

# Anthropic의 Claude 공식 크롤러

User-agent: ClaudeBot

Crawl-delay: 30

# Claude의 웹 전용 크롤러

User-agent: Claude-Web

Crawl-delay: 30

# Google의 AI 학습용 크롤러

User-agent: Google-Extended

Crawl-delay: 30

# Google 기타 Crawlers

User-agent: GoogleOther

Crawl-delay: 30

# Google Search Console 검사 도구 크롤러

User-agent: Google-InspectionTool

Crawl-delay: 30

# Google Cloud Vertex AI 크롤러

User-agent: Google-CloudVertexBot

Crawl-delay: 30

# DuckDuckGo의 AI 요약 지원 봇

User-agent: DuckAssistBot

Crawl-delay: 30

# 웹 페이지를 구조화된 데이터로 바꾸는 Diffbot

User-agent: Diffbot

Crawl-delay: 30

# Kagi 검색엔진의 고급 AI 요약 크롤러

User-agent: Teclis

Crawl-delay: 30

# ====================================

# 🔍 기타 불필요한 크롤러들 - 딜레이만

# ====================================

# 중국 검색엔진 Baidu - 국내 사이트엔 불필요

User-agent: Baiduspider

Crawl-delay: 300

# 📊 마케팅 분석/광고 관련 봇 - 과도한 트래픽 유발 가능

User-agent: BomboraBot

Crawl-delay: 300

User-agent: Buck

Crawl-delay: 300

User-agent: startmebot

Crawl-delay: 300

# ==============================

# ❌ 완전 차단이 필요한 크롤러

# ==============================

# 🦾 백링크 분석 툴들 - 모든 페이지 긁어감

User-agent: MJ12bot

Disallow: /

User-agent: AhrefsBot

Disallow: /

User-agent: SemrushBot

Disallow: /

# 🛑 중국/러시아/광고용 등 트래픽 & 정보 분석용 봇 차단

User-agent: PetalBot

Disallow: /

User-agent: MediaMathbot

Disallow: /

User-agent: Bidswitchbot

Disallow: /

User-agent: barkrowler

Disallow: /

User-agent: DataForSeoBot

Disallow: /

User-agent: DotBot

Disallow: /

User-agent: Bytespider

Disallow: /

User-agent: CensysInspect

Disallow: /

User-agent: rss2tg bot

Disallow: /

User-agent: proximic

Disallow: /

User-agent: Yandex

Disallow: /

User-agent: MauiBot

Disallow: /

User-agent: AspiegelBot

Disallow: /

Sitemap: https://사이트주소.com/sitemap_index.xml

robots.txt Management Tips

  • Utilize the robots.txt inspection feature in Google Search Console
  • When server traffic spikes, check crawl logs and immediately register new bots
  • Even static pages can crash your server if bots scrape them… Always monitor them

Leave a Comment

목차