Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

About Assaybot

Assaybot

Information for web publishers on Index Exchange’s site crawler bot.

Assaybot is Index Exchange’s automated content analysis crawler designed to ensure brand safety across our advertising exchange. It uses a multi-stage AI classification pipeline to analyze web page content, detect potential brand safety concerns, and help maintain high-quality inventory standards that protect both advertiser and publisher interests.

Purpose

Assaybot operates as part of Index Exchange’s quality assurance infrastructure. The system:

  • Analyzes publisher page content for brand safety compliance using industry-leading AI models
  • Identifies potential concerns including adult content, hate speech, violence, CSAM and other material that may affect advertiser confidence
  • Helps publishers maintain and grow advertiser demand by ensuring inventory meets brand safety standards
  • Operates entirely outside of the ad serving path — it has zero impact on ad delivery latency or page load performance

How It Benefits Publishers

Brand safety is a shared priority. When advertisers trust the quality of your inventory, it drives stronger demand and better monetization outcomes. Assaybot helps by:

  • Proactively identifying content issues before they affect your revenue
  • Providing transparent, consistent quality assessments across the Index Exchange
  • Reducing the need for manual review processes that can delay issue resolution
  • Ensuring your inventory remains eligible for premium advertiser demand

Assaybot does not affect your site’s search engine rankings or visibility. It does not index content for public search, and it does not redistribute your content in any form. It is exclusively used for content quality assessment within Index Exchange’s advertising ecosystem.


This documentation is maintained by Index Exchange and reflects the current state of the Assaybot system. Publishers will be notified of significant changes to crawl behavior or capabilities.

User Agent and Network

Assaybot identifies itself using the following HTTP user-agent request header:

Mozilla/5.0 (compatible; Assaybot/0.1; +http://www.indexexchange.dev/bot.html)

Assaybot always sends this user-agent string with every request. It does not attempt to disguise itself as a browser or any other client.

Important Security Note: The HTTP user-agent request header can be spoofed by other crawlers. For verification purposes, publishers should validate requests using IP address verification rather than relying solely on the user-agent string.

Allowing Assaybot in Your robots.txt File

To ensure our crawler doesn’t land on your global Disallow: condition, please add a single line:

User-agent: Assaybot

to the allowed-crawlers group in your robots.txt file. Our crawler identifies itself with the product token Assaybot and follows RFC 9309 / Google robots.txt semantics and will pick-up the authorization.

Authorized IP Address CIDR

All requests for Assaybot outside this address space can be considered user-agent spoofed requests.

192.139.80.0/24

Verification Recommendations

For CDN operators and network administrators who want to verify Assaybot traffic:

  1. IP Verification: Confirm the source IP falls within the authorized CIDR range shown above
  2. User-Agent Check: Verify the user-agent string matches the format shown above
  3. Behavioral Pattern: Assaybot makes only standard HTTP GET requests, respects robots.txt directives, and does not attempt to bypass authentication or access controls

If you need additional assistance with verification or allowlisting, contact your Index Exchange account representative.

Crawl Behavior

Access Frequency

Assaybot is designed to minimize impact on publisher infrastructure:

  • Deduplication: Assaybot maintains a multi-layer deduplication system to prevent redundant requests. Each URL is analyzed at most once within a 30-day rolling window. A short-term cache prevents duplicate requests within the same day, while a long-term filter ensures URLs are not re-crawled for up to 30 days
  • Per-Domain Concurrency: Assaybot limits the number of simultaneous requests to any single domain, ensuring no individual site experiences excessive load
  • Timeout Period: Each page request has a 30-second timeout — if your server does not respond within that window, Assaybot moves on
  • Retry Logic: Failed requests (server errors or rate limit responses) are retried up to 3 times with exponential backoff, increasing the delay between each attempt to avoid adding pressure to an already-strained server
  • Rate Limit Compliance: If your server returns a 429 Too Many Requests response, Assaybot will back off automatically and retry with increasing delays

What Assaybot Accesses

Assaybot analyzes URLs that appear in ad request traffic flowing through Index Exchange. The system:

  • Processes page URLs and referrer URLs observed in ad request data
  • Makes a single HTTP GET request per URL to retrieve the page content
  • Extracts visible text content for brand safety analysis
  • Stores analysis results internally for quality assurance reporting
  • Does not index content for public search or external redistribution
  • Does not execute JavaScript, submit forms, or interact with page elements
  • Does not follow links on the page to discover new URLs — it only visits URLs already observed in ad traffic

Content Analysis Method

Assaybot uses a straightforward content retrieval approach:

  • Makes standard HTTP GET requests using the documented user-agent string
  • Extracts visible text content from the HTML response
  • Strips scripts, styles, navigation elements, and other non-visible content
  • Follows redirects automatically (up to 5 hops)
  • Timeout: 30 seconds per request
  • Cookies are disabled — Assaybot does not send or store cookies

The extracted text is then passed through a multi-stage AI classification pipeline to assess brand safety. No images, videos, or other media are downloaded or analyzed as of the last update to this guide.

Domain Safe List and Block List

Assaybot maintains curated domain lists to optimize system resources and focus analysis where it is most needed:

  • Safe List: Well-known, trusted publisher domains (such as major news outlets) are automatically classified as safe and are not crawled, saving resources for both Assaybot and the publisher
  • Block List: Domains that are already known to be non-compliant are excluded from crawling

These lists are maintained by Index Exchange and updated on a regular basis. If you believe your domain has been incorrectly categorized, please contact your account representative.

Data Collection & Privacy

Information Collected

For each analyzed page, Assaybot records:

  • URL and Domain: The full URL and root domain of the analyzed page
  • Publisher ID: Internal Index Exchange identifier linking to your account
  • Extracted Text: Visible text content extracted from the HTML page (scripts, styles, and non-visible elements are excluded)
  • HTTP Status: The response code returned by your server
  • Brand Safety Verdict: The classification result (safe or unsafe) along with the confidence score
  • Processing Metadata: Timestamps, response latency, and which classification stage produced the verdict

Assaybot does not collect:

  • Personally identifiable information (PII) from page visitors
  • Cookies or session data
  • Form data or user-submitted content
  • Images, videos, or other media files
  • Information from password-protected or authenticated pages

Data Storage and Retention

  • Analysis results are stored in compressed columnar format partitioned by date
  • Automated lifecycle policies manage data retention and archival
  • Data is accessible only to authorized Index Exchange personnel and relevant publisher account teams

Data Usage

Analysis results are used exclusively for:

  • Brand safety quality assurance across Index Exchange’s supply network
  • Publisher account management and content quality reporting
  • Advertiser protection and inventory curation
  • System performance monitoring and optimization
  • Regulatory compliance reporting

Regulations

Assaybot’s content analysis is designed to comply with:

  • GDPR: No personal data is intentionally collected; analysis focuses exclusively on publicly available published content
  • CCPA: Text content analysis falls under business operations exemptions
  • Industry Standards: Aligned with IAB brand safety guidelines and frameworks

Publishers with specific privacy concerns should contact their Index Exchange account representative.

Technical Specifications

Request Characteristics

  • Protocol: HTTPS only
  • HTTP Method: GET (read-only; Assaybot never POSTs data to publisher sites)
  • Connection: Keep-alive
  • Accept-Encoding: gzip
  • Accept: text/html, application/xhtml+xml
  • Accept-Language: en-US,en;
  • DNT: 1 (Do Not Track enabled)
  • Cookies: Disabled — Assaybot does not send or store cookies

Content Processing

  • HTML Processing: Assaybot processes the full HTML response, extracting only visible text content
  • Text Extraction: Scripts, styles, navigation elements, and non-visible markup are stripped before analysis
  • Media: Images, videos, and other media files are not downloaded

HTTP Status Handling

  • 2xx Success: Content is extracted and analyzed normally
  • 3xx Redirects: Followed automatically (up to 5 redirects per request)
  • 4xx Client Errors: Logged and not retried — Assaybot respects access restrictions
  • 429 Too Many Requests: Retried with exponential backoff (automatically backs off to reduce load)
  • 5xx Server Errors: Retried up to 3 times with exponential backoff, then logged as failed

robots.txt Compliance

Assaybot fully respects the Robots Exclusion Standard. Please note, in exceptional cases if your inventory is classified as higher risk, the team may request a robots.txt exception to be applied to one or more of your domains but this is extremely rare.

Before crawling any page, Assaybot checks the site’s robots.txt file and honors all applicable directives, including:

  • User-agent: Assaybot specific rules (checked first)
  • User-agent: * wildcard rules (used as fallback)
  • Disallow and Allow directives
  • Crawl-delay specifications

robots.txt responses are cached so that your server is not repeatedly queried for the same file.

ℹ️ Publisher Recommendation: Publishers may add robots.txt rules for Assaybot at any time. Note: restricting access may impact eligibility to transact on Index Exchange for certain inventory. Exceptions to the robots.txt policy will be handled on a case-by-case basis through your account representative.

Managing Assaybot Access

To ensure optimal brand safety monitoring and maintain good standing in Index Exchange’s supply network, we recommend allowing Assaybot full access to your publicly available content.

Benefits of allowing access:

  • Proactive identification of potential content issues before they affect your revenue
  • Faster resolution of brand safety concerns with automated, consistent analysis
  • Maintained eligibility for premium advertiser demand across Index Exchange
  • Transparent, data-driven content quality assessments available through your account team

Assaybot is designed to be a good citizen on your infrastructure. It respects robots.txt, limits concurrent requests per domain, backs off automatically when rate-limited, and will never crawl the same URL more than once within a 30-day window.

Restricting or Blocking Access

Assaybot fully supports robots.txt directives, giving you granular control over what it can access. Publishers who choose to restrict or block Assaybot should be aware:

  • Quality Assurance Impact: Content that cannot be analyzed may require manual review processes, potentially causing delays in brand safety assessments
  • Demand Eligibility: Blocking may impact eligibility to transact on Index Exchange for certain inventory, as automated brand safety verification cannot be completed
  • Account Coordination: Significant restrictions may require additional coordination with your account team

To block Assaybot entirely, add the following to your robots.txt file:

User-agent: Assaybot
Disallow: /

To allow access to most of your site while restricting specific sections:

User-agent: Assaybot
Disallow: /private/
Disallow: /admin/
Allow: /

To set a crawl delay (seconds between requests):

User-agent: Assaybot
Crawl-delay: 10

If you have questions about how access restrictions may affect your account, please contact your Index Exchange account representative.

Troubleshooting & Common Issues

High Request Volume

If you notice unexpectedly high request volume from Assaybot:

  1. Verify Authenticity: First, confirm the requests are genuinely from Assaybot by checking the user-agent string and verifying the source IP against Index Exchange’s authorized IP range (192.139.80.0/24). Requests from outside this range using the Assaybot user-agent are spoofed
  2. Check Deduplication: Assaybot should not request the same URL more than once within a 30-day period. If you are seeing repeated requests to the same URL, the traffic may not be from Assaybot
  3. Use robots.txt: You can set a Crawl-delay directive in your robots.txt file to control how frequently Assaybot makes requests to your site
  4. Contact Support: If the issue persists after verification, reach out to your account representative with sample request logs (timestamps, URLs, source IPs) and Index Exchange will investigate

WAF and CDN Configuration

If Assaybot is being blocked by your Web Application Firewall (WAF) or CDN:

  • Allowlist by IP: Add Index Exchange’s authorized IP range (192.139.80.0/24) to your WAF/CDN allowlist
  • Allowlist by User-Agent: Add Assaybot to your bot allowlist. Note that IP verification is more secure than user-agent matching alone
  • Rate Limiting: If your CDN applies rate limits, ensure they are not so restrictive that legitimate crawl traffic is blocked. Assaybot limits its own per-domain concurrency and respects 429 responses with automatic backoff
  • Bot Management: If you use a bot management solution (e.g., Cloudflare Bot Management, Akamai Bot Manager), you may need to add Assaybot to your verified bot list or create an exception rule

Access Errors

If Assaybot encounters repeated access errors on your site (403, 401, etc.):

  • Authentication Walls: Assaybot can only access publicly available pages. Ensure the pages that appear in ad traffic are accessible without authentication
  • Geo-Restrictions: If your site restricts access by geography, ensure that Index Exchange’s IP range is permitted
  • IP Allowlisting: Use the authorized CIDR range documented in the User Agent and Network section

Content Analysis Issues

If you believe Assaybot is incorrectly flagging content:

  1. Review Flagged Content: Your account representative can provide specific examples of content that was flagged and the reason for the classification
  2. Understand Criteria: Brand safety assessment covers categories including explicit sexual content, hate speech, violence, illegal activity, and other material that may affect advertiser confidence. Classifications follow IAB brand safety guidelines
  3. Request Review: Contact your account representative to request a manual review of specific flagged URLs
  4. Appeal Process: Work with the Exchange Quality team for remediation guidance. Incorrectly flagged content can be reviewed and reclassified