Citizenly · Legal
Citizenly Corpus Crawler
Last updated: 2026-05-11
This page describes the automated crawler operated by Citizenly for indexing public U.S. immigration policy documents.
What it does
The crawler periodically polls a small allowlist of public-information pages on official U.S. government immigration sites. It compares each page's main article body against the previous fetch using a content hash. When the body changes substantively, a human reviewer at Citizenly reviews the change before any indexed copy is updated.
The crawler does not: republish government content, archive entire sites, index personal data, scrape user accounts, or aggregate beyond the public policy pages listed below.
User-Agent
Every request the crawler issues carries the following User-Agent string:
Citizenly Corpus Crawler (https://citizenly.ai/crawler-info)
If you operate one of the source sites and see this User-Agent in your logs, this is us.
Sites the crawler accesses
www.uscis.gov— U.S. Citizenship and Immigration Servicesegov.uscis.gov— USCIS — e-government services (processing times)travel.state.gov— U.S. Department of State — Travelwww.justice.gov— U.S. Department of Justice — EOIRwww.dhs.gov— U.S. Department of Homeland Securitywww.cbp.gov— U.S. Customs and Border Protectionwww.ice.gov— U.S. Immigration and Customs Enforcementstudyinthestates.dhs.gov— DHS — Study in the Statesohss.dhs.gov— DHS Office of Homeland Security Statisticswww.federalregister.gov— Federal Register (via API)api.congress.gov— Congress.gov (via official API)
The Federal Register and Congress.gov sources use official structured APIs (federalregister.gov/api/v1 and api.congress.gov/v3), not page scraping.
Politeness
- The crawler respects
robots.txton every domain. Disallowed paths are skipped and not retried for at least 24 hours after the first disallow. - Per-domain request rate is limited (default: at most one request every five seconds per source).
- Conditional requests (
If-None-Match,If-Modified-Since) are sent whenever the previous response included anETagorLast-Modifiedheader — so unchanged pages return304 Not Modifiedand consume minimal upstream bandwidth. - Repeated upstream errors pause polling for that source until a human acknowledges the situation.
Contact
If you operate one of the sites above and would like the crawler to slow down, change its access pattern, or stop accessing your site entirely, please email crawler@citizenly.ai. We honor opt-out requests within 48 hours.