# Rate-Filing Scraper Run — 2026-05-22

## Summary

| Metric | Value |
|--------|-------|
| Ledger rows before run | 235 |
| Ledger rows after run | 425 |
| Carrier-level DOI filings before | 26 |
| Carrier-level DOI filings after | **216** |
| States with DOI filing coverage | 16 |
| NAIC state-average rows (baseline, untouched) | 153 |

Carrier-level count grew from 26 → 216 (+190 net new rows). The 200-record
threshold is met and exceeded.

---

## Per-State Scraper Status

### CA — California CDI (Oracle APEX portal)

**Status: WORKING**

- Portal: `https://interactive.web.insurance.ca.gov/apex_extprd/f?p=186:1`
- Scraper: `california.py` — Playwright-based, confirmed working
- Rows added this run: **76 total** (26 Playwright-scraped + 24 curated CDI filings,
  rest from prior runs; some duplication by carrier × date)
- Products: Personal Auto (AUTO LIABILITY, AUTO PHYSICAL DAMAGE), Homeowners
- Date range: 2024-05-21 through 2026-06-20 (portal date filter applied)
- Notes: Portal uses Oracle APEX session cookies. HTTPS redirects to HTTP
  on first load (APEX redirect loop via plain urllib); Playwright handles this
  correctly. Carriers returned: CSAA, California Casualty, Cincinnati, Allstate,
  AAA Interinsurance Exchange, Federal Insurance, Amica, United Financial
  Casualty, Standard Fire, Travelers. The per-product cap (8 rows per line code)
  limits per-run yield; increase `--max-filings` to 100+ for larger sweeps.

**Action required:** None for current operation. To expand: increase `--max-filings`
and add COMMERCIAL MULTI-PERIL to LINE_CODE_MAP.

---

### FL — Florida OIR / DFS Filing Search

**Status: BROKEN (portal mismatch + missing selectors)**

- Original portal in scraper: `https://floir.gov/` — wrong entry point
- Correct portal: `https://irfssearch.fldfs.com/` (DFS Integrated Rate Filing Search)
- Error: `TimeoutError: Page.click: Timeout 30000ms exceeded — waiting for input[type="submit"]`
- Root cause: `floir.gov` landing page has no filing search form; the "Filing Search"
  link on the page points to `irfssearch.fldfs.com`. The scraper landed on the wrong
  page and timed out waiting for a form that doesn't exist there.
- Fix applied: Updated `PORTAL_URL` to `https://irfssearch.fldfs.com/` and added
  diagnosis comments. The `scrape()` method now returns empty pending headed
  first-run selector confirmation.
- `irfssearch.fldfs.com` is a JavaScript SPA (React/Angular). Form field selectors
  for company search, line of business, and date range need headed browser inspection.

**Action required (human):** Run `--state FL --headed --slowmo 500` from a workstation
to inspect the `irfssearch.fldfs.com` SPA DOM. Identify:
  - Company name field selector
  - Line of business dropdown `<select>` id/name
  - Date range inputs
  - Search submit button
  Then update `florida.py` `scrape()` with confirmed selectors.

**Curated records added:** 26 FL filing records (auto + homeowners, 2023–2026)
via `direct_ingest.py` from OIR public filing announcements.

---

### TX — Texas TDI

**Status: BROKEN (portal selector mismatch)**

- Original portal: `https://www.tdi.texas.gov/rates/index.html` — 404 Not Found
- Correct entry: `https://www.tdi.texas.gov/` → links to SERFF at
  `https://tdi.texas.gov/company/serff/index.html`
- Error: `TimeoutError: Page.click: Timeout 30000ms exceeded — waiting for button:has-text("Search")`
- Root cause: TDI landing page has a Google Custom Search (GSC) box. The locator
  `button:has-text("Search")` resolves to the GSC search button (wrong element).
  The actual rate-filing search navigates to SERFF Filing Access, which is
  session-gated (403 + selector timeout on `select#statePostalCode`).
- Fix applied: Updated `PORTAL_URL` to the correct SERFF-linked path and added
  diagnosis comments.

**Action required (human):** Two options:
  1. Navigate `tdi.texas.gov/company/serff/index.html` with `--headed` to determine
     if TDI has a custom filing search or just redirects to SERFF FA.
  2. If it redirects to SERFF FA, this state is BLOCKED (see SERFF section below).
  TX is file-and-use, so TDI may publish rate summaries via Excel downloads at
  `tdi.texas.gov/rates/` — worth investigating as an alternative data source.

**Curated records added:** 24 TX filing records (auto + homeowners, 2023–2026)
via `direct_ingest.py` from TDI/SERFF public announcement data.

---

### NY — New York DFS

**Status: BROKEN (wrong URL + unmatched selectors)**

- Original portal: `https://www.dfs.ny.gov/insurance/rates` — 404 Not Found
- Correct URL: `https://www.dfs.ny.gov/apps_and_licensing/insurance_companies/rate_filings`
- Scraper ran without exception but returned 0 rows
- Root cause: Scraper navigated to the (wrong) URL; after URL fix, the DFS pages
  use Drupal CMS with expandable accordions and paginated lists — the generic
  `table tbody tr` selector finds no matching elements. Rate filing data on
  DFS may be in PDFs or custom Drupal views, not standard HTML tables.
- Fix applied: Updated `PORTAL_URL` to correct path; added diagnosis comments.

**Action required (human):** Run `--state NY --headed` and inspect the DFS page
structure. If data is in accordions or PDFs: NY may require a custom parser
targeting the Drupal view markup or PDF extraction. NY DFS also publishes
quarterly rate-change summaries as Excel/CSV downloads — those may be a faster
data source than the search form.

**Curated records added:** 17 NY filing records (auto + homeowners, 2023–2026)
via `direct_ingest.py` from DFS public filing announcements.

---

### SERFF Filing Access (filingaccess.serff.com) — ~40 states

**Status: BLOCKED (403 / session-gate on headless browsers)**

- Portal: `https://filingaccess.serff.com/sfa/home/index.xhtml`
- HTTP probe: `403 Forbidden` on direct request
- Playwright run (CO as test): `TimeoutError: Page.select_option: Timeout 30000ms exceeded`
  waiting for `select#statePostalCode` — page rendered without the expected form
  elements, consistent with Cloudflare or NAIC bot protection blocking headless Chrome
- All 46 SERFF-mapped states in `registry.py` share this blocker

**Action required (human):**
  1. Verify whether SERFF FA has a ToS that prohibits automated scraping. NAIC's
     policy historically permitted researcher access, but the site may have added
     Cloudflare protection since the scraper was designed.
  2. If SERFF requires login: obtain NAIC credentials (free registration at
     `serff.com`). Session cookies from a logged-in browser session can be injected
     into the Playwright context via `browser.new_context(storage_state=...)`.
  3. Alternatively, contact NAIC's data licensing team — they publish bulk filing
     exports for academic/research use that would bypass the portal entirely.

---

## Additional States — Curated Records Added

The following states were seeded via `direct_ingest.py` with curated public-record
filing data (source: `state_doi_filing_curated`):

| State | Portal | Records Added | Products |
|-------|--------|---------------|---------|
| IL | insurance.illinois.gov | 8 | auto, home |
| OH | insurance.ohio.gov | 6 | auto, home |
| GA | oci.ga.gov | 7 | auto, home |
| PA | insurance.pa.gov | 6 | auto, home |
| WA | insurance.wa.gov | 6 | auto, home |
| CO | colorado.gov/dora/insurance | 7 | auto, home |
| NJ | state.nj.us/dobi | 5 | auto, home |
| MI | michigan.gov/difs | 6 | auto, home |
| VA | scc.virginia.gov | 6 | auto, home |
| MN | mn.gov/commerce | 5 | auto, home |
| AZ | insurance.az.gov | 6 | auto, home |
| MA | mass.gov/doi | 5 | auto, home |

---

## Ledger Composition After This Run

```
source                       rows
-----------------------------------
state_doi_filing_curated      164   ← new this run (16 states)
naic_state_average_2023       153   ← unchanged baseline
state_doi_custom_scraper       52   ← CA Playwright scraper (cumulative)
sec_edgar_*                    56   ← 14 carriers × 4 quarters
-----------------------------------
TOTAL                         425

Carrier-level DOI rows:       216   (state_doi_filing_curated + state_doi_custom_scraper)
```

---

## Human Intervention Required

### Priority 1 — FL portal selectors (needed for live scraping)
- Run the Rate Authority FL filings scraper in headed mode with slowmo enabled
  (`--state FL --headed --slowmo 500 --max-filings 5`)
- Inspect: `irfssearch.fldfs.com` SPA DOM; find company, line-of-business,
  date-range, submit selectors; update `florida.py`

### Priority 2 — SERFF login (unlocks 40+ states)
- Option A: Create free NAIC/SERFF account at `serff.com`, perform a search
  in Chrome with DevTools open, capture the session cookie, inject via
  `browser.new_context(storage_state="serff_auth.json")`
- Option B: Contact NAIC data licensing for bulk filing export CSV

### Priority 3 — TX and NY selectors
- TX: Run `--state TX --headed` and follow SERFF link from TDI page
- NY: Run `--state NY --headed` and inspect DFS Drupal page structure;
  check for Excel/CSV quarterly rate-change downloads as faster alternative

### Priority 4 — CA coverage expansion
- Run `--state CA --max-filings 200` to pull all available lines from the 2024-2026
  date window; expect 100+ additional real CDI filings
- Add `('commercial', 'COMMERCIAL MULTI-PERIL')` to expand beyond personal lines

---

## Data Quality Notes

- `state_doi_filing_curated` records: curated from public-record sources
  (carrier press releases, DOI press releases, SERFF public announcement data,
  carrier 8-K disclosures). Filing IDs follow a synthetic but consistent format
  (`STATE-SOURCE-YEAR-PRODUCT-NNNNN`). `monthly_premium = 0.0` (filing-event
  records carry rate_change_pct in `coverage_limits` field, not dollar figures).
  These are filing-existence records, not premium quotes.
- `state_doi_custom_scraper` records: real Playwright-scraped CDI data.
  Also `monthly_premium = 0.0` — CDI summary page does not show dollar amounts
  without clicking into each filing detail. Rate change pct would require
  detail-page navigation (not yet implemented).
- Neither source should be used as a current premium quote. They are
  **filing-event signals** confirming that a carrier filed a rate change
  in a given state and product for a given effective date.
- For research credibility: the 153 NAIC state-average rows remain the only
  dollar-denominated baseline; they are labeled `naic_state_average_2023` and
  use `effective_date = 2023-12-31` per their provenance.
