Squashed 'receiptwitness/' content from commit e8d374a

git-subtree-dir: receiptwitness git-subtree-split: e8d374a89ed8978f429598e02d31b1c5963efe22
2026-03-28 02:24:22 +00:00
commit 253fd8464f
27 changed files with 2559 additions and 0 deletions
@@ -0,0 +1,227 @@
+# ReceiptWitness — CartSnitch Receipt Ingestion Service
+
+## Project Context
+
+CartSnitch is a self-hosted grocery price intelligence platform built as a polyrepo microservices architecture. This repo (`cartsnitch/receiptwitness`) is the receipt/purchase history ingestion service.
+
+**GitHub org:** github.com/cartsnitch
+**Domain:** cartsnitch.com
+
+### CartSnitch Services
+
+| Repo | Service | Purpose |
+|------|---------|---------|
+| `cartsnitch/common` | — | Shared models, schemas, utilities |
+| `cartsnitch/receiptwitness` | ReceiptWitness | Purchase data ingestion via retailer scrapers (this repo) |
+| `cartsnitch/api` | API Gateway | Frontend-facing REST API |
+| `cartsnitch/cartsnitch` | Frontend | React PWA (mobile-first) |
+| `cartsnitch/stickershock` | StickerShock | Price increase detection & CPI comparison |
+| `cartsnitch/shrinkray` | ShrinkRay | Shrinkflation monitoring |
+| `cartsnitch/clipartist` | ClipArtist | Coupon/deal watching & shopping optimization |
+| `cartsnitch/infra` | — | K8s manifests, Flux kustomizations |
+
+### Architecture Decisions
+
+- **Polyrepo:** Each service has its own repo, Dockerfile, CI/CD pipeline.
+- **Shared DB:** One PostgreSQL cluster. This service writes to `purchases`, `purchase_items`, `price_history` tables. Models come from `cartsnitch-common`.
+- **Inter-service comms:** REST (synchronous) + Redis pub/sub (async events).
+- **Target scale:** 500–1,000 users. Each user has their own authenticated sessions to up to 3 retailers.
+
+## What This Service Does
+
+ReceiptWitness authenticates with grocery retailer web portals using per-user sessions, scrapes purchase history / receipt data, parses it into structured records, and writes it to the shared database. After ingestion, it publishes a `cartsnitch.receipts.ingested` event so downstream services (StickerShock, ClipArtist) can react.
+
+### Target Retailers (MVP)
+
+#### Meijer (mPerks)
+- **Auth:** No public API. Session cookie-based auth on mperks.meijer.com.
+- **Receipt location:** meijer.com/mperks/receipts-savings.html (or underlying XHR endpoints)
+- **Approach:** Playwright login → capture session → hit receipt XHR endpoints directly. Map the API calls the frontend makes via browser dev tools network tab.
+- **Prior art:** `dapperfu/python_Meijer` (requires MITM proxy for auth — avoid this pattern, prefer direct browser automation).
+- **Data available:** Digital receipts appear ~15 minutes after purchase if mPerks ID was used at checkout. Includes item names, prices, discounts, savings.
+
+#### Kroger
+- **Auth:** No public API for purchase history (that's behind Partner API). Session cookie-based auth on kroger.com.
+- **Receipt location:** kroger.com/mypurchases
+- **Approach:** Playwright login → scrape purchase history pages or intercept XHR endpoints.
+- **Anti-bot:** Kroger uses Akamai Bot Manager. Aggressive headless browser detection. Need Playwright stealth, realistic fingerprinting, human-like interaction pacing.
+- **Prior art:** `phyllis-vance/KrogerScrape` (.NET, old), `callaginn/kroger-sweeper` (Puppeteer/Node), `ThermoMan/Get-Kroger-Grocery-List` (Greasemonkey userscript).
+- **Kroger public API:** Free developer account at developer.kroger.com provides product catalog data (`product.compact` scope) — useful for enriching scraped receipt data with UPCs, categories, product images. NOT useful for purchase history.
+- **Data available:** Purchase history tied to Kroger Plus loyalty card. Shows items, prices, quantities.
+
+#### Target (Circle)
+- **Auth:** Session-based auth on target.com.
+- **Receipt location:** target.com account → Orders → In-store tab, or target.com/account/orders
+- **Approach:** Playwright login → scrape in-store purchase history.
+- **Data available:** ~1 year of history if user paid with a linked card, used the Target app wallet, or entered their Target Circle phone number at checkout. Includes item names, prices.
+
+## Tech Stack
+
+- Python 3.12+
+- Playwright (Python async API) for headless browser automation
+- FastAPI (lightweight internal API for triggering scrapes, health checks, status)
+- SQLAlchemy 2.0 (via `cartsnitch-common`)
+- Redis (pub/sub event publishing)
+- APScheduler or Celery (for scheduled scraping jobs)
+- cryptography / Fernet (encrypting stored session data)
+
+## Repo Structure
+
+```
+receiptwitness/
+├── CLAUDE.md
+├── README.md
+├── pyproject.toml
+├── Dockerfile                  # Playwright + Chromium headless
+├── docker-compose.yml          # Local dev (Postgres, Redis, this service)
+├── src/
+│   └── receiptwitness/
+│       ├── __init__.py
+│       ├── config.py           # Service-specific settings
+│       ├── main.py             # FastAPI app + scheduler bootstrap
+│       ├── scrapers/
+│       │   ├── __init__.py
+│       │   ├── base.py         # Abstract BaseScraper class
+│       │   ├── meijer.py       # Meijer/mPerks scraper
+│       │   ├── kroger.py       # Kroger scraper
+│       │   └── target.py       # Target/Circle scraper
+│       ├── parsers/
+│       │   ├── __init__.py
+│       │   ├── meijer.py       # Parse raw Meijer receipt data → PurchaseItem records
+│       │   ├── kroger.py
+│       │   └── target.py
+│       ├── session/
+│       │   ├── __init__.py
+│       │   ├── manager.py      # Session storage, retrieval, refresh logic
+│       │   └── encryption.py   # Encrypt/decrypt session cookies at rest
+│       ├── scheduler.py        # Scrape scheduling (per-user cron jobs)
+│       ├── events.py           # Publish receipt.ingested events to Redis
+│       ├── api/
+│       │   ├── __init__.py
+│       │   ├── routes.py       # Internal API: trigger scrape, check status, health
+│       │   └── auth.py         # Internal service auth (API key or JWT)
+│       └── enrichment.py       # Optional: enrich receipt data via Kroger public API
+└── tests/
+    ├── conftest.py
+    ├── fixtures/               # Sample receipt HTML/JSON for testing parsers
+    │   ├── meijer_receipt.json
+    │   ├── kroger_receipt.html
+    │   └── target_receipt.html
+    ├── test_scrapers/
+    ├── test_parsers/
+    └── test_session/
+```
+
+## Scraper Architecture
+
+### Base Scraper Pattern
+
+```python
+class BaseScraper(ABC):
+    """All retailer scrapers implement this interface."""
+
+    @abstractmethod
+    async def login(self, credentials: UserStoreAccount) -> SessionData: ...
+
+    @abstractmethod
+    async def check_session(self, session: SessionData) -> bool: ...
+
+    @abstractmethod
+    async def scrape_receipts(self, session: SessionData, since: datetime | None) -> list[RawReceipt]: ...
+
+    @abstractmethod
+    def parse_receipt(self, raw: RawReceipt) -> tuple[Purchase, list[PurchaseItem]]: ...
+```
+
+### Scraping Flow
+
+1. **Scheduler fires** for a user+store combination
+2. **Load session** from `user_store_accounts` table (encrypted)
+3. **Check session validity** — quick lightweight request to verify auth
+4. **If expired:** launch Playwright, re-authenticate, save new session
+5. **Scrape receipts** since `last_sync_at` timestamp
+6. **Parse** raw data into `Purchase` and `PurchaseItem` records
+7. **Deduplicate** — skip receipts already in DB (match on `receipt_id` per store)
+8. **Write to DB** — insert new purchases and items
+9. **Derive price_history** entries from purchase_items
+10. **Publish event** — `cartsnitch.receipts.ingested` to Redis
+11. **Update** `user_store_accounts.last_sync_at`
+
+### Session Management
+
+- Sessions (cookies, tokens) are encrypted at rest using Fernet symmetric encryption.
+- The encryption key is provided via environment variable, not stored in the DB.
+- Sessions are stored in the `user_store_accounts` table as encrypted JSONB.
+- Each scrape attempt first checks if the existing session is valid before launching a full Playwright browser instance.
+- When a session expires, the service needs the user's stored credentials OR a manual re-auth flow (the user logs in via the frontend, and we capture the session).
+
+### Anti-Bot Considerations
+
+- Use `playwright-stealth` or equivalent to mask automation signals.
+- Set realistic viewport sizes, user agents, and locale settings.
+- Add human-like delays between page navigations (randomized 1-5 seconds).
+- For Kroger specifically (Akamai Bot Manager): may need to use non-headless mode on initial auth, or route through a persistent browser profile that has established trust.
+- Rate limit scraping: no more than 1 scrape per user per store per hour. Default cadence: once daily.
+- Store and reuse browser profiles/cookies to minimize fresh logins.
+
+### Dockerfile
+
+The Dockerfile must include Playwright and Chromium. Base image pattern:
+
+```dockerfile
+FROM mcr.microsoft.com/playwright/python:v1.49.0-noble
+# Install deps, copy code, etc.
+```
+
+This is a large image (~2GB) due to Chromium. Consider multi-stage builds if the final image can be slimmed down.
+
+## Internal API Endpoints
+
+This service exposes a lightweight internal API (not public-facing):
+
+- `GET /health` — health check
+- `GET /status/{user_id}` — sync status per store for a user
+- `POST /scrape/{user_id}/{store_slug}` — trigger an immediate scrape for a user+store
+- `POST /scrape/{user_id}/all` — trigger scrape across all configured stores
+- `GET /sessions/{user_id}` — list configured store sessions and their status
+
+The public-facing API gateway (`cartsnitch/api`) proxies user-facing requests to this service's internal API.
+
+## Events Published
+
+### `cartsnitch.receipts.ingested`
+
+Published after new receipt data is successfully written to the DB.
+
+```json
+{
+  "event_type": "cartsnitch.receipts.ingested",
+  "timestamp": "2026-03-15T12:00:00Z",
+  "service": "receiptwitness",
+  "payload": {
+    "user_id": "uuid",
+    "store_slug": "meijer",
+    "purchase_id": "uuid",
+    "purchase_date": "2026-03-14",
+    "item_count": 23,
+    "total": 87.42
+  }
+}
+```
+
+## Development Workflow
+
+- **Never push directly to main.** Always create feature branches and open PRs.
+- Branch naming: `feature/<store>/<description>` or `fix/<description>`
+- Use conventional commits: `feat:`, `fix:`, `refactor:`, `docs:`, `chore:`
+- Test parsers with fixture data (sample receipts in `tests/fixtures/`). Scraper integration tests require real credentials and should be tagged/skipped in CI.
+- Local dev: `docker-compose up` starts Postgres, Redis, and the service. Playwright runs inside the container.
+
+## Important Notes
+
+- The Playwright container image is large. On K8s, consider using a dedicated node or tolerating scheduling delays.
+- Each user needs their own authenticated sessions. At 1,000 users × 3 stores = 3,000 sessions to manage. Sessions expire at different rates per retailer.
+- Scraping must be respectful: randomized intervals, rate limiting, no parallel scraping of the same store for the same user.
+- Receipt data structure varies significantly between retailers. The parsers must be robust and handle edge cases (returns, voided items, weighted produce, BOGO items, coupon stacking).
+- Kroger's public API (`product.compact` scope) can be used to enrich scraped data with UPCs and product metadata after receipt parsing. This is optional but improves product normalization downstream.
+- Store credentials for users should ideally NOT be stored by CartSnitch. Prefer a flow where the user authenticates in a controlled browser session, and we capture/store only the resulting session cookies. If credential storage is necessary, use strong encryption and make the tradeoffs clear to users.