Extract receiptwitness/ from the monorepo into cartsnitch/receiptwitness. Inline the consumed modules from cartsnitch-common so there is no cross-repo dependency. - Add src/receiptwitness/shared/ with inlined models, schemas, constants, database - Update all imports from cartsnitch_common to receiptwitness.shared - Remove cartsnitch-common dependency from pyproject.toml - Copy and update Alembic config (alembic.ini, alembic/) - Update Dockerfile for standalone build context, add migration CMD - Add CI workflow with lint, test, build, grype scan, deploy-dev, deploy-uat - Add .grype.yaml Co-Authored-By: Paperclip <noreply@paperclip.ing>
12 KiB
ReceiptWitness — CartSnitch Receipt Ingestion Service
Project Context
CartSnitch is a self-hosted grocery price intelligence platform built as a polyrepo microservices architecture. This repo (cartsnitch/receiptwitness) is the receipt/purchase history ingestion service.
GitHub org: github.com/cartsnitch Domain: cartsnitch.com
CartSnitch Services
| Repo | Service | Purpose |
|---|---|---|
cartsnitch/common |
— | Shared models, schemas, utilities (extracted into individual service repos) |
cartsnitch/receiptwitness |
ReceiptWitness | Purchase data ingestion via retailer scrapers (this repo) |
cartsnitch/api |
API Gateway | Frontend-facing REST API |
cartsnitch/cartsnitch |
Frontend | React PWA (mobile-first) |
cartsnitch/stickershock |
StickerShock | Price increase detection & CPI comparison |
cartsnitch/shrinkray |
ShrinkRay | Shrinkflation monitoring |
cartsnitch/clipartist |
ClipArtist | Coupon/deal watching & shopping optimization |
cartsnitch/infra |
— | K8s manifests, Flux kustomizations |
Architecture Decisions
- Polyrepo: Each service has its own repo, Dockerfile, CI/CD pipeline.
- Shared DB: One PostgreSQL cluster. This service writes to
purchases,purchase_items,price_historytables. Models are inlined undersrc/receiptwitness/shared/(extracted fromcartsnitch-commonduring the CAR-724 migration). - Inter-service comms: REST (synchronous) + Redis pub/sub (async events).
- Target scale: 500–1,000 users. Each user has their own authenticated sessions to up to 3 retailers.
What This Service Does
ReceiptWitness authenticates with grocery retailer web portals using per-user sessions, scrapes purchase history / receipt data, parses it into structured records, and writes it to the shared database. After ingestion, it publishes a cartsnitch.receipts.ingested event so downstream services (StickerShock, ClipArtist) can react.
Target Retailers (MVP)
Meijer (mPerks)
- Auth: No public API. Session cookie-based auth on mperks.meijer.com.
- Receipt location: meijer.com/mperks/receipts-savings.html (or underlying XHR endpoints)
- Approach: Playwright login → capture session → hit receipt XHR endpoints directly. Map the API calls the frontend makes via browser dev tools network tab.
- Prior art:
dapperfu/python_Meijer(requires MITM proxy for auth — avoid this pattern, prefer direct browser automation). - Data available: Digital receipts appear ~15 minutes after purchase if mPerks ID was used at checkout. Includes item names, prices, discounts, savings.
Kroger
- Auth: No public API for purchase history (that's behind Partner API). Session cookie-based auth on kroger.com.
- Receipt location: kroger.com/mypurchases
- Approach: Playwright login → scrape purchase history pages or intercept XHR endpoints.
- Anti-bot: Kroger uses Akamai Bot Manager. Aggressive headless browser detection. Need Playwright stealth, realistic fingerprinting, human-like interaction pacing.
- Prior art:
phyllis-vance/KrogerScrape(.NET, old),callaginn/kroger-sweeper(Puppeteer/Node),ThermoMan/Get-Kroger-Grocery-List(Greasemonkey userscript). - Kroger public API: Free developer account at developer.kroger.com provides product catalog data (
product.compactscope) — useful for enriching scraped receipt data with UPCs, categories, product images. NOT useful for purchase history. - Data available: Purchase history tied to Kroger Plus loyalty card. Shows items, prices, quantities.
Target (Circle)
- Auth: Session-based auth on target.com.
- Receipt location: target.com account → Orders → In-store tab, or target.com/account/orders
- Approach: Playwright login → scrape in-store purchase history.
- Data available: ~1 year of history if user paid with a linked card, used the Target app wallet, or entered their Target Circle phone number at checkout. Includes item names, prices.
Tech Stack
- Python 3.12+
- Playwright (Python async API) for headless browser automation
- FastAPI (lightweight internal API for triggering scrapes, health checks, status)
- SQLAlchemy 2.0 (models inlined under
src/receiptwitness/shared/) - Redis (pub/sub event publishing)
- APScheduler or Celery (for scheduled scraping jobs)
- cryptography / Fernet (encrypting stored session data)
Repo Structure
receiptwitness/
├── CLAUDE.md
├── README.md
├── pyproject.toml
├── Dockerfile # Playwright + Chromium headless
├── docker-compose.yml # Local dev (Postgres, Redis, this service)
├── src/
│ └── receiptwitness/
│ ├── __init__.py
│ ├── config.py # Service-specific settings
│ ├── main.py # FastAPI app + scheduler bootstrap
│ ├── scrapers/
│ │ ├── __init__.py
│ │ ├── base.py # Abstract BaseScraper class
│ │ ├── meijer.py # Meijer/mPerks scraper
│ │ ├── kroger.py # Kroger scraper
│ │ └── target.py # Target/Circle scraper
│ ├── parsers/
│ │ ├── __init__.py
│ │ ├── meijer.py # Parse raw Meijer receipt data → PurchaseItem records
│ │ ├── kroger.py
│ │ └── target.py
│ ├── session/
│ │ ├── __init__.py
│ │ ├── manager.py # Session storage, retrieval, refresh logic
│ │ └── encryption.py # Encrypt/decrypt session cookies at rest
│ ├── scheduler.py # Scrape scheduling (per-user cron jobs)
│ ├── events.py # Publish receipt.ingested events to Redis
│ ├── api/
│ │ ├── __init__.py
│ │ ├── routes.py # Internal API: trigger scrape, check status, health
│ │ └── auth.py # Internal service auth (API key or JWT)
│ └── enrichment.py # Optional: enrich receipt data via Kroger public API
└── tests/
├── conftest.py
├── fixtures/ # Sample receipt HTML/JSON for testing parsers
│ ├── meijer_receipt.json
│ ├── kroger_receipt.html
│ └── target_receipt.html
├── test_scrapers/
├── test_parsers/
└── test_session/
Scraper Architecture
Base Scraper Pattern
class BaseScraper(ABC):
"""All retailer scrapers implement this interface."""
@abstractmethod
async def login(self, credentials: UserStoreAccount) -> SessionData: ...
@abstractmethod
async def check_session(self, session: SessionData) -> bool: ...
@abstractmethod
async def scrape_receipts(self, session: SessionData, since: datetime | None) -> list[RawReceipt]: ...
@abstractmethod
def parse_receipt(self, raw: RawReceipt) -> tuple[Purchase, list[PurchaseItem]]: ...
Scraping Flow
- Scheduler fires for a user+store combination
- Load session from
user_store_accountstable (encrypted) - Check session validity — quick lightweight request to verify auth
- If expired: launch Playwright, re-authenticate, save new session
- Scrape receipts since
last_sync_attimestamp - Parse raw data into
PurchaseandPurchaseItemrecords - Deduplicate — skip receipts already in DB (match on
receipt_idper store) - Write to DB — insert new purchases and items
- Derive price_history entries from purchase_items
- Publish event —
cartsnitch.receipts.ingestedto Redis - Update
user_store_accounts.last_sync_at
Session Management
- Sessions (cookies, tokens) are encrypted at rest using Fernet symmetric encryption.
- The encryption key is provided via environment variable, not stored in the DB.
- Sessions are stored in the
user_store_accountstable as encrypted JSONB. - Each scrape attempt first checks if the existing session is valid before launching a full Playwright browser instance.
- When a session expires, the service needs the user's stored credentials OR a manual re-auth flow (the user logs in via the frontend, and we capture the session).
Anti-Bot Considerations
- Use
playwright-stealthor equivalent to mask automation signals. - Set realistic viewport sizes, user agents, and locale settings.
- Add human-like delays between page navigations (randomized 1-5 seconds).
- For Kroger specifically (Akamai Bot Manager): may need to use non-headless mode on initial auth, or route through a persistent browser profile that has established trust.
- Rate limit scraping: no more than 1 scrape per user per store per hour. Default cadence: once daily.
- Store and reuse browser profiles/cookies to minimize fresh logins.
Dockerfile
The Dockerfile must include Playwright and Chromium. Base image pattern:
FROM mcr.microsoft.com/playwright/python:v1.49.0-noble
# Install deps, copy code, etc.
This is a large image (~2GB) due to Chromium. Consider multi-stage builds if the final image can be slimmed down.
Internal API Endpoints
This service exposes a lightweight internal API (not public-facing):
GET /health— health checkGET /status/{user_id}— sync status per store for a userPOST /scrape/{user_id}/{store_slug}— trigger an immediate scrape for a user+storePOST /scrape/{user_id}/all— trigger scrape across all configured storesGET /sessions/{user_id}— list configured store sessions and their status
The public-facing API gateway (cartsnitch/api) proxies user-facing requests to this service's internal API.
Events Published
cartsnitch.receipts.ingested
Published after new receipt data is successfully written to the DB.
{
"event_type": "cartsnitch.receipts.ingested",
"timestamp": "2026-03-15T12:00:00Z",
"service": "receiptwitness",
"payload": {
"user_id": "uuid",
"store_slug": "meijer",
"purchase_id": "uuid",
"purchase_date": "2026-03-14",
"item_count": 23,
"total": 87.42
}
}
Development Workflow
- Never push directly to main. Always create feature branches and open PRs.
- Branch naming:
feature/<store>/<description>orfix/<description> - Use conventional commits:
feat:,fix:,refactor:,docs:,chore: - Test parsers with fixture data (sample receipts in
tests/fixtures/). Scraper integration tests require real credentials and should be tagged/skipped in CI. - Local dev:
docker-compose upstarts Postgres, Redis, and the service. Playwright runs inside the container.
Important Notes
- The Playwright container image is large. On K8s, consider using a dedicated node or tolerating scheduling delays.
- Each user needs their own authenticated sessions. At 1,000 users × 3 stores = 3,000 sessions to manage. Sessions expire at different rates per retailer.
- Scraping must be respectful: randomized intervals, rate limiting, no parallel scraping of the same store for the same user.
- Receipt data structure varies significantly between retailers. The parsers must be robust and handle edge cases (returns, voided items, weighted produce, BOGO items, coupon stacking).
- Kroger's public API (
product.compactscope) can be used to enrich scraped data with UPCs and product metadata after receipt parsing. This is optional but improves product normalization downstream. - Store credentials for users should ideally NOT be stored by CartSnitch. Prefer a flow where the user authenticates in a controlled browser session, and we capture/store only the resulting session cookies. If credential storage is necessary, use strong encryption and make the tradeoffs clear to users.