Squashed 'receiptwitness/' content from commit e8d374a
git-subtree-dir: receiptwitness git-subtree-split: e8d374a89ed8978f429598e02d31b1c5963efe22
This commit is contained in:
@@ -0,0 +1,12 @@
|
|||||||
|
__pycache__/
|
||||||
|
*.pyc
|
||||||
|
.pytest_cache/
|
||||||
|
*.egg-info/
|
||||||
|
dist/
|
||||||
|
.venv/
|
||||||
|
.env
|
||||||
|
.git/
|
||||||
|
.github/
|
||||||
|
tests/
|
||||||
|
*.md
|
||||||
|
renovate.json
|
||||||
@@ -0,0 +1,7 @@
|
|||||||
|
__pycache__/
|
||||||
|
*.pyc
|
||||||
|
.pytest_cache/
|
||||||
|
*.egg-info/
|
||||||
|
dist/
|
||||||
|
.venv/
|
||||||
|
.env
|
||||||
@@ -0,0 +1,227 @@
|
|||||||
|
# ReceiptWitness — CartSnitch Receipt Ingestion Service
|
||||||
|
|
||||||
|
## Project Context
|
||||||
|
|
||||||
|
CartSnitch is a self-hosted grocery price intelligence platform built as a polyrepo microservices architecture. This repo (`cartsnitch/receiptwitness`) is the receipt/purchase history ingestion service.
|
||||||
|
|
||||||
|
**GitHub org:** github.com/cartsnitch
|
||||||
|
**Domain:** cartsnitch.com
|
||||||
|
|
||||||
|
### CartSnitch Services
|
||||||
|
|
||||||
|
| Repo | Service | Purpose |
|
||||||
|
|------|---------|---------|
|
||||||
|
| `cartsnitch/common` | — | Shared models, schemas, utilities |
|
||||||
|
| `cartsnitch/receiptwitness` | ReceiptWitness | Purchase data ingestion via retailer scrapers (this repo) |
|
||||||
|
| `cartsnitch/api` | API Gateway | Frontend-facing REST API |
|
||||||
|
| `cartsnitch/cartsnitch` | Frontend | React PWA (mobile-first) |
|
||||||
|
| `cartsnitch/stickershock` | StickerShock | Price increase detection & CPI comparison |
|
||||||
|
| `cartsnitch/shrinkray` | ShrinkRay | Shrinkflation monitoring |
|
||||||
|
| `cartsnitch/clipartist` | ClipArtist | Coupon/deal watching & shopping optimization |
|
||||||
|
| `cartsnitch/infra` | — | K8s manifests, Flux kustomizations |
|
||||||
|
|
||||||
|
### Architecture Decisions
|
||||||
|
|
||||||
|
- **Polyrepo:** Each service has its own repo, Dockerfile, CI/CD pipeline.
|
||||||
|
- **Shared DB:** One PostgreSQL cluster. This service writes to `purchases`, `purchase_items`, `price_history` tables. Models come from `cartsnitch-common`.
|
||||||
|
- **Inter-service comms:** REST (synchronous) + Redis pub/sub (async events).
|
||||||
|
- **Target scale:** 500–1,000 users. Each user has their own authenticated sessions to up to 3 retailers.
|
||||||
|
|
||||||
|
## What This Service Does
|
||||||
|
|
||||||
|
ReceiptWitness authenticates with grocery retailer web portals using per-user sessions, scrapes purchase history / receipt data, parses it into structured records, and writes it to the shared database. After ingestion, it publishes a `cartsnitch.receipts.ingested` event so downstream services (StickerShock, ClipArtist) can react.
|
||||||
|
|
||||||
|
### Target Retailers (MVP)
|
||||||
|
|
||||||
|
#### Meijer (mPerks)
|
||||||
|
- **Auth:** No public API. Session cookie-based auth on mperks.meijer.com.
|
||||||
|
- **Receipt location:** meijer.com/mperks/receipts-savings.html (or underlying XHR endpoints)
|
||||||
|
- **Approach:** Playwright login → capture session → hit receipt XHR endpoints directly. Map the API calls the frontend makes via browser dev tools network tab.
|
||||||
|
- **Prior art:** `dapperfu/python_Meijer` (requires MITM proxy for auth — avoid this pattern, prefer direct browser automation).
|
||||||
|
- **Data available:** Digital receipts appear ~15 minutes after purchase if mPerks ID was used at checkout. Includes item names, prices, discounts, savings.
|
||||||
|
|
||||||
|
#### Kroger
|
||||||
|
- **Auth:** No public API for purchase history (that's behind Partner API). Session cookie-based auth on kroger.com.
|
||||||
|
- **Receipt location:** kroger.com/mypurchases
|
||||||
|
- **Approach:** Playwright login → scrape purchase history pages or intercept XHR endpoints.
|
||||||
|
- **Anti-bot:** Kroger uses Akamai Bot Manager. Aggressive headless browser detection. Need Playwright stealth, realistic fingerprinting, human-like interaction pacing.
|
||||||
|
- **Prior art:** `phyllis-vance/KrogerScrape` (.NET, old), `callaginn/kroger-sweeper` (Puppeteer/Node), `ThermoMan/Get-Kroger-Grocery-List` (Greasemonkey userscript).
|
||||||
|
- **Kroger public API:** Free developer account at developer.kroger.com provides product catalog data (`product.compact` scope) — useful for enriching scraped receipt data with UPCs, categories, product images. NOT useful for purchase history.
|
||||||
|
- **Data available:** Purchase history tied to Kroger Plus loyalty card. Shows items, prices, quantities.
|
||||||
|
|
||||||
|
#### Target (Circle)
|
||||||
|
- **Auth:** Session-based auth on target.com.
|
||||||
|
- **Receipt location:** target.com account → Orders → In-store tab, or target.com/account/orders
|
||||||
|
- **Approach:** Playwright login → scrape in-store purchase history.
|
||||||
|
- **Data available:** ~1 year of history if user paid with a linked card, used the Target app wallet, or entered their Target Circle phone number at checkout. Includes item names, prices.
|
||||||
|
|
||||||
|
## Tech Stack
|
||||||
|
|
||||||
|
- Python 3.12+
|
||||||
|
- Playwright (Python async API) for headless browser automation
|
||||||
|
- FastAPI (lightweight internal API for triggering scrapes, health checks, status)
|
||||||
|
- SQLAlchemy 2.0 (via `cartsnitch-common`)
|
||||||
|
- Redis (pub/sub event publishing)
|
||||||
|
- APScheduler or Celery (for scheduled scraping jobs)
|
||||||
|
- cryptography / Fernet (encrypting stored session data)
|
||||||
|
|
||||||
|
## Repo Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
receiptwitness/
|
||||||
|
├── CLAUDE.md
|
||||||
|
├── README.md
|
||||||
|
├── pyproject.toml
|
||||||
|
├── Dockerfile # Playwright + Chromium headless
|
||||||
|
├── docker-compose.yml # Local dev (Postgres, Redis, this service)
|
||||||
|
├── src/
|
||||||
|
│ └── receiptwitness/
|
||||||
|
│ ├── __init__.py
|
||||||
|
│ ├── config.py # Service-specific settings
|
||||||
|
│ ├── main.py # FastAPI app + scheduler bootstrap
|
||||||
|
│ ├── scrapers/
|
||||||
|
│ │ ├── __init__.py
|
||||||
|
│ │ ├── base.py # Abstract BaseScraper class
|
||||||
|
│ │ ├── meijer.py # Meijer/mPerks scraper
|
||||||
|
│ │ ├── kroger.py # Kroger scraper
|
||||||
|
│ │ └── target.py # Target/Circle scraper
|
||||||
|
│ ├── parsers/
|
||||||
|
│ │ ├── __init__.py
|
||||||
|
│ │ ├── meijer.py # Parse raw Meijer receipt data → PurchaseItem records
|
||||||
|
│ │ ├── kroger.py
|
||||||
|
│ │ └── target.py
|
||||||
|
│ ├── session/
|
||||||
|
│ │ ├── __init__.py
|
||||||
|
│ │ ├── manager.py # Session storage, retrieval, refresh logic
|
||||||
|
│ │ └── encryption.py # Encrypt/decrypt session cookies at rest
|
||||||
|
│ ├── scheduler.py # Scrape scheduling (per-user cron jobs)
|
||||||
|
│ ├── events.py # Publish receipt.ingested events to Redis
|
||||||
|
│ ├── api/
|
||||||
|
│ │ ├── __init__.py
|
||||||
|
│ │ ├── routes.py # Internal API: trigger scrape, check status, health
|
||||||
|
│ │ └── auth.py # Internal service auth (API key or JWT)
|
||||||
|
│ └── enrichment.py # Optional: enrich receipt data via Kroger public API
|
||||||
|
└── tests/
|
||||||
|
├── conftest.py
|
||||||
|
├── fixtures/ # Sample receipt HTML/JSON for testing parsers
|
||||||
|
│ ├── meijer_receipt.json
|
||||||
|
│ ├── kroger_receipt.html
|
||||||
|
│ └── target_receipt.html
|
||||||
|
├── test_scrapers/
|
||||||
|
├── test_parsers/
|
||||||
|
└── test_session/
|
||||||
|
```
|
||||||
|
|
||||||
|
## Scraper Architecture
|
||||||
|
|
||||||
|
### Base Scraper Pattern
|
||||||
|
|
||||||
|
```python
|
||||||
|
class BaseScraper(ABC):
|
||||||
|
"""All retailer scrapers implement this interface."""
|
||||||
|
|
||||||
|
@abstractmethod
|
||||||
|
async def login(self, credentials: UserStoreAccount) -> SessionData: ...
|
||||||
|
|
||||||
|
@abstractmethod
|
||||||
|
async def check_session(self, session: SessionData) -> bool: ...
|
||||||
|
|
||||||
|
@abstractmethod
|
||||||
|
async def scrape_receipts(self, session: SessionData, since: datetime | None) -> list[RawReceipt]: ...
|
||||||
|
|
||||||
|
@abstractmethod
|
||||||
|
def parse_receipt(self, raw: RawReceipt) -> tuple[Purchase, list[PurchaseItem]]: ...
|
||||||
|
```
|
||||||
|
|
||||||
|
### Scraping Flow
|
||||||
|
|
||||||
|
1. **Scheduler fires** for a user+store combination
|
||||||
|
2. **Load session** from `user_store_accounts` table (encrypted)
|
||||||
|
3. **Check session validity** — quick lightweight request to verify auth
|
||||||
|
4. **If expired:** launch Playwright, re-authenticate, save new session
|
||||||
|
5. **Scrape receipts** since `last_sync_at` timestamp
|
||||||
|
6. **Parse** raw data into `Purchase` and `PurchaseItem` records
|
||||||
|
7. **Deduplicate** — skip receipts already in DB (match on `receipt_id` per store)
|
||||||
|
8. **Write to DB** — insert new purchases and items
|
||||||
|
9. **Derive price_history** entries from purchase_items
|
||||||
|
10. **Publish event** — `cartsnitch.receipts.ingested` to Redis
|
||||||
|
11. **Update** `user_store_accounts.last_sync_at`
|
||||||
|
|
||||||
|
### Session Management
|
||||||
|
|
||||||
|
- Sessions (cookies, tokens) are encrypted at rest using Fernet symmetric encryption.
|
||||||
|
- The encryption key is provided via environment variable, not stored in the DB.
|
||||||
|
- Sessions are stored in the `user_store_accounts` table as encrypted JSONB.
|
||||||
|
- Each scrape attempt first checks if the existing session is valid before launching a full Playwright browser instance.
|
||||||
|
- When a session expires, the service needs the user's stored credentials OR a manual re-auth flow (the user logs in via the frontend, and we capture the session).
|
||||||
|
|
||||||
|
### Anti-Bot Considerations
|
||||||
|
|
||||||
|
- Use `playwright-stealth` or equivalent to mask automation signals.
|
||||||
|
- Set realistic viewport sizes, user agents, and locale settings.
|
||||||
|
- Add human-like delays between page navigations (randomized 1-5 seconds).
|
||||||
|
- For Kroger specifically (Akamai Bot Manager): may need to use non-headless mode on initial auth, or route through a persistent browser profile that has established trust.
|
||||||
|
- Rate limit scraping: no more than 1 scrape per user per store per hour. Default cadence: once daily.
|
||||||
|
- Store and reuse browser profiles/cookies to minimize fresh logins.
|
||||||
|
|
||||||
|
### Dockerfile
|
||||||
|
|
||||||
|
The Dockerfile must include Playwright and Chromium. Base image pattern:
|
||||||
|
|
||||||
|
```dockerfile
|
||||||
|
FROM mcr.microsoft.com/playwright/python:v1.49.0-noble
|
||||||
|
# Install deps, copy code, etc.
|
||||||
|
```
|
||||||
|
|
||||||
|
This is a large image (~2GB) due to Chromium. Consider multi-stage builds if the final image can be slimmed down.
|
||||||
|
|
||||||
|
## Internal API Endpoints
|
||||||
|
|
||||||
|
This service exposes a lightweight internal API (not public-facing):
|
||||||
|
|
||||||
|
- `GET /health` — health check
|
||||||
|
- `GET /status/{user_id}` — sync status per store for a user
|
||||||
|
- `POST /scrape/{user_id}/{store_slug}` — trigger an immediate scrape for a user+store
|
||||||
|
- `POST /scrape/{user_id}/all` — trigger scrape across all configured stores
|
||||||
|
- `GET /sessions/{user_id}` — list configured store sessions and their status
|
||||||
|
|
||||||
|
The public-facing API gateway (`cartsnitch/api`) proxies user-facing requests to this service's internal API.
|
||||||
|
|
||||||
|
## Events Published
|
||||||
|
|
||||||
|
### `cartsnitch.receipts.ingested`
|
||||||
|
|
||||||
|
Published after new receipt data is successfully written to the DB.
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"event_type": "cartsnitch.receipts.ingested",
|
||||||
|
"timestamp": "2026-03-15T12:00:00Z",
|
||||||
|
"service": "receiptwitness",
|
||||||
|
"payload": {
|
||||||
|
"user_id": "uuid",
|
||||||
|
"store_slug": "meijer",
|
||||||
|
"purchase_id": "uuid",
|
||||||
|
"purchase_date": "2026-03-14",
|
||||||
|
"item_count": 23,
|
||||||
|
"total": 87.42
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Development Workflow
|
||||||
|
|
||||||
|
- **Never push directly to main.** Always create feature branches and open PRs.
|
||||||
|
- Branch naming: `feature/<store>/<description>` or `fix/<description>`
|
||||||
|
- Use conventional commits: `feat:`, `fix:`, `refactor:`, `docs:`, `chore:`
|
||||||
|
- Test parsers with fixture data (sample receipts in `tests/fixtures/`). Scraper integration tests require real credentials and should be tagged/skipped in CI.
|
||||||
|
- Local dev: `docker-compose up` starts Postgres, Redis, and the service. Playwright runs inside the container.
|
||||||
|
|
||||||
|
## Important Notes
|
||||||
|
|
||||||
|
- The Playwright container image is large. On K8s, consider using a dedicated node or tolerating scheduling delays.
|
||||||
|
- Each user needs their own authenticated sessions. At 1,000 users × 3 stores = 3,000 sessions to manage. Sessions expire at different rates per retailer.
|
||||||
|
- Scraping must be respectful: randomized intervals, rate limiting, no parallel scraping of the same store for the same user.
|
||||||
|
- Receipt data structure varies significantly between retailers. The parsers must be robust and handle edge cases (returns, voided items, weighted produce, BOGO items, coupon stacking).
|
||||||
|
- Kroger's public API (`product.compact` scope) can be used to enrich scraped data with UPCs and product metadata after receipt parsing. This is optional but improves product normalization downstream.
|
||||||
|
- Store credentials for users should ideally NOT be stored by CartSnitch. Prefer a flow where the user authenticates in a controlled browser session, and we capture/store only the resulting session cookies. If credential storage is necessary, use strong encryption and make the tradeoffs clear to users.
|
||||||
+67
@@ -0,0 +1,67 @@
|
|||||||
|
# Stage 1: Build dependencies
|
||||||
|
FROM python:3.12-slim AS build
|
||||||
|
|
||||||
|
WORKDIR /app
|
||||||
|
|
||||||
|
# git is required to install cartsnitch-common from GitHub; build-essential and
|
||||||
|
# libpq-dev are needed to compile any C-extension wheels (e.g. psycopg2 fallback)
|
||||||
|
RUN apt-get update && apt-get install -y --no-install-recommends \
|
||||||
|
git \
|
||||||
|
libpq-dev \
|
||||||
|
build-essential \
|
||||||
|
&& rm -rf /var/lib/apt/lists/*
|
||||||
|
|
||||||
|
COPY pyproject.toml ./
|
||||||
|
COPY src/ ./src/
|
||||||
|
|
||||||
|
# cartsnitch-common is not on PyPI — install it directly from GitHub, then
|
||||||
|
# install the rest of the package dependencies in a single resolver pass so
|
||||||
|
# pip can satisfy the cartsnitch-common>=0.1.0 constraint declared in
|
||||||
|
# pyproject.toml without hitting PyPI for it.
|
||||||
|
RUN pip install --no-cache-dir --prefix=/install \
|
||||||
|
"cartsnitch-common @ git+https://github.com/cartsnitch/common.git@76685ed0384103228cd670b477b967e7752ebe6b" \
|
||||||
|
.
|
||||||
|
|
||||||
|
# Stage 2: Production image with Playwright + Chromium
|
||||||
|
FROM python:3.12-slim AS prod
|
||||||
|
|
||||||
|
WORKDIR /app
|
||||||
|
|
||||||
|
# Install Playwright system dependencies for Chromium
|
||||||
|
RUN apt-get update && apt-get install -y --no-install-recommends \
|
||||||
|
libnss3 \
|
||||||
|
libatk1.0-0 \
|
||||||
|
libatk-bridge2.0-0 \
|
||||||
|
libcups2 \
|
||||||
|
libdrm2 \
|
||||||
|
libxkbcommon0 \
|
||||||
|
libxcomposite1 \
|
||||||
|
libxdamage1 \
|
||||||
|
libxrandr2 \
|
||||||
|
libgbm1 \
|
||||||
|
libpango-1.0-0 \
|
||||||
|
libcairo2 \
|
||||||
|
libasound2 \
|
||||||
|
libxshmfence1 \
|
||||||
|
libx11-xcb1 \
|
||||||
|
libxcb-dri3-0 \
|
||||||
|
fonts-liberation \
|
||||||
|
&& rm -rf /var/lib/apt/lists/*
|
||||||
|
|
||||||
|
RUN adduser --system --group --uid 1000 app
|
||||||
|
|
||||||
|
COPY --from=build /install /usr/local
|
||||||
|
COPY src/ ./src/
|
||||||
|
|
||||||
|
# Install Playwright Chromium browser (runs as root; /opt/playwright is world-readable)
|
||||||
|
RUN PLAYWRIGHT_BROWSERS_PATH=/opt/playwright playwright install chromium
|
||||||
|
|
||||||
|
ENV PLAYWRIGHT_BROWSERS_PATH=/opt/playwright
|
||||||
|
|
||||||
|
USER 1000
|
||||||
|
EXPOSE 8000
|
||||||
|
|
||||||
|
HEALTHCHECK --interval=30s --timeout=3s \
|
||||||
|
CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')"
|
||||||
|
|
||||||
|
CMD ["uvicorn", "receiptwitness.main:app", "--host", "0.0.0.0", "--port", "8000"]
|
||||||
@@ -0,0 +1,4 @@
|
|||||||
|
{
|
||||||
|
"$schema": "https://docs.renovatebot.com/renovate-schema.json",
|
||||||
|
"extends": ["local>cartsnitch/.github:renovate-config"]
|
||||||
|
}
|
||||||
@@ -0,0 +1 @@
|
|||||||
|
"""ReceiptWitness — CartSnitch receipt ingestion service."""
|
||||||
@@ -0,0 +1 @@
|
|||||||
|
"""Internal API for ReceiptWitness service."""
|
||||||
@@ -0,0 +1,10 @@
|
|||||||
|
"""Internal API routes for triggering scrapes and checking status."""
|
||||||
|
|
||||||
|
from fastapi import APIRouter
|
||||||
|
|
||||||
|
router = APIRouter()
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/health")
|
||||||
|
async def health():
|
||||||
|
return {"status": "ok", "service": "receiptwitness"}
|
||||||
@@ -0,0 +1,26 @@
|
|||||||
|
"""Service-specific configuration for ReceiptWitness."""
|
||||||
|
|
||||||
|
from pydantic_settings import BaseSettings
|
||||||
|
|
||||||
|
|
||||||
|
class ReceiptWitnessSettings(BaseSettings):
|
||||||
|
model_config = {"env_prefix": "RW_"}
|
||||||
|
|
||||||
|
# Inherited from cartsnitch-common
|
||||||
|
database_url: str = "postgresql+asyncpg://cartsnitch:cartsnitch@localhost:5432/cartsnitch"
|
||||||
|
redis_url: str = "redis://localhost:6379/0"
|
||||||
|
|
||||||
|
# Session encryption
|
||||||
|
session_encryption_key: str = ""
|
||||||
|
|
||||||
|
# Scraping defaults
|
||||||
|
scrape_interval_seconds: int = 86400 # 24 hours
|
||||||
|
min_request_delay_ms: int = 1000
|
||||||
|
max_request_delay_ms: int = 5000
|
||||||
|
|
||||||
|
# Playwright
|
||||||
|
headless: bool = True
|
||||||
|
browser_timeout_ms: int = 60000
|
||||||
|
|
||||||
|
|
||||||
|
settings = ReceiptWitnessSettings()
|
||||||
@@ -0,0 +1,75 @@
|
|||||||
|
"""Publish receipt ingestion events to Redis/DragonflyDB pub/sub."""
|
||||||
|
|
||||||
|
import json
|
||||||
|
import logging
|
||||||
|
from datetime import UTC, datetime
|
||||||
|
from decimal import Decimal
|
||||||
|
|
||||||
|
import redis.asyncio as aioredis
|
||||||
|
|
||||||
|
from receiptwitness.config import settings
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
CHANNEL_RECEIPTS_INGESTED = "cartsnitch.receipts.ingested"
|
||||||
|
|
||||||
|
# Module-level connection pool — shared across all publish calls
|
||||||
|
_pool: aioredis.ConnectionPool | None = None
|
||||||
|
|
||||||
|
|
||||||
|
class _DecimalEncoder(json.JSONEncoder):
|
||||||
|
def default(self, o):
|
||||||
|
if isinstance(o, Decimal):
|
||||||
|
return float(o)
|
||||||
|
return super().default(o)
|
||||||
|
|
||||||
|
|
||||||
|
def _get_pool() -> aioredis.ConnectionPool:
|
||||||
|
"""Get or create the shared Redis connection pool."""
|
||||||
|
global _pool
|
||||||
|
if _pool is None:
|
||||||
|
_pool = aioredis.ConnectionPool.from_url(
|
||||||
|
settings.redis_url, decode_responses=True, max_connections=10
|
||||||
|
)
|
||||||
|
return _pool
|
||||||
|
|
||||||
|
|
||||||
|
async def get_redis_client() -> aioredis.Redis:
|
||||||
|
"""Create an async Redis/DragonflyDB client with connection pooling."""
|
||||||
|
return aioredis.Redis(connection_pool=_get_pool())
|
||||||
|
|
||||||
|
|
||||||
|
async def publish_receipt_ingested(
|
||||||
|
user_id: str,
|
||||||
|
store_slug: str,
|
||||||
|
purchase_id: str,
|
||||||
|
purchase_date: str,
|
||||||
|
item_count: int,
|
||||||
|
total: Decimal | float,
|
||||||
|
) -> None:
|
||||||
|
"""Publish a cartsnitch.receipts.ingested event after successful ingestion."""
|
||||||
|
event = {
|
||||||
|
"event_type": CHANNEL_RECEIPTS_INGESTED,
|
||||||
|
"timestamp": datetime.now(UTC).isoformat(),
|
||||||
|
"service": "receiptwitness",
|
||||||
|
"payload": {
|
||||||
|
"user_id": user_id,
|
||||||
|
"store_slug": store_slug,
|
||||||
|
"purchase_id": purchase_id,
|
||||||
|
"purchase_date": purchase_date,
|
||||||
|
"item_count": item_count,
|
||||||
|
"total": float(total) if isinstance(total, Decimal) else total,
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
try:
|
||||||
|
client = await get_redis_client()
|
||||||
|
await client.publish(CHANNEL_RECEIPTS_INGESTED, json.dumps(event, cls=_DecimalEncoder))
|
||||||
|
logger.info(
|
||||||
|
"Published %s event for purchase %s",
|
||||||
|
CHANNEL_RECEIPTS_INGESTED,
|
||||||
|
purchase_id,
|
||||||
|
)
|
||||||
|
except aioredis.ConnectionError:
|
||||||
|
logger.error("Failed to publish event — Redis/DragonflyDB connection error")
|
||||||
|
raise
|
||||||
@@ -0,0 +1,8 @@
|
|||||||
|
"""FastAPI app entrypoint for ReceiptWitness."""
|
||||||
|
|
||||||
|
from fastapi import FastAPI
|
||||||
|
|
||||||
|
from receiptwitness.api.routes import router
|
||||||
|
|
||||||
|
app = FastAPI(title="ReceiptWitness", version="0.1.0")
|
||||||
|
app.include_router(router)
|
||||||
@@ -0,0 +1 @@
|
|||||||
|
"""Receipt parsers for each retailer."""
|
||||||
@@ -0,0 +1,148 @@
|
|||||||
|
"""Kroger receipt parser.
|
||||||
|
|
||||||
|
Transforms raw Kroger receipt JSON into the common PurchaseCreate schema.
|
||||||
|
Kroger receipt data uses different field names than Meijer — this parser
|
||||||
|
handles Kroger-specific naming conventions and receipt structure.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import logging
|
||||||
|
from decimal import Decimal, InvalidOperation
|
||||||
|
|
||||||
|
from receiptwitness.scrapers.base import RawReceipt
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
def _to_decimal(value, default: str = "0") -> Decimal:
|
||||||
|
"""Safely convert a value to Decimal."""
|
||||||
|
if value is None:
|
||||||
|
return Decimal(default)
|
||||||
|
try:
|
||||||
|
return Decimal(str(value))
|
||||||
|
except (InvalidOperation, ValueError, TypeError):
|
||||||
|
return Decimal(default)
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_item(item: dict) -> dict:
|
||||||
|
"""Parse a single line item from a Kroger receipt.
|
||||||
|
|
||||||
|
Kroger items typically include fields like:
|
||||||
|
- description / itemDescription / productName
|
||||||
|
- upc / krogerProductId
|
||||||
|
- quantity / qty
|
||||||
|
- basePrice / unitPrice / price
|
||||||
|
- totalPrice / extendedAmount / lineTotal
|
||||||
|
- regularPrice / originalPrice
|
||||||
|
- salePrice / promoPrice
|
||||||
|
- couponAmount / couponSavings
|
||||||
|
- loyaltyDiscount / fuelPointsDiscount / plusCardSavings
|
||||||
|
- department / category / aisle
|
||||||
|
"""
|
||||||
|
description = (
|
||||||
|
item.get("description")
|
||||||
|
or item.get("itemDescription")
|
||||||
|
or item.get("productName")
|
||||||
|
or item.get("name")
|
||||||
|
or "UNKNOWN ITEM"
|
||||||
|
)
|
||||||
|
|
||||||
|
quantity = _to_decimal(item.get("quantity", item.get("qty", item.get("quantitySold", 1))), "1")
|
||||||
|
unit_price = _to_decimal(item.get("basePrice", item.get("unitPrice", item.get("price", 0))))
|
||||||
|
extended_price = _to_decimal(
|
||||||
|
item.get("totalPrice", item.get("extendedAmount", item.get("lineTotal")))
|
||||||
|
)
|
||||||
|
|
||||||
|
# Compute extended_price if not provided
|
||||||
|
if extended_price == Decimal("0") and unit_price != Decimal("0"):
|
||||||
|
extended_price = unit_price * quantity
|
||||||
|
|
||||||
|
regular_price = item.get("regularPrice", item.get("originalPrice"))
|
||||||
|
sale_price = item.get("salePrice", item.get("promoPrice"))
|
||||||
|
coupon_discount = item.get(
|
||||||
|
"couponAmount", item.get("couponSavings", item.get("couponDiscount"))
|
||||||
|
)
|
||||||
|
loyalty_discount = item.get(
|
||||||
|
"plusCardSavings",
|
||||||
|
item.get("loyaltyDiscount", item.get("fuelPointsDiscount")),
|
||||||
|
)
|
||||||
|
|
||||||
|
# UPC handling — Kroger may use krogerProductId or upc
|
||||||
|
upc = item.get("upc", item.get("UPC", item.get("krogerProductId")))
|
||||||
|
if upc:
|
||||||
|
upc = str(upc).strip().lstrip("0") or None
|
||||||
|
|
||||||
|
category = item.get("department", item.get("category", item.get("aisle")))
|
||||||
|
|
||||||
|
# Weight info for produce/deli items
|
||||||
|
weight = item.get("weight", item.get("netWeight"))
|
||||||
|
extra = {}
|
||||||
|
if weight is not None:
|
||||||
|
extra["weight"] = str(weight)
|
||||||
|
weight_uom = item.get("weightUom", item.get("unitOfMeasure"))
|
||||||
|
if weight_uom:
|
||||||
|
extra["weight_uom"] = weight_uom
|
||||||
|
|
||||||
|
result = {
|
||||||
|
"product_name_raw": description.strip(),
|
||||||
|
"upc": upc,
|
||||||
|
"quantity": quantity,
|
||||||
|
"unit_price": unit_price,
|
||||||
|
"extended_price": extended_price,
|
||||||
|
"regular_price": (_to_decimal(regular_price) if regular_price is not None else None),
|
||||||
|
"sale_price": (_to_decimal(sale_price) if sale_price is not None else None),
|
||||||
|
"coupon_discount": (_to_decimal(coupon_discount) if coupon_discount is not None else None),
|
||||||
|
"loyalty_discount": (
|
||||||
|
_to_decimal(loyalty_discount) if loyalty_discount is not None else None
|
||||||
|
),
|
||||||
|
"category_raw": category.strip() if category else None,
|
||||||
|
}
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
def parse_kroger_receipt(raw: RawReceipt) -> dict:
|
||||||
|
"""Parse a RawReceipt from Kroger into a PurchaseCreate-compatible dict."""
|
||||||
|
data = raw.raw_data
|
||||||
|
detail = data.get("detail", {})
|
||||||
|
|
||||||
|
# Parse items — Kroger uses "items" or "lineItems" or "receiptItems"
|
||||||
|
raw_items = detail.get("items", detail.get("lineItems", detail.get("receiptItems", [])))
|
||||||
|
items = []
|
||||||
|
for raw_item in raw_items:
|
||||||
|
# Skip voided / returned items
|
||||||
|
if raw_item.get("voided") or raw_item.get("status") in (
|
||||||
|
"VOIDED",
|
||||||
|
"RETURNED",
|
||||||
|
):
|
||||||
|
logger.debug("Skipping voided/returned item: %s", raw_item.get("description"))
|
||||||
|
continue
|
||||||
|
if raw_item.get("returnFlag") or raw_item.get("isReturn"):
|
||||||
|
logger.debug("Skipping returned item: %s", raw_item.get("description"))
|
||||||
|
continue
|
||||||
|
items.append(_parse_item(raw_item))
|
||||||
|
|
||||||
|
# Parse totals — Kroger uses various field names
|
||||||
|
total = _to_decimal(
|
||||||
|
detail.get(
|
||||||
|
"total",
|
||||||
|
data.get("total", data.get("orderTotal", data.get("grandTotal", 0))),
|
||||||
|
)
|
||||||
|
)
|
||||||
|
subtotal = detail.get("subtotal", data.get("subtotal", data.get("subTotal")))
|
||||||
|
tax = detail.get("tax", data.get("tax", data.get("salesTax")))
|
||||||
|
savings = detail.get(
|
||||||
|
"totalSavings",
|
||||||
|
data.get("savings", data.get("totalDiscount", data.get("youSaved"))),
|
||||||
|
)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"receipt_id": raw.receipt_id,
|
||||||
|
"purchase_date": raw.purchase_date,
|
||||||
|
"total": total,
|
||||||
|
"subtotal": _to_decimal(subtotal) if subtotal is not None else None,
|
||||||
|
"tax": _to_decimal(tax) if tax is not None else None,
|
||||||
|
"savings_total": _to_decimal(savings) if savings is not None else None,
|
||||||
|
"source_url": raw.source_url,
|
||||||
|
"raw_data": data,
|
||||||
|
"items": items,
|
||||||
|
}
|
||||||
@@ -0,0 +1,138 @@
|
|||||||
|
"""Parse raw Meijer mPerks receipt data into PurchaseCreate-compatible dicts.
|
||||||
|
|
||||||
|
The mPerks receipt JSON structure (reverse-engineered from their SPA)
|
||||||
|
typically looks like:
|
||||||
|
|
||||||
|
Transaction listing:
|
||||||
|
{
|
||||||
|
"transactions": [
|
||||||
|
{
|
||||||
|
"transactionId": "12345",
|
||||||
|
"transactionDate": "2026-03-10T14:30:00Z",
|
||||||
|
"storeNumber": "123",
|
||||||
|
"total": 87.42,
|
||||||
|
"savings": 12.50
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
|
||||||
|
Receipt detail:
|
||||||
|
{
|
||||||
|
"receiptId": "12345",
|
||||||
|
"items": [
|
||||||
|
{
|
||||||
|
"description": "ORGANIC BANANAS",
|
||||||
|
"upc": "0000000004011",
|
||||||
|
"quantity": 1,
|
||||||
|
"price": 0.69,
|
||||||
|
"extendedPrice": 0.69,
|
||||||
|
"regularPrice": 0.79,
|
||||||
|
"salePrice": 0.69,
|
||||||
|
"couponDiscount": 0.0,
|
||||||
|
"mperksDiscount": 0.10,
|
||||||
|
"category": "PRODUCE"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"subtotal": 74.92,
|
||||||
|
"tax": 5.24,
|
||||||
|
"total": 87.42,
|
||||||
|
"totalSavings": 12.50
|
||||||
|
}
|
||||||
|
"""
|
||||||
|
|
||||||
|
import logging
|
||||||
|
from decimal import Decimal, InvalidOperation
|
||||||
|
|
||||||
|
from receiptwitness.scrapers.base import RawReceipt
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
def _to_decimal(value, default: str = "0") -> Decimal:
|
||||||
|
"""Safely convert a value to Decimal."""
|
||||||
|
if value is None:
|
||||||
|
return Decimal(default)
|
||||||
|
try:
|
||||||
|
return Decimal(str(value))
|
||||||
|
except (InvalidOperation, ValueError, TypeError):
|
||||||
|
return Decimal(default)
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_item(item: dict) -> dict:
|
||||||
|
"""Parse a single line item from Meijer receipt detail."""
|
||||||
|
description = (
|
||||||
|
item.get("description") or item.get("itemDescription") or item.get("name") or "UNKNOWN ITEM"
|
||||||
|
)
|
||||||
|
|
||||||
|
quantity = _to_decimal(item.get("quantity", item.get("qty", 1)), "1")
|
||||||
|
unit_price = _to_decimal(item.get("price", item.get("unitPrice", 0)))
|
||||||
|
extended_price = _to_decimal(item.get("extendedPrice", item.get("totalPrice")))
|
||||||
|
|
||||||
|
# If extended_price wasn't provided, compute it
|
||||||
|
if extended_price == Decimal("0") and unit_price != Decimal("0"):
|
||||||
|
extended_price = unit_price * quantity
|
||||||
|
|
||||||
|
regular_price = item.get("regularPrice")
|
||||||
|
sale_price = item.get("salePrice")
|
||||||
|
coupon_discount = item.get("couponDiscount", item.get("couponSavings"))
|
||||||
|
loyalty_discount = item.get("mperksDiscount", item.get("loyaltyDiscount"))
|
||||||
|
|
||||||
|
upc = item.get("upc", item.get("UPC"))
|
||||||
|
if upc:
|
||||||
|
upc = str(upc).strip().lstrip("0") or None
|
||||||
|
|
||||||
|
category = item.get("category", item.get("departmentDescription"))
|
||||||
|
|
||||||
|
return {
|
||||||
|
"product_name_raw": description.strip(),
|
||||||
|
"upc": upc,
|
||||||
|
"quantity": quantity,
|
||||||
|
"unit_price": unit_price,
|
||||||
|
"extended_price": extended_price,
|
||||||
|
"regular_price": _to_decimal(regular_price) if regular_price is not None else None,
|
||||||
|
"sale_price": _to_decimal(sale_price) if sale_price is not None else None,
|
||||||
|
"coupon_discount": (_to_decimal(coupon_discount) if coupon_discount is not None else None),
|
||||||
|
"loyalty_discount": (
|
||||||
|
_to_decimal(loyalty_discount) if loyalty_discount is not None else None
|
||||||
|
),
|
||||||
|
"category_raw": category.strip() if category else None,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def parse_meijer_receipt(raw: RawReceipt) -> dict:
|
||||||
|
"""Parse a RawReceipt from Meijer into a PurchaseCreate-compatible dict.
|
||||||
|
|
||||||
|
Returns a dict with keys matching PurchaseCreate schema fields.
|
||||||
|
The caller is responsible for setting store_id and store_location_id
|
||||||
|
from the store registry.
|
||||||
|
"""
|
||||||
|
data = raw.raw_data
|
||||||
|
detail = data.get("detail", {})
|
||||||
|
|
||||||
|
# Parse items from the detail response
|
||||||
|
raw_items = detail.get("items", detail.get("lineItems", []))
|
||||||
|
items = []
|
||||||
|
for raw_item in raw_items:
|
||||||
|
# Skip voided items
|
||||||
|
if raw_item.get("voided") or raw_item.get("status") == "VOIDED":
|
||||||
|
logger.debug("Skipping voided item: %s", raw_item.get("description"))
|
||||||
|
continue
|
||||||
|
items.append(_parse_item(raw_item))
|
||||||
|
|
||||||
|
# Parse totals
|
||||||
|
total = _to_decimal(detail.get("total", data.get("total", data.get("transactionTotal", 0))))
|
||||||
|
subtotal = detail.get("subtotal", data.get("subtotal"))
|
||||||
|
tax = detail.get("tax", data.get("tax"))
|
||||||
|
savings = detail.get("totalSavings", data.get("savings", data.get("totalDiscount")))
|
||||||
|
|
||||||
|
return {
|
||||||
|
"receipt_id": raw.receipt_id,
|
||||||
|
"purchase_date": raw.purchase_date,
|
||||||
|
"total": total,
|
||||||
|
"subtotal": _to_decimal(subtotal) if subtotal is not None else None,
|
||||||
|
"tax": _to_decimal(tax) if tax is not None else None,
|
||||||
|
"savings_total": _to_decimal(savings) if savings is not None else None,
|
||||||
|
"source_url": raw.source_url,
|
||||||
|
"raw_data": data,
|
||||||
|
"items": items,
|
||||||
|
}
|
||||||
@@ -0,0 +1,191 @@
|
|||||||
|
"""Target Circle receipt parser.
|
||||||
|
|
||||||
|
Transforms raw Target in-store receipt JSON into the common PurchaseCreate schema.
|
||||||
|
Target receipt data includes Circle pricing, BOGO deals, and Circle rewards
|
||||||
|
discounts that need special handling.
|
||||||
|
|
||||||
|
Target receipt detail structure (reverse-engineered from target.com SPA):
|
||||||
|
|
||||||
|
{
|
||||||
|
"orderId": "TGT-2026-0315-7890",
|
||||||
|
"items": [
|
||||||
|
{
|
||||||
|
"description": "GOOD & GATHER WHOLE MILK GAL",
|
||||||
|
"tcin": "14767459",
|
||||||
|
"upc": "0085239100123",
|
||||||
|
"quantity": 1,
|
||||||
|
"unitPrice": 3.89,
|
||||||
|
"totalPrice": 3.89,
|
||||||
|
"regularPrice": 4.19,
|
||||||
|
"circlePrice": 3.89,
|
||||||
|
"couponDiscount": 0.0,
|
||||||
|
"circleRewardsDiscount": 0.30,
|
||||||
|
"promoDescription": "Circle offer: Save 30c",
|
||||||
|
"department": "GROCERY"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"subtotal": 78.32,
|
||||||
|
"tax": 4.89,
|
||||||
|
"total": 83.21,
|
||||||
|
"totalSavings": 11.45
|
||||||
|
}
|
||||||
|
"""
|
||||||
|
|
||||||
|
import logging
|
||||||
|
from decimal import Decimal, InvalidOperation
|
||||||
|
|
||||||
|
from receiptwitness.scrapers.base import RawReceipt
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
def _to_decimal(value, default: str = "0") -> Decimal:
|
||||||
|
"""Safely convert a value to Decimal."""
|
||||||
|
if value is None:
|
||||||
|
return Decimal(default)
|
||||||
|
try:
|
||||||
|
return Decimal(str(value))
|
||||||
|
except (InvalidOperation, ValueError, TypeError):
|
||||||
|
return Decimal(default)
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_item(item: dict) -> dict:
|
||||||
|
"""Parse a single line item from a Target receipt.
|
||||||
|
|
||||||
|
Target items may include fields like:
|
||||||
|
- description / itemDescription / productName
|
||||||
|
- tcin (Target internal product ID) / upc / dpci
|
||||||
|
- quantity / qty
|
||||||
|
- unitPrice / price
|
||||||
|
- totalPrice / extendedPrice / lineTotal
|
||||||
|
- regularPrice / originalPrice
|
||||||
|
- circlePrice / salePrice / promoPrice
|
||||||
|
- couponDiscount / couponSavings
|
||||||
|
- circleRewardsDiscount / circleDiscount / loyaltyDiscount
|
||||||
|
- promoDescription / offerDescription (e.g. "BOGO 50% off", "Circle offer")
|
||||||
|
- department / category
|
||||||
|
"""
|
||||||
|
description = (
|
||||||
|
item.get("description")
|
||||||
|
or item.get("itemDescription")
|
||||||
|
or item.get("productName")
|
||||||
|
or item.get("name")
|
||||||
|
or "UNKNOWN ITEM"
|
||||||
|
)
|
||||||
|
|
||||||
|
quantity = _to_decimal(item.get("quantity", item.get("qty", item.get("quantitySold", 1))), "1")
|
||||||
|
unit_price = _to_decimal(item.get("unitPrice", item.get("price", item.get("basePrice", 0))))
|
||||||
|
extended_price = _to_decimal(
|
||||||
|
item.get("totalPrice", item.get("extendedPrice", item.get("lineTotal")))
|
||||||
|
)
|
||||||
|
|
||||||
|
# Compute extended_price if not provided
|
||||||
|
if extended_price == Decimal("0") and unit_price != Decimal("0"):
|
||||||
|
extended_price = unit_price * quantity
|
||||||
|
|
||||||
|
regular_price = item.get("regularPrice", item.get("originalPrice"))
|
||||||
|
# Target Circle pricing — circlePrice takes precedence over generic salePrice
|
||||||
|
sale_price = item.get("circlePrice", item.get("salePrice", item.get("promoPrice")))
|
||||||
|
coupon_discount = item.get(
|
||||||
|
"couponDiscount", item.get("couponSavings", item.get("couponAmount"))
|
||||||
|
)
|
||||||
|
# Circle rewards / loyalty discount
|
||||||
|
loyalty_discount = item.get(
|
||||||
|
"circleRewardsDiscount",
|
||||||
|
item.get("circleDiscount", item.get("loyaltyDiscount")),
|
||||||
|
)
|
||||||
|
|
||||||
|
# UPC handling — Target may use tcin, upc, or dpci
|
||||||
|
upc = item.get("upc", item.get("UPC"))
|
||||||
|
if upc:
|
||||||
|
upc = str(upc).strip().lstrip("0") or None
|
||||||
|
|
||||||
|
# Target also has TCIN (Target.com Item Number) and DPCI (Department/Class/Item)
|
||||||
|
tcin = item.get("tcin", item.get("TCIN"))
|
||||||
|
dpci = item.get("dpci", item.get("DPCI"))
|
||||||
|
|
||||||
|
category = item.get("department", item.get("category"))
|
||||||
|
|
||||||
|
# Capture promo/deal description for BOGO and Circle offers
|
||||||
|
promo_description = item.get("promoDescription", item.get("offerDescription"))
|
||||||
|
|
||||||
|
# Weight info for produce/deli items
|
||||||
|
weight = item.get("weight", item.get("netWeight"))
|
||||||
|
extra: dict = {}
|
||||||
|
if weight is not None:
|
||||||
|
extra["weight"] = str(weight)
|
||||||
|
weight_uom = item.get("weightUom", item.get("unitOfMeasure"))
|
||||||
|
if weight_uom:
|
||||||
|
extra["weight_uom"] = weight_uom
|
||||||
|
if tcin:
|
||||||
|
extra["tcin"] = str(tcin)
|
||||||
|
if dpci:
|
||||||
|
extra["dpci"] = str(dpci)
|
||||||
|
if promo_description:
|
||||||
|
extra["promo_description"] = promo_description
|
||||||
|
|
||||||
|
result: dict = {
|
||||||
|
"product_name_raw": description.strip(),
|
||||||
|
"upc": upc,
|
||||||
|
"quantity": quantity,
|
||||||
|
"unit_price": unit_price,
|
||||||
|
"extended_price": extended_price,
|
||||||
|
"regular_price": _to_decimal(regular_price) if regular_price is not None else None,
|
||||||
|
"sale_price": _to_decimal(sale_price) if sale_price is not None else None,
|
||||||
|
"coupon_discount": (_to_decimal(coupon_discount) if coupon_discount is not None else None),
|
||||||
|
"loyalty_discount": (
|
||||||
|
_to_decimal(loyalty_discount) if loyalty_discount is not None else None
|
||||||
|
),
|
||||||
|
"category_raw": category.strip() if category else None,
|
||||||
|
}
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
def parse_target_receipt(raw: RawReceipt) -> dict:
|
||||||
|
"""Parse a RawReceipt from Target into a PurchaseCreate-compatible dict."""
|
||||||
|
data = raw.raw_data
|
||||||
|
detail = data.get("detail", {})
|
||||||
|
|
||||||
|
# Parse items — Target uses "items" or "lineItems"
|
||||||
|
raw_items = detail.get("items", detail.get("lineItems", []))
|
||||||
|
items = []
|
||||||
|
for raw_item in raw_items:
|
||||||
|
# Skip voided / returned items
|
||||||
|
if raw_item.get("voided") or raw_item.get("status") in (
|
||||||
|
"VOIDED",
|
||||||
|
"RETURNED",
|
||||||
|
"CANCELLED",
|
||||||
|
):
|
||||||
|
logger.debug("Skipping voided/returned item: %s", raw_item.get("description"))
|
||||||
|
continue
|
||||||
|
if raw_item.get("returnFlag") or raw_item.get("isReturn"):
|
||||||
|
logger.debug("Skipping returned item: %s", raw_item.get("description"))
|
||||||
|
continue
|
||||||
|
items.append(_parse_item(raw_item))
|
||||||
|
|
||||||
|
# Parse totals
|
||||||
|
total = _to_decimal(
|
||||||
|
detail.get(
|
||||||
|
"total",
|
||||||
|
data.get("total", data.get("orderTotal", data.get("grandTotal", 0))),
|
||||||
|
)
|
||||||
|
)
|
||||||
|
subtotal = detail.get("subtotal", data.get("subtotal", data.get("subTotal")))
|
||||||
|
tax = detail.get("tax", data.get("tax", data.get("salesTax")))
|
||||||
|
savings = detail.get(
|
||||||
|
"totalSavings",
|
||||||
|
data.get("savings", data.get("totalDiscount", data.get("circleSavings"))),
|
||||||
|
)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"receipt_id": raw.receipt_id,
|
||||||
|
"purchase_date": raw.purchase_date,
|
||||||
|
"total": total,
|
||||||
|
"subtotal": _to_decimal(subtotal) if subtotal is not None else None,
|
||||||
|
"tax": _to_decimal(tax) if tax is not None else None,
|
||||||
|
"savings_total": _to_decimal(savings) if savings is not None else None,
|
||||||
|
"source_url": raw.source_url,
|
||||||
|
"raw_data": data,
|
||||||
|
"items": items,
|
||||||
|
}
|
||||||
@@ -0,0 +1,30 @@
|
|||||||
|
"""Receipt & product matching pipeline — receipt normalization and product dedup."""
|
||||||
|
|
||||||
|
from receiptwitness.pipeline.matching import (
|
||||||
|
ConfidenceLevel,
|
||||||
|
ProductMatcher,
|
||||||
|
match_purchase_item,
|
||||||
|
)
|
||||||
|
from receiptwitness.pipeline.normalization import (
|
||||||
|
MatchMethod,
|
||||||
|
MatchResult,
|
||||||
|
clean_name,
|
||||||
|
extract_size_info,
|
||||||
|
jaccard_similarity,
|
||||||
|
normalize_product,
|
||||||
|
)
|
||||||
|
from receiptwitness.pipeline.receipt import normalize_receipt, parse_meijer_item
|
||||||
|
|
||||||
|
__all__ = [
|
||||||
|
"ConfidenceLevel",
|
||||||
|
"MatchMethod",
|
||||||
|
"MatchResult",
|
||||||
|
"ProductMatcher",
|
||||||
|
"clean_name",
|
||||||
|
"extract_size_info",
|
||||||
|
"jaccard_similarity",
|
||||||
|
"match_purchase_item",
|
||||||
|
"normalize_product",
|
||||||
|
"normalize_receipt",
|
||||||
|
"parse_meijer_item",
|
||||||
|
]
|
||||||
@@ -0,0 +1,136 @@
|
|||||||
|
"""Product matching & dedup — UPC primary, fuzzy name fallback, confidence scoring.
|
||||||
|
|
||||||
|
Wraps the Phase 1 normalization module with confidence-level classification
|
||||||
|
and batch matching for purchase ingestion.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import uuid
|
||||||
|
from dataclasses import dataclass
|
||||||
|
|
||||||
|
from cartsnitch_common.constants import MatchConfidence
|
||||||
|
from cartsnitch_common.models.product import NormalizedProduct
|
||||||
|
from cartsnitch_common.schemas.purchase import PurchaseItemCreate
|
||||||
|
from sqlalchemy.orm import Session
|
||||||
|
|
||||||
|
from receiptwitness.pipeline.normalization import (
|
||||||
|
MatchMethod,
|
||||||
|
MatchResult,
|
||||||
|
extract_size_info,
|
||||||
|
normalize_product,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Re-export for convenience
|
||||||
|
ConfidenceLevel = MatchConfidence
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class MatchOutcome:
|
||||||
|
"""Result of matching a single purchase item to a normalized product."""
|
||||||
|
|
||||||
|
item_index: int
|
||||||
|
match: MatchResult | None
|
||||||
|
confidence_level: MatchConfidence
|
||||||
|
created_new: bool = False
|
||||||
|
|
||||||
|
|
||||||
|
def classify_confidence(score: float, method: MatchMethod) -> MatchConfidence:
|
||||||
|
"""Classify a match score into high/medium/low confidence."""
|
||||||
|
if method == MatchMethod.UPC:
|
||||||
|
return MatchConfidence.HIGH
|
||||||
|
# Name-based matching thresholds
|
||||||
|
if score >= 0.8:
|
||||||
|
return MatchConfidence.HIGH
|
||||||
|
if score >= 0.5:
|
||||||
|
return MatchConfidence.MEDIUM
|
||||||
|
return MatchConfidence.LOW
|
||||||
|
|
||||||
|
|
||||||
|
def _create_product_from_item(
|
||||||
|
session: Session,
|
||||||
|
item: PurchaseItemCreate,
|
||||||
|
) -> NormalizedProduct:
|
||||||
|
"""Create a new NormalizedProduct from a purchase item that had no match."""
|
||||||
|
size_info = extract_size_info(item.product_name_raw)
|
||||||
|
product = NormalizedProduct(
|
||||||
|
id=uuid.uuid4(),
|
||||||
|
canonical_name=item.product_name_raw,
|
||||||
|
size=size_info[0] if size_info else None,
|
||||||
|
size_unit=size_info[1] if size_info else None,
|
||||||
|
upc_variants=[item.upc] if item.upc else [],
|
||||||
|
)
|
||||||
|
session.add(product)
|
||||||
|
session.flush()
|
||||||
|
return product
|
||||||
|
|
||||||
|
|
||||||
|
class ProductMatcher:
|
||||||
|
"""Batch product matcher for purchase ingestion.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
matcher = ProductMatcher(session)
|
||||||
|
outcomes = matcher.match_items(items)
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
session: Session,
|
||||||
|
name_threshold: float = 0.4,
|
||||||
|
auto_create: bool = True,
|
||||||
|
):
|
||||||
|
self.session = session
|
||||||
|
self.name_threshold = name_threshold
|
||||||
|
self.auto_create = auto_create
|
||||||
|
|
||||||
|
def match_single(
|
||||||
|
self,
|
||||||
|
item: PurchaseItemCreate,
|
||||||
|
) -> tuple[NormalizedProduct | None, MatchResult | None, MatchConfidence]:
|
||||||
|
"""Match a single purchase item to a normalized product.
|
||||||
|
|
||||||
|
Returns (product, match_result, confidence_level).
|
||||||
|
If auto_create is True and no match found, creates a new product.
|
||||||
|
"""
|
||||||
|
result = normalize_product(
|
||||||
|
self.session,
|
||||||
|
item.product_name_raw,
|
||||||
|
upc=item.upc,
|
||||||
|
name_threshold=self.name_threshold,
|
||||||
|
)
|
||||||
|
|
||||||
|
if result:
|
||||||
|
confidence = classify_confidence(result.confidence, result.method)
|
||||||
|
return result.product, result, confidence
|
||||||
|
|
||||||
|
if self.auto_create:
|
||||||
|
product = _create_product_from_item(self.session, item)
|
||||||
|
return product, None, MatchConfidence.LOW
|
||||||
|
|
||||||
|
return None, None, MatchConfidence.LOW
|
||||||
|
|
||||||
|
def match_items(self, items: list[PurchaseItemCreate]) -> list[MatchOutcome]:
|
||||||
|
"""Match a batch of purchase items. Returns outcomes in order."""
|
||||||
|
outcomes: list[MatchOutcome] = []
|
||||||
|
for idx, item in enumerate(items):
|
||||||
|
product, result, confidence = self.match_single(item)
|
||||||
|
created = result is None and product is not None
|
||||||
|
outcomes.append(
|
||||||
|
MatchOutcome(
|
||||||
|
item_index=idx,
|
||||||
|
match=result,
|
||||||
|
confidence_level=confidence,
|
||||||
|
created_new=created,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
return outcomes
|
||||||
|
|
||||||
|
|
||||||
|
def match_purchase_item(
|
||||||
|
session: Session,
|
||||||
|
item: PurchaseItemCreate,
|
||||||
|
name_threshold: float = 0.4,
|
||||||
|
auto_create: bool = True,
|
||||||
|
) -> tuple[NormalizedProduct | None, MatchConfidence]:
|
||||||
|
"""Convenience function: match a single item, return (product, confidence)."""
|
||||||
|
matcher = ProductMatcher(session, name_threshold=name_threshold, auto_create=auto_create)
|
||||||
|
product, _, confidence = matcher.match_single(item)
|
||||||
|
return product, confidence
|
||||||
@@ -0,0 +1,155 @@
|
|||||||
|
"""Product normalization — Phase 1: UPC matching + fuzzy name matching.
|
||||||
|
|
||||||
|
Matches products across retailers by:
|
||||||
|
1. Exact UPC match (highest confidence)
|
||||||
|
2. Fuzzy name matching via token-based Jaccard similarity (lower confidence)
|
||||||
|
"""
|
||||||
|
|
||||||
|
import re
|
||||||
|
from dataclasses import dataclass
|
||||||
|
from enum import StrEnum
|
||||||
|
|
||||||
|
from cartsnitch_common.models.product import NormalizedProduct
|
||||||
|
from sqlalchemy import select
|
||||||
|
from sqlalchemy.orm import Session
|
||||||
|
|
||||||
|
|
||||||
|
class MatchMethod(StrEnum):
|
||||||
|
"""How a product match was determined."""
|
||||||
|
|
||||||
|
UPC = "upc"
|
||||||
|
NAME = "name"
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class MatchResult:
|
||||||
|
"""Result of a product normalization attempt."""
|
||||||
|
|
||||||
|
product: NormalizedProduct
|
||||||
|
confidence: float
|
||||||
|
method: MatchMethod
|
||||||
|
|
||||||
|
|
||||||
|
# Noise words stripped during name cleaning
|
||||||
|
_NOISE_WORDS = frozenset(
|
||||||
|
{
|
||||||
|
"the",
|
||||||
|
"a",
|
||||||
|
"an",
|
||||||
|
"and",
|
||||||
|
"or",
|
||||||
|
"of",
|
||||||
|
"with",
|
||||||
|
"in",
|
||||||
|
"for",
|
||||||
|
"to",
|
||||||
|
"brand",
|
||||||
|
"original",
|
||||||
|
"classic",
|
||||||
|
"new",
|
||||||
|
"improved",
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
# Regex for extracting size info (e.g., "16 oz", "1.5 lb", "12 ct")
|
||||||
|
_SIZE_PATTERN = re.compile(
|
||||||
|
r"(\d+(?:\.\d+)?)\s*(oz|fl\s*oz|lb|lbs|g|kg|ml|l|ct|pk|count|pack)\b",
|
||||||
|
re.IGNORECASE,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def clean_name(name: str) -> str:
|
||||||
|
"""Normalize a product name for comparison.
|
||||||
|
|
||||||
|
- Lowercase
|
||||||
|
- Remove size info (e.g., "16 oz")
|
||||||
|
- Strip noise words
|
||||||
|
- Collapse whitespace
|
||||||
|
"""
|
||||||
|
cleaned = name.lower()
|
||||||
|
cleaned = _SIZE_PATTERN.sub("", cleaned)
|
||||||
|
cleaned = re.sub(r"[^\w\s]", " ", cleaned)
|
||||||
|
tokens = cleaned.split()
|
||||||
|
tokens = [t for t in tokens if t not in _NOISE_WORDS]
|
||||||
|
return " ".join(tokens)
|
||||||
|
|
||||||
|
|
||||||
|
def extract_size_info(name: str) -> tuple[str, str] | None:
|
||||||
|
"""Extract (size, unit) from a product name, if present."""
|
||||||
|
match = _SIZE_PATTERN.search(name)
|
||||||
|
if match:
|
||||||
|
return match.group(1), match.group(2).lower().replace(" ", "_")
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def jaccard_similarity(a: str, b: str) -> float:
|
||||||
|
"""Token-based Jaccard similarity between two cleaned names."""
|
||||||
|
tokens_a = set(a.split())
|
||||||
|
tokens_b = set(b.split())
|
||||||
|
if not tokens_a or not tokens_b:
|
||||||
|
return 0.0
|
||||||
|
intersection = tokens_a & tokens_b
|
||||||
|
union = tokens_a | tokens_b
|
||||||
|
return len(intersection) / len(union)
|
||||||
|
|
||||||
|
|
||||||
|
def match_by_upc(session: Session, upc: str) -> MatchResult | None:
|
||||||
|
"""Find a normalized product by exact UPC match.
|
||||||
|
|
||||||
|
Loads products with upc_variants and checks membership in Python
|
||||||
|
for cross-database compatibility (works on both PostgreSQL and SQLite).
|
||||||
|
"""
|
||||||
|
# TODO: Use PostgreSQL JSON containment query (@>) for production.
|
||||||
|
# Current approach loads all products into memory — acceptable for tests
|
||||||
|
# and small datasets, but will not scale.
|
||||||
|
stmt = select(NormalizedProduct).where(NormalizedProduct.upc_variants.is_not(None))
|
||||||
|
products = session.execute(stmt).scalars().all()
|
||||||
|
for product in products:
|
||||||
|
if product.upc_variants and upc in product.upc_variants:
|
||||||
|
return MatchResult(product=product, confidence=1.0, method=MatchMethod.UPC)
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def match_by_name(
|
||||||
|
session: Session,
|
||||||
|
name: str,
|
||||||
|
threshold: float = 0.5,
|
||||||
|
) -> MatchResult | None:
|
||||||
|
"""Find the best normalized product by fuzzy name matching.
|
||||||
|
|
||||||
|
Loads all normalized products and computes Jaccard similarity.
|
||||||
|
Returns the best match above the threshold, or None.
|
||||||
|
"""
|
||||||
|
# TODO: Use pg_trgm similarity index for production.
|
||||||
|
# Current approach loads all products into memory — acceptable for tests
|
||||||
|
# and small datasets, but will not scale.
|
||||||
|
cleaned = clean_name(name)
|
||||||
|
stmt = select(NormalizedProduct)
|
||||||
|
products = session.execute(stmt).scalars().all()
|
||||||
|
|
||||||
|
best_match: NormalizedProduct | None = None
|
||||||
|
best_score = 0.0
|
||||||
|
|
||||||
|
for product in products:
|
||||||
|
score = jaccard_similarity(cleaned, clean_name(product.canonical_name))
|
||||||
|
if score > best_score and score >= threshold:
|
||||||
|
best_score = score
|
||||||
|
best_match = product
|
||||||
|
|
||||||
|
if best_match:
|
||||||
|
return MatchResult(product=best_match, confidence=best_score, method=MatchMethod.NAME)
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def normalize_product(
|
||||||
|
session: Session,
|
||||||
|
name: str,
|
||||||
|
upc: str | None = None,
|
||||||
|
name_threshold: float = 0.5,
|
||||||
|
) -> MatchResult | None:
|
||||||
|
"""Full normalization pipeline: UPC first, then fuzzy name fallback."""
|
||||||
|
if upc:
|
||||||
|
result = match_by_upc(session, upc)
|
||||||
|
if result:
|
||||||
|
return result
|
||||||
|
return match_by_name(session, name, threshold=name_threshold)
|
||||||
@@ -0,0 +1,144 @@
|
|||||||
|
"""Receipt normalization — parse raw Meijer scraper output into purchase records.
|
||||||
|
|
||||||
|
Maps raw receipt fields, cleans product names, extracts quantities/units.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import re
|
||||||
|
from datetime import date
|
||||||
|
from decimal import Decimal, InvalidOperation
|
||||||
|
|
||||||
|
from cartsnitch_common.schemas.purchase import PurchaseCreate, PurchaseItemCreate
|
||||||
|
|
||||||
|
|
||||||
|
def _clean_product_name(raw: str) -> str:
|
||||||
|
"""Clean raw product name from scraper output."""
|
||||||
|
cleaned = raw.strip()
|
||||||
|
# Remove leading/trailing non-alphanumeric chars
|
||||||
|
cleaned = re.sub(r"^\W+|\W+$", "", cleaned)
|
||||||
|
# Collapse internal whitespace
|
||||||
|
cleaned = re.sub(r"\s+", " ", cleaned)
|
||||||
|
return cleaned
|
||||||
|
|
||||||
|
|
||||||
|
def _safe_decimal(
|
||||||
|
value: str | float | int | Decimal | None,
|
||||||
|
default: Decimal = Decimal("0"),
|
||||||
|
) -> Decimal:
|
||||||
|
"""Safely convert a value to Decimal."""
|
||||||
|
if value is None:
|
||||||
|
return default
|
||||||
|
try:
|
||||||
|
return Decimal(str(value))
|
||||||
|
except (InvalidOperation, ValueError):
|
||||||
|
return default
|
||||||
|
|
||||||
|
|
||||||
|
def parse_meijer_item(raw_item: dict) -> PurchaseItemCreate:
|
||||||
|
"""Parse a single Meijer scraper line item into a PurchaseItemCreate.
|
||||||
|
|
||||||
|
Expected raw_item keys (from Meijer scraper):
|
||||||
|
- description / name: product name
|
||||||
|
- upc / upcCode: UPC barcode
|
||||||
|
- quantity / qty: number of units
|
||||||
|
- unitPrice / price: per-unit price
|
||||||
|
- extendedPrice / totalPrice: line total
|
||||||
|
- regularPrice: shelf price before discounts
|
||||||
|
- salePrice: sale price if applicable
|
||||||
|
- couponAmount / couponDiscount: coupon savings
|
||||||
|
- loyaltyAmount / loyaltyDiscount: loyalty savings
|
||||||
|
- category / department: raw category
|
||||||
|
"""
|
||||||
|
name = raw_item.get("description") or raw_item.get("name") or ""
|
||||||
|
cleaned_name = _clean_product_name(name)
|
||||||
|
|
||||||
|
upc = raw_item.get("upc") or raw_item.get("upcCode")
|
||||||
|
if upc:
|
||||||
|
upc = str(upc).strip().lstrip("0") or str(upc).strip()
|
||||||
|
|
||||||
|
qty = _safe_decimal(
|
||||||
|
raw_item.get("quantity") or raw_item.get("qty"),
|
||||||
|
default=Decimal("1"),
|
||||||
|
)
|
||||||
|
|
||||||
|
unit_price = _safe_decimal(raw_item.get("unitPrice") or raw_item.get("price"))
|
||||||
|
extended = _safe_decimal(raw_item.get("extendedPrice") or raw_item.get("totalPrice"))
|
||||||
|
if extended == Decimal("0") and unit_price > 0:
|
||||||
|
extended = unit_price * qty
|
||||||
|
|
||||||
|
regular = raw_item.get("regularPrice")
|
||||||
|
sale = raw_item.get("salePrice")
|
||||||
|
coupon = raw_item.get("couponAmount") or raw_item.get("couponDiscount")
|
||||||
|
loyalty = raw_item.get("loyaltyAmount") or raw_item.get("loyaltyDiscount")
|
||||||
|
category = raw_item.get("category") or raw_item.get("department")
|
||||||
|
|
||||||
|
return PurchaseItemCreate(
|
||||||
|
product_name_raw=cleaned_name,
|
||||||
|
upc=upc,
|
||||||
|
quantity=qty,
|
||||||
|
unit_price=unit_price,
|
||||||
|
extended_price=extended,
|
||||||
|
regular_price=_safe_decimal(regular) if regular is not None else None,
|
||||||
|
sale_price=_safe_decimal(sale) if sale is not None else None,
|
||||||
|
coupon_discount=_safe_decimal(coupon) if coupon is not None else None,
|
||||||
|
loyalty_discount=_safe_decimal(loyalty) if loyalty is not None else None,
|
||||||
|
category_raw=str(category).strip() if category else None,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def normalize_receipt(
|
||||||
|
raw_receipt: dict,
|
||||||
|
user_id: str,
|
||||||
|
store_id: str,
|
||||||
|
) -> PurchaseCreate:
|
||||||
|
"""Parse a complete Meijer raw receipt into a PurchaseCreate.
|
||||||
|
|
||||||
|
Expected raw_receipt keys:
|
||||||
|
- receiptId / receipt_id / id: unique receipt identifier
|
||||||
|
- date / purchaseDate / purchase_date: purchase date (YYYY-MM-DD or similar)
|
||||||
|
- total / totalAmount: receipt total
|
||||||
|
- subtotal: pre-tax subtotal
|
||||||
|
- tax / taxAmount: tax amount
|
||||||
|
- savings / totalSavings: total discount savings
|
||||||
|
- items: list of raw line item dicts
|
||||||
|
"""
|
||||||
|
import uuid
|
||||||
|
|
||||||
|
receipt_id = str(
|
||||||
|
raw_receipt.get("receiptId")
|
||||||
|
or raw_receipt.get("receipt_id")
|
||||||
|
or raw_receipt.get("id")
|
||||||
|
or uuid.uuid4()
|
||||||
|
)
|
||||||
|
|
||||||
|
raw_date = (
|
||||||
|
raw_receipt.get("date")
|
||||||
|
or raw_receipt.get("purchaseDate")
|
||||||
|
or raw_receipt.get("purchase_date")
|
||||||
|
)
|
||||||
|
if isinstance(raw_date, str):
|
||||||
|
purchase_date = date.fromisoformat(raw_date[:10])
|
||||||
|
elif isinstance(raw_date, date):
|
||||||
|
purchase_date = raw_date
|
||||||
|
else:
|
||||||
|
purchase_date = date.today()
|
||||||
|
|
||||||
|
total = _safe_decimal(raw_receipt.get("total") or raw_receipt.get("totalAmount"))
|
||||||
|
subtotal = raw_receipt.get("subtotal")
|
||||||
|
tax = raw_receipt.get("tax") or raw_receipt.get("taxAmount")
|
||||||
|
savings = raw_receipt.get("savings") or raw_receipt.get("totalSavings")
|
||||||
|
|
||||||
|
raw_items = raw_receipt.get("items") or []
|
||||||
|
items = [parse_meijer_item(item) for item in raw_items]
|
||||||
|
|
||||||
|
return PurchaseCreate(
|
||||||
|
user_id=uuid.UUID(user_id) if isinstance(user_id, str) else user_id,
|
||||||
|
store_id=uuid.UUID(store_id) if isinstance(store_id, str) else store_id,
|
||||||
|
receipt_id=receipt_id,
|
||||||
|
purchase_date=purchase_date,
|
||||||
|
total=total,
|
||||||
|
subtotal=_safe_decimal(subtotal) if subtotal is not None else None,
|
||||||
|
tax=_safe_decimal(tax) if tax is not None else None,
|
||||||
|
savings_total=_safe_decimal(savings) if savings is not None else None,
|
||||||
|
raw_data=raw_receipt,
|
||||||
|
items=items,
|
||||||
|
)
|
||||||
@@ -0,0 +1 @@
|
|||||||
|
"""Retailer scrapers."""
|
||||||
@@ -0,0 +1,72 @@
|
|||||||
|
"""Abstract base scraper interface for all retailer scrapers."""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import random
|
||||||
|
from abc import ABC, abstractmethod
|
||||||
|
from dataclasses import dataclass, field
|
||||||
|
from datetime import datetime
|
||||||
|
|
||||||
|
from receiptwitness.config import settings
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class SessionData:
|
||||||
|
"""Holds session cookies and metadata for a retailer login."""
|
||||||
|
|
||||||
|
cookies: list[dict]
|
||||||
|
user_agent: str
|
||||||
|
created_at: datetime
|
||||||
|
expires_at: datetime | None = None
|
||||||
|
extra: dict = field(default_factory=dict)
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class RawReceipt:
|
||||||
|
"""Raw receipt data before parsing."""
|
||||||
|
|
||||||
|
receipt_id: str
|
||||||
|
purchase_date: str
|
||||||
|
store_number: str | None = None
|
||||||
|
raw_data: dict = field(default_factory=dict)
|
||||||
|
source_url: str | None = None
|
||||||
|
|
||||||
|
|
||||||
|
class BaseScraper(ABC):
|
||||||
|
"""All retailer scrapers implement this interface.
|
||||||
|
|
||||||
|
Provides common functionality: human-like delays, rate limiting guards,
|
||||||
|
and the abstract methods each retailer scraper must implement.
|
||||||
|
"""
|
||||||
|
|
||||||
|
@abstractmethod
|
||||||
|
async def login(self, username: str, password: str) -> SessionData:
|
||||||
|
"""Authenticate with the retailer portal and return session data."""
|
||||||
|
...
|
||||||
|
|
||||||
|
@abstractmethod
|
||||||
|
async def check_session(self, session: SessionData) -> bool:
|
||||||
|
"""Verify if an existing session is still valid."""
|
||||||
|
...
|
||||||
|
|
||||||
|
@abstractmethod
|
||||||
|
async def scrape_receipts(
|
||||||
|
self, session: SessionData, since: datetime | None = None
|
||||||
|
) -> list[RawReceipt]:
|
||||||
|
"""Scrape receipt data from the retailer portal."""
|
||||||
|
...
|
||||||
|
|
||||||
|
@abstractmethod
|
||||||
|
def parse_receipt(self, raw: RawReceipt) -> dict:
|
||||||
|
"""Parse a raw receipt into structured data.
|
||||||
|
|
||||||
|
Returns a dict with keys matching PurchaseCreate schema fields,
|
||||||
|
including an 'items' list matching PurchaseItemCreate fields.
|
||||||
|
"""
|
||||||
|
...
|
||||||
|
|
||||||
|
async def human_delay(self, min_ms: int | None = None, max_ms: int | None = None) -> None:
|
||||||
|
"""Sleep for a randomized human-like interval."""
|
||||||
|
lo = min_ms or settings.min_request_delay_ms
|
||||||
|
hi = max_ms or settings.max_request_delay_ms
|
||||||
|
delay = random.randint(lo, hi) / 1000.0
|
||||||
|
await asyncio.sleep(delay)
|
||||||
@@ -0,0 +1,344 @@
|
|||||||
|
"""Kroger loyalty portal scraper using Playwright.
|
||||||
|
|
||||||
|
Kroger uses Akamai Bot Manager for aggressive headless browser detection.
|
||||||
|
This scraper uses enhanced stealth measures including playwright-stealth,
|
||||||
|
realistic fingerprinting, and human-like interaction pacing.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import logging
|
||||||
|
from datetime import UTC, datetime, timedelta
|
||||||
|
from typing import cast
|
||||||
|
|
||||||
|
from playwright.async_api import BrowserContext, Page, Playwright, async_playwright
|
||||||
|
|
||||||
|
from receiptwitness.config import settings
|
||||||
|
from receiptwitness.scrapers.base import BaseScraper, RawReceipt, SessionData
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
# Kroger endpoints
|
||||||
|
KROGER_BASE = "https://www.kroger.com"
|
||||||
|
KROGER_LOGIN_PAGE = f"{KROGER_BASE}/signin"
|
||||||
|
KROGER_PURCHASE_HISTORY = f"{KROGER_BASE}/mypurchases"
|
||||||
|
KROGER_RECEIPT_API = f"{KROGER_BASE}/atlas/v1/purchase-history/api"
|
||||||
|
KROGER_RECEIPT_DETAIL_API = f"{KROGER_BASE}/atlas/v1/receipt/api"
|
||||||
|
KROGER_ACCOUNT_PAGE = f"{KROGER_BASE}/account/dashboard"
|
||||||
|
|
||||||
|
# Realistic browser fingerprint — Chrome on Windows (matches Kroger's typical audience)
|
||||||
|
DEFAULT_USER_AGENT = (
|
||||||
|
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
|
||||||
|
"AppleWebKit/537.36 (KHTML, like Gecko) "
|
||||||
|
"Chrome/131.0.0.0 Safari/537.36"
|
||||||
|
)
|
||||||
|
DEFAULT_VIEWPORT = {"width": 1920, "height": 1080}
|
||||||
|
DEFAULT_LOCALE = "en-US"
|
||||||
|
DEFAULT_TIMEZONE = "America/New_York"
|
||||||
|
|
||||||
|
|
||||||
|
class KrogerScraper(BaseScraper):
|
||||||
|
"""Scraper for Kroger loyalty purchase history.
|
||||||
|
|
||||||
|
Kroger uses Akamai Bot Manager which aggressively detects headless
|
||||||
|
browsers. This scraper employs enhanced stealth measures:
|
||||||
|
- Masks webdriver/automation signals
|
||||||
|
- Sets realistic browser fingerprint
|
||||||
|
- Uses human-like interaction pacing
|
||||||
|
- Preserves browser context across sessions
|
||||||
|
"""
|
||||||
|
|
||||||
|
async def _create_stealth_context(
|
||||||
|
self, playwright_instance: Playwright, cookies: list[dict] | None = None
|
||||||
|
) -> BrowserContext:
|
||||||
|
"""Create a browser context with enhanced stealth for Akamai evasion."""
|
||||||
|
browser = await playwright_instance.chromium.launch(
|
||||||
|
headless=settings.headless,
|
||||||
|
args=[
|
||||||
|
"--disable-blink-features=AutomationControlled",
|
||||||
|
"--no-sandbox",
|
||||||
|
"--disable-dev-shm-usage",
|
||||||
|
"--disable-infobars",
|
||||||
|
"--window-size=1920,1080",
|
||||||
|
],
|
||||||
|
)
|
||||||
|
context = await browser.new_context(
|
||||||
|
user_agent=DEFAULT_USER_AGENT,
|
||||||
|
viewport=DEFAULT_VIEWPORT, # type: ignore[arg-type]
|
||||||
|
locale=DEFAULT_LOCALE,
|
||||||
|
timezone_id=DEFAULT_TIMEZONE,
|
||||||
|
java_script_enabled=True,
|
||||||
|
bypass_csp=False,
|
||||||
|
color_scheme="light",
|
||||||
|
has_touch=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Enhanced stealth script targeting Akamai Bot Manager detection vectors
|
||||||
|
await context.add_init_script(
|
||||||
|
"""
|
||||||
|
// Mask webdriver flag
|
||||||
|
Object.defineProperty(navigator, 'webdriver', {
|
||||||
|
get: () => undefined
|
||||||
|
});
|
||||||
|
|
||||||
|
// Chrome runtime object
|
||||||
|
window.chrome = {
|
||||||
|
runtime: {},
|
||||||
|
loadTimes: function() {},
|
||||||
|
csi: function() {},
|
||||||
|
app: { isInstalled: false }
|
||||||
|
};
|
||||||
|
|
||||||
|
// Realistic plugin array
|
||||||
|
Object.defineProperty(navigator, 'plugins', {
|
||||||
|
get: () => [1, 2, 3, 4, 5]
|
||||||
|
});
|
||||||
|
|
||||||
|
// Languages
|
||||||
|
Object.defineProperty(navigator, 'languages', {
|
||||||
|
get: () => ['en-US', 'en']
|
||||||
|
});
|
||||||
|
|
||||||
|
// Platform
|
||||||
|
Object.defineProperty(navigator, 'platform', {
|
||||||
|
get: () => 'Win32'
|
||||||
|
});
|
||||||
|
|
||||||
|
// Hardware concurrency
|
||||||
|
Object.defineProperty(navigator, 'hardwareConcurrency', {
|
||||||
|
get: () => 8
|
||||||
|
});
|
||||||
|
|
||||||
|
// Device memory
|
||||||
|
Object.defineProperty(navigator, 'deviceMemory', {
|
||||||
|
get: () => 8
|
||||||
|
});
|
||||||
|
|
||||||
|
// Permissions query override (Akamai checks this)
|
||||||
|
const originalQuery = window.navigator.permissions.query;
|
||||||
|
window.navigator.permissions.query = (parameters) =>
|
||||||
|
parameters.name === 'notifications'
|
||||||
|
? Promise.resolve({ state: Notification.permission })
|
||||||
|
: originalQuery(parameters);
|
||||||
|
|
||||||
|
// WebGL vendor/renderer (avoid "Google Inc." / "ANGLE" tells)
|
||||||
|
const getParameter = WebGLRenderingContext.prototype.getParameter;
|
||||||
|
WebGLRenderingContext.prototype.getParameter = function(parameter) {
|
||||||
|
if (parameter === 37445) return 'Intel Inc.';
|
||||||
|
if (parameter === 37446) return 'Intel Iris OpenGL Engine';
|
||||||
|
return getParameter.call(this, parameter);
|
||||||
|
};
|
||||||
|
"""
|
||||||
|
)
|
||||||
|
|
||||||
|
if cookies:
|
||||||
|
await context.add_cookies(cookies) # type: ignore[arg-type]
|
||||||
|
|
||||||
|
return cast(BrowserContext, context)
|
||||||
|
|
||||||
|
async def login(self, username: str, password: str) -> SessionData:
|
||||||
|
"""Log in to Kroger and capture session cookies."""
|
||||||
|
async with async_playwright() as p:
|
||||||
|
context = await self._create_stealth_context(p)
|
||||||
|
page = await context.new_page()
|
||||||
|
try:
|
||||||
|
return await self._perform_login(page, context, username, password)
|
||||||
|
finally:
|
||||||
|
if context.browser:
|
||||||
|
await context.browser.close()
|
||||||
|
|
||||||
|
async def _perform_login(
|
||||||
|
self, page: Page, context: BrowserContext, username: str, password: str
|
||||||
|
) -> SessionData:
|
||||||
|
"""Execute the Kroger login flow."""
|
||||||
|
logger.info("Navigating to Kroger sign-in page")
|
||||||
|
await page.goto(KROGER_LOGIN_PAGE, wait_until="networkidle")
|
||||||
|
await self.human_delay(2000, 4000)
|
||||||
|
|
||||||
|
# Kroger login form — email/username field
|
||||||
|
email_input = page.locator(
|
||||||
|
'input[id="SignIn-emailInput"], '
|
||||||
|
'input[name="email"], '
|
||||||
|
'input[type="email"], '
|
||||||
|
'input[data-testid="SignIn-emailInput"]'
|
||||||
|
)
|
||||||
|
await email_input.wait_for(state="visible", timeout=settings.browser_timeout_ms)
|
||||||
|
await email_input.click()
|
||||||
|
await self.human_delay(300, 700)
|
||||||
|
await email_input.fill(username)
|
||||||
|
await self.human_delay(800, 1500)
|
||||||
|
|
||||||
|
# Password field
|
||||||
|
password_input = page.locator(
|
||||||
|
'input[id="SignIn-passwordInput"], '
|
||||||
|
'input[name="password"], '
|
||||||
|
'input[type="password"], '
|
||||||
|
'input[data-testid="SignIn-passwordInput"]'
|
||||||
|
)
|
||||||
|
await password_input.wait_for(state="visible", timeout=settings.browser_timeout_ms)
|
||||||
|
await password_input.click()
|
||||||
|
await self.human_delay(300, 700)
|
||||||
|
await password_input.fill(password)
|
||||||
|
await self.human_delay(1000, 2000)
|
||||||
|
|
||||||
|
# Sign-in button
|
||||||
|
sign_in_btn = page.locator(
|
||||||
|
'button[id="SignIn-submitButton"], '
|
||||||
|
'button[data-testid="SignIn-submitButton"], '
|
||||||
|
'button[type="submit"]:has-text("Sign In")'
|
||||||
|
)
|
||||||
|
await sign_in_btn.click()
|
||||||
|
|
||||||
|
# Wait for redirect away from sign-in page
|
||||||
|
await page.wait_for_url(
|
||||||
|
lambda url: "signin" not in url.lower(),
|
||||||
|
timeout=settings.browser_timeout_ms,
|
||||||
|
)
|
||||||
|
await self.human_delay(1500, 3000)
|
||||||
|
|
||||||
|
# Capture cookies
|
||||||
|
raw_cookies = await context.cookies()
|
||||||
|
cookies = [dict(c) for c in raw_cookies]
|
||||||
|
now = datetime.now(UTC)
|
||||||
|
|
||||||
|
logger.info("Kroger login successful, captured %d cookies", len(cookies))
|
||||||
|
return SessionData(
|
||||||
|
cookies=cookies,
|
||||||
|
user_agent=DEFAULT_USER_AGENT,
|
||||||
|
created_at=now,
|
||||||
|
expires_at=now + timedelta(hours=2),
|
||||||
|
extra={"retailer": "kroger"},
|
||||||
|
)
|
||||||
|
|
||||||
|
async def check_session(self, session: SessionData) -> bool:
|
||||||
|
"""Check if the Kroger session is still valid."""
|
||||||
|
if session.expires_at and datetime.now(UTC) > session.expires_at:
|
||||||
|
logger.info("Kroger session expired based on timestamp")
|
||||||
|
return False
|
||||||
|
|
||||||
|
async with async_playwright() as p:
|
||||||
|
context = await self._create_stealth_context(p, cookies=session.cookies)
|
||||||
|
page = await context.new_page()
|
||||||
|
try:
|
||||||
|
response = await page.goto(KROGER_ACCOUNT_PAGE, wait_until="networkidle")
|
||||||
|
current_url = page.url.lower()
|
||||||
|
is_valid = "signin" not in current_url and response is not None and response.ok
|
||||||
|
logger.info("Kroger session check: valid=%s (url=%s)", is_valid, page.url)
|
||||||
|
return is_valid
|
||||||
|
except Exception:
|
||||||
|
logger.exception("Kroger session check failed")
|
||||||
|
return False
|
||||||
|
finally:
|
||||||
|
if context.browser:
|
||||||
|
await context.browser.close()
|
||||||
|
|
||||||
|
async def scrape_receipts(
|
||||||
|
self, session: SessionData, since: datetime | None = None
|
||||||
|
) -> list[RawReceipt]:
|
||||||
|
"""Scrape purchase history from Kroger."""
|
||||||
|
async with async_playwright() as p:
|
||||||
|
context = await self._create_stealth_context(p, cookies=session.cookies)
|
||||||
|
page = await context.new_page()
|
||||||
|
try:
|
||||||
|
return await self._fetch_receipts(page, since)
|
||||||
|
finally:
|
||||||
|
if context.browser:
|
||||||
|
await context.browser.close()
|
||||||
|
|
||||||
|
async def _fetch_receipts(self, page: Page, since: datetime | None) -> list[RawReceipt]:
|
||||||
|
"""Fetch receipt list and details from Kroger purchase history."""
|
||||||
|
# Navigate to purchase history to establish context
|
||||||
|
await page.goto(KROGER_PURCHASE_HISTORY, wait_until="networkidle")
|
||||||
|
await self.human_delay(1500, 3000)
|
||||||
|
|
||||||
|
receipts: list[RawReceipt] = []
|
||||||
|
|
||||||
|
# Kroger purchase history API endpoint
|
||||||
|
api_response = await page.request.get(KROGER_RECEIPT_API)
|
||||||
|
if not api_response.ok:
|
||||||
|
logger.warning(
|
||||||
|
"Kroger purchase history request failed: %d %s",
|
||||||
|
api_response.status,
|
||||||
|
api_response.status_text,
|
||||||
|
)
|
||||||
|
return []
|
||||||
|
|
||||||
|
response = await api_response.json()
|
||||||
|
if not isinstance(response, dict):
|
||||||
|
logger.warning("Unexpected purchase history response type: %s", type(response))
|
||||||
|
return []
|
||||||
|
|
||||||
|
# Handle Kroger's response structure
|
||||||
|
orders = response.get("orders", response.get("purchases", []))
|
||||||
|
if not isinstance(orders, list):
|
||||||
|
logger.warning("No orders found in Kroger purchase history response")
|
||||||
|
return []
|
||||||
|
|
||||||
|
logger.info("Found %d orders in Kroger purchase history", len(orders))
|
||||||
|
|
||||||
|
for order in orders:
|
||||||
|
raw_id = order.get("orderId") or order.get("receiptId") or order.get("id") or ""
|
||||||
|
order_id = str(raw_id)
|
||||||
|
purchase_date = order.get(
|
||||||
|
"purchaseDate", order.get("transactionDate", order.get("date", ""))
|
||||||
|
)
|
||||||
|
|
||||||
|
# Filter by date if 'since' is provided
|
||||||
|
if since and purchase_date:
|
||||||
|
try:
|
||||||
|
txn_dt = datetime.fromisoformat(purchase_date.replace("Z", "+00:00"))
|
||||||
|
if txn_dt < since:
|
||||||
|
continue
|
||||||
|
except (ValueError, TypeError):
|
||||||
|
pass
|
||||||
|
|
||||||
|
if not order_id:
|
||||||
|
continue
|
||||||
|
|
||||||
|
await self.human_delay(1000, 2500)
|
||||||
|
|
||||||
|
# Fetch receipt detail
|
||||||
|
detail = await self._fetch_receipt_detail(page, order_id)
|
||||||
|
|
||||||
|
raw_store = (
|
||||||
|
order.get("storeNumber")
|
||||||
|
or order.get("divisionNumber")
|
||||||
|
or order.get("storeId")
|
||||||
|
or ""
|
||||||
|
)
|
||||||
|
store_number = str(raw_store)
|
||||||
|
|
||||||
|
receipts.append(
|
||||||
|
RawReceipt(
|
||||||
|
receipt_id=order_id,
|
||||||
|
purchase_date=purchase_date,
|
||||||
|
store_number=store_number,
|
||||||
|
raw_data={**order, "detail": detail},
|
||||||
|
source_url=f"{KROGER_RECEIPT_DETAIL_API}?orderId={order_id}",
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
logger.info("Scraped %d receipts from Kroger", len(receipts))
|
||||||
|
return receipts
|
||||||
|
|
||||||
|
async def _fetch_receipt_detail(self, page: Page, order_id: str) -> dict:
|
||||||
|
"""Fetch detailed receipt data for a single Kroger order."""
|
||||||
|
try:
|
||||||
|
url = f"{KROGER_RECEIPT_DETAIL_API}?orderId={order_id}"
|
||||||
|
api_response = await page.request.get(url)
|
||||||
|
if not api_response.ok:
|
||||||
|
logger.warning(
|
||||||
|
"Kroger receipt detail request failed for %s: %d",
|
||||||
|
order_id,
|
||||||
|
api_response.status,
|
||||||
|
)
|
||||||
|
return {}
|
||||||
|
detail = await api_response.json()
|
||||||
|
return detail if isinstance(detail, dict) else {}
|
||||||
|
except Exception:
|
||||||
|
logger.exception("Failed to fetch Kroger receipt detail for %s", order_id)
|
||||||
|
return {}
|
||||||
|
|
||||||
|
def parse_receipt(self, raw: RawReceipt) -> dict:
|
||||||
|
"""Parse raw Kroger receipt into structured purchase data."""
|
||||||
|
from receiptwitness.parsers.kroger import parse_kroger_receipt
|
||||||
|
|
||||||
|
return parse_kroger_receipt(raw)
|
||||||
@@ -0,0 +1,301 @@
|
|||||||
|
"""Meijer mPerks scraper using Playwright.
|
||||||
|
|
||||||
|
Meijer has no public API. We reverse-engineer the XHR endpoints the mPerks
|
||||||
|
web app uses to pull purchase history and receipt data. The flow:
|
||||||
|
|
||||||
|
1. Launch stealth Playwright browser
|
||||||
|
2. Navigate to mPerks login page and authenticate
|
||||||
|
3. Capture session cookies after successful login
|
||||||
|
4. Use those cookies to hit the mPerks receipt API endpoints directly
|
||||||
|
5. Parse receipt JSON into structured PurchaseCreate records
|
||||||
|
|
||||||
|
Key endpoints (reverse-engineered from mPerks SPA):
|
||||||
|
- Login: POST https://www.meijer.com/bin/meijer/account/login
|
||||||
|
- Receipts: GET https://www.meijer.com/bin/meijer/profile/purchasehistory
|
||||||
|
- Receipt detail: GET https://www.meijer.com/bin/meijer/profile/receipt?receiptId=...
|
||||||
|
"""
|
||||||
|
|
||||||
|
import logging
|
||||||
|
from datetime import UTC, datetime, timedelta
|
||||||
|
from typing import cast
|
||||||
|
|
||||||
|
from playwright.async_api import BrowserContext, Page, Playwright, async_playwright
|
||||||
|
|
||||||
|
from receiptwitness.config import settings
|
||||||
|
from receiptwitness.scrapers.base import BaseScraper, RawReceipt, SessionData
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
# Meijer mPerks URLs
|
||||||
|
MEIJER_BASE = "https://www.meijer.com"
|
||||||
|
MEIJER_LOGIN_PAGE = f"{MEIJER_BASE}/shopping/login.html"
|
||||||
|
MEIJER_LOGIN_API = f"{MEIJER_BASE}/bin/meijer/account/login"
|
||||||
|
MEIJER_PURCHASE_HISTORY = f"{MEIJER_BASE}/bin/meijer/profile/purchasehistory"
|
||||||
|
MEIJER_RECEIPT_DETAIL = f"{MEIJER_BASE}/bin/meijer/profile/receipt"
|
||||||
|
MEIJER_MPERKS_HOME = f"{MEIJER_BASE}/mperks.html"
|
||||||
|
|
||||||
|
# Realistic browser fingerprint
|
||||||
|
DEFAULT_USER_AGENT = (
|
||||||
|
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
|
||||||
|
"AppleWebKit/537.36 (KHTML, like Gecko) "
|
||||||
|
"Chrome/131.0.0.0 Safari/537.36"
|
||||||
|
)
|
||||||
|
DEFAULT_VIEWPORT = {"width": 1920, "height": 1080}
|
||||||
|
DEFAULT_LOCALE = "en-US"
|
||||||
|
DEFAULT_TIMEZONE = "America/Detroit" # Meijer HQ is in Grand Rapids, MI
|
||||||
|
|
||||||
|
|
||||||
|
class MeijerScraper(BaseScraper):
|
||||||
|
"""Scraper for Meijer mPerks purchase history."""
|
||||||
|
|
||||||
|
async def _create_stealth_context(
|
||||||
|
self, playwright_instance: Playwright, cookies: list[dict] | None = None
|
||||||
|
) -> BrowserContext:
|
||||||
|
"""Create a browser context with stealth settings."""
|
||||||
|
browser = await playwright_instance.chromium.launch(
|
||||||
|
headless=settings.headless,
|
||||||
|
args=[
|
||||||
|
"--disable-blink-features=AutomationControlled",
|
||||||
|
"--no-sandbox",
|
||||||
|
],
|
||||||
|
)
|
||||||
|
context = await browser.new_context(
|
||||||
|
user_agent=DEFAULT_USER_AGENT,
|
||||||
|
viewport=DEFAULT_VIEWPORT, # type: ignore[arg-type]
|
||||||
|
locale=DEFAULT_LOCALE,
|
||||||
|
timezone_id=DEFAULT_TIMEZONE,
|
||||||
|
java_script_enabled=True,
|
||||||
|
bypass_csp=False,
|
||||||
|
)
|
||||||
|
# Mask webdriver flag
|
||||||
|
await context.add_init_script(
|
||||||
|
"""
|
||||||
|
Object.defineProperty(navigator, 'webdriver', {
|
||||||
|
get: () => undefined
|
||||||
|
});
|
||||||
|
// Mask chrome automation indicators
|
||||||
|
window.chrome = { runtime: {} };
|
||||||
|
Object.defineProperty(navigator, 'plugins', {
|
||||||
|
get: () => [1, 2, 3, 4, 5]
|
||||||
|
});
|
||||||
|
Object.defineProperty(navigator, 'languages', {
|
||||||
|
get: () => ['en-US', 'en']
|
||||||
|
});
|
||||||
|
"""
|
||||||
|
)
|
||||||
|
if cookies:
|
||||||
|
await context.add_cookies(cookies) # type: ignore[arg-type]
|
||||||
|
return cast(BrowserContext, context)
|
||||||
|
|
||||||
|
async def login(self, username: str, password: str) -> SessionData:
|
||||||
|
"""Log in to Meijer mPerks and capture session cookies.
|
||||||
|
|
||||||
|
The mPerks login flow:
|
||||||
|
1. Navigate to login page
|
||||||
|
2. Fill email and password fields
|
||||||
|
3. Click sign-in button
|
||||||
|
4. Wait for redirect to mPerks dashboard
|
||||||
|
5. Extract session cookies
|
||||||
|
"""
|
||||||
|
async with async_playwright() as p:
|
||||||
|
context = await self._create_stealth_context(p)
|
||||||
|
page = await context.new_page()
|
||||||
|
try:
|
||||||
|
return await self._perform_login(page, context, username, password)
|
||||||
|
finally:
|
||||||
|
if context.browser:
|
||||||
|
await context.browser.close()
|
||||||
|
|
||||||
|
async def _perform_login(
|
||||||
|
self, page: Page, context: BrowserContext, username: str, password: str
|
||||||
|
) -> SessionData:
|
||||||
|
"""Execute the login flow on the mPerks portal."""
|
||||||
|
logger.info("Navigating to Meijer login page")
|
||||||
|
await page.goto(MEIJER_LOGIN_PAGE, wait_until="networkidle")
|
||||||
|
await self.human_delay(1500, 3000)
|
||||||
|
|
||||||
|
# Fill email field
|
||||||
|
email_input = page.locator('input[type="email"], input[name="email"], #email')
|
||||||
|
await email_input.wait_for(state="visible", timeout=settings.browser_timeout_ms)
|
||||||
|
await email_input.click()
|
||||||
|
await self.human_delay(200, 500)
|
||||||
|
await email_input.fill(username)
|
||||||
|
await self.human_delay(500, 1000)
|
||||||
|
|
||||||
|
# Fill password field
|
||||||
|
password_input = page.locator('input[type="password"], input[name="password"], #password')
|
||||||
|
await password_input.wait_for(state="visible", timeout=settings.browser_timeout_ms)
|
||||||
|
await password_input.click()
|
||||||
|
await self.human_delay(200, 500)
|
||||||
|
await password_input.fill(password)
|
||||||
|
await self.human_delay(500, 1500)
|
||||||
|
|
||||||
|
# Click sign-in button
|
||||||
|
sign_in_btn = page.locator(
|
||||||
|
'button[type="submit"], button:has-text("Sign In"), button:has-text("Log In")'
|
||||||
|
)
|
||||||
|
await sign_in_btn.click()
|
||||||
|
|
||||||
|
# Wait for navigation after login
|
||||||
|
await page.wait_for_url(
|
||||||
|
lambda url: "login" not in url.lower(),
|
||||||
|
timeout=settings.browser_timeout_ms,
|
||||||
|
)
|
||||||
|
await self.human_delay(1000, 2000)
|
||||||
|
|
||||||
|
# Capture cookies
|
||||||
|
raw_cookies = await context.cookies()
|
||||||
|
cookies = [dict(c) for c in raw_cookies]
|
||||||
|
now = datetime.now(UTC)
|
||||||
|
|
||||||
|
logger.info("Meijer login successful, captured %d cookies", len(cookies))
|
||||||
|
return SessionData(
|
||||||
|
cookies=cookies,
|
||||||
|
user_agent=DEFAULT_USER_AGENT,
|
||||||
|
created_at=now,
|
||||||
|
expires_at=now + timedelta(hours=4),
|
||||||
|
)
|
||||||
|
|
||||||
|
async def check_session(self, session: SessionData) -> bool:
|
||||||
|
"""Check if the mPerks session is still valid.
|
||||||
|
|
||||||
|
Makes a lightweight request to the mPerks home page and checks
|
||||||
|
if we get redirected to login (session expired) or not.
|
||||||
|
"""
|
||||||
|
if session.expires_at and datetime.now(UTC) > session.expires_at:
|
||||||
|
logger.info("Meijer session expired based on timestamp")
|
||||||
|
return False
|
||||||
|
|
||||||
|
async with async_playwright() as p:
|
||||||
|
context = await self._create_stealth_context(p, cookies=session.cookies)
|
||||||
|
page = await context.new_page()
|
||||||
|
try:
|
||||||
|
response = await page.goto(MEIJER_MPERKS_HOME, wait_until="networkidle")
|
||||||
|
current_url = page.url.lower()
|
||||||
|
is_valid = "login" not in current_url and response is not None and response.ok
|
||||||
|
logger.info("Meijer session check: valid=%s (url=%s)", is_valid, page.url)
|
||||||
|
return is_valid
|
||||||
|
except Exception:
|
||||||
|
logger.exception("Meijer session check failed")
|
||||||
|
return False
|
||||||
|
finally:
|
||||||
|
if context.browser:
|
||||||
|
await context.browser.close()
|
||||||
|
|
||||||
|
async def scrape_receipts(
|
||||||
|
self, session: SessionData, since: datetime | None = None
|
||||||
|
) -> list[RawReceipt]:
|
||||||
|
"""Scrape purchase history from Meijer mPerks.
|
||||||
|
|
||||||
|
Uses the XHR endpoints the mPerks SPA calls to fetch receipt data.
|
||||||
|
The purchase history endpoint returns a list of recent transactions,
|
||||||
|
and we can fetch individual receipt details for line items.
|
||||||
|
"""
|
||||||
|
async with async_playwright() as p:
|
||||||
|
context = await self._create_stealth_context(p, cookies=session.cookies)
|
||||||
|
page = await context.new_page()
|
||||||
|
try:
|
||||||
|
return await self._fetch_receipts(page, since)
|
||||||
|
finally:
|
||||||
|
if context.browser:
|
||||||
|
await context.browser.close()
|
||||||
|
|
||||||
|
async def _fetch_receipts(self, page: Page, since: datetime | None) -> list[RawReceipt]:
|
||||||
|
"""Fetch receipt list and detail via mPerks XHR endpoints.
|
||||||
|
|
||||||
|
Uses Playwright's page.request API (APIRequestContext) instead of
|
||||||
|
page.evaluate(fetch(...)) for better observability — requests show up
|
||||||
|
in Playwright traces and can be intercepted by route handlers.
|
||||||
|
"""
|
||||||
|
# Navigate to mPerks to establish context (cookies need domain context)
|
||||||
|
await page.goto(MEIJER_MPERKS_HOME, wait_until="networkidle")
|
||||||
|
await self.human_delay(1000, 2000)
|
||||||
|
|
||||||
|
receipts: list[RawReceipt] = []
|
||||||
|
|
||||||
|
# Fetch purchase history listing via page.request (APIRequestContext)
|
||||||
|
api_response = await page.request.get(MEIJER_PURCHASE_HISTORY)
|
||||||
|
if not api_response.ok:
|
||||||
|
logger.warning(
|
||||||
|
"Purchase history request failed: %d %s",
|
||||||
|
api_response.status,
|
||||||
|
api_response.status_text,
|
||||||
|
)
|
||||||
|
return []
|
||||||
|
|
||||||
|
response = await api_response.json()
|
||||||
|
|
||||||
|
if not isinstance(response, dict):
|
||||||
|
logger.warning("Unexpected purchase history response type: %s", type(response))
|
||||||
|
return []
|
||||||
|
|
||||||
|
transactions = response.get("transactions", response.get("purchaseHistory", []))
|
||||||
|
if not isinstance(transactions, list):
|
||||||
|
logger.warning("No transactions found in purchase history response")
|
||||||
|
return []
|
||||||
|
|
||||||
|
logger.info("Found %d transactions in Meijer purchase history", len(transactions))
|
||||||
|
|
||||||
|
for txn in transactions:
|
||||||
|
receipt_id = str(txn.get("transactionId", txn.get("receiptId", "")))
|
||||||
|
purchase_date = txn.get("transactionDate", txn.get("purchaseDate", ""))
|
||||||
|
|
||||||
|
# Filter by date if 'since' is provided
|
||||||
|
if since and purchase_date:
|
||||||
|
try:
|
||||||
|
txn_dt = datetime.fromisoformat(purchase_date.replace("Z", "+00:00"))
|
||||||
|
if txn_dt < since:
|
||||||
|
continue
|
||||||
|
except (ValueError, TypeError):
|
||||||
|
pass
|
||||||
|
|
||||||
|
if not receipt_id:
|
||||||
|
continue
|
||||||
|
|
||||||
|
await self.human_delay(800, 2000)
|
||||||
|
|
||||||
|
# Fetch receipt detail
|
||||||
|
detail = await self._fetch_receipt_detail(page, receipt_id)
|
||||||
|
|
||||||
|
receipts.append(
|
||||||
|
RawReceipt(
|
||||||
|
receipt_id=receipt_id,
|
||||||
|
purchase_date=purchase_date,
|
||||||
|
store_number=str(txn.get("storeNumber", txn.get("storeId", ""))),
|
||||||
|
raw_data={**txn, "detail": detail},
|
||||||
|
source_url=f"{MEIJER_RECEIPT_DETAIL}?receiptId={receipt_id}",
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
logger.info("Scraped %d receipts from Meijer", len(receipts))
|
||||||
|
return receipts
|
||||||
|
|
||||||
|
async def _fetch_receipt_detail(self, page: Page, receipt_id: str) -> dict:
|
||||||
|
"""Fetch detailed receipt data for a single transaction.
|
||||||
|
|
||||||
|
Uses Playwright's page.request API for traceability.
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
url = f"{MEIJER_RECEIPT_DETAIL}?receiptId={receipt_id}"
|
||||||
|
api_response = await page.request.get(url)
|
||||||
|
if not api_response.ok:
|
||||||
|
logger.warning(
|
||||||
|
"Receipt detail request failed for %s: %d",
|
||||||
|
receipt_id,
|
||||||
|
api_response.status,
|
||||||
|
)
|
||||||
|
return {}
|
||||||
|
detail = await api_response.json()
|
||||||
|
return detail if isinstance(detail, dict) else {}
|
||||||
|
except Exception:
|
||||||
|
logger.exception("Failed to fetch receipt detail for %s", receipt_id)
|
||||||
|
return {}
|
||||||
|
|
||||||
|
def parse_receipt(self, raw: RawReceipt) -> dict:
|
||||||
|
"""Parse raw Meijer receipt into structured purchase data.
|
||||||
|
|
||||||
|
Delegates to the dedicated parser module.
|
||||||
|
"""
|
||||||
|
from receiptwitness.parsers.meijer import parse_meijer_receipt
|
||||||
|
|
||||||
|
return parse_meijer_receipt(raw)
|
||||||
@@ -0,0 +1,326 @@
|
|||||||
|
"""Target Circle scraper using Playwright.
|
||||||
|
|
||||||
|
Target stores ~1 year of in-store purchase history tied to Circle accounts.
|
||||||
|
Purchases appear when the user pays with a linked card, uses the Target app
|
||||||
|
wallet, or enters their Circle phone number at checkout.
|
||||||
|
|
||||||
|
Key endpoints (reverse-engineered from target.com SPA):
|
||||||
|
- Login: POST https://gsp.target.com/gsp/authentications/v1/auth_codes
|
||||||
|
- Order history: GET https://api.target.com/order_history/v1/orders (in-store tab)
|
||||||
|
- Receipt detail: GET https://api.target.com/order_history/v1/orders/{orderId}
|
||||||
|
"""
|
||||||
|
|
||||||
|
import logging
|
||||||
|
from datetime import UTC, datetime, timedelta
|
||||||
|
from typing import cast
|
||||||
|
|
||||||
|
from playwright.async_api import BrowserContext, Page, Playwright, async_playwright
|
||||||
|
|
||||||
|
from receiptwitness.config import settings
|
||||||
|
from receiptwitness.scrapers.base import BaseScraper, RawReceipt, SessionData
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
# Target endpoints
|
||||||
|
TARGET_BASE = "https://www.target.com"
|
||||||
|
TARGET_LOGIN_PAGE = f"{TARGET_BASE}/login"
|
||||||
|
TARGET_ACCOUNT_PAGE = f"{TARGET_BASE}/account"
|
||||||
|
TARGET_ORDER_HISTORY = f"{TARGET_BASE}/account/orders"
|
||||||
|
TARGET_ORDER_API = "https://api.target.com/order_history/v1/orders"
|
||||||
|
TARGET_RECEIPT_API = "https://api.target.com/order_history/v1/orders"
|
||||||
|
|
||||||
|
# Realistic browser fingerprint — Chrome on Windows
|
||||||
|
DEFAULT_USER_AGENT = (
|
||||||
|
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
|
||||||
|
"AppleWebKit/537.36 (KHTML, like Gecko) "
|
||||||
|
"Chrome/131.0.0.0 Safari/537.36"
|
||||||
|
)
|
||||||
|
DEFAULT_VIEWPORT = {"width": 1920, "height": 1080}
|
||||||
|
DEFAULT_LOCALE = "en-US"
|
||||||
|
DEFAULT_TIMEZONE = "America/Detroit" # SE Michigan coverage
|
||||||
|
|
||||||
|
|
||||||
|
class TargetScraper(BaseScraper):
|
||||||
|
"""Scraper for Target Circle in-store purchase history.
|
||||||
|
|
||||||
|
Target's order history SPA loads purchase data from internal API
|
||||||
|
endpoints. This scraper authenticates via the web login flow,
|
||||||
|
captures session cookies, and uses those to hit the order history
|
||||||
|
API for in-store receipt data.
|
||||||
|
"""
|
||||||
|
|
||||||
|
async def _create_stealth_context(
|
||||||
|
self, playwright_instance: Playwright, cookies: list[dict] | None = None
|
||||||
|
) -> BrowserContext:
|
||||||
|
"""Create a browser context with stealth settings for Target."""
|
||||||
|
browser = await playwright_instance.chromium.launch(
|
||||||
|
headless=settings.headless,
|
||||||
|
args=[
|
||||||
|
"--disable-blink-features=AutomationControlled",
|
||||||
|
"--no-sandbox",
|
||||||
|
"--disable-dev-shm-usage",
|
||||||
|
],
|
||||||
|
)
|
||||||
|
context = await browser.new_context(
|
||||||
|
user_agent=DEFAULT_USER_AGENT,
|
||||||
|
viewport=DEFAULT_VIEWPORT, # type: ignore[arg-type]
|
||||||
|
locale=DEFAULT_LOCALE,
|
||||||
|
timezone_id=DEFAULT_TIMEZONE,
|
||||||
|
java_script_enabled=True,
|
||||||
|
bypass_csp=False,
|
||||||
|
color_scheme="light",
|
||||||
|
has_touch=False,
|
||||||
|
)
|
||||||
|
# Mask webdriver and automation signals
|
||||||
|
await context.add_init_script(
|
||||||
|
"""
|
||||||
|
Object.defineProperty(navigator, 'webdriver', {
|
||||||
|
get: () => undefined
|
||||||
|
});
|
||||||
|
|
||||||
|
window.chrome = {
|
||||||
|
runtime: {},
|
||||||
|
loadTimes: function() {},
|
||||||
|
csi: function() {},
|
||||||
|
app: { isInstalled: false }
|
||||||
|
};
|
||||||
|
|
||||||
|
Object.defineProperty(navigator, 'plugins', {
|
||||||
|
get: () => [1, 2, 3, 4, 5]
|
||||||
|
});
|
||||||
|
|
||||||
|
Object.defineProperty(navigator, 'languages', {
|
||||||
|
get: () => ['en-US', 'en']
|
||||||
|
});
|
||||||
|
|
||||||
|
Object.defineProperty(navigator, 'platform', {
|
||||||
|
get: () => 'Win32'
|
||||||
|
});
|
||||||
|
|
||||||
|
Object.defineProperty(navigator, 'hardwareConcurrency', {
|
||||||
|
get: () => 8
|
||||||
|
});
|
||||||
|
|
||||||
|
Object.defineProperty(navigator, 'deviceMemory', {
|
||||||
|
get: () => 8
|
||||||
|
});
|
||||||
|
"""
|
||||||
|
)
|
||||||
|
if cookies:
|
||||||
|
await context.add_cookies(cookies) # type: ignore[arg-type]
|
||||||
|
return cast(BrowserContext, context)
|
||||||
|
|
||||||
|
async def login(self, username: str, password: str) -> SessionData:
|
||||||
|
"""Log in to Target and capture session cookies."""
|
||||||
|
async with async_playwright() as p:
|
||||||
|
context = await self._create_stealth_context(p)
|
||||||
|
page = await context.new_page()
|
||||||
|
try:
|
||||||
|
return await self._perform_login(page, context, username, password)
|
||||||
|
finally:
|
||||||
|
if context.browser:
|
||||||
|
await context.browser.close()
|
||||||
|
|
||||||
|
async def _perform_login(
|
||||||
|
self, page: Page, context: BrowserContext, username: str, password: str
|
||||||
|
) -> SessionData:
|
||||||
|
"""Execute the Target login flow."""
|
||||||
|
logger.info("Navigating to Target sign-in page")
|
||||||
|
await page.goto(TARGET_LOGIN_PAGE, wait_until="networkidle")
|
||||||
|
await self.human_delay(2000, 4000)
|
||||||
|
|
||||||
|
# Target login form — email/username field
|
||||||
|
email_input = page.locator(
|
||||||
|
'input[id="username"], '
|
||||||
|
'input[name="username"], '
|
||||||
|
'input[type="email"], '
|
||||||
|
'input[data-test="username"]'
|
||||||
|
)
|
||||||
|
await email_input.wait_for(state="visible", timeout=settings.browser_timeout_ms)
|
||||||
|
await email_input.click()
|
||||||
|
await self.human_delay(300, 700)
|
||||||
|
await email_input.fill(username)
|
||||||
|
await self.human_delay(800, 1500)
|
||||||
|
|
||||||
|
# Password field
|
||||||
|
password_input = page.locator(
|
||||||
|
'input[id="password"], '
|
||||||
|
'input[name="password"], '
|
||||||
|
'input[type="password"], '
|
||||||
|
'input[data-test="password"]'
|
||||||
|
)
|
||||||
|
await password_input.wait_for(state="visible", timeout=settings.browser_timeout_ms)
|
||||||
|
await password_input.click()
|
||||||
|
await self.human_delay(300, 700)
|
||||||
|
await password_input.fill(password)
|
||||||
|
await self.human_delay(1000, 2000)
|
||||||
|
|
||||||
|
# Sign-in button
|
||||||
|
sign_in_btn = page.locator(
|
||||||
|
'button[id="login"], '
|
||||||
|
'button[data-test="login-button"], '
|
||||||
|
'button[type="submit"]:has-text("Sign in")'
|
||||||
|
)
|
||||||
|
await sign_in_btn.click()
|
||||||
|
|
||||||
|
# Wait for redirect away from login page
|
||||||
|
await page.wait_for_url(
|
||||||
|
lambda url: "login" not in url.lower(),
|
||||||
|
timeout=settings.browser_timeout_ms,
|
||||||
|
)
|
||||||
|
await self.human_delay(1500, 3000)
|
||||||
|
|
||||||
|
# Capture cookies
|
||||||
|
raw_cookies = await context.cookies()
|
||||||
|
cookies = [dict(c) for c in raw_cookies]
|
||||||
|
now = datetime.now(UTC)
|
||||||
|
|
||||||
|
logger.info("Target login successful, captured %d cookies", len(cookies))
|
||||||
|
return SessionData(
|
||||||
|
cookies=cookies,
|
||||||
|
user_agent=DEFAULT_USER_AGENT,
|
||||||
|
created_at=now,
|
||||||
|
expires_at=now + timedelta(hours=2),
|
||||||
|
extra={"retailer": "target"},
|
||||||
|
)
|
||||||
|
|
||||||
|
async def check_session(self, session: SessionData) -> bool:
|
||||||
|
"""Check if the Target session is still valid."""
|
||||||
|
if session.expires_at and datetime.now(UTC) > session.expires_at:
|
||||||
|
logger.info("Target session expired based on timestamp")
|
||||||
|
return False
|
||||||
|
|
||||||
|
async with async_playwright() as p:
|
||||||
|
context = await self._create_stealth_context(p, cookies=session.cookies)
|
||||||
|
page = await context.new_page()
|
||||||
|
try:
|
||||||
|
response = await page.goto(TARGET_ACCOUNT_PAGE, wait_until="networkidle")
|
||||||
|
current_url = page.url.lower()
|
||||||
|
is_valid = "login" not in current_url and response is not None and response.ok
|
||||||
|
logger.info("Target session check: valid=%s (url=%s)", is_valid, page.url)
|
||||||
|
return is_valid
|
||||||
|
except Exception:
|
||||||
|
logger.exception("Target session check failed")
|
||||||
|
return False
|
||||||
|
finally:
|
||||||
|
if context.browser:
|
||||||
|
await context.browser.close()
|
||||||
|
|
||||||
|
async def scrape_receipts(
|
||||||
|
self, session: SessionData, since: datetime | None = None
|
||||||
|
) -> list[RawReceipt]:
|
||||||
|
"""Scrape in-store purchase history from Target Circle."""
|
||||||
|
async with async_playwright() as p:
|
||||||
|
context = await self._create_stealth_context(p, cookies=session.cookies)
|
||||||
|
page = await context.new_page()
|
||||||
|
try:
|
||||||
|
return await self._fetch_receipts(page, since)
|
||||||
|
finally:
|
||||||
|
if context.browser:
|
||||||
|
await context.browser.close()
|
||||||
|
|
||||||
|
async def _fetch_receipts(self, page: Page, since: datetime | None) -> list[RawReceipt]:
|
||||||
|
"""Fetch receipt list and details from Target order history.
|
||||||
|
|
||||||
|
Target's order history page has separate tabs for online and in-store
|
||||||
|
purchases. We target the in-store tab which shows Circle-linked
|
||||||
|
transactions.
|
||||||
|
"""
|
||||||
|
# Navigate to order history to establish context
|
||||||
|
await page.goto(TARGET_ORDER_HISTORY, wait_until="networkidle")
|
||||||
|
await self.human_delay(1500, 3000)
|
||||||
|
|
||||||
|
receipts: list[RawReceipt] = []
|
||||||
|
|
||||||
|
# Target order history API — filter for in-store purchases
|
||||||
|
api_response = await page.request.get(
|
||||||
|
TARGET_ORDER_API,
|
||||||
|
params={"channel": "in_store", "limit": "50"},
|
||||||
|
)
|
||||||
|
if not api_response.ok:
|
||||||
|
logger.warning(
|
||||||
|
"Target order history request failed: %d %s",
|
||||||
|
api_response.status,
|
||||||
|
api_response.status_text,
|
||||||
|
)
|
||||||
|
return []
|
||||||
|
|
||||||
|
response = await api_response.json()
|
||||||
|
if not isinstance(response, dict):
|
||||||
|
logger.warning("Unexpected order history response type: %s", type(response))
|
||||||
|
return []
|
||||||
|
|
||||||
|
# Target uses "orders" key for in-store purchase list
|
||||||
|
orders = response.get("orders", response.get("transactions", []))
|
||||||
|
if not isinstance(orders, list):
|
||||||
|
logger.warning("No orders found in Target order history response")
|
||||||
|
return []
|
||||||
|
|
||||||
|
logger.info("Found %d in-store orders in Target history", len(orders))
|
||||||
|
|
||||||
|
for order in orders:
|
||||||
|
raw_id = order.get("orderId") or order.get("transactionId") or order.get("id") or ""
|
||||||
|
order_id = str(raw_id)
|
||||||
|
purchase_date = order.get(
|
||||||
|
"purchaseDate",
|
||||||
|
order.get("transactionDate", order.get("date", "")),
|
||||||
|
)
|
||||||
|
|
||||||
|
# Filter by date if 'since' is provided
|
||||||
|
if since and purchase_date:
|
||||||
|
try:
|
||||||
|
txn_dt = datetime.fromisoformat(purchase_date.replace("Z", "+00:00"))
|
||||||
|
if txn_dt < since:
|
||||||
|
continue
|
||||||
|
except (ValueError, TypeError):
|
||||||
|
pass
|
||||||
|
|
||||||
|
if not order_id:
|
||||||
|
continue
|
||||||
|
|
||||||
|
await self.human_delay(1000, 2500)
|
||||||
|
|
||||||
|
# Fetch receipt detail
|
||||||
|
detail = await self._fetch_receipt_detail(page, order_id)
|
||||||
|
|
||||||
|
raw_store = (
|
||||||
|
order.get("storeNumber") or order.get("storeId") or order.get("locationId") or ""
|
||||||
|
)
|
||||||
|
store_number = str(raw_store)
|
||||||
|
|
||||||
|
receipts.append(
|
||||||
|
RawReceipt(
|
||||||
|
receipt_id=order_id,
|
||||||
|
purchase_date=purchase_date,
|
||||||
|
store_number=store_number,
|
||||||
|
raw_data={**order, "detail": detail},
|
||||||
|
source_url=f"{TARGET_RECEIPT_API}/{order_id}",
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
logger.info("Scraped %d receipts from Target", len(receipts))
|
||||||
|
return receipts
|
||||||
|
|
||||||
|
async def _fetch_receipt_detail(self, page: Page, order_id: str) -> dict:
|
||||||
|
"""Fetch detailed receipt data for a single Target order."""
|
||||||
|
try:
|
||||||
|
url = f"{TARGET_RECEIPT_API}/{order_id}"
|
||||||
|
api_response = await page.request.get(url)
|
||||||
|
if not api_response.ok:
|
||||||
|
logger.warning(
|
||||||
|
"Target receipt detail request failed for %s: %d",
|
||||||
|
order_id,
|
||||||
|
api_response.status,
|
||||||
|
)
|
||||||
|
return {}
|
||||||
|
detail = await api_response.json()
|
||||||
|
return detail if isinstance(detail, dict) else {}
|
||||||
|
except Exception:
|
||||||
|
logger.exception("Failed to fetch Target receipt detail for %s", order_id)
|
||||||
|
return {}
|
||||||
|
|
||||||
|
def parse_receipt(self, raw: RawReceipt) -> dict:
|
||||||
|
"""Parse raw Target receipt into structured purchase data."""
|
||||||
|
from receiptwitness.parsers.target import parse_target_receipt
|
||||||
|
|
||||||
|
return parse_target_receipt(raw)
|
||||||
@@ -0,0 +1 @@
|
|||||||
|
"""Session management — encrypted cookie storage and refresh logic."""
|
||||||
@@ -0,0 +1,52 @@
|
|||||||
|
"""Fernet-based encryption for session cookies at rest.
|
||||||
|
|
||||||
|
Session data (cookies, tokens) is encrypted before writing to the database
|
||||||
|
and decrypted only when needed for a scrape. The encryption key is provided
|
||||||
|
via the RW_SESSION_ENCRYPTION_KEY environment variable — it is never stored
|
||||||
|
in the database or logged.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import json
|
||||||
|
import logging
|
||||||
|
|
||||||
|
from cryptography.fernet import Fernet, InvalidToken
|
||||||
|
|
||||||
|
from receiptwitness.config import settings
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
def _get_fernet() -> Fernet:
|
||||||
|
"""Get a Fernet instance using the configured encryption key."""
|
||||||
|
key = settings.session_encryption_key
|
||||||
|
if not key:
|
||||||
|
raise ValueError(
|
||||||
|
"RW_SESSION_ENCRYPTION_KEY is not set. "
|
||||||
|
"Generate one with: "
|
||||||
|
"python -c 'from cryptography.fernet import Fernet; "
|
||||||
|
"print(Fernet.generate_key().decode())'"
|
||||||
|
)
|
||||||
|
return Fernet(key.encode() if isinstance(key, str) else key)
|
||||||
|
|
||||||
|
|
||||||
|
def encrypt_session_data(data: dict) -> str:
|
||||||
|
"""Encrypt session data dict to a Fernet token string.
|
||||||
|
|
||||||
|
The data is JSON-serialized, then encrypted. The result is a
|
||||||
|
URL-safe base64-encoded string suitable for storing in JSONB.
|
||||||
|
"""
|
||||||
|
f = _get_fernet()
|
||||||
|
plaintext = json.dumps(data, default=str).encode("utf-8")
|
||||||
|
return f.encrypt(plaintext).decode("utf-8")
|
||||||
|
|
||||||
|
|
||||||
|
def decrypt_session_data(encrypted: str) -> dict:
|
||||||
|
"""Decrypt a Fernet token string back to a session data dict."""
|
||||||
|
f = _get_fernet()
|
||||||
|
try:
|
||||||
|
plaintext = f.decrypt(encrypted.encode("utf-8"))
|
||||||
|
result: dict = json.loads(plaintext)
|
||||||
|
return result
|
||||||
|
except InvalidToken:
|
||||||
|
logger.error("Failed to decrypt session data — invalid token or wrong key")
|
||||||
|
raise
|
||||||
@@ -0,0 +1,81 @@
|
|||||||
|
"""Session storage, retrieval, and refresh logic.
|
||||||
|
|
||||||
|
Manages the lifecycle of retailer session data:
|
||||||
|
- Load encrypted session from DB
|
||||||
|
- Check validity via scraper
|
||||||
|
- Re-authenticate if expired
|
||||||
|
- Save new session back (encrypted)
|
||||||
|
"""
|
||||||
|
|
||||||
|
import logging
|
||||||
|
from dataclasses import asdict
|
||||||
|
from datetime import UTC, datetime
|
||||||
|
|
||||||
|
from receiptwitness.scrapers.base import BaseScraper, SessionData
|
||||||
|
from receiptwitness.session.encryption import decrypt_session_data, encrypt_session_data
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
def session_from_db_record(session_data_encrypted: str | None) -> SessionData | None:
|
||||||
|
"""Deserialize and decrypt a session from the database.
|
||||||
|
|
||||||
|
The session_data column in user_store_accounts stores the Fernet-encrypted
|
||||||
|
JSON of the SessionData fields.
|
||||||
|
"""
|
||||||
|
if not session_data_encrypted:
|
||||||
|
return None
|
||||||
|
|
||||||
|
try:
|
||||||
|
data = decrypt_session_data(session_data_encrypted)
|
||||||
|
return SessionData(
|
||||||
|
cookies=data["cookies"],
|
||||||
|
user_agent=data["user_agent"],
|
||||||
|
created_at=datetime.fromisoformat(data["created_at"]),
|
||||||
|
expires_at=(
|
||||||
|
datetime.fromisoformat(data["expires_at"]) if data.get("expires_at") else None
|
||||||
|
),
|
||||||
|
extra=data.get("extra", {}),
|
||||||
|
)
|
||||||
|
except Exception:
|
||||||
|
logger.exception("Failed to load session from DB record")
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def session_to_db_value(session: SessionData) -> str:
|
||||||
|
"""Serialize and encrypt a session for database storage."""
|
||||||
|
data = asdict(session)
|
||||||
|
# Convert datetime objects to ISO strings for JSON serialization
|
||||||
|
data["created_at"] = session.created_at.isoformat()
|
||||||
|
if session.expires_at:
|
||||||
|
data["expires_at"] = session.expires_at.isoformat()
|
||||||
|
return encrypt_session_data(data)
|
||||||
|
|
||||||
|
|
||||||
|
async def get_valid_session(
|
||||||
|
scraper: BaseScraper,
|
||||||
|
session_data_encrypted: str | None,
|
||||||
|
username: str,
|
||||||
|
password: str,
|
||||||
|
) -> tuple[SessionData, bool]:
|
||||||
|
"""Get a valid session, re-authenticating if needed.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
A tuple of (session, was_refreshed). If was_refreshed is True,
|
||||||
|
the caller should persist the new session to the database.
|
||||||
|
"""
|
||||||
|
# Try existing session first
|
||||||
|
existing = session_from_db_record(session_data_encrypted)
|
||||||
|
if existing:
|
||||||
|
if existing.expires_at and datetime.now(UTC) > existing.expires_at:
|
||||||
|
logger.info("Session expired by timestamp, re-authenticating")
|
||||||
|
elif await scraper.check_session(existing):
|
||||||
|
logger.info("Existing session is valid")
|
||||||
|
return existing, False
|
||||||
|
else:
|
||||||
|
logger.info("Session check failed, re-authenticating")
|
||||||
|
|
||||||
|
# Need to re-authenticate
|
||||||
|
logger.info("Performing fresh login")
|
||||||
|
new_session = await scraper.login(username, password)
|
||||||
|
return new_session, True
|
||||||
Reference in New Issue
Block a user