Building a Job Board Scraper That Doesn't Suck — Docs

TL;DR

Pauper is a configurable job listing tracker that scrapes career pages, detects changes, and sends notifications when new positions appear. Point it at an API or HTML page, define what to extract, and let it run on a schedule. The code is open source at github.com/devstark03/pauper.

The Problem

So... what's up with the goofy name?

Imagine for me, if you will, a sad little puppy with his puppy dog friends. We'll call him Reginald.

Well, Reginald grew tired of shotgunning dozens of jobs on Puppindeed. Not enough time for naps!

So Reginald decided to start focusing on applying to fewer jobs better and checking for opportunities at specific companies, directly on the website. But Reggie here encountered yet another critical problem. There was still no time for naps!

So Reggie drew up a plan, and did exactly what little puppies naturally do in their spare time: automation software engineering. Here was his plan:

I am going to make an automated .NET application to read from the job boards of specific companies
It's going to read the listings available and email me details if anything has changed
It's going to run once a day at 09:00
And it's going to give me free treats
Or, uh... scratch that last one. Outside of scope and all

Reginald was a genius! It was perfect. Or... perhaps there were some caveats. Let's find out how Pauper works.

The Schema Approach

Here's the thing about job APIs: they're all different. And I mean all different.

Some return JSON like this:

{
    "data": {
        "jobs": [
            { "id": "12345", "title": "Software Engineer", "location": "Remote" }
        ]
    }
}

JSON

Others nest it three levels deep. Some call it title, some call it position, some call it job_name because why not. Pagination might use page, offset, or some cursor-based token thing that makes me want to lie down.

I really didn't want to write a custom scraper for every single site. That's just trading one tedious task for another tedious task. So instead, I built a schema system:

{
    "_metadata": {
        "url": "https://api.example.com/jobs?page={page}",
        "requestDelayMs": 500,
        "pagination": {
            "variable": "page",
            "start": 1,
            "max": 10
        },
        "extract": {
            "array": "data.jobs",
            "fields": {
                "id": "id",
                "title": "title",
                "company": "company.name",
                "location": "location",
                "url": "apply_url"
            }
        }
    }
}

JSON

The schema tells Pauper everything it needs to know:

Where to fetch data (with pagination baked in)
How fast to make requests (don't wanna get rate limited)
Where to find the job array in the response
How to map their weird field names to my standardized structure

One JSON file per site. No code changes. Reggie approves.

Generating Schemas (Because I'm Lazy)

Writing schemas by hand is boring. You have to inspect the API response, figure out the structure, map all the fields... ugh.

So I added a generator:

pauper --url https://api.example.com/jobs --generate-schema --output jobs.json

Bash

It fetches the URL, looks at what comes back, and asks you some questions:

=== Schema Metadata Configuration ===

URL [https://api.example.com/jobs]: 

Does this API use pagination? (y/n) [n]: y
Pagination variable name [page]: page
Starting page number [1]: 1
Maximum pages (leave empty for unlimited): 

--- Field Mappings ---
ID field path (unique identifier): id
Title field path: title
URL field path (optional): apply_url
Company field path (optional): company.name

Answer a few prompts, get a working schema. Point it at a new site and you're scraping in under a minute. This is the kind of laziness that actually pays off.

Handling Pagination

Most job APIs paginate their results. You can't just hit one endpoint and get everything. You have to loop through pages until you run out.

My first implementation was naive:

for (int page = 1; page <= maxPages; page++) {
    var response = await httpClient.GetStringAsync(url);
    // process...
}

This works, but it doesn't know when to stop. What if there are only 3 pages but maxPages is 10? You're making 7 unnecessary requests.

Better approach—stop when results dry up:

var consecutiveEmptyPages = 0;

for (int i = 0; i < maxPages; i++, page++) {
    var listings = await FetchPage(page);

    if (listings.Count == 0) {
        consecutiveEmptyPages++;
        if (consecutiveEmptyPages >= 1) {
            Console.WriteLine("Reached end of results.");
            break;
        }
    }
    else {
        consecutiveEmptyPages = 0;
        allListings.AddRange(listings);
    }
}

Now it stops gracefully when there's nothing left instead of hammering empty pages.

Retry Logic (Because the Internet is Flaky)

APIs fail. Networks hiccup. Rate limits happen. The first version of Pauper would just crash on any HTTP error, which is... not ideal for something that's supposed to run while Reginald is catching some Z's.

Enter exponential backoff:

public static async Task<string> GetWithRetryAsync(HttpClient client, string url, RetryConfig config) {
    int attempt = 0;
    int delayMs = config.InitialDelayMs;

    while (true) {
        try {
            attempt++;
            return await client.GetStringAsync(url);
        }
        catch (HttpRequestException ex) when (attempt < config.MaxRetries) {
            var code = (int?)ex.StatusCode;
            
            // Don't retry client errors (4xx) except rate limits (429)
            if (code >= 400 && code < 500 && code != 429)
                throw;

            Console.WriteLine($"Request failed (attempt {attempt}/{config.MaxRetries}): {ex.Message}");
            Console.WriteLine($"Retrying in {delayMs}ms...");

            await Task.Delay(delayMs);
            delayMs = (int)(delayMs * config.BackoffMultiplier);
        }
    }
}

Transient failures get retried with increasing delays. Permanent failures (404, 403) fail fast. Rate limits (429) get the patience they deserve. All configurable per-schema if you need it.

Change Detection

Scraping is only half the battle. The real value is knowing when something changes.

Pauper keeps a JSON log of every run:

{
    "entries": [
        {
            "timestamp": "2025-01-15T09:00:00Z",
            "url": "https://api.example.com/jobs",
            "listings": [...],
            "newListings": [
                { "id": "789", "title": "Senior Developer", "company": "Example Corp" }
            ],
            "removedListings": [],
            "totalCount": 12
        }
    ]
}

JSON

Each run compares against the previous entry. New IDs = new postings. Missing IDs = removed postings. The console output makes it obvious:

[2025-10-15 09:00:00 UTC]
Total listings: 12

NEW LISTINGS (1):
  + Senior Developer @ Example Corp
      Location: Remote
      URL: https://example.com/apply/789

No removed listings.

Log saved to: pauper_api.example.com.json

No more "wait, was that posting there yesterday?" Just cold, hard diffs.

Notifications

Detecting changes is useless if I'm not staring at the terminal. Which I'm not. Because it runs at 9 AM. And I'm not a morning person.

Webhook support to the rescue:

pauper --schema jobs.json --notify-on-changes --webhook https://my-server.com/api/notifications

Bash

The webhook receives a JSON payload with the new and removed listings. I wired mine to send an email, but you could trigger a Slack message, a Discord ping, a carrier pigeon... whatever works!

Info

I wanted something I controlled. No rate limits, no monthly fees, no third-party service deciding to deprecate the one feature I actually use. A cron job and a webhook are forever.

Making It Extensible

The first version only handled JSON APIs. But some career pages are just HTML. No API, no structured data, just a bunch of <div class="job-listing"> elements sitting there.

I added an HTML scraper. Then I realized I was gonna keep bolting on format-specific code forever. So I did the Reggie thing to do and extracted an interface to save myself time later:

public interface IExtractor {
    ExtractorType Type { get; }
    List<Listing> Extract(string content, ExtractionConfig config);
}

Now there's JsonExtractor, HtmlExtractor, and a stub for XmlExtractor that I'll probably never need but hey, it's there. Adding a new format is just implementing one interface and registering it:

ExtractorFactory.Register(ExtractorType.Yaml, new YamlExtractor());

The rest of the pipeline doesn't care what format the source is. As it should be.

What It Actually Looks Like

My real setup is pretty simple. I have a schema file for each company I'm tracking:

schemas/
├── doofenshmirtz-evil-inc.json
├── github.json
└── example-corp.json

A cron job runs every morning:

0 9 * * * /usr/local/bin/pauper --schema /home/reginald/schemas/doofenshmirtz-evil-inc.json --notify-on-changes --webhook http://localhost:4444/api/notifications/pauper

Bash

If a new developer position shows up on Dr. Doofenshmirtz's careers page, I get an email before I've even changed out of my pajamas. Reginald finally gets his naps.

Key Takeaways

Design for configuration, not code changes. Every new site I want to track is a JSON file, not a code commit.

Handle failure gracefully. Retry transient errors, fail fast on permanent ones, log everything.

Diff against history. Knowing what's new is way more valuable than knowing what exists.

Build for your actual use case. I didn't need a distributed scraping platform. I needed a script that runs once a day and sends me an email. Simple wins.

The code is available at github.com/devstark03/pauper. MIT licensed. Use it, fork it, tell me what breaks.

And give Reggie a treat for me.