Building with the Knowledge of the Crowd: How to Leverage Wikipedia's Pageviews API and HN Data for Smarter Content Discovery

We’ve all been there. It’s 11:30 PM, you have a production deployment running smoothly in the background, and you open Hacker News "just for five minutes." Two hours later, you’re deep in a rabbit hole reading about the history of the Lisp machine, Byzantine fault tolerance, or how the Roman aqueducts relate to modern microservices architectures.

As developers, we have a unique, insatiable appetite for niche, highly technical, and historically rich information. Wikipedia is often the destination for these deep dives. But how do we separate the signal from the noise? How do we discover the specific Wikipedia pages that actually capture the collective curiosity of the software engineering community?

A new open-source project, "Discover Wikipedia articles popular on Hacker News," recently climbed to the top of Show HN. It solves this exact problem by intersecting the Hacker News Firebase API with Wikipedia’s public pageview statistics. Beyond being a incredibly addictive tool for self-education, the architectural pattern behind it is a masterclass in lightweight, high-performance data pipeline engineering.

In this post, we’re going to dissect how to build a high-performance content discovery pipeline. We will explore how to query the Hacker News API, parse massive text datasets for Wikipedia references, enrich that data using the Wikimedia Pageviews API, and build an efficient caching layer so your application doesn't get rate-limited into oblivion. Let's dive in!

The Architecture of a Tech-Curiosity Pipeline

To build a system that identifies what Wikipedia articles are trending in our community, we can't just scrape the HN homepage every five minutes. We need a reliable, structured pipeline. The architecture relies on three primary phases:

  1. Ingestion: Consuming the Hacker News API to monitor new stories, comments, and show-HN posts.
  2. Extraction & Filtering: Parsing the URLs and text bodies using regex to extract valid Wikipedia article identifiers.
  3. Aggregation & Enrichment: Querying the Wikimedia Pageviews API to correlate HN mentions with global traffic spikes, then caching the results.

Here is a conceptual flow of how data moves through this lightweight pipeline:


[ HN Firebase API ] ---> [ Ingestion Engine (Go/Node) ]
                                |
                                v
                   [ Regex Extraction Layer ] (Extracts wiki slugs)
                                |
                                v
                   [ Wikimedia Pageviews API ] (Fetches traffic metrics)
                                |
                                v
               [ Redis Cache / SQLite DB ] ---> [ Frontend UI ]

Let's look at how we can implement each of these components using modern, clean backend JavaScript (Node.js/TypeScript), keeping performance and rate limits in mind.

Step 1: Mining the Hacker News API

Hacker News provides a fantastic, free, near-real-time API hosted on Firebase. While it doesn't support complex querying out of the box, it is perfect for streaming item IDs. We can fetch the top stories or inspect updates in real-time.

First, let's write a utility to fetch the latest top stories and filter them for external links pointing to Wikipedia.


import axios from 'axios';

const HN_BASE_URL = 'https://hacker-news.firebaseio.com/v0';

// Fetch the top 500 story IDs
async function getTopStoryIds() {
  const response = await axios.get(`${HN_BASE_URL}/topstories.json`);
  return response.data; // Returns an array of integers
}

// Fetch details for a single item
async function getItemDetails(id) {
  const response = await axios.get(`${HN_BASE_URL}/item/${id}.json`);
  return response.data;
}

While stories are the most obvious source of links, some of the best Wikipedia rabbit holes are found deep within HN comment threads. To capture these, we need to parse the text field of comments as well as the url field of stories.

Step 2: Parsing Wikipedia Slugs with Precision

URLs on Wikipedia follow a highly predictable structure, but we must account for different language subdomains (e.g., en.wikipedia.org, de.wikipedia.org) and mobile formats (en.m.wikipedia.org). Our goal is to extract the unique article slug (the title key used by the Wikimedia API).

Here is a robust regular expression and helper function to extract both the language code and the article slug from any raw text or URL:


function extractWikipediaSlug(text) {
  if (!text) return null;

  // Regex to match Wikipedia URLs and extract language + slug
  // Handles desktop, mobile, and secure protocols
  const wikiRegex = /https?:\/\/([a-z]{2,3})\.(?:m\.)?wikipedia\.org\/wiki\/([a-zA-Z0-9_\-%]+)/gi;
  
  const matches = [];
  let match;
  
  while ((match = wikiRegex.exec(text)) !== null) {
    matches.push({
      lang: match[1],
      slug: decodeURIComponent(match[2])
    });
  }
  
  return matches.length > 0 ? matches : null;
}

// Quick test
const sampleText = "Check out https://en.wikipedia.org/wiki/Byzantine_fault and https://de.m.wikipedia.org/wiki/Vorgespannter_Beton";
console.log(extractWikipediaSlug(sampleText));
/* Output:
[
  { lang: 'en', slug: 'Byzantine_fault' },
  { lang: 'de', slug: 'Vorgespannter_Beton' }
]
*/

Step 3: Enriching with the Wikimedia Pageviews API

Once we have the article slug (for example, Byzantine_fault), we want to verify if this article is actually trending or if it's just a passive mention. This is where the Wikimedia Pageviews API comes in. It is one of the most underrated, public, developer-friendly APIs on the internet. It requires no API key—just a well-behaved User-Agent header.

Let's write a function to fetch the daily pageviews for a specific article over the last 30 days. This allows us to calculate if there was a "traffic spike" coinciding with the Hacker News post.


async function getArticlePageviews(lang, slug) {
  // Format dates to YYYYMMDD
  const today = new Date();
  const thirtyDaysAgo = new Date(Date.now() - 30 * 24 * 60 * 60 * 1000);
  
  const formatDate = (date) => date.toISOString().split('T')[0].replace(/-/g, '');
  
  const start = formatDate(thirtyDaysAgo);
  const end = formatDate(today);
  
  const url = `https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/${lang}.wikipedia/all-access/all-agents/${slug}/daily/${start}/${end}`;
  
  try {
    const response = await axios.get(url, {
      headers: {
        'User-Agent': 'CodingWithAlexWikiBot/1.0 (https://sysseder.com; alex@sysseder.com) Axios/1.0'
      }
    });
    
    return response.data.items.map(day => ({
      date: day.timestamp.substring(0, 8),
      views: day.views
    }));
  } catch (error) {
    console.error(`Failed to fetch pageviews for ${slug}:`, error.message);
    return [];
  }
}

Pro-Tip: Always include a descriptive User-Agent with your email when querying Wikimedia's APIs. If you don't, your requests may be throttled or blocked entirely under their fair-use policy.

Designing a High-Performance Caching Layer

If you run this pipeline against all 500 top stories and their comments in real-time on every page load, you will rapidly exceed rate limits and destroy your application’s performance. To make this production-ready, we need a caching layer.

Using Redis or a simple SQLite database is perfect here. For a lightweight, self-contained system, SQLite with a Write-Ahead Log (WAL) mode turned on provides incredible read performance with zero infrastructure overhead.

Here’s how you can structure your local schema to cache these metrics:


CREATE TABLE IF NOT EXISTS wikipedia_mentions (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    hn_item_id INTEGER UNIQUE,
    article_slug TEXT NOT NULL,
    language_code TEXT NOT NULL,
    hn_score INTEGER,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE IF NOT EXISTS pageview_cache (
    article_slug TEXT PRIMARY KEY,
    total_views_30d INTEGER,
    spike_percentage REAL,
    last_updated DATETIME DEFAULT CURRENT_TIMESTAMP
);

By keeping the HN item score and comparing the 30-day average pageviews against the views on the day of the HN post, you can calculate a "Spike Index." If an obscure article like "Byzantine_fault" usually gets 200 views a day, but jumps to 15,000 views on the day it was posted on HN, your algorithm flags it as highly trending.

Why Tools Like This Make Us Better Developers

Building pipelines like this isn't just a great weekend project; it changes how we consume information. Algorithms on traditional social media platforms are optimized for outrage and quick engagement. In contrast, tools that track what developers are reading on Wikipedia optimize for curiosity, history, and deep technical knowledge.

When you look at the data generated by this tool, you find gems that you’d never encounter in your daily work: obscure networking protocols from the 1970s that inspired modern distributed databases, biographies of pioneer computer scientists, or deep dives into physical infrastructure security. It encourages "slow learning" in an industry obsessed with fast-paced framework updates.

Conclusion: Build Your Own Discovery Engine

The "Discover Wikipedia articles popular on HN" project reminds us that the best developer tools don't have to be incredibly complex. By combining simple REST APIs, robust regex parsing, and clean caching strategies, you can build a highly engaging platform that cuts through the noise of the modern web.

What are you waiting for? Try building your own version of this pipeline, customize it to track specific subreddits, tech blogs, or GitHub trending repos, and curate your own ultimate technical reading list.

What is the most interesting Wikipedia rabbit hole you’ve found through Hacker News? Let me know in the comments below, or share your thoughts on Twitter/X using the button below!

Until next time, happy coding!

Post a Comment

Previous Post Next Post