Global Health Sentiment Similarity (GDELT + PySpark)

Large-scale analysis of global health-news sentiment using GDELT, distributed MapReduce (PySpark), and vector-based NLP similarity methods.

Data Science
PythonPySparkMapReduceGDELT 2.0Cosine SimilarityMatplotlibNLP-ishDistributed Computing
Global Health Sentiment Similarity (GDELT + PySpark)
Sentiment progression + similarity results (replace with your exported plot PDFs/SVGs if you want).

Research question

How similar is the sentiment of health-related news coverage between New Zealand and other countries over time, and how can that similarity be computed efficiently over a large, distributed news dataset?

The emphasis is both analytical (sentiment similarity) and computational: designing a pipeline that scales to hundreds of country–month datasets using parallel processing.

Big Data source

This project uses the GDELT 2.0 global event database, which continuously aggregates and annotates news articles from across the world. I queried GDELT’s document API to retrieve health-related sentiment tone charts for multiple countries over a multi-year period.

Scale characteristics

  • 222 CSV files (6 countries × 37 months)
  • Each file aggregates thousands of news articles into tone-distribution buckets
  • Data volume and structure motivated a distributed processing approach

While the per-file CSVs are modest in size, the overall workload reflects a common Big Data pattern: many independent data partitions requiring identical transformations.

Method: distributed sentiment aggregation

I implemented the analysis using PySpark to emphasize distributed computation and parallelisation. Each country–month dataset was processed using a MapReduce-style aggregation to compute an average sentiment score from tone buckets.

Parallel MapReduce design

  • Map step: emit (count) and (tone × count) pairs per bucket
  • Reduce step: aggregate sums across partitions
  • Final step: compute weighted average sentiment per month
  • Executed using Spark RDDs to allow parallel execution across nodes

NLP-inspired similarity

  • Each country represented as a sentiment time-series vector
  • Vectors compared using cosine similarity
  • Discrete tone scaling applied to stabilize similarity metrics

Although this is not transformer-based NLP, the approach reflects a classic NLP workflow: transforming unstructured text into numerical representations and comparing documents (here, countries over time) in vector space.

Results

The sentiment progression was highly similar across all countries studied. New Zealand’s closest match was China (S ≈ 0.983) and the least similar was the United States (S ≈ 0.916), but all similarities remained high.

Key takeaway

The alignment of peaks/troughs across countries suggests a broadly consistent global media tone around health events during the COVID-era time window.

Parallelisation & performance analysis

A core goal of this project was to evaluate the effectiveness of parallelisation for this workload. I ran the pipeline across multiple Spark configurations to measure scaling behavior.

Engineering insight

This result is itself instructive: parallel frameworks do not guarantee speedup. Effective Big Data systems require batching, minimizing driver-side logic, and structuring transformations to amortize scheduling and I/O costs.

What I’d improve next

Engineering

  • Reduce small-RDD overhead by batching CSV ingestion
  • Minimize Python-side loops; push aggregation into Spark transformations
  • Cache/persist where appropriate; profile stages explicitly

Analysis

  • Broaden country set and extend timeline beyond COVID-era
  • Compare multiple keyword filters (e.g., “vaccine”, “pandemic”, “hospital”)
  • Add uncertainty estimates (bootstrap months / resample sources)

Beyond the specific sentiment results, this project demonstrates end-to-end data science systems thinking: working with a global-scale dataset, designing parallel MapReduce-style aggregations in PySpark, applying NLP-inspired vector representations, and critically evaluating when distributed computation meaningfully improves performance — and when overhead dominates.