newsfresh Command Reference

Rust 2024 Edition CLI + Library Tantivy BM25 Search Polars DataFrames TeaLeaf data format

A fast CLI tool for querying, filtering, and analyzing GDELT Global Knowledge Graph (GKG) v2.1 data — the world's largest open dataset of global news events, updated every 15 minutes.

Global Options

These options are available on every subcommand.

FlagDescription
-v, --verboseIncrease logging verbosity. Stackable: -v (info), -vv (debug), -vvv (trace)
-q, --quietSuppress non-error output
-h, --helpPrint help for any command
$ newsfresh --help
Query and analyze GDELT GKG v2.1 data

Usage: newsfresh [OPTIONS] <COMMAND>

Commands:
  fetch    Download GKG data (latest or historical)
  parse    Parse a local GKG file and output records
  query    Fetch + parse + filter in one step
  schema   Print GKG type definitions
  analyze  NL search + analyze GKG records
  help     Print this message or the help of the given subcommand(s)

fetch — Download GKG Data

Downloads a GKG data file (latest 15-minute update or historical) from GDELT servers. Automatically extracts the CSV from the ZIP archive.

Arguments

FlagTypeDescriptionDefault
--latestboolFetch the latest 15-minute updatetrue
--date <DATE>stringFetch a specific historical file (YYYYMMDDHHMMSS)
--translationboolFetch non-English (translation) variantfalse
-o, --output <DIR>pathOutput directory./data
--keep-zipboolKeep the .zip file after extractionfalse

Examples

# Latest 15-minute update (default)
$ newsfresh fetch

# Historical file by date
$ newsfresh fetch --date 20250217150000

# Non-English variant, custom output directory, keep zip
$ newsfresh fetch --translation -o ./my-data --keep-zip

Sample Output

Fetching: http://data.gdeltproject.org/gdeltv2/20260216054500.gkg.csv.zip
Extracted: data/20260216054500.gkg.csv

parse — Parse a Local GKG File

Parses a local .csv or .csv.zip GKG file. Supports all filters, output formats, field projection, and offset/limit pagination.

Arguments

FlagTypeDescriptionDefault
<FILE>pathPath to a local .csv or .csv.zip GKG filerequired
-f, --formatenumOutput format: json, json-compact, tealeaf, tealeaf-compactjson
-o, --outputpathOutput file (stdout if omitted)stdout
--limit <N>intMaximum number of records to outputall
--offset <N>intSkip first N records0
--fields <LIST>stringComma-separated field names for projectionall fields
+ all filter options

Examples

# Parse a CSV file with filters
$ newsfresh parse data/20250217150000.gkg.csv \
  --country US --person "Trump" --limit 10

# Parse directly from a zip file
$ newsfresh parse data/20250217150000.gkg.csv.zip -f json

# Output specific fields only
$ newsfresh parse data/gkg.csv \
  -f json --fields document_identifier,source_common_name,tone

Sample Output — JSON with field projection

[
{
  "document_identifier": "https://www.washingtonpost.com/politics/2025/02/17/congress-budget-...",
  "source_common_name": "washingtonpost.com",
  "tone": {
    "tone": -1.82,
    "positive_score": 3.12,
    "negative_score": 4.94,
    "polarity": 8.06,
    "activity_ref_density": 15.43,
    "self_group_ref_density": 0.22,
    "word_count": 612
  }
}
]

query — Fetch + Parse + Filter

Downloads a GKG data file, parses it, applies filters, and outputs results — all in one step. Combines fetch + parse.

Arguments

FlagTypeDescriptionDefault
--latestboolFetch the latest 15-minute updatefalse
--date <DATE>stringFetch a specific historical file (YYYYMMDDHHMMSS)
--translationboolFetch non-English variantfalse
--persist-data-fileboolPersist downloaded data to persisted-storage/false
-f, --formatenumOutput formatjson
-o, --outputpathOutput filestdout
--limit <N>intMaximum recordsall
--offset <N>intSkip first N records0
--fields <LIST>stringComma-separated field projectionall fields
+ all filter options

Examples

# Fetch latest and filter by theme
$ newsfresh query --country US --theme "CLIMATE_CHANGE" --limit 5

# Historical data with tone filter
$ newsfresh query --date 20250201120000 --tone-min=-10 --tone-max=-2

# Persist downloaded files for reuse
$ newsfresh query --persist-data-file --country US --limit 20

# Output in compact TeaLeaf format
$ newsfresh query --country UK --has-quote -f tealeaf-compact

analyze — Full-Text Search & Statistics

Builds an in-memory Tantivy full-text index with BM25 ranking over GKG records, then runs natural language search queries. Optionally computes aggregate statistics using Polars DataFrames.

Search Enrichment

EnrichmentExample
FIPS country codes expanded to full names (240+ countries)Searching "United States" matches code US
ADM1 state/province codes expanded to readable namesSearching "California" matches US06
Theme code canonicalizationTAX_FNCACT_PRESIDENT becomes searchable as "President"

Arguments

FlagTypeDescriptionDefault
[FILE]pathLocal .csv or .csv.zip file (optional — use --latest or --date instead)
--search <QUERY>stringNatural language search queryrequired
--latestboolFetch the latest 15-minute updatefalse
--date <DATE>stringFetch historical file (YYYYMMDDHHMMSS)
--translationboolNon-English variantfalse
--persist-data-fileboolPersist downloaded datafalse
--limit <N>intMaximum number of results20
--statsboolShow aggregate statistics instead of recordsfalse
--stats-top-n <N>intNumber of top entries per frequency table10
-f, --formatenumOutput format (when not using --stats)tealeaf
-o, --outputpathOutput filestdout
--fields <LIST>stringComma-separated field projectionall fields
+ all filter options

Examples — Record Output

# Search the latest GDELT data with natural language
$ newsfresh analyze --latest \
  --search "elections Congress US economy" --limit 20

# Search with additional structured filters
$ newsfresh analyze --latest \
  --search "climate carbon emissions policy" \
  --country US --limit 10

# From a local file
$ newsfresh analyze data/gkg.csv \
  --search "Ukraine Russia ceasefire negotiations" --limit 15

# Compact TeaLeaf output for LLM consumption (~47% fewer tokens)
$ newsfresh analyze --latest \
  --search "AI regulation technology" --limit 10 -f tealeaf-compact

Examples — Aggregate Statistics (--stats)

# Aggregate statistics with Polars DataFrames
$ newsfresh analyze --latest \
  --search "US politics" --limit 50 --stats

# Top 5 per category
$ newsfresh analyze data/gkg.csv \
  --search "climate change" --stats --stats-top-n 5 --limit 100

Sample --stats Output

=== GDELT Analysis Stats (50 records) ===

--- Top Themes ---
   1. GENERAL GOVERNMENT            24  (2.0%)
   2. LEADER                        24  (2.0%)
   3. GENERAL1                      23  (2.0%)
   4. GOVERNMENT                    21  (1.8%)
   5. UNGP FORESTS RIVERS OCEANS    20  (1.7%)

--- Top Countries ---
   1. United States (US)                48  (31.2%)
   2. United Kingdom (UK)                9  (5.8%)
   3. Canada (CA)                        7  (4.5%)
   4. China (CH)                         7  (4.5%)
   5. Australia (AS)                     6  (3.9%)

--- Tone ---
  Mean: -0.82  Std: 3.45  Range: [-8.12, 4.91]
  Most positive: [4.91] https://www.nytimes.com/.../economy-jobs-report...
  Most negative: [-8.12] https://www.washingtonpost.com/.../congress-budget-crisis...

--- Top Persons ---
   1. donald trump         9  (4.9%)
   2. elon musk            5  (2.7%)
   3. kamala harris        3  (1.6%)
   4. marco rubio          3  (1.6%)
   5. jerome powell        3  (1.6%)

--- Top Organizations ---
   1. congress                         3  (1.7%)
   2. federal reserve                  3  (1.7%)
   3. microsoft                        3  (1.7%)
   4. pentagon                         2  (1.2%)
   5. state department                 2  (1.2%)

--- Top Sources ---
   1. nytimes.com           16  (32.0%)
   2. washingtonpost.com     6  (12.0%)
   3. cnn.com               6  (12.0%)
   4. foxnews.com           5  (10.0%)
   5. politico.com          4  (8.0%)

schema — Print GKG Type Definitions

Prints the complete GKG v2.1 record schema in TeaLeaf or JSON Schema format.

Arguments

FlagTypeDescriptionDefault
-f, --formatenumtealeaf or json-schematealeaf

Examples

# TeaLeaf schema format
$ newsfresh schema

# JSON Schema format
$ newsfresh schema -f json-schema

Filter Options

Available on parse, query, and analyze. All filters compose with AND logic.

FlagDescriptionExample
--personPerson name (case-insensitive substring)--person "Trump"
--orgOrganization name--org "United Nations"
--themeGKG theme code--theme "TAX_POLICY"
--locationLocation name--location "Washington"
--countryFIPS country code--country US
--tone-min / --tone-maxTone score range--tone-min -5 --tone-max 5
--date-from / --date-toDate range (YYYYMMDD)--date-from 20250201
--sourceSource name--source "bbc"
--has-imageOnly records with a sharing image--has-image
--has-quoteOnly records with quotations--has-quote

Output Formats

Controlled via -f, --format. The --fields flag enables field projection (JSON only).

FormatFlagDescription
JSON (pretty)-f jsonPretty-printed JSON array (default)
JSON (compact)-f json-compactMinified single-line JSON
TeaLeaf-f tealeafSchema-driven format, ~47% fewer tokens than JSON
TeaLeaf (compact)-f tealeaf-compactMinified TeaLeaf, additional ~21% savings

GKG v2.1 Record Fields

Each record contains up to 27 tab-delimited fields. The GkgRecord struct maps all of them.

#FieldTypeDescription
0gkg_record_idStringUnique record identifier
1datei64Publication date (YYYYMMDDHHMMSS)
2source_collection_idi32Source type (1=Web, 2=Citation, 3=Core, ...)
3source_common_nameStringHuman-readable source name
4document_identifierStringArticle URL
5v1_countsVec<CountV1>Event counts (protests, arrests, etc.)
6v21_countsVec<CountV21>V2.1 counts with character offsets
7v1_themesVec<String>Theme codes (e.g., TAX_POLICY)
8v2_enhanced_themesVec<EnhancedTheme>Themes with character offsets
9v1_locationsVec<LocationV1>Geocoded locations (country, lat/lon)
10v2_enhanced_locationsVec<EnhancedLocation>V2 locations with ADM2 codes
11v1_personsVec<String>Person names mentioned
12v2_enhanced_personsVec<EnhancedEntity>Persons with character offsets
13v1_organizationsVec<String>Organization names
14v2_enhanced_organizationsVec<EnhancedEntity>Organizations with offsets
15toneOption<Tone>Sentiment (tone, polarity, pos/neg, word count)
16v21_enhanced_datesVec<EnhancedDate>Dates mentioned in the article
17gcamVec<GcamEntry>GCAM content-analysis dimension scores
18sharing_imageOption<String>Primary sharing image URL
19related_imagesVec<String>Related image URLs
20social_image_embedsVec<String>Social media image embeds
21social_video_embedsVec<String>Social media video embeds
22quotationsVec<Quotation>Direct quotes with attribution verbs
23all_namesVec<NameEntry>All named entities
24amountsVec<AmountEntry>Monetary/numerical amounts
25translation_infoOption<TranslationInfo>Translation source language and engine
26extras_xmlOption<String>Extra XML content

Module Overview

The crate is organized into 9 public modules.

ModuleDescription
cliCLI argument definitions (clap derive) — structs for all 5 subcommands
errorNewsfreshError enum covering HTTP, I/O, parse, ZIP, JSON, and Polars errors
fetchHTTP client for downloading GKG data, ZIP extraction, lastupdate.txt parsing
filterRecordFilter trait and 10 composable predicate filters
modelComplete GKG v2.1 data model — GkgRecord with 27 fields and 14 sub-types
outputOutputFormatter trait with JSON and TeaLeaf formatters, field projection, schema printing
parseStreaming GKG parser — GkgReader iterator, tab-delimited field parsing
searchTantivy full-text search — SearchEngine trait, BM25, FIPS/ADM1/theme enrichment
statsPolars DataFrame aggregation — theme/country/person/org/source frequency, tone statistics