A fast CLI tool for querying, filtering, and analyzing GDELT Global Knowledge Graph (GKG) v2.1 data — the world's largest open dataset of global news events, updated every 15 minutes.
These options are available on every subcommand.
| Flag | Description |
|---|---|
-v, --verbose | Increase logging verbosity. Stackable: -v (info), -vv (debug), -vvv (trace) |
-q, --quiet | Suppress non-error output |
-h, --help | Print help for any command |
$ newsfresh --help
Query and analyze GDELT GKG v2.1 data
Usage: newsfresh [OPTIONS] <COMMAND>
Commands:
fetch Download GKG data (latest or historical)
parse Parse a local GKG file and output records
query Fetch + parse + filter in one step
schema Print GKG type definitions
analyze NL search + analyze GKG records
help Print this message or the help of the given subcommand(s)
fetch — Download GKG DataDownloads a GKG data file (latest 15-minute update or historical) from GDELT servers. Automatically extracts the CSV from the ZIP archive.
| Flag | Type | Description | Default |
|---|---|---|---|
--latest | bool | Fetch the latest 15-minute update | true |
--date <DATE> | string | Fetch a specific historical file (YYYYMMDDHHMMSS) | — |
--translation | bool | Fetch non-English (translation) variant | false |
-o, --output <DIR> | path | Output directory | ./data |
--keep-zip | bool | Keep the .zip file after extraction | false |
# Latest 15-minute update (default) $ newsfresh fetch # Historical file by date $ newsfresh fetch --date 20250217150000 # Non-English variant, custom output directory, keep zip $ newsfresh fetch --translation -o ./my-data --keep-zip
Fetching: http://data.gdeltproject.org/gdeltv2/20260216054500.gkg.csv.zip
Extracted: data/20260216054500.gkg.csv
parse — Parse a Local GKG FileParses a local .csv or .csv.zip GKG file. Supports all filters, output formats, field projection, and offset/limit pagination.
| Flag | Type | Description | Default |
|---|---|---|---|
<FILE> | path | Path to a local .csv or .csv.zip GKG file | required |
-f, --format | enum | Output format: json, json-compact, tealeaf, tealeaf-compact | json |
-o, --output | path | Output file (stdout if omitted) | stdout |
--limit <N> | int | Maximum number of records to output | all |
--offset <N> | int | Skip first N records | 0 |
--fields <LIST> | string | Comma-separated field names for projection | all fields |
| + all filter options | |||
# Parse a CSV file with filters $ newsfresh parse data/20250217150000.gkg.csv \ --country US --person "Trump" --limit 10 # Parse directly from a zip file $ newsfresh parse data/20250217150000.gkg.csv.zip -f json # Output specific fields only $ newsfresh parse data/gkg.csv \ -f json --fields document_identifier,source_common_name,tone
[
{
"document_identifier": "https://www.washingtonpost.com/politics/2025/02/17/congress-budget-...",
"source_common_name": "washingtonpost.com",
"tone": {
"tone": -1.82,
"positive_score": 3.12,
"negative_score": 4.94,
"polarity": 8.06,
"activity_ref_density": 15.43,
"self_group_ref_density": 0.22,
"word_count": 612
}
}
]
query — Fetch + Parse + FilterDownloads a GKG data file, parses it, applies filters, and outputs results — all in one step. Combines fetch + parse.
| Flag | Type | Description | Default |
|---|---|---|---|
--latest | bool | Fetch the latest 15-minute update | false |
--date <DATE> | string | Fetch a specific historical file (YYYYMMDDHHMMSS) | — |
--translation | bool | Fetch non-English variant | false |
--persist-data-file | bool | Persist downloaded data to persisted-storage/ | false |
-f, --format | enum | Output format | json |
-o, --output | path | Output file | stdout |
--limit <N> | int | Maximum records | all |
--offset <N> | int | Skip first N records | 0 |
--fields <LIST> | string | Comma-separated field projection | all fields |
| + all filter options | |||
# Fetch latest and filter by theme $ newsfresh query --country US --theme "CLIMATE_CHANGE" --limit 5 # Historical data with tone filter $ newsfresh query --date 20250201120000 --tone-min=-10 --tone-max=-2 # Persist downloaded files for reuse $ newsfresh query --persist-data-file --country US --limit 20 # Output in compact TeaLeaf format $ newsfresh query --country UK --has-quote -f tealeaf-compact
analyze — Full-Text Search & StatisticsBuilds an in-memory Tantivy full-text index with BM25 ranking over GKG records, then runs natural language search queries. Optionally computes aggregate statistics using Polars DataFrames.
| Enrichment | Example |
|---|---|
| FIPS country codes expanded to full names (240+ countries) | Searching "United States" matches code US |
| ADM1 state/province codes expanded to readable names | Searching "California" matches US06 |
| Theme code canonicalization | TAX_FNCACT_PRESIDENT becomes searchable as "President" |
| Flag | Type | Description | Default |
|---|---|---|---|
[FILE] | path | Local .csv or .csv.zip file (optional — use --latest or --date instead) | — |
--search <QUERY> | string | Natural language search query | required |
--latest | bool | Fetch the latest 15-minute update | false |
--date <DATE> | string | Fetch historical file (YYYYMMDDHHMMSS) | — |
--translation | bool | Non-English variant | false |
--persist-data-file | bool | Persist downloaded data | false |
--limit <N> | int | Maximum number of results | 20 |
--stats | bool | Show aggregate statistics instead of records | false |
--stats-top-n <N> | int | Number of top entries per frequency table | 10 |
-f, --format | enum | Output format (when not using --stats) | tealeaf |
-o, --output | path | Output file | stdout |
--fields <LIST> | string | Comma-separated field projection | all fields |
| + all filter options | |||
# Search the latest GDELT data with natural language $ newsfresh analyze --latest \ --search "elections Congress US economy" --limit 20 # Search with additional structured filters $ newsfresh analyze --latest \ --search "climate carbon emissions policy" \ --country US --limit 10 # From a local file $ newsfresh analyze data/gkg.csv \ --search "Ukraine Russia ceasefire negotiations" --limit 15 # Compact TeaLeaf output for LLM consumption (~47% fewer tokens) $ newsfresh analyze --latest \ --search "AI regulation technology" --limit 10 -f tealeaf-compact
--stats)# Aggregate statistics with Polars DataFrames $ newsfresh analyze --latest \ --search "US politics" --limit 50 --stats # Top 5 per category $ newsfresh analyze data/gkg.csv \ --search "climate change" --stats --stats-top-n 5 --limit 100
--stats Output=== GDELT Analysis Stats (50 records) ===
--- Top Themes ---
1. GENERAL GOVERNMENT 24 (2.0%)
2. LEADER 24 (2.0%)
3. GENERAL1 23 (2.0%)
4. GOVERNMENT 21 (1.8%)
5. UNGP FORESTS RIVERS OCEANS 20 (1.7%)
--- Top Countries ---
1. United States (US) 48 (31.2%)
2. United Kingdom (UK) 9 (5.8%)
3. Canada (CA) 7 (4.5%)
4. China (CH) 7 (4.5%)
5. Australia (AS) 6 (3.9%)
--- Tone ---
Mean: -0.82 Std: 3.45 Range: [-8.12, 4.91]
Most positive: [4.91] https://www.nytimes.com/.../economy-jobs-report...
Most negative: [-8.12] https://www.washingtonpost.com/.../congress-budget-crisis...
--- Top Persons ---
1. donald trump 9 (4.9%)
2. elon musk 5 (2.7%)
3. kamala harris 3 (1.6%)
4. marco rubio 3 (1.6%)
5. jerome powell 3 (1.6%)
--- Top Organizations ---
1. congress 3 (1.7%)
2. federal reserve 3 (1.7%)
3. microsoft 3 (1.7%)
4. pentagon 2 (1.2%)
5. state department 2 (1.2%)
--- Top Sources ---
1. nytimes.com 16 (32.0%)
2. washingtonpost.com 6 (12.0%)
3. cnn.com 6 (12.0%)
4. foxnews.com 5 (10.0%)
5. politico.com 4 (8.0%)
schema — Print GKG Type DefinitionsPrints the complete GKG v2.1 record schema in TeaLeaf or JSON Schema format.
| Flag | Type | Description | Default |
|---|---|---|---|
-f, --format | enum | tealeaf or json-schema | tealeaf |
# TeaLeaf schema format $ newsfresh schema # JSON Schema format $ newsfresh schema -f json-schema
Available on parse, query, and analyze. All filters compose with AND logic.
| Flag | Description | Example |
|---|---|---|
--person | Person name (case-insensitive substring) | --person "Trump" |
--org | Organization name | --org "United Nations" |
--theme | GKG theme code | --theme "TAX_POLICY" |
--location | Location name | --location "Washington" |
--country | FIPS country code | --country US |
--tone-min / --tone-max | Tone score range | --tone-min -5 --tone-max 5 |
--date-from / --date-to | Date range (YYYYMMDD) | --date-from 20250201 |
--source | Source name | --source "bbc" |
--has-image | Only records with a sharing image | --has-image |
--has-quote | Only records with quotations | --has-quote |
Controlled via -f, --format. The --fields flag enables field projection (JSON only).
| Format | Flag | Description |
|---|---|---|
| JSON (pretty) | -f json | Pretty-printed JSON array (default) |
| JSON (compact) | -f json-compact | Minified single-line JSON |
| TeaLeaf | -f tealeaf | Schema-driven format, ~47% fewer tokens than JSON |
| TeaLeaf (compact) | -f tealeaf-compact | Minified TeaLeaf, additional ~21% savings |
Each record contains up to 27 tab-delimited fields. The GkgRecord struct maps all of them.
| # | Field | Type | Description |
|---|---|---|---|
| 0 | gkg_record_id | String | Unique record identifier |
| 1 | date | i64 | Publication date (YYYYMMDDHHMMSS) |
| 2 | source_collection_id | i32 | Source type (1=Web, 2=Citation, 3=Core, ...) |
| 3 | source_common_name | String | Human-readable source name |
| 4 | document_identifier | String | Article URL |
| 5 | v1_counts | Vec<CountV1> | Event counts (protests, arrests, etc.) |
| 6 | v21_counts | Vec<CountV21> | V2.1 counts with character offsets |
| 7 | v1_themes | Vec<String> | Theme codes (e.g., TAX_POLICY) |
| 8 | v2_enhanced_themes | Vec<EnhancedTheme> | Themes with character offsets |
| 9 | v1_locations | Vec<LocationV1> | Geocoded locations (country, lat/lon) |
| 10 | v2_enhanced_locations | Vec<EnhancedLocation> | V2 locations with ADM2 codes |
| 11 | v1_persons | Vec<String> | Person names mentioned |
| 12 | v2_enhanced_persons | Vec<EnhancedEntity> | Persons with character offsets |
| 13 | v1_organizations | Vec<String> | Organization names |
| 14 | v2_enhanced_organizations | Vec<EnhancedEntity> | Organizations with offsets |
| 15 | tone | Option<Tone> | Sentiment (tone, polarity, pos/neg, word count) |
| 16 | v21_enhanced_dates | Vec<EnhancedDate> | Dates mentioned in the article |
| 17 | gcam | Vec<GcamEntry> | GCAM content-analysis dimension scores |
| 18 | sharing_image | Option<String> | Primary sharing image URL |
| 19 | related_images | Vec<String> | Related image URLs |
| 20 | social_image_embeds | Vec<String> | Social media image embeds |
| 21 | social_video_embeds | Vec<String> | Social media video embeds |
| 22 | quotations | Vec<Quotation> | Direct quotes with attribution verbs |
| 23 | all_names | Vec<NameEntry> | All named entities |
| 24 | amounts | Vec<AmountEntry> | Monetary/numerical amounts |
| 25 | translation_info | Option<TranslationInfo> | Translation source language and engine |
| 26 | extras_xml | Option<String> | Extra XML content |
The crate is organized into 9 public modules.
| Module | Description |
|---|---|
cli | CLI argument definitions (clap derive) — structs for all 5 subcommands |
error | NewsfreshError enum covering HTTP, I/O, parse, ZIP, JSON, and Polars errors |
fetch | HTTP client for downloading GKG data, ZIP extraction, lastupdate.txt parsing |
filter | RecordFilter trait and 10 composable predicate filters |
model | Complete GKG v2.1 data model — GkgRecord with 27 fields and 14 sub-types |
output | OutputFormatter trait with JSON and TeaLeaf formatters, field projection, schema printing |
parse | Streaming GKG parser — GkgReader iterator, tab-delimited field parsing |
search | Tantivy full-text search — SearchEngine trait, BM25, FIPS/ADM1/theme enrichment |
stats | Polars DataFrame aggregation — theme/country/person/org/source frequency, tone statistics |