The R Data Scientist 06-01-2026

        January 6, 2026

The R Data Scientist 06-01-2026
2025 wraps, data infrastructures, mapping trends, fitness data

        🌍 Open Science & Community
2025, With Notes (yabellini.netlify.app). Reflections on 2025 across travel, speaking, teaching, writing, and community building in open science with rOpenSci, The Carpentries, R-Ladies, and LatinR

on the future of ISBA meetings (and others) (xianblog.wordpress.com). ISBA future conferences with mirrors, multi-hubs, sustainability, inclusion, survey, Venezia 2024, and researchers unable to travel

Q&A with ASA Founder Mark Glickman (magazine.amstat.org). Q&A with ASA founder Mark Glickman discusses statistics, teaching, and open-source tools used in data analysis and measurement

An Open Letter to the BMJ Editorial Board (deevybee.blogspot.com). Open letter urging BMJ to retract Attar et al PREVENT-TAHA8 due to data concerns and alleged fabrication

🔄 Interop & Data Infra

R + Python: From polyglot to crosspolination (emilyriederer.com). Diversity in open source, cross-pollination of R and Python tooling, and sessions from posit::conf(2025) with Rich Iannone, Michael Chow, and others

Drop #749 (2025-12-31): 2025 — Dropped (dailydrop.hrbrmstr.dev). Year-end retrospective on 2025: DuckDB, CLI tools, fonts, RSS, and data wrangling in a digitally hoarded year

pytest-r-snapshot: Verifying Python code against R outputs at scale (nanx.me). pytest-r-snapshot enables Python test snapshots against R outputs, recording with R and replaying in CI across environments

🧰 R Tooling & Publishing

Paired Ends Wrapped: Top 10 Posts From 2025 (blog.stephenturner.us). Top 10 2025 posts on R, AI, RAG with Zotero, Quarto books, and Positron tooling by Stephen Turner

Testing the R-universe build workflow from your own GitHub repository (ropensci.org). Discusses testing the R-universe build workflow from a GitHub repo using a reusable CI workflow to mirror R-universe on Linux, Windows, and macOS

Weekly Recap (January 2, 2026) (blog.stephenturner.us). NSF reorg, Genomics in 2026, Claude Code course, AI labor shift, uv speed, and 2025 LLM recap across R+Python and R Data Scientist

Multiplet Function Now Handles I > 1/2 (chemospec.org). Bryan Hanson updates the Multiplet function in R to support nuclei with spin greater than 1/2 using SpecHelpers

ME:: tl;dr-ing: How to make a static website on Codeberg using R and Quarto ; Ken Butler:: Quarto websites and Codeberg pages (part 1) (rolandtanglao.com). How to publish a static Quarto site on Codeberg Pages using R and Quarto, with guidance for beginners

🎨 Data Viz Craft

Learning data viz from the best: New America and Datawrapper (danielroelfs.com). Learning data viz with New America and Datawrapper using ggplot2 and tidyverse in R

60% started Geralt’s journey. Only 22% finished it. (stevenponce.netlify.app). A data-visualization post by Steven Ponce using R, ggplot2, and custom themes to display Witcher 3 progression via Steam achievements

Interesting thoughts about aesthetics aes() in ggplot2 (joshuamarie.com). Explores aes() in ggplot2, implementation challenges, and Python’s plotnine comparison from an R-centric perspective

Data Strips: Quintiles vs. Box Plots (rawdatastudies.com). Quintile area strip plots and related views compare with box plots to reveal skewness and data density in large biological datasets

What is the most “middle” name? (erdavis.com). Analyzes most common middle names using voter data, cleaning with name grouping and R-based processing

🗺️ Mapping & Place Data

Interactive Dashboard for Mapping Police Violence Data (ianadamsresearch.com). An interactive Shiny dashboard in R for Mapping Police Violence data, exploring temporal, demographic, and geographic patterns

Analyzing Police Violence in America: Updated Data Through 2025 (ianadamsresearch.com). Updated MPV data through 2025 analyzed with R, showing demographic, temporal, and geographic patterns in police violence

Visualizing the Los Angeles Microclimate (conormclaughlin.net). A data-driven look at LA microclimates using Open-Meteo hourly data and R visualizations

New Caledonia's nickel exports (freerangestats.info). Tracking New Caledonia nickel exports (ore and metal) 2008–2025 with R, dplyr, readxl, and FRED nickel prices

🏃 Fitness Data in R

Running Around: an R package to analyse Garmin running data (quantixed.org). R package GarminCSVr analyzes Garmin activity data with R, offering annual summaries and year-over-year comparisons

Mapping runkeeper data (blog.djnavarro.net). Data science blog post using R to parse GPX from Runkeeper, mapmaking with leaflet and Stadia Maps

Running Around: 2025 running dataviz in R (quantixed.org). Stephen Royle uses R to visualise and recap 2025 running data from Garmin, exploring distances, training load and marathons

📐 Stats & Modeling Notes

Testing Super Learner's Coverage - A Note To Myself (kenkoonwong.com). Explores SuperLearner with TMLE in R, comparing XGBoost, Random Forest, GLM, NNLS, and parallel computation

Forecasting benchmark: Dynrmf (a new serious competitor in town) vs Theta Method on M-Competitions and Tourism competitition (thierrymoudiki.github.io). A benchmarking study comparing Dynrmf and Theta Method on M3, M1, and Tourism datasets using R, parallel processing, and standard accuracy metrics

Why does a least squares fit appear to have a bias when applied to simple data? (stats.stackexchange.com). Explains why ordinary least squares can appear biased on bivariate data and contrasts with TLS, PCA, and orthogonal regression

📚 Academic Research

friends.test: rank-based method for feature selection in interaction matrices (arxiv:q-bio). friends.test detects specific interactions in large heterogeneous matrices using rank profiles and mixture breakpoints. Fast O(nk log n) R implementation aids omics feature selection analysis

Benchmarking Preprocessing and Integration Methods in Single-Cell Genomics (arxiv:q-bio). Benchmark compares normalization, integration, and dimensionality-reduction pipelines for multimodal single-cell data. Highlights Seurat/Harmony plus UMAP tradeoffs, guiding reproducible R workflows and visualization at scale, quickly

Matrix Decomposition-Based Approach to Estimate the STARTS Model (arxiv:stat). New two-stage eigenvalue-based estimation for STARTS structural equation models reduces improper solutions. Useful for longitudinal SEM in R, guiding sensitivity analyses without Bayesian priors specification

Robust reduced rank regression under heavy-tailed noise and missing data via non-convex penalization (arxiv:stat). Robust reduced-rank regression with Huber loss and SCAD/MCP spectral penalties handles outliers and missing responses. Includes rrpackrobust R package for high-dimensional multivariate prediction better accuracy

                            Don't miss what's next. Subscribe to The R Data Scientist:

            Email address (required)

          Add a comment:

                Share this email:

                                Share on LinkedIn

                                Share on Hacker News

                                Share on Mastodon

                                Share on Bluesky