The R Data Scientist 17-12-2025
R strategies for FDA submissions, Bioconductor in Africa, DuckDB data pipelines
🌍 Open Science & Pharma
Johnson & Johnson’s Success Story: A Hybrid R/SAS Strategy for FDA Submission (r-consortium​.org). Hybrid R/SAS workflow enables FDA submission with reproducible ADRG setup, leveraging R for visuals and SAS for ADaM derivation
Bridging Gaps and Breaking Barriers: A Year of R Innovation with the China PharmaRUG (r-consortium​.org). Hybrid PharmaRUG events and APAC R/Pharma track showcase open-source tooling, AI governance, and validated clinical reporting in R
Further co-creation of the Practical Guide for Research Funding Organizations (sfdora​.org). Funder groups co-create a Practical Guide for Implementing RRA for Research Funding Organizations, exploring AI, open science, and ethics in quarterly sessions
🧬 Bioconductor Genomics
Set Operations with freeCount (morphoscape​.wordpress​.com). Set operations on gene lists using freeCount in RStudio, with Venn diagrams and online Posit Cloud options
Bioconductor in Africa: Highlights from our first workshop in West Africa - Benin (blog​.bioconductor​.org). Bioconductor trains 25 participants in Benin with French bilingual support, DESeq2, SummarizedExperiment, and Carpentries capacity building across West Africa
Outreachy June 2025 Interns with Bioconductor (blog​.bioconductor​.org). Bioconductor interns share their Outreachy June 2025 experiences working on BugSigDB, microbio study curation, R programming, mentoring, and community collaboration
📦 Modeling Package News
Counterfactual Scenario Analysis with ahead::ridge2f (thierrymoudiki​.github​.io). Counterfactual scenario analysis in R using ahead::ridge2f with insuranceQuotes data—scenarios, training, forecasting, and comparison
tidypredict 1.0.0 (tidyverse​.org). tidypredict 1.0.0 enables SQL-based predictions inside databases for tree models, glmnet support, and faster parsing in R
nlmixr2 5.0 (blog​.nlmixr2​.org). nlmixr2 5.0 updates, serialization changes, qs2 adoption, and alternative data formats for rxode2/nlmixr2 in R
tidymodels & xgboost (tidyverse​.org). Tidymodels integrates with xgboost 3.x and 2.x, updating parsnip and embed components for CRAN users and migration guidance from the xgboost team
🤖 LLMs for R
Which LLM writes the best R code? (posit​.co). Which LLM writes the best R code? Insights from Posit authors Sara and Simon on LLMs for R and evaluation tools
chores 0.3.0 and local LLMs (simonpcouch​.com). Local LLMs power chores helpers for R, with Qwen3-4B-2507, LM Studio, Ollama, and GPT-4.1/Claude prompts guiding roxygen2 templating
Interactúa desde R con una IA que conoce tus datos, archivos y paquetes (bastianoleah​.netlify​.app). Explora btw para interactuar desde R con IA que conoce tus datos, archivos y paquetes usando herramientas y contexto
Weekly Recap (December 12, 2025) (blog​.stephenturner​.us). GPT-5.2 release, DARPA GO program, OpenAI benchmarks, and AI in science with R tooling and newsletters
đź§ą R Dev Quality
How to Assess Usage of your Package (ropensci​.org). Guidance on measuring package usage with downloads, dependencies, GitHub activity, citations, surveys, and telemetry in R packages
Better Code, Without Any Effort, Without Even AI (ropensci​.org). Four tools lintr Air jarl and flir improve R code formatting, linting, and safety without AI
R the Software Engineering Way: Introduction and Chapter Zero (deadsimpletech​.com). Introduction to R software engineering with Linux CLI, Git, and Docker-based development environments
A new way to make error messages in R actually make sense (rfortherestofus​.com). Plain-language error explanations and auto-fixes for R errors using Positron with a live coding demo
🦆 DuckDB Pipelines
Fedora Magazine: Creating Data Analysis Pipelines using DuckDB and RStudio (fedoramagazine​.org). Data analysis pipelines with DuckDB and RStudio on Fedora, using ELT/ETL, Python and R, in a Lakehouse approach
Advent of SQL 2025 with DuckDB and R (francoismichonneau​.net). Advent of SQL 2025 with DuckDB and R uses SQL puzzles solved in DuckDB via R, by François Michonneau, PhD
This Month in the DuckDB Ecosystem: December 2025 (motherduck​.com). DuckDB ecosystem update: data-at-rest encryption, DuckLake, Gaggle, osmextract, and spatial data with Ibis, Marimo, and dlt in a December 2025 roundup
A Practical Dive Into Late Materialization in arrow-rs Parquet Reads (arrow​.apache​.org). Late materialization in arrow-rs Parquet reads enables predicate-first, column-projected queries using LM-pipelining and adaptive RowSelection in Rust
Simplicity of a Database, but the Speed of a Cache: OLAP Caches for DuckDB (ssp​.sh). Caches for DuckDB speed up on-the-fly analytics with QuackStore, cache_httpfs, and DiskCache, enabling sub-second queries in browser and cloud environments
🎨 Visualization Storytelling
Broken Chart: discover 9 visualization alternatives (dominicroye​.github​.io). 9 visualization alternatives for temperature data using R, ggplot2, and related packages
Winner of the 2025 Plotnine Plotting Contest (posit​.co). Plotnine contest winner and Posit highlights data visualization, Python and R tools, Shiny apps, and open-source collaboration
Telling a Story with Data (select-statistics​.co​.uk). Using ggplot in R for data visualisation to tell stories with climate stripes and PM2.5 data across cities
🗺️ Spatial Analyses in R
Small Nations Lead in Roundabout Adoption (stevenponce​.netlify​.app). New Zealand and Sweden lead per-capita roundabout adoption using R (tidyverse) visuals
Tour de France Route Analysis (datannery​.com). Plot Tour de France stages on a single map using R with tidyverse, sf, leaflet, rnaturalearth; GPX data from CyclingStage
k-means clusters of Barcelona Nightlife (jmsallan​.netlify​.app). K-means clustering on Barcelona nightlife venues using R, sf, broom, and kselection to identify five neighborhoods and interpret clusters
Elephant(s) in the room: Graph neural networks, embeddings, and foundation models in spatial data science (jakubnowosad​.com). GNNs, embeddings, and foundation models in spatial data science with R demonstrations and real-world tasks
Global language endangerment: scale and geographic concentration (stevenponce​.netlify​.app). Global language endangerment analyzed with Glottolog data using R (tidyverse, treemapify) by Steven Ponce
CfP Decolonising Research in Transport Geography (urbandemographics​.blogspot​.com). Decolonising transport geography research using qualitative methods, R, and open data to critique mobility inequities
đź§® Statistical Inference
Un-debunking the GAMLSS myth (zeileis​.org). R code, GAMLSS vs segmented regression, worm plots, and replication in spirometry with ZavorÂsky data using R packages
Smoothed ROC Curves, Calculus and Curvature (rworks​.dev). Smoothed ROC curves in R with monoH.FC splines, calculus, and curvature concepts for AUC and arc length
mostly Monte Carlo [last session of 2025] (xianblog​.wordpress​.com). Last Mostly Monte Carlo seminar of 2025: Least squares variational inference and extended SUN models for Bayesian classification
📚 Academic Research
All Emulators are Wrong, Many are Useful, and Some are More Useful Than Others: A Reproducible Comparison of Computer Model Surrogates (arxiv:stat). duqling provides a reproducible R framework to benchmark surrogate/emulator methods. This paper compares 29 emulators on 100 datasets, guiding practitioners’ choices with transparent code bases
Neural Network-based Partial-Linear Single-Index Models for Environmental Mixtures Analysis (arxiv:cs). NeuralPLSI builds an interpretable exposure index with a learnable projection, then models outcomes via a neural network. Bootstrap inference and R-friendly software support mixture studies
Autotune: fast, accurate, and automatic tuning parameter selection for LASSO (arxiv:stat). Autotune selects LASSO penalties by alternating likelihood optimization over coefficients and noise variance. Faster tuning improves prediction and sparsity checks; R/C++ package provided on GitHub
dtreg: Describing Data Analysis in Machine-Readable Format in Python and R (arxiv:cs). dtreg lets you record statistical tests and ML workflows as linked-data metadata from R/Python. It boosts FAIR, reproducibility, and Quarto-friendly reporting for open science teams
-
Hi Alastair, I am a medical oncologist passionate about data science. I was wondering whether it would be possible to include my recent work in your newsletter. You can find it here: https://www.nature.com/articles/s41698-025-01231-x and the associated R package here https://github.com/fedenichetti/prophets_package
It's a little contribution that may help spreading the knowledge about it. Thanks, and congrats for your work, I always read and learn from your newsletter.
Add a comment: