The R Data Scientist 26-08-2025
👥 R Community & Research Applications
Epiverse community engagement and software sustainability for research software (epiverse-trace.github.io). Epiverse-TRACE refactors ringbp R package: modularized scenario_sim, customizable incubation and delay distributions, presymptomatic transmission options, roxygen2 inheritance, vignettes, bug fixes, and improved user experience
How R Powers Cancer Research and Community Teaching in Austria (r-consortium.org). Ekaterina Akimova-Höpner (R-Ladies Salzburg) discusses using R for cancer genomics, Bioconductor, Tidyverse pipelines, RUGS grants, and building a Western Austria R community with workshops and Mindful Doctorate course ideas
ANES 2024 is Out! How to Analyze the Data with R (rworks.dev). ANES 2024 data analyzed in R using survey and srvyr; weights, strata, and clusters; in-person/web samples; recoding V241229; design object with V240101b; population adjustments via tidycensus and ACS 2023; population estimates for 18+ citizens; GQ adjustments; code snippets and data loading
El rol del software de investigación a lo largo del ciclo de vida investigativo (ropensci.org). Webinars de software y datos de investigación: papel de Yanina Bellini Saibene, rOpenSci y contribuciones al software de investigación abierto a lo largo del ciclo de vida, con registro en Zoom
📈 Data Analysis & Visualization
Step count versus city walkability (rawdatastudies.com). Step-count data from thousands of movers linked to city walkability; code and data on GitHub; critique of data sharing, non-linear fits, and raw data availability in a countrywide natural experiment
Men's domestic chores and fertility rates - Part II, technical notes by @ellis2013nz (freerangestats.info). Technical notes on drawing directed graphs with coloured edges using ggdag, dagify, and tidy_dagitty; UN SDG data access via curl and httr, SDG Series DataCSV; data wrangling for time-use by sex, age, location; model notes with country random effects
Closing my tabs (Aug 22, 2025) (blog.stephenturner.us). Digest of AI, genomics, and data science topics: European Heart Journal study on vascular ageing post-COVID, Ground Truths by Topol, R Weekly Positron/Shiny, nf-core advisories, Julia for R users, mcptools, AI hallucinations, Bluesky for science, Moonshots in biology, and Parallellism in R/Python with mall 0.2.0
Deploying a Golem Shiny App to ShinyApps.io (pacha.dev). Deploying a Golem Shiny App to ShinyApps.io: troubleshooting dependencies, renv, CRAN/GitHub sources, golem::addshinyappsiofile, remotes::install_github, .rscignore, and rsconnect deployment
Replicating Hansen’s Econometrics using Armadillo (pacha.dev). Dataset replication of Hansen's Econometrics in Armadillo, using tidy data principles, C++/Armadillo implementation, Hansen book exercises datasets, Mauricio Vargas S. August 2025, Buy Me a Coffee note
🤖 AI & LLMs in R
ragnar 0.2 (tidyverse.org). Ragnar 0.2 introduces a tidy R package for building trustworthy RAG pipelines, embedding with OpenAI models, creating a duckdb store, and retrieving via semantic and BM25 scoring
AI vs Manual Scatterplots in R: ggplot2 Workflows for the AI Era (datavizpyr.com). AI vs manual ggplot2 scatterplots in R: three workflows—manual artisan, AI-assisted, and hybrid—using Palmer Penguins, ggplot2, dplyr, custom color palettes, geompoint, geomsmooth, labs, theme_minimal, and ggsave
Prejudicial Peer Review with AI (blog.stephenturner.us). PLANES framework evaluates plausibility of infectious disease forecasts using modular heuristic components (repeat, taper, shape, trend) with rplanes R package, validated on FluSight data and correlated with WIS
genAI Day 2025 (rinpharma.com). GenAI Day 2025 showcases pharma-focused GenAI use cases: LLM-driven Shiny apps, clinical trial applications, ADaM AI pair programming, CDISC data automation, multi-agent frameworks, RAG, and Shiny tooling with Posit, Roche, Pfizer, Eli Lilly, Biogen, Novo Nordisk, Appsilon, Formation Bio, A2-AI
Setting up local LLMs for R and Python (posit.co). Setting up local LLMs for R and Python with Posit's Positron, local inference workflows, RStudio/Jupyter/VS Code integration, CRAN/PyPI/Bioconductor package management, Shiny apps, and AI workflow enhancements
📊 Statistical Methods & Probability
The birthday problem (r.iresmi.net). Birthday problem simulation in R: maxgroupsize 60, 1e5 iterations, multi-core with furrr, simulate birthday collisions including 365/366 days, plot probability vs group size, identify 50% collision threshold
Probability Density Function (PDF) (statisticalaid.com). Overview of Probability Density Function (PDF): continuous variables, PMF vs PDF, Uniform and Normal examples, CDF relation, area under curve equals 1, practical computation, applications in data science, physics, finance, engineering, healthcare, and visualization
2-Sample Median Bootstrap Test Calculator (statisticsbyjim.com). Bootstrap-based 2-sample median test, bootstrap confidence intervals, R or Python implementation, medians comparison, nonparametric inference, effect size, p-values, permutation ideas, sample sizes, robust statistics
Fact and fiction in statistics (larspsyll.wordpress.com). Frequentism and Bayesianism incomplete; causal justifications for data-generating processes; DeFinetti on idealizations; critique of model misspecification and physical constraints in health, medical, and social sciences
🧠 Bayesian Statistics & Advanced Modeling
“Surprises” in BLS Jobs Revisions Became More Frequent After 2020 (medium.com/@baogorek). BLS job revisions, 2-distribution model, mclust in R, quantile residuals, surprise proportion ~11.3%, post-2020 revision patterns, GFC, dot-com era, February vs June revisions, Groshen interview
ベイズ構造時系列モデル(bstsパッケージ)の個人メモ (watagusa.hatenablog.com). Bayesian structural time series with bsts in R: generate synthetic data, spike variables, local linear trend, seasonal components, model fitting via MCMC, and posterior summaries
Stop Guessing at Priors: R2D2’s Automated Approach to Bayesian Modeling (dspn.substack.com). Explores R2D2 Bayesian priors, R2D2M2 extensions for GLMs and multilevel models, variance allocation on R², Dirichlet decomposition, and practical code for hierarchical data
Bayes' Theorem as universe ratios (scyy.fi). Bayes’ Theorem reinterpreted as universe ratios, using prior odds, likelihoods, and universe-shares to illustrate hypothesis updating with an illustrative dinner invitation example
📚 Academic Research
CSTEapp: An interactive R-Shiny application of the covariate-specific treatment effect curve for visualizing individualized treatment rule (arxiv:stat). First-ever Shiny app for estimating individualized treatment rules in precision medicine through point-and-click interface. Essential for R users building interactive dashboards for causal inference and medical applications
piCurve: an R package for modeling photosynthesis-irradiance curves (arxiv:stat). Comprehensive R package with 24 models, uncertainty quantification, and tidy workflows for reproducible biological research. Demonstrates best practices in R package development and scientific computing
Novel Knockoff Generation and Importance Measures with Heterogeneous Data via Conditional Residuals and Local Gradients (arxiv:stat). Advanced variable selection method with rangerKnockoff R package for mixed data types and nonlinear models. Critical for data scientists working with complex datasets requiring rigorous statistical inference
Multinomial probit model based on joint quantile regression (arxiv:stat). Novel Bayesian quantile regression approach using Gibbs sampling for multinomial choice data analysis. Valuable for Stan users and researchers applying advanced statistical modeling techniques
hi