Stuck with R! (loops and linear models) - r

I'm trying to make a loop (or something else that can do this) that can run a linear model of the year and natural log of cases from my data, for each country, separately so that I can gain a slope from each linear model and plot them as a histogram.
I'm very new to R and I'm struggling immensely to work out how to do this; below is a rough snapshot of what my data looks like, and has 197 different countries in total, ranging from years 1997 - 2019.
data
Any help on how to do this would be greatly appreciated, thank you.

Based on your question, take a look at this website.
Let's say your data is in a data.frame called df, then you could
df %>%
split(.$country) %>%
map(~ lm(log(cases) ~ year, data = .)) %>%
map(summary) %>%
map_dbl("Estimate")
Let me know if you need more help with this.

Related

Understanding the mixed_model () function in R

Even though I already have some experience working with R I would still consider myself a beginner. For my current research project, I need to run a zero-inflated negative binomial regression with fixed effects. The mixed_model() function seems to be the only way to do this in R. However, I find the manual to the function very challenging and thus I am looking for some help and explanations here in the community.
I want to run a regression with violent_events as the dependent and project_sum as the dependent variable. Additionally, I want to control for gdp, population_size and education. My units of analysis are different districts, which can be identified via the district_id variable. For each district, I have data for the years 2004 to 2010.
My initial attempt looked like this:
gm1 <- mixed_model(violent_events ~ sproject_sum + gdp + population_size + education,
random = year | district_id, data = DF,
family = zi.negative.binomial(), zi_fixed = ~ district_id)
Of course, the code is not working. I would be grateful for suggestions and particularly explanations that let me understand the *mixed_model()+ function better.

R - How to execute multiple stats functions per group?

I would like to ask for tips on how to execute multiple statistical test(e.g. t-test, f-test, ks-test) for multiple groups. Basically, I want to run the statistical tests as many times as the number of groups available in my data and come up with a single result. In this case, I am trying to do a t-test to compare current and previous year data grouped by a variable(let's say dealer code).
I have a data similar to the below (stress) and would like to find out if there is a generic approach on how to do this? I managed to use rstatix for t-test but it doesn't have this function for f-test and ks-test.
Any help is appreciated. Thank you!
library(datarium)
library(rstatix)
data("stress", package = "datarium")
set.seed(123)
stress%>% sample_n_by( size = 60)
stat.test <- stress %>%
group_by(exercise) %>%
t_test(score~ treatment) %>%
add_significance()
stat.test

getting p-values for a pairwise correlation (dplyr)

I am using the code below to get correlations between my dependent variable and a questionnaire response (for different levels of different conditions).
BREAK %>%
group_by(condition, valence) %>%
summarize(COR=cor(rt, positive_focused_cognitiveER)) %>%
ungroup()
It gives me the correlations and their directions (+/-).
I would like to know, however, if those correlations are significant.
Is there a way to simply add a line to the code I already have to get the p-values?
Or another easy code? (I don't need fancy stuff, just the numbers)
The only fitting post I found for my problem was this one Getting p values for groupwise correlation using the dplyr package but the answer did not help me.
Thanks in advance for any tips! :)
You can compute p-values with stats::cor.test :
BREAK %>%
group_by(condition, valence) %>%
summarize(COR = stats::cor.test(rt, positive_focused_cognitiveER)$estimate,
pval = stats::cor.test(rt, positive_focused_cognitiveER)$p.value
) %>%
ungroup()

Fitting a model to a high frequency timeseries and forecasting on the short term using fable

I have a timeseries of a value with a fairly high frequency (15 minutes). The timeseries has no missing values and shows some daily and weekly periodic components.
I'm trying to model it using fable in R, but I can't seem to find any decent result, and I wonder if I`m doing something wrong.
Here`s my code, using an example dataset that can be downloaded:
library(tsibble)
library(fable)
library(dplyr)
library(lubridate)
download.file("https://srv-file7.gofile.io/download/9yo0cg/so_data.csv", destfile = "so_data.csv", method = "wget")
csv = read.csv("so_data.csv") %>%
mutate(time = ymd_hms(time)) %>%
as_tsibble(index = time)
# Take a look
csv %>% summary
csv %>% autoplot
This is the timeseries:
As you can see it is pretty regular, with good daily periodicity. Let's try to model it using the default settings for a few models:
csv %>%
model(
ets = ETS(value),
arima = ARIMA(value),
snaive = SNAIVE(value)
) %>%
forecast(h = "1 week") %>%
autoplot(csv)
All of them fail spectacularly:
My limited understanding of this process is clearly at fault here, and default values are not useful in this situation. However I tried tuning them, unfortunately, I was unable to capture anything better. Anyway, as I am a noob in the field I do not understand if this is due to:
me not setting proper default parameters (I should dive much deeper into fable's reference book)
the limited data I have available (short time series, only a few months)
approach not suitable to fast-varying data (daily and weekly recurring patterns)
issues in my code
Your 15-minute frequency data exhibits multiple seasonal patterns. The models are producing poor quality forecasts as they are not designed to capture these patterns (and so they are not).
Your code looks good and (visually) the data appears to have strong patterns that an appropriate model should capture.
There are currently two more sophisticated models which work with fable that should be able to capture multiple seasonal patterns to give you better forecasts. They are:
FASSTER (https://github.com/tidyverts/fasster)
Prophet (https://github.com/facebook/prophet) with the fable interface (https://github.com/mitchelloharawild/fable.prophet/)

lm() saving residuals with group_by with R- confused spss user

This is complete reEdit of my orignal question
Let's assume I'm working on RT data gathered in a repeated measure experiment. As part of my usual routine I always transform RT to natural logarytms and then compute a Z score for each RT within each partipant adjusting for trial number. This is typically done with a simple regression in SPSS syntax:
split file by subject.
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT rtLN
/METHOD=ENTER trial
/SAVE ZRESID.
split file off.
To reproduce same procedure in R generate data:
#load libraries
library(dplyr); library(magrittr)
#generate data
ob<-c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3)
ob<-factor(ob)
trial<-c(1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6)
rt<-c(300,305,290,315,320,320,350,355,330,365,370,370,560,565,570,575,560,570)
cond<-c("first","first","first","snd","snd","snd","first","first","first","snd","snd","snd","first","first","first","snd","snd","snd")
#Following variable is what I would get after using SPSS code
ZreSPSS<-c(0.4207,0.44871,-1.7779,0.47787,0.47958,-0.04897,0.45954,0.45487,-1.7962,0.43034,0.41075,0.0407,-0.6037,0.0113,0.61928,1.22038,-1.32533,0.07806)
sym<-data.frame(ob, trial, rt, cond, ZreSPSS)
I could apply a formula (blend of Mark's and Daniel's solution) to compute residuals from a lm(log(rt)~trial) regression but for some reason group_by is not working here
sym %<>%
group_by (ob) %>%
mutate(z=residuals(lm(log(rt)~trial)),
obM=mean(rt), obSd=sd(rt), zRev=z*obSd+obM)
Resulting values clearly show that grouping hasn't kicked in.
Any idea why it didn't work out?
Using dplyr and magrittr, you should be able to calculate z-scores within individual with this code (it breaks things into the groups you tell it to, then calculates within that group).
experiment %<>%
group_by(subject) %>%
mutate(rtLN = log(rt)
, ZRE1 = scale(rtLN))
You should then be able to do use that in your model. However, one thing that may help your shift to R thinking is that you can likely build your model directly, instead of having to make all of these columns ahead of time. For example, using lme4 to treat subject as a random variable:
withRandVar <-
lmer(log(rt) ~ cond + (1|as.factor(subject))
, data = experiment)
Then, the residuals should already be on the correct scale. Further, if you use the z-scores, you probably should be plotting on that scale. I am not actually sure what running with the z-scores as the response gains you -- it seems like you would lose information about the degree of difference between the groups.
That is, if the groups are tight, but the difference between them varies by subject, a z-score may always show them as a similar number of z-scores away. Imagine, for example, that you have two subjects, one scores (1,1,1) on condition A and (3,3,3) on condition B, and a second subject that scores (1,1,1) and (5,5,5) -- both will give z-scores of (-.9,-.9,-.9) vs (.9,.9,.9) -- losing the information that the difference between A and B is larger in subject 2.
If, however, you really want to convert back, you can probably use this to store the subject means and sds, then multiply the residuals by subjSD and add subjMean.
experiment %<>%
group_by(subject) %>%
mutate(rtLN = log(rt)
, ZRE1 = scale(rtLN)
, subjMean = mean(rtLN)
, subjSD = sd(rtLN))
mylm <- lm(x~y)
rstandard(mylm)
This returns the standardized residuals of the function. To bind these to a variable you can do:
zresid <- rstandard(mylm)
EXAMPLE:
a<-rnorm(1:10,10)
b<-rnorm(1:10,10)
mylm <- lm(a~b)
mylm.zresid<-rstandard(mylm)
See also:
summary(mylm)
and
mylm$coefficients
mylm$fitted.values
mylm$xlevels
mylm$residuals
mylm$assign
mylm$call
mylm$effects
mylm$qr
mylm$terms
mylm$rank
mylm$df.residual
mylm$model

Resources