I can't figure out why my loop isn't working.
I have a database (36rows x 51columns, its name is "Seleccio") consisting of 3 factors (first 3 columns: Animal (12 animals), Diet (3 diets) and Period (3 periods)) and 48 variables (many clinical parameters) with 36 observations per column. It is a 3x3 crossover design so I want to implement a mixed model to include the Animal random effect and also Period and Diet fixed effects and the interaction between them.
A sample of the data (but with less rows and columns):
Animal Diet Period Var1 Var2 Var3 Var4 Var5 Var6
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A A A 11 55 1.2 0.023 22 3
2 B A A 13 34 1.6 0.04 23 4
3 C B A 15 13 1.1 0.052 22 2
4 A B B 10 22 1.5 0.067 27 4
5 B C B 9 45 1.4 0.012 24 2
6 C C B 13 32 1.5 0.014 23 3
> dput(sample[1:9,])
structure(list(Animal = c("A", "B", "C", "A", "B", "C", NA, NA,
NA), Diet = c("A", "A", "B", "B", "C", "C", NA, NA, NA), Period = c("A",
"A", "A", "B", "B", "B", NA, NA, NA), Var1 = c(11, 13, 15, 10,
9, 13, NA, NA, NA), Var2 = c(55, 34, 13, 22, 45, 32, NA, NA,
NA), Var3 = c(1.2, 1.6, 1.1, 1.5, 1.4, 1.5, NA, NA, NA), Var4 = c(0.023,
0.04, 0.052, 0.067, 0.012, 0.014, NA, NA, NA), Var5 = c(22, 23,
22, 27, 24, 23, NA, NA, NA), Var6 = c(3, 4, 2, 4, 2, 3, NA, NA,
NA)), row.names = c(NA, -9L), class = c("tbl_df", "tbl", "data.frame"
))
I want to make descriptive analysis (normality testing and checking for outliers) of each variable sorted by Diet (which is the treatment) and also run a mixed model and make an ANOVA and a Tukey test for the fixed effects.
I can do the analysis one by one, but it takes a lot of time, I have tried several times to create a for loop to automate the analysis for all the variables but it didn't work (I'm pretty new to R).
What I got so far:
sink("output.txt") # to store the output into a file, as it would be to large to be shown in the console
vars <-as.data.frame(Seleccio[,c(4:51)])
fact <-Seleccio$Diet
dim(vars)
for (i in 1:length(vars)) {
variable <- vars[,i]
lme_cer <- lme(variable ~ Period*Diet, random = ~1 | Animal, data = Seleccio) # the model
cat("\n---------\n\n")
cat(colnames(Seleccio)[i]) # the name of each variable, so I don't get lost in the text file
cat("\n")
print(boxplot(vars[,i]~fact)$out) #checking for outliers
print(summary(lme_cer))
print(anova(lme_cer))
print(lsmeans(lme_cer, pairwise~Diet, adjust="tukey"))
}
sink()
This code runs but doesn't do the job, as it gives me wrong results for each variable because they are different from the results that I get when I analyse each variable one by one.
I would also like to add to the loop this normality testing sorted by Diet (Treatment) code. I wonder if it would be possible.
aggregate(formula = VARIABLENAME ~ Diet,
data = Seleccio,
FUN = function(x) {y <- shapiro.test(x); c(y$statistic, y$p.value)})
Thank you very much in advance to all of those who are willing to lend me a hand, any help will be greatly appreciated
I don't think i can run the model with only 6 observations, so i couldn't find why would your loop doesn't return the same as doing it one by one. Maybe the problem is with cat(colnames(Seleccio)[i]): you only want the Var names, and for i=1, 2 and 3, that code will return "Animal", "Diet" and "Period", thus messing up how you're comparing the results. Using cat(colnames(vars)[i]) might correct that. If you find a way to include more observations of Seleccio i might be able to help more.
I would suggest you to create a list to store the output:
vars <- as.data.frame(Seleccio[,c(4:51)])
fact <- Seleccio$Diet
dim(vars)
output = list() #Create empty list
for (i in 1:length(vars)) {
var = colnames(vars)[i]
output[[var]] = list() #Create one entry for each variable
variable <- vars[,i]
lme_cer <- lme(variable ~ Period*Diet, random = ~1 | Animal, data = Seleccio) # the model
#Fill that entry with each statistics:
output[[var]]$boxplot = boxplot(vars[,i]~fact)$out #checking for outliers
output[[var]]$summary = summary(lme_cer)
output[[var]]$anova = anova(lme_cer)
output[[var]]$lsmeans = lsmeans(lme_cer, pairwise~Diet, adjust="tukey")
output[[var]]$shapiro = aggregate(formula = variable ~ Diet, data = Seleccio,
FUN = function(x) {y <- shapiro.test(x); c(y$statistic, y$p.value)})
}
This way you have the results in you R enviroment, and have better visualisation options: do output$Var1 and get all the results for Var1, which should fit in the console; do for(i in output){print(i$summary)} to get all of the summaries; etc.
Related
So I have data looking a little something like this:
Data:
Area
Al
Cd
Cu
A
10000
0.2
30
A
15000
0.5
25
A
NA
Na
NA
B
8000
1.1
55
B
11000
0.2
40
B
13000
0.1
40
etc.
And I want to do a Mann Whitney U test between group A and B separately for each element/column.
I have managed to do this manually for each element individually according to this:
#Data is the above dataframe
Area_A <- subset(Data, Group %in% c("A"))
Area_B <- subset(Data, Group %in% c("B"))
WhitneyU_Al <- wilcox.test(Area_A$Al, Area_B$Al, na.rm = TRUE, paired = FALSE, exact = FALSE)
(I couldn't figure out how to do it based on the rows in the column "Areas" in one data frame, which is why I divided it into two subsets).
Now, I have a lot more columns than just these three (43 to be exact), and I was wondering if there was some way to do this across all columns without changing it manually each time?
I tried a few variations of this:
WhitneyU <- wilcox.test(Area_A, Area_B, na.rm = TRUE, paired = FALSE, exact = FALSE)
#OR
WhitneyU <- wilcox.test(Area_A[2:43], Area_B[2:43], na.rm = TRUE, paired = FALSE, exact = FALSE)
But they both return the error that "'x' must be numeric".
I suspect the answer isn't this easy and that I am barking up the wrong tree? Either that, or the question/answer is too obvious and I am just not seeing it.
When I tried looking up multiple tests most answers deal with how to do multiple tests if you have multiple "groups" (as in, they have area A, B, C and D). Sorry if this has been asked before and I didn't find it (or I didn't understand it). I did look.
Any help is appreciated.
Edit:
Upon request, using dput() on part of my data it looks a bit like this:
structure(list(Group = c("A", "A", "A", "A",
"A", "B", "B", "B", "B", "B", "B"
), Al = c(NA, NA, NA, 18100, 18400, 32500, 33200, 31200,
17400, 13900, 14400), As = c(NA, NA, NA, 16.9, 14.6, 8.83, 8.59,
8.42, 13.4, 13.5, 13.7), B = c(NA, NA, NA, 18, 16, 14, 14, 11,
53, 87, 58), Bi = c(NA, NA, NA, 0.13, 0.12, 0.57, 0.55, 0.52, 0.22,
0.18, 0.21), Ca = c(NA, NA, NA, 5950, 5480, 6220, 6230, 5950,
6850, 8170, 7000), Cd = c(NA, NA, NA, 0.2, 0.2, 0.2, 0.2, 0.18,
0.31, 0.36, 0.46)), row.names = c(1L, 2L, 3L, 4L, 5L, 40L, 41L,
42L, 43L, 44L, 45L), class = c("tbl_df", "tbl", "data.frame"))
wilcox.test requires the first input (x) to be numeric. In R, factors have an integer value assigned to them “under the hood” (ie, A = 1, B = 2,…). So you can convert the group variable in your data frame df. This should work to perform the test across all other columns:
df$Group <- as.factor(df$Group)
lapply(df[-1], function(x){
wilcox.test(x ~ df$Group)
})
I have a linear mixed effects model that determines change in grass based on both the previous year's grass and several environmental variables (and their interaction) at different distinct sites over time.
Using this mixed effects model and established, projected future environmental variables, I want to predict change in grass density. Each year's prediction thus depends on the previous year's density, located on the row above it in my dataframe. We begin with a real value from the present year, and then predict into the future.
library(tidyverse); library(lme4)
#data we have from the past, where each site has annual ChlA/Sal/Temp as well as grass density. our formula, change.mod, predicts grass.change, based on these env variables AND last year's grass coverage (grass.y1)
ThePast = tibble(
year = c(2017, 2018, 2019, 2020, 2021, 2017, 2018, 2019, 2020, 2021,2017, 2018, 2019, 2020, 2021),
site = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C"),
ChlA = c(50, 210, 190, 101, 45, 20, 20, 80, 5, 40, 25, 12, 11, 5, 20),
Sal= c(1, 4, 5, 0.1, 10, 18, 14, 17, 10, 21, 30, 28, 25, 20, 22),
Temp = c(28, 21, 24, 25, 22, 19, 20, 17, 18, 15, 18, 16, 19, 20, 20),
grass = c(.5, .3, .1, .4, .1, .25, .33, .43, .44, .08, .75, .54, .69, .4, .6)) %>%
group_by(site) %>%
mutate(grass.y1 = lag(grass, order_by = year)) %>% #last year's grass
mutate(grass.change = grass - grass.y1) %>% #calculate change
ungroup()
#the ME model
change.mod = lmer(grass.change ~ grass.y1 + log10(ChlA) + log10(Sal) + grass.y1:log10(Temp) + grass.y1:log10(Sal) + (1|site), data = ThePast)
#Future environmental data per site per year, to be used to predict grass.
TheDistantFuture <- tibble(
year = c(2022, 2022, 2022, 2023, 2023, 2023, 2024, 2024, 2024),
site = c( "A", "B", "C","A", "B", "C", "A", "B", "C"),
ChlA = c(40, 200, 10, 95, 10, 4, 149, 10, 15),
Sal= c(12, 11, 15, 16, 21, 32, 21, 21, 22),
Temp = c(24, 22, 26, 28, 29, 32, 31, 20, 18))
#The final dataframe should look like this, where both of the grass columns are predicted out into the future. could have the grass.y1 column in here if we wanted
PredictedFuture <- tibble(
year = c(2022, 2022, 2022, 2023, 2023, 2023, 2024, 2024, 2024),
site = c( "A", "B", "C","A", "B", "C", "A", "B", "C"),
ChlA = c(40, 200, 10, 95, 10, 4, 149, 10, 15),
Sal= c(12, 11, 15, 16, 21, 32, 21, 21, 22),
Temp = c(24, 22, 26, 28, 29, 32, 31, 20, 18),
grass = c(0.237, 0.335, 0.457, 0.700, 0.151, 0.361, 0.176, 0.380, 0.684),
grass.change = c(0.1368, 0.2550, -0.1425, -0.1669, -0.18368, -0.0962, 0.106, 0.229, 0.323 ))
Right now, I can generate the next year's (2022) correct predictions using group_by() and predict(), referencing last year's grass density with a lag function.
#How do we get to PredictedFuture?? Here is what I'm trying:
FutureIsNow = ThePast %>%
filter(year == 2021) %>% #take last year of real data to have baseline starting grass density
bind_rows(TheDistantFuture) %>% #bind future data
arrange(site, year) %>% #arrange by site then year
group_by(site) %>% #maybe this should be rowwise?
mutate(grass.change = predict(change.mod, newdata = data.frame(
grass.y1 = lag(grass, n = 1, order_by = year),
ChlA = ChlA, Sal = Sal, Temp = Temp, site = site))) %>% #this correctly predicts 2022 grass change
mutate(grass = grass.change + lag(grass, n = 1)) #this also works to calculate grass in 2022
This df looks like this:
> FutureIsNow
# A tibble: 12 × 7
# Groups: site [3]
year site ChlA Sal Temp grass grass.change
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2021 A 45 10 22 NA NA
2 2022 A 40 12 24 0.237 0.137
3 2023 A 95 16 28 NA NA
4 2024 A 149 21 31 NA NA
5 2021 B 40 21 15 NA NA
6 2022 B 200 11 22 0.335 0.255
7 2023 B 10 21 29 NA NA
8 2024 B 10 21 20 NA NA
9 2021 C 20 22 20 NA NA
10 2022 C 10 15 26 0.457 -0.143
11 2023 C 4 32 32 NA NA
12 2024 C 15 22 18 NA NA
Close, but not really repeatable...
Any ideas for predicting grass change for 2023, 2024, down the rows? I prefer working in tidyverse, though it may be possible to solve this more easily with nested for loops. Potential solutions include a rowwise data structure, or maybe to nest_by(station), but I don't know how to then reference the grass.y1 column. Maybe the solution could be via a rolling prediction with rollify, but I am not sure!
Thank you in advance for your help! Long time reader, first time asker!
So, let's go with a simpler example here for a reprex to show how purrr::accumulate2() can work for you here.
Let's setup a discrete time population model where there is also some covariate that affects time
$N_t = 1.5N_{t-1} + C$
Simple! Heck, we can even use accumulate2 to simulate a population, and then add some noise.
library(tidyverse)
# ok, let's make a population from a simple discrete time growth model
# but, with a covariate!
covariate <- runif(5, 5, 10)
# use accumulate2 with the covariate to generate a population timeseries
pop <- accumulate2(1:5,covariate, ~.x*1.5 + .y, .init = 0) %>% unlist()
pop <- pop[-1]
pop_obs <- rnorm(5, pop, 1) #add some noise
Great! Now, turn it into data and fit a model
# the data ####
dat <- tibble(
time = 1:5,
covariate = covariate,
pop_obs = pop_obs,
lag_pop = lag(pop_obs)
)
# the model ####
mod <- lm(pop_obs ~ covariate + lag_pop, data = dat)
# does this look reasonable?
coef(mod)
My coefficients looked reasonable, but, set a seed and see!
Now we will need some data we want to simulate for - new covariates, but, we will need to incorporate the lag.
# now, simulation data ####
simdat <- tibble(
time = 6:10,
covariate = runif(5, 15,20),
lag_pop = dat$pop_obs[5] #the last lagged value!
)
Great! To make this work, we'll need a function that takes arguments of the lagged value and covariate and runs a prediction. Note, here the second argument is just a numeric. But, you could pass an element of a list - a row of a data frame, if you will. This might be accomplished later with some rowwise nesting or somesuch. For you to work out!
# OK, now we need to get predictions for pop at each step in time! ####
sim_pred <- function(lag_pop, covariate){
newdat <- tibble(covariate = covariate,
lag_pop = lag_pop)
predict(mod, newdata = newdat)
}
With this in hand, we can simulate forward using lag_pop to generate a new population. Note, we'll need to use .init to make sure our first value is correct as well as strip off the final value (I think...might want to check that)
# and let her rip!
# note, we have to init with the first value and
# for multiple covariates, make a rowwise list -
# each element of the list is
# one row of the data and the sim_pred function takes it apart
simdat %>%
mutate(pop = accumulate2(lag_pop,
covariate,
~sim_pred(.x, .y),
.init = lag_pop[1]) %>% `[`(-1) %>% unlist())
That should do!
I apologize if this has already been asked, but the solutions I came across didn't seem to work for me.
I have a data set that was initially multiple excel sheets containing different variables for the same subjects. I was able to import the data into r and combine into a single data frame using:
x1_data <- "/data.xlsx"
excel_sheets(path = x1_data)
tab_names <- excel_sheets(path = x1_data)
list_all <- lapply(tab_names, function(x)
read_excel(path = x1_data, sheet = x))
str(list_all)
df <- rbind.fill(list_all)
df <- as_tibble(df)
However, I now have many duplicate rows for each subject, as each sheet was essentially added beneath the preceding sheet. Something like this:
Sheet 1
ID: 1,2
Age: 32, 29
Sex: M, F
Sheet 2
ID: 1, 2
Weight: 75, 89
Height: 157, 146
Combined
ID: 1, 2, 1, 2
Age: 32, 29, NA, NA
Sex: M, F, NA, NA
Weight: NA, NA, 75, 89
Height: NA, NA, 157, 146
I can't seem to figure out how to delete the duplicate ID rows without losing the data in the columns that belong to those rows. I tried aggregate and group_by without success. What I am after is this:
Combined
ID: 1, 2
Age: 32, 29
Sex: M, F
Weight: 75, 89
Height: 157, 146
Any help would be appreciated. Thanks.
Here's a possible solution:
library(tidyverse)
df <- tibble(ID = c(1, 2, 1, 2),
Age = c(32, 29, NA, NA),
Sex = c("M", "F", NA, NA),
Weight = c(NA, NA, 75, 89),
Height = c(NA, NA, 157, 146))
df1 <- df %>% filter(is.na(Age)) %>% select(ID, Weight, Height)
df2 <- df %>% filter(!is.na(Age)) %>% select(ID, Age, Sex)
df.merged <- df2 %>% left_join(df1, by = "ID")
For future questions, please provide a already formatted sample of your data which makes it much more easier to work with.
Does anyone know if it is possible to calculate a weighted mean in R when values are missing, and when values are missing, the weights for the existing values are scaled upward proportionately?
To convey this clearly, I created a hypothetical scenario. This describes the root of the question, where the scalar needs to be adjusted for each row, depending on which values are missing.
Image: Weighted Mean Calculation
File: Weighted Mean Calculation in Excel
Using weighted.mean from the base stats package with the argument na.rm = TRUE should get you the result you need. Here is a tidyverse way this could be done:
library(tidyverse)
scores <- tribble(
~student, ~test1, ~test2, ~test3,
"Mark", 90, 91, 92,
"Mike", NA, 79, 98,
"Nick", 81, NA, 83)
weights <- tribble(
~test, ~weight,
"test1", 0.2,
"test2", 0.4,
"test3", 0.4)
scores %>%
gather(test, score, -student) %>%
left_join(weights, by = "test") %>%
group_by(student) %>%
summarise(result = weighted.mean(score, weight, na.rm = TRUE))
#> # A tibble: 3 x 2
#> student result
#> <chr> <dbl>
#> 1 Mark 91.20000
#> 2 Mike 88.50000
#> 3 Nick 82.33333
The best way to post an example dataset is to use dput(head(dat, 20)), where dat is the name of a dataset. Graphic images are a really bad choice for that.
DATA.
dat <-
structure(list(Test1 = c(90, NA, 81), Test2 = c(91, 79, NA),
Test3 = c(92, 98, 83)), .Names = c("Test1", "Test2", "Test3"
), row.names = c("Mark", "Mike", "Nick"), class = "data.frame")
w <-
structure(list(Test1 = c(18, NA, 27), Test2 = c(36.4, 39.5, NA
), Test3 = c(36.8, 49, 55.3)), .Names = c("Test1", "Test2", "Test3"
), row.names = c("Mark", "Mike", "Nick"), class = "data.frame")
CODE.
You can use function weighted.mean in base package statsand sapply for this. Note that if your datasets of notes and weights are R objects of class matrix you will not need unlist.
sapply(seq_len(nrow(dat)), function(i){
weighted.mean(unlist(dat[i,]), unlist(w[i, ]), na.rm = TRUE)
})
I've been searching for answers and trying all I can think of, but nothing works:
I want to write a function to add the values across rows in a dataframe. It's easiest to write a function since I have so many columns and don't always have to add the same ones. Here an example of a dataframe:
ExampleData <- data.frame(Participant = 1:7,
Var1 = c(2, NA, 13, 15, 0, 2, NA),
Var2 = c(NA, NA, 1, 0, NA, 4, 2),
Var3 = c(6, NA, 1, 0, 1, 5, 3),
Var4 = c(12, NA, NA, 4, 10, 1, 4),
Var5 = c(10, NA, 3, 5, NA, 4, 4))
The conditions: If all values across a row are NA, the sum should be NA. If there is at least one value across a row that is a number (>= 0, or not NA), then rowSums should ignore NA's and add the values.
The best solution I've reached so far is:
addition <- function(x) {
if(all(is.na(x))){
NA
}else{
rowSums(x, na.rm = TRUE)
}
}
addition(ExampleData[, c("Var1", "Var2", "Var3")])
The output is: [1] 8 0 15 15 1 11 5
But it should be: [1] 8 NA 15 15 1 11 5
Does anyone know how to do this?
Thank you.
The reason is the all(is.na(x)) is checking whether all the elements of the dataset are NA instead of by rows. If we check the output of is.na(data), it is a logical matrix. A matrix is basically a vector with dimension attributes. So, wrapping with all checks if all the elements are NA or not.
For example,
all(is.na(matrix(c(1:9, NA), 5, 2)))
#[1] FALSE
We can change the function to
addition <- function(x) {
rowSums(x, na.rm = TRUE) * NA^(rowSums(!is.na(x))==0)
}
addition(ExampleData[, c("Var1", "Var2", "Var3")])
#[1] 8 NA 15 15 1 11 5