I would like to run a KW-test over certain numerical variables from a data frame, using one grouping variable. I'd prefer to do this in a loop, instead of typing out all the tests, as they are many variables (more than in the example below).
Simulated data:
library(dplyr)
set.seed(123)
Data <- tbl_df(
data.frame(
muttype = as.factor(rep(c("missense", "frameshift", "nonsense"), each = 80)),
ados.tsc = runif(240, 0, 10),
ados.sa = runif(240, 0, 10),
ados.rrb = runif(240, 0, 10))
) %>%
group_by(muttype)
ados.sim <- as.data.frame(Data)
The following code works just fine outside of the loop.
kruskal.test(formula (paste((colnames(ados.sim)[2]), "~ muttype")), data =
ados.sim)
But it doesn't inside the loop:
for(i in names(ados.sim[,2:4])){
ados.mtp <- kruskal.test(formula (paste((colnames(ados.sim)[i]), "~ muttype")),
data = ados.sim)
}
I get the error:
Error in terms.formula(formula, data = data) :
invalid term in model formula
Anybody who knows how to solve this?
Much appreciated!!
Try:
results <- list()
for(i in names(ados.sim[,2:4])){
results[[i]] <- kruskal.test(formula(paste(i, "~ muttype")), data = ados.sim)
}
This also saves your results in a list and avoids overwriting your results as ados.mtp in every iteration, which I think is not what you intended to do.
Note the following:
for(i in names(ados.sim[,2:4])){
print(i)
}
[1] "ados.tsc"
[1] "ados.sa"
[1] "ados.rrb"
That is, i already gives you the name of the column. The problem in your code was that you tried to use it like an integer for subsetting, which turned the outcome into NA.
for(i in names(ados.sim[,2:4])){
print(paste((colnames(ados.sim)[i]), "~ muttype"))
}
[1] "NA ~ muttype"
[1] "NA ~ muttype"
[1] "NA ~ muttype"
And just for reference, all of this could also be done in the following two ways that I often prefer since it makes subsequent analysis slightly easier:
First, store all test objects in a dataframe:
library(tidyr)
df <- ados.sim %>% gather(key, value, -muttype) %>%
group_by(key) %>%
do(test = kruskal.test(x= .$value, g = .$muttype))
You can then subset the dataframe to get the test outcomes:
df[df$key == "ados.rrb",]$test
[[1]]
Kruskal-Wallis rank sum test
data: .$value and .$muttype
Kruskal-Wallis chi-squared = 2.2205, df = 2, p-value = 0.3295
Alternatively, get all results directly in a dataframe, without storing the test objects:
library(broom)
df2 <- ados.sim %>% gather(key, value, -muttype) %>%
group_by(key) %>%
do(tidy(kruskal.test(x= .$value, g = .$muttype)))
df2
# A tibble: 3 x 5
# Groups: key [3]
key statistic p.value parameter method
<chr> <dbl> <dbl> <int> <fctr>
1 ados.rrb 2.2205031 0.3294761 2 Kruskal-Wallis rank sum test
2 ados.sa 0.1319554 0.9361517 2 Kruskal-Wallis rank sum test
3 ados.tsc 0.3618102 0.8345146 2 Kruskal-Wallis rank sum test
Related
I would like to create for loop to repeat the same function for 150 variables. I am new to R and I am a bit stuck.
To give you an example of some commands I need to repeat:
N <- table(df$ var1 ==0)["TRUE"]
n <- table(df$ var1 ==1)["TRUE"]
PREV95 <- (svyciprop(~ var1 ==1, level=0.95, design= design, deff= "replace")*100)
I need to run the same functions for 150 columns. I know that I need to put all my cols in one vector = x but then I don't know how to write the loop to repeat the same command for all my variables.
Can anyone help me to write a loop?
A word in advance: loops in R can in most cases be replaced with a faster, R-ish way (various flavours of apply, maping, walking ...)
applying a function to the columns of dataframe df:
a)
with base R, example dataset cars
my_function <- function(xs) max(xs)
lapply(cars, my_function)
b)
tidyverse-style:
cars %>%
summarise_all(my_function)
An anecdotal example: I came across an R-script which took about half an hour to complete and made abundant use of for-loops. Replacing the loops with vectorized functions and members of the apply family cut the execution time down to about 3 minutes. So while for-loops and related constructs might be more familiar when coming from another language, they might soon get in your way with R.
This chapter of Hadley Wickham's R for data science gives an introduction into iterating "the R-way".
Here is an approach that doesn't use loops. I've created a data set called df with three factor variables to represent your dataset as you described it. I created a function eval() that does all the work. First, it filters out just the factors. Then it converts your factors to numeric variables so that the numbers can be summed as 0 and 1 otherwise if we sum the factors it would be based on 1 and 2. Within the function I create another function neg() to give you the number of negative values by subtracting the sum of the 1s from the total length of the vector. Then create the dataframes "n" (sum of the positives), "N" (sum of the negatives), and PREV95. I used pivot_longer to get the data in a long format so that each stat you are looking for will be in its own column when merged together. Note I had to leave PREV95 out because I do not have a 'design' object to use as a parameter to run the function. I hashed it out but you can remove the hash to add back in. I then used left_join to combine these dataframes and return "results". Again, I've hashed out the version that you'd use to include PREV95. The function eval() takes your original dataframe as input. I think the logic for PREV95 should work, but I cannot check it without a 'design' parameter. It returns a dataframe, not a list, which you'll likely find easier to work with.
library(dplyr)
library(tidyr)
seed(100)
df <- data.frame(Var1 = factor(sample(c(0,1), 10, TRUE)),
Var2 = factor(sample(c(0,1), 10, TRUE)),
Var3 = factor(sample(c(0,1), 10, TRUE)))
eval <- function(df){
df1 <- df %>%
select_if(is.factor) %>%
mutate_all(function(x) as.numeric(as.character(x)))
neg <- function(x){
length(x) - sum(x)
}
n<- df1 %>%
summarize(across(where(is.numeric), sum)) %>%
pivot_longer(everything(), names_to = "Var", values_to = "n")
N <- df1 %>%
summarize(across(where(is.numeric), function(x) neg(x))) %>%
pivot_longer(everything(), names_to = "Var", values_to = "N")
#PREV95 <- df1 %>%
# summarize(across(where(is.numeric), function(x) survey::svyciprop(~x == 1, design = design, level = 0.95, deff = "replace")*100)) %>%
# pivot_longer(everything(), names_to = "Var", values_to = "PREV95")
results <- n %>%
left_join(N, by = "Var")
#results <- n %>%
# left_join(N, by = "Var") %>%
# left_join(PREV95, by = "Var")
return(results)
}
eval(df)
Var n N
<chr> <dbl> <dbl>
1 Var1 2 8
2 Var2 5 5
3 Var3 4 6
If you really wanted to use a for loop, here is how to make it work. Again, I've left out the survey function due to a lack of info on the parameters to make it work.
seed(100)
df <- data.frame(Var1 = factor(sample(c(0,1), 10, TRUE)),
Var2 = factor(sample(c(0,1), 10, TRUE)),
Var3 = factor(sample(c(0,1), 10, TRUE)))
VarList <- names(df %>% select_if(is.factor))
results <- list()
for (var in VarList){
results[[var]][["n"]] <- sum(df[[var]] == 1)
results[[var]][["N"]] <- sum(df[[var]] == 0)
}
unlist(results)
Var1.n Var1.N Var2.n Var2.N Var3.n Var3.N
2 8 5 5 4 6
I am struggling to compute a t-test between 2 groups in data frame in R. The sample code below produces a data frame with 2 columns: Variable and Value. There are 2 variables: "M" and "F".
data <- data.frame(variable = c("M", "F", "F"), value = c(10,5,6))
I need to show that the value for M and F are statistically different from each other. In other words, 10 is statistically different from the mean of 5 and 6. I need to add another column in this data frame that shows the p value. When I run the code below, it gives the following error:
result <- data %>% mutate(newcolumn = t.test(value~variable))
Error in t.test.default(x = c(5, 6), y = 10) :
not enough 'y' observations
I don't understand the question.
The test itself could be run as a one sample t test for the mean. It would be
t.test(x = c(5, 6) - 10)
If you want to test running a package dplyr pipe:
library(dplyr)
fun_t_test <- function(x){
tryCatch(t.test(x)$p.value, error = function(e) NA)
}
data %>%
mutate(newvalue = value - mean(value[variable == "M"])) %>%
group_by(variable) %>%
summarise(p.value = fun_t_test(newvalue))
## A tibble: 2 x 2
# variable p.value
# <fct> <dbl>
#1 F 0.0704
#2 M NA
Today I began working with purrr functions so I can try and use R from a more functional approach. I currently have a dataframe that contains a response variable with a lot of other variables. My goal is to split the dataframe by the levels in the response column, and then run shapiro.test() on all of the split dataframes.
For example, this code works:
# fake data
df = data.frame(y = c(rep(1,10), rep(2, 10)),
a = rnorm(20),
b = runif(20),
c = rnorm(20))
df$y <- factor(df$y)
df %>%
select(y, a) %>%
split(.$y) %>%
map(~shapiro.test(.x$a))
And this returns:
$`1`
Shapiro-Wilk normality test
data: .x$a
W = 0.93455, p-value = 0.4941
$`2`
Shapiro-Wilk normality test
data: .x$a
W = 0.7861, p-value = 0.009822
So this works as I want it to on an individual column, but I would like it to run on a given vector of any columns. My thinking right now is to create a vector of the column names I want to run and use that in a map(). I think I'm pretty close to having this right, but I'm just a little stuck.
# Function that splits the df into two groups based on y levels and run shapiro test on the split dfs
shapiro <- function(var) {
df_list = df %>%
select(y, var) %>%
split(.$y) %>%
map(~shapiro.test(.x$var))
return(df_list)
}
This fails:
> shapiro(a)
Error in .f(.x[[i]], ...) : object 'a' not found
Which makes sense since a is not saved in the environment. This is sort of the direction I envision it to, but I don't know if there's a better way to go about it.
# the column names I want the function to take
columns = c(a, b, c)
# map it
map(columns, shapiro)
However, this gives an error since the column names aren't in the environment. Does anyone have suggestions on how to fix this or improve it?
Thanks!
Here is a tidyverse way with three corrections/improvements:
In your example call shapiro(a), you provide the column as a symbol, so we need to make sure that a is properly quoted and then later un-quoted to adhere to dplyrs non-standard evaluation.
Instead of split a more tidyverse-consistent approach is to use nest.
Lastly, I would recommend making df a function argument of shapiro, thereby avoiding the dependence on a global variable.
This is the improved version
shapiro <- function(df, var) {
var <- enquo(var)
df_list <- df %>%
select(y, !!var) %>%
group_by(y) %>%
nest() %>%
mutate(test = map(setNames(data, y), ~shapiro.test(.x[[1]]))) %>%
pull(test)
return(df_list)
}
So for column df$a
shapiro(df, a)
#$`1`
#
# Shapiro-Wilk normality test
#
#data: .x[[1]]
#W = 0.93049, p-value = 0.4527
#
#
#$`2`
#
# Shapiro-Wilk normality test
#
#data: .x[[1]]
#W = 0.9268, p-value = 0.4171
and for column df$b
shapiro(df, b)
#$`1`
#
# Shapiro-Wilk normality test
#
#data: .x[[1]]
#W = 0.90313, p-value = 0.237
#
#
#$`2`
#
# Shapiro-Wilk normality test
#
#data: .x[[1]]
#W = 0.88552, p-value = 0.1509
If you want to do this with a function, you'll likely need to get into tidyeval, like #MauritsEvers answer. For a relatively small task like this, you could instead get away with a couple map calls. Map over the list of data frames created by splitting by y, then use map_at to apply the test to the columns of your choice.
In the first method, you end up with some excess—any columns not in the map_at are just hanging there. The cleaner way is to select the columns you want, and then map over all columns to apply the test.
library(tidyverse)
test_list1 <- df %>%
split(.$y) %>%
map(function(split_by_y) {
split_by_y %>%
map_at(vars(a, b, c), shapiro.test)
})
test_list2 <- df %>%
split(.$y) %>%
map(function(split_by_y) {
split_by_y %>%
select(a, b, c) %>%
map(shapiro.test)
})
test_list2[[2]]$a
#>
#> Shapiro-Wilk normality test
#>
#> data: .x[[i]]
#> W = 0.95281, p-value = 0.7018
Created on 2019-03-05 by the reprex package (v0.2.1)
You can append the results to a list using a for loop:
shapiro <- function(var) {
myList = list()
for (i in 1:length(var)) {
myList[[i]] = df %>%
select(y, var = var[i]) %>%
split(.$y) %>%
map(~shapiro.test(.x$var))
}
return(myList)
}
Just make sure to use a character vector for the columns:
shapiro(c("a", "b"))
I have a very large dataset that I'm breaking down into smaller data frames based on one of the factors: state. Unfortunately, for some states, I have very little data (Alaska, for example). When I run my basic model on the smaller data frames I get problems with one of the factors (a gender variable that's only 'M' or 'F').
I'm using a loop to set each state's data frame. I was planning on building an if statement that would only run the model if it didn't have a 1-level factor. But I don't know how to build that if.
states_list<-c("AK", ... "WY") # shortened for brevity
resultsList<-list()
j<-1
for (i in states_list){
temp_data<-raw[raw$state==i,]
fac <- min(factor(temp_data) # <- Part I don't have right
if(fac > 1){
model<-lm(y_var~gender,data=temp_data)
resultsList[[j]]<-summary(model)
} else {
print(i)
print("doesn't have enough data points")
}
j=j+1
}
Thanks
-W
You don't need to use for loops and I'd strongly recommend using the broom package to save your models output as a dataframe, so you can access any value you need.
library(dplyr)
library(broom)
# example dataframe
dt = data.frame(state = c(rep("AA",20), rep("BB",15)),
gender = c(rep("M",10), rep("F",10), rep("M",15)),
value = rnorm(35, 100, 5), stringsAsFactors = F)
dt %>%
group_by(state) %>% # for each state
mutate(NumUniqueGenders = n_distinct(gender)) %>% # count how many unique values of gender you have (and add it for each row)
filter(NumUniqueGenders == 2) %>% # keep only rows the belong to a state with both M and F
do(tidy(lm(value ~ gender, data=.))) %>% # run model and save output as a dataframe
ungroup # forget the grouping
# # A tibble: 2 x 6
# state term estimate std.error statistic p.value
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 AA (Intercept) 99.9643476 1.092526 91.4983355 1.787784e-25
# 2 AA genderM 0.6236803 1.545066 0.4036594 6.912182e-01
So, in the end you'll get a dataframe that has 2 rows for each state. One for the intercept and one for gender.
I have a dataframe that looks something like this:
time id trialNum trialType accX gravX
1 1 6 7 low -0.38876217 10.185266
2 2 1 6 low 0.68254705 10.741545
3 3 3 15 high -0.21906854 9.466929
4 4 2 15 none -0.03370001 9.490829
5 5 4 1 high 0.16511542 10.986796
6 6 9 2 none -0.10441621 9.915561
You can generate something similar using this:
testDF <- data.frame(time = 1:50,
id = sample(1:10, size=50, replace=T),
trialNum = sample(1:15, size = 50, replace=T),
trialType = sample(c("none", "low", "high"),
size = 50, replace=T),
accX = sin(seq(1,50,1)),
gravX = 0.1)
And a function to calculate the average time between peaks in a filtered signal (returning mean time, and variance of the time differences):
library(dplyr)
library(signal)
library(quantmod)
calcStepTime <- function(df){
bf <- butter(1, c(0.03,0.05), type="pass")
filtered <- filtfilt(bf, df$accX - df$gravX)
peaks <- findPeaks(filtered)
peakValue <- filtered[peaks]
peakTime <- df$time[peaks]
timeDifferences <- diff(peakTime)
meanStepTime <- mean(timeDifferences)
varianceStepTime <- var(timeDifferences)
return(c(meanStepTime, varianceStepTime))
}
What I'm trying to do apply this function to each combination of id, trialNum, and trialType using groupby:
tempTrial <-
group_by(testDF, id, trialNum, trialType) %>%
summarise(meanTime = calcStepTime(.)[1],
varianceTime= calcStepTime(.)[2])
The problem is that in the output dataframe (tempTrial) every row of meanTime and varianceTime is identical
In this toy dataset, sometimes the columns all show NA (this doesnt happen in my actual dataset)
Am I doing something incorrectly to cause each row to be identical for the 2 columns? It should be taking each combination of id, trialNum and trialType, and calculating peak times for each of those separately. However, it seems its only storing a single value for each combination?
The chain is working properly in the sense that . refers to the grouped data frame group_by(testDF, id, trialNum, trialType). Since your defined function has no way of using the group information in ., the results are what you see (i.e. the function applied to the whole data frame).
So your problem here is the incorrect use of summarise. Latrunculia's answer shows you that the proper way to use summarise in the way you expect is to apply the function to combinations of columns in your data frame, in which case the function applies by group in each variable.
dplyr has a do function for applications where you wish to apply a function to the data frame subset implied by group_by. Simply replace your summarise with do:
tempTrial <- group_by(testDF, id, trialNum, trialType) %>% do(meanTime = calcStepTime(.)[1], varianceTime= calcStepTime(.)[2])
The documentation for do is not terribly clear, but this post describes the application very well.
What you get right now is the result of calcStepTime applied on the whole (ungrouped) data frame for each group.
Try rewriting the function such that it depends on the variables, but not on the data frame.
alcStepTime <- function(var1, var2, var3){
bf <- butter(1, c(0.03,0.05), type="pass")
filtered <- filtfilt(bf, var1 - var2)
peaks <- findPeaks(filtered)
peakValue <- filtered[peaks]
peakTime <- var3[peaks]
timeDifferences <- diff(peakTime)
meanStepTime <- mean(timeDifferences)
varianceStepTime <- var(timeDifferences)
return(c(meanStepTime, varianceStepTime))
}
testDF %>% group_by(testDF, id, trialNum, trialType) %>%
summarise(meanTime = calcStepTime( accX, gravX, time)[1],
varianceTime= calcStepTime(accX, gravX, time)[2])
It gives the right result if you just pipe the testDF data frame into it. It breaks for the grouped DF but I can't find if that's because the function is not defined for the subsets or if it's a problem with the function.
let me know if it works for the full data
As noted by yourself and Latrunculia, calcStepTime is very likely to return NaN/NA on the 50 observation datasets. This occurs when either no peak or a single peak was found within a group of observations. You may want to defend against this in your analysis code. I used this for testing:
testDF <- data.frame(time = 1:200,
id = sample(1:2, size=200, replace=T),
trialNum = sample(1:1, size = 200, replace=T),
trialType = sample(c("low"), size = 200, replace=T),
accX = sin(seq(1,200,1)),
gravX = 0.1)
If you change the return type of your function of data_frame (tibble), like so:
calcStepTime <- function(df){
bf <- butter(1, c(0.03,0.05), type="pass")
filtered <- filtfilt(bf, df$accX - df$gravX)
peaks <- findPeaks(filtered)
peakValue <- filtered[peaks]
peakTime <- df$time[peaks]
timeDifferences <- diff(peakTime)
meanStepTime <- mean(timeDifferences)
varianceStepTime <- var(timeDifferences)
return (data_frame("meanStepTime" = meanStepTime,
"varianceStepTime" = varianceStepTime))
}
Then you can take advantage of purrr::by_slice() for a fairly elegant solution:
library(purrr)
testDF %>%
group_by(id, trialNum, trialType) %>%
by_slice(calcStepTime, .collate="cols")
I got this from my test sample:
# A tibble: 2 x 5
id trialNum trialType meanStepTime1 varianceStepTime1
<int> <int> <fctr> <dbl> <dbl>
1 1 1 low 42.75 802.2500
2 2 1 low 39.75 616.9167
Note that .collate="cols" is the important argument that tells by_slice() to create the named columns for the results in the output. I'm a little curious myself as to why the "1" has been appended to the names we set in the data_frame returned by your function.