I have a very large dataset that I'm breaking down into smaller data frames based on one of the factors: state. Unfortunately, for some states, I have very little data (Alaska, for example). When I run my basic model on the smaller data frames I get problems with one of the factors (a gender variable that's only 'M' or 'F').
I'm using a loop to set each state's data frame. I was planning on building an if statement that would only run the model if it didn't have a 1-level factor. But I don't know how to build that if.
states_list<-c("AK", ... "WY") # shortened for brevity
resultsList<-list()
j<-1
for (i in states_list){
temp_data<-raw[raw$state==i,]
fac <- min(factor(temp_data) # <- Part I don't have right
if(fac > 1){
model<-lm(y_var~gender,data=temp_data)
resultsList[[j]]<-summary(model)
} else {
print(i)
print("doesn't have enough data points")
}
j=j+1
}
Thanks
-W
You don't need to use for loops and I'd strongly recommend using the broom package to save your models output as a dataframe, so you can access any value you need.
library(dplyr)
library(broom)
# example dataframe
dt = data.frame(state = c(rep("AA",20), rep("BB",15)),
gender = c(rep("M",10), rep("F",10), rep("M",15)),
value = rnorm(35, 100, 5), stringsAsFactors = F)
dt %>%
group_by(state) %>% # for each state
mutate(NumUniqueGenders = n_distinct(gender)) %>% # count how many unique values of gender you have (and add it for each row)
filter(NumUniqueGenders == 2) %>% # keep only rows the belong to a state with both M and F
do(tidy(lm(value ~ gender, data=.))) %>% # run model and save output as a dataframe
ungroup # forget the grouping
# # A tibble: 2 x 6
# state term estimate std.error statistic p.value
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 AA (Intercept) 99.9643476 1.092526 91.4983355 1.787784e-25
# 2 AA genderM 0.6236803 1.545066 0.4036594 6.912182e-01
So, in the end you'll get a dataframe that has 2 rows for each state. One for the intercept and one for gender.
Related
I have data frame with hundreds of names and hundreds of values per name. Now I want filter some of the values based on some mathematical rule applied only to a certain subset of the data. A simplified example would filtering the max value for each name.
I can hard code it as shown below, but would love to avoid it.
library(dplyr)
##
names <- c('A', 'A', 'B', 'B')
values <- c(1,2,3,4)
df <- data.frame(names, values)
##
df%>%filter(names!='A' | values!=max(subset(df, names =='A')$values)
,names!='B' | values!=max(subset(df, names =='B')$values))
Desired ouptut:
names values
1 A 1
2 B 3
I would consider creating a loop within a dplyr filter, that calculates the max value per name and then applies both conditions within the filter, if possible.
Filtering out max value for each name:
df %>%
group_by(names) %>%
filter(values != max(values))
# # A tibble: 2 x 2
# # Groups: names [2]
# names values
# <chr> <dbl>
# 1 A 1
# 2 B 3
Or if you mean removing the max values per name from the entire data frame, whenever they occur:
df %>%
group_by(names) %>%
slice_max(values) %>%
select(values) %>%
anti_join(df, ., by = "values")
# # A tibble: 2 x 2
# # Groups: names [2]
# names values
# <chr> <dbl>
# 1 A 1
# 2 B 3
An option in base R
subset(df, values != ave(values, names, FUN = max))
For example:
df <- data.frame("Treatment" = c(rep("A", 2), rep("B", 2)), "Price" = 1:4, "Cost" = 2:5)
I want to summarize the data by treatments for all the variables I have, and put them together, so I define a function to do this for each variable first, and then rbind them later on.
SummarizeFn <- function(x,y,z) {
df1 <- x %>% group_by(Treatment) %>%
summarize(n = n(), Mean = mean(y), SD = sd(y)) %>%
df1$Var = z # add a column to show which variable those statistics belong to.
}
SumPrice <- SummarizeFn(df, df$Price, "Price")
However, the results are:
Treatment n Mean SD Var
<fct> <int> <dbl> <dbl> <chr>
1 A 2 2.5 1.29 Price
2 B 2 2.5 1.29 Price
They are the mean and sd of all the observations, but not the grouped observations by Treatment. What is the problem here?
If I take the code out of the function environment, it works totally fine. Please help, thanks.
If you have a better way to achieve my purpose, that would be great! Thanks!
When you use variables with $ in dplyr pipes they do not respect grouping and work as if they are applied to the entire dataframe. Apart from that, you can use {{}} to evaluate column names in the functions.
library(dplyr)
SummarizeFn <- function(x,y,z) {
x %>%
group_by(Treatment) %>%
summarize(n = n(), Mean = mean({{y}}), SD = sd({{y}}), Var = z)
}
SummarizeFn(df, Price, "Price")
# Treatment n Mean SD Var
# <fct> <int> <dbl> <dbl> <chr>
#1 A 2 1.5 0.707 Price
#2 B 2 3.5 0.707 Price
This is related to the question of standard evaluation. That's funny, I just wrote an article on the subject. This is quite hard to pass string names with dplyr. If you need to do that, use rlang::sym (or rlang::syms) and !! (or !!!)
Regarding your problem, I think data.table offers you a concise solution
dt <- as.data.table(mtcars)
output <- dt[,lapply(.SD, function(d) return(list(.N,mean(d),sd(d)))),
.SDcols = c("mpg","qsec")]
output[,'stat' := c("observations","mean","sd")]
output
# output
# mpg qsec stat
# 1: 32 32 observations
# 2: 20.09062 17.84875 mean
# 3: 6.026948 1.786943 sd
I propose an anonymous function with lapply but you could use a more sophisticated function defined before the summary step. Change the .SDcols to include more variables if needed
I am struggling to compute a t-test between 2 groups in data frame in R. The sample code below produces a data frame with 2 columns: Variable and Value. There are 2 variables: "M" and "F".
data <- data.frame(variable = c("M", "F", "F"), value = c(10,5,6))
I need to show that the value for M and F are statistically different from each other. In other words, 10 is statistically different from the mean of 5 and 6. I need to add another column in this data frame that shows the p value. When I run the code below, it gives the following error:
result <- data %>% mutate(newcolumn = t.test(value~variable))
Error in t.test.default(x = c(5, 6), y = 10) :
not enough 'y' observations
I don't understand the question.
The test itself could be run as a one sample t test for the mean. It would be
t.test(x = c(5, 6) - 10)
If you want to test running a package dplyr pipe:
library(dplyr)
fun_t_test <- function(x){
tryCatch(t.test(x)$p.value, error = function(e) NA)
}
data %>%
mutate(newvalue = value - mean(value[variable == "M"])) %>%
group_by(variable) %>%
summarise(p.value = fun_t_test(newvalue))
## A tibble: 2 x 2
# variable p.value
# <fct> <dbl>
#1 F 0.0704
#2 M NA
I would like to run a KW-test over certain numerical variables from a data frame, using one grouping variable. I'd prefer to do this in a loop, instead of typing out all the tests, as they are many variables (more than in the example below).
Simulated data:
library(dplyr)
set.seed(123)
Data <- tbl_df(
data.frame(
muttype = as.factor(rep(c("missense", "frameshift", "nonsense"), each = 80)),
ados.tsc = runif(240, 0, 10),
ados.sa = runif(240, 0, 10),
ados.rrb = runif(240, 0, 10))
) %>%
group_by(muttype)
ados.sim <- as.data.frame(Data)
The following code works just fine outside of the loop.
kruskal.test(formula (paste((colnames(ados.sim)[2]), "~ muttype")), data =
ados.sim)
But it doesn't inside the loop:
for(i in names(ados.sim[,2:4])){
ados.mtp <- kruskal.test(formula (paste((colnames(ados.sim)[i]), "~ muttype")),
data = ados.sim)
}
I get the error:
Error in terms.formula(formula, data = data) :
invalid term in model formula
Anybody who knows how to solve this?
Much appreciated!!
Try:
results <- list()
for(i in names(ados.sim[,2:4])){
results[[i]] <- kruskal.test(formula(paste(i, "~ muttype")), data = ados.sim)
}
This also saves your results in a list and avoids overwriting your results as ados.mtp in every iteration, which I think is not what you intended to do.
Note the following:
for(i in names(ados.sim[,2:4])){
print(i)
}
[1] "ados.tsc"
[1] "ados.sa"
[1] "ados.rrb"
That is, i already gives you the name of the column. The problem in your code was that you tried to use it like an integer for subsetting, which turned the outcome into NA.
for(i in names(ados.sim[,2:4])){
print(paste((colnames(ados.sim)[i]), "~ muttype"))
}
[1] "NA ~ muttype"
[1] "NA ~ muttype"
[1] "NA ~ muttype"
And just for reference, all of this could also be done in the following two ways that I often prefer since it makes subsequent analysis slightly easier:
First, store all test objects in a dataframe:
library(tidyr)
df <- ados.sim %>% gather(key, value, -muttype) %>%
group_by(key) %>%
do(test = kruskal.test(x= .$value, g = .$muttype))
You can then subset the dataframe to get the test outcomes:
df[df$key == "ados.rrb",]$test
[[1]]
Kruskal-Wallis rank sum test
data: .$value and .$muttype
Kruskal-Wallis chi-squared = 2.2205, df = 2, p-value = 0.3295
Alternatively, get all results directly in a dataframe, without storing the test objects:
library(broom)
df2 <- ados.sim %>% gather(key, value, -muttype) %>%
group_by(key) %>%
do(tidy(kruskal.test(x= .$value, g = .$muttype)))
df2
# A tibble: 3 x 5
# Groups: key [3]
key statistic p.value parameter method
<chr> <dbl> <dbl> <int> <fctr>
1 ados.rrb 2.2205031 0.3294761 2 Kruskal-Wallis rank sum test
2 ados.sa 0.1319554 0.9361517 2 Kruskal-Wallis rank sum test
3 ados.tsc 0.3618102 0.8345146 2 Kruskal-Wallis rank sum test
I have a dataframe that looks something like this:
time id trialNum trialType accX gravX
1 1 6 7 low -0.38876217 10.185266
2 2 1 6 low 0.68254705 10.741545
3 3 3 15 high -0.21906854 9.466929
4 4 2 15 none -0.03370001 9.490829
5 5 4 1 high 0.16511542 10.986796
6 6 9 2 none -0.10441621 9.915561
You can generate something similar using this:
testDF <- data.frame(time = 1:50,
id = sample(1:10, size=50, replace=T),
trialNum = sample(1:15, size = 50, replace=T),
trialType = sample(c("none", "low", "high"),
size = 50, replace=T),
accX = sin(seq(1,50,1)),
gravX = 0.1)
And a function to calculate the average time between peaks in a filtered signal (returning mean time, and variance of the time differences):
library(dplyr)
library(signal)
library(quantmod)
calcStepTime <- function(df){
bf <- butter(1, c(0.03,0.05), type="pass")
filtered <- filtfilt(bf, df$accX - df$gravX)
peaks <- findPeaks(filtered)
peakValue <- filtered[peaks]
peakTime <- df$time[peaks]
timeDifferences <- diff(peakTime)
meanStepTime <- mean(timeDifferences)
varianceStepTime <- var(timeDifferences)
return(c(meanStepTime, varianceStepTime))
}
What I'm trying to do apply this function to each combination of id, trialNum, and trialType using groupby:
tempTrial <-
group_by(testDF, id, trialNum, trialType) %>%
summarise(meanTime = calcStepTime(.)[1],
varianceTime= calcStepTime(.)[2])
The problem is that in the output dataframe (tempTrial) every row of meanTime and varianceTime is identical
In this toy dataset, sometimes the columns all show NA (this doesnt happen in my actual dataset)
Am I doing something incorrectly to cause each row to be identical for the 2 columns? It should be taking each combination of id, trialNum and trialType, and calculating peak times for each of those separately. However, it seems its only storing a single value for each combination?
The chain is working properly in the sense that . refers to the grouped data frame group_by(testDF, id, trialNum, trialType). Since your defined function has no way of using the group information in ., the results are what you see (i.e. the function applied to the whole data frame).
So your problem here is the incorrect use of summarise. Latrunculia's answer shows you that the proper way to use summarise in the way you expect is to apply the function to combinations of columns in your data frame, in which case the function applies by group in each variable.
dplyr has a do function for applications where you wish to apply a function to the data frame subset implied by group_by. Simply replace your summarise with do:
tempTrial <- group_by(testDF, id, trialNum, trialType) %>% do(meanTime = calcStepTime(.)[1], varianceTime= calcStepTime(.)[2])
The documentation for do is not terribly clear, but this post describes the application very well.
What you get right now is the result of calcStepTime applied on the whole (ungrouped) data frame for each group.
Try rewriting the function such that it depends on the variables, but not on the data frame.
alcStepTime <- function(var1, var2, var3){
bf <- butter(1, c(0.03,0.05), type="pass")
filtered <- filtfilt(bf, var1 - var2)
peaks <- findPeaks(filtered)
peakValue <- filtered[peaks]
peakTime <- var3[peaks]
timeDifferences <- diff(peakTime)
meanStepTime <- mean(timeDifferences)
varianceStepTime <- var(timeDifferences)
return(c(meanStepTime, varianceStepTime))
}
testDF %>% group_by(testDF, id, trialNum, trialType) %>%
summarise(meanTime = calcStepTime( accX, gravX, time)[1],
varianceTime= calcStepTime(accX, gravX, time)[2])
It gives the right result if you just pipe the testDF data frame into it. It breaks for the grouped DF but I can't find if that's because the function is not defined for the subsets or if it's a problem with the function.
let me know if it works for the full data
As noted by yourself and Latrunculia, calcStepTime is very likely to return NaN/NA on the 50 observation datasets. This occurs when either no peak or a single peak was found within a group of observations. You may want to defend against this in your analysis code. I used this for testing:
testDF <- data.frame(time = 1:200,
id = sample(1:2, size=200, replace=T),
trialNum = sample(1:1, size = 200, replace=T),
trialType = sample(c("low"), size = 200, replace=T),
accX = sin(seq(1,200,1)),
gravX = 0.1)
If you change the return type of your function of data_frame (tibble), like so:
calcStepTime <- function(df){
bf <- butter(1, c(0.03,0.05), type="pass")
filtered <- filtfilt(bf, df$accX - df$gravX)
peaks <- findPeaks(filtered)
peakValue <- filtered[peaks]
peakTime <- df$time[peaks]
timeDifferences <- diff(peakTime)
meanStepTime <- mean(timeDifferences)
varianceStepTime <- var(timeDifferences)
return (data_frame("meanStepTime" = meanStepTime,
"varianceStepTime" = varianceStepTime))
}
Then you can take advantage of purrr::by_slice() for a fairly elegant solution:
library(purrr)
testDF %>%
group_by(id, trialNum, trialType) %>%
by_slice(calcStepTime, .collate="cols")
I got this from my test sample:
# A tibble: 2 x 5
id trialNum trialType meanStepTime1 varianceStepTime1
<int> <int> <fctr> <dbl> <dbl>
1 1 1 low 42.75 802.2500
2 2 1 low 39.75 616.9167
Note that .collate="cols" is the important argument that tells by_slice() to create the named columns for the results in the output. I'm a little curious myself as to why the "1" has been appended to the names we set in the data_frame returned by your function.