At the moment I am trying to apply GLM predict on a dataframe. The dataframe is quite large therefore I want to apply predict by chunks.
I have found a solution but it is quite unhandy. I first create an empty dataframe and then use rbind. Is there a more efficient way of doing this?
df=data[c(),]
for (x in split(data, factor(sort(rank(row.names(data))%%10)))) {
x["prediction"]=predict(model, x, type="response")
df=rbind(df,x)
}
As the comments mention, an example of what you want your output dataframe to look like would be very helpful.
But I think you can achieve what you want by making a grouping variable first then using 'group_by', something like this:
df <- data %>%
mutate(group = rep(1:10, times = nrow(.)/10)) %>% # make an arbitrary grouping factor for this example
group_by(group) %>% # group by whatever your grouping factor is
summarise(predictions = predict(model, x, type = 'response')) # summarise could be replaced by mutate
I am trying to create a table which provides the weighted means of a list of variables by categories of another list of variables. I want to iterate over the second list of variables with each iteration appending the dataframe to the previous dataframe. I think this is supposed to involve imap_dfr from purrr but I can't quite get the code right. I want to use tidyverse for my code.
I'll use the illinois dataset from the pollster package for my example.
require(pollster)
# rv and voter dummy variables that I want to recode to 1
# and 0 so that I can get the percent of people who are 1s # in each variable. Here I recode them.
voter_vars <- c("rv", "voter")
df2 <- illinois %>%
mutate_at(
voter_vars, ~
recode(.x,
"1" = 0,
"2" = 1)) %>%
mutate_at(
voter_vars, ~
as.numeric(.x))
So those are the variables I want as the columns in my table. To get the weighted means for these two variables I write a function
news_summary <- function(var1){
var1 <- ensym(var1)
df3 <- df2 %>%
group_by(!!var1) %>%
summarise_at(vars(voter_vars),
funs(weighted.mean(., weight, na.rm=TRUE)))
return(df3)
}
This creates a data frame output if I run it for one variable in the dataset
news_summary(educ6)
But what I want to do is run it for three variables in the dataset, rowbinding each output to the previous output so I have a table with all of the weighted means together.
demographic_vars <- c("educ6", "raceethnic", "maritalstatus")
However, I don't quite understand how to put this into imap_dfr (which I think is what I am supposed to use to do this) to make it work. I tried this based on code I found elsewhere. But it doesn't work.
purrr::imap_dfr(demographic_vars ~ news_summary(!!.x))
I need to check my data fro outliers and I have 67 different variables. So I don't want to do it by hand. This is my code for checking it by hand (I have three factors to be checked - voiceID, gender and VP). But I don't know how I should change it to a loop that iterates over columns.
features %>%
group_by(voiceID, gender, VP) %>%
identify_outliers(meanF0)
The values are all numbers. The output should tell me which rows for what factors are outliers.
Thanks for help
The output of identify_outliers is a tibble with multiple columns and it can take a single variable at a time. The variable name can be either quoted or unquoted. In that case, we can group_split the data by the grouping variables, then loop over the columns of interest, and apply the identify_outliers
library(dplyr)
library(purrr)
library(rstatix)
nm1 <- c("score", "score2")
demo.data %>%
group_split(gender) %>%
map(~ map(nm1, function(x) .x %>%
identify_outliers(x)))
If we want to count the outliers,
features %>%
group_by(voiceID, gender, VP) %>%
summarise(across(everything(), ~ length(boxplot(., plot = FALSE)$out)))
I tried to do a t-test comparing values between time1/2/3.. and threshold.
here is my data frame:
time.df1<-data.frame("condition" =c("A","B","C","A","C","B"),
"time1" = c(1,3,2,6,2,3) ,
"time2" = c(1,1,2,8,2,9) ,
"time3" = c(-2,12,4,1,0,6),
"time4" = c(-8,3,2,1,9,6),
"threshold" = c(-2,3,8,1,9,-3))
and I tried to compare each two values by:
time.df1%>%
select_if(is.numeric) %>%
purrr::map_df(~ broom::tidy(t.test(. ~ threshold)))
However, I got this error message
Error in eval(predvars, data, env) : object 'threshold' not found
So, I tried another way (maybe it is wrong)
time.df2<-time.df1%>%gather(TF,value,time1:time4)
time.df2%>% group_by(condition) %>% do(tidy(t.test(value~TF, data=.)))
sadly, I got this error. Even I limited the condition to only two levels (A,B)
Error in t.test.formula(value ~ TF, data = .) : grouping factor must have exactly 2 levels
I wish to loop t-test over each time column to threshold column per condition, then using broom::tidy to get the results in tidy format. My approaches apparently aren't working, any advice is much appreciated to improve my codes.
An alternative route would be to define a function with the required options for t.test() up front, then create data frames for each pair of variables (i.e. each combination of 'time*' and 'threshold') and nesting them into list columns and use map() combined with relevant functions from 'broom' to simplify the outputs.
library(tidyverse)
library(broom)
ttestfn <- function(data, ...){
# amend this function to include required options for t.test
res = t.test(data$resp, data$threshold)
return(res)
}
df2 <-
time.df1 %>%
gather(time, "resp", - threshold, -condition) %>%
group_by(time) %>%
nest() %>%
mutate(ttests = map(data, ttestfn),
glances = map(ttests, glance))
# df2 has data frames, t-test objects and glance summaries
# as separate list columns
Now it's easy to query this object to extract what you want
df2 %>%
unnest(glances, .drop=TRUE)
However, it's unclear to me what you want to do with 'condition', so I'm wondering if it is more straightforward to reframe the question in terms of a GLM (as camille suggested in the comments: ANOVA is part of the GLM family).
Reshape the data, define 'threshold' as the reference level of the 'time' factor and the default 'treatment' contrasts used by R will compare each time to 'threshold':
time.df2 <-
time.df1 %>%
gather(key = "time", value = "resp", -condition) %>%
mutate(time = fct_relevel(time, "threshold")) # define 'threshold' as baseline
fit.aov <- aov(resp ~ condition * time, data = time.df2)
summary(fit.aov)
summary.lm(fit.aov) # coefficients and p-values
Of course this assumes that all subjects are independent (i.e. there are no repeated measures). If not, then you'll need to move on to more complicated procedures. Anyway, moving to appropriate GLMs for the study design should help minimise the pitfalls of doing multiple t-tests on the same data set.
We could remove the threshold from the select and then reintroduce it by creating a data.frame which would go into the formula object of t.test
library(tidyverse)
time1.df %>%
select_if(is.numeric) %>%
select(-threshold) %>%
map_df(~ data.frame(time = .x, time1.df['threshold']) %>%
broom::tidy(t.test(. ~ threshold)))
Background
I am working with a large dataset from a repeated measures clinical trial in R, where I want to do some data manipulations for each subject. This could be extraction of the max value in column x for each subject or the mean of column y for each subject.
Problem
I am fond of using the dplyr package and pipes, which led me to the group_by function. But when I try to apply it, the data that I want to extract does not seem to group by subject as it is supposed to, but rather extracts data based on the entire dataset.
Code
This is what I have done so far:
data <- read.csv(file="group_by_question.csv", header=TRUE, sep=",")
library(dplyr)
library(plyr)
data <- tbl_df(data)
test <- data %>%
filter(!is.na(wght)) %>%
dplyr::group_by(subject_id) %>%
mutate(maxwght=max(wght),meanwght=mean(wght)) %>%
ungroup()
Sample of the test dataframe:
Find a .csv sample of my dataset here:
https://drive.google.com/file/d/1wGkSQyJXqSswThiNsqC26qaP7d3catyX/view?usp=sharing
Is this what you want? In my example below, the output shows the max value for the maxwght column by subject id. You could replace max() with mean, for example, if you require the mean value for maxwght for each subject id.
library(dplyr)
data <- read.csv(file="group_by_question.csv", header=TRUE, sep=",")
test <- data %>%
filter(!is.na(wght)) %>%
mutate(maxwght=max(wght),meanwght=mean(wght)) %>%
group_by(subject_id) %>%
summarise(value = max(maxwght)) %>%
ungroup()