At the moment I am trying to apply GLM predict on a dataframe. The dataframe is quite large therefore I want to apply predict by chunks.
I have found a solution but it is quite unhandy. I first create an empty dataframe and then use rbind. Is there a more efficient way of doing this?
df=data[c(),]
for (x in split(data, factor(sort(rank(row.names(data))%%10)))) {
x["prediction"]=predict(model, x, type="response")
df=rbind(df,x)
}
As the comments mention, an example of what you want your output dataframe to look like would be very helpful.
But I think you can achieve what you want by making a grouping variable first then using 'group_by', something like this:
df <- data %>%
mutate(group = rep(1:10, times = nrow(.)/10)) %>% # make an arbitrary grouping factor for this example
group_by(group) %>% # group by whatever your grouping factor is
summarise(predictions = predict(model, x, type = 'response')) # summarise could be replaced by mutate
Related
I am trying to create a for loop where it calculates the mean of an already existing variable. The data frames are titled "mali2013", "mali2014", "mali2015", "mali2016", and "mali2017" and the variable is prop_AFR. I am trying to calculate the mean of variable per data frame.
I tried
for (i in 2014:2017) {
variable = paste0("mali", Year, "$prop_AFR")
M_mean_AFR_data <- mean(as.numeric(variable), na.rm = TRUE)
assign(paste0("Mali_prop_AFR_", i), M_mean_AFR_data)
}
but it kept yielding NaN. Is there any way to put this in a loop, or should I just do it manually?
It looks like Stata style code to me. In R, there might be several simpler ways to do it without looping. I would try this:
library(dplyr)
df <- bind_rows(mali2013, mali2014, mali2015, mali2016, mali2017)
df %>% group_by(Year) %>%
summarize(prop_AFR = mean(prop_AFR, na.rm = TRUE)
I am trying to create a table which provides the weighted means of a list of variables by categories of another list of variables. I want to iterate over the second list of variables with each iteration appending the dataframe to the previous dataframe. I think this is supposed to involve imap_dfr from purrr but I can't quite get the code right. I want to use tidyverse for my code.
I'll use the illinois dataset from the pollster package for my example.
require(pollster)
# rv and voter dummy variables that I want to recode to 1
# and 0 so that I can get the percent of people who are 1s # in each variable. Here I recode them.
voter_vars <- c("rv", "voter")
df2 <- illinois %>%
mutate_at(
voter_vars, ~
recode(.x,
"1" = 0,
"2" = 1)) %>%
mutate_at(
voter_vars, ~
as.numeric(.x))
So those are the variables I want as the columns in my table. To get the weighted means for these two variables I write a function
news_summary <- function(var1){
var1 <- ensym(var1)
df3 <- df2 %>%
group_by(!!var1) %>%
summarise_at(vars(voter_vars),
funs(weighted.mean(., weight, na.rm=TRUE)))
return(df3)
}
This creates a data frame output if I run it for one variable in the dataset
news_summary(educ6)
But what I want to do is run it for three variables in the dataset, rowbinding each output to the previous output so I have a table with all of the weighted means together.
demographic_vars <- c("educ6", "raceethnic", "maritalstatus")
However, I don't quite understand how to put this into imap_dfr (which I think is what I am supposed to use to do this) to make it work. I tried this based on code I found elsewhere. But it doesn't work.
purrr::imap_dfr(demographic_vars ~ news_summary(!!.x))
I'm trying to use a function that calls on the pROC package in R to calculate the area under the curve for a number of different outcomes.
# Function used to compute area under the curve
proc_auc <- function(outcome_var, predictor_var) {
pROC::auc(outcome_var, predictor_var)}
To do this, I am intending to refer to outcome names in a vector (much like below).
# Create a vector of outcome names
outcome <- c('outcome_1', 'outcome_2')
However, I am having problems defining variables to input into this function. When I do this, I generate the error: "Error in roc.default(response, predictor, auc = TRUE, ...): 'response' must have two levels". However, I can't work out why, as I reckon I only have two levels...
I would be so happy if anyone could help me!
Here is a reproducible code from the iris dataset in R.
library(pROC)
library(datasets)
library(dplyr)
# Use iris dataset to generate binary variables needed for function
df <- iris %>% dplyr::mutate(outcome_1 = as.numeric(ntile(Sepal.Length, 4)==4),
outcome_2 = as.numeric(ntile(Petal.Length, 4)==4))%>%
dplyr::rename(predictor_1 = Petal.Width)
# Inspect binary outcome variables
df %>% group_by(outcome_1) %>% summarise(n = n()) %>% mutate(Freq = n/sum(n))
df %>% group_by(outcome_2) %>% summarise(n = n()) %>% mutate(Freq = n/sum(n))
# Function used to compute area under the curve
proc_auc <- function(outcome_var, predictor_var) {
pROC::auc(outcome_var, predictor_var)}
# Create a vector of outcome names
outcome <- c('outcome_1', 'outcome_2')
# Define variables to go into function
outcome_var <- df %>% dplyr::select(outcome[[1]])
predictor_var <- df %>% dplyr::select(predictor_1)
# Use function - first line works but not last line!
proc_auc(df$outcome_1, df$predictor_1)
proc_auc(outcome_var, predictor_var)
outcome_var and predictor_var are dataframes with one column which means they cannot be used directly as an argument in the auc function.
Just specify the column names and it will work.
proc_auc(outcome_var$outcome_1, predictor_var$predictor_1)
You'll have to familiarize yourself with dplyr's non-standard evaluation, which makes it pretty hard to program with. In particular, you need to realize that passing a variable name is an indirection, and that there is a special syntax for it.
If you want to stay with the pipes / non-standard evaluation, you can use the roc_ function which follows a previous naming convention for functions taking variable names as input instead of the actual column names.
proc_auc2 <- function(data, outcome_var, predictor_var) {
pROC::auc(pROC::roc_(data, outcome_var, predictor_var))
}
At this point you can pass the actual column names to this new function:
proc_auc2(df, outcome[[1]], "predictor_1")
# or equivalently:
df %>% proc_auc2(outcome[[1]], "predictor_1")
That being said, for most use cases you probably want to follow #druskacik's answer and use standard R evaluation.
I have created a list of dataframes with split like so:
dataframes_list <- split(df, f = df$variable3)
Each dataframe (131 in total) there is in long format and have the same variables and structure. I want to perform the function pivot_wider in all of them simultaneously.
I have been struggling with some functions of the apply family, but could not get it done:
First I reduced the number of variables within each dataframe selecting only those that should be used for pivoting
dataframes_list_2 <- lapply(dataframes_list, function (x) select(x, variable1, variable2))
Then I tried pivot_wider
dataframes_list_3 <- lapply(dataframes_list_2, function(x) pivot_wider(x, names_from = variable1, values_from = variable 2)
What I obtain in this way is the list with dataframes that contain 1 observation per variable, each of them being a vector of (in this case) 12 values. What I want instead is this:
Because there was a warning telling me that my observations were not uniquely identified, I varied the code above including such variable. But what I got was this:
Can someone give me some answer to this issue?
Thank you
Each dataframe in the list has this aspect:
I had the same problem and I solved it this way:
df_list <- lapply(1:length(my_list),
function(x) (pivot_wider(my_list[[x]], names_from = names, values_from = values)))
bind_rows(df_list)
You will get what you needed! Hope it helps!
You could try:
map(my_list, ~ (pivot_wider(.x, names_from=1,values_from= 2)))
number 1 and 2 are the columns in my tibbles. You can use map_dfr. To combine the data sets you can use unnest of bind_rows.
With the robCompositions package, I need to impute missing values on a group basis. For example, with the iris dataset.
library(robCompositions)
library(dplyr)
data(iris)
# Insert random NAs
for (i in 1:4) {
n_NA = sample(0:10, 1)
index_NA = sample(1:nrow(iris), n_NA)
iris[index_NA, i] = NA
}
This is where I have no idea which manip to use...
impfunc <- function(x) x %.%
regroup(list(...)) %.%
mutate(impKNNa(x[,-5], k=6, metric="Euclidean"))
impfunc(iris, "Species")
iris %.% group_by(Species) %.% mutate(impKNNa(iris[,-5], k=6, metric="Euclidean"))
Any idea?
Thanks.
Use the the do() function. It allows you to apply any arbitrary function to a grouped data frame.
You'll also want to extract not just the output from impKNNa but specifically impKNNA$xImp which is the altered data frame.
The other issue is that impKNNA doesn't want any variables except the numeric variables of interest and do() won't remove the categorical variables. So perhaps a solution is to write a wrapper function for impKNNA that will remove categorical variables and return xIMP, and use do() to apply that to a grouped data frame.