I'd like to use plyr to calculate multiple empirical cumulative distribution functions using ecdf(), and then apply those functions appropriately to entries in a data frame. For instance:
# Use the diamonds dataset in ggplot2
library(diamonds)
library(plyr)
# Calculate an ecdf for each combination of cut and color
all_ecdfs <- dlply(diamonds, c("cut", "color"), function(x) ecdf(x$carat))
# Make a dataset of specific diamonds, which I want to compare to the larger set
# My particular subset of diamonds
my_diamonds <- ddply(diamonds, c("cut", "color"), summarise,
my.carat=runif(n=1, min=0.5, max=1))
If I were to do this manually, it would look something like this:
# Use the ecdf for the first entry: cut=="Fair" and color=="D"
my_diamonds$percentile <- NA
my_diamonds$percentile[my_diamonds$cut=="Fair" & my_diamonds$color=="D"] <-
all_ecdfs[["Fair.D"]](my_diamonds$my.carat[my_diamonds$cut=="Fair" & my_diamonds$color=="D"])
Seems like there should be some way to use ldply or lapply to do this automatically, but I can't figure it out.
Here's how I would do it using dplyr to make the ecdfs, and vectorizing to get the values for your data.
#get ecdfs
library(dplyr)
z <- diamonds %>% group_by(cut, color) %>%
summarise(x = list(ecdf(carat)))
Now you have a dataframe z with the functions in a list in column x.
Call the function on our data. We go by row, and get the matching cut and color, then call the function on carat:
z$x[z$cut == my_diamonds$cut & z$color == my_diamonds$color][[1]](my_diamonds$my.carat)
Related
At the moment I am trying to apply GLM predict on a dataframe. The dataframe is quite large therefore I want to apply predict by chunks.
I have found a solution but it is quite unhandy. I first create an empty dataframe and then use rbind. Is there a more efficient way of doing this?
df=data[c(),]
for (x in split(data, factor(sort(rank(row.names(data))%%10)))) {
x["prediction"]=predict(model, x, type="response")
df=rbind(df,x)
}
As the comments mention, an example of what you want your output dataframe to look like would be very helpful.
But I think you can achieve what you want by making a grouping variable first then using 'group_by', something like this:
df <- data %>%
mutate(group = rep(1:10, times = nrow(.)/10)) %>% # make an arbitrary grouping factor for this example
group_by(group) %>% # group by whatever your grouping factor is
summarise(predictions = predict(model, x, type = 'response')) # summarise could be replaced by mutate
I am trying to create a table which provides the weighted means of a list of variables by categories of another list of variables. I want to iterate over the second list of variables with each iteration appending the dataframe to the previous dataframe. I think this is supposed to involve imap_dfr from purrr but I can't quite get the code right. I want to use tidyverse for my code.
I'll use the illinois dataset from the pollster package for my example.
require(pollster)
# rv and voter dummy variables that I want to recode to 1
# and 0 so that I can get the percent of people who are 1s # in each variable. Here I recode them.
voter_vars <- c("rv", "voter")
df2 <- illinois %>%
mutate_at(
voter_vars, ~
recode(.x,
"1" = 0,
"2" = 1)) %>%
mutate_at(
voter_vars, ~
as.numeric(.x))
So those are the variables I want as the columns in my table. To get the weighted means for these two variables I write a function
news_summary <- function(var1){
var1 <- ensym(var1)
df3 <- df2 %>%
group_by(!!var1) %>%
summarise_at(vars(voter_vars),
funs(weighted.mean(., weight, na.rm=TRUE)))
return(df3)
}
This creates a data frame output if I run it for one variable in the dataset
news_summary(educ6)
But what I want to do is run it for three variables in the dataset, rowbinding each output to the previous output so I have a table with all of the weighted means together.
demographic_vars <- c("educ6", "raceethnic", "maritalstatus")
However, I don't quite understand how to put this into imap_dfr (which I think is what I am supposed to use to do this) to make it work. I tried this based on code I found elsewhere. But it doesn't work.
purrr::imap_dfr(demographic_vars ~ news_summary(!!.x))
Let's say I have a dataframe with 10+1 columns and 10 rows, and every value has the same units except for one column (the "grouping" column A).
I'm trying to accomplish the following: given a grouping of the data frames based on the last column, how do I compute the standard deviation of the whole block as a single, monolithic variable.
Let's say I do the grouping (in reality it's a cut in intervals):
df %>% group_by(A)
From what I have gathered trhoughout this site, you can use aggregate or other dplyr methods to calculate variance per column, i.e.:
this (SO won't let me embed if I have <10 rep).
In that picture we can see the grouping as colors, but by using aggregate I would get 1 standard deviation per specified column (I know you can use cbind to get more than 1 variable, for example aggregate(cbind(V1,V2)~A, df, sd)) and per group (and similar methods using dplyr and %>%, with summarise(..., FUN=sd) appended at the end).
However what I want is this: just like in Matlab when you do
group1 = df(row_group,:) % row_group would be df(:,end)==1 in this case
stdev(group1(:)) % operator (:) is key here
% iterate for every group
I have my reasons for wanting it that specific way, and of course the real dataframe is bigger than this mock example.
Minimum working example:
df <- data.frame(cbind(matrix(rnorm(100),10,10),c(1,2,1,1,2,2,3,3,3,1)))
colnames(df) <- c(paste0("V",seq(1,10)),"A")
df %>% group_by(A) %>% summarise_at(vars(V1), funs(sd(.))) # no good
aggregate(V1~A, data=df, sd) # no good
aggregate(cbind(V1,V2,V3,V4,V5,V6,V7,V8,V9,V10)~A, data=df, sd) # nope
df %>% group_by(A) %>% summarise_at(vars(V1,V2,V3,V4,V5,V6,V7,V8,V9,V10), funs(sd(.))) # same as above...
Result should be 3 doubles, each with the sd of the group (which should be close to 1 if enough columns are added).
If you want a base R solution, try the following.
sp <- split(df[-1], cut(df$A, breaks=c(2.1)))
lapply(sp, function(x) var(unlist(x)))
#$`(0.998,2]`
#[1] 0.848707
#
#$`(2,3]`
#[1] 1.80633
I have coded it in two lines to make it clearer but you can avoid the creation of sp and write the one-liner
lapply(split(df[-1], cut(df$A, breaks=c(2.1))), function(x) var(unlist(x)))
Or, for a result in another form,
sapply(sp, function(x) var(unlist(x)))
#(0.998,2] (2,3]
# 0.848707 1.806330
DATA
set.seed(6322) # make the results reproducible
df <- data.frame(cbind(matrix(rnorm(100),10,10),c(1,2,1,1,2,2,3,3,3,1)))
colnames(df) <- c(paste0("V",seq(1,10)),"A")
I need to to use ddply to apply multiple functions on multiple columns of my data frame. When I use the column name (RV in the example below), my split variables (Group and Round below) work (I get a mean value for each combination of Round and Group).
I need to do this on 20 columns and I was thinking of creating a for loop and pass column indexes.
When I use the column index (for example df[[1]] which is "RV" in my data frame), Group and Round are ignored and the grand mean is returned for all combinations of Round and Group.
I tried to pass the column name, in new.df3 but Round and Group are ignored again.
df <- data.frame("RV" = 1:5, "Group" = c("a","b","b","b","a"), "Round" = c("2","1","1","2","1"))
# this works and a separate mean for each combination of "Group" and "Round" is calculated
new.df <- ddply(df, c("Group", "Round"), summarise,
mean= mean(RV))
# this does not work and the grand mean is returned for all combinations of "Group" and "Round"
new.df2 <- ddply(df, c("Group", "Round"), summarise,
mean= mean(df[[1]]))
# this does not work and the grand mean is returned for all combinations of "Group" and "Round"
new.df3 <- ddply(df, c("Group", "Round"), summarise,
mean= mean(df[,colnames(df[1])]))
I tried "lapply" and the same issue exists. Any suggestion why this happens and how I can fix it?
As great a package as plyr is, you would do well here to update to it's newest iteration, dplyr. There, the code would be
v <- vars(RV) # add all your variables here
new.df <- df %>%
group_by(Group, Round) %>%
summarize_at(v, funs(mean))
So using this method, you plug in all your variables into v, and you'll get a mean for all of them, for each combination of Group and Round. The pipe operator (%>%) looks weird when you first see it, but it helps streamline your code. It takes the output of the previous function and sets it to be the first argument of the next function. It makes it easy to see that we're taking df, grouping by Group and Round, then summarizing them.
If you really want to stick with plyr, we can get a solution there too:
new.df <- ddply(df, c("Group", "Round"), summarise,
RV_mean = mean(RV),
var2_mean = mean(var2) # add a more variables just like this
)
We can also work from your list approaches:
new.df2 <- ddply(df, .(Group, Round), function(data_subset) { # note alternative way to reference Group and Round
as.data.frame(llply(data_subset[,c("RV"), drop = FALSE], mean)) # add your variables here
})
Note that within ddply, I always refer to the subset of the data frame within my function calls, I never refer to df. df always refers to the original data frame - not the subset you are trying to work with.
With the robCompositions package, I need to impute missing values on a group basis. For example, with the iris dataset.
library(robCompositions)
library(dplyr)
data(iris)
# Insert random NAs
for (i in 1:4) {
n_NA = sample(0:10, 1)
index_NA = sample(1:nrow(iris), n_NA)
iris[index_NA, i] = NA
}
This is where I have no idea which manip to use...
impfunc <- function(x) x %.%
regroup(list(...)) %.%
mutate(impKNNa(x[,-5], k=6, metric="Euclidean"))
impfunc(iris, "Species")
iris %.% group_by(Species) %.% mutate(impKNNa(iris[,-5], k=6, metric="Euclidean"))
Any idea?
Thanks.
Use the the do() function. It allows you to apply any arbitrary function to a grouped data frame.
You'll also want to extract not just the output from impKNNa but specifically impKNNA$xImp which is the altered data frame.
The other issue is that impKNNA doesn't want any variables except the numeric variables of interest and do() won't remove the categorical variables. So perhaps a solution is to write a wrapper function for impKNNA that will remove categorical variables and return xIMP, and use do() to apply that to a grouped data frame.