Variance of a complete group of a dataframe in R - r

Let's say I have a dataframe with 10+1 columns and 10 rows, and every value has the same units except for one column (the "grouping" column A).
I'm trying to accomplish the following: given a grouping of the data frames based on the last column, how do I compute the standard deviation of the whole block as a single, monolithic variable.
Let's say I do the grouping (in reality it's a cut in intervals):
df %>% group_by(A)
From what I have gathered trhoughout this site, you can use aggregate or other dplyr methods to calculate variance per column, i.e.:
this (SO won't let me embed if I have <10 rep).
In that picture we can see the grouping as colors, but by using aggregate I would get 1 standard deviation per specified column (I know you can use cbind to get more than 1 variable, for example aggregate(cbind(V1,V2)~A, df, sd)) and per group (and similar methods using dplyr and %>%, with summarise(..., FUN=sd) appended at the end).
However what I want is this: just like in Matlab when you do
group1 = df(row_group,:) % row_group would be df(:,end)==1 in this case
stdev(group1(:)) % operator (:) is key here
% iterate for every group
I have my reasons for wanting it that specific way, and of course the real dataframe is bigger than this mock example.
Minimum working example:
df <- data.frame(cbind(matrix(rnorm(100),10,10),c(1,2,1,1,2,2,3,3,3,1)))
colnames(df) <- c(paste0("V",seq(1,10)),"A")
df %>% group_by(A) %>% summarise_at(vars(V1), funs(sd(.))) # no good
aggregate(V1~A, data=df, sd) # no good
aggregate(cbind(V1,V2,V3,V4,V5,V6,V7,V8,V9,V10)~A, data=df, sd) # nope
df %>% group_by(A) %>% summarise_at(vars(V1,V2,V3,V4,V5,V6,V7,V8,V9,V10), funs(sd(.))) # same as above...
Result should be 3 doubles, each with the sd of the group (which should be close to 1 if enough columns are added).

If you want a base R solution, try the following.
sp <- split(df[-1], cut(df$A, breaks=c(2.1)))
lapply(sp, function(x) var(unlist(x)))
#$`(0.998,2]`
#[1] 0.848707
#
#$`(2,3]`
#[1] 1.80633
I have coded it in two lines to make it clearer but you can avoid the creation of sp and write the one-liner
lapply(split(df[-1], cut(df$A, breaks=c(2.1))), function(x) var(unlist(x)))
Or, for a result in another form,
sapply(sp, function(x) var(unlist(x)))
#(0.998,2] (2,3]
# 0.848707 1.806330
DATA
set.seed(6322) # make the results reproducible
df <- data.frame(cbind(matrix(rnorm(100),10,10),c(1,2,1,1,2,2,3,3,3,1)))
colnames(df) <- c(paste0("V",seq(1,10)),"A")

Related

Trying to iterate over a list and append dataframes of weighted means in dplyr

I am trying to create a table which provides the weighted means of a list of variables by categories of another list of variables. I want to iterate over the second list of variables with each iteration appending the dataframe to the previous dataframe. I think this is supposed to involve imap_dfr from purrr but I can't quite get the code right. I want to use tidyverse for my code.
I'll use the illinois dataset from the pollster package for my example.
require(pollster)
# rv and voter dummy variables that I want to recode to 1
# and 0 so that I can get the percent of people who are 1s # in each variable. Here I recode them.
voter_vars <- c("rv", "voter")
df2 <- illinois %>%
mutate_at(
voter_vars, ~
recode(.x,
"1" = 0,
"2" = 1)) %>%
mutate_at(
voter_vars, ~
as.numeric(.x))
So those are the variables I want as the columns in my table. To get the weighted means for these two variables I write a function
news_summary <- function(var1){
var1 <- ensym(var1)
df3 <- df2 %>%
group_by(!!var1) %>%
summarise_at(vars(voter_vars),
funs(weighted.mean(., weight, na.rm=TRUE)))
return(df3)
}
This creates a data frame output if I run it for one variable in the dataset
news_summary(educ6)
But what I want to do is run it for three variables in the dataset, rowbinding each output to the previous output so I have a table with all of the weighted means together.
demographic_vars <- c("educ6", "raceethnic", "maritalstatus")
However, I don't quite understand how to put this into imap_dfr (which I think is what I am supposed to use to do this) to make it work. I tried this based on code I found elsewhere. But it doesn't work.
purrr::imap_dfr(demographic_vars ~ news_summary(!!.x))

How to add simulated values from a poisson distribution for each row and add them into the dataframe

I am trying to expand a dataframe by including, for each row, 500 simulated values from a Poisson distribution whose parameter Theta (count_mean) is already stored in the dataframe. In the example below I am only providing a dataframe example, since my real data is composed by more than 50,000 rows (i.e. ids).
example.data <- data.frame(id=c("4008", "4118", "5330"),
count_mean=c(2, 25, 11)
)
So for each row, I know I have to generate the simulated values by:
rpois(500, example.data$count_mean)
How can I introduce these values into the same dataframe, in which each new column presents one simulated value for each row?
You can use sapply to simulate the numbers and then use cbind to bind your data together:
simdata <- t(sapply(example.data$count_mean, function(x) rpois(500, x)))
colnames(simdata) <- paste0("sim_", 1:500)
cbind(example.data, simdata)
However, I would encourage you to work with a different data format: maybe a long table would be more appropriate in this situation than the current wide table.
Another option using dplyr and tidyr:
example.data %>%
rowwise() %>%
mutate(poisson = list(rpois(500, count_mean))) %>%
unnest(poisson) %>%
group_by(id) %>%
mutate(count=row_number()) %>%
pivot_wider(names_from="count", names_prefix="sim_", values_from="poisson")

R ddply ignores split factors when column index is used

I need to to use ddply to apply multiple functions on multiple columns of my data frame. When I use the column name (RV in the example below), my split variables (Group and Round below) work (I get a mean value for each combination of Round and Group).
I need to do this on 20 columns and I was thinking of creating a for loop and pass column indexes.
When I use the column index (for example df[[1]] which is "RV" in my data frame), Group and Round are ignored and the grand mean is returned for all combinations of Round and Group.
I tried to pass the column name, in new.df3 but Round and Group are ignored again.
df <- data.frame("RV" = 1:5, "Group" = c("a","b","b","b","a"), "Round" = c("2","1","1","2","1"))
# this works and a separate mean for each combination of "Group" and "Round" is calculated
new.df <- ddply(df, c("Group", "Round"), summarise,
mean= mean(RV))
# this does not work and the grand mean is returned for all combinations of "Group" and "Round"
new.df2 <- ddply(df, c("Group", "Round"), summarise,
mean= mean(df[[1]]))
# this does not work and the grand mean is returned for all combinations of "Group" and "Round"
new.df3 <- ddply(df, c("Group", "Round"), summarise,
mean= mean(df[,colnames(df[1])]))
I tried "lapply" and the same issue exists. Any suggestion why this happens and how I can fix it?
As great a package as plyr is, you would do well here to update to it's newest iteration, dplyr. There, the code would be
v <- vars(RV) # add all your variables here
new.df <- df %>%
group_by(Group, Round) %>%
summarize_at(v, funs(mean))
So using this method, you plug in all your variables into v, and you'll get a mean for all of them, for each combination of Group and Round. The pipe operator (%>%) looks weird when you first see it, but it helps streamline your code. It takes the output of the previous function and sets it to be the first argument of the next function. It makes it easy to see that we're taking df, grouping by Group and Round, then summarizing them.
If you really want to stick with plyr, we can get a solution there too:
new.df <- ddply(df, c("Group", "Round"), summarise,
RV_mean = mean(RV),
var2_mean = mean(var2) # add a more variables just like this
)
We can also work from your list approaches:
new.df2 <- ddply(df, .(Group, Round), function(data_subset) { # note alternative way to reference Group and Round
as.data.frame(llply(data_subset[,c("RV"), drop = FALSE], mean)) # add your variables here
})
Note that within ddply, I always refer to the subset of the data frame within my function calls, I never refer to df. df always refers to the original data frame - not the subset you are trying to work with.

How to compute a column that depends on a function that uses the value of a variable of each row?

This is a mock-up based on mtcars of what I would like to do:
compute a column that counts the number of cars that have less
displacement (disp) of the current row within the same gear type
category (am)
expected column is the values I would like to get
try1 is one try with the findInterval function, the problem is that I cannot make it count across the subsets that depend on the category (am)
I have tried solutions with *apply but I am somehow never able to make the function called work only on a subset that depends on the value of a variable of the row that is processed (hope this makes sense).
x = mtcars[1:6,c("disp","am")]
# expected values are the number of cars that have less disp while having the same am
x$expected = c(1,1,0,1,2,0)
#this ordered table is for findInterval
a = x[order(x$disp),]
a
# I use the findInterval function to get the number of values and I try subsetting the call
# -0.1 is to deal with the closed intervalq
x$try1 = findInterval(x$disp-0.1, a$disp[a$am==x$am])
x
# try1 values are not computed depending on the subsetting of a
Any solution will do; the use of the findInterval function is not mandatory.
I'd rather have a more general solution enabling a column value to be computed by calling a function that takes values from the current row to compute the expected value.
As pointed out by #dimitris_ps, the previous solution neglects the duplicated counts. Following provides the remedy.
library(dplyr)
x %>%
group_by(am) %>%
mutate(expected=findInterval(disp, sort(disp) + 0.0001))
or
library(data.table)
setDT(x)[, expected:=findInterval(disp, sort(disp) + 0.0001), by=am]
Based on #Khashaa's logic this is my approach
library(dplyr)
mtcars %>%
group_by(am) %>%
mutate(expected=match(disp, sort(disp))-1)

Dividing columns by group (Grouping in data frame)

I would like to calculate relative response values by dividing each response/column by its' group mean.
I have managed to produce an exhaustive (and thus unsatisfying) method. My data set is very large and contains multiple groups and responses.
###############
# example
# used packages
require(plyr)
# sample data
group <- c(rep("alpha", 3), rep("beta", 3), rep("gamma", 3))
a <- rnorm(9, 10,1) #some random data as response
b <- rnorm(9, 10,1)
df <- data.frame(group, a, b)
# my approach
# means for each group and response
df.means <- ddply(df, "group", colwise(mean))
# clunky method
df$rel.a[df$group=="alpha"] <-
df$a[df$group=="alpha"]/df.means$a[df.means$group=="alpha"]
df$rel.a[df$group=="beta"] <-
df$a[df$group=="beta"]/df.means$a[df.means$group=="beta"]
# ... etc
df$rel.b[df$group=="gamma"] <-
df$b[df$group=="gamma"]/df.means$b[df.means$group=="gamma"]
#desired outcome (well, perhaps with no missing values)
df
###############
I have been using r for a while now, but I still struggle with trivial data handling procedures. I believe I must be missing something, How can I better address these group(s)?
It's quite easily understandable with the package dplyr, the next version of plyr for data frames:
library(dplyr)
df %>% group_by(group) %>% mutate_each(funs(./mean(.)))
The . represents the data in each column (by group). mutate_each is used to modify each column except the grouping variables. You specify inside the funs argument which functions should be applied to each column.
With data.table package you can do this whole thing fast and easy in one line (without creating the df.means at all), simply
library(data.table)
setDT(df)[, paste0("real.", names(df)[-1]) :=
lapply(.SD, function(x) x/mean(x)),
group]
This will run over all the column within df (except group) by group and divide each value by the group mean
Edit: If you want to override the original columns (like in the dplyr answer, you can do this with small modification (remove the paste0 part):
setDT(df)[, names(df)[-1] := lapply(.SD, function(x) x/mean(x)), group]
If i understand you correctly, you can also do this easily in dplyr. Given the above data
library(dplyr)
df %>% group_by(group) %>% mutate(aresp = a/ mean(a), bresp= b/mean(b))
returns:
group a b aresp bresp
1 alpha 10.052847 8.076405 1.0132828 0.8288214
2 alpha 10.002243 11.447665 1.0081822 1.1747888
3 alpha 9.708111 9.709265 0.9785350 0.9963898
4 beta 10.732693 7.483065 0.9751125 0.8202278
5 beta 11.719656 11.270522 1.0647824 1.2353754
6 beta 10.567513 8.615878 0.9601051 0.9443968
7 gamma 10.221040 11.181763 1.0035630 0.9723315
8 gamma 10.302611 11.286443 1.0115721 0.9814341
9 gamma 10.030605 12.031643 0.9848649 1.0462344

Resources