Dividing columns by group (Grouping in data frame) - r

I would like to calculate relative response values by dividing each response/column by its' group mean.
I have managed to produce an exhaustive (and thus unsatisfying) method. My data set is very large and contains multiple groups and responses.
###############
# example
# used packages
require(plyr)
# sample data
group <- c(rep("alpha", 3), rep("beta", 3), rep("gamma", 3))
a <- rnorm(9, 10,1) #some random data as response
b <- rnorm(9, 10,1)
df <- data.frame(group, a, b)
# my approach
# means for each group and response
df.means <- ddply(df, "group", colwise(mean))
# clunky method
df$rel.a[df$group=="alpha"] <-
df$a[df$group=="alpha"]/df.means$a[df.means$group=="alpha"]
df$rel.a[df$group=="beta"] <-
df$a[df$group=="beta"]/df.means$a[df.means$group=="beta"]
# ... etc
df$rel.b[df$group=="gamma"] <-
df$b[df$group=="gamma"]/df.means$b[df.means$group=="gamma"]
#desired outcome (well, perhaps with no missing values)
df
###############
I have been using r for a while now, but I still struggle with trivial data handling procedures. I believe I must be missing something, How can I better address these group(s)?

It's quite easily understandable with the package dplyr, the next version of plyr for data frames:
library(dplyr)
df %>% group_by(group) %>% mutate_each(funs(./mean(.)))
The . represents the data in each column (by group). mutate_each is used to modify each column except the grouping variables. You specify inside the funs argument which functions should be applied to each column.

With data.table package you can do this whole thing fast and easy in one line (without creating the df.means at all), simply
library(data.table)
setDT(df)[, paste0("real.", names(df)[-1]) :=
lapply(.SD, function(x) x/mean(x)),
group]
This will run over all the column within df (except group) by group and divide each value by the group mean
Edit: If you want to override the original columns (like in the dplyr answer, you can do this with small modification (remove the paste0 part):
setDT(df)[, names(df)[-1] := lapply(.SD, function(x) x/mean(x)), group]

If i understand you correctly, you can also do this easily in dplyr. Given the above data
library(dplyr)
df %>% group_by(group) %>% mutate(aresp = a/ mean(a), bresp= b/mean(b))
returns:
group a b aresp bresp
1 alpha 10.052847 8.076405 1.0132828 0.8288214
2 alpha 10.002243 11.447665 1.0081822 1.1747888
3 alpha 9.708111 9.709265 0.9785350 0.9963898
4 beta 10.732693 7.483065 0.9751125 0.8202278
5 beta 11.719656 11.270522 1.0647824 1.2353754
6 beta 10.567513 8.615878 0.9601051 0.9443968
7 gamma 10.221040 11.181763 1.0035630 0.9723315
8 gamma 10.302611 11.286443 1.0115721 0.9814341
9 gamma 10.030605 12.031643 0.9848649 1.0462344

Related

Variance of a complete group of a dataframe in R

Let's say I have a dataframe with 10+1 columns and 10 rows, and every value has the same units except for one column (the "grouping" column A).
I'm trying to accomplish the following: given a grouping of the data frames based on the last column, how do I compute the standard deviation of the whole block as a single, monolithic variable.
Let's say I do the grouping (in reality it's a cut in intervals):
df %>% group_by(A)
From what I have gathered trhoughout this site, you can use aggregate or other dplyr methods to calculate variance per column, i.e.:
this (SO won't let me embed if I have <10 rep).
In that picture we can see the grouping as colors, but by using aggregate I would get 1 standard deviation per specified column (I know you can use cbind to get more than 1 variable, for example aggregate(cbind(V1,V2)~A, df, sd)) and per group (and similar methods using dplyr and %>%, with summarise(..., FUN=sd) appended at the end).
However what I want is this: just like in Matlab when you do
group1 = df(row_group,:) % row_group would be df(:,end)==1 in this case
stdev(group1(:)) % operator (:) is key here
% iterate for every group
I have my reasons for wanting it that specific way, and of course the real dataframe is bigger than this mock example.
Minimum working example:
df <- data.frame(cbind(matrix(rnorm(100),10,10),c(1,2,1,1,2,2,3,3,3,1)))
colnames(df) <- c(paste0("V",seq(1,10)),"A")
df %>% group_by(A) %>% summarise_at(vars(V1), funs(sd(.))) # no good
aggregate(V1~A, data=df, sd) # no good
aggregate(cbind(V1,V2,V3,V4,V5,V6,V7,V8,V9,V10)~A, data=df, sd) # nope
df %>% group_by(A) %>% summarise_at(vars(V1,V2,V3,V4,V5,V6,V7,V8,V9,V10), funs(sd(.))) # same as above...
Result should be 3 doubles, each with the sd of the group (which should be close to 1 if enough columns are added).
If you want a base R solution, try the following.
sp <- split(df[-1], cut(df$A, breaks=c(2.1)))
lapply(sp, function(x) var(unlist(x)))
#$`(0.998,2]`
#[1] 0.848707
#
#$`(2,3]`
#[1] 1.80633
I have coded it in two lines to make it clearer but you can avoid the creation of sp and write the one-liner
lapply(split(df[-1], cut(df$A, breaks=c(2.1))), function(x) var(unlist(x)))
Or, for a result in another form,
sapply(sp, function(x) var(unlist(x)))
#(0.998,2] (2,3]
# 0.848707 1.806330
DATA
set.seed(6322) # make the results reproducible
df <- data.frame(cbind(matrix(rnorm(100),10,10),c(1,2,1,1,2,2,3,3,3,1)))
colnames(df) <- c(paste0("V",seq(1,10)),"A")

R ddply ignores split factors when column index is used

I need to to use ddply to apply multiple functions on multiple columns of my data frame. When I use the column name (RV in the example below), my split variables (Group and Round below) work (I get a mean value for each combination of Round and Group).
I need to do this on 20 columns and I was thinking of creating a for loop and pass column indexes.
When I use the column index (for example df[[1]] which is "RV" in my data frame), Group and Round are ignored and the grand mean is returned for all combinations of Round and Group.
I tried to pass the column name, in new.df3 but Round and Group are ignored again.
df <- data.frame("RV" = 1:5, "Group" = c("a","b","b","b","a"), "Round" = c("2","1","1","2","1"))
# this works and a separate mean for each combination of "Group" and "Round" is calculated
new.df <- ddply(df, c("Group", "Round"), summarise,
mean= mean(RV))
# this does not work and the grand mean is returned for all combinations of "Group" and "Round"
new.df2 <- ddply(df, c("Group", "Round"), summarise,
mean= mean(df[[1]]))
# this does not work and the grand mean is returned for all combinations of "Group" and "Round"
new.df3 <- ddply(df, c("Group", "Round"), summarise,
mean= mean(df[,colnames(df[1])]))
I tried "lapply" and the same issue exists. Any suggestion why this happens and how I can fix it?
As great a package as plyr is, you would do well here to update to it's newest iteration, dplyr. There, the code would be
v <- vars(RV) # add all your variables here
new.df <- df %>%
group_by(Group, Round) %>%
summarize_at(v, funs(mean))
So using this method, you plug in all your variables into v, and you'll get a mean for all of them, for each combination of Group and Round. The pipe operator (%>%) looks weird when you first see it, but it helps streamline your code. It takes the output of the previous function and sets it to be the first argument of the next function. It makes it easy to see that we're taking df, grouping by Group and Round, then summarizing them.
If you really want to stick with plyr, we can get a solution there too:
new.df <- ddply(df, c("Group", "Round"), summarise,
RV_mean = mean(RV),
var2_mean = mean(var2) # add a more variables just like this
)
We can also work from your list approaches:
new.df2 <- ddply(df, .(Group, Round), function(data_subset) { # note alternative way to reference Group and Round
as.data.frame(llply(data_subset[,c("RV"), drop = FALSE], mean)) # add your variables here
})
Note that within ddply, I always refer to the subset of the data frame within my function calls, I never refer to df. df always refers to the original data frame - not the subset you are trying to work with.

Calculate fraction of complete/not missing values of variables in a data frame for output in a long format [duplicate]

This question already has answers here:
How to find the percentage of NAs in a data.frame?
(6 answers)
Closed 5 years ago.
I've got a data frame (df1) with four variables, a, b, c, and d.
I'd like to get the completeness (!is.na(x)) for each variable in the data frame. I'd like the output to be in long format (df2).
The problem's that I can't get the nrow() part of my code to work (therefore I don't know if it works overall). Or is there a dplyr+tidyr way of doing it?
Any help would be much appreciated.
Starting point (df1):
df1 <- data.frame(a=c(1,2,3,NA),b=c(1,2,NA,NA),c=c(1,2,3,4),d=c(NA,NA,NA,NA),stringsAsFactors = TRUE)
Current code:
sapply(df1, function(x) sum(!is.na(df1$x)) / nrow(df1$x))
Desired outcome (df2):
df2 <- data.frame(nameofvar=c("a","b","c","d"),completeness=c(75,50,100,0))
As you wanted the answer to be in the long format, here’s how:
df2 = df1 %>%
gather(NameOfVar, Value) %>%
group_by(NameOfVar) %>%
summarize(Completeness = mean(! is.na(Value)) * 100)
As for why your (base R) code isn’t working:
When sapplying over a data.frame, the argument to your function (x) is the column data itself. So instead of having df1$x1 you need to just use x, and instead of nrow you now need to use length, since each column x is a vector.
1 In addition, $-subsetting with a variable never works,
so even if x was a column name/index, df1$x wouldn’t work anyway. You’d have to use df1[[x]] instead.
try purrr package part of tidyverse.
df1 %>%
map_df(~ sum(!is.na(.)) / length(.) * 100)
with data.table
dt1 <- as.data.table(df1)
dt1[, sapply(.SD, function(x) {sum(!is.na(x)) / .N}), .SD = names(dt1)]
Or very simply with base R:
colSums(!is.na(df1))/ ncol(df1) * 100
Using only dplyr package:
library(dplyr)
df1 <- data.frame(a=c(1,2,3,NA),
b=c(1,2,NA,NA),
c=c(1,2,3,4),
d=c(NA,NA,NA,NA),
stringsAsFactors = TRUE)
# get percentage of non NA values
df1 %>% summarise_all(function(x) mean(! is.na(x)))
# a b c d
# 1 0.75 0.5 1 0

rowDiffs type function, keeping "row 1" as the reference row per group

Say I have this simple data frame with a grouping variable, and three xs per group:
df<-data.frame(grp=rep(letters[1:3],each=3),
x=rnorm(9))
grp x
1 a 1.9561455
2 a -2.3916438
3 a 0.7267603
4 b -0.8794693
5 b -0.3089820
6 b -1.7228825
7 c -0.3964017
8 c -0.6237301
9 c -0.1522535
I want to, per group, take the initial row as a reference row, and get the difference between x and this reference x (first row) for all rows, such that the outcome is:
grp x xdiff
1 a 1.9561455 0.0000000
2 a -2.3916438 -4.3477893
3 a 0.7267603 -1.2293853
4 b -0.8794693 0.0000000
5 b -0.3089820 0.5704873
6 b -1.7228825 -0.8434132
7 c -0.3964017 0.0000000
8 c -0.6237301 -0.2273284
9 c -0.1522535 0.2441482
I was able to do it through this way:
rowOne<-df %>% group_by(grp) %>% filter(row_number()==1)
names(rowOne)[2]<-"x_initial"
df %>% left_join(rowOne) %>% mutate(xdiff=x-x_initial)
But I'm hoping there is a simpler way to do it, that doesn't require creating new datasets, merging and subtracting.
I have a dozen or so columns I need to do this for, and I'd like to be able to just do something like:
df %>% group_by(grp) %>% mutate(xdiff=rowDiffs(x))
But, obviously, this is not the correct function. Is there a function out there I haven't come across, or an easier way to program R to do this task?
Thanks!
The difference between a column by the first value in the column grouped by another column can be done using either data.table or dplyr or base R methods.
If we are doing this for a single column, the compact data.table method is one option. We convert 'data.frame' to 'data.table' (setDT(df)), grouped by the grouping column ('grp'), we get the difference between the column ('x') and the first value in that column (x[1L] - Note that I used the integer representation i.e. 1L. It would also work by simply using x[1]. In some cases, the integers might be a bit faster).
library(data.table)
setDT(df)[, xdiff:=x-x[1L] , by = grp]
Or a similar option with dplyr is piping (%>%) the arguments from left to right, ie. use the dataset ('df'), then we group by 'grp', and create a new column using mutate. Note that there is a first function in dplyr to select the first observation. It has also other arguments (?first).
library(dplyr)
df %>%
group_by(grp) %>%
mutate(xdiff= x- first(x))
Or a base R option suggested by #David Arenburg
df$xdiff <- with(df, ave(x, grp), FUN = function(x) x - x[1L])
If you have many columns, we can use mutate_each (from dplyr) after the grouping step, change the column names with setNames (NOTE: If there is mutliple functions i.e. >1, we could change it within the mutate_each itself), and bind the original columns with bind_cols.
df1 %>%
group_by(grp) %>%
mutate_each(funs(.-first(.))) %>%
setNames(., c(names(df1)[1L], paste0(names(df1)[-1L], 'diff'))) %>%
ungroup() %>%
select(-grp) %>%
bind_cols(df1, .)
Or using data.table, we can create new columns by assigning (:=). Here, we loop the columns under consideration with lapply (.SD is the Subset of DataTable) and get the difference grouped by 'grp'.
nm1 <- setdiff(names(df1), 'grp')
setDT(df1)[, paste0(nm1, 'diff') :=lapply(.SD, function(x) x-x[1L]), grp]
data
set.seed(24)
df1 <- cbind(df, y= rnorm(9))

Subsetting data.frame upon two constraints

Say I want to subset using 2 constraints.
1, being the values in the first column be identical
2, and at the same time, the values in the second column be the same
For example, I have a data frame
a <- rep(1:5)
b <- c(1,2,2,2,1,1,1,2,2,2)
data <- data.frame(a,b)
say a is the pair identification number and b represents the gender
now we want to subset to create a dataset where we have a matched pair ID and gender.
Would one create a loop using the while command or use the duplicated
the expected results should return a subset of data that is highlighted here in green
You can try
data[with(data, !!ave(b, a, FUN=function(x)
length(unique(x))==1)),]
Or
library(dplyr)
data %>%
group_by(a) %>%
filter(n_distinct(b)==1)
Or
library(data.table)
setDT(data)[,.(b=b[length(unique(b))==1]) , a]
Or another data.table solution provided by #David Arenburg
setDT(data)[, if (length(unique(b)) == 1) .SD, a]

Resources