Is it possible in R to merge and group (by a column), the dataframe with a single function or in a single step? - r

I am new to the R programming, so wanted to learn that if it is possible to perform merging and grouping of the data with a single function or within a single step in R.

I'm not sure if I've understood your question correctly. It's possible to group and merge data via the aggregate function:
df <- data.frame(a=1:40, b=rbinom(40, 10, 0.5), n=rnorm(40), p=rpois(40, lambda=4), group=gl(4,10), even=rep(c(1,2),20))
require(plyr)
aggregate(b ~ group, df, sum) #aggregate/sum over group
aggregate(b ~ group + even, df, sum) #aggregate/sum over group & even
Results:
> aggregate(b ~ group, df, sum)
group b
1 1 51
2 2 49
3 3 49
4 4 47
> aggregate(b ~ group + even, df, sum)
group even b
1 1 1 27
2 2 1 23
3 3 1 25
4 4 1 23
5 1 2 24
6 2 2 26
7 3 2 24
8 4 2 24

Related

Add data from a data table to another using values of a column

I know the question is confusing, but I hope the example will make it simple.
I have two tables:
x y
1 23
2 34
3 76
4 31
&
x y
1 78
3 51
5 54
I need to add the y columns based on x values. I can do it using loops, but don't want to. It will be better if the solution uses base, dplyr, data.table functions as I am most familiar with those, I am okay with apply family of functions as well. The output should look like this:
x y
1 101
2 34
3 127
4 31
5 54
The basic idea is to combine the two dataset, group by x and summarize y with sum and there are a couple of ways to do it:
data.table:
rbind(dtt1, dtt2)[, .(y = sum(y)), by = x]
# x y
# 1: 1 101
# 2: 2 34
# 3: 3 127
# 4: 4 31
# 5: 5 54
base R aggregate:
aggregate(y ~ x, rbind(dtt1, dtt2), FUN = sum)
dplyr:
rbind(dtt1, dtt2) %>% group_by(x) %>% summarize(y = sum(y))
The data:
library(data.table)
dtt1 <- fread('x y
1 23
2 34
3 76
4 31')
dtt2 <- fread('x y
1 78
3 51
5 54')

Difference between aggregate and table functions

Age <- c(90,56,51,64,67,59,51,55,48,50,43,57,44,55,60,39,62,66,49,61,58,55,45,47,54,56,52,54,50,62,48,52,50,65,59,68,55,78,62,56)
Tenure <- c(2,2,3,4,3,3,2,2,2,3,3,2,4,3,2,4,1,3,4,2,2,4,3,4,1,2,2,3,3,1,3,4,3,2,2,2,2,3,1,1)
df <- data.frame(Age, Tenure)
I'm trying to count the unique values of Tenure, thus I've used the table() function to look at the frequencies
table(df$Tenure)
1 2 3 4
5 15 13 7
However I'm curious to know what the aggregate() function is showing?
aggregate(Age~Tenure , df, function(x) length(unique(x)))
Tenure Age
1 1 3
2 2 13
3 3 11
4 4 7
What's the difference between these two outputs?
The reason for the difference is your inclusion of unique in the aggregate. You are counting the number of distinct Ages by Tenure, not the count of Ages by Tenure. To get the analogous output with aggregate try
aggregate(Age~Tenure , df, length)
Tenure Age
1 1 5
2 2 15
3 3 13
4 4 7

perform operations on a data frame based on a factors

I'm having a hard time to describe this so it's best explained with an example (as can probably be seen from the poor question title).
Using dplyr I have the result of a group_by and summarize I have a data frame that I want to do some further manipulation on by factor.
As an example, here's a data frame that looks like the result of my dplyr operations:
> df <- data.frame(run=as.factor(c(rep(1,3), rep(2,3))),
group=as.factor(rep(c("a","b","c"),2)),
sum=c(1,8,34,2,7,33))
> df
run group sum
1 1 a 1
2 1 b 8
3 1 c 34
4 2 a 2
5 2 b 7
6 2 c 33
I want to divide sum by a value that depends on run. For example, if I have:
> total <- data.frame(run=as.factor(c(1,2)),
total=c(45,47))
> total
run total
1 1 45
2 2 47
Then my final data frame will look like this:
> df
run group sum percent
1 1 a 1 1/45
2 1 b 8 8/45
3 1 c 34 34/45
4 2 a 2 2/47
5 2 b 7 7/47
6 2 c 33 33/47
Where I manually inserted the fraction in the percent column by hand to show the operation I want to do.
I know there is probably some dplyr way to do this with mutate but I can't seem to figure it out right now. How would this be accomplished?
(In base R)
You can use total as a look-up table where you get a total for each run of df :
total[df$run,'total']
[1] 45 45 45 47 47 47
And you simply use it to divide the sum and assign the result to a new column:
df$percent <- df$sum / total[df$run,'total']
run group sum percent
1 1 a 1 0.02222222
2 1 b 8 0.17777778
3 1 c 34 0.75555556
4 2 a 2 0.04255319
5 2 b 7 0.14893617
6 2 c 33 0.70212766
If your "run" values are 1,2...n then this will work
divisor <- c(45,47) # c(45,47,...up to n divisors)
df$percent <- df$sum/divisor[df$run]
first you want to merge in the total values into your df:
df2 <- merge(df, total, by = "run")
then you can call mutate:
df2 %<>% mutate(percent = sum / total)
Convert to data.table in-place, then merge and add new column, again in-place:
library(data.table)
setDT(df)[total, on = 'run', percent := sum/total]
df
# run group sum percent
#1: 1 a 1 0.02222222
#2: 1 b 8 0.17777778
#3: 1 c 34 0.75555556
#4: 2 a 2 0.04255319
#5: 2 b 7 0.14893617
#6: 2 c 33 0.70212766

Working out relative abundances with dplyr

I have my data:
library(dplyr)
Sample.no <- c(1,1,1,2,2,1,1,1,1,2,2)
Group <-c('a','b','c','a','b','a','b','c','d','a','c')
Abundance <- c(Sample.no*c(3,1,4,7,2))
df<-data.frame(Sample.no,Group,Abundance)
giving
Sample.no Group Abundance
1 1 a 3
2 1 b 1
3 1 c 4
4 2 a 14
5 2 b 4
6 1 a 3
7 1 b 1
8 1 c 4
9 1 d 7
10 2 a 4
11 2 c 6
I want to create a summary simmilar to this:
df<-group_by(df,Sample.no)
df<-summarise(df,number=n(),total=sum(Abundance))
Sample.no number total
1 1 7 23
2 2 4 28
however i'd also like a column with the total Abundance of 'a's in each sample in order to work out relative abundance. I've tried custom functions with no success, is there an easy way to do it in dplyr?
Here's one way using data.table:
require(data.table) # v1.9.6
setDT(df)[, c(list(num = .N, tot = sum(Abundance)),
tapply(Abundance, Group, sum)),
by = Sample.no]
# Sample.no num tot a b c d
# 1: 1 7 23 6 2 8 7
# 2: 2 4 28 18 4 6 NA
I use tapply() instead of joins using .SD since we need a named list here, and tapply()'s output format makes is very convenient.
Using aggregate and xtabs:
total <- aggregate(Abundance ~ Sample.no, data=df,
FUN = function(x) c(num = length(x), total = sum(x)))
group <- as.data.frame.matrix(xtabs(Abundance ~ Sample.no + Group, df))
cbind(total, group)
Output:
Sample.no Abundance.num Abundance.total a b c d
1 1 7 23 6 2 8 7
2 2 4 28 18 4 6 0

adding row/column total data when aggregating data using plyr and reshape2 package in R

I create aggregate tables most of the time during my work using the flow below:
set.seed(1)
temp.df <- data.frame(var1=sample(letters[1:5],100,replace=TRUE),
var2=sample(11:15,100,replace=TRUE))
temp.output <- ddply(temp.df,
c("var1","var2"),
function(df) {
data.frame(count=nrow(df))
})
temp.output.all <- ddply(temp.df,
c("var2"),
function(df) {
data.frame(var1="all",
count=nrow(df))
})
temp.output <- rbind(temp.output,temp.output.all)
temp.output[,"var1"] <- factor(temp.output[,"var1"],levels=c(letters[1:5],"all"))
temp.output <- dcast(temp.output,formula=var2~var1,value.var="count",fill=0)
I start feeling silly to writing the "boilerplate" code every time to include the row/column total when I create a new aggregate table, is there some way for skipping it?
Looking at your desired output (now that I'm in front of a computer), perhaps you should look at the margins argument of dcast:
library(reshape2)
dcast(temp.df, var2 ~ var1, value.var = "var2",
fun.aggregate=length, margins = "var1")
# var2 a b c d e (all)
# 1 11 3 1 6 4 2 16
# 2 12 1 3 6 5 5 20
# 3 13 5 9 3 6 1 24
# 4 14 4 7 3 6 2 22
# 5 15 0 5 1 5 7 18
Also look into the addmargins function in base R.

Resources