Aggregate/collapse data frame - r

Is there an "all-in-one" convenience function in R that can collapse/aggregate a data frame to resolve the many-to-many problem? The motivation is to reduce many-to-many relationships so that two or more tables can be joined on some primary key (a column with unique identifier values). To elucidate, consider a data frame like:
set.seed(1) # for reproducibility
df <- data.frame(id = sort(rep(seq(1,3),4)), # primary key
geo_loc = state.abb[sample(seq(1,length(state.name)), # state abbreviations
size=length(sort(rep(seq(1,3),4))),
replace = TRUE)],
revenue = c(sample(seq(0,50),size=3), sample(c(seq(101,200)),size=3),
sample(seq(201,300),size=4), sample(seq(301,1000),size=2)),
prod_id = sample(LETTERS[c(seq(1,4))],size=12, replace=TRUE),
quant = c(sample(seq(0,5),size=4), sample(c(seq(3,8)),size=4),
sample(seq(6,11),size=2), sample(seq(9,14),size=2))) ; df
id geo_loc revenue prod_id quant
1 1 MN 47 D 0
2 1 MA 29 B 3
3 1 SD 50 B 4
4 1 NM 174 A 1
5 2 NC 136 D 6
6 2 LA 143 B 5
7 2 IN 215 C 8
8 2 WY 202 A 4
9 3 NY 271 A 10
10 3 HI 211 C 9
11 3 CT 613 C 10
12 3 MS 748 A 14
Does a function already exist that will collapse this table such that there is only one row per unique id? It would have to convert the geo_loc and prod_id columns to k levels - 1 dummy columns. It would also be nice if such a function could allow automatic clustering of the revenue into a number of blocks based on perhaps quantiles.

Only aggregate when you have a proper grouping variable. It would be more logical to aggregate by prod_id, for example.
To perform these data tidying and aggregating operations I would personally recommend spread() and gather() from the tidyr package and summarise() and group_by() from the dplyr package.

Related

Repeat a vector on a dataframe

I want to add a column to my datatable based on a vector. However my datatable is having 20 rows and my vector is having 7 values. I want the datatable to be repeated 7 times such that each 20 rows has one value from the vector. This might be simple but I am not able to get how to do this.
Some sample data -
library(data.table)
set.seed(9901)
#### create the sample variables for creating the data
group <- c(1:7)
brn <- sample(1:10,20,replace = T)
period <- c(101:120)
df1 <- data.table(cbind(brn,period))
So in this case I want to add a column group. datatable would now have 140 rows. 20 rows for group 1 then 20 rows for group 2 and so on.....
Apparently you want this:
df1[CJ(group, period), on = .(period)]
# brn period group
# 1: 3 101 1
# 2: 9 102 1
# 3: 9 103 1
# 4: 5 104 1
# 5: 5 105 1
# ---
#136: 9 116 7
#137: 7 117 7
#138: 10 118 7
#139: 2 119 7
#140: 7 120 7
CJ creates a data.table resulting from the cartesian join of the vectors passed to it. This data.table is then joined with df1 based on the column specified by on.
I would (1) repeat each number in group 20 times to create the datasets in a list and
(2) join them:
AllLists<-apply(as.data.frame(group),1,function(x) cbind(x,df1))
do.call( "rbind",AllLists)
A solution with data.table. Is that what you are looking for?
library(data.table)
df2 <- df1[rep(1:nrow(df1), times = 7),
][,group := rep(group, each = 20)]
But Rolands solution in the comments is definitly more elegant.

perform operations on a data frame based on a factors

I'm having a hard time to describe this so it's best explained with an example (as can probably be seen from the poor question title).
Using dplyr I have the result of a group_by and summarize I have a data frame that I want to do some further manipulation on by factor.
As an example, here's a data frame that looks like the result of my dplyr operations:
> df <- data.frame(run=as.factor(c(rep(1,3), rep(2,3))),
group=as.factor(rep(c("a","b","c"),2)),
sum=c(1,8,34,2,7,33))
> df
run group sum
1 1 a 1
2 1 b 8
3 1 c 34
4 2 a 2
5 2 b 7
6 2 c 33
I want to divide sum by a value that depends on run. For example, if I have:
> total <- data.frame(run=as.factor(c(1,2)),
total=c(45,47))
> total
run total
1 1 45
2 2 47
Then my final data frame will look like this:
> df
run group sum percent
1 1 a 1 1/45
2 1 b 8 8/45
3 1 c 34 34/45
4 2 a 2 2/47
5 2 b 7 7/47
6 2 c 33 33/47
Where I manually inserted the fraction in the percent column by hand to show the operation I want to do.
I know there is probably some dplyr way to do this with mutate but I can't seem to figure it out right now. How would this be accomplished?
(In base R)
You can use total as a look-up table where you get a total for each run of df :
total[df$run,'total']
[1] 45 45 45 47 47 47
And you simply use it to divide the sum and assign the result to a new column:
df$percent <- df$sum / total[df$run,'total']
run group sum percent
1 1 a 1 0.02222222
2 1 b 8 0.17777778
3 1 c 34 0.75555556
4 2 a 2 0.04255319
5 2 b 7 0.14893617
6 2 c 33 0.70212766
If your "run" values are 1,2...n then this will work
divisor <- c(45,47) # c(45,47,...up to n divisors)
df$percent <- df$sum/divisor[df$run]
first you want to merge in the total values into your df:
df2 <- merge(df, total, by = "run")
then you can call mutate:
df2 %<>% mutate(percent = sum / total)
Convert to data.table in-place, then merge and add new column, again in-place:
library(data.table)
setDT(df)[total, on = 'run', percent := sum/total]
df
# run group sum percent
#1: 1 a 1 0.02222222
#2: 1 b 8 0.17777778
#3: 1 c 34 0.75555556
#4: 2 a 2 0.04255319
#5: 2 b 7 0.14893617
#6: 2 c 33 0.70212766

Merge, cbind: How to merge better? [duplicate]

This question already has answers here:
R: Adding NAs into Data Frame
(5 answers)
Closed 6 years ago.
I want to merge multiple vectors to a data frame. There are two variables, city and id that are going to be used for matching vectors to data frame.
df <- data.frame(array(NA, dim =c(10*50, 2)))
names(df)<-c("city", "id")
df[,1]<-rep(1:50, each=10)
df[,2]<-rep(1:10, 50)
I created a data frame like this. To this data frame, I want to merge 50 vectors that each corresponds to 50 cities. The problem is that each city only has 6 obs. Each city will have 4 NAs.
To give you an example, city 1 data looks like this:
seed(1234)
cbind(city=1,id=sample(1:10,6),obs=rnorm(6))
I have 50 cities data and I want to merge them to one column in df. I have tried the following code:
for(i in 1:50){
citydata<-cbind(city=i,id=sample(1:10,6),obs=rnorm(6)) # each city data
df<-merge(df,citydata, by=c("city", "id"), all=TRUE)} # merge to df
But if I run this, the loop will show warnings like this:
In merge.data.frame(df, citydata, by = c("city", "id"), ... :
column names ‘obs.x’, ‘obs.y’ are duplicated in the result
and it will create 50 columns, instead of one long column.
How can I merge cbind(city=i,id=sample(1:10,6),obs=rnorm(6)) to df in a one nice and long column? It seems both cbind and merge are not ways to go.
In case there are 50 citydata (each has 6 rows), I can rbind them as one long data and use data.table approach or expand.gird+merge approach as Philip and Jaap suggested.
I wonder if I can merge each citydata through a loop one by one, instead of rbind them and merge it to df.
data.table is good for this:
library(data.table)
df <- data.table(df)
> df
city id
1: 1 1
2: 1 2
3: 1 3
4: 1 4
5: 1 5
---
496: 50 6
497: 50 7
498: 50 8
499: 50 9
500: 50 10
I'm using CJ instead of your for loop to make some dummy data. CJ cross-joins each column against each value of each other column, so it makes a two-column table with each possible pair of values of city and id. The [,obs:=rnorm(.N)] command adds a third column that draws random values (without recycling them as it would if it were inside the CJ)--.N means "# rows of this table" in this context.
citydata <- CJ(city=1:50,id=1:6)[,obs:=rnorm(.N)]
> citydata
city id obs
1: 1 1 0.19168335
2: 1 2 0.35753229
3: 1 3 1.35707865
4: 1 4 1.91871907
5: 1 5 -0.56961647
---
296: 50 2 0.30592659
297: 50 3 -0.44989646
298: 50 4 0.05359738
299: 50 5 -0.57494269
300: 50 6 0.09565473
setkey(df,city,id)
setkey(citydata,city,id)
As these two tables have the same key columns the following looks up rows of df by the key columns in citydata, then defines obs in df by looking up the value in citydata. Therefore the resulting object is the original df but with obs defined wherever it was defined in citydata:
df[citydata,obs:=i.obs]
> df
city id obs
1: 1 1 0.19168335
2: 1 2 0.35753229
3: 1 3 1.35707865
4: 1 4 1.91871907
5: 1 5 -0.56961647
---
496: 50 6 0.09565473
497: 50 7 NA
498: 50 8 NA
499: 50 9 NA
500: 50 10 NA
In base R you can do this with a combination of expand.grid and merge:
citydata <- expand.grid(city=1:50,id=1:6)
citydata$obs <- rnorm(nrow(citydata))
res <- merge(df, citydata, by = c("city","id"), all.x = TRUE)
which gives:
> head(res,12)
city id obs
1: 1 1 -0.3121133
2: 1 2 -1.3554576
3: 1 3 -0.9056468
4: 1 4 -0.6511869
5: 1 5 -1.0447499
6: 1 6 1.5939187
7: 1 7 NA
8: 1 8 NA
9: 1 9 NA
10: 1 10 NA
11: 2 1 0.5423479
12: 2 2 -2.3663335
A similar approach with dplyr and tidyr:
library(dplyr)
library(tidyr)
res <- crossing(city=1:50,id=1:6) %>%
mutate(obs = rnorm(n())) %>%
right_join(., df, by = c("city","id"))
which gives:
> res
Source: local data frame [500 x 3]
city id obs
(int) (int) (dbl)
1 1 1 -0.5335660
2 1 2 1.0582001
3 1 3 -1.3888310
4 1 4 1.8519262
5 1 5 -0.9971686
6 1 6 1.3508046
7 1 7 NA
8 1 8 NA
9 1 9 NA
10 1 10 NA
.. ... ... ...

Loop or apply for sum of rows based on multiple conditions in R dataframe

I've hacked together a quick solution to my problem, but I have a feeling it's quite obtuse. Moreover, it uses for loops, which from what I've gathered, should be avoided at all costs in R. Any and all advice to tidy up this code is appreciated. I'm still pretty new to R, but I fear I'm making a relatively simple problem much too convoluted.
I have a dataset as follows:
id count group
2 6 A
2 8 A
2 6 A
8 5 A
8 6 A
8 3 A
10 6 B
10 6 B
10 6 B
11 5 B
11 6 B
11 7 B
16 6 C
16 2 C
16 0 C
18 6 C
18 1 C
18 6 C
I would like to create a new dataframe that contains, for each unique ID, the sum of the first two counts of that ID (e.g. 6+8=14 for ID 2). I also want to attach the correct group identifier.
In general you might need to do this when you measure a value on consecutive days for different subjects and treatments, and you want to compute the total for each subject for the first x days of measurement.
This is what I've come up with:
id <- c(rep(c(2,8,10,11,16,18),each=3))
count <- c(6,8,6,5,6,3,6,6,6,5,6,7,6,2,0,6,1,6)
group <- c(rep(c("A","B","C"),each=6))
df <- data.frame(id,count,group)
newid<-c()
newcount<-c()
newgroup<-c()
for (i in 1:length(unique(df$"id"))) {
newid[i] <- unique(df$"id")[i]
newcount[i]<-sum(df[df$"id"==unique(df$"id")[i],2][1:2])
newgroup[i] <- as.character(df$"group"[df$"id"==newid[i]][1])
}
newdf<-data.frame(newid,newcount,newgroup)
Some possible improvements/alternatives I'm not sure about:
For loops vs apply functions
Can I create a dataframe directly inside a for loop or should I stick to creating vectors I can late assign to a dataframe?
More consistent approaches to accessing/subsetting vectors/columns ($, [], [[]], subset?)
You could do this using data.table
setDT(df)[, list(newcount = sum(count[1:2])), by = .(id, group)]
# id group newcount
#1: 2 A 14
#2: 8 A 11
#3: 10 B 12
#4: 11 B 11
#5: 16 C 8
#6: 18 C 7
You could use dplyr:
library(dplyr)
df %>% group_by(id,group) %>% slice(1:2) %>% summarise(newcount=sum(count))
The pipe syntax makes it easy to read: group your data by id and group, take the first two rows for each group, then sum the counts
You can try to use a self-defined function in aggregate
sum1sttwo<-function (x){
return(x[1]+x[2])
}
aggregate(count~id+group, data=df,sum1sttwo)
and the output is:
id group count
1 2 A 14
2 8 A 11
3 10 B 12
4 11 B 11
5 16 C 8
6 18 C 7
04/2015 edit: dplyr and data.table are definitely better choices when your data set is large. One of the most important disadvantages of base R is that dataframe is too slow. However, if you just need to aggregate a very simple/small data set, the aggregate function in base R can serve its purpose.
library(plyr)
-Keep first 2 rows for each group and id
df2 <- ddply(df, c("id","group"), function (x) x$count[1:2])
-Aggregate by group and id
df3 <- ddply(df2, c("id", "group"), summarize, count=V1+V2)
df3
id group count
1 2 A 14
2 8 A 11
3 10 B 12
4 11 B 11
5 16 C 8
6 18 C 7

Extract rows from data frame based on multiple identifiers in another data frame

I would like to extract a selection of rows from a data frame based on multiple identifying variables contained in another data frame. Consider the following illustrative data set:
df <- data.frame(id=c(1,2,2,3,4,4,4,4,5), ref=c("A","B","C","D","E","F","F","G","H"), amount=c(10,15,20,25,30,35,-35,40,45))
required <- data.frame(id=c(2,3,4,4), ref=c("B","D","E","F"))
I would like the output in a data frame with id, ref and amount as follows:
id ref amount
2 B 15
3 D 25
4 E 30
4 F 35
4 F -35
Note in particular that id 4 and ref F have two matches from the df with amounts 35 and -35.
You want to merge:
merge(df, required)
## id ref amount
## 1 2 B 15
## 2 3 D 25
## 3 4 E 30
## 4 4 F 35
## 5 4 F -35

Resources