I have a data-set having millions of rows and i need to apply the 'group by' operation in it using R.
The data is of the form
V1 V2 V3
a u 1
a v 2
b w 3
b x 4
c y 5
c z 6
performing 'group by' using R, I want to add up the values in column 3 and concatenate the values in column 2 like
V1 V2 V3
a uv 3
b wx 7
c yz 11
I have tried doing the opertaion in excel but due to a lot of tuples i can't use excel. I am new to R so any help would be appreciated.
Many possible ways to solve, here are two
library(data.table)
setDT(df)[, .(V2 = paste(V2, collapse = ""), V3 = sum(V3)), by = V1]
# V1 V2 V3
# 1: a uv 3
# 2: b wx 7
# 3: c yz 11
Or
library(dplyr)
df %>%
group_by(V1) %>%
summarise(V2 = paste(V2, collapse = ""), V3 = sum(V3))
# Source: local data table [3 x 3]
#
# V1 V2 V3
# 1 a uv 3
# 2 b wx 7
# 3 c yz 11
Data
df <- structure(list(V1 = structure(c(1L, 1L, 2L, 2L, 3L, 3L), .Label = c("a",
"b", "c"), class = "factor"), V2 = structure(1:6, .Label = c("u",
"v", "w", "x", "y", "z"), class = "factor"), V3 = 1:6), .Names = c("V1",
"V2", "V3"), class = "data.frame", row.names = c(NA, -6L))
Another option, using aggregate
# Group column 2
ag.2 <- aggregate(df$V2, by=list(df$V1), FUN = paste0, collapse = "")
# Group column 3
ag.3 <- aggregate(df$V3, by=list(df$V1), FUN = sum)
# Merge the two
res <- cbind(ag.2, ag.3[,-1])
Another option with sqldf
library(sqldf)
sqldf('select V1,
group_concat(V2,"") as V2,
sum(V3) as V3
from df
group by V1')
# V1 V2 V3
#1 a uv 3
#2 b wx 7
#3 c yz 11
Or using base R
do.call(rbind,lapply(split(df, df$V1), function(x)
with(x, data.frame(V1=V1[1L], V2= paste(V2, collapse=''), V3= sum(V3)))))
using ddply
library(plyr)
ddply(df, .(V1), summarize, V2 = paste(V2, collapse=''), V3 = sum(V3))
# V1 V2 V3
#1 a uv 3
#2 b wx 7
#3 c yz 11
You could also just use the groupBy function in the 'caroline' package:
x <-cbind.data.frame(V1=rep(letters[1:3],each=2), V2=letters[21:26], V3=1:6, stringsAsFactors=F)
groupBy(df=x, clmns=c('V2','V3'),by='V1',aggregation=c('paste','sum'),collapse='')
Related
Given two dataframes df1 and df2 as follows:
df1:
df1 <- structure(list(A = 1L, B = 2L, C = 3L, D = 4L, G = 5L), class = "data.frame", row.names = c(NA,
-1L))
Out:
A B C D G
1 1 2 3 4 5
df2:
df2 <- structure(list(Col1 = c("A", "B", "C", "D", "X"), Col2 = c("E",
"Q", "R", "Z", "Y")), class = "data.frame", row.names = c(NA,
-5L))
Out:
Col1 Col2
1 A E
2 B Q
3 C R
4 D Z
5 X Y
I need to rename columns of df1 using df2, except column G since it not in df2's Col1.
I use df2$Col2[match(names(df1), df2$Col1)] based on the answer from here, but it returns "E" "Q" "R" "Z" NA, as you see column G become NA. I hope it keep the original name.
The expected result:
E Q R Z G
1 1 2 3 4 5
How could I deal with this issue? Thanks.
By using na.omit(it's little bit messy..)
colnames(df1)[na.omit(match(names(df1), df2$Col1))] <- df2$Col2[na.omit(match(names(df1), df2$Col1))]
df1
E Q R Z G
1 1 2 3 4 5
I have success to reproduce your error with
df2 <- data.frame(
Col1 = c("H","I","K","A","B","C","D"),
Col2 = c("a1","a2","a3","E","Q","R","Z")
)
The problem is location of df2$Col1 and names(df1) in match.
na.omit(match(names(df1), df2$Col1))
gives [1] 4 5 6 7, which index does not exist in df1 that has length 5.
For df1, we should change order of terms in match, na.omit(match(df2$Col1,names(df1))) gives [1] 1 2 3 4
colnames(df1)[na.omit(match(df2$Col1, names(df1)))] <- df2$Col2[na.omit(match(names(df1), df2$Col1))]
This will works.
A solution using the rename_with function from the dplyr package.
library(dplyr)
df3 <- df2 %>%
filter(Col1 %in% names(df1))
df4 <- df1 %>%
rename_with(.cols = df3$Col1, .fn = function(x) df3$Col2[df3$Col1 %in% x])
df4
# E Q R Z G
# 1 1 2 3 4 5
Example df:
index name V1 V2 etc
1 x 2 1
2 y 1 2
3 z 3 4
4 w 4 3
I would like to replace values in columns V1 and V2 with related values in name column for particular index value. Output should look like this:
index name V1 V2 etc
1 x y x
2 y x y
3 z z w
4 w w z
I have tried multiple merge statements in loop but not sure how I can replace the values instead of creating new columns and also got a duplicate name error.
V<-2 # number of V columns
names<-c()
for (i in 1:k){names[[i]]<-paste0('V',i)}
lookup_table<-df[,c('index','name'),drop=FALSE] # it's at unique index level
for(col in names){
df<- merge(df,lookup_table,by.x=col,by.y="index",all.x = TRUE)
}
We can do
df[3:4] <- lapply(df[3:4], function(x) df$name[x])
Or without looping
df[3:4] <- df$name[as.matrix(df[3:4])]
df
# index name V1 V2
#1 1 x y x
#2 2 y x y
#3 3 z z w
#4 4 w w z
data
df <- structure(list(index = 1:4, name = c("x", "y", "z", "w"), V1 = c(2L,
1L, 3L, 4L), V2 = c(1L, 2L, 4L, 3L)), .Names = c("index", "name",
"V1", "V2"), class = "data.frame", row.names = c(NA, -4L))
This question already has answers here:
Create unique identifier from the interchangeable combination of two variables
(2 answers)
Closed 6 years ago.
I have a dataframe of 3 columns
A B 1
A B 1
A C 1
B A 1
I want to aggregate it such that it considers combinations A-B and B-A to be the same, resulting in
A B 3
A C 1
How do I go about this?
Use pmin and pmax on the first two columns and then do the group-by-count:
library(dplyr);
df %>% group_by(G1 = pmin(V1, V2), G2 = pmax(V1, V2)) %>% summarise(Count = sum(V3))
Source: local data frame [2 x 3]
Groups: G1 [?]
G1 G2 Count
(chr) (chr) (int)
1 A B 3
2 A C 1
Corresponding data.table solution would be:
library(data.table)
setDT(df)
df[, .(Count = sum(V3)), .(G1 = pmin(V1, V2), G2 = pmax(V1, V2))]
G1 G2 Count
1: A B 3
2: A C 1
Data:
structure(list(V1 = c("A", "A", "A", "B"), V2 = c("B", "B", "C",
"A"), V3 = c(1L, 1L, 1L, 1L)), .Names = c("V1", "V2", "V3"), row.names = c(NA,
-4L), class = "data.frame")
I have a data.frame mydata like this
V1 V2 V3 V4 V5
1 a b a
2 a b c
3 a b d
4 x y h
5 x y k e
I want to group it by the columns V1and V2, and delete the "" string in the other columns
the result should like this
V1 V2 V3 V4 V5
1 a b a c d
2 x y h k e
is their a efficient way to do this using the dplyr package? Thank you very much.
Using base R, if that's of interest
x <- data.frame(V1 = c(rep("a", 3), "x", "x"),
V2 = c(rep("b", 3), "y", "y"),
V3= c("a", "", "", "h", ""),
V4 = c("", "c", "", "", "k"),
V5 = c(rep("", 2), "d", "", "e"))
temp <- lapply(x[], function(y) as.character(unique(y[y != ""])))
data.frame(do.call(cbind,temp))
V1 V2 V3 V4 V5
1 a b a c d
2 x y h k e
We can use dplyr/tidyr. We reshape the data from 'wide' to 'long' using gather, remove the blank elements in the 'Val' column with filter, and reshape it back to 'wide' format with spread.
library(dplyr)
library(tidyr)
gather(mydata, Var, Val, V3:V5) %>%
filter(Val!='') %>%
spread(Var, Val)
# V1 V2 V3 V4 V5
#1 a b a c d
#2 x y h k e
Or another approach using only dplyr (if the number of non-blank values are the same across each groups) would be to group by 'V1', 'V2', and use summarise_each to select only the elements that are not blank (.[.!=''])
mydata %>%
group_by(V1, V2) %>%
summarise_each(funs(.[.!='']))
# V1 V2 V3 V4 V5
#1 a b a c d
#2 x y h k e
We can also use data.table to do this. We convert the 'data.frame' to 'data.table' (setDT(mydata)), grouped by 'V1', 'V2', we loop through the other columns (lapply(.SD, ...)) and subset the elements that are not blank.
library(data.table)
setDT(mydata)[,lapply(.SD, function(x) x[x!='']) ,.(V1, V2)]
# V1 V2 V3 V4 V5
#1: a b a c d
#2: x y h k e
Similar approach using aggregate from base R is
aggregate(.~V1+V2, mydata, FUN=function(x) x[x!=''])
# V1 V2 V3 V4 V5
#1 a b a c d
#2 x y h k e
data
mydata <- structure(list(V1 = c("a", "a", "a", "x", "x"),
V2 = c("b", "b",
"b", "y", "y"), V3 = c("a", "", "", "h", ""), V4 = c("", "c",
"", "", "k"), V5 = c("", "", "d", "", "e")), .Names = c("V1",
"V2", "V3", "V4", "V5"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5"))
So here is my challenge. I am trying to get rid of rows of data that are best organized as a column. The original data set looks like
1|1|a
2|3|b
2|5|c
1|4|d
1|2|e
10|10|f
And the end result desired is
1 |1,2,4 |a| e d
2 |3,5 |b| c
10|10 |f| NA
The table's shaping is based from minimum value Col 2 within groupings of Col 1, where new column 3 is defined from the minimum values within the group and new column 4 is collapsed from not the minimum of. Some of the approaches tried include:
newTable[min(newTable[,(1%o%2)]),] ## returns the minimum of both COL 1 and 2 only
ddply(newTable,"V1", summarize, newCol = paste(V7,collapse = " ")) ## collapses all values by Col 1 and creates a new column nicely.
Variations to combine these lines of code into a single line have not worked, in part to my limited knowledge. These modifications are not included here.
Try:
library(dplyr)
library(tidyr)
dat %>%
group_by(V1) %>%
summarise_each(funs(paste(sort(.), collapse=","))) %>%
extract(V3, c("V3", "V4"), "(.),?(.*)")
gives the output
# V1 V2 V3 V4
#1 1 1,2,4 a d,e
#2 2 3,5 b c
#3 10 10 f
Or using aggregate and str_split_fixed
res1 <- aggregate(.~ V1, data=dat, FUN=function(x) paste(sort(x), collapse=","))
library(stringr)
res1[, paste0("V", 3:4)] <- as.data.frame(str_split_fixed(res1$V3, ",", 2),
stringsAsFactors=FALSE)
If you need NA for missing values
res1[res1==''] <- NA
res1
# V1 V2 V3 V4
#1 1 1,2,4 a d,e
#2 2 3,5 b c
#3 10 10 f <NA>
data
dat <- structure(list(V1 = c(1L, 2L, 2L, 1L, 1L, 10L), V2 = c(1L, 3L,
5L, 4L, 2L, 10L), V3 = c("a", "b", "c", "d", "e", "f")), .Names = c("V1",
"V2", "V3"), class = "data.frame", row.names = c(NA, -6L))
Here's an approach using data.table, with data from #akrun's post:
It might be useful to store the columns as list instead of pasting them together.
require(data.table) ## 1.9.2+
setDT(dat)[order(V1, V2), list(V2=list(V2), V3=V3[1L], V4=list(V3[-1L])), by=V1]
# V1 V2 V3 V4
# 1: 1 1,2,4 a e,d
# 2: 2 3,5 b c
# 3: 10 10 f
setDT(dat) converts the data.frame to data.table, by reference (without copying it). Then, we sort it by columns V1,V2 and group by V1 column on the sorted data, and for each group, we create the columns V2, V3 and V4 as shown.
V2 and V4 will be of type list here. If you'd rather have a character column where all entries are pasted together, just replace list(.) with paste(., sep=...).
HTH