Aggregating multiple columns from data frame [duplicate] - r

This question already has answers here:
Reshaping a data.frame so a column containing multiple features becomes multiple binary columns
(4 answers)
Closed 5 years ago.
I have a data frame that has a bunch of data that's joined with commas in certain elements of the rows. Something that looks like:
df <- data.frame(
c(2012,2012,2012,2013,2013,2013,2014,2014,2014)
,c("a,b,c","d,e,f","a,c,d,c","a,a,a","b","c,a,d","g","a,b,e","g,h,i")
)
names(df) <- c("year", "type")
I want to get it in a form that dcast is close to getting it to, with the year,a,b,c,etc being the columns, and the frequency across the data frame being in the cells of the resultant data frame. I tried first to do colsplit on df and then use dcast after, but that seems to only work if I want to aggregate on one of the levels instead of all.
df2 <- data.frame( df$year, colsplit(df$type, ',' , c('v1','v2','v3','v4','v5')) )
df3 <- dcast(df2, df.year ~ v1)
This result only gives me for the first level of the colsplit, instead of all of them. Am I close to a solution or should I be using a different approach entirely?

Here is a single line option with base R by splitting the 'type' column with strsplit, then set the names of the list output as 'year', stack it to a single data.frame and get the frequency count using table
table(stack(setNames(strsplit(as.character(df$type), ","), df$year))[2:1])
# values
#ind a b c d e f g h i
# 2012 2 1 3 2 1 1 0 0 0
# 2013 4 1 1 1 0 0 0 0 0
# 2014 1 1 0 0 1 0 2 1 1

You are close to the solution. You just need one more step. You need to melt all values in one column before dcast. See the example.
require(reshape2)
df <- data.frame(c(2012,2012,2012,2013,2013,2013,2014,2014,2014),
c("a,b,c","d,e,f","a,c,d,c","a,a,a","b","c,a,d","g","a,b,e","g,h,i"))
names(df) <- c("year", "type")
df
df2 <- data.frame(df$year, colsplit(df$type, ',', c('v1','v2','v3','v4','v5')))
df2
df3 <- melt(df2, id.vars = "df.year", na.rm = T)
df3
df4 <- dcast(df3[df3$value != "", ], df.year ~ value, fun.aggregate = length)
df4

Here's a data.table approach:
library(data.table)
setDT(df)
dcast(df[, .(unlist(strsplit(as.character(type), ",", fixed=TRUE))), by = year],
year ~ V1, value.var = "V1", fun.aggregate = length)
# year a b c d e f g h i
#1: 2012 2 1 3 2 1 1 0 0 0
#2: 2013 4 1 1 1 0 0 0 0 0
#3: 2014 1 1 0 0 1 0 2 1 1
We first split the type column by comma and per year-group to a long-format, then dcast to wide with the length as aggregate function.

Maybe, something like this could work?
# extract unique values and years
vals <- unique(do.call(c, strsplit(x = as.vector(df$type), "[[:punct:]]")))
years <- unique(df$year)
# count
df4 <- data.frame(sapply(vals, (function(vl) {sapply(years, (function(ye){
sum(do.call(c, strsplit(as.vector(df$type[df$year == ye]) , "[[:punct:]]")) == vl)
}))})))
df4 <- cbind(years, df4)
df4
#result
years a b c d e f g h i
1 2012 2 1 3 2 1 1 0 0 0
2 2013 4 1 1 1 0 0 0 0 0
3 2014 1 1 0 0 1 0 2 1 1

Related

Create new columns based on data based on factor values in another column in R [duplicate]

This question already has an answer here:
Split a column into multiple binary dummy columns [duplicate]
(1 answer)
Closed 5 years ago.
I'm interested in taking a column of a data.frame where the values in the column are pipe delimited and creating dummy variables from the pipe-delimited values.
For example:
Let's say we start with
df = data.frame(a = c("Ben|Chris|Jim", "Ben|Greg|Jim|", "Jim|Steve|Ben"))
> df
a
1 Ben|Chris|Jim
2 Ben|Greg|Jim
3 Jim|Steve|Ben
I'm interested in ending up with:
df2 = data.frame(Ben = c(1, 1, 1), Chris = c(1, 0, 0), Jim = c(1, 1, 1), Greg = c(0, 1, 0),
Steve = c(0, 0, 1))
> df2
Ben Chris Jim Greg Steve
1 1 1 1 0 0
2 1 0 1 1 0
3 1 0 1 0 1
I don't know in advance how many potential values there are within the field. In the example above, the variable "a" can include 1 value or 10 values. Assume it is a reasonable number (i.e., < 100 possible values).
Any good ways to do this?
Another way is using cSplit_e from splitstackshape package.
splitting the dataframe by column a and fill it by 0 and drop the original column.
library(splitstackshape)
cSplit_e(df, "a", "|", type = "character", fill = 0, drop = T)
# a_Ben a_Chris a_Greg a_Jim a_Steve
#1 1 1 0 1 0
#2 1 0 1 1 0
#3 1 0 0 1 1
Here is one option using dplyr and tidyr:
library(dplyr)
library(tidyr)
df %>% tibble::rownames_to_column(var = "id") %>%
mutate(a = strsplit(as.character(a), "\\|")) %>%
unnest() %>% table()
# a
# id Ben Chris Greg Jim Steve
# 1 1 1 0 1 0
# 2 1 0 1 1 0
# 3 1 0 0 1 1
The analogue in base R is:
df$a <- as.character(df$a)
s <- strsplit(df$a, "|", fixed=TRUE)
table(id = rep(1:nrow(df), lengths(s)), v = unlist(s))
Data:
df = data.frame(a = c("Ben|Chris|Jim", "Ben|Greg|Jim", "Jim|Steve|Ben"))
We can use mtabulate from qdapTools after splitting the 'a' column
library(qdapTools)
mtabulate(strsplit(as.character(df$a), "|", fixed = TRUE))
# Ben Chris Greg Jim Steve
#1 1 1 0 1 0
#2 1 0 1 1 0
#3 1 0 0 1 1
Here is a method in base R
# get unique set of names
myNames <- unique(unlist(strsplit(as.character(df$a), split="\\|")))
# get indicator data.frame
setNames(data.frame(lapply(myNames, function(i) as.integer(grepl(i, df$a)))), myNames)
which returns
Ben Chris Jim Greg Steve
1 1 1 1 0 0
2 1 0 1 1 0
3 1 0 1 0 1
The first line uses strsplit to produce a list of names split on the pipe "|", unlist and unique produce a vector of unique names. The second line runs through these names with lapply, and uses grepl to search for the names, which as.integer converts into binary integers. The returned list is converted into a data.frame and given column names with setNames.

R spread function using tidyverse to get wide format of colon separated column [duplicate]

This question already has an answer here:
Split a column into multiple binary dummy columns [duplicate]
(1 answer)
Closed 5 years ago.
I'm interested in taking a column of a data.frame where the values in the column are pipe delimited and creating dummy variables from the pipe-delimited values.
For example:
Let's say we start with
df = data.frame(a = c("Ben|Chris|Jim", "Ben|Greg|Jim|", "Jim|Steve|Ben"))
> df
a
1 Ben|Chris|Jim
2 Ben|Greg|Jim
3 Jim|Steve|Ben
I'm interested in ending up with:
df2 = data.frame(Ben = c(1, 1, 1), Chris = c(1, 0, 0), Jim = c(1, 1, 1), Greg = c(0, 1, 0),
Steve = c(0, 0, 1))
> df2
Ben Chris Jim Greg Steve
1 1 1 1 0 0
2 1 0 1 1 0
3 1 0 1 0 1
I don't know in advance how many potential values there are within the field. In the example above, the variable "a" can include 1 value or 10 values. Assume it is a reasonable number (i.e., < 100 possible values).
Any good ways to do this?
Another way is using cSplit_e from splitstackshape package.
splitting the dataframe by column a and fill it by 0 and drop the original column.
library(splitstackshape)
cSplit_e(df, "a", "|", type = "character", fill = 0, drop = T)
# a_Ben a_Chris a_Greg a_Jim a_Steve
#1 1 1 0 1 0
#2 1 0 1 1 0
#3 1 0 0 1 1
Here is one option using dplyr and tidyr:
library(dplyr)
library(tidyr)
df %>% tibble::rownames_to_column(var = "id") %>%
mutate(a = strsplit(as.character(a), "\\|")) %>%
unnest() %>% table()
# a
# id Ben Chris Greg Jim Steve
# 1 1 1 0 1 0
# 2 1 0 1 1 0
# 3 1 0 0 1 1
The analogue in base R is:
df$a <- as.character(df$a)
s <- strsplit(df$a, "|", fixed=TRUE)
table(id = rep(1:nrow(df), lengths(s)), v = unlist(s))
Data:
df = data.frame(a = c("Ben|Chris|Jim", "Ben|Greg|Jim", "Jim|Steve|Ben"))
We can use mtabulate from qdapTools after splitting the 'a' column
library(qdapTools)
mtabulate(strsplit(as.character(df$a), "|", fixed = TRUE))
# Ben Chris Greg Jim Steve
#1 1 1 0 1 0
#2 1 0 1 1 0
#3 1 0 0 1 1
Here is a method in base R
# get unique set of names
myNames <- unique(unlist(strsplit(as.character(df$a), split="\\|")))
# get indicator data.frame
setNames(data.frame(lapply(myNames, function(i) as.integer(grepl(i, df$a)))), myNames)
which returns
Ben Chris Jim Greg Steve
1 1 1 1 0 0
2 1 0 1 1 0
3 1 0 1 0 1
The first line uses strsplit to produce a list of names split on the pipe "|", unlist and unique produce a vector of unique names. The second line runs through these names with lapply, and uses grepl to search for the names, which as.integer converts into binary integers. The returned list is converted into a data.frame and given column names with setNames.

Summing columns from multiple data frames in R

I have multiple dataframes that I created in a for loop. They all have the same 3 columns, XLocs, YLocs, PatchStatus. XLocs and YLocs contain the same coordinates in each dataframe. PatchStatus can be either 0 or 1 depending how the model ran. Example of dataframe 1 looks like
print(listofdfs[1])
allPoints.xLocs allPoints.yLocs allPoints.patchStatus
1 73.5289654 8.8633913 0
2 21.0795393 44.4840248 0
3 51.5969348 21.7864016 0
4 61.9007129 32.4763183 1
5 62.3447741 41.0651838 1
6 16.9311605 6.3765206 0
And dataframe 2 looks like
print(listofdfs[2])
allPoints.xLocs allPoints.yLocs allPoints.patchStatus
1 73.5289654 8.8633913 0
2 21.0795393 44.4840248 1
3 51.5969348 21.7864016 0
4 61.9007129 32.4763183 1
5 62.3447741 41.0651838 1
6 16.9311605 6.3765206 0
I'm hoping to have 1 resultant dataframe that has XLocs, YLocs, and SUM of patch status (note I plan on combining 15 data frames, so PatchStatus can be between 0 and 15).
I posted this answer along with the heatmap - Plot-3: https://stackoverflow.com/a/60974584/1691723
library('data.table')
df2 <- rbindlist(l = listofdfs)
df2 <- df2[, .(sum_patch = sum(allPoints.patchStatus)), by = .(allPoints.xLocs, allPoints.yLocs)]
We can bind the datasets together and do a group by sum
library(dplyr)
bind_rows(listofdfs) %>%
group_by( allPoints.xLocs, allPoints.yLocs) %>%
summarise(allPoints.patchStatus = sum(allPoints.patchStatus))
Or using rbind and aggregate from base R
aggregate(allPoints.patchStatus ~ ., do.call(rbind, listofdfs), FUN = sum)

Counting occurrencies by row

Imagine I have a data.frame (or matrix) with few different values such as this
test <- data.frame(replicate(10,sample(c(-1,0,1),20, replace=T, prob=c(0.2,0.2,0.6))))
test2 <- test
If I want to add extra columns with counts I could do:
test2$good <- apply(test,1, function(x) sum(x==1))
test2$bad <- apply(test,1, function(x) sum(x==-1))
test2$neutral <- apply(test,1, function(x) sum(x==0))
But If I had many possible values instead I would have to create many lines, it won't be elegant.
I've tried with table(), but the output is not easily usable
apply(test,1, function(x) table(x))
and there is a big problem, if any row doesn't contain any occurrency of some factor the result generated by table() doesn't have the same length and it can't be binded.
Is there way to force table() to take that value into account, telling it has zero occurrencies?
Then I've thought of using do.call or lapply and merge but it's too difficult for me.
I've also read about dplyr count but I have no clue on how to do it.
Could anyone provide a solution with dplyr or tidyr?
PD: What about a data.table solution?
We could melt the dataset to long format after converting to matrix, get the frequency using table and cbind with the original dataset.
library(reshape2)
cbind(test2, as.data.frame.matrix(table(melt(as.matrix(test2))[-2])))
Or use mtabulate on the transpose of 'test2' and cbind with the original dataset.
library(qdapTools)
cbind(test2, mtabulate(as.data.frame(t(test2))))
Or we can use gather/spread from tidyr after creating row id with add_rownames from dplyr
library(dplyr)
library(tidyr)
add_rownames(test2) %>%
gather(Var, Val, -rowname) %>%\
group_by(rn= as.numeric(rowname), Val) %>%
summarise(N=n()) %>%
spread(Val, N, fill=0) %>%
bind_cols(test2, .)
you can use rowSums():
test2 <- cbind(test2, sapply(c(-1, 0, 1), function(x) rowSums(test==x)))
similar to the code in the comment from etienne, but without the call to apply()
Here is the answer using base R.
test <- data.frame(replicate(10,sample(c(-1,0,1),20, replace=T, prob=c(0.2,0.2,0.6))))
testCopy <- test
# find all unique values, note that data frame is a list
uniqVal <- unique(unlist(test))
# the new column names start with Y
for (val in uniqVal) {
test[paste0("Y",val)] <- apply(testCopy, 1, function(x) sum(x == val))
}
head(test)
# X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Y-1 Y1 Y0
# 1 -1 0 1 1 1 0 -1 -1 1 1 3 5 2
# 2 1 -1 0 1 1 -1 -1 0 0 1 3 4 3
# 3 -1 0 1 0 1 1 1 1 -1 1 2 6 2
# 4 1 1 1 1 0 1 1 0 1 0 0 7 3
# 5 0 -1 1 -1 -1 0 0 1 0 0 3 2 5
# 6 1 1 0 1 1 1 1 1 1 1 0 9 1

Reshaping a data.frame so a column containing multiple features becomes multiple binary columns

I have a dataframe like this
df <-data.frame(id = c(1,2),
value = c(25,24),
features = c("A,B,D,F","C,B,E"))
print(df)
id,value,features
1,25,"A,B,D,F"
2,24,"C,B,E"
I want to reshape it into this:
id,value,A,B,C,D,E,F
1,25,1,1,0,1,0,1
2,24,0,1,1,0,1,0
I'm guessing that the first step would be to identify the unique values in the df$features column, but once I have that list, I'm not sure what an efficient (i.e. vectorized) way to create the final dataset would be.
This feels like an operation for dplyr or reshape2 but I'm not sure how to approach this.
This is yet another use case for merge after suitable transformation.
library(reshape2)
f<-with(df,stack(setNames(strsplit(as.character(features),","),id)))
d<-dcast(f,ind~values,length,value.var="ind")
out<-merge(df[,1:2],d,by.x="id",by.y="ind")
print(out)
id value A B C D E F
1 1 25 1 1 0 1 0 1
2 2 24 0 1 1 0 1 0
This can also be done using only default libraries (without reshape2) in a variety of slightly messier ways. In the above, you can substitute the d and out lines with the following instead:
d<-xtabs(count~ind+values,transform(f,count=1))
out<-merge(df[,1:2],as.data.frame.matrix(d),by.x="id",by.y="row.names")
You can do:
library(splitstackshape)
library(qdapTools)
df1 = data.frame(cSplit(df, 'features', sep=',', type.convert=F))
cbind(df1[1:2], mtabulate(as.data.frame(t(df1[-c(1,2)]))))
# id value A B C D E F
#1: 1 25 1 1 0 1 0 1
#2: 2 24 0 1 1 0 1 0
Another one using splitstackshape and data.table (installation instructions here):
require(splitstackshape)
require(data.table) # v1.9.5+
ans <- cSplit(df, 'features', sep = ',', 'long')
dcast(ans, id + value ~ features, fun.aggregate = length)
# id value A B C D E F
# 1: 1 25 1 1 0 1 0 1
# 2: 2 24 0 1 1 0 1 0
If you're using data.table v1.9.4, then replace dcast with dcast.data.table.
Alternatively, you can use cSplit_e, like this:
cSplit_e(df, "features", ",", type = "character", fill = 0)
## id value features features_A features_B features_C features_D features_E features_F
## 1 1 25 A,B,D,F 1 1 0 1 0 1
## 2 2 24 C,B,E 0 1 1 0 1 0
A dplyr/tidyr solution
library(dplyr)
library(tidyr)
separate(df,features,1:4,",",extra="merge") %>%
gather(key,letter,-id,-value) %>%
filter(!is.na(letter)) %>%
select(-key) %>%
mutate(n=1) %>%
spread(letter,n) %>%
mutate_each(funs(ifelse(is.na(.),0,1)),A:F)

Resources