Reshaping data from long to wide with both sums and counts - r

I am trying to reshape data from long to wide format in R. I would like to get both counts of occurrences of a type variable by ID and sums of the values of a second variable (val) by ID and type as in the example below.
I was able to find answers for reshaping with either counts or sums but not for both simultaneously.
This is the original example data:
> df <- data.frame(id = c(1, 1, 1, 2, 2, 2),
+ type = c("A", "A", "B", "A", "B", "C"),
+ val = c(0, 1, 2, 0, 0, 4))
> df
id type val
1 1 A 0
2 1 A 1
3 1 B 2
4 2 A 0
5 2 B 0
6 2 C 4
The output I would like to obtain is the following:
id A.count B.count C.count A.sum B.sum C.sum
1 1 2 1 0 1 2 0
2 2 1 1 1 0 0 4
where the count columns display the number of occurrences of type A, B and C and the sum columns the sum of the values by type.
To achieve the counts I can, as suggested in this answer, use reshape2::dcast with the default aggregation function, length:
> require(reshape2)
> df.c <- dcast(df, id ~ type, value.var = "type", fun.aggregate = length)
> df.c
id A B C
1 1 2 1 0
2 2 1 1 1
Similarly, as suggested in this answer, I can also perform the reshape with the sums as output, this time using the sum aggregation function in dcast:
> df.s <- dcast(df, id ~ type, value.var = "val", fun.aggregate = sum)
> df.s
id A B C
1 1 1 2 0
2 2 0 0 4
I could merge the two:
> merge(x = df.c, y = df.s, by = "id", all = TRUE)
id A.x B.x C.x A.y B.y C.y
1 1 2 1 0 1 2 0
2 2 1 1 1 0 0 4
but is there a way of doing it all in one go (not necessarily with dcast or reshape2)?

From data.table v1.9.6, it is possible to cast multiple value.var columns and also cast by providing multiple fun.aggregate functions. See below:
library(data.table)
df <- data.table(df)
dcast(df, id ~ type, fun = list(length, sum), value.var = c("val"))
id val_length_A val_length_B val_length_C val_sum_A val_sum_B val_sum_C
1: 1 2 1 0 1 2 0
2: 2 1 1 1 0 0 4

Here is an approach with tidyverse
library(tidyverse)
df %>%
group_by(id, type) %>%
summarise(count = n(), Sum = sum(val)) %>%
gather(key, val, count:Sum) %>%
unite(typen, type, key, sep=".") %>%
spread(typen, val, fill = 0)

The data.table solution suggested is probably better but if you prefer using dcast and you have many value.var/fun.aggregate combinations, you could also do:
library(purrr)
cols <- c('type', 'val')
funs <- c(length, sum)
map2(cols, funs, ~ dcast(df, id~type, value.var = .x, fun.aggregate = .y)) %>%
reduce(left_join, by='id', suffix=c('.count', '.sum'))

Related

Calculate row sums by variable names

what's the easiest way to calculate row-wise sums? For example if I wanted to calculate the sum of all variables with "txt_"? (see example below)
df <- data.frame(var1 = c(1, 2, 3),
txt_1 = c(1, 1, 0),
txt_2 = c(1, 0, 0),
txt_3 = c(1, 0, 0))
base R
We can first use grepl to find the column names that start with txt_, then use rowSums on the subset.
rowSums(df[, grepl("txt_", names(df))])
[1] 3 1 0
If you want to bind it back to the original dataframe, then we can bind the output to the original dataframe.
cbind(df, sums = rowSums(df[, grepl("txt_", names(df))]))
var1 txt_1 txt_2 txt_3 sums
1 1 1 1 1 3
2 2 1 0 0 1
3 3 0 0 0 0
Tidyverse
library(tidyverse)
df %>%
mutate(sum = rowSums(across(starts_with("txt_"))))
var1 txt_1 txt_2 txt_3 sum
1 1 1 1 1 3
2 2 1 0 0 1
3 3 0 0 0 0
Or if you want just the vector, then we can use pull:
df %>%
mutate(sum = rowSums(across(starts_with("txt_")))) %>%
pull(sum)
[1] 3 1 0
Data Table
Here is a data.table option as well:
library(data.table)
dt <- as.data.table(df)
dt[ ,sum := rowSums(.SD), .SDcols = grep("txt_", names(dt))]
dt[["sum"]]
# [1] 3 1 0
Another dplyr option:
df %>%
rowwise() %>%
mutate(sum = sum(c_across(starts_with("txt"))))

detecting sequence by group and compute new variable for the subset

I need to detect a sequence by group in a data.frame and compute new variable.
Consider I have this following data.frame:
df1 <- data.frame(ID = c(1,1,1,1,1,1,1,2,2,2,3,3,3,3),
seqs = c(1,2,3,4,5,6,7,1,2,3,1,2,3,4),
count = c(2,1,3,1,1,2,3,1,2,1,3,1,4,1),
product = c("A", "B", "C", "C", "A,B", "A,B,C", "D", "A", "B", "A", "A", "A,B,C", "D", "D"),
stock = c("A", "A,B", "A,B,C", "A,B,C", "A,B,C", "A,B,C", "A,B,C,D", "A", "A,B", "A,B", "A", "A,B,C", "A,B,C,D", "A,B,C,D"))
df1
> df1
ID seqs count product stock
1 1 1 2 A A
2 1 2 1 B A,B
3 1 3 3 C A,B,C
4 1 4 1 C A,B,C
5 1 5 1 A,B A,B,C
6 1 6 2 A,B,C A,B,C
7 1 7 3 D A,B,C,D
8 2 1 1 A A
9 2 2 2 B A,B
10 2 3 1 A A,B
11 3 1 3 A A
12 3 2 1 A,B,C A,B,C
13 3 3 4 D A,B,C,D
14 3 4 1 D A,B,C,D
I am interested to compute a measure for ID that follow this sequence:
- Count == 1
- Count > 1
- Count == 1
In the example this is true for:
- rows 2, 3, 4 for `ID==1`
- rows 8, 9, 10 for `ID==2`
- rows 12, 13, 14 for `ID==3`
For these ID and rows, I need to compute a measure called new that takes the value of the product of the last row of the sequence if it is in the second row of the sequence and NOT in the stock of the first sequence.
The desired outcome is shown below:
> output
ID seq1 seq2 seq3 new
1 1 2 3 4 C
2 2 1 2 3
3 3 2 3 4 D
Note:
In the sequence detected for ID no new products are added to the stock.
In the original data there are a lot of IDs who do not have any sequences.
Some ID have multiple qualifying sequences. All should be recorded.
Count is always 1 or greater.
The original data holds millions of ID with up to 1500 sequences.
How would you write an efficient piece of code to get this output?
Here's a data.table option:
library(data.table)
char_cols <- c("product", "stock")
setDT(df1)[,
(char_cols) := lapply(.SD, as.character),
.SDcols = char_cols] # in case they're factors
df1[, c1 := (count == 1) &
(shift(count) > 1) &
(shift(count, 2L) == 1),
by = ID] #condition1
df1[, pat := paste0("(", gsub(",", "|", product), ")")] # pattern
df1[, c2 := mapply(grepl, pat, shift(product)) &
!mapply(grepl, pat, shift(stock, 2L)),
by = ID] # condition2
df1[(c1), new := ifelse(c2, product, "")] # create new column
df1[, paste0("seq", 1:3) := shift(seqs, 2:0)] # create seq columns
df1[(c1), .(ID, seq1, seq2, seq3, new)] # result
Here's another approach using tidyverse; however, I think lag and lead has made this solution a bit time-consuming. I included the comments within the code to make it more legible.
But I spent enough time on it, to post it anyway.
library(tidyverse)
df1 %>% group_by(ID) %>%
# this finds the row with count > 1 which ...
#... the counts of the row before and the one of after it equals to 1
mutate(test = (count > 1 & c(F, lag(count==1)[-1]) & c(lead(count==1)[-n()],F))) %>%
# this makes a column which has value of True for each chunk...
#that meets desired condition to later filter based on it
mutate(test2 = test | c(F,lag(test)[-1]) | c(lead(test)[-n()], F)) %>%
filter(test2) %>% ungroup() %>%
# group each three occurrences in case of having multiple ones within each ID
group_by(G=trunc(3:(n()+2)/3)) %>% group_by(ID,G) %>%
# creating new column with string extracting techniques ...
#... (assuming those columns are characters)
mutate(new=
str_remove_all(
as.character(regmatches(stock[2], gregexpr(product[3], stock[2]))),
stock[1])) %>%
# selecting desired columns and adding times for long to wide conversion
select(ID,G,seqs,new) %>% mutate(times = 1:n()) %>% ungroup() %>%
# long to wide conversion using tidyr (part of tidyverse)
gather(key, value, -ID, -G, -new, -times) %>%
unite(col, key, times) %>% spread(col, value) %>%
# making the desired order of columns
select(-G,-new,new) %>% as.data.frame()
# ID seqs_1 seqs_2 seqs_3 new
# 1 1 2 3 4 C
# 2 2 1 2 3
# 3 3 2 3 4 D

Count number of non-NA values greater than 0 by group

Here is an example of a data set df:
Name L1 L2 L3 L4
Carl 1 NA 0 2
Carl 0 1 4 1
Joe 3 0 3 1
Joe 2 2 1 0
I would like to create a function that would be able to tally up the number of values in columns L2, L3, and L4 that are greater than 0 as a function of some name. For example:
someFunction(Joe)
# 4
However, I have some NAs in my columns.
I have tried using complete.cases to remove the NAs but I do not want to remove the entire row. I want to use aggregate, however, I am not exactly sure how. Thanks for your help.
We can use
colSums(df[c("L2", "L3", "L4")] > 0, na.rm = TRUE)
Or you may want a sum per person:
m <- rowsum((df[c("L2", "L3", "L4")] > 0) + 0, df[["Name"]], na.rm = TRUE)
# L2 L3 L4
#Carl 1 1 2
#Joe 1 2 1
There is something fun here. df[c("L2", "L3", "L4")] > 0 is a logical matrix (with NA):
Although colSums can work with it without trouble, rowsum can not. So a fix is to add a 0 to this matrix to cast it to a 0-1 numerical matrix;
when adding this 0, we must do (df[c("L2", "L3", "L4")] > 0) + 0 not df[c("L2", "L3", "L4")] > 0 + 0. The operation precedence in R means + is prior to >. Have a try on this toy example:
5 > 4 + 0 ## FALSE
(5 > 4) + 0 ## 1
So we want a bracket to evaluate > first, then +.
If you want the result to be a data frame, just cast the resulting matrix into a data frame by:
data.frame(m)
Follow-up
People stop responding, because your specific question on getting a function is less interesting than getting the summary dataset.
Well, if you still take my approach, I would define such function as:
extract <- function (person) {
m <- rowsum((df[c("L2", "L3", "L4")] > 0) + 0, df[["Name"]], na.rm = TRUE)
rowSums(m)[[person]]
}
Then you can call
extract("Joe")
# 4
extract("Carl")
# 4
Note, this is obviously not the most efficient way to write such a function. Because if you only want to extract the sum for one person, there is no need to proceed all data. We can do:
extract2 <- function (person) {
## subset data
sub <- subset(df, df$Name == person, select = c("L2", "L3", "L4"))
## get sum
sum(sub > 0, na.rm = TRUE)
}
Then you can call
extract2("Joe")
# 4
extract2("Carl")
# 4
With aggregate, you'll need to set both the na.rm parameter of sum, plus the na.action parameter of aggregate itself. After that, it's easy to add the three columns:
df_sums <- aggregate(. ~ Name, df, FUN = function(x) {
sum(x > 0, na.rm = TRUE)
}, na.action = na.pass)
df_sums$sum_L2_L3_L4 <- with(df_sums, L1 + L2 + L3)
df_sums
## Name L1 L2 L3 L4 sum_L2_L3_L4
## 1 Carl 1 1 1 2 4
## 2 Joe 2 1 2 1 4
or in dplyr,
library(dplyr)
df %>% group_by(Name) %>%
summarise_all(funs(sum(. > 0, na.rm = TRUE))) %>%
mutate(sum_L2_L3_L4 = L2 + L3 + L4)
## # A tibble: 2 × 6
## Name L1 L2 L3 L4 sum_L2_L3_L4
## <fctr> <int> <int> <int> <int> <int>
## 1 Carl 1 1 1 2 4
## 2 Joe 2 1 2 1 4
or directly,
df %>% group_by(Name) %>% summarise(sum = sum(cbind(L2, L3, L4) > 0, na.rm = TRUE))
## # A tibble: 2 × 2
## Name sum
## <fctr> <int>
## 1 Carl 4
## 2 Joe 4
or data.table
library(data.table)
setDT(df)[, lapply(.SD, function(x){sum(x > 0, na.rm = TRUE)}), by = Name
][, sum_L2_L3_L4 := L2 + L3 + L4, by = Name][]
## Name L1 L2 L3 L4 sum_L2_L3_L4
## 1: Carl 1 1 1 2 4
## 2: Joe 2 1 2 1 4
or directly,
setDT(df)[, .(sum = sum(cbind(L2, L3, L4) > 0, na.rm = TRUE)), by = Name]
## Name sum
## 1: Carl 4
## 2: Joe 4
We can use aggregate with rowSums to get the output
aggregate(cbind(Total=rowSums(df[3:5]>0,
na.rm=TRUE))~cbind(Name=df$Name), FUN = sum)
# Name Total
#1 Carl 4
#2 Joe 4
Or using data.table, convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'Name' and specifying the select column in .SDcols, unlist the Subset of Data.table (.SD), convert it to a logical vector (>0) and get the sum of the TRUE values to create the summarised 'Total' column
library(data.table)
setDT(df)[, .(Total = sum(unlist(.SD)>0, na.rm = TRUE)), Name, .SDcols = L2:L4]
# Name Total
#1: Carl 4
#2: Joe 4
Or another option is with dplyr/tidyr. We select the columns of interest, gather into 'long' format, filter only the elements that are greater than 0, then grouped by 'Name' get the total number of rows (n())
library(dplyr)
library(tidyr)
df %>%
select(-L1) %>%
gather(Var, Val, -Name) %>%
filter(Val>0) %>%
group_by(Name) %>%
summarise(Total = n())
# A tibble: 2 × 2
# Name Total
# <chr> <int>
#1 Carl 4
#2 Joe 4
With plyr you could do:
library(plyr)
nonZeroDF = ddply(DF[,-2],"Name",.fun = function(x)
data.frame(nonZeroObs=sum((x[,-1]) >0,na.rm=TRUE) ))
# Name nonZeroObs
#1 Carl 4
#2 Joe 4

Count strings with a certain condition

I have the following dataset
#mydata
Factors Transactions
a,c 2
b 0
c 0
d,a 0
a 1
a 0
b 1
I'd like to count those factors who had transactions.For example, we had two times "a" with transaction. I can write a code to give me my desirable outcome for each variable separately. The following is for "a".
nrow (subset (mydata,mydata$Transaction > 0 & length(mydata[grep("a", mydata$Factors),] )> 0))
But I have too much variables and do not want to repeat a code for all of them. I would think there should be a way to write a code to give me the results for all of the variables. I wish to have the following out put:
#Output
a 2
b 1
c 1
d 0
Equivalent data.table option:
library(data.table)
setDT(df)[, .(Factors = unlist(strsplit(as.character(Factors), ","))),
by = Transactions][,.(Transactions = sum(Transactions > 0)), by = Factors]
# Factors Transactions
#1: a 2
#2: c 1
#3: b 1
#4: d 0
You could create a table using the unique values of the Factor column as the levels. Consider df to be your data set.
s <- strsplit(as.character(df$Factors), ",", fixed = TRUE)
table(factor(unlist(s[df$Transactions > 0]), levels = unique(unlist(s))))
#
# a c b d
# 2 1 1 0
Wrap in as.data.frame() for data frame output.
with(df, {
s <- strsplit(as.character(Factors), ",", fixed = TRUE)
f <- factor(unlist(s[Transactions > 0]), levels = unique(unlist(s)))
as.data.frame(table(Factors = f))
})
# Factors Freq
# 1 a 2
# 2 c 1
# 3 b 1
# 4 d 0
With tidyverse packages, assuming your data is strings/factors and numbers,
library(tidyr)
library(dplyr)
# separate factors with two elements
df %>% separate_rows(Factors) %>%
# set grouping for aggregation
group_by(Factors) %>%
# for each group, count how many transactions are greater than 0
summarise(Transactions = sum(Transactions > 0))
## # A tibble: 4 x 2
## Factors Transactions
## <chr> <int>
## 1 a 2
## 2 b 1
## 3 c 1
## 4 d 0
You could also avoid dplyr by using xtabs, though some cleaning is necessary to get to the same arrangement:
library(tidyr)
df %>% separate_rows(Factors) %>%
xtabs(Transactions > 0 ~ Factors, data = .) %>%
as.data.frame() %>%
setNames(names(df))
## Factors Transactions
## 1 a 2
## 2 b 1
## 3 c 1
## 4 d 0
A full base R equivalent:
df2 <- do.call(rbind,
Map(function(f, t){data.frame(Factors = strsplit(as.character(f), ',')[[1]],
Transactions = t)},
df$Factors, df$Transactions))
df3 <- as.data.frame(xtabs(Transactions > 0 ~ Factors, data = df2))
names(df3) <- names(df)
df3
## Factors Transactions
## 1 a 2
## 2 b 1
## 3 c 1
## 4 d 0
We can use cSplit from splitstackshape to split the 'Factors' into 'long' format and grouped by 'Factors' we get the sum of logical column ('Transactions > 0`).
library(splitstackshape)
cSplit(df1, "Factors", ",", "long")[, .(Transactions=sum(Transactions > 0)),.(Factors)]
# Factors Transactions
#1: a 2
#2: c 1
#3: b 1
#4: d 0
Or using base R
with(df1, table(factor(unlist(strsplit(Factors[Transactions>0], ",")),
levels = letters[1:4]) ))
# a b c d
# 2 1 1 0
data
df1 <- structure(list(Factors = c("a,c", "b", "c", "d,a", "a", "a",
"b"), Transactions = c(2L, 0L, 0L, 0L, 1L, 0L, 1L)), .Names = c("Factors",
"Transactions"), class = "data.frame", row.names = c(NA, -7L))

Summarize all group values and a conditional subset in the same call

I'll illustrate my question with an example.
Sample data:
df <- data.frame(ID = c(1, 1, 2, 2, 3, 5), A = c("foo", "bar", "foo", "foo", "bar", "bar"), B = c(1, 5, 7, 23, 54, 202))
df
ID A B
1 1 foo 1
2 1 bar 5
3 2 foo 7
4 2 foo 23
5 3 bar 54
6 5 bar 202
What I want to do is to summarize, by ID, the sum of B and the sum of B when A is "foo". I can do this in a couple steps like:
require(magrittr)
require(dplyr)
df1 <- df %>%
group_by(ID) %>%
summarize(sumB = sum(B))
df2 <- df %>%
filter(A == "foo") %>%
group_by(ID) %>%
summarize(sumBfoo = sum(B))
left_join(df1, df2)
ID sumB sumBfoo
1 1 6 1
2 2 30 30
3 3 54 NA
4 5 202 NA
However, I'm looking for a more elegant/faster way, as I'm dealing with 10gb+ of out-of-memory data in sqlite.
require(sqldf)
my_db <- src_sqlite("my_db.sqlite3", create = T)
df_sqlite <- copy_to(my_db, df)
I thought of using mutate to define a new Bfoo column:
df_sqlite %>%
mutate(Bfoo = ifelse(A=="foo", B, 0))
Unfortunately, this doesn't work on the database end of things.
Error in sqliteExecStatement(conn, statement, ...) :
RS-DBI driver: (error in statement: no such function: IFELSE)
You can do both sums in a single dplyr statement:
df1 <- df %>%
group_by(ID) %>%
summarize(sumB = sum(B),
sumBfoo = sum(B[A=="foo"]))
And here is a data.table version:
library(data.table)
dt = setDT(df)
dt1 = dt[ , .(sumB = sum(B),
sumBfoo = sum(B[A=="foo"])),
by = ID]
dt1
ID sumB sumBfoo
1: 1 6 1
2: 2 30 30
3: 3 54 0
4: 5 202 0
Writing up #hadley's comment as an answer
df_sqlite %>%
group_by(ID) %>%
mutate(Bfoo = if(A=="foo") B else 0) %>%
summarize(sumB = sum(B),
sumBfoo = sum(Bfoo)) %>%
collect
If you want to do counting instead of summarizing, then the answer is somewhat different. The change in code is small, especially in the conditional counting part.
df1 <- df %>%
group_by(ID) %>%
summarize(countB = n(),
countBfoo = sum(A=="foo"))
df1
Source: local data frame [4 x 3]
ID countB countBfoo
1 1 2 1
2 2 2 2
3 3 1 0
4 5 1 0
If you wanted to count the rows, instead of summing them, can you pass a variable to the function:
df1 <- df %>%
group_by(ID) %>%
summarize(RowCountB = n(),
RowCountBfoo = n(A=="foo"))
I get an error both with n() and nrow().

Resources