Here is an example of a data set df:
Name L1 L2 L3 L4
Carl 1 NA 0 2
Carl 0 1 4 1
Joe 3 0 3 1
Joe 2 2 1 0
I would like to create a function that would be able to tally up the number of values in columns L2, L3, and L4 that are greater than 0 as a function of some name. For example:
someFunction(Joe)
# 4
However, I have some NAs in my columns.
I have tried using complete.cases to remove the NAs but I do not want to remove the entire row. I want to use aggregate, however, I am not exactly sure how. Thanks for your help.
We can use
colSums(df[c("L2", "L3", "L4")] > 0, na.rm = TRUE)
Or you may want a sum per person:
m <- rowsum((df[c("L2", "L3", "L4")] > 0) + 0, df[["Name"]], na.rm = TRUE)
# L2 L3 L4
#Carl 1 1 2
#Joe 1 2 1
There is something fun here. df[c("L2", "L3", "L4")] > 0 is a logical matrix (with NA):
Although colSums can work with it without trouble, rowsum can not. So a fix is to add a 0 to this matrix to cast it to a 0-1 numerical matrix;
when adding this 0, we must do (df[c("L2", "L3", "L4")] > 0) + 0 not df[c("L2", "L3", "L4")] > 0 + 0. The operation precedence in R means + is prior to >. Have a try on this toy example:
5 > 4 + 0 ## FALSE
(5 > 4) + 0 ## 1
So we want a bracket to evaluate > first, then +.
If you want the result to be a data frame, just cast the resulting matrix into a data frame by:
data.frame(m)
Follow-up
People stop responding, because your specific question on getting a function is less interesting than getting the summary dataset.
Well, if you still take my approach, I would define such function as:
extract <- function (person) {
m <- rowsum((df[c("L2", "L3", "L4")] > 0) + 0, df[["Name"]], na.rm = TRUE)
rowSums(m)[[person]]
}
Then you can call
extract("Joe")
# 4
extract("Carl")
# 4
Note, this is obviously not the most efficient way to write such a function. Because if you only want to extract the sum for one person, there is no need to proceed all data. We can do:
extract2 <- function (person) {
## subset data
sub <- subset(df, df$Name == person, select = c("L2", "L3", "L4"))
## get sum
sum(sub > 0, na.rm = TRUE)
}
Then you can call
extract2("Joe")
# 4
extract2("Carl")
# 4
With aggregate, you'll need to set both the na.rm parameter of sum, plus the na.action parameter of aggregate itself. After that, it's easy to add the three columns:
df_sums <- aggregate(. ~ Name, df, FUN = function(x) {
sum(x > 0, na.rm = TRUE)
}, na.action = na.pass)
df_sums$sum_L2_L3_L4 <- with(df_sums, L1 + L2 + L3)
df_sums
## Name L1 L2 L3 L4 sum_L2_L3_L4
## 1 Carl 1 1 1 2 4
## 2 Joe 2 1 2 1 4
or in dplyr,
library(dplyr)
df %>% group_by(Name) %>%
summarise_all(funs(sum(. > 0, na.rm = TRUE))) %>%
mutate(sum_L2_L3_L4 = L2 + L3 + L4)
## # A tibble: 2 × 6
## Name L1 L2 L3 L4 sum_L2_L3_L4
## <fctr> <int> <int> <int> <int> <int>
## 1 Carl 1 1 1 2 4
## 2 Joe 2 1 2 1 4
or directly,
df %>% group_by(Name) %>% summarise(sum = sum(cbind(L2, L3, L4) > 0, na.rm = TRUE))
## # A tibble: 2 × 2
## Name sum
## <fctr> <int>
## 1 Carl 4
## 2 Joe 4
or data.table
library(data.table)
setDT(df)[, lapply(.SD, function(x){sum(x > 0, na.rm = TRUE)}), by = Name
][, sum_L2_L3_L4 := L2 + L3 + L4, by = Name][]
## Name L1 L2 L3 L4 sum_L2_L3_L4
## 1: Carl 1 1 1 2 4
## 2: Joe 2 1 2 1 4
or directly,
setDT(df)[, .(sum = sum(cbind(L2, L3, L4) > 0, na.rm = TRUE)), by = Name]
## Name sum
## 1: Carl 4
## 2: Joe 4
We can use aggregate with rowSums to get the output
aggregate(cbind(Total=rowSums(df[3:5]>0,
na.rm=TRUE))~cbind(Name=df$Name), FUN = sum)
# Name Total
#1 Carl 4
#2 Joe 4
Or using data.table, convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'Name' and specifying the select column in .SDcols, unlist the Subset of Data.table (.SD), convert it to a logical vector (>0) and get the sum of the TRUE values to create the summarised 'Total' column
library(data.table)
setDT(df)[, .(Total = sum(unlist(.SD)>0, na.rm = TRUE)), Name, .SDcols = L2:L4]
# Name Total
#1: Carl 4
#2: Joe 4
Or another option is with dplyr/tidyr. We select the columns of interest, gather into 'long' format, filter only the elements that are greater than 0, then grouped by 'Name' get the total number of rows (n())
library(dplyr)
library(tidyr)
df %>%
select(-L1) %>%
gather(Var, Val, -Name) %>%
filter(Val>0) %>%
group_by(Name) %>%
summarise(Total = n())
# A tibble: 2 × 2
# Name Total
# <chr> <int>
#1 Carl 4
#2 Joe 4
With plyr you could do:
library(plyr)
nonZeroDF = ddply(DF[,-2],"Name",.fun = function(x)
data.frame(nonZeroObs=sum((x[,-1]) >0,na.rm=TRUE) ))
# Name nonZeroObs
#1 Carl 4
#2 Joe 4
Related
I am trying to reshape data from long to wide format in R. I would like to get both counts of occurrences of a type variable by ID and sums of the values of a second variable (val) by ID and type as in the example below.
I was able to find answers for reshaping with either counts or sums but not for both simultaneously.
This is the original example data:
> df <- data.frame(id = c(1, 1, 1, 2, 2, 2),
+ type = c("A", "A", "B", "A", "B", "C"),
+ val = c(0, 1, 2, 0, 0, 4))
> df
id type val
1 1 A 0
2 1 A 1
3 1 B 2
4 2 A 0
5 2 B 0
6 2 C 4
The output I would like to obtain is the following:
id A.count B.count C.count A.sum B.sum C.sum
1 1 2 1 0 1 2 0
2 2 1 1 1 0 0 4
where the count columns display the number of occurrences of type A, B and C and the sum columns the sum of the values by type.
To achieve the counts I can, as suggested in this answer, use reshape2::dcast with the default aggregation function, length:
> require(reshape2)
> df.c <- dcast(df, id ~ type, value.var = "type", fun.aggregate = length)
> df.c
id A B C
1 1 2 1 0
2 2 1 1 1
Similarly, as suggested in this answer, I can also perform the reshape with the sums as output, this time using the sum aggregation function in dcast:
> df.s <- dcast(df, id ~ type, value.var = "val", fun.aggregate = sum)
> df.s
id A B C
1 1 1 2 0
2 2 0 0 4
I could merge the two:
> merge(x = df.c, y = df.s, by = "id", all = TRUE)
id A.x B.x C.x A.y B.y C.y
1 1 2 1 0 1 2 0
2 2 1 1 1 0 0 4
but is there a way of doing it all in one go (not necessarily with dcast or reshape2)?
From data.table v1.9.6, it is possible to cast multiple value.var columns and also cast by providing multiple fun.aggregate functions. See below:
library(data.table)
df <- data.table(df)
dcast(df, id ~ type, fun = list(length, sum), value.var = c("val"))
id val_length_A val_length_B val_length_C val_sum_A val_sum_B val_sum_C
1: 1 2 1 0 1 2 0
2: 2 1 1 1 0 0 4
Here is an approach with tidyverse
library(tidyverse)
df %>%
group_by(id, type) %>%
summarise(count = n(), Sum = sum(val)) %>%
gather(key, val, count:Sum) %>%
unite(typen, type, key, sep=".") %>%
spread(typen, val, fill = 0)
The data.table solution suggested is probably better but if you prefer using dcast and you have many value.var/fun.aggregate combinations, you could also do:
library(purrr)
cols <- c('type', 'val')
funs <- c(length, sum)
map2(cols, funs, ~ dcast(df, id~type, value.var = .x, fun.aggregate = .y)) %>%
reduce(left_join, by='id', suffix=c('.count', '.sum'))
I would like to combine a set of data frames into a single data frame by summing columns that have matching variables (instead of appending columns).
For example, given
df1 <- data.frame(A = c(0,0,1,1,1,2,2), B = c(1,2,1,2,3,1,5), x = c(2,3,1,5,3,7,0))
df2 <- data.frame(A = c(0,1,1,2,2,2), B = c(1,1,3,2,4,5), x = c(4,8,4,1,0,3))
df3 <- data.frame(A = c(0,1,2), B = c(5,4,2), x = c(5,3,1))
I want to match by "A" and "B" and sum the values of "x". For this example, I can get the desired result as follows:
library(plyr)
library(dplyr)
# rename columns so that join_all preserves them all:
colnames(df1)[3] <- "x1"
colnames(df2)[3] <- "x2"
colnames(df3)[3] <- "x3"
# join the data frames by matching "A" and "B" values:
res <- join_all(list(df1, df2, df3), by = c("A", "B"), type = "full")
# get the sums and drop superfluous columns:
arrange(res, A, B) %>%
rowwise() %>%
mutate(x = sum(x1, x2, x3, na.rm = TRUE)) %>%
select(A, B, x)
Result:
A B x
<dbl> <dbl> <dbl>
1 0 1 6
2 0 2 3
3 0 5 5
4 1 1 9
5 1 2 5
6 1 3 7
7 1 4 3
8 2 1 7
9 2 2 2
10 2 4 0
11 2 5 3
A more general solution is
library(dplyr)
# function to get the desired result for two data frames:
my_merge <- function(df1, df2)
{
m1 <- merge(df1, df2, by = c("A", "B"), all = TRUE)
m1 <- rowwise(res) %>%
mutate(x = sum(x.x, x.y, na.rm = TRUE)) %>%
select(A, B, x)
return(m1)
}
l1 <- list(df2, df3) # omit the first data frame
res <- df1 # initial value of the result
for(df in l1) res <- my_merge(res, df) # call the function repeatedly
Is there a more efficient option for combining a large set of data frames? Ideally it should be recursive (i.e. it's better not to join all data frames into one massive data frame before calculating the sums).
An easier option is to bind the rows of the datasets, then group by the columns of interest and get the summarised output by getting the sum of 'x'
library(tidyverse)
bind_rows(df1, df2, df3) %>%
group_by(A, B) %>%
summarise(x = sum(x))
# A tibble: 11 x 3
# Groups: A [?]
# A B x
# <dbl> <dbl> <dbl>
# 1 0 1 6
# 2 0 2 3
# 3 0 5 5
# 4 1 1 9
# 5 1 2 5
# 6 1 3 7
# 7 1 4 3
# 8 2 1 7
# 9 2 2 2
#10 2 4 0
#11 2 5 3
If there are many objects in the global environment with the pattern "df" followed by some digits
mget(ls(pattern= "^df\\d+")) %>%
bind_rows %>%
group_by(A, B) %>%
summarise(x = sum(x))
As the OP mentioned about memory constraints, if we do the join first and then use rowSums or + with reduce, it would be more efficient
mget(ls(pattern= "^df\\d+")) %>%
reduce(full_join, by = c("A", "B")) %>%
transmute(A, B, x = rowSums(.[3:5], na.rm = TRUE)) %>%
arrange(A, B)
# A B x
#1 0 1 6
#2 0 2 3
#3 0 5 5
#4 1 1 9
#5 1 2 5
#6 1 3 7
#7 1 4 3
#8 2 1 7
#9 2 2 2
#10 2 4 0
#11 2 5 3
This could also be done with data.table
library(data.table)
rbindlist(mget(ls(pattern= "^df\\d+")))[, .(x = sum(x)), by = .(A, B)]
Ideally it should be recursive (i.e. it's better not to join all data frames into one massive data frame before calculating the sums).
If you're memory constrained and willing to sacrifice speed (vs #akrun's data.table approach), use one table at a time in a loop:
library(data.table)
tabs = c("df1", "df2", "df3")
# enumerate all combos for the results table
# initializing sum to 0
res = CJ(A = 0:2, B = 1:5, x = 0)
# loop over tabs, adding on
for (i in seq_along(tabs)){
tab = get(tabs[[i]])
res[tab, on=.(A, B), x := x + i.x][]
rm(tab)
}
If you need to read tables from disk, change tabs to file names and get to fread or whatever function.
I am skeptical that you can fit all the tables in memory, but cannot also fit an rbind-ed copy of them together.
Similarly (thanks to #akrun's comment), use his approach pairwise:
res = data.table(get(tabs[[1]]))[0L]
for (i in seq_along(tabs)){
tab = get(tabs[[i]])
res = rbind(res, tab)[, .(x = sum(x)), by=.(A,B)]
rm(tab)
}
I would like to ask if there is a way of removing a group from dataframe using dplyr (or anz other way in that matter) in the following way. Lets say I have a dataframe in the following form grouped by variable 1:
Variable 1 Variable 2
1 a
1 b
2 a
2 a
2 b
3 a
3 c
3 a
... ...
I would like to remove only groups that have in Variable 2 two consecutive same values. That is in table above it would remove group 2 because there are values a,a,b but not group c where is a,c,a. So I would get the table bellow?
Variable 1 Variable 2
1 a
1 b
3 a
3 c
3 a
... ...
To test for consecutive identical values, you can compare a value to the previous value in that column. In dplyr, this is possible with lag. (You could do the same thing with comparing to the next value, using lead. Result comes out the same.)
Group the data by variable1, get the lag of variable2, then add up how many of these duplicates there are in that group. Then filter for just the groups with no duplicates. After that, feel free to remove the dupesInGroup column.
library(tidyverse)
df %>%
group_by(variable1) %>%
mutate(dupesInGroup = sum(variable2 == lag(variable2), na.rm = T)) %>%
filter(dupesInGroup == 0)
#> # A tibble: 5 x 3
#> # Groups: variable1 [2]
#> variable1 variable2 dupesInGroup
#> <int> <chr> <int>
#> 1 1 a 0
#> 2 1 b 0
#> 3 3 a 0
#> 4 3 c 0
#> 5 3 a 0
Created on 2018-05-10 by the reprex package (v0.2.0).
prepare data frame:
df <- data.frame("Variable 1" = c(1, 1, 2, 2, 2, 3, 3, 3), "Variable 2" = unlist(strsplit("abaabaca", "")))
write functions to test if consecutive repetitions are there or not:
any.consecutive.p <- function(v) {
for (i in 1:(length(v) - 1)) {
if (v[i] == v[i + 1]) {
return(TRUE)
}
}
return(FALSE)
}
any.consecutive.in.col.p <- function(df, col) {
any.consecutive.p(df[, col])
}
any.consecutive.p returns TRUE if it finds first consecutive repetition in a vector (v).
any.consecutive.in.col.p() looks for consecutive repetitions in a column of a data frame.
split data frame by values of Variable.1
df.l <- split(df, df$Variable.1)
df.l
$`1`
Variable.1 Variable.2
1 1 a
2 1 b
$`2`
Variable.1 Variable.2
3 2 a
4 2 a
5 2 b
$`3`
Variable.1 Variable.2
6 3 a
7 3 c
8 3 a
Finally go over this data.frame list and test for each data frame, if it contains consecutive duplicates in Variable.2 column.
If found, don't collect it.
Bind the collected data frames by rows.
Reduce(rbind, lapply(df.l, function(df) if(!any.consecutive.in.col.p(df, "Variable.2")) {df}))
Variable.1 Variable.2
1 1 a
2 1 b
6 3 a
7 3 c
8 3 a
Say you want to remove all groups of df, grouped by a, where the column b has repeated values. You can do that as below.
set.seed(0)
df <- data.frame(a = rep(1:3, rep(3, 3)), b = sample(1:5, 9, T))
# dplyr
library(dplyr)
df %>%
group_by(a) %>%
filter(all(b != lag(b), na.rm = T))
#data.table
library(data.table)
setDT(df)
df[, if(all(b != shift(b), na.rm = T)) .SD, by = a]
Benchmark shows data.table is faster
#Results
# Unit: milliseconds
# expr min lq mean median uq max neval
# use_dplyr() 141.46819 165.03761 201.0975 179.48334 205.82301 539.5643 100
# use_DT() 36.27936 50.23011 64.9218 53.87114 66.73943 345.2863 100
# Method
set.seed(0)
df <- data.table(a = rep(1:2000, rep(1e3, 2000)), b = sample(1:1e3, 2e6, T))
use_dplyr <- function(x){
df %>%
group_by(a) %>%
filter(all(b != lag(b), na.rm = T))
}
use_DT <- function(x){
df[, if (all(b != shift(b), na.rm = T)) .SD, a]
}
microbenchmark(use_dplyr(), use_DT())
I want to efficiently sum the entries of two data frames, though the data frames are not guaranteed to have the same dimensions or column names. Merge isn't really what I'm after here. Instead I want to create an output object with all of the row and column names that belong to either of the added data frames. In each position of that output, I want to use the following logic for the computed value:
If a row/column pairing belongs to both input data frames I want the output to include their sum
If a row/column pairing belongs to just one input data frame I want to include that value in the output
If a row/column pairing does not belong to any input matrix I want to have 0 in that position in the output.
As an example, consider the following input data frames:
df1 = data.frame(x = c(1,2,3), y = c(4,5,6))
rownames(df1) = c("a", "b", "c")
df2 = data.frame(x = c(7,8), z = c(9,10), w = c(2, 3))
rownames(df2) = c("a", "d")
> df1
x y
a 1 4
b 2 5
c 3 6
> df2
x z w
a 7 9 2
d 8 10 3
I want the final result to be
> df2
x y z w
a 8 4 9 2
b 2 5 0 0
c 3 6 0 0
d 8 0 10 3
What I've done so far -
bind_rows / bind_cols in dplyr can throw the following:
"Error: incompatible number of rows (3, expecting 2)"
I have duplicated column names, so 'merge' isn't working for my purposes either - returns an empty df for some reason.
Seems like you could merge on the rownames, then take care of the sums and conversion of NA to zero with some additional munging:
library(dplyr)
df.new = df1 %>% add_rownames %>%
full_join(df2 %>% add_rownames, by="rowname") %>%
mutate_each(funs(replace(., which(is.na(.)), 0))) %>%
mutate(x = x.x + x.y) %>%
select(rowname,x,y,z,w)
Or, with #DavidArenburg's much more elegant and extensible solution:
df.new = df1 %>% add_rownames %>%
full_join(df2 %>% add_rownames) %>%
group_by(rowname) %>%
summarise_each(funs(sum(., na.rm = TRUE)))
df.new
rowname x y z w
1 a 8 4 9 2
2 b 2 5 0 0
3 c 3 6 0 0
4 d 8 0 10 3
This seems like some type of a simple merge on common column names (+ row names) and then a simple aggregation, this is how I would tackle this
library(data.table)
merge(setDT(df1, keep.rownames = TRUE), # Convert to data.table + keep rows
setDT(df2, keep.rownames = TRUE), # Convert to data.table + keep rows
by = intersect(names(df1), names(df2)), # merge on common column names
all = TRUE)[, lapply(.SD, sum, na.rm = TRUE), by = rn] # Sum all columns by group
# rn x y z w
# 1: a 8 4 9 2
# 2: b 2 5 0 0
# 3: c 3 6 0 0
# 4: d 8 0 10 3
Are a pretty straight forward base R solution
df1$rn <- row.names(df1)
df2$rn <- row.names(df2)
res <- merge(df1, df2, all = TRUE)
rowsum(res[setdiff(names(res), "rn")], res[, "rn"], na.rm = TRUE)
# x y z w
# a 8 4 9 2
# b 2 5 0 0
# c 3 6 0 0
# d 8 0 10 3
First, I would grab the names of all the rows and columns of the new entity:
(all.rows <- unique(c(row.names(df1), row.names(df2))))
# [1] "a" "b" "c" "d"
(all.cols <- unique(c(names(df1), names(df2))))
# [1] "x" "y" "z" "w"
Then I would construct an output matrix with those rows and column names (with matrix data initialized to all 0s), adding df1 and df2 to the relevant parts of that matrix.
out <- matrix(0, nrow=length(all.rows), ncol=length(all.cols))
rownames(out) <- all.rows
colnames(out) <- all.cols
out[row.names(df1),names(df1)] <- unlist(df1)
out[row.names(df2),names(df2)] <- out[row.names(df2),names(df2)] + unlist(df2)
out
# x y z w
# a 8 4 9 2
# b 2 5 0 0
# c 3 6 0 0
# d 8 0 10 3
Using xtabs on melted / stacked data frames:
out <- rbind(cbind(rn=rownames(df1),stack(df1)), cbind(rn=rownames(df2),stack(df2)))
as.data.frame.matrix(xtabs(values ~ rn + ind, data=out))
# x y w z
#a 8 4 2 9
#b 2 5 0 0
#c 3 6 0 0
#d 8 0 3 10
I’m not convinced the accepted (or alternative merge) method is the best. It will give incorrect results if you have common rows, they’ll get joined and not summed.
This can be shown trivialy by changing df2 to:
df2 = data.frame(x = c(1,2), y = c(4,5), z = c(9,10), w = c(2, 3))
rownames(df2) = c("a", "d")
expected results:
rn x y z w
1: a 2 8 9 2
2: b 2 5 0 0
3: c 3 6 0 0
4: d 2 5 10 3
actual results
merge(setDT(df1, keep.rownames = TRUE),
setDT(df2, keep.rownames = TRUE),
by = intersect(names(df1), names(df2)),
all = TRUE)[, lapply(.SD, sum, na.rm = TRUE), by = rn]
rn x y z w
1: a 1 4 9 2
2: b 2 5 0 0
3: c 3 6 0 0
4: d 2 5 10 3
You need to combine both the outer join with an inner join (or left/right joins, merge all=T/all=F). Or alternatively using plyr’s rbind.fill :
base R solution
res <- rbind.fill(df1,df2)
rowsum(res[setdiff(names(res), "rn")], res[, "rn"], na.rm = TRUE)
data table solution
as.data.table(rbind.fill(
setDT(df1, keep.rownames = TRUE),
setDT(df2, keep.rownames = TRUE)
))[, lapply(.SD, sum, na.rm = TRUE), by = rn]
I prefer the rbind.fill method as you can "merge" > 2 data frames using the same syntax.
Let say I have the following data frame in R:
df1 <- data.frame(Item_Name = c("test1","test2","test3"), D_1=c(1,0,1),
D_2=c(1,1,1), D_3=c(11,3,1))
I would like to create a function that would delete columns with no variance
(e.g. in this case, it would remove column D_2 because it has only 1 value)
I know that I could check it by hand, but in reality my data is very large and I would like to automate it. Any idea?
Filter is a useful function here. I will filter only for those where there is more than 1 unique value.
i.e.
Filter(function(x)(length(unique(x))>1), df1)
## Item_Name D_1 D_3
## 1 test1 1 11
## 2 test2 0 3
## 3 test3 1 1
You can do:
df1[c(TRUE, lapply(df1[-1], var, na.rm = TRUE) != 0)]
# Item_Name D_1 D_3
# 1 test1 1 11
# 2 test2 0 3
# 3 test3 1 1
where the lapply piece tells you what variables have some variance:
lapply(df1[-1], var, na.rm = TRUE) != 0
# D_1 D_2 D_3
# TRUE FALSE TRUE
In dplyr, we can use n_distinct to count unique values and select_if to select columns
library(dplyr)
df1 %>% select(where(~n_distinct(.) > 1))
#For dplyr < 1.0.0
#df1 %>% select_if(~n_distinct(.) > 1)
# Item_Name D_1 D_3
#1 test1 1 11
#2 test2 0 3
#3 test3 1 1
We can use the same logic with purrr's keep and discard
purrr::keep(df1, ~n_distinct(.) > 1)
purrr::discard(df1, ~n_distinct(.) == 1)
Apart from that data.table way of doing it could be
library(data.table)
setDT(df1)
df1[, lapply(df1, uniqueN) > 1, with = FALSE]
Or probably this is smarter/better
df1[, .SD, .SDcols=lapply(df1, uniqueN) > 1]
In all the above approaches you could replace n_distinct/uniqueN with var or sd function after subsetting only numeric columns.
For example,
df1[-1] %>% select_if(~sd(.) != 0)