Combine tables of aggregated values with summarised variables from 'parent' data set - r

I have a data set along these lines:
df<-data.frame(sp=c(100, 100, 100, 101, 101, 101, 102, 102, 102),
type=c("C","C","C","H","H","H","C","C","C"),
country=c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
vals=c(1,2,3,4,5,6,7,8,9)
)
I want to aggregate df$vals and bring the other variables through as well
At the moment I'm doing it like this:
multi.func<- function(x){
c(
n = length(x),
min = min(x, na.rm=TRUE),
max = max(x, na.rm=TRUE),
mean = mean(x, na.rm=TRUE)
)}
aggVals<-as.data.frame(do.call(rbind, by(df$vals, df$sp, FUN=multi.func, simplify=TRUE)))
aggVals$sp<-row.names(aggVals)
aggDescrip<-aggregate(cbind(as.character(type), as.character(country)) ~ sp, data=df, FUN=unique)
result<-merge(aggDescrip,aggVals)
This works well enough but I wondered if there's an easier way.
Thanks

Perhaps you should look into the data.table package.
library(data.table)
DT <- data.table(df, key="sp")
DT[, list(type = unique(as.character(type)),
country = unique(as.character(country)),
n = .N, min = min(vals), max = max(vals),
mean = mean(vals)), by=key(DT)]
# sp type country n min max mean
# 1: 100 C A 3 1 3 2
# 2: 101 H B 3 4 6 5
# 3: 102 C C 3 7 9 8
If you want to stick with base R, here is another approach that might be of use (though aggregate is probably more common):
unique(within(df, {
mean <- ave(vals, sp, FUN=mean)
max <- ave(vals, sp, FUN=max)
min <- ave(vals, sp, FUN=min)
n <- ave(vals, sp, FUN=length)
rm(vals)
}))
# sp type country n min max mean
# 1 100 C A 3 1 3 2
# 4 101 H B 3 4 6 5
# 7 102 C C 3 7 9 8
Update: A variation on your initial attempt
I would suggest sticking with data.table if possible, because the resulting code is easy to follow and the process of aggregation is quick.
However, with a little bit of modification, you can have (yet another) base R approach that is somewhat more direct.
First, modify your function so that instead of using c(), use data.frame. Also, add an argument that specifies which column needs to be aggregated.
multi.func <- function(x, value_column) {
data.frame(
n = length(x[[value_column]]),
min = min(x[[value_column]], na.rm=TRUE),
max = max(x[[value_column]], na.rm=TRUE),
mean = mean(x[[value_column]], na.rm=TRUE))
}
Second, use lapply on your dataset, split up by your grouping variable, merge the output with your original dataset, and return the unique values.
unique(merge(df[-4],
do.call(rbind, lapply(split(df, df$sp),
multi.func, value_column = "vals")),
by.x = "sp", by.y = "row.names"))

Using just aggregate:
result <- aggregate(vals ~ type + sp + country, df,
function(x) c(length(x), min(x), max(x), mean(x))
)
result
type sp country vals.1 vals.2 vals.3 vals.4
1 C 100 A 3 1 3 2
2 H 101 B 3 4 6 5
3 C 102 C 3 7 9 8
colnames(result)
[1] "type" "sp" "country" "vals"
The above seems to create a weird "multi-value" column. But summaryBy from the doBy package is similar to aggregate but will allow an output with multiple columns:
library(doBy)
result <- summaryBy(vals ~ type + sp + country, df,
FUN=function(x) c(n=length(x), min=min(x), max=max(x), mean=mean(x))
)
result
type sp country vals.n vals.min vals.max vals.mean
1 C 100 A 3 1 3 2
2 C 102 C 3 7 9 8
3 H 101 B 3 4 6 5
colnames(result)
[1] "type" "sp" "country" "vals.n" "vals.min" "vals.max"
[7] "vals.mean"

Related

Cartesian product with dplyr

I'm trying to find the dplyr function for cartesian product. I've two simple data.frame with no common variable:
x <- data.frame(x = c("a", "b", "c"))
y <- data.frame(y = c(1, 2, 3))
I would like to reproduce the result of
merge(x, y)
x y
1 a 1
2 b 1
3 c 1
4 a 2
5 b 2
6 c 2
7 a 3
8 b 3
9 c 3
I've already looked for this (for example here or here) without finding anything useful.
Use crossing from the tidyr package:
x <- data.frame(x=c("a","b","c"))
y <- data.frame(y=c(1,2,3))
crossing(x, y)
Result:
x y
1 a 1
2 a 2
3 a 3
4 b 1
5 b 2
6 b 3
7 c 1
8 c 2
9 c 3
When x and y are database tbls (tbl_dbi / tbl_sql) you can now also do:
full_join(x, y, by = character())
Added to dplyr at the end of 2017, and also gets translated to a CROSS JOIN in the DB world. Saves the nastiness of having to introduce the fake variables.
I'm seeing comments now (Nov2022) that this does also work on standard dataframes! Great news!
If we need a tidyverse output, we can use expand from tidyr
library(tidyverse)
y %>%
expand(y, x= x$x) %>%
select(x,y)
# A tibble: 9 × 2
# x y
# <fctr> <dbl>
#1 a 1
#2 b 1
#3 c 1
#4 a 2
#5 b 2
#6 c 2
#7 a 3
#8 b 3
#9 c 3
When faced with this problem, I tend to do something like this:
x <- data.frame(x=c("a","b","c"))
y <- data.frame(y=c(1,2,3))
x %>% mutate(temp=1) %>%
inner_join(y %>% mutate(temp=1),by="temp") %>%
dplyr::select(-temp)
If x and y are multi-column data frames, but I want to do every combination of a row of x with a row of y, then this is neater than any expand.grid() option that I can come up with
expand.grid(x=c("a","b","c"),y=c(1,2,3))
Edit: Consider also this following elegant solution from "Y T" for n more complex data.frame :
https://stackoverflow.com/a/21911221/5350791
in short:
expand.grid.df <- function(...) Reduce(function(...) merge(..., by=NULL), list(...))
expand.grid.df(df1, df2, df3)
This is a continuation of dsz's comment. Idea came from: http://jarrettmeyer.com/2018/07/10/cross-join-dplyr.
tbl_1$fake <- 1
tbl_2$fake <- 1
my_cross_join <- full_join(tbl_1, tbl_2, by = "fake") %>%
select(-fake)
I tested this on four columns of data ranging in size from 4 to 640 obs, and it took about 1.08 seconds.
Using two answers above, using full_join() with by = character() seems to be faster:
library(tidyverse)
library(microbenchmark)
df <- data.frame(blah = 1:10)
microbenchmark(diamonds %>% crossing(df))
Unit: milliseconds
expr min lq mean median uq max neval
diamonds %>% crossing(df) 21.70086 22.63943 23.72622 23.01447 24.25333 30.3367 100
microbenchmark(diamonds %>% full_join(df, by = character()))
Unit: milliseconds
expr min lq mean median uq max neval
diamonds %>% full_join(df, by = character()) 9.814783 10.23155 10.76592 10.44343 11.18464 15.71868 100

Counting amount of zeros within a "melted" data frame

Hei, I learn R and I try to count how many zeros I have within the melted data. So, I want to know how many zeros corresponds to column a and b and print two results out.
I generated an example:
library(reshape)
library(plyr)
library(dplyr)
id = c(1,2,3,4,5,6,7,8,9,10)
b = c(0,0,5,6,3,7,2,8,1,8)
c = c(0,4,9,87,0,87,0,4,5,0)
test = data.frame(id,b,c)
test_melt = melt(test, id.vars = "id")
test_melt
I imagine for that I should create an if statement. Something with
if (test$value == 0){print()}, but how can I tell R to count zeros for a columns that have been melted?
With your data:
test_melt %>%
group_by(variable) %>%
summarize(zeroes = sum(value == 0))
# # A tibble: 2 x 2
# variable zeroes
# <fctr> <int>
# 1 b 2
# 2 c 4
Base R:
aggregate(test_melt$value, by = list(variable = test_melt$variable),
FUN = function(x) sum(x == 0))
# variable x
# 1 b 2
# 2 c 4
... and for curiosity:
library(microbenchmark)
microbenchmark(
dplyr = group_by(test_melt, variable) %>% summarize(zeroes = sum(value == 0)),
base1 = aggregate(test_melt$value, by = list(variable = test_melt$variable), FUN = function(x) sum(x == 0)),
# #PankajKaundal's suggested "formula" notation reads easier
base2 = aggregate(value ~ variable, test_melt, function(x) sum(x == 0))
)
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 916.421 986.985 1069.7000 1022.1760 1094.7460 2272.636 100
# base1 647.658 682.302 783.2065 715.3045 765.9940 1905.411 100
# base2 813.219 867.737 950.3247 897.0930 959.8175 2017.001 100
sum(test_melt$value==0)
This should do it.
This might help . Is this what you're looking for ?
> test_melt[4] <- 1
> test_melt2 <- aggregate(V4 ~ value + variable, test_melt, sum)
> test_melt2
value variable V4
1 0 b 2
2 1 b 1
3 2 b 1
4 3 b 1
5 5 b 1
6 6 b 1
7 7 b 1
8 8 b 2
9 0 c 4
10 4 c 2
11 5 c 1
12 9 c 1
13 87 c 2
V4 is the count

Grouping of R dataframe by connected values

I didn't find a solution for this common grouping problem in R:
This is my original dataset
ID State
1 A
2 A
3 B
4 B
5 B
6 A
7 A
8 A
9 C
10 C
This should be my grouped resulting dataset
State min(ID) max(ID)
A 1 2
B 3 5
A 6 8
C 9 10
So the idea is to sort the dataset first by the ID column (or a timestamp column). Then all connected states with no gaps should be grouped together and the min and max ID value should be returned. It's related to the rle method, but this doesn't allow the calculation of min, max values for the groups.
Any ideas?
You could try:
library(dplyr)
df %>%
mutate(rleid = cumsum(State != lag(State, default = ""))) %>%
group_by(rleid) %>%
summarise(State = first(State), min = min(ID), max = max(ID)) %>%
select(-rleid)
Or as per mentioned by #alistaire in the comments, you can actually mutate within group_by() with the same syntax, combining the first two steps. Stealing data.table::rleid() and using summarise_all() to simplify:
df %>%
group_by(State, rleid = data.table::rleid(State)) %>%
summarise_all(funs(min, max)) %>%
select(-rleid)
Which gives:
## A tibble: 4 × 3
# State min max
# <fctr> <int> <int>
#1 A 1 2
#2 B 3 5
#3 A 6 8
#4 C 9 10
Here is a method that uses the rle function in base R for the data set you provided.
# get the run length encoding
temp <- rle(df$State)
# construct the data.frame
newDF <- data.frame(State=temp$values,
min.ID=c(1, head(cumsum(temp$lengths) + 1, -1)),
max.ID=cumsum(temp$lengths))
which returns
newDF
State min.ID max.ID
1 A 1 2
2 B 3 5
3 A 6 8
4 C 9 10
Note that rle requires a character vector rather than a factor, so I use the as.is argument below.
As #cryo111 notes in the comments below, the data set might be unordered timestamps that do not correspond to the lengths calculated in rle. For this method to work, you would need to first convert the timestamps to a date-time format, with a function like as.POSIXct, use df <- df[order(df$ID),], and then employ a slight alteration of the method above:
# get the run length encoding
temp <- rle(df$State)
# construct the data.frame
newDF <- data.frame(State=temp$values,
min.ID=df$ID[c(1, head(cumsum(temp$lengths) + 1, -1))],
max.ID=df$ID[cumsum(temp$lengths)])
data
df <- read.table(header=TRUE, as.is=TRUE, text="ID State
1 A
2 A
3 B
4 B
5 B
6 A
7 A
8 A
9 C
10 C")
An idea with data.table:
require(data.table)
dt <- fread("ID State
1 A
2 A
3 B
4 B
5 B
6 A
7 A
8 A
9 C
10 C")
dt[,rle := rleid(State)]
dt2<-dt[,list(min=min(ID),max=max(ID)),by=c("rle","State")]
which gives:
rle State min max
1: 1 A 1 2
2: 2 B 3 5
3: 3 A 6 8
4: 4 C 9 10
The idea is to identify sequences with rleid and then get the min and max of IDby the tuple rle and State.
you can remove the rle column with
dt2[,rle:=NULL]
Chained:
dt2<-dt[,list(min=min(ID),max=max(ID)),by=c("rle","State")][,rle:=NULL]
You can shorten the above code even more by using rleid inside by directly:
dt2 <- dt[, .(min=min(ID),max=max(ID)), by=.(State, rleid(State))][, rleid:=NULL]
Here is another attempt using rle and aggregate from base R:
rl <- rle(df$State)
newdf <- data.frame(ID=df$ID, State=rep(1:length(rl$lengths),rl$lengths))
newdf <- aggregate(ID~State, newdf, FUN = function(x) c(minID=min(x), maxID=max(x)))
newdf$State <- rl$values
# State ID.minID ID.maxID
# 1 A 1 2
# 2 B 3 5
# 3 A 6 8
# 4 C 9 10
data
df <- structure(list(ID = 1:10, State = c("A", "A", "B", "B", "B",
"A", "A", "A", "C", "C")), .Names = c("ID", "State"), class = "data.frame",
row.names = c(NA,
-10L))

Count unique values of a column by pairwise combinations of another column in R

Let's say I have the following data frame:
ID Code
1 1 A
2 1 B
3 1 C
4 2 B
5 2 C
6 2 D
7 3 C
8 3 A
9 3 D
10 3 B
11 4 D
12 4 B
I would like to get the count of unique values of the column "ID" by pairwise combinations of the column "Code":
Code.Combinations Count.of.ID
1 A, B 2
2 A, C 2
3 A, D 1
4 B, C 3
5 B, D 3
6 C, D 2
I have searched for solution(s) online, so far haven't been able to achieve the desired result.
Any help would be appreciated. Thanks!
Here is a data.table way to solve the problem. Use combn function to pick up all possible combinations of Code and then count ID for each unique CodeComb:
library(data.table)
setDT(df)[, .(CodeComb = sapply(combn(Code, 2, simplify = F),
function(cmb) paste(sort(cmb), collapse = ", "))), .(ID)]
# list all combinations of Code for each ID
[, .(IdCount = .N), .(CodeComb)]
# count number of unique id for each code combination
# CodeComb IdCount
# 1: A, B 2
# 2: A, C 2
# 3: B, C 3
# 4: B, D 3
# 5: C, D 2
# 6: A, D 1
Assuming your data.frame is named df and using dplyr
df %>% full_join(df, by="ID") %>% group_by(Code.x,Code.y) %>% summarise(length(unique(ID))) %>% filter(Code.x!=Code.y)
Join the df with itself and then count by the groups
Below makes use of combinations from the gtools package as well as count from the plyr package.
library(gtools)
library(plyr)
PairWiseCombo <- function(df) {
myID <- df$ID
BreakDown <- rle(myID)
Unis <- BreakDown$values
numUnis <- BreakDown$lengths
Len <- length(Unis)
e <- cumsum(numUnis)
s <- c(1L, e + 1L)
## more efficient to generate outside of the "do.call(c, lapply(.."
## below. This allows me to reference a particular combination
## rather than re-generating the same combination multiple times
myCombs <- lapply(2:max(numUnis), function(x) combinations(x,2L))
tempDF <- plyr::count(do.call(c, lapply(1:Len, function(i) {
myRange <- s[i]:e[i]
combs <- myCombs[[numUnis[i]-1L]]
vapply(1:nrow(combs), function(j) paste(sort(df$Code[myRange[combs[j,]]]), collapse = ","), "A,D")
})))
names(tempDF) <- c("Code.Combinations", "Count.of.ID")
tempDF
}
Below are some metrics. I didn't test the solution by #Carl as it was giving different results than the other solutions.
set.seed(537)
ID <- do.call(c, lapply(1:100, function(x) rep(x, sample(2:26,1))))
temp <- rle(ID)
Code <- do.call(c, lapply(1:100, function(x) LETTERS[sample(temp$lengths[x])]))
TestDF <- data.frame(ID, Code, stringsAsFactors = FALSE)
system.time(t1 <- Noah(TestDF))
user system elapsed
97.05 0.31 97.42
system.time(t2 <- DTSolution(TestDF))
user system elapsed
0.43 0.00 0.42
system.time(t3 <- PairWiseCombo(TestDF))
user system elapsed
0.42 0.00 0.42
identical(sort(t3[,2]),sort(t2$IdCount))
TRUE
identical(sort(t3[,2]),sort(t1[,2]))
TRUE
Using microbenchmark we have:
library(microbenchmark)
microbenchmark(Joseph = PairWiseCombo(TestDF), Psidom = DTSolution(TestDF), times = 10L)
Unit: milliseconds
expr min lq mean median uq max neval
Joseph 420.1090 433.9471 442.0133 446.4880 450.4420 452.7852 10
Psidom 396.8444 413.4933 416.3315 418.5573 420.9669 423.6303 10
Overall, the data.table solution provided by #Psidom was the fastest (not surprisingly). Both my solution and the data.table solution performed similarly on really large examples. However, the solution provided from #Noah is extremely memory intensive and couldn't be tested on larger data frames.
sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Update
After tweaking #Carl's solution, the dplyr approach is by far the fastest. Below is the code (you will see what parts I altered):
DPLYRSolution <- function(df) {
df <- df %>% full_join(df, by="ID") %>% group_by(Code.x,Code.y) %>% summarise(length(unique(ID))) %>% filter(Code.x!=Code.y)
## These two lines were added by me to remove "duplicate" rows
df <- mutate(df, Code=ifelse(Code.x < Code.y, paste(Code.x, Code.y), paste(Code.y, Code.x)))
df[which(!duplicated(df$Code)), ]
}
Below are the new metrics:
system.time(t4 <- DPLYRSolution(TestDF))
user system elapsed
0.03 0.00 0.03 ### Wow!!! really fast
microbenchmark(Joseph = PairWiseCombo(TestDF), Psidom = DTSolution(TestDF),
Carl = DPLYRSolution(TestDF), times = 10L)
Unit: milliseconds
expr min lq mean median uq max neval
Joseph 437.87235 442.7348 450.91085 452.77204 457.09465 461.85035 10
Psidom 407.81519 416.9444 422.62793 425.26041 429.02064 434.38881 10
Carl 44.33698 44.8066 48.39051 45.35073 54.06513 59.35653 10
## Equality Check
identical(sort(c(t4[,3])[[1]]), sort(t1[,2]))
[1] TRUE
Using base only:
df <- data.frame(ID=c(1,1,1,2,2,2,3,3,3,3,4,4),
code=c("A", "B", "C", "B", "C", "D", "C", "A", "D", "B", "D", "B"), stringsAsFactors =FALSE)
# Create data.frame of unique combinations of codes
e <- expand.grid(df$code, df$code)
e <- e[e[,1]!=e[,2],]
e1 <- as.data.frame(unique(t(apply(e, 1, sort))), stringsAsFactors = FALSE)
# Count the occurrence of each code combination across IDs
e1$count <- apply(e1, 1, function(y)
sum(sapply(unique(df$ID), function(x)
sum(y[1] %in% df$code[df$ID==x] & y[2] %in% df$code[df$ID==x]))))
# Turn the codes into a string and print output
out <- data.frame(Code.Combinations=do.call(paste, c(e1[,1:2], sep=", ")),
Count.of.ID=e1$count, stringsAsFactors = FALSE)
out
# Code.Combinations Count.of.ID
# 1 A, B 2
# 2 A, C 2
# 3 A, D 1
# 4 B, C 3
# 5 B, D 3
# 6 C, D 2

Passing a value from a function into ddply

I've got ddply constructing a data.frame along these lines:
out <- ddply(data, .(names), varA = sum(value > 10))
That works fine, so I've tried to place it into a function
func <- function(val.in) {
out <- ddply(data, .(names), varA = sum(value > val.in))
}
func(10)
This doesn't work - it looks like ddply can't find 'val.in'
Error in eval(expr, envir, enclos) : object 'val.in' not found
Anyone know why?
If not enough background, let me know and I'll update.
I've tried to recreate your problem using some sample data from the examples under ddply.
First, some sample data:
dfx <- data.frame(
group = c(rep('A', 8), rep('B', 15), rep('C', 6)),
sex = sample(c("M", "F"), size = 29, replace = TRUE),
age = runif(n = 29, min = 18, max = 54)
)
head(dfx)
# group sex age
# 1 A F 53.08787
# 2 A M 30.47225
# 3 A F 26.78341
# 4 A F 26.46841
# 5 A F 34.65360
# 6 A M 21.26691
Here's what you might try that would work (I assume you meant to use summarize in your question).
library(plyr)
ddply(dfx, .(group, sex), summarize, varA = sum(age > 25))
# group sex varA
# 1 A F 5
# 2 A M 1
# 3 B F 6
# 4 B M 4
# 5 C F 3
# 6 C M 2
We might then try to use it in a function as follows:
func <- function(val.in) {
out <- ddply(dfx, .(group, sex), summarize, varA = sum(age > val.in))
out
}
func(25)
# Error in eval(expr, envir, enclos) : object 'val.in' not found
^^ There's your error ^^
The most straightforward solution is to use here (which helps ddply figure out where to look for things):
func <- function(val.in) {
out <- ddply(dfx, .(group, sex), here(summarize), varA = sum(age > val.in))
out
}
func(25)
# group sex varA
# 1 A F 5
# 2 A M 1
# 3 B F 6
# 4 B M 4
# 5 C F 3
# 6 C M 2
Update
This doesn't seem to be a problem in "dplyr" as far as I can tell:
library(dplyr)
myFun <- function(val.in) {
dfx %>% group_by(group, sex) %>% summarise(varA = sum(age > val.in))
}
myFun(10)
# Source: local data frame [6 x 3]
# Groups: group
#
# group sex varA
# 1 A F 5
# 2 A M 3
# 3 B F 7
# 4 B M 8
# 5 C F 2
# 6 C M 4
Seems like you want to write an anonymous function and pass in the second argument:
func<-function(val.in){
ddply(data, .(names), function(value,val.in) data.frame(varA=sum(value>val.in)), val.in)
}

Resources