I've got ddply constructing a data.frame along these lines:
out <- ddply(data, .(names), varA = sum(value > 10))
That works fine, so I've tried to place it into a function
func <- function(val.in) {
out <- ddply(data, .(names), varA = sum(value > val.in))
}
func(10)
This doesn't work - it looks like ddply can't find 'val.in'
Error in eval(expr, envir, enclos) : object 'val.in' not found
Anyone know why?
If not enough background, let me know and I'll update.
I've tried to recreate your problem using some sample data from the examples under ddply.
First, some sample data:
dfx <- data.frame(
group = c(rep('A', 8), rep('B', 15), rep('C', 6)),
sex = sample(c("M", "F"), size = 29, replace = TRUE),
age = runif(n = 29, min = 18, max = 54)
)
head(dfx)
# group sex age
# 1 A F 53.08787
# 2 A M 30.47225
# 3 A F 26.78341
# 4 A F 26.46841
# 5 A F 34.65360
# 6 A M 21.26691
Here's what you might try that would work (I assume you meant to use summarize in your question).
library(plyr)
ddply(dfx, .(group, sex), summarize, varA = sum(age > 25))
# group sex varA
# 1 A F 5
# 2 A M 1
# 3 B F 6
# 4 B M 4
# 5 C F 3
# 6 C M 2
We might then try to use it in a function as follows:
func <- function(val.in) {
out <- ddply(dfx, .(group, sex), summarize, varA = sum(age > val.in))
out
}
func(25)
# Error in eval(expr, envir, enclos) : object 'val.in' not found
^^ There's your error ^^
The most straightforward solution is to use here (which helps ddply figure out where to look for things):
func <- function(val.in) {
out <- ddply(dfx, .(group, sex), here(summarize), varA = sum(age > val.in))
out
}
func(25)
# group sex varA
# 1 A F 5
# 2 A M 1
# 3 B F 6
# 4 B M 4
# 5 C F 3
# 6 C M 2
Update
This doesn't seem to be a problem in "dplyr" as far as I can tell:
library(dplyr)
myFun <- function(val.in) {
dfx %>% group_by(group, sex) %>% summarise(varA = sum(age > val.in))
}
myFun(10)
# Source: local data frame [6 x 3]
# Groups: group
#
# group sex varA
# 1 A F 5
# 2 A M 3
# 3 B F 7
# 4 B M 8
# 5 C F 2
# 6 C M 4
Seems like you want to write an anonymous function and pass in the second argument:
func<-function(val.in){
ddply(data, .(names), function(value,val.in) data.frame(varA=sum(value>val.in)), val.in)
}
Related
I would like to ask if there is a way of removing a group from dataframe using dplyr (or anz other way in that matter) in the following way. Lets say I have a dataframe in the following form grouped by variable 1:
Variable 1 Variable 2
1 a
1 b
2 a
2 a
2 b
3 a
3 c
3 a
... ...
I would like to remove only groups that have in Variable 2 two consecutive same values. That is in table above it would remove group 2 because there are values a,a,b but not group c where is a,c,a. So I would get the table bellow?
Variable 1 Variable 2
1 a
1 b
3 a
3 c
3 a
... ...
To test for consecutive identical values, you can compare a value to the previous value in that column. In dplyr, this is possible with lag. (You could do the same thing with comparing to the next value, using lead. Result comes out the same.)
Group the data by variable1, get the lag of variable2, then add up how many of these duplicates there are in that group. Then filter for just the groups with no duplicates. After that, feel free to remove the dupesInGroup column.
library(tidyverse)
df %>%
group_by(variable1) %>%
mutate(dupesInGroup = sum(variable2 == lag(variable2), na.rm = T)) %>%
filter(dupesInGroup == 0)
#> # A tibble: 5 x 3
#> # Groups: variable1 [2]
#> variable1 variable2 dupesInGroup
#> <int> <chr> <int>
#> 1 1 a 0
#> 2 1 b 0
#> 3 3 a 0
#> 4 3 c 0
#> 5 3 a 0
Created on 2018-05-10 by the reprex package (v0.2.0).
prepare data frame:
df <- data.frame("Variable 1" = c(1, 1, 2, 2, 2, 3, 3, 3), "Variable 2" = unlist(strsplit("abaabaca", "")))
write functions to test if consecutive repetitions are there or not:
any.consecutive.p <- function(v) {
for (i in 1:(length(v) - 1)) {
if (v[i] == v[i + 1]) {
return(TRUE)
}
}
return(FALSE)
}
any.consecutive.in.col.p <- function(df, col) {
any.consecutive.p(df[, col])
}
any.consecutive.p returns TRUE if it finds first consecutive repetition in a vector (v).
any.consecutive.in.col.p() looks for consecutive repetitions in a column of a data frame.
split data frame by values of Variable.1
df.l <- split(df, df$Variable.1)
df.l
$`1`
Variable.1 Variable.2
1 1 a
2 1 b
$`2`
Variable.1 Variable.2
3 2 a
4 2 a
5 2 b
$`3`
Variable.1 Variable.2
6 3 a
7 3 c
8 3 a
Finally go over this data.frame list and test for each data frame, if it contains consecutive duplicates in Variable.2 column.
If found, don't collect it.
Bind the collected data frames by rows.
Reduce(rbind, lapply(df.l, function(df) if(!any.consecutive.in.col.p(df, "Variable.2")) {df}))
Variable.1 Variable.2
1 1 a
2 1 b
6 3 a
7 3 c
8 3 a
Say you want to remove all groups of df, grouped by a, where the column b has repeated values. You can do that as below.
set.seed(0)
df <- data.frame(a = rep(1:3, rep(3, 3)), b = sample(1:5, 9, T))
# dplyr
library(dplyr)
df %>%
group_by(a) %>%
filter(all(b != lag(b), na.rm = T))
#data.table
library(data.table)
setDT(df)
df[, if(all(b != shift(b), na.rm = T)) .SD, by = a]
Benchmark shows data.table is faster
#Results
# Unit: milliseconds
# expr min lq mean median uq max neval
# use_dplyr() 141.46819 165.03761 201.0975 179.48334 205.82301 539.5643 100
# use_DT() 36.27936 50.23011 64.9218 53.87114 66.73943 345.2863 100
# Method
set.seed(0)
df <- data.table(a = rep(1:2000, rep(1e3, 2000)), b = sample(1:1e3, 2e6, T))
use_dplyr <- function(x){
df %>%
group_by(a) %>%
filter(all(b != lag(b), na.rm = T))
}
use_DT <- function(x){
df[, if (all(b != shift(b), na.rm = T)) .SD, a]
}
microbenchmark(use_dplyr(), use_DT())
Hei, I learn R and I try to count how many zeros I have within the melted data. So, I want to know how many zeros corresponds to column a and b and print two results out.
I generated an example:
library(reshape)
library(plyr)
library(dplyr)
id = c(1,2,3,4,5,6,7,8,9,10)
b = c(0,0,5,6,3,7,2,8,1,8)
c = c(0,4,9,87,0,87,0,4,5,0)
test = data.frame(id,b,c)
test_melt = melt(test, id.vars = "id")
test_melt
I imagine for that I should create an if statement. Something with
if (test$value == 0){print()}, but how can I tell R to count zeros for a columns that have been melted?
With your data:
test_melt %>%
group_by(variable) %>%
summarize(zeroes = sum(value == 0))
# # A tibble: 2 x 2
# variable zeroes
# <fctr> <int>
# 1 b 2
# 2 c 4
Base R:
aggregate(test_melt$value, by = list(variable = test_melt$variable),
FUN = function(x) sum(x == 0))
# variable x
# 1 b 2
# 2 c 4
... and for curiosity:
library(microbenchmark)
microbenchmark(
dplyr = group_by(test_melt, variable) %>% summarize(zeroes = sum(value == 0)),
base1 = aggregate(test_melt$value, by = list(variable = test_melt$variable), FUN = function(x) sum(x == 0)),
# #PankajKaundal's suggested "formula" notation reads easier
base2 = aggregate(value ~ variable, test_melt, function(x) sum(x == 0))
)
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 916.421 986.985 1069.7000 1022.1760 1094.7460 2272.636 100
# base1 647.658 682.302 783.2065 715.3045 765.9940 1905.411 100
# base2 813.219 867.737 950.3247 897.0930 959.8175 2017.001 100
sum(test_melt$value==0)
This should do it.
This might help . Is this what you're looking for ?
> test_melt[4] <- 1
> test_melt2 <- aggregate(V4 ~ value + variable, test_melt, sum)
> test_melt2
value variable V4
1 0 b 2
2 1 b 1
3 2 b 1
4 3 b 1
5 5 b 1
6 6 b 1
7 7 b 1
8 8 b 2
9 0 c 4
10 4 c 2
11 5 c 1
12 9 c 1
13 87 c 2
V4 is the count
I have a data frame that looks as follows:
> df <- data_frame(g = c('A', 'A', 'B', 'B', 'B', 'C'), x = c(7, 3, 5, 9, 2, 4))
> df
Source: local data frame [6 x 2]
g x
1 A 7
2 A 3
3 B 5
4 B 9
5 B 2
6 C 4
I know how to add a column with the maximum x value for each group g:
> df %>% group_by(g) %>% mutate(x_max = max(x))
Source: local data frame [6 x 3]
Groups: g
g x x_max
1 A 7 7
2 A 3 7
3 B 5 9
4 B 9 9
5 B 2 9
6 C 4 4
But what I would like is to get is the maximum x value for each group g, excluding the x value in each row.
For the given example, the desired output would look like this:
Source: local data frame [6 x 3]
Groups: g
g x x_max x_max_exclude
1 A 7 7 3
2 A 3 7 7
3 B 5 9 9
4 B 9 9 5
5 B 2 9 9
6 C 4 4 NA
I thought I might be able to use row_number() to remove particular elements and take the max of what remained, but hit warning messages and got incorrect -Inf output:
> df %>% group_by(g) %>% mutate(x_max = max(x), r = row_number(), x_max_exclude = max(x[-r]))
Source: local data frame [6 x 5]
Groups: g
g x x_max r x_max_exclude
1 A 7 7 1 -Inf
2 A 3 7 2 -Inf
3 B 5 9 1 -Inf
4 B 9 9 2 -Inf
5 B 2 9 3 -Inf
6 C 4 4 1 -Inf
Warning messages:
1: In max(c(4, 9, 2)[-1:3]) :
no non-missing arguments to max; returning -Inf
2: In max(c(4, 9, 2)[-1:3]) :
no non-missing arguments to max; returning -Inf
3: In max(c(4, 9, 2)[-1:3]) :
no non-missing arguments to max; returning -Inf
What is the most {readable, concise, efficient} way to get this output in dplyr? Any insight into why my attempt using row_number() doesn't work would also be much appreciated. Thanks for the help.
You could try:
df %>%
group_by(g) %>%
arrange(desc(x)) %>%
mutate(max = ifelse(x == max(x), x[2], max(x)))
Which gives:
#Source: local data frame [6 x 3]
#Groups: g
#
# g x max
#1 A 7 3
#2 A 3 7
#3 B 9 5
#4 B 5 9
#5 B 2 9
#6 C 4 NA
Benchmark
I've tried the solutions so far on the benchmark:
df <- data.frame(g = sample(LETTERS, 10e5, replace = TRUE),
x = sample(1:10, 10e5, replace = TRUE))
library(microbenchmark)
mbm <- microbenchmark(
steven = df %>%
group_by(g) %>%
arrange(desc(x)) %>%
mutate(max = ifelse(x == max(x), x[2], max(x))),
eric = df %>%
group_by(g) %>%
mutate(x_max = max(x),
x_max2 = sort(x, decreasing = TRUE)[2],
x_max_exclude = ifelse(x == x_max, x_max2, x_max)) %>%
select(-x_max2),
arun = setDT(df)[order(x), x_max_exclude := c(rep(x[.N], .N-1L), x[.N-1L]), by=g],
times = 50
)
#Arun's data.table solution is the fastest:
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# steven 158.58083 163.82669 197.28946 210.54179 212.1517 260.1448 50 b
# eric 223.37877 228.98313 262.01623 274.74702 277.1431 284.5170 50 c
# arun 44.48639 46.17961 54.65824 47.74142 48.9884 102.3830 50 a
Interesting problem. Here's one way using data.table:
require(data.table)
setDT(df)[order(x), x_max_exclude := c(rep(x[.N], .N-1L), x[.N-1L]), by=g]
The idea is to order by column x and on those indices, we group by g. Since we've the ordered indices, for the first .N-1 rows, the max value is the value at .N. And for the .Nth row, it's the value at .N-1th row.
.N is a special variable that holds the number of observations in each group.
I'll leave it to you and/or the dplyr experts to translate this (or answer with another approach).
This is the best I've come up with so far. Not sure if there's a better way.
df %>%
group_by(g) %>%
mutate(x_max = max(x),
x_max2 = sort(x, decreasing = TRUE)[2],
x_max_exclude = ifelse(x == x_max, x_max2, x_max)) %>%
select(-x_max2)
Another way with a functional:
df %>% group_by(g) %>% mutate(x_max_exclude = max_exclude(x))
Source: local data frame [6 x 3]
Groups: g
g x x_max_exclude
1 A 7 3
2 A 3 7
3 B 5 9
4 B 9 5
5 B 2 9
6 C 4 NA
We write a function called max_exclude that does the operation that you describe.
max_exclude <- function(v) {
res <- c()
for(i in seq_along(v)) {
res[i] <- suppressWarnings(max(v[-i]))
}
res <- ifelse(!is.finite(res), NA, res)
as.numeric(res)
}
It works with base R too:
df$x_max_exclude <- with(df, ave(x, g, FUN=max_exclude))
Source: local data frame [6 x 3]
g x x_max_exclude
1 A 7 3
2 A 3 7
3 B 5 9
4 B 9 5
5 B 2 9
6 C 4 NA
Benchmark
Here's a lesson kids, beware of for loops!
big.df <- data.frame(g=rep(LETTERS[1:4], each=1e3), x=sample(10, 4e3, replace=T))
microbenchmark(
plafort_dplyr = big.df %>% group_by(g) %>% mutate(x_max_exclude = max_exclude(x)),
plafort_ave = big.df$x_max_exclude <- with(big.df, ave(x, g, FUN=max_exclude)),
StevenB = (big.df %>%
group_by(g) %>%
mutate(max = ifelse(row_number(desc(x)) == 1, x[row_number(desc(x)) == 2], max(x)))
),
Eric = df %>%
group_by(g) %>%
mutate(x_max = max(x),
x_max2 = sort(x, decreasing = TRUE)[2],
x_max_exclude = ifelse(x == x_max, x_max2, x_max)) %>%
select(-x_max2),
Arun = setDT(df)[order(x), x_max_exclude := c(rep(x[.N], .N-1L), x[.N-1L]), by=g]
)
Unit: milliseconds
expr min lq mean median uq max neval
plafort_dplyr 75.219042 85.207442 89.247409 88.203225 90.627663 179.553166 100
plafort_ave 75.907798 84.604180 87.136122 86.961251 89.431884 104.884294 100
StevenB 4.436973 4.699226 5.207548 4.931484 5.364242 11.893306 100
Eric 7.233057 8.034092 8.921904 8.414720 9.060488 15.946281 100
Arun 1.789097 2.037235 2.410915 2.226988 2.423638 9.326272 100
I've got two data frames in which the unique identifiers common to both frames differ in the number of observations. I would like to create a dataframe from both in which the observations from each frame are taken if they have more observations for a common identifier. For example:
f1 <- data.frame(x = c("a", "a", "b", "c", "c", "c"), y = c(1,1,2,3,3,3))
f2 <- data.frame(x = c("a","b", "b", "c", "c"), y = c(4,5,5,6,6))
I would like this to generate a merge based on the longer x such that it produces:
x y
a 1
a 1
b 5
b 5
c 3
c 3
c 3
Any and all thoughts would be great.
Here's a solution using split
dd<-rbind(cbind(f1, s="f1"), cbind(f2, s="f2"))
keep<-unsplit(lapply(split(dd$s, dd$x), FUN=function(x) {
y<-table(x)
x == names(y[which.max(y)])
}), dd$x)
dd <- dd[keep,]
Normally i'd prefer to use the ave function here but because i'm changing data.types from a factor to a logical, it wasn't as appropriate so I basically copied the idea that ave uses and used split.
dplyr solution
library(dplyr)
First we combine the data:
with rbind() and introduce a new variable called ref to know where each observation came from:
both <- rbind( f1, f2 )
both$ref <- rep( c( "f1", "f2" ) , c( nrow(f1), nrow(f2) ) )
then count the observations:
make another new variable that contains how many observations for each ref and x combination:
both_with_counts <- both %>%
group_by( ref ,x ) %>%
mutate( counts = n() )
then filter for the largest count:
both_with_counts %>% group_by( x ) %>% filter( n==max(n) )
note: you could also select only the x and y cols with select(x,y)...
this gives:
## Source: local data frame [7 x 4]
## Groups: x
##
## x y ref counts
## 1 a 1 f1 2
## 2 a 1 f1 2
## 3 c 3 f1 3
## 4 c 3 f1 3
## 5 c 3 f1 3
## 6 b 5 f2 2
## 7 b 5 f2 2
Altogether now...
what_I_want <-
rbind(cbind(f1,ref = "f1"),cbind(f2,ref = "f2")) %>%
group_by(ref,x) %>%
mutate(counts = n()) %>%
group_by( x ) %>%
filter( counts==max(counts) ) %>%
select( x, y )
and thus:
> what_I_want
# Source: local data frame [7 x 2]
# Groups: x
#
# x y
# 1 a 1
# 2 a 1
# 3 c 3
# 4 c 3
# 5 c 3
# 6 b 5
# 7 b 5
Not a elegant answer but still give the desired result. Hope this help.
f1table <- data.frame(table(f1$x))
colnames(f1table) <- c("x","freq")
f1new <- merge(f1,f1table)
f2table <- data.frame(table(f2$x))
colnames(f2table) <- c("x","freq")
f2new <- merge(f2,f2table)
table <- rbind(f1table, f2table)
table <- table[with(table, order(x,-freq)), ]
table <- table[!duplicated(table$x), ]
data <-rbind(f1new, f2new)
merge(data, table, by=c("x","freq"))[,c(1,3)]
x y
1 a 1
2 a 1
3 b 5
4 b 5
5 c 3
6 c 3
7 c 3
I have a data set along these lines:
df<-data.frame(sp=c(100, 100, 100, 101, 101, 101, 102, 102, 102),
type=c("C","C","C","H","H","H","C","C","C"),
country=c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
vals=c(1,2,3,4,5,6,7,8,9)
)
I want to aggregate df$vals and bring the other variables through as well
At the moment I'm doing it like this:
multi.func<- function(x){
c(
n = length(x),
min = min(x, na.rm=TRUE),
max = max(x, na.rm=TRUE),
mean = mean(x, na.rm=TRUE)
)}
aggVals<-as.data.frame(do.call(rbind, by(df$vals, df$sp, FUN=multi.func, simplify=TRUE)))
aggVals$sp<-row.names(aggVals)
aggDescrip<-aggregate(cbind(as.character(type), as.character(country)) ~ sp, data=df, FUN=unique)
result<-merge(aggDescrip,aggVals)
This works well enough but I wondered if there's an easier way.
Thanks
Perhaps you should look into the data.table package.
library(data.table)
DT <- data.table(df, key="sp")
DT[, list(type = unique(as.character(type)),
country = unique(as.character(country)),
n = .N, min = min(vals), max = max(vals),
mean = mean(vals)), by=key(DT)]
# sp type country n min max mean
# 1: 100 C A 3 1 3 2
# 2: 101 H B 3 4 6 5
# 3: 102 C C 3 7 9 8
If you want to stick with base R, here is another approach that might be of use (though aggregate is probably more common):
unique(within(df, {
mean <- ave(vals, sp, FUN=mean)
max <- ave(vals, sp, FUN=max)
min <- ave(vals, sp, FUN=min)
n <- ave(vals, sp, FUN=length)
rm(vals)
}))
# sp type country n min max mean
# 1 100 C A 3 1 3 2
# 4 101 H B 3 4 6 5
# 7 102 C C 3 7 9 8
Update: A variation on your initial attempt
I would suggest sticking with data.table if possible, because the resulting code is easy to follow and the process of aggregation is quick.
However, with a little bit of modification, you can have (yet another) base R approach that is somewhat more direct.
First, modify your function so that instead of using c(), use data.frame. Also, add an argument that specifies which column needs to be aggregated.
multi.func <- function(x, value_column) {
data.frame(
n = length(x[[value_column]]),
min = min(x[[value_column]], na.rm=TRUE),
max = max(x[[value_column]], na.rm=TRUE),
mean = mean(x[[value_column]], na.rm=TRUE))
}
Second, use lapply on your dataset, split up by your grouping variable, merge the output with your original dataset, and return the unique values.
unique(merge(df[-4],
do.call(rbind, lapply(split(df, df$sp),
multi.func, value_column = "vals")),
by.x = "sp", by.y = "row.names"))
Using just aggregate:
result <- aggregate(vals ~ type + sp + country, df,
function(x) c(length(x), min(x), max(x), mean(x))
)
result
type sp country vals.1 vals.2 vals.3 vals.4
1 C 100 A 3 1 3 2
2 H 101 B 3 4 6 5
3 C 102 C 3 7 9 8
colnames(result)
[1] "type" "sp" "country" "vals"
The above seems to create a weird "multi-value" column. But summaryBy from the doBy package is similar to aggregate but will allow an output with multiple columns:
library(doBy)
result <- summaryBy(vals ~ type + sp + country, df,
FUN=function(x) c(n=length(x), min=min(x), max=max(x), mean=mean(x))
)
result
type sp country vals.n vals.min vals.max vals.mean
1 C 100 A 3 1 3 2
2 C 102 C 3 7 9 8
3 H 101 B 3 4 6 5
colnames(result)
[1] "type" "sp" "country" "vals.n" "vals.min" "vals.max"
[7] "vals.mean"