I have a grouped data frame, in which the grouping variable is SEED. I want to take the groups defined by the values of SEED, set the seed to the value of SEED for each group, and then shuffle the rows of each group using dplyr::sample_frac. However, I cannot replicate my results, which indicates that the seed isn't being set correctly.
To do this in a dplyr-ish way, I wrote the following function:
> library(dplyr)
> ss_sampleseed <- function(df, seed.){
> set.seed(df$seed.)
> sample_frac(df, 1)
> }
I then use this function on my data:
> dg <- structure(list(Gene = c("CAMK1", "ARPC4", "CIDEC", "CAMK1", "ARPC4",
> "CIDEC"), GENESEED = c(1, 1, 1, 2, 2, 2)), class = c("tbl_df",
> "tbl", "data.frame"), row.names = c(NA, -6L), .Names = c("Gene",
> "GENESEED"))
> dg2 <- dg %>%
> group_by(GENESEED) %>%
> ss_sampleseed(GENESEED)
> dg2
Source: local data frame [6 x 2]
Groups: GENESEED
Gene GENESEED
1 ARPC4 1
2 CIDEC 1
3 CAMK1 1
4 CIDEC 2
5 ARPC4 2
6 CAMK1 2
However, when I repeat the above code, I cannot replicate my results.
> dg2
Source: local data frame [6 x 2]
Groups: GENESEED
Gene GENESEED
1 ARPC4 1
2 CAMK1 1
3 CIDEC 1
4 CAMK1 2
5 ARPC4 2
6 CIDEC 2
The problem here is that dollar sign will not substitute for the parameter you are passing. See this minimal example:
df <- data.frame(x = "x", GENESEED = "GENESEED")
h <- function(df,x){
df$x
}
h(df, GENESEED)
[1] x
Levels: x
See that h returns x even though you asked for GENESEED. So your function is actually trying to get df$seed which does not exist so it returns NULL.
But there is another problem. Even correcting this and passing directly the seed, it seems that it would not work as you want, because, if you look at the code of sample_frac, dplyr will eventually run the following line:
sampled <- lapply(index, sample_group, frac = TRUE, tbl = tbl,
size = size, replace = replace, weight = weight, .env = .env)
Notice that it runs a lapply after you set the seed, so you will not have defined a different seed for each group according to GENESEED as you wanted.
Taking this into consideration, I came up with this solution, using sample.int and do:
ss_sampleseed <- function(x){
set.seed(unique(x$GENESEED))
x[sample.int(nrow(x)), ]
}
dg %>% group_by(GENESEED) %>% do(ss_sampleseed(.))
This seems to be working as you want.
I think the main thing going here is the use of $ coding like you are inside your function. I certainly had to learn this the hard way. See also:
library(fortunes)
fortune(312)
fortune(343)
Take the simple function from #Carlos Cinelli and try to use it outside of any dplyr functions.
h = function(df, seed.){
df$seed.
}
h(dg, GENESEED)
NULL
It's those darn dollar signs. Now change the function to use [[ instead.
h2 = function(df, seed.){
df[[seed.]]
}
h2(dg, "GENESEED")
[1] 1 1 1 2 2 2
That's more like it, although you did have to put quotes around the variable name in the function.
So where does that leave your original function? You can go two ways. First, you could just change to [[ and use quotes around the variable name in your function.
ss_sampleseed = function(df, seed.){
set.seed(df[[seed.]])
sample_frac(df, 1)
}
dg %>%
group_by(GENESEED) %>%
ss_sampleseed("GENESEED")
Source: local data frame [6 x 2]
Groups: GENESEED
Gene GENESEED
1 CAMK1 1
2 CIDEC 1
3 ARPC4 1
4 CIDEC 2
5 CAMK1 2
6 ARPC4 2
The other option is to use deparse(substitute(seed.)) inside your function to allow for non-standard evaluation. You'll still need [[, though.
ss_sampleseed2 = function(df, seed.){
set.seed(df[[deparse(substitute(seed.))]])
sample_frac(df, 1)
}
dg %>%
group_by(GENESEED) %>%
ss_sampleseed2(GENESEED)
Source: local data frame [6 x 2]
Groups: GENESEED
Gene GENESEED
1 CAMK1 1
2 CIDEC 1
3 ARPC4 1
4 CIDEC 2
5 CAMK1 2
6 ARPC4 2
I get replicated results with either of these, although I didn't check if the seed is specifically set to what you want it to be.
Related
I want to remove duplicate rows from a dataframe, for specific columns only. That can be obtained with distinct:
data <- tibble(a = c(1, 1, 2, 2), b = c(3, 3, 3, 4), z = c(5,4,5,5))
filtered_data <- data %>% distinct(a, b, .keep_all = T)
dim(filtered_data)
# [1] 3 3
This is (almost) what I need. Yet, my problem is that the columnnames I need to use with distinct will change. So I have a string gen that contains the names of the columns I want to use for with the distinct function. They need to get unquoted to be usefull in the pipe. I found suggestions to use as.name() or eval(parse()). This however gives me a different result:
gen <- c("a", "b")
filtered_data <- data %>% distinct(eval(parse(text = gen)), .keep_all = T)
dim(filtered_data)
# [1] 2 4
The eval seems to do something funny with the amount of times the data is filtered. (and, adds an extra column. I could live with that, though...) So, how to obtain a similar result, as if I had used a,b, but by using a variable instead?
additional information
I actually obtain gen by reading the columnnames of a dataframe: gen <- colnames(data)[1:2]. The solution suggested by #gymbrane would be perfect, if I had a way to transform the gen to c(a, b). The whole point is to avoid hardcoding the columnames. I tried things like gen <- noquotes(gen), which does not give an error in the rm_dup_rows function suggested below, but it does give a different result, giving the same sort of repeated filtering as I started with...
fixed
I think I got it working. It might be unelegant, and I'm not sure if every step is necessary for the result, but it seems to work by combining the function provided by #gymbrane below with ensym and quos in a forloop while adding to a list in GlobalEnv (edit: GlobalEnv isn't necessary):
unquote_string <- function(string) {
out <- list()
i <- 1
for (s in string) {
t <- ensym(s)
out[i] <-dplyr::quos(!!t)
i <- i+1
}
return(out)
}
gen_quo <- unquote_string(gen)
filtered_data <- rm_dup_rows(data, gen_quo)
dim(filtered_data)
# [1] 3 3
How about creating a function and using quosures . Perhaps something like this is what you are looking for...
rm_dup_rows <- function(data, ...){
vars = dplyr::quos(...)
data %>% distinct(!!! vars, .keep_all = T)
}
I believe this returns what you are asking for
rm_dup_rows(data = data, a, b)
# A tibble: 3 x 3
a b z
<dbl> <dbl> <dbl>
1 3 5
2 3 5
2 4 5
rm_dup_rows(data, b, z)
# A tibble: 3 x 3
a b z
<dbl> <dbl> <dbl>
1 3 5
1 3 4
2 4 5
Additional
You could modify rm_dup_rows just slightly and construct and your vector with quos. Something like this...
rm_dup_rows <- function(data, vars){
data %>% distinct(!!! vars, .keep_all = T)
}
# quos your column name vector
gen <- quos(a,z)
rm_dup_rows(data, gen)
# A tibble: 3 x 3
a b z
<dbl> <dbl> <dbl>
1 3 5
1 3 4
2 3 5
I would like to use dplyr's mutate_if() function to convert list-columns to data-frame-columns, but run into a puzzling error when I try to do so. I am using dplyr 0.5.0, purrr 0.2.2, R 3.3.0.
The basic setup looks like this: I have a data frame d, some of whose columns are lists:
d <- dplyr::data_frame(
A = list(
list(list(x = "a", y = 1), list(x = "b", y = 2)),
list(list(x = "c", y = 3), list(x = "d", y = 4))
),
B = LETTERS[1:2]
)
I would like to convert the column of lists (in this case, d$A) to a column of data frames using the following function:
tblfy <- function(x) {
x %>%
purrr::transpose() %>%
purrr::simplify_all() %>%
dplyr::as_data_frame()
}
That is, I would like the list-column d$A to be replaced by the list lapply(d$A, tblfy), which is
[[1]]
# A tibble: 2 x 2
x y
<chr> <dbl>
1 a 1
2 b 2
[[2]]
# A tibble: 2 x 2
x y
<chr> <dbl>
1 c 3
2 d 4
Of course, in this simple case, I could just do a simple reassignment. The point, however, is that I would like to do this programmatically, ideally with dplyr, in a generally applicable way that could deal with any number of list-columns.
Here's where I stumble: When I try to convert the list-columns to data-frame-columns using the following application
d %>% dplyr::mutate_if(is.list, funs(tblfy))
I get an error message that I don't know how to interpret:
Error: Each variable must be named.
Problem variables: 1, 2
Why does mutate_if() fail? How can I properly apply it to get the desired result?
Remark
A commenter has pointed out that the function tblfy() should be vectorized. That is a reasonable suggestion. But — unless I have vectorized incorrectly — that does not seem to get at the root of the problem. Plugging in a vectorized version of tblfy(),
tblfy_vec <- Vectorize(tblfy)
into mutate_if() fails with the error
Error: wrong result size (4), expected 2 or 1
Update
After gaining some experience with purrr, I now find the following approach natural, if somewhat long-winded:
d %>%
map_if(is.list, ~ map(., ~ map_df(., identity))) %>%
as_data_frame()
This is more or less identical to #alistaire's solution, below, but uses map_if(), resp. map(), in place of mutate_if(), resp. Vectorize().
The original tblfy function errors out for me (even when its elements are chained directly), so let's rebuild it a bit, adding vectorization as well, which lets us avoid an otherwise-necessary prior rowwise() call:
tblfy <- Vectorize(function(x){x %>% purrr::map_df(identity) %>% list()})
Now we can use mutate_if nicely:
d %>% mutate_if(purrr::is_list, tblfy)
## Source: local data frame [2 x 2]
##
## A B
## <list> <chr>
## 1 <tbl_df [2,2]> A
## 2 <tbl_df [2,2]> B
...and if we unnest to see what's there,
d %>% mutate_if(purrr::is_list, tblfy) %>% tidyr::unnest()
## Source: local data frame [4 x 3]
##
## B x y
## <chr> <chr> <dbl>
## 1 A a 1
## 2 A b 2
## 3 B c 3
## 4 B d 4
A couple notes:
map_df(identity) seems to be more efficient at building a tibble than any of the alternative formulations. I know the identity call seems unnecessary, but most everything else breaks.
I'm not sure how widely useful tblfy will be, as it's somewhat dependent on the structure of the lists in the list column, which can vary enormously. If you have a lot with a similar structure, I suppose it's useful, though.
There may be a way to do this with pmap instead of Vectorize, but I can't get it to work with some cursory tries.
In-place conversion without any copying:
library(data.table)
for (col in d) if (is.list(col)) lapply(col, setDF)
d
#Source: local data frame [2 x 2]
#
# A B
#1 <S3:data.frame> A
#2 <S3:data.frame> B
I am trying to group a column of my data.frame/data.table into three groups, all with equal sums.
The data is first ordered from smallest to largest, such that group one would be made up of a large number of rows with small values, and group three would have a small number of rows with large values. This is accomplished in spirit with:
test <- data.frame(x = as.numeric(1:100000))
store <- 0
total <- sum(test$x)
for(i in 1:100000){
store <- store + test$x[i]
if(store < total/3){
test$y[i] <- 1
} else {
if(store < 2*total/3){
test$y[i] <- 2
} else {
test$y[i] <- 3
}
}
}
While successful, I feel like there must be a better way (and maybe a very obvious solution that I am missing).
I never like resorting to loops, especially with nested ifs, when a vectorized approach is available - with even 100,000+ records this code becomes quite slow
This method would become impossibly complex to code to a larger number of groups (not necessarily the looping, but the ifs)
Requires pre-ordering of the column. Might not be able to get around this one.
As a nuance (not that it makes a difference) but the data to be summed would not always (or ever) be consecutive integers.
Maybe with cumsum:
test$z <- cumsum(test$x) %/% (ceiling(sum(test$x) / 3)) + 1
This is more or less a bin-packing problem.
Use the binPack function from the BBmisc package:
library(BBmisc)
test$bins <- binPack(test$x, sum(test$x)/3+1)
The sums of the 3 bins are nearly identical:
tapply(test$x, test$bins, sum)
1 2 3
1666683334 1666683334 1666683332
I thought that the cumsum/modulo division approach was very elegant, but it does retrun a somewhat irregular allocation:
> tapply(test$x, test$z, sum)
1 2 3
1666636245 1666684180 1666729575
> sum(test)/3
[1] 1666683333
So I though I would first create a random permutation and offer something similar:
test$x <- sample(test$x)
test$z2 <- cumsum(test$x)[ findInterval(cumsum(test$x),
c(0, 1666683333*(1:2), sum(test$x)+1))]
> tapply(test$x, test$z2, sum)
91099 116379 129539
1666676164 1666686837 1666686999
This also achieves a more even distribution of counts:
> table(test$z2)
91099 116379 129539
33245 33235 33520
> table(test$z)
1 2 3
57734 23915 18351
I must admit to puzzlement regarding the naming of the entries in z2.
Or you can just cut on the cumsum
test$z <- cut(cumsum(test$x), breaks = 3, labels = 1:3)
or use ggplot2::cut_interval instead of cut:
test$z <- cut_interval(cumsum(test$x), n = 3, labels = 1:3)
You can use fold() from groupdata2 and get an almost equal number of elements per group:
# Create data frame
test <- data.frame(x = as.numeric(1:100000))
# Use fold() to create 3 numerically balanced groups
test <- groupdata2::fold(k = 3, num_col = "x")
# Watch first 10 rows
head(test, 10)
## # A tibble: 10 x 2
## # Groups: .folds [3]
## x .folds
## <dbl> <fct>
## 1 1 1
## 2 2 3
## 3 3 2
## 4 4 1
## 5 5 2
## 6 6 2
## 7 7 1
## 8 8 3
## 9 9 2
## 10 10 3
# Check the sum and number of elements per group
test %>%
dplyr::group_by(.folds) %>%
dplyr::summarize(sum_ = sum(x),
n_members = dplyr::n())
## # A tibble: 3 x 3
## .folds sum_ n_members
## <fct> <dbl> <int>
## 1 1 1666690952 33333
## 2 2 1666716667 33334
## 3 3 1666642381 33333
When I need to apply multiple functions to multiple columns sequentially and aggregate by multiple columns and want the results to be bound into a data frame I usually use aggregate() in the following manner:
# bogus functions
foo1 <- function(x){mean(x)*var(x)}
foo2 <- function(x){mean(x)/var(x)}
# for illustration purposes only
npk$block <- as.numeric(npk$block)
subdf <- aggregate(npk[,c("yield", "block")],
by = list(N = npk$N, P = npk$P),
FUN = function(x){c(col1 = foo1(x), col2 = foo2(x))})
Having the results in a nicely ordered data frame is achieved by using:
df <- do.call(data.frame, subdf)
Can I avoid the call to do.call() by somehow using aggregate() smarter in this scenario or shorten the whole process by using another base R solution from the start?
As #akrun suggested, dplyr's summarise_each is well-suited to the task.
library(dplyr)
npk %>%
group_by(N, P) %>%
summarise_each(funs(foo1, foo2), yield, block)
# Source: local data frame [4 x 6]
# Groups: N
#
# N P yield_foo2 block_foo2 yield_foo1 block_foo1
# 1 0 0 2.432390 1 1099.583 12.25
# 2 0 1 1.245831 1 2205.361 12.25
# 3 1 0 1.399998 1 2504.727 12.25
# 4 1 1 2.172399 1 1451.309 12.25
You can use
df=data.frame(as.list(aggregate(...
So I have a bunch of data frames in a list object. Frames are organised such as
ID Category Value
2323 Friend 23.40
3434 Foe -4.00
And I got them into a list by following this topic. I can also run simple functions on them as shown in this topic.
Now I am trying to run a conditional function with lapply, and I'm running into trouble. In some tables the 'ID' column has a different name (say, 'recnum'), and I need to tell lapply to go through each data frame, check if there is a column named 'recnum', and change its name to 'ID', as in
colnr <- which(names(x) == "recnum"
if (length(colnr > 0)) {names(x)[colnr] <- "ID"}
But I'm running into trouble with local scope and who knows what. Any ideas?
Use the rename function from plyr; it renames by name, not position:
x <- data.frame(ID = 1:2,z=1:2)
y <- data.frame('recnum' = 1:2,z=3:4)
.list <- list(x,y)
library(plyr)
lapply(.list, rename, replace = c('recnum' = 'ID'))
[[1]]
ID z
1 1 1
2 2 2
[[2]]
ID z
1 1 3
2 2 4
Your original code works fine:
foo <- function(x){
colnr <- which(names(x) == "recnum")
if (length(colnr > 0)) {names(x)[colnr] <- "ID"}
x
}
.list <- list(x,y)
lapply(.list, foo)
Not sure what your problem was.
If you look at the second part of mnel's answer, you can see that the function foo evaluates x as its last expression. Without that, if you try to change the names of the data.frames in your list directly from within the anonymous function passed to lapply, it will likely not work.
Just as an alternative, you could use gsub and avoid loading an additional package (although plyr is a nice package):
xx <- list(data.frame("recnum" = 1:3, "recnum2" = 1:3),
data.frame("ID" = 4:6, "hat" = 4:6))
lapply(xx, function(x){
names(x) <- gsub("^recnum$", "ID", names(x))
return(x)
})
# [[1]]
# ID recnum2
# 1 1 1
# 2 2 2
# 3 3 3
# [[2]]
# ID hat
# 1 4 4
# 2 5 5
# 3 6 6