I want to remove duplicate rows from a dataframe, for specific columns only. That can be obtained with distinct:
data <- tibble(a = c(1, 1, 2, 2), b = c(3, 3, 3, 4), z = c(5,4,5,5))
filtered_data <- data %>% distinct(a, b, .keep_all = T)
dim(filtered_data)
# [1] 3 3
This is (almost) what I need. Yet, my problem is that the columnnames I need to use with distinct will change. So I have a string gen that contains the names of the columns I want to use for with the distinct function. They need to get unquoted to be usefull in the pipe. I found suggestions to use as.name() or eval(parse()). This however gives me a different result:
gen <- c("a", "b")
filtered_data <- data %>% distinct(eval(parse(text = gen)), .keep_all = T)
dim(filtered_data)
# [1] 2 4
The eval seems to do something funny with the amount of times the data is filtered. (and, adds an extra column. I could live with that, though...) So, how to obtain a similar result, as if I had used a,b, but by using a variable instead?
additional information
I actually obtain gen by reading the columnnames of a dataframe: gen <- colnames(data)[1:2]. The solution suggested by #gymbrane would be perfect, if I had a way to transform the gen to c(a, b). The whole point is to avoid hardcoding the columnames. I tried things like gen <- noquotes(gen), which does not give an error in the rm_dup_rows function suggested below, but it does give a different result, giving the same sort of repeated filtering as I started with...
fixed
I think I got it working. It might be unelegant, and I'm not sure if every step is necessary for the result, but it seems to work by combining the function provided by #gymbrane below with ensym and quos in a forloop while adding to a list in GlobalEnv (edit: GlobalEnv isn't necessary):
unquote_string <- function(string) {
out <- list()
i <- 1
for (s in string) {
t <- ensym(s)
out[i] <-dplyr::quos(!!t)
i <- i+1
}
return(out)
}
gen_quo <- unquote_string(gen)
filtered_data <- rm_dup_rows(data, gen_quo)
dim(filtered_data)
# [1] 3 3
How about creating a function and using quosures . Perhaps something like this is what you are looking for...
rm_dup_rows <- function(data, ...){
vars = dplyr::quos(...)
data %>% distinct(!!! vars, .keep_all = T)
}
I believe this returns what you are asking for
rm_dup_rows(data = data, a, b)
# A tibble: 3 x 3
a b z
<dbl> <dbl> <dbl>
1 3 5
2 3 5
2 4 5
rm_dup_rows(data, b, z)
# A tibble: 3 x 3
a b z
<dbl> <dbl> <dbl>
1 3 5
1 3 4
2 4 5
Additional
You could modify rm_dup_rows just slightly and construct and your vector with quos. Something like this...
rm_dup_rows <- function(data, vars){
data %>% distinct(!!! vars, .keep_all = T)
}
# quos your column name vector
gen <- quos(a,z)
rm_dup_rows(data, gen)
# A tibble: 3 x 3
a b z
<dbl> <dbl> <dbl>
1 3 5
1 3 4
2 3 5
Related
I have a function which needs a long time to run. So, I want to know how many rows of my data frame are processed. Usually, we can define a variable in for loop to deal with this easily. But I do not know how to do it in dplyr.
Let's say the code is:
library(tidyverse)
myFUN <-functin (x) {
x + 1
}
a <- tibble(id=c(1:3),x=c(3,5,1))
a1 <- a %>%
rowwise() %>%
mutate(y=myFUN(x))
I hope in somewhere the code, I can define a variable i. The value will be plus 1 every time one row is processed, then print its values in console like:
1
2
3
Can you pass another variable to the function which would be the row number of the dataframe and print it in the function. Something like :
myFUN <-function (x, y) {
message(y)
x + 1
}
and then use
library(dplyr)
a %>% mutate(y = purrr::map2_dbl(x, row_number(), myFUN))
#1
#2
#3
# A tibble: 3 x 3
# id x y
# <int> <dbl> <dbl>
#1 1 3 4
#2 2 5 6
#3 3 1 2
If your function is vectorized, you can let go map_dbl and do
a %>% mutate(y= myFUN(x, seq_len(n())))
I have a dataset which contains a categorical variable. Depending on the value of this variable, I want to run a different function for each such value. All the possible functions have the same return type. I might wish to run say, sin() if category is 'A', cos() if category is 'B', and tan() if category is 'C'.
The real application for this is in simulating populations, where outcomes depend on the values of categories, but sometimes in very different ways.
Toy example
library(dplyr)
category=c('A','B','C')
N <- 100
pop <- as.data.frame(ID <- seq(1:N))
pop <- as.tbl(pop)
pop$Category <- sample(category,N,replace=TRUE)
pop$score <- runif(N)
pop
tf <- function(x,EXPR) {
switch(EXPR,
A = cos(x),
B = sin(x),
C = tan(x)
)}
pop$results <- tf(pop$Score,pop$Category)
This code fails,reasonably enough, with the error message
Error in switch(EXPR, A = cos(x), B = sin(x), C = tan(x)) : EXPR must be a length 1 vector
I have looked, carefully, at dplyr and do, and I can easily see how to run the same function for each category separately. However, I need a function which depends on the category value.
Suggestions greatly appreciated.
The rowwise function is what you need to force it evaluate row by row...
pop<-data.frame(ID=1:100,
category = sample(c("A", "B", "C"),100,replace=TRUE),
score = runif(100))
exprs<-function(category, score){
if(category=="A")
ret <- sin(score)
if(category=="B")
ret <- cos(score)
if(category=="C")
ret <- tan(score)
ret }
pop %>%
rowwise %>%
mutate(answer = exprs(category, score))
Source: local data frame [100 x 4]
Groups:
# A tibble: 100 × 4
ID category score answer
<int> <fctr> <dbl> <dbl>
1 1 C 0.5219332 0.5751317
2 2 C 0.9266336 1.3314972
3 3 B 0.2729260 0.9629863
4 4 B 0.6575110 0.7915158
5 5 B 0.0910481 0.9958580
6 6 C 0.9968752 1.5467554
7 7 A 0.3429183 0.3362369
8 8 A 0.9101669 0.7896062
9 9 B 0.9291849 0.5984872
10 10 C 0.8913347 1.2379742
# ... with 90 more rows
You can use Vectorize():
set.seed(42)
category=c('A','B','C')
N <- 10
pop <- data.frame(ID=seq(1:N), Category=sample(category,N,replace=TRUE), score=runif(N), stringsAsFactors = FALSE)
tf <- function(x, EXPR) switch(EXPR,
'A' = cos(x),
'B' = sin(x),
'C' = tan(x))
TF <- Vectorize(tf)
pop$result <- TF(pop$score, pop$Category)
or (thx to #42 for the comment)
pop$result <- mapply(tf, pop$score, pop$Category)
The error appears because you are sending the complete vector , instead of record wise. I used lapply to call your function for each row and it works
library(dplyr)
category=c('A','B','C')
N <- 100
pop <- data.frame(ID = seq(1:N))
pop$Category <- sample(category,N,replace=TRUE)
pop$Category <- as.factor(pop$Category)
pop$score <- runif(N)
tf <- function(x,EXPR) {
switch(EXPR,
A = cos(x),
B = sin(x),
C = tan(x)
)}
## call tf for every row in the dataframe
pop$results <-lapply( seq_len(nrow(pop)) , function (i) {
tf(pop$score[i],pop$Category[i])
}) %>% unlist
Thanks
I would like to use dplyr's mutate_if() function to convert list-columns to data-frame-columns, but run into a puzzling error when I try to do so. I am using dplyr 0.5.0, purrr 0.2.2, R 3.3.0.
The basic setup looks like this: I have a data frame d, some of whose columns are lists:
d <- dplyr::data_frame(
A = list(
list(list(x = "a", y = 1), list(x = "b", y = 2)),
list(list(x = "c", y = 3), list(x = "d", y = 4))
),
B = LETTERS[1:2]
)
I would like to convert the column of lists (in this case, d$A) to a column of data frames using the following function:
tblfy <- function(x) {
x %>%
purrr::transpose() %>%
purrr::simplify_all() %>%
dplyr::as_data_frame()
}
That is, I would like the list-column d$A to be replaced by the list lapply(d$A, tblfy), which is
[[1]]
# A tibble: 2 x 2
x y
<chr> <dbl>
1 a 1
2 b 2
[[2]]
# A tibble: 2 x 2
x y
<chr> <dbl>
1 c 3
2 d 4
Of course, in this simple case, I could just do a simple reassignment. The point, however, is that I would like to do this programmatically, ideally with dplyr, in a generally applicable way that could deal with any number of list-columns.
Here's where I stumble: When I try to convert the list-columns to data-frame-columns using the following application
d %>% dplyr::mutate_if(is.list, funs(tblfy))
I get an error message that I don't know how to interpret:
Error: Each variable must be named.
Problem variables: 1, 2
Why does mutate_if() fail? How can I properly apply it to get the desired result?
Remark
A commenter has pointed out that the function tblfy() should be vectorized. That is a reasonable suggestion. But — unless I have vectorized incorrectly — that does not seem to get at the root of the problem. Plugging in a vectorized version of tblfy(),
tblfy_vec <- Vectorize(tblfy)
into mutate_if() fails with the error
Error: wrong result size (4), expected 2 or 1
Update
After gaining some experience with purrr, I now find the following approach natural, if somewhat long-winded:
d %>%
map_if(is.list, ~ map(., ~ map_df(., identity))) %>%
as_data_frame()
This is more or less identical to #alistaire's solution, below, but uses map_if(), resp. map(), in place of mutate_if(), resp. Vectorize().
The original tblfy function errors out for me (even when its elements are chained directly), so let's rebuild it a bit, adding vectorization as well, which lets us avoid an otherwise-necessary prior rowwise() call:
tblfy <- Vectorize(function(x){x %>% purrr::map_df(identity) %>% list()})
Now we can use mutate_if nicely:
d %>% mutate_if(purrr::is_list, tblfy)
## Source: local data frame [2 x 2]
##
## A B
## <list> <chr>
## 1 <tbl_df [2,2]> A
## 2 <tbl_df [2,2]> B
...and if we unnest to see what's there,
d %>% mutate_if(purrr::is_list, tblfy) %>% tidyr::unnest()
## Source: local data frame [4 x 3]
##
## B x y
## <chr> <chr> <dbl>
## 1 A a 1
## 2 A b 2
## 3 B c 3
## 4 B d 4
A couple notes:
map_df(identity) seems to be more efficient at building a tibble than any of the alternative formulations. I know the identity call seems unnecessary, but most everything else breaks.
I'm not sure how widely useful tblfy will be, as it's somewhat dependent on the structure of the lists in the list column, which can vary enormously. If you have a lot with a similar structure, I suppose it's useful, though.
There may be a way to do this with pmap instead of Vectorize, but I can't get it to work with some cursory tries.
In-place conversion without any copying:
library(data.table)
for (col in d) if (is.list(col)) lapply(col, setDF)
d
#Source: local data frame [2 x 2]
#
# A B
#1 <S3:data.frame> A
#2 <S3:data.frame> B
I have a grouped data frame, in which the grouping variable is SEED. I want to take the groups defined by the values of SEED, set the seed to the value of SEED for each group, and then shuffle the rows of each group using dplyr::sample_frac. However, I cannot replicate my results, which indicates that the seed isn't being set correctly.
To do this in a dplyr-ish way, I wrote the following function:
> library(dplyr)
> ss_sampleseed <- function(df, seed.){
> set.seed(df$seed.)
> sample_frac(df, 1)
> }
I then use this function on my data:
> dg <- structure(list(Gene = c("CAMK1", "ARPC4", "CIDEC", "CAMK1", "ARPC4",
> "CIDEC"), GENESEED = c(1, 1, 1, 2, 2, 2)), class = c("tbl_df",
> "tbl", "data.frame"), row.names = c(NA, -6L), .Names = c("Gene",
> "GENESEED"))
> dg2 <- dg %>%
> group_by(GENESEED) %>%
> ss_sampleseed(GENESEED)
> dg2
Source: local data frame [6 x 2]
Groups: GENESEED
Gene GENESEED
1 ARPC4 1
2 CIDEC 1
3 CAMK1 1
4 CIDEC 2
5 ARPC4 2
6 CAMK1 2
However, when I repeat the above code, I cannot replicate my results.
> dg2
Source: local data frame [6 x 2]
Groups: GENESEED
Gene GENESEED
1 ARPC4 1
2 CAMK1 1
3 CIDEC 1
4 CAMK1 2
5 ARPC4 2
6 CIDEC 2
The problem here is that dollar sign will not substitute for the parameter you are passing. See this minimal example:
df <- data.frame(x = "x", GENESEED = "GENESEED")
h <- function(df,x){
df$x
}
h(df, GENESEED)
[1] x
Levels: x
See that h returns x even though you asked for GENESEED. So your function is actually trying to get df$seed which does not exist so it returns NULL.
But there is another problem. Even correcting this and passing directly the seed, it seems that it would not work as you want, because, if you look at the code of sample_frac, dplyr will eventually run the following line:
sampled <- lapply(index, sample_group, frac = TRUE, tbl = tbl,
size = size, replace = replace, weight = weight, .env = .env)
Notice that it runs a lapply after you set the seed, so you will not have defined a different seed for each group according to GENESEED as you wanted.
Taking this into consideration, I came up with this solution, using sample.int and do:
ss_sampleseed <- function(x){
set.seed(unique(x$GENESEED))
x[sample.int(nrow(x)), ]
}
dg %>% group_by(GENESEED) %>% do(ss_sampleseed(.))
This seems to be working as you want.
I think the main thing going here is the use of $ coding like you are inside your function. I certainly had to learn this the hard way. See also:
library(fortunes)
fortune(312)
fortune(343)
Take the simple function from #Carlos Cinelli and try to use it outside of any dplyr functions.
h = function(df, seed.){
df$seed.
}
h(dg, GENESEED)
NULL
It's those darn dollar signs. Now change the function to use [[ instead.
h2 = function(df, seed.){
df[[seed.]]
}
h2(dg, "GENESEED")
[1] 1 1 1 2 2 2
That's more like it, although you did have to put quotes around the variable name in the function.
So where does that leave your original function? You can go two ways. First, you could just change to [[ and use quotes around the variable name in your function.
ss_sampleseed = function(df, seed.){
set.seed(df[[seed.]])
sample_frac(df, 1)
}
dg %>%
group_by(GENESEED) %>%
ss_sampleseed("GENESEED")
Source: local data frame [6 x 2]
Groups: GENESEED
Gene GENESEED
1 CAMK1 1
2 CIDEC 1
3 ARPC4 1
4 CIDEC 2
5 CAMK1 2
6 ARPC4 2
The other option is to use deparse(substitute(seed.)) inside your function to allow for non-standard evaluation. You'll still need [[, though.
ss_sampleseed2 = function(df, seed.){
set.seed(df[[deparse(substitute(seed.))]])
sample_frac(df, 1)
}
dg %>%
group_by(GENESEED) %>%
ss_sampleseed2(GENESEED)
Source: local data frame [6 x 2]
Groups: GENESEED
Gene GENESEED
1 CAMK1 1
2 CIDEC 1
3 ARPC4 1
4 CIDEC 2
5 CAMK1 2
6 ARPC4 2
I get replicated results with either of these, although I didn't check if the seed is specifically set to what you want it to be.
So I have a bunch of data frames in a list object. Frames are organised such as
ID Category Value
2323 Friend 23.40
3434 Foe -4.00
And I got them into a list by following this topic. I can also run simple functions on them as shown in this topic.
Now I am trying to run a conditional function with lapply, and I'm running into trouble. In some tables the 'ID' column has a different name (say, 'recnum'), and I need to tell lapply to go through each data frame, check if there is a column named 'recnum', and change its name to 'ID', as in
colnr <- which(names(x) == "recnum"
if (length(colnr > 0)) {names(x)[colnr] <- "ID"}
But I'm running into trouble with local scope and who knows what. Any ideas?
Use the rename function from plyr; it renames by name, not position:
x <- data.frame(ID = 1:2,z=1:2)
y <- data.frame('recnum' = 1:2,z=3:4)
.list <- list(x,y)
library(plyr)
lapply(.list, rename, replace = c('recnum' = 'ID'))
[[1]]
ID z
1 1 1
2 2 2
[[2]]
ID z
1 1 3
2 2 4
Your original code works fine:
foo <- function(x){
colnr <- which(names(x) == "recnum")
if (length(colnr > 0)) {names(x)[colnr] <- "ID"}
x
}
.list <- list(x,y)
lapply(.list, foo)
Not sure what your problem was.
If you look at the second part of mnel's answer, you can see that the function foo evaluates x as its last expression. Without that, if you try to change the names of the data.frames in your list directly from within the anonymous function passed to lapply, it will likely not work.
Just as an alternative, you could use gsub and avoid loading an additional package (although plyr is a nice package):
xx <- list(data.frame("recnum" = 1:3, "recnum2" = 1:3),
data.frame("ID" = 4:6, "hat" = 4:6))
lapply(xx, function(x){
names(x) <- gsub("^recnum$", "ID", names(x))
return(x)
})
# [[1]]
# ID recnum2
# 1 1 1
# 2 2 2
# 3 3 3
# [[2]]
# ID hat
# 1 4 4
# 2 5 5
# 3 6 6