I have a data frame from which I want to extract rows with a specific value in one column and another specific value in the other column. I want to go from df to df_final like in the code below:
x = c(25,25,25,25,44,44,44,44,44,76,76)
y = c(1,2,3,4,1,4,6,7,9,4,3)
df<- data.frame(x,y)
df1 <- df %>% filter(x == 25 & y == 2)
df2 <- df %>% filter(x == 44 & y == 6)
df3 <- df %>% filter(x == 76 & y == 3)
df_final <- rbind(df1,df2,df3)
# desired result:
print(df_final)
x y
1 25 2
2 44 6
3 76 3
I have tried using filter with dplyr, but dplyr matches everything within filtered x and y, resulting in too many rows, demonstrated below:
df_final <- df %>% filter(x %in% c(25,44,76) & y %in% c(2,6,3))
# non-desired result:
print(df_final)
x y
1 25 2
2 25 3
3 44 6
4 76 3
Is there a way to match exclusively using dplyr? The first instance in filtered x to the first in filtered y, the second instance in filtered x to the second in filtered y and so on... (Other packages than dplyr is also ok of course)
you can make 1 condition and then use eval and parse:
x_cond <- c(25, 44, 76)
y_cond <- c(2, 6, 3)
final_cond <- map2_chr(x_cond, y_cond, ~str_c("x == ", .x, " & y == ", .y)) %>%
str_c(collapse = " | ")
df %>% filter(eval(parse(text = final_cond)))
Related
I'm trying to put a dplyr pipeline in a function but after reading the vignette multiple times as well as the tidy evaluation (https://tidyeval.tidyverse.org/dplyr.html).
I still can't get it to work...
#Sample data:
dat <- read.table(text = "A ID B
1 X 83
2 X NA
3 X NA
4 Y NA
5 X 2
6 Y 2
12 Y 10
7 Y 18
8 Y 85", header = TRUE)
# What I'm trying to do:
x <- dat %>% filter(!is.na(B)) %>% count('ID') %>% filter(freq>3)
x$ID
# Now in a function:
n_occurences <- function(df, n, column){
# Group by ID and return IDs with number of non-na > n in column
column <- enquo(column)
x <- df %>%
filter(!is.na(!!column)) %>%
count('ID') %>% filter(freq>n)
x$ID
}
# Let's try:
col <- 'B'
n_occurences(dat, n=3, column = col)
There is no error, but the output is wrong. This as something to do with the tidy evaluation, but I just can't get my head around it.
With rlang_0.40, we can do this much easier by using the {{...}} or curly-curly operator
library(rlang)
library(dplyr)
n_occurences <- function(df, n1, column){
df %>%
filter(!is.na({{column}})) %>%
count(ID) %>%
filter(n > n1) %>%
pull(ID)
}
n_occurences(dat, n1 = 3, column = B)
#[1] Y
#Levels: X Y
If we intend to pass a quoted string, convert it to symbol (sym) and then do the evaluation (!!)
n_occurences <- function(df, n1, column){
column <- rlang::sym(column)
df %>%
filter(!is.na(!!column)) %>%
count(ID) %>%
filter(n > n1) %>%
pull(ID)
}
col <- 'B'
n_occurences(dat, n1=3, column = col)
#[1] Y
#Levels: X Y
I'm attempting to replace empty values in column z based on the values in column x.
I've used filter() to narrow down to the rows of importance, and apply mutate() afterwards, but the mutate values are not replaced in the original dataframe. I can store it as a new dataframe, but merging afterwards would be a considerable headaches as this is happening across dozens of conditionals.
make dummy data
xx <- data.frame(x = c(1,2,3), y = c("a","","c"), z=c(5,5,""))
xx %>% filter(x == 3) %>% # filter to value of interest
filter(z == "") %>% # filter to NA values to be replaced
mutate(z = replace(z, z =="", 5) ) # mutate to replace NA value
if i do:
xx <- xx %>% filter(x == 3) %>% # filter to value of interest
filter(z == "") %>% # filter to NA values to be replaced
mutate(z = replace(z, z =="", 5) ) # mutate to replace NA value
then only the single row is stored...
I'm looking for a way to keep all of the other dataframe data but replace the mutated data.
Feels like it should be a quick fix, but been stuck on it for a while..
You can use an ifelse() statement within dplyr::mutate().
df <- data.frame(x=sample(1:10,100,T),
y=sample(c(NA,1:5),100,T))
df %>% mutate(y=ifelse(is.na(y),x,y))
x y
1 7 7
2 10 3
3 7 1
4 7 1
5 10 4
6 3 3
...
I'm trying to apply a function over the rows of a data frame and return a value based on the value of each element in a column. I'd prefer to pass the whole dataframe instead of naming each variable as the actual code has many variables - this is a simple example.
I've tried purrr map_dbl and rowwise but can't get either to work. Any suggestions please?
#sample df
df <- data.frame(Y=c("A","B","B","A","B"),
X=c(1,5,8,23,31))
#required result
Res <- data.frame(Y=c("A","B","B","A","B"),
X=c(1,5,8,23,31),
NewVal=c(10,500,800,230,3100)
)
#use mutate and map or rowwise etc
Res <- df %>%
mutate(NewVal=map_dbl(.x=.,.f=FnAdd(.)))
Res <- df %>%
rowwise() %>%
mutate(NewVal=FnAdd(.))
#sample fn
FnAdd <- function(Data){
if(Data$Y=="A"){
X=Data$X*10
}
if(Data$Y=="B"){
X=Data$X*100
}
return(X)
}
If there are multiple values, it is better to have a key/val dataset, join and then do the mulitiplication
keyVal <- data.frame(Y = c("A", "B"), NewVal = c(10, 100))
df %>%
left_join(keyVal) %>%
mutate(NewVal = X*NewVal)
# Y X NewVal
#1 A 1 10
#2 B 5 500
#3 B 8 800
#4 A 23 230
#5 B 31 3100
It is not clear how many unique values are there in the actual dataset 'Y' column. If we have only a few values, then case_when can be used
FnAdd <- function(Data){
Data %>%
mutate(NewVal = case_when(Y == "A" ~ X * 10,
Y == "B" ~ X *100,
TRUE ~ X))
}
FnAdd(df)
# Y X NewVal
#1 A 1 10
#2 B 5 500
#3 B 8 800
#4 A 23 230
#5 B 31 3100
You were originally looking for a solution using dplyr's rowwise() function, so here is that solution. The nice thing about this approach is that you don't need to create a separate function.
Here's the version using if()
df %>%
rowwise() %>%
mutate(NewVal = ifelse(Y == "A", X * 10,
ifelse(Y == "B", X * 100)))
and here's the version using case_when:
df %>%
rowwise() %>%
mutate(NewVal = case_when(Y == "A" ~ X * 10,
Y == "B" ~ X * 100))
This question is similar to selecting the top N values within a group by column here.
However, I want to select the last N values by group, with N depending on the value of a corresponding count column. The count represents the number of occurrences of a specific name. If count >3, I only want the last three entries but if it is less than 3, I only want the last entry.
# Sample data
df <- data.frame(Name = c("x","x","x","x","y","y","y","z","z"), Value = c(1,2,3,4,5,6,7,8,9))
# Obtain count for each name
count <- df %>%
group_by(Name) %>%
summarise(Count = n_distinct(Value))
# Merge dataframe with count
merge(df, count, by=c("Name"))
# Delete the first entry for x and the first entry for z
# Desired output
data.frame(Name = c("x","x","x","y","y","y","z"), Value = c(2,3,4,5,6,7,9))
Another dplyrish way:
df %>% group_by(Name) %>% slice(tail(row_number(),
if (n_distinct(Value) < 3) 1 else 3
))
# A tibble: 7 x 2
# Groups: Name [3]
Name Value
<fctr> <dbl>
1 x 2
2 x 3
3 x 4
4 y 5
5 y 6
6 y 7
7 z 9
The analogue in data.table is...
library(data.table)
setDT(df)
df[, tail(.SD, if (uniqueN(Value) < 3) 1 else 3), by=Name]
The closest thing in base R is...
with(df, {
len = tapply(Value, Name, FUN = length)
nv = tapply(Value, Name, FUN = function(x) length(unique(x)))
df[ sequence(len) > rep(nv - ifelse(nv < 3, 1, 3), len), ]
})
... which is way more difficult to come up with than it should be.
Another possibility:
library(tidyverse)
df %>%
split(.$Name) %>%
map_df(~ if (n_distinct(.x) >= 3) tail(.x, 3) else tail(.x, 1))
Which gives:
# Name Value
#1 x 2
#2 x 3
#3 x 4
#4 y 5
#5 y 6
#6 y 7
#7 z 9
In base R, split the df by df$Name first. Then, for each subgroup, check number of rows and extract last 3 or last 1 row conditionally.
do.call(rbind, lapply(split(df, df$Name), function(a)
a[tail(sequence(NROW(a)), c(3,1)[(NROW(a) < 3) + 1]),]))
Or
do.call(rbind, lapply(split(df, df$Name), function(a)
a[tail(sequence(NROW(a)), ifelse(NROW(a) < 3, 1, 3)),]))
# Name Value
#x.2 x 2
#x.3 x 3
#x.4 x 4
#y.5 y 5
#y.6 y 6
#y.7 y 7
#z z 9
For three conditional values
do.call(rbind, lapply(split(df, df$Name), function(a)
a[tail(sequence(NROW(a)), ifelse(NROW(a) >= 6, 6, ifelse(NROW(a) >= 3, 3, 1))),]))
If you're already using dplyr, the natural approach is:
library(dplyr)
# Sample data
df <- data.frame(Name = c("x","x","x","x","y","y","y","z","z"),
Value = c(1,2,3,4,5,6,7,8,9))
df %>%
group_by(Name) %>%
mutate(Count = n_distinct(Value),
Rank = dense_rank(desc(Value))) %>%
filter((Count>= 3 & Rank <= 3) | (Rank==1)) %>%
select(-c(Count,Rank))
There's no need for a merge since you are just counting and ranking on groups defined by Name. Then, you apply a filter on your count and rank requirements, and (optionally, for clean-up) drop the counts and ranks.
Starting with data containing multiple observations for each group, like this:
set.seed(1)
my.df <- data.frame(
timepoint = rep(c(0, 1, 2), each= 3),
counts = round(rnorm(9, 50, 10), 0)
)
> my.df
timepoint counts
1 0 44
2 0 52
3 0 42
4 1 66
5 1 53
6 1 42
7 2 55
8 2 57
9 2 56
To perform a summary calculation at each timepoint relative to timepoint == 0, for each group I need to pass a vector of counts for timepoint == 0 and a vector of counts for the group (e.g. timepoint == 0) to an arbitrary function, e.g.
NonsenseFunction <- function(x, y){
(mean(x) - mean(y)) / (1 - mean(y))
}
I can get the required output from this table, either with dplyr:
library(dplyr)
my.df %>%
group_by(timepoint) %>%
mutate(rep = paste0("r", 1:n())) %>%
left_join(x = ., y = filter(., timepoint == 0), by = "rep") %>%
group_by(timepoint.x) %>%
summarise(result = NonsenseFunction(counts.x, counts.y))
or data.table:
library(data.table)
my.dt <- data.table(my.df)
my.dt[, rep := paste0("r", 1:length(counts)), by = timepoint]
merge(my.dt, my.dt[timepoint == 0], by = "rep", all = TRUE)[
, NonsenseFunction(counts.x, counts.y), by = timepoint.x]
This only works if the number of observations between groups is the same. Anyway, the observations aren't matched, so using the temporary rep variable seems hacky.
For a more general case, where I need to pass vectors of the baseline values and the group's values to an arbitrary (more complicated) function, is there an idiomatic data.table or dplyr way of doing so with a grouped operation for all groups?
Here's the straightforward data.table approach:
my.dt[, f(counts, my.dt[timepoint==0, counts]), by=timepoint]
This probably grabs my.dt[timepoint==0, counts] again and again, for each group. You could instead save that value ahead of time:
v = my.dt[timepoint==0, counts]
my.dt[, f(counts, v), by=timepoint]
... or if you don't want to add v to the environment, maybe
with(list(v = my.dt[timepoint==0, counts]),
my.dt[, f(counts, v), by=timepoint]
)
You could give the second argument to use the vector from your group of interest as a constant.
my.df %>%
group_by(timepoint) %>%
mutate(response = NonsenseFunction(counts, my.df$counts[my.df$timepoint == 0]))
Or if you want to make it beforehand:
constant = = my.df$counts[my.df$timepoint == 0]
my.df %>%
group_by(timepoint) %>%
mutate(response = NonsenseFunction(counts, constant))
You can try,
library(dplyr)
my.df %>%
mutate(new = mean(counts[timepoint == 0])) %>%
group_by(timepoint) %>%
summarise(result = NonsenseFunction(counts, new))
# A tibble: 3 × 2
# timepoint result
# <dbl> <dbl>
#1 0 0.0000000
#2 1 0.1398601
#3 2 0.2097902