Filtering data based on multiple variables - r

I am trying to create a new column based other column criteria where my data looks like the following:
ID Column 1 Column 2 Column 3
1 2 Y "2013-10-22T10:09"
1 2 Y "2013-10-23T10:09"
2 3 N "2013-10-23T10:09"
3 0 Y "2013-10-23T10:09"
For each ID, I would like to keep only the earliest date/time as long as column 1 is greater than 0 and column 2 is not N. The results would look like this:
ID Column 1 Column 2 Column 3 Column 4
1 2 Y "2013-10-22T10:09" 2013-10-22
I currently tried this but I was wondering how to do it and if there is an elegant way of doing it:
library(dplyr)
ifelse(Column 1 >0 and Column 2 !="N",
(new %>%
group_by(ID) %>%
arrange(Column 3) %>%
slice(1L)))
Column 4 <- as.Date(Column 3, format='%Y-%m-%dT%H:%M')

library(dplyr)
df %>%
filter(Column1 > 0 & Column2 != 'N') %>% # filter out non-matching rows
group_by(ID) %>%
top_n(-1, Column3) %>% # select only the row with the earliest date-time
mutate(Date = as.Date(Column3)) # create date column
#
# # A tibble: 1 x 5
# # Groups: ID [1]
# ID Column1 Column2 Column3 Date
# <int> <int> <chr> <chr> <date>
# 1 1 2 Y 2013-10-22T10:09 2013-10-22

rm(list = ls())
df <- data.frame(id = c(1,1,2,3),column_1 = c(2,2,3,0),
column_2 = c("Y","Y","N","Y"),
column_3 = as.Date(c("2013-10-22","2013-10-23","2013-10-23","2013-10-23"),format = "%Y-%m-%d"))
n <- unique(df$id)
datalist <- list()
for(i in 1:n)
{
z <- df[df$id == i & df$column_1 > 0 & df$column_2 != "N" & df$column_3 == min(df$column_3),]
datalist[[i]] <- z
}
do.call(rbind,datalist)
This function will help you.
But the constraints for each column were made constant.
You can change it as per your convenience.
Thanks

Related

summarize_all rows by grouping and define which value should be kept

I have a data frame in which several data sources are merged. This creates rows with the same id. Now I want to define which values from which row should be kept.
So far I have been using dplyr with group_by and summarize all to keep the first value if it is not NA.
Here's an example:
# function f for summarizing
f <- function(x) {
x <- na.omit(x)
if (length(x) > 0) first(x) else NA
}
# test data
test <- data.frame(id = c(1,2,1,2), value1 = c("a",NA,"b","c"), value2 = c(0:4))
id value1 value2
1 a 0
2 <NA> 1
1 b 2
2 c 3
The following result is obtained when merging
test <- test %>% group_by(id) %>% summarise_all(funs(f))
id value1 value2
1 a 0
2 c 1
Now the question: that NA (na.omit) be replaced already works, but how can I define that not the numerical value 0, but the value not equal to 0 is accepted. So the expected result looks like this:
id value1 value2
1 a 2
2 c 1
You can just modify your f function by subsetting the vector where it is different from zero
f <- function(x) {
x <- na.omit(x)
x <- x[x != 0]
if (length(x) > 0) first(x) else NA
}
Sidenote: as of dplyr 0.8.0, funs is deprecated. You should a lambda, a list of functions or a list of lambdas. In this case I used a single lambda:
test %>%
group_by(id) %>%
summarise_all(~f(.))
# A tibble: 2 x 3
id value1 value2
<dbl> <chr> <int>
1 1 a 2
2 2 c 1
You can write f function as :
library(dplyr)
f <- function(x) x[!is.na(x) & x != 0][1]
test %>% group_by(id) %>% summarise(across(.fns = f))
# id value1 value2
# <dbl> <chr> <int>
#1 1 a 2
#2 2 c 1
Using [1] would return NA automatically if there are no non-zero or non-NA value in your data.
As a sidenote to the sidenote of #RicS, as of dplyr v1+, summarise_all() is deprecated (superseded). You should rather use across():
test %>%
group_by(id) %>%
summarise(across(.f=f))

Filter a dataset based on a grouped condition

It may be an stupid question, but I can not figure out how to filter df to keep the rows in which the id match the condition of being present in all the levels of factor_A:
df = data.frame(id = c(1,1,1,2,2,3,3),
factor_A = c(1,2,3,1,2,1,3))
The desired df1 would keep only the rows containing id=1, since it is present in factor_A=1,2 and 3:
id factor_A
1 1 1
2 1 2
3 1 3
this should do it
library(dplyr)
df = data.frame(id = c(1,1,1,2,2,3,3),
factor_A = c(1,2,3,1,2,1,3))
df %>% group_by(id) %>%
filter(length(unique(factor_A)) == length(unique(df$factor_A)))
I would suggest a dplyr approach. You can count the number of levels for each id and then filter. As your factor variable has 3 levels you will keep those rows with Flag equals to 3:
library(dplyr)
#Data
df = data.frame(id = c(1,1,1,2,2,3,3),
factor_A = c(1,2,3,1,2,1,3))
#Create flag
df %>% group_by(id) %>%
#Count levels
mutate(Flag=n_distinct(factor_A)) %>%
#Filter only rows with 3
filter(Flag==3) %>% select(-Flag)
Output:
# A tibble: 3 x 2
# Groups: id [1]
id factor_A
<dbl> <dbl>
1 1 1
2 1 2
3 1 3
We can use base R
subset(df, id %in% names(which(!rowSums(!table(df) > 0))))
# id factor_A
#1 1 1
#2 1 2
#3 1 3

Filter group only when both levels are present

This feels like it should be more straightforward and I'm just missing something. The goal is to filter the data into a new df where both var values 1 & 2 are represented in the group
here's some toy data:
grp <- c(rep("A", 3), rep("B", 2), rep("C", 2), rep("D", 1), rep("E",2))
var <- c(1,1,2,1,1,2,1,2,2,2)
id <- c(1:10)
df <- as.data.frame(cbind(id, grp, var))
only grp A and C should be present in the new data because they are the only ones where var 1 & 2 are present.
I tried dplyr, but obviously '&' won't work since it's not row based and '|' just returns the same df:
df.new <- df %>% group_by(grp) %>% filter(var==1 & var==2) #returns no rows
Here is another dplyr method. This can work for more than two factor levels in var.
library(dplyr)
df2 <- df %>%
group_by(grp) %>%
filter(all(levels(var) %in% var)) %>%
ungroup()
df2
# # A tibble: 5 x 3
# id grp var
# <fct> <fct> <fct>
# 1 1 A 1
# 2 2 A 1
# 3 3 A 2
# 4 6 C 2
# 5 7 C 1
We can condition on there being at least one instance of var == 1 and at least one instance of var == 2 by doing the following:
library(tidyverse)
df1 <- data_frame(grp, var, id) # avoids coercion to character/factor
df1 %>%
group_by(grp) %>%
filter(sum(var == 1) > 0 & sum(var == 2) > 0)
grp var id
<chr> <dbl> <int>
1 A 1 1
2 A 1 2
3 A 2 3
4 C 2 6
5 C 1 7

How to join two dataframes using dplyr in order to agregate values of the same column?

Is there a simple and elegant way to left join (with dplyr) a "b" table in an "a" table when both contains the same column, but the first has NA's and the second table has the missing values? Here folows an example:
# Tables A and B
a <- tibble(
"ID" = c(1,2,3),
"x" = c(NA,5, NA)
)
b <- tibble(
"ID" = c(1,3),
"x" = c(7, 4)
)
# Table I want as result
c <- tibble(
"ID" = c(1,2,3),
"x" = c(7,5,4)
)
You could use the coalesce function in the dplyr package to match together a complete vector from missing pieces. This is inspired by the sql COALESCE function.
left_join(a,b, by='ID') %>%
mutate(col = coalesce(x.x, x.y)) %>%
select(ID, col)
# A tibble: 3 x 2
ID col
<dbl> <dbl>
1 1 7
2 2 5
3 3 4
Joining and then removing rows with an NA should do it. If an ID has non-NA values of x in both tables, then this code will have 2 rows for that ID, but that is probably the behavior you'd want
library(dplyr)
full_join(a,b, by = c('ID', 'x')) %>%
na.omit()
# A tibble: 3 x 2
ID x
<dbl> <dbl>
1 2 5
2 1 7
3 3 4

How can I create a column that cumulatively adds the sum of two previous rows based on conditions?

I tried asking this question before but was it was poorly stated. This is a new attempt cause I haven't solved it yet.
I have a dataset with winners, losers, date, winner_points and loser_points.
For each row, I want two new columns, one for the winner and one for the loser that shows how many points they have scored so far (as both winners and losers).
Example data:
winner <- c(1,2,3,1,2,3,1,2,3)
loser <- c(3,1,1,2,1,1,3,1,2)
date <- c("2017-10-01","2017-10-02","2017-10-03","2017-10-04","2017-10-05","2017-10-06","2017-10-07","2017-10-08","2017-10-09")
winner_points <- c(2,1,2,1,2,1,2,1,2)
loser_points <- c(1,0,1,0,1,0,1,0,1)
test_data <- data.frame(winner, loser, date = as.Date(date), winner_points, loser_points)
I want the output to be:
winner_points_sum <- c(0, 0, 1, 3, 1, 3, 5, 3, 5)
loser_points_sum <- c(0, 2, 2, 1, 4, 5, 4, 7, 4)
test_data <- data.frame(winner, loser, date = as.Date(date), winner_points, loser_points, winner_points_sum, loser_points_sum)
How I've solved it thus far is to do a for loop such as:
library(dplyr)
test_data$winner_points_sum_loop <- 0
test_data$loser_points_sum_loop <- 0
for(i in row.names(test_data)) {
test_data[i,]$winner_points_sum_loop <-
(
test_data %>%
dplyr::filter(winner == test_data[i,]$winner & date < test_data[i,]$date) %>%
dplyr::summarise(points = sum(winner_points, na.rm = TRUE))
+
test_data %>%
dplyr::filter(loser == test_data[i,]$winner & date < test_data[i,]$date) %>%
dplyr::summarise(points = sum(loser_points, na.rm = TRUE))
)
}
test_data$winner_points_sum_loop <- unlist(test_data$winner_points_sum_loop)
Any suggestions how to tackle this problem? The queries take quite some time when the row numbers add up. I've tried elaborating with the AVE function, I can do it for one column to sum a players point as winner but can't figure out how to add their points as loser.
winner <- c(1,2,3,1,2,3,1,2,3)
loser <- c(3,1,1,2,1,1,3,1,2)
date <- c("2017-10-01","2017-10-02","2017-10-03","2017-10-04","2017-10-05","2017-10-06","2017-10-07","2017-10-08","2017-10-09")
winner_points <- c(2,1,2,1,2,1,2,1,2)
loser_points <- c(1,0,1,0,1,0,1,0,1)
test_data <- data.frame(winner, loser, date = as.Date(date), winner_points, loser_points)
library(dplyr)
library(tidyr)
test_data %>%
unite(winner, winner, winner_points) %>% # unite winner columns
unite(loser, loser, loser_points) %>% # unite loser columns
gather(type, pl_pts, winner, loser, -date) %>% # reshape
separate(pl_pts, c("player","points"), convert = T) %>% # separate columns
arrange(date) %>% # order dates (in case it's not)
group_by(player) %>% # for each player
mutate(sum_points = cumsum(points) - points) %>% # get points up to that date
ungroup() %>% # forget the grouping
unite(pl_pts_sumpts, player, points, sum_points) %>% # unite columns
spread(type, pl_pts_sumpts) %>% # reshape
separate(loser, c("loser", "loser_points", "loser_points_sum"), convert = T) %>% # separate columns and give appropriate names
separate(winner, c("winner", "winner_points", "winner_points_sum"), convert = T) %>%
select(winner, loser, date, winner_points, loser_points, winner_points_sum, loser_points_sum) # select the order you prefer
# # A tibble: 9 x 7
# winner loser date winner_points loser_points winner_points_sum loser_points_sum
# * <int> <int> <date> <int> <int> <int> <int>
# 1 1 3 2017-10-01 2 1 0 0
# 2 2 1 2017-10-02 1 0 0 2
# 3 3 1 2017-10-03 2 1 1 2
# 4 1 2 2017-10-04 1 0 3 1
# 5 2 1 2017-10-05 2 1 1 4
# 6 3 1 2017-10-06 1 0 3 5
# 7 1 3 2017-10-07 2 1 5 4
# 8 2 1 2017-10-08 1 0 3 7
# 9 3 2 2017-10-09 2 1 5 4
I finally understood what you want. And I took an approach of getting cumulative points of each player at each point in time and then joining it to the original test_data data frame.
winner <- c(1,2,3,1,2,3,1,2,3)
loser <- c(3,1,1,2,1,1,3,1,2)
date <- c("2017-10-01","2017-10-02","2017-10-03","2017-10-04","2017-10-05","2017-10-06","2017-10-07","2017-10-08","2017-10-09")
winner_points <- c(2,1,2,1,2,1,2,1,2)
loser_points <- c(1,0,1,0,1,0,1,0,1)
test_data <- data.frame(winner, loser, date = as.Date(date), winner_points, loser_points)
library(dplyr)
library(tidyr)
cum_points <- test_data %>%
gather(end_game_status, player_id, winner, loser) %>%
gather(which_point, how_many_points, winner_points, loser_points) %>%
filter(
(end_game_status == "winner" & which_point == "winner_points") |
(end_game_status == "loser" & which_point == "loser_points")) %>%
arrange(date = as.Date(date)) %>%
group_by(player_id) %>%
mutate(cumulative_points = cumsum(how_many_points)) %>%
mutate(cumulative_points_sofar = lag(cumulative_points, default = 0))
select(player_id, date, cumulative_points)
output <- test_data %>%
left_join(cum_points, by = c('date', 'winner' = 'player_id')) %>%
rename(winner_points_sum = cumulative_points_sofar) %>%
left_join(cum_points, by = c('date', 'loser' = 'player_id')) %>%
rename(loser_points_sum = cumulative_points_sofar)
output
The difference to the previous question of the OP is that the OP is now asking for the cumulative sum of points each player has scored so far, i.e., before the actual date. Furthermore, the sample data set now contains a date column which uniquely identifies each row.
So, my previous approach can be used here as well, with some modifications. The solution below reshapes the data from wide to long format whereby two value variables are reshaped simultaneously, computes the cumulative sums for each player id , and finally reshapes from long back to wide format, again. In order to sum only points scored before the actual date, the rows are lagged by one.
It is important to note that the winner and loser columns contain the respective player ids.
library(data.table)
cols <- c("winner", "loser")
setDT(test_data)[
# reshape multiple value variables simultaneously from wide to long format
, melt(.SD, id.vars = "date",
measure.vars = list(cols, paste0(cols, "_points")),
value.name = c("id", "points"))][
# rename variable column
, variable := forcats::lvls_revalue(variable, cols)][
# order by date and cumulate the lagged points by id
order(date), points_sum := cumsum(shift(points, fill = 0)), by = id][
# reshape multiple value variables simultaneously from long to wide format
, dcast(.SD, date ~ variable, value.var = c("id", "points", "points_sum"))]
date id_winner id_loser points_winner points_loser points_sum_winner points_sum_loser
1: 2017-10-01 1 3 2 1 0 0
2: 2017-10-02 2 1 1 0 0 2
3: 2017-10-03 3 1 2 1 1 2
4: 2017-10-04 1 2 1 0 3 1
5: 2017-10-05 2 1 2 1 1 4
6: 2017-10-06 3 1 1 0 3 5
7: 2017-10-07 1 3 2 1 5 4
8: 2017-10-08 2 1 1 0 3 7
9: 2017-10-09 3 2 2 1 5 4

Resources