Matching values of two data frames based on multiple conditions in R - r

I have a two datasets
cycle <- c(160, 160, 150, 158, 180)
split1 <- c(2, 2,4, 6, 8)
split2 <- c(10,10, 12, 14, 16)
df1 <- data.frame(cycle, split1, split2)
df1
cycle split1 split2
1 160 2 10
2 160 2 10
3 150 4 12
4 158 6 14
5 180 8 16
cycle <- c(160,150,190,180,161,150,140,179)
split1 <- c(2,4,12,8,2,4,32,8)
split2 <- c(10, 12, 18, 16, 10, 12, 21, 16)
df2 <- data.frame(cycle, split1, split2)
df2
cycle split1 split2
1 160 2 10
2 150 4 12
3 190 12 18
4 180 8 16
5 161 2 10
6 150 4 12
7 140 32 21
8 179 8 16
I want to match the values of df1 and df2 and label the df2 values based on two conditions:
1- If the values of all three columns i.e cycle, split1, and split2 are exactly the same then assign a row with the label "Same" otherwise "Different".
2- If the difference of only cycle value from df1 and df2 is +1 or -1 and the rest of the row values are the same then assign a row with the label "Same" otherwise "Different".
The output should look like this
cycle split1 split2 Type
1 160 2 10 Same
2 150 4 12 Same
3 190 12 18 Different
4 180 8 16 Same
5 161 2 10 Same
6 150 4 12 Same
7 140 32 21 Different
8 179 8 16 Same
I was successful in achieving the first condition as below
df1<- df1 %>% mutate(key = paste0(cycle,split1, split2, "_"))
df2<- df2 %>% mutate(key = paste0(cycle,split1, split2, "_"))
df2 %>% mutate(Type = ifelse(df2$key %in% df1$key, 'same', 'different'))%>%
select(-key)
cycle split1 split2 Type
1 160 2 10 same
2 150 4 12 same
3 190 12 18 different
4 180 8 16 same
5 161 2 10 different
6 150 4 12 same
7 140 32 21 different
8 179 8 16 different
but having a problem achieving the second one.
Any idea how to do this efficiently?
Thank you in advance.

Based on your original df1 and df2 (without the generation of the new column key), you could use
df2 %>%
mutate(rn = row_number()) %>%
left_join(df1, by = c("split1", "split2"), suffix = c("", ".y")) %>%
mutate(
type = coalesce(
ifelse(abs(cycle - cycle.y) <= 1, "same", "different"),
"different")
) %>%
group_by(rn) %>%
distinct() %>%
ungroup() %>%
select(-rn, -cycle.y)
This returns
# A tibble: 8 x 4
cycle split1 split2 type
<dbl> <dbl> <dbl> <chr>
1 160 2 10 same
2 150 4 12 same
3 190 12 18 different
4 180 8 16 same
5 161 2 10 same
6 150 4 12 same
7 140 32 21 different
8 179 8 16 same

Related

Removing all the observations except for observations from day 10 or day 20

I want to remove all the observations except for observations from day 10 or day 20 from data(ChickWeight). But I want to use logical operations in r : and "&" or :|. Below is my code but i get an error
ChickWeight %>% slice(10|20)
We could concatenate (c) the indexes as a vector and use - to remove the observations in slice - slice requires numeric index
library(dplyr)
ChickWeight %>%
slice(-c(10, 20))
With filter, it expects a logical vector
ChickWeight %>%
filter(!row_number() %in% c(10, 20))
If this is based on the 'Time' column, use either of the one below
ChickWeight %>%
slice(-which(Time %in% c(10, 20)))
ChickWeight %>%
filter(! Time %in% c(10, 20))
Here is another option using filter:
ChickWeight %>%
filter(row_number() != 10 &
row_number() != 20)
# A tibble: 576 × 4
weight Time Chick Diet
<dbl> <dbl> <ord> <fct>
1 42 0 1 1
2 51 2 1 1
3 59 4 1 1
4 64 6 1 1
5 76 8 1 1
6 93 10 1 1
7 106 12 1 1
8 125 14 1 1
9 149 16 1 1
10 199 20 1 1
You can use subset,
ChickWeight |> subset(Time == 10 | Time == 20)
or with (same result)
ChickWeight[with(ChickWeight, Time == 10 | Time == 20), ]
# weight Time Chick Diet
# 6 93 10 1 1
# 11 199 20 1 1
# 18 103 10 2 1
# 23 209 20 2 1
# 30 99 10 3 1
# 35 198 20 3 1
# ...
or likewise a sequence if you aim for row numbers.
ChickWeight |> subset({m <- seq_len(nrow(ChickWeight)); m == 10 | m == 20})
ChickWeight[{m <- seq_len(nrow(ChickWeight)); m == 10 | m == 20}, ]
# weight Time Chick Diet
# 10 171 18 1 1
# 20 138 14 2 1

R: in for loop, using mutate to compute the difference between two variables dynamically

Purpose
Suppose I have four variables: Two variables are original variables and the other two variables are the predictions of the original variables. (In actual data, there are a greater number of original variables)
I want to use for loop and mutate to create columns that compute the difference between the original and prediction variable. The sample data and the current approach are following:
Sample data
set.seed(10000)
id <- sample(1:20, 100, replace=T)
set.seed(10001)
dv.1 <- sample(1:20, 100, replace=T)
set.seed(10002)
dv.2 <- sample(1:20, 100, replace=T)
set.seed(10003)
pred_dv.1 <- sample(1:20, 100, replace=T)
set.seed(10004)
pred_dv.2 <- sample(1:20, 100, replace=T)
d <-
data.frame(id, dv.1, dv.2, pred_dv.1, pred_dv.2)
Current approach (with Error)
original <- d %>% select(starts_with('dv.')) %>% names(.)
pred <- d %>% select(starts_with('pred_dv.')) %>% names(.)
for (i in 1:length(original)){
d <-
d %>%
mutate(diff = original[i] - pred[i])
l <- length(d)
colnames(d[l]) <- paste0(original[i], '.diff')
}
Error: Problem with mutate() input diff. # x non-numeric
argument to binary operator # ℹ Input diff is original[i] - pred[i].
d %>%
mutate(
across(
.cols = starts_with("dv"),
.fns = ~ . - (get(paste0("pred_",cur_column()))),
.names = "diff_{.col}"
)
)
# A tibble: 100 x 7
id dv.1 dv.2 pred_dv.1 pred_dv.2 diff_dv.1 diff_dv.2
<int> <int> <int> <int> <int> <int> <int>
1 15 5 1 5 15 0 -14
2 13 4 4 5 11 -1 -7
3 12 20 13 6 13 14 0
4 20 11 8 13 3 -2 5
5 9 11 10 7 13 4 -3
6 13 3 3 6 17 -3 -14
7 3 12 19 6 17 6 2
8 19 6 7 11 4 -5 3
9 6 7 12 19 6 -12 6
10 13 10 15 6 7 4 8
# ... with 90 more rows
Subtraction can be applied on dataframes directly.
So you can create a vector of original column names and another vector of prediction column names and subtract them creating new columns.
orig_var <- grep('^dv', names(d), value = TRUE)
pred_var <- grep('pred', names(d), value = TRUE)
d[paste0(orig_var, '.diff')] <- d[orig_var] - d[pred_var]
d
# id dv.1 dv.2 pred_dv.1 pred_dv.2 dv.1.diff dv.2.diff
#1 15 5 1 5 15 0 -14
#2 13 4 4 5 11 -1 -7
#3 12 20 13 6 13 14 0
#4 20 11 8 13 3 -2 5
#5 9 11 10 7 13 4 -3
#...
#...

Replace several values and keep others same efficiently in R

I have a dataframe like the following:
combo_2 combo_4 combo_7 combo_9
12 23 14 17
21 32 41 71
2 3 1 7
1 2 4 1
21 23 14 71
2 32 1 7
Each column has two single-digit values and two double-digit values composed of the single-digit values in each possible order.
I am trying to determine how to replace certain values in the dataframe so that there is only one version of the double-digit value. For example, all values of 21 in the first column should be 12. All values of 32 in the second column should become 23.
I know I can do something like this using the following code:
df <- df %>%
mutate_at(vars(combo_2, combo_4, combo_7, combo_9), function(x)
case_when(x == 21 ~ 12, x == 32 ~ 23, x == 41 ~ 14, x == 71 ~ 17))
The problem with this is that it gives me a dataframe that contains the correct values when specified but leaves all the other values as NA. The resulting dataframe only contains values where 21, 32, 41, and 71 were. I know I could address this by specifying each value, like x == 1 ~ 1. However, I have many values and would prefer to only specify the ones that I am trying to change.
How can I replace several values in a dataframe without all the other values becoming NA? Is there a way for me to replace the values I want to replace while holding the other values the same without directly specifying those values?
You can use TRUE ~ x at the end of your case_when() sequence:
df %>%
mutate_at(vars(combo_2, combo_4, combo_7, combo_9), function(x)
case_when(x == 21 ~ 12, x == 32 ~ 23, x == 41 ~ 14, x == 71 ~ 17, TRUE ~ x))
combo_2 combo_4 combo_7 combo_9
1 12 23 14 17
2 12 23 14 17
3 2 3 1 7
4 1 2 4 1
5 12 23 14 17
6 2 23 1 7
Another option that may be more efficient would be data.table's fcase() function.
Data:
df = read.table(header = TRUE, text = "combo_2 combo_4 combo_7 combo_9
12 23 14 17
21 32 41 71
2 3 1 7
1 2 4 1
21 23 14 71
2 32 1 7")
df[] = lapply(df, as.double) # side-note: tidyverse has become very stict about types
One dplyr and stringi option may be:
df %>%
mutate(across(everything(),
~ if_else(. %in% c(21, 32, 41, 71), as.integer(stri_reverse(.)), .)))
combo_2 combo_4 combo_7 combo_9
1 12 23 14 17
2 12 23 14 17
3 2 3 1 7
4 1 2 4 1
5 12 23 14 17
6 2 23 1 7
Using mapply:
df1[] <- mapply(function(d, x1, x2){ ifelse(d == x1, x2, d) },
d = df1,
x1 = c(21, 32, 41, 71),
x2 = c(12, 23, 14, 17))
df1
# combo_2 combo_4 combo_7 combo_9
# 1 12 23 14 17
# 2 12 23 14 17
# 3 2 3 1 7
# 4 1 2 4 1
# 5 12 23 14 17
# 6 2 23 1 7

Count rows satisfying "less than" filter for sequence of values

I have a dataset with a bunch of times. Let's say I wanted to create a summary table that counts number of rows satisfying "less than" filter for a sequence of values, say [number of rows with time < 6, number of rows with time < 7, etc.]
Example dataset:
data.frame(personId = c("2009ZEMD01", "2012PARK03", "2017VILL41", "2010WEYE01", "2016KOLA02", "2012PONC02"),
average = c(553, 559, 598, 606, 612, 613))
This was my solution using sapply:
tibble(time = 6:15,
count = sapply(time, function(t) best_3x3_solvers %>% filter(average/100 < t) %>% nrow))
The result:
> solvers_under
# A tibble: 10 x 2
time count
<int> <int>
1 6 3
2 7 48
3 8 274
4 9 840
5 10 1952
6 11 3792
7 12 6269
8 13 9459
9 14 13204
10 15 17274
The code is not too long but is there a method using more tidyverse tools without *apply? Maybe summarize with n().
One dplyr and purrr option could be:
map_dfr(.x = 6:15,
~ df %>%
group_by(time = .x) %>%
summarise(count = sum(average/100 < .x)))
time count
<int> <int>
1 6 3
2 7 6
3 8 6
4 9 6
5 10 6
6 11 6
7 12 6
8 13 6
9 14 6
10 15 6
Here's one way :
library(dplyr)
library(purrr)
map_df(6:15, ~df %>% summarise(time = .x, count = sum(average/100 < .x)))
# time count
# 1 6 3
# 2 7 6
# 3 8 6
# 4 9 6
# 5 10 6
# 6 11 6
# 7 12 6
# 8 13 6
# 9 14 6
#10 15 6
You can use summarise, count and filter
df%>%group_by(time)%>%summarise(count = n())%>%filter(count < t)

Find the smallest value under conditions about the index (NA output possible)

Question:
I am using dplyr to do data analysis in R, and I come across the following problem.
My data frame is like this:
item day val
1 A 1 90
2 A 2 100
3 A 3 110
4 A 5 80
5 A 8 70
6 B 1 75
7 B 3 65
The data frame is already arranged in item, day. Now I want to mutate a new column, with each row being the smallest value of the same group AND having the day to be within the next 2 days.
For the example above, I want the resulting data frame to be:
item day val output
1 A 1 90 100 # the smaller of 100 and 110
2 A 2 100 110 # the only value within 2 days
3 A 3 110 80 # the only value within 2 days
4 A 5 80 NA # there is no data within 2 days
5 A 8 70 NA # there is no data within 2 days
6 B 1 75 65 # the only value within 2 days
7 B 3 65 NA # there is no data within 2 days
I understand that I will probably use group_by and mutate, but how to write the inside function in order to achieve my desired result?
Any help is greatly appreciated. Let me know if you need me to clarify anything. Thank you!
Try this:
df %>%
# arrange(item, day) %>% # if not already arranged
# take note of the next two values & corresponding difference in days
group_by(item) %>%
mutate(val.1 = lead(val),
day.1 = lead(day) - day,
val.2 = lead(val, 2),
day.2 = lead(day, 2) - day) %>%
ungroup() %>%
# if the value is associated with a day more than 2 days away, change it to NA
mutate(val.1 = ifelse(day.1 %in% c(1, 2), val.1, NA),
val.2 = ifelse(day.2 %in% c(1, 2), val.2, NA)) %>%
# calculate output normally
group_by(item, day) %>%
mutate(output = min(val.1, val.2, na.rm = TRUE)) %>%
ungroup() %>%
# arrange results
select(item, day, val, output) %>%
mutate(output = ifelse(output == Inf, NA, output)) %>%
arrange(item, day)
# A tibble: 7 x 4
item day val output
<fctr> <int> <int> <dbl>
1 A 1 90 100
2 A 2 100 110
3 A 3 110 80.0
4 A 5 80 NA
5 A 8 70 NA
6 B 1 75 65.0
7 B 3 65 NA
Data:
df <- read.table(text = " item day val
1 A 1 90
2 A 2 100
3 A 3 110
4 A 5 80
5 A 8 70
6 B 1 75
7 B 3 65", header = TRUE)
We can use complete from the tidyr package to complete the dataset by day, and then use lead from dplyr and rollapply from zoo to find the minimum of the next two days.
library(dplyr)
library(tidyr)
library(zoo)
DF2 <- DF %>%
group_by(item) %>%
complete(day = full_seq(day, period = 1)) %>%
mutate(output = rollapply(lead(val), width = 2, FUN = min, na.rm = TRUE,
fill = NA, align = "left")) %>%
drop_na(val) %>%
ungroup() %>%
mutate(output = ifelse(output == Inf, NA, output))
DF2
# # A tibble: 7 x 4
# item day val output
# <chr> <dbl> <int> <dbl>
# 1 A 1.00 90 100
# 2 A 2.00 100 110
# 3 A 3.00 110 80.0
# 4 A 5.00 80 NA
# 5 A 8.00 70 NA
# 6 B 1.00 75 65.0
# 7 B 3.00 65 NA
DATA
DF <- read.table(text = "item day val
1 A 1 90
2 A 2 100
3 A 3 110
4 A 5 80
5 A 8 70
6 B 1 75
7 B 3 65",
header = TRUE, stringsAsFactors = FALSE)
We'll create a dataset with modified day, so we can left join it on the original dataset, keeping only minimum value.
df %>%
left_join(
bind_rows(mutate(.,day=day-1),mutate(.,day=day-2)) %>% rename(output=val)) %>%
group_by(item,day,val) %>%
summarize_at("output",min) %>%
ungroup
# # A tibble: 7 x 4
# item day val output
# <fctr> <dbl> <int> <dbl>
# 1 A 1 90 100
# 2 A 2 100 110
# 3 A 3 110 80
# 4 A 5 80 NA
# 5 A 8 70 NA
# 6 B 1 75 65
# 7 B 3 65 NA
data
df <- read.table(text = " item day val
1 A 1 90
2 A 2 100
3 A 3 110
4 A 5 80
5 A 8 70
6 B 1 75
7 B 3 65", header = TRUE)

Resources