R - Compare two columns from different datasets and get new dataset - r

I have two datasets. For each dataset, every two columns is a pair with *a as 1st column and *b as 2nd column.
E.g. df1, pair A1, df1$A1a = 1st column & df1$A1b = 2nd column.
Likewise, df2, pair B1, df2$B1a = 1st column & df2$B1b = 2nd column.
df1:
ID A1a A1b A2a A2b
1 2 3 2 3
2 3 1 2 1
3 1 3 1 2
4 2 2 3 3
5 1 2 2 1
df2:
ID B1a B1b B2a B2b
1 1 2 2 3
2 3 2 1 1
3 2 3 2 2
4 3 2 2 3
5 2 2 3 1
The final data (df3) should look like this:
ID C1a C1b C2a C2b
1 1 2 2 3
2 3 1 1 1
3 1 3 1 2
4 2 2 2 3
5 1 2 2 1
I would like to do the following:
First, compare the 1st column of each pair between df1 and df2 and identify the lowest value. E.g. For ID=1, compare df1$A1a = 2 with df2$B1a = 1, since df2$B1a has the lower value, mutate new columns with the pairs from df2. I.e. df3$C1a = 1, df3$C1b = 2.
If the 1st column of each pair is the same, then use 2nd column to determine which pairs of values to mutate new columns. E.g. for ID=2, 1st column shows df1$A1a = 3 and df2$B1a = 3, therefore use 2nd column to determine, since df1$A1b = 1 and df2$B1b = 2, the pairs of values should come from df1. I.e. df3$C1a = 3 and df3$C1b = 1.
If both the pairs from df1 and df2 are the same, just use those values. E.g. for ID=1, 1st column df1$A2a = 2 and df2$B2a = 2 are the same, and 2nd column df1$A2b = 3 and df2$B2b = 3 are the same, then new columns should be df3$C1a = 2 and df3$C1b = 3.
Hoping to automate the above so that the code automatically compares every pair from df1 with df2 so that I do not need to compare the pairs individually (e.g. do A1 and B1 first, then do A2 and B2, etc) but rather the code just repeats for every pair in the datasets. Thank you for any help!

This is a bit too lengthy, but still it does the trick.
map2(
list(df1, df2), c("A", "B"),
function(df, df_chr){
df %>% pivot_longer(cols=-ID, values_to=df_chr) %>%
mutate(name=str_replace(name, df_chr, "")) %>% return()
}
) %>% reduce(left_join, by=c("ID", "name")) %>%
mutate(name=name %>% str_split(""), id1=map_chr(name, ~ .[[1]]),
id2=map_chr(name, ~ .[[2]]), .after=ID) %>%
select(-name) %>% nest(data=-c(ID, id1)) %>%
mutate(data=map(data, function(data){
if((data %>% slice(1) %>% .$A) != (data %>% slice(1) %>% .$B)){
min_col_num <- (data %>% slice(1) %>% select(-id2) %>% which.min() %>% unname()) + 1
data %>% select(id2, value=min_col_num) %>% return()
}else{
min_col_num <- (data %>% slice(2) %>% select(-id2) %>% which.min() %>% unname()) + 1
data %>% select(id2, value=min_col_num) %>% return()
}
})) %>% unnest(cols=data) %>% mutate(name=str_c("C", id1, id2), .after=ID) %>%
select(-c(id1, id2)) %>% pivot_wider()

Related

R: conditionally mutate a variable when columns match in different dataframes

I am attempting to write some R code that assesses whether or not two dataframes have any matches in their columns. If there are matches, one of the columns in the second dataframe should assign a "link" (via the links variable) to the first dataframe using the id column of the first dataframe.
In the event that there are multiple matches, I am trying to get the "link" variable to randomly select one of the matching id's.
Some reproducible code:
library(dplyr)
df1 = data.frame(ids = c(1:5),
var = c("a","a","c","b","b"))
df2 = data.frame(var = c('c','a','b','b','d'),
links = 0)
Ideally, I would like a resulting dataframe that looks like:
var links
1 c 3
2 a 1 or 2
3 b 4 or 5
4 b 4 or 5
5 d 0
where observations in the links column randomly select ids from df1 when df1$var matches df2$var. In the dataframe above, this is denoted by "or".
Note 1: The links column should be a numeric, I only made it character to allow to write the word "or".
Note 2: If there is not a match between df1$var and df2$var, the links column should remain a 0.
So far, I've gone this route, but I'm unsure about what to put after the ~
linked_df = df2 %>%
mutate(links=case_when(links==0 & var %in% df1$var ~
sample(c(df1$ids),n(),replace=T) # unsure about this line
TRUE ~ links)
I think this is what you want. I've left the ids column in the result, but
it can be removed when the sampling is complete.
library(dplyr)
library(tidyr)
df1_nest = df1 %>%
group_by(var) %>%
summarize(ids = list(ids))
safe_sample = function(x, ...) {
if(length(x) == 1) return(x)
sample(x, ...)
}
set.seed(47)
df2 %>%
left_join(df1_nest) %>%
mutate(
links = sapply(ids, \(x) if(is.null(x)) 0L else safe_sample(x, size = 1))
)
# Joining, by = "var"
# var links ids
# 1 c 3 3
# 2 a 1 1, 2
# 3 b 4 4, 5
# 4 b 5 4, 5
# 5 d 0 NULL
Something like this could do the trick, just a map of a filter of the first dataframe:
df2 %>%
as_tibble() %>%
mutate(links = map(var, ~sample(filter(df1, var == .)$ids), 1),
index = row_number()) %>%
unnest(links, keep_empty = TRUE) %>%
group_by(index) %>%
slice_sample(n = 1) %>%
ungroup() %>%
select(-index)
# # A tibble: 5 × 2
# var links
# <chr> <int>
# 1 c 1
# 2 a 1
# 3 b 4
# 4 b 5
# 5 d NA

Replace column values in a df with matching index with new values in R

I have df, containing 2 variables, df and val. df contains numbers from 1-255 and val is random numbers generated. I also have new_vals that is a vector of 255 different values.
df = (seq(1,255,by=1))
df = as.data.frame(df)
df$val = seq(0,1,length.out=255)
new_vals = (df$val+1)
new_vals=as.data.frame(new_vals)
I want to replace the value in df, where each number 1-255 in df$df corresponds to the 255 numbers in new_vals. If the index matches replace df$val with the value at each index from new_vals.
dataframe df
df val
1 0.000000000
2 0.003937008
3 0.007874016
dataframe newvals (these are the values at index 1,2,3)
new_vals
<dbl>
1.000000
1.003937
1.007874
Expected Output of dataframe df after replacing values at matching index
df val
1 1.000000
2 1.003937
3 1.007874
What is the easiest way I could do this?
Edit: I realized in this example i can just replace column, but imagine df$df's order of 1-255 was randomized or have more rows
If I'm understanding correctly, here's a way to match indices with dplyr:
library(dplyr)
new_vals %>%
mutate(index = row_number()) %>%
left_join(df, by = c("index" = "df"), keep = T)
Which gives us:
new_vals index df val
1 1.000000 1 1 0.000000000
2 1.003937 2 2 0.003937008
3 1.007874 3 3 0.007874016
Proposed solution without the example would be:
new_vals %>%
mutate(index = row_number()) %>%
left_join(df, by = c("index" = "df"), keep = T) %>%
select(df, val = new_vals)
Which gives us:
df val
1 1 1.000000
2 2 1.003937
3 3 1.007874
4 4 1.011811
5 5 1.015748
6 6 1.019685
7 7 1.023622
8 8 1.027559
9 9 1.031496
10 10 1.035433
If you are sure, there are df:1-255 in df, then:
df$val[which(df$df %in% c(1:255))] <- new_vals$new_vals
In addition a for loop can bring you more control and check the index accurately:
for (row in df$df) {
df$val[df$df==row] <- new_vals$new_vals[row]
}

Separate rows with conditions

I have this dataframe separate_on_condition with two columns:
separate_on_condition <- data.frame(first = 'a3,b1,c2', second = '1,2,3,4,5,6')`
# first second
# 1 a3,b1,c2 1,2,3,4,5,6
How can I turn it to:
# A tibble: 6 x 2
first second
<chr> <chr>
1 a 1
2 a 2
3 a 3
4 b 4
5 c 5
6 c 6
where:
a3 will be separated into 3 rows
b1 into 1 row
c2 into 2 rows
Is there a better way on achieving this instead of using rep() on first column and separate_rows() on the second column?
Any help would be much appreciated!
Create a row number column to account for multiple rows.
Split second column on , in separate rows.
For each row extract the data to be repeated along with number of times it needs to be repeated.
library(dplyr)
library(tidyr)
library(stringr)
separate_on_condition %>%
mutate(row = row_number()) %>%
separate_rows(second, sep = ',') %>%
group_by(row) %>%
mutate(first = rep(str_extract_all(first(first), '[a-zA-Z]+')[[1]],
str_extract_all(first(first), '\\d+')[[1]])) %>%
ungroup %>%
select(-row)
# first second
# <chr> <chr>
#1 a 1
#2 a 2
#3 a 3
#4 b 4
#5 c 5
#6 c 6
You can the following base R option
with(
separate_on_condition,
data.frame(
first = unlist(sapply(
unlist(strsplit(first, ",")),
function(x) rep(gsub("\\d", "", x), as.numeric(gsub("\\D", "", x)))
), use.names = FALSE),
second = eval(str2lang(sprintf("c(%s)", second)))
)
)
which gives
first second
1 a 1
2 a 2
3 a 3
4 b 4
5 c 5
6 c 6
Here is an alternative approach:
add NA to first to get same length
use separate_rows to bring each element to a row
use extract by regex digit to split first into first and helper
group and slice by values in helper
do some tweaking
library(tidyr)
library(dplyr)
separate_on_condition %>%
mutate(first = str_c(first, ",NA,NA,NA")) %>%
separate_rows(first, second, sep = "[^[:alnum:].]+", convert = TRUE) %>%
extract(first, into = c("first", "helper"), "(.{1})(.{1})", remove=FALSE) %>%
group_by(second) %>%
slice(rep(1:n(), each = helper)) %>%
ungroup() %>%
drop_na() %>%
mutate(second = row_number()) %>%
select(first, second)
first second
<chr> <int>
1 a 1
2 a 2
3 a 3
4 b 4
5 c 5
6 c 6

rollsumr with window-length>1: filling missing values

My data frame looks something like the first two columns of the following
I want to add a third column, equal to the sum of the ID-group's last three observations for VAL.
Using the following command, I managed to get the output below:
df %>%
group_by(ID) %>%
mutate(SUM=rollsumr(VAL, k=3)) %>%
ungroup()
ID VAL SUM
1 2 NA
1 1 NA
1 3 6
1 4 8
...
I am now hoping to be able to fill the NAs that result for the group's cells in the first two rows.
ID VAL SUM
1 2 2
1 1 3
1 3 6
1 4 8
...
How do I do that?
I have tried doing the following
df %>%
group_by(ID) %>%
mutate(SUM=rollsumr(VAL, k=min(3, row_number())) %>%
ungroup()
and
df %>%
group_by(ID) %>%
mutate(SUM=rollsumr(VAL, k=3), fill = "extend") %>%
ungroup()
But both give me the same error, because I have groups of sizes <= 2.
Evaluation error: need at least two non-NA values to interpolate.
What do I do?
Alternatively, you can use rollapply() from the same package:
df %>%
group_by(ID) %>%
mutate(SUM = rollapply(VAL, width = 3, FUN = sum, partial = TRUE, align = "right"))
ID VAL SUM
<int> <int> <int>
1 1 2 2
2 1 1 3
3 1 3 6
4 1 4 8
Due to argument partial = TRUE, also the rows that are below the desired window of length three are summed.
Not a direct answer but one way would be to replace the values which are NAs with cumsum of VAL
library(dplyr)
library(zoo)
df %>%
group_by(ID) %>%
mutate(SUM = rollsumr(VAL, k=3, fill = NA),
SUM = ifelse(is.na(SUM), cumsum(VAL), SUM))
# ID VAL SUM
# <int> <int> <int>
#1 1 2 2
#2 1 1 3
#3 1 3 6
#4 1 4 8
Or since you know the window size before hand, you could check with row_number() as well
df %>%
group_by(ID) %>%
mutate(SUM = rollsumr(VAL, k=3, fill = NA),
SUM = ifelse(row_number() < 3, cumsum(VAL), SUM))

Filter and return all rows of a group where specific row fulfills one condition

I am looking to filter and retrieve all rows from all groups where a specific row meets a condition, in my example when the value is more than 3 at the highest day per group. This is obviously simplified but breaks it down to the essential.
# Dummy data
id = rep(letters[1:3], each = 3)
day = rep(1:3, 3)
value = c(2,3,4,2,3,3,1,2,4)
my_data = data.frame(id, day, value, stringsAsFactors = FALSE)
My approach works, but it seems somewhat unsmart:
require(dplyr)
foo <- my_data %>%
group_by(id) %>%
slice(which.max(day)) %>% # gets the highest day
filter(value>3) # filters the rows with value >3
## semi_join with the original data frame gives the required result:
semi_join(my_data, foo, by = 'id')
id day value
1 a 1 2
2 a 2 3
3 a 3 4
4 c 1 1
5 c 2 2
6 c 3 4
Is there a more succint way to do this?
my_data %>% group_by(id) %>% filter(value[which.max(day)] > 3)

Resources