How to perform complex multicolumn match in R / - r

I wish to match two dataframes based on conditionals on more than one column but cannot figure out how. So if there are my data sets:
df1 <- data.frame(lower=c(0,5,10,15,20), upper=c(4,9,14,19,24), x=c(12,45,67,89,10))
df2 <- data.frame(age=c(12, 14, 5, 2, 9, 19, 22, 18, 23))
I wish to match age from df2 that falls into the range between lower and upper in df1 with the aim to add an extra column to df2 containing the value of x in df1 where age lies between upper and lower. i.e. I want df2 to look like
age x
12 67
14 67
5 45
....etc.
How can I achieve such a match ?

I would go with a simple sapply and a "anded" condition in the df1$x selection like this:
df2$x <- sapply( df2$age, function(x) { df1$x[ x >= df1$lower & x <= df1$upper ] })
which gives:
> df2
age x
1 12 67
2 14 67
3 5 45
4 2 12
5 9 45
6 19 89
7 22 10
8 18 89
9 23 10
For age 12 for example the selection inside the brackets gives:
> 12 >= df1$lower & 12 <= df1$upper
[1] FALSE FALSE TRUE FALSE FALSE
So getting df1$x by this logical vector is easy as your ranges don't overlap

Using foverlaps from data.table is what you are looking for:
library(data.table)
setDT(df1)
setDT(df2)[,age2:=age]
setkey(df1,lower,upper)
foverlaps(df2, df1, by.x = names(df2),by.y=c("lower","upper"))[,list(age,x)]
# age x
# 1: 12 67
# 2: 14 67
# 3: 5 45
# 4: 2 12
# 5: 9 45
# 6: 19 89
# 7: 22 10
# 8: 18 89
# 9: 23 10

Here's another vectorized approach using findInterval on a melted data set
library(data.table)
df2$x <- melt(setDT(df1), "x")[order(value), x[findInterval(df2$age, value)]]
# age x
# 1 12 67
# 2 14 67
# 3 5 45
# 4 2 12
# 5 9 45
# 6 19 89
# 7 22 10
# 8 18 89
# 9 23 10
The idea here is to
First, tidy up you data so lower and upper will be in the same column and x will have corresponding values to that new column,
Then, sort the data according to these ranges (necessary for findInterval).
Finally, run findInterval within the x column in order to find the correct incidences
And here's a possible dplyr/tidyr version
library(tidyr)
library(dplyr)
df1 %>%
gather(variable, value, -x) %>%
arrange(value) %>%
do(data.frame(x = .$x[findInterval(df2$age, .$value)])) %>%
cbind(df2, .)
# age x
# 1 12 67
# 2 14 67
# 3 5 45
# 4 2 12
# 5 9 45
# 6 19 89
# 7 22 10
# 8 18 89
# 9 23 10

Related

Replace several values and keep others same efficiently in R

I have a dataframe like the following:
combo_2 combo_4 combo_7 combo_9
12 23 14 17
21 32 41 71
2 3 1 7
1 2 4 1
21 23 14 71
2 32 1 7
Each column has two single-digit values and two double-digit values composed of the single-digit values in each possible order.
I am trying to determine how to replace certain values in the dataframe so that there is only one version of the double-digit value. For example, all values of 21 in the first column should be 12. All values of 32 in the second column should become 23.
I know I can do something like this using the following code:
df <- df %>%
mutate_at(vars(combo_2, combo_4, combo_7, combo_9), function(x)
case_when(x == 21 ~ 12, x == 32 ~ 23, x == 41 ~ 14, x == 71 ~ 17))
The problem with this is that it gives me a dataframe that contains the correct values when specified but leaves all the other values as NA. The resulting dataframe only contains values where 21, 32, 41, and 71 were. I know I could address this by specifying each value, like x == 1 ~ 1. However, I have many values and would prefer to only specify the ones that I am trying to change.
How can I replace several values in a dataframe without all the other values becoming NA? Is there a way for me to replace the values I want to replace while holding the other values the same without directly specifying those values?
You can use TRUE ~ x at the end of your case_when() sequence:
df %>%
mutate_at(vars(combo_2, combo_4, combo_7, combo_9), function(x)
case_when(x == 21 ~ 12, x == 32 ~ 23, x == 41 ~ 14, x == 71 ~ 17, TRUE ~ x))
combo_2 combo_4 combo_7 combo_9
1 12 23 14 17
2 12 23 14 17
3 2 3 1 7
4 1 2 4 1
5 12 23 14 17
6 2 23 1 7
Another option that may be more efficient would be data.table's fcase() function.
Data:
df = read.table(header = TRUE, text = "combo_2 combo_4 combo_7 combo_9
12 23 14 17
21 32 41 71
2 3 1 7
1 2 4 1
21 23 14 71
2 32 1 7")
df[] = lapply(df, as.double) # side-note: tidyverse has become very stict about types
One dplyr and stringi option may be:
df %>%
mutate(across(everything(),
~ if_else(. %in% c(21, 32, 41, 71), as.integer(stri_reverse(.)), .)))
combo_2 combo_4 combo_7 combo_9
1 12 23 14 17
2 12 23 14 17
3 2 3 1 7
4 1 2 4 1
5 12 23 14 17
6 2 23 1 7
Using mapply:
df1[] <- mapply(function(d, x1, x2){ ifelse(d == x1, x2, d) },
d = df1,
x1 = c(21, 32, 41, 71),
x2 = c(12, 23, 14, 17))
df1
# combo_2 combo_4 combo_7 combo_9
# 1 12 23 14 17
# 2 12 23 14 17
# 3 2 3 1 7
# 4 1 2 4 1
# 5 12 23 14 17
# 6 2 23 1 7

Assigning values to a new column based on a condition between two dataframes

I have two dataframes. I need to add the value of one column to every row in the other dataframe where the values of a particular column meet a condition from the first dataframe.
df1:
a b
x 23
s 34
v 15
g 05
k 69
df2:
x y z
1 0 10
2 10 20
3 20 30
4 30 40
5 40 50
6 50 60
7 60 70
Desired output:
a b n
x 23 3
s 34 4
v 15 2
g 05 1
k 69 7
In my dataset the intervals are large, and it's unlikely that a value from df1 is exactly on the boundary of a df2 interval.
Essentially for every row in df1 I need to assign the number which corresponds to which range it fits into in df2. So if df1$b is between df2$y and df2$z, then assign the value of output$n as the corresponding value of df2$x. This is quite a wordy question, so please ask if I need to clarify.
df1 = read.table(text = "
a b
x 23
s 34
v 15
g 05
k 69
", header=T, stringsAsFactors=F)
df2 = read.table(text = "
x y z
1 0 10
2 10 20
3 20 30
4 30 40
5 40 50
6 50 60
7 60 70
", header=T, stringsAsFactors=F)
# function
f = function(x) min(which(x >= df2$y & x <= df2$z))
f = Vectorize(f)
# apply function
df1$n = f(df1$b)
# check updated dataset
df1
# a b n
# 1 x 23 3
# 2 s 34 4
# 3 v 15 2
# 4 g 5 1
# 5 k 69 7
You can try:
library(tidyverse)
df1 %>%
rowwise() %>%
mutate(n=df2[ b > df2$y & b <= df2$z,1]) %>%
ungroup()
# A tibble: 5 x 3
a b n
<chr> <int> <int>
1 x 23 3
2 s 34 4
3 v 15 2
4 g 5 1
5 k 69 7
as already commented you have to change < or > to <= or >= accordingly to your needs.

selecting middle n rows in R

I have a data.table in R say df.
row.number <- c(1:20)
a <- c(rep("A", 10), rep("B", 10))
b <- c(sample(c(0:100), 20, replace = TRUE))
df <-data.table(row.number,a,b)
df
row.number a b
1 1 A 14
2 2 A 59
3 3 A 39
4 4 A 22
5 5 A 75
6 6 A 89
7 7 A 11
8 8 A 88
9 9 A 22
10 10 A 6
11 11 B 37
12 12 B 42
13 13 B 39
14 14 B 8
15 15 B 74
16 16 B 67
17 17 B 18
18 18 B 12
19 19 B 56
20 20 B 21
I want to take the 'n' rows , (say 10) from the middle after arranging the records in increasing order of column b.
Use setorder to sort and .N to filter:
setorder(df, b)[(.N/2 - 10/2):(.N/2 + 10/2 - 1), ]
row.number a b
1: 11 B 36
2: 5 A 38
3: 8 A 41
4: 18 B 43
5: 1 A 50
6: 12 B 51
7: 15 B 54
8: 3 A 55
9: 20 B 59
10: 4 A 60
You could use the following code
library(data.table)
set.seed(9876) # for reproducibility
# your data
row.number <- c(1:20)
a <- c(rep("A", 10), rep("B", 10))
b <- c(sample(c(0:100), 20, replace = TRUE))
df <- data.table(row.number,a,b)
df
# define how many to select and store in n
n <- 10
# calculate how many to cut off at start and end
n_not <- (nrow(df) - n )/2
# use data.tables setorder to arrange based on column b
setorder(df, b)
# select the rows wanted based on n
df[ (n_not+1):(nr-n_not), ]
Please let me know whether this is what you want.

How can we apply a function to a column vector from every set of contiguously matching rows of a data frame

For example, using column 1 as the matching criterion, lets call replicate(length(v), sum(v)) for the column 2 vector, v, of every set of rows that consists of contiguous and matching rows from the data frame A (including sets of size 1).
A v
a 12
a 43
b 8
a 4
b 12
c 5
c 9
d 21
->
55, 55, 8, 4, 12, 14, 14, 21
The operation can return a vector or a list of vectors that we can coerce to a vector with unlist().
Here's a simple solution using data.table - simply because of it's built in rleid function and because it handles factors seemingly
library(data.table)
setDT(df)[, res := sum(v), by = rleid(A)]
df
# A v res
# 1: a 12 55
# 2: a 43 55
# 3: b 8 8
# 4: a 4 4
# 5: b 12 12
# 6: c 5 14
# 7: c 9 14
# 8: d 21 21
If we want base R we could either recreate rleid or just combine cumsum with ave
with(df, ave(v, cumsum(c(TRUE, head(A, -1) != tail(A, -1))), FUN = sum))
# [1] 55 55 8 4 12 14 14 21
Here is an option using dplyr
library(dplyr)
df1 %>%
group_by(A1 = cumsum(A!= dplyr::lag(A, default=A[1]))) %>%
mutate(res = sum(v)) %>%
ungroup() %>%
select(-A1)
# A v res
# (chr) (int) (int)
#1 a 12 55
#2 a 43 55
#3 b 8 8
#4 a 4 4
#5 b 12 12
#6 c 5 14
#7 c 9 14
#8 d 21 21

subset data frame on vector sequence

I have the data frame df and I want to subset df based on a number sequence within a categorical.
x <- c(1,2,3,4,5,7,9,11,13)
x2 <- x+77
df <- data.frame(x=c(x,x2),y= c(rep("A",9),rep("B",9)))
df
x y
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
6 7 A
7 9 A
8 11 A
9 13 A
10 78 B
11 79 B
12 80 B
13 81 B
14 82 B
15 84 B
16 86 B
17 88 B
18 90 B
I want only the rows where x increments by 1 and not the rows where x increases by two: e.g.
x y
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
10 78 B
11 79 B
12 80 B
13 81 B
14 82 B
I figured I have to do some dort of subtraction between elements and check if the difference is >1 and combine this with a ddply but this seems cumbersome. Is there a sort of sequence function I am missing?
using diff
df[which(c(1,diff(df$x))==1),]
Your example seems to behave well and can be nicely handled by #agstudy's answer. Should your data act up one day, though...
myfun <- function(d, whichDiff = 1) {
# d is the data.frame you'd like to subset, containing the variable 'x'
# whichDiff is the difference between values of x you're looking for
theWh <- which(!as.logical(diff(d$x) - whichDiff))
# Take the diff of x, subtract whichDiff to get the desired values equal to 0
# Coerce this to a logical vector and take the inverse (!)
# which() gets the indexes that are TRUE.
# allWh <- sapply(theWh, "+", 1)
# Since the desired rows may be disjoint, use sapply to get each index + 1
# Seriously? sapply to add 1 to a numeric vector? Not even on a Friday.
allWh <- theWh + 1
return(d[sort(unique(c(theWh, allWh))), ])
}
> library(plyr)
>
> ddply(df, .(y), myfun)
x y
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
6 78 B
7 79 B
8 80 B
9 81 B
10 82 B

Resources