Using 3 different data to output 4th dataframe - r

I’m having trouble working with 3 different sets of data (df1, df2, vec1) to output a third dataframe df3. I have 2 dataframes df1 and df2. In df1, each letter in X1 corresponds to a value in X2. In df2, X3 represents a numerical value found in vec1 and X4 represents a letter or multiple letters from df1$X1. I’m looking to scan the letters found in df2$X4 and see if there is a sequential order of N values determined from df2$X3 in vec1, and then remove any letters that do not fit this criterion.
For example, in df2[1, ], the letters are “A, B, D” and the value is 3. Looking at vec1, the max sequential order that includes the value 3 is “2, 3, 4, 5”, meaning df2[1, 2] should be replaced with “A, D” instead of “A, B, D”. The final output should look like df3. Any ideas would be greatly appreciated.
df1 <- data.frame(c("A", "B", "C", "D"), c(4, 8, 1, 3))
colnames(df1) <- c("X1", "X2")
df2 <- data.frame(c(3, 21, 27, 34, 35, 46), c("A, B, D", "A, C", NA, "B", "B, D", "C"))
colnames(df2) <- c("X3", "X4")
vec1 <- c(2, 3, 4, 5, 21, 22, 23, 27, 33, 34, 35, 36, 37, 38, 39, 46)
df3 <- data.frame(c(3, 21, 27, 34, 35, 46), c("A, D", "C", NA, NA, "D", NA))

This is not elegant but it may do what you need it to do.
First, create a list that contains consecutive integers:
vec1_seq <- split(vec1, cumsum(c(0, diff(vec1) > 1)))
$`0`
[1] 2 3 4 5
$`1`
[1] 21 22 23
$`2`
[1] 27
$`3`
[1] 33 34 35 36 37 38 39
$`4`
[1] 46
Then, do the following. Check for X3 in each element of the list, and determine the length if contained in that element. Then, keep only those letters that meet the length requirement:
cbind(df2,
X5 = apply(df2, 1, function(x) {
l <- length(unlist(vec1_seq[sapply(seq_along(vec1_seq), function(i) {
as.numeric(x[["X3"]]) %in% vec1_seq[[i]]
})]))
toString(na.omit(as.vector(sapply(trimws(unlist(strsplit(x[["X4"]], ","))), function(i) {
ifelse(i == df1[["X1"]] & df1[["X2"]] <= l, i, NA)
}))))
}))
It seems that "C" should remain for row 6; if that is incorrect let me know.
Output
X3 X4 X5
1 3 A, B, D A, D
2 21 A, C C
3 27 <NA>
4 34 B
5 35 B, D D
6 46 C C

Related

Look up/match values within the same dataframe column in R

Given data.frame(code=c(10, 20, 21, 22, 23, 31, 32, 40, 50), label=c("a", "b", "c", "d", "e", "f", "g", "h", "i")), I'd like c("", "", "b", "b", "b", "", "", "", "").
If the value is not a multiple of 10, assign the label of the immediately previous multiple of 10 if it is listed. If the immediately previous multiple of 10 is not listed, assign blank. If the value is a multiple of 10, assign blank. (Unlike this dummy example, multiple sequences of non-multiples of 10 may occur in the data and the values may not be ordered.)
Ideally, I'd like to do this as a vector operation in base R, for speed and parsimony.
EDIT: I was trying to simplify my question as much as possible but maybe it was misleading so here is the final output I'm aiming for: data.frame(code=c(10, 20, 21, 22, 23, 31, 32, 40, 50), label=c("a", "b", "b c", "b d", "b e", "f", "g", "h", "i")). That is: prepend the intermediate output to the label column.
This looks like an overkill but seems to work :
library(dplyr)
library(tidyr)
df %>%
#arrange the data based on value
arrange(code) %>%
#Get closest multiple of 10
mutate(multiple10 = floor(code/10) * 10,
#If completely divisible by 10 assign label else NA
result = ifelse(code %% 10 == 0, label, NA)) %>%
#For each multiple of 10
group_by(multiple10) %>%
#fill NA by most recent non-NA in the group
fill(result) %>%
ungroup %>%
#Turn NA to blank along with numbers which are completely divisible by 10
mutate(result = replace(result, code == multiple10 | is.na(result), ''))
# code label multiple10 result
# <dbl> <chr> <dbl> <chr>
#1 10 a 10 ""
#2 20 b 20 ""
#3 21 c 20 "b"
#4 22 d 20 "b"
#5 23 e 20 "b"
#6 31 f 30 ""
#7 32 g 30 ""
#8 40 h 40 ""
#9 50 i 50 ""

How to scale segments of a column in an R data frame?

I have a data frame with a numeric value and a category. I need to scale the numeric value, but only with respect to those observations of its own category (hopefully without splitting up the dataframe into pieces and then using rbind to stitch it back up).
Here is the example:
df <- data.frame(x = c(1, 2, 3, 4, 5, 20, 22, 24, 25, 27, 12, 13, 12, 15, 17),
y = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C"))
This function would give me a scale of the whole column, but I want the scales to be in relation only to the same category (ie A, B, and C).
df$z <- scale(df$x)
Appreciate the help!
Apply the same function (scale) by group.
In base R
df$z <- with(df, ave(x, y, FUN = scale))
df
# x y z
#1 1 A -1.26491
#2 2 A -0.63246
#3 3 A 0.00000
#4 4 A 0.63246
#5 5 A 1.26491
#6 20 B -1.33242
#7 22 B -0.59219
#8 24 B 0.14805
#9 25 B 0.51816
#10 27 B 1.25840
#11 12 C -0.83028
#12 13 C -0.36901
#13 12 C -0.83028
#14 15 C 0.55352
#15 17 C 1.47605
Using dplyr
library(dplyr)
df %>% group_by(y) %>% mutate(z = scale(x))
Or data.table
library(data.table)
setDT(df)[, z:= scale(x), y]

What's the easiest way for the multiple if-else conditions?

I have a data set something like this:
df_1 <- tribble(
~A, ~B, ~C,
10, 10, NA,
NA, 34, 15,
40, 23, NA,
4, 12, 18,
)
Now, I just want to compare A, B, C for each row, and add a new column that shows us the minimum number. Let's see how desired data looks like:
df_2 <- tribble(
~A, ~B, ~C, ~Winner,
10, 10, NA, "Same",
NA, 34, 15, "C",
40, 23, NA, "B",
4, 12, 18, "A",
)
There are four outputs: Same, A-Win, B-Win, C-Win.
How would you code to get this result?
Thanks in advance.
Here is something:
foo <- function(x) {
rmin <- which(x == min(x, na.rm = TRUE))
if (length(rmin) > 1) "same" else names(rmin)
}
apply(df_1, 1, foo)
[1] "same" "C" "B" "A"
You can add this as a column to your data.frame with:
df_1$winner <- apply(df_1, 1, foo)
# A tibble: 4 x 4
A B C winner
<dbl> <dbl> <dbl> <chr>
1 10 10 NA same
2 NA 34 15 C
3 40 23 NA B
4 4 12 18 A
If you have more variables and only want to use some you can use a character vector:
vars <- c("A", "B", "C")
apply(df_1[vars], 1, foo)
df_1 <- tribble(
~A, ~B, ~C,
10, 10, NA,
NA, 34, 15,
40, 23, NA,
4, 12, 18,
)
df_1 %>%
mutate(
winner = colnames(df_1)[apply(df_1,1,which.min)],
winner = if_else(A == B | B == C | A == C, 'same', winner, missing = winner))
# A tibble: 4 x 4
A B C winner
<dbl> <dbl> <dbl> <chr>
1 10 10 NA same
2 NA 34 15 C
3 40 23 NA B
4 4 12 18 A

Sorting a column based on the order of another column in R

The R script below creates a data frame a123 with three columns. Column a1 has three variables occurring at different places with corresponding a2 and a3 values.
a1 = c("A", "B", "C", "A", "B", "B", "A", "C", "A", "C", "B")
a2 = c( 10, 8, 11 , 6 , 4 , 7 , 9 , 1 , 3 , 2, 7)
a3 = c( 55, 34, 33, 23, 78, 33, 123, 34, 85, 76, 74)
a123 = data.frame(a1, a2, a3)
My need is that I want a3 column values corresponding to a1 column values to be arranged in ascending order based on the order of a2 values. Also, if common a2 values are encountered, the corresponding a3 column values should be arranged in ascending order. For example, say value "A" in column a1 has following values in a2 and a3,
a2 = c(10, 6, 9, 3)
a3 = c(55, 23, 123, 85)
The values can be like:
a3 = c(123, 23, 85, 55)
Expected Outcome:
a1 = c("A", "B", "C", "A", "B", "B", "A", "C", "A", "C", "B")
a2 = c( 10, 8, 11, 6, 4, 7, 9, 1, 3, 2, 7)
a3 = c( 123, 78, 76, 23, 33, 34, 85, 33, 55, 34, 74)
a123 = data.frame(a1, a2, a3)
Thanks and please help. Note: Please try to avoid loops and conditions as they might slow the computation based on large data.
A solution using dplyr, sort, and rank. I do not fully understand your logic, but this is probably something you are looking for. Notice that I assume the elements in a3 of group A is 123, 55, 85, 23.
library(dplyr)
a123_r <- a123 %>%
group_by(a1) %>%
mutate(a3 = sort(a3, decreasing = TRUE)[rank(-a2, ties.method = "last")]) %>%
ungroup() %>%
as.data.frame()
a123_r
# a1 a2 a3
# 1 A 10 123
# 2 B 8 78
# 3 C 11 76
# 4 A 6 55
# 5 B 4 33
# 6 B 7 34
# 7 A 9 85
# 8 C 1 33
# 9 A 3 23
# 10 C 2 34
# 11 B 7 74

Compare and merge two dataframes

I have the following two dataframes in R:
df1 = data.frame(c("A", "A", "A", "B", "B"), c(1, 11, 21, 35, 45), c(6, 20, 30, 40, 60), c(1, 2, 3, 4, 5))
colnames(df1) = c("X", "Y", "Z", "score")
df1
X Y Z score
1 A 1 6 1
2 A 11 20 2
3 A 21 30 3
4 B 35 40 4
5 B 45 60 5
df2 = data.frame(c("A", "A", "A", "A", "B", "B", "B", "C"), c(1, 6, 21, 50, 20, 31, 50, 10), c(5, 20, 30, 60, 30, 40, 60, 20), c("x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8"))
colnames(df2) = c("X", "Y", "Z", "out")
df2
X Y Z out
1 A 1 5 x1
2 A 6 20 x2
3 A 21 30 x3
4 A 50 60 x4
5 B 20 30 x5
6 B 31 40 x6
7 B 50 60 x7
8 C 10 20 x8
For every row in df1, I want to check:
is there a match with the value in 'X' and any other 'X' value from df2
if the above is true: I want to check if the values from 'Y' and 'Z' are in the range of the values 'Y' and 'Z' from df2
if both are true: then I want to add the value from 'out' to df1.
This is how the output should look like:
output = data.frame(c("A", "A", "A", "B", "B"), c(1, 11, 21, 35, 45), c(6, 20, 30, 40, 60), c(1, 2, 3, 4, 5), c("x1, x2", "x2", "x3", "x4", "x5"))
colnames(output) = c("X", "Y", "Z", "score", "out")
X Y Z score out
1 A 1 6 1 x1, x2
2 A 11 20 2 x2
3 A 21 30 3 x3
4 B 35 40 4 x6
5 B 45 60 5 x7
The original df1 is kept with an extra column 'out' that is added.
Line 1 from 'output', contains 'x1, x2' in column 'out'. Why: there is a match between the values in column 'X' and range 1 to 6 overlap with lines 1 and 2 from df2.
I've asked this question before (Compare values from two dataframes and merge) where it is suggested to use the foverlaps function. However because of the different columns between df1 and df2 and the extra rows in df2, I cannot make it work.
Here are two possible ways, a) using the newly implemented non equi joins feature, and b) foverlaps as you'd specifically mentioned that..
a) non-equi joins
dt2[dt1, on=.(X, Z>=Y, Y<=Z),
.(score, out=paste(out, collapse=",")),
by=.EACHI]
where dt1 and dt2 are data.tables corresponding to df1 and df2. Note that you'll have to revert column names Z and Y in the result (since the column names come from dt2 but the values from dt1.
Matching rows from dt2 corresponding to each row is dt1 is found based on the condition provided to the on argument and .() is evaluated for each of those matching rows (because of by=.EACHI).
b) foverlaps
setkey(dt1, X, Y, Z)
olaps <- foverlaps(dt2, dt1, type="any", nomatch=0L)
olaps[, .(score=score[1L], out=paste(out, collapse=",")), by=.(X,Y,Z)]
library(dplyr)
df1 = data.frame(c("A", "A", "A", "B", "B"), c(1, 11, 21, 35, 45),
c(6, 20, 30, 40, 60), c(1, 2, 3, 4, 5), stringsAsFactors = F)
colnames(df1) = c("X", "Y", "Z", "score")
df2 = data.frame(c("A", "A", "A", "A", "B", "B", "B", "C"), c(1, 6, 21, 50, 20, 31, 50, 10),
c(5, 20, 30, 60, 30, 40, 60, 20),
c("x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8"), stringsAsFactors = F)
colnames(df2) = c("X", "Y", "Z", "out")
df1 %>%
left_join(df2, by="X") %>% # join on main column
rowwise() %>% # for each row
mutate(counter = sum(seq(Y.x, Z.x) %in% seq(Y.y, Z.y))) %>% # get how many elements of those ranges overlap
filter(counter > 0) %>% # keep rows with overlap
group_by(X, Y.x, Z.x, score) %>% # for each combination of those columns
summarise(out = paste(out, collapse=", ")) %>% # combine out column
ungroup() %>%
rename(Y = Y.x,
Z = Z.x)
# # A tibble: 5 × 5
# X Y Z score out
# <chr> <dbl> <dbl> <dbl> <chr>
# 1 A 1 6 1 x1, x2
# 2 A 11 20 2 x2
# 3 A 21 30 3 x3
# 4 B 35 40 4 x6
# 5 B 45 60 5 x7
The above process is based on dplyr package and involves a join and some grouping and filtering. If your initial datasets (df1, df2) are extremely large then the join will create an even bigger dataset that will need some time to be created.
Also, note that this process works with character and not factor variables. The process might convert factor variables to character if it tries to join factor variables with different levels.
I'd suggest you run the chained commands step by step to see how it works and spot if I missed anything that might lead to bugs in the code.
Here is another options using sqldf
library(sqldf)
xx=sqldf('select t1.*,t2.out from df1 t1 left join df2 t2 on t1.X=t2.X and ((t2.Y between t1.Y and t1.Z) or (t2.Z between t1.Y and t1.Z))')
aggregate(xx[ncol(xx)], xx[-ncol(xx)], FUN = function(X) paste(unique(X), collapse=", "))

Resources