Matching column in dataframe by nearest values in column of other dataframe - r

Hello I have one question of matching two data.frames.
Consider I have two datasets:
Dataframe 1:
"A" "B"
91 1
92 3
93 11
94 4
95 10
96 6
97 7
98 8
99 9
100 2
structure(list(A = 91:100, B = c(1, 3, 11, 4, 10, 6, 7, 8, 9,
2)), .Names = c("A", "B"), row.names = c(NA, -10L), class = "data.frame")
Dataframe 2:
"C" "D"
91.12 1
92.34 3
93.65 11
94.23 4
92.14 10
96.98 6
97.22 7
98.11 8
93.15 9
100.67 2
91.45 1
96.45 3
83.78 11
84.66 4
100 10
structure(list(C = c(91.12, 92.34, 93.65, 94.23, 92.14, 96.98,
97.22, 98.11, 93.15, 100.67, 91.25, 96.45, 83.78, 84.66, 100),
D = c(1, 3, 11, 4, 10, 6, 7, 8, 9, 2, 1, 3, 11, 4, 10)), .Names = c("C",
"D"), row.names = c(NA, -15L), class = "data.frame")
Now I want to find the rounded matches between column A and C and replace column D by the respective value in column B from Dataframe 1. Where there is no corresponding value (by rounded matches between A and C) I want to get an NaN for the replaced column D.
result:
"C" "newD"
91.12 1
92.34 3
93.65 4
94.23 4
92.14 3
96.98 7
97.22 7
98.11 8
93.15 11
100.67 NaN
91.25 1
96.45 6
83.78 NaN
84.66 NaN
100 2
structure(list(C = c(91.12, 92.34, 93.65, 94.23, 92.14, 96.98,
97.22, 98.11, 93.15, 100.67, 91.25, 96.45, 83.78, 84.66, 100),
D = c(1, 3, 4, 4, 3, 7, 7, 8, 11, NaN, 1, 6, NaN, NaN, 2)), .Names = c("C",
"D"), row.names = c(NA, -15L), class = "data.frame")
Does anybody knows how to do that especially for large datasets?
Thanks a lot!

Making an update join with data.table:
library(data.table)
setDT(DF1); setDT(DF2)
DF2[, A := round(C)]
DF2[, D := DF1[DF2, on=.(A), x.B] ]
# alternately, chain together in one step:
DF2[, A := round(C)][, D := DF1[DF2, on=.(A), x.B] ]
This gives NAs in unmatched rows. To switch it... DF2[is.na(D), D := NaN].
To drop the new DF2$A column, use DF2[, A := NULL].
Does anybody knows how to do that especially for large datasets?
This modifies DF2 in place (instead of making a new table like a vanilla join as in Mike's answer), so it should be fairly efficient for large tables. It might perform better if A is stored as an integer instead of a float in both tables.
On data.table 1.9.6, use on="A", B instead of on=.(A), x.B. Thanks to Mike H for checking this.

You can create a lookup table where the values in A are used to look up the values in B.
Lookup = df1$B
names(Lookup) = df1$A
df3 = data.frame(C = df2$C, newD = Lookup[as.character(round(df2$C))])
df3$newD[is.na(df3$newD)] = NaN

For these types of merges I like sql:
library(sqldf)
res <- sqldf("SELECT l.C, r.B
FROM df2 as l
LEFT JOIN df1 as r
on round(l.C) = round(r.A)")
res
# C B
#1 91.12 1
#2 92.34 3
#3 93.65 4
#4 94.23 4
#5 92.14 3
#6 96.98 7
#7 97.22 7
#8 98.11 8
#9 93.15 11
#10 100.67 NA
#11 91.45 1
#12 96.45 6
#13 83.78 NA
#14 84.66 NA
#15 100.00 2

Related

How to run a function over all the values of a column/variable for multiple columns/variables

I'm new to R so grateful if someone could help me here, because I've now tried a lot of things myself that have been unsuccessful and I'm so frustrated!
I have a big dataset that I have manipulated into two types of dataframe layouts where the variables of interest (A, B, C...) are either unique rows or unique columns. (A, B, C...) are categorical, and their values are integers.
LAYOUT 1<br>
A, 1, 6, 11...<br>
B, 2, 7, 12...<br>
C, 3, 8, 13...<br>
D, 4, 9, 14...<br>
E, 5, 10, 15...<br>
LAYOUT 2<br>
A, B, C, D, E...<br>
1, 2, 3, 4, 5...<br>
6, 7, 8, 9, 10...<br>
11, 12, 13, 14, 15...<br>
I want to run a number of math functions like mean() over each variable (A, B, C..) and record the outcomes in new dataframe that shows the outcomes of this function against each variable.
i.e.
X, mean_X, mode_X, sd_X...<br>
A, mean(A), mode(A), sd(A)...<br>
B, mean(B), mode(B), sd(B)...<br>
C, mean(C), mode(C), sd(C)...<br>
D, mean(D), mode(D), sd(D)...<br>
E, mean(E), mode(E), sd(E)...<br>
However, because the dataset is big, I can't do this manually by selecting each variable. I can't figure out how to do this on either of the layouts.
Happy which ever layout you choose, but is there a way to do this simply, preferably just using base, dplyr, tidyr?
Thank you in advance!
it seems to me you are looking for apply()
A=c(1, 6, 11)
B=c(2, 7, 12)
C=c(3, 8, 13)
D=c(4, 9, 14)
df<-cbind.data.frame(A,B,C,D)
df$mean<-apply(df, 1, mean) # 1 is applying the function along rows, 2 along columns
df$sum<-apply(df, 1, sum)
df
A B C D mean sum
1 1 2 3 4 2.5 12.5
2 6 7 8 9 7.5 37.5
3 11 12 13 14 12.5 62.5
You can get the data in long format so that it is easier to apply multiple functions.
If you have Layout 1 like this :
layout1 <- structure(list(V1 = c("A", "B", "C", "D", "E"), V2 = 1:5, V3 = 6:10,
V4 = 11:15), class = "data.frame", row.names = c(NA, -5L))
layout1
# V1 V2 V3 V4
#1 A 1 6 11
#2 B 2 7 12
#3 C 3 8 13
#4 D 4 9 14
#5 E 5 10 15
You can do :
library(dplyr)
library(tidyr)
layout1 %>%
pivot_longer(cols = where(is.numeric)) %>%
group_by(V1) %>%
summarise(mean = mean(value),
sd = sd(value),
sum = sum(value))
# V1 mean sd sum
# <chr> <dbl> <dbl> <int>
#1 A 6 5 18
#2 B 7 5 21
#3 C 8 5 24
#4 D 9 5 27
#5 E 10 5 30
If you have data in the form of layout 2
layout2 <- structure(list(A = c(1L, 6L, 11L), B = c(2L, 7L, 12L), C = c(3L,
8L, 13L), D = c(4L, 9L, 14L), E = c(5L, 10L, 15L)),
class = "data.frame", row.names = c(NA, -3L))
layout2
# A B C D E
#1 1 2 3 4 5
#2 6 7 8 9 10
#3 11 12 13 14 15
You can apply the functions using across :
layout2 %>%
summarise(across(everything(),
list(mean = mean, sd = sd, sum = sum), .names = '{col}_{fn}')) %>%
pivot_longer(cols = everything(),
names_to = c('X', '.value'),
names_sep = '_')
# A tibble: 5 x 4
# X mean sd sum
# <chr> <dbl> <dbl> <int>
#1 A 6 5 18
#2 B 7 5 21
#3 C 8 5 24
#4 D 9 5 27
#5 E 10 5 30

Replace values outside range with NA using replace_with_na function

I have the following dataset
structure(list(a = c(2, 1, 9, 2, 9, 8), b = c(4, 5, 1, 9, 12,
NA), c = c(50, 34, 77, 88, 33, 60)), class = "data.frame", row.names = c(NA,
-6L))
a b c
1 2 4 50
2 1 5 34
3 9 1 77
4 2 9 88
5 9 12 33
6 8 NA 60
From column b I only want values between 4-9. Column c between 50-80. Replacing the values outside the range with NA, resulting in
structure(list(a = c(2, 1, 9, 2, 9, 8), b = c(4, 5, NA, 9, NA,
NA), c = c(50, NA, 77, NA, NA, 60)), class = "data.frame", row.names = c(NA,
-6L))
a b c
1 2 4 50
2 1 5 NA
3 9 NA 77
4 2 9 NA
5 9 NA NA
6 8 NA 60
I've tried several things with replace_with_na_at function where this seemed most logical:
test <- replace_with_na_at(data = test, .vars="c",
condition = ~.x < 2 & ~.x > 2)
However, nothing I tried works. Does somebody know why? Thanks in advance! :)
You can subset with a logical vector testing your conditions.
x$b[x$b < 4 | x$b > 9] <- NA
x$c[x$c < 50 | x$c > 80] <- NA
x
# a b c
#1 2 4 50
#2 1 5 NA
#3 9 NA 77
#4 2 9 NA
#5 9 NA NA
#6 8 NA 60
Data:
x <- structure(list(a = c(2, 1, 9, 2, 9, 8), b = c(4, 5, 1, 9, 12,
NA), c = c(50, 34, 77, 88, 33, 60)), class = "data.frame", row.names = c(NA,
-6L))
Yet another base R solution, this time with function is.na<-
is.na(test$b) <- with(test, b < 4 | b > 9)
is.na(test$c) <- with(test, c < 50 | c > 80)
A package naniar solution with a pipe could be
library(naniar)
library(magrittr)
test %>%
replace_with_na_at(
.vars = 'b',
condition = ~(.x < 4 | .x > 9)
) %>%
replace_with_na_at(
.vars = 'c',
condition = ~(.x < 50 | .x > 80)
)
You should mention the packages you are using. From googling, i'm guessing you are using naniar. The problem appears to be that you did not properly specify the condition, but the following should work:
library(naniar)
test <- structure(list(a = c(2, 1, 9, 2, 9, 8),
b = c(4, 5, 1, 9, 12, NA),
c = c(50, 34, 77, 88, 33, 60)),
class = "data.frame",
row.names = c(NA, -6L))
replace_with_na_at(test, "c", ~.x < 50 | .x > 80)
#> a b c
#> 1 2 4 50
#> 2 1 5 NA
#> 3 9 1 77
#> 4 2 9 NA
#> 5 9 12 NA
#> 6 8 NA 60
Created on 2020-06-02 by the reprex package (v0.3.0)
You simply could use Map to replace your values with NA.
dat[2:3] <- Map(function(x, y) {x[!x %in% y] <- NA;x}, dat[2:3], list(4:9, 50:80))
dat
# a b c
# 1 2 4 50
# 2 1 5 NA
# 3 9 NA 77
# 4 2 9 NA
# 5 9 NA NA
# 6 8 NA 60
Data:
dat <- structure(list(a = c(2, 1, 9, 2, 9, 8), b = c(4, 5, 1, 9, 12,
NA), c = c(50, 34, 77, 88, 33, 60)), class = "data.frame", row.names = c(NA,
-6L))
We can use map2
library(purrr)
library(dplyr)
df1[c('b', 'c')] <- map2(df1 %>%
select(b, c), list(c(4, 9), c(50,80)), ~
replace(.x, .x < .y[1]|.x > .y[2], NA))

Dataframe: Divide each group by a vector corresponding to each group in R?

I have a data frame like this:
df1 <- structure(list(user_id = c(1, 1, 1, 2, 2, 2, 3, 3, 3), param_a = c(123,
2.3, -9, 1, -0.03333, 4, -41, -12, 0.89)), .Names = c("user_id",
"param_a"), row.names = c(NA, -9L), class = c("tbl_df", "tbl",
"data.frame"))
and another dataframe of vectors:
df2 <- structure(list(user_id = c(1, 2, 3), param_b = c(34, 12, -0.89
)), .Names = c("user_id", "param_b"), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame"))
Now I want to divide each group in df1 by corresponding value in df2:
For example for a group of user 1 divide each row by param_b first vector:
user_id param_a
1 123/34
1 2.3/34
1 -9/34
2 1/12
2 -0.03333/12
2 4/12
....
for user 2 divide each row by param_b second vector.
Please advise how can I divide a grouped by user dataframe by a vector per each group?
P.S
If I have df1 extended to param_a, param_k, param_p
and df2 extended accordingly with param_b, param_l, param_r
How can I perform this kind of operation? #nicola suggested a very nice solution but I want to extend it.
Something like this?
df1%>%
left_join(df2)%>%
mutate(result=param_a/param_b)
Joining, by = "user_id"
# A tibble: 9 x 4
user_id param_a param_b result
<dbl> <dbl> <dbl> <dbl>
1 1 123 34 3.62
2 1 2.3 34 0.0676
3 1 -9 34 -0.265
4 2 1 12 0.0833
5 2 -0.0333 12 -0.00278
6 2 4 12 0.333
7 3 -41 -0.89 46.1
8 3 -12 -0.89 13.5
9 3 0.89 -0.89 -1

Aggregate using different functions for each column

I have a data.table similar to the one below, but with around 3 million rows and a lot more columns.
key1 price qty status category
1: 1 9.26 3 5 B
2: 1 14.64 1 5 B
3: 1 16.66 3 5 A
4: 1 18.27 1 5 A
5: 2 2.48 1 7 A
6: 2 0.15 2 7 C
7: 2 6.29 1 7 B
8: 3 7.06 1 2 A
9: 3 24.42 1 2 A
10: 3 9.16 2 2 C
11: 3 32.21 2 2 B
12: 4 20.00 2 9 B
Heres the dput() string
dados = structure(list(key1 = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4),
price = c(9.26, 14.64, 16.66, 18.27, 2.48, 0.15, 6.29, 7.06,
24.42, 9.16, 32.21, 20), qty = c(3, 1, 3, 1, 1, 2, 1, 1,
1, 2, 2, 2), status = c(5, 5, 5, 5, 7, 7, 7, 2, 2, 2, 2,
9), category = c("B", "B", "A", "A", "A", "C", "B", "A",
"A", "C", "B", "B")), .Names = c("key1", "price", "qty",
"status", "category"), row.names = c(NA, -12L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x0000000004720788>)
I need to transform this data so that I have one entry for each key, and on the proccess I need to create some additional variables. So far I was using this:
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
key.aggregate = function(x){
return(data.table(
key1 = Mode(x$key1),
perc.A = sum(x$price[x$category == "A"],na.rm=T)/sum(x$price),
perc.B = sum(x$price[x$category == "B"],na.rm=T)/sum(x$price),
perc.C = sum(x$price[x$category == "C"],na.rm=T)/sum(x$price),
status = Mode(x$status),
qty = sum(x$qty),
price = sum(x$price)
))
}
new_data = split(dados,by = "key1") #Runs out of RAM here
results = rbindlist(lapply(new_data,key.aggregate))
And expecting the following output:
> results
key1 perc.A perc.B perc.C status qty price
1: 1 0.5937447 0.4062553 0.00000000 5 8 58.83
2: 2 0.2780269 0.7051570 0.01681614 7 4 8.92
3: 3 0.4321208 0.4421414 0.12573782 2 6 72.85
4: 4 0.0000000 1.0000000 0.00000000 9 2 20.00
But I'm always running out of RAM when splitting the data by keys. I've tried using only a third of the data, and now only a sixth of it but it still gives the same Error: cannot allocate vector of size 593 Kb.
I'm thinking this approach is very inefficient, which would be the best way to get this result?

How to merge multiple columns in one in R

I have a data frame called mydf with hundreds of paired columns (value1 to valueX and rec1 to recX). I want to combine all these paired columns in the order of their values into value and rec columns as shown in the result. How can I do this in R?
mydf<-structure(list(samples = structure(1:3, .Label = c("A", "B",
"c"), class = "factor"), value1 = c(1, 8, 7), value2 = c(2, 5,
9), rec1 = c(7158, 6975, 6573), rec2 = c(1122, 2235, 229)), .Names = c("samples",
"value1", "value2", "rec1", "rec2"), row.names = c(NA, -3L), class = "data.frame")
result
sample value rec
A 1 7158
A 2 1122
B 5 2235
C 7 6573
B 8 6975
C 9 229
You could solve this quickly using data.tables melt method which allows you to specify regex patters within the measure.vars argument
library(data.table) # v >= 1.9.6
melt(setDT(mydf), measure = patterns("value", "rec"), value.name = c("value", "rec"))
# samples variable value rec
# 1: A 1 1 7158
# 2: B 1 8 6975
# 3: c 1 7 6573
# 4: A 2 2 1122
# 5: B 2 5 2235
# 6: c 2 9 229

Resources