How to merge two dataframes based on range value of one table - r

DF1
SIC Value
350 100
460 500
140 200
290 400
506 450
DF2
SIC1 AREA
100-200 Forest
201-280 Hospital
281-350 Education
351-450 Government
451-550 Land
Note:class of SIC1 is having character,we need to convert to numeric range
i am trying to get the output like below
Desired output:
DF3
SIC Value AREA
350 100 Education
460 500 Land
140 200 Forest
290 400 Education
506 450 Land
i have tried first to convert character class of SIC1 to numeric
then tried to merge,but no luck,can someone guide on this?

An option can be to use tidyr::separate along with sqldf to join both tables on range of values.
library(sqldf)
library(tidyr)
DF2 <- separate(DF2, "SIC1",c("Start","End"), sep = "-")
sqldf("select DF1.*, DF2.AREA from DF1, DF2
WHERE DF1.SIC between DF2.Start AND DF2.End")
# SIC Value AREA
# 1 350 100 Education
# 2 460 500 Lan
# 3 140 200 Forest
# 4 290 400 Education
# 5 506 450 Lan
Data:
DF1 <- read.table(text =
"SIC Value
350 100
460 500
140 200
290 400
506 450",
header = TRUE, stringsAsFactors = FALSE)
DF2 <- read.table(text =
"SIC1 AREA
100-200 Forest
201-280 Hospital
281-350 Education
351-450 Government
451-550 Lan",
header = TRUE, stringsAsFactors = FALSE)

We could do a non-equi join. Split (tstrsplit) the 'SIC1' column in 'DF2' to numeric columns and then do a non-equi join with the first dataset.
library(data.table)
setDT(DF2)[, c('start', 'end') := tstrsplit(SIC1, '-', type.convert = TRUE)]
DF2[, -1, with = FALSE][DF1, on = .(start <= SIC, end >= SIC),
mult = 'last'][, .(SIC = start, Value, AREA)]
# SIC Value AREA
#1: 350 100 Education
#2: 460 500 Land
#3: 140 200 Forest
#4: 290 400 Education
#5: 506 450 Land
Or as #Frank mentioned we can do a rolling join to extract the 'AREA' and update it on the first dataset
setDT(DF1)[, AREA := DF2[DF1, on=.(start = SIC), roll=TRUE, x.AREA]]
data
DF1 <- structure(list(SIC = c(350L, 460L, 140L, 290L, 506L), Value = c(100L,
500L, 200L, 400L, 450L)), .Names = c("SIC", "Value"),
class = "data.frame", row.names = c(NA, -5L))
DF2 <- structure(list(SIC1 = c("100-200", "201-280", "281-350", "351-450",
"451-550"), AREA = c("Forest", "Hospital", "Education", "Government",
"Land")), .Names = c("SIC1", "AREA"), class = "data.frame",
row.names = c(NA, -5L))

Related

How to match two column values in df1 and extract corresponding values in R

Table 1:
Pos
Samples
129
ERR5678
460
ERR7890
568
ERR7689
Table 2:
Pos
ERR5678
ERR7890
ERR7689
129
67890
76879
67894
460
56782
123478
678390
568
78926
890765
345678
Result Table
Pos
Samples
Dp_value
129
ERR5678
67890
460
ERR7890
123478
568
ERR7689
345678
table 1 contains the list of Positions and their corresponding samples and another table contains the Position and Depth values for each sample. Using R, two tables read into data.table then I used: df1[(df1$Pos%in%df2%pos),]
It extracted the position. Please someone kindly tell me how to match both Pos and Samples in df2 to get the result table.
Reshape the second data to 'long' format and do an inner_join
library(dplyr)
library(tidyr)
df2 %>%
pivot_longer(cols = starts_with("ERR"),
names_to = "Samples", values_to = "Dp_value") %>%
inner_join(df1)
-output
# A tibble: 3 × 3
Pos Samples Dp_value
<int> <chr> <int>
1 129 ERR5678 67890
2 460 ERR7890 123478
3 568 ERR7689 345678
data
df1 <- structure(list(Pos = c(129L, 460L, 568L), Samples = c("ERR5678",
"ERR7890", "ERR7689")), class = "data.frame", row.names = c(NA,
-3L))
df2 <- structure(list(Pos = c(129L, 460L, 568L), ERR5678 = c(67890L,
56782L, 78926L), ERR7890 = c(76879L, 123478L, 890765L), ERR7689 = c(67894L,
678390L, 345678L)), class = "data.frame", row.names = c(NA, -3L
))

Add values of columns based on condition of another variable in R

I want to create a variable that adds the values from other columns based on the condition of variable year. That is if variable YEAR = 2013 then add columns YR_2006, YR_2007, YR_2008, YR_2009, YR_2010, YR_2011. So for group A the sum would be 12,793
GROUP YEAR YR_2006 YR_2007 YR_2008 YR_2009 YR_2010 YR_2011
A 2013 636 3653 4759 3745
B 2019 1417 2176 3005 2045 2088 1849
C 2007 4218 3622 4651 4574 4122 4711
E 2017 5956 6031 6032 4885 5400 5828
Here is an option with apply and MARGIN = 1 to loop over the rows, get the index where the 'YEAR' matches the names, do a sequence from the 2nd element to that index, subset the values and get the sum
df1$Sum <- apply(df1[-1], 1, function(x)
sum(x[2:c(grep(as.character(x[1]), names(x)[-1]) +1,
length(x))[1]], na.rm = TRUE))
df1$Sum
#[1] 12793 12580 7840 28886
Or we can use a vectorized option with rowSums after replaceing some of the elements in each row to NA based on matching the 'YEAR' column with the column names that startsWith 'YR_'
i1 <- startsWith(names(df1), "YR_")
i2 <- match(df1$YEAR, sub("YR_", "", names(df1)[i1]), nomatch = sum(i1))
rowSums(replace(df1[i1], col(df1[i1]) > i2[row(df1[i1])], NA), na.rm = TRUE)
#[1] 12793 12580 7840 28886
Or using tidyverse to reshape to 'long' format with pivot_longer and then do a group_by sum after sliceing the rows based on the match
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = starts_with("YR_"), values_drop_na = TRUE) %>%
group_by(GROUP) %>%
slice(seq(match(first(YEAR), readr::parse_number(name), nomatch = n()))) %>%
summarise(Sum = sum(value)) %>%
left_join(df1, .)
GROUP YEAR YR_2006 YR_2007 YR_2008 YR_2009 YR_2010 YR_2011 Sum
1 A 2013 NA 636 3653 4759 3745 NA 12793
2 B 2019 1417 2176 3005 2045 2088 1849 12580
3 C 2007 4218 3622 4651 4574 4122 4711 7840
4 E 2017 5956 6031 6032 4885 5400 582 28886
data
df1 <- structure(list(GROUP = c("A", "B", "C", "E"), YEAR = c(2013L,
2019L, 2007L, 2017L), YR_2006 = c(NA, 1417L, 4218L, 5956L), YR_2007 = c(636L,
2176L, 3622L, 6031L), YR_2008 = c(3653L, 3005L, 4651L, 6032L),
YR_2009 = c(4759L, 2045L, 4574L, 4885L), YR_2010 = c(3745L,
2088L, 4122L, 5400L), YR_2011 = c(NA, 1849L, 4711L, 582L)),
class = "data.frame", row.names = c(NA,
-4L))

Merge and Aggregate by multiple colums in r?

I have 2 tables. Below are the sample tables and the desired output.
Table1:
Start Date End Date Country
2017-01-04 2017-01-06 id
2017-02-13 2017-02-15 ng
Table2:
Transaction Date Country Cost Product
2017-01-04 id 111 21
2017-01-05 id 200 34
2017-02-14 ng 213 45
2017-02-15 ng 314 32
2017-02-18 ng 515 26
Output:
Start Date End Date Country Cost Product
2017-01-04 2017-01-06 id 311 55
2017-02-13 2017-02-15 ng 527 77
The problem is to merge two tables when transaction date lies in between start date and end date & country matches. And add the values of cost and product.
This calls for fuzzyjoins. Below are 2 examples.
Using dplyr and fuzzyjoin packages:
fuzzy_left_join(df1, df2,
c("Country" = "Country",
"Start_Date" = "Transaction_Date",
"End_Date" = "Transaction_Date"),
list(`==`, `<=`,`>=`)) %>%
group_by(Country.x, Start_Date, End_Date) %>%
summarise(Cost = sum(Cost),
Product = sum(Product))
# A tibble: 2 x 5
# Groups: Country.x, Start_Date [?]
Country.x Start_Date End_Date Cost Product
<chr> <date> <date> <int> <int>
1 id 2017-01-04 2017-01-06 311 55
2 ng 2017-02-13 2017-02-15 527 77
Using data.table:
library(data.table)
dt1 <- data.table(df1)
dt2 <- data.table(df2)
dt2[dt1, on=.(Country = Country,
Transaction_Date >= Start_Date,
Transaction_Date <= End_Date),
.(Cost = sum(Cost), Product = sum(Product)),
by=.EACHI]
data:
df1 <- structure(list(Start_Date = structure(c(17170, 17210), class = "Date"),
End_Date = structure(c(17172, 17212), class = "Date"), Country = c("id",
"ng")), row.names = c(NA, -2L), class = "data.frame")
df2 <- structure(list(Transaction_Date = structure(c(17170, 17171, 17211,
17212, 17215), class = "Date"), Country = c("id", "id", "ng",
"ng", "ng"), Cost = c(111L, 200L, 213L, 314L, 515L), Product = c(21L,
34L, 45L, 32L, 26L)), row.names = c(NA, -5L), class = "data.frame")
Not sure if you can use any of the merge operation here but one way using mapply is to subset the rows based on the condition and take the sum of Product and Cost columns.
df1[c("Cost", "Product")] <- t(mapply(function(x, y, z) {
inds <- df2$Transaction_Date >= x & df2$Transaction_Date <= y & df2$Country == z
c(sum(df2$Cost[inds]), sum(df2$Product[inds]))
},df1$Start_Date, df1$End_Date, df1$Country))
df1
# Start_Date End_Date Country Cost Product
#1 2017-01-04 2017-01-06 id 311 55
#2 2017-02-13 2017-02-15 ng 527 77

Subsetting rows based on multiple columns using data.table - fastest way

I was wondering if there was a more elegant, less clunky and faster way to do this. I have millions of rows with ICD coding for clinical data. A short example provided below. I was to subset the dataset based on either of the columns meeting a specific set of diagnosis codes. The code below works but takes ages in R and was wondering if there is a faster way.
structure(list(eid = 1:10, mc1 = structure(c(4L, 3L, 5L, 2L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = c("345", "410", "413.9", "I20.1",
"I23.4"), class = "factor"), oc1 = c(350, 323, 12, 35, 413.1,
345, 345, 345, 345, 345), oc2 = structure(c(5L, 6L, 4L, 1L, 1L,
2L, 2L, 2L, 3L, 2L), .Label = c("", "345", "I20.3", "J23.6",
"K50.1", "K51.4"), class = "factor")), .Names = c("eid", "mc1",
"oc1", "oc2"), class = c("data.table", "data.frame"), row.names = c(NA,
-10L), .internal.selfref = <pointer: 0x102812578>)
The code below subsets all rows that meet the code of either "I20" or "413" (this would include all codes that have for example been coded as "I20.4" or "413.9" etc.
dat2 <- dat [substr(dat$mc1,1,3)== "413"|
substr(dat$oc1,1,3)== "413"|
substr(dat$oc2,1,3)== "413"|
substr(dat$mc1,1,3)== "I20"|
substr(dat$oc1,1,3)== "I20"|
substr(dat$oc2,1,3)== "I20"]
Is there a faster way to do this? For example can i loop through each of the columns looking for the specific codes "I20" or "413" and subset those rows?
We can specify the columns of interest in .SDcols, loop through the Subset of Data.table (.SD), get the first 3 characters with substr, check whether it is %in% a vector of values and Reduce it to a single logical vector for subsetting the rows
dat[dat[,Reduce(`|`, lapply(.SD, function(x)
substr(x, 1, 3) %chin% c('413', 'I20'))), .SDcols = 2:4]]
# eid mc1 oc1 oc2
#1: 1 I20.1 350.0 K50.1
#2: 2 413.9 323.0 K51.4
#3: 5 345 413.1
#4: 9 345 345.0 I20.3
For larger data it could help if we dont chech all rows:
minem <- function(dt, colsID = 2:4) {
cols <- colnames(dt)[colsID]
x <- c('413', 'I20')
set(dt, j = "inn", value = F)
for (i in cols) {
dt[inn == F, inn := substr(get(i), 1, 3) %chin% x]
}
dt[inn == T][, inn := NULL][]
}
n <- 1e7
set.seed(13)
dt <- dts[sample(.N, n, replace = T)]
dt <- cbind(dt, dts[sample(.N, n, replace = T), 2:4])
setnames(dt, make.names(colnames(dt), unique = T))
dt
# eid mc1 oc1 oc2 mc1.1 oc1.1 oc2.1
# 1: 8 345 345.0 345 345 345 345
# 2: 3 I23.4 12.0 J23.6 413.9 323 K51.4
# 3: 4 410 35.0 413.9 323 K51.4
# 4: 1 I20.1 350.0 K50.1 I23.4 12 J23.6
# 5: 10 345 345.0 345 345 345 345
# ---
# 9999996: 3 I23.4 12.0 J23.6 I20.1 350 K50.1
# 9999997: 5 345 413.1 I20.1 350 K50.1
# 9999998: 4 410 35.0 345 345 345
# 9999999: 4 410 35.0 410 35
# 10000000: 10 345 345.0 345 345 345 I20.3
system.time(r1 <- akrun(dt, 2:ncol(dt))) # 22.88 sek
system.time(r2 <- minem(dt, 2:ncol(dt))) # 17.72 sek
all.equal(r1, r2)
# [1] TRUE

R comparing 2 dfs to sum data between values

I have 2 dataframes in R, one with start (column 1) and end (column 2) coordinates...
df1
2500 3499
3500 4499
4500 5499
5500 6499
And one with point coordinates (column 1) and associated values (column 2)...
df2
2657 17
2895 33
3875 12
4448 42
5122 3
5633 65
5781 12
I would like to find a vectorized approach to sum the values from df2 column 2 where df2 column 1 coordinates are between the start and stop coordinates for df1. with this data the result should look like this...
df3
2500 3499 50
3500 4499 54
4500 5499 3
5500 6499 77
The dfs contain 100,000+ rows, I can achieve this easily using loops, but as were are in R it is slow and not the best approach.
What is the best way to do this? Also a flexible solution that can be adapted to other functions, other than simply summing data would be good to know.
Here's a possible data.table::foverlaps solution. As you haven't specified column names, I'm assuming that they are called V1 and V2 in both data sets
Solution
library(data.table)
setDT(df1)[, `:=`(start = V1, end = V2)]
setDT(df2)[, `:=`(start = V1, end = V1)]
setkey(df1, start, end)
foverlaps(df2, df1)[, list(SumV2 = sum(i.V2)), by = list(V1, V2)]
# V1 V2 SumV2
# 1: 2500 3499 50
# 2: 3500 4499 54
# 3: 4500 5499 3
# 4: 5500 6499 77
Explanation
Here we converted both data sets to data.table objects and specified the start/end values to overlap on. Then, we keyed the data set that we want to join against. Finally we ran the foverlaps function and then aggregated the matched values of V2 from df2 by the desired columns in df1
Data
df1 <- structure(list(V1 = c(2500L, 3500L, 4500L, 5500L), V2 = c(3499L,
4499L, 5499L, 6499L)), .Names = c("V1", "V2"), class = "data.frame", row.names = c(NA,
-4L))
df2 <- structure(list(V1 = c(2657L, 2895L, 3875L, 4448L, 5122L, 5633L,
5781L), V2 = c(17L, 33L, 12L, 42L, 3L, 65L, 12L)), .Names = c("V1",
"V2"), class = "data.frame", row.names = c(NA, -7L))

Resources