Intersecting two data frames in R - r

I have two data frames with different columns each
columns in dataframe 1 include
GID_2
MEX.15.1_1
MEX.15.1_2
MEX.15.1_3
MEX.15.1_4
MEX.15.1_5
MEX.15.1_6
columns in dataframe 2 include
ID_MUNICIPIO
B
C
D
1
500
200
100
2
200
300
100
3
100
600
400
4
200
400
700
5
600
100
800
6
700
100
200
I want to merge them like this
GID_2
X
MEX.15.1_1
500
MEX.15.1_2
300
MEX.15.1_3
600
MEX.15.1_4
700
MEX.15.1_5
800
MEX.15.1_6
700
Sorry if this is a rookie question I am fairly new to R

The logic is to find the max in each row!
Then we can use cbind:
cbind(df1, X= apply(df2, 1, max, na.rm=TRUE))
GID_2 X
1 MEX.15.1_1 500
2 MEX.15.1_2 300
3 MEX.15.1_3 600
4 MEX.15.1_4 700
5 MEX.15.1_5 800
6 MEX.15.1_6 700
data:
> dput(df1)
structure(list(GID_2 = c("MEX.15.1_1", "MEX.15.1_2", "MEX.15.1_3",
"MEX.15.1_4", "MEX.15.1_5", "MEX.15.1_6")), class = "data.frame", row.names = c(NA,
-6L))
> dput(df2)
structure(list(ID_MUNICIPIO = 1:6, B = c(500L, 200L, 100L, 200L,
600L, 700L), C = c(200L, 300L, 600L, 400L, 100L, 100L), D = c(100L,
100L, 400L, 700L, 800L, 200L)), class = "data.frame", row.names = c(NA,
-6L))

You can use the intersect function as below:
common_rows <- generics::intersect(GID_2, ID_MUNICIPIO)

More information is clearly needed. I'll assume that the last digit in GID_2 is the unique key can that can be used for a merge with IDMUNICIPIO in dataset 2. That is a big assumption.
The pseudo-code to solve this:
Create a new column in Dataset1 called "IDMUNICIPIO"
"IDMUNICIPIO" will equal the last character in GID_2.
Merge Dataset1 and Dataset2 on "IDMUNICIPIO"
Find the max in each row of the newly merged data set (see #TarJae suggestion).
At least that's how I think it should go. But this is predicated on my understanding of GID_2.

Related

How do I select all rows where one column has the same value but another column has a different values? [duplicate]

This question already has answers here:
Select groups with more than one distinct value
(3 answers)
Closed 2 years ago.
I am trying to extract rows from my R dataframe where the ID column has the same value and the pt column has different values.
For example, if my data frame looks like this:
ID pt
600 DC90
600 DC90
612 DC18
612 DC02
612 DC02
630 DC30
645 DC16
645 DC16
645 DC16
my desired output would look like this:
ID pt
612 DC18
612 DC02
612 DC02
because ID 612 has two different pt numbers
We could group over the ID, and filter IDs where the number of distinct elements in 'pt' is greater than 1
library(dplyr)
df1 %>%
group_by(ID) %>%
filter(n_distinct(pt) > 1)
-output
# A tibble: 3 x 2
# Groups: ID [1]
# ID pt
# <int> <chr>
#1 612 DC18
#2 612 DC02
#3 612 DC02
if it is to check all elements should be different
df1 %>%
group_by(ID) %>%
filter(n_distinct(pt) == n())
data
df1 <- structure(list(ID = c(600L, 600L, 612L, 612L, 612L, 630L, 645L,
645L, 645L), pt = c("DC90", "DC90", "DC18", "DC02", "DC02", "DC30",
"DC16", "DC16", "DC16")), class = "data.frame", row.names = c(NA,
-9L))
A data.table option using uniqueN, grouped by ID
> setDT(df)[, .SD[uniqueN(pt) > 1], ID]
ID pt
1: 612 DC18
2: 612 DC02
3: 612 DC02
Data
> dput(df)
structure(list(ID = c(600L, 600L, 612L, 612L, 612L, 630L, 645L,
645L, 645L), pt = c("DC90", "DC90", "DC18", "DC02", "DC02", "DC30",
"DC16", "DC16", "DC16")), class = "data.frame", row.names = c(NA,
-9L))

How to find out minimum value from various columns in data frame with R? [duplicate]

This question already has answers here:
Find the maximum and minimum value of every column and then find the maximum and minimum value of every row
(4 answers)
Closed 2 years ago.
My data frame is:
`Account id Fcast 1 Fcast 2 Fcast 3 Diff 1 Diff 2 Diff 3
101 4000 2000 1000 1000 3000 4000
201 2900 3300 5000 100 300 2000
301 -100 5500 -800 1700 7300 1000
401 5000 8000 7100 2500 500 400
501 9000 12000 2000 15000 12000 22000
Result required is to find out minimum value from the column labeled as Diff...
`Account id Min
101 1000
201 100
301 1000
401 400
501 12000
Also ideally i also need to fetch another column which tells is filled by column name from which the minimum value is fetched.
We can use apply in row mode here:
data.frame(AccountId=df$AccountId,
Min=apply(df[names(df)[grepl("^Diff\\d", names(df))]], 1, FUN=min))
AccountId Min
1 101 1000
2 201 100
3 301 1000
4 401 400
5 501 12000
Data:
df <- data.frame(AccountId=c(101, 201, 301, 401, 501),
Fcast1=c(4000, 2900, -100, 5000, 9000),
Fcast2=c(2000, 3300, 5500, 8000, 12000),
Fcast3=c(1000, 5000, -800, 7100, 2000),
Diff1=c(1000, 100, 1700, 2500, 15000),
Diff2=c(3000, 300, 7300, 500, 12000),
Diff3=c(4000, 2000, 1000, 400, 22000))
another option would be to use apply function:
df <- data.frame(df$AccountId, min = apply(df[, 2:ncol(df)], 1, min))
Using dplyr :
library(dplyr)
cols <- grep('Diff', names(df), value = TRUE)
df %>%
group_by(Accountid) %>%
mutate(Min = min(c_across(cols)),
Min_name = cols[which.min(c_across(cols))]) %>%
select(Accountid, Min, Min_name)
# Accountid Min Min_name
# <int> <int> <chr>
#1 101 1000 Diff1
#2 201 100 Diff1
#3 301 1000 Diff3
#4 401 400 Diff3
#5 501 12000 Diff2
data
df <- structure(list(Accountid = c(101L, 201L, 301L, 401L, 501L),
Fcast1 = c(4000L, 2900L, -100L, 5000L, 9000L), Fcast2 = c(2000L, 3300L, 5500L,
8000L, 12000L), Fcast3 = c(1000L, 5000L, -800L, 7100L, 2000L),
Diff1 = c(1000L, 100L, 1700L, 2500L, 15000L), Diff2 = c(3000L,
300L, 7300L, 500L, 12000L), Diff3 = c(4000L, 2000L, 1000L,
400L, 22000L)), class = "data.frame", row.names = c(NA, -5L))
A solution using data.table
dt[,`:=`(min_val=apply(.SD,1,min),
min_col=names(.SD)[apply(.SD,1,which.min)]),.SDcols=names(dt) %like% 'diff']
Here,.SDcols chooses the subset of columns to work with, in this case, columns having the work diff in it. Hence, the use of %like
.SD now behaves as a subsetted data.table having only the diff columns.

Create new variable based on the Look up table

I want to create a new variable on the data frame that uses a look up table. So I had df1 (dataframe) that has Amount and Term. And I need to create a new variable "Premium" that create its values using the look up table.
I tried the ifelse function but it's too tedious.
Below is an illustration/example
df1 <- data.frame(Amount, Term)
df1
# Amount Term
# 1 2500 23
# 2 3600 30
# 3 7000 45
# 4 12000 50
# 5 16000 38
And I need to create new variable the 'Premium' by using the Premium Lookup table below.
Term
Amount 0-24 Mos 25-36 Mos 37-48 Mos 49-60 Mos
0 - 5,000 133 163 175 186
5,001 - 10,000 191 213 229 249
10,001 - 15,000 229 252 275 306
15,001 - 20,000 600 615 625 719
20,001 - 25,000 635 645 675 786
So the output for premium should be.
df1
# Amount Term Premium
# 1 2500 23 133
# 2 3600 30 163
# 3 7000 45 229
# 4 12000 50 306
# 5 16000 38 625
Data
df1 <- structure(list(Amount = c(2500L, 3600L, 7000L, 12000L, 16000L),
Term = c(23L, 30L, 45L, 50L, 38L)),
class = "data.frame",
row.names = c(NA, -5L))
lkp <- structure(c(133L, 191L, 229L, 600L, 635L,
163L, 213L, 252L, 615L, 645L,
175L, 229L, 275L, 625L, 675L,
186L, 249L, 306L, 719L, 786L),
.Dim = 5:4,
.Dimnames = list(Amount = c("0 - 5,000", "5,001 - 10,000",
"10,001 - 15,000", "15,001 - 20,000",
"20,001 - 25,000"),
Term = c("0-24 Mos", "25-36 Mos", "37-48 Mos",
"49-60 Mos")))
Code
Create first the upper limits for month and amount using regular expressions from the column and row names (you did not post your data in a reproducible way, so this regex may need adaptation based on your real lookup table structure):
(month <- c(0, as.numeric(sub("\\d+-(\\d+) Mos$",
"\\1",
colnames(lkp)))))
# [1] 0 24 36 48 60
(amt <- c(0, as.numeric(sub("^\\d+,*\\d* - (\\d+),(\\d+)$",
"\\1\\2",
rownames(lkp)))))
# [1] 0 5000 10000 15000 20000 25000
Get the positions for each element of df1 using findInterval:
(rows <- findInterval(df1$Amount, amt))
# [1] 1 1 2 3 4
(cols <- findInterval(df1$Term, month))
# [1] 1 2 3 4 3
Use these indices to subset the lookup matrix:
df1$Premium <- lkp[cbind(rows, cols)]
df1
# Amount Term Premium
# 1 2500 23 133
# 2 3600 30 163
# 3 7000 45 229
# 4 12000 50 306
# 5 16000 38 625
To get to what you want you need to organise the table and categorise the data. I have provided a potential workflow to handle such situations. Hope this is helpful:
library(tidyverse)
df1 <- data.frame(
Amount = c(2500L, 3600L, 7000L, 12000L, 16000L),
Term = c(23L, 30L, 45L, 50L, 38L)
)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# functions for analysis ####
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
amount_tier_function <- function(x){
case_when(x <= 5000 ~ "Tier_5000",
x <= 10000 ~ "Tier_10000",
x <= 15000 ~ "Tier_15000",
x <= 20000 ~ "Tier_20000",
TRUE ~ "Tier_25000")
}
month_tier_function <- function(x){
case_when(x <= 24 ~ "Tier_24",
x <= 36 ~ "Tier_36",
x <= 48 ~ "Tier_48",
TRUE ~ "Tier_60")
}
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# Recut lookup table headings ####
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
lookup_df <- data.frame(stringsAsFactors=FALSE,
amount_tier = c("Tier_5000", "Tier_10000", "Tier_15000", "Tier_20000",
"Tier_25000"),
Tier_24 = c(133L, 191L, 229L, 600L, 635L),
Tier_36 = c(163L, 213L, 252L, 615L, 645L),
Tier_48 = c(175L, 229L, 275L, 625L, 675L),
Tier_60 = c(186L, 249L, 306L, 719L, 786L)
)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# Join everything together ####
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
lookup_df_tidy <- lookup_df %>%
gather(mth_tier, Premium, - amount_tier)
df1 %>%
mutate(amount_tier = amount_tier_function(Amount),
mth_tier = month_tier_function(Term)) %>%
left_join(., lookup_df_tidy) %>%
select(-amount_tier, -mth_tier)

How to merge specific rows that match a grep pattern

I have a dataframe as follows:
Jen Rptname freq
AKT bilb1 23
AKT bilb1 234
DFF bilb22 987
DFF bilf34 7
DFF jhs23 623
AKT j45 53
JFG jhs98 65
I know how to group the whole dataframe based on individual columns but how do I merge individual rows based on a grep (in this case bilb.* and jhs.*)
I want to be able to merge the rows (and therefore also add the frequencies together) with bilb* and separately the rows with jhs* so that I end up with
AKT bilb 257
DFF bilb 987
DFF bilf34 7
DFF jhs 623
AKT j45 53
JFG jhs 65
This is so that the aggregation is by Jen and Rptname so I can see how many of the same Rptnames are in each Jen
We can use grep to get the index of 'Rptname' elements that have 'bilb' or 'jhs', remove the numeric part with sub and use aggregate to get the sum of 'Freq' by 'Rptname'
indx <- grep('bilb|jhs', df1$Rptname)
df1$Rptname[indx] <- sub('\\d+', '', df1$Rptname[indx])
aggregate(freq~Rptname, df1, FUN=sum)
# Rptname freq
#1 bilb 1244
#2 bilf34 7
#3 j45 53
#4 jhs 688
Update
Suppose your dataset is 'df2'
df2$grp <- gsub("([A-Z]+|[a-z]+)[^A-Z]+", "\\1", df2$Rptname)
aggregate(freq~grp+Jen, df2, FUN=sum)
data
df1 <- structure(list(Rptname = c("bilb1", "bilb1", "bilb22",
"bilf34",
"jhs23", "j45", "jhs98"), freq = c(23L, 234L, 987L, 7L, 623L,
53L, 65L)), .Names = c("Rptname", "freq"), class = "data.frame",
row.names = c(NA, -7L))
df2 <- structure(list(Jen = c("AKT", "AKT", "AKT", "DFF", "DFF",
"DFF",
"DFF", "DFF", "DFF", "AKT", "JFG", "JFG", "JFG"), Rptname = c("bilb1",
"bilb1", "bilb22", "bilb22", "bilb1", "BTBy", "bilf34", "BTBx",
"jhs23", "j45", "jhs98", "BTBfd", "BTBx"), freq = c(23L, 234L,
22L, 987L, 18L, 18L, 7L, 9L, 623L, 53L, 65L, 19L, 14L)),
.Names = c("Jen",
"Rptname", "freq"), class = "data.frame", row.names = c(NA, -13L))
Similar to akrun's and I like his use of aggregate better than my creation of an intermediate vector:
> inter <- tapply(dat$freq, sub("^(bilb|jhs)(.+)$", "\\1", dat$Rptname) ,sum)
> final <- data.frame( nams = names(inter), sums = inter)
> final
nams sums
bilb bilb 1244
bilf34 bilf34 7
j45 j45 53
jhs jhs 688
My pattern would require that the 'bilb' amd 'jhs' be at the beginning of the value. Remove the "^" if that was not intended, but if so, add a "(.*)" and switch to "\\2" in the replacement.

Remove duplicates based on specific criteria

I have a dataset that looks something like this:
df <- structure(list(Claim.Num = c(500L, 500L, 600L, 600L, 700L, 700L,
100L, 200L, 300L), Amount = c(NA, 1000L, NA, 564L, 0L, 200L,
NA, 0L, NA), Company = structure(c(NA, 1L, NA, 4L, 2L, 3L, NA,
3L, NA), .Label = c("ATT", "Boeing", "Petco", "T Mobile"), class = "factor")), .Names =
c("Claim.Num", "Amount", "Company"), class = "data.frame", row.names = c(NA,
-9L))
I want to remove duplicate rows based on Claim Num values, but to remove duplicates based on the following criteria: df$Company == 'NA' | df$Amount == 0
In other words, remove records 1, 3, and 5.
I've gotten this far: df <- df[!duplicated(df$Claim.Num[which(df$Amount = 0 | df$Company == 'NA')]),]
The code runs without errors, but doesn't actually remove duplicate rows based on the required criteria. I think that's because I'm telling it to remove any duplicate Claim Nums which match to those criteria, but not to remove any duplicate Claim.Num but treat certain Amounts & Companies preferentially for removal. Please note that, I can't simple filter out the dataset based on specified values, as there are other records that may have 0 or NA values, that require inclusion (e.g. records 8 & 9 shouldn't be excluded because their Claim.Nums are not duplicated).
If you order your data frame first, then you can make sure duplicated keeps the ones you want:
df.tmp <- with(df, df[order(ifelse(is.na(Company) | Amount == 0, 1, 0)), ])
df.tmp[!duplicated(df.tmp$Claim.Num), ]
# Claim.Num Amount Company
# 2 500 1000 ATT
# 4 600 564 T Mobile
# 6 700 200 Petco
# 7 100 NA <NA>
# 8 200 0 Petco
# 9 300 NA <NA>
Slightly different approach
r <- merge(df,
aggregate(df$Amount,by=list(Claim.Num=df$Claim.Num),length),
by="Claim.Num")
result <-r[!(r$x>1 & (is.na(r$Company) | (r$Amount==0))),-ncol(r)]
result
# Claim.Num Amount Company
# 1 100 NA <NA>
# 2 200 0 Petco
# 3 300 NA <NA>
# 5 500 1000 ATT
# 7 600 564 T Mobile
# 9 700 200 Petco
This adds a column x to indicate which rows have Claim.Num present more than once, then filters the result based on your criteria. The use of -ncol(r) just removes the column x at the end.
Another way based on subset and logical indices:
subset(dat, !(duplicated(Claim.Num) | duplicated(Claim.Num, fromLast = TRUE)) |
(!is.na(Amount) & Amount))
Claim.Num Amount Company
2 500 1000 ATT
4 600 564 T Mobile
6 700 200 Petco
7 100 NA <NA>
8 200 0 Petco
9 300 NA <NA>

Resources