Match and replace value using 2 Data Frames (R) - r

2 dfs, need to match "Name" with info$Name and replace corresponding values in details$Salary , df - details should retain all values and there should be no NAs(if match found replace the value if not found leave as it is)
details<- data.frame(Name = c("Aks","Bob","Caty","David","Enya","Fredrick","Gaby","Hema","Isac","Jaby","Katy"),
Age = c(12,22,33,43,24,67,41,19,25,24,32),
Gender = c("f","m","m","f","m","f","m","f","m","m","m"),
Salary = c(1500,2000,3.6,8500,1.2,1400,2300,2.5,5.2,2000,1265))
info <- data.frame(Name = c("caty","Enya","Dadi","Enta","Billu","Viku","situ","Hema","Ignu","Isac"),
income = c(2500,5600,3200,1522,2421,3121,4122,5211,1000,3500))
Expected Result :
Name Age Gender Salary
Aks 12 f 1500
Bob 22 m 2000
Caty 33 m 2500
David 43 f 8500
Enya 24 m 5600
Fredrick 67 f 1400
Gaby 41 m 2300
Hema 19 f 5211
Isac 25 m 3500
Jaby 24 m 2000
Katy 32 m 1265
None of the following is giving expected result
dplyr::left_join(details,info,by = "Name")
dplyr::right_join(details,info,by = "Name")
dplyr::inner_join(details,info, by ="Name") # for other matching and replace this works fine but not here
dplyr:: full_join(details,info,by ="Name")
All the results are giving NA's , tried using match function also but it is not giving desired result, any help would be highly appreciated

You have Name in both the dataframe in different cases, we need to first bring them in the same case then do a left_join with them and use coalesce to select the first non-NA value between income and salary.
library(dplyr)
details %>% mutate(Name = stringr::str_to_title(Name)) %>%
left_join(info %>% mutate(Name = stringr::str_to_title(Name)), by = "Name") %>%
mutate(Salary = coalesce(income, Salary)) %>%
select(names(details))
# Name Age Gender Salary
#1 Aks 12 f 1500
#2 Bob 22 m 2000
#3 Caty 33 m 2500
#4 David 43 f 8500
#5 Enya 24 m 5600
#6 Fredrick 67 f 1400
#7 Gaby 41 m 2300
#8 Hema 19 f 5211
#9 Isac 25 m 3500
#10 Jaby 24 m 2000
#11 Katy 32 m 1265

A base R solution:
matches <- match(tolower(details$Name), tolower(info$Name))
match <- !is.na(matches)
details$Salary[match] <- info$income[matches[match]]
#Result
Name Age Gender Salary
1 Aks 12 f 1500
2 Bob 22 m 2000
3 Caty 33 m 2500
4 David 43 f 8500
5 Enya 24 m 5600
6 Fredrick 67 f 1400
7 Gaby 41 m 2300
8 Hema 19 f 5211
9 Isac 25 m 3500
10 Jaby 24 m 2000
11 Katy 32 m 1265

Related

Merge different dataset

I have a question, I need to merge two different dataset in one but they have a different class. How I can I do? rbind doesn't work, ideas?
nycounties <- rgdal::readOGR("https://raw.githubusercontent.com/openpolis/geojson-italy/master/geojson/limits_IT_provinces.geojson")
city <- c("Novara", "Milano","Torino","Bari")
dimension <- c("150000", "5000000","30000","460000")
df <- cbind(city, dimension)
total <- rbind(nycounties,df)
Are you looking for something like this?
nycounties#data = data.frame(nycounties#data,
df[match(nycounties#data[, "prov_name"],
df[, "city"]),])
Output
nycounties#data[!is.na(nycounties#data$dimension),]
prov_name prov_istat_code_num prov_acr reg_name reg_istat_code reg_istat_code_num prov_istat_code city dimension
0 Torino 1 TO Piemonte 01 1 001 Torino 30000
2 Novara 3 NO Piemonte 01 1 003 Novara 150000
12 Milano 15 MI Lombardia 03 3 015 Milano 5000000
81 Bari 72 BA Puglia 16 16 072 Bari 460000

Convert values using a conversion table R

I am currently running statistical models on ACT and SAT scores. To help clean my data, I want to convert the ACT scores into its SAT equivalent. I found the following table online:
ACT SAT
<dbl> <dbl>
1 36 1590
2 35 1540
3 34 1500
4 33 1460
5 32 1430
6 31 1400
7 30 1370
8 29 1340
9 28 1310
10 27 1280
I want to replace the column ACT_Composite with the number in the SAT column of the conversion table. For instance, if one row displays an ACT_Composite score of 35, I want to input 1540.
If anyone has ideas on how to accomplish this, I would greatly appreciate it.
In base you can you use merge directly:
#Reading score table
df <- read.table(header = TRUE, text ="ACT SAT
36 1590
35 1540
34 1500
33 1460
32 1430
31 1400
30 1370
29 1340
28 1310
27 1280")
#Setting seed to reproduce df1
set.seed(1234)
# Create a data.frame with 50 sample scores
df1 <- data.frame(ACT_Composite = sample(27:36, 50, replace = TRUE))
# left-join df1 with df with keys ACT_Composite and ACT
result <- merge(df1, df,
by.x = "ACT_Composite",
by.y = "ACT",
all.x = TRUE,
sort = FALSE)
#The first 6 values of result
head(result)
ACT_Composite SAT
1 31 1400
2 31 1400
3 31 1400
4 31 1400
5 31 1400
6 36 1590
In data.table you can you use merge
library(data.table)
#Setting seed to reproduce df1
set.seed(1234)
# Create a data.table with 50 sample scores
df1 <- data.table(ACT_Composite = sample(27:36, 50, replace = TRUE))
# left-join df1 with df with keys ACT_Composite and ACT
result <- merge(df1, df,
by.x = "ACT_Composite",
by.y = "ACT",
all.x = TRUE,
sort = FALSE)
#The first 6 values of result
head(result)
ACT_Composite SAT
1: 36 1590
2: 32 1430
3: 31 1400
4: 35 1540
5: 31 1400
6: 32 1430
Alternatively in data.table you can try also
df1 <- data.table(ACT_Composite = sample(27:36, 50, replace = TRUE))
setDT(df)# you need to convert your look-up table df into data.table
result <- df[df1, on = c(ACT = "ACT_Composite")]
head(result)
ACT_Composite SAT
1: 36 1590
2: 32 1430
3: 31 1400
4: 35 1540
5: 31 1400
6: 32 1430

R - Regex to extract numbers prior to keyword with varying formatting

I need to extract numbers prior to their respective units from a string. Unfortunately the inputs sometimes vary and this is giving me trouble.
Sample data:
df <- data.frame(id = c(1, 2, 3, 4),
targets = c("1800 kcal 75 g", "2000kcal 80g", "1900 kcal,87g", "2035kcal,80g"))
> df
id targets
1 1 1800 kcal 75 g
2 2 2000kcal 80g
3 3 1900 kcal,87g
4 4 2035kcal,80g
Desired output:
df <- data.frame(id = c(1, 2, 3, 4),
targets = c("1800 kcal 75 g", "2000kcal 80g", "1900 kcal,87g", "2035kcal,80g"),
kcal_target = c("1800", "2000", "1900", "2035"),
protein_target = c("75", "80", "87", "80"))
> df
id targets kcal_target protein_target
1 1 1800 kcal 75 g 1800 75
2 2 2000kcal 80g 2000 80
3 3 1900 kcal,87g 1900 87
4 4 2035kcal,80g 2035 80
I got as far as this but it is breaking down with spaces between the numbers and unit keyword and a comma after the number keyword.
df <- df %>%
mutate(calorie_target = str_extract_all(targets, regex("\\d+(?=kcal)|\\d+(?=kcal,)"))) %>%
mutate(protein_target = str_extract_all(targets, regex("\\d+(?=g)")))
> df
id targets calorie_target protein_target
1 1 1800 kcal 75 g
2 2 2000kcal 80g 2000 80
3 3 1900 kcal,87g 87
4 4 2035kcal,80g 2035 80
edit: removed portion of code I'm not trying to capture
Base R with strcapture:
strcapture("(\\d+)\\D+(\\d+)", df$targets, list(calorie=0L, protein=0L))
# calorie protein
# 1 1800 75
# 2 2000 80
# 3 1900 87
# 4 2035 80
You can cbind this to the original:
cbind(df, strcapture("(\\d+)\\D+(\\d+)", df$targets, list(calorie=0L, protein=0L)))
# id targets calorie protein
# 1 1 1800 kcal 75 g 1800 75
# 2 2 2000kcal 80g 2000 80
# 3 3 1900 kcal,87g 1900 87
# 4 4 2035kcal,80g 2035 80
If you wanted to put this in a dplyr pipe, then
library(dplyr)
df %>%
bind_cols(strcapture("(\\d+)\\D+(\\d+)", .$targets, list(calorie=0L, protein=0L)))
# id targets calorie protein
# 1 1 1800 kcal 75 g 1800 75
# 2 2 2000kcal 80g 2000 80
# 3 3 1900 kcal,87g 1900 87
# 4 4 2035kcal,80g 2035 80
Note that strcapture uses regexec and regmatches under the hood, so this is similar to #ThomasIsCoding's answer in that respect.
For the regex,
\\d any digit (including unicode); this is similar to [0-9] but also includes other numerals (see https://stackoverflow.com/a/16621778/3358272);
\\D any non-digit
+ is one or more (of the preceding character/class)
A good reference if you need it is https://stackoverflow.com/a/22944075/3358272.
Here is a data.table option using regmatches + transpose
setDT(df)[, setNames(transpose(regmatches(targets, gregexpr("\\d+", targets))), c("kcal_target", "protein_target")), id]
which gives
id kcal_target protein_target
1: 1 1800 75
2: 2 2000 80
3: 3 1900 87
4: 4 2035 80

Joining rows with columns to create vertical table

I am trying to figure out how to join 2 data frames to create a vertical table of the data. Here is some sample data:
people <- data.frame(person = c("John","David","Peter"), company = c("A", "B", "C"))
grades <- data.frame(person1=c(10, 40, 50, 60), person2=c(60,70,80, 100), person3=c(33,44,55, 75))
NOTE: The order of the columns in grades is the same as the order of the person column in the people data frame.
I would like to get a data frame like the following but can't think of how to get there. Would prefer a solution using base R (am using an older version of R so some packages don't work for me):
person | company | grade
-------------------------
John | A | 10
John | A | 40
John | A | 50
John | A | 60
David | B | 60
David | B | 70
David | B | 80
David | B | 100
Peter | C | 33
Peter | C | 44
Peter | C | 55
Peter | C | 75
We change the column names of 'grades' with 'person' column from 'people', gather into 'long' format and then do a left_join
library(tidyverse)
setNames(grades, people$person) %>%
gather(person, grade) %>%
left_join(people)
# person grade company
#1 John 10 A
#2 John 40 A
#3 John 50 A
#4 John 60 A
#5 David 60 B
#6 David 70 B
#7 David 80 B
#8 David 100 B
#9 Peter 33 C
#10 Peter 44 C
#11 Peter 55 C
#12 Peter 75 C
Or using base R with merge
merge(stack(setNames(grades, people$person)),
people, all.x = TRUE, by.x = 'ind', by.y = 'person')
A base R option using cbind would be
idx <- rep(seq_along(people$person), each = dim(grades)[1])
cbind(people[idx,], stack(unlist(grades))["values"])
Result
# person company values
#1 John A 10
#1.1 John A 40
#1.2 John A 50
#1.3 John A 60
#2 David B 60
#2.1 David B 70
#2.2 David B 80
#2.3 David B 100
#3 Peter C 33
#3.1 Peter C 44
#3.2 Peter C 55
#3.3 Peter C 75
Use unlist and stack on grades to get
stack(unlist(grades))
values ind
1 10 john_grades1
2 40 john_grades2
3 50 john_grades3
4 60 john_grades4
5 60 david1
6 70 david2
7 80 david3
8 100 david4
9 33 pj1
10 44 pj2
11 55 pj3
12 75 pj4
Since "The order of the columns in grades is the same as the order of the person column in the people data frame." we can use cbind next, after we expanded people to have the correct number of rows.
(idx <- rep(seq_along(people$person), each = dim(grades)[1]))
# [1] 1 1 1 1 2 2 2 2 3 3 3 3
Another option, probably a little faster would be
cbind(people[idx,], data.frame(grade = unlist(grades, use.names = FALSE)))

Find matching intervals in data frame by range of two column values

I have a data frame of time related events.
Here is an example:
Name Event Order Sequence start_event end_event duration Group
JOHN 1 A 0 19 19 ID1
JOHN 2 A 60 112 52 ID1
JOHN 3 A 392 429 37 ID1
JOHN 4 B 282 329 47 ID1
JOHN 5 C 147 226 79 ID1
JOHN 6 C 566 611 45 ID1
ADAM 1 A 19 75 56 ID2
ADAM 2 A 384 407 23 ID2
ADAM 3 B 0 79 79 ID2
ADAM 4 B 505 586 81 ID2
ADAM 5 C 140 205 65 ID2
ADAM 6 C 522 599 77 ID2
There are essentially two different groups, ID 1 & 2. For each of those groups, there are 18 different name's. Each of those people appear in 3 different sequences, A-C. They then have active time periods during those sequences, and I mark the start/end events and calculate the duration.
I'd like to isolate each person and find when they have matching time intervals with people in both the opposite and same group ID.
Using the example data above, I want to find when John and Adam appear during the same sequence, at the same time. I then want to compare John to the rest of the 17 names in ID1/ID2.
I do not need to match the exact amount of shared 'active' time, I just am hoping to isolate the rows that are common.
My comforts are in using dplyr, but I can't crack this yet. I looked around and saw some similar examples with adjacency matrices, but those are with precise and exact data points. I can't figure out the strategy with a range/interval.
Thank you!
UPDATE:
Here is the example of the desired result
Name Event Order Sequence start_event end_event duration Group
JOHN 3 A 392 429 37 ID1
JOHN 5 C 147 226 79 ID1
JOHN 6 C 566 611 45 ID1
ADAM 2 A 384 407 23 ID2
ADAM 5 C 140 205 65 ID2
ADAM 6 C 522 599 77 ID2
I'm thinking you'd isolate each event row for John, mark the start/end time frame and then iterate through every name and event for the remainder of the data frame to find time points that fit first within the same sequence, and then secondly against the bench-marked start/end time frame of John.
As I understand it, you want to return any row where an event for John with a particular sequence number overlaps an event for anybody else with the same sequence value. To achieve this, you could use split-apply-combine to split by sequence, identify the overlapping rows, and then re-combine:
overlap <- function(start1, end1, start2, end2) pmin(end1, end2) > pmax(start2, start1)
do.call(rbind, lapply(split(dat, dat$Sequence), function(x) {
jpos <- which(x$Name == "JOHN")
njpos <- which(x$Name != "JOHN")
over <- outer(jpos, njpos, function(a, b) {
overlap(x$start_event[a], x$end_event[a], x$start_event[b], x$end_event[b])
})
x[c(jpos[rowSums(over) > 0], njpos[colSums(over) > 0]),]
}))
# Name EventOrder Sequence start_event end_event duration Group
# A.2 JOHN 2 A 60 112 52 ID1
# A.3 JOHN 3 A 392 429 37 ID1
# A.7 ADAM 1 A 19 75 56 ID2
# A.8 ADAM 2 A 384 407 23 ID2
# C.5 JOHN 5 C 147 226 79 ID1
# C.6 JOHN 6 C 566 611 45 ID1
# C.11 ADAM 5 C 140 205 65 ID2
# C.12 ADAM 6 C 522 599 77 ID2
Note that my output includes two additional rows that are not shown in the question -- sequence A for John from time range [60, 112], which overlaps sequence A for Adam from time range [19, 75].
This could be pretty easily mapped into dplyr language:
library(dplyr)
overlap <- function(start1, end1, start2, end2) pmin(end1, end2) > pmax(start2, start1)
sliceRows <- function(name, start, end) {
jpos <- which(name == "JOHN")
njpos <- which(name != "JOHN")
over <- outer(jpos, njpos, function(a, b) overlap(start[a], end[a], start[b], end[b]))
c(jpos[rowSums(over) > 0], njpos[colSums(over) > 0])
}
dat %>%
group_by(Sequence) %>%
slice(sliceRows(Name, start_event, end_event))
# Source: local data frame [8 x 7]
# Groups: Sequence [3]
#
# Name EventOrder Sequence start_event end_event duration Group
# (fctr) (int) (fctr) (int) (int) (int) (fctr)
# 1 JOHN 2 A 60 112 52 ID1
# 2 JOHN 3 A 392 429 37 ID1
# 3 ADAM 1 A 19 75 56 ID2
# 4 ADAM 2 A 384 407 23 ID2
# 5 JOHN 5 C 147 226 79 ID1
# 6 JOHN 6 C 566 611 45 ID1
# 7 ADAM 5 C 140 205 65 ID2
# 8 ADAM 6 C 522 599 77 ID2
If you wanted to be able to compute the overlaps for a specified pair of users, this could be done by wrapping the operation into a function that specifies the pair of users to be processed:
overlap <- function(start1, end1, start2, end2) pmin(end1, end2) > pmax(start2, start1)
pair.overlap <- function(dat, user1, user2) {
dat <- dat[dat$Name %in% c(user1, user2),]
do.call(rbind, lapply(split(dat, dat$Sequence), function(x) {
jpos <- which(x$Name == user1)
njpos <- which(x$Name == user2)
over <- outer(jpos, njpos, function(a, b) {
overlap(x$start_event[a], x$end_event[a], x$start_event[b], x$end_event[b])
})
x[c(jpos[rowSums(over) > 0], njpos[colSums(over) > 0]),]
}))
}
You could use pair.overlap(dat, "JOHN", "ADAM") to get the previous output. Generating the overlaps for every pair of users can now be done with combn and apply:
apply(combn(unique(as.character(dat$Name)), 2), 2, function(x) pair.overlap(dat, x[1], x[2]))

Resources