Merge dataframes based on regex condition - r

This problem involves R. I have two dataframes, represented by this minimal reproducible example:
a <- data.frame(geocode_selector = c("36005", "36047", "36061", "36081", "36085"), county_name = c("Bronx", "Kings", "New York", "Queens", "Richmond"))
b <- data.frame(geocode = c("360050002001002", "360850323001019"), jobs = c("4", "204"))
An example to help communicate the very specific operation I am trying to perform: the geocode_selector column in dataframe a contains the FIPS county codes of the five boroughs of NY. The geocode column in dataframe b is the 15-digit ID of a specific Census block. The first five digits of a geocode match a more general geocode_selector, indicating which county the Census block is located in. I want to add a column to b specifying which county each census block falls under, based on which geocode_selector each geocode in b matches with.
Generally, I'm trying to merge dataframes based on a regex condition. Ideally, I'd like to perform a full merge carrying all of the columns of a over to b and not just the county_name.
I tried something along the lines of:
b[, "county_name"] <- NA
for (i in 1:nrow(b)) {
for (j in 1:nrow(a)) {.
if (grepl(data.a$geocode_selector[j], b$geocode[i]) == TRUE) {
b$county_name[i] <- a$county_name[j]
}
}
}
but it took an extremely long time for the large datasets I am actually processing and the finished product was not what I wanted.
Any insight on how to merge dataframes conditionally based on a regex condition would be much appreciated.

You could do this...
b$geocode_selector <- substr(b$geocode,1,5)
b2 <- merge(b, a, all.x=TRUE) #by default it will merge on common column names
b2
geocode_selector geocode jobs county_name
1 36005 360050002001002 4 Bronx
2 36085 360850323001019 204 Richmond
If you wish, you can delete the geocode_selector column from b2 with b2[,1] <- NULL

We can use sub to create the 'geocode_selector' and then do the join
library(data.table)
setDT(a)[as.data.table(b)[, geocode_selector := sub('^(.{5}).*', '\\1', geocode)],
on = .(geocode_selector)]
# geocode_selector county_name geocode jobs
#1: 36005 Bronx 360050002001002 4
#2: 36085 Richmond 360850323001019 204

This is a great opportunity to use dplyr. I also tend to like the string handling functions in stringr, such as str_sub.
library(dplyr)
library(stringr)
a <- data_frame(geocode_selector = c("36005", "36047", "36061", "36081", "36085"),
county_name = c("Bronx", "Kings", "New York", "Queens", "Richmond"))
b <- data_frame(geocode = c("360050002001002", "360850323001019"),
jobs = c("4", "204"))
b %>%
mutate(geocode_selector = str_sub(geocode, end = 5)) %>%
inner_join(a, by = "geocode_selector")
#> # A tibble: 2 x 4
#> geocode jobs geocode_selector county_name
#> <chr> <chr> <chr> <chr>
#> 1 360050002001002 4 36005 Bronx
#> 2 360850323001019 204 36085 Richmond

Related

Separate a string of multiple dates and names in R

I have a dataframe with 2 columns, where the first column lists companies and the second column are strings of multiple dates and company names as follows:
data=data.frame('Company'=(c("A","B","C")),
'Bank'=c("1/13/2020 Bank A 5/12/2020 Bank H C 11/9/2020 HelloBank",
"2/14/2020 HopeBank 1/9/2020 Liberty Bank SA",
"10/18/2020 Securities"))
I would like to separate column "Bank" into multiple columns of Dates and Bank Names, such that:
data=data.frame('Company'=(c("A","B","C")),
"Date1"=(c("1/13/2020","2/14/2020","10/18/2020")),
'Bank1'=c("Bank A", "HopeBank","Securities"),
"Date2"=(c("5/12/2020","1/9/2020",NA)),
'Bank2'=c("Bank H C", "Liberty Bank SA",NA),
"Date3"=(c("11/9/2020 ",NA,NA)),
'Bank3'=c("HelloBank", NA,NA))
I have tried using library(stringr) but the formats of the dates are not consistent. Also, I do not know how many variables I will need in the final dataframe, and some of the strings in the "Bank" column are very long (up to 824 nchar).
I have also tried using separate from tidyr but without success.
Here is a base R option using strsplit to make it
v <- strsplit(data$Bank, "\\s(?=(\\d+\\/))|(?<=\\d)\\s", perl = TRUE)
data <- cbind(
data[1],
`colnames<-`(
do.call(rbind, lapply(v, `length<-`, max(lengths(v)))),
paste0(c("Date", "Bank"), rep(1:(max(lengths(v)) / 2), each = 2))
)
)
which gives
> data
Company Date1 Bank1 Date2 Bank2 Date3 Bank3
1 A 1/13/2020 Bank A 5/12/2020 Bank H C 11/9/2020 HelloBank
2 B 2/14/2020 HopeBank 1/9/2020 Liberty Bank SA <NA> <NA>
3 C 10/18/2020 Securities <NA> <NA> <NA> <NA>
If you don't know how many banks there might be in each row, you are better off creating a dataframe in long format. Something like this will do it, using the tidyverse...
library(tidyverse)
data_long <- data %>%
mutate(Bank = str_replace_all(Bank, "( \\d+/)", "#\\1"), #add markers between banks
Bank = str_split(Bank, "#")) %>% #split at markers
unnest(Bank) %>% #convert to one row per entry
mutate(Bank = str_squish(Bank)) %>% #trim white space
separate(Bank, into = c("Date", "BankName"), sep = " ", extra = "merge")
data_long
Company Date BankName
<chr> <chr> <chr>
1 A 1/13/2020 Bank A
2 A 5/12/2020 Bank H C
3 A 11/9/2020 HelloBank
4 B 2/14/2020 HopeBank
5 B 1/9/2020 Liberty Bank SA
6 C 10/18/2020 Securities
You might then want to convert Date into date format.
If you really want it in wide format, use pivot_wider.

"Compare a variable by state abbreviations

How can I compare a variable by state abbreviations?
My data set has 5 variables currently. One of them is Location, and it is written as: "Raleigh, NC"
I need to create a variable that contains the two-character state abbreviation for each observation, and afterward another to group them by state. Each observation is of a college including their classification(private/public), instate/out of state tuition, and location.
This should do for you, if I understood your issue correctly.
Note: Please always share sample data using dput(your_dataset) or dput(head(your_dataset))
library(tidyverse)
d<- tibble(id = 1:3,
Location = c("Newyork, NY", "Raleigh, NC", "Delhi, IN"))
d %>% separate(Location,into = c("city", "country")) %>%
mutate_at(vars("city","country"), str_trim)
# A tibble: 3 x 3
id city country
<int> <chr> <chr>
1 1 Newyork NY
2 2 Raleigh NC
3 3 Delhi IN

Cleaning Origin and Destination data with duplicates but different factor level

I have some GIS data with origins and destinations (OD) and an information about the time of the day of the OD. I intending to make a map of this, and to color the ODs by the time of day information.
One thing is that some ODs are in the data set with both day and night and maybe with a different order. I would like to mark those differntly, e.g. "Day/Night"
Is there an easy way to do this? MY MWE is just one OD but I would need to identify it among several others. I can manage to find the duplicates regardless of the order, but I dont know how to find out wether or not there are both time cases there and how to replace them with "Day/Night"
library(data.table)
Origin<-c("London", "Paris", "Lisbon", "Madrid", "Berlin", "London")
Destination<-c("Paris", "London", "Berlin","Lisbon", "Lisbon", "Paris")
Time=factor(c("Day", "Night", "Day", "Day/Night","Day", "Day/Night"))
dt<-data.table(Origin=Origin, Destination=Destination, Time=Time)
#duplicates regardless of order
dat.sort = t(apply(dt[,.(Origin,Destination)], 1, sort))
dt[duplicated(dat.sort) | duplicated(dat.sort, fromLast=TRUE),]
You can do that using dplyr package as follows;
Feel free to change the conditions to what fits your need.
library(data.table)
library(dplyr)
# Creating data
dt <-
data.table(
Origin = c("London", "Paris", "Italy", "Spain", "Portugal", "Poland"),
Destination = c("Paris", "London", "Norway", "Portugal", "Spain", "Spain"),
Time = c("Day", "Night", "Day", NA_character_, NA_character_, NA_character_)
)
dt
# Origin Destination Time
# London Paris Day
# Paris London Night
# Italy Norway Day
# Spain Portugal <NA>
# Portugal Spain <NA>
# Poland Spain <NA>
dt %>%
# pmin and pmax are used to sort the 2 columns
# in order to group by them regardless to their order
group_by(Origin2 = pmin(Origin, Destination),
Destination2 = pmax(Origin, Destination)) %>%
mutate(count = n(), # To check if Origin/destination are repeated or not
row = row_number(), # Place holder to know if it was first to repeat or second
# If not repeated then make Time = Day
# If repeated and first occurance then Time = Day
# If repeated and second occurance then Time = Night
Time = case_when(count == 1 ~ "Day",
count == 2 & row == 1 ~ "Day",
count == 2 & row == 2 ~ "Night")) %>%
ungroup() %>%
select(Origin, Destination, Time)
# Origin Destination Time
# <chr> <chr> <chr>
# 1 London Paris Day
# 2 Paris London Night
# 3 Italy Norway Day
# 4 Spain Portugal Day
# 5 Portugal Spain Night
# 6 Poland Spain Day
Thanks for the dplyr solution by #Nareman Darwisch that gave me the inspiration for my solution with data.table
I am creating a new variable as a unique ID for each Origin Destination
dat.sort = t(apply(dt[,.(Origin,Destination)], 1, sort))
dt.temp<-data.table(dat.sort)
dt.temp[,unique.name:=paste(V1,V2)]
dt$unique.name<-factor(dt.temp$unique.name)
Then I can either calculate the length of the unique occurences of the factor by group or if they match more than once with any of the 3 levels. Based on this I can recode the labels with the "Day/Night" level whenever the length is > 1 or the other condition is TRUE
dt[,No.levels:=length(unique(c(Time))), by=unique.name]
dt[,No.levels.logi:=sum(c(Time) %in% c(1:3))>1 , by=unique.name]
The thing I would like to understand how I could use a logical condition in the spirit of looking at the levels by group and compares those with the cases I want.
dt[,No.levels.logi:=sum(levels(Time) %in% c("Day", "Night"))>1 , by=unique.name]
But I guess the levels command always gives me all three levels.
If I understand correctly, the OP wants to
identify city pairs regardless of the order of origin and destination, e.g. London-Paris belongs to the same city pair as Paris-London
collapse separate rows if a city pair is operated Day and Night or Day/Night
or update the original dataset
This is what I would do:
library(data.table)
dt <- data.table(Origin, Destination, Time)
# add city pair as unique grouping variable
dt[, Pair := paste(pmin(Origin, Destination), pmax(Origin, Destination), sep = "-")][]
# identify city pairs which are operated day and night
pairs_DN <- dt[, all(c("Day", "Night") %in% Time) | "Day/Night" %in% Time, by = Pair][(V1), .(Pair)]
# update original dataset by an update join
dt[pairs_DN, on = "Pair", Time := "Day/Night"][]
Origin Destination Time Pair
1: London Paris Day/Night London-Paris
2: Paris London Day/Night London-Paris
3: Lisbon Berlin Day Berlin-Lisbon
4: Madrid Lisbon Day/Night Lisbon-Madrid
5: Berlin Lisbon Day Berlin-Lisbon
6: London Paris Day/Night London-Paris
The key point is to identify the city pairs which fullfil the second requirement:
dt[, all(c("Day", "Night") %in% Time) | "Day/Night" %in% Time, by = Pair]
Pair V1
1: London-Paris TRUE
2: Berlin-Lisbon FALSE
3: Lisbon-Madrid TRUE
So, there is no need to deal with factor levels. BTW, factor levels are an attribute of the whole column and do not change when subsetting or grouping. What does change is which of the levels are used in a subset or group.
pairs_DN contains the unique key of those city pairs
Pair
1: London-Paris
2: Lisbon-Madrid

how do I extract a part of data from a column and and paste it n another column using R?

I want to extract a part of data from a column and and paste it in another column using R:
My data looks like this:
names <- c("Sia","Ryan","J","Ricky")
country <- c("London +1234567890","Paris", "Sydney +0123458796", "Delhi")
mobile <- c(NULL,+3579514862,NULL,+5554848123)
data <- data.frame(names,country,mobile)
data
> data
names country mobile
1 Sia London +1234567890 NULL
2 Ryan Paris +3579514862
3 J Sydney +0123458796 NULL
4 Ricky Delhi +5554848123
I would like to separate phone number from country column wherever applicable and put it into mobile.
The output should be,
> data
names country mobile
1 Sia London +1234567890
2 Ryan Paris +3579514862
3 J Sydney +0123458796
4 Ricky Delhi +5554848123
You can use the tidyverse package which has a separate function.
Note that I rather use NA instead of NULL inside the mobile vector.
Also, I use the option, stringsAsFactors = F when creating the dataframe to avoid working with factors.
names <- c("Sia","Ryan","J","Ricky")
country <- c("London +1234567890","Paris", "Sydney +0123458796", "Delhi")
mobile <- c(NA, "+3579514862", NA, "+5554848123")
data <- data.frame(names,country,mobile, stringsAsFactors = F)
library(tidyverse)
data %>% as_tibble() %>%
separate(country, c("country", "number"), sep = " ", fill = "right") %>%
mutate(mobile = coalesce(mobile, number)) %>%
select(-number)
# A tibble: 4 x 3
names country mobile
<chr> <chr> <chr>
1 Sia London +1234567890
2 Ryan Paris +3579514862
3 J Sydney +0123458796
4 Ricky Delhi +5554848123
EDIT
If you want to do this without the pipes (which I would not recommend because the code becomes much harder to read) you can do this:
select(mutate(separate(as_tibble(data), country, c("country", "number"), sep = " ", fill = "right"), mobile = coalesce(mobile, number)), -number)

A fast way to merge named vectors of different length into a data frame (preserving name information as column name) in R

I have a list L of named vectors. For example, 1st element:
> L[[1]]
$event
[1] "EventA"
$time
[1] "1416355303"
$city
[1] "Los Angeles"
$region
[1] "California"
$Locale
[1] "en-GB"
when I unlist each element of the list the resulting vectors looks like this (for the 1st 3 elements):
> unlist(L[[1]])
event time city region Locale
"EventA" "1416355303" "Los Angeles" "California" "en-GB"
> unlist(L[[2]])
event time Locale
"EventB" "1416417567" "en-GB"
> unlist(L[[3]])
event properties.time
"EventM" "1416417569"
I have over 0.5 million elements in the list and each one has up to 42 of these feaures/names. I have to merge them into a dataframe taken into account their names and that not all of them have the same number of feaures or names (in the example above, V2 has no information for region and city). At the moment, what I do is a loop through the whole list:
df1 <- merge(stack(unlist(L[[1]])), stack(unlist(L[[2]])),
by = "ind", all = TRUE)
suppressWarnings(for (i in 3:length(L)){
df1 <- merge(df1, stack(unlist(L[[i]])), by = "ind", all = TRUE)
})
df1 <- as.data.frame(t(df1))
For the example above this returns:
V1 V2 V3 V4 V5
ind city event Locale region time
values.x Los Angeles EventA en-GB California 1416355303
values.y <NA> EventB en-GB <NA> 1416417567
values <NA> EventM <NA> <NA> 1416417569
which is what I want. However, bearing in mind the length of the list and the fact that every time that the command:
df1 <- merge(df1, stack(unlist(L[[i]])), by = "ind", all = TRUE)
runs, loads the entire data frame (df1), the loop takes a very long time. Therefore, I was wondering if anyone knows a better/faster way to code this. In other words. Given a long list of named vectors with different lengths, is there a fast way to merge them into a data frame as the one described above.
For example, is there a way of doing this using foreach and %dopar%? In any case, any faster approach is welcome.
I've heard the data.table package is pretty fast. And rbindlist is perfect for this list.
library(data.table)
rbindlist(L, fill=TRUE)
# event time city region Locale
# 1: EventA 1416355303 Los Angeles California en-GB
# 2: EventB 1416417567 NA NA en-GB
# 3: EventM 1416417569 NA NA NA
I'm not sure why you use merge. It seems to me like you should simply rbind.
L <- list(list(event = "EventA", time = 1416355303,
city = "Los Angeles", region = "California",
Locale = "en-GB"),
list(event = "EventB", time = 1416417567,
Locale = "en-GB"),
list(event = "EventM", time = 1416417569))
library(plyr)
do.call(rbind.fill, lapply(L, as.data.frame))
# event time city region Locale
#1 EventA 1416355303 Los Angeles California en-GB
#2 EventB 1416417567 <NA> <NA> en-GB
#3 EventM 1416417569 <NA> <NA> <NA>
Here's a compact solution to consider:
library(reshape2)
dcast(melt(L), L1 ~ L2, value.var = "value")
# L1 city event Locale region time
# 1 1 Los Angeles EventA en-GB California 1416355303
# 2 2 <NA> EventB en-GB <NA> 1416417567
# 3 3 <NA> EventM <NA> <NA> 1416417569
The original post is about merging named vectors. Define the first two given in the example above as vectors:
>C1 <- c(event = "EventA", time = 1416355303,
city = "Los Angeles", region = "California",
Locale = "en-GB")
>C2 <- c(event = "EventB", time = 1416417567,
Locale = "en-GB")
If you want to merge them and are OK to give up the extra data in the longer vector vector, then you can index the longer vector by names in the shorter vector
>C1 <- C1[names(C2)]
Then just use rbind or cbind. Example with rbind
>C1_C2 <- rbind(C1,C2)
>C1_C2
event time Locale
C1 "EventA" "1416355303" "en-GB"
C2 "EventB" "1416417567" "en-GB"
You can combine the final two steps but will lose the name of the first vector if you do that

Resources