Separate a string of multiple dates and names in R - r

I have a dataframe with 2 columns, where the first column lists companies and the second column are strings of multiple dates and company names as follows:
data=data.frame('Company'=(c("A","B","C")),
'Bank'=c("1/13/2020 Bank A 5/12/2020 Bank H C 11/9/2020 HelloBank",
"2/14/2020 HopeBank 1/9/2020 Liberty Bank SA",
"10/18/2020 Securities"))
I would like to separate column "Bank" into multiple columns of Dates and Bank Names, such that:
data=data.frame('Company'=(c("A","B","C")),
"Date1"=(c("1/13/2020","2/14/2020","10/18/2020")),
'Bank1'=c("Bank A", "HopeBank","Securities"),
"Date2"=(c("5/12/2020","1/9/2020",NA)),
'Bank2'=c("Bank H C", "Liberty Bank SA",NA),
"Date3"=(c("11/9/2020 ",NA,NA)),
'Bank3'=c("HelloBank", NA,NA))
I have tried using library(stringr) but the formats of the dates are not consistent. Also, I do not know how many variables I will need in the final dataframe, and some of the strings in the "Bank" column are very long (up to 824 nchar).
I have also tried using separate from tidyr but without success.

Here is a base R option using strsplit to make it
v <- strsplit(data$Bank, "\\s(?=(\\d+\\/))|(?<=\\d)\\s", perl = TRUE)
data <- cbind(
data[1],
`colnames<-`(
do.call(rbind, lapply(v, `length<-`, max(lengths(v)))),
paste0(c("Date", "Bank"), rep(1:(max(lengths(v)) / 2), each = 2))
)
)
which gives
> data
Company Date1 Bank1 Date2 Bank2 Date3 Bank3
1 A 1/13/2020 Bank A 5/12/2020 Bank H C 11/9/2020 HelloBank
2 B 2/14/2020 HopeBank 1/9/2020 Liberty Bank SA <NA> <NA>
3 C 10/18/2020 Securities <NA> <NA> <NA> <NA>

If you don't know how many banks there might be in each row, you are better off creating a dataframe in long format. Something like this will do it, using the tidyverse...
library(tidyverse)
data_long <- data %>%
mutate(Bank = str_replace_all(Bank, "( \\d+/)", "#\\1"), #add markers between banks
Bank = str_split(Bank, "#")) %>% #split at markers
unnest(Bank) %>% #convert to one row per entry
mutate(Bank = str_squish(Bank)) %>% #trim white space
separate(Bank, into = c("Date", "BankName"), sep = " ", extra = "merge")
data_long
Company Date BankName
<chr> <chr> <chr>
1 A 1/13/2020 Bank A
2 A 5/12/2020 Bank H C
3 A 11/9/2020 HelloBank
4 B 2/14/2020 HopeBank
5 B 1/9/2020 Liberty Bank SA
6 C 10/18/2020 Securities
You might then want to convert Date into date format.
If you really want it in wide format, use pivot_wider.

Related

Combine every two rows of data in R

I have a csv file that I have read in but I now need to combine every two rows together. There is a total of 2000 rows but I need to reduce to 1000 rows. Every two rows is has the same account number in one column and the address split into two rows in another. Two rows are taken up for each observation and I want to combine two address rows into one. For example rows 1 and 2 are Acct# 1234 and have 123 Hollywood Blvd and LA California 90028 on their own lines respectively.
Using the tidyverse, you can group_by the Acct number and summarise with str_c:
library(tidyverse)
df %>%
group_by(Acct) %>%
summarise(Address = str_c(Address, collapse = " "))
# A tibble: 2 × 2
Acct Address
<dbl> <chr>
1 1234 123 Hollywood Blvd LA California 90028
2 4321 55 Park Avenue NY New York State 6666
Data:
df <- data.frame(
Acct = c(1234, 1234, 4321, 4321),
Address = c("123 Hollywood Blvd", "LA California 90028",
"55 Park Avenue", "NY New York State 6666")
)
It can be fairly simple with data.table package:
# assuming `dataset` is the name of your dataset, column with account number is called 'actN' and column with adress is 'adr'
library(data.table)
dataset2 <- data.table(dataset)[,.(whole = paste0(adr, collapse = ", ")), by = .(adr)]

R - pivoting duplicate rows into multiple column with unknown number of columns [duplicate]

This question already has answers here:
Transpose / reshape dataframe without "timevar" from long to wide format
(9 answers)
Closed 2 years ago.
I have a data frame like this:
Created with this:
companies = c("ABC Ltd", "ABC Ltd", "ABC Ltd", "Derwent plc", "Derwent plc")
sic = c("12345", "24155", "31231", "55346", "34234")
df = data.frame(companies, sic)
As you can see, the companies column is duplicated due to the SIC code.
I want to pivot wider so that each SIC code has its own column and that it is only 1 company per row.
Something like the following where I don't know how many columns there might be (i.e. there could be some companies with 20 sic codes).
I have tried pivoting it using pivot_wider but I cannot get it to do what I need it to do.
Any help is much appreciated.
With the packages dplyr and tidyr you can use
library(dplyr)
library(tidyr)
df %>%
group_by(companies) %>%
mutate(row_n = row_number()) %>%
pivot_wider(companies, names_from = row_n, values_from = sic, names_glue = "sic.{row_n}")
Output
# A tibble: 2 x 4
# Groups: companies [2]
# companies sic.1 sic.2 sic.3
# <chr> <chr> <chr> <chr>
# 1 ABC Ltd 12345 24155 31231
# 2 Derwent plc 55346 34234 NA
You can split sic by companies, call [ with 1:max(lengths(x)) and rbind the result.
x <- split(df$sic, df$companies)
do.call(rbind, lapply(x, "[", 1:max(lengths(x))))
# [,1] [,2] [,3]
#ABC Ltd "12345" "24155" "31231"
#Derwent plc "55346" "34234" NA
You have problems because there's not yet a "time" variable that differentiates the measures for each ID. You could use ave to make one and use reshape.
res <- reshape(transform(df, t=ave(companies, companies, FUN=seq)),
idvar="companies", timevar="t", direction="wide")
res
# companies sic.1 sic.2 sic.3
# 1 ABC Ltd 12345 24155 31231
# 4 Derwent plc 55346 34234 <NA>
However, you may want to reconsider your data on which measurements of the IDs correspond to each other!

how do I extract a part of data from a column and and paste it n another column using R?

I want to extract a part of data from a column and and paste it in another column using R:
My data looks like this:
names <- c("Sia","Ryan","J","Ricky")
country <- c("London +1234567890","Paris", "Sydney +0123458796", "Delhi")
mobile <- c(NULL,+3579514862,NULL,+5554848123)
data <- data.frame(names,country,mobile)
data
> data
names country mobile
1 Sia London +1234567890 NULL
2 Ryan Paris +3579514862
3 J Sydney +0123458796 NULL
4 Ricky Delhi +5554848123
I would like to separate phone number from country column wherever applicable and put it into mobile.
The output should be,
> data
names country mobile
1 Sia London +1234567890
2 Ryan Paris +3579514862
3 J Sydney +0123458796
4 Ricky Delhi +5554848123
You can use the tidyverse package which has a separate function.
Note that I rather use NA instead of NULL inside the mobile vector.
Also, I use the option, stringsAsFactors = F when creating the dataframe to avoid working with factors.
names <- c("Sia","Ryan","J","Ricky")
country <- c("London +1234567890","Paris", "Sydney +0123458796", "Delhi")
mobile <- c(NA, "+3579514862", NA, "+5554848123")
data <- data.frame(names,country,mobile, stringsAsFactors = F)
library(tidyverse)
data %>% as_tibble() %>%
separate(country, c("country", "number"), sep = " ", fill = "right") %>%
mutate(mobile = coalesce(mobile, number)) %>%
select(-number)
# A tibble: 4 x 3
names country mobile
<chr> <chr> <chr>
1 Sia London +1234567890
2 Ryan Paris +3579514862
3 J Sydney +0123458796
4 Ricky Delhi +5554848123
EDIT
If you want to do this without the pipes (which I would not recommend because the code becomes much harder to read) you can do this:
select(mutate(separate(as_tibble(data), country, c("country", "number"), sep = " ", fill = "right"), mobile = coalesce(mobile, number)), -number)

How can I convert this long format dataframe into a wide format?

I am using RStudio for data analysis in R. I currently have a dataframe which is in a long format. I want to convert it into the wide format.
An extract of the dataframe (df1) is shown below. I have converted the first column into a factor.
Extract:
df1 <- read.csv("test1.csv", stringsAsFactors = FALSE, header = TRUE)
df1$Respondent <- factor(df1$Respondent)
df1
Respondent Question CS Imp LOS Type Hotel
1 1 Q1 Fully Applied High 12 SML ABC
2 1 Q2 Optimized Critical 12 SML ABC
I want a new dataframe (say, df2) to look like this:
Respondent Q1CS Q1Imp Q2CS Q2Imp LOS Type Hotel
1 Fully Applied High Optimized Critical 12 SML ABC
How can I do this in R?
Additional notes: I have tried looking at the tidyr package and its spread() function but I am having a hard time implementing it to this specific problem.
This can be achieved with a gather-unite-spread approach
df %>%
group_by(Respondent) %>%
gather(k, v, CS, Imp) %>%
unite(col, Question, k, sep = "") %>%
spread(col, v)
# Respondent LOS Type Hotel Q1CS Q1Imp Q2CS Q2Imp
#1 1 12 SML ABC Fully Applied High Optimized Critical
Sample data
df <- read.table(text =
" Respondent Question CS Imp LOS Type Hotel
1 1 Q1 'Fully Applied' High 12 SML ABC
2 1 Q2 'Optimized' Critical 12 SML ABC", header = T)
In data.table, this can be done in a one-liner....
dcast(DT, Respondent ~ Question, value.var = c("CS", "Imp"), sep = "")[DT, `:=`(LOS = i.LOS, Type = i.Type, Hotel = i.Hotel), on = "Respondent"][]
Respondent CSQ1 CSQ2 ImpQ1 ImpQ2 LOS Type Hotel
1: 1 Fully Applied Optimized High Critical 12 SML ABC
explained step by step
create sample data
DT <- fread("Respondent Question CS Imp LOS Type Hotel
1 Q1 'Fully Applied' High 12 SML ABC
1 Q2 'Optimized' Critical 12 SML ABC", quote = '\'')
Cast a part of the datatable to desired format by question
colnames might not be what you want... you can always change them using setnames().
dcast(DT, Respondent ~ Question, value.var = c("CS", "Imp"), sep = "")
# Respondent CSQ1 CSQ2 ImpQ1 ImpQ2
# 1: 1 Fully Applied Optimized High Critical
Then join by reference on the orikginal DT, to get the rest of the columns you need...
result.from.dcast[DT, `:=`( LOS = i.LOS, Type = i.Type, Hotel = i.Hotel), on = "Respondent"]

Merge dataframes based on regex condition

This problem involves R. I have two dataframes, represented by this minimal reproducible example:
a <- data.frame(geocode_selector = c("36005", "36047", "36061", "36081", "36085"), county_name = c("Bronx", "Kings", "New York", "Queens", "Richmond"))
b <- data.frame(geocode = c("360050002001002", "360850323001019"), jobs = c("4", "204"))
An example to help communicate the very specific operation I am trying to perform: the geocode_selector column in dataframe a contains the FIPS county codes of the five boroughs of NY. The geocode column in dataframe b is the 15-digit ID of a specific Census block. The first five digits of a geocode match a more general geocode_selector, indicating which county the Census block is located in. I want to add a column to b specifying which county each census block falls under, based on which geocode_selector each geocode in b matches with.
Generally, I'm trying to merge dataframes based on a regex condition. Ideally, I'd like to perform a full merge carrying all of the columns of a over to b and not just the county_name.
I tried something along the lines of:
b[, "county_name"] <- NA
for (i in 1:nrow(b)) {
for (j in 1:nrow(a)) {.
if (grepl(data.a$geocode_selector[j], b$geocode[i]) == TRUE) {
b$county_name[i] <- a$county_name[j]
}
}
}
but it took an extremely long time for the large datasets I am actually processing and the finished product was not what I wanted.
Any insight on how to merge dataframes conditionally based on a regex condition would be much appreciated.
You could do this...
b$geocode_selector <- substr(b$geocode,1,5)
b2 <- merge(b, a, all.x=TRUE) #by default it will merge on common column names
b2
geocode_selector geocode jobs county_name
1 36005 360050002001002 4 Bronx
2 36085 360850323001019 204 Richmond
If you wish, you can delete the geocode_selector column from b2 with b2[,1] <- NULL
We can use sub to create the 'geocode_selector' and then do the join
library(data.table)
setDT(a)[as.data.table(b)[, geocode_selector := sub('^(.{5}).*', '\\1', geocode)],
on = .(geocode_selector)]
# geocode_selector county_name geocode jobs
#1: 36005 Bronx 360050002001002 4
#2: 36085 Richmond 360850323001019 204
This is a great opportunity to use dplyr. I also tend to like the string handling functions in stringr, such as str_sub.
library(dplyr)
library(stringr)
a <- data_frame(geocode_selector = c("36005", "36047", "36061", "36081", "36085"),
county_name = c("Bronx", "Kings", "New York", "Queens", "Richmond"))
b <- data_frame(geocode = c("360050002001002", "360850323001019"),
jobs = c("4", "204"))
b %>%
mutate(geocode_selector = str_sub(geocode, end = 5)) %>%
inner_join(a, by = "geocode_selector")
#> # A tibble: 2 x 4
#> geocode jobs geocode_selector county_name
#> <chr> <chr> <chr> <chr>
#> 1 360050002001002 4 36005 Bronx
#> 2 360850323001019 204 36085 Richmond

Resources