Rmatching strings across columns of two dataframes in R - r

My apologies, in my haste to post my question I forgot to follow the basic rules of posting. I have edited my post in line with these rules:
R experts,
I appreciate that a similar question has been raised before but I am unable to adapt the solutions suggested before to my specific data problem. I basically have a dataframe (call it df1), where one of the columns is a string of sentences, part of which contains a city name and a country name. As an example, dataframe df1 has a column called bus_desc with the following data:
bus_desc
Company ABCD has a base capital of USD 5 million. Company ABCD is based in Mobile, Alabama, US. It is also known to have issued 10 million shares .....
Company XYZ has a history of producing bolts. Company XYZ operates out of Delhi, India. Its directors decided to....
In another dataframe (call it df2), I have two columns of data (named city and country) where each row contains a city name and the corresponding country as follows:
city
country
MOBILE
US
DELHI
INDIA
LONDON
UK
I want R to search the string of sentences of each row for that column in dataframe df1, and match the city (and the same for country) against the city name (and also the country name) from the relevant column in dataframe df2. If there is a match, I want to create a column called city in dataframe df1 and extract the city name from dataframe df2 and assign it to the row in dataframe df1. My final output should look like this:
bus_desc
city
country
Company ABCD has a base capital of USD 5 million. Company ABCD is based in Mobile, Alabama, US. It is also known to have issued 10 million shares .....
MOBILE
US
Company XYZ has a history of producing bolts. Company XYZ operates out of Delhi, India. Its directors decided to....
DELHI
INDIA
Can anyone please suggest a straightforward solution for this if it exists? I tried the below but it does not work
df1 <- df1 %>% rowwise() %>%
mutate(city=ifelse(grepl(toupper(bus_desc),df2$city),df2$city,df1))
Many thanks for your solutions and help on this.
Regards,
Dev

You can use str_extract to extract the city and country values listed in df2.
library(dplyr)
library(stringr)
df1 %>%
mutate(city = str_extract(toupper(bus_desc), str_c(df2$city, collapse = '|')),
country = str_extract(toupper(bus_desc), str_c(df2$country, collapse = '|')))
bus_desc
#1 Company ABCD has a base capital of USD 5 million. Company ABCD is based in Mobile, Alabama, US. It is also known to have issued 10 million shares
#2 Company XYZ has a history of producing bolts. Company XYZ operates out of Delhi, India. Its directors decided to
# city country
#1 MOBILE US
#2 DELHI INDIA
data
It is easier to help if you provide data in a reproducible format -
df1 <- data.frame(bus_desc = c('Company ABCD has a base capital of USD 5 million. Company ABCD is based in Mobile, Alabama, US. It is also known to have issued 10 million shares',
'Company XYZ has a history of producing bolts. Company XYZ operates out of Delhi, India. Its directors decided to'))
df2 <- data.frame(city = c('MOBILE', 'DELHI', 'LONDON'),
country = c('US', 'INDIA', 'UK'))

Related

How to search for county names in a description column with multiple strings - R

I have a donation dataset with a field in it called "Description", where the donor described what they gave their gift for. This field has multiple words or strings in it (sometimes a full sentence), and several rows list specific counties where they wanted their donation to be designated.
I would like to identify which rows in this field have a county name in them, and indicate that somehow in a new field. I have a dataframe with the county names from the two states I need, but I'm struggling to know which code let me use the county field in the county dataframe as a basis for identifying county names in within the Description field.
I'm still at a low level in R but I'll try to give some sample code. I have over 1000 rows so it will take too long for me to search for specific counties in a string - it will be more helpful to use a list of counties as my basis for searching.
`df <- tibble(`Donor Type` = c("Single Donation", "Grant", "Recurring Donation"), Amount = c("10", "50", "100"), Description = c("This is for Person County", "Books for Beaufort County", "Brews for Books"))`
`Donor Type` Amount Description
<chr> <chr> <chr>
1 Single Donation 10 This is for Person County
2 Grant 50 Books for Beaufort County
3 Recurring Donation 100 Brews for Books
I have a dataframe with county names in two states (named Carolina.Counties below)- what code should I use to make an additional column in my donor dataframe indicating which descriptions were limited to a specific county? I've been playing around with the following - but am not getting the right results.
Df <-
apply(Df, 1, function(x)
ifelse(any(Df$Description %in% Carolina.Counties$county), 'yes','no'))
%in% would look for an exact match. You may need some sort of regex match which can be achieved with the help of grepl.
df$result <- ifelse(grepl(paste0(Carolina.Counties$county, collapse = '|'),
df$Description), 'Yes', 'No')
paste0(Carolina.Counties$county, collapse = '|') would create a single regex pattern to looking for all the counties. We look for this pattern in Description column if it exists assign "Yes" else "No".

how to select country name and remove special characters from a csv file in R

I have a data set that looks like this:
I was wondering how can I select the name of the country only from this column, as you can see, the words are separated by a comma, sometimes the country name is the second word, sometimes its the 3rd word and sometimes its the first word, I was wondering, how can I create another column with the country names only? The data set also has special characters, I was wondering is there a way to remove special characters from a csv file in R?
If someone could help me figure this out, I would really appreciate it
thank you!
To start, you may want to try something like this. First use countrycode to get a list of countries. Then, pick up the country names. For special characters, you may want to try countrycode::codelist$country.name.en.regex instead.
I just learned the paste(country_list, collapse="|") trick from #akrun previous post Search in character string with list of strings and return match
library(tidyverse)
library(countrycode)
df <- data.frame(
messy_address = c("1 street, district, China", "2 road, city, Australia", "3 Road Canada"))
country_list<-countrycode::codelist$country.name.en
df$new_country <- str_extract(messy_address, paste(country_list, collapse="|"))
df
#> df
#> messy_address new_country
#> 1 1 street, district, China China
#> 2 2 road, city, Australia Australia
#> 3 3 Road Canada Canada

Issues with copying row data and Paste -> R

I have an ascii file that contains one week of data. This data is a text file and does not have header names. I currently have nearly completed a smaller task using R, and have made some attempts with Python as well. Being a pro at neither, its been a steep learning curve. Here is my data/code to paste rows together based on a specific sequence of chr in R that I created and is not working.
Each column holds different data, but the row data is what matters most. for example:
column 1 column 2 column 3 column 4
Row 1 Name Age YR Birth Date
Row 2 Middle Name School name siblings # of siblings
Row 3 Last Name street number street address
Row 4 Name Age YR Birth Date
Row 5 Middle Name School name siblings # of siblings
Row 6 Last Name street number street address
Row 7 Name Age YR Birth Date
Row 8 Middle Name School name siblings # of siblings
Row 9 Last Name street number street address
I have a folder to iterate or loop over that some files hold 100's of rows, and others hold 1000's. I have a code written that drops all the rows I don't need, and writes to a new .csv however, any pasting and/or merging isn't producing the desirable results.
What I need is a code to select only the Name and Last name rows (and their adjacent data) from the entire file and paste the last name row beside the end of the name row. Each file has the same amount of columns but different rows.
I have the file to a data frame, and have tried merging/pasting/binding (r and c) the rows/columns, and the result is still just shy of what I need. Rbind works the best thus far, but instead of producing the data with the rows pasted one after another on the same line, they are pasted beside each other in columns like this:
ie:
Name Last Name Name Last Name Name Last Name
Age Street Num Age Street Num Age Street Num
YR Street address YR Street address YR Street address
Birth NA Birth NA Birth NA
Date NA Date NA Date NA
I have tried to rbind them or family[c(Name, Age, YR Birth...)] and I am not successful. I have looked at how many columns I have and tried to add more columns to account for the paste, and instead it populates with the data from row 1.
I'm really at a loss here and if anyone can provide some insight I'd really appreciate it. I'm newer than some, but not as new as others. The results I am achieving look like:
Name Age YR Birth date Last Name Street Num Street Address NA NA
Name Age YR Birth date Last Name Street Num Street Address NA NA
Name Age YR Birth date Last Name Street Num Street Address NA NA
codes tried:
rowData <- rbind(name$Name, name$Age, name$YRBirth, name$Date)
colData <- cbind(name$V1 == "Name", name$V1 == "Last Name")
merge and paste also do not work. I have tried to create each variable as new data frames and am still not achieving the results I am looking for. Does anyone have any insight?
Ok, so if I understand your situation correctly, you want to first slice your data and pull out every third row starting with the 1st row and then pull out every 3rd row starting with the 3rd row. I'd do it like this (assume your data is in df:
df1 <- df[3*(1:(nrow(df)/3)) - 2,]
df2 <- df[3*(1:(nrow(df)/3)),]
once you have these, you can just slap them together, but instead of using rbind you want to use cbind. Then you can drop the NA columns and rename them.
df3 <- cbind(df1,df2)
df3 <- df3[1:7]
colnames(df3) <- c("Name", "Age", "YR", "Birth date", "Last Name", "Street Num", "Street Address")

Group by in R to represent counts per county on a map?

picture of data
I have the data above in which I need to represent the companies in the last column for every US county on a map. The dea is to be able to hover over a county and have it say the company names. It came from an Excel pivot table which I collapsed down to a csv. My strategy is to add a column that summarizes the company counts per county so I can map that one variable. I'm not sure the best way to do that, I'm assuming a column value that reads "Alabama Power Company (4) Wetterhorn Wireless L.L.C. (3)" or "Alabama Power Company Alabama Power Company Alabama Power Company Alabama Power Company Wetterhorn Wireless L.L.C. Wetterhorn Wireless L.L.C. Wetterhorn Wireless L.L.C." or something like that. Would I use a group by to do that? What's the best way to summarize this pivot table on a map?
You can get the counts very easily if the data is loaded as a data.table. Just use .N along with 'by' to groupby by country and company
library(data.table)
dt=data.table(data)
dt[,count:=.N,by=.(country,company)]
Note:
data should be your dataframe you load from your csv
Replace country and company with names of your column of country and company from the data.table
I finally figured out how to represent it using aggregate:
summary = aggregate(dt$company, list(dt$ccounty), paste, collapse=" ")
this yields all the names of the winners

R Data frame only with matching rows

Hello i'm trying to create a data frame inside a loop, this data frame should have only matching values, i'm trying to implement a logic like that:
names<- unique(list(data$costumers))
for (i in 1:length(names)) {
city <- data$city where data$costumers == names[i]
costumer <- data$costumers where data$costumers == names[i]
df <- data.frame(costumer,city)
}
Basically i'm trying to make a data frame for each unique name in the list, i don't know how to compare in a data frame, i've tried the if statement but i couldn't get it to work.
a example dataframe input would be something like that:
costumer city
Joseph WS
Edward WS
Joseph NY
so the output dataframe would be like this:
costumer city
Joseph WS
Joseph NY
and the second dataframe output would be like that:
costumer city
Edward WS
In conclusion i'm trying to get a single data frame for every unique name in the list, and that data frame should have all the rows that include that name.
you can use split:
split(data, data$customer)
$Edward
customer city
2 Edward WS
$Joseph
customer city
1 Joseph WS
3 Joseph NY
you can either refer to the dataframes from this list or even
list2env(split(data, data$customer))
and now just call the dataframes by the customer's name

Resources