How to reformat similar text for merging in R? - r

I am working with the NYC open data, and I am wanting tho merge two data frames based on community board. The issue is, the two data frames have slightly different ways of representing this. I have provided an example of the two different formats below.
CommunityBoards <- data.frame(FormatOne = c("01 BRONX", "05 QUEENS", "15 BROOKLYN", "03 STATEN ISLAND"),
FormatTwo = c("BRONX COMMUNITY BOARD #1", "QUEENS COMMUNITY BOARD #5",
"BROOKLYN COMMUNITY BOARD #15", "STATEN ISLAND COMMUNITY BD #3"))
Along with the issue of the placement of the numbers and the "#", the second data frame shortens "COMMUNITY BOARD" to "COMMUNITY BD" just on Staten Island. I don't have a strong preference of what string looks like, so long as I can discern borough and community board number. What would be the easiest way to reformat one or both of these strings so I could merge the two sets?
Thank you for any and all help!

You can use regex to get out just the district numbers. For the first format, the only thing that matters is the begining of the string before the space, hence you could do
districtsNrs1 <- as.numeric(gsub("(\\d+) .*","\\1",CommunityBoards$FormatOne))
For the second, I assume that the formats look like "something HASHTAG number", hence you could do
districtsNrs2 <- as.numeric(gsub(".* #(\\d+)","\\1",CommunityBoards$FormatTwo))
to get the pure district numbers.
Now you know how to extract the district numbers. With that information, you can name/reformat the district-names how you want.
To know which district number is which district, you can create a translation data.frame between the districts and numbers like
districtNumberTranslations <- data.frame(
districtNumber = districtsNrs2,
districtName = sapply(strsplit(CommunityBoards$FormatTwo," COMMUNITY "),"[[",1)
)
giving
# districtNumber districtName
#1 1 BRONX
#2 5 QUEENS
#3 15 BROOKLYN
#4 3 STATEN ISLAND

Related

Finding the Matching Characters after Pattern in R DataFrame

I am fairly new to string manipulation, and I am stuck on a problem regarding string and character data in an R dataframe. I am attempting to extract numeric values from a long string after a pattern and then store the result as a new column in my dataframe. I have a fairly large dataset, and I am attempting to get out some useful information stored in a column called "notes".
For instance, the strings I am interested in always follow this pattern (there is nothing significant about the tasks):
df$notes[1] <- "On 5 June, some people walked down the street in this area. [size=around 5]"
df$notes[2] <- "On 6 June, some people rode bikes down the street in this area. [size= nearly 4]"
df$notes[3] <- "On 7 June, some people walked into a grocery store in this area. [size= about 100]"
In some columns, we do not get a numeric value, and that is a problem I can deal with after I get a solution to this one. Those rows follow something similar to this:
df$notes[4] <- "On 10 July, an hundreds of people drank water from this fountain [size=hundreds]"
df$notes[5] <- "on 15 August, an unreported amount of people drove their cars down the street. [size= no report]"
I am trying to extract the entire match after "size= (some quantifier)", and store the value into an appended column of my dataframe.
Eventually, I need to write a loop that goes through this column (call it "notes") in my dataframe, and storing the values "5, 4, 100" into a new column (call it "est_size").
Ideally, my new column will look like this:
df$est_size[1] <- "around 5"
df$est_size[2] <- "nearly 4"
df$est_size[3] <- "about 100"
df$est_size[4] <- "hundreds"
df$est_size[5] <- "no report"
Code that I have tried / stuck on:
stringr::str_extract(notes[1], \w[size=]\d"
but all I get back is "size=" and not the value after
Thank you in advance for helping!
We can use a regex lookaround to match one or more characters that are not a closing square bracket ] after the size=
library(dplyr)
library(stringr)
df <- df %>%
mutate(est_size = trimws(str_extract(notes, '(?<=size=)[^\\]]+')))
-output
df #notes est_size
#1 On 5 June, some people walked down the street in this area. [size=around 5] around 5
#2 On 6 June, some people rode bikes down the street in this area. [size= nearly 4] nearly 4
#3 On 7 June, some people walked into a grocery store in this area. [size= about 100] about 100
#4 On 10 July, an hundreds of people drank water from this fountain [size=hundreds] hundreds
#5 on 15 August, an unreported amount of people drove their cars down the street. [size= no report] no report
data
df <- structure(list(notes = c("On 5 June, some people walked down the street in this area. [size=around 5]",
"On 6 June, some people rode bikes down the street in this area. [size= nearly 4]",
"On 7 June, some people walked into a grocery store in this area. [size= about 100]",
"On 10 July, an hundreds of people drank water from this fountain [size=hundreds]",
"on 15 August, an unreported amount of people drove their cars down the street. [size= no report]"
)), class = "data.frame", row.names = c(NA, -5L))
Using str_extract:
library(stringr)
trimws(str_extract(df$notes, "(?<=size=)[\\w\\s]+"))
[1] "around 5" "nearly 4" "about 100" "hundreds" "no report"
Here, we use positive lookbehind (?<=...) to assert an accompanying pattern for what we want to extract: we want to extract the alphanumeric string(s) that follow after size=so we put size=into the lookbehind expression and extract whatever alphanumeric chars (\\w) and whitespace chars (\\s) (but not special chars such as ]!) come after it.

Separating geographical data strings in R

I'm working with QECW data from BLS and would like to make the geographical data included more useful. I want to split the column "area_title" into different columns - one with the area's name, one with the level of the area, and one with the state.
I got a good start using separate:
qecw <- qecw %>% separate(area_title, c("county", "geography level", "state"))
The problem is that there's a variety of ways the geographical data are arranged into strings that makes them not uniform enough to cleanly separate. The area_title column includes names in formats that separate pretty cleanly, like:
area_title
Alabama -- Statewide
Autauga County, Alabama
which splits pretty well into
county geography level state
Alabama Statewide NA
Autauga County Alabama
but this breaks down for cases like:
area_title
Aleutians West Census Area, Alaska
Chattanooga-Cleveland-Dalton TN-GA-AL CSA
U.S. Combined statistical Areas, combined
as well as any states, counties or other place names that have more than one word.
I can go case-by-case to fix these, but I would appreciate a more efficient solution.
The exact data I'm using is "2019.q1-q3 10 10 Total, all industries," available at the link under "Current year quarterly data grouped by industry".
Thanks!
So far I came up with this:
I can get a place name by selecting a substring of area_title with everything to the left of the first comma:
qecw <- qecw %>% mutate(location = sub(",.*","", qecw$area_title))
Then I have a series of nested if_else statements to create a location type:
mutate(`Location Type` =
if_else(str_detect(area_title, "Statewide"), "State",
if_else(str_detect(area_title, "County"), "County",
if_else(str_detect(area_title, "CSA"), "CSA",
if_else(str_detect(area_title, "MSA"), "MSA",
if_else(str_detect(area_title, "MicroSA"), "MicroSA",
if_else(str_detect(area_title, "Undefined"), "Undefined",
"other")))))))
This isn't a complete answer; I think I'm still missing some location types, and I haven't come up with a good way to extract state names yet.

Pairing a gsub function and text file for corpus cleaning

I have a large sample of tweets that I am trying to clean up before I analyze them. I have the tweets in a dataframe where each cell has the contents of one tweet (e.g. "i love san francisco" and "proud member of the air force"). However, there are some words in each bio that should be combined when I analyze the text in a network visualization. I want to also combine common two-word phrases (e.g. "new york", "san francisco", and "air force"). I have already compiled the list of terms that need to be combined, and have used gsub to combine a few of them with this line of code:
twitterdata_cleaning$bio = gsub('air force','airforce',twitterdata_cleaning$bio)
The line of code above turns "proud member of the air force" into "proud member of the airforce". I have been able to successfully do this with dozens of two-word phrases.
However, I have hundreds of two-word phrases in the bios, and I want to keep a better track of them, so I've moved all of these terms into two columns in an excel file. I would like to find a way to use the above formula on a txt or excel file, that identifies terms in the dataframe that look like those in the first column of the txt file and changes the words to look like those in the second column of the txt file.
For example, I have xlsx and txt files that look like this:
**column1** **column2*
san francisco sanfrancisco
new york newyork
las vegas lasvegas
san diego sandiego
new hampshire newhampshire
good bye goodbye
air force airforce
video game videogame
high school school
middle school school
elementary school school
I would like to use the gsub command in a formula that searches the dataframe for all the terms in column 1 and terms them into the terms in column 2 using something like this:
twitterdata_df$tweet = gsub('textfile$column1','textfile$columnb',twitterdata_df$tweet)
to get something like this in the cells:
i love sanfrancisco
can not wait to go to newyork
what happens in lasvegas stays there
at the beach in sandiego
can beat the autumn leave in newhampshire
so done with all the drama goodbye
proud member of the airforce
love this videogame so much
playing at the school tonight
so sick of school
school was the best and i miss it
Any help would be very greatly appreciated.
Generalized Solution
You can feed in a named vector to str_replace_all() from package stringr to accomplish this. In my example df has a column with old values to be replaced by new values. This I assume is what you mean by having an Excel file to track them.
library(stringr)
df <- data.frame(old = c("five", "six", "seven"),
new = as.character(5:7),
stringsAsFactors = FALSE)
text <- c("I am a vector with numbers six and other text five",
"another vector seven six text five")
str_replace_all(text, setNames(df$new, df$old))
Result:
[1] "I am a vector with numbers 6 and other text 5" "another vector 7 6 text 5"
Specific Example
Data
Read in the text file with the replacements.
textfile <- read.csv(textConnection("column1,column2
san francisco,sanfrancisco
new york,newyork
las vegas,lasvegas
san diego,sandiego
new hampshire,newhampshire
good bye,goodbye
air force,airforce
video game,videogame
high school,school
middle school,school
elementary school,school"), stringsAsFactors = FALSE)
Load a data frame with tweets in the column tweet.
twitterdata_df <- data.frame(id = 1:11)
twitterdata_df$tweet <- c("i love san francisco",
"can not wait to go to new york",
"what happens in las vegas stays there",
"at the beach in san diego",
"can beat the autumn leave in new hampshire",
"so done with all the drama goodbye",
"proud member of the air force",
"love this video game so much",
"playing at the high school tonight",
"so sick of middle school",
"elementary school was the best and i miss it")
Replace
twitterdata_df$tweet2 <- str_replace_all(twitterdata_df$tweet, setNames(textfile$column2, textfile$column1))
Result
As you can see, the replacements were made in tweet2.
id tweet tweet2
1 1 i love san francisco i love sanfrancisco
2 2 can not wait to go to new york can not wait to go to newyork
3 3 what happens in las vegas stays there what happens in lasvegas stays there
4 4 at the beach in san diego at the beach in sandiego
5 5 can beat the autumn leave in new hampshire can beat the autumn leave in newhampshire
6 6 so done with all the drama goodbye so done with all the drama goodbye
7 7 proud member of the air force proud member of the airforce
8 8 love this video game so much love this videogame so much
9 9 playing at the high school tonight playing at the school tonight
10 10 so sick of middle school so sick of school
11 11 elementary school was the best and i miss it school was the best and i miss it
Thanks for your help, but I found out how to do it. I decided to use a loop, which went into my my table of two columns, and searched for each set of terms in the first column and replaced them with the word in the second column.
for(i in 1:nrow(compoundterms)) {
twitterdata_dfg$tweet = gsub(compoundterms[i,1],compoundterms[i,2],twitterdata_df$tweet)
}

error reading text file into new columns of a dataframe using some text editing

I have a text file (0001.txt) which contains the data as below:
<DOC>
<DOCNO>1100101_business_story_11931012.utf8</DOCNO>
<TEXT>
The Telegraph - Calcutta (Kolkata) | Business | Local firms go global
6 Local firms go global
JAYANTA ROY CHOWDHURY
New Delhi, Dec. 31: Indian companies are stepping out of their homes to try their luck on foreign shores.
Corporate India invested $2.7 billion abroad in the first quarter of 2009-2010 on top of $15.9 billion in 2008-09.
Though the first-quarter investment was 15 per cent lower than what was invested in the same period last year, merchant banker Sudipto Bose said, It marks a confidence in a new world order where Indian businesses see themselves as equal to global players.
According to analysts, confidence in global recovery, cheap corporate buys abroad and easier rules governing investment overseas had spurred flow of capital and could see total investment abroad top $12 billion this year and rise to $18-20 billion next fiscal.
For example, Titagarh Wagons plans to expand abroad on the back of the proposed Asian railroad project.
We plan to travel all around the world with the growth of the railroads, said Umesh Chowdhury of Titagarh Wagons.
India is full of opportunities, but we are all also looking at picks abroad, said Gautam Mitra, managing director of Indian Structurals Engineering Company.
Mitra plans to open a holding company in Switzerland to take his business in structurals to other Asian and African countries.
Indian companies created 3 lakh jobs in the US, while contributing $105 billion to the US economy between 2004 and 2007, according to commerce ministry statistics. During 2008-09, Singapore, the Netherlands, Cyprus, the UK, the US and Mauritius together accounted for 81 per cent of the total outward investment.
Bose said, And not all of it is organic growth. Much of our investment abroad reflects takeovers and acquisitions.
In the last two years, Suzlon acquired Portugals Martifers stake in German REpower Systems for $122 million. McNally Bharat Engineering has bought the coal and minerals processing business of KHD Humboldt Wedag. ONGC bought out Imperial Energy for $2 billion.
Indias foreign assets and liabilities today add up to more than 60 per cent of its gross domestic product. By the end of 2008-09, total foreign investment was $67 billion, more than double of that at the end of March 2007.
</TEXT>
</DOC>
Above, all text data is within the HTML code for text i.e.
<TEXT> and </TEXT>.
I want to read it into an R dataframe in a way that there will be four columns and the data should be read as:
Title Author Date Text
The Telegraph - Calcutta (Kolkata) JAYANTA ROY CHOWDHURY Dec. 31 Indian companies are stepping out of their homes to try their luck on foreign shores. Corporate India invested $2.7 billion abroad in the first quarter of 2009-2010 on top of $15.9 billion in 2008-09. Though the first-quarter investment was 15 percent lower than what was invested in the same period last year, merchant banker Sudipto Bose said, It marks a confidence in a new world order where Indian businesses see themselves as equal to global players.
What I was trying to read using dplyr and as shown below:
# read text file
library(dplyr)
library(readr)
dat <- read_csv("0001.txt") %>% slice(-8)
# print part of data frame
head(dat, n=2)
In above code, I tried to skip first few lines (which are not important) from the text file that contains the above text and then read it into dataframe.
But I could not get what I was looking for and got confused what I am doing is wrong.
Could someone please help?
To be able to read data into R as a data frame or table, the data needs to have a consistent structure maintained by separators. One of the most common formats is a file with comma separated values (CSV).
The data you're working with doesn't have separators though. It's essentially a string with minimally enforced structure. Because of this, it sounds like the question is more related to regular expressions (regex) and data mining than it is to reading text files into R. So I'd recommend looking into those two things if you do this task often.
That aside, to do what you're wanting in this example, I'd recommend reading the text file into R as a single string of text first. Then you can parse the data you want using regex. Here's a basic, rough draft of how to do that:
fileName <- "Path/to/your/data/0001.txt"
string <- readChar(fileName, file.info(fileName)$size)
df <- data.frame(
Title=sub("\\s+[|]+(.*)","",string),
Author=gsub("(.*)+?([A-Z]{2,}.*[A-Z]{2,})+(.*)","\\2",string),
Date=gsub("(.*)+([A-Z]{1}[a-z]{2}\\.\\s[0-9]{1,2})+(.*)","\\2",string),
Text=gsub("(.*)+([A-Z]{1}[a-z]{2}\\.\\s[0-9]{1,2})+[: ]+(.*)","\\3",string))
Output:
str(df)
'data.frame': 1 obs. of 4 variables:
$ Title : chr "The Telegraph - Calcutta (Kolkata)"
$ Author: chr "JAYANTA ROY CHOWDHURY"
$ Date : chr "Dec. 31"
$ Text : chr "Indian companies are stepping out of their homes to"| __truncated__
The reason why regex can be useful is that it allows for very specific patterns in strings. The downside is when you're working with strings that keep changing formats. That will likely mean some slight adjustments to the regex used.
read.table( file = ... , sep = "|") will solve your issue.

Creating a function in R

I'm trying to create code that looks at two CSV files: one is a world list of all bird species and their ranges, and the other is a file of all the birds in the Himalayas. I need to check each species in the CSV file with the matching species on the IOC world list one and see if the bird is actually in range (meaning it would say either "India" or "himalayas" or "s e Asia" under the Range column). I want to create a function that can input both data sets, find where names match, check if range contains those words and returns where it does NOT, so I can check those birds specifically. Here is what I have so far (I'm using RStudio):
myfunc <- function() {
if ((bird_data$Scientific.name == ioc$Scientific.name) & (ioc$Scientific.name!=("Himalayas" | "se Asia" | "India")) {
print(eval(bird_data$Common.Name[bird_data$Scientific.name == ioc$Scientific.name) & (ioc$Scientific.name!=("Himalayas" | "se Asia" | "India")]))
}
}
save("myfunc", file = "myfunc.Rdata")
source("myfunc.Rdata")
I think I'm messed up in not having inputs. So I'm trying a new approach with:
compare = function(data1, data2) {
....
}
But for the above, I don't know how to let the function recognize the appropriate subsets of data (like I can't say data1$Scientific.name).
It's difficult to answer this question without a minimal reproducible example - without any knowledge of the two dataframes you are comparing it is hard to formulate a solution - see the link in the comment by alistaire above for how to provide this.
I suggest you change your question title to make it more informative - "Creating a function in R" suggests you want to know the syntax required for a function in R - I would recommend "Subsetting a dataframe with Grep and then filtering results in R" - which is what I think you are actually trying to do.
Assuming you obtained your IOC world list data from the International Ornithological Committee website I am unsure whether the approach you describe in your function would work as the data in the column Breeding Range-Subregion(s) is very messy, For example:
w Himalayas to s Siberia and w Mongolia
Himalayas to c China
e Afghanistan to nw India and w Nepal
e Afghanistan to w Tibetan plateau and n India
Africa south of the Sahara, s and se Asia
None of these values is identical to "India" or "himalayas" or "SE Asia" and none will be returned by your function which looks for an exact match. You would need to use grep to find the substring present within your data.
Lets create a toy data set.
bird_data <- data.frame(
Scientific.name=c(
"Chicken Little",
"Woodstock",
"Woody Woodpecker",
"Donald Duck",
"Daffy Duck",
"Big Bird",
"Tweety Pie",
"Foghorn Leghorn",
"The Road Runner",
"Angry Birds"))
ioc_data <- data.frame(
Scientific.name=c(
"Chicken Little",
"Woodstock",
"Woody Woodpecker",
"Donald Duck",
"Daffy Duck",
"Big Bird",
"Tweety Pie",
"The Road Runner",
"Angry Birds"),
subrange=c(
"Australia, New Zealand",
"w Himalayas to s Siberia and w Mongolia",
"Himalayas to c China",
"e Afghanistan to nw India and w Nepal",
"e Afghanistan to w Tibetan plateau and n India",
"Africa south of the Sahara, s and se Asia",
"Amazonia to n Argentina",
"n Eurasia",
"n North America"))
I would break what you are attempting to do into two steps.
Step 1
Use grep to subset the ioc_data dataframe based upon whether your search terms are found in the subrange column:
searchTerms <- c("India", "himalayas", "SE Asia")
#Then we use grep to return the indexes of matching rows:
matchingIndexes <- grep(paste(searchTerms, collapse="|"),
ioc_data$subrange,
ignore.case=TRUE) #Important so search such as "SE Asia" will match "se asia"
#We can then use our matching indexes to subset our ioc_data dataframe producing
#a subset of data corresponding to our range of interest:
ioc_data_subset <- ioc_data[matchingIndexes,]
Step 2
If I understand your question correctly you now want to extract the rows from bird_data that ARE NOT present in the ioc_data_subset (i.e. Which rows in bird_data are for birds that ARE NOT recorded as inhabiting the subrange "India", "SE Asia", and "Himalayas" in the IOC Data.
I would use Hadley Wickham's dplyr package for this - a good cheat sheet can be found here. After installing dplyr:
library(dplyr)
#Create a merged dataframe containing all the data in one place.
merged_data <- dplyr::left_join(bird_data,
ioc_data,
by = "Scientific.name")
#Use an anti_join to select any rows in merged_data that are NOT present in
#ioc_data_subset
results <- dplyr::anti_join(merged_data,
ioc_data_subset,
by = "Scientific.name")
The left_join is required first because otherwise we would not have the subrange column in our final database. Note that any species in bird_data not in IOC_data will return NA in the subrange column to indicate no data found.
results
Scientific.name subrange
1 Angry Birds n North America
2 The Road Runner n Eurasia
3 Foghorn Leghorn <NA>
4 Tweety Pie Amazonia to n Argentina
5 Chicken Little Australia, New Zealand

Resources