Using regex on a vector list in R - r

I have a vector with character strings in it. The vector has over 6000 rivers in it but we can use the following as an example:
Names <- Baker R, Colorado R, Missouri R
I am then matching these river names to a list that contains their full names. As an example, the other list contains names such as:
station_nm <- North Creek River, Baker River at Wentworth, Lostine River at Baker Road, Colorado River at North Street, Missouri River
In order to find the full names of the stations for the river names in "Names" I have:
station_nm <- grep(paste(Names, collapse = "|"), ALLsites$station_nm, ignore.case = TRUE, perl = TRUE, value = TRUE)
Continuing with the example, this returns: Baker River at Wentworth, Lostine River at Baker Road, Colorado River at North Street, Missouri River. It does not return North Creek River, as this is not listed in the "Names" vector. This is what I want.
However, I want to restrict the rivers that it returns to only Baker River at Wentworth, Colorado River at North Street, Missouri River. I don't want to include names for which there is something before it, i.e. Lostine River at Baker Road.
I believe this should involve some sort of negative look behind but I don't know how to write this with the vector "Names".
Thank you for any help!

You just have to prepend the values in Names with a ^ meaning "has to start with":
grep(paste0("^", Names, "iver", collapse = "|"), station_nm,
ignore.case = TRUE, value = TRUE)
# [1] "Baker River at Wentworth" "Colorado River at North Street" "Missouri River"

Related

Changing the value in one column based on a subset of values of another column in r

I have a dataset that contains columns city and country. Some of the country columns are incorrectly mislabelled as 'Other'. I know this because some of the city values contain labels like saddle lake (Canada). Is there a way I can search a subset of the value in the city to change the value in Country. IE search for any city value containing the word 'Canada' and change country to 'Canada'. I'd like to do this for multiple countries including the USA and UK. which might mean my search would need an 'or' element and search usa, US, USA etc
Current dataset:
City - Country
Saddle(Canada) - Other
Dublin - Other
Detroit - USA
Vancouver - Canada
NYC: US - Other
Output:
Saddle(Canada) - Canada
Dublin -Other
Detroit - USA
Vancouver - Canada
NYC: US - USA
I've played around with if statements using grep() but no success.
Edit: some code I have tried:
for (i in Data$city){
if (Data$city == '.*canada*.'){
Data$country = Canada
}}

How to remove the first few characters in a column in R?

My data (csv file) has a column that contains uninformative characters (e.g. special characters, random lowercase letters), and I want to remove them.
df <- data.frame(Affiliation = c(". Biotechnology Centre, Malaysia Agricultural Research and Development Institute (MARDI), Serdang, Malaysia","**Institute for Research in Molecular Medicine (INFORMM), Universiti Sains Malaysia, Pulau Pinang, Malaysia","aas Massachusetts General Hospital and Harvard Medical School, Center for Human Genetic Research and Department of Neurology , Boston , MA , USA","ac Albert Einstein College of Medicine , Department of Pathology , Bronx , NY , USA"))
The number of characters I want to remove (e.g. ".","**","aas","ac") per line is indefinite as shown above.
Expected output:
df <- data.frame(Affiliation = c("Biotechnology Centre, Malaysia Agricultural Research and Development Institute (MARDI), Serdang, Malaysia","Institute for Research in Molecular Medicine (INFORMM), Universiti Sains Malaysia, Pulau Pinang, Malaysia","Massachusetts General Hospital and Harvard Medical School, Center for Human Genetic Research and Department of Neurology , Boston , MA , USA","Albert Einstein College of Medicine , Department of Pathology , Bronx , NY , USA"))
I was thinking of using dplyr's mutate function, but I'm not sure how to go about it.
If we assume that the valid text starts from the first uppercase onwards, the following works:
library(tidyverse)
df %>%
mutate(Affiliation = str_extract(Affiliation, "[:upper:].+"))
Base R regex solution:
df$cleaned_str <- gsub("^\\w+ |^\\*+|^\\. ", "", df$Affiliation)
Tidyverse regex solution:
library(tidyverse)
df %>%
mutate(Affiliation = str_replace(Affiliation, "^\\w+ |^\\*+|^\\. ", ""))

How to split address column in R

I have an address column in a dataframe like below:
Address
101 Marietta Street NorthWest Atlanta GA 30303
Now I want to split it into 4 diff columns like
Address City State Zip
101 Marietta Street NorthWest Atlanta GA 30303
It is guaranteed that the last value in address column will be zip code, second last will be state, third last will be city and remaining will be address. So I am thinking, I can split address column values with space and extract values from rear.
How can I do this?
We can use tidyr::extract to get last 3 words in separate columns and remaining text as Address
tidyr::extract(df, Address, c("Address", "City", "State", "Zip"),
regex = "(.+) (\\w+) (\\w+) (\\w+)")
# Address City State Zip
#1 101 Marietta Street NorthWest Atlanta GA 30303

Creating Tidy Text

I am using R for text analysis. I used the 'readtext' function to pull in text from a pdf. However, as you can imagine, it is pretty messy. I used 'gsub' to replace text for different purposes. The general goal is to use one type of delimiter '%%%%%' to split records into rows, and another delimiter '#' into columns. I accomplished the first but am at a loss of how to accomplish the latter. A sample of the data found in the dataframe is as follows:
895 "The ambulatory case-mix development project\n#Published:: June 6, 1994#Authors: Baker A, Honigfeld S, Lieberman R, Tucker AM, Weiner JP#Country: United States #Journal:Project final report. Baltimore, MD, USA: Johns Hopkins University and Aetna Health Plans. Johns Hopkins\nUniversity and Aetna Health Plans, USA As the US […"
896 "Ambulatory Care Groups: an evaluation for military health care use#Published:: June 6, 1994#Authors: Bolling DR, Georgoulakis JM, Guillen AC#Country: United States #Journal:Fort Sam Houston, TX, USA: United States Army Center for Healthcare Education and Studies, publication #HR 94-\n004. United States Army Center for Healthcare Education and […]#URL: http://oai.dtic.mil/oai/oai?verb=getRecord&metadataPrefix=html&identifier=ADA27804"
I want to take this data and split the #Published, #Authors, #Journal, #URL into columns -- c("Published", "Authors", "Journal", "URL").
Any suggestions?
Thanks in advance!
This seems to work OK:
dfr <- data.frame(TEXT=c("The ambulatory case-mix development project\n#Published:: June 6, 1994#Authors: Baker A, Honigfeld S, Lieberman R, Tucker AM, Weiner JP#Country: United States #Journal:Project final report. Baltimore, MD, USA: Johns Hopkins University and Aetna Health Plans. Johns Hopkins\nUniversity and Aetna Health Plans, USA As the US […",
"Ambulatory Care Groups: an evaluation for military health care use#Published:: June 6, 1994#Authors: Bolling DR, Georgoulakis JM, Guillen AC#Country: United States #Journal:Fort Sam Houston, TX, USA: United States Army Center for Healthcare Education and Studies, publication #HR 94-\n004. United States Army Center for Healthcare Education and […]#URL: http://oai.dtic.mil/oai/oai?verb=getRecord&metadataPrefix=html&identifier=ADA27804"),
stringsAsFactors = FALSE)
library(magrittr)
do.call(rbind, strsplit(dfr$TEXT, "#Published::|#Authors:|#Country:|#Journal:")) %>%
as.data.frame %>%
setNames(nm = c("Preamble","Published","Authors","Country","Journal"))
Basically split the text by one of four fields (noticing double :: after Published!), row-binding the result, converting to a dataframe, and giving some names.

Creating a function in R

I'm trying to create code that looks at two CSV files: one is a world list of all bird species and their ranges, and the other is a file of all the birds in the Himalayas. I need to check each species in the CSV file with the matching species on the IOC world list one and see if the bird is actually in range (meaning it would say either "India" or "himalayas" or "s e Asia" under the Range column). I want to create a function that can input both data sets, find where names match, check if range contains those words and returns where it does NOT, so I can check those birds specifically. Here is what I have so far (I'm using RStudio):
myfunc <- function() {
if ((bird_data$Scientific.name == ioc$Scientific.name) & (ioc$Scientific.name!=("Himalayas" | "se Asia" | "India")) {
print(eval(bird_data$Common.Name[bird_data$Scientific.name == ioc$Scientific.name) & (ioc$Scientific.name!=("Himalayas" | "se Asia" | "India")]))
}
}
save("myfunc", file = "myfunc.Rdata")
source("myfunc.Rdata")
I think I'm messed up in not having inputs. So I'm trying a new approach with:
compare = function(data1, data2) {
....
}
But for the above, I don't know how to let the function recognize the appropriate subsets of data (like I can't say data1$Scientific.name).
It's difficult to answer this question without a minimal reproducible example - without any knowledge of the two dataframes you are comparing it is hard to formulate a solution - see the link in the comment by alistaire above for how to provide this.
I suggest you change your question title to make it more informative - "Creating a function in R" suggests you want to know the syntax required for a function in R - I would recommend "Subsetting a dataframe with Grep and then filtering results in R" - which is what I think you are actually trying to do.
Assuming you obtained your IOC world list data from the International Ornithological Committee website I am unsure whether the approach you describe in your function would work as the data in the column Breeding Range-Subregion(s) is very messy, For example:
w Himalayas to s Siberia and w Mongolia
Himalayas to c China
e Afghanistan to nw India and w Nepal
e Afghanistan to w Tibetan plateau and n India
Africa south of the Sahara, s and se Asia
None of these values is identical to "India" or "himalayas" or "SE Asia" and none will be returned by your function which looks for an exact match. You would need to use grep to find the substring present within your data.
Lets create a toy data set.
bird_data <- data.frame(
Scientific.name=c(
"Chicken Little",
"Woodstock",
"Woody Woodpecker",
"Donald Duck",
"Daffy Duck",
"Big Bird",
"Tweety Pie",
"Foghorn Leghorn",
"The Road Runner",
"Angry Birds"))
ioc_data <- data.frame(
Scientific.name=c(
"Chicken Little",
"Woodstock",
"Woody Woodpecker",
"Donald Duck",
"Daffy Duck",
"Big Bird",
"Tweety Pie",
"The Road Runner",
"Angry Birds"),
subrange=c(
"Australia, New Zealand",
"w Himalayas to s Siberia and w Mongolia",
"Himalayas to c China",
"e Afghanistan to nw India and w Nepal",
"e Afghanistan to w Tibetan plateau and n India",
"Africa south of the Sahara, s and se Asia",
"Amazonia to n Argentina",
"n Eurasia",
"n North America"))
I would break what you are attempting to do into two steps.
Step 1
Use grep to subset the ioc_data dataframe based upon whether your search terms are found in the subrange column:
searchTerms <- c("India", "himalayas", "SE Asia")
#Then we use grep to return the indexes of matching rows:
matchingIndexes <- grep(paste(searchTerms, collapse="|"),
ioc_data$subrange,
ignore.case=TRUE) #Important so search such as "SE Asia" will match "se asia"
#We can then use our matching indexes to subset our ioc_data dataframe producing
#a subset of data corresponding to our range of interest:
ioc_data_subset <- ioc_data[matchingIndexes,]
Step 2
If I understand your question correctly you now want to extract the rows from bird_data that ARE NOT present in the ioc_data_subset (i.e. Which rows in bird_data are for birds that ARE NOT recorded as inhabiting the subrange "India", "SE Asia", and "Himalayas" in the IOC Data.
I would use Hadley Wickham's dplyr package for this - a good cheat sheet can be found here. After installing dplyr:
library(dplyr)
#Create a merged dataframe containing all the data in one place.
merged_data <- dplyr::left_join(bird_data,
ioc_data,
by = "Scientific.name")
#Use an anti_join to select any rows in merged_data that are NOT present in
#ioc_data_subset
results <- dplyr::anti_join(merged_data,
ioc_data_subset,
by = "Scientific.name")
The left_join is required first because otherwise we would not have the subrange column in our final database. Note that any species in bird_data not in IOC_data will return NA in the subrange column to indicate no data found.
results
Scientific.name subrange
1 Angry Birds n North America
2 The Road Runner n Eurasia
3 Foghorn Leghorn <NA>
4 Tweety Pie Amazonia to n Argentina
5 Chicken Little Australia, New Zealand

Resources