Lapply instead of for loop in r - r

I want to write the function ,that returns a 2-column data frame
containing the hospital in each state that has the ranking specified in num.
Rankall that takes two arguments: an outcome name (outcome) and a hospital ranking
(num). The function reads the outcome-of-care-measures.csv file and returns a 2-column data frame
containing the hospital in each state that has the ranking specified in num.
rankall <- function(outcome, num = "best") {
## Read outcome data
## Check that state and outcome are valid
## For each state, find the hospital of the given rank
## Return a data frame with the hospital names and the
## (abbreviated) state name
}
head(rankall("heart attack", 20), 10)
hospital state
AK <NA> AK
AL D W MCMILLAN MEMORIAL HOSPITAL AL
AR ARKANSAS METHODIST MEDICAL CENTER AR
4
AZ JOHN C LINCOLN DEER VALLEY HOSPITAL AZ
CA SHERMAN OAKS HOSPITAL CA
CO SKY RIDGE MEDICAL CENTER CO
CT MIDSTATE MEDICAL CENTER CT
DC <NA> DC
DE <NA> DE
FL SOUTH FLORIDA BAPTIST HOSPITAL FL
My function works correct, but the last step(formating 2-column data frame) I made by the following loop:
new_data <- vector()
for(i in sort(unique(d$State))){
new_data <- rbind(new_data,cbind(d$Hospital.Name[which(d$State == i)][num],i))
}
new_data <- as.data.frame(new_data)
It is correct, but i know, that it is possible to code the same loop by lapply function
My attempt is wrong:
lapply(d,function(x) x <-rbind(x,d$Hospital.Name[which(d$State == i)][num]))
How can I fix that?

I'm supposing your d data is already sorted:
new_data <- do.call(rbind,
lapply(unique(d$State),
function(state){
data.frame(State = state,
Hospital.Name = d$Hospital.Name[which(d$State==state)][num],
stringsAsFactors = FALSE)
}))

Related

Converting a nested for-loop for string search for optimization in R

I am relatively new to R. I have written the following code. However, because it uses a for-loop, it is slow. I am not too familiar with packages that will convert this for-loop into a more efficient solution (apply functions?).
What my code does is this: it is trying to extract country names from a variable based on another dataframe that has all countries.
For instance, this is what data looks like:
country Institution
edmonton general hospital
ontario, canada
miyazaki, japan
department of head
this is what countries looks like
Name Code
algeria dz
canada ca
japan jp
kenya ke
# string match the countries
for(i in 1:nrow(data))
{
for (j in 1:nrow(countries))
{
data$country[i] <- ifelse(str_detect(string = data$Institution[i], pattern = paste0("\\b", countries$Name[j], "\\b")), countries$Name[j], data$country[i])
}
}
The above code runs so that it changes data so it looks like this:
country Institution
edmonton general hospital
canada ontario, canada
japan miyazaki, japan
department of head
How can I convert my for-loop to preserve the same function?
Thanks.
You can do a one-liner with str_extract. We'll paste the country names together with word boundaries and concatenate them with a regex | or operator.
library(stringr)
data$country = str_extract(data$Institution, paste0(
"\\b", country$Name, "\\b", collapse = "|"
))
data
# Institution country
# 1 edmonton general hospital <NA>
# 2 ontario, canada canada
# 3 miyazaki, japan japan
# 4 department of head <NA>
Using this data:
country <- read.table(text = " Name Code
algeria dz
canada ca
japan jp
kenya ke",
stringsAsFactors = FALSE, header = TRUE)
data <- data.frame(Institution = c("edmonton general hospital",
"ontario, canada",
"miyazaki, japan",
"department of head"))
The data:
countries <- setDT(read.table(text = " Name Code
algeria dz
canada ca
japan jp
kenya ke",
stringsAsFactors = FALSE, header = TRUE))
data <- setDT(list(country = array(dim = 2), Institution =
c("edmonton general hospital ontario, canada",
"miyazaki, japan department of head")))
I use data.table for syntax convenience, but you can surely do otherwise, the main idea is to use just one loop and grepl
data[,country := as.character(country)]
for( x in unique(countries$Name)){data[grepl(x,data$Institution),country := x]}
> data
country Institution
1: canada edmonton general hospital ontario, canada
2: japan miyazaki, japan department of head
You could add the tolower function to avoid cases problems grepl(tolower(x),tolower(data$Institution))

Extract cities from each row in excel and export to its respective row using R

I have extracted tweets in .csv format and the data looks like this:
(row 1) The latest The Admin Resources Daily! Thanks to #officerenegade #roberthalf #elliottdotorg #airfare #jobsearch
(row 2) RT #airfarewatchdog: Los Angeles #LAX to Cabo #SJD $312 nonstop on #AmericanAir for summer travel. #airfare
(row 3) RT #TheFlightDeal: #Airfare Deal: [AA] New York - Mexico City, Mexico. $270 r/t. Details:
(row 4) The latest The Nasir Muhammad Daily! Thanks to #Matt_Revel #Roddee #JaeKay #lefforum #airfare
(row 5) RT #BarefootNomads: So cool! <U+2708> <U+2764><U+FE0F> #airfare deals w #Skyscanner Facebook Messenger Bot #traveldeals #cheapflights ht…
(row 6) Flights to #Oranjestad #Aruba are £169 for a 15 day trip departing Tue, Jun 7th. #airfare via #hitlist_app"
I have made use of the NLP technique to extract city names from the tweets but the output is a list of cities with each city occupying a row one below the other. It is just identifying all the city names and making a list of it.
Output:
1 Los Angeles
2 New York
3 Mexico City
4 Mexico
5 Tue
6 London
7 New York
8 Fort Lauderdale
9 Los Angeles
10 Paris
I want the output to be something like:
1 Los Angeles Cabo (from the first tweet in row 2)
2 New York Mexico City Mexico (from the second tweet in row 3)
Code:
#Named Entity Recognition (NER)
bio <- readLines("C:\\xyz\\tweets.csv")
print(bio)
install.packages(c("NLP", "openNLP", "RWeka", "qdap"))
install.packages("openNLPmodels.en",
repos = "http://datacube.wu.ac.at/",
type = "source")
library(NLP)
library(openNLP)
library(RWeka)
library(qdap)
library(openNLPmodels.en)
library(magrittr)
bio <- as.String(bio)
word_ann <- Maxent_Word_Token_Annotator()
sent_ann <- Maxent_Sent_Token_Annotator()
bio_annotations <- annotate(bio, list(sent_ann, word_ann))
class(bio_annotations)
head(bio_annotations)
bio_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)
sents(bio_doc) %>% head(2)
words(bio_doc) %>% head(10)
location_ann <- Maxent_Entity_Annotator(kind = "location")
pipeline <- list(sent_ann,
word_ann,
location_ann)
bio_annotations <- annotate(bio, pipeline)
bio_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)
entities <- function(doc, kind) {
s <- doc$content
a <- annotations(doc)[[1]]
if(hasArg(kind)) {
k <- sapply(a$features, `[[`, "kind")
s[a[k == kind]]
} else {
s[a[a$type == "entity"]]
}
}
entities(bio_doc, kind = "location")
cities <- entities(bio_doc, kind = "location")
library(xlsx)
write.xlsx(cities, "C:\\xyz\\xyz.xlsx")
Also is there a way that I can further separate the cities as origin and destination, i.e. by classifying cities before 'to' or '-' as origin cities and the rest as destination cities?

R: Using plyr to perform fuzzy string matching between matching subsets of two data sources

Say I have a list of counties with varying amounts of spelling errors or other issues that differentiate them from the 2010 FIPS dataset (code to create fips dataframe below), but the states in which the misspelled counties reside are entered correctly. Here's a sample of 21 random observations from my full dataset:
tomatch <- structure(list(county = c("Beauregard", "De Soto", "Dekalb", "Webster",
"Saint Joseph", "West Feliciana", "Ketchikan Gateway", "Evangeline",
"Richmond City", "Saint Mary", "Saint Louis City", "Mclean",
"Union", "Bienville", "Covington City", "Martinsville City",
"Claiborne", "King And Queen", "Mclean", "Mcminn", "Prince Georges"
), state = c("LA", "LA", "GA", "LA", "IN", "LA", "AK", "LA", "VA",
"LA", "MO", "KY", "LA", "LA", "VA", "VA", "LA", "VA", "ND", "TN",
"MD")), .Names = c("county", "state"), class = c("tbl_df", "data.frame"
), row.names = c(NA, -21L))
county state
1 Beauregard LA
2 De Soto LA
3 Dekalb GA
4 Webster LA
5 Saint Joseph IN
6 West Feliciana LA
7 Ketchikan Gateway AK
8 Evangeline LA
9 Richmond City VA
10 Saint Mary LA
11 Saint Louis City MO
12 Mclean KY
13 Union LA
14 Bienville LA
15 Covington City VA
16 Martinsville City VA
17 Claiborne LA
18 King And Queen VA
19 Mclean ND
20 Mcminn TN
21 Prince Georges MD
I've used adist to create a fuzzy string matching algorithm that matches around 80% of my counties to the county names in fips. However, sometimes it will match two counties with similar spelling, but from different states (e.g., "Webster, LA" gets matched to "Webster, GA" rather than "Webster Parrish, LA").
distance <- adist(tomatch$county,
fips$countyname,
partial = TRUE)
min.name <- apply(distance, 1, min)
matchedcounties <- NULL
for(i in 1:nrow(distance)) {
s2.i <- match(min.name[i], distance[i, ])
s1.i <- i
matchedcounties <- rbind(data.frame(s2.i = s2.i,
s1.i = s1.i,
s1name = tomatch[s1.i, ]$county,
s2name = fips[s2.i, ]$countyname,
adist = min.name[i]),
matchedcounties)
}
Therefore, I want to restrict fuzzy string matching of county to the correctly spelled versions with matching state.
My current algorithm makes one big matrix which calculates standard Levenshtein distances between both sources and then selects the value with the minimum distance.
To solve my problem, I'm guessing I'd need to create a function that could be applied to each 'state' group by ddply, but I'm confused as to how I should indicate that the group value in the ddply function should match another dataframe. A dplyr solution or solution using any other package would be appreciated as well.
Code to create FIPS dataset:
download.file('http://www2.census.gov/geo/docs/reference/codes/files/national_county.txt',
'./nationalfips.txt')
fips <- read.csv('./nationalfips.txt',
stringsAsFactors = FALSE, colClasses = 'character', header = FALSE)
names(fips) <- c('state', 'statefips', 'countyfips', 'countyname', 'classfips')
# remove 'County' from countyname
fips$countyname <- sub('County', '', fips$countyname, fixed = TRUE)
fips$countyname <- stringr::str_trim(fips$countyname)
Here's a way with dplyr. I first join the tomatch data.frame with the FIPS names by state (allowing only in-state matches):
require(dplyr)
df <- tomatch %>%
left_join(fips, by="state")
Next, I noticed that a lot of counties don't have 'Saint' but 'St.' in the FIPS dataset. Cleaning that up first should improve the results obtained.
df <- df %>%
mutate(county_clean = gsub("Saint", "St.", county))
Then, group this data.frame by county, and calculate the distance with adist:
df <- df %>%
group_by(county_clean) %>% # Calculate the distance per county
mutate(dist = diag(adist(county_clean, countyname, partial=TRUE))) %>%
arrange(county, dist) # Used this for visual inspection.
Note that I took the diagonal from the resulting matrix as adist returns an n x m matrix with n representing the x vector and m representing the y vector (it calculates all of the combinations).
Optionally, you could add the agrep result:
df <- df %>%
rowwise() %>% # 'group_by' a single row.
mutate(agrep_result = agrepl(county_clean, countyname, max.distance = 0.3)) %>%
ungroup() # Always a good idea to remove 'groups' after you're done.
Then filter as you did before, take the minimum distance:
df <- df %>%
group_by(county_clean) %>% # Causes it to calculate the 'min' per group
filter(dist == min(dist)) %>%
ungroup()
Note that this could result in more than one row returned for each of the input rows in tomatch.
Alternatively, do it all in one run (I usually change code to this format once I'm confident it's doing what it's supposed to do):
df <- tomatch %>%
# Join on all names in the relevant state and clean 'St.'
left_join(fips, by="state") %>%
mutate(county_clean = gsub("Saint", "St.", county)) %>%
# Calculate the distances, per original county name.
group_by(county_clean) %>%
mutate(dist = diag(adist(county_clean, countyname, partial=TRUE))) %>%
# Append the agrepl result
rowwise() %>%
mutate(string_agrep = agrepl(county_clean, countyname, max.distance = 0.3)) %>%
ungroup() %>%
# Only retain minimum distances
group_by(county_clean) %>%
filter(dist == min(dist))
The result in both cases:
county county_clean state countyname dist string_agrep
1 Beauregard Beauregard LA Beauregard Parish 0 TRUE
2 De Soto De Soto LA De Soto Parish 0 TRUE
3 Dekalb Dekalb GA DeKalb 1 TRUE
4 Webster Webster LA Webster Parish 0 TRUE
5 Saint Joseph St. Joseph IN St. Joseph 0 TRUE
6 West Feliciana West Feliciana LA West Feliciana Parish 0 TRUE
7 Ketchikan Gateway Ketchikan Gateway AK Ketchikan Gateway Borough 0 TRUE
8 Evangeline Evangeline LA Evangeline Parish 0 TRUE
9 Richmond City Richmond City VA Richmond city 1 TRUE
10 Saint Mary St. Mary LA St. Mary Parish 0 TRUE
11 Saint Louis City St. Louis City MO St. Louis city 1 TRUE
12 Mclean Mclean KY McLean 1 TRUE
13 Union Union LA Union Parish 0 TRUE
14 Bienville Bienville LA Bienville Parish 0 TRUE
15 Covington City Covington City VA Covington city 1 TRUE
16 Martinsville City Martinsville City VA Martinsville city 1 TRUE
17 Claiborne Claiborne LA Claiborne Parish 0 TRUE
18 King And Queen King And Queen VA King and Queen 1 TRUE
19 Mclean Mclean ND McLean 1 TRUE
20 Mcminn Mcminn TN McMinn 1 TRUE
21 Prince Georges Prince Georges MD Prince George's 1 TRU
Don't have example data but try something using agrep instead of adist and searching only the names in that state
sapply(df_tomatch$county, function(x) agrep(x,df_matchby[df_matchby$state==dj_tomatch[x,'state'],'county'],value=TRUE)
You can use the max.distance argument in agrep to vary how close they need to match. Also, setting value=TRUE returns the value of the matched string rather than the location of the match.

Making a new variable column using if/else statements

I have a dataset that contains a column of the state in which a particular office is located. I would like to take that column and make a new column denoting which region of the US that office is located. The state column has the postal abbreviations for each state (ie. NY stands for New York) and I am using the US Census Bureau's Regions.
Here's a mock example of the data. I don't have a Region column, but I want to create it:
Store State Region
A FL South
B NY Northeast
C CA West
D IL Midwest
E MA Northeast
Let's make it simpler and let's just say I want to denote only offices in the Northeast. I used the following syntax:
stores$Northeast<-if(
stores$state=="ME"|"NH"|"VT"|"MA"|"RI"|"CT"|"NY"|"PA"|"NJ"){
print("Northeast")
} else{print("Non-northeast")
}
but I get an error message saying that the | operation doesn't work on characters. Is there a different function I should be using instead?
I'm posting in the interest of saving people's typing time. There are already two vectors available as part of the base R installation that can be used to do this very efficiently: state.abb and state.region. If you have a named vector it can be indexed via the names as a look-up facility. They both need to be converted from factor to character (and the index needs to be de-factorized as well):
# Do read `?states`. Hey, S was invented in the US, but why not some Yuropean constants?
mock <-read.table(text="Store State
A FL
B NY
C CA
D IL
E MA ",head=TRUE)
stat <- as.character(state.region)
> names(stat) <- as.character(state.abb)
> mock$Region <- stat[as.character(mock$State)]
> mock
Store State Region
1 A FL South
2 B NY Northeast
3 C CA West
4 D IL North Central
5 E MA Northeast
If you want to "edit" the regional assignments, do this:
> stat["IL"] <- "Midwest"
> mock$Region <- stat[as.character(mock$State)]
> mock
Store State Region
1 A FL South
2 B NY Northeast
3 C CA West
4 D IL Midwest
5 E MA Northeast
You should probably use the %in% operator here:
NE = c("ME","NH","VT","MA","RI","CT","NY","PA","NJ")
if stores$state %in% NE {
print("Northeast")
} else {
print("Non-northeast")
}
You can also define a new variable this way, especially if you are going to go on to define other regions:
stores$region = "Non-northeast"
stores$region[stores$state %in% NE] = "Northeast"
You need the %in% operator!
stores$Northeast <- ifelse(stores$state %in% c("ME", "NH", "VT", "MA", "RI", "CT", "NY", "PA", "NJ"), "Northeast", "Non-northeast")
cheers

How to separate a row in a CSV and generate another CSV file from it in R?

I have a CSV file like
AdvertiserName,CampaignName
Wells Fargo,Gary IN MetroChicago IL Metro
EMC,Los Angeles CA MetroBoston MA Metro
Apple,Cupertino CA Metro
Desired Output in R
AdvertiserName,City,State
Wells Fargo,Gary,IN
Wells Fargo,Chicago,IL
EMC,Los Angeles,CA
EMC,Boston,MA
Apple,Cupertino,CA
I have done it like
record <- read.csv("C:/Users/Administrator/Downloads/Campaignname.csv",header=TRUE)
ad <- record$AdvertiserName
camp <- record$CampaignName
read.table(text=gsub('Metro', '\n', c), col.names=c('City', 'State'))
It throws an error.
How to get the desired result?
Thanks in advance.
You can do this for example:
## read the csv file, you change text here by your fileName
xx <- read.table(text ='AdvertiserName,CampaignName
Wells Fargo,Gary INMetro Chicago IL Metro
EMC,Los Angeles CAMetro Boston MA Metro',sep=',',header=TRUE)
## use regular expression to create city and state variables
## rows are separated by ":"
## columns are separated by a comma ","
res <-
gsub('(.*) ([A-Z]{2})*Metro (.*) ([A-Z]{2}) .*','\\1,\\2:\\3,\\4',
xx$CampaignName)
## Use strsrsplit to extract rows and columns
## This is a compacted code !
yy <-
Map(function(x,y)
cbind.data.frame(y,do.call(rbind,strsplit(x,','))),
strsplit(res,':'),xx$AdvertiserName)
## create the final data.frame and set names
res <- do.call(rbind,yy)
setNames(res, c('AdvertiserName','City','State'))
AdvertiserName City State
1 Wells Fargo Gary IN
2 Wells Fargo Chicago IL
3 EMC Los Angeles CA
4 EMC Boston MA

Resources