I have a dataframe (df) where in one column I have US states by their two letter acronym; 'AK','AL','AR','AZ','CA', ..., 'WV','WY'.
I want to create a new column that reads the 'df$state' column and apply a region: West, Midwest, Northeast, Southeast, Southwest.
I have the regions broken down into lists (for example:
list_southwest <- c('TX','AZ','NM','OK')
I duplicated the 'df$state' column and renamed it 'df$region'. What I want to do is replace the two-letter state elements with regions and not do it state-by-state.
I have been successful with the code: df$region [df$region == 'TX'] <- "Southwest"
But I'd like to go faster, I tried: df$region [df$region == 'list_west'] <- "Southwest"
in an attempt to check the column for all the two-letter strings in "list_west" but I'm not getting anything replaced and I'm not receiving an error of any kind.
I've also tried the tedious:
df$region [df$region == 'TX', 'AZ', ... but r doesn't seem to like that, I've tried replacing the commas with |, &&, ||, and no luck.
I was thinking there might be a way to add a for loop and case_when(), and a lot of other things, but I'm stuck. Any help would be greatly appreciated!
Here's what I'm hoping for without having to run a line of code per each individual state:
state
region
AK
West
AL
South
AR
South
AZ
West
CA
West
CO
West
CT
NorthEast
SOLVED!!
Here's how the code looks after a comment to use %in% versus ==:
df$region [df$region %in% list_west] <- "West"
Related
I'm working with QECW data from BLS and would like to make the geographical data included more useful. I want to split the column "area_title" into different columns - one with the area's name, one with the level of the area, and one with the state.
I got a good start using separate:
qecw <- qecw %>% separate(area_title, c("county", "geography level", "state"))
The problem is that there's a variety of ways the geographical data are arranged into strings that makes them not uniform enough to cleanly separate. The area_title column includes names in formats that separate pretty cleanly, like:
area_title
Alabama -- Statewide
Autauga County, Alabama
which splits pretty well into
county geography level state
Alabama Statewide NA
Autauga County Alabama
but this breaks down for cases like:
area_title
Aleutians West Census Area, Alaska
Chattanooga-Cleveland-Dalton TN-GA-AL CSA
U.S. Combined statistical Areas, combined
as well as any states, counties or other place names that have more than one word.
I can go case-by-case to fix these, but I would appreciate a more efficient solution.
The exact data I'm using is "2019.q1-q3 10 10 Total, all industries," available at the link under "Current year quarterly data grouped by industry".
Thanks!
So far I came up with this:
I can get a place name by selecting a substring of area_title with everything to the left of the first comma:
qecw <- qecw %>% mutate(location = sub(",.*","", qecw$area_title))
Then I have a series of nested if_else statements to create a location type:
mutate(`Location Type` =
if_else(str_detect(area_title, "Statewide"), "State",
if_else(str_detect(area_title, "County"), "County",
if_else(str_detect(area_title, "CSA"), "CSA",
if_else(str_detect(area_title, "MSA"), "MSA",
if_else(str_detect(area_title, "MicroSA"), "MicroSA",
if_else(str_detect(area_title, "Undefined"), "Undefined",
"other")))))))
This isn't a complete answer; I think I'm still missing some location types, and I haven't come up with a good way to extract state names yet.
I am trying to write a function that has 2 arguments: column name and ranking number. The function will read a CSV file that has hospitals from every state. The function should return a data frame with the hospital name that was at the specified rank.
My solution has been to split the main CSV file by state, order each data frame by the desired column, loop through each state's data frame, grab the row (where row number = rank number), store each state's hospital name into a vector, then create a dataframe using the vector from the loop.
When I test each part of my function in the console, I am able to receive the results I need. However, when I run the function altogether, it isn't storing the hospital names as desired.
Here's what I have:
rankall <- function(outcome, num = "best") {
outcomedf <- read.csv("outcome-of-care-measures.csv")
#using this as a test
outcomedf <- outcomedf[order(outcomedf[, 11], outcomedf[, 2]), ]
#create empty vectors for hospital name and state
hospital <- c()
state <- c()
#split the read dataframe
splitdf <- split(outcomedf, outcomedf$State)
#for loop through each split df
for (i in 1:length(splitdf)) {
#store the ranked hospital name into hospital vector
hospital[i] <- as.character(splitdf[[i]][num, 2])
#store the ranked hospital state into state vector
state[i] <- as.character(splitdf[[i]][, 7])
}
#create a df with hospital and state
rankdf <- data.frame(hospital, state)
return(rankdf)
}
When I run the function altogether, I receive NA in my 'hospital' column, but when I run each part of the function individually, I am able to receive the desired hospital names. I'm a little confused as to why I am able to run each individual part of this function outside of the function and it returns the results I want, but not when I run the function as a whole. Thank you.
This question is related to Programming Assignment 3 in the Johns Hopkins University R Programming course on Coursera. As such, we can't provide a "corrected version" of the code because doing so would violate the Coursera Honor code.
When run with the default settings, your implementation of the rankall() function fails because when you use num in the extract operator there is no ["best",2] row. If you run your code as head(rankall("heart attack",1)) it produces the following output.
> head(rankall("heart attack",1))
hospital state
1 PROVIDENCE ALASKA MEDICAL CENTER AK
2 CRESTWOOD MEDICAL CENTER AL
3 ARKANSAS HEART HOSPITAL AR
4 MAYO CLINIC HOSPITAL AZ
5 GLENDALE ADVENTIST MEDICAL CENTER CA
6 ST MARYS HOSPITAL AND MEDICAL CENTER CO
To completely correct your function, you'll need to make the following changes.
Add code to sort the data by the desired outcome column: heart attack, heart failure, or pneumonia
Add code to handle the condition when a state does not have enough hospitals to return a valid result (i.e. the 43rd ranked hospital for pneumonia in Puerto Rico
Add code to handle "best" and "worst" as inputs to the num argument
Hopefully, this is a fairly straight forward question. I am using R to help subset some data that I am working with. Below is print() of some of the data that I am currently working with. I am trying to create a subset() of the data based around JobCode. As you can see the JobCode follows a pattern (00 - 0000) where the first 2 numbers are the same for a specific industry.
ID State StateName JobCode
1 AL Alabama 51-9199
2 AL Alabama 27-3011
4 AL Alabama 49-9043
5 AL Alabama 49-2097
My current attempt is to use this test <- subset(data, data$State == "AL" & data$JobCode == ("15-####"))(where # is a placeholder for the remaining 4 values) to subset for JobCode beginning with "15-". Is there any way to tell the subset to look for those remaining 4 values?
I'm sorry for the poor formatting as I am new to StackOverflow and I am also quite inexperienced with R. Thank you for your help.
There are no wild card characters in string equality. You need to use a function. You could use substr() to extract the first three charcters
test <- subset(data, State == "AL" & substr(JobCode,1,3) == ("15-"))
Also note that you don't need to use data$ inside the subset() parameter. Variables are evaulated in the context of the data frame for that function.
You can use the %like% operator of data.table library:
library(data.table)
setDT(df)
df[ State == "AL" & JobCode %like% "15-" ]
I have a column of a data frame that I want to categorize.
> df$orgName
[1] "Hank Rubber" "United Steel of Chicago"
[3] "Muddy Lakes Solar" "West cable"
I want to categorize the column using the categories list below that contains a list of subcategories.
metallurgy <- c('steel', 'iron', 'mining', 'aluminum', 'metal', 'copper' ,'geolog')
energy <- c('petroleum', 'coal', 'oil', 'power', 'petrol', 'solar', 'nuclear')
plastics <- c('plastic', 'rubber')
wiring <- c('wire', 'cable')
categories = list(metallurgy, energy, plastics, wiring)
So far I've been able to use a series of nested ifelse statements to categorize the column as shown below, but the number of categories and subcategories keeps increasing.
df$commSector <-
ifelse(grepl(paste(metallurgy,collapse="|"),df$orgName,ignore.case=TRUE), 'metallurgy',
ifelse(grepl(paste(energy,collapse="|"),df$orgName,ignore.case=TRUE), 'energy',
ifelse(grepl(paste(plastics,collapse="|"),df$orgName,ignore.case=TRUE), 'plastics',
ifelse(grepl(paste(wiring,collapse="|"),df$orgName,ignore.case=TRUE), 'wiring',''))))
I've thought about using a set of nested lapply statements, but I'm not too sure how to execute it.
Lastly does anyone know of any R Libraries that may have functions to do this.
Thanks a lot for everyone's time.
Cheers.
One option would be to get the vectors as a named list using mget, then paste the elements together (as showed by OP), use grep to find the index of elements in 'orgName' that matches (or use value = TRUE) extract those elements, stack it create a data.frame.
res <- setNames(stack(lapply(mget(c("metallurgy", "energy", "plastics", "wiring")),
function(x) df$orgName[grep(paste(x, collapse="|"),
tolower(df$orgName))])), c("orgName", "commSector"))
res
# orgName commSector
#1 United Steel of Chicago metallurgy
#2 Muddy Lakes Solar energy
#3 Hank Rubber plastics
#4 West cable wiring
If we have other columns in 'df', do a merge
merge(df, res, by = "orgName")
# orgName commSector
#1 Hank Rubber plastics
#2 Muddy Lakes Solar energy
#3 United Steel of Chicago metallurgy
#4 West cable wiring
data
df <- data.frame(orgName = c("Hank Rubber", "United Steel of Chicago",
"Muddy Lakes Solar", "West cable"), stringsAsFactors=FALSE)
I have a shapefile of the UK: https://geoportal.statistics.gov.uk/Docs/Boundaries/Local_authority_district_(GB)_2014_Boundaries_(Generalised_Clipped).zip
I've read the shapefile into a variable, UK
>UK <- readOGR(dsn = "....."
>England <- UK
I'd like to only display English Local Authority regions. They are specified in the LAD_DEC_2014_GB_BGC.dbf where LAD14CD starts with "E"
>UK#data
LAD14CD LAD14NM LAD14NMW
0 E06000001 Hartlepool <NA>
1 E06000002 Middlesbrough <NA>
2 E06000003 Redcar and Cleveland <NA>
371 W06000015 Cardiff Caerdydd
>#filter UK#data and replace England#data with only English regions
>England#data <- UK#data$LAD14CD[c(grep("^E", UK$LAD14CD))]
>plot(England)
But the grep command appears to change the shapefile into a factor, meaning the plot looks like this:
With this command:
England <- UK#data$LAD14CD[c(grep("^E", UK$LAD14CD))]
...you are subsetting just one column from the data slot, not the whole shapefile and assigning that to England.
This ought to do the job:
England <- UK[grep("^E", UK#data$LAD14CD),]
Note, you need the trailing comma in there! Also you don't need to wrap the grep statement in c(), but that doesn't hurt it's just unnecessary.
I ended up using dplyr and grepl instead to make things simpler:
library('rgdal')
library('dplyr')
UK <- readOGR(dsn="LAD_DEC_2014_GB_BGC.shp", layer="LAD_DEC_2014_GB_BGC") %>%
subset(grepl("^E", LAD14CD))
plot(UK)