How to subset a row from list based on condition - r

I have a data.table outcome, which has a column called hospital. and a column called state. The outcome has been sorted already. Now I want to subset the nth hospital from each state(if there's not a Nth then returns NA for that state). I try to solve it the below way.(Since this is a homework, I just showed the third branch that causes mistake).
rankall <- function(out, num = "best"){
outcome <- readdata(outcome = out) //returns a data.table sorted by rate
...
outcome <- lapply(outcome, function(x) ifelse(num <= nrow(x), x[num,], c(NA,NA)))
outcome <- rbindlist(outcome)
}
The original outcome is like
> data
hospital state
1: NYU HOSPITALS CENTER NY
2: DOYLESTOWN HOSPITAL PA
3: AVERA HEART HOSPITAL OF SOUTH DAKOTA LLC SD
4: GLENDALE ADVENTIST MEDICAL CENTER CA
5: WATERBURY HOSPITAL CT
---
2716: DESERT SPRINGS HOSPITAL NV
2717: THREE RIVERS COMMUNITY HOSPITAL OR
2718: ROBERT WOOD JOHNSON UNIVERSITY HOSPITAL AT RAHWAY NJ
2719: LAREDO MEDICAL CENTER TX
2720: MEDICAL CENTER SOUTH ARKANSAS AR
And the first and second branch could produce the right result, which is like
> head (data)
hospital state
1: NA AK
2: CRESTWOOD MEDICAL CENTER AL
3: ARKANSAS HEART HOSPITAL AR
4: MAYO CLINIC HOSPITAL AZ
5: GLENDALE ADVENTIST MEDICAL CENTER CA
6: ST MARYS HOSPITAL AND MEDICAL CENTER CO
> nrow(data)
[1] 54
However, the third condition just COULDN'T do its work. Which produce the error
Error in rbindlist(outcome) :
Item 1 of list input is not a data.frame, data.table or list
And after debugging I found out that the outcome after the condition is something like(which caused the error in the last step)
$AK
[1] NA
$AL
$AL[[1]]
[1] "HIGHLANDS MEDICAL CENTER"
Differs from the first two which is like...
> head(data,2)
$AK
hospital state
1: PROVIDENCE ALASKA MEDICAL CENTER AK
$AL
hospital state
1: CRESTWOOD MEDICAL CENTER AL
So I wonder what's wrong with the third branch.
Could anyone help me out, thank you very much!!!
By the way, I wonder if I could refer to a variable with the same name of another's. Such like when I called readdata, I need to pass a argument called outcome, which prevents me to use this name as the argument of the rankall function(I use out instead). I know in JAVA this.outcome will help, so how about in R.

Thank you for Vivek's help, I have figured it out now.
First is about the mis-performance of the third branch. It'll work properly if I first convert numto a number using as.numeric(num). I think it's because the num is regarded as a character(since it has possible value of "best" and "worst") that caused the mistake.
The second regarding the naming space is however strange. Though not having performed well in my own test, it just do work after Vivek answer my question. So, that means we can just use the following code, and R will get us the right result.
rankall <- function(outcome, num = "best"){
outcome <- readdata(outcome = outcome)

This output below is a warning sign, first element of outcome is just NA, an atomic which is forced to 'bind' with second element which is a list $AL[[1]] by the function rbindlist
$AK
[1] NA
$AL
$AL[[1]]
[1] "HIGHLANDS MEDICAL CENTER"
Possible solution:
One way would be to have data.frame or list output in all three cases if condition blocks e.g. with data.frame, once num is numeric the following should work in all cases, best,worst and intermediate row.
sapply(outcome, function(x) data.frame(state=x[num,"state"],hospital=x[num,"hospital"])
for cases with no matches the hospital column should have an NA, could you please check if this works

Related

Create data frame of names of 50 states in R

I'm working on a problem where I'm trying to map each state to a region for some data analysis. It seems the first thing I need to do is create a dataframe containing the names of all 50 states. Is there a way to do this without explicitly naming each state and inputting it into a row in the dataframe?
Sample data:
region_key <- as.data.frame("")
colnames(region_key) <- c("state")
region_key$region <- ""
region_key$state <- "AL"
I create an empty data frame, create a "state" and "region" column, then populate the state two letter abbreviations in the above fashion. Is there a way to both populate the data frame with the state abbreviations and classify by region (e.g. Alabama would be "South")?
Expected output:
head(region_key)
state region
1 AL South
Thanks in advance for your help!
Figured out my problem based on the comment from #alistair, thank you.
Sample data:
region_key <- data.frame(state.abb, state.region)
head(region_key)
state.abb state.region
1 AL South
2 AK West
3 AZ West
4 AR South
5 CA West
6 CO West

R Storing Values from For Loop to Data Frame

I am trying to write a function that has 2 arguments: column name and ranking number. The function will read a CSV file that has hospitals from every state. The function should return a data frame with the hospital name that was at the specified rank.
My solution has been to split the main CSV file by state, order each data frame by the desired column, loop through each state's data frame, grab the row (where row number = rank number), store each state's hospital name into a vector, then create a dataframe using the vector from the loop.
When I test each part of my function in the console, I am able to receive the results I need. However, when I run the function altogether, it isn't storing the hospital names as desired.
Here's what I have:
rankall <- function(outcome, num = "best") {
outcomedf <- read.csv("outcome-of-care-measures.csv")
#using this as a test
outcomedf <- outcomedf[order(outcomedf[, 11], outcomedf[, 2]), ]
#create empty vectors for hospital name and state
hospital <- c()
state <- c()
#split the read dataframe
splitdf <- split(outcomedf, outcomedf$State)
#for loop through each split df
for (i in 1:length(splitdf)) {
#store the ranked hospital name into hospital vector
hospital[i] <- as.character(splitdf[[i]][num, 2])
#store the ranked hospital state into state vector
state[i] <- as.character(splitdf[[i]][, 7])
}
#create a df with hospital and state
rankdf <- data.frame(hospital, state)
return(rankdf)
}
When I run the function altogether, I receive NA in my 'hospital' column, but when I run each part of the function individually, I am able to receive the desired hospital names. I'm a little confused as to why I am able to run each individual part of this function outside of the function and it returns the results I want, but not when I run the function as a whole. Thank you.
This question is related to Programming Assignment 3 in the Johns Hopkins University R Programming course on Coursera. As such, we can't provide a "corrected version" of the code because doing so would violate the Coursera Honor code.
When run with the default settings, your implementation of the rankall() function fails because when you use num in the extract operator there is no ["best",2] row. If you run your code as head(rankall("heart attack",1)) it produces the following output.
> head(rankall("heart attack",1))
hospital state
1 PROVIDENCE ALASKA MEDICAL CENTER AK
2 CRESTWOOD MEDICAL CENTER AL
3 ARKANSAS HEART HOSPITAL AR
4 MAYO CLINIC HOSPITAL AZ
5 GLENDALE ADVENTIST MEDICAL CENTER CA
6 ST MARYS HOSPITAL AND MEDICAL CENTER CO
To completely correct your function, you'll need to make the following changes.
Add code to sort the data by the desired outcome column: heart attack, heart failure, or pneumonia
Add code to handle the condition when a state does not have enough hospitals to return a valid result (i.e. the 43rd ranked hospital for pneumonia in Puerto Rico
Add code to handle "best" and "worst" as inputs to the num argument

Categorizing a data frame column in R using grep on a list of a list

I have a column of a data frame that I want to categorize.
> df$orgName
[1] "Hank Rubber" "United Steel of Chicago"
[3] "Muddy Lakes Solar" "West cable"
I want to categorize the column using the categories list below that contains a list of subcategories.
metallurgy <- c('steel', 'iron', 'mining', 'aluminum', 'metal', 'copper' ,'geolog')
energy <- c('petroleum', 'coal', 'oil', 'power', 'petrol', 'solar', 'nuclear')
plastics <- c('plastic', 'rubber')
wiring <- c('wire', 'cable')
categories = list(metallurgy, energy, plastics, wiring)
So far I've been able to use a series of nested ifelse statements to categorize the column as shown below, but the number of categories and subcategories keeps increasing.
df$commSector <-
ifelse(grepl(paste(metallurgy,collapse="|"),df$orgName,ignore.case=TRUE), 'metallurgy',
ifelse(grepl(paste(energy,collapse="|"),df$orgName,ignore.case=TRUE), 'energy',
ifelse(grepl(paste(plastics,collapse="|"),df$orgName,ignore.case=TRUE), 'plastics',
ifelse(grepl(paste(wiring,collapse="|"),df$orgName,ignore.case=TRUE), 'wiring',''))))
I've thought about using a set of nested lapply statements, but I'm not too sure how to execute it.
Lastly does anyone know of any R Libraries that may have functions to do this.
Thanks a lot for everyone's time.
Cheers.
One option would be to get the vectors as a named list using mget, then paste the elements together (as showed by OP), use grep to find the index of elements in 'orgName' that matches (or use value = TRUE) extract those elements, stack it create a data.frame.
res <- setNames(stack(lapply(mget(c("metallurgy", "energy", "plastics", "wiring")),
function(x) df$orgName[grep(paste(x, collapse="|"),
tolower(df$orgName))])), c("orgName", "commSector"))
res
# orgName commSector
#1 United Steel of Chicago metallurgy
#2 Muddy Lakes Solar energy
#3 Hank Rubber plastics
#4 West cable wiring
If we have other columns in 'df', do a merge
merge(df, res, by = "orgName")
# orgName commSector
#1 Hank Rubber plastics
#2 Muddy Lakes Solar energy
#3 United Steel of Chicago metallurgy
#4 West cable wiring
data
df <- data.frame(orgName = c("Hank Rubber", "United Steel of Chicago",
"Muddy Lakes Solar", "West cable"), stringsAsFactors=FALSE)

Filter by first two digit of a number in R

My data looks like this.
AK ALASKA DEPT OF PUBLIC SAFETY 1005-00-073-9421 RIFLE,5.56 MILLIMETER
AK ALASKA DEPT OF PUBLIC SAFETY 1005-00-073-9421 RIFLE,5.56 MILLIMETER
I am looking to filter the data in multiple different ways. For example, I filter by the type of equipment, such as column 4, with the code
rifle.off <- city.data[[i]][city.data[[i]][,4]=="RIFLE,5.56 MILLIMETER",]
Where city.data is a list of matrices with data from 31 cities (so I iterate through a for loop to isolate the rifle data for each city). I would like to also filter by the number in the third column. Specifically, I only need to filter by the first two digits, i.e. I would like to isolate all line items where the number in column 3 begins with '10'. How would I modify my above code to isolate only the first two digits but let all the other digits be anything?
Edit: Providing an example of the city.data matrix as requested. First off city.data is a list made with:
city.data <- list(albuq, austin, baltimore, charlotte, columbus, dallas, dc, denver, detroit)
where each city name is a matrix. Each individual matrix is isolated by police department using:
phoenix <- vector()
for (i in 1:nrow(gun.mat)){
if (gun.mat[i,2]=="PHOENIX DEPT OF PUBLIC SAFETY"){
phoenix <- rbind(gun.mat[i,],phoenix)
}
}
where gun.mat is just the original matrix containing all observations. phoenix looks like
state police.dept nsn type quantity price date.shipped name
AZ PHOENIX DEPT OF PUBLIC SAFETY 1240-01-411-1265 SIGHT,REFLEX 1 331 1 3/29/13 OPTICAL SIGHTING AND RANGING EQUIPMENT
AZ PHOENIX DEPT OF PUBLIC SAFETY 1240-01-411-1265 SIGHT,REFLEX 1 331 1 3/29/13 OPTICAL SIGHTING AND RANGING EQUIPMENT
AZ PHOENIX DEPT OF PUBLIC SAFETY 1240-01-411-1265 SIGHT,REFLEX 1 331 1 3/29/13 OPTICAL SIGHTING AND RANGING EQUIPMENT
Try this:
The original data that you have in the first block in the question. Subset it.
Rifle556<-subset(data, data$column4 == "RIFLE,5.56 MILLIMETER")
After that, subset the data again that don't start with "10" from column 3
s <- '10'
Rifle55610<-subset(Rifle556, grep(s, column3, invert=T)
This way you have the data subset according to your condition.

What's the smart way to aggregate data?

Suppose there is a dataset of different regions, each region a subset of a state, and some outcome variable:
regions <- c("Michigan, Eastern",
"Michigan, Western",
"Minnesota",
"Mississippi, Northern",
"Mississippi, Southern",
"Missouri, Eastern",
"Missouri, Western")
set.seed(123)
outcome <- rpois(7, 12)
testset <- data.frame(regions,outcome)
regions outcome
1 Michigan, Eastern 10
2 Michigan, Western 11
3 Minnesota 17
4 Mississippi, Northern 12
5 Mississippi, Southern 12
6 Missouri, Eastern 17
7 Missouri, Western 13
A useful tool would aggregate each region and add, or take the mean or maximum, etc. of outcome by region and generate a new data frame for state. A sum, for example, would output this:
state outcome
1 Michigan 21
3 Minnesota 17
4 Mississippi 24
6 Missouri 30
The aggregate() function won't solve this problem. Is there something else in R that is built for this? It seems like grep could be used to generate the new column "states" as part of an application specific program. Seems like this would already be out there somewhere though.
The reason this isn't straight forward is that the structure of your data is not consistent, so you couldn't build a library simply for it.
Your state, region column is basically an index column, and you want to index across part of it. tapply is designed for this, but there's no reason to build in a function to do it automatically for this specific scenario. You could do it without creating the column though
tapply(outcome,gsub(",.*$","",testset$regions),sum)
The index column just replaces the , and everything after it, leaving the index column.
PS: you have a slight typo in your example, your data.frame should be
testset <- data.frame(regions,outcome)

Resources