R Storing Values from For Loop to Data Frame - r

I am trying to write a function that has 2 arguments: column name and ranking number. The function will read a CSV file that has hospitals from every state. The function should return a data frame with the hospital name that was at the specified rank.
My solution has been to split the main CSV file by state, order each data frame by the desired column, loop through each state's data frame, grab the row (where row number = rank number), store each state's hospital name into a vector, then create a dataframe using the vector from the loop.
When I test each part of my function in the console, I am able to receive the results I need. However, when I run the function altogether, it isn't storing the hospital names as desired.
Here's what I have:
rankall <- function(outcome, num = "best") {
outcomedf <- read.csv("outcome-of-care-measures.csv")
#using this as a test
outcomedf <- outcomedf[order(outcomedf[, 11], outcomedf[, 2]), ]
#create empty vectors for hospital name and state
hospital <- c()
state <- c()
#split the read dataframe
splitdf <- split(outcomedf, outcomedf$State)
#for loop through each split df
for (i in 1:length(splitdf)) {
#store the ranked hospital name into hospital vector
hospital[i] <- as.character(splitdf[[i]][num, 2])
#store the ranked hospital state into state vector
state[i] <- as.character(splitdf[[i]][, 7])
}
#create a df with hospital and state
rankdf <- data.frame(hospital, state)
return(rankdf)
}
When I run the function altogether, I receive NA in my 'hospital' column, but when I run each part of the function individually, I am able to receive the desired hospital names. I'm a little confused as to why I am able to run each individual part of this function outside of the function and it returns the results I want, but not when I run the function as a whole. Thank you.

This question is related to Programming Assignment 3 in the Johns Hopkins University R Programming course on Coursera. As such, we can't provide a "corrected version" of the code because doing so would violate the Coursera Honor code.
When run with the default settings, your implementation of the rankall() function fails because when you use num in the extract operator there is no ["best",2] row. If you run your code as head(rankall("heart attack",1)) it produces the following output.
> head(rankall("heart attack",1))
hospital state
1 PROVIDENCE ALASKA MEDICAL CENTER AK
2 CRESTWOOD MEDICAL CENTER AL
3 ARKANSAS HEART HOSPITAL AR
4 MAYO CLINIC HOSPITAL AZ
5 GLENDALE ADVENTIST MEDICAL CENTER CA
6 ST MARYS HOSPITAL AND MEDICAL CENTER CO
To completely correct your function, you'll need to make the following changes.
Add code to sort the data by the desired outcome column: heart attack, heart failure, or pneumonia
Add code to handle the condition when a state does not have enough hospitals to return a valid result (i.e. the 43rd ranked hospital for pneumonia in Puerto Rico
Add code to handle "best" and "worst" as inputs to the num argument

Related

Change distribution of fields in a categorical variable by given distribution

Let's say we have a data.frame such as the one below:
City
NYC
Boston
NYC
NYC
Providence
Boston
NYC
I want to write the simplest possible function
redistribute <- function(data, column, unique_value, decrease_by) {
#data = dataframe provided by user
#column = column of the respective dataframe
#unique_value = fields contained within the respective column of the respective dataframe
#decrease_by = the desired "portion" or "distribution" of the unique_value within column.
}
Edit:
I will rephrase the question, as it seems to be slightly confusing.
I need to calculate the frequency of the (argument unique_value) within the column. For example, that would be 4/7 or 0.57 for NYC in the City column.
Decrease the number of occurrences of the unique_value so that the frequency reaches the one provided by the user in the function argument. For example, from a 0.57 to (argument decreased_by) for NYC. So.. from 0.57 to 0.10 for instance.
Replace the original fields originally occupied by the unique_value with the different values in the column. Do this randomly. For example, we remove the first occurrence of 'NYC' field to reduce the overall frequency of the unique value 'NYC' from 0.5 to 0.1, and replace it with some random city 'Boston' for example.
So the expected outcome would be:
City
NYC
Boston
Boston
Providence
Boston
Providence
Boston
I'd like to avoid doing a dozen transformations. I'm looking for the most logical/efficient approach.
What your I think your trying to do is really just putting together a few things into a function. Using your example, lets assume new_level is the percentage of that factor that you want in the new data.
city = c("NYC", "Boston", "NYC", "NYC", "Providence", "Boston", "NYC")
data = data.frame(city=city)
redistribute <- function(data, column, unique_value, new_level){
## Names of factors and size of data
fac_names <- levels(factor(data[,column]))
size <- nrow(data)
## Make new list using rep and sample with desired ratio
new_col <- c(rep(unique_value,
floor(new_level*size)),
sample(fac_names[which(fac_names!=unique_value)],
size=(size-floor(new_level*size)),
replace=TRUE))
## Mix up and assign to data frame
data[,column] <- sample(new_col)
return(data)
}
redistribute(data, column="city",
unique_value="NYC",
new_level=0.3)

error in order() argument lengths differ R

I have looked through all the posts i could find on dplyr::arrange() or order() argument lengths differ errors, but have not found an explanation.
Im trying to make a function best() that can return the lowest rated value from a dataframe of hospital outcomes (dfout). When i copy the code straight into R it runs without an error, returning the hospital name with the lowest mortality rate.
Only when i call it as a function does it say "Error in order(State, outcome, Hospital) : argument lengths differ"
The function: (note i used capitalized names for colnames and non capitalized for function variables)
best <- function(state, outcome){
colnames(dfout) <- c("Hospital", "State", "Heartattack", "Heartfailure", "Pneumonia")
##Return hospital name with lowest 30 day mortality rate
arranged <- arrange(dfout, State, outcome, Hospital) ## arrange hospitals by state, mortality rate in the specified outcome in best() and alphabetically for the ties.
arranged1 <- arranged[arranged$State == state,] ## take the part of the ordered list where state = the state specified in best()
arranged1$Hospital[1]
Now if i call best("TX", Heartattack) i get "Error in order(State, outcome, Hospital) : argument lengths differ",
but if i simply run the code and replace state and outcome with "TX" and Heartattack i get a hospital, like this
##Return hospital name with lowest 30 day mortality rate
arranged <- arrange(dfout, State, Heartattack, Hospital) ## arrange hospitals by state, mortality rate in the specified outcome in best() and alphabetically for the ties.
arranged1 <- arranged[arranged$State == "TX",] ## take the part of the ordered list where state = the state specified in best()
arranged1$Hospital[1]
[1] "CYPRESS FAIRBANKS MEDICAL CENTER"
My question is really: how can the function not work, when copying the same code into the command line with the variables put in works.
You need to evaluate the outcome parameter inside the function, so R will interpret it as a variable, not as text
arranged <- arrange(dfout, State, eval(parse(text=outcome)), Hospital)
Now
# > best("TX","Heartattack")
# [1] CYPRESS FAIRBANKS MEDICAL CENTER

Filter by first two digit of a number in R

My data looks like this.
AK ALASKA DEPT OF PUBLIC SAFETY 1005-00-073-9421 RIFLE,5.56 MILLIMETER
AK ALASKA DEPT OF PUBLIC SAFETY 1005-00-073-9421 RIFLE,5.56 MILLIMETER
I am looking to filter the data in multiple different ways. For example, I filter by the type of equipment, such as column 4, with the code
rifle.off <- city.data[[i]][city.data[[i]][,4]=="RIFLE,5.56 MILLIMETER",]
Where city.data is a list of matrices with data from 31 cities (so I iterate through a for loop to isolate the rifle data for each city). I would like to also filter by the number in the third column. Specifically, I only need to filter by the first two digits, i.e. I would like to isolate all line items where the number in column 3 begins with '10'. How would I modify my above code to isolate only the first two digits but let all the other digits be anything?
Edit: Providing an example of the city.data matrix as requested. First off city.data is a list made with:
city.data <- list(albuq, austin, baltimore, charlotte, columbus, dallas, dc, denver, detroit)
where each city name is a matrix. Each individual matrix is isolated by police department using:
phoenix <- vector()
for (i in 1:nrow(gun.mat)){
if (gun.mat[i,2]=="PHOENIX DEPT OF PUBLIC SAFETY"){
phoenix <- rbind(gun.mat[i,],phoenix)
}
}
where gun.mat is just the original matrix containing all observations. phoenix looks like
state police.dept nsn type quantity price date.shipped name
AZ PHOENIX DEPT OF PUBLIC SAFETY 1240-01-411-1265 SIGHT,REFLEX 1 331 1 3/29/13 OPTICAL SIGHTING AND RANGING EQUIPMENT
AZ PHOENIX DEPT OF PUBLIC SAFETY 1240-01-411-1265 SIGHT,REFLEX 1 331 1 3/29/13 OPTICAL SIGHTING AND RANGING EQUIPMENT
AZ PHOENIX DEPT OF PUBLIC SAFETY 1240-01-411-1265 SIGHT,REFLEX 1 331 1 3/29/13 OPTICAL SIGHTING AND RANGING EQUIPMENT
Try this:
The original data that you have in the first block in the question. Subset it.
Rifle556<-subset(data, data$column4 == "RIFLE,5.56 MILLIMETER")
After that, subset the data again that don't start with "10" from column 3
s <- '10'
Rifle55610<-subset(Rifle556, grep(s, column3, invert=T)
This way you have the data subset according to your condition.

Julia DataFrames: Problems with Split-Apply-Combine strategy

I have some data (from a R course assignment, but that doesn't matter) that I want to use split-apply-combine strategy, but I'm having some problems. The data is on a DataFrame, called outcome, and each line represents a Hospital. Each column has an information about that hospital, like name, location, rates, etc.
My objective is to obtain the Hospital with the lowest "Mortality by Heart Attack Rate" of each State.
I was playing around with some strategies, and got a problem using the by function:
best_heart_rate(df) = sort(df, cols = :Mortality)[end,:]
best_hospitals = by(hospitals, :State, best_heart_rate)
The idea was to split the hospitals DataFrame by State, sort each of the SubDataFrames by Mortality Rate, get the lowest one, and combine the lines in a new DataFrame
But when I used this strategy, I got:
ERROR: no method nrow(SubDataFrame{Array{Int64,1}})
in sort at /home/paulo/.julia/v0.3/DataFrames/src/dataframe/sort.jl:311
in sort at /home/paulo/.julia/v0.3/DataFrames/src/dataframe/sort.jl:296
in f at none:1
in based_on at /home/paulo/.julia/v0.3/DataFrames/src/groupeddataframe/grouping.jl:144
in by at /home/paulo/.julia/v0.3/DataFrames/src/groupeddataframe/grouping.jl:202
I suppose the nrow function is not implemented for SubDataFrames, so I got an error. So I used a nastier code:
best_heart_rate(df) = (df[sortperm(df[:,:Mortality] , rev=true), :])[1,:]
best_hospitals = by(hospitals, :State, best_heart_rate)
Seems to work. But now there is a NA problem: how can I remove the rows from the SubDataFrames that have NA on the Mortality column? Is there a better strategy to accomplish my objective?
I think this might work, if I've understood you correctly:
# Let me make up some data about hospitals in states
hospitals = DataFrame(State=sample(["CA", "MA", "PA"], 10), mortality=rand(10), hospital=split("abcdefghij", ""))
hospitals[3, :mortality] = NA
# You can use the indmax function to find the index of the maximum element
by(hospitals[complete_cases(hospitals), :], :State, df -> df[indmax(df[:mortality]), [:mortality, :hospital]])
State mortality hospital
1 CA 0.9469632421111882 j
2 MA 0.7137144590022733 f
3 PA 0.8811901895164764 e

How to subset a row from list based on condition

I have a data.table outcome, which has a column called hospital. and a column called state. The outcome has been sorted already. Now I want to subset the nth hospital from each state(if there's not a Nth then returns NA for that state). I try to solve it the below way.(Since this is a homework, I just showed the third branch that causes mistake).
rankall <- function(out, num = "best"){
outcome <- readdata(outcome = out) //returns a data.table sorted by rate
...
outcome <- lapply(outcome, function(x) ifelse(num <= nrow(x), x[num,], c(NA,NA)))
outcome <- rbindlist(outcome)
}
The original outcome is like
> data
hospital state
1: NYU HOSPITALS CENTER NY
2: DOYLESTOWN HOSPITAL PA
3: AVERA HEART HOSPITAL OF SOUTH DAKOTA LLC SD
4: GLENDALE ADVENTIST MEDICAL CENTER CA
5: WATERBURY HOSPITAL CT
---
2716: DESERT SPRINGS HOSPITAL NV
2717: THREE RIVERS COMMUNITY HOSPITAL OR
2718: ROBERT WOOD JOHNSON UNIVERSITY HOSPITAL AT RAHWAY NJ
2719: LAREDO MEDICAL CENTER TX
2720: MEDICAL CENTER SOUTH ARKANSAS AR
And the first and second branch could produce the right result, which is like
> head (data)
hospital state
1: NA AK
2: CRESTWOOD MEDICAL CENTER AL
3: ARKANSAS HEART HOSPITAL AR
4: MAYO CLINIC HOSPITAL AZ
5: GLENDALE ADVENTIST MEDICAL CENTER CA
6: ST MARYS HOSPITAL AND MEDICAL CENTER CO
> nrow(data)
[1] 54
However, the third condition just COULDN'T do its work. Which produce the error
Error in rbindlist(outcome) :
Item 1 of list input is not a data.frame, data.table or list
And after debugging I found out that the outcome after the condition is something like(which caused the error in the last step)
$AK
[1] NA
$AL
$AL[[1]]
[1] "HIGHLANDS MEDICAL CENTER"
Differs from the first two which is like...
> head(data,2)
$AK
hospital state
1: PROVIDENCE ALASKA MEDICAL CENTER AK
$AL
hospital state
1: CRESTWOOD MEDICAL CENTER AL
So I wonder what's wrong with the third branch.
Could anyone help me out, thank you very much!!!
By the way, I wonder if I could refer to a variable with the same name of another's. Such like when I called readdata, I need to pass a argument called outcome, which prevents me to use this name as the argument of the rankall function(I use out instead). I know in JAVA this.outcome will help, so how about in R.
Thank you for Vivek's help, I have figured it out now.
First is about the mis-performance of the third branch. It'll work properly if I first convert numto a number using as.numeric(num). I think it's because the num is regarded as a character(since it has possible value of "best" and "worst") that caused the mistake.
The second regarding the naming space is however strange. Though not having performed well in my own test, it just do work after Vivek answer my question. So, that means we can just use the following code, and R will get us the right result.
rankall <- function(outcome, num = "best"){
outcome <- readdata(outcome = outcome)
This output below is a warning sign, first element of outcome is just NA, an atomic which is forced to 'bind' with second element which is a list $AL[[1]] by the function rbindlist
$AK
[1] NA
$AL
$AL[[1]]
[1] "HIGHLANDS MEDICAL CENTER"
Possible solution:
One way would be to have data.frame or list output in all three cases if condition blocks e.g. with data.frame, once num is numeric the following should work in all cases, best,worst and intermediate row.
sapply(outcome, function(x) data.frame(state=x[num,"state"],hospital=x[num,"hospital"])
for cases with no matches the hospital column should have an NA, could you please check if this works

Resources