I have looked through all the posts i could find on dplyr::arrange() or order() argument lengths differ errors, but have not found an explanation.
Im trying to make a function best() that can return the lowest rated value from a dataframe of hospital outcomes (dfout). When i copy the code straight into R it runs without an error, returning the hospital name with the lowest mortality rate.
Only when i call it as a function does it say "Error in order(State, outcome, Hospital) : argument lengths differ"
The function: (note i used capitalized names for colnames and non capitalized for function variables)
best <- function(state, outcome){
colnames(dfout) <- c("Hospital", "State", "Heartattack", "Heartfailure", "Pneumonia")
##Return hospital name with lowest 30 day mortality rate
arranged <- arrange(dfout, State, outcome, Hospital) ## arrange hospitals by state, mortality rate in the specified outcome in best() and alphabetically for the ties.
arranged1 <- arranged[arranged$State == state,] ## take the part of the ordered list where state = the state specified in best()
arranged1$Hospital[1]
Now if i call best("TX", Heartattack) i get "Error in order(State, outcome, Hospital) : argument lengths differ",
but if i simply run the code and replace state and outcome with "TX" and Heartattack i get a hospital, like this
##Return hospital name with lowest 30 day mortality rate
arranged <- arrange(dfout, State, Heartattack, Hospital) ## arrange hospitals by state, mortality rate in the specified outcome in best() and alphabetically for the ties.
arranged1 <- arranged[arranged$State == "TX",] ## take the part of the ordered list where state = the state specified in best()
arranged1$Hospital[1]
[1] "CYPRESS FAIRBANKS MEDICAL CENTER"
My question is really: how can the function not work, when copying the same code into the command line with the variables put in works.
You need to evaluate the outcome parameter inside the function, so R will interpret it as a variable, not as text
arranged <- arrange(dfout, State, eval(parse(text=outcome)), Hospital)
Now
# > best("TX","Heartattack")
# [1] CYPRESS FAIRBANKS MEDICAL CENTER
Related
I just started learning R last month and I am learning the aggregate functions.
To start off, I have a data called property and I am trying to get the mean price per city.
I first used the formula method of aggregate:
mean_price_per_city_1 <- aggregate(PRICE ~ PROPERTYCITY,
property_data, mean)
The results are as follow (just the head):
PROPERTYCITY
PRICE
1.00
ALLISON PARK
193814.08
AMBRIDGE
62328.92
ASPINWALL
226505.50
BADEN
400657.52
BAIRDFORD
59337.37
Then I decided to try the data frame method:
mean_price_per_city_2 <- aggregate(list(property_data$PRICE),
by = list(property_data$PROPERTYCITY),
FUN = mean)
The results are as follow (just the head):
Group.1
c.12000L.. 1783L..4643L..
1.00
ALLISON PARK
NA
AMBRIDGE
62328.92
ASPINWALL
226505.50
BADEN
400657.52
BAIRDFORD
59337.37
I thought that the two methods will return the same results. However I noticed that when I used the data frame method, there are NAs in the second column.
I tried checking if there are NAs in the PRICE column, but there is none. So I am lost why the two methods don't return the same values.
You have two issues. First aggregate(list(property_data$PRICE), by = list(property_data$PROPERTYCITY), FUN = mean) should just have property_data$PRICE without the list. Only the by= argument must be a list. That is why your column name is so strange. Second, as documented in the manual page (?aggregate), the formula method has a default value of na.action=na.omit, but the method for class data.frame does not. Since you have at least one missing value in the ALLISON PARK group, the formula command deleted that value, but the second command did not so the result for ALLISON PARK is NA.
I am trying to write a function that has 2 arguments: column name and ranking number. The function will read a CSV file that has hospitals from every state. The function should return a data frame with the hospital name that was at the specified rank.
My solution has been to split the main CSV file by state, order each data frame by the desired column, loop through each state's data frame, grab the row (where row number = rank number), store each state's hospital name into a vector, then create a dataframe using the vector from the loop.
When I test each part of my function in the console, I am able to receive the results I need. However, when I run the function altogether, it isn't storing the hospital names as desired.
Here's what I have:
rankall <- function(outcome, num = "best") {
outcomedf <- read.csv("outcome-of-care-measures.csv")
#using this as a test
outcomedf <- outcomedf[order(outcomedf[, 11], outcomedf[, 2]), ]
#create empty vectors for hospital name and state
hospital <- c()
state <- c()
#split the read dataframe
splitdf <- split(outcomedf, outcomedf$State)
#for loop through each split df
for (i in 1:length(splitdf)) {
#store the ranked hospital name into hospital vector
hospital[i] <- as.character(splitdf[[i]][num, 2])
#store the ranked hospital state into state vector
state[i] <- as.character(splitdf[[i]][, 7])
}
#create a df with hospital and state
rankdf <- data.frame(hospital, state)
return(rankdf)
}
When I run the function altogether, I receive NA in my 'hospital' column, but when I run each part of the function individually, I am able to receive the desired hospital names. I'm a little confused as to why I am able to run each individual part of this function outside of the function and it returns the results I want, but not when I run the function as a whole. Thank you.
This question is related to Programming Assignment 3 in the Johns Hopkins University R Programming course on Coursera. As such, we can't provide a "corrected version" of the code because doing so would violate the Coursera Honor code.
When run with the default settings, your implementation of the rankall() function fails because when you use num in the extract operator there is no ["best",2] row. If you run your code as head(rankall("heart attack",1)) it produces the following output.
> head(rankall("heart attack",1))
hospital state
1 PROVIDENCE ALASKA MEDICAL CENTER AK
2 CRESTWOOD MEDICAL CENTER AL
3 ARKANSAS HEART HOSPITAL AR
4 MAYO CLINIC HOSPITAL AZ
5 GLENDALE ADVENTIST MEDICAL CENTER CA
6 ST MARYS HOSPITAL AND MEDICAL CENTER CO
To completely correct your function, you'll need to make the following changes.
Add code to sort the data by the desired outcome column: heart attack, heart failure, or pneumonia
Add code to handle the condition when a state does not have enough hospitals to return a valid result (i.e. the 43rd ranked hospital for pneumonia in Puerto Rico
Add code to handle "best" and "worst" as inputs to the num argument
I'm trying to write an R function that calculates whether a data subject is eligible for subsidies based on their income (X_INCOMG), the size of their household (household calculated from CHILDREN and NUMADULT), and the federal poverty limit for their household size (fpl_matrix). I use a number of if statements to evaluate whether the record is eligible, but for some reason my code is labeling everyone as eligible, even though I know that's not true. Could someone else take a look at my code?
Note that the coding for the variable X_INCOMG denotes income categories (less than $15000, 25-35000, etc).
#Create a sample data set
sampdf=data.frame(NUMADULT=sample(3,1000,replace=T),CHILDREN=sample(0:5,1000,replace=T),X_INCOMG=sample(5,1000,replace=T))
#Introducing some "impurities" into the data so its more realistic
sampdf[sample(1000,3),'CHILDREN']=13
sampdf[sample(1000,3),'CHILDREN']=NA
sampdf[sample(1000,3),'X_INCOMG']=9
#this is just a matrix of the federal poverty limit, which is based on household size
fpl_2004=matrix(c(
1,9310,
2,12490,
3,15670,
4,18850,
5,22030,
6,25210,
7,28390,
8,31570,
9,34750,
10,37930,
11,41110),byrow=T,ncol=2)
##################here is the function I'm trying to create
fpl250=function(data,fpl_matrix,add_limit){ #add_limit is the money you add on for every extra person beyond a household size of 11
data[which(is.na(data$CHILDREN)),'CHILDREN']=99 #This code wasn't liking NAs so I'm coding NA as 99
data$household=data$CHILDREN+data$NUMADULT #calculate household size
for(i in seq(nrow(data))){
if(data$household[i]<=11){data$bcccp_cutoff[i]=2.5*fpl_matrix[data$household[i],2]} #this calculates what the subsidy cutoff should be, which is 250% of the FPL
else{data$bcccp_cutoff[i]=2.5*((data$household[i]-11)*add_limit+fpl_matrix[11,2])}}
data$incom_elig='yes' #setting the default value as 'yes', then changing each record to 'no' if the income is definitely more than the eligibility cutoff
for(i in seq(nrow(data))){
if(data$X_INCOMG[i]=='1' | data$X_INCOMG[i]=='9'){data$incom_elig='yes'} #This is the lowest income category and almost all of these people will qualify
if(data$X_INCOMG[i]=='2' & data$bcccp_cutoff[i]<15000){data$incom_elig[i]='no'}
if(data$X_INCOMG[i]=='3' & data$bcccp_cutoff[i]<25000){data$incom_elig[i]='no'}
if(data$X_INCOMG[i]=='4' & data$bcccp_cutoff[i]<35000){data$incom_elig[i]='no'}
if(data$X_INCOMG[i]=='5' & data$bcccp_cutoff[i]<50000){data$incom_elig[i]='no'}
if(data$household[i]>90){data$incom_elig[i]='no'}
}
return(data)
}
dd=fpl250(sampl,fpl_2004,3180)
with(dd,table(incom_elig)) #it's coding all except one as eligible
I know this is a lot of code to digest, but I appreciate whatever help you have to offer!
I find it easier to get the logic working well outside of a function first, then wrap it in a function once it is all working well. My code below does this.
I think one issue was you had the literal comparisons to X_INCOMG as strings (data$X_INCOMG[i]=='1'). That field is a numeric in your sample code, so remove the quotes. Try using a coded factor for X_INCOMG as well. This will make your code easier to manage later.
There is no need to loop over each row in the data frame.
#put the poverty level data in a data frame for merging
fpl_2004.df<- as.data.frame(fpl_2004)
names(fpl_2004.df)<-c("household","pov.limit")
#Include cutoffs
fpl_2004.df$cutoff = 2.5 * fpl_2004.df$pov.limit
add_limit=3181
#compute household size (if NA's this will skip them)
sampdf$household = numeric(nrow(sampdf))
cc<-which(complete.cases(sampdf))
sampdf$household[cc] = sampdf$NUMADULT[cc] + sampdf$CHILDREN[cc]
#get max household and fill fpl_2004 frame
max.hh<-max(sampdf$household,na.rm=TRUE)
#get the 11 person poverty limit
fpl11=subset(fpl_2004.df,household==11)$pov.limit
#rows to fill out the data frame
append<-data.frame(household=12:max.hh,pov.limit=numeric(max.hh-12+1),
cutoff=2.5 *(((12:max.hh)-11)*add_limit+fpl11))
fpl_2004.df<- rbind(fpl_2004.df,append)
#merge the two data frames
sampdf<- merge(sampdf,fpl_2004.df, by="household",all.x=TRUE)
#Add a logical variable to hold the eligibility
sampdf$elig <- logical(nrow(sampdf))
#compute eligibility
sampdf[!is.na(sampdf$X_INCOMG) & sampdf$X_INCOMG == 1,"elig"] = TRUE
sampdf[!is.na(sampdf$X_INCOMG) & sampdf$X_INCOMG == 9,"elig"] = TRUE
#for clarity define variable of what to subset
lvl2 <-!is.na(sampdf$X_INCOMG) & sampdf$X_INCOMG == 2
lvl2 <- lvl2 & !is.na(sampdf$cutoff) & sampdf$cutoff>=15000
#set the eligibility (note the initial value was false thus cutoff logic reversed)
sampdf[lvl2,"elig"] = TRUE
#continue computing these
lvl3 <-!is.na(sampdf$X_INCOMG) & sampdf$X_INCOMG == 3
lvl3 <- lvl3 & !is.na(sampdf$cutoff) & sampdf$cutoff>=25000
sampdf[lvl3,"elig"] = TRUE
Alternately you could load in a small data frame with the cutoff comparison values (15000; 25000; 35000 etc) and the X_INCOMG. Then merge by X_INCOMG, as I did with the household size, and set all the values in one go like this the below. You may need to use complete.cases again.
sampdf$elig = sampdf$cutoff >= sampdf$comparison.value
You will then have elig == FALSE for any incomplete cases, which will need further investigation.
Using the below data frame called 'data', I'm able to directly assign values to two variables, 'state' and 'measure', and identify the school with the lowest score within the subset:
Create dataframe 'data':
school<-c("NYU", "BYU", "USC", "FIT", "Oswego","UCLA","USF","Columbia")
state<-c("NY","UT","CA","NY","NY","CA", "CA","NY")
measure<-c("MSAT","MSAT","GPA","MSAT","MSAT","GPA","GPA","GPA")
score<-c(590, 490, 2.9, 759, 550, 1.2, 3.1, 3.2)
data<-data.frame(school,state, measure,score)
Subset on a 'state' and 'measure':
answer<-subset(data,subset=(state=="NY" & measure=="MSAT"))
order.answer<-order(answer$score,answer$school) #answer$school is tie-breaker
answer1<-as.matrix(answer[order.answer,])
answer1[1,1]
This is the correct answer:
[1] "Oswego"
My problem is that when I create a function to accomplish the same thing, I get an incorrect result:
lowest <- function(state, measure){
answer<-subset(data,subset=(state==state & measure==measure))
order.answer<-order(answer$score,answer$school)
answer1<-as.matrix(answer[order.answer,])
answer1[1,1]
}
lowest("NY","MSAT")
Incorrect answer:
[1] "UCLA"
The problem seems to be that the variables 'state' and 'measure' don't take on the values of the arguments "NY" and "MSAT" in the subset line of the function. I've experimented with '=' instead of '==' and also tried subset(data,subset=(state=="state" & measure=="measure")), but can't find a solution.
There seems to be something going awry in your function call related to the fact that your function arguments are the same as your data columns because this works
lowest <- function(State, Measure){
answer<-subset(data,subset=(state==State & measure==Measure))
order.answer<-order(answer$score,answer$school)
answer1<-as.matrix(answer[order.answer,])
answer1[1,1]
}
##
> lowest("NY","MSAT")
[1] "Oswego"
UPDATE: After mulling this over a bit more, I think I can offer a little more detail about what is going on internally. Notice that your first (manual) subset worked correctly above:
> subset(data,subset=(state=="NY" & measure=="MSAT"))
school state measure score
1 NYU NY MSAT 590
4 FIT NY MSAT 759
5 Oswego NY MSAT 550
However, notice that if we create two objects in the global environment, state <- "NY" and measure <- "MSAT", this does not work
state <- "NY"
measure <- "MSAT"
> subset(data,state==state & measure==measure)
school state measure score
1 NYU NY MSAT 590.0
2 BYU UT MSAT 490.0
3 USC CA GPA 2.9
4 FIT NY MSAT 759.0
5 Oswego NY MSAT 550.0
6 UCLA CA GPA 1.2
7 USF CA GPA 3.1
8 Columbia NY GPA 3.2
The reason (I believe) has to do with the R's scope resolution mechanism and how it operates within functions. When a function is called in R, this function call creates a (temporary?) environment where the objects in that function reside in the local frame, i.e. they are prioritized in the sense that if you have a variable x in the global environment, and a function defined as foo <- function(x){do something interesting to x}, that do something interesting is acting on the object passed into the foo via the argument x, and not the object x in the global environment. However, R uses dynamic scoping, meaning that if the object referred to in the local frame is not found, the interpreter will recursively search through the hierarchy of frames / environments until it finds the object referred to. So if instead of foo <- function(x){do something interesting to x} you had foo(z){do something interesting to x}, but you still had x defined in the global environment, rather than result in an error, the function would search through the call stack until it found the object x to "do something interesting" to.
In the example directly above, the reason subset(data,state==state & measure==measure) did not produce the desired results and subset(data,state=="NY & measure=="MSAT") did is because in the subset(data, ...) function call, data and all of its columns were in the local scope, i.e. the columns state and measure were prioritized over the objects state and measure in the global environment. Therefore, the subsetting condition state==state & measure==measure evaluated as TRUE for each row of data, and so the "subset" returned was the entirety of data. Now, if we do
State <- "NY"
Measure <- "MSAT"
> subset(data, state==State & measure==Measure)
school state measure score
1 NYU NY MSAT 590
4 FIT NY MSAT 759
5 Oswego NY MSAT 550
this works fine, because since State and Measure are not found in the local frame of the subset function call, the interpreter will keep searching through environments until it first encounters these objects (in this case, it finds them in the global environment). This is why when I changed the arguments to State and Measure (and made the respective changes in the function body) in your function lowest it produced the desired results - really you could change them to just about anything, as long as the names do not clash with the column names of data, but capitalizing their first letter was a quick fix.
I have some data (from a R course assignment, but that doesn't matter) that I want to use split-apply-combine strategy, but I'm having some problems. The data is on a DataFrame, called outcome, and each line represents a Hospital. Each column has an information about that hospital, like name, location, rates, etc.
My objective is to obtain the Hospital with the lowest "Mortality by Heart Attack Rate" of each State.
I was playing around with some strategies, and got a problem using the by function:
best_heart_rate(df) = sort(df, cols = :Mortality)[end,:]
best_hospitals = by(hospitals, :State, best_heart_rate)
The idea was to split the hospitals DataFrame by State, sort each of the SubDataFrames by Mortality Rate, get the lowest one, and combine the lines in a new DataFrame
But when I used this strategy, I got:
ERROR: no method nrow(SubDataFrame{Array{Int64,1}})
in sort at /home/paulo/.julia/v0.3/DataFrames/src/dataframe/sort.jl:311
in sort at /home/paulo/.julia/v0.3/DataFrames/src/dataframe/sort.jl:296
in f at none:1
in based_on at /home/paulo/.julia/v0.3/DataFrames/src/groupeddataframe/grouping.jl:144
in by at /home/paulo/.julia/v0.3/DataFrames/src/groupeddataframe/grouping.jl:202
I suppose the nrow function is not implemented for SubDataFrames, so I got an error. So I used a nastier code:
best_heart_rate(df) = (df[sortperm(df[:,:Mortality] , rev=true), :])[1,:]
best_hospitals = by(hospitals, :State, best_heart_rate)
Seems to work. But now there is a NA problem: how can I remove the rows from the SubDataFrames that have NA on the Mortality column? Is there a better strategy to accomplish my objective?
I think this might work, if I've understood you correctly:
# Let me make up some data about hospitals in states
hospitals = DataFrame(State=sample(["CA", "MA", "PA"], 10), mortality=rand(10), hospital=split("abcdefghij", ""))
hospitals[3, :mortality] = NA
# You can use the indmax function to find the index of the maximum element
by(hospitals[complete_cases(hospitals), :], :State, df -> df[indmax(df[:mortality]), [:mortality, :hospital]])
State mortality hospital
1 CA 0.9469632421111882 j
2 MA 0.7137144590022733 f
3 PA 0.8811901895164764 e