Rank a sorted dataset using apply function - r

My dataframe looks like this:
head(temp$HName)
[1] "UNIVERSITY OF TEXAS HEALTH SCIENCE CENTER AT TYLER"
[2] "METHODIST HOSPITAL,THE"
[3] "TOMBALL REGIONAL MEDICAL CENTER"
[4] "METHODIST SUGAR LAND HOSPITAL"
[5] "GULF COAST MEDICAL CENTER"
[6] "VHS HARLINGEN HOSPITAL COMPANY LLC"
head(temp$Rate)
[1] 7.3 8.3 8.7 8.7 8.8 8.9
76 Levels: 7.3 8.3 8.7 8.8 8.9 9 9.1 9.2 9.3 9.4 9.5 9.6 ... 17.1
> head(temp$Rank)
[1] NA NA NA NA NA NA
The temp$Rate is sorted. I am trying to write a function assignRank which gives me a new column temp$Rank which has values as 1, 2, 3, 3, 4, 5
My code is as below:
tapply(temp$Rank,temp$Rate, assignRank)
where :
assignRank<- function(r=1){
temp$Rank <- r
r <- r + 1
return(r)
}
I get following error when running tapply
tapply(temp$Rank,temp$Rate, assignRank)
Show Traceback
Rerun with Debug
Error in `$<-.data.frame`(`*tmp*`, "Rank", value = c(NA, NA)) :
replacement has 2 rows, data has 301
Please advise where I am going wrong?

I use data.table for stuff like this, because both sorting and ranking are very efficient/simple syntax
library(data.table)
setkey(setDT(temp), Rate) # This will sort your data set by Rate in case it's not yet sorted
temp[, Rank := .GRP, by = Rate]
temp
# HName Rate Rank
# 1: UNIVERSITY OF TEXAS HEALTH SCIENCE CENTER AT TYLER 7.3 1
# 2: METHODIST HOSPITAL,THE 8.3 2
# 3: TOMBALL REGIONAL MEDICAL CENTER 8.7 3
# 4: METHODIST SUGAR LAND HOSPITAL 8.7 3
# 5: GULF COAST MEDICAL CENTER 8.8 4
# 6: VHS HARLINGEN HOSPITAL COMPANY LLC 8.9 5
Or you could easily do the same using base R (assuming your data is sorted by Rank) just do
as.numeric(factor(temp$Rate))
## [1] 1 2 3 3 4 5
Or could also use dense_rank function from dplyr package (which will not require sorting the data set)
library(dplyr)
temp %>%
mutate(Rank = dense_rank(Rate))
# HName Rate Rank
# 1 UNIVERSITY OF TEXAS HEALTH SCIENCE CENTER AT TYLER 7.3 1
# 2 METHODIST HOSPITAL,THE 8.3 2
# 3 TOMBALL REGIONAL MEDICAL CENTER 8.7 3
# 4 METHODIST SUGAR LAND HOSPITAL 8.7 3
# 5 GULF COAST MEDICAL CENTER 8.8 4
# 6 VHS HARLINGEN HOSPITAL COMPANY LLC 8.9 5

Other options (if the data is ordered)
with(temp, cumsum(ave(Rate, Rate, FUN=function(x) c(1,x[-1]!=x[-length(x)]))))
#[1] 1 2 3 3 4 5
with(temp, match(Rate, unique(Rate)) )
#[1] 1 2 3 3 4 5

Related

Convert string to symbol accepted by dplyr in function

My data frame looks like:
> str(b)
'data.frame': 2720 obs. of 3 variables:
$ Hospital.Name: chr "SOUTHEAST ALABAMA MEDICAL CENTER" "MARSHALL MEDICAL CENTER SOUTH" "ELIZA COFFEE MEMORIAL HOSPITAL" "ST VINCENT'S EAST" ...
$ State : chr "AL" "AL" "AL" "AL" ...
$ heart attack : num 14.3 18.5 18.1 17.7 18 15.9 19.6 17.3 17.8 17.5 ...
I want to group it by State, sort them by State and Heart Attack, and then add a column that return row number within each group. The ideal result would look like:
# A tibble: 2,720 x 4
# Groups: State [54]
Hospital.Name State `heart attack` rank
<chr> <chr> <dbl> <int>
1 PROVIDENCE ALASKA MEDICAL CENTER AK 13.4 1
2 ALASKA REGIONAL HOSPITAL AK 14.5 2
3 FAIRBANKS MEMORIAL HOSPITAL AK 15.5 3
4 ALASKA NATIVE MEDICAL CENTER AK 15.7 4
5 MAT-SU REGIONAL MEDICAL CENTER AK 17.7 5
6 CRESTWOOD MEDICAL CENTER AL 13.3 1
7 BAPTIST MEDICAL CENTER EAST AL 14.2 2
8 SOUTHEAST ALABAMA MEDICAL CENTER AL 14.3 3
9 GEORGIANA HOSPITAL AL 14.5 4
10 PRATTVILLE BAPTIST HOSPITAL AL 14.6 5
# ... with 2,710 more rows
so my code is:
outcome<-"heart attack"
c<-arrange(b,State,sym(outcome))%>%
group_by(State)%>%
mutate(rank=row_number(sym(outcome)))
but I got this error:
Error in arrange_impl(.data, dots) : object 'heart attack' not found
When I ran sym(outcome) independently and copied the results into my code, it works:
sym(outcome)
`heart attack`
c<-arrange(b,State,`heart attack`)%>%
+ group_by(State)%>%
+ mutate(rank=rank(`heart attack`))
> c
# A tibble: 2,720 x 4
# Groups: State [54]
Hospital.Name State `heart attack` rank
<chr> <chr> <chr> <dbl>
1 PROVIDENCE ALASKA MEDICAL CENTER AK 13.4 1
2 ALASKA REGIONAL HOSPITAL AK 14.5 2
3 FAIRBANKS MEMORIAL HOSPITAL AK 15.5 3
4 ALASKA NATIVE MEDICAL CENTER AK 15.7 4
5 MAT-SU REGIONAL MEDICAL CENTER AK 17.7 5
6 CRESTWOOD MEDICAL CENTER AL 13.3 1
7 BAPTIST MEDICAL CENTER EAST AL 14.2 2
8 SOUTHEAST ALABAMA MEDICAL CENTER AL 14.3 3
9 GEORGIANA HOSPITAL AL 14.5 4
10 PRATTVILLE BAPTIST HOSPITAL AL 14.6 5
# ... with 2,710 more rows
This is a part of a function, so the 'outcome' needs to be a string. Therefore I tried to convert a string to a symbol so that I can reference the column in dplyr.
can anyone tell me what's happening here?
are there any good ways to achieve my goal?
You need to unquote the symbol with !!:
arrange(b, State, !!sym(outcome))
Or UQ:
arrange(b, State, UQ(sym(outcome)))
Similarly for mutate:
mutate(rank=row_number(!!sym(outcome))) # or mutate(rank=row_number(UQ(sym(outcome))))
If you are only trying to name the column then you will want to use the backtick (`). (It is typically paired with the ~ on the top left of your keyboard just below the ESC key.) Please note that is not the same as the single quotation mark (').
The reason you often will get your variable written like this is from importing header names containing spaces into tibbles. Any header name that has a space in it gets wrapped in `. You need to refer to those columns by also wrapping them in backticks or else R does not recognize you are referring the objects in memory that it can work with. It will just think you are referring to the string and not the object in memory. Though it will happily store the object with a space in its name if you use " or '.
see below demonstration of the issue:
`tidy time` <- 4
'tidy time' <- 5
"tidy time" <- 6
print('tidy time')
print("tidy time")
print(`tidy time`)
This is the cause for R's error message.
Hopefully understanding all that will spare you from having to call on the sym function. In any case, if you remove the space in the name the problem will also go away and you can save the backticks for another day.
To learn more about !! and unquoting variables (which psidom was referring to in his answer), and also learn about the related issues that occur in writing functions that rely on referencing objects with non-standard evaluation in dplyr please see here: https://rpubs.com/hadley/dplyr-programming

R: split data based on a factor, add a ranking column and extract

I still haven't been able to know how we can access different elements of a split data. Here is my problem:
I have a data set, that I want to split based on a column (State). I want to have a ranking column added to my data for each subset. This is part of a function I'm writing.
My data set has 2 columns, Hospital, State, Outcome. For each state, I want to add a 'Rank' column that ranks the data based on Outcome; the lowest Outcome will be ranked 1 and the highest outcome will be ranked the last.
How can I use split, sapply/lapply to do this? Is there a better way, like using "arrange"?
My main problem is that when I use either of these methods, I do not know how to access each element of the split or arranged data.
Here's how my data set looks like:
Hospital State Outcome. The row lines are not important here.
Hospital State Outcome
1 SOUTHEAST ALABAMA MEDICAL CENTER AL 14.3
2 MARSHALL MEDICAL CENTER SOUTH AL 18.5
3 ELIZA COFFEE MEMORIAL HOSPITAL TX 18.1
7 ST VINCENT'S EAST TX 17.7
8 DEKALB REGIONAL MEDICAL CENTER AL 18.0
9 SHELBY BAPTIST MEDICAL CENTER AL 15.9
The desired outcome would be
Hospital State Outcome Rank
1 SOUTHEAST ALABAMA MEDICAL CENTER AL 14.3 1
2 SHELBY BAPTIST MEDICAL CENTER AL 15.9 2
3 DEKALB REGIONAL MEDICAL CENTER AL 18.0 3
4 MARSHALL MEDICAL CENTER SOUTH AL 18.5 4
5 ST VINCENT'S EAST TX 17.7 1
6 ELIZA COFFEE MEMORIAL HOSPITAL TX 18.1 2
Thanks in advance.
The dplyr package provides a very elegant solution for this type of problem. I'm using the mtcars data as an example:
library(dplyr)
mtcars %>%
group_by(cyl) %>%
mutate(rank = row_number(mpg))
The OP's example is hard to read into R because of all the spaces in the string variable.
Here's a simpler example:
set.seed(1)
DF <- data.frame(id=rep(1:2,sample(5,2))); DF$v <- runif(nrow(DF))*100
# id v
# 1 A 57.28534
# 2 A 90.82078
# 3 B 20.16819
# 4 B 89.83897
# 5 B 94.46753
# 6 B 66.07978
# 7 B 62.91140
Here's a solution without using any packages:
DF$r <- ave(DF$v,DF$id,FUN=rank)
# id v r
# 1 A 57.28534 1
# 2 A 90.82078 2
# 3 B 20.16819 1
# 4 B 89.83897 4
# 5 B 94.46753 5
# 6 B 66.07978 3
# 7 B 62.91140 2
Finally, to order by ranking within state:
DF[order(DF$id,DF$r),]
# id v r
# 1 A 57.28534 1
# 2 A 90.82078 2
# 3 B 20.16819 1
# 7 B 62.91140 2
# 6 B 66.07978 3
# 4 B 89.83897 4
# 5 B 94.46753 5
If you have ties in the column you're ranking, read the documentation for rank and decide how you want the ties treated. The dplyr and data.table packages (mentioned in the other answers) also have nice functionality for dealing with ties, like the notion of a "dense rank."
You could try this
library(data.table)
setDT(dat)[, myrank := rank(Outcome), by = State]
dat[,.SD[order(myrank)], by=State]
# State Hospital Outcome myrank
#1: AL SOUTHEAST ALABAMA MEDICAL CENTER 14.3 1
#2: AL SHELBY BAPTIST MEDICAL CENTER 15.9 2
#3: AL DEKALB REGIONAL MEDICAL CENTER 18.0 3
#4: AL MARSHALL MEDICAL CENTER SOUTH 18.5 4
#5: TX ST VINCENT EAST 17.7 1
#6: TX ELIZA COFFEE MEMORIAL HOSPITAL 18.1 2
Or using ddply
library(plyr)
ddply(dat, .(State), function(x){x$myrank = rank(x$Outcome); x[order(x$myrank),]})
# Hospital State Outcome myrank
#1 SOUTHEAST ALABAMA MEDICAL CENTER AL 14.3 1
#2 SHELBY BAPTIST MEDICAL CENTER AL 15.9 2
#3 DEKALB REGIONAL MEDICAL CENTER AL 18.0 3
#4 MARSHALL MEDICAL CENTER SOUTH AL 18.5 4
#5 ST VINCENT EAST TX 17.7 1
#6 ELIZA COFFEE MEMORIAL HOSPITAL TX 18.1 2
You can use by:
do.call(
rbind,
by(d, list(State = d$State), function(x) { x$Rank <- order(x$Outcome); x[order(x$Rank), ] }))
where d is your raw data.

How to sort alphabetically rows of a data frame? [duplicate]

This question already has answers here:
Sort (order) data frame rows by multiple columns
(19 answers)
Closed 8 years ago.
I am tring to sort c alphabetically if x[i]== x[i+1]. I used order() function but it changes the x column as well. I want to order the entire row:
best <- function(state){
HospitalName<-vector()
StateName<-vector()
HeartAttack<-vector()
k<-1
outcome<-read.csv("outcome-of-care-measures.csv",colClasses= "character")
temp<-(outcome[,c(2,7,11,17,23)])
for (i in 1:nrow(temp)){
if(identical(state,temp[i,2])==TRUE){
HospitalName[k]<-temp[i,1]
StateName[k]<-temp[i,2]
HeartAttack[k]<-as.numeric(temp[i,4])
k<-k+1
}}
frame<-data.frame(cbind(HospitalName,StateName,HeartAttack))
library(dplyr)
frame %>%
group_by(as.numeric(as.character(frame[,3]))) %>%
arrange(frame[,1])
}
Output:
HospitalName StateName HeartAttack
1 FORT DUNCAN MEDICAL CENTER TX 8.1
2 TOMBALL REGIONAL MEDICAL CENTER TX 8.5
3 CYPRESS FAIRBANKS MEDICAL CENTER TX 8.7
4 DETAR HOSPITAL NAVARRO TX 8.7
5 METHODIST HOSPITAL,THE TX 8.8
6 MISSION REGIONAL MEDICAL CENTER TX 8.8
7 BAYLOR ALL SAINTS MEDICAL CENTER AT FW TX 8.9
8 SCOTT & WHITE HOSPITAL-ROUND ROCK TX 8.9
9 THE HEART HOSPITAL BAYLOR PLANO TX 9
10 UT SOUTHWESTERN UNIVERSITY HOSPITAL TX 9
.. ... ... ...
Variables not shown: as.numeric(as.character(frame[, 3])) (dbl)
Output does not contain the HeartAttack Column and I do not understand why?
One solution with dplyr:
library(dplyr)
df %>%
group_by(x) %>%
arrange(c)
Or as #Akrun mentions in the comments below just
df %>%
arrange(x,c)
if you are not interested in grouping. Depends on what you want.
Output:
Source: local data frame [5 x 2]
Groups: x
x c
1 2 A
2 2 D
3 3 B
4 3 C
5 5 E
There is another solution in base R but it will only work if your x column is ordered as is, or if you don't mind changing the order it has:
> df[order(df$x, df$c), , drop = FALSE]
x c
2 2 A
1 2 D
4 3 B
3 3 C
5 5 E

Issue with sorting one column after rank is assigned

*****This is to deal with the question asked in Coursera and hence I may not be able to reveal the complete code*****
hi,
below is my data frame (outcome_H)
Hospital_Name H_A H_F PN
ABC 4.5 5 6
CDE 4.5 1 3
EFG 5 2 1
1) I need to rank the column provided in the function call (it could be one of H_A ,H_F,PN)
2) there will also a rank be provided in the call. Need to match that rank with the rank calculated above and return the respective Hospital_Name
I had used ties.method="first" to solve the tie problem. But however when I look at the final output the hospital name is not sorted.
Example: if i give rank =2, I expect CDE to be printed, but due to some problems(which I am note aware) ABC gets printed for rank=2 and CDE is printed for rank=1.
Below are some parts of code for better understanding:
H_A <- as.numeric(outcome_H$H_A)
HA <- H_A[order(H_A)] // newly added piece to order the value
df <- data.frame(HA,round(rank(HA,ties.method="first")),outcome_H$Hospital_Name)
rowss <- df[order(df$round.rank.HA..),]
Before ordering Output:
HA round.rank.HA.. outcome_H.Hospital.Name
42 8.1 1 FORT DUNCAN MEDICAL CENTER
192 8.5 2 TOMBALL REGIONAL MEDICAL CENTER
61 8.7 4 DETAR HOSPITAL NAVARRO
210 8.7 4 CYPRESS FAIRBANKS MEDICAL CENTER
69 8.8 6 MISSION REGIONAL MEDICAL CENTER
117 8.8 6 METHODIST HOSPITAL,THE
After Ordering output:
HA round.rank.HA..ties.method....first... outcome_H.Hospital.Name
1 8.1 1 PROVIDENCE MEMORIAL HOSPITAL
2 8.5 2 MEMORIAL HERMANN BAPTIST ORANGE HOSPITAL
3 8.7 3 PETERSON REGIONAL MEDICAL CENTER
4 8.7 4 CHILDREN'S HOSPITAL -SCOTT & WHITE HEALTHCARE
5 8.8 5 UNITED REGIONAL HEALTH CARE SYSTEM
6 8.8 6 ST JOSEPH REGIONAL HEALTH CENTER
As you can see, the data with hospital names are completely incorrect.
Any help is very much appreciated.
Thanks,
Pravellika J
You could try H_A <- as.numeric(as.character(outcome_H$H_A))
Output
HA round.rank.HA..ties.method....first... outcome_H.Hospital_Name
1 4.5 1 ABC
2 4.5 2 CDE
3 5.0 3 EFG
I figured it myself. I had initialy assigned HA only with one of the three cols(H_A,H_F,PN). Now i clubbed it with hospital_Name and ordered it based on both the attributes.
Thanks,
Pravellika J

Order a data frame using character and numeric columns

I have a dataframe:
df <- data.frame(c(name = "FORT DUNCAN", "DETAR HOSPITAL", "CYPRESS FAIRBANKS","MISSION REGIONAL", "Test"), rate = c(8.0,8.7,8.7,8.1,8.9))
colnames(df) = c("name","rate")
ordered_df <- df[order(df[,2]),]
name rate
1 FORT DUNCAN 8.0
4 MISSION REGIONAL 8.1
2 DETAR HOSPITAL 8.7
3 CYPRESS FAIRBANKS 8.7
5 Test 8.9
I can clearly order the dataframe by the rate variable. However, If two rates are similar then I want to order by name. i.e. Detar Hospital and Cypress Fairbanks have the same rate of 8.7. Therefore, I want Cypress Fairbanks to move up and Detar Hospital to move down and Test should remain at its place (The last place according to the rate)...
Any ideas???
Cheers
I think I fixed it by:
ordered_df <- df[order(df$rate, df$name),]
Cheers
Since order accepts many variables via ... you can do the following:
> df[order(df[,2],df[,1] ),]
name rate
1 FORT DUNCAN 8.0
4 MISSION REGIONAL 8.1
3 CYPRESS FAIRBANKS 8.7
2 DETAR HOSPITAL 8.7
5 Test 8.9

Resources