Convert string to symbol accepted by dplyr in function - r

My data frame looks like:
> str(b)
'data.frame': 2720 obs. of 3 variables:
$ Hospital.Name: chr "SOUTHEAST ALABAMA MEDICAL CENTER" "MARSHALL MEDICAL CENTER SOUTH" "ELIZA COFFEE MEMORIAL HOSPITAL" "ST VINCENT'S EAST" ...
$ State : chr "AL" "AL" "AL" "AL" ...
$ heart attack : num 14.3 18.5 18.1 17.7 18 15.9 19.6 17.3 17.8 17.5 ...
I want to group it by State, sort them by State and Heart Attack, and then add a column that return row number within each group. The ideal result would look like:
# A tibble: 2,720 x 4
# Groups: State [54]
Hospital.Name State `heart attack` rank
<chr> <chr> <dbl> <int>
1 PROVIDENCE ALASKA MEDICAL CENTER AK 13.4 1
2 ALASKA REGIONAL HOSPITAL AK 14.5 2
3 FAIRBANKS MEMORIAL HOSPITAL AK 15.5 3
4 ALASKA NATIVE MEDICAL CENTER AK 15.7 4
5 MAT-SU REGIONAL MEDICAL CENTER AK 17.7 5
6 CRESTWOOD MEDICAL CENTER AL 13.3 1
7 BAPTIST MEDICAL CENTER EAST AL 14.2 2
8 SOUTHEAST ALABAMA MEDICAL CENTER AL 14.3 3
9 GEORGIANA HOSPITAL AL 14.5 4
10 PRATTVILLE BAPTIST HOSPITAL AL 14.6 5
# ... with 2,710 more rows
so my code is:
outcome<-"heart attack"
c<-arrange(b,State,sym(outcome))%>%
group_by(State)%>%
mutate(rank=row_number(sym(outcome)))
but I got this error:
Error in arrange_impl(.data, dots) : object 'heart attack' not found
When I ran sym(outcome) independently and copied the results into my code, it works:
sym(outcome)
`heart attack`
c<-arrange(b,State,`heart attack`)%>%
+ group_by(State)%>%
+ mutate(rank=rank(`heart attack`))
> c
# A tibble: 2,720 x 4
# Groups: State [54]
Hospital.Name State `heart attack` rank
<chr> <chr> <chr> <dbl>
1 PROVIDENCE ALASKA MEDICAL CENTER AK 13.4 1
2 ALASKA REGIONAL HOSPITAL AK 14.5 2
3 FAIRBANKS MEMORIAL HOSPITAL AK 15.5 3
4 ALASKA NATIVE MEDICAL CENTER AK 15.7 4
5 MAT-SU REGIONAL MEDICAL CENTER AK 17.7 5
6 CRESTWOOD MEDICAL CENTER AL 13.3 1
7 BAPTIST MEDICAL CENTER EAST AL 14.2 2
8 SOUTHEAST ALABAMA MEDICAL CENTER AL 14.3 3
9 GEORGIANA HOSPITAL AL 14.5 4
10 PRATTVILLE BAPTIST HOSPITAL AL 14.6 5
# ... with 2,710 more rows
This is a part of a function, so the 'outcome' needs to be a string. Therefore I tried to convert a string to a symbol so that I can reference the column in dplyr.
can anyone tell me what's happening here?
are there any good ways to achieve my goal?

You need to unquote the symbol with !!:
arrange(b, State, !!sym(outcome))
Or UQ:
arrange(b, State, UQ(sym(outcome)))
Similarly for mutate:
mutate(rank=row_number(!!sym(outcome))) # or mutate(rank=row_number(UQ(sym(outcome))))

If you are only trying to name the column then you will want to use the backtick (`). (It is typically paired with the ~ on the top left of your keyboard just below the ESC key.) Please note that is not the same as the single quotation mark (').
The reason you often will get your variable written like this is from importing header names containing spaces into tibbles. Any header name that has a space in it gets wrapped in `. You need to refer to those columns by also wrapping them in backticks or else R does not recognize you are referring the objects in memory that it can work with. It will just think you are referring to the string and not the object in memory. Though it will happily store the object with a space in its name if you use " or '.
see below demonstration of the issue:
`tidy time` <- 4
'tidy time' <- 5
"tidy time" <- 6
print('tidy time')
print("tidy time")
print(`tidy time`)
This is the cause for R's error message.
Hopefully understanding all that will spare you from having to call on the sym function. In any case, if you remove the space in the name the problem will also go away and you can save the backticks for another day.
To learn more about !! and unquoting variables (which psidom was referring to in his answer), and also learn about the related issues that occur in writing functions that rely on referencing objects with non-standard evaluation in dplyr please see here: https://rpubs.com/hadley/dplyr-programming

Related

How can I filter (dplyr) on the same dataset twice in a 'for' loop? R

I have a dataset that looks like this:
Hospital.Name State heart attack
1 SOUTHEAST ALABAMA MEDICAL CENTER AL 14.3
2 MARSHALL MEDICAL CENTER SOUTH AL 18.5
3 ELIZA COFFEE MEMORIAL HOSPITAL AL 18.1
4 MIZELL MEMORIAL HOSPITAL AL Not Available
5 CRENSHAW COMMUNITY HOSPITAL AL Not Available
6 MARSHALL MEDICAL CENTER NORTH AL Not Available
7 ST VINCENT'S EAST AL 17.7
8 DEKALB REGIONAL MEDICAL CENTER AL 18.0
9 SHELBY BAPTIST MEDICAL CENTER AL 15.9
10 CALLAHAN EYE FOUNDATION HOSPITAL AL Not Available
11 HELEN KELLER MEMORIAL HOSPITAL AL 19.6
12 DALE MEDICAL CENTER AL 17.3
13 CHEROKEE MEDICAL CENTER AL Not Available
14 BAPTIST MEDICAL CENTER SOUTH AL 17.8
15 JACKSON HOSPITAL & CLINIC INC AL 17.5
16 GEORGE H. LANIER MEMORIAL HOSPITAL AL 15.4
17 ELBA GENERAL HOSPITAL AL Not Available
18 EAST ALABAMA MEDICAL CENTER AND SNF AL 16.3
19 WEDOWEE HOSPITAL AL Not Available
20 UNIVERSITY OF ALABAMA HOSPITAL AL 15.0
The goal is to retrieve the hospital name, for a given rank of hospital on 'heart attack' for every state. For example, here I am trying to retrieve the hospital name for the lowest score (rank=1) in the heart attack column, for every state in a data frame.
This is my attempt:
stateVec <- unique(df$State)
outcome <- 'heart attack'
name <- c()
st <- c()
stateVec <- c()
rank <- 1
for (i in 1:length(stateVec)) {
k <- stateVec[i]
df1 <- dplyr::filter(df, State==k)
rankVec <- unique(df[[outcome]])
rankVec <- sort(rankVec[rankVec != 'Not Available'])
key <- rankVec[rank]
df1 <- dplyr::filter(df1, get(outcome, envir = as.environment(df))==key)
df1 <- df1[order(df$Hospital.Name), , drop = F]
d <- df1[1,]
name <- d$Hospital.Name
st <- k
return(data.frame(st, name))
}
I receive the following error:
Error in filter_impl(.data, quo) : Result must have length 98, not 4706
I've tried recreating the problem with the mtcars dataset, and don't get the same error. Any help would be appreciated :)
I think this is what you are looking for.
desired_rank <- 1
df %>%
filter(!is.na(heart.attack)) %>%
group_by(State) %>%
arrange(heart.attack) %>%
slice(desired_rank) %>%
ungroup()
It remove's NA values for heart.attack;
Then groups by State;
Then sorts ascending on heart.attack;
Then returns the first hospital (so the hospital with lowest heart.attack value).
The output is a data.frame.

R: split data based on a factor, add a ranking column and extract

I still haven't been able to know how we can access different elements of a split data. Here is my problem:
I have a data set, that I want to split based on a column (State). I want to have a ranking column added to my data for each subset. This is part of a function I'm writing.
My data set has 2 columns, Hospital, State, Outcome. For each state, I want to add a 'Rank' column that ranks the data based on Outcome; the lowest Outcome will be ranked 1 and the highest outcome will be ranked the last.
How can I use split, sapply/lapply to do this? Is there a better way, like using "arrange"?
My main problem is that when I use either of these methods, I do not know how to access each element of the split or arranged data.
Here's how my data set looks like:
Hospital State Outcome. The row lines are not important here.
Hospital State Outcome
1 SOUTHEAST ALABAMA MEDICAL CENTER AL 14.3
2 MARSHALL MEDICAL CENTER SOUTH AL 18.5
3 ELIZA COFFEE MEMORIAL HOSPITAL TX 18.1
7 ST VINCENT'S EAST TX 17.7
8 DEKALB REGIONAL MEDICAL CENTER AL 18.0
9 SHELBY BAPTIST MEDICAL CENTER AL 15.9
The desired outcome would be
Hospital State Outcome Rank
1 SOUTHEAST ALABAMA MEDICAL CENTER AL 14.3 1
2 SHELBY BAPTIST MEDICAL CENTER AL 15.9 2
3 DEKALB REGIONAL MEDICAL CENTER AL 18.0 3
4 MARSHALL MEDICAL CENTER SOUTH AL 18.5 4
5 ST VINCENT'S EAST TX 17.7 1
6 ELIZA COFFEE MEMORIAL HOSPITAL TX 18.1 2
Thanks in advance.
The dplyr package provides a very elegant solution for this type of problem. I'm using the mtcars data as an example:
library(dplyr)
mtcars %>%
group_by(cyl) %>%
mutate(rank = row_number(mpg))
The OP's example is hard to read into R because of all the spaces in the string variable.
Here's a simpler example:
set.seed(1)
DF <- data.frame(id=rep(1:2,sample(5,2))); DF$v <- runif(nrow(DF))*100
# id v
# 1 A 57.28534
# 2 A 90.82078
# 3 B 20.16819
# 4 B 89.83897
# 5 B 94.46753
# 6 B 66.07978
# 7 B 62.91140
Here's a solution without using any packages:
DF$r <- ave(DF$v,DF$id,FUN=rank)
# id v r
# 1 A 57.28534 1
# 2 A 90.82078 2
# 3 B 20.16819 1
# 4 B 89.83897 4
# 5 B 94.46753 5
# 6 B 66.07978 3
# 7 B 62.91140 2
Finally, to order by ranking within state:
DF[order(DF$id,DF$r),]
# id v r
# 1 A 57.28534 1
# 2 A 90.82078 2
# 3 B 20.16819 1
# 7 B 62.91140 2
# 6 B 66.07978 3
# 4 B 89.83897 4
# 5 B 94.46753 5
If you have ties in the column you're ranking, read the documentation for rank and decide how you want the ties treated. The dplyr and data.table packages (mentioned in the other answers) also have nice functionality for dealing with ties, like the notion of a "dense rank."
You could try this
library(data.table)
setDT(dat)[, myrank := rank(Outcome), by = State]
dat[,.SD[order(myrank)], by=State]
# State Hospital Outcome myrank
#1: AL SOUTHEAST ALABAMA MEDICAL CENTER 14.3 1
#2: AL SHELBY BAPTIST MEDICAL CENTER 15.9 2
#3: AL DEKALB REGIONAL MEDICAL CENTER 18.0 3
#4: AL MARSHALL MEDICAL CENTER SOUTH 18.5 4
#5: TX ST VINCENT EAST 17.7 1
#6: TX ELIZA COFFEE MEMORIAL HOSPITAL 18.1 2
Or using ddply
library(plyr)
ddply(dat, .(State), function(x){x$myrank = rank(x$Outcome); x[order(x$myrank),]})
# Hospital State Outcome myrank
#1 SOUTHEAST ALABAMA MEDICAL CENTER AL 14.3 1
#2 SHELBY BAPTIST MEDICAL CENTER AL 15.9 2
#3 DEKALB REGIONAL MEDICAL CENTER AL 18.0 3
#4 MARSHALL MEDICAL CENTER SOUTH AL 18.5 4
#5 ST VINCENT EAST TX 17.7 1
#6 ELIZA COFFEE MEMORIAL HOSPITAL TX 18.1 2
You can use by:
do.call(
rbind,
by(d, list(State = d$State), function(x) { x$Rank <- order(x$Outcome); x[order(x$Rank), ] }))
where d is your raw data.

Issue with sorting one column after rank is assigned

*****This is to deal with the question asked in Coursera and hence I may not be able to reveal the complete code*****
hi,
below is my data frame (outcome_H)
Hospital_Name H_A H_F PN
ABC 4.5 5 6
CDE 4.5 1 3
EFG 5 2 1
1) I need to rank the column provided in the function call (it could be one of H_A ,H_F,PN)
2) there will also a rank be provided in the call. Need to match that rank with the rank calculated above and return the respective Hospital_Name
I had used ties.method="first" to solve the tie problem. But however when I look at the final output the hospital name is not sorted.
Example: if i give rank =2, I expect CDE to be printed, but due to some problems(which I am note aware) ABC gets printed for rank=2 and CDE is printed for rank=1.
Below are some parts of code for better understanding:
H_A <- as.numeric(outcome_H$H_A)
HA <- H_A[order(H_A)] // newly added piece to order the value
df <- data.frame(HA,round(rank(HA,ties.method="first")),outcome_H$Hospital_Name)
rowss <- df[order(df$round.rank.HA..),]
Before ordering Output:
HA round.rank.HA.. outcome_H.Hospital.Name
42 8.1 1 FORT DUNCAN MEDICAL CENTER
192 8.5 2 TOMBALL REGIONAL MEDICAL CENTER
61 8.7 4 DETAR HOSPITAL NAVARRO
210 8.7 4 CYPRESS FAIRBANKS MEDICAL CENTER
69 8.8 6 MISSION REGIONAL MEDICAL CENTER
117 8.8 6 METHODIST HOSPITAL,THE
After Ordering output:
HA round.rank.HA..ties.method....first... outcome_H.Hospital.Name
1 8.1 1 PROVIDENCE MEMORIAL HOSPITAL
2 8.5 2 MEMORIAL HERMANN BAPTIST ORANGE HOSPITAL
3 8.7 3 PETERSON REGIONAL MEDICAL CENTER
4 8.7 4 CHILDREN'S HOSPITAL -SCOTT & WHITE HEALTHCARE
5 8.8 5 UNITED REGIONAL HEALTH CARE SYSTEM
6 8.8 6 ST JOSEPH REGIONAL HEALTH CENTER
As you can see, the data with hospital names are completely incorrect.
Any help is very much appreciated.
Thanks,
Pravellika J
You could try H_A <- as.numeric(as.character(outcome_H$H_A))
Output
HA round.rank.HA..ties.method....first... outcome_H.Hospital_Name
1 4.5 1 ABC
2 4.5 2 CDE
3 5.0 3 EFG
I figured it myself. I had initialy assigned HA only with one of the three cols(H_A,H_F,PN). Now i clubbed it with hospital_Name and ordered it based on both the attributes.
Thanks,
Pravellika J

Rank a sorted dataset using apply function

My dataframe looks like this:
head(temp$HName)
[1] "UNIVERSITY OF TEXAS HEALTH SCIENCE CENTER AT TYLER"
[2] "METHODIST HOSPITAL,THE"
[3] "TOMBALL REGIONAL MEDICAL CENTER"
[4] "METHODIST SUGAR LAND HOSPITAL"
[5] "GULF COAST MEDICAL CENTER"
[6] "VHS HARLINGEN HOSPITAL COMPANY LLC"
head(temp$Rate)
[1] 7.3 8.3 8.7 8.7 8.8 8.9
76 Levels: 7.3 8.3 8.7 8.8 8.9 9 9.1 9.2 9.3 9.4 9.5 9.6 ... 17.1
> head(temp$Rank)
[1] NA NA NA NA NA NA
The temp$Rate is sorted. I am trying to write a function assignRank which gives me a new column temp$Rank which has values as 1, 2, 3, 3, 4, 5
My code is as below:
tapply(temp$Rank,temp$Rate, assignRank)
where :
assignRank<- function(r=1){
temp$Rank <- r
r <- r + 1
return(r)
}
I get following error when running tapply
tapply(temp$Rank,temp$Rate, assignRank)
Show Traceback
Rerun with Debug
Error in `$<-.data.frame`(`*tmp*`, "Rank", value = c(NA, NA)) :
replacement has 2 rows, data has 301
Please advise where I am going wrong?
I use data.table for stuff like this, because both sorting and ranking are very efficient/simple syntax
library(data.table)
setkey(setDT(temp), Rate) # This will sort your data set by Rate in case it's not yet sorted
temp[, Rank := .GRP, by = Rate]
temp
# HName Rate Rank
# 1: UNIVERSITY OF TEXAS HEALTH SCIENCE CENTER AT TYLER 7.3 1
# 2: METHODIST HOSPITAL,THE 8.3 2
# 3: TOMBALL REGIONAL MEDICAL CENTER 8.7 3
# 4: METHODIST SUGAR LAND HOSPITAL 8.7 3
# 5: GULF COAST MEDICAL CENTER 8.8 4
# 6: VHS HARLINGEN HOSPITAL COMPANY LLC 8.9 5
Or you could easily do the same using base R (assuming your data is sorted by Rank) just do
as.numeric(factor(temp$Rate))
## [1] 1 2 3 3 4 5
Or could also use dense_rank function from dplyr package (which will not require sorting the data set)
library(dplyr)
temp %>%
mutate(Rank = dense_rank(Rate))
# HName Rate Rank
# 1 UNIVERSITY OF TEXAS HEALTH SCIENCE CENTER AT TYLER 7.3 1
# 2 METHODIST HOSPITAL,THE 8.3 2
# 3 TOMBALL REGIONAL MEDICAL CENTER 8.7 3
# 4 METHODIST SUGAR LAND HOSPITAL 8.7 3
# 5 GULF COAST MEDICAL CENTER 8.8 4
# 6 VHS HARLINGEN HOSPITAL COMPANY LLC 8.9 5
Other options (if the data is ordered)
with(temp, cumsum(ave(Rate, Rate, FUN=function(x) c(1,x[-1]!=x[-length(x)]))))
#[1] 1 2 3 3 4 5
with(temp, match(Rate, unique(Rate)) )
#[1] 1 2 3 3 4 5

Application of 'apply' functions in R

I have got the following list which was generated by using split function with state as index.
$AK
Hospital_Name State Mortality_Rate
99 PROVIDENCE ALASKA MEDICAL CENTER AK 13.4
100 MAT-SU REGIONAL MEDICAL CENTER AK 17.7
102 FAIRBANKS MEMORIAL HOSPITAL AK 15.5
$AL
Hospital_Name State Mortality_Rate
1 SOUTHEAST ALABAMA MEDICAL CENTER AL 14.3
2 MARSHALL MEDICAL CENTER SOUTH AL 18.5
3 ELIZA COFFEE MEMORIAL HOSPITAL AL 18.1
$AR
Hospital_Name State Mortality_Rate
193 SILOAM SPRINGS MEMORIAL HOSPITAL AR 15.6
194 JOHNSON REGIONAL MEDICAL CENTER AR 16.9
195 WASHINGTON REGIONAL MED CTR AT NORTH HILLS AR 15.2
I want to select Hospital Name of 2nd row from each of these states. Can somebody help with an apply function here? I was trying to use sapply the following way(not working) -
x <- sapply(test.case3,function(i) test.case3[[i]][2,1])
so that, by varying 'i', I can get result as follows -
> test.case3[[1]][2,1]
[1] "MAT-SU REGIONAL MEDICAL CENTER"
> test.case3[[2]][2,1]
[1] "MARSHALL MEDICAL CENTER SOUTH"
> test.case3[[3]][2,1]
[1] "JOHNSON REGIONAL MEDICAL CENTER"
Kindly advise.
I required to produce the final report in the following format.(ignore the data, in below example)
Hospital_Name State
D W MCMILLAN MEMORIAL HOSPITAL AL
ARKANSAS METHODIST MEDICAL CENTER AR
JOHN C LINCOLN DEER VALLEY HOSPITAL AZ

Resources