How can I filter (dplyr) on the same dataset twice in a 'for' loop? R - r

I have a dataset that looks like this:
Hospital.Name State heart attack
1 SOUTHEAST ALABAMA MEDICAL CENTER AL 14.3
2 MARSHALL MEDICAL CENTER SOUTH AL 18.5
3 ELIZA COFFEE MEMORIAL HOSPITAL AL 18.1
4 MIZELL MEMORIAL HOSPITAL AL Not Available
5 CRENSHAW COMMUNITY HOSPITAL AL Not Available
6 MARSHALL MEDICAL CENTER NORTH AL Not Available
7 ST VINCENT'S EAST AL 17.7
8 DEKALB REGIONAL MEDICAL CENTER AL 18.0
9 SHELBY BAPTIST MEDICAL CENTER AL 15.9
10 CALLAHAN EYE FOUNDATION HOSPITAL AL Not Available
11 HELEN KELLER MEMORIAL HOSPITAL AL 19.6
12 DALE MEDICAL CENTER AL 17.3
13 CHEROKEE MEDICAL CENTER AL Not Available
14 BAPTIST MEDICAL CENTER SOUTH AL 17.8
15 JACKSON HOSPITAL & CLINIC INC AL 17.5
16 GEORGE H. LANIER MEMORIAL HOSPITAL AL 15.4
17 ELBA GENERAL HOSPITAL AL Not Available
18 EAST ALABAMA MEDICAL CENTER AND SNF AL 16.3
19 WEDOWEE HOSPITAL AL Not Available
20 UNIVERSITY OF ALABAMA HOSPITAL AL 15.0
The goal is to retrieve the hospital name, for a given rank of hospital on 'heart attack' for every state. For example, here I am trying to retrieve the hospital name for the lowest score (rank=1) in the heart attack column, for every state in a data frame.
This is my attempt:
stateVec <- unique(df$State)
outcome <- 'heart attack'
name <- c()
st <- c()
stateVec <- c()
rank <- 1
for (i in 1:length(stateVec)) {
k <- stateVec[i]
df1 <- dplyr::filter(df, State==k)
rankVec <- unique(df[[outcome]])
rankVec <- sort(rankVec[rankVec != 'Not Available'])
key <- rankVec[rank]
df1 <- dplyr::filter(df1, get(outcome, envir = as.environment(df))==key)
df1 <- df1[order(df$Hospital.Name), , drop = F]
d <- df1[1,]
name <- d$Hospital.Name
st <- k
return(data.frame(st, name))
}
I receive the following error:
Error in filter_impl(.data, quo) : Result must have length 98, not 4706
I've tried recreating the problem with the mtcars dataset, and don't get the same error. Any help would be appreciated :)

I think this is what you are looking for.
desired_rank <- 1
df %>%
filter(!is.na(heart.attack)) %>%
group_by(State) %>%
arrange(heart.attack) %>%
slice(desired_rank) %>%
ungroup()
It remove's NA values for heart.attack;
Then groups by State;
Then sorts ascending on heart.attack;
Then returns the first hospital (so the hospital with lowest heart.attack value).
The output is a data.frame.

Related

Convert string to symbol accepted by dplyr in function

My data frame looks like:
> str(b)
'data.frame': 2720 obs. of 3 variables:
$ Hospital.Name: chr "SOUTHEAST ALABAMA MEDICAL CENTER" "MARSHALL MEDICAL CENTER SOUTH" "ELIZA COFFEE MEMORIAL HOSPITAL" "ST VINCENT'S EAST" ...
$ State : chr "AL" "AL" "AL" "AL" ...
$ heart attack : num 14.3 18.5 18.1 17.7 18 15.9 19.6 17.3 17.8 17.5 ...
I want to group it by State, sort them by State and Heart Attack, and then add a column that return row number within each group. The ideal result would look like:
# A tibble: 2,720 x 4
# Groups: State [54]
Hospital.Name State `heart attack` rank
<chr> <chr> <dbl> <int>
1 PROVIDENCE ALASKA MEDICAL CENTER AK 13.4 1
2 ALASKA REGIONAL HOSPITAL AK 14.5 2
3 FAIRBANKS MEMORIAL HOSPITAL AK 15.5 3
4 ALASKA NATIVE MEDICAL CENTER AK 15.7 4
5 MAT-SU REGIONAL MEDICAL CENTER AK 17.7 5
6 CRESTWOOD MEDICAL CENTER AL 13.3 1
7 BAPTIST MEDICAL CENTER EAST AL 14.2 2
8 SOUTHEAST ALABAMA MEDICAL CENTER AL 14.3 3
9 GEORGIANA HOSPITAL AL 14.5 4
10 PRATTVILLE BAPTIST HOSPITAL AL 14.6 5
# ... with 2,710 more rows
so my code is:
outcome<-"heart attack"
c<-arrange(b,State,sym(outcome))%>%
group_by(State)%>%
mutate(rank=row_number(sym(outcome)))
but I got this error:
Error in arrange_impl(.data, dots) : object 'heart attack' not found
When I ran sym(outcome) independently and copied the results into my code, it works:
sym(outcome)
`heart attack`
c<-arrange(b,State,`heart attack`)%>%
+ group_by(State)%>%
+ mutate(rank=rank(`heart attack`))
> c
# A tibble: 2,720 x 4
# Groups: State [54]
Hospital.Name State `heart attack` rank
<chr> <chr> <chr> <dbl>
1 PROVIDENCE ALASKA MEDICAL CENTER AK 13.4 1
2 ALASKA REGIONAL HOSPITAL AK 14.5 2
3 FAIRBANKS MEMORIAL HOSPITAL AK 15.5 3
4 ALASKA NATIVE MEDICAL CENTER AK 15.7 4
5 MAT-SU REGIONAL MEDICAL CENTER AK 17.7 5
6 CRESTWOOD MEDICAL CENTER AL 13.3 1
7 BAPTIST MEDICAL CENTER EAST AL 14.2 2
8 SOUTHEAST ALABAMA MEDICAL CENTER AL 14.3 3
9 GEORGIANA HOSPITAL AL 14.5 4
10 PRATTVILLE BAPTIST HOSPITAL AL 14.6 5
# ... with 2,710 more rows
This is a part of a function, so the 'outcome' needs to be a string. Therefore I tried to convert a string to a symbol so that I can reference the column in dplyr.
can anyone tell me what's happening here?
are there any good ways to achieve my goal?
You need to unquote the symbol with !!:
arrange(b, State, !!sym(outcome))
Or UQ:
arrange(b, State, UQ(sym(outcome)))
Similarly for mutate:
mutate(rank=row_number(!!sym(outcome))) # or mutate(rank=row_number(UQ(sym(outcome))))
If you are only trying to name the column then you will want to use the backtick (`). (It is typically paired with the ~ on the top left of your keyboard just below the ESC key.) Please note that is not the same as the single quotation mark (').
The reason you often will get your variable written like this is from importing header names containing spaces into tibbles. Any header name that has a space in it gets wrapped in `. You need to refer to those columns by also wrapping them in backticks or else R does not recognize you are referring the objects in memory that it can work with. It will just think you are referring to the string and not the object in memory. Though it will happily store the object with a space in its name if you use " or '.
see below demonstration of the issue:
`tidy time` <- 4
'tidy time' <- 5
"tidy time" <- 6
print('tidy time')
print("tidy time")
print(`tidy time`)
This is the cause for R's error message.
Hopefully understanding all that will spare you from having to call on the sym function. In any case, if you remove the space in the name the problem will also go away and you can save the backticks for another day.
To learn more about !! and unquoting variables (which psidom was referring to in his answer), and also learn about the related issues that occur in writing functions that rely on referencing objects with non-standard evaluation in dplyr please see here: https://rpubs.com/hadley/dplyr-programming

R: split data based on a factor, add a ranking column and extract

I still haven't been able to know how we can access different elements of a split data. Here is my problem:
I have a data set, that I want to split based on a column (State). I want to have a ranking column added to my data for each subset. This is part of a function I'm writing.
My data set has 2 columns, Hospital, State, Outcome. For each state, I want to add a 'Rank' column that ranks the data based on Outcome; the lowest Outcome will be ranked 1 and the highest outcome will be ranked the last.
How can I use split, sapply/lapply to do this? Is there a better way, like using "arrange"?
My main problem is that when I use either of these methods, I do not know how to access each element of the split or arranged data.
Here's how my data set looks like:
Hospital State Outcome. The row lines are not important here.
Hospital State Outcome
1 SOUTHEAST ALABAMA MEDICAL CENTER AL 14.3
2 MARSHALL MEDICAL CENTER SOUTH AL 18.5
3 ELIZA COFFEE MEMORIAL HOSPITAL TX 18.1
7 ST VINCENT'S EAST TX 17.7
8 DEKALB REGIONAL MEDICAL CENTER AL 18.0
9 SHELBY BAPTIST MEDICAL CENTER AL 15.9
The desired outcome would be
Hospital State Outcome Rank
1 SOUTHEAST ALABAMA MEDICAL CENTER AL 14.3 1
2 SHELBY BAPTIST MEDICAL CENTER AL 15.9 2
3 DEKALB REGIONAL MEDICAL CENTER AL 18.0 3
4 MARSHALL MEDICAL CENTER SOUTH AL 18.5 4
5 ST VINCENT'S EAST TX 17.7 1
6 ELIZA COFFEE MEMORIAL HOSPITAL TX 18.1 2
Thanks in advance.
The dplyr package provides a very elegant solution for this type of problem. I'm using the mtcars data as an example:
library(dplyr)
mtcars %>%
group_by(cyl) %>%
mutate(rank = row_number(mpg))
The OP's example is hard to read into R because of all the spaces in the string variable.
Here's a simpler example:
set.seed(1)
DF <- data.frame(id=rep(1:2,sample(5,2))); DF$v <- runif(nrow(DF))*100
# id v
# 1 A 57.28534
# 2 A 90.82078
# 3 B 20.16819
# 4 B 89.83897
# 5 B 94.46753
# 6 B 66.07978
# 7 B 62.91140
Here's a solution without using any packages:
DF$r <- ave(DF$v,DF$id,FUN=rank)
# id v r
# 1 A 57.28534 1
# 2 A 90.82078 2
# 3 B 20.16819 1
# 4 B 89.83897 4
# 5 B 94.46753 5
# 6 B 66.07978 3
# 7 B 62.91140 2
Finally, to order by ranking within state:
DF[order(DF$id,DF$r),]
# id v r
# 1 A 57.28534 1
# 2 A 90.82078 2
# 3 B 20.16819 1
# 7 B 62.91140 2
# 6 B 66.07978 3
# 4 B 89.83897 4
# 5 B 94.46753 5
If you have ties in the column you're ranking, read the documentation for rank and decide how you want the ties treated. The dplyr and data.table packages (mentioned in the other answers) also have nice functionality for dealing with ties, like the notion of a "dense rank."
You could try this
library(data.table)
setDT(dat)[, myrank := rank(Outcome), by = State]
dat[,.SD[order(myrank)], by=State]
# State Hospital Outcome myrank
#1: AL SOUTHEAST ALABAMA MEDICAL CENTER 14.3 1
#2: AL SHELBY BAPTIST MEDICAL CENTER 15.9 2
#3: AL DEKALB REGIONAL MEDICAL CENTER 18.0 3
#4: AL MARSHALL MEDICAL CENTER SOUTH 18.5 4
#5: TX ST VINCENT EAST 17.7 1
#6: TX ELIZA COFFEE MEMORIAL HOSPITAL 18.1 2
Or using ddply
library(plyr)
ddply(dat, .(State), function(x){x$myrank = rank(x$Outcome); x[order(x$myrank),]})
# Hospital State Outcome myrank
#1 SOUTHEAST ALABAMA MEDICAL CENTER AL 14.3 1
#2 SHELBY BAPTIST MEDICAL CENTER AL 15.9 2
#3 DEKALB REGIONAL MEDICAL CENTER AL 18.0 3
#4 MARSHALL MEDICAL CENTER SOUTH AL 18.5 4
#5 ST VINCENT EAST TX 17.7 1
#6 ELIZA COFFEE MEMORIAL HOSPITAL TX 18.1 2
You can use by:
do.call(
rbind,
by(d, list(State = d$State), function(x) { x$Rank <- order(x$Outcome); x[order(x$Rank), ] }))
where d is your raw data.

Empty rows in list as NA values in data.frame in R

I have a dataframe as follows:
hospital <- c("PROVIDENCE ALASKA MEDICAL CENTER", "ALASKA REGIONAL HOSPITAL", "FAIRBANKS MEMORIAL HOSPITAL",
"CRESTWOOD MEDICAL CENTER", "BAPTIST MEDICAL CENTER EAST", "ARKANSAS HEART HOSPITAL",
"MEDICAL CENTER NORTH LITTLE ROCK", "CRITTENDEN MEMORIAL HOSPITAL")
state <- c("AK", "AK", "AK", "AL", "AL", "AR", "AR", "AR")
rank <- c(1,2,3,1,2,1,2,3)
df <- data.frame(hospital, state, rank)
df
hospital state rank
1 PROVIDENCE ALASKA MEDICAL CENTER AK 1
2 ALASKA REGIONAL HOSPITAL AK 2
3 FAIRBANKS MEMORIAL HOSPITAL AK 3
4 CRESTWOOD MEDICAL CENTER AL 1
5 BAPTIST MEDICAL CENTER EAST AL 2
6 ARKANSAS HEART HOSPITAL AR 1
7 MEDICAL CENTER NORTH LITTLE ROCK AR 2
8 CRITTENDEN MEMORIAL HOSPITAL AR 3
I would like to create a function, rankall, that takes rank as an argument and returns the hospitals of that rank for each state, with NAs returned if the state does not have a hospital that matches the given rank. For example, I want output of rankall(rank=3) to look like this:
hospital state
AK FAIRBANKS MEMORIAL HOSPITAL AK
AL <NA> AL
AR CRITTENDEN MEMORIAL HOSPITAL AR
I've tried:
rankall <- function(rank) {
split_by_state <- split(df, df$state)
ranked_hospitals <- lapply(split_by_state, function (x) {
x[(x$rank==rank), ]
})
combined_ranked_hospitals <- do.call(rbind, ranked_hospitals)
return(combined_ranked_hospitals[ ,1:2])
}
But rankall(rank=3) returns:
hospital state
AK FAIRBANKS MEMORIAL HOSPITAL AK
AR CRITTENDEN MEMORIAL HOSPITAL AR
This leaves out the NA values that I need to keep track of. Is there a way for R to recognize the empty rows in my list object within my function as NAs, rather than as empty rows? Is there another function besides lapply that would be more useful for this task?
[ Note: This dataframe is from the Coursera R Programming course. This is also my first post on Stackoverflow, and my first time learning programming. Thank you to all who offered solutions and advice, this forum is fantastic. ]
You just need an in/else in your function:
rankall <- function(rank) {
split_by_state <- split(df, df$state)
ranked_hospitals <- lapply(split_by_state, function (x) {
indx <- x$rank==rank
if(any(indx)){
return(x[indx, ])
else{
out = x[1, ]
out$hospital = NA
return(out)
}
}
}
Here's an alternative approach:
rankall <- function(rank) {
do.call(rbind, lapply(split(df, df$state), function(df) {
tmp <- df[df$rank == rank, 1:2]
if (!nrow(tmp)) return(transform(df[1, 1:2], hospital = NA)) else return(tmp)
}))
}
rankall(3)
# hospital state
# AK FAIRBANKS MEMORIAL HOSPITAL AK
# AL <NA> AL
# AR CRITTENDEN MEMORIAL HOSPITAL AR
Here is another dplyr approach.
fun1 <- function(x) {
group_by(df, state) %>%
summarise(hospital = hospital[x],
rank = nth(rank, x))
}
# fun1(3)
#Source: local data frame [3 x 3]
#
# state hospital rank
#1 AK FAIRBANKS MEMORIAL HOSPITAL 3
#2 AL NA NA
#3 AR CRITTENDEN MEMORIAL HOSPITAL 3
I think this is a good use of dplyr. Only thing that's weird is summarize complains when I use NA instead of "NA". Anyone have thoughts on why?
library(dplyr)
rankall <- function(chosen_rank){
group_by(df, state) %>%
summarize(hospital = ifelse(length(hospital[rank==chosen_rank])!=0,
as.character(hospital[rank==chosen_rank]), "NA"),
rank = chosen_rank)
}
rankall(1)
rankall(2)
rankall(3)

subsetting by a variable name of a column r

row.names Hospital State Heart Attack Heart Failure
1 2275 PROVIDENCE MEMORIAL HOSPITAL TX 16.1 9.1
2 2276 MEMORIAL HERMANN BAPTIST ORANGE HOSPITALTX 16.3 14.3
4 2278 UNITED REGIONAL HEALTH CARE SYSTEM TX 17.4 15.1
5 2279 ST JOSEPH REGIONAL HEALTH CENTER TX 15.7 15.6
6 2280 PARKLAND HEALTH AND HOSPITAL SYSTEM TX 12.9 11.2
7 2281 UNIVERSITY OF TEXAS MEDICAL BRANCH GAL TX 17.4 11.8
Hello R peeps, I need to get a row.name where input, which is variable column name (Heart Attack or Heart Failure) is minimum for that column. In the exmple above, if I input "Heart failure" it needs to return [1] 2275Which row name in the first row. so far I got this:inds<-subset(wfperstate, wfperstate[[outname]]==min)where wfperstate is my data frame
outname is my inputPlease, help!
To transform my last comment into a function :
get_min_rowname <-
function(dat,col)
dat[which.min(dat[[col]]),"row.names"]
Then you apply it :
get_min_rowname(wfperstate, "Heart Attack")
get_min_rowname(wfperstate, "Heart Failure")

Application of 'apply' functions in R

I have got the following list which was generated by using split function with state as index.
$AK
Hospital_Name State Mortality_Rate
99 PROVIDENCE ALASKA MEDICAL CENTER AK 13.4
100 MAT-SU REGIONAL MEDICAL CENTER AK 17.7
102 FAIRBANKS MEMORIAL HOSPITAL AK 15.5
$AL
Hospital_Name State Mortality_Rate
1 SOUTHEAST ALABAMA MEDICAL CENTER AL 14.3
2 MARSHALL MEDICAL CENTER SOUTH AL 18.5
3 ELIZA COFFEE MEMORIAL HOSPITAL AL 18.1
$AR
Hospital_Name State Mortality_Rate
193 SILOAM SPRINGS MEMORIAL HOSPITAL AR 15.6
194 JOHNSON REGIONAL MEDICAL CENTER AR 16.9
195 WASHINGTON REGIONAL MED CTR AT NORTH HILLS AR 15.2
I want to select Hospital Name of 2nd row from each of these states. Can somebody help with an apply function here? I was trying to use sapply the following way(not working) -
x <- sapply(test.case3,function(i) test.case3[[i]][2,1])
so that, by varying 'i', I can get result as follows -
> test.case3[[1]][2,1]
[1] "MAT-SU REGIONAL MEDICAL CENTER"
> test.case3[[2]][2,1]
[1] "MARSHALL MEDICAL CENTER SOUTH"
> test.case3[[3]][2,1]
[1] "JOHNSON REGIONAL MEDICAL CENTER"
Kindly advise.
I required to produce the final report in the following format.(ignore the data, in below example)
Hospital_Name State
D W MCMILLAN MEMORIAL HOSPITAL AL
ARKANSAS METHODIST MEDICAL CENTER AR
JOHN C LINCOLN DEER VALLEY HOSPITAL AZ

Resources