Empty rows in list as NA values in data.frame in R - r

I have a dataframe as follows:
hospital <- c("PROVIDENCE ALASKA MEDICAL CENTER", "ALASKA REGIONAL HOSPITAL", "FAIRBANKS MEMORIAL HOSPITAL",
"CRESTWOOD MEDICAL CENTER", "BAPTIST MEDICAL CENTER EAST", "ARKANSAS HEART HOSPITAL",
"MEDICAL CENTER NORTH LITTLE ROCK", "CRITTENDEN MEMORIAL HOSPITAL")
state <- c("AK", "AK", "AK", "AL", "AL", "AR", "AR", "AR")
rank <- c(1,2,3,1,2,1,2,3)
df <- data.frame(hospital, state, rank)
df
hospital state rank
1 PROVIDENCE ALASKA MEDICAL CENTER AK 1
2 ALASKA REGIONAL HOSPITAL AK 2
3 FAIRBANKS MEMORIAL HOSPITAL AK 3
4 CRESTWOOD MEDICAL CENTER AL 1
5 BAPTIST MEDICAL CENTER EAST AL 2
6 ARKANSAS HEART HOSPITAL AR 1
7 MEDICAL CENTER NORTH LITTLE ROCK AR 2
8 CRITTENDEN MEMORIAL HOSPITAL AR 3
I would like to create a function, rankall, that takes rank as an argument and returns the hospitals of that rank for each state, with NAs returned if the state does not have a hospital that matches the given rank. For example, I want output of rankall(rank=3) to look like this:
hospital state
AK FAIRBANKS MEMORIAL HOSPITAL AK
AL <NA> AL
AR CRITTENDEN MEMORIAL HOSPITAL AR
I've tried:
rankall <- function(rank) {
split_by_state <- split(df, df$state)
ranked_hospitals <- lapply(split_by_state, function (x) {
x[(x$rank==rank), ]
})
combined_ranked_hospitals <- do.call(rbind, ranked_hospitals)
return(combined_ranked_hospitals[ ,1:2])
}
But rankall(rank=3) returns:
hospital state
AK FAIRBANKS MEMORIAL HOSPITAL AK
AR CRITTENDEN MEMORIAL HOSPITAL AR
This leaves out the NA values that I need to keep track of. Is there a way for R to recognize the empty rows in my list object within my function as NAs, rather than as empty rows? Is there another function besides lapply that would be more useful for this task?
[ Note: This dataframe is from the Coursera R Programming course. This is also my first post on Stackoverflow, and my first time learning programming. Thank you to all who offered solutions and advice, this forum is fantastic. ]

You just need an in/else in your function:
rankall <- function(rank) {
split_by_state <- split(df, df$state)
ranked_hospitals <- lapply(split_by_state, function (x) {
indx <- x$rank==rank
if(any(indx)){
return(x[indx, ])
else{
out = x[1, ]
out$hospital = NA
return(out)
}
}
}

Here's an alternative approach:
rankall <- function(rank) {
do.call(rbind, lapply(split(df, df$state), function(df) {
tmp <- df[df$rank == rank, 1:2]
if (!nrow(tmp)) return(transform(df[1, 1:2], hospital = NA)) else return(tmp)
}))
}
rankall(3)
# hospital state
# AK FAIRBANKS MEMORIAL HOSPITAL AK
# AL <NA> AL
# AR CRITTENDEN MEMORIAL HOSPITAL AR

Here is another dplyr approach.
fun1 <- function(x) {
group_by(df, state) %>%
summarise(hospital = hospital[x],
rank = nth(rank, x))
}
# fun1(3)
#Source: local data frame [3 x 3]
#
# state hospital rank
#1 AK FAIRBANKS MEMORIAL HOSPITAL 3
#2 AL NA NA
#3 AR CRITTENDEN MEMORIAL HOSPITAL 3

I think this is a good use of dplyr. Only thing that's weird is summarize complains when I use NA instead of "NA". Anyone have thoughts on why?
library(dplyr)
rankall <- function(chosen_rank){
group_by(df, state) %>%
summarize(hospital = ifelse(length(hospital[rank==chosen_rank])!=0,
as.character(hospital[rank==chosen_rank]), "NA"),
rank = chosen_rank)
}
rankall(1)
rankall(2)
rankall(3)

Related

Finding rows that have the minimum of a specific factor group

I'm am attempting to find the minimum incomes from the state.x77 dataset based on the state.region variable.
df1 <- data.frame(state.region,state.x77,row.names = state.name)
tapply(state.x77,state.region,min)
I am trying to get it to output which state has the lowest income for X region eg for south Alabama would be the lowest income. Im trying to use tapply but I keep getting an error saying
Error in tapply(state.x77, state.region, min) :
arguments must have same length
What is the issue?
Here is a solution. First get the vector of incomes and make of it a named vector. Then use tapply to get the names of the minima incomes.
state <- setNames(state.x77[, "Income"], rownames(state.x77))
tapply(state, state.region, function(x) names(x)[which.min(x)])
# Northeast South North Central West
# "Maine" "Mississippi" "South Dakota" "New Mexico"
The following, more complicated, code will output state names, regions and incomes.
df1 <- data.frame(
State = rownames(state.x77),
Income = state.x77[, "Income"],
Region = state.region
)
merge(aggregate(Income ~ Region, df1, min), df1)[c(3, 1, 2)]
# State Region Income
#1 South Dakota North Central 4167
#2 Maine Northeast 3694
#3 Mississippi South 3098
#4 New Mexico West 3601
And another solution with aggregate but avoiding merge.
agg <- aggregate(Income ~ Region, df1, min)
i <- match(agg$Income, df1$Income)
data.frame(
State = df1$State[i],
Region = df1$Region[i],
Income = df1$Income[i]
)
# State Region Income
#1 Maine Northeast 3694
#2 Mississippi South 3098
#3 South Dakota North Central 4167
#4 New Mexico West 3601
You can also use this solution:
library(dplyr)
library(tibble)
state2 %>%
rownames_to_column() %>%
bind_cols(state.region) %>%
rename(State = rowname,
Region = ...10) %>%
group_by(Region, State) %>%
summarise(Income = sum(Income)) %>% arrange(desc(Income)) %>%
slice_tail(n = 1)
# A tibble: 4 x 3
# Groups: Region [4]
Region State Income
<fct> <chr> <dbl>
1 Northeast Maine 3694
2 South Mississippi 3098
3 North Central South Dakota 4167
4 West New Mexico 3601

Need to ID states from mixed names /IDs in location data

Need to ID states from mixed location data
Need to search for 50 states abbreviations & 50 states full names, and return state abbreviation
N <- 1:10
Loc <- c("Los Angeles, CA", "Manhattan, NY", "Florida, USA", "Chicago, IL" , "Houston, TX",
+ "Texas, USA", "Corona, CA", "Georgia, USA", "WV NY NJ", "qwerty uy PO DOPL JKF" )
df <- data.frame(N, Loc)
> # Objective create variable state such
> # state contains abbreviated names of states from Loc:
> # for "Los Angeles, CA", state = CA
> # for "Florida, USA", sate = FL
> # for "WV NY NJ", state = NA
> # for "qwerty NJuy PO DOPL JKF", sate = NA (inspite of containing the srting NJ, it is not wrapped in spaces)
>
# End result should be Newdf
State <- c("CA", "NY", "FL", "IL", "TX","TX", "CA", "GA", NA, NA)
Newdf <- data.frame(N, Loc, State)
> Newdf
N Loc State
1 1 Los Angeles, CA CA
2 2 Manhattan, NY NY
3 3 Florida, USA FL
4 4 Chicago, IL IL
5 5 Houston, TX TX
6 6 Texas, USA TX
7 7 Corona, CA CA
8 8 Georgia, USA GA
9 9 WV NY NJ <NA>
10 10 qwerty uy PO DOPL JKF <NA>
Is there a package? or can a loop be written? Even if the schema could be demonstrated with a few states, that would be sufficient - I will post the full solution when I get to it. Btw, this is for a Twitter dataset downloaded using rtweet package, and the variable is: place_full_name
There are default constants in R, state.abb and state.name which can be used.
vars <- stringr::str_extract(df$Loc, paste0('\\b',c(state.abb, state.name),
'\\b', collapse = '|'))
#[1] "CA" "NY" "Florida" "IL" "TX" "Texas" "CA" "Georgia" "WV" NA
If you want everything as abbreviations, we can go further and do :
inds <- vars %in% state.name
vars[inds] <- state.abb[match(vars[inds], state.name)]
vars
#[1] "CA" "NY" "FL" "IL" "TX" "TX" "CA" "GA" "WV" NA
However, we can see that in 9th row you expect output as NA but here it returns "WV" because it is a state name. In such cases, you need to prepare rules which are strict enough so that it only extracts state names and nothing else.
Utilising the built-in R constants, state.abb and state.name, we can try to extract these from the Loc with regular expressions.
state.abbs <- sub('.+, ([A-Z]{2})', '\\1', df$Loc)
state.names <- sub('^(.+),.+', '\\1', df$Loc)
Now if the state abbreviations are not in any of the built-in ones, then we can use match to find the positions of our state.names that are in any of the items in the built-in state.name vector, and use that to index state.abb, else keep what we already have. Those that don't match either return NA.
df$state.abb <- ifelse(!state.abbs %in% state.abb,
state.abb[match(state.names, state.name)], state.abbs)
df
N Loc state.abb
1 1 Los Angeles, CA CA
2 2 Manhattan, NY NY
3 3 Florida, USA FL
4 4 Chicago, IL IL
5 5 Houston, TX TX
6 6 Texas, USA TX
7 7 Corona, CA CA
8 8 Georgia, USA GA
9 9 WV NY NJ <NA>
10 10 qwerty uy PO DOPL JKF <NA>

How can I filter (dplyr) on the same dataset twice in a 'for' loop? R

I have a dataset that looks like this:
Hospital.Name State heart attack
1 SOUTHEAST ALABAMA MEDICAL CENTER AL 14.3
2 MARSHALL MEDICAL CENTER SOUTH AL 18.5
3 ELIZA COFFEE MEMORIAL HOSPITAL AL 18.1
4 MIZELL MEMORIAL HOSPITAL AL Not Available
5 CRENSHAW COMMUNITY HOSPITAL AL Not Available
6 MARSHALL MEDICAL CENTER NORTH AL Not Available
7 ST VINCENT'S EAST AL 17.7
8 DEKALB REGIONAL MEDICAL CENTER AL 18.0
9 SHELBY BAPTIST MEDICAL CENTER AL 15.9
10 CALLAHAN EYE FOUNDATION HOSPITAL AL Not Available
11 HELEN KELLER MEMORIAL HOSPITAL AL 19.6
12 DALE MEDICAL CENTER AL 17.3
13 CHEROKEE MEDICAL CENTER AL Not Available
14 BAPTIST MEDICAL CENTER SOUTH AL 17.8
15 JACKSON HOSPITAL & CLINIC INC AL 17.5
16 GEORGE H. LANIER MEMORIAL HOSPITAL AL 15.4
17 ELBA GENERAL HOSPITAL AL Not Available
18 EAST ALABAMA MEDICAL CENTER AND SNF AL 16.3
19 WEDOWEE HOSPITAL AL Not Available
20 UNIVERSITY OF ALABAMA HOSPITAL AL 15.0
The goal is to retrieve the hospital name, for a given rank of hospital on 'heart attack' for every state. For example, here I am trying to retrieve the hospital name for the lowest score (rank=1) in the heart attack column, for every state in a data frame.
This is my attempt:
stateVec <- unique(df$State)
outcome <- 'heart attack'
name <- c()
st <- c()
stateVec <- c()
rank <- 1
for (i in 1:length(stateVec)) {
k <- stateVec[i]
df1 <- dplyr::filter(df, State==k)
rankVec <- unique(df[[outcome]])
rankVec <- sort(rankVec[rankVec != 'Not Available'])
key <- rankVec[rank]
df1 <- dplyr::filter(df1, get(outcome, envir = as.environment(df))==key)
df1 <- df1[order(df$Hospital.Name), , drop = F]
d <- df1[1,]
name <- d$Hospital.Name
st <- k
return(data.frame(st, name))
}
I receive the following error:
Error in filter_impl(.data, quo) : Result must have length 98, not 4706
I've tried recreating the problem with the mtcars dataset, and don't get the same error. Any help would be appreciated :)
I think this is what you are looking for.
desired_rank <- 1
df %>%
filter(!is.na(heart.attack)) %>%
group_by(State) %>%
arrange(heart.attack) %>%
slice(desired_rank) %>%
ungroup()
It remove's NA values for heart.attack;
Then groups by State;
Then sorts ascending on heart.attack;
Then returns the first hospital (so the hospital with lowest heart.attack value).
The output is a data.frame.

Insert NA values in a data frame R

I want an empty data frame and later add row values to it. The way I create a data frame is the following:
result_df <- data.frame("Hospital" = character(), "State" = character(), stringsAsFactors = FALSE)
Then I add the first row:
result_df <- rbind(result_df, list("D W MCMILLAN MEMORIAL HOSPITAL", "AL"))
Just as extra information I show you the result of the following command:
str(result_df)
'data.frame': 1 obs. of 2 variables:
$ X.D.W.MCMILLAN.MEMORIAL.HOSPITAL.: Factor w/ 1 level "D W MCMILLAN MEMORIAL HOSPITAL": 1
$ X.AL. : Factor w/ 1 level "AL": 1
Then I add the next row to the data frame
result_df <- rbind(result_df, list("ARKANSAS METHODIST MEDICAL CENTER", "TX"))
and this is what I get:
Warning messages:
1: In `[<-.factor`(`*tmp*`, ri, value = "ARKANSAS METHODIST MEDICAL CENTER") :
invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, ri, value = "TX") :
invalid factor level, NA generated
When I type result_df to see the content of the data frame this is the result:
X.D.W.MCMILLAN.MEMORIAL.HOSPITAL. X.AL.
1 D W MCMILLAN MEMORIAL HOSPITAL AL
2 <NA> <NA>
I guess this could be solved using stringAsFactors = FALSE, does any one have an idea about this problem?
The rbind function needs to have the same column names. If you created the data frame with the same column names, you can combine these data frames without NA.
result_df <- rbind(result_df, data.frame(Hospital = "D W MCMILLAN MEMORIAL HOSPITAL",
state = "AL",
stringsAsFactors = FALSE))
result_df <- rbind(result_df, data.frame(Hospital = "ARKANSAS METHODIST MEDICAL CENTER",
state = "TX",
stringsAsFactors = FALSE))
Here is the final output.
print(result_df)
Hospital state
1 D W MCMILLAN MEMORIAL HOSPITAL AL
2 ARKANSAS METHODIST MEDICAL CENTER TX
We can use rbindlist from data.table
library(data.table)
rbindlist(list(result_df, list("D W MCMILLAN MEMORIAL HOSPITAL", "AL")))
# Hospital State
#1: D W MCMILLAN MEMORIAL HOSPITAL AL

Application of 'apply' functions in R

I have got the following list which was generated by using split function with state as index.
$AK
Hospital_Name State Mortality_Rate
99 PROVIDENCE ALASKA MEDICAL CENTER AK 13.4
100 MAT-SU REGIONAL MEDICAL CENTER AK 17.7
102 FAIRBANKS MEMORIAL HOSPITAL AK 15.5
$AL
Hospital_Name State Mortality_Rate
1 SOUTHEAST ALABAMA MEDICAL CENTER AL 14.3
2 MARSHALL MEDICAL CENTER SOUTH AL 18.5
3 ELIZA COFFEE MEMORIAL HOSPITAL AL 18.1
$AR
Hospital_Name State Mortality_Rate
193 SILOAM SPRINGS MEMORIAL HOSPITAL AR 15.6
194 JOHNSON REGIONAL MEDICAL CENTER AR 16.9
195 WASHINGTON REGIONAL MED CTR AT NORTH HILLS AR 15.2
I want to select Hospital Name of 2nd row from each of these states. Can somebody help with an apply function here? I was trying to use sapply the following way(not working) -
x <- sapply(test.case3,function(i) test.case3[[i]][2,1])
so that, by varying 'i', I can get result as follows -
> test.case3[[1]][2,1]
[1] "MAT-SU REGIONAL MEDICAL CENTER"
> test.case3[[2]][2,1]
[1] "MARSHALL MEDICAL CENTER SOUTH"
> test.case3[[3]][2,1]
[1] "JOHNSON REGIONAL MEDICAL CENTER"
Kindly advise.
I required to produce the final report in the following format.(ignore the data, in below example)
Hospital_Name State
D W MCMILLAN MEMORIAL HOSPITAL AL
ARKANSAS METHODIST MEDICAL CENTER AR
JOHN C LINCOLN DEER VALLEY HOSPITAL AZ

Resources