R data frame unsplit: invalid factor level, NAs generated - r

I have an array which contains a data frame at each entry (data frame has column 'hospital' and 'state'):
> groups
$AK
hospital state
AK NA AK
$AL
hospital state
AL D W MCMILLAN MEMORIAL HOSPITAL AL
$AR
hospital state
AR ARKANSAS METHODIST MEDICAL CENTER AR
When I call the unsplit method I can't have a correctly unified data frame:
> unsplit(groups, dimnames(groups) )
hospital state
AK <NA> AK
AL D W MCMILLAN MEMORIAL HOSPITAL <NA>
AR ARKANSAS METHODIST MEDICAL CENTER <NA>
Warning messages:
1: In `[<-.factor`(`*tmp*`, iseq, value = "AL") :
: invalid factor level, NAs generated
2: In `[<-.factor`(`*tmp*`, iseq, value = "AR") :
: invalid factor level, NAs generated
It's driving me NUTS for hours. How can I have my data frame correctly reunited?
Thanks

Related

How to Merge Uneven Data Frames With Real Data

Problem:
I have two different size data sets that I would like to merge together. Without abandoning rows or inserting NA's. To compare this to a excel document situation you would have five columns and you would drago down 3 of them to populate the blank space left by the rows inserted by adding your data to the 4th and 5th column.
Example Data Set
zipcode = a, step3 = b in my later brainstorming code to solve my problem
> head(zipcode_joincsv)
zip city abv latitude longitude median mean pop
226 01749 Hudson AL 42.38981 -71.55791 76500 85689 18081
227 01752 Marlborough AL 42.35091 -71.54753 71835 89002 36273
228 01754 Maynard AL 42.43078 -71.45594 76228 82167 10414
229 01756 Mendon AL 42.09201 -71.54474 102625 117692 5257
230 01757 Milford AL 42.14918 -71.52149 68565 82206 26877
231 01760 Natick AL 42.29076 -71.35368 90673 113933 31763
> head(step3_df)
tolower.state.name. state.abb
1 alabama AL
2 alaska AK
3 arizona AZ
4 arkansas AR
5 california CA
6 colorado CO
Desired Result:
One DF where each zipcode city combo is combined with their states pop and
income. A column in common they have is the abbreviation column.
tolower.state.name. zip city abv latitude longitude median mean pop
1 alabama 01749 Hudson AL 42.38981 -71.55791 76500 85689 18081
2 alabama 01752 Marlborough AL 42.35091 -71.54753 71835 89002 36273
3 alabama 01754 Maynard AL 42.43078 -71.45594 76228 82167 10414
4 alabama 01756 Mendon AL 42.09201 -71.54474 102625 117692 5257
5 alabama 01757 Milford AL 42.14918 -71.52149 68565 82206 26877
6 alabama 01760 Natick AL 42.29076 -71.35368 90673 113933 31763
7 alaska data from these rows
8 arizona data from these rows
9 arkansas data from these rows
10 california data from these rows
11 colorado data from these rows
I've contemplated using something like
sqldf ("SELECT a.Zip, a.City, a.State Abv, a.Lat, a.Long, a.median, a.mean, a.pop, b.state.name, b.states.abb, b.pop, b.income
FROM a a
LEFT JOIN b b using (abv)")
I know that is probably not going to work if only that if it worked all the rows that there was not a matching set from A would input a NA where what I would like is that for every abv of NY the states average income and total population gets copied down the line. Than for every AR and every AL etc until the two data sets are one that a ggplot using all of the data can be created.
dplyr::left_join(a, b, by="abv") should work.

How can I filter (dplyr) on the same dataset twice in a 'for' loop? R

I have a dataset that looks like this:
Hospital.Name State heart attack
1 SOUTHEAST ALABAMA MEDICAL CENTER AL 14.3
2 MARSHALL MEDICAL CENTER SOUTH AL 18.5
3 ELIZA COFFEE MEMORIAL HOSPITAL AL 18.1
4 MIZELL MEMORIAL HOSPITAL AL Not Available
5 CRENSHAW COMMUNITY HOSPITAL AL Not Available
6 MARSHALL MEDICAL CENTER NORTH AL Not Available
7 ST VINCENT'S EAST AL 17.7
8 DEKALB REGIONAL MEDICAL CENTER AL 18.0
9 SHELBY BAPTIST MEDICAL CENTER AL 15.9
10 CALLAHAN EYE FOUNDATION HOSPITAL AL Not Available
11 HELEN KELLER MEMORIAL HOSPITAL AL 19.6
12 DALE MEDICAL CENTER AL 17.3
13 CHEROKEE MEDICAL CENTER AL Not Available
14 BAPTIST MEDICAL CENTER SOUTH AL 17.8
15 JACKSON HOSPITAL & CLINIC INC AL 17.5
16 GEORGE H. LANIER MEMORIAL HOSPITAL AL 15.4
17 ELBA GENERAL HOSPITAL AL Not Available
18 EAST ALABAMA MEDICAL CENTER AND SNF AL 16.3
19 WEDOWEE HOSPITAL AL Not Available
20 UNIVERSITY OF ALABAMA HOSPITAL AL 15.0
The goal is to retrieve the hospital name, for a given rank of hospital on 'heart attack' for every state. For example, here I am trying to retrieve the hospital name for the lowest score (rank=1) in the heart attack column, for every state in a data frame.
This is my attempt:
stateVec <- unique(df$State)
outcome <- 'heart attack'
name <- c()
st <- c()
stateVec <- c()
rank <- 1
for (i in 1:length(stateVec)) {
k <- stateVec[i]
df1 <- dplyr::filter(df, State==k)
rankVec <- unique(df[[outcome]])
rankVec <- sort(rankVec[rankVec != 'Not Available'])
key <- rankVec[rank]
df1 <- dplyr::filter(df1, get(outcome, envir = as.environment(df))==key)
df1 <- df1[order(df$Hospital.Name), , drop = F]
d <- df1[1,]
name <- d$Hospital.Name
st <- k
return(data.frame(st, name))
}
I receive the following error:
Error in filter_impl(.data, quo) : Result must have length 98, not 4706
I've tried recreating the problem with the mtcars dataset, and don't get the same error. Any help would be appreciated :)
I think this is what you are looking for.
desired_rank <- 1
df %>%
filter(!is.na(heart.attack)) %>%
group_by(State) %>%
arrange(heart.attack) %>%
slice(desired_rank) %>%
ungroup()
It remove's NA values for heart.attack;
Then groups by State;
Then sorts ascending on heart.attack;
Then returns the first hospital (so the hospital with lowest heart.attack value).
The output is a data.frame.

Having trouble merging/joining two datasets on two variables in R

I realize there have already been many asked and answered questions about merging datasets here, but I've been unable to find one that addresses my issue.
What I'm trying to do is merge to datasets using two variables and keeping all data from each. I've tried merge and all of the join operations from dplyr, as well as cbind and have not gotten the result I want. Usually what happens is that one column from one of the datasets gets overwritten with NAs. Another thing that will happen, as when I do full_join in dplyr or all = TRUE in merge is that I get double the number of rows.
Here's my data:
Primary_State Primary_County n
<fctr> <fctr> <int>
1 AK 12
2 AK Aleutians West 1
3 AK Anchorage 961
4 AK Bethel 1
5 AK Fairbanks North Star 124
6 AK Haines 1
Primary_County Primary_State Population
1 Autauga AL 55416
2 Baldwin AL 208563
3 Barbour AL 25965
4 Bibb AL 22643
5 Blount AL 57704
6 Bullock AL 10362
So I want to merge or join based on Primary_State and Primary_County, which is necessary because there are a lot of duplicate county names in the U.S. and retain the data from both n and Population. From there I can then divide the Population by n and get a per capita figure for each county. I just can't figure out how to do it and keep all of the data, so any help would be appreciated. Thanks in advance!
EDIT: Adding code examples of what I've already described above.
This code (as well as left_join):
countyPerCap <- merge(countyLicense, countyPops, all.x = TRUE)
Produces this:
Primary_State Primary_County n Population
1 AK 12 NA
2 AK Aleutians West 1 NA
3 AK Anchorage 961 NA
4 AK Bethel 1 NA
5 AK Fairbanks North Star 124 NA
6 AK Haines 1 NA
This code:
countyPerCap <- right_join(countyLicense, countyPops)
Produces this:
Primary_State Primary_County n Population
<chr> <chr> <int> <int>
1 AL Autauga NA 55416
2 AL Baldwin NA 208563
3 AL Barbour NA 25965
4 AL Bibb NA 22643
5 AL Blount NA 57704
6 AL Bullock NA 10362
Hope that's helpful.
EDIT: This is what happens with the following code:
countyPerCap <- merge(countyLicense, countyPops, all = TRUE)
Primary_State Primary_County n Population
1 AK 12 NA
2 AK Aleutians East NA 3296
3 AK Aleutians West 1 NA
4 AK Aleutians West NA 5647
5 AK Anchorage 961 NA
6 AK Anchorage NA 298192
It duplicates state and county and then adds n to one record and Population in another. Is there a way to deduplicate the dataset and remove the NAs?
We can give column names in merge by mentioning "by" in merge statement
merge(x,y, by=c(col1, col2 names))
in merge statement
I figured it out. There were trailing whitespaces in the Census data's county names, so they weren't matching with the other dataset's county names. (Note to self: Always check that factors match when trying to merge datasets!)
trim.trailing <- function (x) sub("\\s+$", "", x)
countyPops$Primary_County <- trim.trailing(countyPops$Primary_County)
countyPerCap <- full_join(countyLicense, countyPops,
by=c("Primary_State", "Primary_County"), copy=TRUE)
Those three lines did the trick. Thanks everyone!

Insert NA values in a data frame R

I want an empty data frame and later add row values to it. The way I create a data frame is the following:
result_df <- data.frame("Hospital" = character(), "State" = character(), stringsAsFactors = FALSE)
Then I add the first row:
result_df <- rbind(result_df, list("D W MCMILLAN MEMORIAL HOSPITAL", "AL"))
Just as extra information I show you the result of the following command:
str(result_df)
'data.frame': 1 obs. of 2 variables:
$ X.D.W.MCMILLAN.MEMORIAL.HOSPITAL.: Factor w/ 1 level "D W MCMILLAN MEMORIAL HOSPITAL": 1
$ X.AL. : Factor w/ 1 level "AL": 1
Then I add the next row to the data frame
result_df <- rbind(result_df, list("ARKANSAS METHODIST MEDICAL CENTER", "TX"))
and this is what I get:
Warning messages:
1: In `[<-.factor`(`*tmp*`, ri, value = "ARKANSAS METHODIST MEDICAL CENTER") :
invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, ri, value = "TX") :
invalid factor level, NA generated
When I type result_df to see the content of the data frame this is the result:
X.D.W.MCMILLAN.MEMORIAL.HOSPITAL. X.AL.
1 D W MCMILLAN MEMORIAL HOSPITAL AL
2 <NA> <NA>
I guess this could be solved using stringAsFactors = FALSE, does any one have an idea about this problem?
The rbind function needs to have the same column names. If you created the data frame with the same column names, you can combine these data frames without NA.
result_df <- rbind(result_df, data.frame(Hospital = "D W MCMILLAN MEMORIAL HOSPITAL",
state = "AL",
stringsAsFactors = FALSE))
result_df <- rbind(result_df, data.frame(Hospital = "ARKANSAS METHODIST MEDICAL CENTER",
state = "TX",
stringsAsFactors = FALSE))
Here is the final output.
print(result_df)
Hospital state
1 D W MCMILLAN MEMORIAL HOSPITAL AL
2 ARKANSAS METHODIST MEDICAL CENTER TX
We can use rbindlist from data.table
library(data.table)
rbindlist(list(result_df, list("D W MCMILLAN MEMORIAL HOSPITAL", "AL")))
# Hospital State
#1: D W MCMILLAN MEMORIAL HOSPITAL AL

invalid factor level, NA generated when pasting in a dataframe in r

I cannot paste the correct data into the dataframe using rbind. Here is the problem
Results <- dataframe()
Value will store the hospital name that meets the selection criteria and y[1,2] is the name of the State
Here is what I get when I try to past the results into the blank dataframe results.
class(results)
[1] "data.frame"
value
[1] "JOHN C LINCOLN DEER VALLEY HOSPITAL"
y[1,2]
[1] "AZ"
class(value)
[1] "character"
class(y[1,2])
[1] "character"
results <- rbind(results,as.list(c(value,y[1,2])))
Warning messages:
1: In `[<-.factor`(`*tmp*`, ri, value = "JOHN C LINCOLN DEER VALLEY HOSPITAL") :
invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, ri, value = "AZ") :
invalid factor level, NA generated
results
X.ARKANSAS.METHODIST.MEDICAL.CENTER. X.AR.
1 ARKANSAS METHODIST MEDICAL CENTER AR
2 <NA> <NA>
3 <NA> <NA>
How to solve this?
Many thanks
You have a factor when you want a character. Do an str() on your data frame to identify the columns that are factors then suppose your data.frame is called Mydf and the factor columns are columns 3 and 5
Mydf[,c(3,5)] <- sapply(Mydf[,c(3,5)],as.character)

Resources