Insert NA values in a data frame R - r

I want an empty data frame and later add row values to it. The way I create a data frame is the following:
result_df <- data.frame("Hospital" = character(), "State" = character(), stringsAsFactors = FALSE)
Then I add the first row:
result_df <- rbind(result_df, list("D W MCMILLAN MEMORIAL HOSPITAL", "AL"))
Just as extra information I show you the result of the following command:
str(result_df)
'data.frame': 1 obs. of 2 variables:
$ X.D.W.MCMILLAN.MEMORIAL.HOSPITAL.: Factor w/ 1 level "D W MCMILLAN MEMORIAL HOSPITAL": 1
$ X.AL. : Factor w/ 1 level "AL": 1
Then I add the next row to the data frame
result_df <- rbind(result_df, list("ARKANSAS METHODIST MEDICAL CENTER", "TX"))
and this is what I get:
Warning messages:
1: In `[<-.factor`(`*tmp*`, ri, value = "ARKANSAS METHODIST MEDICAL CENTER") :
invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, ri, value = "TX") :
invalid factor level, NA generated
When I type result_df to see the content of the data frame this is the result:
X.D.W.MCMILLAN.MEMORIAL.HOSPITAL. X.AL.
1 D W MCMILLAN MEMORIAL HOSPITAL AL
2 <NA> <NA>
I guess this could be solved using stringAsFactors = FALSE, does any one have an idea about this problem?

The rbind function needs to have the same column names. If you created the data frame with the same column names, you can combine these data frames without NA.
result_df <- rbind(result_df, data.frame(Hospital = "D W MCMILLAN MEMORIAL HOSPITAL",
state = "AL",
stringsAsFactors = FALSE))
result_df <- rbind(result_df, data.frame(Hospital = "ARKANSAS METHODIST MEDICAL CENTER",
state = "TX",
stringsAsFactors = FALSE))
Here is the final output.
print(result_df)
Hospital state
1 D W MCMILLAN MEMORIAL HOSPITAL AL
2 ARKANSAS METHODIST MEDICAL CENTER TX

We can use rbindlist from data.table
library(data.table)
rbindlist(list(result_df, list("D W MCMILLAN MEMORIAL HOSPITAL", "AL")))
# Hospital State
#1: D W MCMILLAN MEMORIAL HOSPITAL AL

Related

How to remove values in a column based on other column values equaling the column values above it?

I am currently coding in R and merged two dataframes together so I could include all the information together but I don't want the one column "Cost" to be duplicated multiple times (it was due to the unique values of the last 3 columns). I want it to include the cost 100 only in the first column and then for every other instance where the columns "State", "Market", "Date", and "Cost" are the same as above. I attached what the dataframe looks like and what I want it to be changed to. Thank you!
What it currently looks like
What it should look like
Please use index like in this example:
name_of_your_dataset[nrow_init:nrow_fin, ncol] <- NA
In your case, assuming the name of your dataset as 'data'
data[2:4,4]<- NA
Just leave a positive feedback and if I was useful, just vote this answer up.
Here is a solution using duplicated with your dataframe (df)
State Market Date Cost Word format Type
1 AZ Phoenix 10-20-2020 100 HELLO AM Sports related
2 AZ Phoenix 10-21-2020 NA GOODBYE PM Non Sports related
3 AZ Phoenix 10-22-2020 NA YES FM Country
4 AZ Phoenix 10-23-2020 NA NONE CM Rock
Set duplicates to NA
df$Cost[duplicated(df$Cost)] <- NA
Output:
State Market Date Cost Word format Type
1 AZ Phoenix 10-20-2020 100 HELLO AM Sports related
2 AZ Phoenix 10-21-2020 NA GOODBYE PM Non Sports related
3 AZ Phoenix 10-22-2020 NA YES FM Country
4 AZ Phoenix 10-23-2020 NA NONE CM Rock
The column Date is different so I think you want to do replace duplicated Cost for every value of State and Market combination.
library(dplyr)
df <- df %>%
group_by(State, Market) %>%
mutate(Cost = replace(Cost, duplicated(Cost), NA)) %>%
ungroup
df
# State Market Date Cost Word format Type
# <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
#1 AZ Phoenix 10-20-2020 100 HELLO AM Sports related
#2 AZ Phoenix 10-21-2020 NA GOODBYE PM Non Sports related
#3 AZ Phoenix 10-22-2020 NA YES FM Country
#4 AZ Phoenix 10-23-2020 NA NONE CM Rock
data
It is easier to help if you provide data in a reproducible format
df <- structure(list(State = c("AZ", "AZ", "AZ", "AZ"), Market = c("Phoenix",
"Phoenix", "Phoenix", "Phoenix"), Date = c("10-20-2020", "10-21-2020",
"10-22-2020", "10-23-2020"), Cost = c(100, 100, 100, 100), Word = c("HELLO",
"GOODBYE", "YES", "NONE"), format = c("AM", "PM", "FM", "CM"),
Type = c("Sports related", "Non Sports related", "Country",
"Rock")), row.names = c(NA, -4L), class = "data.frame")

Need to ID states from mixed names /IDs in location data

Need to ID states from mixed location data
Need to search for 50 states abbreviations & 50 states full names, and return state abbreviation
N <- 1:10
Loc <- c("Los Angeles, CA", "Manhattan, NY", "Florida, USA", "Chicago, IL" , "Houston, TX",
+ "Texas, USA", "Corona, CA", "Georgia, USA", "WV NY NJ", "qwerty uy PO DOPL JKF" )
df <- data.frame(N, Loc)
> # Objective create variable state such
> # state contains abbreviated names of states from Loc:
> # for "Los Angeles, CA", state = CA
> # for "Florida, USA", sate = FL
> # for "WV NY NJ", state = NA
> # for "qwerty NJuy PO DOPL JKF", sate = NA (inspite of containing the srting NJ, it is not wrapped in spaces)
>
# End result should be Newdf
State <- c("CA", "NY", "FL", "IL", "TX","TX", "CA", "GA", NA, NA)
Newdf <- data.frame(N, Loc, State)
> Newdf
N Loc State
1 1 Los Angeles, CA CA
2 2 Manhattan, NY NY
3 3 Florida, USA FL
4 4 Chicago, IL IL
5 5 Houston, TX TX
6 6 Texas, USA TX
7 7 Corona, CA CA
8 8 Georgia, USA GA
9 9 WV NY NJ <NA>
10 10 qwerty uy PO DOPL JKF <NA>
Is there a package? or can a loop be written? Even if the schema could be demonstrated with a few states, that would be sufficient - I will post the full solution when I get to it. Btw, this is for a Twitter dataset downloaded using rtweet package, and the variable is: place_full_name
There are default constants in R, state.abb and state.name which can be used.
vars <- stringr::str_extract(df$Loc, paste0('\\b',c(state.abb, state.name),
'\\b', collapse = '|'))
#[1] "CA" "NY" "Florida" "IL" "TX" "Texas" "CA" "Georgia" "WV" NA
If you want everything as abbreviations, we can go further and do :
inds <- vars %in% state.name
vars[inds] <- state.abb[match(vars[inds], state.name)]
vars
#[1] "CA" "NY" "FL" "IL" "TX" "TX" "CA" "GA" "WV" NA
However, we can see that in 9th row you expect output as NA but here it returns "WV" because it is a state name. In such cases, you need to prepare rules which are strict enough so that it only extracts state names and nothing else.
Utilising the built-in R constants, state.abb and state.name, we can try to extract these from the Loc with regular expressions.
state.abbs <- sub('.+, ([A-Z]{2})', '\\1', df$Loc)
state.names <- sub('^(.+),.+', '\\1', df$Loc)
Now if the state abbreviations are not in any of the built-in ones, then we can use match to find the positions of our state.names that are in any of the items in the built-in state.name vector, and use that to index state.abb, else keep what we already have. Those that don't match either return NA.
df$state.abb <- ifelse(!state.abbs %in% state.abb,
state.abb[match(state.names, state.name)], state.abbs)
df
N Loc state.abb
1 1 Los Angeles, CA CA
2 2 Manhattan, NY NY
3 3 Florida, USA FL
4 4 Chicago, IL IL
5 5 Houston, TX TX
6 6 Texas, USA TX
7 7 Corona, CA CA
8 8 Georgia, USA GA
9 9 WV NY NJ <NA>
10 10 qwerty uy PO DOPL JKF <NA>

R data frame unsplit: invalid factor level, NAs generated

I have an array which contains a data frame at each entry (data frame has column 'hospital' and 'state'):
> groups
$AK
hospital state
AK NA AK
$AL
hospital state
AL D W MCMILLAN MEMORIAL HOSPITAL AL
$AR
hospital state
AR ARKANSAS METHODIST MEDICAL CENTER AR
When I call the unsplit method I can't have a correctly unified data frame:
> unsplit(groups, dimnames(groups) )
hospital state
AK <NA> AK
AL D W MCMILLAN MEMORIAL HOSPITAL <NA>
AR ARKANSAS METHODIST MEDICAL CENTER <NA>
Warning messages:
1: In `[<-.factor`(`*tmp*`, iseq, value = "AL") :
: invalid factor level, NAs generated
2: In `[<-.factor`(`*tmp*`, iseq, value = "AR") :
: invalid factor level, NAs generated
It's driving me NUTS for hours. How can I have my data frame correctly reunited?
Thanks

Empty rows in list as NA values in data.frame in R

I have a dataframe as follows:
hospital <- c("PROVIDENCE ALASKA MEDICAL CENTER", "ALASKA REGIONAL HOSPITAL", "FAIRBANKS MEMORIAL HOSPITAL",
"CRESTWOOD MEDICAL CENTER", "BAPTIST MEDICAL CENTER EAST", "ARKANSAS HEART HOSPITAL",
"MEDICAL CENTER NORTH LITTLE ROCK", "CRITTENDEN MEMORIAL HOSPITAL")
state <- c("AK", "AK", "AK", "AL", "AL", "AR", "AR", "AR")
rank <- c(1,2,3,1,2,1,2,3)
df <- data.frame(hospital, state, rank)
df
hospital state rank
1 PROVIDENCE ALASKA MEDICAL CENTER AK 1
2 ALASKA REGIONAL HOSPITAL AK 2
3 FAIRBANKS MEMORIAL HOSPITAL AK 3
4 CRESTWOOD MEDICAL CENTER AL 1
5 BAPTIST MEDICAL CENTER EAST AL 2
6 ARKANSAS HEART HOSPITAL AR 1
7 MEDICAL CENTER NORTH LITTLE ROCK AR 2
8 CRITTENDEN MEMORIAL HOSPITAL AR 3
I would like to create a function, rankall, that takes rank as an argument and returns the hospitals of that rank for each state, with NAs returned if the state does not have a hospital that matches the given rank. For example, I want output of rankall(rank=3) to look like this:
hospital state
AK FAIRBANKS MEMORIAL HOSPITAL AK
AL <NA> AL
AR CRITTENDEN MEMORIAL HOSPITAL AR
I've tried:
rankall <- function(rank) {
split_by_state <- split(df, df$state)
ranked_hospitals <- lapply(split_by_state, function (x) {
x[(x$rank==rank), ]
})
combined_ranked_hospitals <- do.call(rbind, ranked_hospitals)
return(combined_ranked_hospitals[ ,1:2])
}
But rankall(rank=3) returns:
hospital state
AK FAIRBANKS MEMORIAL HOSPITAL AK
AR CRITTENDEN MEMORIAL HOSPITAL AR
This leaves out the NA values that I need to keep track of. Is there a way for R to recognize the empty rows in my list object within my function as NAs, rather than as empty rows? Is there another function besides lapply that would be more useful for this task?
[ Note: This dataframe is from the Coursera R Programming course. This is also my first post on Stackoverflow, and my first time learning programming. Thank you to all who offered solutions and advice, this forum is fantastic. ]
You just need an in/else in your function:
rankall <- function(rank) {
split_by_state <- split(df, df$state)
ranked_hospitals <- lapply(split_by_state, function (x) {
indx <- x$rank==rank
if(any(indx)){
return(x[indx, ])
else{
out = x[1, ]
out$hospital = NA
return(out)
}
}
}
Here's an alternative approach:
rankall <- function(rank) {
do.call(rbind, lapply(split(df, df$state), function(df) {
tmp <- df[df$rank == rank, 1:2]
if (!nrow(tmp)) return(transform(df[1, 1:2], hospital = NA)) else return(tmp)
}))
}
rankall(3)
# hospital state
# AK FAIRBANKS MEMORIAL HOSPITAL AK
# AL <NA> AL
# AR CRITTENDEN MEMORIAL HOSPITAL AR
Here is another dplyr approach.
fun1 <- function(x) {
group_by(df, state) %>%
summarise(hospital = hospital[x],
rank = nth(rank, x))
}
# fun1(3)
#Source: local data frame [3 x 3]
#
# state hospital rank
#1 AK FAIRBANKS MEMORIAL HOSPITAL 3
#2 AL NA NA
#3 AR CRITTENDEN MEMORIAL HOSPITAL 3
I think this is a good use of dplyr. Only thing that's weird is summarize complains when I use NA instead of "NA". Anyone have thoughts on why?
library(dplyr)
rankall <- function(chosen_rank){
group_by(df, state) %>%
summarize(hospital = ifelse(length(hospital[rank==chosen_rank])!=0,
as.character(hospital[rank==chosen_rank]), "NA"),
rank = chosen_rank)
}
rankall(1)
rankall(2)
rankall(3)

R make new data frame from current one

I'm trying to calculate the best goal differentials in the group stage of the 2014 world cup.
football <- read.csv(
file="http://pastebin.com/raw.php?i=iTXdPvGf",
header = TRUE,
strip.white = TRUE
)
football <- head(football,n=48L)
football[which(max(abs(football$home_score - football$away_score)) == abs(football$home_score - football$away_score)),]
Results in
home home_continent home_score away away_continent away_score result
4 Cameroon Africa 0 Croatia Europe 4 l
7 Spain Europe 1 Netherlands Europe 5 l
37 Germany
So those are the games with the highest goal differntial, but now I need to make a new data frame that has a team name, and abs(football$home_score-football$away_score)
football$score_diff <- abs(football$home_score - football$away_score)
football$winner <- ifelse(football$home_score > football$away_score, as.character(football$home),
ifelse(football$result == "d", NA, as.character(football$away)))
You could save some typing in this way. You first get score differences and winners. When the result indicates w, home is the winner. So you do not have to look into scores at all. Once you add the score difference and winner, you can subset your data by subsetting data with max().
mydf <- read.csv(file="http://pastebin.com/raw.php?i=iTXdPvGf",
header = TRUE, strip.white = TRUE)
mydf <- head(mydf,n = 48L)
library(dplyr)
mutate(mydf, scorediff = abs(home_score - away_score),
winner = ifelse(result == "w", as.character(home),
ifelse(result == "l", as.character(away), "draw"))) %>%
filter(scorediff == max(scorediff))
# home home_continent home_score away away_continent away_score result scorediff winner
#1 Cameroon Africa 0 Croatia Europe 4 l 4 Croatia
#2 Spain Europe 1 Netherlands Europe 5 l 4 Netherlands
#3 Germany Europe 4 Portugal Europe 0 w 4 Germany
Here is another option without using ifelse for creating the "winner" column. This is based on row/column indexes. The numeric column index is created by matching the result column with its unique elements (match(football$result,..), and the row index is just 1:nrow(football). Subset the "football" dataset with columns 'home', 'away' and cbind it with an additional column 'draw' with NAs so that the 'd' elements in "result" change to NA.
football$score_diff <- abs(football$home_score - football$away_score)
football$winner <- cbind(football[c('home', 'away')],draw=NA)[
cbind(1:nrow(football), match(football$result, c('w', 'l', 'd')))]
football[with(football, score_diff==max(score_diff)),]
# home home_continent home_score away away_continent away_score result
#60 Brazil South America 1 Germany Europe 7 l
# score_diff winner
#60 6 Germany
If the dataset is very big, you could speed up the match by using chmatch from library(data.table)
library(data.table)
chmatch(as.character(football$result), c('w', 'l', 'd'))
NOTE: I used the full dataset in the link

Resources