R - Coalesce dataframe while creating a new variable with column names [duplicate] - r

This question already has answers here:
Coalesce columns and create another column to specify source
(4 answers)
Closed 2 years ago.
I am using the dplyr::coalesce and dplyr::mutate to find all first non-missing values and stuff that into a new variable. However, I would like to also create a new variable with the information on which variable is used to infill the new variable.
Here is an example:
df <- dataframe(
St1 = c(1, NA, NA, NA),
St2 = c(NA, 3, NA, NA),
St3 = c(NA, NA, 12, NA),
St4 = c(NA, NA, NA, 4))
What I do :
df <- df %>%
mutate(df.coalesce = coalesce(St1, St2, St3, St4)) %>%
select(df.coalesce)
Result:
df.coalesce
1
3
12
4
Desired result:
Station df.coalesce
St.1 1
St.2 3
St.3 12
St.4 4
Is there a way to do that using the tidyverse grammar?
Thanks!

You can use max.col to get column name with non-NA value in each row and use do.call with coalesce to apply it to all the columns.
library(dplyr)
df %>%
transmute(Station = names(df)[max.col(replace(., is.na(.), 0))],
df.coalesce = do.call(coalesce, .))
# Station df.coalesce
#1 St1 1
#2 St2 3
#3 St3 12
#4 St4 4

You can find all the ids having NA and then remove them.
train <- read.csv (file = "file", sep = ",", na.strings=c("NA"))
id_na_Cols <- sapply(train,function(x)any(is.na(x)))
trainData <- train[,!(id_na_Cols)]
write.table (trainData, file = "file_new", sep = ",")
Afterwards you can load new data for further analysis.

Related

Using a numlist loop when renaming variables

I´m trying to rename two types of variables in R using tidyverse/dplyr. The first type "var_a_year", I want to rename it as "sample_year". The second type of variable "var_b_7", I want to rename it as "index_year".
The second variable, "var_b" starts on the number 7 for the first year "2004". And increases by 2 for each year. So for year 2005, the second type variable is called "var_b_9" as shown.
I would like to use a loop so I can make this faster instead of writting a line for each year.
Many thanks in advance!
df <- df %>%
rename(
sample_2004 = var_a_2004, index_2004 = var_b_7,
sample_2005 = var_a_2005, index_2005 = var_b_9,
sample_2006 = var_a_2006, index_2006 = var_b_11,
sample_2007 = var_a_2007, index_2007 = var_b_13,
...
sample_2020 = var_a_2020, index_2020 = var_b_39)
There's no need to use a loop. rename_with will do the trick:
df <- tibble(var_a_2004=NA, var_b_7=NA, var_a_2005=NA, var_b_8=NA)
renameA <- function(x) {
return(paste0("sample_", stringr::str_sub(x, -4)))
}
df %>% rename_with(renameA, starts_with("var_a"))
Gives
# A tibble: 1 x 4
sample_2004 var_b_7 sample_2005 var_b_8
<lgl> <lgl> <lgl> <lgl>
1 NA NA NA NA
I'll leave you to work out how to code the corresponding function for your var_b_XXXX columns.
In addition to the answer of Limey:
#sample data
df <- structure(list(var_a_2004 = NA, var_b_7 = NA, var_a_2005 = NA,
var_b_9 = NA), row.names = c(NA, -1L), class = "data.frame")
#load data.table package
library(data.table)
#set df to data.table
dt <- as.data.table(df)
#convert var_a in columnnames to sample_
colnames(dt) <- gsub("var_a_", "sample_", colnames(dt))
#use a loop to replace var_b to index_
for(i in 2004:2005){
year <- i
nr <- 2* i -4001
setnames(dt, old = paste0("var_b_", nr), new = paste0("index_", year))
}
This function now works for the years 2004:2005 to match the sample data. You can change it to 2004:2020 for your dataset.

Fill NAs with 0 if the column is numeric and empty string '' if the column is a factor using R [duplicate]

This question already has answers here:
How to replace NA values in a table for selected columns
(12 answers)
Closed 2 years ago.
I'm trying to replace all the NAs present in the column of integer type with 0 and NAs present in the column of factor type with empty string "". The code below is the one that i'm using but it doesn't seem to work
for(i in 1:ncol(credits)){
if(sapply(credits[i], class) == 'integer'){
credits[is.na(credits[,i]), i] <- 0
}
else if(sapply(credits[i], class) == 'factor'){
credits[is.na(credits[,i]), i] <- ''
}
You can use across in dplyr to replace column values by class :
library(dplyr)
df %>%
mutate(across(where(is.factor), ~replace(as.character(.), is.na(.), '')),
across(where(is.numeric), ~replace(., is.na(.), 0)))
# a b
#1 1 a
#2 2 b
#3 0 c
#4 4 d
#5 5
b column is of class "character" now, if you need it as factor, you can add factor outside replace like :
across(where(is.factor), ~factor(replace(as.character(.), is.na(.), ''))),
data
df <- data.frame(a = c(1, 2, NA, 4:5), b = c(letters[1:4], NA),
stringsAsFactors = TRUE)
Another way of achieving the same:
library(dplyr)
# Dataframe
df <- data.frame(x = c(1, 2, NA, 4:5), y = c('a',NA, 'd','e','f'),
stringsAsFactors = TRUE)
# Creating new columns
df_final<- df %>%
mutate(new_x = ifelse(is.numeric(x)==TRUE & is.na(x)==TRUE,0,x)) %>%
mutate(new_y = ifelse(is.factor(y)==TRUE & is.na(y)==TRUE,"",y))
# Printing the output
df_final

Mutate a column and name it after the input variable for a function in R

I have a data frame in R that is 89 columns wide, and 500,000 rows long. In each of the columns there are multiple 4 digit numeric codes, they can be in any column. I want to create a function that scans across each row to see if a code exists, if it does label as 1 if not 0, the new column must be named as the code searched for or something very similar (appended letter etc), rinse and repeat for ~450 such codes. Each new column would be labelled in some way after the code that was being searched for, like the 3669 column below.
c1 c2 c3 3369
1 2255 3669 NA 1
2 NA 5555 6598 0
3 NA NA 1245 0
I have attempted to do this using mutate, and rowSums see below, which works for an individual code, but I cannot get to work when using the sapply function. It just creates a single column called "x"
a <- function(x) {
SR2 <<- SR2 %>% mutate(x = ifelse(rowSums(SR2 == x, na.rm = TRUE) > 0, 1, 0))
}
The x in this function is a list of codes, so "3369", "2255" etc.
What am I missing here?
Use quo_name with !! to get the correct column name. Use map_dfc to get the output in data frame
library(purrr)
library(dplyr)
df_out <- map_dfc(c('2255','5555'),
~transmute(df,!!quo_name(.x) := ifelse(rowSums(df == .x, na.rm = TRUE) > 0, 1, 0)))
bind_cols(df,df_out)
Data
df <- structure(list(c1 = c(2255L, NA, NA), c2 = c(3669L, 5555L, NA), c3 = c(NA, 6598L, 1245L),
`3369` = c(1L, 0L, 0L)), class = "data.frame", row.names = c("1", "2", "3"))

How to get ONLY columns with NA values and the amount of NAs

I have a dataset and some of the columns have NA values. I need to display only the column names that have NA values as well as the total number of NA values in each of those columns.
I've been able to get different pieces of the problem working but not both things at once.
This gives me only the column names of the columns containing NA values. But I want the NA totals to show under each column name.
nacol<- colnames(df)[colSums(is.na(df)) > 0]
This gives me exactly what I want but it also displays the zero totals of the other columns in the dataframe and I don't want those to be displayed.
df %>% summarise_all(funs(sum(is.na(.))))
I'm obviously a complete beginner. I realize this is an extremely easy problem to fix but I've been trying for hours and I'm just getting frustrated. Please help. Thank you!
We can use Filter with colSums to remove 0 values
Filter(function(x) x > 0, colSums(is.na(df)))
#a c
#2 1
Or select_if in dplyr
library(dplyr)
df %>%
summarise_all(~(sum(is.na(.)))) %>%
select_if(. > 0)
We can also first select column with any NA values and then count them.
df %>%
select_if(~any(is.na(.))) %>%
summarise_all(~(sum(is.na(.))))
data
df <- data.frame(a = c(2, 3, NA, NA, 1), b = 1:5, c = c(1, 3, 4, NA, 1))
A possible alternative using purrr and dplyr for the pipe(using airquality for reproducibilty):
library(dplyr)
library(purrr)
airquality %>%
keep(~anyNA(.x)) %>%
map_dbl(~sum(is.na(.x)))
Ozone Solar.R
37 7
Using data from #Ronak Shah 's answer:
df %>%
keep(~anyNA(.x)) %>%
map_dbl(~sum(is.na(.x)))
a c
2 1
Using data.table(there might be a way to make it more compact):
setDT(df)
df[,Filter(anyNA,.SD)][,lapply(.SD, function(x) sum(is.na(x)))]
a c
1: 2 1
Data:
df <- structure(list(a = c(2, 3, NA, NA, 1), b = 1:5, c = c(1, 3, 4,
NA, 1)), class = "data.frame", row.names = c(NA, -5L))
airquality is builtin
We can do
na.omit(na_if(colSums(is.na(df)), 0))
# a c
# 2 1
Or using summarise_if
library(dplyr)
df %>%
summarise_if(~ any(is.na(.)), ~sum(is.na(.)))
# a c
#1 2 1
data
df <- data.frame(a = c(2, 3, NA, NA, 1), b = 1:5, c = c(1, 3, 4, NA, 1))

Delete columns in an R loop

I have a dataframe where I want to replace the variables
age_1 with values of variable age1_corr_1 if age1_corr_1 is not NA
age_2 with values of variable age1_corr_2 if age1_corr_2 is not NA, ...,
age_n with values of variable age1_corr_n if age1_corr_n is not NA.
Then I'd like to delete the variables age1_corr_1, age1_corr_2, ..., age1_corr_n. I have figured out how to do the first part (change the values) in a loop but couldn't figure out how to delete the variables after. Any suggestion?
Sample data
y <- data.frame("age_1" = c(5,1,1,10), "age1_corr_1" = c(1,NA,NA,0), "age_2" = c(1,2,3,4), "age1_corr_2" = c(NA, NA, 10, 9),
"age_3" = c(4,3,2,5), "age1_corr_3" = c(NA,NA,NA,6), "age_4" = c(1,4,2,7), "age1_corr_4" = c(NA, NA, NA,NA))
The code that will change values of age_n based on age1_corr_n
for(i in 1:4){
cname1 <- paste0("age_",i)
cname2 <- paste0("age1_corr_",i)
y[,cname1] <- ifelse(!is.na(y[,cname2]), y[,cname2], y[,cname1])
}
The output I'd like to have is
age_1 age_2 age_3 age_4
1 1 1 4 1
2 1 2 3 4
3 1 10 2 2
4 0 9 6 7
You have several options if there is a pattern to the columns you want to remove (or conversely, the ones you want to keep).
Here's the data you provided:
y <- data.frame("age_1" = c(5,1,1,10), "age1_corr_1" = c(1,NA,NA,0), "age_2" = c(1,2,3,4), "age1_corr_2" = c(NA, NA, 10, 9),
"age_3" = c(4,3,2,5), "age1_corr_3" = c(NA,NA,NA,6), "age_4" = c(1,4,2,7), "age1_corr_4" = c(NA, NA, NA,NA))
Here's a dplyr example of how to get only those columns that follow the pattern age_N, where N is 1, 2, 3, or 4:
library(dplyr)
x <- select(y, paste("age", 1:4, sep = "_"))
Alternatively, you could choose the pattern for the columns you DON'T want:
x <- select(y, -grep("_corr_", current_vars()))
This uses the following strategy:
* you can select for everything BUT a column or set of columns by adding a minus sign first.
* current_vars() is a helper function in dplyr that evaluates to all the variable names for the data (here, y)
Do the real work with dplyr::coalesce() (description: "Given a set of vectors, coalesce() finds the first non-missing value at each position."). Then drop the columns with dplyr::select(), using a negative sign in front of the columns you don't need anymore.
library(magrittr)
y %>%
dplyr::mutate(
age1_corr_4 = as.numeric(age1_corr_4), # Delete this line if it's already a numeric/floating data type.
age_1 = dplyr::coalesce(age1_corr_1, age_1),
age_2 = dplyr::coalesce(age1_corr_2, age_2),
age_3 = dplyr::coalesce(age1_corr_3, age_3),
age_4 = dplyr::coalesce(age1_corr_4, age_4)
) %>%
dplyr::select(
-age1_corr_1, -age1_corr_2, -age1_corr_3, -age1_corr_4
)
Produces
age_1 age_2 age_3 age_4
1 1 1 4 1
2 1 2 3 4
3 1 10 2 2
4 0 9 6 7
Edit: I apologize, I focused on the coalesce part of the task and ignored the n part of the task.
Here are two other approaches that can handle an arbitrary number of columns. For this specific example dataset, make sure that the 4th column is correctly represented as a float with y$age1_corr_4 <- as.numeric(y$age1_corr_4)).
Like Dan Hall's response, one approach keeps the columns you want...
library(magrittr)
coalesce_corr1 <- function( index ) {
name_age <- paste0("age_" , index)
name_corr <- paste0("age1_corr_", index)
y %>%
dplyr::mutate(
!!name_age := dplyr::coalesce(.data[[name_corr]], .data[[name_age]])
) %>%
dplyr::select(!!name_age)
}
1:4 %>%
purrr::map(coalesce_corr) %>%
dplyr::bind_cols()
...and the other drops the columns you don't want.
z <- y
coalesce_corr2 <- function( index ) {
name_age <- paste0( "age_" , index)
name_corr <- paste0( "age1_corr_", index)
z <<- z %>%
dplyr::mutate(
!!name_age := dplyr::coalesce(.data[[!!name_corr]], .data[[!!name_age]])
)
z[[name_corr]] <<- NULL
}
1:4 %>%
purrr::walk(coalesce_corr2)
z
I wish this last one didn't require a global variable (that uses <<-), and for this reason, I actually recommend Dan's approaches, but I wanted to try out quosures for output variables.

Resources