Replace NA's using data from Multiple Columns - r

I have a data-frame that looks as such:
ID col2 col3 col4
1 5 NA NA
2 NA NA 1
3 5 NA NA
4 19 NA 1
If col2 has a value, that cell should not change (even if columns 3 and 4 contains values). However, if col2 contains an "NA" value, I would like to return any non-NA's from col3 or col4, if they exist.
Desired output shown below, notice how row 2 has the "1" there now.
ID col2 col3 col4
1 5 NA NA
2 1 NA 1
3 5 NA NA
4 19 NA 1
I know this can be done manually by referencing each column using $ or [], but how can this be done using a for-loop or apply?
Thanks

We can do with ifelse
df1$col2 <- with(df1, ifelse(is.na(col2), pmax(col3, col4, na.rm = TRUE), col2))
df1$col2
#[1] 5 1 5 19
Or create a logical index to replace the values
i1 <- is.na(df1$col2)
df1$col2[i1] <- do.call(pmax, c(df1[i1, 3:4], na.rm = TRUE))

Related

Summing up a col and storing the value at the last index in a new column in R [duplicate]

This question already has answers here:
How to sum data.frame column values?
(5 answers)
Closed 2 years ago.
Suppose I have a dataframe
col1
df= 1
2
3
4
How do I get the following in R
col1 col2
df= 1
2
3
4 10(total of col1)
You can do :
df <- data.frame(col1 = 1:4)
df$col2 <- NA
df$col2[nrow(df)] <- sum(df$col1, na.rm = TRUE)
df
# col1 col2
#1 1 NA
#2 2 NA
#3 3 NA
#4 4 10
Keeping other values in col2 as NA instead of blanks since blanks would turn the column to character.

missing values filling in R [duplicate]

This question already has answers here:
How to implement coalesce efficiently in R
(9 answers)
nested ifelse() is the worst; what's the best? [duplicate]
(3 answers)
Closed 5 years ago.
I want help on R programming to fill col4
col4=col1, in case col1 is NA then col4=col2, in case col1 and col2 are NA then col4=col3
id col1 col2 col3
1 10 NA NA
2 NA 12 NA
3 NA NA 13
4 NA NA 1
5 2 3 NA
Answer:
id col4
1 10
2 12
3 13
4 1
5 2
Easily done with coalesce from dplyr. This solution works for N number of columns:
library(dplyr)
data %>%
mutate(col4 = coalesce(!!!data[-1]))
Result:
id col1 col2 col3 col4
1 1 10 NA NA 10
2 2 NA 12 NA 12
3 3 NA NA 13 13
4 4 NA NA 1 1
5 5 2 3 NA 2
Data:
data = read.table(text = "id col1 col2 col3
1 10 NA NA
2 NA 12 NA
3 NA NA 13
4 NA NA 1
5 2 3 NA", header = T)
Notes:
!!! shouldn't be confused with the negation operator ! (understandable confusion). It is an operator that is part of rlang, or the tidyverse (also available to dplyr) which enables explicit splicing.
What this means is that instead of inputting the entire data frame into coalesce (coalesce(data[-1])), I am separating the columns of data[-1] (or elements of the list) and have each element as an input to coalesce. So this:
coalesce(!!!data[-1])
is actually equivalent to this:
coalesce(col1, col2, col3)
The advantage of writing it this way is that you don't have to know the column names nor how many columns there are to begin with.
Using dplyr::coalesce, or any of the answers at How to implement coalesce in R?:
xx$col4 = with(xx, coalesce(col1, col2, col3))

Replacing NAs between two rows with identical values in a specific column

I have a dataframe with multiple columns and I want to replace NAs in one column if they are between two rows with an identical number. Here is my data:
v1 v2
1 2
NA 3
NA 2
1 1
NA 7
NA 2
3 1
I basically want to start from the beginning of the data frame and replcae NAs in column v1 with previous Non NA if the next Non NA matches the previous one. That been said, I want the result to be like this:
v1 v2
1 2
1 3
1 2
1 1
NA 7
NA 2
3 1
As you may see, rows 2 and 3 are replaced with number "1" because row 1 and 4 had an identical number but rows 5,6 stays the same because the non na values in rows 4 and 7 are not identical. I have been twicking a lot but so far no luck. Thanks
Here is an idea using zoo package. We basically fill NAs in both directions and set NA the values that are not equal between those directions.
library(zoo)
ind1 <- na.locf(df$v1, fromLast = TRUE)
df$v1 <- na.locf(df$v1)
df$v1[df$v1 != ind1] <- NA
which gives,
v1 v2
1 1 2
2 1 3
3 1 2
4 1 1
5 NA 7
6 NA 2
7 3 1
Here is a similar approach in tidyverse using fill
library(tidyverse)
df1 %>%
mutate(vNew = v1) %>%
fill(vNew, .direction = 'up') %>%
fill(v1) %>%
mutate(v1 = replace(v1, v1 != vNew, NA)) %>%
select(-vNew)
# v1 v2
#1 1 2
#2 1 3
#3 1 2
#4 1 1
#5 NA 7
#6 NA 2
#7 3 1
Here is a base R solution, the logic is almost the same as Sotos's one:
replace_na <- function(x){
f <- function(x) ave(x, cumsum(!is.na(x)), FUN = function(x) x[1])
y <- f(x)
yp <- rev(f(rev(x)))
ifelse(!is.na(y) & y == yp, y, x)
}
df$v1 <- replace_na(df$v1)
test:
> replace_na(c(1, NA, NA, 1, NA, NA, 3))
[1] 1 1 1 1 NA NA 3
I could use na.locf function to do so. Basically, I use the normal na.locf function package zoo to replace each NA with the latest previous non NA and store the data in a column. by using the same function but fixing fromlast=TRUE NAs are replaces with the first next nonNA and store them in another column. I checked these two columns and if the results in each row for these two columns are not matching I replace them with NA.

Maintain NA's after aggregation R

I have a data frame as follows
test_df<-data.frame(col1=c(1,NA,NA,4,5),col2=c(3,NA,NA,5,6),col3=c("a","b","c","d","c"))
test_df
col1 col2 col3
1 3 a
NA NA b
NA NA c
4 5 d
5 6 c
I am aggregating data based on col3
agg_test<-aggregate(list(test_df$col1,test_df$col2),by=list(test_df$col3),sum,na.rm=T)
agg_test
Col3 col1 col2
a 1 3
b 0 0
c 5 6
d 4 5
From what I know for summation to be correct we need to explicitly define what is to be done with NA's, in this case I have specified that NA's are to be removed from summation, I guess internally R converts all NA's to 0 and sums up according to the by condition. I need to treat the NA's and 0's in my data differently and therefore have to maintain the NA's that are valid (in this case the observations for b are NA's and not 0). How can I achieve this?
Expected o/p
Col3 col1 col2
a 1 3
b NA NA
c 5 6
d 4 5
library(data.table)
unique(setDT(test_df)[, lapply(.SD, function(x)
replace(x, !all(is.na(x)), sum(x, na.rm=TRUE))) , by=col3])
# col3 col1 col2
#1: a 1 3
#2: b NA NA
#3: c 5 6
#4: d 4 5
test_df1 <- test_df
test_df1$col2[2] <- 2
unique(setDT(test_df1)[, lapply(.SD, function(x)
replace(x, !all(is.na(x)), sum(x, na.rm=TRUE))) , by=col3])
# col3 col1 col2
#1: a 1 3
#2: b NA 2
#3: c 5 6
#4: d 4 5
Update
Or using the compact code suggested by #Arun
test_df1$col2[5] <- NA
setDT(test_df1)[, lapply(.SD,
function(x) sum(x,na.rm= !all(is.na(x)))), by=col3]
# col3 col1 col2
#1: a 1 3
#2: b NA 2
#3: c 5 NA
#4: d 4 5
It sounds like (based on your comments to requests for clarification) you want aggregate your groups so you get NA if all the values are missing, and otherwise you want the sum of the non-missing values. You can pass aggregate a user-defined function that has this behavior:
aggregate(list(test_df$col1,test_df$col2), by=list(test_df$col3),
function(x) ifelse(all(is.na(x)), NA, sum(x, na.rm=T)))
# Group.1 c.1..NA..NA..4..5. c.3..NA..NA..5..6.
# 1 a 1 3
# 2 b NA NA
# 3 c 5 6
# 4 d 4 5

Keeping rows if any column matches one of a set of values

I have a simple question about subsetting using R; I think I am close but can't quite get it. Basically, I have 25 columns of interest and about 100 values. Any row that has ANY of those values in at one of the columns, I want to keep. Simple example:
Values <- c(1,2,5)
col1 <- c(2,6,8,1,3,5)
col2 <- c(1,4,5,9,0,0)
col3 <- c('dog', 'cat', 'cat', 'pig', 'chicken', 'cat')
df <- cbind.data.frame(col1, col2, col3)
df1 <- subset(df, col1%in%Values)
(Note that the third column is to indicate that there are additional columns but I don't need to match the values to those; the rows retained only depend upon columns 1 and 2). I know that in this trivial case I could just add
| col2%in%Values
to get the additional rows from column 2, but with 25 columns I don't want to add an OR statement for every single one. I tried
file2011_test <- file2011[file2011[,9:33]%in%CO_codes] #real names of values
but it didn't work. (And yes I know this is mixing subsetting types; I find subset() easier to understand but I don't think it can help me with what I need?)
May be you can try:
df[Reduce(`|`, lapply(as.data.frame(df), function(x) x %in% Values)),]
# col1 col2
#[1,] 2 1
#[2,] 8 5
#[3,] 1 9
#[4,] 5 0
Or
indx <- df %in% Values
dim(indx) <- dim(df)
df[!!rowSums(indx),]
# col1 col2
# [1,] 2 1
# [2,] 8 5
# [3,] 1 9
# [4,] 5 0
Update
Using the new dataset
df[Reduce(`|`, lapply(df[sapply(df, is.numeric)], function(x) x %in% Values)),]
# col1 col2 col3
#1 2 1 dog
#3 8 5 cat
#4 1 9 pig
#6 5 0 cat
take a look at data.table package. It is very intuitive and literally 100 times faster.
library(data.table)
df <- data.table(col1, col2, col3)
df[col1%in%Values | col2%in%Values]
# col1 col2 col3
#1: 2 1 dog
#2: 8 5 cat
#3: 1 9 pig
#4: 5 0 cat
If you want to do this for all column you can do this with:
df[rowSums(sapply(df, '%in%', Values) )>0]
# col1 col2 col3
#1: 2 1 dog
#2: 8 5 cat
#3: 1 9 pig
#4: 5 0 cat

Resources