Select first non-NA value by row [duplicate] - r

This question already has answers here:
How to implement coalesce efficiently in R
(9 answers)
Closed 1 year ago.
I have data like this:
df <- data.frame(id=c(1, 2, 3, 4), A=c(6, NA, NA, 4), B=c(3, 2, NA, NA), C=c(4, 3, 5, NA), D=c(4, 3, 1, 2))
id A B C D
1 1 6 3 4 4
2 2 NA 2 3 3
3 3 NA NA 5 1
4 4 4 NA NA 2
For each row: If the row has non-NA values in column "A", I want that value to be entered into a new column 'E'. If it doesn't, I want to move on to column "B", and that value entered into E. And so on. Thus, the new column would be E = c(6, 2, 5, 4).
I wanted to use the ifelse function, but I am not quite sure how to do this.

tidyverse
library(dplyr)
mutate(df, E = coalesce(A, B, C, D))
# id A B C D E
# 1 1 6 3 4 4 6
# 2 2 NA 2 3 3 2
# 3 3 NA NA 5 1 5
# 4 4 4 NA NA 2 4
coalesce is effectively "return the first non-NA in each vector". It has a SQL equivalent (or it is an equivalent of SQL's COALESCE, actually).
base R
df$E <- apply(df[,-1], 1, function(z) na.omit(z)[1])
df
# id A B C D E
# 1 1 6 3 4 4 6
# 2 2 NA 2 3 3 2
# 3 3 NA NA 5 1 5
# 4 4 4 NA NA 2 4
na.omit removes all of the NA values, and [1] makes sure we always return just the first of them. The advantage of [1] over (say) head(., 1) is that head will return NULL if there are no non-NA elements, whereas .[1] will always return at least an NA (indicating to you that it was the only option).

Related

Replacing NA's with numbers based on the numbers preceding them in R data frame

I have the following two columns:
ID Number
A 1
A 1
B NA
C NA
C NA
D 3
D 3
D 3
F NA
G 6
H NA
I want the NAs in the Number column to be replaced by the next integer number following the Non-NA number that precedes them. That new number should then stay the same as long as the ID in the ID column doesn't change.
So for example, using the example columns above, if the number associated with ID "A" is 1, and ID "B" below it has NA's, I want those NA's replaced with the number 2. Then, if ID "C" has NA's, they should be replaced with 3. We move down to ID "D". This ID has the number 3 in the Number column, so nothing changes. ID "E" below it has NA's so they get replaced with 4, etc.
Here is what my data frame should look like:
ID Number
A 1
A 1
B 2
C 3
C 3
D 3
D 3
D 3
F 4
G 6
H 7
How would I be able to code this in R, preferably using dplyr?
I came up with the following, though I am not totally sure if the logic is correct, so please test it:
df <- data.frame(ID=c('A', 'A', 'B', 'C', 'C', 'D', 'D', 'D', 'F', 'G', 'H'),
Number=c(1, 1, NA, NA, NA, 3, 3, 3, NA, 6, NA))
library(zoo)
a <- as.integer(factor(df$ID))
b <- zoo::na.locf(a - df$Number)
df$Number <- a - b
Resulting in:
ID Number
1 A 1
2 A 1
3 B 2
4 C 3
5 C 3
6 D 3
7 D 3
8 D 3
9 F 4
10 G 6
11 H 7
Some explanation:
a simply relabels the groups with ascending integers: 1 1 2 3 3 4 4 4 5 6 7. This is almost the desired result, but we have to account for cases where the values in df$Number get ahead of/behind this running integer label.
b tracks the difference between a and df$Number with a forward fill (na.locf): 0 0 0 0 0 1 1 1 1 0 0. Places with a non-zero indicate the correction that should be applied to a to "reset" the running labels, based on the values observed in df$Number.
a - b applies the correction alluded to in the above point: 1 1 2 3 3 3 3 3 4 6 7.
One hiccup I noted is if the values start with NA; in that case using na.locf will return something smaller than the length of the dataframe. The fix I came up with was to manually prepend a value (0), forward fill, then remove the value, but G. Grothendieck in the comments provided a nicer way to fix this 😊:
# Note: the first 5 values are NA
df <- data.frame(ID=c('A', 'A', 'B', 'C', 'C', 'D', 'D', 'D', 'F', 'G', 'H'),
Number=c(NA, NA, NA, NA, NA, 3, 3, 3, NA, 6, NA))
library(zoo)
a <- as.integer(factor(df$ID))
b <- na.fill(na.locf0(a - df$Number), 0)
df$Number <- a - b
Result:
ID Number
1 A 1
2 A 1
3 B 2
4 C 3
5 C 3
6 D 3
7 D 3
8 D 3
9 F 4
10 G 6
11 H 7
Data
ID <- c("A","A","B","C","C","D","D","D","F","G","H")
Number <- c(1,1,NA_real_,NA_real_,NA_real_,3,3,3,NA_real_,6,NA_real_)
df <- data.frame(ID,Number,stringsAsFactors = F)
dplyr approach
df2 <- df[!df$ID%>%duplicated(),]%>%
mutate(Number2=ifelse(is.na(Number),1,0))%>%
group_by(grp=cumsum(Number2==0))%>%
mutate(cumulative=cumsum(Number2))%>%
ungroup%>%
fill(Number)%>%
mutate(Number=Number+cumulative)%>%
select(ID,Number)
base::merge(df%>%select(-Number),df2,by="ID",all.x=T)
ID Number
1 A 1
2 A 1
3 B 2
4 C 3
5 C 3
6 D 3
7 D 3
8 D 3
9 F 4
10 G 6
11 H 7
or in one LONG line:
df%>%select(-Number)%>%merge(df[!df$ID%>%duplicated(),]%>%
mutate(Number2=ifelse(is.na(Number),1,0))%>%
group_by(grp=cumsum(Number2==0))%>%
mutate(cumulative=cumsum(Number2))%>%
ungroup%>%
fill(Number)%>%
mutate(Number=Number+cumulative)%>%
select(ID,Number),
by="ID",all.x=T)
original answer:
df2 <- df[!df$ID%>%duplicated(),]
while(sum(is.na(df2$Number))!=0){
df2$Number[is.na(df2$Number)] <- c(lag(df2$Number)+1)[is.na(df2$Number)]
}
base::merge(df%>%select(-Number),df2,by="ID",all.x=T)
ID Number
1 A 1
2 A 1
3 B 2
4 C 3
5 C 3
6 D 3
7 D 3
8 D 3
9 F 4
10 G 6
11 H 7

Iterate over a column ignoring but retaining NA values in R

I have a time series data frame in R that has a column, V1, which consists of integers with a few NAs interspersed throughout. I want to iterate over this column and subtract V1 from itself one time step previously. However, I want to ignore the NA values in V1 and use the last non-NA value in the subtraction. If the current value of V1 is NA, then the difference should return NA. See below for an example
V1 <- c(1, 3, 4, NA, NA, 6, 9, NA, 10)
time <- 1:length(V1)
dat <- data.frame(time = time,
V1 = V1)
lag_diff <- c(NA, 2, 1, NA, NA, 2, 3, NA, 1) # The result I want
diff(dat$V1) # Not the result I want
I'd prefer not to do this with loops because I have hundreds of data frames, each with >10,000 rows.
My first thought to solve this was to filter out the NA rows, perform the iterative difference calculation and then reinsert the rows that were filtered out but I can't think of a way to do that. It doesn't seem very "tidy" to do it that way either and I'm not sure it would be faster than looping. Any help is appreciated, bonus points if the solution uses tidyverse functions.
dat[!is.na(dat$V1), 'lag_diff'] <- c(NA, diff(dat[!is.na(dat$V1), 'V1']))
# time V1 lag_diff
# 1 1 1 NA
# 2 2 3 2
# 3 3 4 1
# 4 4 NA NA
# 5 5 NA NA
# 6 6 6 2
# 7 7 9 3
# 8 8 NA NA
# 9 9 10 1
Or with data.table (same result)
library(data.table)
setDT(dat)
dat[!is.na(V1), lag_diff := V1 - shift(V1)]
# time V1 lag_diff
# 1: 1 1 NA
# 2: 2 3 2
# 3: 3 4 1
# 4: 4 NA NA
# 5: 5 NA NA
# 6: 6 6 2
# 7: 7 9 3
# 8: 8 NA NA
# 9: 9 10 1
A tidyverse version, just in case. It does need a filter though
dat %>%
filter(!is.na(V1)) %>%
mutate(diff=V1- lag(V1)) %>%
right_join(dat,by=c("time","V1"))

Replicate rows with missing values and replace missing values by vector

I have a dataframe in which a column has some missing values.
I would like to replicate the rows with the missing values N times, where N is the length of a vector which contains replacements for the missing values.
I first define a replacement vector, then my starting data.frame, then my desired result and finally my attempt to solve it. Unfortunately that didn't work...
> replace_values <- c('A', 'B', 'C')
> data.frame(value = c(3, 4, NA, NA), result = c(5, 3, 1,2))
value result
1 3 5
2 4 3
3 NA 1
4 NA 2
> data.frame(value = c(3, 4, replace_values, replace_values), result = c(5, 3, rep(1, 3),rep(2, 3)))
value result
1 3 5
2 4 3
3 A 1
4 B 1
5 C 1
6 A 2
7 B 2
8 C 2
> t <- data.frame(value = c(3, 4, NA, NA), result = c(5, 3, 1,2))
> mutate(t, value = ifelse(is.na(value), replace_values, value))
value result
1 3 5
2 4 3
3 C 1
4 A 2
You can try a tidyverse solution
d %>%
mutate(value=ifelse(is.na(value), paste0(replace_values, collapse=","), value)) %>%
separate_rows(value, sep=",") %>%
select(value, everything())
value result
1 3 5
2 4 3
3 A 1
4 B 1
5 C 1
6 A 2
7 B 2
8 C 2
The idea is to replace the NA's by the ,-collapsed 'replace_values'. Then separate the collpased values and binding them by row using tidyr's separate_rows function. Finally sort the data.frame according your expected output.
We can do an rbind here using base R. Create a logical vector where the 'value' is NA ('i1'), get the number of NA elements by taking the sum of it ('n'), create a data.frame by replicating the 'replace_values' with 'n' as well as the 'result' elements that correspond to the NA elements of 'value' by the length of 'replace_values' and 'rbind' with the subset of dataset i.e. the non-NA elements of 'value' rows
i1 <- is.na(df1$value)
n <- sum(i1)
rbind(df1[!i1,],
data.frame(value = rep(replace_values, n),
result = rep(df1$result[i1], each = length(replace_values))))
# value result
#1 3 5
#2 4 3
#3 A 1
#4 B 1
#5 C 1
#6 A 2
#7 B 2
#8 C 2

Reduce two rows into a single row in R [duplicate]

This question already has answers here:
Collapsing rows where some are all NA, others are disjoint with some NAs
(5 answers)
Closed 6 years ago.
I have a situation such like this:
df<-data.frame(A=c(1, NA), B=c(NA, 2), C=c(3, NA), D=c(4, NA), E=c(NA, 5))
df
A B C D E
1 1 NA 3 4 NA
2 NA 2 NA NA 5
What I wanted is, conditioning on all length(!is.na(df$*))==1, reduce df to :
df
A B C D E
1 1 2 3 4 5
As long as the resulting rows are equal, you can use:
dfNew <- do.call(data.frame, lapply(df, function(i) i[!is.na(i)]))
which results in
dfNew
A B C D E
1 1 2 3 4 5

Removing rows with NA in R [duplicate]

This question already has answers here:
Remove rows with all or some NAs (missing values) in data.frame
(18 answers)
Closed 5 years ago.
I have a dataframe with 2500 rows. A few of the rows have NAs (an excessive number of NAs), and I want to remove those rows.
I've searched the SO archives, and come up with this as the most likely solution:
df2 <- df[df[, 12] != NA,]
But when I run it and look at df2, all I see is a screen full of NAs (and s).
Any suggestions?
Depending on what you're looking for, one of the following should help you on your way:
Some sample data to start with:
mydf <- data.frame(A = c(1, 2, NA, 4), B = c(1, NA, 3, 4),
C = c(1, NA, 3, 4), D = c(NA, 2, 3, 4),
E = c(NA, 2, 3, 4))
mydf
# A B C D E
# 1 1 1 1 NA NA
# 2 2 NA NA 2 2
# 3 NA 3 3 3 3
# 4 4 4 4 4 4
If you wanted to remove rows just according to a few specific columns, you can use complete.cases or the solution suggested by #SimonO101 in the comments. Here, I'm removing rows which have an NA in the first column.
mydf[complete.cases(mydf$A), ]
# A B C D E
# 1 1 1 1 NA NA
# 2 2 NA NA 2 2
# 4 4 4 4 4 4
mydf[!is.na(mydf[, 1]), ]
# A B C D E
# 1 1 1 1 NA NA
# 2 2 NA NA 2 2
# 4 4 4 4 4 4
If, instead, you wanted to set a threshold--as in "keep only the rows that have fewer than 2 NA values" (but you don't care which columns the NA values are in--you can try something like this:
mydf[rowSums(is.na(mydf)) < 2, ]
# A B C D E
# 3 NA 3 3 3 3
# 4 4 4 4 4 4
On the other extreme, if you want to delete all rows that have any NA values, just use complete.cases:
mydf[complete.cases(mydf), ]
# A B C D E
# 4 4 4 4 4 4

Resources