Removing rows with NA in R [duplicate] - r

This question already has answers here:
Remove rows with all or some NAs (missing values) in data.frame
(18 answers)
Closed 5 years ago.
I have a dataframe with 2500 rows. A few of the rows have NAs (an excessive number of NAs), and I want to remove those rows.
I've searched the SO archives, and come up with this as the most likely solution:
df2 <- df[df[, 12] != NA,]
But when I run it and look at df2, all I see is a screen full of NAs (and s).
Any suggestions?

Depending on what you're looking for, one of the following should help you on your way:
Some sample data to start with:
mydf <- data.frame(A = c(1, 2, NA, 4), B = c(1, NA, 3, 4),
C = c(1, NA, 3, 4), D = c(NA, 2, 3, 4),
E = c(NA, 2, 3, 4))
mydf
# A B C D E
# 1 1 1 1 NA NA
# 2 2 NA NA 2 2
# 3 NA 3 3 3 3
# 4 4 4 4 4 4
If you wanted to remove rows just according to a few specific columns, you can use complete.cases or the solution suggested by #SimonO101 in the comments. Here, I'm removing rows which have an NA in the first column.
mydf[complete.cases(mydf$A), ]
# A B C D E
# 1 1 1 1 NA NA
# 2 2 NA NA 2 2
# 4 4 4 4 4 4
mydf[!is.na(mydf[, 1]), ]
# A B C D E
# 1 1 1 1 NA NA
# 2 2 NA NA 2 2
# 4 4 4 4 4 4
If, instead, you wanted to set a threshold--as in "keep only the rows that have fewer than 2 NA values" (but you don't care which columns the NA values are in--you can try something like this:
mydf[rowSums(is.na(mydf)) < 2, ]
# A B C D E
# 3 NA 3 3 3 3
# 4 4 4 4 4 4
On the other extreme, if you want to delete all rows that have any NA values, just use complete.cases:
mydf[complete.cases(mydf), ]
# A B C D E
# 4 4 4 4 4 4

Related

Replacing NA's with numbers based on the numbers preceding them in R data frame

I have the following two columns:
ID Number
A 1
A 1
B NA
C NA
C NA
D 3
D 3
D 3
F NA
G 6
H NA
I want the NAs in the Number column to be replaced by the next integer number following the Non-NA number that precedes them. That new number should then stay the same as long as the ID in the ID column doesn't change.
So for example, using the example columns above, if the number associated with ID "A" is 1, and ID "B" below it has NA's, I want those NA's replaced with the number 2. Then, if ID "C" has NA's, they should be replaced with 3. We move down to ID "D". This ID has the number 3 in the Number column, so nothing changes. ID "E" below it has NA's so they get replaced with 4, etc.
Here is what my data frame should look like:
ID Number
A 1
A 1
B 2
C 3
C 3
D 3
D 3
D 3
F 4
G 6
H 7
How would I be able to code this in R, preferably using dplyr?
I came up with the following, though I am not totally sure if the logic is correct, so please test it:
df <- data.frame(ID=c('A', 'A', 'B', 'C', 'C', 'D', 'D', 'D', 'F', 'G', 'H'),
Number=c(1, 1, NA, NA, NA, 3, 3, 3, NA, 6, NA))
library(zoo)
a <- as.integer(factor(df$ID))
b <- zoo::na.locf(a - df$Number)
df$Number <- a - b
Resulting in:
ID Number
1 A 1
2 A 1
3 B 2
4 C 3
5 C 3
6 D 3
7 D 3
8 D 3
9 F 4
10 G 6
11 H 7
Some explanation:
a simply relabels the groups with ascending integers: 1 1 2 3 3 4 4 4 5 6 7. This is almost the desired result, but we have to account for cases where the values in df$Number get ahead of/behind this running integer label.
b tracks the difference between a and df$Number with a forward fill (na.locf): 0 0 0 0 0 1 1 1 1 0 0. Places with a non-zero indicate the correction that should be applied to a to "reset" the running labels, based on the values observed in df$Number.
a - b applies the correction alluded to in the above point: 1 1 2 3 3 3 3 3 4 6 7.
One hiccup I noted is if the values start with NA; in that case using na.locf will return something smaller than the length of the dataframe. The fix I came up with was to manually prepend a value (0), forward fill, then remove the value, but G. Grothendieck in the comments provided a nicer way to fix this 😊:
# Note: the first 5 values are NA
df <- data.frame(ID=c('A', 'A', 'B', 'C', 'C', 'D', 'D', 'D', 'F', 'G', 'H'),
Number=c(NA, NA, NA, NA, NA, 3, 3, 3, NA, 6, NA))
library(zoo)
a <- as.integer(factor(df$ID))
b <- na.fill(na.locf0(a - df$Number), 0)
df$Number <- a - b
Result:
ID Number
1 A 1
2 A 1
3 B 2
4 C 3
5 C 3
6 D 3
7 D 3
8 D 3
9 F 4
10 G 6
11 H 7
Data
ID <- c("A","A","B","C","C","D","D","D","F","G","H")
Number <- c(1,1,NA_real_,NA_real_,NA_real_,3,3,3,NA_real_,6,NA_real_)
df <- data.frame(ID,Number,stringsAsFactors = F)
dplyr approach
df2 <- df[!df$ID%>%duplicated(),]%>%
mutate(Number2=ifelse(is.na(Number),1,0))%>%
group_by(grp=cumsum(Number2==0))%>%
mutate(cumulative=cumsum(Number2))%>%
ungroup%>%
fill(Number)%>%
mutate(Number=Number+cumulative)%>%
select(ID,Number)
base::merge(df%>%select(-Number),df2,by="ID",all.x=T)
ID Number
1 A 1
2 A 1
3 B 2
4 C 3
5 C 3
6 D 3
7 D 3
8 D 3
9 F 4
10 G 6
11 H 7
or in one LONG line:
df%>%select(-Number)%>%merge(df[!df$ID%>%duplicated(),]%>%
mutate(Number2=ifelse(is.na(Number),1,0))%>%
group_by(grp=cumsum(Number2==0))%>%
mutate(cumulative=cumsum(Number2))%>%
ungroup%>%
fill(Number)%>%
mutate(Number=Number+cumulative)%>%
select(ID,Number),
by="ID",all.x=T)
original answer:
df2 <- df[!df$ID%>%duplicated(),]
while(sum(is.na(df2$Number))!=0){
df2$Number[is.na(df2$Number)] <- c(lag(df2$Number)+1)[is.na(df2$Number)]
}
base::merge(df%>%select(-Number),df2,by="ID",all.x=T)
ID Number
1 A 1
2 A 1
3 B 2
4 C 3
5 C 3
6 D 3
7 D 3
8 D 3
9 F 4
10 G 6
11 H 7

Select first non-NA value by row [duplicate]

This question already has answers here:
How to implement coalesce efficiently in R
(9 answers)
Closed 1 year ago.
I have data like this:
df <- data.frame(id=c(1, 2, 3, 4), A=c(6, NA, NA, 4), B=c(3, 2, NA, NA), C=c(4, 3, 5, NA), D=c(4, 3, 1, 2))
id A B C D
1 1 6 3 4 4
2 2 NA 2 3 3
3 3 NA NA 5 1
4 4 4 NA NA 2
For each row: If the row has non-NA values in column "A", I want that value to be entered into a new column 'E'. If it doesn't, I want to move on to column "B", and that value entered into E. And so on. Thus, the new column would be E = c(6, 2, 5, 4).
I wanted to use the ifelse function, but I am not quite sure how to do this.
tidyverse
library(dplyr)
mutate(df, E = coalesce(A, B, C, D))
# id A B C D E
# 1 1 6 3 4 4 6
# 2 2 NA 2 3 3 2
# 3 3 NA NA 5 1 5
# 4 4 4 NA NA 2 4
coalesce is effectively "return the first non-NA in each vector". It has a SQL equivalent (or it is an equivalent of SQL's COALESCE, actually).
base R
df$E <- apply(df[,-1], 1, function(z) na.omit(z)[1])
df
# id A B C D E
# 1 1 6 3 4 4 6
# 2 2 NA 2 3 3 2
# 3 3 NA NA 5 1 5
# 4 4 4 NA NA 2 4
na.omit removes all of the NA values, and [1] makes sure we always return just the first of them. The advantage of [1] over (say) head(., 1) is that head will return NULL if there are no non-NA elements, whereas .[1] will always return at least an NA (indicating to you that it was the only option).

How can I insert blank rows every 3 existing rows in a data frame?

How can I insert blank rows every 3 existing rows in a data frame?
After a web scraping process I get a dataframe with the information I need, however the final excel format requires that I add a blank row every 3 rows. I have searched the web for help but have not found a solution yet.
With hypothetical data, the structure of my data frame is as follows:
mi_df <- data.frame(
"ID" = rep(1:3,c(3,3,3)),
"X" = as.character(c("a", "a", "a", "b", "b", "b", "c", "c", "c")),
"Y" = seq(1,18, by=2)
)
mi_df
ID X Y
1 1 a 1
2 1 a 3
3 1 a 5
4 2 b 7
5 2 b 9
6 2 b 11
7 3 c 13
8 3 c 15
9 3 c 17
The result I hope for is something like this
ID X Y
1 1 a 1
2 1 a 3
3 1 a 5
4
5 2 b 7
6 2 b 9
7 2 b 11
8
9 3 c 13
10 3 c 15
11 3 c 17
If the indices of a data frame contain NA, then the output will have NA rows. So my goal is to create a vector like 1 2 3 NA 4 5 6 NA ... and set it as the indices of mi_df.
cut <- rep(1:(nrow(mi_df)/3), each = 3)
mi_df[sapply(split(1:nrow(mi_df), cut), c, NA), ]
# ID X Y
# 1 1 a 1
# 2 1 a 3
# 3 1 a 5
# NA NA <NA> NA
# 4 2 b 7
# 5 2 b 9
# 6 2 b 11
# NA.1 NA <NA> NA
# 7 3 c 13
# 8 3 c 15
# 9 3 c 17
# NA.2 NA <NA> NA
If nrow(mi_df) is not a multiple of 3, then the following is a general solution:
# Version 1
cut <- rep(1:ceiling(nrow(mi_df)/3), each = 3, len = nrow(mi_df))
mi_df[Reduce(c, lapply(split(1:nrow(mi_df), cut), c, NA)), ]
# Version 2
cut <- rep(1:ceiling(nrow(mi_df)/3), each = 3, len = nrow(mi_df))
mi_df[Reduce(function(x, y) c(x, NA, y), split(1:nrow(mi_df), cut)), ]
Don't mind the NA in the output because some functions which write data to an excel file have an optional argument controls if NA values are converted to strings or be empty. E.g.
library(openxlsx)
write.xlsx(df, "test.xlsx", keepNA = FALSE) # defaults to FALSE
tmp <- split(mi_df, rep(1:(nrow(mi_df) / 3), each = 3))
# or split(mi_df, ggplot2::cut_width(seq_len(nrow(mi_df)), 3, center = 2))
do.call(rbind, lapply(tmp, function(x) { x[4, ] <- NA; x }))
ID X Y
1.1 1 a 1
1.2 1 a 3
1.3 1 a 5
1.4 NA <NA> NA
2.4 2 b 7
2.5 2 b 9
2.6 2 b 11
2.4.1 NA <NA> NA
3.7 3 c 13
3.8 3 c 15
3.9 3 c 17
3.4 NA <NA> NA
You can make empty rows like you show by assigning an empty character vector ("") instead of NA, but this will convert your columns to character, and I wouldn't recommend it.
My recommendation is somewhat different from all the other answers: don't make a mess of your dataset inside R . Use the existing packages to write to designated rows in an Excel workbook. For example, with the package xlConnect, the method writeWorksheet (called from writeWorksheetToFile ) includes these arguments:
object The workbook to write to data Data to write
sheet The name or index of the sheet to write to
startRow Index of the first row to write to. The default is startRow = 1.
startCol Index of the first column to write to. The default is startCol = 1.
So if you simply set up a loop that writes 3 rows of your data file at a time, then moves the row index down by 4 and writes the next 3 rows, etc., you're all set.
Here's one method.
Splits into list by ID, adds empty row, then binds list back into data frame.
mi_df2 <- do.call(rbind,Map(rbind,split(mi_df,mi_df$ID),rep("",3)))
rownames(mi_df2) <- NULL

cumulative product in R across column

I have a dataframe in the following format
> x <- data.frame("a" = c(1,1),"b" = c(2,2),"c" = c(3,4))
> x
a b c
1 1 2 3
2 1 2 4
I'd like to add 3 new columns which is a cumulative product of the columns a b c, however I need a reverse cumulative product i.e. the output should be
row 1:
result_d = 1*2*3 = 6 , result_e = 2*3 = 6, result_f = 3
and similarly for row 2
The end result will be
a b c result_d result_e result_f
1 1 2 3 6 6 3
2 1 2 4 8 8 4
the column names do not matter this is just an example. Does anyone have any idea how to do this?
as per my comment, is it possible to do this on a subset of columns? e.g. only for columns b and c to return:
a b c results_e results_f
1 1 2 3 6 3
2 1 2 4 8 4
so that column "a" is effectively ignored?
One option is to loop through the rows and apply cumprod over the reverse of elements and then do the reverse
nm1 <- paste0("result_", c("d", "e", "f"))
x[nm1] <- t(apply(x, 1,
function(x) rev(cumprod(rev(x)))))
x
# a b c result_d result_e result_f
#1 1 2 3 6 6 3
#2 1 2 4 8 8 4
Or a vectorized option is rowCumprods
library(matrixStats)
x[nm1] <- rowCumprods(as.matrix(x[ncol(x):1]))[,ncol(x):1]
temp = data.frame(Reduce("*", x[NCOL(x):1], accumulate = TRUE))
setNames(cbind(x, temp[NCOL(temp):1]),
c(names(x), c("res_d", "res_e", "res_f")))
# a b c res_d res_e res_f
#1 1 2 3 6 6 3
#2 1 2 4 8 8 4

Reduce two rows into a single row in R [duplicate]

This question already has answers here:
Collapsing rows where some are all NA, others are disjoint with some NAs
(5 answers)
Closed 6 years ago.
I have a situation such like this:
df<-data.frame(A=c(1, NA), B=c(NA, 2), C=c(3, NA), D=c(4, NA), E=c(NA, 5))
df
A B C D E
1 1 NA 3 4 NA
2 NA 2 NA NA 5
What I wanted is, conditioning on all length(!is.na(df$*))==1, reduce df to :
df
A B C D E
1 1 2 3 4 5
As long as the resulting rows are equal, you can use:
dfNew <- do.call(data.frame, lapply(df, function(i) i[!is.na(i)]))
which results in
dfNew
A B C D E
1 1 2 3 4 5

Resources