R - Replacing Specific Columns' Data - r

I'm connecting to my Vertica Database and retrieve huge amount of data. There are NAs in the dataset in all columns. But I want to find NAs from specific columns' and replace with 0.
How should I do that ?
Thanks !

To expand on my comment and make it into an answer, here's a minimal reproducible example:
set.seed(1)
mydf <- as.data.frame(matrix(sample(c(1:2, NA), 50, replace = TRUE), ncol = 10))
mydf
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
# 1 1 NA 1 2 NA 2 2 NA NA NA
# 2 2 NA 1 NA 1 1 2 NA 2 1
# 3 2 2 NA NA 2 2 2 1 NA 2
# 4 NA 2 2 2 1 NA 1 NA 2 NA
# 5 1 1 NA NA 1 2 NA 2 2 NA
Now, if we wanted to replace NA with "0", but only in columns 1, 3, 7, and 8, you can use:
mydf[c(1, 3, 7, 8)][is.na(mydf[c(1, 3, 7, 8)])] <- 0
mydf
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
# 1 1 NA 1 2 NA 2 2 0 NA NA
# 2 2 NA 1 NA 1 1 2 0 2 1
# 3 2 2 0 NA 2 2 2 1 NA 2
# 4 0 2 2 2 1 NA 1 0 2 NA
# 5 1 1 0 NA 1 2 0 2 2 NA
Instead of column numeric index positions, you can use a vector of column names (which will be safer than the numeric positions). Additionally, your code might be easier if the vector of column names or index positions you're working on were stored in a separate vector. Both of those concepts are demonstrated below, where we replace NA values in variables "V2", "V4" and "V5" with "-999".
changeMe <- c("V2", "V4", "V5")
mydf[changeMe][is.na(mydf[changeMe])] <- -999
mydf
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
# 1 1 -999 1 2 -999 2 2 0 NA NA
# 2 2 -999 1 -999 1 1 2 0 2 1
# 3 2 2 0 -999 2 2 2 1 NA 2
# 4 0 2 2 2 1 NA 1 0 2 NA
# 5 1 1 0 -999 1 2 0 2 2 NA

Related

r - Lag a data.frame by the number of NAs

In other words, I am trying to lag a data.frame that looks like this:
V1 V2 V3 V4 V5 V6
1 1 1 1 1 1
2 2 2 2 2 NA
3 3 3 3 NA NA
4 4 4 NA NA NA
5 5 NA NA NA NA
6 NA NA NA NA NA
To something that looks like this:
V1 V2 V3 V4 V5 V6
1 NA NA NA NA NA
2 1 NA NA NA NA
3 2 1 NA NA NA
4 3 2 1 NA NA
5 4 3 2 1 NA
6 5 4 3 2 1
So far, I have used a function that counts the number of NAs, and have tried to lag my each column in my data.frame by the corresponding number of NAs in that column.
V1 <- c(1,2,3,4,5,6)
V2 <- c(1,2,3,4,5,NA)
V3 <- c(1,2,3,4,NA,NA)
V4 <- c(1,2,3,NA,NA,NA)
V5 <- c(1,2,NA,NA,NA,NA)
V6 <- c(1,NA,NA,NA,NA,NA)
mydata <- cbind(V1,V2,V3,V4,V5,V6)
na.count <- colSums(is.na(mydata))
lag.by <- function(mydata, na.count){lag(mydata, k = na.count)}
lagged.df <- apply(mydata, 2, lag.by)
But this code just lags the entire data.frame by one...
One option would be to loop through the columns with apply and append the NA elements first by subsetting the NA elements using is.na and then the non-NA element by negating the logical vector (is.na)
apply(mydata, 2, function(x) c(x[is.na(x)], x[!is.na(x)]))
# V1 V2 V3 V4 V5 V6
#[1,] 1 NA NA NA NA NA
#[2,] 2 1 NA NA NA NA
#[3,] 3 2 1 NA NA NA
#[4,] 4 3 2 1 NA NA
#[5,] 5 4 3 2 1 NA
#[6,] 6 5 4 3 2 1
You could use the sort function with option na.last = FALSE like this:
edit:
Akrun's comment is a valid one. If the values need to stay in the order as they are in the data.frame, then Akrun's answer is the best. Sort will out everything in order from low to high with the NA's in front.
library(purrr)
map_df(mydata, sort, na.last = FALSE)
# A tibble: 6 x 6
V1 V2 V3 V4 V5 V6
<int> <int> <int> <int> <int> <int>
1 1 NA NA NA NA NA
2 2 1 NA NA NA NA
3 3 2 1 NA NA NA
4 4 3 2 1 NA NA
5 5 4 3 2 1 NA
6 6 5 4 3 2 1
Or apply:
apply(mydata, 2, sort , na.last = FALSE)
V1 V2 V3 V4 V5 V6
[1,] 1 NA NA NA NA NA
[2,] 2 1 NA NA NA NA
[3,] 3 2 1 NA NA NA
[4,] 4 3 2 1 NA NA
[5,] 5 4 3 2 1 NA
[6,] 6 5 4 3 2 1
edit2:
As nicolo commented. order can preserve the order of the variables.
mydata[,3] <- c(4, 3, 1, 2, NA, NA)
map_df(mydata, function(x) x[order(!is.na(x))])
# A tibble: 6 x 6
V1 V2 V3 V4 V5 V6
<int> <int> <dbl> <int> <int> <int>
1 1 NA NA NA NA NA
2 2 1 NA NA NA NA
3 3 2 4 NA NA NA
4 4 3 3 1 NA NA
5 5 4 1 2 1 NA
6 6 5 2 3 2 1

Count number of NAs between 2 values by row in R

My data looks something like this:
db <- as.data.frame(matrix(ncol=10, nrow=3,
c(3,NA,NA,4,5,NA,7,NA,NA,NA,NA,NA,7,NA,8,9,NA,NA,4,6,NA,NA,7,8,11,5,10,NA,NA,NA), byrow = TRUE))
db
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 3 NA NA 4 5 NA 7 NA NA NA
2 NA NA 7 NA 8 9 NA NA 4 6
3 NA NA 7 8 11 5 10 NA NA NA
For each row, I'm trying to count the number of NAs that appear between the first and last non-NA element (I have numbers and characters) by row.
The output should be something like this:
db$na.tot <- c(3, 3, 0)
db
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 na.tot
1 3 NA NA 4 5 NA 7 NA NA NA 3
2 NA NA 7 NA 8 9 NA NA 4 6 3
3 NA NA 7 8 11 5 10 NA NA NA 0
Where na.tot represents the number of NAs observed between the first and last non-NA elements by row (between 3 and 7, 7 and 6 and 7 and 10 in rows 1, 2 and 3 respectively).
Does anyone have a simple solution?
Thanks!
Try this:
require(data.table)
z<-as.data.table(which(!is.na(db),arr.ind=TRUE))
setkey(z,row,col)
z[,list(NAs=last(col)-first(col)-.N+1),by=row]
# row NAs
#1: 1 3
#2: 2 3
#3: 3 0

Remove rows with missing data conditionally [duplicate]

This question already has answers here:
Remove rows that have more than a threshold of missing values missing values [closed]
(1 answer)
Delete columns/rows with more than x% missing
(5 answers)
Closed 5 years ago.
I have a dataframe with some missing values, displayed as NA.
For example:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 4 3 6 7 2 1 2 3 4 1
2 5 5 4 3 2 1 3 7 6 7
3 6 6 NA NA NA NA NA NA NA NA
4 5 2 2 1 7 NA NA NA NA NA
5 7 NA NA NA NA NA NA NA NA NA
I would like to remove rows that have contain at least 80% of missing data. In this example it is clearly row 3 and 5. I know how to remove rows manually, but I would like some help with the code because my original dataframe contains 480 Variables and more than 1000 rows, so a code for automatically identifying and removing rows with >80% NA data would be extremely useful.
Thanking you in advance
you could use rowMeans:
df = read.table(text=' V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 4 3 6 7 2 1 2 3 4 1
2 5 5 4 3 2 1 3 7 6 7
3 6 6 NA NA NA NA NA NA NA NA
4 5 2 2 1 7 NA NA NA NA NA
5 7 NA NA NA NA NA NA NA NA NA')
df[rowMeans(is.na(df))<.8,]
Output:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 4 3 6 7 2 1 2 3 4 1
2 5 5 4 3 2 1 3 7 6 7
4 5 2 2 1 7 NA NA NA NA NA
Hope this helps!
We can use rowSums on the logical matrix
df1[rowSums(is.na(df1))/ncol(df1) < 0.8,]
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
#1 4 3 6 7 2 1 2 3 4 1
#2 5 5 4 3 2 1 3 7 6 7
#4 5 2 2 1 7 NA NA NA NA NA

Calculations in R with Missing Values

In the below test data, v4 is calculated out of v1, v2 and v3 as follows:
test$v4 <- (test$v1 + test$v2 + test$v3) / 3
As expected, any row with a missing value returns an NA result for v4:
v1 v2 v3 v4
1 1 1 2 1.333333
2 1 1 2 1.333333
3 1 2 NA NA
4 0 1 NA NA
5 NA NA 0 NA
6 NA 1 0 NA
7 1 2 NA NA
However, I want R to return an NA only when there are two or three NA values. If there is only one NA, I want R to calculate the mean of the two available values.
Can you please advise as to how I can do that?
Thank you.
You can use ifelse and rowSums(is.na()) to have differing formula on different rows:
dat <- read.table(text= "v1 v2 v3 v4
1 1 1 2 1.333333
2 1 1 2 1.333333
3 1 2 NA NA
4 0 1 NA NA
5 NA NA 0 NA
6 NA 1 0 NA
7 1 2 NA NA")
# if more than 2 NAs in each row, NA, otherwise the mean ignoring NAs
dat$v4 <- ifelse(rowSums(is.na(dat)) >= 2, NA, rowMeans(dat, na.rm = TRUE))

Dropping all left NAs in a dataframe and left shifting the cleaned rows

I have the following dataframe dat, which presents a row-specific number of NAs at the beginning of some of its rows:
dat <- as.data.frame(rbind(c(NA,NA,1,3,5,NA,NA,NA), c(NA,1:3,6:8,NA), c(1:7,NA)))
dat
# V1 V2 V3 V4 V5 V6 V7 V8
# NA NA 1 3 5 NA NA NA
# NA 1 2 3 6 7 8 NA
# 1 NA 2 3 4 5 6 NA
My aim is to delete all the NAs at the beginning of each row and to left shift the row values (adding NAs at the end of the shifted rows accordingly, in order to keep their length constant).
The following code works as expected:
for (i in 1:nrow(dat)) {
if (is.na(dat[i,1])==TRUE) {
dat1 <- dat[i, min(which(!is.na(dat[i,]))):length(dat[i,])]
dat[i,] <- data.frame( dat1, t(rep(NA, ncol(dat)-length(dat1))) )
}
}
dat
returning:
# V1 V2 V3 V4 V5 V6 V7 V8
# 1 3 5 NA NA NA NA NA
# 1 2 3 6 7 8 NA NA
# 1 NA 2 3 4 5 6 NA
I was wondering whther there is a more direct way to do so without using a for-loop and by using the tail function.
With respect to this last point, by using min(which(!is.na(dat[1,]))) the result is 3, as expected. But then if I type tail(dat[1,],min(which(!is.na(dat[1,])))) the result is the same initial row, and I don't understand why..
Thank you very much for anu suggestion.
if you just want all NA's to be pushed to the end, you could try
dat <- as.data.frame(rbind(c(NA,NA,1,3,5,NA,NA,NA), c(NA,1:3,6:8,NA), c(1:7,NA)))
dat[3,2] <- NA
> dat
V1 V2 V3 V4 V5 V6 V7 V8
1 NA NA 1 3 5 NA NA NA
2 NA 1 2 3 6 7 8 NA
3 1 NA 3 4 5 6 7 NA
dat.new<-do.call(rbind,lapply(1:nrow(dat),function(x) t(matrix(dat[x,order(is.na(dat[x,]))])) ))
colnames(dat.new)<-colnames(dat)
> dat.new
V1 V2 V3 V4 V5 V6 V7 V8
[1,] 1 3 5 NA NA NA NA NA
[2,] 1 2 3 6 7 8 NA NA
[3,] 1 3 4 5 6 7 NA NA
I don't think you can do this without a loop.
dat <- as.data.frame(rbind(c(NA,NA,1,3,5,NA,NA,NA), c(NA,1:3,6:8,NA), c(1:7,NA)))
dat[3,2] <- NA
# V1 V2 V3 V4 V5 V6 V7 V8
# 1 NA NA 1 3 5 NA NA NA
# 2 NA 1 2 3 6 7 8 NA
# 3 1 NA 3 4 5 6 7 NA
t(apply(dat, 1, function(x) {
if (is.na(x[1])) {
y <- x[-seq_len(which.min(is.na(x))-1)]
length(y) <- length(x)
y
} else x
}))
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#[1,] 1 3 5 NA NA NA NA NA
#[2,] 1 2 3 6 7 8 NA NA
#[3,] 1 NA 3 4 5 6 7 NA
Then turn the matrix into a data.frame if you must.
Here there is the answer by using the tail function:
dat <- as.data.frame(rbind(c(NA,NA,1,3,5,NA,NA,NA), c(NA,1:3,6:8,NA), c(1:7,NA)))
dat
for (i in 1:nrow(dat)) {
if (is.na(dat[i,1])==TRUE) {
# drops initial NAs of the row (if the sequence starts with NAs)
dat1 <- tail(as.integer(dat[i,]), -min(which(!is.na(dat[i,]))-1))
# adds final NAs to keep the row length constant (i.e. conformable with 'dat')
length(dat1) <- ncol(dat)
dat[i,] <- dat1
}
}
dat

Resources