Calculations in R with Missing Values - r

In the below test data, v4 is calculated out of v1, v2 and v3 as follows:
test$v4 <- (test$v1 + test$v2 + test$v3) / 3
As expected, any row with a missing value returns an NA result for v4:
v1 v2 v3 v4
1 1 1 2 1.333333
2 1 1 2 1.333333
3 1 2 NA NA
4 0 1 NA NA
5 NA NA 0 NA
6 NA 1 0 NA
7 1 2 NA NA
However, I want R to return an NA only when there are two or three NA values. If there is only one NA, I want R to calculate the mean of the two available values.
Can you please advise as to how I can do that?
Thank you.

You can use ifelse and rowSums(is.na()) to have differing formula on different rows:
dat <- read.table(text= "v1 v2 v3 v4
1 1 1 2 1.333333
2 1 1 2 1.333333
3 1 2 NA NA
4 0 1 NA NA
5 NA NA 0 NA
6 NA 1 0 NA
7 1 2 NA NA")
# if more than 2 NAs in each row, NA, otherwise the mean ignoring NAs
dat$v4 <- ifelse(rowSums(is.na(dat)) >= 2, NA, rowMeans(dat, na.rm = TRUE))

Related

Conditionally creating matrix or dataframe in R

I have two objects let's call them 1 and 2. They can take either 1 or 2 as values for x variable and depending on that, their y values (binary) are determined as depicted in the image.
For example, if x=1 then only yA can be 1. But if x=2, all yA, yB and yC for that object can be 1. The constraint is that for each object maximum one y can be 1. In the image, blue columns are for object 1 and greens are for object 2.
Is there any efficient way to do it as the number of variables in original problem is much higher?
EDIT: The objective is to find all the possible combination of y variables as depicted in the image. The image is only to provide an idea for expected outcome.
A bit of a brute-force generation.
First, creating the basic frame of all y* columns:
dat <- data.frame(yA=c(1,NA,NA),yB=c(NA,1,NA),yC=c(NA,NA,1),ign=1)
dat <- merge(dat, dat, by="ign")
names(dat)[-1] <- c("y1A", "y1B", "y1C", "y2A", "y2B", "y2C")
dat
# ign y1A y1B y1C y2A y2B y2C
# 1 1 1 NA NA 1 NA NA
# 2 1 1 NA NA NA 1 NA
# 3 1 1 NA NA NA NA 1
# 4 1 NA 1 NA 1 NA NA
# 5 1 NA 1 NA NA 1 NA
# 6 1 NA 1 NA NA NA 1
# 7 1 NA NA 1 1 NA NA
# 8 1 NA NA 1 NA 1 NA
# 9 1 NA NA 1 NA NA 1
Merge (outer/cartesian) with a frame of x*:
alldat <- merge(data.frame(x1=c(1,1,2),x2=c(1,2,2),ign=1), dat, by="ign")
subset(alldat, (!is.na(y1B) | x1 > 1) & (!is.na(y2B) | x2 > 1), select = -ign)
# x1 x2 y1A y1B y1C y2A y2B y2C
# 5 1 1 NA 1 NA NA 1 NA
# 13 1 2 NA 1 NA 1 NA NA
# 14 1 2 NA 1 NA NA 1 NA
# 15 1 2 NA 1 NA NA NA 1
# 19 2 2 1 NA NA 1 NA NA
# 20 2 2 1 NA NA NA 1 NA
# 21 2 2 1 NA NA NA NA 1
# 22 2 2 NA 1 NA 1 NA NA
# 23 2 2 NA 1 NA NA 1 NA
# 24 2 2 NA 1 NA NA NA 1
# 25 2 2 NA NA 1 1 NA NA
# 26 2 2 NA NA 1 NA 1 NA
# 27 2 2 NA NA 1 NA NA 1
The ign column is merely to force/enable merge to do a cartesian/outer join.

calculating sum and dealing with NAs

I am having a problem with the rowSum function. What's happening is that any rows with NAs are being counted as 0 and I don't want that. Here is my data:
V1 V2 V3 V4
1 0 0 1
0 1 NA 1
NA NA NA NA
Here is what is happening:
V1 V2 V3 V4 SUM
1 0 0 1 2
0 1 NA 1 2
NA NA NA NA 0
I want this:
V1 V2 V3 V4 SUM
1 0 0 1 2
0 1 NA 1 2
NA NA NA NA NA
I've looked on several websites and I have tried so many different iterations of code and I keep getting the same thing. This is the most basic piece of code I have used, although I tried using dplyr. Can someone please help me?
df$sum <- rowSums(df, na.rm = T)
We can take advantage of the fact that
NA ^ 0
#[1] 1
NA ^ 1
#[1] NA
Using it in rowSums, we can do :
rowSums(df, na.rm = TRUE) * NA^(rowSums(!is.na(df)) == 0)
#[1] 2 2 NA

r - Lag a data.frame by the number of NAs

In other words, I am trying to lag a data.frame that looks like this:
V1 V2 V3 V4 V5 V6
1 1 1 1 1 1
2 2 2 2 2 NA
3 3 3 3 NA NA
4 4 4 NA NA NA
5 5 NA NA NA NA
6 NA NA NA NA NA
To something that looks like this:
V1 V2 V3 V4 V5 V6
1 NA NA NA NA NA
2 1 NA NA NA NA
3 2 1 NA NA NA
4 3 2 1 NA NA
5 4 3 2 1 NA
6 5 4 3 2 1
So far, I have used a function that counts the number of NAs, and have tried to lag my each column in my data.frame by the corresponding number of NAs in that column.
V1 <- c(1,2,3,4,5,6)
V2 <- c(1,2,3,4,5,NA)
V3 <- c(1,2,3,4,NA,NA)
V4 <- c(1,2,3,NA,NA,NA)
V5 <- c(1,2,NA,NA,NA,NA)
V6 <- c(1,NA,NA,NA,NA,NA)
mydata <- cbind(V1,V2,V3,V4,V5,V6)
na.count <- colSums(is.na(mydata))
lag.by <- function(mydata, na.count){lag(mydata, k = na.count)}
lagged.df <- apply(mydata, 2, lag.by)
But this code just lags the entire data.frame by one...
One option would be to loop through the columns with apply and append the NA elements first by subsetting the NA elements using is.na and then the non-NA element by negating the logical vector (is.na)
apply(mydata, 2, function(x) c(x[is.na(x)], x[!is.na(x)]))
# V1 V2 V3 V4 V5 V6
#[1,] 1 NA NA NA NA NA
#[2,] 2 1 NA NA NA NA
#[3,] 3 2 1 NA NA NA
#[4,] 4 3 2 1 NA NA
#[5,] 5 4 3 2 1 NA
#[6,] 6 5 4 3 2 1
You could use the sort function with option na.last = FALSE like this:
edit:
Akrun's comment is a valid one. If the values need to stay in the order as they are in the data.frame, then Akrun's answer is the best. Sort will out everything in order from low to high with the NA's in front.
library(purrr)
map_df(mydata, sort, na.last = FALSE)
# A tibble: 6 x 6
V1 V2 V3 V4 V5 V6
<int> <int> <int> <int> <int> <int>
1 1 NA NA NA NA NA
2 2 1 NA NA NA NA
3 3 2 1 NA NA NA
4 4 3 2 1 NA NA
5 5 4 3 2 1 NA
6 6 5 4 3 2 1
Or apply:
apply(mydata, 2, sort , na.last = FALSE)
V1 V2 V3 V4 V5 V6
[1,] 1 NA NA NA NA NA
[2,] 2 1 NA NA NA NA
[3,] 3 2 1 NA NA NA
[4,] 4 3 2 1 NA NA
[5,] 5 4 3 2 1 NA
[6,] 6 5 4 3 2 1
edit2:
As nicolo commented. order can preserve the order of the variables.
mydata[,3] <- c(4, 3, 1, 2, NA, NA)
map_df(mydata, function(x) x[order(!is.na(x))])
# A tibble: 6 x 6
V1 V2 V3 V4 V5 V6
<int> <int> <dbl> <int> <int> <int>
1 1 NA NA NA NA NA
2 2 1 NA NA NA NA
3 3 2 4 NA NA NA
4 4 3 3 1 NA NA
5 5 4 1 2 1 NA
6 6 5 2 3 2 1

Merge and replace values from overlapping matrices

I have two overlapping matrices with some shared columns and rows:
m.1 = matrix(c(NA,NA,1,NA,NA,NA,1,1,1,NA,1,1,1,1,1,NA,1,1,1,NA,NA,NA,1,NA,NA), ncol=5)
colnames(m.1) <- c("-2","-1","0","1","2")
rownames(m.1) <- c("-2","-1","0","1","2")
## -2 -1 0 1 2
## -2 NA NA 1 NA NA
## -1 NA 1 1 1 NA
## 0 1 1 1 1 1
## 1 NA 1 1 1 NA
## 2 NA NA 1 NA NA
m.2 = matrix(c(NA,2,NA,2,2,2,NA,2,NA), ncol=3)
colnames(m.2) <- c("-1","0","1")
rownames(m.2) <- c("-1","0","1")
## -1 0 1
## -1 NA 2 NA
## 0 2 2 2
## 1 NA 2 NA
Now I want to pass the maximum value in each column from m.1 and m.2 to a new matrix m.max, which should look like this:
## -2 -1 0 1 2
## -2 NA NA 1 NA NA
## -1 NA 1 2 1 NA
## 0 1 2 2 2 1
## 1 NA 1 2 1 NA
## 2 NA NA 1 NA NA
Based on previous threads, I have meddled with merge(), replace() and match() but cannot get the desired result at all, e.g.
m.max<- merge(m.1,m.2, by = "row.names", all=TRUE, sort = TRUE)
## Row.names -2 -1.x 0.x 1.x 2 -1.y 0.y 1.y
## 1 -1 NA 1 1 1 NA NA 2 NA
## 2 -2 NA NA 1 NA NA NA NA NA
## 3 0 1 1 1 1 1 2 2 2
## 4 1 NA 1 1 1 NA NA 2 NA
## 5 2 NA NA 1 NA NA NA NA NA
Please help! Am I completely on the wrong track? Does this operation require a different kind of object than matrix? For example, I also tried to convert the matrices into raster objects and do cell statistics, but ran into problems because of the unequal dimensions of m.1 and m.2.
Importantly, the answer should also work for much larger objects, or whether I want to calculate the maximum, minimum or sum.
You can use pmax:
#we create a new matrix as big as m.1 with the values of m.2 in it
mres<-array(NA,dim(m.1),dimnames(m.1))
mres[rownames(m.2),colnames(m.2)]<-m.2
#Then we use pmax
pmax(m.1,mres,na.rm=TRUE)
# -2 -1 0 1 2
#-2 NA NA 1 NA NA
#-1 NA 1 2 1 NA
#0 1 2 2 2 1
#1 NA 1 2 1 NA
#2 NA NA 1 NA NA

R - Replacing Specific Columns' Data

I'm connecting to my Vertica Database and retrieve huge amount of data. There are NAs in the dataset in all columns. But I want to find NAs from specific columns' and replace with 0.
How should I do that ?
Thanks !
To expand on my comment and make it into an answer, here's a minimal reproducible example:
set.seed(1)
mydf <- as.data.frame(matrix(sample(c(1:2, NA), 50, replace = TRUE), ncol = 10))
mydf
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
# 1 1 NA 1 2 NA 2 2 NA NA NA
# 2 2 NA 1 NA 1 1 2 NA 2 1
# 3 2 2 NA NA 2 2 2 1 NA 2
# 4 NA 2 2 2 1 NA 1 NA 2 NA
# 5 1 1 NA NA 1 2 NA 2 2 NA
Now, if we wanted to replace NA with "0", but only in columns 1, 3, 7, and 8, you can use:
mydf[c(1, 3, 7, 8)][is.na(mydf[c(1, 3, 7, 8)])] <- 0
mydf
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
# 1 1 NA 1 2 NA 2 2 0 NA NA
# 2 2 NA 1 NA 1 1 2 0 2 1
# 3 2 2 0 NA 2 2 2 1 NA 2
# 4 0 2 2 2 1 NA 1 0 2 NA
# 5 1 1 0 NA 1 2 0 2 2 NA
Instead of column numeric index positions, you can use a vector of column names (which will be safer than the numeric positions). Additionally, your code might be easier if the vector of column names or index positions you're working on were stored in a separate vector. Both of those concepts are demonstrated below, where we replace NA values in variables "V2", "V4" and "V5" with "-999".
changeMe <- c("V2", "V4", "V5")
mydf[changeMe][is.na(mydf[changeMe])] <- -999
mydf
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
# 1 1 -999 1 2 -999 2 2 0 NA NA
# 2 2 -999 1 -999 1 1 2 0 2 1
# 3 2 2 0 -999 2 2 2 1 NA 2
# 4 0 2 2 2 1 NA 1 0 2 NA
# 5 1 1 0 -999 1 2 0 2 2 NA

Resources