Subtract multiple columns ignoring NA - r

I'm fairly new to R and have run into an issue with NA's. This question may have been answered elsewhere but I can't seem to find the answer. I'm trying to do sort of the opposite of rowSums() in that I'm trying to subtract x2 and x3 from x1 in order to generate x4 without NA's. The code I'm currently using is as follows:
> x <- data.frame(x1 = 3, x2 = c(4:1, 2:5), x3=c(1,NA))
> x$x4=x$x1-x$x2-x$x3
> x
x1 x2 x3 x4
1 3 4 1 -2
2 3 3 NA NA
3 3 2 1 0
4 3 1 NA NA
5 3 2 1 0
6 3 3 NA NA
7 3 4 1 -2
8 3 5 NA NA
In other words I want to ingore the NA's similar to how rowSums allows the na.rm=TRUE argument so that I get this result:
x1 x2 x3 x4
1 3 4 1 -2
2 3 3 NA 0
3 3 2 1 0
4 3 1 NA 2
5 3 2 1 0
6 3 3 NA 0
7 3 4 1 -2
8 3 5 NA -2
Any help is greatly appreciated.

You can use something like this if all columns have NAs -
x$x4 <- ifelse(is.na(x$x1),0,x$x1) -ifelse(is.na(x$x2),0,x$x2)-ifelse(is.na(x$x3),0,x$x3)
Provided you want to treat NAs as 0. Else you can replace the 0s in the above formula with the value you need.

Just use rowSums:
> x$x4 <- x$x1 - rowSums(x[,2:3], na.rm=TRUE)
> x
x1 x2 x3 x4
1 3 4 1 -2
2 3 3 NA 0
3 3 2 1 0
4 3 1 NA 2
5 3 2 1 0
6 3 3 NA 0
7 3 4 1 -2
8 3 5 NA -2

Related

Calculating the cumulative sum of the columns in a dataframe with NAs

I have a dataframe that has some NAs in it, and want to create a new set of columns with the cumulative sum of a subset of the original columns, with NAs being ignored. A minimal example follows:
x = data.frame(X1 = c(NA, NA, 1,2,3),
X2 = 1:5)
> x
X1 X2
1 NA 1
2 NA 2
3 1 3
4 2 4
5 3 5
If I now write
> cumsum(x)
X1 X2
1 NA 1
2 NA 3
3 NA 6
4 NA 10
5 NA 15
I tried using ifelse
> cumsum(ifelse(is.na(x), 0, x))
Error: 'list' object cannot be coerced to type 'double'
but I have no difficulty working with one column at a time
> cumsum(ifelse(is.na(x$X1), 0, x$X1))
[1] 0 0 1 3 6
I suppose I could loop through the columns in my chosen subset, create a cumulative sum for each one, and then assign it to a new column in the dataframe, but this seems tedious, If I have a vector with the names of the columns whose cumulative sum I want to compute, how can I do so while ignoring the NAs (i,e, treating them as 0), and add the resulting set of cumulative sums to the dataframe with new names?
Sincerely
Thomas Philips
We could do
library(dplyr)
x %>%
mutate(across(everything(),
~ replace(.x, complete.cases(.x), cumsum(.x[complete.cases(.x)]))))
-output
X1 X2
1 NA 1
2 NA 3
3 1 6
4 3 10
5 6 15
Or more compactly with fcumsum from collapse
library(collapse)
fcumsum(x)
X1 X2
1 NA 1
2 NA 3
3 1 6
4 3 10
5 6 15
Or using base R with replace
cumsum(replace(x, is.na(x), 0))
X1 X2
1 0 1
2 0 3
3 1 6
4 3 10
5 6 15
library(dplyr)
mutate(x, across(everything(), ~cumsum(coalesce(.x, 0))))
X1 X2
1 0 1
2 0 3
3 1 6
4 3 10
5 6 15
Or
x[is.na(x)] <- 0
cumsum(x)
# but we lose the NA's
X1 X2
1 0 1
2 0 3
3 1 6
4 3 10
5 6 15

Subtract from values in a dataframe by condition

I want to select some columns from my dataframe, and subtract a number from all values meeting a condition. In my case, i want to select columns 5:10 of my data, and subtract 10 from all values >5, while keeping all other values the same, and then saving this dataframe.
The solution i have tried (below) just subtracts 10 from all the values. How can I do this? Any help much appreciated.
data <- data.frame(replicate(10,sample(-1:10,1000,rep=TRUE))) #generate random data
# what i have tried so far
(data[, 5:10] > 5) - 10
in base r you may use lapply
lapply(data[, 5:10], function(x) ifelse(x > 5, x - 10, x))
In dplyr you can do
data <- data.frame(replicate(10,sample(-1:10,1000,rep=TRUE)))
library(dplyr, warn.conflicts = F)
data %>%
mutate(across(5:10, ~ifelse(.>5, . - 10, .)))
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 9 3 3 5 -2 -4 1 4 -4 -1
2 1 0 7 7 -2 3 2 -1 4 -3
3 2 -1 8 1 1 1 -4 0 0 3
4 9 9 4 6 -2 -3 3 0 0 0
5 7 -1 9 5 0 1 1 -1 -1 2
6 4 9 4 7 4 1 0 -1 -3 -1
.
.
.
.
You can use -
cols <- 5:10
data[cols] <- data[cols] - 10 * +(data[cols] > 5)
+(data[cols] > 5) would give you 1/0 values which is multiplied by 10. So you'll have 10 for values which are greater than 5 and 0 otherwise. These values are subtracted from the selected columns of the dataframe.
I would use dplyr and base subsetting here.
library(dplyr)
data %>% mutate(across(5:10, ~{.x[.x>5]<-.x[.x>5]-10; .x}))
We can also substitute the whole subsetted dataframe in place, without loops or lapply, which can be done with not-so-beautiful but potentially very fast code:
data[,5:10][data[,5:10]>5]<-data[,5:10][data[,5:10]>5]-10
output
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 8 -1 0 3 1 -1 -2 0 1 -1
2 5 6 5 4 4 4 -1 3 2 -3
3 10 4 4 4 4 0 -3 -3 4 -1
4 1 7 5 5 -2 0 -3 5 1 5
5 0 6 7 1 0 -3 0 -1 -1 3
6 8 4 7 4 -3 5 0 -4 1 2
7 -1 5 9 7 0 1 0 2 4 4
8 9 8 5 3 -1 5 -3 -1 -4 -1
9 9 9 8 8 4 2 1 -1 1 3
10 8 8 9 5 2 -4 2 -3 -3 -1
......
[ reached 'max' / getOption("max.print") -- omitted 900 rows ]
Using vapply():
colIndices <- seq(5, 10)
df[,colIndices] <- vapply(
df[,colIndices],
function(x){
ifelse(x > 5, x - 10, x)
},
numeric(nrow(df))
)

How to filter multiple columns using a single critera

I have
4 5 6 7
1 3 3 3 3
2 1 2 2 1
3 2 1 1 NA
4 2 7 1 NA
5 1 1 1 1
I want to filter rows with either 2 or 3 in columns 1 to 4 so I only get rows 1,2,4
I tried
df1%>%filter_at(vars(4:7), all_vars(c(2,3)) -> df2
which returns
Error in filter_impl(.data, quo) : Result must have length 413, not 2
and
filter(d1[4:7]%in%c(1,3))
which returns
Error in filter_impl(.data, quo) : Result must have length 413, not 4
I want to avoid using
df1%>%filter(rowname1%in%c(1,3)|rowname1%in%c(1,3)| ...)
I dont get the syntax. Thanks
We can use any_vars and %in% to achieve this task.
library(dplyr)
df1 %>% filter_at(vars(1:4), any_vars(. %in% c(2, 3)))
# X4 X5 X6 X7
# 1 3 3 3 3
# 2 1 2 2 1
# 3 2 1 1 NA
# 4 2 7 1 NA
Or use == with |.
df1 %>% filter_at(vars(1:4), any_vars(. == 2 | . == 3))
# X4 X5 X6 X7
# 1 3 3 3 3
# 2 1 2 2 1
# 3 2 1 1 NA
# 4 2 7 1 NA
DATA
df1 <- read.table(text = " 4 5 6 7
1 3 3 3 3
2 1 2 2 1
3 2 1 1 NA
4 2 7 1 NA
5 1 1 1 1",
header = TRUE, stringsAsFactors = FALSE)

Exclude a Specific Value from a Unique Value Counter

I am trying to count how many different responses a person gives during a trial of an experiment, but there is a catch.
There are supposed to be 6 possible responses (1,2,3,4,5,6) BUT sometimes 0 is recorded as a response (it's a glitch / flaw in design).
I need to count the number of different responses they give, BUT ONLY counting unique values within the range 1-6. This helps us calculate their accuracy.
Is there a way to exclude the value 0 from contributing to a unique value counter? Any other work-arounds?
Currently I am trying this method below, but it includes 0, NA, and I think any other entry in a cell in the Unique Value Counter Column (I have named "Span6"), which makes me sad.
# My Span6 calculator:
ASixImageTrials <- data.frame(eSOPT_831$T8.RESP, eSOPT_831$T9.RESP, eSOPT_831$T10.RESP, eSOPT_831$T11.RESP, eSOPT_831$T12.RESP, eSOPT_831$T13.RESP)
ASixImageTrials$Span6 = apply(ASixImageTrials, 1, function(x) length(unique(x)))
Use na.omit inside unique and sum logic vector as below
df$res = apply(df, 1, function(x) sum(unique(na.omit(x)) > 0))
df
Output:
X1 X2 X3 X4 X5 res
1 2 1 1 2 1 2
2 3 0 1 1 2 3
3 3 NA 1 1 3 2
4 3 3 3 4 NA 2
5 1 1 0 NA 3 2
6 3 NA NA 1 1 2
7 2 0 2 3 0 2
8 0 2 2 2 1 2
9 3 2 3 0 NA 2
10 0 2 3 2 2 2
11 2 2 1 2 1 2
12 0 2 2 2 NA 1
13 0 1 4 3 2 4
14 2 2 1 1 NA 2
15 3 NA 2 2 NA 2
16 2 2 NA 3 NA 2
17 2 3 2 2 2 2
18 2 NA 3 2 2 2
19 NA 4 5 1 3 4
20 3 1 2 1 NA 3
Data:
set.seed(752)
mat <- matrix(rbinom(100, 10, .2), nrow = 20)
mat[sample(1:100, 15)] = NA
data.frame(mat) -> df
df$res = apply(df, 1, function(x) sum(unique(na.omit(x)) > 0))
could you edit your question and clarify why this doesn't solve your problem?
# here is a numeric vector with a bunch of numbers
mtcars$carb
# here is how to limit that vector to only 1-6
mtcars$carb[ mtcars$carb %in% 1:6 ]
# here is how to tabulate that result
table( mtcars$carb[ mtcars$carb %in% 1:6 ] )

Number of maximums in each row and more

My dataset contains four numerical variables X1, X2, X3, X_4 and an ID column.
ID <- c(1,2,3,4,5,6,7,8,9,10)
X1 <- c(3,1,1,1,2,1,2,1,3,4)
X2 <- c(1,2,1,3,2,2,4,1,2,4)
X3 <- c(1,1,1,3,2,3,3,2,1,4)
X4 <- c(1,4,1,1,1,4,3,1,4,4)
Mydata <- data.frame(ID, X1,X2,X3,X4)
I need to create two more columns: 1) Max, and 2) Var
1) Max column: For each row that has ONLY ONE maximum, I need to save this 'max' value in the Max variable. And if the
row has more than one, then the Max value should be 999.
2) Var column: For the rows with only one maximum, I need to know whether it was X1, X2, X3$, or X4.
For the above dataset, here is the output:
ID X1 X2 X3 X4 Max Var
1 3 1 1 1 3 X1
2 1 2 1 4 4 X4
3 1 1 1 1 999 NA
4 1 3 3 1 999 NA
5 2 2 2 1 999 NA
6 1 2 3 4 4 X4
7 2 4 3 3 4 X2
8 1 1 2 1 2 X3
9 3 2 1 4 4 X4
10 4 4 4 4 999 NA
We could get the column names of the 'Mydata' for the maximum value in each row (excluding the 'ID' column) using max.col ('Var'), and the maximum value per row with pmax ('Max'). Create a logical index for rows that have more than one maximum value ('indx') and use it with ifelse to get the expected output.
Var <- names(Mydata[-1])[max.col(Mydata[-1])]
Max <- do.call(pmax,Mydata[-1])
indx <- rowSums(Mydata[-1]==Max)>1
transform(Mydata, Var= ifelse(indx, NA, Var), Max=ifelse(indx, 999, Max))
Here's another possible apply solution
MyFunc <- function(x){
Max <- max(x)
if(sum(x == Max) > 1L) {
Max <- 999
Var <- NA
} else {
Var <- which.max(x)
}
c(Max, Var)
}
Mydata[c("Max", "Var")] <- t(apply(Mydata[-1], 1, MyFunc))
# ID X1 X2 X3 X4 Max Var
# 1 1 3 1 1 1 3 1
# 2 2 1 2 1 4 4 4
# 3 3 1 1 1 1 999 NA
# 4 4 1 3 3 1 999 NA
# 5 5 2 2 2 1 999 NA
# 6 6 1 2 3 4 4 4
# 7 7 2 4 3 3 4 2
# 8 8 1 1 2 1 2 3
# 9 9 3 2 1 4 4 4
# 10 10 4 4 4 4 999 NA
I would break this down into some small steps, which may not be the most efficient but would at least give you a starting point to work from if efficiency were an issues for your real problem.
First, compute the row maxes:
maxs <- apply(Mydata[, -1], 1, max)
> maxs
[1] 3 4 1 3 2 4 4 2 4 4
Next compute how which values in the rows equal the maximum
wMax <- apply(Mydata[, -1], 1, function(x) length(which(x == max(x))))
This gives a list, which we can sapply() over to get the number of values equalling the maximum:
nMax <- sapply(wMax, length)
> nMax
[1] 1 1 4 2 3 1 1 1 1 4
Now add the Max & Var columns:
Mydata$Max <- ifelse(nMax > 1L, 999, maxs)
Mydata$Var <- ifelse(nMax > 1L, NA, sapply(wMax, `[[`, 1))
> Mydata
ID X1 X2 X3 X4 Max Var
1 1 3 1 1 1 3 1
2 2 1 2 1 4 4 4
3 3 1 1 1 1 999 NA
4 4 1 3 3 1 999 NA
5 5 2 2 2 1 999 NA
6 6 1 2 3 4 4 4
7 7 2 4 3 3 4 2
8 8 1 1 2 1 2 3
9 9 3 2 1 4 4 4
10 10 4 4 4 4 999 NA
This isn't going to win any prizes for elegant use of the language, but it works and you can build off of it.
(That last line creating Var needs a little explanation: wMax is actually a list. We want the first element of each component of that list (because those will be the only maximums), and the sapply() call produces that.)
Now we can write a function that incorporates all the steps for you:
MaxVar <- function(x, na.rm = FALSE) {
## compute `max`
maxx <- max(x, na.rm = na.rm)
## which equal the max
wmax <- which(x == max(x))
## how many equal the max
nmax <- length(wmax)
## return
out <- if(nmax > 1L) {
c(999, NA)
} else {
c(maxx, wmax)
}
out
}
And use it like this:
> new <- apply(Mydata[, -1], 1, MaxVar)
> new
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 3 4 999 999 999 4 4 2 4 999
[2,] 1 4 NA NA NA 4 2 3 4 NA
> Mydata <- cbind(Mydata, Max = new[1, ], Var = new[2, ])
> Mydata
ID X1 X2 X3 X4 Max Var
1 1 3 1 1 1 3 1
2 2 1 2 1 4 4 4
3 3 1 1 1 1 999 NA
4 4 1 3 3 1 999 NA
5 5 2 2 2 1 999 NA
6 6 1 2 3 4 4 4
7 7 2 4 3 3 4 2
8 8 1 1 2 1 2 3
9 9 3 2 1 4 4 4
10 10 4 4 4 4 999 NA
Again, not the most elegant or efficient of code, but it works and it's easy to see what it is doing.
Yet another way to do this using apply
Mydata$Max = apply(Mydata[,-1], 1,
function(x){ m = max(x); ifelse(m != max(x[duplicated(x)]), m, 999)})
Mydata$Var = apply(Mydata[,-1], 1,
function(x){ index = which.max(x); ifelse(index != 5, names(x)[index], NA)})
#> Mydata
#ID X1 X2 X3 X4 Max Var
#1 1 3 1 1 1 3 X1
#2 2 1 2 1 4 4 X4
#3 3 1 1 1 1 999 <NA>
#4 4 1 3 3 1 999 <NA>
#5 5 2 2 2 1 999 <NA>
#6 6 1 2 3 4 4 X4
#7 7 2 4 3 3 4 X2
#8 8 1 1 2 1 2 X3
#9 9 3 2 1 4 4 X4
#10 10 4 4 4 4 999 <NA>

Resources