How to reference all other columns in R? - r

I am working with data similar to the data below:
ID <- c("A", "B", "C", "D", "E")
x1 <- c(1,1,1,1,0)
x2 <- c(0,0,1,2,2)
x3 <- c(0,0,0,0,0)
x4 <- c(0,0,0,0,0)
df <- data.frame(ID, x1, x2, x3, x4)
It looks like:
> df
ID x1 x2 x3 x4
1 A 1 0 0 0
2 B 1 0 0 0
3 C 1 1 0 0
4 D 1 2 0 0
5 E 0 2 0 0
I want to create a new column, which is the product of the conditional statement: if x1 == 1 and all the other columns are equal to 0, then it is coded "Positive".
How can I reference all the other columns besides x1 without having to write out the rest of the columns in the conditional statement?

Base R:
df$new <- ifelse(df$x1==1 & ## check x1 condition
rowSums(df[,3:5]!=0)==0), ## add the logical outcomes by row
"Positive",
"not_Positive"))
The second line is a little tricky.
df[,3:5] (or df[,-(1:2)]) selects all the columns except the first two. You could also use subset(df,select=x2:x4) here (although ?subset says "Warning: This is a convenience function intended for use interactively ...")
!=0 tests whether the values are 0 or not, returning TRUE or FALSE
rowSums() adds up the values (FALSE→0, TRUE →1)
the row sum is zero if all of the logical values in that row are zero when converted to numeric (→ all FALSE → none are not equal to zero)
If there might be NA values then you'll need an na.rm=TRUE in your rowSums() specification

With select, we can have multiple options. The below one is with range (:), In the example, the columns selected are from 'x2' to 'x4' and are in the order. If we want to select based on some patterns, it can be done with matches("^x[2-9]$").
In the below code, it is creating logical condition on the single column 'x1', and the rest of the columns selected with rowSums, joined together with &, then the output is passed on the case_when two-sided formula as the lhs, with the replacement value as the rhs of the ~. By default, all other elements that doesn't satisfy the condition will be NA
library(dplyr)
df %>%
mutate(new = case_when(x1 == 1 &
rowSums(select(., x2:x4)!= 0) == 0~ 'Positive'))
# ID x1 x2 x3 x4 new
#1 A 1 0 0 0 Positive
#2 B 1 0 0 0 Positive
#3 C 1 1 0 0 <NA>
#4 D 1 2 0 0 <NA>
#5 E 0 2 0 0 <NA>

Related

How can I use vectorisation in R to change a DF value based on a condition?

Suppose I have the following DF:
C1
C2
0
0
1
1
1
1
0
0
.
.
.
.
I now want to apply these following conditions on the Dataframe:
The value for C1 should be 1
A random integer between 0 and 5 should be less than 2
If both these conditions are true, I change the C1 and C2 value for that row to 2
I understand this can be done by using the apply function, and I have used the following:
C1 <- c(0, 1,1,0,1,0,1,0,1,0,1)
C2 <- c(0, 1,1,0,1,0,1,0,1,0,1)
df <- data.frame(C1, C2)
fun <- function(x){
if (sample(0:5, 1) < 2){
x[1:2] <- 2
}
return (x)
}
index <- df$C1 ==1 // First Condition
processed_Df <-t(apply(df[index,],1,fun)) // Applies Second Condition
df[index,] <- processed_Df
Output:
C1
C2
0
0
2
2
1
1
0
0
.
.
.
.
Some Rows have both conditions met, some doesn't (This is the main
functionality, I would like to achieve)
Now I want to achieve this same using vectorization and without using loops or the apply function. The only confusion I have is "If I don't use apply, won't each row get the same result based on the condition's result? (For example, the following:)
df$C1 <- ifelse(df$C1==1 & sample(0:5, 1) < 5, 2, df$C1)
This changes all the rows in my DF with C1==2 to 2 when there should possibly be many 1's.
Is there a way to get different results for the second condition for each row without using the apply function? Hopefully my question makes sense.
Thanks
You need to sample the values for nrow times. Try this method -
set.seed(167814)
df[df$C1 == 1 & sample(0:5, nrow(df), replace = TRUE) < 2, ] <- 2
df
# C1 C2
#1 0 0
#2 2 2
#3 2 2
#4 0 0
#5 1 1
#6 0 0
#7 2 2
#8 0 0
#9 1 1
#10 0 0
#11 1 1
Here is a fully vectorized way. Create the logical index index just like in the question. Then sample all random integers r in one call to sample. Replace in place based on the conjunction of the index and the condition r < 2.
x <- 'C1 C2
0 0
1 1
1 1
0 0'
df1 <- read.table(textConnection(x), header = TRUE)
set.seed(1)
index <- df1$C1 == 1
r <- sample(0:5, length(index), TRUE)
df1[index & r < 2, c("C1", "C2")] <- 2
df1
#> C1 C2
#> 1 0 0
#> 2 1 1
#> 3 2 2
#> 4 0 0
Created on 2022-05-11 by the reprex package (v2.0.1)

Changing values in dataframe iteraring over all rows and multiple columns

I need to change some values in my dataframe iterating over rows. For each row, if there is a 1 in some column I need to change 0 values in other columns to NA.
I have a code that works, but is super slow when using a bigger dataset.
data = data.frame(id=c("A","B","C"),V1=c(1,0,0),V2=c(0,0,0),V3=c(1,0,1))
cols = names(data)[2:4]
for (i in 1:nrow(data)){
if(any(data[i,cols]==1)){
data[i,cols][data[i,cols]==0]=NA
}
}
I have an example data set
data
id V1 V2 V3
1 A 1 0 1
2 B 0 0 0
3 C 0 0 1
and the expected (and the actual) result is
data
id V1 V2 V3
1 A 1 NA 1
2 B 0 0 0
3 C NA NA 1
How can I write this in a more optimal way?
A one-liner can be,
data[rowSums(data[-1]) > 0,] <- replace(data[rowSums(data[-1]) > 0,],
data[rowSums(data[-1]) > 0,] == 0,
NA)
data
# id V1 V2 V3
#1 A 1 NA 1
#2 B 0 0 0
#3 C NA NA 1
To avoid evaluating the same expression over and over again, we can define it first, i.e.
v1 <- rowSums(data[-1]) > 0
data[v1,] <- replace(data[v1,],
data[v1,] == 0,
NA)
It is easy with dplyr assuming you want to change values for V1 and V2 column based on values in V3. We can specify columns for whom we want to change values in mutate_at and in funs argument specify the condition for which you want to change values.
library(dplyr)
data %>% mutate_at(vars(V1:V2), funs(replace(., V3 == 1 & . == 0, NA)))
# id V1 V2 V3
#1 A 1 NA 1
#2 B 0 0 0
#3 C NA NA 1
We can do this in base R, by creating a logical vector with rowSums and then update the numeric columns based on this index
i1 <- rowSums(data[-1] == 1) > 0
data[-1][i1,] <- NA^ !data[-1][i1,]
data
# id V1 V2 V3
#1 A 1 NA 1
#2 B 0 0 0
#3 C NA NA 1
If the index needs to be based on a single column, say 'V3', change the 'i1' to
i1 <- data$V3 == 1
and update the other numeric columns after subsetting the rows with 'i1', create a logical matrix with negation (! - returns TRUE for 0 values and all others FALSE). Then, using NA^ on logical matrix returns NA for TRUE and 1 for other values. As there are only binary values, this can be updated
data[i1, 2:3] <- NA^!data[i1, 2:3]

Selecting columns based on row values in multiple columns using dplyr

I am trying to select columns where at least one row equals 1, only if the same row also has a certain value in a second column. I would prefer to achieve this using dplyr, but any computationally efficient solution is welcome.
Example:
Select columns among a1, a2, a3 containing at least one row where the value is 1 AND where column b=="B"
Example data:
rand <- function(S) {set.seed(S); sample(x = c(0,1),size = 3, replace=T)}
df <- data.frame(a1=rand(1),a2=rand(2),a3=rand(3),b=c("A","B","A"))
Input data:
a1 a2 a3 b
1 0 0 0 A
2 0 1 1 B
3 1 1 0 A
Desired output:
a2 a3
1 0 0
2 1 1
3 1 0
I managed to obtain the correct output with the following code, however this is a very inefficient solution and I need to run it on a very large dataframe (365,000 rows X 314 columns).
df %>% select_if(function(x) any(paste0(x,.$b) == '1B'))
A solution, not using dplyr:
df[sapply(df[df$b == "B",], function(x) 1 %in% x)]
Here is my dplyr solution:
ids <- df %>%
reshape2::melt(id.vars = "b") %>%
filter(value == 1 & b == "B") %>%
select(variable)
df[,unlist(ids)]
# a2 a3
#1 0 0
#2 1 1
#3 1 0
As suggested by #docendo-discimus it is easier to convert to long format

How to add an aggregated variable to an existing dataset in R

How do you add a variable to a dataset using the aggregate and by commands? For example, I have:
num x1
1 1
1 0
2 0
2 0
And I'm looking to create a variable to identify every variable for which any num is 1, for example:
num x1 x2
1 1 1
1 0 1
2 0 0
2 0 0
or
num x1 x2
1 1 TRUE
1 0 TRUE
2 0 FALSE
2 0 FALSE
I've tried to use
df$x2 <- aggregate(df$x1, by = list(df$num), FUN = sum)
But I'm getting an error that says the replacement has a different number of rows than the data. Can anyone help?
This can be done by grouping with 'num' and checking if there are any 1 element in 'x'1. The ave from base R is convenient for this instead of aggregate
df1$x2 <- with(df1, ave(x1==1, num, FUN = any))
df1$x2
#[1] 1 1 0 0
Or using dplyr, we group by 'num' and create the 'x2' by checking if any 'x1' is equal to 1. It will be a logical vector if we are not wrapping with as.integer to convert to binary
library(dplyr)
df1 %>%
group_by(num) %>%
mutate(x2 = as.integer(any(x1==1)))

Counting occurrencies by row

Imagine I have a data.frame (or matrix) with few different values such as this
test <- data.frame(replicate(10,sample(c(-1,0,1),20, replace=T, prob=c(0.2,0.2,0.6))))
test2 <- test
If I want to add extra columns with counts I could do:
test2$good <- apply(test,1, function(x) sum(x==1))
test2$bad <- apply(test,1, function(x) sum(x==-1))
test2$neutral <- apply(test,1, function(x) sum(x==0))
But If I had many possible values instead I would have to create many lines, it won't be elegant.
I've tried with table(), but the output is not easily usable
apply(test,1, function(x) table(x))
and there is a big problem, if any row doesn't contain any occurrency of some factor the result generated by table() doesn't have the same length and it can't be binded.
Is there way to force table() to take that value into account, telling it has zero occurrencies?
Then I've thought of using do.call or lapply and merge but it's too difficult for me.
I've also read about dplyr count but I have no clue on how to do it.
Could anyone provide a solution with dplyr or tidyr?
PD: What about a data.table solution?
We could melt the dataset to long format after converting to matrix, get the frequency using table and cbind with the original dataset.
library(reshape2)
cbind(test2, as.data.frame.matrix(table(melt(as.matrix(test2))[-2])))
Or use mtabulate on the transpose of 'test2' and cbind with the original dataset.
library(qdapTools)
cbind(test2, mtabulate(as.data.frame(t(test2))))
Or we can use gather/spread from tidyr after creating row id with add_rownames from dplyr
library(dplyr)
library(tidyr)
add_rownames(test2) %>%
gather(Var, Val, -rowname) %>%\
group_by(rn= as.numeric(rowname), Val) %>%
summarise(N=n()) %>%
spread(Val, N, fill=0) %>%
bind_cols(test2, .)
you can use rowSums():
test2 <- cbind(test2, sapply(c(-1, 0, 1), function(x) rowSums(test==x)))
similar to the code in the comment from etienne, but without the call to apply()
Here is the answer using base R.
test <- data.frame(replicate(10,sample(c(-1,0,1),20, replace=T, prob=c(0.2,0.2,0.6))))
testCopy <- test
# find all unique values, note that data frame is a list
uniqVal <- unique(unlist(test))
# the new column names start with Y
for (val in uniqVal) {
test[paste0("Y",val)] <- apply(testCopy, 1, function(x) sum(x == val))
}
head(test)
# X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Y-1 Y1 Y0
# 1 -1 0 1 1 1 0 -1 -1 1 1 3 5 2
# 2 1 -1 0 1 1 -1 -1 0 0 1 3 4 3
# 3 -1 0 1 0 1 1 1 1 -1 1 2 6 2
# 4 1 1 1 1 0 1 1 0 1 0 0 7 3
# 5 0 -1 1 -1 -1 0 0 1 0 0 3 2 5
# 6 1 1 0 1 1 1 1 1 1 1 0 9 1

Resources