R - work on data frame rows based on condition - r

I'm trying to understand how can I work on the rows of a data frame based on a condition.
Having a data frame like this
> d<-data.frame(x=c(0,1,2,3), y=c(1,1,1,0))
> d
x y
1 0 1
2 1 1
3 2 1
4 3 0
how can I add +1 to all rows that contain a value of zero? (note that zeros can be found in any column), so that the result would look like this:
x y
1 1 2
2 1 1
3 2 1
4 4 1
The following code seems to do part of the job, but is just printing the rows where the action was taken, the number of times it was taken (2)...
> for(i in 1:nrow(d)){
+ d[d[i,]==0,]<-d[i,]+1
+ }
> d
x y
1 1 2
2 4 1
3 1 2
4 4 1
I'm sure there is a simple solution for this, maybe an apply function?, but I'm not getting there.
Thanks.

Some possibilities:
# 1
idx <- which(d == 0, arr.ind = TRUE)[, 1]
d[idx, ] <- d[idx, ] + 1
# 2
t(apply(d, 1, function(x) x + any(x == 0)))
# 3
d + apply(d == 0, 1, max)
The usage of which for vectors, e.g. which(1:3 > 2), is quite common, whereas it is used less for matrices: by specifying arr.ind = TRUE what we get is array indices, i.e. coordinates of every 0:
which(d == 0, arr.ind = TRUE)
row col
[1,] 1 1
[2,] 4 2
Since we are interested only in rows where zeros occur, I take the first column of which(d == 0, arr.ind = TRUE) and add 1 to all the elements in these rows by d[idx, ] <- d[idx, ] + 1.
Regarding the second approach, apply(d, 1, function(x) x) would be simply going row by row and returning the same row without any modifications. By any(x == 0) we check whether there are any zeros in a particular row and get TRUE or FALSE. However, by writing x + any(x == 0) we transform TRUE or FALSE to 1 or 0, respectively, as required.
Now the third approach. d == 0 is a logical matrix, and we use apply to go over its rows. Then when applying max to a particular row, we again transform TRUE, FALSE to 1, 0 and find a maximal element. This element is 1 if and only if there are any zeros in that row. Hence, apply(d == 0, 1, max) returns a vector of zeros and ones. The final point is that when we write A + b, where A is a matrix and b is a vector, the addition is column-wise. In this way, by writing d + apply(d == 0, 1, max) we add apply(d == 0, 1, max) to every column of d, as needed.

Related

How can I extract (values of yes) in vector from ifelse function

I want to extract values of yes in vector only in the ifelse function
for example
X=rnorm(6,1,1)
y<- ifelse(X>0,yes=1,no=2)
#I get
1 1 2 1 1 1
#Also if I use for loop
X<-matrix(rcauchy(15*5,0,1),5)
p<-vector()
for(i in 1:5){
p[i]<- ifelse (shapiro.test(X[i,])$p.value>=0.05,yes=t.test(X[i,], alternative = "two.sided")$p.value, no= wilcox.test(X[i,], mu = 0, alternative = "two.sided")$p.value)
p
How can I extract (values of yes) from total result
Maybe you want something like:
set.seed(7)
X <- matrix(rcauchy(15*5,0,1),5)
i <- apply(X, 1, \(y) shapiro.test(y)$p.value>=0.05)
apply(X[i,], 1, \(y) t.test(y, alternative = "two.sided")$p.value)
#[1] 0.5011835 0.6214762 0.0983801
Try.
set.seed(42)
X <- rnorm(6)
1[+(X > 0)]
#[1] 1 1 1 1
rep(1, sum(X>0))
#[1] 1 1 1 1
To get the values.
X[X > 0]
#[1] 0.7212112 0.8666787 1.6359504 0.7157471
To get values from another vector.
seq_along(X)[X > 0]
#[1] 2 3 4 5
Some also use which.
which(X>0)
#[1] 2 3 4 5
X[which(X>0)]
#[1] 0.7212112 0.8666787 1.6359504 0.7157471
To keep your ifelse statement, then you can use it as an index by setting it equal to 1 to create a boolean vector. Then use [ to subset from original vector, i.e.
X[ifelse(X>0,yes=1,no=2) == 1]
#[1] 1.8418504 1.0860513 2.5771326 0.5809096 1.1458737 1.5731607

Flag rows in matrix that contain the same set of values

I have a matrix of integers
m <- rbind(c(1,2),
c(3,6),
c(5,1),
c(2,1),
c(6,3))
and I am looking for a function that takes this matrix as input and outputs a vector flag with length(flag) == ncol(m) that assigns the rows that contain the same set of integers the same unique (let's say integer) value.
For the above example, the desired output would be:
flag <- c(1, 2, 3, 1, 2)
So rows 1 and 4 inm get the same flag 1, because they both contain the same set of integers, in this case {1, 2}. Similarly, rows 2 and 5 get the same flag.
The solution should work for any number of columns.
The only thing I could come up with is the following approach ...
FlagSymmetric <- function(x) {
vec_sim <- rep(NA, nrow(x)) # object containing flags
ind_ord <- ncol(x)
counter <- 1
for(i in 1:nrow(x)) {
if(is.na(vec_sim[i])) { # if that row is not flagged yet, proceed ...
vec_sim[i] <- counter # ... and give the next free flag
for(j in (i+1):nrow(x)) {
if( (i+1) > nrow(x) ) next # in case of tiny matrices
ind <- x[j, ] %in% x[i, ]
if(sum(ind)==ind_ord) vec_sim[j] <- counter # if the same, assign flag
}
counter <- counter + 1
}
}
return(vec_sim)
}
... which does what I want:
> FlagSymmetric(m)
[1] 1 2 3 1 2
If n = nrow(m) this needs 1/2 n^2 operations. Of course, I could make it much quicker by writing this in C++, but this only alleviates my problem to some extent, because I am working with matrices with a potentially huge number of rows.
I guess there must be a smarter way of doing this.
EDIT:
Additional, more general example (sorting row and pasting to character string not possible):
m2 <- rbind(c(1,112),
c(11,12),
c(12,11),
c(112,1),
c(6,3))
flag2 <- c(1, 2, 2, 1, 3) # desired output
FlagSymmetric(m2) # works
[1] 1 2 2 1 3
Assuming you only have numeric data in your matrix.
First converting the matrix to dataframe,
m <- data.frame(m)
We can sort every row and paste them together. Convert them to factor and then to numeric to get unique numbers for every combination
m$flag <- as.numeric(factor(apply(m, 1, function(x) paste0(sort(x), collapse = ""))))
m
# X1 X2 flag
#1 1 2 1
#2 3 6 3
#3 5 1 2
#4 2 1 1
#5 6 3 3
EDIT
The above solution does not work for every combination as explained in the new example. To differentiate between each number, as #d.b commented we can use any non-empty collapse argument. For updated example,
as.numeric(factor(apply(m2, 1, function(x) paste0(sort(x), collapse = "-"))))
#[1] 1 2 2 1 3

R: Compute number of rows in data frame that have 0 colSums for specific columns using a function

I have a data frame with n rows and m columns where m > 30.
My first column is an age variable and the rest are medical conditions that are either on or off (binary).
Now I would like to compute the number of observations where none of the medical conditions is switched on i.e. the number of healthy patients. I thought I could use the rowSums function to count observations wherever the row sum is zero (of course excluding the age variable) but I tried some functions and did not succeed.
Here is an example how it could work but always involving a lot of AND / OR statements which is not practical. I was looking for a non-loop solution.
example <- as.data.frame(matrix(data=c(40,1,1,1,36,1,0,1,56,0,0,1,43,0,0,0), nrow=4, ncol=4,
byrow=T, dimnames <- list(c("row1","row2","row3", "row4"),c("Age","x","y","z"))))
Two impractical alternatives to arrive at desired outcome:
nrow(subset(example, x==0 & y==0 & z==0))
table(example$x==0 & example$y==0 & example$z==0)
What I actually wanted is sth like this:
nrow(example[rowSums(example[,2:ncol(example)])==0])
You can use
apply(example[, -1], MARGIN = 1, FUN = function(x) all(x == 0))
## row1 row2 row3 row4
## FALSE FALSE FALSE TRUE
Here you are applying FUN on every row of the example[,-1]. It gives you logical vector indicating which rows satisfy the condition that all of the variables in that row are equal to 0. You get this by using all function inside your FUN argument function.
You can use this result to get rows containing all healthy patients or those containing atleast 1 non healthy patient.
example[apply(example[, -1], MARGIN = 1, FUN = function(x) all(x == 0)), ]
## Age x y z
## row4 43 0 0 0
example[!apply(example[, -1], MARGIN = 1, FUN = function(x) all(x == 0)), ]
## Age x y z
## row1 40 1 1 1
## row2 36 1 0 1
## row3 56 0 0 1
And you can get number of healthy rows or otherwise as below
# healthy rows
sum(apply(example[, -1], MARGIN = 1, FUN = function(x) all(x == 0)))
## [1] 1
# rows with atleast one unhealthy condition
sum(!apply(example[, -1], MARGIN = 1, FUN = function(x) all(x == 0)))
## [1] 3
You just want the total numbers of observations/rows that satisfy this condition right? Then you can use -
nrow(example[example$x==0 & example$y==0 & example$z==0,])
Else, if you want to use rowSums, this will work -
nrow(example[rowSums(example[,2:4])==0,])

R set value in dependency of another value

For each row of my dataframe, I want to calculate a value from numbers taken from columns of this dataframe. If the calculated value is above 2, I want to set another columns value to 0, else to 1.
x=(df$firstnumber+df$secondnumer)/2
if(x>2){
df$binaryValue=0}
else{ df$binaryValue=1}
this throws the error
the condition has length > 1 and only the first element will be used
because x is a vector
How can I solve this? One way would be to write this as a function and to apply it to the dataframe - are there any other options?
Also, how could I write this to work with appl() ?
Thanks in advance
You could simply do...
df$BinaryValue <- ifelse( x > 2 , 0 , 1 )
So you get...
df <- data.frame( x = 1:5 , y = -2:2 )
x <- df$x + df$y
df$BinaryValue <- ifelse( x > 2 , 0 , 1 )
df
# x y BinaryValue
# 1 1 -2 1
# 2 2 -1 1
# 3 3 0 0
# 4 4 1 0
# 5 5 2 0
transform(df, BinaryValue = as.numeric(firstnumber + secondnumber > 4))
There's no need to divide by two in the first place. You could check whether the sum is greater than four. The function as.numeric is employed to transform boolean to numeric (0 and 1) values.

Finding the column number and value the of second highest value in a row

I am trying to write some code which identifies the greatest two values for each row and provides their column number and value.
df = data.frame( car = c (2,1,1,1,0), bus = c (0,2,0,1,0),
walk = c (0,3,2,0,0), bike = c(0,4,0,0,1))
I've managed to get it to do this for the maximum value using the max and max.col functions.
df$max = max.col(df,ties.method="first")
df$val = apply(df[ ,1:4], 1, max)
As far as I know there are no equivalent functions for the second highest value so doing this has made things a little trickier. Using this code provides the second highest value but (importantly) not in situations with ties. Also it looks risky.
sec.fun <- function (x) {
max( x[x!=max(x)] )
}
df$val2 <- apply(df[ ,1:4], 1, sec.fun)
Ideally the solution would not involve removing any original data and could be used to find the third, fourth... highest value but neither of these are essential requirements.
try this:
# a function that returns the position of n-th largest
maxn <- function(n) function(x) order(x, decreasing = TRUE)[n]
this is a closure, so you can use like this:
> # position of the largest
> apply(df, 1, maxn(1))
[1] 1 4 3 1 4
> # position of the 2nd largest
> apply(df, 1, maxn(2))
[1] 2 3 1 2 1
>
> # value of the largest
> apply(df, 1, function(x)x[maxn(1)(x)])
[1] 2 4 2 1 1
> # value of the 2nd largest
> apply(df, 1, function(x)x[maxn(2)(x)])
[1] 0 3 1 1 0
Updated
Why using closure here?
One reason is that you can define a function such as:
max2 <- maxn(2)
max3 <- maxn(3)
then, use it
> apply(df, 1, max2)
[1] 2 3 1 2 1
> apply(df, 1, max3)
[1] 3 2 2 3 2
I'm not sure if the advantage is obvious, but I like this way, since this is more functional-ish way.

Resources