Using IFELSE function across multiple columns [closed] - r

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I want to create a new column based on multiple columns of different data types
Names
1
2
3
A
000
NA
030
B
100
DDD
NA
C
XXX
000
050
Based on column 1-3, I want to add another column with the condition If value >= 30 then 1 else 0.
Output will be:
Names
1
2
3
4
A
000
NA
030
1
B
100
DDD
NA
1
C
XXX
000
015
0
Note : There are 36 such columns (1-36) across where I want to use the if condition and then create a new column.
adding some more details:
These variables are extracted from one long string like "030060000XXX010" which turned into 030 , 060, 000, XXX, 010. Now using IFELSE condition if any of the value (number looking) is >= 30 then 1 else 0

Consider using if_any. Loop over the columns other than 'Name', create a logical condition after converting to integer class, replace the NA with FALSE and coerces the logical output from if_any to binary (+)
library(dplyr)
library(tidyr)
df1 %>%
mutate(new = +(if_any(-Names, ~ replace_na(as.integer(.) >= 30, FALSE) ) ))

Since you want to group by 3, one way is to split.default the columns by 3, operate on one three-pack at a time, then combine them later.
I'll demonstrate on the data but repeating the three data columns so that we can show the iteration.
dat <- structure(list(Names = c("A", "B", "C"), X1 = c("000", "100", "XXX"), X2 = c(NA, "DDD", "000"), X3 = c(30L, NA, 50L), X1 = c("000", "100", "XXX"), X2 = c(NA, "DDD", "000"), X3 = c(30L, NA, 50L)), class = "data.frame", row.names = c(NA, -3L))
split.default(dat[,-1], (seq_along(dat)[-1]-2) %/% 3)
# $`0`
# X1 X2 X3
# 1 000 <NA> 30
# 2 100 DDD NA
# 3 XXX 000 50
# $`1`
# X1.1 X2.1 X3.1
# 1 000 <NA> 30
# 2 100 DDD NA
# 3 XXX 000 50
With this, we'll work on one three-pack at a time.
func <- function(x, lim = 30) {
x <- as.matrix(x)
x <- `dim<-`(suppressWarnings(as.numeric(x)), dim(x))
cbind(x,(+(rowSums(x <= lim, na.rm = TRUE) > 0)))
}
lapply(split.default(dat[,-1], (seq_along(dat)[-1]-2) %/% 3), func)
# $`0`
# [,1] [,2] [,3] [,4]
# [1,] 0 NA 30 1
# [2,] 100 NA NA 0
# [3,] NA 0 50 1
# $`1`
# [,1] [,2] [,3] [,4]
# [1,] 0 NA 30 1
# [2,] 100 NA NA 0
# [3,] NA 0 50 1
Now we just need to recombine them all again:
do.call(cbind, c(list(dat[,1,drop=FALSE]), lapply(split.default(dat[,-1], (seq_along(dat)[-1]-2) %/% 3), func)))
# Names 0.1 0.2 0.3 0.4 1.1 1.2 1.3 1.4
# 1 A 0 NA 30 1 0 NA 30 1
# 2 B 100 NA NA 0 100 NA NA 0
# 3 C NA 0 50 1 NA 0 50 1

Related

Rank order row values in R while keeping NA values

I'm trying to convert values in a data frame to rank order values by row. So take this:
df = data.frame(A = c(10, 20, NA), B = c(NA, 10, 20), C = c(20, NA, 10))
When I do this:
t(apply(df, 1, rank))
I get this:
[1,] 1 3 2
[2,] 2 1 3
[3,] 3 2 1
But I want the NA values to continue showing as NA, like so:
[1,] 1 NA 2
[2,] 2 1 NA
[3,] NA 2 1
Try using the argument na.last and set it to keep:
t(apply(df, 1, rank, na.last='keep'))
Output:
A B C
[1,] 1 NA 2
[2,] 2 1 NA
[3,] NA 2 1
As mentioned in the documentation of rank:
na.last:
for controlling the treatment of NAs. If TRUE, missing values in the data are put last; if FALSE, they are put first; if NA, they are removed; if "keep" they are kept with rank NA.
Here a dplyr approach
Libraries
library(dplyr)
Data
df <- tibble(A = c(10, 20, NA), B = c(NA, 10, 20), C = c(20, NA, 10))
Code
df %>%
mutate(across(.fns = ~rank(x = .,na.last = "keep")))
Output
# A tibble: 3 x 3
A B C
<dbl> <dbl> <dbl>
1 1 NA 2
2 2 1 NA
3 NA 2 1

How to verify if when a column is NA the other is not?

I have a dataframe with two columns. I need to check if where a column is NA the other is not. Thanks
Edited.
I would like to know, for each row of the dataframe, if there are rows with both columns not NA.
You can use the following code to check which row has no NA values:
df <- data.frame(x = c(1, NA),
y = c(2, NA))
which(rowSums(is.na(df))==ncol(df))
Output:
[1] 1
As you can see the first rows has no NA values so both columns have no NA values.
Here's a simple code to generate a column of the NA count for each row:
x <- sample(c(1, NA), 25, replace = TRUE)
y <- sample(c(1, NA), 25, replace = TRUE)
df <- data.frame(x, y)
df$NA_Count <- apply(df, 1, function(x) sum(is.na(x)))
df
x y NA_Count
1 NA 1 1
2 NA NA 2
3 1 NA 1
4 1 NA 1
5 NA NA 2
6 1 NA 1
7 1 1 0
8 1 1 0
9 1 1 0

how to add a new row with extra column in R?

I was trying to add results of a for loop into a dataframe as new rows, but it gets an error when there is a new result with more columns than the original dataframe, how could I add the new result with extra columns to the dataframe with adding the extra column names to the original dataframe?
e.g.
original dataframe:
-______A B C
x1 1 1 1
x2 2 2 2
x3 3 3 3
I want to get
-______A B C D
x1 1 1 1 NA
x2 2 2 2 NA
x3 3 3 3 NA
X4 4 4 4 4
I tried rbind (Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match)
and rbind_fill (Error: All inputs to rbind.fill must be data.frames)
and bind_rows (Argument 2 must have names)
In base R, this can be done by creating a new column 'D' with NA and then assign new row with 4.
df1$D <- NA
df1['x4', ] <- 4
-output
> df1
A B C D
x1 1 1 1 NA
x2 2 2 2 NA
x3 3 3 3 NA
x4 4 4 4 4
Or in a single line
rbind(cbind(df1, D = NA), x4 = 4)
A B C D
x1 1 1 1 NA
x2 2 2 2 NA
x3 3 3 3 NA
x4 4 4 4 4
Regarding the error in bind_rows, it happens when the for loop output is not a named vector
library(dplyr)
> vec1 <- c(4, 4, 4, 4)
> bind_rows(df1, vec1)
Error: Argument 2 must have names.
Run `rlang::last_error()` to see where the error occurred.
If it is a named vector, then it should work
> vec1 <- c(A = 4, B = 4, C = 4, D = 4)
> bind_rows(df1, vec1)
A B C D
x1 1 1 1 NA
x2 2 2 2 NA
x3 3 3 3 NA
...4 4 4 4 4
data
df1 <- structure(list(A = 1:3, B = 1:3, C = 1:3),
class = "data.frame", row.names = c("x1",
"x2", "x3"))
You probably have something like this, if you list the elements of your for loop.
(l <- list(x1, x2, x3, x4, x5))
# [[1]]
# [1] 1 1 1
#
# [[2]]
# [1] 2 2 2 2
#
# [[3]]
# [1] 3 3
#
# [[4]]
# [1] 4
#
# [[5]]
# NULL
Multiple elements can be rbinded using a do.call(rbind, .) approach, your problem is, how to rbind multiple elements that differ in length.
There's a `length<-` function with which you may adjust the length of a vector. To know to which length, there's another function, lengths, that gives you the lengths of each list element, where you are interested in the maximum.
I include the special case when an element has length NULL (our 5th element of l); since length of NULL cannot be changed, replace those elements with NA.
So altogether you may do:
do.call(rbind, lapply(replace(l, lengths(l) == 0L, NA), `length<-`, max(lengths(l))))
# [,1] [,2] [,3] [,4]
# [1,] 1 1 1 NA
# [2,] 2 2 2 2
# [3,] 3 3 NA NA
# [4,] 4 NA NA NA
# [5,] NA NA NA NA
Or, since you probably want a data frame with pretty row and column names:
ml <- max(lengths(l))
do.call(rbind, lapply(replace(l, lengths(l) == 0L, NA), `length<-`, ml)) |>
as.data.frame() |> `dimnames<-`(list(paste0('x', 1:length(l)), LETTERS[1:ml]))
# A B C D
# x1 1 1 1 NA
# x2 2 2 2 2
# x3 3 3 NA NA
# x4 4 NA NA NA
# x5 NA NA NA NA
Note: R >= 4.1 used.
Data:
x1 <- rep(1, 3); x2 <- rep(2, 4); x3 <- rep(3, 2); x4 <- rep(4, 1); x5 <- NULL

Loop through matrix in R and calculate measurement difference between all users

I have a matrix that is 10 rows by 4 columns. Each row represents a user, and each column a measurement. Some users only have one measurement, while others may have the full 4 measurements.
The goals I want to accomplish with this matrix are three fold:
To subtract the user's measurements from their own measurements (across columns);
To subtract the user's measurement from other user's measurement points (all included, across rows);
To create a final matrix that counts the number of "matches" (comparisons) each user has against themselves and others.
Within a threshold of 2.0 units, I have tried to measure each user's measurement against their own measurement and other users by obtaining the difference with a nested for-loop.
Below is an example of what the clean_data matrix looks like, and this matrix was used for all three goals:
M1 M2 M3 M4
U1 148.2 148.4 155.6 155.7
U2 149.5 150.1 150.1 153.9
U3 148.4 154.2 NA NA
U4 154.5 NA NA NA
U5 151.1 156.9 157.1 NA
For Goal #3, the output should look something akin to this matrix:
U1 U2 U3 U4 U5
U1 2 8 4 2 3
U2 8 3 2 1 4
U3 4 2 0 1 0
U4 2 1 1 0 0
U5 3 4 0 0 1
For example: User 1 has 2 matches with themselves because, with all 4 of their measurements, 2 differences were less than a value of 2.0 units. User 1 also has 8 matches with User 2. Each of User 1's measurements were subtracted iteratively from User 2's measurements (stored as an absolute value), and those differences that were below a value of 2 were considered a "match."
I have tried using the following nested for-loop, however I believe it is only counting the number of elements in my matrix instead of adding the differences.
# Set the time_threshold.
time_threshold <- 2.000
# Create an empty matrix the same dimensions as the number of users present.
matrix_a<-matrix(nrow = nrow(clean_data), ncol = nrow(clean_data))
# Use a nested for-loop to calculate the intra-user
# and inter-user time differences, adding values below
# the threshold up for those user-comparisons.
for (i in 1:nrow(clean_data)) {
for (j in 1:nrow(clean_data)) {
matrix_a[i, j] <-
round(sum(!is.na(abs((clean_data[i, 2:dim(clean_data)[2]]) -
(clean_data[j, 2:dim(clean_data)[2]])
) <= time_threshold)) / 2)
}
}
# Dividing by 2 and rounding has proven that this code only counts the
# number of vectors that are not NA, not the values below by time_threshold (2.000).
Is there a way that can calculate the differences I outlined above, and is also more efficient than a nested for-loop?
Note: The structure of these data are only relevant in so far that differences can be calculated for individuals across rows and columns. Missing values in this example are represented as NA, and should not be included in the calculation. Alternatively, I have set them to -0.01, which still has not changed the outcome of my for-loop.
You could write a function to do the loop for you:
fun <- function(index, dat){
i <- index[1]
j <- index[2]
m <- if(i==j) combn(dat[i,],2, function(x)diff(x))
else do.call("-", expand.grid(dat[i, ], dat[j, ]))
sum(abs(m)<2, na.rm = TRUE)
}
dist_fun <- function(dat){
dat <- as.matrix(dat)
result <- diag(0, nrow(dat))
mat_index <- which(lower.tri(result, TRUE), TRUE)
result[mat_index] <- apply(mat_index, 1, fun, dat = dat)
result[mat_index[,2:1]] <- result[mat_index]
result
}
dist_fun(df)
[,1] [,2] [,3] [,4] [,5]
[1,] 2 8 4 2 4
[2,] 8 3 4 1 3
[3,] 4 4 0 1 0
[4,] 2 1 1 0 0
[5,] 4 3 0 0 1
Here's one tidyverse approach. I convert the data to longer format, then join it to itself by User (across) and by time point (down), each time counting the number of matches. Then I combine the two and convert to wide format again.
library(tidyverse)
my_data2 <- my_data %>% pivot_longer(-User)
left_join(my_data2, my_data2, by = "User") %>%
filter(name.x < name.y, abs(value.y - value.x) <= 2) %>% # EDIT
count(User) %>%
select(User.x = User, User.y = User, n) -> compare_across
my_data3 <- my_data2 %>% mutate(dummy = 1) # EDIT
inner_join(my_data3, my_data3, by = "dummy") %>% # EDIT
filter(abs(value.x - value.y) <=2, User.x != User.y) %>%
count(User.x, User.y) -> compare_down
bind_rows(compare_across, compare_down) %>%
arrange(User.x, User.y) %>%
pivot_wider(names_from = User.y, values_from = n, values_fill = list(n = 0))
# A tibble: 5 x 6
User.x U1 U2 U3 U4 U5
<chr> <int> <int> <int> <int> <int>
1 U1 2 8 4 2 4
2 U2 8 3 4 1 3
3 U3 4 4 0 1 0
4 U4 2 1 1 0 0
5 U5 4 3 0 0 1
source data:
my_data <- data.frame(
stringsAsFactors = FALSE,
User = c("U1", "U2", "U3", "U4", "U5"),
M1 = c(148.2, 149.5, 148.4, 154.5, 151.1),
M2 = c(148.4, 150.1, 154.2, NA, 156.9),
M3 = c(155.6, 150.1, NA, NA, 157.1),
M4 = c(155.7, 153.9, NA, NA, NA)
)

R: Merging data frames: Exclude specific column value, but keep skipped rows

I want to merge two data frames, skipping rows based on a specific column value, but still keep the skipped rows in the final merged data frame. I can manage the first part (skipping), but not the second.
Here are the data frames:
# Data frame 1 values
ids1 <- c(1:3)
x1 <- c(100, 101, 102)
doNotMerge <- c(1, 0, 0)
# Data frame 2 values
ids2 <- c(1:3)
x2 <- c(200, 201, 202)
# Creating the data frames
df1 <- as.data.frame(matrix(c(ids1, x1, doNotMerge),
nrow = 3,
ncol = 3,
dimnames = list(c(),c("ID", "X1", "DoNotMerge"))))
df2 <- as.data.frame(matrix(c(ids2, x2),
nrow = 3,
ncol = 2,
dimnames = list(c(),c("ID", "X2"))))
# df1 contents:
# ID X1 DoNotMerge
# 1 1 100 1
# 2 2 101 0
# 3 3 102 0
# df2 contents:
# ID X2
# 1 1 200
# 2 2 201
# 3 3 202
I used merge:
merged <- merge(df1[df1$DoNotMerge != 1,], df2, by = "ID", all = T)
# merged contents:
# ID X1 DoNotMerge X2
# 1 1 NA NA 200
# 2 2 101 0 201
# 3 3 102 0 202
The skipping part I was able to do, but what I actually want is to keep the df1 row where DoNotMerge == 1, like so:
# ID X1 DoNotMerge X2
# 1 1 NA NA 200
# 2 1 100 1 NA
# 3 2 101 0 201
# 4 3 102 0 202
Can anyone please help? Thanks.
Update: I actually found the solution while writing the question (ran into this question), so figured I'd post it in case someone else encounters this problem:
require(plyr)
rbind.fill(merged, df1[df1$DoNotMerge == 1,])
# Result:
# ID X1 DoNotMerge X2
# 1 1 NA NA 200
# 2 2 101 0 201
# 3 3 102 0 202
# 4 1 100 1 NA

Resources