Removing a different value from each column of a data frame - r

I have the following items
A<-data.frame(replicate(5,c(1,2,3,4)))
A= X1 X2 X3 X4 X5
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
B<-c(1,2,3,4,1)
B = 1 2 3 4 5
I want to find a way of removing the first element of B from the first column of A, the second element of B from the second column of A and so on so I obtain the following result
A= X1 X2 X3 X4 X5
2 1 1 1 2
3 3 2 2 3
4 4 4 3 4

Using mapply we can pass A and B in parallel and filter the values which are not present in B
mapply(function(x, y) x[x != y], A, B)
# X1 X2 X3 X4 X5
#[1,] 2 1 1 1 2
#[2,] 3 3 2 2 3
#[3,] 4 4 4 3 4
PS - Make sure that ncol(A) and length(B) are the same otherwise it would lead to vector recycling giving some unexpected results.

A purrr solution:
A<-data.frame(replicate(5,c(1,2,3,4)))
# X1 X2 X3 X4 X5
# 1 1 1 1 1 1
# 2 2 2 2 2 2
# 3 3 3 3 3 3
# 4 4 4 4 4 4
B<-c(1,2,3,4,1)
# [1] 1 2 3 4 1
purrr::map2_df(A, B, ~.x[.x != .y]) # function(x,y) x[x != y]
# # A tibble: 3 x 5
# X1 X2 X3 X4 X5
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2 1 1 1 2
# 2 3 3 2 2 3
# 3 4 4 4 3 4

Related

How to count the number of occurrences of a given value for each row?

I'm sure this is a really easy fix but I can't seem to find the answer... I am trying to create a column at the end of my dataframe that is a sum of the number of times a specific value (say "1") appears across that row. So for example, if I started with the following dataframe:
X1 <- c(5,1,7,8,1,5)
X2 <- c(5,0,0,2,3,7)
X3 <- c(6,2,3,4,1,7)
X4 <- c(1,1,5,2,1,7)
df <- data.frame(id,X1,X2,X3,X4)
id X1 X2 X3 X4
1 1 5 5 6 1
2 2 1 0 1 1
3 3 7 0 3 5
4 4 8 2 4 2
5 5 1 3 2 1
6 6 5 7 7 7
and I was trying to identify how many times the value "1" appears across that row, I would want the output to look like this:
id X1 X2 X3 X4 one_appears
1 1 5 5 6 1 2
2 2 1 0 1 1 3
3 3 7 0 3 5 0
4 4 8 2 4 2 0
5 5 1 3 2 1 2
6 6 5 7 7 7 0
Thanks very much in advance!
library(tidyverse)
df %>%
mutate(
one = rowSums(across(everything(), ~ .x == 1))
)
# A tibble: 6 × 6
id X1 X2 X3 X4 one
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 5 5 6 1 2
2 2 1 0 2 1 2
3 3 7 0 3 5 0
4 4 8 2 4 2 0
5 5 1 3 1 1 3
6 6 5 7 7 7 0
EDIT:
df %>%
mutate(
one = rowSums(across(starts_with("X"), ~ .x == 1))
)
df %>%
mutate(
one = rowSums(across(X1:X4, ~ .x == 1))
)
We can use rowSums on a logical matrix
df$one_appears <- rowSums(df == 1, na.rm = TRUE)
-output
> df
id X1 X2 X3 X4 one_appears
1 1 5 5 6 1 2
2 2 1 0 1 1 3
3 3 7 0 3 5 0
4 4 8 2 4 2 0
5 5 1 3 2 1 2
6 6 5 7 7 7 0
Another option using apply with sum:
id <- c(1:6)
X1 <- c(5,1,7,8,1,5)
X2 <- c(5,0,0,2,3,7)
X3 <- c(6,2,3,4,1,7)
X4 <- c(1,1,5,2,1,7)
df <- data.frame(id,X1,X2,X3,X4)
df$one_appear = apply(df, 1, \(x) sum(x == 1))
df
#> id X1 X2 X3 X4 one_appear
#> 1 1 5 5 6 1 2
#> 2 2 1 0 2 1 2
#> 3 3 7 0 3 5 0
#> 4 4 8 2 4 2 0
#> 5 5 1 3 1 1 3
#> 6 6 5 7 7 7 0
Created on 2023-01-18 with reprex v2.0.2
This answer may not be the best of the approach, but an alternative that I tried so thought to share
code
library(dplyr)
X1 <- c(5,1,7,8,1,5)
X2 <- c(5,0,0,2,3,7)
X3 <- c(6,2,3,4,1,7)
X4 <- c(1,1,5,2,1,7)
df <- data.frame(X1,X2,X3,X4) %>% rowwise %>%
mutate(across(starts_with('X'), function(x) ifelse(x==1,1,NA), .names = 'Y_{col}'),
one_appears=sum(across(starts_with('Y')), na.rm = T)
)

Split dataset in R so that all columns with same name are split into two equal parts?

I have a data frame which looks like something like this:
A A B C C C D D
X1 1 2 1 1 2 3 1 2
X2 1 2 1 1 2 3 1 2
X3 1 2 1 1 2 3 1 2
I want to split it into two roughly equal parts in a way that the each column name (e.g. A) would be split equally as well. For example, A and A columns would go into different dataframes (same for D, because it can be split equally), the one B column would go to a random dataframe, and for C, C and C, 2 Cs would go into one, and one C into antoher dataframe. This would be an acceptable result:
A B C C D
X1 1 1 1 2 1
X2 1 1 1 2 1
X3 1 1 1 2 1
A C D
X1 2 3 2
X2 2 3 2
X3 2 3 2
I hope it makes sense. What would be the best way to solve this?
How about every-other?
# if the data.frame is already sorted by column name, skip the next 3 lines
nm <- names(df)
idx <- order(nm)
df <- df[,order(names(df))]
idx <- rep(c(TRUE, FALSE), ceil(length(df)/2))[1:length(df)]
(df1 <- setNames(df[,idx], names(df)[idx]))
#> A B C D
#> X1 1 1 2 1
#> X2 1 1 2 1
#> X3 1 1 2 1
(df2 <- setNames(df[,!idx], names(df)[!idx]))
#> A C C D
#> X1 2 1 3 2
#> X2 2 1 3 2
#> X3 2 1 3 2

Subset a dataframe based on multiple columns simultaneously

Suppose I have two dataframes like
> r1 <- data.frame(replicate(5, sample(1:3, 5, replace = TRUE)))
> r2 <- data.frame(replicate(5, sample(1:3, 5, replace = TRUE)))
> r1
X1 X2 X3 X4 X5
1 2 3 1 3 1
2 1 3 1 1 3
3 3 2 3 3 3
4 1 1 1 2 3
5 1 1 3 2 3
> r2
X1 X2 X3 X4 X5
1 1 3 3 3 2
2 3 1 2 1 2
3 1 1 1 2 2
4 2 3 1 2 2
5 2 2 1 2 3
I would like to subset r1 so that the result would only contain rows where r1$X1 == r2$X1 AND r1$X2 == r2$X2 AND r1$X3 == r2$X3, e.g.
r1_example
X1 X2 X3 X4 X5
1 2 3 1 3 1
4 1 1 1 2 3
If I actually do subset, I'm getting more values, because some of the columns separately correspond to the r2 columns.
> r1_sub <- subset(r1, X1 %in% r2$X1 & X2 %in% r2$X2 & X3 %in% r2$X3)
> r1_sub
X1 X2 X3 X4 X5
1 2 3 1 3 1
2 1 3 1 1 3
3 3 2 3 3 3
4 1 1 1 2 3
5 1 1 3 2 3
I can figure a workaround like
> r1$concat <- paste(r1$X1, '&', r1$X2, '&', r1$X3)
> r2$concat <- paste(r2$X1, '&', r2$X2, '&', r2$X3)
> r1_concat <- subset(r1, concat %in% r2$concat)
> r1_concat
X1 X2 X3 X4 X5 concat
1 2 3 1 3 1 2 & 3 & 1
4 1 1 1 2 3 1 & 1 & 1
But that's crude to say the least. Is there a more elegant solution?
You can keep only the columns X1 to X3 in r2 and merge the data :
merge(r1, r2[c('X1', 'X2', 'X3')])
X1 X2 X3 X4 X5
#1 1 1 1 2 3
#2 2 3 1 3 1
In dplyr :
library(dplyr)
r2 %>% select(X1:X3) %>% inner_join(r1)
using semi_join
library(tidyverse)
r1 <- read.table(text = " X1 X2 X3 X4 X5
1 2 3 1 3 1
2 1 3 1 1 3
3 3 2 3 3 3
4 1 1 1 2 3
5 1 1 3 2 3", header = T)
r2 <- read.table(text = " X1 X2 X3 X4 X5
1 1 3 3 3 2
2 3 1 2 1 2
3 1 1 1 2 2
4 2 3 1 2 2
5 2 2 1 2 3", header = T)
semi_join(r1, r2, by = c("X1", "X2", "X3"))
#> X1 X2 X3 X4 X5
#> 1 2 3 1 3 1
#> 4 1 1 1 2 3
Created on 2021-03-12 by the reprex package (v1.0.0)

repeat list in to a data frame in R

I have a list let's say
k<-c(1,2,3,4)
I want to create a dataframe with let's say 5 rows using the same list in each row as shown below.
X1 X2 X3 X4
1 1 2 3 4
2 1 2 3 4
3 1 2 3 4
4 1 2 3 4
5 1 2 3 4
I tried doing:-
> rep(k, each = 5)
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
However I am not able to get intended result. Any suggestions?
data.frame(t(replicate(5, k)))
#OR
data.frame(matrix(rep(k, each = 5), 5))
#OR
data.frame(t(sapply(1:5, function(x) k)))
# X1 X2 X3 X4
#1 1 2 3 4
#2 1 2 3 4
#3 1 2 3 4
#4 1 2 3 4
#5 1 2 3 4
Here is one option by converting the vector to list with as.list, change it to data.frame (as.data.frame and replicate the rows
as.data.frame(as.list(k))[rep(1, 5),]
# X1 X2 X3 X4
#1 1 2 3 4
#1.1 1 2 3 4
#1.2 1 2 3 4
#1.3 1 2 3 4
#1.4 1 2 3 4
Or another option is to take the transpose of the vector to get a row matrix, replicate the rows and convert to data.frame
as.data.frame(t(k)[rep(1, 5),])
In tidyverse, one option is to convert to tibble and then uncount
library(dplyr)
library(tidyr)
library(stringr)
as.list(k) %>%
set_names(str_c("X", seq_along(k))) %>%
as_tibble %>%
uncount(5)
# A tibble: 5 x 4
# X1 X2 X3 X4
# <dbl> <dbl> <dbl> <dbl>
#1 1 2 3 4
#2 1 2 3 4
#3 1 2 3 4
#4 1 2 3 4
#5 1 2 3 4
purrr::map_dfc(k, rep, 5)
# # A tibble: 5 x 4
# V1 V2 V3 V4
# <dbl> <dbl> <dbl> <dbl>
# 1 1 2 3 4
# 2 1 2 3 4
# 3 1 2 3 4
# 4 1 2 3 4
# 5 1 2 3 4
Using data.table:
k = c(1,2,3,4)
n = 5 # Number of rows
df = data.table()
df = df[, lapply(1:length(k), function(x) rep(k[x], n))]
> df
V1 V2 V3 V4
1: 1 2 3 4
2: 1 2 3 4
3: 1 2 3 4
4: 1 2 3 4
5: 1 2 3 4

Number of maximums in each row and more

My dataset contains four numerical variables X1, X2, X3, X_4 and an ID column.
ID <- c(1,2,3,4,5,6,7,8,9,10)
X1 <- c(3,1,1,1,2,1,2,1,3,4)
X2 <- c(1,2,1,3,2,2,4,1,2,4)
X3 <- c(1,1,1,3,2,3,3,2,1,4)
X4 <- c(1,4,1,1,1,4,3,1,4,4)
Mydata <- data.frame(ID, X1,X2,X3,X4)
I need to create two more columns: 1) Max, and 2) Var
1) Max column: For each row that has ONLY ONE maximum, I need to save this 'max' value in the Max variable. And if the
row has more than one, then the Max value should be 999.
2) Var column: For the rows with only one maximum, I need to know whether it was X1, X2, X3$, or X4.
For the above dataset, here is the output:
ID X1 X2 X3 X4 Max Var
1 3 1 1 1 3 X1
2 1 2 1 4 4 X4
3 1 1 1 1 999 NA
4 1 3 3 1 999 NA
5 2 2 2 1 999 NA
6 1 2 3 4 4 X4
7 2 4 3 3 4 X2
8 1 1 2 1 2 X3
9 3 2 1 4 4 X4
10 4 4 4 4 999 NA
We could get the column names of the 'Mydata' for the maximum value in each row (excluding the 'ID' column) using max.col ('Var'), and the maximum value per row with pmax ('Max'). Create a logical index for rows that have more than one maximum value ('indx') and use it with ifelse to get the expected output.
Var <- names(Mydata[-1])[max.col(Mydata[-1])]
Max <- do.call(pmax,Mydata[-1])
indx <- rowSums(Mydata[-1]==Max)>1
transform(Mydata, Var= ifelse(indx, NA, Var), Max=ifelse(indx, 999, Max))
Here's another possible apply solution
MyFunc <- function(x){
Max <- max(x)
if(sum(x == Max) > 1L) {
Max <- 999
Var <- NA
} else {
Var <- which.max(x)
}
c(Max, Var)
}
Mydata[c("Max", "Var")] <- t(apply(Mydata[-1], 1, MyFunc))
# ID X1 X2 X3 X4 Max Var
# 1 1 3 1 1 1 3 1
# 2 2 1 2 1 4 4 4
# 3 3 1 1 1 1 999 NA
# 4 4 1 3 3 1 999 NA
# 5 5 2 2 2 1 999 NA
# 6 6 1 2 3 4 4 4
# 7 7 2 4 3 3 4 2
# 8 8 1 1 2 1 2 3
# 9 9 3 2 1 4 4 4
# 10 10 4 4 4 4 999 NA
I would break this down into some small steps, which may not be the most efficient but would at least give you a starting point to work from if efficiency were an issues for your real problem.
First, compute the row maxes:
maxs <- apply(Mydata[, -1], 1, max)
> maxs
[1] 3 4 1 3 2 4 4 2 4 4
Next compute how which values in the rows equal the maximum
wMax <- apply(Mydata[, -1], 1, function(x) length(which(x == max(x))))
This gives a list, which we can sapply() over to get the number of values equalling the maximum:
nMax <- sapply(wMax, length)
> nMax
[1] 1 1 4 2 3 1 1 1 1 4
Now add the Max & Var columns:
Mydata$Max <- ifelse(nMax > 1L, 999, maxs)
Mydata$Var <- ifelse(nMax > 1L, NA, sapply(wMax, `[[`, 1))
> Mydata
ID X1 X2 X3 X4 Max Var
1 1 3 1 1 1 3 1
2 2 1 2 1 4 4 4
3 3 1 1 1 1 999 NA
4 4 1 3 3 1 999 NA
5 5 2 2 2 1 999 NA
6 6 1 2 3 4 4 4
7 7 2 4 3 3 4 2
8 8 1 1 2 1 2 3
9 9 3 2 1 4 4 4
10 10 4 4 4 4 999 NA
This isn't going to win any prizes for elegant use of the language, but it works and you can build off of it.
(That last line creating Var needs a little explanation: wMax is actually a list. We want the first element of each component of that list (because those will be the only maximums), and the sapply() call produces that.)
Now we can write a function that incorporates all the steps for you:
MaxVar <- function(x, na.rm = FALSE) {
## compute `max`
maxx <- max(x, na.rm = na.rm)
## which equal the max
wmax <- which(x == max(x))
## how many equal the max
nmax <- length(wmax)
## return
out <- if(nmax > 1L) {
c(999, NA)
} else {
c(maxx, wmax)
}
out
}
And use it like this:
> new <- apply(Mydata[, -1], 1, MaxVar)
> new
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 3 4 999 999 999 4 4 2 4 999
[2,] 1 4 NA NA NA 4 2 3 4 NA
> Mydata <- cbind(Mydata, Max = new[1, ], Var = new[2, ])
> Mydata
ID X1 X2 X3 X4 Max Var
1 1 3 1 1 1 3 1
2 2 1 2 1 4 4 4
3 3 1 1 1 1 999 NA
4 4 1 3 3 1 999 NA
5 5 2 2 2 1 999 NA
6 6 1 2 3 4 4 4
7 7 2 4 3 3 4 2
8 8 1 1 2 1 2 3
9 9 3 2 1 4 4 4
10 10 4 4 4 4 999 NA
Again, not the most elegant or efficient of code, but it works and it's easy to see what it is doing.
Yet another way to do this using apply
Mydata$Max = apply(Mydata[,-1], 1,
function(x){ m = max(x); ifelse(m != max(x[duplicated(x)]), m, 999)})
Mydata$Var = apply(Mydata[,-1], 1,
function(x){ index = which.max(x); ifelse(index != 5, names(x)[index], NA)})
#> Mydata
#ID X1 X2 X3 X4 Max Var
#1 1 3 1 1 1 3 X1
#2 2 1 2 1 4 4 X4
#3 3 1 1 1 1 999 <NA>
#4 4 1 3 3 1 999 <NA>
#5 5 2 2 2 1 999 <NA>
#6 6 1 2 3 4 4 X4
#7 7 2 4 3 3 4 X2
#8 8 1 1 2 1 2 X3
#9 9 3 2 1 4 4 X4
#10 10 4 4 4 4 999 <NA>

Resources