I have df including NA.
df <- data.frame( X1= c(NA, 1, 4, NA),
X2 = c(34, 75, 1, 4),
X3= c(2,9,3,5))
My ideal out come looks like,
X1 X2 X3 Min
1 NA 34 2 X3
2 1 75 9 X1
3 4 1 3 X2
4 NA 4 5 X2
I have tried
df$Min <- colnames(df)[apply(df,1,which.min, na.rm=TRUE)]
but this one didn't work
You don't need the na.rm=TRUE when using which.min() – try this instead:
df$Min <- colnames(df)[apply(df,1,which.min)]
Output:
X1 X2 X3 Min
1 NA 34 2 X3
2 1 75 9 X1
3 4 1 3 X2
4 NA 4 5 X2
Code:
foo <- names(df)
df$Min <- apply(df, 1, function(x) foo[which.min(x)])
df
Output:
X1 X2 X3 Min
1 NA 34 2 X3
2 1 75 9 X1
3 4 1 3 X2
4 NA 4 5 X2
Here's an idea that will likely be faster and does not require any looping. You could replace NA with Inf, take the negative of the data, then find the maximum per column via max.col().
names(df)[max.col(-replace(df, is.na(df), Inf))]
# [1] "X3" "X1" "X2" "X2"
Also, not to forget, a data.table solution, given that dt <- as.data.table(df)
dt[ , Min:=names(dt)[match(min(.SD, na.rm=T), .SD)], by=1:nrow(dt)][]
# X1 X2 X3 Min
#1: NA 34 2 X3
#2: 1 75 9 X1
#3: 4 1 3 X2
#4: NA 4 5 X2
Not much simpler than the solutions above, just extending the choices here.
Related
I was trying to add results of a for loop into a dataframe as new rows, but it gets an error when there is a new result with more columns than the original dataframe, how could I add the new result with extra columns to the dataframe with adding the extra column names to the original dataframe?
e.g.
original dataframe:
-______A B C
x1 1 1 1
x2 2 2 2
x3 3 3 3
I want to get
-______A B C D
x1 1 1 1 NA
x2 2 2 2 NA
x3 3 3 3 NA
X4 4 4 4 4
I tried rbind (Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match)
and rbind_fill (Error: All inputs to rbind.fill must be data.frames)
and bind_rows (Argument 2 must have names)
In base R, this can be done by creating a new column 'D' with NA and then assign new row with 4.
df1$D <- NA
df1['x4', ] <- 4
-output
> df1
A B C D
x1 1 1 1 NA
x2 2 2 2 NA
x3 3 3 3 NA
x4 4 4 4 4
Or in a single line
rbind(cbind(df1, D = NA), x4 = 4)
A B C D
x1 1 1 1 NA
x2 2 2 2 NA
x3 3 3 3 NA
x4 4 4 4 4
Regarding the error in bind_rows, it happens when the for loop output is not a named vector
library(dplyr)
> vec1 <- c(4, 4, 4, 4)
> bind_rows(df1, vec1)
Error: Argument 2 must have names.
Run `rlang::last_error()` to see where the error occurred.
If it is a named vector, then it should work
> vec1 <- c(A = 4, B = 4, C = 4, D = 4)
> bind_rows(df1, vec1)
A B C D
x1 1 1 1 NA
x2 2 2 2 NA
x3 3 3 3 NA
...4 4 4 4 4
data
df1 <- structure(list(A = 1:3, B = 1:3, C = 1:3),
class = "data.frame", row.names = c("x1",
"x2", "x3"))
You probably have something like this, if you list the elements of your for loop.
(l <- list(x1, x2, x3, x4, x5))
# [[1]]
# [1] 1 1 1
#
# [[2]]
# [1] 2 2 2 2
#
# [[3]]
# [1] 3 3
#
# [[4]]
# [1] 4
#
# [[5]]
# NULL
Multiple elements can be rbinded using a do.call(rbind, .) approach, your problem is, how to rbind multiple elements that differ in length.
There's a `length<-` function with which you may adjust the length of a vector. To know to which length, there's another function, lengths, that gives you the lengths of each list element, where you are interested in the maximum.
I include the special case when an element has length NULL (our 5th element of l); since length of NULL cannot be changed, replace those elements with NA.
So altogether you may do:
do.call(rbind, lapply(replace(l, lengths(l) == 0L, NA), `length<-`, max(lengths(l))))
# [,1] [,2] [,3] [,4]
# [1,] 1 1 1 NA
# [2,] 2 2 2 2
# [3,] 3 3 NA NA
# [4,] 4 NA NA NA
# [5,] NA NA NA NA
Or, since you probably want a data frame with pretty row and column names:
ml <- max(lengths(l))
do.call(rbind, lapply(replace(l, lengths(l) == 0L, NA), `length<-`, ml)) |>
as.data.frame() |> `dimnames<-`(list(paste0('x', 1:length(l)), LETTERS[1:ml]))
# A B C D
# x1 1 1 1 NA
# x2 2 2 2 2
# x3 3 3 NA NA
# x4 4 NA NA NA
# x5 NA NA NA NA
Note: R >= 4.1 used.
Data:
x1 <- rep(1, 3); x2 <- rep(2, 4); x3 <- rep(3, 2); x4 <- rep(4, 1); x5 <- NULL
I am struggling to solve the captioned problem.
My dataframe is like:
X1 X2 X3 X4 X5
1 1 2 3 4 5
2 6 7 8 9 10
3 11 12 13 14 15
What I am trying to do is randomly selecting 3 elements from the third and fourth column and replace their values by 0. So the manipulated dataframe could be like
X1 X2 X3 X4 X5
1 1 2 3 4 5
2 6 7 0 0 10
3 11 12 13 0 15
I saw from here Random number selection from a data-frame that it could be easier if I convert the data frame into matrix, so I tried
mat <- data.frame(rbind(rep(1:5, 1), rep(6:10, 1), rep(11:15, 1)))
mat_matrix <- as.matrix(mat)
mat_matrix[sample(mat_matrix[, 3:4], 3)] <- 0
But it just randomly picked 3 elements across all columns and rows in the matrix and turned them into 0.
Can anyone help me out?
You can use slice.index and sample from that.
mat_matrix[sample(slice.index(mat_matrix, 1:2)[,3:4], 3)] <- 0
Nothing wrong with a for loop in this case. Perhaps like this:
mat <- data.frame(rbind(rep(1:5, 1), rep(6:10, 1), rep(11:15, 1)))
cols <- c(3,4)
n <- nrow(mat)*length(cols)
v <- sample( x=1:n, size=3 )
m <- matrix(FALSE, ncol=length(cols), nrow=nrow(mat))
m[v] <- TRUE
for( i in seq_along(cols) ) {
mat[ m[,i], cols[i] ] <- 0
}
Just create a two column "index matrix" that you sample on and use to replace back into your data.
Here is one way using replace
cols <- c("X3", "X4")
N <- 3
df[cols] <- replace(as.matrix(df[cols]), sample(length(unlist(df[cols])), N), 0)
such that
> df
X1 X2 X3 X4 X5
1 1 2 3 0 5
2 6 7 8 0 10
3 11 12 0 14 15
I'm trying to make a new matrix using values from other matrix with R. I'm trying to match the names of rows and columns while importing the values. This is what what trying to do:
I have two matrices;
X1 X2 X3 X4
X1 0 9 8 0
X2 1 2 3 5
X4 6 1 2 4
X1 X2 X3 X4
X1 NA NA NA NA
X2 NA NA NA NA
X3 NA NA NA NA
X4 NA NA NA NA
I want to do
X1 X2 X3 X4
X1 0 9 8 0
X2 1 2 3 5
X3 NA NA NA NA
X4 6 1 2 4
These matrices are just simple examples of my dataset, my real data is more complicated.
Many thanks,
checking for rownames and colnames matches in both matrices will prevent subscript out of bounds error. See below.
mat2[rownames(mat2) %in% rownames(mat1),
colnames(mat2) %in% colnames(mat1)] <- mat1[rownames(mat1) %in% rownames(mat2),
colnames(mat1) %in% colnames(mat2)]
mat2
# X1 X2 X3 X4
# X1 0 9 8 0
# X2 1 2 3 5
# X3 NA NA NA NA
# X4 6 1 2 4
Data:
mat1 <- read.table(text = ' X1 X2 X3 X4
X1 0 9 8 0
X2 1 2 3 5
X4 6 1 2 4', header = TRUE)
mat1 <- as.matrix(mat1)
mat2 <- matrix(NA, nrow = 4, ncol = 4, dimnames = list(paste0("X", 1:4),
paste0("X", 1:4)))
If I understood your question you can do this:
# Building your matrices
mat1 <- matrix(runif(12), nrow = 3, ncol = 4)
mat2 <- matrix(NA, nrow = 4, ncol = 4)
labs <- paste0("x", 1:4)
colnames(mat1) <- colnames(mat2) <- labs
rownames(mat2) <- labs
rownames(mat1) <- labs[c(1:2, 4)]
#
rows <- sort(unique(c(rownames(mat1), rownames(mat2))))
result <- matrix(NA, nrow = length(rows), ncol = ncol(mat1))
result[match(rownames(mat1), rows), ] <- mat1
The data contains four fields: id, x1, x2, and x3.
id <- c(1,2,3,4,5,6,7,8,9,10)
x1 <- c(2,4,5,3,6,4,3,6,7,7)
x2 <- c(0,1,2,6,7,6,0,8,2,2)
x3 <- c(5,3,4,5,8,3,4,2,5,6)
DF <- data.frame(id, x1,x2,x3)
Before I ask the question, let me create a new field (minX) which is the min of (x1,x2,x3)
DF$minX <- pmin(DF$x1, DF$x2, DF$x3)
I need to create a new field, y, that is defined as follows
if min(x1,x2,x3) = x1, then y = "x1"
if min(x1,x2,x3) = x2, then y = "x2"
if min(x1,x2,x3) = x3, then y = "x3"
Note: we assume no ties.
As a simply solution, do:
VARS <- colnames(DF)[-1]
y <- VARS[apply(DF[, -1], MARGIN = 1, FUN = which.min)]
DF$y <- y
The function which.min returns the index of the minimum. If the minimum is not unique it returns the first one. Since you guarantee that there is no tie, this is not an issue here.
Finally, you should be familiar with apply, right? MARGIN = 1 means applying function FUN row-wise, while MARGIN = 2 means applying FUN column-wise. This is an useful function to avoid the need for a for loop when dealing with matrix. Since your data frame only contains numerical/integer values, it is like a matrix hence we can use apply.
Here is another option using pmin and max.col
library(data.table)
setDT(DF)[, c("minx", "y") := list(do.call(pmin, .SD),
names(.SD)[max.col(-1*.SD)]), .SDcols= x1:x3]
DF
# id x1 x2 x3 minx y
# 1: 1 2 0 5 0 x2
# 2: 2 4 1 3 1 x2
# 3: 3 5 2 4 2 x2
# 4: 4 3 6 5 3 x1
3 5: 5 6 7 8 6 x1
# 6: 6 4 6 3 3 x3
# 7: 7 3 0 4 0 x2
# 8: 8 6 8 2 2 x3
# 9: 9 7 2 5 2 x2
#10: 10 7 2 6 2 x2
a data.table solution:
# create variables
id <- c(1,2,3,4,5,6,7,8,9,10)
x1 <- c(2,4,5,3,6,4,3,6,7,7)
x2 <- c(0,1,2,6,7,6,0,8,2,2)
x3 <- c(5,3,4,5,8,3,4,2,5,6)
DF <- data.frame(id, x1,x2,x3)
# load package and set data table, calculating min
library(data.table)
setDT(DF)[, minx := apply(.SD, 1, min), .SDcols=c("x1", "x2", "x3")]
# Create variable with name of minimum
DF[, y := apply(.SD, 1, function(x) names(x)[which.min(x)]), .SDcols = c("x1", "x2", "x3")]
# call result
DF
## id x1 x2 x3 minx y
1: 1 2 0 5 0 x2
2: 2 4 1 3 1 x2
3: 3 5 2 4 2 x2
4: 4 3 6 5 3 x1
5: 5 6 7 8 6 x1
6: 6 4 6 3 3 x3
7: 7 3 0 4 0 x2
8: 8 6 8 2 2 x3
9: 9 7 2 5 2 x2
10: 10 7 2 6 2 x2
The last step can be called directly, without the need to calculate minx.
Please notice that data.table is particularily fast in large data sets.
######## EDIT TO ADD: DPLYR METHOD #########
For completeness, this would be a dplyr method to produce the same (final) result. This solution is credited to #eipi10 in a question I started out of this problem (see here):
DF %>% mutate(y = apply(.[,2:4], 1, function(x) names(x)[which.min(x)]))
This solution takes about the same time as the data.table one provided in the original answer, when applyed to a 1e6 rows data frame (about 17 secs in my sony laptop).
What is the most efficient way to apply gsub to various columns?
The following does not work
x1=c("10%","20%","30%")
x2=c("60%","50%","40%")
x3 = c(1,2,3)
x = data.frame(x1,x2,x3)
per_col = c(1,2)
x = gsub("%","",x[,per_col])
How can I most efficiently drop the "%" sign in specified columns.
Can I apply it to the whole dataframe? This would be useful in the case where I don't know where the percentage columns are.
You can use apply to apply it to the whole data.frame
apply(x, 2, function(y) as.numeric(gsub("%", "", y)))
x1 x2 x3
[1,] 10 60 1
[2,] 20 50 2
[3,] 30 40 3
Or, you could try the lapply solution:
as.data.frame(lapply(x, function(y) gsub("%", "", y)))
x1 x2 x3
1 10 60 1
2 20 50 2
3 30 40 3
To clean the % out you can do:
x[per_col] <- lapply(x[per_col], function(y) as.numeric(gsub("%", "", y)))
x
x1 x2 x3
1 10 60 1
2 20 50 2
3 30 40 3
The first answer works but be careful if you are using data.frame with string: the #docendo discimus's answer will return NAs.
If you want to keep the content of your column as string just remove the as.numeric and convert your table into a data frame after :
as.data.frame(apply(x, 2, function(y) as.numeric(gsub("%", "", y))))
x1 x2 x3
[1,] 10 60 1
[2,] 20 50 2
[3,] 30 40 3
We can unlist per_col columns, remove "%" symbol and convert it into numeric.
x[per_col] <- as.numeric(gsub("%","", unlist(x[per_col])))
#In this case using sub would be enough too as we have only 1 % symbol to replace
#x[per_col] <- as.numeric(sub("%","", unlist(x[per_col])))
x
# x1 x2 x3
#1 10 60 1
#2 20 50 2
#3 30 40 3
To add on docendo discimus' answer, an extension with non-adjacent columns and returning a data.frame:
x1 <- c("10%", "20%", "30%")
x2 <- c("60%", "50%", "40%")
x3 <- c(1, 2, 3)
x4 <- c("60%", "50%", "40%")
x <- data.frame(x1, x2, x3, x4)
x[, c(1:2, 4)] <- as.data.frame(apply(x[,c(1:2, 4)], 2,
function(x) {
as.numeric(gsub("%", "", x))}
))
> x
x1 x2 x3 x4
1 10 60 1 60
2 20 50 2 50
3 30 40 3 40
> class(x)
[1] "data.frame"