Applying gsub to various columns - r

What is the most efficient way to apply gsub to various columns?
The following does not work
x1=c("10%","20%","30%")
x2=c("60%","50%","40%")
x3 = c(1,2,3)
x = data.frame(x1,x2,x3)
per_col = c(1,2)
x = gsub("%","",x[,per_col])
How can I most efficiently drop the "%" sign in specified columns.
Can I apply it to the whole dataframe? This would be useful in the case where I don't know where the percentage columns are.

You can use apply to apply it to the whole data.frame
apply(x, 2, function(y) as.numeric(gsub("%", "", y)))
x1 x2 x3
[1,] 10 60 1
[2,] 20 50 2
[3,] 30 40 3

Or, you could try the lapply solution:
as.data.frame(lapply(x, function(y) gsub("%", "", y)))
x1 x2 x3
1 10 60 1
2 20 50 2
3 30 40 3

To clean the % out you can do:
x[per_col] <- lapply(x[per_col], function(y) as.numeric(gsub("%", "", y)))
x
x1 x2 x3
1 10 60 1
2 20 50 2
3 30 40 3

The first answer works but be careful if you are using data.frame with string: the #docendo discimus's answer will return NAs.
If you want to keep the content of your column as string just remove the as.numeric and convert your table into a data frame after :
as.data.frame(apply(x, 2, function(y) as.numeric(gsub("%", "", y))))
x1 x2 x3
[1,] 10 60 1
[2,] 20 50 2
[3,] 30 40 3

We can unlist per_col columns, remove "%" symbol and convert it into numeric.
x[per_col] <- as.numeric(gsub("%","", unlist(x[per_col])))
#In this case using sub would be enough too as we have only 1 % symbol to replace
#x[per_col] <- as.numeric(sub("%","", unlist(x[per_col])))
x
# x1 x2 x3
#1 10 60 1
#2 20 50 2
#3 30 40 3

To add on docendo discimus' answer, an extension with non-adjacent columns and returning a data.frame:
x1 <- c("10%", "20%", "30%")
x2 <- c("60%", "50%", "40%")
x3 <- c(1, 2, 3)
x4 <- c("60%", "50%", "40%")
x <- data.frame(x1, x2, x3, x4)
x[, c(1:2, 4)] <- as.data.frame(apply(x[,c(1:2, 4)], 2,
function(x) {
as.numeric(gsub("%", "", x))}
))
> x
x1 x2 x3 x4
1 10 60 1 60
2 20 50 2 50
3 30 40 3 40
> class(x)
[1] "data.frame"

Related

R: Sample n elements in certain columns in a dataframe/matrix and replace their values

I am struggling to solve the captioned problem.
My dataframe is like:
X1 X2 X3 X4 X5
1 1 2 3 4 5
2 6 7 8 9 10
3 11 12 13 14 15
What I am trying to do is randomly selecting 3 elements from the third and fourth column and replace their values by 0. So the manipulated dataframe could be like
X1 X2 X3 X4 X5
1 1 2 3 4 5
2 6 7 0 0 10
3 11 12 13 0 15
I saw from here Random number selection from a data-frame that it could be easier if I convert the data frame into matrix, so I tried
mat <- data.frame(rbind(rep(1:5, 1), rep(6:10, 1), rep(11:15, 1)))
mat_matrix <- as.matrix(mat)
mat_matrix[sample(mat_matrix[, 3:4], 3)] <- 0
But it just randomly picked 3 elements across all columns and rows in the matrix and turned them into 0.
Can anyone help me out?
You can use slice.index and sample from that.
mat_matrix[sample(slice.index(mat_matrix, 1:2)[,3:4], 3)] <- 0
Nothing wrong with a for loop in this case. Perhaps like this:
mat <- data.frame(rbind(rep(1:5, 1), rep(6:10, 1), rep(11:15, 1)))
cols <- c(3,4)
n <- nrow(mat)*length(cols)
v <- sample( x=1:n, size=3 )
m <- matrix(FALSE, ncol=length(cols), nrow=nrow(mat))
m[v] <- TRUE
for( i in seq_along(cols) ) {
mat[ m[,i], cols[i] ] <- 0
}
Just create a two column "index matrix" that you sample on and use to replace back into your data.
Here is one way using replace
cols <- c("X3", "X4")
N <- 3
df[cols] <- replace(as.matrix(df[cols]), sample(length(unlist(df[cols])), N), 0)
such that
> df
X1 X2 X3 X4 X5
1 1 2 3 0 5
2 6 7 8 0 10
3 11 12 0 14 15

Sum the last n non NA values in each column of a matrix in R

I have a matrix that looks like below:
x1<-c(1,2,3,4,5,6,NA)
x2<-c(1,2,NA,4,5,NA,NA)
x3<-c(1,2,3,4,NA,NA,NA)
x4<-c(1,2,3,NA,NA,NA,NA)
x5<-c(1,2,NA,NA,NA,NA,NA)
x<-cbind(x1,x2,x3,x4,x5)
If I want to calculate the last 3 non NA values of each column, and if a column has less than 3 non NA values (like column 5), then I'll sum all the non NA values in that column. I want an output that looks like
15 11 10 6 3
Thank you!
You can use apply with tail to sum up the last non NA like:
apply(x, 2, function(x) sum(tail(x[!is.na(x)], 3)))
#x1 x2 x3 x4 x5
#15 11 9 6 3
It also works with a customized function (#GKi answer is pretty cool):
#Build function
myfun <- function(y)
{
#Count na
i <- length(which(!is.na(y)))
if(i<3)
{
r1 <- sum(y,na.rm=T)
} else
{
y1 <- y[!is.na(y)]
y2 <- y1[(length(y1)-2):length(y1)]
r1 <- sum(y2)
}
return(r1)
}
#Apply
apply(x,2,myfun)
Output:
x1 x2 x3 x4 x5
15 11 9 6 3
One dplyr option using the logic from #GKi could be:
x %>%
data.frame() %>%
summarise(across(everything(), ~ sum(tail(na.omit(.), 3))))
x1 x2 x3 x4 x5
1 15 11 9 6 3
Or:
x %>%
data.frame() %>%
summarise(across(everything(), ~ sum(rev(na.omit(.))[1:3], na.rm = TRUE)))
Using sapply from base R
sapply(as.data.frame(x), function(x) sum(tail(na.omit(x), 3)))
# x1 x2 x3 x4 x5
#15 11 9 6 3

Convert 3D array into list of dataframes

Basically, I want to group a 3D array by its columns, transform it into a data frame, and bind to it a new column whose value equals to the sum of all existing columns.
For example, consider the following 3D array
> (src <- array(1:8, c(2, 2, 2), dimnames=list(c('X1', 'X2'), c('Y1', 'Y2'), 1:2)))
, , 1
Y1 Y2
X1 1 3
X2 2 4
, , 2
Y1 Y2
X1 5 7
X2 6 8
I would like to convert it to
> (dest <- list(Y1=data.frame(X1=c(1, 5), X2=c(2, 6), Y1=c(1, 5)+c(2, 6)),
Y2=data.frame(X1=c(3, 7), X2=c(4, 8), Y2=c(3, 7)+c(4, 8))))
$Y1
X1 X2 Y1
1 1 2 3
2 5 6 11
$Y2
X1 X2 Y2
1 3 4 7
2 7 8 15
I know how to do the transformation for each individual column in the original array, but have no idea how to handle multiple columns simultaneously.
> library(dplyr)
> as.data.frame(t(src[, 'Y1', ])) %>% mutate(Y1=X1+X2)
X1 X2 Y1
1 1 2 3
2 5 6 11
Feel free to use base R, dplyr, data.table, or whatever package you prefer, as long as it's fast enough. In the real-world application, dim(src) tend to be something like c(hundreds, tens, tens of thousands).
We could first apply data.frame-transformation on margin 2 of the transposed array, where we transpose arrays with aperm(). Then we proceed similarly with the colSums. In order to get the right names "Y1", "Y2" we make an interim step listing the columns as data frames. Finally Map evaluates both lists (the X* and colsums of Y*) element by element.
dest <- Map(cbind, apply(aperm(src, c(3, 2, 1)), 2, data.frame),
{tmp <- data.frame(apply(src, 2, colSums));list(tmp[1], tmp[2])})
dest
# $Y1
# X1 X2 Y1
# 1 1 2 3
# 2 5 6 11
#
# $Y2
# X1 X2 Y2
# 1 3 4 7
# 2 7 8 15

Call col name with min value (NA included)

I have df including NA.
df <- data.frame( X1= c(NA, 1, 4, NA),
X2 = c(34, 75, 1, 4),
X3= c(2,9,3,5))
My ideal out come looks like,
X1 X2 X3 Min
1 NA 34 2 X3
2 1 75 9 X1
3 4 1 3 X2
4 NA 4 5 X2
I have tried
df$Min <- colnames(df)[apply(df,1,which.min, na.rm=TRUE)]
but this one didn't work
You don't need the na.rm=TRUE when using which.min() – try this instead:
df$Min <- colnames(df)[apply(df,1,which.min)]
Output:
X1 X2 X3 Min
1 NA 34 2 X3
2 1 75 9 X1
3 4 1 3 X2
4 NA 4 5 X2
Code:
foo <- names(df)
df$Min <- apply(df, 1, function(x) foo[which.min(x)])
df
Output:
X1 X2 X3 Min
1 NA 34 2 X3
2 1 75 9 X1
3 4 1 3 X2
4 NA 4 5 X2
Here's an idea that will likely be faster and does not require any looping. You could replace NA with Inf, take the negative of the data, then find the maximum per column via max.col().
names(df)[max.col(-replace(df, is.na(df), Inf))]
# [1] "X3" "X1" "X2" "X2"
Also, not to forget, a data.table solution, given that dt <- as.data.table(df)
dt[ , Min:=names(dt)[match(min(.SD, na.rm=T), .SD)], by=1:nrow(dt)][]
# X1 X2 X3 Min
#1: NA 34 2 X3
#2: 1 75 9 X1
#3: 4 1 3 X2
#4: NA 4 5 X2
Not much simpler than the solutions above, just extending the choices here.

The order returned from a vectorised function

I am sending two columns of a data frame to a vectorised function.
For each row of the data frame, the function will return 3 rows. So the total number of rows returned will be nrow(dataframe) * 3. The total columns returned will be equal to 2.
The trivial function below produces the correct set of numbers. But these numbers are returned in a peculiar order. I guess it would be possible to get the order of these numbers in the order I desire...using some combination of base functions. But, if possible, I want to write easy-to-understand code.
So my question is this:
Is there a better way of writing either the function (or call to the function) such that it will produce the desired result (which is commented out below) ?
fnVector <- function(fx, fy) {
x1 <- fx + 1
x2 <- fx + 2
x3 <- fx + 3
y1 <- fy + 1
y2 <- fy + 2
y3 <- fy + 3
vctx <- c(x1, x2, x3)
vcty <- c(y1, y2, y3)
#vct.pair <- c(vctx, vcty)
vct.series <- c(x1, y1, x2, y2, x3, y3)
return(vct.series)
}
vct.names <- c("a", "b")
vct.x <- c(10, 20)
vct.y <- c(100, 200)
df.data <- data.frame(name = vct.names, x = vct.x, y = vct.y)
aa <- fnVector(df.data$x, df.data$y)
# desired result [nrow(dataframe) * 3, 2] (i.e. 3 x 2 )
#11, 101 (i.e. row a)
#12, 102 (i.e. row a)
#13, 103 (i.e. row a)
#21, 201 (i.e. row b)
#22, 202 (i.e. row b)
#23, 203 (i.e. row b)
I think you want to interleave your vectors, i.e. the returned x is x1[1], x2[1], x3[1], x1[2], x2[2], x3[2], ...
so you could:
vctx <- c(rbind(x1, x2, x3)) # interleaves the x2
vcty <- c(rbind(y1, y2, y3)) # interleaves the x2
Then return a matrix, not a vector:
return(cbind(vctx, vcty))
Giving you
fnVector <- function(fx, fy) {
x1 <- fx + 1
x2 <- fx + 2
x3 <- fx + 3
y1 <- fy + 1
y2 <- fy + 2
y3 <- fy + 3
vctx <- c(rbind(x1, x2, x3)) # interleaves the x2
vcty <- c(rbind(y1, y2, y3)) # interleaves the x2
return(cbind(vctx, vcty))
}
fnVector(df.data$x, df.data$y)
# vctx vcty
# [1,] 11 101
# [2,] 12 102
# [3,] 13 103
# [4,] 21 201
# [5,] 22 202
# [6,] 23 203
You may want to think about also retaining the name column.
I don't know if this is adaptable to your specific application or not (I understand you have simplified your fnVector for the purposes of this question), but you might want to investigate plyr:
library(plyr)
ddply(df.data, .(name), summarize,
vctx = x + 1:3,
vcty = y + 1:3)
# name vctx vcty
# 1 a 11 101
# 2 a 12 102
# 3 a 13 103
# 4 b 21 201
# 5 b 22 202
# 6 b 23 203
The ddply(df.data, .(name), says "for each unique value in df.data$name", the summarize says "call the summarize function", then the two named arguments vctx=.. and vcty=... create the output 3 rows for each of these columns (for us, x+1:3 and y+1:3, but for your application, probably something more complex).
I think your function can be greatly simplified, and I also think it makes the most sense to use the custom function along with one of the apply functions. Try this code:
fnVector <- function(x) {
y <- rbind(x+1, x+2, x+3)
return(y)
}
df.output <- data.frame(apply(df.data[, c("x", "y")], 2, function(x) fnVector(x)))
> df.output
x y
1 11 101
2 12 102
3 13 103
4 21 201
5 22 202
6 23 203

Resources