Rearrange data from single observation to many - r

I've a text file in following form:
x1, y1, z1, x2, y2, z2, x3, y3, z3
If I import it with read.csv I've a single observation with nine variables (in the example, the number of triplets in real file is unknown).
I want to rearrange data in order to have many observation with three variables:
x1 y1 z1
x2 y2 z2
x3 y3 z3
So I can perform operations on each triplet.
For example I want to transform this
fileData <- read.table(text = "1 2 3 10 20 30 100 200 300")
> fileData
V1 V2 V3 V4 V5 V6 V7 V8 V9
1 1 2 3 10 20 30 100 200 300
to this:
> fileData
V1 V2 V3
1 1 2 3
2 10 20 30
3 100 200 300
How can I split it?

Not sure what your actual goal is but using base R:
data.frame(matrix(fileData, ncol = 3, byrow = T))
This should get what you want
X1 X2 X3
1 1 2 3
2 10 20 30
3 100 200 300

akash gave a great answer but it may not work if you have mixed data types (numeric and character) because the matrix will force everything to be one type. An alternative is something like the following where we lapply across an index based on the number of columns desired.
fileData <- read.table(text = "m 2 3 a 20 30 cat 200 300")
rows = lapply(seq(3,ncol(fileData),by=3),
function(x){
range = paste("V",(x-2):x,sep="")
output = fileData[,range]
names(output) = c("x","y","z")
return(output)
})
do.call(rbind,rows)
#> x y z
#> 1 m 2 3
#> 2 a 20 30
#> 3 cat 200 300

Related

R: Sample n elements in certain columns in a dataframe/matrix and replace their values

I am struggling to solve the captioned problem.
My dataframe is like:
X1 X2 X3 X4 X5
1 1 2 3 4 5
2 6 7 8 9 10
3 11 12 13 14 15
What I am trying to do is randomly selecting 3 elements from the third and fourth column and replace their values by 0. So the manipulated dataframe could be like
X1 X2 X3 X4 X5
1 1 2 3 4 5
2 6 7 0 0 10
3 11 12 13 0 15
I saw from here Random number selection from a data-frame that it could be easier if I convert the data frame into matrix, so I tried
mat <- data.frame(rbind(rep(1:5, 1), rep(6:10, 1), rep(11:15, 1)))
mat_matrix <- as.matrix(mat)
mat_matrix[sample(mat_matrix[, 3:4], 3)] <- 0
But it just randomly picked 3 elements across all columns and rows in the matrix and turned them into 0.
Can anyone help me out?
You can use slice.index and sample from that.
mat_matrix[sample(slice.index(mat_matrix, 1:2)[,3:4], 3)] <- 0
Nothing wrong with a for loop in this case. Perhaps like this:
mat <- data.frame(rbind(rep(1:5, 1), rep(6:10, 1), rep(11:15, 1)))
cols <- c(3,4)
n <- nrow(mat)*length(cols)
v <- sample( x=1:n, size=3 )
m <- matrix(FALSE, ncol=length(cols), nrow=nrow(mat))
m[v] <- TRUE
for( i in seq_along(cols) ) {
mat[ m[,i], cols[i] ] <- 0
}
Just create a two column "index matrix" that you sample on and use to replace back into your data.
Here is one way using replace
cols <- c("X3", "X4")
N <- 3
df[cols] <- replace(as.matrix(df[cols]), sample(length(unlist(df[cols])), N), 0)
such that
> df
X1 X2 X3 X4 X5
1 1 2 3 0 5
2 6 7 8 0 10
3 11 12 0 14 15

Convert 3D array into list of dataframes

Basically, I want to group a 3D array by its columns, transform it into a data frame, and bind to it a new column whose value equals to the sum of all existing columns.
For example, consider the following 3D array
> (src <- array(1:8, c(2, 2, 2), dimnames=list(c('X1', 'X2'), c('Y1', 'Y2'), 1:2)))
, , 1
Y1 Y2
X1 1 3
X2 2 4
, , 2
Y1 Y2
X1 5 7
X2 6 8
I would like to convert it to
> (dest <- list(Y1=data.frame(X1=c(1, 5), X2=c(2, 6), Y1=c(1, 5)+c(2, 6)),
Y2=data.frame(X1=c(3, 7), X2=c(4, 8), Y2=c(3, 7)+c(4, 8))))
$Y1
X1 X2 Y1
1 1 2 3
2 5 6 11
$Y2
X1 X2 Y2
1 3 4 7
2 7 8 15
I know how to do the transformation for each individual column in the original array, but have no idea how to handle multiple columns simultaneously.
> library(dplyr)
> as.data.frame(t(src[, 'Y1', ])) %>% mutate(Y1=X1+X2)
X1 X2 Y1
1 1 2 3
2 5 6 11
Feel free to use base R, dplyr, data.table, or whatever package you prefer, as long as it's fast enough. In the real-world application, dim(src) tend to be something like c(hundreds, tens, tens of thousands).
We could first apply data.frame-transformation on margin 2 of the transposed array, where we transpose arrays with aperm(). Then we proceed similarly with the colSums. In order to get the right names "Y1", "Y2" we make an interim step listing the columns as data frames. Finally Map evaluates both lists (the X* and colsums of Y*) element by element.
dest <- Map(cbind, apply(aperm(src, c(3, 2, 1)), 2, data.frame),
{tmp <- data.frame(apply(src, 2, colSums));list(tmp[1], tmp[2])})
dest
# $Y1
# X1 X2 Y1
# 1 1 2 3
# 2 5 6 11
#
# $Y2
# X1 X2 Y2
# 1 3 4 7
# 2 7 8 15

Create a new variable from the minimum in R

The data contains four fields: id, x1, x2, and x3.
id <- c(1,2,3,4,5,6,7,8,9,10)
x1 <- c(2,4,5,3,6,4,3,6,7,7)
x2 <- c(0,1,2,6,7,6,0,8,2,2)
x3 <- c(5,3,4,5,8,3,4,2,5,6)
DF <- data.frame(id, x1,x2,x3)
Before I ask the question, let me create a new field (minX) which is the min of (x1,x2,x3)
DF$minX <- pmin(DF$x1, DF$x2, DF$x3)
I need to create a new field, y, that is defined as follows
if min(x1,x2,x3) = x1, then y = "x1"
if min(x1,x2,x3) = x2, then y = "x2"
if min(x1,x2,x3) = x3, then y = "x3"
Note: we assume no ties.
As a simply solution, do:
VARS <- colnames(DF)[-1]
y <- VARS[apply(DF[, -1], MARGIN = 1, FUN = which.min)]
DF$y <- y
The function which.min returns the index of the minimum. If the minimum is not unique it returns the first one. Since you guarantee that there is no tie, this is not an issue here.
Finally, you should be familiar with apply, right? MARGIN = 1 means applying function FUN row-wise, while MARGIN = 2 means applying FUN column-wise. This is an useful function to avoid the need for a for loop when dealing with matrix. Since your data frame only contains numerical/integer values, it is like a matrix hence we can use apply.
Here is another option using pmin and max.col
library(data.table)
setDT(DF)[, c("minx", "y") := list(do.call(pmin, .SD),
names(.SD)[max.col(-1*.SD)]), .SDcols= x1:x3]
DF
# id x1 x2 x3 minx y
# 1: 1 2 0 5 0 x2
# 2: 2 4 1 3 1 x2
# 3: 3 5 2 4 2 x2
# 4: 4 3 6 5 3 x1
3 5: 5 6 7 8 6 x1
# 6: 6 4 6 3 3 x3
# 7: 7 3 0 4 0 x2
# 8: 8 6 8 2 2 x3
# 9: 9 7 2 5 2 x2
#10: 10 7 2 6 2 x2
a data.table solution:
# create variables
id <- c(1,2,3,4,5,6,7,8,9,10)
x1 <- c(2,4,5,3,6,4,3,6,7,7)
x2 <- c(0,1,2,6,7,6,0,8,2,2)
x3 <- c(5,3,4,5,8,3,4,2,5,6)
DF <- data.frame(id, x1,x2,x3)
# load package and set data table, calculating min
library(data.table)
setDT(DF)[, minx := apply(.SD, 1, min), .SDcols=c("x1", "x2", "x3")]
# Create variable with name of minimum
DF[, y := apply(.SD, 1, function(x) names(x)[which.min(x)]), .SDcols = c("x1", "x2", "x3")]
# call result
DF
## id x1 x2 x3 minx y
1: 1 2 0 5 0 x2
2: 2 4 1 3 1 x2
3: 3 5 2 4 2 x2
4: 4 3 6 5 3 x1
5: 5 6 7 8 6 x1
6: 6 4 6 3 3 x3
7: 7 3 0 4 0 x2
8: 8 6 8 2 2 x3
9: 9 7 2 5 2 x2
10: 10 7 2 6 2 x2
The last step can be called directly, without the need to calculate minx.
Please notice that data.table is particularily fast in large data sets.
######## EDIT TO ADD: DPLYR METHOD #########
For completeness, this would be a dplyr method to produce the same (final) result. This solution is credited to #eipi10 in a question I started out of this problem (see here):
DF %>% mutate(y = apply(.[,2:4], 1, function(x) names(x)[which.min(x)]))
This solution takes about the same time as the data.table one provided in the original answer, when applyed to a 1e6 rows data frame (about 17 secs in my sony laptop).

The order returned from a vectorised function

I am sending two columns of a data frame to a vectorised function.
For each row of the data frame, the function will return 3 rows. So the total number of rows returned will be nrow(dataframe) * 3. The total columns returned will be equal to 2.
The trivial function below produces the correct set of numbers. But these numbers are returned in a peculiar order. I guess it would be possible to get the order of these numbers in the order I desire...using some combination of base functions. But, if possible, I want to write easy-to-understand code.
So my question is this:
Is there a better way of writing either the function (or call to the function) such that it will produce the desired result (which is commented out below) ?
fnVector <- function(fx, fy) {
x1 <- fx + 1
x2 <- fx + 2
x3 <- fx + 3
y1 <- fy + 1
y2 <- fy + 2
y3 <- fy + 3
vctx <- c(x1, x2, x3)
vcty <- c(y1, y2, y3)
#vct.pair <- c(vctx, vcty)
vct.series <- c(x1, y1, x2, y2, x3, y3)
return(vct.series)
}
vct.names <- c("a", "b")
vct.x <- c(10, 20)
vct.y <- c(100, 200)
df.data <- data.frame(name = vct.names, x = vct.x, y = vct.y)
aa <- fnVector(df.data$x, df.data$y)
# desired result [nrow(dataframe) * 3, 2] (i.e. 3 x 2 )
#11, 101 (i.e. row a)
#12, 102 (i.e. row a)
#13, 103 (i.e. row a)
#21, 201 (i.e. row b)
#22, 202 (i.e. row b)
#23, 203 (i.e. row b)
I think you want to interleave your vectors, i.e. the returned x is x1[1], x2[1], x3[1], x1[2], x2[2], x3[2], ...
so you could:
vctx <- c(rbind(x1, x2, x3)) # interleaves the x2
vcty <- c(rbind(y1, y2, y3)) # interleaves the x2
Then return a matrix, not a vector:
return(cbind(vctx, vcty))
Giving you
fnVector <- function(fx, fy) {
x1 <- fx + 1
x2 <- fx + 2
x3 <- fx + 3
y1 <- fy + 1
y2 <- fy + 2
y3 <- fy + 3
vctx <- c(rbind(x1, x2, x3)) # interleaves the x2
vcty <- c(rbind(y1, y2, y3)) # interleaves the x2
return(cbind(vctx, vcty))
}
fnVector(df.data$x, df.data$y)
# vctx vcty
# [1,] 11 101
# [2,] 12 102
# [3,] 13 103
# [4,] 21 201
# [5,] 22 202
# [6,] 23 203
You may want to think about also retaining the name column.
I don't know if this is adaptable to your specific application or not (I understand you have simplified your fnVector for the purposes of this question), but you might want to investigate plyr:
library(plyr)
ddply(df.data, .(name), summarize,
vctx = x + 1:3,
vcty = y + 1:3)
# name vctx vcty
# 1 a 11 101
# 2 a 12 102
# 3 a 13 103
# 4 b 21 201
# 5 b 22 202
# 6 b 23 203
The ddply(df.data, .(name), says "for each unique value in df.data$name", the summarize says "call the summarize function", then the two named arguments vctx=.. and vcty=... create the output 3 rows for each of these columns (for us, x+1:3 and y+1:3, but for your application, probably something more complex).
I think your function can be greatly simplified, and I also think it makes the most sense to use the custom function along with one of the apply functions. Try this code:
fnVector <- function(x) {
y <- rbind(x+1, x+2, x+3)
return(y)
}
df.output <- data.frame(apply(df.data[, c("x", "y")], 2, function(x) fnVector(x)))
> df.output
x y
1 11 101
2 12 102
3 13 103
4 21 201
5 22 202
6 23 203

Applying gsub to various columns

What is the most efficient way to apply gsub to various columns?
The following does not work
x1=c("10%","20%","30%")
x2=c("60%","50%","40%")
x3 = c(1,2,3)
x = data.frame(x1,x2,x3)
per_col = c(1,2)
x = gsub("%","",x[,per_col])
How can I most efficiently drop the "%" sign in specified columns.
Can I apply it to the whole dataframe? This would be useful in the case where I don't know where the percentage columns are.
You can use apply to apply it to the whole data.frame
apply(x, 2, function(y) as.numeric(gsub("%", "", y)))
x1 x2 x3
[1,] 10 60 1
[2,] 20 50 2
[3,] 30 40 3
Or, you could try the lapply solution:
as.data.frame(lapply(x, function(y) gsub("%", "", y)))
x1 x2 x3
1 10 60 1
2 20 50 2
3 30 40 3
To clean the % out you can do:
x[per_col] <- lapply(x[per_col], function(y) as.numeric(gsub("%", "", y)))
x
x1 x2 x3
1 10 60 1
2 20 50 2
3 30 40 3
The first answer works but be careful if you are using data.frame with string: the #docendo discimus's answer will return NAs.
If you want to keep the content of your column as string just remove the as.numeric and convert your table into a data frame after :
as.data.frame(apply(x, 2, function(y) as.numeric(gsub("%", "", y))))
x1 x2 x3
[1,] 10 60 1
[2,] 20 50 2
[3,] 30 40 3
We can unlist per_col columns, remove "%" symbol and convert it into numeric.
x[per_col] <- as.numeric(gsub("%","", unlist(x[per_col])))
#In this case using sub would be enough too as we have only 1 % symbol to replace
#x[per_col] <- as.numeric(sub("%","", unlist(x[per_col])))
x
# x1 x2 x3
#1 10 60 1
#2 20 50 2
#3 30 40 3
To add on docendo discimus' answer, an extension with non-adjacent columns and returning a data.frame:
x1 <- c("10%", "20%", "30%")
x2 <- c("60%", "50%", "40%")
x3 <- c(1, 2, 3)
x4 <- c("60%", "50%", "40%")
x <- data.frame(x1, x2, x3, x4)
x[, c(1:2, 4)] <- as.data.frame(apply(x[,c(1:2, 4)], 2,
function(x) {
as.numeric(gsub("%", "", x))}
))
> x
x1 x2 x3 x4
1 10 60 1 60
2 20 50 2 50
3 30 40 3 40
> class(x)
[1] "data.frame"

Resources