I am looking for a more versatile way to get from a data.frame to a multidimensional array.
I would like to be able to create as many dimensions as needed from as many variables in the data frame as desired.
Currently, the method has to be tailored to each data.frame, requires subletting to form a vector.
I would love something along the melt/cast methods in plyr.
data<-data.frame(coord.name=rep(1:10, 2),
x=rnorm(20),
y=rnorm(20),
ID=rep(c("A","B"), each=10))
data.array<-array(dim=c(10, 2, length(unique(data$ID))))
for(i in 1:length(unique(data$ID))){
data.array[,1,i]<-data[data$ID==unique(data$ID)[i],"x"]
data.array[,2,i]<-data[data$ID==unique(data$ID)[i],"y"]
}
data.array
, , 1
[,1] [,2]
[1,] 1 1
[2,] 3 3
[3,] 5 5
[4,] 7 7
[5,] 9 9
[6,] 1 1
[7,] 3 3
[8,] 5 5
[9,] 7 7
[10,] 9 9
, , 2
[,1] [,2]
[1,] 2 2
[2,] 4 4
[3,] 6 6
[4,] 8 8
[5,] 10 10
[6,] 2 2
[7,] 4 4
[8,] 6 6
[9,] 8 8
[10,] 10 10
You may have had trouble applying the reshape2 functions for a somewhat subtle reason. The difficulty was that your data.frame has no column that can be used to direct how you want to arrange the elements along the first dimension of an output array.
Below, I explicitly add such a column, calling it "row". With it in place, you can use the expressive acast() or dcast() functions to reshape the data in any way you choose.
library(reshape2)
# Use this or some other method to add a column of row indices.
data$row <- with(data, ave(ID==ID, ID, FUN = cumsum))
m <- melt(data, id.vars = c("row", "ID"))
a <- acast(m, row ~ variable ~ ID)
a[1:3, , ]
# , , A
#
# x y
# 1 1 1
# 2 3 3
# 3 5 5
#
# , , B
#
# x y
# 1 2 2
# 2 4 4
# 3 6 6
I think this is right:
array(unlist(lapply(split(data, data$ID), function(x) as.matrix(x[ , c("x", "y")]))), c(10, 2, 2))
Related
tl;dr What is the idiomatic way to identify groups of identical rows in a matrix in R?
Given an n-by-2 matrix where some rows occur more than once,
> mat <- matrix(c(2,5,5,3,4,6,2,5,4,6,4,6), ncol=2, byrow=T)
> mat
[,1] [,2]
[1,] 2 5
[2,] 5 3
[3,] 4 6
[4,] 2 5
[5,] 4 6
[6,] 4 6
I am looking to get the groups of row indices of identical rows. In the example above, rows (1,4) are identical, and so are rows (3,5,6). Finally, there is row (2). I am looking to get these groups, represented in whatever way is idiomatic in R.
The output could be something like this,
> groups <- matrix(c(1,1, 2,2, 3,3, 4,1, 5,3, 6,3), ncol=2, byrow=T)
> groups
[,1] [,2]
[1,] 1 1
[2,] 2 2
[3,] 3 3
[4,] 4 1
[5,] 5 3
[6,] 6 3
where the first column contains the row indices of mat and the second the group index for each row index. Or it could be like this:
> split(groups[,1], groups[,2])
$`1`
[1] 1 4
$`2`
[1] 2
$`3`
[1] 3 5 6
Either will do. I am not sure what is the best way to represent groups in R, and advice on this is also welcome.
For benchmarking purposes, here's a larger dataset:
set.seed(123)
n <- 10000000
mat <- matrix(sample.int(10, 2*n, replace = T), ncol=2)
cbind with sequence of rows and the match between the rows and unique values of the row
v1 <- paste(mat[,1], mat[,2])
# or if there are more columns
#v1 <- do.call(paste, as.data.frame(mat))
out <- cbind(seq_len(nrow(mat)), match(v1, unique(v1)))
-output
> out
[,1] [,2]
[1,] 1 1
[2,] 2 2
[3,] 3 3
[4,] 4 1
[5,] 5 3
[6,] 6 3
If we want a list output
split(out[,1], out[,2])
-ouptut
$`1`
[1] 1 4
$`2`
[1] 2
$`3`
[1] 3 5 6
Benchmarks
With the OP's big data
> system.time({
+ v1 <- paste(mat[,1], mat[,2])
+
+ out <- cbind(seq_len(nrow(mat)), match(v1, unique(v1)))
+
+ })
user system elapsed
2.603 0.130 2.706
Does anyone know another method for filtering data when there is twice the same ID (Column X) in a data frame but with a different associate value (columns Y)?
Basically I wan to know which rows are in both data frame and after I want to know which row is not in both data frame (Actually I want the value of X and Y of this particular row)
Thank you in advance for your help!
> x <- seq(1:10)
> x[5] <- 4
> y <- (seq.int(1,19,2))
>
> x<- cbind(x,y)
> x
x y
[1,] 1 1
[2,] 2 3
[3,] 3 5
[4,] 4 7
[5,] 4 9
[6,] 6 11
[7,] 7 13
[8,] 8 15
[9,] 9 17
[10,] 10 19
>
> z <- x[1:4,]
> y <- x[6:10,]
>
> z <- rbind(z,y)
> z
x y
[1,] 1 1
[2,] 2 3
[3,] 3 5
[4,] 4 7
[5,] 6 11
[6,] 7 13
[7,] 8 15
[8,] 9 17
[9,] 10 19
>
> df1 <- z[z[,1] %in% x[,1]]
>
> matrix(df1,9,2) # As expected I'm getting 9 rows
[,1] [,2]
[1,] 1 1
[2,] 2 3
[3,] 3 5
[4,] 4 7
[5,] 6 11
[6,] 7 13
[7,] 8 15
[8,] 9 17
[9,] 10 19
>
> # Now I want to know what is the value inside the missing row
> df2 <- z[!z[,1] %in% x[,1]]
>
> matrix(df2,1,2) # I'm getting NA and NA, bu I was expecting the values 4 and 9
[,1] [,2]
[1,] NA NA
To use #hansjaneinvielleicht method:
xlist <- paste(x[,1], x[,2])
zlist <- paste(z[,1], z[,2])
setdiff(xlist, zlist)
# [1] "4 9"
What you're doing here is to filter for values that are not present in x[,1]. However, since 4 is in there, it's also filtered out.
Instead, I assume you'd probably want to work with setdiff method from dplyr (see the doc here)
Then use df2 <- setdiff(x, z)
I am using the cumcount here to adding another key for distinguish the duplicate value in x[,1]
v=ave(x[,1]==x[,1], x[,1], FUN=cumsum)
t=ave(z[,1]==z[,1], z[,1], FUN=cumsum)
df2 <- x[!paste(x[,1],v) %in% paste(z[,1],t)]
matrix(df2,1,2)
[,1] [,2]
[1,] 4 9
x <- data.frame(x)
z <- data.frame(z)
x$from <- "x"
z$from <- "z"
df2 <- merge(x, z, by = c("x", "y"), all.x = T)
df2
# x y from.x from.y
# 1 1 1 x z
# 2 2 3 x z
# 3 3 5 x z
# 4 4 7 x z
# 5 4 9 x <NA>
# 6 6 11 x z
# 7 7 13 x z
# 8 8 15 x z
# 9 9 17 x z
# 10 10 19 x z
df2 <- df2[is.na(df2$from.y),]
df2
# x y from.x from.y
# 5 4 9 x <NA>
Since my real problem was not the one posted since it was too complicated.
Basically, I was not able to apply any solution to my real problem since my real data frames were containing all data types and had a lot of columns.
But I was able to found a solution than work for my real problem but also for the problem posted in the question, so I post the answer than solved my real problem in case it can be useful for someone!
> dup <- which(duplicated(x[,1]) == TRUE)
> ans <- matrix(x[dup,],1,2)
> ans
[,1] [,2]
[1,] 4 9
> # I'm doing this in case the answer was not NA in df2 at the previous step, without
# providing the row "missing"
> df2 <- rbind(df2, ans)
> df2
[,1] [,2]
[1,] 4 9
How to extract every two elements in sequence in a matrix and return the result as a matrix so that I could feed the answer in a formula for calculation:
For example, I have a one row matrix with 6 columns:
[,1][,2][,3][,4][,5][,6]
[1,] 2 1 5 5 10 1
I want to extract column 1 and two in first iteration, 3 and 4 in second iteration and so on. The result has to be in the form of matrix.
[1,] 2 1
[2,] 5 5
[3,] 10 1
My original codes:
data <- matrix(c(1,1,1,2,2,1,2,2,5,5,5,6,10,1,10,2,11,1,11,2), ncol = 2)
Center Matrix:
[,1][,2][,3][,4][,5][,6]
[1,] 2 1 5 5 10 1
[2,] 1 1 2 1 10 1
[3,] 5 5 5 6 11 2
[4,] 2 2 5 5 10 1
[5,] 2 1 5 6 5 5
[6,] 2 2 5 5 11 1
[7,] 2 1 5 5 10 1
[8,] 1 1 5 6 11 1
[9,] 2 1 5 5 10 1
[10,] 5 6 11 1 10 2
objCentroidDist <- function(data, centers) {
resultMatrix <- matrix(NA, nrow=dim(data)[1], ncol=dim(centers)[1])
for(i in 1:nrow(centers)) {
resultMatrix [,i] <- sqrt(rowSums(t(t(data)-centers[i, ])^2))
}
resultMatrix
}
objCentroidDist(data,centers)
I want the Result matrix to be as per below:
[1,][,2][,3]
[1,]
[2,]
[3,]
[4,]
[5,]
[7,]
[8,]
[9,]
[10]
My concern is, how to calculate the data-centers distance if the dimensions of the data matrix are two, and centers matrix are six. (to calculate the distance from the data matrix and every two columns in centers matrix). Each row of the centers matrix has three centers.
Something like this maybe?
m <- matrix(c(2,1,5,5,10,1), ncol = 6)
list.seq.pairs <- lapply(seq(1, ncol(m), 2), function(x) {
m[,c(x, x+1)]
})
> list.seq.pairs
[[1]]
[1] 2 1
[[2]]
[1] 5 5
[[3]]
[1] 10 1
And, in case you're wanting to iterate over multiple rows in a matrix,
you can expand on the above like this:
mm <- matrix(1:18, ncol = 6, byrow = TRUE)
apply(mm, 1, function(x) {
lapply(seq(1, length(x), 2), function(y) {
x[c(y, y+1)]
})
})
EDIT:
I'm really not sure what you're after exactly. I think, if you want each row transformed into a 2 x 3 matrix:
mm <- matrix(1:18, ncol = 6, byrow = TRUE)
list.mats <- lapply(1:nrow(mm), function(x){
a = matrix(mm[x,], ncol = 2, byrow = TRUE)
})
> list.mats
[[1]]
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
[[2]]
[,1] [,2]
[1,] 7 8
[2,] 9 10
[3,] 11 12
[[3]]
[,1] [,2]
[1,] 13 14
[2,] 15 16
[3,] 17 18
If, however, you want to get to your results matrix- I think it's probably easiest to do whatever calculations you need to do while you're dealing with each row:
results <- t(apply(mm, 1, function(x) {
sapply(seq(1, length(x), 2), function(y) {
val1 = x[y] # Get item one
val2 = x[y+1] # Get item two
val1 / val2 # Do your calculation here
})
}))
> results
[,1] [,2] [,3]
[1,] 0.5000000 0.7500 0.8333333
[2,] 0.8750000 0.9000 0.9166667
[3,] 0.9285714 0.9375 0.9444444
That said, I don't understand what you're trying to do so this may miss the mark. You may have more luck if you ask a new question where you show example input and the actual expected output that you're after, with the actual values you expect.
I have a matrix
mat_a <- matrix(data = c( c(rep(1,3), rep(2,3), rep(3,3))
, rep(seq(1,300,100), 3)
, runif(15, 0, 1))
, ncol=3)
[,1] [,2] [,3]
[1,] 1 1 0.8393401
[2,] 1 101 0.5486805
[3,] 1 201 0.4449259
[4,] 2 1 0.3949137
[5,] 2 101 0.4002575
[6,] 2 201 0.3288861
[7,] 3 1 0.7865035
[8,] 3 101 0.2581155
[9,] 3 201 0.8987769
that I compare to another matrix with higher dimensions
mat_b <- matrix(data = c(
c(rep(1,3), rep(2,3), rep(3,3), rep(4,3))
, rep(seq(1,300,100), 4)
, rep(3:5, 4))
, ncol = 3)
[,1] [,2] [,3]
[1,] 1 1 3
[2,] 1 101 4
[3,] 1 201 5
[4,] 2 1 3
[5,] 2 101 4
[6,] 2 201 5
[7,] 3 1 3
[8,] 3 101 4
[9,] 3 201 5
[10,] 4 1 3
[11,] 4 101 4
[12,] 4 201 5
I need to extract the lines of mat_a where columns #2 of both matrices match. For those matches, both columns 1 also have to match. Also, column 3 of mat_b must be higher or equal to 4.
I cannot find any solution based on vectorization. I only came out with a loop-based solution.
output <- NULL
for (i in 1:nrow(mat_a)) {
if (mat_a[i,2] %in% mat_b[,2][mat_b[,3] >= 4]) {
rows <- which( mat_b[,2] %in% mat_a[i,2])
row <- which(mat_b[,1][rows] == mat_a[i,1])
if (mat_b[,3][rows[row]] >= 4) {
output <- rbind(output, mat_a[i,])
}
}
}
This works but is extremely slow. It took less than one hour to run. mat_a has 9 col with 40,000 rows (could go higher), mat_b has 5 col and around 1.2 millions rows.
Any idea?
It is better to work with data frames when comparing tables as you are. That will use R's structures to their strengths instead of working against them. We use a simple merge to match the correct values. Then subset b with the necessary condition, b$V3 >= 4. On the end, [-4] lets the output more closely match your desired output:
a <- as.data.frame(mat_a)
b <- as.data.frame(mat_b)
merge(a,b[b$V3 >= 4,], by=c("V1","V2"))[-4]
# V1 V2 V3.x
# 1 1 101 0.1118960
# 2 1 201 0.1543351
# 3 2 101 0.3950491
# 4 2 201 0.5688684
# 5 3 201 0.4749941
I am playing with RGooglemaps, and have been able to plot lines on a map. I loaded my lats and longs from a csv into a coords object.
I wanted to imply direction using: PlotArrowsOnStaticMap
defined as:
PlotArrowsOnStaticMap(MyMap, lat0, lon0, lat1 = lat0, lon1 = lon0, TrueProj = TRUE, FUN = arrows, add = FALSE, verbose = 1,...)
I define lat0 as something like coords[,'lat'].
How do I give the lat1?
The value is the next row in the file - but how do I describle that relatively? (coords[+1,'lat'] in pseudocode.
Is there some basic reading I should be doing?
An inelegant work around is to create new columns for your lat and long that are upshifted by one row compared to the initial rows. The value of the first row is wrapped to the bottom (or replaced by NA if this does not make sense).
coords$lat.1<-coords$lat[c(2:length(coords$lat), 1)]
coords$lon.1<-coords$lon[c(2:length(coords$lon), 1)]
You now have two columns for lat (lat and lat1) and two columns for long (lon, lon1).
with(coords, PlotArrowsOnStaticMap(lat0=lat, lon0=lon, lat1=lat1, lon1=lon1...)
Some of the functions commonly used to do this include head, tail, and embed:
> tmp <- 1:10
> cbind( head(tmp,-1), tail(tmp,-1) )
[,1] [,2]
[1,] 1 2
[2,] 2 3
[3,] 3 4
[4,] 4 5
[5,] 5 6
[6,] 6 7
[7,] 7 8
[8,] 8 9
[9,] 9 10
> embed(tmp, 2)
[,1] [,2]
[1,] 2 1
[2,] 3 2
[3,] 4 3
[4,] 5 4
[5,] 6 5
[6,] 7 6
[7,] 8 7
[8,] 9 8
[9,] 10 9