splitting integers and converting into matrix - r

I was wondering if is it possible to stringsplit each integer in a set of numbers and transform it into a transition matrix, e.g
data<-c(11,123,142,1423,1234,12)
What i would like to do is to split each integer in the data (considering only the first two elements in the dataset),first element will be 1,1 second element will be 1,2,3....and convert it into matrix e,g 1,1 will be 1 to 1, 1,2 will be 1 to 2 and 2,3 will be 2 to 3. generating the following matrix
1 2 3 4 5
1 1 1 0 0 0
2 0 0 1 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
My matrix will never go past 5x5. Below is what i have done which works but it's really really tedious.
data2<-as.matrix(as.character(data))
for(i in 1:nrow(data2)) {
values<-strsplit(data2,"")
}
values2<-t(sapply(values, '[', 1:max(sapply(values, length))))
values2[is.na(values2)]<-0
values3<-apply(values2,2,as.numeric)
from1to1<-0
from1to2<-0
from1to3<-0
from1to4<-0
from1to5<-0
from2to1<-0
from2to2<-0
from2to3<-0
from2to4<-0
...
from5to4<-0
from5to5<-0
for(i in 1:nrow(values3)){
for(j in 1:(ncol(values3)-1))
if (((values3[i,j]==1)&(values3[i,j+1]==1))){
from1to1<-from1to1 + 1
}else{
if (((values3[i,j]==1)&(values3[i,j+1]==2))){
from1to2<-from1to2 + 1
}else{
if (((values3[i,j]==1)&(values3[i,j+1]==3))){
from1to3<-from1to3 + 1
}else{
if (((values3[i,j]==1)&(values3[i,j+1]==4))){
from1to4<-from1to4 + 1
}else{
if (((values3[i,j]==1)&(values3[i,j+1]==5))){
from1to5<-from1to5 + 1
}else{
if (((values3[i,j]==1)&(values3[i,j+1]==1))){
from1to1<-from1to1 + 1
}else{.....continues through all other from2to1...from5to5``
I then place every single number into a 5x5 matrix.
This is obviously tedious and long and ridiculous. Is there anyway to shorten this? Any suggestions is appreciated.

Here's an option, presented here piped so as to be easy to follow:
library(magrittr) # for the pipe
# initialize a matrix of zeros
mat <- matrix(0, 5, 5)
# split each element into individual digits
strsplit(as.character(data), '') %>%
# turn list elements back to integers
lapply(as.integer) %>%
# make a 2 column matrix of each digit paired with the previous digit
lapply(function(x){matrix(c(x[-length(x)], x[-1]), ncol = 2)}) %>%
# reduce list to a single 2-column matrix
do.call(rbind, .) %>%
# for each row, add 1 to the element of mat they subset
apply(1, function(x){mat[x[1], x[2]] <<- mat[x[1], x[2]] + 1; x})
# output is the transpose of the matrix; the real results are stored in mat
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
## [1,] 1 1 2 1 4 1 4 2 1 2 3 1
## [2,] 1 2 3 4 2 4 2 3 2 3 4 2
mat
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 3 0 2 0
## [2,] 0 0 3 0 0
## [3,] 0 0 0 1 0
## [4,] 0 2 0 0 0
## [5,] 0 0 0 0 0
Alternately, if you'd like xtabs as suggested by alexis_laz, replace the last line with xtabs(formula = ~ .[,1] + .[,2]) instead of using mat.
You might also check out the permutations package, which from what I can tell seems to be for working with this kind of data, though it's somewhat high-level.

Related

optimize network for three connections each in r

I have a list of locations and their weights (calculated distances apart) in a matrix.
I would like the optimal solution for each location having 3 connections, minimizing total distance.
costs6 <- matrix(c(0,399671,1525211,990914,1689886,1536081,399671,0,1802419,1128519,1964930,1603803,1525211,1802419,0,814942,164677,943489,990914,1128519.4,814942.7,0,953202,565712,1689886,1964930,164677,953202,0, 1004916,1536081,1603803,943489,565712,1004916,0),ncol=6,byrow=TRUE)
plantcap <- rep(3,6)
citydemand <- rep(3,6)
plant.signs <- rep("=",6)
city.signs <- rep("=",6)
lptrans <- lp.transport(costs6,"min",plant.signs,plantcap,city.signs,citydemand)
lptrans$solution
lptrans
This LP solver returns
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 3 0 0 0 0 0
[2,] 0 3 0 0 0 0
[3,] 0 0 3 0 0 0
[4,] 0 0 0 3 0 0
[5,] 0 0 0 0 3 0
[6,] 0 0 0 0 0 3
I am wondering if there is a way to max out any Xij at 1, so that the solver will give me three ones in each column/row, rather than one 3 in each column/row? If not, is there another solver I can use to find the solution?
Something like this, setting it up as an LP problem (assuming a symmetric solution matrix)?
library(lpSolve)
costs6 <- matrix(c(0,399671,1525211,990914,1689886,1536081,
399671,0,1802419,1128519,1964930,1603803,
1525211,1802419,0,814942,164677,943489,
990914,1128519.4,814942.7,0,953202,565712,
1689886,1964930,164677,953202,0, 1004916,
1536081,1603803,943489,565712,1004916,0),ncol=6,byrow=TRUE)
nLoc <- nrow(costs6)
nParams <- sum(1:(nLoc - 1L))
# set up the constraint matrix
# columns are parameters corresponding to the lower triangular of costs6 (by column)
# the first six constraints are for the row/column sums
# the last 15 constraints are for the maximum number of times each path can be used (1)
nConst <- sum(1:nLoc)
mConst <- matrix(0L, nConst, nParams)
mConst[matrix(c(c(combn(1:nLoc, 2)), rep(1:nParams, each = 2)), ncol = 2)] <- 1L
mConst[(nLoc + 1L):nConst,] <- diag(nParams)
lpSol <- lp(
direction = "min",
objective.in = unlist(costs6[lower.tri(costs6)]),
const.mat = mConst,
const.dir = c(rep("=", nLoc), rep("<=", nParams)),
const.rhs = c(rep(3L, nLoc), rep(1L, nParams)),
all.int = TRUE
)
lpSol
#> Success: the objective function is 8688039
# convert the solution to a transport matrix
mSol <- matrix(0, nLoc, nLoc)
mSol[lower.tri(mSol)] <- lpSol$solution
mSol[upper.tri(mSol)] <- t(mSol)[upper.tri(mSol)]
mSol
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 0 1 1 1 0 0
#> [2,] 1 0 0 1 1 0
#> [3,] 1 0 0 0 1 1
#> [4,] 1 1 0 0 0 1
#> [5,] 0 1 1 0 0 1
#> [6,] 0 0 1 1 1 0

Generate all possible binary vectors of length n in R

I'm looking to generate all possible binary vectors of length n in R. What is the best way (preferably both computationally efficient and readable code) to do this?
n = 3
expand.grid(replicate(n, 0:1, simplify = FALSE))
# Var1 Var2 Var3
#1 0 0 0
#2 1 0 0
#3 0 1 0
#4 1 1 0
#5 0 0 1
#6 1 0 1
#7 0 1 1
#8 1 1 1
Inspired by this question generating all possible binary vectors of length n containing less than m 1s, I've extended this code to produce all possible combinations. It's not pretty, though.
> z <- 3
> z <- rep(0, n)
> do.call(rbind, lapply(0:n, function(i) t(apply(combn(1:n,i), 2, function(k) {z[k]=1;z}))))
[,1] [,2] [,3]
[1,] 0 0 0
[2,] 1 0 0
[3,] 0 1 0
[4,] 0 0 1
[5,] 1 1 0
[6,] 1 0 1
[7,] 0 1 1
[8,] 1 1 1
What is it doing? Once we strip it back, the core of this one-liner is the following:
apply(combn(1:n,i), 2, function(k) {z[k]=1;z})
To understand this, let's step back one level further. The function combn(x,m) generates all possible combinations of x taken m at a time.
> combn(1:n, 1)
[,1] [,2] [,3]
[1,] 1 2 3
> combn(1:n, 2)
[,1] [,2] [,3]
[1,] 1 1 2
[2,] 2 3 3
> combn(1:n, 3)
[,1]
[1,] 1
[2,] 2
[3,] 3
For using apply(MARGIN=2), we pass in a column of this function at a time to our inner function function(k) {z[k]=1;z} which simply replaces all of the values at the indices k with 1. Since they were originally all 0, this gives us each possible binary vector.
The rest is just window dressing. combn gives us a wide, short matrix; we transpose it with t. lapply returns a list; we bind the matrices in each element of the list together with do.call(rbind, .).
You should define what is "the best way" (fastest? shortest code?, etc.).
One way is to use the package R.utils and the function intToBin for converting decimal numbers to binary numbers. See the example.
require(R.utils)
n <- 5
strsplit(intToBin(0:(2 ^ n - 1)), split = "")

Element wise multiplication of first 4 columns by last 4 columns

I'm trying to do an element wise multiplication of the first 2 columns by the next 2 columns and add the result on the fifth column
for example :
> x = diag(4)
[,1] [,2] [,3] [,4] C1*C3 + C2*C4
[1,] 1 0 0 0 1*0 + 0*0
[2,] 0 1 0 0 0*0 + 1*0
[3,] 0 0 1 0 0*1 +0*0
[4,] 0 0 0 1 0*0 +0*1
Thank you!
You can do this pretty directly in base R.
cbind(x, rowSums(x[,1:2] * x[,3:4]))
Although going by what you show in your 'results' you are doing every other column which is a simple modification
cbind(x, rowSums(x[,c(1,3)] * x[,c(2,4)]))

adding successive four / n numbers in large matrix in R

I have very large dataset with dimension of 60K x 4 K. I am trying add every four values in succession in every row column wise. The following is smaller example dataset.
set.seed(123)
mat <- matrix (sample(0:1, 48, replace = TRUE), 4)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] 0 1 1 1 0 1 1 0 1 1 0 0
[2,] 1 0 0 1 0 1 1 0 1 0 0 0
[3,] 0 1 1 0 0 1 1 1 0 0 0 0
[4,] 1 1 0 1 1 1 1 1 0 0 0 0
Here is what I am trying to perform:
mat[1,1] + mat[1,2] + mat[1,3] + mat[1,4] = 0 + 1 + 1 + 1 = 3
i.e. add every four values and output.
mat[1,5] + mat[1,6] + mat[1,7] + mat[1,8] = 0 + 1 + 1 + 0 = 2
Keep going to end of matrix (here to 12).
mat[1,9] + mat[1,10] + mat[1,11] + mat[1,12]
Once first row is done apply the same to second row, like:
mat[2,1] + mat[2,2] + mat[2,3] + mat[2,4]
mat[2,5] + mat[2,6] + mat[2,7] + mat[2,8]
mat[2,9] + mat[2,10] + mat[2,11] + mat[2,12]
The result will be nrow x (ncol)/4 matrix.
The expected result will look like:
col1-col4 col5-8 col9-12
row1 3 2 2
row2 2 2 1
row3 2 3 0
row4 3 4 0
Similarly for row 3 to number of rows in the matrix. How can I efficiently loop this.
While Matthew's answer is really cool (+1, btw), you can get a much (~100x) faster solution if you avoid apply and use the *Sums functions (in this case colSums), and a bit of vector manipulation trickery:
funSums <- function(mat) {
t.mat <- t(mat) # rows become columns
dim(t.mat) <- c(4, length(t.mat) / 4) # wrap columns every four items (this is what we want to sum)
t(matrix(colSums(t.mat), nrow=ncol(mat) / 4)) # sum our new 4 element columns, and reconstruct desired output format
}
set.seed(123)
mat <- matrix(sample(0:1, 48, replace = TRUE), 4)
funSums(mat)
Produces desired output:
[,1] [,2] [,3]
[1,] 3 2 2
[2,] 2 2 1
[3,] 2 3 0
[4,] 3 4 0
Now, let's make something the real size and compare against the other options:
set.seed(123)
mat <- matrix(sample(0:1, 6e5, replace = TRUE), 4)
funApply <- function(mat) { # Matthew's Solution
apply(array(mat, dim=c(4, 4, ncol(mat) / 4)), MARGIN=c(1,3), FUN=sum)
}
funRcpp <- function(mat) { # David's Solution
roll_sum(mat, 4, by.column = F)[, seq_len(ncol(mat) - 4 + 1)%%4 == 1]
}
library(microbenchmark)
microbenchmark(times=10,
funSums(mat),
funApply(mat),
funRcpp(mat)
)
Produces:
Unit: milliseconds
expr min lq median uq max neval
funSums(mat) 4.035823 4.079707 5.256517 7.5359 42.06529 10
funApply(mat) 379.124825 399.060015 430.899162 455.7755 471.35960 10
funRcpp(mat) 18.481184 20.364885 38.595383 106.0277 132.93382 10
And to check:
all.equal(funSums(mat), funApply(mat))
# [1] TRUE
all.equal(funSums(mat), funRcpp(mat))
# [1] TRUE
The key point is that the *Sums functions are fully "vectorized", in as much as all the calculations happen in C. apply still needs to do a bunch of not strictly vectorized (in the primitive C function way) stuff in R, and is slower (but far more flexible).
Specific to this problem, it might be possible to make it 2-3x faster as about half the time is spent on the transpositions, which are only necessary so that the dim changes do what I need for colSums to work.
Dividing the matrix up into a 3D array is one way:
apply(array(mat, dim=c(4, 4, 3)), MARGIN=c(1,3), FUN=sum)
# [,1] [,2] [,3]
# [1,] 3 2 2
# [2,] 2 2 1
# [3,] 2 3 0
# [4,] 3 4 0
Here's another approach using the RcppRoll package
library(RcppRoll) # Uses C++/Rcpp
n <- 4 # The summing range
roll_sum(mat, n, by.column = F)[, seq_len(ncol(mat) - n + 1) %% n == 1]
## [,1] [,2] [,3]
## [1,] 3 2 2
## [2,] 2 2 1
## [3,] 2 3 0
#3 [4,] 3 4 0
This might be the slowest of all:
set.seed(123)
mat <- matrix (sample(0:1, 48, replace = TRUE), 4)
mat
output <- sapply(seq(4,ncol(mat),4), function(i) { apply(mat,1,function(j){
sum(j[c(i-3, i-2, i-1, i)], na.rm=TRUE)
})})
output
[,1] [,2] [,3]
[1,] 3 2 2
[2,] 2 2 1
[3,] 2 3 0
[4,] 3 4 0
Maybe nested for-loops would be slower, but this answer is pretty close to being nested for-loops.

Finding intersection entries in a data frame

I've run into an R-programming problem that I can't seem to wrap my head around. I have data like the following:
data = data.frame("start"=c(1,2,4,5),
"length"=c(2,2,2,3),
"decision"=c("yes","no","yes","yes"))
Which looks like:
start length decision
1 1 2 yes
2 2 2 no
3 4 2 yes
4 5 3 yes
Row one stands for a sequence of integers that start at 1 for length 2 (1,2). Row 3 is 2 integers starting at 4 (4,5). I'm looking for intersections between entries that have a 'yes' decision variable. When the decision variable is 'no', then the sequence is thrown out. Here's what I've attempted so far.
I think I need to create a sequence list first.
sequence.list = lapply(seq(dim(data)[1]),
function(d){
seq(data$start[d],(data$start[d]+data$length[d]-1),by=1)
})
This outputs:
sequence.list
[[1]]
[1] 1 2
[[2]]
[1] 2 3
[[3]]
[1] 4 5
[[4]]
[1] 5 6 7
Which is a start. Then I create a list that counts intersections between items on my list (I stole this idea from another post on here).
count.intersect = lapply(sequence.list,function(a) {
sapply(seq(length(sequence.list)),
function(b) length(intersect(sequence.list[[b]], a)))
})
This creates the list:
count.intersect
[[1]]
[1] 2 1 0 0
[[2]]
[1] 1 2 0 0
[[3]]
[1] 0 0 2 1
[[4]]
[1] 0 0 1 3
The way to read this is that entry 1 in the data frame has 2 trivial intersections with itself and 1 intersection with entry 2.
Here's where I get fuzzy on what to do. Make it a matrix?
intersect.matrix = do.call(rbind,count.intersect)
Then set the rows and columns of non-used entries to zero?
intersect.matrix[,data$decision=="no"]=0
intersect.matrix[data$decision=="no",]=0
intersect.matrix
[,1] [,2] [,3] [,4]
[1,] 2 0 0 0
[2,] 0 0 0 0
[3,] 0 0 2 1
[4,] 0 0 1 3
Now, I would like to return indices 3 and 4 somehow. I want to find the rows (or columns) containing non zeros that are also not on the diagonal.
Sorry for posting the whole procedure, I also want to know if there is a shorter way to go from the starting dataframe to finding intersections in used entries.
Since you are not interested in non zero values on the diagonal, first I would subtract them away:
diag.mat <- diag(intersect.matrix) * diag(ncol(intersect.matrix)
which gives:
intersect.matrix - diag.mat
[,1] [,2] [,3] [,4]
[1,] 0 0 0 0
[2,] 0 0 0 0
[3,] 0 0 0 1
[4,] 0 0 1 0
Then identify which of the columns still hold non zero entries using which:
which(colSums(intersect.matrix - diag.mat) != 0)
[1] 3 4
You asked whether there is a short way to go from your data frame data to the indices. Here it is.
(Note: This may be hard to understand if you're new to R.)
1) Create the sequence list:
sequence.list <- apply(data[1:2], 1, function(x) seq_len(x[2]) + x[1] - 1)
# [[1]]
# [1] 1 2
#
# [[2]]
# [1] 2 3
#
# [[3]]
# [1] 4 5
#
# [[4]]
# [1] 5 6 7
2) Count intersects and create the intersect matrix
intersect.matrix <- outer(s <- seq_along(sequence.list), s,
Vectorize(function(a, b)
length(Reduce(intersect, sequence.list[seq(a, b)]))))
# [,1] [,2] [,3] [,4]
# [1,] 2 1 0 0
# [2,] 1 2 0 0
# [3,] 0 0 2 1
# [4,] 0 0 1 3
3) Set cells corresponding to "no" to zero
idx <- data$decision == "no"
intersect.matrix[idx, ] <- intersect.matrix[ , idx] <- 0
# [,1] [,2] [,3] [,4]
# [1,] 2 0 0 0
# [2,] 0 0 0 0
# [3,] 0 0 2 1
# [4,] 0 0 1 3
4) Find indices of non-zero rows/columns (except diagonal)
result <- which(as.logical(colSums("diag<-"(intersect.matrix, 0))))
# [1] 3 4

Resources