Basic operations on Simple Triplet Matrix (Document Term Matrix) - r

I am struggling to understand how to do basic operations with Simple Triplet Matrix produced by TermDocumentMatrix() of the tm package.
It seems that the problem could be with the matrices not being recognized as numeric.
library(tm)
data("crude")
tdm <- TermDocumentMatrix(crude)
vector <- tdm[,1]
matrix <- tdm[,2:20]
multiplication <- t(vector) %*% matrix
# Error in t(vector) %*% matrix :
# requires numeric/complex matrix/vector arguments
But
multiplication <- t(as.matrix(vector)) %*% as.matrix(matrix)
multiplication
# Docs
# Docs 144 191 194 211 236 237 242 246 248 273 349 352 353 368 489 502 543 704 708
# 127 232 56 62 65 201 214 61 159 244 197 51 90 71 84 96 126 90 152 11
I have a very large Term Document Matrix which doesn't allow me to transform the sparse matrix into a dense matrix with as.matrix().
Is there any way to operate directly on the Simple Triplet Matrix without applying transformation into different classes (like sparseMatrix() of the Matrix package)?

The slam package has methods for simple triplet matrices:
library(slam)
matprod_simple_triplet_matrix(t(v), m)
Or equivalently:
crossprod_simple_triplet_matrix(v, m)

Related

R function generating incorrect results

I am trying to get better with functions in R and I was working on a function to pull out every odd value from 100 to 500 that was divisible by 3. I got close with the function below. It keeps returning all of the values correctly but it also includes the first number in the sequence (101) when it should not. Any help would be greatly appreciated. The code I wrote is as follows:
Test=function(n){
if(n>100){
s=seq(from=101,to=n,by=2)
p=c()
for(i in seq(from=101,to=n,by=2)){
if(any(s==i)){
p=c(p,i)
s=c(s[(s%%3)==0],i)
}}
return (p)}else{
stop
}}
Test(500)
Here is a function that gets all non even multiples of 3. It's fully vectorized, no loops at all.
Check if n is within the range [100, 500].
Create an integer vector N from 100 to n.
Create a logical index of the elements of N that are divisible by 3 but not by 2.
Extract the elements of N that match the index i.
The main work is done in 3 code lines.
Test <- function(n){
stopifnot(n >= 100)
stopifnot(n <= 500)
N <- seq_len(n)[-(1:99)]
i <- ((N %% 3) == 0) & ((N %% 2) != 0)
N[i]
}
Test(500)
Here is a vectorised one-liner which optionally allows you to change the lower bound from a default of 100 to anything you like. If the bounds are wrong, it returns an empty vector rather than throwing an error.
It works by creating a vector of 1:500 (or more generally, 1:n), then testing whether each element is greater than 100 (or whichever lower bound m you set), AND whether each element is odd AND whether each element is divisible by 3. It uses the which function to return the indices of the elements that pass all the tests.
Test <- function(n, m = 100) which(1:n > m & 1:n %% 2 != 0 & 1:n %% 3 == 0)
So you can use it as specified in your question:
Test(500)
# [1] 105 111 117 123 129 135 141 147 153 159 165 171 177 183 189 195 201 207 213 219
# [21] 225 231 237 243 249 255 261 267 273 279 285 291 297 303 309 315 321 327 333 339
# [41] 345 351 357 363 369 375 381 387 393 399 405 411 417 423 429 435 441 447 453 459
# [61] 465 471 477 483 489 495
Or play around with upper and lower bounds:
Test(100, 50)
# [1] 51 57 63 69 75 81 87 93 99
Here is a function example for your objective
Test <- function(n) {
if(n<100 | n> 500) stop("out of range")
v <- seq(101,n,by = 2)
na.omit(ifelse(v%%2==1 & v%%3==0,v,NA))
}
stop() is called when your n is out of range [100,500]
ifelse() outputs desired odd values + NA
na.omit filters out NA and produce the final results

In R, how can I compute the summary function in parallel?

I have a huge dataset. I computed the multinomial regression by multinom in nnet package.
mylogit<- multinom(to ~ RealAge, mydata)
It takes 10 minutes. But when I use summary function to compute the coefficient
it takes more than 1 day!!!
This is the code I used:
output <- summary(mylogit)
Coef<-t(as.matrix(output$coefficients))
I was wondering if anybody know how can I compute this part of the code by parallel processing in R?
this is a small sample of data:
mydata:
to RealAge
513 59.608
513 84.18
0 85.23
119 74.764
116 65.356
0 89.03
513 92.117
69 70.243
253 88.482
88 64.23
513 64
4 84.03
65 65.246
69 81.235
513 87.663
513 81.21
17 75.235
117 49.112
69 59.019
20 90.03
If you just want the coefficients, use only the coef() method which do less computations.
Example:
mydata <- readr::read_table("to RealAge
513 59.608
513 84.18
0 85.23
119 74.764
116 65.356
0 89.03
513 92.117
69 70.243
253 88.482
88 64.23
513 64
4 84.03
65 65.246
69 81.235
513 87.663
513 81.21
17 75.235
117 49.112
69 59.019
20 90.03")[rep(1:20, 3000), ]
mylogit <- nnet::multinom(to ~ RealAge, mydata)
system.time(output <- summary(mylogit)) # 6 sec
all.equal(output$coefficients, coef(mylogit)) # TRUE & super fast
If you profile the summary() function, you'll see that the most of the time is taken by the crossprod() function.
So, if you really want the output of the summary() function, you could use an optimized math library, such as the MKL provided by Microsoft R Open.

Converting different maximum scores to percentage out of 100

I have three different datasets with 3 students and 3 subjects each, with different maximum scores(125,150,200). How to calculate the mean percentage(out of 100) per subject of a standard(not section), when all the three maximum scores are different. which are not comparable at this point.
Class2:
section1.csv
english maths science
name score(125) score(125) score(125)
sam 114 112 111
erm 89 91 97
asd 101 107 118
section2.csv
english maths science
name score(150) score(150) score(150)
wer 141 127 143
rahul 134 119 145
rohit 149 135 139
section3.csv
english maths science
name score(200) score(200) score(200)
vinod 178 186 176
manoj 189 191 185
deepak 191 178 187
P.s: Expected columns in the output:
class1 englishavg mathsavg scienceavg( the values are the summation of mean percentage of all the three section)
Here is the piece of the code. I tried.
files <- list.files(pattern = ".csv") ## creates a vector with all file names in your folder
list_files <- lapply(files,read.csv,header=F,stringsAsFactors=F)
list_files <- lapply(list_files, function(x) x)
engav <- sapply(list_files,function(x) mean(as.numeric(x[,2]),na.rm=T)/2)
mathav <- sapply(list_files,function(x) mean(as.numeric(x[,3]),na.rm=T)/2)
scienceav <- sapply(list_files,function(x) mean(as.numeric(x[,4]),na.rm=T)/2)
result <- cbind(files,engav,mathav,scienceav)
Looking forward for an assistance.

10 fold cross validation using logspline in R

I would like to do 10 fold cross validation and then using MSE for model selection in R . I can divide the data into 10 groups, but I got the following error, how can I fix it?
crossvalind <- function(N, kfold) {
len.seg <- ceiling(N/kfold)
incomplete <- kfold*len.seg - N
complete <- kfold - incomplete
ind <- matrix(c(sample(1:N), rep(NA, incomplete)), nrow = len.seg, byrow = TRUE)
cvi <- lapply(as.data.frame(ind), function(x) c(na.omit(x))) # a list
return(cvi)
}
I am using logspline package for estimation of a density function.
library(logspline)
x = rnorm(300, 0, 1)
kfold <- 10
cvi <- crossvalind(N = 300, kfold = 10)
for (i in 1:length(cvi)) {
xc <- x[cvi[-i]] # x in training set
xt <- x[cvi[i]] # x in test set
fit <- logspline(xc)
f.pred <- dlogspline(xt, fit)
f.true <- dnorm(xt, 0, 1)
mse[i] <- mean((f.true - f.pred)^2)
}
Error in x[cvi[-i]] : invalid subscript type 'list'
cvi is a list object, so cvi[-1] and cvi[1] are list objects, and then you try and get x[cvi[-1]] which is subscripting using a list object, which doesn't make sense because list objects can be complex objects containing numbers, characters, dates and other lists.
Subscripting a list with single square brackets always returns a list. Use double square brackets to get the constituents, which in this case are vectors.
> cvi[1] # this is a list with one element
$V1
[1] 101 78 231 82 211 239 20 201 294 276 181 168 207 240 61 72 267 75 218
[20] 177 127 228 29 159 185 118 296 67 41 187
> cvi[[1]] # a length 30 vector:
[1] 101 78 231 82 211 239 20 201 294 276 181 168 207 240 61 72 267 75 218
[20] 177 127 228 29 159 185 118 296 67 41 187
so you can then get those elements of x:
> x[cvi[[1]]]
[1] 0.32751014 -1.13362827 -0.13286966 0.47774044 -0.63942372 0.37453378
[7] -1.09954301 -0.52806368 -0.27923480 -0.43530831 1.09462984 0.38454106
[13] -0.68283862 -1.23407793 1.60511404 0.93178122 0.47314510 -0.68034783
[19] 2.13496564 1.20117869 -0.44558321 -0.94099782 -0.19366673 0.26640705
[25] -0.96841548 -1.03443796 1.24849113 0.09258465 -0.32922472 0.83169736
this doesn't work with negative indexes:
> cvi[[-1]]
Error in cvi[[-1]] : attempt to select more than one element
So instead of subscripting x with the list elements you don't want, subscript it with the negative of the indexes you do want (since you are partitioning here):
> x[-cvi[[1]]]
will return the other 270 elements. Note I've used 1 here for the first pass through your loop, replace with i and insert in your code.

Clustering Large Data Matrix using R

I have a large data matrix (33183x1681), each row corresponding to one observation and each column corresponding to the variables.
I applied K-medoids clustering using PAM function in R, and I tried to visualize the clustering results using the built-in plots available with the PAM function. I got this error:
Error in princomp.default(x, scores = TRUE, cor = ncol(x) != 2) :
cannot use cor=TRUE with a constant variable
I think this problem is because of the high dimensionality of the data matrix I'm trying to cluster.
Any thoughts/ideas how to tackle this issue?
Check out the clara() function in package cluster which is shipped with all versions of R.
library("cluster")
## generate 500 objects, divided into 2 clusters.
x <- rbind(cbind(rnorm(200,0,8), rnorm(200,0,8)),
cbind(rnorm(300,50,8), rnorm(300,50,8)))
clarax <- clara(x, 2, samples=50)
clarax
> clarax
Call: clara(x = x, k = 2, samples = 50)
Medoids:
[,1] [,2]
[1,] -1.15913 0.5760027
[2,] 50.11584 50.3360426
Objective function: 10.23341
Clustering vector: int [1:500] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...
Cluster sizes: 200 300
Best sample:
[1] 10 17 45 46 68 90 99 150 151 160 184 192 232 238 243 250 266 275 277
[20] 298 303 304 313 316 327 333 339 353 358 398 405 410 411 421 426 429 444 447
[39] 456 477 481 494 499 500
Available components:
[1] "sample" "medoids" "i.med" "clustering" "objective"
[6] "clusinfo" "diss" "call" "silinfo" "data"
Note that you should study the help for clara() (?clara) in some detail as well as the references cited in order to make the clustering performed by clara() as close to or identical to pam().

Resources