10 fold cross validation using logspline in R - r

I would like to do 10 fold cross validation and then using MSE for model selection in R . I can divide the data into 10 groups, but I got the following error, how can I fix it?
crossvalind <- function(N, kfold) {
len.seg <- ceiling(N/kfold)
incomplete <- kfold*len.seg - N
complete <- kfold - incomplete
ind <- matrix(c(sample(1:N), rep(NA, incomplete)), nrow = len.seg, byrow = TRUE)
cvi <- lapply(as.data.frame(ind), function(x) c(na.omit(x))) # a list
return(cvi)
}
I am using logspline package for estimation of a density function.
library(logspline)
x = rnorm(300, 0, 1)
kfold <- 10
cvi <- crossvalind(N = 300, kfold = 10)
for (i in 1:length(cvi)) {
xc <- x[cvi[-i]] # x in training set
xt <- x[cvi[i]] # x in test set
fit <- logspline(xc)
f.pred <- dlogspline(xt, fit)
f.true <- dnorm(xt, 0, 1)
mse[i] <- mean((f.true - f.pred)^2)
}
Error in x[cvi[-i]] : invalid subscript type 'list'

cvi is a list object, so cvi[-1] and cvi[1] are list objects, and then you try and get x[cvi[-1]] which is subscripting using a list object, which doesn't make sense because list objects can be complex objects containing numbers, characters, dates and other lists.
Subscripting a list with single square brackets always returns a list. Use double square brackets to get the constituents, which in this case are vectors.
> cvi[1] # this is a list with one element
$V1
[1] 101 78 231 82 211 239 20 201 294 276 181 168 207 240 61 72 267 75 218
[20] 177 127 228 29 159 185 118 296 67 41 187
> cvi[[1]] # a length 30 vector:
[1] 101 78 231 82 211 239 20 201 294 276 181 168 207 240 61 72 267 75 218
[20] 177 127 228 29 159 185 118 296 67 41 187
so you can then get those elements of x:
> x[cvi[[1]]]
[1] 0.32751014 -1.13362827 -0.13286966 0.47774044 -0.63942372 0.37453378
[7] -1.09954301 -0.52806368 -0.27923480 -0.43530831 1.09462984 0.38454106
[13] -0.68283862 -1.23407793 1.60511404 0.93178122 0.47314510 -0.68034783
[19] 2.13496564 1.20117869 -0.44558321 -0.94099782 -0.19366673 0.26640705
[25] -0.96841548 -1.03443796 1.24849113 0.09258465 -0.32922472 0.83169736
this doesn't work with negative indexes:
> cvi[[-1]]
Error in cvi[[-1]] : attempt to select more than one element
So instead of subscripting x with the list elements you don't want, subscript it with the negative of the indexes you do want (since you are partitioning here):
> x[-cvi[[1]]]
will return the other 270 elements. Note I've used 1 here for the first pass through your loop, replace with i and insert in your code.

Related

R function generating incorrect results

I am trying to get better with functions in R and I was working on a function to pull out every odd value from 100 to 500 that was divisible by 3. I got close with the function below. It keeps returning all of the values correctly but it also includes the first number in the sequence (101) when it should not. Any help would be greatly appreciated. The code I wrote is as follows:
Test=function(n){
if(n>100){
s=seq(from=101,to=n,by=2)
p=c()
for(i in seq(from=101,to=n,by=2)){
if(any(s==i)){
p=c(p,i)
s=c(s[(s%%3)==0],i)
}}
return (p)}else{
stop
}}
Test(500)
Here is a function that gets all non even multiples of 3. It's fully vectorized, no loops at all.
Check if n is within the range [100, 500].
Create an integer vector N from 100 to n.
Create a logical index of the elements of N that are divisible by 3 but not by 2.
Extract the elements of N that match the index i.
The main work is done in 3 code lines.
Test <- function(n){
stopifnot(n >= 100)
stopifnot(n <= 500)
N <- seq_len(n)[-(1:99)]
i <- ((N %% 3) == 0) & ((N %% 2) != 0)
N[i]
}
Test(500)
Here is a vectorised one-liner which optionally allows you to change the lower bound from a default of 100 to anything you like. If the bounds are wrong, it returns an empty vector rather than throwing an error.
It works by creating a vector of 1:500 (or more generally, 1:n), then testing whether each element is greater than 100 (or whichever lower bound m you set), AND whether each element is odd AND whether each element is divisible by 3. It uses the which function to return the indices of the elements that pass all the tests.
Test <- function(n, m = 100) which(1:n > m & 1:n %% 2 != 0 & 1:n %% 3 == 0)
So you can use it as specified in your question:
Test(500)
# [1] 105 111 117 123 129 135 141 147 153 159 165 171 177 183 189 195 201 207 213 219
# [21] 225 231 237 243 249 255 261 267 273 279 285 291 297 303 309 315 321 327 333 339
# [41] 345 351 357 363 369 375 381 387 393 399 405 411 417 423 429 435 441 447 453 459
# [61] 465 471 477 483 489 495
Or play around with upper and lower bounds:
Test(100, 50)
# [1] 51 57 63 69 75 81 87 93 99
Here is a function example for your objective
Test <- function(n) {
if(n<100 | n> 500) stop("out of range")
v <- seq(101,n,by = 2)
na.omit(ifelse(v%%2==1 & v%%3==0,v,NA))
}
stop() is called when your n is out of range [100,500]
ifelse() outputs desired odd values + NA
na.omit filters out NA and produce the final results

Finding nearest matching points

What I would like to do is for the red points find the nearest equivalent blue dot on the other side of the abline (i.e. 1,5 find 5,1).
Data:
https://1drv.ms/f/s!Asb7WztvacfOuesIq4evh0jjvejZ4Q
Edit: to open data do readRDS("path/to/data")
So what I have tried is to find the difference between the x and y coordinates, rank them and then find the min value going down the ranks for both x and y. The results and pretty bad. The thing I'm struggling with is finding a way to find nearest match of tuples.
My attempt:
find_nearest <- function(query, subject){
weight_df <- data.frame(ID=query$ID)
#find difference of first, then second, rank and find match in both going from top to bottom
tmp_df <- query
for(i in 1:nrow(subject)){
first_order <- order(abs(query$mean_score_n-subject$mean_score_n[i]))
second_order <- order(abs(query$mean_score_p-subject$mean_score_p[i]))
tmp_df$order_1[first_order] <- seq(1, nrow(tmp_df))
tmp_df$order_2[second_order] <- seq(1, nrow(tmp_df))
weight_df[,i+1] <- tmp_df$order_1 + tmp_df$order_2
}
rownames(weight_df) <- weight_df$ID
weight_df$ID <- NULL
print(dim(weight_df))
nearest_match <- list()
count <- 1
subject_ids <- NA
query_ids <- NA
while(ncol(weight_df) > 0 & count <= ncol(weight_df)){
pos <- which(weight_df == min(weight_df, na.rm = TRUE), arr.ind = TRUE)
if(length(unique(rownames(pos))) > 1){
for(i in nrow(pos)){
#if subject/query already used then mask and find another
if(subject$ID[pos[i,2]] %in% subject_ids){
weight_df[pos[i,1],pos[i,2]] <- NA
}else if(query$ID[pos[i,1]] %in% query_ids){
weight_df[pos[i,1],pos[i,2]] <- NA
}else{
subject_ids <- c(subject_ids, subject$ID[pos[i,2]])
query_ids <- c(query_ids, query$ID[pos[i,1]])
nearest_match[[count]] <- data.frame(query=query[pos[i,1],]$ID, subject=subject[pos[i,2],]$ID)
#mask
weight_df[pos[i,1],pos[i,2]] <- NA
count <- count + 1
}
}
}else if(nrow(pos) > 1){
#if subject/query already used then mask and find another
if(subject$ID[pos[1,2]] %in% subject_ids){
weight_df[pos[1,1],pos[1,2]] <- NA
}else if(query$ID[pos[1,1]] %in% query_ids){
weight_df[pos[1,1],pos[1,2]] <- NA
}else{
subject_ids <- c(subject_ids, subject$ID[pos[1,1]])
query_ids <- c(query_ids, query$ID[pos[1,1]])
nearest_match[[count]] <- data.frame(query=query[pos[1,1],]$ID, subject=subject[pos[1,2],]$ID)
#mask
weight_df[pos[1,1],pos[1,2]] <- NA
count <- count + 1
}
}else{
#if subject/query already used then mask and find another
if(subject$ID[pos[2]] %in% subject_ids){
weight_df[pos[1],pos[2]] <- NA
}else if(query$ID[pos[1]] %in% query_ids){
weight_df[pos[1],pos[2]] <- NA
}else{
subject_ids <- c(subject_ids, subject$ID[pos[2]])
query_ids <- c(query_ids, query$ID[pos[1]])
nearest_match[[count]] <- data.frame(query=query[pos[1],]$ID, subject=subject[pos[2],]$ID)
#mask
weight_df[pos[1],pos[2]] <- NA
count <- count + 1
}
}
}
out <- plyr::ldply(nearest_match, rbind)
out <- merge(out, data.frame(subject=subject$ID,
mean_score_p_n=subject$mean_score_p,
mean_score_n_n= subject$mean_score_n), by="subject", all.x=TRUE)
out <- merge(out, data.frame(query=query$ID,
mean_score_p_p=query$mean_score_p,
mean_score_n_p= query$mean_score_n), by="query", all.x=TRUE)
return(out)
}
Edit: is this what the solution looks like for you?
ggplot() +
geom_point(data=B[out,], aes(x=mean_score_p, y= mean_score_n, color="red")) +
geom_point(data=A, aes(x=mean_score_p, y=mean_score_n, color="blue")) +
geom_abline(intercept = 0, slope = 1)
Let
query <- readRDS("query.dms")
subject <- readRDS("subject.dms")
kA <- nrow(subject)
kB <- nrow(query)
A <- as.matrix(subject[, 2:3])
B <- as.matrix(query[, 2:3])
where we want to find the closest "reverse" point (row) in B to each point in A.
Solution permitting non-unique results
Then, assuming that you are using the Euclidean distance,
D <- as.matrix(dist(rbind(A, B[, 2:1])))[(1 + kA):(kA + kB), 1:kA]
unname(apply(D, 2, which.min))
# [1] 268 183 350 284 21 360 132 287 100 298 58 56 170 70 47 305 353
# [18] 43 266 198 58 215 198 389 412 321 255 181 79 340 292 268 198 54
# [35] 390 38 376 47 19 94 244 18 168 201 160 194 114 247 287 273 182
# [52] 87 94 87 192 63 160 244 101 298 62
are the corresponding row numbers in B. The trick was to switch the coordinates of the points in B by using B[, 2:1].
Solution with unique results
out <- vector("numeric", length = kA)
colnames(D) <- 1:ncol(D)
rownames(D) <- 1:nrow(D)
while(any(out == 0))
for(i in 1:nrow(D)) {
aux <- apply(D, 2, which.min)
if(i %in% aux) {
win <- which(aux == i)[which.min(D[i, aux == i])]
out[as.numeric(names(win))] <- as.numeric(rownames(D)[i])
D <- D[-i, -win, drop = FALSE]
}
}
out
# [1] 268 183 350 284 21 360 132 213 100 298 22 56 170 70 128 305 353
# [18] 43 266 198 58 215 294 389 412 321 255 181 79 340 292 20 347 54
# [35] 390 38 376 47 19 94 73 18 168 201 160 194 114 247 287 273 182
# [52] 87 365 158 192 63 211 244 101 68 62
whereas
all(table(res) == 1)
# [1] TRUE
confirms uniqueness. The solution is not the most efficient, but on your dataset it takes only a couple of seconds. It takes some time because it keeps going over all the available points in B checking if it is the closest one to any of the points in A. If so, the corresponding point in B is assigned to the closest one in A. Then both the point in A and the point in B are eliminated from the distance matrix. The loop goes until every point in A has some match in B.

Creating data continuously using rnorm until an outlier occurs in R

Sorry for the confusing title, but i wasn't sure how to title what i am trying to do. My objective is to create a dataset of 1000 obs each would be the length of the run. I have created a phase1 dataset, from which a set of control limits are produced. What i am trying to do now is create a phase2 dataset most likely using rnorm. what im trying to do is create a repeat loop that will continuously create values in the phase2 dataset until one of those values is outside of the control limits produced from the phase1 dataset. for example if i had 3.0 and -3.0 as control limits the phase2 dataset would create a bunch of observations until obs 398 when the value here happens to be 3.45, thus stopping the creation of data. my objective is then to record the number 398. Furthermore, I am then trying to loop the code back to the phase1 dataset/ control limits portion and create a new set of control limits and then run another phase2, until i have 1000 run lengths recorded. the code i have for the phase1/ control limits works fine and looks like this:
nphase1=50
nphase2=1000
varcount=1
meanshift= 0
sigmashift= 1
##### phase1 dataset/ control limits #####
phase1 <- matrix(rnorm(nphase1*varcount, 0, 1), nrow = nphase1, ncol=varcount)
mean_var <- apply(phase1, 2, mean)
std_var <- apply(phase1, 2, sd)
df_var <- data.frame(mean_var, std_var)
Upper_SPC_Limit_Method1 <- with(df_var, mean_var + 3 * std_var)
Lower_SPC_Limit_Method1 <- with(df_var, mean_var - 3 * std_var)
df_control_limits<- data.frame(Upper_SPC_Limit_Method1, Lower_SPC_Limit_Method1)
I have previously created this code in SAS and it looks like this. might be a better reference for what i am trying to achieve then me trying to explain it.
%macro phase2_dataset (n=,varcount=, meanshift=, sigmashift=, nphase1=,simID=,);
%do z=1 %to &n;
%phase1_dataset (n=&nphase1, varcount=&varcount);
data phase2; set control_limits n=lastobs;
call streaminit(0);
do until (phase2_var1<Lower_SPC_limit_method1_var1 or
phase2_var1>Upper_SPC_limit_method1_var1);
phase2_var1 = rand("normal", &meanshift, &sigmashift);
output;
end;
run;
ods exclude all;
proc means data=phase2;
var phase2_var1;
ods output summary=x;
run;
ods select all;
data run_length; set x;
keep Phase2_var1_n;
run;
proc append base= QA.Phase2_dataset&simID data=Run_length force; run;
%end;
%mend;
Also been doing research about using a while loop in replace of the repeat loop.
Im new to R so Any ideas you are able to throw my way are greatly appreciated. Thanks!
Using a while loop indeed seems to be the way to go. Here's what I think you're looking for:
set.seed(10) #Making results reproducible
replicate(100, { #100 is easier to display here
phase1 <- matrix(rnorm(nphase1*varcount, 0, 1), nrow = nphase1, ncol=varcount)
mean_var <- colMeans(phase1) #Slightly better than apply
std_var <- apply(phase1, 2, sd)
df_var <- data.frame(mean_var, std_var)
Upper_SPC_Limit_Method1 <- with(df_var, mean_var + 3 * std_var)
Lower_SPC_Limit_Method1 <- with(df_var, mean_var - 3 * std_var)
df_control_limits<- data.frame(Upper_SPC_Limit_Method1, Lower_SPC_Limit_Method1)
#Phase 2
x <- 0
count <- 0
while(x > Lower_SPC_Limit_Method1 && x < Upper_SPC_Limit_Method1) {
x <- rnorm(1)
count <- count + 1
}
count
})
The result is:
[1] 225 91 97 118 304 275 550 58 115 6 218 63 176 100 308 844 90 2758
[19] 161 311 1462 717 2446 74 175 91 331 210 118 1517 420 32 39 201 350 89
[37] 64 385 212 4 72 730 151 7 1159 65 36 333 97 306 531 1502 26 18
[55] 67 329 75 532 64 427 39 352 283 483 19 9 2 1018 137 160 223 98
[73] 15 182 98 41 25 1136 405 474 1025 1331 159 70 84 129 233 2 41 66
[91] 1 23 8 325 10 455 363 351 108 3
If performance becomes a problem, perhaps it would be interesting to explore some improvements, like creating more numbers with rnorm() at a time and then counting how many are necessary to exceed the limits and repeat if necessary.

Basic operations on Simple Triplet Matrix (Document Term Matrix)

I am struggling to understand how to do basic operations with Simple Triplet Matrix produced by TermDocumentMatrix() of the tm package.
It seems that the problem could be with the matrices not being recognized as numeric.
library(tm)
data("crude")
tdm <- TermDocumentMatrix(crude)
vector <- tdm[,1]
matrix <- tdm[,2:20]
multiplication <- t(vector) %*% matrix
# Error in t(vector) %*% matrix :
# requires numeric/complex matrix/vector arguments
But
multiplication <- t(as.matrix(vector)) %*% as.matrix(matrix)
multiplication
# Docs
# Docs 144 191 194 211 236 237 242 246 248 273 349 352 353 368 489 502 543 704 708
# 127 232 56 62 65 201 214 61 159 244 197 51 90 71 84 96 126 90 152 11
I have a very large Term Document Matrix which doesn't allow me to transform the sparse matrix into a dense matrix with as.matrix().
Is there any way to operate directly on the Simple Triplet Matrix without applying transformation into different classes (like sparseMatrix() of the Matrix package)?
The slam package has methods for simple triplet matrices:
library(slam)
matprod_simple_triplet_matrix(t(v), m)
Or equivalently:
crossprod_simple_triplet_matrix(v, m)

Clustering Large Data Matrix using R

I have a large data matrix (33183x1681), each row corresponding to one observation and each column corresponding to the variables.
I applied K-medoids clustering using PAM function in R, and I tried to visualize the clustering results using the built-in plots available with the PAM function. I got this error:
Error in princomp.default(x, scores = TRUE, cor = ncol(x) != 2) :
cannot use cor=TRUE with a constant variable
I think this problem is because of the high dimensionality of the data matrix I'm trying to cluster.
Any thoughts/ideas how to tackle this issue?
Check out the clara() function in package cluster which is shipped with all versions of R.
library("cluster")
## generate 500 objects, divided into 2 clusters.
x <- rbind(cbind(rnorm(200,0,8), rnorm(200,0,8)),
cbind(rnorm(300,50,8), rnorm(300,50,8)))
clarax <- clara(x, 2, samples=50)
clarax
> clarax
Call: clara(x = x, k = 2, samples = 50)
Medoids:
[,1] [,2]
[1,] -1.15913 0.5760027
[2,] 50.11584 50.3360426
Objective function: 10.23341
Clustering vector: int [1:500] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...
Cluster sizes: 200 300
Best sample:
[1] 10 17 45 46 68 90 99 150 151 160 184 192 232 238 243 250 266 275 277
[20] 298 303 304 313 316 327 333 339 353 358 398 405 410 411 421 426 429 444 447
[39] 456 477 481 494 499 500
Available components:
[1] "sample" "medoids" "i.med" "clustering" "objective"
[6] "clusinfo" "diss" "call" "silinfo" "data"
Note that you should study the help for clara() (?clara) in some detail as well as the references cited in order to make the clustering performed by clara() as close to or identical to pam().

Resources