Connect dots function - r

I am learning cross validation method.
In the lines below, the input and query are both a data frame.
my.knn <- get.knnx(input,query,k=2)
nn.index <- my.knn$nn.index
What does the second line mean? What will nn.index be?

my.knn is a list of variables. So nn.index is taking that value out of the list so you can work on it as a single variable.
EXAMPLE OF GETTING ELEMENTS OUT OF A LIST
stats <- list("mean" = 10, "data" = c(0, 10 ,20))
#just get the average out
my.average <- stats$mean
So a list can have different kind of results from your testing, and can have a mix of variable types (integers, strings, vectors). The $ syntax is taking one of the variables out of the list into a single variable.
If you type my.knn at the prompt you will see its contents with sections marked with $. This will help see what is in your list.
In the example:
> stats
$mean
[1] 10
$data
[1] 0 10 20
SPECIFICS ON FUNCTION
I looked at get.knnx function notes, assuming you are using FNN package, here http://www.inside-r.org/packages/cran/fnn/docs/get.knn:
Output a list contains:
nn.index
an n x k matrix for the nearest neighbor indice(s).
nn.dist
an n x k matrix for the nearest neighbor Euclidean distances.
So you can see your function output list has these two variables - an index of the nearest neighbour, and the second is the distances.
Trust this helps.

Related

How to use dim. argument in rowVars on an array in R

This question is quite basic: I'm very confused by the documentation for rowVars in the package MatrixStats in R.
I have an array of dimensions (12, 12, 10000), ie 10000 12x12 matrices. rowMeans very easily gives me the mean of each row of each matrix in the form of a list with 10000 items. I want to do the same with rowVars to get variances.
This works fine for a single matrix, but for anything with more dimensions it gives an error message saying to use the dim argument, and I don't understand how it works. The package documentation says that dim is "An integer vector of length two specifying the dimension of x, also when not a matrix." (where x is the object the function is to be used on). However, I don't understand what this means, and haven't been able to find any helpful examples of it in use. What does 'specifying the dimension' mean- specifying how many dimensions? or specifying the size of each eg (12, 12, 10000)? If so, how can it be of length 2?
Thank you!
The documentation for rowVars states that the input x should be a "numeric N x K matrix," so it sounds like this function only supports two-dimensional matrices. If you want the variances of each row for your 10000 matrices, you could instead do something like this:
mat <- array(rnorm(12*12*10000,0,1),dim=c(12,12,10000))
rvs <- sapply(1:10000,function(x) rowVars(mat[,,x]))
The resulting object rvs will be a matrix where the nth column is the row variances for the nth 12 x 12 matrix.

How can I avoid replacement has length zero error

I am trying to generate the term frequency matrix of a document and subsequently look up the frequency of a certain word in a given query in that matrix. In the end I want to sum the frequencies found of the words in the query.
However, I am coping with the error message: Error in feature[i] <- x : replacement has length zero
I do not have a lot of coding experience in general, and this is my first time working with R, thus I am having difficulties solving this error. I presume it has something to do with a null-value. I already tried to avoid the nested for-loop with an apply function because I thought that might help (not sure though), but I could not quite get the hang of how to convert the for-loop into an apply function.
termfreqname <- function(queries,docs){
n <- length(queries)
feature <- vector(length=n)
for(i in 1:n){
query <- queries[i]
documentcorpus <- c(docs[i])
tdm <- TermDocumentMatrix(tm_corpus) #creates the term frequency matrix per document
m <- sapply(strsplit(query, " "), length) #length of the query in words
totalfreq <- list(0) #initialize list
freq_counter <- rowSums(as.matrix(tdm)) #counts the occurrence of a given word in the tdm matrix
for(j in 1:m){
freq <- freq_counter[word(query,j)] #finds frequency of each word in the given query, in the term frequency matrix
totalfreq[[j]] <- freq #adds this frequency to position j in the list
}
x <- reduce(totalfreq,'+') #sums all the numbers in the list
feature[i] <- x #adds this number to feature list
feature
}
}
It depends on your needs, but bottom line you need to add some if statement. How you use it depends on whether you want the default value of the vector to persist. In your code, while feature starts as a logical vector, it is likely coerced to integer or numeric once you overwrite its first value with a number. In that case, the default value in all positions of the vector will be 0 (or 0L, if integer). That's going to influence your decision on how to use the if statement.
if (length(x)) feature[i] <- x
This will only attempt to overwrite the ith value of feature if the x objects has length (that's equivalent to if (length(x) > 0)). In this case, since the default value in the vector will be zero, this means when you are done that you will not be able to distinguish between an element known to be 0 and an element that failed to find anything.
The alternative (and my preference/recommendation):
feature[i] <- if (length(x)) x else NA
In this case, when you are done, you can clearly distinguish between known-zero (0) and uncertain/unknown values (NA). When doing math operations on that vector, you might want/need na.rm=TRUE ... but it all depends on your use.
BTW, as MartinGal noted, your use of reduce(totalfreq, '+') is a little flawed: 'x' may not be (is not?) recognized as a known function. The first fix to this is to use backticks around the function, so
totalfreq <- 5:7
reduce(totalfreq, '+')
# NULL
reduce(totalfreq, `+`)
# [1] 18
sum(totalfreq)
# [1] 18
There the last is the much-more-preferred method. Why? With a vector of length 4, for instance, it takes the first two and adds them, then takes that result and adds it to the third, then takes that result and adds to the fourth. Three operations. When you have 100 elements, it will make 99 individual additions. sum does it once, and this does have an effect on performance (asymptotically).
However, if totalfreq is instead a list, then this changes slightly:
totalfreq <- as.list(5:7)
reduce(totalfreq, `+`)
# [1] 18
sum(totalfreq)
# Error in sum(totalfreq) : invalid 'type' (list) of argument
# x
sum(unlist(totalfreq))
# [1] 18
The reduce code still works, and the sum by itself fails, but we can unlist the list first, effectively creating a vector, and then call sum on that. Much much faster asymptotically. And perhaps clearer, more declarative.
(I'm assuming purrr::reduce, btw ...)

Indexing variables in R

I am normally a maple user currently working with R, and I have a problem with correctly indexing variables.
Say I want to define 2 vectors, v1 and v2, and I want to call the nth element in v1. In maple this is easily done:
v[1]:=some vector,
and the nth element is then called by the command
v[1][n].
How can this be done in R? The actual problem is as follows:
I have a sequence M (say of length 10, indexed by k) of simulated negbin variables. For each of these simulated variables I want to construct a vector X of length M[k] with entries given by some formula. So I should end up with 10 different vectors, each of different length. My incorrect code looks like this
sims<-10
M<-rnegbin(sims, eks_2016_kasko*exp(-2.17173), 840.1746)
for(k in 1:sims){
x[k]<-rep(NA,M[k])
X[k]<-rep(NA,M[k])
for(i in 1:M[k]){x[k][i]<-runif(1,min=0,max=1)
if(x[k][i]>=0 & x[i]<=0.1056379){
X[k][i]<-rlnorm(1, 6.228244, 0.3565041)}
else{
X[k][i]<-rlnorm(1, 8.910837, 1.1890874)
}
}
}
The error appears to be that x[k] is not a valid name for a variable. Any way to make this work?
Thanks a lot :)
I've edited your R script slightly to get it working and make it reproducible. To do this I had to assume that eks_2016_kasko was an integer value of 10.
require(MASS)
sims<-10
# Because you R is not zero indexed add one
M<-rnegbin(sims, 10*exp(-2.17173), 840.1746) + 1
# Create a list
x <- list()
X <- list()
for(k in 1:sims){
x[[k]]<-rep(NA,M[k])
X[[k]]<-rep(NA,M[k])
for(i in 1:M[k]){
x[[k]][i]<-runif(1,min=0,max=1)
if(x[[k]][i]>=0 & x[[k]][i]<=0.1056379){
X[[k]][i]<-rlnorm(1, 6.228244, 0.3565041)}
else{
X[[k]][i]<-rlnorm(1, 8.910837, 1.1890874)
}
}
This will work and I think is what you were trying to do, BUT is not great R code. I strongly recommend using the lapply family instead of for loops, learning to use data.table and parallelisation if you need to get things to scale. Additionally if you want to read more about indexing in R and subsetting Hadley Wickham has a comprehensive break down here.
Hope this helps!
Let me start with a few remarks and then show you, how your problem can be solved using R.
In R, there is most of the time no need to use a for loop in order to assign several values to a vector. So, for example, to fill a vector of length 100 with uniformly distributed random variables, you do something like:
set.seed(1234)
x1 <- rep(NA, 100)
for (i in 1:100) {
x1[i] <- runif(1, 0, 1)
}
(set.seed() is used to set the random seed, such that you get the same result each time.) It is much simpler (and also much faster) to do this instead:
x2 <- runif(100, 0, 1)
identical(x1, x2)
## [1] TRUE
As you see, results are identical.
The reason that x[k]<-rep(NA,M[k]) does not work is that indeed x[k] is not a valid variable name in R. [ is used for indexing, so x[k] extracts the element k from a vector x. Since you try to assign a vector of length larger than 1 to a single element, you get an error. What you probably want to use is a list, as you will see in the example below.
So here comes the code that I would use instead of what you proposed in your post. Note that I am not sure that I correctly understood what you intend to do, so I will also describe below what the code does. Let me know if this fits your intentions.
# define M
library(MASS)
eks_2016_kasko <- 486689.1
sims<-10
M<-rnegbin(sims, eks_2016_kasko*exp(-2.17173), 840.1746)
# define the function that calculates X for a single value from M
calculate_X <- function(m) {
x <- runif(m, min=0,max=1)
X <- ifelse(x > 0.1056379, rlnorm(m, 6.228244, 0.3565041),
rlnorm(m, 8.910837, 1.1890874))
}
# apply that function to each element of M
X <- lapply(M, calculate_X)
As you can see, there are no loops in that solution. I'll start to explain at the end:
lapply is used to apply a function (calculate_X) to each element of a list or vector (here it is the vector M). It returns a list. So, you can get, e.g. the third of the vectors with X[[3]] (note that [[ is used to extract elements from a list). And the contents of X[[3]] will be the result of calculate_X(M[3]).
The function calculate_X() does the following: It creates a vector of m uniformly distributed random values (remember that m runs over the elements of M) and stores that in x. Then it creates a vector X that contains log normally distributed random variables. The parameters of the distribution depend on the value x.

Creating multiple matrices with a "for" loop

I am currently in a statistics class working on multivariate clustering and classification. For our homework we are trying to use a 10 fold cross validation to test how accurate different classification methods are on a 6 variable data set with three classifications. I was hoping I could get some help on creating a for loop (or something else which would be better that I don't know about) to create and run 10 classifications and validations so I don't have to repeat myself 10 times on everything.... Here is what I have. It will run but the first two matrices only show the first variable. Because of this, I have not been able to troubleshoot the other parts.
index<-sample(1:10,90,rep=TRUE)
table(index)
training=NULL
leave=NULL
Trfootball=NULL
football.pred=NULL
for(i in 1:10){
training[i]<-football[index!=i,]
leave[i]<-football[index==i,]
Trfootball[i]<-rpart(V1~., data=training[i], method="class")
football.pred[i]<- predict(Trfootball[i], leave[i], type="class")
table(Actual=leave[i]$"V1", classfied=football.pred[i])}
Removing the "[i]" and replacing them with 1:10 individually works right now....
Your problem lies is the assignment of a data.frame or matrix to a vector that you initially set as NULL (training and leave). A way to think about it is, you are trying to squeeze in a whole matrix into an element that can only take a single number. That's why R has a problem with your code. You need to initialise training and leave to something that can handle your iterative agglomeration of values (the R object list as #akrun points out).
The following example should give you a feel for what is happening and what you can do to fix your problem:
a<-NULL # your set up at the moment
print(a) # NULL as expected
# your football data is either data.frame or matrix
# try assigning those objects to the first element of a:
a[1]<-data.frame(1:10,11:20) # no good
a[1]<-matrix(1:10,nrow=2) # no good either
print(a)
## create "a" upfront, instead of an empty object
# what you need:
a<-vector(mode="list",length=10)
print(a) # empty list with 10 locations
## to assign and extract elements out of a list, use the "[[" double brackets
a[[1]]<-data.frame(1:10,11:20)
#access data.frame in "a"
a[1] ## no good
a[[1]] ## what you need to extract the first element of the list
## how does it look when you add an extra element?
a[[2]]<-matrix(1:10,nrow=2)
print(a)

Looping through sequence objects in a list?

I have a list that contains 24 TraMineR sequence objects. Now I want to calculate the Optimal Matching distances for each of these sequence objects (only within each object) and store it in a new list, now consisting of 24 OM distance objects (distance matrices).
The dataset can be found here.
library(TraMineR)
sequences <- read.csv(file = "event-stream-20-l-m.csv", header = TRUE, nrows=10)
repo_names = colnames(sequences)
# 1. Loop across and define the 24 sequence objects & store them in sequence_objects
colpicks <- seq(10,240,by=10)
sequence_objects <- mapply(function(start,stop) seqdef(sequences[,start:stop]), colpicks- 9, colpicks)
# 2. Calculate the costs for OM distances within each object
costs <- mapply(seqsubm(sequence_objects, method="TRATE"))
# 3. Calculate the OM distance objects for each sequence object
sequences.om <- seqdist(sequence_objects, method="OM", indel=1, sm=costs, with.missing=FALSE, norm="maxdist")
Step (1) works fine, but when I progress to step (2), it tells me:
Error in seqsubm(sequence_objects, method = "TRATE") :
[!] data is NOT a sequence object, see seqdef function to create one
This is natural, because sequence_objects is not a sequence object, but a list of sequence objects.
How can I apply the seqsubm function to a list of sequence objects?
I'm not familiar with the TraMineR package, however it looks like you are trying to iterate over the elements of sequence_objects.
mapply is for iterating over multiple objects simultaneously.
lapply in contrast is for iterating over a single object.
Therefore, the following might work for you:
costs <- lapply(sequence_objects, seqsubm, method="TRATE")

Resources