Updated with code that addresses most of my questions.
I have a modest function to generate iterations of a MLE for a population estimate. (I know that iterations are poor form in R, but I am trying to show a nonlinear search procedure in detail, to accompany methods from an Excel spreadsheet).
n <- c(32,54,37,60,41) # number of captures
R <- c(32,36,6,13,5) # of marked fish returned to the population
fn <- function(x){
N = 97 #starting value of N
mle = matrix(0, nrow=x, ncol=8) #per suggestion
colnames(mle) = c("N","g.N","h.N","N1","g.N1","h.N1","delta.h","corr") #added column names
for (i in 1:x) {
g.N = prod(1-n/N)
h.N = N-sum(R)-N*g.N
N1 = N-1
g.N1 = prod(1-n/N1)
h.N1 = N1-sum(R)-N*g.N1
delta.h = h.N-h.N1
corr = -h.N/delta.h
#print(c(N,g.N,h.N,N1,g.N1,h.N1,delta.h,corr))#original output
mle[i,] = c(N,g.N,h.N,N1,g.N1,h.N1,delta.h,corr) #per suggestion
N = N+corr
}
return(mle) #per suggestion
}
fn(5)
This creates the following output
N g.N h.N N1 g.N1 h.N1 delta.h corr
[1,] 97.00000 0.04046356 1.075034e+00 96.00000 0.03851149 0.2643856 0.8106486 -1.326141e+00
[2,] 95.67386 0.03788200 4.954192e-02 94.67386 0.03597455 -0.7679654 0.8175073 -6.060119e-02
[3,] 95.61326 0.03776543 2.382189e-03 94.61326 0.03586008 -0.8154412 0.8178234 -2.912841e-03
[4,] 95.61035 0.03775983 1.147664e-04 94.61035 0.03585458 -0.8177238 0.8178386 -1.403289e-04
[5,] 95.61020 0.03775956 5.529592e-06 94.61020 0.03585432 -0.8178338 0.8178393 -6.761220e-06
I would like to cleanup the output, but have not been able to crack the code to put the results in a matrix or data.frame or any format where I can give column titles and adjust the digits, numeric format, etc. in a meaningful manner. I've have had limited success with cat and format but have been unable to get them to do precisely what I would like. Any help formatting this as a table, or matrix or data.frame would be appreciated.
Your function doesn't actually work for me (what's n for example). Anyway, you should have something like:
N<-97 #starting value of N
m = matrix(0, nrow=5, ncol=7)
for (i in 1:x) {
#<snip>
m[i,] = c(N,g.N,N1,g.N1,h.N1,delta.h,corr)
N<-N+corr
}
return(m)
}
Related
I'm trying to perform k-means on a dataframe with 69 columns and 1000 rows. First, I need to decide upon the optimal numbers of clusters first with the use of the Davies-Bouldin index. This algorithm requires that the input should be in the form of a matrix, I used this code first:
totalm <- data.matrix(total)
Followed by the following code (Davies-Bouldin index)
clusternumber<-0
max_cluster_number <- 30
#Davies Bouldin algorithm
library(clusterCrit)
smallest <-99999
for(b in 2:max_cluster_number){
a <-99999
for(i in 1:200){
cl <- kmeans(totalm,b)
cl<-as.numeric(cl)
intCriteria(totalm,cl$cluster,c("dav"))
if(intCriteria(totalm,cl$cluster,c("dav"))$davies_bouldin < a){
a <- intCriteria(totalm,cl$cluster,c("dav"))$davies_bouldin }
}
if(a<smallest){
smallest <- a
clusternumber <-b
}
}
print("##clusternumber##")
print(clusternumber)
print("##smallest##")
print(smallest)
I keep on getting this error:(list) object cannot be coerced to type 'double'.
How can I solve this?
Reproducable example:
a <- c(0,0,1,0,1,0,0)
b <- c(0,0,1,0,0,0,0)
c <- c(1,1,0,0,0,0,1)
d <- c(1,1,0,0,0,0,0)
total <- cbind(a,b,c,d)
The error is coming from cl<-as.numeric(cl). The result of a call to kmeans is an object, which is a list containing various information about the model.
Run ?kmeans
I would also recommend you add nstart = 20 to your kmeans call. k-means clustering is a random process. This will run the algorithm 20 times and find the best fit (i.e. for each number of centers).
for(b in 2:max_cluster_number){
a <-99999
for(i in 1:200){
cl <- kmeans(totalm,centers = b,nstart = 20)
#cl<-as.numeric(cl)
intCriteria(totalm,cl$cluster,c("dav"))
if(intCriteria(totalm,cl$cluster,c("dav"))$davies_bouldin < a){
a <- intCriteria(totalm,cl$cluster,c("dav"))$davies_bouldin }
}
if(a<smallest){
smallest <- a
clusternumber <-b
}
}
This gave me
[1] "##clusternumber##"
[1] 4
[1] "##smallest##"
[1] 0.138675
(tempoarily changing max clusters to 4 as reproducible data is a small set)
EDIT Integer Error
I was able to reproduce your error using
a <- as.integer(c(0,0,1,0,1,0,0))
b <- as.integer(c(0,0,1,0,0,0,0))
c <- as.integer(c(1,1,0,0,0,0,1))
d <- as.integer(c(1,1,0,0,0,0,0))
totalm <- cbind(a,b,c,d)
So that an integer matrix is created.
I was then able to remove the error by using
storage.mode(totalm) <- "double"
Note that
total <- cbind(a,b,c,d)
totalm <- data.matrix(total)
is unnecessary for the data in this example
> identical(total,totalm)
[1] TRUE
I am currently working my way through the book 'R for Data Science'.
I am trying to solve this exercise question (21.2.1 Q1.4) but have not been able to determine the correct output before starting the for loop.
Write a for loop to:
Generate 10 random normals for each of μ= −10, 0, 10 and 100.
Like the previous questions in the book I have been trying to insert into a vector output but for this example, it appears I need the output to be a data frame?
This is my code so far:
values <- c(-10,0,10,100)
output <- vector("double", 10)
for (i in seq_along(values)) {
output[[i]] <- rnorm(10, mean = values[[i]])
}
I know the output is wrong but am unsure how to create the format I need here. Any help much appreciated. Thanks!
There are many ways of doing this. Here is one. See inline comments.
set.seed(357) # to make things reproducible, set random seed
N <- 10 # number of loops
xy <- vector("list", N) # create an empty list into which values are to be filled
# run the loop N times and on each loop...
for (i in 1:N) {
# generate a data.frame with 4 columns, and add a random number into each one
# random number depends on the mean specified
xy[[i]] <- data.frame(um10 = rnorm(1, mean = -10),
u0 = rnorm(1, mean = 0),
u10 = rnorm(1, mean = 10),
u100 = rnorm(1, mean = 100))
}
# result is a list of data.frames with 1 row and 4 columns
# you can bind them together into one data.frame using do.call
# rbind means they will be merged row-wise
xy <- do.call(rbind, xy)
um10 u0 u10 u100
1 -11.241117 -0.5832050 10.394747 101.50421
2 -9.233200 0.3174604 9.900024 100.22703
3 -10.469015 0.4765213 9.088352 99.65822
4 -9.453259 -0.3272080 10.041090 99.72397
5 -10.593497 0.1764618 10.505760 101.00852
6 -10.935463 0.3845648 9.981747 100.05564
7 -11.447720 0.8477938 9.726617 99.12918
8 -11.373889 -0.3550321 9.806823 99.52711
9 -7.950092 0.5711058 10.162878 101.38218
10 -9.408727 0.5885065 9.471274 100.69328
Another way would be to pre-allocate a matrix, add in values and coerce it to a data.frame.
xy <- matrix(NA, nrow = N, ncol = 4)
for (i in 1:N) {
xy[i, ] <- rnorm(4, mean = c(-10, 0, 10, 100))
}
# notice that i name the column names post festum
colnames(xy) <- c("um10", "u0", "u10", "u100")
xy <- as.data.frame(xy)
As this is a learning question I will not provide the solution directly.
> values <- c(-10,0,10,100)
> for (i in seq_along(values)) {print(i)} # Checking we iterate by position
[1] 1
[1] 2
[1] 3
[1] 4
> output <- vector("double", 10)
> output # Checking the place where the output will be
[1] 0 0 0 0 0 0 0 0 0 0
> for (i in seq_along(values)) { # Testing the full code
+ output[[i]] <- rnorm(10, mean = values[[i]])
+ }
Error in output[[i]] <- rnorm(10, mean = values[[i]]) :
more elements supplied than there are to replace
As you can see the error say there are more elements to put than space (each iteration generates 10 random numbers, (in total 40) and you only have 10 spaces. Consider using a data format that allows to store several values for each iteration.
So that:
> output <- ??
> for (i in seq_along(values)) { # Testing the full code
+ output[[i]] <- rnorm(10, mean = values[[i]])
+ }
> output # Should have length 4 and each element all the 10 values you created in the loop
# set the number of rows
rows <- 10
# vector with the values
means <- c(-10,0,10,100)
# generating output matrix
output <- matrix(nrow = rows,
ncol = 4)
# setting seed and looping through the number of rows
set.seed(222)
for (i in 1:rows){
output[i,] <- rnorm(length(means),
mean=means)
}
#printing the output
output
I want to perform an IDW cross-validation and find out which "power"-value gives the smallest RMSE. In order to do this, I want to store the "power" and "RMSE"-values in a list and sort them by the smallest RMSE, for example
I'd like something like this:
RMSE Power
[1,] 1.230 2.5
[2,] 1.464 1.5
[3,] 1.698 2.0
[4,] 1.932 3.0
What I have so far is this:
require(sp)
require(gstat)
data("meuse")
#### create grid:
pixels <- 500 #define resolution
#define extent
raster.grd <- expand.grid(x=seq(floor(min(x=meuse$x)),
ceiling(max(x=meuse$x)),
length.out=pixels),
y=seq(floor(min(y=meuse$y)),
ceiling(max(y=meuse$y)),
length.out=pixels))
# convert the dataframe to a spatial points and then to a spatial pixels
grd.pts <- SpatialPixels(SpatialPoints((raster.grd)))
grd <- as(grd.pts, "SpatialGrid")
gridded(grd) = TRUE
#### perform IDW and loop through different power-values
power = seq(from = 1.5, to = 3, by = 0.5)
results=list()
results.cv=list()
for(i in power) {
results[[paste0(i,"P")]] <- gstat::idw(meuse$zinc ~ 1, meuse, grd, idp = i)
results.cv[[paste0(i,"P")]] <- krige.cv(zinc ~ 1, meuse, nfold = nrow(meuse),set = list(idp = i))
}
Now my attempt to calculate and store the RMSE with a for-loop:
results_rmse <- list()
pwr <- names(results.cv)
for(i in results.cv){ #for each Element (1.5P, 2P, etc) in results.cv
for(j in 1:length(pwr)){ #for each Power
results_rmse <- sqrt(mean(i$residual^2))
print(pwr[j])
}
print(paste("RMSE",results_rmse))
}
But with this loop, it prints each RMSE individually. So I changed the code like this
results_rmse[[i]] <- sqrt(mean(i$residual^2))
But then I get an error
Error in results_rmse[[i]] <- sqrt(mean(i$residual^2)) : invalid subscript type 'S4'
I tried several versions of the for-loop, but I couldn't even figure out how to store the values in a list, not to mention to sort them by the smallest RMSE.
There is an extra loop for j in the RMSE calculation that is not needed as far as I understand the problem. Also, I rearranged the loop in such a way that it cycles through a sequence of elements rather than calling them by their names.
# Data, because your script doesn't run for me. The rest is identical from your code
for(i in power) {
results.cv[[paste0(i,"P")]]$residual <- rnorm(50)
}
# Fixed loop
for(i in 1:length(results.cv)){
results_rmse[[i]] <- sqrt(mean(results.cv[[i]]$residual^2))
}
names(results_rmse) <- names(results.cv)
Alternatively, the for loop can be avoided with the apply function. The result is a named list corresponding to the input names, so the last line can be omitted to achieve the same results_rmse.
results_rmse <- lapply(results.cv, function(x) sqrt(mean(x$residual^2)))
To print the data as you showed in your question:
cbind(RMSE=unlist(results_rmse), Power=power)
I'm fairly new to R and I've been trying for a while to do something, which I assumed to be very simple, but I keep failing at it (unfortunately for me, it doesn't mean it's not simple!).
I have defined a function that takes a time series as an input and outputs a single value (as a vector) at the end. The function has two parameters (from now on n and m) affecting the output, so it looks like this:
fnc <- function(x, n, m)
My goal is to store/see (possibly in a matrix?) the output while changing n and m (so, a basic sensitivity analysis, if this makes sense?).
My attempts were to create an empty matrix, run a nested for loop for several n and m values and fill in the matrix accordingly.
So, something like this (I'm aware that the code below gives an error, as i,j values would end up being out of bounds the 3x3 matrix, but it is just one of the illogical-trial I did):
n_lens = c(750, 1000, 1250)
m_lens = c(250, 300, 350)
output_matrix = matrix(data=NA, nrow = length(n_lens), ncol = length(m_lens))
for (i in n_lens){
for (j in m_lens){
output_matrix[i,j] <- function(x, i, j)
}
}
Unfortunately all of them were far from getting the job done.
Any suggestion/tip is much appreciated.
I took the freedom to define a simple fnc function.
The idea is to loop over the indices of n_lens and not on the values of n_lens.
Nested for loops may be (will be?) slower in R compared to other ways of R.
It produces the required output.
fnc <- function(x, n, m)
{
return (n+m)
}
n_lens = c(750, 1000, 1250)
m_lens = c(250, 300, 350)
x = 1
len_n = length(n_lens)
len_m = length(m_lens)
output_matrix = matrix(data=NA, nrow = length(n_lens), ncol = length(m_lens))
for (i in seq(len_n)){
for (j in seq(len_m)){
output_matrix[i,j] <- fnc(x, n_lens[i], m_lens[j])
}
}
output_matrix
The output received is
[,1] [,2] [,3]
[1,] 1000 1050 1100
[2,] 1250 1300 1350
[3,] 1500 1550 1600
First of all, I am new to R (I started yesterday).
I have two groups of points, data and centers, the first one of size n and the second of size K (for instance, n = 3823 and K = 10), and for each i in the first set, I need to find j in the second with the minimum distance.
My idea is simple: for each i, let dist[j] be the distance between i and j, I only need to use which.min(dist) to find what I am looking for.
Each point is an array of 64 doubles, so
> dim(data)
[1] 3823 64
> dim(centers)
[1] 10 64
I have tried with
for (i in 1:n) {
for (j in 1:K) {
d[j] <- sqrt(sum((centers[j,] - data[i,])^2))
}
S[i] <- which.min(d)
}
which is extremely slow (with n = 200, it takes more than 40s!!). The fastest solution that I wrote is
distance <- function(point, group) {
return(dist(t(array(c(point, t(group)), dim=c(ncol(group), 1+nrow(group)))))[1:nrow(group)])
}
for (i in 1:n) {
d <- distance(data[i,], centers)
which.min(d)
}
Even if it does a lot of computation that I don't use (because dist(m) computes the distance between all rows of m), it is way more faster than the other one (can anyone explain why?), but it is not fast enough for what I need, because it will not be used only once. And also, the distance code is very ugly. I tried to replace it with
distance <- function(point, group) {
return (dist(rbind(point,group))[1:nrow(group)])
}
but this seems to be twice slower. I also tried to use dist for each pair, but it is also slower.
I don't know what to do now. It seems like I am doing something very wrong. Any idea on how to do this more efficiently?
ps: I need this to implement k-means by hand (and I need to do it, it is part of an assignment). I believe I will only need Euclidian distance, but I am not yet sure, so I will prefer to have some code where the distance computation can be replaced easily. stats::kmeans do all computation in less than one second.
Rather than iterating across data points, you can just condense that to a matrix operation, meaning you only have to iterate across K.
# Generate some fake data.
n <- 3823
K <- 10
d <- 64
x <- matrix(rnorm(n * d), ncol = n)
centers <- matrix(rnorm(K * d), ncol = K)
system.time(
dists <- apply(centers, 2, function(center) {
colSums((x - center)^2)
})
)
Runs in:
utilisateur système écoulé
0.100 0.008 0.108
on my laptop.
rdist() is a R function from {fields} package which is able to calculate distances between two sets of points in matrix format quickly.
https://www.image.ucar.edu/~nychka/Fields/Help/rdist.html
Usage :
library(fields)
#generating fake data
n <- 5
m <- 10
d <- 3
x <- matrix(rnorm(n * d), ncol = d)
y <- matrix(rnorm(m * d), ncol = d)
rdist(x, y)
[,1] [,2] [,3] [,4] [,5]
[1,] 1.512383 3.053084 3.1420322 4.942360 3.345619
[2,] 3.531150 4.593120 1.9895867 4.212358 2.868283
[3,] 1.925701 2.217248 2.4232672 4.529040 2.243467
[4,] 2.751179 2.260113 2.2469334 3.674180 1.701388
[5,] 3.303224 3.888610 0.5091929 4.563767 1.661411
[6,] 3.188290 3.304657 3.6668867 3.599771 3.453358
[7,] 2.891969 2.823296 1.6926825 4.845681 1.544732
[8,] 2.987394 1.553104 2.8849988 4.683407 2.000689
[9,] 3.199353 2.822421 1.5221291 4.414465 1.078257
[10,] 2.492993 2.994359 3.3573190 6.498129 3.337441
You may want to have a look into the apply functions.
For instance, this code
for (j in 1:K)
{
d[j] <- sqrt(sum((centers[j,] - data[i,])^2))
}
Can easily be substituted by something like
dt <- data[i,]
d <- apply(centers, 1, function(x){ sqrt(sum(x-dt)^2)})
You can definitely optimise it more but you get the point I hope
dist works fast because is't vectorized and call internal C functions.
You code in loop could be vectorized in many ways.
For example to compute distance between data and centers you could use outer:
diff_ij <- function(i,j) sqrt(rowSums((data[i,]-centers[j,])^2))
X <- outer(seq_len(n), seq_len(K), diff_ij)
This gives you n x K matrix of distances. And should be way faster than loop.
Then you could use max.col to find maximum in each row (see help, there are some nuances when are many maximums). X must be negate cause we search for minimum.
CL <- max.col(-X)
To be efficient in R you should vectorized as possible. Loops could be in many cases replaced by vectorized substitute. Check help for rowSums (which describe also rowMeans, colSums, rowSums), pmax, cumsum. You could search SO, e.g.
https://stackoverflow.com/search?q=[r]+avoid+loop (copy&paste this link, I don't how to make it clickable) for some examples.
My solution:
# data is a matrix where each row is a point
# point is a vector of values
euc.dist <- function(data, point) {
apply(data, 1, function (row) sqrt(sum((point - row) ^ 2)))
}
You can try it, like:
x <- matrix(rnorm(25), ncol=5)
euc.dist(x, x[1,])