I've found this page from the R blog where there is an example of how the genalg library works.
I've written a piece of code, mainly copy-pasted from the page linked above. What I'm expecting from the code is that no chromosome is a good solution and I've been wondering what happens when, in theory, all the chromosomes should be discarded. It seems that the algorithm is always able to return a solution but which solution is a solution that does not satisfy the constraints?
What I can imagine is that the algorithm should always return a chromosome made of only zeroes. I tried to run it several times but it does not happen, i.e. it returns also ones, in the chromosome sequence. But, how is this possible? What am I missing? Does anyone have more experience than me, with respect to genetic algorithms in general and the genalg library for R, in particular?
The code:
library("genalg");
evalFunc <- function(x) {
current_solution_survivalpoints <- x %*% dataset$survivalpoints;
current_solution_weight <- x %*% dataset$weight;
if (current_solution_weight > weightlimit){
return(0);
}
else{
return(-current_solution_survivalpoints);
}
}
dataset <- data.frame(item = c("pocketknife", "beans", "potatoes", "unions", "sleeping bag", "rope", "compass"),
survivalpoints = c(10, 20, 15, 2, 30, 10, 30),
weight = c(5, 10, 20, 5, 12, 10, 5));
#this is the constraint that cannot be satisfied
weightlimit <- 4;
GAmodel <- rbga.bin(size = 7, popSize = 200, iters = 100, mutationChance = 0.01, evalFunc = evalFunc);
filter <- GAmodel$evaluations == min(GAmodel$evaluations);
chromosome <- GAmodel$population[filter, , drop= FALSE][1,];
print("the solution is");print(chromosome);
Related
The codes below works to extract sequences using the cspade algorithm.
library("arulesSequences")
df <- data.frame(personID = c(1, 1, 2, 2, 2),
eventID = c(100, 101, 102, 103, 104),
site = c("google", "facebook", "facebook", "askjeeves", "stackoverflow"),
sequence = c(1, 2, 1, 2, 3))
df.trans <- as(df[,"site", drop = FALSE], "transactions")
transactionInfo(df.trans)$sequenceID <- df$sequence
transactionInfo(df.trans)$eventID <- df$eventID
df.trans <- df.trans[order(transactionInfo(df.trans)$sequenceID),]
seq <- cspade(df.trans, parameter = list(support = 0.2),
control = list(verbose = TRUE))
The problem is that my actual data is ~2 million rows, with sequence increasing to ~20 for each person. Using the code above, cspade quickly consumes all RAM and R crashes. Anyone have tips on how to perform sequence mining on large datasets like mine? Thanks!
How many unique IDs do you have in df$sequence? It looks like in the last column of your sample dataset that there are 3 sequence options. Do you think sequences of up to 20 are necessary? One thing you could do is set the maxlen parameter in your cspade function call to something like 4 or 5 and evaluate your predictive accuracy, assuming that's what you are after.
So you would have something like seq <- cspade(df.trans, parameter = list(support = 0.2, maxlen = 4),control = list(verbose = TRUE)).
Hope that helps
I hope that, this question will be nice tutorial for beginners in R (such as me).
I was used to programming languages where loops are necessary to manipulate the data, algorithms, etc.
Nevertheless, loops in R are slow, what can be seen in case of large data.
Fortunately R provides bulit-in functions which allow to iterating through elements and do some calculation in very efficient way.
Now I'd like to avoid loops when I'm analysying data in R. So I've read about lapply, apply and other useful functions.
I'd like to make correlation between first and each other column of my data and print: sample name, sample estimate and p-value in nice table - everything without for loop.
My idea - create fake data from stratch:
surv <- c(7.1,8,4,2,0.5,5,6)
geneA_expr <- runif(n = 7, min = 1, max = 30)
geneB_expr <- runif(n = 7, min = 1, max = 30)
geneC_expr <- runif(n = 7, min = 1, max = 30)
my_data <- data.frame(surv, geneA_expr, geneB_expr, geneC_expr)
Correlation test with apply - found it here in Stack Overflow and in manual:
md_stat <- apply(my_data[,2:4], 2, cor.test, my_data$surv, method="pearson")
md_stat is a list, now I'd like to visualize it nicely, but I have no idea how to do it, it's too complicated for me, so I used for loop
for(i in names(md_stat)){
cat(i ,md_stat[[i]]$estimate, md_stat[[i]]$p.value, '\n')
}
geneA_expr 0.2517658 0.5860052
geneB_expr 0.2438112 0.5982849
geneC_expr 0.8026801 0.02977544
How to replace above for loop by other bulit-in function?
unlist every list within md_stat. then bind the outputs into a matrix.
do.call(rbind, lapply(md_stat, unlist))
Try this
temp <- lapply(seq_along(md_stat), function(i) {
cat(names(md_stat)[[i]], md_stat[[i]]$estimate, md_stat[[i]]$p.value, '\n')
})
I can think of 4 ways for you to do this, 1 of which depends on the purrr package.
You could use a loop, walk from the purrr package, lapply and a recursive function.
library(microbenchmark)
library(purrr)
surv <- c(7.1,8,4,2,0.5,5,6)
geneA_expr <- runif(n = 7, min = 1, max = 30)
geneB_expr <- runif(n = 7, min = 1, max = 30)
geneC_expr <- runif(n = 7, min = 1, max = 30)
my_data <- data.frame(surv, geneA_expr, geneB_expr, geneC_expr)
md_stat <- apply(my_data[,2:4], 2, cor.test, my_data$surv, method="pearson")
md_loop <- function(md_stat) {
for(i in names(md_stat)){
cat(i ,md_stat[[i]]$estimate, md_stat[[i]]$p.value, '\n')
}
}
md_walk <- function(md_stat) {
walk(names(md_stat), function(i) {
cat(i ,md_stat[[i]]$estimate, md_stat[[i]]$p.value, '\n')
})
}
md_apply <- function(md_stat) {
lapply(names(md_stat), function(i) {
cat(i[[1]],md_stat[[i[[1]]]]$estimate, md_stat[[i[[1]]]]$p.value, '\n')
})
}
md_recursive <- function(md_stat) {
i <- names(md_stat)
if(length(i) < 1) {
NULL
} else {
cat(i[[1]],md_stat[[i[[1]]]]$estimate, md_stat[[i[[1]]]]$p.value, '\n')
md_recursive(tail(md_stat, -1))
}
}
md_speed <- microbenchmark(
md_loop(md_stat),
md_walk(md_stat),
md_apply(md_stat),
md_recursive(md_stat)
)
Speed comparisons
I want to calculate the euclidean distances between rows of a dataframe with 30.000 observations. A simple way to do this is the dist function (e.g., dist(data)). However, since my dataframe is large, this takes too much time.
Some of the rows contain missing values. I do not need the distances between rows, where both rows contain missing values, or between rows, where none of the rows contains missing values.
In a for-loop, I tried to exclude the combinations that I do not need. Unfortunately, my solution takes even more time:
# Some example data
data <- data.frame(
x1 = c(1, 22, NA, NA, 15, 7, 10, 8, NA, 5),
x2 = c(11, 2, 7, 15, 1, 17, 11, 18, 5, 5),
x3 = c(21, 5, 6, NA, 10, 22, 12, 2, 12, 3),
x4 = c(13, NA, NA, 20, 12, 5, 1, 8, 7, 14)
)
# Measure speed of dist() function
start_time_dist <- Sys.time()
# Calculate euclidean distance with dist() function for complete dataset
dist_results <- dist(data)
end_time_dist <- Sys.time()
time_taken_dist <- end_time_dist - start_time_dist
# Measure speed of my own loop
start_time_own <- Sys.time()
# Calculate euclidean distance with my own loop only for specific cases
# # #
# The following code should be faster!
# # #
data_cc <- data[complete.cases(data), ]
data_miss <- data[complete.cases(data) == FALSE, ]
distance_list <- list()
for(i in 1:nrow(data_miss)) {
distances <- numeric()
for(j in 1:nrow(data_cc)) {
distances <- c(distances, dist(rbind(data_miss[i, ], data_cc[j, ]), method = "euclidean"))
}
distance_list[[i]] <- distances
}
end_time_own <- Sys.time()
time_taken_own <- end_time_own - start_time_own
# Compare speed of both calculations
time_taken_dist # 0.002001047 secs
time_taken_own # 0.01562881 secs
Is there a faster way how I could calculate the euclidean distances that I need?
I recommend you to use parallel computations. Put all your code in one function and do it parallel.
R will do all calculation in one thread by default. You should add parallel threads manually. Starting clusters in R will take time, but if you have large data frame, the performance of the main job will be (your_processors_number-1) times faster.
This links may also help: How-to go parallel in R – basics + tips and A gentle introduction to parallel computing in R.
Good choice is to divide your job into smaller packes and calculate them separately in each thread. Create threads only once, because it is time consuming in R.
library(parallel)
library(foreach)
library(doParallel)
# I am not sure that all libraries are here
# try ??your function to determine which library do you need
# determine how many processors has your computer
no_cores <- detectCores() - 1# one processor must be free always for system
start.t.total<-Sys.time()
print(start.t.total)
startt<-Sys.time()
print(startt)
#start parallel calculations
cl<-makeCluster(no_cores,outfile = "mycalculation_debug.txt")
registerDoParallel(cl)
# results will be in out.df class(dataframe)
out.df<-foreach(p=1:no_cores
,.combine=rbind # data from different threads will be in one table
,.packages=c()# All packages that your funtion is using must be called here
,.inorder=T) %dopar% #don`t forget this directive
{
tryCatch({
#
# enter your function here and do what you want in parallel
#
print(startt-Sys.time())
print(start.t.total-Sys.time())
print(paste(date,'packet',p, percent((x-istart)/packes[p]),'done'))
}
out.df
},error = function(e) return(paste0("The variable '", p, "'",
" caused the error: '", e, "'")))
}
stopCluster(cl)
gc()# force to free memory from killed processes
I am trying to simulate 5000 samples of size 5 from a normal distribution with mean 5 and standard deviation 3. I want to then compute the mean of each sample and make a histogram of the sample means
My current code is not giving me an error but I don't think it's right:
nrSamples = 5000
e <- list(mode="vector",length=nrSamples)
for (i in 1:nrSamples) {
e[[i]] <- rnorm(n = 5, mean = 5, sd = 3)
}
sample_means <- matrix(NA, 5000,1)
for (i in 1:5000){
sample_means[i] <- mean(e[[i]])
}
Any idea on how to tackle this? I am very very new to R!
You don't need a list in this case. It is a common mistake of new R users to use lists excessively.
observations <- matrix(rnorm(25000, mean=5, sd=3), 5000, 5)
means <- rowMeans(observations)
Now means is a vector of 5000 elements.
You can actually do this without for loops. replicate can be used to create the 5000 samples. Then use sapply to return the mean of each sample. Wrap the sapply call in hist() to get the histogram of means.
dat = replicate(5000, rnorm(5,5,3), simplify=FALSE)
hist(sapply(dat, mean))
Or, if you want to save the means:
sample.means = sapply(dat,mean)
hist(sample.means)
I think your code is giving valid results. list(mode="vector",length=nrSamples) isn't doing what I think you intended (run it in the console and see what happens), but it works out because the first two list elements get overwritten in the loop.
Although there's no need to use loops here, just for illustration here are two modified versions of your code using loops:
# 1. Store random samples in a list
e <- vector("list", nrSamples)
for (i in 1:nrSamples) {
e[[i]] <- rnorm(n = 5, mean = 5, sd = 3)
}
sample_means = rep(NA, nrSamples)
for (i in 1:nrSamples){
sample_means[i] <- mean(e[[i]])
}
# 2. Store random samples in a matrix
e <- matrix(rep(NA, 5000*5), nrow=5)
for (i in 1:nrSamples) {
e[,i] <- rnorm(n = 5, mean = 5, sd = 3)
}
sample_means = rep(NA, nrSamples)
for (i in 1:nrSamples){
sample_means[i] <- mean(e[, i])
}
Your code is fine (see below), but I would suggest you try the following:
yourlist <- lapply(1:nrSamples, function(x) rnorm(n=5, mean = 5, sd = 3 ))
yourmeans <- sapply(yourlist, mean)
Here, for each element of the sequence 1, 2, 3, ... nrSamples that I supply as the first argument, lapply executes an function with the given element of the sequence as argument (i.e. x). The function that I have supplied does not depend on x, however, so it is just replicated 5000 times, and the output is stored in a list (this is what lapply does). It is an easy way to avoid loops in situations like these. Needless to say, you could also just run
yourmeans <- sapply(1:nrSamples, function(x) mean(rnorm(n=5, mean = 5, sd = 3)))
Apart from the means, the latter does not store your results though, which may not be what you want. Also note that I call sapply to return a vector, which you can then use to plot your histogram, using e.g. hist(yourmeans).
To show that your code is fine, consider the following:
set.seed(42)
nrSamples = 5000
e <- list(mode="vector",length=nrSamples)
for (i in 1:nrSamples) {
e[[i]] <- rnorm(n = 5, mean = 5, sd = 3)
}
sample_means <- matrix(NA, 5000,1)
for (i in 1:5000){
sample_means[i] <- mean(e[[i]])
}
set.seed(42)
yourlist <- lapply(1:nrSamples, function(x) rnorm(n=5, mean = 5, sd = 3 ))
yourmeans <- sapply(yourlist, mean)
all.equal(as.vector(sample_means), yourmeans)
[1] TRUE
Here, I set the seed to the random number generator to make sure that the random numbers are the same. As you see, your code works fine, though as others have pointed out, loops can easily be avoided.
I am working with the genalg library for R, and try to save all the generations when I run a binary generic algorithm. It does not seems like there is a built-in method for that in the library, so my attempt was to save each chromosome, x, coming through the evaluation function.
To test this method I have tried to insert print(x) in the evaluation function to be able to see all the evaluated chromosomes. However, the number of printed chromosomes does not always match what I am suspecting.
I thought that the number of printed chromosomes would be equal to the number of iterations times the population size, but it does not seems to be try all the time.
The problem is that I want to know from which generation (or iteration) each chromosome belongs, which I can't tell if the number of chromosomes are different from iter times popSize.
What is the reason for this, and how can I "fix" it. Or is there another way of saving each chromosome and from which iteration it belongs?
Below is an example, where I thought that the evaluation function would print 2x5 chromosomes, but only prints 8.
library(genalg)
library(ggplot2)
dataset <- data.frame(
item = c("pocketknife", "beans", "potatoes", "unions", "sleeping bag", "rope", "compass"),
survivalpoints = c(10, 20, 15, 2, 30, 10, 30),
weight = c(1, 5, 10, 1, 7, 5, 1))
weightlimit <- 20
evalFunc <- function(x) {
print(x)
current_solution_survivalpoints <- x %*% dataset$survivalpoints
current_solution_weight <- x %*% dataset$weight
if (current_solution_weight > weightlimit)
return(0) else return(-current_solution_survivalpoints
}
iter = 2
popSize = 5
set.seed(1)
GAmodel <- rbga.bin(size = 7, popSize = popSize, iters = iter, mutationChance = 0.1,elitism = T, evalFunc = evalFunc)
Looking at the function code, it seems like at each iteration (generation) a subset of chromosomes is chosen from the population (population = 5 chromosomes in your example) with a certain probability (0.1 in your case) and mutated. Evaluation function is called only for the mutated chromosomes at each generation (and of course for all the chromosomes in the first iteration to know their initial value).
Note that, this subset do not include elitists group, which in your example you have defined as 1 element big (you have erroneously passed elitism=TRUE and TRUE is implicitly converted to 1).
Anyway, to know the population at each generation, you can pass a monitor function through the monitorFun parameter e.g. :
# obj contains a lot of informations, try to print it
monitor <- function(obj) {
print(paste(" GENERATION :", obj$iter))
print("POPULATION:")
print(obj$population)
print("VALUES:")
print(obj$evaluations)
}
iter = 2
popSize = 5
set.seed(1)
GAmodel <- rbga.bin(size = 7, popSize = popSize,
iters = iter, mutationChance = 0.1,
elitism = 1, evalFunc = evalFunc, monitorFunc = monitor)