I am attempting to create a 5000 word vector composed of 500 blocks of 10 words. One block is drawn from sampling with replacement from a fixed list of animals, and this block is to alternate with a fixed list of foods. The following code yields one iteration of what I need:
anim<- data.frame(cbind(stim=list.sample(animals$WORD, 10, replace=T), cond="animal"))
food <- data.frame(cbind(stim=list.sample(foods$WORD, 10, replace=T), cond="food"))
both <- data.frame(rbind(anim, food))
This yields output as follows:
I just cannot figure out how to repeat this procedure 499 more times to create the total vector I need -- I will be running semantic distances between clusters to determine whether I can autosegment the boundaries between foods and animals. I attempted a repeat loop to no avail
Thanks for any ideas!
Since you did not provide any reproducible data, we will assume that LETTERS are food and letters are animals. This line of code generates the vector you specified. Here we are only using batches of 5 to illustrate the process:
result <- as.vector(replicate(5, c(sample(LETTERS, 5, replace=TRUE), sample(letters, 5, replace=TRUE))))
result
# [1] "H" "O" "T" "K" "J" "m" "c" "s" "u" "c" "P" "Y" "V" "U" "Y" "p" "u" "q" "k" "l" "B" "H" "U" "F" "K" "h" "v" "g"
# [29] "c" "d" "X" "F" "R" "N" "U" "v" "t" "u" "q" "x" "N" "E" "G" "Q" "L" "d" "a" "v" "e" "a"
I want to take "random" samples from a vector called data but with increasing size and without replacement.
To illustrate my point data looks for example like:
data<-c("a","s","d","f","g","h","j","k","l","x","c","v","b","n","m")
What I need is to get different sampling vectors with increasing sampling size (starting with size=2) for example by 2 but without duplicates between the different vectors and store everything into a list so that the result would look something like this:
sample_1<-c("s","d")
sample_2<-c("s","d","a","f")
sample_3<-c("s","d","a","f","m","n")
sample_4<-c("s","d","a","f","m","n","l","c")
sample_5<-c("s","d","a","f","m","n","l","c","j","x")
sample_6<-c("s","d","a","f","m","n","l","c","j","x","v","k")
sample_7<-c("s","d","a","f","m","n","l","c","j","x","v","k","g","b")
sample_8<-c("s","d","a","f","m","n","l","c","j","x","v","k","g","b","h")
samples<-list(sample_1,sample_2,sample_3,sample_4,sample_5,sample_6,sample_7,sample_8)
What i have so far is:
samples<-sapply(seq(from=2, to=length(data), by=2), function(i) sample(data,size=i,replace=F),simplify=F,USE.NAMES=T )
What does not work is to have the increasing sample size but keeping the samples of the previous steps and to have a last list element with all observations.
Is something like this possible?
I'm not sure whether I understood you correctly, but perhaps you only need to scramble the data once:
data = letters
data_random = sample(data)
sapply(seq(from=2, to=length(data), by=2),
function (x) data_random[1:x],
simplify = FALSE)
After your comments on other answer I think I get what you want to achieve, so extending my previous code I end up with:
data<-c("a","s","d","f","g","h","j","k","l","x","c","v","b","n","m")
set.seed(123)
nbitems=length(data)/2+length(data)%%2
results=vector("list",nbitems)
results[[1]] <- sample(data,2) # get first sample
for (i in 2:nbitems) { # Loop for each result
samplesavail <- data[!data %in% results[[i-1]]] # Reduce the samples available
results[[i]] <- c(results[[i-1]], sample( samplesavail, min( length(samplesavail), 2) ) ) # concatenate a new sample, size depends on step and remaining samples available.
}
Hope this match your intended use:
> results
[[1]]
[1] "n" "f"
[[2]]
[1] "n" "f" "a" "g"
[[3]]
[1] "n" "f" "a" "g" "m" "v"
[[4]]
[1] "n" "f" "a" "g" "m" "v" "x" "l"
[[5]]
[1] "n" "f" "a" "g" "m" "v" "x" "l" "b" "j"
[[6]]
[1] "n" "f" "a" "g" "m" "v" "x" "l" "b" "j" "k" "h"
[[7]]
[1] "n" "f" "a" "g" "m" "v" "x" "l" "b" "j" "k" "h" "d" "s"
[[8]]
[1] "n" "f" "a" "g" "m" "v" "x" "l" "b" "j" "k" "h" "d" "s" "c"
Previous approach:
If I understood you well (but far unsure):
data<-c("a","s","d","f","g","h","j","k","l","x","c","v","b","n","m")
set.seed(123) # fix the seed for repro of answer, remove in real case
nbitems=length(data)/2+length(data)%%2 # Get how much entries we should have when stepping by 2
results=vector("list",nbitems) # preallocate the list (as we'll start by end)
results[[nbitems]] = sample(data,length(data)) # sample the datas
for (i in nbitems:2) {
results[[i-1]] <- results[[i]][1:(length(results[[i]]) - 2)] # for each iteration, take down the 2 last entries.
}
This give a single entry as first result.
Just noticed this is the same idea as #sbstn answer but with a more complicated backward approach, posting in case it can have some value.
I have a list of vectors such as:
>list
[[1]]
[1] "a" "m" "l" "s" "t" "o"
[[2]]
[1] "a" "y" "o" "t" "e"
[[3]]
[1] "n" "a" "s" "i" "d"
I want to find the matches between each of them and the remaining (i.e. between the 1st and the other 2, the 2nd and the other 2, and so on) and keep the couple with the highest number of matches. I could do it with a "for" loop and intersect by couples. For example
for (i in 2:3) { intersect(list[[1]],list[[i]]) }
and then save the output into a vector or some other structure. However, this seems so inefficient to me (given than rather than 3 I have thousands) and I am wondering if R has some built-in function to do that in a clever way.
So the question would be:
Is there a way to look for matches of one vector to a list of vectors without the explicit use of a "for" loop?
I don't believe there is a built-in function for this. The best you could try is something like:
lsts <- lapply(1:5, function(x) sample(letters, 10)) # make some data (see below)
maxcomb <- which.max(apply(combs <- combn(length(lsts), 2), 2,
function(ix) length(intersect(lsts[[ix[1]]], lsts[[ix[2]]]))))
lsts <- lsts[combs[, maxcomb]]
# [[1]]
# [1] "m" "v" "x" "d" "a" "g" "r" "b" "s" "t"
# [[2]]
# [1] "w" "v" "t" "i" "d" "p" "l" "e" "s" "x"
A dump of the original:
[[1]]
[1] "z" "r" "j" "h" "e" "m" "w" "u" "q" "f"
[[2]]
[1] "m" "v" "x" "d" "a" "g" "r" "b" "s" "t"
[[3]]
[1] "w" "v" "t" "i" "d" "p" "l" "e" "s" "x"
[[4]]
[1] "c" "o" "t" "j" "d" "g" "u" "k" "w" "h"
[[5]]
[1] "f" "g" "q" "y" "d" "e" "n" "s" "w" "i"
datal <- list (a=c(2,2,1,2),
b=c(2,2,2,4,3),
c=c(1,2,3,4))
# all possible combinations
combs <- combn(length(datal), 2)
# split into list
combs <- split(combs, rep(1:ncol(combs), each = nrow(combs)))
# calculate length of intersection for every combination
intersections_length <- sapply(combs, function(y) {
length(intersect(datal[[y[1]]],datal[[y[2]]]))
}
)
# What lists have biggest intersection
combs[which(intersections_length == max(intersections_length))]
If I have a vector of letters:
> all <- letters
> all
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
and then I define a reference sample from letters as follows:
> refSample <- c("j","l","m","s")
in which the spacing between elements is 2 (1st to 2nd), 1 (2nd to 3rd) and 6 (3rd to 4th), how can I then select n samples from all that have identical, non-wrap-around spacing between its elements to refSample? For example, "a","c","d","j" and "q" "s" "t" "z" would be valid samples, but "a","c","d","k" and "r" "t" "u" "a" would not. The former has an index difference of 7 (rather than 6) between the 3rd and last element, whereas the latter has the correct spacing but wraps around.
Second, how can I parameterise this, so that whatever refSample is used, I can use the spacing of that as a template?
Here's a simple way --
all <- letters
refSample <- c("j","l","m","s")
pick_matches <- function(n, ref, full) {
iref <- match(ref,full)
spaces <- diff(iref)
tot_space <- sum(spaces)
max_start <- length(full) - tot_space
starts <- sample(1:max_start, n, replace = TRUE)
return( sapply( starts, function(s) full[ cumsum(c(s, spaces)) ] ) )
}
> set.seed(1)
> pick_matches(5, refSample, all) # each COLUMN is a desired sample vector
[,1] [,2] [,3] [,4] [,5]
[1,] "e" "g" "j" "p" "d"
[2,] "g" "i" "l" "r" "f"
[3,] "h" "j" "m" "s" "g"
[4,] "n" "p" "s" "y" "m"
I'm trying to plot heatmap in ggplot2 using csv data following casbon's solution in
http://biostar.stackexchange.com/questions/921/how-to-draw-a-csv-data-file-as-a-heatmap-using-numpy-and-matplotlib
the problem is x-label try to re-sort itself. For example, if I swap label COG0002 and COG0001 in that example data, the x-label still come out in sort order (cog0001, cog0002, cog0003.... cog0008).
Is there anyway to prevent this ? I want to it to be ordered as in csv file
thanks
pp
If I recall, when calling factor(x) with the default levels argument, the levels are set as levels = sort(unique(x)).
You can override this action by setting levels = unique(x).
For example:
set.seed(1)
x = sample(letters, 100, replace = TRUE)
head(x, 5)
[1] "g" "j" "o" "x" "f"
levels(factor(x))
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"
levels(factor(x, levels = unique(x)))
[1] "g" "j" "o" "x" "f" "y" "r" "q" "b" "e" "u" "m" "s" "z" "d" "k" "a" "w" "i"
[20] "p" "v" "c" "n" "t" "l" "h"
You can see that setting levels = unique(x) preserves the order of occurrence in the data.
If you want to keep the order directly from the csv file :
foomelt$COG <- factor(foomelt$COG, levels = unique(as.character(foo[[1]])))
Did you try reordering factor levels before plotting?
e.g.
foomelt$COG = factor(foomelt$COG,levels(foomelt$COG)[c(2,1,3:8)])
(I can't try it right now, so I can't be sure that it works)