Is there a way to create a permutation of a vector without using the sample() function in R? - r

I hope you are having a nice day. I would like to know if there is a way to create a permutation (rearrangement) of the values in a vector in R?
My professor provided with an assignment in which we are supposed create functions for a randomization test, one while using sample() to create a permutation and one not using the sample() function. So far all of my efforts have been fruitless, as any answer that I can find always resorts in the use of the sample() function. I have tried several other methods, such as indexing with runif() and writing my own functions, but to no avail. Alas, I have accepted defeat and come here for salvation.
While using the sample() function, the code looks like:
#create the groups
a <- c(2,5,5,6,6,7,8,9)
b <- c(1,1,2,3,3,4,5,7,7,8)
#create a permutation of the combined vector without replacement using the sample function()
permsample <-sample(c(a,b),replace=FALSE)
permsample
[1] 2 5 6 1 7 7 3 8 6 3 5 9 2 7 4 8 1 5
And, for reference, the entire code of my function looks like:
PermutationTtest <- function(a, b, P){
sample.t.value <- t.test(a, b)$statistic
perm.t.values<-matrix(rep(0,P),P,1)
N <-length(a)
M <-length(b)
for (i in 1:P)
{
permsample <-sample(c(a,b),replace=FALSE)
pgroup1 <- permsample[1:N]
pgroup2 <- permsample[(N+1) : (N+M)]
perm.t.values[i]<- t.test(pgroup1, pgroup2)$statistic
}
return(mean(perm.t.values))
}
How would I achieve the same thing, but without using the sample() function and within the confines of base R? The only hint my professor gave was "use indices." Thank you very much for your help and have a nice day.

You can use runif() to generate a value between 1.0 and the length of the final array. The floor() function returns the integer part of that number. At each iteration, i decrease the range of the random number to choose, append the element in the rn'th position of the original array to the new one and remove it.
a <- c(2,5,5,6,6,7,8,9)
b <- c(1,1,2,3,3,4,5,7,7,8)
c<-c(a,b)
index<-length(c)
perm<-c()
for(i in 1:length(c)){
rn = floor(runif(1, min=1, max=index))
perm<-append(perm,c[rn])
c=c[-rn]
index=index-1
}

It is easier to see what is going on if we use consecutive numbers:
a <- 1:8
b <- 9:17
ab <- c(a, b)
ab
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Now draw 17 (length(ab)) random numbers and use them to order ab:
rnd <- runif(length(ab))
ab[order(rnd)]
# [1] 5 13 11 12 6 1 17 3 10 2 8 16 7 4 9 15 14
rnd <- runif(length(ab))
ab[order(rnd)]
# [1] 14 11 5 15 10 7 13 9 17 8 2 6 1 4 16 12 3
For each permutation just draw another 17 random numbers.

Related

efficient rbind alternative with applied function

To take a step back, my ultimate goal is to read in around 130,000 images into R with a pixel size of HxW and then to make a dataframe/datatable containing the rgb of each pixel of each image on a new row. So the output will be something like this:
> head(train_data, 10)
image_no r g b pixel_no
1: 00003e153.jpg 0.11764706 0.1921569 0.3098039 1
2: 00003e153.jpg 0.11372549 0.1882353 0.3058824 2
3: 00003e153.jpg 0.10980392 0.1843137 0.3019608 3
4: 00003e153.jpg 0.11764706 0.1921569 0.3098039 4
5: 00003e153.jpg 0.12941176 0.2039216 0.3215686 5
6: 00003e153.jpg 0.13333333 0.2078431 0.3254902 6
7: 00003e153.jpg 0.12549020 0.2000000 0.3176471 7
8: 00003e153.jpg 0.11764706 0.1921569 0.3098039 8
9: 00003e153.jpg 0.09803922 0.1725490 0.2901961 9
10: 00003e153.jpg 0.11372549 0.1882353 0.3058824 10
I currently have a piece of code to do this in which I apply a function to get the rgb for each pixel of a specified image, returning the result in a dataframe:
#function to get rgb from image file paths
get_rgb_table <- function(link){
img <- readJPEG(toString(link))
# Creating the data frame
rgb_image <- data.frame(r = as.vector(img[1:H, 1:W, 1]),
g = as.vector(img[1:H, 1:W, 2]),
b = as.vector(img[1:H, 1:W, 3]))
#add pixel id
rgb_image$pixel_no <- row.names(rgb_image)
#add image id
train_rgb <- cbind(sub('.*/', '',link),rgb_image)
colnames(train_rgb)[1] <- "image_no"
return(train_rgb)
}
I call this function on another dataframe which contains the links to all the images:
train_files <- list.files(path="~/images/", pattern=".jpg",all.files=T, full.names=T, no.. = T)
train <- data.frame(matrix(unlist(train_files), nrow=length(train_files), byrow=T))
The train dataframe looks like this:
> head(train, 10)
link
1 C:/Documents/image/00003e153.jpg
2 C:/Documents/image/000155de5.jpg
3 C:/Documents/image/00021ddc3.jpg
4 C:/Documents/image/0002756f7.jpg
5 C:/Documents/image/0002d0f32.jpg
6 C:/Documents/image/000303d4d.jpg
7 C:/Documents/image/00031f145.jpg
8 C:/Documents/image/00053c6ba.jpg
9 C:/Documents/image/00057a50d.jpg
10 C:/Documents/image/0005d01c8.jpg
I finally get the result I want with the following loop:
for(i in 1:length(train[,1])){
train_data <- rbind(train_data,get_rgb_table(train[i,1]))
}
However, this last bit of code is very inefficient. An optimization of how the function is applied and and/or the rbind would help. I think the function get_rgb_table() itself is quick but the problem is with the loop and the rbind. I have tried using apply() but can't manage to do this on each row and put the result in one dataframe without running out of memory. Any help on this would be great. Thanks!
This is very difficult to answer given the vagueness of the question, but I'll make a reproducible example of what I think you're asking and will give a solution.
Say I have a function that returns a data frame:
MyFun <- function(x)randu[1:x,]
And I have a data frame df that will act an input to the function.
# a b
# 1 1 21
# 2 2 22
# 3 3 23
# 4 4 24
# 5 5 25
# 6 6 26
# 7 7 27
# 8 8 28
# 9 9 29
# 10 10 30
From your question, it looks like only one column will be used as input. So, I apply the function to each row of this data frame using lapply then I bind the results together using do.call and rbind like this:
do.call(rbind, lapply(df$a, MyFun))

word_stats function from qdap package application on a dataframe

I have a dataframe, where one column contains strings.
q = data.frame(number=1:2,text=c("The surcingle hung in ribands from my body.", "But a glance will show the fallacy of this idea."))
I want to use the word_stats function for each individual record.
is it possible?
text_statistic <- apply(q,1,word_stats)
this will apply word_stats() row-by-row and return a list with the results of word_stats() for every row
you can do it many ways, lapply or sapply apply a Function over a List or Vector.
word_stats <- function(x) {length(unlist(strsplit(x, ' ')))}
sapply(q$text, word_stats)
Sure have a look at the grouping.var argument:
dat = data.frame(number=1:2,text=c("The surcingle hung in ribands from my body.", "But a glance will show the fallacy of this idea."))
with(dat, qdap::word_stats(text, number))
## number n.sent n.words n.char n.syl n.poly wps cps sps psps cpw spw pspw n.state p.state n.hapax grow.rate
## 1 2 1 10 38 14 2 10 38 14 2 3.800 1.400 .200 1 1 10 1
## 2 1 1 8 35 12 1 8 35 12 1 4.375 1.500 .125 1 1 8 1

Filter between threshold

I am working with a large dataset and I am trying to first identify clusters of values that meet specific threshold values. My aim then is to only keep clusters of a minimum length. Below is some example data and my progress thus far:
Test = c("A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B")
Sequence = c(1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10)
Value = c(3,2,3,4,3,4,4,5,5,2,2,4,5,6,4,4,6,2,3,2)
Data <- data.frame(Test, Sequence, Value)
Using package evd, I have identified clusters of values >3
C1 <- clusters(Data$Value, u = 3, r = 1, cmax = F, plot = T)
Which produces
C1
$cluster1
4
4
$cluster2
6 7 8 9
4 4 5 5
$cluster3
12 13 14 15 16 17
4 5 6 4 4 6
My problem is twofold:
1) I don't know how to relate this back to the original dataframe (for example to Test A & B)
2) How can I only keep clusters with a minimum size of 3 (thus excluding Cluster 1)
I have looked into various filtering options etc. however they do not cluster data according to a desired threshold, with no options for the minimum size of the cluster either.
Any help is much appreciated.
Q1: relate back to original dataframe: Have a look at Carl Witthoft's answer. He wrote a variant of rle() (seqle() because it allows one to look for integer sequences rather than repetitions): detect intervals of the consequent integer sequences
Q2: only keep clusters of certain length:
C1[sapply(C1, length) > 3]
yields the 2 clusters that are long enough:
$cluster2
6 7 8 9
4 4 5 5
$cluster3
12 13 14 15 16 17
4 5 6 4 4 6

approx() without duplicates?

I am using approx() to interpolate values.
x <- 1:20
y <- c(3,8,2,6,8,2,4,7,9,9,1,3,1,9,6,2,8,7,6,2)
df <- cbind.data.frame(x,y)
> df
x y
1 1 3
2 2 8
3 3 2
4 4 6
5 5 8
6 6 2
7 7 4
8 8 7
9 9 9
10 10 9
11 11 1
12 12 3
13 13 1
14 14 9
15 15 6
16 16 2
17 17 8
18 18 7
19 19 6
20 20 2
interpolated <- approx(x=df$x, y=df$y, method="linear", n=5)
gets me this:
interpolated
$x
[1] 1.00 5.75 10.50 15.25 20.00
$y
[1] 3.0 3.5 5.0 5.0 2.0
Now, the first and last value are duplicates of my real data, is there any way to prevent this or is it something I don't understand properly about approx()?
You may want to specify xout to avoid this. For instance, if you want to always exclude the first and the last points, here's how you can do that:
specify_xout <- function(x, n) {
seq(from=min(x), to=max(x), length.out=n+2)[-c(1, n+2)]
}
plot(df$x, df$y)
points(approx(df$x, df$y, xout=specify_xout(df$x, 5)), pch = "*", col = "red")
It does not prevent from interpolating the existing point somewhere in the middle (exactly what happens on the picture below).
approx will fit through all your original datapoints if you give it a chance (change n=5 to xout=df$x to see this). Interpolation is the process of generating values for y given unobserved values of x, but should agree if the values of x have been previously observed.
The method="linear" setup is going to 'draw' linear segments joining up your original coordinates exactly (and so will give the y values you input to it for integer x). You only observe 'new' y values because your n=5 means that for points other than the beginning and end the x is not an integer (and therefore not one of your input values), and so gets interpolated.
If you want observed values not to be exactly reproduced, then maybe add some noise via rnorm ?

returning a list in R and functional programming behavior

I have a basic questions regarding functional programming in R.
Given a function that returns a list, such as:
myF <- function(x){
return (list(a=11,b=x))
}
why is it that the list returned when calling the function with a range or vector is always the same lenght for 'a'
Ex:
myF(1:10)
returns:
$a
[1] 11
$b
[1] 1 2 3 4 5 6 7 8 9 10
How can one change the behavior so that the 'a' list has the sample length as b's.
I am actually working with a bunch of S4 objects that do I cannot easily convert to list (using as.list) so _apply is not my first choice.
Thanks for any insight or help!
EDIT (Added further explanations)
I am not necessarily looking to just pad 'a' to makes its length equal to b's. However using the solution
as.list(data.frame(a=myA,b=x)) pads the 'a' with the same value computed first.
myF <- function(x){
myA = ceiling(runif(1, max=100))
return (as.list(data.frame(a=myA
,b=x)))
}
myF(1:5)
$a
[1] 79 79 79 79 79 79 79 79 79 79
$b
[1] 1 2 3 4 5 6 7 8 9 10
I still am not sure why that happens!
Thanks
are you just looking to have 11 repeated so that a is the same length as b? if so:
> myF <- function(x){
+ return (list(a=rep(11,length(x)),b=x))
+ }
> myF(1:10)
$a
[1] 11 11 11 11 11 11 11 11 11 11
$b
[1] 1 2 3 4 5 6 7 8 9 10
EDIT based on OP's clarification/comments. If you want 'a' to instead be a random vector with length equal to 'b':
> myF <- function(x){
+ return (list(a=ceiling(runif(length(x),max=100)),b=x))
+ }
> myF(1:10)
$a
[1] 4 31 8 45 25 74 36 95 64 32
$b
[1] 1 2 3 4 5 6 7 8 9 10
I don't quite understand what you mean by not being able to use as.list. You should be able to get a version of your function satisfying the requirement that all components of the list be equally long by doing:
myF <- function(x){
return as.list(data.frame(a=11,b=x))
}
EDIT:
The reason list does not work the way you expect is that list applied to a number of lists/vectors/e.t.c. is just that, a list of those lists/vectors/e.t.c.; it does not "inspect" their structure.
What I think you want is the additional semantics that the vectors contained in the list should match up and produce a set of "rows", each with one corresponding element from each one of your vectors. This is exactly what a data frame is suppose to be (indeed how, I think, a data frame is represented in R). The final as.list call does little but change what type its tagged as.
EDIT2:
Note that if I'm wrong above (and that's not the general behaviour you want) then Mac's solution is more appropriate, as it gives you exactly the behaviour that both the vectors should have the same length, without implying that they should "line up".
This would both be confusing to anyone reading the code (as using a data.frame implies you think of your vectors as matching up) as well as forcing any additional elements you add to the list to be converted into vectors of the appropriate length (which may or may not be what you want)
In case I did not understand you correctly last time, here is another possibility:
If you want to generate a second vector, given some function/expression, of the same length as your argument you could do something like:
myF <- function(x){
return (list(a=replicate(length(x),f),b=x))
}
in your example f could be runif(1, max=100), though in the specific case of runif you could explicitly tell it to generate a vector of appropriate length by calling runif(length(x), max=100) inside the function.
replicate simply re-evaluates f the number of times you request, and gives you the vector of all the results.
It appears that your function is "hard coding" a. So no matter what you specify it will always give 11.
If for example you changed the function to:
myF <- function(x){ return (list(a=x,b=x)) }
myF(1:10)
$a
[1] 1 2 3 4 5 6 7 8 9 10
$b
[1] 1 2 3 4 5 6 7 8 9 10
a is allowed to change like b.
or
myF <- function(x,y){ return (list(a=y,b=x)) }
myF(10:1,1:10)
$a
[1] 1 2 3 4 5 6 7 8 9 10
$b
[1] 10 9 8 7 6 5 4 3 2 1
Now a is allowed to change independent of b.

Resources