Given myletters:
library(tidyverse)
myletters <- letters
myletters
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
I want to sample 4 letters at a time (without replacement) from myletters, repeat this X multiple times and find the probability of having sampled all letters at least once in X = 1:100 draws.
For example if X = 10 we could get:
set.seed(10)
X <- unlist(rerun(10, sample(myletters, 4, replace = F)))
X
# [1] "k" "i" "j" "p" "l" "w" "h" "v" "g" "s" "x" "o" "o" "j" "g" "y" "b" "x" "m" "h" "n" "g" "f" "y" "v" "r" "u" "y" "m" "e" "a" "g" "z" "r" "d" "y" "x" "s" "v"
# [40] "r"
#test if X contains all 26 letters
n_distinct(X) == 26 #26 = no of letters
#FALSE
The following approach does what I want in a simulation but doesn't scale very well as it fills a dataframe column with up to 400 letters in a cell so is awkward and inefficient:
output <- crossing(drawsX = 1:100,
trial = 1:100) %>%
mutate(draws_output = map(drawsX, ~ unlist(rerun(., sample(myletters, 4, replace = F)))),
all_letters = map_lgl(draws_output, ~ n_distinct(.) == 26))
output
#plot
output %>%
group_by(drawsX) %>%
summarise(prob_of_all_letters = mean(all_letters)) %>%
ggplot(., aes(drawsX, prob_of_all_letters)) +
geom_line() +
scale_y_continuous(labels = scales::percent_format()) +
labs(y = "Probability")
Ideally I would like to simulate more times e.g. trial = 1:100000 but the approach above is inefficient if I wanted to do this.
1) Is there a more efficient way to fill my dataset (or using a matrix) with samples?
2) Also, is there an analytic way to solve this problem in R instead of simulation. e.g. what is probability of get 26 letters from 10 draws of 4 samples each?
thanks
Here's a somewhat improved version. The code is a bit more efficient and certainly cleaner:
sample_sets = function(replicates, k, set = letters) {
draws = vapply(1:replicates, function(z, ...) sample.int(...), FUN.VALUE = integer(k), n = length(set), size = k, replace = FALSE)
all(seq_along(set) %in% draws)
}
## example use
output <- crossing(
drawsX = 1:100,
trial = 1:100
) %>%
mutate(
outcome = map_lgl(drawsX, sample_sets, set = letters, k = 4),
)
## timing
system.time({output <- crossing(
drawsX = 1:100,
trial = 1:100
) %>%
mutate(
outcome = map_lgl(drawsX, sample_sets, set = letters, k = 4),
)
})
# user system elapsed
# 2.79 0.04 2.95
## original way
system.time({output <- crossing(drawsX = 1:100,
trial = 1:100) %>%
mutate(draws_output = map(drawsX, ~ unlist(rerun(., sample(letters, 4, replace = F)))),
all_letters = map_lgl(draws_output, ~ n_distinct(.) == 26))})
# user system elapsed
# 4.96 0.06 5.18
So it's about 40% faster on this data - hopefully that performance gain will continue as draws increases.
Let's use the dataset from this question:
dat<-data.frame(replicate(20,sample(c("A", "B", "C","D"), size = 100, replace=TRUE)))
Then we can build the transition matrix and the markov chain:
# Build transition matrix
trans.matrix <- function(X, prob=T)
{
tt <- table( c(X[,-ncol(X)]), c(X[,-1]) )
if(prob) tt <- tt / rowSums(tt)
tt
}
trans.mat <- trans.matrix(as.matrix(dat))
attributes(trans.mat)$class <- 'matrix'
# Build markovchain
library(markovchain)
chain <- new('markovchain', transitionMatrix = trans.mat)
If I now encounter a new sequence, let's say AAABCAD can I then calculate the probability of observing this sequence given this markovchain?
I cannot see a function in markovchain exactly for that, but it can be easily done manually too. There's one caveat though: the transition matrix does not provide the probability of observing the first A, which needs to be provided by you. Let it be 0.25, as it would be if all four states were equally likely (which is true in your example).
Then the transitions in the observed chain can be obtained with
cbind(head(obs, -1), obs[-1])
# [,1] [,2]
# [1,] "A" "A"
# [2,] "A" "A"
# [3,] "A" "B"
# [4,] "B" "C"
# [5,] "C" "A"
# [6,] "A" "D"
Probabilities for each of those transitions then are
trans.mat[cbind(head(obs, -1), obs[-1])]
# [1] 0.2268722 0.2268722 0.2268722 0.2926316 0.2791165 0.2665198
and the final answer is 0.25 * (the product of the above vector):
0.25 * prod(trans.mat[cbind(head(obs, -1), obs[-1])])
# [1] 6.355069e-05
For comparison, we may estimate this probability by generating many chains of length 7:
dat <- replicate(2000000, paste(sample(c("A", "B", "C", "D"), size = 7, replace = TRUE), collapse = ""))
mean(dat == "AAABCAD")
# [1] 6.55e-05
Looks close enough!
I'm having trouble finding a solution to this simple problem. I have been searching the forums and altought I have gotten closer to an answer this is not exactly what I need.
I'm trying to find from a set of x,y points which point is the furthest away from any other points i.e. not the maximum distance between points, but the one furthest from the rest.
I've tried
x <-c(x1,x2,x3....)
y <-c(y1,y2,y3...)
dist(cbind(x,y))
Which gives me a matrix of the distance between each point to each point. I can interrogate the data in MS Excel and find the answer. Find the minimum values in each column, then the maximum number across them.
If I were to plot the data, I would like to have as output the distance of either the red or blue line (depending on which is longer).
Starting from this example data set:
set.seed(100)
x <- rnorm(150)
y <- rnorm(150)
coord <- cbind(x,y)
dobj <- dist(coord)
Now dobj is a distance object, but you can't examine that directly. You'll have to convert that to a matrix first, and make sure you don't take zero distances between a point and itself into account:
dmat <- as.matrix(dobj)
diag(dmat) <- NA
The latter line replaces the diagonal values in the distance matrix with NA.
Now you can use the solution of amonk:
dmax <- max(apply(dmat,2,min,na.rm=TRUE))
This gives you the maximum distance to the nearest point. If you want to know which points these are, you can take an extra step :
which(dmat == dmax, arr.ind = TRUE)
# row col
# 130 130 59
# 59 59 130
So point 130 and 59 are the two points fulfilling your conditions. Plotting this gives you:
id <- which(dmat == dmax, arr.ind = TRUE)
plot(coord)
lines(coord[id[1,],], col = 'red')
Note how you get this info twice, as euclidean distances between two points are symmetric (A -> B is as long as B -> A ).
It looks like to me, that you have spatial points in some projection. One could argue, that the point furthest away from the rest, is the one which lies furthest from the center (the mean coordinates):
library(raster)
set.seed(21)
# create fake points
coords <- data.frame(x=sample(438000:443000,10),y=sample(6695000:6700000,10))
# calculate center
center <- matrix(colMeans(coords),ncol=2)
# red = center, magenta = furthest point (Nr.2)
plot(coords)
# furthest point #2
ix <- which.max(pointDistance(coords,center,lonlat = F))
points(center,col='red',pch='*',cex=3)
points(coords[ix,],col='magenta',pch='*',cex=3)
segments(coords[ix,1],coords[ix,2],center[1,1],center[1,2],col='magenta')
To find the points farthest from the rest of the points you could do something like this. I opted for the median distance as you said the point(s) farthest from the rest of the data. If you have a group of points very close to each other the median should remain robust to this.
There is probably also a way to do this with hierarchical clustering but it is escaping me at the moment.
set.seed(1234)
mat <- rbind(matrix(rnorm(100), ncol=2), c(-5,5), c(-5.25,4.75))
d <- dist(mat)
sort(apply(as.matrix(d), 1, median), decreasing = T)[1:5]
# 51 52 20 12 4
# 6.828322 6.797696 3.264315 2.806263 2.470919
I wrote up a handy little function you can use for picking from the largest of line distances. You can specify if you want the largest, second largest, and so forth with the n argument.
getBigSegment <- function(x, y, n = 1){
a <- cbind(x,y)
d <- as.matrix(dist(a, method = "euclidean"))
sorted <- order(d, decreasing = T)
sub <- (1:length(d))[as.logical(1:length(sorted) %% 2)]
s <- which(d == d[sorted[sub][n]], arr.ind = T)
t(cbind(a[s[1],], a[s[2],]))
}
With some example data similar to your own you can see:
set.seed(100)
mydata <- data.frame(x = runif(10, 438000, 445000) + rpois(10, 440000),
y = runif(10, 6695000, 6699000) + rpois(10, 6996000))
# The function
getBigSegment(mydata$x, mydata$y)
# x y
#[1,] 883552.8 13699108
#[2,] 881338.8 13688458
Below you can visualize how I would use such a function
# easy plotting function
pointsegments <- function(z, ...) {
segments(z[1,1], z[1,2], z[2,1], z[2,2], ...)
points(z, pch = 16, col = c("blue", "red"))
}
plot(mydata$x, mydata$y) # points
top3 <- lapply(1:3, getBigSegment, x = mydata$x, y = mydata$y) # top3 longest lines
mycolors <- c("black","blue","green") # 3 colors
for(i in 1:3) pointsegments(top3[[i]], col = mycolors[i]) # plot lines
legend("topleft", legend = round(unlist(lapply(top3, dist))), lty = 1,
col = mycolors, text.col = mycolors, cex = .8) # legend
This approach first uses chull to identify extreme_points, the points that lie on the boundary of the given points. Then, for each extreme_points, it calculates centroid of the extreme_points by excluding that particular extreme_points. Then it selects the point from extreme_points that's furthest away from the centroid.
foo = function(X = all_points){
plot(X)
chull_inds = chull(X)
extreme_points = X[chull_inds,]
points(extreme_points, pch = 19, col = "red")
centroid = t(sapply(1:NROW(extreme_points), function(i)
c(mean(extreme_points[-i,1]), mean(extreme_points[-i,2]))))
distances = sapply(1:NROW(extreme_points), function(i)
dist(rbind(extreme_points[i,], centroid[i,])))
points(extreme_points[which.max(distances),], pch = 18, cex = 2)
points(X[chull_inds[which.max(distances)],], cex = 5)
return(X[chull_inds[which.max(distances)],])
}
set.seed(42)
all_points = data.frame(x = rnorm(25), y = rnorm(25))
foo(X = all_points)
# x y
#18 -2.656455 0.7581632
So for df as your initial data frame you can perform the following:
df<-NULL#initialize object
for(i in 1:10)#create 10 vectors with 10 pseudorandom numbers each
df<-cbind(df,runif(10))#fill the dataframe
cordf<-cor(df);diag(cordf)<-NA #create correlation matrix and set diagonal values to NA
Hence:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] NA -0.03540916 -0.29183703 0.49358124 0.79846794 0.29490246 0.47661166 -0.51181482 -0.04116772 -0.10797632
[2,] -0.03540916 NA 0.47550478 -0.24284088 -0.01898357 -0.67102287 -0.46488410 0.01125144 0.13355919 0.08738474
[3,] -0.29183703 0.47550478 NA -0.05203104 -0.26311149 0.01120055 -0.16521411 0.49215496 0.40571893 0.30595246
[4,] 0.49358124 -0.24284088 -0.05203104 NA 0.60558581 0.53848638 0.80623397 -0.49950396 -0.01080598 0.41798727
[5,] 0.79846794 -0.01898357 -0.26311149 0.60558581 NA 0.33295170 0.53675545 -0.54756131 0.09225002 -0.01925587
[6,] 0.29490246 -0.67102287 0.01120055 0.53848638 0.33295170 NA 0.72936185 0.09463988 0.14607018 0.19487579
[7,] 0.47661166 -0.46488410 -0.16521411 0.80623397 0.53675545 0.72936185 NA -0.46348644 -0.05275132 0.47619940
[8,] -0.51181482 0.01125144 0.49215496 -0.49950396 -0.54756131 0.09463988 -0.46348644 NA 0.64924510 0.06783324
[9,] -0.04116772 0.13355919 0.40571893 -0.01080598 0.09225002 0.14607018 -0.05275132 0.64924510 NA 0.44698207
[10,] -0.10797632 0.08738474 0.30595246 0.41798727 -0.01925587 0.19487579 0.47619940 0.06783324 0.44698207 NA
Finally by executing:
max(apply(cordf,2,min,na.rm=TRUE),na.rm = TRUE)#avoiding NA's
one can get:
[1] -0.05275132
the maximum value of the local minima.
Edit:
In order to get the index of matrix
>which(cordf==max(apply(cordf,2,min,na.rm=TRUE),na.rm = TRUE))
[1]68 77
or in order to get the coordinates:
> which(cordf==max(apply(cordf,2,min,na.rm=TRUE),na.rm = TRUE), arr.ind = TRUE)
row col
[1,] 8 7
[2,] 7 8
I am using a dataset to create a horizontal in the horizontal orientation. Something similar to what has been proposed as a solution in R: How can I make a barplot with labels parallel (horizontal) to bars.
However, the number of labels in the Y axis of my horizontal barplot chart are a little too many (due to the problem in hand) and hence, they are overlapping over each other.
Is there a way to preserve the barplot bin size and show a subset of the Y labels in the horizontal orientation of the barplot?
thanks,
rajat
Here's one way to do it, we can use a nice solution to interleave the names from your data with blanks:
generate some data
set.seed(123)
df1 <- data.frame(x = replicate(50, paste(sample(letters, 2, replace = T), collapse = '')),
y = sample(1:10, 50, replace = T), stringsAsFactors = FALSE)
make a barplot, using a subset of the names
barplot(df1$y, names.arg = c(rbind(df1$x, rep('', 50)))[1:50], horiz = T, las = 1)
The main trick is the names.arg = c(rbind(df1$x, rep('',50)))[1:50] line. It interleaves blanks between the names from the data. Effectively, we are replacing half of the names with blank space.
If that's not sufficient, we can define a function which takes in a vector of names, x, and a multiple, m that defines which values to replace with blanks:
replace_multiple <- function(x, m){
len_x <- length(x)
index_to_replace <- seq(1, len_x, by = m)
x[index_to_replace] <- ''
return(x)
}
replace_multiple(letters[1:12], m = 2)
# "" "b" "" "d" "" "f" "" "h" "" "j" "" "l"
replace_multiple(letters[1:12], m = 3)
# "" "b" "c" "" "e" "f" "" "h" "i" "" "k" "l"
replace_multiple(letters[1:12], m = 4)
# "" "b" "c" "d" "" "f" "g" "h" "" "j" "k" "l"
I would like to find all the possibilities to divide 10 data values into 2 groups of 5
If i'm right there are 252 possibilities
choose(10,5)
252
How can i do it with R ?
Thanks !
Here's one possibility:
a <- letters[1:10]
split1 <- combn(a, 5);
split2 <- apply(b, 2, function(x) a[!a %in% x])
Pick a random one:
set.seed(1)
rnd <- sample(1:ncol(split1), size=1)
split1[, rnd]; split2[, rnd]
# [1] "a" "c" "d" "g" "i"
# [1] "b" "e" "f" "h" "j"
So i will explain in details what i have to do :
I have 2 sets of data :
cellular_wt = c(1.1656,0.9577,1.3655,0.9016,0.9336)
cellular_mutant = c(2.8896,5.7018,3.595,1.6998,1.8893)
secreted_wt = c(7.8491,6.1546,5.1972,6.1607,5.928)
secreted_mutant = c(4.6801,3.2418,3.6651,3.0678,2.3221)
mean_cellular_wt <- mean(cellular_wt)
mean_cellular_mutant <- mean(cellular_mutant)
mean_secreted_wt <- mean(secreted_wt)
mean_secreted_mutant <- mean(secreted_mutant)
mean_secreted_wt/mean_cellular_wt = 5.877085
mean_secreted_mutant/mean_cellular_mutant = 1.076156
mean_ratio <- (mean_secreted_wt/mean_cellular_wt)/(mean_secreted_mutant/mean_cellular_mutant) = 5.46
I want to run a randomization test on these data to test the significance of mean ratio
To do so, i would like to randomly divide these 10 values (cellular_wt + cellular_mutant and secreted_wt + secreted_mutant into 2 groups of 5 (as the initial data sets), and calculate the mean ratio each time.
In this way, i can see whether the observed difference of 5.46 seems unusually large by comparing it to the 252 differences that could have been seen due to random assignment alone. Do you understand ?