Horizontal barplot labels overlapping on each other - r

I am using a dataset to create a horizontal in the horizontal orientation. Something similar to what has been proposed as a solution in R: How can I make a barplot with labels parallel (horizontal) to bars.
However, the number of labels in the Y axis of my horizontal barplot chart are a little too many (due to the problem in hand) and hence, they are overlapping over each other.
Is there a way to preserve the barplot bin size and show a subset of the Y labels in the horizontal orientation of the barplot?
thanks,
rajat

Here's one way to do it, we can use a nice solution to interleave the names from your data with blanks:
generate some data
set.seed(123)
df1 <- data.frame(x = replicate(50, paste(sample(letters, 2, replace = T), collapse = '')),
y = sample(1:10, 50, replace = T), stringsAsFactors = FALSE)
make a barplot, using a subset of the names
barplot(df1$y, names.arg = c(rbind(df1$x, rep('', 50)))[1:50], horiz = T, las = 1)
The main trick is the names.arg = c(rbind(df1$x, rep('',50)))[1:50] line. It interleaves blanks between the names from the data. Effectively, we are replacing half of the names with blank space.
If that's not sufficient, we can define a function which takes in a vector of names, x, and a multiple, m that defines which values to replace with blanks:
replace_multiple <- function(x, m){
len_x <- length(x)
index_to_replace <- seq(1, len_x, by = m)
x[index_to_replace] <- ''
return(x)
}
replace_multiple(letters[1:12], m = 2)
# "" "b" "" "d" "" "f" "" "h" "" "j" "" "l"
replace_multiple(letters[1:12], m = 3)
# "" "b" "c" "" "e" "f" "" "h" "i" "" "k" "l"
replace_multiple(letters[1:12], m = 4)
# "" "b" "c" "d" "" "f" "g" "h" "" "j" "k" "l"

Related

Combine a list of similar length vectors with NAs to one vector

This is likely a duplicate, yet I appear to be incapable of finding a similar question atm. I have a list of (very long) vectors that are similar in length. Each vector element contains a character. Sometimes multiple vectors contain characters at the same position (sequential numbering from the beginning). Sometimes none contain a character (i.e. all contain NA). There are maybe 10 of these vectors and each has a length of millions of elements. I need to find a quick and memory-efficient way of combining the vectors to a single vector, preferably without using any dependencies (i.e. no data.table or dplyr). The example is simple and short to understand the concept.
I have:
x <- list(A = c(rep("A", 5), rep(NA, 5)), B = c(rep(NA, 4), rep("B", 5), NA))
I need to combine them to:
c(rep("A", 4), "conflict", rep("B", 4), "none")
# "A" "A" "A" "A" "conflict" "B" "B" "B" "B" "none"
Thank you for help. I should know how to do this but somehow it escapes me atm. I do have an apply solution that goes in row by row but that is inefficient. Need to vectorize the solution.
apply(do.call(cbind, x), 1, function(k) {
if(sum(is.na(k)) == length(k)) {
"none"
} else if (sum(!is.na(k)) == 1) {
k[!is.na(k)]
} else {
"conflict"
}
})
This solution uses a vectorized function f and Reduce to apply it to the list. But it assumes that all vectors have the same length. And Reduce is not known for its speed-wise performance.
f <- function(x, y){
na.x <- is.na(x) | x == "none"
na.y <- is.na(y) | y == "none"
x[na.x & na.y] <- "none"
x[!na.x & !na.y & x != y] <- "conflict"
x[!na.x & na.y] <- x[!na.x & na.y]
x[na.x & !na.y] <- y[na.x & !na.y]
x
}
Reduce(f, x)
# [1] "A" "A" "A" "A" "conflict" "B"
# [7] "B" "B" "B" "none"
Reduce(f, list(A=NA, B = NA, C = 'A'))
#[1] "A"
Here's a vectorised version of your code :
dat <- do.call(cbind, x)
#Logical matrix
mat <- !is.na(dat)
#Number of non-NA's in each row
rs <- rowSums(mat)
#First non-NA value
val <- dat[cbind(1:nrow(dat), max.col(mat, ties.method = 'first'))]
#More than 1 non-NA value
val[rs > 1] <- 'conflict'
#Only NA value
val[rs == 0] <- 'none'
val
#[1] "A" "A" "A" "A" "Conflict" "B"
#[7] "B" "B" "B" "none"
EDIT - Updated to include suggestion from #Henrik to avoid nested ifelse which should make the solution faster.
Another one
x <- list(A = c(rep("A", 5), rep(NA, 5)), B = c(rep(NA, 4), rep("B", 5), NA))
y <- apply(do.call('rbind', x), 2, function(x) toString(na.omit(x)))
y[!nzchar(y)] <- 'none'
replace(y, grepl(',', y), 'conflict')
# [1] "A" "A" "A" "A" "conflict" "B" "B" "B" "B" "none"

improve efficiency of filling dataset with sampling

Given myletters:
library(tidyverse)
myletters <- letters
myletters
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
I want to sample 4 letters at a time (without replacement) from myletters, repeat this X multiple times and find the probability of having sampled all letters at least once in X = 1:100 draws.
For example if X = 10 we could get:
set.seed(10)
X <- unlist(rerun(10, sample(myletters, 4, replace = F)))
X
# [1] "k" "i" "j" "p" "l" "w" "h" "v" "g" "s" "x" "o" "o" "j" "g" "y" "b" "x" "m" "h" "n" "g" "f" "y" "v" "r" "u" "y" "m" "e" "a" "g" "z" "r" "d" "y" "x" "s" "v"
# [40] "r"
#test if X contains all 26 letters
n_distinct(X) == 26 #26 = no of letters
#FALSE
The following approach does what I want in a simulation but doesn't scale very well as it fills a dataframe column with up to 400 letters in a cell so is awkward and inefficient:
output <- crossing(drawsX = 1:100,
trial = 1:100) %>%
mutate(draws_output = map(drawsX, ~ unlist(rerun(., sample(myletters, 4, replace = F)))),
all_letters = map_lgl(draws_output, ~ n_distinct(.) == 26))
output
#plot
output %>%
group_by(drawsX) %>%
summarise(prob_of_all_letters = mean(all_letters)) %>%
ggplot(., aes(drawsX, prob_of_all_letters)) +
geom_line() +
scale_y_continuous(labels = scales::percent_format()) +
labs(y = "Probability")
Ideally I would like to simulate more times e.g. trial = 1:100000 but the approach above is inefficient if I wanted to do this.
1) Is there a more efficient way to fill my dataset (or using a matrix) with samples?
2) Also, is there an analytic way to solve this problem in R instead of simulation. e.g. what is probability of get 26 letters from 10 draws of 4 samples each?
thanks
Here's a somewhat improved version. The code is a bit more efficient and certainly cleaner:
sample_sets = function(replicates, k, set = letters) {
draws = vapply(1:replicates, function(z, ...) sample.int(...), FUN.VALUE = integer(k), n = length(set), size = k, replace = FALSE)
all(seq_along(set) %in% draws)
}
## example use
output <- crossing(
drawsX = 1:100,
trial = 1:100
) %>%
mutate(
outcome = map_lgl(drawsX, sample_sets, set = letters, k = 4),
)
## timing
system.time({output <- crossing(
drawsX = 1:100,
trial = 1:100
) %>%
mutate(
outcome = map_lgl(drawsX, sample_sets, set = letters, k = 4),
)
})
# user system elapsed
# 2.79 0.04 2.95
## original way
system.time({output <- crossing(drawsX = 1:100,
trial = 1:100) %>%
mutate(draws_output = map(drawsX, ~ unlist(rerun(., sample(letters, 4, replace = F)))),
all_letters = map_lgl(draws_output, ~ n_distinct(.) == 26))})
# user system elapsed
# 4.96 0.06 5.18
So it's about 40% faster on this data - hopefully that performance gain will continue as draws increases.

retrieve points based in Euclidean distance and axis values R

I have a dataframe with 11 variables represented by letters a to k, plotted in a bidimentional scatterplot.
cor<-data.frame(X=c(0.36187115, -0.54755904, -0.82417308, -0.70806545, -0.77422866, -0.70003404,
-0.70043884, 0.73602124,-0.89909694, -0.05937341, 0.93496883),
Y=c(-0.54354070,-0.81211142, -0.52775892, 0.40191296, 0.36820779, 0.28163131,
-0.26161395, -0.26386668,-0.31894766, -0.91541962, -0.04548996),
row.names = letters[1:11]);cor
a<-seq(0,2*pi, length=100)
plot(cos(a),sin(a), type="l", lty=2, xlab = "X", ylab = 'Y')
points(cor[cor$X<0 & cor$Y<0,-3], pch=20, col='blue')
points(cor[cor$X<0 & cor$Y>0,-3], pch=20, col='forestgreen')
points(cor[cor$X>0 & cor$Y<0,-3], pch=20, col='red')
abline(v = 0, h = 0)
text(cor, rownames(cor), pos = 3, cex = 0.8 )
Using euclidian distance I observe that the points (d,e,f and g,i) have distance less than 30.
d<-dist(cor, method = 'euclidean');d
I want to program a code R to identify all points with distance < 30 and retrieve one of these points based on the values of X and Y axis.
Ex: points i and g have a distance value of 0.206, based on the criterion of the axis point g must be excluded due the less value of X and Y axis.
However I dont have any idea to where begin.
Somebody can help me to begin the code?
Thank
This function should work, although there's probably a better way to do it.
nearby <- function(data, d){
dist <- as.matrix(dist(data))
dist[upper.tri(dist, diag = TRUE)] <- NA
pairs <- which(dist < d ,arr.ind = TRUE)
for (i in 1:nrow(pairs)){
for (j in 1:2){
pairs[i,j] <- letters[as.numeric(pairs[i,j])]
}
}
rownames(pairs) <- NULL
colnames(pairs) <- NULL
pairs[,2:1]
}
So to get pairs for which the distance between them is less than 0.3, do
> nearby(data = cor, d = 0.3)
[,1] [,2]
[1,] "c" "g"
[2,] "c" "i"
[3,] "d" "e"
[4,] "d" "f"
[5,] "e" "f"
[6,] "g" "i"
[7,] "h" "k"
Note that the function only works for points with two coordinates (points on a plane).

Get a random column from R matrix

In R language, I defined a matrix this way:
data <- matrix(c("A","B","C","D","E","F"), nrow = 2)
This gives me something like this:
"A" | "C" | "E"
"B" | "D" | "F"
now, How do I get a random column of the matrix?
If I do:
sample(x = data, n = 2)
I get random elements from all around the matrix, like "A" and "F". What I want is to get a column like "A" and "B", or "C" and "D" or "E" and "F"
I am new to R so any help is really apreciated
I'd use something like this:
f <- function(mat) {
j <- sample(seq_len(ncol(mat)), size=1)
## (Use `drop=FALSE` to say "don't convert 1-column matrices to vectors")
data[, j, drop=FALSE]
}
## Try it out
f(data)
# [,1]
# [1,] "E"
# [2,] "F"

An efficient way of converting a vector of characters into integers based on frequency in R

I have a vector of characters consisting of only 'a' or 'g', I want to convert them to integers based on frequency, i.e. the more frequent one should be coded to 0, and the other to 1, for example:
set.seed(17)
x = sample(c('g', 'a'), 10, replace=T)
x
# [1] "g" "a" "g" "a" "g" "a" "g" "g" "a" "g"
x[x == names(which.max(table(x)))] = 0
x[x != 0] = 1
x
# [1] "0" "1" "0" "1" "0" "1" "0" "0" "1" "0"
This works, but I wonder if there is a more efficient way to do it.
(We don't have to consider the 50%-50% case here, because it should never happen in our study.)
Use this:
ag.encode <- function(x)
{
result <- x == "a"
if( sum(result) > length(result) %/% 2 ) 1-result else as.numeric(result)
}
If you want to keep the labels in a factor structure, use this instead:
ag.encode2factor <- function(x)
{
result <- x == "a"
if( sum(result) > length(result) %/% 2 )
{
factor(2-result, labels=c("a","g"))
}
else
{
factor(result+1, labels=c("g","a"))
}
}
You can convert your character vector to a factor one. This solution is more general in the sense you don't need to know the name of the 2 characters used to create x.
y <- as.integer(factor(x))-1
if(sum(y)>length(y)/2) y <- as.integer(!y)

Resources