Changing the levels of a pooled DataArray - julia

I'm looking for a way to modify the levels of a DataArray:
result = pool(["a", "a", "b"])
levels(result) = ["A", "B"]

As a quick-and-dirty solution, you can change the pool field of the object -- it happens to be mutable.
result.pool = [ "A", "B" ]
result
# 3-element PooledDataArray{ASCIIString,Uint8,1}:
# "A"
# "A"
# "B"
xdump( result )
# PooledDataArray{ASCIIString,Uint8,1}
# refs: Array(Uint8,(3,)) Uint8[0x01,0x01,0x02]
# pool: Array(ASCIIString,(2,)) ASCIIString["a","b"]

Related

Unable to change name value in a vector

named_vector=c(a=1,b=2,c=3,d=4,e=5,f=6,g=7)
names(named_vector)[names(named_vector)=='c'] <- 'k'
names(named_vector[names(named_vector)])=='c'<-'k'
Unable to change name of a member 'c' in named_vector using line 3, but working fine with line 2
getting the error message as --------------------->
Error in names(named_vector[names(named_vector)]) == "c" <- "k" :
could not find function "==<-"
You can index by numeric position:
`names(named_vector)[3] <- "new name" `
Line 3 doesn't work because you're nesting your data too much. If you break this down
names(named_vector[names(named_vector)]) == 'c' <- 'k'
you get
# Gives you all the names back
names(named_vector)
# [1] "a" "b" "c" "d" "e" "f" "g"
# Putting it back in, you simply get all the values again
names(named_vector[c("a", "b", "c", "d", "e", "f", "g")])
# The inner part simply gives you the `named_vector` again
named_vector[c("a", "b", "c", "d", "e", "f", "g")]
# a b c d e f g
# 1 2 3 4 5 6 7
This is not to mention that the assignment is being saved into a vector
names(named_vector[names(named_vector)]) == 'c'
# [1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE
So Line 2 works because you're indexing your vector names by the equality of which label you wish you change.
names(named_vector)[names(named_vector) == 'c'] <- 'K'

Subsetting in R (Index Explanation)

a <- c("a", "b", "c", "d", "e")
u <- a > "a"
a[u]
The code gives me the output as: "b" "c" "d" "e".
What does a[u] mean ? Do vector a has a new index u of a vector type?
u is a logical vector which is used to subset a.
u
#[1] FALSE TRUE TRUE TRUE TRUE
As 1st element is FALSE, we select all TRUE elements from a by doing a[u]
a[u]
#[1] "b" "c" "d" "e"
It will be more clear with another example. Consider
a <- 11:15
u <- c(FALSE, TRUE, TRUE, FALSE, TRUE)
a[u]
#[1] 12 13 15
So all the elements in a where u is TRUE are selected i.e 12, 13 and 15.
You can figure this out yourself by looking at the contents of the u vector:
u <- a > "a"
u
[1] FALSE TRUE TRUE TRUE TRUE
When you then subset the vector a using this boolean vector u, you are telling R to output a vector consisting only of elements for which the input index be TRUE. This leaves you with just:
[1] "b" "c" "d" "e"
To be more explicit:
"a" "b" "c" "d" "e"
F T T T T
^^ |______________|
drop keep the rest

Create array of values based on dictionary and array of keys

I'm new to Julia, so I'm sorry if this is a basic question.
Say we have a dictionary, and a vector of keys:
X = [2, 1, 1, 3]
d = Dict( 1 => "A", 2 => "B", 3 => "C")
I want to create a new array which contains values instead of keys (according to the dictionary), so the end result would be something like
Y = ["B", "A", "A", "C"]
I suppose I could iterate over the vector elements, look it up in the dictionary and return the corresponding value, but this seems awfully inefficient to me.
Something like
Y = Array{String}(undef, length(X))
for i in 1:length(X)
Y[i] = d[X[i]]
end
EDIT: Also, my proposed solution doesn't work if X contains missing values.
So my question is if there is some more efficient way of doing this (I'm doing it with a much larger array and dictionary), or is this an appropriate way of doing it?
Efficiency can mean different things in different contexts, but I would probably do:
Y = [d[i] for i in X]
If X contains missing values, you could use skipmissing(X) in the comprehension.
You can use an array comprehension to do this pretty tersely:
julia> [d[x] for x in X]
4-element Array{String,1}:
"B"
"A"
"A"
"C"
In the future it may be possible to write d.[X] to express this even more concisely, but as of Julia 1.3, that is not yet allowed.
As per the edit to the question, let's suppose there is a missing value somewhere in X:
julia> X = [2, 1, missing, 1, 3]
5-element Array{Union{Missing, Int64},1}:
2
1
missing
1
3
If you want to map missing to missing or some other value like the string "?" you can do that explicitly like this:
julia> [ismissing(x) ? missing : d[x] for x in X]
5-element Array{Union{Missing, String},1}:
"B"
"A"
missing
"A"
"C"
julia> [ismissing(x) ? "?" : d[x] for x in X]
5-element Array{String,1}:
"B"
"A"
"?"
"A"
"C"
If you're going to do that a lot, it might be easier to put missing in the dictionary like this:
julia> d = Dict(missing => "?", 1 => "A", 2 => "B", 3 => "C")
Dict{Union{Missing, Int64},String} with 4 entries:
2 => "B"
missing => "?"
3 => "C"
1 => "A"
julia> [d[x] for x in X]
5-element Array{String,1}:
"B"
"A"
"?"
"A"
"C"
If you want to simply skip over missing values, you can use skipmissing(X) instead of X:
julia> [d[x] for x in skipmissing(X)]
4-element Array{String,1}:
"B"
"A"
"A"
"C"
There's generally not a single correct way to handle missing values, which is why you need to explicitly code how to handle missing data.

Random sequence from fixed ensemble that contains at least one of each character

I am trying to generate a random sequence from a fixed number of characters that contains at least one of each character.
For example having the ensemble
m = letters[1:3]
I would like to create a sequence of N = 10 elements that contain at least one of each m characters, like
a
a
a
a
b
c
c
c
c
a
I tried with sample(n,N,replace=T) but in this way also a sequence like
a
a
a
a
a
c
c
c
c
a
can be generated that does not contain b.
f <- function(x, n){
sample(c(x, sample(m, n-length(x), replace=TRUE)))
}
f(letters[1:3], 5)
# [1] "a" "c" "a" "b" "a"
f(letters[1:3], 5)
# [1] "a" "a" "b" "b" "c"
f(letters[1:3], 5)
# [1] "a" "a" "b" "c" "a"
f(letters[1:3], 5)
# [1] "b" "c" "b" "c" "a"
Josh O'Briens answer is a good way to do it but doesn't provide much input checking. Since I already wrote it might as well present my answer. It's pretty much the same thing but takes care of checking things like only considering unique items and making sure there are enough unique items to guarantee you get at least one of each.
at_least_one_samp <- function(n, input){
# Only consider unique items.
items <- unique(input)
unique_items_count <- length(items)
if(unique_items_count > n){
stop("Not enough unique items in input to give at least one of each")
}
# Get values for vector - force each item in at least once
# then randomly select values to get the remaining.
vals <- c(items, sample(items, n - unique_items_count, replace = TRUE))
# Now shuffle them
sample(vals)
}
m <- c("a", "b", "c")
at_least_one_samp(10, m)

Shuffling a vector - all possible outcomes of sample()?

I have a vector with five items.
my_vec <- c("a","b","a","c","d")
If I want to re-arrange those values into a new vector (shuffle), I could use sample():
shuffled_vec <- sample(my_vec)
Easy - but the sample() function only gives me one possible shuffle. What if I want to know all possible shuffling combinations? The various "combn" functions don't seem to help, and expand.grid() gives me every possible combination with replacement, when I need it without replacement. What's the most efficient way to do this?
Note that in my vector, I have the value "a" twice - therefore, in the set of shuffled vectors returned, they all should each have "a" twice in the set.
I think permn from the combinat package does what you want
library(combinat)
permn(my_vec)
A smaller example
> x
[1] "a" "a" "b"
> permn(x)
[[1]]
[1] "a" "a" "b"
[[2]]
[1] "a" "b" "a"
[[3]]
[1] "b" "a" "a"
[[4]]
[1] "b" "a" "a"
[[5]]
[1] "a" "b" "a"
[[6]]
[1] "a" "a" "b"
If the duplicates are a problem you could do something similar to this to get rid of duplicates
strsplit(unique(sapply(permn(my_vec), paste, collapse = ",")), ",")
Or probably a better approach to removing duplicates...
dat <- do.call(rbind, permn(my_vec))
dat[duplicated(dat),]
Noting that your data is effectively 5 levels from 1-5, encoded as "a", "b", "a", "c", and "d", I went looking for ways to get the permutations of the numbers 1-5 and then remap those to the levels you use.
Let's start with the input data:
my_vec <- c("a","b","a","c","d") # the character
my_vec_ind <- seq(1,length(my_vec),1) # their identifier
To get the permutations, I applied the function given at Generating all distinct permutations of a list in R:
permutations <- function(n){
if(n==1){
return(matrix(1))
} else {
sp <- permutations(n-1)
p <- nrow(sp)
A <- matrix(nrow=n*p,ncol=n)
for(i in 1:n){
A[(i-1)*p+1:p,] <- cbind(i,sp+(sp>=i))
}
return(A)
}
}
First, create a data.frame with the permutations:
tmp <- data.frame(permutations(length(my_vec)))
You now have a data frame tmp of 120 rows, where each row is a unique permutation of the numbers, 1-5:
>tmp
X1 X2 X3 X4 X5
1 1 2 3 4 5
2 1 2 3 5 4
3 1 2 4 3 5
...
119 5 4 3 1 2
120 5 4 3 2 1
Now you need to remap them to the strings you had. You can remap them using a variation on the theme of gsub(), proposed here: R: replace characters using gsub, how to create a function?
gsub2 <- function(pattern, replacement, x, ...) {
for(i in 1:length(pattern))
x <- gsub(pattern[i], replacement[i], x, ...)
x
}
gsub() won't work because you have more than one value in the replacement array.
You also need a function you can call using lapply() to use the gsub2() function on every element of your tmp data.frame.
remap <- function(x,
old,
new){
return(gsub2(pattern = old,
replacement = new,
fixed = TRUE,
x = as.character(x)))
}
Almost there. We do the mapping like this:
shuffled_vec <- as.data.frame(lapply(tmp,
remap,
old = as.character(my_vec_ind),
new = my_vec))
which can be simplified to...
shuffled_vec <- as.data.frame(lapply(data.frame(permutations(length(my_vec))),
remap,
old = as.character(my_vec_ind),
new = my_vec))
.. should you feel the need.
That gives you your required answer:
> shuffled_vec
X1 X2 X3 X4 X5
1 a b a c d
2 a b a d c
3 a b c a d
...
119 d c a a b
120 d c a b a
Looking at a previous question (R: generate all permutations of vector without duplicated elements), I can see that the gtools package has a function for this. I couldn't however get this to work directly on your vector as such:
permutations(n = 5, r = 5, v = my_vec)
#Error in permutations(n = 5, r = 5, v = my_vec) :
# too few different elements
You can adapt it however like so:
apply(permutations(n = 5, r = 5), 1, function(x) my_vec[x])
# [,1] [,2] [,3] [,4]
#[1,] "a" "a" "a" "a" ...
#[2,] "b" "b" "b" "b" ...
#[3,] "a" "a" "c" "c" ...
#[4,] "c" "d" "a" "d" ...
#[5,] "d" "c" "d" "a" ...

Resources