R: trouble with specifying probability for sample function - r

> sample(c(2), 10, replace = TRUE, prob = 1)
Error in sample.int(x, size, replace, prob) :
incorrect number of probabilities
> sample(c(1), 10, replace = TRUE, prob = 1)
[1] 1 1 1 1 1 1 1 1 1 1
In the first example, I would like to sample the vector 2 ten times, with replacement, each with probability = 1. I would expect the output to be 2 2 2 2 2 2 2 2 2 2
However, it seems to work with a vector of 1?

Try removing the prob = 1 and what do you get?
> set.seed(123)
> sample(c(2), 10, replace = TRUE)
# [1] 1 2 1 2 2 1 2 2 2 1
help(sample)
Usage
sample(x, size, replace = FALSE, prob = NULL)
If x has length 1, is numeric (in the sense of is.numeric) and x >= 1,
sampling via sample takes place from 1:x. Note that this convenience
feature may lead to undesired behaviour when x is of varying length in
calls such as sample(x). See the examples.
So, it's sampling from 1:2 not 2.

Related

Efficient way to iteratively store counts in R

I'm having a problem with an efficient way to store the counts of a vector which is changing over time. In my problem I start with an empty vector of length n and by each iteration I add a number to this vector, but I also want to have some type of object that acts as a counter, so if the number that I add is already in the vector then it should add 1 to the object and if it's not then it should add the value as a "name" and set it to 1.
What I want is something analogous to Python, in which we can use numbers as keys and counts as values, so then I can access both separately with dict.keys() and dict.values().
For example, if I get the values 1, 2, 1, 4 then I would like the object to update as:
> value count
1 1
> value count
1 1
2 1
> value count
1 2
2 1
> value count
1 2
2 1
4 1
and to access efficiently both values and count separately. I thought of using something like plyr::count on the vector, but I don't think that it's efficient to count at every iteration, specially if n is really large.
Edit: In my problem it's necessary (well, maybe not) to update the counts at every iteration.
What I'm doing is simulating data from a Dirichlet Process using the Polya urn representation. For example, suppose that I have the vector (1.1, 0.2, 0.3, 1.1, 0.2), then to get a new data point one samples from a base distribution (for example a normal distribution) and adds that value with a certain probability, or adds a previous value with a probability proportional to the frequency of the value. With numbers:
Add the sampled value with probability 1/6, or
Add 1.1 with probability 2/6, or 0.2 with probability 2/6, or 0.3 with probability 1/6 (i.e. the probabilities are proportional to the frecuencies)
The structure you are describing is produced by as.data.frame(table(vec)). There is no need to update the counts as you go along, since calling this line will give you the updated counts
vec <- c(1, 2, 4, 1)
as.data.frame(table(vec))
#> vec Freq
#> 1 1 2
#> 2 2 1
#> 3 4 1
Suppose I now update vec
vec <- append(vec, c(1, 2, 4, 5))
We get the new counts the same way
as.data.frame(table(vec))
#> vec Freq
#> 1 1 3
#> 2 2 2
#> 3 4 2
#> 4 5 1
Maybe you can use assign and get0 of an environment to update the counts like:
x <- c(1, 2, 1, 4)
y <- new.env()
lapply(x, function(z) {
assign(as.character(z), get0(as.character(z), y, ifnotfound = 0) + 1, y)
setNames(stack(mget(ls(y), y))[2:1], c("value", "count"))
})
#[[1]]
# value count
#1 1 1
#
#[[2]]
# value count
#1 1 1
#2 2 1
#
#[[3]]
# value count
#1 1 2
#2 2 1
#
#[[4]]
# value count
#1 1 2
#2 2 1
#3 4 1

Impute missing values in partial rank data?

I have some rank data with missing values. The highest ranked item was assigned a value of '1'. 'NA' values occur when the item was not ranked.
# sample data
df <- data.frame(Item1 = c(1,2, NA, 2, 3), Item2 = c(3,1,NA, NA, 1), Item3 = c(2,NA, 1, 1, 2))
> df
Item1 Item2 Item3
1 1 3 2
2 2 1 NA
3 NA NA 1
4 2 NA 1
5 3 1 2
I would like to randomly impute the 'NA' values in each row with the appropriate unranked values. One solution that would meet my goal would be this:
> solution1
Item1 Item2 Item3
1 1 3 2
2 2 1 3
3 3 2 1
4 2 3 1
5 3 1 2
This code gives a list of possible replacement values for each row.
# set max possible rank in data
max_val <- 3
# calculate row max
df$row_max <- apply(df, 1, max, na.rm= T)
# calculate number of missing values in each row
df$num_na <- max_val - df$row_max
# set a sample vector
samp_vec <- 1:max_val # set a sample vector
# set an empty list
replacements <- vector(mode = "list", length = nrow(df))
# generate a list of replacements for each row
for(i in 1:nrow(df)){
if(df$num_na[i] > 0){
replacements[[i]] <- sample(samp_vec[samp_vec > df$row_max[i] ], df$num_na[i])
} else {
replacements[[i]] <- NULL
}
}
Now puzzling over how I can assign the values in my list to the missing values in each row of my data.frame. (My actual data has 1000's of rows.)
Is there a clean way to do this?
A base R option using apply -
set.seed(123)
df[] <- t(apply(df, 1, function(x) {
#Get values which are not present in the row
val <- setdiff(seq_along(x), x)
#If only 1 missing value replace with the one which is not missing
if(length(val) == 1) x[is.na(x)] <- val
#If more than 1 missing replace randomly
else if(length(val) > 1) x[is.na(x)] <- sample(val)
#If no missing replace the row as it is
x
}))
df
# Item1 Item2 Item3
#1 1 3 2
#2 2 1 3
#3 2 3 1
#4 2 3 1
#5 3 1 2

Alternating between values with rep() in R

I am looking for an elegant way of repeating two values according to a given vector in an alternating fashion. It is better stated by example. Take the following code for instance:
vals_to_rep <- c(1, 2)
tms_to_rep <- c(5, 4, 15)
res <- c(rep(1, 5), rep(2, 4), rep(1, 15))
res
In this example, I wish to repeat the values 1 and 2 according to the vector tms_to_rep where I will be starting with 1 (given it is first in the variable) vals_to_rep, before alternating to 2, back to 1, ...
I wish to continue this process for the length of tms_to_rep-- in this case, three times. The result would look like this:
1 1 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
If it helps, you can assume vals_to_rep is binary, but no assumptions on length of tms_to_rep.
Thanks!
You can expand vals_to_rep out to the length of tms_to_rep. Then rep() works fine:
vals_to_rep_expanded = rep(vals_to_rep, length.out = length(tms_to_rep))
rep(vals_to_rep_expanded, times = tms_to_rep)

Sweep equivalent in Julia

From R documentation:
sweep: Return an array obtained from an input array by sweeping out a summary
statistic.
For example, here is how I divide each row by its row sum:
> rs = rowSums(attitude)
> ratios = sweep(attitude, 1, rs, FUN="/")
> head(ratios)
rating complaints privileges learning raises critical advance
1 0.1191136 0.1412742 0.08310249 0.1080332 0.1689751 0.2548476 0.12465374
2 0.1518072 0.1542169 0.12289157 0.1301205 0.1518072 0.1759036 0.11325301
3 0.1454918 0.1434426 0.13934426 0.1413934 0.1557377 0.1762295 0.09836066
4 0.1568123 0.1619537 0.11568123 0.1208226 0.1388175 0.2159383 0.08997429
5 0.1680498 0.1618257 0.11618257 0.1369295 0.1473029 0.1721992 0.09751037
6 0.1310976 0.1676829 0.14939024 0.1341463 0.1646341 0.1493902 0.10365854
> rowSums(ratios) # check that ratios sum up to 1
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
My attempt in Julia:
x = rand(3, 4)
x[1, 1] = 10
x[2, 1] = 20
x[3, 1] = 30
rowsum = sum(x, 2)
rowsum_mat = repmat(rowsum, 1, size(x, 2))
x = x ./ rowsum_mat
This works but is clunky. Is there a more elegant and efficient way of doing this?
No need to use repmat — all of Julia's .-operators do "broadcasting" by default. This means it matches the dimensions of the two arguments and then expands any dimensions that have length 1 (the singleton dimensions) to match the other array. Since reductions keep the same dimensionality of the source array, they can be used directly with any dot-operator.
In your case, you can just use:
x ./ sum(x, 2)
since:
julia> x ./ rowsum_mat == x ./ rowsum
true

Calculate run length aggregated by subject ID conditional on observation == 1

I am trying to use the rle function in R to calculate the run lengths for the variable positive in the example below, aggregated by the variable id.
Here is a toy dataset (that admittedly has a few quirks):
test <- c('id', 'positive')
test$id <- rep(1:3, c(24, 24, 24))
set.seed(123456)
test$positive <- round(runif(72, 0, 1))
test <- data.frame(test)
test <- subset(test, select = -X.id.)
test <- subset(test, select = -X.positive.)
result <- aggregate(positive ~ id, data = test, FUN = rle)
The way this currently is set up it reads the run lengths for all possible values (0 and 1) of the variable positive. Is it possible to condition this function such that it only evaluates the run lengths when positive == 1?
At the end of the day, I ultimately want to figure out how to count the number of instances in which two or more consecutive months were positive (positive == 1) for each subject.
UPDATE:
I have a variable called event that has values of 0 or 1. For each of the occurrences of two or more positives that were developed from the code featured in the suggestions below, is it possible to stratify our results such that if event == 1 occurs during any of the positive months it would be classified differently than a run of positives in which event == 0 for all of the months?
The toy dataset looks like this:
set.seed(123456)
x <- c(1, 2, 1)
test <- data.frame(id = rep(1:3, each = 24), positive = round(runif(72, 0, 1)), event = round(runif(72, 0, 1)))
results <- aggregate(positive ~ id + event, data = test, FUN=function(x) with(rle(x), sum(lengths > 1 & values == 1)))
aggregate(positive ~ event, data = result, FUN=sum)
However, this code gives all possible permutations of event and positive, while I would like to delimit the results to counting only those occurrences of two or more consecutive positive months for which any event == 1. Alternatively, if it is easier to evaluate only the number of consecutive positive months for which all event == 0 that would be a fine solution too.
To count occurrences of two or more consecutive positives, use this:
aggregate(positive ~ id, data=test, FUN=function(x) with(rle(x), sum(lengths>=2 & values==1)))
(inspired in #sgibb's answer.)
EDIT: Counting the number of 2 or more consecutive positives such that any of them has event==1, separated by id:
Calculate the run to which each record belongs:
tmp <- within(test, run <- ave(positive, by=id, FUN=function(x)cumsum(c(1,diff(x)!=0))))
# id positive event run
# 1 1 1 1
# 1 1 0 1
# 1 0 1 2
# 1 0 0 2
# 1 0 1 2
# 1 0 0 2
For each id and each run mark if there was at least one record with event==1 and run length >= 2:
tmp2 <- aggregate(event~id+positive+run, data=tmp, function(x)any(x>0) && length(x)>=2)
# id positive run event
# 2 0 1 FALSE
# 1 1 1 TRUE
# 3 1 1 FALSE
# 1 0 2 TRUE
# 3 0 2 TRUE
# 2 1 2 TRUE
Now simply count how many marked runs are there in each id and each kind of run (positive==1 or positive==0):
aggregate(event~positive+id, tmp2, sum)
# positive id event
# 0 1 1
# 1 1 2
# 0 2 1
# 1 2 3
# 0 3 3
# 1 3 1
Do you mean something like this?:
aggregate(positive ~ id, data=test, FUN=function(x) {
r <- rle(x);
return(r$length[r$value == 1])
})
# id positive
# 1 1 2, 1, 1, 7, 1
# 2 2 4, 2, 1, 4, 2, 1, 2
# 3 3 1, 7, 1, 1, 1
A ddply version for the 'at the end of the day' part:
library(plyr)
set.seed(123456)
test <- data.frame(id = rep(1:3, each = 24), positive = round(runif(72, 0, 1)))
ddply(.data = test, .variables = .(id), function(x){
rl <- rle(x$positive)
sum(rl$length[rl$value == 1] > 1)
}
)
# id V1
# 1 1 2
# 2 2 5
# 3 3 1

Resources