I am looking for an elegant way of repeating two values according to a given vector in an alternating fashion. It is better stated by example. Take the following code for instance:
vals_to_rep <- c(1, 2)
tms_to_rep <- c(5, 4, 15)
res <- c(rep(1, 5), rep(2, 4), rep(1, 15))
res
In this example, I wish to repeat the values 1 and 2 according to the vector tms_to_rep where I will be starting with 1 (given it is first in the variable) vals_to_rep, before alternating to 2, back to 1, ...
I wish to continue this process for the length of tms_to_rep-- in this case, three times. The result would look like this:
1 1 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
If it helps, you can assume vals_to_rep is binary, but no assumptions on length of tms_to_rep.
Thanks!
You can expand vals_to_rep out to the length of tms_to_rep. Then rep() works fine:
vals_to_rep_expanded = rep(vals_to_rep, length.out = length(tms_to_rep))
rep(vals_to_rep_expanded, times = tms_to_rep)
Related
I'm having a problem with an efficient way to store the counts of a vector which is changing over time. In my problem I start with an empty vector of length n and by each iteration I add a number to this vector, but I also want to have some type of object that acts as a counter, so if the number that I add is already in the vector then it should add 1 to the object and if it's not then it should add the value as a "name" and set it to 1.
What I want is something analogous to Python, in which we can use numbers as keys and counts as values, so then I can access both separately with dict.keys() and dict.values().
For example, if I get the values 1, 2, 1, 4 then I would like the object to update as:
> value count
1 1
> value count
1 1
2 1
> value count
1 2
2 1
> value count
1 2
2 1
4 1
and to access efficiently both values and count separately. I thought of using something like plyr::count on the vector, but I don't think that it's efficient to count at every iteration, specially if n is really large.
Edit: In my problem it's necessary (well, maybe not) to update the counts at every iteration.
What I'm doing is simulating data from a Dirichlet Process using the Polya urn representation. For example, suppose that I have the vector (1.1, 0.2, 0.3, 1.1, 0.2), then to get a new data point one samples from a base distribution (for example a normal distribution) and adds that value with a certain probability, or adds a previous value with a probability proportional to the frequency of the value. With numbers:
Add the sampled value with probability 1/6, or
Add 1.1 with probability 2/6, or 0.2 with probability 2/6, or 0.3 with probability 1/6 (i.e. the probabilities are proportional to the frecuencies)
The structure you are describing is produced by as.data.frame(table(vec)). There is no need to update the counts as you go along, since calling this line will give you the updated counts
vec <- c(1, 2, 4, 1)
as.data.frame(table(vec))
#> vec Freq
#> 1 1 2
#> 2 2 1
#> 3 4 1
Suppose I now update vec
vec <- append(vec, c(1, 2, 4, 5))
We get the new counts the same way
as.data.frame(table(vec))
#> vec Freq
#> 1 1 3
#> 2 2 2
#> 3 4 2
#> 4 5 1
Maybe you can use assign and get0 of an environment to update the counts like:
x <- c(1, 2, 1, 4)
y <- new.env()
lapply(x, function(z) {
assign(as.character(z), get0(as.character(z), y, ifnotfound = 0) + 1, y)
setNames(stack(mget(ls(y), y))[2:1], c("value", "count"))
})
#[[1]]
# value count
#1 1 1
#
#[[2]]
# value count
#1 1 1
#2 2 1
#
#[[3]]
# value count
#1 1 2
#2 2 1
#
#[[4]]
# value count
#1 1 2
#2 2 1
#3 4 1
I have several column names in my dataset (eat10.18) that have the suffix "_10p." I'd like to change that suffix to be "_p_10" but preserve the rest of the variable name. I only want this to affect columns that end in the exact string "_10p." I cannot figure out how to make this work with rename_with(). Can anyone help? Faux data below:
eat10.18 <- data.frame(id = c(1000, 1001, 1002),
eat_10 = c(2, 4, 1),
eat_10p = c(1, 2, 3),
run_10p = c(1, 1, 2))
In the above example, the variables "id" and "eat_10" would remain the same, but "eat_10p" and "run_10p" would become "eat_p_10" and "run_p_10"
Thanks for your help!
library(tidyverse)
eat10.18 %>%
rename_with(~str_replace(.,'_10p$', '_p_10'))
id eat_10 eat_p_10 run_p_10
1 1000 2 1 1
2 1001 4 2 1
3 1002 1 3 2
I suggest using gsub and referring to this post.
names(eat10.18) <- gsub(x = names(eat10.18), pattern = "_10p", replacement = "_p_10")
Result
id
eat_10
eat_p_10
run_p_10
1000
2
1
1
1001
4
2
1
1002
1
3
2
I have this data frame:
structure(list(ID = c(101, 102, 103, 104, 105, 106
), 1Var = c(1, 3, 3, 1, 1, 1), 2Var = c(1, 1,
1, 1, 1, 1), 3Var = c(3, 1, 1, 1, 1, 1), 4Var = c(1,
1, 1, 1, 1, 1)), row.names = c(NA, 6L), class = "data.frame")
I have been trying to subset based on values of 1 and 0. In this data table there are no 0 values but my full data has it.
I toyed around with this method:
Prime <- grep('$Var', names(Data))
DataPrime <- Data[rowSums(Data[Prime] <= 1),]
I am getting duplicated observations though. Another issue with this method is that it keeps all rows that have a 1 or 0 but not rows with ONLY 1 or 0. So, some rows that have 3 but the rest of the variables are value of 1 that row is still kept in my data.
I think my method will work but I'm not sure what else I need to specify in the argument. I tried a simple subset too but that removed everything from the data:
DataPrime <- subset(Data, '1Var' <=1, '2Var' <=1, '3Var' <=1, '4Var' <=1)
I essentially want my data to look something like this:
ID 1Var 2Var 3Var 4Var
4 104 1 1 1 1
5 105 1 1 1 1
6 106 1 1 1 1
We can use Reduce with & to create a logical vector for subsetting the rows
subset(Data, Reduce(`&`, lapply(Data[-1], `<=`, 1)))
-output
# ID 1Var 2Var 3Var 4Var
#4 104 1 1 1 1
#5 105 1 1 1 1
#6 106 1 1 1 1
Or another option is rowSums
subset(Data, !rowSums(Data[-1] > 1))
I think you're looking for something like:
Prime <- grep('\\dVar', names(Data))
Data[apply(Data[Prime], 1, function(x) !any(x > 1)),]
#> ID 1Var 2Var 3Var 4Var
#> 4 104 1 1 1 1
#> 5 105 1 1 1 1
#> 6 106 1 1 1 1
A few things to note are:
Your regex inside grep was wrong. The "$" symbol represents the end of a string, not a number. For numbers you can use \\d . Your Prime variable is therefore empty in the example.
It's best not to have column names (or any variable name) starting with numbers. These are not legal names in R. You can get round this by surrounding them with backticks, but this is easy to overlook and is a source of bugs.
rowSums adds up all the values in each row, so the lowest sum of any of the rows is 4, whereas rowSums(Data[Prime] <= 1) gives the total number of entries that are one or less, giving a vector like c(3, 3, 3, 4, 4, 4). Subsetting Data by this will give 3 copies of row 3 then three copies of row 4, which clearly isn't what you want.
In subset, you need the logical conjunction of all your var <= 1 terms, so you should split these with &, not with commas.
> sample(c(2), 10, replace = TRUE, prob = 1)
Error in sample.int(x, size, replace, prob) :
incorrect number of probabilities
> sample(c(1), 10, replace = TRUE, prob = 1)
[1] 1 1 1 1 1 1 1 1 1 1
In the first example, I would like to sample the vector 2 ten times, with replacement, each with probability = 1. I would expect the output to be 2 2 2 2 2 2 2 2 2 2
However, it seems to work with a vector of 1?
Try removing the prob = 1 and what do you get?
> set.seed(123)
> sample(c(2), 10, replace = TRUE)
# [1] 1 2 1 2 2 1 2 2 2 1
help(sample)
Usage
sample(x, size, replace = FALSE, prob = NULL)
If x has length 1, is numeric (in the sense of is.numeric) and x >= 1,
sampling via sample takes place from 1:x. Note that this convenience
feature may lead to undesired behaviour when x is of varying length in
calls such as sample(x). See the examples.
So, it's sampling from 1:2 not 2.
I am trying to use the rle function in R to calculate the run lengths for the variable positive in the example below, aggregated by the variable id.
Here is a toy dataset (that admittedly has a few quirks):
test <- c('id', 'positive')
test$id <- rep(1:3, c(24, 24, 24))
set.seed(123456)
test$positive <- round(runif(72, 0, 1))
test <- data.frame(test)
test <- subset(test, select = -X.id.)
test <- subset(test, select = -X.positive.)
result <- aggregate(positive ~ id, data = test, FUN = rle)
The way this currently is set up it reads the run lengths for all possible values (0 and 1) of the variable positive. Is it possible to condition this function such that it only evaluates the run lengths when positive == 1?
At the end of the day, I ultimately want to figure out how to count the number of instances in which two or more consecutive months were positive (positive == 1) for each subject.
UPDATE:
I have a variable called event that has values of 0 or 1. For each of the occurrences of two or more positives that were developed from the code featured in the suggestions below, is it possible to stratify our results such that if event == 1 occurs during any of the positive months it would be classified differently than a run of positives in which event == 0 for all of the months?
The toy dataset looks like this:
set.seed(123456)
x <- c(1, 2, 1)
test <- data.frame(id = rep(1:3, each = 24), positive = round(runif(72, 0, 1)), event = round(runif(72, 0, 1)))
results <- aggregate(positive ~ id + event, data = test, FUN=function(x) with(rle(x), sum(lengths > 1 & values == 1)))
aggregate(positive ~ event, data = result, FUN=sum)
However, this code gives all possible permutations of event and positive, while I would like to delimit the results to counting only those occurrences of two or more consecutive positive months for which any event == 1. Alternatively, if it is easier to evaluate only the number of consecutive positive months for which all event == 0 that would be a fine solution too.
To count occurrences of two or more consecutive positives, use this:
aggregate(positive ~ id, data=test, FUN=function(x) with(rle(x), sum(lengths>=2 & values==1)))
(inspired in #sgibb's answer.)
EDIT: Counting the number of 2 or more consecutive positives such that any of them has event==1, separated by id:
Calculate the run to which each record belongs:
tmp <- within(test, run <- ave(positive, by=id, FUN=function(x)cumsum(c(1,diff(x)!=0))))
# id positive event run
# 1 1 1 1
# 1 1 0 1
# 1 0 1 2
# 1 0 0 2
# 1 0 1 2
# 1 0 0 2
For each id and each run mark if there was at least one record with event==1 and run length >= 2:
tmp2 <- aggregate(event~id+positive+run, data=tmp, function(x)any(x>0) && length(x)>=2)
# id positive run event
# 2 0 1 FALSE
# 1 1 1 TRUE
# 3 1 1 FALSE
# 1 0 2 TRUE
# 3 0 2 TRUE
# 2 1 2 TRUE
Now simply count how many marked runs are there in each id and each kind of run (positive==1 or positive==0):
aggregate(event~positive+id, tmp2, sum)
# positive id event
# 0 1 1
# 1 1 2
# 0 2 1
# 1 2 3
# 0 3 3
# 1 3 1
Do you mean something like this?:
aggregate(positive ~ id, data=test, FUN=function(x) {
r <- rle(x);
return(r$length[r$value == 1])
})
# id positive
# 1 1 2, 1, 1, 7, 1
# 2 2 4, 2, 1, 4, 2, 1, 2
# 3 3 1, 7, 1, 1, 1
A ddply version for the 'at the end of the day' part:
library(plyr)
set.seed(123456)
test <- data.frame(id = rep(1:3, each = 24), positive = round(runif(72, 0, 1)))
ddply(.data = test, .variables = .(id), function(x){
rl <- rle(x$positive)
sum(rl$length[rl$value == 1] > 1)
}
)
# id V1
# 1 1 2
# 2 2 5
# 3 3 1