Efficient way to iteratively store counts in R - r

I'm having a problem with an efficient way to store the counts of a vector which is changing over time. In my problem I start with an empty vector of length n and by each iteration I add a number to this vector, but I also want to have some type of object that acts as a counter, so if the number that I add is already in the vector then it should add 1 to the object and if it's not then it should add the value as a "name" and set it to 1.
What I want is something analogous to Python, in which we can use numbers as keys and counts as values, so then I can access both separately with dict.keys() and dict.values().
For example, if I get the values 1, 2, 1, 4 then I would like the object to update as:
> value count
1 1
> value count
1 1
2 1
> value count
1 2
2 1
> value count
1 2
2 1
4 1
and to access efficiently both values and count separately. I thought of using something like plyr::count on the vector, but I don't think that it's efficient to count at every iteration, specially if n is really large.
Edit: In my problem it's necessary (well, maybe not) to update the counts at every iteration.
What I'm doing is simulating data from a Dirichlet Process using the Polya urn representation. For example, suppose that I have the vector (1.1, 0.2, 0.3, 1.1, 0.2), then to get a new data point one samples from a base distribution (for example a normal distribution) and adds that value with a certain probability, or adds a previous value with a probability proportional to the frequency of the value. With numbers:
Add the sampled value with probability 1/6, or
Add 1.1 with probability 2/6, or 0.2 with probability 2/6, or 0.3 with probability 1/6 (i.e. the probabilities are proportional to the frecuencies)

The structure you are describing is produced by as.data.frame(table(vec)). There is no need to update the counts as you go along, since calling this line will give you the updated counts
vec <- c(1, 2, 4, 1)
as.data.frame(table(vec))
#> vec Freq
#> 1 1 2
#> 2 2 1
#> 3 4 1
Suppose I now update vec
vec <- append(vec, c(1, 2, 4, 5))
We get the new counts the same way
as.data.frame(table(vec))
#> vec Freq
#> 1 1 3
#> 2 2 2
#> 3 4 2
#> 4 5 1

Maybe you can use assign and get0 of an environment to update the counts like:
x <- c(1, 2, 1, 4)
y <- new.env()
lapply(x, function(z) {
assign(as.character(z), get0(as.character(z), y, ifnotfound = 0) + 1, y)
setNames(stack(mget(ls(y), y))[2:1], c("value", "count"))
})
#[[1]]
# value count
#1 1 1
#
#[[2]]
# value count
#1 1 1
#2 2 1
#
#[[3]]
# value count
#1 1 2
#2 2 1
#
#[[4]]
# value count
#1 1 2
#2 2 1
#3 4 1

Related

Alternating between values with rep() in R

I am looking for an elegant way of repeating two values according to a given vector in an alternating fashion. It is better stated by example. Take the following code for instance:
vals_to_rep <- c(1, 2)
tms_to_rep <- c(5, 4, 15)
res <- c(rep(1, 5), rep(2, 4), rep(1, 15))
res
In this example, I wish to repeat the values 1 and 2 according to the vector tms_to_rep where I will be starting with 1 (given it is first in the variable) vals_to_rep, before alternating to 2, back to 1, ...
I wish to continue this process for the length of tms_to_rep-- in this case, three times. The result would look like this:
1 1 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
If it helps, you can assume vals_to_rep is binary, but no assumptions on length of tms_to_rep.
Thanks!
You can expand vals_to_rep out to the length of tms_to_rep. Then rep() works fine:
vals_to_rep_expanded = rep(vals_to_rep, length.out = length(tms_to_rep))
rep(vals_to_rep_expanded, times = tms_to_rep)

Binning data by row values with minimum sample size

I’m trying to figure out how to create bins with a minimum sample size that also accounts for values in a specific column.
So, in the dummy data below, I want to create bins that have a minimum number of 6 samples in them, but if a bin includes a row with a specific value from column a, I want that bin to also include all other rows with that same value. I also do not want any bins to only contain 1 unique value from row a. I then want the output to have a row with a mean of the unique values in column a, a mean of all values in column b and a column with sample size.
df<-data.frame(a=c(1,1,2,2,2,3,3,3,3,4,4,5,6,6,6,7,7,7,7,7,7,8,8,8,9,9,9,9,10,10,10),
b=c(12,13,11,12,12,11,15,13,12,11,14,15,11,14,12,11,14,12,13,15,11,11,12,13,14,16,14,13,15,13,15))
I want the output to look something like this:
mean.a mean.b n
1 2.0 12.33333 9
2 5.0 12.83333 6
3 7.0 12.66667 6
4 8.5 13.28571 7
This is what I have so far:
x<-df
final<-NULL
for(i in 1:16){
x1<-x[1:6,]
x2<-x[-c(1:6),]
x3<-rbind(x1, x2[x2$a==x1$a[6],])
n<-nrow(x3)
y<-mean(x3$b)
z<-mean(unique(x3$a))
f<-data.frame(mean.a=z, mean.b=y, n=n)
final<-rbind(final,f)
x<-x[-c(1:n),]
}
final<-final[complete.cases(final),]
The problem I'm having is I can't figure out how to not have a single bin with one unique value in column a. For example, in the third bin, all 6 rows have mean.a$a=7, but I would like to add the next sequential row and all rows with that row value in column a to that bin (which would be all rows that have mean.a$a=8 in this case).
Also, I can't figure out how to get the loop to continue looping through without having 1:number at the top, and then just deleting the rows with NAs afterwards, this isn't a huge deal, but that's the reason it's kind of messy.
I'm not attached to this loop by any means, and if there's a simpler way to answer this question, I'm all for it!
Here is a recursive solution for the problem, where get_6 will return a group variable based on the column a. The conditions are met in get_i function inside, starting from index 6 and move forward until we find the next index that is not equal to the current value and the length of unique values is not equal to 1, every time we found a sequence that satisfies the condition we increase the id by one and the result will be similar to what you get from the rleid function from data.table, from there, summary statistics can be calculated based on this group variable:
get_6 <- function(vec, id = 1) {
if(length(vec) < 6) NULL
else {
get_i <- function(x, i = 6) {
if(length(x) == i) i
else if(x[i + 1] != x[i] && length(unique(x[1:i])) != 1) i
else get_i(x, i + 1)
}
ind <- get_i(vec)
c(rep(id, ind), get_6(vec[-(1:ind)], id + 1))
}
}
s <- get_6(df$a)
s
# [1] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4
library(dplyr)
df[1:length(s), ] %>%
mutate(g = s) %>% group_by(g) %>%
summarize(n = n(), mean.a = mean(unique(a)), mean.b = mean(b))
#Source: local data frame [4 x 4]
# g n mean.a mean.b
# <dbl> <int> <dbl> <dbl>
#1 1 9 2.0 12.33333
#2 2 6 5.0 12.83333
#3 3 9 7.5 12.44444
#4 4 7 9.5 14.28571

Vectorizing R-loop for better performance

I have a problem to find a vectorization representation for a specific loop in R. My objective is to enhance the performance of the loop, because it has to be run thousands of times in the algorithm.
I want to find the position of the lowest value in a particular array section defined by a vector 'Level' for each row.
Example:
Level = c(2,3)
Let first row of array X be: c(2, -1, 3, 0.5, 4).
Searching for the position of the lowest value in the range 1:Level[1] of the row (that is (2, -1)), I get a 2, because -1 < 2 and -1 stands on second position of the row. Then, searching the position of the lowest value in the second range (Level[1]+1):(Level[1]+Level[2]) (that is (3, 0.5, 4)), I get a 4, because 0.5 < 3 < 4 and 0.5 stands on fourth position of the row.
I have to perform this over each row in the array.
My solution to the problem works as follows:
Level = c(2,3,3) #elements per section, here: 3 sections with 2,3 and 3 levels
rows = 10 #number of rows in array X
X = matrix(runif(rows*sum(Level),-5,5),rows,sum(Level)) #array with 10 rows and sum(Level) columns, here: 8
Position_min = matrix(0,rows,length(Level)) #array in which the position of minimum values for each section and row are stored
for(i in 1:rows){
for(j in 1:length(Level)){ #length(Level) is number of intervals, here: 3
if(j == 1){coeff=0}else{coeff=1}
Position_min[i,j] = coeff*sum(Level[1:(j-1)]) + which(X[i,(coeff*sum(Level[1:(j-1)])+1):sum(Level[1:j])] == min(X[i,(coeff*sum(Level[1:(j-1)])+1):sum(Level[1:j])]))
}
}
It works fine but I would prefer a solution with better performance. Any ideas?
This will remove the outer level of the loop:
Level1=c(0,cumsum(Level))
for(j in 1:(length(Level1)-1)){
Position_min[,j]=max.col(-X[,(Level1[j]+1):Level1[j+1]])+(Level1[j])
}
Here is a "fully vectorized" solution with no explicit loops:
findmins <- function(x, level) {
series <- rep(1:length(Level), Level)
x <- split(x, factor(series))
minsSplit <- as.numeric(sapply(x, which.min))
minsSplit + c(0, cumsum(level[-length(level)]))
}
Position_min_vectorized <- t(apply(X, 1, findmins, Level))
identical(Position_min, Position_min_vectorized)
## [1] TRUE
You can get better performance by making your matrix into a list, and then using parallel's mclapply():
X_list <- split(X, factor(1:nrow(X)))
do.call(rbind, parallel::mclapply(X_list, findmins, Level))
## [,1] [,2] [,3]
## 1 1 5 6
## 2 2 3 6
## 3 1 4 7
## 4 1 5 6
## 5 2 5 7
## 6 2 4 6
## 7 1 5 8
## 8 1 5 8
## 9 1 3 8
## 10 1 3 8

Count consecutive occurrences of a specific value in every row of a data frame in R

I've got a data.frame of monthly values of a variable for many locations (so many rows) and I want to count the numbers of consecutive months (i.e consecutive cells) that have a value of zero. This would be easy if it was just being read left to right, but the added complication is that the end of the year is consecutive to the start of the year.
For example, in the shortened example dataset below (with seasons instead of months),location 1 has 3 '0' months, location 2 has 2, and 3 has none.
df<-cbind(location= c(1,2,3),
Winter=c(0,0,3),
Spring=c(0,2,4),
Summer=c(0,2,7),
Autumn=c(3,0,4))
How can I count these consecutive zero values? I've looked at rle but I'm still none the wiser currently!
Many thanks for any help :)
You've identified the two cases that the longest run can take: (1) somewhere int he middle or (2) split between the end and beginning of each row. Hence you want to calculate each condition and take the max like so:
df<-cbind(
Winter=c(0,0,3),
Spring=c(0,2,4),
Summer=c(0,2,7),
Autumn=c(3,0,4))
#> Winter Spring Summer Autumn
#> [1,] 0 0 0 3
#> [2,] 0 2 2 0
#> [3,] 3 4 7 4
# calculate the number of consecutive zeros at the start and end
startZeros <- apply(df,1,function(x)which.min(x==0)-1)
#> [1] 3 1 0
endZeros <- apply(df,1,function(x)which.min(rev(x==0))-1)
#> [1] 0 1 0
# calculate the longest run of zeros
longestRun <- apply(df,1,function(x){
y = rle(x);
max(y$lengths[y$values==0],0)}))
#> [1] 3 1 0
# take the max of the two values
pmax(longestRun,startZeros +endZeros )
#> [1] 3 2 0
Of course an even easier solution is:
longestRun <- apply(cbind(df,df),# tricky way to wrap the zeros from the start to the end
1,# the margin over which to apply the summary function
function(x){# the summary function
y = rle(x);
max(y$lengths[y$values==0],
0)#include zero incase there are no zeros in y$values
})
Note that the above solution works because my df does not include the location field (column).
Try this:
df <- data.frame(location = c(1, 2, 3),
Winter = c(0, 0, 3),
Spring = c(0, 2, 4),
Summer = c(0, 2, 7),
Autumn = c(3, 0, 4))
maxcumzero <- function(x) {
l <- x == 0
max(cumsum(l) - cummax(cumsum(l) * !l))
}
df$N.Consec <- apply(cbind(df[, -1], df[, -1]), 1, maxcumzero)
df
# location Winter Spring Summer Autumn N.Consec
# 1 1 0 0 0 3 3
# 2 2 0 2 2 0 2
# 3 3 3 4 7 4 0
This adds a column to the data frame specifying the maximum number of times zero has occurred consecutively in each row of the data frame. The data frame is column bound to itself to be able to detect consecutive zeroes between autumn and winter.
The method used here is based on that of Martin Morgan in his answer to this similar question.

In R, find duplicated dates in a dataset and replace their associated values with their mean

I have a rather small dataset of 3 columns (id, date and distance) in which some dates may be duplicated (otherwise unique) because there is a second distance value associated with that date.
For those duplicated dates, how do I average the distances then replace the original distance with the averages?
Let's use this dataset as the model:
z <- data.frame(id=c(1,1,2,2,3,4),var=c(2,4,1,3,5,2))
# id var
# 1 2
# 1 4
# 2 1
# 2 3
# 3 5
# 4 2
The mean of id#1 is 3 and of id#2 is 2, which would then replace each of the original var's.
I've checked multiple questions to address this and have found related discussions. As a result, here is what I have so far:
# Check if any dates have two estimates (duplicate Epochs)
length(unique(Rdataset$Epoch)) == nrow(Rdataset)
# if 'TRUE' then each day has a unique data point (no duplicate Epochs)
# if 'FALSE' then duplicate Epochs exist, and the distances must be
# averaged for each duplicate Epoch
Rdataset$Distance <- ave(Rdataset$Distance, Rdataset$Epoch, FUN=mean)
Rdataset <- unique(Rdataset)
Then, with the distances for duplicate dates averaged and replaced, I wish to perform other functions on the entire dataset.
Here's a solution that doesn't bother to actually check if the id's are duplicated- you don't actually need to, since for non-duplicated id's, you can just use the mean of the single var value:
duplicated_ids = unique(z$id[duplicated(z$id)])
library(plyr)
z_deduped = ddply(
z,
.(id),
function(df_section) {
res_df = data.frame(id=df_section$id[1], var=mean(df_section$var))
}
)
Output:
> z_deduped
id var
1 1 3
2 2 2
3 3 5
4 4 2
Unless I misunderstand:
library(plyr)
ddply(z, .(id), summarise, var2 = mean(var))
# id var2
# 1 1 3
# 2 2 2
# 3 3 5
# 4 4 2
Here is another answer in data.table style:
library(data.table)
z <- data.table(id = c(1, 1, 2, 2, 3, 4), var = c(2, 4, 1, 3, 5, 2))
z[, mean(var), by = id]
id V1
1: 1 3
2: 2 2
3: 3 5
4: 4 2
There is no need to treat unique values differently than duplicated values as the mean of a single argument is the argument.
zt<-aggregate(var~id,data=z,mean)
zt
id var
1 1 3
2 2 2
3 3 5
4 4 2

Resources