Percentile from bins of distributions - r

I need to find the "highest bin for 90% of samples".
I have a table like this:
my_table <- data.frame(matrix(c(122,68,2,0,30,0,0,0,5,79,23,9000), byrow=TRUE, ncol=4))
names(my_table) <- c("0-10","11-20","21-30","31-5000")
Where the bin-headers indicate minutes (time).
For the first row, 90% of samples are at intervals lower or equal to "11-20". I.e. 90% of samples have shorter time than 21 minutes.
For second row it is lower or equal to interval "0-10".
And for third row it is lower or equal to interval "31-5000".
I would like to add a column "90p-interval" where the above intervals are found automatically, resulting in the table like this:
my_table$Perc90 <- c("11-20","0-10","31-5000")
My real table is thousands and thousands of rows long.
If someone can help I'm very grateful, and also thanks to everyone contributing to this fantastic site!
/Chris

apply(my_table, 1, function(x) names(x)[
max( which( c(0,cumsum(x)) < 0.9*sum(x)))
])
# [1] "11-20" "0-10" "31-5000"

It's not clear how you want he 90% cutoff to be determined from your answer when it's not exact so I provided a response that gives you something that matches your example. This makes sure the selected cutoff is at least 90%.
my_table$Perc90 <- apply(my_table, 1, function(x) {
pct <- cumsum(x)/sum(x)
return(names(x[pct >= 0.9][1]))
} )

Related

Calculate Number of Times Weight Changes more than a percentage

Suppose we have a table A with date and weight columns. Basically, this is daily weight data across 10 years. We want to count the number of times the weight has changed more than 3% in either direction. Is the below pseudo-code slightly correct:
count = 0
for (i in 1:nrow(A))
{
weight_initial = weight[i]
for (j in 1:nrow(A))
{
weight_compare = weight[j]
if(weight_compare >= 1.03*weight_initial || weight_compare <= 0.97*weight_initial)
{
count ++
}
}
}
It's better to do vectorized computations in R whenever possible. This is a quick and dirty approach (does twice as many computations as necessary, but should still be pretty quick):
weight <- rnorm(10,mean=1,sd=0.1)
wtcomp <- outer(weight,weight,"/")
sum(abs(wtcomp[lower.tri(wtcomp)])>0.03)
This solution is similar to what would be produced by your pseudocode, except that yours does comparisons between present and both past and future values - so yours would (I think) produce double my answer.
Do you really want to count weight gain/loss against all future times? e.g. should weight = (1,2,2,2,2) really be counted as 4 weight-gain events and not just one?
It is not very clear for me what is the output you want. So, if you have your time series of weights:
a <- c(1,1.5,2,4,3, 3.005, 3.05, 0.5, 0.99)
and you want to compare each measurement just against the initial measurement to check how many times it changed more than 3%:
sum(abs((a-a[1])/a[1]) > 0.03)
But if you want to calculate how many times there was a daily change higher than the 3% with respect to the previous measurement, then:
sum(abs(diff(a)/a[-length(a)]) > .003)
Cheers F.

Counting consecutive repeats, and returning the maximum value in each in each string of repeats if over a threshold

I am working with long strings of repeating 1's and 0's representing the presence of a phenomenon as a function of depth. If this phenomenon is flagged for over 1m, it is deemed significant enough to use for further analyses, if not it could be due to experimental error.
I ultimately need to get a total thickness displaying this phenomenon at each location (if over 1m).
In a dummy data set the input and expected output would look like this:
#Depth from 0m to 10m with 0.5m readings
depth <- seq(0, 10, 0.5)
#Phenomenon found = 1, not = 0
phenomflag <- c(1,0,1,1,1,1,0,0,1,0,1,0,1,0,1,1,1,1,1,0)
What I would like as an output is a vector with: 4, 5 (which gets converted back to 2m and 2.5m)
I have attempted to solve this problem using
y <- rle(phenomflag)
z <- y$length[y$values ==1]
but once I have my count, I have no idea how to:
a) Isolate 1 maximum number from each group of consecutive repeats.
b) Restrict to consecutive strings longer than (x) - this might be easier after a.
Thanks in advance.
count posted a good solution in the comments section.
y <- y <- rle(repeating series of 1's and 0's)
x <- cbind(y$lengths,y$values) ; x[which(x[,1]>=3 & x[,2]==1)]
This results in just the values that repeat more than a threshold of 2, and just the maximum.

Expected value of the difference between a sum of variables and a threshold

I had a custom deck consisting of eight cards of the sequence 2^n, n=0,..,6. I draw cards (without replacement) until the sum is equal or greater than the threshold. How can I implement in R a function that calculates the mean of the difference between the sum and the threshold??
I tried to do it using this How to store values in a vector with nested functions
but it takes ages... I think there is a way to do it with probabilities/simulations but I can figure out.
The threshold could be greater than the value of one single card, ie, threshold=500 or less than the value of a single card, ie, threshold=50
What I have done so far is to find all the subsets that meet the condition of the sum greater or equal to the threshold. Then I will only substract the threshold and calculate the mean.
I am using the following code in R. For a small set I get the answer quite fast. However, I have been running the function for several ours with the set containing the 56 numbers and is still working.
set<-c(rep(1,8),rep(2,8), rep(4,8),rep(8,8),rep(16,8),rep(32,8),rep(64,8))
recursive.subset <-function(x, index, current, threshold, result){
for (i in index:length(x)){
if (current + x[i] >= threshold){
store <<- append(store, sum(c(result,x[i])))
} else {
recursive.subset(x, i + 1, current+x[i], threshold, c(result,x[i]))
}
}
}
store <- vector()
inivector <- vector(mode="numeric", length=0) #initializing empty vector
recursive.subset (set, 1, 0, threshold, inivector)
I don't know if it is possible to get an exact solution, simply because there are so many possible combinations. It is probably better to do simulations, i.e. write a script for 1 full draw and then rerun that script many times. Since the solutions are very similar, the simulation should give a pretty good approximation.
Ok, here goes:
set <- rep(2^(0:6), each = 8)
thr <- 500
fun <- function(set,thr){
x <- cumsum(sample(set))
value <- x[min(which(x >= thr))]
value
}
system.time(a <- replicate(100000, fun(set,thr)))
# user system elapsed
# 1.10 0.00 1.09
mean(a - thr)
# [1] 21.22992
Explanation: Rather than drawing a card one at a time, I draw all cards simultaneously (sample) and then calculate the cumulative sum (cumsum). I then find the point where the cards at up to the threshold or larger, and find the sum of those cards back in x. We run this function many times with replicate, to obtain a vector of the values. We use mean(a-thr) to calculate the mean difference.
Edit: Made a really stupid typo in the code, fixed it now.
Edit2: Shortened the function a little.

problems with deleting rows in big data sets in r

I wrote a script that delete rows that 20% of their cells ara smaller then 10.
it's work great on small data Sets but for big it's useless.
can sombody help me please.
here is my script:
DataSets<-choose.files()
DataSet<-read.delim(DataSets,header = TRUE,
row.names = 1,sep="\t",blank.lines.skip=TRUE)
delete<-0
for(i in 1:length(DataSet[,1]))
{
count<-0
for(j in 1:length(DataSet[i,]))
{
if(DataSet[i,j]<10 || is.na(DataSet[i,j]))
{
count=count+1
}
}
if(count>0.2*length(DataSet[i,]))
{
DataSet=DataSet[-i,]
delete<-delete+1
}
}
This is essentially instantaneous on my machine:
m <- matrix(runif(100000),10000,10)
system.time(m1 <- m[rowSums((m <= 0.25 | is.na(m)) < 2,])
I only approximated your exact situation, but your version would be analogous. The idea here would be to:
Use a matrix, rather than a data frame, if your data is indeed all numeric.
Use vectorized comparison to determine which elements are less than some value (0.25 in my example).
Then use rowSums to count how many values are less than 0.25 in each row.
Subset the matrix according to which rows have fewer than two values less than (or equal to) 0.25.
Edit Added check for NAs to count them too.
This would solve your problem. You can leave your data as a DataFrame.
dat<-data.frame(matrix(rnorm(100,10,1),10))
bad<-apply(dat,1,function(x){
return((sum(x<10,na.rm=TRUE)+sum(is.na(x)))>length(x)*0.2)
})
dat<-dat[!bad,]
This works pretty quickly for me. Like the solution #joran used, I use a matrix:
data <- matrix(rnorm(1000, 15, 5), 100, 10)
tf <- apply(data, 1, function(x) x < 10) # your value of 10
data[-which(colSums(tf) > ncol(data)*0.2),] # here is where the 20% comes in
TRUE = 1 and FALSE = 0, which is why one can use colSums here
Update to handle NAs
If one follows OP's comment to include "just 20% of the numeric values" and not the original code that counts NA values as values < 10, (i.e. delete rows where 20 % of numeric entries are less than 10), then this will work:
data[-which(colSums(tf, na.rm=T) > (ncol(data) - colSums(apply(tf,2,is.na)))*0.2),]
colSums(apply(tf,2,is.na)) counts the number of entries in a row of data that are NA.
(ncol(data) - colSums(apply(tf,2,is.na))) subtracts that number from the number of columns so that only the total number of numeric columns is returned.
(ncol(data) - colSums(apply(tf,2,is.na)))*0.2 is 20% of the number of numeric entries per row

Aligning sequences with missing values

The language I'm using is R, but you don't necessarily need to know about R to answer the question.
Question:
I have a sequence that can be considered the ground truth, and another sequence that is a shifted version of the first, with some missing values. I'd like to know how to align the two.
setup
I have a sequence ground.truth that is basically a set of times:
ground.truth <- rep( seq(1,by=4,length.out=10), 5 ) +
rep( seq(0,length.out=5,by=4*10+30), each=10 )
Think of ground.truth as times where I'm doing the following:
{take a sample every 4 seconds for 10 times, then wait 30 seconds} x 5
I have a second sequence observations, which is ground.truth shifted with 20% of the values missing:
nSamples <- length(ground.truth)
idx_to_keep <- sort(sample( 1:nSamples, .8*nSamples ))
theLag <- runif(1)*100
observations <- ground.truth[idx_to_keep] + theLag
nObs <- length(observations)
If I plot these vectors this is what it looks like (remember, think of these as times):
What I've tried. I want to:
calculate the shift (theLag in my example above)
calculate a vector idx such that ground.truth[idx] == observations - theLag
First, assume we know theLag. Note that ground.truth[1] is not necessarily observations[1]-theLag. In fact, we have ground.truth[1] == observations[1+lagI]-theLag for some lagI.
To calculate this, I thought I'd use cross-correlation (ccf function).
However, whenever I do this I get a lag with a max. cross-correlation of 0, meaning ground.truth[1] == observations[1] - theLag. But I've tried this in examples where I've explicitly made sure that observations[1] - theLag is not ground.truth[1] (i.e. modify idx_to_keep to make sure it doesn't have 1 in it).
The shift theLag shouldn't affect the cross-correlation (isn't ccf(x,y) == ccf(x,y-constant)?) so I was going to work it out later.
Perhaps I'm misunderstanding though, because observations doesn't have as many values in it as ground.truth? Even in the simpler case where I set theLag==0, the cross correlation function still fails to identify the correct lag, which leads me to believe I'm thinking about this wrong.
Does anyone have a general methodology for me to go about this, or know of some R functions/packages that could help?
Thanks a lot.
For the lag, you can compute all the differences (distances) between your two sets of points:
diffs <- outer(observations, ground.truth, '-')
Your lag should be the value that appears length(observations) times:
which(table(diffs) == length(observations))
# 55.715382960625
# 86
Double check:
theLag
# [1] 55.71538
The second part of your question is easy once you have found theLag:
idx <- which(ground.truth %in% (observations - theLag))
The following should work if your time series are not too long.
You have two vectors of time-stamps,
the second one being a shifted and incomplete copy of the first,
and you want to find by how much it was shifted.
# Sample data
n <- 10
x <- cumsum(rexp(n,.1))
theLag <- rnorm(1)
y <- theLag + x[sort(sample(1:n, floor(.8*n)))]
We can try all possible lags and, for each one,
compute how bad the alignment is,
by matching each observed timestamp with the closest
"truth" timestamp.
# Loss function
library(sqldf)
f <- function(u) {
# Put all the values in a data.frame
d1 <- data.frame(g="truth", value=x)
d2 <- data.frame(g="observed", value=y+u)
d <- rbind(d1,d2)
# For each observed value, find the next truth value
# (we could take the nearest, on either side,
# but it would be more complicated)
d <- sqldf("
SELECT A.g, A.value,
( SELECT MIN(B.value)
FROM d AS B
WHERE B.g='truth'
AND B.value >= A.value
) AS next
FROM d AS A
WHERE A.g = 'observed'
")
# If u is greater than the lag, there are missing values.
# If u is smaller, the differences decrease
# as we approach the lag.
if(any(is.na(d))) {
return(Inf)
} else {
return( sum(d$`next` - d$value, na.rm=TRUE) )
}
}
We can now search for the best lag.
# Look at the loss function
sapply( seq(-2,2,by=.1), f )
# Minimize the loss function.
# Change the interval if it does not converge,
# i.e., if it seems in contradiction with the values above
# or if the minimum is Inf
(r <- optimize(f, c(-3,3)))
-r$minimum
theLag # Same value, most of the time

Resources