In this example I'm trying to generate a random time series for 3 individuals at 4 time points (below x contains the 1st timepoints for each individual). I want the values to be randomly increasing rather than decreasing in time. Below is my current solution.
set.seed(0)
x <- rnorm(3)
x
[1] 1.2629543 -0.3262334 1.3297993
y <- c(x,
x*runif(1,.8,1.2),
x*runif(1,.9,1.3),
x*runif(1,1,1.4))
y
[1] 1.2629543 -0.3262334 1.3297993 1.4642135 -0.3782206 1.5417106 1.6138915 -0.4168839 1.6993107 1.5967772
[11] -0.4124631 1.6812906
This has some problems.
For each individual the same coefficient is used for calculating the values for same timepoint resulting in identical trends. How could I get a random coefficient for each multiplication? I could use lapply but then the vector will be "grouped" by individuals not by timepoints.
I don't wish to write the formulas for last timepoints separately and be so precise. Exact coefficients are not important, I just need the values to have a tendency to slightly increase but occasional decreasing should also be allowed. How could I extend the vector more "effectively"?
How to make negative values to also increase?
I managed to solve this thanks to Federico Manigrasso. The solution is below.
TimeSer <- function(num.id, years, init.val) {
df <- data.frame(id = factor(rep(1:num.id, length(years))),
year = rep(years, each = num.id))
yrs <- length(years) - 1
minim <- seq(-.1, by = -.1, len = yrs)
maxim <- seq(.4, by = .4, len = yrs)
val <- list(init.val)
for (i in 1:yrs) {
val[[i + 1]] <- unlist(lapply(init.val, function (x) {
x + (x * runif(1, minim[i], maxim[i]))
}))
}
df$val <- unlist(val)
df
}
df <- TimeSer(num.id = 3, years = 2006:2016, init.val = rnorm(3,1e5, 1e5))
Visual representation of the results:
num.id <- length(unique(df$id))
par(mfrow=c(1,num.id))
lapply(1:num.id, function(x) {
plot(unique(df$year), df$val[df$id == x], type = 'l', col = x)
})
I suggest to put the output in a list, It a lot less messy and you can transform into a vector later (using unlist).
This is how I would rewrite your code
x<-rnorm(3)
time<-3
output<-list(x) #init output list with initial data
par1<-c(0.8,0.9,1)
par2<-c(1.2,1.3,1.4)
for( i in 1:time){
a<-unlist(lapply(x,function(x){x+runif(1,par1[i],par2[i])}))
output[[i+1]]<-a
x<-a
}
let me know if this solves all your problems..
Related
I have the following vector:
wss <- c(23265.2302840678, 4917.06943551649, 1330.49917983449, 288.050702912287,
216.182464712486, 203.769578557051, 151.991297068931, 139.635571841227,
118.285305833194, 117.164567420633, 105.397722980407, 95.4682187817563,
116.448588269066, 88.1287299776581, 83.9345098736843)
And if we with the following plot code
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
we can get this:
By eye we can see at x-axis point 4 the value change begin to change drastically plateaued.
My question is given the vector wss how can we automatically detect the index 4 without looking at the plot.
Edit: This works better:
#change relative to the maximum change
threshold <- 0.1
d1 <- diff(wss)
# this assumes that the first value is the highest
## you could use max(d1) instead of d1[1]
which.max((d1 / d1[1]) < threshold) #results in 3
d1 <- diff(wss2)
which.max(d1 / d1[1] < threshold) #results in 5
Second Edit: This is somewhat subjective, but here's how my three methods compare for your two data sets. While it's easy to visualize what a plateau is, you need to be able to describe in math terminology what a plateau is in order to automate it.
Original: If you know that the second derivative will flip from positive to negative, you can do this:
sec_der <- diff(wss, differences = 2)
inflection_pt <- which.min(sign(sec_der))
inflection_pt
For this data set, the result is 5 which corresponds to the original datasets result of 7 (i.e., 151.991).
Instead of looking at inflection points, you could instead look at some relative percent threshold.
thrshold <- 0.06
which.min(sign(abs(diff(wss)) / wss[1:(length(wss)-1)] - thrshold))
This results in 5 as well using the first derivative approach.
Regardless, using the diff() function would be a key part of figuring this out in base R. Also see:
Finding the elbow/knee in a curve
Code to create graphs:
wss <- c(23265.2302840678, 4917.06943551649, 1330.49917983449, 288.050702912287,
216.182464712486, 203.769578557051, 151.991297068931, 139.635571841227,
118.285305833194, 117.164567420633, 105.397722980407, 95.4682187817563,
116.448588269066, 88.1287299776581, 83.9345098736843)
wss2 <- c(1970.08410513303, 936.826421218935, 463.151086710784, 310.219800983285, 227.747583214178, 191.601552329558, 159.703151798393, 146.881710048563, 138.699803963718, 134.534334658148)
data_list <- list(wss, wss2)
# Potential_methods -------------------------------------------------------
plateau_method = list(thresh_to_max = function(x) which.max(diff(x) / diff(x)[1] < threshold)
, inflection_pt = function(x) which.min(sign(diff(x, differences = 2)))
, deriv_to_raw = function(x) which.min(sign(abs(diff(x)) / x[1:(length(x)-1)] - threshold))
)
threshold <- 0.1
results <- t(sapply(plateau_method, mapply, data_list))
# graphing ----------------------------------------------------------------
par(mfrow = c(3,2))
apply(results, 1, function (x) {
for (i in seq_along(x)) {
plot(data_list[[i]],ylab="Within groups sum of squares", type = 'b', xlab = 'Number of Clusters')
abline(v = x[i])
}
} )
lapply(seq_along(names(plateau_method))
, function (i) {
mtext(paste(names(plateau_method)[i]
, "- \n"
, substring(plateau_method[i], 15))
, side = 3, line = -18*(i)+15, outer = TRUE)
})
mtext('Threshold = 0.1', side = 3, line = -53, outer = T)
I am working on Spike Trains and my code to get a spike train like this:
for 20 trials is written below. The image is representational for 5 trials.
fr = 100
dt = 1/1000 #dt in milisecond
duration = 2 #no of duration in s
nBins = 2000 #10msSpikeTrain
nTrials = 20 #NumberOfSimulations
MyPoissonSpikeTrain = function(p, fr= 100) {
p = runif(nBins)
q = ifelse(p < fr*dt, 1, 0)
return(q)
}
set.seed(1)
SpikeMat <- t(replicate(nTrials, MyPoissonSpikeTrain()))
plot(x=-1,y=-1, xlab="time (s)", ylab="Trial",
main="Spike trains",
ylim=c(0.5, nTrials+1), xlim=c(0, duration))
for (i in 1: nTrials)
{
clip(x1 = 0, x2= duration, y1= (i-0.2), y2= (i+0.4))
abline(h=i, lwd= 1/4)
abline(v= dt*which( SpikeMat[i,]== 1))
}
Each trial has spikes occuring at random time points. Now what I am trying to work towards, is getting a random sample time point that works for all 20 trials and I want to get the vector consisting of length of the intervals this point falls into, for each trial. The code to get the time vector for the points where the spikes occur is,
A <- numeric()
for (i in 1: nTrials)
{
ISI <- function(i){
spike_times <- c(dt*which( SpikeMat[i, ]==1))
ISI1vec <- c(diff(spike_times))
A <- c(A, ISI1vec)
return(A)}
}
Then you call ISI(i) for whichever trial you wish to see the Interspike interval vector for. A visual representation of what I want is:
I want to get a vector that has the lengths of the interval where this points fall into, for each trial. I want to figure out it's distribution as well, but that's for later. Can anybody help me figure out how to code my way to this? Any help is appreciated, even if it's just about how to start/where to look.
Your data
set.seed(1)
SpikeMat <- t(replicate(nTrials, MyPoissonSpikeTrain()))
I suggest transforming your sparse matrix data into a list of indices where spikes occur
L <- lapply(seq_len(nrow(SpikeMat)), function(i) setNames(which(SpikeMat[i, ] == 1), seq_along(which(SpikeMat[i, ] == 1))))
Grab random timepoint
set.seed(1)
RT <- round(runif(1) * ncol(SpikeMat))
# 531
Result
distances contains the distances to the 2 nearest spikes - each element of the list is a named vector where the values are the distances (to RT) and their names are their positions in the vector. nearest_columns shows the original timepoint (column number) of each spike in SpikeMat.
bookend_values <- function(vec) {
lower_val <- head(sort(vec[sign(vec) == 1]), 1)
upper_val <- head(sort(abs(vec[sign(vec) == -1])), 1)
return(c(lower_val, upper_val))
}
distances <- lapply(L, function(i) bookend_values(RT-i))
nearest_columns <- lapply(seq_along(distances), function(i) L[[i]][names(distances[[i]])])
Note that the inter-spike interval of the two nearest spikes that bookend RT can be obtained with
sapply(distances, sum)
I have the following problem.
I have multiple subarrays (say 2) that I have populated with character labels (1, 2, 3, 4, 5). My algorithm selects labels at random based on occurrence probabilities.
How can I get R to instead select labels 1:3 for subarray 1 and 4:5 for subarray 2, say, without using subsetting (i.e., []). That is, I want a random subset of labels to be selected for each subarray, instead of all labels assigned to each subarray manually using [].
I know sample() should help.
Using subsetting (which I don't want) one would do
x <- 1:5
sample(x[1:3], size, prob = probs[1:3])
but this assigns labels 1:3 to ALL subarrays.
Would
sample(sample(x), size, replace = TRUE, prob = probs)
work?
Any ideas? Please let me know if this is unclear.
Here is a small example, which selects labels from 1:5 for each of 10 subarrays.
set.seed(1)
N <- 10
K <- 2
Hstar <- 5
probs <- rep(1/Hstar, Hstar)
perms <- 5
## Set up container(s) to hold the identity of each individual from each permutation ##
num.specs <- ceiling(N / K)
## Create an ID for each haplotype ##
haps <- 1:Hstar
## Assign individuals (N) to each subpopulation (K) ##
specs <- 1:num.specs
## Generate permutations, assume each permutation has N individuals, and sample those individuals' haplotypes from the probabilities ##
gen.perms <- function() {
sample(haps, size = num.specs, replace = TRUE, prob = probs) # I would like each subarray to contain a random subset of 1:5.
}
pop <- array(dim = c(perms, num.specs, K))
for (i in 1:K) {
pop[,, i] <- replicate(perms, gen.perms())
}
pop
Hopefully this helps.
I think what you actually want is something like that
num.specs <- 3
haps[sample(seq(haps),size = num.specs,replace = F)]
[1] 3 5 4
That is a random subset of your vector haps ?
Not quite what you want (returns list of matrices instead of 3D array) but this might help
lapply(split(1:5, cut(1:5, breaks=c(0, 2, 5))), function(i) matrix(sample(i, 25, replace=TRUE), ncol=5))
Use cut and split to partition your vector of character labels before sampling them. Here I split your character labels at the value 2. Also, rather than sampling 5 numbers 5 times, you can sample 25 numbers once, and convert to matrix.
I am trying to implement Chebyshev filter to smooth a time series but, unfortunately, there are NAs in the data series.
For example,
t <- seq(0, 1, len = 100)
x <- c(sin(2*pi*t*2.3) + 0.25*rnorm(length(t)),NA, cos(2*pi*t*2.3) + 0.25*rnorm(length(t)))
I am using Chebyshev filter: cf1 = cheby1(5, 3, 1/44, type = "low")
I am trying to filter the time series exclude NAs, but not mess up the orders/position. So, I have already tried na.rm=T, but it seems there's no such argument.
Then
z <- filter(cf1, x) # apply filter
Thank you guys.
Try using x <- x[!is.na(x)] to remove the NAs, then run the filter.
You can remove the NAs beforehand using the compelete.cases function. You also might consider imputing the missing data. Check out the mtsdi or Amelia II packages.
EDIT:
Here's a solution with Rcpp. This might be helpful is speed is important:
require(inline)
require(Rcpp)
t <- seq(0, 1, len = 100)
set.seed(7337)
x <- c(sin(2*pi*t*2.3) + 0.25*rnorm(length(t)),NA, cos(2*pi*t*2.3) + 0.25*rnorm(length(t)))
NAs <- x
x2 <- x[!is.na(x)]
#do something to x2
src <- '
Rcpp::NumericVector vecX(vx);
Rcpp::NumericVector vecNA(vNA);
int j = 0; //counter for vx
for (int i=0;i<vecNA.size();i++) {
if (!(R_IsNA(vecNA[i]))) {
//replace and update j
vecNA[i] = vecX[j];
j++;
}
}
return Rcpp::wrap(vecNA);
'
fun <- cxxfunction(signature(vx="numeric",
vNA="numeric"),
src,plugin="Rcpp")
if (identical(x,fun(x2,NAs)))
print("worked")
# [1] "worked"
I don't know if ts objects can have missing values, but if you just want to re-insert the NA values, you can use ?insert from R.utils. There might be a better way to do this.
install.packages(c('R.utils', 'signal'))
require(R.utils)
require(signal)
t <- seq(0, 1, len = 100)
set.seed(7337)
x <- c(sin(2*pi*t*2.3) + 0.25*rnorm(length(t)), NA, NA, cos(2*pi*t*2.3) + 0.25*rnorm(length(t)), NA)
cf1 = cheby1(5, 3, 1/44, type = "low")
xex <- na.omit(x)
z <- filter(cf1, xex) # apply
z <- as.numeric(z)
for (m in attributes(xex)$na.action) {
z <- insert(z, ats = m, values = NA)
}
all.equal(is.na(z), is.na(x))
?insert
Here is a function you can use to filter a signal with NAs in it.
The NAs are ignored rather than replaced by zero.
You can then specify a maximum percentage of weight which the NAs may take at any point of the filtered signal. If there are too many NAs (and too few actual data) at a specific point, the filtered signal itself will be set to NA.
# This function applies a filter to a time series with potentially missing data
filter_with_NA <- function(x,
window_length=12, # will be applied centrally
myfilter=rep(1/window_length,window_length), # a boxcar filter by default
max_percentage_NA=25) # which percentage of weight created by NA should not be exceeded
{
# make the signal longer at both sides
signal <- c(rep(NA,window_length),x,rep(NA,window_length))
# see where data are present and not NA
present <- is.finite(signal)
# replace the NA values by zero
signal[!is.finite(signal)] <- 0
# apply the filter
filtered_signal <- as.numeric(filter(signal,myfilter, sides=2))
# find out which percentage of the filtered signal was created by non-NA values
# this is easy because the filter is linear
original_weight <- as.numeric(filter(present,myfilter, sides=2))
# where this is lower than one, the signal is now artificially smaller
# because we added zeros - compensate that
filtered_signal <- filtered_signal / original_weight
# but where there are too few values present, discard the signal
filtered_signal[100*(1-original_weight) > max_percentage_NA] <- NA
# cut away the padding to left and right which we previously inserted
filtered_signal <- filtered_signal[((window_length+1):(window_length+length(x)))]
return(filtered_signal)
}
I have a data frame df with 2 variables A and B. I would like to split A in groups 1 and 2 so that mean(df$B[df$group==1]) as close as possible to mean(df$B[df$group==2])
Or just to express it otherwise, what I would like is to find a cut point (cutp) in df$A that would minimize the abs(mean(df$B[df$A<cutp])-mean(df$B[df$A>=cutp]))
Any ideas?
If you want to find a threshold on variable A, to split the data into two groups, so that the means of B in those two groups be similar, you can compute these means for all possible cut-points, and check when the distance between those means is minimal.
# Sample data
n <- 10
d <- data.frame(
A = rnorm(n),
B = rnorm(n)
)
# The quantity to minimize
# (You can use a loop instead of apply.)
d$differences <- apply(
d, 1,
# Compute the difference of the means for each value of A
function (u) {
i <- d$A <= u[1];
abs( mean( d$B[which(i)]) - mean(d$B[which(!i)] ) )
}
)
# The mean of an empty vector is NaN: discard those values
d$differences[ ! is.finite( d$differences ) ] <- Inf
# Take the minimum
threshold <- d$A[ which.min( d$differences ) ]
# Build the groups
d$group <- ifelse( d$A <= threshold, "group 1", "group 2" )
I'm still not sure how column A factors into it. It seems you want to create a new column that has two levels which create ~= mean values for column B. Column A is obviously associated with the new column created, but does not directly factor into the calculation needed. Am I missing something?
Regardless, here's a start (note this can be made much more robust, but proof of concept should work). Define a tolerance that you find acceptable and then set up a while loop to create new groups until the condition is met, i.e.
FUN <- function(tol){
df$groups <- sample(1:2, nrow(df), TRUE)
while(abs(mean(df$B[df$groups == 1]) - mean(df$B[df$groups == 2])) > tol) {
df$groups <- sample(1:2, nrow(df), TRUE)
}
return(df)
}
set.seed(101)
df <- data.frame(A=runif(20),B=runif(20))
#Test it. Means should be less than .02 different and have roughly equivalent sample sizes.
set.seed(101)
out <- FUN(.02)
library(plyr)
> ddply(out, "groups", summarize, n = length(B), mean = mean(B))
groups n mean
1 1 11 0.5229024
2 2 9 0.5037279
I should note that you could create a runaway function if you set tol super low so don't blame me if your computer crashes.