R: Finding the begin of a (exponential?) decay? - r

How to find the index indicated by the red vlin in the following example:
# Get the data as "tmpData"
source("http://pastie.org/pastes/9350691/download")
# Plot
plot(tmpData,type="l")
abline(v=49,col="red")
The following approach is promising, but how to find the peak maximum?
library(RcppRoll)
n <- 10
smoothedTmpData <- roll_mean(tmpData,n)
plot(-diff(smoothedTmpData),type="l")
abline(v=49,col="red")

which.max(-diff(smoothedTmpData)) gives you the index of the maximum.
http://www.inside-r.org/r-doc/base/which.max
I'm unsure if this is your actual question...

Where there is a single peak in the gradient, as in your example dataset, then gwieshammer is correct: you can just use which.max to find it.
For the case where there are multiple possible peaks, you need a more sophisticated approach. R has lots of peak finding functions (of varying quality). One that works for this data is wavCWTPeaks in wmtsa.
library(RcppRoll)
library(wmtsa)
source("http://pastie.org/pastes/9350691/download")
n <- 10
smoothedTmpData <- roll_mean(tmpData, n)
gradient <- -diff(smoothedTmpData)
cwt <- wavCWT(gradient)
tree <- wavCWTTree(cwt)
(peaks <- wavCWTPeaks(tree))
## $x
## [1] 4 52
##
## $y
## [1] 302.6718 5844.3172
##
## attr(,"peaks")
## branch itime iscale time scale extrema iendtime
## 1 1 5 2 5 2 16620.58 4
## 2 2 57 26 57 30 20064.64 52
## attr(,"snr.min")
## [1] 3
## attr(,"scale.range")
## [1] 1 28
## attr(,"length.min")
## [1] 10
## attr(,"noise.span")
## [1] 5
## attr(,"noise.fun")
## [1] "quantile"
## attr(,"noise.min")
## 5%
## 4.121621
So the main peak close to 50 is correctly found, and the routine picks up another smaller peak at the start.

Related

how to set Target in quanteda's "textstat_keyness" function associated with "date"

It's working as target = year(dfmat_news$datee) >= 2016
tstat_key <- textstat_keyness(hr_dfm,
measure = "chi2",sort = TRUE, correction = c("default"),
target = year(dfmat_news$datee) >= 2016)
AS I set target= date(dfmat_news$datee) >= 2016-02-01
It's not working.
datee=(YYYY-MM-DD) in date format
How to set "target" with date in function "textstat_keyness" in R Package quanteda?
THANKS A LOT!
The problem is that the comparison operator (>=) is trying to compare a numeric value to a "date" formatted field dfmat_news$datee, and this is not producing what you expect. 2016-02-01 evaluates to 2013 which compares to Date fields that are very different when used as integers. For instance:
> as.numeric(as.Date("2016-01-01"))
[1] 16801
So you should brush up on date operations in R, starting with ?Ops.Date.
Here's a reproducible example solving this in the way that you want, using textstat_keyness(). You can see that this works on an expression that evaluates to a logical.
library("quanteda")
## Package version: 2.1.1
# this has a date field, but the package must be installed
# from https://github.com/quanteda/quanteda.corpora
data(data_corpus_sotu, package = "quanteda.corpora")
hr_dfm <- dfm(tail(data_corpus_sotu, 10)) %>%
dfm_remove(stopwords("en"))
hr_dfm$Date
## [1] "2011-01-25" "2012-01-24" "2013-02-12" "2014-01-28" "2015-01-20"
## [6] "2016-01-12" "2017-02-28" "2018-01-30" "2019-02-05" "2020-02-04"
textstat_keyness(hr_dfm, target = hr_dfm$Date >= "2016-01-01") %>%
head()
## feature chi2 p n_target n_reference
## 1 thank 65.17899 6.661338e-16 85 12
## 2 much 33.17024 8.443305e-09 49 9
## 3 great 28.21748 1.084209e-07 66 22
## 4 , 21.21601 4.103217e-06 1822 1791
## 5 drug 20.95085 4.712182e-06 21 1
## 6 border 20.25901 6.763404e-06 27 4
textstat_keyness(hr_dfm, target = rep(c(FALSE, TRUE), each = 5)) %>%
head()
## feature chi2 p n_target n_reference
## 1 thank 65.17899 6.661338e-16 85 12
## 2 much 33.17024 8.443305e-09 49 9
## 3 great 28.21748 1.084209e-07 66 22
## 4 , 21.21601 4.103217e-06 1822 1791
## 5 drug 20.95085 4.712182e-06 21 1
## 6 border 20.25901 6.763404e-06 27 4

How do I can tranform a cataegorical variable to continuos variable in R?

I have a DataFrame with many variables such as percentage, year, hectares and others, but I have one that is the distance, I am evaluating the direct impact (area directly impacted) and the extent of this impact by buffers in km around the direct impact and I compare it to the control.
I have:
At this point, my variables are categorical, even though they represent distances.
Distances:"dirImpct",1km","2km","3km","4km","5km","6km","7km","8km","9km","10km","20km","30km","40km","50km","60km","70km", "controle")
I want something like this:
Distances: dirImpct < km < control
DirImpact= distance 0
km= distances, (1,2,3,4,5,6,7,8,9,10,20,30,40,50,60,70)
control= distances > 70
You may transform to factor and rename the levels.
dat$Distances <- factor(dat$Distances)
lvl <- levels(dat$Distances)
levels(dat$Distances)[c(length(lvl) - 1, length(lvl))] <-
c("0", "dist > 70")
dat
# x Distances
# 1 -0.5448391 dist > 70
# 2 -0.7178019 1km
# 3 0.8157044 2km
# 4 0.2789727 3km
# 5 0.4940412 4km
# 6 0.4096203 5km
# 7 -0.9503315 6km
# 8 -0.3750518 7km
# 9 0.3385260 8km
# 10 0.3666900 9km
# 11 0.9265970 10km
# 12 0.2554600 20km
# 13 1.4776808 30km
# 14 -0.3211048 40km
# 15 -0.9469399 50km
# 16 -0.1853323 60km
# 17 -0.4438878 70km
# 18 -1.0206900 0
Data:
dat <- data.frame(x=rnorm(length(Distances)),
Distances=c("dirImpct","1km","2km","3km","4km","5km","6km","7km",
"8km","9km","10km","20km","30km","40km","50km","60km",
"70km", "controle")
)

How to generate random numbers in a data.frame with range

I have a data.frame which I want to generate random numbers each list by a sequence.
I used sample function to create random numbers but even I created random numbers for list [[1]], for set [[2]] same numbers produced again. So, here how can I create different random numbers for the set [[2]].
here is the simple code;
data.list <- lapply(1:2, function(x) {
nrep <- 1
time <- rep(seq(90,54000,by=90),times=nrep)
Mx <- rep(sort(sample(seq(0.012,-0.014,length.out = 600),replace=TRUE)), times=nrep)
My <- rep(sort(sample(seq(0.02,-0.02,length.out = 600),replace=TRUE)), times=nrep)
Mz <- rep(sort(sample(seq(-1,1,length.out=600),replace=TRUE)), times=nrep)
data.frame(time,Mx,My,Mz,set_nbr=x)
})
this is provide the 5 first lines of each of datasets
[[1]]
time Mx My Mz set_nbr
1 90 -1.391319e-02 -2.000000e-02 -1.000000000 1
2 180 -1.386978e-02 -1.986644e-02 -1.000000000 1
3 270 -1.386978e-02 -1.973289e-02 -0.996661102 1
4 360 -1.382638e-02 -1.973289e-02 -0.993322204 1
5 450 -1.382638e-02 -1.973289e-02 -0.979966611 1
.. .. .... .... .... ...
[[2]]
time Mx My Mz set_nbr
1 90 -1.395659e-02 -0.0200000000 -1.000000000 2
2 180 -1.391319e-02 -0.0199332220 -0.993322204 2
3 270 -1.386978e-02 -0.0199332220 -0.993322204 2
4 360 -1.386978e-02 -0.0199332220 -0.993322204 2
5 450 -1.382638e-02 -0.0199332220 -0.986644407 2
.. .. .... .... .... ...
EDIT 1:
regarding to #bgoldst answer now I can produce different numbers
set.seed(1);
data.list <- lapply(1:2, function(x) {
nrep <- 1;
time <- rep(seq(90,54000,by=90),times=nrep);
Mx <- rep(sort(runif(600,-0.014,0.012)),times=nrep);
My <- rep(sort(runif(600,-0.02,0.02)),times=nrep);
Mz <- rep(sort(runif(600,-1,1)),times=nrep);
data.frame(time,Mx,My,Mz,set_nbr=x);
});
On the other hand when I change nrep <- 3; same numbers are created for each nrep. This is the thing I want to avoid from the beginning.
EDIT 2:
#bgoldst showed that replicate does the job!
I think you may have some confusion about how sample() works.
First, let's examine sample()'s behavior with respect to this simple vector:
1:5;
## [1] 1 2 3 4 5
When you pass a multi-element vector to sample() it basically just randomizes the order. This means you'll get a different result every time, or rather, to state it more precisely, the longer the vector is, the less likely you are to get the same result twice:
set.seed(1); sample(1:5); sample(1:5); sample(1:5);
## [1] 2 5 4 3 1
## [1] 5 4 2 3 1
## [1] 2 1 3 4 5
This means if you sort it immediately after sampling, then you'll get the same result every time. And if the original vector was itself sorted, then the result will also be equal to that original vector. This will be true regardless how sample() randomized the order, because the order is always restored by sort():
set.seed(1); sort(sample(1:5)); sort(sample(1:5)); sort(sample(1:5));
## [1] 1 2 3 4 5
## [1] 1 2 3 4 5
## [1] 1 2 3 4 5
Now if you add replace=T (or just rep=T if you like to take advantage of partial matching for concision, which I do), then you're not just randomizing the order, you're selecting size elements with replacement, where size is the vector length if you didn't provide size explicitly. This means you can get repeated elements in the result:
set.seed(1); sample(1:5,rep=T); sample(1:5,rep=T); sample(1:5,rep=T);
## [1] 2 2 3 5 2
## [1] 5 5 4 4 1
## [1] 2 1 4 2 4
And so, if you sort the result, you (likely) won't get back the original vector, because some elements will have been repeated, and some elements will have been omitted:
set.seed(1); sort(sample(1:5,rep=T)); sort(sample(1:5,rep=T)); sort(sample(1:5,rep=T));
## [1] 2 2 2 3 5
## [1] 1 4 4 5 5
## [1] 1 2 2 4 4
That's exactly what is happening with your code. Your output vectors are different between the two list components, because you're sampling with replacement before sorting, which means different repetitions and omissions of the elements will occur for each list component. But since you're sampling from the same sequence and you're sorting the result, you're bound to get similar-looking results for each list component, even though they're not identical.
I think what you might be looking for is random deviates from a uniform distribution. You can get these from runif():
set.seed(1); runif(5,-0.014,0.012);
## [1] -0.0070967748 -0.0043247786 0.0008941874 0.0096134025 -0.0087562698
set.seed(1); runif(5,-0.02,0.02);
## [1] -0.009379653 -0.005115044 0.002914135 0.016328312 -0.011932723
set.seed(1); runif(5,-1,1);
## [1] -0.4689827 -0.2557522 0.1457067 0.8164156 -0.5966361
Thus, your code would become:
set.seed(1);
data.list <- lapply(1:2, function(x) {
nrep <- 1;
time <- rep(seq(90,54000,by=90),times=nrep);
Mx <- rep(sort(runif(600,-0.014,0.012)),times=nrep);
My <- rep(sort(runif(600,-0.02,0.02)),times=nrep);
Mz <- rep(sort(runif(600,-1,1)),times=nrep);
data.frame(time,Mx,My,Mz,set_nbr=x);
});
Which gives:
lapply(data.list,head);
## [[1]]
## time Mx My Mz set_nbr
## 1 90 -0.01395224 -0.01994741 -0.9967155 1
## 2 180 -0.01394975 -0.01991923 -0.9933909 1
## 3 270 -0.01378866 -0.01980934 -0.9905714 1
## 4 360 -0.01371306 -0.01977090 -0.9854065 1
## 5 450 -0.01371011 -0.01961713 -0.9850108 1
## 6 540 -0.01365998 -0.01960718 -0.9846628 1
##
## [[2]]
## time Mx My Mz set_nbr
## 1 90 -0.01398426 -0.01997718 -0.9970438 2
## 2 180 -0.01398293 -0.01989651 -0.9931286 2
## 3 270 -0.01397330 -0.01988715 -0.9923425 2
## 4 360 -0.01396455 -0.01957807 -0.9913645 2
## 5 450 -0.01384501 -0.01939597 -0.9892001 2
## 6 540 -0.01382531 -0.01931913 -0.9889356 2
Edit: It looked from your question like you wanted the random numbers to be different between list components, that is to say, between the components generated from the 1:2 passed as the first argument to lapply(). The repetition of each random vector nrep times within each list component didn't appear to be relevant, partly because you set nrep to 1, so there wasn't any actual repetition.
But that's ok, we can achieve this requirement by using replicate() instead of rep(), because replicate() actual runs its expression argument once for every repetition. We also have to flatten the result, because replicate() by default returns a matrix, and we want a straight vector:
set.seed(1);
data.list <- lapply(1:2, function(x) {
nrep <- 2;
time <- rep(seq(90,54000,by=90),times=nrep);
Mx <- c(replicate(nrep,sort(runif(600,-0.014,0.012))));
My <- c(replicate(nrep,sort(runif(600,-0.02,0.02))));
Mz <- c(replicate(nrep,sort(runif(600,-1,1))));
data.frame(time,Mx,My,Mz,set_nbr=x);
});
lapply(data.list,function(x) x[c(1:6,601:606),]);
## [[1]]
## time Mx My Mz set_nbr
## 1 90 -0.01395224 -0.01993431 -0.9988590 1
## 2 180 -0.01394975 -0.01986782 -0.9948254 1
## 3 270 -0.01378866 -0.01981143 -0.9943576 1
## 4 360 -0.01371306 -0.01970813 -0.9789037 1
## 5 450 -0.01371011 -0.01970022 -0.9697986 1
## 6 540 -0.01365998 -0.01969326 -0.9659567 1
## 601 90 -0.01396582 -0.01997579 -0.9970438 1
## 602 180 -0.01394750 -0.01997375 -0.9931286 1
## 603 270 -0.01387607 -0.01995893 -0.9923425 1
## 604 360 -0.01385108 -0.01994546 -0.9913645 1
## 605 450 -0.01375113 -0.01976155 -0.9892001 1
## 606 540 -0.01374467 -0.01973125 -0.9889356 1
##
## [[2]]
## time Mx My Mz set_nbr
## 1 90 -0.01396979 -0.01999198 -0.9960861 2
## 2 180 -0.01390373 -0.01995219 -0.9945237 2
## 3 270 -0.01390252 -0.01991559 -0.9925640 2
## 4 360 -0.01388905 -0.01978123 -0.9890171 2
## 5 450 -0.01386718 -0.01967644 -0.9835435 2
## 6 540 -0.01384351 -0.01958008 -0.9822988 2
## 601 90 -0.01396739 -0.01989328 -0.9971255 2
## 602 180 -0.01396433 -0.01985785 -0.9954987 2
## 603 270 -0.01390700 -0.01984074 -0.9903196 2
## 604 360 -0.01376890 -0.01982715 -0.9902251 2
## 605 450 -0.01366110 -0.01979802 -0.9829480 2
## 606 540 -0.01364868 -0.01977278 -0.9812671 2

converting arrival time/process to count process in R

I have the data for an arrival process and I want to convert it to count process. This is what I did:
# inter-arrival time in milliseconds
x <- rpareto(100000, location = 10, shape = 1.2)
# arrival time in milliseconds
x.cumsum <- cumsum(x)
# the last arrival
x.max <- max(x.cumsum)
# the time scale for the count data, in this case 1 second
kTimeScale <- 1000
count.length <- ceiling(x.max / kTimeScale)
counts <- rep(0, times = count.length)
for (i in x.cumsum) {
counts[round(i / kTimeScale)] <- counts[round(i / kTimeScale)] + 1
}
This works but for very large dataset (few millions it's slow). I was wondering if there is a better faster way to do this?
You can do this with table:
countsTable<-table(round(x.cumsum/kTimeScale))
counts[1:10]
## [1] 24 41 1 2 33 26 20 45 36 19
countsTable[1:10]
##
## 0 1 2 3 4 5 6 7 8 9
## 5 24 41 1 2 33 26 20 45 36
The difference is that your function misses the 0 values. The table function won't put in 0 for values where there are no observations but you can do something like this to fix that:
counts2<-rep(0,length(counts)+1)
counts2[as.integer(names(countsTable))+1]<-countsTable
identical(counts,counts2[-1])
## [1] TRUE

finding unique vector elements in a list efficiently

I have a list of numerical vectors, and I need to create a list containing only one copy of each vector. There isn't a list method for the identical function, so I wrote a function to apply to check every vector against every other.
F1 <- function(x){
to_remove <- c()
for(i in 1:length(x)){
for(j in 1:length(x)){
if(i!=j && identical(x[[i]], x[[j]]) to_remove <- c(to_remove,j)
}
}
if(is.null(to_remove)) x else x[-c(to_remove)]
}
The problem is that this function becomes very slow as the size of the input list x increases, partly due to the assignment of two large vectors by the for loops. I'm hoping for a method that will run in under one minute for a list of length 1.5 million with vectors of length 15, but that might be optimistic.
Does anyone know a more efficient way of comparing each vector in a list with every other vector? The vectors themselves are guaranteed to be equal in length.
Sample output is shown below.
x = list(1:4, 1:4, 2:5, 3:6)
F1(x)
> list(1:4, 2:5, 3:6)
As per #JoshuaUlrich and #thelatemail, ll[!duplicated(ll)] works just fine.
And thus, so should unique(ll)
I previously suggested a method using sapply with the idea of not checking every element in the list (I deleted that answer, as I think using unique makes more sense)
Since efficiency is a goal, we should benchmark these.
# Let's create some sample data
xx <- lapply(rep(100,15), sample)
ll <- as.list(sample(xx, 1000, T))
ll
Putting it up against some becnhmarks
fun1 <- function(ll) {
ll[c(TRUE, !sapply(2:length(ll), function(i) ll[i] %in% ll[1:(i-1)]))]
}
fun2 <- function(ll) {
ll[!duplicated(sapply(ll, digest))]
}
fun3 <- function(ll) {
ll[!duplicated(ll)]
}
fun4 <- function(ll) {
unique(ll)
}
#Make sure all the same
all(identical(fun1(ll), fun2(ll)), identical(fun2(ll), fun3(ll)),
identical(fun3(ll), fun4(ll)), identical(fun4(ll), fun1(ll)))
# [1] TRUE
library(rbenchmark)
benchmark(digest=fun2(ll), duplicated=fun3(ll), unique=fun4(ll), replications=100, order="relative")[, c(1, 3:6)]
test elapsed relative user.self sys.self
3 unique 0.048 1.000 0.049 0.000
2 duplicated 0.050 1.042 0.050 0.000
1 digest 8.427 175.563 8.415 0.038
# I took out fun1, since when ll is large, it ran extremely slow
Fastest Option:
unique(ll)
You could hash each of the vectors and then use !duplicated() to identify unique elements of the resultant character vector:
library(digest)
## Some example data
x <- 1:44
y <- 2:10
z <- rnorm(10)
ll <- list(x,y,x,x,x,z,y)
ll[!duplicated(sapply(ll, digest))]
# [[1]]
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
# [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
#
# [[2]]
# [1] 2 3 4 5 6 7 8 9 10
#
# [[3]]
# [1] 1.24573610 -0.48894189 -0.18799758 -1.30696395 -0.05052373 0.94088670
# [7] -0.20254574 -1.08275938 -0.32937153 0.49454570
To see at a glance why this works, here's what the hashes look like:
sapply(ll, digest)
[1] "efe1bc7b6eca82ad78ac732d6f1507e7" "fd61b0fff79f76586ad840c9c0f497d1"
[3] "efe1bc7b6eca82ad78ac732d6f1507e7" "efe1bc7b6eca82ad78ac732d6f1507e7"
[5] "efe1bc7b6eca82ad78ac732d6f1507e7" "592e2e533582b2bbaf0bb460e558d0a5"
[7] "fd61b0fff79f76586ad840c9c0f497d1"

Resources