rolling Hurst exponent - r

I am pretty new to R and I have been trying to compute a rolling Hurst exponent with no avail. I have installed packages fArma (for the Hurst) and zoo (for the rollapply). The data is in a dataframe called 'data' and is variable 'returns'. The following code for the Hurst works great;
rsFit(data$returns, levels = 50, minnpts = 3, cut.off = 10^c(0.7, 2.5), + doplot = FALSE, trace = FALSE, title = NULL, description = NULL)
Below is my attempt at a rolling Hurst exponent of window size 230, which generates an error.
rollapply(data$returns, 230, (rsFit(data$returns, levels = 50, minnpts = 3, cut.off = 10^c(0.7, 2.5), + doplot = FALSE, trace = FALSE, title = NULL, description = NULL)))
Any help with the code would be much appreciated. I am trying to calculate the Hurst exponent over a 230 period window, that rolls forward 1 period at a time.
The data is;
returns
1 -0.002002003
2 -0.002006019
3 0.000000000
4 0.000000000
5 -0.009077218
6 -0.003044142
7 -0.002034589
8 0.004065046
9 0.002026343
10 0.001011634
11 0.001010612
12 0.000000000
13 -0.001010612
14 -0.001011634
15 0.003031837
16 -0.001009591
17 0.001009591
18 -0.002020203

I'm really not familiar with the fArma package or its functions, but I noticed a couple of major issues with your code.
You are trying to use rollapply incorrectly; specifically your third argument FUN=rsFun(data$returns). In general, if you are applying a (one-parameter) function foo on a data object x with rollapply, your function call should be rollapply(x,some_integer,foo).
So in your case, you would have
rollapply(data$returns,230,rsFit)
since it is acceptable to call rsFit with only one argument (the first one, x, as shown in the help file ?rsFit).
The width of 230 that you are specifying in rollapply is much too large - your sample data, data$returns, only has a length of 18 - the window size has to be less than the length of the data. One option is to use a smaller width: I tried a couple of small values (5,10,...) with your data, but this produced errors. Like I said, I'm not familiar with the functions in fArma, but I suspect rsFit requires more than 5 or 10 observations at a time. A better solution would be to use a larger sample of data, which is shown below.
Even having made the changes described above, you will encounter one more issue. From the Value section (i.e. return value) in ?rollapply:
"A object of the same class as data with the results of the rolling
function."
Typically this is some type of simple object, e.g. a vector, matrix, etc... depending on your input. However, rsFit returns an S4 class fHURST object, which rollapply is apparently not able to deal with. This is not surprising, since fHURST objects have a fairly complex structure - try running str(rsFit(data$returns)) and note all of the various slots it contains. Basically, the simple solution for this is instead of returning the entire fHURST object calculated in rollapply, just return the component / slot that you need. Again, I've never used rsFit and I don't have to time to read into the theoretical underpinnings of Hurst exponents, but below I assumed you were mainly concerned with the estimated coefficient values occupying the #hurst slot of the fHURST objects.
As noted above, I made a toy data set that is much larger than 18 observations so that I could keep the width=230 in rollapply.
library(fArma)
library(zoo)
##
set.seed(123)
data2 <- rnorm(690)
##
data2.ra <- rollapply(data2,230,function(x){
hSlot <- rsFit(x)#hurst
result <- data.frame(
H = hSlot$H,
beta = hSlot$beta,
Estimate.intercept = hSlot$diag[1,1],
Estimate.X = hSlot$diag[2,1])
result
})
##
> head(data2.ra)
H beta Estimate.intercept Estimate.X
1 0.6257476 0.6257476 -0.143363281 0.6257476
2 0.6627804 0.6627804 -0.193806373 0.6627804
3 0.6235309 0.6235309 -0.133828565 0.6235309
4 0.5683417 0.5683417 -0.055960572 0.5683417
5 0.5520769 0.5520769 -0.027270395 0.5520769
6 0.5334170 0.5334170 -0.003523383 0.5334170
> dim(data2.ra)
[1] 461 4
> 690 - (230-1)
[1] 461
Which is an object of length 461, since the output of using rollapply with a window size k on an object of length n is n - (k-1). Of course, you can change the body of the anonymous function (function(x){...}) used in rollapply above to suit your needs.

Related

Memory management in R ComplexUpset Package

I'm trying to plot an stacked barplot inside an upset-plot using the ComplexUpset package. The plot I'd like to get looks something like this (where mpaa would be component in my example):
I have a dataframe of size 57244 by 21, where one column is ID and the other is type of recording, and other 19 columns are components from 1 to 19:
ID component1 component2 ... component19 type
1 1 0 1 a
2 0 0 1 b
3 1 1 0 b
Ones and zeros indicate affiliation with a certain component. As shown in the example in the docs, I first convert these ones and zeros to logical, and then try to plot the basic upset plot. Here's the code:
df <- df %>% mutate(across(where(is.numeric), as.logical))
components <- colnames(df)[2:20]
upset(df, components, name='protein', width_ratio = 0.1)
But unfortunately after thinking for a while when processing the last line it spits out an error message like this:
Error: cannot allocate vector of size 176.2 Mb
Though I know I'm using the 32Gb RAM architecture, I'm sure I couldn't have flooded the memory so much that 167 Mb can't be allocated, so my guess is I am managing memory in R somehow wrong. Could you please explein what's faulty in my code, if possible.
I also know that UpsetR package plots the same data, but as far as i know it provides no way for the stacked barplotting.
Somehow, it works if you:
Tweak the min_size parameter so that the plot is not overloaded and makes a better impression
Making the first argument of ComplexUpset a sample with some data also helps, even if your sample is the whole dataset.

Is there a way to import the results of HSD.test from agricolae directly into geom_text() in a ggplot2?

I'm creating figures that show the efficacy of several warning signals relative to the event they warn about. The figure is based off a dataframe which is produced by a function that runs a model multiple times and collates the results like this:
t type label early
4 847 alarm alarm.1 41
2 849 alarm alarm.2 39
6 853 alarm alarm.3 35
5 923 alarm alarm.4 -35
7 1003 alarm alarm.5 -115
But with a dozen alarms and a value for each alarm n times (typically 20 - 100), with each value being slightly different depending on random and stochastic variables built into the model.
I'm putting the results in an lm
a.lm <- lm(log(early + 500) ~ label, data = alarm.data)
and after checking the assumptions are met, running a 1 way anova
anova(a.lm)
then a tukey post hoc test
HSD.test(a.lm, trt = "label", console = TRUE)
Which produces
log(early + 500) groups
alarm.1 6.031453 a
alarm.2 6.015221 a
alarm.3 6.008366 b
alarm.4 5.995150 b
alarm.5 5.921384 c
I have a function which generates a ggplot2 figure based on the collated data, to which I am then manually adding +geom_text(label = c("a", "a", "b", "b", "c") or whatever the appropriate letters are. Is there a way to generalise that last step? Calling the letters directly from the result of the HSD.test. If I put the results of the HSD.test into an object
a.test <- HSD.test(a.lm, trt = "label", console = TRUE)
I can call the results using a.test$groups and calling the letter groupings specifically using a.test$groups$groups but I don't know enough about manipulating lists to make that useful to me. Whilst the order of the labels in the ggplot is predictable, the order of the groups in the HSD.test result isn't and can vary a lot between iterations of the model running function.
If anyone has any insights I would be grateful.
Okay I actually bumped into a solution just after I posted the question.
If you take the output of the HSD.test and make it into an object
a.test <- HSD.test(ram.lm, trt = "label")
Then convert the groups list into a dataframe
a.df <- as.data.frame(a.test$groups)
The row index is the alarm names rather than numbers
a.df
log(early + 500) groups
alarm.1 6.849082 a
alarm.2 6.842465 a
alarm.3 6.837438 a
alarm.4 6.836437 a
alarm.5 6.812714 a
so they can be called specifically into geom_text inside the function
a.plot +
geom_text(label = c(a.df["alarm.1",2],
a.df["alarm.2",2],
a.df["alarm.3", 2],
a.df["alarm.4", 2],
a.df["alarm.5", 2])
even though not using the same functions to get the compact letter display, I think this may be a more efficient way of doing it? (Make sure to unfold the code via the "Code" button above the ggplots)

Creating synthetic user data in R; issues with generating user identifier variable

I am trying to generate synthetic user event log data for demonstration purposes. It's going to be very basic feature-wise (about 4 variables altogether). Here is what I have so far:-
require(wakefield)#for generating the Status variable
require(dplyr)
require(stringi)
set.seed(1)
#data<-data.frame()
eventDate<-seq(as.Date("2015-01-01"), as.Date("2015-12-31"), by = "1 day")
eventDate<-sample(rep(eventDate,each=1000),replace = T)
u <- runif(length(eventDate), 0, 60*60*12) # "noise" to add or subtract from some timepoint
eventDateTime<-as.POSIXlt(u, origin = paste0(eventDate,"00:00:00"))
eventDateTime
eventOutcome<-r_sample_factor(x = c("Passed", "Failed", "Ongoing","Unknown"), n=length(eventDate))
eventOutcome
data<-data.frame(eventDate,eventDateTime,eventOutcome)
head(data)
# eventDate eventDateTime eventOutcome
#1 2015-01-25 2015-01-25 04:48:47 Unknown
#2 2015-05-05 2015-05-05 09:35:22 Unknown
#3 2015-11-28 2015-11-28 08:56:16 Failed
#4 2015-05-23 2015-05-23 02:24:52 Ongoing
#5 2015-01-26 2015-01-26 07:43:52 Failed
#6 2015-10-22 2015-10-22 03:07:14 Passed
There is about 365000 rows of data here. All that is left to do is add a user identifier variable. I would like it if some users will maybe have a handful of interactions recorded in the data set, whereas some users may have dozens/hundreds/thousands of interactions (I would like this dataset to have that kind of variability).
I can create a user identifier variable no problem:-
UserId<-stri_rand_strings(1300,6)
But if I add this to the data, it doesn't work:-
data$UserId<-stri_rand_strings(1300,6)
Error in `$<-.data.frame`(`*tmp*`, UserId, value = c("k3QlXs", "gK3eBa", :
replacement has 1300 rows, data has 365000
So my request two-fold: How can I assign a User identifier variable to this kind of data; how can I make it variable, where some users have a 1 or a few interactions whilst others will appear frequently (i.e. dozens, hundreds, thousands of times)?
Thank you in advance for any help, always appreciated :)
One option might be to generate a UUID for each user. A UUID looks like this:
c7f2dde5-dfeb-45cb-9720-87b23effd45d
If you use a good UUID generator, then it is almost impossible to generate the same UUID more than once. R has a uuid package which can be used:
library(uuid)
user_uuid <- UUIDgenerate()
It sounds like the distribution of user IDs you are looking for is something akin to a power law or Pareto distribution. This is a way to generate a vector of relative probabilities of sampling each user ID that follows a power law, then sample the user IDs following those relative probabilities.
Here I've used the function Pareto::rPareto to generate the relative probabilities. You can vary the parameters, especially alpha, to get different distributions. As alpha increases the distribution will become more even. I also supplied a truncation parameter so that you will not get too many users with unrealistically high numbers of purchases (In this example the most frequent ID has ~3700 cases).
Note you are not guaranteed to sample each of the 1300 user IDs at least once with this particular way of doing it.
Next I use the function uuid::UUIDgenerate to generate 1300 unique strings. Finally I use sample to sample the unique IDs with replacement as many times as you have rows in your data frame. I plot the frequencies of the different IDs in the sample. Again, modify the parameters if this distribution is not what you are looking for.
library(Pareto)
library(uuid)
library(ggplot2)
set.seed(1)
n_users <- 1300
n_rows <- 365000
relative_probs <- rPareto(n = n_users, t = 1, alpha = 0.3, truncation = 500)
unique_ids <- UUIDgenerate(n = n_users)
id_sample <- sample(unique_ids, size = n_rows, prob = relative_probs, replace = TRUE)
# Check the distribution of IDs
ggplot(as.data.frame(table(id_sample)), aes(x = Freq)) +
geom_histogram() +
scale_x_log10()
You are generating 1300 strings but number of rows in your data is 365000. So you can use sample to repeat those 1300 string randomly.
library(stringi)
data$UserId <- sample(stri_rand_strings(1300,6), nrow(data), replace = TRUE)

Fisher test more than 2 groups

Major Edit:
I decided to rewrite this question since my original was poorly put. I will leave the original question below to maintain a record. Basically, I need to do Fisher's Test on tables as big as 4 x 5 with around 200 observations. It turns out that this is often a major computational challenge as explained here (I think, I can't follow it completely). As I use both R and Stata I will frame the question for both with some made-up data.
Stata:
tabi 1 13 3 27 46 \ 25 0 2 5 3 \ 22 2 0 3 0 \ 19 34 3 8 1 , exact(10)
You can increase exact() to 1000 max (but it will take maybe a day before returning an error).
R:
Job <- matrix(c(1,13,3,27,46, 25,0,2,5,3, 22,2,0,3,0, 19,34,3,8,1), 4, 5,
dimnames = list(income = c("< 15k", "15-25k", "25-40k", ">40k"),
satisfaction = c("VeryD", "LittleD", "ModerateS", "VeryS", "exstatic")))
fisher.test(Job)
For me, at least, it errors out on both programs. So the question is how to do this calculation on either Stata or R?
Original Question:
I have Stata and R to play with.
I have a dataset with various categorical variables, some of which have multiple categories.
Therefore I'd like to do Fisher's exact test with more than 2 x 2 categories
i.e. apply Fisher's to a 2 x 6 table or a 4 x 4 table.
Can this be done with either R or Stata ?
Edit: whilst this can be done in Stata - it will not work for my dataset as I have too many categories. Stata goes through endless iterations and even being left for a day or more does not produce a solution.
My question is really - can R do this, and can it do it quickly ?
Have you studied the documentation of R function fisher.test? Quoting from help("fisher.test"):
For 2 by 2 cases, p-values are obtained directly using the (central or
non-central) hypergeometric distribution. Otherwise, computations are
based on a C version of the FORTRAN subroutine FEXACT which implements
the network developed by Mehta and Patel (1986) and improved by
Clarkson, Fan and Joe (1993).
This is an example given in the documentation:
Job <- matrix(c(1,2,1,0, 3,3,6,1, 10,10,14,9, 6,7,12,11), 4, 4,
dimnames = list(income = c("< 15k", "15-25k", "25-40k", "> 40k"),
satisfaction = c("VeryD", "LittleD", "ModerateS", "VeryS")))
fisher.test(Job)
# Fisher's Exact Test for Count Data
#
# data: Job
# p-value = 0.7827
# alternative hypothesis: two.sided
As far as Stata is concerned, your original statement was totally incorrect. search fisher leads quickly to help tabulate twoway and
the help for the exact option explains that it may be applied to r x
c as well as to 2 x 2 tables
the very first example in the same place of Fisher's exact test underlines that Stata is not limited to 2 x 2 tables.
It's a minimal expectation anywhere on this site that you try to read basic documentation. Please!

unused arguments error using apply() in R

I get an error message when I attempt to use apply() conditional on a column of dates to return a set of coefficients.
I have a dataset (herein modified for simplicity, but reproducible):
ADataset <- data.table(Epoch = c("2007-11-15", "2007-11-16", "2007-11-17",
"2007-11-18", "2007-11-19", "2007-11-20", "2007-11-21"),
Distance = c("92336.22", "92336.23", "92336.22", "92336.20",
"92336.19", "92336.21", "92336.18))
ADataset
Epoch Distance
1: 2007-11-15 92336.22
2: 2007-11-16 92336.23
3: 2007-11-17 92336.22
4: 2007-11-18 92336.20
5: 2007-11-19 92336.19
6: 2007-11-20 92336.21
7: 2007-11-21 92336.18
The analysis begins with establishing start and end dates:
############## Establish dates for analysis
#4.Set date for center of duration
StartDate <- "2007-11-18"
as.numeric(as.Date(StartDate)); StartDate
EndDate <- as.Date(tail(Adataset$Epoch,1)); EndDate
Then I establish time durations for analysis:
#5.Quantify duration of time window
STDuration <- 1
LTDuration <- 3
Then I write functions to regress over both durations and return the slopes:
# Write STS and LTS functions, each with following steps
#6.Define time window- from StartDate less ShortTermDuration to
StartDate plus ShortTermDuration
#7.Define Short Term & Long Term datasets
#8. Run regression over dataset
my_STS_Function <- function (StartDate) {
STAhead <- as.Date(StartDate) + STDuration; STAhead
STBehind <- as.Date(StartDate) - STDuration; STBehind
STDataset <- subset(Adataset, as.Date(Epoch) >= STBehind & as.Date(Epoch)<STAhead)
STResults <- rlm( Distance ~ Epoch, data=STDataset); STResults
STSummary <- summary( STResults ); STSummary
# Return coefficient (Slope of regression)
STNum <- STResults$coefficients[2];STNum
}
my_LTS_Function <- function (StartDate) {
LTAhead <- as.Date(StartDate) + LTDuration; LTAhead
LTBehind <- as.Date(StartDate) - LTDuration; LTBehind
LTDataset <- subset(Adataset, as.Date(Epoch) >= LTBehind & as.Date(Epoch)<LTAhead)
LTResults <- rlm( Distance ~ Epoch, data=LTDataset); LTResults
LTSummary <- summary( LTResults ); LTSummary
# Return coefficient (Slope of regression)
LTNum <- LTResults$coefficients[2];LTNum
Then I test the function to make sure it works for a single date:
myTestResult <- my_STS_Function("2007-11-18")
It works, so I move on to apply the function over the range of dates in the dataset:
mySTSResult <- apply(Adataset, 1, my_STS_Function, seq(StartDate : EndDate))
...in which my desired result is a list or array or vector of mySTSResult (slopes) (and, subsequently, a separate list/array/vector of myLTSResults so then I can create a STSlope:LTSlope ratio over the duration), something like (mySTSResults fabricated)...
> Adataset
Epoch Distance mySTSResults
1: 2007-11-15 92336.22 3
2: 2007-11-16 92336.23 4
3: 2007-11-17 92336.22 5
4: 2007-11-18 92336.20 6
5: 2007-11-19 92336.19 7
6: 2007-11-20 92336.21 8
7: 2007-11-21 92336.18 9
Only I get this error:
Error in FUN(newX[, i], ...) : unused argument(s) (1:1185)
What is this telling me and how to do correct it? I've done some looking and cannot find the correction.
Hopefully I've explained this sufficiently. Please let me know if you need further details.
Ok, it seems the problem is in the additional arguments to my_STS_Function as stated in your apply function call (as you have defined it with only one parameter). The date range is being passed as an additional parameter to that function, and R is complaining that it is unused (a vector of 1185 elements it seems). Are you rather trying to pull a subset of the rows restricted by date range first, then wishing to apply the my_STS_Function? I'd have to think a bit on an exact solution to that.
Sorry - I did my working out in the comments there. A possible solution is this:
subSet <- Adataset[Adataset[,1] %in% seq(StartDate:EndDate),][order(na.exclude(match(Adataset[,1], seq(StartData,EndDate))),]
Adapted from the answer in this question:
R select rows in matrix from another vector (match, %in)
Adding this as a new answer as the previous one was getting confused. A previous commenter was correct, there are bugs in your code, but they aren't a sticking point.
My updated approach was to use seq.Date to generate the date sequence (only works if you have a data point for each day between the start and end - though you could use na.exclude as above):
dates = seq.Date(as.Date(StartDate),as.Date(EndDate),"days")
You then use this as the input to apply, with some munging of types to get things working correctly (I've done this with a lamda function):
mySTSResult <- apply(as.matrix(dates), 1, function(x) {class(x) <- "Date"; my_STS_Function(x)})
Then hopefully you should have a vector of the results, and you should be able to do something similar for LTS, and then manipulate that into another column in your original data frame/matrix.

Resources