How to calculate area under the curve (AUC) in several data series? - r

I have the data of blood parameters from around 400 patients and from each patient I collected the parameter on 30 consecutive days. So each patient has around 30 values.
It looks like this:
So from these data I want to calculate the area under the curve for each patient.
As I see probably the "pROC" package could help me with this. But what is the fastest method to calculate the AUC for each patient? I want to avoid to calculate it for each patient manually.
Can anyone help?

Related

time series with multiple observations per unit of time

I have a dataset of the daily spreads of 500 stocks. My eventual goal is to make a model using extreme value theory. However as one of the first steps, I want to check my data for volatility clustering and leptokurticity. So I first want R to see my data as a time series and I want to plot my data. However, I only find examples of time series with only one observation per unit of time. Is there a possibility for R to treat my type of dataset as a time series? And what's the best way to plot it?

DBSCAN on high dense dataset. R

I've been recently studying DBSCAN with R for transit research purposes, and I'm hoping if someone could help me with this particular dataset.
Summary of my dataset is described below.
BTIME ATIME
1029 20001 21249
2944 24832 25687
6876 25231 26179
11120 20364 21259
11428 25550 26398
12447 24208 25172
What I am trying to do is to cluster these data using BTIME as x axis, ATIME as y axis. A pair of BTIME and ATIME represents the boarding time and arrival time of a subway passenger.
For more explanation, I will add the scatter plot of my total data set.
However if I split my dataset in different smaller time periods, the scatter plot looks like this. I would call this a sample dataset.
If I perform a DBSCAN clustering on the second image(sample data set), the clustering is performed as expected.
However it seems that DBSCAN cannot perform cluster on the total dataset with smaller scales. Maybe because the data is too dense.
So my question is,
Is there a way I can perform clustering in the total dataset?
What criteria should be used to separate the time scale of the data
I think the total data set is highly dense, which was why I tried clustering on a sample time period.
If I seperate my total data into smaller time scale, how would I choose the hyperparameters for each seperated dataset? If I look at the data, the distribution of the data is similar both in the total dataset and the seperated sample dataset.
I would sincerely appreciate some advices.

Determining ARIMA frequency of non-stationary time series

I am trying to use ARIMA to forecast chemical concentrations in water tanks. I have a large dataset of around a million intervals, two minutes apart. When i use the autoarima in R i get a forecast looking like this:
Forecast
As you can see, it evens itself out, which makes larger forecasts quite useless.
As far as i can read myself to, the frequency of the time series is what i need to address in the model. I just simply cannot find anywhere that explains this. Frequency in this case is not that there is two minutes between each observation, but is something along the lines of "twelve observations per year" for a monthly observation, where the seasons have an effect on the data.
Here is a plot of the data, if it helps
Plot
and on a smaller scale:
Smaller scale plot
Have a look at this question and answer on stats stack exchange pretty much the same question, and the answer basically answers it.

Fitting a poisson GLM in R with an aggregated count data

I have a dataset of the number of stranded turtles reported at a variety of locations along the Queensland coast of Australia. What I would like to find out is the number of stranded turtles that are NOT reported at each of these locations. In order to estimate that number, I have collected data on the frequency with which a turtle is reported to a stranding location; i.e. how often is a single turtle stranding reported more than one time at about 20 points along the coast? So I have count data which indicates the number of turtles that are reported to a stranding location one time, two times, or three or more times. Ultimately I would like to relate these data to covariates such as local population density and distance to the nearest road, in order to predict the "zero reporting" incidence for the rest of the coastal areas as well.
My data should look something like this, then:
loc<-c("A","B","C")
rep1<-c(51,24,10)
rep2<-c(4,8,3)
rep3ormore<-c(2,1,0)
pop<-c(50,1000,100)
turtle- cbind.data.frame(loc, rep1, rep2, rep3ormore, pop)
There are other possible covariates, but I'll keep it simple for now! I think this should be able to be done using a Poisson distribution, but I'm having trouble wrapping my head around how to do it.
Additionally, in certain instances I don't have exact numbers for the turtles that have been reported, but instead I have categories; 4-6, 7-10, >10, etc. If there's a way to model that possibility, that would be great as well!

How to compare two forecasted graph for two different time series in R?

Actually I want to compare the forecasted graph for two different time series data. I have data for 5 year for two different city of rain data which has been observed monthly. For that I have plotted the graph for 5 years of period of time series and also for 2 more year in future using forecast package for both city. Now I want to compare graph these two graphs and their future prediction for 2 years(may be in terms of error).
Can anyone help me out of these.
You could start with something like this:
f1 <- forecast(series1, h=24)
f2 <- forecast(series2, h=24)
accuracy(f1)
accuracy(f2)
That will give you a lot of error measures on the historical data. Unless you have the actual data for the future periods, you can't do much more than that.

Resources