Anomaly detection within a dataset in R - r

I would like to detect patterns within a weather dataset of around 10'000 data points. I have around 40 possible predictors (temperature, humidity etc.) which may explain good or bad weather the next day (dependent variable). Normally, I would apply classical machine learning methods like Random Forest to build and test models for classifying the whole dataset and find reliable predictors to forecast the next day's weather.
My task though is different. I want to find predictors and their parameters which "guarantee" me good or bad weather in a subset of the overall data. I am not interested in describing the whole dataset but finding the pattern of predictors (and their parameters) that give me good or bad weather indications. So I am trying to find, for example, 100 datapoints with 100% good weather if certain predictors are set to certain levels. I am not interested in the other 9'900 datapoints.
It is kind of the task of trying all combinations and calibrations of the predictors to find a subset of the overall data points which can be predicted with very high accuracy.
How would you do this systematically?

Related

Singularity in Linear Mixed Effects Models

Dataset Description: I use a dataset with neuropsychological (np) tests from several subjects. Every subject has more than one tests in his/her follow up i.e one test per year. I study the cognitive decline in these subjects. The information that I have are: Individual number(identity number), Education(years), Gender(M/F as factor), Age(years), Time from Baseline (= years after the first np test).
AIM: My aim is to measure the rate of change in their np tests i.e the cognitive decline per year for each of them. To do that I use Linear Mixture Effects Models (LMEM), taking into account the above parameters and I compute the slope for each subject.
Question: When I run the possible models (combining different parameters every time), I also check their singularity and the result in almost all cases is TRUE. So my models present singularity! In the case that I would like to use these models to do predictions this is not good as it means that the model overfits the data. But now that I just want to find the slope for each individual I think that this is not a problem, or even better I think that this is an advantage, as in that case singularity offers a more precise calculation for the subjects' slopes. Do you think that this thought is correct?

Create a new datafram to do piecewise linear regression on percentages after doing serial crosstabs in R

I am working with R. I need to identify the predictors of higher Active trial start percentage over time (StartDateMonthsYrs). I will do linear regression with Percent.Active as the dependent variable.
My original dataframe is attached and my obtained Active trial start percentage over time (named Percent.Activeis presented here.
So, I need to assess whether federal sponsored trials, industry sponsored trials or Other sponsored trials were associated with higher active trial start percentage over time. I have many other variables that I wneed to assess but this is the sample of my data.
I am thinking to do many crosstabs for each variable (eg Fedral & Active then Industry & Active..etc.) in each month (may be with help of lapply then accumulate the obtained percentages data in the second sheet then run the analysis based on that.
My code for linear regression is as follow:
q.lm0 <- lm(Percent.Active ~ Time.point+ xyz, data.percentage);summary(q.lm0)
I'm a little bit confused. You write 'associated'. If you really want to look for association then yeah, a crosstab might be possible, and sufficient, as association is not the same as causation (which is further derived from correlation, if there is a theory behind). If you look for correlation, and insights over time, doing a regression with the lm package is not useful.
If you want to look for a regreesion type analysis there are packages in R like the plm package, which can deal with panel data, as you clearly have panel data (time points, and interested trials labels, and repetitive time points for these labels). Look at this post for infos about the package:https://stackoverflow.com/questions/2804001/panel-data-with-binary-dependent-variable-in-r
I'm writing you this because your Percent.Activevariable is only a binary outcome of 0/1 I'm not sure if this is on purpose. However, even if your outcome is not binary, the plm package might help, but you will find other mentioned packages in that post.

How to construct dataframe for time series data using ensemble learning methods

I am trying to predict the Bitcoin price at t+5, i.e. 5 minutes ahead, using 11 technical indicators up to time t which can all be calculated from the open, high, low, close and volume values from the Bitcoin time series (see my full data set here). As far as I know, it is not necessary to manipulate the data frame when using algorithms like regression trees, support vector machines or artificial neural networks, but when using ensemble methods like random forests (RF) and Boosting, I heard that it is necessary to re-arrange the data frame in some way, because ensemble methods draw repeated RANDOM samples from the training data, in which case the sequence of the Bitcoin time series will be ruined. So, is there a way to re-arrange the data frame in some way such that the time series will still be in chronological order every time repeated samples are drawn from the training data?
I was provided with an explanation of how to construct the data frame here and possibly here, too, but unfortunately, I didn't really understand these explanations, because I didn't see a visual example of the to-be-constructed data frame and because I wasn't able to identify the relevant line of code. So, if someone could, show me how to re-arrange the data frame using an example data frame, I would be very thankful. As example data frame, you might consider using the airquality in-built data frame in r (I think it contains time series data), the data I provided above, or any other data frame you think is best.
Many thanks!
There is no problem with resampling for ML algorithms. To capture (auto)correlation just add columns with lagged values of time series. E.g. in case of univarate time-series x[t], where t is time in minutes, you add x[t - 1], x[t - 2], ..., x[t - n] columns with lagged values. More lags you add more history will be accounted at model training.
Some very basic working example you can find here: Prediction using neural networks
More advanced staff with Keras is here: Time series prediction using RNN
However, just for your information, special message by Mr Chollet and Mr Allaire from the above-mentioned article ,):
NOTE: Markets and machine learning
Some readers are bound to want to take the techniques we’ve introduced
here and try them on the problem of forecasting the future price of
securities on the stock market (or currency exchange rates, and so
on). Markets have very different statistical characteristics than
natural phenomena such as weather patterns. Trying to use machine
learning to beat markets, when you only have access to publicly
available data, is a difficult endeavor, and you’re likely to waste
your time and resources with nothing to show for it.
Always remember that when it comes to markets, past performance is not
a good predictor of future returns – looking in the rear-view mirror
is a bad way to drive. Machine learning, on the other hand, is
applicable to datasets where the past is a good predictor of the
future.

Did I just do an ANCOVA or MANOVA?

I’m trying to do an ANCOVA here ...
I want to analyze the effect of EROSION FORCE and ZONATION on all the species (listed with small letters) in each POOL.STEP (ranging from 1-12/1-4), while controlling for the effect of FISH.
I’m not sure if I’m doing it right. What is the command for ANCOVA?
So far I used lm(EROSIONFORCE~ZONATION+FISH,data=d), which yields:
So what I see here is that both erosion force percentage (intercept?) and sublittoral zonation are significant in some way, but I’m still not sure if I’ve done an ANCOVA correctly here or is this just an ANOVA?
In general, ANCOVA (analysis of covariance) is simply a special case of the general linear model with one categorical predictor (factor) and one continuous predictor (the "covariate"), so lm() is the right function to use.
However ... the bottom line is that you have a moderately challenging statistical problem here, and I would strongly recommend that you try to get local help (if you're working within a research group, can you consult with others in your group about appropriate methods?) I would suggest following up either on CrossValidated or r-sig-ecology#r-project.org
by putting EROSIONFORCE on the left side of the formula, you're specifying that you want to use EROSIONFORCE as a response (dependent) variable, i.e. your model is estimating how erosion force varies across zones and for different fish numbers - nothing about species response
if you want to analyze the response of a single species to erosion and zone, controlling for fish numbers, you need something like
lm(`Acmaeidae s...` ~ EROSIONFORCE+ZONATION+FISH, data=your_data)
the lm() suggestion above would do each species independently, i.e. you'd have to do a separate analysis for each species. If you also want to do it separately for each POOL.STEP you're going to have to do a lot of separate analyses. There are various ways of automating this in R, the most idiomatic is probably to melt your data (see reshape2::melt or tidy::gather) into long format and then use lmList from lme4.
since you have count data with low means, i.e. lots of zeros (and a few big values), you should probably consider a Poisson or negative binomial model, and possibly even a zero-inflated/hurdle model (i.e. analyze presence-absence and size of positive responses separately)
if you really want to analyze the joint distribution of all species (i.e., a response of a multivariate analysis, which is the M in MANOVA), you're going to have to work quite a bit harder ... there are a variety of joint species distribution models by people like Pierre Legendre, David Warton and others ... I'd suggest you try starting with the mvabund package, but you might need to do some reading first

R linear regression with lm - how to deal with categorical variables with thousands of values (like city or zip code)?

I am using R and the linear regression method function lm() to build a prediction model for business sales of retail stores. Among the many dependent feature variables in my dataset, there are some categorical (factor) features that can take on thousands of different values, such as zip code (and/or city name). For example, there are over 6000 different zip codes for California alone; if I instead use city, there are over 400 cities.
I understand that lm() creates a variable for each value of a categorical feature. The problem is that when I run lm(), the explosion of variables takes a lot of memory and a really long time. How can I avoid or handle this situation with my categorical variables?
Your intuition to move from zip codes to cities is good. However, the question is, is there a further level of spatial aggregation which will capture important spatial variation, but will result in the creation of less categorical (i.e. dummy) variables? Probably. Depending on your question, simply including a dummy for rural/suburban/urban maybe all you need.
In your case geographic region is likely a proxy meant to capture variation in socio-economic data. If so, why not include the socio-economic data directly. To do this you could use your city/zip data to link to US census data.
However, if you really need/want to include cities, try estimating a fixed effect model. The within-estimator that results differences out time invariant categorical coefficients such as your city coefficients.
Even if you find a way to obtain an OLS estimate with 400 cities in R, I would strongly encourage you not do use an OLS estimator, use a Ridge or Lasso estimator. Unless your data is massive (it can't be too big since your using R), the inclusive of so many dummy variables is going to dramatically reduce the degrees of freedom, which can lead to over-fitting and generally poorly estimated coefficients and standard errors.
In a slightly more sophisticated language, when degrees of freedom are low the minimization problem you solve when you estimate the OLS is "ill-posed", consequently you should use a regularization. For example, a Ridge Regression (i.e. Tikhonov regularization), would be a good solution. Remember, however, Ridge regression is a biased estimator and therefore you should perform bias-correction.
My solutions in order of my preference:
Aggregate up to a coarser spatial area (i.e. maybe a regions instead of cities)
Fixed effect estimator.
Ridge regression.
If you don't like my suggestions, I would suggest you pose this question on cross validated. IMO your question is closer to a statistics question than a programming question.

Resources