Prediction analysis (Time series model) in UNIX - r

I know its not a code level question but wanted your views.
I need to perform “Prediction Analysis” in UNIX level using Time series model (like ARIMA).
We have implemented the same using R , but my work environment is not supporting R.
Data snapshot
Year | Month| Data1| Data2 | Data3
2012 | Jan | 1 |1 |3
2012 | Feb | 2 |21 | 4
So I wanted to implement some algorithm which will help me in finding the predicted values for future months.
Is there any other way of implementing “Time series Prediction Analysis” in UNIX (preferably Perl/Shell).

Since you are interested in perl and statistics, I'm sure you are aware of PDL. There are some specific time-series statistics modules available and of course, since perl is involved, other CPAN modules can be used.
R is still king and has a lot of packages to choose from - and, lucky us, R and perl play nice together using Statistics::R. I've not tried using Statistics-R from the PDL shell but this too may be possible to some extent.
Here's a pdl example session using MVA
/home/zombiepl % pdl
pdl> use Statistics::MVA::MultipleRegression;
pdl> $lol = [ [qw/745 36 66/],
[qw/895 37 68/],
[qw/442 47 64/],
[qw/440 32 53/],
[qw/1598 1 101/],];
pdl> linear_regression($lol);
The coefficients are: B[0] = -281.426985090045, B[1] = -7.61102966577879,
B[2] = 19.0102910918022.
R^2 is 0.943907302962818
Cheers and good luck with your project.

Related

R Question: How can I create a histogram with 2 variables against eachother?

Okay, let me be as clear as I can in my problem. I'm new to R, so your patience is appreciated.
I want to create a histogram using two different vectors. The first vector contains a list of models (products). These models are listed as either integers, strings, or NA. I'm not exactly sure how R is storing them (I assume they're kept as strings), or if that is a relevant issue. I also have a vector containing a list of incidents pertaining to that model. So for example, one row in the dataframe might be:
Model Incidents
XXX1991 7
How can I create a histogram where the number of incidents for each model is shown? So the histogram will look like
| =
| =
Frequency of | =
Incidents | = =
| = = =
| = = = = =
- - - - - -
Each different Model
Just to give a general idea.
I also need to be able to map everything out with standard deviation lines, so that it's easy to see which models are the least reliable. But that's not the main question here. I just don't want to do anything that will make me unable to use standard deviation in the future.
So far, all I really understand is how to make a histogram with the frequency marked, but for some reason, the x-axis is marked with numbers, not the models' names.
I don't really care if I have to download new packages to make this work, but I suspect that this already exists in basic R or ggplot2 and I'm just too dumb to figure it out.
Feel free to ask clarfying questions. Thanks.
EDIT: I forgot to mention, there are multiple rows of incidents listed under each model. So to add to my example earlier:
Model Incidents
XXX1991 7
XXX1991 1
XXX1991 19
3
5
XXX1002 9
XXX1002 4
etc . . .
I want to add up all the incidents for a model under one label.
I am assuming that you did not mean to leave the model blank in your example, so I filled in some values.
You can add up the number of incidents by model using aggregate then make the relevant plot using barplot.
## Example Data
data = read.table(text="Model Incidents
XXX1991 7
XXX1991 1
XXX1991 19
XXX1992 3
XXX1992 5
XXX1002 9
XXX1002 4",
header=TRUE)
TAB = aggregate(data$Incidents, list(data$Model), sum)
TAB
Group.1 x
1 XXX1002 13
2 XXX1991 27
3 XXX1992 8
barplot(TAB$x, names.arg=TAB$Group.1 )

Formatting Categorical Variables for a linear regression

I am trying to build a linear regression model in R. I am working on converting a categorical variable in to numeric for consumption by the model. I want to convert the name of a procedure to a number and have the following line of code to do so. It appears to be working successfully. I am using a library called CAR as well.
res$Procedure <- recode(res$Procedure, "'Primary Knee'='1'; 'Primary Hip'='2'; 'Revision Knee'='3'; 'Revision Knee'='4';
'Partial Knee'='5'; 'Revision Hip'='6'; 'Partial knee'='7'; 'Bilateral Hip'='8';
'Bilateral knee'='9'; 'Bilateral Knee'='9'; 'Resurfacing Hip'='10';'Resurfacing Hip '='10'; 'Revision knee'='3'")
I am then running the model -
lg1 = glm(BloodTransfusions~ Age+Hospital+Procedure+LenthOfStay,
family=binomial(link=probit), data=res)
Then I am looking at the results of my model and this is were things look a little odd.
summary(lg1)
| Variable | P-Values |
| Age | |
|Hospital | |
|Procedure1 | |
|Procedure2 | |
|Procedure3 | |
Basically the model is treating each of the categorical variables that I converted as numbers as a distinct variable rather than a continuous one. Does anyone have any suggestions? Or am I going about this the wrong way. I appreciate the help!
You can dummify your dataframe. This will create a binary variable out of every level of categorical variables.
library("dummy")
res.dummy <- dummy(res)
Then use res.dummy in glm.

Issues solving a Regression with numeric and categorical variables in R

I am very new to statistics and R in general so my question might be a bit dumb, but since I cannot find my solutions online I thought I should try ask it here.
I have a data frame dataset of a whole lot of different variables very similar to as follows:
Item | Size | Value | Town
----------------------------------
A | 10 | 800 | 1
B | 11 | 100 | 2
A | 17 | 900 | 2
D | 13 | 200 | 3
B | 15 | 500 | 1
C | 12 | 250 | 3
E | 14 | NA | 2
A | | 800 | 1
C | | 800 | 2
Basically, I have to try and 'guess' the Size based on the type of Item, it's Value, and the Town it was sold in, so I think a regression method would be a good idea.
I try and use a polynomial regression (although I'm not even sure if that's correct) to see how that looks by using a function similar to the following:
summary(lm(Size~ polym(factor(Item), Value, factor(Town), degree=2, raw=TRUE), dataset))
But I get this Warning message when I try to do this:
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
0 (non-NA) cases
In addition: Warning message:
In Ops.factor(X, Y, ...) : ‘^’ not meaningful for factors
Can anyone tell me why this happens? More importantly, is what I've done even correct?
My second question is regarding NA values in a regression. In the dataset above, I have an NA value in the Value Column. From what I understand, R ignores rows which have an NA value in a column. But what if I have a lot of NA values? Also, it seems like a waste of data to automatically eliminate entire rows if there is only one NA value in a column, so I was wondering if there is perhaps a better way of solving or working around this issue. Thanks!
EDIT: I just have one more question: In the regression model I have created it appears there are new 'levels' in the testing data which were not in the training data (e.g. the error says factor(Town) has new levels). What would be the right thing to do for cases such as this?
Yes, follow #RemkoDuursma's suggestion in using lm(Size ~ factor(Item) + factor(Town) + Value,...) and look into other degrees as well (was there a reason why you chose squared?) by comparing residuals.
In regards to substituting NA values, you have many options:
substitute all with median variable value
substitute all with mean variable value
substitute each with prediction based on values of other variables
good luck, and next time you might want to check out https://stats.stackexchange.com/!

ANOVA in R using summary data

is it possible to run an ANOVA in r with only means, standard deviation and n-value? Here is my data frame:
q2data.mean <- c(90,85,92,100,102,106)
q2data.sd <- c(9.035613,11.479667,9.760268,7.662572,9.830258,9.111457)
q2data.n <- c(9,9,9,9,9,9)
q2data.frame <- data.frame(q2data.mean,q2data.sq,q2data.n)
I am trying to find the means square residual, so I want to take a look at the ANOVA table.
Any help would be really appreciated! :)
Here you go, using ind.oneway.second from the rspychi package:
library(rpsychi)
with(q2data.frame, ind.oneway.second(q2data.mean,q2data.sd,q2data.n) )
#$anova.table
# SS df MS F
#Between (A) 2923.5 5 584.70 6.413
#Within 4376.4 48 91.18
#Total 7299.9 53
# etc etc
Update: the rpsychi package was archived in March 2022 but the function is still available here: http://github.com/cran/rpsychi/blob/master/R/ind.oneway.second.R (hat-tip to #jrcalabrese in the comments)
As an unrelated side note, your data could do with some renaming. q2data.frame is a data.frame, no need to put it in the title. Also, no need to specify q2data.mean inside q2data.frame - surely mean would suffice. It just means you end up with complex code like:
q2data.frame$q2data.mean
when:
q2$mean
would give you all the info you need.

Mixing other languages with R

I use R for most of my statistical analysis. However, cleaning/processing data, especially when dealing with sizes of 1Gb+, is quite cumbersome. So I use common UNIX tools for that. But my question is, is it possible to, say, run them interactively in the middle of an R session? An example: Let's say file1 is the output dataset from an R processes, with 100 rows. From this, for my next R process, I need a specific subset of columns 1 and 2, file2, which can be easily extracted through cut and awk. So the workflow is something like:
Some R process => file1
cut --fields=1,2 <file1 | awk something something >file2
Next R process using file2
Apologies in advance if this is a foolish question.
Try this (adding other read.table arguments if needed):
# 1
DF <- read.table(pipe("cut -fields=1,2 < data.txt| awk something_else"))
or in pure R:
# 2
DF <- read.table("data.txt")[1:2]
or to not even read the unwanted fields assuming there are 4 fields:
# 3
DF <- read.table("data.txt", colClasses = c(NA, NA, "NULL", "NULL"))
The last line could be modified for the case where we know we want the first two fields but don't know how many other fields there are:
# 3a
n <- count.fields("data.txt")[1]
read.table("data.txt", header = TRUE, colClasses = c(NA, NA, rep("NULL", n-2)))
The sqldf package can be used. In this example we assume a csv file, data.csv and that the desired fields are called a and b . If its not a csv file then use appropriate arguments to read.csv.sql to specify other separator, etc. :
# 4
library(sqldf)
DF <- read.csv.sql("data.csv", sql = "select a, b from file")
I think you may be looking for littler which integrates R into the Unix command-line pipelines.
Here is a simple example computing the file size distribution of of /bin:
edd#max:~/svn/littler/examples$ ls -l /bin/ | awk '{print $5}' | ./fsizes.r
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
4 5736 23580 61180 55820 1965000 1
The decimal point is 5 digit(s) to the right of the |
0 | 00000000000000000000000000000000111111111111111111111111111122222222+36
1 | 01111112233459
2 | 3
3 | 15
4 |
5 |
6 |
7 |
8 |
9 | 5
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 | 6
edd#max:~/svn/littler/examples$
and it takes for that is three lines:
edd#max:~/svn/littler/examples$ cat fsizes.r
#!/usr/bin/r -i
fsizes <- as.integer(readLines())
print(summary(fsizes))
stem(fsizes)
See ?system for how to run shell commands from within R.
Staying in the tradition of literate programming, using e.g. org-mode and org-babel will do the job perfectly:
You can combine several different programming languages in one script and execute then separate, in sequence, export the results or the code, ...
It is a little bit like sweave, only that the code blocks can by python, bash, R, sql, and numerous other. Check t out: org-mode and bable and an example using different programming languages
Apart from that, I think org-mode and babel is the perfect way of writing even pure R scripts.
Preparing data before working with it in R is quite common, and I have a lot of scripts for Unix and Perl pre-processing, and have, at various times, maintained scripts/programs for MySQL, MongoDB, Hadoop, C, etc. for pre-processing.
However, you may get better mileage for portability if you do some kinds of pre-processing in R. You might try asking new questions focused on some of these particulars. For instance, to load large amounts of data into memory mapped files, I seem to evangelize bigmemory. Another example is found in the answers (especially JD Long's) to this question.

Resources