Speed up import of fixed width format table in R - r

I'm importing table from a fixed width format .txt file in R.
This table has about 100 observations and 200000 lines (a few lines below).
11111 2008 7 31 21 2008 8 1 21 3 4 6 18 4 7 0 12 0 0 0 0 0 1 0 0 0 0 0 0 0 5 0 0 7 5 0 1 0 2 0 0 0 0 0 0 2 0 0 0.0 5 14.9 0 14.9 0 14.0 0 16.5 0 14.9 0 15.6 0 15.3 0 0 15.6 0 15.6 0 17.6 0 16.1 0 17.10 0 1 97 0 0.60 0 1 15.1 0 986.6 0 1002.9 0 7 0 0.2 0
11111 2008 8 1 0 2008 8 1 0 4 7 6 18 4 98 0 1 9 0 0 0 2 0 1 0 0 0 0 0 0 0 5 0 0 7 0 0 0 1 0 2 0 260 0 1 0 0 2 0 0 0.0 5 14.4 0 14.4 0 13.0 0 14.9 0 14.9 0 15.2 0 14.6 0 0 15.2 0 14.8 0 16.1 0 15.7 0 16.10 0 1 93 0 1.20 0 1 14.1 0 986.1 0 1002.4 0 7 0 0.5 0
11111 2008 8 1 3 2008 8 1 3 5 10 6 18 4 98 0 1 3 0 0 0 1 0 0 0 0 0 0 0 0 0 5 0 0 7 5 0 1 0 2 0 200 0 1 0 0 4 0 0 0.0 5 25.8 0 7 14.4 0 26.0 0 26.0 0 19.8 0 17.0 0 0 19.8 0 15.2 0 20.1 0 20.1 0 17.10 0 1 74 0 6.00 0 1 15.1 0 984.5 0 1000.6 0 8 0 1.6 0
11111 2008 8 1 6 2008 8 1 6 6 13 6 18 4 98 0 1 7 0 6 0 1 0 0 0 1 0 0 0 0 0 1000 0 1 0 7 5 0 1 0 2 0 230 0 2 0 0 8 0 0 0.0 5 36.0 0 5 5 40.0 0 36.0 0 23.7 0 17.4 0 0 23.7 0 19.8 0 24.6 0 24.0 0 14.80 0 1 51 0 14.50 0 1 12.8 0 983.9 0 999.7 0 6 0 0.6 0
11111 2008 8 1 9 2008 8 1 9 7 16 6 18 4 96 0 0 9 0 9 0 0 0 0 0 2 0 0 0 0 0 1200 0 0 0 7 5 0 7 95 0 300 0 3 0 0 13 0 0 0.0 5 23.5 0 5 5 43.8 0 23.6 0 19.6 0 17.3 0 0 19.6 0 19.6 0 26.0 0 19.8 0 17.90 0 1 79 0 4.90 0 1 15.8 0 981.9 0 997.9 0 8 0 2.0 0
Right now, I'm using the following code leading to a pretty long loading (about 1 minute):
col_width <- c(5,5,3,3,3,5,3,3,3,2,
3,3,3,2,3,2,2,3,2,3,
2,2,2,2,2,2,2,2,2,2,
2,5,2,2,2,2,2,2,2,2,
2,3,2,4,2,3,2,2,3,2,
2,7,2,6,2,6,2,6,2,6,
2,6,2,6,2,6,2,2,6,2,
6,2,6,2,6,2,6,2,2,4,
2,6,2,2,6,2,7,2,7,2,
3,2,5,2)
df.h.tomsk <- read.fwf(path,
widths=col_width,
header=FALSE,
sep="\t",
nrows=200000,
comment.char="",
buffersize=5000)
Any suggestion(s) to accelerate the process?
For example is there something like fread from data.table working with fwf format?

Have you tried using fread of library(data.table)? Please copy paste some lines of your file to check it...

Related

Calculate weighted mean from matrix in R

I have a matrix that looks like the following. For rows 1:23, I would like to calculate the weighted mean, where the data in rows 1:23 are the weights and row 24 is the data.
1 107 33 41 22 12 4 122 44 297 123 51 16 7 9 1 1 0
10 5 2 2 1 0 3 4 6 12 3 3 0 1 1 0 0 0
11 1 3 1 0 0 0 4 2 8 3 4 0 0 0 0 0 0
12 2 1 1 0 0 0 2 1 5 6 3 1 0 0 0 0 0
13 1 0 1 0 0 0 3 1 3 5 2 2 0 1 0 0 0
14 3 0 0 0 0 0 3 1 2 3 0 1 0 0 0 0 0
15 0 0 0 0 0 0 2 0 0 1 0 1 0 0 0 0 0
16 0 0 0 0 1 0 0 0 2 0 0 0 0 0 0 0 0
17 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
18 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
19 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
2 80 27 37 5 6 4 97 48 242 125 44 27 7 8 8 0 2
20 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
21 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
22 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
23 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
3 47 12 33 12 6 1 63 42 200 96 45 19 6 6 9 2 0
4 45 14 21 9 4 2 54 26 130 71 36 17 8 5 1 0 2
5 42 10 14 6 3 2 45 19 89 45 26 7 4 8 2 1 0
6 17 3 12 5 2 0 18 21 51 41 19 15 5 1 1 0 0
7 16 2 6 0 0 1 14 9 37 23 17 7 3 0 3 0 0
8 9 4 4 2 1 0 7 9 30 15 8 3 3 1 1 0 1
9 12 2 3 1 1 1 6 5 14 12 5 1 2 0 0 1 0
24 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
As an example using the top two rows, there would have an additional column at the end indicated the weighted mean.
1 107 33 41 22 12 4 122 44 297 123 51 16 7 9 1 1 0 6.391011
10 5 2 2 1 0 3 4 6 12 3 3 0 1 1 0 0 0 6.232558
I'm a little new to coding so I wasn't too sure how to do it - any advice would be appreciated!
You can do:
apply(df[-nrow(df), ], 1, function(row) weighted.mean(df[nrow(df), ], row))
I'm assuming your first columns is some kind of index and not used for the weighted mean (and the data is stored in matr_dat):
apply(matr_dat[-nrow(matr_dat), -1], 1,
function(row) weighted.mean(matr_dat[nrow(matr_dat), -1], row))
Using apply and setting the margin to 1, the function defined in the third argument of apply to each row of the data; to calculate the weighted mean, you can use weighted.mean and set the weights to the values of the row.

Error in running glinternet : a statistical function for automatic model selection using interaction terms by Stanford's professor T. Hastie

The glinternet is an R package and a function that implements an algorithm developed by Trevor Hastie -- the eminent Stanford professor on Statistical Learning -- and his ex-phD student. glinternet() detects automatically interaction terms and as such it is very useful in building a model in a situation with many variables where the possible combinations are enormous.
When I run glinternet I get an error message which I reproduce here using the mtcars base R dataset:
data(mtcars)
setDT(mtcars)
glimpse(mtcars)
x = as.matrix(mtcars[, -c("am"), with = FALSE])
class(x)
y <- mtcars$am
class(y)
glinter_fit <- glinternet(x , y, numLevels = 2)
Error: pCat + pCont == ncol(X) is not TRUE
Your advice will be appreciated.
It's not very clear, but you need to provide a vector that is as long as your number of predictor columns, each element indicating the number of categories for each column.
In your example, in x it's all continuous, so we do:
glinternet(x,y,numLevels=rep(1,ncol(x)))
Call: glinternet(X = x, Y = y, numLevels = rep(1, ncol(x)))
lambda objValue cat cont catcat contcont catcont
1 0.068900 0.1210 0 0 0 0 0
2 0.062800 0.1200 0 1 0 0 0
3 0.057100 0.1180 0 1 0 0 0
4 0.052000 0.1160 0 1 0 0 0
5 0.047300 0.1130 0 2 0 0 0
6 0.043100 0.1100 0 2 0 0 0
7 0.039200 0.1060 0 3 0 0 0
8 0.035700 0.1020 0 3 0 0 0
9 0.032500 0.0983 0 3 0 0 0
10 0.029600 0.0944 0 3 0 0 0
11 0.026900 0.0904 0 3 0 0 0
12 0.024500 0.0866 0 3 0 0 0
13 0.022300 0.0829 0 3 0 0 0
14 0.020300 0.0794 0 3 0 0 0
15 0.018500 0.0760 0 3 0 0 0
16 0.016800 0.0728 0 3 0 1 0
17 0.015300 0.0698 0 4 0 1 0
18 0.014000 0.0668 0 4 0 1 0
19 0.012700 0.0638 0 4 0 2 0
20 0.011600 0.0608 0 4 0 2 0
21 0.010500 0.0579 0 3 0 2 0
22 0.009580 0.0551 0 3 0 2 0
23 0.008720 0.0523 0 3 0 2 0
24 0.007940 0.0497 0 3 0 2 0
25 0.007230 0.0472 0 3 0 3 0
26 0.006580 0.0448 0 5 0 3 0
27 0.005990 0.0425 0 5 0 3 0
28 0.005450 0.0403 0 5 0 3 0
29 0.004960 0.0382 0 5 0 3 0
30 0.004520 0.0361 0 4 0 3 0
31 0.004110 0.0342 0 4 0 3 0
32 0.003740 0.0324 0 4 0 4 0
33 0.003410 0.0307 0 4 0 5 0
34 0.003100 0.0291 0 4 0 6 0
35 0.002820 0.0275 0 3 0 6 0
36 0.002570 0.0261 0 3 0 6 0
37 0.002340 0.0247 0 3 0 8 0
38 0.002130 0.0234 0 3 0 7 0
39 0.001940 0.0221 0 3 0 7 0
40 0.001760 0.0210 0 3 0 7 0
41 0.001610 0.0199 0 3 0 8 0
42 0.001460 0.0188 0 3 0 8 0
43 0.001330 0.0178 0 4 0 10 0
44 0.001210 0.0168 0 4 0 10 0
45 0.001100 0.0159 0 4 0 12 0
46 0.001000 0.0149 0 4 0 12 0
47 0.000914 0.0140 0 4 0 12 0
48 0.000832 0.0132 0 4 0 12 0
49 0.000757 0.0123 0 3 0 13 0
50 0.000689 0.0115 0 2 0 13 0

Error in eval(expr, envir, enclos) : object 'accueil' not found

I try ti create a RandomForst model using R for sentiment analysis :
Here the code :
data = as.data.frame(as.matrix(dtm_train), stringsAsFactors = T)
>data
accueil bon depuis banque client très service conseiller agence a plus je
634 0 0 0 0 0 0 0 0 0 0 0 0
3802 0 0 0 0 0 0 0 0 0 0 0 0
16739 0 0 0 0 0 1 0 0 0 0 0 1
20992 0 0 0 0 0 0 0 0 0 0 0 0
4742 0 0 0 0 0 0 0 0 0 0 0 0
5104 0 0 1 0 0 0 0 0 0 0 0 0
6978 1 1 0 1 0 0 0 0 0 0 0 0
21630 0 2 0 0 0 0 1 0 0 0 0 0
13606 0 0 0 0 0 0 0 0 0 0 0 0
21910 0 0 0 0 0 0 0 1 0 0 0 0
8184 0 0 0 0 0 0 0 0 0 0 0 0
...
Note = train[['Note.Reco']]
> Note
[1] 9 10 9 0 10 8 10 7 10 10 5 5 8 8 2 9 8 0 10 10 8 0 8 7 7 6 9 10 8 9 5 10 10 0 5 3 2 8 8 1 7 6 0 8 9 0 5 5 8 6 8
[52] 8 7 8 9 9 9 10 5 4 5 8 8 8 9 9 10 9 8 4 10 9 8 8 8 8 5 0 9 8 7 5 3 2 10 8 10 9 0 10 6 10 8 5 9 10 1 8 9 1
reviews.test = test$reason
[1] "Pas assez service......"
[2] "Pour résidant s....."
[3] " emails, réponses ...."
[4] "Même ...."
review.test_DF = as.data.frame(reviews.test,stringsAsFactors = T)
reviews.svm = randomForest(Note~., data)
pred.svm = predict(reviews.svm, review.test_DF, type="class")
I get this error :
> pred.svm = predict(reviews.svm, review.test_DF, type="class")
Error in eval(expr, envir, enclos) : object 'accueil' not found
Can you help me to resolve this problem?
thank you in advance

Properly Creating a Time Series in R, auto.arima Function on Daily Data

I am creating a time series of daily sales of a given item at a retailer. I have several questions outlined below that I would like some help with (data and code to follow). Note that I am reading my actual data set from a csv file, where observations (dates) are in rows, and each variable are in the columns. Thank you ahead of time for your help, and please know that I am new to R coding.
1) It appears as if R is reading my time series by the observation number of the date (ie, April 5th, the 5th date in the data set, has a value of 5, rather than the 297 units that sold on that particular day). How can I remedy this?
2) I believe that my 'ts' statement is telling R that the data begins on the 91st day (April 1st) of 2013; have I coded this correctly? When I plot the data, it appears that R may be interpreting this statement differently.
3) Do I need to create a separate time series for my xreg? For example, should I create a time series for each variable, then take the union of those, and then cbind them?
4) Have I logged the variables in the correct statements, or should I do it elsewhere in the code?
require("forecast")
G<-read.csv("SingleItemToyDataset.csv")
GT<-ts(G$Units, start = c(2013, 91), frequency = 365.25)
X = cbind(log(G$Price), G$Time, as.factor(G$PromoOne), as.factor(G$PromoTwo), as.factor(G$Mon), as.factor(G$Tue), as.factor(G$Wed), as.factor(G$Thu), as.factor(G$Fri), as.factor(G$Sat))
Fit<-auto.arima(log(GT), xreg = X)
Date Day Units Price Time PromoOne PromoTwo Mon Tue Wed Thu Fri Sat
1 4/1/2013 Mon 351 5.06 1 1 0 1 0 0 0 0 0
2 4/2/2013 Tue 753 4.90 2 1 0 0 1 0 0 0 0
3 4/3/2013 Wed 133 5.32 3 1 0 0 0 1 0 0 0
4 4/4/2013 Thu 150 5.14 4 1 0 0 0 0 1 0 0
5 4/5/2013 Fri 297 5.00 5 1 0 0 0 0 0 1 0
6 4/6/2013 Sat 688 5.27 6 1 0 0 0 0 0 0 1
7 4/7/2013 Sun 1,160 5.06 7 1 0 0 0 0 0 0 0
8 4/8/2013 Mon 613 5.07 8 1 0 1 0 0 0 0 0
9 4/9/2013 Tue 430 5.07 9 1 0 0 1 0 0 0 0
10 4/10/2013 Wed 400 5.03 10 1 0 0 0 1 0 0 0
11 4/11/2013 Thu 1,530 4.97 11 1 0 0 0 0 1 0 0
12 4/12/2013 Fri 2,119 5.00 12 0 1 0 0 0 0 1 0
13 4/13/2013 Sat 1,094 5.09 13 0 1 0 0 0 0 0 1
14 4/14/2013 Sun 736 5.02 14 1 0 0 0 0 0 0 0
15 4/15/2013 Mon 518 5.10 15 1 0 1 0 0 0 0 0
16 4/16/2013 Tue 485 5.02 16 1 0 0 1 0 0 0 0
17 4/17/2013 Wed 472 5.05 17 1 0 0 0 1 0 0 0
18 4/18/2013 Thu 406 5.03 18 1 0 0 0 0 1 0 0
19 4/19/2013 Fri 564 5.00 19 1 0 0 0 0 0 1 0
20 4/20/2013 Sat 475 5.09 20 1 0 0 0 0 0 0 1
21 4/21/2013 Sun 621 5.04 21 1 0 0 0 0 0 0 0
22 4/22/2013 Mon 714 5.02 22 1 0 1 0 0 0 0 0
23 4/23/2013 Tue 1,217 5.32 23 0 0 0 1 0 0 0 0
24 4/24/2013 Wed 1,253 5.45 24 0 0 0 0 1 0 0 0
25 4/25/2013 Thu 1,169 5.06 25 0 0 0 0 0 1 0 0
26 4/26/2013 Fri 1,216 5.01 26 0 0 0 0 0 0 1 0
27 4/27/2013 Sat 1,127 5.02 27 0 0 0 0 0 0 0 1
28 4/28/2013 Sun 693 5.04 28 1 0 0 0 0 0 0 0
29 4/29/2013 Mon 388 5.01 29 1 0 1 0 0 0 0 0
30 4/30/2013 Tue 305 5.01 30 1 0 0 1 0 0 0 0
31 5/1/2013 Wed 207 5.03 31 1 0 0 0 1 0 0 0
32 5/2/2013 Thu 612 4.97 32 1 0 0 0 0 1 0 0
33 5/3/2013 Fri 671 5.01 33 1 0 0 0 0 0 1 0
34 5/4/2013 Sat 1,151 5.04 34 1 0 0 0 0 0 0 1
35 5/5/2013 Sun 2,578 5.00 35 1 0 0 0 0 0 0 0
36 5/6/2013 Mon 2,364 5.01 36 1 0 1 0 0 0 0 0
37 5/7/2013 Tue 423 5.03 37 1 0 0 1 0 0 0 0
38 5/8/2013 Wed 388 5.04 38 1 0 0 0 1 0 0 0
39 5/9/2013 Thu 1,417 4.70 39 0 1 0 0 0 1 0 0
40 5/10/2013 Fri 1,607 4.59 40 0 1 0 0 0 0 1 0
41 5/11/2013 Sat 1,217 4.86 41 1 0 0 0 0 0 0 1
42 5/12/2013 Sun 545 5.12 42 1 0 0 0 0 0 0 0
43 5/13/2013 Mon 461 5.01 43 1 0 1 0 0 0 0 0
44 5/14/2013 Tue 358 4.97 44 1 0 0 1 0 0 0 0
45 5/15/2013 Wed 310 5.00 45 1 0 0 0 1 0 0 0
46 5/16/2013 Thu 925 4.63 46 1 0 0 0 0 1 0 0
47 5/17/2013 Fri 266 4.99 47 1 0 0 0 0 0 1 0
48 5/18/2013 Sat 183 5.15 48 0 0 0 0 0 0 0 1
49 5/19/2013 Sun 363 5.20 49 0 0 0 0 0 0 0 0
50 5/20/2013 Mon 5,469 4.99 50 1 0 1 0 0 0 0 0
51 5/21/2013 Tue 647 4.81 51 1 0 0 1 0 0 0 0
52 5/22/2013 Wed 421 4.97 52 1 0 0 0 1 0 0 0
53 5/23/2013 Thu 353 4.93 53 1 0 0 0 0 1 0 0
54 5/24/2013 Fri 375 4.95 54 1 0 0 0 0 0 1 0
55 5/25/2013 Sat 575 4.88 55 1 0 0 0 0 0 0 1
56 5/26/2013 Sun 707 4.92 56 0 0 0 0 0 0 0 0
57 5/27/2013 Mon 533 4.89 57 0 0 1 0 0 0 0 0
58 5/28/2013 Tue 641 4.66 58 0 0 0 1 0 0 0 0
59 5/29/2013 Wed 264 4.85 59 0 0 0 0 1 0 0 0
60 5/30/2013 Thu 186 5.74 60 1 0 0 0 0 1 0 0
61 5/31/2013 Fri 207 6.40 61 1 0 0 0 0 0 1 0
1) I'm not sure exactly what you mean here, but perhaps you are confused by the row names (numbers in this case) that R has assigned to your data frame G. Assuming the data.frame printed below your code is what G looks like, it looks to me like G$Units does indeed have the data you're interested in modeling (note, however, that R is perhaps treating G$Units as a character class because of the commas in the number; you should remove those from your .csv file).
2) For modeling with auto.arima() (or arima() in base R), the univariate ts does not need to be an actual ts object. So, you don't really need to create GT. That said, however, The start and freq arguments to ts() can be a bit odd to figure out. In this case, you need to set freq=365 even though a year is technically a bit longer (i.e., GT <- ts(G$Units, start=c(2013,91), freq=365))
3) No, you do not need to create a separate time series for xreg. In fact, you don't need to create factors for your promos/days because they are already coded as 0/1. Thus, something like X <- G[,-c(1,2,3,5)]; X$Price <- log(X$Price) would suffice. (Aside: why are you using Time as a covariate; there doesn't appear to be any trend in the data?).
4) Yes, log-transforming the (co)variates where you did is fine, but I'm curious as to why the price covariate needs to be log-transformed?

Extend table by adding missing values [duplicate]

This question already has an answer here:
Include levels of zero count in result of table()
(1 answer)
Closed 8 years ago.
I need to extend a table in R language.
result 3 4 5 6 7 8
5 6 29 295 104 6 0
6 1 9 112 238 66 5
7 0 0 5 29 40 6
Should be extended to
result 1 2 3 4 5 6 7 8 9 10
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0
5 0 0 6 29 295 104 6 0 0 0
6 0 0 1 9 112 238 66 5 0 0
7 0 0 0 0 5 29 40 6 0 0
8 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0
So I need add zeros in missing values. Also, in alternative scenario an output as a matrix (10x10) with the same data would be satisfying.
EDIT:
table(factor(x, levels = 1:10), factor(y, levels = 1:10)) worked perfectly.
As the guys in the comments mentioned. Factoring works perfectly.
table(factor(x, levels = 1:10), factor(y, levels = 1:10))

Resources