How can I grab data from this web site? - web-scraping

There is a site here (http://www.tsetmc.com/Loader.aspx?ParTree=151311&i=46741025610365786#), that each field of this table(specified by yellow squares) shows information about one specific day. What I need to do is to read only حجم row of each field(I mean what I specified by red squares in the following photos(You should go to the tab i mentioned in first photo, to see the second photo)):
And write them(store in my computer) in a text file like this:
6.832 M (14%) , 40.475 M (85%), 248,000 (0%), 47.059 M (99%)
605,000 (3%), 15.277 M (96%), 478,714 (3%), 15.404 M (96%)
8.102 M (42%), 10.751 M (57%), 9.599 M (50%), 9.253 M (49%)
215,937 (2%), 9.417 M (97%), 1.115 M (11%), 8.518 M (88%)
3.351 M (15%), 18.284 M (84%), 5.987 M (27%), 15.647 M (72%)
But I don't know is it possible or not? If so, how can I do that in simplest way?(I use Windows10)
EDIT:
I did the step 3 successfully, and ran node extract.js command in step 4. I got this result:
[ 'حجم, 47.059 M (99%), 248,000 (0%), 40.475 M (85%), 6.832 M (14%)',
'حجم, 15.404 M (96%), 478,714 (3%), 15.277 M (96%), 605,000 (3%)',
'حجم, 9.253 M (49%), 9.599 M (50%), 10.751 M (57%), 8.102 M (42%)',
'حجم, 8.518 M (88%), 1.115 M (11%), 9.417 M (97%), 215,937 (2%)',
'حجم, 15.647 M (72%), 5.987 M (27%), 18.284 M (84%), 3.351 M (15%)',
'حجم, 21.848 M (93%), 1.501 M (6%), 21.648 M (92%), 1.701 M (7%)',
'حجم, 30.845 M (95%), 1.3 M (4%), 30.663 M (95%), 1.482 M (4%)',
'حجم, 9.914 M (64%), 5.474 M (35%), 9.938 M (64%), 5.45 M (35%)',
'حجم, 10.775 M (97%), 250,000 (2%), 10.995 M (99%), 30,000 (0%)',
'حجم, 21.328 M (91%), 2.027 M (8%), 22.315 M (95%), 1.04 M (4%)',
'حجم, 19.588 M (92%), 1.54 M (7%), 21.048 M (99%), 80,000 (0%)',
'حجم, 12.554 M (96%), 418,000 (3%), 11.504 M (88%), 1.468 M (11%)',
'حجم, 14.98 M (92%), 1.299 M (7%), 16.135 M (99%), 144,008 (0%)',
'حجم, 10.878 M (95%), 502,040 (4%), 11.378 M (99%), 2,040 (0%)',
'حجم, 10.012 M (97%), 275,000 (2%), 10.287 M (100%), 0 (0%)',
'حجم, 11.992 M (95%), 500,000 (4%), 11.707 M (93%), 785,244 (6%)',
'حجم, 16.492 M (95%), 820,000 (4%), 17.056 M (98%), 256,241 (1%)',
'حجم, 19.639 M (98%), 378,384 (1%), 20.017 M (100%), 0 (0%)',
'حجم, 13.781 M (95%), 639,609 (4%), 14.161 M (98%), 260,000 (1%)',
'حجم, 31.797 M (99%), 300,507 (0%), 26.089 M (81%), 6.009 M (18%)',
'حجم, 18.159 M (99%), 30,391 (0%), 15.914 M (87%), 2.275 M (12%)',
'حجم, 21.271 M (95%), 1.01 M (4%), 21.501 M (96%), 780,000 (3%)',
'حجم, 17.322 M (62%), 10.615 M (37%), 19.437 M (69%), 8.5 M (30%)',
'حجم, 37.817 M (97%), 1.03 M (2%), 34.125 M (87%), 4.722 M (12%)',
'حجم, 55.396 M (99%), 211,000 (0%), 52.507 M (94%), 3.1 M (5%)',
'حجم, 23.141 M (98%), 420,000 (1%), 23.461 M (99%), 100,000 (0%)',
'حجم, 46.215 M (82%), 9.919 M (17%), 49.764 M (88%), 6.371 M (11%)',
'حجم, 1.26 M (100%), 0 (0%), 1.26 M (100%), 0 (0%)',
'حجم, 35.89 M (99%), 251,000 (0%), 35.921 M (99%), 220,000 (0%)',
'حجم, 48.509 M (88%), 6.349 M (11%), 54.052 M (98%), 806,362 (1%)',
'حجم, 41.018 M (91%), 4.006 M (8%), 41.564 M (92%), 3.46 M (7%)',
'حجم, 40.02 M (99%), 100,000 (0%), 39.22 M (97%), 900,000 (2%)',
'حجم, 36.974 M (99%), 30,000 (0%), 36.549 M (98%), 455,500 (1%)',
'حجم, 35.739 M (99%), 230,000 (0%), 33.104 M (92%), 2.866 M (7%)',
'حجم, 19.627 M (100%), 0 (0%), 18.877 M (96%), 750,000 (3%)',
'حجم, 19.603 M (81%), 4.379 M (18%), 23.982 M (100%), 0 (0%)',
'حجم, 10.186 M (97%), 250,000 (2%), 10.436 M (100%), 0 (0%)',
'حجم, 15.414 M (98%), 250,500 (1%), 15.465 M (98%), 200,000 (1%)',
'حجم, 21.571 M (97%), 665,000 (2%), 22.236 M (100%), 0 (0%)',
'حجم, 15.537 M (98%), 250,000 (1%), 15.787 M (100%), 0 (0%)',
'حجم, 21.422 M (98%), 221,004 (1%), 21.243 M (98%), 400,000 (1%)',
'حجم, 30.662 M (92%), 2.375 M (7%), 33.036 M (100%), 0 (0%)',
'حجم, 39.287 M (98%), 455,000 (1%), 39.742 M (100%), 0 (0%)',
'حجم, 53.141 M (89%), 6.11 M (10%), 59.131 M (99%), 120,000 (0%)',
'حجم, 23.587 M (98%), 255,000 (1%), 23.842 M (100%), 0 (0%)',
'حجم, 17.043 M (98%), 255,000 (1%), 17.298 M (100%), 0 (0%)',
'حجم, 33.51 M (96%), 1.25 M (3%), 34.75 M (99%), 10,000 (0%)',
'حجم, 36.408 M (99%), 15,000 (0%), 28.248 M (77%), 8.175 M (22%)',
'حجم, 32.367 M (98%), 480,000 (1%), 31.535 M (96%), 1.312 M (3%)',
'حجم, 54.773 M (95%), 2.68 M (4%), 43.936 M (76%), 13.517 M (23%)',
'حجم, 58.955 M (95%), 2.54 M (4%), 41.234 M (67%), 20.262 M (32%)',
'حجم, 45.222 M (99%), 15,000 (0%), 40.215 M (88%), 5.023 M (11%)',
'حجم, 43.487 M (97%), 1.225 M (2%), 43.902 M (98%), 810,008 (1%)',
'حجم, 35.46 M (91%), 3.18 M (8%), 38.33 M (99%), 310,000 (0%)',
'حجم, 39.42 M (99%), 90,927 (0%), 36.722 M (92%), 2.789 M (7%)',
'حجم, 41.024 M (99%), 312,000 (0%), 35.814 M (86%), 5.522 M (13%)',
'حجم, 32.718 M (99%), 277,978 (0%), 30.995 M (93%), 2.001 M (6%)',
'حجم, 1.12 M (100%), 0 (0%), 1.12 M (100%), 0 (0%)',
'حجم, 2.015 M (86%), 325,000 (13%), 2.34 M (100%), 0 (0%)',
'حجم, 40.402 M (95%), 2.109 M (4%), 42.511 M (100%), 0 (0%)',
'حجم, 41.726 M (86%), 6.372 M (13%), 48.098 M (100%), 0 (0%)',
'حجم, 39.444 M (97%), 1.14 M (2%), 39.551 M (97%), 1.033 M (2%)',
'حجم, 4.14 M (100%), 0 (0%), 3.14 M (75%), 1,000,000 (24%)',
'حجم, 43.447 M (96%), 1.743 M (3%), 44.292 M (98%), 898,000 (1%)',
'حجم, 56.023 M (98%), 864,338 (1%), 52.627 M (92%), 4.26 M (7%)',
'حجم, 14.062 M (99%), 8,008 (0%), 12.055 M (85%), 2.015 M (14%)',
'حجم, 56.557 M (84%), 10.413 M (15%), 66.47 M (99%), 500,000 (0%)',
'حجم, 7.971 M (69%), 3.481 M (30%), 11.452 M (100%), 0 (0%)',
'حجم, 38.85 M (86%), 5.864 M (13%), 44.494 M (99%), 220,000 (0%)',
'حجم, 53.151 M (99%), 105,000 (0%), 51.039 M (95%), 2.217 M (4%)',
'حجم, 51.861 M (79%), 13.352 M (20%), 64.603 M (99%), 610,000 (0%)',
'حجم, 2.025 M (80%), 500,000 (19%), 2.525 M (100%), 0 (0%)',
'حجم, 67.428 M (95%), 3.294 M (4%), 68.538 M (96%), 2.184 M (3%)',
'حجم, 52.373 M (87%), 7.211 M (12%), 58.408 M (98%), 1.176 M (1%)',
'حجم, 12.073 M (80%), 3.01 M (19%), 14.583 M (96%), 500,000 (3%)',
'حجم, 47.369 M (99%), 424,000 (0%), 30.168 M (63%), 17.626 M (36%)',
'حجم, 3.401 M (100%), 0 (0%), 1.039 M (30%), 2.363 M (69%)',
'حجم, 52.213 M (99%), 247,000 (0%), 41.872 M (79%), 10.588 M (20%)',
'حجم, 73.585 M (98%), 1.356 M (1%), 38.911 M (51%), 36.029 M (48%)',
'حجم, 67.943 M (97%), 1.622 M (2%), 35.571 M (51%), 33.995 M (48%)',
'حجم, 2.653 M (100%), 0 (0%), 2.003 M (75%), 650,000 (24%)',
'حجم, 32.055 M (99%), 18,408 (0%), 24.301 M (75%), 7.772 M (24%)',
'حجم, 16.989 M (98%), 209,000 (1%), 9.598 M (55%), 7.6 M (44%)',
'حجم, 34.906 M (95%), 1.64 M (4%), 21.129 M (57%), 15.417 M (42%)',
'حجم, 14.669 M (98%), 150,000 (1%), 7.852 M (52%), 6.967 M (47%)',
'حجم, 23.542 M (98%), 289,600 (1%), 23.102 M (96%), 729,782 (3%)',
'حجم, 27.87 M (98%), 450,000 (1%), 21.461 M (75%), 6.859 M (24%)',
'حجم, 48.785 M (98%), 500,000 (1%), 30.683 M (62%), 18.603 M (37%)',
'حجم, 22.839 M (93%), 1.518 M (6%), 16.242 M (66%), 8.115 M (33%)',
'حجم, 15.683 M (96%), 631,500 (3%), 13.316 M (81%), 2.999 M (18%)',
'حجم, 15.715 M (96%), 630,000 (3%), 15.436 M (94%), 908,399 (5%)',
'حجم, 11.776 M (90%), 1.305 M (9%), 13.081 M (100%), 0 (0%)',
'حجم, 12.492 M (85%), 2.057 M (14%), 14.149 M (97%), 400,000 (2%)',
'حجم, 11.909 M (100%), 0 (0%), 11.818 M (99%), 91,008 (0%)',
'حجم, 21.404 M (99%), 140,000 (0%), 17.8 M (82%), 3.744 M (17%)',
'حجم, 22.115 M (89%), 2.718 M (10%), 21.969 M (88%), 2.864 M (11%)',
'حجم, 23.146 M (97%), 637,396 (2%), 21.881 M (92%), 1.902 M (7%)',
'حجم, 35.986 M (94%), 1.92 M (5%), 25.749 M (67%), 12.156 M (32%)',
'حجم, 16.064 M (93%), 1.179 M (6%), 17.104 M (99%), 139,467 (0%)',
'حجم, 19.314 M (85%), 3.284 M (14%), 22.408 M (99%), 189,500 (0%)',
... 84 more items ]
(node:13916) UnhandledPromiseRejectionWarning: TypeError [ERR_INVALID_CALLBACK]: Callback must be a function
at maybeCallback (fs.js:129:9)
at Object.writeFile (fs.js:1159:14)
at C:\Users\m\Desktop\GetData\extract.js:21:14
at process._tickCallback (internal/process/next_tick.js:68:7)
(node:13916) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:13916) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process
with a non-zero exit code.
But there is no store.txt file!

Download and install node.js & npm from here - https://www.npmjs.com/get-npm
Create folder anywhere in your pc, create a file extract.js in it and paste following code.
Code
Then open command prompt in that folder and run "npm install puppeteer"(it may take several minutes to complete)
Then run "node extract.js"
After it runs successfully you will have "store.txt" file in the same folder which contains your expected result.

Related

Calculate weighted average of four nearest grid points

I have a data frame that looks like this:
Teff logg M_div_H U B V R I J H K L Lprime M
1: 2000 4.00 -0.1 -13.443 -11.390 -7.895 -4.464 -1.831 1.666 3.511 2.701 4.345 4.765 5.680
2: 2000 4.50 -0.1 -13.402 -11.416 -7.896 -4.454 -1.794 1.664 3.503 2.728 4.352 4.772 5.687
3: 2000 5.00 -0.1 -13.358 -11.428 -7.888 -4.431 -1.738 1.664 3.488 2.753 4.361 4.779 5.685
4: 2000 5.50 -0.1 -13.220 -11.079 -7.377 -4.136 -1.483 1.656 3.418 2.759 4.355 4.753 5.638
5: 2200 3.50 -0.1 -11.866 -9.557 -6.378 -3.612 -1.185 1.892 3.294 2.608 3.929 4.289 4.842
6: 2200 4.50 -0.1 -11.845 -9.643 -6.348 -3.589 -1.132 1.874 3.310 2.648 3.947 4.305 4.939
7: 2200 5.50 -0.1 -11.655 -9.615 -6.279 -3.508 -0.997 1.886 3.279 2.709 3.964 4.314 4.928
8: 2500 -1.02 -0.1 -7.410 -7.624 -6.204 -3.854 -1.533 1.884 3.320 2.873 3.598 3.964 5.579
9: 2500 -0.70 -0.1 -7.008 -7.222 -5.818 -3.618 -1.338 1.905 3.266 2.868 3.502 3.877 5.417
10: 2500 -0.29 -0.1 -6.526 -6.740 -5.357 -3.421 -1.215 1.927 3.216 2.870 3.396 3.781 5.247
11: 2500 5.50 -0.1 -9.518 -7.575 -5.010 -2.756 -0.511 1.959 3.057 2.642 3.472 3.756 4.265
12: 2800 -1.02 -0.1 -7.479 -7.386 -5.941 -3.716 -1.432 1.824 3.259 2.812 3.567 3.784 5.333
13: 2800 -0.70 -0.1 -7.125 -7.032 -5.596 -3.477 -1.231 1.822 3.218 2.813 3.479 3.717 5.229
14: 2800 -0.29 -0.1 -6.673 -6.580 -5.154 -3.166 -0.974 1.816 3.163 2.812 3.364 3.628 5.093
15: 2800 3.50 -0.1 -8.113 -6.258 -4.103 -2.209 -0.360 1.957 2.872 2.517 3.219 3.427 4.026
16: 2800 4.00 -0.1 -7.992 -6.099 -3.937 -2.076 -0.230 1.907 2.869 2.480 3.227 3.424 4.075
17: 2800 4.50 -0.1 -7.815 -6.051 -4.067 -2.176 -0.228 1.920 2.877 2.503 3.212 3.428 4.000
18: 2800 5.00 -0.1 -7.746 -6.018 -4.031 -2.144 -0.176 1.907 2.883 2.512 3.216 3.430 4.023
19: 3000 -0.70 -0.1 -7.396 -6.995 -5.605 -3.554 -1.293 1.787 3.172 2.759 3.474 3.588 5.052
20: 3000 -0.29 -0.1 -6.966 -6.565 -5.179 -3.249 -1.035 1.772 3.136 2.764 3.388 3.533 4.978
Notice, for example, how every V value has a unique Teff, logg combination.
Now, let's say I have two values:
input_Teff = 2300
input_log_g = 3.86
If we imagine all the (Teff, logg) combinations as grid points, for a given input point, I would like to find the four points closest to that input point.
point 1 (Teff1, logg1)
point 2 (Teff2, logg2)
point 3 (Teff3, logg3)
point 4 (Teff4, logg4)
then, calculate "distances" between my input point and the other points through Pythagoras (in this examples, three distances),
(Teff1, logg_1) -> d1
(Teff2, logg_2) -> d2
(Teff3, logg_3) -> d3
(Teff4, logg_4) -> d4 # NOTE: Teff and logg are different scales
next, get in this example, the V values in the rows of these points,
(Teff1, logg_1) -> V1
(Teff2, logg_2) -> V2
(Teff3, logg_3) -> V3
(Teff4, logg_4) -> V4
And finally do a weighted average calculation
V = (d1V1+d2V2+d3V3+d4V4)/(d1+d2+d3+d4)
What would be a good way to do this in R?
Edit: https://www.dropbox.com/s/prbceabxmd25etx/lcb98cor.dat?dl=0
lets say your dataframe name is teff_df, and lets assume this way is to calculate phytagoras distances, (update -> find nearest distance between input and every point of Teff logg), note: I use min-max normalization to rescale the Teff and logg values:
min_max_norm <- function(x) {
return(((x - min(x)) / (max(x) - min(x))))
}
phytagoras <- function(a,b){
return(a**2 + b**2)
}
First, calculate the Teff and logg rescale values between the input and all available grid points into a ranges of 0-1:
teff_with_input <- c(teff_df$Teff, 2300)
logg_with_input <- c(teff_df$logg, 3.86)
teff_rescale <- min_max_norm(teff_with_input)
logg_rescale <- min_max_norm(logg_with_input)
teff_input_rescale <- teff_rescale[length(teff_rescale)]
logg_input_rescale <- logg_rescale[length(logg_rescale)]
teff_rescale <- teff_rescale[-length(teff_rescale)]
logg_rescale <- logg_rescale[-length(logg_rescale)]
Second, calculate the differences between input and all grid point which already transformed to phytagoras values:
input_distance <- phytagoras(teff_input_rescale, logg_input_rescale)
my_phyta_dist <- phytagoras(teff_rescale, logg_rescale)
my_v_point <- teff_df[,which(colnames(teff_df)=="V")]
diff_input <- my_phyta_dist - input_distance
teff_df[order(abs(diff_input))[1:4],] #this will display which rows are nearest based by phytagoras distance calculation
Third, extract the 4 nearest points, by the minimum differences of input and 4 points of Teff and logg combination grid
nearest_4rows <- as.numeric(rownames(teff_df[order(abs(diff_input))[1:4],]))
nearest_4phyta_dist <- my_phyta_dist[nearest_4rows]
nearest_4v_point <- my_v_point[nearest_4rows]
and then finally, calculate final formula of weighted average calculation
V = (d1V1+d2V2+d3V3+d4V4)/(d1+d2+d3+d4)
product_dist <- nearest_4phyta_dist * nearest_4v_point
weighted_average <- sum(product_dist) / sum(nearest_4phyta_dist)

Making a matrix from lsmeans contrasts return

To create the data frame:
num <- sample(1:25, 20)
x <- data.frame("Day_eclosion" = num, "Developmental" = c("AP", "MA",
"JU", "L"), "Replicate" = 1:5)
model <- glmer(Day_eclosion ~ Developmental + (1 | Replicate), family =
"poisson", data= x)
I get this return from:
a <- lsmeans(model, pairwise~Developmental, adjust = "tukey")
a$contrasts
contrast estimate SE df z.ratio p.value
AP - JU 0.2051 0.0168 Inf 12.172 <.0001
AP - L 0.3009 0.0212 Inf 14.164 <.0001
AP - MA 0.3889 0.0209 Inf 18.631 <.0001
JU - L 0.0958 0.0182 Inf 5.265 <.0001
JU - MA 0.1839 0.0177 Inf 10.387 <.0001
L - MA 0.0881 0.0222 Inf 3.964 0.0004
I am looking for a simple way to turn this output (just p values) into:
AP MA JU L
AP - <.0001 <.0001 <.0001
MA - - <.0001 0.0004
JU - - - <.0001
L - - -
I have about 20 sets of these that I need to turn into tables, so the simpler and more general the better.
Bonus points if the output is tab-deliminated, etc, so that I can easily paste into word/excel.
Thanks!
Here's a function that works...
pvmat = function(emm, ...) {
emm = update(emm, by = NULL) # need to work harder otherwise
pv = test(pairs(emm, reverse = TRUE, ...)) $ p.value
fmtpv = sprintf("%6.4f", pv)
fmtpv[pv < 0.0001] = "<.0001"
lbls = do.call(paste, emm#grid[emm#misc$pri.vars])
n = length(lbls)
mat = matrix("", nrow = n, ncol = n, dimnames = list(lbls, lbls))
mat[upper.tri(mat)] = fmtpv
idx = seq_len(n - 1)
mat[idx, 1 + idx] # trim off last row and 1st col
}
Illustration:
require(emmeans)
> warp.lm = lm(breaks ~ wool * tension, data = warpbreaks)
> warp.emm = emmeans(warp.lm, ~ wool * tension)
> warp.emm
wool tension emmean SE df lower.CL upper.CL
A L 44.6 3.65 48 37.2 51.9
B L 28.2 3.65 48 20.9 35.6
A M 24.0 3.65 48 16.7 31.3
B M 28.8 3.65 48 21.4 36.1
A H 24.6 3.65 48 17.2 31.9
B H 18.8 3.65 48 11.4 26.1
Confidence level used: 0.95
> pm = pvmat(warp.emm, adjust = "none")
> print(pm, quote=FALSE)
B L A M B M A H B H
A L 0.0027 0.0002 0.0036 0.0003 <.0001
B L 0.4170 0.9147 0.4805 0.0733
A M 0.3589 0.9147 0.3163
B M 0.4170 0.0584
A H 0.2682
Notes
As provided, this does not support by variables. Accordingly, the first line of the function disables them.
Using pairs(..., reverse = TRUE) generates the P values in the correct order needed later for upper.tri()
you can pass arguments to test() via ...
To create a tab-delimited version, use the clipr package:
clipr::write_clip(pm)
What you need is now in the clipboard and ready to paste into a spreadsheet.
Addendum
Answering this question inspired me to add a new function pwpm() to the emmeans package. It will appear in the next CRAN release, and is available now from the github site. It displays means and differences as well as P values; but the user may select which to include.
> pwpm(warp.emm)
wool = A
L M H
L [44.6] 0.0007 0.0009
M 20.556 [24.0] 0.9936
H 20.000 -0.556 [24.6]
wool = B
L M H
L [28.2] 0.9936 0.1704
M -0.556 [28.8] 0.1389
H 9.444 10.000 [18.8]
Row and column labels: tension
Upper triangle: P values adjust = “tukey”
Diagonal: [Estimates] (emmean)
Upper triangle: Comparisons (estimate) earlier vs. later

how to loop in ggplot2 to get many plots per page

I wanted to loop a piece of code that produces graph by ggplot2. My data set looks like the example "prot", juts prot is one accession and in the original data I have many more accessions.
For single accession it looks nice. Just with the looping, I thought that it will place one iteration per one page in .pdf
but it doesn't. This single plot already creates combined plots, so I don't know now how and where to place facet_wrap or facet_grid?
or maybe there is other solution?
help, please help.
'library(ggplot2)
ggplot(prot, aes(factor(genotype), value, fill = Light)) +
geom_bar(stat="identity", position = "dodge") +
scale_fill_brewer(palette = "Set1")
'> prot
Accession genotype variable value Light
966 AT1G01050 WT ML_WT_Dejan_05 219971.1 ML
2828 AT1G01050 WT ML_WT_Dejan_06 286308.6 ML
4690 AT1G01050 WT ML_WT_Dejan_14 1177873.5 ML
6552 AT1G01050 m ML_m_Dejan_08 861982.0 ML
8414 AT1G01050 m ML_m_Dejan_10 3786163.0 ML
10276 AT1G01050 m ML_m_Dejan_11 1289267.7 ML
12138 AT1G01050 f ML_f_Dejan_01 400419.3 ML
14000 AT1G01050 f ML_f_Dejan_04 929297.2 ML
15862 AT1G01050 f ML_f_Dejan_09 12245991.9 ML
17724 AT1G01050 ntrc ML_ntrc_Dejan_02 785773.5 ML
19586 AT1G01050 ntrc ML_ntrc_Dejan_03 971133.1 ML
21448 AT1G01050 ntrc ML_ntrc_dejan7 592207.0 ML
23310 AT1G01050 ntrc ML_ntrc_Dejan_12R 347127.5 ML
25204 AT1G01050 WT FL_WT_Dejan_20 131817.0 FL
27134 AT1G01050 WT FL_WT_Dejan_39 560424.7 FL
29064 AT1G01050 WT FL_WT_Dejan_33 9304183.7 FL
30994 AT1G01050 WT FL_WT_Dejan_34 647452.4 FL
32924 AT1G01050 m FL_m_Dejan_21 712381.5 FL
34854 AT1G01050 m FL_m_Dejan_26 6089158.8 FL
36784 AT1G01050 m FL_m_Dejan_28 11341334.1 FL
38714 AT1G01050 f FL_f_Dejan_19 13140258.2 FL
40644 AT1G01050 f FL_f_Dejan_31 11256554.9 FL
42574 AT1G01050 f FL_f_Dejan_35 1621509.9 FL
44504 AT1G01050 f FL_f_Dejan37 392228.2 FL
46434 AT1G01050 ntrc FL_ntrc_Dejan_30 9069074.8 FL
48364 AT1G01050 ntrc FL_ntrc_Dejan_38 562403.6 FL
50294 AT1G01050 ntrc FL_ntrc_Dejan29 175258.6 FL
79347 AT1G01050 WT LL_WT_Dejan_41 2443625.6 LL
81783 AT1G01050 WT LL_WT_Dejan_43 8529143.7 LL
84219 AT1G01050 WT LL_WT_Dejan_49 11054552.6 LL
86655 AT1G01050 m LL_m_Dejan_44 14325152.0 LL
89091 AT1G01050 m LL_m_Dejan_45 13114486.4 LL
91527 AT1G01050 m LL_m_Dejan_54 8250430.1 LL
93963 AT1G01050 f LL_f_Dejan_47 12431354.5 LL
96399 AT1G01050 f LL_f_Dejan_48 11884118.5 LL
98835 AT1G01050 f LL_f_Dejan_53 8408509.1 LL
101271 AT1G01050 ntrc LL_ntrc_Dejan_46 12214783.1 LL
103707 AT1G01050 ntrc LL_ntrc_Dejan_50 1286828.3 LL
106143 AT1G01050 ntrc LL_ntrc_Dejan_42 1819043.9 LL
plots<- list()
pdf("TEST_boxplot.pdf")
IDs<-unique(prot$Accession)
for (i in 1:length(IDs)){
temp <- prot[(prot$Accession)==IDs[i],]
p<- ggplot(temp, aes(factor(genotype), value, fill = Light)) +
geom_bar(stat="identity", position = "dodge") +
scale_fill_brewer(palette = "Set1")+
ggtitle(as.character(i))
plots[[i]] <- p
#plots[[paste(i)]] = p
#multiplot(plotlist = plots, cols = 1)
}
dev.off()
I generated a prot toy dataset with two levels for Accession.
The code below prints 2 graphs on two pages of the TEST_boxplot.pdf file.
Here is the file generated.
library(ggplot2)
prot1 <- read.table(text="
n Accession genotype variable value Light
966 AT1G01050 WT ML_WT_Dejan_05 219971.1 ML
2828 AT1G01050 WT ML_WT_Dejan_06 286308.6 ML
4690 AT1G01050 WT ML_WT_Dejan_14 1177873.5 ML
6552 AT1G01050 m ML_m_Dejan_08 861982.0 ML
8414 AT1G01050 m ML_m_Dejan_10 3786163.0 ML
10276 AT1G01050 m ML_m_Dejan_11 1289267.7 ML
12138 AT1G01050 f ML_f_Dejan_01 400419.3 ML
14000 AT1G01050 f ML_f_Dejan_04 929297.2 ML
15862 AT1G01050 f ML_f_Dejan_09 12245991.9 ML
17724 AT1G01050 ntrc ML_ntrc_Dejan_02 785773.5 ML
19586 AT1G01050 ntrc ML_ntrc_Dejan_03 971133.1 ML
21448 AT1G01050 ntrc ML_ntrc_dejan7 592207.0 ML
23310 AT1G01050 ntrc ML_ntrc_Dejan_12R 347127.5 ML
25204 AT1G01050 WT FL_WT_Dejan_20 131817.0 FL
27134 AT1G01050 WT FL_WT_Dejan_39 560424.7 FL
29064 AT1G01050 WT FL_WT_Dejan_33 9304183.7 FL
30994 AT1G01050 WT FL_WT_Dejan_34 647452.4 FL
32924 AT1G01050 m FL_m_Dejan_21 712381.5 FL
34854 AT1G01050 m FL_m_Dejan_26 6089158.8 FL
36784 AT1G01050 m FL_m_Dejan_28 11341334.1 FL
38714 AT1G01050 f FL_f_Dejan_19 13140258.2 FL
40644 AT1G01050 f FL_f_Dejan_31 11256554.9 FL
42574 AT1G01050 f FL_f_Dejan_35 1621509.9 FL
44504 AT1G01050 f FL_f_Dejan37 392228.2 FL
46434 AT1G01050 ntrc FL_ntrc_Dejan_30 9069074.8 FL
48364 AT1G01050 ntrc FL_ntrc_Dejan_38 562403.6 FL
50294 AT1G01050 ntrc FL_ntrc_Dejan29 175258.6 FL
79347 AT1G01050 WT LL_WT_Dejan_41 2443625.6 LL
81783 AT1G01050 WT LL_WT_Dejan_43 8529143.7 LL
84219 AT1G01050 WT LL_WT_Dejan_49 11054552.6 LL
86655 AT1G01050 m LL_m_Dejan_44 14325152.0 LL
89091 AT1G01050 m LL_m_Dejan_45 13114486.4 LL
91527 AT1G01050 m LL_m_Dejan_54 8250430.1 LL
93963 AT1G01050 f LL_f_Dejan_47 12431354.5 LL
96399 AT1G01050 f LL_f_Dejan_48 11884118.5 LL
98835 AT1G01050 f LL_f_Dejan_53 8408509.1 LL
101271 AT1G01050 ntrc LL_ntrc_Dejan_46 12214783.1 LL
103707 AT1G01050 ntrc LL_ntrc_Dejan_50 1286828.3 LL
106143 AT1G01050 ntrc LL_ntrc_Dejan_42 1819043.9 LL
", header=T)
prot2 <- prot1
prot2$Accession <- "AT3G53620"
prot <- rbind(prot1,prot2)
plots <- list()
pdf("TEST_boxplot.pdf", onefile=T)
IDs<-unique(prot$Accession)
for (i in 1:length(IDs)){
temp <- prot[(prot$Accession)==IDs[i],]
p<- ggplot(temp, aes(factor(genotype), value, fill = Light)) +
geom_bar(stat="identity", position = "dodge") +
scale_fill_brewer(palette = "Set1")+
ggtitle(as.character(i))
plots[[i]] <- p
print(p)
}
dev.off()

R fitting and forecasting daily time series

I am working with a daily time serie and I need to build a forecast for 90 days (or maybe more) based on my history - The current time serie has roughly 298 data points.
The issue I have is the famous flat line in the final forecast - and yes I might not have a seasonality but I am trying to work this out. Another issue is how to find the best model and adapt it from here on for this kind of behaviour.
I created a test case to investigate this further and any help is appreciated.
Thanks,
To start with
x <- day_data # My time serie
z <- 90 # Days to forecast
low_bound_date <- as.POSIXlt(min(x$time), format = "%m/%d/%Y") # oldest date in the DF.
> low_bound_date
[1] "2015-12-21 PST"
low_bound_date$yday
> low_bound_date$yday # Day in Julian
[1] 354
lbyear <- as.numeric(substr(low_bound_date, 1, 4))
> lbyear
[1] 2015
This is my time serie content
> ts
Time Series:
Start = c(2065, 4)
End = c(2107, 7)
Frequency = 7
[2] 20.73 26.19 27.51 26.11 26.28 27.58 26.84 27.00 26.30 28.75 28.43 39.03 41.36 45.42 44.80 45.33 47.79 44.70 45.17
[20] 34.90 32.54 32.75 33.35 34.76 34.11 33.59 33.60 38.08 30.45 29.66 31.09 31.36 31.96 29.30 30.04 30.85 31.13 25.09
[39] 17.88 23.73 25.31 31.30 35.18 34.13 34.96 35.12 27.36 38.33 38.59 38.14 38.54 41.72 37.15 35.92 37.37 32.39 30.64
[58] 30.57 30.66 31.16 31.50 30.68 32.21 32.27 32.55 33.61 34.80 33.53 33.09 20.90 6.91 7.82 15.78 7.25 6.19 6.38
[77] 38.06 39.82 35.53 38.63 41.91 39.76 37.26 38.79 37.74 35.61 39.70 35.79 35.36 29.63 22.07 35.39 35.99 37.35 38.82
[96] 25.80 21.31 18.85 9.52 20.75 36.83 44.12 37.79 34.45 36.05 16.39 21.84 31.39 34.26 31.50 30.87 28.88 42.83 41.52
[115] 42.34 47.35 44.47 44.10 44.49 26.89 18.17 40.44 43.93 41.56 39.98 40.31 40.59 40.17 40.22 40.50 32.68 35.89 36.06
[134] 34.30 22.67 12.56 13.29 12.34 28.00 35.27 36.57 33.78 32.15 33.58 34.62 30.96 32.06 33.05 30.66 32.47 30.42 32.83
[153] 31.74 29.39 22.39 12.58 16.46 5.36 4.01 15.32 32.79 31.66 32.02 27.60 31.47 31.61 34.96 27.77 31.91 33.94 33.43
[172] 26.94 28.38 21.42 24.51 23.82 31.71 26.64 27.96 29.29 29.25 28.70 27.02 27.62 30.90 27.46 27.37 26.46 27.77 13.61
[191] 5.87 12.18 5.68 4.15 4.35 4.42 16.42 25.18 26.06 27.39 27.57 28.86 15.18 5.19 5.61 8.28 7.78 5.13 4.90
[210] 5.02 5.27 16.31 25.01 26.19 25.96 24.93 25.53 25.56 26.39 26.80 26.73 26.00 25.61 25.90 25.89 13.80 6.66 6.41
[229] 5.28 5.64 5.71 5.38 5.76 7.20 7.27 5.55 5.31 5.94 5.75 5.93 5.77 6.57 5.52 5.51 5.47 5.69 19.75
[248] 29.22 30.75 29.63 30.49 29.48 31.83 30.42 29.27 30.40 29.91 32.00 30.09 28.93 14.54 7.75 5.63 17.17 22.27 24.93
[267] 35.94 37.42 33.13 25.88 24.27 37.64 37.42 38.33 35.20 21.32 7.32 4.81 5.17 17.49 23.77 23.36 27.60 26.53 24.99
[286] 24.22 23.76 24.10 24.22 27.06 25.53 23.40 37.07 26.52 25.19 28.02 28.53 26.67
First step, I get my data in ts
day_data_ts <- ts(x$avg_day, start = c(lbyear,low_bound_date$yday), frequency=7)
plot(day_data_ts)
plot_ts
acf(day_data_ts)
acf_ts
Second step, I get my data in msts
day_data_msts <- msts(x$avg_day, seasonal.periods=c(7,365.25), start = c(lbyear,low_bound_date$yday))
plot(day_data_msts)
acf(day_data_msts)
I did several fitting iterations to try and figure out the best fit and forecast model.
First fitting test is with the ts only.
fit1 <- HoltWinters(day_data_ts)
> fit1
Holt-Winters exponential smoothing with trend and additive seasonal component.
Call: HoltWinters(x = day_data_ts)
Smoothing parameters: alpha: 1 beta : 0.006757112 gamma: 0
Coefficients:
[,1]
a 28.0922449
b 0.1652477
s1 0.6241837
s2 1.9084694
s3 0.9913265
s4 0.8198980
s5 -1.7015306
s6 -1.2201020
s7 -1.4222449
fit2 <- tbats(day_data_ts)
> fit2
BATS(1, {0,0}, 0.8, -)
Parameters: Alpha: 1.309966 Beta: -0.3011143 Damping Parameter: 0.800001
Seed States:
[,1]
[1,] 15.282259
[2,] 2.177787
Sigma: 5.501356 AIC: 2723.911
fit3 <- ets(day_data_ts)
> fit3
ETS(A,N,N)
Smoothing parameters: alpha = 0.9999
Initial states: l = 25.2275
sigma: 5.8506
AIC AICc BIC
2756.597 2756.678 2767.688
fit4 <- auto.arima(day_data_ts)
> fit4
ARIMA(1,1,2)
Coefficients:
ar1 ma1 ma2
0.7396 -0.6897 -0.2769
s.e. 0.0545 0.0690 0.0621
sigma^2 estimated as 30.47: log likelihood=-927.9
AIC=1863.81 AICc=1863.94 BIC=1878.58
Second test is using msts. I also changed the ets model to MAM.
fit5 <- tbats(day_data_msts)
> fit5
BATS(1, {0,0}, 0.8, -)
Parameters: Alpha: 1.309966 Beta: -0.3011143 Damping Parameter: 0.800001
Seed States:
[,1]
[1,] 15.282259
[2,] 2.177787
Sigma: 5.501356 AIC: 2723.911
fit6 <- ets(day_data_msts, model="MAN")
> fit6
ETS(M,A,N)
Smoothing parameters: alpha = 0.9999 beta = 9e-04
Initial states: l = 52.8658 b = 3.9184
sigma: 0.3459
AIC AICc BIC
3042.744 3042.949 3061.229
fit7 <- auto.arima(day_data_msts)
> fit7
ARIMA(1,1,2)
Coefficients:
ar1 ma1 ma2
0.7396 -0.6897 -0.2769
s.e. 0.0545 0.0690 0.0621
sigma^2 estimated as 30.47: log likelihood=-927.9
AIC=1863.81 AICc=1863.94 BIC=1878.58
You can forecast on previously estimated model as follows (use built in timeseries LakeHuron):
library(forecast)
y <- LakeHuron
tsdisplay(y)
# estimate ARMA(1,1)
mod_2 <- Arima(y, order = c(1, 0, 1))
#make forecast for 5 periods (years in this case)
fHuron <- forecast(mod_2, h = 5)
#show results in table
fHuron
#plot results
plot(fHuron)
This will give you:
Pay attention that ARIMA model bases its forecast on previous values, so if we make prediction on many periods the model will use already predicted values to predict next. Which will reduce accuracy.
To fit optimal ARIMA model use this function:
library(R.utils) #for the function 'withTimeout'
fitARIMA<-function(timeseriesObject, timout)
{
final.aic <- Inf
final.order <- c(0,0,0)
for (p in 0:5) for (q in 0:5) {
if ( p == 0 && q == 0) {
next
}
arimaFit = tryCatch(
withTimeout(arima(timeseriesObject
,order=c(p, 0, q))
,timeout = timeout)
,error=function( err ) FALSE
,warning=function( err ) FALSE )
if( !is.logical( arimaFit ) ) {
current.aic <- AIC(arimaFit)
if (current.aic < final.aic) {
final.aic <- current.aic
final.order <- c(p, 0, q)
final.arima <- arima(timeseriesObject, order=final.order)
}
} else {
next
}
}
final.order<-c(final.order,final.aic)
final.order
}

How to calculate percentile in R?

In R helphelp(quantile),you can see
Type 7 m = 1-p. p[k] = (k - 1) / (n - 1). In this case, p[k] =mode[F(x[k])].
This is used by S.
now ,i have a example:
w<-c(75,64,47.4,66.9,62.2,62.2,58.7,63.5,66.6,64,57,69,56.9,50,72.0)
sort(w)
[1] 47.4 50.0 56.9 57.0 58.7 62.2 62.2 63.5 64.0 64.0 66.6 66.9 69.0 72.0 75.0
quantile(w)
0% 25% 50% 75% 100%
47.40 57.85 63.50 66.75 75.00
How can you use the type 7 formula to get the result?
I'm having some trouble deciding if the answer is just:
> quantile(w, type=7)
0% 25% 50% 75% 100%
47.40 57.85 63.50 66.75 75.00
My problem is that the default for quantile is type=7 and you already have that result. If you look at the code for quantile.default there is a section for type=7:
index <- 1 + (n - 1) * probs
lo <- floor(index)
hi <- ceiling(index)
x <- sort(x, partial = unique(c(lo, hi)))
qs <- x[lo]
i <- which(index > lo)
h <- (index - lo)[i]
qs[i] <- (1 - h) * qs[i] + h * x[hi[i]]

Resources