How can I grab data from this web site? - web-scraping
There is a site here (http://www.tsetmc.com/Loader.aspx?ParTree=151311&i=46741025610365786#), that each field of this table(specified by yellow squares) shows information about one specific day. What I need to do is to read only حجم row of each field(I mean what I specified by red squares in the following photos(You should go to the tab i mentioned in first photo, to see the second photo)):
And write them(store in my computer) in a text file like this:
6.832 M (14%) , 40.475 M (85%), 248,000 (0%), 47.059 M (99%)
605,000 (3%), 15.277 M (96%), 478,714 (3%), 15.404 M (96%)
8.102 M (42%), 10.751 M (57%), 9.599 M (50%), 9.253 M (49%)
215,937 (2%), 9.417 M (97%), 1.115 M (11%), 8.518 M (88%)
3.351 M (15%), 18.284 M (84%), 5.987 M (27%), 15.647 M (72%)
But I don't know is it possible or not? If so, how can I do that in simplest way?(I use Windows10)
EDIT:
I did the step 3 successfully, and ran node extract.js command in step 4. I got this result:
[ 'حجم, 47.059 M (99%), 248,000 (0%), 40.475 M (85%), 6.832 M (14%)',
'حجم, 15.404 M (96%), 478,714 (3%), 15.277 M (96%), 605,000 (3%)',
'حجم, 9.253 M (49%), 9.599 M (50%), 10.751 M (57%), 8.102 M (42%)',
'حجم, 8.518 M (88%), 1.115 M (11%), 9.417 M (97%), 215,937 (2%)',
'حجم, 15.647 M (72%), 5.987 M (27%), 18.284 M (84%), 3.351 M (15%)',
'حجم, 21.848 M (93%), 1.501 M (6%), 21.648 M (92%), 1.701 M (7%)',
'حجم, 30.845 M (95%), 1.3 M (4%), 30.663 M (95%), 1.482 M (4%)',
'حجم, 9.914 M (64%), 5.474 M (35%), 9.938 M (64%), 5.45 M (35%)',
'حجم, 10.775 M (97%), 250,000 (2%), 10.995 M (99%), 30,000 (0%)',
'حجم, 21.328 M (91%), 2.027 M (8%), 22.315 M (95%), 1.04 M (4%)',
'حجم, 19.588 M (92%), 1.54 M (7%), 21.048 M (99%), 80,000 (0%)',
'حجم, 12.554 M (96%), 418,000 (3%), 11.504 M (88%), 1.468 M (11%)',
'حجم, 14.98 M (92%), 1.299 M (7%), 16.135 M (99%), 144,008 (0%)',
'حجم, 10.878 M (95%), 502,040 (4%), 11.378 M (99%), 2,040 (0%)',
'حجم, 10.012 M (97%), 275,000 (2%), 10.287 M (100%), 0 (0%)',
'حجم, 11.992 M (95%), 500,000 (4%), 11.707 M (93%), 785,244 (6%)',
'حجم, 16.492 M (95%), 820,000 (4%), 17.056 M (98%), 256,241 (1%)',
'حجم, 19.639 M (98%), 378,384 (1%), 20.017 M (100%), 0 (0%)',
'حجم, 13.781 M (95%), 639,609 (4%), 14.161 M (98%), 260,000 (1%)',
'حجم, 31.797 M (99%), 300,507 (0%), 26.089 M (81%), 6.009 M (18%)',
'حجم, 18.159 M (99%), 30,391 (0%), 15.914 M (87%), 2.275 M (12%)',
'حجم, 21.271 M (95%), 1.01 M (4%), 21.501 M (96%), 780,000 (3%)',
'حجم, 17.322 M (62%), 10.615 M (37%), 19.437 M (69%), 8.5 M (30%)',
'حجم, 37.817 M (97%), 1.03 M (2%), 34.125 M (87%), 4.722 M (12%)',
'حجم, 55.396 M (99%), 211,000 (0%), 52.507 M (94%), 3.1 M (5%)',
'حجم, 23.141 M (98%), 420,000 (1%), 23.461 M (99%), 100,000 (0%)',
'حجم, 46.215 M (82%), 9.919 M (17%), 49.764 M (88%), 6.371 M (11%)',
'حجم, 1.26 M (100%), 0 (0%), 1.26 M (100%), 0 (0%)',
'حجم, 35.89 M (99%), 251,000 (0%), 35.921 M (99%), 220,000 (0%)',
'حجم, 48.509 M (88%), 6.349 M (11%), 54.052 M (98%), 806,362 (1%)',
'حجم, 41.018 M (91%), 4.006 M (8%), 41.564 M (92%), 3.46 M (7%)',
'حجم, 40.02 M (99%), 100,000 (0%), 39.22 M (97%), 900,000 (2%)',
'حجم, 36.974 M (99%), 30,000 (0%), 36.549 M (98%), 455,500 (1%)',
'حجم, 35.739 M (99%), 230,000 (0%), 33.104 M (92%), 2.866 M (7%)',
'حجم, 19.627 M (100%), 0 (0%), 18.877 M (96%), 750,000 (3%)',
'حجم, 19.603 M (81%), 4.379 M (18%), 23.982 M (100%), 0 (0%)',
'حجم, 10.186 M (97%), 250,000 (2%), 10.436 M (100%), 0 (0%)',
'حجم, 15.414 M (98%), 250,500 (1%), 15.465 M (98%), 200,000 (1%)',
'حجم, 21.571 M (97%), 665,000 (2%), 22.236 M (100%), 0 (0%)',
'حجم, 15.537 M (98%), 250,000 (1%), 15.787 M (100%), 0 (0%)',
'حجم, 21.422 M (98%), 221,004 (1%), 21.243 M (98%), 400,000 (1%)',
'حجم, 30.662 M (92%), 2.375 M (7%), 33.036 M (100%), 0 (0%)',
'حجم, 39.287 M (98%), 455,000 (1%), 39.742 M (100%), 0 (0%)',
'حجم, 53.141 M (89%), 6.11 M (10%), 59.131 M (99%), 120,000 (0%)',
'حجم, 23.587 M (98%), 255,000 (1%), 23.842 M (100%), 0 (0%)',
'حجم, 17.043 M (98%), 255,000 (1%), 17.298 M (100%), 0 (0%)',
'حجم, 33.51 M (96%), 1.25 M (3%), 34.75 M (99%), 10,000 (0%)',
'حجم, 36.408 M (99%), 15,000 (0%), 28.248 M (77%), 8.175 M (22%)',
'حجم, 32.367 M (98%), 480,000 (1%), 31.535 M (96%), 1.312 M (3%)',
'حجم, 54.773 M (95%), 2.68 M (4%), 43.936 M (76%), 13.517 M (23%)',
'حجم, 58.955 M (95%), 2.54 M (4%), 41.234 M (67%), 20.262 M (32%)',
'حجم, 45.222 M (99%), 15,000 (0%), 40.215 M (88%), 5.023 M (11%)',
'حجم, 43.487 M (97%), 1.225 M (2%), 43.902 M (98%), 810,008 (1%)',
'حجم, 35.46 M (91%), 3.18 M (8%), 38.33 M (99%), 310,000 (0%)',
'حجم, 39.42 M (99%), 90,927 (0%), 36.722 M (92%), 2.789 M (7%)',
'حجم, 41.024 M (99%), 312,000 (0%), 35.814 M (86%), 5.522 M (13%)',
'حجم, 32.718 M (99%), 277,978 (0%), 30.995 M (93%), 2.001 M (6%)',
'حجم, 1.12 M (100%), 0 (0%), 1.12 M (100%), 0 (0%)',
'حجم, 2.015 M (86%), 325,000 (13%), 2.34 M (100%), 0 (0%)',
'حجم, 40.402 M (95%), 2.109 M (4%), 42.511 M (100%), 0 (0%)',
'حجم, 41.726 M (86%), 6.372 M (13%), 48.098 M (100%), 0 (0%)',
'حجم, 39.444 M (97%), 1.14 M (2%), 39.551 M (97%), 1.033 M (2%)',
'حجم, 4.14 M (100%), 0 (0%), 3.14 M (75%), 1,000,000 (24%)',
'حجم, 43.447 M (96%), 1.743 M (3%), 44.292 M (98%), 898,000 (1%)',
'حجم, 56.023 M (98%), 864,338 (1%), 52.627 M (92%), 4.26 M (7%)',
'حجم, 14.062 M (99%), 8,008 (0%), 12.055 M (85%), 2.015 M (14%)',
'حجم, 56.557 M (84%), 10.413 M (15%), 66.47 M (99%), 500,000 (0%)',
'حجم, 7.971 M (69%), 3.481 M (30%), 11.452 M (100%), 0 (0%)',
'حجم, 38.85 M (86%), 5.864 M (13%), 44.494 M (99%), 220,000 (0%)',
'حجم, 53.151 M (99%), 105,000 (0%), 51.039 M (95%), 2.217 M (4%)',
'حجم, 51.861 M (79%), 13.352 M (20%), 64.603 M (99%), 610,000 (0%)',
'حجم, 2.025 M (80%), 500,000 (19%), 2.525 M (100%), 0 (0%)',
'حجم, 67.428 M (95%), 3.294 M (4%), 68.538 M (96%), 2.184 M (3%)',
'حجم, 52.373 M (87%), 7.211 M (12%), 58.408 M (98%), 1.176 M (1%)',
'حجم, 12.073 M (80%), 3.01 M (19%), 14.583 M (96%), 500,000 (3%)',
'حجم, 47.369 M (99%), 424,000 (0%), 30.168 M (63%), 17.626 M (36%)',
'حجم, 3.401 M (100%), 0 (0%), 1.039 M (30%), 2.363 M (69%)',
'حجم, 52.213 M (99%), 247,000 (0%), 41.872 M (79%), 10.588 M (20%)',
'حجم, 73.585 M (98%), 1.356 M (1%), 38.911 M (51%), 36.029 M (48%)',
'حجم, 67.943 M (97%), 1.622 M (2%), 35.571 M (51%), 33.995 M (48%)',
'حجم, 2.653 M (100%), 0 (0%), 2.003 M (75%), 650,000 (24%)',
'حجم, 32.055 M (99%), 18,408 (0%), 24.301 M (75%), 7.772 M (24%)',
'حجم, 16.989 M (98%), 209,000 (1%), 9.598 M (55%), 7.6 M (44%)',
'حجم, 34.906 M (95%), 1.64 M (4%), 21.129 M (57%), 15.417 M (42%)',
'حجم, 14.669 M (98%), 150,000 (1%), 7.852 M (52%), 6.967 M (47%)',
'حجم, 23.542 M (98%), 289,600 (1%), 23.102 M (96%), 729,782 (3%)',
'حجم, 27.87 M (98%), 450,000 (1%), 21.461 M (75%), 6.859 M (24%)',
'حجم, 48.785 M (98%), 500,000 (1%), 30.683 M (62%), 18.603 M (37%)',
'حجم, 22.839 M (93%), 1.518 M (6%), 16.242 M (66%), 8.115 M (33%)',
'حجم, 15.683 M (96%), 631,500 (3%), 13.316 M (81%), 2.999 M (18%)',
'حجم, 15.715 M (96%), 630,000 (3%), 15.436 M (94%), 908,399 (5%)',
'حجم, 11.776 M (90%), 1.305 M (9%), 13.081 M (100%), 0 (0%)',
'حجم, 12.492 M (85%), 2.057 M (14%), 14.149 M (97%), 400,000 (2%)',
'حجم, 11.909 M (100%), 0 (0%), 11.818 M (99%), 91,008 (0%)',
'حجم, 21.404 M (99%), 140,000 (0%), 17.8 M (82%), 3.744 M (17%)',
'حجم, 22.115 M (89%), 2.718 M (10%), 21.969 M (88%), 2.864 M (11%)',
'حجم, 23.146 M (97%), 637,396 (2%), 21.881 M (92%), 1.902 M (7%)',
'حجم, 35.986 M (94%), 1.92 M (5%), 25.749 M (67%), 12.156 M (32%)',
'حجم, 16.064 M (93%), 1.179 M (6%), 17.104 M (99%), 139,467 (0%)',
'حجم, 19.314 M (85%), 3.284 M (14%), 22.408 M (99%), 189,500 (0%)',
... 84 more items ]
(node:13916) UnhandledPromiseRejectionWarning: TypeError [ERR_INVALID_CALLBACK]: Callback must be a function
at maybeCallback (fs.js:129:9)
at Object.writeFile (fs.js:1159:14)
at C:\Users\m\Desktop\GetData\extract.js:21:14
at process._tickCallback (internal/process/next_tick.js:68:7)
(node:13916) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:13916) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process
with a non-zero exit code.
But there is no store.txt file!
Download and install node.js & npm from here - https://www.npmjs.com/get-npm
Create folder anywhere in your pc, create a file extract.js in it and paste following code.
Code
Then open command prompt in that folder and run "npm install puppeteer"(it may take several minutes to complete)
Then run "node extract.js"
After it runs successfully you will have "store.txt" file in the same folder which contains your expected result.
Related
Calculate weighted average of four nearest grid points
I have a data frame that looks like this: Teff logg M_div_H U B V R I J H K L Lprime M 1: 2000 4.00 -0.1 -13.443 -11.390 -7.895 -4.464 -1.831 1.666 3.511 2.701 4.345 4.765 5.680 2: 2000 4.50 -0.1 -13.402 -11.416 -7.896 -4.454 -1.794 1.664 3.503 2.728 4.352 4.772 5.687 3: 2000 5.00 -0.1 -13.358 -11.428 -7.888 -4.431 -1.738 1.664 3.488 2.753 4.361 4.779 5.685 4: 2000 5.50 -0.1 -13.220 -11.079 -7.377 -4.136 -1.483 1.656 3.418 2.759 4.355 4.753 5.638 5: 2200 3.50 -0.1 -11.866 -9.557 -6.378 -3.612 -1.185 1.892 3.294 2.608 3.929 4.289 4.842 6: 2200 4.50 -0.1 -11.845 -9.643 -6.348 -3.589 -1.132 1.874 3.310 2.648 3.947 4.305 4.939 7: 2200 5.50 -0.1 -11.655 -9.615 -6.279 -3.508 -0.997 1.886 3.279 2.709 3.964 4.314 4.928 8: 2500 -1.02 -0.1 -7.410 -7.624 -6.204 -3.854 -1.533 1.884 3.320 2.873 3.598 3.964 5.579 9: 2500 -0.70 -0.1 -7.008 -7.222 -5.818 -3.618 -1.338 1.905 3.266 2.868 3.502 3.877 5.417 10: 2500 -0.29 -0.1 -6.526 -6.740 -5.357 -3.421 -1.215 1.927 3.216 2.870 3.396 3.781 5.247 11: 2500 5.50 -0.1 -9.518 -7.575 -5.010 -2.756 -0.511 1.959 3.057 2.642 3.472 3.756 4.265 12: 2800 -1.02 -0.1 -7.479 -7.386 -5.941 -3.716 -1.432 1.824 3.259 2.812 3.567 3.784 5.333 13: 2800 -0.70 -0.1 -7.125 -7.032 -5.596 -3.477 -1.231 1.822 3.218 2.813 3.479 3.717 5.229 14: 2800 -0.29 -0.1 -6.673 -6.580 -5.154 -3.166 -0.974 1.816 3.163 2.812 3.364 3.628 5.093 15: 2800 3.50 -0.1 -8.113 -6.258 -4.103 -2.209 -0.360 1.957 2.872 2.517 3.219 3.427 4.026 16: 2800 4.00 -0.1 -7.992 -6.099 -3.937 -2.076 -0.230 1.907 2.869 2.480 3.227 3.424 4.075 17: 2800 4.50 -0.1 -7.815 -6.051 -4.067 -2.176 -0.228 1.920 2.877 2.503 3.212 3.428 4.000 18: 2800 5.00 -0.1 -7.746 -6.018 -4.031 -2.144 -0.176 1.907 2.883 2.512 3.216 3.430 4.023 19: 3000 -0.70 -0.1 -7.396 -6.995 -5.605 -3.554 -1.293 1.787 3.172 2.759 3.474 3.588 5.052 20: 3000 -0.29 -0.1 -6.966 -6.565 -5.179 -3.249 -1.035 1.772 3.136 2.764 3.388 3.533 4.978 Notice, for example, how every V value has a unique Teff, logg combination. Now, let's say I have two values: input_Teff = 2300 input_log_g = 3.86 If we imagine all the (Teff, logg) combinations as grid points, for a given input point, I would like to find the four points closest to that input point. point 1 (Teff1, logg1) point 2 (Teff2, logg2) point 3 (Teff3, logg3) point 4 (Teff4, logg4) then, calculate "distances" between my input point and the other points through Pythagoras (in this examples, three distances), (Teff1, logg_1) -> d1 (Teff2, logg_2) -> d2 (Teff3, logg_3) -> d3 (Teff4, logg_4) -> d4 # NOTE: Teff and logg are different scales next, get in this example, the V values in the rows of these points, (Teff1, logg_1) -> V1 (Teff2, logg_2) -> V2 (Teff3, logg_3) -> V3 (Teff4, logg_4) -> V4 And finally do a weighted average calculation V = (d1V1+d2V2+d3V3+d4V4)/(d1+d2+d3+d4) What would be a good way to do this in R? Edit: https://www.dropbox.com/s/prbceabxmd25etx/lcb98cor.dat?dl=0
lets say your dataframe name is teff_df, and lets assume this way is to calculate phytagoras distances, (update -> find nearest distance between input and every point of Teff logg), note: I use min-max normalization to rescale the Teff and logg values: min_max_norm <- function(x) { return(((x - min(x)) / (max(x) - min(x)))) } phytagoras <- function(a,b){ return(a**2 + b**2) } First, calculate the Teff and logg rescale values between the input and all available grid points into a ranges of 0-1: teff_with_input <- c(teff_df$Teff, 2300) logg_with_input <- c(teff_df$logg, 3.86) teff_rescale <- min_max_norm(teff_with_input) logg_rescale <- min_max_norm(logg_with_input) teff_input_rescale <- teff_rescale[length(teff_rescale)] logg_input_rescale <- logg_rescale[length(logg_rescale)] teff_rescale <- teff_rescale[-length(teff_rescale)] logg_rescale <- logg_rescale[-length(logg_rescale)] Second, calculate the differences between input and all grid point which already transformed to phytagoras values: input_distance <- phytagoras(teff_input_rescale, logg_input_rescale) my_phyta_dist <- phytagoras(teff_rescale, logg_rescale) my_v_point <- teff_df[,which(colnames(teff_df)=="V")] diff_input <- my_phyta_dist - input_distance teff_df[order(abs(diff_input))[1:4],] #this will display which rows are nearest based by phytagoras distance calculation Third, extract the 4 nearest points, by the minimum differences of input and 4 points of Teff and logg combination grid nearest_4rows <- as.numeric(rownames(teff_df[order(abs(diff_input))[1:4],])) nearest_4phyta_dist <- my_phyta_dist[nearest_4rows] nearest_4v_point <- my_v_point[nearest_4rows] and then finally, calculate final formula of weighted average calculation V = (d1V1+d2V2+d3V3+d4V4)/(d1+d2+d3+d4) product_dist <- nearest_4phyta_dist * nearest_4v_point weighted_average <- sum(product_dist) / sum(nearest_4phyta_dist)
Making a matrix from lsmeans contrasts return
To create the data frame: num <- sample(1:25, 20) x <- data.frame("Day_eclosion" = num, "Developmental" = c("AP", "MA", "JU", "L"), "Replicate" = 1:5) model <- glmer(Day_eclosion ~ Developmental + (1 | Replicate), family = "poisson", data= x) I get this return from: a <- lsmeans(model, pairwise~Developmental, adjust = "tukey") a$contrasts contrast estimate SE df z.ratio p.value AP - JU 0.2051 0.0168 Inf 12.172 <.0001 AP - L 0.3009 0.0212 Inf 14.164 <.0001 AP - MA 0.3889 0.0209 Inf 18.631 <.0001 JU - L 0.0958 0.0182 Inf 5.265 <.0001 JU - MA 0.1839 0.0177 Inf 10.387 <.0001 L - MA 0.0881 0.0222 Inf 3.964 0.0004 I am looking for a simple way to turn this output (just p values) into: AP MA JU L AP - <.0001 <.0001 <.0001 MA - - <.0001 0.0004 JU - - - <.0001 L - - - I have about 20 sets of these that I need to turn into tables, so the simpler and more general the better. Bonus points if the output is tab-deliminated, etc, so that I can easily paste into word/excel. Thanks!
Here's a function that works... pvmat = function(emm, ...) { emm = update(emm, by = NULL) # need to work harder otherwise pv = test(pairs(emm, reverse = TRUE, ...)) $ p.value fmtpv = sprintf("%6.4f", pv) fmtpv[pv < 0.0001] = "<.0001" lbls = do.call(paste, emm#grid[emm#misc$pri.vars]) n = length(lbls) mat = matrix("", nrow = n, ncol = n, dimnames = list(lbls, lbls)) mat[upper.tri(mat)] = fmtpv idx = seq_len(n - 1) mat[idx, 1 + idx] # trim off last row and 1st col } Illustration: require(emmeans) > warp.lm = lm(breaks ~ wool * tension, data = warpbreaks) > warp.emm = emmeans(warp.lm, ~ wool * tension) > warp.emm wool tension emmean SE df lower.CL upper.CL A L 44.6 3.65 48 37.2 51.9 B L 28.2 3.65 48 20.9 35.6 A M 24.0 3.65 48 16.7 31.3 B M 28.8 3.65 48 21.4 36.1 A H 24.6 3.65 48 17.2 31.9 B H 18.8 3.65 48 11.4 26.1 Confidence level used: 0.95 > pm = pvmat(warp.emm, adjust = "none") > print(pm, quote=FALSE) B L A M B M A H B H A L 0.0027 0.0002 0.0036 0.0003 <.0001 B L 0.4170 0.9147 0.4805 0.0733 A M 0.3589 0.9147 0.3163 B M 0.4170 0.0584 A H 0.2682 Notes As provided, this does not support by variables. Accordingly, the first line of the function disables them. Using pairs(..., reverse = TRUE) generates the P values in the correct order needed later for upper.tri() you can pass arguments to test() via ... To create a tab-delimited version, use the clipr package: clipr::write_clip(pm) What you need is now in the clipboard and ready to paste into a spreadsheet. Addendum Answering this question inspired me to add a new function pwpm() to the emmeans package. It will appear in the next CRAN release, and is available now from the github site. It displays means and differences as well as P values; but the user may select which to include. > pwpm(warp.emm) wool = A L M H L [44.6] 0.0007 0.0009 M 20.556 [24.0] 0.9936 H 20.000 -0.556 [24.6] wool = B L M H L [28.2] 0.9936 0.1704 M -0.556 [28.8] 0.1389 H 9.444 10.000 [18.8] Row and column labels: tension Upper triangle: P values adjust = “tukey” Diagonal: [Estimates] (emmean) Upper triangle: Comparisons (estimate) earlier vs. later
how to loop in ggplot2 to get many plots per page
I wanted to loop a piece of code that produces graph by ggplot2. My data set looks like the example "prot", juts prot is one accession and in the original data I have many more accessions. For single accession it looks nice. Just with the looping, I thought that it will place one iteration per one page in .pdf but it doesn't. This single plot already creates combined plots, so I don't know now how and where to place facet_wrap or facet_grid? or maybe there is other solution? help, please help. 'library(ggplot2) ggplot(prot, aes(factor(genotype), value, fill = Light)) + geom_bar(stat="identity", position = "dodge") + scale_fill_brewer(palette = "Set1") '> prot Accession genotype variable value Light 966 AT1G01050 WT ML_WT_Dejan_05 219971.1 ML 2828 AT1G01050 WT ML_WT_Dejan_06 286308.6 ML 4690 AT1G01050 WT ML_WT_Dejan_14 1177873.5 ML 6552 AT1G01050 m ML_m_Dejan_08 861982.0 ML 8414 AT1G01050 m ML_m_Dejan_10 3786163.0 ML 10276 AT1G01050 m ML_m_Dejan_11 1289267.7 ML 12138 AT1G01050 f ML_f_Dejan_01 400419.3 ML 14000 AT1G01050 f ML_f_Dejan_04 929297.2 ML 15862 AT1G01050 f ML_f_Dejan_09 12245991.9 ML 17724 AT1G01050 ntrc ML_ntrc_Dejan_02 785773.5 ML 19586 AT1G01050 ntrc ML_ntrc_Dejan_03 971133.1 ML 21448 AT1G01050 ntrc ML_ntrc_dejan7 592207.0 ML 23310 AT1G01050 ntrc ML_ntrc_Dejan_12R 347127.5 ML 25204 AT1G01050 WT FL_WT_Dejan_20 131817.0 FL 27134 AT1G01050 WT FL_WT_Dejan_39 560424.7 FL 29064 AT1G01050 WT FL_WT_Dejan_33 9304183.7 FL 30994 AT1G01050 WT FL_WT_Dejan_34 647452.4 FL 32924 AT1G01050 m FL_m_Dejan_21 712381.5 FL 34854 AT1G01050 m FL_m_Dejan_26 6089158.8 FL 36784 AT1G01050 m FL_m_Dejan_28 11341334.1 FL 38714 AT1G01050 f FL_f_Dejan_19 13140258.2 FL 40644 AT1G01050 f FL_f_Dejan_31 11256554.9 FL 42574 AT1G01050 f FL_f_Dejan_35 1621509.9 FL 44504 AT1G01050 f FL_f_Dejan37 392228.2 FL 46434 AT1G01050 ntrc FL_ntrc_Dejan_30 9069074.8 FL 48364 AT1G01050 ntrc FL_ntrc_Dejan_38 562403.6 FL 50294 AT1G01050 ntrc FL_ntrc_Dejan29 175258.6 FL 79347 AT1G01050 WT LL_WT_Dejan_41 2443625.6 LL 81783 AT1G01050 WT LL_WT_Dejan_43 8529143.7 LL 84219 AT1G01050 WT LL_WT_Dejan_49 11054552.6 LL 86655 AT1G01050 m LL_m_Dejan_44 14325152.0 LL 89091 AT1G01050 m LL_m_Dejan_45 13114486.4 LL 91527 AT1G01050 m LL_m_Dejan_54 8250430.1 LL 93963 AT1G01050 f LL_f_Dejan_47 12431354.5 LL 96399 AT1G01050 f LL_f_Dejan_48 11884118.5 LL 98835 AT1G01050 f LL_f_Dejan_53 8408509.1 LL 101271 AT1G01050 ntrc LL_ntrc_Dejan_46 12214783.1 LL 103707 AT1G01050 ntrc LL_ntrc_Dejan_50 1286828.3 LL 106143 AT1G01050 ntrc LL_ntrc_Dejan_42 1819043.9 LL plots<- list() pdf("TEST_boxplot.pdf") IDs<-unique(prot$Accession) for (i in 1:length(IDs)){ temp <- prot[(prot$Accession)==IDs[i],] p<- ggplot(temp, aes(factor(genotype), value, fill = Light)) + geom_bar(stat="identity", position = "dodge") + scale_fill_brewer(palette = "Set1")+ ggtitle(as.character(i)) plots[[i]] <- p #plots[[paste(i)]] = p #multiplot(plotlist = plots, cols = 1) } dev.off()
I generated a prot toy dataset with two levels for Accession. The code below prints 2 graphs on two pages of the TEST_boxplot.pdf file. Here is the file generated. library(ggplot2) prot1 <- read.table(text=" n Accession genotype variable value Light 966 AT1G01050 WT ML_WT_Dejan_05 219971.1 ML 2828 AT1G01050 WT ML_WT_Dejan_06 286308.6 ML 4690 AT1G01050 WT ML_WT_Dejan_14 1177873.5 ML 6552 AT1G01050 m ML_m_Dejan_08 861982.0 ML 8414 AT1G01050 m ML_m_Dejan_10 3786163.0 ML 10276 AT1G01050 m ML_m_Dejan_11 1289267.7 ML 12138 AT1G01050 f ML_f_Dejan_01 400419.3 ML 14000 AT1G01050 f ML_f_Dejan_04 929297.2 ML 15862 AT1G01050 f ML_f_Dejan_09 12245991.9 ML 17724 AT1G01050 ntrc ML_ntrc_Dejan_02 785773.5 ML 19586 AT1G01050 ntrc ML_ntrc_Dejan_03 971133.1 ML 21448 AT1G01050 ntrc ML_ntrc_dejan7 592207.0 ML 23310 AT1G01050 ntrc ML_ntrc_Dejan_12R 347127.5 ML 25204 AT1G01050 WT FL_WT_Dejan_20 131817.0 FL 27134 AT1G01050 WT FL_WT_Dejan_39 560424.7 FL 29064 AT1G01050 WT FL_WT_Dejan_33 9304183.7 FL 30994 AT1G01050 WT FL_WT_Dejan_34 647452.4 FL 32924 AT1G01050 m FL_m_Dejan_21 712381.5 FL 34854 AT1G01050 m FL_m_Dejan_26 6089158.8 FL 36784 AT1G01050 m FL_m_Dejan_28 11341334.1 FL 38714 AT1G01050 f FL_f_Dejan_19 13140258.2 FL 40644 AT1G01050 f FL_f_Dejan_31 11256554.9 FL 42574 AT1G01050 f FL_f_Dejan_35 1621509.9 FL 44504 AT1G01050 f FL_f_Dejan37 392228.2 FL 46434 AT1G01050 ntrc FL_ntrc_Dejan_30 9069074.8 FL 48364 AT1G01050 ntrc FL_ntrc_Dejan_38 562403.6 FL 50294 AT1G01050 ntrc FL_ntrc_Dejan29 175258.6 FL 79347 AT1G01050 WT LL_WT_Dejan_41 2443625.6 LL 81783 AT1G01050 WT LL_WT_Dejan_43 8529143.7 LL 84219 AT1G01050 WT LL_WT_Dejan_49 11054552.6 LL 86655 AT1G01050 m LL_m_Dejan_44 14325152.0 LL 89091 AT1G01050 m LL_m_Dejan_45 13114486.4 LL 91527 AT1G01050 m LL_m_Dejan_54 8250430.1 LL 93963 AT1G01050 f LL_f_Dejan_47 12431354.5 LL 96399 AT1G01050 f LL_f_Dejan_48 11884118.5 LL 98835 AT1G01050 f LL_f_Dejan_53 8408509.1 LL 101271 AT1G01050 ntrc LL_ntrc_Dejan_46 12214783.1 LL 103707 AT1G01050 ntrc LL_ntrc_Dejan_50 1286828.3 LL 106143 AT1G01050 ntrc LL_ntrc_Dejan_42 1819043.9 LL ", header=T) prot2 <- prot1 prot2$Accession <- "AT3G53620" prot <- rbind(prot1,prot2) plots <- list() pdf("TEST_boxplot.pdf", onefile=T) IDs<-unique(prot$Accession) for (i in 1:length(IDs)){ temp <- prot[(prot$Accession)==IDs[i],] p<- ggplot(temp, aes(factor(genotype), value, fill = Light)) + geom_bar(stat="identity", position = "dodge") + scale_fill_brewer(palette = "Set1")+ ggtitle(as.character(i)) plots[[i]] <- p print(p) } dev.off()
R fitting and forecasting daily time series
I am working with a daily time serie and I need to build a forecast for 90 days (or maybe more) based on my history - The current time serie has roughly 298 data points. The issue I have is the famous flat line in the final forecast - and yes I might not have a seasonality but I am trying to work this out. Another issue is how to find the best model and adapt it from here on for this kind of behaviour. I created a test case to investigate this further and any help is appreciated. Thanks, To start with x <- day_data # My time serie z <- 90 # Days to forecast low_bound_date <- as.POSIXlt(min(x$time), format = "%m/%d/%Y") # oldest date in the DF. > low_bound_date [1] "2015-12-21 PST" low_bound_date$yday > low_bound_date$yday # Day in Julian [1] 354 lbyear <- as.numeric(substr(low_bound_date, 1, 4)) > lbyear [1] 2015 This is my time serie content > ts Time Series: Start = c(2065, 4) End = c(2107, 7) Frequency = 7 [2] 20.73 26.19 27.51 26.11 26.28 27.58 26.84 27.00 26.30 28.75 28.43 39.03 41.36 45.42 44.80 45.33 47.79 44.70 45.17 [20] 34.90 32.54 32.75 33.35 34.76 34.11 33.59 33.60 38.08 30.45 29.66 31.09 31.36 31.96 29.30 30.04 30.85 31.13 25.09 [39] 17.88 23.73 25.31 31.30 35.18 34.13 34.96 35.12 27.36 38.33 38.59 38.14 38.54 41.72 37.15 35.92 37.37 32.39 30.64 [58] 30.57 30.66 31.16 31.50 30.68 32.21 32.27 32.55 33.61 34.80 33.53 33.09 20.90 6.91 7.82 15.78 7.25 6.19 6.38 [77] 38.06 39.82 35.53 38.63 41.91 39.76 37.26 38.79 37.74 35.61 39.70 35.79 35.36 29.63 22.07 35.39 35.99 37.35 38.82 [96] 25.80 21.31 18.85 9.52 20.75 36.83 44.12 37.79 34.45 36.05 16.39 21.84 31.39 34.26 31.50 30.87 28.88 42.83 41.52 [115] 42.34 47.35 44.47 44.10 44.49 26.89 18.17 40.44 43.93 41.56 39.98 40.31 40.59 40.17 40.22 40.50 32.68 35.89 36.06 [134] 34.30 22.67 12.56 13.29 12.34 28.00 35.27 36.57 33.78 32.15 33.58 34.62 30.96 32.06 33.05 30.66 32.47 30.42 32.83 [153] 31.74 29.39 22.39 12.58 16.46 5.36 4.01 15.32 32.79 31.66 32.02 27.60 31.47 31.61 34.96 27.77 31.91 33.94 33.43 [172] 26.94 28.38 21.42 24.51 23.82 31.71 26.64 27.96 29.29 29.25 28.70 27.02 27.62 30.90 27.46 27.37 26.46 27.77 13.61 [191] 5.87 12.18 5.68 4.15 4.35 4.42 16.42 25.18 26.06 27.39 27.57 28.86 15.18 5.19 5.61 8.28 7.78 5.13 4.90 [210] 5.02 5.27 16.31 25.01 26.19 25.96 24.93 25.53 25.56 26.39 26.80 26.73 26.00 25.61 25.90 25.89 13.80 6.66 6.41 [229] 5.28 5.64 5.71 5.38 5.76 7.20 7.27 5.55 5.31 5.94 5.75 5.93 5.77 6.57 5.52 5.51 5.47 5.69 19.75 [248] 29.22 30.75 29.63 30.49 29.48 31.83 30.42 29.27 30.40 29.91 32.00 30.09 28.93 14.54 7.75 5.63 17.17 22.27 24.93 [267] 35.94 37.42 33.13 25.88 24.27 37.64 37.42 38.33 35.20 21.32 7.32 4.81 5.17 17.49 23.77 23.36 27.60 26.53 24.99 [286] 24.22 23.76 24.10 24.22 27.06 25.53 23.40 37.07 26.52 25.19 28.02 28.53 26.67 First step, I get my data in ts day_data_ts <- ts(x$avg_day, start = c(lbyear,low_bound_date$yday), frequency=7) plot(day_data_ts) plot_ts acf(day_data_ts) acf_ts Second step, I get my data in msts day_data_msts <- msts(x$avg_day, seasonal.periods=c(7,365.25), start = c(lbyear,low_bound_date$yday)) plot(day_data_msts) acf(day_data_msts) I did several fitting iterations to try and figure out the best fit and forecast model. First fitting test is with the ts only. fit1 <- HoltWinters(day_data_ts) > fit1 Holt-Winters exponential smoothing with trend and additive seasonal component. Call: HoltWinters(x = day_data_ts) Smoothing parameters: alpha: 1 beta : 0.006757112 gamma: 0 Coefficients: [,1] a 28.0922449 b 0.1652477 s1 0.6241837 s2 1.9084694 s3 0.9913265 s4 0.8198980 s5 -1.7015306 s6 -1.2201020 s7 -1.4222449 fit2 <- tbats(day_data_ts) > fit2 BATS(1, {0,0}, 0.8, -) Parameters: Alpha: 1.309966 Beta: -0.3011143 Damping Parameter: 0.800001 Seed States: [,1] [1,] 15.282259 [2,] 2.177787 Sigma: 5.501356 AIC: 2723.911 fit3 <- ets(day_data_ts) > fit3 ETS(A,N,N) Smoothing parameters: alpha = 0.9999 Initial states: l = 25.2275 sigma: 5.8506 AIC AICc BIC 2756.597 2756.678 2767.688 fit4 <- auto.arima(day_data_ts) > fit4 ARIMA(1,1,2) Coefficients: ar1 ma1 ma2 0.7396 -0.6897 -0.2769 s.e. 0.0545 0.0690 0.0621 sigma^2 estimated as 30.47: log likelihood=-927.9 AIC=1863.81 AICc=1863.94 BIC=1878.58 Second test is using msts. I also changed the ets model to MAM. fit5 <- tbats(day_data_msts) > fit5 BATS(1, {0,0}, 0.8, -) Parameters: Alpha: 1.309966 Beta: -0.3011143 Damping Parameter: 0.800001 Seed States: [,1] [1,] 15.282259 [2,] 2.177787 Sigma: 5.501356 AIC: 2723.911 fit6 <- ets(day_data_msts, model="MAN") > fit6 ETS(M,A,N) Smoothing parameters: alpha = 0.9999 beta = 9e-04 Initial states: l = 52.8658 b = 3.9184 sigma: 0.3459 AIC AICc BIC 3042.744 3042.949 3061.229 fit7 <- auto.arima(day_data_msts) > fit7 ARIMA(1,1,2) Coefficients: ar1 ma1 ma2 0.7396 -0.6897 -0.2769 s.e. 0.0545 0.0690 0.0621 sigma^2 estimated as 30.47: log likelihood=-927.9 AIC=1863.81 AICc=1863.94 BIC=1878.58
You can forecast on previously estimated model as follows (use built in timeseries LakeHuron): library(forecast) y <- LakeHuron tsdisplay(y) # estimate ARMA(1,1) mod_2 <- Arima(y, order = c(1, 0, 1)) #make forecast for 5 periods (years in this case) fHuron <- forecast(mod_2, h = 5) #show results in table fHuron #plot results plot(fHuron) This will give you: Pay attention that ARIMA model bases its forecast on previous values, so if we make prediction on many periods the model will use already predicted values to predict next. Which will reduce accuracy. To fit optimal ARIMA model use this function: library(R.utils) #for the function 'withTimeout' fitARIMA<-function(timeseriesObject, timout) { final.aic <- Inf final.order <- c(0,0,0) for (p in 0:5) for (q in 0:5) { if ( p == 0 && q == 0) { next } arimaFit = tryCatch( withTimeout(arima(timeseriesObject ,order=c(p, 0, q)) ,timeout = timeout) ,error=function( err ) FALSE ,warning=function( err ) FALSE ) if( !is.logical( arimaFit ) ) { current.aic <- AIC(arimaFit) if (current.aic < final.aic) { final.aic <- current.aic final.order <- c(p, 0, q) final.arima <- arima(timeseriesObject, order=final.order) } } else { next } } final.order<-c(final.order,final.aic) final.order }
How to calculate percentile in R?
In R helphelp(quantile),you can see Type 7 m = 1-p. p[k] = (k - 1) / (n - 1). In this case, p[k] =mode[F(x[k])]. This is used by S. now ,i have a example: w<-c(75,64,47.4,66.9,62.2,62.2,58.7,63.5,66.6,64,57,69,56.9,50,72.0) sort(w) [1] 47.4 50.0 56.9 57.0 58.7 62.2 62.2 63.5 64.0 64.0 66.6 66.9 69.0 72.0 75.0 quantile(w) 0% 25% 50% 75% 100% 47.40 57.85 63.50 66.75 75.00 How can you use the type 7 formula to get the result?
I'm having some trouble deciding if the answer is just: > quantile(w, type=7) 0% 25% 50% 75% 100% 47.40 57.85 63.50 66.75 75.00 My problem is that the default for quantile is type=7 and you already have that result. If you look at the code for quantile.default there is a section for type=7: index <- 1 + (n - 1) * probs lo <- floor(index) hi <- ceiling(index) x <- sort(x, partial = unique(c(lo, hi))) qs <- x[lo] i <- which(index > lo) h <- (index - lo)[i] qs[i] <- (1 - h) * qs[i] + h * x[hi[i]]