proper date format for time series precipitation analysis - r

I have 100 yrs precipitation and temperature data that I would like to analyze using the 'seas' package. I have tried several formats for the date column and get the following error code every time:
Error in seas.df.check(x, orig, var) :
a ‘date’ column must exist in ‘srs_precip’
Example code below, where I have tried to coerce Gregorian dates into Date, also Date.POSIX but still get error message from 'seas' library.
Year yr.m.d Date date.J Inches mm_SRS Max_Temp_F Max_Temp_C Min_Temp_F Min_Temp_C Date.POSIX
1 1919 1919-01-01 1919-01-01 1 0 0 39 3.888889 26 -3.333333333 1919-01-01
2 1919 1919-01-02 1919-01-02 2 0 0 35 1.666667 19 -7.222222222 1919-01-02
3 1919 1919-01-03 1919-01-03 3 0 0 40 4.444444 14 -10 1919-01-03
4 1919 1919-01-04 1919-01-04 4 0 0 52 11.111111 20 -6.666666667 1919-01-04
5 1919 1919-01-05 1919-01-05 5 0 0 43 6.111111 20 -6.666666667 1919-01-05
6 1919 1919-01-06 1919-01-06 6 0 0 56 13.333333 31 -0.555555556 1919-01-06

Related

Calendar heatmap function will not execute

I have the following code for a calendar heatmap. I have created one previously using this code. But when I try to create another one, I enter the code and press enter but plus signs appear and I cannot execute.
library(ggplot2)
source("https://raw.githubusercontent.com/iascchen/VisHealth/master/R/calendarHeat.R")
library(plyr)
library(plotly)
r2g <- c("#D61818", "#B5E384")
calendarHeat(heatmap1$date, heatmap1$ROI, ncolors = 2, color = "r2g", varname="30-day ROI") # here you had backquotes at the end of the line
heatmap1 is the name of the data.
A cut of the data is shown below
Row Date ROI
1 2010-08-17 0
2 2010-08-18 0
3 2010-08-19 0
4 2010-08-20 0
5 2010-08-21 1
6 2010-08-22 1
7 2010-08-23 1
8 2010-08-24 1
9 2010-08-25 1
10 2010-08-26 1
11 2010-08-27 1
12 2010-08-28 0
13 2010-08-29 0
14 2010-08-30 0
15 2010-08-31 0
16 2010-09-01 1
17 2010-09-02 1
18 2010-09-03 1
19 2010-09-04 1
20 2010-09-05 0
21 2010-09-06 1
22 2010-09-07 1
23 2010-09-08 0
24 2010-09-09 0
25 2010-09-10 0
26 2010-09-11 0
27 2010-09-12 0
28 2010-09-13 0
29 2010-09-14 0
30 2010-09-15 0
31 2010-09-16 0
I don't understand why the code will work when executed previously, but now doesn't work. Any ideas?
There was a syntax error with quotes. R will propose you to continue typing a line with + if there is no match to ',",{ and(.
This is explained in this pdf, page 4-5.

Why is my R code for filtering data producing different results with "fread()" and "ffdf()"?

I have a huge file with 7 million records and 160 variables. I came to know that fread() and read.csv.ffdf() are two ways to handle such big data. But when I try to use dplyr to filter these two data sets, I get different results. Below is a small subset of my data-
sample_data
AGE AGE_NEONATE AMONTH AWEEKEND
2 18 5 0
3 32 11 0
4 67 7 0
5 37 6 1
6 57 5 0
7 50 6 0
8 59 12 0
9 44 9 0
10 40 9 0
11 27 3 0
12 59 8 0
13 44 7 0
14 81 10 0
15 59 6 1
16 32 10 0
17 90 12 1
18 69 7 0
19 62 11 1
20 85 6 1
21 43 10 0
Code1
sample_data <- fread("/user/sample_data.csv", stringsAsFactors = T)
age_filter<-sample_data%>%filter(!(is.na(AGE)), between(as.numeric(AGE),65 , 95))
Result1-
AGE AGE_NEONATE AMONTH AWEEKEND
1 67 NA 7 0
2 81 NA 10 0
3 90 NA 12 1
4 69 NA 7 0
5 85 NA 6 1
Code2-
sample_data <- read.csv.ffdf(file="C:/Users/sample_data.csv", header=F ,fill=T)
header.true <- function(df) {
names(df) <- as.character(unlist(df[1,]))
df[-1,]
}
sample_data<-tbl_ffdf(sample_data)
sample_data<-header.true(sample_data)
age_filter<-sample_data%>%filter(!(is.na(AGE)), between(as.numeric(AGE),65 , 95))
Result2-
AGE AGE_NEONATE AMONTH AWEEKEND
1 81 10 0
2 90 12 1
3 85 6 1
I know that my 1st code is correct and gives me the correct results. What am I doing wrong in the 2nd code?
I haven't really tried running your code, but from what I can see, I suspect the following:
In your 2nd code version, you are reading the headers as part of the data. This leads to all the columns being imported as character rather than numeric.
In addition, most likely you have default.stringsAsFactors() returning TRUE, meaning that the imported character columns are treated as factors.
Now I guess that your between is being applied to factor levels between 65 and 95, rather than to the actual numbers. Since you probably don't have data for every year (age), 67 and 69 are likely mapped to factor levels below 65 (i.e. as.numeric(AGE) will return you the factor levels the numbers map to, and not the numbers as you see them when printing).
Try to use stringsAsFactors = FALSE or convert explicitly to character after reading.

transform values in data frame, generate new values as 100 minus current value

I'm currently working on a script which will eventually plot the accumulation of losses from cell divisions. Firstly I generate a matrix of values and then I add the number of times 0 occurs in each column - a 0 represents a loss.
However, I am now thinking that a nice plot would be a degradation curve. So, given the following example;
>losses_plot_data <- melt(full_losses_data, id=c("Divisions", "Accuracy"), value.name = "Losses", variable.name = "Size")
> full_losses_data
Divisions Accuracy 20 15 10 5 2
1 0 0 0 0 3 25
2 0 0 0 1 10 39
3 0 0 1 3 17 48
4 0 0 1 5 23 55
5 0 1 3 8 29 60
6 0 1 4 11 34 64
7 0 2 5 13 38 67
8 0 3 7 16 42 70
9 0 4 9 19 45 72
10 0 5 11 22 48 74
Is there a way I can easily turn this table into being 100 minus the numbers shown in the table? If I can plot that data instead of my current data, I would have a lovely curve of degradation from 100% down to however many cells have been lost.
Assuming you do not want to do that for the first column:
fld <- full_losses_data
fld[, 2:ncol(fld)] <- 100 - fld[, -1]

R: calculate and superimpose on a ggplot graph minute-based totals for an event series

Consider a data frame df with an extract from a web server access log, with two fields (sample below, duration is in msec and to simplify the example, let's ignore the date).
time,duration
18:17:26.552,8
18:17:26.632,10
18:17:26.681,12
18:17:26.733,4
18:17:26.778,5
18:17:26.832,5
18:17:26.889,4
18:17:26.931,3
18:17:26.991,3
18:17:27.040,5
18:17:27.157,4
18:17:27.209,14
18:17:27.249,4
18:17:27.303,4
18:17:27.356,13
18:17:27.408,13
18:17:27.450,3
18:17:27.506,13
18:17:27.546,3
18:17:27.616,4
18:17:27.664,4
18:17:27.718,3
18:17:27.796,10
18:17:27.856,3
18:17:27.909,3
18:17:27.974,3
18:17:28.029,3
qplot(time, duration, data=df); gives me a graph of the duration. I'd like to add, superimposed a line showing the number of requests for each minute. Ideally, this line would have a single data point per minute, at the :30sec point. If that's too complicated, an acceptable alternative is to have a step line, with the same value (the count of request) during a minute.
One way is to trunc(df$time, units=c("mins")), then calculate the count of request per minute into a new column then graph it.
I'm asking if there is, perhaps, a more direct way to accomplish the above. Thanks.
Following may be helpful. Create a data frame with steps and plot:
time duration sec sec2 diffsec2 step30s steps
1 18:17:26.552 8 26.552 552 0 0 0
2 18:17:26.632 10 26.632 632 80 1 1
3 18:17:26.681 12 26.681 681 49 0 0
4 18:17:26.733 4 26.733 733 52 1 1
5 18:17:26.778 5 26.778 778 45 0 0
6 18:17:26.832 5 26.832 832 54 1 1
7 18:17:26.889 4 26.889 889 57 1 2
8 18:17:26.931 3 26.931 931 42 0 0
9 18:17:26.991 3 26.991 991 60 1 1
10 18:17:27.040 5 27.040 040 -951 0 0
11 18:17:27.157 4 27.157 157 117 1 1
12 18:17:27.209 14 27.209 209 52 1 2
13 18:17:27.249 4 27.249 249 40 0 0
14 18:17:27.303 4 27.303 303 54 1 1
15 18:17:27.356 13 27.356 356 53 1 2
16 18:17:27.408 13 27.408 408 52 1 3
17 18:17:27.450 3 27.450 450 42 0 0
18 18:17:27.506 13 27.506 506 56 1 1
19 18:17:27.546 3 27.546 546 40 0 0
20 18:17:27.616 4 27.616 616 70 1 1
21 18:17:27.664 4 27.664 664 48 0 0
22 18:17:27.718 3 27.718 718 54 1 1
23 18:17:27.796 10 27.796 796 78 1 2
24 18:17:27.856 3 27.856 856 60 1 3
25 18:17:27.909 3 27.909 909 53 1 4
26 18:17:27.974 3 27.974 974 65 1 5
27 18:17:28.029 3 28.029 029 -945 0 0
>
> ggplot(ddf)+geom_point(aes(x=time, y=duration))+geom_line(aes(x=time, y=steps, group=1),color='red')

R tvm financial package

Im trying to estimate the present value of a stream of payments using the fvm in the financial package.
y <- tvm(pv=NA,i=2.5,n=1:10,pmt=-c(5,5,5,5,5,8,8,8,8,8))
The result that I obtain is:
y
Time Value of Money model
I% #N PV FV PMT Days #Adv P/YR C/YR
1 2.5 1 4.99 0 -5 30 0 12 12
2 2.5 2 9.97 0 -5 30 0 12 12
3 2.5 3 14.94 0 -5 30 0 12 12
4 2.5 4 19.90 0 -5 30 0 12 12
5 2.5 5 24.84 0 -5 30 0 12 12
6 2.5 6 47.65 0 -8 30 0 12 12
7 2.5 7 55.54 0 -8 30 0 12 12
8 2.5 8 63.40 0 -8 30 0 12 12
9 2.5 9 71.26 0 -8 30 0 12 12
10 2.5 10 79.09 0 -8 30 0 12 12
There is a jump in the PV from 5 to 6 (when the price changes to 8) that appears to be incorrect. This affects the result in y[10,3] which is the result that I'm interested in obtaining.
The NPV formula in Excel produces similar results when the payments are the same throughout the whole stream, however, when the vector of paymets is variable, the resuls with the tvm formula and the NPV differ. I need to obtain the same result that the NPV formula provides in Excel.
What should I do to make this work?
The cf formula helps but it is not always consistent with Excel.
I solved my problem using the following function:
npv<-function(a,b,c) sum(a/(1+b)^c)

Resources