Hello everyone,
Hope you all are doing well. I'm facing a problem while trying to calculate seasonal observation (using the dm2seasonal function from the hydroTSM package in Rstudio) from daily time-series data (1951-2020).
The error occurred, especially when calculating the Winter Season (comprising of December, January, and February). The error is as follows.
Error in 1:(nrow(s.a) - 1) : argument of length 0
A subset of the raw data I'm using to calculate seasonal observation:
Date ETO
01-01-1951 2.53
02-01-1951 2.95
03-01-1951 2.99
04-01-1951 2.71
05-01-1951 2.74
06-01-1951 3.02
.
.
.
27-12-2020 2.73
28-12-2020 2.57
29-12-2020 2.86
30-12-2020 2.96
31-12-2020 3.11
I'm completely confused and looking for a suitable solution or any alternatives to do the task. Thank you.
R version 3.5.1.
I want to create a correlation matrix with a Data frame that goes like this:
BGI WINDSPEED SLOPE
4277.2 4.23 7.54
4139.8 5.25 8.63
4652.9 3.59 6.54
3942.6 4.42 10.05
I put that as an example but I have more than 20 columns and over 40,000 entries.
I tried using the following codes
corrplot(cor(Site.2),method = "color",outline="white")
Matrix_Site=cor(Site.2,method=c("spearman"))
but every time, the same warning appears:
Error in cor(Site.2) : 'x' must be numeric
I would like to correlate every variable of the data frame with each other and create a graph and a table with it, similar to this.
I have this Spark table:
xydata
y: num 11.00 22.00 33.00 ...
x0: num 1.00 2.00 3.00 ...
x1: num 2.00 3.00 4.00 ...
...
x788: num 2.00 3.00 4.00 ...
and a handle named xy_df that is connected to this table.
I want to invoke the selectExpr function to calculate the mean, something like:
xy_centered <- xy_df %>%
spark_dataframe() %>%
invoke("selectExpr", list("( y0-mean(y0) ) AS y0mean"))
which is also applicable to all other columns.
But when I run it, it gives this error:
Error: org.apache.spark.sql.AnalysisException: expression 'y0' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;
I know this happens because, in common SQL rules, I didn't put a GROUP BY clause for columns contained in the aggregate function (mean). How do I put the GROUP BY to the invoke method?
Previously, I manage to do complete the purpose using another way, which is by:
Calculate the mean of each column by summarize_all
Collect the mean inside R
Apply this mean using invoke and selectExpr
as explained in this answer, but now I'm trying to speed up the execution time a bit by putting all operation inside the Spark itself, without retrieving anything to R.
My Spark version is 1.6.0
So I have a bunch of functions that save the column numbers of my data. So for example, my data looks something like:
>MergedData
[[1]]
Date EUR.HIGH EUR.LOW EUR.CLOSE EUR.OPEN EUR.LAST
01/01/16 1.00 1.00 1.25 1.30 1.24
[[2]]
Date AUD.HIGH AUD.LOW AUD.CLOSE AUD.OPEN AUD.LAST
01/01/16 1.00 1.00 1.25 1.30 1.24
I have 29 of the above currencies. So in this case, MergedData[[1]] will return all of my Euro prices, and so on for 29 currencies.
I also have a function in R that calculates the variables and saves the numbers 1 to 29 that correspond with the currencies. This code calculates values in the first row of my data, i.e:
trending <- intersect(which(!ma.sig[1,]==0), which(!pricebreak[1,]==0))
which returns something like:
>sig.nt
[1] 1 2 5...
And so I can use this to pull up 'trending' currencies via a for() function:
for (i in length(sig.nt){
MergedData[[i]]
...misc. code for calculations of trending currencies...
}
I want to be able to 'save' my trending currencies for future references and calculations. The problem is sig.nt variable changes with every new row. I was thinking of using the lockBinding command:
sig.exist <- sig.nt #saves existing trend
lockBinding('curexist', .GlobalEnv)
But wouldn't this still get overwritten everytime I run my script? Help would be much appreciated!
I want to calculate the mean for each "Day" but for a portion of the day (Time=12-14). This code works for me but I have to enter each day as a new line of code, which will amount to hundreds of lines.
This seems like it should be simple to do. I've done this easily when the grouping variables are the same but dont know how to do it when I dont want to include all values for the day.
Is there a better way to do this?
sapply(sap[sap$Day==165 & sap$Time %in% c(12,12.1,12.2,12.3,12.4,12.5,13,13.1,13.2,13.3,13.4,13.5, 14), ],mean)
sapply(sap[sap$Day==166 & sap$Time %in% c(12,12.1,12.2,12.3,12.4,12.5,13,13.1,13.2,13.3,13.4,13.5, 14), ],mean)
Here's what the data looks like:
Day Time StomCond_Trunc
165 12 33.57189926
165 12.1 50.29437636
165 12.2 35.59876214
165 12.3 24.39879768
Try this:
aggregate(StomCond_Trunc~Day,data=subset(sap,Time>=12 & Time<=14),mean)
If you have a large dataset, you may also want to look into the data.table package. Converting a data.frame to a data.table is quite easy.
Example:
Large(ish) dataset
df <- data.frame(Day=1:1000000,Time=sample(1:14,1000000,replace=T),StomCond_Trunc=rnorm(100000)*20)
Using aggregate on the data.frame
>system.time(aggregate(StomCond_Trunc~Day,data=subset(df,Time>=12 & Time<=14),mean))
user system elapsed
16.255 0.377 24.263
Converting it to a data.table
dt <- data.table(df,key="Time")
>system.time(dt[Time>=12 & Time<=14,mean(StomCond_Trunc),by=Day])
user system elapsed
9.534 0.178 15.270
Update from Matthew. This timing has improved dramatically since originally answered due to a new optimization feature in data.table 1.8.2.
Retesting the difference between the two approaches, using data.table 1.8.2 in R 2.15.1 :
df <- data.frame(Day=1:1000000,
Time=sample(1:14,1000000,replace=T),
StomCond_Trunc=rnorm(100000)*20)
system.time(aggregate(StomCond_Trunc~Day,data=subset(df,Time>=12 & Time<=14),mean))
# user system elapsed
# 10.19 0.27 10.47
dt <- data.table(df,key="Time")
system.time(dt[Time>=12 & Time<=14,mean(StomCond_Trunc),by=Day])
# user system elapsed
# 0.31 0.00 0.31
Using your original method, but with less typing:
sapply(sap[sap$Day==165 & sap$Time %in% seq(12, 14, 0.1), ],mean)
However this is only a slightly better method than your original one. It's not as flexible as the other answers since it depends on 0.1 increments in your time values. The other methods don't care about the increment size, which makes them more versatile. I'd recommend #Maiasaura's answer with data.table