I have this Spark table:
xydata
y: num 11.00 22.00 33.00 ...
x0: num 1.00 2.00 3.00 ...
x1: num 2.00 3.00 4.00 ...
...
x788: num 2.00 3.00 4.00 ...
and a handle named xy_df that is connected to this table.
I want to invoke the selectExpr function to calculate the mean, something like:
xy_centered <- xy_df %>%
spark_dataframe() %>%
invoke("selectExpr", list("( y0-mean(y0) ) AS y0mean"))
which is also applicable to all other columns.
But when I run it, it gives this error:
Error: org.apache.spark.sql.AnalysisException: expression 'y0' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;
I know this happens because, in common SQL rules, I didn't put a GROUP BY clause for columns contained in the aggregate function (mean). How do I put the GROUP BY to the invoke method?
Previously, I manage to do complete the purpose using another way, which is by:
Calculate the mean of each column by summarize_all
Collect the mean inside R
Apply this mean using invoke and selectExpr
as explained in this answer, but now I'm trying to speed up the execution time a bit by putting all operation inside the Spark itself, without retrieving anything to R.
My Spark version is 1.6.0
Related
R version 3.5.1.
I want to create a correlation matrix with a Data frame that goes like this:
BGI WINDSPEED SLOPE
4277.2 4.23 7.54
4139.8 5.25 8.63
4652.9 3.59 6.54
3942.6 4.42 10.05
I put that as an example but I have more than 20 columns and over 40,000 entries.
I tried using the following codes
corrplot(cor(Site.2),method = "color",outline="white")
Matrix_Site=cor(Site.2,method=c("spearman"))
but every time, the same warning appears:
Error in cor(Site.2) : 'x' must be numeric
I would like to correlate every variable of the data frame with each other and create a graph and a table with it, similar to this.
I have a dataframe with climatic values like temperature_max, temperature_min... in diferent locations. The data collection is a time series data there are some especific days in which there are no data registration. I woul like to impute taking in account date and also the location (place variable in the dataframe)
I have tried to impute those missing values with amelia. But no imputation is done with warning information
Checking variables:
head(df): PLACE, DATE, TEMP_MAX, TEMP_MIN, TEMP_AVG
PLACE DATE TEMP_MAX TEMP_MIN TEMP_AVG
F 12/01/2007 19.7 2.5 10.1
F 13/01/2007 18.8 3.5 10.4
F 14/01/2007 17.3 2.4 10.4
F 15/01/2007 19.5 4.0 9.2
F 16/01/2007
F 17/01/2007 21.5 2.8 9.7
F 18/01/2007 17.7 3.3 12.9
F 19/01/2007 18.3 3.8 9.7
A 16/01/2007 17.7 3.4 9.7
A 17/01/2007
A 18/01/2007 19.7 6.2 10.4
A 19/01/2007 17.7 3.8 10.1
A 20/01/2007 18.6 3.8 12.9
This is just some of the records of my data set.
DF = amelia(df, m=4, ts= c("DATE"), cs = c("PLACE"))
where DATE is time series data (01/01/2001, 02/01/2001, 03/01/2001...) but if you filter by PLACE the time series is not equal (not the same star and end time).
I have 3 questions:
I am not sure if I should have the time series data complete for all the places, I mean same start and end time for all the places.
I am not using lags or polytime parameters so, am I imputting correctly taking in account time series influence? I am not sure about how to use lag parameter although I have checked the R package information.
The last question is that when I try to use that code there is a warning
and no imputation is done.
Warning: There are observations in the data that are completely missing.
These observations will remain unimputed in the final datasets.
-- Imputation 1 --
No missing data in bootstrapped sample: EM chain unnecessary
-- Imputation 2 --
No missing data in bootstrapped sample: EM chain unnecessary
-- Imputation 3 --
No missing data in bootstrapped sample: EM chain unnecessary
-- Imputation 4 --
No missing data in bootstrapped sample: EM chain unnecessary
Can someone help me with this?
Thanks very much for your time!
For the software it does not matter if you have different start and end dates for different places. I think that it is more up to you and your thoughts on the data. I would ask myself, if those were missing data (missing at random) thus I would create empty rows in your data set or not.
You want to use lags in order to use past values of the variable to improve the prediction of missing values. It is not mandatory (i.e., the function can impute missing data even without such a specification) but it can be useful.
I contacted the author of the package and he told me that you need to specify the splinetime or polytime arguments to make sure that Amelia will use the time-series information to impute. For instance, if you set polytime = 3, it will impute based on a cubic of time. If you do that, I think you shouldn't see that error anymore.
I have the following dataset:
Class Value
Drive 9.5
Analyser 6.35
GameGUI 12.09
Drive 9.5
Analyser 5.5
GameGUI 2.69
Drive 9.5
Analyser 9.10
GameGUI 6.1
I want to retrieve the classes that have similar values, which would be in the case of the example above is Drive. To do that I have the following command:
dataset[as.logical(ave(dataset$Value, dataset$Class, FUN = function(x) all(x==1))), ]
But this command returns only the classes that their values is always one. What I want is different, I don't want to give a specific value.
So I have a bunch of functions that save the column numbers of my data. So for example, my data looks something like:
>MergedData
[[1]]
Date EUR.HIGH EUR.LOW EUR.CLOSE EUR.OPEN EUR.LAST
01/01/16 1.00 1.00 1.25 1.30 1.24
[[2]]
Date AUD.HIGH AUD.LOW AUD.CLOSE AUD.OPEN AUD.LAST
01/01/16 1.00 1.00 1.25 1.30 1.24
I have 29 of the above currencies. So in this case, MergedData[[1]] will return all of my Euro prices, and so on for 29 currencies.
I also have a function in R that calculates the variables and saves the numbers 1 to 29 that correspond with the currencies. This code calculates values in the first row of my data, i.e:
trending <- intersect(which(!ma.sig[1,]==0), which(!pricebreak[1,]==0))
which returns something like:
>sig.nt
[1] 1 2 5...
And so I can use this to pull up 'trending' currencies via a for() function:
for (i in length(sig.nt){
MergedData[[i]]
...misc. code for calculations of trending currencies...
}
I want to be able to 'save' my trending currencies for future references and calculations. The problem is sig.nt variable changes with every new row. I was thinking of using the lockBinding command:
sig.exist <- sig.nt #saves existing trend
lockBinding('curexist', .GlobalEnv)
But wouldn't this still get overwritten everytime I run my script? Help would be much appreciated!
I have a simple function like this:
f<-function(x){ ymd_h(x[1]) + (x[2] )}
Here I use ymd_h(), hours() from the lubridate package. But when I try to call this function from apply method like this:
result<-apply(wind.new[,c("date","hors")],1,f)
... where *wind.new is a data frame with date, hors and other columns,
I get the following error:
Error in FUN(newX[, i], ...) : could not find function "ymd_h"
Obviously ymd_h is not visible inside apply(). How would I fix this?
UPDATE:
Actually I have the data frame showed below and want to convert date column to YYYYMMDDHH format and add hors column to it as hours. I thought the best approach will be to use lubridate package and I used above mentioned function for that. But the execution took very long time and the output was not correct. Any hints, please?
date hors u v ws wd
2009070100 1 2.34 -0.79 2.47 108.68
2009070100 2 2.18 -0.99 2.4 114.31
2009070100 3 2.2 -1.21 2.51 118.71
............
2009070100 47 2.3 -2.18 3.17 133.5
2009070100 48 1.93 -1.87 2.69 134.12
2009070112 1 2.77 -0.65 2.85 103.17
.........
I am sure that #DrewSteen's comment regarding loading the lubridate package onto the search path is correct.
I think you will have issues with your function because apply will coerce to matrix and will return the incorrect type (more coercions required)
Changing your function to take the date-time and hours data as separate arguments should help.
library(lubridate)
f <- function(date, hours){ymd_h(date) + hours(hours)}
result <- with(wind.new, f(as.character(date), hors))
ymd_h will be slow as it has to guess the format of the date/time.