I am trying to use the forecast ML r package to run some tests but the moment I hit this step, it renames the columns
data <- read.csv("C:\\Users\\User\\Desktop\\DG ST Forecast\\LassoTemporalForecast.csv", header=TRUE)
date_frequency <- "1 week"
dates <- seq(as.Date("2012-10-05"), as.Date("2020-10-05"), by = date_frequency)
data_train <- data[1:357,]
data_test <- data[358:429,]
outcome_col <- 1 # The column index of our DriversKilled outcome.
horizons <- c(1,2,3,4,5,6,7,8,9,10,11,12) # 4 models that forecast 1, 1:3, 1:6, and 1:12 time steps ahead.
# A lookback across select time steps in the past. Feature lags 1 through 9, for instance, will be
# silently dropped from the 12-step-ahead model.
lookback <- c(1)
# A non-lagged feature that changes through time whose value we either know (e.g., month) or whose
# value we would like to forecast.
dynamic_features <- colnames(data_train)
data_list <- forecastML::create_lagged_df(data_train,
type = "train",
outcome_col = 1,
horizons = horizons,
lookback = lookback,
date = dates[1:nrow(data_train)],
frequency = date_frequency,
dynamic_features = colnames(data_train)
)
After the data_list, here is a snapshot of what happens in the console:
Next, when I try to create windows following the name change,
windows <- forecastML::create_windows(lagged_df = data_list, window_length = 36,
window_start = NULL, window_stop = NULL,
include_partial_window = TRUE)
plot(windows, data_list, show_labels = TRUE)
this error: Can't subset columns that don't exist. x Column cases doesn't exist.
I've checked through many times based on my input data and the code previously and still can't understand why the name change occurs, if anyone is familiar with this package please assist thank you!
I'm the package author. It's difficult to tell without a reproducible example, but here's what I think is going on: Dynamic features are essentially features with a lag of 0. Dynamic features also retain their original names, as opposed to lagged features which have "_lag_n" appended to the feature name. So by setting dynamic_features to all column names you are getting duplicate columns specifically for the outcome column. My guess is that "cases" is the outcome here. Fix this by removing dynamic_features = colnames(data_train) and setting it to only those features that you really want to have a lag of 0.
Related
I started learning R three days ago so pls bear with me.... if you see any flaws in my code or calculations please call them out.
I have tried this, but get a error message every time:
table.AnnualizedReturns(Apple.Monthly.Returns[, 2:3, drop = FALSE], scale = 12,
Rf = 0, geometric = TRUE, digits = 4)
Error in checkData(R) :
The data cannot be converted into a time series. If you are trying to pass in names from a data object with one column, you should use the form 'data[rows, columns, drop = FALSE]'. Rownames should have standard date formats, such as '1985-03-15'.
As you can clearly see I have no clue what I am doing.
This is every line of code I have written this far:
Dates <- Data_Task2$`Names Date`[1801:2270]
as.numeric(Dates)
Dates <- ymd(Dates)
Monthly.Return <- Data_Task2$Returns[1801:2270]
Monthly.Return <- as.numeric(Monthly.Return)
Apple.Monthly.Returns <- data.frame(Dates, Monthly.Return)
Log.return = log(Monthly.Return + 1)
Apple.Monthly.Returns$Log.return = log(Apple.Monthly.Returns$Monthly.Return + 1)
You should check out the Tidyverse and specifically dplyr (https://dplyr.tidyverse.org/).
This gets you to a good starting point:
https://www.r-bloggers.com/2014/03/using-r-quickly-calculating-summary-statistics-with-dplyr/
Recently I have switched from STATA to R.
In STATA, you have something called value label. Using the command encode for example allows you to turn a string variable into a numeric, with a string label attached to each number. Since string variables contain names (which repeat themselves most of the time), using value labels allows you to save a lot of space when dealing with large dataset.
Unfortunately, I did not manage to find a similar command in R. The only package I have found that could attach labels to my values vector is sjlabelled. It does the attachment but when I’m trying to merge attached numeric vector to another dataframe, the labels seems to “fall of”.
Example: Start with a string variable.
paragraph <- "Melanija Knavs was born in Novo Mesto, and grew up in Sevnica, in the Yugoslav republic of Slovenia. She worked as a fashion model through agencies in Milan and Paris, later moving to New York City in 1996. Her modeling career was associated with Irene Marie Models and Trump Model Management"
install.packages("sjlabelled")
library(sjlabelled)
sentences <- strsplit(paragraph, " ")
sentences <- unlist(sentences, use.names = FALSE)
# Now we have a vector to string values.
sentrnces_df <- as.data.frame(sentences)
sentences <- unique(sentrnces_df$sentences)
group_sentences <- c(1:length(sentences))
sentences <- as.data.frame(sentences)
group_sentences <- as.data.frame(group_sentences)
z <- cbind(sentences,group_sentences)
z$group_sentences <- set_labels(z$group_sentences, labels = (z$sentences))
sentrnces_df <- merge(sentrnces_df, z, by = c('sentences'))
get_labels(z$group_sentences) # the labels I was attaching using set labels
get_labels(sentrnces_df$group_sentences) # the output is just “NULL”
Thanks!
P.S. Sorry about the inelegant code, as I said before, I'm pretty new in R.
source: https://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/
...
Around June of 2007, R introduced hashing of CHARSXP elements in the
underlying C code thanks to Seth Falcon. What this meant was that
effectively, character strings were hashed to an integer
representation and stored in a global table in R. Anytime a given
string was needed in R, it could be referenced by its underlying
integer. This effectively put in place, globally, the factor encoding
behavior of strings from before. Once this was implemented, there was
little to be gained from an efficiency standpoint by encoding
character variables as factor. Of course, you still needed to use
‘factors’ for the modeling functions.
...
I adjusted your initial test data a little bit. I was confused by so many strings and am unsure whether they are necessary for this issue. Let me know, if I missed a point. Here is my adjustment and the answer:
#####################################
# initial problem rephrased
#####################################
# create test data
id = seq(1:20)
variable1 = sample(30:35, 20, replace=TRUE)
variable2 = sample(36:40, 20, replace=TRUE)
df1 <- data.frame(id, variable1)
df2 <- data.frame(id, variable2)
# set arbitrary labels
df1$variable1 <- set_labels(df1$variable1, labels = c("few" = 1, "lots" = 5))
# show labels in this frame
get_labels(df1)
# include associated values
get_labels(df1, values = "as.prefix")
# merge df1 and df2
df_merge <- merge(df1, df2, by = c('id'))
# labels lost after merge
get_labels(df_merge, values = "as.prefix")
#####################################
# solution with dplyr
#####################################
library(dplyr)
df_merge2 <- left_join(x = df1, y = df2, by = "id")
get_labels(df_merge2, values = "as.prefix")
Solution attributed to:
Merging and keeping variable labels in R
Like this we have 500 entries. Entries may be repeated.
The date represents the date on which that particular car part(carparts) was malfunctioning. We have to predict on which date a car part(carparts) is going to malfunction.
Codes are written in R. The code to develop the table is mentioned below:-
q<-c("Mercedes","Audi","Tata","Renault","Ferrari","Lamborgini")
w<-sample(q,500,replace=TRUE)
m <- c("accelerator", "gear", "coolant", "brakes", "airbags")
k <- sample(m, 500, replace=TRUE)
e <- seq(as.Date("2010/1/1"), as.Date("2011/1/1"), by="days")
l <- sample(e, 500, replace=TRUE)
test <- list(w,k, l)
t2 <- as.data.frame(test)
names(t2) <- c("carnames","carparts", "date")
t2$Diffdate<-as.numeric(t2$date-as.Date("2010-01-01"))
head(t2)
I'm preparing my data for survival analysis .In the code above I haven't included the censor and event variable .(I tried a rough draft and it went messy.) I just need an idea of how to include the event and censor variable along with the carparts variable and carnames variable. I'm getting stuck as I'm unable to frame all the variables in a single table.
Two problems that I'm facing are:-
1> I can't find a way to keep carparts ,carnames,event and censor variable in one table.
2> And the event variable is always 1 in each entry because for each entry(row) there is a breakdown/defect of carparts happening.Is it ok to have so?
As in all the examples I saw from internet the event variable had both ones and zeroes.
Edited 1:- Its not necessary to do it in R, you may write it down (draw the table including the columns present as well as the censor and event variable) on a piece of paper and attach the snapshot.
Thanks
I have a data analysis module that I've been using for some time. From the output of a selected model, I can use a data.frame to predict outcomes over a range of values of interest. The following line should create a data.frame. Sometimes it will run, but sometimes the column 'tod' fails to create, and trips an error.
todData <- data.frame(kpsp=rep(c(0,1,0), each=10), tlwma=rep(c(0,0,1), each=10), tod=rep(seq(-0.25,4.25, by=.5),3), tod2=tod^2, doy=36)
This results in the following return:
Error in data.frame(kpsp = rep(c(0, 1, 0), each = 10), tlwma = rep(c(0, :
object 'tod' not found
I did some searching but couldn't get any returns... wasn't even sure how to properly search for such an issue. Thanks for any suggestions on how to make this run consistently.
A.Birdman
The error happens because we are trying to create new columns based on a column that was created within the data.frame call. A variable within the data.frame can be accessed after the data.frame object is created. We can use the data.frame call to create the initial columns and then with mutate (from dplyr) or within or transform (from base R) create new columns that depend on the initial columns.
todData <- data.frame(kpsp=rep(c(0,1,0), each=10),
tlwma=rep(c(0,0,1), each=10), tod=rep(seq(-0.25,4.25, by=.5),3),
doy = 36)
todData <- within(todData, {tod2 <- tod^2})
Or
todData <- transform(todData, tod2 = tod^2)
I think it works only when you executed tod=rep(seq(-0.25,4.25, by=.5),3) as indivudial line somewhere before.
This will work:
tod=rep(seq(-0.25,4.25, by=.5),3)
todData <- data.frame(kpsp=rep(c(0,1,0), each=10), tlwma=rep(c(0,0,1), each=10), tod=tod, tod2=tod^2, doy=36)
Or if you really want to execute this several times with one line, use this function that has a default formula for tod2 (you then won't mention tod2 in your call unless needed):
create.toData <- function(kpsp,tlwma,tod,tod2=tod^2,doy){
data.frame(kpsp=kpsp, tlwma=tlwma, tod=tod, tod2=tod2,doy=doy)
}
todData <- create.toData(kpsp=rep(c(0,1,0), each=10), tlwma=rep(c(0,0,1), each=10), tod=rep(seq(-0.25,4.25, by=.5),3), doy=36)
I am organizing weather data into netCDF files in R. Everything goes fine until I try to populate the netcdf variables with data, because it is asking me to specify only one dimension for two-dimensional variables.
library(ncdf)
These are the dimension tags for the variables. Each variable uses the Threshold dimension and one of the other two dimensions.
th <- dim.def.ncdf("Threshold", "level", c(5,6,7,8,9,10,50,75,100))
rt <- dim.def.ncdf("RainMinimum", "cm", c(5, 10, 25))
wt <- dim.def.ncdf("WindMinimum", "m/s", c(18, 30, 50))
The variables are created in a loop, and there are a lot of them, so for the sake of easy understanding, in my example I'll only populate the list of variables with one variable.
vars <- list()
v1 <- var.def.ncdf("ARMM_rain", "percent", list(th, rt), -1, prec="double")
vars[[length(vars)+1]] <- v1
ncdata <- create.ncdf("composite.nc", vars)
I use another loop to extract data from different data files into a 9x3 data frame named subframe while iterating through the variables of the netcdf file with varindex. For the sake of reproducing, I'll give a quick initialization for these values.
varindex <- 1
subframe <- data.frame(matrix(nrow=9, ncol=3, rep(.01, 27)))
The desired outcome from there is to populate each ncdf variable with the contents of subframe. The code to do so is:
for(x in 1:9) {
for(y in 1:3) {
value <- ifelse(is.na(subframe[x,y]), -1, subframe[x,y])
put.var.ncdf(ncdata, varindex, value, start=c(x,y), count=1)
}
}
The error message is:
Error in put.var.ncdf(ncdata, varindex, value, start = c(x, y), count = 1) :
'start' should specify 1 dims but actually specifies 2
tl;dr: I have defined two-dimensional variables using ncdf in R, I am trying to write data to them, but I am getting an error message because R believes they are single-dimensional variables instead.
Anyone know how to fix this error?