Turn R output into a dataframe of multiple vectors - r

The standard bit of code below from the VARS package forecasts values for several variables.
What I want to do is to take those values and turn them into a data frame so I can produce time series graphs.
> predict(var4, n.ahead=12, ci=0.95)

This question is highly vague. I suppose you're looking for:
x <- predict(var4, n.ahead=12, ci=0.95)
data.frame(n = rep(names(x), each = nrow(x$fcst[[1]])), do.call(rbind, x$fcst))
By the way: The package's name is vars, not VARS.

Related

Split-apply-combine with aggregate : can the applied function accept multiple arguments that are specified variables of the original data?

Some context: On my quest to improve my R-code I'm trying to replace my for-loops whenever I can by R's apply-class functions.
The question: Are R's apply functions such as sapply, tapply, aggregate, etc. useful for applying functions that are more complicated in the sense that they take as arguments specified variables of the original data?
Simple examples of what works and what does not:
I have a dataframe with one time variable date.time and two numeric variables val.one and value.two:
Generate the data:
df <- data.frame(date.time = seq(ymd_hms("2000-01-01 00:00:00"),ymd_hms("2000-01-03 00:00:00"), length.out=100),value.one = c(1:100), value.two = c(1:100) + 10)
I would like to apply a function to every 10 hour cut of the dataframe that has as its two arguments the two numeric variables of the dataframe. For example, if I want to compute the mean of each of the two values for each 10 hour cut the solution is the following:
A function that computes the mean of value.one and value.two for each time period of 10 hours:
work_on_subsets <- function(data, time.step = "10 hours"){
aggregate(data[,-1], list(cut(df$date.time, breaks = time.step)), function(x) mean(x))}
However, If I want to work with the two data values separately to run another function, say compute the som of the two averages, I run into trouble. The function work_on_subsets_2 gives me the following error : Error in x$value.one : $ operator is invalid for atomic vectors
A function that computes the sum of the means of value.one and value.two for each 10 hour time period:
work_on_subsets_2 <- function(data, time.step = "10 hours"){
aggregate(data, list(cut(df$date.time, breaks = time.step)), function(x) mean(x$value.one) + mean(x$value.two)}
In the limit, I would like to be able to do something like this:
A function that runs another_function on value.one and value.two for each time period of 10 hours :
another_function <- function(a,b) {
# do something with a and b
}
work_on_subsets_3 <- function(data, time.step = "10 hours"){aggregate(data, list(cut(df$date.time, breaks = time.step)), another_function(x$value.one, x$value.two))}
Am I using the wrong tools for this job? I have already a working solution using for loops, but I'm trying to get a grip on the split-apply-combine strategy. If so, are there any viable alternatives to for-loops?
Hi there are a basic things you are doing wrong here. You are creating a function which has data as its data.frame but you are still referencing df from the global environment. You're also missing at least one bracket. And I don't quite know why you have two layers of functions embedded.
My solution departs from your method but hopefully will help. I'd recommend using plyr package when you want to split dataframes and apply functions as I find it much more intuitive. Combining it with dplyr also helps in my opinion. Note: Always load plyr before dplyr or you run into dependency issues.
If I understand your question correctly the below should work, and you could create different functions to apply
library(plyr)
library(dplyr)
#create function you want to apply
MeanFun <- function(data) mean(data[["value.one"]]) + mean(data[["value.two"]])
#add grouping variable to your dataframe. You can link this with pipes (%>%)
# if you don't want to create a new data.frame, but for testing purposes it
# more clearly shows wants happening
df1 <- df %>% mutate(Breaks = cut(date.time, breaks = time.step))
# use plyr's ssply to split the dataframe on "Breaks" column and apply the function
out <- ddply(df1, "Breaks", MeanFun)

How to loop a test against one designated control group in R?

I'm completely new to programming and R, but have a dataset that can only be analyzed with a more powerful statistics program such as R.
I have a large but simple dataset consisting of thousands of different groups with multiple samples that I want to compare against the control group with a mann whitney U test, data structure is pictured below.
Group, Measurements
a 0.14534
cont 0.42574
d 0.36347
c 0.14284
a 0.23593
d 0.36347
cont 0.33514
cont 0.29210
b 0.36345
...
The problem comes from that the nature of the test requires that only two groups are designated. However, as I have more than 1 group it does not work.
This is what I have so far and I as you see it does not work in a repeated fashion and only works if I have two groups in my input file.
data1 = read.csv(file.choose(), header=TRUE, stringsAsFactors=FALSE)
attach(data1)
testoutput <- wilcox.test(group ~ measurement, mu=0, alt="two.sided", conf.int=TRUE, conf.level=0.95, paired=FALSE, exact=FALSE, correct=TRUE)
write.table(testoutput$p.value, file="mwUtest.tsv", sep="\t")
How do I do write and loop the test properly for it to test all my groups against my designated control group? I assume the sapply or lapply functions are used before the wilcox.test, but I dont know how.
I'm sorry if this simple question has been brought up before, but I could not find any previous question regarding this specific problem.
In R, there's often many solutions for the same problem. Here's how I would solve this.
First, I would split my data and have one dataframe with experiments and one with controls:
experiments <- dat[dat$group!="cont",]
controls <- dat[dat$group=="cont",]
Then I would split my experimental data by group, and feed that to my test together with my control measurements. Note that this construction makes it easy to extract more values from the test: just return a (named) vector.
result <- lapply(split(experiments, experiments$group),function(x){
mytest = wilcox.test(x$measurement,controls$measurement,mu=0, alt="two.sided", conf.int=TRUE, conf.level=0.95, paired=FALSE, exact=FALSE, correct=TRUE)
return(mytest$p.value)
})
Combining to a table is then easy:
output <- do.call(rbind,result)
Data used:
set.seed(123)
nobs=100
dat <- data.frame(group=sample(c(LETTERS[1:6],"cont"),nobs,T),
measurement=runif(nobs),stringsAsFactors=F)

R: Help using dummyVars and adding back into data.frame

I have a data.frame of 373127 obs. of 193 variables. Some variables are factors which I want to use dummyVars() to separate each factor into its own column. I then want to merge the separate dummy variable columns back into my original data.frame, so I thought I could do the whole thing with apply, but something is not working and I can't figure out what it is.
Sample:
dat_final <- apply(dummies.var1, 1, function(x) {
dummies.var1 <- dummyVars(~ dat1$factor.var1 -1, data = dat1)
})
Thanks!
You can do the following that will create a new df, trsf, but you could always reassign back to the original df:
library(caret)
customers <- data.frame(
id=c(10,20,30,40,50),
gender=c('male','female','female','male','female'),
mood=c('happy','sad','happy','sad','happy'),
outcome=c(1,1,0,0,0))
# dummify the data
dmy <- dummyVars(" ~ .", data = customers)
trsf <- data.frame(predict(dmy, newdata = customers))
print(trsf)
See more here
The real answer is .... Don't do that. It's almost never necessary.
You could do something like this:
# Example data
df = data.frame(x = rep(LETTERS, each = 3), y = rnorm(78))
df = cbind(df, model.matrix(~df$x - 1))
However, as pointed out by #user30257 it is hard to see why you want to do it. In general, modeling tools in R don't need dummy vars, but deal with factors directly.
Creating dummy variables can be very important in feature selection, which it sounds like the original poster was doing.
For instance, suppose you have a feature that contains duplicated information (i.e., one of its levels corresponds to something measured elsewhere). You can determine this is the case very simply by comparing the dummy variables for these features using a variety of dissimilarity measures.
My preference is to use:
sparse.model.matrix and
cBind

R: Split-Apply-Combine... Apply Functions via Aggregate to Row-Bound Data Frames Subset by Class

Update: My NOAA GHCN-Daily weather station data functions have since been cleaned and merged into the rnoaa package, available on CRAN or here: https://github.com/ropensci/rnoaa
I'm designing a R function to calculate statistics across a data set comprised of multiple data frames. In short, I want to pull data frames by class based on a reference data frame containing the names. I then want to apply statistical functions to values for the metrics listed for each given day. In effect, I want to call and then overlay a list of data frames to calculate functions on a vector of values for every unique date and metric where values are not NA.
The data frames are iteratively read into the workspace from file based on a class variable, using the 'by' function. After importing the files for a given class, I want to rbind() the data frames for that class and each user-defined metric within a range of years. I then want to apply a concatenation of user-provided statistical functions to each metric within a class that corresponds to a given value for the year, month, and day (i.e., the mean [function] low temperature [class] on July 1st, 1990 [date] reported across all locations [data frames] within a given region [class]. I want the end result to be new data frames containing values for every date within a region and a year range for each metric and statistical function applied. I am very close to having this result using the aggregate() function, but I am having trouble getting reasonable results out of the aggregate function, which is currently outputting NA and NaN for most functions other than the mean temperature. Any advice would be much appreciated! Here is my code thus far:
# Example parameters
w <- c("mean","sd","scale") # Statistical functions to apply
x <- "C:/Data/" # Folder location of CSV files
y <- c("MaxTemp","AvgTemp","MinTemp") # Metrics to subset the data
z <- c(1970:2000) # Year range to subset the data
CSVstnClass <- data.frame(CSVstations,CSVclasses)
by(CSVstnClass, CSVstnClass[,2], function(a){ # Station list by class
suppressWarnings(assign(paste(a[,2]),paste(a[,1]),envir=.GlobalEnv))
apply(a, 1, function(b){ # Data frame list, row-wise
classData <- data.frame()
sapply(y, function(d){ # Element list
CSV_DF <- read.csv(paste(x,b[2],"/",b[1],".csv",sep="")) # Read in CSV files as data frames
CSV_DF1 <- CSV_DF[!is.na("Value")]
CSV_DF2 <- CSV_DF1[which(CSV_DF1$Year %in% z & CSV_DF1$Element == d),]
assign(paste(b[2],"_",d,sep=""),CSV_DF2,envir=.GlobalEnv)
if(nrow(CSV_DF2) > 0){ # Remove empty data frames
classData <<- rbind(classData,CSV_DF2) # Bind all data frames by row for a class and element
assign(paste(b[2],"_",d,"_bound",sep=""),classData,envir=.GlobalEnv)
sapply(w, function(g){ # Function list
# Aggregate results of bound data frame for each unique date
dataFunc <- aggregate(Value~Year+Month+Day+Element,data=classData,FUN=g,na.action=na.pass)
assign(paste(b[2],"_",d,"_",g,sep=""),dataFunc,envir=.GlobalEnv)
})
}
})
})
})
I think I am pretty close, but I am not sure if rbind() is performing properly, nor why the aggregate() function is outputting NA and NaN for so many metrics. I was concerned that the data frames were not being bound together or that missing values were not being handled well by some of the statistical functions. Thank you in advance for any advice you can offer.
Cheers,
Adam
You've tackled this problem in a way that makes it very hard to debug. I'd recommend switching things around so you can more easily check each step. (Using informative variable names also helps!) The code is unlikely to work as is, but it should be much easier to work iteratively, checking that each step has succeeded before continuing to the next.
paths <- dir("C:/Data/", pattern = "\\.csv$")
# Read in CSV files as data frames
raw <- lapply(paths, read.csv, str)
# Extract needed rows
filter_metrics <- c("MaxTemp", "AvgTemp", "MinTemp")
filter_years <- 1970:2000
filtered <- lapply(raw, subset,
!is.na(Value) & Year %in% filter_years & Element %in% filter_metrics)
# Drop any empty data frames
rows <- vapply(filtered, nrow, integer(1))
filtered <- filtered[rows > 0]
# Compute aggregates
my_aggregate <- function(df, fun) {
aggregate(Value ~ Year + Month + Day + Element, data = df, FUN = fun,
na.action = na.pass)
}
means <- lapply(filtered, my_aggregate, mean)
sds <- lapply(filtered, my_aggregate, sd)
scales <- lapply(filtered, my_aggregate, scale)

Extract data from a by-timeseries object

Let's start from the end: the R output will be read in Tableau to create a dashboard, and therefore I need the R output to look like in a certain way. With that in mind, I'm starting with a data frame in R with n groups of time series. I want to run auto.arima (or another forecasting method from package forecast) on each by group. I'm using the by function to do that, but I'm not attached to that approach, it's just what seemed to do the job for an R beginner like me.
The output I need would append a (say) 1 period forecast to the original data frame, filling in the date (variable t) and by variable (variable class).
If possible I'd like the approach to generalize to multiple by variables (i.e class_1,...class_n,).
#generate fake data
t<-seq(as.Date("2012/1/1"), by = "month", length.out = 36)
class<-rep(c("A","B"),each=18)
set.seed(1234)
metric<-as.numeric(arima.sim(model=list(order=c(2,1,1),ar=c(0.5,.3),ma=0.3),n=35))
df <- data.frame(t,class,metric)
df$type<-"ORIGINAL"
#sort of what I'd like to do
library(forecast)
ts<-ts(df$metric)
ts<-by(df$metric,df$class,auto.arima)
#extract forecast and relevant other pieces of data
#???
#what I'd like to look like
t<-as.Date(c("2013/7/1","2015/1/1"))
class<-rep(c("A","B"),each=1)
metric<-c(1.111,2.222)
dfn <- data.frame(t,class,metric)
dfn$type<-"FORECAST"
dfinal<-rbind(df,dfn)
I'm not attached to the how-to, as long as it starts with a data frame that looks like what I described, and outputs a data frame like the output I described.
Your description is a little vague, but something along these lines should work:
library(data.table)
dt = data.table(df)
dt[, {result = auto.arima(metric);
rbind(.SD,
list(seq(t[.N], length.out = 2, by = '1 month')[2], result$sigma2, "FORECAST"))},
by = class]
I arbitrarily chose to fill in the sigma^2, since it wasn't clear which variable(s) you want there.

Resources