I have a data frame with daily observations, which I would like to interpolate. I use automap to build a variogram for every days and then apply on new data. I try to run a loop and put the results in a new dataframe. Unfortunately, the data frame with the results only contains the last predicted day.
coordinates(mydata) <- ~lat+long
coordinates(new_data) <- ~lat+long
df <- data.frame(matrix(nrow=50,ncol=10)) #new data frame for predicted valeus
for(i in 1:ncol(mydata))
kriging_new <- autoKrige(mydata[,i],mydata,newdata)
pred <- kriging_new$krige_output$var1.pred
df[,i] <- data.frame(pred)
The result looks like this, all the columns should be filled with values, not just the last one:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 NA NA NA NA NA NA NA NA NA 12.008726
2 NA NA NA NA NA NA NA NA NA 6.960499
3 NA NA NA NA NA NA NA NA NA 10.894787
4 NA NA NA NA NA NA NA NA NA 14.378945
5 NA NA NA NA NA NA NA NA NA 17.719522
I also get a warning, saying:
Warning message:
In autofitVariogram(formula, data_variogram, model = model, kappa = kappa, :
Some models where removed for being either NULL or having a negative sill/range/nugget,
set verbose == TRUE for more information
If I do autoKrige manually for each row, everything works fine. It seems the loop is not working as it usually does. Is this some problem in the automap package?
Thanks a lot!
I think you just forgot to enclose the code in your for loop in curly brackets. As a result you execute the loop 10 times, overwriting kriging_new with itself every time:
for(i in 1:ncol(mydata))
kriging_new <- autoKrige(mydata[,i],mydata,newdata)
Only then do you assign the result from your last iteration:
pred <- kriging_new$krige_output$var1.pred
and finally assign those to the last column of your data frame holding your predictions (the loop counter i is still set to 10 at this point):
df[, i] <- data.frame(pred)
Always write loops with multiple lines of statements like this:
for (condition) {
statement1
statement2
...
}
Related
I've created a formula that calculates the exponential moving average of data:
myEMA <- function(price, n) {
ema <- c()
data_start <- which(!is.na(price))[1]
ema[1:data_start+n-2] <- NA
ema[data_start+n-1] <- mean(price[data_start:(data_start+n-1)])
beta <- 2/(n+1)
for(i in (data_start+n):length(price)) {
ema[i] <- beta*price[i] +
(1-beta)*ema[i-1]
}
ema <- reclass(ema,price)
return(ema)
}
The data I'm using is:
pricesupdated <- data.frame(a = seq(1,100), b = seq(1,200,2), c = c(NA,NA,NA,seq(1,97)))
I would like to create a dataframe where I apply the formula to each variable in my above data.frame. My attempt was:
frameddata <- data.frame(myEMA(pricesupdated,12))
But the error message that I get is:
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'mean': undefined columns selected
I'm able to print the answer that I want, but not create a dataframe...
Can you help me?
First of all myEMA() is a function, not a formula. Check out help("function") and help("formula") for details on what the distinction is.
The myEMA() function takes a numeric vector as its first argument and returns a numeric vector with the same dimensions as its first argument.
A data.frame object is bascially just a list of vectors with a special class attribute. The most common way to repeat a function call across each element in a list is to use one of the *apply family of functions. For example, you can use lapply(), which will calls myEMA once on each variable in pricesupdated and returns a list with one element per function call containing that function call's returned value (a numeric vector). This list can be easily converted back to data.frame() since all its elements have the same length:
results <- lapply(pricesupdated, myEMA, n = 12)
# look at the structure of the results object
> str(results)
List of 3
$ a: num [1:100] NA NA NA NA NA NA NA NA NA NA ...
$ b: num [1:100] NA NA NA NA NA NA NA NA NA NA ...
$ c: num [1:100] NA NA NA NA NA NA NA NA NA NA ...
frameddata <- as.data.frame(results)
# look at the top 15 records in this object
> head(frameddata, 15)
a b c
1 NA NA NA
2 NA NA NA
3 NA NA NA
4 NA NA NA
5 NA NA NA
6 NA NA NA
7 NA NA NA
8 NA NA NA
9 NA NA NA
10 NA NA NA
11 NA NA NA
12 6.5 12 NA
13 7.5 14 NA
14 8.5 16 NA
15 9.5 18 6.5
The question is likely a duplicate, ...
but the apply-family might help, e.g.
sapply(pricesupdated, myEMA, n=12)
for reproducibilty, it would be benificial to add require(pec)
I have a set of data and a loop containing numerous calculations for the data set, where the individual components of the set are split into a subset and cycled through one by one. However I need to be able to execute the same calculations across the original data set as a whole first.
For a fictional data set called masterdata with 3 components (column D1) and numerous variables (X2-X10) as such:
# masterdata
# D1 X2 X3 X4 X5 X6 X7 X8 X9 X10
# A NA NA NA NA NA NA NA NA NA
# B NA NA NA NA NA NA NA NA NA
# C NA NA NA NA NA NA NA NA NA
# B NA NA NA NA NA NA NA NA NA
# B NA NA NA NA NA NA NA NA NA
# C NA NA NA NA NA NA NA NA NA
# C NA NA NA NA NA NA NA NA NA
# A NA NA NA NA NA NA NA NA NA
# B NA NA NA NA NA NA NA NA NA
# A NA NA NA NA NA NA NA NA NA
A loop is in place to split off a subset for component A, perform the calculations, output the results and then repeat this for B and C:
Component.List = c("A", "B", "C")
for(k in 1:length(Component.List)) {
subdata = subset(masterdata, D1 == Component.List[k])
# Numerous calculations performed on "subdata" within the loop
}
# End of loop
What I am trying to do is initially perform the same numerous calculations against the whole of masterdata and then start looping through the individual components.
Part of the output from the calculations is that two vectors that are created are placed into the first column of the data frames created just prior to executing the loop:
# Prior to the start of the loop two frames below created
Components = 3 # In this example 3 components in column D1 - "A", "B", "C"
Result.Frame.V1 = as.data.frame(matrix(0, nrow = 200, ncol = Components))
Result.Frame.V2 = as.data.frame(matrix(0, nrow = 200, ncol = Components))
# Loop runs and contains all of the calculations and within the calculations the last two
# lines below place two vectors generated into the the kth columns of the frames.
Result.Frame.V1[,k] = V1.Result
Result.Frame.V2[,k] = V2.Result
# First run of the loop for "A" will place the outputs in the 1st columns
# Second run of the loop for "B" will place the outputs in the 2nd columns, etc.
# With the expansion to also calculate against the whole group, the above data frames
# would be expanded to an extra column that would hold the result vector for the whole
# masterdata run through the calculations
My initial theoretical solution is to write every calculation in the loop once for masterdata and then have the above loop, however the calculations are hundreds of lines of code!
Is it possible to incorporate into the For loop a way to calculate for the original data and then continue cycling through the components?
It seems like dplyr would solve this elegantly, among the other options
For the whole data:
library(dplyr)
masterdata %>%
summarise(result = your_function(arg1 = X1, arg2 = X2, ...))
For each component, just add group_by
masterdata %>%
group_by(D1) %>%
summarise(result = your_function(arg1 = X1, arg2 = X2, ...))
If you are outputting dataframes then creating a function that performs your calculations when passed a dataframe, and outputs a dataframe will be key. In the below example the function is called your_function().
For simplicity a Three stage process is used, first to create the output dataframe on the overall dataset then lapply to perform the same calculations on the sub datasets. The sub datasets are then bound together into a single dataframe before finally being combined with the output of the full dataset.
note: I created a new variable called "Subset" so that the outputs are all identifiable as belonging to each distinct set.
library(dplyr)
FullSet <- your_function(masterdata) %>% mutate(Subset = "Full")
SubSets <- lapply(unique(D1), function(n){
masterdata %>% filter(D1 == n) %>%
your_function(.) %>% mutate(Subset = n)
}) %>% bind_rows()
FinalSet <- bind_rows(FullSet, SubSets)
if you want to run the process in parallel for speed then use
mclapply(unique(D1), function..., mc.cores=detectCores())
I have a dataset archivo containing the rates of bonds for every duration of the government auctions since 2003. The first few rows are:
Fecha 1 2 3 4 5 6 7 8 9 10 11 12 18 24
2003-01-02 NA NA NA NA NA 44.9999 NA NA 52.0002 NA NA NA NA NA
2003-01-03 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2003-01-06 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2003-01-07 NA NA NA NA NA 40.0000 NA NA 45.9900 NA NA NA NA NA
2003-01-08 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2003-01-09 NA NA NA NA NA 37.0000 NA NA 41.9999 NA NA NA NA NA
Every column named 1 to 24 corresponds to a different duration. (1 month, 2 months, ..., 24 months). Not all durations are sold on the auction date. That's why I have NAs.
I need to calculate the NAs (missing) rates with a log fitting curve for every row that has at least more than 1 value. For the rows that has all NAs I just can use the preceeding constructed curve.
I'm aware I could run a code like:
x<-colnames(archivo[,-1]) # to keep the durations
y<-t(archivo[1,-1])
estimacion<-lm(y ~ log(x))
param<-estimacion$coefficients
and get the coefficients for the first row. Then run a loop and do it for every row.
Is there any way to do it directly with the entire dataset and obtain the parameters of every row (every log fitting) without doing a loop?
Hope the question is clear enough.
Thanks in advance!
Try:
dat <- as.data.frame(t(archivo[,-1])) ## transpose you data frame
## a function to fit a model `y ~ log(x)` for response vector `y`
fit_model <- function (y) {
non_NA <- which(!is.na(y)) ## non-NA rows index
if (length(non_NA) > 1) {
## there are at least 2 data points, a linear model is possible
lm.fit(cbind(1, log(non_NA)), y[non_NA])$coef
} else {
## not sufficient number of data, return c(NA, NA)
c(NA, NA)
}
}
## fit linear model column-by-column
result <- sapply(dat, FUN = fit_model)
Note that I am using lm.fit(), the kernel fitting routine called by lm(). Have a read on ?lm.fit if you are not familiar with it. It takes 2 essential arguments:
The first is the model matrix. The model matrix for your model y ~ log(x), is matrix(c(rep(1,24), log(1:24)), ncol = 2). You can also construct it via model.matrix(~log(x), data = data.frame(x = 1:24)).
The second is the response vector. For you problem it is a column of dat.
Unlike lm() which can handle NA, lm.fit() can not. So we need to remove NA rows from model matrix and response vector ourselves. The non_NA variable is doing this. Note, your model y ~ log(x) involves 2 parameters / coefficients, so at least 2 data are required for fitting. If there are not enough data, model fitting is impossible and we return c(NA, NA).
Finally, I use sapply() to fit a linear model column by column, retaining coefficients only by $coef.
Test
I am using the example rows you posted in your question. Using the above code, I get:
# V1 V2 V3 V4 V5 V6
# x1 14.06542 NA NA 13.53005 NA 14.90533
# x2 17.26486 NA NA 14.77316 NA 12.33127
Each column gives coefficients for each column of dat (or each row of archivo).
Update
Originally I used matrix(rep(1,24), log(1:24))[non_NA, ] for model matrix in lm.fit(). This is not efficient though. It first generates the complete model matrix then drops rows with NA. A double thought shows this is better: cbind(1, log(non_NA)).
I feel like this is a relatively straightforward question, and I feel I'm close but I'm not passing edge-case testing. I have a directory of CSVs and instead of reading all of them, I only want some of them. The files are in a format like 001.csv, 002.csv,...,099.csv, 100.csv, 101.csv, etc which should help to explain my if() logic in the loop. For example, to get all files, I'd do something like:
id = 1:1000
setwd("D:/")
filenames = as.character(NULL)
for (i in id){
if(i < 10){
i <- paste("00",i,sep="")
}
else if(i < 100){
i <- paste("0",i,sep="")
}
filenames[[i]] <- paste(i,".csv", sep="")
}
y <- do.call("rbind", lapply(filenames, read.csv, header = TRUE))
The above code works fine for id=1:1000, for id=1:10, id=20:70 but as soon as I pass it id=99:100 or any sequence involving numbers starting at over 100, it introduces a lot of NAs.
Example output below for id=98:99
> filenames
098 099
"098.csv" "099.csv"
Example output below for id=99:100
> filenames
099
"099.csv" NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
"100.csv"
I feel like I'm missing some catch statement in my if() logic. Any insight would be greatly appreciated! :)
You can avoid the loop for creating the filenames
filenames <- sprintf('%03d.csv', 1:1000)
y <- do.call(rbind, lapply(filenames, read.csv, header = TRUE))
#akrun has given you a much better way of solving your task. But in terms of the actual issue with your code, the problem is that for i < 100 you subset by a character vector (implicitly converted using paste) while for i >= 100 you subset by an integer. When you use id = 99:100 this translates to:
filenames <- character(0)
filenames["099"] <- "099.csv" # length(filenames) == 1L
filenames[100] <- "100.csv" # length(filenames) == 100L, with all(filenames[2:99] == NA)
Assigning to a named member of a vector that doesn't yet exist will create a new member at position length(vector) + 1 whereas assigning to a numbered position that is > length(vector) will also fill in every intervening position with NA.
Another approach, although less efficient than #akrun's solution, is with the following function:
merged <- function(id = 1:332) {
df <- data.frame()
for(i in 1:length(id)){
add <- read.csv(sprintf('%03d.csv', id[i]))
df <- rbind(df,add)
}
df
}
Now, you can merge the files with:
dat <- merged(99:100)
Furthermore, you can assign columnnames by inserting the following line in the function just before the last line with df:
colnames(df) <- c(..specify the colnames in here..)
I want to create a lot of variables across several separate dataframes which I will then combine into one grand data frame.
Each sheet is labeled by a letter (there are 24) and each sheet contributes somewhere between 100-200 variables. I could write it as such:
a$varible1 <- NA
a$variable2 <- NA
.
.
.
w$variable25 <- NA
This can/will get ugly, and I'd like to write a loop or use a vector to do the work. I'm having a heck of a time doing it though.
I essentially need a script which will allow me to specify a form and then just tack numbers onto it.
So,
a$variable[i] <- NA
where [i] gets tacked onto the actual variable created.
I just learnt this neat little trick from #eddi
#created some random dataset with 3 columns
library(data.table)
a <- data.table(
a1 = c(1,5),
a2 = c(2,1),
a3 = c(3,4)
)
#assuming that you now need to ad more columns from a4 to a200
# first, creating the sequence from 4 to 200
v = c(4:200)
# then using that sequence to add the 197 more columns
a[, paste0("a", v) :=
NA]
# now a has 200 columns, as compared to the three we initiated it with
dim(a)
#[1] 2 200
I don't think you actually need this, although you seem to think so for some reason.
Maybe something like this:
a <- as.data.frame(matrix(NA, ncol=10, nrow=5))
names(a) <- paste0("Variable", 1:10)
print(a)
# Variable1 Variable2 Variable3 Variable4 Variable5 Variable6 Variable7 Variable8 Variable9 Variable10
# 1 NA NA NA NA NA NA NA NA NA NA
# 2 NA NA NA NA NA NA NA NA NA NA
# 3 NA NA NA NA NA NA NA NA NA NA
# 4 NA NA NA NA NA NA NA NA NA NA
# 5 NA NA NA NA NA NA NA NA NA NA
If you want variables with different types:
p <- 10 # number of variables
N <- 100 # number of records
vn <- vector(mode="list", length=p)
names(vn) <- paste0("V", seq(p))
vn[1:8] <- NA_real_ # numeric
vn[9:10] <- NA_character_ # character
df <- as.data.frame(lapply(vn, function(x, n) rep(x, n), n=N))