I have a dataset archivo containing the rates of bonds for every duration of the government auctions since 2003. The first few rows are:
Fecha 1 2 3 4 5 6 7 8 9 10 11 12 18 24
2003-01-02 NA NA NA NA NA 44.9999 NA NA 52.0002 NA NA NA NA NA
2003-01-03 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2003-01-06 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2003-01-07 NA NA NA NA NA 40.0000 NA NA 45.9900 NA NA NA NA NA
2003-01-08 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2003-01-09 NA NA NA NA NA 37.0000 NA NA 41.9999 NA NA NA NA NA
Every column named 1 to 24 corresponds to a different duration. (1 month, 2 months, ..., 24 months). Not all durations are sold on the auction date. That's why I have NAs.
I need to calculate the NAs (missing) rates with a log fitting curve for every row that has at least more than 1 value. For the rows that has all NAs I just can use the preceeding constructed curve.
I'm aware I could run a code like:
x<-colnames(archivo[,-1]) # to keep the durations
y<-t(archivo[1,-1])
estimacion<-lm(y ~ log(x))
param<-estimacion$coefficients
and get the coefficients for the first row. Then run a loop and do it for every row.
Is there any way to do it directly with the entire dataset and obtain the parameters of every row (every log fitting) without doing a loop?
Hope the question is clear enough.
Thanks in advance!
Try:
dat <- as.data.frame(t(archivo[,-1])) ## transpose you data frame
## a function to fit a model `y ~ log(x)` for response vector `y`
fit_model <- function (y) {
non_NA <- which(!is.na(y)) ## non-NA rows index
if (length(non_NA) > 1) {
## there are at least 2 data points, a linear model is possible
lm.fit(cbind(1, log(non_NA)), y[non_NA])$coef
} else {
## not sufficient number of data, return c(NA, NA)
c(NA, NA)
}
}
## fit linear model column-by-column
result <- sapply(dat, FUN = fit_model)
Note that I am using lm.fit(), the kernel fitting routine called by lm(). Have a read on ?lm.fit if you are not familiar with it. It takes 2 essential arguments:
The first is the model matrix. The model matrix for your model y ~ log(x), is matrix(c(rep(1,24), log(1:24)), ncol = 2). You can also construct it via model.matrix(~log(x), data = data.frame(x = 1:24)).
The second is the response vector. For you problem it is a column of dat.
Unlike lm() which can handle NA, lm.fit() can not. So we need to remove NA rows from model matrix and response vector ourselves. The non_NA variable is doing this. Note, your model y ~ log(x) involves 2 parameters / coefficients, so at least 2 data are required for fitting. If there are not enough data, model fitting is impossible and we return c(NA, NA).
Finally, I use sapply() to fit a linear model column by column, retaining coefficients only by $coef.
Test
I am using the example rows you posted in your question. Using the above code, I get:
# V1 V2 V3 V4 V5 V6
# x1 14.06542 NA NA 13.53005 NA 14.90533
# x2 17.26486 NA NA 14.77316 NA 12.33127
Each column gives coefficients for each column of dat (or each row of archivo).
Update
Originally I used matrix(rep(1,24), log(1:24))[non_NA, ] for model matrix in lm.fit(). This is not efficient though. It first generates the complete model matrix then drops rows with NA. A double thought shows this is better: cbind(1, log(non_NA)).
Related
I've created a formula that calculates the exponential moving average of data:
myEMA <- function(price, n) {
ema <- c()
data_start <- which(!is.na(price))[1]
ema[1:data_start+n-2] <- NA
ema[data_start+n-1] <- mean(price[data_start:(data_start+n-1)])
beta <- 2/(n+1)
for(i in (data_start+n):length(price)) {
ema[i] <- beta*price[i] +
(1-beta)*ema[i-1]
}
ema <- reclass(ema,price)
return(ema)
}
The data I'm using is:
pricesupdated <- data.frame(a = seq(1,100), b = seq(1,200,2), c = c(NA,NA,NA,seq(1,97)))
I would like to create a dataframe where I apply the formula to each variable in my above data.frame. My attempt was:
frameddata <- data.frame(myEMA(pricesupdated,12))
But the error message that I get is:
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'mean': undefined columns selected
I'm able to print the answer that I want, but not create a dataframe...
Can you help me?
First of all myEMA() is a function, not a formula. Check out help("function") and help("formula") for details on what the distinction is.
The myEMA() function takes a numeric vector as its first argument and returns a numeric vector with the same dimensions as its first argument.
A data.frame object is bascially just a list of vectors with a special class attribute. The most common way to repeat a function call across each element in a list is to use one of the *apply family of functions. For example, you can use lapply(), which will calls myEMA once on each variable in pricesupdated and returns a list with one element per function call containing that function call's returned value (a numeric vector). This list can be easily converted back to data.frame() since all its elements have the same length:
results <- lapply(pricesupdated, myEMA, n = 12)
# look at the structure of the results object
> str(results)
List of 3
$ a: num [1:100] NA NA NA NA NA NA NA NA NA NA ...
$ b: num [1:100] NA NA NA NA NA NA NA NA NA NA ...
$ c: num [1:100] NA NA NA NA NA NA NA NA NA NA ...
frameddata <- as.data.frame(results)
# look at the top 15 records in this object
> head(frameddata, 15)
a b c
1 NA NA NA
2 NA NA NA
3 NA NA NA
4 NA NA NA
5 NA NA NA
6 NA NA NA
7 NA NA NA
8 NA NA NA
9 NA NA NA
10 NA NA NA
11 NA NA NA
12 6.5 12 NA
13 7.5 14 NA
14 8.5 16 NA
15 9.5 18 6.5
The question is likely a duplicate, ...
but the apply-family might help, e.g.
sapply(pricesupdated, myEMA, n=12)
for reproducibilty, it would be benificial to add require(pec)
I am testing the impact of missing data on regression analysis. So, using a simulated dataset, I want to randomly remove a proportion of observations (not entire rows) from a designated set of columns. I am using 'sample' to do this. Unfortunately, this is making some columns have much more missing values than others. See an example below:
#Data frame with 5 columns, 10 rows
DF = data.frame(A = paste(letters[1:10]),B = rnorm(10, 1, 10), C = rnorm(10, 1, 10), D = rnorm(10, 1, 10), E = rnorm(10,1,10))
#Function to randomly delete a proportion (ProportionRemove) of records per column, for a designated set of columns (ColumnStart - ColumnEnd)
RandomSample = function(DataFrame,ColumnStart, ColumnEnd,ProportionRemove){
#ci is the opposite of the proportion
ci = 1-ProportionRemove
Missing = sapply(DataFrame[(ColumnStart:ColumnEnd)], function(x) x[sample(c(TRUE, NA), prob = c(ci,ProportionRemove), size = length(DataFrame), replace = TRUE)])}
#Randomly sample column 2 - 5 within DF, deleting 80% of the observation per column
Test = RandomSample(DF, 2, 5, 0.8)
I understand there is an element of randomness to this, but in 10 trials (10*4 = 40 columns), 17 of the columns had no data, and in one trial, one column still had 6 records (rather than the expected ~2) - see below.
B C D E
[1,] NA 24.004402 7.201558 NA
[2,] NA NA NA NA
[3,] NA 4.029659 NA NA
[4,] NA NA NA NA
[5,] NA 29.377632 NA NA
[6,] NA 3.340918 -2.131747 NA
[7,] NA NA NA NA
[8,] NA 15.967318 NA NA
[9,] NA NA NA NA
[10,] NA -8.078221 NA NA
In summary, I want to replace a propotion of observations with NAs in each column.
Any help is greatly appreciated!!!
This makes sense to me. As #Frank suggested (in a since-deleted comment ... *sigh*), "randomness" can give you really non-random-looking results (Dilbert: Tour of Accounting, 2001-10-25).
If you want random samples with guaranteed ratios, try this:
guaranteedSampling <- function(DataFrame, ProportionRemove) {
n <- max(1L, floor(nrow(DataFrame) * ProportionRemove))
inds <- replicate(ncol(DataFrame), sample(nrow(DataFrame), size=n), simplify=FALSE)
DataFrame[] <- mapply(`[<-`, DataFrame, inds, MoreArgs=list(NA), SIMPLIFY=FALSE)
DataFrame
}
set.seed(2)
guaranteedSampling(DF[2:5], 0.8)
# B C D E
# 1 NA NA NA NA
# 2 NA NA NA NA
# 3 NA NA NA NA
# 4 6.792463 10.582938 NA NA
# 5 NA NA -0.612816 NA
# 6 NA -2.278758 NA NA
# 7 NA NA NA 2.245884
# 8 NA NA NA 5.993387
# 9 7.863310 NA 9.042127 NA
# 10 NA NA NA NA
Further to #joran's comment, you either wanted nrow(DataFrame) or length(x)
The specific impact in your example is that you are producing a vector with 5 elements (because DF has 5 variables) each with 0.8 probability of being NA and 0.2 of being TRUE.
Then this statement (which is what the sapply is doing to each column you specify and in this case I'm applying to DF$B only):
DF$B[sample(c(TRUE, NA), prob=c(0.2, 0.8), size = 5, replace=TRUE)]
does something that isn't immediately obvious to the uninitiated*. This:
sample(c(TRUE, NA), prob=c(0.2, 0.8), size = 5, replace=TRUE)
gives a logical vector, which when used to extract elements of a vector is silently recycled. So lets say you end up with:
NA TRUE NA TRUE NA
When you subset DF$B you end up getting this:
DF$B[c(NA, TRUE, NA, TRUE, NA, NA, TRUE, NA, TRUE, NA)]
Notice in your example how the top 5 numbers always follow the same pattern as the bottom 5 numbers. This explains why so many columns ended up being all NA, because there is a 0.32768 probability of getting 5 out of 5 NA which gets recycled to the whole column.
The other issue with your code is that the function doesn't actually do anything useful because you didn't specify any return value. Here it is corrected and cleaned up and using http://adv-r.had.co.nz/Style.html:
random_sample <- function(x, col_start, col_end, p) {
sapply(x[col_start:col_end],
function(y) y[sample(c(TRUE, NA), prob = c(1-p, p), size = length(y), replace = TRUE)])
}
*The uninitiated in this case includes me! I had no idea that logical vectors were recycled when used to extract until having a look at this question.
I have a set of data and a loop containing numerous calculations for the data set, where the individual components of the set are split into a subset and cycled through one by one. However I need to be able to execute the same calculations across the original data set as a whole first.
For a fictional data set called masterdata with 3 components (column D1) and numerous variables (X2-X10) as such:
# masterdata
# D1 X2 X3 X4 X5 X6 X7 X8 X9 X10
# A NA NA NA NA NA NA NA NA NA
# B NA NA NA NA NA NA NA NA NA
# C NA NA NA NA NA NA NA NA NA
# B NA NA NA NA NA NA NA NA NA
# B NA NA NA NA NA NA NA NA NA
# C NA NA NA NA NA NA NA NA NA
# C NA NA NA NA NA NA NA NA NA
# A NA NA NA NA NA NA NA NA NA
# B NA NA NA NA NA NA NA NA NA
# A NA NA NA NA NA NA NA NA NA
A loop is in place to split off a subset for component A, perform the calculations, output the results and then repeat this for B and C:
Component.List = c("A", "B", "C")
for(k in 1:length(Component.List)) {
subdata = subset(masterdata, D1 == Component.List[k])
# Numerous calculations performed on "subdata" within the loop
}
# End of loop
What I am trying to do is initially perform the same numerous calculations against the whole of masterdata and then start looping through the individual components.
Part of the output from the calculations is that two vectors that are created are placed into the first column of the data frames created just prior to executing the loop:
# Prior to the start of the loop two frames below created
Components = 3 # In this example 3 components in column D1 - "A", "B", "C"
Result.Frame.V1 = as.data.frame(matrix(0, nrow = 200, ncol = Components))
Result.Frame.V2 = as.data.frame(matrix(0, nrow = 200, ncol = Components))
# Loop runs and contains all of the calculations and within the calculations the last two
# lines below place two vectors generated into the the kth columns of the frames.
Result.Frame.V1[,k] = V1.Result
Result.Frame.V2[,k] = V2.Result
# First run of the loop for "A" will place the outputs in the 1st columns
# Second run of the loop for "B" will place the outputs in the 2nd columns, etc.
# With the expansion to also calculate against the whole group, the above data frames
# would be expanded to an extra column that would hold the result vector for the whole
# masterdata run through the calculations
My initial theoretical solution is to write every calculation in the loop once for masterdata and then have the above loop, however the calculations are hundreds of lines of code!
Is it possible to incorporate into the For loop a way to calculate for the original data and then continue cycling through the components?
It seems like dplyr would solve this elegantly, among the other options
For the whole data:
library(dplyr)
masterdata %>%
summarise(result = your_function(arg1 = X1, arg2 = X2, ...))
For each component, just add group_by
masterdata %>%
group_by(D1) %>%
summarise(result = your_function(arg1 = X1, arg2 = X2, ...))
If you are outputting dataframes then creating a function that performs your calculations when passed a dataframe, and outputs a dataframe will be key. In the below example the function is called your_function().
For simplicity a Three stage process is used, first to create the output dataframe on the overall dataset then lapply to perform the same calculations on the sub datasets. The sub datasets are then bound together into a single dataframe before finally being combined with the output of the full dataset.
note: I created a new variable called "Subset" so that the outputs are all identifiable as belonging to each distinct set.
library(dplyr)
FullSet <- your_function(masterdata) %>% mutate(Subset = "Full")
SubSets <- lapply(unique(D1), function(n){
masterdata %>% filter(D1 == n) %>%
your_function(.) %>% mutate(Subset = n)
}) %>% bind_rows()
FinalSet <- bind_rows(FullSet, SubSets)
if you want to run the process in parallel for speed then use
mclapply(unique(D1), function..., mc.cores=detectCores())
I have a data frame with daily observations, which I would like to interpolate. I use automap to build a variogram for every days and then apply on new data. I try to run a loop and put the results in a new dataframe. Unfortunately, the data frame with the results only contains the last predicted day.
coordinates(mydata) <- ~lat+long
coordinates(new_data) <- ~lat+long
df <- data.frame(matrix(nrow=50,ncol=10)) #new data frame for predicted valeus
for(i in 1:ncol(mydata))
kriging_new <- autoKrige(mydata[,i],mydata,newdata)
pred <- kriging_new$krige_output$var1.pred
df[,i] <- data.frame(pred)
The result looks like this, all the columns should be filled with values, not just the last one:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 NA NA NA NA NA NA NA NA NA 12.008726
2 NA NA NA NA NA NA NA NA NA 6.960499
3 NA NA NA NA NA NA NA NA NA 10.894787
4 NA NA NA NA NA NA NA NA NA 14.378945
5 NA NA NA NA NA NA NA NA NA 17.719522
I also get a warning, saying:
Warning message:
In autofitVariogram(formula, data_variogram, model = model, kappa = kappa, :
Some models where removed for being either NULL or having a negative sill/range/nugget,
set verbose == TRUE for more information
If I do autoKrige manually for each row, everything works fine. It seems the loop is not working as it usually does. Is this some problem in the automap package?
Thanks a lot!
I think you just forgot to enclose the code in your for loop in curly brackets. As a result you execute the loop 10 times, overwriting kriging_new with itself every time:
for(i in 1:ncol(mydata))
kriging_new <- autoKrige(mydata[,i],mydata,newdata)
Only then do you assign the result from your last iteration:
pred <- kriging_new$krige_output$var1.pred
and finally assign those to the last column of your data frame holding your predictions (the loop counter i is still set to 10 at this point):
df[, i] <- data.frame(pred)
Always write loops with multiple lines of statements like this:
for (condition) {
statement1
statement2
...
}
I want to create a lot of variables across several separate dataframes which I will then combine into one grand data frame.
Each sheet is labeled by a letter (there are 24) and each sheet contributes somewhere between 100-200 variables. I could write it as such:
a$varible1 <- NA
a$variable2 <- NA
.
.
.
w$variable25 <- NA
This can/will get ugly, and I'd like to write a loop or use a vector to do the work. I'm having a heck of a time doing it though.
I essentially need a script which will allow me to specify a form and then just tack numbers onto it.
So,
a$variable[i] <- NA
where [i] gets tacked onto the actual variable created.
I just learnt this neat little trick from #eddi
#created some random dataset with 3 columns
library(data.table)
a <- data.table(
a1 = c(1,5),
a2 = c(2,1),
a3 = c(3,4)
)
#assuming that you now need to ad more columns from a4 to a200
# first, creating the sequence from 4 to 200
v = c(4:200)
# then using that sequence to add the 197 more columns
a[, paste0("a", v) :=
NA]
# now a has 200 columns, as compared to the three we initiated it with
dim(a)
#[1] 2 200
I don't think you actually need this, although you seem to think so for some reason.
Maybe something like this:
a <- as.data.frame(matrix(NA, ncol=10, nrow=5))
names(a) <- paste0("Variable", 1:10)
print(a)
# Variable1 Variable2 Variable3 Variable4 Variable5 Variable6 Variable7 Variable8 Variable9 Variable10
# 1 NA NA NA NA NA NA NA NA NA NA
# 2 NA NA NA NA NA NA NA NA NA NA
# 3 NA NA NA NA NA NA NA NA NA NA
# 4 NA NA NA NA NA NA NA NA NA NA
# 5 NA NA NA NA NA NA NA NA NA NA
If you want variables with different types:
p <- 10 # number of variables
N <- 100 # number of records
vn <- vector(mode="list", length=p)
names(vn) <- paste0("V", seq(p))
vn[1:8] <- NA_real_ # numeric
vn[9:10] <- NA_character_ # character
df <- as.data.frame(lapply(vn, function(x, n) rep(x, n), n=N))