How can I create a data frame with 2 rows with this structure?
X1 Y1 Calc1 X2 Y2 Calc2 … Xn Yn Calcn
1 4 0.25 2 5 0.4 i i+3 i/i+3
I tried using this code:
dataRowTemp<-numeric(length = 0)
dataRow<-numeric(length = 0)
headerRowTemp<-character(length = 0)
headerRow<-character(length = 0)
for (i in 1:150){
X<- i
Y<- i+3
Calc <- X/Y
dataRowTemp <- c(X,Y,Calc)
dataRow<-c(dataRow,dataRowTemp)
headerRowTemp <- paste(c("X", i),c("Y", i),c("Calc", i),sep='')
headerRow<-c(headerRow,headerRowTemp)
}
unfortunately, I can’t create the a correct header (titleRow) and how can I combine them to data.frame later?
Is there an elegant and better way to do so?
Build a function to be used in each iteration.
myfun <- function(i) {
X <- i
Y <- i + 3
c(X = X, Y = Y, Calc = X/Y)
}
Set the number of iterations.
n <- 150
Apply the function to the numbers from 1 to n, use matrix(..., nrow = 1) to store the output in a matrix of only 1 row, and transform it into a data.frame (because it is what you say you aim at).
mydf <- data.frame(matrix(sapply(seq_len(n), myfun), nrow = 1))
Use paste0 in a loop to iteratively assign names to the column of your data.frame.
names(mydf) <- c(sapply(seq_len(n), function(i) paste0(c('X', 'Y', 'Calc'), i)))
Related
I am working on an assignment, which tasks me to generate a list of data, using the below code.
##Use the make_data function to generate 25 different datasets, with mu_1 being a vector
x <- seq(0, 3, len=25)
make_data <- function(a){
n = 1000
p = 0.5
mu_0 = 0
mu_1=a
sigma_0 = 1
sigma_1 = 1
y <- rbinom(n, 1, p)
f_0 <- rnorm(n, mu_0, sigma_0)
f_1 <- rnorm(n, mu_1, sigma_1)
x <- ifelse(y == 1, f_1, f_0)
test_index <- createDataPartition(y, times = 1, p = 0.5, list = FALSE)
list(train = data.frame(x = x, y = as.factor(y)) %>% slice(-test_index),
test = data.frame(x = x, y = as.factor(y)) %>% slice(test_index))
}
dat <- sapply(x,make_data)
The code looks good to go, and 'dat' appears to be a 25 column, 2 row table, each with its own data frame.
Now, each data frame within a cell has 2 columns.
And this is where I get stuck.
While I can get to the data frame in row 1, column 1, just fine (i.e. just use dat[1,1]), I can't reach the column of 'x' values within dat[1,1]. I've experimented with
dat[1,1]$x
dat[1,1][1]
But they only throw weird responses: error/null.
Any idea how I can pull the column? Thanks.
dat[1, 1] is a list.
class(dat[1, 1])
#[1] "list"
So to reach to x you can do
dat[1, 1]$train$x
Or
dat[1, 1][[1]]$x
As a sidenote, instead of having this 25 X 2 matrix as output in dat I would actually prefer to have a nested list.
dat <- lapply(x,make_data)
#Access `x` column of first list from `train` dataset.
dat[[1]]$train$x
However, this is quite subjective and you can chose whatever format you like the best.
I have two datasets, each with 5 columns and 10,000 rows. I want to calculate y from values in columns between the two datasets, column 1 in data set 1 and column 1 in data set 2; then column 2 in data set 1 and column 2 in data set 2. The yneeds nonetheless to follow a set of rules before being calculated. What I did so far doesn't work, and I cannot figure it out why and if there is a easier way to do all of this.
Create data from t-distributions
mx20 <- as.data.frame(replicate(10000, rt(20,19)))
mx20.50 <- as.data.frame(replicate(10000, rt(20,19)+0.5))
Calculates the mean for each simulated sample
m20 <- apply(mx20, FUN=mean, MARGIN=2)
m20.05 <- apply(mx20.50, FUN=mean, MARGIN=2)
The steps 1 and 2_ above are repeated for five sample sizes from t-distributions rt(30,29); rt(50,49); rt(100,99); and rt(1000,999)
Bind tables (create data.frame) for each t-distribution specification
tbl <- cbind(m20, m30, m50, m100, m1000)
tbl.50 <- cbind(m20.05, m30.05, m50.05, m100.05, m1000.05)
Finally, I want to calculate the y as specified above. But here is where I get totally lost. Please see below my best attempt so far.
y = (mtheo-m0)/(m1-m0), where y = 0 when m1 < m0 and y = y when m1 >= m0. mtheo is a constant (e.g. 0.50), m1 is value in column 1 of tbl and m0 is value in column 1 of tbl.50.
ycalc <- function(mtheo, m1, m0) {
ifelse(m1>=m0) {
y = (mteo-m0)/(m1-m0)
} ifelse(m1<m0) {
y=0
} returnValue(y)
}
You can try this. I used data frames instead of data tables.
This code is more versatile. You can add or remove parameters. Below are the parameters that you can use to create t distributions.
params = data.frame(
n = c(20, 30, 50, 100, 1000),
df = c(19, 29, 49, 99, 999)
)
And here is a loop that creates the values you need for each t distribution. You can ignore this part if you already have those values (or code to create those values).
tbl = data.frame(i = c(1:10000))
tbl.50 = data.frame(i = c(1:10000))
for (i in 1:nrow(params)) {
mx = as.data.frame(replicate(10000, rt(params[i, 1], params[i, 2])))
m <- apply(mx, FUN=mean, MARGIN=2)
tbl = cbind(tbl, m)
names(tbl)[ncol(tbl)] = paste("m", params[i, 1], sep="")
mx.50 = as.data.frame(replicate(10000, rt(params[i, 1], params[i, 2])+.5))
m.50 <- apply(mx.50, FUN=mean, MARGIN=2)
tbl.50 = cbind(tbl.50, m.50)
names(tbl.50)[ncol(tbl.50)] = paste("m", params[i, 1], ".50", sep="")
}
tbl = tbl[-1]
tbl.50 = tbl.50[-1]
And here is the loop that does the calculations. I save them in a data frame (y). Each column in this data frame is the result of your function applied for all rows.
mtheo = .50
y = data.frame(i = c(1:10000))
for (i in 1:nrow(params)) {
y$dum = 0
idx = which(tbl[, i] >= tbl.50[, i])
y[idx, ]$dum =
(mtheo - tbl.50[idx, i]) /
(tbl[idx, i] - tbl.50[idx, i])
names(y)[ncol(y)] = paste("y", params[i, 1], sep="")
}
y = y[-1]
You could try this, if the first column in tbl is called m0 and the first column in tbl.50 is called m1:
mteo <- 0.5
ycalc <- ifelse(tbl$m1 >= tbl.50$m0, (mteo - tbl.50$m0)/(tbl$m1 - tbl.50$m0),
ifelse(tbl$m1 < tbl.50$m0), 0, "no")
Using the same column names provided by your code, and transforming your matrices into dataframes:
tbl <- data.frame(tbl)
tbl.50 <- data.frame(tbl.50)
mteo <- 0.5
ycalc <- ifelse(tbl$m20 >= tbl.50$m20.05, (mteo - tbl.50$m20.05)/(tbl$m20 - tbl.50$m20.05),
ifelse(tbl$m20 < tbl.50$m20.05, "0", "no"))
This results in:
head(ycalc)
[1] "9.22491706576716" "0" "0" "0" "0" "1.77027049630147"
I managed to apply a linear regression for each subject of my data frame and paste the values into a new dataframe using a for-loop. However, I think there should be a more readable way of achieving my result using an apply function, but all my attempts fail. This is how I do it:
numberOfFiles <- length(resultsHick$subject)
intslop <- data.frame(matrix(0,numberOfFiles,4))
intslop <- rename(intslop,
subject = X1,
intercept = X2,
slope = X3,
Rsquare = X4)
cond <- c(0:3)
allSubjects <- resultsHick$subject
for (i in allSubjects)
{intslop[i,1] <- i
yvalues <- t(subset(resultsHick,
subject == i,
select = c(H0meanRT, H1meanRT, H2meanRT, H258meanRT)))
fit <- lm(yvalues ~ cond)
intercept <- fit$coefficients[1]
slope <- fit$coefficients[2]
rsquared <- summary(fit)$r.squared
intslop[i,2] <- intercept
intslop[i,3] <- slope
intslop[i,4] <- rsquared
}
The result should look the same as
> head(intslop)
subject intercept slope Rsquare
1 1 221.3555 54.98290 0.9871209
2 2 259.4947 66.33344 0.9781499
3 3 227.8693 47.28699 0.9537868
4 4 257.7355 80.71935 0.9729132
5 5 197.4659 49.57882 0.9730409
6 6 339.1649 61.63161 0.8213179
...
Does anybody know a more readable way of writing this code using an apply function?
One common pattern I use to replace for loops that aggregate data.frames is:
do.call(
rbind,
lapply(1:numberOfDataFrames,
FUN = function(i) {
print(paste("Processing index:", i)) # helpful to see how slow/fast
temp_df <- do_some_work[i]
temp_df$intercept <- 1, etc.
return(temp_df) # key is to return a data.frame for each index.
}
)
)
I need some help vectorizing the following code because I believe that it will become more efficient. However i do not know how to begin... I created a loop that goes through z. z has 3 columns and 112847 rows, which might be a reason it takes a long time. The 3 columns contain numbers that are used in the MACD() function...
library(quantmod)
library(TTR)
# get stock data
getSymbols('LUNA')
#Choose the Adjusted Close of a Symbol
stock <- Ad(LUNA)
#Create matrix for returns only
y <- stock
#Create a "MATRIX" by choosing the Adjusted Close
Nudata3 <- stock
#Sharpe Ratio Matrix
SR1<- matrix(NA, nrow=1)
# I want to create a table with all possible combinations from the ranges below
i = c(2:50)
k = c(4:50)
j = c(2:50)
# stores possible combinations into z
z <- expand.grid(i,k,j)
colnames(z)<- c("one","two","three")
n = 1
stretches <- length(z[,1])
while (n < stretches){
# I am trying to go through all the values in "z"
Nuw <- MACD((stock), nFast=z[n,1], nSlow=z[n,2], nSig=z[n,3], maType="EMA")
colnames(Nuw) <- c("MACD","Signal") #change the col names to create signals
x <- na.omit(merge((stock), Nuw))
x$sig <- NA
# Create trading signals
sig1 <- Lag(ifelse((x$MACD <= x$Signal),-1, 0)) # short when MACD < SIGNAL
sig2 <- Lag(ifelse((x$MACD >= x$Signal),1, 0)) # long when MACD > SIGNAL
x$sig <- sig1 + sig2
#calculate Returns
ret <- na.omit(ROC(Ad(x))*x$sig)
colnames(ret)<- c(paste(z[n,1],z[n,2],z[n,3],sep=","))
x <- merge(ret,x)
y <- merge(y,ret) #This creates a MATRIX with RETURNs ONLY
Nudata3 <- merge(Nudata3, x)
((mean(ret)/sd(ret)) * sqrt(252)) -> ANNUAL # Creates a Ratio
ANNUAL->Shrat # stores Ratio into ShRat
SR1 <- cbind(SR1,Shrat) # binds all ratios as it loops
n <- (n+1)
}
I would like to know how to vectorize the MACD() function, to speed up the process since the length of stretches is approx. 112847. It takes my computer quite some time to go through the loop itself.
First and foremost - case specific optimization - remove the cases where nFast > nSlow as it doesn't make sense technically.
Secondly - you are creating objects and copying them over and over again. This is very expensive.
Thirdly - you can code this better perhaps by creating a matrix of signals in one loop and doing rest of the operations in vectorized manner.
I would code what you are doing something like this.
Please read help pages of mapply, do.call, merge and sapply if you don't understand.
require(quantmod)
getSymbols("LUNA")
#Choose the Adjusted Close of a Symbol
stock <- Ad(LUNA)
# I want to create a table with all possible combinations from the ranges below
i = c(2:50)
k = c(4:50)
j = c(2:50)
# stores possible combinations into z
z <- expand.grid(i,k,j)
IMO : This is where your first optimization should be. Remove cases where i > k
z <- z[z[,1]<z[,2], ]
It reduces the number of cases from 112847 to 57575
#Calculate only once. No need to calculate this in every iteration.
stockret <- ROC(stock)
getStratRet <- function(nFast, nSlow, nSig, stock, stockret) {
x <- MACD((stock), nFast=nFast, nSlow=nSlow, nSig=nSig, maType="EMA")
x <- na.omit(x)
sig <- Lag(ifelse((x$macd <= x$signal),-1, 0)) + Lag(ifelse((x$macd >= x$signal),1, 0))
return(na.omit(stockret * sig))
}
RETURNSLIST <- do.call(merge, mapply(FUN = getStratRet, nFast = z[,1], nSlow = z[,2], nSig = z[,3], MoreArgs = list(stock = stock, stockret = stockret), SIMPLIFY = TRUE))
getAnnualSharpe <- function(ret) {
ret <- na.omit(ret)
return ((mean(ret)/sd(ret)) * sqrt(252))
}
SHARPELIST <- sapply(RETURNSLIST, FUN = getAnnualSharpe)
Results will be as below. Which column belongs to which combo of i, j, k is trivial.
head(RETURNSLIST[, 1:3])
## LUNA.Adjusted LUNA.Adjusted.1 LUNA.Adjusted.2
## 2007-01-10 0.012739026 -0.012739026 0
## 2007-01-11 -0.051959739 0.051959739 0
## 2007-01-12 -0.007968170 -0.007968170 0
## 2007-01-16 -0.007905180 -0.007905180 0
## 2007-01-17 -0.005235614 -0.005235614 0
## 2007-01-18 0.028315920 -0.028315920 0
SHARPELIST
## LUNA.Adjusted LUNA.Adjusted.1 LUNA.Adjusted.2 LUNA.Adjusted.3 LUNA.Adjusted.4 LUNA.Adjusted.5 LUNA.Adjusted.6
## 0.04939150 -0.07428392 NaN 0.02626382 -0.06789803 -0.22584987 -0.07305477
## LUNA.Adjusted.7 LUNA.Adjusted.8 LUNA.Adjusted.9
## -0.05831643 -0.08864845 -0.08221986
system.time(
+ RETURNSLIST <- do.call(merge, mapply(FUN = getStratRet, nFast = z[1:100,1], nSlow = z[1:100,2], nSig = z[1:100,3], MoreArgs = list(stock = stock, stockret = stockret), SIMPLIFY = TRUE)),
+ SHARPELIST <- sapply(RETURNSLIST, FUN = getAnnualSharpe)
+ )
user system elapsed
2.28 0.00 2.29
I am running correlations between variables, some of which have missing data, so the sample size for each correlation are likely different. I tried print and summary, but neither of these shows me how big my n is for each correlation. This is a fairly simple problem that I cannot find the answer to anywhere.
like this..?
x <- c(1:100,NA)
length(x)
length(x[!is.na(x)])
you can also get the degrees of freedom like this...
y <- c(1:100,NA)
x <- c(1:100,NA)
cor.test(x,y)$parameter
But I think it would be best if you show the code for how your are estimating the correlation for exact help.
Here's an example of how to find the pairwise sample sizes among the columns of a matrix. If you want to apply it to (certain) numeric columns of a data frame, combine them accordingly, coerce the resulting object to matrix and apply the function.
# Example matrix:
xx <- rnorm(3000)
# Generate some NAs
vv <- sample(3000, 200)
xx[vv] <- NA
# reshape to a matrix
dd <- matrix(xx, ncol = 3)
# find the number of NAs per column
apply(dd, 2, function(x) sum(is.na(x)))
# tack on some column names
colnames(dd) <- paste0("x", seq(3))
# Function to find the number of pairwise complete observations
# among all pairs of columns in a matrix. It returns a data frame
# whose first two columns comprise all column pairs
pairwiseN <- function(mat)
{
u <- if(is.null(colnames(mat))) paste0("x", seq_len(ncol(mat))) else colnames(mat)
h <- expand.grid(x = u, y = u)
f <- function(x, y)
sum(apply(mat[, c(x, y)], 1, function(z) !any(is.na(z))))
h$n <- mapply(f, h[, 1], h[, 2])
h
}
# Call it
pairwiseN(dd)
The function can easily be improved; for example, you could set h <- expand.grid(x = u[-1], y = u[-length(u)]) to cut down on the number of calculations, you could return an n x n matrix instead of a three-column data frame, etc.
Here is a for-loop implementation of Dennis' function above to output an n x n matrix rather than have to pivot_wide() that result. On my databricks cluster it cut the compute time for 1865 row x 69 column matrix down from 2.5 - 3 minutes to 30-40 seconds.
Thanks for your answer Dennis, this helped me with my work.
pairwise_nxn <- function(mat)
{
cols <- if(is.null(colnames(mat))) paste0("x", seq_len(ncol(mat))) else colnames(mat)
nn <- data.frame(matrix(nrow = length(cols), ncol = length(cols)))
rownames(nn) <- colnames(nn) <- cols
f <- function(x, y)
sum(apply(mat[, c(x, y)], 1, function(z) !any(is.na(z))))
for (i in 1:nrow(nn))
for (j in 1:ncol(nn))
nn[i,j] <- f(rownames(nn)[i], colnames(nn)[j])
nn
}
If your variables are vectors named a and b, would something like sum(is.na(a) | is.na(b)) help you?