I'm running a principal component analysis on a dataset with more than 1000 variables. I'm using R Studio and when I run the summary to see the cumulative variance of the components, I can only see the last few hundred components. How do I limit the summary to only show, say, the first 100 components?
I tried this and it seems to be working:
l = loadings(prin)
l[,1:100]
It's pretty easy to modify print.summary.princomp (you can see the original code by typing stats:::print.summary.princomp) to do this:
pcaPrint <- function (x, digits = 3, loadings = x$print.loadings, cutoff = x$cutoff,n, ...)
{
#Check for sensible value of n; default to full output
if (missing(n) || n > length(x$sdev) || n < 1){n <- length(x$sdev)}
vars <- x$sdev^2
vars <- vars/sum(vars)
cat("Importance of components:\n")
print(rbind(`Standard deviation` = x$sdev[1:n], `Proportion of Variance` = vars[1:n],
`Cumulative Proportion` = cumsum(vars)[1:n]))
if (loadings) {
cat("\nLoadings:\n")
cx <- format(round(x$loadings, digits = digits))
cx[abs(x$loadings) < cutoff] <- paste(rep(" ", nchar(cx[1,
1], type = "w")), collapse = "")
print(cx[,1:n], quote = FALSE, ...)
}
invisible(x)
}
pcaPrint(summary(princomp(USArrests, cor=TRUE),
loadings = TRUE, cutoff = 0.2), digits = 2,n = 2)
Edited To include a basic check for a sensible value for n. Now that I've done this, I wonder if it isn't worth suggesting to R Core as a permanent addition; seems simple and like it might be useful.
You can put the loadings in matrix form, you could save the matrix to a variable and then subset (a la matrix[,1:100]) it to see the first/middle/last n. In this example, I've used head(). Each column is a principle component.
head(
matrix(
prin$loadings,
ncol=length(dimnames(prin$loadings)[[2]]),
nrow=length(dimnames(prin$loadings)[[1]])
),
100)
Related
I want to loop through a list of elements and use them in a function. Here is a partial code, from a t-test in jamovi package.
j = c("K1", "K2")
for (i in j) {
temp1 <- jmv::ttestIS(
formula = i ~ T1,
data = data1,
vars = i,
students = FALSE,
eqv = TRUE)
temp1
}
I get the error that the argument i is not in the dataset. I suppose it's because the function sees i as itself and not K1. Any ideas how to access the loop elements in the function?
ps. I can provide a reproducible code if needed.
You can use as.formula(). However, please note that you are overwriting temp1 for each iteration of j here; was that your intention?
for (i in j) {
formula = as.formula(paste(i,"~T1"))
temp1 <- jmv::ttestIS(
formula = formula,
data = data1,
students = FALSE,
eqv = TRUE)
temp1
}
Instead of using
for (i in j)
use
for (in in 1:length(j))
I am a newbie to R programming environment. Can anyone please help me with the following problem:
I have a .csv file with stock return data for 100 odd firms (207 days each). I need to estimate GARCH volatilities for each firm and save the output for all firms in a single .csv file. My data looks like this:
StockData
Below is the reproducible code I have tried so far but unsuccessfully:
library(fGarch)
stdata <- read.csv("StockData.csv", header = T)
out <- vector("list",c(437))
for(j in length(names(stdata[,-1]))) {
fit = garchFit(~arma(1,0)+garch(1,1), data = stdata[,j], trace = F)
volatility(fit)
out = as.data.frame(volatility(fit))
}
write.csv(out, 'volatility.csv')
The output .csv file prints the volatility only for the last firm (the 100th firm). I also get the following warning message:
Warning message:
Using formula(x) is deprecated when x is a character vector of length > 1.
Consider formula(paste(x, collapse = " ")) instead.
My expected output will be as follows:
SampleVolatilityOutput
Please tell me if there is a way to get all volatilities at once in a single .csv file.
You get your output, because you overwrite "out" for each j in the loop. If you store it in a matrix, i.e. each column ist the volatility for one firm, you don't need a list:
library(fGarch)
stdata <- matrix(rnorm(100 * 207), ncol = 100)
out <- matrix(rep(NA, 100*207), ncol = 100)
for (j in 1:ncol(stdata)) {
fit <-
garchFit(
formula = ~ arma(1, 0) + garch(1, 1),
data = stdata[, j],
trace = F
)
out[, j] = volatility(fit)
}
write.csv(out, 'volatility.csv')
I want to draw a hexbin plot with ggplot, but with log scale "pretty" breaks for the frequency. Consider
df = data.frame(a=rnorm(1000)); df$b <- df$a+rnorm(1000);
I used this answer to get pretty breaks on linear scale
ggplot(df, aes(a,b)) +
geom_hex(aes(fill=cut(..value..,breaks=pretty(..value.., n=10)))) +
scale_fill_discrete("Frequency")
This works. Now say I want to use log scale pretty breaks. So I used the idea from another answer to define
base_breaks <- function(n = 10){
function(x) {
axisTicks(log10(range(x, na.rm = TRUE)), log = TRUE, n = n)
}
}
and try to do
ggplot(df, aes(a,b)) +
geom_hex(aes(fill=cut(..value..,breaks=base_breaks(n=10)(..value..))))
but it is not able to find the function. It says:
Error in cut.default(value, breaks = base_breaks(n = 10)(value)) :
could not find function "base_breaks"
Even though base_breaks is defined.
> base_breaks(n=10)(c(1:1000))
[1] 1 5 10 50 100 500 1000
How can I make my function visible in whatever environment ggplot is calling it? I even defined it as a global variable with
base_breaks <<- function(n = 10){
function(x) {
axisTicks(log10(range(x, na.rm = TRUE)), log = TRUE, n = n)
}
}
but I still get the same error.
I am not sure about it, but you could try simplifying the function like this:
base_breaks <<- function(n = 10, x){
axisTicks(log10(range(x, na.rm = TRUE)), log = TRUE, n = n)
}
Maybe the problem is that you have a function whose result is another function, and that could be causing the error. With this aproach you would have the values more directly. Check it out!
I can't check it myself, since I get an error object 'value' not found...
I make this code using a for-statement. (The main purpose of this code is to list different webpages, which are obtained via httr and rvest)
r = "asdgkjkhdf"
t = "osrt"
all = c()
for(i in 1:400)
{
y = paste(r, i, sep = '')
d = paste(y, t, sep = '')
all = c(all, d)
}
all
I got things like these (pasted numbers are actually getting accumulated in the each results)
[1]asdgkjkhdf1osrt
[2]asdgkjkhdf12osrt
[3]asdgkjkhdf123osrt
[4]asdgkjkhdf1234osrt
...
But I want results like these regardless of how many numbers i put in 'for()'function.
[1]asdgkjkhdf1osrt
[2]asdgkjkhdf2osrt
...
[400]asdgkjkhdf400osrt
like these above
What should I change in order to have what I want to result in?
Should I use paste(substr(), substr(), sep='')?
If you really want to use a for-statement you can use the following
r = "asdgkjkhdf"
t = "osrt"
all = c()
for (idx in 1:400)
all = c(all, paste0(r, idx, t))
However, in R you should prefer code without for-statements since, in general, this is less readable and hurts performance. The solution without the for-statement (given by Roland in the comments) equals
all <- paste0(r, 1:400, t)
Note that paste0("string")is just a short notation for paste("string", sep='').
I came across this function a while back that was created for fixing PCA values. The problem with the function was that it wasn't compatible xts time series objects.
amend <- function(result) {
result.m <- as.matrix(result)
n <- dim(result.m)[1]
delta <- apply(abs(result.m[-1,] - result.m[-n,]), 1, sum)
delta.1 <- apply(abs(result.m[-1,] + result.m[-n,]), 1, sum)
signs <- c(1, cumprod(rep(-1, n-1) ^ (delta.1 <= delta)))
zoo(result * signs)
}
Full sample can be found https://stats.stackexchange.com/questions/34396/im-getting-jumpy-loadings-in-rollapply-pca-in-r-can-i-fix-it
The problem is that applying the function on a xts object with multiple columns and rows wont solve the problem. Is there a elegant way of applying the algorithm for a matrix of xts objects?
My current solution given a single column with multiple row is to loop through row by row...which is slow and tedious. Imagine having to do it column by column also.
Thanks,
Here is some code to get one started:
rm(list=ls())
require(RCurl)
sit = getURLContent('https://github.com/systematicinvestor/SIT/raw/master/sit.gz', binary=TRUE, followlocation = TRUE, ssl.verifypeer = FALSE)
con = gzcon(rawConnection(sit, 'rb'))
source(con)
close(con)
load.packages('quantmod')
data <- new.env()
tickers<-spl("VTI,IEF,VNQ,TLT")
getSymbols(tickers, src = 'yahoo', from = '1980-01-01', env = data, auto.assign = T)
for(i in ls(data)) data[[i]] = adjustOHLC(data[[i]], use.Adjusted=T)
bt.prep(data, align='remove.na', dates='1990::2013')
prices<-data$prices[,-10] #don't include cash
retmat<-na.omit(prices/mlag(prices) - 1)
rollapply(retmat, 500, function(x) summary(princomp(x))$loadings[, 1], by.column = FALSE, align = "right") -> princomproll
require(lattice)
xyplot(amend(pruncomproll))
plotting "princomproll" will get you jumpy loadings...
It isn't very obvious how the amend function relates to the script below it (since it isn't called there), or what you are trying to achieve. There are a couple of small changes that can be made. I haven't profiled the difference, but it's a little more readable if nothing else.
You remove the first and last rows of the result twice.
rowSums might be slightly more efficient for getting the row sums than apply.
rep.int is a little bit fster than rep.
amend <- function(result) {
result <- as.matrix(result)
n <- nrow(result)
without_first_row <- result[-1,]
without_last_row <- result[-n,]
delta_minus <- rowSums(abs(without_first_row - without_last_row))
delta_plus <- rowSums(abs(without_first_row + without_last_row))
signs <- c(1, cumprod(rep.int(-1, n-1) ^ (delta_plus <= delta_minus)))
zoo(result * signs)
}