How to for loop using different columns of data frame? - r

Basically I was working on a portfolio return problem. The stock return is like:
AMZN <- c(0.1, 0.3, 0.4, 0.2)
BBY <- c(0.2, 0.4, 0.5, 0.3)
TGT <- c(-0.1, -0.3, -0.2,-0.5)
df1 <- data.frame(AMZN, BBY, TGT)
date <- c("2000-01-01","2000-02-01", "2000-03-01", "2000-04-01")
date <- as.Date(date, "%Y-%m-%d")
df1 <- cbind(date, df1)
xts <- xts(df1[,-1], order.by=df1[,1])
I want to use Return.portfolio(xts, weight) to calculate portfolio return. So
The weight is like
w1 <- c(0.2, 0.3, 0.1, 0.4)
w2 <- c(0.5, 0.1, 0.1, 0.3)
w3 <- c(0.1, 0.1, 0.4, 0.4)
Weights <- data.frame(w1, w2, w3)
Since there are several groups of weights assigned, I need to get multiple portfolio return.
The code I tried is
for (i in colnames(Weights)){
Return.portfolio(xts, (Weights[[i]]))
}
Although R does not report any error, the only thing I got is a value which i is "w3".

I think you may need to initialize a NULL object first. Maybe something like this
Return<-NULL
for (i in 1:ncol(Weights)){
Return<- cbind(Return, Return.portfolio(xts, (Weights[[i]])))
}

Related

combining lists and dataframes in R from raster values

QUESTION EDITED FOR CLARITY AND REPRODUCIBILITY
I am trying to summarize proportions of landcover classes within many buffers contained within a list. Although it appears to be a common problem, I have not found an appropriate solution:
I have a raster stack called hab_stack with discrete values 1-6 for each of 3 layers (each layer == year). I also have locational data with >800,000 locations called dat_sf. I have extracted hab_stack raster values within a 400 m buffer around each location.
I now have a large list with ~800,000 elements (not all hab classes 1-6 are represented in each list). So I tried to create a dummy dataframe with all hab_stack values 1-6 called true_names with assigned frequency/proportion == zero for classes not represented within the buffer because I need to combine all proportions together. I have tried to accomplish this using an lapply looping structure but can't seem to get it quite right. Below is the full function and error:
sum_class <- lapply(values_hab, function(x){
true_names <- data.frame(x = 1:6, Freq = 0)
prop_df <- as.data.frame(prop.table(table(x))) %>%
mutate(x = as.numeric(x))
true_names %>%
anti_join(prop_df, by = "x") %>%
bind_rows(prop_df) %>%
arrange(x)
Error in `mutate()`:
! Problem while computing `x = as.numeric(x)`.
x `x` must be size 0 or 1, not 1659.
Run `rlang::last_error()` to see where the error occurred.
})
When I dissect the function, the error arises from the table(values_hab) argument = Error in table(values_hab) : all arguments must have the same length.
I think a hypothetical list could look something like this, where there's different numbers of NAs and not all classes are represented in each element; also, see a dataframe of my desired output below:
list <- list(c(1,1,1,2,2,2,3,3,4,4,4,NA,NA,NA,5,6),
c(1,2,3,4,NA,NA,NA,NA,4,4,4,4,NA,5,1,1)
c(5,5,5,5,5,1,2,2,2,2,NA,NA,NA,NA,NA,3))
desired_output <- data.frame(`1` = c(0.4, 0.5, 0.6, 0.5, 0.5, 0.3),
`2` = c(0.1, 0.1, 0.1, 0.1, 0.1, 0.2),
`3` = c(0.1, 0.1, 0.0, 0.1, 0.0, 0.3),
`4` = c(0.3, 0.2, 0.0, 0.1, 0.1, 0.1),
`5` = c(0.0, 0.1, 0.2, 0.2, 0.1, 0.0),
`6` = c(0.1, 0.0, 0.1, 0.0, 0.2, 0.1))
Any help is much appreciated. Best,
It looks like my function works and this was a very easy fix. dplyr::mutate was recognizing x as the entire list when in fact I wanted it to apply mutate the vector x within each list. R is still running in the background but this should have taken care of it.
sum_class_function <- function(x){
true_names <- data.frame(x = 1:6, Freq = 0)
prop_df <- as.data.frame(prop.table(table(x)))
prop_df$x <- as.numeric(prop_df$x)
temp<- true_names %>%
anti_join(prop_df, by = "x") %>%
bind_rows(prop_df) %>%
arrange(x)
return(temp)
}
sum_class <- lapply(values_hab, sum_class_function)

How do I need to assign values to each other in triplets using R?

The situation is as follows:
I need to create a dataset of triplets where we have discrete distribution of stock prices S <- c(80,100,120,140,160), with probability P <- c(0.2, 0.3, 0.2, 0.2, 0.1), call option C <- max(S-120,0) = c(0,0,0,20,40) and liability of an option which pays 30 if in a certain region otherwise zero, namely L = I{110 \leq S \leq 150} = c(0,0,30,30,0) <- c(0,0,30,30,0). It is important to mention that if P[1] = 80, then C[1] and L[1]. This holds for i = 1,2,3,4,5. How do you create a dataset for N = 10000 simulations where each value for i corresponds to the other two values for the same i?
This is the code I had for now. Note that X_1 = S, X_2 = C and Y = L.
X_1 <- function(n) {
sample(c(80,100,120,140,160), size = n,
prob = c(0.2, 0.3, 0.2, 0.2, 0.1), replace=T)
}
X_2 <- function(n) {
sample(X_1 - 120, size = n,
prob = c(0.2, 0.3, 0.2, 0.2, 0.1), replace=T)
}
Y <- function(n) {
sample(L, size = n,
prob = c(0.2, 0.3, 0.2, 0.2, 0.1), replace=T)
}
##Creating triplets##
df <- data.frame(S_T = X_1(10000), C_T = X_2(10000), L_T =Y(10000))
df```
I'm not sure if you want C_T to be dependent on the S_T values. If you do, I think you just want to call X_1, assign the results to an object, then use that as the argument to X_2 (or just subtract 120, which is what X_2 does).
X_1 <- function(n) {
sample(c(80,100,120,140,160), size = n,
prob = c(0.2, 0.3, 0.2, 0.2, 0.1), replace=T)
}
# Call that function
S_T <- X_1(10) # for practice
C_T <- S_T - 120 # that's all you're doing in function X_2, if you want to use S_T
If you want to C_T to contain values independent of S_T, you can create function within function
X_1 <- function(n) {
sample(c(80,100,120,140,160), size = n,
prob = c(0.2, 0.3, 0.2, 0.2, 0.1), replace=T)
}
X_2 <- function(n) {
X_1(n) - 120
}
S_T <- X_1(10) # Same as above
C_T <- X_2(10) # Gives values not dependent on S_T
EDIT to address comment below:
It's hard to read the comment, but it looks like you want create a function that takes the results of function X_1 and returns a result based on a condition. Use ifelse to read each element one at at time. You can create another function and then input the results of function X_1
Y <- function(X_1_func){
ifelse( X_1_func == 80,
return(0),
ifelse(X_1_func == 100,
return(0),
ifelse(X_1_func == 120,
return(30),
return(60) # Add a default value here or the last possible value if others are F
)
)
)
}
sapply(X_1(10), Y) # Use an apply to input one element of function X_1 at a time. Assign results to L or whatever you with to call.
If this all works for you, you can accept the answer.

Fastest way to select a valid range for raster data

Using R, I need to select the valid range for a given raster (from package raster) in the fastest possible way. I tried this:
library(raster)
library(microbenchmark)
library(ggplot2)
library(compiler)
r <- raster(ncol=100, nrow=100)
r[] <- runif(ncell(r))
#Let's see if precompiling helps speed...
f <- function(x, min, max) reclassify(x, c(-Inf, min, NA, max, Inf, NA))
g <- cmpfun(f)
#Benchmark!
compare <- microbenchmark(
calc(r, fun=function(x){ x[x < 0.2] <- NA; x[x > 0.8] <- NA; return(x)}),
reclassify(r, c(-Inf, 0.2, NA, 0.8, Inf, NA)),
g(r, 0.2, 0.8),
times=100)
autoplot(compare) #Reclassify is much faster, precompiling doesn't help much.
#Check they are the same...
identical(
calc(r, fun=function(x){ x[x < 0.2] <- NA; x[x > 0.8] <- NA; return(x)}),
reclassify(r, c(-Inf, 0.2, NA, 0.8, Inf, NA))
) #TRUE
identical(
reclassify(r, c(-Inf, 0.2, NA, 0.8, Inf, NA)),
g(r, 0.2, 0.8),
) #TRUE
The reclassify method is much faster, but I'm sure that it can be sped up more. How can I do so?
While the accepted answer to this question is true for the example raster, it is important to note that the fastest safe function is highly dependent on raster size: the functions h and i presented by #rengis are only faster with relatively small rasters (and relatively simple reclassifications). Just increasing the size of the raster r in the OP's example by a magnitude of ten makes reclassify quicker:
# Code from OP #AF7
library(raster)
library(microbenchmark)
library(ggplot2)
library(compiler)
#Let's see if precompiling helps speed...
f <- function(x, min, max) reclassify(x, c(-Inf, min, NA, max, Inf, NA))
g <- cmpfun(f)
# Funcions from #rengis
h <- function(r, min, max) {
rr <- r[]
rr[rr < min | rr > max] <- NA
r[] <- rr
r
}
i <- cmpfun(h)
# Benchmark with larger raster (100k cells, vs 10k originally)
r <- raster(ncol = 1000, nrow = 100)
r[] <- runif(ncell(r))
compare <- microbenchmark(
calc(r, fun=function(x){ x[x < 0.2] <- NA; x[x > 0.8] <- NA; return(x)}),
reclassify(r, c(-Inf, 0.2, NA, 0.8, Inf, NA)),
g(r, 0.2, 0.8),
h(r, 0.2, 0.8),
i(r, 0.2, 0.8),
times=100)
autoplot(compare)
The exact point when reclassify becomes quicker is dependent both on the number of the cells in the raster and on the complexity of the reclassification, but in this case the cross-over point is at about 50,000 cells (see below).
As the raster becomes even larger (or the calculation more complex), another way to speed up reclassification is using multi-threading, e.g. with the snow package:
# Reclassify, using clusterR to split into two threads
library(snow)
tryCatch({
beginCluster(n = 2)
clusterR(r, reclassify, args = list(rcl = c(-Inf, 0.2, NA, 0.8, Inf, NA)))
}, finally = endCluster())
Multi-threading involves even more overhead to set up, and so only makes sense with very large rasters and/or more complex calculations (in fact, I was surprised to note that it didn't come out as the best option under any of the conditions I tested below--perhaps with a more complex reclassification?).
To illustrate, I've plotted results from microbenchmark using the OP's setup at intervals up to 10 million cells (10 runs of each) below:
As a final note, compiling didn't make a difference at any of the tested sizes.
Here is one more way:
h <- function(r, min, max) {
rr <- r[]
rr[rr < min | rr > max] <- NA
r[] <- rr
r
}
i <- cmpfun(h)
identical(
i(r, 0.2, 0.8),
g(r, 0.2, 0.8)
)
#Benchmark!
compare <- microbenchmark(
calc(r, fun=function(x){ x[x < 0.2] <- NA; x[x > 0.8] <- NA; return(x)}),
reclassify(r, c(-Inf, 0.2, NA, 0.8, Inf, NA)),
g(r, 0.2, 0.8),
h(r, 0.2, 0.8),
i(r, 0.2, 0.8),
times=100)
autoplot(compare)
Compiling doesn't help much in this instance.
You could even gain some further speed up, by accessing slots of the raster object directly using # (although usually discouraged).
j <- function(r, min, max) {
v <- r#data#values
v[v < min | v > max] <- NA
r#data#values <- v
r
}
k <- cmpfun(j)
identical(
j(r, 0.2, 0.8)[],
g(r, 0.2, 0.8)[]
)
The raster package has a function for that: clamp. It is faster than g but slower than h and i because it has some overhead (safety) built in.
compare <- microbenchmark(
h(r, 0.2, 0.8),
i(r, 0.2, 0.8),
clamp(r, 0.2, 0.8),
g(r, 0.2, 0.8),
times=100)
autoplot(compare)

R: function cut

Here is my data:
>my.cut <- cut(my.variable, breaks = c(-Inf, -0.5, -0.25, -0.1, 0, 0.02, 0.05, 0.15, 0.3, 0.5, 1, Inf), right = FALSE)
>levels(my.cut)
"[-Inf,-0.5)" "[-0.5,-0.25)" "[-0.25,-0.1)" "[-0.1,0)" "[0,0.02)" "[0.02,0.05)" "[0.05,0.15)" "[0.15,0.3)" "[0.3,0.5)" "[0.5,1)" "[1, Inf)"
Expected result:
>levels(my.cut)
"[-Inf,-0.5)" "[-0.5,-0.25)" "[-0.25,-0.1)" "[-0.1,0)" "0" "(0,0.02)" "[0.02,0.05)" "[0.05,0.15)" "[0.15,0.3)" "[0.3,0.5)" "[0.5,1)" "[1, Inf)"
In the expected result, there is single figure 0 which I only want to choose the my.variable==0, but with the formula of the my.cut, there is no single 0 cause breaks can only be used for interval. So how could I do?
Hope to get your answer soon! Thanks!
You could explicitly put each value into a group. This is more flexible, but also a lot more verbose.
One way of doing this could be to define a bespoke cut function and then apply it to every element of your vector.
my.variable <- rnorm(100)
bespoke_cut <- function(value){
if (value < 0.1) return('[-Inf, 0.1)')
if (value < 0) return('[0.1, 0)')
if (value == 0) return('0')
return('(0, Inf]')
}
my.cut <- sapply(my.variable, bespoke_cut)
my.cut <- factor(my.cut)
I've only done a few of the groupings you wanted, but I think it should be apparent how to add extra groups.
I think the best you can hope for with 'cut' is to specify a really small range for 0, i.e.,
cps = c(-Inf, -0.1, 0-.Machine$double.eps, 0+.Machine$double.eps, 0.02, Inf)
bgroup = cut(c(-10, 10, 0,0), breaks = cps)
cat(deparse(levels(bgroup)), "\n") ## use this to edit the levels more easily
levels(bgroup) = c("(-Inf, -0.1]", "(-0.1,0)", "0", "(0,0.02]", "(0.02, Inf]")
table(bgroup)
Obviously, the display levels are not identical to those used to cut the data, but if you are okay with that window around 0, then the solution is to form the cuts with that value then change the labels.
You could do this : cut, assign 0 where my.variable == 0, refactor.
my.variable <- rnorm(100)
my.variable[sample(1:100,2)] <- 0
my.cut <- cut(my.variable, breaks = c(-Inf, -0.5, -0.25, -0.1, 0, 0.02, 0.05, 0.15, 0.3, 0.5, 1, Inf), right = FALSE)
lvl <- levels(my.cut)
lvlR <- c(lvl[1:4],"0","(0,0.02)",lvl[6:11])
my.cut <- as.character(my.cut)
my.cut[my.variable == 0] <- 0
my.cut <- factor(my.cut,levels=lvlR)
rm(lvl,lvlR)

Simulating Data for SEM with psych package

I'm simulating data for SEM (structural equation model) with psych package. I used the code given on page 17 of Using the psych package to generate and test structural models. The code is
library(psych)
set.seed(42)
fx <- matrix(c(0.9, 0.8, 0.7, rep(0, 9), 0.7, 0.6, 0.5, rep(0, 9), 0.6, 0.5, 0.4), ncol = 3)
rownames(fx) <- paste("x", 1:9, sep="")
fy <- matrix(c(0.6, 0.5, 0.4), ncol=1)
rownames(fy) <- paste("y", 1:3, sep="")
Phi <- matrix(c(1, 0.48, 0.32, 0.4, 0.48, 1, 0.32, 0.3, 0.32, 0.32, 1, 0.2, 0.4, 0.3, 0.2, 1), ncol = 4)
twelveV <- sim.structure(fx=fx, Phi=Phi, fy=fy, n=100, raw=TRUE)
round(twelveV$model, 2)
round(twelveV$model-twelveV$r, 2)
twelveV$observed
Then I tried to use sem package to analyse the simulated data. The code is
sem.mod <- structure.sem(twelveV$model)
library(sem)
sem.fit <- sem(sem.mod, twelveV$r, 100)
This code is giving the following error message:
Error in solve.default(diag(m) - A) :
Lapack routine dgesv: system is exactly singular
I don't what is causing this error. Any idea, comment and/or help will be highly appreciated. Thanks
Ah, that error message was the bane of my life for a while.
Essentially (as I eventually gathered from the R-Help archives, specifically here, it means that there is redundant information in your matrix in that (at least) one column's information can be derived from one of the others.
I believe that this is related to collinearity, but i could be wrong on this point. In most cases, dropping the column that is most highly correlated with the others will solve the problem.
In a real application, its a sign to throw out some of your questions or measures.

Resources