I want to run a partition regression in R, for which I need to assign a factor to indicate which partition this data belongs to. For example, when it is greater than mean+2 standard deviations,I assign the indicator 2, and between mean+1sd and mean+2sd, 1 so on and so forth. I know it can be done by if and else. But when the partitions are way too much, the code seems to be too long. Is there any easy and succinct methods to accomplish it?
mean=mean(x)
sd=sd(x)
if((x[i]-mean)/sd< -3) signal[i]=-3
if((x[i]-mean)/sd> -3) signal[i]=-2
if((x[i]-mean)/sd> -2) signal[i]=-1
if((x[i]-mean)/sd> -1) signal[i]=0
if((x[i]-mean)/sd>1) signal[i]=1
if((x[i]-mean)/sd>2) signal[i]=2
if((x[i]-mean)/sd>3) signal[i]=3
}
Thanks for #jogo and #r.user.05apr.
Now I have a slightly different problem. I want to compute the partition based on rolling windows, 20 days for example, which means I need to scale the data of day t based on the past 20 days (day t-20 to day t-1) and assign the same values as above according to its z score. In such case, can cut function still be used? I have written a code with a loop and if sentences
signal <- vector()
n=20 #window
for(i in (n+1):length(x)){
mean=mean(x[(n-20):(n-1)])
sd=sd(x[(i-20):(i-1)])
if((x[i]-mean)/sd< -3) signal[i]=-3
if((x[i]-mean)/sd> -3) signal[i]=-2
if((x[i]-mean)/sd> -2) signal[i]=-1
if((x[i]-mean)/sd> -1) signal[i]=0
if((x[i]-mean)/sd>1) signal[i]=1
if((x[i]-mean)/sd>2) signal[i]=2
if((x[i]-mean)/sd>3) signal[i]=3
}
You can use cut()
x <- iris$Petal.Length
m <- mean(x)
s <- sd(x)
cut((x - m)/s, breaks = c(-Inf, -3, -2, -1, 1, 2, 3, +Inf), labels = c((-3):3))
to coerce to numeric:
as.numeric(as.character(cut((x - m)/s, breaks = c(-Inf, -3, -2, -1, 1, 2, 3, +Inf), labels = c((-3):3))))
remark:
You can shorten (x - m)/s to scale(x)
Depends on how dynamic the value assignment has to be. Alternative option:
criteria <- data.frame(operator = c("<", rep(">", 6)),
criterion = c(-3, seq(-3, -1, 1), 1:3),
result = c(seq(-3, 0, 1), 1:3),
stringsAsFactors = FALSE)
criteria # data frame with individual conditions for if
get_signal <- function(mean, sd, x) {
dummy <- (x-mean)/sd
for (i in (1:nrow(criteria))) {
if (do.call(criteria[i, 1], list(dummy, criteria[i, 2]))) res <- criteria[i, 3]
}
res
}
sapply(-5:10, function(x) get_signal(2, 1, x))
Related
Long time reader, first time poster. I have not found any previous questions about my current problem. I would like to create multiple linear functions, which I can later apply to variables. I have a data frame of slopes: df_slopes and a data frame of constants: df_constants.
Dummy data:
df_slope <- data.frame(var1 = c(1, 2, 3,4,5), var2 = c(2,3,4,5,6), var3 = c(-1, 1, 0, -10, 1))
df_constant<- data.frame(var1 = c(3, 4, 6,7,9), var2 = c(2,3,4,5,6), var3 = c(-1, 7, 8, 0, -1))
I would like to construct functions such as
myfunc <- function(slope, constant, trvalue){
result <- trvalue*slope+constant
return(result)}
where the slope and constant values are
slope<- df_slope[i,j]
constant<- df_constant[i,j]
I have tried many ways, for example like this, creating a dataframe of functions with for loop
myfunc_all<-data.frame()
for(i in 1:5){
for(j in 1:3){
myfunc_all[i,j]<-function (x){ x*df_slope[i,j]+df_constant[i,j] }
full_func[[i]][j]<- func_full
}
}
without success. The slope-constant values are paired up, such as df_slope[i,j] is paired with df_constant[i,j]. The desired end result would be some kind of data frame, from where I can call a function by giving it the coordinates, for example like this:
myfunc_all[i,j}
but any form would be great. For example
myfunc_all[2,1]
in our case would be
function (x){ x*2+4]
which I can apply to different x values. I hope my problem is clear.
So you have a slight problem with lazy evaluation and variable scopes when you are using a for loop to build functions (see here for more info). It's a bit safer to use something like mapply which will create closures for you. Try
myfunc_all <- with(expand.grid(1:5, 1:3), mapply(function(i, j) {
function(x) {
x*df_slope[i,j]+df_constant[i,j]
}
},Var1, Var2))
dim(myfunc_all) <- c(5,3)
This will create an array like object. The only difference is that you need to use double brackets to extract the function. For example
myfunc_all[[2,1]](0)
# [1] 4
myfunc_all[[5,3]](0)
# [1] -1
Alternative you can choose to write a function that returns a function. That would look like
myfunc_all <- (function(slopes, constants) {
function(i, j)
function(x) x*slopes[i,j]+constants[i,j]
})(df_slope, df_constant)
then rather than using brackets, you call the function with parenthesis.
myfunc_all(2,1)(0)
# [1] 4
myfunc_all(5,3)(0)
# [1] -1
df_slope <- data.frame(var1 = c(1, 2, 3,4,5), var2 = c(2,3,4,5,6), var3 = c(-1, 1, 0, -10, 1))
df_constant<- data.frame(var1 = c(3, 4, 6,7,9), var2 = c(2,3,4,5,6), var3 = c(-1, 7, 8, 0, -1))
functions = vector(mode = "list", length = nrow(df_slope))
for (i in 1:nrow(df_slope)) {
functions[[i]] = function(i,x) { df_slope[i]*x + df_constant[i]}
}
f = function(i, x) {
functions[[i]](i, x)
}
f(1, 1:10)
f(3, 5:10)
I am working on an assignment for school. I need to transform the columns in a data frame using a for loop and the bcPower function from the cars package. My data frame named bb2.df consists of 13 columns of baseball statistics for 337 players. The data is from:
http://ww2.amstat.org/publications/jse/datasets/baseball.dat.txt
I read the data in using:
bb.df <- read.fwf("baseball.dat.txt",widths=c(4,6,6,4,4,3,3,3,4,4,4,3,3,2,2,2,2,19))
And then I created a second data frame just for the numeric stats using:
bb2.df <- bb.df[,1:13]
library(cars)
Then I unsuccessfully tried to build the for loop.
> bb2.df[[i]] <- bcPower(bb2.df[[i]],c)
> for (i in 1:ncol(bb2.df)) {
+ c <- coef(powerTransform(bb2.df[[i]]))
+ bb2.df[[i]] <- bcPower(bb2.df[[i]],c)
+ }
Error in bc1(out[, j], lambda[j]) :
First argument must be strictly positive.
The loop seems to transform the first three columns but stops.
What am I doing wrong?
This solution
tests whether a column appears to contain logical values and omits them from the transformation
replaces zero values in the vectors with a small number, outside the range of the actual values
stores the transformed values in a new data frame, retaining the column and row names
I have also tested all of the variables for normality before and after the transformation. I tried to find a variable that's interesting in that the transformed variable has a large p-value for the Shapiro test, but also there there was a large change in the p-value. Finally, the interesting variable is scaled in both the original and transformed version, and the two versions are overlaid on a density plot.
library(car); library(ggplot2); library(reshape2)
# see this link for column names and type hints
# http://ww2.amstat.org/publications/jse/datasets/baseball.txt
# add placeholder column for opening quotation mark
bb.df <-
read.fwf(
"http://ww2.amstat.org/publications/jse/datasets/baseball.dat.txt",
widths = c(4, 6, 6, 4, 4, 3, 3, 3, 4, 4, 4, 3, 3, 2, 2, 2, 2, 2, 17)
)
# remove placeholder column
bb.df <- bb.df[,-(ncol(bb.df) - 1)]
names(bb.df) <- make.names(
c(
'Salary', 'Batting average', 'OBP', 'runs', 'hits', 'doubles', 'triples',
'home runs', 'RBI', 'walks', 'strike-outs', 'stolen bases', 'errors',
"free agency eligibility", "free agent in 1991/2" ,
"arbitration eligibility", "arbitration in 1991/2", 'name'
)
)
# test for boolean/logical values... don't try to transform them
logicals.test <- apply(
bb.df,
MARGIN = 2,
FUN = function(one.col) {
asnumeric <- as.numeric(one.col)
aslogical <- as.logical(asnumeric)
renumeric <- as.numeric(aslogical)
matchflags <- renumeric == asnumeric
cant.be.logical <- any(!matchflags)
print(cant.be.logical)
}
)
logicals.test[is.na(logicals.test)] <- FALSE
probably.numeric <- bb.df[, logicals.test]
result <- apply(probably.numeric, MARGIN = 2, function(one.col)
{
# can't transform vectors containing non-positive values
# replace zeros with something small
non.zero <- one.col[one.col > 0]
small <- min(non.zero) / max(non.zero)
zeroless <- one.col
zeroless[zeroless == 0] <- small
c <- coef(powerTransform(zeroless))
transformation <- bcPower(zeroless, c)
return(transformation)
})
result <- as.data.frame(result)
row.names(result) <- bb.df$name
cols2test <- names(result)
normal.before <- sapply(cols2test, function(one.col) {
print(one.col)
temp <- shapiro.test(bb.df[, one.col])
return(temp$p.value)
})
normal.after <- sapply(cols2test, function(one.col) {
print(one.col)
temp <- shapiro.test(result[, one.col])
return(temp$p.value)
})
more.normal <- cbind.data.frame(normal.before, normal.after)
more.normal$more.normal <-
more.normal$normal.after / more.normal$normal.before
more.normal$interest <-
more.normal$normal.after * more.normal$more.normal
interesting <-
rownames(more.normal)[which.max(more.normal$interest)]
data2plot <-
cbind.data.frame(bb.df[, interesting], result[, interesting])
names(data2plot) <- c("original", "transformed")
data2plot <- scale(data2plot)
data2plot <- melt(data2plot)
names(data2plot) <- c("Var1", "dataset", interesting)
ggplot(data2plot, aes(x = data2plot[, 3], fill = dataset)) +
geom_density(alpha = 0.25) + xlab(interesting)
Original, incomplete answer:
I believe you're trying to do illegal power transformations (vectors including non-positive values, specifically zeros; vectors with no variance)
The fact that you are copying bb.df into bb2.df and then overwriting is a sure sign that you should really be using apply.
This doesn't create a useful dataframe, but it should get you started,
library(car)
bb.df <-
read.fwf(
"baseball.dat.txt",
widths = c(4, 6, 6, 4, 4, 3, 3, 3, 4, 4, 4, 3, 3, 2, 2, 2, 2, 19)
)
bb.df[bb.df == 0] <- NA
# skip last (text) col
for (i in 1:(ncol(bb.df) - 1)) {
print(i)
# use comma to indicate indexing by column
temp <- bb.df[, i]
temp[temp == 0] <- NA
temp <- temp[complete.cases(temp)]
if (length(unique(temp)) > 1) {
c <- coef(powerTransform(bb.df[, i]))
print(bcPower(bb.df[i], c))
} else {
print(paste0("column ", i, " is invariant"))
}
}
# apply solution
result <- apply(bb.df[,-ncol(bb.df)], MARGIN = 2, function(one.col)
{
temp <- one.col
temp[temp == 0] <- NA
temp <- temp[complete.cases(temp)]
if (length(unique(temp)) > 1) {
c <- coef(powerTransform(temp))
transformation <- bcPower(temp, c)
return(transformation)
} else
{
print("skipping invariant column")
return(NULL)
}
})
I've written a simple correlation function that takes in three variables. "A" and "B" are numerical vectors of equal length, and "n" is the length.
Corr.fxn <- function(A, B, n){
Correlation <- (sum((A - mean(A))*(B - mean(B))) / (n-1)) / (sd(A)*sd(B))
return(Correlation)
}
The function works well enough, but I have many vectors I want to process. What's the best way to modify this code to process all "N take 2" unique analyses for my set of vectors "N"?
EDIT:
Example data showing the structure of the vectors:
A <- c(-1, 0, 1, -1, 0, 1, -1, 0, 1)
B <- c(1, 1, -1, 0, 1, -1, 0, 0, 1)
...
n <- length(A)
So let's say I have vectors A through Z and I want to modify my code to output a new vector containing all {26 take 2} correlation values.
Here is one possible way you can do it assuming you have a bunch of numeric vectors in a list v as follows:
v <- list()
for (i in 1:10) {
v[[i]] <- sample(1:10, 10, replace = TRUE)
}
apply(combn(1:10, 2), 2, function(x) Corr.fxn(v[[x[1]]], v[[x[2]]], length(v[[x[1]]])))
In this answer, I assume 2 things. First, you want to write a function yourself, since otherwise you can use Hmisc::rcorr. Second, you want the "N take 2" part to be inside the function, otherwise the ways suggested earlier are correct. In that case, you can do this:
Corr.fxn <- function(vectors, n){
pairs<- combn(length(vectors), 2)
npairs<- ncol(pairs)
cor.mat<- matrix(NA, nrow = length(vectors), ncol = npairs)
for (i in 1:ncol(pairs)){
A<- vectors[[pairs[1, i]]]
B<- vectors[[pairs[2, i]]]
cor.mat[pairs[1, i], pairs[2, i]] <- (sum((A - mean(A))*(B - mean(B))) / (n-1)) /(sd(A)*sd(B))
}
cor.mat[lower.tri(cor.mat)]<- cor.mat[upper.tri(cor.mat)] ###
diag(cor.mat)<- 1 ###
cor.mat<- data.frame(cor.mat) ###
row.names(cor.mat)<- colnames(cor.mat)<- names(vectors) ###
return(cor.mat)
}
The lines that end in ### are there for decorative reasons. The main input is a list called "vectors". So it works as follows:
A<- runif(100, 1, 100)
B<- runif(100, 30, 50)
C<- runif(100, 120, 200)
> Corr.fxn(list(A=A, B=B, C=C), n=100)
A B C
A 1.0000000 -0.11800104 -0.13980458
B -0.1180010 1.00000000 0.04933581
C -0.1398046 0.04933581 1.00000000
I know similar questions have been asked in this site here, here, and here, but none of them tackles my problem.
I've a data frame which I want to apply the rdirichlet function (from gtools) to each line. So, each line shall be consider as aplha.
data = NULL
data <- data.frame(rbind(
oct = c(60, 32, 8),
sep = c(53, 35, 12),
ago = c(54, 40, 6)
))
data <- data/100*1000
library(gtools) # contains the function
sim <- 10000 # simulation
My first attenpt was to use apply, it does work, but the output is not that clear for conducting further analysis; each row computation becomes a vector:
p = apply(data, 1, function(x) rdirichlet(sim, alpha = x + 1))
I also try in a loop without success:
p = NULL
for(i in 1:length(data)) {
p[i] <- rdirichlet(sim, alpha = data[i] + 1)
}
Any tip how can I solve this?
Well firstly you might want to change the data in your anonymous function in the apply to x to match the x in function(x)
apply(data, 1, function(x) rdirichlet(sim, alpha = x + 1))
This works for me, as in it provides an output with three columns and 30000 rows.
Two important things here. First, vectorizing is the best way to go:
ans <- apply(data, 1, function(x) rdirichlet(sim, alpha = x + 1))
By doing this, you'll receive each row computations as vector, essentially k vs sim like.
Then you'll need to subsample things like:
margin <- ans[1:100000,1] - ans[100001:200000,1]
I have a list of data.frames, and would like to operate on their columns, using various weights.
For example, subtracting the first columns from the second column (solved, see below); or subtracting the first and third from twice the second (unsolved).
Thanks to the generous help obtained in response to this question, I have a solution to the the problem in two dimensions without weights using Reduce.
I would like to have the flexibility to operate with weights - and in higher dimesions.
What I have so far is:
priceList <- data.frame(aaa = rnorm(100, 100, 10), bbb = rnorm(100, 100, 10),
ccc = rnorm(100, 100, 10), ddd = rnorm(100, 100, 10),
eee = rnorm(100, 100, 10), fff = rnorm(100, 100, 10),
ggg = rnorm(100, 100, 10)
)
colDiff <- function(x)
{
Reduce('-', rev(x))
}
tradeLegsList <- combn(names(priceList), 3, function(x) priceList[x], simplify = FALSE)
tradeList <- lapply(tradeLegsList, colDiff)
From what I can tell, Reduce is not designed to take multiple arguments.
I can do this the long way with 2* tradeLegsList[[1]]$bbb - tradeLegsList[[1]]$aaa - tradeLegsList[[1]]$ccc, and some loops, but it doesn't seem like the R way.
Is there a way to pass in a weight vector?
Ideally, I would to pass an argument such as w = c(-1, 2, -1) to the colDiff (or Reduce) function ... or something similar.
True, Reduce is not geared to allow multiple arguments, just two for each reduction. Therefore it is easiest to premultiply the elements in the list you are Reduce-ing.
Below is a solution that does this using mapply within your colDiff function definition.
Change your definifion of colDiff to allow a weight vector, and apply this using mapply
with SIMPLIFY = F.
EDIT
In light of the comments, weighting depends on the number of columns and there being no need for the rev
The weighting by length
length(x) == 1 -> w = 1
length(x) == 2 -> w = c(-1, 1),
length(x) == 3 -> w = c(-1, 2, -1),
length(x) == 4 -> w = c(-1, 1, -1, +1)
weighting <- function(i){
switch(i, 1, c(-1,1), c(-1,2,-1), c(-1,1,-1, 1))
}
colDiff <- function(x)
{
w = weighting(length(x))
Reduce('+', mapply('*', x, e2 = w, SIMPLIFY = F))
}
Then something like this would work
tradeList <- lapply(tradeLegsList, colDiff)
you could also keep with the functional programming theme and use Map which is a simple wrapper for mapply with SIMPLIFY = F
colDiff <- function(x)
{
w = weighting(length(x))
Reduce('+', Map('*', x , e2 = w))
}
you could also prefine the weighting within the function colDiff (which may be easier).
weighting[[2]] is weighting for when there are 2 columns, weighting[[3]] when there are 3.
colDiff <- function(x)
{
weighting <- list(1, c(-1,1), c(-1,2,-1), c(-1,1,-1, 1))
w = weighting[[length(x)]]
Reduce('+', Map('*', x , e2 = w))
}