how to correlate 2 variables when X > 1

how to correlate 2 variables when X > 1 - r

I have a data set and want to run a correlation between X and Y. However, I only want to look at X values that are greater than 1.
cor(Data$X, Data$Y, use = "complete.obs")
What argument do I add to run a correlation between X and Y only for the X values that are greater than 1?

You can subset using the [ operator.
Try this:
# Generate Example Data
Data <- data.frame(X = seq(-5, 10, 1),
Y = sample(1:100, 16))
with(data = Data[Data$X > 1, ], cor(X, Y, use = "complete.obs"))
[ lets us specify rows and columns in the style my.data.frame[rows, columns]. Here we are specifying that we want only rows where X > 1, but all columns. We could also do the following to ask for each column individually by name:
cor(Data[Data$X > 1, "X"], Data[Data$X > 1, "Y"], use = "complete.obs"))
Or even the following to subset the column vectors:
cor(Data$X[Data$X > 1], Data$Y[Data$X > 1], use = "complete.obs"))
Of course, these are only to illustrate the flexibility. It's best to subset the whole data set once to avoid discrepancies.

Related

R: How to access a 'complicated list'

I am working on an assignment, which tasks me to generate a list of data, using the below code.
##Use the make_data function to generate 25 different datasets, with mu_1 being a vector
x <- seq(0, 3, len=25)
make_data <- function(a){
n = 1000
p = 0.5
mu_0 = 0
mu_1=a
sigma_0 = 1
sigma_1 = 1
y <- rbinom(n, 1, p)
f_0 <- rnorm(n, mu_0, sigma_0)
f_1 <- rnorm(n, mu_1, sigma_1)
x <- ifelse(y == 1, f_1, f_0)
test_index <- createDataPartition(y, times = 1, p = 0.5, list = FALSE)
list(train = data.frame(x = x, y = as.factor(y)) %>% slice(-test_index),
test = data.frame(x = x, y = as.factor(y)) %>% slice(test_index))
}
dat <- sapply(x,make_data)
The code looks good to go, and 'dat' appears to be a 25 column, 2 row table, each with its own data frame.
Now, each data frame within a cell has 2 columns.
And this is where I get stuck.
While I can get to the data frame in row 1, column 1, just fine (i.e. just use dat[1,1]), I can't reach the column of 'x' values within dat[1,1]. I've experimented with
dat[1,1]$x
dat[1,1][1]
But they only throw weird responses: error/null.
Any idea how I can pull the column? Thanks.

dat[1, 1] is a list.
class(dat[1, 1])
#[1] "list"
So to reach to x you can do
dat[1, 1]$train$x
Or
dat[1, 1][[1]]$x
As a sidenote, instead of having this 25 X 2 matrix as output in dat I would actually prefer to have a nested list.
dat <- lapply(x,make_data)
#Access `x` column of first list from `train` dataset.
dat[[1]]$train$x
However, this is quite subjective and you can chose whatever format you like the best.

R Loop: Perform a Function for Every 3 Rows

I have 2000 wheat plants, growing over the course of 40 days.
I'd like to perform the coeff function on each plant to find the coefficients of the quadratic equation the 3 time points make. (a, b, and c)
(1) The coef(lm(y~poly(x,2,raw=TRUE)) function works exactly the way I want it to.
(2) However, the way my data is presented, requires me to manually set x and y.
(3) Thus, I melted my data, and ordered it.
(4) I'd like to make a loop that will take the first three in column "Day" and set that as x. Then I'd like it to take the first three in column "Height" and set that as y.
Then I'd like to perform the coeff function.
Last I'd like it to present the coefficient outputs I need, preferably in a new data table.
Then repeat for every three rows, which represent each wheat ID, for all wheat plants.
1) This function works, giving me coefficients: a, b, c
x<-c(1,2,3)
y<-c(1,10,4)
coef(lm(y~poly(x,2,raw=TRUE)))
2) This is what my data originally looked like
A = matrix(c(5, 4, 2, 10, 10, 4, 5, 15, 6),nrow=3, ncol=3)
colnames(A)<-c("10", "25", "40")
rownames(A)<-c("Wheat 1", "Wheat 2", "Wheat 3")
A
3) This is my melted format
A.melted<-as.data.frame(melt(A, id.vars="ID"))
A.melted<-A.melted[with(A.melted,order(Var1)),]
colnames(A.melted) <- c("WheatID", "Day", "Height")
A.melted$Day<-as.numeric(as.character(A.melted$Day))
A.melted
#
4) This is what I am trying to do with my loop....
for every 3 rows,
x<-A.melted[,2]
y<-A.melted[,3]
coef(lm(y~poly(x,2,raw=TRUE)))
something to compile the coefficients: a, b, c
I am just not familiar with the syntax of loops, and I'd love any tips and suggestions. Perusing Google tells me that one should not do loops unless it is absolutely required since I may run into more problems- thus I am open to non loop techniques as well.

If you want to do it in a loop try this. The crucial part is to use seq together with a by = argument to let the index take the steps you need.
library(tibble)
df <- tibble(
WheatID = rep(NA_character_, nrow(A)),
Intercept = rep(NA_real_, nrow(A)),
poly1 = rep(NA_real_, nrow(A)),
poly2 = rep(NA_real_, nrow(A))
)
cnt <- 1
for (i in seq(1, nrow(A.melted), by = 3)) {
x <- A.melted$Day[i + 0:2]
y <- A.melted$Height[i + 0:2]
df$WheatID[cnt] <- as.character(A.melted$WheatID[i])
df[cnt, 2:4] <- coef(lm(y~poly(x,2,raw=TRUE)))
cnt <- cnt + 1
}
df
Note: I am not a data.table guy. Therefore, I present you with a tibble.

We can do this with the help of data.table, see ?data.table:
library(data.table)
A.models = A.melted[, model := list(.(lm(Height ~ poly(Day, 2),
data = list(.(.SD[WheatID == .BY[[1]]]))))),
by = WheatID]
A.models[, coefs := list(.(coefficients(model[[1]]))),
by = WheatID]
You can access each model like this:
A.models[WheatID == "Wheat 1", model[[1]]]
and even
A.models[WheatID == "Wheat 1", summary(model[[1]])]
The magic here happens because data.table takes in J expressions, not only functions.

This is something you can do with data.table package.
data.list <- split(A.melted, f = (1:nrow(A.melted) - 1) %/% 3)
coefs <- lapply(data.list, function(x) {
coefs <- coef(lm(Day ~ poly(Height, raw=TRUE), data = x))
data.table(
intercept = coefs[1],
poly.height = coefs[2]
)
})
coefs <- rbindlist(coefs)

Or you could perform apply() directly on the original matrix:
x <- as.numeric(colnames(A))
apply(A, 1, function(y) coef(lm(y~poly(x,2,raw=TRUE))))
Wheat 1 Wheat 2 Wheat 3
(Intercept) -3.88888889 -0.555555556 6.666667e-01
poly(x, 2, raw = TRUE)1 1.11111111 0.477777778 1.333333e-01
poly(x, 2, raw = TRUE)2 -0.02222222 -0.002222222 -2.417315e-18
Or you could transpose the data and use the coef(...) call directly:
x <- as.numeric(colnames(A))
coef(lm(t(A) ~ poly(x, 2, raw = TRUE)))

Create a matrix from a list consisting of unequal matrices for individual bootstraps

I tried to create a matrix from a list which consists of N unequal matrices...
The reason to do this is to make R individual bootstrap samples.
In the example below you can find e.g. 2 companies, where we have 1 with 10 & 1 with just 5 observations.
Data:
set.seed(7)
Time <- c(10,5)
xv <- matrix(c(rnorm(10,5,2), rnorm(5,20,1), rnorm(10,5,2), rnorm(5,20,1)), ncol=2);
y <- matrix( c(rnorm(10,5,2), rnorm(5,20,1)));
z <- matrix(c(rnorm(10,5,2), rnorm(5,20,1), rnorm(10,5,2), rnorm(5,20,1)), ncol=2)
# create data frame of input variables which helps
# to conduct the rowise bootstrapping
data <- data.frame (y = y, xv = xv, z = z);
rows <- dim(data)[1];
cols <- dim(data)[2];
# create the index to sample from the different panels
cumTime <- c(0, cumsum (Time));
index <- findInterval (seq (1:rows), cumTime, left.open = TRUE);
# draw R individual bootstrap samples
bootList <- replicate(R = 5, list(), simplify=F);
bootList <- lapply (bootList, function(x) by (data, INDICES = index, FUN = function(x) dplyr::sample_n (tbl = x, size = dim(x)[1], replace = T)));
---------- UNLISTING ---------
Currently, I try do it incorrectly like this:
Example for just 1 entry of the list:
matrix(unlist(bootList[[1]], recursive = T), ncol = cols)
The desired output is just
bootList[[1]]
as a matrix.
Do you have an idea how to do this & if possible reasonably efficient?
The matrices are then processed in unfortunately slow MLE estimations...

i found a solution for you. From what i gather, you have a Dataframe containing all observations of all companies, which may have different panel lengths. And as a result you would like to have a Bootstap sample for each company of same size as the original panel length.
You mearly have to add a company indicator
data$company = c(rep(1, 10), rep(2, 5)) # this could even be a factor.
L1 = split(data, data$company)
L2 = lapply(L1, FUN = function(s) s[sample(x = 1:nrow(s), size = nrow(s), replace = TRUE),] )
stop here if you would like to have saperate bootstap samples e.g. in case you want to estimate seperately
bootdata = do.call(rbind, L2)
Best wishes,
Tim

DMwR::unscale to unscale only selected columns

I've got a data.frame with 4 columns which I want to scale and then add some new columns (without scaling them). Then I perform some calculations after which I need to unscale only first 4 columns (as the remaining two weren't scaled in the first place). DMwR::unscale seems to allow for that with col.ids argument. But when I specify the fucntion like below it returns
Error in DMwR::unscale(cbind(scale(x), x2), scale(x), 1:4) :
Incorrect dimension of data to unscale.
x <- matrix(2*rnorm(400) + 1, ncol = 4)
x2 <- matrix(9*rnorm(200), ncol = 2)
DMwR::unscale(cbind(scale(x), x2), scale(x), 1:4)
What am I doing wrong? How can I unscale only selected 4 first columns of matrix?

The DMwR::unscale(vals, norm.data, col.ids) function requires that norm.data has a number of columns larger than that of vals.
I suggest to consider the following modified version of unscale:
myunscale <- function (vals, norm.data, col.ids) {
cols <- if (missing(col.ids)) 1:NCOL(vals) else col.ids
if (length(cols) > NCOL(vals))
stop("Incorrect dimension of data to unscale.")
centers <- attr(norm.data, "scaled:center")[cols]
scales <- attr(norm.data, "scaled:scale")[cols]
unvals <- scale(vals[,cols], center = (-centers/scales), scale = 1/scales)
unvals <- cbind(unvals,vals[,-cols])
attr(unvals, "scaled:center") <- attr(unvals, "scaled:scale") <- NULL
unvals
}
set.seed(1)
x <- matrix(2*rnorm(4000) + 1, ncol = 4)
x2 <- matrix(9*rnorm(2000), ncol = 2)
x_unsc <- myunscale(cbind(scale(x), x2), scale(x) , 1:4)
The mean values and the standard deviations of x_unsc are:
apply(x_unsc, 2, mean)
# [1] 0.9767037 0.9674762 1.0306181 1.0334445 -0.1805717 -0.1053083
apply(x_unsc, 2, sd)
# [1] 2.069832 2.079963 2.062214 2.077307 8.904343 8.810420

How to avoid gaps due to missing values in matplot in R?

I have a function that uses matplot to plot some data. Data structure is like this:
test = data.frame(x = 1:10, a = 1:10, b = 11:20)
matplot(test[,-1])
matlines(test[,1], test[,-1])
So far so good. However, if there are missing values in the data set, then there are gaps in the resulting plot, and I would like to avoid those by connecting the edges of the gaps.
test$a[3:4] = NA
test$b[7] = NA
matplot(test[,-1])
matlines(test[,1], test[,-1])
In the real situation this is inside a function, the dimension of the matrix is bigger and the number of rows, columns and the position of the non-overlapping missing values may change between different calls, so I'd like to find a solution that could handle this in a flexible way. I also need to use matlines
I was thinking maybe filling in the gaps with intrapolated data, but maybe there is a better solution.

I came across this exact situation today, but I didn't want to interpolate values - I just wanted the lines to "span the gaps", so to speak. I came up with a solution that, in my opinion, is more elegant than interpolating, so I thought I'd post it even though the question is rather old.
The problem causing the gaps is that there are NAs between consecutive values. So my solution is to 'shift' the column values so that there are no NA gaps. For example, a column consisting of c(1,2,NA,NA,5) would become c(1,2,5,NA,NA). I do this with a function called shift_vec_na() in an apply() loop. The x values also need to be adjusted, so we can make the x values into a matrix using the same principle, but using the columns of the y matrix to determine which values to shift.
Here's the code for the functions:
# x -> vector
# bool -> boolean vector; must be same length as x. The values of x where bool
# is TRUE will be 'shifted' to the front of the vector, and the back of the
# vector will be all NA (i.e. the number of NAs in the resulting vector is
# sum(!bool))
# returns the 'shifted' vector (will be the same length as x)
shift_vec_na <- function(x, bool){
n <- sum(bool)
if(n < length(x)){
x[1:n] <- x[bool]
x[(n + 1):length(x)] <- NA
}
return(x)
}
# x -> vector
# y -> matrix, where nrow(y) == length(x)
# returns a list of two elements ('x' and 'y') that contain the 'adjusted'
# values that can be used with 'matplot()'
adj_data_matplot <- function(x, y){
y2 <- apply(y, 2, function(col_i){
return(shift_vec_na(col_i, !is.na(col_i)))
})
x2 <- apply(y, 2, function(col_i){
return(shift_vec_na(x, !is.na(col_i)))
})
return(list(x = x2, y = y2))
}
Then, using the sample data:
test <- data.frame(x = 1:10, a = 1:10, b = 11:20)
test$a[3:4] <- NA
test$b[7] <- NA
lst <- adj_data_matplot(test[,1], test[,-1])
matplot(lst$x, lst$y, type = "b")

You could use the na.interpolation function from the imputeTS package:
test = data.frame(x = 1:10, a = 1:10, b = 11:20)
test$a[3:4] = NA
test$b[7] = NA
matplot(test[,-1])
matlines(test[,1], test[,-1])
library('imputeTS')
test <- na.interpolation(test, option = "linear")
matplot(test[,-1])
matlines(test[,1], test[,-1])

Had also the same issue today. In my context I was not permitted to interpolate. I am providing here a minimal, but sufficiently general working example of what I did. I hope it helps someone:
mymatplot <- function(data, main=NULL, xlab=NULL, ylab=NULL,...){
#graphical set up of the window
plot.new()
plot.window(xlim=c(1,ncol(data)), ylim=range(data, na.rm=TRUE))
mtext(text = xlab,side = 1, line = 3)
mtext(text = ylab,side = 2, line = 3)
mtext(text = main,side = 3, line = 0)
axis(1L)
axis(2L)
#plot the data
for(i in 1:nrow(data)){
nin.na <- !is.na(data[i,])
lines(x=which(nin.na), y=data[i,nin.na], col = i,...)
}
}
The core 'trick' is in x=which(nin.na). It aligns the data points of the line consistently with the indices of the x axis.
The lines
plot.new()
plot.window(xlim=c(1,ncol(data)), ylim=range(data, na.rm=TRUE))
mtext(text = xlab,side = 1, line = 3)
mtext(text = ylab,side = 2, line = 3)
mtext(text = main,side = 3, line = 0)
axis(1L)
axis(2L)`
draw the graphical part of the window.
range(data, na.rm=TRUE) adapts the plot to a proper size being able to include all data points.
mtext(...) is used to label the axes and provides the main title. The axes themselves are drawn by the axis(...) command.
The following for-loop plots the data.
The function head of mymatplot provides the ... argument for an optional passage of typical plot parameters as lty, lwt, cex etc. via . Those will be passed on to the lines.
At last word on the choice of colors - they are up to your flavor.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

how to correlate 2 variables when X > 1 - r

I have a data set and want to run a correlation between X and Y. However, I only want to look at X values that are greater than 1. cor(Data$X, Data$Y, use = "complete.obs") What argument do I add to run a correlation between X and Y only for the X values that are greater than 1?

Related

R: How to access a 'complicated list'

R Loop: Perform a Function for Every 3 Rows

Create a matrix from a list consisting of unequal matrices for individual bootstraps

DMwR::unscale to unscale only selected columns

How to avoid gaps due to missing values in matplot in R?

Categories

Resources