I simulated a data set with the following assumptions:
x1 <- rbinom(100,0,0.5) #trt
x2 <- rnorm(100,0,1) # metric outcome
df <- data.frame(x1,x2)
Now I'm trying to include missing values with two different methods: First "missing completely at random" and second "missing not at random". Therefore I tried lots of packages, but it does not work, as I expacted.
For the first scenario (MCAR) I used:
df_mcar <- ampute(data = df, prop = 0.1, mech = "MCAR", patterns = c(1, 0))$amp
... and it seems to work (with probability of 10% only x2 has missing values - independently of x1)
For the second scenario I want - again - that only x2 has missing values, but this time with special assumption on x1: Only for x1 = 1 I want x2 to have missing values in 10% of cases.
So in variable x2 I want missing values with probability of p=0.1 for x1 = 1 and with probability of p=0 for x1 = 0.
I would be glad for any hint or a simple solution :)
PS: I often read something like prodNA(...) but it does not work
Could probably do something like:
library(dplyr)
df %>%
mutate(
x2 = if_else(x1 == 1 & runif(n()) < .1, NA_real_, x2)
)
My R is currently too busy for me to run the code, though.
Related
I have the following data:
set.seed(1)
df=data.frame(y=rnorm(500,0,20),x1=rnorm(500,50,100),x2=rnorm(500,10,40))
df$x3=df$x1+runif(500,-50,50); df$x4=df$x2+runif(500,-5,5)
This data contains multicollinear data. If I do this:
library(ppcor)
t <- pcor(df, method = "pearson")
t$estimate
I see that X1/X3 and X2/X4 have an issues with multicollinearity. Now I have to manually screen the output. Is there a way to detect these items automatically? And any thoughts on what should is the threshold?
Regrading multicollinearity - there are many tests that can help you to detect multicollinearity. For example, you can calculate the "variance inflation factor" using the vif function in the car package.
fit <- lm(y ~ x1 + x2 + x3 + x4, data = df)
vifVues <- car::vif(fit)
In addition, Wikipedia has a full page about detecting multicollinearity, and this looks like a good blog post.
However, similar to what you did, I usually start with correlations in order to detect "problematic" high correlations. For that, you can try to "flatten" the correlation table and than to filter out high correlations.
Note: because I'm not familiar with ppcor I used Hmisc. However, the idea is the same.
require(tidyverse)
require(Hmisc)
#Flatten correlation matrix function
flattenCorrMatrix <- function (DF)
{
DF <- DF %>% as.matrix() %>% rcorr()
ut <- upper.tri(DF$r)
flat <- data.frame(row = rownames(DF$r)[row(DF$r)[ut]], column = rownames(DF$r)[col(DF$r)[ut]],
cor = (DF$r)[ut], p = DF$P[ut], n = DF$n[ut])
return(flat)
}
#using the function and filtering out the y variable and correlations higher than abs(0.7)
flattenCorrMatrix(df) %>%
filter(!grepl("y", row)) %>%
filter(cor > abs(0.7))
Output:
row column cor p n
1 x1 x3 0.9626412 0 500
2 x2 x4 0.9972960 0 500
For a science project, I am looking for a way to generate random data in a certain range (e.g. min=0, max=100000) with a certain correlation with another variable which already exists in R. The goal is to enrich the dataset a little so I can produce some more meaningful graphs (no worries, I am working with fictional data).
For example, I want to generate random values correlating with r=-.78 with the following data:
var1 <- rnorm(100, 50, 10)
I already came across some pretty good solutions (i.e. https://stats.stackexchange.com/questions/15011/generate-a-random-variable-with-a-defined-correlation-to-an-existing-variable), but only get very small values, which I cannot transform so the make sense in the context of the other, original values.
Following the example:
var1 <- rnorm(100, 50, 10)
n <- length(var1)
rho <- -0.78
theta <- acos(rho)
x1 <- var1
x2 <- rnorm(n, 50, 50)
X <- cbind(x1, x2)
Xctr <- scale(X, center=TRUE, scale=FALSE)
Id <- diag(n)
Q <- qr.Q(qr(Xctr[ , 1, drop=FALSE]))
P <- tcrossprod(Q) # = Q Q'
x2o <- (Id-P) %*% Xctr[ , 2]
Xc2 <- cbind(Xctr[ , 1], x2o)
Y <- Xc2 %*% diag(1/sqrt(colSums(Xc2^2)))
var2 <- Y[ , 2] + (1 / tan(theta)) * Y[ , 1]
cor(var1, var2)
What I get for var2 are values ranging between -0.5 and 0.5. with a mean of 0. I would like to have much more distributed data, so I could simply transform it by adding 50 and have a quite simililar range compared to my first variable.
Does anyone of you know a way to generate this kind of - more or less -meaningful data?
Thanks a lot in advance!
Starting with var1, renamed to A, and using 10,000 points:
set.seed(1)
A <- rnorm(10000,50,10) # Mean of 50
First convert values in A to have the new desired mean 50,000 and have an inverse relationship (ie subtract):
B <- 1e5 - (A*1e3) # Note that { mean(A) * 1000 = 50,000 }
This only results in r = -1. Add some noise to achieve the desired r:
B <- B + rnorm(10000,0,8.15e3) # Note this noise has mean = 0
# the amount of noise, 8.15e3, was found through parameter-search
This has your desired correlation:
cor(A,B)
[1] -0.7805972
View with:
plot(A,B)
Caution
Your B values might fall outside your range 0 100,000. You might need to filter for values outside your range if you use a different seed or generate more numbers.
That said, the current range is fine:
range(B)
[1] 1668.733 95604.457
If you're happy with the correlation and the marginal distribution (ie, shape) of the generated values, multiply the values (that fall between (-.5, +.5) by 100,000 and add 50,000.
> c(-0.5, 0.5) * 100000 + 50000
[1] 0e+00 1e+05
edit: this approach, or any thing else where 100,000 & 50,000 are exchanged for different numbers, will be an example of a 'linear transformation' recommended by #gregor-de-cillia.
I'd like to solve an equation for a variable for each line of a given csv file.
You may know the equation as the Euler-Lotka equation.
That is what I have so far:
# seed is needed for reproducible results (otherwise random numbers will never be the same!)
set.seed(42)
# using the Euler-Lotka equation
# l = survival rate until age x
# m = amount of offspring at age x
# x = age of reproduction
# r = population growth rate
y <- function(r, l1, l2, l3, m1, m2, m3, x1, x2, x3, z){((l1*m1*exp(-r*x1)) + (l2*m2*exp(-r*x2)) + (l3*m3*exp(-r*x3))) - z}
# iterate through each line calculating r and writing it into the respective field
for (i in 1:length(neos_data$jar_no)){
# declare the variables from table (this does not work!!)
l1 <- neos_data$surv_rate_clutch1[i]
l2 <- neos_data$surv_rate_clutch2[i]
l3 <- neos_data$surv_rate_clutch3[i]
m1 <- neos_data$indiv_sum_1_clutch[i]
m2 <- neos_data$indiv_sum_2_clutch[i]
m3 <- neos_data$indiv_sum_3_clutch[i]
x1 <- neos_data$age_clutch_1[i]
x2 <- neos_data$age_clutch_2[i]
x3 <- neos_data$age_clutch_3[i]
# this works, while these numbers are the same as in the data frame
l1 <- 0.9333333
l2 <- 0.9333333
l3 <- 0.9333333
m1 <- 3.4
m2 <- 0
m3 <- 0
x1 <- 9
x2 <- 13
x3 <- 16
## uniroot finds a 0 value, so offset function, thats why -z in the upper formula
r <- uniroot(y, l1=l1, l2=l2, l3=l3, m1=m1, m2=m2, m3=m3, x1=x1, x2=x2, x3=x3, z = 1, interval = c(-1, 1))[1] #writing only the result of r into variable
# write r into table
neos_data$pop_gr[i] <- r
}
As I already commented, uniroot works fine with manual input of values. But when try to load a value from my data frame it gives the error "values of f() have the same sign".
I do understand the meaning of the error itself, but why does it work with the values I insert manually and not with the same values from the data frame (and yes, I have checked the data types).
Would be glad for any help, as what I've seen so far was not helpful in my case :)
EDIT:
To clearify: I'd like to get a value of r for which the equation becomes 0. This works with the given code very fine as far as I insert the values of the variables as a number. But when I try to pass the value from my data frame, it fails even if the same values are passed.
Ok, I think I've found the problem.
There are some lines where each part of the sum becomes 0. At each step where the loop hits the 0s the error occurs and the whole stuff doesn't work.
This seems natural as the equation is:
1 = SUM( l(x) * m(x) * exp(-r*x) )
if all l(x) and m(x) are 0 the equation cannot become 1, of course.
I didn't realize this issue as the script didn't work at all. Now, after trying and rewriting and deleting code, somehow it writes the resulting r into the data frame until the line with 0s. That brought me to this conclusion.
Why does this always happens after hours of trying? :D
However, to solve this issue, I inserted a 0.0001 at the respective fields just to get the loop running. In my case I just want to copy the r values to my data mastersheed. As there are only 3 lines with all 0 (it was because NAs couldn't be handled by uniroot) I will delete those values by hand (NAs won't disturb any further calculation).
Thanks for your help anyway. It dropped me into the right direction :)
I am trying to generate a formula using dataframe column names of the following format:
d ~ x1 + x2 + x3 + x4
From the following sample dataset:
a = c(1,2,3)
b = c(2,4,6)
c = c(1,3,5)
d = c(9,8,7)
x1 = c(1,2,3)
x2 = c(2,4,6)
x3 = c(1,3,5)
x4 = c(9,8,7)
df = data.frame(a,b,c,d,x1,x2,x3,x4)
As for what I have tried already:
I know that I can subset only the columns I need using the following approach
predictors = names(df[5:8])
response = names(df[4])
Although, my efforts to try and include these into a formula have failed
How can I assemble the predictors and the response variables into the following format:
d ~ x1 + x2 + x3 + x4
I ultimately want to input this formula into a randomForest function.
We can avoid the entire problem by using the default method of randomForest (rather than the formula method):
randomForest(df[5:8], df[[4]])
or in terms of predictors and response defined in the question:
randomForest(df[predictors], df[[response]])
As mentioned in the Note section of the randomForest help file the default method used here has the additional advantage of better performance than the formula method.
How about:
reformulate(predictors,response=response)
I have this sample data table:
df <- data.table(indexer = c(0:12), x1 =
c(0,1000,1500,1000,1000,2000,
1000,1000,0,351.2,1000,1000,1851.2)
)
Now I need to create two additional columns x2 and x3 in this data frame such as x2[i] = x1[i] - x3[i] and x3[i] = x2[i-1] with x3[1]=0.
How can I do this without using a loop in an efficient way?
EDIT1: expected results are
x2 = c(0.0,1000.0,500.0,500.0,500.0,1500.0,-500.0,1500.0,-1500.0,1851.2,-851.2,1851.2,0.0)
and
x3 = c(0.0,0.0,1000.0,500.0,500.0,500.0,1500.0,-500.0,1500.0,-1500.0,1851.2,-851.2,1851.2)
EDIT2: First time here posting questions. Hence all these confusions. Forget the example guys, the formulas are:
x3[i] = c - x2[i-1]*(1+r/12); x2[i] = x1[i] - x3[i]; x3[1] = 0; # c is some constant.
The problem is that x2 and x3 depend on each other. Thus, one needs to express x2 in terms of x1:
Once we have the formula, programming is easy:
df$x2 <- (-1)^(df$indexer) * cumsum(df$x1*(-1)^(df$indexer))
And x3 can be obtained from x2:
df$x3 <- c(0,df$x2[-nrow(df)])
[EDIT2] I guess that solution for the modified question, if it exists at all, should be sought along the same lines. I don't think it should be considered as a programming-related problem, because the code is quite straightforward once the mathematical formula is known.