I have 2 continuous variables, each having values in the range [0, 1]. Each can be categorized as Low ($\le 0.25$), Medium ($0.25 - 0.70$) and High ($\ge 0.7$). I need to create an index using both the variables and use this index in a regression model. The generated index will be as per following truth table:
Var1/ Var2 | Low | Medium | High |
=======================================
Low | Low | Low | Low |
Medium | Low | Medium | Medium |
High | Low | Medium | High |
=======================================
Straight forward multiplication of the two variables is not the solution as some values will yield a Medium output (var1 = 0.75 and var2 = 0.8 for example).
In the model, I would like to use the index expression (rather than the categorical transformation). This will preserve the data variation.
What f(var1, var2) will provide me this index to be used in lm/R?
Help!!!
I do not know whether there is an inbuild function for this and I couldn't find it instantly. Can you use something like the following?
get_index <- function(var1, var2)
{
if (var1 < 0 || var1 > 1 || var2 < 0 || var2 > 1)
return("out of range");
low <- min(var1, var2);
if (low < 0.25)
return("Low");
if (low <= 0.70)
return("Medium");
return("High");
}
How about:
cfun <- function(x) cut(x,c(-0.01,0.25,0.7,1.01),
labels=c("low","medium","high"))
var1c <- cfun(var1)
var2c <- cfun(var2)
comb <- ifelse(var1c=="low" | var2c=="low", "low",
ifelse(var1c=="medium" | var2c=="medium", "medium",
"high"))
or actually, as suggested by other answers:
cfun(min(var1,var2))
After re-reading your request my (second) guess is that you want this: only the "numerical index" and you could dispense with the use of a character vector label. If entered as a numerical variable in a regression formula the p-value for that synthetic interaction would give you a "test of trend" for the joint "minimum" descretized level condition.
inter.n <- pmin( findInterval(x, c(0, .25, .7, 1)),
findInterval(y, c(0, .25, .7, 1)) )
Earlier comments:
At the moment it is unclear how you want the inequalities to work when values are at the boundaries. The findInterval function can be used when the boundaries are closed on either the right (the default) or the left. You say : " Low ($\le 0.25$), Medium ($0.25 - 0.70$) and High ($\ge 0.7$)", which would make a value of either 0.2 or 0.7 a member of two groups. There would be fairly simple code with which is Low ($\lt 0.25$), Medium ($\ge 0.25 & $\lt 0.70$) and High ($\ge 0.7$):
x=runif(1000)
y=runif(1000)
inter <- c("Low", "Middle", "High")[ pmin( findInterval(x, c(0,.25,.7,1)),
findInterval(y, c(0, .25, .7, 1)))]
> table(inter)
inter
High Low Middle
78 383 539
If you use a modification of #BenBolker's cfun that makes ordered factors, you can get pmin to work directly on the values:
cfun2 <- function(x) cut(x,c(0, 0.25, 0.7, 1.01), include.lowest=TRUE,
labels=c("low","medium","high"), ordered=TRUE)
inter.f <- pmin( cfun2(x) , cfun2(y) )
table(inter.f)
#--------
inter.f
low medium high
449 473 78
And that is in some ways superior because the table function automagically honors the ordering of the factor labels.
I am beginner at R languange and syntax, but it seems you are more looking for a function rather than a procedure.
What about using f(var1, var2)=min(var1,var2)? Clearly, you have to apply this to the numeric version, and then categorize the variables.
In my point of view since you want to use this new index in a regression, you are trying to do what is known as feature elimination. Generally, it is best that you use all the variables that you have if the total number of variables is small. Now if the number of variables is big and you need to therefore eliminate some then there are multiple ways to do it including stepwise elimination, recursive feature elimination etc.
In your case you only have 2 variables and essentially you want to combine those 2 without losing any variance. Well, to my point of view one thing you can use is Principal Component Analysis. Let's see an example:
#create data
var1 <- runif(1:100)
var2 <- runif(1:100)
df <- data.frame(var1,var2)
#the below line will create a PCA model
PCAmod <- princomp(var1+var2,data=df) #uses formula syntax without a response variable
> summary(PCAmod)
Importance of components:
Comp.1
Standard deviation 0.4052599
Proportion of Variance 1.0000000
Cumulative Proportion 1.0000000
The above shows that a new principal component has been created i.e. a vector of 100 new elements that in this example explains 100% of the variance of var1 and var2 (proporsion of variance in the table above).
newvar <- PCAmod$scores #the new vector
Essentially, the newvar can be used instead of var1 and var2
If you need the vector to be numbers ranging between [0,1] then you can scale it:
scaled_newvar <- scale(newvar,center=min(newvar), scale=max(newvar)-min(newvar) )
> summary(scaled_newvar)
Comp.1
Min. :0.0000
1st Qu.:0.2991
Median :0.4607
Mean :0.4788
3rd Qu.:0.6566
Max. :1.0000
However, the above will probably not confirm your 'low','medium','high' condition table but I think this is the right thing to do if you will use the above in a regression.
If the above is not satisfying enough then (and I wouldn't recommend it) then:
Just use the min(var1,var2) for each combination and use that
Multiply the two, applying the boundary value if it is outside the range you would want it to be e.g. if both var1 and var2 are high and their product is medium then choose 0.75 as the correct value.
According to your final edit, you could just multiply the 2 together without caring about 'low','medium','high'
Related
I have a field data
sowing_date<- rep(c("Early" ,"Normal"), each=12)
herbicide<- rep (rep(c("No" ,"Yes"), each=6),2)
nitrogen<- rep (rep(c("No" ,"Yes"), each=3),4)
Block<- rep(c("Block 1" ,"Block 2", "Block 3"), times=8)
Yield<- c(30,27,25,40,41,42,37,38,40,48,47,46,25,27,26,
41,41,42,38,39,42,57,59,60)
DataA<- data.frame(sowing_date,herbicide,nitrogen,Block,Yield)
and I conducted 3-way ANOVA
anova3way <- aov (Yield ~ sowing_date + herbicide + nitrogen +
sowing_date:herbicide + sowing_date:nitrogen +
herbicide:nitrogen + sowing_date:herbicide:nitrogen +
factor(Block), data=DataA)
summary(anova3way)
There is a 3-way interaction among 3 factors. So, I'd like to see which combination shows the greatest yield.
I know how to compare mean difference with single factor like below, but in case of interactions, how can I do that?
library(agricolae)
LSD_Test<- LSD.test(anova3way,"sowing_date")
LSD_Test
For example, I'd like to check the mean difference under 3 way interaction, and also interaction between each two factors.
For example, I'd like to get this LSD result in R
Could you tell me how can I do that?
Many thanks,
One way which does take some manual work is to encode the experimental parameters as -1 and 1 in order to properly separate the 2 and 3 parameter interactions.
Once you have values encoded you can pull the residual degree of freedoms and the sum of the error square from the ANOVA model and pass it to the LSD.test function.
See Example below:
sowing_date<- rep(c("Early" ,"Normal"), each=12)
herbicide<- rep (rep(c("No" ,"Yes"), each=6),2)
nitrogen<- rep (rep(c("No" ,"Yes"), each=3),4)
Block<- rep(c("Block 1" ,"Block 2", "Block 3"), times=8)
Yield<- c(30,27,25,40,41,42,37,38,40,48,47,46,25,27,26,
41,41,42,38,39,42,57,59,60)
DataA<- data.frame(sowing_date,herbicide,nitrogen,Block,Yield)
anova3way <- aov (Yield ~ sowing_date * herbicide * nitrogen +
factor(Block), data=DataA)
summary(anova3way)
#Encode the experiment's parameters as -1 and 1
DataA$codeSD <- ifelse(DataA$sowing_date == "Early", -1, 1)
DataA$codeherb <- ifelse(DataA$herbicide == "No", -1, 1)
DataA$codeN2 <- ifelse(DataA$nitrogen == "No", -1, 1)
library(agricolae)
LSD_Test<- LSD.test(anova3way, c("sowing_date"))
LSD_Test
#Manually defining the treatment and specifying the
# degrees of freedom and Sum of the squares (Frin the resduals from the ANOVA)
print(LSD.test(y=DataA$Yield, trt=DataA$sowing_date, DFerror=14, MSerror=34.3))
#Example for a two parameter value
print(LSD.test(y=DataA$Yield, trt=(DataA$codeSD*DataA$codeherb), DFerror=14, MSerror=34.3))
print(LSD.test(y=DataA$Yield, trt=(DataA$codeSD*DataA$codeherb*DataA$codeN2), DFerror=14, MSerror=34.3))
#calaculate the means and std (as a check)
#DataA %>% group_by(sowing_date) %>% summarize(mean=mean(Yield), sd=sd(Yield))
#DataA %>% group_by(codeSD*codeherb*codeN2) %>% summarize(mean=mean(Yield), sd=sd(Yield))
You will need to manually track which run/condition goes with the -1 and 1 in the final report.
Edit:
So my answer above with show the overall effect based on interactions. For example how does the interaction of herbicide and nitrogen effect yield.
Based on your comment where you want to determine which combination provides the greatest yield, you the use the LSD.test() function again but passing a vector of parameter names.
LSD_Test<- LSD.test(anova3way, c("sowing_date", "herbicide", "nitrogen"))
LSD_Test
From the groups part of the out put you can see Normal, Yes and Yes is the optimal yield. The "groups" column will identify the unique clusters of results. For example the last 2 rows provide a similar yield.
...
$groups
Yield groups
Normal:Yes:Yes 58.66667 a
Early:Yes:Yes 47.00000 b
Normal:No:Yes 41.33333 c
Early:No:Yes 41.00000 cd
Normal:Yes:No 39.66667 cd
Early:Yes:No 38.33333 d
Early:No:No 27.33333 e
Normal:No:No 26.00000 e
...
I need to run a 2-sample independent t-test, comparing Column1 to Column2. But Column1 is in DataframeA, and Column2 is in DataframeB. How should I do this?
Just in case relevant (feel free to ignore): I am a true beginner. My experience with R so far has been limited to running 2-sample matched t-tests within the same data frame by doing the following:
t.test(response ~ Column1,
data = (Dataframe1 %>%
gather(key = "Column1", value = "response", "Column1", "Column2")),
paired = TRUE)
TL;DR
t_test_result = t.test(DataframeA$Column1, DataframeB$Column2, paired=TRUE)
Explanation
If the data is paired, I assume that both dataframes will have the same number of observations (same number of rows). You can check this with nrow(DataframeA) == nrow(DataframeB) .
You can think of each column of a dataframe as a vector (an ordered list of values). The way that you have used t.test is by using a formula (y~x), and you were essentially saying: Given the dataframe specified in data, perform a t test to assess the significance in the difference in means of the variable response between the paired groups in Column1.
Another way of thinking about this is by grabbing the data in data and separating it into two vectors: the vector with observations for the first group of Column1, and the one for the second group. Then, for each vector, you compute the mean and stdev and apply the appropriate formula that will give you the t statistic and hence the p value.
Thus, you can just extract those 2 vectors separately and provide them as arguments to the t.test() function. I hope it was beginner-friendly enough ^^ otherwise let me know
EDIT: a few additions
(I was going to reply in the comments but realized I did not have space hehe)
Regarding the what #Ashish did in order to turn it into a Welch's test, I'd say it was to set var.equal = FALSE. The paired parameter controls whether the t-test is run on paired samples or not, and since your data frames have unequal number of rows, I'm suspecting the observations are not matched.
As for the Cohen's d effect size, you can check this stats exchange question, from which I copy the code:
For context, m1 and m2 are the group's means (which you can get with n1 = mean(DataframeA$Column1)), s1 and s2 are the standard deviations (s2 = sd(DataframeB$Column2)) and n1 and n2 the sample sizes (n2 = length(DataframeB$Column2))
lx <- n1- 1 # Number of observations in group 1
ly <- n2- 1 # # Number of observations in group 1
md <- abs(m1-m2) ## mean difference (numerator)
csd <- lx * s1^2 + ly * s2^2
csd <- csd/(lx + ly)
csd <- sqrt(csd) ## common sd computation
cd <- md/csd ## cohen's d
This should work for you
res = t.test(DataFrameA$Column1, DataFrameB$Column2, alternative = "two.sided", var.equal = FALSE)
I tested differences among sampling sites in terms of abundance values using kruskal.test. However, I want to determine the multiple differences between sites.
The dunn.test function has the option to use a vector data with a categorical vector or use the formula expression as lm.
I write the function in the way to use in a data frame with many columns, but I have not found an example that confirms my procedures.
library(dunn.test)
df<-data.frame(a=runif(5,1,20),b=runif(5,1,20), c=runif(5,1,20))
kruskal.test(df)
dunn.test(df)
My results were:
Kruskal-Wallis chi-squared = 6.02, df = 2, p-value = 0.04929
Kruskal-Wallis chi-squared = 6.02, df = 2, p-value = 0.05
Comparison of df by group
Between 1 and 2 2.050609, 0.0202
Between 1 and 3 -0.141421, 0.4438
Between 2 and 3 -2.192031, 0.0142
I took a look at your code, and you are close. One issue is that you should be specifying a method to correct for multiple comparisons, using the method argument.
Correcting for Multiple Comparisons
For your example data, I'll use the Benjamini-Yekutieli variant of the False Discovery Rate (FDR). The reasons why I think this is a good performer for your data are beyond the scope of StackOverflow, but you can read more about it and other correction methods here. I also suggest you read the associated papers; most of them are open-access.
library(dunn.test)
set.seed(711) # set pseudorandom seed
df <- data.frame(a = runif(5,1,20),
b = runif(5,1,20),
c = runif(5,1,20))
dunn.test(df, method = "by") # correct for multiple comparisons using "B-Y" procedure
# Output
data: df and group
Kruskal-Wallis chi-squared = 3.62, df = 2, p-value = 0.16
Comparison of df by group
(Benjamini-Yekuteili)
Col Mean-|
Row Mean | 1 2
---------+----------------------
2 | 0.494974
| 0.5689
|
3 | -1.343502 -1.838477
| 0.2463 0.1815
alpha = 0.05
Reject Ho if p <= alpha/2
Interpreting the Results
The first row in each cell provides the Dunn's pairwise z test statistic for each comparison, and the second row provides your corrected p-values.
Notice that, once corrected for multiple comparisons, none of your pairwise tests are significant at an alpha of 0.05, which is not surprising given that each of your example "sites" was generated by exactly the same distribution. I hope this has been helpful. Happy analyzing!
P.S. In the future, you should use set.seed() if you're going to construct example dataframes using runif (or any other kind of pseudorandom number generation). Also, if you have other questions about statistical analysis, it's better to ask at: https://stats.stackexchange.com/
Ok straight to the question. I have a database with lots and lots of categorical variable.
Sample database with a few variables as below
gender <- as.factor(sample( letters[6:7], 100, replace=TRUE, prob=c(0.2, 0.8) ))
smoking <- as.factor(sample(c(0,1),size=100,replace=T,prob=c(0.6,0.4)))
alcohol <- as.factor(sample(c(0,1),size=100,replace=T,prob=c(0.3,0.7)))
htn <- as.factor(sample(c(0,1),size=100,replace=T,prob=c(0.2,0.8)))
tertile <- as.factor(sample(c(1,2,3),size=100,replace=T,prob=c(0.3,0.3,0.4)))
df <- as.data.frame(cbind(gender,smoking,alcohol,htn,tertile))
I want to test the hypothesis, using a chi square test, that there is a difference in the portion of smokers, alcohol use, hypertension (htn) etc by tertile (3 factors). I then want to extract the p values for each variable.
Now i know i can test each individual variable using a 2 by 3 cross tabulation but is there a more efficient code to derive the test statistic and p-value across all variables in one go and extract the p value across each variable
Thanks in advance
Anoop
If you want to do all the comparisons in one statement, you can do
mapply(function(x, y) chisq.test(x, y)$p.value, df[, -5], MoreArgs=list(df[,5]))
# gender smoking alcohol htn
# 0.4967724 0.8251178 0.5008898 0.3775083
Of course doing tests this way is somewhat statistically inefficient since you are doing multiple tests here so some correction is required to maintain an appropriate type 1 error rate.
You can run the following code chunk if you want to get the test result in details:
lapply(df[,-5], function(x) chisq.test(table(x,df$tertile), simulate.p.value = TRUE))
You can get just p-values:
lapply(df[,-5], function(x) chisq.test(table(x,df$tertile), simulate.p.value = TRUE)$p.value)
This is to get the p-values in the data frame:
data.frame(lapply(df[,-5], function(x) chisq.test(table(x,df$tertile), simulate.p.value = TRUE)$p.value))
Thanks to RPub for inspiring.
http://www.rpubs.com/kaz_yos/1204
In the following code I use bootstrapping to calculate the C.I. and the p-value under the null hypothesis that two different fertilizers applied to tomato plants have no effect in plants yields (and the alternative being that the "improved" fertilizer is better). The first random sample (x) comes from plants where a standard fertilizer has been used, while an "improved" one has been used in the plants where the second sample (y) comes from.
x <- c(11.4,25.3,29.9,16.5,21.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
library(boot)
diff <- function(x,i) mean(x[i[6:11]]) - mean(x[i[1:5]])
b <- boot(total, diff, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
What I don't like about the code above is that resampling is done as if there was only one sample of 11 values (separating the first 5 as belonging to sample x leaving the rest to sample y).
Could you show me how this code should be modified in order to draw resamples of size 5 with replacement from the first sample and separate resamples of size 6 from the second sample, so that bootstrap resampling would mimic the “separate samples” design that produced the original data?
EDIT2 :
Hack deleted as it was a wrong solution. Instead one has to use the argument strata of the boot function :
total <- c(x,y)
id <- as.factor(c(rep("x",length(x)),rep("y",length(y))))
b <- boot(total, diff, strata=id, R = 10000)
...
Be aware you're not going to get even close to a correct estimate of your p.value :
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
b <- boot(total, diff, strata=id, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
> p.value
[1] 0.5162
How would you explain a p-value of 0.51 for two samples where all values of the second are higher than the highest value of the first?
The above code is fine to get a -biased- estimate of the confidence interval, but the significance testing about the difference should be done by permutation over the complete dataset.
Following John, I think the appropriate way to use bootstrap to test if the sums of these two different populations are significantly different is as follows:
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
b_x <- boot(x, sum, R = 10000)
b_y <- boot(y, sum, R = 10000)
z<-(b_x$t0-b_y$t0)/sqrt(var(b_x$t[,1])+var(b_y$t[,1]))
pnorm(z)
So we can clearly reject the null that they are the same population. I may have missed a degree of freedom adjustment, I am not sure how bootstrapping works in that regard, but such an adjustment will not change your results drastically.
While the actual soil beds could be considered a stratified variable in some instances this is not one of them. You only have the one manipulation, between the groups of plants. Therefore, your null hypothesis is that they really do come from the exact same population. Treating the items as if they're from a single set of 11 samples is the correct way to bootstrap in this case.
If you have two plots, and in each plot tried the different fertilizers over different seasons in a counterbalanced fashion then the plots would be statified samples and you'd want to treat them as such. But that isn't the case here.