How to move several columns to front of dataframe in R? - r

I am writing a simulation where I am trying several multiple testing methods. In my simulation, I want to vary the percentage of true null hypotheses, and move the true null hypotheses to the beginning of my dataframe. This is proving to be a little tricky, when the number of null hypotheses are being varied.
I have looked into moving them by index, but this doesn't work in all cases. (especially h0 = 0)
It looks like relocate() might do what I want, but can I use this with several columns, and using only column indexes?
I am just including the "inner loop" in my simulation, where the error occurs. First you can see the levels I want to vary h0 at.
iter <- 100 #number of iterations for 1 datapoint
rho_vec <- c(0, 0.20, 0.40, 0.60, 0.80) # correlation value
h0_vec <- c(0, 0.20, 0.40, 0.60, 0.80) # list of percentage of true h0
for(j in 1 : iter){
mu11 <- c(rep(0, (h0*50)), rep(1.5, (1-h0)*50)) #vector giving the means of the variables. true nulls have 0 mean, false nulls have 1.5 in mean. (12 false h0)
Sigma11 <- diag(k) + rho - diag(k)*rho #Making simple correlation matrix for dependent variables
corrdata1 <- mvrnorm(n, mu = mu11, Sigma = Sigma11)
# now we simulate the unncorrelated data with (1-h0)*50) non-true null hypothesis. n and k are the same.
mu12 <- c(rep(0, h0*50), rep(1.5, (1-h0)*50))
SigmaId <- diag(k) #making correlation matrix (id matrix) for independent data.
indepdata1 <- mvrnorm(n, mu = mu12, Sigma = SigmaId)
#we define the total data matrix for both of the cases
data1 <- cbind(corrdata1,indepdata1) #a 100 x 1000 matrix with 1000 observations of 100 variables
#reorder columns so the false nulls are the last columns.
#data1 <- data1[, c( 0:(h0*50), 51:(50+(h0*50)), (51-((1-h0)*50)):50, (101-(50*(1-h0))):100)] #can check this by calling colMeans(data1). I tried this version first.
data1 %>% relocate(c(0:(h0*50), 51:(50 + (h0*50))) %>% head()) # this is the relocate() approach.
}
This produces an error in relocate():
"Error in UseMethod("relocate") :
no applicable method for 'relocate' applied to an object of class "c('matrix', 'double', 'numeric')"
Does anyone have any ideas on how to to this? Advice is greatly appreciated!

The error message tells you that relocate() is applied to a matrix object, which it cannot do. Indeed, relocate() must be applied to a dataframe, so you should use as.data.frame() or as_tibble() beforehand, as mentioned in the comments.
Finally, you should reassign the result after using the function, otherwise it won't have any effect:
data1 <- data1 %>% as_tibble() %>% relocate(c(0:(h0*50), 51:(50 + (h0*50))) %>% head())

Related

Combining for loops and ifelse in R

I am trying to write a for loop that will generate a correlation for a fixed column (LPS0) vs. all other columns in the data set. I don't want to use a correlation matrix because I only care about the correlation of LPS0 vs all other columns, not the correlations of the other columns with themselves. I then want to include an if statement to print only the significant correlations (p.value <= 0.05). I ran into some issues where some of the p.values are returned as NA, so I switched to an if_else loop. However, I am now getting an error. My code is as follows:
for(i in 3:ncol(microbiota_lps_0_morm)) {
morm_0 <- cor.test(microbiota_lps_0_morm$LPS0, microbiota_lps_0_morm[[colnames(microbiota_lps_0_morm)[i]]], method = "spearman")
if_else(morm_0$p.value <= 0.05, print(morm_0), print("Not Sig"), print("NA"))
}
The first value is returned, and then the loop stops with the following error:
Error in if_else():
! true must be length 1 (length of condition), not 8.
Backtrace: 1. dplyr::if_else(morm_0$p.value <= 0.05, print(morm_0), print("Not Sig"), print("NA"))
How can I make the loop print morm only when p.value <- 0.05?
Here's a long piece of code which aytomates the whole thing. it might be overkill but you can just take the matrix and use whatever you need. it makes use of the tidyverse.
df <- select_if(mtcars,is.numeric)
glimpse(df)
# keeping real names
dict <- cbind(original=names(df),new=paste0("v",1:ncol(df)))
# but changing names for better data viz
colnames(df) <- paste0("v",1:ncol(df))
# correlating between variables + p values
pvals <- list()
corss <- list()
for (coln in colnames(df)) {
pvals[[coln]] <- map(df, ~ cor.test(df[,coln], .)$p.value)
corss[[coln]] <- map(df, ~ cor(df[,coln], .))
}
# Keeping both matrices in a list
matrices <- list(
pvalues = matrix(data=unlist(pvals),
ncol=length(names(pvals)),
nrow=length(names(pvals))),
correlations = matrix(data=unlist(corss),
ncol=length(names(corss)),
nrow=length(names(corss)))
)
rownames(matrices[[1]]) <- colnames(df)
rownames(matrices[[2]]) <- colnames(df)
# Creating a combined data frame
long_cors <- expand.grid(Var1=names(df),Var2=names(df)) %>%
mutate(cor=unlist(matrices["correlations"]),
pval=unlist(matrices["pvalues"]),
same=Var1==Var2,
significant=pval<0.05,
dpcate=duplicated(cor)) %>%
# Leaving no duplicants, non-significant or self-correlation results
filter(same ==F,significant==T,dpcate==F) %>%
select(-c(same,dpcate,significant))
# Plotting correlations
long_cors %>%mutate(negative=cor<0) %>%
ggplot(aes(x=Var1,y=Var2,
color=negative,size=abs(cor),fill=Var2,
label=round(cor,2)))+
geom_label(show.legend = F,alpha=0.2)+
scale_color_manual(values = c("black","darkred"))+
# Sizing each correlation by it's magnitude
scale_size_area(seq(1,100,length=length(unique(long_cors$Var1))))+ theme_light()+
theme(axis.text = element_text(face = "bold",size=12))+
labs(title="Correlation between variables",
caption = "p < 0.05")+xlab("")+ylab("")
If you want to correlate a column of a matrix with the remaining columns, you can do so with one function call:
mtx <- matrix(rnorm(800), ncol=8)
cor(mtx[,1], mtx[,-1])
However, you will not get p-values. For getting p-values, I would recommend this approach:
library(tidyverse)
significant <- map_dbl(2:ncol(mtx),
~ cor.test(mtx[,1], mtx[,.], use="p", method="s")$p.value)
Whenever you feel like you need a for loop in R, chances are, you should be using another approach. for is a very un-R construct, and R gives many better ways of handling the same issues. map_* family of functions from tidyverse is but one of them. Another approach, in base R, would be to use apply:
significant <- apply(mtx[,-1], 2,
\(x) cor.test(x, mtx[,1], method="s", use="p")$p.value)

Correlation matrix giving the wrong value in R

I wanted to get the correlation matrix for an actual value and a predicted value, but the results after the first column are wrong.
There are many missing values in my dataframe.
This is the code:
# Create different tables for overall (actual) and potential (predicted) values for every year
actual <- table %>%
select(starts_with('overall_'))
predicted <- table %>%
select(starts_with('potential_'))
# Create a matrix for r of every year
cormat <- round( cor(
x = actual, y = predicted,
use = "complete.obs",
method = "pearson"),
3)
cormat
However when I calculate the correlation manually the resut is different.
round(
cor(table$overall_15,
table$potential_15,
use = "complete.obs"), 3)
# Result: 0.804
Anybody knows why?
Thank you,

Logistic regression training and test data

I am a beginner to R and am having trouble with something that feels basic but I am not sure how to do it. I have a data set with 1319 rows and I want to setup training data for observations 1 to 1000 and the test data for 1001 to 1319.
Comparing with notes from my class and the professor set this up by doing a Boolean vector by the 'Year' variable in her data. For example:
train=(Year<2005)
And that returns the True/False statements.
I understand that and would be able to setup a Boolean vector if I was subsetting my data by a variable but instead I have to strictly by the number of rows which I do not understand how to accomplish. I tried
train=(data$nrow < 1001)
But got logical(0) as a result.
Can anyone lead me in the right direction?
You get logical(0) because nrow is not a column
You can also subset your dataframe by using row numbers
train = 1:1000 # vector with integers from 1 to 1000
test = 1001:nrow(data)
train_data = data[train,]
test_data = data[test,]
But be careful, unless the order of rows in your dataframe is completely random, you probably want to get 1000 rows randomly and not the 1000 first ones, you can do this using
train = sample(1:nrow(data),1000)
You can then get your train_data and test_data using
train_data = data[train,]
test_data = data[setdiff(1:nrow(data),train),]
The setdiff function is used to get all rows not selected in train
The issue with splitting your data set by rows is the potential to introduce bias into your training and testing set - particularly for ordered data.
# Create a data set
data <- data.frame(year = sample(seq(2000, 2019, by = 1), 1000, replace = T),
data = sample(seq(0, 1, by = 0.01), 1000, replace = T))
nrow(data)
[1] 1000
If you really want to take the first n rows then you can try:
first.n.rows <- data[1:1000, ]
The caret package provides a more reliable approach to using cross validation in your models.
First create the partition rule:
library(caret)
inTrain <- createDataPartition(y = data$year,
p = 0.8, list = FALSE)
Note y = data$year this tells R to use the variable year to sample from, ensuring you don't get ordered data and introduced bias to the model.
The p argument tells caret how much of the original data should be partitioned to the training set, in this case 80%.
Then apply the partition to the data set:
# Create the training set
train <- data[inTrain,]
# Create the testing set
test <- data[-inTrain,]
nrow(train) + nrow(test)
[1] 1000

R: Testing each level of a factor without creating new variables

Suppose I have a data frame with a binary grouping variable and a factor. An example of such a grouping variable could specify assignment to the treatment and control conditions of an experiment. In the below, b is the grouping variable while a is an arbitrary factor variable:
a <- c("a","a","a","b","b")
b <- c(0,0,1,0,1)
df <- data.frame(a,b)
I want to complete two-sample t-tests to assess the below:
For each level of a, whether there is a difference in the mean propensity to adopt that level between the groups specified in b.
I have used the dummies package to create separate dummies for each level of the factor and then manually performed t-tests on the resulting variables:
library(dummies)
new <- dummy.data.frame(df, names = "a")
t.test(new$aa, new$b)
t.test(new$ab, new$b)
I am looking for help with the following:
Is there a way to perform this without creating a large number of dummy variables via dummy.data.frame()?
If there is not a quicker way to do it without creating a large number of dummies, is there a quicker way to complete the t-test across multiple columns?
Note
This is similar to but different from R - How to perform the same operation on multiple variables and nearly the same as this question Apply t-test on many columns in a dataframe split by factor but the solution of that question no longer works.
Here is a base R solution implementing a chi-squired test for equality of proportions, which I believe is more likely to answer whatever question you're asking of your data (see my comment above):
set.seed(1)
## generate similar but larger/more complex toy dataset
a <- sample(letters[1:4], 100, replace = T)
b <- sample(0:1, 10, replace = T)
head((df <- data.frame(a,b)))
a b
1 b 1
2 b 0
3 c 0
4 d 1
5 a 1
6 d 0
## create a set of contingency tables for proportions
## of each level of df$a to the others
cTbls <- lapply(unique(a), function(x) table(df$a==x, df$b))
## apply chi-squared test to each contingency table
results <- lapply(cTbls, prop.test, correct = FALSE)
## preserve names
names(results) <- unique(a)
## only one result displayed for sake of space:
results$b
2-sample test for equality of proportions without continuity
correction
data: X[[i]]
X-squared = 0.18382, df = 1, p-value = 0.6681
alternative hypothesis: two.sided
95 percent confidence interval:
-0.2557295 0.1638177
sample estimates:
prop 1 prop 2
0.4852941 0.5312500
Be aware, however, that is you might not want to interpret your p-values without correcting for multiple comparisons. A quick simulation demonstrates that the chance of incorrectly rejecting the null hypothesis with at least one of of your tests can be dramatically higher than 5%(!) :
set.seed(11)
sum(
replicate(1e4, {
a <- sample(letters[1:4], 100, replace = T)
b <- sample(0:1, 100, replace = T)
df <- data.frame(a,b)
cTbls <- lapply(unique(a), function(x) table(df$a==x, df$b))
results <- lapply(cTbls, prop.test, correct = FALSE)
any(lapply(results, function(x) x$p.value < .05))
})
) / 1e4
[1] 0.1642
I dont exactly understand what this is doing from a statistical standpoint, but this code generates a list where each element is the output from the t.test() you run above:
a <- c("a","a","a","b","b")
b <- c(0,0,1,0,1)
df <- data.frame(a,b)
library(dplyr)
library(tidyr)
dfNew<-df %>% group_by(a) %>% summarise(count = n()) %>% spread(a, count)
lapply(1:ncol(dfNew), function (x)
t.test(c(rep(1, dfNew[1,x]), rep(0, length(b)-dfNew[1,x])), b))
This will save you the typing of t.test(foo, bar) continuously, and also eliminates the need for dummy variables.
Edit: I dont think the above method preserves the order of the columns, only the frequency of values measured as 0 or 1. If the order is important (again, I dont know the goal of this procedure) then you can use the dummy method and lapply through the data.frame you named new.
library(dummies)
new <- dummy.data.frame(df, names = "a")
lapply(1:(ncol(new)-1), function(x)
t.test(new[,x], new[,ncol(new)]))

Custom AUC in R with different thresholds and binary predictions

I am looking to plot a FPR vs TPR point on an AUC graph for different thresholds.
For example, if data$C2 is my data frame with the true response column (either 0 or 1), I want to make a vector with predicted values (0 or 1) when data$C1 (a different measurement column) is above or below the specified threshold. Here is the function I've attempted with the ROCR package.
fun <- function (data, col1, col2){
perfc <- NULL #Create null vectors for prediction and performance
perfs <- NULL
temp <- NULL
d <- seq(0.10,0.30,0.01) ##Various thresholds to be tested
for (i in length(d){
temp <- ifelse(data[,col1] > d, 1 , 0) ##Create predicted responses
pred <- prediction(temp, data[,col2]) #Predict responses over true values
perf <- performance(pred, "tpr","fpr") #Store performance information
predc[i] <- pred #Do this i times for every d in the sequence
perfc[i] <- perf
preds <- prediction.class(predc, col2) #Combine to make prediction class
perfs <- performance.class(preds, "tpr","fpr") #Combine to make performance class
}
plot(perfs) #Plot TPR against FPR
}
Is the problem because temp is a list vector and the true labels are from a matrix? Am I applying this for loop incorrectly?
Thanks in advance!
Edit: Here's my attempt to do this manually without the ROC package.
for(t in seq(0.40,0.60,0.01)) #I want to do this for every t in the sequence
{
t <- t
TP <- 0
FP <- 0
p <- sum(data$C2==1, na.rm=TRUE) #Total number of true positives
n <- sum(data$C2==0, na.rm=TRUE) #Total number of true negatives
list <- data$C1 #Column to vector
test <- ifelse(list > t, 1, 0) #Make prediction vector
for(i in 1:nrow(data))
{if(test==1 & data$C2==1)
{TP <- TP + 1} #Count number of correct predictions
if(test==1 & data$C2==0)
{FP <- FP + 1} #Count number of false positives
}
plot(x=(FP/n),y=(TP/p)) #Plot every FP,TP pair
}
I hope I understand your question right, but I think that by AUC graph you mean ROC curve. The ROC curve already takes into account different thresholds to make those classification decisions. See this wikipedia page. I found this picture particularly helpful.
If the above is right, then all you need to do in your code is:
pred <- prediction(data[,col1], data[,col2])
perf <- performance(pred, "tpr","fpr")
plot(perf)
If you would like to 'add' a different curve to that plot, perhaps because you used a different classification technique (e.g. decision tree instead of logistic regression. Then use plot(perf2,add=TRUE). Where perf2 is created in a same way as perf. See the documentation.

Resources