Error in creating a function to perform ttests on multiple continuous variables - r

So I'm trying to create a function that will take in a string of continuous variables, a categorical variable and a dataframe and output a table that includes, for each continuous variable: mean group1, mean group2, teststat, confidence interval, p-value.
What is currently here gives me the error: Error in model.frame.default(formula = var ~ class, data = data) : variable lengths differ (found for 'class')
I would love any feedback on how to fix this error and make this function do what I like. I want to make this function way more substantial and flexible, but I can't even get the basic version (handling multiple variables) to work.
THANKS!
#Continuous must be an object of the form:
#vars<-c("cont1", "cont2", "cont3", etc)
#CREATE DATA
cat1<-sample(c(1,2), 100, replace=T)
cont1<-rnorm(100, 25, 8)
cont2<-rnorm(100, 0, 1)
cont3<-rnorm(100, 6, 14.23)
cont4<-rnorm(100, 25, 8)*runif(5, 0.1, 1)
one<-data.frame(cat1, cont1, cont2, cont3, cont4)
#FUNCTION
two.group.comp<-function(continvars,class,data){
attach(data)
descriptives<-function(var){
test<-t.test(var~class, data)
means<-data.frame(test[5])
mean1<-means[1,1]
mean2<-means[2,1]
teststatbig<-data.frame(test[1])
teststat<-teststatbig[1,1]
conf<-data.frame(test[4])
lconf<-conf[1,1]
uconf<-conf[2,1]
pvalues<-data.frame(test[3])
pvalue<-pvalues[1,1]
variablename<-deparse(substitute(var))
entry<-data.frame(variablename,mean1,mean2,lconf,uconf,teststat,pvalue)
}
var<-data.frame(continvars)
table<<-sapply(var,descriptives)
detach(data)
}
#VARIABLES
continvars<-c("cont1", "cont2", "cont3")
#CALL TO FUNCTION
two.group.comp(continvars=continvars, class=cat1, data=one)

Does this do what you want?
two.group.comp <- function(continvars,class,data){
get.stats <- function(x,cat){
f <- unique(cat)
x1 <- x[cat==f[1]]
x2 <- x[cat==f[2]]
tt <- t.test(x1,x2)
smry <- c(tt$estimate,tt$statistic,p=tt$p.value)
names(smry) <- c("mean.1","mean.2","t","p")
return(smry)
}
result <- do.call(rbind,lapply(data[,continvars],get.stats,cat=class))
return(result)
}
# create sample dataset
set.seed(1)
cat1 <-sample(c(1,2), 100, replace=T)
cont1<-rnorm(100, 25, 8)
cont2<-rnorm(100, 0, 1)
cont3<-rnorm(100, 6, 14.23)
cont4<-rnorm(100, 25, 8)*runif(5, 0.1, 1)
one <-data.frame(cat1, cont1, cont2, cont3, cont4)
continvars<-c("cont1", "cont2", "cont3")
# call the function...
two.group.comp(continvars,cat1,one)
# mean.1 mean.2 t p
# cont1 24.4223859 25.33275704 -0.6024497 0.54827955
# cont2 0.0330148 0.01168979 0.1013519 0.91947827
# cont3 10.5784201 4.00651493 2.4183031 0.01747468
Working from the inside out:
get.stats(...) takes a single column of data, splits it into x1 and x2 according to cat, runs the t-test, and returns the summary statistics as a named vector.
lapply(...) passes the continvars columns of data to get.stats(...) one at a time.
do.call(rbind,...) binds together the set of vectors returned from lapply(...), row-wise, to generate the final result table.
This will work also if you pass column numbers instead of column names.
A piece of advice: the way you have it set up, you pass the column names of the continuous variables, but you pass the grouping factor as a vector. It would be cleaner if you pass the column name of the grouping factor.

Related

How to create a formulated table in R?

This is my reproducible example :
#http://gekkoquant.com/2012/05/26/neural-networks-with-r-simple-example/
library("neuralnet")
require(ggplot2)
traininginput <- as.data.frame(runif(50, min=0, max=100))
trainingoutput <- sqrt(traininginput)
trainingdata <- cbind(traininginput,trainingoutput)
colnames(trainingdata) <- c("Input","Output")
Hidden_Layer_1 <- 1 # value is randomly assigned
Hidden_Layer_2 <- 1 # value is randomly assigned
Threshold_Level <- 0.1 # value is randomly assigned
net.sqrt <- neuralnet(Output~Input,trainingdata, hidden=c(Hidden_Layer_1, Hidden_Layer_2), threshold = Threshold_Level)
#Test the neural network on some test data
testdata <- as.data.frame((1:13)^2) #Generate some squared numbers
net.results <- predict(net.sqrt, testdata) #Run them through the neural network
cleanoutput <- cbind(testdata,sqrt(testdata),
as.data.frame(net.results))
colnames(cleanoutput) <- c("Input","ExpectedOutput","NeuralNetOutput")
ggplot(data = cleanoutput, aes(x= ExpectedOutput, y= NeuralNetOutput)) + geom_point() +
geom_abline(intercept = 0, slope = 1
, color="brown", size=0.5)
rmse <- sqrt(sum((sqrt(testdata)- net.results)^2)/length(net.results))
print(rmse)
At here, when my Hidden_Layer_1 is 1, Hidden_Layer_2 is 2, and the Threshold_Level is 0.1, my rmse generated is 0.6717354.
Let's say we try for the other example,
when my Hidden_Layer_1 is 2, Hidden_Layer_2 is 3, and the Threshold_Level is 0.2, my rmse generated is 0.8355925.
How can I create a table that will automatically calculate the value of rmse when user assign value to the Hidden_Layer_1, Hidden_Layer_2, and Threshold_Level. ( I know how to do it in Excel but not in r haha )
The desired table should be looked like this :
I wish that I have Trial(s), Hidden_Layer_1, Hidden_Layer_2, Threshold_Level, and rmse in my column, and the number of rows can be generated infinitely by entering some actionButton (if possible), means user can keep on trying until they got the rmse they desired.
How can I do that? Can anyone help me? I will definitely learn from this lesson as I am quite new to r.
Thank you very much for anyone who willing to give a helping hand to me.
Here is a way to create the table of values that can be displayed with the data frame viewer.
# initialize an object where we can store the parameters as a data frame
data <- NULL
# function to receive a row of parameters and add them to the
# df argument
addModelElements <- function(df,trial,layer1,layer2,threshold,rmse){
newRow <- data.frame(trial = trial,
Hidden_Layer_1 = layer1,
Hidden_Layer_2 = layer2,
Threshold = threshold,
RMSE = rmse)
rbind(df,newRow)
}
# once a model has been run, call addModelElements() with the
# model parameters
data <- addModelElements(data,1,1,2,0.1,0.671735)
data <- addModelElements(data,2,2,3,0.2,0.835593)
...and the output:
View(data)
Note that if you're going to create scores or hundreds of rows of parameters & RMSE results before displaying any of them to the end user, the code should be altered to improve the efficiency of rbind(). In this scenario, we build a list of sets of parameters, convert them into data frames, and use do.call() to execute rbind() only once.
# version that improves efficiency of `rbind()
addModelElements <- function(trial,layer1,layer2,threshold,rmse){
# return row as data frame
data.frame(trial = trial,
Hidden_Layer_1 = layer1,
Hidden_Layer_2 = layer2,
Threshold = threshold,
RMSE = rmse)
}
# generate list of data frames and rbind() once
inputParms <- list(c(1,1,2,0.1,0.671735),
c(1,1,2,0.3,0.681935),
c(2,2,3,0.2,0.835593))
parmList <- lapply(inputParms,function(x){
addModelElements(x[1],x[2],x[3],x[4],x[5])
})
# bind to single data frame
data <- do.call(rbind,parmList)
View(data)
...and the output:

Transform left skewed data in R

I have a column that is left-skewed, I need to transform it. So I tried this
library(car)
vect<-c(1516201202, 1526238001, 1512050372, 1362933719, 1516342174, 1526502557 ,1523548827, 1512241202,1526417785, 1517846464)
powerTransform(vect)
The values in the vector are 13 digit numeric unix epoch timestamps like this I have few thousand values, pasting 10 of them here, I do the same operation on the entire column. This gave me an error
Error in qr.resid(xqr, w * fam(Y, lambda, j = TRUE, ...)) : NA/NaN/Inf in foreign function call (arg 5)
I was expecting transformed column back. Any Idea on how to do this in R?
Thanks
Raj
Generally, car::powerTransform returns a powerTransform object (which is a list containing amongst other things the estimated Box-Cox transformation parameter(s)). To get the transformed values, you need bcPower, which takes the car::powerTransform output object to transform the original data.
Unfortunately you don't provide sample data, so here's an example based on the iris dataset.
library(car)
# Box-Cox transformation of `Sepal.Length`
df <- iris
trans <- powerTransform(df$Sepal.Length)
# Or the same using formula syntax:
# trans <- powerTransform(Sepal.Length ~ 1, data = df)
# Add the transformed `Sepal.Length` data to the original `data.frame`
df <- cbind(
df,
Sepal.Length_trans = bcPower(
with(iris, cbind(Sepal.Length)), coef(trans))[, 1])
# Show a histogram of the Box-Cox-transformed data
library(ggplot2)
ggplot(df, aes(Sepal.Length_trans)) +
geom_histogram(aes(Sepal.Length_trans), bins = 30)

Generating n new datasets by randomly sampling existing data, and then applying a function to new datasets

For a paper I'm writing I have subsetted a larger dataset into 3 groups, because I thought the strength of correlations between 2 variables in those groups would differ (they did). I want to see if subsetting my data into random groupings would also significantly affect the strength of correlations (i.e., whether what I'm seeing is just an effect of subsetting, or if those groupings are actually significant).
To this end, I am trying to generate n new data frames by randomly sampling 150 rows from an existing dataset, and then want to calculate correlation coefficients for two variables in those n new data frames, saving the correlation coefficient and significance in a new file.
But, HOW?
I can do it manually, e.g., with dplyr, something like
newdata <- sample_n(Random_sample_data, 150)
output <- cor.test(newdata$x, newdata$y, method="kendall")
I'd obviously like to not type this out 1000 or 100000 times, and have been trying things with loops and lapply (see below) but they've not worked (undoubtedly due to something really obvious that I'm missing!).
Here I have tried to assign each row to a different group, with 10 groups in total, and then to do correlations between x and y by those groups:
Random_sample_data<-select(Range_corrected, x, y)
cat <- sample(1:10, 1229, replace=TRUE)
Random_sample_cats<-cbind(Random_sample_data,cat)
correlation <- function(c) {
c <- cor.test(x,y, method="kendall")
return(c)
}
b<- daply(Random_sample_cats, .(cat), correlation)
Error message:
Error in cor.test(x, y, method = "kendall") :
object 'x' not found
Once you have the code for what you want to do once, you can put it in replicate to do it n times. Here's a reproducible example on built-in data
result = replicate(n = 10, expr = {
newdata <- sample_n(mtcars, 10)
output <- cor.test(newdata$wt, newdata$qsec, method="kendall")
})
replicate will save the result of the last line of what you did (output <- ...) for each replication. It will attempt to simplify the result, in this case cor.test returns a list of length 8, so replicate will simplify the results to a matrix with 8 rows and 10 columns (1 column per replication).
You may want to clean up the results a little bit so that, e.g., you only save the p-value. Here, we store only the p-value, so the result is a vector with one p-value per replication, not a matrix:
result = replicate(n = 10, expr = {
newdata <- sample_n(mtcars, 10)
cor.test(newdata$wt, newdata$qsec, method="kendall")$p.value
})

function to call the different columns for calculating Correlation and Confidence interval using Bootstrap in R

Here is the problem I am currently facing: I have a data frame (let's call A) of 200 observations (rows) and 12 variables (columns). where I am, trying to find out the confidence interval using Bootstrap based on Correlation between two variables in the data frame.
My Data:
library(boot)
library(tidyverse)
library(psychometric)
hsb2 <- read.table("https://stats.idre.ucla.edu/stat/data/hsb2.csv", sep=",", header=T)
here I am trying to find out the confidence interval by using bootstrap based correlation formula
I wrote code for that its work.
k<-CIr(r=orig.cor, n = 21, level = .95)
k
n<-length(hsb2$math)
#n
B<-5000
boot.cor.all<-NULL
for (i in 1:B){
index<-sample(1:n, replace=T)
boot.v2<-hsb2$math[index]
boot.v1<-hsb2$write[index]
boot.cor<-cor(boot.v1, boot.v2,method="spearman")
boot.cor.all<-c(boot.cor.all, boot.cor)
}
ci_boot<-quantile(boot.cor.all, prob=c(0.025, 0.975))
ci_boot
Result:
[1] 0.6439442
[1] 0.2939780 0.8416635
2.5% 97.5%
0.5556964 0.7211145
Here is the actual problem I am facing where I have to write a function to get
result for another variable but
this function not working
bo<-function(v1,v2,df){
orig.cor <- cor(df$v1,df$v2,method="spearman")
orig.ci<-CIr(r=orig.cor, n = 21, level = .95)
B<-5000
n<-length(df$v1)
boot.cor.all<-NULL
for (i in 1:B){
index<-sample(1:n, replace=T)
boot.hvltt2<-df$v1[index]
boot.hvltt<-df$v2[index]
boot.cor<-cor(boot.hvltt2, boot.hvltt,method="spearman")
boot.cor.all<-c(boot.cor.all, boot.cor)
}
ci_boot<-quantile(boot.cor.all, prob=c(0.025, 0.975))
return(orig.cor,orig.ci,ci_boot)
}
after calling this function I am getting error
bo(math,write,hsb2)
bo(math,read,hsb2)
bo(female,write,hsb2)
bo(female,read,hsb2)
I am getting this error
Error in cor(df$v1, df$v2, method = "spearman") : supply both 'x' and 'y' or a matrix-like 'x'
how to write a function correctly.
I want the result as each time a call function it needs to be stored in data frame like below
Variable1 variable2 Orig Cor Orig CI bootstrap CI
math wirte 0.643 0.2939780 0.8416635 0.5556964 0.7211145
math read 0.66 0.3242639 0.8511580 0.5736904 0.7400174
female read -0.059 -0.4787978 0.3820967 -0.20432743 0.08176896
female write
science write
science read
The logic was right, I just had to make some changes on how you access the elements on df. R doesn't recognized the objects math and write because they are columns inside the data.frame. One way to pass them as arguments to the function is to define them as strings v1 = "math" and then access them with df[,v1]
bo<-function(v1,v2,df){
orig.cor <- cor(df[,v1],df[,v2],method="spearman")
orig.ci<-CIr(r=orig.cor, n = 21, level = .95)
B<-5000
n<-nrow(df) #Changed length to nrow
boot.cor.all<-NULL
for (i in 1:B){
index<-sample(1:n, replace=T)
boot.hvltt2<-df[index,v1]
boot.hvltt<-df[index,v2]
boot.cor<-cor(boot.hvltt2, boot.hvltt,method="spearman")
boot.cor.all<-c(boot.cor.all, boot.cor)
}
ci_boot<-quantile(boot.cor.all, prob=c(0.025, 0.975))
return(list(orig.cor,orig.ci,ci_boot)) #wrap your returns in a list
}
bo("math","write",hsb2)

R: iterate a function over two lists simultaneously using lapply?

I have multiple factors dividing my data.
By one factor (uniqueGroup), I would like to subset my data, by another factor (distance), I want to first classify my data by "moving threshold", and then test statistical difference between groups.
I have created a function movThreshold to classify my data, and test it by wilcox.test. To vary the different threshold values, I just run
lapply(th.list, # list of thresholds
movThreshold, # my function
tab = tab, # original data
dependent = "infGrad") # dependent variable
Now I've realized, that in fact I need to firstly subset my data by uniqueGroup, and then vary the threshold value. But I am not sure, how to write it in my lapply code?
My dummy data:
set.seed(10)
infGrad <- c(rnorm(20, mean=14, sd=8),
rnorm(20, mean=13, sd=5),
rnorm(20, mean=8, sd=2),
rnorm(20, mean=7, sd=1))
distance <- rep(c(1:4), each = 20)
uniqueGroup <- rep(c("x", "y"), 40)
tab<-data.frame(infGrad, distance, uniqueGroup)
# Create moving threshold function &
# test for original data
# ============================================
movThreshold <- function(th, tab, dependent, ...) {
# Classify data
tab$group<- ifelse(tab$distance < th, "a", "b")
# Calculate wincoxon test - as I have only two groups
test<-wilcox.test(tab[[dependent]] ~ as.factor(group), # specify column name
data = tab)
# Put results in a vector
c(th, unique(tab$uniqueGroup), dependent, uniqueGroup, round(test$p.value, 3))
}
# Define two vectors to run through
# unique group
gr.list<-unique(tab$uniqueGroup)
# unique threshold
th.list<-c(2,3,4)
How to run lapply over two lists??
lapply(c(th.list,gr.list), # iterate over two vectors, DOES not work!!
movThreshold,
tab = tab,
dependent = "infGrad")
In my previous question (Kruskal-Wallis test: create lapply function to subset data.frame?), I've learnt how to iterate through individual subsets within a table:
lapply(split(tab, df$uniqueGroup), movThreshold})
But how to iterate through subsets, and through thresholds at once?
If I understood correctly what you're trying to do, here is a data.table solution:
library(data.table)
setDT(tab)[, lapply(th.list, movThreshold, tab = tab, dependent = "infGrad"), by = uniqueGroup]
Also, you can just do a nested lapply.
lapply(gr.list, function(z) lapply(th.list, movThreshold, tab = tab[uniqueGroup == z, ], dependent = "infGrad"))
I apologize, If I misunderstood what you're trying to do.

Resources