means/pvalue table from t.test in R - r

Is there a way to extract the mean and p-value from a t.test output and create a table that includes the features, mean, and p-value? Say there are 10 columns put through t.test, and that means there are 10 means, and 10 p-values. How would I be able to create a table which only shows those specific items?
here is an example: data (iris):
a. b. c. d. e.
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
t.test(a)
t.test(b) #...ect we obtain the mean and p-value.
this is the output im looking for:
feature mean p-val
col1 0.01 0.95
col2 0.01 0.95
.
.
.
coln
hope it makes sense!

Using the iris built in data set as an example
t(sapply(iris[, 1:4], function(i){
t.test(i)[c(5,3)]
}))
The sapply() function is iteratively performing that custom function - which performs a t-test on a variable and returns the estimate and p-value - through columns 1 to 4 of iris. That is then transposed by t() to rotate the data to the desired shape. You can store that as a data.frame using data.frame() and use row.names() to get the variable names into a new column on that if you like.
values <- t(sapply(iris[, 1:4], function(i){
t.test(i)[c(5,3)]
}))
values <- data.frame("feature"=row.names(values), values)
row.names(values) <- NULL
values
Beware multiple testing though...

Related

Repeated Simulation of New Data Prediction with Tidymodels (Parsnip XGboost)

I have a model, called predictive_fit <- fit(workflow, training) that classifies the Iris dataset species using xgboost. The data are pivoted wide such that each species is a dummied column represented by a 0 or 1. Here, I am trying to predict Virginica based on the Sepal and Petal columns.
Currently, I have the following code which then takes the dataset after the model has been fit to test if it can accurately predict the Virginia species of iris. (Snippet below)
testing_data <-
test %>%
bind_cols(
predict(predictive_fit, test)
)
I cannot, however, figure out how to scale this up with simulation. If I have another dataset with exactly the same structure, I would like to predict whether it is Virginica 100 times. (Snippet below)
new_iris_data <-
new_iris_data %>%
bind_cols(
replicate(n = 100, predict(predictive_fit, new_iris_data))
)
However, it looks as if when I run the new data the same predictions are just being copied 100 times. What is the appropriate way to repeatedly predict the classification? I wouldn't expect that all 100 times the model would predict exactly the same thing, but I'd like some way to have the predictions run n number of times so each and every row of new data can have its own proportion calculated.
I have already tried using the replicate() function to try this. However, it appears as if it copies the same exact results 100 times. I considered having a for loop that iterated through a different seed and then ran the predictions, but I was hoping for a more performant solution out there.
You are replicating the prediction of you model, not the data.frame you call new_iris_data, and the result is exactly that. In order to replicate a (random) part of the iris dataset, try this:
> data("iris")
>
> sample <- sample(nrow(iris), floor(nrow(iris) * 0.5))
>
> train <- iris[sample,]
> test <- iris[-sample,]
>
> new_test <- replicate(100, test, simplify = FALSE)
> new_test <- Reduce(rbind.data.frame, new_test)
>
> head(new_test)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
> nrow(new_test)
[1] 7500
The you can use the new_test in any prediction, independent of the model.
If you want 100 differents random parts of the data set, you need to drop the replicate function and do something like:
> new_test <- lapply(1:100, function(x) {
+ sample <- sample(nrow(iris), floor(nrow(iris) * 0.5))
+ iris[-sample,]
+ })
>
> new_test <- Reduce(rbind.data.frame, new_test)
>
> head(new_test)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
7 4.6 3.4 1.4 0.3 setosa
10 4.9 3.1 1.5 0.1 setosa
11 5.4 3.7 1.5 0.2 setosa
13 4.8 3.0 1.4 0.1 setosa
18 5.1 3.5 1.4 0.3 setosa
> nrow(new_test)
[1] 7500
>
Hope it helps.

apply ntile function to list of data frames with different bucket sizes

I would like to use the ntile function from dplyr or a similar function on a list of data frames but using a different n for each data frame. My list contains 150 data frames so a manual solution like the one below will not work. How can I rewrite the code below to act on the list of data frames and return the list of data frames with the new column?
library(tidyverse)
iris_list=split(iris,iris$Species)
iris_setosa=iris_list[[1]]
iris_versicolor=iris_list[[2]]
iris_virginica=iris_list[[3]]
iris_setosa$n3=ntile(iris_setosa$Sepal.Length,3)
iris_versicolor$n5=ntile(iris_setosa$Sepal.Length,5)
iris_virginica$n7=ntile(iris_setosa$Sepal.Length,7)
The final result should be this
final_list=list(iris_setosa,iris_versicolor,iris_virginica)
head(final_list[[1]])
Sepal.Length Sepal.Width Petal.Length Petal.Width Species n3
1 5.1 3.5 1.4 0.2 setosa 2
2 4.9 3.0 1.4 0.2 setosa 1
3 4.7 3.2 1.3 0.2 setosa 1
4 4.6 3.1 1.5 0.2 setosa 1
5 5.0 3.6 1.4 0.2 setosa 2
6 5.4 3.9 1.7 0.4 setosa 3
There are several ways to achieve this, depending on what type of object you want in the end.
One way would be to use base::expand.grid and purrr::pmap like this:
percentiles = list(3,5,7)
iris_list %>%
map("Sepal.Length") %>%
expand.grid(percentiles) %>%
pmap(~ntile(..1,..2))
First, you want only the Sepal.Length variable of all your datasets, so you use purrr::map to get them.
Then, expand.grid creates a dataframe of all combinations of its parameters. Here, with 2 lists of 3 members, it would return a dataframe of 3x3=9 rows: setosa 3, versicolor 3, virginica 3, setosa 5, ...
Finally, pmap can iterate over the dataframe and apply the function ntile, with the first column (iris_list) as the first argument and the second column (percentiles) as the second argument. Unfortunately, purrr is very bad in dealing with names, but it seems that it is on purpose.
EDIT:
Your edit is somehow another question, so here is another answer:
iris_list %>%
map(~mutate(.x, n3=ntile(Sepal.Length,3)),
n5=ntile(Sepal.Length,5)), n7=ntile(Sepal.Length,7)))
I've found a way that works
n_size=data.frame(Species=c("setosa ","versicolor","virginica"),size=c(3,5,7))
iris_bin=iris %>% inner_join(n_size,by="Species") %>%
group_by(Species)%>%
mutate(bin=ntile(Sepal.Length,size[1])) %>%
arrange(Species,Sepal.Length,bin)

Is it possible to combine parameters to a subset function that is generated programmatically in R?

Before my question, here is a little background.
I am creating a general purpose data shaping and charting library for plotting survey data of a particular format.
As part of my scripts, I am using the subset function on my data frame. The way I am working is that I have a parameter file where I can pass this subsetting criteria into my functions (so I don't need to directly edit my main library). The way I do this is as follows:
subset_criteria <- expression(variable1 != "" & variable2 == TRUE)
(where variable1 and variable2 are columns in my data frame, for example).
Then in my function, I call this as follows:
my.subset <- subset(my.data, eval(subset_criteria))
This part works exactly as I want it to work. But now I want to augment that subsetting criteria inside the function, based on some other calculations that can only be performed inside the function. So I am trying to find a way to combine together these subsetting expressions.
Imagine inside my function I create some new column in my data frame automatically, and then I want to add a condition to my subsetting that says that this additional column must be TRUE.
Essentially, I do the following:
my.data$newcolumn <- with(my.data, ifelse(...some condition..., TRUE, FALSE))
Then I want my subsetting to end up being:
my.subset <- subset(my.data, eval(subset_criteria & newcolumn == TRUE))
But it does not seem like simply doing what I list above is valid. I get the wrong solution. So I'm looking for a way of combining these expressions using expression and eval so that I essentially get the combination of all the conditions.
Thanks for any pointers. It would be great if I can do this without having to rewrite how I do all my expressions, but I understand that might be what is needed...
Bob
You should probably avoid two things: using subset in non-interactive setting (see warning in the help pages) and eval(parse()). Here we go.
You can change the expression into a string and append it whatever you want. The trick is to convert the string back to expression. This is where the aforementioned parse comes in.
sub1 <- expression(Species == "setosa")
subset(iris, eval(sub1))
sub2 <- paste(sub1, '&', 'Petal.Width > 0.2')
subset(iris, eval(parse(text = sub2))) # your case
> subset(iris, eval(parse(text = sub2)))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
16 5.7 4.4 1.5 0.4 setosa
17 5.4 3.9 1.3 0.4 setosa
18 5.1 3.5 1.4 0.3 setosa
19 5.7 3.8 1.7 0.3 setosa
20 5.1 3.8 1.5 0.3 setosa
22 5.1 3.7 1.5 0.4 setosa
24 5.1 3.3 1.7 0.5 setosa
27 5.0 3.4 1.6 0.4 setosa
32 5.4 3.4 1.5 0.4 setosa
41 5.0 3.5 1.3 0.3 setosa
42 4.5 2.3 1.3 0.3 setosa
44 5.0 3.5 1.6 0.6 setosa
45 5.1 3.8 1.9 0.4 setosa
46 4.8 3.0 1.4 0.3 setosa

biglm finds the wrong data.frame to take the data from

I am trying to create chunks of my dataset to run biglm. (with fastLm I would need 350Gb of RAM)
My complete dataset is called res. As experiment I drastically decreased the size to 10.000 rows. I want to create chunks to use with biglm.
library(biglm)
formula <- iris$Sepal.Length ~ iris$Sepal.Width
test <- iris[1:10,]
biglm(formula, test)
And somehow, I get the following output:
> test <- iris[1:10,]
> test
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
Above you can see the matrix test contains 10 rows. Yet when running biglm it shows a sample size of 150
> biglm(formula, test)
Large data regression model: biglm(formula, test)
Sample size = 150
Looks like it uses iris instead of test.. how is this possible and how do I get biglm to use chunk1 the way I intend it to?
I suspect the following line is to blame:
formula <- iris$Sepal.Length ~ iris$Sepal.Width
where in the formula you explicitly reference the iris dataset. This will cause R to try and find the iris dataset when lm is called, which it finds in the global environment (because of R's scoping rules).
In a formula you normally do not use vectors, but simply the column names:
formula <- Sepal.Length ~ Sepal.Width
This will ensure that the formula contains only the column (or variable) names, which will be found in the data lm is passed. So, lm will use test in stead of iris.

Defining functions (in rollapply) using lines of a dataframe

First of all, I have a dataframe (lets call it "years") with 5 rows and 10 columns. I need to build a new one doing (x1-x2)/x1, being x1 the first element and x2 the second element of a column in "years", then (x2-x3)/x2 and so forth. I thought rollapply would be the best tool for the task, but I can't figure out how to define such function to insert it in rollapply.
I'm new to R, so I hope my question is not too basic. Anyway, I couldn't find a similar question here so I'd be really thankful if someone could help me.
You can use transform, diff and length, no need to use rollapply
> df <- head(iris,5) # some data
> transform(df, New = c(NA, diff(Sepal.Length)/Sepal.Length[-length(Sepal.Length)] ))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species New
1 5.1 3.5 1.4 0.2 setosa NA
2 4.9 3.0 1.4 0.2 setosa -0.03921569
3 4.7 3.2 1.3 0.2 setosa -0.04081633
4 4.6 3.1 1.5 0.2 setosa -0.02127660
5 5.0 3.6 1.4 0.2 setosa 0.08695652
diff.zoo in the zoo package with the arithmetic=FALSE argument will divide each number by the prior in each column:
library(zoo)
as.data.frame(1 - diff(zoo(DF), arithmetic = FALSE))

Resources