I have a data frame 'math.numeric' with 32 variables. Each row represents a student and each variable is an attribute. The students have been put into 5 groups based on their final grade.
The data looks as follows:
head(math.numeric)
school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason ... group
1 1 18 2 1 1 4 4 1 5 1 2
1 1 17 2 1 2 1 1 1 3 1 2
1 1 15 2 2 2 1 1 1 3 3 3
1 1 15 2 1 2 4 2 2 4 2 4
1 1 16 2 1 2 3 3 3 3 2 3
1 2 16 2 2 2 4 3 4 3 4 4
I am performing t-tests on each variable for group 1 vs. all the other groups to identify significantly different attributes with this group. I am looking to pull out the p-values for each test such as:
t.test(subset(math.numeric$school, math.numeric$group == 1),
subset(math.numeric$school, math.numeric$group != 1))$p.value
t.test(subset(math.numeric$sex, math.numeric$group == 1),
subset(math.numeric$sex, math.numeric$group != 1))$p.value
t.test(subset(math.numeric$age, math.numeric$group == 1),
subset(math.numeric$age, math.numeric$group != 1))$p.value
I have been trying to figure out how I can create a loop to do this instead of writing out each test one at a time. I have tried a for loop, and lapply, but so far I have not had any luck.
I am fairly new to this, so any help would be appreciated.
Courtney
Your example data is not sufficient to actually carry out t-tests on all subgroups. For that reason, I take the iris dataset, which contains 3 species of plants: Setosa, Versicolor, and Virginica. These are my groups. You will have to adjust your code accordingly. Below I show how to test one groups versus all other groups, one group versus each other group, and all combinations of individual groups.
One group versus all other groups combined:
First, let's say I want to compare Versicolor and Virginica to Setosa, i.e. Setosa is my group 1 to which all other groups should be compared. An easy way to achieve what you want is the following:
sapply(names(iris)[-ncol(iris)], function(x){
t.test(iris[iris$Species=="setosa", x],
iris[iris$Species!="setosa", x])$p.value
})
Sepal.Length Sepal.Width Petal.Length Petal.Width
7.709331e-32 1.035396e-13 1.746188e-69 1.347804e-60
Here, I have supplied the names of the different variables in the dataset names(iris) - exlcuding the column indicating the grouping variable [-ncol(iris)] (since it is the last column) - as vector to sapply, which passes the corresponding names as arguments to the function that I have defined.
One group versus each of the other groups:
In case you want to make groupwise comparisons for all groups, the following may be helpful: First, create a dataframe of all group x variable combinations that you are going to do, excluding the grouping variable itself and the reference group, of course. This can be achieved by:
comps <- expand.grid(unique(iris$Species)[-1], # excluding Setosa as reference group
names(iris)[-ncol(iris)] # excluding group column
)
head(comps)
Var1 Var2
1 versicolor Sepal.Length
2 virginica Sepal.Length
3 versicolor Sepal.Width
4 virginica Sepal.Width
5 versicolor Petal.Length
6 virginica Petal.Length
Here, Var1 are the different species, and Var2 the different variables for which comparisons are to be done. The reference group 1 or Setosa is implicit in this case. Now, I can use apply to create the tests. I do this by using each row of comps as argument with two elements, the first of which indicates which group's turn it is, and the second argument indicates which variable should be compared. These will be used to subset the original dataframe.
comps$pval <- apply(comps, 1, function(x) {
t.test(iris[iris$Species=="setosa", x[2]], iris[iris$Species==x[1], x[2]])$p.value
} )
where group 1 aka Setosa is hard-coded in the function. This gives me a dataframe with p-values for all combinations (with Setosa as reference group) so that they are easy to look up:
head(comps)
Var1 Var2 pval
1 versicolor Sepal.Length 3.746743e-17
2 virginica Sepal.Length 3.966867e-25
3 versicolor Sepal.Width 2.484228e-15
4 virginica Sepal.Width 4.570771e-09
5 versicolor Petal.Length 9.934433e-46
6 virginica Petal.Length 9.269628e-50
All combinations of groups:
You can expand the above easily to produce a dataframe that contains p-values of t-tests for each combination of groups. One approach would be:
comps <- expand.grid(unique(iris$Species), unique(iris$Species), names(iris)[-ncol(iris)])
This now has three columns. The first two are the groups, and the third the variable:
head(comps)
Var1 Var2 Var3
1 setosa setosa Sepal.Length
2 versicolor setosa Sepal.Length
3 virginica setosa Sepal.Length
4 setosa versicolor Sepal.Length
5 versicolor versicolor Sepal.Length
6 virginica versicolor Sepal.Length
You can use this to carry out the tests:
comps$pval <- apply(comps, 1, function(x) {
t.test(iris[iris$Species==x[1], x[3]], iris[iris$Species==x[2], x[3]])$p.value
} )
I get an error message: what should I do?
t.test may throw out an error message if the sample size is too small or if the values are constant for one group. This is problematic since it might only occur for specific groups, and you may not know in advance which one it is. Yet the error will disrupt the entire function call to apply, and you will not be able to see any results.
A way to circumvent this and to identify the problematic groups is to wrap the function t.test around dplyr::failwith (see also ?tryCatch). To show how this works, consider the following:
smalln <- data.frame(a=1, b=2)
t.test(smalln$a, smalln$b)
> Error in t.test.default(smalln$a, smalln$b) : not enough 'x' observations
failproof.t <- failwith(default="Some default of your liking", t.test, quiet = T)
failproof.t(smalln$a, smalln$b)
[1] "Some default of your liking"
That way, whenever t.test would throw out an error, you get a character as a result instead and the computation continues with other groups. Needless to say, you could also set default to a number, or anything else. It does not have to be a character.
Statistical disclaimer:
Having said all of this, note that conducting a several t-tests is not necessarily good statistical practice. You may want to adjust your p-values to account for multiple testing, or you may want to use alternative test procedures that conducts joint tests.
Hows this?
pvals <- numeric() #the vector of p values
k <- 1 #in case you choose to use a subset not continuing from 1
# "for(i in seq(1,5))" is just doing the pvalues for the first 5 columns. You could do a
# list, like "c(1,2,4)" (in place of "seq(1,5)"), to do tests for columns 1, 2, and 4.
# To do all of the columns, try "for(i in seq(1,(ncol(math.numeric)-1)))".
for(i in seq(1,5)){
# using your code to grab the p-values and store them in the kth element of "pvals"
pvals[k] <- t.test(subset(math.numeric[,i], math.numeric$group == 1),
subset(math.numeric[,i], math.numeric$group != 1))$p.value
#iterating the "pvals" vector entry counter
k=k+1
}
pvals #printing the p values for each test
Consider splitting the data frame by group and using mapply() across the columns. Output becomes a compiled list of tests' statistics: statistic, parameter, p-value, confid. interval, etc.
# FILTER ROWS AND SUBSET NUMERIC COLS
group1df <- df[df$group==1, 1:ncol(df)-1]
othgroupdf <- df[df$group!=1, 1:ncol(df)-1]
# T-TEST FCT
tfct <- function(v1, v2){
t.test(v1, v2)
}
# RUN T-TESTS BY COL, SAVE RESULTS TO LIST
ttests <- mapply(tfct, group1df, othgroupdf)
Related
I have a large data set with over 2000 observations. The data involves toxin concentrations in animal tissue. My response variable is myRESULT and I have multiple observations per ANALYTE of interest. I need to remove the outliers, as defined by numbers more than three SD away from the mean, from within each ANALYTE group.
While I realize that I should not remove outliers from a dataset normally, I would still like to know how to do it in R.
Here is a small portion of what my data look like:
It's subsetting by group, which can be done in different ways. With dplyr, you use group_by to set grouping, then filter to subset rows, passing it an expression that will calculate return TRUE for rows to keep, and FALSE for outliers.
For example, using iris and 2 standard deviations (everything is within 3):
library(dplyr)
iris_clean <- iris %>%
group_by(Species) %>%
filter(abs(Petal.Length - mean(Petal.Length)) < 2*sd(Petal.Length))
iris_clean %>% count()
#> # A tibble: 3 x 2
#> # Groups: Species [3]
#> Species n
#> <fct> <int>
#> 1 setosa 46
#> 2 versicolor 47
#> 3 virginica 47
With a split-apply-combine approach in base R,
do.call(rbind, lapply(
split(iris, iris$Species),
function(x) x[abs(x$Petal.Length - mean(x$Petal.Length)) < 2*sd(x$Petal.Length), ]
))
For someone new to R, what is the best way to view the range for a number of variables? I've run the summary command on the entire dataset, can I do range () on the entire dataset as well or do i need to create variables for each variable in the dataset?
For individual variable, you can use range. To see the range of multiple variables, you can combine range with one of the apply functions. See below for an example.
range(iris$Sepal.Length)
# [1] 4.3 7.9
sapply(iris[, 1:4], range)
# Sepal.Length Sepal.Width Petal.Length Petal.Width
#[1,] 4.3 2.0 1.0 0.1
#[2,] 7.9 4.4 6.9 2.5
(only the first four columns were selected from iris since the 5th is a factor, and range doesn't apply for factors)
more a curiosity than a question. Is it possible to make some operation only on specific columns of a dataframe but maintaining the dataframe original structure?
For example, suppose I want simply to add 1 to the first 4 columns of the iris dataset because the 5th column is a factor and it is nonsense to add values to it.
1. ignoring the factor column
just perform the operation without caring of the Warning Message
ex <- iris[,] + 1
head(ex, 2)
#gives
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 6.1 4.5 2.4 1.2 NA
2 5.9 4.0 2.4 1.2 NA
so the 5th original column loose the original values due to the nonsense operation.
2. excluding the last column
excluding the index of the column from the operation
ex <- iris[,-c(5)] + 1
head(ex, 2)
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 6.1 4.5 2.4 1.2
2 5.9 4.0 2.4 1.2
but doing so I have to perform a cbind operation to recover the original column (not a big deal with this dataframe)
I was wondering if there is a smarter solution for this operation. Imagine the dataframe is very big,with cbind one loose the original position of the columns and it could be quite tricky to do it.
Thanks to all
I have a data frame 'math.numeric' with 32 variables. Each row represents a student and each variable is an attribute. The students have been put into 5 groups based on their final grade.
The data looks as follows:
head(math.numeric)
school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason ... group
1 1 18 2 1 1 4 4 1 5 1 2
1 1 17 2 1 2 1 1 1 3 1 2
1 1 15 2 2 2 1 1 1 3 3 3
1 1 15 2 1 2 4 2 2 4 2 4
1 1 16 2 1 2 3 3 3 3 2 3
1 2 16 2 2 2 4 3 4 3 4 4
I am performing t-tests on each variable for group 1 vs. all the other groups to identify significantly different attributes with this group. I am looking to pull out the p-values for each test such as:
t.test(subset(math.numeric$school, math.numeric$group == 1),
subset(math.numeric$school, math.numeric$group != 1))$p.value
t.test(subset(math.numeric$sex, math.numeric$group == 1),
subset(math.numeric$sex, math.numeric$group != 1))$p.value
t.test(subset(math.numeric$age, math.numeric$group == 1),
subset(math.numeric$age, math.numeric$group != 1))$p.value
I have been trying to figure out how I can create a loop to do this instead of writing out each test one at a time. I have tried a for loop, and lapply, but so far I have not had any luck.
I am fairly new to this, so any help would be appreciated.
Courtney
Your example data is not sufficient to actually carry out t-tests on all subgroups. For that reason, I take the iris dataset, which contains 3 species of plants: Setosa, Versicolor, and Virginica. These are my groups. You will have to adjust your code accordingly. Below I show how to test one groups versus all other groups, one group versus each other group, and all combinations of individual groups.
One group versus all other groups combined:
First, let's say I want to compare Versicolor and Virginica to Setosa, i.e. Setosa is my group 1 to which all other groups should be compared. An easy way to achieve what you want is the following:
sapply(names(iris)[-ncol(iris)], function(x){
t.test(iris[iris$Species=="setosa", x],
iris[iris$Species!="setosa", x])$p.value
})
Sepal.Length Sepal.Width Petal.Length Petal.Width
7.709331e-32 1.035396e-13 1.746188e-69 1.347804e-60
Here, I have supplied the names of the different variables in the dataset names(iris) - exlcuding the column indicating the grouping variable [-ncol(iris)] (since it is the last column) - as vector to sapply, which passes the corresponding names as arguments to the function that I have defined.
One group versus each of the other groups:
In case you want to make groupwise comparisons for all groups, the following may be helpful: First, create a dataframe of all group x variable combinations that you are going to do, excluding the grouping variable itself and the reference group, of course. This can be achieved by:
comps <- expand.grid(unique(iris$Species)[-1], # excluding Setosa as reference group
names(iris)[-ncol(iris)] # excluding group column
)
head(comps)
Var1 Var2
1 versicolor Sepal.Length
2 virginica Sepal.Length
3 versicolor Sepal.Width
4 virginica Sepal.Width
5 versicolor Petal.Length
6 virginica Petal.Length
Here, Var1 are the different species, and Var2 the different variables for which comparisons are to be done. The reference group 1 or Setosa is implicit in this case. Now, I can use apply to create the tests. I do this by using each row of comps as argument with two elements, the first of which indicates which group's turn it is, and the second argument indicates which variable should be compared. These will be used to subset the original dataframe.
comps$pval <- apply(comps, 1, function(x) {
t.test(iris[iris$Species=="setosa", x[2]], iris[iris$Species==x[1], x[2]])$p.value
} )
where group 1 aka Setosa is hard-coded in the function. This gives me a dataframe with p-values for all combinations (with Setosa as reference group) so that they are easy to look up:
head(comps)
Var1 Var2 pval
1 versicolor Sepal.Length 3.746743e-17
2 virginica Sepal.Length 3.966867e-25
3 versicolor Sepal.Width 2.484228e-15
4 virginica Sepal.Width 4.570771e-09
5 versicolor Petal.Length 9.934433e-46
6 virginica Petal.Length 9.269628e-50
All combinations of groups:
You can expand the above easily to produce a dataframe that contains p-values of t-tests for each combination of groups. One approach would be:
comps <- expand.grid(unique(iris$Species), unique(iris$Species), names(iris)[-ncol(iris)])
This now has three columns. The first two are the groups, and the third the variable:
head(comps)
Var1 Var2 Var3
1 setosa setosa Sepal.Length
2 versicolor setosa Sepal.Length
3 virginica setosa Sepal.Length
4 setosa versicolor Sepal.Length
5 versicolor versicolor Sepal.Length
6 virginica versicolor Sepal.Length
You can use this to carry out the tests:
comps$pval <- apply(comps, 1, function(x) {
t.test(iris[iris$Species==x[1], x[3]], iris[iris$Species==x[2], x[3]])$p.value
} )
I get an error message: what should I do?
t.test may throw out an error message if the sample size is too small or if the values are constant for one group. This is problematic since it might only occur for specific groups, and you may not know in advance which one it is. Yet the error will disrupt the entire function call to apply, and you will not be able to see any results.
A way to circumvent this and to identify the problematic groups is to wrap the function t.test around dplyr::failwith (see also ?tryCatch). To show how this works, consider the following:
smalln <- data.frame(a=1, b=2)
t.test(smalln$a, smalln$b)
> Error in t.test.default(smalln$a, smalln$b) : not enough 'x' observations
failproof.t <- failwith(default="Some default of your liking", t.test, quiet = T)
failproof.t(smalln$a, smalln$b)
[1] "Some default of your liking"
That way, whenever t.test would throw out an error, you get a character as a result instead and the computation continues with other groups. Needless to say, you could also set default to a number, or anything else. It does not have to be a character.
Statistical disclaimer:
Having said all of this, note that conducting a several t-tests is not necessarily good statistical practice. You may want to adjust your p-values to account for multiple testing, or you may want to use alternative test procedures that conducts joint tests.
Hows this?
pvals <- numeric() #the vector of p values
k <- 1 #in case you choose to use a subset not continuing from 1
# "for(i in seq(1,5))" is just doing the pvalues for the first 5 columns. You could do a
# list, like "c(1,2,4)" (in place of "seq(1,5)"), to do tests for columns 1, 2, and 4.
# To do all of the columns, try "for(i in seq(1,(ncol(math.numeric)-1)))".
for(i in seq(1,5)){
# using your code to grab the p-values and store them in the kth element of "pvals"
pvals[k] <- t.test(subset(math.numeric[,i], math.numeric$group == 1),
subset(math.numeric[,i], math.numeric$group != 1))$p.value
#iterating the "pvals" vector entry counter
k=k+1
}
pvals #printing the p values for each test
Consider splitting the data frame by group and using mapply() across the columns. Output becomes a compiled list of tests' statistics: statistic, parameter, p-value, confid. interval, etc.
# FILTER ROWS AND SUBSET NUMERIC COLS
group1df <- df[df$group==1, 1:ncol(df)-1]
othgroupdf <- df[df$group!=1, 1:ncol(df)-1]
# T-TEST FCT
tfct <- function(v1, v2){
t.test(v1, v2)
}
# RUN T-TESTS BY COL, SAVE RESULTS TO LIST
ttests <- mapply(tfct, group1df, othgroupdf)
I am trying to create a specialized summary 'matrix' for my supervisor, and would like R to export it in a clean, readable form. As such, I am creating it from scratch basically, to tailor it to our project. My problem is I can't figure out how to get a created data frame to behave like an imported one, specifically headers.
I am most comfortable dealing with imported data frames with headers, and calling specific rows by name instead of column number:
iris$Sepal.Length
with(iris,Sepal.Length)
iris['Sepal.Length']
Now, if I want to create a data frame (or matrix, I'm not entirely sure what the difference is), I have tried the following:
groups<-c("Group 1", "Group 2")
factors<-c("Fac 1", "Fac 2", "Fac 3","Fac 4", "Fac 5")
x<-1:10
y<-11:20
z<-21-30
data<-cbind(groups, factors, x, y, z)
names(data) #returns NULL
data$x #clearly doesn't return the column 'x' since the matrix 'data' has no names
data<-data.frame(cbind(groups, factors, x, y, z))
names(data) #confirms that there are header names
So, I have created a data frame that has the columns x, y and z, but in reality I don't have a premade column to start off with. If I knew how many rows of data there would be I could simply do:
data<-data.frame(1:10)
data$x<-x
data$y<-y
data$z<-z
I tried creating an empty data frame, but it is one element big, and if I try to append a vector to it (of any length greater than 1), I get an error:
data<-data.frame(0)
data$x<-x #returns an error
My best guess at what to do is to pass through the data once to find out how long many rows of data I will have (there are several factor levels, and the summary matrix will have a row for each possible combination of factors). Then I can get the data frame started with a simple:
data<-data.frame(length(n)) #where n would be how many rows of data I would have
And follow through by creating individual vectors for each summary statistic I want and appending it to the data frame with ~$~.
Another solution I tried to play with was creating a matrix and filling in each element as I calculate it within a loop. I know the apply family is better than a loop, but to make my summary table tailored to my needs I would need to run an apply function then try to pull the individual data:
means<-with(iris,tapply(iris[,4],Species,mean))
means[1] #This returns the species and the mean petal width. What I need is the numeric part of this, as I will have my own headers, or possibly a separate summary table for each species.
I'm not sure if extracting the numerical information from the apply output is better / any easier than simply constructing my own loop to calculate the required statistics. It would be a nested loop that would first sort by group (2 runs), then an internal loop that would run by factors (5 runs) for a total of 10 runs through the data. I was thinking of creating an empty martix, and simply saving the data in the appropriate cell when it is calculated. My problem, again, is calling a specific row in a matrix. I have tried:
m<-matrix(0,ncol=5)
m[1,1]<-'Groups'
m[1,2]<-'Factors'
m[1,3]<-'Mean.x'
m[1,4]<-'Mean.y'
m[1,5]<-'Mean.z'
names(m) #Returns NULL
My desired output would look like:
Groups Factors Mean.x Mean.y Mean.z
Group 1 Fac 1
Group 1 Fac 2
Group 1 Fac 3
Etc, for all combinations of groups and factors.
You can use ddply from plyr package for that: assume your original data frame is mydata and your new data frame where you store the result is newdata:
library(plyr)
newdata<-ddply(mydata,.(Groups,Factors),summarize,mean.x=mean(x),mean.y=mean(y),mean.z=mean(z))
Example: mydata<-iris
> newdata<-ddply(mydata,.(Species),colwise(mean))
> newdata
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
I think this is what you're looking for, but I'm a little confused by your question in general. This basically will give you a pivot table of the means in each column x,y, and z grouped by the columns 'groups' and 'factors'
aggregate(.~groups+factors, data=data, FUN="mean")
groups factors x y z
1 Group 1 Fac 1 1 1 1
2 Group 2 Fac 1 7 6 1
3 Group 1 Fac 2 8 7 1
4 Group 2 Fac 2 3 2 1
5 Group 1 Fac 3 4 3 1
6 Group 2 Fac 3 9 8 1
7 Group 1 Fac 4 10 9 1
8 Group 2 Fac 4 5 4 1
9 Group 1 Fac 5 6 5 1
10 Group 2 Fac 5 2 10 1
or with the iris data grouped by Species:
aggregate(.~Species, data=iris, FUN="mean")
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
UPDATE: To only calculate the mean of certain columns, you can either pass only the apropriate columns of your dataset to the aggregate function (perhaps calling subset) or modify the formula like this:
aggregate(cbind(Sepal.Length,Sepal.Width)~Species, data=iris, FUN="mean")
I am not entirely sure if that's what you are looking for but there are several options to add “stuff” to data frames:
To add a variable, just type data$newname <- NA (no need to know the length of the data frame or pass a vector, all rows will be filled with NA)
To append data use rbind (the data you are adding should be another data frame with the same variables)
To fix your example, first create an empty data frame and append data as it comes:
data <- data.frame(x=numeric())
data <- rbind(data, data.frame(x))
The previous example had only one variable (x) but you can also define a data frame with several variables and no rows:
data <- data.frame(x=numeric(),
y=numeric(),
a=character(),
b=factor(levels=c("Factor 1", "Factor 2")))
You don't need to know how many rows you will have, but the data you are adding needs to have the same structure. If that's not the case, you need to create columns with missing values in both data frames as needed, e.g.
data1 <- data.frame(x=1:10, y=1)
data2 <- data.frame(y=2, z=100:110)
rbind(data1, data2) # Error
data1$z <- NA
data2$x <- NA
rbind(data1, data2) # Now it works