Conducting a t-test with a grouping variable - r

Getting started on an assignment with R, and I haven't really worked with it before, so apologies if this is basic.
brain is an excel dataframe. Its format is as follows (for an odd 40-some rows):
para1 para2 para3 para4 para5 para6 para7
FF 133 132 124 118 64.5 816932
highVAL = ifelse(brain$para2>=130,1, 0)
highVAL gives me a vector of 1's and 0's, categorized by para2.
I'm looking to perform a t-test on the mean para7 between two sets: rows that have para2 > 130 and those that have para2 < 130.
In Python, I would construct two new arrays and append values in, and perform a t-test there. Not sure how I would go about it in R.

You're closer than you think! Your highVAL variable should be added as a new column to the brain data frame:
brain$highVAL <- brain$FSIQ >= 130
This adds a true/false column to the dataset. Then you can run the test using t-test's formula interface:
result <- t.test(MRIcount ~ highVAL, data = brain)

Related

Sorting data by type in R

I am struggling to write a function for a dataset that looks like this:
identifier age occupation
pers1 18 student
pers2 45 teacher
pers3 65 retired
What I am trying to do, is to write a function that will:
sort my variables into numerical vs. factor variable
for the numerical variables, give me the mean, min and mx
for the factor variable, give me a frequency table
return point (2) and (3) in a "nice" format (dataframe, vector or table)
So far, I have tried this:
describe<- function(x)
{ if (is.numeric(x)) { mean <- mean(x)
min <- min(x)
max <- max(x)
d <- data.frame(mean, min, max)}
else { factor <- table(x) }
}
stats <- lapply(data, describe)
Problems:
My problem is that now, "stats" is a list that is difficult to read and to export to Excel or share. I don't know how to make the list "stats" more reader-friendly.
Alternatively, maybe is there a better way to build the function "describe"?
Any thoughts on how to solve these two problems are much appreciated!
I ma be late to the party, but maybe you still need a solution. I combined the answers from some of the comments to your post to the following code. It assumes you only have numerical columns and factors, and scales to a large number of columns, as you specified:
# Just some sample data for my example, you don't need ggplot2.
library(ggplot2)
data=diamonds
# Find which columns are numeric, and which are not.
classes = sapply(data,class)
numeric = which(classes=="numeric")
non_numeric = which(classes!="numeric")
# create the summary objects
summ_numeric = summary(data[,numeric])
summ_non_numeric = summary(data[,non_numeric])
# result is easily written to csv
write.csv(summ_non_numeric,file="test.csv")
Hope this helps.
The desired functionality is already available elsewhere, so if you are not interested in coding it yourself then you can maybe use this. The Publish package can be used to generate a table for presentation in a paper. It is not on CRAN, but you can install it from github
devtools::install_github('tagteam/Publish')
library(Publish)
library(isdals) # Get some data
data(fev)
fev$Smoke <- factor(fev$Smoke, levels=0:1, labels=c("No", "Yes"))
fev$Gender <- factor(fev$Gender, levels=0:1, labels=c("Girl", "Boy"))
The univariateTable can generate a publication-ready table presenting the data. By default, univariateTable computes the mean and standard deviation for numeric variables and the distribution of observations in categories for factors. These values can be computed and compared across groups. The main input to univariateTable is a formula where the right-hand side lists the variables to be included in the table while the left-hand side --- if present --- specifies a grouping variable.
univariateTable(Smoke ~ Age + Ht + FEV + Gender, data=fev)
This produces the following output
Variable Level No (n=589) Yes (n=65) Total (n=654) p-value
1 Age mean (sd) 9.5 (2.7) 13.5 (2.3) 9.9 (3.0) <1e-04
2 Ht mean (sd) 60.6 (5.7) 66.0 (3.2) 61.1 (5.7) <1e-04
3 FEV mean (sd) 2.6 (0.9) 3.3 (0.7) 2.6 (0.9) <1e-04
4 Gender Girl 279 (47.4) 39 (60.0) 318 (48.6)
5 Boy 310 (52.6) 26 (40.0) 336 (51.4) 0.0714

Performing a Specific Function for One Column For The First 12 Rows?

This is easy, but for some reason I'm having trouble with it. I have a set of Data like this:
File Trait Temp Value Rep
PB Mortality 16 52.2 54
PB Mortality 17 21.9 91
PB Mortality 18 15.3 50
...
And it goes on like that for 36 rows. What I need to do is divide the Value column by 100 in only the first 12 rows. I did:
NewData <- Data[1:12,4]/100
to try and create a new data frame without changing the old data. When I do this it divides the fourth column, but saves only the fourth column (rows 1-12) as a Values in the Global Environment by itself, not as Data with the rest of the rows/columns in the original set. Overall, I'm trying to fit the NewData in a nls function, so I need to save the modified data with the rest of the data, and not as a separate value. Is there a way for me to modify the first 12 rows without having R save it as a value?
Consider copying the dataframe and then updating column at select rows:
NewData <- Data
NewData$Value[1:12] <- NewData$Value[1:12]/10
# NewData[1:12,4] <- NewData[1:12,4]/10 ' ALTERNATE EQUIVALENT
library(dplyr)
newdata <- data[1:12,] %>% mutate(newV = VALUE/100)
newdata$Value = newdata$newV
newdata = newdata %>% select(-newV)
then you can do
full_data = rbind(newdata,data[13:36,])

How to Bootstrap Resample Count Data in R

I have a vector of counts which I want to resample with replacement in R:
X350277 128
X193233 301
X514940 3715
X535375 760
X953855 50
X357046 236
X196664 460
X589071 898
X583656 670
X583117 1614
(Note the second column is counts, the first column is the object the counts represent)
From reading various documentation it seems easy to resample data where each row or column represents a single observation. But how do I do this when each row represents multiple observations summed together (as in a table of counts)?
You can use weighted sampling (as user20650 also mentioned in the comments):
sample_weights <- dat$count/sum(dat$count)
mysample <- dat[sample(1:nrow(dat),1000,replace=T,prob=sample_weights),]
A less efficient approach - which might have its uses depending on what you want to do - is to turn your data to 'long' again:
dat_large <- dat[rep(1:nrow(dat),dat$count),]
#then sampling is easy
mysample <- dat_large[sample(1:nrow(dat_large),1000,replace=T),]

Identifying duplicate columns in a dataframe

I'm an R newbie and am attempting to remove duplicate columns from a largish dataframe (50K rows, 215 columns). The frame has a mix of discrete continuous and categorical variables.
My approach has been to generate a table for each column in the frame into a list, then use the duplicated() function to find rows in the list that are duplicates, as follows:
age=18:29
height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender=c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender)
tables=apply(testframe,2,table)
dups=which(duplicated(tables))
testframe <- subset(testframe, select = -c(dups))
This isn't very efficient, especially for large continuous variables. However, I've gone down this route because I've been unable to get the same result using summary (note, the following assumes an original testframe containing duplicates):
summaries=apply(testframe,2,summary)
dups=which(duplicated(summaries))
testframe <- subset(testframe, select = -c(dups))
If you run that code you'll see it only removes the first duplicate found. I presume this is because I am doing something wrong. Can anyone point out where I am going wrong or, even better, point me in the direction of a better way to remove duplicate columns from a dataframe?
How about:
testframe[!duplicated(as.list(testframe))]
You can do with lapply:
testframe[!duplicated(lapply(testframe, summary))]
summary summarizes the distribution while ignoring the order.
Not 100% but I would use digest if the data is huge:
library(digest)
testframe[!duplicated(lapply(testframe, digest))]
A nice trick that you can use is to transpose your data frame and then check for duplicates.
duplicated(t(testframe))
unique(testframe, MARGIN=2)
does not work, though I think it should, so try
as.data.frame(unique(as.matrix(testframe), MARGIN=2))
or if you are worried about numbers turning into factors,
testframe[,colnames(unique(as.matrix(testframe), MARGIN=2))]
which produces
age height gender
1 18 76.1 M
2 19 77.0 F
3 20 78.1 M
4 21 78.2 M
5 22 78.8 F
6 23 79.7 F
7 24 79.9 M
8 25 81.1 M
9 26 81.2 F
10 27 81.8 M
11 28 82.8 F
12 29 83.5 M
It is probably best for you to first find the duplicate column names and treat them accordingly (for example summing the two, taking the mean, first, last, second, mode, etc... To find the duplicate columns:
names(df)[duplicated(names(df))]
What about just:
unique.matrix(testframe, MARGIN=2)
Actually you just would need to invert the duplicated-result in your code and could stick to using subset (which is more readable compared to bracket notation imho)
require(dplyr)
iris %>% subset(., select=which(!duplicated(names(.))))
Here is a simple command that would work if the duplicated columns of your data frame had the same names:
testframe[names(testframe)[!duplicated(names(testframe))]]
If the problem is that dataframes have been merged one time too many using, for example:
testframe2 <- merge(testframe, testframe, by = c('age'))
It is also good to remove the .x suffix from the column names. I applied it here on top of Mostafa Rezaei's great answer:
testframe2 <- testframe2[!duplicated(as.list(testframe2))]
names(testframe2) <- gsub('.x','',names(testframe2))
Since this Q&A is a popular Google search result but the answer is a bit slow for a large matrix I propose a new version using exponential search and data.table power.
This a function I implemented in dataPreparation package.
The function
dataPreparation::which_are_bijection
which_are_in_double(testframe)
Which return 3 and 4 the columns that are duplicated in your example
Build a data set with wanted dimensions for performance tests
age=18:29
height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender=c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender)
for (i in 1:12){
testframe = rbind(testframe,testframe)
}
# Result in 49152 rows
for (i in 1:5){
testframe = cbind(testframe,testframe)
}
# Result in 160 columns
The benchmark
To perform the benchmark, I use the library rbenchmark which will reproduce each computations 100 times
benchmark(
which_are_in_double(testframe, verbose=FALSE),
duplicated(lapply(testframe, summary)),
duplicated(lapply(testframe, digest))
)
test replications elapsed
3 duplicated(lapply(testframe, digest)) 100 39.505
2 duplicated(lapply(testframe, summary)) 100 20.412
1 which_are_in_double(testframe, verbose = FALSE) 100 13.581
So which are bijection 3 to 1.5 times faster than other proposed solutions.
NB 1: I excluded from the benchmark the solution testframe[,colnames(unique(as.matrix(testframe), MARGIN=2))] because it was already 10 times slower with 12k rows.
NB 2: Please note, the way this data set is constructed we have a lot of duplicated columns which reduce the advantage of exponential search. With just a few duplicated columns, one would have much better performance for which_are_bijection and similar performances for other methods.

How to sum data.frame column values?

I have a data frame with several columns; some numeric and some character. How to compute the sum of a specific column? I’ve googled for this and I see numerous functions (sum, cumsum, rowsum, rowSums, colSums, aggregate, apply) but I can’t make sense of it all.
For example suppose I have a data frame people with the following columns
people <- read(
text =
"Name Height Weight
Mary 65 110
John 70 200
Jane 64 115",
header = TRUE
)
…
How do I get the sum of all the weights?
You can just use sum(people$Weight).
sum sums up a vector, and people$Weight retrieves the weight column from your data frame.
Note - you can get built-in help by using ?sum, ?colSums, etc. (by the way, colSums will give you the sum for each column).
To sum values in data.frame you first need to extract them as a vector.
There are several way to do it:
# $ operatior
x <- people$Weight
x
# [1] 65 70 64
Or using [, ] similar to matrix:
x <- people[, 'Weight']
x
# [1] 65 70 64
Once you have the vector you can use any vector-to-scalar function to aggregate the result:
sum(people[, 'Weight'])
# [1] 199
If you have NA values in your data, you should specify na.rm parameter:
sum(people[, 'Weight'], na.rm = TRUE)
you can use tidyverse package to solve it and it would look like the following (which is more readable for me):
library(tidyverse)
people %>%
summarise(sum(weight, na.rm = TRUE))
When you have 'NA' values in the column, then
sum(as.numeric(JuneData1$Account.Balance), na.rm = TRUE)
to order after the colsum :
order(colSums(people),decreasing=TRUE)
if more than 20+ columns
order(colSums(people[,c(5:25)],decreasing=TRUE) ##in case of keeping the first 4 columns remaining.

Resources