I have two dataframes for men and women working in a company. One is 15000 rows x 1000 columns and the other is 150 x 1000. Each column represents an attribute (e.g, Salary, Height, etc...). I am comparing female and male employees within each Bracket (there are five in total).
Below I created some dummy data and the for loop.
#Create the data
num_of_employee = 100
f <- rep(c("Female"), 15)
m <- rep(c("Male"), 85)
Employee = paste("Employee",seq(1:num_of_employee))
Bracket = sample(seq(1,5,1),num_of_employee, replace = TRUE)
Height = sample(seq(65,100, 1),num_of_employee, replace = TRUE)
Weight = sample(seq(120,220, 1),num_of_employee, replace = TRUE)
Years_Employed = sample(seq(1,13, 1),num_of_employee, replace = TRUE)
Income = sample(seq(50000,200000, 1000),num_of_employee, replace = TRUE)
gender <- sample(append(f,m), replace = FALSE)
df1 = data.frame(Employee, Height, Weight, Years_Employed, Income, Bracket, gender)
women <-df1[df1$gender == 'Female',]
men <- df1[df1$gender == 'Male',]
So that's all the data. Now, this for-loop essentially compares both the men and women dataframes column by column. So for example, Income from df1 will be compared to Income from df2, likewise for Height, Years_Employed etc...
v <-c()
runs <- 1000
for(j in 1:runs){
male_vector <- c()
female_vector <- c()
#loop through each of the 5 Brackets
for(z in 1:5){
#print out number of rows in each bracket.
number_of_rows <- length(which(women$Bracket == z))
#compare attributes of men and women within each bracket.
male_vector <- append(male_vector, men[sample(which(men$Bracket == z), number_of_rows), ]$Height)
female_vector <- append(female_vector, women[which(women$Bracket == z), ]$Height)
}
#Ask, are men and women different?
v <- append(v, sum(male_vector) > sum(female_vector))
}
#How many times are the men>women out of 1000?
as.numeric(sum(v))
[1] 70
So this code works, but I want to compare each column - meaning Height, Weight, Years_Employed and Income.
Edit
I would like to input the two dataframes and the output be the following:
"Height " 0.223
"Salary: " 0.994
"Weight: " 0.006
"Years_Employed:" 0.325
.
.
.
"1000th column :" 0.013
Note, my actual data consists of 1000 columns, so hard-coding anything (the way I originally did it), won't work.
The following is much simpler than your code.
Note that there are loops in disguise, namely split and sapply. But the code is cleaner and it avoids repeating the same computations over and over again.
If you call set.seed(4358) just before running your code the result will be exactly the same as the result of mean(v) in the end of this.
set.seed(4358) # Needed because of the call to sample()
runs <- 1000
v <- logical(runs)
df1_br <- split(df1, df1$Bracket)
df2_br <- split(df2, df2$Bracket)
female_vector <- sapply(df2_br, function(x) sum(x$Income))
sum_female_vector <- sum(female_vector)
number_of_rows <- sapply(df2_br, nrow)
for(j in 1:runs){
male_vector <- sapply(seq_along(df1_br), function(i) sum(sample(df1_br[[i]]$Income, number_of_rows[i], TRUE)))
v[j] <- sum(male_vector) > sum_female_vector
}
mean(v)
#[1] 0.933
Sample data.
I have recreated the datasets by first calling set.seed().
set.seed(6736)
num_of_employee = 15000
#Create their attributes
Employee <- paste("Employee", 1:num_of_employee)
Bracket <- sample(1:5, num_of_employee, replace = TRUE)
Height <- sample(65:100, num_of_employee, replace = TRUE)
Weight <- sample(120:220, num_of_employee, replace = TRUE)
Years_Employed <- sample(1:13, num_of_employee, replace = TRUE)
Income <- sample(seq(50000, 200000, 1000), num_of_employee, replace = TRUE)
gender <- sample(c("Female", "Male"), num_of_employee, prob = c(150, 14850)/15000, replace = TRUE)
#Finally make a dataframe for all their data
df1 = data.frame(Employee, Height, Weight, Years_Employed, Income, Bracket, gender)
#Split the dataframe by gender
df2 <- df1[df1$gender == 'Female', ]
df1 <- df1[df1$gender == 'Male', ]
Edit.
To have the code above accept any column, rewrite it as a function.
compareGender <- function(Female, Male, what = "Income", Runs = 1000){
v <- logical(Runs)
Male_br <- split(Male, Male[["Bracket"]])
Female_br <- split(Female, Female[["Bracket"]])
female_vector <- sapply(Female_br, function(x) sum(x[[what]]))
sum_female_vector <- sum(female_vector)
number_of_rows <- sapply(Female_br, nrow)
for(j in seq_len(Runs)){
male_vector <- sapply(seq_along(Male_br), function(i) sum(sample(Male_br[[i]][[what]], number_of_rows[i], TRUE)))
v[j] <- sum(male_vector) > sum_female_vector
}
c(what = mean(v))
}
set.seed(4358) # To compare the result with the result above
compareGender(Female = df2, Male = df1)
#[1] 0.933
compareGender(Female = df2, Male = df1, what = "Height")
#[1] 0.012
compareGender(Female = df2, Male = df1, what = "Years_Employed")
#[1] 0.815
If you want to apply the function to several columns automatically, you can use the *apply functions.
In this case I will sapply the function to columns 2 to 5, or to names(df1)[2:5].
res <- sapply(names(df1)[2:5], function(x) compareGender(df2, df1, x))
names(res) <- sub("\\.what$", "", names(res))
res
#Height Weight Years_Employed Income
#0.012 0.211 0.827 0.948
Now, you can transform this output into a data.frame. There are two ways you can do it. The first creates a df with one column and the names attribute as the row names. The second creates a df with two columns, the original column names and the mean values returned by compareGender.
final1 <- data.frame(Mean = res)
final1
# Mean
#Height 0.012
#Weight 0.211
#Years_Employed 0.827
#Income 0.948
final2 <- data.frame(Variable = names(res), Mean = res)
row.names(final2) <- NULL
final2
# Variable Mean
#1 Height 0.012
#2 Weight 0.211
#3 Years_Employed 0.827
#4 Income 0.948
Related
Intro: Working in R, I often need to reorganize information from lists of data.frames to create a summary table. In this example, I start with a single data.frame, and I show my function that converts key information from the data.frame into a single row. Bearing in mind that my desired output requires the sorting of a mixture of numeric and character data, I can’t help wondering if there is an easier technique to do this kind of thing.
My question: Can anyone provide advice, or better yet a solution, for a simpler technique to convert data.frames like these into rows, while respecting the specific sorting of the data?
#sample data
input_df <- data.frame(M1 = c("Age", "Weight", "Speed", "Range"),
dogs = c(100, 120, 85, 105),
cats = c(115, 89, 80, 111),
birds = c(100, 90, 100, 104))
# desired summary row
desired_row <- data.frame(Model = "M1",
dogs = "Weight (120)",
cats = "Age (115), Range (111)",
birds = "Range (104)",
stringsAsFactors = F)
desired_row$Model <- factor(desired_row$Model)
# my function
makeRow <- function(dat1) {
# get model name
mod <- data.frame(Model = names(dat1[1]))
# make list of variables with model varible
d1 <- setNames(lapply(names(dat1)[-1], function(x) cbind(dat1[1],
dat1[x])), names(dat1)[-1])
# create a sorted named vector, largest-to-smallest
sorted_named_vec <- function(x) {
sort(setNames(x[[2]], x[[1]]), decreasing = T)
}
d2 <- lapply(d1, sorted_named_vec)
# implement a criterion to report only top indexes
keep_tops <- function(x) {
ifelse(x == max(x) | x >= 110 | (x > 102) & ((x -
100)/(max(x) - 100) > 0.33), x, "")
}
d3 <- lapply(d2, keep_tops)
# remove blank character elements
remove_blank_elements <- function(x) {
x[nchar(x) > 0]
}
d4 <- lapply(d3, remove_blank_elements)
# collapse variable name with top values and add parenthesis
collapse_to_string <- function(x) {
paste0(names(x), " (", x, "),", collapse = " ")
}
d5 <- lapply(d4, collapse_to_string)
# remove the last comma
remove_last_comma <- function(x) {
gsub("\\,$", "", x)
}
d6 <- lapply(d5, remove_last_comma)
# consturct a row from the list
row <- cbind(mod, as.data.frame(d6, stringsAsFactors = F))
row
}
# call
row_output <- makeRow(dat1 = input_df)
row_output
# check output to desired
identical(desired_row, row_output)
not sure if more efficient, but slightly less code and more direct approach imo.
makeRow <- function(dat1) {
#make data frame for row with model name
d0 <- data.frame(mod = names(dat1)[1]) #col name changed later
# implement a criterion to report only top indexes -> now return if true or false
keep_tops <- function(x) {
x == max(x) | x >= 110 | (x > 102) & ((x - 100)/(max(x) - 100) > 0.33)
}
vals =c() #empty -> for values of each cols
# make list of variables with model variables(dat1 cols)
#use the columns of the df directly
for(col in 2:ncol(dat1)){
#make temp df with each and evaluate what row to keep in the same line
df = dat1[keep_tops(dat1[,col])==1,c(1,col)]
df[,2] = paste0("(",df[,2],")") #add the () around the numbers
val = apply(as.data.frame(apply(df, 1, paste0, collapse=" ")), 2, paste0, collapse=", ") #collapse rows, then cols
vals = c(vals, val) #add this variable values to the values' list
}
# bind the first col made earlier with these values
row <- cbind(d0, as.data.frame(t(vals), stringsAsFactors = F))
colnames(row) = colnames(dat1) #rename the columns to match
row
}
# call
row_output <- makeRow(dat1 = input_df)
# check output to desired
identical(desired_row$birds, row_output$birds)
with your 'input_df', identical() was TRUE.
I would like to run a custom function that uses specific columns of a dataframe split by groups. Here is my sample data & function code:
my_data = data.frame(N = c(12, 12, 24, 24, 12, 12),
p = rep(c(.125,.125,.025),2),
group = rep(c("dogs","cats"),each=3))
power.sequential <- function(d, nseq, pseq){
decvec <- NULL
nvec <- NULL
for (i in 1:100){
decvec[i] <- 0
nvec[i] <- 0
j <- 1
x <- NULL
while(decvec[i] == 0 & nvec[i] < sum(nseq)){
x <- c(x, rnorm(nseq[j], mean = d))
p <- t.test(x)$p.value
nvec[i] <- nvec[i] + nseq[j]
if (p < pseq[j]) decvec[i] <- 1
j <- j + 1
}
}
power <- mean(decvec == 1)
meanN <- mean(nvec)
return(list("power" = power, "mean_N" = meanN))
}
Now I want to run this function on each group in my dataframe. This is how the function is called normally:
power.sequential(d = .5,
nseq = c(12,12,24),
pseq = c(.125,.125,.025))
The function returns two values, and ideally they would each be saved in a separate column of my dataframe.
And this is my best try, but it gives an error message:
my_data %>% group_by(group) %>%
mutate(result = power.sequential(d=.5,nseq=N,pseq=p))
I probably need to reshape my dataframe so that each group is a single row, but I'm stuck on how to proceed.
Here is my desired output, the function outputs two values (power and meanN), each should get its own column.
group power meanN
dogs .94 20.28
cats .95 27.36
You can do:
my_data %>%
group_by(group) %>%
do(data.frame(power.sequential(d=.5,nseq=.$N,pseq=.$p)[c(1, 2)])) %>%
data.frame()
That gives:
group power mean_N
1 cats 0.96 27.24
2 dogs 0.94 21.12
The task can be simplified with use of data.table. One can call the function in 'j` section directly and both values will appear as separate column.
library(data.table)
setDT(my_data)
set.seed(1)
my_data[,power.sequential(0.5, N, p), by=group]
# group power mean_N
# 1: dogs 0.90 24.48
# 2: cats 0.94 27.72
Note: set.seed(1) has been used to keep the result consistent.
Given the American Community Survey dataset from Lock5Data, I want to calculate the t-stat (and its Bonferroni-corrected p-value) of incomes for each combination of races. I'd like to store the results in a dataframe with columns 'race1', 'race2, 'tstat', and 'pval'. This way I can sort the dataframe to show the biggest (or most significant) differences in income.
library(Lock5Data)
data("ACS")
ACS$Sex <- factor(ACS$Sex, labels = c("Female","Male"))
sub_acs <- subset(ACS, select = c("Income","Sex","Race"))
sub_acs <- na.omit(sub_acs)
# form results df (t_df)
race_unique <- unique(sub_acs$Race)
t_df <- expand.grid(race1 = race_unique, race2 = race_unique)
t_df <- t_df[t_df$race1 != t_df$race2,]
rownames(t_df) <- NULL
# fill df col with t-stat
t_df$tstat <- t.test(sub_acs[sub_acs$Race == t_df$race1,]$Income,
sub_acs[sub_acs$Race == t_df$race2,]$Income,
p.adjust.methods='bonferroni')$statistic
# fill df col with p_val
t_df$pval <- t.test(sub_acs[sub_acs$Race == t_df$race1,]$Income,
sub_acs[sub_acs$Race == t_df$race2,]$Income,
p.adjust.methods='bonferroni')$p.value
Unfortunately, the results df t_df appears to only show the first result for each test repeated down all the rows. How do I correctly map the t-statistic and p-value results? Answers that make my current solution more elegant and portable are welcome!
Consider Map (the simplified wrapper to mapply) and use t_df as only a helper dataframe to pass unique values into the multiple apply method.
t_df <- subset(expand.grid(race1 = race_unique, race2 = race_unique), race1 < race2)
ttest_proc <- function(r1, r2) {
output <- t.test(sub_acs[sub_acs$Race == r1,]$Income,
sub_acs[sub_acs$Race == r2,]$Income,
p.adjust.methods='bonferroni')
df <- data.frame(race1 = r1,
race2 = r2,
t_stat = output$statistic,
p_val = output$p.value)
return(df)
}
df_list <- Map(ttest_proc, t_df$race1, t_df$race2)
final_df <- do.call(rbind, df_list)
Are you looking for something like this?
cbn <- t(combn(as.character(race_unique), 2))
pval <- numeric(nrow(cbn))
tstat <- numeric(nrow(cbn))
for(i in seq_len(nrow(cbn))){
a <- subset(sub_acs, Race %in% cbn[i, ])
tt <- t.test(Income ~ Race, data = a, p.adjust.methods = 'bonferroni')
pval[i] <- tt$p.value
tstat[i] <- tt$statistic
}
result <- data.frame(race1 = cbn[, 1], race2 = cbn[, 2], p.value = pval, statistic = tstat)
result
# race1 race2 p.value statistic
#1 white black 0.190337465 -1.3173942
#2 white other 0.007776557 -2.7231317
#3 white asian 0.203332407 1.2831045
#4 black other 0.365064209 0.9092968
#5 black asian 0.050391428 1.9782605
#6 other asian 0.005943459 2.8158303
I have a dataset where a subset of measurements for each entry are randomly missing:
dat <- matrix(runif(100), nrow=10)
rownames(dat) <- letters[1:10]
colnames(dat) <- paste("time", 1:10)
dat[sample(100, 25)] <- NA
I am interested in calculating correlations between each row in this dataset (i.e., a-a, a-b, a-c, a-d, ...). However, I would like to exclude correlations where there are fewer than 5 pairwise non-NA observations by setting their value to NA in the resulting correlation matrix.
Currently I am doing this as follows:
cor <- cor(t(dat), use = 'pairwise.complete.obs')
names <- rownames(dat)
filter <- sapply(names, function(x1) sapply(names, function(x2)
sum(!is.na(dat[x1,]) & !is.na(dat[x2,])) < 5))
cor[filter] <- NA
However, this operation is very slow as the actual dataset contains >1,000 entries.
Is there way to filter cells based on the number of non-NA pairwise observations in a vectorized manner, instead of within nested loops?
You can count the number of non-NA pairwise observations using matrix approach.
Let's use this data generation code. I made data larger and added more NAs.
nr = 1000;
nc = 900;
dat = matrix(runif(nr*nc), nrow=nr)
rownames(dat) = paste(1:nr)
colnames(dat) = paste("time", 1:nc)
dat[sample(nr*nc, nr*nc*0.9)] = NA
Then you filter code is taking 85 seconds
tic = proc.time()
names = rownames(dat)
filter = sapply(names, function(x1) sapply(names, function(x2)
sum(!is.na(dat[x1,]) & !is.na(dat[x2,])) < 5));
toc = proc.time();
show(toc-tic);
# 85.50 seconds
My version creates a matrix with values 1 for non-NAs in the original data. Then using matrix multiplication I calculate number of pairwise non-NAs. It ran in a fraction of a second.
tic = proc.time()
NAmat = matrix(0, nrow = nr, ncol = nc)
NAmat[ !is.na(dat) ] = 1;
filter2 = (tcrossprod(NAmat) < 5)
toc = proc.time();
show(toc-tic);
# 0.09 seconds
Simple check shows the results are the same:
all(filter == filter2)
# TRUE
I want to apply t-tests on a bunch of variables. Below is some mock data
d <- data.frame(var1=rnorm(10),
var2=rnorm(10),
group=sample(c(0,1), 10, replace=TRUE))
# Is there a way to do this in some sort of loop?
with(d, t.test(var1~group))
with(d, t.test(var2~group))
# I tried this but the loop did not give a result!?
varnames <- c('var1', 'var2')
for (i in 1:2) {
eval(substitute(with(d, t.test(variable~group)),
list(variable=as.name(varnames[i]))))
}
Also, is it possible to extract the values from the t-test's result (e.g. two group means, p-value) so that the loop will produce a neat balance table across the variables? In other words, the end result I want is not a bunch of t-tests upon one another, but a table like this:
Varname mean1 mean2 p-value
Var1 1.1 1.2 0.989
Var2 1.2 1.3 0.912
You can use formula and lapply like this
set.seed(1)
d <- data.frame(var1 = rnorm(10),
var2 = rnorm(10),
group = sample(c(0, 1), 10, replace = TRUE))
varnames <- c("var1", "var2")
formulas <- paste(varnames, "group", sep = " ~ ")
res <- lapply(formulas, function(f) t.test(as.formula(f), data = d))
names(res) <- varnames
If you want to extract your table, you can proceed like this
t(sapply(res, function(x) c(x$estimate, pval = x$p.value)))
mean in group 0 mean in group 1 pval
var1 0.61288 0.012034 0.098055
var2 0.46382 0.195100 0.702365
Here is a reshape/plyr solution:
The foo function is the workhorse, it runs the t-test and extract means and p-value.
d <- data.frame(var1=rnorm(10),
var2=rnorm(10),
group=sample(c(0,1), 10, replace=TRUE))
require(reshape2)
require(plyr)
dfm <- melt(d, id = 'group')
foo <- function(x) {
tt <- t.test(value ~ group, data = x)
out <- data.frame(mean1 = tt$estimate[1], mean2 = tt$estimate[2], P = tt$p.value)
return(out)
}
ddply(dfm, .(variable), .fun=foo)
# variable mean1 mean2 P
#1 var1 -0.2641942 0.3716034 0.4049852
#2 var2 -0.9186919 -0.2749101 0.5949053
Use sapply to apply t-test to all varnames and extract the necessary data by subsetting "estimate" and "p.value". Check names(with(d, t.test(var1~group))) if you want to extract other information
t(with(d, sapply(varnames, function(x) unlist(t.test(get(x)~group)[c("estimate", "p.value")]))))