I have a data frame on population of particles with given size. Data is organized in a dataframe where the first column represents the size (x value) and the other columns represent the density (y-values) for the actual size. I need to calculate the median for all the columns.
Since median() works with hist data, I decided to transform my dataset to this type by adding Nth time the value of the first column to a vector and get N from all the columns for the rows. This actually works, but really slow with my 1200 lines dataframes, so I wonder if you have a more efficient solution.
df <- data.frame(Size = c(1:100),
val1 = sample(0:9,100,replace = TRUE,),
val2 = sample(0:9,100,replace = TRUE))
get.median <- function(dataset){
results <- list()
for(col in colnames(dataset)[2:ncol(dataset)]){
col.results <- c()
for(i in 1:nrow(dataset)){
size <- dataset[i,"Size"]
count <- dataset[i,col]
out <- rep(size,count)
col.results <- c(col.results,out)
}
med <- median(col.results)
results <- append(results,med)
}
return(results)
}
get.median(df)
Without transforming:
lapply(df[,2:3], function(y) median(rep(df$Size, times = y)))
$val1
[1] 49
$val2
[1] 47
data:
set.seed(99)
df <- data.frame(Size = c(1:100),
val1 = sample(0:9,100,replace = TRUE,),
val2 = sample(0:9,100,replace = TRUE))
You can use sapply and median to calculate the median for each column like this:
sapply(df, median)
Output:
Size val1 val2
50.5 6.0 3.5
from "spatstat" library with dplyr::across
> df %>% summarize(across(-Size, ~weighted.median(Size,.x,na.rm = TRUE)))
val1 val2
1 42.5 47.5
Related
I'd like to compute trimmed mean for each trimming proportion alpha, and then see which trimming proportion gives the minimal variance of the trimmed means, when Bootstrap simulations of size N=200 are applied. The problem that I have, is that when I try to create a data frame of column1 = mean and column2 = variance, the code that I wrote creates each output of mean and variance as separate data frame, so I cannot look up the trimming proportion and trimmed mean which have the minimal variance.
The function gives out "data.frame" 9 times. I guess it's because the alpha argument is vectorized. The code:
tmean_var <- function(n,N,alpha){
set.seed(1)
data <- rnorm(n)
data_aug1 <- c(data, -data)
data_aug2 <- c(data, 2 * median(data) - data)
est <- data.frame()
tmean <- replicate(N, {
sample <- base::mean(sample(x = data[(round(alpha*n)+1):(n-round(alpha*n))],
size = n-2*round(alpha*n), replace = TRUE))
})
mean <- base::mean(tmean)
var <- var(tmean) * (n-2*round(alpha * n))
df <- data.frame(mean = mean, var = var)
class(df)
}
f <- Vectorize(tmean_var, vectorize.args = "alpha")
f(n,N,alpha)
How could I make the output to be one dataframe not nine?
This should do it. Rather than try to use Vectorize() on a function that doesn't inherently take vector arguments, you could just use sapply() and lapply() across the values of alpha you provide as below:
tmean_var <- function(n,N,alpha){
set.seed(1)
data <- rnorm(n)
data_aug1 <- c(data, -data)
data_aug2 <- c(data, 2 * median(data) - data)
est <- data.frame()
tmean <- lapply(alpha, function(a){replicate(N, {
sample <- base::mean(sample(x = data[(round(a*n)+1):(n-round(a*n))],
size = n-2*round(a*n), replace = TRUE))
})
})
mean <- sapply(tmean, mean)
var <- sapply(seq_along(tmean), function(i)var(tmean[[i]]) * (n-2*round(alpha[i] * n)))
df <- data.frame(mean = mean, var = var, alpha=alpha)
# class(df)
}
out <- tmean_var(100, 200, c(.1, .2, .3))
out
#> mean var alpha
#> 1 0.10555709 0.8066377 0.1
#> 2 0.06868891 0.8331401 0.2
#> 3 0.21791984 0.9024612 0.3
Created on 2022-05-13 by the reprex package (v2.0.1)
I have two dataframes for men and women working in a company. One is 15000 rows x 1000 columns and the other is 150 x 1000. Each column represents an attribute (e.g, Salary, Height, etc...). I am comparing female and male employees within each Bracket (there are five in total).
Below I created some dummy data and the for loop.
#Create the data
num_of_employee = 100
f <- rep(c("Female"), 15)
m <- rep(c("Male"), 85)
Employee = paste("Employee",seq(1:num_of_employee))
Bracket = sample(seq(1,5,1),num_of_employee, replace = TRUE)
Height = sample(seq(65,100, 1),num_of_employee, replace = TRUE)
Weight = sample(seq(120,220, 1),num_of_employee, replace = TRUE)
Years_Employed = sample(seq(1,13, 1),num_of_employee, replace = TRUE)
Income = sample(seq(50000,200000, 1000),num_of_employee, replace = TRUE)
gender <- sample(append(f,m), replace = FALSE)
df1 = data.frame(Employee, Height, Weight, Years_Employed, Income, Bracket, gender)
women <-df1[df1$gender == 'Female',]
men <- df1[df1$gender == 'Male',]
So that's all the data. Now, this for-loop essentially compares both the men and women dataframes column by column. So for example, Income from df1 will be compared to Income from df2, likewise for Height, Years_Employed etc...
v <-c()
runs <- 1000
for(j in 1:runs){
male_vector <- c()
female_vector <- c()
#loop through each of the 5 Brackets
for(z in 1:5){
#print out number of rows in each bracket.
number_of_rows <- length(which(women$Bracket == z))
#compare attributes of men and women within each bracket.
male_vector <- append(male_vector, men[sample(which(men$Bracket == z), number_of_rows), ]$Height)
female_vector <- append(female_vector, women[which(women$Bracket == z), ]$Height)
}
#Ask, are men and women different?
v <- append(v, sum(male_vector) > sum(female_vector))
}
#How many times are the men>women out of 1000?
as.numeric(sum(v))
[1] 70
So this code works, but I want to compare each column - meaning Height, Weight, Years_Employed and Income.
Edit
I would like to input the two dataframes and the output be the following:
"Height " 0.223
"Salary: " 0.994
"Weight: " 0.006
"Years_Employed:" 0.325
.
.
.
"1000th column :" 0.013
Note, my actual data consists of 1000 columns, so hard-coding anything (the way I originally did it), won't work.
The following is much simpler than your code.
Note that there are loops in disguise, namely split and sapply. But the code is cleaner and it avoids repeating the same computations over and over again.
If you call set.seed(4358) just before running your code the result will be exactly the same as the result of mean(v) in the end of this.
set.seed(4358) # Needed because of the call to sample()
runs <- 1000
v <- logical(runs)
df1_br <- split(df1, df1$Bracket)
df2_br <- split(df2, df2$Bracket)
female_vector <- sapply(df2_br, function(x) sum(x$Income))
sum_female_vector <- sum(female_vector)
number_of_rows <- sapply(df2_br, nrow)
for(j in 1:runs){
male_vector <- sapply(seq_along(df1_br), function(i) sum(sample(df1_br[[i]]$Income, number_of_rows[i], TRUE)))
v[j] <- sum(male_vector) > sum_female_vector
}
mean(v)
#[1] 0.933
Sample data.
I have recreated the datasets by first calling set.seed().
set.seed(6736)
num_of_employee = 15000
#Create their attributes
Employee <- paste("Employee", 1:num_of_employee)
Bracket <- sample(1:5, num_of_employee, replace = TRUE)
Height <- sample(65:100, num_of_employee, replace = TRUE)
Weight <- sample(120:220, num_of_employee, replace = TRUE)
Years_Employed <- sample(1:13, num_of_employee, replace = TRUE)
Income <- sample(seq(50000, 200000, 1000), num_of_employee, replace = TRUE)
gender <- sample(c("Female", "Male"), num_of_employee, prob = c(150, 14850)/15000, replace = TRUE)
#Finally make a dataframe for all their data
df1 = data.frame(Employee, Height, Weight, Years_Employed, Income, Bracket, gender)
#Split the dataframe by gender
df2 <- df1[df1$gender == 'Female', ]
df1 <- df1[df1$gender == 'Male', ]
Edit.
To have the code above accept any column, rewrite it as a function.
compareGender <- function(Female, Male, what = "Income", Runs = 1000){
v <- logical(Runs)
Male_br <- split(Male, Male[["Bracket"]])
Female_br <- split(Female, Female[["Bracket"]])
female_vector <- sapply(Female_br, function(x) sum(x[[what]]))
sum_female_vector <- sum(female_vector)
number_of_rows <- sapply(Female_br, nrow)
for(j in seq_len(Runs)){
male_vector <- sapply(seq_along(Male_br), function(i) sum(sample(Male_br[[i]][[what]], number_of_rows[i], TRUE)))
v[j] <- sum(male_vector) > sum_female_vector
}
c(what = mean(v))
}
set.seed(4358) # To compare the result with the result above
compareGender(Female = df2, Male = df1)
#[1] 0.933
compareGender(Female = df2, Male = df1, what = "Height")
#[1] 0.012
compareGender(Female = df2, Male = df1, what = "Years_Employed")
#[1] 0.815
If you want to apply the function to several columns automatically, you can use the *apply functions.
In this case I will sapply the function to columns 2 to 5, or to names(df1)[2:5].
res <- sapply(names(df1)[2:5], function(x) compareGender(df2, df1, x))
names(res) <- sub("\\.what$", "", names(res))
res
#Height Weight Years_Employed Income
#0.012 0.211 0.827 0.948
Now, you can transform this output into a data.frame. There are two ways you can do it. The first creates a df with one column and the names attribute as the row names. The second creates a df with two columns, the original column names and the mean values returned by compareGender.
final1 <- data.frame(Mean = res)
final1
# Mean
#Height 0.012
#Weight 0.211
#Years_Employed 0.827
#Income 0.948
final2 <- data.frame(Variable = names(res), Mean = res)
row.names(final2) <- NULL
final2
# Variable Mean
#1 Height 0.012
#2 Weight 0.211
#3 Years_Employed 0.827
#4 Income 0.948
I would like to bootstrap confidence intervals for a proportion from a data.frame. I would like to get the results for the variables in one of my columns.
I have managed to perform the bootstrap for a vector but do not know how to scale it up to a data.frame from here.
A simplified example setting a threshold value of 10 and looking at the proportion less than 10 in the data.
Vector solution:
library(boot)
vec <- abs(rnorm(1000)*10) #generate example vector
data_to_tb <- vec
tb <- function(data) {
sum(data < 10, na.rm = FALSE)/length(data) #function for generating the proportion
}
tb(data_to_tb)
boot.out <- boot(data = data_to_tb, function(u,i) tb(u[i]), R = 999)
quantile(boot.out$t, c(.025,.975))
And from here I would like to do the same for a data.frame containing two columns.
I would like to return the result in a "summarized" data.frame if possible, with columns (x, sample, proportion, CI) :
x n proportion CI
A xx xx xx
B xx xx xx
C xx xx xx
Would be extra good if dplyr package could be used.
Here is a simplified example of my data:
Example:
dataframe <- data.frame(x = sample(c("A","B","C"),100,replace = TRUE), vec =abs(rnorm(100)*10))
head(dataframe)
## x vec
## 1 B 0.06735163
## 2 C 0.48612358
## 3 B 2.34190635
## 4 C 0.36393262
## 5 A 7.99762969
## 6 B 1.43293330
You can use group_by and summarise from dplyr to achieve the desired result. See below for the code.
# load required package
require(dplyr)
# function to calculate the confidence interval
CIfun <- function(v, probs = c(.025, .975)) {
quantile(boot(data = v, function(u,i) tb(u[i]), R = 999)$t, probs)
}
# using summarise from dplyr
dataframe %>% group_by(x) %>%
summarise(n = n(),
proportion = tb(vec),
`2.5%` = CIfun(vec, .025),
`97.5%`= CIfun(vec, .975))
I have a dataframe:
a <- c(1:5)
b <- c(5:9)
c <- c(7:11)
min <- c(2,7,4,5,3)
max <- c(5,9,12,8,7)
df1 <- data.frame(a,b,c,min,max)
df1
How can I set row limits, so the minimum and maximum values of each row a,b,c is set by the row values in columns: min and max?
Thank you for your help.
I find this easier if you think of the data frame as a list of columns:
mins <- lapply(df1[1:3], function(x) pmax(x, df1$min))
maxs <- lapply(mins, function(x) pmin(x, df1$max))
Then you can rebuild the data.frame:
df2 <- do.call(cbind, maxs)
f <- function(X){
X[X < df1$min] <- df1$min[X < df1$min]
X[X > df1$max] <- df1$max[X > df1$max]
X
}
sapply(df1[,1:3] , f)
Try something like:
## Function that checks the limits of each column
limit <- function(x, mn, mx) {
pmax(pmin(x, mx), mn)
}
## Then build your data.frame
df2 <- data.frame(a = limit(a, min, max), b = limit(b, min, max), c = limit(c, min, max), min, max)
Using apply is another option on an existing data.frame
df3 <- apply(df1[ ,1:3], 2, limit, mn = min, mx = max)
I hope it helps,
Alex
Pretty much this: EDIT I apologize: it should be "pmin" for "parallel minimum" and "pmax"
for(j in 1:3 ) df1[,j]<-pmin(df1[,j],df1[,5])
for(j in 1:3 ) df1[,j<-pmax(df1[,j],df1[,4])
PS you're probably better off with a matrix
I want to apply t-tests on a bunch of variables. Below is some mock data
d <- data.frame(var1=rnorm(10),
var2=rnorm(10),
group=sample(c(0,1), 10, replace=TRUE))
# Is there a way to do this in some sort of loop?
with(d, t.test(var1~group))
with(d, t.test(var2~group))
# I tried this but the loop did not give a result!?
varnames <- c('var1', 'var2')
for (i in 1:2) {
eval(substitute(with(d, t.test(variable~group)),
list(variable=as.name(varnames[i]))))
}
Also, is it possible to extract the values from the t-test's result (e.g. two group means, p-value) so that the loop will produce a neat balance table across the variables? In other words, the end result I want is not a bunch of t-tests upon one another, but a table like this:
Varname mean1 mean2 p-value
Var1 1.1 1.2 0.989
Var2 1.2 1.3 0.912
You can use formula and lapply like this
set.seed(1)
d <- data.frame(var1 = rnorm(10),
var2 = rnorm(10),
group = sample(c(0, 1), 10, replace = TRUE))
varnames <- c("var1", "var2")
formulas <- paste(varnames, "group", sep = " ~ ")
res <- lapply(formulas, function(f) t.test(as.formula(f), data = d))
names(res) <- varnames
If you want to extract your table, you can proceed like this
t(sapply(res, function(x) c(x$estimate, pval = x$p.value)))
mean in group 0 mean in group 1 pval
var1 0.61288 0.012034 0.098055
var2 0.46382 0.195100 0.702365
Here is a reshape/plyr solution:
The foo function is the workhorse, it runs the t-test and extract means and p-value.
d <- data.frame(var1=rnorm(10),
var2=rnorm(10),
group=sample(c(0,1), 10, replace=TRUE))
require(reshape2)
require(plyr)
dfm <- melt(d, id = 'group')
foo <- function(x) {
tt <- t.test(value ~ group, data = x)
out <- data.frame(mean1 = tt$estimate[1], mean2 = tt$estimate[2], P = tt$p.value)
return(out)
}
ddply(dfm, .(variable), .fun=foo)
# variable mean1 mean2 P
#1 var1 -0.2641942 0.3716034 0.4049852
#2 var2 -0.9186919 -0.2749101 0.5949053
Use sapply to apply t-test to all varnames and extract the necessary data by subsetting "estimate" and "p.value". Check names(with(d, t.test(var1~group))) if you want to extract other information
t(with(d, sapply(varnames, function(x) unlist(t.test(get(x)~group)[c("estimate", "p.value")]))))