I'm looking for a simpler way to aggregate and calculate percentages of a numerical variable using data.table.
The following code outputs the desired result, my question is if there is a better way to get the same result. I'm not really familiarized with the package, so any tips would be useful.
I'd like to have the following columns:
second_factor_variable third_factor_variable factor_variable porc porcentaje
1: HIGH C > 200 0.04456544 4 %
2: LOW A 51 - 100 0.31739130 32 %
3: LOW A 101 - 200 0.68260870 68 %
4: LOW A 26 - 50 0.00000000 0 %
Where porc is the numerical percentage and porcentage would be the percentage rounded to be used as a label in a ggplot call.
library("ggplot2")
library("scales")
library("data.table")
### Generate some data
set.seed(123)
df <- data.frame(x = rnorm(10000, mean = 100, sd = 50))
df <- subset(df, x > 0)
df$factor_variable <- cut(df$x, right = TRUE,
breaks = c(0, 25, 50, 100, 200, 100000),
labels = c("0 - 25", "26 - 50", "51 - 100", "101 - 200", "> 200")
)
df$second_factor_variable <- cut(df$x, right = TRUE,
breaks = c(0, 100, 100000),
labels = c("LOW", "HIGH")
)
df$third_factor_variable <- cut(df$x, right = TRUE,
breaks = c(0, 50, 100, 100000),
labels = c("A", "B","C")
)
str(df)
### Aggregate
DT <- data.table(df)
dt = DT[, list(factor_variable = unique(DT$factor_variable),
porc = as.numeric(table(factor_variable)/length(factor_variable)),
porcentaje = paste( round( as.numeric(table(factor_variable)/length(factor_variable), 0 ) * 100 ), "%")
), by="second_factor_variable,third_factor_variable"]
EDIT
I've tried agstudy's solution grouping by with just one variable, and I believe it didn't work for producing the labels (porcentaje column). In the real dataset, I ended up having a similar issue and I can't spot whats wrong about this function.
grp <- function(factor_variable) {
porc = as.numeric(table(factor_variable)/length(factor_variable))
list(factor_variable = factor_variable[1],
porc =porc,
porcentaje = paste( round( porc, 0 ) * 100 , "%"))
}
DT[, grp(factor_variable) , by="second_factor_variable"]
The numerical values are correct
DT2 <- DT[DT$second_factor_variable %in% "LOW"]
table(DT2$factor_variable)/length(DT2$factor_variable)
I believe the same problems appears if i group by with 2 factor variables:
DT[, grp(factor_variable) , by="second_factor_variable,third_factor_variable"]
2 changes : factorize porc variable and don't use DT to compute factor_variable
DT[, { porc = as.numeric(table(factor_variable)/length(factor_variable))
list(factor_variable = factor_variable[1],
porc =porc,
porcentaje = paste( round( porc, 0 ) * 100 , "%"))
}
, by="second_factor_variable,third_factor_variable"]
In the real dataset for some reason, the previous function wasn't working well. The values and labels weren't right.
I've just changed a tiny bit and it worked. I add it as a separate answer so it's easier to find later.
This keeps the order the factor levels as they were created. The previous way keept them as they were ordered in the DT, for the particular data i was working with this wasn't working well.
The labels work now, I mean the porcentaje column.
factor_variable = levels(factor_variable)
grp2 <- function(factor_variable) {
porc = as.numeric(table(factor_variable)/length(factor_variable))
list(factor_variable = levels(factor_variable),
porc = porc,
porcentaje = paste( round( as.numeric(table(factor_variable)/length(factor_variable), 0 ) * 100 ), "%")
)
}
DT[, grp2(factor_variable) , by="second_factor_variable"]
Related
I'm trying to generate multiple plots with ggmosaic using a for loop (or map) but I'm not able to pull out the correct title names or x-axis names.
This is example of the dataframe:
set.seed(42) ## for sake of reproducibility
n <- 10
dat <- data.frame(balance=factor(paste("DM", 1:n)),
credit_history=sample(c("repaid", "critical"), 10, replace = TRUE),
purpose=sample(c("yes", "no"), 10, replace = TRUE),
employment_rate=sample(c("0-1 yrs", "1-4 yrs", ">4 yrs"), 10, replace = TRUE),
personal_status=sample(c("married", "single"), 10, replace=TRUE),
other_debtors=sample(c("guarantor", "none"), 10, replace= TRUE),
default=sample(c("yes", "no"), 10, replace = TRUE))
library(ggmosaic)
# create a list of variables
c_names <- dat[ , c("balance", "credit_history", "purpose", "employment_rate",
"personal_status", "other_debtors", "default")]
for ( col in c_names ) {
s<- ggplot(data = dat) +
geom_mosaic(aes(x=product(default, col), fill = default)) +
ggtitle(paste("DEFAULT", col, sep = " "))
print(s)
}
Can someone give some advice?
This is probably how I would do it. Normally if you're trying to pass strings to ggplot aesthetics, you would use aes_string() and then pass all the aesthetic arguments as string values rather than as unquoted values. However, this doesn't seem to work with the product() function. The alternative I proposed below is to create a temporary data object each time where the variable on the x-axis is always x and then everything works. The title can incorporate the string without a problem.
c_names <- dat[ , c("balance", "credit_history", "purpose", "employment_rate",
"personal_status", "other_debtors", "default")]
for ( cn in colnames(c_names)[1:6]) {
tmp <- data.frame(
default =dat$default,
x = dat[[cn]]
)
s<- ggplot(data = tmp) +
geom_mosaic(aes(x=product(default, x), fill = default)) +
ggtitle(paste("DEFAULT", cn, sep = " "))
print(s)
}
Here is a solution slightly different from that of #DaveArmstrong.
c_names <- c("balance", "credit_history", "purpose", "employment_rate",
"personal_status", "other_debtors")
for ( col in c_names ) {
df <- dat[, c(col, "default")]
names(df)[1] <- "y"
s <- ggplot(data = df) +
geom_mosaic(aes(x=product(default, y), fill = default)) +
ggtitle(paste("DEFAULT", col, sep = " ")) +
labs(x=col)
dev.new()
print(s)
}
I have a dataframe with 2 columns namely p1 and p2. I need to split the p1 column into a range of values like 10-50, 50-100, 100-150, etc. After splitting the values of p1, the corresponding values of p2 should be printed. The sample input is given below.
df = data.frame(p1 = c(10,20,70,80,150,200),p2 = c(1000, 1111.7, 15522.1, 15729.3,18033.8,19358.2)).
The sample output is attached below.
When I am trying to do for large dataset p2 getting mixed with p1.
One way of doing it:
library(dplyr)
df %>%
mutate(
p1 = cut(p1, breaks = 0:(max(p1) %/% 50 + 1) * 50, include.lowest = TRUE)
) %>%
group_by(p1) %>%
summarise(p2 = list(p2))
Maybe this?
setNames(
aggregate(
p2 ~ cut(p1, c(10, 50, 100, 150, 200), include.lowest = TRUE),
df,
c
), names(df)
)
gives
p1 p2
1 [10,50] 1000.0, 1111.7
2 (50,100] 15522.1, 15729.3
3 (100,150] 18033.8
4 (150,200] 19358.2
I try to assess the combined uncertainties related to different input parameters using Markov Chain Monte Carlo method in R. In other words, using the uncertainty parameters reported in input data documentation, I try to generate distributions for each of the datasets by creating 1000 random values within the used distribution (normal distribution or truncated normal distribution).
However, I don't know how to do this with the purrr::map() functions faster and without exhausting the RAM. The dataset has 2m rows and 80 cols.
Here is a simplified example:
library(tidyverse); library(truncnorm);library(data.table); library(dtplyr)
n <- 1000 # number of simulations
n_obs <- 10000 # number of observations. Does not work if e.g. 50000
Create a data.frame
dt <- data.frame(
var1 = runif(n_obs, 0, 100),
var2_low = runif(n_obs, 0, 1),
var2_mean = runif(n_obs, 0, 5),
var2_up = runif(n_obs, 0, 10)
)
Convert to lazy data table to speed things up
dt1 <- dt %>% as.data.table() %>%
lazy_dt(., immutable = FALSE)
Simulate
dt_sim <- dt1 %>%
mutate(mean_val = rep(1, nrow(.)), # just row of 1
var1_rnorm = map(.x = mean_val,~rnorm(n, mean = .x, sd = 0.10)), # normal distribution with given sd
sim_var1 = map2(.x = var1, .y = var1_rnorm, ~(.x*.y))) %>% # multiply the data with simulated distribution
# add truncated normal distribution for each row (var2)
mutate(sim_var2 = pmap(.,~ rtruncnorm(n,
a = dt$var2_low,
b = dt$var2_up,
mean =dt$var2_mean))) %>%
# multiply simulated variables sim_var1 and sim_var2
mutate(sim_vars_multiplied =
pmap(list(x = .$sim_var1,
y = .$sim_var2),
function(x,y) (x*y))) %>%
# derive coefficient of variation
mutate(var_mean =map(.x = sim_vars_multiplied, ~ mean(.x, na.rm = TRUE)),
var_sd = map(.x = sim_vars_multiplied, ~ sd(.x, na.rm = TRUE)),
var_cv = unlist(var_sd) / unlist(var_mean)) %>%
# select only the variables needed
dplyr::select(var_cv)
# collect the results
sim_results <- dt_sim %>% as.data.table()
This may help
library(data.table)
#n <- 1000 # number of simulations
n_obs <- 10000
dt <- data.table(
var1 = runif(n_obs, 0, 100),
var2_low = runif(n_obs, 0, 1),
var2_mean = runif(n_obs, 0, 5),
var2_up = runif(n_obs, 0, 10)
)
dt[, mean_val := rep(1,.N)]
dt[, var1_rnorm := rnorm(.N, mean = mean_val, sd = 0.10)]
dt[, sim_var1 := var1 * var1_rnorm]
dt[, sim_var2 := truncnorm::rtruncnorm(.N, a = var2_low, b = var2_up, mean = var2_mean)]
dt[, sim_vars_multiplied := sim_var1 * sim_var2]
dt[, var_mean := mean(sim_vars_multiplied, na.rm=TRUE)]
dt[, var_sd := sd(sim_vars_multiplied, na.rm=TRUE)]
dt[, var_cv := var_sd / var_mean]
sim_results <- dt[,var_cv]
Edit: Modified simulated data so that price means/medians and neighborhoods don't overlap perfectly.
I have a column in a dataframe, we'll call it Price. I'm just simulating data here:
mydata = data.frame(index = rep(1:1000))
mydata$price[1:300] = rnorm(250, mean = 10000, sd = 1000)
mydata$price[301:550] = rnorm(250, mean = 25000, sd = 1000)
mydata$price[551:775] = rnorm(250, mean = 75000, sd = 1000)
mydata$price[776:1000] = rnorm(250, mean = 100000, sd = 1000)
And a set of neighborhoods, we'll call it Hoods:
mydata$hoods = factor(c(rep('hood1',250),rep('hood2',250),rep('hood3',250),rep('hood4',250)))
Then I aggregate the neighborhoods by the median price to create a median bin. I'd like to bin neighborhoods by their median price.
agg <- aggregate(mydata$price, by = list(hoods), FUN = median))
Then I create a cut version of the neighborhood medians (in my actual data there are 24 neighborhoods). So something like:
cut_aggregates <- cut(agg$x, breaks = c(0, 25000, 70000, 110000), labels = c('low','medium','high'))
I then want to replace the value of every 'hood1' in the original data with the aggregated price label, and so on for all the Neighborhoods. SO the first 250 records would be 'low', for example. I know I could make some nested if statement, or brute-force hard-coding. Does anyone know a way I can more efficiently assign all the values in one go, as I may use this for datasets larger than 1000 records. Thank you very much for any help you may provide.
in the final output, the categorized neighborhood ('low', 'medium', 'high') won't necessarily be the same as just doing a cut on price from the original data, because some neighborhood will have a combination of 'low', 'medium', and 'high' using this strategy. I want to first categorize each neighborhood based on its aggregate, and THEN recode the neighborhood.
A very simple way to do this, and probably the fastest, is to use data.table.
library(data.table)
# convert mydata into a data.table
setDT(mydata)
# calculate median price by hood
mydata[, med := median(price), by=hoods]
now you can either:
# replace the original data of `hoods` with the new price labels
mydata[, hoods := cut(med, breaks = c(0, 25000, 70000, 110000), labels = c('low','medium','high'))]
# or create new price labels in a new column
mydata[, new_col := cut(med, breaks = c(0, 25000, 70000, 110000), labels = c('low','medium','high'))]
Finally, if you want just a summary table for each hood:
mydata[, (med = median(price)), by=.(hoods, new_col)]
> hoods my_cut V1
> 1: hood1 low 9916.564
> 2: hood2 low 24696.864
> 3: hood3 high 74749.481
> 4: hood4 high 99852.744
Edit: Approach 1
mydata <- within(mydata, med <- ave(price, hoods, FUN = median) )
mydata$new_label <- cut(mydata$med, breaks = c(0, 25000, 70000, 110000), labels = c('low','medium','high'))
# index price hoods med new_label
# 1 1 10084.756 hood1 10014.38 low
# 2 2 10226.460 hood1 10014.38 low
# 3 3 10432.556 hood1 10014.38 low
# 4 4 10558.065 hood1 10014.38 low
# 5 5 10059.755 hood1 10014.38 low
# 6 6 9885.359 hood1 10014.38 low
Approach2:
Since agg$labs is not unique for each level of hoods in mydata, it will be better to reassign labels individually using a loop by mapping the levels of hoods with the values of agg$labs.
If you had unique labels in agg$labs for each levels of hoods in mydata, then it will be very simple to just reassign labels by doing mydata$hoods <- factor( mydata$hoods, levels = agg$Group.1, labels = agg$labs ). However you have duplicated levels in agg$labs, so you will follow the steps below.
mydata$hoods <- as.character( mydata$hoods ) # convert factor to character
agg$labs <- as.character(agg$labs) # convert factor to character
for( i in seq_len( nrow( agg ) ) ) { # change labels for hoods in mydata
mydata[ mydata$hoods %in% agg$Group.1[ i ], "hoods" ] <- agg$labs[i]
}
mydata$hoods <- factor( mydata$hoods ) # convert hoods back to factor
unique(mydata$hoods) # output
# [1] low medium high
# Levels: high low medium
Data:
set.seed( 200 )
mydata = data.frame(index = rep(1:1000))
mydata$price[1:250] = rnorm(250, mean = 10000, sd = 1000)
mydata$price[251:500] = rnorm(250, mean = 25000, sd = 1000)
mydata$price[501:750] = rnorm(250, mean = 75000, sd = 1000)
mydata$price[751:1000] = rnorm(250, mean = 100000, sd = 1000)
mydata$hoods = factor(c(rep('hood1',250),rep('hood2',250),rep('hood3',250),rep('hood4',250)))
agg <- with(mydata, aggregate( price, by = list(hoods), FUN = median) )
agg$labs <- cut(agg$x, breaks = c(0, 25000, 70000, 110000), labels = c('low','medium','high'))
agg
# Group.1 x labs
# 1 hood1 10014.38 low
# 2 hood2 25021.96 medium
# 3 hood3 74963.40 high
# 4 hood4 100019.88 high
The data in agg will vary if you choose a different seed in set.seed() function.
How would I use apply or a version of it to get around using this loop:
data<- seq(from = -100, to = 100, by = 10)
data
times<- seq(from = 25, to = 100, by=25)
for(i in 1:length(times))
{
print(length(data[data < -times[i] | data > times[i]])/length(data))
}
I am calculating a frequency table of the "data" dataframe.
Thank you.
Here is a vectorized solution:
unname( #remove names
rev( #reverse order
cumsum( #cumulative sum
rev( #reverse order
table( #contingency table of the counts of each interval
cut(abs(data), c(times, Inf)) #cut into intervals
)
)
)
)
) / length(data)
#[1] 0.7619048 0.4761905 0.2857143 0.0000000
Note that data is not a data.frame.