Edit: Modified simulated data so that price means/medians and neighborhoods don't overlap perfectly.
I have a column in a dataframe, we'll call it Price. I'm just simulating data here:
mydata = data.frame(index = rep(1:1000))
mydata$price[1:300] = rnorm(250, mean = 10000, sd = 1000)
mydata$price[301:550] = rnorm(250, mean = 25000, sd = 1000)
mydata$price[551:775] = rnorm(250, mean = 75000, sd = 1000)
mydata$price[776:1000] = rnorm(250, mean = 100000, sd = 1000)
And a set of neighborhoods, we'll call it Hoods:
mydata$hoods = factor(c(rep('hood1',250),rep('hood2',250),rep('hood3',250),rep('hood4',250)))
Then I aggregate the neighborhoods by the median price to create a median bin. I'd like to bin neighborhoods by their median price.
agg <- aggregate(mydata$price, by = list(hoods), FUN = median))
Then I create a cut version of the neighborhood medians (in my actual data there are 24 neighborhoods). So something like:
cut_aggregates <- cut(agg$x, breaks = c(0, 25000, 70000, 110000), labels = c('low','medium','high'))
I then want to replace the value of every 'hood1' in the original data with the aggregated price label, and so on for all the Neighborhoods. SO the first 250 records would be 'low', for example. I know I could make some nested if statement, or brute-force hard-coding. Does anyone know a way I can more efficiently assign all the values in one go, as I may use this for datasets larger than 1000 records. Thank you very much for any help you may provide.
in the final output, the categorized neighborhood ('low', 'medium', 'high') won't necessarily be the same as just doing a cut on price from the original data, because some neighborhood will have a combination of 'low', 'medium', and 'high' using this strategy. I want to first categorize each neighborhood based on its aggregate, and THEN recode the neighborhood.
A very simple way to do this, and probably the fastest, is to use data.table.
library(data.table)
# convert mydata into a data.table
setDT(mydata)
# calculate median price by hood
mydata[, med := median(price), by=hoods]
now you can either:
# replace the original data of `hoods` with the new price labels
mydata[, hoods := cut(med, breaks = c(0, 25000, 70000, 110000), labels = c('low','medium','high'))]
# or create new price labels in a new column
mydata[, new_col := cut(med, breaks = c(0, 25000, 70000, 110000), labels = c('low','medium','high'))]
Finally, if you want just a summary table for each hood:
mydata[, (med = median(price)), by=.(hoods, new_col)]
> hoods my_cut V1
> 1: hood1 low 9916.564
> 2: hood2 low 24696.864
> 3: hood3 high 74749.481
> 4: hood4 high 99852.744
Edit: Approach 1
mydata <- within(mydata, med <- ave(price, hoods, FUN = median) )
mydata$new_label <- cut(mydata$med, breaks = c(0, 25000, 70000, 110000), labels = c('low','medium','high'))
# index price hoods med new_label
# 1 1 10084.756 hood1 10014.38 low
# 2 2 10226.460 hood1 10014.38 low
# 3 3 10432.556 hood1 10014.38 low
# 4 4 10558.065 hood1 10014.38 low
# 5 5 10059.755 hood1 10014.38 low
# 6 6 9885.359 hood1 10014.38 low
Approach2:
Since agg$labs is not unique for each level of hoods in mydata, it will be better to reassign labels individually using a loop by mapping the levels of hoods with the values of agg$labs.
If you had unique labels in agg$labs for each levels of hoods in mydata, then it will be very simple to just reassign labels by doing mydata$hoods <- factor( mydata$hoods, levels = agg$Group.1, labels = agg$labs ). However you have duplicated levels in agg$labs, so you will follow the steps below.
mydata$hoods <- as.character( mydata$hoods ) # convert factor to character
agg$labs <- as.character(agg$labs) # convert factor to character
for( i in seq_len( nrow( agg ) ) ) { # change labels for hoods in mydata
mydata[ mydata$hoods %in% agg$Group.1[ i ], "hoods" ] <- agg$labs[i]
}
mydata$hoods <- factor( mydata$hoods ) # convert hoods back to factor
unique(mydata$hoods) # output
# [1] low medium high
# Levels: high low medium
Data:
set.seed( 200 )
mydata = data.frame(index = rep(1:1000))
mydata$price[1:250] = rnorm(250, mean = 10000, sd = 1000)
mydata$price[251:500] = rnorm(250, mean = 25000, sd = 1000)
mydata$price[501:750] = rnorm(250, mean = 75000, sd = 1000)
mydata$price[751:1000] = rnorm(250, mean = 100000, sd = 1000)
mydata$hoods = factor(c(rep('hood1',250),rep('hood2',250),rep('hood3',250),rep('hood4',250)))
agg <- with(mydata, aggregate( price, by = list(hoods), FUN = median) )
agg$labs <- cut(agg$x, breaks = c(0, 25000, 70000, 110000), labels = c('low','medium','high'))
agg
# Group.1 x labs
# 1 hood1 10014.38 low
# 2 hood2 25021.96 medium
# 3 hood3 74963.40 high
# 4 hood4 100019.88 high
The data in agg will vary if you choose a different seed in set.seed() function.
Related
I have dataframe with 100+ columns. I want to filter those columns based on their nonuniformity.
Ex. if there're columns with more than 90% (or 95% or 99%) NAs (or 0-s, or -999, or whatever value), remove them from dataframe.
I can remove with NAs, or 0-s, but the problem is I don't know what value it will be.
Ex. of removing NAs with more than 90% df[, which(colMeans(!is.na(df)) > 0.9)]
I would simply use table to count the number of occurence of each value, when the maximum of these values exceeds the needed threshold you can discard the column.
In the following toy example, x, y and z are "constant". For x there are 96% of NA values, for y there are 99% of 0 and for z there are 97% of -1 (but any value would work).
set.seed(26012023)
df <- data.frame(w = rnorm(100), x = c(rep(NA, 96), rnorm(4)), y = c(rep(0, 99), rnorm(1)),
z = c(rep(-1, 97), rnorm(3)))
apply(df, 2, function(x, cutoff = .95) {
tab <- table(x, useNA = "ifany")
max_val <- max(tab)
max_val >= cutoff * length(x)
})
# w x y z
# FALSE TRUE TRUE TRUE
We can create a toy example, defining the following data.frame named df
# Seed to make it reproducible
set.seed(12345)
df <- data.frame(cbind(Var1 = c(rep(10,19),1),
Var2 = sample(letters[1:5],20, prob = c(0.8,0.1,0.5,0.25,0.25), replace = T),
Var3 = sample(c("Yes","No"), 20, prob = c(.95, .05), replace = T),
Var4 = sample(1:3, 20, replace = T),
Var5 = c(rep(NA,15),rep(1,5))))
Then we compute the maximum frequency of a single value for each colum and finally we delet those that exceed the required value
# Calculate the maximum frequency for a single value for each column
aux <- apply(df,2,function(x) max(prop.table(table(x, useNA = "ifany"))))
# Define new.df as df whithout the columns that have a value more than a 90% of times
new.df <- df[,-which(aux>.9)]
I try to assess the combined uncertainties related to different input parameters using Markov Chain Monte Carlo method in R. In other words, using the uncertainty parameters reported in input data documentation, I try to generate distributions for each of the datasets by creating 1000 random values within the used distribution (normal distribution or truncated normal distribution).
However, I don't know how to do this with the purrr::map() functions faster and without exhausting the RAM. The dataset has 2m rows and 80 cols.
Here is a simplified example:
library(tidyverse); library(truncnorm);library(data.table); library(dtplyr)
n <- 1000 # number of simulations
n_obs <- 10000 # number of observations. Does not work if e.g. 50000
Create a data.frame
dt <- data.frame(
var1 = runif(n_obs, 0, 100),
var2_low = runif(n_obs, 0, 1),
var2_mean = runif(n_obs, 0, 5),
var2_up = runif(n_obs, 0, 10)
)
Convert to lazy data table to speed things up
dt1 <- dt %>% as.data.table() %>%
lazy_dt(., immutable = FALSE)
Simulate
dt_sim <- dt1 %>%
mutate(mean_val = rep(1, nrow(.)), # just row of 1
var1_rnorm = map(.x = mean_val,~rnorm(n, mean = .x, sd = 0.10)), # normal distribution with given sd
sim_var1 = map2(.x = var1, .y = var1_rnorm, ~(.x*.y))) %>% # multiply the data with simulated distribution
# add truncated normal distribution for each row (var2)
mutate(sim_var2 = pmap(.,~ rtruncnorm(n,
a = dt$var2_low,
b = dt$var2_up,
mean =dt$var2_mean))) %>%
# multiply simulated variables sim_var1 and sim_var2
mutate(sim_vars_multiplied =
pmap(list(x = .$sim_var1,
y = .$sim_var2),
function(x,y) (x*y))) %>%
# derive coefficient of variation
mutate(var_mean =map(.x = sim_vars_multiplied, ~ mean(.x, na.rm = TRUE)),
var_sd = map(.x = sim_vars_multiplied, ~ sd(.x, na.rm = TRUE)),
var_cv = unlist(var_sd) / unlist(var_mean)) %>%
# select only the variables needed
dplyr::select(var_cv)
# collect the results
sim_results <- dt_sim %>% as.data.table()
This may help
library(data.table)
#n <- 1000 # number of simulations
n_obs <- 10000
dt <- data.table(
var1 = runif(n_obs, 0, 100),
var2_low = runif(n_obs, 0, 1),
var2_mean = runif(n_obs, 0, 5),
var2_up = runif(n_obs, 0, 10)
)
dt[, mean_val := rep(1,.N)]
dt[, var1_rnorm := rnorm(.N, mean = mean_val, sd = 0.10)]
dt[, sim_var1 := var1 * var1_rnorm]
dt[, sim_var2 := truncnorm::rtruncnorm(.N, a = var2_low, b = var2_up, mean = var2_mean)]
dt[, sim_vars_multiplied := sim_var1 * sim_var2]
dt[, var_mean := mean(sim_vars_multiplied, na.rm=TRUE)]
dt[, var_sd := sd(sim_vars_multiplied, na.rm=TRUE)]
dt[, var_cv := var_sd / var_mean]
sim_results <- dt[,var_cv]
In the data set created below, assume I randomly picked up 20 flat rocks. Each of these rocks were assigned a unique ID number. I measured the concentration of 7 substances (Copper,Iron,Carbon,Lead,Mg,CaCO, and Zinc) across the surface of the longest axis of each rock. Distance is recorded in mm, and therefore is a function of each rocks length. Note that not all Rocks are of the same length. Location is a grouping variable that describes where the Rock was picked up.
ID <- data.frame(ID=rep(c(12,122,242,329,595,130,145,245,654,878), each = 200))
ID2 <- data.frame(ID=rep(c(863,425,24,92,75,3,200,300,40,500), each = 300))
RockID<-data.frame(RockID = c(unlist(ID), unlist(ID2)))
Location <- rep(c("Alpha","Beta","Charlie","Delta","Echo"), each = 1000)
a <- rep(c(1:200),times = 10)
b <- rep(c(1:300), times = 10)
Time <- data.frame(Time = c(unlist(a), unlist(b)))
set.seed(1)
Copper <- rnorm(5000, mean = 0, sd = 5)
Iron <- rnorm(5000, mean = 0, sd = 10)
Carbon <- rnorm(5000, mean = 0, sd = 1)
Lead <- rnorm(5000, mean = 0, sd = 4)
Mg <- rnorm(5000, mean = 0, sd = 6)
CaCO <- rnorm(5000, mean = 0, sd = 2)
Zinc <- rnorm(5000, mean = 0, sd = 3)
data <-cbind(RockID, Location, Time,Copper,Iron,Carbon,Lead,Mg,CaCO,Zinc)
data$ID <- as.factor(data$RockID)
I want to create a new data frame that contains the following information:
1. The first observation and the last observation for each individual
2. The average of the first 3 observations and last 3 observations for each individual
3. The same as step 2. for the first and last 5, 7, and 10 observations
I want the new data frame to be set up like this:
ID FirstPt First3 First5 First7 First10 LastPt Last3 Last5 Last7 Last10
12 … … … … … … … … … …
122
242
329
595
130
145
245
654
878
863
425
ect...
How would I write a function to accomplish this?
We can create a function to calculate average of first and last n values. Use pivot_longer to get data in long format, group_by each RockID and substance and calculate the mean.
library(dplyr)
average_of_first_n_values <- function(value, x) mean(head(value, x))
average_of_last_n_values <- function(value, x) mean(tail(value, x))
data %>%
tidyr::pivot_longer(cols = Copper:Zinc) %>%
group_by(RockID, name) %>%
summarise(first_obs = first(value),
last_obs = last(value),
first_3_avg = average_of_first_n_values(value, 3),
first_5_avg = average_of_first_n_values(value, 5),
first_7_avg = average_of_first_n_values(value, 7),
first_10_avg = average_of_first_n_values(value, 10),
last_3_avg = average_of_last_n_values(value, 3),
last_5_avg = average_of_last_n_values(value, 5),
last_7_avg = average_of_last_n_values(value, 7),
last_10_avg = average_of_last_n_values(value, 10))
Suppose I have a data frame with 8 schools and its characteristics, and another with 48 teachers and its characteristics. I can generate some fake data with the following code:
library(dplyr)
library(geosphere)
set.seed(6232015)
n.schools <-8
n.teachers <- 48
makeRandomString <- function(pre, n=1, length=12) {
randomString <- c(1:n) # initialize vector
for (i in 1:n) {
randomString[i] <- paste0(pre,'.', paste(sample(c(0:9, letters, LETTERS),
length, replace=TRUE),
collapse=""))
}
return(randomString)
}
gen.teachers <- function(n.teachers){
Teacher.ID <- makeRandomString(pre= 'T', n = n.teachers, length = 20)
Teacher.exp <- runif(n = n.teachers, min = 1, max = 30)
Teacher.Other <- sample(c(0,1), replace = T, prob = c(0.5, 0.5), size = n.teachers)
Teachers <- data.frame(Teacher.ID, Teacher.exp, Teacher.Other)
return(Teachers)
}
gen.schools <- function(n.schools){
School.ID <- makeRandomString(pre= 'S', n = n.schools, length = 20)
School.lat <- runif(n = n.schools, min = -2, max = 2)
School.long <- runif(n = n.schools, min = -2, max = 2)
Schools <- data.frame(School.ID, School.lat, School.long) %>%
rowwise() %>% mutate (School.distance = distHaversine(p1 = c(School.long, School.lat),
p2 = c(0, 0), r = 3961))
return(Schools)
}
Teachers <- gen.teachers(n.teachers = n.teachers)
Schools <- gen.schools(n.schools = n.schools)
To each shool, I want to assign 6 teachers (every teacher get 1 and only 1 school). I could use:
Teachers %>% sample_n(6)
To get a list of 6 teachers assign those to a school, remove them from my original pool and keep going with a loop. My guess/hope is that there is a much easier way of doing this.
Thanks for the help!
In the context of your code
sample(rep(Schools$School.ID, each = 6))
gives a random sequence of schools where each school.id appears 6 times. Set Teachers$AssignedSchool to this sample and each teacher has an assigned school
I'm looking for a simpler way to aggregate and calculate percentages of a numerical variable using data.table.
The following code outputs the desired result, my question is if there is a better way to get the same result. I'm not really familiarized with the package, so any tips would be useful.
I'd like to have the following columns:
second_factor_variable third_factor_variable factor_variable porc porcentaje
1: HIGH C > 200 0.04456544 4 %
2: LOW A 51 - 100 0.31739130 32 %
3: LOW A 101 - 200 0.68260870 68 %
4: LOW A 26 - 50 0.00000000 0 %
Where porc is the numerical percentage and porcentage would be the percentage rounded to be used as a label in a ggplot call.
library("ggplot2")
library("scales")
library("data.table")
### Generate some data
set.seed(123)
df <- data.frame(x = rnorm(10000, mean = 100, sd = 50))
df <- subset(df, x > 0)
df$factor_variable <- cut(df$x, right = TRUE,
breaks = c(0, 25, 50, 100, 200, 100000),
labels = c("0 - 25", "26 - 50", "51 - 100", "101 - 200", "> 200")
)
df$second_factor_variable <- cut(df$x, right = TRUE,
breaks = c(0, 100, 100000),
labels = c("LOW", "HIGH")
)
df$third_factor_variable <- cut(df$x, right = TRUE,
breaks = c(0, 50, 100, 100000),
labels = c("A", "B","C")
)
str(df)
### Aggregate
DT <- data.table(df)
dt = DT[, list(factor_variable = unique(DT$factor_variable),
porc = as.numeric(table(factor_variable)/length(factor_variable)),
porcentaje = paste( round( as.numeric(table(factor_variable)/length(factor_variable), 0 ) * 100 ), "%")
), by="second_factor_variable,third_factor_variable"]
EDIT
I've tried agstudy's solution grouping by with just one variable, and I believe it didn't work for producing the labels (porcentaje column). In the real dataset, I ended up having a similar issue and I can't spot whats wrong about this function.
grp <- function(factor_variable) {
porc = as.numeric(table(factor_variable)/length(factor_variable))
list(factor_variable = factor_variable[1],
porc =porc,
porcentaje = paste( round( porc, 0 ) * 100 , "%"))
}
DT[, grp(factor_variable) , by="second_factor_variable"]
The numerical values are correct
DT2 <- DT[DT$second_factor_variable %in% "LOW"]
table(DT2$factor_variable)/length(DT2$factor_variable)
I believe the same problems appears if i group by with 2 factor variables:
DT[, grp(factor_variable) , by="second_factor_variable,third_factor_variable"]
2 changes : factorize porc variable and don't use DT to compute factor_variable
DT[, { porc = as.numeric(table(factor_variable)/length(factor_variable))
list(factor_variable = factor_variable[1],
porc =porc,
porcentaje = paste( round( porc, 0 ) * 100 , "%"))
}
, by="second_factor_variable,third_factor_variable"]
In the real dataset for some reason, the previous function wasn't working well. The values and labels weren't right.
I've just changed a tiny bit and it worked. I add it as a separate answer so it's easier to find later.
This keeps the order the factor levels as they were created. The previous way keept them as they were ordered in the DT, for the particular data i was working with this wasn't working well.
The labels work now, I mean the porcentaje column.
factor_variable = levels(factor_variable)
grp2 <- function(factor_variable) {
porc = as.numeric(table(factor_variable)/length(factor_variable))
list(factor_variable = levels(factor_variable),
porc = porc,
porcentaje = paste( round( as.numeric(table(factor_variable)/length(factor_variable), 0 ) * 100 ), "%")
)
}
DT[, grp2(factor_variable) , by="second_factor_variable"]