I want to plot fee as a percentage of income fee_per_inc for each income year_hh_inc quintile.
this is what I have so far:
pacman::p_load(RCurl, plm, tibble, ggplot2, AER, dplyr, car, arm, broom, tidyr, fastDummies, dummies)
x <- getURL("https://raw.githubusercontent.com/dothemathonthatone/maps/master/main_test.csv")
maindf <- read.csv(text = x, row.names=NULL)
maindf <- maindf %>%
mutate(category = cut(year_hh_inc, breaks = (quantile(year_hh_inc, c(0, 1 / 5, 2 / 5, 3 / 5, 4 / 5, 1), na.rm = TRUE)), labels = c("first_quint", "second_quint", "third_quint", 'fourth_quint', 'fifth_quint'), include.lowest = TRUE), vals = 1) %>%
pivot_wider(names_from = category, values_from = vals, values_fill = list(vals = 0))
box <- boxplot(maindf$year_hh_inc ~ maindf$fee_per_inc, col = 3:5)
This is what I would like as an end result:
I think I have a bit more work to do; any help from this point is appreciated.
I think there were a couple of problems here. You need the boxplot to have the variables the other way round. Also, you need to use the category variable that you created in mutate instead of the original variable. Lastly, you don't need the pivot_wider.
Some of the values were also way outside the useful range and may have been wrong (some numbers were -8), so I have trimmed the outliers to make the graph prettier. You'll want to check the original data to see whether this makes sense.
pacman::p_load(RCurl, plm, tibble, ggplot2, AER, dplyr, car, arm, broom, tidyr, fastDummies, dummies)
x <- getURL("https://raw.githubusercontent.com/dothemathonthatone/maps/master/main_test.csv")
maindf <- read.csv(text = x, row.names=NULL)
maindf <- maindf %>%
mutate(category = cut(year_hh_inc,
breaks = (quantile(year_hh_inc, c(0, 1/5, 2/5, 3/5, 4/5, 1), na.rm = TRUE)),
labels = c("first_quint", "second_quint", "third_quint",
'fourth_quint', 'fifth_quint'),
include.lowest = TRUE),
vals = 1)
maindf <- maindf[maindf$fee_per_inc > 0 & maindf$fee_per_inc < 0.01, ]
box <- boxplot(maindf$fee_per_inc ~ maindf$category, col = 3:5)
Created on 2020-03-03 by the reprex package (v0.3.0)
Related
I'm trying to apply the data_color() function from the gt package to several columns in my data frame, but each with their own color palette domain. So far, what I have is:
df <- data.frame(Var1 = rnorm(30),
Var2 = rnorm(30),
Var3 = rnorm(30),
Var4 = rnorm(30),
Var5 = rnorm(30),
Var6 = rnorm(30))
mypals <- list()
for (i in 2:6){
mypals[[i]] <- scales::col_bin(colpal,
domain = c(min(df[,i]), max(df[,i])))
}
df %>%
gt() %>%
data_color(columns = 2, colors = mypals[[2]]) %>%
data_color(columns = 3, colors = mypals[[3]]) %>%
data_color(columns = 4, colors = mypals[[4]]) %>%
data_color(columns = 5, colors = mypals[[5]]) %>%
data_color(columns = 6, colors = mypals[[6]])
Is there a way to do a "recursive" piping, something similar to this perhaps?
df %>%
gt() %>% seq(2:6) %>% (function(x){
data_color(columns = x, colors = mypals[[x]])
}
)
Thanks in advance for all your suggestions.
I'm new to the gt package, so forgive me if there's an easier way to do this.
I can' test this answer throughy, because I cant install this gt package, but I believe you are looking for the accumulate or reduce functions from the purrr package.
library(purrr)
my_data_color <- \(x, y, z) data_color(x, columns = y, colors = z[[y]])
reduce2(df %>% gt(),
1:6,
~ my_data_color(x = .x,
y = .y,
z = mypals))
From the man page:
reduce() is an operation that combines the elements of a vector into a single value. The combination is driven by .f, a binary function that takes two values and returns a single value: reducing f over 1:3 computes the value f(f(1, 2), 3).
One approach would be generate your statement and use eval(parse(text=<stment>)), as below:
eval(parse(text=paste(
"df %>% gt() %>%",
paste0("data_color(columns=",2:6,",color='",mypals,"')", collapse=" %>% ")
)))
In R I've got a dataset like this one:
df <- data.frame(
ID = c(1:30),
x1 = seq(0, 1, length.out = 30),
x2 = seq(100, 3000, length.out = 30),
category = gl(3, 10, labels = c("NEGATIVE", "NEUTRAL", "POSITIVE"))
)
Now I want to add a new column with randomized boolean values, but inside each category the proportion of TRUE and FALSE values should be the same (i.e. the randomizing process should generate the same count of true and false values, in the above data frame 5 TRUEs and 5 FALSEs in each of the 3 categories). How to do this?
You can sample a vector of "TRUE" and "FALSE" values without replacement so you have a randomized and balanced column in your data-frame.
sample(rep(c("TRUE","FALSE"),each=5),10,replace=FALSE)
Based on Yacine Hajji answer:
addRandomBool <- function(df, p){
n <- ceiling(nrow(df) * p)
df$bool <- sample(rep(c("TRUE","FALSE"), times = c(n, nrow(df) - n)))
df
}
Reduce(rbind, lapply(split(df, df$category), addRandomBool, p = 0.5))
where parametar p determines the proportion of TRUE.
This will sample within each group from a vector of 5 TRUE and 5 FALSE without replacement. It will assume that there are always 10 records per group.
library(dplyr)
library(tidyr)
df <- data.frame(
ID = c(1:30),
x1 = seq(0, 1, length.out = 30),
x2 = seq(100, 3000, length.out = 30),
category = gl(3, 10, labels = c("NEGATIVE", "NEUTRAL", "POSITIVE"))
)
set.seed(pi)
df %>%
group_by(category) %>%
nest() %>%
mutate(data = lapply(data,
function(df){ # Function to saple and assign the new_col
df$new_col <- sample(rep(c(FALSE, TRUE),
each = 5),
size = 10,
replace = FALSE)
df
})) %>%
unnest(cols = "data")
This next example is a little more generalized, but still assumes (approximately) even distribution of TRUE and FALSE within a group. But it can accomodate variable group sizes, and even groups with odd numbers of records (but will favor FALSE for odd numbers of records)
library(dplyr)
library(tidyr)
df <- data.frame(
ID = c(1:30),
x1 = seq(0, 1, length.out = 30),
x2 = seq(100, 3000, length.out = 30),
category = gl(3, 10, labels = c("NEGATIVE", "NEUTRAL", "POSITIVE"))
)
set.seed(pi)
df %>%
group_by(category) %>%
nest() %>%
mutate(data = lapply(data,
function(df){
df$new_col <- sample(rep(c(FALSE, TRUE),
length.out = nrow(df)),
size = nrow(df),
replace = FALSE)
df
})) %>%
unnest(cols = "data")
Maintaining Column Order
A couple of options to maintain the column order:
First, you can save the column order before you do your group_by - nest, and then use select to set the order when you're done.
set.seed(pi)
orig_col <- names(df) # original column order
df %>%
group_by(category) %>%
nest() %>%
mutate(data = lapply(data,
function(df){
df$new_col <- sample(rep(c(FALSE, TRUE),
length.out = nrow(df)),
size = nrow(df),
replace = FALSE)
df
})) %>%
unnest(cols = "data") %>%
select_at(c(orig_col, "new_col")) # Restore the column order
Or you can use a base R solution that doesn't change the column order in the first place
df <- split(df, df["category"])
df <- lapply(df,
function(df){
df$new_col <- sample(rep(c(FALSE, TRUE),
length.out = nrow(df)),
size = nrow(df),
replace = FALSE)
df
})
do.call("rbind", c(df, list(make.row.names = FALSE)))
There are likely a dozen other ways to do this, and probably more efficient ways that I'm not thinking of.
I create some models like this using a nested tidyr dataframe:
set.seed(1)
library(tidyr)
library(dplyr)
library(sjPlot)
library(tibble)
library(purrr)
fits <- tribble(~group, ~colA, ~colB, ~colC,
sample(c("group1", "group2"), 10, replace = T), 0, sample(10, replace = T), sample(10, replace = T),
sample(c("group1", "group2"), 10, replace = T), 1, sample(10, replace = T), sample(10, replace = T)) %>%
unnest(cols = c(colB, colC)) %>%
nest(data=-group) %>%
mutate(fit= map(data, ~glm(formula = colA ~ colB + colC, data = .x, family="binomial"))) %>%
dplyr::select(group, fit) %>%
tibble::column_to_rownames("group")
I would like to use this data to create some quick marginal effects plots with sjPlot::plot_models like this
plot_models(as.list(fits), type = "pred", terms = c("colB", "colA", "colC"))
Unfortunately, I get the error
Error in if (fam.info$is_linear) tf <- NULL else tf <- "exp" :
argument is of length zero
In addition: Warning message:
Could not access model information.
I've played around a bit with the nesting of the data but I've been unable to get it into a format that sjPlot::plot_models will accept.
What I was expecting to get is a "Forest plot of multiple regression models" as described in the help file. Ultimately, the goal is to plot the marginal effects of regression models by group, which I was hoping the plot_models will do (please correct me if I'm wrong).
It think there are some issues with the original code as well as with the data. There are arguments from plot_model in the function call which are not supported in plot_models. I first show an example that shows how plot_models can be called and used with a nested tibble using {ggplot2}'s diamonds data set. Then I apply this approach to the OP's sample data, which doesn't yield useable results*. Finally, I create some new toy data to show how the approach could be applied to a binominal model.
(* In the original toy data the dependent variable is either always 0 or always 1 in each model so this is unlikely to yield useable results).
set.seed(1)
library(tidyr)
library(dplyr)
library(sjPlot)
library(tibble)
library(ggplot2)
# general example
fits <- tibble(id = c("x", "y", "z")) %>%
rowwise() %>%
mutate(fit = list(glm(reformulate(
termlabels = c("cut", "color", "depth", "table", "price", id),
response = "carat"),
data = diamonds)))
plot_models(fits$fit)
# OP's example data
fits2 <- tribble(~group, ~colA, ~colB, ~colC,
sample(c("group1", "group2"), 10, replace = T), 0,
sample(10, replace = T), sample(10, replace = T),
sample(c("group1", "group2"), 10, replace = T), 1,
sample(10, replace = T),
sample(10, replace = T)) %>%
unnest(cols = c(colB, colC)) %>%
nest(data = -group) %>%
rowwise() %>%
mutate(fit = list(glm(formula = colA ~ colB + colC, data = data, family="binomial")))
plot_models(fits2$fit)
#> Warning: Transformation introduced infinite values in continuous y-axis
#> Warning: Removed 4 rows containing missing values (geom_point).
# new data for binominal model
n <- 500
g <- round(runif(n, 0L, 1L), 0)
x1 <- runif(n,0,100)
x2 <- runif(n,0,100)
y <- (x2 - x1 + rnorm(n,sd=20)) < 0
fits3 <- tibble(g, y, x1, x2) %>%
nest_by(g) %>%
mutate(fit = list(glm(formula = y ~ x1 + x2, data = data, family="binomial")))
plot_models(fits3$fit)
Created on 2021-01-23 by the reprex package (v0.3.0)
I have a simple data.frame where I want to compute some summary statistics on a rolling basis. For example, a rolling median over a window of five observations (2 lags, current one and 2 ahead) is achieved by
library(dplyr)
x <- data.frame("vals" = rnorm(3e04))
y <- x %>%
mutate(med5 = rollapply(data = vals,
width = 5,
FUN = median,
align = "center",
fill = NA,
na.rm = TRUE))
However, I would like to exclude the current row from this computation. I found the following approach:
z <- x %>%
mutate(N=1:n()) %>%
do(data.frame(., prmed = sapply(.$N, function(i) median(.$vals[.$N %in% c((i - 2):(i - 1), (i + 1):(i + 2))]))))
This does what I want, if I subsequently set the first two values to NA.
So far so good, the only problem is that the latter approach is terribly slow compared to rollapply.
Is there a way to achieve the outcome of the latter with the speed of the former?
A solution based on excluding the third number of the five, which is the current row of the calculation.
library(dplyr)
library(zoo)
set.seed(124)
x <- data.frame("vals" = rnorm(3e04))
y <- x %>%
mutate(med5 = rollapply(data = vals,
width = 5,
FUN = function(x) median(x[-3], na.rm = TRUE),
align = "center",
fill = NA))
head(y)
# vals med5
# 1 -1.38507062 NA
# 2 0.03832318 NA
# 3 -0.76303016 0.1253147
# 4 0.21230614 0.3914015
# 5 1.42553797 0.4562678
# 6 0.74447982 0.4562678
The width= argument of rollapply can be a one element list containing a vector of offsets.
y <- x %>%
mutate(med5 = rollapply(data = vals,
width = list(c(-2, -1, 1, 2)),
FUN = median,
na.rm = TRUE,
fill = NA))
Note that align = "center" is the default and so does not have to specified. In addition, if we use offsets then align= is ignored. For safety, TRUE should be written out in full since T can also be a variable name.
My dataset has about 20 columns and I would like to create 7 new columns with lagged data for each of the 20 current columns.
For example I have column x, y, and z. I would like to create a columns for xlag1, xlag2, xlag3, xlag4, xlag5, xlag6, xlag7, ylag1, ylag2, etc..
My current attempt is with dplyr in R -
aq %>% mutate(.,
xlag1 = lag(x, 1),
xlag2 = lag(x, 2),
xlag3 = lag(x, 3),
xlag4 = lag(x, 4),
xlag5 = lag(x, 5),
xlag6 = lag(x, 6),
xlag7 = lag(x, 7),
)
As you can see it'll take alot of lines of codes to cover all 20 columns. Is there a more efficient way of doing this ? If possible in dplyr and R as I'm most familiar with the package.
We can use data.table. The shift from data.table can take a sequence of 'n'.
library(data.table)
setDT(aq)[, paste0('xlag', 1:7) := shift(x, 1:7)]
If there are multiple columns,
setDT(aq)[, paste0(rep(c("xlag", "ylag"), each = 7), 1:7) :=
c(shift(x, 1:7), shift(y, 1:7))]
If we have many columns, then specify the columns in .SDcols and loop through the dataset, get the shift, unlist and assign to new columns
setDT(aq)[, paste0(rep(c("xlag", "ylag"), each = 7), 1:7) :=
unlist(lapply(.SD, shift, n = 1:7), recursive = FALSE) , .SDcols = x:y]
We can also use the shift in dplyr
library(dplyr)
aq %>%
do(setNames(data.frame(., shift(.$x, 1:7)), c(names(aq), paste0('xlag', 1:7))))
and for multiple columns
aq %>%
do(setNames(data.frame(., shift(.$x, 1:7), shift(.$y, 1:7)),
c(names(aq), paste0(rep(c("xlag", "ylag"), each = 7), 1:7) )))
data
aq <- data.frame(x = 1:20, y = 21:40)