"For" loop with column names as index - r

I would like to create a loop in which the index is given by the column names of a dataframe. The idea is to select one column at a time and create a map based on the data in that column. I need i being the column name, as it identifies the name of the variable and I'll use that as part of the title of the map. However, I do not seem to be able to associate my index i to the name of the column. My code goes as follows:
# random data
x <- rep(c("AT130", "DEA1A", "DEA2C", "SE125", "SE232"), 4)
y <- c(1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0 ,1, 0, 1, 0, 1)
z <- c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ,0, 0, 0, 0, 0)
w <- c(0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1 ,0, 1, 0, 1, 0)
d <- as.data.frame(cbind(x,y,z,w))
colnames(d) <- c("id", "typeA", "typeB", "typeC")
for (i in colnames(d[,2:ncol(d)])) {
var_to_map <- d[,c(1,i)]
## do stuff
}
I get the following error at the first line:
Error: Can't subset columns that don't exist.
x Columns `1`, `2`, and `3` don't exist.
Run `rlang::last_error()` to see where the error occurred.
However, if I just run colnames(d[,2:ncol(d)]), it works properly
colnames(d[,2:ncol(d)])
[1] "typeA" "typeB" "typeC"
I could find a workaround by using columns numbers to make it work, but I would like to keep the column names since I am printing (10+) maps within the loop and I am using i to insert the title of the map each, as follows:
# I use geodata files from the library `Eurostat`.
geodata <- get_eurostat_geospatial(resolution = "60", nuts_level = "3", year = 2013)
for (i in colnames(d[,2:ncol(d)])) {
var_to_map <- d[,c(1,i)]
colnames(var_to_map)[1] <- "geo"
# Joining, by = "geo"
map_data <- merge(var_to_map, geodata, by=c("geo"), all.y=T, all.x=T)
## creating ranges
map_data$cat <- with(map_data, cut(value,
breaks= qu <- unique(quantile(value,
probs=c(0, 0.2, 0.5, 0.8,
0.9, 0.95, 0.99, 1),
na.rm=TRUE, include.lowest=T )),
labels=qu[-1]),include.lowest=TRUE )
# Map
print(ggplot(data=map_data) + geom_sf(aes(fill=cat), size=.1) +
scale_fill_brewer(palette = "Darkred", na.value= "grey") + aes(geometry = geometry) +
guides(fill = guide_legend(reverse=T, title = "Percentiles")) +
labs(title = paste("The name of this graph is the column name", i) ## here is where I use the index
)+
theme_minimal() + theme(legend.position=c(.8,.6)) +
coord_sf(xlim=c(-12,44), ylim=c(35,70)) +
theme( axis.text.x=element_blank(), axis.text.y=element_blank()))
}
I could also use column numbers for i and create another object with column names to refer to when pasting the title of the map, but I am wondering why the above approach fails and what I could do to make it work in that setting.

In base R, you can either select the columns by position or by name, you can't combine them both in one command. If you use dplyr::select you can select columns by name and position in the same command.
So here are your options -
cols <- colnames(d)
for (i in cols[-1]) {
#Select columns by position
var_to_map <- d[,c(1,match(i, cols))]
#OR select column by name
var_to_map <- d[,c(cols[1],i)]
#OR select column by position and name
var_to_map <- dplyr::select(d, 1, i)
#...rest of the code
#...rest of the code
}

There's a lot going on in this question, but perhaps this minimal example will help:
library(tidyverse)
# random data
d <- data.frame(x = rep(c("AT130", "DEA1A", "DEA2C", "SE125", "SE232"), 4),
y = sample(1:10, 20, replace = TRUE),
z = sample(1:10, 20, replace = TRUE),
w = sample(1:10, 20, replace = TRUE))
colnames(d) <- c("id", "typeA", "typeB", "typeC")
for (i in colnames(d[,2:ncol(d)])) {
type <- ensym(i)
p <- ggplot(d, aes(y = !!type, x = id, fill = id)) +
geom_boxplot() +
ggtitle(type)
print(p)
}

Related

R: How to transform set-membership indicator line item data into single summary row of all set combinations?

I have line items indicating which groups my customers are members of.
cols <- c("CustomerName", "Magazines", "Books", "Emails")
df <- data.frame(matrix(ncol = length(cols), nrow=0))
colnames(df) <- cols
df[nrow(df) + 1,] <- c("Alice", 1, 0, 1)
df[nrow(df) + 1,] <- c("Bob", 0, 1, 1)
df[nrow(df) + 1,] <- c("Chris", 1, 1, 1)
df[nrow(df) + 1,] <- c("Darcy", 0, 1, 1)
How do I summarize data of this shape into a single summary row with columns & counts for each possible group-combination?
Desired output:
df_DesiredOutput <- c("Books" = 0, "Magazines" = 0, "Emails" = 0, "BooksMagazines" = 0, "BooksEmails" = 1, "MagazinesEmails" = 2, "BooksMagazinesEmails" = 1)
The transformation should be agnostic to the number of products as well as their actual product names.

When plotting a correlation matrix with pairs(), how to display long column names in many lines?

After drawing the correlation matrix in R, I try to read it in binary form in an external program.
However, each column name is too long, so in the correlation matrix drawn in pairs(), parts of both sides of the column names are cut off.
If the column name is this long, is there a way to make these column names appear in multiple lines?
And can we increase the fontsize of the column names to increase readability?
This is the sample code.
In this case, for example, I want the column
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
to look like
AAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAA
in two lines.
a <- runif(100, 0, 30) + rnorm(100, 5, 1) + 10
b <- runif(100, 0, 30) + rnorm(100, 5, 1) + 10
c <- runif(100, 0, 30) + rnorm(100, 5, 1) + 10
d <- runif(100, 0, 30) + rnorm(100, 5, 1) + 10
e <- runif(100, 0, 30) + rnorm(100, 5, 1) + 10
f <- runif(100, 0, 30) + rnorm(100, 5, 1) + 10
g <- runif(100, 0, 30) + rnorm(100, 5, 1) + 10
h <- runif(100, 0, 30) + rnorm(100, 5, 1) + 10
i <- runif(100, 0, 30) + rnorm(100, 5, 1) + 10
df <- data.frame(AAAAAAAAAAAAAAAAAAAAAAAAAAA = a,
BBBBBBBBBBBBBBBBBBBBBBBBBBB = b,
CCCCCCCCCCCCCCCCCCCCCCCCCCC = c,
DDDDDDDDDDDDDDDDDDDDDDDDDDD = d,
EEEEEEEEEEEEEEEEEEEEEEEEEEE = e,
FFFFFFFFFFFFFFFFFFFFFFFFFFF = f,
GGGGGGGGGGGGGGGGGGGGGGGGGGG = g,
HHHHHHHHHHHHHHHHHHHHHHHHHHH = h,
IIIIIIIIIIIIIIIIIIIIIIIIIII = i)
pairs(df,
lower.panel = NULL,
upper.panel = function(x, y){
points(x,y,pch=20)
r <- round(cor(x, y, use = "complete.obs"), digits=2)
txt <- paste0("R = ", r)
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
text(1, 0.95, txt, col="red", pos=2, cex=1.0)
},
)
I would begin by using the function stringi::stri_extract_all on a long label to break it down into chunks of at most ten characters
longlabel <- "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
small_chunks <- stringi::stri_extract_all(longlabel,
regex = ".{1,10}")[[1]]
After that, you can use paste to get
betterlabel <- paste(small_chunks, collapse = "\n")
And test that it works:
hist(rnorm(100), main = betterlabel)
On your example, you would need to wrap all that into an sapply to wrap the column names of df, like so:
colnames(df) <- sapply(stringi::stri_extract_all(colnames(df), regex = ".{1,10}"), paste, collapse = "\n")
to obtain the desired result:

Different return values of the sum of a row with imputed values using 'complete' (mice) and 'update' (survey)

I need to calculate the sum of some variables with imputed values. I did this with complete --> as.mids --> with --> do.call
I needed to do the same thing but in a survey context. Therefore, I did: update --> with --> MIcombine
The means of the variables calculated both ways do not match. Which one is correct?
You may check this different behavior in this toy database:
library(tidyverse)
library(mice)
library(mitools)
library(survey)
mydata <- structure(list(dis1 = c(NA, NA, 1, 0, 0, 1, 1, 1, 1, 0),
dis2 = c(0, 1, 0, 1, NA, 1, 1, 1, 1, 0),
dis3 = c(1, 1, 0, 0, NA, 1, 1, 1, 1, 0),
sex = c(0,0,0,1,0,1,1,1,1,0),
clus = c(1,1,1,1,1,2,2,2,2,2)),
row.names = c(NA, 10L),
class = c("tbl_df", "tbl", "data.frame") )
imp <- mice::mice(mydata, m = 5, seed = 237856)
# calculating numenf with mice::complete
long <- mice::complete(imp, action = "long", include = TRUE)
long$numenf <- long$dis1 + long$dis2 + long$dis3
imp2 <- mice::as.mids(long)
res <- with(imp2, mean(numenf))
do.call(mean, res$analyses) # mean = 2.1
#calculating numenf with update (from survey)
imp1 <- mice::complete(imp)
imp2 <- mice::complete(imp, 2)
imp3 <- mice::complete(imp, 3)
imp4 <- mice::complete(imp, 4)
imp5 <- mice::complete(imp, 5)
listimp <- mitools::imputationList(list(imp1, imp2, imp3, imp4, imp5))
clus <- survey::svydesign(id = ~clus, data = listimp)
clus <- stats::update(clus, numenf = dis1 + dis2 + dis3)
res <- with(clus, survey::svymean(~numenf))
summary(mitools::MIcombine(res)) # mean = 1.98
Answer
Replace do.call(mean, res$analyses) with mean(unlist(res$analyses)).
Rationale
In the first code snippet, res$analyses is a list. When entering it into do.call, you are essentially calling:
mean(res$analyses[1], res$analyses[2], res$analyses[3], res$analyses[4], res$analyses[5])
mean takes the average of a vector in its first argument. The other arguments are not used properly (see ?mean). Hence, you're just getting 2.1 back, since that is the (mean of the) value of first analysis.
We can make a vector out of the list by using unlist(res$analyses). Then, we can just feed it to mean as an argument:
mean(unlist(res$analyses))

Different colors of geom_point() based on subsets of dataframe

I am trying to produce a geom_violin() plot overlayed with a geom_point() plot, in which the geom_point() plot has different colors of the points based on which subset I have categorized the data into.
I have an error saying "Error in eval(expr, envir, enclos) : object 'ind' not found" when attempting to load the subset dataframe when I do it within the geom_point() function, but I don't understand what I am doing wrong from poking around or googling the error.
(Without that row, the code runs and generates this output, which is what I want other than the color coding of the points: PDF output when the second geom_point is commented out)
Here is the nonsense dataset I used to try and make this work (gene1,2,3 are rownames). I will transpose it in the code below:
,cell_1,cell_2,cell_3,cell_4,cell_5,cell_6,cell_7,cell_8,cell_9,cell_10,cell_11,cell_12,cell_13,cell_14,cell_15,cell_16,cell_17,cell_18,cell_19,cell_20,cell_21,cell_22,cell_23,cell_24,cell_25,cell_26,cell_27,cell_28,cell_29,cell_30,cell_31,cell_32,cell_33,cell_34,cell_35,cell_36,cell_37,cell_38,cell_39,cell_40,cell_41,cell_42,cell_43,cell_44,cell_45,cell_46,cell_47,cell_48,cell_49,cell_50
gene1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.19230,0.0,0.0,0.0,0.19230,0.0,0.0,0.0,69.3915,0.0,0.0,74.123,0,0,0,0,0,13.01,0.0,0.0,0.0,0.0,0.0,0.9231,73.023,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
gene2,0.279204,23.456,13.1,10.5,0.0,14.2,151,2,50.3201,0.0,0.0,128.0,0.0,0.0,0.0,9.74082,20.9432,0.0,0.0,300.023,20.0234,0.0,0.0,300.024,123,201.345,164.681,301.421,173.023,216.537,201.234,302.102,199.234,20.234,40.234,180.0234,0.0,23.234,190.134,170.023,0.0,8.023,40.234,180.0234,0.0,23.234,190.134,170.023,21.24,8.023
gene3,25.9954,77.3398,45.3092,107.508,0.266139,70.4924,114.17,291.324,198.525,190.353,185.381,0.14223,90.323,20.4332,29.012,500.391,2.51459,300.021,60.001,192.023,60.0234,300.022,60.002,192.024,34,500.392,2.51460,300.022,60.002,192.024,60.0235,300.023,60.003,192.025,60.002,192.024,34,500.392,2.51460,300.022,60.002,192.024,60.0235,300.023,60.003,192.025,35,194.231,94.13,32.124
gene4,46.1717,194.241,0.776565,3.0325,0.762981,2.3123,14.507,13.0234,0.538315,0.0,1.5234,11.2341,0.0,1.34819,6.0142,3.2341,4.4444,150.324,0.0,20.9432,134.023,150.325,0.0,20.9433,3.2341,4.4444,150.324,0.0,20.9432,134.023,170.13408,0.0,3.2341,4.4444,150.324,0.0,3.2341,6.7023,150.324,0.0,3.2341,4.4444,170.341,0.0,20.9432,134.023,150.325,0.0,50.234,3.123
gene5,94.2341,301.234,0.0,0.0,123.371,0.0,0.0,155.234,0.0,0.664744,0.0,402.616,222.148,0.0,0.0,0.0,169.234,0.0,10.234,0.0,0.0,0.0,0.99234,0.0,0.99234,0.0,0.0,0.0,0.99234,0.0,0.99234,0.0,0.0,0.0,0.99234,0.0,10.324,0.0,0.0,15.0234,43.1243,0.0,320.023,0.0,0.0,0.0,1.234,0.0,12.123,0.0
Here's the code I wrote:
#Load dataset
df_raw <- read.table("pretend_dataset.csv",
sep=",",
header=TRUE)
#Make gene names into rownames
rownames(df_raw) <- df_raw$Name
#Remove "Name" column
df_raw$Name <- NULL
#TRANSPOSE DATASET
matrix_transp <- t(df_raw)
#Make matrix_transp matrix into dataframe
df <- as.data.frame(as.matrix(matrix_transp))
#Subset gene1 positive and negatve cells
df.positive <- subset(df, gene1 > 0)
#Convert data in data frames to log scale
df.log <- log(df+1)
df.positive.log <- log(df.positive+1)
#Violin plot for each gene with all cells (positive and negative with color coded scatter)
plot <- ggplot(stack(df.log), aes(x = ind, y = values, fill=ind)) +
geom_violin() +
geom_point(position = position_jitterdodge(jitter.width=4)) +
geom_point(data=df.positive.log, aes(x = ind, y = values, fill=ind), position = position_jitterdodge(jitter.width=4), color="red") +
xlab("Gene") + ylab("Expression level (TPM log)") +
theme_classic(base_size = 14, base_family = "Helvetica") +
theme(axis.text.y=element_text(size=14)) +
theme(axis.title.y=element_text(size=14, face="bold")) +
theme(axis.text.x=element_text(size=14)) +
theme(axis.title.x=element_text(size=14, face="bold")) +
scale_fill_brewer(palette="Pastel1")
plot + coord_cartesian(ylim = c(0, 8))
Update:
This question was asked due to a fundamental misunderstanding regarding how data needs to be formatted to efficiently plot it in R.
The data needs to be reformatted into a long instead of a wide format, which can be done i.e. with gather as suggested below, but also with other methods listed in this question: Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
The below answer overlays a coloured violin plot with a jittered set of points that are coloured by positive or negative.
library(dplyr); library(ggplot2); library(tidyr)
#read in data.
df2 <-read.csv(textConnection(df), header=TRUE, row.names = 1)
# Add in the rownames and gather the dataset
df3 <- df2 %>% mutate(Gene= rownames(.)) %>%
gather(., key= "cell", value="value", -Gene) %>%
mutate(positive = value>0, absolute= abs(value), logabs= log(absolute+1))
df3 %>% ggplot(. , aes(x = Gene, y=logabs, fill=Gene)) +
geom_violin() +geom_jitter( aes(colour= positive))
Is this what you were looking for?
EDIT: The read in data line, line pastes in the data you presented above into a text string, then converts the text string to a dataframe. If you already have the data frame it isn't necessary. It is only used as there was not dput() object available to use.
EDIT 2:
This extended answer results from comments to the previous answer. The solution uses a transposed matrix of the data shown in the question. The resulting plot has violin plots, coloured by gene overlaid with points coloured by whether that observation is negative in gene1.
The exact data set is shown below and is the result of calling the dput() command on the matrix.
df <- structure(c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0.1923, 0, 0, 0, 0.1923, 0, 0, 0, 69.3915, 0, 0, 74.123, 0, 0,
0, 0, 0, 13.01, 0, 0, 0, 0, 0, 0.9231, 73.023, 0, 0, 0, 0, 0,
0, 0, 0, 0.279204, 23.456, 13.1, 10.5, 0, 14.2, 151, 2, 50.3201,
0, 0, 128, 0, 0, 0, 9.74082, 20.9432, 0, 0, 300.023, 20.0234,
0, 0, 300.024, 123, 201.345, 164.681, 301.421, 173.023, 216.537,
201.234, 302.102, 199.234, 20.234, 40.234, 180.0234, 0, 23.234,
190.134, 170.023, 0, 8.023, 40.234, 180.0234, 0, 23.234, 190.134,
170.023, 21.24, 8.023, 25.9954, 77.3398, 45.3092, 107.508, 0.266139,
70.4924, 114.17, 291.324, 198.525, 190.353, 185.381, 0.14223,
90.323, 20.4332, 29.012, 500.391, 2.51459, 300.021, 60.001, 192.023,
60.0234, 300.022, 60.002, 192.024, 34, 500.392, 2.5146, 300.022,
60.002, 192.024, 60.0235, 300.023, 60.003, 192.025, 60.002, 192.024,
34, 500.392, 2.5146, 300.022, 60.002, 192.024, 60.0235, 300.023,
60.003, 192.025, 35, 194.231, 94.13, 32.124, 46.1717, 194.241,
0.776565, 3.0325, 0.762981, 2.3123, 14.507, 13.0234, 0.538315,
0, 1.5234, 11.2341, 0, 1.34819, 6.0142, 3.2341, 4.4444, 150.324,
0, 20.9432, 134.023, 150.325, 0, 20.9433, 3.2341, 4.4444, 150.324,
0, 20.9432, 134.023, 170.13408, 0, 3.2341, 4.4444, 150.324, 0,
3.2341, 6.7023, 150.324, 0, 3.2341, 4.4444, 170.341, 0, 20.9432,
134.023, 150.325, 0, 50.234, 3.123), .Dim = c(50L, 4L), .Dimnames = list(
c("cell_1", "cell_2", "cell_3", "cell_4", "cell_5", "cell_6",
"cell_7", "cell_8", "cell_9", "cell_10", "cell_11", "cell_12",
"cell_13", "cell_14", "cell_15", "cell_16", "cell_17", "cell_18",
"cell_19", "cell_20", "cell_21", "cell_22", "cell_23", "cell_24",
"cell_25", "cell_26", "cell_27", "cell_28", "cell_29", "cell_30",
"cell_31", "cell_32", "cell_33", "cell_34", "cell_35", "cell_36",
"cell_37", "cell_38", "cell_39", "cell_40", "cell_41", "cell_42",
"cell_43", "cell_44", "cell_45", "cell_46", "cell_47", "cell_48",
"cell_49", "cell_50"), c("gene1", "gene2", "gene3", "gene4"
)))
The code required to turn the above data set into the plot requested is shown below.
df2 <- df %>% as.data.frame %>% mutate(Cell= rownames(.), positive = gene1>0) %>%
gather(., key= "Gene", value="value", -Cell,-positive) %>%
mutate( absolute= abs(value), logabs= log(absolute+1))
df2 %>% ggplot(. , aes(x = Gene, y=logabs, fill=Gene)) +
geom_violin() +geom_jitter( aes(colour= positive))
As the plot might be difficult to interpret, to additional methods of displaying the status relative to gene1.
df2 %>% ggplot(., aes(x=Gene, y=logabs, fill=positive)) +geom_boxplot()
df2 %>% ggplot(. , aes(x = Gene, y=logabs, fill=positive)) +
geom_violin()

R function to plot binned means and model fit, ggplot

Sample data:
pp.inc <- structure(list(has.di.rec.pp = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0), m.dist.km2 = c(-34.4150009155273, 6.80600023269653, -6.55499982833862,
-61.7700004577637, 15.6840000152588, -11.2869997024536, -26.9729995727539,
0, 81.9940032958984, -35.1459999084473, -12.5179996490479, 0,
21.5919990539551, 81.9940032958984, -20.7770004272461, 85.9469985961914,
-15.2959995269775, -75.5879974365234, 81.9940032958984, 3.04999995231628,
-17.1490001678467, -25.806999206543, -16.0060005187988, -14.91100025177,
-12.9020004272461, -16.0060005187988, 5.44000005722046, -34.4150009155273,
81.9940032958984, 3.61400008201599, 13.7379999160767, 2.71300005912781,
4.31300020217896), treated = c(0, 1, 0, 0, 1, 0, 0, 1, 1, 0,
0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1,
1, 1)), .Names = c("has.di.rec.pp", "m.dist.km2", "treated"), row.names = c(NA,
-33L), class = c("data.table", "data.frame"))
Code:
library(data.table)
library(ggplot2)
rddplot <- function(data, outcome, runvar, treatment = treated, span, bw, ...){
data <- data.table(data)
data.span <- data[abs(runvar) <= span, ]
data.span <- data.span[ , bins := cut(runvar,
seq(-span, span, by = bw),
include.lowest = TRUE, right = FALSE)]
data.span.plot <- data.span[ , list(avg.outcome = mean(outcome),
avg.runvar = mean(runvar),
treated = max(treatment),
n.iid = length(outcome)), keyby = bins]
data.span.plot <- data.span.plot[ , runvar := head(seq(-span, span, by = bw), -1)]
bp <- ggplot(data = data.span.plot, aes(x = runvar, y = avg.outcome))
bp <- bp + geom_point(aes(colour = n.iid))
bp <- bp + stat_smooth(data = data.span, aes(x = runvar, y = outcome,
group = factor(treatment)), ...)
bp
return(bp)
}
rddplot(pp.inc, has.di.rec.pp, m.dist.km2, treated, 50, 5)
This code runs perfect if I do not wrap it in a function. I am a novice in R, only using it very infrequently. What am I doing wrong? Am I missing something obvious or is it to do with data.table or ggplot2? I thought it might be something with ggplot, as other questions mention there is an issue and aes_string should be used. I can rewrite the data.table parts to use base functions. But I think the error already occurs before that, on the second line. How do I make this work?
EDIT:
[Original title:
R function returns Error in eval(expr, envir, enclos) : object 'name' not found]
I had some time to look at this again and have worked out a solution, hence I also modified the title a bit. Using eval() didn't really work out for me, so I went the [['columname']] selection route. I've ditched data.table (and plyr as well), so that this only uses base functions except for ggplot2. I am happy for any comments on how to improve it. Please let me know if there are some essential flaws. If not I will add an answer with my solution later.
I have changed the bin calculation so that there is always a breakpoint at zero, which is necessary. Default binwidth is determined by the Silverman rule. I am thinking of calculating model fit separately and returning it, as the model choice within ggplot is limited, however I can't think of a nice way to incorporate this for a variety of diverse models such as lm or loess, and it's not strictly necessary. I actually wanted to overlay a thin bar plot displaying the number of observations in each bin, but found out this is impossible in ggplot (I know this generally is a bad idea, but there are several well-published papers which use similar graphs). I don't find the size aestetic to appealing here, but these are really minor gripes.
Thanks for getting me on the right path.
My solution:
rddplot <- function(data, outcome, runvar, treatment = treated,
span, bw = bw.nrd0(data[[runvar]]), ...){
breaks <- c(sort(-seq(0, span, by = bw)[-1]), seq(0, span, by = bw))
data.span <- data[abs(data[[runvar]]) <= max(breaks), ]
data.span$bins <- cut(data.span[[runvar]], breaks,
include.lowest = TRUE, right = FALSE)
data.span.plot <- as.data.frame(cbind(tapply(data.span[[outcome]], data.span$bins, mean),
tapply(data.span[[runvar]], data.span$bins, mean),
tapply(data.span[[treatment]], data.span$bins, max),
tapply(data.span[[outcome]], data.span$bins, length),
tapply(data.span[[outcome]], data.span$bins, sum)))
colnames(data.span.plot) <- c("avg.outcome", "avg.runvar", "treated", "n.iid", "n.rec")
data.span.plot$runvar <- head(breaks, -1)
print(data.span.plot)
bp <- ggplot(data = data.span.plot, aes(x = runvar, y = avg.outcome))
bp <- bp + geom_point(aes(size = n.iid))
bp <- bp + stat_smooth(data = data.span, aes_string(x = runvar, y = outcome,
group = treatment), ...)
print(bp)
}
Call:
rddplot(pp.inc, "has.di.rec.pp", "m.dist.km2", "treated", 50,
method = lm, formula = y ~ poly(x, 4, raw = TRUE))
I have an approach using data.table and some deparse(substitute()) and setnames trickery....
rddplot <- function(data, outcome, runvar, treatment = treated, span, bw, ...){
# convert to data.table
data <- data.table(data)
# get the column names as defined in the call to rddplot
outname <- deparse(substitute(outcome))
runname <- deparse(substitute(runvar))
treatname <- deparse(substitute(treatment))
# rename these columns with the argument namses
setnames(data, old = c(outname,runname,treatname), new = c('outcome','runvar', 'treatment'))
# breaks as defined in the second example
breaks <- c(sort(-seq(0, span, by = bw)[-1]), seq(0, span, by = bw))
# the stuff you were doing before
data.span <- data[abs(runvar) <= span, ]
data.span <- data.span[ , bins := cut(runvar,
breaks,
include.lowest = TRUE, right = FALSE)]
data.span.plot <- data.span[ , list(avg.outcome = mean(outcome),
avg.runvar = mean(runvar),
treated = max(treatment),
n.iid = length(outcome)), keyby = bins]
# note I've removed trying to add `runvar` column to data.span.plot....)
bp <- ggplot(data = data.span.plot, aes(x = avg.runvar, y = avg.outcome))
bp <- bp + geom_point(aes(colour = n.iid))
bp <- bp + stat_smooth(data = data.span, aes(x = runvar, y = outcome,
group = treatment), ...)
bp
}
rddplot(pp.inc, has.di.rec.pp, m.dist.km2, treated, 50, 5)
Note that if you didn't convert to data.table within the function, and assumed the data argument was a data.table, then you could use on.exit() to revert the names changed by reference.

Resources