R: Split metricsgraphics histogram by factor - r

I have a data frame that looks sort of like the following:
'data.frame': 400 obs. of 4 variables:
$ admit: Factor w/ 2 levels "rejected","accepted": 1 2 2 2 1 2 2 1 2 1 ...
$ gpa : num 3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
Now I would like to turn this into a histogram of GPA using the metricsgraphics package, but split the data by the factor 'admit'. How is this done?
Using ggplot I can do something like the following:
ggplot(data, aes(gpa)) +
geom_histogram(aes(fill=admit, y=..density..),
position="dodge",
binwidth=0.1
)
but I'm looking at how to specifically do so using metricsgraphics.
I currently have
mjs_plot(data, x = gpa) %>%
mjs_histogram(bins = 80)
but of course this doesn't split by the factor.

I think you'll have to produce each plot and arrange it into a grid. From the package vignette:
moar_plots <- lapply(1:7, function(x) {
mjs_plot(rbeta(10000, x, x), width="250px", height="250px", linked=TRUE) %>%
mjs_histogram(bar_margin=2) %>%
mjs_labs(x_label=sprintf("Plot %d", x))
})
mjs_grid(moar_plots, nrow=4, ncol=3, widths=c(rep(0.33, 3)))

Related

How can I write a dataframe to a csv after running scale() in R?

I'm scaling one column in a dataset with the intention of fitting a linear model. However, when I try to write the dataframe (with scaled column) to a csv, it doesn't work because the scaled column became complex with center and scale attributes.
Can someone please indicate how to convert the scaled column to something that can write to a csv? (and maybe why scale() needs to do it this way.)
# make a data frame
testDF <- data.frame(x1 = c(1,2,2,3,2,4,4,5,6,15,36,42,11,12,23,24,25,66,77,18,9),
x2 = c(1,4,5,9,4,15,17,25,35,200,1297,1764,120,150,500,500,640,4200,6000,365,78))
# scale the x1 attribute
testDF <- testDF %>%
mutate(x1_scaled = scale(x1, center = TRUE, scale = TRUE))
# write to csv doesn't work
write_csv(as.matrix(testDF), "testDF.csv")
# but plotting and lm do work
ggplot(testDF, aes(x1_scaled)) +
geom_histogram(aes(y = ..density..),binwidth = 1)
Lm_scaled <- lm(x2 ~ x1_scaled, data = testDF)
plot(Lm_scaled)
scale returns a matrix output. We could extract the column or use as.vector to remove the dim attribute
testDF <- testDF %>%
mutate(x1_scaled = as.vector(scale(x1, center = TRUE, scale = TRUE)))
Check the structure of the output without as.vector and with as.vector
> testDF %>%
+ mutate(x1_scaled = scale(x1, center = TRUE, scale = TRUE)) %>% str
'data.frame': 21 obs. of 3 variables:
$ x1 : num 1 2 2 3 2 4 4 5 6 15 ...
$ x2 : num 1 4 5 9 4 15 17 25 35 200 ...
$ x1_scaled: num [1:21, 1] -0.824 -0.776 -0.776 -0.729 -0.776 ...
..- attr(*, "scaled:center")= num 18.4
..- attr(*, "scaled:scale")= num 21.2
> testDF %>%
+ mutate(x1_scaled = as.vector(scale(x1, center = TRUE, scale = TRUE))) %>% str
'data.frame': 21 obs. of 3 variables:
$ x1 : num 1 2 2 3 2 4 4 5 6 15 ...
$ x2 : num 1 4 5 9 4 15 17 25 35 200 ...
$ x1_scaled: num -0.824 -0.776 -0.776 -0.729 -0.776 ...
You can simply convert the scale column to numeric in base R and write out the dataframe:
testDF$x1_scaled <- as.numeric(testDF$x1_scaled)
write_csv(testDF, "testDF.csv")

Plot a Dataframe on a basic R chart

I think it's a basic question but I didn't saw any answers on SO that solves this.
So, I would like to put each line of this dataframe on a chart, with respective column and row names on x and y axis.
> indicador.transposta
31-12-2017.pdf 31-12-2018.pdf 31-12-2019.pdf
Liq..Imed. 0.260650162167045 0.278000595266861 0.100940099971038
Liq..Corr. 1.20707183817692 1.07611507200346 0.775547123687795
Liq..Seca 1.01127035774033 0.87978786315216 0.616295652990034
Liq..Geral 1.38911863440832 1.20904526839338 1.22121514777491
Endividamento 0.719880919620619 0.827098890456626 0.818856531399918
Retorno.s..Invest. -0.0281507369406506 -0.110425824989136 0.00682789312217763
Retorno.s..PL -0.100495606734552 -0.638664640618945 0.037693289053948
Margem.LĂ­quida -0.0440458341645613 -0.181853784203517 0.0103531380484155
The structure is:
> str(indicador.transposta)
'data.frame': 8 obs. of 3 variables:
$ 31-12-2017.pdf: chr "0.260650162167045" "1.20707183817692" "1.01127035774033" "1.38911863440832" ...
$ 31-12-2018.pdf: chr "0.278000595266861" "1.07611507200346" "0.87978786315216" "1.20904526839338" ...
$ 31-12-2019.pdf: chr "0.100940099971038" "0.775547123687795" "0.616295652990034" "1.22121514777491" ...
Regards.
Base R:
plot(NA, type = "n", xlim = c(1, nrow(dat)), xlab = "", ylim = range(unlist(dat)), ylab = "")
for (y in dat) lines(y)
But dealing with different lines and then optionally coloring and such, it might be easier in the long run to shift to ggplot2. That graphic engine prefers its data in a "long" format, so we'll use tidyr::pivot_longer to reshape it:
library(tidyr) # pivot_longer
library(ggplot2)
pivot_longer(dat, -x)
# # A tibble: 24 x 3
# x name value
# * <int> <chr> <dbl>
# 1 1 X31.12.2017.pdf 0.261
# 2 1 X31.12.2018.pdf 0.278
# 3 1 X31.12.2019.pdf 0.101
# 4 2 X31.12.2017.pdf 1.21
# 5 2 X31.12.2018.pdf 1.08
# 6 2 X31.12.2019.pdf 0.776
# 7 3 X31.12.2017.pdf 1.01
# 8 3 X31.12.2018.pdf 0.880
# 9 3 X31.12.2019.pdf 0.616
# 10 4 X31.12.2017.pdf 1.39
# # ... with 14 more rows
ggplot(pivot_longer(dat, -x), aes(x, value, color = name, group = name)) +
geom_line()

Formula notation for scatterplot producing unexpected results

I am working on a map, where the color of each point is proportional to one response variable, and the size of the point is proportional to another. I've noticed that when I try to plot the points using formula notation things go haywire, while default notation performs as expected. I have used formula notation to plot maps many times before, and thought that the notations were nearly interchangeable. Why would these produce different results? I have read through the plot.formula and plot.default documentation and haven't been able to figure it out. Based on this I am wondering if it has to do with the columns of dat being coerced to factors, but I'm not sure why that would be happening. Any ideas?
Consider the following example data frame, dat:
latitude <- c(runif(10, min = 45, max = 48))
latitude[9] <- NA
longitude <- c(runif(10, min = -124.5, max = -122.5))
longitude[9] <- NA
color <- c("#00FFCCCC", "#99FF00CC", "#FF0000CC", "#3300FFCC", "#00FFCCCC",
"#00FFCCCC", "#3300FFCC", "#00FFCCCC", NA, "#3300FFCC")
size <- c(4.916667, 5.750000, 7.000000, 2.000000, 5.750000,
4.500000, 2.000000, 4.500000, NA, 2.000000)
dat <- as.data.frame(cbind(longitude, latitude, color, size))
Plotting according to formula notation
plot(latitude ~ longitude, data = dat, type = "p", pch = 21, col = 1, bg = color, cex = size)
produces
this mess and the following error: graphical parameter "type" is obsolete.
Plotting according to the default notation
plot(longitude, latitude, type = "p", pch = 21, col = 1, bg = color, cex = size)
works as expected, though with the same error.
There are a couple of problems with this. First is that your use of cbind is turning this into a matrix, albeit temporarily, which is converting your numbers to character. See:
dat <- as.data.frame(cbind(longitude, latitude, color, size))
str(dat)
# 'data.frame': 10 obs. of 4 variables:
# $ longitude: Factor w/ 9 levels "-122.855375511572",..: 6 8 9 1 4 3 2 7 NA 5
# $ latitude : Factor w/ 9 levels "45.5418886151165",..: 6 2 4 1 3 7 5 9 NA 8
# $ color : Factor w/ 4 levels "#00FFCCCC","#3300FFCC",..: 1 3 4 2 1 1 2 1 NA 2
# $ size : Factor w/ 5 levels "2","4.5","4.916667",..: 3 4 5 1 4 2 1 2 NA 1
If instead you just use data.frame, you'll get:
dat <- data.frame(longitude, latitude, color, size)
str(dat)
# 'data.frame': 10 obs. of 4 variables:
# $ longitude: num -124 -124 -124 -123 -124 ...
# $ latitude : num 47.3 45.9 46.3 45.5 46 ...
# $ color : Factor w/ 4 levels "#00FFCCCC","#3300FFCC",..: 1 3 4 2 1 1 2 1 NA 2
# $ size : num 4.92 5.75 7 2 5.75 ...
plot(latitude ~ longitude, data = dat, pch = 21, col = 1, bg = color, cex = size)
But now the colors are all dorked. Okay, the problem is likely because your $color is a factor, which is being interpreted internally as integers. Try stringsAsFactors=F:
dat <- data.frame(longitude, latitude, color, size, stringsAsFactors=FALSE)
str(dat)
# 'data.frame': 10 obs. of 4 variables:
# $ longitude: num -124 -124 -124 -123 -124 ...
# $ latitude : num 47.3 45.9 46.3 45.5 46 ...
# $ color : chr "#00FFCCCC" "#99FF00CC" "#FF0000CC" "#3300FFCC" ...
# $ size : num 4.92 5.75 7 2 5.75 ...
plot(latitude ~ longitude, data = dat, pch = 21, col = 1, bg = color, cex = size)

How to retrieve name of element in list (data frame) to use it as a title of the plot?

So briefly and without further ado - is it possible to retrieve only a name of element in list and use it as a main title of plot?
Let me explain - example:
Let's create a random df:
a <- c(1,2,3,4)
b <- runif(4)
c <- runif(4)
d <- runif(4)
e <- runif(4)
f <- runif(4)
df <- data.frame(a,b,c,d,e,f)
head(df)
a b c d e f
1 1 0.9694204 0.9869154 0.5386678 0.39331278 0.15054698
2 2 0.8949330 0.9910894 0.1009689 0.03632476 0.15523628
3 3 0.4930752 0.7179144 0.6957262 0.36579883 0.32006026
4 4 0.4850141 0.5539939 0.3196953 0.14348259 0.05292068
Then I want to create a list of data frame (based on this above) with specific columns to make a plot. In other words I'd like to make plot where first column of df (a) will be x axis on the plot and columns b,c,d,e and gonna represent values on y axis on the plot. Yes there'll be 5 plots - that's the point!
So my idea was to write some simple function which be able to create a list of df's based on that created above so:
my_fun <- function(x){
a <- df[1]
b <- x
aname <- "x_label"
bname <- "y_label"
df <- data.frame(a,b)
names(df) <- c(aname,bname)
return(df)
}
Run it for all (specified) columns:
df_s <- apply(df[,2:6], 2, function(x) my_fun(x))
So I have now:
class(df_s)
[1] "list"
str(df_s)
List of 5
$ b:'data.frame': 4 obs. of 2 variables:
..$ x_label: num [1:4] 1 2 3 4
..$ y_label: num [1:4] 0.969 0.895 0.493 0.485
$ c:'data.frame': 4 obs. of 2 variables:
..$ x_label: num [1:4] 1 2 3 4
..$ y_label: num [1:4] 0.987 0.991 0.718 0.554
$ d:'data.frame': 4 obs. of 2 variables:
..$ x_label: num [1:4] 1 2 3 4
..$ y_label: num [1:4] 0.539 0.101 0.696 0.32
$ e:'data.frame': 4 obs. of 2 variables:
..$ x_label: num [1:4] 1 2 3 4
..$ y_label: num [1:4] 0.3933 0.0363 0.3658 0.1435
$ f:'data.frame': 4 obs. of 2 variables:
..$ x_label: num [1:4] 1 2 3 4
..$ y_label: num [1:4] 0.1505 0.1552 0.3201 0.0529
Something that I wanted, but here's the question. I'd like to create a plot for every df in my list... As a result I want 5 plots with main titles b, c, d, e, f respectively Axis labels are the same name of the plot isn't... So I tried:
lapply(df_s, function(x) plot(x[2] ~ x[1], data = x, main = ???))
What should be instead of question marks? I tried main = names(df_s)[x] however it didin't work...
I think the following works. However, I think it might be best to use ggplot2 instead of the plot function (unless you are saving the plots inside inside lapply).
lapply(1 : length(df_s), function(x)
plot(df_s[[x]][,2] ~ df_s[[x]][,1],
xlab = names(df_s[[x]])[1],
ylab = names(df_s[[x]])[1],
main = names(df_s[x])))
With ggplot2
plot_lst <- lapply(seq_along(df_s), function(i) {
ggplot(df_s[[i]], aes(x=x_label, y=y_label)) +
geom_point() +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle(names(df_s)[i]) })

Add boxplots in the coplot() function

Consider the dataset "ToothGrow" from the "datasets" package: a 60 rows dataset for three variables: "Tooth length", "Supplement lenght", "Dose in milligrams per day".
str(ToothGrowth)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
I use the function coplot() to see the effect of the variable dose on the variable len for each factor of supp.
with(ToothGrowth, coplot(len ~ dose | supp))
How can I create the same plot with boxplots for len ~ dose, instead of having a point for each case?
Using the coplot() function from R base graphics would be preferable.
Try this:
library(lattice)
data("ToothGrowth")
ToothGrowth[,3]<-factor(ToothGrowth[,3])
#before
xyplot(len ~ dose | supp, data=ToothGrowth, layout=c(2,1))
#after
bwplot(len ~ dose | supp, data=ToothGrowth, layout=c(2,1))
The result is the following:
Edit:
If you want to only employ the R base package you can use the following.
coplot(len ~ dose | supp, data=ToothGrowth, xlim = c(0, 4),
panel = function(x, y, ...){boxplot(y ~ x, add=TRUE)})
Which yields:

Resources