Plot a Dataframe on a basic R chart - r

I think it's a basic question but I didn't saw any answers on SO that solves this.
So, I would like to put each line of this dataframe on a chart, with respective column and row names on x and y axis.
> indicador.transposta
31-12-2017.pdf 31-12-2018.pdf 31-12-2019.pdf
Liq..Imed. 0.260650162167045 0.278000595266861 0.100940099971038
Liq..Corr. 1.20707183817692 1.07611507200346 0.775547123687795
Liq..Seca 1.01127035774033 0.87978786315216 0.616295652990034
Liq..Geral 1.38911863440832 1.20904526839338 1.22121514777491
Endividamento 0.719880919620619 0.827098890456626 0.818856531399918
Retorno.s..Invest. -0.0281507369406506 -0.110425824989136 0.00682789312217763
Retorno.s..PL -0.100495606734552 -0.638664640618945 0.037693289053948
Margem.LĂ­quida -0.0440458341645613 -0.181853784203517 0.0103531380484155
The structure is:
> str(indicador.transposta)
'data.frame': 8 obs. of 3 variables:
$ 31-12-2017.pdf: chr "0.260650162167045" "1.20707183817692" "1.01127035774033" "1.38911863440832" ...
$ 31-12-2018.pdf: chr "0.278000595266861" "1.07611507200346" "0.87978786315216" "1.20904526839338" ...
$ 31-12-2019.pdf: chr "0.100940099971038" "0.775547123687795" "0.616295652990034" "1.22121514777491" ...
Regards.

Base R:
plot(NA, type = "n", xlim = c(1, nrow(dat)), xlab = "", ylim = range(unlist(dat)), ylab = "")
for (y in dat) lines(y)
But dealing with different lines and then optionally coloring and such, it might be easier in the long run to shift to ggplot2. That graphic engine prefers its data in a "long" format, so we'll use tidyr::pivot_longer to reshape it:
library(tidyr) # pivot_longer
library(ggplot2)
pivot_longer(dat, -x)
# # A tibble: 24 x 3
# x name value
# * <int> <chr> <dbl>
# 1 1 X31.12.2017.pdf 0.261
# 2 1 X31.12.2018.pdf 0.278
# 3 1 X31.12.2019.pdf 0.101
# 4 2 X31.12.2017.pdf 1.21
# 5 2 X31.12.2018.pdf 1.08
# 6 2 X31.12.2019.pdf 0.776
# 7 3 X31.12.2017.pdf 1.01
# 8 3 X31.12.2018.pdf 0.880
# 9 3 X31.12.2019.pdf 0.616
# 10 4 X31.12.2017.pdf 1.39
# # ... with 14 more rows
ggplot(pivot_longer(dat, -x), aes(x, value, color = name, group = name)) +
geom_line()

Related

How can I write a dataframe to a csv after running scale() in R?

I'm scaling one column in a dataset with the intention of fitting a linear model. However, when I try to write the dataframe (with scaled column) to a csv, it doesn't work because the scaled column became complex with center and scale attributes.
Can someone please indicate how to convert the scaled column to something that can write to a csv? (and maybe why scale() needs to do it this way.)
# make a data frame
testDF <- data.frame(x1 = c(1,2,2,3,2,4,4,5,6,15,36,42,11,12,23,24,25,66,77,18,9),
x2 = c(1,4,5,9,4,15,17,25,35,200,1297,1764,120,150,500,500,640,4200,6000,365,78))
# scale the x1 attribute
testDF <- testDF %>%
mutate(x1_scaled = scale(x1, center = TRUE, scale = TRUE))
# write to csv doesn't work
write_csv(as.matrix(testDF), "testDF.csv")
# but plotting and lm do work
ggplot(testDF, aes(x1_scaled)) +
geom_histogram(aes(y = ..density..),binwidth = 1)
Lm_scaled <- lm(x2 ~ x1_scaled, data = testDF)
plot(Lm_scaled)
scale returns a matrix output. We could extract the column or use as.vector to remove the dim attribute
testDF <- testDF %>%
mutate(x1_scaled = as.vector(scale(x1, center = TRUE, scale = TRUE)))
Check the structure of the output without as.vector and with as.vector
> testDF %>%
+ mutate(x1_scaled = scale(x1, center = TRUE, scale = TRUE)) %>% str
'data.frame': 21 obs. of 3 variables:
$ x1 : num 1 2 2 3 2 4 4 5 6 15 ...
$ x2 : num 1 4 5 9 4 15 17 25 35 200 ...
$ x1_scaled: num [1:21, 1] -0.824 -0.776 -0.776 -0.729 -0.776 ...
..- attr(*, "scaled:center")= num 18.4
..- attr(*, "scaled:scale")= num 21.2
> testDF %>%
+ mutate(x1_scaled = as.vector(scale(x1, center = TRUE, scale = TRUE))) %>% str
'data.frame': 21 obs. of 3 variables:
$ x1 : num 1 2 2 3 2 4 4 5 6 15 ...
$ x2 : num 1 4 5 9 4 15 17 25 35 200 ...
$ x1_scaled: num -0.824 -0.776 -0.776 -0.729 -0.776 ...
You can simply convert the scale column to numeric in base R and write out the dataframe:
testDF$x1_scaled <- as.numeric(testDF$x1_scaled)
write_csv(testDF, "testDF.csv")

tidyr::spread() with multiple keys and values

I assume this has been asked multiple times but I couldn't find the proper words to find a workable solution.
How can I spread() a data frame based on multiple keys for multiple values?
A simplified (I have many more columns to spread, but on only two keys: Id and time point of a given measurement) data I'm working with looks like this:
df <- data.frame(id = rep(seq(1:10),3),
time = rep(1:3, each=10),
x = rnorm(n=30),
y = rnorm(n=30))
> head(df)
id time x y
1 1 1 -2.62671241 0.01669755
2 2 1 -1.69862885 0.24992634
3 3 1 1.01820778 -1.04754037
4 4 1 0.97561596 0.35216040
5 5 1 0.60367158 -0.78066767
6 6 1 -0.03761868 1.08173157
> tail(df)
id time x y
25 5 3 0.03621258 -1.1134368
26 6 3 -0.25900538 1.6009824
27 7 3 0.13996626 0.1359013
28 8 3 -0.60364935 1.5750232
29 9 3 0.89618748 0.0294315
30 10 3 0.14709567 0.5461084
What i'd like to have is a dataframe populated like this:
One row per Id columns for each value from the time and each measurement variable.
With the devel version of tidyr (tidyr_0.8.3.9000), we can use pivot_wider to reshape multiple value columns from long to wide format
library(dplyr)
library(tidyr)
library(stringr)
df %>%
mutate(time = str_c("time", time)) %>%
pivot_wider(names_from = time, values_from = c("x", "y"), names_sep="")
# A tibble: 10 x 7
# id xtime1 xtime2 xtime3 ytime1 ytime2 ytime3
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 -0.256 0.483 -0.254 -0.652 0.655 0.291
# 2 2 1.10 -0.596 -1.85 1.09 -0.401 -1.24
# 3 3 0.756 -2.19 -0.0779 -0.763 -0.335 -0.456
# 4 4 -0.238 -0.675 0.969 -0.829 1.37 -0.830
# 5 5 0.987 -2.12 0.185 0.834 2.14 0.340
# 6 6 0.741 -1.27 -1.38 -0.968 0.506 1.07
# 7 7 0.0893 -0.374 -1.44 -0.0288 0.786 1.22
# 8 8 -0.955 -0.688 0.362 0.233 -0.902 0.736
# 9 9 -0.195 -0.872 -1.76 -0.301 0.533 -0.481
#10 10 0.926 -0.102 -0.325 -0.678 -0.646 0.563
NOTE: The numbers are different as there was no set seed while creating the sample dataset
Reshaping with multiple value variables can best be done with dcast from data.table or reshape from base R.
library(data.table)
out <- dcast(setDT(df), id ~ paste0("time", time), value.var = c("x", "y"), sep = "")
out
# id xtime1 xtime2 xtime3 ytime1 ytime2 ytime3
# 1: 1 0.4334921 -0.5205570 -1.44364515 0.49288757 -1.26955148 -0.83344256
# 2: 2 0.4785870 0.9261711 0.68173681 1.24639813 0.91805332 0.34346260
# 3: 3 -1.2067665 1.7309593 0.04923993 1.28184341 -0.69435556 0.01609261
# 4: 4 0.5240518 0.7481787 0.07966677 -1.36408357 1.72636849 -0.45827205
# 5: 5 0.3733316 -0.3689391 -0.11879819 -0.03276689 0.91824437 2.18084692
# 6: 6 0.2363018 -0.2358572 0.73389984 -1.10946940 -1.05379502 -0.82691626
# 7: 7 -1.4979165 0.9026397 0.84666801 1.02138768 -0.01072588 0.08925716
# 8: 8 0.3428946 -0.2235349 -1.21684977 0.40549497 0.68937085 -0.15793111
# 9: 9 -1.1304688 -0.3901419 -0.10722222 -0.54206830 0.34134397 0.48504564
#10: 10 -0.5275251 -1.1328937 -0.68059800 1.38790593 0.93199593 -1.77498807
Using reshape we could do
# setDF(df) # in case df is a data.table now
reshape(df, idvar = "id", timevar = "time", direction = "wide")
Your entry data frame is not tidy. You should use gather to make it so.
gather(df, key, value, -id, -time) %>%
mutate(key = paste0(key, "time", time)) %>%
select(-time) %>%
spread(key, value)

Formula notation for scatterplot producing unexpected results

I am working on a map, where the color of each point is proportional to one response variable, and the size of the point is proportional to another. I've noticed that when I try to plot the points using formula notation things go haywire, while default notation performs as expected. I have used formula notation to plot maps many times before, and thought that the notations were nearly interchangeable. Why would these produce different results? I have read through the plot.formula and plot.default documentation and haven't been able to figure it out. Based on this I am wondering if it has to do with the columns of dat being coerced to factors, but I'm not sure why that would be happening. Any ideas?
Consider the following example data frame, dat:
latitude <- c(runif(10, min = 45, max = 48))
latitude[9] <- NA
longitude <- c(runif(10, min = -124.5, max = -122.5))
longitude[9] <- NA
color <- c("#00FFCCCC", "#99FF00CC", "#FF0000CC", "#3300FFCC", "#00FFCCCC",
"#00FFCCCC", "#3300FFCC", "#00FFCCCC", NA, "#3300FFCC")
size <- c(4.916667, 5.750000, 7.000000, 2.000000, 5.750000,
4.500000, 2.000000, 4.500000, NA, 2.000000)
dat <- as.data.frame(cbind(longitude, latitude, color, size))
Plotting according to formula notation
plot(latitude ~ longitude, data = dat, type = "p", pch = 21, col = 1, bg = color, cex = size)
produces
this mess and the following error: graphical parameter "type" is obsolete.
Plotting according to the default notation
plot(longitude, latitude, type = "p", pch = 21, col = 1, bg = color, cex = size)
works as expected, though with the same error.
There are a couple of problems with this. First is that your use of cbind is turning this into a matrix, albeit temporarily, which is converting your numbers to character. See:
dat <- as.data.frame(cbind(longitude, latitude, color, size))
str(dat)
# 'data.frame': 10 obs. of 4 variables:
# $ longitude: Factor w/ 9 levels "-122.855375511572",..: 6 8 9 1 4 3 2 7 NA 5
# $ latitude : Factor w/ 9 levels "45.5418886151165",..: 6 2 4 1 3 7 5 9 NA 8
# $ color : Factor w/ 4 levels "#00FFCCCC","#3300FFCC",..: 1 3 4 2 1 1 2 1 NA 2
# $ size : Factor w/ 5 levels "2","4.5","4.916667",..: 3 4 5 1 4 2 1 2 NA 1
If instead you just use data.frame, you'll get:
dat <- data.frame(longitude, latitude, color, size)
str(dat)
# 'data.frame': 10 obs. of 4 variables:
# $ longitude: num -124 -124 -124 -123 -124 ...
# $ latitude : num 47.3 45.9 46.3 45.5 46 ...
# $ color : Factor w/ 4 levels "#00FFCCCC","#3300FFCC",..: 1 3 4 2 1 1 2 1 NA 2
# $ size : num 4.92 5.75 7 2 5.75 ...
plot(latitude ~ longitude, data = dat, pch = 21, col = 1, bg = color, cex = size)
But now the colors are all dorked. Okay, the problem is likely because your $color is a factor, which is being interpreted internally as integers. Try stringsAsFactors=F:
dat <- data.frame(longitude, latitude, color, size, stringsAsFactors=FALSE)
str(dat)
# 'data.frame': 10 obs. of 4 variables:
# $ longitude: num -124 -124 -124 -123 -124 ...
# $ latitude : num 47.3 45.9 46.3 45.5 46 ...
# $ color : chr "#00FFCCCC" "#99FF00CC" "#FF0000CC" "#3300FFCC" ...
# $ size : num 4.92 5.75 7 2 5.75 ...
plot(latitude ~ longitude, data = dat, pch = 21, col = 1, bg = color, cex = size)

How to retrieve name of element in list (data frame) to use it as a title of the plot?

So briefly and without further ado - is it possible to retrieve only a name of element in list and use it as a main title of plot?
Let me explain - example:
Let's create a random df:
a <- c(1,2,3,4)
b <- runif(4)
c <- runif(4)
d <- runif(4)
e <- runif(4)
f <- runif(4)
df <- data.frame(a,b,c,d,e,f)
head(df)
a b c d e f
1 1 0.9694204 0.9869154 0.5386678 0.39331278 0.15054698
2 2 0.8949330 0.9910894 0.1009689 0.03632476 0.15523628
3 3 0.4930752 0.7179144 0.6957262 0.36579883 0.32006026
4 4 0.4850141 0.5539939 0.3196953 0.14348259 0.05292068
Then I want to create a list of data frame (based on this above) with specific columns to make a plot. In other words I'd like to make plot where first column of df (a) will be x axis on the plot and columns b,c,d,e and gonna represent values on y axis on the plot. Yes there'll be 5 plots - that's the point!
So my idea was to write some simple function which be able to create a list of df's based on that created above so:
my_fun <- function(x){
a <- df[1]
b <- x
aname <- "x_label"
bname <- "y_label"
df <- data.frame(a,b)
names(df) <- c(aname,bname)
return(df)
}
Run it for all (specified) columns:
df_s <- apply(df[,2:6], 2, function(x) my_fun(x))
So I have now:
class(df_s)
[1] "list"
str(df_s)
List of 5
$ b:'data.frame': 4 obs. of 2 variables:
..$ x_label: num [1:4] 1 2 3 4
..$ y_label: num [1:4] 0.969 0.895 0.493 0.485
$ c:'data.frame': 4 obs. of 2 variables:
..$ x_label: num [1:4] 1 2 3 4
..$ y_label: num [1:4] 0.987 0.991 0.718 0.554
$ d:'data.frame': 4 obs. of 2 variables:
..$ x_label: num [1:4] 1 2 3 4
..$ y_label: num [1:4] 0.539 0.101 0.696 0.32
$ e:'data.frame': 4 obs. of 2 variables:
..$ x_label: num [1:4] 1 2 3 4
..$ y_label: num [1:4] 0.3933 0.0363 0.3658 0.1435
$ f:'data.frame': 4 obs. of 2 variables:
..$ x_label: num [1:4] 1 2 3 4
..$ y_label: num [1:4] 0.1505 0.1552 0.3201 0.0529
Something that I wanted, but here's the question. I'd like to create a plot for every df in my list... As a result I want 5 plots with main titles b, c, d, e, f respectively Axis labels are the same name of the plot isn't... So I tried:
lapply(df_s, function(x) plot(x[2] ~ x[1], data = x, main = ???))
What should be instead of question marks? I tried main = names(df_s)[x] however it didin't work...
I think the following works. However, I think it might be best to use ggplot2 instead of the plot function (unless you are saving the plots inside inside lapply).
lapply(1 : length(df_s), function(x)
plot(df_s[[x]][,2] ~ df_s[[x]][,1],
xlab = names(df_s[[x]])[1],
ylab = names(df_s[[x]])[1],
main = names(df_s[x])))
With ggplot2
plot_lst <- lapply(seq_along(df_s), function(i) {
ggplot(df_s[[i]], aes(x=x_label, y=y_label)) +
geom_point() +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle(names(df_s)[i]) })

R: Split metricsgraphics histogram by factor

I have a data frame that looks sort of like the following:
'data.frame': 400 obs. of 4 variables:
$ admit: Factor w/ 2 levels "rejected","accepted": 1 2 2 2 1 2 2 1 2 1 ...
$ gpa : num 3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
Now I would like to turn this into a histogram of GPA using the metricsgraphics package, but split the data by the factor 'admit'. How is this done?
Using ggplot I can do something like the following:
ggplot(data, aes(gpa)) +
geom_histogram(aes(fill=admit, y=..density..),
position="dodge",
binwidth=0.1
)
but I'm looking at how to specifically do so using metricsgraphics.
I currently have
mjs_plot(data, x = gpa) %>%
mjs_histogram(bins = 80)
but of course this doesn't split by the factor.
I think you'll have to produce each plot and arrange it into a grid. From the package vignette:
moar_plots <- lapply(1:7, function(x) {
mjs_plot(rbeta(10000, x, x), width="250px", height="250px", linked=TRUE) %>%
mjs_histogram(bar_margin=2) %>%
mjs_labs(x_label=sprintf("Plot %d", x))
})
mjs_grid(moar_plots, nrow=4, ncol=3, widths=c(rep(0.33, 3)))

Resources