Multiple plots in R on a pdf with a loop - r

I think I need help with the loop. How would you do multiple plots on separate pdf pages with the data below:
pdf page 1:
Mazda RX4
2 panel plot for mpg vs cyl and mpg vs vs
pdf page 2:
Hornet 4 D
2 panel plot for mpg vs cyl and mpg vs vs
and the same for Valiant.
model mpg cyl vs
Mazda RX4 21.0 6 0
Mazda RX4 21.0 6 0
Mazda RX4 22.8 4 1
Hornet 4 D 21.4 6 1
Hornet 4 D 18.7 8 0
Valiant 18.1 6 1
Valiant 21.4 6 1
Valiant 21.0 6 0
Valiant 22.8 6 0
Thanks.

What I do in this case is set up the plots I want on one page with gridExtra, save that as a PDF, and then concatenate all these PDFs with ghostscript.
In R:
library(gridExtra)
library(ggplot2)
plot_one <- ggplot() + geom_...
plot_two <- ggplot() + geom_...
# Arrange the two plots one per row.
# grid.arrange'd plots can be nested, too!
two_rows <- grid.arrange(plot_one, plot_two, nrow = 2)
ggsave("dataset_1.pdf", two_rows)
# repeat for second, third, etc datasets so you end up with dataset_2.pdf etc
These are then concatenated to one PDF with multiple pages with ghostscript:
gs -sDEVICE=pdfwrite \
-dNOPAUSE \
-dQUIET \
-dBATCH \
-sOutputFile=multipage.pdf \
dataset_1.pdf dataset_2.pdf

Derived from an example elsewhere (https://www.researchgate.net/post/How_to_save_the_graphics_in_several_separate_pages_with_R)
# Create pdf
pdf(...)
# Create different plots
plot1(...)
plot2(...)
plot3(...)
dev.off()
Note: set the parameter onefile=FALSE in pdf()

Related

Converting Multiple PDF files into Text (R Language)

I am using the "tesseract" library in R to convert "PDF files into text", like shown over here: https://cran.r-project.org/web/packages/tesseract/vignettes/intro.html
library(pdftools)
library(tesseract)
pngfile <- pdftools::pdf_convert('myfile_1.pdf', dpi = 600)
text <- tesseract::ocr(pngfile)
cat(text)
The above code works perfectly. Now, I am trying to "mass upload" a large number of PDF files and convert them into text- currently, I figured out how to do this manually
#import and convert 1st file
pngfile_1 <- pdftools::pdf_convert('myfile_1.pdf', dpi = 600)
text_1 <- tesseract::ocr(pngfile_1)
#import and convert 2nd file (note: the files do not have the same naming convention)
pngfile_2 <- pdftools::pdf_convert('second_file.pdf', dpi = 600)
text_2 <- tesseract::ocr(pngfile_2)
etc
I copied/pasted the above code 50 times (while changing the "index", i.e. pngfile_i, text_i) and was able to accomplish what I wanted to do. However, I am looking for a somewhat "automatic" to import and convert all the pdf files.
At the moment, all my pdf files are in the following folder:
"C:/Users/me/Documents/mypdfs"
I found the following code which can be used to "mass import" pdf files into R:
library(dplyr)
library(data.table)
tbl_fread <-
list.files(pattern = "*.pdf") %>%
map_df(~fread(.))
But I am not sure how to instruct this code to import all pdf's from the correct directory ("C:/Users/me/Documents/mypdfs"). I also don't know how to instruct R to "rename" each imported pdf as "pdf_1, pdf_2, etc."
If all the pdf files were correctly imported and created, I could then write a "loop" and execute the desired commands, e.g.
# "n" would be the total number of pdf files
for (i in 1:n)
{
pngfile_i <- pdftools::pdf_convert('myfile_i.pdf', dpi = 600)
text_i <- tesseract::ocr(pngfile_i)
}
Can someone please show me how to do this?
Thanks
You can add full.names = TRUE in your list.files-function, but this assumes that "C:/Users/me/Documents/mypdfs" is contained within your project.
Alternatively, you can use path = "Documents/mypdfs with full.names = TRUE which will direct the path to mypdfs.
list.files(
path = "Documents/mypdfs"
full.names = TRUE,
pattern = "*.pdf"
)
To save them according to you pdf_n then you can use paste along with map. Here I used data.frames to provide an example, as I do not work with pdfs and have none in bulk that I am willing to process.
library(tidyverse)
1:length(tbl_fread) %>% map(
.f = function(i) {
# Your regular function
# related to PDF
# Saving according to desired names
write.table(
tbl_fread[[i]],
file = paste0("pdf_", i, ".csv")
)
}
)
To verify it works as intended, we can read it accordingly,
read.table(
file = "pdf_1.csv"
)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

Multiply two matrices (sumproduct for multiple functions) - to get Fishers discriminant linear function scores

I have a set of Fisher's discriminant linear functions that I need to multiply against some test data. Both data files are in the form of two matrices (variables lined up to match variable order), so I need to multiply them together.
Here is some example test data, which I've added a constant=1 variable (you'll see why when you we get to the coefficients)
testdata <- cbind(constant=1,mtcars[ 1:6 ,c("mpg","disp","hp") ])
> testdata
constant mpg disp hp
Mazda RX4 1 21.0 160 110
Mazda RX4 Wag 1 21.0 160 110
Datsun 710 1 22.8 108 93
Hornet 4 Drive 1 21.4 258 110
Hornet Sportabout 1 18.7 360 175
Valiant 1 18.1 225 105
Here are my coefficients matrix (the Fishers discriminant linear functions)
coefs <- data.frame(constant = c(-67.67, -59.46, -89.70),
mpg = c(4.01,3.49,3.69),
disp = c(0.14,0.15,0.22),
hp = c(0.13,0.15,0.20))
rownames(coefs) <- c("Function1","Function2","Function3")
> coefs
constant mpg disp hp
Function1 -67.67 4.01 0.14 0.13
Function2 -59.46 3.49 0.15 0.15
Function3 -89.70 3.69 0.22 0.20
I need to multiply the values in test data against the respective coefficients to get 3 functions scores per row. Here is how the values would be calculated
for the first row, Function1 = 1*(-67.67)+21*(4.01)+160*(0.14)+110*(0.13)
for the first row, Function2 = 1*(-59.46)+21*(3.49)+160*(0.15)+110*(0.15)
for the first row, Function3 = 1*(-89.70)+21*(3.69)+160*(0.22)+110*(0.20)
It's kind of like a sumproduct of coefficients against each row time 3 for each function.
So the df/matrix should look like this when multiplied same number of rows with 3 function score variables
> df_result
Function1 Function2 Function3
row1 53.24 54.33 44.99
row2
Not ideal, but I'm taking the data out doing it excel. If this is possible to do, any help is greatly appreciated. Many thanks
Are you just looking for the inner product?
testdata <- cbind(constant=1,mtcars[ 1:6 ,c("mpg","disp","hp") ])
coefs <- data.frame(constant = c(-67.67, -59.46, -89.70),
mpg = c(4.01,3.49,3.69),
disp = c(0.14,0.15,0.22),
hp = c(0.13,0.15,0.20))
rownames(coefs) <- c("Function1","Function2","Function3")
as.matrix(testdata) %*% t(as.matrix(coefs))
# Function1 Function2 Function3
# Mazda RX4 53.240 54.330 44.990
# Mazda RX4 Wag 53.240 54.330 44.990
# Datsun 710 50.968 50.262 36.792
# Hornet 4 Drive 68.564 70.426 68.026
# Hornet Sportabout 80.467 86.053 93.503
# Valiant 50.061 53.209 47.589

Adding tidyselect helper functions to a vector [duplicate]

This question already has answers here:
dplyr/rlang: parse_expr with multiple expressions
(3 answers)
Closed 2 years ago.
I often create a "vector" of the variables I use most often while I'm coding. Usually if I just input the vector object in select it works perfectly. Is there any way I can use in the helper functions in a string?
For example I could do
library(dplyr)
x = c('matches("cyl")')
mtcars %>%
select_(x)
but this is not preferable because 1) select_ is deprecated and 2) it's not scalable (i.e., x = c('hp', 'matches("cyl")') will not grab both the relevant columns.
Is there anyway I could use more tidyselect helper functions in as part of a vector?
Note: if I do something like:
x = c(matches("cyl"))
#> Error: `matches()` must be used within a *selecting* function.
#> ℹ See <https://tidyselect.r-lib.org/reference/faq-selection-context.html>.
I get an error, so I'll definitely need to enquo it somehow.
You are trying to turn a string into code which might not be the best approach. However, you can use parse_exprs with !!!.
library(dplyr)
library(rlang)
x = c('matches("cyl")')
mtcars %>% select(!!!parse_exprs(x))
# Cyl
#Mazda RX4 6
#Mazda RX4 Wag 6
#Datsun 710 4
#Hornet 4 Drive 6
#Hornet Sportabout 8
#...
x = c('matches("cyl")', 'hp')
mtcars %>% select(!!!parse_exprs(x))
# cyl hp
#Mazda RX4 6 110
#Mazda RX4 Wag 6 110
#Datsun 710 4 93
#Hornet 4 Drive 6 110
#Hornet Sportabout 8 175
#....

Cannot use a variable named with numbers in R

I have some dataframes named as:
1_patient
2_patient
3_patient
Now I am not able to access its variables. For example:
I am not able to obtain:
2_patient$age
If I press tab when writing the name, it automatically gets quoted, but I am still unable to use it.
Do you know how can I solve this?
It is not recommended to name an object with numbers as prefix, but we can use backquote to extract the value from the object
`1_patient`$age
If there are more than object, we can use mget to return the objects in a list and then extract the 'age' column by looping over the list with lapply
mget(ls(pattern = "^\\d+_mtcars$"))
#$`1_mtcars`
# mpg cyl disp hp drat wt qsec vs am gear carb
#Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4
#Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
lapply(mget(ls(pattern = "^\\d+_patient$")), `[[`, 'age')
Using a small reproducible example
data(mtcars)
`1_mtcars` <- head(mtcars, 2)
1_mtcars$mpg
Error: unexpected input in "1_"
`1_mtcars`$mpg
#[1] 21 21

R: Geom_histrogram

I am trying to create a histogram using geom_histogram() that uses a numeric variable for both the x and y axis.
The numeric x axis will be bucketed and the numeric x axis will show the sum of some other numeric value for each bucket. Right now, I am not having any luck and was hoping someone could help.
attach(Pre_vitality_HZ_Data)
buckets_pre = seq(min(Pre_V_HzR),max(Pre_V_HzR)+1,0.05)
ggplot() +
geom_histogram(alpha = 0.2, aes(x=Pre_V_HzR, y = sum(Policy_Count)), bins = length(buckets), fill = 'aquamarine3')
`
To make the plot you want with ggplot2, it's necessary to prepare the data before plotting. In the solution below, I propose dividing the continuous x-variable into a discrete variable with cut(), and using aggregate() to sum the y-values for each bin of x-values. Besides the base R function aggregate, there are many ways to summarize, aggregate and reshape your data. You may wish to look into the dplyr package or data.table package (two very powerful, well supported packages).
library(ggplot2)
# Use the built-in data set `mtcars` to make the example reproducible.
# Run ?mtcars to see a description of the data set.
head(mtcars)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
# Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
# Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
# Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
# Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
# Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# Let's use `disp` (engine displacement) as the x-variable
# and `mpg` (miles per gallon) as the y-variable.
# Bin the `disp` column into discrete variable with `cut()`
disp_bin_edges = seq(from=71, to=472, length.out=21)
mtcars$disp_discrete = cut(mtcars$disp, breaks=disp_bin_edges)
# Use `aggregate()` to sum `mpg` over levels of `disp_discrete`,
# creating a new data.frame.
dat = aggregate(mpg ~ disp_discrete, data=mtcars, FUN=sum)
# Use `geom_bar(stat="identity") to plot pre-computed y-values.
p1 = ggplot(dat, aes(x=disp_discrete, y=mpg)) +
geom_bar(stat="identity") +
scale_x_discrete(drop=FALSE) +
theme(axis.text.x=element_text(angle=90)) +
ylab("Sum of miles per gallon") +
xlab("Displacement, binned")
# For this example data, a scatterplot conveys a clearer story.
p2 = ggplot(mtcars, aes(x=disp, y=mpg)) +
geom_point(size=5, alpha=0.4) +
ylab("Miles per gallon") +
xlab("Displacement")
library(gridExtra)
ggsave("plots.png", arrangeGrob(p1, p2, nrow=1), height=4, width=8, dpi=150)

Resources