Extract table from a PDF with multiple headers (R) - r

I am trying to get some epidemiological data stored in a pdf that is publicly available
link.
I am just looking at the data in page 9 (right table).
What I would like to achieve is to pass the data into a table, but since I have many headers, it's quite dificult to achieve this. Example: The column SIDA is divided in two further columns (SEM and ACUM). Would it be possible to split the SIDA cell?
So far I have tried to extract the data using pdftools and tabulizer.

Related

Export dataframe to multiple excel workbooks in R with conditional formatting intact

I have a dataset that I'd like to export to multiple Excel workbooks with conditional formatting. I can't post the actual data, but a sample is below. Essentially, I've got a dataset showing whether or an individual qualifies for a survey, what department and team within the department they are in:
Survey Status
Department
Team
1
Budget Off
Acts
0
Budget Off
Acts
1
Sales
Local
1
Public Rel
Social
I want to do the following:
Conditionally format the data so that rows with a survey status of 1 are in bold, black text and rows with a survey status of 0 are in non-bold, red text for easier reading.
Maintain this formatting when exporting to Excel.
Creating an individual workbook for each Department/Team grouping.
I can format the data how I'd like within R, and I can create an individual workbook for each Department/Team grouping without a problem.
The issue is that the formatting gets lost. I've tried a few different packages, including xlsx, openxlsx, formattable, and condformat, but can't seem to bridge the gap so that the formatting is applied within the Excel files.
I was previously able to do this is SAS with no problems. We're transitioning to R, which is why I'm recreating these documents. However, I'm wondering if R is the best choice for this procedure. Perhaps Python would be better?
Thanks in advance for all your help. The SO community is my lifeline in learning to code, and has been an invaluable resource.
Finally discovered the problem with the help of YouTuber CradleToGraveR. User error! It was in the formula for the "rule" within conditionalFormatting()
I had been using the formatting examples in the documentation for openxlsx (e.g., rule = "$colname=1", but applying it to a string (e.g., rule = "$colname=YES"), which is why Excel didn't recognize it.
When conditionally formatting strings, use single quotes around the rule with double quotes around the value (e.g., rule = '$colname="YES"').
CradleToGraveR shows how to format for text values in this video, https://www.youtube.com/watch?v=ACdCQuQJxhU

Is there any method to extract pdf table tidy with R?

I need an automatic code to extract pdf table in R.
So I searched website, find tabulizer package.
and I use
extract_tables(f2,pages = 25,guess=TRUE,encoding = 'UTF-8',method="stream")#f2 is pdf file name
I tried every method type, but the outcome is not tidy.
Some columns are mixed and there is a lot of blank as you can see image file.
I think I would do modify the data directly. But the purpose is automizing it. So general method is needed. And every pdf file is not organized. Some table is very tidy with every related line matched perfectly but others are not..
As you can see in my outcome image, in column 4, the number is mixed in same column. Other columns, the number is matched one by one what I mean is I want to make column tidy like table in pdf automatically.
Is there any package or some method to make extracted table tidy?
my Code result
table in PDF

Locate starting coordinates of a table in R

I am trying to extract information from a portion of a table in R. Example table below...
This is just a simple example compared to what I am really dealing with. I am working with a very large table that has a very strange structure and changes with each page. When I read the whole table using "extract_tables" function, I get a very unstructured result back with multiple table elements being pushed into the same row/column. So I am attempting to read only a portion of the table. I am trying to locate the position of the table using the text in the first cell "Here", so I can plug this into the "area" parameter of the "extract_tables" function. I cannot use the "extract_areas" function because I do not want to extract the tables manually.
Can anyone help me with this?

Load multiple tables from one word sheet and split them by detecting one fully emptied row

So generally what I would like to do is to load a sheet with 4 different tables, and split this one big data into smaller tables using str_detect() to detect one fully blank row that's deviding those tables. After that I want to plug that information into the startRow, startCol, endRow, endCol.
I have tried using this function as followed :
str_detect(my_data, ‘’) but the my_data format is wrong. I’m not sure what step shall I make do prevent this and make it work.
I’m using read_xlsx() to read my dataset

Make and Visualise table in R

First I have an input and I want to get an output like this
(I want to group the occurrences in the dataset between columns):
Second:
Display this table in a good looking way (something that looks like Word or Excel)
I can't use Word or Excel as I'm making some calculations in R with this dataset (it contains columns with numbers which aren't displayed here)
calculating the output table
I don't really see how the output table is calculated. Shouldn't it be output_table["A", "C"] = "e"?
Displaying your data
There are a lot of ways to do that. You might consider using RMarkdown to create a report-style output.
The DT library is also a very handy tool to display tables. It works well with RMarkdown and can be embedded in HTML documents. If you are using RStudio, you can use the following code to display your data
library(DT)
DT::datatable(iris)

Resources