The text from a pdf I scraped is jumbled up in different elements. Not to mention, it deleted data when it was converted to a data frame. It's really hard to tell where the text should have been split since it seems like I got it correct in the below code. How do I split the text so that it looks looks like the original table?
mintz = "https://www.mintz.com/sites/default/files/media/documents/2019-02-08/State%20Legislation%20on%20Biosimilars.pdf"
mintzText = pdf_subset(mintz,pages = 2:23)
mintzText = pdf_text(mintzText)
q = data.frame(trimws(mintzText))
mintzdf <- q %>%
rename(x = trimws.mintzText.) %>%
mutate(x=strsplit(x, "\\n")) %>%
unnest(x)
View(mintzdf)
mintzDF=mintzdf[-c(1:2),]
mintzDF=mintzDF %>%
separate(x, c("a","State", "Substitution
Requirements","Pharmacy Notification Requirements
(to prescriber, patient, or others)","Recordkeeping
Requirements"))%>%
select(-a)
View(mintzdf)
what it looks like
what it should look like
Pdf stored order for a page may be random or bottom rows upwards as there are no key press order rules for when lasers charge a drum (The design requirement for PDF introduction)
We are lucky if the order can be sensibly extracted, but this is a very well ordered PDF. So remember there is no need to observe the grid simply output by rows with spaces that with luck form columns.
In this case using poppler pdftotext with no controls a single page text order could look like this with the first column headed State and the second starting with Substitution\nRequirements\n so clearly there may be head scratching why State is not spaced away from Alaska? but then it is PDF after all, so expect there are no rules.
Looks like it was written down one column then across two then perhaps down the last ?.
Dependant on the very different page variations, I would attempt to target as vertical strips, rather than horizontals. so set a template as 4 vertical page high zones and then hope the horizontal breaks can be determined as matches. The alternative (probably better) is extract as a tabular layout and xpdf pdftotext may then give a better result.
Or use a python table extractor like pdfminer.
Related
The data I have been working with reads everything just fine, **except** for the date column. It always reads it as characters instead.
This would be fine except that, when I have lots of dates (like over 400 of them), then you can see something like this on a scatterplot:
Scatter Plot
In essence, I have two questions.
The first is, apart from using as.Date, which is fine when I'm needed temporary stuff, how do I permanently make R read the date column as legit dates? What I mean is, is there a way I can make that date column read as dates when I am using read.csv or read.excel?
When graphing, like the graph I have included here, how can I only include some of the labels throughout so that it won't be so cramped up? I still want all the data, but really do not want all those labels.
I was hoping to add the data file, but I am unaware of how to add excel/csv files on this website and my data set is quite long (n = 491). I do have 9 columns, 1 of which is the date column. The others are numbers or actual letters (the latter of which is in fact a character). I can add maybe a few rows just to help out.
Some of the data set
I have tens of thousands of rows of unstructured data in csv format. I need to extract certain product attributes from a long string of text. Given a set of acceptable attributes, if there is a match, I need it to fill in the cell with the match.
Example data:
"[ROOT];Earrings;Brands;Brands>JeweleryExchange;Earrings>Gender;Earrings>Gemstone;Earrings>Metal;Earrings>Occasion;Earrings>Style;Earrings>Gender>Women's;Earrings>Gemstone>Zircon;Earrings>Metal>White Gold;Earrings>Occasion>Just to say: I Love You;Earrings>Style>Drop/Dangle;Earrings>Style>Fashion;Not Visible;Gifts;Gifts>Price>$500 - $1000;Gifts>Shop>Earrings;Gifts>Occasion;Gifts>Occasion>Christmas;Gifts>Occasion>Just to say: I Love You;Gifts>For>Her"
Look up table of values:
Zircon, Diamond, Pearl, Ruby
Output:
Zircon
I tried using the VLOOKUP() function, but it needs to match an entire cell and works better for translating acronyms. Haven't really found a built in function that accomplishes what I need. The data is totally unstructured, and changes from row to row with no consistency even within variations of the same product. Does anyone have an idea how to do this?? Or how to write an OpenOffice Calc function to accomplish this? Also open to other better methods of doing this if anyone has any experience or ideas in how to approach this...
ok so I figured out how to do this on my own... I created many different columns, each with a keyword I was looking to extract as a header.
Spreadsheet solution for structured data extraction
Then I used this formula to extract the keywords into the correct row beneath the column header. =IF(ISERROR(SEARCH(CF$1,$D769)),"",CF$1) The Search function returns a number value for the position of a search string otherwise it produces an error. I use the iserror function to determine if there is an error condition, and the if statement in such a way that if there is an error, it leaves the cell blank, else it takes the value of the header. Had over 100 columns of specific information to extract, into one final column where I join all the previous cells in the row together for the final list. Worked like a charm. Recommend this approach to anyone who has to do a similar task.
I am so very very new to R. Like had to look up how to open a file in R new. Diving in the deep end. Anyway
I have a bunch of .csv files with results that I need to analyse. Really, I would like to set up some kind of automation so I can just say "go" (a function?)
Basically I have results in one file that are -particle.csv and another that are -ROI.csv. They have the same names so I know which ones match up (e.g. brain1 section1 -particle.csv and brain1 section1 -ROI.csv). I need to do some maths using these two datasets - Divide column 2 rows 2-x in -particle.csv (the row number might change but is there a way of saying row "2-No more content"?) by column 1, 5, 10, etc. row 2 in -ROI.csv (the column number will always stay the same but if it helps they are all called Area1, Area2, Area3,... the number of Area columns can vary but surely there's a way I can say "every column that begins with Area"? Also the area count and the row count will always match up)
Okay, I'm fine to do that manually for each set up results but I have over 300 brains to analyse! Is there anyway I can set it up as a process that I can apply this these and future results that will be in the same format?
Sorry if this is a huge ask!
So generally what I would like to do is to load a sheet with 4 different tables, and split this one big data into smaller tables using str_detect() to detect one fully blank row that's deviding those tables. After that I want to plug that information into the startRow, startCol, endRow, endCol.
I have tried using this function as followed :
str_detect(my_data, ‘’) but the my_data format is wrong. I’m not sure what step shall I make do prevent this and make it work.
I’m using read_xlsx() to read my dataset
I have a data frame containing two columns, let's call them "description" and "closure_notes". Basically what I am trying to do is combine the contents of both of those columns into a single one (replacing the contents of "description" with the merged contents of the two). The trick is, I need there to a blank line or two between the two pieces of data.
For instances, if df$description is, "A short descriptions of the issue", and df$closure_notes is, "Solved (Workaround): Fixed issue by restarting services", then the result I want as the new value for df$description should be:
A short description of the issue
Solved (Workaround): Fixed issue by restarting services
The reason for the space is for readability. This data will be eventually shown in a shiny app and an accompanying PDF report that can be generated vi knitr/rmarkdown. I want the space in there so when someone reads this they can easily jump right to the closure notes if they want to, but they want them combined into a single column. I have tried paste with several "\n\n" as a separator, tried using writeLines and cat, which work great, printing to the screen, but I want something that will write the result back to the data frame. I am looping through each row combining these two columns, I just need that blank line separating the two pieces of data. Any suggestions? Thanks in advance!
Shiny deals with HTML tags. Therefore try using < /br> and not \n\n