R Dataframe - add a newline or whitespace bewtween two strings - r

I have a data frame containing two columns, let's call them "description" and "closure_notes". Basically what I am trying to do is combine the contents of both of those columns into a single one (replacing the contents of "description" with the merged contents of the two). The trick is, I need there to a blank line or two between the two pieces of data.
For instances, if df$description is, "A short descriptions of the issue", and df$closure_notes is, "Solved (Workaround): Fixed issue by restarting services", then the result I want as the new value for df$description should be:
A short description of the issue
Solved (Workaround): Fixed issue by restarting services
The reason for the space is for readability. This data will be eventually shown in a shiny app and an accompanying PDF report that can be generated vi knitr/rmarkdown. I want the space in there so when someone reads this they can easily jump right to the closure notes if they want to, but they want them combined into a single column. I have tried paste with several "\n\n" as a separator, tried using writeLines and cat, which work great, printing to the screen, but I want something that will write the result back to the data frame. I am looping through each row combining these two columns, I just need that blank line separating the two pieces of data. Any suggestions? Thanks in advance!

Shiny deals with HTML tags. Therefore try using < /br> and not \n\n

Related

separating multiple columns into more columns

The text from a pdf I scraped is jumbled up in different elements. Not to mention, it deleted data when it was converted to a data frame. It's really hard to tell where the text should have been split since it seems like I got it correct in the below code. How do I split the text so that it looks looks like the original table?
mintz = "https://www.mintz.com/sites/default/files/media/documents/2019-02-08/State%20Legislation%20on%20Biosimilars.pdf"
mintzText = pdf_subset(mintz,pages = 2:23)
mintzText = pdf_text(mintzText)
q = data.frame(trimws(mintzText))
mintzdf <- q %>%
rename(x = trimws.mintzText.) %>%
mutate(x=strsplit(x, "\\n")) %>%
unnest(x)
View(mintzdf)
mintzDF=mintzdf[-c(1:2),]
mintzDF=mintzDF %>%
separate(x, c("a","State", "Substitution
Requirements","Pharmacy Notification Requirements
(to prescriber, patient, or others)","Recordkeeping
Requirements"))%>%
select(-a)
View(mintzdf)
what it looks like
what it should look like
Pdf stored order for a page may be random or bottom rows upwards as there are no key press order rules for when lasers charge a drum (The design requirement for PDF introduction)
We are lucky if the order can be sensibly extracted, but this is a very well ordered PDF. So remember there is no need to observe the grid simply output by rows with spaces that with luck form columns.
In this case using poppler pdftotext with no controls a single page text order could look like this with the first column headed State and the second starting with Substitution\nRequirements\n so clearly there may be head scratching why State is not spaced away from Alaska? but then it is PDF after all, so expect there are no rules.
Looks like it was written down one column then across two then perhaps down the last ?.
Dependant on the very different page variations, I would attempt to target as vertical strips, rather than horizontals. so set a template as 4 vertical page high zones and then hope the horizontal breaks can be determined as matches. The alternative (probably better) is extract as a tabular layout and xpdf pdftotext may then give a better result.
Or use a python table extractor like pdfminer.

Printing out R Dataframe - Single Character Between Columns While Maintaining Alignment (Variable Spacing)

In a previous question, I received output for an R dataframe that had two aligned columns. The answer gave me the following output:
While the post answered my initial question, it seems as if the program I intend to use requires a text file in which the two columns are both aligned and separated by a single character (e.g. a tab). The previous solution instead results in a large and variable number of spaces between the first and second columns (depending on the length of the string in the first column for that particular row.) Inserting a single character, however, results in a misalignment of the columns.
Is there any way in which I can replace a large number of spaces with a single character that has variable spacing to 'reach' to the second column?
If it helps, this webpage contains a .txt file that you may download to see the intended output (although it does not suffer from the problem with the first column having variable name lengths, it has a single 'space character' that separates the first and second columns. If I 'copy and paste' this specific space character between columns 1 and 2, the program can successfully interpret the .txt file. This copy + paste results in a single character separating the columns and appropriate alignment.)
For further example, the first of the following pictures (note the highlight is a single character) properly parses while the second does not:

Is there a way to extract a substring from a cell in OpenOffice Calc?

I have tens of thousands of rows of unstructured data in csv format. I need to extract certain product attributes from a long string of text. Given a set of acceptable attributes, if there is a match, I need it to fill in the cell with the match.
Example data:
"[ROOT];Earrings;Brands;Brands>JeweleryExchange;Earrings>Gender;Earrings>Gemstone;Earrings>Metal;Earrings>Occasion;Earrings>Style;Earrings>Gender>Women's;Earrings>Gemstone>Zircon;Earrings>Metal>White Gold;Earrings>Occasion>Just to say: I Love You;Earrings>Style>Drop/Dangle;Earrings>Style>Fashion;Not Visible;Gifts;Gifts>Price>$500 - $1000;Gifts>Shop>Earrings;Gifts>Occasion;Gifts>Occasion>Christmas;Gifts>Occasion>Just to say: I Love You;Gifts>For>Her"
Look up table of values:
Zircon, Diamond, Pearl, Ruby
Output:
Zircon
I tried using the VLOOKUP() function, but it needs to match an entire cell and works better for translating acronyms. Haven't really found a built in function that accomplishes what I need. The data is totally unstructured, and changes from row to row with no consistency even within variations of the same product. Does anyone have an idea how to do this?? Or how to write an OpenOffice Calc function to accomplish this? Also open to other better methods of doing this if anyone has any experience or ideas in how to approach this...
ok so I figured out how to do this on my own... I created many different columns, each with a keyword I was looking to extract as a header.
Spreadsheet solution for structured data extraction
Then I used this formula to extract the keywords into the correct row beneath the column header. =IF(ISERROR(SEARCH(CF$1,$D769)),"",CF$1) The Search function returns a number value for the position of a search string otherwise it produces an error. I use the iserror function to determine if there is an error condition, and the if statement in such a way that if there is an error, it leaves the cell blank, else it takes the value of the header. Had over 100 columns of specific information to extract, into one final column where I join all the previous cells in the row together for the final list. Worked like a charm. Recommend this approach to anyone who has to do a similar task.

Load multiple tables from one word sheet and split them by detecting one fully emptied row

So generally what I would like to do is to load a sheet with 4 different tables, and split this one big data into smaller tables using str_detect() to detect one fully blank row that's deviding those tables. After that I want to plug that information into the startRow, startCol, endRow, endCol.
I have tried using this function as followed :
str_detect(my_data, ‘’) but the my_data format is wrong. I’m not sure what step shall I make do prevent this and make it work.
I’m using read_xlsx() to read my dataset

Using data.table::fread on column containing a single double quote

I've been googling and reading posts on problems similar but different to the one described below; apologies if this is a duplicate.
I've got a csv file with a field which can contain, among other things, a single instance of a double quote (object descriptions sometimes containing lengths specified in inches).
When I call fread as follows
data_in <- data.table::fread(file_path,stringsAsFactors = FALSE)
the resulting data frame contains two consecutive double quotes in instances where the source file only had one (e.g., the string which appears in the raw csv as
MI|WIRE 9" BGD
appears in the data frame as
MI|WIRE 9"" BGD
).
This character field can also contain commas, semicolons, single quotes in any quantity, and many other characters which I cannot identify.
This is a problem as I need the exact string to match another dataset's values with merge (in fact, the file being read in was originally written from r with fwrite).
I assume that nearly any io problem I'm wrestling with can be solved with readLines and some elbow grease, but I quite like fread. Based on what I've read online this seems similar to problems that others have faced and so I'm guessing that some tweaking of fread's parameters will solve this problem. Any ideas?

Resources