R: How to extract all dates from a block of text - r

Does anybody know of a package that can be used to extract all dates from a block of text using R? I am looking for something similar to search_dates from dateparser.search in python (see pg 11 of this: https://readthedocs.org/projects/dateparser/downloads/pdf/latest/#:~:text=dateparser.search.,or%20time%20and%20parse%20them.&text=text%20(str%7Cunicode)%20%E2%80%93,date%20and%2For%20time%20expressions.)
I have tried searching google, but for R, I only find options to convert a character string (with no surplus characters) to a date string. I am more after something that can find and identify multiple dates in a range of different formats in a large block of text.

Related

Reading dates as dates and not characters

The data I have been working with reads everything just fine, **except** for the date column. It always reads it as characters instead.
This would be fine except that, when I have lots of dates (like over 400 of them), then you can see something like this on a scatterplot:
Scatter Plot
In essence, I have two questions.
The first is, apart from using as.Date, which is fine when I'm needed temporary stuff, how do I permanently make R read the date column as legit dates? What I mean is, is there a way I can make that date column read as dates when I am using read.csv or read.excel?
When graphing, like the graph I have included here, how can I only include some of the labels throughout so that it won't be so cramped up? I still want all the data, but really do not want all those labels.
I was hoping to add the data file, but I am unaware of how to add excel/csv files on this website and my data set is quite long (n = 491). I do have 9 columns, 1 of which is the date column. The others are numbers or actual letters (the latter of which is in fact a character). I can add maybe a few rows just to help out.
Some of the data set

Is there a digit separator in R like using underscore in Python?

In Python there is a way to make large numerals more readable using the underscore, e.g.
1000000 == 1_000_000, as discussed several times here. However, is there something similar in R?
Googling it just leads me to how to format the variable as a string using format and formatC. I already tried 1_000, 1 000, 1'000 and 1,000 but they only produce error messages. Is there really no workaround?
Something like this:
as.numeric(gsub("_","", "1_000_000"))

How do I export a custom list of numbers and letters to Excel from R?

To help with some regular label-making and printing I need to do, I am looking to write a script that allows me to enter a range of sequential numbers (some with string identifiers) that I can export with a specific format to Excel. For example, if I entered the range '1:16', I am looking for an output in Excel exactly as:
Example Excel Output
For each unique sequential number (i.e., 1 to 16) the first five rows must be labeled with a 'U", the next three rows with an 'F' and the last two rows must be the number alone. The final exported matrix will be n columns x 21 rows, where n will vary depending on the number range I enter.
My main problem is in writing to Excel. I can't find out how to customize this output and write to specific rows and columns as in the example above. I am limited to 'openxlsx' since I work on a corporate secure workstation. Here is what I have so far:
Example Code
Any help you may have would be very appreciated, thanks in advance!

R studio numeric integer display format options

I don't want the display format like this: 2.150209e+06
the format I want is 2150209
because when I export data, format like 2.150209e+06 caused me a lot of trouble.
I did some search found this function could help me
formatC(numeric_summary$mean, digits=1,format="f").
I am wondering can I set options to change this forever? I don't want to apply this function to every variable of my data because I have this problem very often.
One more question is, can I change the class of all integer variables to numeric automatically? For integer format, when I sum the whole column usually cause trouble, says "integer overflow - use sum(as.numeric(.))".
I don't need integer format, all I need is numeric format. Can I set options to change integer class to numeric please?
I don't know how you are exporting your data, but when I use write.csv with a data frame containing numeric data, I don't get scientific notation, I get the full number written out, including all decimal precision. Actually, I also get the full number written out even with factor data. Have a look here:
df <- data.frame(c1=c(2150209.123, 10001111),
c2=c('2150209.123', '10001111'))
write.csv(df, file="C:\\Users\\tbiegeleisen\\temp.txt")
Output file:
"","c1","c2"
"1",2150209.123,"2150209.123"
"2",10001111,"10001111"
Update:
It is possible that you are just dealing with a data rendering issue. What you see in the R console or in your spreadsheet does not necessarily reflect the precision of the underlying data. For instance, if you are using Excel, you highlight a numeric cell, press CTRL + 1 and then change the format. You should be able to see full/true precision of the underlying data. Similarly, the number you see printed in the R console might use scientific notation only for ease of reading (SN was invented partially for this very reason).
Thank you all.
For the example above, I tried this:
df <- data.frame(c1=c(21503413542209.123, 10001111),
c2=c('2150209.123', '100011413413111'))
c1 in df is scientific notation, c2 is not.
then I run write.csv(df, file="C:\Users\tbiegeleisen\temp.txt").
It does out put all digits.
Can I disable scientific notation in R please? Because, it still cause me trouble, although it exported all digits to txt.
Sometimes I want to visually compare two big numbers. For example, if I run
df <- data.frame(c1=c(21503413542209.123, 21503413542210.123),
c2=c('2150209.123', '100011413413111'))
df will be
c1 c2
2.150341e+13 2150209.123
2.150341e+13 100011413413111
The two values for c1 are actually different, but I cannot differentiate them in R, unless I exported them to txt. The numbers here are fake numbers, but the same problem I encounter very day.

general to number in R

I have data in excel and after reading in R it reads as follows
as
lob2 lob3
1.86E+12 7.58E+12
I want it as
lob2 lob3
1857529190776.75 7587529190776.75
This difference causes me to have different results after doing my analysis later on
How is the data stored in Excel (does it think it is a number, a string, a date, etc.)?
How are you getting the data from Excel to R? If you save the data as a .csv file then read it into R, look at the intermediate file, Excel is known to abbreviate when saving and R would then see character strings instead of numbers. You need to find a way to tell excel to export the data in the correct format with the correct precision.
If you are using a package (there are more than 1) then look into the details of that package for how to grab the numbers correctly (you may need to make changes in Excel so that it knows they are numbers).
Lastly, what does the str function on your R object say? It could be that R is storing the proper numbers and only displaying the short version as mentioned in the comments. Or, it could be that R received strings that did not convert nicely to numbers and is storing them as characters or factors. The str function will let you see how your data is stored in R, and therefore how to convert or display it correctly.

Resources