How can I generate a fully-formatted Excel document as a final output? - r

I have built a script that extracts data from numerous backend systems and I want the output to be formatted like this:
Which package is the best and/or easiest to do this with?
I am aware of the xlsx package, but am also aware that there are others available so would like to know which is best in terms of ease and/or simplicity to achieve my desired output.
A little more detail:
If I run the report across seven days, then the resulting data frame is 168 rows deep (1 row represents 1 hour, 168 hours per week). I want each date (00:00 - 23:00) to be broken out into day-long blocks, as per the image I have provided.
(Also note that I am in London, England, and as such am currently in timezone UTC+1, which means that right now, the hourly breakdown for each date will range from 01:00 - 00:00 on the next day because our backend systems run on the UTC timezone and that is fine.)
At present, I copy and paste (transpose) the values across manually, but want to be able to fully automate the process so that I can run the script (function), and have the resulting output looking like the image.
This is what the current final output looks like:

Try the package openxlsx. The package offers a lot of custom formatting for .xslx documents and is actively developed / fairly responsive to githhub issues. The vignettes on the cran website are particularly useful.

Related

Poor performance Arrow Parquet multiple files

After watching the mind-blowing webinar at Rstudio conference here I was pumped enough to dump an entire SQL server table to parquet files. The result was 2886 files, (78 entities over 37 months) with around 700 millons rows in total.
Doing a basic select returned all rows in less than 15 seconds! (Just out of this world result!!) At the webinar Neal Richardson from Ursa Labs was showcasing the Ny-Taxi dataset with 2 billions rows under 4 seconds.
I felt it was time to do something more daring like basic mean, sd, mode over a year worth of data, but that took a minute per month, so I was sitting 12.4 minutes waiting for a reply from R.
What is the issue? My badly written R-query? or simply too many files or granularity (decimal values?)??
Any ideas??
PS: I did not want to put a Jira-case in apache-arrow board as I see google search does not retrieve answers from there.
My guess (without actually looking at the data or profiling the query) is two things:
You're right, the decimal type is going to require some work in converting to an R type because R doesn't have a decimal type, so that will be slower than just reading in an int32 or float64 type.
You're still reading in ~350 million rows of data to your R session, and that's going to take some time. In the example query on the arrow package vignette, more data is filtered out (and the filtering is very fast).

Trouble importing Excel time Duration data into R (with GUI EZR)

Tearing my hair out on this one. Took me hours just to get rJava up and running (because mac OS X el capitan was not wanting to play nice with Java) in order to load excel-specific data importing packages etc. But in the end this hasn't helped my problem, and I'm just about at my wits end. Please help.
Basic situation is this:
Have simple excel data of time durations, over a span of a couple of years. So the two columns I'm importing are the time(duration) and year(2016,2017 etc).
In Excel the data is formatted as [h]:mm:ss so it displays correctly (data is related to number of hours worked in a month, so typically something like 80:xx:xx ~ 120:xx:xx). I'm aware that in excel, despite the cells being formatted as above, and only showing the relevant period of hours, that in reality excel has appended an (irrelevant, arbitrary) date to this hours data. I have searched and searched and found no way around this limitation in the way excel handles dates/times/durations.
I import this data into R via the "import data -> import from excel data set" menu item in R commander GUI, not the console.
However when importing the data into R, the data displays as a single number e.g. approx. 110 hrs is converted to 4.xxxxx, not as hh:mm:ss. So when running analyses and generating plots etc, instead of the actual (meaningful) 110:xx:xx type data being displayed, a completely meaningless 4.xxxxxx is displayed.
If I change the formatting of the excel cells to display the date as well as the time rather than use the [h]:mm:ss cell formatting, R erroneously interprets the data to something equally useless, like 1901/02/04 05:23 am
I have installed and loaded a variety of packages such as xlsx, XLConnect, lubridate etc but it hasn't made any difference to how R interprets the excel data on import, from the GUI at least.
Please tell me how do I either
a) edit the raw data to a format that R will understand as a time duration (and nothing but a time duration) in hh:mm:ss format, or
b) format the current data from within R after importing, so that it displays the data in the correct way rather than a useless number or arbitrary date/time?
[Please note: I can use the console, when given the commands etc needed to be executed. But I need to find a solution that ultimately will allow the data to be imported and/or manipulated from within the GUI, not from typing a bunch of commands into the console, as the end user (not me) has zero programming ability and cannot use a console, and will only ever be using R via the GUI.]
Is your code importing the data from excel as seconds?
library(lubridate)
duration <- lubridate::as.duration(400000)
as.numeric(duration, "hours")
111.1111
as.numeric(duration, "days")
4.62963
seconds_to_period(400000)
"4d 15H 6M 40S"

read RTF files into R containing econ data (a lot of numbers)

Recently I obtained a lot of RTF files that contain econ data for analysis I need to do. Unfortunately, this is how Bureau of Statistics of my country could help with time series data for a long time lapse. If there is a one time need to select particular indicator for 10 years or so I'm OK to find these values by hand using Word/Notepad/TestEdit(for Mac). But my problem is that I have 15 files with data that I need to combine somehow in one dataset for my work. But, before even start doing this I don't have a clue if it is possible to read those files in appropriate format (data.frame). Wanted to ask expert opinion on how to approach this task. An example of one of files could be downloaded from here:
[https://www.dropbox.com/s/863ikx6poid8unc/Export_for_SO.rtf?dl=0][1]
All values are in Russian. Dataset represents export of particular product (first column) across countries (second column) in US dollars for 2 periods.
Thank you.
Use code found on https://datascienceplus.com/how-to-import-multiple-csv-files-simultaneously-in-r-and-create-a-data-frame/
replace read_csv with read_rtf
You may want to manually convert your files to another format using an office suite or a text editor. You should be able to save as in another format.
While in R, you may want to give striprtf a try. I'm guessing you will still have to clean your data a bit afterward.
You can install the package like this:
install.packages("striprtf")

SPSS date format when imported into R

I have not worked with SPSS (.sav) files before and am trying to work with some data files provided to me by importing them into R. I did not receive any explanation of the files, and because communication is difficult I am trying to figure out as much as I can on my own.
Here's my first question. This is what the Date field looks like in an R data frame after import:
> dataset2$Date[1:4]
[1] 13608172800 13608259200 13608345600 13608345600
I don't know what dates the data is supposed to be for, but I found that if I divide the above numbers by 10, that seems to give a reasonable date (in February 2013). Can anyone confirm this is indeed what the above represents?
My second question is regarding another column called Begin_time. Here's what that looks like:
> dataset2$Begin_time[1:4]
[1] 29520 61800 21480 55080
Any idea what this is representing? I want to believe this is some representation of time of day because the records are for wildlife observations, but I haven't got more info than that to try to guess. I noticed that if I take the difference between End_Time and Begin_time I get numbers like 120 and 180, which seems like minutes to me (3 hours seems reasonable to observe a wild animal), but the absolute numbers are far greater than the number of minutes in a day (1440), so that leaves me puzzled. Is this some time keeping format from SPSS? If so, what's the logic?
Unfortunately, I don't have access to SPSS, so any help would be much appreciated.
I had the same problem and this function is a good solution:
pss2date <- function(x) as.Date(x/86400, origin = "1582-10-14")
This is where I found the answer:
http://scs.math.yorku.ca/index.php/R:_Importing_dates_from_SPSS
Dates in SPSS Statistics are represented as floating point doubles holding the number of seconds since Oct 1, 1582. If you use the SPSS R plugin apis, they can be automatically converted to R dates, but any proper converter should be able to do this for you.

Import data in R from Access

I'm trying to import a table from Microsoft Access (.accdb) to R.
The code that I use is:
library(RODBC)
testdb <- file.path("modelEAU Database V.2.accdb")
channel <- odbcConnectAccess2007(testdb)
WQ_data <- sqlFetch(channel, "WaterQuality")
It seems that it works but the problem is importing date and time data. Into the Access file there are two columns, one with date field (dd/mm/yyyy) and another one with time field (hh:mm:ss) and when I import them in R, in date column appears the date with yyyy-mm-dd format and into the time column the format is 1899-12-30 hh:mm:ss. Also, R can't recognise these formats as a variable and I can't work with them.
Also, I tried the mdb.get function but it didn't work as well.
Does somebody know how to import the data in R from Access defining the date and time format ? Any idea how to import the Access file as a text file?
Note: I'm working with with Office 2010 and R version 2.14.1
Thanks a lot in advanced.
Look at the result of runing str on your data frame. That will tell you more about how the data is actually stored. Generally dates and times are stored as a number from an origin date (Access uses 30 Dec. 1899 because MS thought that 1900 was a leap year). Sometimes it is stored as the number of days since the origin with time being represented as a fraction of the day, other times it is the number of seconds (or miliseconds) since the origin.
You will need to see how the data was sent (whether access and odbc converted to strings first, or sent days or seconds), then you will have a better feel for how to work with these (possibly converting) in R.
There is an article in the June 2004 edition of R News (the predecesor to the R Journal) that details the common ways to deal with dates and times in R and could be very useful to you.
You should decide what you want to end up with, a single column of DateTimes, 2 columns with numbers, 2 columns with characters, etc.

Resources