How to investigate 5MB+ datasets in RStudio's source editor? - r

My question:
Can I change the parameters in R to use the source editor to also view >5MB data sets in R?
If not, what is your advice?
Background:
I recently stopped looking at data in Excel and switched to R entirely. As I did in Excel and still prefer to do in R, I like to look at the entire frame and then decide on filters.
Problem: Working with the World Development Indicators (WDI) data set which is over 100MB, opening it in the source editor does not work. View(df) opens an empty tab in RStudio as also shown below:
R threw another error when I selected the data set from the Files Tab in column on the right of RStudio which read:
The selected file 'wdi.csv' is too large to open in the source editor (the file is 104.5 MB and the maximum file size is 5MB).
Solutions?
My alter ego would tell me to increase the threshold of datasets' file size for the source editor, so I could investigate it there. In brief: change 5 to 200 MB. My alter ego would also tell me that I would probably encounter performance issues (since I am using a MacAir).
How I resolved the issue:
I used head() and dplyr's glimpse() to get a better idea, but ended up looking at the wdi matrix in excel and then filtered it out in R. Newly created dataframes could be opened in the source editor without any problems.
Thanks in advance!

Related

Is there a way to accelerate formatted table writing from R to excel?

I have a 174603 rows and 178 column dataframe, which I'm importing to Excel using openxlsx::saveWorkbook, (Using this package to obtain the aforementioned format of cells, with colors, header styles and so on). But the process is extremely slow, (depending on the amount of memory used by the machine it can take from 7 to 17 minutes!!) and I need a way to reduce this significantly (Doesn't need to be seconds, but anything bellow 5 min would be OK)
I've already searched other questions but they all seem to focus either in exporting to R (I have no problem with this) or writing non-formatted files to R (using write.csv and other options of the like)
Apparently I can't use xlsx package because of the settings on my computer (industrial computer, Check comments on This question)
Any suggestions regarding packages or other functionalities inside this package to make this run faster would be highly appreciated.
This question has some time ,but I had the same problem as you and came up with a solution worth mentioning.
There is package called writexl that has implemented a way to export a data frame to Excel using the C library libxlsxwriter. You can export to excel using the next code:
library(writexl)
writexl::write_xlsx(df, "Excel.xlsx",format_headers = TRUE)
The parameter format_headers only apply centered and bold titles, but I had edited the C code of the its source in github writexl library made by ropensci.
You can download it or clone it. Inside src folder you can edit write_xlsx.c file.
For example in the part that he is inserting the header format
//how to format headers (bold + center)
lxw_format * title = workbook_add_format(workbook);
format_set_bold(title);
format_set_align(title, LXW_ALIGN_CENTER);
you can add this lines to add background color to the header
format_set_pattern (title, LXW_PATTERN_SOLID);
format_set_bg_color(title, 0x8DC4E4);
There are lots of formating you can do searching in the libxlsxwriter library
When you have finished editing that file and given you have the source code in a folder called writexl, you can build and install the edited package by
shell("R CMD build writexl")
install.packages("writexl_1.2.tar.gz", repos = NULL)
Exporting again using the first chunk of code will generate the Excel with formats and faster than any other library I know about.
Hope this helps.
Have you tried ;
write.table(GroupsAlldata, file = 'Groupsalldata.txt')
in order to obtain it in txt format.
Then on Excel, you can simply transfer you can 'text to column' to put your data into a table
good luck

BlySky Statistics - File naming conventions

When opening file 'TestFile.RData' in BlueSky Statistics it is opened with this name PLUS Dataset3 attached. Looks like this in tab TestFile.RData(Dataset3)
I would like to use my original name when using r code in the r command editor but from what I see BlueSky wants me to use the Dataset3 name.
Please clarify this file name issue for me.
If my original name is changed I see issues with reproducing things - as the given name of Dataset3 is not controllable.
Regards
Your observation is correct. When ever a file is opened in BlueSky Statistics (that is not an R datafile) we create a dataframe object in R. We name these objects sequentially namely Dataset1, Dataset2,Dataset3, etc. We could always use the name of the original file, however we went with Dataset1,Dataset2,Dataset3 for compatibility with SPSS. Many of our users come from SPSS and that is exactly what SPSS does. There is a simple work around, see below.
To work around this you need to change the default code we use to open the dataset. To see the code in the output window, Go to the top level menu Tools , Tools->Configuration settings->Select the Output tab and select the checkbox near the text "Show syntax in output window"
The code you will see when you open a dataset in the output Window is
BSkyloadDataset(fullpathfilename='C:/Users/Aaron_2/Documents/BlueSky Statistics/Sample Datasets/IRT/engagement.csv', filetype='CSV', worksheetName='',load.missing=FALSE, character.to.factor=FALSE, csvHeader=TRUE, isBasketData=FALSE, trimSPSStrailing=FALSE, sepChar=',', deciChar='.', datasetName='Dataset2')
All you need to do is change the datasetName parameter to the name you want to use
I will also add an enhancement to make the default behavior of naming the dataset when opening files to be the name of the file. This is easy to do.
With R datasets this is not a problem because we load all dataframe objects into the grid. The name of the dataset in the grid, continues to be the dataset object
BlueSky is one of the few packages that use R and allow you to open and work on multiple data files at once. This naming approach is its way of allowing that while using files that have not yet been stored as R data files (.RData). After importing data from a non-R file, simply use "File> Save as" and save it as an R Object (.RData). The next time you open that file, it will maintain the name you've given it.

How to achieve the equivalent of Refresh All (in Excel) in R?

The title to this really says it all. I am trying to automate a report in R, ideally, the Excel file would be imported into R, refreshed (as the said Excel file has a direct link to SQL) before finally outputting the result as an Excel file.
Finally, this report needs to automatically run without human intervention or supervision (hence the need for R).
Any help or ideas would be appreciated !

Accessing large spreadsheets written from R

I am using the following R script to write the data. table into the excel file in my set directory. However, the size of the file is in GB's as the total rows are 50 million+. Hence upon opening the file, I just see a blank grey screen and nothing else.
How can I see the contents in the file?
The first line is just for illustration purpose.
Final1 <- rep(iris, times = 1000000)
fwrite(Final1,"data2.csv")
You mention that this is part of the report. I would be willing to bet good money that whoever will be reading this report will not check all or most of the values by hand. In which case, you don't need a format that is easily browsable, e.g. xlsx or even csv. If this indeed is the case, you might want to try a (relational) database. If you do not have anything centralized, you might want to give SQLite a try. You save everything into one file which acts as a database. There are packages that handle this interaction in R. You can try with sqldf or RSQLite.

RStudio: Save data from Viewer

Due to a stupid mistake and a defective USB stick I lost a bunch of data and I am now trying to recover it.
Some of the the data is still displayed in the Viewer tabs when I open RStudio. However, I can only save R Scripts and R Markdownfiles out of the Viewer. The displayed data frames are nice and complete, I can sort and filter them in the Viewer, however, I cannot find a "save" option. Is there a possibility to save this displayed data into Rdata or csv or something similar?
I would suggest three different approaches, but none of them will necessarily work. I sort them according to my prior expectations of success.
1) You can copy all your data frame from the viewer and paste it into an external spreadsheet software to obtain a .csv file. E.g. through the "convert text to columns" button in MS Excel.
2) You can copy and paste the character string into an object that is passed to the text option of read.table or to dput(). Check out the "Copy your data" section of this famous SO question
3) Finally, you can get google Chrome's "Inspect Element" function to inspect the html code of the object in the viewer. Once you find the table you can copy paste and scrape with an html parser, e.g. using the rvest package. Good luck!
Thanks everybody, there is a way to access the data as Rdata files, which was kindly explained to me here
I used the second method and located the files in %localappdata%\RStudio-Desktop\viewer-cache.

Resources