I have 2 tables, Table_1 and Table_2, on one sheet (sheet_5) in my excel workbook myWorkbook.xlsx. I know there are plenty of packages that allow you to specify which sheet to load in, but is there a way to load in only the table(s) you want? In this case, I want to load Table_2 only.
Thanks
The most recent version of readxl has the ability to set the range. IMHO this is BY FAR the best excel read in package for R.
See the part of this post entitled "Specifying the data rectangle": https://blog.rstudio.org/2017/04/19/readxl-1-0-0/
the syntax should be very familiar to an excel user.
Also please see this post for asking questions on SO:
How to make a great R reproducible example?
Related
Background
I'm doing some data manipulation (joins, etc.) on a very large dataset in R, so I decided to use a local installation of Apache Spark and sparklyr to be able to use my dplyr code to manipulate it all. (I'm running Windows 10 Pro; R is 64-bit.) I've done the work needed, and now want to output the sparklyr table to a .csv file.
The Problem
Here's the code I'm using to output a .csv file to a folder on my hard drive:
spark_write_csv(d1, "C:/d1.csv")
When I navigate to the directory in question, though, I don't see a single csv file d1.csv. Instead I see a newly created folder called d1, and when I click inside it I see ~10 .csv files all beginning with "part". Here's a screenshot:
The folder also contains the same number of .csv.crc files, which I see from Googling are "used to store CRC code for a split file archive".
What's going on here? Is there a way to put these files back together, or to get spark_write_csv to output a single file like write.csv?
Edit
A user below suggested that this post may answer the question, and it nearly does, but it seems like the asker is looking for Scala code that does what I want, while I'm looking for R code that does what I want.
I had the exact same issue.
In simple terms, the partitions are done for computational efficiency. If you have partitions, multiple workers/executors can write the table on each partition. In contrast, if you only have one partition, the csv file can only be written by a single worker/executor, making the task much slower. The same principle applies not only for writing tables but also for parallel computations.
For more details on partitioning, you can check this link.
Suppose I want to save table as a single file with the path path/to/table.csv. I would do this as follows
table %>% sdf_repartition(partitions=1)
spark_write_csv(table, path/to/table.csv,...)
You can check full details of sdf_repartition in the official documentation.
Data will be divided into multiple partitions. When you save the dataframe to CSV, you will get file from each partition. Before calling spark_write_csv method you need to bring all the data to single partition to get single file.
You can use a method called as coalese to achieve this.
coalesce(df, 1)
I have a 174603 rows and 178 column dataframe, which I'm importing to Excel using openxlsx::saveWorkbook, (Using this package to obtain the aforementioned format of cells, with colors, header styles and so on). But the process is extremely slow, (depending on the amount of memory used by the machine it can take from 7 to 17 minutes!!) and I need a way to reduce this significantly (Doesn't need to be seconds, but anything bellow 5 min would be OK)
I've already searched other questions but they all seem to focus either in exporting to R (I have no problem with this) or writing non-formatted files to R (using write.csv and other options of the like)
Apparently I can't use xlsx package because of the settings on my computer (industrial computer, Check comments on This question)
Any suggestions regarding packages or other functionalities inside this package to make this run faster would be highly appreciated.
This question has some time ,but I had the same problem as you and came up with a solution worth mentioning.
There is package called writexl that has implemented a way to export a data frame to Excel using the C library libxlsxwriter. You can export to excel using the next code:
library(writexl)
writexl::write_xlsx(df, "Excel.xlsx",format_headers = TRUE)
The parameter format_headers only apply centered and bold titles, but I had edited the C code of the its source in github writexl library made by ropensci.
You can download it or clone it. Inside src folder you can edit write_xlsx.c file.
For example in the part that he is inserting the header format
//how to format headers (bold + center)
lxw_format * title = workbook_add_format(workbook);
format_set_bold(title);
format_set_align(title, LXW_ALIGN_CENTER);
you can add this lines to add background color to the header
format_set_pattern (title, LXW_PATTERN_SOLID);
format_set_bg_color(title, 0x8DC4E4);
There are lots of formating you can do searching in the libxlsxwriter library
When you have finished editing that file and given you have the source code in a folder called writexl, you can build and install the edited package by
shell("R CMD build writexl")
install.packages("writexl_1.2.tar.gz", repos = NULL)
Exporting again using the first chunk of code will generate the Excel with formats and faster than any other library I know about.
Hope this helps.
Have you tried ;
write.table(GroupsAlldata, file = 'Groupsalldata.txt')
in order to obtain it in txt format.
Then on Excel, you can simply transfer you can 'text to column' to put your data into a table
good luck
I am working with an Excel file. It has 3 books. I need help with extracting only one of the books into R. I did a Google search and could not glean the solution from the information. I am working on a MacBook; I am running the latest version of R.
More specifically, here is the question.
The data set has three workbooks "Sales", "Resources", and "Supplies". How do you read in only the items from the "Sales" workbook?
Thank you.
I think the best way, is to save the worksheet you need as csv, and use read.csv in R. but if you prefer to read directly the excel file:
Use the package XLConnect
df <- readWorksheetFromFile("excel_file.xlsx", sheet = "Sales")
My question:
Can I change the parameters in R to use the source editor to also view >5MB data sets in R?
If not, what is your advice?
Background:
I recently stopped looking at data in Excel and switched to R entirely. As I did in Excel and still prefer to do in R, I like to look at the entire frame and then decide on filters.
Problem: Working with the World Development Indicators (WDI) data set which is over 100MB, opening it in the source editor does not work. View(df) opens an empty tab in RStudio as also shown below:
R threw another error when I selected the data set from the Files Tab in column on the right of RStudio which read:
The selected file 'wdi.csv' is too large to open in the source editor (the file is 104.5 MB and the maximum file size is 5MB).
Solutions?
My alter ego would tell me to increase the threshold of datasets' file size for the source editor, so I could investigate it there. In brief: change 5 to 200 MB. My alter ego would also tell me that I would probably encounter performance issues (since I am using a MacAir).
How I resolved the issue:
I used head() and dplyr's glimpse() to get a better idea, but ended up looking at the wdi matrix in excel and then filtered it out in R. Newly created dataframes could be opened in the source editor without any problems.
Thanks in advance!
I am wondering is it possible to read an excel file that is currently open, and capture things you manually test into R?
I have an excel file opened (in Windows). In my excel, I have connected to a SSAS cube. And I do some manipulations using PivotTable Fields (like changing columns, rows, and filters) to understand the data. I would like to import some of the results I see in excel into R to create a report. (I mean without manually copy/paste the results into R or saving excel sheets to read them later). Is this a possible thing to do in R?
UPDATE
I was able to find an answer. Thanks to awesome package created by Andri Signorell.
library(DescTools)
fxls<-GetCurrXL()
tttt<-XLGetRange(header=TRUE)
I was able to find an answer. Thanks to awesome package created by Andri Signorell.
library(DescTools)
fxls<-GetCurrXL()
tttt<-XLGetRange(header=TRUE)
Copy the values you are interested in (in a single spread sheet at a time) to clipboard.
Then
dat = read.table('clipboard', header = TRUE, sep = "\t")
You can save the final excel spreadsheet as a csv file (comma separated).
Then use read.csv("filename") in R and go from there. Alternatively, you can use read.table("filename",sep=",") which is the more general version of read.csv(). For tab separated files, use sep="\t" and so forth.
I will assume this blog post will be useful: http://www.r-bloggers.com/a-million-ways-to-connect-r-and-excel/
In the R console, you can type
?read.table
for more information on the arguments and uses of this function. You can just repeat the same call in R after Excel sheet changes have been saved.