Let the following vectors:
x <- c(123456789123456789, 987654321987654321)
y <- as.character(x)
I'm trying to export y into a csv file for later conversion into XLSX (don't ask, client requested), like so:
write.csv(y, file = 'y.csv', row.names = F)
If I open y.csv in a pure word processor, I can see it has correctly inserted quotes around the elements, but when I open it in Excel the program insists in converting the column into numbers and showing the contents in scientific format. This requires the extra step of reformatting the column, which can be a real time-waster when one works with lots of files.
How can I format a character vector of 20-digit numbers in R so that Excel doesn't display them in scientific notation?
Instead of opening the csv file via File->Open, you can go to Data->From Text in Excel and on the last step specify the Column data format to be Text.
Not really sure though that you save any time by doing this, you can also consider using the WriteXLS (or some other direct to xls) package.
edit
Here's a much better way of forcing Excel to read as text:
write.csv(paste0('"=""', y, '"""'), 'y.csv', row.names = F, quote = F)
In Excel, select the column of numbers, and format them as text. (Format Cells -> Number tab -> Text in the list on the left)
Related
I have a series of massive data files that range in size from 800k to 1.4M rows, and one variable in particular has a set length of 12 characters (numeric data but with leading zeros where other the number of non-zero digits is fewer than 12). The column should look like this:
col
000000000003
000000000102
000000246691
000000000042
102851000324
etc.
I need to export these files for a client to a CSV file, using R. The final data NEEDS to retain the 12 character structure, but when I open the CSV files in excel, the zeros disappear. This happens even after converting the entire data frame to character. The code I am using to do this is as follows.
df1 %>%
mutate(across(everything(), as.character))
##### I did this for all data frames #####
export(df1, "df1.csv")
export(df2, "df2.csv")
....
export(df17, "df17.csv)
I've read a few other posts that say this is an excel problem, and that makes sense, but given the number of data files and amount of data, as well as the need for the client to be able to open it in excel, I need a way to do it on the front end in R. Any ideas?
Yes, this is definitely an Excel problem!
To demonstrate, In Excel enter your column values save the file as a CSV value and then re-open it in Excel, the leading zeros will disappear.
One option is add a leading non-numerical character such as '
paste0("\' ", df$col)
Not a great but an option.
A slightly better option is to paste Excel's Text function to the character string. Then Excel will process the function when the function is opened.
df$col <- paste0("=Text(", df$col, ", \"000000000000\")")
#or
df$col <- paste0("=\"", df$col, "\"")
write.csv(df, "df2.csv", row.names = FALSE)
Of course if the CSV file is saved and reopened then the leading 0 will again disappear.
Another option is to investigate saving the file directly as a .xlsx file with the "writexl", or "XLSX" or similar package.
I'm trying to write data to an existing Excel file from R, while preserving the formatting. I'm able to do so following the answer to this question (Write from R into template in excel while preserving formatting), except that my file includes empty columns at the beginning, and so I cannot just begin to write data at cell A1.
As a solution I was hoping to be able to find the first non-empty cell, then start writing from there. If I run read.xlsx(file="myfile.xlsx") using the openxlsx package, the empty columns and rows are automatically removed, and only the data is left, so this doesn't work for me.
So I thought I would first load the worksheet using wb <- loadWorkbook("file.xlsx") so I have access to getStyles(wb) (which works). However, the subsequent command getTables returns character(0), and wb$tables returns NULL. I can't figure out why this is? Am I right in that these variables would tell me the first non-empty cell?
I've tried manually removing the empty columns and rows preceding the data, straight in the Excel file, but that doesn't change things. Am I on the right path here or is there a different solution?
As suggested by Stéphane Laurent, the package tidyxl offers the perfect solution here.
For instance, I can now search the Excel file for a character value, like my variable names of interest ("Item", "Score", and "Mean", which correspond to the names() of the data.frame I want to write to my Excel file):
require(tidyxl)
colnames <- c("Item","Score","Mean")
excelfile <- "FormattedSheet.xlsx"
x <- xlsx_cells(excelfile)
# Find all cells with character values: return their address (i.e., Cell) and character (i.e., Value)
chars <- x[x$data_type == "character", c("address", "character")]
starting.positions <- unlist(
chars[which(chars$character %in% colnames), "address"]
)
# returns: c(C6, D6, E6)
Using openxlsx read.xlsx to import a dataframe from a multi-class column. The desired result is to import all values as strings, exactly as they're represented in Excel. However, some decimals are represented as very long floats.
Sample data is simply an Excel file with a column containing the following rows:
abc123,
556.1,
556.12,
556.123,
556.1234,
556.12345
require(openxlsx)
df <- read.xlsx('testnumbers.xlsx', )
Using the above R code to read the file results in df containing these string
values:
abc123,
556.1,
556.12,
556.12300000000005,
556.12339999999995,
556.12345000000005
The Excel file provided in production has the column formatted as "General". If I format the column as Text, there is no change unless I explicitly double-click each cell in Excel and hit enter. In that case, the number is correctly displayed as a string. Unfortunately, clicking each cell isn't an option in the production environment. Any solution, Excel, R, or otherwise is appreciated.
*Edit:
I've read through this question and believe I understand the math behind what's going on. At this point, I suppose I'm looking for a workaround. How can I get a float from Excel to an R dataframe as text without changing the representation?
Why Are Floating Point Numbers Inaccurate?
I was able to get the correct formats into a data frame using pandas in python.
import pandas as pd
test = pd.read_excel('testnumbers.xlsx', dtype = str)
This will suffice as a workaround, but I'd like to see a solution built in R.
Here is a workaround in R using openxlsx that I used to solve a similar issue. I think it will solve your question, or at least allow you to format as text in the excel files programmatically.
I will use it to reformat specific cells in a large number of files (I'm converting from general to 'scientific' in my case- as an example of how you might alter this for another format).
This uses functions in the openxlsx package that you reference in the OP
First, load the xlsx file in as a workbook (stored in memory, which preserves all the xlsx formatting/etc; slightly different than the method shown in the question, which pulls in only the data):
testnumbers <- loadWorkbook(here::here("test_data/testnumbers.xlsx"))
Then create a "style" to apply which converts the numbers to "text" and apply it to the virtual worksheet (in memory).
numbersAsText <- createStyle(numFmt = "TEXT")
addStyle(testnumbers, sheet = "Sheet1", style = numbersAsText, cols = 1, rows = 1:10)
finally, save it back to the original file:
saveWorkbook(testnumbers,
file = here::here("test_data/testnumbers_formatted.xlsx"),
overwrite = T)
When you open the excel file, the numbers will be stored as "text"
Suppose I have variable s with this code:
s <- "foo\nbar"
Then change it to data.frame
s2 <- data.frame(s)
Now s2 is a data.frame with one records, next I export to a csv file with:
write.csv(s2, file = "out.csv", row.names = F)
Then I open it with notepad, the "foo\nbar" was flown into two lines. With SAS import:
proc import datafile = "out.csv" out = out dbms = csv replace;
run;
I got two records, one is '"foo', the other is 'bar"', which is not expected.
After struggling for a while, I found if I export from R with foreign package like this:
write.dbf(s2, 'out.dbf')
Then import with SAS:
proc import datafile = "out.dbf" out = out dbms = dbf replace;
run;
Everything works nice and got one records in sas, the value seems to be 'foo bar'.
Does this mean csv is a bad choice when dealing with data, compared with dbf? Are there any other solutions or explanations to this?
A CSV file stands for comma-separated-version. This means that each line in the file should contain a list of values separated by a comma. SAS imported the file correctly based on the definition of the CSV file (ie. 2 lines = 2 rows).
The problem you are experiencing is due to the \n character(s) in your string. This sequence of characters happens to represent a newline character, and this is why the R write.csv() call is creating two lines instead of putting it all on one.
I'm not an expert in R so I can't tell you how to either modify the call to write.csv() or mask the \n value in the input string to prevent it from writing out the newline character.
The reason you don't have this problem with .dbf is probably because it doesn't care about commas or newlines to indicate when new variables or rows start, it must have it's own special sequence of bytes that indicate this.
DBF - is a database formats, which are always easier to work with because they have variable types/lengths embedded in their structure.
With a CSV or any other delimited file you have to have documentation included to know the file structure.
The benefit of CSV is smaller file sizes and compatibility across multiple OS and applications. For a while Excel (2007?) no longer supported DBF for example.
As Robert says you will need to mask the new line value. For example:
replace_linebreak <- function(x,...){
gsub('\n','|n',x)
}
s3 <- replace_linebreak(s2$s)
This replaces \n with |n, which would you would then need to replace when you import again. Obviously what you choose to mask it with will depend on your data.
The R function read.csv works as the following as stated in the manual: "If there is a header and the first row contains one fewer field than the number of columns, the first column in the input is used for the row names." That's good. However, when it comes to the function write.csv, I cannot find a way to write the csv file in a similar way. So, if I have a file.txt as below:
Column_1,Column_2
Row_1,2,3
Row_2,4,5
Then when I read it using a = read.csv('file.txt'), the row and column names are Row_x and Column_x as expected. However, when I write the matrix a to a csv file again, then what I get as a result from write.csv('file2.txt', quote=F) is as below:
,Column_1,Column_2
Row_1,2,3
Row_2,4,5
So, there is a comma in the beginning of this file. And if I would read this file again using a2 = read.csv('file2.txt'), then resulting a2 will not be the same as the previous matrix a. The row names of the matrix a2 will not be Row_x. That's, I do not want a comma in the beginning of the file. How can I get rid of this comma while using write.csv?
The two functions that you have mentioned, read.cvs and write.csv are just a specific form of the more generic functions read.table and write.table.
When I copy your example data into a .csv and try to read it with read.csv, R throws a warning and says that the header line was incomplete. Thus it resorted to special behaviour to fix the error. Because we had an incomplete file, it completed the file by adding an empty element at the top left. R understands that this is a header row, and thus the data appears okay in R, but when we write to a csv, it doesn't understand what is header and what is not. Thus the empty element only appearing in the header row created by R shows up as a regular element. Which you would expect. Basically it made our table into a 3x3 because it can't have a weird number of elements.
You want the extra comma there, because it allows programs to read the column names in the right place. In order to read the file in again you can do the following, assuming test.csv is your data. You can fix this by manually adding the column and row names in R, including the missing element to put everything in place.
To fix the wonky row names, you're going to want to add an extra option specifying which row is the row names (row.names = your_column_number) when you read it back in with the comma correctly in place.
y <- read.csv(file = "foo.csv") #this throws a warning because your input is incorrect
write.csv(y, "foo_out.csv")
x <- read.csv(file = "foo.csv", header = T, row.names = 1) #this will read the first column as the row names.
Play around with read/write.csv, but it might be worth while to move into the more generic functions read.table and write.table. They offer expanded functionality.
To read a csv in the generic function
y <- read.table(file = "foo.csv", sep = ",", header = TRUE)
thus you can specify the delimiter and easily read in excel spreadsheets (separated by tab or "\t") or space delimited files ( " " ).
Hope that helps.