I have a series of massive data files that range in size from 800k to 1.4M rows, and one variable in particular has a set length of 12 characters (numeric data but with leading zeros where other the number of non-zero digits is fewer than 12). The column should look like this:
col
000000000003
000000000102
000000246691
000000000042
102851000324
etc.
I need to export these files for a client to a CSV file, using R. The final data NEEDS to retain the 12 character structure, but when I open the CSV files in excel, the zeros disappear. This happens even after converting the entire data frame to character. The code I am using to do this is as follows.
df1 %>%
mutate(across(everything(), as.character))
##### I did this for all data frames #####
export(df1, "df1.csv")
export(df2, "df2.csv")
....
export(df17, "df17.csv)
I've read a few other posts that say this is an excel problem, and that makes sense, but given the number of data files and amount of data, as well as the need for the client to be able to open it in excel, I need a way to do it on the front end in R. Any ideas?
Yes, this is definitely an Excel problem!
To demonstrate, In Excel enter your column values save the file as a CSV value and then re-open it in Excel, the leading zeros will disappear.
One option is add a leading non-numerical character such as '
paste0("\' ", df$col)
Not a great but an option.
A slightly better option is to paste Excel's Text function to the character string. Then Excel will process the function when the function is opened.
df$col <- paste0("=Text(", df$col, ", \"000000000000\")")
#or
df$col <- paste0("=\"", df$col, "\"")
write.csv(df, "df2.csv", row.names = FALSE)
Of course if the CSV file is saved and reopened then the leading 0 will again disappear.
Another option is to investigate saving the file directly as a .xlsx file with the "writexl", or "XLSX" or similar package.
Related
I am importing a csv file with read_csv() using the following command:
motus <- read_csv("motus_tables/TEST_metabar_motus_miseq_nov_12S.csv",
col_names = TRUE,
progress = show_progress())
The issue is that this adds a first column X1 with numbers for the rows. I don't know why and I don't know how to fix it! I am trying to use this option instead of the read.table because the file I will have to read is huge and the read.table takes forever. I take any other suggestions as well to handle large csv files fast, with a progress bar being a plus!
I am running R v.3.6.2 in RStudio v.1.3
I'm trying to write data to an existing Excel file from R, while preserving the formatting. I'm able to do so following the answer to this question (Write from R into template in excel while preserving formatting), except that my file includes empty columns at the beginning, and so I cannot just begin to write data at cell A1.
As a solution I was hoping to be able to find the first non-empty cell, then start writing from there. If I run read.xlsx(file="myfile.xlsx") using the openxlsx package, the empty columns and rows are automatically removed, and only the data is left, so this doesn't work for me.
So I thought I would first load the worksheet using wb <- loadWorkbook("file.xlsx") so I have access to getStyles(wb) (which works). However, the subsequent command getTables returns character(0), and wb$tables returns NULL. I can't figure out why this is? Am I right in that these variables would tell me the first non-empty cell?
I've tried manually removing the empty columns and rows preceding the data, straight in the Excel file, but that doesn't change things. Am I on the right path here or is there a different solution?
As suggested by Stéphane Laurent, the package tidyxl offers the perfect solution here.
For instance, I can now search the Excel file for a character value, like my variable names of interest ("Item", "Score", and "Mean", which correspond to the names() of the data.frame I want to write to my Excel file):
require(tidyxl)
colnames <- c("Item","Score","Mean")
excelfile <- "FormattedSheet.xlsx"
x <- xlsx_cells(excelfile)
# Find all cells with character values: return their address (i.e., Cell) and character (i.e., Value)
chars <- x[x$data_type == "character", c("address", "character")]
starting.positions <- unlist(
chars[which(chars$character %in% colnames), "address"]
)
# returns: c(C6, D6, E6)
I have the following code to read a file and save it as a csv file, I remove the first 7 lines in the text file and then the 3rd column as well, since I just require the first two columns.
current_file <- paste("Experiment 1 ",i,".cor",sep="")
curfile <- list.files(pattern = current_file)
curfile_data <- read.table(curfile, header=F,skip=7,sep=",")
curfile_data <- curfile_data[-grep('V3',colnames(curfile_data))]
write.csv(curfile_data,curfile)
new_file <- paste("Dev_C",i,".csv",sep="")
new_file
file.copy(curfile, new_file)
The curfile thus hold two column variables V1 and V2 along with the observation number column in the beginning.
Now when I use file.copy to copy the contents of the curfile into a .csv file and then open the new .csv file in Excel, all the data seems to be concatenated and appear in a single column, is there a way to show each of the individual columns separately? Thanks in advance for your suggestions.
The data in the .txt file looks like this,
"","V1","V2","V3"
"1",-0.02868862,5.442283e-11,76.3
"2",-0.03359281,7.669754e-12,76.35
"3",-0.03801883,-1.497323e-10,76.4
"4",-0.04320051,-6.557672e-11,76.45
"5",-0.04801207,-2.557059e-10,76.5
"6",-0.05325544,-9.986231e-11,76.55
You need to use Text to columns feature in Excel, selecting comma as a delimiter. The position of this point in menu depends on version of Excel you are using.
Let the following vectors:
x <- c(123456789123456789, 987654321987654321)
y <- as.character(x)
I'm trying to export y into a csv file for later conversion into XLSX (don't ask, client requested), like so:
write.csv(y, file = 'y.csv', row.names = F)
If I open y.csv in a pure word processor, I can see it has correctly inserted quotes around the elements, but when I open it in Excel the program insists in converting the column into numbers and showing the contents in scientific format. This requires the extra step of reformatting the column, which can be a real time-waster when one works with lots of files.
How can I format a character vector of 20-digit numbers in R so that Excel doesn't display them in scientific notation?
Instead of opening the csv file via File->Open, you can go to Data->From Text in Excel and on the last step specify the Column data format to be Text.
Not really sure though that you save any time by doing this, you can also consider using the WriteXLS (or some other direct to xls) package.
edit
Here's a much better way of forcing Excel to read as text:
write.csv(paste0('"=""', y, '"""'), 'y.csv', row.names = F, quote = F)
In Excel, select the column of numbers, and format them as text. (Format Cells -> Number tab -> Text in the list on the left)
I have some data from and I am trying to load it into R. It is in .csv files and I can view the data in both Excel and OpenOffice. (If you are curious, it is the 2011 poll results data from Elections Canada data available here).
The data is coded in an unusual manner. A typical line is:
12002,Central Nova","Nova-Centre"," 1","River John",N,N,"",1,299,"Chisholm","","Matthew","Green Party","Parti Vert",N,N,11
There is a " on the end of the Central-Nova but not at the beginning. So in order to read in the data, I suppressed the quotes, which worked fine for the first few files. ie.
test<-read.csv("pollresults_resultatsbureau11001.csv",header = TRUE,sep=",",fileEncoding="latin1",as.is=TRUE,quote="")
Now here is the problem: in another file (eg. pollresults_resultatsbureau12002.csv), there is a line of data like this:
12002,Central Nova","Nova-Centre"," 6-1","Pictou, Subd. A",N,N,"",0,168,"Parker","","David K.","NDP-New Democratic Party","NPD-Nouveau Parti democratique",N,N,28
Because I need to suppress the quotes, the entry "Pictou, Subd. A" makes R wants to split this into 2 variables. The data can't be read in since it wants to add a column half way through constructing the dataframe.
Excel and OpenOffice both can open these files no problem. Somehow, Excel and OpenOffice know that quotation marks only matter if they are at the beginning of a variable entry.
Do you know what option I need to enable on R to get this data in? I have >300 files that I need to load (each with ~1000 rows each) so a manual fix is not an option...
I have looked all over the place for a solution but can't find one.
Building on my comments, here is a solution that would read all the CSV files into a single list.
# Deal with French properly
options(encoding="latin1")
# Set your working directory to where you have
# unzipped all of your 308 CSV files
setwd("path/to/unzipped/files")
# Get the file names
temp <- list.files()
# Extract the 5-digit code which we can use as names
Codes <- gsub("pollresults_resultatsbureau|.csv", "", temp)
# Read all the files into a single list named "pollResults"
pollResults <- lapply(seq_along(temp), function(x) {
T0 <- readLines(temp[x])
T0[-1] <- gsub('^(.{6})(.*)$', '\\1\\"\\2', T0[-1])
final <- read.csv(text = T0, header = TRUE)
final
})
names(pollResults) <- Codes
You can easily work with this list in different ways. If you wanted to just see the 90th data.frame you can access it by using pollResults[[90]] or by using pollResults[["24058"]] (in other words, either by index number or by district number).
Having the data in this format means you can also do a lot of other convenient things. For instance, if you wanted to fix all 308 of the CSVs in one go, you can use the following code, which will create new CSVs with the file name prefixed with "Corrected_".
invisible(lapply(seq_along(pollResults), function(x) {
NewFilename <- paste("Corrected", temp[x], sep = "_")
write.csv(pollResults[[x]], file = NewFilename,
quote = TRUE, row.names = FALSE)
}))
Hope this helps!
This answer is mainly to #AnandaMahto (see comments to the original question).
First, it helps to set some options globally because of the french accents in the data:
options(encoding="latin1")
Next, read in the data verbatim using readLines():
temp <- readLines("pollresults_resultatsbureau13001.csv")
Following this, simply replace the first comma in each line of data with a comma+quotation. This works because the first field is always 5 characters long. Note that it leaves the header untouched.
temp[-1] <- gsub('^(.{6})(.*)$', '\\1\\"\\2', temp[-1])
Penultimately, write over the original file.
fileConn<-file("pollresults_resultatsbureau13001.csv")
writeLines(temp,fileConn)
close(fileConn)
Finally, simply read the data back into R:
data<-read.csv(file="pollresults_resultatsbureau13001.csv",header = TRUE,sep=",")
There is probably a more parsimonious way to do this (and one that can be iterated more easily) but this process made sense to me.