How to avoid: read.table truncate last digits after decimal - r

I want to import a table (.txt file) in R with read.table().
table1<- read.table("input.txt",sep = "\t")
The file contains data like 0.09165395632583884
After reading the data, data becomes 0.09165396. Last few digits are lost,
but I want to avoid this problem.
If I used
options(digits=22)
then it creates another problem, like maindata = 0.19285969062961023 but when I write the data in file,
write.table(table1,file = "output.txt",col.names = F, row.names = F)
I get data = 0.192859690629610225. Here, last digit is extra and the second last digit is change.
Can someone give me a hint how to solve the problem?

Related

Problems with displaying .txt file (delimiters)

I have a problem with one task where I have to load some data set, and I have to make sure that missing values are read in properly and that column names are unambiguous.
The format of .txt file:
At the end, data set should contain only country column and median age.
I tried using read.delim, precisely this chunk:
rawdata <- read.delim("rawdata_343.txt", sep = "", stringsAsFactors = FALSE, header = TRUE)
And when I run it, I get this:
It confuses me that if country has multiple words (Turks and Caicos Islands) it assigns every word to another column.
Since I am still a beginner in R, any suggestion would be very helpful for me. Thanks!
Three points to note about your input file: (1) the first two lines at the top are not tabular and should be skipped with skip = 2, (2) your column separators are tabs and this should be specified with sep = "\t", and (c) you have no headers, so header = FALSE. Your command should be: -
rawdata <- read.delim("rawdata_343.txt", sep = "\t", stringsAsFactors = FALSE, header = FALSE, skip = 2)
UPDATE: A fourth point is that the first column includes row numbers, so row.names = 1. This also addresses the follow-up comment.
rawdata <- read.delim("rawdata_343.txt", sep = "\t", stringsAsFactors = FALSE, header = FALSE, skip = 2, row.names = 1)
It looks like your delimiter that you are specifying in the sep= argument is telling R to consider spaces as the column delimiter. Looking at your data as a .txt file, there is no apparent delimiter (like commas that you would find in a typical .csv). If you can put the data in a tabular form in something like a .csv or .xlsx file, R is much better at reading that data as expected. As it is, you may struggle to get the .txt format to read in a tabular fashion, which is what I assume you want.
P.s. you can use read.csv() if you do end up putting the data in that format.

R changes the format of a variable in the middle of writing a file

Something unusual is happening to me with a code that I have written in R. I have not changed anything in the code but today when I ran it, when writing the file it begins to write it well until a point (random), where R changes the format of a column of a data.table of character to numeric.
I have run it several times and R always change the format of the column, but in different points of the file.
function with i wrote is
write.table(table,gzfile(paste("~/NameFile.csv.gz",sep = "")), row.names = F, sep = "|;|", quote = F, dec = ",")
what can i do?

CSV with multiple datasets/different-number-of-columns

Similar to How can you read a CSV file in R with different number of columns, I have some complex CSV-files. Mine are from SAP BusinessObjects and hold challenges different to those of the quoted question. I want to automate the capture of an arbitrary number of datasets held in one CSV file. There are many CSV-files, but let's start with one of them.
Given: One CSV file containing several flat tables.
Wanted: Several dataframes or other structure holding all data (S4?)
The method so far:
get line numbers of header data by counting number of columns
get headers by reading every line index held in vector described above
read data by calculating skip and nrows between data sets in index described by header lines as above.
give the read data column names from read header
I need help getting me on the right track to avoid loops/making the code more readable/compact when reading headers and datasets.
These CSVs are formatted as normal CSVs, only that they contain an more or less arbitrary amount of subtables. For each dataset I export, the structure is different. In the current example I will suppose there are five tables included in the CSV.
In order to give you an idea, here is some fictous sample data with line numbers. Separator and quote has been stripped:
1: n, Name, Species, Description, Classification
2: 90, Mickey, Mouse, Big ears, rat
3: 45, Minnie, Mouse, Big bow, rat
...
16835: Code, Species
16836: RT, rat
...
22673: n, Code, Country
22674: 1, RT, Murica
...
33211: Activity, Code, Descriptor
32212: running, RU, senseless activity
...
34749: Last update
34750: 2017/05/09 02:09:14
There are a number of ways going about reading each data set. What I have come up with so far:
filepath <- file.path(paste0(Sys.getenv("USERPROFILE"), "\\SAMPLE.CSV)
# Make a vector with column number per line
fieldVector <- utils::count.fields(filepath, sep = ",", quote = "\"")
# Make a vector with unique number of fields in file
nFields <- base::unique(fieldVector)
# Make a vector with indices for position of new dataset
iHeaders <- base::match(nFields, fieldVector)
With this, I can do things like:
header <- utils::read.csv2(filepath, header = FALSE, sep = ",", quote = "\"", skip = iHeaders[4], nrows = iHeaders[5]-iHeaders[4]-1)
data <- utils::read.csv2(filepath, header = FALSE, sep = ",", quote = "\"", skip = iHeaders[4] + 1, nrows = iHeaders[5]-iHeaders[4]-1)
names(data) <- header
As in the intro of this post, I have made a couple of functions which makes it easier to get headers for each dataset:
Headers <- GetHeaders(filepath, iHeaders)
colnames(data) <- Headers[[4]]
I have two functions now - one is GetHeader, which captures one line from the file with utils::read.csv2 while ensuring safe headernames (no æøå % etc).
The other returns a list of string vectors holding all headers:
GetHeaders <- function(filepath, linenums) {
# init an empty list of length(linenums)
l.headers <- vector(mode = "list", length = length(linenums))
for(i in seq_along(linenums)) {
# read.csv2(filepath, skip = linenums[i]-1, nrows = 1)
l.headers[[i]] <- GetHeader(filepath, linenums[i])
}
l.headers
}
What I struggle with is how to read in all possible datasets in one go. Specifically the last set is a bit hard to wrap my head around if I should write a common function, where I only know the line number of header, and not the number of lines in the following data.
Also, what is the best data structure for such a structure as described? The data in the subtables are all relevant to each other (can be used to normalize parts of the data). I understand that I must do manual work for each read CSV, but as I have to read TONS of these files, some common functions to structure them in a predictable manner at each pass would be excellent.
Before you answer, please keep in mind that, no, using a different export format is not an option.
Thank you so much for any pointers. I am a beginner in R and haven't completely wrapped my head around all possible solutions in this particular domain.

Skip all leading empty lines in read.csv

I am wishing to import csv files into R, with the first non empty line supplying the name of data frame columns. I know that you can supply the skip = 0 argument to specify which line to read first. However, the row number of the first non empty line can change between files.
How do I work out how many lines are empty, and dynamically skip them for each file?
As pointed out in the comments, I need to clarify what "blank" means. My csv files look like:
,,,
w,x,y,z
a,b,5,c
a,b,5,c
a,b,5,c
a,b,4,c
a,b,4,c
a,b,4,c
which means there are rows of commas at the start.
read.csv automatically skips blank lines (unless you set blank.lines.skip=FALSE). See ?read.csv
After writing the above, the poster explained that blank lines are not actually blank but have commas in them but nothing between the commas. In that case use fread from the data.table package which will handle that. The skip= argument can be set to any character string found in the header:
library(data.table)
DT <- fread("myfile.csv", skip = "w") # assuming w is in the header
DF <- as.data.frame(DT)
The last line can be omitted if a data.table is ok as the returned value.
Depending on your file size, this may be not the best solution but will do the job.
Strategy here is, instead of reading file with delimiter, will read as lines,
and count the characters and store into temp.
Then, while loop will search for first non-zero character length in the list,
then will read the file, and store as data_filename.
flist = list.files()
for (onefile in flist) {
temp = nchar(readLines(onefile))
i = 1
while (temp[i] == 0) {
i = i + 1
}
temp = read.table(onefile, sep = ",", skip = (i-1))
assign(paste0(data, onefile), temp)
}
If file contains headers, you can start i from 2.
If the first couple of empty lines are truly empty, then read.csv should automatically skip to the first line. If they have commas but no values, then you can use:
df = read.csv(file = 'd.csv')
df = read.csv(file = 'd.csv',skip = as.numeric(rownames(df[which(df[,1]!=''),])[1]))
It's not efficient if you have large files (since you have to import twice), but it works.
If you want to import a tab-delimited file with the same problem (variable blank lines) then use:
df = read.table(file = 'd.txt',sep='\t')
df = read.table(file = 'd.txt',skip = as.numeric(rownames(df[which(df[,1]!=''),])[1]))

Error when reading in data

I’ trying to read in a set of data tables. All of which is representing different parts of a larger Excel table, selected using ”filter” and saved individually as a .csv file. Most of my tables have 5 rows of data but two of them have 4 rows. The tables with 5 rows of data reads in to R as requested:
Y <- read.csv(file = "MyFile.csv", row.names = 1,header = T, sep = ";")
No problem.
The tables with 4 rows of data gives following error meassage:
In read.csv("MyFile.csv", quote = "", : incomplete final line found
by readTableHeader on ' MyFile.csv'
It’s the same problem with
Z <- read.table("MyFile.csv", quote = "", sep = ';', header = TRUE)
There is no missing data in the file. When I print the Y or Z object in R no missing data is visible (or invisible as it were).
I know the problem is extremely simple, but as I’ve got frustration pouring out of my ears my officemates would really appreciate your help.
The final line of your CSV doesn't have a line feed or carriage return.
Plan A: open the files in a text editor, go to the end of the final line, hit enter and then save the modified file.
Plan B: if there are too many files for Plan A, you could simply ignore the warnings, since the files seem to be loaded fine (apart from that message).

Resources