Importing data into R, which fileformat is the easiest? - r

I have a few datasets in the following formats: .asc, .wf1, .xls. I am indifferent about which one I use, as they are exactly the same. Could anyone please tell me which one of the fileformats is easiest to import into R, and how this is done?

save xls to txt or csv, they are easiest for R to read:
but be sure that only one header line or no header line is recommended
try
read.table('*.txt', header=T)
read.table('*.txt', header=F)
read.delim(*, header=F)
read.csv("*.csv")
etc.

Definitely not .xls. If .asc is some sort of fixed-width format, than that can be read in easily with read.csv or read.table.
Other formats that are easy to read include CSV (comma- or tab-separated text files) and DTA (Stata files, via read.dta in the foreign package).
Edit: #KarlOveHufthammer pointed out that .asc is most likely a fixed-width format. In which case read.fwf is the tool to use to read it in to R. Note that FWF is a pain in the heiny to deal with, though, in that you have to have the column widths and names of every column stored somewhere else, then convert that to a format that read.fwf can use--and that's before problems like overlapping ranges.

Related

Reading large numeric TSV file into memory in R

I am trying to read a file representing a numeric matrix with 4.5e5 rows and 2e3 columns. First line is the header with ncol+1 words, while each row begins with a row name. In txt format it is around 17G in size.
I tried using:
read.table(fname, header=TRUE)
but the operation ate all 64G of RAM available. I assume it loaded it in a wrong structure.
Usually people discuss speed, is there a way to import it so it fits properly? Performance is not a primary issue.
EDIT: I managed to read it with read.table:
colclasses = c("character",rep("numeric",2000))
betas = read.table(beta_fname, header=TRUE, colClasses=colclasses, row.names=1)
But documentation still recommends "scan" for memory usage. What would be the "scan" alternative?
There are several things you might try. Google about reading large files and they might point you to using 'fread' in data.table. You can also try 'read_delim_chunked' that might help. Also break the file into smaller pieces, read each one in, write out an RDS file. When complete you might be able to read in the RDS files and combine using less space.

When I upload my excel file to R, the column titles are in the rows and the data seems all jumbled. How do I fix this?

hi literally day one new coder
On the excel sheet, my data looks organized, but when I upload my file to R, it's not able to read the excel properly and the column headers are in the rows and the data seems randomized.
So far I have tried:
library(readxl)
dataset <-read_excel("pathname")
View(dataset)
Also tried:
dataset <-read_excel("pathname", sheet=1, colNames=TRUE)
Also tried to use the package openxlsx
but nothing is giving me the correct, organized data set.
I tried formatting my Excel to a CSV file, and the CSV file looks exactly like the data that shows up on R (both are messed up).
How should I approach this problem?
I deal with importing .xlsx into R frequently. It can be challenging due to the flexibility of the excel platform. I generally use readxl::read_xlsx() to fetch data from .xlsx files. My suggestions:
First, specify exactly the data you want to import with the range argument.
A cell range to read from, as described in cell-specification. Includes typical Excel
ranges like "B3:D87", possibly including the sheet name like "Budget!B2:G14"
Second, if there are there merged cells or other formatting challenges in column headers, I resort to setting col_names = FALSE. And supplying clean names after import with names(df) <- c("first_col", "second_col")
Third, if there are merged cells elsewhere in the spreadsheet I generally I resort to "fixing" them in excel (not ideal but easier for my use case), however, others may have suggestions on a programmatic fix.
It may be helpful to provide a screenshot of your spreadsheet.

R misreading csv files after modifications on Excel

This is more of a curiosity.
Sometimes I modify csv files from Excel rather than R (suppose I manage to find a missing piece of info and I type it in the csv file), of course maintaining commas and quotes as they were.
Every time I do this, R becomes unable to read the csv file, i.e. it imports a single column as it appears on Excel, rather than separating the values (no options like sep= or quote= change this).
Does anyone know why this happens?
Thanks a lot
An example
This was readable:
state,"city","county"
AK,"Anchorage",""
AK,"Haines",""
AK,"Juneau","Juneau"
After adding the missing info under "county", R fails to import it as a data frame, reading it instead as a single vector.
state,"city","county"
AK,"Anchorage","Anchorage"
AK,"Haines","Haines"
AK,"Juneau","Juneau"
Edit:
I'm just running the basic read.csv
df <- read.csv("C:/directory/df.csv")

read_csv does not work separate commas and not capture separate rows

I am trying to parse a text log file like this, I can use the default read.csv to parse this file.
test <- read.csv("test.txt", header=FALSE)
It separated all comma parts, though not perfectly put in a dataframe, further manipulation can be done to improve.
However, I can not seem to do so using readr package
test <- read_csv("test.txt", header=FALSE)
All observations turn into 1 row, no separation between commas.
I am learning this package so any help would be great.
{"dev_id":"f8:f0:05:xx:db:xx","data":[{"dist":[7270,7269,7269,7275,7270,7271,7265,7270,7274,7267,7271,7271,7266,7263,7268,7271,7266,7265,7270,7268,7264,7270,7261,7260]},{"temp":0},{"hum":0},{"vin":448}],"time":4485318,"transmit_time":4495658,"version":"1.0"}
{"dev_id":"f8:xx:05:xx:d9:xx","data":[{"dist":[6869,6868,6867,6871,6866,6867,6863,6865,6868,6869,6868,6860,6865,6866,6870,6861,6865,6868,6866,6864,6866,6866,6865,6872]},{"temp":0},{"hum":0},{"vin":449}],"time":4405316,"transmit_time":4413715,"version":"1.0"}
{"dev_id":"xx:f0:05:e8:da:xx","data":[{"dist":[5775,5775,5777,5772,5777,5770,5779,5773,5776,5777,5772,5768,5782,5772,5765,5770,5770,5767,5767,5777,5766,5763,5773,5776]},{"temp":0},{"hum":0},{"vin":447}],"time":4461316,"transmit_time":4473307,"version":"1.0"}
{"dev_id":"xx:f0:xx:e8:xx:0a","data":[{"dist":[4358,4361,4355,4358,4359,4359,4361,4358,4359,4360,4360,4361,4361,4359,4359,4356,4357,4361,4359,4360,4358,4358,4362,4359]},{"temp":0},{"hum":0},{"vin":424}],"time":5190320,"transmit_time":5198748,"version":"1.0"}
Thanks to #Dave2e pointing out that this file is in JSON format, I found the way to parse it using ndjson::stream_in.

Exporting large number to csv from R

I came across a strange problem when trying to export an R dataframe to a csv file.
The dataframe contains some big numbers, but when they are written to the csv file, they "lose" the decimal part and are instead written without it.
But not like one would expect, but like this:
Say 3224571816.5649 is the correct value in R. When written to csv, it becomes 32245718165649.
I am using the write.csv2 function to write the csv. The separators are correct, as it works normally for smaller values. Is the problem occurring because the number (with decimals) is bigger than 32bit?
And more importantly, how can I solve this, as I have a whole dataframe with values as big (or bigger) than this? Also, it has to be written in to a csv.
write.csv2 is intended for a different standard of csv (Western European styling, which based on your use of a "." as a decimal indicator, I am guessing you are not looking for). write.csv2 uses a comma as a decimal indicator and a semicolon as the field delimiter, so if you are trying to read the result in as a comma separated file, it will look strange indeed.
I suggest you use write.csv (or even better, write.table) to output your file. write.csv assumes a comma separator and period for decimal marker.
both write.csv and write.csv2 are just wrappers for write.table, which is the underlying method. In general, I recommend use of write.table because it does not assume your region and you can explicitly pass it sep = ",", dec = ".", etc. This not only lets you know what you are using for sure, but it also makes your code a lot more readable.
for more, check the rdocumentation.org site for write.table: https://www.rdocumentation.org/packages/utils/versions/3.5.3/topics/write.table

Resources