Extract bz2 file in R - r

I have bunch of .csv.bz2 files, which i have to download, extract, and read in R.
I downloaded the file and want to extract it to current working directory, then read it.
unz(filename,filename.csv) but it does not seem to work. How can I do that?
I heard somewhere that bzfiles can be read directly without decompressing. How can I do that?

You can use any of these two commands:
read.csv()command: with this command you can directly supply your compressed filename containing csv file.
read.csv("file.csv.bz2")
read.table() command: This command is generic version of read.csv() command. You can set delimiters and others options that read.csv() automatically sets. You don't need to uncompress the file separately. This command does it automatically for you.
read.csv("file.csv.bz2", header = TRUE, sep = ",", quote = "\"",...)

Like this:
readcsvbz2file <- read.csv(bzfile("file.csv.bz2"))

You can make use of the super fast fread which has built-in support for bz2-compressed files
require(data.table)
fread("file.csv.bz2")

Basically, you need to type:
library(R.utils)
bunzip2("dataset.csv.bz2", "dataset.csv", remove = FALSE, skip = TRUE)
dataset <- read.csv("dataset.csv")
See documentation here: bunzip2 {R.utils}.

According to read.table description, one can read a compressed file directly.
read.table("file.csv.bz2")

Related

Read files with a specific a extension from a folder in R

I want to read files with extension .output with the function read.table.
I used pattern=".output" but its'not correct.
Any suggestions?
As an example, heres how you could read in files with the extension ".output" and create a list of tables
list.filenames <- list.files(pattern="\\.output$")
trialsdata <- lapply(list.filenames,read.table,sep="\t")
or if you just want to read them one at a time manually just include the extention in the filename argument.
read.table("ACF.output",sep=...)
So finally because i didn't found a solution(something is going wrong with my path) i made a text file including all the .output files with ls *.output > data.txt.
After that using :
files = read.table("./data.txt")
i am making a data.frame including all my files and using
files[] <- lapply(files, as.character)
Finally with test = read.table(files[i,],header=F,row.names=1)
we could read every file which is stored in i (i = no of line).

Why R cannot read this table while excel can?

I am trying to read a specific file that I have copied from an SFTP location. The file is pipe delimited. I can read the file in Excel. But R read is as null values and column names are being duplicated. I don't understand if this is an encoding issue? I am trying to create a bash script to automate this process. Any help? Below is the link for the data.
Here's file!
I have tried changing the Encoding. But without knowing which encoding I am struggling. I have tried using read_delim, ead_table, read.table, read_csv and read.csv. But no help.
this is the code I have used to read the file.
read_delim("./Engagement_Level.txt", delim = "|")
I would like to read it as a data frame.
The issue is that the file encoding is UTF-16LE, which read_delim cannot read at present.
You could use the base read.delim and file() to specify the encoding:
read.delim(file("Engagement_Level.txt", encoding = "UTF-16LE"), sep = "|")
That will convert all the quoted numbers to numeric. If you'd rather they were type character, to deal with later:
read.delim(file("Engagement_Level.txt", encoding = "UTF-16LE"), sep = "|",
colClasses = "character")
I really recommend you to use Excel to build a CSV file using Data>Text in columns, this is not appropriate in this context but it's incredibly infallible and quickly.
Then use read.csv(file,sep=",").

fread issue with archive package unzip file in R

I am having issues while trying to use fread, after I unzip a file using the archive package in R. The data I am using can be downloaded from https://www.kaggle.com/c/favorita-grocery-sales-forecasting/data
The code is as follows:
library(dplyr)
library(devtools)
library(archive)
library(data.table)
setwd("C:/jc/2017/13.Lafavorita")
hol<-archive("./holidays_events.csv.7z")
holcsv<-fread(hol$path, header = T, sep = ",")
This code gives the error message:
File 'holidays_events.csv' does not exist. Include one or more spaces to consider the input a system command.
Yet if I try:
holcsv1<-read.csv(archive_read(hol),header = T,sep = ",")
It works perfectly. I need to use the fread command because the other data bases I need to open are too big to use read.csv. I am puzzled because my code was working fine a few days ago. I could unzip the files manually, but that is not the point. I have tried to solve this problem for hours, but I cannot seem to find anything useful on the documentation. I found this: https://github.com/yihui/knitr/blob/master/man/knit.Rd#L104-L107 , but I cannot understand it.
Turns out the answer is rather simple, but I found it by luck. So after using the archive function you need to pass it to the archive_extract function. So in my case, I should add the following to the code: hol1<-archive_extract(hol) . Then I have to change the last line to: holcsv<-fread(hol1$path, header = T, sep = ",")

read an Excel file embedded in a website

I would like to read automatically in R the file which is located at
https://clients.rte-france.com/servlets/IndispoProdServlet?annee=2017
This link generates the automatic download of a zipfile. This zipfile contains the Excel file I want to read in R.
Does any of you have any suggestions on this? Thanks.
Panagiotis' comment to use download.file() is generally good advice, but I couldn't make it work here (and would be curious to know why). Instead I used httr.
(Edit: got it, I reversed args of download.file()... Repeat after me: always use named args...)
Another problem with this data: it appears not to be a regular xls file, I couldn't open it with the yet excellent readxl package.
Looks like a tab separated flat file, but no success with read.table() either. readr::read_delim() made it.
library(httr)
library(readr)
r <- GET("https://clients.rte-france.com/servlets/IndispoProdServlet?annee=2017")
# Write the archive on disk
writeBin(r$content, "./data/rte_data")
rte_data <-
read_delim(
unzip("./data/rte_data", exdir = "./data/"),
delim = "\t",
locale = locale(encoding = "ISO-8859-1"),
col_names = TRUE
)
There still are parsing problems, but not sure they should be dealt with in this SO question.

read.csv directly into character vector in R

This code works, however, I wonder if there is a more efficient way. I have a CSV file that has a single column of ticker symbols. I then read this csv into R and apply functions to each ticker using a for loop.
I read in the csv, and then go into the data frame and pull out the character vector that the for loop needs to run properly.
SymbolListDataFrame = read.csv("DJIA.csv", header = FALSE, stringsAsFactors=F)
SymbolList = SymbolListDataFrame[[1]]
for (Symbol in SymbolList){...}
Is there a way to combine the first two lines I have written into one? Maybe read.csv is not the best command for this?
Thank you.
UPDATE
I am using the readlines method suggested by Jake and Bartek. There is a warning "incomplete final line found on" the csv file but I ignore it since the data is correct.
SymbolList <- readLines("DJIA.csv")
SymbolList <- read.csv("DJIA.csv", header = FALSE, stringsAsFactors=F)[[1]]
readLines function is the best solution here.
Please note that read.csv function is not only for reading files with csv extensions. This is simply read.table function with parameters like header or sep set differently. Check the documentation for more info.

Resources