I am trying to download a list using R with the following code:
name <- paste0("https://www.sec.gov/Archives/edgar/full-index/2016/QTR1/master.idx")
master <- readLines(url(name))
master <- master[grep("SC 13(D|G)", master)]
master <- gsub("#", "", master)
master_table <- fread(textConnection(master), sep = "|")
The final line returns an error. I verified that textConnection works as expected and I could read from it using readLines, but fread returns an error. read.table runs into the same problem.
Error in fread(textConnection(master), sep = "|") : input= must be a single character string containing a file name, a system command containing at least one space, a URL starting 'http[s]://', 'ftp[s]://' or 'file://', or, the input data itself containing at least one \n or \r
What am I doing wrong?
1) In the first line we don't need paste. In the next line we don't need url(...). Also we have limited the input to 1000 lines to illustrate the example in less time. We can omit the gsub if we specify na.strings in fread. Also collapsing the input to a single string allows elimination of textConnection in fread.
library(data.table)
name <- "https://www.sec.gov/Archives/edgar/full-index/2016/QTR1/master.idx"
master <- readLines(name, 1000)
master <- master[grep("SC 13(D|G)", master)]
master <- paste(master, collapse = "\n")
master_table <- fread(master, sep = "|", na.strings = "")
2) A second approach which may be faster is to download the file first and then fread it as shown.
name <- "https://www.sec.gov/Archives/edgar/full-index/2016/QTR1/master.idx"
download.file(name, "master.txt")
master_table <- fread('findstr "SC 13[DG]" master.txt', sep = "|", na.strings = "")
The above is for Windows. For Linux with bash replace the last line with:
master_table <- fread("grep 'SC 13[DG]' master.txt", sep = "|", na.strings = "")
I'm not quite sure of the broader context, in particular whether you need to use fread(), but
s <- scan(text=master, sep="|", what=character())
works well, and fast (0.1 seconds).
I would like to add the final solution, which was implemented in fread https://github.com/Rdatatable/data.table/issues/1423.
Perhaps this also saves others a bit of time.
So the solution becomes simpler:
library(data.table)
name <- "https://www.sec.gov/Archives/edgar/full-index/2016/QTR1/master.idx"
master <- readLines(name, 1000)
master <- master[grep("SC 13(D|G)", master)]
master <- paste(master, collapse = "\n")
master_table <- fread(text = master, sep = "|")
Related
Take a look at the "Estimated Global Trend daily values" file on this NOAA web page. It is a .txt file with something like 50 header lines (identified with leading #s) followed by several thousand lines of tabular data. The link to download the file is embedded in the code below.
How can I read this file so that I end up with a data frame (or tibble) with the appropriate column names and data?
All the text-to-data functions I know get stymied by those header lines. Here's what I just tried, riffing off of this SO Q&A. My thought was to read the file into a list of lines, then drop the lines that start with # from the list, then do.call(rbind, ...) the rest. The downloading part at the top works fine, but when I run the function, I'm getting back an empty list.
temp <- paste0(tempfile(), ".txt")
download.file("ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_trend_gl.txt",
destfile = temp, mode = "wb")
processFile = function(filepath) {
dat_list <- list()
con = file(filepath, "r")
while ( TRUE ) {
line = readLines(con, n = 1)
if ( length(line) == 0 ) {
break
}
append(dat_list, line)
}
close(con)
return(dat_list)
}
dat_list <- processFile(temp)
Here's a possible alternative
processFile = function(filepath, header=TRUE, ...) {
lines <- readLines(filepath)
comments <- which(grepl("^#", lines))
header_row <- gsub("^#","",lines[tail(comments,1)])
data <- read.table(text=c(header_row, lines[-comments]), header=header, ...)
return(data)
}
processFile(temp)
The idea is that we read in all the lines, find the ones that start with "#" and ignore them except for the last one which will be used as the header. We remove the "#" from the header (otherwise it's usually treated as a comment) and then pass it off to read.table to parse the data.
Here are a few options that bypass your function and that you can mix & match.
In the easiest (albeit unlikely) scenario where you know the column names already, you can use read.table and enter the column names manually. The default option of comment.char = "#" means those comment lines will be omitted.
read.table(temp, col.names = c("year", "month", "day", "cycle", "trend"))
More likely is that you don't know those column names, but can get them by figuring out how many comment lines there are, then reading just the last of those lines. That saves you having to read more of the file than you need; this is a small enough file that it shouldn't make a huge difference, but in a larger file it might. I'm doing the counting by accessing the command line, only because that's the way I know how. Note also that I saved the file at an easier path; you could instead paste the command together with the temp variable.
Again, the comments are omitted by default.
n_comments <- as.numeric(system("grep '^# ' co2.txt | wc -l", intern = TRUE))
hdrs <- scan(temp, skip = n_comments - 1, nlines = 1, what = "character")[-1]
read.table(temp, col.names = hdrs)
Or with dplyr and stringr, read all the lines, separate out the comments to extract column names, then filter to remove the comment lines and separate into fields, assigning the column names you've just pulled out. Again, with a bigger file, this could become burdensome.
library(dplyr)
lines <- data.frame(text = readLines(temp), stringsAsFactors = FALSE)
comments <- lines %>%
filter(stringr::str_detect(text, "^#"))
hdrs <- strsplit(comments[nrow(comments), 1], "\\s+")[[1]][-1]
lines %>%
filter(!stringr::str_detect(text, "^#")) %>%
mutate(text = trimws(text)) %>%
tidyr::separate(text, into = hdrs, sep = "\\s+") %>%
mutate_all(as.numeric)
I have several hundred files regarding information in .pet files organized by date code (19960101 is format YYYYMMDD). I'm trying to add a column, NDate with the date code:
for (pet.atual in files.pet) {
data.pet.atual <-
read.table(file = pet.atual,
header = FALSE,
sep = ",",
quote = "\"",
comment.char = ";");
data.pet.atual <- cbind(data.pet.atual, NDate= pet.atual)
}
What i'm trying to achieve, for example, is for the 01-01-1996 NDate = 19960101, for 02-01-1996 NDate = 19960102 and so on. Still the for loop just replaces the NDate field everytime it runs with the latest pet.atual, ideas? Thanks
Small modification should do the trick:
data.pet.atual <- NULL
for (pet.atual in files.pet) {
tmp.data <-
read.table(file = pet.atual,
header = FALSE,
sep = ",",
quote = "\"",
comment.char = ";");
tmp.data <- cbind(tmp.data, NDate= pet.atual)
data.pet.atual <- rbind(data.pet.atual, tmp.data)
}
You can also replace the tmp.data<-cbind(...) by tmp.data$NDate <- pet.atual
You may also try fread() and rbindlist() from the data.table package (untested due to lack of a reproducible example):
library(data.table)
result <- rbindlist(lapply(files.pet, fread), idcol = "NDate")
result[, NDate := anytime::anydate(files.pet[NDate])]
lapply() "loops" over all entries in files.pet executing fread() for each entry and returns a list with the data.tables fread has created from reading each file. rbindlist() is used to combine all pieces into one large data.table. The parameter idcol = NDate generates an index column named NDate to identify the origin of each row in the final output. The ids are integer numbers 1 to the length of the list (if the list is not named).
Finally, the id number is used to lookup the file name in files.pet which is directly converted to class Date using the anytime package.
EDIT Perhaps, it might be more efficient to convert the file names to Date first before looking them up:
result[, NDate := anytime::anydate(files.pet)[NDate]]
Although fread() is pretty smart in analysing and guessing the right parameters for reading the files it might be necessary (and perhaps faster as well) to supply additional parameters, e.g.:
result <- rbindlist(lapply(files.pet, fread, header = FALSE, sep = ","), idcol = "NDate")
Yes, lapply will help, as Frank suggests. And you want to use rbind to keep the dates different for each file. Something along the lines of:
I'm assuming files.pet is a list of all the files you want to include...
my.fun<-function(file){
data <- read.table(file = file,
header = FALSE,
sep = ",",
quote = "\"",
comment.char = ";")
data$NDate = file
return(data)}
data.pet.atual <- do.call(rbind.data.frame, lapply(files.pet, FUN=my.fun))
I can't test this without a reproducible example, so you may need to play with it a bit, but the general approach should work!
I have a txt file (remove.txt) with these kind of data (that's RGB Hex colors):
"#DDDEE0", "#D8D9DB", "#F5F6F8", "#C9CBCA"...
Which are colors I don't want into my analysis.
And I have a R object (nacreHEX) with other data like in the file, but there are into this the good colors and the colors wich I don't want into my analysis. So I use this code to remove them:
nacreHEX <- nacreHEX [! nacreHEX %in% remove] .
It's works when remove is a R object like this remove <- c("#DDDEE0", "#D8D9DB"...), but it doesn't work when it's come from a txt file and I change it into a data.frame, and neither when I try with remove2 <-as.vector(t(remove)).
So there is my code:
remove <- read.table("remove.txt", sep=",")
remove2 <-as.vector(t(remove))
nacreHEX <- nacreHEX [! nacreHEX %in% remove2]
head(nacreHEX)
With this, there are no comas with as.vector, so may be that's why it doesn't work.
How can I make a R vector with comas with these kind of data?
What stage did I forget?
The problem is that your txt file is separated by ", " not ",'. The spaces end up in your string:
rr = read.table(text = '"#DDDEE0", "#D8D9DB", "#F5F6F8", "#C9CBCA"', sep = ",")
(rr = as.vector(t(rr)))
# [1] "#DDDEE0" " #D8D9DB" " #F5F6F8" " #C9CBCA"
You can see the leading spaces before the #. We can trim these spaces with trimws().
trimws(rr)
# [1] "#DDDEE0" "#D8D9DB" "#F5F6F8" "#C9CBCA"
Even better, you can use the argument strip.white to have read.table do it for you:
rr = read.table(text = '"#DDDEE0", "#D8D9DB", "#F5F6F8", "#C9CBCA"',
sep = ",", strip.white = TRUE)
I want to write a data frame from R into a CSV file. Consider the following toy example
df <- data.frame(ID = c(1,2,3), X = c("a", "b", "c"), Y = c(1,2,NA))
df[which(is.na(df[,"Y"])), 1]
write.table(t(df), file = "path to CSV/test.csv", sep = ""), col.names=F, sep=",", quote=F)
The output in test.csvlooks as follows
ID,1,2,3
X,a,b,c
Y, 1, 2,NA
At first glance, this is exactly as I need it, BUT what cannot be seen in the code insertion above is that after the NA in the last line, there is another linebreak. When I pass test.csv to a Javascript chart on a website, however, the trailing linebreak causes trouble.
Is there a way to avoid this final linebreak within R?
This is a little convoluted, but obtains your desired result:
zz <- textConnection("foo", "w")
write.table(t(df), file = zz, col.names=F, sep=",", quote=F)
close(zz)
foo
# [1] "ID,1,2,3" "X,a,b,c" "Y, 1, 2,NA"
cat(paste(foo, collapse='\n'), file = 'test.csv', sep='')
You should end up with a file that has newline character after only the first two data rows.
You can use a command line utility like sed to remove trailing whitespace from a file:
sed -e :a -e 's/^.\{1,77\}$/ & /;ta'
Or, you could begin by writing a single row then using append.
An alternative in the similar vein of the answer by #Thomas, but with slightly less typing. Send output from write.csv to a character string (capture.out). Concatenate the string (paste) and separate the elements with linebreaks (collapse = \n). Write to file with cat.
x <- capture.output(write.csv(df, row.names = FALSE, quote = FALSE))
cat(paste(x, collapse = "\n"), file = "df.csv")
You may also use format_csv from package readr to create a character vector with line breaks (\n). Remove the last end-of-line \n with substr. Write to file with cat.
library(readr)
x <- format_csv(df)
cat(substr(x, 1, nchar(x) - 1), file = "df.csv")
I'm wondering if it would be possible to draw out out the filename from a file.choose() command embedded within a read.csv call. Right now I'm doing this in two steps, but the user has to select the same file twice in order to extract both the data (csv) and the filename for use in a function I run. I want to make it so the user only has to select the file once, and then I can use both the data and the name of the file.
Here's what I'm working with:
data <- read.csv(file.choose(), skip=1))
name <- basename(file.choose())
I'm running OS X if that helps, since I think file.choose() has different behavior depending on the OS. Thanks in advance.
Why you use an embedded file.choose() command?
filename <- file.choose()
data <- read.csv(filename, skip=1)
name <- basename(filename)
use this:
df = read.csv(file.choose(), sep = "<use relevant seperator>", header = T, quote = "")
seperators are generally comma , or fullstop .
Example:
df = read.csv(file.choose(), sep = ",", header = T, quote = "")
#
Use :
df = df[,!apply(is.na(df), 2, all)] # works for every data format
to remove blank columns to the left