I have a rather simple data frame
str(match)
'data.frame': 261 obs. of 2 variables:
$ country: chr "Afghanistan" "Albania" "Algeria" "American Samoa" ...
$ match : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
If I write this to a .csv file by running a single line, everything works fine.
write.csv(match.df, file = "match.csv")
However, when using knitr to produce a .pdf I (of course) use chunks.
<<worldmapmatch>>=
exo$match <- is.na(match(exo$iso2c, wrld_simpl#data$ISO2))
match <- exo[, c("country", "match")]
match.df <- data.frame(match)
head(match.df)
str(match.df)
write.csv(match.df, file = "match.csv")
save(exo, file = "exo.RData")
#
In this case I receive an error message.
Dimension too large.
The error message is given for the line which states "write..."
Any clues?
Related
Hi I was trying to read a csv file, I would like to get a vector
the inside of file like
head filetoread.csv
0610010K14Rik,0610011F06Rik,1110032F04Rik,1110034G24Rik,1500011B03Rik,1700019L03Rik,1700021K19Rik, blah,blah,...
in R session:
c <- read.csv("filetoread.csv")
> c
[1] X0610010K14Rik X0610011F06Rik X1110032F04Rik
...
> str(c)
'data.frame': 0 obs. of 2840 variables:
$ X0610010K14Rik : logi
$ X0610011F06Rik : logi
$ X1110032F04Rik : logi
$ X1110034G24Rik : logi
...
but I wanna something like:
> c
[1] "X0610010K14Rik" "X0610011F06Rik" "X1110032F04Rik" ...
str(c)
chr [1:2840] "X0610010K14Rik" "X0610011F06Rik" "X1110032F04Rik"...
We can use scan
scan("filetoread.csv", sep=',', what = "", quiet = TRUE)
#[1] "0610010K14Rik" "0610011F06Rik" "1110032F04Rik" "1110034G24Rik"
#[5] "1500011B03Rik" "1700019L03Rik" "1700021K19Rik" " blah" "blah"
I have a PDF document with 300 pages. I need to split this file in 150 files containing each one 2 pages. For example, the 1st document would contain pages 1 & 2 of the original file, the 2nd document, the pages 3 & 4 and so on.
Maybe I can use the "pdftools" package, but I don't know how.
1) pdftools Assuming that the input PDF is in the current directory and the outputs are to go into the same directory, change the inputs below and then get the number of pages num, compute the st and en vectors of start and end page numbers and repeatedly call pdf_subset. Note that the pdf_length and pdf_subset functions come from the qpdf R package but are also made available by the pdftools R package by importing them and exporting them back out.
library(pdftools)
# inputs
infile <- "a.pdf" # input pdf
prefix <- "out_" # output pdf's will begin with this prefix
num <- pdf_length(infile)
st <- seq(1, num, 2)
en <- pmin(st + 1, num)
for (i in seq_along(st)) {
outfile <- sprintf("%s%0*d.pdf", prefix, nchar(num), i)
pdf_subset(infile, pages = st[i]:en[i], output = outfile)
}
2) pdfbox The Apache pdfbox utility can split into files of 2 pages each. Download the .jar command line utilities file from pdfbox and be sure you have java installed. Then run this assuming that your input file is a.pdf and is in the current directory (or run the quoted part directly from the command line without the quotes and without R). The jar file name below may need to be changed if a later version is to be used. The one named below is the latest one currently (not counting alpha version).
system("java -jar pdfbox-app-2.0.26.jar PDFSplit -split 2 a.pdf")
3) animation/pdftk Another option is to install the pdftk program, change the inputs at the top of the script below and run. This gets the number of pages in the input, num, using pdftk and then computes the start and end page numbers, st and en, and then invokes pdftk repeatedly, once for each st/en pair to extract those pages into another file.
library(animation)
# inputs
PDFTK <- "~/../bin/pdftk.exe" # path to pdftk
infile <- "a.pdf" # input pdf
prefix <- "out_" # output pdf's will begin with this prefix
ani.options(pdftk = Sys.glob(PDFTK))
tmp <- tempfile()
dump_data <- pdftk(infile, "dump_data", tmp)
g <- grep("NumberOfPages", readLines(tmp), value = TRUE)
num <- as.numeric(sub(".* ", "", g))
st <- seq(1, num, 2)
en <- pmin(st + 1, num)
for (i in seq_along(st)) {
outfile <- sprintf("%s%0*d.pdf", prefix, nchar(num), i)
pdftk(infile, sprintf("cat %d-%d", st[i], en[i]), outfile)
}
Neither pdftools nor qpdf (on which the first depends) support splitting PDF files by other than "every page". You likely will need to rely on an external program, I'm confident you can get pdftk to do that by calling it once for each 2-page output.
I have a 36-page PDF here named quux.pdf in the current working directory.
str(pdftools::pdf_info("quux.pdf"))
# List of 11
# $ version : chr "1.5"
# $ pages : int 36
# $ encrypted : logi FALSE
# $ linearized : logi FALSE
# $ keys :List of 8
# ..$ Producer : chr "pdfTeX-1.40.24"
# ..$ Author : chr ""
# ..$ Title : chr ""
# ..$ Subject : chr ""
# ..$ Creator : chr "LaTeX via pandoc"
# ..$ Keywords : chr ""
# ..$ Trapped : chr ""
# ..$ PTEX.Fullbanner: chr "This is pdfTeX, Version 3.141592653-2.6-1.40.24 (TeX Live 2022) kpathsea version 6.3.4"
# $ created : POSIXct[1:1], format: "2022-05-17 22:54:40"
# $ modified : POSIXct[1:1], format: "2022-05-17 22:54:40"
# $ metadata : chr ""
# $ locked : logi FALSE
# $ attachments: logi FALSE
# $ layout : chr "no_layout"
I also have pdftk installed and available in the page,
Sys.which("pdftk")
# pdftk
# "C:\\PROGRA~2\\PDFtk Server\\bin\\pdftk.exe"
With this, I can run an external script to create 2-page PDFs:
list.files(pattern = "pdf$")
# [1] "quux.pdf"
pages <- seq(pdftools::pdf_info("quux.pdf")$pages)
pages <- split(pages, (pages - 1) %/% 2)
pages[1:3]
# $`0`
# [1] 1 2
# $`1`
# [1] 3 4
# $`2`
# [1] 5 6
for (pg in pages) {
system(sprintf("pdftk quux.pdf cat %s-%s output out_%02i-%02i.pdf",
min(pg), max(pg), min(pg), max(pg)))
}
list.files(pattern = "pdf$")
# [1] "out_01-02.pdf" "out_03-04.pdf" "out_05-06.pdf" "out_07-08.pdf"
# [5] "out_09-10.pdf" "out_11-12.pdf" "out_13-14.pdf" "out_15-16.pdf"
# [9] "out_17-18.pdf" "out_19-20.pdf" "out_21-22.pdf" "out_23-24.pdf"
# [13] "out_25-26.pdf" "out_27-28.pdf" "out_29-30.pdf" "out_31-32.pdf"
# [17] "out_33-34.pdf" "out_35-36.pdf" "quux.pdf"
str(pdftools::pdf_info("out_01-02.pdf"))
# List of 11
# $ version : chr "1.5"
# $ pages : int 2
# $ encrypted : logi FALSE
# $ linearized : logi FALSE
# $ keys :List of 2
# ..$ Creator : chr "pdftk 2.02 - www.pdftk.com"
# ..$ Producer: chr "itext-paulo-155 (itextpdf.sf.net-lowagie.com)"
# $ created : POSIXct[1:1], format: "2022-05-18 09:37:56"
# $ modified : POSIXct[1:1], format: "2022-05-18 09:37:56"
# $ metadata : chr ""
# $ locked : logi FALSE
# $ attachments: logi FALSE
# $ layout : chr "no_layout"
I am trying to import the data from this api https://api.ycombinator.com/companies/export.json?callback=true
and i am getting the following error:
Error in parse_con(txt, bigint_as_char) lexical error: invalid char in json text.
setupCompanies([{"name":"Parake
(right here) ------^
i thought the error was because of emoticons, so i downloaded the file as text and did the manual removal. It didnt work
Remove the ?callback=true from your URL, and it works without error:
aa <- jsonlite::fromJSON("https://api.ycombinator.com/companies/export.json")
str(aa)
# 'data.frame': 2055 obs. of 8 variables:
# $ name : chr "Parakey" "Dinesafe" "Pengram" "Demeanor.co" ...
# $ url : chr "http://parakey.com" "https://dinesafe.org" "http://pengramar.com" "https://demeanor.co" ...
# $ batch : chr "s2005" "s2018" "w2019" "s2018" ...
# $ vertical : chr NA "B2B" "Augmented Reality" "Media" ...
# $ description: chr "" "We crowdsource food poisoning reports and help detect and prevent outbreaks." "Pengram provides indoor navigation in augmented reality on your phone. " "Now part of thentwrk.com" ...
# $ dead : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
# $ has_ff : logi NA FALSE FALSE FALSE FALSE FALSE ...
# $ all_ff : logi NA FALSE FALSE FALSE FALSE FALSE ...
I'm guessing that the "callback API" is setting up the return value so that it is effectively a function call (i.e., setupCompanies(...)), not just data.
I have ASCII files with data separated by $ signs.
There are 23 columns in the data, the first row is of column names, but there is inconsistency between the line endings, which causes R to import the data improperly, by shift the data left-wise with respect to their columns.
Header line:
ISR$CASE$I_F_COD$FOLL_SEQ$IMAGE$EVENT_DT$MFR_DT$FDA_DT$REPT_COD$MFR_NUM$MFR_SNDR$AGE$AGE_COD$GNDR_COD$E_SUB$WT$WT_COD$REPT_DT$OCCP_COD$DEATH_DT$TO_MFR$CONFID$REPORTER_COUNTRY
which does not end with a $ sign.
First row line:
7215577$8135839$I$$7215577-0$20101011$$20110104$DIR$$$67$YR$F$N$220$LBS$20110102$CN$$N$Y$UNITED STATES$
Which does end with a $ sign.
My import command:
read.table(filename, header=TRUE, sep="$", comment.char="", header=TRUE, quote="")
My guess is that the inconsistency between the line endings causes R to think that the records have one column more than the header, thus making the first column as a row.names column, which is not correct. Adding the specification row.names=NULL does not fix the issue.
If I manually add a $ sign in the file the problem is solved, but this is infeasible as the issue occurs in hundreds of files. Is there a way to specify how to read the header line? Do I have any alternative?
Additional info: the headers change across different files, so I cannot set my own vector of column names
Create a dummy test file:
cat("ISR$CASE$I_F_COD$FOLL_SEQ$IMAGE$EVENT_DT$MFR_DT$FDA_DT$REPT_COD$MFR_NUM$MFR_SNDR$AGE$AGE_COD$GNDR_COD$E_SUB$WT$WT_COD$REPT_DT$OCCP_COD$DEATH_DT$TO_MFR$CONFID$REPORTER_COUNTRY\n7215577$8135839$I$$7215577-0$20101011$$20110104$DIR$$$67$YR$F$N$220$LBS$20110102$CN$$N$Y$UNITED STATES$",
file="deleteme.txt",
"\n")
Solution using gsub:
First read the file as text and then edit its content:
file_path <- "deleteme.txt"
fh <- file(file_path)
file_content <- readLines(fh)
close(fh)
Either add a $ at the end of header row:
file_content[1] <- paste0(file_content, "$")
Or remove $ from the end of all rows:
file_content <- gsub("\\$$", "", file_content)
Then we write the fixed file back to disk:
cat(paste0(file_content, collapse="\n"), file=paste0("fixed_", file_path), "\n")
Now we can read the file:
df <- read.table(paste0("fixed_", file_path), header=TRUE, sep="$", comment.char="", quote="", stringsAsFactors=FALSE)
And get the desired structure:
str(df)
'data.frame': 1 obs. of 23 variables:
$ ISR : int 7215577
$ CASE : int 8135839
$ I_F_COD : chr "I"
$ FOLL_SEQ : logi NA
$ IMAGE : chr "7215577-0"
$ EVENT_DT : int 20101011
$ MFR_DT : logi NA
$ FDA_DT : int 20110104
$ REPT_COD : chr "DIR"
$ MFR_NUM : logi NA
$ MFR_SNDR : logi NA
$ AGE : int 67
$ AGE_COD : chr "YR"
$ GNDR_COD : logi FALSE
$ E_SUB : chr "N"
$ WT : int 220
$ WT_COD : chr "LBS"
$ REPT_DT : int 20110102
$ OCCP_COD : chr "CN"
$ DEATH_DT : logi NA
$ TO_MFR : chr "N"
$ CONFID : chr "Y"
$ REPORTER_COUNTRY: chr "UNITED STATES "
I'm opening ebird data with auk, having trouble creating a path for the file. I set the path to a folder. When I try to change it to a file it says the path is not true.
with the Sys.getenv() I can see the path is set to a folder. with the auk_get_ebd_path() command I see the same thing. When I try to change the path to a file inside that folder with the auk_set_ebd_path() command I receive an error message.
library(auk)
auk_get_ebd_path()
[1] "/Users/lucypullen/Documents/bird/data"
auk_set_ebd_path("/Users/lucypullen/Documents/bird/data/ebd_CA_relApr-2019.txt", overwrite = TRUE)
[1] Error: dir.exists(paths = path) is not TRUE
other attempts yeilded an Error in file(con, "r") : cannot open the connection message
Warning messages: 1: In file(con, "r") :
'raw = FALSE' but '/Users.....data/CA' is not a regular file
2: In file(con, "r") :
cannot open file '/Users/lucypullen/Documents/bird/data/CA': it is a directory
seems like they want the path to go to a file. I thought the path would be complete with the system.file() command. I've tried a bunch of variations:
input_file <- system.file("/Users/lucypullen/Documents/bird/data/CA/ebd_CA_relApr-2019.txt", package = "auk")
or
input_file <- system.file("ebd_CA_relApr-2019.txt", package = "auk")
or
input_file <- system.file("~/ebd_CA_relApr-2019.txt", package = "auk")
I suspect you should be doing this, since there appears to have been some sort of setup operation that preceded this question:
my_ebd_path = auk_get_ebd_path() # since you appear to already set it.
my_full_file_loc <- paste0(my_ebd_path, ”/“, "ebd_CA_relApr-2019.txt")
my_ebd_data <- read_ebd(my_full_file_loc)
str(my_ebd_data)
# ------what I get with the sample file in the package--------------
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 494 obs. of 45 variables:
$ checklist_id : chr "S6852862" "S14432467" "S39033556" "S38303088" ...
$ global_unique_identifier : chr "URN:CornellLabOfOrnithology:EBIRD:OBS97935965" "URN:CornellLabOfOrnithology:EBIRD:OBS201605886" "URN:CornellLabOfOrnithology:EBIRD:OBS530638734" "URN:CornellLabOfOrnithology:EBIRD:OBS520887169" ...
$ last_edited_date : chr "2016-02-22 14:59:49" "2013-06-16 17:34:19" "2017-09-06 13:13:34" "2017-07-24 15:17:16" ...
$ taxonomic_order : num 20145 20145 20145 20145 20145 ...
$ category : chr "species" "species" "species" "species" ...
$ common_name : chr "Green Jay" "Green Jay" "Green Jay" "Green Jay" ...
$ scientific_name : chr "Cyanocorax yncas" "Cyanocorax yncas" "Cyanocorax yncas" "Cyanocorax yncas" ...
$ observation_count : chr "4" "2" "1" "1" ...
$ breeding_bird_atlas_code : chr NA NA NA NA ...
#----snipped a bunch of output---------