R read.csv didn't load all rows of .tsv file - r

A little mystery. I have a .tsv file that contains 58936 rows. I loaded the file into R using this command:
dat <- read.csv("weekly_devdata.tsv", header=FALSE, stringsAsFactors=TRUE, sep="\t")
but nrow(dat) only shows this:
> nrow(dat)
[1] 28485
So I used the sed -n command to write the rows around where it stopped (before, including and after that row) to a new file and was able to load that file into R so I don't think there was any corruption in the file.
Is it an environment issue?
Here's my sessionInfo()
> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] tcltk stats graphics grDevices utils datasets methods base
other attached packages:
[1] sqldf_0.4-10 RSQLite_1.0.0 DBI_0.3.1 gsubfn_0.6-6 proto_0.3-10 scales_0.2.4 plotrix_3.5-11
[8] reshape2_1.4.1 dplyr_0.4.1
loaded via a namespace (and not attached):
[1] assertthat_0.1 chron_2.3-45 colorspace_1.2-4 lazyeval_0.1.10 magrittr_1.5 munsell_0.4.2
[7] parallel_3.1.2 plyr_1.8.1 Rcpp_0.11.4 rpart_4.1-8 stringr_0.6.2 tools_3.1.2
Did I run out of memory? Is that why it didn't finish loading?

I had a similar problem lately, and it turned out I had two different problems.
1 - Not all rows had the right number of tabs. I ended up counting them using awk
2 - At some points in the file I had quotes that were not closed. This was causing it to skip over all the lines until it found a closing quote.
I will dig up the awk code I used to investigate and fix these issues and post it.
Since I am using Windows, I used the awk that came with git bash.
This counted the number of tabs in a line and printed out those lines that did not have the right number.
awk -F "\t" 'NF!=6 { print NF-1 ":" $0 } ' Catalog01.csv
I used something similar to count quotes, and I used tr to fix a lot of it.

Pretty sure this was not a memory issue. If the problem is unmatched quotes then try this:
t <-read.csv("weekly_devdata.tsv", header=FALSE, stringsAsFactors=TRUE,sep="\t",
quote="")
There is also the very useful function count.fields that I use inside table to get a high-level view of the consequences of various parameter settings. Take a look at results of:
table( count.fields( "weekly_devdata.tsv", sep="\t"))
And compare to:
table( count.fields( "weekly_devdata.tsv", sep="\t", quote=""))
It's sometime necessary to read in with readLines, then remove one or more lines assigning the result to clean and then send the cleaned up lines to read.table(text=clean, sep="\t", quote="")

Could well be some illegal characters in some of the entries... Check out how many upload and where the issue is taking place. Delve deeper into that row of the raw data. Webdings font that kind of stuff!

Related

R Markdown struggling with read_xlsx, Warning: Expecting logical

When running read_xlsx() in my normal .R script, I'm able to read in the data. But when running the .R script with source() in R Markdown, it suddenly takes a long time (> 20+++ mins I always terminate before the end) and I keep getting these warning messages where it is evaluating every single column and expecting it to be a logical:
Warning: Expecting logical in DE5073 / R5073C109: got 'HOSPITAL/CLINIC'
Warning: Expecting logical in DG5073 / R5073C111: got 'YES'
Warning: Expecting logical in CQ5074 / R5074C95: got '0'
Warning: Expecting logical in CR5074 / R5074C96: got 'MARKET/GROCERY STORE'
Warning: Expecting logical in CT5074 / R5074C98: got 'NO'
Warning: Expecting logical in CU5074 / R5074C99: got 'YES'
Warning: Expecting logical in CV5074 / R5074C100: got 'Less than one week'
Warning: Expecting logical in CW5074 / R5074C101: got 'NEXT'
Warning: Expecting logical in CX5074 / R5074C102: got '0'
.. etc
I can't share the data here, but it is just a normal xlsx file (30k obs, 110 vars). The data has responses in all capitals like YES and NO. The raw data has filters applied, some additional sheets, and some mild formatting in Excel (no borders, white fill) but I don't think these are affecting it.
An example of my workflow setup is like this:
Dataprep.R:
setwd()
pacman::p_load() # all my packages
df <- read_xlsx("./data/Data.xlsx") %>% type_convert()
## blabla more cleaning stuff
Report.Rmd:
setwd()
pacman::p_load() # all my packages again
source("Dataprep.R")
When I run Dataprep.R, everything works in < 1 min. But when I try to source("Dataprep.R") from Report.Rmd, then it starts being slow at read_xlsx() and giving me those warnings.
I've tried also taking df <- read_xlsx() from Dataprep.R and moving it to Report.Rmd, and it is still as slow as running source(). I've also removed type_convert() and tried other things like removing the extra sheets in the Excel. source() was also in the setup chunk in Report.Rmd, but I took it out and still the same thing.
So I think it is something to do with R Markdown and readxl/read_xlsx(). The exact same code and data is evaluating so differently in R vs Rmd and it's very puzzling.
Would appreciate any insight on this. Is there a fix? Or is this something I will just have to live with (i.e. convert to csv)?
> sessionInfo()
R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22000)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.utf8 LC_CTYPE=English_United Kingdom.utf8 LC_MONETARY=English_United Kingdom.utf8
[4] LC_NUMERIC=C LC_TIME=English_United Kingdom.utf8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] digest_0.6.29 R6_2.5.1 lifecycle_1.0.1 pacman_0.5.1 evaluate_0.15 scales_1.2.0 rlang_1.0.2 cli_3.3.0 rstudioapi_0.13
[10] rmarkdown_2.14 tools_4.2.0 munsell_0.5.0 xfun_0.30 yaml_2.3.5 fastmap_1.1.0 compiler_4.2.0 colorspace_2.0-3 htmltools_0.5.2
[19] knitr_1.39
UPDATE:
So in Markdown, I can use the more generic read_excel() and that works in my setup chunk. But I still get the same Warning messages if I try to source() it, even if the R script sourced is also using read_excel() instead of read_xlsx(). Very puzzling all around.
When you run that code on a .R (and probably other kinds of codes that generate warnings), you will get a summary of warnings. Something like "There were 50 or more warnings (use warning() to see the first 50)".
While if you run that same code on a standard Rmarkdown code chunk, you will actually get the whole 50+ warnings. That could mean you are printing thousands, millions, or more warnings.
If your question is WHY does that happen on Rmarkdown and not on R, I'm not sure.
But if your question is how to solve it, it's simple. Just make sure to add the options message=FALSE and warning=FALSE to your code chunk.
It should look something like this:
{r chunk_name, message=FALSE, warning=FALSE}
setwd()
pacman::p_load() # all my packages again
source("Dataprep.R")
Now, about the "setwd()", I would advise against using anything that changes the state of your system (avoid "side effect" functions). They can create problems if you are not very careful. But that is another topic for another day.

R/data.table: read multiline script with fread

I'm using data.table::fread to read input from a shell script. For readability I want to split the script on multiple lines using the line continuation character '\'.
However, fread doesn't seem to like shell scripts on multiple lines.
Examples:
library(data.table)
fread("cat test1.txt test2.txt") ## OK
Now split script on two lines:
fread("cat test1.txt \
test2.txt")
Error in fread("cat test.txt \n test.txt") :
Expected sep (' ') but new line, EOF (or other non printing character) ends field 0 when detecting types ( first): test.txt
## Same problem
fread("cat test.txt \\
test.txt")
Is there any escape sequence or switch I'm missing?
If not, these are possible solutions I guess: 1) Don't split script at all 2) write script to a file and call that file with fread.
These are my settings:
sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS release 6.7 (Final)
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 LC_PAPER=en_GB.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.4
loaded via a namespace (and not attached):
[1] tools_3.2.3 chron_2.3-46 tcltk_3.2.3
embedding within paste is an alternative:
fread(paste("cat test1.txt",
"test2.txt"))
If you are looking for an easy way to read multiple text files, you could either use
fread("cat t*.txt")
or if the .txt files don't follow the above example pattern of file names, perhaps move them to a sub-directory (say 'data') and read them all as below:
fread("ls data | cat")

Non-ASCII characters in R, reading from .sav file

I am trying to read a .sav file into RStudio. The file contains data from a Spanish language survey, and when I read it into R -- even though my default text encoding has already been set to ISO-8859-1 -- the display of special characters is incorrect.
For example, the word "Camión" appears as
"Cami<c3><b3>n"
even though it shows up correctly as "Camión" in PSPP.
This is what I did:
install.packages("memisc")
jcv2014 <- as.data.set(spss.system.file('myfile.sav'))
Later, I wanted to create a list of just the variable labels, so I did the following:
library(foreign)
jcv2014.spss <- read.spss("myfile.sav", to.data.frame=FALSE, use.value.labels=FALSE)
jcv2014_vars <- attr(jcv2014.spss, "variable.labels")
(I'm not sure if this is the best way to do it, but it worked)
Anyway, this time around, I still didn't get the proper accents but there was a different sort of encoding:
A variable label that was supposed to be "¿Qué calificación le daría..." instead appeared as
"\302\277Qu\303\251 calificaci\303\263n le dar\303\255a..."
I'm not sure how to get the proper characters, but they appear correctly in PSPP. I tried changing the default text encoding in R to both ISO-8859-1 and UTF-8, to no avail. I don't know what the original file was encoded in, but I guessed it would be one of those.
Any ideas?
And if it helps, I have R version 3.1.1 and OS X Yosemite version 10.10.1, and I am using PSPP, not SPSS.
Thanks so much in advance!!!
Can you just set the encoding once you've read the data in?
# Here's your sentence
s <- "\302\277Qu\303\251 calificaci\303\263n le dar\303\255a..."
# it has no encoding
Encoding(s)
# [1] "unknown"
# but if you specify UTF-8, then it shows up correctly
iconv(s, 'UTF-8')
# [1] "¿Qué calificación le daría..."
# This also works
Encoding(s) <- 'UTF-8'
s
# [1] "¿Qué calificación le daría..."
Here are the results of my sessionInfo() call. You should post yours too.
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] reshape2_1.4 hexbin_1.27.0 ggplot2_1.0.0 data.table_1.9.2 yaml_2.1.13
[6] redshift_0.4 RJDBC_0.2-4 rJava_0.9-6 DBI_0.3.1
loaded via a namespace (and not attached):
[1] colorspace_1.2-4 digest_0.6.4 grid_3.1.1 gtable_0.1.2 labeling_0.2
[6] lattice_0.20-29 MASS_7.3-33 munsell_0.4.2 plyr_1.8.1 proto_0.3-10
[11] Rcpp_0.11.2 scales_0.2.4 stringr_0.6.2 tools_3.1.1
Update: looks like you may not have a locale that supports UTF-8. Here are the locale settings for each category on my system. You might try using Sys.setLocale() and updating them one by one on your system (or just use LC_ALL if you don't feel the need to test each one incrementally). ?Sys.setLocale for more info
cat_str <- c("LC_COLLATE", "LC_CTYPE", "LC_MONETARY", "LC_NUMERIC",
"LC_TIME", "LC_MESSAGES", "LC_PAPER", "LC_MEASUREMENT")
sapply(cat_str, Sys.getlocale)
# LC_COLLATE LC_CTYPE LC_MONETARY LC_NUMERIC LC_TIME LC_MESSAGES
# "en_US.UTF-8" "en_US.UTF-8" "en_US.UTF-8" "C" "en_US.UTF-8" "en_US.UTF-8"
# LC_PAPER LC_MEASUREMENT
# "" ""

Understanding data.table invalid .selfref warning

I am trying to figuring out the data.table 'invalid .selfref' error that I am getting with the code below.
library(data.table)
library(dplyr)
DT <- data.table(aa=1:100, bb=rnorm(n=100), dd=gl(2,100))
DT <- DT %.% group_by(dd, aa) %.% summarize(m=mean(bb))
DT <- DT[, ee := 3]
The last line throws the error. Here there is the suggestion to just write the last line as DT$ee <- 3 but doesn't really explain why it works (and the := doesn't) and being a beginner data.table user also doesn't feel like the proper data.table idiom.
It IS related to the dplyr line in there that obviously changes the DT data table. But when I change that line (and those following) into DDT <- DT %.% group_by() ... then I still get the selfref error from the DT[, ee := 3] line.
Been checking the various sources but all the info there doesn't really come down, so I am still confused.
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=Dutch_Netherlands.1252 LC_CTYPE=Dutch_Netherlands.1252
[3] LC_MONETARY=Dutch_Netherlands.1252 LC_NUMERIC=C
[5] LC_TIME=Dutch_Netherlands.1252
attached base packages:
[1] graphics grDevices utils datasets stats methods base
other attached packages:
[1] dplyr_0.2 data.table_1.9.2 ggplot2_1.0.0
loaded via a namespace (and not attached):
[1] assertthat_0.1 colorspace_1.2-4 digest_0.6.4 grid_3.1.0
[5] gtable_0.1.2 MASS_7.3-31 munsell_0.4.2 parallel_3.1.0
[9] plyr_1.8.1 proto_0.3-10 Rcpp_0.11.2 reshape2_1.4
[13] scales_0.2.4 stringr_0.6.2 tools_3.1.0
I just ran your code, and I see the problem. data.table over-allocates vector of column pointers (for efficiently adding columns by reference later on) and this warning occurs when an operation (most likely inadvertently) removes that over allocation.
Let me try to explain over-allocation using slide 45 from Matt's useR 2014 presentation. The (blue and yellow) boxes on the top correspond to the vector of column pointers and the arrow shows the data each pointer is pointing to.
The figure on the left depicts pictorially how adding (or cbinding) a column to a data.frame works. cbinding a column basically results in a (deep or shallow) copy resulting in a new location for the vector of column pointers (shown in yellow) and the data (which has now one more column).
The figure on the right shows the data.table way, where there are more than 3 blue boxes to begin with, due to over-allocation while data.table creation. And by using :=, not even a shallow copy is being made. The vector of column pointers that were there before stay where they are and the next unused over-allocated box is used to assign your new column.
This is about the difference and as to what over-allocation here means.
Now the warning tells that whatever operation you did has removed this over-allocation - meaning the extra blue boxes are gone! So, we can't add columns by reference anymore, until we over-allocate again (which is unnecessary and should be avoided, but since it's already gone, we do what's the next best thing).
My guess is that your dplyr syntax somehow removes this over-allocation which is caught int the next step when you use := and data.table over-allocates once again before to add new column by reference (which'll result in a shallow copy).
If I do it the data.table way:
DT <- DT[, list(m=mean(bb)), by=list(dd,aa)]
DT[, ee := 3]
it works just fine.
I don't have the time to look into dplyr right now to verify or find out what's doing this.
Update: Have suggested necessary changes as a pull request here.

knitr updated from 1.2 to 1.4 error: Quitting from lines

I recently updated knitr to 1.4, and since then my .Rnw files don't compile.
The document is rich (7 chapters, included with child="").
Now, in the recent knitr version I get an error message:
Quitting from lines 131-792 (/DATEN/anna/tex/CoSta/chapter1.Rnw)
Quitting from lines 817-826 (/DATEN/anna/tex/CoSta/chapter1.Rnw)
Fehler in if (eval) { :
Argument kann nicht als logischer Wert interpretiert werden
(the last two lines mean that knitr is looking for a logical and it cannot find it.
At those lines 131 and 817 two figures end. Compiling these sniplets separately will work.
I have no idea how to resolve this problem.
Thank's in advance for any hints that allow to resolve my issue.
Here is the sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C
[3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8
[5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] tools stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] knitr_1.4
loaded via a namespace (and not attached):
[1] compiler_2.15.1 digest_0.6.3 evaluate_0.4.7 formatR_0.9
[5] stringr_0.6.2 tcltk_2.15.1
Following the suggestions of Hui, I run each chapter separately with
knit("chapter1.Rnw")
and so on. No error message occurs, and separate tex files are created. To provide more information I display part of the code.
There is a main document in which several options are set
<<options-setting,echo=FALSE>>=
showthis <- FALSE
evalthis <- FALSE
evalchapter <- TRUE
opts_chunk$set(comment=NA, fig.width=6, fig.height=4)
#
The each chapter is used via child chunks, e.g. chapter1 is called from
<<child-chapter1, child='chapter1.Rnw', eval=evalchapter>>=
#
The error message which appears when knitting the main Rnw file was given above.
The related Figure environment is as follows
\begin{figure}[ht]
\centering
<<wuerfel-simulation,echo=showthis,fig.height=5>>=
data.sample6 <- sample(1:6,repl=TRUE,100)
table(data.sample6)
barplot(table(data.sample6)/100,col=5,main="Haeufigkeiten beim Wuerfeln")
#
\caption{Visualisierung beim W"urfeln. 100 Versuche.}
\label{fig:muent-vis}
\end{figure}
This is not very advanced, but the error is still as it was given before.
The quitting from lines concerns a long text, from 131 (end of first chunk) to line 792 (beginning of the followup chunk), which is
<< zeiten, echo=showthis,eval=evalthis>>=
zeiten <- c(17,16,20,24,22,15,21,15,17,22)
max(zeiten)
mean(zeiten)
zeiten[4] <- 18; zeiten
mean(zeiten)
sum(zeiten > 20)
#
Is there a problem with correctly closing a chunk?
I now located the error and I provide a short piece of code with reproducible error message.It concerns conditional evaluation of child processes involving Sexpr:
The main file is the following
\documentclass{article}
\begin{document}
<<options-setting,echo=FALSE>>=
evalchapter <- TRUE
#
<<test,child="test-child.Rnw", eval=evalchapter>>=
#
\end{document}
The related child file 'test-child.Rnw' is
<<no-sexpr>>=
t <- 2:4
#
text \Sexpr{(t <- 2:4)}
knitting this 'as is' gives the error message from above. Removing the Sexpr in the child everything works nicely.
But, everything also works nicely, if I remove the conditioning in the call of the child file, i.e., without 'eval=evalchapter'
Since I use Sexpr quite often I would like to have a solution to this problem. As I mentioned earlier, there were no problems up to knitR Version 1.2.
This is related to a change in knitr 1.3 and mentioned in the NEWS:
added an argument options to knit_child() to set global chunk options for child documents; if a parent chunk calls a child document (via the child option), the chunk options of the parent chunk will be used as global options for the child document, e.g. for <<foo, child='bar.Rnw', fig.path='figure/foo-'>>=, the figure path prefix will be figure/foo- in bar.Rnw; see How to avoid figure filenames in child calls for an application
And this caused a bug for inline R code. In your case, the chunk option eval=evalchapter was not evaluated when it is used for evaluating inline code. I have fixed the bug in the development version v1.4.5 on Github.

Resources