Issue with double quotes and fread function - r

I have some column entries that look like this:
c("This is just a "shame"...") # since its a character
THIS WILL WRITE A FILE ON YOUR C:\ DRIVE:
sample.data <- data.frame(case1=c("This is just a 'shame'..."),
case2="This is just a shame") # here I could not make it to insert the double quotes
write.csv(sample.data, file="C:/sample_data.csv")
require(data.table)
test.fread <- fread("C:/sample_data.csv")
test.read.csv <- read.csv("C:/sample_data.csv")
If I read the csv data with fread function (from data.table), I get his error:
Bumped column 79 to type character on data row 12681, field contains '
a.n."'. Coercing previously read values in this column from logical,
integer or numeric back to character which may not be lossless; e.g., if
'00' and '000' occurred before they will now be just '0', and there
may be inconsistencies with treatment of ',,' and ',NA,' too (if they
occurred in this column before the bump). If this matters please rerun
and set 'colClasses' to 'character' for this column. Please note that column
type detection uses the first 5 rows, the middle 5 rows and the
last 5 rows, so hopefully this message should be very rare.
If reporting to datatable-help, please rerun and include
the output from verbose=TRUE.
If I use read.csv no error occurs and the entries are read in correctly!
Question 1: How can I remove the double quotes inside the character name.
Question 2: Why read.csv reads the entries correctly but fread fails?

As #Arun kindly suggested, the data.table development version 1.9.5 currently on github may be of help here.
To install please follow this procedure (Rtools required):
# To install development version
library(devtools)
install_github("Rdatatable/data.table", build_vignettes = FALSE)
It has been tested so this is to confirm that the newest version of data.table solves the issue with double quotes without problems.
For further details and updates check the following link github data.table

Related

str_detect producing vector related errors in R code (which previously worked) since update 1.5.0

I'm trying to do some simple str_detects as follows:
index1 <- str_detect(colnames(DataFrame), paste0("^", name_))
also, name_ is just a character string so paste0("^", name_)) is of length 1.
which yields the following error:
Error in stop_vctrs(): ! Input must be a vector, not an environment.
When I check rlang::last_error() I get:
`Backtrace:
stringr::str_detect(colnames(DataFrame), paste0("^", name_))
vctrs:::stop_scalar_type(<fn>(<env>), "")
vctrs:::stop_vctrs(msg, "vctrs_error_scalar_type", actual = x)`
I know that in this instance I could use the base R alternative:
grep(paste0("^", name_), colanmes(DataFrame))
but the issue is that I have many long scripts which feature str_detect many times...
I'd like to understand the ways around this new error so that I can best fix all these instances in my code, thank you.
I have read the update on Stringr 1.5.0 written by Hadley about the stricter vector definitions which have been implemented in tidyverse but I still pose my question
EDIT: uninstallation and reinstallation of R/studio/tools fixed the issue

R: read.csv introduced unreadable characters in one column name [duplicate]

I have a text file with Byte order mark (U+FEFF) at the beginning. I am trying to read the file in R. Is it possible to avoid the Byte order mark?
The function fread (from the data.table package) reads the file, but adds ļ»æ at the beginning of the first variable name:
> names(frame_pers)[1]
[1] "ļ»æreg_date"
The same is with read.csv function.
Currently I have made a function which removes the BOM from the first column name, but I believe there should be a way how to automatically strip the BOM.
remove.BOM <- function(x) setnames(x, 1, substring(names(x)[1], 4))
> names(frame_pers)[1]
[1] "ļ»æreg_date"
> remove.BOM(frame_pers)
> names(frame_pers)[1]
[1] "reg_date"
I am using the native encoding for the R session:
> options("encoding" = "")
> options("encoding")
$encoding
[1] ""
Have you tried read.csv(..., fileEncoding = "UTF-8-BOM")?. ?file says:
As from R 3.0.0 the encoding ‘"UTF-8-BOM"’ is accepted and will remove
a Byte Order Mark if present (which it often is for files and webpages
generated by Microsoft applications).
This was handled between versions 1.9.6 and 1.9.8 with this commit; update your data.table installation to fix this.
Once done, you can just use fread:
fread("file_name.csv")
I know it's been 8 years but I just had this problem and came across this so it might help. An important detail (mentioned by hadley above) is that it needs to be fileEncoding="UTF-8-BOM" not just encoding="UTF-8-BOM". "encoding" works for a few options but not UTF-8-BOM. Go figure. Found this out here: https://www.johndcook.com/blog/2019/09/07/excel-r-bom/

Is it possible to use unicode column name in data.frame/data.table in certain locale?

I need to create a data.table, and some column names need to have some unicode symbols for consistence reason(I better to make it match another source which have these names).
In some locale there is no problem, but some user will met problem. Turned out my code will not work in certain locale.
Sys.setlocale("LC_CTYPE", "English_United States.1252")
> df <- data.frame("\u0394AICc" = 2)
Warning message:
unable to translate '<U+0394>AICc' to native encoding
> dt <- data.table(mtcars)
> dt[, "\u0394AICc" := "test"]
> dt
...
Warning message:
In do.call("cbind", lapply(x, function(col, ...) { :
unable to translate '<U+0394>AICc' to native encoding
# interestingly, the string can be printed normally in console
# just it will have problem as a data.frame column name.
> cat("\u0394AICc")
ΔAICc
I searched around and most information I found are about file or string encoding, not really data.frame column names.
There was an issue in data.table for this, it also depend on R/OS/locale. Given data.frame cannot handle it either, I think it's not a data.table problem.
All the problem will gone with some other locale, so I'm hoping there is a way to fix this and don't have to remove all the unicode symbols.

Fixing a multiple warning "unknown column"

I have a persistent multiple warning of "unknown column" for all types of commands (e.g., str(x) to installing updates on packages), and not sure how to debug this or fix it.
The warning "unknown column" is clearly related to a variable in a tbl_df that I renamed, but the warning comes up in all kinds of commands seemingly unrelated to the tbl_df (e.g., installing updates on a package, str(x) where x is simply a character vector).
This is an issue with the Diagnostics tool in RStudio (the tool that shows warnings and possible mistakes in your code). It was partially fixed at this commit in RStudio v1.1.103 or later by #kevin-ushey. That fix was partial, because the warnings still appeared (albeit with less frequency). This issue was reported with a reproducible example at https://github.com/rstudio/rstudio/issues/7372 and it was fixed on RStudio v1.4 pull request.
Update to the latest RStudio release to fix this issue. Alternatively, there are several workarounds available, choose the solution you prefer:
Disable the code diagnostics for all files in Preferences/Code/Diagnostics
Disable all diagnostics for a specific file:
Add at the beginning of the opened file(s):
# !diagnostics off
Then save the files and the warnings should stop appearing.
Disable the diagnostics for the variables that cause the warning
Add at the beginning of the opened file(s):
# !diagnostics suppress=<comma-separated list of variables>
Then save the files and the warnings should stop appearing.
The warnings appear because the diagnostics tool in RStudio parses the source code to detect errors and when it performs the diagnostic checks it accesses columns in your tibble that are not initialized, giving the Warning we see. The warnings do not appear because you run unrelated things, they appear when the RStudio diagnostics are executed (when a file is saved, then modified, when you run something...).
I have been encountering the same problem, and although I don't know why it occurs, I have been able to pin down when it occurs, and thus prevent it from happening.
The issue seems to be with adding in a new column, derived from indexing, in a base R data frame vs. in a tibble data frame. Take this example, where you add a new column (age) to a base R data frame:
base_df <- data.frame(id = c(1:3), name = c("mary", "jill","steve"))
base_df$age[base_df$name == "mary"] <- 47
That works without returning a warning. But when the same is done with a tibble, it throws a warning (and consequently, I think causing the weird, seemingly unprovoked, multiple warning issue):
library(tibble)
tibble_df <- tibble(id = c(1:3), name = c("mary", "jill","steve"))
tibble_df$age[tibble_df$name == "mary"] <- 47
Warning message:
Unknown column 'age'
There are surely better ways of avoiding this, but I have found that first creating a vector of NAs does the job:
tibble_df$age <- NA
tibble_df$age[tibble_df$name == "mary"] <- 47
I have faced this issue when using the "dplyr" package.
For those facing this problem after using the "group_by" function in the "dplyr" library:
I have found that ungrouping the variables solves the unknown column warning problem. Sometimes I have had to iterate through the ungrouping several times until the problem is resolved.
Converting the class into data.frame solved the problem for me:
library(dplyr)
df <- data.frame(id = c(1,1:3), name = c("mary", "jo", "jill","steve"))
dfTbl <- df %>%
group_by(id) %>%
summarize (n = n())
class(dfTbl) # [1] "tbl_df" "tbl" "data.frame"
dfTbl = as.data.frame(dfTbl)
class(dfTbl) # [1] "data.frame"
Borrowed the partial script from #adts
I had this problem when dealing with tibble and lapply functions together. The tibble seemed to save things as a list inside the dataframe.
I solved it by using unlist before adding the results of an lapply function to the tibble.
I ran into this problem too except through a tibble created using a dyplyr block. Here's slight modification of sabre's code to show how I came to the same error.
library(dplyr)
df <- data.frame(id = c(1,1:3), name = c("mary", "jo", "jill","steve"))
t <- df %>%
group_by(id) %>%
summarize (n = n())
t
str(t)
t$newvar[t$id==1] <- 0
I know this is an old thread, but I just encountered the same problem when loading a spatial vector in geopackage format with the package sf. Using as_tibble=FALSE worked for me. The file was loaded as an sp object but everything still worked fine. As mentioned by #sabre, trying to force an object into a tibble seems to be making the problems while trying to index a column that was not anymore there.
Let's say I wanted to select the following column(s)
best.columns = 'id'
For me the following gave the warning:
df%>% select_(one_of(best.columns))
While this worked as expected, although, as far as I know dplyr, this should be identical.
df%>% select_(.dots = best.columns)
I get these warnings when I rename a column using dplyr::rename after reading it using the readr package.
The old name of the column is not renamed in the spec attribute. So removing the the spec attribute makes the warnings go away. Also removing the "spec_tbl_df" class seems like a good idea.
attr(dat, "spec") <- NULL
class(dat) <- setdiff(class(dat), "spec_tbl_df")
Building on the answer by #stok ( https://stackoverflow.com/a/47848259/7733418 ), who found this problem when using group_by (which also converts your data.frame to a tibble), and solved it in the same way.
For me the problem was ultimately due to the use of "slice()".
Slice() converted my data.frame to a tibble, causing this error.
Checking the class of your data.frame and re-converting it to a data.frame whenever a function converts it to a tibble could solve this issue.

Kindly check the R command

I am doing following in Cooccur library in R.
> fb<-read.table("Fb6_peaks.bed")
> f1<-read.table("F16_peaks.bed")
everything is ok with the first two commands and I can also display the data:
> fb
> f1
But when I give the next command as given below
> explore_pairs(c("fb", "f1"))
I get an error message:
Error in sum(sapply(tf1_s, score_sample, tf2_hits = tf2_s, hit_list = hit_l)) :
invalid 'type' (list) of argument
Could anyone suggest something?
Despite promising to release a version to the Bioconductor depository in the article the authors published over a year ago, they have still not delivered. The gz file that is attached to the article is not of a form that my installation recognizes. Your really should be corresponding with the authors for this question.
The nature of the error message suggests that the function is expecting a different data class. You should be looking at the specification for the arguments in the help(explore_pairs) file. If it is expecting 2 matrices, then wrapping data.matrix around the arguments may solve the problem, but if it is expecting a class created by one of that packages functions then you need to take the necessary step to construct the right objects.
The help file for explore_pairs does exist (at least in the MAN directory) and says the first argument should be a character vector with further provisos:
\arguments{
\item{factornames}{an vector of character strings, each naming a GFF-like
data frame containing the binding profile of a DNA-binding factor.
There is also a load utility, load_GFF, which I assume is designed for creation of such files.
Try rename your data frame:
names(fb)=c("seq","start","end")
Check the example datasets. The column names are as above. I set the names and it worked.

Resources