why i can't import the following dataset from uci - r

Good afternoon ,
Assume we have the following function :
data_preprocessing<-function(link,drop_last_column=TRUE){
link=as.character(link)
DT <- data.table::fread(link,
fill = TRUE,
na.strings = "?")
DT=DT[-1,]
DT=as.data.frame(DT)
if(drop_last_column==TRUE){
DT=as.data.frame(DT)[,-ncol(DT)]
}
return(DT)
}
When i try to import acute dataset from uci , i get the following error :
acute=data_preprocessing("https://archive.ics.uci.edu/ml/machine-learning-databases/acute/diagnosis.data")
[100%] Downloaded 7276 bytes...
Error in data.table::fread(link, fill = TRUE, na.strings = "?") :
File is encoded in UTF-16, this encoding is not supported by fread(). Please recode the file to UTF-8.
I also tried :
acute=read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/acute/diagnosis.data")
Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote, :
line 1 appears to contain embedded nulls
2: In read.table(file = file, header = header, sep = sep, quote = quote, :
line 2 appears to contain embedded nulls
3: In read.table(file = file, header = header, sep = sep, quote = quote, :
line 3 appears to contain embedded nulls
4: In read.table(file = file, header = header, sep = sep, quote = quote, :
line 4 appears to contain embedded nulls
5: In read.table(file = file, header = header, sep = sep, quote = quote, :
line 5 appears to contain embedded nulls
6: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
embedded nul(s) found in input
Thank you for help !

Use read.table with appropriate encoding instead.
data_preprocessing<-function(link,drop_last_column=TRUE){
link=as.character(link)
DT <- read.table(link,
fileEncoding="UTF-16",
fill = TRUE,
na.strings = "?")
DT=DT[-1,]
DT=as.data.frame(DT)
if(drop_last_column==TRUE){
DT=as.data.frame(DT)[,-ncol(DT)]
}
return(DT)
}
acute=data_preprocessing("https://archive.ics.uci.edu/ml/machine-learning-databases/acute/diagnosis.data")
head(acute)
V1 V2 V3 V4 V5 V6 V7
2 35,9 no no yes yes yes yes
3 35,9 no yes no no no no
4 36,0 no no yes yes yes yes
5 36,0 no yes no no no no
6 36,0 no yes no no no no
7 36,2 no no yes yes yes yes
Edit:
To find automatically the encoding used in the data file, you can use the guess_encoding function in readr package.
data_preprocessing<-function(link,drop_last_column=TRUE){
link=as.character(link)
enc_guess <- readr::guess_encoding(link)
enc <- enc_guess[enc_guess$confidence == max(enc_guess$confidence),]$encoding
DT <- read.table(link,
fileEncoding = enc,
fill = TRUE,
na.strings = "?")
DT=DT[-1,]
DT=as.data.frame(DT)
if(drop_last_column==TRUE){
DT=as.data.frame(DT)[,-ncol(DT)]
}
return(DT)
}

Related

Trying to create a dataframe on R from a directory that contains files of different types of files i.e. png, tif, rds

As the question states, I am trying to make a data frame on R from a directory that has different types of files. I have tried this code:
setwd("/working/directory/here")
file_list <- list.files()
# Creating the dataset for all the files in file_list.
for (file in file_list) {
# if the merged dataset does not exist, create it.
if (!exists("dataset")){
dataset <- read.table(file, header = TRUE, sep = "\t")
}
# if the merged dataset does exist, append to it.
if (exists("dataset")){
temp_dataset <- read.table(file, header = TRUE, sep = "\t")
dataset <- rbind(dataset, temp_dataset)
rm(temp_dataset)
}
}
But end up receiving several different errors and am not sure how to go about it:
Error in match.names(clabs, names(xi)) :
names do not match previous names
In addition: Warning messages:
1: In read.table(file, header = TRUE, sep = "\t") :
line 1 appears to contain embedded nulls
2: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
embedded nul(s) found in input
3: In read.table(file, header = TRUE, sep = "\t") :
line 1 appears to contain embedded nulls
4: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
embedded nul(s) found in input
5: In read.table(file, header = TRUE, sep = "\t") :
line 2 appears to contain embedded nulls
6: In read.table(file, header = TRUE, sep = "\t") :
line 3 appears to contain embedded nulls
7: In read.table(file, header = TRUE, sep = "\t") :
line 4 appears to contain embedded nulls
8: In read.table(file, header = TRUE, sep = "\t") :
line 5 appears to contain embedded nulls
9: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
embedded nul(s) found in input
Code when using rbindlist:
setwd("/srv/shiny-server/magneto/Storage/1880")
file_list_1880 <- list.files()
all_data <- rbindlist(lapply(file_list_1880, fread), fill = TRUE)
all_data
Error:
Error in FUN(X[[i]], ...) :
embedded nul in string: '\xf5\xfd\x9e\x9a\xc0:\xea~\xa1\u07fcV\xfd\xbd\xe4s\xf9\x99\U02e6aead\xdfC\xb6y\x97\xfa\xbd\xa6$g\xa9\xef۩\xf7\xaf>g\xdf\023\xe0\f\xfa:\0p\x97\xfaߛw\xed+\xf5\xf3?\xfb^\xf5sJ99\001\xe0\021\xe6\r\0\0\x85\xfaw\023\xfb-\xafP\xdf\xe7\xa9'
In addition: Warning messages:
1: In FUN(X[[i]], ...) :
Previous fread() session was not cleaned up properly. Cleaned up ok at the beginning of this fread() call.
2: In FUN(X[[i]], ...) :
Detected 3 column names but the data has 2 columns. Filling rows automatically. Set fill=TRUE explicitly to avoid this warning.
3: In FUN(X[[i]], ...) :
Stopped early on line 26. Expected 3 fields but found 4. Consider fill=TRUE and comment.char=. First discarded non-empty line: < >>
4: In FUN(X[[i]], ...) :
Detected 3 column names but the data has 5 columns (i.e. invalid file). Added 2 extra default column names at the end.
With your list of files, lapply over it and later dplyr::bind_rows to form one, large dataframe.
Here is a small example
data_list = lapply(file_list, function(x) {
read.table(file, header = TRUE, sep = "\t")
})
all_data = dplyr::bind_rows(data_list)

showing error when iam trying to import xlsx file into R

d=read.csv(file.choose())
Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote, :
line 1 appears to contain embedded nulls
2: In read.table(file = file, header = header, sep = sep, quote = quote, :
incomplete final line found by readTableHeader on 'C:\Users\xforce47\Desktop\airbnb .xlsx'
d=read.csv(file.choose())
Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote, :
line 1 appears to contain embedded nulls
2: In read.table(file = file, header = header, sep = sep, quote = quote, :
incomplete final line found by readTableHeader on 'C:\Users\xforce47\Desktop\airbnb .xlsx'
Thats because you try to read in an excel document with a function for csv's. Try
library(rio)
d <- import(file.choose(), setclass = "tbl")
instead. The setclass argument is optional and only useful if you work with the tidyverse.
Just save the file as .csv and read it.
Set the working directory correctly
x <- read.csv(‘myfile1.csv’)

How can I read from a text file in R?

How can I read from a text file? I have the following data in a text file-
A,B,C,D
E,F,G,H
Iam trying to choose the file interactively.
read.delim(file.choose(), sep=",")
Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote, :
line 1 appears to contain embedded nulls<br>
2: In read.table(file = file, header = header, sep = sep, quote = quote,
:
line 2 appears to contain embedded nulls<br>
3: In read.table(file = file, header = header, sep = sep, quote = quote,
:
line 3 appears to contain embedded nulls<br>
4: In read.table(file = file, header = header, sep = sep, quote = quote,
:
line 4 appears to contain embedded nulls<br>
5: In read.table(file = file, header = header, sep = sep, quote = quote,
:
line 5 appears to contain embedded nulls<br>
6: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,
:
EOF within quoted string<br>
7: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,
:
embedded nul(s) found in input
I wish to simply read the data and store it in a variable.
Just use read.csv:
your_df <- read.csv(file="/path/to/your_file.csv", header=FALSE)
your_df
v1 v2 v3 v4
1 A B C D
2 E F G H
The header parameter to read.csv tells R that your input CSV file does not have a leading header row with column names.
Install and attach data.table then use fread
fread(file.choose(), sep = ",")
Your error could be due to encoding issues - specify the right encoding:
fread(file.choose(), sep = ",", encoding = "INSERT YOUR ENCODING"`)

Issues Reading txt files [duplicate]

This question already has answers here:
Get "embedded nul(s) found in input" when reading a csv using read.csv()
(7 answers)
Closed 3 years ago.
I want to text a .txt file in R but I keep getting an embedded null error.
I have tried this code:
text_df = read.delim2(testfile, header = TRUE, sep = ',')
The original file ("testfile") looks like this:
UPC,HSY Item Description,Hsy Seasonal Segmentation,Store Nbr,Store Name,Building City,Building State/Prov,Building Postal Code,Store Type,WM Date,SeasonAndYear,OH_Qty,POS_Qty,POS_Sales
"0001070006638","Whprs Rbn Egg 13.75OZ","EAS $2.98 Candy Dish",1,"ROGERS, AR","ROGERS","AR","72756","Supercenter",1/27/2018 12:00:00 AM,"EAS2018",0,0,0.0000
"0001070006638","Whprs Rbn Egg 13.75OZ","EAS $2.98 Candy Dish",1,"ROGERS, AR","ROGERS","AR","72756","Supercenter",1/30/2018 12:00:00 AM,"EAS2018",0,0,0.0000
"0001070006638","Whprs Rbn Egg 13.75OZ","EAS $2.98 Candy Dish",1,"ROGERS, AR","ROGERS","AR","72756","Supercenter",2/2/2018 12:00:00 AM,"EAS2018",0,0,0.0000
I keep getting this error:
Warning messages: 1: In read.table(file = file, header = header, sep =
sep, quote = quote, : line 1 appears to contain embedded nulls 2:
In read.table(file = file, header = header, sep = sep, quote = quote,
: line 2 appears to contain embedded nulls 3: In read.table(file =
file, header = header, sep = sep, quote = quote, : line 3 appears
to contain embedded nulls 4: In read.table(file = file, header =
header, sep = sep, quote = quote, : line 4 appears to contain
embedded nulls 5: In read.table(file = file, header = header, sep =
sep, quote = quote, : line 5 appears to contain embedded nulls 6:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,
: embedded nul(s) found in input
Try this:
df = read.table(yourFile, quote = '"', sep = ",", header = T)
This should treat the comma inside "ROGERS, AR" as part of the string and not as a separator.

Unable to use biocLite in R centOS 7 - error in read.table

I am using R on centos 7 When i try to install bioconductor packages i am getting the following error.
> source("http://bioconductor.org/biocLite.R")
Bioconductor version 3.0 (BiocInstaller 1.16.1), ?biocLite for help
> biocLite("affy")
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
no lines available in input
This error seems to be bigger than just biocLite because other functions(like rma in affy package) which use read.table are also throwing same error. I am clueless regarding how to solve this error. Any help is very much apprieciated. Thanks.
eset=rma(data,normalize=FALSE)
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
no lines available in input
# Richie Cotton
I am not sure what you meant by option(error = recover) but i tried the following
> source("http://bioconductor.org/biocLite.R")
Bioconductor version 3.0 (BiocInstaller 1.16.1), ?biocLite for help
> biocLite("affy")
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
no lines available in input
> traceback()
9: stop("no lines available in input")
8: read.table(file = file, header = header, sep = sep, quote = quote,
dec = dec, fill = fill, comment.char = comment.char, ...)
7: utils::read.delim(file, header = TRUE, comment.char = "#", colClasses = c(rep.int("character",
3L), rep.int("logical", 4L)))
6: tools:::.read_repositories(p)
5: setRepositories(ind = 1:20)
4: .biocinstallRepos(biocVersion = biocVersion)
3: .getContribUrl(biocVersion())
2: bioconductorPackageIsCurrent()
1: biocLite("affy")
> options(error=recover)
> biocLite("affy")
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
no lines available in input
Enter a frame number, or 0 to exit
1: biocLite("affy")
2: bioconductorPackageIsCurrent()
3: .getContribUrl(biocVersion())
4: .biocinstallRepos(biocVersion = biocVersion)
5: setRepositories(ind = 1:20)
6: tools:::.read_repositories(p)
7: utils::read.delim(file, header = TRUE, comment.char = "#", colClasses = c(re
8: read.table(file = file, header = header, sep = sep, quote = quote, dec = dec
Selection: 8
Called from: top level
Browse[1]>
eval(expr,envir,enclos)
eval(substitute(browser(skipCalls=skip),list(skip=7...
The error message typically results when a file exists but is empty
> system("touch /tmp/123")
> read.table("/tmp/123")
Error in read.table("/tmp/123") : no lines available in input
The traceback says setRepositories() fails. Looking at the source
> head(setRepositories, 9)
1 function (graphics = getOption("menu.graphics"), ind = NULL,
2 addURLs = character())
3 {
4 if (is.null(ind) && !interactive())
5 stop("cannot set repositories non-interactively")
6 p <- file.path(Sys.getenv("HOME"), ".R", "repositories")
7 if (!file.exists(p))
8 p <- file.path(R.home("etc"), "repositories")
9 a <- tools:::.read_repositories(p)
The file that exists but is empty is either a user customization
file.path(Sys.getenv("HOME"), ".R", "repositories")
(in which case the solution is to remove the file pointed to above) or a system file
file.path(R.home("etc"), "repositories")
For the later case, for me in a 'factory fresh' installation from source I get
> length(readLines(p))
[1] 20
but the poster probably gets 0. This is somehow a corrupted installation in centOS, so more information about the version and installation of R is needed. I believe there has been a post on R-devel, R-help, or in the R bug tracker about this recently, but I have not been able to find it.

Resources