Trouble loading a dataset in R

Trouble loading a dataset in R - r

I have downloaded the package SemiPar and I have been trying to attach the dataset fuel.frame, using the command data(fuel.frame), but without sucess. The error I have been getting is:
Error in read.table(zfile, header = TRUE, as.is = FALSE) :
more columns than column names
In addition: Warning messages:
1: In read.table(zfile, header = TRUE, as.is = FALSE) :
line 1 appears to contain embedded nulls
2: In read.table(zfile, header = TRUE, as.is = FALSE) :
line 5 appears to contain embedded nulls
3: In read.table(zfile, header = TRUE, as.is = FALSE) :
incomplete final line found by readTableHeader on 'C:/...
Could you please tell me what is wrong here? I have tried to look for solutions online but it seems the package works for everyone besides myself.
My sessionInfo()
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] SemiPar_1.0-4.1
loaded via a namespace (and not attached):
[1] cluster_1.15.3 grid_3.1.1 lattice_0.20-29 MASS_7.3-33 nlme_3.1-117
[6] tools_3.1.1
Thank you.

The "fuel.frame" file is actually in the ../SemiPar/data/ directory wherever your library is. You can use the .libPaths() function. For me it returns:
> .libPaths()
[1] "/Library/Frameworks/R.framework/Versions/3.1/Resources/library"
If you look in there you should see "fuel.frame.txt.gz" which tells you that it's a gzipped file that will expand to a text file (which is what the data() call is doing before passing it to read.table() ). The top of it looks like:
car.name Weight Disp. Mileage Fuel Type
"Eagle Summit 4" 2560 97 33 3.030303 Small
"Ford Escort 4" 2345 114 33 3.030303 Small
"Ford Festiva 4" 1845 81 37 2.702703 Small
"Honda Civic 4" 2260 91 32 3.125000 Small
"Mazda Protege 4" 2440 113 32 3.125000 Small
"Mercury Tracer 4" 2285 97 26 3.846154 Small
"Nissan Sentra 4" 2275 97 33 3.030303 Small
"Pontiac LeMans 4" 2350 98 28 3.571429 Small
As you can see your error message is not correct about my copy. So you may want to use your unnamed system to expand the .gz file and investigate. (I was not getting an error with my R 3.1.1 (SnowLeopard build) running in OSX 10.7.5.) With my setup this also succeeds:
data('fuel.frame',
lib.loc='/Library/Frameworks/R.framework/Versions/3.1/Resources/library/')

Related

Reading large data with messy strings and multiple string indicators R

I have a large (8GB+) csv file (comma-separated) that I want to read into R. The file contains three columns
date #in 2017-12-27 format
text #a string
type #a label per string (either NA, typeA, or typeB)
The problem I encounter is that the text column contains various string indicators: ' (single quot. marks), " (double quot. marks), no quot. marks, as well as multiple separated strings.
E.g.
date text type
2016-01-01 great job! NA
2016-01-02 please, type "submit" typeA
2016-01-02 "can't see the "error" now" typeA
2016-01-03 "add \\"/filename.txt\\"" NA
To read these large data, I tried:
Base read.csv and readr's read_csv function: work fine for a portion but fail (probably due to memory) or take ages to read
chunking the data via Mac terminal into batches of 1m lines: fails because lines seem to break arbitrarily
Using fread (preferred as I hope this will solve the two other issues): fails with Error: Expecting 3 cols, but line 1103 contains text after processing all cols.
My idea is to work around these issues by using specifics of the data that I know, i.e. that each line starts with a date and ends with either NA, typeA, or typeB.
How could I implement this (either using pure readLines or into fread)?
Edit:
Sample data (anonymized) as opened with Mac TextWrangler:
"date","text","type"
"2016-03-30","Maybe use `tapply` from `base`, and check how that works.",NA
"2016-04-01","Fiex this now. Please check.","typeA"
"2016-04-01","Does it work? Maybe try the other approach.","typeB"
"2016-04-01","This won't work. You should remove ABC ... each line starts with a date and ends with ... and this line is veeeeeeeeeeeeeeeeeery long.",NA
"2014-05-02","Tried to remove ""../"" but no success #myid",typeA
Sample data 2:
"date","text","type"
"2018-05-02","i try this, but it doesnt work",NA
"2018-05-02","Thank you very much. Cheers !!",NA
"2018-05-02","#myid. I'll change this.",NA
Sample data for reproducible fread error "Expecting 3 cols, but line 3 contains text after processing all cols.":
"date","text","type"
"2015-03-02","Some text, some text, some question? Please, some question?",NA
"2015-03-02","Here you have the error ""Can’t access {file \""Macintosh HD:abc:def:filename\"", \""/abc.txt\""} from directory."" something -1100 from {file ""Macintosh HD:abc:def:filename"", ""/abc.txt""} to file",NA
"2015-03-02","good idea",NA
"2015-03-02","Worked perfectly :)",NA
SessionInfo:
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.5
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.4-3 readr_1.1.1
loaded via a namespace (and not attached):
[1] compiler_3.5.0 assertthat_0.2.0 R6_2.2.2 cli_1.0.0
[5] hms_0.4.2 tools_3.5.0 pillar_1.2.2 rstudioapi_0.7
[9] tibble_1.4.2 yaml_2.1.19 crayon_1.3.4 Rcpp_0.12.16
[13] utf8_1.1.3 pkgconfig_2.0.1 rlang_0.2.0

readLines approach could be
infile <- file("test.txt", "r")
txt <- readLines(infile, n = 1)
df <- NULL
#change this value as per your requirement
chunksize <- 1
while(length(txt)){
txt <- readLines(infile, warn=F, n = chunksize)
df <- rbind(df, data.frame(date = gsub("\\s.*", "", txt),
text = trimws(gsub("\\S+(.*)\\s+\\S+$", "\\1", txt)),
type = gsub(".*\\s", "", txt),
stringsAsFactors = F))
}
which gives
> df
date text type
1 2016-01-01 great job! NA
2 2016-01-02 please, type "submit" typeA
3 2016-01-02 "can't see the "error" now" typeA
4 2016-01-03 "add \\\\"/filename.txt\\\\"" NA
Sample data: test.txt contains
date text type
2016-01-01 great job! NA
2016-01-02 please, type "submit" typeA
2016-01-02 "can't see the "error" now" typeA
2016-01-03 "add \\"/filename.txt\\"" NA
Update:
You can modify above code with below regex parser to parse another set of sample data
df <- rbind(df, data.frame(date = gsub("\"(\\S{10}).*", "\\1", txt),
text = gsub(".*\"\\,\"(.*)\"\\,(\"|NA).*", "\\1", txt),
type = gsub(".*\\,|\"", "", txt),
stringsAsFactors = F))
Another set of sample data:
"date","text","type"
"2016-03-30","Maybe use `tapply` from `base`, and check how that works.",NA
"2016-04-01","Fiex this now. Please check.","typeA"
"2016-04-01","Does it work? Maybe try the other approach.","typeB"
"2016-04-01","This won't work. You should remove ABC ... each line starts with a date and ends with ... and this line is veeeeeeeeeeeeeeeeeery long.",NA
"2014-05-02","Tried to remove ""../"" but no success #myid","typeA"

r - Error: Text after processing all cols in fread (data.table)

I tried to import a text file in R (3.4.0) which actually contains 4 columns but the 4th column is mostly empty until 200,000+th row. I use the fread() in package data.table (ver 1.10.4)
fread("test.txt",fill = TRUE, sep = "\t", quote = "", header = FALSE)
I got this error message:
Error in fread("test.txt", fill = TRUE, sep = "\t", quote = "", header = FALSE) :
Expecting 3 cols, but line 258088 contains text after processing all cols. Try again with fill=TRUE. Another reason could be that fread's logic in distinguishing one or more fields having embedded sep=' ' and/or (unescaped) '\n' characters within unbalanced unescaped quotes has failed. If quote='' doesn't help, please file an issue to figure out if the logic could be improved.
I checked the file and there's additional text in 258088th row in the 4th column ("8-4").
Nevertheless, fill = TRUE did not solve this as I expected. I thought it might be fread() determining column numbers inappropriately because the additional column occurs very late in the file. So I tried this:
fread("test.txt", fill = TRUE, header = FALSE, sep = "\t", skip = 250000)
The error persisted. On the other hand,
fread("test.txt", fill = TRUE, header = FALSE, sep = "\t", skip = 258080)
This gives no error.
I thought I found the reason, but the weird thing happened when I tested with a dummy file generated by:
write.table(matrix(c(1:990000), nrow = 330000), "test2.txt", sep = "\t", row.names = FALSE)
with the addition of a "8-4" in the 4th column of the 250000th row by Excel. When read by fread():
fread("test2.txt", fill = TRUE, header = FALSE, sep = "\t")
It worked fine with no error message, and this should indicate some late additional column not necessarily trigger error.
I also tried changing encoding ("Latin-1" and "UTF-8") or quote, but neither helped.
Now I feel clueless, and hopefully I did my homework enough with a reproducible information. Thank you for helping.
For additional environmental info, my sessionInfo() is:
R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.5
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] zh_TW.UTF-8/zh_TW.UTF-8/zh_TW.UTF-8/C/zh_TW.UTF-8/zh_TW.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_0.5.0 purrr_0.2.2.2 readr_1.1.1 tidyr_0.6.3
[5] tibble_1.3.3 ggplot2_2.2.1 tidyverse_1.1.1 stringr_1.2.0
[9] microbenchmark_1.4-2.1 data.table_1.10.4
loaded via a namespace (and not attached):
[1] Rcpp_0.12.11 cellranger_1.1.0 compiler_3.4.0 plyr_1.8.4 forcats_0.2.0
[6] tools_3.4.0 jsonlite_1.5 lubridate_1.6.0 nlme_3.1-131 gtable_0.2.0
[11] lattice_0.20-35 rlang_0.1.1 psych_1.7.5 DBI_0.6-1 parallel_3.4.0
[16] haven_1.0.0 xml2_1.1.1 httr_1.2.1 hms_0.3 grid_3.4.0
[21] R6_2.2.1 readxl_1.0.0 foreign_0.8-68 reshape2_1.4.2 modelr_0.1.0
[26] magrittr_1.5 scales_0.4.1 rvest_0.3.2 assertthat_0.2.0 mnormt_1.5-5
[31] colorspace_1.3-2 stringi_1.1.5 lazyeval_0.2.0 munsell_0.4.3 broom_0.4.2

Actually there is a difference between the two files that you provide, and I think this is the cause of the different outputs of the fread.
The first file has an end of the line after the 3rd column, except line 258088, where there is a tab a 4th column and then the end of the line. (You can use the option 'show all characters to confirm that').
On the other hand the second file has in all rows an extra tab, i.e. a new empty column.
So in the first case fread expects 3 columns and then finds out a 4th column. On the contrary in the second file, fread expects 4 columns.
I checked read.table with fill=TRUE and it worked with both files. So I think that something is done differently with the fill option of the fread.
I would expect since fill=TRUE, all the lines to be used so as to infer the number of columns (with cost on computational time).
In the comments there are some nice workarounds you can use.

The file has issue: if the table has four columns, at the end of each row with the fourth column missing a \t should have been present.
In this case you may have better luck with a low-level approach: read the file line by line, add a \t to each row which doesn't have the fourth column, split each line with \t and collect all together in a data.frame. Most of the above work is done by the data.table::tstrsplit function. Try something like:
f<-readLines("test.txt")
require(stringr)
require(data.table)
a<-data.frame(tstrsplit(f,"\t",type.convert=TRUE,names=TRUE,keep=1:4),stringsAsFactors=FALSE)
str(a)
#'data.frame': 273070 obs. of 4 variables:
# $ V1: num 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 ...
# $ V2: num -18.7113 -1.2685 0.0768 0.1507 0.1609 ...
# $ V3: num 0 0 0 0 0 0 0 0 0 0 ...
# $ V4: chr NA NA NA NA ...

I was struggling with this as well. I found another solution (for csv and read.table) here How can you read a CSV file in R with different number of columns. This answer you can use the handy function count.fields to count the delimiters of a file by line and then take the max field count to pass the max number of column names to fread. A reproducible example is below.
Generate text with uneven number of fields
text <- "12223, University\n12227, bridge, Sky\n12828, Sunset\n13801, Ground\n14853, Tranceamerica\n16520, California, ocean, summer, golden gate, beach, San Francisco\n14854, San Francisco\n15595, shibuya, Shrine\n16126, fog, San Francisco\n"
Write to file
cat(text, file = "foo")
Scan file for delimeters
max.fields<-max(count.fields("foo", sep = ','))
Now use fread to read file, but expect a max number of columns from the col.names argument
fread("foo", header = FALSE, fill=TRUE, sep=",", col.names = paste("V", 1:max.fields, sep = ""))
However, I was basing this data on the example data from ?count.fields and found if the max number of fields is in the last line of the file, fread will still fail with the following error.
Error in fread("foo", header = FALSE, fill = TRUE, sep = ",", col.names = paste("V", :
Expecting 3 cols, but line 9 contains text after processing all cols. Try again with fill=TRUE. Another reason could be that fread's logic in distinguishing one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes has failed. If quote='' doesn't help, please file an issue to figure out if the logic could be improved.
example
text <- "12223, University\n12227, bridge, Sky\n12828, Sunset\n13801, Ground\n14853, Tranceamerica\n14854, San Francisco\n15595, shibuya, Shrine\n16126, fog, San Francisco\n16520, California, ocean, summer, golden gate, beach, San Francisco\n"
cat(text, file = "foo")
max.fields<-max(count.fields("foo", sep = ','))
fread("foo", header = FALSE, fill=TRUE, sep=",", col.names = paste("V", 1:max.fields, sep = ""))
I'll report this as an issue to the data.table Github. Update: issue logged here https://github.com/Rdatatable/data.table/issues/2691

Reading TIGER/Line gdb file into R

I am using the following instructions to read a TIGER/Line gdb into R.
The first commands work correctly and print out the feature classes:
fgdb <- "ACS_2014_5YR_BG_48_TEXAS.gdb"
subset(ogrDrivers(), grepl("GDB", name))
fc_list = ogrListLayers(fgdb)
print(fc_list)
[1] "ACS_2014_5YR_BG_48_TEXAS" "BG_METADATA_2014"
[3] "X00_COUNTS" "X01_AGE_AND_SEX"
[5] "X02_RACE" "X03_HISPANIC_OR_LATINO_ORIGIN"
[7] "X07_MIGRATION" "X08_COMMUTING"
[9] "X09_CHILDREN_HOUSEHOLD_RELATIONSHIP" "X11_HOUSEHOLD_FAMILY_SUBFAMILIES"
[11] "X12_MARITAL_STATUS_AND_HISTORY" "X14_SCHOOL_ENROLLMENT"
[13] "X15_EDUCATIONAL_ATTAINMENT" "X16_LANGUAGE_SPOKEN_AT_HOME"
[15] "X17_POVERTY" "X19_INCOME"
[17] "X20_EARNINGS" "X21_VETERAN_STATUS"
[19] "X22_FOOD_STAMPS" "X23_EMPLOYMENT_STATUS"
[21] "X24_INDUSTRY_OCCUPATION" "X25_HOUSING_CHARACTERISTICS"
[23] "X27_HEALTH_INSURANCE" "X99_IMPUTATION"
attr(,"driver")
[1] "OpenFileGDB"
attr(,"nlayers")
[1] 24
However, when I try to import specific feature classes, the following error occurs:
> X01 = readOGR(dsn=fgdb,layer="X01_AGE_AND_SEX")
Error in readOGR(dsn = fgdb, layer = "X01_AGE_AND_SEX") :
no features found
In addition: Warning message:
In ogrInfo(dsn = dsn, layer = layer, encoding = encoding, use_iconv = use_iconv, :
ogrInfo: all features NULL
These feature classes do not contain geometry data, which may be why there are no features found.
Is there workaround to read non-geometry feature classes into R from a gdb file ? The only solution I've found so far is to use GDAL command line tools to export the feature class to csv which would then be read into R, but I would like to cut out this middle step.
ogr2ogr -f CSV output.csv ACS_2014_5YR_BG_48_TEXAS.gdb X01_AGE_AND_SEX

Compatibility issue mac/PC on date format

Everything works at work on my PC but at home with my mac I meet a problem
I wrote my data on Excel,
it formats date dd/jj/yy even if I write dd/jj/yyyy but it keeps in memory the way I wrote it (dd/jj/yyyy)
I save the file as a CSV and read it in a data.frame
here is the problem :
data$ddn
[1] 29/11/58 25/07/64 25/09/67 03/01/82 15/05/58 29/07/78 22/03/69 23/01/60 15/12/60 16/06/64
[11] 10/12/60 23/08/78 13/04/67 29/11/59 25/09/56 10/10/87 22/06/60 21/06/76 01/11/63 08/07/69
[21] 22/05/52 06/05/69 04/03/64 08/04/75 09/03/54 22/04/69 29/04/71 18/03/79 14/06/71 03/06/60
71 Levels: 01/11/63 01/12/40 02/07/48 03/01/82 03/05/68 03/06/60 04/03/64 05/01/62 ... 31/07/70
> class(data$ddn)
[1] "factor"
data$ddn<-as.Date(data$ddn,format="%d/%m/%Y") (this syntax works perfectly on my PC)
data$ddn
[1] "0058-11-29" "0064-07-25" "0067-09-25" "0082-01-03" "0058-05-15" "0078-07-29" "0069-03-22"
[8] "0060-01-23" "0060-12-15" "0064-06-16" "0060-12-10" "0078-08-23" "0067-04-13" "0059-11-29"
[15] "0056-09-25" "0087-10-10" "0060-06-22" "0076-06-21" "0063-11-01" "0069-07-08" "0052-05-22"
[22] "0069-05-06" "0064-03-04" "0075-04-08" "0054-03-09" "0069-04-22" "0071-04-29" "0079-03-18"
[29] "0071-06-14" "0060-06-03"
data$ddn<-as.Date(data$ddn,format="%d/%m/%y")
data$ddn
[1] "2058-11-29" "2064-07-25" "2067-09-25" "1982-01-03" "2058-05-15" "1978-07-29" "1969-03-22"
[8] "2060-01-23" "2060-12-15" "2064-06-16" "2060-12-10" "1978-08-23" "2067-04-13" "2059-11-29"
[15] "2056-09-25" "1987-10-10" "2060-06-22" "1976-06-21" "2063-11-01" "1969-07-08" "2052-05-22"
[22] "1969-05-06" "2064-03-04" "1975-04-08" "2054-03-09" "1969-04-22" "1971-04-29" "1979-03-18"
[29] "1971-06-14" "2060-06-03"
R choose to put 19 or 20 before the date and I do not know why
And if I modify the original data (cell format : text or standard instead of date) it 29/11/58 becomes 20056 (again, I am perplexed).
I thought it was a EXCEL problem but the CSV which works with R on my PC doesn't work on mac
How to correct this R compatibility problem between PC and mac?
Thanks.

How to read.table with "Hebrew" column names (in R)?

I am trying to read a .txt file, with Hebrew column names, but without success.
I uploaded an example file to:
http://www.talgalili.com/files/aa.txt
And am trying the command:
read.table("http://www.talgalili.com/files/aa.txt", header = T, sep = "\t")
This returns me with:
X.....ª X...ª...... X...œ....
1 12 97 6
2 123 354 44
3 6 1 3
Instead of:
אחת שתיים שלוש
12 97 6
123 354 44
6 1 3
My output for:
l10n_info()
Is:
$MBCS
[1] FALSE
$`UTF-8`
[1] FALSE
$`Latin-1`
[1] TRUE
$codepage
[1] 1252
And for:
Sys.getlocale()
Is:
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
Can you suggest to me what to try and change to allow me to load the file correctly ?
Update:
Trying to use:
read.table("http://www.talgalili.com/files/aa.txt",fileEncoding ="iso8859-8")
Has resulted in:
V1
1 ?
Warning messages:
1: In read.table("http://www.talgalili.com/files/aa.txt", fileEncoding = "iso8859-8") :
invalid input found on input connection 'http://www.talgalili.com/files/aa.txt'
2: In read.table("http://www.talgalili.com/files/aa.txt", fileEncoding = "iso8859-8") :
incomplete final line found by readTableHeader on 'http://www.talgalili.com/files/aa.txt'
While also trying this:
Sys.setlocale("LC_ALL", "en_US.UTF-8")
Or this:
Sys.setlocale("LC_ALL", "en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8")
Get's me this:
[1] ""
Warning message:
In Sys.setlocale("LC_ALL", "en_US.UTF-8") :
OS reports request to set locale to "en_US.UTF-8" cannot be honored
Finally, here is the > sessionInfo()
R version 2.10.1 (2009-12-14)
i386-pc-mingw32
locale:
[1] LC_COLLATE=English_United States.1255 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_2.10.1
Any suggestion or clarification will be appreciated.
Best,
Tal

I would try passing parameter fileEncoding to read.table with a value of iso8859-8.
Use iconvlist() to get an alphabetical list of the supported encodings. As I saw here Hebrew must be part 8 of ISO 8859.

I've tried #George Donats answer, but couldn't make it work. So I wanted to suggest another possibility for future reference.
I couldn't find the file online, so I've recreated a txt file like your using TAB as a seperator. You can load it into R with the Hebrew text using a connection. It is demonstrated below:
con<-file("aa.txt",open="r",encoding="iso8859-8") ##Open a read-only connection with encoding fit for Hebrew (iso8859-8)
Than you can load it into R with your code, using con variable as the file input, code described here:
data<-read.table(con,sep="\t",header=TRUE)
Browsing into the data variable gives the following results:
str(data)
'data.frame': 3 obs. of 3 variables:
$ אחת : int 6 44 3
$ שתיים: int 97 354 1
$ שלוש : int 12 123 6
> data$אחת
[1] 6 44 3

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex