Why is fread not accepting the skip command? - r

I have a .txt dataset where the first 12 lines are text followed by 2 blank rows and then the data
DATE HEIGHT INPUT OUTPUT TESTMEASURE
01/01/1933 NO RECORD NO RECORD MISSING MISSING
01/02/1933 NO RECORD NO RECORD MISSING MISSING
But when I do a
dat <- fread('data.txt'),
It skips 15 rows, and uses the first data line as column name for the imported dataset. It ignores the header line.
01/01/1933 NO RECORD NO RECORD MISSING MISSING
The skip parameter is not affecting what I import at all. How can I mention the row number which needs to be used as the column name. Alternatively I can rename the column names, but the first line of data shouldn't be ignored.
DIAGNOSIS
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 0.001319 GB.
Memory mapping ... ok
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Positioned on line 1 after skip or autostart
This line is the autostart and not blank so searching up for the last non-blank ... line 1
Detecting sep ... '\t'
Detected 5 columns. Longest stretch was from line 15 to line 30
Starting data input on line 15 (either column names or first row of data). First 10 characters: 01/01/1933
The line before starting line 15 is non-empty and will be ignored (it has too few or too many items to be column names or data): DATE HEIGHT INPUT OUTPUT TESTMEASURE the fields on line 15 are character fields. Treating as the column names.

You have 12 lines of text, 2 lines of spaces, and then your data. But I noticed extra whitespace between DATE and HEIGHT. So make a text file like this, where your data is tab-delimited, and add 2 tabs between DATE and HEIGHT instead of 1 tab
garbage
garbage
garbage
garbage
garbage
garbage
garbage
garbage
garbage
garbage
garbage
garbage
DATE HEIGHT INPUT OUTPUT TESTMEASURE
01/01/1933 NO RECORD NO RECORD MISSING MISSING
01/02/1933 NO RECORD NO RECORD MISSING MISSING
Doing fread(data) gives me:
fread(data)
01/01/1933 NO RECORD NO RECORD MISSING MISSING
1: 01/02/1933 NO RECORD NO RECORD MISSING MISSING
Removing the extra tab between DATE and HEIGHT gives me:
DATE HEIGHT INPUT OUTPUT TESTMEASURE
1: 01/01/1933 NO RECORD NO RECORD MISSING MISSING
2: 01/02/1933 NO RECORD NO RECORD MISSING MISSING

Related

Start reading an excel after empty line

I have several excel files I need to read, but in some of them the results start in row 41 in others in 48. Though all start after one single empty row. How can I write a code to read every file after this empty row?
here is an example: my data start from row 41. The name of the cell is the same, it's the position that changes

R: Why am I getting an extra column titled "X.1" in my dataframe after reading my .txt file?

I have got this .txt file outputed by a microscope to process.
#read the .txt file generated by microscope, skipping the first 9 lines of garbage information
df <- read.csv("Objects_Population - AllCells.txt", sep="\t", skip = 9,header=TRUE, fill = T)
Then I started looking at the structure of the dataframe, everything seems fine except I now found an extra column in the end of the data frame named "x.1" and all rows of it are NA values. I don't see this column when I open the .txt file in excel. I suspect the problem has something to do with the column names generated by microscope, they contain quite some special characters
Below is the dataframe read by Excel(only showing the last 2 columns since I have 132 columns, and their names are disgustingly long):
AllCells - Cell Contact Area with Neighbors [%] AllCells - Nucleus Nearest Neighbor Distance [µm]
0 4.82083
21.9512 0
15.7895 0
29.4118 0.584611
0 4.21569
0 1.99599
0 3.50767
...
This has happened to me before but I never took it too serious as I was always interested in a subset of my data frame. Now I'm looking at all columns then this starts to bothering me.
Is there any way I can read them correctly without R attaching that additional "X.1" column in the end? Preferably not manually delete or subset out the last column...
Cheers,
ML
If all other column names are correct, you have probably a trailing \t in the text file. R tries to include it and gives it the generic column name X.1.
You could try and read the file first as 'plain text' and remove the trailing \t and only then use read.csv:
file_connection <- file("Objects_Population - AllCells.txt")
content <- readLines(file_connection )
close(file_connection)
Now we try to get rid of these trailing \t (this might need some testing to fit your needs)
sanitized <- gsub("\\t$", "", content)
And then we read this sanitized string as if it was a file (using the argument text)
df <- read.csv(text=paste0(sanitized, collapse="\n"), sep="\t", skip = 9,header=TRUE, fill = T)
Had that problem too. Fixed it by saving the file as "CSV (MS-DOS (*csv)" instead of what I originally had as "CSV (Comma delimited)(*csv)".
This is almost certainly because you've got an extra empty column in your spreadsheet.
In Excel, open your sheet and press Ctrl-End. If you end up in an empty cell outside the range of your data, there's the problem. Select the column (Ctrl-Space), right-click, and choose Delete.
I also encountered similar problem. I found that three extra columns were created (X, X.1, X.2), after I loaded dataset from excel sheet to R studio.
Steps Followed by me:
a) I went to the excel sheet and selected those three extra columns after last column with actual values in excel sheet. Selected extra 3 columns by keeping cursor on top of columns and then right click the mouse and select delete.
b) Again loaded that excel sheet in R. I did not find those 3 columns.

SQLite: Stop insertion on: expected n columns but found m

I'm using the .import command to insert a large csv into an SQLite db. Unfortunately, some of the lines contain the delimeter inside the value for the field. For example, each line is in the form:
id, title, first name, last name, location
but some lines have values like:
1, Mr, Bob, Saget, Sydney, Australia
and the comma in the location field causes an expected 5 columns but found 6 error. Luckily, it's not important to me to insert every line, so I'd like to just not insert any lines that raise this error. Is this possible?
Parsing the csv with regex is a last resort as the file is very large and could take minutes every time I need to import it.

fread() error and strange behaviour when reading csv

I used fread() from data.table library to try read a 540MB csv file. It returned an error message saying:
' ends field 36 on line 4 when detecting types: 20.00,8/25/2006 0:00:00,"07:05:00 PM","CST",143.00,"OTTAWA","KS","HAIL",1.00,"S","MINNEAPOLIS",8/25/2006 0:00:00,"07:05:00 PM",0.00,,1.00,"S","MINNEAPOLIS",0.00,0.00,,88.00,0.00,0.00,0.00,,0.00,,"TOP","KANSAS, East",,3907.00,9743.00,3907.00,9743.00,"Dime to nickel sized hail.
I have no idea what caused the error and want to track down if it's a bug or just some data formating issue that I can tweak fread() to process.
I managed to read the csv using read.csv(), and decided to track down the row that triggered the error above (line 617174, not line 4 as the error message above). I then re-output the row and one row each immediately preceding and following the offending row, written out using write.csv() as testout.csv
I was able to read back testout.csv using read.csv() creating a data frame with 3 observations, as expected. Using fread() on testout.csv, however, resulted in a data table with only 1 observation, which is the last row.
The four lines in testout.csv are below (I start a new line for each entry below for readability).
"STATE__","BGN_DATE","BGN_TIME","TIME_ZONE","COUNTY","COUNTYNAME","STATE","EVTYPE","BGN_RANGE","BGN_AZI","BGN_LOCATI","END_DATE","END_TIME","COUNTY_END","COUNTYENDN","END_RANGE","END_AZI","END_LOCATI","LENGTH","WIDTH","F","MAG","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP","WFO","STATEOFFIC","ZONENAMES","LATITUDE","LONGITUDE","LATITUDE_E","LONGITUDE_","REMARKS","REFNUM"
20,"8/25/2006 0:00:00","07:01:00 PM","CST",139,"OSAGE","KS","TSTM WIND",5,"WNW","OSAGE CITY","8/25/2006 0:00:00","07:01:00 PM",0,NA,5,"WNW","OSAGE CITY",0,0,NA,52,0,0,0,"",0,"","TOP","KANSAS, East","",3840,9554,3840,9554,".",617129
20,"8/25/2006 0:00:00","07:05:00 PM","CST",143,"OTTAWA","KS","HAIL",1,"S","MINNEAPOLIS","8/25/2006 0:00:00","07:05:00 PM",0,NA,1,"S","MINNEAPOLIS",0,0,NA,88,0,0,0,"",0,"","TOP","KANSAS, East","",3907,9743,3907,9743,"Dime to nickel sized hail.
.",617130
20,"8/25/2006 0:00:00","07:07:00 PM","CST",125,"MONTGOMERY","KS","TSTM WIND",3,"N","COFFEYVILLE","8/25/2006 0:00:00","07:07:00 PM",0,NA,3,"N","COFFEYVILLE",0,0,NA,61,0,0,0,"",0,"","ICT","KANSAS, Southeast","",3705,9538,3705,9538,"",617131
When I ran fread("testout.csv", sep=",", verbose=TRUE), the output was
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 1.05E-06B
File is opened and mapped ok
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Looking for supplied sep ',' on line 5 (the last non blank line in the first 'autostart') ... found ok
Found 37 columns
First row with 37 fields occurs on line 5 (either column names or first row of data)
Some fields on line 5 are not type character (or are empty). Treating as a data row and using default column names.
Count of eol after first data row: 2
Subtracted 1 for last eol and any trailing empty lines, leaving 1 data rows
Type codes: 1444144414444111441111111414444111141 (first 5 rows)
Type codes: 1444144414444111441111111414444111141 (after applying colClasses and integer64)
Type codes: 1444144414444111441111111414444111141 (after applying drop or select (if supplied)
Any idea what may have caused the unexpected results, and the error in the first place? And any way around it? Just to be clear, my aim is to be able to use fread() to read the main file, even though read.csv() works so far.
UPDATE: Now fixed in v1.9.3 on GitHub :
fread() now accepts line breaks inside quoted fields. Thanks to Clayton Stanley for highlighting.See:
fread and a quoted multi-line column value
Windows users are reporting success with the latest version from GitHub.

duplicate 'row.names' are not allowed error

I am trying to load a csv file that has 14 columns like this:
StartDate, var1, var2, var3, ..., var14
when I issue this command:
systems <- read.table("http://getfile.pl?test.csv", header = TRUE, sep = ",")
I get an error message.
duplicate row.names are not allowed
It seems to me that the first column name is causing the issue. When I manually download the file and remove the StartDate name from the file, R successfully reads the file and replaces the first column name with X. Can someone tell me what is going on? The file is a (comma separated) csv file.
Then tell read.table not to use row.names:
systems <- read.table("http://getfile.pl?test.csv",
header=TRUE, sep=",", row.names=NULL)
and now your rows will simply be numbered.
Also look at read.csv which is a wrapper for read.table which already sets the sep=',' and header=TRUE arguments so that your call simplifies to
systems <- read.csv("http://getfile.pl?test.csv", row.names=NULL)
This related question points out a part of the ?read.table documentation that explains your problem:
If there is a header and the first row contains one fewer field
than the number of columns, the first column in the input is used
for the row names. Otherwise if row.names is missing, the rows are numbered.
Your header row likely has 1 fewer column than the rest of the file and so read.table assumes that the first column is the row.names (which must all be unique), not a column (which can contain duplicated values). You can fix this by using one of the following two Solutions:
adding a delimiter (ie \t or ,) to the front or end of your header row in the source file, or,
removing any trailing delimiters in your data
The choice will depend on the structure of your data.
Example:
Here the header row is interpreted as having one fewer column than the data because the delimiters don't match:
v1,v2,v3 # 3 items!!
a1,a2,a3, # 4 items
b1,b2,b3, # 4 items
This is how it is interpreted by default:
v1,v2,v3 # 3 items!!
a1,a2,a3, # 4 items
b1,b2,b3, # 4 items
The first column (with no header) values are interpreted as row.names: a1 and b1. If this column contains duplicates, which is entirely possible, then you get the duplicate 'row.names' are not allowed error.
If you set row.names = FALSE, the shift doesn't happen, but you still have a mismatching number of items in the header and in the data because the delimiters don't match.
Solution 1
Add trailing delimiter to header:
v1,v2,v3, # 4 items!!
a1,a2,a3, # 4 items
b1,b2,b3, # 4 items
Solution 2
Remove excess trailing delimiter from non-header rows:
v1,v2,v3 # 3 items
a1,a2,a3 # 3 items!!
b1,b2,b3 # 3 items!!
In my case was a comma at the end of every line. By removing that worked
I had this error when opening a CSV file and one of the fields had commas embedded in it. The field had quotes around it, and I had cut and paste the read.table with quote="" in it. Once I took quote="" out, the default behavior of read.table took over and killed the problem. So I went from this:
systems <- read.table("http://getfile.pl?test.csv", header=TRUE, sep=",", quote="")
to this:
systems <- read.table("http://getfile.pl?test.csv", header=TRUE, sep=",")
I used read_csv from the readr package
In my experience, the parameter row.names=NULL in the read.csv function will lead to a wrong reading of the
file if a column name is missing, i.e. every column will be shifted.
read_csv solves this.
Another possible reason for this error is that you have entire rows duplicated. If that is the case, the problem is solved by removing the duplicate rows.
The answer here (https://stackoverflow.com/a/22408965/2236315) by #adrianoesch should help (e.g., solves "If you know of a solution that does not require the awkward workaround mentioned in your comment (shift the column names, copy the data), that would be great." and "...requiring that the data be copied" proposed by #Frank).
Note that if you open in some text editor, you should see that the number of header fields less than number of columns below the header row. In my case, the data set had a "," missing at the end of the last header field.
It seems the problem can arise from more than one reasons. Following two steps worked when I was having same error.
I saved my file as MS-DOS csv. ( Earlier it was saved in as just csv , excel starter 2010 ).
Opened the csv in notepad++. No coma was inconsistent (consistency as described above #Brian).
Noticed I was not using argument sep="," . I used and it worked ( even though that is default argument!)

Resources