Read.CSV not working as expected in R - r

I am stumped. Normally, read.csv works as expected, but I have come across an issue where the behavior is unexpected. It most likely is user error on my part, but any help will be appreciated.
Here is the URL for the file
http://nces.ed.gov/ipeds/datacenter/data/SFA0910.zip
Here is my code to get the file, unzip, and read it in:
URL <- "http://nces.ed.gov/ipeds/datacenter/data/SFA0910.zip"
download.file(URL, destfile="temp.zip")
unzip("temp.zip")
tmp <- read.table("sfa0910.csv",
header=T, stringsAsFactors=F, sep=",", row.names=NULL)
Here is my problem. When I open the data csv data in Excel, the data look as expected. When I read the data into R, the first column is actually named row.names. R is reading in one extra row of data, but I can't figure out where the "error" occurs that is causing row.names to be a column. Simply, it looks like the data shifted over.
However, what is strange is that the last column in R does appear to contain the proper data.
Here are a few rows from the first few columns:
tmp[1:5,1:7]
row.names UNITID XSCUGRAD SCUGRAD XSCUGFFN SCUGFFN XSCUGFFP
1 100654 R 4496 R 1044 R 23
2 100663 R 10646 R 1496 R 14
3 100690 R 380 R 5 R 1
4 100706 R 6119 R 774 R 13
5 100724 R 4638 R 1209 R 26
Any thoughts on what I could be doing wrong?

My tip: use count.fields() as a quick diagnostic when delimited files do not behave as expected.
First, count the number of fields using table():
table(count.fields("sfa0910.csv", sep = ","))
# 451 452
# 1 6852
That tells you that all but one of the lines contains 452 fields. So which is the aberrant line?
which(count.fields("sfa0910.csv", sep = ",") != 452)
# [1] 1
The first line is the problem. On inspection, all lines except the first are terminated by 2 commas.
The question now is: what does that mean? Is there supposed to be an extra field in the header row which was omitted? Or were the 2 commas appended to the other lines in error? It may be best to contact whoever generated the data, if possible, to clarify the ambiguity.

I have a fix maybe based on mnel's comments
dat<-readLines(paste("sfa", '0910', ".csv", sep=""))
ncommas<-sapply(seq_along(dat),function(x){sum(attributes(gregexpr(',',dat[x])[[1]])$match.length)})
> head(ncommas)
[1] 450 451 451 451 451 451
all columns after the first have an extra seperator which excel ignores.
for(i in seq_along(dat)[-1]){
dat[i]<-gsub('(.*),','\\1',dat[i])
}
write(dat,'temp.csv')
tmp<-read.table('temp.csv',header=T, stringsAsFactors=F, sep=",")
> tmp[1:5,1:7]
UNITID XSCUGRAD SCUGRAD XSCUGFFN SCUGFFN XSCUGFFP SCUGFFP
1 100654 R 4496 R 1044 R 23
2 100663 R 10646 R 1496 R 14
3 100690 R 380 R 5 R 1
4 100706 R 6119 R 774 R 13
5 100724 R 4638 R 1209 R 26
the moral of the story .... listen to Joshua Ulrich ;)
Quick fix. Open the file in excel and save it. This will also delete the extra seperators.
Alternatively
dat<-readLines(paste("sfa", '0910', ".csv", sep=""),n=1)
dum.names<-unlist(strsplit(dat,','))
tmp <- read.table(paste("sfa", '0910', ".csv", sep=""),
header=F, stringsAsFactors=F,col.names=c(dum.names,'XXXX'),sep=",",skip=1)
tmp1<-tmp[,-dim(tmp)[2]]

I know you've found an answer but as your answer helped me to find out this, I'll share:
If you read into R a file with different amount of columns for different rows, like this:
1,2,3,4,5
1,2,3,4
1,2,3
it would be read-in filling the missing columns with NAs, like this:
1,2,3,4,5
1,2,3,4,NA
1,2,3,NA,NA
BUT!
If the row with the biggest columns is not the first row, like this:
1,2,3,4
1,2,3,4,5
1,2,3
then it would be read in a bit confusing way:
1,2,3,4
1,2,3,4
5,NA,NA,NA
1,2,3,NA
(overwhelming before you figure out the problem and quite simple after!)
Just hope it may help someone!

If you using local data, also make sure that it's in the right place. To be sure put it for instance in your working directory and change it via
setwd("C:/[User]/[MyFolder]")
directly in your R-console.

Related

read.xlsx file with one column consisting "numbers as text"

I have excel file that contains numeric variables, but the first column (index column) uses custom formatting: those are numbers that should be presented as text (or similar to text) and having always fixed number of digits where some are zeroes. Here is my example table from excel:
And here is formatting for bad_col1 (rest are numbers or general):
When I try to import my data by using read.xlsx function from either openxlsx or xlsx package it produces something like this:
read.xlsx(file_dir,sheet=1)#for openxlsx
bad_col1 col2 col3
1 5 11 974
2 230 15 719
3 10250 6 944
4 2340 7 401
So as you can see, zeroes are gone. Is there any way to read 1st column as "text" and as other numeric? I can not convert it to text after, because "front zeroes" are gone arleady. I can think of workaround, but it would be more feasible for my project to have them converted while importing.
Thank you in Advance
You can use a vector to filter your desired format, with library readxl:
library(readxl)
filter <- c('text','numeric','numeric')
the_file <- read_xlsx("sample.xlsx", col_types = filter)
Even more, you can skip columns if you use in your filter 'skip' in the desired position, considering that you might have many columns.
Regards
With this https://readxl.tidyverse.org/reference/read_excel.html you can use paramater col_types so that first column is read as character.

R: lack of value in second column of first row causing read.table to recognise a 2D file as 1D

I have a series of data frames in my R environment the I have read in as follows:
x <- list.files(pattern="nuc_occupancy_region");
for(i in seq_along(x)){
print(x[i])
assign(paste(x[i]), read.table(x[i], sep='\t', header=T, fill=T))
}
ESC=ls()[grep(ls(), pattern='ESC_nuc')]
MEF=ls()[grep(ls(), pattern='MEF_nuc')]
The list of files MEF often have missing data:
eg.
from command line
head MEF_nuc_occupancy_regionCybb9049012-9053217chrX.txt
9049012 26
9049013
9049014 29
9049015
9049016 26
etc.
The above file is not a problem as the missing values will be read as NA's and I can deal with that later.
However, in others the second value of the first row is missing....
117755994
117755995
117755996
117755997 6
117755998 6
117755999 6
so despite the fact that each file has 2 columns, the lack of a second value in the first row of some of them causes them to be recognised as a file with a single column:
read.table(example.txt, sep='\t', header=T, fill=T)
117755994
117755995
117755996
117755997
6
117755998
6
117755999
6
Is there some way to avoid this as I need all the data frames to be in 2D?
Thanks
I had to just sort it out with python as 'readlines()' is unbiased to column number:
import os
list=os.listdir('.')
counter=0
files=[]
for i in list:
file=open(i, 'r')
print file
lines=file.readlines()
file.close()
corrected=open(i+'formatted', 'a')
print lines[0:9]
for line in lines:
line=line.rstrip('\n')
line=line.split()
if len(line)<2:
line.append(0)
corrected.write("{}\t{}\n".format(line[0], line[1]))
else:
corrected.write("{}\t{}\n".format(line[0], line[1]))
corrected.close()

R readr package - written and read in file doesn't match source

I apologize in advance for the somewhat lack of reproducibility here. I am doing an analysis on a very large (for me) dataset. It is from the CMS Open Payments database.
There are four files I downloaded from that website, read into R using readr, then manipulated a bit to make them smaller (column removal), and then stuck them all together using rbind. I would like to write my pared down file out to an external hard drive so I don't have to read in all the data each time I want to work on it and doing the paring then. (Obviously, its all scripted but, it takes about 45 minutes to do this so I'd like to avoid it if possible.)
So I wrote out the data and read it in, but now I am getting different results. Below is about as close as I can get to a good example. The data is named sa_all. There is a column in the table for the source. It can only take on two values: gen or res. It is a column that is actually added as part of the analysis, not one that comes in the data.
table(sa_all$src)
gen res
14837291 822559
So I save the sa_all dataframe into a CSV file.
write.csv(sa_all, 'D:\\Open_Payments\\data\\written_files\\sa_all.csv',
row.names = FALSE)
Then I open it:
sa_all2 <- read_csv('D:\\Open_Payments\\data\\written_files\\sa_all.csv')
table(sa_all2$src)
g gen res
1 14837289 822559
I did receive the following parsing warnings.
Warning: 4 parsing failures.
row col expected actual
5454739 pmt_nature embedded null
7849361 src delimiter or quote 2
7849361 src embedded null
7849361 NA 28 columns 54 columns
Since I manually add the src column and it can only take on two values, I don't see how this could cause any parsing errors.
Has anyone had any similar problems using readr? Thank you.
Just to follow up on the comment:
write_csv(sa_all, 'D:\\Open_Payments\\data\\written_files\\sa_all.csv')
sa_all2a <- read_csv('D:\\Open_Payments\\data\\written_files\\sa_all.csv')
Warning: 83 parsing failures.
row col expected actual
1535657 drug2 embedded null
1535657 NA 28 columns 25 columns
1535748 drug1 embedded null
1535748 year an integer No
1535748 NA 28 columns 27 columns
Even more parsing errors and it looks like some columns are getting shuffled entirely:
table(sa_all2a$src)
100000000278 Allergan Inc. gen GlaxoSmithKline, LLC.
1 1 14837267 1
No res
1 822559
There are columns for manufacturer names and it looks like those are leaking into the src column when I use the write_csv function.

Technique for finding bad data in read.csv in R

I am reading in a file of data that looks like this:
userId, fullName,email,password,activated,registrationDate,locale,notifyOnUpdates,lastSyncTime,plan_id,plan_period_months,plan_price,plan_exp_date,plan_is_trial,plan_is_trial_used,q_hear,q_occupation,pp_subid,pp_payments,pp_since,pp_cancelled,apikey
"2","John Smith,"john.smith#gmail.com","a","1","2004-07-23 14:19:32","en_US","1","2011-04-07 07:29:17","3",\N,\N,\N,"0","1",\N,\N,\N,\N,\N,\N,"d7734dce-4ae2-102a-8951-0040ca38ff83"
but the actual file as around 20000 records. I use the following R code to read it in:
user = read.csv("~/Desktop/dbdump/users.txt", na.strings = "\\N", quote="")
And the reason I have quote="" is because without it the import stops prematurely. I end up with a total of 9569 observations. Why I don't understand why exactly the quote="" overcomes this problem, it seems to do so.
Except that it introduces other problems that I have to 'fix'. The first one I saw is that the dates end up being strings which include the quotes, which don't want to convert to actual dates when I use to.Date() on them.
Now I could fix the strings and hack my way through. But better to know more about what I am doing. Can someone explain:
Why does the quote="" fix the 'bad data'
What is a best-practice technique to figure out what is causing the read.csv to stop early? (If I just look at the input data at +/- the indicated row, I don't see anything amiss).
Here are the lines 'near' the 'problem'. I don't see the damage do you?
"16888","user1","user1#gmail.com","TeilS12","1","2008-01-19 08:47:45","en_US","0","2008-02-23 16:51:53","1",\N,\N,\N,"0","0","article","student",\N,\N,\N,\N,"ad949a8e-17ed-102b-9237-0040ca390025"
"16889","user2","user2#gmail.com","Gaspar","1","2008-01-19 10:34:11","en_US","1",\N,"1",\N,\N,\N,"0","0","email","journalist",\N,\N,\N,\N,"8b90f63a-17fc-102b-9237-0040ca390025"
"16890","user3","user3#gmail.com","boomblaadje","1","2008-01-19 14:36:54","en_US","0",\N,"1",\N,\N,\N,"0","0","article","student",\N,\N,\N,\N,"73f31f4a-181e-102b-9237-0040ca390025"
"16891","user4","user4#gmail.com","mytyty","1","2008-01-19 15:10:45","en_US","1","2008-01-19 15:16:45","1",\N,\N,\N,"0","0","google-ad","student",\N,\N,\N,\N,"2e48e308-1823-102b-9237-0040ca390025"
"16892","user5","user5#gmail.com","08091969","1","2008-01-19 15:12:50","en_US","1",\N,"1",\N,\N,\N,"0","0","dont","dont",\N,\N,\N,\N,"79051bc8-1823-102b-9237-0040ca390025"
* Update *
It's more tricky. Even though the total number of rows imported is 9569, if I look at the last few rows they correspond to the last few rows of data. Therefore I surmise that something happened during the import to cause a lot of rows to be skipped. In fact 15914 - 9569 = 6345 records. When I have the quote="" in there I get 15914.
So my question can be modified: Is there a way to get read.csv to report about rows it decides not to import?
* UPDATE 2 *
#Dwin, I had to remove na.strings="\N" because the count.fields function doesn't permit it. With that, I get this output which looks interesting but I don't understand it.
3 4 22 23 24
1 83 15466 178 4
Your second command produces a lots of data (and stops when max.print is reached.) But the first row is this:
[1] 2 4 2 3 5 3 3 3 5 3 3 3 2 3 4 2 3 2 2 3 2 2 4 2 4 3 5 4 3 4 3 3 3 3 3 2 4
Which I don't understand if the output is supposed to show how many fields there are in each record of input. Clearly the first lines all have more than 2,4,2 etc fields... Feel like I am getting closer, but still confused!
The count.fields function can be very useful in identifying where to look for malformed data.
This gives a tabulation of fields per line ignores quoting, possibly a problem if there are embedded commas:
table( count.fields("~/Desktop/dbdump/users.txt", quote="", sep=",") )
This give a tabulation ignoring both quotes and "#"(octothorpe) as a comment character:
table( count.fields("~/Desktop/dbdump/users.txt", quote="", comment.char="") )
Atfer seeing what you report for the first tabulation..... most of which were as desired ... You can get a list of the line positions with non-22 values (using the comma and non-quote settings):
which( count.fields("~/Desktop/dbdump/users.txt", quote="", sep=",") != 22)
Sometimes the problem can be solved with fill=TRUE if the only difficulty is missing commas at the ends of lines.
One problem I have spotted (thanks to data.table) is the missing quote (") after John Smith. Could this be a problem also for other lines you have?
If I add the "missing" quote after John Smith, it reads fine.
I saved this data to data.txt:
userId, fullName,email,password,activated,registrationDate,locale,notifyOnUpdates,lastSyncTime,plan_id,plan_period_months,plan_price,plan_exp_date,plan_is_trial,plan_is_trial_used,q_hear,q_occupation,pp_subid,pp_payments,pp_since,pp_cancelled,apikey
"2","John Smith","john.smith#gmail.com","a","1","2004-07-23 14:19:32","en_US","1","2011-04-07 07:29:17","3",\N,\N,\N,"0","1",\N,\N,\N,\N,\N,\N,"d7734dce-4ae2-102a-8951-0040ca38ff83"
"16888","user1","user1#gmail.com","TeilS12","1","2008-01-19 08:47:45","en_US","0","2008-02-23 16:51:53","1",\N,\N,\N,"0","0","article","student",\N,\N,\N,\N,"ad949a8e-17ed-102b-9237-0040ca390025"
"16889","user2","user2#gmail.com","Gaspar","1","2008-01-19 10:34:11","en_US","1",\N,"1",\N,\N,\N,"0","0","email","journalist",\N,\N,\N,\N,"8b90f63a-17fc-102b-9237-0040ca390025"
"16890","user3","user3#gmail.com","boomblaadje","1","2008-01-19 14:36:54","en_US","0",\N,"1",\N,\N,\N,"0","0","article","student",\N,\N,\N,\N,"73f31f4a-181e-102b-9237-0040ca390025"
"16891","user4","user4#gmail.com","mytyty","1","2008-01-19 15:10:45","en_US","1","2008-01-19 15:16:45","1",\N,\N,\N,"0","0","google-ad","student",\N,\N,\N,\N,"2e48e308-1823-102b-9237-0040ca390025"
"16892","user5","user5#gmail.com","08091969","1","2008-01-19 15:12:50","en_US","1",\N,"1",\N,\N,\N,"0","0","dont","dont",\N,\N,\N,\N,"79051bc8-1823-102b-9237-0040ca390025"
And this is a code. Both fread and read.csv works fine.
require(data.table)
dat1 <- fread("data.txt", header = T, na.strings = "\\N")
dat1
dat2 <- read.csv("data.txt", header = T, na.strings = "\\N")
dat2

R Programming: read.csv() skips lines unexpectedly

I am trying to read a CSV file in R (under linux) using read.csv(). After the function gets completed I find that the number of lines read in R is less than the number of lines in CSV file (obtained by wc -l). Also, every time I read that specific CSV file always the same lines are getting skipped. I checked the formatting errors in CSV file but everything looks good.
But if I extract the lines being skipped into another CSV file, then R is able to read very lines from that file.
I am not able to find anywhere what my problem could be. Any help greatly appreciated.
Here's an example of using count.fields to determine where to look and perhaps apply fixes. You have a modest number of lines that are 23 'fields' in width:
> table(count.fields("~/Downloads/bugs.csv", quote="", sep=","))
2 23 30
502 10 136532
> table(count.fields("~/Downloads/bugs.csv", sep=","))
# Just wanted to see if removing quote-recognition would help.... It didn't.
2 4 10 12 20 22 23 25 28 30
11308 24 20 33 642 251 10 2 170 124584
> which(count.fields("~/Downloads/bugs.csv", quote="", sep=",") == 23)
[1] 104843 125158 127876 129734 130988 131456 132515 133048 136764
[10] 136765
I looked at the 23 with:
txt <-readLines("~/Downloads/bugs.csv")[
which(count.fields("~/Downloads/bugs.csv", quote="", sep=",") == 23)]
And they had octothorpes ("#", hash-signs) which are comment characters in R data parlance.
> table(count.fields("~/Downloads/bugs.csv", quote="", sep=",", comment.char=""))
30
137044
So.... use those settings in read.table and you should be "good to go".

Resources