R Programming: read.csv() skips lines unexpectedly - r

I am trying to read a CSV file in R (under linux) using read.csv(). After the function gets completed I find that the number of lines read in R is less than the number of lines in CSV file (obtained by wc -l). Also, every time I read that specific CSV file always the same lines are getting skipped. I checked the formatting errors in CSV file but everything looks good.
But if I extract the lines being skipped into another CSV file, then R is able to read very lines from that file.
I am not able to find anywhere what my problem could be. Any help greatly appreciated.

Here's an example of using count.fields to determine where to look and perhaps apply fixes. You have a modest number of lines that are 23 'fields' in width:
> table(count.fields("~/Downloads/bugs.csv", quote="", sep=","))
2 23 30
502 10 136532
> table(count.fields("~/Downloads/bugs.csv", sep=","))
# Just wanted to see if removing quote-recognition would help.... It didn't.
2 4 10 12 20 22 23 25 28 30
11308 24 20 33 642 251 10 2 170 124584
> which(count.fields("~/Downloads/bugs.csv", quote="", sep=",") == 23)
[1] 104843 125158 127876 129734 130988 131456 132515 133048 136764
[10] 136765
I looked at the 23 with:
txt <-readLines("~/Downloads/bugs.csv")[
which(count.fields("~/Downloads/bugs.csv", quote="", sep=",") == 23)]
And they had octothorpes ("#", hash-signs) which are comment characters in R data parlance.
> table(count.fields("~/Downloads/bugs.csv", quote="", sep=",", comment.char=""))
30
137044
So.... use those settings in read.table and you should be "good to go".

Related

getting < table of extent 0 > when using table() function to get table of frequency

Using RStudio from Anaconda, I am trying to generate a table of frequencies from a CSV file. When I run the code, instead of the expected table of frequencies, I get < table of extent 0 > as a result.
I tried running the same code in R (instead of RStudio) and it works as expected there. I am using RStudio from Anaconda, which already cause me a few problems upon reading code files, so I suspect it might be linked?
Code :
sn <- read.csv("social_network.csv", header = T)
table(sn$Site)
File content > head(sn):
ID.Gender.Age.Site.Times
1 1;male;24;None;0
2 2;female;26;Facebook;20
3 3;male;54;Facebook;2
4 4;female;42;Facebook;7
5 5;male;54;None;
6 6;female;21;Facebook;3
Expected result:
Facebook LinkedIn MySpace None Other Twitter
93 3 22 70 11 3
Actual result:
< table of extent 0 >
The column delimiter is not correctly set. Please read the file by specifying the correct delimiter:
sn <- read.csv('social_network.csv', header = TRUE, sep = ';')

Checking for number of items in a string in R

I have a very large csv file (1.4 million rows). It is supposed to have 22 fields and 21 commas in each row. It was created by taking quarterly text files and compiling them into one large text file so that I could import into SQL. In the past, one field was not in the file. I don't have the time to go row by row and check for this.
In R, is there a way to verify that each row has 22 fields or 21 commas? Below is a small sample data set. The possibly missing field is the 0 in the 10th slot.
32,01,01,01,01,01,000000,123,456,0,132,345,456,456,789,235,256,88,4,1,2,1
32,01,01,01,01,01,000001,123,456,0,132,345,456,456,789,235,256,88,5,1,2,1
you can use the base R function count.fields to do this:
count.fields(tmp, sep=",")
[1] 22 22
The input for this function is the name of a file or a connection. Below, I supplied a textConnection. For large files, you would probably want to feed this into table:
table(count.fields(tmp, sep=","))
Note that this can also be used to count the number of rows in a file using length, similar to the output of wc -l in the *nix OSs.
data
tmp <- textConnection(
"32,01,01,01,01,01,000000,123,456,0,132,345,456,456,789,235,256,88,4,1,2,1
32,01,01,01,01,01,000001,123,456,0,132,345,456,456,789,235,256,88,5,1,2,1"
)
Assuming df is your dataframe
apply(df, 1, length)
This will give you the length of each row.

Collecting data in one row from different csv files by the name

It's hard to explain what exactly I want to achieve with my script but let me try.
I have 20 different csv files, so I loaded them into R:
tbl = list.files(pattern="*.csv")
list_of_data = lapply(tbl, read.csv)
then with your help I combined them into one and removed all of the duplicates:
data_rd <- subset(transform(all_data, X = sub("\\..*", "", X)),
!duplicated(X))
I have now 1 master table which includes all of the names (Accession):
Accession
AT1G19570
AT5G38480
AT1G07370
AT4G23670
AT5G10450
AT4G09000
AT1G22300
AT1G16080
AT1G78300
AT2G29570
Now I would like to find this accession in other csv files and put the data of this accession in the same raw. There are like 20 csv files and for each csv there are like 20 columns so in same cases it might give me a 400 columns. It doesn't matter how long it takes. It has to be done. Is it even possible to do with R ?
Example:
First csv Second csv Third csv
Accession Size Lenght Weight Size Lenght Weight Size Lenght Weight
AT1G19570 12 23 43 22 77 666 656 565 33
AT5G38480
AT1G07370 33 22 33 34 22
AT4G23670
AT5G10450
AT4G09000 12 45 32
AT1G22300
AT1G16080
AT1G78300 44 22 222
AT2G29570
It looks like a hard task to do. Propably it has to be done by the loop. Any ideas ?
This is a merge loop. Here is rough R code that will inefficiently grow with each merge.
Begin as before:
tbls = list.files(pattern="*.csv")
list_of_data = lapply(tbl, read.csv)
tbl=list_of_data[[1]]
for(i in 2:length(list_of_data))
{
tbl=merge(tbl, list of_data[[i]], by="Accession", all=T)
}
The matching column names (not used as a key) will be renamed with a suffix (.x,.y, and so on), the all=T argument will ensure that whenever a new Accession key is merged a new row will be made and the missing cells will be filled with NA.

what kind of files is suitable to be read in R [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Read an Excel file directly from a R script
I made an Excel file, I named it test.xlsx. I wanted to read the file in R.
date price
1 34
2 34.5
3 34
4 34
5 35
6 34.5
7 36
Now, when I used
x = read.csv("test.xlsx")
didn't work. Also I used
x = read.table("test.xlsx")
I got the warning
Warning message:
In read.table("test.xlsx") :
incomplete final line found by readTableHeader on 'test.xlsx'
and the result:
V1
1 PK\003\004\024
2 PˆTز\005›DQ4ï½ùfىé|[™d\003\001µ³9\033g
So, do I need to make a special file in order to read it in R?
try using a simple CSV file. you can save one in Excel using the Save As option
You may want to have a look at the XLConnect package for dealing with Excel files in R: http://cran.r-project.org/web/packages/XLConnect/index.html

Read.CSV not working as expected in R

I am stumped. Normally, read.csv works as expected, but I have come across an issue where the behavior is unexpected. It most likely is user error on my part, but any help will be appreciated.
Here is the URL for the file
http://nces.ed.gov/ipeds/datacenter/data/SFA0910.zip
Here is my code to get the file, unzip, and read it in:
URL <- "http://nces.ed.gov/ipeds/datacenter/data/SFA0910.zip"
download.file(URL, destfile="temp.zip")
unzip("temp.zip")
tmp <- read.table("sfa0910.csv",
header=T, stringsAsFactors=F, sep=",", row.names=NULL)
Here is my problem. When I open the data csv data in Excel, the data look as expected. When I read the data into R, the first column is actually named row.names. R is reading in one extra row of data, but I can't figure out where the "error" occurs that is causing row.names to be a column. Simply, it looks like the data shifted over.
However, what is strange is that the last column in R does appear to contain the proper data.
Here are a few rows from the first few columns:
tmp[1:5,1:7]
row.names UNITID XSCUGRAD SCUGRAD XSCUGFFN SCUGFFN XSCUGFFP
1 100654 R 4496 R 1044 R 23
2 100663 R 10646 R 1496 R 14
3 100690 R 380 R 5 R 1
4 100706 R 6119 R 774 R 13
5 100724 R 4638 R 1209 R 26
Any thoughts on what I could be doing wrong?
My tip: use count.fields() as a quick diagnostic when delimited files do not behave as expected.
First, count the number of fields using table():
table(count.fields("sfa0910.csv", sep = ","))
# 451 452
# 1 6852
That tells you that all but one of the lines contains 452 fields. So which is the aberrant line?
which(count.fields("sfa0910.csv", sep = ",") != 452)
# [1] 1
The first line is the problem. On inspection, all lines except the first are terminated by 2 commas.
The question now is: what does that mean? Is there supposed to be an extra field in the header row which was omitted? Or were the 2 commas appended to the other lines in error? It may be best to contact whoever generated the data, if possible, to clarify the ambiguity.
I have a fix maybe based on mnel's comments
dat<-readLines(paste("sfa", '0910', ".csv", sep=""))
ncommas<-sapply(seq_along(dat),function(x){sum(attributes(gregexpr(',',dat[x])[[1]])$match.length)})
> head(ncommas)
[1] 450 451 451 451 451 451
all columns after the first have an extra seperator which excel ignores.
for(i in seq_along(dat)[-1]){
dat[i]<-gsub('(.*),','\\1',dat[i])
}
write(dat,'temp.csv')
tmp<-read.table('temp.csv',header=T, stringsAsFactors=F, sep=",")
> tmp[1:5,1:7]
UNITID XSCUGRAD SCUGRAD XSCUGFFN SCUGFFN XSCUGFFP SCUGFFP
1 100654 R 4496 R 1044 R 23
2 100663 R 10646 R 1496 R 14
3 100690 R 380 R 5 R 1
4 100706 R 6119 R 774 R 13
5 100724 R 4638 R 1209 R 26
the moral of the story .... listen to Joshua Ulrich ;)
Quick fix. Open the file in excel and save it. This will also delete the extra seperators.
Alternatively
dat<-readLines(paste("sfa", '0910', ".csv", sep=""),n=1)
dum.names<-unlist(strsplit(dat,','))
tmp <- read.table(paste("sfa", '0910', ".csv", sep=""),
header=F, stringsAsFactors=F,col.names=c(dum.names,'XXXX'),sep=",",skip=1)
tmp1<-tmp[,-dim(tmp)[2]]
I know you've found an answer but as your answer helped me to find out this, I'll share:
If you read into R a file with different amount of columns for different rows, like this:
1,2,3,4,5
1,2,3,4
1,2,3
it would be read-in filling the missing columns with NAs, like this:
1,2,3,4,5
1,2,3,4,NA
1,2,3,NA,NA
BUT!
If the row with the biggest columns is not the first row, like this:
1,2,3,4
1,2,3,4,5
1,2,3
then it would be read in a bit confusing way:
1,2,3,4
1,2,3,4
5,NA,NA,NA
1,2,3,NA
(overwhelming before you figure out the problem and quite simple after!)
Just hope it may help someone!
If you using local data, also make sure that it's in the right place. To be sure put it for instance in your working directory and change it via
setwd("C:/[User]/[MyFolder]")
directly in your R-console.

Resources