Data.table fread position error [duplicate] - r

Is this a bug in data.table::fread (version 1.9.2) or misplaced user expectation/error?
Consider this trivial example where I have a table of values, TAB separated with possibly missing values. If the values are missing in the first column, fread gets upset, but if missing values are elsewhere I return the data.table I expect:
# Data with missing value in first column, third row and last column, second row:
12 876 19
23 39
15 20
fread("12 876 19
23 39
15 20")
#Error in fread("12\t876\t19\n23\t39\t\n\t15\t20") :
# Not positioned correctly after testing format of header row. ch=' '
# Data with missing values last column, rows two and three:
"12 876 19
23 39
15 20 "
fread( "12 876 19
23 39
15 20 " )
# V1 V2 V3
#1: 12 876 19
#2: 23 39 NA
#3: 15 20 NA
# Returns as expected.
Is this a bug or is it not possible to have missing values in the first column (or do I have malformed data somehow?).

I believe this is the same bug that I reported here.
The most recent version that I know will work with this type of input is Rev. 1180. You could checkout and build that version by adding #1180 to the end of the svn checkout command.
svn checkout svn://svn.r-forge.r-project.org/svnroot/datatable/#1180
If you're not familiar with checking out and building packages, see here
But, a lot of great features, bug fixes, enhancements have been implemented since Rev. 1180. (The deveolpment version at the time of this writing is Rev. 1272). So, a better solution, is to replace the R/fread.R and src/fread.c files with the versions from Rev. 1180 or older, and then re-building the package.
You can find those files online without checking them out here (sorry, I can't figure out how to post links that include '*', so you have to copy/paste):
fread.R:
http://r-forge.r-project.org/scm/viewvc.php/*checkout*/pkg/R/fread.R?revision=988&root=datatable
fread.c:
http://r-forge.r-project.org/scm/viewvc.php/*checkout*/pkg/src/fread.c?revision=1159&root=datatable
Once you've rebuilt the package, you'll be able to read your tsv file.
> fread("12\t876\t19\n23\t39\t\n\t15\t20")
V1 V2 V3
1: 12 876 19
2: 23 39 NA
3: NA 15 20
The downside to doing this is that the old version of fread() does not pass a newer test -- you won't be able to read fields that have quotes in the middle.
> fread('A,B,C\n1.2,Foo"Bar,"a"b\"c"d"\nfo"o,bar,"b,az""\n')
Error in fread("A,B,C\n1.2,Foo\"Bar,\"a\"b\"c\"d\"\nfo\"o,bar,\"b,az\"\"\n") :
Not positioned correctly after testing format of header row. ch=','
With newer versions of fread, you would get this
> fread('A,B,C\n1.2,Foo"Bar,"a"b\"c"d"\nfo"o,bar,"b,az""\n')
A B C
1: 1.2 Foo"Bar a"b"c"d
2: fo"o bar b,az"
So, for now, which version "works" depends on whether you're more likely to have missing values in the first column, or quotes in fields. For me, it's the former, so I'm still using the old code.

Related

Join Statement omitting entries

Using:
Unix
2.6.18-194.el5
I am having an issue where this join statement is omitting values/indexes from the match. I found out the values are between 11-90 (out of about 3.5 Million entries) and I have tried to look for foreign characters but I may be overlooking something (Tried cat -v to see hidden characters).
Here is the join statement I am using (only simplified the output columns for security):
join -t "|" -j 1 -o 1.1 2.1 file1 file2> fileJoined
file1 contents (first 20 values):
1
3
7
11
12
16
17
19
20
21
27
28
31
33
34
37
39
40
41
42
file2 contents (first 50 values so you can see where it would match):
1|US
2|US
3|US
4|US
5|US
6|US
7|US
8|US
9|US
10|US
11|US
12|US
13|US
14|US
15|US
16|US
17|US
18|US
19|US
20|US
21|US
22|US
23|US
24|US
25|US
26|US
27|US
28|US
29|US
30|US
31|US
32|US
33|US
34|US
35|US
36|US
37|US
38|US
39|US
40|US
41|US
42|US
43|US
44|US
45|US
46|US
47|US
48|US
49|US
50|US
From my initial testing it appears that file2 is the culprit. Because when I create a new file with values 1-100 I am able to get the join statement to match completely against file1; however the same file will not match against file2.
Another strange thing is that the file is 3.5 million records long and at value 90 they start matching again. For example, the output of fileJoined looks like this (first 20 values only):
1|1
3|3
7|7
90|90
91|91
92|92
93|93
95|95
96|96
97|97
98|98
99|99
106|106
109|109
111|111
112|112
115|115
116|116
117|117
118|118
Other things I have tried are:
Using vi to manually enter a new line 11 (still doesnt match on the join statement)
copying the code into notepad, deleting the lines in vi and then copying them back in (same result, no matching 11-90)
Removing lines 11-90 to see if the problem then shifts to 90-170 and it does not shift
I think that there may be some hidden values that I am missing, or that the 11 - 90 from file1 is not the same binary equivalent as the 11 - 90 in file2?
I am lost here, any help would be greatly appreciated.
I tried this out, and I noticed a couple things.
First: this is minor, but I think you're missing a comma in your -o specifier. I changed it to -o 1.1,2.1.
But then, running it on just the fragments you posted, I got only three lines of output:
1|1
3|3
7|7
I think this is because join assumes alphabetical sorting, while your input files look like they're numerically sorted.
Rule #1 of join(1) is to make sure your inputs are sorted, and the same way join expects them to be!
When I ran the two input files through sort and then joined again, I got 18 rows of output. (Sorting was easy, since you're joining on the first column; I didn't have to muck around with sort's column specifiers.)
Beware that, these days, sort doesn't always sort the way you expect, due to locale issues. I tend to set LC_ALL=C to make sure I get the old-fashioned behavior I'm used to.

R readHTMLTable() function error

I'm running into a problem when trying to use the readHTMLTable function in the R package XML. When running
library(XML)
baseurl <- "http://www.pro-football-reference.com/teams/"
team <- "nwe"
year <- 2011
theurl <- paste(baseurl,team,"/",year,".htm",sep="")
readurl <- getURL(theurl)
readtable <- readHTMLTable(readurl)
I get the error message:
Error in names(ans) = header :
'names' attribute [27] must be the same length as the vector [21]
I'm running 64 bit R 2.15.1 through R Studio 0.96.330. It seems there are several other questions that have been asked about the readHTMLTable() function, but none addressed this specific question. Does anyone know what's going on?
When readHTMLTable() complains about the 'names' attribute, it's a good bet that it's having trouble matching the data with what it's parsed for header values. The simplest way around this is to simply turn off header parsing entirely:
table.list <- readHTMLTable(theurl, header=F)
Note that I changed the name of the return value from "readtable" to "table.list". (I also skipped the getURL() call since 1. it didn't work for me and 2. readHTMLTable() knows how to handle URLs). The reason for the change is that, without further direction, readHTMLTable() will hunt down and parse every HTML table it can find on the given page, returning a list containing a data.frame for each.
The page you have sent it after is fairly rich, with 8 separate tables:
> length(table.list)
[1] 8
If you were only interested in a single table on the page, you can use the which attribute to specify it and receive its contents as a data.frame directly.
This could also cure your original problem if it had choked on a table you're not interested in. Many pages still use tables for navigation, search boxes, etc., so it's worth taking a look at the page first.
But this is unlikely to be the case in your example since it actually choked on all but one of them. In the unlikely event that the stars aligned and you were only interested in the successfully-oarsed third table on the page (passing statistics) you could grab it like this, keeping header parsing on:
> passing.df = readHTMLTable(theurl, which=3)
> print(passing.df)
No. Age Pos G GS QBrec Cmp Att Cmp% Yds TD TD% Int Int% Lng Y/A AY/A Y/C Y/G Rate Sk Yds NY/A ANY/A Sk% 4QC GWD
1 12 Tom Brady* 34 QB 16 16 13-3-0 401 611 65.6 5235 39 6.4 12 2.0 99 8.6 9.0 13.1 327.2 105.6 32 173 7.9 8.2 5.0 2 3
2 8 Brian Hoyer 26 3 0 1 1 100.0 22 0 0.0 0 0.0 22 22.0 22.0 22.0 7.3 118.7 0 0 22.0 22.0 0.0

Sorting data in R

I have a dataset that I need to sort by participant (RECORDING_SESSION_LABEL) and by trial_number. However, when I sort the data using R none of the sort functions I have tried put the variables in the correct numeric order that I want. The participant variable comes out ok but the trial ID variable comes out in the wrong order for what I need.
using:
fix_rep[order(as.numeric(RECORDING_SESSION_LABEL), as.numeric(trial_number)),]
Participant ID comes out as:
118 118 118 etc. 211 211 211 etc. 306 306 306 etc.(which is fine)
trial_number comes out as:
1 1 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 2 2 20 20 .... (which is not what I want - it seems to be sorting lexically rather than numerically)
What I would like is trial_number to be order like this within each participant number:
1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 ....
I have checked that these variables are not factors and are numeric and also tried without the 'as.numeric', but with no joy. Looking around I saw suggestions that sort() and mixedsort() might do the trick in place of 'order', both come up with errors. I am slowly pulling my hair out over what I think should be a simple thing. Can anybody help shed some light on how to do this to get what I need?
Even though you claim it is not a factor, it does behave exactly as if it were a factor. Testing if something is a factor can be tricky since a factor is just an integer vector with a levels attribute and a class label. If it is a factor, your code needs to have a call to as.character() nested inside the as.numeric():
fix_rep[order(as.numeric(RECORDING_SESSION_LABEL), as.numeric(as.character(trial_number))),]
To be really sure if it's a factor, I recommend the str() function:
str(trial_number)
I think it may be worthwhile for you to design your own function in this case. It wouldn't be too hard, basically you could just design a bubble-sort algorithm with a few alterations. These alterations could change each number to a string, and begin by sorting those with different numbers of digits into different bins (easily done by finding which numbers, which are now strings, have the greatest numbers of indices). Then, in a similar fashion, the numbers in these bins could be sorted by converting the least significant digit to a numeric type and checking to see which are the largest/smallest. If you're interested, I could come up with some code for this, however, it looks like the two above me have beat me to the punch with some of the built-in functions. I've never used those functions, so I'm not sure if they'll work as you intend, but there's no use in reinventing the wheel.

read.csv appends/modifies column headings with date values

I'm trying to read a csv file into R that has date values in some of the colum headings.
As an example, the data file looks something like this:
ID Type 1/1/2001 2/1/2001 3/1/2001 4/1/2011
A Supply 25 35 45 55
B Demand 26 35 41 22
C Supply 25 35 44 85
D Supply 24 39 45 75
D Demand 26 35 41 22
...and my read.csv logic looks like this
dat10 <- read.csv("c:\data.csv",header=TRUE, sep=",",as.is=TRUE)
The read.csv works fine except it modifies the name of the colums with dates as follows:
x1.1.2001 x2.1.2001 x3.1.2001 x4.1.2001
Is there a way to prevent this, or a easy way to correct afterwards?
Set check.names=FALSE. But be aware that 1/1/2001 et al are syntactically invalid names, therefore they may cause you some headaches.
You can always change the column names using the colnames function. For example,
colnames(dat10) = gsub("\\.", "/", colnames(dat10))
However, having slashes in your column names isn't a particularly good idea. You can always change them just before you print out the table or when you create a graph.

How can I select human miRNA from affy chip while analyzing data using R?

I am new to R and want to analyze miRNA expression from a data set of 3 groups. Can anyone help me out.
In this case I got other miRNAs(on affy chips) as top expressed genes. Now I want to select only human miRNAs. Please help me
Thanks in advance
Summary
I'm not entirely sure what your data frame looks like, given that I haven't worked with Affy chips before. Let me try to summarize what I think you have told us. You have a data frame with a list of all of the microRNAs on the Affy chip, along with their expression data. You want to select a subset of these microRNAs that are unique to humans.
Possible solution 1
You do not state whether or not your data frame contains a variable that identifies whether or not these microRNAs are indeed from humans. If it does have this information, all you would need to do is subset your data based on this identifier. Type help(subset) or help(Extract) for more information on how to do this.
Possible solution 2
If your data frame does not contain such an identifier, you will first need to make a list of all known human microRNAs. You could retrieve these manually from the online miRBase website (and then import them into R), or you could download them from Ensembl using the R package biomaRt. To do the latter, after loading biomaRt, you might type this command:
miRNA <- getBM(c("mirbase_id", "ensembl_gene_id", "start_position", "chromosome_name"), filters = c("with_mirbase"), values = list(TRUE), mart = ensembl)
The above code requests that R download the mirbase identifier, gene ID, start position, and chromosome name for all microRNAs in the miRBase catalog. (Note that you would have to specify the human Ensembl mart in an earlier command, which I have not shown).
Once you have downloaded this information, you could use a merge command or perhaps a which command to pull the appropriate microRNAs from your Affy chip data.
Recommendations
This all might sound a bit complicated. If you haven't already, I recommend that you spend some time working through exercises on biomaRt and bioconductor. Information about these packages, and how to install them, are available at the below links:
Bioconductor, http://www.bioconductor.org/install/
Database mining with biomaRt, http://www.stat.berkeley.edu/~sandrine/Teaching/PH292.S10/Durinck.pdf
You might consider asking for this question to be migrated to Biostar. I think you would get better responses there. Also, consider editing your question to provide more information about your data. Good luck.
Edit to my original answer
In reference to your comment made at 2012-02-26 22:08:02, try the following:
## Load biomaRt package
library(biomaRt)
## Specify which "mart" (i.e., source of genetic data) that you want to use
ensembl <- useMart("ensembl")
ensembl <- useDataset("hsapiens_gene_ensembl", mart = ensembl)
## You can then ask the system what attributes are available for download
listAttributes(ensembl)
name description
58 mirbase_accession miRBase Accession(s)
59 mirbase_id miRBase ID(s)
60 mirbase_gene_name miRBase gene name
61 mirbase_transcript_name miRBase transcript
Above I have pasted part of the output from the listAttributes() command, which shows the relevant miRBase options. Now you can try the following code:
## Download microRNA data
miRNA <- getBM(c("mirbase_id", "ensembl_gene_id", "start_position", "chromosome_name"), filters = c("with_mirbase"), values = list(TRUE), mart = ensembl)
## Check how much we downloaded
> dim(miRNA)
[1] 715 4
## Peak at the head of our data
> head(miRNA)
mirbase_id ensembl_gene_id start_position chromosome_name
1 hsa-mir-320c-1 ENSG00000221493 19263471 18
2 hsa-mir-133a-1 ENSG00000207786 19405659 18
3 hsa-mir-1-2 ENSG00000207694 19408965 18
4 hsa-mir-320c-2 ENSG00000212051 21901650 18
5 hsa-mir-187 ENSG00000207797 33484781 18
6 hsa-mir-1539 ENSG00000222690 47013743 18
## Check which chromosomes are contributing to our data
> table(miRNA$chromosome_name)
1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9 X
50 27 26 25 15 59 26 15 35 7 85 23 32 5 16 31 23 30 17 33 27 28 80
Now your challenge will be to use this downloaded data to parse your original Affy data frame. Again, read the help files for the merge, Extract, and which functions to give it a try yourself first.

Resources