how to read a specific .Matrix file in R - r

I have a .Matrix file, I have been told it is similar to .csv file, and I take a look by web browser, it looks like this:
%TransMat_H0004.E1.L1.S1.B1.T1
CLUSTER,,3,3,2,2,1,1,3,1,1,1,1,3,2,3,1,2,2,1,1,3,3,1,2,1,3,1,1,2,1,3,3,2,3,3,1,1,1,1,1,3,3,1,2,3,2,1,1,1,1,2,1,2,2,3,1,3,2,2,2,1,3,3,2,3,3,1,2,3,3,2,2,2,3,2,2,2,1,1,2,1,1,2,1,1,1,1,2,3,1,3,2,3,3,3,3,2,1,1,3,3,3,1,1,1,2,1,3,1,2,1,1,1,1,1,1,1,3,1,3,2,3,1,1,3,2,2,3,3,1,3,1,1,2,1,2,2,1,1,3,3,1,2,1,2,2,2,2,2,1,3,1,2,3,2,2,2,2,3,2,1,1,2,3,3,2,1,3,1,1,1,1,3,3,3,1,3,3,1,2,2,3,2,3,2,2,3,1,2,2,1,3,1,2,2,3,1,2,3,2,3,3,1,3,2,3,1,1,2,3,1,1,3,2,1,2,1,1,3,1,1,3,1,1,2,1,2,2,2,3,1,3,3,3,1,3,1,1,3,2,3,1,3,2,1,3,1,1,1,2,3,3,3,1,3,3,3,1,1,2,2,3,2,3,3,3,1,3,3,1,1,2,3,2,1,1,3,1,1,1,1,1,3,3,2,2,1,1,1,1,1,3,1,1,2,3,3,1,1,3,2,2,1,1,2,1,1,3,2,1,2,1,2,3,2,1,1,3,2,1,3,2,1,2,2,1,3,3,1,3,3,2,3,2,3,1,3,3,3,3,2,1,3,2,3,3,3,2,1,2,1,2,3,1,1,3,3,3,3,3,2,3,3,1,3,1,1,2,3,3,3,3,3,3,2,2,2,3,1,2,3,3,3,3,2,1,2,2,3,2,3,2,3,2,3,3,2,1,2,3,3,2,1,2,3,3,3,1,3,2,3,3,1,2,2,3,1,1,2,2,3,2,1,1,2,2,1,3,1,2,3,1,3,1,1,2,3,3,1,2,3,2,2,1,1,2,3,2,2,2,1,2,1,2,2,3,2,1,2,1,3,1,2,3,1,2,3,1,2,1,1,2,1,3,3,3,1,3,3,2,2,2,1,2,3,1,3,1,2,1,3,1,2,2,1,2,3,1,1,3,3,2,2,3,1,1,2,1,1,1,2,1,2,3,3,2,2,1,2,3,2,3,1,2,2,2,1,3,3,3,3,3,3,2,3,2,1,2,1,3,3,1,3,3,1,3,2,3,3,1,2,3,3,3,3,3,1,2,1,2,1,1,1,1,2,2,3,1,1,2,3,2,3,2,2,3,3,1,2,1,3,2,3,2,2,3,2,3,1,1,1,3,1,2,3,1,3,2,3,2,2,1,2,3,1,3,2,1,2,3,1,3,1,2,2,1,3,3,2,1,3,3,1,2,3,1,2,1,1,3,1,3,2,3,3,3,3,2,2,1,1,3,3,2,1,3,1,1,3,3,3,1,3,3,1,1,3,3,3,1,1,3,3,2,1,3,2,3,1,3,2,2,2,2,2,3,3,1,2,2,3,2,3,3,1,3,1,3,3,1,3,2,1,2,3,1,3,1,3,2,2,1,1,1,1,3,2,3,3,2,2,3,2,3,1,3,2,1,2,3,1,2,2,1,1,1,3,3,2,3,3,3,3,2,3,1,1,3,3,3,1,1,3,2,1,2,3,2,3,1,3,3,2,1,1,1,1,3,3,2,3,1,2,1,3,3,3,2,2,2,2,3,3,1,1,2,3,2,2,3,3,2,2,3,3,3,2,2,1,2,2,3,3,3,3,1,2,2,3,2,2,2,2,3,2,2,2,1,1,2,2,2,1,2,3,2,2,3,3,2,1,3,1,2,2,1,3,2,3,1,1,3,1,2,2,2,3,3,1,3,3,1,2,1,2,3,1,3,2,3,1,1,3,3,3,1,2,3,3,3,1,3,3,1,3,2,2,2,3,2,1,2,3,3,2,1,2,1,2,1,1,3,3,1,1,3,2,1,3,2,1,3,3,3,2,2,2,1,3,2,3,2,3,1,2,3,1,3,3,1,1,3,2,1,2,3,2,1,1,2,3,1,3,2,1,2,2,3,2,2,1,3,2,1,1,3,3,2,1,3,1,2,2,1,2,2,3,2,2,2,3,1,1,3,3,3,3,1,2,2,3,3,3,2,1,3,2,1,2,3,3,1,3,2,1,2,1,1,2,2,3,2,2,3,1,2,3,2,3,1,2,3,3,2,3,3,1,1,2,1,1,1,3,1,3,1,3,3,2,3,1,2,2,1,2,3,3,2,3,2,3,2,1,1,3,2,3,2,3,1,1,3,1,3,2,1,3,2,2,2,3,1,1,2,3,1,1,1,2,3,3,3,1,2,3,3,3,3,2,3,1,3,1,3,2,3,2,3,3,1,1,2,3,1,1,3,3,2,3,3,1,2,3,1,2,3,3,2,3,3,2,1,2,3,3,2,3,1,2,2,3,1,2,1,3,2,3,1,2,2,3,3,2,2,3,1,3,3,3,3,2,3,2,2,1,3,1,2,1,1,1,3,2,3,1,1,1,1,3,3,2,3,1,1,2,1,3,1,2,3,3,2,2,1,1,3,2,2,3,1,2,3,3,3,2,1,2,2,3,1,3,3,2,1,2,2,3,3,2,2,3,2,1,1,3,1,3,3,1,3,2,3,3,3,1,1,1,3,1,2,2,3,2,3,2,3,1,1,2,1,2,1,3,3,1,3,3,2,2,1,3,1,2,2,3,2,2,2,3,3,2,1,1,1,1,3,1,1,2,1,2,2,3,3,2,3,3,3,2,1,1,3,2,2,2,3,1,3,3,3,2,2,3,1,3,3,3,1,3,3,3,2,3,1,2,1,1,3,1,2,3,2,1,3,3,2,1,3,2,3,2,3,1,2,2,3,3,2,3,3,3,1,2,3,3,3,3,3,1,1,2,3,1,2,1,1,1,1,2,1,1,2,3,1,3,3,2,2,3,2,2,1,3,2,2,3,1,1,1,1,1,3,1,3,1,1,3,2,2,3,3,3,1,2,2,3,3,2,3,2,3,3,2,1,2,3,3,1,3,1,2,1,1,2,2,2,2,2,2,1,3,1,3,2,3,2,2,2,2,2,3,2,2,1,3,1,1,1,2,1,2,1,2,1,3,1,3,3,1,3,1,3,3,1,3,2,3,3,3,3,1,3,3,2,3,2,3,3,3,1,1,2,2,3,3,3,2,2,3,3,1,3,1,2,1,2,2,1,1,3,3,1,1,3,1,1,1,2,2,3,2,2,2,3,3,1,2,1,2,2,2,3,2,2,1,2,1,1,1,3,3,3,2,1,3,3,3,2,2,3,1,2,1,3,1,3,3,1,3,2,3,2,2,1,1,1,3,3,2,3,1,3,2,2,2,2,2,3,1,3,2,3,1,3,1,3,1,2,3,2,2,3,3,3,3,3,1,1,2,3,3,2,3,1,3,3,1,3,3,2,2,1,3,3,3,3,2,1,3,2,2,2,3,3,1,1,3,3,3,1,3,1,1,2,3,1,3,3,3,2,1,3,1,2,1,3,2,2,3,1,3,1,2,3,3,3,2,2,3,1,2,1,1,1,2,3,1,2,3,2,3,3,2,1,1,2,3,3,1,2,3,1,1,1,3,1,2,3,1,2,3,2,2,3,2,3,2,3,1,2,3,3,1,3,3,2,2,1,1,2,3,2,2,3,3,2,1,1,1,3,3,3,2,2,1,3,2,2,1,3,2,3,3,1,1,3,2,3,3,2,3,1,3,3,1,3,3,2,3,3,2,3,1,3,3,3,3,3,1,1,3,2,2,3,3,3,3,1,1,1,1,3,2,3,3,1,3,2,2,1,1,1,1,3,2,2,3,2,2,3,3,2,3,1,1,1,3,3,3,3,2,3,1,3,3,1,1,3,3,1,3,3,3,1,3,2,1,1,3,3,2,3,3,3,2,2,1,3,3,3,1,2,2,2,2,1,2,2,1,2,3,2,1,2,2,3,3,3,3,3,2,2,3,2,2,3,2,1,3,1,1,2,2,3,1,2,3,2,1,3,1,1,2,1,2,2,3,1,2,2,3,3,1,3,2,1,3,3,2,1,3,3,3,1,3,2,3,3,2,3,2,2,3,2,1,3,3,3,3,2,1,3,3,3,1,3,3,1,3,1,3,3,3
tSNE-1,,8.13846968090103,12.8635212043927,10.3864480425066,7.17083119797853,-72.7452686458686,-49.7960088439495,45.63460621346,-50.3693843293848,-53.2415432674881,-54.6891175204711,-46.4635164735514,4.49644447816871,3.98243750756555,-9.99729157677144,-98.1041739031645,14.4129117311442,21.8090838800674,-46.5547640077783,-65.8379505581324,39.8907136841164,45.2453417297103,-43.4054353275594,5.58370171555427,-82.6419520577671,42.7647608862027,-91.125151907502,-37.9838559192307,62.9924569510685,-69.108888726706,62.7774653919852,60.3873481045592,62.825
I tried to read it by read.csv:
test=read.csv('TransMat_H0004.E1.L1.S1.B1.T1.Matrix',sep='' )
str(test)
'data.frame': 33141 obs. of 1 variable:
$ X.TransMat_H0004.E1.L1.S1.B1.T1: Factor w/ 33141 levels "A1BG,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"| truncated,..: 13453 31099 31100 1 2 3 4 5 6 7 ...
how should I read it in a right format, say, first character of 'sequence'(list?I guess?) as rowname.
Thanks in advance!
sorry, I cannot provide the data link because it is unpublished; but I can tell you what the data look like:
%TransMat_H0004.E1.L1.S1.B1.T1
cluster,1,2,3,2,3….
tsne-
1,-41,-80…..
tsne-
2,-41,-80…..
tsne-
3,-41,-80…..
(and the rest are all started with gene name and number, such as)
genea, 0,2,1,0…
….
genez,0,2,1,0
my desired output is to remove the first 4 factors(cluster, tsne-1, tsne-2,tsne-3), and extract the gene transcripts matrix,such as:
V1 V2 V3 V4 V5
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0

I figure this out by this:
read.csv("E2.Matrix", skip=1)
since the first row is annotation according to the bioinfor technician who arranged the .Matrix file
Thanks! # Stephan

Related

How can I pull player stats from a tabbed ESPN table?

I've been reading through a couple of the other useful guides on pulling player and match data from ESPN using R, however I have come across a problem with tabbed tables. As shown here on the player stats for a recent rugby game, the player statistics table is tabbed into 'Scoring', 'Attacking', 'Defending' and 'Discipline'.
Using the following code (with the help of two lovely packages (RCurl and htmltab), I can pull out the first tab ('Scoring') from that page ...
# install & attach RCurl
if (!base::require(package="RCurl")) utils::install.packages("RCurl")
library(RCurl)
# install & attach htmltab
if (!base::require(package="htmltab")) utils::install.packages("htmltab")
library(htmltab)
# assign URL
theurl <- RCurl::getURL("https://www.espn.co.uk/rugby/playerstats?gameId=294854&league=270557",.opts = list(ssl.verifypeer = FALSE))
# pull tables from url
team1 <- htmltab::htmltab(theurl,which=1)
team2 <- htmltab::htmltab(theurl,which=2)
league <- htmltab::htmltab(theurl,which=3)
... in the following format, which is exactly what I wanted ...
team1
rowID LEINS Tx TA CG PG PTS
2 J LarmourFB 0 0 0 0 0 0
3 H KeenanW 0 0 0 0 0 0
4 G RingroseC 0 0 0 0 0 0
5 R HenshawC 1 0 0 0 0 5
6 J LoweW 1 0 0 0 0 5
7 R ByrneFH 0 0 2 2 0 10
8 J Gibson-ParkSH 0 1 0 0 0 0
9 C HealyP 0 0 0 0 0 0
10 R KelleherH 0 0 0 0 0 0
11 A PorterP 0 0 0 0 0 0
... however I seem unable to pull out any tab other than 'Scoring'. I'm sure I'm missing something really obvious, so would appreciate someone pointing out where I'm going wrong!
Thanks in advance!
if you check the source html-page you will see that the data is not there at the start. You can find a data-reactid-tag that indicates that the data is only loaded once you click on the new tab. So you will need to find a way to make that click on the second tab.
One option for you might be to use Selenium: https://www.rdocumentation.org/packages/RSelenium/versions/1.7.7
This would enable you to make the necessary button click.
A sample can be found here: https://www.r-bloggers.com/2014/12/scraping-with-selenium/

How to fix rows order with pheatmap?

I have generate a heatmap with pheatmap and for some reasons, I want that the rows appear in a predefined order.
I see in previous posts that the solution is to set the paramater cluster_row to FALSE, and to order the matrix in the order we want, like this in my case:
Otu0085 Otu0086 Otu0087 Otu0088 Otu0091
AB200 0 0 0 0 0
2 91 0 2 1 0
20CF360 0 1 0 1 0
19CF359 0 0 0 2 0
11VP12 0 0 0 0 155
11VP04 4 1 0 0 345
However, when I do:
pheatmap(shared,cluster_rows = F)
My rows are sorted alphabetically, like this:
10CF278a
11
11AA07
11CF278b
11VP03
11VP04
11VP05
11VP06
11VP08
11VP09
ANy suggestions would be welcome
Thank's by advance

How do I make a selected table confined to a matrix, rather than a running list?

For my previous lines of code for making tables from column names, they successfully made short and dense matrices for me to readily process data from two questions (from survey results): (2nd example).
However, when I try using the same line of code (above), I don't get that sleek matrix. I end up getting a list of un-linked tables, which I do not want. Perhaps it's due to the new column only having 0's and 1's as numeric characters, vs. the others that have more than 2: (1st example).
[Please forgive my formatting issues (StackOverflow Status: Newbie). Also, many thanks in advance to those checking in on and answering my question!]
>table(select(data_final, `Relationship 2Affected Individual`, Satisfied_Treatments))
Relationship 2Affected Individual 1
1 0
2 0
3 0
6 0
Other (please specify) 0
, , 1 = 1, Response = 10679308122
0
Relationship 2Affected Individual 1
1 0
2 0
3 0
6 0
Other (please specify) 0
, ,
...
> table(select(data_final, `Relationship 2Affected Individual`, Indirect_Benefits))
Indirect_Benefits
Relationship 2Affected Individual 0 1 2 3
1 4 1 0 0
2 42 17 9 3
3 12 1 1 0
6 5 2 2 0
Other (please specify) 1 0 0 0
>#rstudioapi::versionInfo()
>#packageVersion("dplyr")
table(data_final$Relationship 2Affected Individual, data_final$Satisfied_Treatments)
Problem Solved^

Conditional input using read.table or readLines

I'm struggling with using readLines() and read.Table() to get a well formatted data frame in R.
I want to read files like this which are Hockey stats. I'd like to get a nicely formatted data frame, however, specifying the concrete amount of lines to read is difficult because in other files like this the number of players is different. Also, non-players, signed as #.AC, #.HC and so on, should not be read in.
I tried something like this
LINES <- 19
stats <- read.table(file=Datei, skip=11, header=FALSE, stringsAsFactors=FALSE,
encoding="UTF-8", nrows=LINES)
but as mentioned above, the value for LINES is different each time.
I also tried readLines as in this post, but had no luck with it.
Is there a way to integrate a condition in read.table, like (pseudo code)
if (first character == "AC") {
break read.table
}
Sorry if this looks strange, I don't have that much experience in scripting or coding.
Any help is appreciated, thanks a lot!
Greetz!
Your data show a couple of difficulties which should be handled in a sequence, which means you should not try to read the entire file with one command:
Read plain lines and find start and stop row
Depending on the specification of the files you read in my suggestion is to first find the the first row you actually want to read in by any indicator. So this can be a lone number which is always the same or as in my example two lines after the line "TEAM STATS". Finding the last line is then simple again by just looking for the first line containing only whitespaces after the start line:
lines <- readLines( Datei )
start <- which(lines == "TEAM STATS") + 2
end <- start + min( grep( "^\\s+$", lines[ start:length(lines) ] ) ) -2
lines <- lines[start:end]
Read the data to data.frame
In your case you meet a couple of complications:
Your header line starts with an # which is on default recognized as a comment character, ignoring the line. But even if you switch this behavior off (comment.char = "") it's not a valid column name.
If we tell read.table to split the columns along whitespaces you end up with one more column in the data, than in the header since the Player column contains white spaces in the cells. So the best is at the moment to just ignore the header line and let read.table do this with it's default behavior (comment.char = "#"). Also we let the PLAYER column be split into two and will fix this later.
You won't be able to use the first column as row.names since they are not unique.
The rows have unequal length, since the POS column is not filled everywhere.
:
tab <- read.table( text = lines[ start:end ], fill = TRUE, stringsAsFactors=FALSE )
# fix the PLAYER column
tab$V2 <- paste( tab$V2, tab$V3 )
tab <- tab[-3]
Fix the header
Just split the start line at multiple whitespaces and reset the first entry (#) by a valid column name:
colns <- strsplit( lines[start], "\\s+" )[[1]]
colns[1] <- "code"
colnames(tab) <- colns
Fix cases were "POS" was empty
This is done by finding the rows which last cell contains NAs and shift them by one cell to the right:
colsToFix <- which( is.na(tab[, "SHO%"]) )
tab[ colsToFix, 4:ncol(tab) ] <- tab[ colsToFix, 3:(ncol(tab)-1) ]
tab[ colsToFix, 3 ] <- NA
> str(tab)
'data.frame': 25 obs. of 20 variables:
$ code : chr "93" "91" "61" "88" ...
$ PLAYER: chr "Eichelkraut, Flori" "Müller, Lars" "Alt, Sebastian" "Gross, Arthur" ...
$ POS : chr "F" "F" "D" "F" ...
$ GP : chr "8" "6" "7" "8" ...
$ G : int 10 1 4 3 4 2 0 2 1 0 ...
$ A : int 5 11 5 5 3 4 6 3 3 4 ...
$ PTS : int 15 12 9 8 7 6 6 5 4 4 ...
$ PIM : int 12 10 12 6 2 36 37 29 6 0 ...
$ PPG : int 3 0 1 1 1 1 0 0 1 0 ...
$ PPA : int 1 5 2 2 1 2 4 2 1 1 ...
$ SHG : int 0 1 0 1 1 0 0 0 0 0 ...
$ SHA : int 0 0 1 0 1 0 0 1 0 0 ...
$ GWG : int 2 0 1 0 0 0 0 0 0 0 ...
$ FG : int 1 0 1 1 1 0 0 0 0 0 ...
$ OTG : int 0 0 0 0 0 0 0 0 0 0 ...
$ UAG : int 1 0 1 0 0 0 0 0 0 0 ...
$ ENG : int 0 0 0 0 0 0 0 0 0 0 ...
$ SHOG : int 0 0 0 0 0 0 0 0 0 0 ...
$ SHOA : num 0 0 0 0 0 0 0 0 0 0 ...
$ SHO% : num 0 0 0 0 0 0 0 0 0 0 ...

using graph.adjacency() in R

I have a sample code in R as follows:
library(igraph)
rm(list=ls())
dat=read.csv(file.choose(),header=TRUE,row.names=1,check.names=T) # read .csv file
m=as.matrix(dat)
net=graph.adjacency(adjmatrix=m,mode="undirected",weighted=TRUE,diag=FALSE)
where I used csv file as input which contain following data:
23732 23778 23824 23871 58009 58098 58256
23732 0 8 0 1 0 10 0
23778 8 0 1 15 0 1 0
23824 0 1 0 0 0 0 0
23871 1 15 0 0 1 5 0
58009 0 0 0 1 0 7 0
58098 10 1 0 5 7 0 1
58256 0 0 0 0 0 1 0
After this I used following command to check weight values:
E(net)$weight
Expected output is somewhat like this:
> E(net)$weight
[1] 8 1 10 1 15 1 1 5 7 1
But I'm getting weird values (and every time different):
> E(net)$weight
[1] 2.121996e-314 2.121996e-313 1.697597e-313 1.291034e-57 1.273197e-312 5.092790e-313 2.121996e-314 2.121996e-314 6.320627e-316 2.121996e-314 1.273197e-312 2.121996e-313
[13] 8.026755e-316 9.734900e-72 1.273197e-312 8.027076e-316 6.320491e-316 8.190221e-316 5.092790e-313 1.968065e-62 6.358638e-316
I'm unable to find where and what I am doing wrong?
Please help me to get the correct expected result and also please tell me why is this weird output and that too every time different when I run it.??
Thanks,
Nitin
Just a small working example below, much clearer than CSV input.
library('igraph');
adjm1<-matrix(sample(0:1,100,replace=TRUE,prob=c(0.9,01)),nc=10);
g1<-graph.adjacency(adjm1);
plot(g1)
P.s. ?graph.adjacency has a lot of good examples (remember to run library('igraph')).
Related threads
Creating co-occurrence matrix
Co-occurrence matrix using SAC?
The problem seems to be due to the data-type of the matrix elements. graph.adjacency expects elements of type numeric. Not sure if its a bug.
After you do,
m <- as.matrix(dat)
set its mode to numeric by:
mode(m) <- "numeric"
And then do:
net <- graph.adjacency(m, mode = "undirected", weighted = TRUE, diag = FALSE)
> E(net)$weight
[1] 8 1 10 1 15 1 1 5 7 1

Resources