How to import and transform adjacency matrix to R edge list? - r

A sample of my data can be seen below. The data contains information about ties between organizations (over 2000 organizations, the csv file has 0s and 1s, and empty cells)
A2654 B0004 B0188 B1278 B1372 B1722 B2503
A2654 0 1 0 0 0 1 0
B0004 1 0 0 0 0 1 0
B0188 0 0 0 0 0 0 0
B1278 0 0 0 0 0 0 0
B1372 0 0 0 0 0 0 0
B1722 1 1 0 0 0 0 0
(1) The first problem is that I can't import this data (.csv) into R
I runt the following code dt <- read_csv2("Org_ties.csv") The problem here is that while in the csv file the first column is left empty (it should be) -- when reading it into R, read_csv() generates a label for this column "X1". I do this in order to run the next code: g=graph_from_adjacency_matrix(dtmtrx, mode="directed", weighted = T) to produce a graph. However, I get the error message below. I think it has to do with the fact that I can't read it properly.
graph.adjacency.dense(adjmatrix, mode = mode, weighted = weighted, :
not a square matrix
In addition: Warning message:
In mde(x) : NAs introduced by coercion
(2) Another puzzling thing is that I cannot seem to transform the current data structure into an edge list. How can I do that? The edge list looks something like this
V1 V2 weight
A2654 B0004 1
A2654 B0188 0
A2654 B1278 0
A2654 B1372 0
A2654 B1722 1

Related

Determine Row Number based on Nonzero Elements

I am currently working with about 301 rows of data and want to determine the earliest point at which only a few particular columns are nonzero. However, I also want to ensure that this does not change. For example, the two columns are nonzero, while all other columns are zero, then later in the dataframe other columns are nonzero as well, this would mean that I would have to determine a later point which is "correct".
I have the data:
1 x y z xx xy xz
292 0 -8.965140 9.596890 0 0 0 -0.03147483
293 0 -9.079889 9.645991 0 0 0 -0.02722520
294 0 -8.967767 9.597826 0 0 0 0
295 0 -9.090561 9.650230 0 0 0 -0.02685287
296 0 -9.081568 9.646105 0 0 0 -0.02716237
297 0 0.000000 0.000000 0 0 0 0.00000000
298 0 0.000000 0.000000 0 0 0 0.00000000
299 0 -9.098568 9.628576 0 0 0 -0.02654466
300 0 -9.089815 9.646099 0 0 0 -0.02681748
301 0 -8.998078 9.605140 0 0 0 0
As you can see, only the variables x and y are selected for row 294, however, the xz variable contains values after that until the 301 row. Is it possible to develop a function which tells me at which point is the minimum row where I see only x and y as nonzero and it remains that way until the final row of the dataframe?
I'm sorry if it's difficult to understand the question, I found it difficult asking how exactly to accomplish this issue.
EDIT: I presume I could use something like
which((df$x != 0 & df$y != 0 &
(df[, 1] | df[, 4] == 0))
but then I need to somehow expand the second or statement to all columns of df.
Thanks in advance.

How can I pull player stats from a tabbed ESPN table?

I've been reading through a couple of the other useful guides on pulling player and match data from ESPN using R, however I have come across a problem with tabbed tables. As shown here on the player stats for a recent rugby game, the player statistics table is tabbed into 'Scoring', 'Attacking', 'Defending' and 'Discipline'.
Using the following code (with the help of two lovely packages (RCurl and htmltab), I can pull out the first tab ('Scoring') from that page ...
# install & attach RCurl
if (!base::require(package="RCurl")) utils::install.packages("RCurl")
library(RCurl)
# install & attach htmltab
if (!base::require(package="htmltab")) utils::install.packages("htmltab")
library(htmltab)
# assign URL
theurl <- RCurl::getURL("https://www.espn.co.uk/rugby/playerstats?gameId=294854&league=270557",.opts = list(ssl.verifypeer = FALSE))
# pull tables from url
team1 <- htmltab::htmltab(theurl,which=1)
team2 <- htmltab::htmltab(theurl,which=2)
league <- htmltab::htmltab(theurl,which=3)
... in the following format, which is exactly what I wanted ...
team1
rowID LEINS Tx TA CG PG PTS
2 J LarmourFB 0 0 0 0 0 0
3 H KeenanW 0 0 0 0 0 0
4 G RingroseC 0 0 0 0 0 0
5 R HenshawC 1 0 0 0 0 5
6 J LoweW 1 0 0 0 0 5
7 R ByrneFH 0 0 2 2 0 10
8 J Gibson-ParkSH 0 1 0 0 0 0
9 C HealyP 0 0 0 0 0 0
10 R KelleherH 0 0 0 0 0 0
11 A PorterP 0 0 0 0 0 0
... however I seem unable to pull out any tab other than 'Scoring'. I'm sure I'm missing something really obvious, so would appreciate someone pointing out where I'm going wrong!
Thanks in advance!
if you check the source html-page you will see that the data is not there at the start. You can find a data-reactid-tag that indicates that the data is only loaded once you click on the new tab. So you will need to find a way to make that click on the second tab.
One option for you might be to use Selenium: https://www.rdocumentation.org/packages/RSelenium/versions/1.7.7
This would enable you to make the necessary button click.
A sample can be found here: https://www.r-bloggers.com/2014/12/scraping-with-selenium/

How to fix rows order with pheatmap?

I have generate a heatmap with pheatmap and for some reasons, I want that the rows appear in a predefined order.
I see in previous posts that the solution is to set the paramater cluster_row to FALSE, and to order the matrix in the order we want, like this in my case:
Otu0085 Otu0086 Otu0087 Otu0088 Otu0091
AB200 0 0 0 0 0
2 91 0 2 1 0
20CF360 0 1 0 1 0
19CF359 0 0 0 2 0
11VP12 0 0 0 0 155
11VP04 4 1 0 0 345
However, when I do:
pheatmap(shared,cluster_rows = F)
My rows are sorted alphabetically, like this:
10CF278a
11
11AA07
11CF278b
11VP03
11VP04
11VP05
11VP06
11VP08
11VP09
ANy suggestions would be welcome
Thank's by advance

T test to find differentially expressed genes in R

I have a matrix which contains the genes and the mrna.
ID_REF GSM362168 GSM362169 GSM362170 GSM362171 GSM362172 GSM362173 GSM362174
244901_at 5.171072 5.207896 5.191145 5.067809 5.010239 5.556884 4.879528
244902_at 5.296012 5.460796 5.419633 5.440318 5.234789 7.567894 6.908795
I wanted to find the differentially expressed genes from the matrix using t test and i carried out the following.
stat=mt.teststat(control,classlabel,test="t",na=.mt.naNUM,nonpara="n")
and I get the following error
Error in is.factor(classlabel) : object 'classlabel' not found.
I am not sure how I have to assign the classlabels.Is it the right way to find the differentially expressed genes.
The classlabel should be a vector of integers corresponding to observation (column) class labels. I do not understand what that is.
If you open the documentation for mt.teststat:
?mt.teststat
and scroll down to the end, you'll see an example using the "Golub data":
data(golub)
teststat <- mt.teststat(golub, golub.cl)
If you look at golub.cl,it will become clear what the classlabel vector should look like:
golub.cl
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1
In this case, 0 or 1 are labels for two classes of sample. There should be as many values in the vector as you have samples, in the same order that the samples appear in the data matrix. You can also look at:
?golub
golub.cl: numeric vector indicating the tumor class, 27 acute
lymphoblastic leukemia (ALL) cases (code 0) and 11 acute
myeloid leukemia (AML) cases (code 1).
So you need to create a similar vector, with labels (0, 1, ...) for however many classes you have for your own data.

Highlight cells in heatmap

I am currently trying to set up a heatmap of a matrix and highlight specific cells, based on two other matrices.
An example:
> SOI
NAP_G021 NAP_G033 NAP_G039 NAP_G120 NAP_G122
2315101 59.69418 27.26002 69.94698 35.22521 38.63995
2315102 104.15294 76.70379 114.72999 97.35930 79.46014
2315104 164.32822 61.83898 140.99388 63.25482 105.48041
2315105 32.15792 21.03730 26.89965 36.25943 40.46321
2315103 74.67434 82.49875 133.89709 93.17211 35.53019
> above150
NAP_G021 NAP_G033 NAP_G039 NAP_G120 NAP_G122
2315101 0 0 0 0 0
2315102 0 0 0 0 0
2315104 1 0 0 0 0
2315105 0 0 0 0 0
2315103 0 0 0 0 0
> below30
NAP_G021 NAP_G033 NAP_G039 NAP_G120 NAP_G122
2315101 0 1 0 0 0
2315102 0 0 0 0 0
2315104 0 0 0 0 0
2315105 0 1 1 0 0
2315103 0 0 0 0 0
Now I create a normal heatmap:
heatmap(t(SOI), Rowv = NA, Colv = NA)
Now what I want to do is highlight the cells, that have a 1 in above150 with a frame of one colour (e.g. blue), whilst the cells with a 1 in below30 should get a red frame. Of couse all matrices are equal sized as they are related. I know that I can add things to the heatmap after processing via add.expr, but so far I just managed to create full ablines that span the whole heatmap => not what I'm looking for.
If anybody has any suggestions I would be delighted.
When add.expr is called the plot is set so that the centre of the cells is at unit integer values. Try add.expr=points(1:5,1:5) to see. Now all you need to do is write a function that draws boxes (help(rect)) with the colours you need, at the half-integer coordinates.
Try this:
set.seed(310366)
nx=5
ny=6
SOI=matrix(rnorm(nx*ny,100,50),nx,ny)
colnames(SOI)=paste("NAP_G0",sort(as.integer(runif(ny,10,99))),sep="")
rownames(SOI)=sample(2315101:(2315101+nx-1))
above150 = SOI>150
below30=SOI<30
makeRects <- function(tfMat,border){
cAbove = expand.grid(1:nx,1:ny)[tfMat,]
xl=cAbove[,1]-0.49
yb=cAbove[,2]-0.49
xr=cAbove[,1]+0.49
yt=cAbove[,2]+0.49
rect(xl,yb,xr,yt,border=border,lwd=3)
}
heatmap(t(SOI),Rowv = NA, Colv=NA, add.expr = {
makeRects(above150,"red");makeRects(below30,"blue")})

Resources