R create a column by selecting partial strings - r

I have a data frame and I want to extract the specific string on one of the columns by delimiter but there are several conditions. I want to mutate a new column that contain the COSVxxxx strings only.
df:
ID
.
COSV50419740
.
.
.
rs375210814
.
rs114284775;COSV60321424
.
.
.
rs67376798;88974
rs1169783812
rs56386506;51676;COSV66451617
rs80358907;52202
.
.
.
482972
629301
COSV66463357
rs80358408;51066
rs80358420;51100;COSV66464432
desired df:
ID COSV.ID
. .
COSV50419740 COSV50419740
. .
. .
. .
rs375210814 rs375210814
. .
rs114284775;COSV60321424 COSV60321424
.
.
.
rs67376798;88974 rs67376798;88974
rs1169783812 rs1169783812
rs56386506;51676;COSV66451617 COSV66451617
rs80358907;52202 rs80358907;52202
. .
. .
. .
482972 482972
629301 629301
COSV66463357 COSV66463357
rs80358408;51066 rs80358408;51066
rs80358420;51100;COSV66464432 COSV66464432
I want to keep the string if there are no COSV annotation. However, my problem is that there are some rows containing from one to four annotation by colon delimiter. I tried to use cSplit function to separate them but have no idea how to convert the COSV string into one column.

You could use sub here, e.g.
df$ID_new <- ifelse(grepl("\\bCOSV\\d+\\b", df$ID),
sub("^.*\\b(COSV\\d+)\\b.*$", "\\1", df$ID),
NA)
This option will assign the (last) COSV value, should it exist in the ID column, otherwise it will assign NA.

Related

Random binary matrix with row and column sum constraints

My objective is to create:
a randomly populated matrix with entries either 0 or 1. In this particular case, the matrix is 4x24.
The row sum of each of the 4 rows is exactly 6.
The column sum of each of the 24 columns is exactly 1
Call the desired matrix M.
Another way of looking at M:
There are exactly 24 entries equal to 1.
No column has more than one 1 entry.
Progress:
There are 6 spots on each row with a 1 entry. The rest are zero, the matrix is sparse. With 4 rows, this means that M can be uniquely determined by a matrix of indices that stores the locations of the 1 entries. Call this matrix of indices indexM.
I populated indexM with the numbers 1:24 sampled without replacement:
set.seed(30592)
colNum <- 24
rowSum <-6
numZeros <- colNum-rowSum
OneRow<-c(rep(1,rowSum),rep(0,numZeros))
indexM<-matrix(sample(1:24,replace=FALSE),
nrow=4,ncol=6,byrow=TRUE)
For the given seed, the matrix is: https://pastebin.com/8T21MiDv .
How do I turn indexM into the desired sparse matrix?
I found the sparseMatrix in the Matrix library, but it wants a vector or row indices and another vector of column indices, which is not what I have.
Thank you.
I found the sparseMatrix in the Matrix library, but it wants a vector or row indices and another vector of column indices, which is not what I have.
The constraints impose that...
row indices are rep(1:4, 6)
col indices are 1:24
The match between row and col indices is randomized. We can...
library(Matrix)
# fix rows, jumble cols
sparseMatrix(rep(1:4, each=6), sample(1:24))
# fix cols, jumble rows
sparseMatrix(sample(rep(1:4, each=6)), 1:24)
# jumbl'm all
sparseMatrix(sample(rep(1:4, each=6)), sample(1:24))
any of which will return something like
4 x 24 sparse Matrix of class "ngCMatrix"
[1,] . . . . | | . . | . . . | | . | . . . . . . . .
[2,] . | | . . . | . . | . . . . . . . . | . . . | .
[3,] . . . | . . . | . . | | . . . . | . . . | . . .
[4,] | . . . . . . . . . . . . . | . . | . | . | . |

manipulating a string in R replacing decimals

i am a bit stuck on this problem. I am trying to delete the dots before the first number, but any dots between two numbers i would like to keep.
. . . . . . . . . . . . . . 122 (100.0) . . . . . . . . . . . . . . 7 (5. 7)
for example the above should output to
122 (100.0) . . . . . . . . . . . . . . 7 (5. 7)
I am not sure what functions or package i should be using to do the above
Thanks!
This should work for what you're asking.
sub('^[\\h.]+', '', x, perl=TRUE)
Maybe something like this:
#copy pasted from your example
text <- ". . . . . . . . . . . . . . 122 (100.0) . . . . . . . . . . . . . . 7 (5. 7)"
#find the location of the first number using gregexpr
loc <- gregexpr('[0-9]', text)[[1]][1]
#substring the text from loc and until the end
substr(text, loc, nchar(text)) # or substring(text, loc)
Output:
[1] "122 (100.0) . . . . . . . . . . . . . . 7 (5. 7)"

R: how to extract a row from sparseMatrix as sparseVector

I need to extract the row of sparseMatrix as sparseVector, however 'drop=FALSE' option does not work well for me.
To explain the issue, I will use an example from extract sparse rows from sparse matrix in r (my question is different since I need to convert extracted row to vector):
i <- c(1,3:8); j <- c(2,9,6:10); x <- 7 * (1:7)
A <- sparseMatrix(i, j, x = x)
b <- sparseVector(7,2,10)
now A[1,,drop=FALSE] and b should have the same value.
However, A[1,,drop=FALSE] is still a matrix with 2 dimensions. So if I try Matrix::crossprod(b), I get:
1 x 1 Matrix of class "dsyMatrix"
[,1]
[1,] 49
but if I try Matrix::crossprod(A[1,,drop=FALSE]), then I get:
10 x 10 sparse Matrix of class "dsCMatrix"
[1,] . . . . . . . . . .
[2,] . 49 . . . . . . . .
[3,] . . . . . . . . . .
[4,] . . . . . . . . . .
[5,] . . . . . . . . . .
[6,] . . . . . . . . . .
[7,] . . . . . . . . . .
[8,] . . . . . . . . . .
[9,] . . . . . . . . . .
[10,] . . . . . . . . . .
How can I get just 49 in the second case in efficient way (Matrix::crossprod should be faster than %*%, as far as I understand from the description of the function)?
Also, b%*%b works perfectly correct, while A[1,,drop=FALSE]%*%A[1,,drop=FALSE] returns the following error:
Cholmod error 'A and B inner dimensions must match' at file ../MatrixOps/cholmod_ssmult.c, line 82
I am not quite sure there is a method for (directly) casting a sparse matrix row as a sparse vector.
The reason why you are getting an error from
A[1,,drop=FALSE]%*%A[1,,drop=FALSE]
is that you're multiplying matrices that have the same dimension. You need to transpose the second matrix:
A[1,,drop=FALSE] %*% t(A[1,,drop=FALSE])
will return a 1x1 sparse matrix which you can then cast as.numeric()

Converting database output to a corpus for topic modelling

I have a total of 54892 documents. After retrieving them from the database, how am I supposed to convert them to a corpus that can be used for Topic Modelling using LDA?
This is the code I have tried:
library(RMySQL)
library(RTextTools)
library(topicmodels)
library(tm)
con <- dbConnect(MySQL(), user="root", password="root", dbname="dbtemp", host="localhost")
rs <- dbSendQuery(con, "select text_body from all_text;")
data <- fetch(rs, n=54892)
huh <- dbHasCompleted(rs)
dbClearResult(rs)
dbDisconnect(con)
I referred to this page, and noticed that the output of data from the line data <- NYTimes[sample(1:3100,size=1000,replace=FALSE),] contains a two column table along with another table with something called TopicCode, then this data is converted to a term-document frequency matrix. I don't know how to get that TopicCode from the two colums that I retrieved from the database?
I have tried a similar problem in Python where I converted the data to a Market Matrix format. I thought of using this file for further computations in R. I tried reading this file using b <- readMM(file="PRC.mm") and when I printed b I got a 336331X88 matrix which looked like :
. . 2 . . . . . . 1 1 . 1 . . 1 . 2 . . . . . . . . . . . . . ......
. 1 . . . . . . 1 1 . . . . . . . . . . . . . . . . . . . . . ......
. . . . . . . . . 1 1 1 . . . 2 . . . . . . . 1 . . 1 . . . . ......
. . 1 . . . 2 . . . . 1 1 . . . . . . . 1 . . . . . . . . . . ......
where . means 0. This looks like a term-document matrix but I still want to remake such kind of matrix in R. What should I do?

Remove a comma in a matrix in r

This problem to me is harder than it might sound. I imported a GML file. I now have all of my rows with numbers followed by a ,. I can't figure out how to remove and make numeric. I have tried as.numeric and gsub, but when I do my adjacency matrix I get this output:
[1,] . 1 . . 1 . . . . 1 . . . . . . 1 . . . . . . 1 . . . . . . . . . 1 . 1 . . . ......
[2,] 1 . . . . . . . . . . . . . . . . . . . . . . . . 1 . 1 . . . . . 1 . . . 1 . ......
I need the numbers in the [1,] to be a real number so I can attempt a loop that I will come back later for help on!
This code doesn't work:
games[0] <- as.numeric(gsub("[^[:digit:]]","",games[0]))
I get this error:
Error in `[<-.igraph`(`*tmp*`, 0, value = numeric(0)) :
Logical or numeric value must be of length 1
Here is the code I have:
library(igraph)
games <- read.graph("football.gml", format="gml")
and I eventually need to be able to look this algorithm:
get.shortest.paths(games, 1, 155, weights = NULL ,output=c("vpath", "epath", "both"))
[1,] is a row with multiple values (one for each column), not a single string. gsub returns an error because it is only designed for use on a single string. You need to loop over each value in the n x k matrix (or use an apply function to do this) and apply the gsub function to each individual value. Also not sure why you are replacing "[^[:digit:]]". Keep in mind this will substitute out the literal string "[^[:digit:]]" , not whatever this references in R. Here is an example in a loop:
for (i in 1:nrow(data)){
for (j in 1:ncol(data)){
data[i,j] <- gsub(".", "", data[i,j])
}
}
Maybe you could do something creative like this:
read.table(text='1 2 3 4 ,
5 6 7 8 ,
9 1 2 3 ,', sep=' ', na.strings=',')
And then drop the last column.

Resources