I have a data frame and I want to extract the specific string on one of the columns by delimiter but there are several conditions. I want to mutate a new column that contain the COSVxxxx strings only.
df:
ID
.
COSV50419740
.
.
.
rs375210814
.
rs114284775;COSV60321424
.
.
.
rs67376798;88974
rs1169783812
rs56386506;51676;COSV66451617
rs80358907;52202
.
.
.
482972
629301
COSV66463357
rs80358408;51066
rs80358420;51100;COSV66464432
desired df:
ID COSV.ID
. .
COSV50419740 COSV50419740
. .
. .
. .
rs375210814 rs375210814
. .
rs114284775;COSV60321424 COSV60321424
.
.
.
rs67376798;88974 rs67376798;88974
rs1169783812 rs1169783812
rs56386506;51676;COSV66451617 COSV66451617
rs80358907;52202 rs80358907;52202
. .
. .
. .
482972 482972
629301 629301
COSV66463357 COSV66463357
rs80358408;51066 rs80358408;51066
rs80358420;51100;COSV66464432 COSV66464432
I want to keep the string if there are no COSV annotation. However, my problem is that there are some rows containing from one to four annotation by colon delimiter. I tried to use cSplit function to separate them but have no idea how to convert the COSV string into one column.
You could use sub here, e.g.
df$ID_new <- ifelse(grepl("\\bCOSV\\d+\\b", df$ID),
sub("^.*\\b(COSV\\d+)\\b.*$", "\\1", df$ID),
NA)
This option will assign the (last) COSV value, should it exist in the ID column, otherwise it will assign NA.
Related
My objective is to create:
a randomly populated matrix with entries either 0 or 1. In this particular case, the matrix is 4x24.
The row sum of each of the 4 rows is exactly 6.
The column sum of each of the 24 columns is exactly 1
Call the desired matrix M.
Another way of looking at M:
There are exactly 24 entries equal to 1.
No column has more than one 1 entry.
Progress:
There are 6 spots on each row with a 1 entry. The rest are zero, the matrix is sparse. With 4 rows, this means that M can be uniquely determined by a matrix of indices that stores the locations of the 1 entries. Call this matrix of indices indexM.
I populated indexM with the numbers 1:24 sampled without replacement:
set.seed(30592)
colNum <- 24
rowSum <-6
numZeros <- colNum-rowSum
OneRow<-c(rep(1,rowSum),rep(0,numZeros))
indexM<-matrix(sample(1:24,replace=FALSE),
nrow=4,ncol=6,byrow=TRUE)
For the given seed, the matrix is: https://pastebin.com/8T21MiDv .
How do I turn indexM into the desired sparse matrix?
I found the sparseMatrix in the Matrix library, but it wants a vector or row indices and another vector of column indices, which is not what I have.
Thank you.
I found the sparseMatrix in the Matrix library, but it wants a vector or row indices and another vector of column indices, which is not what I have.
The constraints impose that...
row indices are rep(1:4, 6)
col indices are 1:24
The match between row and col indices is randomized. We can...
library(Matrix)
# fix rows, jumble cols
sparseMatrix(rep(1:4, each=6), sample(1:24))
# fix cols, jumble rows
sparseMatrix(sample(rep(1:4, each=6)), 1:24)
# jumbl'm all
sparseMatrix(sample(rep(1:4, each=6)), sample(1:24))
any of which will return something like
4 x 24 sparse Matrix of class "ngCMatrix"
[1,] . . . . | | . . | . . . | | . | . . . . . . . .
[2,] . | | . . . | . . | . . . . . . . . | . . . | .
[3,] . . . | . . . | . . | | . . . . | . . . | . . .
[4,] | . . . . . . . . . . . . . | . . | . | . | . |
i am a bit stuck on this problem. I am trying to delete the dots before the first number, but any dots between two numbers i would like to keep.
. . . . . . . . . . . . . . 122 (100.0) . . . . . . . . . . . . . . 7 (5. 7)
for example the above should output to
122 (100.0) . . . . . . . . . . . . . . 7 (5. 7)
I am not sure what functions or package i should be using to do the above
Thanks!
This should work for what you're asking.
sub('^[\\h.]+', '', x, perl=TRUE)
Maybe something like this:
#copy pasted from your example
text <- ". . . . . . . . . . . . . . 122 (100.0) . . . . . . . . . . . . . . 7 (5. 7)"
#find the location of the first number using gregexpr
loc <- gregexpr('[0-9]', text)[[1]][1]
#substring the text from loc and until the end
substr(text, loc, nchar(text)) # or substring(text, loc)
Output:
[1] "122 (100.0) . . . . . . . . . . . . . . 7 (5. 7)"
I need to extract the row of sparseMatrix as sparseVector, however 'drop=FALSE' option does not work well for me.
To explain the issue, I will use an example from extract sparse rows from sparse matrix in r (my question is different since I need to convert extracted row to vector):
i <- c(1,3:8); j <- c(2,9,6:10); x <- 7 * (1:7)
A <- sparseMatrix(i, j, x = x)
b <- sparseVector(7,2,10)
now A[1,,drop=FALSE] and b should have the same value.
However, A[1,,drop=FALSE] is still a matrix with 2 dimensions. So if I try Matrix::crossprod(b), I get:
1 x 1 Matrix of class "dsyMatrix"
[,1]
[1,] 49
but if I try Matrix::crossprod(A[1,,drop=FALSE]), then I get:
10 x 10 sparse Matrix of class "dsCMatrix"
[1,] . . . . . . . . . .
[2,] . 49 . . . . . . . .
[3,] . . . . . . . . . .
[4,] . . . . . . . . . .
[5,] . . . . . . . . . .
[6,] . . . . . . . . . .
[7,] . . . . . . . . . .
[8,] . . . . . . . . . .
[9,] . . . . . . . . . .
[10,] . . . . . . . . . .
How can I get just 49 in the second case in efficient way (Matrix::crossprod should be faster than %*%, as far as I understand from the description of the function)?
Also, b%*%b works perfectly correct, while A[1,,drop=FALSE]%*%A[1,,drop=FALSE] returns the following error:
Cholmod error 'A and B inner dimensions must match' at file ../MatrixOps/cholmod_ssmult.c, line 82
I am not quite sure there is a method for (directly) casting a sparse matrix row as a sparse vector.
The reason why you are getting an error from
A[1,,drop=FALSE]%*%A[1,,drop=FALSE]
is that you're multiplying matrices that have the same dimension. You need to transpose the second matrix:
A[1,,drop=FALSE] %*% t(A[1,,drop=FALSE])
will return a 1x1 sparse matrix which you can then cast as.numeric()
I have a total of 54892 documents. After retrieving them from the database, how am I supposed to convert them to a corpus that can be used for Topic Modelling using LDA?
This is the code I have tried:
library(RMySQL)
library(RTextTools)
library(topicmodels)
library(tm)
con <- dbConnect(MySQL(), user="root", password="root", dbname="dbtemp", host="localhost")
rs <- dbSendQuery(con, "select text_body from all_text;")
data <- fetch(rs, n=54892)
huh <- dbHasCompleted(rs)
dbClearResult(rs)
dbDisconnect(con)
I referred to this page, and noticed that the output of data from the line data <- NYTimes[sample(1:3100,size=1000,replace=FALSE),] contains a two column table along with another table with something called TopicCode, then this data is converted to a term-document frequency matrix. I don't know how to get that TopicCode from the two colums that I retrieved from the database?
I have tried a similar problem in Python where I converted the data to a Market Matrix format. I thought of using this file for further computations in R. I tried reading this file using b <- readMM(file="PRC.mm") and when I printed b I got a 336331X88 matrix which looked like :
. . 2 . . . . . . 1 1 . 1 . . 1 . 2 . . . . . . . . . . . . . ......
. 1 . . . . . . 1 1 . . . . . . . . . . . . . . . . . . . . . ......
. . . . . . . . . 1 1 1 . . . 2 . . . . . . . 1 . . 1 . . . . ......
. . 1 . . . 2 . . . . 1 1 . . . . . . . 1 . . . . . . . . . . ......
where . means 0. This looks like a term-document matrix but I still want to remake such kind of matrix in R. What should I do?
This problem to me is harder than it might sound. I imported a GML file. I now have all of my rows with numbers followed by a ,. I can't figure out how to remove and make numeric. I have tried as.numeric and gsub, but when I do my adjacency matrix I get this output:
[1,] . 1 . . 1 . . . . 1 . . . . . . 1 . . . . . . 1 . . . . . . . . . 1 . 1 . . . ......
[2,] 1 . . . . . . . . . . . . . . . . . . . . . . . . 1 . 1 . . . . . 1 . . . 1 . ......
I need the numbers in the [1,] to be a real number so I can attempt a loop that I will come back later for help on!
This code doesn't work:
games[0] <- as.numeric(gsub("[^[:digit:]]","",games[0]))
I get this error:
Error in `[<-.igraph`(`*tmp*`, 0, value = numeric(0)) :
Logical or numeric value must be of length 1
Here is the code I have:
library(igraph)
games <- read.graph("football.gml", format="gml")
and I eventually need to be able to look this algorithm:
get.shortest.paths(games, 1, 155, weights = NULL ,output=c("vpath", "epath", "both"))
[1,] is a row with multiple values (one for each column), not a single string. gsub returns an error because it is only designed for use on a single string. You need to loop over each value in the n x k matrix (or use an apply function to do this) and apply the gsub function to each individual value. Also not sure why you are replacing "[^[:digit:]]". Keep in mind this will substitute out the literal string "[^[:digit:]]" , not whatever this references in R. Here is an example in a loop:
for (i in 1:nrow(data)){
for (j in 1:ncol(data)){
data[i,j] <- gsub(".", "", data[i,j])
}
}
Maybe you could do something creative like this:
read.table(text='1 2 3 4 ,
5 6 7 8 ,
9 1 2 3 ,', sep=' ', na.strings=',')
And then drop the last column.