manipulating a string in R replacing decimals - r

i am a bit stuck on this problem. I am trying to delete the dots before the first number, but any dots between two numbers i would like to keep.
. . . . . . . . . . . . . . 122 (100.0) . . . . . . . . . . . . . . 7 (5. 7)
for example the above should output to
122 (100.0) . . . . . . . . . . . . . . 7 (5. 7)
I am not sure what functions or package i should be using to do the above
Thanks!

This should work for what you're asking.
sub('^[\\h.]+', '', x, perl=TRUE)

Maybe something like this:
#copy pasted from your example
text <- ". . . . . . . . . . . . . . 122 (100.0) . . . . . . . . . . . . . . 7 (5. 7)"
#find the location of the first number using gregexpr
loc <- gregexpr('[0-9]', text)[[1]][1]
#substring the text from loc and until the end
substr(text, loc, nchar(text)) # or substring(text, loc)
Output:
[1] "122 (100.0) . . . . . . . . . . . . . . 7 (5. 7)"

Related

Extract Elastic Net penalized logistic regression Coefficients from mlr

I have read a few other answers but none have worked for me in extracting the penalized logistic regression coefficients from my final trained model.
penlrntune = makeLearner("classif.glmnet", predict.type = "prob", par.vals = logres.mbo$x)
set.seed(9)
pen_model_tune <- mlr::train(penlrntune, task = trn.task)
## when i run coef(getLearnerModel(pen_model_tune)
coef(getLearnerModel(pen_model_tune, more.unwrap = TRUE))
36 x 83 sparse Matrix of class "dgCMatrix"
[[ suppressing 45 column names ‘s0’, ‘s1’, ‘s2’ ... ]]
[[ suppressing 45 column names ‘s0’, ‘s1’, ‘s2’ ... ]]
(Intercept) -3.15367 -3.251889 -3.3528 -3.455813 -3.560283 -3.665525 -3.792622 -3.816052 -3.093964 -2.249888
Obesity . . . . . . . . . .
Diabetes . . . . . . . . . .
Smoke . . . . . . . . . .
Pulm . . . . . . . . . .
Cardiac . . . . . . . . . .
Dyspnea . . . . . . . . . .
Steroids . . . . . . . . . .
RenalComorb . . . . . . . . . .
emergency.Yes . . . . . . . . . .
ASA3 . . . . . . . . . .
this doesn't make much sense as you can tell. The model otherwise works well, so i must be missing something.
I have reproduced this issue using the pimaindiansdiabetes2 dataset from the mlbench package, when all i want is the logistic regression coefficients from the trained model.
library(mlbench)
library(mlr)
# Load the data and remove NAs
data("PimaIndiansDiabetes2", package = "mlbench")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
PimaIndiansDiabetes2 <- PimaIndiansDiabetes2[2:9]
set.seed(9)
indian.task = makeClassifTask(data=PimaIndiansDiabetes2, target = "diabetes", positive = "pos")
testlrn <-makeLearner("classif.glmnet", predict.type = "prob", par.vals = list(alpha = 0.339, s=0.00556))
testlrn_train <- mlr::train(testlrn, indian.task)
coef(getLearnerModel(testlrn_train))
8 x 70 sparse Matrix of class "dgCMatrix"
[[ suppressing 70 column names ‘s0’, ‘s1’, ‘s2’ ... ]]
(Intercept) -0.7008101 -0.832287755 -0.96789148 -1.107230550 -1.249880176 -1.395389472 -1.593724136
glucose . 0.001070684 0.00217191 0.003300326 0.004452356 0.005624265 0.006706897
pressure . . . . . . .
triceps . . . . . . .
insulin . . . . . . .
mass . . . . . . .
pedigree . . . . . . .
age . . . . . . 0.002042163
(Intercept) -1.795136758 -1.995953151 -2.264942612 -2.5516299342 -2.840942953 -3.125132e+00 -3.411189e+00
glucose 0.007787101 0.008868089 0.009892814 0.0108901514 0.011863974 1.280009e-02 1.367427e-02
pressure . . . . . . .
triceps . . . 0.0003408727 0.001478306 2.543441e-03 3.465742e-03
insulin . . . . . 2.182663e-05 6.272459e-05
mass . . 0.002302462 0.0049824211 0.007195585 9.319553e-03 1.125890e-02
pedigree . . . . . 4.557533e-03 4.506202e-02
age 0.004173928 0.006265506 0.008297616 0.0102624303 0.012130153 1.393673e-02 1.565741e-02

text2vec's vocab_vectorizer ouput is the function itself

I am trying to run through text2vec's example on this page. However, whenever I try to see what the vocab_vectorizer function returned, it's just an output of the function itself. In all my years of R coding, I've never seen this before, but it also feels funky enough to extend beyond just this function. Any pointers?
> library(data.table)
> data("movie_review")
> setDT(movie_review)
> setkey(movie_review, id)
> set.seed(2016L)
> all_ids <- movie_review$id
> train_ids <- sample(all_ids, 4000)
> test_ids <- setdiff(all_ids, train_ids)
> train <- movie_review[J(train_ids)]
> test <- movie_review[J(test_ids)]
>
> prep_fun <- tolower
> tok_fun <- word_tokenizer
>
> it_train <- itoken(train$review,
+ preprocessor = prep_fun,
+ tokenizer = tok_fun,
+ ids = train$id,
+ progressbar = FALSE)
> vocabulary <- create_vocabulary(it_train)
>
> vec <- text2vec::vocab_vectorizer(vocabulary = vocabulary)
> vec
function (iterator, grow_dtm, skip_grams_window_context, window_size,
weights, binary_cooccurence = FALSE)
{
vocab_corpus_ptr = cpp_vocabulary_corpus_create(vocabulary$term,
attr(vocabulary, "ngram")[[1]], attr(vocabulary, "ngram")[[2]],
attr(vocabulary, "stopwords"), attr(vocabulary, "sep_ngram"))
setattr(vocab_corpus_ptr, "ids", character(0))
setattr(vocab_corpus_ptr, "class", "VocabCorpus")
corpus_insert(vocab_corpus_ptr, iterator, grow_dtm, skip_grams_window_context,
window_size, weights, binary_cooccurence)
}
<bytecode: 0x7f9c2e3f7380>
<environment: 0x7f9c18970970>
>
The output of vocab_vectorizer is supposed to be a function. I ran the function from the example in the documentation as below:
data("movie_review")
N = 100
vectorizer = hash_vectorizer(2 ^ 18, c(1L, 2L))
it = itoken(movie_review$review[1:N], preprocess_function = tolower,
tokenizer = word_tokenizer, n_chunks = 10)
hash_dtm = create_dtm(it, vectorizer)
it = itoken(movie_review$review[1:N], preprocess_function = tolower,
tokenizer = word_tokenizer, n_chunks = 10)
v = create_vocabulary(it, c(1L, 1L) )
vectorizer = vocab_vectorizer(v)
The output of vocab_vectorizer:
> vectorizer
function (iterator, grow_dtm, skip_grams_window_context, window_size,
weights, binary_cooccurence = FALSE)
{
vocab_corpus_ptr = cpp_vocabulary_corpus_create(vocabulary$term,
attr(vocabulary, "ngram")[[1]], attr(vocabulary,
"ngram")[[2]], attr(vocabulary, "stopwords"),
attr(vocabulary, "sep_ngram"))
setattr(vocab_corpus_ptr, "ids", character(0))
setattr(vocab_corpus_ptr, "class", "VocabCorpus")
corpus_insert(vocab_corpus_ptr, iterator, grow_dtm, skip_grams_window_context,
window_size, weights, binary_cooccurence)
}
<bytecode: 0x00000147ada65218>
<environment: 0x00000147b2a6dc38>
In the documentation, it has been mentioned that "It supposed to be used only as argument to create_dtm, create_tcm, create_vocabulary".
Finally, when I ran create_dtm(it, vectorizer), I got the output
> create_dtm(it, vectorizer)
100 x 5356 sparse Matrix of class "dgCMatrix"
[[ suppressing 52 column names ‘0.3’, ‘02’, ‘10,000,000’ ... ]]
[[ suppressing 52 column names ‘0.3’, ‘02’, ‘10,000,000’ ... ]]
1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . ......
5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . ......
6 . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . ......
10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
..............................
........suppressing 5304 columns and 81 rows in show(); maybe adjust 'options(max.print= *, width = *)'
..............................
[[ suppressing 52 column names ‘0.3’, ‘02’, ‘10,000,000’ ... ]]
92 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
93 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
94 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
95 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . ......
96 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . ......
97 . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
98 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
99 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
I hope this answers you.

R: how to extract a row from sparseMatrix as sparseVector

I need to extract the row of sparseMatrix as sparseVector, however 'drop=FALSE' option does not work well for me.
To explain the issue, I will use an example from extract sparse rows from sparse matrix in r (my question is different since I need to convert extracted row to vector):
i <- c(1,3:8); j <- c(2,9,6:10); x <- 7 * (1:7)
A <- sparseMatrix(i, j, x = x)
b <- sparseVector(7,2,10)
now A[1,,drop=FALSE] and b should have the same value.
However, A[1,,drop=FALSE] is still a matrix with 2 dimensions. So if I try Matrix::crossprod(b), I get:
1 x 1 Matrix of class "dsyMatrix"
[,1]
[1,] 49
but if I try Matrix::crossprod(A[1,,drop=FALSE]), then I get:
10 x 10 sparse Matrix of class "dsCMatrix"
[1,] . . . . . . . . . .
[2,] . 49 . . . . . . . .
[3,] . . . . . . . . . .
[4,] . . . . . . . . . .
[5,] . . . . . . . . . .
[6,] . . . . . . . . . .
[7,] . . . . . . . . . .
[8,] . . . . . . . . . .
[9,] . . . . . . . . . .
[10,] . . . . . . . . . .
How can I get just 49 in the second case in efficient way (Matrix::crossprod should be faster than %*%, as far as I understand from the description of the function)?
Also, b%*%b works perfectly correct, while A[1,,drop=FALSE]%*%A[1,,drop=FALSE] returns the following error:
Cholmod error 'A and B inner dimensions must match' at file ../MatrixOps/cholmod_ssmult.c, line 82
I am not quite sure there is a method for (directly) casting a sparse matrix row as a sparse vector.
The reason why you are getting an error from
A[1,,drop=FALSE]%*%A[1,,drop=FALSE]
is that you're multiplying matrices that have the same dimension. You need to transpose the second matrix:
A[1,,drop=FALSE] %*% t(A[1,,drop=FALSE])
will return a 1x1 sparse matrix which you can then cast as.numeric()

Converting database output to a corpus for topic modelling

I have a total of 54892 documents. After retrieving them from the database, how am I supposed to convert them to a corpus that can be used for Topic Modelling using LDA?
This is the code I have tried:
library(RMySQL)
library(RTextTools)
library(topicmodels)
library(tm)
con <- dbConnect(MySQL(), user="root", password="root", dbname="dbtemp", host="localhost")
rs <- dbSendQuery(con, "select text_body from all_text;")
data <- fetch(rs, n=54892)
huh <- dbHasCompleted(rs)
dbClearResult(rs)
dbDisconnect(con)
I referred to this page, and noticed that the output of data from the line data <- NYTimes[sample(1:3100,size=1000,replace=FALSE),] contains a two column table along with another table with something called TopicCode, then this data is converted to a term-document frequency matrix. I don't know how to get that TopicCode from the two colums that I retrieved from the database?
I have tried a similar problem in Python where I converted the data to a Market Matrix format. I thought of using this file for further computations in R. I tried reading this file using b <- readMM(file="PRC.mm") and when I printed b I got a 336331X88 matrix which looked like :
. . 2 . . . . . . 1 1 . 1 . . 1 . 2 . . . . . . . . . . . . . ......
. 1 . . . . . . 1 1 . . . . . . . . . . . . . . . . . . . . . ......
. . . . . . . . . 1 1 1 . . . 2 . . . . . . . 1 . . 1 . . . . ......
. . 1 . . . 2 . . . . 1 1 . . . . . . . 1 . . . . . . . . . . ......
where . means 0. This looks like a term-document matrix but I still want to remake such kind of matrix in R. What should I do?

Overloading + operator in R S4 classes and Matrix package

I get a weird effect when trying to overload the + operator and using the Matrix package with sparse matrices. I first define a very simple class that does not use the Matrix package but has a + operator. I then sum two sparse matrices. The first M+M addition delivers the expected result but the second throws an error. Here is a very simple code that generates the error:
require(Matrix)
setClass("TestM",representation(M='numeric'))
setMethod("initialize", "TestM", function(.Object,x) {
.Object#M = x
.Object
})
setMethod("+", c("TestM","TestM"), function(e1,e2) {
e1#M + e2#M
})
M = Matrix(diag(1:10),sparse=T)
M+M # > FINE
M+M # > ERROR
M = Matrix(diag(1:10),sparse=F)
M+M # > FINE
M+M # > FINE
The second addition throws the following error:
Error in forceSymmetric(callGeneric(as(e1, "dgCMatrix"), as(e2, "dgCMatrix"))) :
error in evaluating the argument 'x' in selecting a method for function
'forceSymmetric': Error in .Arith.Csparse(e1, e2, .Generic, class. = "dgCMatrix") :
object '.Generic' not found
And the error does not happen if the matrices are not sparse. Is there some interference between the + I define and the + for sparseMatrix ? Do I not define the + operator correctly?
Thank you!
Try setting the Ops class to be overloaded:
> setMethod(Ops, c("TestM","TestM"), function(e1,e2) {
+ e1#M + e2#M
+ })
[1] "Ops"
attr(,"package")
[1] "base"
>
> M = Matrix(diag(1:10),sparse=T)
> M+M # > FINE
10 x 10 sparse Matrix of class "dsCMatrix"
[1,] 2 . . . . . . . . .
[2,] . 4 . . . . . . . .
[3,] . . 6 . . . . . . .
[4,] . . . 8 . . . . . .
[5,] . . . . 10 . . . . .
[6,] . . . . . 12 . . . .
[7,] . . . . . . 14 . . .
[8,] . . . . . . . 16 . .
[9,] . . . . . . . . 18 .
[10,] . . . . . . . . . 20
> M+M # (NOT error)... was ERROR
10 x 10 sparse Matrix of class "dsCMatrix"
[1,] 2 . . . . . . . . .
[2,] . 4 . . . . . . . .
[3,] . . 6 . . . . . . .
[4,] . . . 8 . . . . . .
[5,] . . . . 10 . . . . .
[6,] . . . . . 12 . . . .
[7,] . . . . . . 14 . . .
[8,] . . . . . . . 16 . .
[9,] . . . . . . . . 18 .
[10,] . . . . . . . . . 20
Almost 3,5 years later, I stumbled upon the same error, which is also topic of a more recent question: setMethod and package Matrix. I have sent a bug report to R-devel but found out afterwards, that the error is only reproducible if overloading is done outside of a package. In other words, if you define the + method for objects of class testM within a package and load the overloaded +-function by loading the package, it will solve the initial problem without having to overload the whole group in a generic manner (which is not always possible).

Resources