R - Exact String Match - Revisited - r

I have the below test input in a file called Input
Exploratory objectives :
This is Exp objective 1
This is Exp objective 2
3.3 Exploratory objective(s)
This is Exp objective 1
This is Exp objective 2
From this text file, I'm trying to grep for "Exploratory objective(s)" using the below. The output line number I am expecting is 7.
However, when I run the below command: I am getting the line number as 1. Can anyone please point out what is wrong with my grep here and why it doesnt return 7? Also how I can fix this?
key_str <-"Exploratory objective(s)"
key_str
key_pat <- paste0("(", key_str, ")", "(?![[:alpha:]])")
line_number<-grep(key_pat,Input,perl=TRUE)
line_number
Expected line_number: 7
Output line_number using above: 1 (Incorrect)

You have to escape parentheses:
key_str <- "Exploratory objective\\(s\\)"
If the string is dynamically generated or read from a file, use this:
key_str <- gsub("([\\(\\)])", "\\\\\\1", string)

Related

Print text with subscripts (programatically) to R console

I'm using R to balance some complex chemical equations and would like to print these equations including subscripts to the console as the code runs. I've seen some answers posted, most of which are related to plots or rely on pasting the subscript from another program into R scripts:
Subscripts in R when adding other text
How to literally print superscripts in R not used in labels or legends?
Using Subscripts and Superscripts in R console
Unicode subscript in R had some pointers that were helpful. I can get the appropriate code from this link but it doesn't allow me to programatically create the code for the character I want.
CODE
Here's a simple example equation for combustion of methane that works:
> sub2 <- '\u2082' # hard-coding unicode for '2' as a subscript
> sub4 <- '\u2084' # hard-coding unicode for '4' as a subscript
> cat(sprintf('CH%s + 2 O%s --> CO%s + 2 H%sO', sub4, sub2, sub2, sub2))
CH₄ + 2 O₂ --> CO₂ + 2 H₂O
Lengthy workaround (proof-of-concept):
desired_subscript <- 3.375
subs <- c('\u2080', '\u2081', '\u2082', '\u2083', '\u2084',
'\u2085', '\u2086', '\u2087', '\u2088', '\u2089')
charvec <- as.character(x = desired_subscript)
lapply(0:9, function(z){
charvec <<- gsub(pattern = z, replacement = subs[z+1], x = charvec)
return(NULL)
})
> cat(charvec)
₃.₃₇₅
Here's what doesn't work:
replacing the last digit of the unicode string to what I want:
> cat(sub(pattern = '2', replacement = '4', x = sub2))
₂
Trying to create a unicode string:
> paste('\208','4',sep = '')
[1] "\02084"
I have multiple equations to balance and the subscripts are not always whole numbers. Is there a way to programatically get unicode for the subscript that I want to include in my output to console?
Try this
create a function to return unicodes. Caution: No error checking
ss <- function(x) {intToUtf8(0x2080 + x)}
cat(sprintf('CH%s + 2 O%s --> CO%s + 2 H%sO', ss(4), ss(2), ss(2), ss(2)))

R, getting an invalid argument to unary operator when using order function

I'm essentially doing the exact same thing 3 times, and when adding a new variable I get this error
Error in -emps$EV : invalid argument to unary operator
The code chunk causing this is
evps<-aggregate(EV~player,s1k,mean)
sort2<-evps[order(-evps$EV),]
head(sort2,10)
s1k$EM<-s1k$points-s1k$EV
emps<-aggregate(EM~player,s1k,mean)
sort3<-emps[order(-emps$EV),]
head(sort3,10)
Works like a charm for the first list, but the identical code thereafter causes the error.
This specific line is causing the error
sort3<-emps[order(-emps$EV),]
How can I fix/workaround this?
Full Code
url <- getURL("https://raw.githubusercontent.com/M-ttM/Basketball/master/class.csv")
shots <- read.csv(text = url)
shots$make<-shots$points>0
shots2<-shots[which(!(shots$player=="Luc Richard Mbah a Moute")),]
fit1<-glm(make~factor(type)+factor(period), data=shots2,family="binomial")
summary(fit1)
shots2$makeodds<-fitted(fit1)
shots2$EV<-shots2$makeodds*ifelse(shots2$type=="3pt",3,2)
shots3<-shots2[which(shots2$y>7),]
locmakes<-data.frame(table(shots3[, c("x", "y")]))
s1k <- shots2[with(shots2, player %in% names(which(table(player)>=1000))), ]
pps<-aggregate(points~player,s1k,mean)
sort<-pps[order(-PPS$points),]
head(sort,10)
evps<-aggregate(EV~player,s1k,mean)
sort2<-evps[order(-evps$EV),]
head(sort2,10)
s1k$EM<-s1k$points-s1k$EV
emps<-aggregate(EM~player,s1k,mean)
sort3<-emps[order(-emps$EV),]
head(sort3,10)
The error message seems to occur when trying to order columns including chr type data. A possible workaround is to use the reverse function rev() instead of the minus sign, like so:
column_a = c("a","a","b","b","c","c")
column_b = seq(6)
df = data.frame(column_a, column_b)
df$column_a = as.character(df$column_a)
df[with(df, order(-column_a, column_b)),]
> Error in -column_a : invalid argument to unary operator
df[with(df, order(rev(column_a), column_b)),]
column_a column_b
5 c 5
6 c 6
3 b 3
4 b 4
1 a 1
2 a 2
Let me know if it works in your case.
On this line, emps$EV doesn't exist.
s1k$EM<-s1k$points-s1k$EV
emps<-aggregate(EM~player,s1k,mean)
sort3<-emps[order(-emps$EV),]
head(sort3,10)
You probably meant
s1k$EM<-s1k$points-s1k$EV
emps<-aggregate(EM~player,s1k,mean)
sort3<-emps[order(-emps$EM),]
head(sort3,10)

Selecting features from a feature set using mRMRe package

I am a new user of R and trying to use mRMRe R package (mRMR is one of the good and well known feature selection approaches) to obtain feature subset from a feature set. Please excuse if my question is simple as I really want to know how I can fix an error. Below is the detail.
Suppose, I have a csv file (gene.csv) having feature set of 6 attributes ([G1.1.1.1], [G1.1.1.2], [G1.1.1.3], [G1.1.1.4], [G1.1.1.5], [G1.1.1.6]) and a target class variable [Output] ('1' indicates positive class and '-1' stands for negative class). Here's a sample gene.csv file:
[G1.1.1.1] [G1.1.1.2] [G1.1.1.3] [G1.1.1.4] [G1.1.1.5] [G1.1.1.6] [Output]
11.688312 0.974026 4.87013 7.142857 3.571429 10.064935 -1
12.538226 1.223242 3.669725 6.116208 3.363914 9.174312 1
10.791367 0.719424 6.115108 6.47482 3.597122 10.791367 -1
13.533835 0.37594 6.766917 7.142857 2.631579 10.902256 1
9.737828 2.247191 5.992509 5.992509 2.996255 8.614232 -1
11.864407 0.564972 7.344633 4.519774 3.389831 7.909605 -1
11.931818 0 7.386364 5.113636 3.409091 6.818182 1
16.666667 0.333333 7.333333 4.333333 2 8.333333 -1
I am trying to get best feature subset of 2 attributes (out of above 6 attributes) and wrote following R code.
library(mRMRe)
file_n<-paste0("E:\\gene", ".csv")
df <- read.csv(file_n, header = TRUE)
f_data <- mRMR.data(data = data.frame(df))
featureData(f_data)
mRMR.ensemble(data = f_data, target_indices = 7,
feature_count = 2, solution_count = 1)
When I run this code, I am getting following error for the statement f_data <- mRMR.data(data = data.frame(df)):
Error in .local(.Object, ...) :
data columns must be either of numeric, ordered factor or Surv type
However, data in each column of the csv file are real number.So, how can I change the R code to fix this problem? Also, I am not sure what should be the value of target_indices in the statement mRMR.ensemble(data = f_data, target_indices = 7,feature_count = 2, solution_count = 1) as my target class variable name is "[Output]" in the gene.csv file.
I will appreciate much if anyone can help me to obtain the best feature subset based on the gene.csv file using mRMRe R package.
I solved the problem by modifying my code as follows.
library(mRMRe)
file_n<-paste0("E:\\gene", ".csv")
df <- read.csv(file_n, header = TRUE)
df[[7]] <- as.numeric(df[[7]])
f_data <- mRMR.data(data = data.frame(df))
results <- mRMR.classic("mRMRe.Filter", data = f_data, target_indices = 7,
feature_count = 2)
solutions(results)
It worked fine. The output of the code gives the indices of the selected 2 features.
I think it has to do with your Output column which is probably of class integer. You can check that using class(df[[7]]).
To convert it to numeric as required by the warning, just type:
df[[7]] <- as.numeric(df[[7]])
That worked for me.
As for the other question, after reading the documentation, setting target_indices = 7 seems the right choice.

R scripting reading a txt file, finding and replacing some text, concatenating results and then writing to a file

I have a query as below. It reads a text file into variable linn. Line 2 (linn[2]) and line 28 (linn[28]) contain 2 dates (20160831;20160907). I would like to loop over a vector a and copy each element at a time and replace those values on lines 2 and 8 and create a copy of the original file with only line 2 and line 8 changed. Then I would like to concatenate all the copies and write it into a txt file
How could I achieve the same?
#http://stackoverflow.com/questions/12626637/reading-a-text-file-in-r-line-by-line
fileName <- "C:/Users/500361/Downloads/query.txt"
conn <- file(fileName,open="r")
linn <-readLines(conn)
for (i in 1:length(linn)){
print(linn[i])
}
close(conn)
str(linn)
linn[2]
#[1] "<"department_code,department_desc,subclass_code,subclass_desc,dept_rank\" label=\"20160831;20160907\">"
linn[28]
#<sel value=\"between(date_id;20160831;20160907) [[ \\ text_req:From \\ text_req:To]] \"/>"
a=c("20160406;20160413","20160330;20160406")
a
update 1: updated values in line 2 and 28 will be
"<"department_code,department_desc,subclass_code,subclass_desc,dept_rank\" label=\"20160406;20160413\">"
#<sel value=\"between(date_id;20160406;20160413) [[ \\ text_req:From \\ text_req:To]] \"/>"
"<"department_code,department_desc,subclass_code,subclass_desc,dept_rank\" label=\"20160330;20160406\">"
#<sel value=\"between(date_id;20160330;20160406) [[ \\ text_req:From \\ text_req:To]] \"/>"
I found the answer and wanted to post it for future readers
#http://stackoverflow.com/questions/23474552/how-to-read-txt-file-and-replace-word-in-r
#variable `a` has different set of values that I want to put in my code (find and replace)
a=c("20160907;20160914","20160831;20160907","20160824;20160831","20160817;20160824","20160810;20160817","20160803;20160810","20160727;20160803","20160720;20160727","20160713;20160720","20160706;20160713","20160629;20160706","20160622;20160629","20160615;20160622","20160608;20160615","20160601;20160608","20160525;20160601","20160518;20160525","20160511;20160518","20160504;20160511","20160427;20160504","20160420;20160427","20160413;20160420","20160406;20160413","20160330;20160406","20160323;20160330","20160316;20160323","20160309;20160316","20160302;20160309","20160224;20160302","20160217;20160224","20160210;20160217","20160203;20160210","20160127;20160203","20160120;20160127","20160113;20160120","20160106;20160113")
res <- readLines("C:/Users/500361/Downloads/query.txt")
result=c()
for (i in 1:36)
{
#res1 is just copy of the query
res1=res
#as lines 2 and 28 have the text that I want to replace. I am using gsub function while looping over i
res1[2]=gsub(pattern = "20160831;20160907", replace = a[i],x=res1[2])
res1[28]=gsub(pattern = "20160831;20160907", replace = a[i],x=res1[28])
#concatenating results
result=c(result,res1)
}
#writing the file
writeLines(result, con="C:/Users/500361/Downloads/answer.txt")

R-How dos the pos() function work for parts-of-speech tagging

I'm new to R and confused with the way the pos() function works. Here's why:
Example:
library(qdap)
s1<-c("Hello World")
pos(s1)
This produces the correct output saying the word count
wrd.cnt - 2
NN -1(50%)
UH-1(50%)
whereas the following to operations throws errors:
s2<-"Hello"
pos(s2)
Error in apply(pro, 2, paster, digits = digits, symbol = s.ymb, override = override) :
dim(X) must have a positive length
s3<-c("Hello Hello")
pos(s3)
Error in apply(pro, 2, paster, digits = digits, symbol = s.ymb, override = override) :
dim(X) must have a positive length
I'm not able to understand why this is caused.
You have found a bug in this version of qdap cause by not using drop = FALSE while indexing.
The dev version will behave as expected. You can download it easily with this code:
if (!require("pacman")) install.packages("pacman"); library(pacman)
p_install_gh("trinker/qdap")
The following has been added to the NEWS file as well:
pos threw an error if only one word was passed to text.var. Fix:
drop = FALSE has been added to data frame indexing. Caught by
StackOverflow user G_1991 R-How dos the pos() function work for parts-of-speech tagging.
Here's the updated output:
library(qdap)
s1<-c("Hello World")
pos(s1)
## wrd.cnt NN UH
## 1 2 1(50%) 1(50%)
s2<-"Hello"
pos(s2)
## wrd.cnt UH
## 1 1 1(100%)

Resources