Read specific part of a data in R - r

I have this code here:
library(BCEA)
data(Vaccine)
ls()
Now, i get this:
[1] "c" "cost.GP" "cost.hosp" "cost.otc" "cost.time.off" "cost.time.vac"
[7] "cost.travel" "cost.trt1" "cost.trt2" "cost.vac" "e" "N"
[13] "N.outcomes" "N.resources" "QALYs.adv" "QALYs.death" "QALYs.hosp" "QALYs.inf"
[19] "QALYs.pne" "treats"
How can i access the 'c' value?
Something like Vaccine(c)?
Please help

In the given data Vaccine, each one of "c", "cost.GP", ... corresponds to separate data.
So, when you load the data of Vaccine into current workspace by data(Vaccine), you can directly access each of the 20 listed data values.
For example,
> head(c)
Status Quo Vaccination
[1,] 10.409146 16.252537
[2,] 5.834875 9.373437
[3,] 5.784903 15.935623
[4,] 12.208484 18.654250
[5,] 9.786787 16.467321
[6,] 6.560276 9.689887

Just like this:
c
It's a named object, so type "c" at the prompt, press return, or use directly in any function within this environment

Related

Convert number to alphabetically sortable character representation

I need to pass some numbers into functions/environments/contexts that only accept strings. They will then be processed there, after receiving the result I can reconvert them to numbers. The key issue is that when the string representations are sorted, they must be sorted in the correct order (in what would be the numerical sort order of the numeric representation). I also need to be able to add numbers after creating the initial batch. Following are two conversion approaches and why they do not work:
The simplest conversion is the standard one. 1->"1", 10->"10" and so on. This does not satisfy the criteria of sortability, because "10" gets sorted before "2".
The next approach is to prefix with zeroes. 1->"001", 10->"010" and so on. This satisfies sortability, ("002" gets sorted before "010"), but if a a larger number needs to be added later, this approach fails. If, say the numbers 2000 and 10000 need to be added later it is not possible to do so in a way that maintains sorting.
Are there any good approaches to doing this? The question doen not pertain specifically to any particular language (although the target language in my use-case is R, which has a number of places such as vector names and others that accept only character variables). Simplicity and/or standardization (of the representation, not implementation-wise) would both be big factors in choosing the best solution here.
I've had the same problem I think, and I used workaround that could help you. But I'm not sure it will apply in all situations.
First, a vector with number as string ordered as string:
str.numbers <- sort(as.character(1:20))
Now, I used the numeric representation of the string to order numerically the same vector:
str.numbers[order(as.numeric(str.numbers))]
This does the trick for simple vectors. But not sure it'll solve more complex problems.
Funny how the world works, after searching for a solution but not finding it, I came across a package, with the last release only four days ago, that focuses on this problem. This is the strex package. A reproducible example of my troubles, and how strex provides a pretty good fix, follows:
# Load the strex library
library(strex)
#> Loading required package: stringr
# This would be a function created by others, doing more than
# sorting, but nevertheless requiring sortable input
fun_requiring_char <- function(x) {
stopifnot(is.character(x))
sort(x)
}
# Example data
set.seed(42)
a <- sample(20, 5) ; a
#> [1] 17 5 1 10 4
b <- sample(2000, 10) ; b
#> [1] 1170 634 49 1152 1327 24 1863 356 1625 165
# Won't work, error
#fun_requiring_char(a)
# Works, but returns incorrectly sorted input
fun_requiring_char(as.character(a))
#> [1] "1" "10" "17" "4" "5"
fun_requiring_char(as.character(b))
#> [1] "1152" "1170" "1327" "1625" "165" "1863" "24" "356" "49" "634"
# Solution provided by strex
fun_requiring_char(str_alphord_nums(a))
#> [1] "01" "04" "05" "10" "17"
fun_requiring_char(str_alphord_nums(b))
#> [1] "0024" "0049" "0165" "0356" "0634" "1152" "1170" "1327" "1625" "1863"
# What quick and dirty zero padding did not allow was to first
# convert a, and then b into character, where both were of a
# unified format that would represent the numbers and yet be
# sortable in correct order according to the numbers they
# represent. However, using str_alphord_nums repeatedly gets
# very close to a solution.
ac <- str_alphord_nums(a); ac
#> [1] "17" "05" "01" "10" "04"
bc <- str_alphord_nums(b); bc
#> [1] "1170" "0634" "0049" "1152" "1327" "0024" "1863" "0356" "1625" "0165"
# Wrong order here
fun_requiring_char(c(ac,bc))
#> [1] "0024" "0049" "01" "0165" "0356" "04" "05" "0634" "10" "1152"
#> [11] "1170" "1327" "1625" "17" "1863"
# But doing an alphord again on the concatenated vectors provides a fix
fun_requiring_char(str_alphord_nums(c(ac,bc)))
#> [1] "0001" "0004" "0005" "0010" "0017" "0024" "0049" "0165" "0356" "0634"
#> [11] "1152" "1170" "1327" "1625" "1863"
Created on 2020-10-21 by the reprex package (v0.3.0)

Quanteda with topicmodels: removed stopwords appear in results (Chinese)

My code:
library(quanteda)
library(topicmodels)
# Some raw text as a vector
postText <- c("普京 称 俄罗斯 未 乌克兰 施压 来自 头 条 新闻", "长期 电脑 前进 食 致癌 环球网 报道 乌克兰 学者 认为 电脑 前进 食 会 引发 癌症 等 病症 电磁 辐射 作用 电脑 旁 水 食物 会 逐渐 变质 有害 物质 累积 尽管 人体 短期 内 会 感到 适 会 渐渐 引发 出 癌症 阿尔茨海默 式 症 帕金森 症 等 兔子", "全 木 手表 乌克兰 木匠 瓦列里·达内维奇 木头 制作 手表 共计 154 手工 零部件 唯一 一个 非 木制 零件 金属 弹簧 驱动 指针 运行 其他 零部件 材料 取自 桦树 苹果树 杏树 坚果树 竹子 黄杨树 愈疮木 非洲 红木 总共 耗时 7 打造 手表 不仅 能够 正常 运行 天 时间 误差 保持 5 分钟 之内 ")
# Create a corpus of the posts
postCorpus <- corpus(postText)
# Make a dfm, removing numbers and punctuation
myDocTermMat <- dfm(postCorpus, stem = FALSE, removeNumbers = TRUE, removeTwitter = TRUE, removePunct = TRUE)
# Estimate a LDA Topic Model
if (require(topicmodels)) {
myLDAfit <- LDA(convert(myDocTermMat, to = "topicmodels"), k = 2)
}
terms(myLDAfit, 11)
The code works and I see a result. Here is an example of the output:
Topic 1 Topic 2
[1,] "木" "会"
[2,] "手表" "电脑"
[3,] "零" "乌克兰"
[4,] "部件" "前进"
[5,] "运行" "食"
[6,] "乌克兰" "引发"
[7,] "内" "癌症"
[8,] "全" "等"
[9,] "木匠" "症"
[10,] "瓦" "普"
[11,] "列" "京"
Here is the problem. All of my posts have been segmented (necessary pre-processing step for Chinese) and had stop words removed. Nonetheless, the topic model returns topics containing single-character stop terms that have already been removed. If I open the raw .txt files and do ctrl-f for a given single-character stop word, no results are returned. But those terms show up in the returned topics from the R code, perhaps because the individual characters occur as part of other multi-character words. E.g. 就 is a preposition treated as a stop word, but 成就 means "success."
Related to this, certain terms are split. For example, one of the events I am examining contains references to Russian president Putin ("普京"). In the topic model results, however, I see separate term entries for "普" and "京" and no entries for "普京". (See lines 10 and 11 in output topic 2, compared to the first word in the raw text.)
Is there an additional tokenization step occurring here?
Edit: Modified to make reproducible. For some reason it wouldn't let me post until I also deleted my introductory paragraph.
Here's a workaround, based on using a faster but "dumber" word tokeniser based on space ("\\s") splitting:
# fails
features(dfm(postText, verbose = FALSE))
## [1] "普" "京" "称" "俄罗斯" "未" "乌克兰" "施压" "来自" "头" "条" "新闻"
# works
features(dfm(postText, what = "fasterword", verbose = FALSE))
## [1] "普京" "称" "俄罗斯" "未" "乌克兰" "施压" "来自" "头" "条" "新闻"
So add what = "fasterword" to the dfm() call and you will get this as a result, where Putin ("普京") is not split.
terms(myLDAfit, 11)
## Topic 1 Topic 2
## [1,] "会" "手表"
## [2,] "电脑" "零部件"
## [3,] "乌克兰" "运行"
## [4,] "前进" "乌克兰"
## [5,] "食" "全"
## [6,] "引发" "木"
## [7,] "癌症" "木匠"
## [8,] "等" "瓦列里达内维奇"
## [9,] "症" "木头"
## [10,] "普京" "制作"
## [11,] "称" "共计"
This is an interesting case of where quanteda's default tokeniser, built on the definition of stringi's definition of text boundaries (see stri_split_boundaries, does not work in the default setting. It might after experimentation with locale, but these are not currently options that can be passed to quanteda::tokenize(), which dfm() calls.
Please file this as an issue at https://github.com/kbenoit/quanteda/issues and I'll try to get working on a better solution using the "smarter" word tokeniser.

Assigning variable names within a function in R

I am currently working on a dataset in R which is assigned to the global enviroment in R by a function of i, due to the nature of my work I am unable to disclose the dataset so let's use an example.
DATA
[,1] [,2] [,3] [,4] [,5]
[1,] 32320 27442 29275 45921 162306
[2,] 38506 29326 33290 45641 175386
[3,] 42805 30974 33797 47110 198358
[4,] 42107 34690 47224 62893 272305
[5,] 54448 39739 58548 69470 316550
[6,] 53358 48463 63793 79180 372685
Where DATA(i) is a function and the above is an output for a certain i
I want to assign variable names based on i such as:-
names(i)<-c(a(i),b(i),c(i),d(i),e(i))
for argument sake, let's say that the value of names for this specific i is
c("a","b","c","d","e")
I hope that it will produce the following:-
a b c d e
[1,] 32320 27442 29275 45921 162306
[2,] 38506 29326 33290 45641 175386
[3,] 42805 30974 33797 47110 198358
[4,] 42107 34690 47224 62893 272305
[5,] 54448 39739 58548 69470 316550
[6,] 53358 48463 63793 79180 372685
This is the code I currently use:-
VarName<-function(i){
colnames(DATA(i))<<-names(i)
}
However this produces an error message when I run it: "Error in colnames(DATA(i)) <- names(i)) :
target of assignment expands to non-language object" which we can see from my input that isn't true. Is there another way to do this?
Sorry for the basic questions. I'm fairly new to programming.

How to select a value from a table in R

I have the following data, called fit.2.sim:
An object of class "sim"
Slot "coef":
fit.2.sim
[,1] [,2]
[1,] -1.806363 5.148728
[2,] -3.599123 5.183769
[3,] 4.192562 4.855095
[4,] 2.658218 4.967007
[5,] -2.304084 5.220325
[6,] -1.010406 5.071663
[7,] 2.601671 5.129750
[8,] 5.977764 4.757826
[9,] 3.873432 4.932319
[10,] 1.281331 5.138091
Slot "sigma":
[1] 8.285497 10.659971 9.568340 8.649106 8.611894 9.041444 8.316859 7.990499 8.985450
[10] 7.947142
The command I have been using, to no avail unfortunately is:
fit.2.sim$coef[i,j]
i,j being the respective rows and columns. The error I get is:
"Error in fit.2.sim$coef : $ operator not defined for this S4 class"
Could you please tell me if there is another way to make this work?
S4 classes use # not $ to access slots, so you probably wanted
fit.t.sim#coef[i,j]

tkplot in igraph within R

Here is my code with the corresponding output
> tkplot(g.2,vertex.label=nodes,
+ canvas.width=700,
+ canvas.height=700)
[1] 6
> ?tkplot
Warning message:
In rm(list = cmd, envir = .tkplot.env) : object 'tkp.6' not found
I get this error no matter what command I run after building and viewing my plot.
This may be obvious, but I can't get at the data from the plot.
> tkp.6.getcoords
Error: object 'tkp.6.getcoords' not found
Any thoughts? On Windows 2007 Pro.
R is a functional programming language. tkplot is a bit odd (for R users anyway) in that it returns numeric handles to its creations. Try this instead:
tkplot.getcoords(6)
When I run the example on the tkplot page, I then get this from tkplot.getcoords(1) since it was my first igraph plot:
> tkplot.getcoords(1)
[,1] [,2]
[1,] 334.49319 33.82983
[2,] 362.43837 286.10754
[3,] 410.61862 324.98319
[4,] 148.00673 370.91116
[5,] 195.69191 20.00000
[6,] 29.49197 430.00000
[7,] 20.00000 155.05409
[8,] 388.51103 62.61010
[9,] 430.00000 133.44695
[10,] 312.76239 168.90260

Resources