My code:
library(quanteda)
library(topicmodels)
# Some raw text as a vector
postText <- c("普京 称 俄罗斯 未 乌克兰 施压 来自 头 条 新闻", "长期 电脑 前进 食 致癌 环球网 报道 乌克兰 学者 认为 电脑 前进 食 会 引发 癌症 等 病症 电磁 辐射 作用 电脑 旁 水 食物 会 逐渐 变质 有害 物质 累积 尽管 人体 短期 内 会 感到 适 会 渐渐 引发 出 癌症 阿尔茨海默 式 症 帕金森 症 等 兔子", "全 木 手表 乌克兰 木匠 瓦列里·达内维奇 木头 制作 手表 共计 154 手工 零部件 唯一 一个 非 木制 零件 金属 弹簧 驱动 指针 运行 其他 零部件 材料 取自 桦树 苹果树 杏树 坚果树 竹子 黄杨树 愈疮木 非洲 红木 总共 耗时 7 打造 手表 不仅 能够 正常 运行 天 时间 误差 保持 5 分钟 之内 ")
# Create a corpus of the posts
postCorpus <- corpus(postText)
# Make a dfm, removing numbers and punctuation
myDocTermMat <- dfm(postCorpus, stem = FALSE, removeNumbers = TRUE, removeTwitter = TRUE, removePunct = TRUE)
# Estimate a LDA Topic Model
if (require(topicmodels)) {
myLDAfit <- LDA(convert(myDocTermMat, to = "topicmodels"), k = 2)
}
terms(myLDAfit, 11)
The code works and I see a result. Here is an example of the output:
Topic 1 Topic 2
[1,] "木" "会"
[2,] "手表" "电脑"
[3,] "零" "乌克兰"
[4,] "部件" "前进"
[5,] "运行" "食"
[6,] "乌克兰" "引发"
[7,] "内" "癌症"
[8,] "全" "等"
[9,] "木匠" "症"
[10,] "瓦" "普"
[11,] "列" "京"
Here is the problem. All of my posts have been segmented (necessary pre-processing step for Chinese) and had stop words removed. Nonetheless, the topic model returns topics containing single-character stop terms that have already been removed. If I open the raw .txt files and do ctrl-f for a given single-character stop word, no results are returned. But those terms show up in the returned topics from the R code, perhaps because the individual characters occur as part of other multi-character words. E.g. 就 is a preposition treated as a stop word, but 成就 means "success."
Related to this, certain terms are split. For example, one of the events I am examining contains references to Russian president Putin ("普京"). In the topic model results, however, I see separate term entries for "普" and "京" and no entries for "普京". (See lines 10 and 11 in output topic 2, compared to the first word in the raw text.)
Is there an additional tokenization step occurring here?
Edit: Modified to make reproducible. For some reason it wouldn't let me post until I also deleted my introductory paragraph.
Here's a workaround, based on using a faster but "dumber" word tokeniser based on space ("\\s") splitting:
# fails
features(dfm(postText, verbose = FALSE))
## [1] "普" "京" "称" "俄罗斯" "未" "乌克兰" "施压" "来自" "头" "条" "新闻"
# works
features(dfm(postText, what = "fasterword", verbose = FALSE))
## [1] "普京" "称" "俄罗斯" "未" "乌克兰" "施压" "来自" "头" "条" "新闻"
So add what = "fasterword" to the dfm() call and you will get this as a result, where Putin ("普京") is not split.
terms(myLDAfit, 11)
## Topic 1 Topic 2
## [1,] "会" "手表"
## [2,] "电脑" "零部件"
## [3,] "乌克兰" "运行"
## [4,] "前进" "乌克兰"
## [5,] "食" "全"
## [6,] "引发" "木"
## [7,] "癌症" "木匠"
## [8,] "等" "瓦列里达内维奇"
## [9,] "症" "木头"
## [10,] "普京" "制作"
## [11,] "称" "共计"
This is an interesting case of where quanteda's default tokeniser, built on the definition of stringi's definition of text boundaries (see stri_split_boundaries, does not work in the default setting. It might after experimentation with locale, but these are not currently options that can be passed to quanteda::tokenize(), which dfm() calls.
Please file this as an issue at https://github.com/kbenoit/quanteda/issues and I'll try to get working on a better solution using the "smarter" word tokeniser.
Related
This is my first time asking a question on here so I hope I don't miss any crucial parts. I want to perform sentiment analysis on windows of speeches around certain keywords. My dataset is a large csv file containing a number of speeches, but I'm only interest in the sentiment of the words immediately surrounding certain key words.
I was told that the quanteda package in R would likely be my best bet for finding such a function, but I've been unsuccessful in locating it so far. If anyone knows how to do such a task it would be greatly appreciated !!!
Reprex (I hope?) below:
speech = c("This is the first speech. Many words are in this speech, but only few are relevant for my research question. One relevant word, for example, is the word stackoverflow. However there are so many more words that I am not interested in assessing the sentiment of", "This is a second speech, much shorter than the first one. It still includes the word of interest, but at the very end. stackoverflow.", "this is the third speech, and this speech does not include the word of interest so I'm not interested in assessing this speech.")
data <- data.frame(id=1:3,
speechContent = speech)
I'd suggest using tokens_select() with the window argument set to a range of tokens surrounding your target terms.
To take your example, if "stackoverflow" is the target term, and you want to measure sentiment in the +/- 10 tokens around that, then this would work:
library("quanteda")
## Package version: 3.2.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
## [CODE FROM ABOVE]
corp <- corpus(data, text_field = "speechContent")
toks <- tokens(corp) %>%
tokens_select("stackoverflow", window = 10)
toks
## Tokens consisting of 3 documents and 1 docvar.
## text1 :
## [1] "One" "relevant" "word" ","
## [5] "for" "example" "," "is"
## [9] "the" "word" "stackoverflow" "."
## [ ... and 9 more ]
##
## text2 :
## [1] "word" "of" "interest" ","
## [5] "but" "at" "the" "very"
## [9] "end" "." "stackoverflow" "."
##
## text3 :
## character(0)
There are many ways to compute sentiment from this point. An easy one is to apply a sentiment dictionary, e.g.
tokens_lookup(toks, data_dictionary_LSD2015) %>%
dfm()
## Document-feature matrix of: 3 documents, 4 features (91.67% sparse) and 1 docvar.
## features
## docs negative positive neg_positive neg_negative
## text1 0 1 0 0
## text2 0 0 0 0
## text3 0 0 0 0
Using quanteda:
library(quanteda)
corp <- corpus(data, docid_field = "id", text_field = "speechContent")
x <- kwic(tokens(corp, remove_punct = TRUE),
pattern = "stackoverflow",
window = 3
)
x
Keyword-in-context with 2 matches.
[1, 29] is the word | stackoverflow | However there are
[2, 24] the very end | stackoverflow |
as.data.frame(x)
docname from to pre keyword post pattern
1 1 29 29 is the word stackoverflow However there are stackoverflow
2 2 24 24 the very end stackoverflow stackoverflow
Now read the help for kwic (use ?kwic in console) to see what kind of patterns you can use. With tokens you can specify which data cleaning you want to use before using kwic. In my example I removed the punctuation.
The end result is a data frame with the window before and after the keyword(s). In this example a window of length 3. After that you can do some form of sentiment analyses on the pre and post results (or paste them together first).
I am using R for extracting text. The code below works well to extract the non-bold text from pdf but it ignores the bold part. Is there a way to extract both bold and non-bold text?
news <-'http://www.frbe-kbsb.be/sites/manager/ICN/14-15/ind01.pdf'
library(pdftools)
library(tesseract)
library(tiff)
info <- pdf_info(news)
numberOfPageInPdf <- as.numeric(info[2])
numberOfPageInPdf
for (i in 1:numberOfPageInPdf){
bitmap <- pdf_render_page(news, page=i, dpi = 300, numeric = TRUE)
file_name <- paste0("page", i, ".tiff")
file_tiff <- tiff::writeTIFF(bitmap, file_name)
out <- ocr(file_name)
file_txt <- paste0("text", i, ".txt")
writeLines(out, file_txt)
}
I like using the tabulizer library for this. Here's a small example:
devtools::install_github("ropensci/tabulizer")
library(tabulizer)
news <-'http://www.frbe-kbsb.be/sites/manager/ICN/14-15/ind01.pdf'
# note that you need to specify UTF-8 as the encoding, otherwise your special characters
# won't come in correctly
page1 <- extract_tables(news, guess=TRUE, page = 1, encoding='UTF-8')
page1[[1]]
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] "" "Division: 1" "" "" "" "" "Série: A"
[2,] "" "514" "" "Fontaine 1 KBSK 1" "" "" "303"
[3,] "1" "62529 WIRIG ANTHONY" "" "2501 1⁄2-1⁄2" "51560" "CZEBE ATTILLA" "2439"
[4,] "2" "62359 BRUNNER NICOLAS" "" "2443 0-1" "51861" "PICEU TOM" "2401"
[5,] "3" "75655 CEKRO EKREM" "" "2393 0-1" "10391" "GEIRNAERT STEVEN" "2400"
[6,] "4" "50211 MARECHAL ANDY" "" "2355 0-1" "35181" "LEENHOUTS KOEN" "2388"
[7,] "5" "73059 CLAESEN PIETER" "" "2327 1⁄2-1⁄2" "25615" "DECOSTER FREDERIC" "2373"
[8,] "6" "63614 HOURIEZ CLEMENT" "" "2304 1⁄2-1⁄2" "44954" "MAENHOUT THIBAUT" "2372"
[9,] "7" "60369 CAPONE NICOLA" "" "2283 1⁄2-1⁄2" "10430" "VERLINDE TIEME" "2271"
[10,] "8" "70653 LE QUANG KIM" "" "2282 0-1" "44636" "GRYSON WOUTER" "2269"
[11,] "" "" "< 2361 >" "12 - 20" "" "< 2364 >" ""
You can also use the locate_areas function to specify a specific region if you only care about some of the tables. Note that for locate_areas to work, I had to download the file locally first; using the URL returned an error.
You'll note that each table is its own element in the returned list.
Here's an example using a custom region to just select the first table on each page:
customArea <- extract_tables(news, guess=FALSE, page = 1, area=list(c(84,27,232,569), encoding = 'UTF-8')
This is also a more direct method than using the OCR (Optical Character Recognition) library tesseract beacuse you're not relying on the OCR library to translate pixel arrangement back into text. In digital PDFs, each text element has an x and y position, and the tabulizer library uses that information to detect table heuristics and extract sensibly formatted data. You'll see you still have some clean up to do, but it's pretty manageable.
Edit: just for fun, here's a little example of starting the clean up with data.table
library(data.table)
cleanUp <- setDT(as.data.frame(page1[[1]]))
cleanUp[ , `:=` (Division = as.numeric(gsub("^.*(\\d+{1,2}).*", "\\1", grep('Division', cleanUp$V2, value=TRUE))),
Series = as.character(gsub(".*:\\s(\\w).*","\\1", grep('Série:', cleanUp$V7, value=TRUE))))
][,ID := tstrsplit(V2," ", fixed=TRUE, keep = 1)
][, c("V1", "V3") := NULL
][-grep('Division', V2, fixed=TRUE)]
Here we've moved Division, Series, and ID into their own columns, and removed the Division header row. This is just the general idea, and would need a little refinement to apply to all 27 pages.
V2 V4 V5 V6 V7 Division Series ID
1: 514 Fontaine 1 KBSK 1 303 1 A 514
2: 62529 WIRIG ANTHONY 2501 1/2-1/2 51560 CZEBE ATTILLA 2439 1 A 62529
3: 62359 BRUNNER NICOLAS 2443 0-1 51861 PICEU TOM 2401 1 A 62359
4: 75655 CEKRO EKREM 2393 0-1 10391 GEIRNAERT STEVEN 2400 1 A 75655
5: 50211 MARECHAL ANDY 2355 0-1 35181 LEENHOUTS KOEN 2388 1 A 50211
6: 73059 CLAESEN PIETER 2327 1/2-1/2 25615 DECOSTER FREDERIC 2373 1 A 73059
7: 63614 HOURIEZ CLEMENT 2304 1/2-1/2 44954 MAENHOUT THIBAUT 2372 1 A 63614
8: 60369 CAPONE NICOLA 2283 1/2-1/2 10430 VERLINDE TIEME 2271 1 A 60369
9: 70653 LE QUANG KIM 2282 0-1 44636 GRYSON WOUTER 2269 1 A 70653
10: 12 - 20 < 2364 > 1 A NA
There is no need to go through the PDF -> TIFF -> OCR loop, since pdftools::pdf_text() can read this file directly:
stringi::stri_split(pdf_text(news), regex = "\n")
I am unsure how to describe this problem. I have a feeling it is trivial but I cannot get a hold of it.
I have a stack of raster objects (object NDVI). From these I extracted x and y coordinates using rasterToPoints
xycoord1 <- rasterToPoints(NDVI)
xycoord <- xycoord1[,c(1:2)]
Along the pre-processing I kicked out several unusable pixels and ended up with:
> str(xycoord.short)
num [1:20054, 1:2] 3802292 3802523 3802755 3802987 3803218 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:2] "x" "y"
No I simply want to find a certain x and y coordinate.
e.g.
> which(xycoord.short[,1]==3802292)
integer(0)
But I seem unable to "get hold" of the values inside for example one column.
> xycoord.short[,1][1]
[1] 3802292
> xycoord.short[,1][1]==xycoord.short[,1][1]
[1] TRUE
> xycoord.short[,1][1]==3802292
[1] FALSE
Can anyone help me along this problem? I just don't find the problem. Does it have to do with initial extraction through rasterToPoints? Thanks!
EDIT:
dput output for the first 10 rows of my xy-coordinates
xy <- structure(c(3802291.63636448, 3802523.29272274, 3802754.94908101,
3802986.60543927, 3803218.26179754, 3803449.9181558, 3803681.57451406,
3803913.23087233, 3804144.88723059, 3804376.54358886, -49690.2888476191,
-49690.2888476191, -49690.2888476191, -49690.2888476191, -49690.2888476191,
-49690.2888476191, -49690.2888476191, -49690.2888476191, -49690.2888476191,
-49690.2888476191), .Dim = c(10L, 2L), .Dimnames = list(NULL,
c("x", "y")))
EDIT2:
After posting the dput output it makes sense, as the values are obviously rounded.
Using the exact numbers works...
> any(xycoord.short[,1]==3802291.63636448)
[1] TRUE
What you have here is a rounding "problem". Your coordinates are in what we call "double" (10.3 is a double) but you're trying to subset based on an integer (say 10). What you can do here is round to n places and subset based on that.
For instance, let's check eight digits.
format(xy, digits = 8)
x y
[1,] "3802291.636" " -49690.289"
[2,] "3802523.293" " -49690.289"
[3,] "3802754.949" " -49690.289"
[4,] "3802986.605" " -49690.289"
[5,] "3803218.262" " -49690.289"
[6,] "3803449.918" " -49690.289"
[7,] "3803681.575" " -49690.289"
[8,] "3803913.231" " -49690.289"
[9,] "3804144.887" " -49690.289"
[10,] "3804376.544" " -49690.289"
So in essence, when you're looking for 3802292 it doesn't find it because it's actually 3802291.636....
You can either specify exact coordinate up to x places correct, or perhaps round your number and work on that. Or you could specify a range of values that would encompass your desired value(s).
I am currently working on a dataset in R which is assigned to the global enviroment in R by a function of i, due to the nature of my work I am unable to disclose the dataset so let's use an example.
DATA
[,1] [,2] [,3] [,4] [,5]
[1,] 32320 27442 29275 45921 162306
[2,] 38506 29326 33290 45641 175386
[3,] 42805 30974 33797 47110 198358
[4,] 42107 34690 47224 62893 272305
[5,] 54448 39739 58548 69470 316550
[6,] 53358 48463 63793 79180 372685
Where DATA(i) is a function and the above is an output for a certain i
I want to assign variable names based on i such as:-
names(i)<-c(a(i),b(i),c(i),d(i),e(i))
for argument sake, let's say that the value of names for this specific i is
c("a","b","c","d","e")
I hope that it will produce the following:-
a b c d e
[1,] 32320 27442 29275 45921 162306
[2,] 38506 29326 33290 45641 175386
[3,] 42805 30974 33797 47110 198358
[4,] 42107 34690 47224 62893 272305
[5,] 54448 39739 58548 69470 316550
[6,] 53358 48463 63793 79180 372685
This is the code I currently use:-
VarName<-function(i){
colnames(DATA(i))<<-names(i)
}
However this produces an error message when I run it: "Error in colnames(DATA(i)) <- names(i)) :
target of assignment expands to non-language object" which we can see from my input that isn't true. Is there another way to do this?
Sorry for the basic questions. I'm fairly new to programming.
Here is my code with the corresponding output
> tkplot(g.2,vertex.label=nodes,
+ canvas.width=700,
+ canvas.height=700)
[1] 6
> ?tkplot
Warning message:
In rm(list = cmd, envir = .tkplot.env) : object 'tkp.6' not found
I get this error no matter what command I run after building and viewing my plot.
This may be obvious, but I can't get at the data from the plot.
> tkp.6.getcoords
Error: object 'tkp.6.getcoords' not found
Any thoughts? On Windows 2007 Pro.
R is a functional programming language. tkplot is a bit odd (for R users anyway) in that it returns numeric handles to its creations. Try this instead:
tkplot.getcoords(6)
When I run the example on the tkplot page, I then get this from tkplot.getcoords(1) since it was my first igraph plot:
> tkplot.getcoords(1)
[,1] [,2]
[1,] 334.49319 33.82983
[2,] 362.43837 286.10754
[3,] 410.61862 324.98319
[4,] 148.00673 370.91116
[5,] 195.69191 20.00000
[6,] 29.49197 430.00000
[7,] 20.00000 155.05409
[8,] 388.51103 62.61010
[9,] 430.00000 133.44695
[10,] 312.76239 168.90260