I am unsure how to describe this problem. I have a feeling it is trivial but I cannot get a hold of it.
I have a stack of raster objects (object NDVI). From these I extracted x and y coordinates using rasterToPoints
xycoord1 <- rasterToPoints(NDVI)
xycoord <- xycoord1[,c(1:2)]
Along the pre-processing I kicked out several unusable pixels and ended up with:
> str(xycoord.short)
num [1:20054, 1:2] 3802292 3802523 3802755 3802987 3803218 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:2] "x" "y"
No I simply want to find a certain x and y coordinate.
e.g.
> which(xycoord.short[,1]==3802292)
integer(0)
But I seem unable to "get hold" of the values inside for example one column.
> xycoord.short[,1][1]
[1] 3802292
> xycoord.short[,1][1]==xycoord.short[,1][1]
[1] TRUE
> xycoord.short[,1][1]==3802292
[1] FALSE
Can anyone help me along this problem? I just don't find the problem. Does it have to do with initial extraction through rasterToPoints? Thanks!
EDIT:
dput output for the first 10 rows of my xy-coordinates
xy <- structure(c(3802291.63636448, 3802523.29272274, 3802754.94908101,
3802986.60543927, 3803218.26179754, 3803449.9181558, 3803681.57451406,
3803913.23087233, 3804144.88723059, 3804376.54358886, -49690.2888476191,
-49690.2888476191, -49690.2888476191, -49690.2888476191, -49690.2888476191,
-49690.2888476191, -49690.2888476191, -49690.2888476191, -49690.2888476191,
-49690.2888476191), .Dim = c(10L, 2L), .Dimnames = list(NULL,
c("x", "y")))
EDIT2:
After posting the dput output it makes sense, as the values are obviously rounded.
Using the exact numbers works...
> any(xycoord.short[,1]==3802291.63636448)
[1] TRUE
What you have here is a rounding "problem". Your coordinates are in what we call "double" (10.3 is a double) but you're trying to subset based on an integer (say 10). What you can do here is round to n places and subset based on that.
For instance, let's check eight digits.
format(xy, digits = 8)
x y
[1,] "3802291.636" " -49690.289"
[2,] "3802523.293" " -49690.289"
[3,] "3802754.949" " -49690.289"
[4,] "3802986.605" " -49690.289"
[5,] "3803218.262" " -49690.289"
[6,] "3803449.918" " -49690.289"
[7,] "3803681.575" " -49690.289"
[8,] "3803913.231" " -49690.289"
[9,] "3804144.887" " -49690.289"
[10,] "3804376.544" " -49690.289"
So in essence, when you're looking for 3802292 it doesn't find it because it's actually 3802291.636....
You can either specify exact coordinate up to x places correct, or perhaps round your number and work on that. Or you could specify a range of values that would encompass your desired value(s).
Related
I need to pass some numbers into functions/environments/contexts that only accept strings. They will then be processed there, after receiving the result I can reconvert them to numbers. The key issue is that when the string representations are sorted, they must be sorted in the correct order (in what would be the numerical sort order of the numeric representation). I also need to be able to add numbers after creating the initial batch. Following are two conversion approaches and why they do not work:
The simplest conversion is the standard one. 1->"1", 10->"10" and so on. This does not satisfy the criteria of sortability, because "10" gets sorted before "2".
The next approach is to prefix with zeroes. 1->"001", 10->"010" and so on. This satisfies sortability, ("002" gets sorted before "010"), but if a a larger number needs to be added later, this approach fails. If, say the numbers 2000 and 10000 need to be added later it is not possible to do so in a way that maintains sorting.
Are there any good approaches to doing this? The question doen not pertain specifically to any particular language (although the target language in my use-case is R, which has a number of places such as vector names and others that accept only character variables). Simplicity and/or standardization (of the representation, not implementation-wise) would both be big factors in choosing the best solution here.
I've had the same problem I think, and I used workaround that could help you. But I'm not sure it will apply in all situations.
First, a vector with number as string ordered as string:
str.numbers <- sort(as.character(1:20))
Now, I used the numeric representation of the string to order numerically the same vector:
str.numbers[order(as.numeric(str.numbers))]
This does the trick for simple vectors. But not sure it'll solve more complex problems.
Funny how the world works, after searching for a solution but not finding it, I came across a package, with the last release only four days ago, that focuses on this problem. This is the strex package. A reproducible example of my troubles, and how strex provides a pretty good fix, follows:
# Load the strex library
library(strex)
#> Loading required package: stringr
# This would be a function created by others, doing more than
# sorting, but nevertheless requiring sortable input
fun_requiring_char <- function(x) {
stopifnot(is.character(x))
sort(x)
}
# Example data
set.seed(42)
a <- sample(20, 5) ; a
#> [1] 17 5 1 10 4
b <- sample(2000, 10) ; b
#> [1] 1170 634 49 1152 1327 24 1863 356 1625 165
# Won't work, error
#fun_requiring_char(a)
# Works, but returns incorrectly sorted input
fun_requiring_char(as.character(a))
#> [1] "1" "10" "17" "4" "5"
fun_requiring_char(as.character(b))
#> [1] "1152" "1170" "1327" "1625" "165" "1863" "24" "356" "49" "634"
# Solution provided by strex
fun_requiring_char(str_alphord_nums(a))
#> [1] "01" "04" "05" "10" "17"
fun_requiring_char(str_alphord_nums(b))
#> [1] "0024" "0049" "0165" "0356" "0634" "1152" "1170" "1327" "1625" "1863"
# What quick and dirty zero padding did not allow was to first
# convert a, and then b into character, where both were of a
# unified format that would represent the numbers and yet be
# sortable in correct order according to the numbers they
# represent. However, using str_alphord_nums repeatedly gets
# very close to a solution.
ac <- str_alphord_nums(a); ac
#> [1] "17" "05" "01" "10" "04"
bc <- str_alphord_nums(b); bc
#> [1] "1170" "0634" "0049" "1152" "1327" "0024" "1863" "0356" "1625" "0165"
# Wrong order here
fun_requiring_char(c(ac,bc))
#> [1] "0024" "0049" "01" "0165" "0356" "04" "05" "0634" "10" "1152"
#> [11] "1170" "1327" "1625" "17" "1863"
# But doing an alphord again on the concatenated vectors provides a fix
fun_requiring_char(str_alphord_nums(c(ac,bc)))
#> [1] "0001" "0004" "0005" "0010" "0017" "0024" "0049" "0165" "0356" "0634"
#> [11] "1152" "1170" "1327" "1625" "1863"
Created on 2020-10-21 by the reprex package (v0.3.0)
I am using R for extracting text. The code below works well to extract the non-bold text from pdf but it ignores the bold part. Is there a way to extract both bold and non-bold text?
news <-'http://www.frbe-kbsb.be/sites/manager/ICN/14-15/ind01.pdf'
library(pdftools)
library(tesseract)
library(tiff)
info <- pdf_info(news)
numberOfPageInPdf <- as.numeric(info[2])
numberOfPageInPdf
for (i in 1:numberOfPageInPdf){
bitmap <- pdf_render_page(news, page=i, dpi = 300, numeric = TRUE)
file_name <- paste0("page", i, ".tiff")
file_tiff <- tiff::writeTIFF(bitmap, file_name)
out <- ocr(file_name)
file_txt <- paste0("text", i, ".txt")
writeLines(out, file_txt)
}
I like using the tabulizer library for this. Here's a small example:
devtools::install_github("ropensci/tabulizer")
library(tabulizer)
news <-'http://www.frbe-kbsb.be/sites/manager/ICN/14-15/ind01.pdf'
# note that you need to specify UTF-8 as the encoding, otherwise your special characters
# won't come in correctly
page1 <- extract_tables(news, guess=TRUE, page = 1, encoding='UTF-8')
page1[[1]]
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] "" "Division: 1" "" "" "" "" "Série: A"
[2,] "" "514" "" "Fontaine 1 KBSK 1" "" "" "303"
[3,] "1" "62529 WIRIG ANTHONY" "" "2501 1⁄2-1⁄2" "51560" "CZEBE ATTILLA" "2439"
[4,] "2" "62359 BRUNNER NICOLAS" "" "2443 0-1" "51861" "PICEU TOM" "2401"
[5,] "3" "75655 CEKRO EKREM" "" "2393 0-1" "10391" "GEIRNAERT STEVEN" "2400"
[6,] "4" "50211 MARECHAL ANDY" "" "2355 0-1" "35181" "LEENHOUTS KOEN" "2388"
[7,] "5" "73059 CLAESEN PIETER" "" "2327 1⁄2-1⁄2" "25615" "DECOSTER FREDERIC" "2373"
[8,] "6" "63614 HOURIEZ CLEMENT" "" "2304 1⁄2-1⁄2" "44954" "MAENHOUT THIBAUT" "2372"
[9,] "7" "60369 CAPONE NICOLA" "" "2283 1⁄2-1⁄2" "10430" "VERLINDE TIEME" "2271"
[10,] "8" "70653 LE QUANG KIM" "" "2282 0-1" "44636" "GRYSON WOUTER" "2269"
[11,] "" "" "< 2361 >" "12 - 20" "" "< 2364 >" ""
You can also use the locate_areas function to specify a specific region if you only care about some of the tables. Note that for locate_areas to work, I had to download the file locally first; using the URL returned an error.
You'll note that each table is its own element in the returned list.
Here's an example using a custom region to just select the first table on each page:
customArea <- extract_tables(news, guess=FALSE, page = 1, area=list(c(84,27,232,569), encoding = 'UTF-8')
This is also a more direct method than using the OCR (Optical Character Recognition) library tesseract beacuse you're not relying on the OCR library to translate pixel arrangement back into text. In digital PDFs, each text element has an x and y position, and the tabulizer library uses that information to detect table heuristics and extract sensibly formatted data. You'll see you still have some clean up to do, but it's pretty manageable.
Edit: just for fun, here's a little example of starting the clean up with data.table
library(data.table)
cleanUp <- setDT(as.data.frame(page1[[1]]))
cleanUp[ , `:=` (Division = as.numeric(gsub("^.*(\\d+{1,2}).*", "\\1", grep('Division', cleanUp$V2, value=TRUE))),
Series = as.character(gsub(".*:\\s(\\w).*","\\1", grep('Série:', cleanUp$V7, value=TRUE))))
][,ID := tstrsplit(V2," ", fixed=TRUE, keep = 1)
][, c("V1", "V3") := NULL
][-grep('Division', V2, fixed=TRUE)]
Here we've moved Division, Series, and ID into their own columns, and removed the Division header row. This is just the general idea, and would need a little refinement to apply to all 27 pages.
V2 V4 V5 V6 V7 Division Series ID
1: 514 Fontaine 1 KBSK 1 303 1 A 514
2: 62529 WIRIG ANTHONY 2501 1/2-1/2 51560 CZEBE ATTILLA 2439 1 A 62529
3: 62359 BRUNNER NICOLAS 2443 0-1 51861 PICEU TOM 2401 1 A 62359
4: 75655 CEKRO EKREM 2393 0-1 10391 GEIRNAERT STEVEN 2400 1 A 75655
5: 50211 MARECHAL ANDY 2355 0-1 35181 LEENHOUTS KOEN 2388 1 A 50211
6: 73059 CLAESEN PIETER 2327 1/2-1/2 25615 DECOSTER FREDERIC 2373 1 A 73059
7: 63614 HOURIEZ CLEMENT 2304 1/2-1/2 44954 MAENHOUT THIBAUT 2372 1 A 63614
8: 60369 CAPONE NICOLA 2283 1/2-1/2 10430 VERLINDE TIEME 2271 1 A 60369
9: 70653 LE QUANG KIM 2282 0-1 44636 GRYSON WOUTER 2269 1 A 70653
10: 12 - 20 < 2364 > 1 A NA
There is no need to go through the PDF -> TIFF -> OCR loop, since pdftools::pdf_text() can read this file directly:
stringi::stri_split(pdf_text(news), regex = "\n")
My code:
library(quanteda)
library(topicmodels)
# Some raw text as a vector
postText <- c("普京 称 俄罗斯 未 乌克兰 施压 来自 头 条 新闻", "长期 电脑 前进 食 致癌 环球网 报道 乌克兰 学者 认为 电脑 前进 食 会 引发 癌症 等 病症 电磁 辐射 作用 电脑 旁 水 食物 会 逐渐 变质 有害 物质 累积 尽管 人体 短期 内 会 感到 适 会 渐渐 引发 出 癌症 阿尔茨海默 式 症 帕金森 症 等 兔子", "全 木 手表 乌克兰 木匠 瓦列里·达内维奇 木头 制作 手表 共计 154 手工 零部件 唯一 一个 非 木制 零件 金属 弹簧 驱动 指针 运行 其他 零部件 材料 取自 桦树 苹果树 杏树 坚果树 竹子 黄杨树 愈疮木 非洲 红木 总共 耗时 7 打造 手表 不仅 能够 正常 运行 天 时间 误差 保持 5 分钟 之内 ")
# Create a corpus of the posts
postCorpus <- corpus(postText)
# Make a dfm, removing numbers and punctuation
myDocTermMat <- dfm(postCorpus, stem = FALSE, removeNumbers = TRUE, removeTwitter = TRUE, removePunct = TRUE)
# Estimate a LDA Topic Model
if (require(topicmodels)) {
myLDAfit <- LDA(convert(myDocTermMat, to = "topicmodels"), k = 2)
}
terms(myLDAfit, 11)
The code works and I see a result. Here is an example of the output:
Topic 1 Topic 2
[1,] "木" "会"
[2,] "手表" "电脑"
[3,] "零" "乌克兰"
[4,] "部件" "前进"
[5,] "运行" "食"
[6,] "乌克兰" "引发"
[7,] "内" "癌症"
[8,] "全" "等"
[9,] "木匠" "症"
[10,] "瓦" "普"
[11,] "列" "京"
Here is the problem. All of my posts have been segmented (necessary pre-processing step for Chinese) and had stop words removed. Nonetheless, the topic model returns topics containing single-character stop terms that have already been removed. If I open the raw .txt files and do ctrl-f for a given single-character stop word, no results are returned. But those terms show up in the returned topics from the R code, perhaps because the individual characters occur as part of other multi-character words. E.g. 就 is a preposition treated as a stop word, but 成就 means "success."
Related to this, certain terms are split. For example, one of the events I am examining contains references to Russian president Putin ("普京"). In the topic model results, however, I see separate term entries for "普" and "京" and no entries for "普京". (See lines 10 and 11 in output topic 2, compared to the first word in the raw text.)
Is there an additional tokenization step occurring here?
Edit: Modified to make reproducible. For some reason it wouldn't let me post until I also deleted my introductory paragraph.
Here's a workaround, based on using a faster but "dumber" word tokeniser based on space ("\\s") splitting:
# fails
features(dfm(postText, verbose = FALSE))
## [1] "普" "京" "称" "俄罗斯" "未" "乌克兰" "施压" "来自" "头" "条" "新闻"
# works
features(dfm(postText, what = "fasterword", verbose = FALSE))
## [1] "普京" "称" "俄罗斯" "未" "乌克兰" "施压" "来自" "头" "条" "新闻"
So add what = "fasterword" to the dfm() call and you will get this as a result, where Putin ("普京") is not split.
terms(myLDAfit, 11)
## Topic 1 Topic 2
## [1,] "会" "手表"
## [2,] "电脑" "零部件"
## [3,] "乌克兰" "运行"
## [4,] "前进" "乌克兰"
## [5,] "食" "全"
## [6,] "引发" "木"
## [7,] "癌症" "木匠"
## [8,] "等" "瓦列里达内维奇"
## [9,] "症" "木头"
## [10,] "普京" "制作"
## [11,] "称" "共计"
This is an interesting case of where quanteda's default tokeniser, built on the definition of stringi's definition of text boundaries (see stri_split_boundaries, does not work in the default setting. It might after experimentation with locale, but these are not currently options that can be passed to quanteda::tokenize(), which dfm() calls.
Please file this as an issue at https://github.com/kbenoit/quanteda/issues and I'll try to get working on a better solution using the "smarter" word tokeniser.
Row<-c(1,2,3,4,5)
Content<-c("I love cheese", "whre is the fish", "Final Countdow", "show me your s", "where is what")
Data<-cbind(Row, Content)
View(Data)
I wanted to create a function which tells me how many words are wrong per Row.
A intermediate step would be to have it look like this:
Row<-c(1,2,3,4,5)
Content<-c("I love cheese", "whre is the fs", "Final Countdow", "show me your s", "where is what")
MisspelledWords<-c(NA, "whre, fs", "Countdow","s",NA)
Data<-cbind(Row, Content,MisspelledWords)
I know that i have to use aspell but i'm having problems to perform aspell on only rows and not always directly on the whole file, finally i want to Count how many words are wrong on every Row For this i would take code of: Count the number of words in a string in R?
Inspired by this article, here's a try with which_misspelled and check_spelling in library(qdap).
library(qdap)
# which_misspelled
n_misspelled <- sapply(Content, function(x){
length(which_misspelled(x, suggest = FALSE))
})
data.frame(Content, n_misspelled, row.names = NULL)
# Content n_misspelled
# 1 I love cheese 0
# 2 whre is the fs 2
# 3 Final Countdow 1
# 4 show me your s 0
# 5 where is what 0
# check_spelling
df <- check_spelling(Content, n.suggest = 0)
n_misspelled <- as.vector(table(factor(df$row, levels = Row)))
data.frame(Content, n_misspelled)
# Content n_misspelled
# 1 I love cheese 0
# 2 whre is the fs 2
# 3 Final Countdow 1
# 4 show me your s 0
# 5 where is what 0
To use aspell you have to use a file. It's pretty straightforward to use a function to dump a column to a file, run aspell and get the counts (but it will not be all that efficient if you have a large matrix/dataframe).
countMispelled <- function(words) {
# do a bit of cleanup (if necessary)
words <- gsub(" *", " ", gsub("[[:punct:]]", "", words))
temp_file <- tempfile()
writeLines(words, temp_file);
res <- aspell(temp_file)
unlink(temp_file)
# return # of mispelled words
length(res$Original)
}
Data <- cbind(Data, Errors=unlist(lapply(Data[,2], countMispelled)))
Data
## Row Content Errors
## [1,] "1" "I love cheese" "0"
## [2,] "2" "whre is thed fish" "2"
## [3,] "3" "Final Countdow" "1"
## [4,] "4" "show me your s" "0"
## [5,] "5" "where is what" "0"
You might be better off using a data frame vs a matrix (I just worked with what you provided) since you can keep Row and Errors numeric that way.
I have this code here:
library(BCEA)
data(Vaccine)
ls()
Now, i get this:
[1] "c" "cost.GP" "cost.hosp" "cost.otc" "cost.time.off" "cost.time.vac"
[7] "cost.travel" "cost.trt1" "cost.trt2" "cost.vac" "e" "N"
[13] "N.outcomes" "N.resources" "QALYs.adv" "QALYs.death" "QALYs.hosp" "QALYs.inf"
[19] "QALYs.pne" "treats"
How can i access the 'c' value?
Something like Vaccine(c)?
Please help
In the given data Vaccine, each one of "c", "cost.GP", ... corresponds to separate data.
So, when you load the data of Vaccine into current workspace by data(Vaccine), you can directly access each of the 20 listed data values.
For example,
> head(c)
Status Quo Vaccination
[1,] 10.409146 16.252537
[2,] 5.834875 9.373437
[3,] 5.784903 15.935623
[4,] 12.208484 18.654250
[5,] 9.786787 16.467321
[6,] 6.560276 9.689887
Just like this:
c
It's a named object, so type "c" at the prompt, press return, or use directly in any function within this environment