How can I include pseudos in a DTM in R? - r

I am trying to get a DTM of the words on this page:
https://en.wikipedia.org/wiki/Talk:Libyan_Civil_War_(2011)/Archive_1
My problem is the pseudos of the person posting (that are words of my corpus) never appear in my DTM even if I am setting dictionary on NULL. For instance, I expect the word "Lihaas" to be found 31 times but it is not showing up on my DTM.
My code :
library(tm)
docs<- VCorpus(DirSource(directory = "~dir"))
docsTDM <- DocumentTermMatrix(docs, control=list(dictionary=NULL))
I obtain :
the 2011 february utc
628 319 293 280
talk and this that
236 197 163 152
for are not uprising
106 101 92 79
libyan protests but support
76 75 68 68
with there revolt its
68 65 62 61
protest article have now
58 57 53 50
has civil should which
47 46 44 44
more think war was
43 43 41 41
from libya what would
40 40 36 35
about revolution added sources
34 34 32 32
comment government people some
30 30 30 30
all just section you
29 29 29 29
than unsigned will can
27 27 27 26
talk•contribs then even name
26 26 25 25

It might have to do with the fact that "Lihaas" is adjacent to a preceding "." in all of the cases that I see, or inside parentheses. So it is likely to be due to issues with tm's tokeniser.
Here is an alternative that produces what you want, using the quanteda package.
# read the document using the readtext package
wikitxt <- readtext::readtext("Talk:Libyan Civil War (2011):Archive 1 - Wikipedia.html")
library("quanteda")
wikidfm <- dfm(corpus(wikitxt), tolower = FALSE)
wikidfm
## Document-feature matrix of: 1 document, 3,006 features (0% sparse).
wikidfm[, c("lihaas", "Lihaas")]
## Document-feature matrix of: 1 document, 2 features (0% sparse).
## 1 x 2 sparse Matrix of class "dfm"
## features
## docs lihaas Lihaas
## text1 1 30

Related

Ordering list object of IRanges to get all elements decreasing

I am having difficulties trying to order a list element-wise by decreasing order...
I have a ByPos_Mindex object or a list of 1000 IRange objects (CG_seqP) from
C <- vmatchPattern(CG, CPGi_Seq, max.mismatch = 0, with.indels = FALSE)
IRanges object with 27 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
[1] 1 2 2
[2] 3 4 2
[3] 9 10 2
[4] 27 28 2
[5] 34 35 2
... ... ... ...
[23] 189 190 2
[24] 207 208 2
[25] 212 213 2
[26] 215 216 2
[27] 218 219 2
length(1000 of these IRanges)
I then change this to a list of only the start integers (which I want)
CG_SeqP <- sapply(C, function(x) sapply(as.vector(x), "[", 1))
[[1]]
[1] 1 3 9 27 34 47 52 56 62 66 68 70 89 110 112
[16] 136 140 146 154 160 163 178 189 207 212 215 218
(1000 of these)
The Problem happens when I try and order the list of elements using
CG_SeqP <- sapply(as.vector(CG_SeqP),order, decreasing = TRUE)
I get a list of what I think is row numbers so if the first IRAnge object is 27 I get this...
CG_SeqP[1]
[[1]]
[1] 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8
[21] 7 6 5 4 3 2 1
So the decreasing has worked but not for my actual list of elements>?
Any suggestions, thanks in advance.
Order returns order of the sequence not the actual elements of your vector, to extract it let us look at a toy example (I am following your idea here) :
set.seed(1)
alist1 <- list(a = sample(1:100, 30))
So, If you print alist1 with the current seed value , you will have below results:
> alist1
$a
[1] 99 51 67 59 23 25 69 43 17 68 10 77 55 49 29 39 93 16 44
[20] 7 96 92 80 94 34 97 66 31 5 24
Now to sort them either you use sort function or you can use order, sort just sorts the data, whereas order just returns the order number of the elements in a sorted sequence. It doesn't return the actual sequence, it returns the position. Hence we need to put those positions in the actual vector using square notation brackets to get the right sorted outcome.
lapply(as.vector(alist1),function(x)x[order(x, decreasing = TRUE)])
I have used lapply instead of sapply just to enforce the outcome as a list. You are free to choose any command basis your need
Will return:
#> lapply(as.vector(alist1),function(x)x[order(x, decreasing = TRUE)])
#$a
# [1] 99 97 96 94 93 92 80 77 69 68 67 66 59 55 51 49 44 43 39
#[20] 34 31 29 25 24 23 17 16 10 7 5
I hope this clarifies your doubt. Thanks

Dealing with date format in zoo

I've a csv data file with the following formats
Stock prices over the period of Jan 1, 2015 to Sep 26, 2017
Now I use the following code to import the data as zoo object:
sensexzoo1<- read.zoo(file = "/home/bidyut/Downloads/SENSEX.csv",
format="%d-%B-%Y", header=T, sep=",")
It produces the following error:
Error in read.zoo(file = "/home/bidyut/Downloads/SENSEX.csv", format =
"%d-%B-%Y", : index has 679 bad entries at data rows: 1 2 3 4 5 6
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
100 ...
What is the wrong with this? Please suggest
The problem is the mismatch between the header and the data. The header line has 5 fields and the remaining lines of the file have 6 fields:
head(count.fields("SENSEX.csv", sep = ","))
## [1] 5 6 6 6 6 6
When that happens it assumes that the first field of the data is the row names so by default the next field (which in fact contains the Open data) is assumed to be the time index.
We can address this in several alternative ways:
1) The easiest way to fix this is to add a field called Volume, say, to the header so that the header looks like this:
Date,Open,High,Low,Close,Volume
2) If you have many files of this format so that it is not feasible to modify them we can read the data in without the headers and then add them on in a second pass. The [, -5] drops the column of NAs and the [-1] on the second line drops the Date header.
z <- read.zoo("SENSEX.csv", format="%d-%B-%Y", sep = ",", skip = 1)[, -5]
names(z) <- unlist(read.table("SENSEX.csv", sep = ",", nrow = 1))[-1]
giving:
> head(z)
Open High Low Close
2015-01-01 27485.77 27545.61 27395.34 27507.54
2015-01-02 27521.28 27937.47 27519.26 27887.90
2015-01-05 27978.43 28064.49 27786.85 27842.32
2015-01-06 27694.23 27698.93 26937.06 26987.46
2015-01-07 26983.43 27051.60 26776.12 26908.82
2015-01-08 27178.77 27316.41 27101.94 27274.71
3) A third approach is to read the file in as text, use R to append ",Volume" to the first line and then read the text with read.zoo:
Lines <- readLines("SENSEX.csv")
Lines[1] <- paste0(Lines[1], ",Volume")
z <- read.zoo(text = Lines, header = TRUE, sep = ",", format="%d-%B-%Y")
Note: The first few lines of SENSEX.csv are shown below to make this self-contained (not dependent on the link in the question which could disappear in the future):
Date,Open,High,Low,Close
1-January-2015,27485.77,27545.61,27395.34,27507.54,
2-January-2015,27521.28,27937.47,27519.26,27887.90,
5-January-2015,27978.43,28064.49,27786.85,27842.32,
6-January-2015,27694.23,27698.93,26937.06,26987.46,
7-January-2015,26983.43,27051.60,26776.12,26908.82,
8-January-2015,27178.77,27316.41,27101.94,27274.71,
9-January-2015,27404.19,27507.67,27119.63,27458.38,
12-January-2015,27523.86,27620.66,27323.74,27585.27,
13-January-2015,27611.56,27670.19,27324.58,27425.73,
14-January-2015,27432.14,27512.80,27203.25,27346.82,

In R topicmodels package, how could we get the topics' distributions over terms?

I'm running LDA by using topicmodels package.
lda.model = LDA(dtm, k,control = list(em = list(iter.max = 1000, tol = 10^-4)))
apps.terms<-terms(lda.model,15)
head(apps.terms)
Topic.1 Topic.2 Topic.3 Topic.4 Topic.5
1 38 55 187 38 38
2 40 38 171 40 35
3 55 35 178 56 44
4 49 49 74 35 55
5 35 44 177 190 52
6 44 53 80 55 49
These code get the 15 terms order by their proportion. If I didn't badly understand the LDA algorithm. Each topic is a distribution over terms.So I want to know the exact distribution over these terms. For example. Topic.1 is 30% related to 38, 20% related to 40 ..etc. Is there any way to get it by using topicmodels package?
It sounds like you want the posterior probabilities for each document.
lda.inf <- posterior(lda.model,dtm)

Overlay two differently formatted qplots in ggplot2

I have two scatterplots, based on different but related data, created using qplot() from ggplot2. (Learning ggplot hasn't been a priority because qplot has been sufficient for my needs up to now). What I want to do is superimpose/overlay the two charts so that the x,y data for each is plotted in the same plot space. The complication is that I want each plot to retain its formatting/aesthetics.
That data in question are row and column scores from correspondence analysis - corresp() from MASS - so the number of data rows (i.e. samples or taxa) differ between the two datasets. I can plot the two score sets together easily. Either by combing the two datasets or, even easier, just using the biplot() function.
However, I have been using qplot to get the plots looking exactly as I need them; with samples plotted as colour-coded symbols and taxa as labels:
PlotSample <- qplot(DataCorresp$rscore[,1], DataCorresp$rscore[,2],
colour=factor(DataAll$ColourCode)) +
scale_colour_manual(values = c("black","darkgoldenrod2",
"deepskyblue2","deeppink2"))
and
PlotTaxa <- qplot(DataCorresp$cscore[,1], DataCorresp$cscore[,2],
label=colnames(DataCorresp), size=10, geom=“text”)
Can anyone suggest a way by which either
the two plots (PlotSample and PlotTaxa) can be superimposed atop of each other,
the two datasets (DataCorresp$rscore and DataCorresp$cscore) can be plotted together but formatted in their different ways, or
another function (e.g. biplot()) that could be used to achieve my aim.
Example of workflow using a extremely simplified and made-up dataset:
> require(MASS)
> require(ggplot2)
> alldata<-read.csv("Fake data.csv",header=T,row.name=1)
> selectdata<-alldata[,2:10]
> alldata
Period Species.1 Species.2 Species.3 Species.4 Species.5 Species.6
Sample-1 Early 50 87 97 12 60 49
Sample-2 Early 41 90 36 52 36 27
Sample-3 Early 87 56 82 45 56 13
Sample-4 Early 37 47 78 29 53 34
Sample-5 Early 58 70 34 35 8 21
Sample-6 Early 94 82 48 16 27 26
Sample-7 Early 91 69 50 57 24 13
Sample-8 Early 63 38 86 20 28 11
Sample-9 Middle 4 19 55 99 86 38
Sample-10 Middle 29 25 10 93 37 54
Sample-11 Middle 48 12 59 73 39 92
Sample-12 Middle 31 6 34 81 39 54
Sample-13 Middle 29 40 26 52 34 84
Sample-14 Middle 1 46 15 97 67 41
Sample-15 Late 43 47 30 18 60 23
Sample-16 Late 45 10 49 2 2 45
Sample-17 Late 14 8 51 36 58 51
Sample-18 Late 41 51 32 47 23 43
Sample-19 Late 43 17 6 54 4 12
Sample-20 Late 20 25 1 29 35 2
Species.7 Species.8 Species.9
Sample-1 41 39 57
Sample-2 59 4 45
Sample-3 10 56 5
Sample-4 59 30 39
Sample-5 9 29 57
Sample-6 29 24 35
Sample-7 22 4 42
Sample-8 31 19 40
Sample-9 17 7 57
Sample-10 6 9 29
Sample-11 34 20 0
Sample-12 56 41 59
Sample-13 6 31 13
Sample-14 25 12 28
Sample-15 60 75 84
Sample-16 32 69 34
Sample-17 48 53 56
Sample-18 80 86 46
Sample-19 50 70 82
Sample-20 57 84 70
> biplot(selectca,cex=c(0.6,0.6))
> selectca<-corresp(selectdata,nf=5)
> PlotSample <- qplot(selectca$rscore[,1], selectca$rscore[,2], colour=factor(alldata$Period) )
> PlotTaxa<-qplot(selectca$cscore[,1], selectca$cscore[,2], label=colnames(selectdata), size=10, geom="text")
The biplot will produce this plot: /r/10wk1a8/5
The PlotSample appears as such: /r/i29cba/5
The PlotTaxa appears as such: /r/245bl9d/5
EDIT so don't have enough rep to post pictures and tinypic links not accepted (despite https://meta.stackexchange.com/questions/60563/how-to-upload-images-on-stack-overflow). So if you add tinypic's URL to the start of those codes above you'll get there.
Essentially I want to creat the biplot plot but with samples colour coded as they are in PlotSample.
Have a look at Gavin Simpsons ggvegan-package!
require(vegan)
require(ggvegan)
# some data
data(dune)
# CA
mod <- cca(dune)
# plot
autoplot(mod, geom = 'text')
For a finer control (or if you want to stick with corresp(), you may also want to take a look at the code of the two involved functions fortify.cca (which wraps the data in the cca objects into a useable format for ggplot) and autoplot.cca for creating the plot.
I you want to do it from scratch, you'll have to wrap both scores (sites and species) into one data.frame (see how fortify.cca does this and extract the relevant values from the corresp() object) and use this to build the plot.

In R: Indexing vectors by boolean comparison of a value in range: index==c(min : max)

In R, let's say we have a vector
area = c(rep(c(26:30), 5), rep(c(500:504), 5), rep(c(550:554), 5), rep(c(76:80), 5)) and another vector yield = c(1:100).
Now, say I want to index like so:
> yield[area==27]
[1] 2 7 12 17 22
> yield[area==501]
[1] 27 32 37 42 47
No problem, right? But weird things start happening when I try to index it by using c(A, B). (and even weirder when I try c(min:max) ...)
> yield[area==c(27,501)]
[1] 7 17 32 42
What I'm expecting is of course the instances that are present in both of the other examples, not just some weird combination of them. This works when I can use the pipe OR operator:
> yield[area==27 | area==501]
[1] 2 7 12 17 22 27 32 37 42 47
But what if I'm working with a range? Say I want index it by the range c(27:503)? In my real example there are a lot more data points and ranges, so it makes more sense, please don't suggest I do it by hand, which would essentially mean:
yield[area==27 | area==28 | area==29 | ... | area==303 | ... | area==500 | area==501]
There must be a better way...
You want to use %in%. Also notice that c(27:503) and 27:503 yield the same object.
> yield[area %in% 27:503]
[1] 2 3 4 5 7 8 9 10 12 13 14 15 17
[14] 18 19 20 22 23 24 25 26 27 28 29 31 32
[27] 33 34 36 37 38 39 41 42 43 44 46 47 48
[40] 49 76 77 78 79 80 81 82 83 84 85 86 87
[53] 88 89 90 91 92 93 94 95 96 97 98 99 100
Why not use subset?
subset(yield, area > 26 & area < 504) ## for indexes
subset(area, area > 26 & area < 504) ## for values

Resources