searching for deleted documents from corpus in R

searching for deleted documents from corpus in R - r

I want preprocess my texts before its analysis
mydat
Production of banners 1,2x2, Cutting
Production of a plate with the size 2330 * 600mm
Delivery
Placement of advertising information on posters 0.85 * 0.65 at Ordzhonikidze Street (TSUM) -Gerzen, side A2 April 2014
Manufacturing of a banner 3,7х2,7
Placement of advertising information on the prismatron 3 * 4 at 60, Ordzhonikidze, Aldjonikidze Street, A (01.12.2011-14.12.2011)
Placement of advertising information on the multipanel 3 * 12 at Malygina-M.Torez street, side A, (01.12.2011-14.12.2011)
Designer services
41526326
12
Mounting and rolling of the RIM on the prismatron 3 * 6
the code
mydat=read.csv("C:/kr_csv.csv", sep=";",dec=",")
tw.corpus <- Corpus(VectorSource(mydat$descr))
tw.corpus <- tm_map(tw.corpus, removePunctuation)
tw.corpus <- tm_map(tw.corpus, removeNumbers)
tw.corpus = tm_map(tw.corpus, content_transformer(tolower))
tw.corpus = tm_map(tw.corpus, stemDocument)
#deleting emptu documents
doc.m <- DocumentTermMatrix(tw.corpus)
rowTotals <- apply(doc.m , 1, sum) #Find the sum of words in each Document
doc.m.new <- doc.m[rowTotals> 0, ]
1. How do I know the numbers of observations that were deleted during preprocessing (for example first, second texts were deleted)?
2.How this numbers of observation delete from original dataset (mydat)?

After pre-processing and stemming your corpus, you are counting the number of words that are left in each document. Surely, the "documents" with no words in them, have a count of zero. Also, the documents with only letters and punctuation are also empty, because you removed those strings.
In your data, you have many "documents" that are empty lines. In total, you have 28 "documents" in your corpus, but more than half of them are empty lines (i.e. they contain zero words).
You calculate the word-count for each document in rowTotals. If you check which of the entries in rowTotals are equal to zero, you would get the document numbers that are subsequently removed from doc.m:
rowTotals
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
# 3 5 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 2 8 8 2 0 0 0 7
You can see that documents 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, etc. all contain zero words, and are therefore not present in doc.m. You can automatically get these number with which():
which( rowTotals == 0)
# [1] 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 25 26 27

Related

Text Mining in R: Counting 2-3 word phrases

I found a very useful piece of code within Stackoverflow - Finding 2 & 3 word Phrases Using R TM Package
(credit #patrick perry) to show the frequency of 2 and 3 word phrases within a corpus:
library(corpus)
corpus <- gutenberg_corpus(55) # Project Gutenberg #55, _The Wizard of Oz_
text_filter(corpus)$drop_punct <- TRUE # ignore punctuation
term_stats(corpus, ngrams = 2:3)
## term count support
## 1 of the 336 1
## 2 the scarecrow 208 1
## 3 to the 185 1
## 4 and the 166 1
## 5 said the 152 1
## 6 in the 147 1
## 7 the lion 141 1
## 8 the tin 123 1
## 9 the tin woodman 114 1
## 10 tin woodman 114 1
## 11 i am 84 1
## 12 it was 69 1
## 13 in a 64 1
## 14 the great 63 1
## 15 the wicked 61 1
## 16 wicked witch 60 1
## 17 at the 59 1
## 18 the little 59 1
## 19 the wicked witch 58 1
## 20 back to 57 1
## ⋮ (52511 rows total)
How do you ensure that frequency counts of phrases like "the tin" are not also included in the frequency count of "the tin woodman" or the "tin woodman"?
Thanks

Removing stopwords can remove noise from the data, causing issues such as those you are having a above:
library(tm)
library(corpus)
library(dplyr)
corpus <- Corpus(VectorSource(gutenberg_corpus(55)))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
term_stats(corpus, ngrams = 2:3) %>%
arrange(desc(count)) %>%
group_by(grp = str_extract(as.character(term), "\\w+\\s+\\w+")) %>%
mutate(count_unique = ifelse(length(unique(count)) > 1, max(count) - min(count), count)) %>%
ungroup() %>%
select(-grp)

Mapping dataframe column values to a n by n matrix

I'm trying to map column values of a data.frame object (consisting of large number of bilateral trade data among 161 countries) to a 161 x 161 adjacency matrix (also of data.frame class) such that each cell represents the dyadic trade flows between any two countries.
The data looks like this
# load the data from dropbox folder
library(foreign)
example_data <- read.csv("https://www.dropbox.com/s/hf0ga22tdjlvdvr/example_data.csv?dl=1")
head(example_data, n = 10)
rid pid TradeValue
1 2 3 500
2 2 7 2328
3 2 8 2233465
4 2 9 81470
5 2 12 572893
6 2 17 488374
7 2 19 3314932
8 2 23 20323
9 2 25 10
10 2 29 9026220
length(unique(example_data$rid))
[1] 139
length(unique(example_data$pid))
[1] 161
where rid is reporter id, pid is (trade) partner id, a country's rid and pid are the same. The same id(s) in the rid column are matched with multiple rows in the pid column in terms of TradeValue.
However, there are some problems with this data. First, because countries (usually developing countries) that did not report trade statistics have no data to be extracted, their id(s) are absent in the rid column (such as country 1). On the other hand, those country id(s) may enter into pid column through other countries' reporting (in which case, the reporters tend to be developed countries). Hence, the rid column only contains some of the country id (only 139 out of 161), while the pid column has all 161 country id.
What I'm attempting to do is to map this example_data dataframe to a 161 x 161 adjacency matrix using rid for row and pid for column where each cell represent the TradeValue between any two country id. To this end, there are a couple things I need to tackle with:
Fill in those country id(s) that are missing in the rid column of example_data and, temporarily, set all cell values in their respective rows to 0.
By previous step, impute those "0" cells using bilateral trade statistics reported by other countries; if the corresponding statistics are still unavailable, leave those "0" cells as they are.
For example, for a 5-country dataframe of the following form
rid pid TradeValue
2 1 50
2 3 45
2 4 7
2 5 18
3 1 24
3 2 45
3 4 88
3 5 12
5 1 27
5 2 18
5 3 12
5 4 92
The desired output should look like this
pid_1 pid_2 pid_3 pid_4 pid_5
rid_1 0 50 24 0 27
rid_2 50 0 45 7 18
rid_3 24 45 0 88 12
rid_4 0 7 88 0 92
rid_5 27 18 12 92 0
but on top of my mind, I could not figure out how to. It will be really appreciated if someone can help me on this.

df1$rid = factor(df1$rid, levels = 1:5, labels = paste("rid",1:5,sep ="_"))
df1$pid = factor(df1$pid, levels = 1:5, labels = paste("pid",1:5,sep ="_"))
data.table::dcast(df1, rid ~ pid, fill = 0, drop = FALSE, value.var = "TradeValue")
# rid pid_1 pid_2 pid_3 pid_4 pid_5
#1 rid_1 0 0 0 0 0
#2 rid_2 50 0 45 7 18
#3 rid_3 24 45 0 88 12
#4 rid_4 0 0 0 0 0
#5 rid_5 27 18 12 92 0
The secrets/ tricks:
use factor variables to tell R what values are all possible as well as the order.
in data.tables dcast use fill = 0 (fill zero where you have nothing), drop = FALSE (make entries for factor levels that aren't observed)

regex for searching through dataframe in R

I have a list of barcodes with the format: AAACCTGAGCGTCAAG-1
The letters can be A, C, G or T and the number after the dash can be 1 - 16.
barcode = c('AAACCTGAGCGTCAAG-1',
'AAACCTGAGTACCGGA-1',
'AAACCTGCAGCTGCTG-1',
'AAACCTGCATCACGAT-3',
'AAACCTGCATTGGGCC-5',
'AAACCTGGTATAGTAG-10',
'AAACCTGGTCGCGTGT-1',
'AAACCTGGTTTCCACC-16',
'AAACCTGTCATGCATG-14',
'AAACCTGTCGCAGGCT-15',
'AAACGGGAGAACTCGG-1')
cluster = c(6,3,6,16,17,11,14,18,9,8,14)
df <- data.frame(Barcode = barcode, Cluster = cluster)
I need to subset this dataframe based on the -# at the end of the barcode. I have been using this to subset the dataframe. The problem is this works for every number except 1.
> df[grep("([ACGT]-10){1}", df$Barcode),]
Barcode Cluster
6 AAACCTGGTATAGTAG-10 11
When I use the following, it will include all the barcodes that end in -1, as well as -10, -11, -12, -13, -14, -15 and -16.
> df[grep("([ACGT]-1){1}", df$Barcode),]
Barcode Cluster
1 AAACCTGAGCGTCAAG-1 6
2 AAACCTGAGTACCGGA-1 3
3 AAACCTGCAGCTGCTG-1 6
6 AAACCTGGTATAGTAG-10 11
7 AAACCTGGTCGCGTGT-1 14
8 AAACCTGGTTTCCACC-16 18
9 AAACCTGTCATGCATG-14 9
10 AAACCTGTCGCAGGCT-15 8
11 AAACGGGAGAACTCGG-1 14
>
Is there a regex that will include barcodes ending in -1, but exclude all other barcodes that end in numbers from 10 - 16?
I want to subset the dataframe so that I only get this:
Barcode Cluster
1 AAACCTGAGCGTCAAG-1 6
2 AAACCTGAGTACCGGA-1 3
3 AAACCTGCAGCTGCTG-1 6
7 AAACCTGGTCGCGTGT-1 14
11 AAACGGGAGAACTCGG-1 14
>
Thanks!

How about:
df[grep("-1$", df$Barcode),]
This matches 1 at the end of the string, but also requires that the digit before 1 is not 1, so you don't match 11
Barcode Cluster
1 AAACCTGAGCGTCAAG-1 6
2 AAACCTGAGTACCGGA-1 3
3 AAACCTGCAGCTGCTG-1 6
7 AAACCTGGTCGCGTGT-1 14
11 AAACGGGAGAACTCGG-1 14

I think you can just use df[grep("([ACGT]-1$){1}", df$Barcode),]
You can just use a $ to specify the end of the chain. See more information here on "pattern" use: http://www.jdatalab.com/data_science_and_data_mining/2017/03/20/regular-expression-R.html

how to do assignment of numbers in r (one machine n jobs)

I am working on assignment problem in R. I have following dataframe in r
cycle_time TAT ready_for_next ITV_no
2 10 12 0
4 12 16 0
6 13 19 0
8 11 19 0
10 15 25 0
12 17 29 0
14 13 27 0
16 13 29 0
18 12 30 0
20 16 36 0
22 13 35 0
24 12 36 0
26 15 41 0
28 14 42 0
30 17 47 0
My desired dataframe would be
cycle_time TAT ready_for_next ITV_no wait_time
2 10 12 1 0
4 12 16 2 0
6 13 19 3 0
8 11 19 4 0
10 15 25 5 0
12 17 29 1 0
14 13 27 6 0
16 13 29 2 0
18 12 30 3 1
20 16 36 4 1
22 13 35 5 3
24 12 36 6 3
26 15 41 2 3
28 14 42 3 2
30 17 47 5 5
cycle_time = crane cycle time
TAT(in mins) = turn around time of truck
ready_for_next(in mins) = ready to take next container
ITV_no = ITV no to be assigned for that job
***There are only 6 unique trucks available***
Idea here is to assign trucks such that waiting time is minimum.
In first five observations all 5 trucks are assigned.
For the next container i.e row number 6 (on 12th min) ITV_no 1 is coming back from its job so that will get assigned to this job.
7th observation(i.e 14th min) there are no trucks available,so we will have to assign new truck (i.e ITV_no 6)
8th observation(16 min) ITV_no 2 is coming back from its job,so that will get assigned to this job and so on.
If there are no trucks available then it has to wait till the nearest truck comes back from job.
How can I implement this in R?
I have build some logic
cycle_time <- c(2,4,6,8,10,12,14,16,18,20,22,24,26,28,30)
ITV_no <- c(1,2,3,4,5,6,7)
temp <- c()
TAT <- c(10,12,13,11,15,17,13,13,12,16,13,12,15,14,17)
ready_for_next <- cycle_time + TAT
assignment <- data.frame(cycle_time,TAT,ready_for_next)
assignment$ITV_no <- 0
for(i in 1:nrow(assignment)) {
for(j in 1:length(ITV_no)){
assignment$ITV_no[i] <- ifelse(assignment$cycle_time <= assignment$ready_for_next,ITV_no[j],
ifelse())
## I am not able to update the count of trucks which are already assigned
# and which are free to be assigned
}
}
Logic
1. first row increment ITV_no by 1. directly assign truck to that job
2. check if cycle_time <= previous all ready_for_next(i.e 12), if yes then increment ITV_no by 1,if no then assign previous ITV_no for that job(i.e 1)
e.g
for row 6, cycle time will get compared to all previous ready_for_next column values (25,19,19,16,12) it finds the match at first row then that ITV_no(i.e 2) is assigned to 6th row
for row 7, cycle time will get compared to all previous ready_for_next column values (25,19,19,16) **12 should be removed from comparison because the truck is already assigned to the job** match at first row then that ITV_no(i.e 2) is assigned to 6th row. No match,so new truck is assigned to that job

I have come up with some solution...
It is working with sample data
rm(list=ls())
df <- data.frame(qc_time = seq(2,40,2),itv_tat=c(10,15,12,18,25,19,18,16,14,10,12,15,17,19,13,12,8,15,9,14))
itv_number_vec <- vector()
itv_number_vec <- 0
itvno_time <- list()
for (i in 1:nrow(df))
{
#### Initialisation ####
if (i==1)
{
df$itv_available_time[i] <- sum(df$qc_time[i] + df$itv_tat[i])
itvno_time[[i]] <- df$itv_available_time[i]
df$delay[i] <- 0
df$itv_number[i] <- 1
itv_number_vec <- 1
}
if(i!=1)
{
if (df$qc_time[i] >= min(unlist(itvno_time)))
{
for (j in 1:length(itvno_time))
{
if (itvno_time[[j]] <= df$qc_time[i])
{
df$itv_number[i] <- j
df$itv_available_time[i] <- sum(df$qc_time[i] + df$itv_tat[i])
itvno_time[[j]] <- df$itv_available_time[i]
break
}
}
}else{
if (max(itv_number_vec)<7)
{
df$itv_number[i] <- max(itv_number_vec) + 1
itv_number_vec <- c(itv_number_vec,(max(itv_number_vec) + 1))
df$delay[i] <- 0
df$itv_available_time[i] <- sum(df$qc_time[i] + df$itv_tat[i])
itvno_time[[max(itv_number_vec)]] <- df$itv_available_time[i]
}else{
df$delay[i] <- (min(unlist(itvno_time)) - df$qc_time[i])
df$itv_number[i] <- which.min(itvno_time)
df$itv_available_time[i] <- sum(df$qc_time[i], df$itv_tat[i] ,df$delay[i])
itvno_time[[which.min(itvno_time)]] <- df$itv_available_time[i]
}
}
}
}

Optimization of an R loop taking 18 hours to run

I've got an R code that works and does what I want but It takes a huge time to run. Here is an explanation of what the code does and the code itself.
I've got a vector of 200000 line containing street adresses (String) : data.
Example :
> data[150000,]
address
"15 rue andre lalande residence marguerite yourcenar 91000 evry france"
And I have a matrix of 131x2 string elements which are 5grams (part of word) and the ids of the bags of NGrams (example of a 5Grams bag : ["stack", "tacko", "ackov", "ckover", ",overf", ... ] ) : list_ngrams
Example of list_ngrams :
idSac ngram
1 4 stree
2 4 tree_
3 4 _stre
4 4 treet
5 5 avenu
6 5 _aven
7 5 venue
8 5 enue_
I have also a 200000x31 numerical matrix initialized with 0 : idv_x_bags
In total I have 131 5-grams and 31 bags of 5-grams.
I want to loop the string addresses and check whether it contains one of the n-grams in my list or not. If it does, I put one in the corresponding column which represents the id of the bag that contains the 5-gram.
Example :
In this address : "15 rue andre lalande residence marguerite yourcenar 91000 evry france". The word "residence" exists in the bag ["resid","eside","dence",...] which the id is 5. So I'm gonna put 1 in the column called 5. Therefore the corresponding line "idv_x_bags" matrix will look like the following :
> idv_x_sacs[150000,]
4 5 6 8 10 12 13 15 17 18 22 26 29 34 35 36 42 43 45 46 47 48 52 55 81 82 108 114 119 122 123
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Here is the code that does :
idv_x_sacs <- matrix(rep(0,nrow(data)*31),nrow=nrow(data),ncol=31)
colnames(idv_x_sacs) <- as.vector(sqldf("select distinct idSac from list_ngrams order by idSac"))$idSac
for(i in 1:nrow(idv_x_bags))
{
for(ngram in list_ngrams$ngram)
{
if(grepl(ngram,data[i,])==TRUE)
{
idSac <- sqldf(sprintf("select idSac from list_ngramswhere ngram='%s'",ngram))[[1]]
idv_x_bags[i,as.character(idSac)] <- 1
}
}
}
The code does perfectly what I aim to do, but it takes about 18 hours which is huge. I tried to recode it with c++ using Rcpp library but I encountered many problems. I'm tried to recode it using apply, but I couldn't do it.
Here is what I did :
apply(cbind(data,1:nrow(data),1,function(x){
apply(list_ngrams,1,function(y){
if(grepl(y[2],x[1])==TRUE){idv_x_bags[x[2],str_trim(as.character(y[1]))]<-1}
})
})
I need some help with coding my loop using apply or some other method that run faster that the current one. Thank you very much.

Check this one and run the simple example step by step to see how it works.
My N-Grams don't make much sense, but it will work with actual N_Grams as well.
library(dplyr)
library(reshape2)
# your example dataset
dt_sen = data.frame(sen = c("this is a good thing", "this is bad"), stringsAsFactors = F)
dt_ngr = data.frame(id_ngr = c(2,2,2,3,3,3),
ngr = c("th","go","tt","drf","ytu","bad"), stringsAsFactors = F)
# sentence dataset
dt_sen
sen
1 this is a good thing
2 this is bad
#ngrams dataset
dt_ngr
id_ngr ngr
1 2 th
2 2 go
3 2 tt
4 3 drf
5 3 ytu
6 3 bad
# create table of matches
expand.grid(unique(dt_sen$sen), unique(dt_ngr$id_ngr)) %>%
data.frame() %>%
rename(sen = Var1,
id_ngr = Var2) %>%
left_join(dt_ngr, by = "id_ngr") %>%
group_by(sen, id_ngr,ngr) %>%
do(data.frame(match = grepl(.$ngr,.$sen))) %>%
group_by(sen,id_ngr) %>%
summarise(sum_success = sum(match)) %>%
mutate(match = ifelse(sum_success > 0,1,0)) -> dt_full
dt_full
Source: local data frame [4 x 4]
Groups: sen
sen id_ngr sum_success match
1 this is a good thing 2 2 1
2 this is a good thing 3 0 0
3 this is bad 2 1 1
4 this is bad 3 1 1
# reshape table
dt_full %>% dcast(., sen~id_ngr, value.var = "match")
sen 2 3
1 this is a good thing 1 0
2 this is bad 1 1

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

searching for deleted documents from corpus in R - r

Related

Text Mining in R: Counting 2-3 word phrases

Mapping dataframe column values to a n by n matrix

regex for searching through dataframe in R

how to do assignment of numbers in r (one machine n jobs)

Optimization of an R loop taking 18 hours to run

Categories

Resources