Optimization of an R loop taking 18 hours to run - r

I've got an R code that works and does what I want but It takes a huge time to run. Here is an explanation of what the code does and the code itself.
I've got a vector of 200000 line containing street adresses (String) : data.
Example :
> data[150000,]
address
"15 rue andre lalande residence marguerite yourcenar 91000 evry france"
And I have a matrix of 131x2 string elements which are 5grams (part of word) and the ids of the bags of NGrams (example of a 5Grams bag : ["stack", "tacko", "ackov", "ckover", ",overf", ... ] ) : list_ngrams
Example of list_ngrams :
idSac ngram
1 4 stree
2 4 tree_
3 4 _stre
4 4 treet
5 5 avenu
6 5 _aven
7 5 venue
8 5 enue_
I have also a 200000x31 numerical matrix initialized with 0 : idv_x_bags
In total I have 131 5-grams and 31 bags of 5-grams.
I want to loop the string addresses and check whether it contains one of the n-grams in my list or not. If it does, I put one in the corresponding column which represents the id of the bag that contains the 5-gram.
Example :
In this address : "15 rue andre lalande residence marguerite yourcenar 91000 evry france". The word "residence" exists in the bag ["resid","eside","dence",...] which the id is 5. So I'm gonna put 1 in the column called 5. Therefore the corresponding line "idv_x_bags" matrix will look like the following :
> idv_x_sacs[150000,]
4 5 6 8 10 12 13 15 17 18 22 26 29 34 35 36 42 43 45 46 47 48 52 55 81 82 108 114 119 122 123
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Here is the code that does :
idv_x_sacs <- matrix(rep(0,nrow(data)*31),nrow=nrow(data),ncol=31)
colnames(idv_x_sacs) <- as.vector(sqldf("select distinct idSac from list_ngrams order by idSac"))$idSac
for(i in 1:nrow(idv_x_bags))
{
for(ngram in list_ngrams$ngram)
{
if(grepl(ngram,data[i,])==TRUE)
{
idSac <- sqldf(sprintf("select idSac from list_ngramswhere ngram='%s'",ngram))[[1]]
idv_x_bags[i,as.character(idSac)] <- 1
}
}
}
The code does perfectly what I aim to do, but it takes about 18 hours which is huge. I tried to recode it with c++ using Rcpp library but I encountered many problems. I'm tried to recode it using apply, but I couldn't do it.
Here is what I did :
apply(cbind(data,1:nrow(data),1,function(x){
apply(list_ngrams,1,function(y){
if(grepl(y[2],x[1])==TRUE){idv_x_bags[x[2],str_trim(as.character(y[1]))]<-1}
})
})
I need some help with coding my loop using apply or some other method that run faster that the current one. Thank you very much.

Check this one and run the simple example step by step to see how it works.
My N-Grams don't make much sense, but it will work with actual N_Grams as well.
library(dplyr)
library(reshape2)
# your example dataset
dt_sen = data.frame(sen = c("this is a good thing", "this is bad"), stringsAsFactors = F)
dt_ngr = data.frame(id_ngr = c(2,2,2,3,3,3),
ngr = c("th","go","tt","drf","ytu","bad"), stringsAsFactors = F)
# sentence dataset
dt_sen
sen
1 this is a good thing
2 this is bad
#ngrams dataset
dt_ngr
id_ngr ngr
1 2 th
2 2 go
3 2 tt
4 3 drf
5 3 ytu
6 3 bad
# create table of matches
expand.grid(unique(dt_sen$sen), unique(dt_ngr$id_ngr)) %>%
data.frame() %>%
rename(sen = Var1,
id_ngr = Var2) %>%
left_join(dt_ngr, by = "id_ngr") %>%
group_by(sen, id_ngr,ngr) %>%
do(data.frame(match = grepl(.$ngr,.$sen))) %>%
group_by(sen,id_ngr) %>%
summarise(sum_success = sum(match)) %>%
mutate(match = ifelse(sum_success > 0,1,0)) -> dt_full
dt_full
Source: local data frame [4 x 4]
Groups: sen
sen id_ngr sum_success match
1 this is a good thing 2 2 1
2 this is a good thing 3 0 0
3 this is bad 2 1 1
4 this is bad 3 1 1
# reshape table
dt_full %>% dcast(., sen~id_ngr, value.var = "match")
sen 2 3
1 this is a good thing 1 0
2 this is bad 1 1

Related

Problems to separate data

I have the FreqAnual.
Fêmea Macho
Abril 3 0
Agosto 1 0
Dezembro 7 0
Fevereiro 6 4
Janeiro 6 4
Julho 1 0
Junho 5 0
Maio 3 0
Março 20 2
Novembro 4 1
Outubro 3 0
It also comes from a dataset from Excel, in which is the column "Mes", and a row for every register, and another row for sex, that comes to be Fêmea and Macho.
I used the FreqAnual <- table(Dados_procesados$Mes, Dados_procesados$Sexo) .
So i tried FreqJan <- Dados_Procesados [Mes == Janeiro, ], also the one with the $ before Mes, and just get the result
FreqJan <- Dados_Procesados [Mes = Janeiro, ]
Error: object 'Dados_Procesados' not found
What can I do? Also the subtable didn't work
I was expecting something like
Fêmea Macho
Janeiro 6 4
I need it that way, so I can do G test monthly to find the sex ratio, and if there were significant differences

R - Perform two calculations to different parts of a loop

Ok so I have written a loop which intends to for the first part of the loop, perform a multiplication calculation, multiplying two columns. Afterwards, for the remainder of the loop the loop is to perform another multiplication using two different columns than the first.
The columns for multiplation are: ocret and clret which multiply against response.
My code for this:
train.set$output[[1]] = if (train.set$response[[1]] == 1) {
apply(train.set[,c('ocret', 'response')], 1, function(x) { (x[1]*x[2])} )
}
for (i in 2:nrow(train.set)){
train.set$output[i] = if(train.set$response[i] == 1) {
apply(train.set[,c('clret', 'response')], 1, function(x) { (x[1]*x[2])})
train.set$output[i-1]
}
}
The idea for this was first finding a response == 1, it was to perform the ocret * response calculation.
For the second part of the loop, it was to start on row 2 as to not overwrite the first part... and continue to loop down the +1 and perform the clret * response calculation.
The logic makes sense to me, this is pretty much my first attempt at a loop. When I run the code, nothing happens, it doesnt make the output column, can anyone give me any pointers? I continue to read it and it makes sense, not sure what im missing, any explanation greatly appreciated.
Example data frame and output below:
ocret clret response output
1 0.00730616 0.003382433 0 0
2 -0.084899894 -0.088067766 0 0
3 0.047208568 0.054174679 1 0.047208568
4 -0.002671414 -0.004543992 0 0
5 -0.039943462 -0.040290793 0 0
6 -0.01428499 -0.013506524 0 0
7 -0.037054965 -0.038517845 0 0
8 -0.058027611 -0.057394837 1 -0.058027611
9 -0.004014491 -0.011332705 1 -0.011332705
10 -0.079419682 -0.076167096 1 -0.076167096
11 -0.003424577 -0.011759287 1 -0.011759287
12 0.099260455 0.115800375 1 0.115800375
13 -0.011841897 -0.005322141 1 -0.005322141
14 -0.087230999 -0.090349775 1 -0.090349775
15 0.040570359 0.042507445 1 0.042507445
16 -0.001846555 -0.006212821 1 -0.006212821
17 0.044398056 0.047684898 1 0.047684898
18 -0.025856823 -0.030799705 0 0
19 -0.057677505 -0.061012471 0 0
20 0.010043567 0.012634046 0 0
21 -0.020609404 -0.034511205 0 0
Line 3: ocret * response
Line 8: ocret * response
Line 9 to 16: clret * response
For loop may not be required. We can use dplyr and data.table to get the desired output (dt2).
library(dplyr)
library(data.table)
dt2 <- dt %>%
mutate(RunID = rleid(response)) %>%
group_by(RunID) %>%
mutate(output = ifelse(response == 0, 0,
ifelse(row_number() == 1, ocret, clret))) %>%
ungroup() %>%
select(-RunID)
Data Preparation
dt <- read.table(text = " ocret clret response
1 0.00730616 0.003382433 0
2 -0.084899894 -0.088067766 0
3 0.047208568 0.054174679 1
4 -0.002671414 -0.004543992 0
5 -0.039943462 -0.040290793 0
6 -0.01428499 -0.013506524 0
7 -0.037054965 -0.038517845 0
8 -0.058027611 -0.057394837 1
9 -0.004014491 -0.011332705 1
10 -0.079419682 -0.076167096 1
11 -0.003424577 -0.011759287 1
12 0.099260455 0.115800375 1
13 -0.011841897 -0.005322141 1
14 -0.087230999 -0.090349775 1
15 0.040570359 0.042507445 1
16 -0.001846555 -0.006212821 1
17 0.044398056 0.047684898 1
18 -0.025856823 -0.030799705 0
19 -0.057677505 -0.061012471 0
20 0.010043567 0.012634046 0
21 -0.020609404 -0.034511205 0",
header = TRUE, stringsAsFactors = FALSE)

Print varible names in table() with 2 binary variables in R

I'm sure I'll kick myself for not being able to figure this out, but when you have a table with 2 variables (i.e. cross-tab) and both are binary or otherwise have the same levels, how can you make R show which variable is displayed row-wise and which is column-wise?
For example:
> table(tc$tr, tc$fall_term)
0 1
0 1569 538
1 0 408
is a little confusing because it's not immediately obvious which is which. Of course, I checked out ?table but I don't see an option to do this, at least not a logical switch that doesn't require me to already know which is which.
I tried ftable but had the same problem.
The output I want would be something like this:
> table(tc$tr, tc$fall_term)
tr tr
0 1
fallterm 0 1569 538
fallterm 1 0 408
or
> table(tc$tr, tc$fall_term)
fallterm fallterm
0 1
tr 0 1569 538
tr 1 0 408
You can use the dnn option :
table(df$tr,df$fall_term) # impossible to tell the difference
0 1
0 18 33
1 15 34
table(df$tr,df$fall_term,dnn=c('tr','fall_term')) # you have the names
fall_term
tr 0 1
0 18 33
1 15 34
Note that it's easier (and safer) to do table(df$tr,df$fall_term,dnn=colnames(df))
Check out dimnames, and in particular their names. I’m using another example here since I don’t have your data:
x = HairEyeColor[, , Sex = 'Male']
names(dimnames(x))
# [1] "Hair" "Eye"
names(dimnames(x)) = c('Something', 'Else')
x
# Else
# Something Brown Blue Hazel Green
# Black 32 11 10 3
# Brown 53 50 25 15
# Red 10 10 7 7
# Blond 3 30 5 8

Find a function to return value based on condition using R

I have a table with values
KId sales_month quantity_sold
100 1 0
100 2 0
100 3 0
496 2 6
511 2 10
846 1 4
846 2 6
846 3 1
338 1 6
338 2 0
now i require output as
KId sales_month quantity_sold result
100 1 0 1
100 2 0 1
100 3 0 1
496 2 6 1
511 2 10 1
846 1 4 1
846 2 6 1
846 3 1 0
338 1 6 1
338 2 0 1
Here, the calculation has to go as such if quantity sold for the month of march(3) is less than 60% of two months January(1) and February(2) quantity sold then the result should be 1 or else it should display 0. Require solution to perform this.
Thanks in advance.
If I understand well, your requirement is to compare sold quantity in month t with the sum of quantity sold in months t-1 and t-2. If so, I can suggest using dplyr package that offer the nice feature of grouping rows and mutating columns in your data frame.
resultData <- group_by(data, KId) %>%
arrange(sales_month) %>%
mutate(monthMinus1Qty = lag(quantity_sold,1), monthMinus2Qty = lag(quantity_sold, 2)) %>%
group_by(KId, sales_month) %>%
mutate(previous2MonthsQty = sum(monthMinus1Qty, monthMinus2Qty, na.rm = TRUE)) %>%
mutate(result = ifelse(quantity_sold/previous2MonthsQty >= 0.6,0,1)) %>%
select(KId,sales_month, quantity_sold, result)
The result is as below:
Adding
select(KId,sales_month, quantity_sold, result)
at the end let us display only columns we care about (and not all these intermediate steps).
I believe this should satisfy your requirement. NA is the result column are due to 0/0 division or no data at all for the previous months.
Should you need to expand your calculation beyond one calendar year, you can add year column and adjust group_by() arguments appropriately.
For more information on dplyr package, follow this link

Aggregating big data in R

I have a dataset (dat) that looks like this:
Team Person Performance1 Performance2
1 36465930 1 101
1 37236856 1 101
1 34940210 1 101
1 29135524 1 101
2 10318268 1 541
2 641793 1 541
2 32352593 1 541
2 2139024 1 541
3 35193922 2 790
3 32645504 2 890
3 32304024 2 790
3 22696491 2 790
I am trying to identify and remove all teams that have variance on Performance1 or Performance2. So, for example, team 3 in the example has variance on Performance 2, so I would want to remove that team from the dataset. Here is the code as I've written it:
tda <- aggregate(dat, by=list(data$Team), FUN=sd)
tda1 <- tda[ which(tda$Performance1 != 0 | tda$Performance2 != 0), ]
The problem is that there are over 100,000 teams in my dataset, so my first line of code is taking an extremely long time, and I'm not sure if it will ever finish aggregating the dataset. What would be a more efficient way to solve this problem?
Thanks in advance! :)
Sincerely,
Amy
The dplyr package is generally very fast. Here's a way to select only those teams with standard deviation equal to zero for both Performance1 and Performance2:
library(dplyr)
datAggregated = dat %>%
group_by(Team) %>%
summarise(sdP1 = sd(Performance1),
sdP2 = sd(Performance2)) %>%
filter(sdP1==0 & sdP2==0)
datAggregated
Team sdP1 sdP2
1 1 0 0
2 2 0 0
Using data.table for big datasets
library(data.table)
setDT(dat)[, setNames(lapply(.SD,sd), paste0("sdP", 1:2)),
.SDcols=3:4, by=Team][,.SD[!sdP1& !sdP2]]
# Team sdP1 sdP2
#1: 1 0 0
#2: 2 0 0
If you have more number of Performance columns, you could use summarise_each from dplyr
datNew <- dat %>%
group_by(Team) %>%
summarise_each(funs(sd), starts_with("Performance"))
colnames(datNew)[-1] <- paste0("sdP", head(seq_along(datNew),-1))
datNew[!rowSums(datNew[-1]),]
which gives the output
# Team sdP1 sdP2
#1 1 0 0
#2 2 0 0

Resources