R Loop optimisation/ Loop is way too time consuming - r

The following loop takes ages. Is there any way to this in a more time-efficient way? The following data.table consists of 27 variables and more than 600k observations.
data <- read.table("file.txt", header = T, sep= "|")
colnames(data)[c(1)] <- c("X")
data <- as.data.table(data)
n=1;
vector <- vector()
for(i in 2:nrow(data))
{
if(data[["X"]][i] != data[["X"]][i-1])
{
n=1; vector[i]=1}
else {
n=n+1; vector[i]=n}}
Basically, I need to index every appearance of a unique entry in X, i.e. the first time it appeared, the second time it appeared, etc and then merge this to the existing data as additional column. However, I got stock at compiling the vector.
Thank you.

First off, use fread:
DT <- fread("file.txt", sep = "|")
Next, use setnames:
setnames(DT, 1, "X")
Finally, use rowid:
DT[ , vector := rowid(X)]

Related

Difference between two columns with separated variables by ; in R

I am a beginner in R and while trying to make some exercises I got stuck in one of them. My data.frame is as follow:
LanguageWorkedNow LanguageNextYear
Java; PHP Java; C++; SQL
C;C++;JavaScript; JavaScript; C; SQL
And I need to know the variables which are in LanguageNextYear and are not in LanguageWorkedNow, to set a list with the different ones.
Sorry if the question is duplicated, I'm quite new here and tried to find it, but with no success.
base R
Idea: mapply setdiff on strsplitted NextYear and WorkedNow, and then paste it using collapse=";":
df$New <- with(df, {
a <- mapply(setdiff, strsplit(NextYear, ";"), strsplit(WorkedNow, ";"), SIMPLIFY = FALSE)
sapply(a, paste, collapse=";")
})
# SIMPLIFY = FALSE is needed in a general case, it doesn't
# affect the output in the example case
# Or if you use Map instead of mapply, that is the default, so
# it could also be...
df$New <- with(df,
sapply(Map(setdiff, strsplit(NextYear, ";"), strsplit(WorkedNow, ";")),
paste, collapse=";"))
data
df <- read.table(text = "WorkedNow NextYear
Java;PHP Java;C++;SQL
C;C++;JavaScript JavaScript;C;SQL
", header=TRUE, stringsAsFactors=FALSE)
Here's a solution using purrr package:
df = read.table(text = "
LanguageWorkedNow LanguageNextYear
Java;PHP Java;C++;SQL
C;C++;JavaScript JavaScript;C;SQL
", header=T, stringsAsFactors=F)
library(purrr)
df$New = map2_chr(df$LanguageWorkedNow,
df$LanguageNextYear,
~{x1 = unlist(strsplit(.x, split=";"))
x2 = unlist(strsplit(.y, split=";"))
paste0(x2[!x2%in%x1], collapse = ";")})
df
# LanguageWorkedNow LanguageNextYear New
# 1 Java;PHP Java;C++;SQL C++;SQL
# 2 C;C++;JavaScript JavaScript;C;SQL SQL
For each row you get your columns and you create vectors of values (separated by ;). Then you check which values of NextYear vector don't exist in WorkedNow vector and you create a string based on / combining those values.
The map function family will help you apply your logic / function to each row. In our case we use map2_chr as we have two inputs (your two columns) and we excpet a string / character output.

Efficient way to split strings in separate rows (creating an edgelist)

I currently have the following problem. I work with Web-of-Science scientific publication and citation data, which has the following structure: A variable "SR" is a string with the name of a publication, "CR" a variable with a string containing all cited references in the article, separated by a ";".
My task now is to create an edgelist between all publications with the corresponding citations, where every publication and citation combination is in a single row. I do it currently with the following code:
# Some minimal data for example
pub <- c("pub1", "pub2", "pub3")
cit <- c("cit1;cit2;cit3;cit4","cit1;cit4;cit5","cit5;cit1")
M <- cbind(pub,cit)
colnames(M) <- c("SR","CR")
# Create an edgelist
cit_el <- data.frame() #
for (i in seq(1, nrow(M), 1)) { # i=3
cit <- data.frame(strsplit(as.character(M[i,"CR"]), ";", fixed=T), stringsAsFactors=F)
colnames(cit)[1] <- c("SR")
cit$SR_source <- M[i,"SR"]
cit <- unique(cit)
cit_el <- rbind(cit_el, cit)
}
However, for large datasets of some 10k+ of publications (which tend to have 50+ citations), the script runs 15min+. I know that loops are usually an inefficient way of coding in R, yet didn't find an alternative that produces what I want.
Anyone knows some trick to make this faster?
This is my attempt. I haven't compared the speeds of different approaches yet.
First is the artificial data with 10k pubs, 100k possible citations, max is 80 citations per pub.
library(data.table)
library(stringr)
pubCount = 10000
citCount = 100000
maxCitPerPub = 80
pubList <- paste0("pub", seq(pubCount))
citList <- paste0("cit", seq(citCount))
cit <- sapply(sample(seq(maxCitPerPub), pubCount, replace = TRUE),
function(x) str_c(sample(citList, x), collapse = ";"))
data <- data.table(pub = pubList,
cit = cit)
For processing, I use stringr::str_split_fixed to split the citations into columns and use data.table::melt to collapse the columns.
temp <- data.table(pub = pubList, str_split_fixed(data$cit, ";", maxCitPerPub))
result <- melt(temp, id.vars = "pub")[, variable:= NULL][value!='']
Not sure if this is any quicker but if I'm understanding correctly this should give the desired result
rbindlist(lapply(1:nrow(M), function(i){
data.frame(SR_source = M[i, 'SR'], SR = strsplit(M[i, 'CR'], ';'))
}))

Removing rows based on character conditions in a column

Good morning, I have created the following R code:
setwd("xxx")
library(reshape)
##Insert needed year
url <- "./Quarterly/1990_qtrly.csv"
##Writes data in R with applicable columns
qtrly_data <- read.csv(url, header = TRUE, sep = ",", quote="\"", dec=".", na.strings=" ", skip=0)
relevant_cols <- c("area_fips", "industry_code", "own_code", "agglvl_code", "year", "qtr")
overall <- c(relevant_cols, colnames(qtrly_data)[8:16])
lq <- c(relevant_cols, colnames(qtrly_data)[17:25])
oty <- c(relevant_cols, colnames(qtrly_data)[18:42])
types <- c("overall", "lq", "oty")
overallx <- colnames(qtrly_data)[9:16]
lqx <- colnames(qtrly_data)[18:25]
otyx <- colnames(qtrly_data)[seq(27,42,2)]
###Adding in the disclosure codes from each section
disc_codes <- c("disclosure_code", "lq_disclosure_code", "oty_disclosure_code")
cols_list = list(overall, lq, oty)
denom_list = list(overallx, lqx, otyx)
##Uses a two-loop peice of code to go through data denominations and categories, while melting it into the correct format
for (j in 1:length(types))
{
cat("Working on type: " , types[j], "\n")
these_denominations <- denom_list[[j]]
type_data <- qtrly_data[ , cols_list[[j]] ]
QCEW_County <- melt(type_data, id=c(relevant_cols, disc_codes[j]))
colnames(QCEW_County) <- c(relevant_cols, "disclosure_code", "text_denomination", "value")
Data_Cat <- j
for (k in 1:length(these_denominations))
{
cat("Working on type: " , types[j], "and denomination: ", these_denominations[k], "\n")
QCEW_County_Denominated <- QCEW_County[QCEW_County[, "text_denomination"] == these_denominations[k], ]
QCEW_County_Denominated$disclosure_code <- ifelse(QCEW_County_Denominated$disclosure_code == "", 0, 1)
Data_Denom <- k
QCEW_County_Denominated <- cbind(QCEW_County_Denominated, Data_Cat, Data_Denom)
QCEW_County_Denominated$Source_ID <- 1
QCEW_County_Denominated$text_denomination <- NULL
colnames(QCEW_County_Denominated) <- NULL
###Actually writes the txt file to the QCEW folder
write.table(QCEW_County_Denominated, file="C:\\Users\\jjackson\\Downloads\\QCEW\\1990_test.txt", append=TRUE, quote=FALSE, sep=',', row.names=FALSE)
}
}
Now, there are some things I need to get rid of, namely, all the rows in my QCEW_County_Denominated dataframe where the "area_fips" column begins with the character "C", in that same column, there are also codes that start with US that I would like to replace with a 0. Finally, I also have the "industry_code" column that in my final dataframe has 3 values that need to be replaced. 31-33 with 31, 44-45 with 44, and 48-49 with 48. I understand that this is a difficult task. I'm slowly figuring it out on my own, but if anyone could give me a helpful nudge in the right direction while I'm figuring this out on my own, it would be much appreciated. Conditional statements in R is looking like it's my Achilles heel, as it's always where I begin to get confused with how its syntax differs from other statistical packages.
Thank you, and have a nice day.
You can remove and recode your data using regex and subsetting.
Using grepl, you can select the rows in the column area_fips that DON'T start with C.
QCEW_County_Denominated <- QCEW_County_Denominated[!grepl("^C", QCEW_County_Denominated$area_fips), ]
Using gsub, you can replace with 0 the values in the area_fips columns that start with 0.
QCEW_County_Denominated$area_fips <- as.numeric(gsub("^US", 0, QCEW_County_Denominated$area_fips))
Finally, using subsetting you can replace the values in the industry_code.
QCEW_County_Denominated$industry_code[QCEW_County_Denominated$industry_code == "31-33"] <- 31
QCEW_County_Denominated$industry_code[QCEW_County_Denominated$industry_code == "44-45"] <- 44
QCEW_County_Denominated$industry_code[QCEW_County_Denominated$industry_code == "48-49"] <- 48

R Optimizing double loop that uses stri_extract

I have been working on some text scraping/analysis. One thing I did was pull out the top words from documents to compare and learn about different metrics. This was fast and easy. There became an issue with defining what separators to use though and pulling out individual words rather than phrases removed information from the analysis. For example .Net Developer becomes net and developer after the transformation. I already had a list of set phrases/words from an old project someone else gave up on. The next step was pulling out specific keywords from multiple rows for multiple documents.
I have been looking into several techniques including vectorization, parallel processing, using C++ code within R and others. Moving forward I will experiment with all of these techniques and try and speed up my process as well as give me these tools for future projects. In the mean time (without experimentation) I'm wondering what adjustments are obvious which will significantly decrease the time taken e.g. moving parts of the code outside the loop, using better packages etc
I also have a progress bar, but I can remove it if its slowing down my loop significantly.
Here is my code:
words <- read.csv("keyphrases.csv")
df <- data.frame(x=(list.files("sec/new/")))
total = length(df$x)
pb <- txtProgressBar(title = "Progress Bar", min = 0, max =total , width = 300, style=3)
for (i in df$x){
s <- read.csv(paste0("sec/new/",i))
u <- do.call(rbind, pblapply(words$words, function(x){
t <- data.frame(ref= s[,2], words = stri_extract(s[,3], coll=x))
t<-na.omit(t)
}))
write.csv(u,paste0("sec/new_results/new/",i), row.names = F)
setTxtProgressBar(pb, i, title=paste( round(which(df$x== i)/total*100, 2),"% done"))
}
So words has 60,000 rows of words/short phrases - no more than 30 characters each. Length i is around 4000 where each i has between 100 and 5000 rows with each row having between 1 and 5000 characters. Any random characters/strings can be used if my question needs to be reproducible.
I only used lapply because combining it with rbind and do.call worked really well, having a loop within a loop may be slowing down the process significantly too.
So off the bat there are somethings I can do right? Swapping data.frame to data.table or using vectors instead. Do the reading and writing outside the loop somehow? Perhaps write it such that one of the loops isnt nested?
Thanks in advance
EDIT
The key element that needs speeding up is the extract. Whether I use lapply above or cut it down to:
for(x in words$words){t<-data.table(words=stri_extract(s[,3], coll=x))}
This still takes the most time for a long way. skills and t are data tables in this case.
EDIT2
Attempting to create reproducible data:
set.seed(42)
words <- data.frame(words=rnorm(1:60000))
words$wwords <- as.String(words$words)
set.seed(42)
file1 <- data.frame(x=rnorm(1:5000))
file1$x<-as.String(file1$x)
pblapply(words$words, function(x){
t <- data.frame(words = stri_extract(file1$x, coll=x))
})
First things first. Yes, I would definitely switch from data.frame to data.table. Not only is it faster and easier to use, when you start merging data sets data.table will do reasonable things when data.frame will give you unexpected and unintended results.
Secondly, is there an advantage to using R to take care of your separators? You mentioned a number of different techniques you are considering using. If separators are just noise for the purposes of your analysis, why not split the work into two tools and use a tool that is much better at handling separators and continuation lines and so on? For me, Python is a natural choice to do things like parsing a bunch of text into keywords--including stripping off separators and other "noise" words you do not care about in your analysis. Feed the results of the Python parsing into R, and use R for its strengths.
There are a few different ways to get the output of Python into R. I would suggest starting off with something simple: CSV files. They are what you are starting with, they are easy to read and write in Python and easy to read in R. Later you can deal with a direct pipe between Python and R, but it does not give you much advantage until you have a working prototype and it is a lot more work at first. Make Python read in your raw data and turn out a CSV file that R can drop straight into a data.table without further processing.
As for stri_extract, it is really not the tool you need this time. You certainly can match on a bunch of different words, but it is not really what it is optimized for. I agree with #Chris that using merge() on data.tables is a much more efficient--and faster--way to search for a number of key words.
Single Word Version
When you have single words in each lookup, this is easily accomplished with merging:
library(data.table)
#Word List
set.seed(42)
WordList <- data.table(ID = 1:60000, words = sapply(1:60000, function(x) paste(sample(letters, 5), collapse = '')))
#A list of dictionaries
set.seed(42)
Dicts <- list(
Dict1 = sapply(1:15000, function(x) {
paste(sample(letters, 5), collapse = '')
}),
Dict2 = sapply(1:15000, function(x) {
paste(sample(letters, 5), collapse = '')
}),
Dict3 = sapply(1:15000, function(x) {
paste(sample(letters, 5), collapse = '')
})
)
#Create Dictionary Data.table and add Identifier
Dicts <- rbindlist(lapply(Dicts, function(x){data.table(ref = x)}), use.names = T, idcol = T)
# set key for joining
setkey(WordList, "words")
setkey(Dicts, "ref")
Now we have a data.table with all dictionary words, and a data.table with all words in our word list. Now we can just merge:
merge(WordList, Dicts, by.x = "words", by.y = "ref", all.x = T, allow.cartesian = T)
words ID .id
1: abcli 30174 Dict3
2: abcrg 26210 Dict2
3: abcsj 8487 Dict1
4: abczg 24311 Dict2
5: abdgl 1326 Dict1
---
60260: zyxeb 52194 NA
60261: zyxfg 57359 NA
60262: zyxjw 19337 Dict2
60263: zyxoq 5771 Dict1
60264: zyxqa 24544 Dict2
So we can see abcli appears in Dict3, while zyxeb does not appear in any of the dictionaries. There look to be 264 duplicates (words that appear in >1 dictionary), as the resultant data.table is larger than our word list (60264 > 60000). This is shown as follows:
merge(WordList, Dicts, by.x = "words", by.y = "ref", all.x = T, allow.cartesian = T)[words == "ahlpk"]
words ID .id
1: ahlpk 7344 Dict1
2: ahlpk 7344 Dict2
3: ahlpk 28487 Dict1
4: ahlpk 28487 Dict2
We also see here that duplicated words in our word list are going to create multiple resultant rows.
This is very very quick to run
Phrases + Sentences
In the case where you are searching for phrases within sentences, you will need to perform a string match instead. However, you will still need to make n(Phrases) * n(Sentences) comparisons, which will quick hit memory limits in most R data structures. Fortunately, this is an embarrassingly parallel operation:
Same setup:
library(data.table)
library(foreach)
library(doParallel)
# Sentence List
set.seed(42)
Sentences <- data.table(ID = 1:60000, Sentence = sapply(1:60000, function(x) paste(sample(letters, 10), collapse = '')))
# A list of phrases
set.seed(42)
Phrases <- list(
Phrases1 = sapply(1:15000, function(x) {
paste(sample(letters, 5), collapse = '')
}),
Phrases2 = sapply(1:15000, function(x) {
paste(sample(letters, 5), collapse = '')
}),
Phrases3 = sapply(1:15000, function(x) {
paste(sample(letters, 5), collapse = '')
})
)
# Create Dictionary Data.table and add Identifier
Phrases <- rbindlist(lapply(Phrases, function(x){data.table(Phrase = x)}), use.names = T, idcol = T)
# Full Outer Join
Sentences[, JA := 1]
Phrases[, JA := 1]
# set key for joining
setkey(Sentences, "JA")
setkey(Phrases, "JA")
We now want to break up our Phrases table into manageable batches
cl<-makeCluster(4)
registerDoParallel(cl)
nPhrases <- as.numeric(nrow(Phrases))
nSentences <- as.numeric(nrow(Sentences))
batch_size <- ceiling(nPhrases*nSentences / 2^30) #Max data.table allocation is 2^31. Lower this if you are hitting memory allocation limits
seq_s <- seq(1,nrow(Phrases), by = floor(nrow(Phrases)/batch_size))
ln_s <- length(seq_s)
if(ln_s > 1){
str_seq <- paste0(seq_s,":",c(seq_s[2:ln_s],nrow(Phrases) + 1) - 1)
} else {
str_seq <- paste0(seq_s,":",nrow(Phrases))
}
We are now ready to send our job out. The grepl line below is doing the work - testing which phrases match each sentence. We then filter out any non-matches.
ls<-foreach(i = 1:ln_s) %dopar% {
library(data.table)
TEMP_DT <- merge(Sentences,Phrases[eval(parse(text = str_seq[1]))], by = "JA", allow.cartesian = T)
TEMP_DT <- TEMP_DT[, match_test := grepl(Phrase,Sentence), by = .(Phrase,Sentence)][match_test == 1]
return(TEMP_DT)
}
stopCluster(cl)
DT_OUT <- unique(do.call(rbind,ls))
DT_OUT now summarizes the sentences that match, along with the Phrase + Phrase list that it is found in.
This still will take some time (as there is a lot of processing that is necessary) , but nowhere near a year.

R- How to do a loop on a list and output different dataframes

I'm attempting to create a loop in R that will use a vector of dates, run them through a loop that includes a SQL query, and then generate a separate dataframe for each output. Here is as far as I've gotten:
library(RODBC)
dvect <- as.Date("2015-04-13") + 0:2
d <- list()
for(i in list(dvect)){
queryData <- sqlQuery(myconn, paste("SELECT
WQ_hour,
sum(calls) as calls
FROM database
WHERE DDATE = '", i,"'
GROUP BY 1
", sep = ""))
d[i] <- rbind(d, queryData)
}
From what I can tell, the query portion of the code runs fine since I've tested it by itself. Where I'm stumbling is the last line where I try to save the contents of each loop through the query separately with each having a label of the date that was used in the loop.
I'd appreciate any help. I've only been using R consistently for about 2 months now so I'm definitely open to alternative ways of doing this that are cleaner and more efficient.
Thanks.
I'd suggest making the SQL query a function, and use lapply to apply it and return your result as a list.
userSQLquery = function(i) {
sqlQuery(myconn, paste("SELECT
WQ_hour,
sum(calls) as calls
FROM database
WHERE DDATE = '", i,"'
GROUP BY 1
", sep = ""))
}
dvect = as.Date("2015-04-13") + 0:2
d = as.list(1:length(dvect))
names(d) = dvect
lapply(d, userSQLquery)
I have very little experience with SQL though, so this may not work. Maybe it could start you off?
Looks like a job for lapply (lapply documentation)instead of a for loop. (In R it's often good to avoid a for loop by using a vectorization.)
If you want each date to return a separate data frame, and then have each data frame labelled with the original date, try:
dates <- c("Jan 1", "Oct 31", "Dec 25")
queryData <- function(date){
#dummy data
return(runif(5))
}
results <- lapply(dates, queryData)
names(results) <- dates
Either use:
d[[i]] <- queryData
if you want each data.frame (query result) as a separate element in the list output d.
Or use:
d <- rbind(d, queryData)
if you want a single data.frame with all the query outputs combined. In this case you should declare d as a data.frame (i.e. d <- data.frame()).
You can also store each data.frame (i.e. the query result) with its corresponding date in a list as:
d[[i]] <- list(date = dvect[[i]], queryResult = queryData)
I think the last one is what you are looking for.

Resources