I have been working on some text scraping/analysis. One thing I did was pull out the top words from documents to compare and learn about different metrics. This was fast and easy. There became an issue with defining what separators to use though and pulling out individual words rather than phrases removed information from the analysis. For example .Net Developer becomes net and developer after the transformation. I already had a list of set phrases/words from an old project someone else gave up on. The next step was pulling out specific keywords from multiple rows for multiple documents.
I have been looking into several techniques including vectorization, parallel processing, using C++ code within R and others. Moving forward I will experiment with all of these techniques and try and speed up my process as well as give me these tools for future projects. In the mean time (without experimentation) I'm wondering what adjustments are obvious which will significantly decrease the time taken e.g. moving parts of the code outside the loop, using better packages etc
I also have a progress bar, but I can remove it if its slowing down my loop significantly.
Here is my code:
words <- read.csv("keyphrases.csv")
df <- data.frame(x=(list.files("sec/new/")))
total = length(df$x)
pb <- txtProgressBar(title = "Progress Bar", min = 0, max =total , width = 300, style=3)
for (i in df$x){
s <- read.csv(paste0("sec/new/",i))
u <- do.call(rbind, pblapply(words$words, function(x){
t <- data.frame(ref= s[,2], words = stri_extract(s[,3], coll=x))
t<-na.omit(t)
}))
write.csv(u,paste0("sec/new_results/new/",i), row.names = F)
setTxtProgressBar(pb, i, title=paste( round(which(df$x== i)/total*100, 2),"% done"))
}
So words has 60,000 rows of words/short phrases - no more than 30 characters each. Length i is around 4000 where each i has between 100 and 5000 rows with each row having between 1 and 5000 characters. Any random characters/strings can be used if my question needs to be reproducible.
I only used lapply because combining it with rbind and do.call worked really well, having a loop within a loop may be slowing down the process significantly too.
So off the bat there are somethings I can do right? Swapping data.frame to data.table or using vectors instead. Do the reading and writing outside the loop somehow? Perhaps write it such that one of the loops isnt nested?
Thanks in advance
EDIT
The key element that needs speeding up is the extract. Whether I use lapply above or cut it down to:
for(x in words$words){t<-data.table(words=stri_extract(s[,3], coll=x))}
This still takes the most time for a long way. skills and t are data tables in this case.
EDIT2
Attempting to create reproducible data:
set.seed(42)
words <- data.frame(words=rnorm(1:60000))
words$wwords <- as.String(words$words)
set.seed(42)
file1 <- data.frame(x=rnorm(1:5000))
file1$x<-as.String(file1$x)
pblapply(words$words, function(x){
t <- data.frame(words = stri_extract(file1$x, coll=x))
})
First things first. Yes, I would definitely switch from data.frame to data.table. Not only is it faster and easier to use, when you start merging data sets data.table will do reasonable things when data.frame will give you unexpected and unintended results.
Secondly, is there an advantage to using R to take care of your separators? You mentioned a number of different techniques you are considering using. If separators are just noise for the purposes of your analysis, why not split the work into two tools and use a tool that is much better at handling separators and continuation lines and so on? For me, Python is a natural choice to do things like parsing a bunch of text into keywords--including stripping off separators and other "noise" words you do not care about in your analysis. Feed the results of the Python parsing into R, and use R for its strengths.
There are a few different ways to get the output of Python into R. I would suggest starting off with something simple: CSV files. They are what you are starting with, they are easy to read and write in Python and easy to read in R. Later you can deal with a direct pipe between Python and R, but it does not give you much advantage until you have a working prototype and it is a lot more work at first. Make Python read in your raw data and turn out a CSV file that R can drop straight into a data.table without further processing.
As for stri_extract, it is really not the tool you need this time. You certainly can match on a bunch of different words, but it is not really what it is optimized for. I agree with #Chris that using merge() on data.tables is a much more efficient--and faster--way to search for a number of key words.
Single Word Version
When you have single words in each lookup, this is easily accomplished with merging:
library(data.table)
#Word List
set.seed(42)
WordList <- data.table(ID = 1:60000, words = sapply(1:60000, function(x) paste(sample(letters, 5), collapse = '')))
#A list of dictionaries
set.seed(42)
Dicts <- list(
Dict1 = sapply(1:15000, function(x) {
paste(sample(letters, 5), collapse = '')
}),
Dict2 = sapply(1:15000, function(x) {
paste(sample(letters, 5), collapse = '')
}),
Dict3 = sapply(1:15000, function(x) {
paste(sample(letters, 5), collapse = '')
})
)
#Create Dictionary Data.table and add Identifier
Dicts <- rbindlist(lapply(Dicts, function(x){data.table(ref = x)}), use.names = T, idcol = T)
# set key for joining
setkey(WordList, "words")
setkey(Dicts, "ref")
Now we have a data.table with all dictionary words, and a data.table with all words in our word list. Now we can just merge:
merge(WordList, Dicts, by.x = "words", by.y = "ref", all.x = T, allow.cartesian = T)
words ID .id
1: abcli 30174 Dict3
2: abcrg 26210 Dict2
3: abcsj 8487 Dict1
4: abczg 24311 Dict2
5: abdgl 1326 Dict1
---
60260: zyxeb 52194 NA
60261: zyxfg 57359 NA
60262: zyxjw 19337 Dict2
60263: zyxoq 5771 Dict1
60264: zyxqa 24544 Dict2
So we can see abcli appears in Dict3, while zyxeb does not appear in any of the dictionaries. There look to be 264 duplicates (words that appear in >1 dictionary), as the resultant data.table is larger than our word list (60264 > 60000). This is shown as follows:
merge(WordList, Dicts, by.x = "words", by.y = "ref", all.x = T, allow.cartesian = T)[words == "ahlpk"]
words ID .id
1: ahlpk 7344 Dict1
2: ahlpk 7344 Dict2
3: ahlpk 28487 Dict1
4: ahlpk 28487 Dict2
We also see here that duplicated words in our word list are going to create multiple resultant rows.
This is very very quick to run
Phrases + Sentences
In the case where you are searching for phrases within sentences, you will need to perform a string match instead. However, you will still need to make n(Phrases) * n(Sentences) comparisons, which will quick hit memory limits in most R data structures. Fortunately, this is an embarrassingly parallel operation:
Same setup:
library(data.table)
library(foreach)
library(doParallel)
# Sentence List
set.seed(42)
Sentences <- data.table(ID = 1:60000, Sentence = sapply(1:60000, function(x) paste(sample(letters, 10), collapse = '')))
# A list of phrases
set.seed(42)
Phrases <- list(
Phrases1 = sapply(1:15000, function(x) {
paste(sample(letters, 5), collapse = '')
}),
Phrases2 = sapply(1:15000, function(x) {
paste(sample(letters, 5), collapse = '')
}),
Phrases3 = sapply(1:15000, function(x) {
paste(sample(letters, 5), collapse = '')
})
)
# Create Dictionary Data.table and add Identifier
Phrases <- rbindlist(lapply(Phrases, function(x){data.table(Phrase = x)}), use.names = T, idcol = T)
# Full Outer Join
Sentences[, JA := 1]
Phrases[, JA := 1]
# set key for joining
setkey(Sentences, "JA")
setkey(Phrases, "JA")
We now want to break up our Phrases table into manageable batches
cl<-makeCluster(4)
registerDoParallel(cl)
nPhrases <- as.numeric(nrow(Phrases))
nSentences <- as.numeric(nrow(Sentences))
batch_size <- ceiling(nPhrases*nSentences / 2^30) #Max data.table allocation is 2^31. Lower this if you are hitting memory allocation limits
seq_s <- seq(1,nrow(Phrases), by = floor(nrow(Phrases)/batch_size))
ln_s <- length(seq_s)
if(ln_s > 1){
str_seq <- paste0(seq_s,":",c(seq_s[2:ln_s],nrow(Phrases) + 1) - 1)
} else {
str_seq <- paste0(seq_s,":",nrow(Phrases))
}
We are now ready to send our job out. The grepl line below is doing the work - testing which phrases match each sentence. We then filter out any non-matches.
ls<-foreach(i = 1:ln_s) %dopar% {
library(data.table)
TEMP_DT <- merge(Sentences,Phrases[eval(parse(text = str_seq[1]))], by = "JA", allow.cartesian = T)
TEMP_DT <- TEMP_DT[, match_test := grepl(Phrase,Sentence), by = .(Phrase,Sentence)][match_test == 1]
return(TEMP_DT)
}
stopCluster(cl)
DT_OUT <- unique(do.call(rbind,ls))
DT_OUT now summarizes the sentences that match, along with the Phrase + Phrase list that it is found in.
This still will take some time (as there is a lot of processing that is necessary) , but nowhere near a year.
Related
I am building a Shiny application in which a large ggplot2 fortified dataframe needs to be calculated over and over again, using a large amount of external source files. I am searching for the fastest and most efficient way to do this. In the following paragraph I will delve a little bit more into the subject and the code I have so far and also provide the input data to enable your kind assistance.
I am using the Helsinki Region Travel Time Matrix 2018, a dataset provided by Digital Geography Lab, a research group in the University of Helsinki. This data uses a generalised map of Helsinki capital region, in 250 x 250 meter cells (in my code grid_f), to calculate travel times between all cells in the map (grid ids are called YKR_ID, n=13231) by public transport, private car, bicycle and by foot. The calculations are stored in delimited .txt files, one text file for all the travel times to a specific cell id. The data is available for download at this website, under "Download the data". NB, the unzipped data is 13.8 GB in size.
Here is a selection from a text file in the dataset:
from_id;to_id;walk_t;walk_d;bike_s_t;bike_f_t;bike_d;pt_r_tt;pt_r_t;pt_r_d;pt_m_tt;pt_m_t;pt_m_d;car_r_t;car_r_d;car_m_t;car_m_d;car_sl_t
5785640;5785640;0;0;-1;-1;-1;0;0;0;0;0;0;-1;0;-1;0;-1
5785641;5785640;48;3353;51;32;11590;48;48;3353;48;48;3353;22;985;21;985;16
5785642;5785640;50;3471;51;32;11590;50;50;3471;50;50;3471;22;12167;21;12167;16
5785643;5785640;54;3764;41;26;9333;54;54;3764;54;54;3764;22;10372;21;10370;16
5787544;5785640;38;2658;10;7;1758;38;38;2658;38;38;2658;7;2183;7;2183;6
My interest is to visualise (with ggplot2) this 250x250m Helsinki region map for one travel mode, the private car, using any of the possible 13231 cell ids, repeatedly if the user wants. Because of this it is important that the dataframe fetch is as fast and efficient as possible. For this question, let's concentrate on the fetching and processing of the data from the external files and use only one specific id value.
In a nutshell, After I have produced a ggplot2::fortify() version of the 250 x 250 meter grid spatial dataset grid_f,
I need to scan through all the 13231 Travel Time Matrix 2018 text files
Pick only the relevant columns (from_id, to_id, car_r_t, car_m_t, car_sl_t) in each file
Pick the relevant row using from_id (in this case, origin_id <- "5985086") in each file
Join the the resulting row to the fortified spatial data grid_f
My code is as follows:
# Libraries
library(ggplot2)
library(dplyr)
library(rgdal)
library(data.table)
library(sf)
library(sp)
# File paths. ttm_path is the folder which contains the unchanged Travel
# Time Matrix 2018 data from the research group's home page
ttm_path <- "HelsinkiTravelTimeMatrix2018"
gridpath <- "MetropAccess_YKR_grid_EurefFIN.shp"
#### Import grid cells
# use this CRS information throughout the app
app_crs <- sp::CRS("+init=epsg:3067")
# Read grid shapefile and transform
grid_f <- rgdal::readOGR(gridpath, stringsAsFactors = TRUE) %>%
sp::spTransform(., app_crs) %>%
# preserve grid dataframe data in the fortify
{dplyr::left_join(ggplot2::fortify(.),
as.data.frame(.) %>%
dplyr::mutate(id = as.character(dplyr::row_number() - 1)))} %>%
dplyr::select(-c(x, y))
The code above this point is meant to run only once. The code below, more or less, would be run over and over with different origin_ids.
#### Fetch TTM18 data
origin_id <- "5985086"
origin_id_num <- as.numeric(origin_id)
# column positions of columns from_id, to_id, car_r_t, car_m_t, car_sl_t
col_range <- c(1, 2, 14, 16, 18)
# grid_f as data.table version
dt_grid <- as.data.table(grid_f)
# Get filepaths of all of the TTM18 data. Remove metadata textfile filepath.
all_files <- list.files(path = ttm_path,
pattern = ".txt$",
recursive = TRUE,
full.names = TRUE)
all_files <- all_files[-length(all_files)]
# lapply function
TTM18_fetch <- function(x, col_range, origin_id) {
res <- fread(x, select = col_range)
res <- subset(res, from_id == origin_id)
return(res)
}
# The part of the code that needs to be fast and efficient
result <-
lapply(all_files, FUN = TTM18_fetch, col_range, origin_id_num) %>%
data.table::rbindlist(., fill = TRUE) %>%
data.table::merge.data.table(dt_grid, ., by.x = "YKR_ID", by.y = "to_id")
The dataframe result should have 66155 rows of 12 variables, five rows for each 250x250 meter grid cell. The columns are YKR_ID, long, lat, order, hole, piece, id, group, from_id, car_r_t, car_m_t, car_sl_t.
My current lapply() and data.table::fread() solution takes about 2-3 minutes to complete. I think this is already a good achievement, but I can't help and think there are better and faster ways to complete this. So far, I have tried these alternatives to what I now have:
A conventional for loop: that was obviously a slow solution
I tried to teach myself more about vectorised functions in R, but that did not lead anywhere. Used this link
Tried to dabble with with() unsuccessfully using this SO question, inspired by this SO question
Looked into package parallel but ended up not utilising that because of the Windows environment I am using
Tried to find alternative ways to solve this with apply() and sapply() but nothing noteworthy came out of that.
As to why I didn't do all this to the data before ggplot2::fortify, I simply found it troublesome to work with a SpatialPolygonsDataFrame.
Thank you for your time.
Whenver I’m trying to figure out how to improve the performance of my R
functions, I generally use the following approach. First, I look for any
function calls that may be unesscesary or identify places where multiple
function calls can be simplified into one. Then, I look for places in my
code that are incurring the greatest time penalty by benchmarking each
part separately. This can easily be done using the microbenchmark
package.
For example, we can ask if we get better performance with or without
piping (e.g. %>%).
# hint... piping is always slower
library(magrittr)
library(microbenchmark)
microbenchmark(
pipe = iris %>% subset(Species=='setosa'),
no_pipe = subset(iris, Species=='setosa'),
times = 200)
Unit: microseconds
expr min lq mean median uq max neval cld
pipe 157.518 196.739 308.1328 229.6775 312.6565 2473.582 200 b
no_pipe 84.894 116.386 145.4039 126.1950 139.4100 612.492 200 a
Here, we find that removing subseting a data.frame without piping
takes nearly half the time to execute!
Next, I determine the net time penalty for each place I
benchmarked by multipling the execution time by total number of times it
needs to be executed. For the areas with the greatest net time penalty,
I try to replace it with faster functions and/or try reduce the total
number of times it needs to be executed.
TLDR
In your case, you can speed things up by using the fst package
although you would need to convert your csv files to fst files.
# before
TTM18_fetch <- function(x, col_range, origin_id) {
res <- data.table::fread(x, select = col_range)
res <- subset(res, from_id == origin_id)
return(res)
}
# after (NB x needs to be a fst file)
col_range <- c('from_id', 'to_id', 'car_r_t', 'car_m_t', 'car_sl_t')
TTM18_fetch <- function(x, col_range, origin_id) {
res <- fst::read_fst(path = x,
columns = col_range,
as.data.table = TRUE)[from_id==origin_id]
return(res)
}
To convert your csv files to fst
library(data.table)
library(fst)
ttm_path <- 'REPLACE THIS'
new_ttm_path <- 'REPLACE THIS'
# Get filepaths of all of the TTM18 data. Remove metadata textfile filepath.
all_files <- list.files(path = ttm_path,
pattern = ".txt$",
recursive = TRUE,
full.names = TRUE)
all_files <- all_files[-grepl('[Mm]eta', all_files)]
# creating new file paths and names for fst files
file_names <- list.files(path = ttm_path,
pattern = ".txt$",
recursive = TRUE)
file_names <- file_names[-grepl('[Mm]eta', file_names)]
file_names <- gsub(pattern = '.csv$',
replacement = '.fst',
x =file_names)
file_names <- file.path(new_ttm_path, file_names)
# csv to fst conversion
require(progress) # this will help you create track of things
pb <- progress_bar$new(
format = " :what [:bar] :percent eta: :eta",
clear = FALSE, total = length(file_names), width = 60)
# an index file to store from_id file locations
from_id_paths <- data.table(from_id = numeric(),
file_path = character())
for(i in seq_along(file_names)){
pb$tick(tokens = list(what = 'reading'))
tmp <- data.table::fread(all_files[i], key = 'from_id')
pb$update(tokens = list(what = 'writing'))
fst::write_fst(tmp,
compress = 50, # less compressed files read faster
path = file_names[i] )
pb$update(tokens = list(what = 'indexing'))
from_id_paths <- rbind(from_id_paths,
data.table(from_id = unique(tmp$from_id),
file_path = file_names[i]))
}
setkey(from_id_paths, from_id)
write_fst(from_id_paths,
path = file.path('new_ttm_path', 'from_id_index.fst'),
compress = 0)
This would be the replacement
library(fst)
library(data.table)
new_ttm_path <- 'REPLACE THIS'
#### Fetch TTM18 data
origin_id <- "5985086"
origin_id_num <- as.numeric(origin_id)
# column positions of columns from_id, to_id, car_r_t, car_m_t, car_sl_t
col_range <- c('from_id', 'to_id', 'car_r_t', 'car_m_t', 'car_sl_t')
# grid_f as data.table version
dt_grid <- as.data.table(grid_f)
nescessary_files <- read_fst(path = file.path(new_ttm_path,
'from_id_index.fst'),
as.data.table = TRUE
)[from_id==origin_id,file_path]
TTM18_fetch <- function(x, col_range, origin_id) {
res <- fst::read_fst(path = x,
columns = col_range,
as.data.table = TRUE)[from_id==origin_id]
return(res)
}
result <- rbindlist(lapply(nescessary_files, FUN = TTM18_fetch, col_range, origin_id_num),
fill = TRUE)
result <- data.table::merge.data.table(dt_grid, result, by.x = "YKR_ID", by.y = "to_id")
I currently have the following problem. I work with Web-of-Science scientific publication and citation data, which has the following structure: A variable "SR" is a string with the name of a publication, "CR" a variable with a string containing all cited references in the article, separated by a ";".
My task now is to create an edgelist between all publications with the corresponding citations, where every publication and citation combination is in a single row. I do it currently with the following code:
# Some minimal data for example
pub <- c("pub1", "pub2", "pub3")
cit <- c("cit1;cit2;cit3;cit4","cit1;cit4;cit5","cit5;cit1")
M <- cbind(pub,cit)
colnames(M) <- c("SR","CR")
# Create an edgelist
cit_el <- data.frame() #
for (i in seq(1, nrow(M), 1)) { # i=3
cit <- data.frame(strsplit(as.character(M[i,"CR"]), ";", fixed=T), stringsAsFactors=F)
colnames(cit)[1] <- c("SR")
cit$SR_source <- M[i,"SR"]
cit <- unique(cit)
cit_el <- rbind(cit_el, cit)
}
However, for large datasets of some 10k+ of publications (which tend to have 50+ citations), the script runs 15min+. I know that loops are usually an inefficient way of coding in R, yet didn't find an alternative that produces what I want.
Anyone knows some trick to make this faster?
This is my attempt. I haven't compared the speeds of different approaches yet.
First is the artificial data with 10k pubs, 100k possible citations, max is 80 citations per pub.
library(data.table)
library(stringr)
pubCount = 10000
citCount = 100000
maxCitPerPub = 80
pubList <- paste0("pub", seq(pubCount))
citList <- paste0("cit", seq(citCount))
cit <- sapply(sample(seq(maxCitPerPub), pubCount, replace = TRUE),
function(x) str_c(sample(citList, x), collapse = ";"))
data <- data.table(pub = pubList,
cit = cit)
For processing, I use stringr::str_split_fixed to split the citations into columns and use data.table::melt to collapse the columns.
temp <- data.table(pub = pubList, str_split_fixed(data$cit, ";", maxCitPerPub))
result <- melt(temp, id.vars = "pub")[, variable:= NULL][value!='']
Not sure if this is any quicker but if I'm understanding correctly this should give the desired result
rbindlist(lapply(1:nrow(M), function(i){
data.frame(SR_source = M[i, 'SR'], SR = strsplit(M[i, 'CR'], ';'))
}))
Working in R, I'm trying to find an efficient way to search through a file of texts and remove or replace all instances of proper names (e.g., Thomas). I assume there is something available to do this but have been unable to locate.
So, in this example the words "Susan" and "Bob" would be removed. This is a simplified example, when in reality would want this to apply to hundreds of documents and therefore a fairly large list of names.
texts <- as.data.frame (rbind (
'This text stuff if quite interesting',
'Where are all the names said Susan',
'Bob wondered what happened to all the proper nouns'
))
names(texts) [1] <- "text"
Here's one approach based upon a data set of firstnames:
install.packages("gender")
library(gender)
install_genderdata_package()
sets <- data(package = "genderdata")$results[,"Item"]
data(list = sets, package = "genderdata")
stopwords <- unique(kantrowitz$name)
texts <- as.data.frame (rbind (
'This text stuff if quite interesting',
'Where are all the names said Susan',
'Bob wondered what happened to all the proper nouns'
))
removeWords <- function(txt, words, n = 30000L) {
l <- cumsum(nchar(words)+c(0, rep(1, length(words)-1)))
groups <- cut(l, breaks = seq(1,ceiling(tail(l, 1)/n)*n+1, by = n))
regexes <- sapply(split(words, groups), function(words) sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE), collapse = "|")))
for (regex in regexes) txt <- gsub(regex, "", txt, perl = TRUE, ignore.case = TRUE)
return(txt)
}
removeWords(texts[,1], stopwords)
# [1] "This text stuff if quite interesting"
# [2] "Where are all the names said "
# [3] " wondered what happened to all the proper nouns"
It may need some tuning for your specific data set.
Another approach could be based upon part-of-speech tagging.
As example, I have the following XML code
tt = '<Nummeraanduiding>
<identificatie>0010200000114849</identificatie>
<aanduidingRecordInactief>N</aanduidingRecordInactief>
<aanduidingRecordCorrectie>0</aanduidingRecordCorrectie>
<huisnummer>13</huisnummer>
<officieel>N</officieel>
<postcode>9904PC</postcode>
<tijdvakgeldigheid>
<begindatumTijdvakGeldigheid>2010051100000000</begindatumTijdvakGeldigheid>
</tijdvakgeldigheid>
<inOnderzoek>N</inOnderzoek>
<typeAdresseerbaarObject>Verblijfsobject</typeAdresseerbaarObject>
<bron>
<documentdatum>20100511</documentdatum>
<documentnummer>2010/NR002F</documentnummer>
</bron>
<nummeraanduidingStatus>Naamgeving uitgegeven</nummeraanduidingStatus>
<gerelateerdeOpenbareRuimte>
<identificatie>0010300000000444</identificatie>
</gerelateerdeOpenbareRuimte>
</Nummeraanduiding> '
The goal is to convert this node(Nummeraanduiding) to a data.table (or data.frame is also fine). One challenge is that I have a lot of these Nummeraanduiding nodes (millions of them).
The following code is able to process the data:
library(XML)
# This parses the doc...
doc = xmlParse(tt)
# Solution (1) - this is the most obvious solution..
XML::xmlToDataFrame(doc)
# Solution (2) - apparently converting to a list is also possible..
unlist(xmlToList(doc))
# Solution (3) - My own solution
data.frame(as.list(unlist(xmlToList(doc))))
Not all solutions produce the desired result... In the end only the version of Solution (3) satisfies my needs.
It is in a data.frame/data.table format
It contains all the child-child-nodes and has distinct names for each column
It does not 'merge' the information of child-child-nodes
However, running this piece of code for all my data becomes quite slow. It took 8+ hours to complete it for a file containing 2290000 times the 'Nummeraanduiding'-node.
Do you guys know any way to speed up this process? Can my method be improved? Am I missing some useful function maybe?
Given that each field is already on a separate line just grep them out, read what is left using read.table and convert from long to wide using tapply to produce the resulting matrix (which can be converted to a data frame or data.table if desired). Note that in read.table we bypass quote, comment and class processing. Finally, test it out to see if it is faster. No packages are used.
nms <- c("identificatie", "aanduidingRecordInactief", "aanduidingRecordCorrectie",
"huisnummer", "officieel", "postcode", "tijdvakgeldigheid.begindatumTijdvakGeldigheid",
"inOnderzoek", "typeAdresseerbaarObject", "bron.documentdatum",
"bron.documentnummer", "nummeraanduidingStatus",
"gerelateerdeOpenbareRuimte.identificatie")
rx <- paste(nms, collapse = "|")
g <- chartr("<", ">", grep(rx, readLines(textConnection(tt)), value = TRUE))
long <- read.table(text = g, sep = ">", quote = "", comment.char = "",
colClasses = "character")[2:3]
names(long) <- c("field", "value")
long$field <- factor(long$field, levels = nms) # maintain order of columns
long$recno <- cumsum(long$field == "identificatie")
with(long, tapply(value, list(recno, field), c))
If all records have exactly the same set of fields, such as those listed in nms, then the last line could be replaced with this (which is likely faster):
matrix(long$value, ncol = length(nms), byrow = TRUE, dimnames = list(NULL, nms))
Another alternative to the tapply line would be to use reshape from base R or to use dcast from the reshape2 package.
I have 9880 records in a data frame, I am trying to split it into 9 groups of 1000 each and the last group will have 880 records and also name them accordingly. I used for-loop for 1-9 groups but manually for the last 880 records, but i am sure there are better ways to achieve this,
library(sqldf)
for (i in 0:8)
{
assign(paste("test",i,sep="_"),as.data.frame(final_9880[((1000*i)+1):(1000*(i+1)), (1:53)]))
}
test_9<- num_final_9880[9001:9880,1:53]
also am unable to append all the parts in one for-loop!
#append all parts
all_9880<-rbind(test_0,test_1,test_2,test_3,test_4,test_5,test_6,test_7,test_8,test_9)
Any help is appreciated, thanks!
A small variation on this solution
ls <- split(final_9880, rep(0:9, each = 1000, length.out = 9880)) # edited to Roman's suggestion
for(i in 1:10) assign(paste("test",i,sep="_"), ls[[i]])
Your command for binding should work.
Edit
If you have many dataframes you can use a parse-eval combo. I use the package gsubfn for readability.
library(gsubfn)
nms <- paste("test", 1:10, sep="_", collapse=",")
eval(fn$parse(text='do.call(rbind, list($nms))'))
How does this work? First I create a string containing the comma-separated list of the dataframes
> paste("test", 1:10, sep="_", collapse=",")
[1] "test_1,test_2,test_3,test_4,test_5,test_6,test_7,test_8,test_9,test_10"
Then I use this string to construct the list
list(test_1,test_2,test_3,test_4,test_5,test_6,test_7,test_8,test_9,test_10)
using parse and eval with string interpolation.
eval(fn$parse(text='list($nms)'))
String interpolation is implemented via the fn$ prefix of parse, its effect is to intercept and substitute $nms with the string contained in the variable nms. Parsing and evaluating the string "list($mns)" creates the list needed. In the solution the rbind is included in the parse-eval combo.
EDIT 2
You can collect all variables with a certain pattern, put them in a list and bind them by rows.
do.call("rbind", sapply(ls(pattern = "test_"), get, simplify = FALSE))
ls finds all variables with a pattern "test_"
sapply retrieves all those variables and stores them in a list
do.call flattens the list row-wise.
No for loop required -- use split
data <- data.frame(a = 1:9880, b = sample(letters, 9880, replace = TRUE))
splitter <- (data$a-1) %/% 1000
.list <- split(data, splitter)
lapply(0:9, function(i){
assign(paste('test',i,sep='_'), .list[[(i+1)]], envir = .GlobalEnv)
return(invisible())
})
all_9880<-rbind(test_0,test_1,test_2,test_3,test_4,test_5,test_6,test_7,test_8,test_9)
identical(all_9880,data)
## [1] TRUE