How to concatenate strings with different separator every n-th element - r

I would like to concatenate words (strings) with different separator every 10-th element, such that each word is separated by a comma until every 10th word then it's separated by a comma and a line break. The ultimate purpose is for printing neatly a list of words into a table.
I can write a loop but I am hoping for a more elegant solution as proposed in these related questions using gsub and regular expressions:
here and here that involves inserting/replacing string after every n-th character but in my case my words have variable length (of characters).
Edit: I am looking for solution I can apply to any vector with variable number of words.
For reproducible data, I generate a vector of 40 random words using code from this source
MHmakeRandomString <- function(n, length) {
randomString <- c(1:n)
for (i in 1:n) {
randomString[i] <- paste(sample(c(0:9, letters, LETTERS), length, replace=TRUE),
collapse="")}
return(randomString)
}
set.seed(4)
word_vector <- MHmakeRandomString(n=40, length=5)
word_vector
# [1] "A0ihO" "gIUW4" "Kh6Xp" "sYAXL" "IZvuE" "PtQvw" "zeSEt" "YsCo0" "WfzbU" "5TTIz"
# [11] "oKTOO" "qaaTK" "y4QUd" "C4vNY" "lDplP" "Gjrg8" "UHzUT" "32ZcV" "c7xgl" "5Lr2H"
# [21] "fDgxt" "zFdYO" "hohuK" "vrNU4" "8oRg5" "IYcyl" "pblbO" "SHhq0" "yFjWa" "rzYLr"
# [31] "m2AXf" "QdhtM" "TWpkh" "4499K" "5Bcv8" "0DeqI" "6BdTy" "fJgKX" "tUZeh" "HPso5"
I usually do a paste(x, collapse) and then print to table using gridExtra
word_sep <- paste(word_vector, collapse=", ")
# [1] "z6LHb, 1ubB1, o9TZ2, 8s8bV, sZmcB, blirI, gMfo1, xXkkt, gFMrA, hXdaO,
# lNP2Q, p9B9G, JXTsJ, qVsWS, ntiT8, d0QRv, uoR1D, L99Bg, THWQo, meuev,
# IO0Au, 0yWmh, 72d3g, FJRDS, PtbJT, JaXVK, OPo9m, i0678, 6BpXZ, b6hzT,
# BDQBk, ANC5h, 7QPgM, JJSxf, nnX7Z, rbEfm, XXl4Z, kHMuI, wFLyM, P8rlp"
library(gridExtra)
plot_grid(tableGrob(word_sep))
Current table output: In this case I have a really long list of words and specified table width so I need line breaks.
My desired output would look like this hacked version:
word_sep2 <- paste(c(paste(MHmakeRandomString(n=10, length=5), collapse=", "), ",\n",
paste(MHmakeRandomString(n=10, length=5), collapse=", "), ",\n",
paste(MHmakeRandomString(n=10, length=5), collapse=", "), ",\n",
paste(MHmakeRandomString(n=10, length=5), collapse=", ")), collapse="")
word_sep2
# [1] "0ahiL, 2pA5c, dKWuR, 79sw5, MeL1I, KpB1w, UNLSo, LlDlN, jNOcI, tv8R5,
# \norf60, avKFo, jZFxE, U7RQW, SSmxD, czlMt, 75zEB, 2jLwG, 08dmN, H3sVW,
# \nCZwQt, ggumo, wHUpj, Z7WGR, BHYLE, eWksX, Lbt3D, P1Brf, OpEvk, 1WFVa,
# \nEeFd4, afX7B, nyBzF, vbNLz, U7MU0, H4rx4, AKgv8, Kbzri, KKajp, Yg6EW"
plot_grid(tableGrob(word_sep2))
Desired table output:

You may use
gsub("((?:[^,]*,){10}) ", "\\1\n", word_sep)
See the online regex demo.
Details
((?:[^,]*,){10}) - Group 1 (referred to with \1 from the replacement pattern) that matches 10 consecutive occurrences of
[^,]* - any 0+ chars other than ,
, - a comma
- a space
See the R demo:
MHmakeRandomString <- function(n, length) {
randomString <- c(1:n)
for (i in 1:n) {
randomString[i] <- paste(sample(c(0:9, letters, LETTERS), length, replace=TRUE),
collapse="")}
return(randomString)
}
set.seed(4)
word_vector <- MHmakeRandomString(n=40, length=5)
word_sep <- paste(word_vector, collapse=", ")
f <- gsub("((?:[^,]*,){10}) ", "\\1\n", word_sep)
cat(f, collapse="\n")

I gues you can do it with paste
paste(word_vector, rep(c(", ", ",\n"), c(9,1)), collapse = "", sep = "")
[1] "A0ihO, gIUW4, Kh6Xp, sYAXL, IZvuE, PtQvw, zeSEt, YsCo0, WfzbU, 5TTIz,\noKTOO, qaaTK, y4QUd, C4vNY, lDplP, Gjrg8, UHzUT, 32ZcV, c7xgl, 5Lr2H,\nfDgxt, zFdYO, hohuK, vrNU4, 8oRg5, IYcyl, pblbO, SHhq0, yFjWa, rzYLr,\nm2AXf, QdhtM, TWpkh, 4499K, 5Bcv8, 0DeqI, 6BdTy, fJgKX, tUZeh, HPso5,\n"
Here's what it looks like when printing it with cat:
res <- paste(word_vector, rep(c(", ", ",\n"), c(9,1)), collapse = "", sep = "")
cat(res)
# A0ihO, gIUW4, Kh6Xp, sYAXL, IZvuE, PtQvw, zeSEt, YsCo0, WfzbU, 5TTIz,
# oKTOO, qaaTK, y4QUd, C4vNY, lDplP, Gjrg8, UHzUT, 32ZcV, c7xgl, 5Lr2H,
# fDgxt, zFdYO, hohuK, vrNU4, 8oRg5, IYcyl, pblbO, SHhq0, yFjWa, rzYLr,
# m2AXf, QdhtM, TWpkh, 4499K, 5Bcv8, 0DeqI, 6BdTy, fJgKX, tUZeh, HPso5,

Related

convert numbers in to letters in the data frame

How to convert the numbers of my last column into a letters in which 1=A ,2=C , 3=G , 4=T , 5=^ , 6=!
please I need help,
```
"TCCA^TA!A"
"TT^CAATTAA!C"
test <- list(c("T"="4", "C"="2", "C"="2", "A"="1", "^"="5", "T"="4", "A"="1","!"="6","A"="1"),c("T"="4","T"="4","^"="5","C"="2","A"="1","A"="1","T"="4","T"="4","A"="1","A"="1","!"="6","C"="2"))
oddfunction <- function(test) {
first_column<- test
second_column<- sort(first_column)
third_column<-paste(first_column,second_column)
fourth_column<-sort(third_column)
column_5<-paste(first_column,fourth_column)
column_6<-sort(column_5)
column_7<-paste(first_column,column_6)
column_8<-sort(column_7)
column_9<-paste(first_column,column_8)
column_10<-sort(column_9)
column_11<-paste(first_column,column_10)
df<-data.frame(first_column, second_column, third_column, fourth_column,column_5,column_6,column_7,column_8,column_9,column_10,column_11)
print(df)
}
for (i in test) {
oddfunction(i)
```
Use chartr:
chartr("123456", "ACGT^!", "1232126565")
#[1] "ACGCAC!^!^"

Is there a dynamic way to extract information from forms?

I want to write a R-Script which allows me to extract information of MSG-Files (Email).
The emails are automated sign-up-mails from a Website. They are containing Information about the User (Forename, Surname, Email etc.). I try to extract the specific Information by using regex. The Problem is, that the order of fields may vary.
I use the msgxtractr-Library which works fine. The Output looks like this:
\r\n\r\nAnrede \r\n\r\nHerr\r\n\r\nVorname \r\n\r\nJames \r\n\r\nName \r\n\r\nBond \r\n\r\
To get the Information, i extract the text inbetween two text patterns ->(.*?)
Example:
"Vorname \r\n\r\n(.*?) \r\n\r\n"
library(msgxtractr) #usage
library(magrittr)
#------pfad setzen-----------------------------------------------------------
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
#------Msg-Datei einlesen-----------------------------------------------------------
BALBLI = read_msg("MSG/Test2.msg")
#------Text zwischen 2 Pattern Extrahieren-----------------------------------
testAR = BALBLI[["body"]][["text"]] #Body aus MSG-Datei
patternVN= "Vorname \r\n\r\n(.*?) \r\n\r\n"
searchVN <- regmatches(testAR,regexec(patternVN,testAR))
Vorname = searchVN[[1]][2]
Vorname
I have been trying two Test-Cases:
1) Good Result:
> patternVN= "Vorname \r\n\r\n(.*?) \r\n\r\n"
> searchVN <- regmatches(testAR,regexec(patternVN,testAR))
> Vorname = searchVN[[1]][2]
> Vorname
[1] "James"
2) Bad Result:
> patternVN= "Vorname \r\n\r\n(.*?) \r\n\r\n"
> searchVN <- regmatches(testAR,regexec(patternVN,testAR))
> Vorname = searchVN[[1]][2]
> Vorname
[1] "John\r\n\r\nName"
In this Case it takes the Pattern after the Name.
I would try a completely different approach.
msg <- "\r\n\r\nAnrede \r\n\r\nHerr\r\n\r\nVorname \r\n\r\nJames \r\n\r\nName \r\n\r\nBond \r\n\r\n"
msg <- gsub("^\\s+", "", msg) # remove spaces at the beginning and end
msg <- gsub("\\s+$", "", msg)
words <- strsplit(msg, " *[\n\r]+ *")[[1]]
res <- as.list(words[seq(2, length(words), 2)])
names(res) <- words[seq(1, length(words), 2)]
Result
> res
$Anrede
[1] "Herr"
$Vorname
[1] "James"
$Name
[1] "Bond"

How to sub matching words with bracketed words?

Trying to create a function to bracket reserved words in Access for a SQL query:
library(dplyr)
tester <- data.frame(names=c("Add", "Date", "Test", "DOB"))
bracket_access <- function(x) {x %>% gsub(c("ADD|ALL|Alphanumeric|ALTER|AND|ANY|Application|AS|ASC|Assistant|
AUTOINCREMENT|Avg|BETWEEN|BINARY|BIT|BOOLEAN|BY|BYTE|CHAR|CHARACTER|
COLUMN|CompactDatabase|CONSTRAINT|Container|Count|COUNTER|CREATE|CreateDatabase|
CreateField|CreateGroup|CreateIndex|CreateObject|CreateProperty|CreateRelation|
CreateTableDef|CreateUser|CreateWorkspace|CURRENCY|CurrentUser|DATABASE|DATE|
DATETIME|DELETE|DESC|Description|DISALLOW|DISTINCT|DISTINCTROW|Document|
DOUBLE|DROP|Echo|Else|End|Eqv|Error|EXISTS|Exit|FALSE|Field |Fields|
FillCache|FLOAT |FLOAT4 |FLOAT8|FOREIGN|Form |Forms|FROM|Full|FUNCTION|
GENERAL|GetObject|GetOption|GotoPage|GROUP|GROUP BY|GUID|HAVING|Idle|
IEEEDOUBLE|IEEESINGLE|If|IGNORE|Imp|IN|INDEX|Index|Indexes|INNER|
INSERT|InsertText|INT|INTEGER|INTEGER1 |INTEGER2 |INTEGER4|INTO|IS|
JOIN|KEY|LastModified|LEFT|Level|Like|LOGICAL |LOGICAL1|LONG |LONGBINARY|
LONGTEXT|Macro|Match|Max |Min |Mod|MEMO|Module|MONEY|Move|NAME|
NewPassword|NO|Not|Note|NULL|NUMBER |NUMERIC|Object|OLEOBJECT|OFF|ON|
OpenRecordset|OPTION|OR|ORDER|Orientation|Outer|OWNERACCESS|Parameter|
PARAMETERS|Partial|PERCENT|PIVOT|PRIMARY|PROCEDURE|Property|Queries|Query|
Quit|REAL|Recalc|Recordset|REFERENCES|Refresh|RefreshLink|RegisterDatabase|
Relation|Repaint|RepairDatabase|Report|Reports|Requery|RIGHT|SCREEN|SECTION|
SELECT|SET|SetFocus|SetOption|SHORT|SINGLE|SMALLINT|SOME|SQL|StDev|
StDevP|STRING|Sum|TABLE|TableDef|TableDefs|TableID|TEXT|TIME |TIMESTAMP|
TOP|TRANSFORM|TRUE|Type|UNION|UNIQUE|UPDATE|USER|VALUE|VALUES|Var|
VarP|VARBINARY|VARCHAR|VERSION|WHERE|WITH|Workspace|Xor|Year|YES|YESNO"), paste0("[",.,"]"), ignore.case = T)
}
bracket_access(tester)
I get a numeric output and I don't really understand why:
> bracket_access(tester)
[1] "[c(1, 2, 4, 3)]"
You may fix the current approach by matching and capturing the strings equal to the alternatives you provided and then replace the names column with [\\1] in the gsub:
bracket_access <- function(x) {
gsub("^(ADD|ALL|Alphanumeric|ALTER|AND|ANY|Application|AS|ASC|Assistant|AUTOINCREMENT|Avg|BETWEEN|BINARY|BIT|BOOLEAN|BY|BYTE|CHAR|CHARACTER|COLUMN|CompactDatabase|CONSTRAINT|Container|Count|COUNTER|CREATE|CreateDatabase|CreateField|CreateGroup|CreateIndex|CreateObject|CreateProperty|CreateRelation|CreateTableDef|CreateUser|CreateWorkspace|CURRENCY|CurrentUser|DATABASE|DATE|DATETIME|DELETE|DESC|Description|DISALLOW|DISTINCT|DISTINCTROW|Document|DOUBLE|DROP|Echo|Else|End|Eqv|Error|EXISTS|Exit|FALSE|Field |Fields|FillCache|FLOAT |FLOAT4 |FLOAT8|FOREIGN|Form |Forms|FROM|Full|FUNCTION|GENERAL|GetObject|GetOption|GotoPage|GROUP|GROUP BY|GUID|HAVING|Idle|IEEEDOUBLE|IEEESINGLE|If|IGNORE|Imp|IN|INDEX|Index|Indexes|INNER|INSERT|InsertText|INT|INTEGER|INTEGER1 |INTEGER2 |INTEGER4|INTO|IS|JOIN|KEY|LastModified|LEFT|Level|Like|LOGICAL |LOGICAL1|LONG |LONGBINARY|LONGTEXT|Macro|Match|Max |Min |Mod|MEMO|Module|MONEY|Move|NAME|NewPassword|NO|Not|Note|NULL|NUMBER |NUMERIC|Object|OLEOBJECT|OFF|ON|OpenRecordset|OPTION|OR|ORDER|Orientation|Outer|OWNERACCESS|Parameter|PARAMETERS|Partial|PERCENT|PIVOT|PRIMARY|PROCEDURE|Property|Queries|Query|Quit|REAL|Recalc|Recordset|REFERENCES|Refresh|RefreshLink|RegisterDatabase|Relation|Repaint|RepairDatabase|Report|Reports|Requery|RIGHT|SCREEN|SECTION|SELECT|SET|SetFocus|SetOption|SHORT|SINGLE|SMALLINT|SOME|SQL|StDev|StDevP|STRING|Sum|TABLE|TableDef|TableDefs|TableID|TEXT|TIME |TIMESTAMP|TOP|TRANSFORM|TRUE|Type|UNION|UNIQUE|UPDATE|USER|VALUE|VALUES|Var|VarP|VARBINARY|VARCHAR|VERSION|WHERE|WITH|Workspace|Xor|Year|YES|YESNO)$",
"[\\1]",
x,
ignore.case = T)
}
bracket_access(tester$names)
## => [1] "[Add]" "[Date]" "Test" "DOB"
Here, the gsub pattern looks like ^(word1|word2|...|wordN)$ and once there is a match, the whole string is wrapped with [...] and put back (the \\1 is a placeholder for the capturing group #1 in the pattern, and there is one defined with a pair of unescaped parentheses).

Different spacing while printing to log

I am printing importance matrix of xgBoost into log using write command (write works with file connection and direct it to stderr well). Here is the command I am using:
importance_matrix <- xgb.importance(names, model=bst)
write("The top 30 variables are:",stderr())
write(paste0("Feature",'\t','\t','Gain','\t','Cover','\t','Frequency'),stderr())
write(t(as.matrix(importance_matrix[1:30,])),sep="\t",ncolumns = length(names(importance_matrix)),stderr())
Output comes in format:
Feature Gain Cover Frequency
pctTillDate 0.560359696 0.1314074664 0.024278250
colr_per 0.183149483 0.0962457545 0.049618673
date 0.050528297 0.1143752021 0.066395735
GREG_D 0.025648433 0.0381476142 0.018070143
LNGTD_I 0.020346020 0.0485235001 0.101322109
LATTD_I 0.019241497 0.0421892270 0.093867103
which make it look a bit clumsy (much clumsy in log than appearing here in SO). So in order to make it better looking I want to change last line of t(as.matrix(importance_matrix[1:30,])),sep="\t" such that first sep will be 2 tabs ('\t','\t') and rest single tab ('\t'); instead of current uniform spacing. Simple but search doesn't give any idea. Any suggestions?
Consider padding the column names and first char column of matrix with whitespace to align each to largest character size of first column:
write.table(importance_matrix, sep="\t", row.names = FALSE, quote = FALSE)
# Feature Gain Cover Frequency
# pctTillDate 0.56035970 0.13140747 0.02427825
# colr_per 0.18314948 0.09624575 0.04961867
# date 0.05052830 0.11437520 0.06639573
# GREG_D 0.02564843 0.03814761 0.01807014
# LNGTD_I 0.02034602 0.04852350 0.10132211
# LATTD_I 0.01924150 0.04218923 0.09386710
new_matrix <- importance_matrix
# FIRST COLUMN LARGEST CHAR LENGTH
charmax <- max(nchar(new_matrix[,1]))
# PAD COLUMN HEADERS
colnames(new_matrix) <- lapply(1:ncol(new_matrix), function(i)
paste0(colnames(new_matrix)[i],
paste(rep(" ", charmax - nchar(colnames(new_matrix)[i])), collapse=""))
)
# PAD FIRST COLUMN
new_matrix[,1] <- sapply(1:nrow(new_matrix), function(i)
paste0(new_matrix[i,1],
paste(rep(" ", charmax - nchar(new_matrix[i,1])), collapse=""))
)
write.table(new_matrix, sep="\t", row.names = FALSE, quote = FALSE)
# Feature Gain Cover Frequency
# pctTillDate 0.56035970 0.13140747 0.02427825
# colr_per 0.18314948 0.09624575 0.04961867
# date 0.05052830 0.11437520 0.06639573
# GREG_D 0.02564843 0.03814761 0.01807014
# LNGTD_I 0.02034602 0.04852350 0.10132211
# LATTD_I 0.01924150 0.04218923 0.09386710

R loop for creating and using -csv

I have a function output (from koRpus) of the form:
Total number of tokens: 887
Total number of types: 393
Measure of Textual Lexical Diversity
MTLD: 142.66
Number of factors: 6.22
Factor size: 0.72
SD tokens/factor: 41.55 (all factors)
38 (complete factors only)
And I want to make a loop for storing these results for 80 different documents. I have tried the following:
for (i in 1:length(infra$tableid)) {
whypar <- paste(infra$whypar [infra[,1] ==i], collapse=" ")
wpi<- removeWords(whypar, stopwords("english"))
as.data.frame(wpi)
write.csv(data.frame(wpi), file= "wp.csv")
tagged.text <- tokenize("wp.csv", lang="en")
res.mtld <- MTLD(tagged.text)
write.csv(data.frame(res.mtld),file="output.csv")
}
where infra is:
tableid 1, 2, 3, ... 80
whypar "I took part because xxx", "I believe that jshfdjk", "jhsadkjhd" ... (N=350)
Thanks for any help
Extract the parts of the MTLD object you are interested in first. From your question it seems like you are only interested in a subset of the object returned by MTLD, namely the MTLD score, number of factors the SD of tokens/factor and the SD for complete factors only. If you only want these results for each file you can just write one nice table as your output for all the files:
res <- data.frame( ID = numeric() , MTLD=numeric() , Factor_Size=numeric() , SD=numeric() , SD_Complete=numeric() )
for (i in 1:length(infra$tableid)) {
whypar <- paste(infra$whypar [infra[,1] ==i], collapse=" ")
wpi<- removeWords(whypar, stopwords("english"))
wpi <- as.data.frame(wpi)
write.csv(data.frame(wpi), file= "wp.csv")
tagged.text <- tokenize("wp.csv", lang="en")
res.mtld <- MTLD(tagged.text)
mtld <- res.mtld#MTLD$MTLD
fac.size <- res.mtld#param$factor.size
mtld.sd <- res.mtld#MTLD$lengths$sd
mtld.sd.compl <- res.mtld#MTLD$lengths$sd.compl
res <- rbind( res , c( infra$tableid[i] , mtld, fac.size , mtld.sd , mtld.sd.compl ) )
}
write.csv( res , file="output.csv" )
I hope this helps, but check these are the results you want returned.

Resources