Save .dta files with long strings in R - r

I have to save an R-dataset in Stata's .dta format.
The dataset contains, among other data, a single column containing long strings (column 3).
test data:
r_data <- data.frame( ae= 1, be= 2, ce= "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet"
,stringsAsFactors = FALSE )
export to dta
library(foreign)
write.dta(r_data, file = "r_data.dta")
results in this warning message:
Warning message:
In write.dta(r_data, file = "r_data.dta") :
character strings of >244 bytes in column 3 will be truncated
Furthermore, I can't open the file in Stata (14 SE) at all due to an error stating:
. use "r_data.dta"
file not Stata format
.dta file contains 1 invalid storage-type code.
File uses invalid codes other than code 0.
r(610);
How can I save longer strings as a .dta file?
R-solution prefered because I am not experienced with Stata.
PS: The indirect route via a CSV-file does not work, because the resulting CSV-file is too big for my little RAM when importing in Stata.

Old question, but deserves to be closed:
Use the haven package to write to a dta-file in Stata 14 format.
library(haven)
r_data <- data.frame(ae = 1, be = 2, ce = "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet",
stringsAsFactors = FALSE)
write_dta(r_data, "r_data.dta")

Related

Pull all 8 digit numbers from a data frame

I have this assignment where I need to pull all the 8 digit numbers from a text file. I've converted the text file into a dataframe and now have some 67 columns with 18000 rows. There are empty cells as well.
Within this table, some 8 digit number exist, (not in any particular row or column) which is what I want to extract.
I need all these numbers to be extracted into one single column without checking for duplicates.
The only code I've written so far:
data <- read.delim("cerupload_adsi_1_01-02-2019.txt", header = FALSE, sep="|")
You may use regmatches() and match for a juxtaposition of exactly 8 digits with regex "\\d{8}". Specifying word boundaries "\\b" might make this more robust.
Example
txt <- "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod
tempor invidunt ut labore et dolore 235462354 magna aliquyam erat, sed diam voluptua. At
vero eos et accusam et justo duo dolores et ea rebum. Stet clita 235 kasd gubergren, no sea
takimata sanctus est Lorem ipsum dolor sit amet. 12345678 Lorem ipsum dolor 345.454 sit amet,
12345678 consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et
dolore magna aliquyam erat, sed diam 345 voluptua. At vero eos et accusam et justo duo
dolores et ea rebum. Stet clita 12345.67 12345.678 kasd gubergren, no sea takimata sanctus
est Lorem ipsum dolor sit amet. 12345678"
regmatches(txt, gregexpr("\\b\\d{8}\\b", txt))
# [[1]]
# [1] "12345678" "12345678" "12345678"
First, put all of your data into a simple integer vector:
data = as.integer(unlist(data))
Next, remove any elements that weren't convertible to integers (optional):
data = data[!is.na(data)]
Next, find the integers that are 8 characters long:
data = data[nchar(as.character(data))==8]
Then, make a data.frame with the integer vector as a column:
data = data.frame(x=data)
Using str_extract_all from stringr
temp <- data.frame(col = unlist(stringr::str_extract_all(unlist(data), "\\d{8}$")))
temp
# col
#1 12352318
#2 98765432
data
Tested on this sample data with two columns.
data <- data.frame(a = "This is a text with number 1234 and 12352318",
b = "More random text 123456789 98765432")

Efficiently break up a string based on the nth occurrence of a substring using R

Introduction
Given a string in R, is it possible to get a vectorized solution (i.e. no loops) where we can break the string into blocks where each block is determined by the nth occurrence of a substring in the string.
Work done with Reproducible Example
Suppose we have several paragraphs of the famous Lorem Ipsum text.
library(strex)
# devtools::install_github("aakosm/lipsum")
library(lipsum)
my.string = capture.output(lipsum(5))
my.string = paste(my.string, collapse = " ")
> my.string # (partial output)
# [1] "Lorem ipsum dolor ... id est laborum. "
We would like to break this text into segments at every 3rd occurrence of the the word " in" (a space is included in order to distinguish from words which contain "in" as part of them, such as "min").
I have the following solution with a while loop:
# We wish to break up the string at every
# 3rd occurence of the worn "in"
break.character = " in"
break.occurrence = 3
string.list = list()
i = 1
# initialize string to send into the loop
current.string = my.string
while(length(current.string) > 0){
# Enter segment into the list which occurs BEFORE nth occurence character of interest
string.list[[i]] = str_before_nth(current.string, break.character, break.occurrence)
# Update next string to exmine.
# Next string to examine is current string AFTER nth occurence of character of interest
current.string = str_after_nth(current.string, break.character, break.occurrence)
i = i + 1
}
We are able to get the desired output in a list with a warning (warning not shown)
> string.list (#partial output shown)
[[1]]
[1] "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit"
[[2]]
[1] " voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor"
...
[[6]]
[1] " voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor"
Goal
Is it possible to improve this solution by vectorizing (i.e. using apply(), lapply(), mapply() etc.). Also, my current solution cut's off the last occurrence of the substring in a block.
The current solution may not work well on extremely long strings (such as DNA sequences where we are looking for blocks with the nth occurrence of a substring of nucleotides).
Try with this:
text_split=strsplit(text," in ")[[1]]
l=length(text_split)
n = floor(l/3)
Seq = seq(1,by=2,length.out = n)
L= list()
L=sapply(Seq, function(x){
paste0(paste(text_split[x:(x+2)],collapse=" in ")," in ")
})
if (l>(n*3)){
L = c(L,paste(text_split[(n*3+1):l],collapse=" in "))
}
Last conditional is in case number of in is not divisible by 3. Also, the last in pasted in the sapply() is there because I don't know what you want to do with the one in that separates your blocks.
Let me know if this does the trick. I will try to make it faster. It keeps the third in in the code block. If it works I'll annotate it more too.
library(lipsum)
library(stringi)
my.string = capture.output(lipsum(5))
my.string = paste(my.string, collapse = " ")
end_of_in <- stri_locate_all(fixed = " in ", my.string)[[1]][,2]
start_of_strings <- c(1, end_of_in[c(F, F, T)])
end_of_strings <- c(end_of_in[c(F, F, T)] - 1, nchar(my.string))
end_of_strings <- end_of_strings[!duplicated(end_of_strings)]
stri_sub(my.string, start_of_strings, end_of_strings)
EDIT: actually, use stri_sub from stringi. It will scale much better than substring. See:
my.string <- paste(rep(my.string, 10000), collapse = " ")
nchar(my.string)
[1] 22349999
microbenchmark::microbenchmark(
sol1 = {
text_split=strsplit(my.string," in ")[[1]]
l=length(text_split)
n = floor(l/3)
Seq = seq(1,by=2,length.out = n)
L= list()
L=sapply(Seq, function(x){
paste0(paste(text_split[x:(x+2)],collapse=" in ")," in ")
})
if (l>(n*3)){
L = c(L,paste(text_split[(n*3+1):l],collapse=" in "))
}
},
sol2 = {
end_of_in <- stri_locate_all(fixed = " in ", my.string)[[1]][,2]
start_of_strings <- c(1, end_of_in[c(F, F, T)])
end_of_strings <- c(end_of_in[c(F, F, T)] - 1, nchar(my.string))
end_of_strings <- end_of_strings[!duplicated(end_of_strings)]
stri_sub(my.string, start_of_strings, end_of_strings)
},
times = 10
)
Unit: milliseconds
expr min lq mean median uq max neval
sol1 914.1268 927.45958 941.36117 939.80361 950.18099 980.86941 10
sol2 55.4163 56.40759 58.53444 56.86043 57.03707 71.02974 10

TermDocumentMatrix in R - only 1-grams created

I just started with tm package in R and cannot seem to overcome an issue.
Even though my tokenizer functions seem to work right:
uniTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1))
biTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
triTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
uniTDM <- TermDocumentMatrix(corpus, control=list(tokenize = uniTokenizer))
biTDM <- TermDocumentMatrix(corpus, control=list(tokenize = biTokenizer))
triTDM <- TermDocumentMatrix(corpus, control=list(tokenize = triTokenizer))
when I try to pull 2-grams from biTDM, only 1-grams come up...
findFreqTerms(biTDM, 50)
[1] "after" "and" "most" "the" "were" "years" "love"
[8] "you" "all" "also" "been" "did" "from" "get"
at the same, the 2-gram function appears to be in tact:
x <- biTokenizer(corpus)
head(x)
[1] "c in" "in the" "the years"
[4] "years thereafter" "thereafter most" "most of"
I can only assume what the problem is here: NGramTokenizer needs a VCorpus object rather than a Corpus object.
library(tm)
library(RWeka)
# some dummy text
text <- c("Lorem ipsum dolor sit amet, consetetur sadipscing elitr",
"sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat",
"sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum",
"Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet")
# create a VCorpus
corpus <- VCorpus(VectorSource(text))
biTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
biTDM <- TermDocumentMatrix(corpus, control=list(tokenize = biTokenizer))
print(biTDM$dimnames$Terms)
[1] "accusam et" "aliquyam erat" "amet consetetur" "at vero" "clita kasd" "consetetur sadipscing" "diam nonumy" "diam voluptua" "dolor sit" "dolore magna"
[11] "dolores et" "duo dolores" "ea rebum" "eirmod tempor" "eos et" "est lorem" "et accusam" "et dolore" "et ea" "et justo"
[21] "gubergren no" "invidunt ut" "ipsum dolor" "justo duo" "kasd gubergren" "labore et" "lorem ipsum" "magna aliquyam" "no sea" "nonumy eirmod"
[31] "sadipscing elitr" "sanctus est" "sea takimata" "sed diam" "sit amet" "stet clita" "takimata sanctus" "tempor invidunt" "ut labore" "vero eos"
[41] "voluptua at"

How to extract repeated pattterns from a string

I need to extract certain patterns from the text below.
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Budget 2016-2017
Curabitur dictum gravida mauris. Budget 2015-2016 mauris ut leo. Cras
viverra metus rhoncus sem
I need to get the 'Budget \d{4}-\d{4}' part of the text so it looks like:
[1] "Budget 2016-2017" "Budget 2015-2016"
You can get what you want with the following:
library(stringr)
string <- "Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Budget 2016-2017 Curabitur dictum gravida mauris. Budget 2015-2016 mauris ut leo. Cras viverra metus rhoncus sem"
unlist(str_extract_all(string, 'Budget [0-9]{4}-[0-9]{4}'))
Result:
> unlist(str_extract_all(string, 'Budget [0-9]{4}-[0-9]{4}'))
[1] "Budget 2016-2017" "Budget 2015-2016"
something close
s <- "Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Budget 2016-2017 Curabitur dictum gravida mauris. Budget 2015-2016 mauris ut leo. Cras viverra metus rhoncus sem"
gsub(".*(Budget [0-9]{4}-[0-9]{4}).*", "\\1", s)
[1] "Budget 2015-2016"

two column beamer/sweave slide with grid graphic

I'm trying to make a presentation on ggplot2 graphics using beamer + sweave. Some slides should have two columns; the left one for the code, the right one for the resulting graphic. Here's what I tried,
\documentclass[xcolor=dvipsnames]{beamer}
\usepackage{/Library/Frameworks/R.framework/Resources/share/texmf/tex/latex/Sweave}
\usepackage[english]{babel}
\usepackage{tikz}
\usepackage{amsmath,amssymb}% AMS standards
\usepackage{listings}
\usetheme{Madrid}
\usecolortheme{dove}
\usecolortheme{rose}
\SweaveOpts{pdf=TRUE, echo=FALSE, fig=FALSE, eps=FALSE, tidy=T, width=4, height=4}
\title{Reproducible data analysis with \texttt{ggplot2} \& \texttt{R}}
\subtitle{subtitle}
\author{Baptiste Augui\'e}
\date{\today}
\institute{Here}
\begin{document}
\begin{frame}[fragile]
\frametitle{Some text to show the space taken by the title}
\begin{columns}[t] \column{0.5\textwidth}
Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat. Ut wisi enim ad minim veniam, quis nostrud exerci tation ullamcorper suscipit lobortis nisl ut aliquip ex ea commodo consequat. Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi.
\column{0.5\textwidth}
\begin{figure}[!ht]
\centering
<<fig=TRUE>>=
grid.rect(gp=gpar(fill="slateblue"))
#
\end{figure}
\end{columns}
\end{frame}
\begin{frame}[fragile]
\frametitle{Some text to show the space taken by the title}
\begin{columns}[t]
\column{0.5\textwidth}
<<echo=TRUE,fig=FALSE>>=
library(ggplot2)
p <-
qplot(mpg, wt, data=mtcars, colour=cyl) +
theme_grey(base_family="Helvetica")
#
\column{0.5\textwidth}
\begin{figure}[!ht]
\centering
<<fig=TRUE>>=
print(p)
#
\end{figure}
\end{columns}
\end{frame}
\end{document}
And the two pages of output.
I have two issues with this output:
the echo-ed sweave code ignores the columns environment and spans the two columns
the column margins for either graphic are unecessarily wide
Any ideas?
Thanks.
As for the first question, the easy way is to set keep.source=TRUE in SweaveOpts. For more fancy control, see fancyvrb and FAQ #9 of Sweave manual.
The width of the figure can be set by \setkeys{Gin}{width=1.0\textwidth}
here is a slight modification:
... snip ...
\SweaveOpts{pdf=TRUE, echo=FALSE, fig=FALSE, eps=FALSE, tidy=T, width=4, height=4, keep.source=TRUE}
\title{Reproducible data analysis with \texttt{ggplot2} \& \texttt{R}}
... snip ...
\begin{document}
\setkeys{Gin}{width=1.1\textwidth}
... snip...
<<echo=TRUE,fig=FALSE>>=
library(ggplot2)
p <-
qplot(mpg,
wt,
data=mtcars,
colour=cyl) +
theme_grey(base_family=
"Helvetica")
#

Resources