Sometimes, when I save a columns of double precision numbers to a csv using write_csv from readr (part of the tidyverse), the following happens:a double like 285121.15 is written as 285121.14999999997. The original value has only two decimals and this is not an artifact of printing it on screen. Numerically, they are pretty much the same thing, but it is annoying to share files with so many (unneeded decimals). The documentation of write_csv says that the grisu3 algorithm is used.
At the same time, I would like to avoid rounding up the values myself, since in general the number of decimals may vary.
According to what I found here
http://www.serpentine.com/blog/2011/06/29/here-be-dragons-advances-in-problems-you-didnt-even-know-you-had/
florian.loitsch.com/publications/dtoa-pldi2010.pdf?attredirects=0
it is a known shortcoming of grisu3.
Seen that I am now dealing with large data sets (hence writing to disk is not a big issue), I came up with the following
############ to avoid troubles when saving numbers
num_to_char <- function(df){
res <- df %>% mutate_if(is.numeric, as.character )
return(res)
}
to_csv <- function(df, ...){
df <- num_to_char(df)
write_csv(df, ...)
}
i.e. I essentially convert the numbers to strings prior to saving the file.
I ran some tests, and it seems to me my problem has been solved, but are there any caveats I should be aware of?
Many thanks!
My suggestion is to using following code:
#removing unlike precision (double precision)
A <- floor(A*100)
#then converting to the real number
A <- A/100
A simple example in R area;)
A <-9.12234353423242
A<-A*100
A
#[1] 912.2344
A<- floor(A)
A
#[1] 912
A <- A/100
A
#[1] 9.12
Related
I'm working with some data that has hundreds of covariates, so I decided to write some functions to make pre-processing much faster and cleaner (like scaling certain numeric variables). An important part of all of these functions is type-checking the columns before I apply a particular function to them.
Here is my function for scaling continuous columns:
# rm (vector): names of columns not to be scaled
scale.continuous <- function(df, rm=NULL) {
cols <- setdiff(colnames(df), rm)
for(col in cols) {
if(is.numeric(df[,col])){
df[,col] <- as.numeric(scale(df[,col]))
}
}
df
}
This works perfectly fine if I load the data frame using read.csv(), but the data I have is huge so the speed boost of using read_csv() from readr/tidyverse is significant. Unfortunately, if I load my data using read_csv() all of my functions break.
I narrowed down the issue to the type-checking, specifically when type-checking a column I am accessing by a string of its column name. Here's some code to demonstrate what I mean:
# When using read.csv()
> is.numeric(df$col)
[1] TRUE
> is.numeric(df[,"col"])
[1] TRUE
# When using read_csv()
> is.numeric(df$col)
[1] TRUE
> is.numeric(df[,"col"])
[1] FALSE
I realized the issue here was that indexing the dataframe with a string the way I do above returns a tibble instead of a regular list like other methods of indexing do. What I don't understand is why this behavior exists, why as.numeric() (or any type-check) does not work with a tibble and in general why there is this difference in the way the default and tidyverse dataframes are constructed. Also, it would be nice to know if there is a parameter I can change in read_csv() that will make the behavior of this type of indexing the same as with a default dataframe.
I should mention, I realize there are probably better ways of writing this code (for example, just using df$"col" to index fixes the issue), but I still don't understand what the root of the issue was with my first approach. I am now working with much larger data sets that require much more involved pre-processing than what I have been used to in the past so I want to have as complete an understanding of the data structures I am using as possible.
Tibbles have a slightly different default behaviour than regular data frames when using the [ extracting function which can be a bit of a gotcha. Specifically df[,"col"] on a tibble will return a one column tibble whereas on a regular data frame it will return a vector. So you need to use:
df[["col"]]
Or explicitly state that you want to coerce to the lowest dimension and do:
df[, "col", drop = TRUE]
From the documentation:
df[, j] returns a tibble; it does not automatically extract the column
inside. df[, j, drop = FALSE] is the default.
I am trying to decrease the memory footprint of some of my datasets where I have a small set of factors per columns (repeated a large number of times). Are there better ways to minimize it? For comparison, this is what I get from just using factors:
library(pryr)
N <- 10 * 8
M <- 10
Initial data:
test <- data.frame(A = c(rep(strrep("A", M), N), rep(strrep("B", N), N)))
object_size(test)
# 1.95 kB
Using Factors:
test2 <- as.factor(test$A)
object_size(test2)
# 1.33 kB
Aside: I naively assumed that they replaced the strings with a number and was pleasantly surprised to see test2 smaller than test3. Can anyone point me to some material on how to optimize factor representation?
test3 <- data.frame(A = c(rep("1", N), rep("2", N)))
object_size(test3)
# 1.82 kB
I'm afraid the difference is minimal.
The principle would be easy enough: instead of (in your example) 160 strings, you would just be storing 2, along with 160 integers (which are only 4 bytes).
Except that R kind of stores character internally the same way.
Every modern language supports string of (virtually) unlimited length. Which gives the problem that you can't store a vector (or array) of strings as one contiguous block, as any element can be reset to arbitrary length. So if another value is assigned to one element, which happened to be somewhat longer, that would mean the rest of the array would have to be shifted. Or the OS/language should reserve large amounts of space for each string.
Therefore, strings are stored at whatever place in memory is convenient, and arrays (or vectors in R) are stored as blocks of pointers to the place where the value actually is.
In the early days of R, each pointer pointed to another place in memory, even if the actual value was the same. So in your example, 160 pointers to 160 memory locations. But that's changed, nowadays it's implemented as 160 pointers to 2 memory locations.
There may be some small differences, mainly because a factor can normally support only 2^31-1 levels, meaning 32-bits integers are enough to store it, while a character mostly uses 64-bits pointers. Then again, there's more overhead in factors.
Generally, there may be some advantage in using factor if you really have a large percentage duplicates, but if that's not the case it may even harm your memory usage.
And the example you provided doesn't work, as you're comparing a data.frame with a factor, instead of the bare character.
Even stronger: when I reproduce your example, I only get your results if I set stringsAsFactors to FALSE, so you're comparing a factor to a factor in a data.frame.
Comparing the results otherwise gives a lot smaller difference: 1568 for character, 1328 for a factor.
And that only works if you have a lot of the same values, if you look at this you see that the factor can be larger:
> object.size(factor(sample(letters)))
2224 bytes
> object.size(sample(letters))
1712 bytes
So generally, there is no real way to compress your data while still keeping it easy to work with, except for using common sense in what you actually want to store.
I don't have a direct answer for your question but here is a few information from the book "Advanced R" by Hadley Wickham:
Factors
One important use of attributes is to define factors. A factor
is a vector that can contain only predefined values, and is used to
store categorical data. Factors are built on top of integer vectors
using two attributes: the class, “factor”, which makes them behave
differently from regular integer vectors, and the levels, which
defines the set of allowed values.
Also:
"While factors look (and often behave) like character vectors, they
are actually integers. Be careful when treating them like strings.
Some string methods (like gsub() and grepl()) will coerce factors to
strings, while others (like nchar()) will throw an error, and still
others (like c()) will use the underlying integer values. For this
reason, it’s usually best to explicitly convert factors to character
vectors if you need string-like behaviour. In early versions of R,
there was a memory advantage to using factors instead of character
vectors, but this is no longer the case."
There is a package in R called fst (Lightning Fast Serialization of Data Frames for R)
, in which you can create compressed fst objects for your data frame. A detailed explanation can be found in the fst-package manual, but I'll briefly explain about how to use it and how much space an fst object takes. First, Let's make your test dataframe a bit larger, as follows:
library(pryr)
N <- 1000 * 8
M <- 100
test <- data.frame(A = c(rep(strrep("A", M), N), rep(strrep("B", N), N)))
object_size(test)
# 73.3 kB
Now, let's convert this dataframe into an fst object, as follows:
install.packages("fst") #install the package
library(fst) #load the package
path <- paste0(tempfile(), ".fst") #create a temporary '.fst' file
write_fst(test, path) #write the dataframe into the '.fst' file
test2 <- fst(path) #load the data as an fst object
object_size(test2)
# 2.14 kB
The disk space for the created .fst file is 434 bytes. You can deal with test2 as a normal dataframe (as far as I tried).
Hope this helps.
I am working on a problem in which I have to two data frames data and abbreviations and I would like to replace all the abbreviations present in data to their respective full forms. Till now I was using for-loops in the following manner
abb <- c()
for(i in 1:length(data$text)){
for(j in 1:length(AbbreviationList$Abb)){
abb <- paste("(\\b", AbbreviationList$Abb[j], "\\b)", sep="")
data$text[i] <- gsub(abb, AbbreviationList$Fullform[j], tolower(data$text[i]))
}
}
The abbreviation data frame looks something like the image below and can be generated using the following code
Abbreviation <- c(c("hru", "how are you"),
c("asap", "as soon as possible"),
c("bf", "boyfriend"),
c("ur", "your"),
c("u", "you"),
c("afk", "away from keyboard"))
Abbreviation <- data.frame(matrix(Abbreviation, ncol=2, byrow=T), row.names=NULL)
names(Abbreviation) <- c("abb","Fullform")
And the data is merely a data frame with 1 columns having text strings in each rows which can also be generated using the following code.
data <- data.frame(unlist(c("its good to see you, hru doing?",
"I am near bridge come ASAP",
"Can u tell me the method u used for",
"afk so couldn't respond to ur mails",
"asmof I dont know who is your bf?")))
names(data) <- "text"
Initially, I had data frame with around 1000 observations and abbreviation of around 100. So, I was able to run the analysis. But now the data has increased to almost 50000 and I am facing difficulty in processing it as there are two for-loops which makes the process very slow. Can you suggest some better alternatives to for-loop and explain with an example how to use it in this situation. If this problem can be solved faster via vectorization method then please suggest how to do that as well.
Thanks for the help!
This should be faster, and without side effect.
mapply(function(x,y){
abb <- paste0("(\\b", x, "\\b)")
gsub(abb, y, tolower(data$text))
},abriv$Abb,abriv$Fullform)
gsub is vectorized so no you give it a character vector where matches are sought. Here I give it data$text
I use mapply to avoid the side effect of for.
First of all, clearly there is no need to compile the regular expressions with each iteration of the loop. Also, there is no need to actually loop over data$text: in R, very often you can use a vector where a value could do -- and R will go through all the elements of the vector and return a vector of the same length.
Abbreviation$regex <- sprintf( "(\\b%s\\b)", Abbreviation$abb )
for( j in 1:length( Abbreviation$abb ) ) {
data$text <- gsub( Abbreviation$regex[j],
Abbreviation$Fullform[j], data$text,
ignore.case= T )
}
The above code works with the example data.
I have a corpus of 26 plain text files, each between 12 - 148kb, total of 1.2Mb. I'm using R on a Windows 7 laptop.
I did all the normal cleanup stuff (stopwords, custom stopwords, lower case, numbers) and want to do stem completion. I am using the original corpus as a dictionary as shown in the examples. I tried a couple of simple vectors to make sure it would work at all (with about 5 terms) and it did and very quickly.
exchanger <- function(x) stemCompletion(x, budget.orig)
budget <- tm_map(budget, exchanger)
It's been working since yesterday at 4pm! In R Studio under diagnostics, the request log shows new requests with different request numbers. Task manager shows it using some memory, but not a crazy amount. I don't want to stop it because what if it's almost there? Any other ideas of how to check progress - it's a volatile corpus, unfortunately? Ideas on how long it should take? I thought about using the dtm names vector as a dictionary, cut off at the most frequent (or high tf-idf), but I'm reluctant to kill this process.
This is a regular windows 7 laptop with lots of other things running.
Is this corpus too big for stemCompletion? Short of moving to Python, is there a better way to do stemCompletion or lemmatize vice stem - my web searching has not yielded any answers.
I can't give you a definite answer without data that reproduces your problem, but I would guess the bottleneck comes from the folllowing line from the stemCompletion source code:
possibleCompletions <- lapply(x, function(w) grep(sprintf("^%s", w), dictionary, value = TRUE))
After which, given you've kept the completion heuristic on the default of "prevalent", this happens:
possibleCompletions <- lapply(possibleCompletions, function(x) sort(table(x), decreasing = TRUE))
structure(names(sapply(possibleCompletions, "[", 1)), names = x)
That first line loops through each word in your corpus and checks it against your dictionary for possible completions. I'm guessing you have many words that appear many times in your corpus. That means the function is being called many times only to give the same response. A possibly faster version (depending on how many words were repeats and how often they were repeated) would look something like this:
y <- unique(x)
possibleCompletions <- lapply(y, function(w) grep(sprintf("^%s", w), dictionary, value = TRUE))
possibleCompletions <- lapply(possibleCompletions, function(x) sort(table(x), decreasing = TRUE))
z <- structure(names(sapply(possibleCompletions, "[", 1)), names = y)
z[match(x, names(z))]
So it only loops through the unique values of x rather than every value of x. To create this revised version of the code, you would need to download the source from CRAN and modify the function (I found it in the completion.R in the R folder).
Or you may just want to use Python for this one.
Cristina, following Schaun I recomend you use just unique word for apply stemcompletion. I mean, it is easy for your PC do completion in your unique words them do the completition in all your corpus (with all repetitions).
First of all, take the unique words from your corpus. For exemplo:
unique$text <- unique(budget)
Them, you can get the unique words from your original text
unique_budget.orig <- unique(budget.orig)
Now, you can apply the stemcomplection for your unieque words
unique$completition <- budget %>% stemCompletion (dictionary= unique_budget.orig)
Now you have an object with all words from your corpus and their completion. you just have to apply a join between your corpus and the object unique. Be sure both objects have the same variable name for words without the completition: this gonne be the key.
This gonna reduce the number of operation your PC have to do.
I'm using the intToBin() function from "R.utils" package and am having trouble using it to convert large decimal numbers to binary.
I get this error : NAs introduced by coercion.
Is there another function out there that can handle big numbers/ is there an algorithm/ code to implement such a function?
Thanks
If you read the help page for intToBin, it quite explicitly says it takes "integer" inputs. These are not mathematical "integers" but rather the computer-language-defined ints, which are limited to 16 bits (or something like that).
You'll need to find (or write :-() a function which converts floating-point numbers to binary floats, or if you're lucky, perhaps Rmpfr or gmp packages, which do arbitrary precision "big number" math, may have a float-to-binary tool.
By the time this gets posted, someone will have exposed my ignorance by posting an existing function, w/ my luck.
Edit -- like maybe the package pack
I needed a converter between doubles and hex numbers. So I wrote those, might be helpful to others
doubleToHex <- function(x) {
if(x < 16)
return(sprintf("%X", x))
remainders <- c()
while(x > 15) {
remainders <- append(remainders, x%%16)
x <- floor(x/16)
}
remainders <- paste(sprintf("%X", rev(remainders)), collapse="")
return(paste(x, remainders, sep=""))
}
hexToDouble <- function(x) {
x <- strsplit(x,"")[[1]]
output <- as.double(0)
for(i in rev(seq_along(x))) {
output <- output + (as.numeric(as.hexmode(x[i]) * (16**(length(x)-i))))
}
return(output)
}
doubleToHex(x = 8356723)
hexToDouble(x = "7F8373")
Hasn't been extensively tested yet, let me know if you detect a problem with it.