Is R able to compute contingency tables on big file without putting the whole file in RAM? - r

Let me explain the question:
I know the functions table or xtabs compute contingency tables, but they expect a data.frame, which is always stored in RAM. It's really painful when trying to do this on a big file (say 20 GB, the maximum I have to tackle).
On the other hand, SAS is perfectly able to do this, because it reads the file line by line, and updates the result in the process. Hence there is ever only one line in RAM, which is much more acceptable.
I have done the same as SAS with ad-hoc Python programs on occasion, when I had to do more complicated stuff that either I didn't know how to do in SAS or thought it was too cumbersome. Python syntax and integrated features (dictionaries, regular expressions...) compensate for its weaknesses (speed, mainly, but when reading 20 GB, speed is limitated by the hard drive anyway).
My question, then: I would like to know if there are packages to do this in R. I know it's possible to read a file line by line, like I do in Python, but computing simple statistics (contingency tables for instance) on a big file is such a basic task that I feel there should be some more or less "integrated" feature to do it in a statistical package.
Please tell me if this question should be asked on "Cross Validated". I had a doubt, since it's more about software than statistics.

You can use the package ff for this which uses the hard disk drive instead of RAM but it is implemented in a way that it doesn't make it (significantly) slower than the normal way R uses RAM.
This if from the package description:
The ff package provides data structures that are stored on disk but behave (almost) as if they were in RAM by transparently mapping only a section (pagesize) in main memory.
I think this will solve your problem of loading a 20GB file in RAM. I have used it myself for such purposes and it worked great.
See here a small example as well. From the example on the xtabs documentation:
Base R
#example from ?xtabs
d.ergo <- data.frame(Type = paste0("T", rep(1:4, 9*4)),
Subj = gl(9, 4, 36*4))
> print(xtabs(~ Type + Subj, data = d.ergo)) # 4 replicates each
Subj
Type 1 2 3 4 5 6 7 8 9
T1 4 4 4 4 4 4 4 4 4
T2 4 4 4 4 4 4 4 4 4
T3 4 4 4 4 4 4 4 4 4
T4 4 4 4 4 4 4 4 4 4
ff package
#convert to ff
d.ergoff <- as.ffdf(d.ergo)
> print(xtabs(~ Type + Subj, data = d.ergoff)) # 4 replicates each
Subj
Type 1 2 3 4 5 6 7 8 9
T1 4 4 4 4 4 4 4 4 4
T2 4 4 4 4 4 4 4 4 4
T3 4 4 4 4 4 4 4 4 4
T4 4 4 4 4 4 4 4 4 4
You can check here for more information on memory manipulation.

Related

How do I change the order of multiple grouped values in a row dependent on another variable in that row in R?

I need some help conditionally sorting/switching data based on a factor variable.
I'm not sure if it's a typical use case I just can't formulate properly enough for a search engine to show me a solution or if it is that niche but I haven't found anything yet.
I currently have a dataframe like this:
id group a1 a2 a3 a4 b1 b2 b3 b4
1 1 2 6 6 3 4 4 6 4
2 2 5 2 2 2 2 5 2 3
3 1 6 3 3 1 3 6 4 1
4 1 4 8 4 2 7 8 8 9
5 2 3 1 1 4 2 1 1 7
For context this is from a psychological experiment where people went through two variations of a task and the order of those conditions was determined by the experimental group they were assigned to. The columns represent different measurements from different trials and are currently grouped together for the same variable and in chronological order, meaning a1,a2,a3,a4 are essentially the same variable at consecutive time points, same with b1,b2,b3,b4.
I want to split them up for the different conditions so regardless of which group (=which order of tasks) someone went through, data from one condition should come first in the dataframe and columns should still be grouped together for the same variables and in chronological order within that condition. It should essentially look like this:
id group c1a1 c1a2 c2a1 c2a2 c1b1 c1b2 c2b1 c2b2
1 1 2 6 6 3 4 4 6 4
2 2 2 2 5 2 2 3 2 5
3 1 6 3 3 1 3 6 4 1
4 1 4 8 4 2 7 8 8 9
5 2 1 4 3 1 1 7 2 1
So essentially for group 1 everything stays the same since they happened to go through the conditions in the same order that I want to have in the new dataframe while for group 2 values are being switched where the originally second half of values for each variable is put in front of the originally first one.
I hope I formulated the problem in a way, people can understand it.
My real dataset is a bit more complicated it has 180 columns minus id and group so 178.
I have 13 variables some of which were measured over two conditions with 5 trials for each of those and some which have those 5 trials for each of the 2 main condition but which also have 2 adittional measurements for each condition where the order was determined by the same group variable.
(We essentially asked participants to do the task again in two certain ways, which allowed us to see if they were capable of doing them like that if they wanted to under the circumstences of both main conditions).
So there are an adittional 4 columns for some variables which need to be treated seperately. It should look like this when transformed (x and y are the 2 extra tasks where only b was measured once):
id group c1a1 c1a2 c2a1 c2a2 c1b1 c1b2 c1bx c1by c2b1 c2b2 c2bx c2by
1 1 2 6 6 3 4 4 3 7 6 4 4 2
2 2 2 2 5 2 2 3 4 3 2 5 2 2
3 1 6 3 3 1 3 6 2 2 4 1 1 1
4 1 4 8 4 2 7 8 1 1 8 9 5 8
5 2 1 4 3 1 1 7 8 9 2 1 3 4
What I want to say with this is, I need a pretty general solution.
I already tried formulating a function for creation of two seperate datasets for the groups and then merging them by id but got stuck with the automatic creation and naming of columns which I can't seem to wrap my head around. dplyr is currently loaded and used for some other transformations but since I'm not really good with it, I need to ask for your help regarding a solution with or without it. I'm still pretty new to R and this is for my bachelor thesis.
Thanks in advance!
Your question leaves a few things unclear that make this hard to answer, but here is maybe a start that could help, or at least help clarify your problem.
It would really help if you could clarify 2 pieces of info, what types of column rearrangements you need, and how you distinguish what indicates that a row needs to have this transformation.
I'm also wondering if instead of trying to manipulate your data in its current shape, if it not might be more practical to figure out how to change the shape of your data to better represent your data, perhaps using something like pivot_longer(), I don't know how this data will ultimately be used or what the actual values indicate, but it doesn't seem to be very tidy in its current form, and instead having a "longer" table might be more meaningful, but I'll still provide what I think is a solution to your listed problem.
This creates some example data that looks like it reflects yours in the example table.
ID=seq(1:10)
group=sample(1:2,10,replace=T)
Data=matrix(sample(1:10,80,replace=T),nrow=10,ncol=8)
DataFrame=data.frame('ID'=ID,'Group'=group,Data)
You then define the groups of columns that need to be kept together. I can't tell if there is an automated way for you to indicate which columns are grouped, but this might get bulky if done manually. Some more information on what your column names actually are, and how they are distributed in groups would help.
ColumnGroups=list('One'=c('X1','X2'),'Two'=c('X3','X4'),'Three'=c('X5','X6'),'Four'=c('X7','X8'))
You can then figure out which rows need to have rearranged done by using some conditional. Based on your example, I'm assuming when the group variable equals 2, then the rearranging needs to be done, which is what I've used here.
FlipRows=DataFrame$Group==2
You can then have R only apply the rearrangement needed to those rows that need it, and define the rearrangement based on the ordering of the different column groups. I know you ask for a general solution, but is hard to identify the general solution you need without knowing what types of column rearrangements you need. If it is always flipping two sets of consecutive column groups, that would be easier to define without having to type it all out. What I have done here would require you to manually type out the order of the different column groups that you would like the rows to be rearranged as. The SortedDataFrame object seems to be what you are looking for, but might not actually reflect your real data. I removed columns 1 and 2 in this operation since those are ID and group which you don't want overridden.
SortedDataFrame=DataFrame
SortedDataFrame[FlipRows,-c(1,2)]=DataFrame[FlipRows,c(ColumnGroups$Two,ColumnGroups$One,ColumnGroups$Four,ColumnGroups$Three)]
This solution won't work if you need to rearrange each row differently, but it is unclear if that is the case. Try to provide any of the other info requested here, and let me know where this solution doesn't work for you, and that.

Manually entering contingency table in R

Hi I have the following code in R,
> trial <- matrix(c(6,9,6,5,2,2,5,8,5,3,2,4,6,8,8), ncol=5,byrow=T)
> colnames(trial) <- c('Never','Rarely','Sometimes', 'Often' ,'Always')
> rownames(trial) <- c('Walk', 'Bus','Drive')
> trial
Never Rarely Sometimes Often Always
Walk 6 9 6 5 2
Bus 2 5 8 5 3
Drive 2 4 6 8 8
However what I want is;
Breakfast Never Rarely Sometimes Often Always
Travel
Walk 6 9 6 5 2
Bus 2 5 8 5 3
Drive 2 4 6 8 8
i.e. a contingency table where Breakfast has options Never, Rarely etc.
and Travel has options Walk,bus etc.
Thank you
Answered courtesy: Salvatore S. Mangiafico, Ph.D.
trial <- matrix(c(6,9,6,5,2,2,5,8,5,3,2,4,6,8,8), ncol=5,byrow=T)
colnames(trial) <- c('Never','Rarely','Sometimes', 'Often' ,'Always')
rownames(trial) <- c('Walk', 'Bus','Drive')
Trial = as.table(trial)
dimnames(Trial)
names(dimnames(Trial)) = c("Transport", "Breakfast")

Very confusing R feature - completion of list item names

I found a very suprising and unpleasant feature of R - it completes list item names!!! See the following code:
a <- list(cov_spring = "spring")
a$cov <- c()
a$cov
# spring ## I expect it to be empty!!! I've set it empty!
a$co
# spring
a$c
I don't know what to do with that.... I need to be able to set $cov to NULL and have $cov_spring there at the same time!!! And use $cov separately!! This is annoying!
My question:
What is going on here? How is this possible, what is the logic behind?
Is there some easy fix, how to turn this completion off? I need to use list items cov_spring and cov independently as if they are normal variables. No damn completion please!!!
From help("$"):
'x$name' is equivalent to 'x[["name", exact = FALSE]]'
When you scroll back and read up on exact=:
exact: Controls possible partial matching of '[[' when extracting by
a character vector (for most objects, but see under
'Environments'). The default is no partial matching. Value
'NA' allows partial matching but issues a warning when it
occurs. Value 'FALSE' allows partial matching without any
warning.
So this provides you partial matching capability in both $ and [[ indexing:
mtcars$cy
# [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
mtcars[["cy"]]
# NULL
mtcars[["cy", exact=FALSE]]
# [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
There is no way I can see of to disable the exact=FALSE default for $ (unless you want to mess with formals, which I do not recommend for the sake of reproducibility and consistent behavior).
Programmatic use of frames and lists (for defensive purposes) should prefer [[ over $ for precisely this reason. (It's rare, but I have been bitten by this permissive behavior.)
Edit:
For clarity on that last point:
mtcars$cyl becomes mtcars[["cyl"]]
mtcars$cyl[1:3] becomes mtcars[["cyl"]][1:3]
mtcars[,"cy"] is not a problem, nor is mtcars[1:3,"cy"]
You can use [ or [[ instead.
a["cov"] will return a list with a NULL element.
a[["cov"]] will return the NULL element directly.

R - set bucket from a mapper data frame

Probably a similar situation has already been solved but I could not find it.
I have a mapper data frame like the following
mapper
bucket_label bucket_no
1 (-Inf; 9.99) 1
2 (25.01; 29.99) 1
3 (29.99; 30.01) 1
4 (30.01; Inf) 1
5 (19.99; 20.01) 2
6 (20.01; 24.99) 2
7 (24.99; 25.01) 2
8 (9.99; 10.11) 3
9 (10.11; 14.99) 3
10 (14.99; 15.01) 3
11 (15.01; 19.99) 3
and a vector x with random data
x <- rnorm(100)*100
I need to set the corresponding bucket for each entry of this in a quick way and findInterval and cut seem not to help for this issue.

How to test graphical output of functions?

I am wondering how to test functions that produce graphics. I have a simple
plotting function img:
img <- function() {
plot(1:10)
}
In my package I like to create a unit test for this function using testthat.
Because plot and its friends in base graphics just return NULL a simple
expect_identical is not working:
library("testthat")
## example for a successful test
expect_identical(plot(1:10), img()) ## equal (as expected)
## example for a test failure
expect_identical(plot(1:10, col="red"), img()) ## DOES NOT FAIL!
# (because both return NULL)
First I thought about plotting into a file and compare the md5 checksums to
ensure that the output of the functions is equal:
md5plot <- function(expr) {
file <- tempfile(fileext=".pdf")
on.exit(unlink(file))
pdf(file)
expr
dev.off()
unname(tools::md5sum(file))
}
## example for a successful test
expect_identical(md5plot(img()),
md5plot(plot(1:10))) ## equal (as expected)
## example for a test failure
expect_identical(md5plot(img()),
md5plot(plot(1:10, col="red"))) ## not equal (as expected)
That works well on Linux but not on Windows. Surprisingly
md5plot(plot(1:10)) results in a new md5sum at each call.
Aside this problem I need to create a lot of temporary files.
Next I used recordPlot (first creating a null-device, call the plotting
function and record its output). This works as expected:
recPlot <- function(expr) {
pdf(NULL)
on.exit(dev.off())
dev.control(displaylist="enable")
expr
recordPlot()
}
## example for a successful test
expect_identical(recPlot(plot(1:10)),
recPlot(img())) ## equal (as expected)
## example for a test failure
expect_identical(recPlot(plot(1:10, col="red")),
recPlot(img())) ## not equal (as expected)
Does anybody know a better way to test the graphical output of functions?
EDIT: regarding the points #josilber asks in his comments.
While the recordPlot approach works well you have to rewrite the whole plotting function in the unit test. That becomes complicated for complex plotting functions. It would be nice to have an approach that allows to store a file (*.RData or *.pdf, ...) which contains an image against you could compare in future tests. The md5sum approach isn't working because the md5sums differ on different platforms. Via recordPlot you could create an *.RData file but you could not rely on its format (from the recordPlot manual page):
The format of recorded plots may change between R versions.
Recorded plots can not be used as a permanent storage format for
R plots.
Maybe it would be possible to store an image file (*.png, *.bmp, etc), import it and compare it pixel by pixel ...
EDIT2: The following code illustrate the desired reference file approach using svg as output. First the needed helper functions:
## plot to svg and return file contant as character
plot_image <- function(expr) {
file <- tempfile(fileext=".svg")
on.exit(unlink(file))
svg(file)
expr
dev.off()
readLines(file)
}
## the IDs differ at each `svg` call, that's why we simple remove them
ignore_svg_id <- function(lines) {
gsub(pattern = "(xlink:href|id)=\"#?([a-z0-9]+)-?(?<![0-9])[0-9]+\"",
replacement = "\\1=\"\\2\"", x = lines, perl = TRUE)
}
## compare svg character vs reference
expect_image_equal <- function(object, expected, ...) {
stopifnot(is.character(expected) && file.exists(expected))
expect_equal(ignore_svg_id(plot_image(object)),
ignore_svg_id(readLines(expected)), ...)
}
## create reference image
create_reference_image <- function(expr, file) {
svg(file)
expr
dev.off()
}
A test would be:
create_reference_image(img(), "reference.svg")
## create tests
library("testthat")
expect_image_equal(img(), "reference.svg") ## equal (as expected)
expect_image_equal(plot(1:10, col="red"), "reference.svg") ## not equal (as expected)
Sadly this is not working across different platforms. The order (and the names)
of the svg elements completely differs on Linux and Windows.
Similar problems exist for png, jpeg and recordPlot. The resulting files
differ on all platforms.
Currently the only working solution is the recPlot approach above. But therefore
I need to rewrite the whole plotting functions in my unit tests.
P.S.:
I am completley confused about the different md5sums on Windows. It seems they depending on the creation time of the temporary files:
# on Windows
table(sapply(1:100, function(x)md5plot(plot(1:10))))
#4693c8bcf6b6cb78ce1fc7ca41831353 51e8845fead596c86a3f0ca36495eacb
# 40 60
Mango Solutions have published an open source package, visualTest, that does fuzzy matching of plots, to address this use case.
The package is on github, so install using:
devtools::install_github("MangoTheCat/visualTest")
library(visualTest)
Then use the function getFingerprint() to extract a finger print for each plot, and compare using the function isSimilar(), specifying a suitable threshold.
First, create some plots on file:
png(filename = "test1.png")
img()
dev.off()
png(filename = "test2.png")
plot(1:11, col="red")
dev.off()
The finger print is a numeric vector:
> getFingerprint(file = "test1.png")
[1] 4 7 4 4 10 4 7 7 4 7 7 4 7 4 5 9 4 7 7 5 6 7 4 7 4 4 10
[28] 4 7 7 4 7 7 4 7 4 3 7 4 4 3 4 4 5 5 4 7 4 7 4 7 7 7 4
[55] 7 7 4 7 4 7 5 6 7 7 4 8 6 4 7 4 7 4 7 7 7 4 4 10 4 7 4
> getFingerprint(file = "test2.png")
[1] 7 7 4 4 17 4 7 4 7 4 7 7 4 5 9 4 7 7 5 6 7 4 7 7 11 4 7
[28] 7 5 6 7 4 7 4 14 4 3 4 7 11 7 4 7 5 6 7 7 4 7 11 7 4 7 5
[55] 6 7 7 4 8 6 4 7 7 4 4 7 7 4 10 11 4 7 7
Compare using isSimilar():
> isSimilar(file = "test2.png",
+ fingerprint = getFingerprint(file = "test1.png"),
+ threshold = 0.1
+ )
[1] FALSE
You can read more about the package at http://www.mango-solutions.com/wp/products-services/r-services/r-packages/visualtest/
It's worth noting that the vdiffr package also supports comparing plots. A nice feature is that it integrates with the testthat package -- it's actually used for testing in ggplot2 -- and it has an add-in for RStudio to help manage your testsuite.

Resources