Memory management in R ComplexUpset Package - r

I'm trying to plot an stacked barplot inside an upset-plot using the ComplexUpset package. The plot I'd like to get looks something like this (where mpaa would be component in my example):
I have a dataframe of size 57244 by 21, where one column is ID and the other is type of recording, and other 19 columns are components from 1 to 19:
ID component1 component2 ... component19 type
1 1 0 1 a
2 0 0 1 b
3 1 1 0 b
Ones and zeros indicate affiliation with a certain component. As shown in the example in the docs, I first convert these ones and zeros to logical, and then try to plot the basic upset plot. Here's the code:
df <- df %>% mutate(across(where(is.numeric), as.logical))
components <- colnames(df)[2:20]
upset(df, components, name='protein', width_ratio = 0.1)
But unfortunately after thinking for a while when processing the last line it spits out an error message like this:
Error: cannot allocate vector of size 176.2 Mb
Though I know I'm using the 32Gb RAM architecture, I'm sure I couldn't have flooded the memory so much that 167 Mb can't be allocated, so my guess is I am managing memory in R somehow wrong. Could you please explein what's faulty in my code, if possible.
I also know that UpsetR package plots the same data, but as far as i know it provides no way for the stacked barplotting.

Somehow, it works if you:
Tweak the min_size parameter so that the plot is not overloaded and makes a better impression
Making the first argument of ComplexUpset a sample with some data also helps, even if your sample is the whole dataset.

Related

R - Making a ggplot while using survey package

I am stuck with a real problem.
My dataset comes from a survey and to make it usable to find statistics about the whole French population, I must weight it with weights.
For this purpose, I used the survey package, but the syntax is not really easy to use with R.
Is there a way to use ggplot while having weights?
To explain it a bit better, here is my dataset:
head(df)
Id Weight Var1
1 30 0
2 12.4 0
3 68.2 1
So my individual 1 accounts for 30 people in the French population.
I create a df_weighted dataset using the survey package.
How can I use ggplot now? df_weighted is a list!
I did something like this to try to escape the list problem but I did not work at all...
df_weighted_ggplot$var1 <- svytable(~var1, df_weighted)
df_weighted_ggplot$var_fill <- svytable(~var_fill, df_weighted)
ggplot(df_weighted_ggplot, aes(fill = var_fill , x =var1)) + geom_bar(position = "fill")
I received this predictable error:
Erreur : `data` must be a data frame, or other object coercible by `fortify()`, not a list
Do you know any other package which should help me? But I read many forums and it seems to be the most helpful...

To sort a specific column in a DataFrame in SparkR

In SparkR I have a DataFrame data. It contains time, game and id.
head(data)
then gives ID = 1 4 1 1 215 985 ..., game = 1 5 1 10 and time 2012-2-1, 2013-9-9, ...
Now game contains a gametype which is numbers from 1 to 10.
For a given gametype I want to find the minimum time, meaning the first time this game has been played. For gametype 1 I do this
data1 <- filter(data, data$game == 1)
This new data contains all data for gametype 1. To find the minimum time I do this
g <- groupBy(data1, game$time)
first(arrange(g, desc(g$time)))
but this can't run in sparkR. It says "object of type S4 is not subsettable".
Game 1 has been played 2012-01-02, 2013-05-04, 2011-01-04,... I would like to find the minimum-time.
If all you want is a minimum time sorting a whole data set doesn't make sense. You can simply use min:
agg(df, min(df$time))
or for each type of game:
groupBy(df, df$game) %>% agg(min(df$time))
By typing
arrange(game, game$time)
I get all of the time sorted. By taking first function I get the first entry. If I want the last entry I simply type this
first(arrange(game, desc(game$time)))
Just to clarify because this is something I keep running into: the error you were getting is probably because you also imported dplyr into your environment. If you would have used SparkR::first(SparkR::arrange(g, SparkR::desc(g$time))) things would probably have been fine (although obviously the query could've been more efficient).

R Refer to (part of) data frame using string in R

I have a large data set in which I have to search for specific codes depending on what i want. For example, chemotherapy is coded by ~40 codes, that can appear in any of 40 columns called (diag1, diag2, etc).
I am in the process of writing a function that produces plots depending on what I want to show. I thought it would be good to specify what I want to plot in a input data frame. Thus, for example, in case I only want to plot chemotherapy events for patients, I would have a data frame like this:
Dataframe name: Style
Name SearchIn codes PlotAs PlotColour
Chemo data[substr(names(data),1,4)=="diag"] 1,2,3,4,5,6 | red
I already have a function that searches for codes in specific parts of the data frame and flags the events of interest. What i cannot do, and need your help with, is referring to a data frame (Style$SearchIn[1]) using codes in a data frame as above.
> Style$SearchIn[1]
[1] data[substr(names(data),1,4)=="diag"]
Levels: data[substr(names(data),1,4)=="diag"]
I thought perhaps get() would work, but I cant get it to work:
> get(Style$SearchIn[1])
Error in get(vars$SearchIn[1]) : invalid first argument
enter code here
or
> get(as.character(Style$SearchIn[1]))
Error in get(as.character(Style$SearchIn[1])) :
object 'data[substr(names(data),1,5)=="TDIAG"]' not found
Obviously, running data[substr(names(data),1,5)=="TDIAG"] works.
Example:
library(survival)
ex <- data.frame(SearchIn="lung[substr(names(lung),1,2) == 'ph']")
lung[substr(names(lung),1,2) == 'ph'] #works
get(ex$SearchIn[1]) # does not work
It is not a good idea to store R code in strings and then try to eval them when needed; there are nearly always better solutions for dynamic logic, such as lambdas.
I would recommend using a list to store the plot specification, rather than a data.frame. This would allow you to include a function as one of the list's components which could take the input data and return a subset of it for plotting.
For example:
library(survival);
plotFromSpec <- function(data,spec) {
filteredData <- spec$filter(data);
## ... draw a plot from filteredData and other stuff in spec ...
};
spec <- list(
Name='Chemo',
filter=function(data) data[,substr(names(data),1,2)=='ph'],
Codes=c(1,2,3,4,5,6),
PlotAs='|',
PlotColour='red'
);
plotFromSpec(lung,spec);
If you want to store multiple specifications, you could create a list of lists.
Have you tried using quote()
I'm not entirely sure what you want but maybe you could store the things you're trying to get() like
quote(data[substr(names(data),1,4)=="diag"])
and then use eval()
eval(quote(data[substr(names(data),1,4)=="diag"]), list(data=data))
For example,
dat <- data.frame("diag1"=1:10, "diag2"=1:10, "other"=1:10)
Style <- list(SearchIn=c(quote(data[substr(names(data),1,4)=="diag"]), quote("Other stuff")))
> head(eval(Style$SearchIn[[1]], list(data=dat)))
diag1 diag2
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6

R storing different columns in different vectors to compute conditional probabilities

I am completely new to R. I tried reading the reference and a couple of good introductions, but I am still quite confused.
I am hoping to do the following:
I have produced a .txt file that looks like the following:
area,energy
1.41155882174e-05,1.0914586287e-11
1.46893363946e-05,5.25011714434e-11
1.39244046855e-05,1.57904991488e-10
1.64155121046e-05,9.0815757601e-12
1.85202830392e-05,8.3207522281e-11
1.5256036289e-05,4.24756620609e-10
1.82107587343e-05,0.0
I have the following command to read the file in R:
tbl <- read.csv("foo.txt",header=TRUE).
producing:
> tbl
area energy
1 1.411559e-05 1.091459e-11
2 1.468934e-05 5.250117e-11
3 1.392440e-05 1.579050e-10
4 1.641551e-05 9.081576e-12
5 1.852028e-05 8.320752e-11
6 1.525604e-05 4.247566e-10
7 1.821076e-05 0.000000e+00
Now I want to store each column in two different vectors, respectively area and energy.
I tried:
area <- c(tbl$first)
energy <- c(tbl$second)
but it does not seem to work.
I need to different vectors (which must include only the numerical data of each column) in order to do so:
> prob(energy, given = area), i.e. the conditional probability P(energy|area).
And then plot it. Can you help me please?
As #Ananda Mahto alluded to, the problem is in the way you are referring to columns.
To 'get' a column of a data frame in R, you have several options:
DataFrameName$ColumnName
DataFrameName[,ColumnNumber]
DataFrameName[["ColumnName"]]
So to get area, you would do:
tbl$area #or
tbl[,1] #or
tbl[["area"]]
With the first option generally being preferred (from what I've seen).
Incidentally, for your 'end goal', you don't need to do any of this:
with(tbl, prob(energy, given = area))
does the trick.

Lines between certain points in a plot, based on the data? (with R)

I have done my research and googling but have yet to find a solution to the following problem. I have quite often found solutions to R-related issues from this forum, so I thought I'd give it a try and hope that somebody can suggest something. I would need it for my PhD thesis; anybody who's code or suggestions I will use will naturally be acknowledged and credited.
So: I need to draw lines/segments to connect points in a plot (of multidimensional scaling, specifically) in R (SPSS-based solutions are welcome as well) - but not between all points, just those that represent properties/variables that at least one data item shares - the placement of the lines should be based on the data that the plot in question is based on itself. Let me exeplify; below are some fictional data with dummy variables, where '1' means that the item has the property:
"properties"
a b c
"items" ---------
tree | 1 1 0
house | 0 1 1
hut | 0 1 1
book | 1 0 0
The plot is a multidimensional scaling plot (distances are to be interpreted as dissimilarities). This is the logic:
there's a line between A and B, because there is at least one item/variable ("tree") in
the data that has both properties;
there is a line between B and C, because there is at least one item in the data ("house" and "hut") that has both properties;
there is an item ("book") that has only one property (A), so it does not affect the placement of the lines
importantly, there is no line between A and C because there are no items in the data that have both properties.
What I am looking for is a way to add the grey lines automatically/computationally that I have for now drawn manually on the plot above. The automatic drawing should be based on the data as described above. With a small data set, drawing the lines manually is no problem, but becomes a problem when there are tens of such "properties" and hundreds of items/rows of data.
Any ideas? Some R code (commented if possible) would be most welcome!
EDIT: It seems I forgot something very important. First thing, the solution proposed by #GaborCsardi below works perfectly with the example data, thanks for that! But I forgot to include that the linking of the points should also be "conservative", with as few connecting lines as possible. For example, if there is an item that has all the "properties", then it should not create lines between every single property point in the plot just because of that, if the points are connected by other items already, even if indirectly. So a plot based on the following data should not be a full triangle, even though item1 has all three properties:
A B C
item1 1 1 1
item2 1 1 0
item3 0 1 1
Instead, A,B and B,C should be connected by a line, but a line between A and C would be exessive, as they are already indirectly connected (through B). Could this be done with incidence graphs?
This is very easy if you use graphs, and create the projection of the bipartite graph that you have in your table. E.g.
library(igraph)
## Some example data
mat <- " properties
items a b c
tree 1 1 0
house 0 1 1
hut 0 1 1
book 1 0 0
"
tab <- read.table(textConnection(mat), skip=1,
header=TRUE, row.names=1)
## Create a bipartite graph
graph <- graph.incidence(as.matrix(tab))
## Project the bipartite graph
proj <- bipartite.projection(graph)
## Plot one of the projections, the one you need
## happens to be the second one
plot(proj$proj2)
## Minimum spanning tree of the projection
plot(minimum.spanning.tree(proj$proj2))
For more information see the manual pages, i.e. ?"igraph-package" ?graph.incidence, ?bipartite.projection and ?plot.igraph.

Resources