How to Count Records in a Dimension? - crossfilter

I'm trying to convert an app to use Crossfilter and have run into a problem.
My data looks something like the following:
Threshold: 0.7
ID Category A Category B Category C Any category above threshold?
1 0.21 0.83 0.83 TRUE
2 0.38 0.68 0.47 FALSE
3 0.72 0.58 0.01 TRUE
4 0.95 0.62 0.01 TRUE
5 0.61 0.75 0.27 TRUE
Category Above threshold Total above threshold
2 2 1 4
I have a global threshold value and a number of categories (A through C). The global threshold determines which rows are filtered (all except 2). Then of the filtered rows, for each category I need to calculate the number of rows with a value above the threshold (A: 2, B: 2, C: 1). The threshold can change dynamically.
I have not been able to figure out how to maintain category counts without iterating over all filtered rows (expensive!) each time the threshold changes. Can someone suggest a better approach?
Thank you in advance!

Related

Heatmap of effect sizes and p-values using different exposures and outcomes in ggplot2

I want to create a heat map that graphically shows effect sizes between different outcomes and exposures and if p-values were significant.
I have created one big dataframe containing all exposure-outcomes tests with p-values and effect sizes. The effect direction can be positive or negative. Now, there are great resources to create this for correlation matrices such as corrplot.
I don't get how to do this for effects sizes with different exposures and outcomes.
This would be the sample dataframe. The exposures would be 20 and the outcome 15.
Here is a shortened example. Estimates and p-values made up, so disregard the statical nonsense in the values.
dat
# id Exposure Outcome beta p-value se x
# 1 a 1 0.02 0.04 0.001
# 1 a 2 0.52 0.001 0.02
# 1 a 3 0.001 0.54 0.001
# 1 b 1 -0.02 0.09 0.045
# 1 b 2 0.06 0.12 0.03
# 1 b 3 -0.1 0.41 0.09
# 1 c 1 -0.42 0.01 0.08
This is an example of a similar plot using correlation.

Split a vector uniformly in R

I have a vector of probabilitites in R :
p =c(0.01,0.02,0.2,0.1,0.07,0.15,0.09)
and i want to divide it uniformly into 5 categories according to its range:
range(p)[1]
range(p)[2]
range(p)[2]/5
for example the 1) category to contain the probabilities from 0 to 0.04,the 2) category from 0.04 to 0.08
3) category from 0.08 to 0.12
4) category from 0.12 to 0.16
5) category from 0.16 to 0.20
But i want to implement it automatically.How can i do it ?

how to use the `map` family command in **purrr** pacakge to swap the columns across rows in data frame?

Imagine there are 4 cards on the desk and there are several rows of them (e.g., 5 rows in the demo). The value of each card is already listed in the demo data frame. However, the exact position of the card is indexed by the pos columns, see the demo data I generated below.
To achieve this, I swap the cards with the [] function across the rows to switch the cards' values back to their original position. The following code already fulfills such a purpose. To avoid explicit usage of the loop, I wonder whether I can achieve a similar effect if I use the vectorization function with packages from tidyverse family, e.g. pmap or related function within the package purrr?
# 1. data generation ------------------------------------------------------
rm(list=ls())
vect<-matrix(round(runif(20),2),nrow=5)
colnames(vect)<-paste0('card',1:4)
order<-rbind(c(2,3,4,1),c(3,4,1,2),c(1,2,3,4),c(4,3,2,1),c(3,4,2,1))
colnames(order)=paste0('pos',1:4)
dat<-data.frame(vect,order,stringsAsFactors = F)
# 2. data swap ------------------------------------------------------------
for (i in 1:dim(dat)[1]){
orders=dat[i,paste0('pos',1:4)]
card=dat[i,paste0('card',1:4)]
vec<-card[order(unlist(orders))]
names(vec)=paste0('deck',1:4)
dat[i,paste0('deck',1:4)]<-vec
}
dat
You could use pmap_dfr :
card_cols <- grep('card', names(dat))
pos_cols <- grep('pos', names(dat))
dat[paste0('deck', seq_along(card_cols))] <- purrr::pmap_dfr(dat, ~{
x <- c(...)
as.data.frame(t(unname(x[card_cols][order(x[pos_cols])])))
})
dat
# card1 card2 card3 card4 pos1 pos2 pos3 pos4 deck1 deck2 deck3 deck4
#1 0.05 0.07 0.16 0.86 2 3 4 1 0.86 0.05 0.07 0.16
#2 0.20 0.98 0.79 0.72 3 4 1 2 0.79 0.72 0.20 0.98
#3 0.50 0.79 0.72 0.10 1 2 3 4 0.50 0.79 0.72 0.10
#4 0.03 0.98 0.48 0.06 4 3 2 1 0.06 0.48 0.98 0.03
#5 0.41 0.72 0.91 0.84 3 4 2 1 0.84 0.91 0.41 0.72
One thing to note here is to make sure that the output from pmap function does not have original names of the columns. If they have the original names, it would reshuffle the columns according to the names and output would not be in correct order. I use unname here to remove the names.

How to keep values with same sign in r

I have a data frame like below; Rows are protein IDs and a is experiment 1 and b is experiment 2 that we expect proteins show the same expression values in both experiments however the sign of expression always in not in agreement as heatmap shows that
> head(a[,c(1,3)])
a b
A0JLT2 0.29 0.2
A8MXV4 -1.25 -0.6
O00194 -2.21 0.9
O00462 0.68 -0.6
O00505 1.05 -0.6
O00560 0.43 -0.2
>
I want to keep only proteins with the same sign of value (- or +) in both columns but I don't know how to do that
Any help
Here's one way -
a[sign(a$a) == sign(a$b), ]
sign() returns the sign of an element.
sign(-1.25)
[1] -1
sign(-0.6)
[1] -1
sign(0.29)
[1] 1

mlogit "row names supplied are of the wrong length", R

I am implementing a multinomial logit model using the mlogit package in R. The data includes three different "choices" and three variables (A, B, C) which contains information for the independent variable. I have transformed the data into a wide format using the mlogit.data function which makes it look like this:
Observation Choice VariableA VariableB VariableC
1 1 1.27 0.2 0.81
1 0 1.27 0.2 0.81
1 -1 1.27 0.2 0.81
2 1 0.20 0.45 0.70
2 0 0.20 0.45 0.70
2 -1 0.20 0.45 0.70
The thing is that I want the independent variable to be choice-specific and therefore being constructed as Variable D below:
Observation Choice VariableA VariableB VariableC VariableD
1 1 1.27 0.2 0.81 1.27
1 0 1.27 0.2 0.81 0.2
1 -1 1.27 0.2 0.81 0.81
2 1 0.20 0.45 0.70 0.20
2 0 0.20 0.45 0.70 0.45
2 -1 0.20 0.45 0.70 0.70
Variable D was constructed using the following code:
choice_map <- data.frame(choice = c(1, 0, -1), var = grep('Variable[A-C]', names(df)))
df$VariableD <- df[cbind(seq_len(nrow(df)), with(choice_map, var[match(df$Choice, choice)]))]
However, when I try to run the multinomial logit model,
mlog <- mlogit(Choice ~ 1 | VariableD, data=df, reflevel = "0")
the error message "row names supplied are of the wrong length" is returned. When I use any of the other variables A-C separately the regression is run without any problems, so my questions are therefore: why can't Variable D be used and how can this problem be solved?
Thanks!
I got this error when I entered my original dataframe into the model, and not the wide dataframe created by mlogit.data.
So make sure to create your "wide" dataframe first and enter this into your mlogit function.
(source: Andy Field, Discovering statistics using R, page 348)

Resources