How to plot a data.table with columns with lists - r

Currently I have a data.table of this form:
USER active reason days # of elements by hour
4q7C0o 1 NA 28 c(0, 0, 0, 0, 0, 0, 5, 98, 167, 211, 246)
2BrKY63 1 NA 28 c(0, 0, 0, 0, 0, 0, 0, 5, 15, 24, 89, 187)
3drUy6I 1 NA 28 c(0, 0, 0, 0, 0, 0, 0, 0, 1, 112, 265, 309)
G5ALtO 1 NA 28 c(0, 0, 0, 0, 0, 0, 0, 2, 20, 153, 170)
Where in the column "#elements by hour" each list is 24 elements long (i ommited the rest just for clarity)
However I don know how to perform the 2 following things:
1) plot all the #elements by hour in a single plot and label them by "user" or "active" (something that appears like a time series)
2) apply a function also to the column "elements by hour"
I tried following but it gives nothing:
plotserieslines <- function(yvar){
ggplot(tickets_by_hour_2019031, aes_(x=c(0:23) ,y=yvar)) +
geom_line()
}
lapply(names(tickets_by_hour_2019031[,#elements by hour,]), plotserieslines)
and $tickets_by_hour_2019031$ is my data.table

Related

Making a venn diagram from a count table

I'm trying to create a venn diagram to help me inspect how many shared variables (species) there are between participant groups. I have a dataframe with dimensions 97 (participants) x 320. My first 2 columns are participant_id and participant_group respectively, and the rest 318 columns are the names of the species with their respective counts. I want to create a venn diagram which will tell me how many species are shared between all the groups.
Here is a reproducible example.
participant_id <- c("P01","P02","P03","P04","P05","P06","P07","P08","P09","P10", "P11", "P12", "P13", "P14", "P15")
participant_group <- c("control", "responsive", "resistant", "non-responsive", "control", "responsive", "resistant", "non-responsive", "resistant", "non-responsive", "control", "responsive", "non-responsive", "control", "resistant")
A <- c (0, 54, 23, 4, 0, 2, 0, 35, 0, 0, 45, 0, 1, 99, 12)
B <- c (10, 0, 1, 0, 4, 65, 0, 1, 52, 0, 0, 15, 20, 0, 0)
C <- c (0, 0, 0, 5, 35, 0, 0, 45, 0, 0 , 0, 22, 0, 89, 50)
D <- c (0, 0, 45, 0, 1, 0, 0, 0, 56, 32, 0, 0, 40, 0, 0)
E <- c (0, 0, 40, 5, 0, 0, 0, 45, 0, 1, 76, 0, 34, 56, 31)
F <- c (0, 64, 1, 5, 0, 0, 80, 0, 0, 1, 76, 0, 34, 0, 32)
G <- c (12, 5, 0, 0, 80, 45, 0, 0, 76, 0, 0, 0, 0, 32, 11)
H <- c (0, 0, 0, 5, 0, 0, 80, 0, 0, 1, 0, 0, 34, 0, 2)
example_df <- data.frame(participant_id, participant_group, A, B, C, D, E, F, G, H)
I can see all the wonderful venn diagram packages out there, but I'm struggling to format my data correctly.
I have started with:
example_df %>%
group_by(participant_group) %>%
dplyr::summarise(across(where(is.numeric), sum)) %>%
mutate_if(is.numeric, ~1 * (. > 0))
So now I have an indication whether a species (A,B,C, etc) is present (1) or absent (0) within every group. Now, I want to see the overlap of species between the groups through a venn diagram (something like this https://statisticsglobe.com/venn-diagram-with-proportional-size-in-r ). However, I am a little bit stuck on what to do next. Does anybody have any ideas?
I hope this makes sense! Thanks for your time.
When using the code from #Paul Stafford Allen, I get this diagram but the goal here is to have something that shows shared presence/absence for species (A,B,C, etc) between groups irrespective of the counts.
using
library(VennDiagram)
library(dplyr)
library(magrittr)
I managed the following start point:
groupSums <- example_df %>%
group_by(participant_group) %>%
summarise(across(where(is.numeric), sum))
forVenn <- lapply(groupSums$participant_group, function(x) {
rep(names(groupSums)[-1], times = groupSums[groupSums$participant_group == x,-1])
})
names(forVenn) <- groupSums$participant_group
venn.diagram(forVenn, filename = "Venn.png", force.unique = FALSE)

R: Confusion matrix from list of relevant variables

I am trying to set up a simulation but I struggle with something.
I know which variables are relevant or not in the beginning as I set them myself. They are stored in a beta matrix taking 0 if insignificant and 1 (same amplitude for all) if they are significant.
I run a procedure which gives me its list of relevant variables. The problem is that, this procedures is not giving me a 0-1 matrix but just a list of the selected variables using their "place" in number (so 38 for the 38th variable in my list). I know wish to test whether this procedure is powerful or not and I want to display the confusion matrix (among other things).
Below you can find those two vectors. I've tried using the "%in%" operator but it does not work and I don't want to use a loop as the simulation takes enough time already.
Thanks in advance!
res=c( 1, 9, 11 ,18, 19, 25, 26, 28, 31, 34 ,37 ,38, 39, 42 ,43 ,47 ,48, 50) #From my procedure
beta=c(1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1)

Selecting column sequences and creating variables

I was wondering if there was a way to select specific columns via a sequence and create new variables from this.
So for example, if I had 8 columns with n observations, how could I create 4 variables that selects 2 rows sequentially? My dataset is much larger than this and I have 1416 variables with 62 observations each (I have pasted a link to the spreadsheet below, whereby the first column and row represent names). I would like to create new dataframes from this named as sites 1-12. So site 1 = df[,1:117]; site 2 = df [,119:237] etc.
I am planning on using this code for future datasets with even more variables so some form of loop or sequence function would be very effective if anyone could shed any light on how to achieve this?
https://www.dropbox.com/s/p1a5cu567lxntmw/MyData.csv?dl=0
Thank you in advance.
James
p.s #nrussell I have copied and pasted the output of the code you mentioned below, it follows on as a series of numbers like those displayed.
dput(z[ , 1:10])
structure(list(1 = c(0, 0, 0, 0, 0, 0, 0, 0, 0.0311410340342049,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0.0207444023791158, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.0312971643732546,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0.0376287494579976, 0, 0, 0, 0, 0,
0, 0),......... 10 = c(0, 0, 0, 0, 0.119280313679916,
0, 0, 0.301029995663981, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.715681882079494,
0.136831816210901, 0, 0, 0, 0.0273663632421801, 0, 0, 0, 0.0547327264843602,
0, 0, 0, 0, 0.0231561535126139, 0, 0, 0.0903089986991944, 0,
0, 0.0752574989159953, 0.159368821233872, 0.0272640716982664,
0.0177076468037636, 0, 0, 0.120411998265592, 0, 0, 0, 0, 0.0322532138211408,
0.0250858329719984, 0, 0, 0, 0.119280313679916, 0, 0.172922500085254,
0.225772496747986, 0, 0, 0, 0.0954242509439325, 0)), .Names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"), class = "data.frame", row.names = c(NA,
-62L))
We could split the dataset ('df') with '1416' columns to equal size '118' columns by creating a grouping index with gl
lst <- setNames(lapply(split(1:ncol(df), as.numeric(gl(ncol(df), 118,
ncol(df)))), function(i) df[,i]), paste0('site', 1:12))
Or you can create the 'lst' without using the split
lst <- setNames(lapply(seq(1, ncol(df), by = 118),
function(i) df[i:(i+117)]), paste0('site', 1:12))
If we need to create 12 dataset objects in the global environment, list2env is an option (I would prefer to work within the 'lst' itself)
list2env(lst, envir=.GlobalEnv)
Using a small dataset ('df1') with '8' columns
lst1 <- setNames(lapply(split(1:ncol(df1), as.numeric(gl(ncol(df1),
2, ncol(df1)))), function(i) df1[,i]), paste0('site', 1:4))
list2env(lst1, envir=.GlobalEnv)
head(site1,3)
# V1 V2
#1 6 12
#2 4 7
#3 14 14
head(site4,3)
# V7 V8
#1 10 2
#2 5 4
#3 5 0
data
set.seed(24)
df1 <- as.data.frame(matrix(sample(0:20, 8*10, replace=TRUE), ncol=8))

How to add a column of labels to the right of a plot

I have code producing a swell plot in R. Check it out!
I made it using this code:
SwellPlot <- function(Data,P) {
par(mfcol=c(1,1), mar=c(4, 4.5, 0, 0) + 0.1)
plot(c(0.5,P+0.5), c(-0.5,P+0.5), type="n", cex=1, lab=c(20,20,1), cex.lab=1.5, yaxs="i", xaxs="i", xlab=~bold("Stuff!"), ylab=~bold("Other stuff!"))
for (f in 1:(P+1)) {
for (r in 0:P) {
points(r,(f-1), pch=15, cex=(60/P)*sqrt(Data[f,r]/200))
}
}
}
With Data input as a matrix, like this one used for the above example:
Data <- matrix(c(0, 117, 76, 7, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 199, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 200, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12, 188, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 90,
109, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 38, 143, 19, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 18, 142, 39, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 8, 114, 75, 3, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 4, 95, 99, 2, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 68, 121, 7, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 65, 122, 10,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 65,
120, 15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
2, 42, 132, 23, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 35, 135, 30, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 44, 128, 27, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 33, 139, 27, 1, 0, 0, 0, 0,
0, 0, 0, 0), 17, 16, byrow=FALSE)
I would like to glom a vertical series of labels along the right side of my swell plot (these labels are different than the y-axis labels, which I need to remain as they already are). I have done so by extending the x-axis and using text(), but when I do this it looks kludgy (on the x-axis labels) and because I have plots with quite variable x-axis lengths I have to spend a lot of trial and error time making sure that the width of my labels does not overlap the right-most plotted elements. The improved plot would look kinda like this:
I imagine that there must be some relatively straightforward way of doing this using layout() or mfcol or somesuch, but the documentation and examples get rapidly complex for me. How would you solve this?
Bonus points for answers that permit me to specifically (1) make the width of the label column I would like to add 8% of the width of my swell plot, and (2) make my swell plot (not including the labels up the right side) retain its 1:1 aspect ratio.
Also bonus points if this can be done using par and plot and the like.
Here is one possibility. You can either add rownames to your Data or you can pass in a vector of names
SwellPlot <- function(Data,P=nrow(Data), labels=!is.null(rownames(Data))) {
par(mfcol=c(1,1), mar=c(4, 4.5, 0, 3) + 0.1)
plot(c(0.5,P+0.5), c(-0.5,P+0.5),
type="n", cex=1, lab=c(20,20,1),
cex.lab=1.5, yaxs="i", xaxs="i",
xlab=~bold("Stuff!"), ylab=~bold("Other stuff!")
)
cex<-(60/P)*sqrt(Data/200)
points(col(Data), row(Data), pch=15, cex=cex)
if(is.logical(labels) && labels) {
labels = rownames(Data)
}
if(is.character(labels)) {
axis(4, labels=labels, at=1:P, las=1, tick=F)
}
}
#add names for labeling
lbl<-c("",rev(c("Ant","Bat","Cat","Dog","Eel","Fly","Gar","Hog")), rep("",8))
rownames(Data)<-lbl
SwellPlot(Data)
I just drew the extra names with axis and I made sure to make room for them by adjusting the par
I also edited the middle to take out the two loops. Those were pretty slow. You can do it all with one call to points passing all the x,y pairs at once. The trick is that if you pass a data.frame, it will convert it to a simple vector. So if all the matrices are arranges the same way, you'll get a point for each cell.

Add consecutive elements of a vector until a value

I would like to calculate the minimum number of consecutive elements in a vector that when added (consecutively) would be less than a given value.
For example in the following vector
ev<-c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 2.7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3.27, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 370.33, 1375.4,
1394.03, 1423.8, 1360, 1269.77, 1378.8, 1350.37, 1425.97, 1423.6,
1363.4, 1369.87, 1365.5, 1294.97, 1362.27, 1117.67, 1026.97,
1077.4, 1356.83, 565.23, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 356.83,
973.5, 0, 240.43, 1232.07, 1440, 1329.67, 1096.87, 1331.37, 1305.03,
1328.03, 1246.03, 1182.3, 1054.53, 723.03, 1171.53, 1263.17,
1200.37, 1054.8, 971.4, 936.4, 968.57, 897.93, 1099.87, 876.43,
1095.47, 1132, 774.4, 1075.13, 982.57, 947.33, 1096.97, 929.83,
1246.9, 1398.2, 1063.83, 1223.73, 1174.37, 1248.5, 1171.63, 1280.57,
1183.33, 1016.23, 1082.1, 795.37, 900.83, 1159.2, 992.5, 967.3,
1440, 804.13, 418.17, 559.57, 563.87, 562.97, 1113.1, 954.87,
883.8, 1207.1, 1046.83, 995.77, 803.93, 1036.63, 946.9, 887.33,
727.97, 733.93, 979.2, 1176.8, 1241.3, 1435.6)
What is the minimum number of elements that when added consecutively (as in the order within the vector) would sum up to lets say 20000
To be more clear i need the following:
Start with ev[1] and add consecutively up to 20000. Record the number of elements you had to add in order to get to 20000 as r[1]. Then start with ev[2] and add till 20000 and so on. Recored the number of elements you had to add till 20000 as r[2]. Do this for the entire length of ev. Then return the min(r)
For example
j<-c(1, 2, 3, 5, 7, 9, 2).
I want the minimum number of elements that when added consecutively would give lets say >20. This should be 3 (5+7+9)
Thanks a lot
Well, I'll give it a shot: This one will find the length of the minimum sequence of numbers
that add up to or above max. It makes no claims to be fast, but it has O(2n) time complexity :-)
I made it return both the start index and the length.
f <- function(x, max=10) {
s <- 0
len <- Inf
start <- 1
j <- 1
for (i in seq_along(x)) {
s <- s + x[i]
while (s >= max) {
if (i-j+1 < len) {
len <- i-j+1
start <- j
}
s <- s - x[j]
j <- j + 1
}
}
list(start=start, length=len)
# uncomment the line below if you don't need the start index...
#len
}
r <- f(ev, 20000) # list(start=245, length=15)
sum(ev[seq(r$start, len=r$length)]) # 20275.42
# Test speed:
x <- sin(1:1e6)
system.time( r <- f(x, 1.9) ) # 1.54 secs
# Compile the function makes it 9x faster...
g <- compiler::cmpfun(f)
system.time( r <- g(x, 1.9) ) # 0.17 secs
library(zoo) # Needed for rollapply
N <- 20000 # The desired sum we want to achieve
j <- 0
for(i in 1:length(ev)){
k <- rollapply(ev, i, sum)
j[i] <- max(k)
if(j[i] >= N){
break
}
}
i # contains how many consecutive elements you need to sum (15)
j[i] # contains the corresponding sum(20275.42)
Currently this doesn't tell you where the specific subset occurs in the vector but another use of rollapply could get you that information.
There are other ways to do it but if you have a really long vector this will break out of the loop so you don't calculate more than you need. The basic idea is to use rollapply to create a vector of the consecutive sums of length k and then find the maximum of that. If this is less than what we desire do the same thing for sums of length k+1. Repeat until we find a sum that is larger than the desired threshold.
Edit:
This appears to be about 100x faster. I haven't compared it to Tommy's answer (which is probably faster than this but this will provide a significant speedup compared to my original method.
Edit 2: Moving the [-n] and removing the suppresswarnings speeds this up quite a bit.
myfun <- function(ev, N){
i <- 1
n <- length(ev)
j <- ev
repeat{
j <- (j[-n] + ev[-c(1:i)])
i <- i+1
n <- n-1
if(max(j) >= N | i > length(ev)){
break;
}
}
return(i)
}
myfun(ev, 20000)
# And stealing the idea from Tommy gives a nice speedup as well
myfuncomp <- compiler:cmpfun(myfun)
myfuncomp(ev, 20000)
myfunc3 <- compiler:cmpfun(myfun, options = list(optimize = 3))
myfunc3(ev, 20000)
library(rbenchmark) # For testing
# If you have Tommy's functions loaded as f and g you can compare
benchmark(f(ev, 20000), g(ev, 20000), myfun(ev, 20000), myfuncomp(ev, 20000), myfunc3(ev, 20000))
you mean something like this?
> sum(ifelse(cumsum(ev)<=200000, 1, 0))
[1] 364
I think this may be a Traveling Salesman Problem in disguise unless you put in some more constraints. You cannot necessarily start at the max ev and go out in either direction since it may be a local non-dense maximum
x=1:length(ev)
plot(x,ev)
lxy <- loess(ev~x )
lines(predict(lxy, x=1:length(y)))
title(main="loess() fit of ev")
But in the region of the most dense values the values are fairly flat.
x=1:length(y); y=c(356.83,
973.5, 0, 240.43, 1232.07, 1440, 1329.67, 1096.87, 1331.37, 1305.03,
1328.03, 1246.03, 1182.3, 1054.53, 723.03, 1171.53, 1263.17,
1200.37, 1054.8, 971.4, 936.4, 968.57, 897.93, 1099.87, 876.43,
1095.47, 1132, 774.4, 1075.13, 982.57, 947.33, 1096.97, 929.83,
1246.9, 1398.2, 1063.83, 1223.73, 1174.37, 1248.5, 1171.63, 1280.57,
1183.33, 1016.23, 1082.1, 795.37, 900.83, 1159.2, 992.5, 967.3,
1440, 804.13, 418.17, 559.57, 563.87, 562.97, 1113.1, 954.87,
883.8, 1207.1, 1046.83, 995.77, 803.93, 1036.63, 946.9, 887.33,
727.97, 733.93, 979.2, 1176.8, 1241.3, 1435.6)
lxyhi <- loess(y~x)
plot(x,y)
lines(predict(lxyhi, x=1:length(y)))

Resources