I have some data:
> head(dat)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
1: 2 2 3 2 4 1 1 0 0 0 2 2 0 0
2: 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3: 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4: 0 0 0 0 0 1 0 0 0 0 0 0 0 0
5: 0 0 0 0 0 0 0 0 0 0 1 1 0 0
6: 0 0 0 0 0 0 0 0 0 0 0 0 0 0
How can I create a 3D plot of this data, so X axis would be V1:V14, Y axis would be 1:6(Index) and Z axis would be the value of V1[1]?
When I try to plot I get:
> scatter3D(dat)
Error in range(y, na.rm = TRUE) : 'y' is missing
What should I parse as Y and Z?
You'll want to play around with the arguments, but wireframe is nice.
library(lattice)
d <- as.matrix(dat)
wireframe(d, scales = list(arrows = FALSE),
drape = TRUE, colorkey = TRUE,
screen = list(z = 30, x = -60))
Related
I have 3000-element vector populated with mostly zeros (0) and intermixed values of one (1) throughout. I am attempting to visualize the degree of where the ones appear and their degree of sequential runs in the data.
I like the idea of a waffle chart with tiny squares with different colors denoting instances of 0 and 1. Is there a means to tweak a waffle chart to achieve this 2-colored, ordered data, stacked representation?
The code below provides a 200-element vector populated with mostly zeros as an example. A waffle-type chart with width = 20 and height = 10 is something along the lines of what I seek.
This solution is close to what I desire, except I need to retain the original order of the data in the visual.
Create waffle chart in ggplot2
library(tidyverse)
library(waffle)
dabble <- ifelse(runif(200) < 0.8, 0, 1 )
dabble
# [1] 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1
# [70] 1 0 0 0 0 0 0 0 0 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0
# [139] 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0
You can do this in ggplot directly using geom_tile and a little data shaping:
library(tidyverse)
dabble <- ifelse(runif(200) < 0.8, 0, 1 )
df <- data.frame(z = dabble, x = rep(1:20, 10), y = rep(10:1, each = 20))
ggplot(df, aes(x, y, fill = factor(z))) +
geom_tile(color = "white", size = 2) +
scale_fill_manual(values = c("lightblue", "red4"), name = NULL) +
coord_equal() +
theme_void()
Created on 2022-06-01 by the reprex package (v2.0.1)
I have a very large CSV file containing counts of unique DNA sequences, and there is a column for each unique sequence. I started with hundreds of samples and cut it down to only 15 that I care about but now I have THOUSANDS of columns that contain nothing but Zeroes and it is messing up my data processing. How do I go about completely removing any column that sums to zero? I’ve seen some similar questions on here but none of those suggestions have worked for me.
I have 6653 columns and 16 rows in my data frame.
If it matters my columns all have super crazy names, some several hundred characters long ( AATCGGCTAA..., etc) and the row names are the sample IDs which are also not entirely numeric. Any tips greatly appreciated. I am still new to R so please let me know where I will need to change things in code examples if you can! Thanks!
You can use colSums
set.seed(10)
df <- as.data.frame(matrix(sample(0:1, 50, replace = TRUE, prob = c(.8, .2)),
5, 10))
df
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
# 1 0 0 0 0 1 0 0 0 0 0
# 2 0 0 0 0 0 1 0 1 0 0
# 3 0 0 0 0 0 0 0 1 0 0
# 4 0 0 0 0 0 0 1 0 0 0
# 5 0 0 0 1 0 0 0 0 0 1
df[colSums(df) != 0]
# V4 V5 V6 V7 V8 V10
# 1 0 1 0 0 0 0
# 2 0 0 1 0 1 0
# 3 0 0 0 0 1 0
# 4 0 0 0 1 0 0
# 5 1 0 0 0 0 1
But you might not want to remove all columns which sum to 0, because that could be true even if not all elements are 0. Take V4 in the data frame below as an example.
df$V4[1] <- -1
df
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
# 1 0 0 0 -1 1 0 0 0 0 0
# 2 0 0 0 0 0 1 0 1 0 0
# 3 0 0 0 0 0 0 0 1 0 0
# 4 0 0 0 0 0 0 1 0 0 0
# 5 0 0 0 1 0 0 0 0 0 1
So if you want to only remove columns where all elements are 0, you can do
df[colSums(df == 0) < nrow(df)]
# V4 V5 V6 V7 V8 V10
# 1 -1 1 0 0 0 0
# 2 0 0 1 0 1 0
# 3 0 0 0 0 1 0
# 4 0 0 0 1 0 0
# 5 1 0 0 0 0 1
welcome to SO here is a tidyverse approach
library(tidyverse)
mtcars %>%
select_if(is.numeric) %>%
select_if(~ sum(.x) > 0)
I want a multinominal distributed data frame with dummies. The probabilities should be applied to the columns. I have following code which seems a bit awkward. Does anyone have a better idea?
set.seed(1234)
data.table::transpose(data.frame(rmultinom(10, 1, c(1:5)/5)))
# V1 V2 V3 V4 V5
# 1 0 0 0 1 0
# 2 0 0 0 0 1
# 3 0 0 0 0 1
# 4 0 1 0 0 0
# 5 0 0 0 0 1
# 6 0 0 0 0 1
# 7 0 0 0 1 0
# 8 0 1 0 0 0
# 9 0 0 0 0 1
# 10 0 0 0 1 0
A little shorter: and doesn't involve multiple coercions.
data.frame(t(rmultinom(10, 1, c(1:5)/5)))
or
library(data.table)
data.table(t(rmultinom(10, 1, c(1:5)/5)))
Because I think my primary obstacle is coding this problem I put it here rather than on MSE. I'll move it over if that was incorrect.
I have a matrix, m, that I constructed representing an acyclic directed graph where the matrix entries are the distances between nodes. For example [and to confirm I constructed it correctly], vertex 1 goes to vertices 2, 3, 4 with distances of 3, 4, and 4 respectively. And vertex 12 goes to vertices 13, 16, 19 with distances of 6, 2, 6 respectively.
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
[1,] 0 3 4 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[2,] 0 0 0 0 3 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[3,] 0 0 0 0 0 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0
[4,] 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0
[5,] 0 0 0 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0
[6,] 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0
[7,] 0 0 0 0 0 0 0 0 6 4 0 0 0 0 0 0 0 0 0 0
[8,] 0 0 0 0 0 0 0 0 0 3 0 5 0 0 0 0 0 0 0 0
[9,] 0 0 0 0 0 0 0 0 0 0 3 0 3 0 0 0 0 0 0 0
[10,] 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0
[11,] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 3 0 0 0 0 0
[12,] 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 2 0 0 6 0
[13,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
[14,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0
[15,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 1 0 0
[16,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 2 0
[17,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2
[18,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3
[19,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4
[20,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
I already went through and manually calculated a minimum distance path using a backwards recursion method (I'm not sure if it has a proper name) where it starts at the vertex 20 and then iteratively works its way back to vertex 1 where each iteration:
f(i) = min{s(i,j) + f(j)}
where i and j are vertices, s(i,j) is the edge length from vertex i to j, and f(j) is the minimum distance from vertex j to the terminal vertex.
f(20) = 0
f(19) = 4 + f(20) = 4
f(18) = 3 + f(20) = 3
f(17) = 2 + f(20) = 2
f(16) = min { 3 + f(18) | 2 + f(19) = 6
f(15) = min { 4 + f(17) | 1 + f(18) = 4
f(14) = 5 + f(17) = 7
f(13) = 1 + f(15) = 5
f(12) = min { 6 + f(13) | 2 + f(16) | 6 + f(19) = 8
f(11) = min { 1 + f(14) | 3 + f(15) = 7
f(10) = min \{ 3 + f(13) = 8
f(9) = min { 3 + f(11) | 3 + f(13) = 8
f(8) = min { 3 + f(10) | 5 + f(12) = 11
f(7) = min { 6 + f(9) | 4 + f(10) = 12
f(6) = 3 + f(9) = 11
f(5) = 8 + f(14) = 15
f(4) = 5 + f(8) = 16
f(3) = min { 2 + f(6) | 2 + f(7) = 13
f(2) = min { 3 + f(5) | 5 f(6) = 16
f(1) = min { 3 + f(2) | 4 + f(3) | 4 + f(4) = 17
Which identifies the minimal path as 1 - 3 - 6 - 9 - 13 - 15 - 18 - 20. Now, I'm trying to find which single edge to remove that would maximize the distance from vertex 1 to vertex 20 without using an R package [if there is one] so I can see what's going on. I'm really uncertain how to get started on this and know it's next to nothing, but started with something like:
for (i in nrow(t)-1:2) {
for (j in ncol(t):1) {
if k != 0 { s[j] <- k }
}
}
How would I go about taking the minimum of the sum of the distance from vertex i to j and the longest distance from j to the terminal node as done manually above?
I want to calculate correlation of V2 with V3, V4, ..., V18:
That is cor(V2,V3, na.rm = TRUE), cor(V2, V4, na.rm =TRUE), etc
What is the most effective way to do this?
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18
1 141_21311223 2.000 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 44_33331123 2.000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 247_11131211 2.065 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 33_31122113 2.080 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 277_21212111 2.090 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0
Converting my comment to an answer, one simple approach would be to use the column positions in a sapply statement:
sapply(3:ncol(mydf), function(y) cor(mydf[, 2], mydf[, y], ))
This should create a vector of the output value. change sapply to lapply if you prefer a list as the output.
I've never seen na.rm for cor though....