I want to calculate correlation of V2 with V3, V4, ..., V18:
That is cor(V2,V3, na.rm = TRUE), cor(V2, V4, na.rm =TRUE), etc
What is the most effective way to do this?
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18
1 141_21311223 2.000 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 44_33331123 2.000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 247_11131211 2.065 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 33_31122113 2.080 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 277_21212111 2.090 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0
Converting my comment to an answer, one simple approach would be to use the column positions in a sapply statement:
sapply(3:ncol(mydf), function(y) cor(mydf[, 2], mydf[, y], ))
This should create a vector of the output value. change sapply to lapply if you prefer a list as the output.
I've never seen na.rm for cor though....
Related
This question already has answers here:
How to get the maximum value by group
(5 answers)
Summarizing multiple columns with dplyr? [duplicate]
(5 answers)
Closed 8 months ago.
I have 1000+ rows shows like below
ID V1 V2 V3 V4 V5 V6 V7
KCT00094653 0 0 0 0 0 1 0
KCT00094653 0 0 0 1 0 0 0
KCT00094653 0 0 0 0 0 0 1
KCT00094653 0 0 1 0 0 0 0
KCT00140855 0 0 0 0 1 0 0
KCT00140855 0 0 0 0 0 0 1
KCT00140855 0 0 0 0 0 1 0
KCT00140855 0 0 0 1 0 0 0
KCT00162123 0 1 0 0 0 0 0
KCT00162123 1 0 0 0 0 0 0
KCT00323882 0 0 0 0 0 1 0
KCT00323882 0 0 0 1 0 0 0
KCT00323882 0 0 0 0 0 0 1
KCT00323882 0 0 0 0 1 0 0
KCT00323882 0 1 0 0 0 0 0
I trying to get all the 1 in a row for single Id
and the expected output is like below
ID V1 V2 V3 V4 V5 V6 V7
KCT00094653 0 0 1 1 0 1 1
KCT00140855 0 0 0 1 1 1 1
KCT00162123 1 1 0 0 0 0 0
KCT00323882 0 1 0 1 1 1 1
I have a very large CSV file containing counts of unique DNA sequences, and there is a column for each unique sequence. I started with hundreds of samples and cut it down to only 15 that I care about but now I have THOUSANDS of columns that contain nothing but Zeroes and it is messing up my data processing. How do I go about completely removing any column that sums to zero? I’ve seen some similar questions on here but none of those suggestions have worked for me.
I have 6653 columns and 16 rows in my data frame.
If it matters my columns all have super crazy names, some several hundred characters long ( AATCGGCTAA..., etc) and the row names are the sample IDs which are also not entirely numeric. Any tips greatly appreciated. I am still new to R so please let me know where I will need to change things in code examples if you can! Thanks!
You can use colSums
set.seed(10)
df <- as.data.frame(matrix(sample(0:1, 50, replace = TRUE, prob = c(.8, .2)),
5, 10))
df
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
# 1 0 0 0 0 1 0 0 0 0 0
# 2 0 0 0 0 0 1 0 1 0 0
# 3 0 0 0 0 0 0 0 1 0 0
# 4 0 0 0 0 0 0 1 0 0 0
# 5 0 0 0 1 0 0 0 0 0 1
df[colSums(df) != 0]
# V4 V5 V6 V7 V8 V10
# 1 0 1 0 0 0 0
# 2 0 0 1 0 1 0
# 3 0 0 0 0 1 0
# 4 0 0 0 1 0 0
# 5 1 0 0 0 0 1
But you might not want to remove all columns which sum to 0, because that could be true even if not all elements are 0. Take V4 in the data frame below as an example.
df$V4[1] <- -1
df
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
# 1 0 0 0 -1 1 0 0 0 0 0
# 2 0 0 0 0 0 1 0 1 0 0
# 3 0 0 0 0 0 0 0 1 0 0
# 4 0 0 0 0 0 0 1 0 0 0
# 5 0 0 0 1 0 0 0 0 0 1
So if you want to only remove columns where all elements are 0, you can do
df[colSums(df == 0) < nrow(df)]
# V4 V5 V6 V7 V8 V10
# 1 -1 1 0 0 0 0
# 2 0 0 1 0 1 0
# 3 0 0 0 0 1 0
# 4 0 0 0 1 0 0
# 5 1 0 0 0 0 1
welcome to SO here is a tidyverse approach
library(tidyverse)
mtcars %>%
select_if(is.numeric) %>%
select_if(~ sum(.x) > 0)
I have a 10x1 matrix of character (say e212m).
> print(e212m)
[,1]
[1,] "0000000000000111111000000000"
[2,] "0000000000000111111100000000"
[3,] "0000000000001111111100000000"
[4,] "0000000000001111111100000000"
[5,] "0000000000011100111100000000"
[6,] "0000000000011111111100000000"
[7,] "0000000000011111111100000000"
[8,] "0000000000011111111100000000"
[9,] "0000000000001111111000000000"
[10,] "0000000000000011111000000000"
> dim(e212m)
[1] 10 1
> typeof(e212m)
[1] "character"
I want to convert each character of any row into integer. But not like
"0000000000000111111000000000"(string/character) to integer = 0000000000000111111000000000
I want something like each character changed to digit.eg
"0" "0" "1" "1" to number 0 0 1 1.
So that in the end I can get integer matrix of 10x29.
P.S: I am new to R. Direct commands doing the above task are welcome.
x<-"0000000000000111111000000000"
y<-as.numeric(strsplit(x,split='')[[1]])
will return
y
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0
if you matrix is named m just use :
m2<-apply(m,1,function(x){as.numeric(strsplit(x,split='')[[1]])})
m2<-t(m2)
x <- c("0000000000000111111000000000", "0000000000000111111100000000", "0000000000001111111100000000")
y <- paste(x, collapse = "\n")
read.fwf(textConnection(y), rep(1, nchar(x[1])))
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28
#1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0
#2 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
#3 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
Try to use regular expression.
gsub('(\\d)','\\1 ',x)
or
gsub('(?<=\\d)(\\d)',' \\1',x,perl=T)
a file I want to read into R looks liek this
0010101010101010101
1110101010101010101
1111110101010111000
0001010101000010100
when I read that in the problem is that R thinks every row is a number and just shows
V1
1 Inf
2 Inf
3 Inf
4 Inf
5 Inf
6 Inf
how can I read it in as a matrix with 0 and the other element?
One option is
as.matrix(read.fwf('triub.txt', widths=rep(1,19)))
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19
#[1,] 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
#[2,] 1 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
#[3,] 1 1 1 1 1 1 0 1 0 1 0 1 0 1 1 1 0 0 0
#[4,] 0 0 0 1 0 1 0 1 0 1 0 0 0 0 1 0 1 0 0
Or
as.matrix(read.table(text=gsub("", ' ', readLines('triub.txt'))))
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19
#[1,] 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
#[2,] 1 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
#[3,] 1 1 1 1 1 1 0 1 0 1 0 1 0 1 1 1 0 0 0
#[4,] 0 0 0 1 0 1 0 1 0 1 0 0 0 0 1 0 1 0 0
Or you can pipe with sed or awk (in linux)
as.matrix(read.table(pipe("sed 's/./& /g' triub.txt"), header=FALSE))
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19
#[1,] 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
#[2,] 1 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
#[3,] 1 1 1 1 1 1 0 1 0 1 0 1 0 1 1 1 0 0 0
#[4,] 0 0 0 1 0 1 0 1 0 1 0 0 0 0 1 0 1 0 0
as.matrix(read.table(pipe("awk 'BEGIN{FS=\"\"; OFS=\" \"}{$1=$1}1' triub.txt"),
header=FALSE))
You could treat it like a fixed with data file and use the new readr library.
library(readr)
read_fwf("0010101010101010101
1110101010101010101
1111110101010111000
0001010101000010100", fwf_widths(rep(1,19)))
this returns a data.frame which you can convert to matrix with as.matrix
Or you could read the lines and split (useful if you don't know the number of columns ahead of time)
tx<-textConnection("0010101010101010101
1110101010101010101
1111110101010111000
0001010101000010100")
do.call(rbind, lapply(strsplit(readLines(tx), strsplit, split=""), as.numeric))
close(tx)
Note I only use textConnection() here to make a reproducible example. You can use readLines("filename.txt") for your real data file.
I have some data:
> head(dat)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
1: 2 2 3 2 4 1 1 0 0 0 2 2 0 0
2: 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3: 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4: 0 0 0 0 0 1 0 0 0 0 0 0 0 0
5: 0 0 0 0 0 0 0 0 0 0 1 1 0 0
6: 0 0 0 0 0 0 0 0 0 0 0 0 0 0
How can I create a 3D plot of this data, so X axis would be V1:V14, Y axis would be 1:6(Index) and Z axis would be the value of V1[1]?
When I try to plot I get:
> scatter3D(dat)
Error in range(y, na.rm = TRUE) : 'y' is missing
What should I parse as Y and Z?
You'll want to play around with the arguments, but wireframe is nice.
library(lattice)
d <- as.matrix(dat)
wireframe(d, scales = list(arrows = FALSE),
drape = TRUE, colorkey = TRUE,
screen = list(z = 30, x = -60))