a file I want to read into R looks liek this
0010101010101010101
1110101010101010101
1111110101010111000
0001010101000010100
when I read that in the problem is that R thinks every row is a number and just shows
V1
1 Inf
2 Inf
3 Inf
4 Inf
5 Inf
6 Inf
how can I read it in as a matrix with 0 and the other element?
One option is
as.matrix(read.fwf('triub.txt', widths=rep(1,19)))
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19
#[1,] 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
#[2,] 1 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
#[3,] 1 1 1 1 1 1 0 1 0 1 0 1 0 1 1 1 0 0 0
#[4,] 0 0 0 1 0 1 0 1 0 1 0 0 0 0 1 0 1 0 0
Or
as.matrix(read.table(text=gsub("", ' ', readLines('triub.txt'))))
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19
#[1,] 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
#[2,] 1 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
#[3,] 1 1 1 1 1 1 0 1 0 1 0 1 0 1 1 1 0 0 0
#[4,] 0 0 0 1 0 1 0 1 0 1 0 0 0 0 1 0 1 0 0
Or you can pipe with sed or awk (in linux)
as.matrix(read.table(pipe("sed 's/./& /g' triub.txt"), header=FALSE))
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19
#[1,] 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
#[2,] 1 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
#[3,] 1 1 1 1 1 1 0 1 0 1 0 1 0 1 1 1 0 0 0
#[4,] 0 0 0 1 0 1 0 1 0 1 0 0 0 0 1 0 1 0 0
as.matrix(read.table(pipe("awk 'BEGIN{FS=\"\"; OFS=\" \"}{$1=$1}1' triub.txt"),
header=FALSE))
You could treat it like a fixed with data file and use the new readr library.
library(readr)
read_fwf("0010101010101010101
1110101010101010101
1111110101010111000
0001010101000010100", fwf_widths(rep(1,19)))
this returns a data.frame which you can convert to matrix with as.matrix
Or you could read the lines and split (useful if you don't know the number of columns ahead of time)
tx<-textConnection("0010101010101010101
1110101010101010101
1111110101010111000
0001010101000010100")
do.call(rbind, lapply(strsplit(readLines(tx), strsplit, split=""), as.numeric))
close(tx)
Note I only use textConnection() here to make a reproducible example. You can use readLines("filename.txt") for your real data file.
Related
I have a very large CSV file containing counts of unique DNA sequences, and there is a column for each unique sequence. I started with hundreds of samples and cut it down to only 15 that I care about but now I have THOUSANDS of columns that contain nothing but Zeroes and it is messing up my data processing. How do I go about completely removing any column that sums to zero? I’ve seen some similar questions on here but none of those suggestions have worked for me.
I have 6653 columns and 16 rows in my data frame.
If it matters my columns all have super crazy names, some several hundred characters long ( AATCGGCTAA..., etc) and the row names are the sample IDs which are also not entirely numeric. Any tips greatly appreciated. I am still new to R so please let me know where I will need to change things in code examples if you can! Thanks!
You can use colSums
set.seed(10)
df <- as.data.frame(matrix(sample(0:1, 50, replace = TRUE, prob = c(.8, .2)),
5, 10))
df
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
# 1 0 0 0 0 1 0 0 0 0 0
# 2 0 0 0 0 0 1 0 1 0 0
# 3 0 0 0 0 0 0 0 1 0 0
# 4 0 0 0 0 0 0 1 0 0 0
# 5 0 0 0 1 0 0 0 0 0 1
df[colSums(df) != 0]
# V4 V5 V6 V7 V8 V10
# 1 0 1 0 0 0 0
# 2 0 0 1 0 1 0
# 3 0 0 0 0 1 0
# 4 0 0 0 1 0 0
# 5 1 0 0 0 0 1
But you might not want to remove all columns which sum to 0, because that could be true even if not all elements are 0. Take V4 in the data frame below as an example.
df$V4[1] <- -1
df
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
# 1 0 0 0 -1 1 0 0 0 0 0
# 2 0 0 0 0 0 1 0 1 0 0
# 3 0 0 0 0 0 0 0 1 0 0
# 4 0 0 0 0 0 0 1 0 0 0
# 5 0 0 0 1 0 0 0 0 0 1
So if you want to only remove columns where all elements are 0, you can do
df[colSums(df == 0) < nrow(df)]
# V4 V5 V6 V7 V8 V10
# 1 -1 1 0 0 0 0
# 2 0 0 1 0 1 0
# 3 0 0 0 0 1 0
# 4 0 0 0 1 0 0
# 5 1 0 0 0 0 1
welcome to SO here is a tidyverse approach
library(tidyverse)
mtcars %>%
select_if(is.numeric) %>%
select_if(~ sum(.x) > 0)
I have two groups linked by a connectivity matrix like the following:
#
# X1 X2 X3 X4 X5 X6
# 1 0 0 0 0 0 V1
# 1 1 1 0 0 0 V2
# 0 1 0 0 0 0 V3
# 0 0 1 0 0 0 V4
# 0 0 0 1 0 0 V5
# 0 0 0 1 0 0 V6
# 0 0 0 0 1 0 V7
# 0 0 0 0 1 1 V8
# 0 0 0 0 1 0 V9
# 0 0 0 0 0 1 V10
#
So X1 is linked to V1 and V2 while V2 is linked to X1, X2 and X3 and so on. I need to find a way (algorithm or command) for getting all the biggest independent subsets of the matrix. So, in this case:
# X1 X2 X3
# 1 0 0 V1
# 1 1 1 V2
# 0 1 0 V3
# 0 0 1 V4
and:
# X4
# 1 V5
# 1 V6
and:
# X5 X6
# 1 0 V7
# 1 1 V8
# 1 0 V9
# 0 1 V10
Do you have any hint? I guess there's already some library or function to use either from graph analysis or linear algebra.
As you hinted we can do this with igraph:
# dummy data
df1 <- read.table(text = " X1 X2 X3 X4 X5 X6
V1 1 0 0 0 0 0
V2 1 1 1 0 0 0
V3 0 1 0 0 0 0
V4 0 0 1 0 0 0
V5 0 0 0 1 0 0
V6 0 0 0 1 0 0
V7 0 0 0 0 1 0
V8 0 0 0 0 1 1
V9 0 0 0 0 1 0
V10 0 0 0 0 0 1
")
library(dplyr)
library(tidyr)
library(igraph)
# make graph object
gg <-
df1 %>%
add_rownames(var = "V") %>%
gather(X, value, -V) %>%
filter(value == 1) %>%
graph.data.frame
# split based on clusters of graph
lapply(
sapply(split(clusters(gg)$membership,
clusters(gg)$membership), names),
function(i)
df1[intersect(rownames(df1), i),
intersect(colnames(df1), i),
drop = FALSE])
# $`1`
# X1 X2 X3
# V1 1 0 0
# V2 1 1 1
# V3 0 1 0
# V4 0 0 1
#
# $`2`
# X4
# V5 1
# V6 1
#
# $`3`
# X5 X6
# V7 1 0
# V8 1 1
# V9 1 0
# V10 0 1
I have a 10x1 matrix of character (say e212m).
> print(e212m)
[,1]
[1,] "0000000000000111111000000000"
[2,] "0000000000000111111100000000"
[3,] "0000000000001111111100000000"
[4,] "0000000000001111111100000000"
[5,] "0000000000011100111100000000"
[6,] "0000000000011111111100000000"
[7,] "0000000000011111111100000000"
[8,] "0000000000011111111100000000"
[9,] "0000000000001111111000000000"
[10,] "0000000000000011111000000000"
> dim(e212m)
[1] 10 1
> typeof(e212m)
[1] "character"
I want to convert each character of any row into integer. But not like
"0000000000000111111000000000"(string/character) to integer = 0000000000000111111000000000
I want something like each character changed to digit.eg
"0" "0" "1" "1" to number 0 0 1 1.
So that in the end I can get integer matrix of 10x29.
P.S: I am new to R. Direct commands doing the above task are welcome.
x<-"0000000000000111111000000000"
y<-as.numeric(strsplit(x,split='')[[1]])
will return
y
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0
if you matrix is named m just use :
m2<-apply(m,1,function(x){as.numeric(strsplit(x,split='')[[1]])})
m2<-t(m2)
x <- c("0000000000000111111000000000", "0000000000000111111100000000", "0000000000001111111100000000")
y <- paste(x, collapse = "\n")
read.fwf(textConnection(y), rep(1, nchar(x[1])))
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28
#1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0
#2 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
#3 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
Try to use regular expression.
gsub('(\\d)','\\1 ',x)
or
gsub('(?<=\\d)(\\d)',' \\1',x,perl=T)
I want to calculate correlation of V2 with V3, V4, ..., V18:
That is cor(V2,V3, na.rm = TRUE), cor(V2, V4, na.rm =TRUE), etc
What is the most effective way to do this?
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18
1 141_21311223 2.000 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 44_33331123 2.000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 247_11131211 2.065 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 33_31122113 2.080 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 277_21212111 2.090 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0
Converting my comment to an answer, one simple approach would be to use the column positions in a sapply statement:
sapply(3:ncol(mydf), function(y) cor(mydf[, 2], mydf[, y], ))
This should create a vector of the output value. change sapply to lapply if you prefer a list as the output.
I've never seen na.rm for cor though....
I'm currently trying to get used to data.table package in R. I want to get the index for the last 1 ocurring in each row of a data table, say a and add that new column to a. My code for this is the following:
a = data.table(matrix(sample(c(0,1),500,rep=T),50,10))
a[,ind:=apply(a==1,1,function(x) max(which(x)))]
Nevertheless, I think this can be written in a short way using more data.table syntax. Therefore, my question is: how to do this without the apply function within the j component of [?
Great question. Yes, the apply by row isn't page efficient, the which will be allocating for each and every row, and the a==1 creates a new logical matrix as large as a.
In data.table we do things by column. Sometimes, it's data.table-ish to use a for loop through columns (never for loop through rows) :
a[,ans:=0L]
for (i in 1:10) set(a,which(a[[i]]==1),"ans",i)
identical(a$ind, a$ans)
# [1] TRUE
As you can see it's a completely different style. But, I think, this should be :
page efficient; i.e., it runs by columns which are contiguous in memory
needs working memory as large as (just) one column rather than whole of a
calls which() (a non primitive, vectorized function) 10 rather than nrow(a) times
I didn't do any speed tests, though, so I might have to eat my words.
See ?set.
In response to comment, to inspect how it works, set happens to return a pointer to the data.table, so we can look at the first few rows as it progresses.
a[,ans:=0L] # add column by reference, initialized with 0L
> head(a)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ans
1: 0 0 1 0 1 1 0 1 1 1 0
2: 0 0 1 1 0 0 1 1 1 1 0
3: 0 1 0 0 0 1 1 1 1 0 0
4: 0 0 0 1 1 0 1 1 1 1 0
5: 1 1 1 1 0 0 0 0 0 1 0
6: 1 1 0 1 1 0 1 0 1 1 0
Now hopefully the following reveals how it's working :
> for (i in 1:10) print(head(set(a,which(a[[i]]==1),"ans",i)))
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ans
1: 0 0 1 0 1 1 0 1 1 1 0
2: 0 0 1 1 0 0 1 1 1 1 0
3: 0 1 0 0 0 1 1 1 1 0 0
4: 0 0 0 1 1 0 1 1 1 1 0
5: 1 1 1 1 0 0 0 0 0 1 1
6: 1 1 0 1 1 0 1 0 1 1 1
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ans
1: 0 0 1 0 1 1 0 1 1 1 0
2: 0 0 1 1 0 0 1 1 1 1 0
3: 0 1 0 0 0 1 1 1 1 0 2
4: 0 0 0 1 1 0 1 1 1 1 0
5: 1 1 1 1 0 0 0 0 0 1 2
6: 1 1 0 1 1 0 1 0 1 1 2
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ans
1: 0 0 1 0 1 1 0 1 1 1 3
2: 0 0 1 1 0 0 1 1 1 1 3
3: 0 1 0 0 0 1 1 1 1 0 2
4: 0 0 0 1 1 0 1 1 1 1 0
5: 1 1 1 1 0 0 0 0 0 1 3
6: 1 1 0 1 1 0 1 0 1 1 2
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ans
1: 0 0 1 0 1 1 0 1 1 1 3
2: 0 0 1 1 0 0 1 1 1 1 4
3: 0 1 0 0 0 1 1 1 1 0 2
4: 0 0 0 1 1 0 1 1 1 1 4
5: 1 1 1 1 0 0 0 0 0 1 4
6: 1 1 0 1 1 0 1 0 1 1 4
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ans
1: 0 0 1 0 1 1 0 1 1 1 5
2: 0 0 1 1 0 0 1 1 1 1 4
3: 0 1 0 0 0 1 1 1 1 0 2
4: 0 0 0 1 1 0 1 1 1 1 5
5: 1 1 1 1 0 0 0 0 0 1 4
6: 1 1 0 1 1 0 1 0 1 1 5
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ans
1: 0 0 1 0 1 1 0 1 1 1 6
2: 0 0 1 1 0 0 1 1 1 1 4
3: 0 1 0 0 0 1 1 1 1 0 6
4: 0 0 0 1 1 0 1 1 1 1 5
5: 1 1 1 1 0 0 0 0 0 1 4
6: 1 1 0 1 1 0 1 0 1 1 5
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ans
1: 0 0 1 0 1 1 0 1 1 1 6
2: 0 0 1 1 0 0 1 1 1 1 7
3: 0 1 0 0 0 1 1 1 1 0 7
4: 0 0 0 1 1 0 1 1 1 1 7
5: 1 1 1 1 0 0 0 0 0 1 4
6: 1 1 0 1 1 0 1 0 1 1 7
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ans
1: 0 0 1 0 1 1 0 1 1 1 8
2: 0 0 1 1 0 0 1 1 1 1 8
3: 0 1 0 0 0 1 1 1 1 0 8
4: 0 0 0 1 1 0 1 1 1 1 8
5: 1 1 1 1 0 0 0 0 0 1 4
6: 1 1 0 1 1 0 1 0 1 1 7
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ans
1: 0 0 1 0 1 1 0 1 1 1 9
2: 0 0 1 1 0 0 1 1 1 1 9
3: 0 1 0 0 0 1 1 1 1 0 9
4: 0 0 0 1 1 0 1 1 1 1 9
5: 1 1 1 1 0 0 0 0 0 1 4
6: 1 1 0 1 1 0 1 0 1 1 9
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ans
1: 0 0 1 0 1 1 0 1 1 1 10
2: 0 0 1 1 0 0 1 1 1 1 10
3: 0 1 0 0 0 1 1 1 1 0 9
4: 0 0 0 1 1 0 1 1 1 1 10
5: 1 1 1 1 0 0 0 0 0 1 10
6: 1 1 0 1 1 0 1 0 1 1 10
>