creating a DTM from a 3 column CSV file with r - r

I have that csv file, containing 600k lines and 3 rows, first one containing a disease name, second one a gene, a third one a number something like that: i have roughly 4k disease and 16k genes so sometimes the disease names and genes names are redudant.
cholera xx45 12
Cancer xx65 1
cholera xx65 0
i would like to make a DTM matrix using R, i've been trying to use the Corpus command from the tm library but corpus doesn't reduce the amount of disease and size's 600k ish, i'd love to understand how to transform that file into a DTM.
I'm sorry for not being that precise, totally starting with computer science things as a bio guy :)
Cheers!

If you're not concerned with the number in the third column, then you can accomplish what I think you're trying to do using only the first two columns (gene and disease).
Example with some simulated data:
library(data.table)
# Create a table with 10k combinations of ~6k different genes and 40 different diseases
df <- data.frame(gene=sapply(1:10000, function(x) paste(c(sample(LETTERS, size=2), sample(10, size=1)), collapse="")), disease=sample(40, size=100000, replace=TRUE))
table(df) creates a large matrix, nGenes rows long and nDiseases columns wide. Looking at just the first 10 rows (because it's so large and sparse).
head(table(df))
disease
gene 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
AB10 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0
AB2 1 1 0 0 0 0 1 0 0 0 0 0 0 0 2 0 0 2 0 0 0 0 1 0 1 0 1
AB3 0 1 0 0 2 1 1 0 0 1 0 0 0 0 0 2 1 0 0 1 0 0 1 0 3 0 1
AB4 0 0 1 0 0 1 0 2 1 1 0 1 0 0 1 1 1 1 0 1 0 2 0 0 0 1 1
AB5 0 1 0 1 0 0 2 2 0 1 1 1 0 1 0 0 2 0 0 0 0 0 0 1 1 1 0
AB6 0 0 2 0 2 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0
disease
gene 28 29 30 31 32 33 34 35 36 37 38 39 40
AB10 0 0 1 2 1 0 0 1 0 0 0 0 0
AB2 0 0 0 0 0 0 0 0 0 0 0 0 0
AB3 0 0 1 1 1 0 0 0 0 0 1 1 0
AB4 0 0 1 2 1 1 1 1 1 2 0 3 1
AB5 0 2 1 1 0 0 3 4 0 1 1 0 2
AB6 0 0 0 0 0 0 0 1 0 0 0 0 0
Alternatively, you can exclude the counts of 0 and only include combinations that actually exist. Easy aggregation can be done with data.table, e.g. (continuing from the above example)
library(data.table)
dt <- data.table(df)
dt[, .N, by=list(gene, disease)]
which gives a frequency table like the following:
gene disease N
1: HA5 20 2
2: RF9 10 3
3: SD8 40 2
4: JA7 35 4
5: MJ2 1 2
---
75872: FR10 26 1
75873: IC5 40 1
75874: IU2 20 1
75875: IG5 13 1
75876: DW7 21 1

Related

create a loop to get samples in grouped data which meet a condition

I have a dataframe where data are grouped by ID. I need to know how many cells are the 10% of each group in order to select this number in a sample, but this sample should select the cells which EP is 1.
I've tried to do a nested For loop: one For to know the quantity of cells which are the 10% for each group and the bigger one to sample this number meeting the condition EP==1
x <- data.frame("ID"=rep(1:2, each=10),"EP" = rep(0:1, times=10))
x
ID EP
1 1 0
2 1 1
3 1 0
4 1 1
5 1 0
6 1 1
7 1 0
8 1 1
9 1 0
10 1 1
11 2 0
12 2 1
13 2 0
14 2 1
15 2 0
16 2 1
17 2 0
18 2 1
19 2 0
20 2 1
for(j in 1:1000){
for (i in 1:nrow(x)){
d <- x[x$ID==i,]
npix <- 10*nrow(d)/100
}
r <- sample(d[d$EP==1,],npix)
print(r)
}
data frame with 0 columns and 0 rows
data frame with 0 columns and 0 rows
data frame with 0 columns and 0 rows
.
.
.
until 1000
I would want to get this dataframe, where each sample is in a new column in x, and the cell sampled has "1":
ID EP s1 s2....s1000
1 1 0 0 0 ....
2 1 1 0 1
3 1 0 0 0
4 1 1 0 0
5 1 0 0 0
6 1 1 0 0
7 1 0 0 0
8 1 1 0 0
9 1 0 0 0
10 1 1 1 0
11 2 0 0 0
12 2 1 0 0
13 2 0 0 0
14 2 1 0 1
15 2 0 0 0
16 2 1 0 0
17 2 0 0 0
18 2 1 1 0
19 2 0 0 0
20 2 1 0 0
see that each 1 in S1 and s2 are the sampled cells and correspond to 10% of cells in each group (1, 2) which meet the condition EP==1
you can try
set.seed(1231)
x <- data.frame("ID"=rep(1:2, each=10),"EP" = rep(0:1, times=10))
library(tidyverse)
x %>%
group_by(ID) %>%
mutate(index= ifelse(EP==1, 1:n(),0)) %>%
mutate(s1 = ifelse(index %in% sample(index[index!=0], n()*0.1), 1, 0)) %>%
mutate(s2 = ifelse(index %in% sample(index[index!=0], n()*0.1), 1, 0))
# A tibble: 20 x 5
# Groups: ID [2]
ID EP index s1 s2
<int> <int> <dbl> <dbl> <dbl>
1 1 0 0 0 0
2 1 1 2 0 0
3 1 0 0 0 0
4 1 1 4 0 0
5 1 0 0 0 0
6 1 1 6 1 1
7 1 0 0 0 0
8 1 1 8 0 0
9 1 0 0 0 0
10 1 1 10 0 0
11 2 0 0 0 0
12 2 1 2 0 0
13 2 0 0 0 0
14 2 1 4 0 1
15 2 0 0 0 0
16 2 1 6 0 0
17 2 0 0 0 0
18 2 1 8 0 0
19 2 0 0 0 0
20 2 1 10 1 0
We can write a function which gives us 1's which are 10% for each ID and place it where EP = 1.
library(dplyr)
rep_func <- function() {
x %>%
group_by(ID) %>%
mutate(s1 = 0,
s1 = replace(s1, sample(which(EP == 1), floor(0.1 * n())), 1)) %>%
pull(s1)
}
then use replicate to repeat it for n times
n <- 5
x[paste0("s", seq_len(n))] <- replicate(n, rep_func())
x
# ID EP s1 s2 s3 s4 s5
#1 1 0 0 0 0 0 0
#2 1 1 0 0 0 0 0
#3 1 0 0 0 0 0 0
#4 1 1 0 0 0 0 0
#5 1 0 0 0 0 0 0
#6 1 1 1 0 0 1 0
#7 1 0 0 0 0 0 0
#8 1 1 0 1 0 0 0
#9 1 0 0 0 0 0 0
#10 1 1 0 0 1 0 1
#11 2 0 0 0 0 0 0
#12 2 1 0 0 1 0 0
#13 2 0 0 0 0 0 0
#14 2 1 1 1 0 0 0
#15 2 0 0 0 0 0 0
#16 2 1 0 0 0 0 1
#17 2 0 0 0 0 0 0
#18 2 1 0 0 0 1 0
#19 2 0 0 0 0 0 0
#20 2 1 0 0 0 0 0

R: how do split a delimited columns(state) into columns with a binary 1,0

Appreciate your help. Need to split a column filled with delimited values into columns named after its delimited values and each of these new columns are to be filled with either 1 or 0 where values are found or not.
state <-
c('ACT',
'ACT|NSW|NT|QLD|SA|VIC',
'ACT|NSW|NT|QLD|TAS|VIC|WA',
'ACT|NSW|NT|SA|TAS|VIC',
'ACT|NSW|QLD|VIC',
'ACT|NSW|SA',
'ACT|NSW|NT|QLD|TAS|VIC|WA|SA',
'NSW',
'NT',
'NT|SA',
'QLD',
'SA',
'TAS',
'VIC',
'WA')
df <- data.frame(id = 1:length(state),state)
id state
1 1 ACT
2 2 ACT|NSW|NT|QLD|SA|VIC
3 3 ACT|NSW|NT|QLD|TAS|VIC|WA
4 4 ACT|NSW|NT|SA|TAS|VIC
...
Desired state is a dataframe with the same dimensions plus the additional columns based on state populated with a 1 or 0 depending on the rows.
tq,
James
You can do something like this:
library(tidyr)
library(dplyr)
df %>%
separate_rows(state) %>%
unique() %>% # in case you have duplicated states for a single id
mutate(exist = 1) %>%
spread(state, exist, fill=0)
# id ACT NSW NT QLD SA TAS VIC WA
#1 1 1 0 0 0 0 0 0 0
#2 2 1 1 1 1 1 0 1 0
#3 3 1 1 1 1 0 1 1 1
#4 4 1 1 1 0 1 1 1 0
#5 5 1 1 0 1 0 0 1 0
#6 6 1 1 0 0 1 0 0 0
#7 7 1 1 1 1 1 1 1 1
#8 8 0 1 0 0 0 0 0 0
#9 9 0 0 1 0 0 0 0 0
#10 10 0 0 1 0 1 0 0 0
#11 11 0 0 0 1 0 0 0 0
#12 12 0 0 0 0 1 0 0 0
#13 13 0 0 0 0 0 1 0 0
#14 14 0 0 0 0 0 0 1 0
#15 15 0 0 0 0 0 0 0 1
separate_rows split state and convert the data frame to long format;
add a constant value column for reshaping purpose;
use spread to transform the result to wide format;
Here is a base R option to split the 'state' column by |, convert the list of vectors into a two column data.frame (stack), get the frequency with table and cbind with the first column of 'df'
cbind(df[1], as.data.frame.matrix(table(stack(setNames(strsplit(as.character(df$state),
"[|]"), df$id))[2:1])))
# id ACT NSW NT QLD SA TAS VIC WA
#1 1 1 0 0 0 0 0 0 0
#2 2 1 1 1 1 1 0 1 0
#3 3 1 1 1 1 0 1 1 1
#4 4 1 1 1 0 1 1 1 0
#5 5 1 1 0 1 0 0 1 0
#6 6 1 1 0 0 1 0 0 0
#7 7 1 1 1 1 1 1 1 1
#8 8 0 1 0 0 0 0 0 0
#9 9 0 0 1 0 0 0 0 0
#10 10 0 0 1 0 1 0 0 0
#11 11 0 0 0 1 0 0 0 0
#12 12 0 0 0 0 1 0 0 0
#13 13 0 0 0 0 0 1 0 0
#14 14 0 0 0 0 0 0 1 0
#15 15 0 0 0 0 0 0 0 1

Re-expanding a compressed dataframe to include zero values in missing rows

Given a dataset in the following form:
> Test
Pos Watson Crick Total
1 39023 0 0 0
2 39024 0 0 0
3 39025 0 0 0
4 39026 2 1 3
5 39027 0 0 0
6 39028 0 4 4
7 39029 0 0 0
8 39030 0 1 1
9 39031 0 0 0
10 39032 0 0 0
11 39033 0 0 0
12 39034 1 0 1
13 39035 0 0 0
14 39036 0 0 0
15 39037 3 0 3
16 39038 2 0 2
17 39039 0 0 0
18 39040 0 1 1
19 39041 0 0 0
20 39042 0 0 0
21 39043 0 0 0
22 39044 0 0 0
23 39045 0 0 0
I can compress these data to remove zero rows with the following code:
a=subset(Test, Total!=0)
> a
Pos Watson Crick Total
4 39026 2 1 3
6 39028 0 4 4
8 39030 0 1 1
12 39034 1 0 1
15 39037 3 0 3
16 39038 2 0 2
18 39040 0 1 1
How would I code the reverse transformation? i.e. To convert dataframe a back into the original form of Test.
More specifically: without any access to the original data, how would I re-expand the data (to include all sequential "Pos" rows) for an arbitrary range of Pos?
Here, the ID column is irrelevant. In a real example, the ID numbers are just row numbers created by R. In a real example, the compressed dataset will have sequential ID numbers.
Here's another possibility, using base R. Unless you explicitly provide the initial and the final value of Pos, the first and the last index value in the restored dataframe will correspond to the values given in the "compressed" dataframe a:
restored <- data.frame(Pos=(a$Pos[1]:a$Pos[nrow(a)])) # change range if required
restored <- merge(restored,a, all=TRUE)
restored[is.na(restored)] <- 0
#> restored
# Pos Watson Crick Total
#1 39026 2 1 3
#2 39027 0 0 0
#3 39028 0 4 4
#4 39029 0 0 0
#5 39030 0 1 1
#6 39031 0 0 0
#7 39032 0 0 0
#8 39033 0 0 0
#9 39034 1 0 1
#10 39035 0 0 0
#11 39036 0 0 0
#12 39037 3 0 3
#13 39038 2 0 2
#14 39039 0 0 0
#15 39040 0 1 1
Possibly the last step can be combined with the merge function by using the na.action option correctly, but I didn't find out how.
You need to know at least the Pos values you want to fill in. Then, it's a combination of join and mutate operations in dplyr.
Test <- read.table(text = "
Pos Watson Crick Total
1 39023 0 0 0
2 39024 0 0 0
3 39025 0 0 0
4 39026 2 1 3
5 39027 0 0 0
6 39028 0 4 4
7 39029 0 0 0
8 39030 0 1 1
9 39031 0 0 0
10 39032 0 0 0
11 39033 0 0 0
12 39034 1 0 1
13 39035 0 0 0
14 39036 0 0 0
15 39037 3 0 3
16 39038 2 0 2
17 39039 0 0 0
18 39040 0 1 1
19 39041 0 0 0
20 39042 0 0 0
21 39043 0 0 0
22 39044 0 0 0")
library(dplyr)
Nonzero <- Test %>% filter(Total > 0)
All_Pos <- Test %>% select(Pos)
Reconstruct <-
All_Pos %>%
left_join(Nonzero) %>%
mutate_each(funs(ifelse(is.na(.), 0, .)), Watson, Crick, Total)
In my code, All_Pos contains all valid positions as a one-column data frame; the mutate_each() call converts NA values to zeros. If you only know the largest MaxPos, you can construct it using
All_Pos <- data.frame(seq_len(MaxPos))

How do I create a DTM-like text matrix from a list of text blocks?

I have been using the textmatrix() function for a while to create DTMs which I can further use for LSI.
dirLSA<-function(dir){
dtm<-textmatrix(dir)
return(lsa(dtm))
}
textdir<-"C:/RProjects/docs"
dirLSA(textdir)
> tm
$matrix
D1 D2 D3 D4 D5 D6 D7 D8 D9
1. 000 2 0 0 0 0 0 0 0 0
2. 20 1 0 0 1 0 0 1 0 0
3. 200 1 0 0 0 0 0 0 0 0
4. 2014 1 0 0 0 0 0 0 0 0
5. 2015 1 0 0 0 0 0 0 0 0
6. 27 1 0 0 0 0 0 0 1 0
7. 30 1 0 0 0 1 0 1 0 0
8. 31 1 0 2 0 0 0 0 0 0
9. 40 1 0 0 0 0 0 0 0 0
10. 45 1 0 0 0 0 0 0 0 0
11. 500 1 0 0 0 0 0 1 0 0
12. 600 1 0 0 0 0 0 0 0 0
728. bias 0 0 0 2 0 0 0 0 0
729. biased 0 0 0 1 0 0 0 0 0
730. called 0 0 0 1 0 0 0 0 0
731. calm 0 0 0 1 0 0 0 0 0
732. cause 0 0 0 1 0 0 0 0 0
733. chauhan 0 0 0 2 0 0 0 0 0
734. chief 0 0 0 8 0 0 1 0 0
Textmatrix() is a function which takes a directory(folder path) and returns a document-wise term frequency. This is used in further analysis like Latent Semantic Indexing/Allocation(LSI/LSA)
However, a new problem that came across me is that if I have tweet data in batch files (~500000 tweets/batch) and I want to carry out similar operations on this data.
I have code modules to clean up my data, and I want to pass the cleaned tweets directly to the LSI function. The problem I face is that the textmatrix() does not support it.
I tried looking at other packages and code snippets, but that didn't get me any further. Is there any way I can create a line-term matrix of sorts?
I tried sending table(tokenize(cleanline[i])) into a loop, but it wont add new columns for words not already there in the matrix. Any workaround?
Update: I just tried this:
a<-table(tokenize(cleanline[10]))
b<-table(tokenize(cleanline[12]))
df1<-data.frame(a)
df1
df2<-data.frame(b)
df2
merge(df1,df2, all=TRUE)
I got this:
> df1
Var1 Freq
1 6
2 " 2
3 and 1
4 home 1
5 mabe 1
6 School 1
7 then 1
8 xbox 1
> b<-table(tokenize(cleanline[12]))
> df2<-data.frame(b)
> df2
Var1 Freq
1 13
2 " 2
3 BillGates 1
4 Come 1
5 help 1
6 Mac 1
7 make 1
8 Microsoft 1
9 please 1
10 Project 1
11 really 1
12 version 1
13 wish 1
14 would 1
> merge(df1,df2)
Var1 Freq
1 " 2
> merge(df1,df2, all=TRUE)
Var1 Freq
1 6
2 13
3 " 2
4 and 1
5 home 1
6 mabe 1
7 School 1
8 then 1
9 xbox 1
10 BillGates 1
11 Come 1
12 help 1
13 Mac 1
14 make 1
15 Microsoft 1
16 please 1
17 Project 1
18 really 1
19 version 1
20 wish 1
21 would 1
I think I'm close.
Try something like this
ll <- list(df1,df2)
dtm <- xtabs(Freq ~ ., data = do.call("rbind", ll))
Something that works for me:
textLSA<-function(text){
a<-data.frame(table(tokenize(text[1])))
colnames(a)[2]<-paste(c("Line",1),collapse=' ')
df<-a
for(i in 1:length(text)){
a<-data.frame(table(tokenize(text[i])))
colnames(a)[2]<-paste(c("Line",i),collapse=' ')
df<-merge(df,a, all=TRUE)
}
df[is.na(df)]<-0
dtm<-as.matrix(df[,-1])
rownames(dtm)<-df$Var1
return(lsa(dtm))
}
What do you think of this code?

How to sum leading diagonal of table in R

I have a table created using the table() command in R:
y
x 0 1 2 3 4 5 6 7 8 9
0 23 0 0 0 0 1 0 0 0 0
1 0 23 1 0 1 0 1 2 0 2
2 1 1 28 0 0 0 1 0 2 2
3 0 1 0 24 0 1 0 0 0 1
4 1 1 0 0 34 0 3 0 0 0
5 0 0 0 0 0 33 0 0 0 0
6 0 0 0 0 0 2 32 0 0 0
7 0 1 0 1 0 0 0 36 0 1
8 1 1 1 1 0 0 0 1 20 1
9 1 3 0 1 0 1 0 1 0 24
This table shows the results of a classification, and I want to sum the leading diagonal of it (the diagonal with the large numbers - like 23, 23, 28 etc). Is there a sensible/easy way to do this in R?
How about sum(diag(tbl)), where tbl is your table?

Resources