For instance, if I have a text file (mytext.txt) with the following text:
Table1
13 3 20 0 0 0 0
3 10 0 0 0 6 0
20 0 5 0 0 0 0
0 0 0 7 20 0 0
0 0 0 20 19 0 0
0 0 0 0 0 8 0
0 0 0 0 0 0 13
Table2
0
2
10
-5
3
-10
-5
Can I retrieve both of them and get two tables?
So that if I print my data table1 I would get the first table and if I print my data table 2 I would get the second table.
I know that if mytext.txt only had one table I could do something like:
table1 <- read.table("mytext.txt")
1) Assuming the input file is tables.txt, read the lines into Lines and let names.ix be the indexes of the lines containing the table names -- these lines are identified as beginning with a character that is not a minus or digit. Then create a grouping variable grp which identifies which table each line belongs to, split the lines into those groups and read each group of lines. This uses no packages and can handle any number of tables in the file.
Lines <- readLines("tables.txt")
names.ix <- grep("^[^-0-9]", Lines)
grp <- Lines[names.ix][ cumsum(seq_along(Lines) %in% names.ix) ]
Read <- function(x) read.table(text = x)
L <- lapply(split(Lines[-names.ix], grp[-names.ix]), Read)
giving:
> L
$Table1
V1 V2 V3 V4 V5 V6 V7
1 13 3 20 0 0 0 0
2 3 10 0 0 0 6 0
3 20 0 5 0 0 0 0
4 0 0 0 7 20 0 0
5 0 0 0 20 19 0 0
6 0 0 0 0 0 8 0
7 0 0 0 0 0 0 13
$Table2
V1
1 0
2 2
3 10
4 -5
5 3
6 -10
7 -5
2) By the way, if you only need the first table then this does it:
library(data.table)
fread("tables.txt")
Not directly, but you can try read.mtable from my "SOfun" package, available only on GitHub.
The approach is similar to #G.Grothendieck's approach, but packaged into a function, so you can simply do:
read.mtable("tables.txt", chunkId = "Table", header = FALSE)
# $Table1
# V1 V2 V3 V4 V5 V6 V7
# 1 13 3 20 0 0 0 0
# 2 3 10 0 0 0 6 0
# 3 20 0 5 0 0 0 0
# 4 0 0 0 7 20 0 0
# 5 0 0 0 20 19 0 0
# 6 0 0 0 0 0 8 0
# 7 0 0 0 0 0 0 13
#
# $Table2
# V1
# 1 0
# 2 2
# 3 10
# 4 -5
# 5 3
# 6 -10
# 7 -5
The chunkId parameter can also be a regular expression, like `chunkId = "[A-Za-z]+".
Related
I have that csv file, containing 600k lines and 3 rows, first one containing a disease name, second one a gene, a third one a number something like that: i have roughly 4k disease and 16k genes so sometimes the disease names and genes names are redudant.
cholera xx45 12
Cancer xx65 1
cholera xx65 0
i would like to make a DTM matrix using R, i've been trying to use the Corpus command from the tm library but corpus doesn't reduce the amount of disease and size's 600k ish, i'd love to understand how to transform that file into a DTM.
I'm sorry for not being that precise, totally starting with computer science things as a bio guy :)
Cheers!
If you're not concerned with the number in the third column, then you can accomplish what I think you're trying to do using only the first two columns (gene and disease).
Example with some simulated data:
library(data.table)
# Create a table with 10k combinations of ~6k different genes and 40 different diseases
df <- data.frame(gene=sapply(1:10000, function(x) paste(c(sample(LETTERS, size=2), sample(10, size=1)), collapse="")), disease=sample(40, size=100000, replace=TRUE))
table(df) creates a large matrix, nGenes rows long and nDiseases columns wide. Looking at just the first 10 rows (because it's so large and sparse).
head(table(df))
disease
gene 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
AB10 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0
AB2 1 1 0 0 0 0 1 0 0 0 0 0 0 0 2 0 0 2 0 0 0 0 1 0 1 0 1
AB3 0 1 0 0 2 1 1 0 0 1 0 0 0 0 0 2 1 0 0 1 0 0 1 0 3 0 1
AB4 0 0 1 0 0 1 0 2 1 1 0 1 0 0 1 1 1 1 0 1 0 2 0 0 0 1 1
AB5 0 1 0 1 0 0 2 2 0 1 1 1 0 1 0 0 2 0 0 0 0 0 0 1 1 1 0
AB6 0 0 2 0 2 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0
disease
gene 28 29 30 31 32 33 34 35 36 37 38 39 40
AB10 0 0 1 2 1 0 0 1 0 0 0 0 0
AB2 0 0 0 0 0 0 0 0 0 0 0 0 0
AB3 0 0 1 1 1 0 0 0 0 0 1 1 0
AB4 0 0 1 2 1 1 1 1 1 2 0 3 1
AB5 0 2 1 1 0 0 3 4 0 1 1 0 2
AB6 0 0 0 0 0 0 0 1 0 0 0 0 0
Alternatively, you can exclude the counts of 0 and only include combinations that actually exist. Easy aggregation can be done with data.table, e.g. (continuing from the above example)
library(data.table)
dt <- data.table(df)
dt[, .N, by=list(gene, disease)]
which gives a frequency table like the following:
gene disease N
1: HA5 20 2
2: RF9 10 3
3: SD8 40 2
4: JA7 35 4
5: MJ2 1 2
---
75872: FR10 26 1
75873: IC5 40 1
75874: IU2 20 1
75875: IG5 13 1
75876: DW7 21 1
I have a data frame, please see below.
How do I compare the Volume where Purchase == 1 to the previous Purchase == 1 Volume and create a factor variable V1 like shown in the Picture 2?
The df[5,"V1"] == 1 because df[5,"Volume"] > df[3,"Volume"].... and so on.
How to achieve this without using loops, how do I achieve this the vectorized way so calculation speed is faster(when dealing with millions of rows)?
I've tried sub-setting, then do the comparison but when tried to put them back to a factor variable, the number of rows of the result is not the same as the number of rows of the df therefore I cannot put the factor variable to the dataframe.
Picture 2
Volume Weight Purchase V1
1 3.95670 5.27560 0 0
2 3.97110 5.29280 0 0
3 3.97200 5.29120 1 0
4 3.98640 5.31160 0 0
5 3.98880 5.31240 1 1
6 3.98700 5.31040 0 0
7 3.98370 5.31080 0 0
8 3.98580 5.31400 0 0
9 3.98670 5.31120 1 0
10 3.98460 5.29040 0 0
11 3.97710 5.28920 0 0
12 3.96720 5.26080 1 0
13 3.95190 5.26520 0 0
14 3.95160 5.26840 0 0
15 3.95340 5.26360 1 0
16 3.95370 5.23600 1 1
17 3.93450 5.23480 0 0
18 3.93480 5.23640 1 0
19 3.92760 5.23600 0 0
20 3.92820 5.22960 1 0
With data.table:
library(data.table)
data <- data.table(read.table(text=' Volume Weight Purchase V1
1 3.95670 5.27560 0 0
2 3.97110 5.29280 0 0
3 3.97200 5.29120 1 0
4 3.98640 5.31160 0 0
5 3.98880 5.31240 1 1
6 3.98700 5.31040 0 0
7 3.98370 5.31080 0 0
8 3.98580 5.31400 0 0
9 3.98670 5.31120 1 0
10 3.98460 5.29040 0 0
11 3.97710 5.28920 0 0
12 3.96720 5.26080 1 0
13 3.95190 5.26520 0 0
14 3.95160 5.26840 0 0
15 3.95340 5.26360 1 0
16 3.95370 5.23600 1 1
17 3.93450 5.23480 0 0
18 3.93480 5.23640 1 0
19 3.92760 5.23600 0 0
20 3.92820 5.22960 1 0', header=T))
data[, V1 := 0]
data[Purchase == 1, V1 := as.integer(Volume > shift(Volume)) ]
data[, V1 := as.factor(V1)]
Here, I filtered data to where Purchase = 1, then I brought previous Volume with shift function.
Finally, I compared Volume to Previous volume and assigned 1 if Volume is larger than Previous.
Given a dataset in the following form:
> Test
Pos Watson Crick Total
1 39023 0 0 0
2 39024 0 0 0
3 39025 0 0 0
4 39026 2 1 3
5 39027 0 0 0
6 39028 0 4 4
7 39029 0 0 0
8 39030 0 1 1
9 39031 0 0 0
10 39032 0 0 0
11 39033 0 0 0
12 39034 1 0 1
13 39035 0 0 0
14 39036 0 0 0
15 39037 3 0 3
16 39038 2 0 2
17 39039 0 0 0
18 39040 0 1 1
19 39041 0 0 0
20 39042 0 0 0
21 39043 0 0 0
22 39044 0 0 0
23 39045 0 0 0
I can compress these data to remove zero rows with the following code:
a=subset(Test, Total!=0)
> a
Pos Watson Crick Total
4 39026 2 1 3
6 39028 0 4 4
8 39030 0 1 1
12 39034 1 0 1
15 39037 3 0 3
16 39038 2 0 2
18 39040 0 1 1
How would I code the reverse transformation? i.e. To convert dataframe a back into the original form of Test.
More specifically: without any access to the original data, how would I re-expand the data (to include all sequential "Pos" rows) for an arbitrary range of Pos?
Here, the ID column is irrelevant. In a real example, the ID numbers are just row numbers created by R. In a real example, the compressed dataset will have sequential ID numbers.
Here's another possibility, using base R. Unless you explicitly provide the initial and the final value of Pos, the first and the last index value in the restored dataframe will correspond to the values given in the "compressed" dataframe a:
restored <- data.frame(Pos=(a$Pos[1]:a$Pos[nrow(a)])) # change range if required
restored <- merge(restored,a, all=TRUE)
restored[is.na(restored)] <- 0
#> restored
# Pos Watson Crick Total
#1 39026 2 1 3
#2 39027 0 0 0
#3 39028 0 4 4
#4 39029 0 0 0
#5 39030 0 1 1
#6 39031 0 0 0
#7 39032 0 0 0
#8 39033 0 0 0
#9 39034 1 0 1
#10 39035 0 0 0
#11 39036 0 0 0
#12 39037 3 0 3
#13 39038 2 0 2
#14 39039 0 0 0
#15 39040 0 1 1
Possibly the last step can be combined with the merge function by using the na.action option correctly, but I didn't find out how.
You need to know at least the Pos values you want to fill in. Then, it's a combination of join and mutate operations in dplyr.
Test <- read.table(text = "
Pos Watson Crick Total
1 39023 0 0 0
2 39024 0 0 0
3 39025 0 0 0
4 39026 2 1 3
5 39027 0 0 0
6 39028 0 4 4
7 39029 0 0 0
8 39030 0 1 1
9 39031 0 0 0
10 39032 0 0 0
11 39033 0 0 0
12 39034 1 0 1
13 39035 0 0 0
14 39036 0 0 0
15 39037 3 0 3
16 39038 2 0 2
17 39039 0 0 0
18 39040 0 1 1
19 39041 0 0 0
20 39042 0 0 0
21 39043 0 0 0
22 39044 0 0 0")
library(dplyr)
Nonzero <- Test %>% filter(Total > 0)
All_Pos <- Test %>% select(Pos)
Reconstruct <-
All_Pos %>%
left_join(Nonzero) %>%
mutate_each(funs(ifelse(is.na(.), 0, .)), Watson, Crick, Total)
In my code, All_Pos contains all valid positions as a one-column data frame; the mutate_each() call converts NA values to zeros. If you only know the largest MaxPos, you can construct it using
All_Pos <- data.frame(seq_len(MaxPos))
I have a data frame with four columns, let's call them V1-V4 and ten observations. Exactly one of V1-V4 is 1 for each row, and the others of V1-V4 are 0. I want to create a new column called NEWCOL that takes on the value of 3 if V3 is 1, 4 if V4 is 1, and is 0 otherwise.
I have to do this for MANY sets of variables V1-V4 so I would like the solution to be as short as possible so that it will be easy to replicate.
This does it for 4 columns to add a fifth using matrix multiplication:
> cbind( mydf, newcol=data.matrix(mydf) %*% c(0,0,3,4) )
V1 V2 V3 V4 newcol
1 1 0 0 0 0
2 1 0 0 0 0
3 0 1 0 0 0
4 0 1 0 0 0
5 0 0 1 0 3
6 0 0 1 0 3
7 0 0 0 1 4
8 0 0 0 1 4
9 0 0 0 1 4
10 0 0 0 1 4
It's generalizable to getting multiple columns.... we just need the rules. You need to make a matric with the the same number of rows as there are columns in the original data and have one column for each of the new factors needed to build each new variable. This shows how to build one new column from the sum of 3 times the third column plus 4 times the fourth, and another new column from one times the first and 2 times the second.
> cbind( mydf, newcol=data.matrix(mydf) %*% matrix(c(0,0,3,4, # first set of factors
1,2,0,0), # second set
ncol=2) )
V1 V2 V3 V4 newcol.1 newcol.2
1 1 0 0 0 0 1
2 1 0 0 0 0 1
3 0 1 0 0 0 2
4 0 1 0 0 0 2
5 0 0 1 0 3 0
6 0 0 1 0 3 0
7 0 0 0 1 4 0
8 0 0 0 1 4 0
9 0 0 0 1 4 0
10 0 0 0 1 4 0
An example data set:
mydf <- data.frame(V1 = c(1, 1, rep(0, 8)),
V2 = c(0, 0, 1, 1, rep(0, 6)),
V3 = c(rep(0, 4), 1, 1, rep(0, 4)),
V4 = c(rep(0, 6), rep(1, 4)))
# V1 V2 V3 V4
# 1 1 0 0 0
# 2 1 0 0 0
# 3 0 1 0 0
# 4 0 1 0 0
# 5 0 0 1 0
# 6 0 0 1 0
# 7 0 0 0 1
# 8 0 0 0 1
# 9 0 0 0 1
# 10 0 0 0 1
Here's an easy approach to generate the new column:
mydf <- transform(mydf, NEWCOL = V3 * 3 + V4 * 4)
# V1 V2 V3 V4 NEWCOL
# 1 1 0 0 0 0
# 2 1 0 0 0 0
# 3 0 1 0 0 0
# 4 0 1 0 0 0
# 5 0 0 1 0 3
# 6 0 0 1 0 3
# 7 0 0 0 1 4
# 8 0 0 0 1 4
# 9 0 0 0 1 4
# 10 0 0 0 1 4
I have been using the textmatrix() function for a while to create DTMs which I can further use for LSI.
dirLSA<-function(dir){
dtm<-textmatrix(dir)
return(lsa(dtm))
}
textdir<-"C:/RProjects/docs"
dirLSA(textdir)
> tm
$matrix
D1 D2 D3 D4 D5 D6 D7 D8 D9
1. 000 2 0 0 0 0 0 0 0 0
2. 20 1 0 0 1 0 0 1 0 0
3. 200 1 0 0 0 0 0 0 0 0
4. 2014 1 0 0 0 0 0 0 0 0
5. 2015 1 0 0 0 0 0 0 0 0
6. 27 1 0 0 0 0 0 0 1 0
7. 30 1 0 0 0 1 0 1 0 0
8. 31 1 0 2 0 0 0 0 0 0
9. 40 1 0 0 0 0 0 0 0 0
10. 45 1 0 0 0 0 0 0 0 0
11. 500 1 0 0 0 0 0 1 0 0
12. 600 1 0 0 0 0 0 0 0 0
728. bias 0 0 0 2 0 0 0 0 0
729. biased 0 0 0 1 0 0 0 0 0
730. called 0 0 0 1 0 0 0 0 0
731. calm 0 0 0 1 0 0 0 0 0
732. cause 0 0 0 1 0 0 0 0 0
733. chauhan 0 0 0 2 0 0 0 0 0
734. chief 0 0 0 8 0 0 1 0 0
Textmatrix() is a function which takes a directory(folder path) and returns a document-wise term frequency. This is used in further analysis like Latent Semantic Indexing/Allocation(LSI/LSA)
However, a new problem that came across me is that if I have tweet data in batch files (~500000 tweets/batch) and I want to carry out similar operations on this data.
I have code modules to clean up my data, and I want to pass the cleaned tweets directly to the LSI function. The problem I face is that the textmatrix() does not support it.
I tried looking at other packages and code snippets, but that didn't get me any further. Is there any way I can create a line-term matrix of sorts?
I tried sending table(tokenize(cleanline[i])) into a loop, but it wont add new columns for words not already there in the matrix. Any workaround?
Update: I just tried this:
a<-table(tokenize(cleanline[10]))
b<-table(tokenize(cleanline[12]))
df1<-data.frame(a)
df1
df2<-data.frame(b)
df2
merge(df1,df2, all=TRUE)
I got this:
> df1
Var1 Freq
1 6
2 " 2
3 and 1
4 home 1
5 mabe 1
6 School 1
7 then 1
8 xbox 1
> b<-table(tokenize(cleanline[12]))
> df2<-data.frame(b)
> df2
Var1 Freq
1 13
2 " 2
3 BillGates 1
4 Come 1
5 help 1
6 Mac 1
7 make 1
8 Microsoft 1
9 please 1
10 Project 1
11 really 1
12 version 1
13 wish 1
14 would 1
> merge(df1,df2)
Var1 Freq
1 " 2
> merge(df1,df2, all=TRUE)
Var1 Freq
1 6
2 13
3 " 2
4 and 1
5 home 1
6 mabe 1
7 School 1
8 then 1
9 xbox 1
10 BillGates 1
11 Come 1
12 help 1
13 Mac 1
14 make 1
15 Microsoft 1
16 please 1
17 Project 1
18 really 1
19 version 1
20 wish 1
21 would 1
I think I'm close.
Try something like this
ll <- list(df1,df2)
dtm <- xtabs(Freq ~ ., data = do.call("rbind", ll))
Something that works for me:
textLSA<-function(text){
a<-data.frame(table(tokenize(text[1])))
colnames(a)[2]<-paste(c("Line",1),collapse=' ')
df<-a
for(i in 1:length(text)){
a<-data.frame(table(tokenize(text[i])))
colnames(a)[2]<-paste(c("Line",i),collapse=' ')
df<-merge(df,a, all=TRUE)
}
df[is.na(df)]<-0
dtm<-as.matrix(df[,-1])
rownames(dtm)<-df$Var1
return(lsa(dtm))
}
What do you think of this code?