Re-expanding a compressed dataframe to include zero values in missing rows - r

Given a dataset in the following form:
> Test
Pos Watson Crick Total
1 39023 0 0 0
2 39024 0 0 0
3 39025 0 0 0
4 39026 2 1 3
5 39027 0 0 0
6 39028 0 4 4
7 39029 0 0 0
8 39030 0 1 1
9 39031 0 0 0
10 39032 0 0 0
11 39033 0 0 0
12 39034 1 0 1
13 39035 0 0 0
14 39036 0 0 0
15 39037 3 0 3
16 39038 2 0 2
17 39039 0 0 0
18 39040 0 1 1
19 39041 0 0 0
20 39042 0 0 0
21 39043 0 0 0
22 39044 0 0 0
23 39045 0 0 0
I can compress these data to remove zero rows with the following code:
a=subset(Test, Total!=0)
> a
Pos Watson Crick Total
4 39026 2 1 3
6 39028 0 4 4
8 39030 0 1 1
12 39034 1 0 1
15 39037 3 0 3
16 39038 2 0 2
18 39040 0 1 1
How would I code the reverse transformation? i.e. To convert dataframe a back into the original form of Test.
More specifically: without any access to the original data, how would I re-expand the data (to include all sequential "Pos" rows) for an arbitrary range of Pos?
Here, the ID column is irrelevant. In a real example, the ID numbers are just row numbers created by R. In a real example, the compressed dataset will have sequential ID numbers.

Here's another possibility, using base R. Unless you explicitly provide the initial and the final value of Pos, the first and the last index value in the restored dataframe will correspond to the values given in the "compressed" dataframe a:
restored <- data.frame(Pos=(a$Pos[1]:a$Pos[nrow(a)])) # change range if required
restored <- merge(restored,a, all=TRUE)
restored[is.na(restored)] <- 0
#> restored
# Pos Watson Crick Total
#1 39026 2 1 3
#2 39027 0 0 0
#3 39028 0 4 4
#4 39029 0 0 0
#5 39030 0 1 1
#6 39031 0 0 0
#7 39032 0 0 0
#8 39033 0 0 0
#9 39034 1 0 1
#10 39035 0 0 0
#11 39036 0 0 0
#12 39037 3 0 3
#13 39038 2 0 2
#14 39039 0 0 0
#15 39040 0 1 1
Possibly the last step can be combined with the merge function by using the na.action option correctly, but I didn't find out how.

You need to know at least the Pos values you want to fill in. Then, it's a combination of join and mutate operations in dplyr.
Test <- read.table(text = "
Pos Watson Crick Total
1 39023 0 0 0
2 39024 0 0 0
3 39025 0 0 0
4 39026 2 1 3
5 39027 0 0 0
6 39028 0 4 4
7 39029 0 0 0
8 39030 0 1 1
9 39031 0 0 0
10 39032 0 0 0
11 39033 0 0 0
12 39034 1 0 1
13 39035 0 0 0
14 39036 0 0 0
15 39037 3 0 3
16 39038 2 0 2
17 39039 0 0 0
18 39040 0 1 1
19 39041 0 0 0
20 39042 0 0 0
21 39043 0 0 0
22 39044 0 0 0")
library(dplyr)
Nonzero <- Test %>% filter(Total > 0)
All_Pos <- Test %>% select(Pos)
Reconstruct <-
All_Pos %>%
left_join(Nonzero) %>%
mutate_each(funs(ifelse(is.na(.), 0, .)), Watson, Crick, Total)
In my code, All_Pos contains all valid positions as a one-column data frame; the mutate_each() call converts NA values to zeros. If you only know the largest MaxPos, you can construct it using
All_Pos <- data.frame(seq_len(MaxPos))

Related

How to assign a single count to groups of adjacent identical values in R

I have a column that contains binary values indicating the presence (1) or absence (0) of an event. Based on this column I want to create a new column containing a continuous count that assigns a single count to groups of adjacent events.
event <- c(0,0,0,1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,1,1,0,0)
count<- c(0,0,0,1,0,0,0,2,2,2,2,2,0,0,0,0,0,0,3,3,0,0)
df <- data.frame(event, count)
The desired count should look like this:
event count
0 0
0 0
0 0
1 1
0 0
0 0
0 0
1 2
1 2
1 2
1 2
1 2
0 0
0 0
0 0
0 0
0 0
0 0
1 3
1 3
0 0
0 0
Any suggestions how to get there are much appreciated. Thank you!
With dplyr, the following checks whether there is a 1 following a 0 and takes the cumulative sum of that. Then, the result is multiplied by event such that the zeros are maintained.
library(dplyr)
df %>%
mutate(count_2 = event * cumsum(event == 1 & lag(event, default = 0) == 0))
gives
event count count_2
1 0 0 0
2 0 0 0
3 0 0 0
4 1 1 1
5 0 0 0
6 0 0 0
7 0 0 0
8 1 2 2
9 1 2 2
10 1 2 2
11 1 2 2
12 1 2 2
13 0 0 0
14 0 0 0
15 0 0 0
16 0 0 0
17 0 0 0
18 0 0 0
19 1 3 3
20 1 3 3
21 0 0 0
22 0 0 0
A base-R variant:
df$count_2 <- df$event * cumsum(c(0, diff(df$event)==1))
Using rle in base R :
df$count1 <- with(df, event * with(rle(event == 1),rep(cumsum(values), lengths)))
df
# event count count1
#1 0 0 0
#2 0 0 0
#3 0 0 0
#4 1 1 1
#5 0 0 0
#6 0 0 0
#7 0 0 0
#8 1 2 2
#9 1 2 2
#10 1 2 2
#11 1 2 2
#12 1 2 2
#13 0 0 0
#14 0 0 0
#15 0 0 0
#16 0 0 0
#17 0 0 0
#18 0 0 0
#19 1 3 3
#20 1 3 3
#21 0 0 0
#22 0 0 0

How to add elements to a vector based on another vector position wise in R

I have a positive integer value vector C which is of length 192. I want to generate another vector based on this vector C. The new vector to be created is called B (same length as C). The algorithm for creation is:
Whenever a value above 0 is observed in C, add the same value 12 places back in the vector B. For example, if the first 15 entries of C are 0 and the 16th entry is 3, then I want to add the value 3, 12 positions back (which is 16-12=position 4) in vector B. The vector B would be generated in this way over all values of C.
Any help would be greatly appreciated! The vector C can be obtained by the R library "outbreaks" and the data file from that package is ebola_kikwit_1995$onset.
It doesn't seem to be very tricky. An index vector and a for loop will do what the question asks for.
library("outbreaks")
i <- which(ebola_kikwit_1995$onset > 0)
i <- i[i > 12]
ebola_kikwit_1995$B <- 0L
for(j in i){
ebola_kikwit_1995$B[j] <- ebola_kikwit_1995$onset[j] + ebola_kikwit_1995$B[j - 12]
}
ebola_kikwit_1995$B
# [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [28] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [55] 0 0 0 0 0 1 1 1 0 0 0 0 3 0 0 0 1 2 0 2 0 2 0 1 0 0 0
# [82] 0 0 0 0 4 0 3 1 3 1 0 1 1 1 1 1 5 3 0 5 7 2 2 5 3 2 8
#[109] 5 9 5 4 10 10 13 14 20 10 9 16 7 14 13 10 18 13 17 21 31 13 21 21 15 17 16
#[136] 18 22 18 18 24 35 16 22 23 18 18 19 21 26 21 23 26 37 17 0 25 0 0 21 0 0 0
#[163] 25 28 0 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
#[190] 0 0 0

R: how do split a delimited columns(state) into columns with a binary 1,0

Appreciate your help. Need to split a column filled with delimited values into columns named after its delimited values and each of these new columns are to be filled with either 1 or 0 where values are found or not.
state <-
c('ACT',
'ACT|NSW|NT|QLD|SA|VIC',
'ACT|NSW|NT|QLD|TAS|VIC|WA',
'ACT|NSW|NT|SA|TAS|VIC',
'ACT|NSW|QLD|VIC',
'ACT|NSW|SA',
'ACT|NSW|NT|QLD|TAS|VIC|WA|SA',
'NSW',
'NT',
'NT|SA',
'QLD',
'SA',
'TAS',
'VIC',
'WA')
df <- data.frame(id = 1:length(state),state)
id state
1 1 ACT
2 2 ACT|NSW|NT|QLD|SA|VIC
3 3 ACT|NSW|NT|QLD|TAS|VIC|WA
4 4 ACT|NSW|NT|SA|TAS|VIC
...
Desired state is a dataframe with the same dimensions plus the additional columns based on state populated with a 1 or 0 depending on the rows.
tq,
James
You can do something like this:
library(tidyr)
library(dplyr)
df %>%
separate_rows(state) %>%
unique() %>% # in case you have duplicated states for a single id
mutate(exist = 1) %>%
spread(state, exist, fill=0)
# id ACT NSW NT QLD SA TAS VIC WA
#1 1 1 0 0 0 0 0 0 0
#2 2 1 1 1 1 1 0 1 0
#3 3 1 1 1 1 0 1 1 1
#4 4 1 1 1 0 1 1 1 0
#5 5 1 1 0 1 0 0 1 0
#6 6 1 1 0 0 1 0 0 0
#7 7 1 1 1 1 1 1 1 1
#8 8 0 1 0 0 0 0 0 0
#9 9 0 0 1 0 0 0 0 0
#10 10 0 0 1 0 1 0 0 0
#11 11 0 0 0 1 0 0 0 0
#12 12 0 0 0 0 1 0 0 0
#13 13 0 0 0 0 0 1 0 0
#14 14 0 0 0 0 0 0 1 0
#15 15 0 0 0 0 0 0 0 1
separate_rows split state and convert the data frame to long format;
add a constant value column for reshaping purpose;
use spread to transform the result to wide format;
Here is a base R option to split the 'state' column by |, convert the list of vectors into a two column data.frame (stack), get the frequency with table and cbind with the first column of 'df'
cbind(df[1], as.data.frame.matrix(table(stack(setNames(strsplit(as.character(df$state),
"[|]"), df$id))[2:1])))
# id ACT NSW NT QLD SA TAS VIC WA
#1 1 1 0 0 0 0 0 0 0
#2 2 1 1 1 1 1 0 1 0
#3 3 1 1 1 1 0 1 1 1
#4 4 1 1 1 0 1 1 1 0
#5 5 1 1 0 1 0 0 1 0
#6 6 1 1 0 0 1 0 0 0
#7 7 1 1 1 1 1 1 1 1
#8 8 0 1 0 0 0 0 0 0
#9 9 0 0 1 0 0 0 0 0
#10 10 0 0 1 0 1 0 0 0
#11 11 0 0 0 1 0 0 0 0
#12 12 0 0 0 0 1 0 0 0
#13 13 0 0 0 0 0 1 0 0
#14 14 0 0 0 0 0 0 1 0
#15 15 0 0 0 0 0 0 0 1

creating a DTM from a 3 column CSV file with r

I have that csv file, containing 600k lines and 3 rows, first one containing a disease name, second one a gene, a third one a number something like that: i have roughly 4k disease and 16k genes so sometimes the disease names and genes names are redudant.
cholera xx45 12
Cancer xx65 1
cholera xx65 0
i would like to make a DTM matrix using R, i've been trying to use the Corpus command from the tm library but corpus doesn't reduce the amount of disease and size's 600k ish, i'd love to understand how to transform that file into a DTM.
I'm sorry for not being that precise, totally starting with computer science things as a bio guy :)
Cheers!
If you're not concerned with the number in the third column, then you can accomplish what I think you're trying to do using only the first two columns (gene and disease).
Example with some simulated data:
library(data.table)
# Create a table with 10k combinations of ~6k different genes and 40 different diseases
df <- data.frame(gene=sapply(1:10000, function(x) paste(c(sample(LETTERS, size=2), sample(10, size=1)), collapse="")), disease=sample(40, size=100000, replace=TRUE))
table(df) creates a large matrix, nGenes rows long and nDiseases columns wide. Looking at just the first 10 rows (because it's so large and sparse).
head(table(df))
disease
gene 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
AB10 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0
AB2 1 1 0 0 0 0 1 0 0 0 0 0 0 0 2 0 0 2 0 0 0 0 1 0 1 0 1
AB3 0 1 0 0 2 1 1 0 0 1 0 0 0 0 0 2 1 0 0 1 0 0 1 0 3 0 1
AB4 0 0 1 0 0 1 0 2 1 1 0 1 0 0 1 1 1 1 0 1 0 2 0 0 0 1 1
AB5 0 1 0 1 0 0 2 2 0 1 1 1 0 1 0 0 2 0 0 0 0 0 0 1 1 1 0
AB6 0 0 2 0 2 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0
disease
gene 28 29 30 31 32 33 34 35 36 37 38 39 40
AB10 0 0 1 2 1 0 0 1 0 0 0 0 0
AB2 0 0 0 0 0 0 0 0 0 0 0 0 0
AB3 0 0 1 1 1 0 0 0 0 0 1 1 0
AB4 0 0 1 2 1 1 1 1 1 2 0 3 1
AB5 0 2 1 1 0 0 3 4 0 1 1 0 2
AB6 0 0 0 0 0 0 0 1 0 0 0 0 0
Alternatively, you can exclude the counts of 0 and only include combinations that actually exist. Easy aggregation can be done with data.table, e.g. (continuing from the above example)
library(data.table)
dt <- data.table(df)
dt[, .N, by=list(gene, disease)]
which gives a frequency table like the following:
gene disease N
1: HA5 20 2
2: RF9 10 3
3: SD8 40 2
4: JA7 35 4
5: MJ2 1 2
---
75872: FR10 26 1
75873: IC5 40 1
75874: IU2 20 1
75875: IG5 13 1
75876: DW7 21 1

How do I create a DTM-like text matrix from a list of text blocks?

I have been using the textmatrix() function for a while to create DTMs which I can further use for LSI.
dirLSA<-function(dir){
dtm<-textmatrix(dir)
return(lsa(dtm))
}
textdir<-"C:/RProjects/docs"
dirLSA(textdir)
> tm
$matrix
D1 D2 D3 D4 D5 D6 D7 D8 D9
1. 000 2 0 0 0 0 0 0 0 0
2. 20 1 0 0 1 0 0 1 0 0
3. 200 1 0 0 0 0 0 0 0 0
4. 2014 1 0 0 0 0 0 0 0 0
5. 2015 1 0 0 0 0 0 0 0 0
6. 27 1 0 0 0 0 0 0 1 0
7. 30 1 0 0 0 1 0 1 0 0
8. 31 1 0 2 0 0 0 0 0 0
9. 40 1 0 0 0 0 0 0 0 0
10. 45 1 0 0 0 0 0 0 0 0
11. 500 1 0 0 0 0 0 1 0 0
12. 600 1 0 0 0 0 0 0 0 0
728. bias 0 0 0 2 0 0 0 0 0
729. biased 0 0 0 1 0 0 0 0 0
730. called 0 0 0 1 0 0 0 0 0
731. calm 0 0 0 1 0 0 0 0 0
732. cause 0 0 0 1 0 0 0 0 0
733. chauhan 0 0 0 2 0 0 0 0 0
734. chief 0 0 0 8 0 0 1 0 0
Textmatrix() is a function which takes a directory(folder path) and returns a document-wise term frequency. This is used in further analysis like Latent Semantic Indexing/Allocation(LSI/LSA)
However, a new problem that came across me is that if I have tweet data in batch files (~500000 tweets/batch) and I want to carry out similar operations on this data.
I have code modules to clean up my data, and I want to pass the cleaned tweets directly to the LSI function. The problem I face is that the textmatrix() does not support it.
I tried looking at other packages and code snippets, but that didn't get me any further. Is there any way I can create a line-term matrix of sorts?
I tried sending table(tokenize(cleanline[i])) into a loop, but it wont add new columns for words not already there in the matrix. Any workaround?
Update: I just tried this:
a<-table(tokenize(cleanline[10]))
b<-table(tokenize(cleanline[12]))
df1<-data.frame(a)
df1
df2<-data.frame(b)
df2
merge(df1,df2, all=TRUE)
I got this:
> df1
Var1 Freq
1 6
2 " 2
3 and 1
4 home 1
5 mabe 1
6 School 1
7 then 1
8 xbox 1
> b<-table(tokenize(cleanline[12]))
> df2<-data.frame(b)
> df2
Var1 Freq
1 13
2 " 2
3 BillGates 1
4 Come 1
5 help 1
6 Mac 1
7 make 1
8 Microsoft 1
9 please 1
10 Project 1
11 really 1
12 version 1
13 wish 1
14 would 1
> merge(df1,df2)
Var1 Freq
1 " 2
> merge(df1,df2, all=TRUE)
Var1 Freq
1 6
2 13
3 " 2
4 and 1
5 home 1
6 mabe 1
7 School 1
8 then 1
9 xbox 1
10 BillGates 1
11 Come 1
12 help 1
13 Mac 1
14 make 1
15 Microsoft 1
16 please 1
17 Project 1
18 really 1
19 version 1
20 wish 1
21 would 1
I think I'm close.
Try something like this
ll <- list(df1,df2)
dtm <- xtabs(Freq ~ ., data = do.call("rbind", ll))
Something that works for me:
textLSA<-function(text){
a<-data.frame(table(tokenize(text[1])))
colnames(a)[2]<-paste(c("Line",1),collapse=' ')
df<-a
for(i in 1:length(text)){
a<-data.frame(table(tokenize(text[i])))
colnames(a)[2]<-paste(c("Line",i),collapse=' ')
df<-merge(df,a, all=TRUE)
}
df[is.na(df)]<-0
dtm<-as.matrix(df[,-1])
rownames(dtm)<-df$Var1
return(lsa(dtm))
}
What do you think of this code?

Resources