Is there a way on R to represent the alternation of these sequences, SEQ1 and SEQ2, within a single ultra-long read? - r

I have a dataframe on R similar to this one, only it is 2000 rows long.
Throughout the dataframe I have this alternation of SEQ1 and SEQ2 within a single read called "id read". These sequences alternate, and SEQ1 is always 1 nucleotide away from SEQ1, while SEQ2 from SEQ1 about 335 nucleotides, sometimes jumps and goes to 670.
The sequences are both in forward and in revers, as can be seen from the value of the end coordinate which is sometimes less than the start coordinate.
sequence
id read
start
end
sequencedistance
sequencelength
SEQ1
id read
90
105
1
15
SEQ2
id read
440
458
335
18
SEQ1
id read
459
474
1
15
SEQ2
id read
808
826
334
18
SEQ1
id read
827
812
1
15
SEQ2
id read
1148
1156
336
18
SEQ1
id read
1157
1172
1
15
SEQ2
id read
1850
1868
678
18
SEQ1
id read
1869
1854
1
15
SEQ2
id read
2187
2205
333
18
SEQ1
id read
2206
2221
1
15
SEQ2
id read
2887
2905
666
18
Would anyone have any ideas on how to plot this data and visually show the pattern that these sequences have within a read?
I have tried plotting with horizontal lines, lollipop, point, but none of these methods are effective in representing the amount of data I have and to visually understand the behavior of these sequences.
Would anyone have an idea of ​​how to plot the pattern? If I wanted, I could also plot only a part of the large dataframe I have, but at least I would like to understand the particularity of these sequences in the ultra-long read taken into consideration.

I'm still not exactly sure what you are looking for, but if every row i where sequence == "SEQ" has a paired row i + 1 where sequence == "SEQ2", you can calculate the relative start and ends sites and then try to visualise it.
Assuming your data is in a variable called df, you can calculate these as follows.
df <- transform(
df,
rel_start = ifelse(
as.character(sequence) == "SEQ1",
start - start,
start - c(0, head(start, -1))
),
rel_end = ifelse(
as.character(sequence) == "SEQ1",
end - start,
end - c(0, head(start, -1))
)
)
Then for visualisation, you can just use geom_segment(). You could use arrows to indicate the direction of the reads.
library(ggplot2)
ggplot(df, aes(rel_start, y = seq_along(start), colour = sequence)) +
geom_segment(aes(xend = rel_end, yend = seq_along(start)),
arrow = arrow(length = unit(2, "mm")))
Data loading:
txt <- "sequence id read start end sequencedistance sequencelength
SEQ1 id read 90 105 1 15
SEQ2 id read 440 458 335 18
SEQ1 id read 459 474 1 15
SEQ2 id read 808 826 334 18
SEQ1 id read 827 812 1 15
SEQ2 id read 1148 1156 336 18
SEQ1 id read 1157 1172 1 15
SEQ2 id read 1850 1868 678 18
SEQ1 id read 1869 1854 1 15
SEQ2 id read 2187 2205 333 18
SEQ1 id read 2206 2221 1 15
SEQ2 id read 2887 2905 666 18"
df <- read.table(text = txt, header = TRUE)

Related

Adding values of two columns on the same row to get a new value

Sorry for asking a very basic question but I am new to R and really stuck on a rather simple matter; I have the data frame below (2 rows and 7 columns):
Sub sup_b hdt sup_2 lbnp sup_3 hut sup_4
6 175 434 596 585 601 593 211
7 130 592 592 593 600 384 166
These values correspond with time duration (secs) for seven test conditions
col$names <- c(sup_b, hdt, sup_2, lbnp, sup_3, hut, sup_4)
and 17 rows (each row is for one study subject- I have only included first two rows).
I am trying to add values from row 1 col$sup_b (175) and row 1 col$hdt (434) to get the combined duration for the first two conditions i.e. 609 secs. I then add the value of the previous two cols (609) to the next col$sup_2 to get the total duration (609 + 596) and so on until the last condition col$sup_4.
I have tried the method below which is for subject 6 (row 1), which works fine, but I want to tidy this up and make it easier as I have 17 subjects (rows) and have been advised there is an easier way around this:
sup_b <- 175
hdt <- (sup_b + 434)
sup_2 <- (hdt + 596)
lbnp <- (sup_2 + 585)
sup_3 <- (hdt_lbnp + 601)
hut <- (sup_3 + 593)
sup_4 <- (hut + 211)
I want to be able to just change the number of row and have the data pulled across from the data frame rather than entering each individual time period; for instance:
line <- 1 ### the row I want which corresponds to the subject
sup_b <- df[line, 2]
hdt <-df[line, 2] + df[line, 3]
but I keep getting this warning message:
In Ops.factor(df[line, 2], df[line, 3]) : ‘+’ not meaningful for factor
I have even tried: colSums(df[,c(2:3)]), but get the following warning:
Error in colSums(df[, c(2:3)]) : 'x' must be numeric.
also tried: st$sum <- apply(df[,c(2:3)], 1, sum), which doesn't work either.
df1[-1] <- t(apply(df1[-1],1,cumsum))
# Sub sup_b hdt sup_2 lbnp sup_3 hut sup_4
# 1 6 175 609 1205 1790 2391 2984 3195
# 2 7 130 722 1314 1907 2507 2891 3057
data
df1 <- read.table(text="Sub sup_b hdt sup_2 lbnp sup_3 hut sup_4
6 175 434 596 585 601 593 211
7 130 592 592 593 600 384 166",h=T,strin=F)

Divide a data-frame into x roughly equal groups -- sequentially

I want to divide a df into x roughly equal groups, sequentially.
I was basically doing it like this:
df_1 <- df[1:10,]
df_2 <- df[11:21,]
df_3..
Is there a simpler way to do this, using split or slice? The important thing is, I want to maintain the order of the df, not sample from it.
Imagine I had 7000 observations, and I wanted 19 roughly equal groups.
Best!
I don't know if it counts for roughly equal, but you can do this:
nobs <- 7000
ngroups <- 17
df <- data.frame(x = sample(nobs))
set.seed(1)
df$grp <- sort(sample(1:ngroups,nobs,T)) # added the sort so the order of your df is maintained
table(df$grp)
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
# 436 407 410 369 417 411 440 401 431 411 356 398 390 414 443 418 448
then split(df,df$grp)

R One sample test for set of columns for each row

I have a data set where I have the Levels and Trends for say 50 cities for 3 scenarios. Below is the sample data -
City <- paste0("City",1:50)
L1 <- sample(100:500,50,replace = T)
L2 <- sample(100:500,50,replace = T)
L3 <- sample(100:500,50,replace = T)
T1 <- runif(50,0,3)
T2 <- runif(50,0,3)
T3 <- runif(50,0,3)
df <- data.frame(City,L1,L2,L3,T1,T2,T3)
Now, across the 3 scenarios I find the minimum Level and Minimum Trend using the below code -
df$L_min <- apply(df[,2:4],1,min)
df$T_min <- apply(df[,5:7],1,min)
Now I want to check if these minimum values are significantly different between the levels and trends respectively. So check L_min with columns 2-4 and T_min with columns 5-7. This needs to be done for each city (row) and if significant then return which column it is significantly different with.
It would help if some one could guide how this can be done.
Thank you!!
I'll put my idea here, nevertheless I'm looking forward for ideas for others.
> head(df)
City L1 L2 L3 T1 T2 T3 L_min T_min
1 City1 251 176 263 1.162313 0.07196579 2.0925715 176 0.07196579
2 City2 385 406 264 0.353124 0.66089524 2.5613980 264 0.35312402
3 City3 437 333 426 2.625795 1.43547766 1.7667891 333 1.43547766
4 City4 431 405 493 2.042905 0.93041254 1.3872058 405 0.93041254
5 City5 101 429 100 1.731004 2.89794314 0.3535423 100 0.35354230
6 City6 374 394 465 1.854794 0.57909775 2.7485841 374 0.57909775
> df$FC <- rowMeans(df[,2:4])/df[,8]
> df <- df[order(-df$FC), ]
> head(df)
City L1 L2 L3 T1 T2 T3 L_min T_min FC
18 City18 461 425 117 2.7786757 2.6577894 0.75974121 117 0.75974121 2.857550
38 City38 370 117 445 0.1103141 2.6890014 2.26174542 117 0.11031411 2.655271
44 City44 101 473 222 1.2754675 0.8667007 0.04057544 101 0.04057544 2.627063
10 City10 459 361 132 0.1529519 2.4678493 2.23373484 132 0.15295194 2.404040
16 City16 232 393 110 0.8628494 1.3995549 1.01689217 110 0.86284938 2.227273
15 City15 499 475 182 0.3679611 0.2519497 2.82647041 182 0.25194969 2.117216
Now you have the most different rows based on columns 2:4 at the top. Columns 5:7 in analogous way.
And some tips for stastical tests:
Always use t.test(parametrical, based on mean) instead of wilcoxon(u-mann whitney - non-parametrical, based on median), it has more power; HOWEVER:
-Data sets should be big ex. hipotesis: Montreal has taller citizens than Quebec; t.test will work fine when you take a 100 people from each city, so we have height measurment of 200 people 100 vs 100.
-Distribution should be close to normal distribution in all samples; or both samples should have similar distribution far from normal - it may be binominal. Anyway we can't use this test when one sample has normal distribution, and second hasn't.
-Size of both samples should be eqal, so 100 vs 100 is ok, but 87 vs 234 not exactly, p-value will be below 0.05, however it may be misrepresented.
If your data doesn't meet above conditions, I prefer non-parametrical test, less power but more resistant.

Multiple scatterplot figure in R

I have a slightly complicated plotting task. I am half way there, quite sure how to get it. I have a dataset of the form below, with multiple subjects, each in either Treatgroup 0 or Treatgroup 1, each subject contributing several rows of data. Each row corresponds to a single timepoint at which there are values in columns "count1, count2, weirdname3, etc.
Task 1. I need to calculate "Days", which is just the visitdate - the startdate, for each row. Should be an apply type function, I guess.
Task 2. I have to make a multiplot figure with one scatterplot for each of the count variables (a plot for count1, one for count2, etc). In each scatterplot, I need to plot the value of the count (y axis) against "Days" (x-axis) and connect the dots for each subject. Subjects in Treatgroup 0 are one color, subjects in treatgroup 1 are another color. Each scatterplot should be labeled with count1, count2 etc as appropriate.
I am trying to use the base plotting function, and have taken the approach of writing a plotting function to call later. I think this can work but need some help with syntax.
#Enter example data
tC <- textConnection("
ID StartDate VisitDate Treatstarted count1 count2 count3 Treatgroup
C0098 13-Jan-07 12-Feb-10 NA 457 343 957 0
C0098 13-Jan-06 2-Jul-10 NA 467 345 56 0
C0098 13-Jan-06 7-Oct-10 NA 420 234 435 0
C0098 13-Jan-05 3-Feb-11 NA 357 243 345 0
C0098 14-Jan-06 8-Jun-11 NA 209 567 254 0
C0098 13-Jan-06 9-Jul-11 NA 223 235 54 0
C0098 13-Jan-06 12-Oct-11 NA 309 245 642 0
C0110 13-Jan-06 23-Jun-10 30-Oct-10 629 2436 45 1
C0110 13-Jan-07 30-Sep-10 30-Oct-10 461 467 453 1
C0110 13-Jan-06 15-Feb-11 30-Oct-10 270 365 234 1
C0110 13-Jan-06 22-Jun-11 30-Oct-10 236 245 23 1
C0151 13-Jan-08 2-Feb-10 30-Oct-10 199 653 456 1
C0151 13-Jan-06 24-Mar-10 3-Apr-10 936 25 654 1
C0151 13-Jan-06 7-Jul-10 3-Apr-10 1147 254 666 1
C0151 13-Jan-06 9-Mar-11 3-Apr-10 1192 254 777 1
")
data1 <- read.table(header=TRUE, tC)
close.connection(tC)
# format date
data1$VisitDate <- with(data1,as.Date(VisitDate,format="%d-%b-%y"))
# stuck: need to define days as VisitDate - StartDate for each row of dataframe (I know I need an apply family fxn here)
data1$Days <- [applyfunction of some kind ](VisitDate,ID,function(x){x-data1$StartDate})))
# Unsure here. Need to define plot function
plot_one <- function(d){
with(d, plot(Days, Count, t="n", tck=1, cex.main = 0.8, ylab = "", yaxt = 'n', xlab = "", xaxt="n", xlim=c(0,1000), ylim=c(0,1200))) # set limits
grid(lwd = 0.3, lty = 7)
with(d[d$Treatgroup == 0,], points(Days, Count1, col = 1))
with(d[d$Treatgroup == 1,], points(Days, Count1, col = 2))
}
#Create multiple plot figure
par(mfrow=c(2,2), oma = c(0.5,0.5,0.5,0.5), mar = c(0.5,0.5,0.5,0.5))
#trouble here. I need to call the column names somehow, with; plyr::d_ply(data1, ???, plot_one)
Task 1:
data1$days <- floor(as.numeric(as.POSIXlt(data1$VisitDate,format="%d-%b-%y")
-as.POSIXlt(data1$StartDate,format="%d-%b-%y")))
Task 2:
par(mfrow=c(3,1), oma = c(2,0.5,1,0.5), mar = c(2,0.5,1,0.5))
plot(data1$days, data1$count1, col=as.factor(data1$Treatgroup), main="count1")
plot(data1$days, data1$count2, col=as.factor(data1$Treatgroup), main="count2")
plot(data1$days, data1$count3, col=as.factor(data1$Treatgroup), main="count3")

Find the non zero values and frequency of those values in R

I have a data which has two parameters, they are data/time and flow. The flow data is intermittent flow. Lets say at times there is zero flow and suddenly the flow starts and there will be non-zero values for sometime and then the flow will be zero again. I want to understand when the non-zero values occur and how long does each non-zero flow last. I have attached the sample dataset at this location https://www.dropbox.com/s/ef1411dq4gyg0cm/sampledataflow.csv
The data is 1 minute data.
I was able to import the data into R as follows:
flow <- read.csv("sampledataflow.csv")
summary(flow)
names(flow) <- c("Date","discharge")
flow$Date <- strptime(flow$Date, format="%m/%d/%Y %H:%M")
sapply(flow,class)
plot(flow$Date, flow$discharge,type="l")
I made plot to see the distribution but couldn't get a clue where to start to get the frequency of each non zero values. I would like to see a output table as follows:
Date Duration in Minutes
Please let me know if I am not clear here. Thanks.
Additional Info:
I think we need to check the non-zero value first and then find how many non zero values are there continuously before it reaches zero value again. What I want to understand is the flow release durations. For eg. in one day there might be multiple releases and I want to note at what time did the release start and how long did it continue before coming to value zero. I hope this explain the problem little better.
The first point is that you have too many NA in your data. In case you want to look into it.
If I understand correctly, you require the count of continuous 0's followed by continuous non-zeros, zeros, non-zeros etc.. for each date.
This can be achieved with rle of course, as also mentioned by #mnel under comments. But there are quite a few catches.
First, I'll set up the data with non-NA entries:
flow <- read.csv("~/Downloads/sampledataflow.csv")
names(flow) <- c("Date","discharge")
flow <- flow[1:33119, ] # remove NA entries
# format Date to POSIXct to play nice with data.table
flow$Date <- as.POSIXct(flow$Date, format="%m/%d/%Y %H:%M")
Next, I'll create a Date column:
flow$g1 <- as.Date(flow$Date)
Finally, I prefer using data.table. So here's a solution using it.
# load package, get data as data.table and set key
require(data.table)
flow.dt <- data.table(flow)
# set key to both "Date" and "g1" (even though, just we'll use just g1)
# to make sure that the order of rows are not changed (during sort)
setkey(flow.dt, "Date", "g1")
# group by g1 and set data to TRUE/FALSE by equating to 0 and get rle lengths
out <- flow.dt[, list(duration = rle(discharge == 0)$lengths,
val = rle(discharge == 0)$values + 1), by=g1][val == 2, val := 0]
> out # just to show a few first and last entries
# g1 duration val
# 1: 2010-05-31 120 0
# 2: 2010-06-01 722 0
# 3: 2010-06-01 138 1
# 4: 2010-06-01 32 0
# 5: 2010-06-01 79 1
# ---
# 98: 2010-06-22 291 1
# 99: 2010-06-22 423 0
# 100: 2010-06-23 664 0
# 101: 2010-06-23 278 1
# 102: 2010-06-23 379 0
So, for example, for 2010-06-01, there are 722 0's followed by 138 non-zeros, followed by 32 0's followed by 79 non-zeros and so on...
I looked a a small sample of the first two days
> do.call( cbind, tapply(flow$discharge, as.Date(flow$Date), function(x) table(x > 0) ) )
2010-06-01 2010-06-02
FALSE 1223 911
TRUE 217 529 # these are the cumulative daily durations of positive flow.
You may want this transposed in which case the t() function should succeed. Or you could use rbind.
If you jsut wante the number of flow-postive minutes, this would also work:
tapply(flow$discharge, as.Date(flow$Date), function(x) sum(x > 0, na.rm=TRUE) )
#--------
2010-06-01 2010-06-02 2010-06-03 2010-06-04 2010-06-05 2010-06-06 2010-06-07 2010-06-08
217 529 417 463 0 0 263 220
2010-06-09 2010-06-10 2010-06-11 2010-06-12 2010-06-13 2010-06-14 2010-06-15 2010-06-16
244 219 287 234 31 245 311 324
2010-06-17 2010-06-18 2010-06-19 2010-06-20 2010-06-21 2010-06-22 2010-06-23 2010-06-24
299 305 124 129 295 296 278 0
To get the lengths of intervals with discharge values greater than zero:
tapply(flow$discharge, as.Date(flow$Date), function(x) rle(x>0)$lengths[rle(x>0)$values] )
#--------
$`2010-06-01`
[1] 138 79
$`2010-06-02`
[1] 95 195 239
$`2010-06-03`
[1] 57 360
$`2010-06-04`
[1] 6 457
$`2010-06-05`
integer(0)
$`2010-06-06`
integer(0)
... Snipped output
If you want to look at the distribution of these durations you will need to unlist that result. (And remember that the durations which were split at midnight may have influenced the counts and durations.) If you just wanted durations without dates, then use this:
flowrle <- rle(flow$discharge>0)
flowrle$lengths[!is.na(flowrle$values) & flowrle$values]
#----------
[1] 138 79 95 195 296 360 6 457 263 17 203 79 80 85 30 189 17 270 127 107 31 1
[23] 2 1 241 311 229 13 82 299 305 3 121 129 295 3 2 291 278

Resources