Finding which contigs go into any given scaffold in an 454-Newbler assembly - scaffold

I guess there should be a quite simple script for calling which contigs go into which scaffolds. This is 454 data assembled with Newbler.
Thanks!

Thanks for the corrections ;)
A simple 'grep' on the 454Scaffolds.txt file gives me the folloving output.
-bash-3.2$ grep scaffold10511 454Scaffolds.txt
scaffold10511 1 1251 1 W contig368342 1 1251 +
scaffold10511 1252 2340 2 N 1089 fragment yes
scaffold10511 2341 3244 3 W contig368343 1 904 +
scaffold10511 3245 5704 4 N 2460 fragment yes
scaffold10511 5705 6500 5 W contig368344 1 796 +
scaffold10511 6501 7472 6 N 972 fragment yes
scaffold10511 7473 10932 7 W contig368345 1 3460 +
scaffold10511 10933 11579 8 N 647 fragment yes
scaffold10511 11580 15817 9 W contig368346 1 4238 +
scaffold10511 15818 19635 10 N 3818 fragment yes
scaffold10511 19636 21767 11 W contig368347 1 2132 +
scaffold10511 21768 22244 12 N 477 fragment yes
scaffold10511 22245 25733 13 W contig368348 1 3489 +
scaffold10511 25734 28642 14 N 2909 fragment yes
scaffold10511 28643 32182 15 W contig368349 1 3540 +
As I understands it then this scaffold is based on contig368342-->368349, so that should answer this question ;)

Maybe you can also use grep -f :)

Related

I am unable to import in R a downloaded xls file

I am trying to directly import the .xls file that comes from this link (French electricity distributor).
I have built, based on this question, the folloning code :
library(rio)
Chemin = "F:/DGTresor/00.Refontes/06.Electricite_HauteFrequence" #WhateverPath
## RTE mois en cours
temporaire <- tempfile()
download.file("https://eco2mix.rte-france.com/download/eco2mix/eCO2mix_RTE_En-cours-TR.zip",temporaire)
unzip(zipfile=temporaire,
files = "eCO2mix_RTE_En-cours-TR.xls",
exdir=Chemin)
RTE_EnCours <- import(paste0(Chemin,"/eCO2mix_RTE_En-cours-TR.xls"))
The file exists, but I am unable to read it. I get the following error : libxls error: Unable to open file
I am not sure why it is happening but when I try to open the .xls file manually, it gives an error like "The file format and its extension does not match" etc. To solve the issue, I converted the file extension to .csv with the codes below.
file.rename(paste0(Chemin,"/eCO2mix_RTE_En-cours-TR.xls"), paste0(Chemin,"/eCO2mix_RTE_En-cours-TR.csv"))
After that, importing the file works,
# to prevent the shifting, header=FALSE should be applied
RTE_EnCours<- read.csv(paste0(Chemin,"/eCO2mix_RTE_En-cours-TR.csv"),sep="\t",header=FALSE,row.names=NULL)
# canceling out the last column which is full NA
RTE_EnCours <- RTE_EnCours[,-ncol(RTE_EnCours)]
# assigning the first row as the column names
colnames(RTE_EnCours) <-as.character(unlist(RTE_EnCours[1,]))
# removing the first row
RTE_EnCours <- RTE_EnCours[-1,]
head(RTE_EnCours)
gives,
Périmètre Nature Date Heures Consommation Prévision J-1 Prévision J Fioul Charbon Gaz Nucléaire Eolien Solaire Hydraulique
2 France Données temps réel 2020-10-01 00:00 46957 46500 47100 134 286 4524 35004 4327 0 4645
3 France Données temps réel 2020-10-01 00:15 46342 45350 45950 149 318 4727 35278 4336 0 4953
4 France Données temps réel 2020-10-01 00:30 44689 44200 44800 149 304 4380 34732 4428 0 4580
5 France Données temps réel 2020-10-01 00:45 43277 42950 43700 165 308 4244 34644 4528 0 4147
6 France Données temps réel 2020-10-01 01:00 42511 41700 42600 165 302 4012 34780 4488 0 4096
7 France Données temps réel 2020-10-01 01:15 42714 41650 42750 165 297 4114 35145 4630 0 3758
Pompage Bioénergies Ech. physiques Taux de Co2 Ech. comm. Angleterre Ech. comm. Espagne Ech. comm. Italie Ech. comm. Suisse
2 -751 1087 -2299 58 179 -914 -1732 -1283
3 -750 1055 -3724 59
4 -920 1045 -4009 58 179 -914 -1732 -1283
5 -1861 1048 -3946 59
6 -1857 1039 -4514 56 497 -1759 -2279 -2217
7 -2005 1037 -4427 57
Ech. comm. Allemagne-Belgique Fioul - TAC Fioul - Cogén. Fioul - Autres Gaz - TAC Gaz - Cogén. Gaz - CCG Gaz - Autres
2 -79 0 21 113 -2 585 3941 0
3 0 21 128 -1 580 4148 0
4 -159 0 21 128 -1 580 3801 0
5 0 21 144 -1 582 3663 0
6 1252 0 21 144 -1 579 3434 0
7 0 21 144 -1 581 3534 0
Hydraulique - Fil de l?eau + éclusée Hydraulique - Lacs Hydraulique - STEP turbinage Bioénergies - Déchets Bioénergies - Biomasse
2 3355 1288 2 183 447
3 3336 1615 2 174 435
4 3242 1338 0 174 434
5 3155 992 0 174 437
6 3060 1036 0 172 434
7 2992 766 0 177 436
Bioénergies - Biogaz
2 301
3 294
4 294
5 294
6 294
7 294
>

merging multiple p-values from Fisher test to the original data

I have done a Fisher test on all my rows which outputs a lot of p-values. How could I correctly combine p-values to the original columns? I tried the following codes but the rows in original data (d) do not match with p-values (e) in the merged dataframe (f).
d <- read.table('test.txt', header = FALSE)
e <-apply(d,1, function(x) fisher.test(matrix(x,nr=2), alternative='greater')$p.value)
f <-merge(d,as.data.frame(e),by.x=0,by.y=0)
> d
V1 V2 V3 V4
1 1 839 63 222247
2 1 839 47 222263
3 1 839 299 222011
4 6 834 1821 220489
5 1 839 198 222112
6 1 839 324 221986
7 2 838 808 221502
8 3 837 935 221375
9 4 836 1723 220587
10 1 839 117 22219
> e
[1] 2.144749e-01 1.656028e-01 6.776690e-01 6.848409e-01 5.280300e-01 7.067099e-01 8.091576e-01 6.859446e-01
[9] 8.895988e-01 3.592658e-01
> f
Row.names V1 V2 V3 V4 e
1 1 1 839 63 222247 2.144749e-01
2 10 1 839 117 222193 3.592658e-01
3 11 6 834 850 221460 1.071752e-01
4 12 29 811 11625 210685 9.941101e-01
5 13 2 838 1231 221079 9.463472e-01
6 14 1 839 1236 221074 9.907043e-01
7 15 3 837 905 221405 6.647785e-01
8 16 3 837 793 221517 5.768163e-01
9 17 6 834 687 221623 4.906665e-02
10 18 1 839 226 222084 5.753710e-01
f <-cbind(d,e)
# V1 V2 V3 V4 e
#1 1 839 63 222247 0.2144749
#2 1 839 47 222263 0.1656028
#3 1 839 299 222011 0.6776690
#4 6 834 1821 220489 0.6848409
#5 1 839 198 222112 0.5280300
#6 1 839 324 221986 0.7067099
#7 2 838 808 221502 0.8091576
#8 3 837 935 221375 0.6859446
#9 4 836 1723 220587 0.8895988
#10 1 839 117 22219 0.9873172

Not getting the correct degrees of freedom in R

I'm unsure what I'm doing wrong. This is the data that I'm using:
dtf <- read.table(text=
"Litter Treatment Tube.L
1 Control 1641
2 Control 1290
3 Control 2411
4 Control 2527
5 Control 1930
6 Control 2158
1 GH 1829
2 GH 1811
3 GH 1897
4 GH 1506
5 GH 2060
6 GH 1207
1 FSH 3395
2 FSH 3113
3 FSH 2219
4 FSH 2667
5 FSH 2210
6 FSH 2625
1 GH+FSH 1537
2 GH+FSH 1991
3 GH+FSH 3639
4 GH+FSH 2246
5 GH+FSH 1840
6 GH+FSH 2217", header=TRUE)
What I did was:
BoarsMod1 <- aov(Tube.L ~ Litter + Treatment, data=dtf)
anova(BoarsMod1)
I'm getting an incorrect number of degrees of freedom for litter. It should be 5 (as there are 6 litter blocks) but it is 1. Am I doing something wrong?

Weighted price in a data.frame with R

I have a weekly dataset of prices of a product. This product has many varieties, each with its own price. I am interested in calculating a weighted price depending on the sales volume of each.
I tried to do with a loop, but does not work.
Can someone help me?
Here, a minimal example of my dataset:
Any
nrow week variety price volume
1 10 Semiduro 911 15550
2 10 Semiduro 809 13400
3 10 Semiduro 611 15200
4 10 Semiduro 517 17250
5 10 Semiduro 389 4550
6 10 Semiduro 300 1500
7 10 Paisana(o) 1100 19200
8 10 Paisana(o) 726 22900
9 10 Paisana(o) 452 10450
10 11 Semiduro 1362 13250
11 11 Semiduro 1163 7100
12 11 Semiduro 1032 15580
13 11 Semiduro 768 9700
14 11 Semiduro 703 3670
15 11 Semiduro 550 1450
16 11 Paisana(o) 1825 20200
17 11 Paisana(o) 1402 30650
18 11 Paisana(o) 838 9750
19 12 Semiduro 1050 11350
20 12 Semiduro 878 9200
We could use dplyr
library(dplyr)
df1 %>%
group_by(week, variety) %>%
summarise(wprice = weighted.mean(price, volume))
# week variety wprice
# <int> <chr> <dbl>
#1 10 Paisana(o) 808.1598
#2 10 Semiduro 673.5663
#3 11 Paisana(o) 1452.2574
#4 11 Semiduro 1048.4625
#5 12 Semiduro 972.9976

R: trouble reading dates and time

I have some problems in reading in date and time in a proper way, and I wonder why I get these problems. The problem is only on my windows installation of R. Running the exact same script on my UNIX installation works fine.
Basically, I want to read in a file with data and time as the second column, like this:
TrainData[[i]] = read.csv(TrainFiles[i],header=F, colClasses=c(NA,"POSIXct",rep(NA,8)))
colnames(TrainData[[i]])=c("comp","time","s1","s2","s3","s4","r1","r2","r3","r4")
However, only the dates are read, not the times, and my data looks like this:
comp time s1 s2 s3 s4 r1 r2 r3 r4
1 1 2009-08-18 711 630 69 600 689 20 40 1
2 5 2009-08-18 725 460 101 705 689 20 40 1
3 6 2009-08-18 711 505 69 678 689 20 40 1
4 1 2009-08-18 705 630 69 600 689 20 40 1
5 2 2009-08-18 734 516 101 671 689 20 40 1
6 3 2009-08-18 743 637 69 595 689 20 40 1
7 4 2009-08-18 730 577 101 633 689 20 40 1
8 2 2009-08-18 721 511 101 674 689 20 40 1
9 3 2009-08-18 747 563 101 642 689 20 40 1
10 4 2009-08-18 716 572 101 636 689 20 40 1
Running the exact same cond on UNIX returned both time and dates.
When I read in another file in the same script, with dates and times in the two first columns, I get the correct format of the date/time:
TrainData[[i]]=read.csv(TrainFiles[i],header=F, colClasses=c("POSIXct","POSIXct",NA))
colnames(TrainData[[i]])=c("start","end","fault")
returns
start end fault
1 2010-10-24 04:25:53 2010-10-24 11:22:33 6
2 2010-10-30 12:57:16 2010-11-02 12:29:54 6
3 2010-11-05 10:40:17 2010-11-05 11:59:51 6
4 2010-11-05 17:07:37 2010-11-06 14:30:01 6
5 2010-11-06 23:59:59 2010-11-07 00:14:49 6
6 2010-11-06 23:59:59 2010-11-07 00:14:49 6
7 2010-11-06 23:59:59 2010-11-07 00:14:49 6
8 2010-11-06 23:59:59 2010-11-07 00:14:49 6
9 2010-11-06 23:59:59 2010-11-07 00:14:50 6
10 2010-11-06 23:59:47 2010-11-07 00:14:51 6
Actually, I found a solution that works, eventually, but I wonder why I get these problems.
It appears that my Sys.timezone is set to "Europe/Berlin". If I set this to NA, the times will be read in as well, i.e. using Sys.setenv(tz=NA). If I then run the same code, my data looks like this:
comp time s1 s2 s3 s4 r1 r2 r3 r4
1 1 2009-08-18 18:12:00 711 630 69 600 689 20 40 1
2 5 2009-08-18 18:14:27 725 460 101 705 689 20 40 1
3 6 2009-08-18 18:14:31 711 505 69 678 689 20 40 1
4 1 2009-08-18 18:14:43 705 630 69 600 689 20 40 1
5 2 2009-08-18 18:14:47 734 516 101 671 689 20 40 1
6 3 2009-08-18 18:14:51 743 637 69 595 689 20 40 1
7 4 2009-08-18 18:15:00 730 577 101 633 689 20 40 1
8 2 2009-08-18 18:29:33 721 511 101 674 689 20 40 1
9 3 2009-08-18 18:29:37 747 563 101 642 689 20 40 1
10 4 2009-08-18 18:29:45 716 572 101 636 689 20 40 1
The other file still get times, but now consistently two hours different.
This is how the csv-files look like (basically, text separated by commas):
this is my file (basically text separated by commas):
1,2009-08-18 18:12:00,711,630,69,600,689,20,40,1
5,2009-08-18 8:14:27,725,460,101,705,689,20,40,1
6,2009-08-18 18:14:31,711,505,69,678,689,20,40,1
1,2009-08-18 18:14:43,705,630,69,600,689,20,40,1
2,2009-08-18 8:14:47,734,516,101,671,689,20,40,1
3,2009-08-18 18:14:51,743,637,69,595,689,20,40,1
4,2009-08-18 8:15:00,730,577,101,633,689,20,40,1
2,2009-08-18 8:29:33,721,511,101,674,689,20,40,1
3,2009-08-18 8:29:37,747,563,101,642,689,20,40,1
4,2009-08-18 8:29:45,716,572,101,636,689,20,40,1
Why am I having these problems with reading in the times? I would expect that it is not correct to use tz=NA, but this is the only way I found to work. Can anyone help me figure out why the times are ignored when tz = "Europe/Berlin"?
Is it generally adviced to put tz=NA when reading files like this? Even if this seems to work in reading in the times, the tz="NA" results in warning messages when I later want to work with the data:
Warning message:
In as.POSIXlt.POSIXct(x, tz) : unknown timezone 'NA'
Can anyone help me explain the differences I get?

Resources