incorrectly sorted numbers in R - r

I have a data frame with over 1000 rows.
1 chr1 23373262 23390014 NM_001081079 - Ogfrl1
10 chr1 43807075 43884553 NM_026430 - Uxs1
100 chr10 77069342 77081076 NM_001301671 + Sumo3
1000 chr5 137736931 137748902 NM_031405 - Srrt
1001 chr5 137737851 137737916 NR_106003 -
1002 chr5 137751126 137755469 NM_011639 - Trip6
1003 chr5 137755785 137774810 NM_031406 - Slc12a9
1004 chr5 138220145 138221335 NM_027242 + Ppp1r35
1005 chr5 138223133 138227929 NM_144913 - Mepce
1006 chr5 138229029 138263849 NM_001005426 + Zcwpw1
I want to sort them by row.names, which the the number in the first column above. My command is
mef_genes<-mef_genes[sort(as.integer(row.names(mef_genes)),decreasing=F),]
Its ranked as shown above. I want to rank like normally ascending, i.e 1,2,3...
Can anyone tell me how to do it?
Thank you very much

Try order instead of sort:
mef_genes <- mef_genes[order(as.integer(row.names(mef_genes))),]

Related

Error in using GADEM function from rGADEM package

I have big peak list in the "Bed" format and I converted it to GenomicRange for use as an input for the GADEM package to find denovo motifs. But when I try the GADEM function always I face the below error.
Could you please anybody who knows help me with this error?
This is a small example of my real file with only 20 rows.
1 chr6 29723590 29723790
2 chr14 103334312 103334512
3 chr1 150579030 150579230
4 chr7 76358527 76358727
5 chr6 11537891 11538091
6 chr14 49893256 49893456
7 chr5 179623200 179623400
8 chr1 228082831 228083031
9 chr12 93441644 93441844
10 chr10 3784776 3784976
11 chr3 183635833 183636033
12 chr7 975301 975501
13 chr12 123364510 123364710
14 chr1 1615578 1615778
15 chr1 36156320 36156520
16 chr14 55051781 55051981
17 chr8 11867697 11867897
18 chr22 38706135 38706335
19 chr6 44265256 44265456
20 chr1 185316658 185316858
and the code that I use is :
library(GenomicRanges)
library(rGADEM)
data = makeGRangesFromDataFrame(data, keep.extra.columns = TRUE)
data = reduce(data)
data = resize(data, width = 50, fix='center')
gadem<-GADEM(data,verbose=1,genome=Hsapiens)
plot(gadem)
and error is:
[ Retrieving sequences... Error in.Call2("C_solve_user_SEW", refwidths, start, end, width, translate.negative.coord:
solving row 136: 'allow.nonnarrowing' is FALSE and the supplied start (55134751) is > refwidth + 1 ]
Better to mention that, when I try an example input file with less than 136 rows, it works and I get motifs.
Thanks in advance.

Bash - one liner to sort a bed file on qvalue column then extract top 20% of rows with highest q value

I have a bed file in the following format:
chr start end q-value name
chr1 10004 10467 310.43 peak_1
chr2 15410 15704 19.61 peak_2
chr3 21207 21354 4.04 peak_3
chr4 26073 26165 25.32 peak_4
chr5 63044057 63044425 39.65 peak_5
If possible, I need a bash one-liner to sort this file on the q-value column (column 4), then I need to extract the top 20% of rows with the highest q-value.
After sorting this would look like:
chr start end q-value name
chr1 10004 10467 310.43 peak_1
chr5 63044057 63044425 39.65 peak_5
chr4 26073 26165 25.32 peak_4
chr2 15410 15704 19.61 peak_2
chr3 21207 21354 4.04 peak_3
After percentage it would look like:
chr1 10004 10467 310.43 peak_1
I need to run this on over 40 files.
I'm also familiar with R so if this is not possible in bash, but doable in R, R code would also be useful (but Bash is preferable).
Many Thanks.
Edit comments:
Made code more testable.
Re: my own attempt
When I tried to run sort -k4 file.txt in the first instance. I got the following which is not what I'm looking for:
chr2 15410 15704 19.61 peak_2
chr4 26073 26165 25.32 peak_4
chr1 10004 10467 310.43 peak_1
chr5 63044057 63044425 39.65 peak_5
chr3 21207 21354 4.04 peak_3
This confused me, I assume the decimals are causing an issue and not sure how to get round this first part.
Is this what you are looking at?
#!/bin/sh
sort -r -g -k 4,4 < inputFile.file > tempfile_sorted.out
lncnt=$(wc -l < tempfile_sorted.out)
percent_linecount_infloat=$(echo "$lncnt*.2" | bc)
float2Int=$(printf %.0f "$percent_linecount_infloat")
head_20_percent=$(head -"$float2Int" tempfile_sorted.out)
new_fn=$(printf "%s_20" tempfile_sorted.out) # new file with top 20% of sorted output
printf "$head_20_percent" > $new_fn

Sort multiple csv files within a directory based on two columns [duplicate]

This question already has answers here:
Sorting multiple keys with Unix sort
(7 answers)
Closed 7 years ago.
I have multiple .csv files in a directory called mydirectory. I want to sort all these files using some bash/awk/sed command first based on LeftChr column and then RightChr column and get the result.
>Id LeftChr LeftPosition LeftStrand LeftLength RightChr RightPosition RightStrand
1979 chr1 825881 - 252 chr2 5726723 -
5480 chr2 826313 + 444 chr2 5727501 +
5492 chr5 869527 + 698 chr2 870339 +
1980 chr2 1584550 - 263 chr1 1651034 -
5491 chr14 1685863 + 148 chr1 1686679 +
5490 chr1 1691382 + 190 chr1 1693020 +
result
>Id LeftChr LeftPosition LeftStrand LeftLength RightChr RightPosition RightStrand
5490 chr1 1691382 + 190 chr1 1693020 +
1979 chr1 825881 - 252 chr2 5726723 -
1980 chr2 1584550 - 263 chr1 1651034 -
5480 chr2 826313 + 444 chr2 5727501 +
5492 chr5 869527 + 698 chr2 870339 +
5491 chr14 1685863 + 148 chr1 1686679 +
awk 'h{NF+=0;print |"sort -t\" \" -k2.4n -k6.4n"}!h{print;h=1}' file | column -t
Id LeftChr LeftPosition LeftStrand LeftLength RightChr RightPosition RightStrand
5490 chr1 1691382 + 190 chr1 1693020 +
1979 chr1 825881 - 252 chr2 5726723 -
1980 chr2 1584550 - 263 chr1 1651034 -
5480 chr2 826313 + 444 chr2 5727501 +
5492 chr5 869527 + 698 chr2 870339 +
5491 chr14 1685863 + 148 chr1 1686679 +
Yes ah, this pattern does not become a
This might work for you (GNU sed and sort):
sed '1b;/Id/d;s/chr//g' mydirectory/*.csv |
sort -k2,2n -k6,6n |
sed '1b;s/\S\+/chr&/2;s/\S\+/chr&/6' > outputFile
This drops all but the first header and removes the literal chr from all files. The ensuing file is piped into a sort which sorts the file by the second and sixth fields numerically. This inturn is piped into a final sed command which ignores the first line (header line) and replaces the literal chr in the second and sixth fields.
Assuming you have access to a reasonable computing environment, the following should provide the foundation for what you are trying to do:
in=input.txt; head -n 1 "$in"; tail -n +2 "$in" | sort -k2,2 -k6,6
There are several potential issues, though. One is that the input file you have posted is not a "CSV" file in the usual sense. Another is whether you want a "stable sort" or not.
load it into r
result <- yourdataname[order(,yourdataname[,LeftChr], yourdataname[,RightChr])]
if you have NAs in the dataset:
result <- yourdataname[order(yourdataname[,"LeftChr"],yourdataname[,"RightChr"], na.last = NA),]
None of the above answer worked for me, but was able to get this done with something like this.
for x in *.csv; do
grep -v "^>" *.csv | sort -k2,2V -k6,6V -k3,3n -t $','<"$x" >"$x.tmp"
mv "$x.tmp" "$x"
done

How to replace values in dataframe in R with translation table with minimal computational time?

I have the following biological data file.
#acgh_file
chromosome startPosition
chr1 37196
chr1 52308
chr1 357503
chr1 443361
chr1 530358
and I need to convert the positions by means of a translation table.
#convert
chr1 37196 chr1 47333
chr1 52308 chr1 62445
chr1 357503 chr1 367640
chr1 443361 chr1 453498
chr1 530358 chr1 540495
What needs to happen is that I have to replace the startPosition in the acgh_file with the value in fourth column of the convert table.
I made a script, but as the files are quite large it takes ages before it finishes (probably due to that R is not good for doing for-loops).
for (n in 1:nrow(convert)){
acgh_file[acgh_file$chromosome==convert[n,1] & acgh_file$startPosition==convert[n,2],3] <- convert[n,4]
}
I'm looking for a quicker solution here. Anybody have some ideas? I thought about doing something with the apply functions, but I don't know how to combine that when using this convert look-up table that I have here.
No need to use a for-loop here( Btw for loop in R are slow when they are not used in the good manner). Here you want to do a merge between 2 data sets. Since you have a big data.frame, I suggest to use data.table package to do the merge.
library(data.table)
setkey(acgh_file,chromosome,startPosition)
setkey(convert_file,V1,V2)
acgh_file[convert_file]
# chromosome startPosition V4
# 1: chr1 37196 47333
# 2: chr1 52308 62445
# 3: chr1 357503 367640
# 4: chr1 443361 453498
# 5: chr1 530358 540495
where the data sets are data.table :
acgh_file <- fread("
chromosome startPosition
chr1 37196
chr1 52308
chr1 357503
chr1 443361
chr1 530358")
convert_file <- fread("
chr1 37196 chr1 47333
chr1 52308 chr1 62445
chr1 357503 chr1 367640
chr1 443361 chr1 453498
chr1 530358 chr1 540495")[,V3:=NULL]

R - retrieve specific information from several columns

I have a huge dataframe df which includes information about overlapping intervals (A) and (B) and on which chromosome (chrom) they were located. There is also information about a value (level of gene expression) observed over interval (A).
chrom value Astart Aend Bstart Bend
chr1 0 0 54519752 17408 17431
chr1 0 0 54519752 17368 17391
chr1 0 0 54519752 567761 567783
chr11 0 2 93466832 568111 568133
chr11 0 2 93466832 568149 568171
chr11 0 2 93466832 1880734 1880756
chr11 4 93466844 93466880 93466856 93466878
chr11 2 93466885 135006516 93466889 93466911
chr11 2 93466885 135006516 94199710 94199732
Note that the same interval may appear several times, for instance, an interval (B) will have been reported two times if it overlapped with two (A) intervals:
Astart(1)=========================Aend1 Astart(2)========================Aend(2)
Bstart(1)=======================================Bend(1)
chrom value Astart Aend Bstart Bend
chr1 0 0 25 15 35 #A(1) and B(1) overlap
chr1 1 28 45 15 35 #A(2) and B(1) overlap
Likewise, an interval (A) will have been reported two or more times if it overlapped with two or more (B) intervals:
Astart(3)===================================================================Aend(3)
Bstart(2)=========Bend(2) Bstart(3)===========Bend(3) Bstart(4)===============Bend(4)
chrom value Astart Aend Bstart Bend
chr4 0 10 100 15 25 #A(3) and B(2) overlap
chr4 0 10 100 30 75 #A(3) and B(3) overlap
chr4 3 10 100 80 120 #A(3) and B(4) overlap
My goal is to output all the individual positions from intervals (B) and the corresponding values from (A). I have a piece of code that beautifully outputs all the relevant positions in (B):
position <- unlist(mapply(seq, ans$Bstart, ans$Bend - 1))
> head(position)
[1] 17408 17409 17410 17411 17412 17413
The problem with this is that it is not enough to retrieve the chromosome information back from there. I need to check chromosome information AND position at the same time when I list these positions. That is because the same position integer may occur on several chromosomes, so I can't afterwards just run something like for position %in% range(Astart, Aend) output $chrom, $value (dummy code).
How can I retrieve (chrom, position, value) at the same time?
The expected result would be something like this:
> head(expected_result)
chrom position value
chr1 17408 0
chr1 17409 0
chr1 17410 0
chr1 17411 0
chr1 17412 0
chr1 17413 0
#skipping some lines to show another part of the dataframe
chr11 93466856 4
chr11 93466857 4
A call to ddply might be more elegant, but the logic would be the same:
dfA = read.table(textConnection("chrom value Astart Aend Bstart Bend
chr1 0 0 54519752 17408 17431
chr1 0 0 54519752 17368 17391
chr1 0 0 54519752 567761 567783
chr11 0 2 93466832 568111 568133
chr11 0 2 93466832 568149 568171
chr11 0 2 93466832 1880734 1880756
chr11 4 93466844 93466880 93466856 93466878
chr11 2 93466885 135006516 93466889 93466911
chr11 2 93466885 135006516 94199710 94199732"), header = TRUE)
dfB = as.data.frame(do.call(rbind,
apply(dfA, MARGIN = 1, FUN = function(x) {
cbind(mapply(seq,
as.numeric(x['Bstart']),
as.numeric(x['Bend']) - 1),
x['chrom'], x['value'])
}
)))
lapply(dfB, typeof)

Resources