How to know if there is a different element in one array in Scilab? - scilab

My goal is to check if there are misplaced objects in one array.
for example the array is
2.
2.
2.
2.
2.
1.
3.
1.
3.
3.
3.
1.
3.
1.
1.
1.
1.
I want to know if the first 5 elements, 6 to 13 and 14-17 are the same.
The purpose of this is to identify the misplaced elements in a clustering solution.
I have tried for the first 5 elements
ISet=5
IVer=7
IVir=5
for i=1:ISet
if(isequal(FIRSTMIN(i,1,2),FIRSTMIN(i+1,1,2))==%f)
numMisp=numMisp+1
mprintf("Set misp: %i",numMisp)
end
end
For the next 6 to 13 elements
for i=ISet+1:IVer+ISet-1
if(isequal(FIRSTMIN(i,1,2),FIRSTMIN(i+1,1,2))==%f)
mprintf("%i %i Ver misp: %i\n",FIRSTMIN(i,1,2),FIRSTMIN(i+1,1,2),i)
numMisp=numMisp+1
end
end
For the next 14 to 17 elements
for i=IVer+ISet:IVer+IVir-1
if(isequal(FIRSTMIN(i,1,2),FIRSTMIN(i+1,1,2))==%f)
mprintf("%i %i Ver misp: %i\n",FIRSTMIN(i,1,2),FIRSTMIN(i+1,1,2),i)
numMisp=numMisp+1
mprintf("Vir misp: %i",i)
end
end

You can use unique for that purpose. For example the following test checks if the first five elements are the same
x=[2 2 2 2 2 1 3 1 3 3 3 1 3 1 1 1 1];
if length(unique(x(1:5))) == 1
//
end
You can do the the same for the other clusters by replacing 1:5 by 6:13 then 14:17.

Related

Subtracting lines in one file from another file

I couldn't find an answer that truly subtracts one file from another.
My goal is to remove lines in one file that occur in another file.
Multiple occurences should be respected, which means for exammple if one line occurs 4 times in file A and only once in file B, file C should have 3 of those lines.
File A:
1
3
3
3
4
4
File B:
1
3
4
File C (desired output)
3
3
4
Thanks in advance
In awk:
$ awk 'NR==FNR{a[$0]--;next} ($0 in a) && ++a[$0] > 0' f2 f1
3
3
4
Explained:
NR==FNR { # for each record in the first file
a[$0]--; # for each identical value, decrement a[value] (of 0)
next
}
($0 in a) && ++a[$0] > 0' # if record in a, increment a[value]
# once over remove count in first file, output
If you want to print items in f1 that are not in f2 you can lose ($0 in a) &&:
$ echo 5 >> f1
$ awk 'NR==FNR{a[$0]--;next} (++a[$0] > 0)' f2 f1
3
3
4
5
If the input files are already sorted as shown in sample, comm would be more suited
$ comm -23 f1 f2
3
3
4
option description from man page:
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files)
You can do:
awk 'NR==FNR{++cnt[$1]
next}
cnt[$1]-->0{next}
1' f2 f1

Invalid values in an array on executing for loop in R

I am new to R and stuck up in a very naive thing. I am getting 'NA' values in count array after executing following code:
i=1
j=2
l=1
count=0
while(j<length(positions)){
a=positions[i]
b=positions[j]
for(k in a:b){
if(y$feature[k]==x$feature[l]){
count[l]=count[l]+1
}
}
i=i+2
j=j+2
l=l+1
}
For reference, y and x data frames are as follows:
y data frame
positions id feature
1 1 45128
2 1 28901
3 1 48902
. .
. .
. .
. .
2344 1 45579
2345 2 37689
2346 2 45547
. .
. .
5677 2 12339
5678 3 98034
5679
.
.
x dataframe :
id feature
1 28901
2 23498
3 98906
. .
. .
. .
I have inserted the positions in the position array, at the point where new id starts and where it ends
positions is an array consisting of [1,2344,2345,5677,5678,7390,7391,...]. I am incrementing the for loop as elements in position array, i being 1,3,5... j being 2,4,6... If y$feature and x$feature match I increment count[l]
So first feature of x is compared with all features in y with id=1, second feature in x is compared with all features in y with id=2 and so on. When they match, count[l] is incremented. i and j are incremented twice, to make them start with correct positions. *But I just get a valid answer for count[1], rest all values are NA.
Please tell a reason why this happens and a valid way to do this using the loops.
It's because you are trying to add a nonexistent value count[l] to 1. You start out with count<-0, so count is of length one. There is no count[2], so a reference to count[2] returns NA. Then (assuming l = 2 in your loop), NA + l returns NA.
If you initialize count<-rep(0,length(positions)) this particular problem will go away.
Meanwhile, you can vectorize your operations quite a lot. I believe you can replace the k-loop with
count[l] <- sum(y$feature[a:b]==x$feature[l])
for one example.

Make a file counting instances in sets of 5

I have a file that looks like this:
1 rs531842 503939 61733 G A
1 rs10494103 35025 114771 C T
1 rs17038458 254490 21116837 G A
1 rs616378 525783 21127670 T C
1 rs3845293 432526 21199392 A C
2 rs16840461 233620 157112959 A G
2 rs1560628 224228 157113214 T C
2 rs17200880 269314 257145829 C T
2 rs10497165 35844 357156412 C T
2 rs7607531 624696 457156575 T C
...with column 1 stretching on to 22, and several thousand entries in total.
I want to create a file that lists bins of 5 million from column 4 which have data, separating by column 1.
Basically, all but column 1 and 4 can be discarded. A simple imput would look like this:
InputChr1:
61733
114771
21116837
21127670
21199392
InputChr2:
157112959
157113214
257145829
357156412
457156575
So, for the example above, I would want to get two files that look like this:
OutputChr1.txt
Start End Occurrences
1 5000000 2
20000001 25000000 3
OutputChr2.txt
Start End Occurrences
155000001 160000000 2
255000001 260000000 1
355000001 360000000 1
455000001 460000000 1
Any ideas? It seems like something that should be doable with lapply in R, but I can't get the for loops to work...
EDIT: Actually, I made this look much harder than it needed to be - basically, I want to split the original file by column 1, extract the data in column 4, and then count the instances in bins of 5 million.
(Apologies for slightly random tags, just trying to think of which tools might be best!)
Well, this happened to be very challenging. I couldn't find a way to use an unique awk command, though.
awk -v const=5000000 -v max=150
'{a[$1,int($4/const)]++; b[$1]}
END{for (i in b)
{for (j=0; j<max; j++)
print i, j*const +1, (j+1)*const, a[i,j]
}
}' file
And then to get only the results:
awk 'NF==4'
Explanation
-v const=5000000 -v max=150 give the variables. const is the 5 million value to split the results. max is the biggest number up to which we will look for info in the END block.
a[$1,int($4/const)]++ create an array with (1st field, 4th field) as index. Note the second is int($4/const) is to get from 23432 --> 0, 6000000 --> 1, etc. That is, to see in which block of values is every 4th column.
b[$1] keep track of the first columns that have been processed.
END{for (i in b) {for (j=0; j<max; j++) print j, j*const +1, (j+1)*const, a[i,j]}}' print the values.
awk 'NF==4' just print those lines that have 4 columns. This way it just outputs those cases in which there were matches.
In case you want to store the values into a new file, you can do
awk 'NF==4 {print > "OutputChr"$1".txt}'
Sample output
$ awk -v const=5000000 -v max=150 '{a[$1,int($4/const)]++; b[$1]} END{for (i in b) {for (j=0; j<max; j++) print i, j*const +1, (j+1)*const, a[i,j]}}' a | awk 'NF==4'
1 1 5000000 2
1 20000001 25000000 3
2 155000001 160000000 2
2 255000001 260000000 1
2 355000001 360000000 1
2 455000001 460000000 1
All in one
awk '{ v=int($4/const)
a[$1 FS v]++
min[$1]=min[$1]<v?min[$1]:v # get the Minimum of column $4 for group $1
max[$1]=max[$1]>v?max[$1]:v # get the Minimum of column $4 for group $1
}END{ for (i in min)
for (j=min[i];j<=max[i];j++) # set the for loop, and use the min and max value.
if (a[i FS j]!="") print j*const+1,(j+1)*const,a[i FS j] > "OutputChr" i ".txt" # if the data is exist, print to file "OutputChr" i ".txt"
}' const=5000000 file
result:
$ cat OutputChr1.txt
1 5000000 2
20000001 25000000 3
$ cat OutputChr2.txt
155000001 160000000 2
255000001 260000000 1
355000001 360000000 1
455000001 460000000 1

Check if a column has an value if so right true or false to column next to it

i was wondering how to make something that checks if column Lair in the data
is below or above an certain threshold lets say below 0.5 is called LOH en
above is called imbalance. So the calls LOH and INBALANCE should be written in a new column. I tried something as the code below.
detection<-function(assay,method,thres){
if(method=="threshold"){
idx<-ifelse(segmenten["intensity"]<1.1000000 & segmenten["intensity"]>0.900000 & segmenten["Lair"]>thres,TRUE,FALSE)
}
if(method=="cnloh"){
idx<-ifelse(segmenten["intensity"]<1.1000000 & segmenten["intensity"]>0.900000 & segmenten["Lair"]<thres,TRUE,FALSE)
}
if(method=="gain"){
idx<-ifelse(segmenten["intensity"]>1.1000000 & segmenten["Lair"]<thres,TRUE,FALSE)
}
if(method=="loss"){
idx<-ifelse(segmenten["intensity"]<0.900000 & segmenten["Lair"]<thres,TRUE,FALSE)
}
if(method=="bloss"){
idx<-ifelse(segmenten["intensity"]<0.900000 & segmenten["Lair"]>thres,TRUE,FALSE)
}
if(method=="bgain"){
idx<-ifelse(segmenten["intensity"]>1.100000 & segmenten["Lair"]>thres,TRUE,FALSE)
}
return(idx)
}
After this part the next step is to write the data from the function to the existing table.
Anyone has an idea
Since your desired result is not clear enough I made some assumptions and wrote something that might be useful or not.
First at all, inside your function there is an object segmenten which is not defined, I suppose this is the data set supplied as an input, then you used ifelse and the returning results are TRUE or FALSE but you want either LOH or INBALANCE when some conditions are met.
You want INBALANCE when ... & segmenten["Lair"]>thres and LOH otherwise (here ... means the other part of the condition) this will give a vector, but you want it in the main dataset as an addional column, don't you? So maybe this could be a new starting point for you to improve your code.
detection <- function(assay, method=c('threshold', 'cnloh', 'gain', 'loss', 'bloss', 'bgain'),
thres=0.5){
x <- assay
idx <- switch(match.arg(method),
threshold = ifelse(x["intensity"]<1.1 & x["intensity"]>0.9 & x["Lair"]>thres, 'INBALANCE', 'LOH'),
cnloh = ifelse(x["intensity"]<1.1 & x["intensity"]>0.9 & x["Lair"]<thres, 'LOH', 'INBALANCE'),
gain = ifelse(x["intensity"]>1.1 & x["Lair"]<thres, 'LOH', 'INBALANCE'),
loss = ifelse(x["intensity"]<0.9 & x["Lair"]<thres,'LOH', 'INBALANCE'),
bloss = ifelse(x["intensity"]<0.9 & x["Lair"]>thres, 'INBALANCE', 'LOH'),
bgain = ifelse(x["intensity"]>1.1 & x["Lair"]>thres, 'INBALANCE', 'LOH'))
colnames(idx) <- 'Checking'
return(cbind(x, as.data.frame(idx)))
}
Example:
Data <- read.csv("japansegment data.csv", header=T)
result <- detection(Data, method='threshold', thres=0.5) # 'threshold' is the default value for method
head(result)
SNP_NAME x0 x1 y pos.start pos.end chrom count copynumber intensity allele.B Lair uncertain sample_id
1 SNP_A-1656705 0 0 0 836727 27933161 1 230 2 1.0783 1 0.9218 FALSE GSM288035
2 SNP_A-1677548 0 0 0 28244579 246860994 1 4408 2 0.9827 1 0.9236 FALSE GSM288035
3 SNP_A-1669537 0 0 0 100819 159783145 2 3480 2 0.9806 1 0.9193 FALSE GSM288035
4 SNP_A-1758569 0 0 0 159783255 159791136 2 5 2 1.7244 1 0.9665 FALSE GSM288035
5 SNP_A-1662168 0 0 0 159817465 168664268 2 250 2 0.9786 1 0.9197 FALSE GSM288035
6 SNP_A-1723506 0 0 0 168721411 168721920 2 2 2 1.8027 -4 NA FALSE GSM288035
Checking
1 INBALANCE
2 INBALANCE
3 INBALANCE
4 LOH
5 INBALANCE
6 LOH
Using match.arg and switch functions will help you to avoid a lot of if statements.

How to find 5 closest number from matrix having attributes?

I have a matrix as follows
`> y
1 2 3
1 0.8802216 1.2277843 0.6875047
2 0.9381081 1.3189847 0.2046542
3 1.3245534 0.8221709 0.4630722
4 1.2006974 0.8890464 0.6710844
5 1.2344071 0.8354292 0.7259998
6 1.1670665 0.9214787 0.6826173
7 0.9670581 1.1070461 0.7742342
8 0.8867365 1.2160533 0.7024281
9 0.8235792 1.4424190 0.2030302
10 0.8821301 1.0541099 1.2279813
11 1.1958634 0.9708839 0.4297043
12 1.3542734 0.7747481 0.5119648
13 0.4385487 0.3588158 4.9167998
14 0.8530141 1.3578511 0.3698620
15 0.9651803 0.8426226 1.6132899
16 0.8854192 1.2272616 0.6715839
17 0.7779642 0.8132233 2.3386331
18 0.9936722 1.1629110 0.5083558
19 1.1235897 1.0018480 0.5764672
20 0.7887222 1.3101684 0.7373181
21 2.2276176 0.0000000 0.0000000`
I found one clue, but it can give position for the whole matrix,`
n<-read.table(file.choose(),header=T)
y<-n[,c("1","2","3")]
my.number=1.12270420185886 .
z<-abs(y-my.number)==min(abs(y-my.number))
which(z)
[1] 19 `
I want to find at least the 5 closest values with letter & column no too, in another way, I want the 5 closest single values from the matrix with their position.
I don't know what language it is; is it R?
In a procedural language, I would save all values to a map (val, (pos)) = (val (row, col); example (0.880..-> (1, 1)), then sort by value.
Then iterate over i<-pos (1 to map.size-5), and get the diff (pos (i), pos (i+5)), search for the minimum (diff), get the values and their position then.
Here is a solution in Scala:
val matrix = """1 0.8802216 1.2277843 0.6875047
2 0.9381081 1.3189847 0.2046542
3 1.3245534 0.8221709 0.4630722
4 1.2006974 0.8890464 0.6710844
5 1.2344071 0.8354292 0.7259998
6 1.1670665 0.9214787 0.6826173
7 0.9670581 1.1070461 0.7742342
8 0.8867365 1.2160533 0.7024281
9 0.8235792 1.4424190 0.2030302
10 0.8821301 1.0541099 1.2279813
11 1.1958634 0.9708839 0.4297043
12 1.3542734 0.7747481 0.5119648
13 0.4385487 0.3588158 4.9167998
14 0.8530141 1.3578511 0.3698620
15 0.9651803 0.8426226 1.6132899
16 0.8854192 1.2272616 0.6715839
17 0.7779642 0.8132233 2.3386331
18 0.9936722 1.1629110 0.5083558
19 1.1235897 1.0018480 0.5764672
20 0.7887222 1.3101684 0.7373181
21 2.2276176 0.0000000 0.0000000"""
// split block of text into lines
val lines=matrix.split ("\n")
// split lines into words
val rows = lines.map (l => l.split (" \\+"))
// remove the index from the beginning (1, 2, ... 21) and
// transform values from Strings to double numbers
// triples is: Array(Array(0.8802216, 1.2277843, 0.6875047), Array(0.9381081, 1.3189847, 0.2046542),
val triples = rows.map (_.tail).map(triple=> triple.map (_.toDouble))
// generate an own index for the rows and columns
// elems is: elems: Array[Array[(Double, (Int, Int))]] = Array(Array((0.8802216,(0,0)), (1.2277843,(0,1)), (0.6875047,(0,2))), Array((0.9381081,(1,0)), ...
val elems = triples.zipWithIndex.map {t=> t._1.zipWithIndex.map (vc=> (vc._1 -> (t._2, vc._2)))}
// sorted = Array((0.0,(20,1)), (0.0,(20,2)), (0.2030302,(8,2)), (0.2046542,(1,2)),
val sorted = elems.sortBy (e => e._1)
// delta5 = List(0.3588158, 0.369862, 0.2266741, 0.2338945, 0.10425639999999997, 0.1384938,
val delta5 = sorted.sliding (5, 1).map (q => q(4)._1-q(0)._1).toList
val minindex = delta5.indexOf (delta5.min) // minindex: Int = 29, delta5.min = 0.008824799999999966
// we found the smallest intervall of 5 values beginning at 29:
(29 to 29 +5).map (sorted (_))
res568: scala.collection.immutable.IndexedSeq[(Double, (Int, Int))] =
Vector( (0.8802216,(0,0)),
(0.8821301,(9,0)),
(0.8854192,(15,0)),
(0.8867365,(7,0)),
(0.8890464,(3,1)),
(0.9214787,(5,1)))
Since Scala counts from 0 to 20 and 0 to 2, where your index runs from 1 to 3 and 1 to 21 respectively, you have to add (1,1) to each of the positions=> (1,1), (10,1), and so on.

Resources