How to define a column class in R, dataframe - r

I have a data frame like this
scf0001 123 4567 - 4350
scf0001 4474 4878 + 376 *
scf0002 5375 10571 c 5006
scf0003 11370 16986 c 5536
scf0003 16256 17000 - 789 *
when I attempt to read it as a table in R, the error is line 2 did not have 5 elements, so it is not recognizing the star(*), I tried to read it as string with na.string="*" but it didn't work, so I think I have to define it with colclasses=.I tried something like this colClasses=[, c(6)="character"] but it doesn't work, Error: unexpected '['

Related

chromoMap error - subscript out of bounds

I am new to this community and hope my post is correct.
I tried to run a test case on chromoMap using the chromosome file (chrom) and annotation file (anno) below
> read.table(chrom)
V1 V2 V3
1 7 1 1000
> read.table(anno)
V1 V2 V3 V4
1 An1 7 10 30
2 An2 7 15 40
Unfortunately, I constantly ran into an error shown below. I tried to have a look into the code, but was not able to figure out the problem.
> chromoMap(chrom,anno)
********************************** __ __ ************
** __**|__ * __* __ * __ __ * __ *| | |* __ * __ **
**|__**| |*| *|__|*| | |*|__|*| | |*|_ |*|__|**
***********************************************| **
*****************************************************
OUTPUT:
Number of Chromosome sets: 1
Number of Chromosomes in set 1 : 1
Processing data..
Number of annotations in data set 1 : 2
Error in temp.list[[inputData[[h]]$ch_name[i]]] : subscript out of bounds
Also the columns seemed to be recognized correctly.
> ncol(read.table(anno))
[1] 4
> ncol(read.table(chrom))
[1] 3
I know its a trivial problem but I am happy for any suggestions.
Thanks!
When using R objects as input, you need to pass it within a list like:
chromoMap(list(chrom),list(anno))
It allows passing the objects for multiple ploidy within a list. For instance, for ploidy = 2 ,you can use:
chromoMap(list(chrom1,chrom2),list(anno1,anno2))
Thanks!
Try to change your chromosome name from 7 to VII.

Counting observations using multiple BY groups SAS

I am examining prescription patterns within a large EHR dataset. The data is structured so that we are given several key bits of information, such as patient_num, encounter_num, ordering_date, medication, age_event (age at event) etc. Example below:
Patient_num enc_num ordering_date medication age_event
1111 888888 07NOV2008 Wellbutrin 48
1111 876578 11MAY2011 Bupropion 50
2222 999999 08DEC2009 Amitriptyline 32
2222 999999 08DEC2009 Escitalopram 32
3333 656463 12APR2007 Imipramine 44
3333 643211 21DEC2008 Zoloft 45
3333 543213 02FEB2009 Fluoxetine 45
Currently I have the dataset sorted by patient_id then by ordering_date so that I can see what each individual was prescribed during their encounters in a longitudinal fashion. For now, I am most concerned with the prescription(s) that were made during their first visit. I wrote some code to count the number of prescriptions and had originally restricted later analyses to RX = 1, but as we can see, that doesn't work for people with multiple scripts on the same encounter (Patient 2222).
data pt_meds_;
set pt_meds;
by patient_num;
if first.patient_num then RX = 1;
else RX + 1;
run;
Patient_num enc_num ordering_date medication age_event RX
1111 888888 07NOV2008 Wellbutrin 48 1
1111 876578 11MAY2011 Bupropion 50 2
2222 999999 08DEC2009 Amitriptyline 32 1
2222 999999 08DEC2009 Escitalopram 32 2
3333 656463 12APR2007 Imipramine 44 1
3333 643211 21DEC2008 Zoloft 45 2
3333 543213 02FEB2009 Fluoxetine 45 3
I think it would be more appropriate to recode the encounter numbers into a new variable so that they reflect a style similar to the RX variable. Where each encounter is listed 1-n, and the number will repeat if multiple scripts are made in the same encounter. Such as below:
Patient_num enc_num ordering_date medication age_event RX Enc_
1111 888888 07NOV2008 Wellbutrin 48 1 1
1111 876578 11MAY2011 Bupropion 50 2 2
2222 999999 08DEC2009 Amitriptyline 32 1 1
2222 999999 08DEC2009 Escitalopram 32 2 1
3333 656463 12APR2007 Imipramine 44 1 1
3333 643211 21DEC2008 Zoloft 45 2 2
3333 543213 02FEB2009 Fluoxetine 45 3 3
From what I have seen, this could be possible with a variant of the above code using 2 BY groups (patient_num & enc_num), but I can't seem to get it. I think the first. / last. codes require sorting, but if I am to sort by enc_num, they won't be in chronological order because the encounter numbers are generated by the system and depend on all other encounters going in at that time.
I tried to do the following code (using ordering_date instead because its already sorted properly) but everything under Enc_ is printed as a 1. I'm sure my logic is all wrong. Any thoughts?
data pt_meds_test;
set pt_meds_;
by patient_num ordering_date;
if first.patient_num;
if first.ordering_date then enc_ = 1;
else enc_ + 1;
run;
First
.First/.Last flags doesn't require sorting if data is properly ordered or you use NOTSORTED in your BY statement. If your variable in BY statement is not properly ordered then BY statment will throw error and stop executing when encounter deviations. Like this:
data class;
set sashelp.class;
by age;
first = first.age;
last = last.age;
run;
ERROR: BY variables are not properly sorted on data set SASHELP.CLASS.
Name=Alfred Sex=M Age=14 Height=69 Weight=112.5 FIRST.Age=1 LAST.Age=1 first=. last=. _ERROR_=1 _N_=1
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 2 observations read from the data set SASHELP.CLASS.
Try this code to see how exacly .first/.last flags works:
data pt_meds_test;
set pt_meds_;
by patient_num ordering_date;
fp = first.patient_num;
lp = last.patient_num;
fo = first.ordering_date;
lo = last.ordering_date;
run;
Second
Those condidions works differently than you think:
if expression;
If expression is true then continue with next instructions after if.
Otherwise return to begining of data step (no implicit output). This also implies your observation is not retained in the output.
In most cases if without then is equivalent to where. However
whereworks faster but it is limited to variables that comes from data set you are reading
if can be used with any type of expression including calculated fields
More info:: IF
Statement, Subsetting
Third
I think lag() function can be your answear.
data pt_meds_test;
set pt_meds_;
by patient_num;
retain enc_;
prev_patient_num = lag(patient_num);
prev_ordering_date = lag(ordering_date);
if first.patient_num then enc_ = 1;
else if patient_num = prev_patient_num and ordering_date ne prev_ordering_date then enc_ + 1;
end;
run;
With lag() function you can look what was the value of vairalbe on the previos observation and compare it with current one later.
But be carefull. lag() doesn't look for variable value from previous observation. It takes vale of variable and stores it in a FIFO queue with size of 1. On next call it retrives stored value from queue and put new value there.
More info: LAG Function
I'm not sure if this hurts the rest of your analysis, but what about just
proc freq data=pt_meds noprint;
tables patient_num ordering_date / out=pt_meds_freq;
run;
data pt_meds_freq2;
set pt_meds_freq;
by patient_num ordering_date;
if first.patient_num;
run;

decrypt xored content in binary

I want to decrypt a xored content. if you want you can download the file in here
the file extension is .bin but content looks like hex to me and not binary, i'm not sure what kind of content it's.
the content look likes bellow:
2007 0b54 180a 541d 1318 1a00 541c 0654
0a0c 0606 065a 9854 0caa 2624 3000 0c04
260c 102c b435 fcaa b2ab acbf 32b2 aeb9
34b9 a0a8 a425 b6a9 809c bcb7 a8bb 2e34
eaa7 a835 80aa 8625 b8a7 aebc 2cbb 9e9d
329c bcaf 3493 a080 a625 aab9 329c bcaf
34b1 aab6 aab3 3431 b0a8 bebf b6ad 3634
b0af 849d 329c b225 faab acba b4af 3a93
32aa a0a9 a6b3 b80a 0a
and if it's hex why each 4 character is space delimiter-ed?
i think it can't be base64, because when i try to run following code i will get error
a#ubuntu:~/Downloads$ base64 -d enigma.bin>enigma.txt
base64: invalid input
second my goal is to find the key. so I tried the xortool
a#ubuntu:~/Downloads$ xortool enigma.bin
The most probable key lengths:
3: 15.1%
6: 19.3%
9: 13.6%
12: 15.3%
15: 9.4%
18: 10.9%
20: 4.4%
24: 5.3%
30: 3.4%
36: 3.4%
Key-length can be 3*n
Most possible char is needed to guess the key!
so i tried most used character like space(20) or E T A O I N S H R D L U but i had not luck. still my guess is i got the encoding incorrect

Stata counting substring

My table looks like this:
ID AQ_ATC amountATC
. "A05" 1
123 "A05AA02" 2525
234 "A05AA02" 2525
991 "A05AD39" 190
. "C10" 1
441 "C10AA11" 4330
229 "C10AA22" 3100
. "C05AA" 1
441 "C05AA03" 130
The count for the full 8-character AQ_ATC codes is already correct.
The shorter codes are unique in the table and are substrings of the complete 8-character codes (they represent the first x characters).
What I am looking for is the count of the appearances of the shorter codes throughout the entire table.
For example in this case the resulting table would be
ID AQ_ATC amountATC
. "A05" 2715 <-- 2525 + 190
123 "A05AA02" 2525
234 "A05AA02" 2525
991 "A05AD39" 190
. "C10" 7430 <-- 4330 + 3100
441 "C10AA11" 4330
229 "C10AA22" 3100
. "C05AA" 130 <-- 130
441 "C05AA03" 130
The partial codes do not overlap, by what I mean that if there is "C05" there wont be another partial code "C05A1".
I created the amountATC column using
bysort ATC: egen amountATC = total(AQ_ATC==AQ_ATC)
I attempted recycling the code that I had received yesterday but failed in doing so.
My attempt looks like this:
levelsof AQ_ATC, local(ATCvals)
quietly foreach y in AQ_ATC {
local i = 0
quietly foreach x of local ATCvals {
if strpos(`y', `"`x'"') == 1{
local i = `i'+1
replace amountATC = `i'
}
}
}
My idea was to use a counter "i" and increase it by 1 everytime the an AQ_ATC starts with another AQ_ATC code. Then I write "i" into amountATC and after I iterated over the entire table for my AQ_ATC, I will have an "i"-value that will be equal to the amount of occurences of the substring. Then I reset "i" to 0 and continue with the next AQ_ATC.
At least thats how I intended for it to work, what it did in the end is set all amountATC-values to 1.
I also attempted looking into different egen-functions such as noccur and moss, but my connection keeps timing out when I attempt to install the packages.
It seems as if you come from another language and you insist in using loops when not strictly necessary. Stata does many things without explicit loops, precisely because commands already apply to all observations.
One way is:
clear
set more off
input ///
ID str15 AQ_ATC amountATC
. "A05" 1
123 "A05AA02" 2525
234 "A05AA02" 2525
991 "A05AD39" 190
. "C10" 1
441 "C10AA11" 4330
229 "C10AA22" 3100
. "C05AA" 1
441 "C05AA03" 130
end
*----- what you want -----
sort AQ_ATC ID
gen grou = sum(missing(ID))
bysort grou AQ_ATC: gen tosum = amountATC if _n == 1 & !missing(ID)
by grou: egen s = total(tosum)
replace amountATC = s if missing(ID)
list, sepby(grou)
Edit
With your edit the same principles apply. Below code that adjusts to your change and slightly changes the code (one line less):
*----- what you want -----
sort AQ_ATC
gen grou = sum(missing(ID))
bysort grou: gen s = sum(amountATC) if AQ_ATC != AQ_ATC[_n+1] & !missing(ID)
by grou: replace amountATC = s[_N] if missing(ID)
More efficient should be:
<snip>
bysort grou: gen s = sum(amountATC) if AQ_ATC != AQ_ATC[_n+1]
by grou: replace amountATC = s[_N] - 1 if missing(ID)
Some comments:
sort is a very handy command. If you sort the data by AQ_ATC they are arranged in such a way that the short (sub)strings are placed before corresponding long strings.
The by: prefix is fundamental and very helpful, and I noticed you can use it after defining appropriate groups. I created the groups taking advantage of the fact that all short (sub)strings have a missing(ID).
Then (by the groups just defined) you only want to add up one value (observation) per amountATC. That's what the condition if AQ_ATC != AQ_ATC[_n+1] does.
Finally, replace back into your original variable. I would usually generate a copy and work with that, so my original variable doesn't suffer.
An excellent read for the by: prefix is Speaking Stata: How to move step by: step, by Nick Cox.
Edit2
Yet another slightly different way:
*----- what you want -----
sort AQ_ATC
gen grou = sum(missing(ID))
egen t = tag(grou AQ_ATC)
bysort grou: gen s = sum(amountATC * t)
by grou: replace amountATC = s[_N] - 1 if missing(ID)

R: Combine fragments below certain length

I have a bed file containing restriction fragments of the mouse genome. Each fragment has a different length/width, like this:
chr start end width
1 chr1 0 3000534 3000534
2 chr1 3000535 3000799 264
3 chr1 3000800 3001209 409
4 chr1 3001210 3001496 286
5 chr1 3001497 3002121 624
Is it possible to combine shorter fragments ( < 500bp) with adjacent fragments using R (see example below) and if yes how?
chr start end width
1 chr1 0 3000534 3000534
2 chr1 3000535 3001209 673
3 chr1 3001210 3002121 910
Note, I don't want to filter out fragments under a certain length, so sub setting the data is not an option.
I hope my question is not too confusing…
Here is a first solution, that supposes that chr stays the same and that filters out the last fragment if it is < 500 (the result is the dataframe you put in your example) :
mydata<-data.frame(chr=rep("chr1",6),start=c(0,3000535,3000800,3001210,3001497,3002122),end=c(3000534,3000799,3001209,3001496,3002121,3002134),width=c(3000534,264,409,286,624,12),stringsAsFactors=F)
i<-1
while(i<nrow(mydata)){
if(mydata$width[i]>=500) {
i<-i+1
} else {
mydata$end[i]<-mydata$end[i+1]
mydata$width[i]<-sum(mydata$width[i:(i+1)])
mydata<-mydata[-(i+1),]
}
}
if(mydata$width[i]<500) mydata<-mydata[-i,]

Resources