I have the following makefile.
I would like step0 to run then I would like all of the b*.R scripts to run at the same time in step1. When step1 is complete I would like final to run.
When I run make or make -j 8 it seems like all of the b*.R files still run sequentially. Is this makefile set up correctly to run all of the b*.R files at the same time? If not what do I need to change.
final : step1
Rscript c.R
step1 : step0
Rscript b1.R
Rscript b2.R
Rscript b3.R
Rscript b4.R
Rscript b5.R
Rscript b6.R
step0 :
Rscript a.R
When I run make or make -j 8 it seems like all of the b*.R files still run sequentially.
-jN allows parallel execution of different recipes, not the individual commands constituting a recipe.
So the makefile should be restructured like this:
.PHONY: final b1 b2 b3 b4 b5 b6 step0
final: b1 b2 b3 b4 b5 b6
b1 b2 b3 b4 b5 b6: step0
b1: ;Rscript b1.R
b2: ;Rscript b2.R
b3: ;Rscript b3.R
b4: ;Rscript b4.R
b5: ;Rscript b5.R
b6: ;Rscript b6.R
step0: ;Rscript a.R
If you want make to handle the parallelism for you, you need to restructure the makefile to have different targets. For example:
step1: b1 b2 b3 b4 b5 b6
b1: step0
Rscript b1.R
b2: step0
Rscript b2.R
...
step0 :
Rscript a.R
Or, you could let the shell do the parallelism and write:
step1: step0
Rscript b1.R & Rscript b2.R & \
Rscript b3.R & ... & wait
I would recommend the former.
I am trying to use unix to transform a tab delimited file from a short/wide format to long format, in a similar way as the reshape function in R. I hope to create three rows for each row in the starting file. Column 4 currently contains 3 values separated by commas. I hope to keep columns 1, 2, and 3 the same for each starting row, but have column 4 be one of the values from the initial column 4. This example probably makes it more clear than I can describe verbally:
current file:
A1 A2 A3 A4,A5,A6
B1 B2 B3 B4,B5,B6
C1 C2 C3 C4,C5,C6
goal:
A1 A2 A3 A4
A1 A2 A3 A5
A1 A2 A3 A6
B1 B2 B3 B4
B1 B2 B3 B5
B1 B2 B3 B6
C1 C2 C3 C4
C1 C2 C3 C5
C1 C2 C3 C6
As someone just becoming familiar with this language, my initial thought was to use sed to find the commas replace with a hard return
sed 's/,/&\n/' data.frame
I am really not sure how to include the values for columns 1-3. I had low hopes of this working, but the only thing I could think of was to try inserting the column values with {print $1, $2, $3}.
sed 's/,/&\n{print $1, $2, $3}/' data.frame
Not to my surprise, the output looked like this:
A1 A2 A3 A4
{print $1, $2, $3} A5
{print $1, $2, $3} A6
B1 B2 B3 B4
{print $1, $2, $3} B5
{print $1, $2, $3} B6
C1 C2 C3 C4
{print $1, $2, $3} C5
{print $1, $2, $3} C6
It seems like an approach might be to store the values of columns 1-3 and then insert them. I am not really sure how to store the values, I think that it may involve using an adaptation of the following script, but I am having a hard time understanding all of the components.
NR==FNR{a[$1, $2, $3]=1}
Thanks in advance for your thoughts on this.
You can a write simple read loop for this and use brace expansion for parsing the comma delimited field:
#!/bin/bash
while read -r f1 f2 f3 c1; do
# split the comma delimited field 'c1' into its constituents
for c in ${c1//,/ }; do
printf "$f1 $f2 $f3 $c\n"
done
done < input.txt
Output:
A1 A2 A3 A4
A1 A2 A3 A5
A1 A2 A3 A6
B1 B2 B3 B4
B1 B2 B3 B5
B1 B2 B3 B6
C1 C2 C3 C4
C1 C2 C3 C5
C1 C2 C3 C6
As solution without calling an external program :
#!/bin/bash
data_file="d"
while IFS=" " read -r f1 f2 f3 r
do
IFS="," read f4 f5 f6 <<<"$r"
printf "$f1 $f2 $f3 $f4\n$f1 $f2 $f3 $f5\n$f1 $f2 $f3 $f6\n"
done <"$data_file"
In the great Miller there is the nest verb to do it
With
mlr --nidx --ifs "\t" nest --explode --values --across-records -f 4 --nested-fs "," input.tsv
you will have
A1 A2 A3 A4
A1 A2 A3 A5
A1 A2 A3 A6
B1 B2 B3 B4
B1 B2 B3 B5
B1 B2 B3 B6
C1 C2 C3 C4
C1 C2 C3 C5
C1 C2 C3 C6
If you don't need the output to be in any particular order within a group of the fourth column, the following awk one-liner might do:
awk '{split($4,a,","); for(i in a) print $1,$2,$3,a[i]}' input.txt
This works by splitting your 4th column into an array, then for each element of the array, printing the "new" four columns.
If order is important -- that is, A4 must come before A5, etc, then you can use a classic for loop:
awk '{split($4,a,","); for(i=1;i<=length(a);i++) print $1,$2,$3,a[i]}' input.txt
But that's awk. And you're asking about bash.
The following might work:
#!/usr/bin/env bash
mapfile -t arr < input.txt
for s in "${arr[#]}"; do
t=($s)
mapfile -t -d, u <<<"${t[3]}"
for v in "${u[#]}"; do
printf '%s %s %s %s\n' "${t[#]:0:3}" "${v%$'\n'}"
done
done
This copies your entire input file into the elements of an array, and then steps through that array, mapping each 4th-column into a second array. It then steps through that second array, printing the first three columns from the first array, along with the current field from the second array.
It's obviously similar in structure to the awk alternative, but much more cumbersome to read and code.
Note the ${v%$'\n'} on the printf line. This strips off the last field's trailing newline, which doesn't get stripped by mapfile because we're using an alternate delimiter.
Note also that there's no reason you have to copy all your input into an array, I just did it that way to demonstrate a little more of mapfile. You could of course use the old standard,
while read s; do
...
done < input.txt
if you prefer.
I am examining prescription patterns within a large EHR dataset. The data is structured so that we are given several key bits of information, such as patient_num, encounter_num, ordering_date, medication, age_event (age at event) etc. Example below:
Patient_num enc_num ordering_date medication age_event
1111 888888 07NOV2008 Wellbutrin 48
1111 876578 11MAY2011 Bupropion 50
2222 999999 08DEC2009 Amitriptyline 32
2222 999999 08DEC2009 Escitalopram 32
3333 656463 12APR2007 Imipramine 44
3333 643211 21DEC2008 Zoloft 45
3333 543213 02FEB2009 Fluoxetine 45
Currently I have the dataset sorted by patient_id then by ordering_date so that I can see what each individual was prescribed during their encounters in a longitudinal fashion. For now, I am most concerned with the prescription(s) that were made during their first visit. I wrote some code to count the number of prescriptions and had originally restricted later analyses to RX = 1, but as we can see, that doesn't work for people with multiple scripts on the same encounter (Patient 2222).
data pt_meds_;
set pt_meds;
by patient_num;
if first.patient_num then RX = 1;
else RX + 1;
run;
Patient_num enc_num ordering_date medication age_event RX
1111 888888 07NOV2008 Wellbutrin 48 1
1111 876578 11MAY2011 Bupropion 50 2
2222 999999 08DEC2009 Amitriptyline 32 1
2222 999999 08DEC2009 Escitalopram 32 2
3333 656463 12APR2007 Imipramine 44 1
3333 643211 21DEC2008 Zoloft 45 2
3333 543213 02FEB2009 Fluoxetine 45 3
I think it would be more appropriate to recode the encounter numbers into a new variable so that they reflect a style similar to the RX variable. Where each encounter is listed 1-n, and the number will repeat if multiple scripts are made in the same encounter. Such as below:
Patient_num enc_num ordering_date medication age_event RX Enc_
1111 888888 07NOV2008 Wellbutrin 48 1 1
1111 876578 11MAY2011 Bupropion 50 2 2
2222 999999 08DEC2009 Amitriptyline 32 1 1
2222 999999 08DEC2009 Escitalopram 32 2 1
3333 656463 12APR2007 Imipramine 44 1 1
3333 643211 21DEC2008 Zoloft 45 2 2
3333 543213 02FEB2009 Fluoxetine 45 3 3
From what I have seen, this could be possible with a variant of the above code using 2 BY groups (patient_num & enc_num), but I can't seem to get it. I think the first. / last. codes require sorting, but if I am to sort by enc_num, they won't be in chronological order because the encounter numbers are generated by the system and depend on all other encounters going in at that time.
I tried to do the following code (using ordering_date instead because its already sorted properly) but everything under Enc_ is printed as a 1. I'm sure my logic is all wrong. Any thoughts?
data pt_meds_test;
set pt_meds_;
by patient_num ordering_date;
if first.patient_num;
if first.ordering_date then enc_ = 1;
else enc_ + 1;
run;
First
.First/.Last flags doesn't require sorting if data is properly ordered or you use NOTSORTED in your BY statement. If your variable in BY statement is not properly ordered then BY statment will throw error and stop executing when encounter deviations. Like this:
data class;
set sashelp.class;
by age;
first = first.age;
last = last.age;
run;
ERROR: BY variables are not properly sorted on data set SASHELP.CLASS.
Name=Alfred Sex=M Age=14 Height=69 Weight=112.5 FIRST.Age=1 LAST.Age=1 first=. last=. _ERROR_=1 _N_=1
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 2 observations read from the data set SASHELP.CLASS.
Try this code to see how exacly .first/.last flags works:
data pt_meds_test;
set pt_meds_;
by patient_num ordering_date;
fp = first.patient_num;
lp = last.patient_num;
fo = first.ordering_date;
lo = last.ordering_date;
run;
Second
Those condidions works differently than you think:
if expression;
If expression is true then continue with next instructions after if.
Otherwise return to begining of data step (no implicit output). This also implies your observation is not retained in the output.
In most cases if without then is equivalent to where. However
whereworks faster but it is limited to variables that comes from data set you are reading
if can be used with any type of expression including calculated fields
More info:: IF
Statement, Subsetting
Third
I think lag() function can be your answear.
data pt_meds_test;
set pt_meds_;
by patient_num;
retain enc_;
prev_patient_num = lag(patient_num);
prev_ordering_date = lag(ordering_date);
if first.patient_num then enc_ = 1;
else if patient_num = prev_patient_num and ordering_date ne prev_ordering_date then enc_ + 1;
end;
run;
With lag() function you can look what was the value of vairalbe on the previos observation and compare it with current one later.
But be carefull. lag() doesn't look for variable value from previous observation. It takes vale of variable and stores it in a FIFO queue with size of 1. On next call it retrives stored value from queue and put new value there.
More info: LAG Function
I'm not sure if this hurts the rest of your analysis, but what about just
proc freq data=pt_meds noprint;
tables patient_num ordering_date / out=pt_meds_freq;
run;
data pt_meds_freq2;
set pt_meds_freq;
by patient_num ordering_date;
if first.patient_num;
run;
How can I merge two lines if the have met specific criteria in Unix terminal?
I have data like:
A1
B1
A2
B2
A3
A4
A5
B5
And I want to merge to like that:
A1, B1
A2, B2
A3,
A4,
A5, B5
Real data looks like this:
"224222"
<Frequency freq="0.136" allele="T" sampleSize="5008"/>
"224223"
<Frequency freq="0.3864" allele="T" sampleSize="5008"/>
"224224"
"224225"
<Frequency freq="0.3894" allele="G" sampleSize="5008"/>
"1801179"
"1861759"
I actually tried to add dummy deliminator texts to before the "A" data to separate them. But I couldn't achive it.
Using sed
sed 's/$/, /;N;/\n<Freq/{s/\n//};P;D' <file>
Explanation:
s/$/, / - Append a comma to the current line
N - Get the next line
/\n<Freq/{s/\n//} - If the second line contains <Freq, delete the newline
P - Print first portion of pattern space
D - Delete first portion of pattern space
It can be done using awk getline:
awk '{ if(condition){ if((getline var)>0) print $0","$var; else print $0; } else print $0;}' <file>
I have a file where I want to print every entry for a column i>N followed by the contents of the next column. Each line has the same number of columns. An example input:
a b c d
a1 b1 c1 d1
a2 b2 c2 d2
a3 b3 c3 d3
say in this case I want to skip the first column so the desired output would be
b
b1
b2
b3
c
c1
c2
c3
d
d1
d2
d3
I got close to what I wanted using
awk '{for(i=2; i<=NF; print $i; i++)}'
but this prints each entry in a line consecutively instead off all entries from each column consecutively.
Thanks in advance
If every line has same number of fields then you can do:
awk '
{
for(i=2;i<=NF;i++)
rec[i]=(rec[i]?rec[i]RS$i:$i)
}
END {
for(i=2;i<=NF;i++) print rec[i]
}' file
If the number of fields are uneven, then you need to remember which line has the maximum number of fields.
awk '
{
for(i=2;i<=NF;i++) {
rec[i]=(rec[i]?rec[i]RS$i:$i)
}
num=(num>NF?num:NF)
}
END {
for(i=2;i<=num;i++) print rec[i]
}' file
Output:
b
b1
b2
b3
c
c1
c2
c3
d
d1
d2
d3
Using cut would be easier here:
# figure out how many fields
read -a fields < <(sed 1q file)
nf=${#fields[#]}
# start dumping the columns.
n=3
for ((i = n; i <= nf; i++)); do
cut -d " " -f $i file
done