A bit of context: the File I've shown below is generated by a VLSI tool. It consists of timing delays caused by various components in a circuit. When I generate this "timing file" the fields are not properly organised sometimes.
The generated file:
something1 0.20 0.00 0.00
something2 6 12.95
something3 0.00 0.08 0.00 0.00 0.07
something4 6 8.70
something5 0.00 0.03 0.00 0.00 0.05
something6 5 4.70
What I want:
something1 0.20 0.00 0.00
something2 6 12.95
something3 0.00 0.08 0.00 0.00 0.07
something4 6 8.70
something5 0.00 0.03 0.00 0.00 0.05
something6 5 4.70
The displacement for something4
and something6keep recurring throughout the table in a particular order(say every 2 lines or 1 line). Only something2 has a different displacement whereas all the other displacements follow something4/something6.
So far I have no clue how to proceed with this. Any way to fix this?
$ awk '{gsub(/ {6}/,","); gsub(/ +/,",")} 1' file | column -s, -t
something1 0.20 0.00 0.00
something2 6 12.95
something3 0.00 0.08 0.00 0.00 0.07
something4 6 8.70
something5 0.00 0.03 0.00 0.00 0.05
something6 5 4.70
or:
$ awk 'BEGIN{FS=OFS="\t"} {gsub(/ {6}/,FS); gsub(/ +/,FS); $1=$1} 1' file
something1 0.20 0.00 0.00
something2 6 12.95
something3 0.00 0.08 0.00 0.00 0.07
something4 6 8.70
something5 0.00 0.03 0.00 0.00 0.05
something6 5 4.70
Another way with awk
awk 'NF>3{$2=OFS$2}NF==4{$2=OFS$2}{$1=$1}1' OFS='\t' infile
Related
I need help to select a subset of participants on my dataset.
I have a varied number of observations (rows) per participant. I want to work with only those participants who make less than 5% of Conventional_response_code = F, and those with the biggest proportion of Conventional_response_code = S.
participant Test_word Regular_response Conventional_response_code PPT
<chr> <chr> <chr> <chr> <dbl>
1 BM0289 ambulance NA N 92
2 BM0289 bat NR NR 92
3 BM0289 beard man with a mustache D 92
4 BM0289 binoculars NA N 92
5 BM0289 bride wedding dress PP 92
6 BM0289 cannon gun M 92
7 BM0289 cheerleaders NA N 92
8 BM0289 chimney NA N 92
9 BM0289 dinosaur NR NR 92
10 BM0289 dragon NR NR 92
I managed to create a proportion table in which I can see this information:
###Number of errors per participant (raw numbers)
proptable<-xtabs(formula= ~ participant + Conventional_response_code, data=data)
###proportion of errors per participant (row)
proptable<- (round(100*prop.table(proptable, margin=1), digits=2))
head(proptable)
Conventional_response_code
participant Adm E AN B D F M M-F-F M-F-U M-N M-N-A M-N-N MO N NR O PP PP-F PP-N Prima S S-F S-F-F
BM0289 0.00 0.00 0.00 0.99 2.97 18.81 0.99 0.00 0.00 0.00 0.99 2.97 23.76 29.70 0.00 6.93 0.00 0.00 0.00 8.91 0.00 0.00
BM0601 0.95 5.71 0.00 9.52 20.00 4.76 0.00 0.00 0.00 0.00 0.00 2.86 20.95 3.81 7.62 1.90 0.00 0.00 3.81 18.10 0.00 0.00
LD0001 0.00 0.00 0.00 0.00 0.00 7.69 0.00 0.00 0.00 0.00 0.00 0.00 61.54 7.69 0.00 15.38 0.00 0.00 0.00 7.69 0.00 0.00
LD0002 0.00 0.00 0.00 27.50 0.00 12.50 0.00 0.00 0.00 0.00 0.00 2.50 2.50 20.00 5.00 7.50 0.00 0.00 0.00 17.50 0.00 0.00
LD0003 2.27 0.00 4.55 13.64 27.27 2.27 0.00 0.00 0.00 0.00 0.00 2.27 29.55 2.27 0.00 2.27 0.00 0.00 0.00 13.64 0.00 0.00
LD0004 4.67 0.00 0.00 11.21 4.67 12.15 0.00 0.00 0.00 0.93 0.93 0.93 20.56 14.95 0.93 4.67 0.00 0.00 0.93 19.63 0.00 0.00
Conventional_response_code
participant S-F-U S-N S-N-A S-N-N U
BM0289 0.00 0.00 1.98 0.99 0.00
BM0601 0.00 0.00 0.00 0.00 0.00
LD0001 0.00 0.00 0.00 0.00 0.00
LD0002 0.00 0.00 5.00 0.00 0.00
LD0003 0.00 0.00 0.00 0.00 0.00
LD0004 0.00 0.00 0.93 1.87 0.00
>
but that creates a separate table. How do I use the information from this proportion in my own dataset (data) and select those participants that satisfy the conditions I need:
the number of F values in the Conventional_response_code column should be under 5%
the proportion of S values in the Conventional_response_code column should be the highest
Thank you!!
Even though you already have found a way to put together your proptable dataframe, I would imagine it is easier to start a solution from the initial participant table, since you want to compare the frequencies of Conventional_response_code.
library(dplyr)
participants.filtered <- participant %>%
count(participant, Conventional_response_code) %>%
group_by(participant) %>%
mutate(freq = n/sum(n)) %>%
filter(any(Conventional_response_code == 'F' & freq < 0.05)) %>%
arrange(desc(freq)) %>%
filter(Conventional_response_code == 'S', row_number() == 1) %>%
pull(participant) %>%
unique()
Does that help?
From the outut of a sar command, I want to extract only the lines in which the %iowait value is higher than a set threshold.
I tried using AWK but somehow I'm not able to perform the action.
sar -u -f sa12 | sed 's/\./,/g' | awk -f" " '{ if ( $7 -gt 0 ) print $0 }'
I tried to substitute the . with , and using -gt but still no joy.
Can someone suggest a solution?
If we need entire line output of sar -u with iowait > 0.01 then, we can use this ,
Command
sar -u | grep -v "CPU" | awk '$7 > 0.01'
Output will be similar to
03:40:01 AM all 3.16 0.00 0.05 0.11 0.00 96.68
04:40:01 PM all 0.19 0.00 0.05 0.02 0.00 99.74
if wish to out specific fields, say only iowait, we can use as given below,
Command to out specific field(s),
sar -u | grep -v "CPU" | awk '{if($7 > 0.01 ) print $7}'
Output will be
0.11
0.02
Note : grep -v is used just to remove the headings in the output
Hope this helps,
My sar -u gives several lines similar to the following:
Linux 4.4.0-127-generic (v1) 06/12/2018 _x86_64_ (1 CPU)
12:00:01 AM CPU %user %nice %system %iowait %steal %idle
12:05:01 AM all 0.29 0.00 0.30 0.01 0.00 99.40
12:15:01 AM all 0.33 0.00 0.34 0.00 0.00 99.32
12:25:01 AM all 0.33 0.00 0.30 0.01 0.00 99.36
12:35:01 AM all 0.31 0.00 0.29 0.01 0.00 99.39
12:45:01 AM all 0.33 0.00 0.32 0.01 0.00 99.35
12:55:01 AM all 0.32 0.00 0.30 0.00 0.00 99.38
01:05:01 AM all 0.32 0.00 0.28 0.00 0.00 99.39
01:15:01 AM all 0.33 0.00 0.30 0.01 0.00 99.37
01:25:01 AM all 0.31 0.00 0.30 0.01 0.00 99.39
01:35:01 AM all 0.31 0.00 0.33 0.00 0.00 99.36
01:45:01 AM all 0.31 0.00 0.28 0.01 0.00 99.40
01:55:01 AM all 0.31 0.00 0.30 0.00 0.00 99.38
02:05:01 AM all 0.31 0.00 0.28 0.01 0.00 99.40
02:15:01 AM all 0.32 0.00 0.30 0.01 0.00 99.38
02:25:01 AM all 0.31 0.00 0.30 0.01 0.00 99.38
02:35:01 AM all 0.33 0.00 0.33 0.00 0.00 99.33
02:45:01 AM all 0.35 0.00 0.32 0.01 0.00 99.32
02:55:01 AM all 0.28 0.00 0.30 0.00 0.00 99.42
03:05:01 AM all 0.32 0.00 0.31 0.00 0.00 99.37
03:15:01 AM all 0.34 0.00 0.30 0.01 0.00 99.36
03:25:01 AM all 0.32 0.00 0.29 0.01 0.00 99.38
03:35:01 AM all 0.33 0.00 0.26 0.00 0.00 99.40
03:45:01 AM all 0.34 0.00 0.29 0.00 0.00 99.36
03:55:01 AM all 0.30 0.00 0.28 0.01 0.00 99.41
04:05:01 AM all 0.32 0.00 0.30 0.01 0.00 99.37
04:15:01 AM all 0.37 0.00 0.31 0.01 0.00 99.32
04:25:01 AM all 1.78 2.04 0.59 0.05 0.00 95.55
To filter out those where %iowait is greater than, let's say, 0.01:
sar -u | awk '$7>0.01{print}'
Linux 4.4.0-127-generic (v1) 06/12/2018 _x86_64_ (1 CPU)
04:25:01 AM all 1.78 2.04 0.59 0.05 0.00 95.55
05:15:01 AM all 0.34 0.00 0.32 0.02 0.00 99.32
06:35:01 AM all 0.33 0.22 1.23 4.48 0.00 93.74
06:45:01 AM all 0.16 0.00 0.12 0.02 0.00 99.71
10:35:01 AM all 0.22 0.00 0.13 0.02 0.00 99.63
12:15:01 PM all 0.42 0.00 0.16 0.03 0.00 99.40
01:45:01 PM all 0.17 0.00 0.11 0.02 0.00 99.71
04:05:01 PM all 0.15 0.00 0.12 0.03 0.00 99.70
04:15:01 PM all 0.42 0.00 0.23 0.10 0.00 99.25
Edit:
As correctly pointed out by #Ed Morton, the awk code can be shortened to simply awk '$7>0.01', since the default action is to print the current line.
I have downloaded a file with the extension .mea. It's climate data. I don't know how to import it in r. even I don't know how to open in MacOS. Here is what the first lines of data look like.
IPCC Data Distribution Centre Results from model HADCM3 11-07-2002
Grid is 96 * 73 Month is Jan
HADCM A1F
Total precipitation (mm/day)
7008 format is (10F8.2) missing code is 9999.99
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
I did it the following way:
First I split the file into 12 small files, each containing one month's data, using the command line "split" function:
split -l 706 filename newfilePrefix
Then read in each small file with the following
readr::read_table(filename, col_names=FALSE, skip=5)
I have file that has space separated columns from that i want to extract specific data .below is the format of the file :
12:00:01 AM CPU %usr %nice %sys %iowait %steal %irq %soft %guest %idle
12:01:01 AM all 78.13 0.00 0.98 0.00 0.00 0.00 0.56 0.00 20.33
12:02:01 AM all 93.42 0.00 0.53 0.00 0.00 0.00 0.10 0.00 5.95
12:03:01 AM 1 88.62 0.00 1.71 0.00 0.00 0.00 0.71 0.00 8.96
12:01:01 AM 2 92.56 0.00 0.70 0.00 0.00 0.00 1.17 0.00 5.58
12:01:01 AM 3 86.90 0.00 1.57 0.00 0.00 0.00 0.55 0.00 10.99
01:01:01 AM all 78.13 0.00 0.98 0.00 0.00 0.00 0.56 0.00 20.33
01:02:01 AM all 93.42 0.00 0.53 0.00 0.00 0.00 0.10 0.00 5.95
01:03:01 AM all 88.62 0.00 1.71 0.00 0.00 0.00 0.71 0.00 8.96
01:01:01 AM 2 92.56 0.00 0.70 0.00 0.00 0.00 1.17 0.00 5.58
01:01:01 AM 3 86.90 0.00 1.57 0.00 0.00 0.00 0.55 0.00 10.99
12:01:01 PM all 78.13 0.00 0.98 0.00 0.00 0.00 0.56 0.00 20.33
12:02:01 PM 0 93.42 0.00 0.53 0.00 0.00 0.00 0.10 0.00 5.95
12:03:01 PM 1 88.62 0.00 1.71 0.00 0.00 0.00 0.71 0.00 8.96
12:01:01 PM 2 92.56 0.00 0.70 0.00 0.00 0.00 1.17 0.00 5.58
12:01:01 PM 3 86.90 0.00 1.57 0.00 0.00 0.00 0.55 0.00 10.99
Now from this file i want those rows that have time like 12:01:01 AM/PM i means for every hourly basis and have all in the CPU column
So after extraction i want below data but i am not able to get that.
12:01:01 AM all 78.13 0.00 0.98 0.00 0.00 0.00 0.56 0.00 20.33
01:01:01 AM all 78.13 0.00 0.98 0.00 0.00 0.00 0.56 0.00 20.33
12:01:01 PM all 78.13 0.00 0.98 0.00 0.00 0.00 0.56 0.00 20.33
Please suggest me how we can get that data in UNIX
If you add the -E option to grep it allows you to look for "Extended Regular Expressions". One such expression is
"CPU|01:01"
which will allow you to find all lines containing the word "CPU" (such as your column heading line) and also any lines with "01:01" in them. It is called an "alternation" and uses the pipe symbol (|) to separate alternate sub-parts.
So, an answer would be"
grep -E "CPU|01:01 .*all" yourFile > newFile
Try running:
man grep
to get the manual (help) page.
awk to the rescue!
if you need field specific matches awk is the right tool.
$ awk '$3=="all" && $1~/01:01$/' file
12:01:01 AM all 78.13 0.00 0.98 0.00 0.00 0.00 0.56 0.00 20.33
01:01:01 AM all 78.13 0.00 0.98 0.00 0.00 0.00 0.56 0.00 20.33
12:01:01 PM all 78.13 0.00 0.98 0.00 0.00 0.00 0.56 0.00 20.33
you can extract the header as well, with this
$ awk 'NR==1 || $3=="all" && $1~/01:01$/' file
I'm trying to run compositional analysis of the use of different type of habitats by ground nesting chicks on a set of data using R Studio. It starts processing but gives never stops. I have to manually stop the processing or kill R Studio. (Same result in R.)
I'm using the campana function from the adehabitatHS package. From the adehabitat I'm able to run the sample pheasant and squirrel data without any problems. (I've tried calling campana from both packages with the same result.)
For each chick, the habitat available varies as it's taken as a buffer zone around their nest site.
My data
This is the available habitats for each chick:
grass fallow.plot oil.seed.rape spring.barley winter.wheat maize other.crops other woodland hedgerow
1 23.35 7.53 45.75 0.00 0.00 0.00 0.00 0.00 23.37 0.00
2 86.52 10.35 0.00 0.00 1.24 0.00 0.00 1.89 0.00 0.00
3 5.18 10.33 28.36 38.82 0.00 0.00 17.17 0.14 0.00 0.00
4 4.26 18.32 27.31 32.66 3.82 0.00 0.00 5.02 5.52 3.09
5 4.26 18.32 27.31 32.66 3.82 0.00 0.00 5.02 5.52 3.09
6 12.52 10.35 0.00 0.00 0.00 18.02 43.59 13.15 2.37 0.00
7 21.41 11.56 59.25 0.00 0.00 0.00 0.00 5.82 0.00 1.96
8 21.41 11.56 59.25 0.00 0.00 0.00 0.00 5.82 0.00 1.96
9 36.17 16.93 0.00 30.14 0.00 0.00 0.00 7.08 9.68 0.00
10 0.00 12.17 26.49 0.00 3.99 55.77 0.00 1.58 0.00 0.00
11 0.00 10.27 67.41 1.93 18.30 0.00 0.00 1.18 0.00 0.91
12 2.66 5.38 0.00 14.39 54.06 0.00 8.40 3.83 7.84 3.44
13 2.66 5.38 0.00 14.39 54.06 0.00 8.40 3.83 7.84 3.44
14 84.22 8.00 0.00 0.00 0.00 2.90 0.00 0.22 3.84 0.82
15 84.22 8.00 0.00 0.00 0.00 2.90 0.00 0.22 3.84 0.82
16 86.85 13.04 0.00 0.00 0.00 0.00 0.00 0.11 0.00 0.00
17 86.85 13.04 0.00 0.00 0.00 0.00 0.00 0.11 0.00 0.00
18 86.85 13.04 0.00 0.00 0.00 0.00 0.00 0.11 0.00 0.00
19 86.85 13.04 0.00 0.00 0.00 0.00 0.00 0.11 0.00 0.00
20 21.41 8.11 0.47 8.08 0.00 0.00 56.78 2.26 0.00 2.89
This is the used habitats (mcp):
grass fallow.plot oil.seed.rape spring.barley winter.wheat maize other.crops other woodland hedgerow
1 41.14 58.67 0.19 0.00 0.00 0.00 0.00 0.00 0 0.0
2 35.45 64.55 0.00 0.00 0.00 0.00 0.00 0.00 0 0.0
3 10.10 60.04 7.72 21.37 0.00 0.00 0.00 0.77 0 0.0
4 0.00 44.55 0.00 50.27 0.00 0.00 0.00 5.18 0 0.0
5 2.82 48.48 44.80 0.00 0.00 0.00 0.00 0.00 0 3.9
6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0 0.0
7 0.00 87.41 12.59 0.00 0.00 0.00 0.00 0.00 0 0.0
8 0.00 83.59 16.41 0.00 0.00 0.00 0.00 0.00 0 0.0
9 0.00 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0.0
10 0.00 18.93 0.00 0.00 0.00 81.07 0.00 0.00 0 0.0
11 0.00 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0.0
12 0.00 22.79 0.00 0.00 77.13 0.00 0.00 0.08 0 0.0
13 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0 0.0
14 54.60 44.97 0.00 0.00 0.00 0.00 0.00 0.43 0 0.0
15 62.86 36.57 0.00 0.00 0.00 0.00 0.00 0.57 0 0.0
16 11.15 88.10 0.00 0.00 0.00 0.00 0.00 0.75 0 0.0
17 20.06 79.62 0.00 0.00 0.00 0.00 0.00 0.32 0 0.0
18 38.64 60.95 0.00 0.00 0.00 0.00 0.00 0.41 0 0.0
19 3.81 95.81 0.00 0.00 0.00 0.00 0.00 0.38 0 0.0
20 0.00 3.56 0.00 0.00 0.00 0.00 96.44 0.00 0 0.0
I've tried both parametric and randomisation tests with the same results. The code I'm running:
habuse <- compana(used, avail, test = "randomisation",rnv = 0.001, nrep = 500, alpha = 0.1)
habuse <- compana(used, avail, test = "parametric")
Any ideas where I'm going wrong?
I've discovered the answer to my own question. For the used data, the function replaces 0 values with the value you specify (0.001 in my case). But it doesn't replace 0 values in the available data, and it doesn't like them either.
I replaced all the 0s with 0.001 in the available table, adjusted the other values and the function worked.