Use awk to modify lines with specific keys - unix

I have a main file that includes a series of data lines whose ID's are stored in the second column. There is another key file that contains specific IDs and I would like to comment (put $) the records with those ID's in the main file and leave the rest. I have written the below script, it puts the comment but repeats the non-keyed items. Can you please help debug the awk command?
key_file:
10
20
30
main_file:
PSHELL 10 136514 0.7
PSHELL 15 136514 0.7
PSHELL 20 136513 2.0
PSHELL 30 13571 1.7
Current output:
PSHELL 10 136514 0.7
PSHELL 10 136514 0.7
$PSHELL 10 136514 0.7
PSHELL 15 136514 0.7
PSHELL 15 136514 0.7
PSHELL 15 136514 0.7
PSHELL 20 136513 2.0
$PSHELL 20 136513 2.0
PSHELL 20 136513 2.0
$PSHELL 30 13571 1.7
PSHELL 30 13571 1.7
PSHELL 30 13571 1.7
Desired output
$PSHELL 10 136514 0.7
PSHELL 15 136514 0.7
$PSHELL 20 136513 2.0
$PSHELL 30 13571 1.7
awk 'NR==FNR{a[$1]; next} {for (i in a) if (index($2, i)) {print "$"$0 > "out_file"} else {print $0 > "out_file"}}' key_file main_file

You may use this awk:
awk 'FNR == NR {key[$1]; next} $2 in key {$0 = "$" $0} 1' keyfile mainfile
$PSHELL 10 136514 0.7
PSHELL 15 136514 0.7
$PSHELL 20 136513 2.0
$PSHELL 30 13571 1.7

Related

not dealing properly with dates in R

I am trying to use selectByDate from openair package but got stuck in my second try
I have A
> A
date x
23 1982-08-23 0.0
24 1982-08-24 0.0
25 1982-08-25 0.0
26 1982-08-26 9.3
27 1982-08-27 0.0
28 1982-08-28 0.2
29 1982-08-29 0.0
30 1982-08-30 0.0
31 1982-08-31 0.0
32 1982-09-01 0.0
33 1982-09-02 0.2
34 1982-09-03 0.9
35 1982-09-04 4.2
36 1982-09-05 0.0
37 1982-09-06 0.0
38 1982-09-07 1.2
39 1982-09-08 0.0
40 1982-09-09 0.0
and then
> selectByDate(A, month = 9)
date x
10 1982-09-01 0.0
11 1982-09-02 0.2
12 1982-09-03 0.9
13 1982-09-04 4.2
14 1982-09-05 0.0
15 1982-09-06 0.0
16 1982-09-07 1.2
17 1982-09-08 0.0
18 1982-09-09 0.0
but with B
16 1971-04-20 100511
17 1971-04-21 100795
18 1971-04-22 101008
19 1971-04-23 101292
20 1971-04-24 101577
21 1971-04-25 101862
22 1971-04-26 102220
23 1971-04-27 103372
24 1971-04-28 103662
25 1971-04-29 103807
26 1971-04-30 104025
27 1971-05-01 104316
28 1971-05-02 104462
29 1971-05-03 104681
30 1971-05-04 104900
31 1971-05-05 105047
I got
> selectByDate(B, month = 4)
Error in as.POSIXlt.default(x, tz = tz(x)) :
do not know how to convert 'x' to class “POSIXlt”
I am a beginner in R and I cant see why this happens. Any clue?
Convert data to as.POSIXct class and then try :
B$date <- as.POSIXct(B$date, '%Y-%m-%d')
openair::selectByDate(B, month = 4)
You can also do this in base R :
subset(B, as.integer(format(date, '%m')) == 4)

piecewise regression in r

I have two variables, A and B, that are significantly related if modeled in a piecewise regression. The model has two segments. The problem is that in the plot, the two segments do not connect to one another the way they should: they form a 'nose' at the break point. I've seen in other posts on Stackoverflow that problems with plotting segmented regressions correctly seem widespread.
Here's the dataframe with A and B:
dfrm <- read.table(text=" A B
1 0.04545455 1.3
2 0.09090909 1.1
3 0.13636364 1.6
4 0.18181818 1.8
5 0.22727273 3.4
6 0.27272727 1.8
7 0.31818182 1.9
8 0.36363636 0.7
9 0.40909091 2.9
10 0.45454545 1.2
11 0.50000000 0.8
12 0.54545455 0.7
13 0.59090909 0.6
14 0.63636364 1.7
15 0.68181818 0.7
16 0.72727273 2.0
17 0.77272727 1.2
18 0.81818182 0.5
19 0.86363636 2.8
20 0.90909091 1.0
21 0.95454545 0.5
22 1.00000000 1.0
23 0.06666667 0.2
24 0.13333333 0.6
25 0.20000000 1.6
26 0.26666667 0.4
27 0.33333333 1.7
28 0.40000000 2.5
29 0.46666667 0.5
30 0.53333333 1.5
31 0.60000000 0.4
32 0.66666667 0.3
33 0.73333333 0.2
34 0.80000000 0.2
35 0.86666667 0.7
36 0.93333333 2.2
37 1.00000000 2.3
38 0.05882353 1.4
39 0.11764706 2.7
40 0.17647059 0.7
41 0.23529412 0.2
42 0.29411765 0.8
43 0.35294118 2.9
44 0.41176471 0.4
45 0.47058824 0.5
46 0.52941176 2.1
47 0.58823529 0.4
48 0.64705882 0.6
49 0.70588235 1.0
50 0.76470588 0.3
51 0.82352941 0.9
52 0.88235294 1.4
53 0.94117647 0.6
54 1.00000000 0.4
55 0.10000000 1.7
56 0.20000000 1.4
57 0.30000000 1.5
58 0.40000000 0.6
59 0.50000000 0.4
60 0.60000000 0.5
61 0.70000000 0.4
62 0.80000000 1.0
63 0.90000000 0.8
64 1.00000000 3.0
65 0.03846154 1.5
66 0.07692308 2.7
67 0.11538462 2.2
68 0.15384615 0.6
69 0.19230769 0.7
70 0.23076923 0.5
71 0.26923077 0.5
72 0.30769231 0.6
73 0.34615385 1.2
74 0.38461538 0.8
75 0.42307692 1.8
76 0.46153846 2.1
77 0.50000000 0.6
78 0.53846154 0.7
79 0.57692308 1.3
80 0.61538462 0.4
81 0.65384615 0.7
82 0.69230769 1.2
83 0.73076923 0.8
84 0.76923077 1.2
85 0.80769231 1.0
86 0.84615385 1.4
87 0.88461538 0.9
88 0.92307692 0.8
89 0.96153846 1.7
90 1.00000000 5.8", header=TRUE)
## attach(df) NO, don't use attach and mistrust anyone who tells you differently
model <- lm(B ~ (A < 0.89394)*A + (A >= 0.89394)*A, data=dfrm) # 0.89394 = breakpoint
# Preparing the plot:
a <- sort(unique(dfrm$A))
# Plotting:
plot(B ~ A, data=dfrm)
lines(a, predict(model, list(A=a)), lwd=2, col="blue")
This is the plot:Piecewise regression
How can the two segments be connected cleanly at the break point?
It might be easiest to attempt this with a GAM (Generalized Additive Model), applied via either the GAM package or the mgcv package in R. This technique allows you to fit a non-linear model in stages, smoothing out the joins (or 'knots) between functions. As a bonus, the GAM is basically a GLM anyway so the learning curve should be quite easy.
The nose and the disconnect between the segments may be due to lack of precision in the way the break point is determined.
After re-determining the break point for my data based on the method detailed in Crawley (2007: 427), the two segments perfectly connect.
The steps involved are:
define a vector "breaks" for potential breaks
run a for loop for piecewise regressions for all potential break points and yank out the minimal residual standard error (mse) for each model:
mse <- numeric(length(breaks))
for(i in 1:length(breaks)){
piecewise <- lm(V_indep ~ V_dep*(V_dep < breaks[i]) + V_dep*(V_dep>=breaks[i]))
mse[i] <- summary(piecewise)[6]
}
mse <- numeric(length(breaks))
identify the break point with the least mse:
breaks[which(mse==min(mse))]
fit the model using this break point.

Does ggplot2 exclude some data?

I want to create some basic grouped barplots with ggplot2 but it seems to exclude some data. If I review my input data everything is there, but some bars are missing and it is also messing with the error bars. I tried to convert into multiple variable types, regrouped, loaded again, saved everything in .csv and loaded all new... I just don't know what is wrong.
Here is my code:
library(ggplot2)
limits <- aes(ymax = DataCm$mean + DataCm$sd,
ymin = DataCm$mean - DataCm$sd)
p <- ggplot(data = DataCm, aes(x = factor(DataCm$Zeit), y = factor(DataCm$mean)
) )
p + geom_bar(stat = "identity",
position = position_dodge(0.9),fill =DataCm$group) +
geom_errorbar(limits, position = position_dodge(0.9),
width = 0.25) +
labs(x = "Time [min]", y = "Individuals per foodsource")
This is DataCm:
Zeit mean sd group
1 30 0.1 0.3162278 1
2 60 0.0 0.0000000 2
3 90 0.1 0.3162278 3
4 120 0.0 0.0000000 4
5 150 0.1 0.3162278 5
6 180 0.1 0.3162278 6
7 240 0.3 0.6749486 1
8 300 0.3 0.6749486 2
9 360 0.3 0.6749486 3
10 30 0.1 0.3162278 4
11 60 0.1 0.3162278 5
12 90 0.2 0.4216370 6
13 120 0.3 0.4830459 1
14 150 0.3 0.4830459 2
15 180 0.4 0.5163978 3
16 240 0.3 0.4830459 4
17 300 0.4 0.5163978 5
18 360 0.4 0.5163978 6
19 30 1.2 1.1352924 1
20 60 1.8 1.6865481 2
21 90 2.2 2.0976177 3
22 120 2.2 2.0976177 4
23 150 2.0 1.8856181 5
24 180 2.3 1.9465068 6
25 240 2.4 2.0655911 1
26 300 2.1 1.8529256 2
27 360 2.0 2.1602469 3
28 30 0.2 0.4216370 4
29 60 0.1 0.3162278 5
30 90 0.1 0.3162278 6
31 120 0.1 0.3162278 1
32 150 0.0 0.0000000 2
33 180 0.1 0.3162278 3
34 240 0.1 0.3162278 4
35 300 0.1 0.3162278 5
36 360 0.1 0.3162278 6
37 30 1.3 1.5670212 1
38 60 1.5 1.5811388 2
39 90 1.5 1.7159384 3
40 120 1.5 1.9002924 4
41 150 1.9 2.1317703 5
42 180 1.9 2.1317703 6
43 240 2.2 2.3475756 1
44 300 2.4 2.3190036 2
45 360 2.2 2.1499354 3
46 30 2.1 2.1317703 4
47 60 3.0 2.2110832 5
48 90 3.3 2.1628171 6
49 120 3.2 2.1499354 1
50 150 3.4 2.6331224 2
51 180 3.5 2.4152295 3
52 240 3.7 2.6267851 4
53 300 3.7 2.4060110 5
54 360 3.8 2.6583203 6
The output is:
Maybe you can help me. Thanks in advance!
Best wishes,
Benjamin
Solved it:
I reshaped everything in Excel and exported it another way. The group variable was also not the way I wanted it. Now it is fixed, but I can't really tell you why.
Your data looks malformed. I guess you wanted to have 6 different group values for each time point, but now the group variable just loops over, and you have:
1 30 0.1 0.3162278 1
...
10 30 0.1 0.3162278 4
...
19 30 1.2 1.1352924 1
...
28 30 0.2 0.4216370 4
geom_bar then probably omits rows that have identical mean and time. Although I am not sure why it chooses to do so, you should solve the group problem first anyway.

Why length function does not work correct in R?

Following R code gives the cars which are in Type Small. But length function returns 6 instead of 13. Why is that?
> fuel.frame[fuel.frame$Type=="Small",]
row.names Weight Disp. Mileage Fuel Type
1 Eagle.Summit.4 30 0.97 33 3.030303 Small
2 Ford.Escort.4 28 114.00 33 3.030303 Small
3 Ford.Festiva.4 23 0.81 37 2.702703 Small
4 Honda.Civic.4 27 0.91 32 3.125000 Small
5 Mazda.Protege.4 29 113.00 32 3.125000 Small
6 Mercury.Tracer.4 27 0.97 26 3.846154 Small
7 Nissan.Sentra.4 27 0.97 33 3.030303 Small
8 Pontiac.LeMans.4 28 0.98 28 3.571429 Small
9 Subaru.Loyale.4 27 109.00 25 4.000000 Small
10 Subaru.Justy.3 24 0.73 34 2.941176 Small
11 Toyota.Corolla.4 28 0.97 29 3.448276 Small
12 Toyota.Tercel.4 25 0.89 35 2.857143 Small
13 Volkswagen.Jetta.4 28 109.00 26 3.846154 Small
> length(fuel.frame[fuel.frame$Type=="Small",])
[1] 6
length gives in this case the number of columns in the data frame. You can instead use nrow or ncol to get the number of rows or number of columns respectively:
nrow(fuel.frame[fuel.frame$Type=="Small",])
Another example using iris dataset:
> d = head(iris)
> d
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
> nrow(d)
[1] 6
> ncol(d)
[1] 5
> dim(d)
[1] 6 5
I thought it might help to give a bit of an explanation as to thy your getting your result. Your asking the length of the data.frame not the vector. Since the data.frame has 6 columns that explains your result.
this asks for the vector specifically:
length(fuel.frame$Type[fuel.frame$Type=="Small"])
and so does this:
length(fuel.frame[fuel.frame$Type=="Small",][,1])
or use nrow instead of length as already suggested.

Adding information on a graph using R

I would like to add some information on my graph which was plotted from this data set:
EDITTED:
#data set:
day <- c(0:28)
ndied <- c(342,335,240,122,74,64,49,60,51,44,35,48,41,34,38,27,29,23,20,15,20,16,17,17,14,10,4,1,2)
pdied <- c(19.1,18.7,13.4,6.8,4.1,3.6,2.7,3.3,2.8,2.5,2.0,2.7,2.3,1.9,2.1,1.5,1.6,1.3,1.1,0.8,1.1,0.9,0.9,0.9,0.8,0.6,0.2,0.1,0.1)
pmort <- data.frame(day,ndied,pdied)
> pmort
day ndied pdied
1 0 342 19.1
2 1 335 18.7
3 2 240 13.4
4 3 122 6.8
5 4 74 4.1
6 5 64 3.6
7 6 49 2.7
8 7 60 3.3
9 8 51 2.8
10 9 44 2.5
11 10 35 2.0
12 11 48 2.7
13 12 41 2.3
14 13 34 1.9
15 14 38 2.1
16 15 27 1.5
17 16 29 1.6
18 17 23 1.3
19 18 20 1.1
20 19 15 0.8
21 20 20 1.1
22 21 16 0.9
23 22 17 0.9
24 23 17 0.9
25 24 14 0.8
26 25 10 0.6
27 26 4 0.2
28 27 1 0.1
29 28 2 0.1
I have put together this script and still trying to improve on it so that the rest of the information can be added:
> barplot(pmort$pdied,xlab="Age(days)",ylab="Percent",xlim=c(0,28),ylim=c(0,20),legend="Mortality")
I am trying to insert the numbers 0 to 28 (age in days) on the x-axis but could not and I know that it could be a simple script. Secondly, I would like to add the number died or ndied (342 to 2) below each day(0 to 28) along the x-axis.
Example:
0 1 2 3 4 5 and so on...
(N=342) (N=335) (N=240) (N=122) (N=74) (N=64)
Graph:
Any help would be appreciated.
Baz
I gave you two ways to plot the info: one above the bars and one below. You can tweak it to meet your needs.
barX <- barplot(pmort$pdied,xlab="Age(days)",
ylab="Percent", names=pmort$day,
xlim=c(0,28),ylim=c(0,20),legend="Mortality")
text(cex=.5, x=barX, y=pmort$pdied+par("cxy")[2]/2, pmort$ndied, xpd=TRUE)
barX <- barplot(pmort$pdied,xlab="Age(days)",
ylab="Percent", names=pmort$day,
xlim=c(0,28),ylim=c(0,20),legend="Mortality")
text(cex=.5, x=barX, y=-.5, pmort$ndied, xpd=TRUE)

Resources