AWK command for sum 2 files - unix

i am new at awk and i need awk command to summing 2 files if found the same column
file 1
a | 16:00 | 24
b | 16:00 | 12
c | 16:00 | 32
file 2
b | 16:00 | 10
c | 16:00 | 5
d | 16:00 | 14
and the output should be
a | 16:00 | 24
b | 16:00 | 22
c | 16:00 | 37
d | 16:00 | 14
i have read some of the question here and still found the correct way to do it, i already tried with this command
awk 'BEGIN { FS = "," } ; FNR=NR{a[$1]=$2 FS $3;next}{print $0,a[$1]}'
please help me, thank you

This script also uses sort but it will work,
awk -F'|' ' { f[$1] += $3 ; g[$1] = $2 } END { for (a in f) { print a , "|", g[a] , "|", f[a] } } ' a.txt b.txt | sort
The results are
a | 16:00 | 24
b | 16:00 | 22
c | 16:00 | 37
d | 16:00 | 14

without |sort
awk -F'|' '{O[$1FS$2]+=$3}END{asorti(O,T,"#ind_str_asc");for(t in T)print T[t] FS O[T[t]]}' file[1,2]

Just store all the data in two arrays a[] and b[] and then print them back:
awk 'BEGIN{FS=OFS="|"}
{a[$1]+=$3; b[$1]=$2}
END{for (i in a) print i,b[i],a[i]}' f1 f2
Test
$ awk 'BEGIN{FS=OFS="|"} {a[$1]+=$3; b[$1]=$2} END{for (i in a) print i,b[i],a[i]}' f1 f2
b | 16:00 |22
c | 16:00 |37
d | 16:00 |14
a | 16:00 |24

Related

how to reference a result in a subquery

I have the following table in an sqlite database
+----+-------------+-------+
| ID | Week Number | Count |
+----+-------------+-------+
| 1 | 1 | 31 |
| 2 | 2 | 16 |
| 3 | 3 | 73 |
| 4 | 4 | 59 |
| 5 | 5 | 44 |
| 6 | 6 | 73 |
+----+-------------+-------+
I want to get the following table out. Where I get this weeks sales as one column and then the next column will be last weeks sales.
+-------------+-----------+-----------+
| Week Number | This_Week | Last_Week |
+-------------+-----------+-----------+
| 1 | 31 | null |
| 2 | 16 | 31 |
| 3 | 73 | 16 |
| 4 | 59 | 73 |
| 5 | 44 | 59 |
| 6 | 73 | 44 |
+-------------+-----------+-----------+
This is the select statement i was going to use:
select
id, week_number, count,
(select count from tempTable
where week_number = (week_number-1))
from
tempTable;
You are comparing values in two different rows. When you are just writing week_number, the database does not know which one you mean.
To refer to a column in a specific table, you have to prefix it with the table name: tempTable.week_number.
And if both tables have the same name, you have to rename at least one of them:
SELECT id,
week_number,
count AS This_Week,
(SELECT count
FROM tempTable AS T2
WHERE T2.week_number = tempTable.week_number - 1
) AS Last_Week
FROM tempTable;
In case of you want to take a query upon a same table twice, you have to put aliases on the original one and its replicated one to differentiate them
select a.week_number,a.count this_week,
(select b.count from tempTable b
where b.week_number=(a.week_number-1)) last_week
from tempTable a;

Extracting columns from text file

I load a text file (tree.txt) to R, with the below content (copy pasted from JWEKA - J48 command).
I use the following command to load the text file:
data3 <-read.table (file.choose(), header = FALSE,sep = ",")
I would like to insert each column into a separate variables named like the following format COL1, COL2 ... COL8 (in this example since we have 8 columns). If you load it to EXCEL with delimited separation each row will be separated in one column (this is the required result).
Each COLn will contain the relevant characters of the tree in this example.
How can separate and insert the text file into these columns automatically while ignoring the header and footer content of the file?
Here is the text file content:
[[1]]
J48 pruned tree
------------------
MSTV <= 0.4
| MLTV <= 4.1: 3 -2
| MLTV > 4.1
| | ASTV <= 79
| | | b <= 1383:00:00 2 -18
| | | b > 1383
| | | | UC <= 05:00 1 -2
| | | | UC > 05:00 2 -2
| | ASTV > 79:00:00 3 -2
MSTV > 0.4
| DP <= 0
| | ALTV <= 09:00 1 (170.0/2.0)
| | ALTV > 9
| | | FM <= 7
| | | | LBE <= 142:00:00 1 (27.0/1.0)
| | | | LBE > 142
| | | | | AC <= 2
| | | | | | e <= 1058:00:00 1 -5
| | | | | | e > 1058
| | | | | | | DL <= 04:00 2 (9.0/1.0)
| | | | | | | DL > 04:00 1 -2
| | | | | AC > 02:00 1 -3
| | | FM > 07:00 2 -2
| DP > 0
| | DP <= 1
| | | UC <= 03:00 2 (4.0/1.0)
| | | UC > 3
| | | | MLTV <= 0.4: 3 -2
| | | | MLTV > 0.4: 1 -8
| | DP > 01:00 3 -8
Number of Leaves : 16
Size of the tree : 31
An example of the COL1 content will be:
MSTV
|
|
|
|
|
|
|
|
MSTV
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
COL2 content will be:
MLTV
MLTV
|
|
|
|
|
|
>
DP
|
|
|
|
|
|
|
|
|
|
|
|
DP
|
|
|
|
|
|
Try this:
cleaned.txt <- capture.output(cat(paste0(tail(head(readLines("FILE_LOCATION"), -4), -4), collapse = '\n'), sep = '\n'))
cleaned.df <- read.fwf(file = textConnection(cleaned.txt),
header = FALSE,
widths = rep.int(4, max(nchar(cleaned.txt)/4)),
strip.white= TRUE
)
cleaned.df <- cleaned.df[,colSums(is.na(cleaned.df))<nrow(cleaned.df)]
For the cleaning process, I end up using a combination of head and tail to remove the 4 spaces on the top and the bottom. There's probably a more efficient way to do this outside of R, but this isn't so bad. Generally, I'm just making the file readable to R.
Your file looks like a fixed-width file so I use read.fwf, and use textConnection() to point the function to the cleaned output.
Finally, I'm not sure how your data is actually structured, but when I copied it from stackoverflow, it pasted with a bunch of whitespace at the end of each line. I'm using some tricks to guess at how long the file is, and removing extraneous columns over here
widths = rep.int(4, max(nchar(cleaned.txt)/4))
cleaned.df <- cleaned.df[,colSums(is.na(cleaned.df))<nrow(cleaned.df)]
Next, I'm creating the data in the way you would like it structured.
for (i in colnames(cleaned.df)) {
assign(i, subset(cleaned.df, select=i))
assign(i, capture.output(cat(paste0(unlist(get(i)[get(i)!=""])),sep = ' ', fill = FALSE)))
}
rm(i)
rm(cleaned.df)
rm(cleaned.txt)
What this does is it creates a loop for each column header in your data frame.
From there it uses assign() to put all the data in each column into its' own data frame. In your case, they are named V1 through V15.
Next, it uses a combination of cat() and paste() with unlist() an capture.output() to concatenate your list into a single character vectors, for each of the data frames, so they are now character vectors, instead of data frames.
Keep in mind that because you wanted a space at each new character, I'm using a space as a separator. But because this is a fixed-width file, some columns are completely blank, which I'm removing using
get(i)[get(i)!=""]
(Your question said you wanted COL2 to be: MLTV MLTV | | | | | | > DP | | | | | | | | | | | | DP | | | | | |).
If we just use get(i), there will be a leading whitespace in the output.

Extracting contents from decision tree J48

I have the following decision tree (created by JWEKA package - by the command J48(NSP~., data=training) ):
[[1]]
J48 pruned tree
------------------
MSTV <= 0.4
| MLTV <= 4.1: 3 -2
| MLTV > 4.1
| | ASTV <= 79
| | | b <= 1383:00:00 2 -18
| | | b > 1383
| | | | UC <= 05:00 1 -2
| | | | UC > 05:00 2 -2
| | ASTV > 79:00:00 3 -2
MSTV > 0.4
| DP <= 0
| | ALTV <= 09:00 1 (170.0/2.0)
| | ALTV > 9
| | | FM <= 7
| | | | LBE <= 142:00:00 1 (27.0/1.0)
| | | | LBE > 142
| | | | | AC <= 2
| | | | | | e <= 1058:00:00 1 -5
| | | | | | e > 1058
| | | | | | | DL <= 04:00 2 (9.0/1.0)
| | | | | | | DL > 04:00 1 -2
| | | | | AC > 02:00 1 -3
| | | FM > 07:00 2 -2
| DP > 0
| | DP <= 1
| | | UC <= 03:00 2 (4.0/1.0)
| | | UC > 3
| | | | MLTV <= 0.4: 3 -2
| | | | MLTV > 0.4: 1 -8
| | DP > 01:00 3 -8
Number of Leaves : 16
Size of the tree : 31
I would like to extract the nodes' values in 2 formats:
one format only the name of the property such as: MSTV, MLTV, DP... etc.,
So each level of the tree will be followed by his parent, in the above case I would like to get the '(' as separator between each level such as:
(MSTV (MLTV...) (DP...) )
In the second format I would like to get the nodes with their values such as:
(MSTV 0.4 (MLTV 4.1 ....) (DP 0..... ) )
How can I extract the relevant information. I think to separate between the node values we should separate the characters by using gsub("[A-Z]:", "", string)
But we need to ignore the last lines.
Thanks a lot for your help.

Subset a dataframe using a array created by itself in R

Sorry about that I don't how to describe my question in Title.
My problem is that
I have a dataframe as below
| id | content | created_at |
|----|---------|---------------------|
| 1 | hello | 2014-12-10 00:00:00 |
| 2 | world | 2013-11-11 00:00:00 |
| 3 | oh~no | 2012-10-10 00:00:00 |
| 4 | helpme | 2011-09-11 00:00:00 |
I want to subset this frame by time interval
for example:
subset: 2011 - 2012
| 4 | helpme | 2011-09-11 00:00:00 |
subset: 2012 - 2013
| 3 | oh~no | 2012-10-10 00:00:00 |
subset: 2013 - 2014
| 2 | world | 2013-11-11 00:00:00 |
subset: 2014 - 2015
| 1 | hello | 2014-12-10 00:00:00 |
Below is how I try to resolve this problem
I try to create a true,false array and do in each row
ifelse(
difftime(DF$created_at,as.Date(ISOdate(seq(2004,2014),1,1))) >= 0 &
difftime(DF$created_at,as.Date(ISOdate(seq(2005,2015),1,1))) < 0
, assign_to_subset_X, do_nothing)
but....
I don't think this is a good idea especially I already use R....
then I find some solutions such as apply
apply(DF, 2, do_a_function_to_subset)
but I still have no idea to write this function
please give me a hint.
Here is one possible solution
library(lubridate)
df <- read.table(textConnection(" id | content | created_at
1 | hello | 2014-12-10 00:00:00
2 | world | 2013-11-11 00:00:00
3 | oh~no | 2012-10-10 00:00:00
4 | helpme | 2011-09-11 00:00:00 "), header=TRUE, sep="|")
df$ts <- ymd_hms(df$created_at)
## create an interval
myInt <- ymd("2011-01-01") %--% ymd("2011-12-31")
df[df$ts %within% myInt, ]

Joining two files using awk

File 1
---------+
|ID |
+---------+
| 15 |
| 45 |
| 18 |
| 76 |
| 29 |
| 10 |
| 40 |
+---------+
File 2:
| ID Name |
+---------+
| 12 abc |
| 18 nop |
| 15 ujh |
| 30 jkl |
| 15 lmn |
| 18 tre |
| 19 hgt |
+---------+
Desired output:
+---------+
| ID Name |
+---------+
| 18 nop |
| 15 ujh |
| 15 lmn |
| 18 tre |
+---------
The Join cammand below is not giving the desired result (It should return all rows in File 2 where the value in the first column exists in the File1 table.
join -1 1 -2 1 File1.txt File2.txt
Please help.
Well, since you ask specifically for an awk solution, here's one approach:
#!/bin/sh
awk 'BEGIN {
while ((getline line < "File1.txt") > 0) {
split(line, a)
for (fld in a) {
if (a[fld] ~ /^[0-9]*$/ ) {
targets[a[fld]]=a[fld]
}
}
}
} {
if (NF == 4 && $2 ~ /^[0-9]*$/ ) {
if ($2 in targets) {
print $0
}
} else {
print $0
}
}' File2.txt
Although, I wonder like #Mark Setchell why you wouldn't approach getting this output from the database, if you have access to it.

Resources