R's paste function equivalent in BASH?

R's paste function equivalent in BASH? - r

I'm trying to translate this piece of R code into bash:
year <- 2010
emmissions <- paste("Emmision table from ",year, sep="")
em_table <- read.table(emmissions, head=F, as.is=T)
for (i in 1:nrow(em_table)) {
print(em_table[i,1])
}
But can't figure out how to translate paste function to concatenate string with variable. Expected outcome would be this script translated into bash code.

Using perhaps echo "Emmision table from ${year}" inside your for loop something like below:
for var in 2001 2003 2004;
do
echo "Emission table ${var}"
done
Updated on OP's request:
for sequence generation in bash , one can do : for i in {1..5} for a sequence of 1 to 5 or {0..20..4} for a step size of 4 from a sequence of 0 to 20
Assuming a column like structure, one can do this:
Assuming a table which has two column saved in a text file
col1 col2
1 2
3 4
5 6
while read col1 col2
do
echo "Col1 : $col1"
echo "Col2 : $col2"
done < table.txt

There already is an accepted answer but here is a awk solution.
Tested with just one file, named "test2020".
for var in 2020;
do
awk '{print $1}' "test${var}"
done
Output:
1
2
3
To read both columns the awk code line would be
awk '{print $1 $2}' "test${var}"
or, since the OP says the files only have two columns,
awk '{print $0}' "test${var}"
The file contents are:
1 a
2 b
3 c

Related

How to split file in two by segments of 4 lines in unix

I'm searching for a way how to split my file in two, haven't found an answer here. I have a large file (with millions of lines) and would like to split it in two files, always by four lines, i.e. the first four lines (1,2,3,4) go in the first output file, the second four lines (5,6,7,8) go in the second output file, then the third four lines (9,10,11,12) go again to the first output file, etc. I'm sure there will be a way - possibly using awk (?) but I can't get it right... Thanks a lot!

All you need is
awk 'NR%4==1{c=!c} {print > ("out"c)}'
Look:
$ seq 10 | awk 'NR%4==1{c=!c} {print $0 " > " ("out"c)}'
1 > out1
2 > out1
3 > out1
4 > out1
5 > out0
6 > out0
7 > out0
8 > out0
9 > out1
10 > out1

another awk
awk '{print > "out_"((NR-1)%8>3)}' file

How to remove multiple commas but keep one in between two values in a csv file?

I have a csv file with millions of records like below
1,,,,,,,,,,a,,,,,,,,,,,,,,,,4,,,,,,,,,,,,,,,456,,,,,,,,,,,,,,,,,,,,,3455,,,,,,,,,,
1,,,,,,,,,,b,,,,,,,,,,,,,,,,5,,,,,,,,,,,,,,,467,,,,,,,,,,,,,,,,,,,,,3445,,,,,,,,,,
2,,,,,,,,,,c,,,,,,,,,,,,,,,,6,,,,,,,,,,,,,,,567,,,,,,,,,,,,,,,,,,,,,4656,,,,,,,,,,
I have to remove the extra commas between two values and keep only one. The output for the sample input should look like
1,a,4,456,3455
1,b,5,467,3445
2,c,6,567,4656
How can I achieve this using shell since it automates for the other files too.
I need to load this data in to a database. Can we do it using R?

sed method:
sed -e "s/,\+/,/g" -e "s/,$//" input_file > output_file
Turns multiple commas to single comma and also remove last comma on line.

Edited to address modified question.
R solution.
The original solution provided was just processing text. Assuming that your rows are in a structure, you can handle multiple rows with:
# Create Data
Row1 = "1,,,,,,,a,,,,,,,,,,4,,,,,,,,,456,,,,,,,,,,,3455,,,,,,,"
Row2 = "2,,,,,,,b,,,,,,,,,,5,,,,,,,,,567,,,,,,,,,,,4566,,,,,,,"
Rows = c(Row1, Row2)
CleanedRows = gsub(",+", ",", Rows) # Compress multiple commas
CleanedRows = sub(",\\s*$", "", CleanedRows) # Remove final comma if any
[1] "1,a,4,456,3455" "2,b,5,567,4566"
But if you are trying to read this from a csv and compress the rows,
## Create sample data
Data =read.csv(text="1,,,,,,,a,,,,,,,,,,4,,,,,,,,,456,,,,,,,,,,,3455,,,,,,,
2,,,,,,,b,,,,,,,,,,5,,,,,,,,,567,,,,,,,,,,,4566,,,,,,,",
header=FALSE)
You code would probably say
Data = read.csv("YourFile.csv", header=FALSE)
Data = Data[which(!is.na(Data[1,]))]
Data
V1 V8 V18 V27 V38
1 1 a 4 456 3455
2 2 b 5 567 4566
Note: This assumes that the non-blank fields are in the same place in every row.

Use tr -s:
echo 'a,,,,,,,,b,,,,,,,,,,c' | tr -s ','
Output:
a,b,c
If the input line has trailing commas, tr -s ',' would squeeze those trailing commas into one comma, but to be rid that one requires adding a little sed code: tr -s ',' | sed 's/,$//'.
Speed. Tests on a 10,000,000 line test file consisting of the first line in the OP example, repeated.
3 seconds. tr -s ',' (but leaves trailing comma)
9 seconds. tr -s ',' | sed 's/,$//
30 seconds. sed -e "s/,\+/,/g" -e "s/,$//" (Jean-François Fabre's answer.)

If you have a file that's really a CSV file, it might have quoting of commas in a few different ways, which can make regex-based CSV parsing unhappy.
I generally use and recommend csvkit which has a nice set of CSV parsing utilities for the shell. Docs at http://csvkit.readthedocs.io/en/latest/
Your exact issue is answered in csvkit with this set of commands. First, csvstat shows what the file looks like:
$ csvstat -H --max tmp.csv | grep -v None
1. column1: 2
11. column11: c
27. column27: 6
42. column42: 567
63. column63: 4656
Then, now that you know that all of the data is in those columns, you can run this:
$ csvcut -c 1,11,27,42,63 tmp.csv
1,a,4,456,3455
1,b,5,467,3445
2,c,6,567,4656
to get your desired answer.

Can we do it using R?
Provided your input is as shown, i.e., you want to skip the same columns in all rows, you can analyze the first line and then define column classes in read.table:
text <- "1,,,,,,,,,,a,,,,,,,,,,,,,,,,4,,,,,,,,,,,,,,,456,,,,,,,,,,,,,,,,,,,,,3455,,,,,,,,,,
1,,,,,,,,,,b,,,,,,,,,,,,,,,,5,,,,,,,,,,,,,,,467,,,,,,,,,,,,,,,,,,,,,3445,,,,,,,,,,
2,,,,,,,,,,c,,,,,,,,,,,,,,,,6,,,,,,,,,,,,,,,567,,,,,,,,,,,,,,,,,,,,,4656,,,,,,,,,,"
tmp <- read.table(text = text, nrows = 1, sep = ",")
colClasses <- sapply(tmp, class)
colClasses[is.na(unlist(tmp))] <- "NULL"
Here I assume there are no actual NA values in the first line. If there could be, you'd need to adjust it slightly.
read.table(text = text, sep = ",", colClasses = colClasses)
# V1 V11 V27 V42 V63
#1 1 a 4 456 3455
#2 1 b 5 467 3445
#3 2 c 6 567 4656
Obviously, you'd specify a file instead of text.
This solution is fairly efficient for smallish to moderately sized data. For large data, substitute the second read.table with fread from package data.table (but that applies regardless of the skipping columns problem).

AWK or SED to remove pattern from every line while using grep

i have grep 2 columns , suppose col1 and col2. In col2 i want to remove a pattern which occurs in every line. how to use awk/sed for this purpose?
suppose ps -eaf | grep b would result into following output:
col1 col2 col3
1 a/b/rac 123
2 a/b/rac1 456
3 a/b/rac3 789
I want output to get stored in a file like this :
1 rac
2 rac1
3 rac3

Vaguely speaking, this might do what you want:
$ awk 'sub(/.*b\//,"",$2){print $1, $2}' file
1 rac
2 rac1
3 rac3
assuming file contains:
col1 col2 col3
1 a/b/rac 123
2 a/b/rac1 456
3 a/b/rac3 789

You will want to pipe the output to
| awk '{sub(/^.*\//, "", $2); print $1, $2}'
The first command in the awk program changes the contents of the second field to only contain whatever was after the final / character, which appears to be what you want. The second command prints the first two fields.

Split first column into two and preserve rest in unix

I need to split first column delimited by '#' into two columns. My data is in following format.
1#b,a
2#b,a
5#c,d
Required Output:
1,b,a
2,b,a
5,c,d
Other columns can have # in their values so I want to apply regex only on first column.
Thanks
Jitendra

A file (file.orig) contains this:
1#b,a #
2#b,a #
5#c,d #
Use sed:
sed 's/#/,/1' file.orig > file.new
Output (cat file.new):
1,b,a #
2,b,a #
5,c,d #

You didn't say where the data is. I'll assume it's in a file.
tr '#' ',' < some_file.txt

awk -F, '{OFS=","}{gsub(/#/,",",$1);}1' your_file

How to diff two file lists and ignoring place in list

I have two lists of files which I want to diff.
The second list has more files in it, and because they are all in alphabetical order when I diff these two lists I get files (lines) that exists in both lists, but in a different place.
I want to diff these two lists, ignoring line place in the list.
This way I would get only the new or missing lines in the list.
Thank you.

You can try this approach which involves "subtracting" the two lists as follows:
$ cat file1
a.txt
b.txt
c.txt
$ cat file2
a.txt
a1.txt
b.txt
b2.txt
1) print everything in file2 that is not in file1 i.e. file2 - file1
$ grep -vxFf file1 file2
a1.txt
b2.txt
2) print everything in file1 that is not in file2 i.e. file1 - file2
$ grep -vxFf file2 file1
c.txt
(You can then do what you want with these diffs e.g. write to file, sort etc)
grep options descriptions:
-v, --invert-match select non-matching lines
-x, --line-regexp force PATTERN to match only whole lines
-F, --fixed-strings PATTERN is a set of newline-separated strings
-f, --file=FILE obtain PATTERN from FILE

Do the following:
cat file1 file2 | sort | uniq -u
This will give you a list of lines which are unique (ie, not duplicated).
Explanation:
1) cat file1 file2 will put all of the entries into one list
2) sort will sort the combined list
3) uniq -u will only output the entries which don't have duplicates

The deft command to use here is the humble comm command:
To demonstrate, let's create two input files:
$ cat <<EOF >a
> a.txt
> b.txt
> c.txt
> EOF
$ cat <<EOF >b
> a.txt
> a1.txt
> b.txt
> b2.txt
> EOF
Now, using the comm command to get what the question wanted:
$ comm -2 a b
a.txt
b.txt
c.txt
This shows a columnar output with missing files (lines in a but not in b) in the first column and extra files (lines in b but not in a) in the second column.
What exactly does comm do?
Here's the output if the command is typed without any switches:
$ comm a b
a.txt
a1.txt
b.txt
b2.txt
c.txt
This shows three columns thus:
Lines in a but not in b
Lines in both a and b
Lines in b but not in a
What the numbered switches -123 do is it hides the specified column from the output.
So for example:
Specifying -13 results in common lines only
Specifying -12 results in lines only in b
Specifying -23 results in lines only in a
Specifying -2 results in the symmetric difference
Specifying -123 results in no output

For the example you quotes #Sparr
a contains
a.txt
b.txt
c.txt
b contains
a.txt
a1.txt
b.txt
b2.txt
diff a b gives
1a2
> a1.txt
3c4
< c.txt
---
> b2.txt
What is it about this output that does not meet your needs?

Sorting the two list before you diff them will provide a more useful diff data.

If the lines are sorted, diff should catch the insertions and deletions just fine and only report the differences.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R's paste function equivalent in BASH? - r

Related

How to split file in two by segments of 4 lines in unix

How to remove multiple commas but keep one in between two values in a csv file?

AWK or SED to remove pattern from every line while using grep

Split first column into two and preserve rest in unix

How to diff two file lists and ignoring place in list

Categories

Resources