Start reading an excel after empty line - r

I have several excel files I need to read, but in some of them the results start in row 41 in others in 48. Though all start after one single empty row. How can I write a code to read every file after this empty row?
here is an example: my data start from row 41. The name of the cell is the same, it's the position that changes

Related

Why run three row if the csv store tow data row?

I made a CSV and I used it for a data driven test in Katalon.
x,y
house,way
1,2
Run the test and the test runs three times, however, I stored two valid data inputs (house, way; 1,2)?
I don't know why this happened.
Its your CVS file, its got extra line or CR at the end, use notepad to delete all empty space and line after the last character of 2nd line. Also remember to delete and reload the file in Data Driven

Why is fread not accepting the skip command?

I have a .txt dataset where the first 12 lines are text followed by 2 blank rows and then the data
DATE HEIGHT INPUT OUTPUT TESTMEASURE
01/01/1933 NO RECORD NO RECORD MISSING MISSING
01/02/1933 NO RECORD NO RECORD MISSING MISSING
But when I do a
dat <- fread('data.txt'),
It skips 15 rows, and uses the first data line as column name for the imported dataset. It ignores the header line.
01/01/1933 NO RECORD NO RECORD MISSING MISSING
The skip parameter is not affecting what I import at all. How can I mention the row number which needs to be used as the column name. Alternatively I can rename the column names, but the first line of data shouldn't be ignored.
DIAGNOSIS
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 0.001319 GB.
Memory mapping ... ok
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Positioned on line 1 after skip or autostart
This line is the autostart and not blank so searching up for the last non-blank ... line 1
Detecting sep ... '\t'
Detected 5 columns. Longest stretch was from line 15 to line 30
Starting data input on line 15 (either column names or first row of data). First 10 characters: 01/01/1933
The line before starting line 15 is non-empty and will be ignored (it has too few or too many items to be column names or data): DATE HEIGHT INPUT OUTPUT TESTMEASURE the fields on line 15 are character fields. Treating as the column names.
You have 12 lines of text, 2 lines of spaces, and then your data. But I noticed extra whitespace between DATE and HEIGHT. So make a text file like this, where your data is tab-delimited, and add 2 tabs between DATE and HEIGHT instead of 1 tab
garbage
garbage
garbage
garbage
garbage
garbage
garbage
garbage
garbage
garbage
garbage
garbage
DATE HEIGHT INPUT OUTPUT TESTMEASURE
01/01/1933 NO RECORD NO RECORD MISSING MISSING
01/02/1933 NO RECORD NO RECORD MISSING MISSING
Doing fread(data) gives me:
fread(data)
01/01/1933 NO RECORD NO RECORD MISSING MISSING
1: 01/02/1933 NO RECORD NO RECORD MISSING MISSING
Removing the extra tab between DATE and HEIGHT gives me:
DATE HEIGHT INPUT OUTPUT TESTMEASURE
1: 01/01/1933 NO RECORD NO RECORD MISSING MISSING
2: 01/02/1933 NO RECORD NO RECORD MISSING MISSING

R - Exctract multiple tables from text file

I have a .txt file containing text (which I don't want) and 65 tables, as shown below (just the top of the .txt file)
Does anyone know how I can extract only the tables from this text file, such that I can open the resulting .txt file as a data.frame with my 65 tables in R? Above each table is a fixed number of lines (starting with "The result of abcpred on seq..." and ending with "Predicted B cell epitopes") and below each of them is a variable number of lines, depending on how many rows each tables has. Then it comes the next table, and it goes like that until I reach the 65th table.
Given that the tables are the only elements that start with numbers, to grep for integers at the beginning of the line is indeed the best solution. Using the shell (and not R) the command:
grep '^[0-9]' input > output
did exactly what I wanted.

modifying line names to avoid redundancies when files are merged in terminal

I have two files containing biological DNA sequence data. Each of these files are the output of a python script which assigns each DNA sequence to a sample ID based on a DNA barcode at the beginning of the sequence. The output of one of these .txt files looks like this:
>S066_1 IGJRWKL02G0QZG orig_bc=ACACGTGTCGC new_bc=ACACGTGTCGC bc_diffs=0
TTAAGTTCAGCGGGTATCCCTACCTGATCCGAGGTCAACCGTGAGAAGTTGAGGTTATGGCAAGCATCCATAAGAACCCTATAGCGAGAATAATTACTACGCTTAGAGCCAGATGGCACCGCCACTGATTTTAGGGGCCGCTGAATAGCGAGCTCCAAGACCCCTTGCGGGATTGGTCAAAATAGACGCTCGAACAGGCATGCCCCTCGGAATACCAAGGGGCGCAATGTGCGTCCAAAGATTCGATGATTCACTGAATTCTGCAATTCACATTACTTATCGCATTTCGCAGCGTTCTTCATCGATGACGAGTCTAG
>S045_2 IGJRWKL02H5XHD orig_bc=ATCTGACGTCA new_bc=ATCTGACGTCA bc_diffs=0
CTAAGTTCAGCGGGTAGTCTTGTCTGATATCAGGTCCAATTGAGATACCACCGACAATCATTCGATCATCAACGATACAGAATTTCCCAAATAAATCTCTCTACGCAACTAAATGCAGCGTCTCCGTACATCGCGAAATACCCTACTAAACAACGATCCACAGCTCAAACCGACAACCTCCAGTACACCTCAAGGCACACAGGGGATAGG
The first line is the sequence ID, and the second line in the DNA sequence. S_066 in the first part of the ID indicates that the sequence is from sample 066, and the _1 indicates that its the first sequence in the file (not the first sequence from S_066 per se). Because of the nuances of the DNA sequencing technology being used, I need to generate two files like this from the raw sequencing files, and the result is an output where I have two of these files, which I then use cat to merge together. So far so good.
The next downstream step in my workflow does not allow identical sample names. Right now it gets half way through, errors, and closes because it encounters some identical sequence IDs. So, it must be that the 400th sequence in both files belongs to the same sample, or something, generating identical sample IDs (i.e. both files might have S066_400).
What I would like to do is use some code to insert a number (1000,, 4971, whatever) immediately after the _ on every other line in the second file, starting with the first line. This way the IDs would no longer be confounded and I could proceed. So, it would cover S066_2 to S066_24971 or S066_49712. Part of the trouble is that the ID may be variable in length such that it could begin as S066_ or as 49BBT1_.
Try:
awk '/^\>/ {$1=$1 "_13"} {print $0}' filename > tmp.tmp
mv tmp.tmp filename

Modify the dates in a huge file (around 1000 rows) using script

I have a requirement in which I need to subtract x number of days from dates present in a delimited file if the date exists excluding the first and last row. If the date does not exist in the specified field, ignore the same.
For example, aaa.txt contains
header
abc|20160431|dhadjs|20160325|hjkkj|kllls
ddd||dhajded|20160320|dwdas|hfehf
footer
I want the modified file to have the dates subtracted by 10 days. Something like below:-
header
abc|20160421|dhadjs|20160315|hjkkj|kllls
ddd||dhajded|20160310|dwdas|hfehf
footer
I don't want to use a programming language like Java to read the file but rather use a scripting language on unix. Any suggestions on how this can be done?

Resources