Issues with importing R Data due to formatting - r

I'm trying to import txt data into R; however, due to the txt file's unique formatting, I'm unsure of how to do this. I definitely feel that the issue is related to the fact that the txt file was formatted to line up columns with column names; however, as it's a text file, this was done with a variety of spaces. For example:
Gene Chromosomal Swiss-Prot MIM Description
name position AC Entry name code
______________ _______________ ______________________ ______ ______________________
A3GALT2 1p35.1 U3KPV4 A3LT2_HUMAN Alpha-1,3-galactosyltransferase 2 (EC 2.4.1.87) (Isoglobotriaosylceramide synthase) (iGb3 synthase) (iGb3S) [A3GALT2P] [IGBS3S]
AADACL3 1p36.21 Q5VUY0 ADCL3_HUMAN Arylacetamide deacetylase-like 3 (EC 3.1.1.-)
AADACL4 1p36.21 Q5VUY2 ADCL4_HUMAN Arylacetamide deacetylase-like 4 (EC 3.1.1.-)
ABCA4 1p21-p22.1 P78363 ABCA4_HUMAN 601691 Retinal-specific phospholipid-transporting ATPase ABCA4 (EC 7.6.2.1) (ATP-binding cassette sub-family A member 4) (RIM ABC transporter) (RIM protein) (RmP) (Retinal-specific ATP-binding cassette transporter) (Stargardt disease protein) [ABCR]
ABCB10 1q42 Q9NRK6 ABCBA_HUMAN 605454 ATP-binding cassette sub-family B member 10, mitochondrial precursor (ATP-binding cassette transporter
Because of this, I have not been able to import my data whatsoever. Because it was made to be justified text with spaces, the number of spaces aren't uniform at all.
This is the link to the data sheet that I am using: https://www.uniprot.org/docs/humchr01.txt

Each field has a fixed width. Therefore, you can use the function read.fwf to read the file.
The following code reads the input file (assuming the file has only the rows, without the headers)
f = read.fwf('input.txt', c(14,16,11,12,7,250), strip.white=T)
colnames(f) = c('Gene name', 'Chromosomal position', 'Swiss-Prot AC',
'Swiss-Prot Entry name', 'MIM code', 'Description')

Related

Best way to import words from a text file into a data frame in R

I have a text file filled with words separated by spaces as seen below:
ACNES ACOCK ACOLD ACORN ACRED ACRES ACRID ACTED ACTIN ACTON ACTOR ACUTE ACYLS ADAGE ADAPT ADAWS ADAYS ADDAX ADDED ADDER ADDIO ADDLE ADEEM ADEPT ADHAN ADIEU ADIOS ADITS ADMAN ADMEN ADMIN ADMIT ADMIX ADOBE ADOBO ADOPT ADORE ADORN ADOWN ADOZE ADRAD ADRED ADSUM ADUKI ADULT ADUNC ADUST ADVEW ADYTA ADZED ADZES AECIA AEDES AEGIS AEONS AERIE AEROS AESIR AFALD AFARA AFARS AFEAR AFFIX AFIRE AFLAJ AFOOT AFORE AFOUL AFRIT AFROS AFTER AGAIN AGAMA AGAMI AGAPE AGARS AGAST AGATE AGAVE AGAZE AGENE AGENT AGERS AGGER AGGIE AGGRI AGGRO AGGRY AGHAS AGILA AGILE AGING AGIOS AGISM AGIST AGITA AGLEE AGLET AGLEY AGLOO AGLOW AGLUS AGMAS AGOGE AGONE AGONS AGONY AGOOD AGORA AGREE AGRIA AGRIN
What's the best way to import all these words into a 1 column data frame?

Error in R - more columns than column names

I am trying to read in a file that has 5 column headers, however in column 5 I have list of genes separated commas.
EC2 <- read.table("david.txt", header=TRUE)
Whenever I run the code below, I get the message
"more columns than column names."
I feel like the answer is probably simple. Any idea?
These are the first 3 lines:
Category ID Term PValue Genes
BP GO: 0006412 translation 2.711930356491234E-10 P0A7U3, P0A7K6, P68191, P0A7Q1, P0A7U7, P02359, P02358, P60438, P0A7L0, P0A7L3, P0A7L8, P0A7T3, P0A8A8, P69441, P0A8N5, P0A8N3, P02413, P0A7T7, P0AG63, P0A7D1, P0AA10 , P0ADY3, P0AG67, P0A7M2, P0A898, P0A9W3, P0A7M6, P0A7X3, P0AAR3, P0A7S3, P0A7S9, P0ADY7, P62399, P60624, P32132, P0ADZ4, P60723, P0C0U4, P0AG51, P0ADZ0, P0A7N9, P0A7J3, P0A7W7, P0AG59, P68679, P0C018 , P0A7R1, P0A7N4, P0A7R5, P0A7R9, P0AG44, P68919, P61175, P0A6K3, P0A7V0, P0A7M9, P0A7K2, P0A7V3, P0AG48
BP GO: 0051301 cell division 1.4011247561051483E-7 P0AC30, P17952, P75949, P0A6H1, P06966, P0A9R7, P64612, P36548, P60472, P45955, P0A855, P06136, P0A850, P6246, P0246, P024 P22523, P08373, P11880, P0AFB1, P60293, P18196, P0ABG4, P07026, P0A749, P29131, P0A6S5, P26648, P17443, P0ADS2, P0A8P6, P0A8P8, P0A6, P0A6A7, P0A8P8, P0A6, P0A6A7, P0A6, P0A6A7 P46889, P0A6F9, P0AE60, P0AD68, P19934, P0ABU9, P37773

Join multiline message from log file into a single row in R

How is it possible to join multiple lines of a log file into 1 dataframe row?
ADDED ONE LINE -- Example 4-line log file:
[WARN ][2016-12-16 13:43:10,138][ConfigManagerLoader] - [Low max memory=477102080. Java max memory=1000 MB is recommended for production use, as a minimum.]
[DEBUG][2016-05-26 10:10:22,185][DataSourceImpl] - [SELECT mr.lb_id,mr.lf_id,mr.mr_id FROM mr WHERE (( mr.cap_em >
0 AND mr.cap_em > 5
)) ORDER BY mr.lb_id, mr.lf_id, mr.mr_id]
[ERROR][2016-12-21 13:51:04,710][DWRWorkflowService] - [Update Wizard - : [DWR WFR request error:
workflow rule = BenCommonResources-getDataRecords
version = 2.0
filterValues = [{"fieldName": "wotable_hwohtable.status", "filterValue": "CLOSED"}, {"fieldName": "wotable_hwohtable.status_clearance", "filterValue": "Goods Delivered"}]
sortValues = [{"fieldName": "wotable_hwohtable.cost_actual", "sortOrder": -1}]
Result code = ruleFailed
Result message = Database error while processing request.
Result details = null
]]
[INFO ][2019-03-15 12:34:55,886][DefaultListableBeanFactory] - [Overriding bean definition for bean 'cpnreq': replacing [Generic bean: class [com.ar.moves.domain.bom.Cpnreq]; scope=prototype; abstract=false; lazyInit=false; autowireMode=0; dependencyCheck=0; autowireCandidate=true; primary=false; factoryBeanName=null; factoryMethodName=null; initMethodName=null; destroyMethodName=null; defined in URL [jar:file:/D:/Dev/404.jar!/com/ar/moves/moves-context.xml]] with [Generic bean: class [com.ar.bl.bom.domain.Cpnreq]; scope=prototype; abstract=false; lazyInit=false; autowireMode=0; dependencyCheck=0; autowireCandidate=true; primary=false; factoryBeanName=null; factoryMethodName=null; initMethodName=null; destroyMethodName=null; defined in URL [jar:file:/D:/Dev/Tools/Tomcatv8.5-appGit-master/404.jar!/com/ar/bl/bom/bl-bom-context.xml]]]
(See representative 8-line extract at https://pastebin.com/bsmWWCgw.)
The structure is clean:
[PRIOR][datetime][ClassName] - [Msg]
but the message is often multi-lined, there may be multiple brackets in the message itself (even trailing…), or ^M newlines, but not necessarily… That makes it difficult to parse. Dunno where to begin here…
So, in order to process such a file, and be able to read it with something like:
#!/usr/bin/env Rscript
df <- read.table('D:/logfile.log')
we really need to have that merge of lines happening first. How is that doable?
The goal is to load the whole log file for making graphics, analysis (grepping out stuff), and eventually writing it back into a file, so -- if possible -- newlines should be kept in order to respect the original formatting.
The expected dataframe would look like:
PRIOR Datetime ClassName Msg
----- ------------------- ------------------- ----------
WARN 2016-12-16 13:43:10 ConfigManagerLoader Low max...
DEBUG 2016-05-26 10:10:22 DataSourceImpl SELECT ...
And, ideally once again, this should be doable in R directly (?), so that we can "process" a live log file (opened in write mode by the server app), "à la tail -f".
This is a pretty wicked Regex bomb. I'd recommend using the stringr package, but you could do all this with grep style functions.
library(stringr)
str <- c(
'[WARN ][2016-12-16 13:43:10,138][ConfigManagerLoader] - [Low max memory=477102080. Java max memory=1000 MB is recommended for production use, as a minimum.]
[DEBUG][2016-05-26 10:10:22,185][DataSourceImpl] - [SELECT mr.lb_id,mr.lf_id,mr.mr_id FROM mr WHERE (( mr.cap_em >
0 AND mr.cap_em > 5
)) ORDER BY mr.lb_id, mr.lf_id, mr.mr_id]
[ERROR][2016-12-21 13:51:04,710][DWRWorkflowService] - [Update Wizard - : [DWR WFR request error:
workflow rule = BenCommonResources-getDataRecords
version = 2.0
filterValues = [{"fieldName": "wotable_hwohtable.status", "filterValue": "CLOSED"}, {"fieldName": "wotable_hwohtable.status_clearance", "filterValue": "Goods Delivered"}]
sortValues = [{"fieldName": "wotable_hwohtable.cost_actual", "sortOrder": -1}]
Result code = ruleFailed
Result message = Database error while processing request.
Result details = null
]]'
)
Using regex we can split each line by checking for the pattern you mentioned. This regex is checking for a [, followed by any non-line feed character or line feed character or carriage return character, followed by a [. But do this is a lazy (non-greedy) way by using *?. Repeat that 3 times, then check for a -. Finally, check for a [, followed by any characters or a group that includes information within square brackets, then a ]. That's a mouthful. Type it into a regex calculator. Just remember to remove the extra backlashes (in a regex calculator \ is used but in R \\ is used).
# Split the text into each line without using \n or \r.
# pattern for each line is a lazy (non-greedy) [][][] - []
linesplit <- str %>%
# str_remove_all("\n") %>%
# str_extract_all('\\[(.|\\n|\\r)+\\]')
str_extract_all('\\[(.|\\n|\\r)*?\\]\\[(.|\\n|\\r)*?\\]\\[(.|\\n|\\r)*?\\] - \\[(.|\\n|\\r|(\\[(.|\\n|\\r)*?\\]))*?\\]') %>%
unlist()
linesplit # Run this to view what happened
Now that we have each line separated break them into columns. But we don't want to keep the [ or ] so we use a positive lookbehind and a positive lookahead in the regex code to check to see if the are there without capturing them. Oh, and capture everything between them of course.
# Split each line into columns
colsplit <- linesplit %>%
str_extract_all("(?<=\\[)(.|\\n|\\r)*?(?=\\])")
colsplit # Run this to view what happened
Now we have a list with an object for each line. In each object are 4 items for each column. We need to convert those 4 items to a dataframe and then join those dataframes together.
# Convert each line to a dataframe, then join the dataframes together
df <- lapply(colsplit,
function(x){
data.frame(
PRIOR = x[1],
Datetime = x[2],
ClassName = x[3],
Msg = x[4],
stringsAsFactors = FALSE
)
}
) %>%
do.call(rbind,.)
df
# PRIOR Datetime ClassName Msg
# 1 WARN 2016-12-16 13:43:10,138 ConfigManagerLoader Low max memory=
# 2 DEBUG 2016-05-26 10:10:22,185 DataSourceImpl SELECT mr.lb_id
# 3 ERROR 2016-12-21 13:51:04,710 DWRWorkflowService Update Wizard -
# Note: there are extra spaces that probably should be trimmed,
# and the dates are slightly messed up. I'll leave those for the
# questioner to fix using a mutate and the string functions.
I will leave it to you to fix the extra spaces, and date field.

Attach foldername to first column of file

I have a list of files that do have the identical filename but are in different subfolders. The values in the files are seperated with a tab.
I would like to attach to all of the files "test.txt" an additional first column with the foldername and if merge to one file in the end (they all have the same header for the columns).
The most important command though would be the merging.
I have tried to many commands now that did not work so I guess I am missing an essential step with awk...
Current structure is:
mainfolder
|_>Folder1
|_>test.txt
|->Folder2
|_>test.txt
.
.
.
This is where I would like to get to per file before merging all of the,
#Name Count FragCount Type Left LeftB Right RightB Support FRPM LeftBD LeftBE RightBD RightBE annots
RFP1A 13 10 REF RFP1A_ins chr3:3124352:+ RFP1A_ins chr3:5234143:+ confirmed 0.86 TA 1.454 AC 1.564 ["INTRACHROM."]
#Samplename #Name Count FragCount Type Left LeftB Right RightB Support FRPM LeftBD LeftBE RightBD RightBE annots
Sample1 RFP1A 13 10 REF RFP1A_ins chr3:3124352:+ RFP1A_ins chr3:5234143:+ confirmed 0.86 TA 1.454 AC 1.564 ["INTRACHROM."]
Thanks so much!!
D
I believe this might do the trick:
$ cd mainfolder
$ awk '(NR==1){sub("#","#Samplename\t"); print} # print header
(FNR==1){next} # skip header
{print substr(FILENAME,1,match(FILENAME,"/")-1)"\t"$0 } # add directory
' */test.txt > /path/to/newfile.txt

Doing a complex join on files

I have files (~1k) that look (basically) like this:
NAME1.txt
NAME ATTR VALUE
NAME1 x 1
NAME1 y 2
...
NAME2.txt
NAME ATTR VALUE
NAME2 x 19
NAME2 y 23
...
Where the ATTR collumn is same in everyfile and the name column is just some version of the filename. I would like to combine them together into 1 file that looks like:
All_data.txt
ATTR NAME1_VALUE NAME2_VALUE NAME3_VALUE ...
X 1 19 ...
y 2 23 ...
...
Is there simple way to do this with just command line utilities or will I have to resort to writing some script?
Thanks
You need to write a script.
gawk is the obvious candidate
You could create an associative array in a BEGIN block, using FILENAME as the KEY and
ATTR " " VALUE
values as the value.
Then create your output in an END block.
gawk can process all txt files together by using *txt as the filename
It's a bit optimistic to expect there to be a ready made command to do exactly what you want.
Very few command join data horizontally.

Resources