R custom parser function

R custom parser function - r

I have data in txt file in this form:
col1 col2 col3 col4 col5
44 PT-222 My name is John 829302 24.02.14 01.53.51.000000 AM
11 PT-111 This is not user 8292829 24.02.14 01.40.47.000000 AM
I want to stress that this columns are not tab seperated. They are only one or more space seperated. And I col3 and col5 contains data that is composed of space seperated words.
Actually rows are fixed length. To make it clear:
44 PT-222 My name 829302 24.02.14 01.53.51.000000 AM
1 PT-1 This is not user and John 829 24.02.14 01.40.47.000000 AM
How can I read that txt file into a table?
Is there any custom seperator function reading 1 line, so that I can override it?

If the fields are fixed width you can use read.fwf. Otherwise, we can use read.pattern in the gsubfn package. (Below we can replace text = Lines with something like "myfile.dat" .) First we read in the column names cn separately since they are not in the same format as the data. Then we skip over the first two lines of the file since the data begins in the third line and we read in the data using an appropriate pattern, pat:
Lines <- "col1 col2 col3 col4 col5
44 PT-222 My name is John 829302 24.02.14 01.53.51.000000 AM
11 PT-111 This is not user 8292829 24.02.14 01.40.47.000000 AM"
library(gsubfn)
cn <- read.table(text = Lines, nrow = 1, as.is = TRUE)
pat <- "^ *(\\S+) +(\\S+) +(.*\\S) +(\\S+) +(\\S+ \\S+ \\S+) *$"
DF <- read.pattern(text = Lines, pattern = pat, skip = 2,
col.names = cn, as.is = TRUE)
giving:
> DF
col1 col2 col3 col4 col5
1 44 PT-222 My name is John 829302 24.02.14 01.53.51.000000 AM
2 11 PT-111 This is not user 8292829 24.02.14 01.40.47.000000 AM
Note that the pattern we used assumes that no fields are empty. Any rows that do not match the pattern are silently dropped. The skip=2 is optional since the first two rows would be ignored in any case since they do not match the pattern.

Related

Specify unique newline character (for upload to R)

I have a pipe delimited file with several embedded '\n' characters per row, but a unique pattern that I would like to substitute as a '\n' prior to importing into R.
For example, a sample text document might look like:
COL1|COL2|COL3|COL4
ID1|num1|num2|text\n text\n text[uniquepattern]\n
ID2|num3|num4|text2\n tex2\n text[uniquepattern]\n
I would ideally like the above to be loaded into R as a dataframe with two rows as follows:
COL1
COL2
COL3
COL4
ID1
num1
num2
text text text
ID2
num3
num4
text2 text2 text2
Without specifying that [uniquepattern] should be treated as a newline, R will upload this row as several rows. My initial solution was to use shell scripting to process the file beforehand. Something like:
tr '\n' ' ' < original_file.txt > temp_file.txt
tr '[uniquepattern]' '\n' < temp_file.txt > final_file.txt
However this doesn't seem to work. Many thanks for any suggestions!

I think this is what you want:
library(tidyverse)
file1 <- read_lines("COL1|COL2|COL3|COL4
ID1|num1|num2|text\n text\n text[uniquepattern]\n
ID2|num3|num4|text2\n tex2\n text[uniquepattern]\n")
unsplit_df <- paste(file1[2:length(file1)], collapse = "") %>%
str_split("\\[uniquepattern\\]") %>%
unlist() %>%
as_tibble_col(file1[1]) %>%
filter(str_detect(.[[1]], "[:alnum:]"))
separate(unsplit_df, col = 1, into = unlist(str_split(colnames(unsplit_df), "\\|")), sep = "\\|")
# # A tibble: 2 × 4
# COL1 COL2 COL3 COL4
# <chr> <chr> <chr> <chr>
# 1 ID1 num1 num2 text text text
# 2 ID2 num3 num4 text2 tex2 text

reading comma-separated strings with read.csv()

I am trying to load a comma-delimited data file that also has commas in one of its text columns. The following sample code generates such a file'test.csv',which I'll load usingread.csv()to illustrate my problem.
> d <- data.frame(name = c("John Smith", "Smith, John"), age = c(34, 34))
> d
name age
1 John Smith 34
2 Smith, John 34
> write.csv(d, file = "test.csv", quote = F, row.names = F)
> d2 <- read.csv("test.csv")
> d2
name age
John Smith 34 NA
Smith John 34
Because of the ',' in Smith, John, d2 is not assigned correctly. How do I read the file so that d2 looks exactly like d?
Thanks.

1) read.pattern read.pattern (in gsubfn package) can read such files:
library(gsubfn)
pat <- "(.*),(.*)"
read.pattern("test.csv", pattern = pat, header = TRUE, as.is = TRUE)
giving:
name age
1 John Smith 34
2 Smith, John 34
2) two pass Another possibility is to read it in, fix it up and then re-read it. This uses no packages and gives the same output.
L <- readLines("test.csv")
read.table(text = sub("(.*),", "\\1|", L), header = TRUE, sep = "|", as.is = TRUE)
Note: For 3 fields with the third field at the end use this in (1)
pat <- "(.*),([^,]+),([^,]+)"
The same situation use this in (2) assuming that there are non-spaces adjacent to each of the last two commas and at least one space adjacent to any commas in the text field and that fields have at least 2 characters:
text = gsub("(\\S),(\\S)", "\\1|\\2", L)
If you have some other arrangement just modify the regular expression in (1) appropriately and the sub or gsub in (2).

Splitting Columns that contain decimal point values in R

I facing difficulties splitting columns in R. For instance
Col1.Col2.Col3
12.3,10,11
11.3,11,50
85,89.3,90
and over 100x records
I did
tidyr::separate(df, Col1.Col2.Col3,
c("Col1", "Col2", "Col3" ))
And i get
Col1 Col2 Col3
12 3 10
11 3 11
85 89 3
and over 100x records
I realised that the decimal value is moved to the next column and the values of Col3 were left out. How can i fix this or is there a better way of splitting the columns?

tidyr::separate has a sep argument that controls where the splits occur. Use sep = ",".

split string in a column of a dataframe and return new column with split

I have a dataframe called dat which has two columns as below
col1 col2
chr2 atagaaaaatcggctgggtgcggtggctcactcctataatcccagcactttg
chr3 atagaaaaatcggctgggtgcggtggctcactcctataatcccagcactttg
I want to be able to split the string at a match for gtggctc and to return a new column with the match included up to a specified length (eg 10 further characters as follows
col1 col2 new_split_col
chr2 atagaaaaatcggctgggtgcg gtggctcactcctataa
chr3 atagaaaaatcggctgggtgcg gtggctcactcctataa
I have tried
library(stringr)
dat$new_split_col <- str_split(dat$col2, "gtggctc", 2)
but it gives me two matches in one column and doesnt include the match itself. It also doesnt allow me to specify the length of the desired match.

Try
library(stringr)
dat[c('col2', 'new_split_col')] <- do.call(rbind,lapply(str_split(dat$col2,
perl('(?=gtggctc)'), 2), function(x) c(x[1],substr(x[2],1,17))))
Or
library(tidyr)
extract(dat, col2, into=c('col2', 'new_split_col'), '(.*)(gtggctc.{10}).*')
# col1 col2 new_split_col
#1 chr2 atagaaaaatcggctgggtgcg gtggctcactcctataa
#2 chr3 atagaaaaatcggctgggtgcg gtggctcactcctataa
Or
dat[c('col2', 'new_split_col')] <- read.table(text=gsub('(.*)(gtggctc.{10}).*',
'\\1 \\2', dat$col2))

Read fixed-width format, where the widths are inferred from the column headers

I have a rather odd file format that I need to read. It has space-separated columns, but the column widths must be inferred from the header.
In addition, there are some bogus lines that must be ignored, both blank and non-blank.
A representation of the data:
The first line contains some text that is not important, and shoud be ignored.
The second line also. In addition, the third and fifth lines are blank.
col1 col2 col3 col4 col5
ab cd e 132399.4 101 0 17:25:24 Ignore anything past the last named column
blah 773411 25 10 17:25:25 Ignore this too
Here, the first column, col1, contains the text from the beginning of the line until the character position of the end of the text string col1. The second column, col2 contains the text from the next character following the 1 in col1 until the end of the text string col2. And so on.
In reality, there are 17 columns rather than 5, but that should not change the code.
I'm looking for a data frame with the contents:
col1 col2 col3 col4 col5
1 ab cd e 132399.4 101 0 17:25:24
2 blah 773411.0 25 10 17:25:25
Here is a rather inelegant approach:
read.tt <- function(file) {
con <- base::file(file, 'r')
readLines(con, n=3);
header <- readLines(con, n=1)
close(con)
endpoints <- c(0L, gregexpr('[^ ]( |$)', header)[[1]])
widths <- diff(endpoints)
names <- sapply(seq_along(widths),
function(i) substr(header, endpoints[i]+1, endpoints[i]+widths[i]))
names <- sub('^ *', '', names)
body <- read.fwf(file, widths, skip=5)
names(body) <- names
body
}
There must be a better way.
The lines to be ignored is a minor piece of this puzzle. I'll accept a solution that works with these already removed from the file (but of course would prefer one that does not need preprocessing).

If you know your header line, you can get widths using following method.
x
## [1] " col1 col2 col3 col4 col5"
nchar(unlist(regmatches(x, gregexpr("\\s+\\S+", x))))
## [1] 13 9 5 5 10

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R custom parser function - r

Related

Specify unique newline character (for upload to R)

reading comma-separated strings with read.csv()

Splitting Columns that contain decimal point values in R

split string in a column of a dataframe and return new column with split

Read fixed-width format, where the widths are inferred from the column headers

Categories

Resources