Specify unique newline character (for upload to R)

Specify unique newline character (for upload to R) - r

I have a pipe delimited file with several embedded '\n' characters per row, but a unique pattern that I would like to substitute as a '\n' prior to importing into R.
For example, a sample text document might look like:
COL1|COL2|COL3|COL4
ID1|num1|num2|text\n text\n text[uniquepattern]\n
ID2|num3|num4|text2\n tex2\n text[uniquepattern]\n
I would ideally like the above to be loaded into R as a dataframe with two rows as follows:
COL1
COL2
COL3
COL4
ID1
num1
num2
text text text
ID2
num3
num4
text2 text2 text2
Without specifying that [uniquepattern] should be treated as a newline, R will upload this row as several rows. My initial solution was to use shell scripting to process the file beforehand. Something like:
tr '\n' ' ' < original_file.txt > temp_file.txt
tr '[uniquepattern]' '\n' < temp_file.txt > final_file.txt
However this doesn't seem to work. Many thanks for any suggestions!

I think this is what you want:
library(tidyverse)
file1 <- read_lines("COL1|COL2|COL3|COL4
ID1|num1|num2|text\n text\n text[uniquepattern]\n
ID2|num3|num4|text2\n tex2\n text[uniquepattern]\n")
unsplit_df <- paste(file1[2:length(file1)], collapse = "") %>%
str_split("\\[uniquepattern\\]") %>%
unlist() %>%
as_tibble_col(file1[1]) %>%
filter(str_detect(.[[1]], "[:alnum:]"))
separate(unsplit_df, col = 1, into = unlist(str_split(colnames(unsplit_df), "\\|")), sep = "\\|")
# # A tibble: 2 × 4
# COL1 COL2 COL3 COL4
# <chr> <chr> <chr> <chr>
# 1 ID1 num1 num2 text text text
# 2 ID2 num3 num4 text2 tex2 text

Related

Split values from single cell into new rows [duplicate]

This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 11 months ago.
Need some help with my R code please folks!
My table has two columns
a list of codes, with numerous codes in the same "cell" separated by commas
a description that applies to all of the codes in the same row
I want to split the values in the first column so that there is only 1 code per row and the corresponding description is repeated for every relevant code.
I really don't know where to start sorry, I actually don't really know what to search for!

You can use separate_rows from tidyr:
library(tidyr)
separate_rows(df, numbers, convert = TRUE)
Or in base R, we can use strsplit:
s <- strsplit(df$numbers, split = ",")
output <- data.frame(numbers = unlist(s), descriptions = rep(df$descriptions, sapply(s, length)))
Output
numbers descriptions
<int> <chr>
1 This is a description for ID1
2 This is a description for ID2
3 This is a description for ID2
4 This is a description for ID2
5 This is a description for ID3
6 This is a description for ID3
Data
df <- tibble(
numbers = c("1", "2,3,4", "5,6"),
descriptions = c("This is a description for ID1", "This is a description for ID2", "This is a description for ID3")
)
# numbers descriptions
# <chr> <chr>
# 1 This is a description for ID1
# 2,3,4 This is a description for ID2
# 5,6 This is a description for ID3

Split a string by two delimiters only in the first occurrence

I have read many examples here and other forums, tried things myself, but still can´t do what I want:
I have a string like this:
myString <- c("ENSG00000185561.10|TLCD2", "ENSG00000124785.9|NRN1", "ENSG00000287339.1|RP11-575F12.4")
And I want to split it into columns by the first dot and the vertical slash so it looks like this:
data.frame(c("ENSG00000185561", "ENSG00000124785", "ENSG00000287339"), c("TLCD2","NRN1","RP11-575F12.4")) %>% set_colnames(c("col1","col2"))
The biggest problem here is the dot that is sometimes present in the right part of the slash (e.g. third row), by which I don´t want to split.
Among others, what I tried was:
data.frame(do.call(rbind, strsplit(myString,"(\\.)|(\\|)")))
but this also creates a fourth column when it splits after the second dot.
I tried to tell it to only split once for the dot:
data.frame(do.call(rbind, strsplit(myString,"(\\.{1})|(\\|)")))
but same result.
Then tried to tell it that the dot could not be preceded by a slash:
data.frame(do.call(rbind, strsplit(myString,"([^\\|]\\.)|(\\|)")))
data.frame(do.call(rbind, strsplit(myString,"([[:alnum:]][^\\|]\\.)|(\\|)")))
but in both cases it splits by both dots.
I tried various combinations with reshape2::colsplit as well, similar results; either it splits in both dots, or it splits on the first dot but not on the slash:
reshape2::colsplit(myString, "([^\\|]\\.)|(\\|)", c("col1", "col2"))
Does anyone have an idea on how to solve this?
It is totally ok if it creates 3 columns instead of 2, I can then select the ones of interest.
E.g.
data.frame(c("ENSG00000185561", "ENSG00000124785", "ENSG00000287339"), c("10","9","1"), c("TLCD2","NRN1","RP11-575F12.4")) %>% set_colnames(c("col1","col2", "col3"))

library(stringr)
str_split_fixed(df$myString, "[\\.,\\|]", 3)
output:
[,1] [,2] [,3]
[1,] "ENSG00000185561" "10" "TLCD2"
[2,] "ENSG00000124785" "9" "NRN1"
[3,] "ENSG00000287339" "1" "RP11-575F12.4"

This should work. The secret sauce is the option extra = "merge", which means that any extra separated parts get added back onto the last column.
library(tidyr)
tibble(string = c(
"ENSG00000185561.10|TLCD2",
"ENSG00000124785.9|NRN1",
"ENSG00000287339.1|RP11-575F12.4"
)) %>%
separate(
string, into = c("c1", "c2", "c3"), sep = "[.]|[|]", extra = "merge"
)
#> # A tibble: 3 x 3
#> c1 c2 c3
#> <chr> <chr> <chr>
#> 1 ENSG00000185561 10 TLCD2
#> 2 ENSG00000124785 9 NRN1
#> 3 ENSG00000287339 1 RP11-575F12.4
Created on 2021-10-21 by the reprex package (v2.0.0)
NB, reshape2 is superseded by tidyr. You should make the switch ASAP!

I would suggest using matching instead of splitting (i.e. write a regex that specifies the parts that should be matched, rather than the splitter):
df = tibble(ID = myString)
df %>% extract(ID, into = c('ID', 'Name'), '([^.]+).*\\|(.+)')
# A tibble: 3 × 2
ID Name
<chr> <chr>
1 ENSG00000185561 TLCD2
2 ENSG00000124785 NRN1
3 ENSG00000287339 RP11-575F12.4
Just like the other answer, this is using ‘tidyr’ (which supersedes ‘reshape2’).

This could also help in base R:
as.data.frame(do.call(rbind, strsplit(myString, "\\.\\d+.+?", perl = TRUE)))
V1 V2
1 ENSG00000185561 TLCD2
2 ENSG00000124785 NRN1
3 ENSG00000287339 RP11-575F12.4

You can use str_extract and lookahead (?=\\|) and, respectively, lookbehind (?<=\\|) to assert the | as demarcation point:
library(stringr)
df <- data.frame(
col1 = str_extract(myString, ".*?(?=\\|)"),
col2 = str_extract(myString, "(?<=\\|).*$")
)
df
col1 col2
1 ENSG00000185561.10 TLCD2
2 ENSG00000124785.9 NRN1
3 ENSG00000287339.1 RP11-575F12.4
EDIT:
If you want three columns:
df <- data.frame(
col1 = str_extract(myString, ".*?(?=\\.)"),
col2 = str_extract(myString, "(?<=\\.)\\d+(?=\\|)"),
col3 = str_extract(myString, "(?<=\\|).*$")
)
df
col1 col2 col3
1 ENSG00000185561 10 TLCD2
2 ENSG00000124785 9 NRN1
3 ENSG00000287339 1 RP11-575F12.4

It seems to me that you are trying to cram two operations into a single command. First split at | and create two columns, than remove the dot suffix from the first column. I think this is simpler and there is no need for external packages either:
myString <- c("ENSG00000185561.10|TLCD2", "ENSG00000124785.9|NRN1", "ENSG00000287339.1|RP11-575F12.4")
df <- do.call(rbind, strsplit(myString, '\\|'))
df[,1] <- sub('\\..*', '', df[,1])
df
[,1] [,2]
[1,] "ENSG00000185561" "TLCD2"
[2,] "ENSG00000124785" "NRN1"
[3,] "ENSG00000287339" "RP11-575F12.4"
or am I missing something...?

How to return rows of a df that contain strings from a character list

I have a character list. I would like to return rows in a df that contain any of the strings in the list in a given column.
I have tried things like:
hits <- df %>%
filter(column, any(strings))
strings <- c("ape", "bat", "cat")
head(df$column)
[1] "ape and some other text here"
[2] "just some random text"
[3] "Something about cats"
I would like only rows 1 and 3 returned
Thanks in advance for the help.

Use grepl() with a regular expression matching any of the strings in your strings vector:
strings <- c("ape", "bat", "cat")
Firstly, you can collapse the strings vector to the regex you need:
regex <- paste(strings, collapse = "|")
Which gives:
> regex <- paste(strings, collapse = "|")
> regex
[1] "ape|bat|cat"
The pipe symbol | acts as an or operator, so this regex ape|bat|cat will match ape or bat or cat.
If your data.frame df looks like this:
> df
# A tibble: 3 x 1
column
<chr>
1 ape and some other text here
2 just some random text
3 something about cats
Then you can run the following line of code to return just the rows matching your desired strings:
df[grepl(regex, df$column), ]
The output is as follows:
> df[grepl(regex, df$column), ]
# A tibble: 2 x 1
column
<chr>
1 ape and some other text here
2 something about cats
Note that the above example is case-insensitive, it will only match the lower case strings exactly as specified. You can overcome this easily using the ignore.case parameter of grepl() (note the upper case Cats):
> df[grepl(regex, df$column, ignore.case = TRUE), ]
# A tibble: 2 x 1
column
<chr>
1 ape and some other text here
2 something about Cats

This can be accomplished with a regular expression.
aColumn <- c("ape and some other text here","just some random text","Something about cats")
aColumn[grepl("ape|bat|cat",aColumn)]
...and the output:
> aColumn[grepl("ape|bat|cat",aColumn)]
[1] "ape and some other text here" "Something about cats"
>
One an also set up the regular expression in an R object, as follows.
# use with a variable
strings <- "ape|cat|bat"
aColumn[grepl(strings,aColumn)]

R custom parser function

I have data in txt file in this form:
col1 col2 col3 col4 col5
44 PT-222 My name is John 829302 24.02.14 01.53.51.000000 AM
11 PT-111 This is not user 8292829 24.02.14 01.40.47.000000 AM
I want to stress that this columns are not tab seperated. They are only one or more space seperated. And I col3 and col5 contains data that is composed of space seperated words.
Actually rows are fixed length. To make it clear:
44 PT-222 My name 829302 24.02.14 01.53.51.000000 AM
1 PT-1 This is not user and John 829 24.02.14 01.40.47.000000 AM
How can I read that txt file into a table?
Is there any custom seperator function reading 1 line, so that I can override it?

If the fields are fixed width you can use read.fwf. Otherwise, we can use read.pattern in the gsubfn package. (Below we can replace text = Lines with something like "myfile.dat" .) First we read in the column names cn separately since they are not in the same format as the data. Then we skip over the first two lines of the file since the data begins in the third line and we read in the data using an appropriate pattern, pat:
Lines <- "col1 col2 col3 col4 col5
44 PT-222 My name is John 829302 24.02.14 01.53.51.000000 AM
11 PT-111 This is not user 8292829 24.02.14 01.40.47.000000 AM"
library(gsubfn)
cn <- read.table(text = Lines, nrow = 1, as.is = TRUE)
pat <- "^ *(\\S+) +(\\S+) +(.*\\S) +(\\S+) +(\\S+ \\S+ \\S+) *$"
DF <- read.pattern(text = Lines, pattern = pat, skip = 2,
col.names = cn, as.is = TRUE)
giving:
> DF
col1 col2 col3 col4 col5
1 44 PT-222 My name is John 829302 24.02.14 01.53.51.000000 AM
2 11 PT-111 This is not user 8292829 24.02.14 01.40.47.000000 AM
Note that the pattern we used assumes that no fields are empty. Any rows that do not match the pattern are silently dropped. The skip=2 is optional since the first two rows would be ignored in any case since they do not match the pattern.

Read fixed-width format, where the widths are inferred from the column headers

I have a rather odd file format that I need to read. It has space-separated columns, but the column widths must be inferred from the header.
In addition, there are some bogus lines that must be ignored, both blank and non-blank.
A representation of the data:
The first line contains some text that is not important, and shoud be ignored.
The second line also. In addition, the third and fifth lines are blank.
col1 col2 col3 col4 col5
ab cd e 132399.4 101 0 17:25:24 Ignore anything past the last named column
blah 773411 25 10 17:25:25 Ignore this too
Here, the first column, col1, contains the text from the beginning of the line until the character position of the end of the text string col1. The second column, col2 contains the text from the next character following the 1 in col1 until the end of the text string col2. And so on.
In reality, there are 17 columns rather than 5, but that should not change the code.
I'm looking for a data frame with the contents:
col1 col2 col3 col4 col5
1 ab cd e 132399.4 101 0 17:25:24
2 blah 773411.0 25 10 17:25:25
Here is a rather inelegant approach:
read.tt <- function(file) {
con <- base::file(file, 'r')
readLines(con, n=3);
header <- readLines(con, n=1)
close(con)
endpoints <- c(0L, gregexpr('[^ ]( |$)', header)[[1]])
widths <- diff(endpoints)
names <- sapply(seq_along(widths),
function(i) substr(header, endpoints[i]+1, endpoints[i]+widths[i]))
names <- sub('^ *', '', names)
body <- read.fwf(file, widths, skip=5)
names(body) <- names
body
}
There must be a better way.
The lines to be ignored is a minor piece of this puzzle. I'll accept a solution that works with these already removed from the file (but of course would prefer one that does not need preprocessing).

If you know your header line, you can get widths using following method.
x
## [1] " col1 col2 col3 col4 col5"
nchar(unlist(regmatches(x, gregexpr("\\s+\\S+", x))))
## [1] 13 9 5 5 10

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Specify unique newline character (for upload to R) - r

Related

Split values from single cell into new rows [duplicate]

Split a string by two delimiters only in the first occurrence

How to return rows of a df that contain strings from a character list

R custom parser function

Read fixed-width format, where the widths are inferred from the column headers

Categories

Resources