How to split a sentence in two halves in R

How to split a sentence in two halves in R - r

I have a vector of string, and I want each string to be cut roughly in half, at the nearest space.
For exemple, with the following data :
test <- data.frame(init = c("qsdf mqsldkfop mqsdfmlk lksdfp pqpdfm mqsdfmj mlk",
"qsdf",
"mp mlksdfm mkmlklkjjjjjjjjjjjjjjjjjjjjjjklmmjlkjll",
"qsddddddddddddddddddddddddddddddd",
"qsdfmlk mlk mkljlmkjlmkjml lmj mjjmjmjm lkj"), stringsAsFactors = FALSE)
I want to get something like this :
first sec
1 qsdf mqsldkfop mqsdfmlk lksdfp pqpdfm mqsdfmj mlk
2 qsdf
3 mp mlksdfm mkmlklkjjjjjjjjjjjjjjjjjjjjjjklmmjlkjll
4 qsddddddddddddddddddddddddddddddd
5 lmj mjjmjmjm lkj lmj mjjmjmjm lkj
Any solution that does not cut in halves but "so that the first part isn't longer than X character" would be also great.

First, we split the strings by spaces.
a <- strsplit(test$init, " ")
Then we find the last element of each vector for which the cumulative sum of characters is lower than half the sum of all characters in the vector:
b <- lapply(a, function(x) which.max(cumsum(cumsum(nchar(x)) <= sum(nchar(x))/2)))
Afterwards we combine the two halfs, substituting NA if the vector was of length 1 (only one word).
combined <- Map(function(x, y){
if(y == 1){
return(c(x, NA))
}else{
return(c(paste(x[1:y], collapse = " "), paste(x[(y+1):length(x)], collapse = " ")))
}
}, a, b)
Finally, we rbind the combined strings and change the column names.
newdf <- do.call(rbind.data.frame, combined)
names(newdf) <- c("first", "second")
Result:
> newdf
first second
1 qsdf mqsldkfop mqsdfmlk lksdfp pqpdfm mqsdfmj mlk
2 qsdf <NA>
3 mp mlksdfm mkmlklkjjjjjjjjjjjjjjjjjjjjjjklmmjlkjll
4 qsddddddddddddddddddddddddddddddd <NA>
5 qsdfmlk mlk mkljlmkjlmkjml lmj mjjmjmjm lkj

You can use the function nbreak from the package that I wrote:
devtools::install_github("igorkf/breaker")
library(tidyverse)
test <- data.frame(init = c("Phrase with four words", "That phrase has five words"), stringsAsFactors = F)
#This counts the numbers of words of each row:
nwords = str_count(test$init, " ") + 1
#This is the position where break the line for each row:
break_here = ifelse(nwords %% 2 == 0, nwords/2, round(nwords/2) + 1)
test
# init
# 1 Phrase with four words
# 2 That phrase has five words
#the map2_chr is applying a function with two arguments,
#the string is "init" and the n is "break_here":
test %>%
mutate(init = map2_chr(init, break_here, ~breaker::nbreak(string = .x, n = .y, loop = F))) %>%
separate(init, c("first", "second"), sep = "\n")
# first second
# 1 Phrase with four words
# 2 That phrase has five words

Related

Using regex to drop duplicated elements in columns of an R dataframe

I have a dummy dataframe df which has dimensions 6 X 4.
df <- data.frame(
Hits = c("Hit1", "Hit2", "Hit3", "Hit4", "Hit5", "Hit6"),
GO = c("GO:0005634~nucleus,", "", "GO:0005737~cytoplasm,", "GO:0005634~nucleus,GO:0005737~cytoplasm,", "",
"GO:0005634~nucleus,GO:0005654~nucleoplasm,"),
KEGG = c("", "", "", "", "", ""),
SMART = c("SM00394:RIIa,", "SM00394:RIIa,", "", "SM00054:EFh,",
"", "SM00394:RIIa,SM00239:C2,"))
df looks like this
The elements in the columns consist of two parts:
an identifier (e.g. GO:0005634~, SM00394: etc.)
a term (e.g. nucleus, EFh etc.)
For each column I want to retain a row if it contains atleast one term which is not present in any row above it. e.g. in the column GO rows 1 and 3 contain unique terms, so these should be retained. Row 4 contains terms which are already present in rows 1 and 3, so it should be dropped. Row 6 has one term which is not present in any row above it, hence it should also be retained.
I have been able to come up with regular expressions to extract the terms from the columns GO and SMART
Regex for GO: (?<=~).*?(?=,(?:GO:\\d+~|$))
Regex for SMART: (?<=:).*?(?=,(?:\\w+\\d+:|$))
But I'm unable to figure out a way to integrate the regex and the conditions mentioned above into a solution. The output should look like this
Any suggestions on how to solve this?

Here is a general approach that will handle GO, SMART, and potentially KEGG, though it is impossible to say without any information about KEGG.
The function f below takes as arguments
x, a character vector
split, the delimiter separating items in lists
sep, the delimiter separating identifiers and terms within items
and returns a logical vector indexing the elements of x with at least one non-duplicated term.
f <- function(x, split, sep) {
l1 <- strsplit(x, split)
tt <- sub(paste0("^[^", sep, "]*", sep), "", unlist(l1))
l2 <- relist(duplicated(tt), l1)
!vapply(l2, all, NA)
}
Applying f to GO and SMART:
nms <- c("GO", "SMART")
l <- Map(f, x = df[nms], split = ",", sep = c("~", ":"))
l
## $GO
## [1] TRUE FALSE TRUE FALSE FALSE TRUE
##
## $SMART
## [1] TRUE FALSE FALSE TRUE FALSE TRUE
Setting to "" elements of GO and SMART with zero non-duplicated terms, then filtering out empty rows, we obtain the desired result:
df2 <- df
df2[nms] <- Map(replace, df2[nms], lapply(l, `!`), "")
df2[Reduce(`|`, l), ]
## Hits GO KEGG SMART
## 1 Hit1 GO:0005634~nucleus, SM00394:RIIa,
## 3 Hit3 GO:0005737~cytoplasm,
## 4 Hit4 SM00054:EFh,
## 6 Hit6 GO:0005634~nucleus,GO:0005654~nucleoplasm, SM00394:RIIa,SM00239:C2,

The following algorithm is applied to each term (GO, SMART, KEGG):
extract the identifier+term list as comma-separated. See stringr::str_split etc.
extract the term as regex
cumulate all the terms along the dataframe as they appear
extract the difference between each row and the row immediately preceding
replace the string with "" if no new term is introduced
filter rows where not all the terms are ""
library(dplyr)
library(stringr)
library(purrr)
termred <- function(terms, rx) {
terms |>
stringr::str_split(",") |>
purrr::map(stringr::str_trim) |>
purrr::map(~{.x[.x != ""]}) |>
purrr::map(~stringr::str_extract(.x, rx)) |>
purrr::accumulate(union) %>%
{mapply(setdiff, ., lag(., 1), SIMPLIFY = TRUE)} %>%
{ifelse(sapply(., length) > 0, terms, "")}
}
df |>
transform(GO = termred(GO, "~.*$")) |>
transform(SMART = termred(SMART, ":.*$")) |>
filter(GO != "" | SMART != ""| KEGG != "")
##> Hits GO KEGG SMART
##>1 Hit1 GO:0005634~nucleus, SM00394:RIIa,
##>2 Hit3 GO:0005737~cytoplasm,
##>3 Hit4 SM00054:EFh,
##>4 Hit6 GO:0005634~nucleus,GO:0005654~nucleoplasm, SM00394:RIIa,SM00239:C2,

fread specifying separator within column

I am trying to parse a 2 column list that is separated using multiple spaces for columns and single spaces for words within a column. Nothing I have tried has successfully split the data into two columns. How do I do this?
library(data.table)
item.ids<-fread("http://eve-files.com/chribba/typeid.txt",sep2=" ")
Example of the dataset:
typeID typeName
----------- ----------------------------------------
0 #System
2 Corporation
3 Region
4 Constellation
5 Solar System

This seems to work:
library(readr)
url = "http://eve-files.com/chribba/typeid.txt"
df = read_fwf(url, fwf_empty(url), skip = 2)
colnames = read_table(url, n_max = 1)
names(df) = names(colnames)
df = na.omit(df)
dim(df)
# [1] 22382 2
summary(df)
# typeID typeName
# Min. : 0 Length:22382
# 1st Qu.: 13986 Class :character
# Median : 22938 Mode :character
# Mean : 53827
# 3rd Qu.: 30209
# Max. :368620

Here's one approach that uses extract from "tidyr" that should be pretty easy to follow.
First, we read the data in, and inspect the first few lines and last few lines. After inspection, we find that the data values are from lines 3 to 22384.
x <- readLines("http://eve-files.com/chribba/typeid.txt")
# Check out the data
head(x) # Let's get rid of the first two lines...
tail(x) # ... and the last 3
In the extraction stage, we're basically looking for:
A set of numbers--can be of varying lengths (([0-9]+)). It's in (), so capture it and extract it to a new column.
The numbers should be followed by 2 or more spaces ([ ]{2,}). That's not in (), so we don't need to extract that into a new column.
The set of spaces can be followed by anything else ((.*)). That's in (), so capture that and extract it into a new column.
I've also used the first value of "x" to extract the original column names.
Here's what it looks like:
library(tidyverse)
data_frame(V1 = x[3:(length(x)-3)]) %>%
extract(V1, into = scan(text = x[1], what = ""), regex = "([0-9]+)[ ]{2,}(.*)")
# # A tibble: 22,382 x 2
# typeID typeName
# * <chr> <chr>
# 1 0 #System
# 2 2 Corporation
# 3 3 Region
# 4 4 Constellation
# 5 5 Solar System
# 6 6 Sun G5 (Yellow)
# 7 7 Sun K7 (Orange)
# 8 8 Sun K5 (Red Giant)
# 9 9 Sun B0 (Blue)
# 10 10 Sun F0 (White)
# # ... with 22,372 more rows
Or
data_frame(V1 = x[3:(length(x)-3)]) %>%
separate(V1, into = scan(text = x[1], what = ""), sep = "[ ]{2,}",
extra = "merge", convert = TRUE)
Another approach might be to use strsplit with [ ]{2, } as the split value. do.call(rbind, ...) would be the idiom to follow after that, but you might want to filter only for cases where the split resulted in two values.
do.call(rbind, Filter(function(z) length(z) == 2, strsplit(x, "[ ]{2, }")))

Read in your text file line-by-line:
l <- list()
fileName <- "http://eve-files.com/chribba/typeid.txt"
conn <- file(fileName,open="r")
linn <-readLines(conn)
for (i in 1:length(linn)){
l[i] <- list(linn[i])
}
close(conn)
Create a list of all entries:
l_new <- list()
for(p in 1:length(l)) {
new_vec <- unlist(strsplit(gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", "", l[[p]], perl=TRUE), " "))
if(!is.na(new_vec[4])) {
new_vec_t <- paste(new_vec[2], new_vec[3], new_vec[4])
}
else if (!is.na(new_vec[3])) {
new_vec_t <- paste(new_vec[2], new_vec[3])
}
else {
new_vec_t <- paste(new_vec[2])
}
l_new[p] <- list(c(new_vec[1], new_vec_t))
}
Convert your list to a dataframe:
l_new_frame <- data.frame(do.call('rbind', l_new))
l_new_frame <- l_new_frame[-c(1,2),]
names(l_new_frame) <- c('typeID', 'typeName')
Check results:
print(l_new_frame[1:100,], row.names = FALSE)

read data into R with different delimiters

I am trying to read a file into R that has different delimiters in the first row has space as delimiters but from the 2nd row to the last between the first column and the second there is a space, the same between the second and third, then all the block of two, zeros and ones should be different columns.
any hint?!
ID Chip AX-77047182 AX-80910836 AX-80737273 AX-77048714 AX-77048779 AX-77050447
3811582 1 2002202222200202022020200200220200222200022220002200000201202000222022
3712982 1 2002202222200202022020200200220200222200022220002200000200202000222022
3712990 1 2002202211200202021011100101210200111101022121112100111110211110122122
3713019 1 2002202211200202021011100101210200111101022121112100111110211110122122
3713025 1 2002202211200202021011100101210200111101022121112100111110211110122122
3713126 1 2002202222200202022020200200220200222200022220002200000200202000222022

Certainly not the most elegant solution, but you could try the following. If I have understood your example data correctly, you have not provided all the column names (AX-77047182,...) that would be needed for the rows of zeros/ones/twos. If my understanding is wrong, below approach will not result in the desired result, but might still aid you in finding a workaround - you might simply adapt the delimiter in the second split command. I hope this helps...
#read file as character vector
chipstable <- readLines(".../chips.txt")
#extact first line to be used as column names
tablehead <- unlist(strsplit(chipstable[1], " "))
#split by first delimiter, i.e., space
chipstable <- strsplit(chipstable[2:length(chipstable)], " ")
#split by second delimiter, i.e., between each character (here number)
#and merge the two split results in one line
chipstable <- lapply(chipstable, function(x) {
c(x[1:2], unlist(strsplit(x[3], "")))
})
#combine all lines to a data frame
chipstable <- do.call(rbind, chipstable)
#assign column names
colnames(chipstable) <- tablehead
#turn values to numeric (if needed)
chipstable <- apply(chipstable, 2, as.numeric)

You can try ... read(pattern = " || 1 ", recursive = TRUE)
After make a bind
For instance:
data <- "ID Chip AX-77047182 AX-80910836 AX-80737273 AX-77048714 AX-77048779 AX-77050447
3811582 1 2002202222200202022020200200220200222200022220002200000201202000222022
3712982 1 2002202222200202022020200200220200222200022220002200000200202000222022
3712990 1 2002202211200202021011100101210200111101022121112100111110211110122122
3713019 1 2002202211200202021011100101210200111101022121112100111110211110122122
3713025 1 2002202211200202021011100101210200111101022121112100111110211110122122
3713126 1 2002202222200202022020200200220200222200022220002200000200202000222022"
teste <- strsplit(data, split = "\n")
for(i in seq(1, length(teste[[1]]),1)) {
if (i==1) {
dataOut <- strsplit(teste[[1]][i], split = " ")
print(dataOut)
} else
dataOut <- strsplit(teste[[1]][i], split = " 1 ")
print(dataOut)
}

Split string according to ambiguous delimiter in R

I have a pairs of strings included in a data frame:
df <- data.frame(str = c("L_V1_ROI-L_MST_ROI",
"L_V6_ROI-L_V2_ROI",
"L_V3_ROI-L_V4_ROI",
"L_V8_ROI-L_4_ROI",
"L_p9-46v_ROI-L_a9-46v_ROI"))
Each pair is separated by - symbol with the exception of the last pair which contains three - symbols and should be separated into substrings L_p9-46v_ROI and L_a9-46v_ROI.
A task is to split these pairs into substrings according to the separator. To do this I simply use:
library(tidyr)
df %>% separate(data = df, col = str, into = c("str1", "str2"), sep = "-")
which gives the following result:
str1 str2
1 L_V1_ROI L_MST_ROI
2 L_V6_ROI L_V2_ROI
3 L_V3_ROI L_V4_ROI
4 L_V8_ROI L_4_ROI
5 L_p9 46v_ROI
Warning message:
Too many values at 1 locations: 5
As expected, the problem lies in the 5th pair which has more than one - symbol.
Question: what is the regex to match the proper separator?
My partial solution is pasted below, but I hope that there should be more intelligent solution.
my_split <- function(string, pattern) {
## Match start end end position of the "_ROI-"
position <- str_locate(string = string, pattern = pattern)
start <- position[1]
end <- position[2]
## Extract substrings
substring1 <- substr(my_str, 1, start + 3)
substring2 <- substr(my_str, end + 1, nchar(string))
return(list(substring1, substring2))
}
## Toy example
my_str <- "L_p9-46v_ROI-L_a9-46v_ROI"
my_split(string = my_str, pattern = "_ROI-")
[[1]]
[1] "L_p9-46v_ROI"
[[2]]
[1] "L_a9-46v_ROI"

Need to find most common combination of letters

Let's say for simplicity that i have 10 rows of 5 characters where each character can be A-Z.
E.g//
KJGXI
GDGQT
JZKDC
YOTQD
SSDIQ
PLUWC
TORHC
PFJSQ
IIZMO
BRPOJ
WLMDX
AZDIJ
ARNUA
JEXGA
VFPIP
GXOXM
VIZEM
TFVQJ
OFNOG
QFNJR
ZGUBZ
CCTMB
HZPGV
ORQTJ
I want to know which 3 letter combination is most common. However, the combination does not need to be in order, nor next to each other. E.g
ABCXY
CQDBA
=ABC
I could probably brute-force it with endless loops but I was wondering if there was a better way of doing it!

Here is a solution:
x <- c("KJGXI", "GDGQT", "JZKDC", "YOTQD", "SSDIQ", "PLUWC", "TORHC", "PFJSQ", "IIZMO", "BRPOJ", "WLMDX", "AZDIJ",
"ARNUA", "JEXGA", "VFPIP", "GXOXM", "VIZEM", "TFVQJ", "OFNOG", "QFNJR", "ZGUBZ", "CCTMB", "HZPGV", "ORQTJ")
temp <- do.call(cbind, lapply(strsplit(x, ""), combn, m = 3))
temp <- apply(temp, 2, sort)
temp <- apply(temp, 2, paste0, collapse = "")
sort(table(temp), decreasing = TRUE)
which will return the number of times each combination appear. You can then use names(which.max(sort(table(temp), decreasing = TRUE))) to have the combination (in this case, "FJQ")
In this case, two combinations appear 3 times, you can do
result <- sort(table(temp), decreasing = TRUE)
names(which(result == max(result)))
# [1] "FJQ" "IMZ"
to have the two combinations which appear the most time.
The code works as follow:
split each element of x in five letters, then generate each possible combination of 3 elements from the 5 letters
sort each of those combination alphabetically
paste the 3 letters together
generate the count for each of those combinations, and sort the result

I would split each string into letters, sort them, then use combn to get all combinations. Use paste0 to collapse these back into strings and count.
txt <- c("KJGXI", "GDGQT", "JZKDC", "YOTQD", "SSDIQ", "PLUWC", "TORHC",
"PFJSQ", "IIZMO", "BRPOJ", "WLMDX", "AZDIJ", "ARNUA", "JEXGA",
"VFPIP", "GXOXM", "VIZEM", "TFVQJ", "OFNOG", "QFNJR", "ZGUBZ",
"CCTMB", "HZPGV", "ORQTJ")
txt2 <- strsplit(txt, split = "")
txt2 <- lapply(txt2, sort)
txt3 <- lapply(txt2, combn, m = 3)
txt4 <- lapply(txt3, function(x){apply(x, 2, paste0, collapse = "")})
table(unlist(txt4))
Several steps here could be combined.

Categories

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to split a sentence in two halves in R - r

Related

Using regex to drop duplicated elements in columns of an R dataframe

fread specifying separator within column

read data into R with different delimiters

Split string according to ambiguous delimiter in R

Need to find most common combination of letters

Categories

Resources