R, readLines, strsplit and grep - r

I am trying to read a random text file one line at a time. Then split the line into "words" and perform some regex on each word, like finding all word that start with "w". After the following like code snippet I get:
while (length(oneLine <- readLines(infile, n = 1, warn = FALSE)) > 0) {
myVector <- (strsplit(oneLine, " ", fixed = FALSE, perl = TRUE))
res <- grep("^w", myVector, perl = TRUE, value = TRUE)
...
> myVector
[[1]]
[1] "u" "rtu" "jgiyu" "t6riuri-4e5-" "ee4" "59"
[7] "43"
My question is, what is the correct syntax to access "u", "rtu", ... ?
> myVector[1]
[[1]]
[1] "u" "rtu" "jgiyu" "t6riuri-4e5-" "ee4" "59"
[7] "43"
Doesn't work. What will? What's up with the [[1]]? I was under the impression that vectors are one-dimensional and its elements are accessed like myVector[1], myVector[2], etc.
Thanks for the help.

strsplit returns a list. In this case, it is a list of length 1, but if you used readLines on the whole file, then called strsplit, it would return a list of the same length as the number of lines.
For the way you're using it, you need to select the first element of the first component of the list. i.e. myVector[[1]][1] for "u" and myVector[[1]][2] for "rtu". Also, in this case, unlist(myVector)[1] and unlist(myVector)[2] would work.

Related

how to extract part of a string matching pattern with separation in r

I'm trying to extract part of a file name that matches a set of letters with variable length. The file names consist of several parameters separated by "_", but they vary in the number of parts. I'm trying to pull some of the parameters out to use separately.
Example file names:
a = "Vel_Mag_ft_modelExisting_350cfs_blah3.tif"
b = "Depth_modelDesign_11000cfs_blah2.tif"
I'm trying to pull out the parts that start with "model" so I end up with
"modelExisting"
"modelDesign"
The filenames are stored as a variable in a data.frame
I've tried
library(tidyverse)
tibble(files = c(a,b))%>%
mutate(attempt1 = str_extract(files, "model"),
attempt2 = str_match(str_split(files, "_"), "model"))
but just ended up with the "model" in all cases and not the "model...." that I need.
The pieces I need are a consisent number of pieces from the end, but I couldn't figure out how to specify that either. I tried
str_split(files, "_")[-3]
but this threw an error that it must be size 480 or 1 not size 479
We can create a function to capture the word before the _ and one or more digits (\\1), in the replacement, specify the backreference (\\1) of the captured group
f1 <- function(x) sub(".*_([[:alpha:]]+)_\\d+.*", "\\1", x)
-testing
> f1(a)
[1] "modelExisting"
> f1(b)
[1] "modelDesign"
We can use strsplit or regmatches like below
> s <- c("Vel_Mag_ft_modelExisting_350cfs_blah3.tif", "Depth_modelDesign_11000cfs_blah2.tif")
> lapply(strsplit(s, "_"), function(x) x[which(grepl("^\\d+", x)) - 1])
[[1]]
[1] "modelExisting"
[[2]]
[1] "modelDesign"
> regmatches(s, gregexpr("[[:alpha:]]+(?=_\\d+)", s, perl = TRUE))
[[1]]
[1] "modelExisting"
[[2]]
[1] "modelDesign"

Recode if string (with punctuation) contains certain text

How can I search through a character vector and, if the string at a given index contains a pattern, replace that index's value?
I tried this:
List <- c(1:8)
Types<-as.character(c(
"ABC, the (stuff).\n\n\n fun", "meaningful", "relevant", "rewarding",
"unpleasant", "enjoyable", "engaging", "disinteresting"))
for (i in List) {
if (grepl(Types[i], "fun", fixed = TRUE))
{Types[i]="1"
} else if (grepl(Types[i], "meaningful", fixed = TRUE))
{Types[i]="2"}}
The code works for "meaningful", but doesn't when there's punctuation or other things in the string, as with "fun".
The first argument to grepl is the pattern, not the string.
This would be a literal fix of your code:
for (i in seq_along(Types)) {
if (grepl("fun", Types[i], fixed = TRUE)) {
Types[i] = "1"
} else if (grepl("meaningful", Types[i], fixed = TRUE)) {
Types[i] = "2"
}
}
Types
# [1] "1" "2" "relevant" "rewarding" "unpleasant"
# [6] "enjoyable" "engaging" "disinteresting"
BTW, the use of List works, but it's a little extra: when you have separate variables like that, it is possible that one might go out of sync with the other. For instance, if you update Types and forget to update List, then it will break (or fail). For this, I used seq_along(Types) instead.
BTW: here's a slightly different version that leaves Types untouched and returns a new vector, and is introducing you to the power of vectorization:
Types[grepl("fun", Types, fixed = TRUE)] <- "1"
Types[grepl("meaningful", Types, fixed = TRUE)] <- "2"
Types
# [1] "1" "2" "relevant" "rewarding" "unpleasant"
# [6] "enjoyable" "engaging" "disinteresting"
The next level (perhaps over-complicating?) would be to store the patterns and recoding replacements in a frame (always a 1-to-1, you'll never accidentally update one without the other, can be stored in CSV if needed) and Reduce on it:
ptns <- data.frame(ptn = c("fun", "meaningful"), repl = c("1", "2"))
Reduce(function(txt, i) {
txt[grepl(ptns$ptn[i], txt, fixed = TRUE)] <- ptns$repl[i]
txt
}, seq_len(nrow(ptns)), init = Types)
# [1] "1" "2" "relevant" "rewarding" "unpleasant"
# [6] "enjoyable" "engaging" "disinteresting"
You could use str_replace_all:
library(stringr)
pat <- c(fun = '1', meaningful = '2')
str_replace_all(Types, setNames(pat, sprintf('(?s).*%s.*', names(pat))))
[1] "1" "2" "relevant"
[4] "rewarding" "unpleasant" "enjoyable"
[7] "engaging" "disinteresting"
Try to use str_replace(string, pattern, replacement) from string package.

How to randomly reshuffle letters in words

I am trying to make a word scrambler in R. So i have put some words in a collection and tried to use strsplit() to split the letters of each word in the collection.
But I don't understand how to jumble the letters of a word and merge them to one word in R Tool. Does anyone know how can I solve this?
This is what I have done
enter image description here
Once you've split the words, you can use sample() to rescramble the letters, and then paste0() with collapse="", to concatenate back into a 'word'
lapply(words, function(x) paste0(sample(strsplit(x, split="")[[1]]), collapse=""))
You can use the stringi package if you want:
> stringi::stri_rand_shuffle(c("hello", "goodbye"))
[1] "oellh" "deoygob"
Here's a one-liner:
lapply(lapply(strsplit(strings, ""), sample), paste0, collapse = "")
[[1]]
[1] "elfi"
[[2]]
[1] "vleo"
[[3]]
[1] "rmsyyet"
Use unlistto get rid of the list:
unlist(lapply(lapply(strsplit(strings, ""), sample), paste0, collapse = ""))
Data:
strings <- c("life", "love", "mystery")
You can use the sample function for this.
here is an example of doing it for a single word. You can use this within your for-loop:
yourword <- "hello"
# split: Split will return a list with one char vector in it.
# We only want to interact with the vector not the list, so we extract the first
# (and only) element with "[[1]]"
jumble <- strsplit(yourword,"")[[1]]
jumble <- sample(jumble, # sample random element from jumble
size = length(jumble), # as many times as the length of jumble
# ergo all Letters
replace = FALSE # do not sample an element multiple times
)
restored <- paste0(jumble,
collapse = "" # bas
)
As the answer from langtang suggests, you can use the apply family for this, which is more efficient. But maybe this answer helps the understanding of what R is actually doing here.

a list of multiple lists of 2 for synonyms

I want to read the synonyms from a csv file , where the first word is the "main" word and the rest of the words in the same record are its synonyms
now i basically want to create a list like i would have in R ,
**synonyms <- list(
list(word="ss", syns=c("yy","yyss")),
list(word="ser", syns=c("sert","sertyy","serty"))
)**
This gives me a list as
synonyms
[[1]]
[[1]]$word
[1] "ss"
[[1]]$syns
[1] "yy" "yyss"
[[2]]
[[2]]$word
[1] "ser"
[[2]]$syns
[1] "sert" "sertyy" "serty"
which is essentially a list of lists of "word" and "syns".
how do i go about creating the similar list while reading the word and synonyms from a csv file
any pointers would help !! Thanks
This process should return what you want.
# read in data using readLines
myStuff <- readLines(textConnection(temp))
This will return a character vector with one element per line in the file. Note that textConnection is not necessary for reading in files. Just supply the file path. Now, split each vector element into a vectors using strsplit and return a list.
myList <- strsplit(myStuff, split=" ")
Now, separate the first element from the remaining element for each vector within the list.
result <- lapply(myList, function(x) list(word=x[1], synonyms=x[-1]))
This returns the desired result. We use lapply to move through the list items. For each list item, we return a named list where the first element, named word, corresponds to the first element of the vector that is the list item and the remaining elements of this vector are placed in a second list element called synonyms.
result
[[1]]
[[1]]$word
[1] "ss"
[[1]]$synonyms
[1] "yy" "yyss"
[[2]]
[[2]]$word
[1] "ser"
[[2]]$synonyms
[1] "sert" "sertyy" "serty"
[[3]]
[[3]]$word
[1] "at"
[[3]]$synonyms
[1] "ate" "ater" "ates"
[[4]]
[[4]]$word
[1] "late"
[[4]]$synonyms
[1] "lated" "lates" "latee"
data
temp <-
"ss yy yyss
ser sert sertyy serty
at ate ater ates
late lated lates latee"

How can I create a table of only the unique elements in a list so I can order the elements in terms of frequency?

I have tried running the code below, however it does not work as the arguments are not all of equal length.
sentence= "I like tea and I love coffee and biscuits"
words = function(x) {
txt = unlist(strsplit(x,' '))
wl = list()
for(i in seq_along(txt)) {
wrd = txt[i]
wl[[wrd]] = c(wl[[wrd]], i)
}
class(wl) <- "wordclass"
return(wl)
}
summary.wordclass <- function(y) {
cat("the frequency of words",names(sort(table(y), decreasing=TRUE)),"\n")
}
wordfreq=words(sentence)
summary(wordfreq)
I want to get an output like
[1] "I" "and" "like" "tea" "love" "coffee"
However, I am getting the error
Error in table(y) : all arguments must have the same length
If anyone could help that would be great!
would
names(sort(table(unlist(strsplit(sentence," "))),decreasing=T))
work for you ?
the output is
[1] "and" "I" "biscuits" "coffee" "like" "love" "tea"

Resources