Recode if string (with punctuation) contains certain text - r

How can I search through a character vector and, if the string at a given index contains a pattern, replace that index's value?
I tried this:
List <- c(1:8)
Types<-as.character(c(
"ABC, the (stuff).\n\n\n fun", "meaningful", "relevant", "rewarding",
"unpleasant", "enjoyable", "engaging", "disinteresting"))
for (i in List) {
if (grepl(Types[i], "fun", fixed = TRUE))
{Types[i]="1"
} else if (grepl(Types[i], "meaningful", fixed = TRUE))
{Types[i]="2"}}
The code works for "meaningful", but doesn't when there's punctuation or other things in the string, as with "fun".

The first argument to grepl is the pattern, not the string.
This would be a literal fix of your code:
for (i in seq_along(Types)) {
if (grepl("fun", Types[i], fixed = TRUE)) {
Types[i] = "1"
} else if (grepl("meaningful", Types[i], fixed = TRUE)) {
Types[i] = "2"
}
}
Types
# [1] "1" "2" "relevant" "rewarding" "unpleasant"
# [6] "enjoyable" "engaging" "disinteresting"
BTW, the use of List works, but it's a little extra: when you have separate variables like that, it is possible that one might go out of sync with the other. For instance, if you update Types and forget to update List, then it will break (or fail). For this, I used seq_along(Types) instead.
BTW: here's a slightly different version that leaves Types untouched and returns a new vector, and is introducing you to the power of vectorization:
Types[grepl("fun", Types, fixed = TRUE)] <- "1"
Types[grepl("meaningful", Types, fixed = TRUE)] <- "2"
Types
# [1] "1" "2" "relevant" "rewarding" "unpleasant"
# [6] "enjoyable" "engaging" "disinteresting"
The next level (perhaps over-complicating?) would be to store the patterns and recoding replacements in a frame (always a 1-to-1, you'll never accidentally update one without the other, can be stored in CSV if needed) and Reduce on it:
ptns <- data.frame(ptn = c("fun", "meaningful"), repl = c("1", "2"))
Reduce(function(txt, i) {
txt[grepl(ptns$ptn[i], txt, fixed = TRUE)] <- ptns$repl[i]
txt
}, seq_len(nrow(ptns)), init = Types)
# [1] "1" "2" "relevant" "rewarding" "unpleasant"
# [6] "enjoyable" "engaging" "disinteresting"

You could use str_replace_all:
library(stringr)
pat <- c(fun = '1', meaningful = '2')
str_replace_all(Types, setNames(pat, sprintf('(?s).*%s.*', names(pat))))
[1] "1" "2" "relevant"
[4] "rewarding" "unpleasant" "enjoyable"
[7] "engaging" "disinteresting"

Try to use str_replace(string, pattern, replacement) from string package.

Related

Row names disappear after as.matrix

I notice that if the row names of the dataframe follows a sequence of numbers from 1 to the number of rows. The row names of the dataframe will disappear after using as.matrix. But the row names re-appear if the row name is not a sequence.
Here are a reproducible example:
test <- as.data.frame(list(x=c(0.1, 0.1, 1), y=c(0.1, 0.2, 0.3)))
rownames(test)
# [1] "1" "2" "3"
rownames(as.matrix(test))
# NULL
rownames(as.matrix(test[c(1, 3), ]))
# [1] "1" "3"
Does anyone have an idea on what is going on?
Thanks a lot
You can enable rownames = TRUE when you apply as.matrix
> as.matrix(test, rownames = TRUE)
x y
1 0.1 0.1
2 0.1 0.2
3 1.0 0.3
First and foremost, we always have a numerical index for sub-setting that won't disappear and that we should not confuse with row names.
as.matrix(test)[c(1, 3), ]
# x y
# [1,] 0.1 0.1
# [2,] 1.0 0.3
WHAT's going on while using rownames is the dimnames feature in the serene source code of base:::rownames(),
function (x, do.NULL = TRUE, prefix = "row")
{
dn <- dimnames(x)
if (!is.null(dn[[1L]]))
dn[[1L]]
else {
nr <- NROW(x)
if (do.NULL)
NULL
else if (nr > 0L)
paste0(prefix, seq_len(nr))
else character()
}
}
which yields NULL for dimnames(as.matrix(test))[[1]] but yields "1" "3" in the case of dimnames(as.matrix(test[c(1, 3), ]))[[1]].
Note, that the method base:::row.names.data.frame is applied in case of data frames, e.g. rownames(test).
The WHAT should be explained with it, fortunately you did not ask for the WHY, which would be rather opinion-based.
There is a difference between 'automatic' and non-'automatic' row names.
Here is a motivating example:
automatic
test <- as.data.frame(list(x = c(0.1,0.1,1), y = c(0.1,0.2,0.3)))
rownames(test)
# [1] "1" "2" "3"
rownames(as.matrix(test))
# NULL
non-'automatic'
test1 <- test
rownames(test1) <- as.character(1:3)
rownames(test1)
# [1] "1" "2" "3"
rownames(as.matrix(test1))
# [1] "1" "2" "3"
You can read about this in e.g. ?data.frame, which mentions the behavior you discovered at the end:
If row.names was supplied as NULL or no suitable component was found the row names are the integer sequence starting at one (and such row names are considered to be ‘automatic’, and not preserved by as.matrix).
When you call test[c(1, 3), ] then you create non-'automatic' rownames implicitly, which is kinda documented in ?Extract.data.frame:
If `[` returns a data frame it will have unique (and non-missing) row names.
(type `[.data.frame` into your console if you want to go deeper here.)
Others showed what this means for your case already, see the argument rownames.force in ?matrix:
rownames.force: ... The default, NA, uses NULL rownames if the data frame has ‘automatic’ row.names or for a zero-row data frame.
The difference dataframe vs. matrix:
?rownames
rownames(x, do.NULL = TRUE, prefix = "row")
The important part is do.NULL = TRUE the default is TRUE: This means:
If do.NULL is FALSE, a character vector (of length NROW(x) or NCOL(x)) is returned in any case,
If the replacement versions are called on a matrix without any existing dimnames, they will add suitable dimnames. But constructions such as
rownames(x)[3] <- "c"
may not work unless x already has dimnames, since this will create a length-3 value from the NULL value of rownames(x).
For me that means (maybe not correct or professional) to apply rownames() function to a matrix the dimensions of the row must be declared before otherwise you will get NULL -> because this is the default setting in the function rownames().
In your example you experience this kind of behaviour:
Here you declare row 1 and 3 and get 1 and 3
rownames(as.matrix(test[c(1, 3), ]))
[1] "1" "3"
Here you declare nothing and get NULL because NULL is the default.
rownames(as.matrix(test))
NULL
You can overcome this by declaring before:
rownames(test) <- 1:3
rownames(as.matrix(test))
[1] "1" "2" "3"
or you could do :
rownames(as.matrix(test), do.NULL = FALSE)
[1] "row1" "row2" "row3"
> rownames(as.matrix(test), do.NULL = FALSE, prefix="")
[1] "1" "2" "3"
Similar effect with rownames.force:
rownames.force
logical indicating if the resulting matrix should have character (rather than NULL) rownames. The default, NA, uses NULL rownames if the data frame has ‘automatic’ row.names or for a zero-row data frame.
dimnames(matrix_test)
I don't know exactly why it happens, but one way to fix it is to include the argument rownames.force = T, inside as.matrix
rownames(as.matrix(test, rownames.force = T))

How to check if filename ends with a certain string? (R)

I have the following code:
for (fileName in fileNames) {
index <- "0"
if (grepl("_01", fileName, fixed = TRUE)) {
index <- "01"
}
if (grepl("_02", fileName, fixed = TRUE)) {
index <- "02"
}
}
and so on.
My filename is like "31231_sad_01.csv" or "31231_happy_01.csv".
All of my filenames are stored in a character vector fileNames. I loop through each file.
How can I find the past ending part of the filename aka 01 in this case or 02?
I tried using the code I mentioned and it always returns 1 for every value.
Try the following:
#suppose you have your file names in a character vector
fnames <- c("31231_sad_01.csv", "31231_happy_02.csv")
unlist(lapply(str_extract_all(fnames,"\\d+"),'[',2))
It would return a vector
[1] "01" "02"
Vectorized alternatives exist, there is no need for a loop.
To check if the last numeric part of filename ends with a specific number, here 01, we can first extract the numeric part, then run endsWith.
string <- c("31231_sad_01.csv", "bla_215.csv", "test_05.csv")
endsWith(stringr::str_extract(string, "([^_])*(?=.csv)"), "01")
#> [1] TRUE FALSE FALSE
An alternative way is to use sub to extract parts of the strings. Your examples show that the targeted index in each file name is always located after _ and before .csv. We can use this pattern in sub:
library(magrittr)
findex <- function(filename){
filename %>%
sub(".csv.*" , "", .) %>% #extract the part before ".csv"
sub(".*_" , "", .) # exctract the part after "_"
}
This method can be used for various length of the index.
Test:
findex("31231_sad_01.csv")
#[1] "01"
findex("31231_happy_02.csv")
#[1] "02"
findex("31231_happy_213.csv")
#[1] "213"
findex("31231_happy_15213.csv")
#[1] "15213"
Then, you can use lapply or vapply to the vector that contains all the names:
names <- c("31231_happy_1032.csv", "31231_happy_02.csv", "31231_happy_213.csv", "31231_happy_15213.csv")
lapply(names, findex)
#[[1]]
#[1] "1032"
#[[2]]
#[1] "02"
#[[3]]
#[1] "213"
#[[4]]
#[1] "15213"
vapply(names, findex, character(1))
#31231_happy_1032.csv 31231_happy_02.csv 31231_happy_213.csv
"1032" "02" "213"
#31231_happy_15213.csv
"15213"
In case you want to use only base R, this should work:
findex1 <- function(filename) sub(".*_" , "", sub(".csv.*" , "", filename))
vapply(names, findex1, character(1))
# 31231_happy_1032.csv 31231_happy_02.csv 31231_happy_213.csv
# "1032" "02" "213"
#31231_happy_15213.csv
# "15213"

Time conversion to number

There is a vector with a time value. How can I remove a colon and convert a text value to a numeric value. i.e. from "10:01:02" - character to 100102 - numeric. All that I could find is presented below.
> x <- c("10:01:02", "11:01:02")
> strsplit(x, split = ":")
[[1]]
[1] "10" "01" "02"
[[2]]
[1] "11" "01" "02"
If you want to do everything in one line, you can use the destring() function from taRifx to remove everything that isn't a number and convert the result to numeric.
taRifx::destring(x)
This will also work if some of your data's formatted in a different way, such as "10-01-02", though you may have to set the value of keep.
destring("10-10-10", keep = "0-9")
And if you don't want to have to install the taRifx package you can define the destring() function locally.
destring <- function(x, keep = "0-9.-")
{
return(as.numeric(gsub(paste("[^", keep, "]+", sep = ""),
"", x)))
}
We can use gsub to replace : with "". After that, use as.numeric to do the conversion.
x <- as.numeric(gsub(":", "", x, fixed = TRUE))
Or we can use the regex suggest by Soto
x <- as.numeric(gsub('\\D+', '', x))
Try with
x <- as.numeric(x)
and then to make sure
class(x)

How can I create a table of only the unique elements in a list so I can order the elements in terms of frequency?

I have tried running the code below, however it does not work as the arguments are not all of equal length.
sentence= "I like tea and I love coffee and biscuits"
words = function(x) {
txt = unlist(strsplit(x,' '))
wl = list()
for(i in seq_along(txt)) {
wrd = txt[i]
wl[[wrd]] = c(wl[[wrd]], i)
}
class(wl) <- "wordclass"
return(wl)
}
summary.wordclass <- function(y) {
cat("the frequency of words",names(sort(table(y), decreasing=TRUE)),"\n")
}
wordfreq=words(sentence)
summary(wordfreq)
I want to get an output like
[1] "I" "and" "like" "tea" "love" "coffee"
However, I am getting the error
Error in table(y) : all arguments must have the same length
If anyone could help that would be great!
would
names(sort(table(unlist(strsplit(sentence," "))),decreasing=T))
work for you ?
the output is
[1] "and" "I" "biscuits" "coffee" "like" "love" "tea"

R, readLines, strsplit and grep

I am trying to read a random text file one line at a time. Then split the line into "words" and perform some regex on each word, like finding all word that start with "w". After the following like code snippet I get:
while (length(oneLine <- readLines(infile, n = 1, warn = FALSE)) > 0) {
myVector <- (strsplit(oneLine, " ", fixed = FALSE, perl = TRUE))
res <- grep("^w", myVector, perl = TRUE, value = TRUE)
...
> myVector
[[1]]
[1] "u" "rtu" "jgiyu" "t6riuri-4e5-" "ee4" "59"
[7] "43"
My question is, what is the correct syntax to access "u", "rtu", ... ?
> myVector[1]
[[1]]
[1] "u" "rtu" "jgiyu" "t6riuri-4e5-" "ee4" "59"
[7] "43"
Doesn't work. What will? What's up with the [[1]]? I was under the impression that vectors are one-dimensional and its elements are accessed like myVector[1], myVector[2], etc.
Thanks for the help.
strsplit returns a list. In this case, it is a list of length 1, but if you used readLines on the whole file, then called strsplit, it would return a list of the same length as the number of lines.
For the way you're using it, you need to select the first element of the first component of the list. i.e. myVector[[1]][1] for "u" and myVector[[1]][2] for "rtu". Also, in this case, unlist(myVector)[1] and unlist(myVector)[2] would work.

Resources