replacing a pattern in a column value with a column value - r

after an hour or so skimming trough stackoverflow and trying out different things I've decided to make another query.
I made a data frame [ picture 1 ] in which I basically inserted a vector with the same length as the df containing a URL used to access an API's data.
Hereby I added the "FLAVOURS" text in the URL as a "pattern trigger" for gsub to replace this word with the column value which I will replace later as flavors.
What i ended up with was a df with 2 columns one with the URL used for the API and one with all the flavors. What i wanted to do now is insert the flavors [column 2] into the URL so it would become e.g:
"http://strainapi.evanbusse.com/ZlWfxSa/searchdata/flavors/Earthy"
So what I would like to happen is the pattern "FLAVOUR" to be replaced by the column 2 data in a row wise fashion.
I've tried using gsub on its own, or in combination with rowwise() but I've been getting errors out of them, or they do something I didn't expect at all.
*I'm still new to both R and making stackoverflow posts so please do give me pointers if I did something wrong.

In base R, you can use any of the apply family of functions to do this. If your dataframe is called df and the first two column are a and b you can do :
df$a <- mapply(function(x, y) sub('FLAVOURS', y, x), df$a, df$b)
However, stringr has a vectorised function str_replace.
df$a <- stringr::str_replace(df$a, 'FLAVOURS', df$b)
Another base R option would be to treat column a as file paths and use dirpath to extract the path until the last '/' and paste it with b column.
df$a <- paste(dirname(df$a), df$b, sep = '/')

Related

Extracting information between special characters in a column in R

I'm sorry because I feel like versions of this question have been asked many times, but I simply cannot find code from other examples that works in this case. I have a column where all the information I want is stored in between two sets of "%%", and I want to extract this information between the two sets of parentheses and put it into a new column, in this case called df$empty.
This is a long column, but in all cases I just want the information between the sets of parentheses. Is there a way to code this out across the whole column?
To be specific, I want in this example a new column that will look like "information", "wanted".
empty <- c('NA', 'NA')
information <- c('notimportant%%information%%morenotimportant', 'ignorethis%%wanted%%notthiseither')
df <- data.frame(information, empty)
In this case you can do:
df$empty <- sapply(strsplit(df$information, '%%'), '[', 2)
# information empty
# 1 notimportant%%information%%morenotimportant information
# 2 ignorethis%%wanted%%notthiseither wanted
That is, split the text by '%%' and take second elements of the resulting vectors.
Or you can get the same result using sub():
df$empty <- sub('.*%%(.+)%%.*', '\\1', df$information)

Best way to extract a single letter from each row and create a new column in R?

Below is an excerpt of the data I'm working with. I am having trouble finding a way to extract the last letter from the sbp.id column and using the results to add a new column to the below data frame called "sex". I initially tried grepl to separate the rows ending in F and the ones ending in M, but couldn't figure out how to use that to create a new column with just M or F, depending on which one is the last letter of each row in the sbp.id column
sbp.id newID
125F 125
13000M 13000
13120M 13120
13260M 13260
13480M 13480
Another way, if you know you need the last letter, irrespective of whether the other characters are numbers, digits, or even if the elements all have different lengths, but you still just need the last character in the string from every row:
df$sex <- substr(df$sbp.id, nchar(df$sbp.id), nchar(df$sbp.id))
This works because all of the functions are vectorized by default.
Using regex you can extract the last part from sbp.id
df$sex <- sub('.*([A-Z])$', '\\1', df$sbp.id)
#Also
#df$sex <- sub('.*([MF])$', '\\1', df$sbp.id)
Or another way would be to remove all the numbers.
df$sex <- sub('\\d+', '', df$sbp.id)

Find whether a raw in data table contains at least one word from the list

I am quite new to R and data tables, so probably my question will sound obvious, but I searched through questions here for similar issues and couldn't find a solution anyway.
So, initially, I have a data table and one of the rows contains fields that have many values(in fact these values are all separate words) of the data joined together by &&&&. I also have a list of words (list). This list is big and has 38 000 different words. But for the purpose of example let's sat that it is small.
list <- c('word1', 'word2, 'word3')
What I need is to filter the data table so that I only have rows that contain at least one word from the list of words.
I unjoined the data by &&&&& and created a list
fields_with_words <-strsplit(data_final$fields_with_words,"&&&&")
But I don't know which function should I use to check whether the row from my data table has at least one word from the list. Can you give me some clues?
Try :
data_final[sapply(strsplit(data_final$fields_with_words,"&&&&"), function(x)
any(x %in% word_list)), ]
I have used word_list instead of list here since list is a built-in function in R.
Assuming you want to scan x variable in df with the list of words lw <- c("word1","word2","word3") (character vector of words), you can use
df[grepl(paste0("(",paste(lw, collapse = "|"), ")"), x)]
if you want regular expression. In particular you will have match also if your word is within a sentence. However, with 38k words, I don't know if this solution is scalable.
If your x column contains only words and you want exact matching, the problem is simpler. You can do:
df[any(x %chin% lw)]
%chin% is a data.table special %in% operator for character vectors (%in% can also be used but it will not be as performant). You can have better performance there if you use merge by transforming lw into a data.table:
merge(df, data.table(x = lw), by = "x")

How to replace all values in my dataframe containing the string "x" with NAs?

Solved
I cached some excel files into csv's, problem is in the excel sheets rather than giving blank values where none exists, there is inputted "x" or "xx" up to "xxxxx" and I need to convert all of those into NA's. What should I do? Some of the more complicated functions that I find in solutions online in R don't make sense to me (like apply + function + grepl) but I can understand things like grepl individually however cannot seem to find something that works.
I have tried
df <- replace(df, df == grepl(df, "x"), NA) %>%
write_csv("df.csv")
However I get an
error: my df (pattern in grepl) has length >1
and only the first element will be used (I'm assuming the first column).
I've also done things individually by column, but I'm looking for something that scales.
Thanks!
df[sapply(df, grepl, pattern = 'x')] <- NA

Using grepl() in a particular type of pattern matching

I'm not sure how to do this, I have a feeling that I can use grepl() with this but I am not sure how.
I have a column in my dataset where I have names like "Abbot", "Baron", "William", and hundreds of other names, and many blanks/missing-values.
I want to extract it in such a way where the first letter is extracted and put in a new column that only contains the letter, and if its missing a value then fill in with unknown.
Below I use a quick sapply statement and strsplit to grab the first letter. There is likely a better way to do this, but here's one solution. :)
test <- c('Abbot', 'Baron', 'William')
firstLetter <- sapply(test, function(x){unlist(strsplit(x,''))[1]})
What do you mean with
and if its missing a value then fill in with unknown
?
The following code using substr should be very fast with a large number of rows. It always returns the first letter and returns NA if the respective value in test$name is NA.
test <- data.frame(name = c('Abbot', 'Baron', 'William', NA))
test$first.letter <- substr(test$name, 1, 1)
If you want to convert all NAin test$first.letter to 'unknown' you can do this afterwards:
test$first.letter <- ifelse(is.na(test$first.letter), "unknown", test$first.letter)

Resources