I would like to add a number indicating the x^th occurrence of a word in a vector. (So this question is different from Make a column with duplicated values unique in a dataframe , because I have a simple vector and try to avoid the overhead of casting it to a data.frame).
E.g. for the vector:
book, ship, umbrella, book, ship, ship
the output would be:
book, ship, umbrella, book2, ship2, ship3
I have solved this myself by transposing the vector to a dataframe and next using the grouping function. That feels like using a sledgehammer to crack nuts:
# add consecutive number for equal string
words <- c("book", "ship", "umbrella", "book", "ship", "ship")
# transpose word vector to data.frame for grouping
df <- data.frame(words = words)
df <- df %>% group_by(words) %>% mutate(seqN = row_number())
# combine columns and remove '1' for first occurrence
wordsVec <- paste0(df$words, df$seqN)
gsub("1", "", wordsVec)
# [1] "book" "ship" "umbrella" "book2" "ship2" "ship3"
Is there a more clean solution, e.g. using the stringr package?
You can still utilize row_number() from dplyr but you don't need to convert to data frame, i.e.
sub('1$', '', ave(words, words, FUN = function(i) paste0(i, row_number(i))))
#[1] "book" "ship" "umbrella" "book2" "ship2" "ship3"
Another option is to use make.unique along with gsubfn to increment your values by 1, i.e.
library(gsubfn)
gsubfn("\\d+", function(x) as.numeric(x) + 1, make.unique(words))
#[1] "book" "ship" "umbrella" "book.2" "ship.2" "ship.3"
Related
I have a list of IP address pairs separated by "::".
ip_pairs <- c("104.124.199.136::192.168.1.67", "104.124.199.136::192.168.137.174", "192.168.1.67::104.124.199.136", "192.168.137.174::104.124.199.136")
As you can see, the third and fourth elements of the vector are the same as the first two, but reversed (my actual problem is to find all unique pairings of IPs, so the solution would drop the pair B::A if A::B is already present. This could be solved using stringr or regex, I'm guessing.
One option:
library(stringr)
split_function = function(x) {
x = sort(x)
paste(x, collapse="::")
}
pairs = str_split(ip_pairs, "::")
unique(sapply(pairs, split_function))
[1] "104.124.199.136::192.168.1.67" "104.124.199.136::192.168.137.174"
Use read.table to create a two column data frame from the pairs, sort each row and find the duplicates using duplicated. Then extract out the non-duplicates. No packages are used.
DF <- read.table(text = ip_pairs, sep = ":")[-2]
ip_pairs[! duplicated(t(apply(DF, 1, sort)))]
## [1] "192.168.1.67::104.124.199.136" "192.168.137.174::104.124.199.136"
Example:
df <- data.frame(Name = c("J*120_234_458_28", "Z*23_205_a834_306", "H*_39_004_204_99_04902"))
I would like to be able to select everything before the third underscore for each row in the dataframe. I understand how to split the string apart:
df$New <- sapply(strsplit((df$Name),"_"), `[`)
But this places a list in each row. I've thus far been unable to figure out how to use sapply to unlist() each row of df$New select the first N elements of the list to paste/collapse them back together. Because the length of each subelement can be distinct, and the number of subelements can also be distinct, I haven't been able to figure out an alternative way of getting this info.
We specify the 'n', after splitting the character column by '_', extract the n-1 first components
n <- 4
lapply(strsplit(as.character(df$Name), "_"), `[`, seq_len(n - 1))
If we need to paste it together, can use anonymous function call (function(x)) after looping over the list with lapply/sapply, get the first n elements with head and paste them together`
sapply(strsplit(as.character(df$Name), "_"), function(x)
paste(head(x, n - 1), collapse="_"))
#[1] "J*120_234_458" "Z*23_205_a834" "H*_39_004"
Or use regex method
sub("^([^_]+_[^_]+_[^_]+)_.*", "\\1", df$Name)
#[1] "J*120_234_458" "Z*23_205_a834" "H*_39_004"
Or if the 'n' is really large, then
pat <- sprintf("^([^_]+){%d}[^_]+).*", n-1)
sub(pat, "\\1", df$Name)
Or
sub("^(([^_]+_){2}[^_]+)_.*", "\\1", df$Name)
#[1] "J*120_234_458" "Z*23_205_a834" "H*_39_004"
I'm trying to do a string search and replace across multiple columns in R. My code:
# Get columns of interest
selected_columns <- c(368,370,372,374,376,378,380,382,384,386,388,390,392,394)
#Perform grepl across multiple columns
df[,selected_columns][grepl('apples',df[,selected_columns],ignore.case = TRUE)] <- 'category1'
However, I'm getting the error:
Error: undefined columns selected
Thanks in advance.
grep/grepl works on vectors/matrix and not on data.frame/list. According to the?grep`
x - a character vector where matches are sought, or an object which can be coerced by as.character to a character vector.
We can loop over the columns (lapply) and replace the values based on the match
df[, selected_columns] <- lapply(df[, selected_columns],
function(x) replace(x, grepl('apples', x, ignore.case = TRUE), 'category1'))
Or with dplyr
library(dplyr)
library(stringr)
df %>%
mutate_at(selected_columns, ~ replace(., str_detect(., 'apples'), 'category1'))
Assuming you want to partially match a cell and replace it, you could use rapply() and replace cell contents that have "apples" with "category1" using gsub():
df[selected_columns] <- rapply(df[selected_columns], function(x) gsub("apples", "category1", x), how = "replace")
Just keep in mind the difference between grepl()/gsub() (with and without boundaries in your regex), and %in%/match() when searching for strings.
I have a dataframe that is a list of meeting transcripts converted from PDF using pdftools with a series of unnested words that look like this:
document_id <- c("BOARD19810203meeting.pdf", "BOARD19810405meeting.pdf", "BOARD19810609meeting.pdf", "BOARD19810405meeting.pdf", "BOARD19810609meeting.pdf")
word <- c("leave", "tomorrow", "for", "first", meeting")
df <- data.frame(document_id, word)
I want to write a code that aggregates the number of times a word appears only if it is followed by another word by the date that it appears on. Using the example above, I would like to count how many times 'leave tomorrow' appears (i.e. count leave if followed by tomorrow). So the final output would look like this:
date <- c("1981-02-03", "1982-08-09", "1991-04-04", "1991-07-04")
word <- c("leave", "leave", "leave", "leave")
df <- data.frame(date, word)
I have written the following code to aggregate one of the terms:
leave_in_transcripts <- select(interview_transcripts, 1:3) %>% filter(grepl("leave", word, ignore.case=TRUE)|(grepl("tomorrow", word, ignore.case=TRUE))
leave_in_transcripts$word <- str_count(leave_in_transcripts$word, 'leave')
count_leave <- aggregate(leave_in_transcripts['word'], by = list(Group.date = leave_in_transcripts$date), sum, na.rm=T)
But obviously this just counts leave even if it is followed by another word.
I have been searching for a while and I can't quite figure out what to do. Any ideas?
Thanks in advance for your help!
We can count the number of instances of 'leave' followed by 'tomorrow' by creating a logical expression with current row and the next row (lead) and sum the logical vector
library(dplyr)
library(stringr)
df %>%
summarise(Sum = sum(str_detect(word, 'leave') &
str_detect(lead(word), 'tomorrow'), na.rm = TRUE))
Thanks #akrun for answering this.
For anyone else reading this, I also wrote code to aggregate by date the instances that words appear based on Akrun's code:
leave_in_transcripts <- df %>% mutate(match = str_detect(word, 'leave') & str_detect(lead(word), 'tomorrow'))
leave_in_transcripts <- select(leave_in_transcripts, 1:4) %>% filter(match == "TRUE")
leave_in_transcripts$match <- str_count(leave_in_transcripts$match, 'TRUE')
count_leave <- aggregate(leave_in_transcripts['match'], by = list(Group.date = leave_in_transcripts$date), sum, na.rm=T)
In base R, we can use head and tail to match values for current and next rows. We can subset the rows which match the condition and use as.Date to convert data from document_id to date object giving appropriate format. Also since you want to test for an exact match and not partial match it is better to use == and not grepl.
transform(subset(df, c(head(word, -1) == "leave" &
tail(word, -1) == "tomorrow", FALSE)),
date = as.Date(document_id,"BOARD%Y%m%dmeeting.pdf"))
# document_id word date
#1 BOARD19810203meeting.pdf leave 1981-02-03
If you just want to count number of times the above condition is satisfied, we can use sum.
with(df, sum(head(word, -1) == "leave" & tail(word, -1) == "tomorrow"))
I have a dataframe somehow like this:
df <- ("test1/a/x/w/e/a/adfsadfsfads
test2/w/s/f/x/a/saffakwfkwlwe
test3/a/e/c/o/a/saljsfadswwoe")
The structure is always like testX/0/0/0/0/a/randomstuff while 0 is a random letter. Now I want to change the letter "a" behind the 4 random letters to a "z" in every row.
I tried a regex, but it didn't work because when I choose "/a/" as the pattern and "/z/" as the replacement, it would also replace the two "a"s at the beginning of test1 and test3.
So what I need is a function that replaces only the last pattern that is observed in each line. Is there anything that can do this?
I believe this is what you are looking for:
data <- c(
"test1/a/x/w/e/a/adfsadfsfads",
"test2/w/s/f/x/a/saffakwfkwlwe",
"test3/a/e/c/o/a/saljsfadswwoe"
)
gsub("a/([a-z]+)$", "z/\\1", data)
[1] "test1/a/x/w/e/z/adfsadfsfads" "test2/w/s/f/x/z/saffakwfkwlwe"
[3] "test3/a/e/c/o/z/saljsfadswwoe"
And if you don't like regex you could use strsplit().
library(magrittr)
data %>%
strsplit("/") %>%
lapply(function(x) {x[6] <- "z"; x}) %>%
sapply(paste, collapse = "/")