I have a dataframe that is a list of meeting transcripts converted from PDF using pdftools with a series of unnested words that look like this:
document_id <- c("BOARD19810203meeting.pdf", "BOARD19810405meeting.pdf", "BOARD19810609meeting.pdf", "BOARD19810405meeting.pdf", "BOARD19810609meeting.pdf")
word <- c("leave", "tomorrow", "for", "first", meeting")
df <- data.frame(document_id, word)
I want to write a code that aggregates the number of times a word appears only if it is followed by another word by the date that it appears on. Using the example above, I would like to count how many times 'leave tomorrow' appears (i.e. count leave if followed by tomorrow). So the final output would look like this:
date <- c("1981-02-03", "1982-08-09", "1991-04-04", "1991-07-04")
word <- c("leave", "leave", "leave", "leave")
df <- data.frame(date, word)
I have written the following code to aggregate one of the terms:
leave_in_transcripts <- select(interview_transcripts, 1:3) %>% filter(grepl("leave", word, ignore.case=TRUE)|(grepl("tomorrow", word, ignore.case=TRUE))
leave_in_transcripts$word <- str_count(leave_in_transcripts$word, 'leave')
count_leave <- aggregate(leave_in_transcripts['word'], by = list(Group.date = leave_in_transcripts$date), sum, na.rm=T)
But obviously this just counts leave even if it is followed by another word.
I have been searching for a while and I can't quite figure out what to do. Any ideas?
Thanks in advance for your help!
We can count the number of instances of 'leave' followed by 'tomorrow' by creating a logical expression with current row and the next row (lead) and sum the logical vector
library(dplyr)
library(stringr)
df %>%
summarise(Sum = sum(str_detect(word, 'leave') &
str_detect(lead(word), 'tomorrow'), na.rm = TRUE))
Thanks #akrun for answering this.
For anyone else reading this, I also wrote code to aggregate by date the instances that words appear based on Akrun's code:
leave_in_transcripts <- df %>% mutate(match = str_detect(word, 'leave') & str_detect(lead(word), 'tomorrow'))
leave_in_transcripts <- select(leave_in_transcripts, 1:4) %>% filter(match == "TRUE")
leave_in_transcripts$match <- str_count(leave_in_transcripts$match, 'TRUE')
count_leave <- aggregate(leave_in_transcripts['match'], by = list(Group.date = leave_in_transcripts$date), sum, na.rm=T)
In base R, we can use head and tail to match values for current and next rows. We can subset the rows which match the condition and use as.Date to convert data from document_id to date object giving appropriate format. Also since you want to test for an exact match and not partial match it is better to use == and not grepl.
transform(subset(df, c(head(word, -1) == "leave" &
tail(word, -1) == "tomorrow", FALSE)),
date = as.Date(document_id,"BOARD%Y%m%dmeeting.pdf"))
# document_id word date
#1 BOARD19810203meeting.pdf leave 1981-02-03
If you just want to count number of times the above condition is satisfied, we can use sum.
with(df, sum(head(word, -1) == "leave" & tail(word, -1) == "tomorrow"))
Related
I am trying to add a new column, currency, to df "myfile".
The contents of that column are conditional like if the year column fulfills this condition then the new column has this value, else another value.
When I tried if else without the loop, it says >1, so I guessed if else couldn’t work for a vector with multiple elements, I could use a for loop, but then this error showed:
myfile$currency <- myfile %>% for (i in year) {if(year>2000){print("Latest")}else{"Oldest"}}
Error in for (. in i) year : 4 arguments passed to 'for' which requires 3
You can use ifelse in mutate. See the documentation for dpylr.
library(dplyr)
myfile <- myfile %>%
mutate(
currency = ifelse(year > 2000, "latest", "oldest")
)
If you have more conditions, see case_when.
Or you can do something like this:
myfile$currency[myfile$year > 2000] <- "latest"
myfile$currency[myfile$year <= 2000] <- "oldest"
I would like to add a number indicating the x^th occurrence of a word in a vector. (So this question is different from Make a column with duplicated values unique in a dataframe , because I have a simple vector and try to avoid the overhead of casting it to a data.frame).
E.g. for the vector:
book, ship, umbrella, book, ship, ship
the output would be:
book, ship, umbrella, book2, ship2, ship3
I have solved this myself by transposing the vector to a dataframe and next using the grouping function. That feels like using a sledgehammer to crack nuts:
# add consecutive number for equal string
words <- c("book", "ship", "umbrella", "book", "ship", "ship")
# transpose word vector to data.frame for grouping
df <- data.frame(words = words)
df <- df %>% group_by(words) %>% mutate(seqN = row_number())
# combine columns and remove '1' for first occurrence
wordsVec <- paste0(df$words, df$seqN)
gsub("1", "", wordsVec)
# [1] "book" "ship" "umbrella" "book2" "ship2" "ship3"
Is there a more clean solution, e.g. using the stringr package?
You can still utilize row_number() from dplyr but you don't need to convert to data frame, i.e.
sub('1$', '', ave(words, words, FUN = function(i) paste0(i, row_number(i))))
#[1] "book" "ship" "umbrella" "book2" "ship2" "ship3"
Another option is to use make.unique along with gsubfn to increment your values by 1, i.e.
library(gsubfn)
gsubfn("\\d+", function(x) as.numeric(x) + 1, make.unique(words))
#[1] "book" "ship" "umbrella" "book.2" "ship.2" "ship.3"
Say I have a directory with four files:
someText.abcd.xyz.10Sep16.csv
someText.xyz.10Sep16.csv
someText.abcd.xyz.23Oct16.csv
someText.xyz.23Oct16.csv
This is how the names are formatted. I cannot change them, and the format will remain the same except the dates will change. All of the names begin with someText. Next, there is either a four-letter code (abcd) or a three latter code (xyz). If the file name has a four letter code, it will always have a three-letter code after it. Finally there is a date value.
I have two tasks. First, I need to filter out the files that have the "abcd" component. This will always be a four-character code that appears after the someText. in the name. Is there a way to right a regex expression to remove these values?
That leaves two files:
someText.xyz.10Sep16.csv
someText.xyz.23Oct16.csv
I need only the file with the later date. Is there a second regex I could do to extract the dates, find the latest, and then keep only that date? I'm doing this to get the file set down to four:
myDir <- "\\\\myDir\\folder\\"
files <- list.files(path = myDir, pattern = "\\.csv$")
Here's a vector with the file names if someone wants to try it out:
files <- c("someText.abcd.xyz.10Sep16.csv", "someText.xyz.10Sep16.csv", "someText.abcd.xyz.23Oct16.csv", "someText.xyz.23Oct16.csv")
Here's my attempt at a simple base R answer
# regex subset
files <- files[!grepl("^.*?\\.[[:alpha:]]{4}\\.", files)]
# get date
dates <- unlist(lapply(strsplit(files, "\\."), "[[", 3))
files[which.max(as.Date(dates, format = "%d%b%y"))]
# [1] "someText.xyz.23Oct16.csv"
I think this should be robust enough to work reliably. I used dplyr to pass the results through and manipulate them, and lubridate for a convenient date extraction (dmy). Almost forgot: you need to load magrittr to get the %$% pipe.
I split the file names by the "."s, then slide over the results if they are missing the four-letter code section. Bind them into a data.frame for easy filtering etc. Here, filter for those missing the four-letter section, then select the one with the latest date.
strsplit(files, "\\.") %>%
setNames(files) %>%
lapply(function(x){
if(length(x) == 4){
x[3:5] <- x[2:4]
x[2] <- "noCode"
}
rbind(x) %>%
as.data.frame()
}) %>%
bind_rows(.id = "fileName") %>%
mutate(date = dmy(V4)) %>%
filter(V2 == "noCode") %$%
c(fileName[which.max(date)])
returns: "someText.xyz.23Oct16.csv"
I am sure that this can be made more compact, but here is a base R answer:
# file names
file_names =c(
"someText.abcd.xyz.10Sep16.csv",
"someText.xyz.10Sep16.csv",
"someText.abcd.xyz.23Oct16.csv",
"someText.xyz.23Oct16.csv"
)
# the pattern to be tested
reg_file_names = regexec(
pattern = "^someText\\.[a-z]{4}\\.[a-z]{3}\\.(.*).csv$",
file_names
)
# parse out the matched dates, and look for the maximum
file_names[
which.max(
sapply(
regmatches(
x = file_names, m = reg_file_names
),
function(match) {
as.Date(
ifelse(
length(match) == 0,
NA,
match[2]
),
format = "%d%b%y"
)
}
)
)
]
The regular expression that you need is fairly straightforward, and the rest of the code is just to handle the cases where there is no match, and to format the dates so that they can be compared.
I am working on a large dataset, with some rows with NAs and others with blanks:
df <- data.frame(ID = c(1:7),
home_pc = c("","CB4 2DT", "NE5 7TH", "BY5 8IB", "DH4 6PB","MP9 7GH","KN4 5GH"),
start_pc = c(NA,"Home", "FC5 7YH","Home", "CB3 5TH", "BV6 5PB",NA),
end_pc = c(NA,"CB5 4FG","Home","","Home","",NA))
How do I remove the NAs and blanks in one go (in the start_pc and end_pc columns)? I have in the past used:
df<- df[-which(is.na(df$start_pc)), ]
... to remove the NAs - is there a similar command to remove the blanks?
df[!(is.na(df$start_pc) | df$start_pc==""), ]
It is the same construct - simply test for empty strings rather than NA:
Try this:
df <- df[-which(df$start_pc == ""), ]
In fact, looking at your code, you don't need the which, but use the negation instead, so you can simplify it to:
df <- df[!(df$start_pc == ""), ]
df <- df[!is.na(df$start_pc), ]
And, of course, you can combine these two statements as follows:
df <- df[!(df$start_pc == "" | is.na(df$start_pc)), ]
And simplify it even further with with:
df <- with(df, df[!(start_pc == "" | is.na(start_pc)), ])
You can also test for non-zero string length using nzchar.
df <- with(df, df[!(nzchar(start_pc) | is.na(start_pc)), ])
Disclaimer: I didn't test any of this code. Please let me know if there are syntax errors anywhere
An elegant solution with dplyr would be:
df %>%
# recode empty strings "" by NAs
na_if("") %>%
# remove NAs
na.omit
Alternative solution can be to remove the rows with blanks in one variable:
df <- subset(df, VAR != "")
An easy approach would be making all the blank cells NA and only keeping complete cases. You might also look for na.omit examples. It is a widely discussed topic.
df[df==""]<-NA
df<-df[complete.cases(df),]
I am working on a large dataset, with some rows with NAs and others with blanks:
df <- data.frame(ID = c(1:7),
home_pc = c("","CB4 2DT", "NE5 7TH", "BY5 8IB", "DH4 6PB","MP9 7GH","KN4 5GH"),
start_pc = c(NA,"Home", "FC5 7YH","Home", "CB3 5TH", "BV6 5PB",NA),
end_pc = c(NA,"CB5 4FG","Home","","Home","",NA))
How do I remove the NAs and blanks in one go (in the start_pc and end_pc columns)? I have in the past used:
df<- df[-which(is.na(df$start_pc)), ]
... to remove the NAs - is there a similar command to remove the blanks?
df[!(is.na(df$start_pc) | df$start_pc==""), ]
It is the same construct - simply test for empty strings rather than NA:
Try this:
df <- df[-which(df$start_pc == ""), ]
In fact, looking at your code, you don't need the which, but use the negation instead, so you can simplify it to:
df <- df[!(df$start_pc == ""), ]
df <- df[!is.na(df$start_pc), ]
And, of course, you can combine these two statements as follows:
df <- df[!(df$start_pc == "" | is.na(df$start_pc)), ]
And simplify it even further with with:
df <- with(df, df[!(start_pc == "" | is.na(start_pc)), ])
You can also test for non-zero string length using nzchar.
df <- with(df, df[!(nzchar(start_pc) | is.na(start_pc)), ])
Disclaimer: I didn't test any of this code. Please let me know if there are syntax errors anywhere
An elegant solution with dplyr would be:
df %>%
# recode empty strings "" by NAs
na_if("") %>%
# remove NAs
na.omit
Alternative solution can be to remove the rows with blanks in one variable:
df <- subset(df, VAR != "")
An easy approach would be making all the blank cells NA and only keeping complete cases. You might also look for na.omit examples. It is a widely discussed topic.
df[df==""]<-NA
df<-df[complete.cases(df),]