I have a bunch of .txt files (articles) in a folder, I use a for cycle in order to get text from all of them on R
input_loc <- "C:/Users/User/Desktop/Folder"
files <- dir(input_loc, full.names = TRUE)
text <- c()
for (f in files) {
text <- c(text, paste(readLines(f), collapse = "\n"))
}
from here, I tokenize per paragraphs and I get each paragraph in each article:
paragraphs <- tokenize_paragraphs(text)
sapply(paragraphs, length)
paragraphs
then I unlist and transform into a dataframe
par_unlisted<-unlist(paragraphs)
par_unlisted
par_unlisted_df<-as.data.frame(par_unlisted)
BUT doing that I no longer have an inter-article separation of paragraph numbers (e.g. first article has 6 paragraphs, before unlisting the first paragraph of the second article would still have a [1] in front, while after unlisting it will have a [7]).
What I would like to do is, once I have the dataframe, having a column with the number of the paragraph, then create another column named "article" with the number of the article.
Thank You in advance
EDIT
this is roughly what I get once I get to paragraphs:
> paragraphs
[[1]]
[1] "The Miami Dolphins have decided to use their non-exclusive franchise
tag on wide receiver Jarvis Landry."
[2] "The Dolphins tweeted the announcement Tuesday, the first day teams
could use their franchise or transition tags. The salary for wide receivers
getting the franchise tag this offseason is expected to be around $16.2
million, which will be quite the raise for Landry, who made $894,000 last
season."
[[2]]
[1] "Despite months of little-to-no movement on contract negotiations,
Jarvis Landry has often stated his desire to stay in Miami."
[2] "The Dolphins used their lone tool to wipe away negotation-driven stress
-- at least in the immediate future -- and ensure Landry won't be lured away
from Miami, placing the franchise tag on the receiver on Tuesday, the team
announced."
I would want to keep the paragraph number ([n]) as a column in the dataframe, because when I unlist them they no longer stay separated per article and then per paragraph, but I get them in sequence, let's say (basically in the example I've just posted I no longer have
[[1]]
[1] ...
[2] ...
[[2]]
[1] ...
[2] ...
but I get
[1] ...
[2] ...
[3] ...
[4] ...
Consider iterating through the paragraphs list and build a list of dataframes with needed article and paragraph numbers with a final row bind through all dataframe elements.
Input Data
paragraphs <- list(
c("The Miami Dolphins have decided to use their non-exclusive franchise tag on wide receiver Jarvis Landry.",
"The Dolphins tweeted the announcement Tuesday, the first day teams could use their franchise or transition tags. The salary for wide receivers
getting the franchise tag this offseason is expected to be around $16.2 million, which will be quite the raise for Landry, who made $894,000 last
season."),
c("Despite months of little-to-no movement on contract negotiations, Jarvis Landry has often stated his desire to stay in Miami.",
"The Dolphins used their lone tool to wipe away negotation-driven stress -- at least in the immediate future -- and ensure Landry won't be lured away
from Miami, placing the franchise tag on the receiver on Tuesday, the team announced."))
Dataframe Build
df_list <- lapply(seq_along(paragraphs), function(i)
setNames(data.frame(i, 1:length(paragraphs[[i]]), paragraphs[[i]]),
c("article_num", "paragraph_num", "paragraph"))
)
final_df <- do.call(rbind, df_list)
Output Result
final_df
# article_num paragraph_num paragraph
# 1 1 1 The Miami Dolphins have decided to use their non-e...
# 2 1 2 The Dolphins tweeted the announcement Tuesday, the...
# 3 2 1 Despite months of little-to-no movement on contrac...
# 4 2 2 The Dolphins used their lone tool to wipe away neg...
Related
I have six .txt datasets files i've stored at '../data/csv'. All the datasets have similar structure(X1(speech),part(part of the speech i.e Charlotte_part_1 ...Charlotte_part_60)). Am having trouble combining all the six datasets into a single .csv file called biden.csv which has speech, part,location, event and date .But am having trouble extracting the speech, part(this two are from the file content) and event(from file name) of the file names because of their varying naming structure.
The six datasets
"Charlotte_Sep23_2020_Racial_Equity_Discussion-1.txt",
"Cleveland_Sep30_2020_Whistle_Stop_Tour.txt",
"Milwaukee_Aug20_2020_Democratic_National_Convention.txt",
"Philadelphia_Sep20_2020_SCOTUS.txt",
"Washington_Sep26_2020_US_Conference_of_Mayors.txt",
"Wilmington_Nov25_2020_Thanksgiving.txt"
Sample content from 'Charlotte_Sep23_2020_Racial_Equity_Discussion-1.txt'
X1 part
"Folks, thanks for taking the time to be here today. I really appreciate it. And we even have an astronaut in our house and I tell you what, that’s pretty cool. Look, first of all, I want to thank Chris and the mayor for being here, and all of you for being here. You know, these are tough times. Over 200,000 Americans have passed away. Over 200,000, and the number is still rising. The impact on communities is bad across the board, but particularly bad for African-American communities. Almost four times as likely, three times as likely to catch the disease, COVID, and when it’s caught, twice as likely to die as white Americans. It’s sort of emblematic of the inequality that exists and the circumstances that exist." Charlotte_part_1
"One of the things that really matters to me, is we could do … It didn’t have to be this bad. You have 30 million people on unemployment, you have 20 million people figuring whether or not they can pay their mortgage payment this month, and what they’re going to be able to do or not do as the consequence of that, and you’ve got millions of people who are worried that they’re going to be thrown out in the street because they can’t pay their rent. Although they’ve been given a reprieve for three months, but they have to pay double the next three months when it comes around." Charlotte_part_2
Here is the code i have designed but its not producing the output i wan't...i mean it just creat the tibble with the tittles but no contents in any of the variables
biden_data <- tibble() # initialize empty tibble
# loop through all text files in the specified directory
for (file in list.files(path="./data/csv", pattern='*.txt', full.names=T)){
filename <- strsplit(file, "[./]")[[1]][5] # extract file name from path
# extract location from file name
location <- strsplit(filename, split='_')[[1]][1]
# extract raw date from file name
raw_date <- strsplit(filename, split='_')[[1]][2]
date <- as.Date(raw_date, "%b%d_%Y") # format as datetime
# extract event from file name
event <- strsplit(filename, split='_')[[1]][3]
# extract speech and part from file
content <- readChar(file, file.info(file)$size)
speech <- content[grepl("^X1", content)]
part <- content[grepl("^part", content)]
# create a new observation (row)
new_obs <- tibble(speech=speech, part=part, location=location, event=event, date=date)
# append the new observation to the existing data
biden_data <- bind_rows(biden_data, new_obs)
rm(filename, location, raw_date, date, content, speech, part, new_obs, file) # cleanup
}
Desired Output is supposed to look like this:
## # A tibble: 128 x 5
## speech part location event date
## <chr> <chr> <chr> <chr> <date>
## 1 Folks, thanks for taking the time to be here~ Char~ Charlot~ Raci~ 2020-09-23
## 2 One of the things that really matters to me,~ Char~ Charlot~ Raci~ 2020-09-23
## 3 How people going to do that? And the way, in~ Char~ Charlot~ Raci~ 2020-09-23
## 4 In addition to that, we find ourselves in a ~ Char~ Charlot~ Raci~ 2020-09-23
## 5 If he had spoken, as I said, they said at Co~ Char~ Charlot~ Raci~ 2020-09-23
## 6 But what I want to talk to you about today i~ Char~ Charlot~ Raci~ 2020-09-23
## 7 And thirdly, if you’re a business person, le~ Char~ Charlot~ Raci~ 2020-09-23
## 8 For too many people, particularly in the Afr~ Char~ Charlot~ Raci~ 2020-09-23
## 9 It goes to education, as well as access to e~ Char~ Charlot~ Raci~ 2020-09-23
## 10 And then we’re going to talk about, I think ~ Char~ Charlot~ Raci~ 2020-09-23
## # ... with 118 more rows
Starting with a vector of file paths:
files <- c("Charlotte_Sep23_2020_Racial_Equity_Discussion-1.txt", "Cleveland_Sep30_2020_Whistle_Stop_Tour.txt", "Milwaukee_Aug20_2020_Democratic_National_Convention.txt", "Philadelphia_Sep20_2020_SCOTUS.txt", "Washington_Sep26_2020_US_Conference_of_Mayors.txt", "Wilmington_Nov25_2020_Thanksgiving.txt")
We can capture the components into a frame:
meta <- strcapture("^([^_]+)_([^_]+_[^_]+)_(.*)\\.txt$", files, list(location="", date="", event=""))
meta
# location date event
# 1 Charlotte Sep23_2020 Racial_Equity_Discussion-1
# 2 Cleveland Sep30_2020 Whistle_Stop_Tour
# 3 Milwaukee Aug20_2020 Democratic_National_Convention
# 4 Philadelphia Sep20_2020 SCOTUS
# 5 Washington Sep26_2020 US_Conference_of_Mayors
# 6 Wilmington Nov25_2020 Thanksgiving
And then iterate on that for the contents into a single frame.
out <- do.call(Map, c(list(f = function(fn, ...) cbind(..., read.table(fn, header = TRUE))),
list(files), meta))
out <- do.call(rbind, out)
rownames(out) <- NULL
out[1:3,]
# location date event
# 1 Charlotte Sep23_2020 Racial_Equity_Discussion-1
# 2 Charlotte Sep23_2020 Racial_Equity_Discussion-1
# 3 Cleveland Sep30_2020 Whistle_Stop_Tour
# X1
# 1 Folks, thanks for taking the time to be here today. I really appreciate it. And we even have an astronaut in our house and I tell you what, that’s pretty cool. Look, first of all, I want to thank Chris and the mayor for being here, and all of you for being here. You know, these are tough times. Over 200,000 Americans have passed away. Over 200,000, and the number is still rising. The impact on communities is bad across the board, but particularly bad for African-American communities. Almost four times as likely, three times as likely to catch the disease, COVID, and when it’s caught, twice as likely to die as white Americans. It’s sort of emblematic of the inequality that exists and the circumstances that exist.
# 2 One of the things that really matters to me, is we could do … It didn’t have to be this bad. You have 30 million people on unemployment, you have 20 million people figuring whether or not they can pay their mortgage payment this month, and what they’re going to be able to do or not do as the consequence of that, and you’ve got millions of people who are worried that they’re going to be thrown out in the street because they can’t pay their rent. Although they’ve been given a reprieve for three months, but they have to pay double the next three months when it comes around.
# 3 Charlotte_Sep23_2020_Racial_Equity_Discussion-1.txt
# part
# 1 Charlotte_part_1
# 2 Charlotte_part_2
# 3 something
(I made fake files for all but the first file.)
Brief walk-through:
strcapture takes the regex (lots of _-separation) and creates a frame of location, date, etc.
Map takes a function with 1 or more arguments (we use two: fn= for the filename, and ... for "the rest") and applies it to each of the subsequent lists/vectors. In this case, I'm using ... to cbind (column-bind/concatenate) the columns from meta to what we read from the file itself. This is useful in that it combines the 1 row of each meta row with any-number-of-rows from the file itself. (We could have hard-coded ... instead as location, date, and event, but I tend to prefer to generalize, in case you need to extract something else from the filenames.)
Because we use ..., however, we need to combine files and the columns of meta in a list and then call our anon-function with the list contents as arguments.
The contents of out after our do.call(Map, ...) is in a list and not a single frame. Each element of this list is a frame with the same column-structure, so we then combine them by rows with do.call(rbind, out).
R is going to use the names from files into row names, which I find unnecessary (and distracting), so I removed the row names. Optional.
If you're interested, this may appear much easier to digest using dplyr and friends:
library(dplyr)
# library(tidyr) # unnest
out <- strcapture("^([^_]+)_([^_]+_[^_]+)_(.*)\\.txt$", files,
list(location="", date="", event="")) %>%
mutate(contents = lapply(files, read.table, header = TRUE)) %>%
tidyr::unnest(contents)
I have a data frame containing unstructured text. In this reproducible example, I'm downloading a 10K company filing directly from the SEC website and loading it with read.table.
dir = getwd(); setwd(dir)
download.file("https://www.sec.gov/Archives/edgar/data/2648/0000002648-96-000013.txt", file.path(dir,"filing.txt"))
filing <- read.table(file=file.path(dir, "filing.txt"), sep="\t", quote="", comment.char="")
droplevels.data.frame(filing)
I want to remove the SEC header in order to focus on the main body of the document (starting in row 216) and divide my text into sections/items.
> filing$V1[216:218]
[1] PART I
[2] Item 1. Business.
[3] A. Organization of Business
Therefore, I'm trying to match strings starting with the word Item (or ITEM) followed by one or more spaces, one or two digits, a dot, one or more spaces and one or more words. For example:
Item 1. Business.
ITEM 1. BUSINESS
Item 1. Business
Item 10. Directors and Executive Officers of
ITEM 10. DIRECTORS AND EXECUTIVE OFFICERS OF THE REGISTRANT
My attempt involves str_detect and regex in order to create a variable count that jumps each time there is a string match.
library(dplyr)
library(stringr)
tidy_filing <- filing %>% mutate(count = cumsum(str_detect(V1, regex("^Item [\\d]{1,2}\\.",ignore_case = TRUE)))) %>% ungroup()
However, I'm missing the first 9 Items and my count starts only with Item 10.
tidy_filing[c(217, 218,251:254),]
V1 count
217 Item 1. Business. 0
218 A. Organization of Business 3 0
251 PART III 0
252 Item 10. Directors etc. 38 1
253 Item 11. Executive Compens. 38 2
254 Item 12. Security Ownership. 38 3
Any help would be highly appreciated.
The problem is that the single digit items have double spaces in order to align with the two digit ones. You can get round this by changing your regex string to
"^Item\\s+\\d{1,2}\\."
My dataframe column looks like this:
head(tweets_date$Tweet)
[1] b"It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac
[2] b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81
[3] b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!
[4] b'CHAMPIONS - 2018 #IPLFinal
[5] b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.
[6] b"Final. It's all over! Chennai Super Kings won by 8 wickets
These are tweets which have mentions starting with '#', I need to extract all of them and save each mention in that particular tweet as "#mention1 #mention2". Currently my code just extracts them as lists.
My code:
tweets_date$Mentions<-str_extract_all(tweets_date$Tweet, "#\\w+")
How do I collapse those lists in each row to a form a string separated by spaces as mentioned earlier.
Thanks in advance.
I trust it would be best if you used an asis column in this case:
extract words:
library(stringr)
Mentions <- str_extract_all(lis, "#\\w+")
some data frame:
df <- data.frame(col = 1:6, lett = LETTERS[1:6])
create a list column:
df$Mentions <- I(Mentions)
df
#output
col lett Mentions
1 1 A #DineshK....
2 2 B #IPL, #p....
3 3 C
4 4 D
5 5 E #ChennaiIPL
6 6 F
I think this is better since it allows for quite easy sub setting:
df$Mentions[[1]]
#output
[1] "#DineshKarthik" "#KKRiders"
df$Mentions[[1]][1]
#output
[1] "#DineshKarthik"
and it succinctly shows whats inside the column when printing the df.
data:
lis <- c("b'It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac",
"b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81",
"b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!",
"b'CHAMPIONS - 2018 #IPLFinal",
"b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.",
"b'Final. It's all over! Chennai Super Kings won by 8 wickets")
The str_extract_all function from the stringr package returns a list of character vectors. So, if you instead want a list of single CSV terms, then you may try using sapply for a base R option:
tweets <- str_extract_all(tweets_date$Tweet, "#\\w+")
tweets_date$Mentions <- sapply(tweets, function(x) paste(x, collapse=", "))
Demo
Via Twitter's help site: "Your username cannot be longer than 15 characters. Your real name can be longer (20 characters), but usernames are kept shorter for the sake of ease. A username can only contain alphanumeric characters (letters A-Z, numbers 0-9) with the exception of underscores, as noted above. Check to make sure your desired username doesn't contain any symbols, dashes, or spaces."
Note that email addresses can be in tweets as can URLs with #'s in them (and not just the silly URLs with username/password in the host component). Thus, something like:
(^|[^[[:alnum:]_]#/\\!?=&])#([[:alnum:]_]{1,15})\\b
is likely a better, safer choice
I want to count the number of words in each row:
Review_ID Review_Date Review_Content Listing_Title Star Hotel_Name
1 1/25/2016 I booked both the Crosby and Four Seasons but decided to cancel the Four Seasons closer to the arrival date based on reviews. Glad I did. The Crosby is an outstanding hotel. The rooms are immaculate and luxurious, with real attention to detail and none of the bland furnishings you find in even the top chain hotels. Staff on the whole were extremely attentive and seemed to enjoy being there. Breakfast was superb and facilities at ground level gave an intimate and exclusive feel to the hotel. It's a fairly expensive place to stay but is one of those hotels where you feel you're getting what you pay for, helped by an excellent location. Hope to be back! Outstanding 5 Crosby Street Hotel
2 1/18/2016 We've stayed many times at the Crosby Street Hotel and always have an incredible, flawless experience! The staff couldn't be more accommodating, the housekeeping is immaculate, the location's awesome and the rooms are the coolest combination of luxury and chic. During our most recent trip over The New Years holiday, we stayed in the stunning Crosby Suite which has the most extraordinary, gorgeous decor. The Crosby remains our absolute favorite in NYC. Can't wait to return! Always perfect! 5 Crosby Street Hotel
I was thinking something like:
WordFreqRowWise %>%
rowwise() %>%
summarise(n = n())
To get the results something like..
Review_ID Review_Content total_Words Min_occrd_word Max Average
1 .... 230 great: 1 the: 25 total_unique/total_words in the row
But do not have idea, how can I do it....
Here is a method in base R using strsplit and sapply. Let's say the data is stored in a data.frame df and the reviews are stored in the variable Review_Content
# break up the strings in each row by " "
temp <- strsplit(df$Review_Content, split=" ")
# count the number of words as the length of the vectors
df$wordCount <- sapply(temp, length)
In this instance, sapply will return a vector of the counts for each row.
Since the word count is now an object, you can perform analysis you want on it. Here are some examples:
summarize the distribution of word counts: summary(df$wordCount)
maximum word count: max(df$wordCount)
mean word count: mean(df$wordCount)
range of word counts: range(df$wordCount)
interquartile range of word counts: IQR(df$wordCount)
Adding to #lmo's answer above..
Below code will generate a dataframe that consists of all the words, row-wise, and their frequencies:
temp2 <- data.frame()
for (i in 1:length(temp)){
temp1 <- as.data.frame(table(temp[[i]]))
temp1$ID <- paste0("Row_", i)
temp2 <- rbind(temp2, temp1)
temp1 <- NULL
}
I'm working with a large number (1,983) of CSV files. Posts on stackoverflow have said that lists are easier to work with so I've approached my task that way. I have read the CSVs in and gotten the first part of my task accomplished: what is the maximum number of concurrent users of the application? (A:203) Here's that code:
# get a list of the files
files <- list.files("my_path_here",pattern="*.CSV$", recursive = TRUE, full.names=TRUE)
#read in the csv's and store them as a list of dataframes
tables <- lapply(files, read.csv)
#store the counts of the number of users here
counts<-rep(NA,length(tables))
#loop thru the files to find the count and store that value
for (i in 1:length(files)) {
counts[i] <- length(tables[[i]][[2]])
}
#what's the largest number?
max(counts)
#203
The 2nd part of the task is to show the count of each title for each file. The contents of each file will be something like this:
compute_0001 compute_0002
[1] 3/26/2015 6:00:00 Business System Manager;Lead CoPath Analyst
[2] Regional Histotechnologist;Hist Tech - Ht
[3] Regional Histotechnologist;Tissue Tech
[4] SDX Histotechnologist;Histology Tech
[5] SDX Histotechnologist;Histology Tech
[6] Regional Histotechnologist;Lab Asst II Histology
[7] CytoPrep Tech;Histo Tech - Ht
[8] Regional Histotechnologist;Tissue Tech
[9] Histology Supervisor;Supv Reg Lab Unit
[10] Histotech/FC Tech/PA/Diener;Pathology Tissue Technician;;CONTRACT
What will differ from file to file is the time stamp in compute_0001, name of the file and the number of users (ie length of the file).
My approach was to try this:
>col2 <- sapply(tables,summary, maxsum=300) # gives me a list of 1983 elements that is 23.6Mb
(I noticed that when doing a summary() on the files I would get something like this - which is why I was trying it)
>col2[[1]]
compute_0001 compute_0002
1] Business System Manager;Lead CoPath Analyst :1
[2] Regional Histotechnologist;Hist Tech - Ht :1
[3] Regional Histotechnologist;Tissue Tech :1
[4] SDX Histotechnologist;Histology Tech :1
[5] SDX Histotechnologist;Histology Tech :1
[6] Regional Histotechnologist;Lab Asst II Histology :2
[7] CytoPrep Tech;Histo Tech - Ht :4
[8] Regional Histotechnologist;Tissue Tech :1
[9 Histotech/FC Tech/PA/Diener;Pathology Tissue Technician;;CONTRACT :1
The above is actually many different people. For my purposes, [2],[3], [6] and [8] are the same title (even though the stuff after the ";" is different. The truth is that even [4] and [5] could also be considered the same as [2,3,6,8]).
That ":1" (or generally ":#") is the number of users with that title at that particular time. I was hoping to grab that character, make it numeric and add them up to get a count of the users with each title for each file. Each file is an observation at a particular datetime.
I tried something like this:
>for (s in 1:length(col2)) {
>split <- strsplit(col2[[s]][,2], ":")
>#... make it numeric so I can do addition with it
>num <- as.numeric(split[[s]][2])
>#... and put it in the correct df
>tables[[s]]$count <- num
# After dealing with the ":" I was going to handle splitting on the first ";"
>}
But I couldn't get the loop to iterate more than a single time or past the first element of col2.
A more experienced useR suggested something like this:
>strsplit(x = as.character(compute2[[s]]),split=";",fixed=TRUE)
He said "However this results in a messy list also, since there are multiple ";" in some lines. What I would #suggest is to use grep() with a regex that returns the text before the first ";"- use that with sapply(compute2,grep()) and then you can run sapply(??,table) on the list that is returned to tally the job titles."
I'd prefer not to get into regex but, following his advice, I tried:
>for (s in 1:length(tables)){
>+ split <- strsplit(x = >as.character(compute2[[s]]),split=";",fixed=TRUE)
>+ }
split is a list of only 122 , not nearly long enough so it's not iterating thru the loop either. So, I figured I'd skip the loop and try:
>title_split<- sapply(compute2, strsplit, x = as.character(compute2[[1]]),split=";",fixed=TRUE)
But that gave me more than 50 warnings and a matrix that had 105,000+ elements that was 20.2Mb in size.
Like I said, I'd prefer to not venture into the world of regex, since I think I should be able to split on the ":" first and then the first of the ";" and return the string that precedes the ";". I'm just not sure why the loop is failing.
What I eventually want is a table that shows the count of each title (collapsed for duplicates like [2],[3], [6] and [8] above) for each file (which represents an observation at a particular datetime). I'm pretty agnostic as to approach, so if I have to do it via regex, then so be it.
Sorry for the lengthy post but I suspect that part of my problem (besides being brand new to stackoverflow, R and not understanding regex well) is that I'm not well versed in list manipulation and I wanted you to have the context.
Many thanks for reading.
You data isn't easily reproducible, so I've created a simple list of fake data that I hope captures the essence of your data.
Make a list of fake data frames:
string1 = "53 Regional histotechnologist;text2 - more text"
string2 = "54 Regional histotechnologist;text2 - more text"
string3 = "CytoPrep Tech;text2 - more text"
tables = list(df1=data.frame(compute=c(string1, string2, string3)),
df2=data.frame(compute=c(string1, string2, string3)))
Count the number of rows in each data frame:
counts = sapply(tables, nrow)
Add a column that extracts job titles from the compute column. The regex pattern skips zero or more digit characters ([0-9]*) followed by zero or one space character (?) and then captures everything up to, but not including, the first semi-colon(([^;]*);) and then skips every character after the semi-colon (.*).
tables = sapply(names(tables), function(df) {
cbind(tables[[df]], title=gsub("[0-9]* ?([^;]*);.*", "\\1", tables[[df]][,"compute"]))
}, simplify=FALSE)
tables
$df1
compute title
1 53 Regional histotechnologist;text2 - more text Regional histotechnologist
2 54 Regional histotechnologist;text2 - more text Regional histotechnologist
3 CytoPrep Tech;text2 - more text CytoPrep Tech
$df2
compute title
1 53 Regional histotechnologist;text2 - more text Regional histotechnologist
2 54 Regional histotechnologist;text2 - more text Regional histotechnologist
3 CytoPrep Tech;text2 - more text CytoPrep Tech
Make a table of counts of each title for each data frame in tables:
title.table.list = lapply(tables, function(df) table(df$title))
title.table.list
$df1
CytoPrep Tech Regional histotechnologist
1 2
$df2
CytoPrep Tech Regional histotechnologist
1 2