Data Preparation In R - r

I have six .txt datasets files i've stored at '../data/csv'. All the datasets have similar structure(X1(speech),part(part of the speech i.e Charlotte_part_1 ...Charlotte_part_60)). Am having trouble combining all the six datasets into a single .csv file called biden.csv which has speech, part,location, event and date .But am having trouble extracting the speech, part(this two are from the file content) and event(from file name) of the file names because of their varying naming structure.
The six datasets
"Charlotte_Sep23_2020_Racial_Equity_Discussion-1.txt",
"Cleveland_Sep30_2020_Whistle_Stop_Tour.txt",
"Milwaukee_Aug20_2020_Democratic_National_Convention.txt",
"Philadelphia_Sep20_2020_SCOTUS.txt",
"Washington_Sep26_2020_US_Conference_of_Mayors.txt",
"Wilmington_Nov25_2020_Thanksgiving.txt"
Sample content from 'Charlotte_Sep23_2020_Racial_Equity_Discussion-1.txt'
X1 part
"Folks, thanks for taking the time to be here today. I really appreciate it. And we even have an astronaut in our house and I tell you what, that’s pretty cool. Look, first of all, I want to thank Chris and the mayor for being here, and all of you for being here. You know, these are tough times. Over 200,000 Americans have passed away. Over 200,000, and the number is still rising. The impact on communities is bad across the board, but particularly bad for African-American communities. Almost four times as likely, three times as likely to catch the disease, COVID, and when it’s caught, twice as likely to die as white Americans. It’s sort of emblematic of the inequality that exists and the circumstances that exist." Charlotte_part_1
"One of the things that really matters to me, is we could do … It didn’t have to be this bad. You have 30 million people on unemployment, you have 20 million people figuring whether or not they can pay their mortgage payment this month, and what they’re going to be able to do or not do as the consequence of that, and you’ve got millions of people who are worried that they’re going to be thrown out in the street because they can’t pay their rent. Although they’ve been given a reprieve for three months, but they have to pay double the next three months when it comes around." Charlotte_part_2
Here is the code i have designed but its not producing the output i wan't...i mean it just creat the tibble with the tittles but no contents in any of the variables
biden_data <- tibble() # initialize empty tibble
# loop through all text files in the specified directory
for (file in list.files(path="./data/csv", pattern='*.txt', full.names=T)){
filename <- strsplit(file, "[./]")[[1]][5] # extract file name from path
# extract location from file name
location <- strsplit(filename, split='_')[[1]][1]
# extract raw date from file name
raw_date <- strsplit(filename, split='_')[[1]][2]
date <- as.Date(raw_date, "%b%d_%Y") # format as datetime
# extract event from file name
event <- strsplit(filename, split='_')[[1]][3]
# extract speech and part from file
content <- readChar(file, file.info(file)$size)
speech <- content[grepl("^X1", content)]
part <- content[grepl("^part", content)]
# create a new observation (row)
new_obs <- tibble(speech=speech, part=part, location=location, event=event, date=date)
# append the new observation to the existing data
biden_data <- bind_rows(biden_data, new_obs)
rm(filename, location, raw_date, date, content, speech, part, new_obs, file) # cleanup
}
Desired Output is supposed to look like this:
## # A tibble: 128 x 5
## speech part location event date
## <chr> <chr> <chr> <chr> <date>
## 1 Folks, thanks for taking the time to be here~ Char~ Charlot~ Raci~ 2020-09-23
## 2 One of the things that really matters to me,~ Char~ Charlot~ Raci~ 2020-09-23
## 3 How people going to do that? And the way, in~ Char~ Charlot~ Raci~ 2020-09-23
## 4 In addition to that, we find ourselves in a ~ Char~ Charlot~ Raci~ 2020-09-23
## 5 If he had spoken, as I said, they said at Co~ Char~ Charlot~ Raci~ 2020-09-23
## 6 But what I want to talk to you about today i~ Char~ Charlot~ Raci~ 2020-09-23
## 7 And thirdly, if you’re a business person, le~ Char~ Charlot~ Raci~ 2020-09-23
## 8 For too many people, particularly in the Afr~ Char~ Charlot~ Raci~ 2020-09-23
## 9 It goes to education, as well as access to e~ Char~ Charlot~ Raci~ 2020-09-23
## 10 And then we’re going to talk about, I think ~ Char~ Charlot~ Raci~ 2020-09-23
## # ... with 118 more rows

Starting with a vector of file paths:
files <- c("Charlotte_Sep23_2020_Racial_Equity_Discussion-1.txt", "Cleveland_Sep30_2020_Whistle_Stop_Tour.txt", "Milwaukee_Aug20_2020_Democratic_National_Convention.txt", "Philadelphia_Sep20_2020_SCOTUS.txt", "Washington_Sep26_2020_US_Conference_of_Mayors.txt", "Wilmington_Nov25_2020_Thanksgiving.txt")
We can capture the components into a frame:
meta <- strcapture("^([^_]+)_([^_]+_[^_]+)_(.*)\\.txt$", files, list(location="", date="", event=""))
meta
# location date event
# 1 Charlotte Sep23_2020 Racial_Equity_Discussion-1
# 2 Cleveland Sep30_2020 Whistle_Stop_Tour
# 3 Milwaukee Aug20_2020 Democratic_National_Convention
# 4 Philadelphia Sep20_2020 SCOTUS
# 5 Washington Sep26_2020 US_Conference_of_Mayors
# 6 Wilmington Nov25_2020 Thanksgiving
And then iterate on that for the contents into a single frame.
out <- do.call(Map, c(list(f = function(fn, ...) cbind(..., read.table(fn, header = TRUE))),
list(files), meta))
out <- do.call(rbind, out)
rownames(out) <- NULL
out[1:3,]
# location date event
# 1 Charlotte Sep23_2020 Racial_Equity_Discussion-1
# 2 Charlotte Sep23_2020 Racial_Equity_Discussion-1
# 3 Cleveland Sep30_2020 Whistle_Stop_Tour
# X1
# 1 Folks, thanks for taking the time to be here today. I really appreciate it. And we even have an astronaut in our house and I tell you what, that’s pretty cool. Look, first of all, I want to thank Chris and the mayor for being here, and all of you for being here. You know, these are tough times. Over 200,000 Americans have passed away. Over 200,000, and the number is still rising. The impact on communities is bad across the board, but particularly bad for African-American communities. Almost four times as likely, three times as likely to catch the disease, COVID, and when it’s caught, twice as likely to die as white Americans. It’s sort of emblematic of the inequality that exists and the circumstances that exist.
# 2 One of the things that really matters to me, is we could do … It didn’t have to be this bad. You have 30 million people on unemployment, you have 20 million people figuring whether or not they can pay their mortgage payment this month, and what they’re going to be able to do or not do as the consequence of that, and you’ve got millions of people who are worried that they’re going to be thrown out in the street because they can’t pay their rent. Although they’ve been given a reprieve for three months, but they have to pay double the next three months when it comes around.
# 3 Charlotte_Sep23_2020_Racial_Equity_Discussion-1.txt
# part
# 1 Charlotte_part_1
# 2 Charlotte_part_2
# 3 something
(I made fake files for all but the first file.)
Brief walk-through:
strcapture takes the regex (lots of _-separation) and creates a frame of location, date, etc.
Map takes a function with 1 or more arguments (we use two: fn= for the filename, and ... for "the rest") and applies it to each of the subsequent lists/vectors. In this case, I'm using ... to cbind (column-bind/concatenate) the columns from meta to what we read from the file itself. This is useful in that it combines the 1 row of each meta row with any-number-of-rows from the file itself. (We could have hard-coded ... instead as location, date, and event, but I tend to prefer to generalize, in case you need to extract something else from the filenames.)
Because we use ..., however, we need to combine files and the columns of meta in a list and then call our anon-function with the list contents as arguments.
The contents of out after our do.call(Map, ...) is in a list and not a single frame. Each element of this list is a frame with the same column-structure, so we then combine them by rows with do.call(rbind, out).
R is going to use the names from files into row names, which I find unnecessary (and distracting), so I removed the row names. Optional.
If you're interested, this may appear much easier to digest using dplyr and friends:
library(dplyr)
# library(tidyr) # unnest
out <- strcapture("^([^_]+)_([^_]+_[^_]+)_(.*)\\.txt$", files,
list(location="", date="", event="")) %>%
mutate(contents = lapply(files, read.table, header = TRUE)) %>%
tidyr::unnest(contents)

Related

fuzzy and exact match of two databases

I have two databases. The first one has about 70k rows with 3 columns. the second one has 790k rows with 2 columns. Both databases have a common variable grantee_name. I want to match each row of the first database to one or more rows of the second database based on this grantee_name. Note that merge will not work because the grantee_name do not match perfectly. There are different spellings etc. So, I am using the fuzzyjoin package and trying the following:
library("haven"); library("fuzzyjoin"); library("dplyr")
forfuzzy<-read_dta("/path/forfuzzy.dta")
filings <- read_dta ("/path/filings.dta")
> head(forfuzzy)
# A tibble: 6 x 3
grantee_name grantee_city grantee_state
<chr> <chr> <chr>
1 (ICS)2 MAINE CHAPTER CLEARWATER FL
2 (SUFFOLK COUNTY) VANDERBILT~ CENTERPORT NY
3 1 VOICE TREKKING A FUND OF ~ WESTMINSTER MD
4 10 CAN NEWBERRY FL
5 10 THOUSAND WINDOWS LIVERMORE CA
6 100 BLACK MEN IN CHICAGO INC CHICAGO IL
... 7 - 70000 rows to go
> head(filings)
# A tibble: 6 x 2
grantee_name ein
<chr> <dbl>
1 ICS-2 MAINE CHAPTER 123456
2 SUFFOLK COUNTY VANDERBILT 654321
3 VOICE TREKKING A FUND OF VOICES 789456
4 10 CAN 654987
5 10 THOUSAND MUSKETEERS INC 789123
6 100 BLACK MEN IN HOUSTON INC 987321
rows 7-790000 omitted for brevity
The above examples are clear enough to provide some good matches and some not-so-good matches. Note that, for example, 10 THOUSAND WINDOWS will match best with 10 THOUSAND MUSKETEERS INC but it does not mean it is a good match. There will be a better match somewhere in the filings data (not shown above). That does not matter at this stage.
So, I have tried the following:
df<-as.data.frame(stringdist_inner_join(forfuzzy, filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance"))
Totally new to R. This is resulting in an error:
cannot allocate vector of size 375GB (with the big database of course). A sample of 100 rows from forfuzzy always works. So, I thought of iterating over a list of 100 rows at a time.
I have tried the following:
n=100
lst = split(forfuzzy, cumsum((1:nrow(forfuzzy)-1)%%n==0))
df<-as.data.frame(lapply(lst, function(df_)
{
(stringdist_inner_join(df_, filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance", nthread = getOption("sd_num_thread")))
}
)%>% bind_rows)
I have also tried the above with mclapply instead of lapply. Same error happens even though I have tried a high-performance cluster setting 3 CPUs, each with 480G of memory and using mclapply with the option mc.cores=3. Perhaps a foreach command could help, but I have no idea how to implement it.
I have been advised to use the purrr and repurrrsive packages, so I try the following:
purrr::map(lst, ~stringdist_inner_join(., filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance", nthread = getOption("sd_num_thread")))
This seems to be working, after a novice error in the by=grantee_name statement. However, it is taking forever and I am not sure it will work. A sample list in forfuzzy of 100 rows, with n=10 (so 10 lists with 10 rows each) has been running for 50 minutes, and still no results.
If you split (with base::split or dplyr::group_split) your uniquegrantees data frame into a list of data frames, then you can call purrr::map on the list. (map is pretty much lapply)
purrr::map(list_of_dfs, ~stringdist_inner_join(., filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance"))
Your result will be a list of data frames each fuzzyjoined with filings. You can then call bind_rows (or you could do map_dfr) to get all the results in the same data frame again.
See R - Splitting a large dataframe into several smaller dateframes, performing fuzzyjoin on each one and outputting to a single dataframe
I haven't used foreach before but maybe the variable x is already the individual rows of zz1?
Have you tried:
stringdist_inner_join(x, zz2, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance")
?

Extract words starting with # in R dataframe and save as new column

My dataframe column looks like this:
head(tweets_date$Tweet)
[1] b"It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac
[2] b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81
[3] b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!
[4] b'CHAMPIONS - 2018 #IPLFinal
[5] b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.
[6] b"Final. It's all over! Chennai Super Kings won by 8 wickets
These are tweets which have mentions starting with '#', I need to extract all of them and save each mention in that particular tweet as "#mention1 #mention2". Currently my code just extracts them as lists.
My code:
tweets_date$Mentions<-str_extract_all(tweets_date$Tweet, "#\\w+")
How do I collapse those lists in each row to a form a string separated by spaces as mentioned earlier.
Thanks in advance.
I trust it would be best if you used an asis column in this case:
extract words:
library(stringr)
Mentions <- str_extract_all(lis, "#\\w+")
some data frame:
df <- data.frame(col = 1:6, lett = LETTERS[1:6])
create a list column:
df$Mentions <- I(Mentions)
df
#output
col lett Mentions
1 1 A #DineshK....
2 2 B #IPL, #p....
3 3 C
4 4 D
5 5 E #ChennaiIPL
6 6 F
I think this is better since it allows for quite easy sub setting:
df$Mentions[[1]]
#output
[1] "#DineshKarthik" "#KKRiders"
df$Mentions[[1]][1]
#output
[1] "#DineshKarthik"
and it succinctly shows whats inside the column when printing the df.
data:
lis <- c("b'It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac",
"b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81",
"b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!",
"b'CHAMPIONS - 2018 #IPLFinal",
"b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.",
"b'Final. It's all over! Chennai Super Kings won by 8 wickets")
The str_extract_all function from the stringr package returns a list of character vectors. So, if you instead want a list of single CSV terms, then you may try using sapply for a base R option:
tweets <- str_extract_all(tweets_date$Tweet, "#\\w+")
tweets_date$Mentions <- sapply(tweets, function(x) paste(x, collapse=", "))
Demo
Via Twitter's help site: "Your username cannot be longer than 15 characters. Your real name can be longer (20 characters), but usernames are kept shorter for the sake of ease. A username can only contain alphanumeric characters (letters A-Z, numbers 0-9) with the exception of underscores, as noted above. Check to make sure your desired username doesn't contain any symbols, dashes, or spaces."
Note that email addresses can be in tweets as can URLs with #'s in them (and not just the silly URLs with username/password in the host component). Thus, something like:
(^|[^[[:alnum:]_]#/\\!?=&])#([[:alnum:]_]{1,15})\\b
is likely a better, safer choice

apply nested within lapply not working in R

just earlier today I received a very helpful answer for a problem I was running into that allowed me to move onto the next step of one of my projects. However, I got stuck again later on in the project, and I'm wondering if any of you can help me move forward.
Context
Currently, I have a list of data frames that are full of soccer matches called wc_match_dataframes. Here is what one of the data frames looks like:
type_id tourn_id day month year team_A score_A score_B team_B win loss
f wc_1934 27 5 1934 Germany 5 2 Belgium Germany Belgium
I wasn't able to fit the data for the final three columns, draw, drawA, and drawB but basically the draw column is TRUE if the match is a draw, if not, it is FALSE. In the case of a draw, the win and loss columns are just filled by Draw. The drawA column is filled by team_A if the match was a draw, and likewise, the drawB column is filled by team_B.
The type_id is either f or q depending on if the match was a World Cup qualifier or a World Cup finals match. The tourn_id refers to the tournament the match was for, whether it was a qualifier or finals.
There are a total of 39 of these data frames, with a "finals" data frame for each of the 20 World Cup tournaments, and a "qualifiers" data frame for 19 tournaments (the first World Cup did not have qualifying).
What I Want To Do
I'm trying to populate a different list of data frames wc_dataframes with data for each of the 20 World Cups at the country level as opposed to the match level. Each of these twenty data frames will have the countries that made it to the finals of said tournament and their data like so:
Country
Wins in qualifying
Wins in finals
Losses in qualifying
Losses in finals
... and so on.
I have been able to populate the first country column for every World Cup no problem, but I'm running into issues for the rest of the columns.
Here is what I'm doing
This is the unlooped (only works for one World Cup) version of my code that works successfully:
wc_dataframes$wc_1930$fw <- apply(wc_dataframes$wc_1930, MARGIN = 1, function(country)
sum(wc_match_dataframes$`wc_1930 f`$w == country, na.rm = TRUE))
This is successfully populating the finals win column in the wc_dataframes$wc_1930 data frame by counting the number of wins.
Now, when I try and nest this under lapply to do it across all World Cup years like so:
lapply(names(wc_dataframes), function(year)
wc_dataframes$year$fw <- apply(wc_dataframes$year, MARGIN = 1, function(country)
sum(wc_match_dataframes$`year f`$w == country, na.rm = TRUE)))
It does not work for me. I suspect that the issue has to do with defining the year function and running into issues in the sum portion of my code. I come from a background in STATA so I am more used to running for loops and what not. I'm still getting used to R and lists and everything so I really appreciate the help.
Thank you!
Thank you so much in advance for the help, and happy holidays! :)
What you need is to output whatever you have replaced:
lapply(names(wc_dataframes), function(year){
wc_dataframes[[year]]$fw <- apply(wc_dataframes[[year]], MARGIN = 1, function(country)
sum(wc_match_dataframes[[paste(year,'f')]]$w == country, na.rm = TRUE));
wc_dataframes}
)

Extract total frequency of words from vector in R

This is the vector I have:
posts = c("originally by: cearainmy only concern with csm is they seem a bit insulated from players. they have private message boards where it appears most of their work goes on. i would bet they are posting more there than in jita speakers corner. i think that is unfortunate because its hard to know who to vote for if you never really see what positions they hold. its sort of like ccp used to post here on the forums then they stopped. so they got a csm to represent players and use jita park forum to interact. now the csm no longer posts there as they have their internal forums where they hash things out. perhaps we need a csm to the csm to find out what they are up to.i don't think you need to worry too much. the csm has had an internal forum for over 2 years, although it is getting used a lot more now than it was. a lot of what goes on in there is nda stuff that we couldn't discuss anyway.i am quite happy to give my opinion on any topic, to the extent that the nda allows, and i" , "fot those of you bleating about imagined nda scandals as you attempt to cast yourselves as the julian assange of eve, here's a quote from the winter summit thread:originally by: sokrateszday 3post dominion 0.0 (3hrs!)if i had to fly to iceland only for this session i would have done it. we had gathered a list of items and prepared it a bit. important things we went over were supercaps, force projection, empire building, profitability of 0.0, objectives for small gangs and of course sovereingty.the csm spent 3 hours talking to ccp about how dominion had changed 0.0, and the first thing on sokratesz's list is supercaps. its not hard to figure out the nature of the discussion.on the other hand, maybe you're right, and the csm's priority for this discussion was to talk about how underpowered and useless supercarriers are and how they needed triple the ehp and dps from their current levels?(it wasn't)"
I want a data frame as a result, that would contain words and the frequecy of times they occur.
So result should look something like:
word count
a 300
and 260
be 200
... ...
... ...
What I tried to do, was use tm
corpus <- VCorpus(VectorSource(posts))
corpus <-tm_map(corpus, removeNumbers)
corpus <-tm_map(corpus, removePunctuation)
m <- DocumentTermMatrix(corpus)
Running findFreqTerms(m, lowfreq =0, highfreq =Inf ) just gives me the words, so I understand its a sparse matrix, how do I extract the words and their frequency?
Is there a easier way to do this, maybe by not using tm at all?
posts = c("originally by: cearainmy only concern with csm is they seem a bit insulated from players. they have private message boards where it appears most of their work goes on. i would bet they are posting more there than in jita speakers corner. i think that is unfortunate because its hard to know who to vote for if you never really see what positions they hold. its sort of like ccp used to post here on the forums then they stopped. so they got a csm to represent players and use jita park forum to interact. now the csm no longer posts there as they have their internal forums where they hash things out. perhaps we need a csm to the csm to find out what they are up to.i don't think you need to worry too much. the csm has had an internal forum for over 2 years, although it is getting used a lot more now than it was. a lot of what goes on in there is nda stuff that we couldn't discuss anyway.i am quite happy to give my opinion on any topic, to the extent that the nda allows, and i" , "fot those of you bleating about imagined nda scandals as you attempt to cast yourselves as the julian assange of eve, here's a quote from the winter summit thread:originally by: sokrateszday 3post dominion 0.0 (3hrs!)if i had to fly to iceland only for this session i would have done it. we had gathered a list of items and prepared it a bit. important things we went over were supercaps, force projection, empire building, profitability of 0.0, objectives for small gangs and of course sovereingty.the csm spent 3 hours talking to ccp about how dominion had changed 0.0, and the first thing on sokratesz's list is supercaps. its not hard to figure out the nature of the discussion.on the other hand, maybe you're right, and the csm's priority for this discussion was to talk about how underpowered and useless supercarriers are and how they needed triple the ehp and dps from their current levels?(it wasn't)")
posts <- gsub("[[:punct:]]", '', posts) # remove punctuations
posts <- gsub("[[:digit:]]", '', posts) # remove numbers
word_counts <- as.data.frame(table(unlist( strsplit(posts, "\ ") ))) # split vector by space
word_counts <- with(word_counts, word_counts[ Var1 != "", ] ) # remove empty characters
head(word_counts)
# Var1 Freq
# 2 a 8
# 3 about 3
# 4 allows 1
# 5 although 1
# 6 am 1
# 7 an 1
Plain R solution, assuming all words are separated by space:
words <- strsplit(posts, " ", fixed = T)
words <- unlist(words)
counts <- table(words)
The names(counts) holds words, and values are the counts.
You might want to use gsub to get rid of (),.?: and 's, 't or 're as in your example. As in:
posts <- gsub("'t|'s|'t|'re", "", posts)
posts <- gsub("[(),.?:]", " ", posts)
You've got two options. Depends if you want word count per document, or for all documents.
All Documents
library(dplyr)
count <- as.data.frame(t(inspect(m)))
sel_cols <- colnames(count)
count$word <- rownames(count)
rownames(count) <- seq(length = nrow(count))
count$count <- rowSums(count[,sel_cols])
count <- count %>% select(word,count)
count <- count[order(count$count, decreasing=TRUE), ]
### RESULT of head(count)
# word count
# 140 the 14
# 144 they 10
# 4 and 9
# 25 csm 7
# 43 for 5
# 55 had 4
This should capture occurrences across all documents (by use of rowSums).
Per Document
I would suggesting using the tidytext package, if you want word frequency per document.
library(tidytext)
m_td <- tidy(m)
The tidytext package allows fairly intuitive text mining, including tokenization. It is designed to work in a tidyverse pipeline, so it supplies a list of stop words ("a", "the", "to", etc.) to exclude with dplyr::anti_join. Here, you might do
library(dplyr) # or if you want it all, `library(tidyverse)`
library(tidytext)
data_frame(posts) %>%
unnest_tokens(word, posts) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE)
## # A tibble: 101 × 2
## word n
## <chr> <int>
## 1 csm 7
## 2 0.0 3
## 3 nda 3
## 4 bit 2
## 5 ccp 2
## 6 dominion 2
## 7 forum 2
## 8 forums 2
## 9 hard 2
## 10 internal 2
## # ... with 91 more rows

How do I generate a dataframe displaying the number of unique pairs between two vectors, for each unique value in one of the vectors?

First of all, I apologize for the title. I really don't know how to succinctly explain this issue in one sentence.
I have a dataframe where each row represents some aspect of a hospital visit by a patient. A single patient might have thousands of rows for dozens of hospital visits, and each hospital visit could account for several rows.
One column is Medical.Record.Number, which corresponds to Patient IDs, and the other is Patient.ID.Visit, which corresponds to an ID for an individual hospital visit. I am trying to calculate the number of hospital visits each each patient has had.
For example:
Medical.Record.Number    Patient.ID.Visit
AAAXXX           1111
AAAXXX           1112
AAAXXX           1113
AAAZZZ           1114
AAAZZZ           1114
AAABBB           1115
AAABBB           1116
would produce the following:
Medical.Record.Number   Number.Of.Visits
AAAXXX          3
AAAZZZ          1
AAABBB          2
The solution I am currently using is the following, where "data" is my dataframe:
#this function returns the number of unique hospital visits associated with the
#supplied record number
countVisits <- function(record.number){
visits.by.number <- data$Patient.ID.Visit[which(data$Medical.Record.Number
== record.number)]
return(length(unique(visits.by.number)))
}
recordNumbers <- unique(data$Medical.Record.Number)
visits <- integer()
for (record in recordNumbers){
visits <- c(visits, countVisits(record))
}
visit.counts <- data.frame(recordNumbers, visits)
This works, but it is pretty slow. I am dealing with potentially millions of rows of data, so I'd like something efficient. From what little I know about R, I know there's usually a faster way to do things without using a for-loop.
This essentially looks like a table() operation after you take out duplicates. First, some sample data
#sample data
dd<-read.table(text="Medical.Record.Number Patient.ID.Visit
AAAXXX 1111
AAAXXX 1112
AAAXXX 1113
AAAZZZ 1114
AAAZZZ 1114
AAABBB 1115
AAABBB 1116", header=T)
then you could do
tt <- table(Medical.Record.Number=unique(dd)$Medical.Record.Number)
as.data.frame(tt, responseName="Number.Of.Visits") #to get a data.frame rather than named vector (table)
# Medical.Record.Number Number.Of.Visits
# 1 AAABBB 2
# 2 AAAXXX 3
# 3 AAAZZZ 1
Or you could also think of this as an aggregation problem
aggregate(Patient.ID.Visit~Medical.Record.Number, dd, function(x) length(unique(x)))
# Medical.Record.Number Patient.ID.Visit
# 1 AAABBB 2
# 2 AAAXXX 3
# 3 AAAZZZ 1
There are many ways to do this, #MrFlick provided handful of perfectly valid approaches. Personally I'm fond of the data.table package. Its faster on large data frames and I find the logic to be more intuitive than the base functions. I'd check it out if you are having problems with execution time.
library(data.table)
med.dt <- data.table(med_tbl)
num.visits.dt <- med.dt[ , num_visits = length(unique(Patient.ID.Visit)),
by = Medical.Record.Number]
data.Table should be much faster than data.frame on a large tables.

Resources