Searching one .CSV File with the Columns of another .CSV File - r

I'm working with two Excel files. One is a very large (~5 to 6 GB) data set. This is from the government's Open Payments Data, which is free and open to everyone to view. It is a file that lists all of the disclosed payments from Industry to physicians, and it is open to the public (which is why I took a screenshot).
The second Excel file I'm working with is also large, but it's a file that lists the disclosed payments from physicians at a particular institution.
My goals:
I'd like to 'filter' out the Open Payments Data to just include the physicians I have in my second Excel file. Is there any way to do that? The Open Payments Data is inconsistent and has some uppercase and lower case.
What I've done so far:
I've been able to parse out the Open Payments Data to just include the state of the physicians I'm looking for. I've also imported both of these .csv files into R and named them accordingly.
I'm taking a course in R right now but it's been no help ... and most of the answers I've found online are for smaller sets of data. The data I'm working with has ~500,000 rows! Thank you in advance for your insight.
Edit: This is head(mydata)
Physician_Profile_ID Physician_First_Name
1 377519 KELLI
2 377519 KELLI
3 377519 KELLI
4 272641 ABDUL
5 272641 ABDUL
6 272641 ABDUL
Physician_Middle_Name Physician_Last_Name
1 A AABY
2 A AABY
3 A AABY
4 A AADAM
5 A AADAM
6 AADAM
Physician_Name_Suffix
1
2
3
4
5
6
Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name
1 BioHorizons Implant Systems Inc.
2 BioHorizons Implant Systems Inc.
3 BioHorizons Implant Systems Inc.
4 APOLLO ENDOSURGERY INC
5 APOLLO ENDOSURGERY INC
6 BOSTON SCIENTIFIC CORPORATION
Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name
1 BioHorizons Implant Systems Inc.
2 BioHorizons Implant Systems Inc.
3 BioHorizons Implant Systems Inc.
4 Apollo Endosurgery Inc
5 APOLLO ENDOSURGERY INC
6 Boston Scientific Corporation
Total_Amount_of_Payment_USDollars Date_of_Payment
1 11.55 6/17/2014
2 187.50 6/4/2014
3 222.24 5/23/2014
4 60.20 5/4/2014
5 110.15 7/28/2014
6 12.36 12/10/2014
Form_of_Payment_or_Transfer_of_Value
1 In-kind items and services
2 In-kind items and services
3 In-kind items and services
4 In-kind items and services
5 In-kind items and services
6 In-kind items and services
Nature_of_Payment_or_Transfer_of_Value City_of_Travel
1 Food and Beverage
2 Gift
3 Education
4 Food and Beverage
5 Food and Beverage
6 Food and Beverage
State_of_Travel Country_of_Travel
1
2
3
4
5
6
And this is head(institution_data, 2):
DB.ID Last.Name First.Name
1 12345 Johnson John
2 12354 Twain Mark
Names have been changed for confidentiality. DB ID != Physician_ID unfortunately.

A list (vector actually) of physician IDs could be constructed:
PHY_ID <- unique(
institution_data$DB.ID[ institution_data$DB.ID %in% mydata$Physician_Profile_ID ] )
Then extract the data from the main file using the matches to that vector:
chargedata <- mydata[ mydata$Physician_Profile_ID %in% PHY_ID , ]
Could also use match with the same logic but the %in% function uses match "under the hood" and code written with %in% is generally easier to read. If the ID's were not supposed to match, which you should have stated if that were the case, then name matching could be attempted but it would make sense to add additional criteria, such as state of nearby zipcode.

Related

R: How to Prepare Data for LDA/Text Analysis

I am working with the R programming language.
I would like to perform BTM (Bitopic Term Model - a variant of LDA (Latent Dirichlet Analysis) for small text datasets) on some text data. I am following this tutorial over here: https://cran.r-project.org/web/packages/BTM/readme/README.html
When I look at the dataset ("brussels_reviews_anno") being used in this tutorial, it look something like this (I can not recognize the format of this data!):
library(udpipe)
library(BTM)
data("brussels_reviews_anno", package = "udpipe")
head(brussels_reviews_anno)
doc_id language sentence_id token_id token lemma upos xpos
1 32198807 es 1 1 Gwen gwen NOUN NNP
2 32198807 es 1 2 fue ser VERB VB
3 32198807 es 1 3 una un DET DT
4 32198807 es 1 4 magnifica magnifica NOUN NN
5 32198807 es 1 5 anfitriona anfitriono ADJ JJ
6 32198807 es 1 6 . . PUNCT .
My dataset ("my_data") is in the current format - I manually create a text dataset for this example using reviews of fast food restaurants found on the internet:
my_data = structure(list(id = 1:8, reviews = c("I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave. Had to demand for and for a refund because they acted like it was my fault and told me the charges are still pending even though they are for 2 different amounts.",
"I went to McDonald's and they charge me 50 for Big Mac when I only came with 49. The casher told me that I can't read correctly and told me to get glasses. I am file a report on your casher and now I'm mad.",
"I really think that if you can buy breakfast anytime then I should be able to get a cheeseburger anytime especially since I really don't care for breakfast food. I really like McDonald's food but I preferred tree lunch rather than breakfast. Thank you thank you thank you.",
"I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave. Had to demand for and for a refund because they acted like it was my fault and told me the charges are still pending even though they are for 2 different amounts.",
"Never order McDonald's from Uber or Skip or any delivery service for that matter, most particularly one on Elgin Street and Rideau Street, they never get the order right. Workers at either of these locations don't know how to follow simple instructions. Don't waste your money at these two locations.",
"Employees left me out in the snow and wouldn’t answer the drive through. They locked the doors and it was freezing. I asked the employee a simple question and they were so stupid they answered a completely different question. Dumb employees and bad food.",
"McDonalds food was always so good but ever since they add new/more crispy chicken sandwiches it has come out bad. At first I thought oh they must haven't had a good day but every time I go there now it's always soggy, and has no flavor. They need to fix this!!!",
"I just ordered the new crispy chicken sandwich and I'm very disappointed. Not only did it taste horrible, but it was more bun than chicken. Not at all like the commercial shows. I hate sweet pickles and there were two slices on my sandwich. I wish I could add a photo to show the huge bun and tiny chicken."
)), class = "data.frame", row.names = c(NA, -8L))
Can someone please show me how I can take my dataset and transform it in such a way that I can perform BTM analysis on this data and create a visualization similar to the visualizations in this tutorial?
Thanks!
Additional References:
https://rforanalytics.com/11-7-topic-modelling.html
The class of brussels_reviews_anno is just a regular data.frame. That structure is generated by the function udpipe() from the package udpipe.
Below I provide a working example, with the exclusion of the path where I save the language model, that shows how to replicate a similar data structure.
Please keep in mind that udpipe() does a lot of stuff. The reason why you see many more columns in the final data.frame out is because I did not tweak any parameters of the function nor simply deleted any of the columns.
Overall, to get started with BTM() you need to tokenize your textual data. That's one of the things you can do with the package udpipe.
Hope this helped!
library(udpipe)
library(BTM)
data("brussels_reviews_anno", package = "udpipe")
head(brussels_reviews_anno)
#> doc_id language sentence_id token_id token lemma upos xpos
#> 1 32198807 es 1 1 Gwen gwen NOUN NNP
#> 2 32198807 es 1 2 fue ser VERB VB
#> 3 32198807 es 1 3 una un DET DT
#> 4 32198807 es 1 4 magnifica magnifica NOUN NN
#> 5 32198807 es 1 5 anfitriona anfitriono ADJ JJ
#> 6 32198807 es 1 6 . . PUNCT .
my_data = structure(list(id = 1:8, reviews = c("I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave. Had to demand for and for a refund because they acted like it was my fault and told me the charges are still pending even though they are for 2 different amounts.",
"I went to McDonald's and they charge me 50 for Big Mac when I only came with 49. The casher told me that I can't read correctly and told me to get glasses. I am file a report on your casher and now I'm mad.",
"I really think that if you can buy breakfast anytime then I should be able to get a cheeseburger anytime especially since I really don't care for breakfast food. I really like McDonald's food but I preferred tree lunch rather than breakfast. Thank you thank you thank you.",
"I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave. Had to demand for and for a refund because they acted like it was my fault and told me the charges are still pending even though they are for 2 different amounts.",
"Never order McDonald's from Uber or Skip or any delivery service for that matter, most particularly one on Elgin Street and Rideau Street, they never get the order right. Workers at either of these locations don't know how to follow simple instructions. Don't waste your money at these two locations.",
"Employees left me out in the snow and wouldn’t answer the drive through. They locked the doors and it was freezing. I asked the employee a simple question and they were so stupid they answered a completely different question. Dumb employees and bad food.",
"McDonalds food was always so good but ever since they add new/more crispy chicken sandwiches it has come out bad. At first I thought oh they must haven't had a good day but every time I go there now it's always soggy, and has no flavor. They need to fix this!!!",
"I just ordered the new crispy chicken sandwich and I'm very disappointed. Not only did it taste horrible, but it was more bun than chicken. Not at all like the commercial shows. I hate sweet pickles and there were two slices on my sandwich. I wish I could add a photo to show the huge bun and tiny chicken."
)), class = "data.frame", row.names = c(NA, -8L))
# download a language model
udpipe_download_model("english-ewt", model_dir = "~/Desktop/")
#> Downloading udpipe model from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.5/master/inst/udpipe-ud-2.5-191206/english-ewt-ud-2.5-191206.udpipe to ~/Desktop//english-ewt-ud-2.5-191206.udpipe
#> - This model has been trained on version 2.5 of data from https://universaldependencies.org
#> - The model is distributed under the CC-BY-SA-NC license: https://creativecommons.org/licenses/by-nc-sa/4.0
#> - Visit https://github.com/jwijffels/udpipe.models.ud.2.5 for model license details.
#> - For a list of all models and their licenses (most models you can download with this package have either a CC-BY-SA or a CC-BY-SA-NC license) read the documentation at ?udpipe_download_model. For building your own models: visit the documentation by typing vignette('udpipe-train', package = 'udpipe')
#> Downloading finished, model stored at '~/Desktop//english-ewt-ud-2.5-191206.udpipe'
#> language file_model
#> 1 english-ewt ~/Desktop//english-ewt-ud-2.5-191206.udpipe
#> url
#> 1 https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.5/master/inst/udpipe-ud-2.5-191206/english-ewt-ud-2.5-191206.udpipe
#> download_failed download_message
#> 1 FALSE OK
# load in the environment
eng_model = udpipe_load_model("~/Desktop/english-ewt-ud-2.5-191206.udpipe")
# apply the tokenization
out = udpipe(my_data$reviews, object = eng_model)
head(out)
#> doc_id paragraph_id sentence_id
#> 1 doc1 1 1
#> 2 doc1 1 1
#> 3 doc1 1 1
#> 4 doc1 1 1
#> 5 doc1 1 1
#> 6 doc1 1 1
#> sentence
#> 1 I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave.
#> 2 I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave.
#> 3 I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave.
#> 4 I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave.
#> 5 I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave.
#> 6 I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave.
#> start end term_id token_id token lemma upos xpos
#> 1 1 1 1 1 I I PRON PRP
#> 2 3 7 2 2 guess guess VERB VBP
#> 3 9 11 3 3 the the DET DT
#> 4 13 20 4 4 employee employee NOUN NN
#> 5 22 28 5 5 decided decide VERB VBD
#> 6 30 31 6 6 to to PART TO
#> feats head_token_id dep_rel deps misc
#> 1 Case=Nom|Number=Sing|Person=1|PronType=Prs 2 nsubj <NA> <NA>
#> 2 Mood=Ind|Tense=Pres|VerbForm=Fin 0 root <NA> <NA>
#> 3 Definite=Def|PronType=Art 4 det <NA> <NA>
#> 4 Number=Sing 5 nsubj <NA> <NA>
#> 5 Mood=Ind|Tense=Past|VerbForm=Fin 2 ccomp <NA> <NA>
#> 6 <NA> 7 mark <NA> <NA>
Created on 2022-09-20 by the reprex package (v2.0.1)

Comparing rows in the same R dataframe

I want to take the nth row in a dataframe and compare it to all rows that are not the nth row and return how many of this columns match and/or mismatch.
I tried the match function and ifelse for single observations but I haven't been able to replicate it for the entire dataframe.
The dataset Superstore contains the order priority, customer name, ship mode, customer segment and category. It looks like this:
> head(df2)
Order.Priority Customer.Name Ship.Mode Customer.Segment Product.Category
1 Not Specified Dana Teague Regular Air Corporate Office Supplies
2 Critical Vanessa Boyer Regular Air Consumer Office Supplies
3 Critical Wesley Tate Regular Air Corporate Technology
4 High Brian Grady Delivery Truck Corporate Furniture
5 Medium Kristine Connolly Delivery Truck Corporate Furniture
6 High Emily Britt Regular Air Corporate Office Supplies
The code I tried (extracting relevant columns):
df <- read.csv("Superstore.csv", header = TRUE)
df2 <- df[,c(2,4,5,6,7)]
match(df2[2,],df2[1,],nomatch = 0)
This returns:
> match(df2[2,],df2[1,],nomatch = 0)
[1] 0 0 3 0 5
Using ifelse I get:
> ifelse(df2[1,]==df2[2,],1,0)
Order.Priority Customer.Name Ship.Mode Customer.Segment Product.Category
1 0 0 1 0 1
Like I said, this is exactly the result I need, but I haven't been able to replicate for the whole dataframe.

R data cleaning

I have a dataframe (df1) scrapped as a single column of data .
1
2 Amazon Pantry
3 Best Sellerin Soaps & Hand Wash
4
5 Palmolive Hygiene-Plus Sensitive Liquid Hand Wash, 300ml
6 Palmolive Hygiene-Plus Sensitive Liquid Hand Wash, 300ml
7 £0.90
8 ?
9
10 Palmolive Naturals Nourishing Liquid Hand Wash, 300ml
11 Palmolive Naturals Nourishing Liquid Hand Wash, 300ml
12 £0.90
13 ?
14
15 L'Oreal Men Expert Carbon Protect Deodorant 250ml
16 L'Oreal Men Expert Carbon Protect Deodorant 250ml
17 £1.50
In order to clean the data i tried using the below commands such that to get Product and pricing information in 2 separate columns . Can someone let me know if there is an alternate way of doing it .
install.packages("splitstackshape")
newdf <- cSplit(df1, "Amazon_Normal_Text2", direction = "long")
this is just a thought process...
everytime there's a "ml," extract information until ml going backward until there is a space and store that into volume variable. (substr)
extract information from £ to the end of the string and store that into price variable. (grep, regex, nchar)
extract from beginning of string until the character location for volume occurrence into product variable (substr, nchar)
look into substr, nchar, grep, regex

Consolidate data table factor levels in R

Suppose I have a very large data table, one column of which is "ManufacturerName". The data was not entered uniformly, so it's pretty messy. For example, there may be observations like:
ABC Inc
ABC, Inc
ABC Incorporated
A.B.C.
...
Joe Shmos Plumbing
Joe Shmo Plumbing
...
I am looking for an automated way in R to try and consider similar names as one factor level. I have learned the syntax to manually do this, for example:
levels(df$ManufacturerName) <- list(ABC=c("ABC", "A.B.C", ....), JoeShmoPlumbing=c(...))
But I'm trying to think of an automated solution. Obviously it's not going to be perfect as I can't anticipate every type of permutation in the data table. But maybe something that searches the factor levels, strips out punctuation/special characters, and creates levels based on common first words. Or any other ideas. Thanks!
Look into the stringdist package. For starters, you could do something like this:
library(stringdist)
x <- c("ABC Inc", "ABC, Inc", "ABC Incorporated", "A.B.C.", "Joe Shmos Plumbing", "Joe Shmo Plumbing")
d <- stringdistmatrix(x)
# 1 2 3 4 5
# 2 1
# 3 9 10
# 4 6 7 15
# 5 16 16 16 18
# 6 15 15 15 17 1
For more help, see ?stringdistmatrix or do searches on StackOverflow for fuzzy matching, approximate string matching, string distance functions, and agrep.

How can I de- and re-classify data?

Some of the data I work with contain sensitive information (names of persons, dates, locations, etc). But I sometimes need to share "the numbers" with other persons to get help with statistical analysis, or process it on more powerful machines where I can't control who looks at the data.
Ideally I would like to work like this:
Read the data into R (look at it, clean it, etc.)
Select a data frame that I want to de-classify, run it through a package and receive two "files": the de-classified data and a translation-file. The latter I will keep myself.
The de-classified data can be shared, manipulated and processed without worries.
I re-classify the processed data together with the translation-file.
I suppose that this can also be useful when uploading data for processing "in the cloud" (Amazon, etc.).
Have you been in this situation? I first thought about writing a "randomize" function myself, but then I realized there is no end on how sophisticated this can be done (for example, offsetting time-stamps without losing order). Maybe there is already a defined method or tool?
Thanks to everyone who contributes to [r]-tag here at Stack Overflow!
One way to do this is with match. First I make a small dataframe:
foo <- data.frame( person=c("Mickey","Donald","Daisy","Scrooge"), score=rnorm(4))
foo
person score
1 Mickey -0.07891709
2 Donald 0.88678481
3 Daisy 0.11697127
4 Scrooge 0.31863009
Then I make a key:
set.seed(100)
key <- as.character(foo$person[sample(1:nrow(foo))])
You must save this key obviously somewhere. Now I can encode the persons:
foo$person <- match(foo$person, key)
foo
person score
1 2 0.3186301
2 1 -0.5817907
3 4 0.7145327
4 3 -0.8252594
If I want the person names again I can index the key:
key[foo$person]
[1] "Mickey" "Donald" "Daisy" "Scrooge"
Or use tranform, this also works if the data is changed as long as the person ID remains the same:
foo <-rbind(foo,foo[sample(1:4),],foo[sample(1:4,2),],foo)
foo
person score
1 2 0.3186301
2 1 -0.5817907
3 4 0.7145327
4 3 -0.8252594
21 1 -0.5817907
41 3 -0.8252594
31 4 0.7145327
15 2 0.3186301
32 4 0.7145327
16 2 0.3186301
11 2 0.3186301
12 1 -0.5817907
13 4 0.7145327
14 3 -0.8252594
transform(foo, person=key[person])
person score
1 Mickey 0.3186301
2 Donald -0.5817907
3 Daisy 0.7145327
4 Scrooge -0.8252594
21 Donald -0.5817907
41 Scrooge -0.8252594
31 Daisy 0.7145327
15 Mickey 0.3186301
32 Daisy 0.7145327
16 Mickey 0.3186301
11 Mickey 0.3186301
12 Donald -0.5817907
13 Daisy 0.7145327
14 Scrooge -0.8252594
Can you simply assign a GUID to the row from which you have removed all of the sensitive information? As long as your colleagues lacking the security clearance don't mess with the GUID, you'd be able to incorporate any changes and additions they may make simply by joining on the GUID. Then it becomes simply a matter of generating bogus ersatz values for the columns whose data you have purged. LastName1, LastName2, City1, City2, etc etc. EDIT: You'd have a table for each purged column, e.g. City, State, Zip, FirstName, LastName each of which contains the distinct set of the real classified values in that column and an integer value. So that "Jones" could be represented in the sanitized dataset as, say, LastName22, "Schenectady" as City343, "90210" as Zipcode716. This would give your colleagues valid values to work with (e.g. they'd have the same number of distinct cities as your real data, just with anonymized names) and the interrelationships of the anonymized data are preserved.. EDIT2: if the goal is to give your colleagues sanitized data that is still statistically meaningful, then date columns would require special processing. E.g. if your colleagues need to do statistical computations on the age of the person, you have to give them something close to the original date, not so close that it could be revealing, yet not so far that it could skew the analysis.
Sounds like Statistical Disclosure Control problem. Take a look at sdcMicro package.
EDIT: Just realized that you have slightly different problem. The point of Statistical Disclosure Control is to "damage" data so that the risk of disclosure is reduced. By "damaging" data you are loosing some information - this is the price you are paying for reduced risk of disclosure. Your data will contain less information - so your analysis can give different or less results as analysis done on original data.
Depends on what you are going to do with your data.

Resources