R data cleaning - r

I have a dataframe (df1) scrapped as a single column of data .
1
2 Amazon Pantry
3 Best Sellerin Soaps & Hand Wash
4
5 Palmolive Hygiene-Plus Sensitive Liquid Hand Wash, 300ml
6 Palmolive Hygiene-Plus Sensitive Liquid Hand Wash, 300ml
7 £0.90
8 ?
9
10 Palmolive Naturals Nourishing Liquid Hand Wash, 300ml
11 Palmolive Naturals Nourishing Liquid Hand Wash, 300ml
12 £0.90
13 ?
14
15 L'Oreal Men Expert Carbon Protect Deodorant 250ml
16 L'Oreal Men Expert Carbon Protect Deodorant 250ml
17 £1.50
In order to clean the data i tried using the below commands such that to get Product and pricing information in 2 separate columns . Can someone let me know if there is an alternate way of doing it .
install.packages("splitstackshape")
newdf <- cSplit(df1, "Amazon_Normal_Text2", direction = "long")

this is just a thought process...
everytime there's a "ml," extract information until ml going backward until there is a space and store that into volume variable. (substr)
extract information from £ to the end of the string and store that into price variable. (grep, regex, nchar)
extract from beginning of string until the character location for volume occurrence into product variable (substr, nchar)
look into substr, nchar, grep, regex

Related

find every combination of elements in a column of a dataframe, which add up to a given sum in R

I'm trying to ease my life by writing a menu creator, which is supposed to permutate a weekly menu from a list of my favourite dishes, in order to get a little bit more variety in my life.
I gave every dish a value of how many days it approximately lasts and tried to arrange the dishes to end up with menus worth 7 days of food.
I've already tried solutions for knapsack functions from here, including dynamic programming, but I'm not experienced enough to get the hang of it. This is because all of these solutions are targeting only the most efficient option and not every combination, which fills the Knapsack.
library(adagio)
#create some data
dish <-c('Schnitzel','Burger','Steak','Salad','Falafel','Salmon','Mashed potatoes','MacnCheese','Hot Dogs')
days_the_food_lasts <- c(2,2,1,1,3,1,2,2,4)
price_of_the_food <- c(20,20,40,10,15,18,10,15,15)
data <- data.frame(dish,days_the_food_lasts,price_of_the_food)
#give each dish a distinct id
data$rownumber <- (1:nrow(data))
#set limit for how many days should be covered with the dishes
food_needed_for_days <- 7
#knapsack function of the adagio library as an example, but all other solutions I found to the knapsackproblem were the same
most_exspensive_food <- knapsack(days_the_food_lasts,price_of_the_food,food_needed_for_days)
data[data$rownumber %in% most_exspensive_food$indices, ]
#output
dish days_the_food_lasts price_of_the_food rownumber
1 Schnitzel 2 20 1
2 Burger 2 20 2
3 Steak 1 40 3
4 Salad 1 10 4
6 Salmon 1 18 6
Simplified:
I need a solution to a single objective single Knapsack problem, which returns all possible combinations of dishes which add up to 7 days of food.
Thank you very much in advance

Text-mining including patterns and numbers

Dataset contains a free text field with information on building plans. I need to split the content of the field in 2 parts, the first part contains only the number of planned buildings, the other only the type of building. I have a reference lexicon list with the types of buildings.
Example
Plans<- c("build 10 houses ","5 luxury apartments with sea view",
"renovate 20 cottages"," transform 2 bungalows and a school", "1 hotel")
Reference list
Types <-c("houses", "cottages", "bungalows", "luxury apartments")
Desired Output 2 colums, Number and Type, with this content:
Number Type
10 houses
5 apartments
20 cottages
2 bungalows
Tried
matches <- unique (grep(paste(Types,collapse="|"), Plans, value=TRUE))
I can match the plans and types, but I can’t extract the numbers and types into two columns.
I tried str_split_fixed and grepl using :digit: and :alpha: but it isn’t working.
Assuming there is only going to be one numeric part in the string, we can extract the numeric part by replacing all the characters to empty strings. We create the Type column by extracting any of the string present in the Plans.
library(stringr)
data.frame(Number = as.numeric(gsub("[[:alpha:]]", "", Plans)),
Type = str_extract(Plans, paste(Types,collapse="|")))
# Number Type
#1 10 houses
#2 5 luxury apartments
#3 20 cottages
#4 2 bungalows
#5 1 <NA>
For the 5th row, "hotel" is not present in Types so it gives output as NA, if you need to ignore such cases you can do it with is.na. Extracting number from the string part is taken from here.
You can also use strcapture from base R:
strcapture(pattern = paste0("(\\d+)\\s(",paste(Types,collapse="|"),")"),x = Plans,
proto = data.frame(Number=numeric(),Type=character()))
Number Type
1 10 houses
2 5 luxury apartments
3 20 cottages
4 2 bungalows
5 NA <NA>

Grouping words that are similar

CompanyName <- c('Kraft', 'Kraft Foods', 'Kfraft', 'nestle', 'nestle usa', 'GM', 'general motors', 'the dow chemical company', 'Dow')
I want to get either:
CompanyName2
Kraft
Kraft
Kraft
nestle
nestle
general motors
general motors
Dow
Dow
But would be absolutely fine with:
CompanyName2
1
1
1
2
2
3
3
I see algorithms for getting the distance between two words, so if I had just one weird name I would compare it to all other names and pick the one with the lowest distance. But I have thousands of names and want to group them all into groups.
I do not know anything about elastic search, but would one of the functions in the elastic package or some other function help me out here?
I'm sorry there's no programming here. I know. But this is way out of my area of normal expertise.
Solution: use string distance
You're on the right track. Here is some R code to get you started:
install.packages("stringdist") # install this package
library("stringdist")
CompanyName <- c('Kraft', 'Kraft Foods', 'Kfraft', 'nestle', 'nestle usa', 'GM', 'general motors', 'the dow chemical company', 'Dow')
CompanyName = tolower(CompanyName) # otherwise case matters too much
# Calculate a string distance matrix; LCS is just one option
?"stringdist-metrics" # see others
sdm = stringdistmatrix(CompanyName, CompanyName, useNames=T, method="lcs")
Let's take a look. These are the calculated distances between strings, using Longest Common Subsequence metric (try others, e.g. cosine, Levenshtein). They all measure, in essence, how many characters the strings have in common. Their pros and cons are beyond this Q&A. You might look into something that gives a higher similarity value to two strings that contain the exact same substring (like dow)
sdm[1:5,1:5]
kraft kraft foods kfraft nestle nestle usa
kraft 0 6 1 9 13
kraft foods 6 0 7 15 15
kfraft 1 7 0 10 14
nestle 9 15 10 0 4
nestle usa 13 15 14 4 0
Some visualization
# Hierarchical clustering
sdm_dist = as.dist(sdm) # convert to a dist object (you essentially already have distances calculated)
plot(hclust(sdm_dist))
If you want to group then explicitly into k groups, use k-medoids.
library("cluster")
clusplot(pam(sdm_dist, 5), color=TRUE, shade=F, labels=2, lines=0)

Consolidate data table factor levels in R

Suppose I have a very large data table, one column of which is "ManufacturerName". The data was not entered uniformly, so it's pretty messy. For example, there may be observations like:
ABC Inc
ABC, Inc
ABC Incorporated
A.B.C.
...
Joe Shmos Plumbing
Joe Shmo Plumbing
...
I am looking for an automated way in R to try and consider similar names as one factor level. I have learned the syntax to manually do this, for example:
levels(df$ManufacturerName) <- list(ABC=c("ABC", "A.B.C", ....), JoeShmoPlumbing=c(...))
But I'm trying to think of an automated solution. Obviously it's not going to be perfect as I can't anticipate every type of permutation in the data table. But maybe something that searches the factor levels, strips out punctuation/special characters, and creates levels based on common first words. Or any other ideas. Thanks!
Look into the stringdist package. For starters, you could do something like this:
library(stringdist)
x <- c("ABC Inc", "ABC, Inc", "ABC Incorporated", "A.B.C.", "Joe Shmos Plumbing", "Joe Shmo Plumbing")
d <- stringdistmatrix(x)
# 1 2 3 4 5
# 2 1
# 3 9 10
# 4 6 7 15
# 5 16 16 16 18
# 6 15 15 15 17 1
For more help, see ?stringdistmatrix or do searches on StackOverflow for fuzzy matching, approximate string matching, string distance functions, and agrep.

Searching one .CSV File with the Columns of another .CSV File

I'm working with two Excel files. One is a very large (~5 to 6 GB) data set. This is from the government's Open Payments Data, which is free and open to everyone to view. It is a file that lists all of the disclosed payments from Industry to physicians, and it is open to the public (which is why I took a screenshot).
The second Excel file I'm working with is also large, but it's a file that lists the disclosed payments from physicians at a particular institution.
My goals:
I'd like to 'filter' out the Open Payments Data to just include the physicians I have in my second Excel file. Is there any way to do that? The Open Payments Data is inconsistent and has some uppercase and lower case.
What I've done so far:
I've been able to parse out the Open Payments Data to just include the state of the physicians I'm looking for. I've also imported both of these .csv files into R and named them accordingly.
I'm taking a course in R right now but it's been no help ... and most of the answers I've found online are for smaller sets of data. The data I'm working with has ~500,000 rows! Thank you in advance for your insight.
Edit: This is head(mydata)
Physician_Profile_ID Physician_First_Name
1 377519 KELLI
2 377519 KELLI
3 377519 KELLI
4 272641 ABDUL
5 272641 ABDUL
6 272641 ABDUL
Physician_Middle_Name Physician_Last_Name
1 A AABY
2 A AABY
3 A AABY
4 A AADAM
5 A AADAM
6 AADAM
Physician_Name_Suffix
1
2
3
4
5
6
Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name
1 BioHorizons Implant Systems Inc.
2 BioHorizons Implant Systems Inc.
3 BioHorizons Implant Systems Inc.
4 APOLLO ENDOSURGERY INC
5 APOLLO ENDOSURGERY INC
6 BOSTON SCIENTIFIC CORPORATION
Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name
1 BioHorizons Implant Systems Inc.
2 BioHorizons Implant Systems Inc.
3 BioHorizons Implant Systems Inc.
4 Apollo Endosurgery Inc
5 APOLLO ENDOSURGERY INC
6 Boston Scientific Corporation
Total_Amount_of_Payment_USDollars Date_of_Payment
1 11.55 6/17/2014
2 187.50 6/4/2014
3 222.24 5/23/2014
4 60.20 5/4/2014
5 110.15 7/28/2014
6 12.36 12/10/2014
Form_of_Payment_or_Transfer_of_Value
1 In-kind items and services
2 In-kind items and services
3 In-kind items and services
4 In-kind items and services
5 In-kind items and services
6 In-kind items and services
Nature_of_Payment_or_Transfer_of_Value City_of_Travel
1 Food and Beverage
2 Gift
3 Education
4 Food and Beverage
5 Food and Beverage
6 Food and Beverage
State_of_Travel Country_of_Travel
1
2
3
4
5
6
And this is head(institution_data, 2):
DB.ID Last.Name First.Name
1 12345 Johnson John
2 12354 Twain Mark
Names have been changed for confidentiality. DB ID != Physician_ID unfortunately.
A list (vector actually) of physician IDs could be constructed:
PHY_ID <- unique(
institution_data$DB.ID[ institution_data$DB.ID %in% mydata$Physician_Profile_ID ] )
Then extract the data from the main file using the matches to that vector:
chargedata <- mydata[ mydata$Physician_Profile_ID %in% PHY_ID , ]
Could also use match with the same logic but the %in% function uses match "under the hood" and code written with %in% is generally easier to read. If the ID's were not supposed to match, which you should have stated if that were the case, then name matching could be attempted but it would make sense to add additional criteria, such as state of nearby zipcode.

Resources