Name matching and correcting spelling error in r - r

I have a huge data table with millions of rows that consists of Merchandise code with its description. I want to assign a category to each group (based on the combination of code and description). The problem is that the description is spelled in different ways and I want to convert all the similar names into a single one. Here is an illustrative example:
ibrary(data.table)
dt <- data.table(code = c(rep(1,2),rep(2,2),rep(3,2)), name = c('McDonalds','Mc
Dnald','Macys','macy','Comcast','Com-cats'))
dt[,cat:='NA']
setkeyv(dt,c('code','name'))
dt[.(1,'McDonalds'),cat:='Restaurant']
dt[.(1,'Mc Dnald'),cat:='Restaurant']
dt[.(1,'Macys'),cat:='Department Store']
Of course in the real case, it is impossible to go through all the spelling that refer to the same word and fix them manually.
Is there a way to detect all the similar words and convert them to a single (correct) spelling?
Thanks in advance

Related

R: Searching a column in a dataframe for matches to a reference list in another dataframe

I am trying to categorize genes with multiple GO descriptors into bins based on what those GO descriptors are related to. I have dataframe A which contains the raw data associated with a list of geneIDs (>500,000) and their associated GO descriptors and dataframe B which classifies these GO descriptors into larger groups.
Example of dataframe A
dfA
Example of dataframe B
dfB
Ideally, the final output would reference the entire list and generate a new column in dataframe A classifying the GeneIDs into the GO_Category's associated with its specific GO_IDs -- bonus points if it removes duplicate hits on the GO_Categorys.
Looking something like this...
Example of Ideal Solution
However, I know that the ideal solution might be difficult to obtain, and I already have dataframe B listed out based on the unique GO_Categories so a solution like this might be easier to obtain.
Example of Acceptable Solution
So far I have struggled with getting any command to search for partial strings using a list from another dataframe with the goal of returning all matches.
I have had partial success with the acceptable solution approach and using:
dfA <- dfA %>%
mutate(GO_Cat_1 = c('No', 'Yes')[1+str_detect(dfA$GO_IDs, as.character(dfB$GO_IDs))])
The solution seems okay, however, it does return an error along the lines of
problem with mutate() column GO_Cat_1.
i GO_Cat_1 = ...[].
i longer object length is not a multiple of shorter object length
I have also tried to look into applying grepl/grep - but struggled to feed it a list of terms to look for partial string matches in dfA.
Any assistance is greatly appreciated!

given a dataset, how can I define the data type of each column including the category and sub-category

Hi guys I am a student and we have barely gone through R and are expected the following:
dataset
In the dataset provided, I'm being told to define the data type of each column
including the category and sub-category. How do I do this?
I am not sure where to start
Any help will be appreciated
you can change a column data type using the following commands (depending on the data type you want) :
dataframeName$colName <- as.factor(datataframeName$colName)
dataframeName$colName <- as.numeric(datataframeName$colName)

Assigning observation name to a value when retrieving a variable

I want to create a dataframe that contains > 100 observations on ~20 variables. Now, this will be based on a list of html files which are saved to my local folder. I would like to make sure that are matches the correct values per variable to each observation. Assuming that R would use the same order of going through the files for constructing each variable AND not skipping variables in case of errors or there like, this should happen automatically.
But, is there a "save way" to this, meaning assigning observation names to each variable value when retrieving the info?
Take my sample code for extracting a variable to make it more clear:
#Specifying the url for desired website to be scrapped
url <- 'http://www.imdb.com/search/title?
count=100&release_date=2016,2016&title_type=feature'
#Reading the HTML code from the website
webpage <- read_html(url)
title_data_html <- html_text(html_nodes(webpage,'.lister-item-header a'))
rank_data_html <- html_text(html_nodes(webpage,'.text-primary'))
description_data_html <- html_text(html_nodes(webpage,'.ratings-bar+ .text-
muted'))
df <- data.frame(title_data_html, rank_data_html,description_data_html)
This would come up with a list of rank and description data, but no reference to the observation name for rank or description (before binding it in the df). Now, in my actual code one variable suddenly comes up with 1 value too much, so 201 descriptions but there are only 200 movies. Without having a reference to which movie the description belongs, it is very though to see why that happens.
A colleague suggested to extract all variables for 1 observation at a time and extend the dataframe row-wise (1 observation at a time), instead of extending column-wise (1 variable at a time), but spotting errors and clean up needs per variable seems way more time consuming this way.
Does anyone have a suggestion of what is the "best practice" in such a case?
Thank you!
I know it's not a satisfying answer, but there is not a single strategy for solving this type of problem. This is the work of web scraping. There is no guarantee that the html is going to be structured in the way you'd expect it to be structured.
You haven't shown us a reproducible example (something we can run on our own machine that reproduces the problem you're having), so we can't help you troubleshoot why you ended up extracting 201 nodes during one call to html_nodes when you expected 200. Best practice here is the boring old advice to LOOK at the website you're scraping, LOOK at your data, and see where the extra or duplicate description is (or where the missing movie is). Perhaps there's an odd element that has an attribute that is also matching your xpath selector text. Look at both the website as it appears in a browser, as well as the source. Right click, CTL + U (PC), or OPT + CTL + U (Mac) are some ways to pull up the source code. Use the search function to see what matches the selector text.
If the html document you're working with is like the example you used, you won't be able to use the strategy you're looking for help with (extract the name of the movie together with the description). You're already extracting the names. The names are not in the same elements as the descriptions.

How do I match single ID's in one data frame to multiples of the IDs in another data frame in R?

For a project at work, I need to generate a table from a list of proposal ids, and a table with more data about some of those proposals (called "awards"). I'm having trouble with the match() function; the data in the "awards" table often has several rows that use the same ID, while the proposals frame has only one copy of each ID. From what I've tried, R ignores multiple rows and only returns the first match, when I need all of them. I haven't been able to find anything in documentation or through searches that helps me, though I have been having difficulty phrasing the right question.
Here's what I have so far:
#R CODE to add awards data on proposals to new data spreadsheet
#read tab delimited files
Awards=read.delim("O:/testing.txt",as.is=T)
Proposals=read.delim("O:/test.txt",as.is=T)
#match IDs from both spreadsheets
Proposals$TotalAwarded=Awards$TotalAwarded([match(Proposals$IDs,Awards$IDs)]),
write.table(Proposals,"O:/tested.txt",quote=F,row.names=F,sep="\t")
This does exactly what I want, except that only the first match is encapsulated.
What's the best way to go forward? How do I make R utilize all of the matches available?
Thanks
See help on merge: ?merge
merge( Proposals, Awards, by=ID, all.y=TRUE )
But I cannot believe this hasn't been asked on SO before.

Adding (mathematically) columns of a CSV based on information in another column with PowerShell

I was having a really hard time describing what I need in the Title, so I apologize ahead of time if that makes absolutely no sense.
If I have a CSV that has 2 columns, one with a persons name and a second column with a numeric value I need to find the duplicates in the names column then add the numeric values for that person together to get a total number in a new CSV.
This is a very simplified version of the real CSV
Name,Number
Dog,1
Cat,2
Fish,1
Dog,3
Dog,2
Cat,2
Fish,1
Given the information above, what I would like to be able to produce is this:
Name,Number
Dog,6
Cat,4
Fish,2
I really don't have any idea how to get there or if it's possible with PowerShell. I can only get as far as using group-object to group by name, but I have no clue how to add the columns after that.
The biggest problem I'm coming across with my research on this is that most if not all the results I get when googling involve adding new columns to a csv and not performing the mathematical calculation.
I finally got it
$csvfile = import-csv c:\csvfile.csv
$csvfile | group name | select name,#{Name="Totals";Expression={($_.group | Measure-Object -sum number).sum}}
Credit goes to:
http://www.hanselman.com/blog/ParsingCSVsAndPoorMansWebLogAnalysisWithPowerShell.aspx

Resources