I am trying to use tapply in R so that it gives me a summary of averages for data that correspond to two categories (category_name and brand_name)
It is working fine when I include one category, i.e.:
brandavg <- tapply(train$price, train$brand_name, mean)
head(brandavg, 10)
Looks something like:
"" "brandavg" "brandstd"
"% Pure" 13.85 5.53719480565651
"10.Deep" 21 NA
"21men" 10 NA
But when I try to run it with two categories, using this code:
brandcatavg <- tapply(train$price, list(train$brand_name, train$category_name), mean)
head(brandcatavg, 10)
I get along list of category names (far more than 10!), but no averages.
I'm sure this has a really easy solution that I'm missing, but I'm hitting my head on the wall trying to figure it out.
Current code attempts have been:
brandcatavg <- tapply(train$price, list(train$brand_name, train$category_name), mean)
brandcatavg <- with(train, tapply(price, list(brand_name, category_name), mean))
Both yield something that looks like:
Vintage & Collectibles/Paper Ephemera/Postcard Vintage & Collectibles/Paper Ephemera/Stamps
Vintage & Collectibles/Serving/Bowl Vintage & Collectibles/Serving/Casserole
Vintage & Collectibles/Serving/Cream and Sugar Set Vintage & Collectibles/Serving/Dinnerware Set
Vintage & Collectibles/Serving/Flatware Vintage & Collectibles/Serving/Glassware
Vintage & Collectibles/Serving/Mug Vintage & Collectibles/Serving/Other
Vintage & Collectibles/Serving/Pitcher Vintage & Collectibles/Serving/Plate
Vintage & Collectibles/Serving/Salt and Pepper Shakers Vintage & Collectibles/Serving/Teacup
Vintage & Collectibles/Serving/Teapot Vintage & Collectibles/Serving/Tray
Vintage & Collectibles/Serving/Tumbler Vintage & Collectibles/Supplies/Bead
(Note that I cut the output for length...)
When I pull the first 10 columns it looks like:
train_id name item_condition_id category_name
1 0 MLB Cincinnati Reds T Shirt Size XL 3 Men/Tops/T-shirts
2 1 Razer BlackWidow Chroma Keyboard 3 Electronics/Computers & Tablets/Components & Parts
3 2 AVA-VIV Blouse 1 Women/Tops & Blouses/Blouse
4 3 Leather Horse Statues 1 Home/Home Décor/Home Décor Accents
5 4 24K GOLD plated rose 1 Women/Jewelry/Necklaces
6 5 Bundled items requested for Ruie 3 Women/Other/Other
7 6 Acacia pacific tides santorini top 3 Women/Swimwear/Two-Piece
8 7 Girls cheer and tumbling bundle of 7 3 Sports & Outdoors/Apparel/Girls
9 8 Girls Nike Pro shorts 3 Sports & Outdoors/Apparel/Girls
10 9 Porcelain clown doll checker pants VTG 3 Vintage & Collectibles/Collectibles/Doll
brand_name price shipping
1 10 1
2 Razer 52 0
3 Target 10 1
4 35 1
5 44 0
6 59 0
7 Acacia Swimwear 64 0
8 Soffe 6 1
9 Nike 19 0
10 8 0
item_description
1 No description yet
2 This keyboard is in great condition and works like it came out of the box. All of the ports are tested and work perfectly. The lights are customizable via the Razer Synapse app on your PC.
3 Adorable top with a hint of lace and a key hole in the back! The pale pink is a 1X, and I also have a 3X available in white!
4 New with tags. Leather horses. Retail for [rm] each. Stand about a foot high. They are being sold as a pair. Any questions please ask. Free shipping. Just got out of storage
5 Complete with certificate of authenticity
6 Banana republic bottoms, Candies skirt with matching blazer,Amy Byers suit, Loft bottoms and cami top.
7 Size small but straps slightly shortened to fit xs, besides that, perfect condition
8 You get three pairs of Sophie cheer shorts size small and medium girls and two sports bra/boy shorts spandex matching sets in small and medium girls. All items total retail for [rm] in store and you can take him today for less than the price of one item at the store!)
9 Girls Size small Plus green. Three shorts total.
10 I realized his pants are on backwards after the picture. They were very dirty so I hand washed them. He has a stuffed body and painted porcelain head, hands and feet. Back before clowns were too scary. 9" tall. No chips or cracks but minor paint loss in a few places. Clown Circus Doll Collectible
(dput is giving me funky results.)
Eventually got this to work with the code:
library(dplyr)
brandcatavg <- train %>%
group_by(category_name, brand_name) %>%
summarise(averageprice = mean(price, na.rm=TRUE))
You are close,
Try:
with(train, tapply(price, list(brand_name, category_name), mean))
Related
I have 2 tables. Table1 is a smaller table with around 10K values.
Table 1 (Sample):
KeyWords PageView
Phillips Trimmer 123
Buy Samsung Mobile 45
Ripe yellow Banana 63
Pepsi 140
Table 2 contains 1 Million Values.
Table 2 (Sample):
KeyWords PageView
Electric Trimmer 123
Samsung Mobile 45
Yellow Ripe Banana 63
Samsung S6 304
Banana 105
Phillips 209
Trimmer Phillips 29
Now I want to take all words from table 1 and look into table 2 an find the best match. The order of words should not have a big influence in the match i.e "Ripe yellow Banana" should match perfectly with "Yellow Ripe Banana". "Buy Samsung Mobile" should match with "Samsung Mobile" and with "Samsung S6".
The final output should look like this.
Table 3:
Word PageView Match
Phillips Trimmer 123 Trimmer Phillips
Buy Samsung Mobile 45 Samsung Mobile
Ripe yellow Banana 63 Yellow Ripe Banana
Pepsi 140 NA
Would really appreciate if we can Stem and Tokenize the sentence before doing a match.
I have tried the following but its not working properly and the loop takes quite a bit of time.
file_1$match <- ""
for(i in 1:dim(file_1)[1]) {
print(i)
x <- grep(file_1$Keywords[i],file_2$Keyword,value = T, ignore.case = m
T,useBytes = T)
x <- paste0(x,"")
file_1$match[i] <- x
}
I have tried using 'agrep' as well varying the 'max.distance' parameter. The results were not as expected.
Edited: I used the "apply" function to perform the following on every row of tab1:
the function inside "apply", takes x[1] which is keyword (let's say "Ripe Yellow Banana"), strsplit splits it by space ("Ripe" "Yellow" "Banana"), sapply performs grepl on each of these splits to see if the pattern exists in tab2. so you will have 3 column of true falses for "Ripe" "Yellow" "Banana". next step is to count the number of trues for each row and output the tab2 with that row number. I also put a if statement to give NA's if the max number of trues is 0:
tab1<-data.frame(Keyword=c("Phillips Trimmer",
"Buy Samsung Mobile","Ripe Yellow Banana","Pepsi"),
PageView=c(123,45,63,140))
tab2<-data.frame(Keyword=c("Electric Trimmer","Samsung Mobile",
"Yellow Ripe Banana","Samsung S6","Banana",
"Phillips","Trimmer Phillips","Buy Trimmer Philips"),
PageView=c(123,45,63,304,105,209,29,21))
tab2$StrLen<-apply(tab2,1,function(x)length(unlist(strsplit(x[1], " "))))
tab1$BestMatch<-apply(tab1,1,function(x){
a <-sapply(unlist(strsplit(x[1], " ")), grepl, tab2$Keyword)
a<-cbind(a,TRUECnt=rowSums(a==TRUE))
a<-as.data.frame(a)
a$StrLen <- length(unlist(strsplit(x[1], " ")))
if (max(a$TRUECnt)==0){
return(NA)
}
return(as.character(tab2[which(a$TRUECnt==max(a$TRUECnt) &
tab2$StrLen <= a$StrLen),]$Keyword))
})
View(tab1) View(tab1)
# Keyword PageView BestMatch
# 1 Phillips Trimmer 123 Trimmer Phillips
# 2 Buy Samsung Mobile 45 Samsung Mobile
# 3 Ripe Yellow Banana 63 Yellow Ripe Banana
# 4 Pepsi 140 <NA>
I have a group of individuals that I am distributing items to in an effort to move toward even distribution of total items across individuals.
Each individual can receive only certain item types.
The starting distribution of items is not equal.
The number of available items of each type is known, and must fully be exhausted.
df contains an example format for the person data. Note that Chuck has 14 items total, not 14 bats and 14 gloves.
df<-data.frame(person=c("Chuck","Walter","Mickey","Vince","Walter","Mickey","Vince","Chuck"),alloweditem=c("bat","bat","bat","bat","ball","ball","glove","glove"),startingtotalitemspossessed=c(14,9,7,12,9,7,12,14))
otherdf contains and example format for the items and number needing assignment
otherdf<-data.frame(item=c("bat","ball","glove"),numberneedingassignment=c(3,4,7))
Is there a best method for coding this form of item distribution? I imagine the steps to be:
Check which person that can receive a given item has the lowest total items assigned. Break a tie at random.
Assign 1 of the given item to this person.
Update the startingtotalitemspossessed for the person receiving the item.
Update the remaining number of the item left to assign.
Stop this loop for a given item if the total remaining is 0, and move to the next item.
Below is a partial representation of something like how i'd imagine this working as a view inside the loop, left to right.
Note: The number of items and people is very large. If possible, a method that would scale to any given number of people or items would be ideal!
Thank you in advance for your help!
I'm sure there are better ways, but here is an example:
df<-data.frame(person=c("Chuck","Walter","Mickey","Vince","Walter","Mickey","Vince","Chuck"),
alloweditem=c("bat","bat","bat","bat","ball","ball","glove","glove"),
total=c(14,9,7,12,9,7,12,14))
print(df)
## person alloweditem total
## 1 Chuck bat 14
## 2 Walter bat 9
## 3 Mickey bat 7
## 4 Vince bat 12
## 5 Walter ball 9
## 6 Mickey ball 7
## 7 Vince glove 12
## 8 Chuck glove 14
otherdf<-data.frame(item=c("bat","ball","glove"),
numberneedingassignment=c(3,4,7))
# Items in queue
queue <- rep(otherdf$item, otherdf$numberneedingassignment)
for (i in 1:length(queue)) {
# Find person with the lowest starting total
personToBeAssigned <- df[df$alloweditem == queue[i] &
df$total == min(df[df$alloweditem == queue[i], 3]), 1][1]
df[df$person == personToBeAssigned & df$alloweditem == queue[i], 3] <-
df[df$person == personToBeAssigned & df$alloweditem == queue[i], 3] + 1
}
print(df)
## person alloweditem total
## 1 Chuck bat 14
## 2 Walter bat 10
## 3 Mickey bat 9
## 4 Vince bat 12
## 5 Walter ball 10
## 6 Mickey ball 10
## 7 Vince glove 17
## 8 Chuck glove 16
I have a database of thoroughbred names that is structured as follows:
HorseName <- c("Grey emperor", "Smokey grey", "Gaining greys", "chestnut", "Glowing Chestnuts", "Ruby red", "My fair lady", "Man of war")
Number <- seq(1:8)
df <- data.frame(HorseName, Number)
I now wish to search for occurences of colours within each horse's name. Specifically, I wish to select all the instances of 'grey' and 'chestnut', creating a new column that identifies these colours. Any other names can be simply 'other' Unfortunately, the names are not consistent, with plurals included and varying case formats. How would I go about doing this in R?
My anticipated output would be:
df$Type <- c("Grey", "Grey", "Grey", "Chestnut", "Chestnut", "Other", "Other", "Other")
I am familiar with chained ifelse statements but unsure how to handle the plural occurences and case sensitivities!
In case you are interested in other ways to do this, here's a tidyverse alternative which has the same end result as #amrrs answer.
library(tidyverse)
library(stringr)
df %>%
mutate(Type = str_extract(str_to_lower(HorseName), "grey|chestnut")) %>%
mutate(Type = str_to_title(if_else(is.na(Type), "other", Type)))
#> HorseName Number Type
#> 1 Grey emperor 1 Grey
#> 2 Smokey grey 2 Grey
#> 3 Gaining greys 3 Grey
#> 4 chestnut 4 Chestnut
#> 5 Glowing Chestnuts 5 Chestnut
#> 6 Ruby red 6 Other
#> 7 My fair lady 7 Other
#> 8 Man of war 8 Other
Converting all the input text df$HorseName to lower case before pattern matching with grepl (using lower-cased pattern) solves this problem.
> df$Type <- ifelse(grepl('grey',tolower(df$HorseName)),'Grey',
+ ifelse(grepl('chestnut',tolower(df$HorseName)),'Chestnut',
+ 'others'))
> df
HorseName Number Type
1 Grey emperor 1 Grey
2 Smokey grey 2 Grey
3 Gaining greys 3 Grey
4 chestnut 4 Chestnut
5 Glowing Chestnuts 5 Chestnut
6 Ruby red 6 others
7 My fair lady 7 others
8 Man of war 8 others
>
I'm new to R and have been trying to do some figure out what I can do to move this along. I know loops are not the best thing to use, but it's all I can figure out. I've searched the here and the net, and am seeing options like tapply, but I can't figure out if it's something I'm doing wrong, or if tapply isn't compatible with this type of data. I think it's the latter, but I'm new and what do I know. HA!
I have a data.frame that holds all the players that have played from a previous parse that amounts to over 18000 rows. The script below takes that url and scrapes another URL if they played last year. Is there anything a I can do to make this quicker or less of a memory pig, as it routinely pegs my ram at 99% after around 15 minutes? Thanks for any help!
#GET YEARS PLAYED LINKS
yplist = NULL
playerURLs <- paste("http://www.baseball-reference.com",datafile[,c("hrefs")],sep="")
for(thisplayerURL in playerURLs){
doc <- htmlParse(thisplayerURL)
yplinks <- data.frame(
names = xpathSApply(doc, '//*[#id="all_standard_batting"]/div/ul/li[2]/ul/li[. = "2014"]/a',xmlValue),
hrefs = xpathSApply(doc, '//*[#id="all_standard_batting"]/div/ul/li[2]/ul/li[. = "2014"]/a',xmlGetAttr,'href'))
yplist = rbind(yplist, yplinks)
}
yplist[,c("hrefs")]
Example datafile list in playerURLs (there are 2 mel queens, is different)
X names hrefs
1 1 Jason Kipnis /players/k/kipnija01.shtml
2 2 Tom Qualters /players/q/qualtto01.shtml
3 3 Paul Quantrill /players/q/quantpa01.shtml
4 4 Bill Quarles /players/q/quarlbi01.shtml
5 5 Billy Queen /players/q/queenbi01.shtml
6 6 Mel Queen /players/q/queenme01.shtml
7 7 Mel Queen /players/q/queenme02.shtml
If anyone of those guys played in 2014 my script above would return a data.frame that looks like the following
X names hrefs
1 1 Jason Kipnis players/gl.cgi?id=kipnija01&t=b&year=2014
2 2 Tom Qualters /players/gl.cgi?id=qualtto01&t=b&year=2014
3 3 Paul Quantrill /players/gl.cgi?id=quantpa01&t=b&year=2014
4 4 Bill Quarles /players/gl.cgi?id=quarlbi01&t=b&year=2014
5 5 Billy Queen /players/gl.cgi?id=queenbi01&t=b&year=2014
6 6 Mel Queen /players/gl.cgi?id=queenme01&t=b&year=2014
7 7 Mel Queen /players/gl.cgi?id=queenme02&t=b&year=2014
I have a csv Document with 2 columns which contains Commodity Category and Commodity Name.
Ex:
Sl.No. Commodity Category Commodity Name
1 Stationary Pencil
2 Stationary Pen
3 Stationary Marker
4 Office Utensils Chair
5 Office Utensils Drawer
6 Hardware Monitor
7 Hardware CPU
and I have another csv file which contains various Commodity names.
Ex:
Sl.No. Commodity Name
1 Pancil
2 Pencil-HB 02
3 Pencil-Apsara
4 Pancil-Nataraj
5 Pen-Parker
6 Pen-Reynolds
7 Monitor-X001RL
The output I would like is to standardise and categorise the commodity names and classify them into respective Commodity Categories like shown below :
Sl.No. Commodity Name Commodity Category
1 Pencil Stationary
2 Pencil Stationary
3 Pencil Stationary
4 Pancil Stationary
5 Pen Stationary
6 Pen Stationary
7 Monitor Hardware
Step 1) I first have to use NLTK (Text mining methods) and clean the data so as to seperate "Pencil" from "Pencil-HB 02" .
Step 2) After cleaning I have to use Approximate String match technique i.e agrep() to match the patterns "Pencil *" or correcting "Pancil" to "Pencil".
Step 3)Once correcting the pattern I have to categorise. No idea how.
This is what I have thought about. I started with step 2 and I'm stuck in step 2 only.
I'm not finding an exact method to code this.
Is there any way to get the output as required?
If yes please suggest me the method I can proceed with.
You could use the stringdist package. The correct function below will correct the Commodity.Name in file2 based on distances of the item to different CName.
Then a left_join is used to join the two tables.
I also notice that there are some classifications if I use the default options for stringdistmatrix. You can try changing the weight argument of stringdistmatrix for better correction result.
> library(dplyr)
> library(stringdist)
>
> file1 <- read.csv("/Users/Randy/Desktop/file1.csv")
> file2 <- read.csv("/Users/Randy/Desktop/file2.csv")
>
> head(file1)
Sl.No. Commodity.Category Commodity.Name
1 1 Stationary Pencil
2 2 Stationary Pen
3 3 Stationary Marker
4 4 Office Utensils Chair
5 5 Office Utensils Drawer
6 6 Hardware Monitor
> head(file2)
Sl.No. Commodity.Name
1 1 Pancil
2 2 Pencil-HB 02
3 3 Pencil-Apsara
4 4 Pancil-Nataraj
5 5 Pen-Parker
6 6 Pen-Reynolds
>
> CName <- levels(file1$Commodity.Name)
> correct <- function(x){
+ factor(sapply(x, function(z) CName[which.min(stringdistmatrix(z, CName, weight=c(1,0.1,1,1)))]), CName)
+ }
>
> correctedfile2 <- file2 %>%
+ transmute(Commodity.Name.Old = Commodity.Name, Commodity.Name = correct(Commodity.Name))
>
> correctedfile2 %>%
+ inner_join(file1[,-1], by="Commodity.Name")
Commodity.Name.Old Commodity.Name Commodity.Category
1 Pancil Pencil Stationary
2 Pencil-HB 02 Pencil Stationary
3 Pencil-Apsara Pencil Stationary
4 Pancil-Nataraj Pencil Stationary
5 Pen-Parker Pen Stationary
6 Pen-Reynolds Pen Stationary
7 Monitor-X001RL Monitor Hardware
If you need the "Others" category, you just need to play with the weights.
I added a row "Diesel" in file2. Then compute the score using stringdist with customized weights (you should try varying the values). If the score is large than 2 (this value is related to how the weights are assigned), it doesn't correct anything.
PS: as we don't know all the possible labels, we have to do as.character to convect factor to character.
PS2: I am also using tolower for case insensitive scoring.
> head(file2)
Sl.No. Commodity.Name
1 1 Diesel
2 2 Pancil
3 3 Pencil-HB 02
4 4 Pencil-Apsara
5 5 Pancil-Nataraj
6 6 Pen-Parker
>
> CName <- levels(file1$Commodity.Name)
> CName.lower <- tolower(CName)
> correct_1 <- function(x){
+ scores = stringdistmatrix(tolower(x), CName.lower, weight=c(1,0.001,1,0.5))
+ if (min(scores)>2) {
+ return(x)
+ } else {
+ return(as.character(CName[which.min(scores)]))
+ }
+ }
> correct <- function(x) {
+ sapply(as.character(x), correct_1)
+ }
>
> correctedfile2 <- file2 %>%
+ transmute(Commodity.Name.Old = Commodity.Name, Commodity.Name = correct(Commodity.Name))
>
> file1$Commodity.Name = as.character(file1$Commodity.Name)
> correctedfile2 %>%
+ left_join(file1[,-1], by="Commodity.Name")
Commodity.Name.Old Commodity.Name Commodity.Category
1 Diesel Diesel <NA>
2 Pancil Pencil Stationary
3 Pencil-HB 02 Pencil Stationary
4 Pencil-Apsara Pencil Stationary
5 Pancil-Nataraj Pencil Stationary
6 Pen-Parker Pen Stationary
7 Pen-Reynolds Pen Stationary
8 Monitor-X001RL Monitor Hardware
There is an 'Approximate string matching' function amatch() in {stingdist} (at least in 0.9.4.6) that returns the most probable match from the pre-defined set of words. It has a parameter maxDist that can be set for the maximum distance to be matched, and a nomatch parameter that can be used for the 'other' category. Otherwise, method, weights, etc. can be set similarly to stringdistmatrix().
So, your original problem can be solved like this using a tidyverse compatible solution:
library(dplyr)
library(stringdist)
# Reading the files
file1 <- readr::read_csv("file1.csv")
file2 <- readr::read_csv("file2.csv")
# Getting the commodity names in a vector
commodities <- file1 %>% distinct(`Commodity Name`) %>% pull()
# Finding the closest string match of the commodities, and joining the file containing the categories
file2 %>%
mutate(`Commodity Name` = commodities[amatch(`Commodity Name`, commodities, maxDist = 5)]) %>%
left_join(file1, by = "Commodity Name")
This will return a data frame that contains the corrected commodity name and category. If the original Commodity name is more than 5 characters away (simplified explanation of string distance) from any of the possible commodity names, the corrected name will be NA.