Aggregating frequencies with multiple columns in R - r

I am working with a dataframe in R that has three columns: House, Appliance, and Count. The data is essentially an inventory of the different types of kitchen appliances contained within each house on a block. The data look something like this: (spaces added for illustrative purposes)
House Appliance Count
1 Toaster 2
2 Dishwasher 1
2 Toaster 1
2 Refrigerator 1
2 Toaster 1
3 Dishwasher 1
3 Oven 1
For each appliance type, I would like to be able to compute the proportion of houses containing at least one of those appliances. Note that in my data, it is possible for a single house to have zero, one, or multiple appliances in a single category. If a house does not have an appliance, it is not listed in the data for that house. If the house has more than one appliance, the appliance could be listed once with a count >1 (e.g., toasters in House 1), or it could be listed twice (each with count = 1, e.g., toasters in House 2).
As an example showing what I am trying to compute, in the data shown here, the proportion of houses with toasters would be .67 (rounded) because 2/3 of the houses have at least one toaster. Similarly, the proportion of houses with ovens would be 0.33 (since only 1/3 of the houses have an oven). I do not care that any of the houses have more than one toaster -- only that they have at least one.
I have fooled around with xtabs and ftable in R but am not confident that they provide the simplest solution. Part of the problem is that these functions will provide the number of appliances for each house, which then throws off my proportion of houses calculations. Here's my current approach:
temp1 <- xtabs(~House + Appliance, data=housedata)
temp1[temp1[,] > 1] <- 1 # This is needed to correct houses with >1 unit.
proportion.of.houses <- data.frame(margin.table(temp1,2)/3)
This appears to work but it's not elegant. I'm guessing there is a better way to do this in R. Any suggestions much appreciated.

library(data.table)
setDT(df)
n.houses = length(unique(df$House))
df[, length(unique(House))/n.houses, by = Appliance]

library(dplyr)
n <- length(unique(df$House))
df %>%
group_by(Appliance) %>%
summarise(freq = n_distinct(House)/n)
Output:
Appliance freq
1 Dishwasher 0.6666667
2 Oven 0.3333333
3 Refrigerator 0.3333333
4 Toaster 0.6666667

Related

find every combination of elements in a column of a dataframe, which add up to a given sum in R

I'm trying to ease my life by writing a menu creator, which is supposed to permutate a weekly menu from a list of my favourite dishes, in order to get a little bit more variety in my life.
I gave every dish a value of how many days it approximately lasts and tried to arrange the dishes to end up with menus worth 7 days of food.
I've already tried solutions for knapsack functions from here, including dynamic programming, but I'm not experienced enough to get the hang of it. This is because all of these solutions are targeting only the most efficient option and not every combination, which fills the Knapsack.
library(adagio)
#create some data
dish <-c('Schnitzel','Burger','Steak','Salad','Falafel','Salmon','Mashed potatoes','MacnCheese','Hot Dogs')
days_the_food_lasts <- c(2,2,1,1,3,1,2,2,4)
price_of_the_food <- c(20,20,40,10,15,18,10,15,15)
data <- data.frame(dish,days_the_food_lasts,price_of_the_food)
#give each dish a distinct id
data$rownumber <- (1:nrow(data))
#set limit for how many days should be covered with the dishes
food_needed_for_days <- 7
#knapsack function of the adagio library as an example, but all other solutions I found to the knapsackproblem were the same
most_exspensive_food <- knapsack(days_the_food_lasts,price_of_the_food,food_needed_for_days)
data[data$rownumber %in% most_exspensive_food$indices, ]
#output
dish days_the_food_lasts price_of_the_food rownumber
1 Schnitzel 2 20 1
2 Burger 2 20 2
3 Steak 1 40 3
4 Salad 1 10 4
6 Salmon 1 18 6
Simplified:
I need a solution to a single objective single Knapsack problem, which returns all possible combinations of dishes which add up to 7 days of food.
Thank you very much in advance

Comparing rows in the same R dataframe

I want to take the nth row in a dataframe and compare it to all rows that are not the nth row and return how many of this columns match and/or mismatch.
I tried the match function and ifelse for single observations but I haven't been able to replicate it for the entire dataframe.
The dataset Superstore contains the order priority, customer name, ship mode, customer segment and category. It looks like this:
> head(df2)
Order.Priority Customer.Name Ship.Mode Customer.Segment Product.Category
1 Not Specified Dana Teague Regular Air Corporate Office Supplies
2 Critical Vanessa Boyer Regular Air Consumer Office Supplies
3 Critical Wesley Tate Regular Air Corporate Technology
4 High Brian Grady Delivery Truck Corporate Furniture
5 Medium Kristine Connolly Delivery Truck Corporate Furniture
6 High Emily Britt Regular Air Corporate Office Supplies
The code I tried (extracting relevant columns):
df <- read.csv("Superstore.csv", header = TRUE)
df2 <- df[,c(2,4,5,6,7)]
match(df2[2,],df2[1,],nomatch = 0)
This returns:
> match(df2[2,],df2[1,],nomatch = 0)
[1] 0 0 3 0 5
Using ifelse I get:
> ifelse(df2[1,]==df2[2,],1,0)
Order.Priority Customer.Name Ship.Mode Customer.Segment Product.Category
1 0 0 1 0 1
Like I said, this is exactly the result I need, but I haven't been able to replicate for the whole dataframe.

R-return the name with the least number of occurences

I need to find the sector with the lowest frequency in my data frame. Using min gives the minimum number of occurrences, but I would like to obtain the corresponding sector name with the lowest number of occurrences...So in this case, I would like it to print "consumer staples". I keep getting the frequency and not the actual sector name. Is there a way to do this?
Thank you.
sector_count <- count(portfolio, "Sector")
sector_count
Sector freq
1 Consumer Discretionary 5
2 Consumer Staples 1
3 Health Care 2
4 Industrials 3
5 Information Technology 4
min(sector_count$freq)
[1] 1
You want
sector_count$Sector[which.min(sector_count$freq)]
The which.min(sector_count$freq) function selects the index or row where the minimum value is found. The sector_count$Sector vector is then subset to the corresponding value.

Grouping words that are similar

CompanyName <- c('Kraft', 'Kraft Foods', 'Kfraft', 'nestle', 'nestle usa', 'GM', 'general motors', 'the dow chemical company', 'Dow')
I want to get either:
CompanyName2
Kraft
Kraft
Kraft
nestle
nestle
general motors
general motors
Dow
Dow
But would be absolutely fine with:
CompanyName2
1
1
1
2
2
3
3
I see algorithms for getting the distance between two words, so if I had just one weird name I would compare it to all other names and pick the one with the lowest distance. But I have thousands of names and want to group them all into groups.
I do not know anything about elastic search, but would one of the functions in the elastic package or some other function help me out here?
I'm sorry there's no programming here. I know. But this is way out of my area of normal expertise.
Solution: use string distance
You're on the right track. Here is some R code to get you started:
install.packages("stringdist") # install this package
library("stringdist")
CompanyName <- c('Kraft', 'Kraft Foods', 'Kfraft', 'nestle', 'nestle usa', 'GM', 'general motors', 'the dow chemical company', 'Dow')
CompanyName = tolower(CompanyName) # otherwise case matters too much
# Calculate a string distance matrix; LCS is just one option
?"stringdist-metrics" # see others
sdm = stringdistmatrix(CompanyName, CompanyName, useNames=T, method="lcs")
Let's take a look. These are the calculated distances between strings, using Longest Common Subsequence metric (try others, e.g. cosine, Levenshtein). They all measure, in essence, how many characters the strings have in common. Their pros and cons are beyond this Q&A. You might look into something that gives a higher similarity value to two strings that contain the exact same substring (like dow)
sdm[1:5,1:5]
kraft kraft foods kfraft nestle nestle usa
kraft 0 6 1 9 13
kraft foods 6 0 7 15 15
kfraft 1 7 0 10 14
nestle 9 15 10 0 4
nestle usa 13 15 14 4 0
Some visualization
# Hierarchical clustering
sdm_dist = as.dist(sdm) # convert to a dist object (you essentially already have distances calculated)
plot(hclust(sdm_dist))
If you want to group then explicitly into k groups, use k-medoids.
library("cluster")
clusplot(pam(sdm_dist, 5), color=TRUE, shade=F, labels=2, lines=0)

Tallying up data

I have two data sets. One has 2 million cases (individual donations to various causes), the other has about 38,000 (all zip codes in the U.S.).
I want to sort through the first data set and tally up the total number of donations by zip code. (Additionally, the total for each zip code will be broken down by cause.) Each case in the first data set includes the zip code of the corresponding donation and information about what kind of cause it went to.
Is there an efficient way to do this? The only approach that I (very much a novice) can think of is to use a for ... if loop to go through each case and count them up one by one. This seems like it would be really slow, though, for data sets of this size.
edit: thanks, #josilber. This gets me a step closer to what I'm looking for.
One more question, though. table seems to generate frequencies, correct? What if I'm actually looking for the sum for each cause by zip code? For example, if the data frame looks like this:
dat3 <- data.frame(zip = sample(paste("Zip", 1:3), 2000000, replace=TRUE),
cause = sample(paste("Cause", 1:3), 2000000, replace=TRUE),
amt = sample(250:2500, 2000000, replace=TRUE))
Suppose that instead of frequencies, I want to end up with output that looks like this?
# Cause 1(amt) Cause 2(amt) Cause 3(amt)
# Zip 1 (sum) (sum) (sum)
# Zip 2 (sum) (sum) (sum)
# Zip 3 (sum) (sum) (sum)
# etc. ... ... ...
Does that make sense?
Sure, you can accomplish what you're looking for with the table command in R. First, let's start with a reproducible example (I'll create an example with 2 million cases, 3 zip codes, and 3 causes; I know you have more zip codes and more causes but that won't cause the code to take too much longer to run):
# Data
set.seed(144)
dat <- data.frame(zip = sample(paste("Zip", 1:3), 2000000, replace=TRUE),
cause = sample(paste("Cause", 1:3), 2000000, replace=TRUE))
Please note that it's a good idea to include a reproducible example with all your questions on Stack Overflow because it helps make sure we understand what you're asking! Basically you should include a sample dataset (like the one I've just included) along with your desired output for that dataset.
Now you can use the table function to count the number of donations in each zip code, broken down by cause:
table(dat$zip, dat$cause)
# Cause 1 Cause 2 Cause 3
# Zip 1 222276 222004 222744
# Zip 2 222068 222791 222363
# Zip 3 221015 221930 222809
This took about 0.3 seconds on my computer.
could this work?-
aggregate(amt~cause+zip,data=dat3,FUN=sum)
cause zip amt
1 Cause 1 Zip 1 306231179
2 Cause 2 Zip 1 306600943
3 Cause 3 Zip 1 305964165
4 Cause 1 Zip 2 305788668
5 Cause 2 Zip 2 306306940
6 Cause 3 Zip 2 305559305
7 Cause 1 Zip 3 304898918
8 Cause 2 Zip 3 304281568
9 Cause 3 Zip 3 303939326

Resources