Group By and Summarize - r

I have a quick question, I did a group by and summarize function for the following data by doing this. However, How do I summarize the length of the variable (Trump, Obama, McConnell) individually.
dta.subset.tabluea = dta.subset %>%
group_by(variable,catvalue2) %>%
summarize(value = length(catvalue))
the output i got was
variable catvalue2 value
1 Trump Slightly Warm 216
2 Trump Very Cold 778
3 Trump Very Warm 311
4 Trump <NA> 176
5 Obama Slightly Warm 251
6 Obama Very Cold 427
7 Obama Very Warm 676
8 Obama <NA> 224
9 McConnell Slightly Warm 248
10 McConnell Very Cold 731
11 McConnell Very Warm 60
12 McConnell <NA> 444
However, How do I summarize the length of the variable (Trump, Obama, McConnell) in another column. I need this info so I can make percentages.
If i do the following, I would get the same answers as the first column.
summarize(value = length(catvalue), varvalue = length(vatvalue))

Related

list of data frames, trying to create new column with normalisation values for each dataframe

I'm new to r and mostly work with dataframes. A frequent task is to normalize counts for several parameters from several data frames. I have a demo dataset:
dataset
Season
Product
Quality
Sales
Winter
Apple
bad
345
Winter
Apple
good
13
Winter
Potato
bad
23
Winter
Potato
good
66
Winter
Beer
bad
345
Winter
Beer
good
34
Summer
Apple
bad
88
Summer
Apple
good
90
Summer
Potato
bad
123
Summer
Potato
good
457
Summer
Beer
bad
44
Summer
Beer
good
546
What I want to do is add a column "FC" ([tag:fold change]) for "Sales". FC must be calculated for each "Season" and "Product" according to "Quality". "Bad" is the baseline.
Desired result:
Season
Product
Quality
Sales
FC
Winter
Apple
bad
345
1.00
Winter
Apple
good
13
0.04
Winter
Potato
bad
23
1.00
Winter
Potato
good
66
2.87
Winter
Beer
bad
345
1.00
Winter
Beer
good
34
0.10
Summer
Apple
bad
88
1.00
Summer
Apple
good
90
1.02
Summer
Potato
bad
123
1.00
Summer
Potato
good
457
3.72
Summer
Beer
bad
44
1.00
Summer
Beer
good
546
12.41
One way to do it is to filter first by "Season" and then by "Product" (e.g. creating subset data frame subset_winter_apple) and then calculate FC similarly to this:
subset_winter_apple$FC = subset_winter_apple$Sales / subset_winter_apple$Sales[1]
Later on, I can then combine all subset dataframes again e.g. using rbind() to reconstitute the original data frame with the FC column. However, this is highly inefficient. So I thought of splitting the data frame and creating a list:
split(
dataset,
list(dataset$Season, dataset$Product)
)
However, now I struggle with the normalisation (FC calculation) as I do not know how to reference the specific first cell value of "Sales" in the list of data frames so that each value in that column in each listed data frame is individually normalized. I did manage to calculate an FC value for the list, however, it is an exact copy in each listed data frame from the first one using lappy:
lapply(
dataset,
function(DF){DF$FC = dataset[[1]]$Sales/dataset[[1]]$Sales[1]; DF}
)
Clearly, I do not know how to reference the first cell in a specific column to normalize the entire column for each listed data frame. Can somebody please help me?
Many thanks in advance for your suggestions.
dplyr solution
Using logical indexing within a grouped mutate():
library(dplyr)
dataset %>%
group_by(Season, Product) %>%
mutate(FC = Sales / Sales[Quality == "bad"]) %>%
ungroup()
# A tibble: 12 × 5
Season Product Quality Sales FC
<chr> <chr> <chr> <int> <dbl>
1 Winter Apple bad 345 1
2 Winter Apple good 13 0.0377
3 Winter Potato bad 23 1
4 Winter Potato good 66 2.87
5 Winter Beer bad 345 1
6 Winter Beer good 34 0.0986
7 Summer Apple bad 88 1
8 Summer Apple good 90 1.02
9 Summer Potato bad 123 1
10 Summer Potato good 457 3.72
11 Summer Beer bad 44 1
12 Summer Beer good 546 12.4
Base R solution
Using by():
dataset <- by(
dataset,
list(dataset$Season, dataset$Product),
\(x) transform(x, FC = Sales / Sales[Quality == "bad"])
)
dataset <- do.call(rbind, dataset)
dataset[order(as.numeric(rownames(dataset))), ]
Season Product Quality Sales FC
1 Winter Apple bad 345 1.00000000
2 Winter Apple good 13 0.03768116
3 Winter Potato bad 23 1.00000000
4 Winter Potato good 66 2.86956522
5 Winter Beer bad 345 1.00000000
6 Winter Beer good 34 0.09855072
7 Summer Apple bad 88 1.00000000
8 Summer Apple good 90 1.02272727
9 Summer Potato bad 123 1.00000000
10 Summer Potato good 457 3.71544715
11 Summer Beer bad 44 1.00000000
12 Summer Beer good 546 12.40909091

Combining & totalling rows in R

I have the below dataset, with the variables as follows:
member_id - an id number for each member
year - the year in question
gender - binary variable, 0 is male, 1 is female
party - the party of the member
Leadership - TRUE if the member holds a leadership position in government or opposition, FALSE if they don't
house_start - the date the member became an MP
Year.Entered - the year they became an MP
Years.in.parliament - how many years it has been since they were first elected
Edu - the amount of time the MP has participated in debates related to education in the given year.
member_id year gender party Leadership house_start Year.Entered Years.in.parliament Edu
1 386 1997 0 Conservative FALSE 03/05/1979 1979 18 7
2 37 1997 0 Labour FALSE 03/05/1979 1979 18 10
3 47 1997 0 Labour TRUE 09/06/1983 1983 14 157
4 408 1997 0 Conservative TRUE 03/05/1979 1979 18 48
5 15 1997 1 Liberal Democrat FALSE 09/06/1983 1983 14 3
6 15 1997 1 Liberal Democrat TRUE 09/06/1983 1983 14 9
As you can see with rows 5 and 6 in the dataset, the same member is recorded twice in the one year. This has happened throughout the dataset for some members because of the Leadership variable. For example this member (id number 15) did not have a leadership position for the first part of 1997 but did get one later in the year. I want to be able to combine these two rows and have the Leadership variable as TRUE in these cases. I also need to compute the sum of Edu rows for these as well, so for this member it would become 12 (because I want each members number of times participated per year for this policy area). So I want it to look like:
member_id year gender party Leadership house_start Year.Entered Years.in.parliament Edu
1 386 1997 0 Conservative FALSE 03/05/1979 1979 18 7
2 37 1997 0 Labour FALSE 03/05/1979 1979 18 10
3 47 1997 0 Labour TRUE 09/06/1983 1983 14 157
4 408 1997 0 Conservative TRUE 03/05/1979 1979 18 48
5 15 1997 1 Liberal Democrat TRUE 09/06/1983 1983 14 12
I have been trying to change these manually on Excel, but I need to do this for several different policy areas, so it is taking a lot of time. Any help would be much appreciated!
We can do a group by sum and arrange and slice the first row
library(dplyr)
df1 %>%
group_by(member_id, year, gender, party) %>%
mutate(Edu = sum(Edu)) %>%
arrange(party, desc(Leadership)) %>%
slice(1)
For each group you can select the rows where there is only one row or row where Leadership is TRUE.
library(dplyr)
df %>%
group_by(member_id, year, gender, party) %>%
mutate(Edu = sum(Edu)) %>%
filter(n() == 1 | Leadership)
From my understanding the minimal repeating group is the member_id & year, we can then sum the Edu amount defensively (using na.rm = TRUE) and then slice the grouped data.frame using boolean algebra (taking the maximum of a boolean vector yields true records).
library(dplyr)
df %>%
group_by(member_id, year) %>%
mutate(Edu = sum(Edu, na.rm = TRUE)) %>%
slice(which.max(Leadership)) %>%
ungroup()
Alternatively we can use top_n function (which yields the same result):
df %>%
group_by(member_id, year) %>%
mutate(Edu = sum(Edu, na.rm = TRUE)) %>%
top_n(1, Leadership) %>%
ungroup()

Create Dataframe w/All Combinations of 2 Categorical Columns then Sum 3rd Column by Each Combination

I have an large messy dataset but want to accomplish a straightforward thing. Essentially I want to fill a tibble based on every combination of two columns and sum a third column.
As a hypothetical example, say each observation has the company_name (Wendys, BK, McDonalds), the food_option (burgers, fries, frosty), and the total_spending (in $). I would like to make a 9x3 tibble with the company, food, and total as a sum of every observation. Here's my code so far:
df_table <- df %>%
group_by(company_name, food_option) %>%
summarize(total= sum(total_spending))
company_name food_option total
<chr> <chr> <dbl>
1 Wendys Burgers 757
2 Wendys Fries 140
3 Wendys Frosty 98
4 McDonalds Burgers 1044
5 McDonalds Fries 148
6 BK Burgers 669
7 BK Fries 38
The problem is that McDonalds has zero observations with "Frosty" as the food_option. Consequently, I get a partial table. I'd like to fill that with a row that shows:
8 McDonalds Frosty 0
9 BK Frosty 0
I know I can add the rows manually, but the actual dataset has over a hundred combinations so it will be tedious and complicated. Also, I'm constantly modifying the upstream data and I want the code to automatically fill correctly.
Thank you SO MUCH to anyone who can help. This forum has really been a godsend, really appreciate all of you.
Try:
library(dplyr)
df %>%
mutate(food_option = factor(food_option, levels = unique(food_option))) %>%
group_by(company_name, food_option, .drop = FALSE) %>%
summarise(total = sum(total_spending))
Newer versions of dplyr have a .drop argument to group_by where if you've got a factor with pre-defined levels they will not be dropped (and you'll get the zeros).
You can use tidyr::expand_grid():
tidyr::expand_grid(company_name = c("Wendys", "McDonalds", "BK"),
food_option = c("Burgers", "Fries", "Frosty"))
to create all possible variations
library(tidyverse)
# example data
df = read.table(text = "
company_name food_option total
1 Wendys Burgers 757
2 Wendys Fries 140
3 Wendys Frosty 98
4 McDonalds Burgers 1044
5 McDonalds Fries 148
6 BK Burgers 669
7 BK Fries 38
", header=T)
df %>% complete(company_name, food_option, fill=list(total = 0))
# # A tibble: 9 x 3
# company_name food_option total
# <fct> <fct> <dbl>
# 1 BK Burgers 669
# 2 BK Fries 38
# 3 BK Frosty 0
# 4 McDonalds Burgers 1044
# 5 McDonalds Fries 148
# 6 McDonalds Frosty 0
# 7 Wendys Burgers 757
# 8 Wendys Fries 140
# 9 Wendys Frosty 98

Aggregate/Group_by second minimum value in R

I have used either group_by() in dplyr or the aggregate() function to aggregate across columns in R. For my current problem I want to group by an individual but finding the second lowest of one column (Number) and the lowest of another (Year). So, if my data looks like this:
Number Individual Year Value
123 M. Smith 2010 234
435 M. Smith 2011 346
435 M. Smith 2012 356
524 M. Smith 2015 432
119 J. Jones 2010 345
119 J. Jones 2012 432
254 J. Jones 2013 453
876 J. Jones 2014 654
I want it to become:
Number Individual Year Value
435 M. Smith 2011 346
254 J. Jones 2013 453
Thank you.
We can use the dplyr package. dt2 is the final output. The idea is to filter out the minimum in the Number column, then arrange the data frame by Individual, Number, and Year. Finally, select the first row of each group.
# Load package
library(dplyr)
# Create example data frame
dt <- read.table(text = "Number Individual Year Value
123 'M. Smith' 2010 234
435 'M. Smith' 2011 346
435 'M. Smith' 2012 356
524 'M. Smith' 2015 432
119 'J. Jones' 2010 345
119 'J. Jones' 2012 432
254 'J. Jones' 2013 453
876 'J. Jones' 2014 654",
header = TRUE, stringsAsFactors = FALSE)
# Process the data
dt2 <- dt %>%
group_by(Individual) %>%
filter(Number != min(Number)) %>%
arrange(Individual, Number, Year) %>%
slice(1)
We can use dplyr
library(dplyr)
df1 %>%
group_by(Individual) %>%
arrange(Individual, Number) %>%
filter(Number != max(Number)) %>%
slice(which.max(Number))
# A tibble: 2 x 4
# Groups: Individual [2]
# Number Individual Year Value
# <int> <chr> <int> <int>
#1 254 J. Jones 2013 453
#2 435 M. Smith 2011 346

Fuzzy string matching in r

I have 2 datasets with more than 100K rows each. I would like to merge them based on fuzzy string matching one column('movie title') as well as using release date. I am providing a sample from both datasets below.
dataset-1
itemid userid rating time title release_date
99991 1673 835 3 1998-03-27 mirage 1995
99992 1674 840 4 1998-03-29 mamma roma 1962
99993 1675 851 3 1998-01-08 sunchaser, the 1996
99994 1676 851 2 1997-10-01 war at home, the 1996
99995 1677 854 3 1997-12-22 sweet nothing 1995
99996 1678 863 1 1998-03-07 mat' i syn 1997
99997 1679 863 3 1998-03-07 b. monkey 1998
99998 1680 863 2 1998-03-07 sliding doors 1998
99999 1681 896 3 1998-02-11 you so crazy 1994
100000 1682 916 3 1997-11-29 scream of stone (schrei aus stein) 1991
dataset - 2
itemid userid rating time title release_date
1 2844 4477 3 2013-03-09 fantã´mas - 〠l'ombre de la guillotine 1913
2 4936 8871 4 2013-05-05 the bank 1915
3 4936 11628 3 2013-07-06 the bank 1915
4 4972 16885 4 2013-08-19 the birth of a nation 1915
5 5078 11628 2 2013-08-23 the cheat 1915
6 6684 4222 3 2013-08-24 the fireman 1916
7 6689 4222 3 2013-08-24 the floorwalker 1916
8 7264 2092 4 2013-03-17 the rink 1916
9 7264 5943 3 2013-05-12 the rink 1916
10 7880 11628 4 2013-07-19 easy street 1917
I have looked at 'agrep' but it only matches one string at a time. The 'stringdist' function is good but you need to run it in a loop, find the minimum distance and then go onto further precessing which is very time consuming given the size of the datasets. The strings can have typo's and special characters due to which fuzzy matching is required. I have looked around and found 'Lenenshtein' and 'Jaro-Winkler' methods. The later I read is good for when you have typo's in strings.
In this scenario, only fuzzy matching may not provide good results e.g., A movie title 'toy story' in one dataset can be matched to 'toy story 2' in the other which is not right. So I need to consider the release date to make sure the movies that are matched are unique.
I want to know if there is a way to achieve this task without using a loop? worse case scenario if I have to use a loop, how can I make it work efficiently and as fast as possible.
I have tried the following code but it has taken an awful amount of time to process.
for(i in 1:nrow(test))
for(j in 1:nrow(test1))
{
test$title.match <- ifelse(jarowinkler(test$x[i], test1$x[j]) > 0.85,
test$title, NA)
}
test - contains 1682 unique movie names converted to lower case
test1 - contains 11451 unique movie names converted to lower case
Is there a way to avoid the for loops and make it work faster?
What about this approach to move you forward? You can adjust the degree of match from 0.85 after you see the results. You could then use dplyr to group by the matched title and summarise by subtracting release dates. Any zeros would mean the same release date.
dataset-1$title.match <- ifelse(jarowinkler(dataset-1$title, dataset_2$title) > 0.85, dataset-1$title, NA)

Resources