How to group rows with duplicate name in R? - r

I am quite new to R and struggling with subsetting datasets.
This is where the dataset came from and how I clean it.
board_game_original<- read.csv("https://raw.githubusercontent.com/bryandmartin/STAT302/master/docs/Projects/project1_bgdataviz/board_game_raw.csv")
#tidy up the column of mechanic and category with cSplit function
library(splitstackshape)
mechanic <- board_game$mechanic
board_game_tidy <- cSplit(board_game,splitCols=c("mechanic","category"), sep = ",", direction = "long")
here's my code trying to extract two columns: category, and average complexities.
summary_category <- summary(board_game_tidy$category)
top_5_category <- summary_category[1:5]
complexity_top_5_category <- board_game_tidy %>%
group_by(category) %>%
select(average_complexity) %>%
filter(category == c("Abstract Strategy Action / Dexterity", "Adventure", "Age of Reason","American Civil War "))
complexity_top_5_category
My final intent: create a data frame with only 2 columns: category and average complexities, and take a mean of the average complexities under the same category name.
What I encountered: I have 5 rows of category, but 30 rows of average complexities. What can I do to take a mean value of all the average complexities under the same category names? All help will be appreciated! Thank you!

filter the values for top 5 category, then group_by category and take mean of average_complexity.
library(dplyr)
board_game_tidy %>%
filter(category %in% names(top_5_category)) %>%
group_by(category) %>%
summarise(average_complexity = mean(average_complexity))
# category average_complexity
# <fct> <dbl>
#1 Abstract Strategy 0.844
#2 Action / Dexterity 0.469
#3 Adventure 1.25
#4 Age of Reason 1.95
#5 American Civil War 1.68

You are very close. You need dplyr::summarise()
complexity_top_5_category <- board_game_tidy %>%
group_by(category) %>%
dplyr::summarise(mean_average_complexity = mean(average_complexity, na.rm=TRUE)) %>%
top_n(5, mean_average_complexity)
#select(average_complexity) %>% # you don't need this
#filter(category == c("Abstract Strategy Action / Dexterity", "Adventure", "Age of Reason","American Civil War "))
complexity_top_5_category
You don't have to include dplyr:: before summarise(). However, some other common packages have their versions of summarise() so it's safer to be specific.
You can use top_n() to automatically select the top n categories, instead of using filter().

Related

R: Home sales in the last year before each sale

As a follow-up question to a previous one in the same project:
I found that real estate is often measured in inventory time, which is defined as (number of active listings) / (number of homes sale per month, as average over the last 12 months). The best way I could find to count the number of homes sold in the last 12 months before each home sale is through a for-loop.
homesales$yearlysales = 0
for (i in 1:nrow(homesales))
{
sdt = as.Date(homesales$saledate[i])
x <- homesales %>% filter( sdt - saledate >= 0 & sdt - saledate < 365) %>% summarise(count=n())
homesales$yearlysales[i] =x$count[1]
}
homesales$inventorytime = homesales$inventory / homesales$yearlysales * 12
homesales$inventorytime[is.na(homesales$saledate)] = NA
homesales$inventorytime[homesales$yearlysales==0] = NA
Obviously (?), the R language has some prejudice against using a for-loop for doing this type of selections. Is there a better way?
Appendix 1. data table structure
address, listingdate, saledate
101 Street, 2017/01/01, 2017/06/06
106 Street, 2017/03/01, 2017/08/11
102 Street, 2017/05/04, 2017/06/13
109 Street, 2017/07/04, 2017/11/24
...
Appendix 2. The output I'm looking for is something like this.
The following gives you the number of active listings on any given day:
library(tidyverse)
library(lubridate)
tmp <- tempfile()
download.file("https://raw.githubusercontent.com/robhanssen/glenlake-homesales/master/homesalesdata-source.csv", tmp)
data <- read_csv(tmp) %>%
select(ends_with("date")) %>%
mutate(across(everything(), mdy)) %>%
pivot_longer(cols = everything(), names_to = "activity", values_to ="date", names_pattern = "(.*)date")
active <- data %>%
mutate(active = if_else(activity == "listing", 1, -1)) %>%
arrange(date) %>%
mutate(active = cumsum(active)) %>%
group_by(date) %>%
filter(row_number() == n()) %>%
select(-activity)
tibble(date = seq(min(data$date, na.rm = TRUE), max(data$date, na.rm = TRUE), by = "days")) %>%
left_join(active) %>%
fill(active)
Basically, we pivot longer and split each row of data into two rows indicating distinct activities: adding a listing or removing a listing. Then the cumulative sum of this gives you the number of active listings.
Note, this assumes that you are not missing any data. Depending on the specification from which the csv was made, you could be missing activity at the start or end. But this is a warning about the csv itself.
Active listings is a fact about an instant in time. Sales is a fact about a time period. You probably want to aggregate sales by month, and then use the number of active listings from the last day of the month, or perhaps the average number of listings over that month.

How do I combine these into a single table?

I'm working with some survey data and I want to summarize responses from everyone, and responses from members in a single table.
The best way I can think of to translate this to Starwars is that I want to know how many characters total have any one eye color, and how many female characters have that eye color. For simplicity, I limited the population to blue and brown eyes.
I can run to separate queries, one to show just the females:
starwars %>%
filter(eye_color %in% c("brown","blue")) %>%
count(eye_color, gender) %>%
filter(gender == "female") %>%
mutate(percent = n / sum(n) * 100,
percent = sprintf("%.0f%%", percent))
And one to show all characters regardless of gender:
starwars %>%
filter(eye_color %in% c("brown","blue")) %>%
count(eye_color) %>%
mutate(percent = n / sum(n) * 100,
percent = sprintf("%.0f%%", percent))
But I'd like to spit these out as a single table. Is there a better approach to that than just pasting the two resulting tibbles together?
I still don't know of a good way to group by data where groups overlap in dplyr without repeating data. So I think combing the data data from two different pipelines is fine. If you want to elimiated code duplication, you could write a helper function. Here's one such example
plus_margin <- function(data, filters, fun=identity, .id="id") {
stopifnot(is.list(filters))
stopifnot(!is.null(names(filters)))
stopifnot(all(sapply(filters, is.function)))
stopifnot(is.function(fun))
bind_rows(
map_dfr(filters, ~data %>% .x %>% fun, .id=.id),
data %>% fun %>% mutate(.id:="all")
)
}
Then you could call it with something like
starwars %>%
filter(eye_color %in% c("brown","blue")) %>%
plus_margin(list(
feminine = . %>% filter(gender == "feminine")
),
. %>% count(eye_color) %>%
mutate(percent = n / sum(n) * 100,
percent = sprintf("%.0f%%", percent))
)
Which returns
id eye_color n percent
<chr> <chr> <int> <chr>
1 feminine blue 6 55%
2 feminine brown 5 45%
3 all blue 19 48%
4 all brown 21 52%
The idea is that you pass in a list of filters to subset the data by. These filters, should be functions that take data and subset it in some way. The list should be named and the names will be used as values in the resulting "id" column. Here we use the magrittr syntax . %>% {} to create an anonymous function. We the need to pass in a function apply to each of the subsets.
But at the end of the day, the joining is still happening with bind_rows. Maybe someone else will suggest a better way.

Question on filtering down a large dataset

In the problem here, I have a data set of popular baby names going back to 1880. I am trying to find the timelessly popular baby names, meaning the 30 most common names for its gender in every year of my data.
I have tried using group_by, top_n, and filter, but just am not very well verse with the program yet, so unsure how the proper order and thinking goes here.
library(babynames)
timeless <- babynames %>% group_by(name, sex, year) %>% top_n(30) %>% filter()
I am getting a large data table back with the 30 most common names for each year of data, but I want to compare that to find the most common names in every year. My prof hinted that there should be four timeless boy names, and one timeless girl name. Any help is appreciated!
Here is the answer.
library(babynames)
library(dplyr)
timeless <- babynames %>%
group_by(sex, year) %>%
top_n(30) %>%
ungroup() %>%
count(sex, name) %>%
filter(n == max(babynames$year) - min(babynames$year) + 1)
timeless
# # A tibble: 5 x 3
# sex name n
# <chr> <chr> <int>
# 1 F Elizabeth 138
# 2 M James 138
# 3 M John 138
# 4 M Joseph 138
# 5 M William 138
Regarding your original code, group_by(name, sex, year) %>% top_n(30) does not make sense as all combination of name, sex, and year are unique, thus nothing for you to filer the "top 30".

How to calculate weighted sums of rows based on value in another column

I searched around a lot trying to find an answer for this. It seems like what would be a relatively simple and common question, and I'm surprised I didn't find an answer but perhaps I am just not searching for the correct keywords.
I would like to calculate a weighted sum of some columns in three rows based on a value in another column. I think it makes more sense if you look at the dummy table below.
INDIVIDUAL <- c("A","A","A","A","A","A","B","B","B","B","B","B")
BEHAVIOR <- c("Smell", "Dig", "Eat", "Smell", "Dig", "Eat","Smell", "Dig", "Eat","Smell", "Dig", "Eat")
FOOD <- c("a", "a", "a","b","b","b", "a", "a", "a","b","b","b")
TIME <- c(2,4,7,6,1,2,9,0,4,3,7,6)
sample <- data.frame(Individual=INDIVIDUAL, Behavior=BEHAVIOR, Food=FOOD, Time=TIME)
Each individual spends a certain amount of time Smelling, Digging, and Eating each food item. I would like to weight and sum these three times to have one overall time per food item. Smelling is the lowest weight, eating is the highest. So basically I want a time interacting with each food item: Time per FoodA = (EatA) + (0.5*DigA) + (0.33*SmellA).
After extensive web browsing the best idea I could come up with was this:
sample %>%
group_by(Individual, Food) %>%
mutate(TIME = ((fullsum$BEHAVIOR == "EAT")
+(.5*(fullsum$BEHAVIOR == "DIG")
+(.33*(fullsum$BEHAVIOR == "SMELL")))))
But it doesn't work and I get this error: Error in mutate_impl(.data, dots) : incompatible size (2195), expecting 1 (the group size) or 1.
Any advice or direction to where this question has been answered already would be greatly appreciated!
FINAL RESULT
I modified fexjoo's suggestion to account for missing values and the result matches up with the values I calculated manually in Excel, so it looks like this is the winner. There may be a tidier way to remove the NAs from each of the columns but I'm ok with this.
data.frame %>%
spread(BEHAVIOR, TIME) %>%
mutate(EAT = coalesce(EAT, 0)) %>%
mutate(DIG = coalesce(DIG, 0)) %>%
mutate(SMELL = coalesce(SMELL, 0)) %>%
mutate(TIME = EAT + .5*DIG + .33*SMELL)
Try this
sample %>%
group_by(Individual, Food) %>%
mutate(TIME = ((Behavior == "Eat") + (.5*(Behavior == "Dig")
+(.33*(Behavior == "Smell")))))
My suggestion:
library(tidyr)
sample %>%
spread(Behavior, Time) %>%
mutate(TIME = Eat + .5*Dig + .33*Smell)
The result is:
Individual Food Dig Eat Smell TIME
1 A a 4 7 2 9.66
2 A b 1 2 6 4.48
3 B a 0 4 9 6.97
4 B b 7 6 3 10.49
You could do:
sample %>%
mutate(weights=case_when(.$Behavior=="Smell"~0.33,.$Behavior=="Dig"~0.5,.$Behavior=="Eat"~1))
%>% group_by(Food,Individual)
%>% summarise(WeightedTime=sum(weights*Time))
Which gives:
Food Individual WeightedTime
<fctr> <fctr> <dbl>
1 a A 9.66
2 a B 6.97
3 b A 4.48
4 b B 10.49
You could create a column with the weights based on the Behavior column:
library(dplyr)
sample$weights <-
case_when(
sample$Behavior == "Smell" ~ 0.33,
sample$Behavior == "Dig" ~ 0.5,
sample$Behavior == "Eat" ~ 1
)
sample %>% group_by(Individual, Food) %>%
summarise(time = sum(Time * weights))

Using spread to create two value columns with tidyr

I have a data frame that looks just like this (see link). I'd like to take the output that is produced below and go one step further by spreading the tone variable across both the n and the average variables. It seems like this topic might bear on this, but I can't get it to work:
Is it possible to use spread on multiple columns in tidyr similar to dcast?
I'd like the final table to have the source variable in one column, then then the tone-n and tone-avg variables to be in columns. So I'd like the column headers to be "source" - "For - n" - "Against - n" "For -Avg" - "Against - Avg". This is for publication, not for further calculation, so it's about presenting data. It seems more intuitive to me to present data in this way. Thank you.
#variable1
Politician.For<-sample(seq(0,4,1),50, replace=TRUE)
#variable2
Politician.Against<-sample(seq(0,4,1),50, replace=TRUE)
#Variable3
Activist.For<-sample(seq(0,4,1),50,replace=TRUE)
#variable4
Activist.Against<-sample(seq(0,4,1),50,replace=TRUE)
#dataframe
df<-data.frame(Politician.For, Politician.Against, Activist.For,Activist.Against)
#tidyr
df %>%
#Gather all columns
gather(df) %>%
#separate by the period character
#(default separation character is non-alpha numeric characterr)
separate(col=df, into=c('source', 'tone')) %>%
#group by both source and tone
group_by(source,tone) %>%
#summarise to create counts and average
summarise(n=sum(value), avg=mean(value)) %>%
#try to spread
spread(tone, c('n', 'value'))
I think what you want is another gather to break out the count and mean as separate observations, the gather(type, val, -source, -tone) below.
gather(df, who, value) %>%
separate(who, into=c('source', 'tone')) %>%
group_by(source, tone) %>%
summarise(n=sum(value), avg=mean(value)) %>%
gather(type, val, -source, -tone) %>%
unite(stat, c(tone, type)) %>%
spread(stat, val)
Yields
Source: local data frame [2 x 5]
source Against_avg Against_n For_avg For_n
1 Activist 1.82 91 1.84 92
2 Politician 1.94 97 1.70 85
Using data.table syntax (thanks #akrun):
library(data.table)
dcast(
setDT(melt(df))[,c('source', 'tone'):=
tstrsplit(variable, '[.]')
][,list(
N = sum(value),
avg= mean(value))
,by=.(source, tone)],
source~tone,
value.var=c('N','avg'))

Resources