I am working with some state elections data that has a list of candidates
who've run in different years. There's a program that some of them have participated in, and I'm interested in looking at why candidates move in and out of the program. What I want is a list of names of those who've participated in some years, but not in others. I'd like to eliminate from the list all the candidates who always or never participate.
The data looks a bit like this:
names program year
1 Smith John 1 2008
2 Smith John 1 2010
3 Oliver Mary 0 2008
4 Oliver Mary 1 2010
5 Oliver Mary 1 2012
6 O'Neil Cathy 0 2010
7 O'Neil Cathy 1 2012
So in this case, I'd want to collect Mary Oliver and Cathy O'Neil in the list, but not John Smith. I thought of using group_by in dplyr, but I'm not sure where to go next. Any thoughts on how to set this operation up?
Try filtering out the ones where the sum of the values in the program column is less than the number of rows for each name in the names column. The following should do, I think:
Data:
df1 <- structure(list(names = c("Smith John", "Smith John", "Oliver Mary",
"Oliver Mary", "Oliver Mary", "ONeil Cathy", "ONeil Cathy"),
program = c(1L, 1L, 0L, 1L, 1L, 0L, 1L), year = c(2008L,
2010L, 2008L, 2010L, 2012L, 2010L, 2012L)), .Names = c("names",
"program", "year"), class = "data.frame", row.names = c(NA, -7L
))
Code:
df1 %>% group_by(names) %>% dplyr::filter(sum(program) != n())
Output:
names program year
<chr> <int> <int>
1 Oliver Mary 0 2008
2 Oliver Mary 1 2010
3 Oliver Mary 1 2012
4 ONeil Cathy 0 2010
5 ONeil Cathy 1 2012
I hope this helps.
Related
I have a panel data that records the employment status of individuals across different years. Many of them change jobs over the time span of my data. I want to capture these transitions and merge them into string sequences. For example:
Year Person Employment_Status
1990 Bob High School Teacher
1991 Bob High School Teacher
1992 Bob Freelancer
1993 Bob High School Teacher
1990 Peter Singer
1991 Peter Singer
1990 James Actor
1991 James Actor
1992 James Producer
1993 James Producer
1994 James Investor
The ideal output should look like below:
Person Job_Sequence
Bob High School Teacher-Freelancer-High School Teacher
Peter Singer
James Actor-Producer-Investor
Essentially, each person is reduced to one row of record. The challenge for me is that different people have different number of transitions (ranging from zero to a dozen).
We may apply rleid on 'Employment_Status' to group adjacent elements that are same as a single group, get the distinct elements of 'Person', 'grp', and do a group by paste
library(dplyr)
library(data.table)
df1 %>%
mutate(grp = rleid(Employment_Status)) %>%
distinct(Person, grp, .keep_all = TRUE) %>%
group_by(Person) %>%
summarise(Job_Sequence = str_c(Employment_Status,
collapse = '-'), .groups = 'drop')
-output
# A tibble: 3 × 2
Person Job_Sequence
<chr> <chr>
1 Bob High School Teacher-Freelancer-High School Teacher
2 James Actor-Producer-Investor
3 Peter Singer
Or using base R
aggregate(cbind(Job_Sequence = Employment_Status) ~ Person,
subset(df1, !duplicated(with(rle(Employment_Status),
rep(seq_along(values), lengths)))), FUN = paste, collapse = '-')
-output
Person Job_Sequence
1 Bob High School Teacher-Freelancer-High School Teacher
2 James Actor-Producer-Investor
3 Peter Singer
data
df1 <- structure(list(Year = c(1990L, 1991L, 1992L, 1993L, 1990L, 1991L,
1990L, 1991L, 1992L, 1993L, 1994L), Person = c("Bob", "Bob",
"Bob", "Bob", "Peter", "Peter", "James", "James", "James", "James",
"James"), Employment_Status = c("High School Teacher", "High School Teacher",
"Freelancer", "High School Teacher", "Singer", "Singer", "Actor",
"Actor", "Producer", "Producer", "Investor")),
class = "data.frame", row.names = c(NA,
-11L))
Suppose I have a dataset looks like below
Person Year From To
Peter 2001 Apple Microsoft
Peter 2006 Microsoft IBM
Peter 2010 IBM Facebook
Peter 2016 Facebook Apple
Kate 2003 Microsoft Google
Jimmy 2001 Samsung IBM
Jimmy 2004 IBM Google
Jimmy 2009 Google Facebook
I want to filter by person and only keep people who worked at IBM sometime (either in the From or in the To column). Furthermore, I only want to keep the records before people move away from IBM (that is, before "IBM" first appears in the From column). Thus, I want something like below:
Person Year From To
Peter 2001 Apple Microsoft
Peter 2006 Microsoft IBM
Jimmy 2001 Samsung IBM
A possible solution with dplyr:
library(dplyr)
df %>%
group_by(Person) %>%
filter(To == "IBM" | lead(To) == "IBM") %>%
ungroup()
# A tibble: 3 x 4
Person Year From To
<chr> <int> <chr> <chr>
1 Peter 2001 Apple Microsoft
2 Peter 2006 Microsoft IBM
3 Jimmy 2001 Samsung IBM
Data
df <- structure(list(Person = c("Peter", "Peter", "Peter", "Peter",
"Kate", "Jimmy", "Jimmy", "Jimmy"), Year = c(2001L, 2006L, 2010L,
2016L, 2003L, 2001L, 2004L, 2009L), From = c("Apple", "Microsoft",
"IBM", "Facebook", "Microsoft", "Samsung", "IBM", "Google"),
To = c("Microsoft", "IBM", "Facebook", "Apple", "Google",
"IBM", "Google", "Facebook")), class = "data.frame", row.names = c(NA, -8L))
I have a data frame that I created using xtabs.
My goal is to create an area graph / sand chart using this data frame, I'm just not entirely sure how to declare the axes.
vg <- read.csv("vgdata.csv")
df <- data.frame(vg)
graph <- xtabs(Sales ~ Year + Genre, df)
print(graph)
Output:
Genre
Year Action RPG Shooter
2005 3 2 2
2006 1 1 3
2007 3 3 4
2008 1 5 8
2009 4 7 7
2010 4 5 2
Typically I would use Sales, Genre, Year, etc as variables of my graph, but these don't exist because of how it was created using xtabs. I simply have graph as a defined variable.
I would like to have the years on the x axis and the sales data on the y axis with the genre being labels. I'm hoping there is an easy way to do this with the format I already have. The reason I chose xtabs is because I had several video game titles under action, RPG, and shooter for each year and it was a convenient way to sum them to get a data frame of total sales per year.
Here are a couple of possible ways you can plot your results. For this example I'm using data from your last SO question.
Assuming you use xtabs as described:
result <- xtabs(Sales ~ Year + Genre, df)
You can convert to a data frame:
plot_data <- as.data.frame(result)
plot_data
Year Genre Freq
1 2005 Action 3
2 2006 Action 0
3 2007 Action 1
4 2005 RPG 0
5 2006 RPG 4
6 2007 RPG 2
7 2005 Shooter 3
8 2006 Shooter 0
9 2007 Shooter 3
For the area plot, you can make Year numeric instead of a factor for use on the x axis:
plot_data$Year <- as.numeric(as.character(plot_data$Year))
Then plot with geom_area:
library(ggplot2)
ggplot(plot_data, aes(x = Year, y = Freq, fill = Genre)) +
geom_area() +
scale_x_continuous(breaks = 2005:2007)
Plot
Data
df <- structure(list(Name = 1:8, Year = c(2005L, 2005L, 2005L, 2006L,
2006L, 2007L, 2007L, 2007L), Genre = c("Action", "Action", "Shooter",
"RPG", "RPG", "Action", "Shooter", "RPG"), Sales = c(1L, 2L,
3L, 2L, 2L, 1L, 3L, 2L)), class = "data.frame", row.names = c(NA,
-8L))
I have a csv file similar to below:
Name - Year - Genre - Sales
1 - 2005 - Action - 1
2 - 2005 - Action - 2
3 - 2005 - Shooter - 3
4 - 2006 - RPG - 2
5 - 2006 - RPG - 2
6 - 2007 - Action - 1
7 - 2007 - Shooter - 3
8 - 2007 - RPG - 2
...
My end goal is to make a sand chart in R that shows the total sales of each genre on the y axis and year on the x axis, with the labels being the genres.
I need to sum up the sales of each of the genres per year, for example 2005 sales would be Action:3, Shooter:3, RPG:0. And do this for every year.
This would eventually give me a data frame that looks like this:
Action Shooter RPG
2005 3 3 0
2006 0 0 4
2007 1 3 2
In Python, I could do this using enumerate, but I'm having a hard time figuring this out in R.
Here's what I have so far
vg <- read.csv("vgdata.csv")
genres <- unique(vg$Genre)
years <- sort(unique(vg$Year))
genredf <-data.frame(vg$Genre)
i<-0
for (year in (unique(vg$Year))) {
yeardata = rep(0,length(genres))
}
This would give me the data frame with 0s in it. Now I'm trying to add in the summation of the data so I can chart it.
Sorry for the poor formatting. I'm still new to stack overflow.
We could use xtabs
xtabs(Sales ~ Year + Genre, df1)
Here is a base R solution using reshape + aggregate (but seems not as simple as the approach of xtabs #akrun)
dfout <- reshape(aggregate(Sales~Year + Genre,df,sum),
direction = "wide",
idvar = "Year",
timevar = "Genre")
such that
> dfout
Year Sales.Action Sales.RPG Sales.Shooter
1 2005 3 NA 3
2 2007 1 2 3
3 2006 NA 4 NA
DATA
df <- structure(list(Name = 1:8, Year = c(2005L, 2005L, 2005L, 2006L,
2006L, 2007L, 2007L, 2007L), Genre = c("Action", "Action", "Shooter",
"RPG", "RPG", "Action", "Shooter", "RPG"), Sales = c(1L, 2L,
3L, 2L, 2L, 1L, 3L, 2L)), class = "data.frame", row.names = c(NA,
-8L))
This question already has answers here:
Adding a column of means by group to original data [duplicate]
(4 answers)
Closed 6 years ago.
Imagine I have the following data:
Year Month State ppo
2011 Jan CA 220
2011 Feb CA 250
2012 Jan CA 230
2011 Jan WA 200
2011 Feb WA 210
I need to calculate the mean for each state for the year, so the output would look something like this:
Year Month State ppo annualAvg
2011 Jan CA 220 230
2011 Feb CA 240 230
2012 Jan CA 260 260
2011 Jan WA 200 205
2011 Feb WA 210 205
where the annual average is the mean of any entries for that state in the same year. If the year and state were constant I would know how to do this, but somehow the fact that they are variable is throwing me off.
Looking around, it seems like maybe ddply is what I want to be using for this (https://stats.stackexchange.com/questions/8225/how-to-summarize-data-by-group-in-r), but when I tried to use it I was doing something wrong and kept getting errors (I have tried so many variations of it that I won't bother to post them all here). Any idea how I am actually supposed to be doing this?
Thanks for the help!
Try this:
library(data.table)
setDT(df)
df[ , annualAvg := mean(ppo) , by =.(Year, State) ]
Base R: df$ppoAvg <- ave(df$ppo, df$State, df$Year, FUN = mean)
Using dplyr with group_by %>% mutate to add a column:
library(dplyr)
df %>% group_by(Year, State) %>% mutate(annualAvg = mean(ppo))
#Source: local data frame [5 x 5]
#Groups: Year, State [3]
# Year Month State ppo annualAvg
# (int) (fctr) (fctr) (int) (dbl)
#1 2011 Jan CA 220 235
#2 2011 Feb CA 250 235
#3 2012 Jan CA 230 230
#4 2011 Jan WA 200 205
#5 2011 Feb WA 210 205
Using data.table:
library(data.table)
setDT(df)[, annualAvg := mean(ppo), .(Year, State)]
df
# Year Month State ppo annualAvg
#1: 2011 Jan CA 220 235
#2: 2011 Feb CA 250 235
#3: 2012 Jan CA 230 230
#4: 2011 Jan WA 200 205
#5: 2011 Feb WA 210 205
Data:
structure(list(Year = c(2011L, 2011L, 2012L, 2011L, 2011L), Month = structure(c(2L,
1L, 2L, 2L, 1L), .Label = c("Feb", "Jan"), class = "factor"),
State = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("CA",
"WA"), class = "factor"), ppo = c(220L, 250L, 230L, 200L,
210L), annualAvg = c(235, 235, 230, 205, 205)), .Names = c("Year",
"Month", "State", "ppo", "annualAvg"), class = c("data.table",
"data.frame"), row.names = c(NA, -5L), .internal.selfref = <pointer: 0x105000778>)