Aggregating Data in R Data Frame

Aggregating Data in R Data Frame - r

I have a csv file similar to below:
Name - Year - Genre - Sales
1 - 2005 - Action - 1
2 - 2005 - Action - 2
3 - 2005 - Shooter - 3
4 - 2006 - RPG - 2
5 - 2006 - RPG - 2
6 - 2007 - Action - 1
7 - 2007 - Shooter - 3
8 - 2007 - RPG - 2
...
My end goal is to make a sand chart in R that shows the total sales of each genre on the y axis and year on the x axis, with the labels being the genres.
I need to sum up the sales of each of the genres per year, for example 2005 sales would be Action:3, Shooter:3, RPG:0. And do this for every year.
This would eventually give me a data frame that looks like this:
Action Shooter RPG
2005 3 3 0
2006 0 0 4
2007 1 3 2
In Python, I could do this using enumerate, but I'm having a hard time figuring this out in R.
Here's what I have so far
vg <- read.csv("vgdata.csv")
genres <- unique(vg$Genre)
years <- sort(unique(vg$Year))
genredf <-data.frame(vg$Genre)
i<-0
for (year in (unique(vg$Year))) {
yeardata = rep(0,length(genres))
}
This would give me the data frame with 0s in it. Now I'm trying to add in the summation of the data so I can chart it.
Sorry for the poor formatting. I'm still new to stack overflow.

We could use xtabs
xtabs(Sales ~ Year + Genre, df1)

Here is a base R solution using reshape + aggregate (but seems not as simple as the approach of xtabs #akrun)
dfout <- reshape(aggregate(Sales~Year + Genre,df,sum),
direction = "wide",
idvar = "Year",
timevar = "Genre")
such that
> dfout
Year Sales.Action Sales.RPG Sales.Shooter
1 2005 3 NA 3
2 2007 1 2 3
3 2006 NA 4 NA
DATA
df <- structure(list(Name = 1:8, Year = c(2005L, 2005L, 2005L, 2006L,
2006L, 2007L, 2007L, 2007L), Genre = c("Action", "Action", "Shooter",
"RPG", "RPG", "Action", "Shooter", "RPG"), Sales = c(1L, 2L,
3L, 2L, 2L, 1L, 3L, 2L)), class = "data.frame", row.names = c(NA,
-8L))

Related

How to replace a column in R by a modified column, dependent on filtered values? (removing outliers in panel data)

I have a panel dataset that goes like this
year
id
treatment_year
time_to_treatment
outcome
2000
1
2011
-11
2
2002
1
2011
-10
3
2004
2
2015
-9
22
and so on and so forth. I am trying to deal with the outliers by 'Winsorize'. The end goal is to make a scatterplot with time_to_treatment on the X axis and outcome on the Y.
I would like to replace the outcomes for each time_to_treatment by its winsorized outcomes, i.e. replace all extreme values with the 5% and 95% quantile values.
So far what I have tried to do is this but it doesn't work.
for(i in range(dataset$time_to_treatment)){
dplyr::filter(dataset, time_to_treatment == i)$outcome <- DescTools::Winsorize(dplyr::filter(dataset,time_to_treatment==i)$outcome)
}
I get the error - Error in filter(dataset, time_to_treatment == i) <- *vtmp* :
could not find function "filter<-"
Would anyone able to give a better way?
Thanks.
my actual data
where: conflicts = outcome, commission = year of treatment, CD_mun = id.
The concerned time period indicator is time_to_t
Groups: year, CD_MUN, type [6]
type
CD_MUN
year
time_to_t
conflicts
commission
chr
dbl
dbl
dbl
int
dbl
manif
1100023
2000
-11
1
2011
manif
1100189
2000
-3
2
2003
manif
1100205
2000
-9
5
2009
manif
1500602
2000
-4
1
2004
manif
3111002
2000
-11
2
2011
manif
3147006
2000
-10
1
2010

Assuming, "time periods" refer to 'commission' column, you may use ave.
transform(dat, conflicts_w=ave(conflicts, commission, FUN=DescTools::Winsorize))
# type CD_MUN year time_to_t conflicts commission conflicts_w
# 1 manif 1100023 2000 -11 1 2011 1.05
# 2 manif 1100189 2000 -3 2 2003 2.00
# 3 manif 1100205 2000 -9 5 2009 5.00
# 4 manif 1500602 2000 -4 1 2004 1.00
# 5 manif 3111002 2000 -11 2 2011 1.95
# 6 manif 3147006 2000 -10 1 2010 1.00
Data:
dat <- structure(list(type = c("manif", "manif", "manif", "manif", "manif",
"manif"), CD_MUN = c(1100023L, 1100189L, 1100205L, 1500602L,
3111002L, 3147006L), year = c(2000L, 2000L, 2000L, 2000L, 2000L,
2000L), time_to_t = c(-11L, -3L, -9L, -4L, -11L, -10L), conflicts = c(1L,
2L, 5L, 1L, 2L, 1L), commission = c(2011L, 2003L, 2009L, 2004L,
2011L, 2010L)), class = "data.frame", row.names = c(NA, -6L))

For a start you may use this:
# The data
set.seed(123)
df <- data.frame(
time_to_treatment = seq(-15, 0, 1),
outcome = sample(1:30, 16, replace=T)
)
# A solution without Winsorize based solely on dplyr
library(dplyr)
df %>%
mutate(outcome05 = quantile(outcome, probs = 0.05), # 5% quantile
outcome95 = quantile(outcome, probs = 0.95), # 95% quantile
outcome = ifelse(outcome <= outcome05, outcome05, outcome), # replace
outcome = ifelse(outcome >= outcome95, outcome95, outcome)) %>%
select(-c(outcome05, outcome95))
You may adapt this to your exact problem.

Checking data in R (whether the full range of values in a data frame column exists)

I have stumbled across a problem in checking data with R. I am fairly new to it and unfortunately, I have not managed to find a solution.
An example of my data frame (let's call it X) is are as follows:
ID Year Month
1 2012 7
1 2012 8
1 2012 9
2 2012 10
1 2012 11
3 2012 12
What I want to do is check for each ID whether all the months from 1 until 12 are present. I have tried this code :
Dataset_check <- X %>% mutate(check=X$ID<- ifelse(sapply(X$ID, function(Month)
any(X$Month <=12 & X$Month >=1)), "YES", NA))
but it does not check whether ALL of the months are included but rather if any of the months (1 through 12) are there.
I am not sure which function to use if not "any" to designate that I want to check if all of them exist or not. Do you have any ideas? Am I in the right track at all or should I look at it another way?
Thank you in advance.

Does this work:
library(dplyr)
df %>% group_by(ID) %>% mutate(check = if_else(all(1:12 %in% Month), 'Yes','No'))
# A tibble: 6 x 4
# Groups: ID [3]
ID Year Month check
<dbl> <dbl> <int> <chr>
1 1 2012 7 No
2 1 2012 8 No
3 1 2012 9 No
4 2 2012 10 No
5 1 2012 11 No
6 3 2012 12 No

We may use base R as well
df1$check <- with(df1, c("No", "Yes")[1 + ave(Month, ID,
FUN = function(x) all(1:12 %in% x))])
df1$check
[1] "No" "No" "No" "No" "No" "No"
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 1L, 3L), Year = c(2012L,
2012L, 2012L, 2012L, 2012L, 2012L), Month = 7:12),
class = "data.frame", row.names = c(NA,
-6L))

Creating Area Chart in R from xtabs Data Frame

I have a data frame that I created using xtabs.
My goal is to create an area graph / sand chart using this data frame, I'm just not entirely sure how to declare the axes.
vg <- read.csv("vgdata.csv")
df <- data.frame(vg)
graph <- xtabs(Sales ~ Year + Genre, df)
print(graph)
Output:
Genre
Year Action RPG Shooter
2005 3 2 2
2006 1 1 3
2007 3 3 4
2008 1 5 8
2009 4 7 7
2010 4 5 2
Typically I would use Sales, Genre, Year, etc as variables of my graph, but these don't exist because of how it was created using xtabs. I simply have graph as a defined variable.
I would like to have the years on the x axis and the sales data on the y axis with the genre being labels. I'm hoping there is an easy way to do this with the format I already have. The reason I chose xtabs is because I had several video game titles under action, RPG, and shooter for each year and it was a convenient way to sum them to get a data frame of total sales per year.

Here are a couple of possible ways you can plot your results. For this example I'm using data from your last SO question.
Assuming you use xtabs as described:
result <- xtabs(Sales ~ Year + Genre, df)
You can convert to a data frame:
plot_data <- as.data.frame(result)
plot_data
Year Genre Freq
1 2005 Action 3
2 2006 Action 0
3 2007 Action 1
4 2005 RPG 0
5 2006 RPG 4
6 2007 RPG 2
7 2005 Shooter 3
8 2006 Shooter 0
9 2007 Shooter 3
For the area plot, you can make Year numeric instead of a factor for use on the x axis:
plot_data$Year <- as.numeric(as.character(plot_data$Year))
Then plot with geom_area:
library(ggplot2)
ggplot(plot_data, aes(x = Year, y = Freq, fill = Genre)) +
geom_area() +
scale_x_continuous(breaks = 2005:2007)
Plot
Data
df <- structure(list(Name = 1:8, Year = c(2005L, 2005L, 2005L, 2006L,
2006L, 2007L, 2007L, 2007L), Genre = c("Action", "Action", "Shooter",
"RPG", "RPG", "Action", "Shooter", "RPG"), Sales = c(1L, 2L,
3L, 2L, 2L, 1L, 3L, 2L)), class = "data.frame", row.names = c(NA,
-8L))

Months to integer R

This is part of the dataframe I am working on. The first column represents the year, the second the month, and the third one the number of observations for that month of that year.
2005 07 2
2005 10 4
2005 12 2
2006 01 4
2006 02 1
2006 07 2
2006 08 1
2006 10 3
I have observations from 2000 to 2018. I would like to run a Kernel Regression on this data, so I need to create a continuum integer from a date class vector. For instance Jan 2000 would be 1, Jan 2001 would be 13, Jan 2002 would be 25 and so on. With that I will be able to run the Kernel. Later on, I need to translate that back (1 would be Jan 2000, 2 would be Feb 2000 and so on) to plot my model.

Just use a little algebra:
df$cont <- (df$year - 2000L) * 12L + df$month
You could go backward with modulus and integer division.
df$year <- df$cont %/% 12 + 2000L
df$month <- df$cont %% 12 # 12 is set at 0, so fix that with next line.
df$month[df$month == 0L] <- 12L
Here, %% is the modulus operator and %/% is the integer division operator. See ?"%%" for an explanation of these and other arithmetic operators.

What you can do is something like the following. First create a dates data.frame with expand.grid so we have all the years and months from 2000 01 to 2018 12. Next put this in the correct order and last add an order column so that 2000 01 starts with 1 and 2018 12 is 228. If you merge this with your original table you get the below result. You can then remove columns you don't need. And because you have a dates table you can return the year and month columns based on the order column.
dates <- expand.grid(year = seq(2000, 2018), month = seq(1, 12))
dates <- dates[order(dates$year, dates$month), ]
dates$order <- seq_along(dates$year)
merge(df, dates, by.x = c("year", "month"), by.y = c("year", "month"))
year month obs order
1 2005 10 4 70
2 2005 12 2 72
3 2005 7 2 67
4 2006 1 4 73
5 2006 10 3 82
6 2006 2 1 74
7 2006 7 2 79
8 2006 8 1 80
data:
df <- structure(list(year = c(2005L, 2005L, 2005L, 2006L, 2006L, 2006L, 2006L, 2006L),
month = c(7L, 10L, 12L, 1L, 2L, 7L, 8L, 10L),
obs = c(2L, 4L, 2L, 4L, 1L, 2L, 1L, 3L)),
class = "data.frame",
row.names = c(NA, -8L))

An option is to use yearmon type from zoo package and then calculate difference of months from Jan 2001 using difference between yearmon type.
library(zoo)
# +1 has been added to difference so that Jan 2001 is treated as 1
df$slNum = (as.yearmon(paste0(df$year, df$month),"%Y%m")-as.yearmon("200001","%Y%m"))*12+1
# year month obs slNum
# 1 2005 7 2 67
# 2 2005 10 4 70
# 3 2005 12 2 72
# 4 2006 1 4 73
# 5 2006 2 1 74
# 6 2006 7 2 79
# 7 2006 8 1 80
# 8 2006 10 3 82
Data:
df <- read.table(text =
"year month obs
2005 07 2
2005 10 4
2005 12 2
2006 01 4
2006 02 1
2006 07 2
2006 08 1
2006 10 3",
header = TRUE, stringsAsFactors = FALSE)

Finding status changes in R

I am working with some state elections data that has a list of candidates
who've run in different years. There's a program that some of them have participated in, and I'm interested in looking at why candidates move in and out of the program. What I want is a list of names of those who've participated in some years, but not in others. I'd like to eliminate from the list all the candidates who always or never participate.
The data looks a bit like this:
names program year
1 Smith John 1 2008
2 Smith John 1 2010
3 Oliver Mary 0 2008
4 Oliver Mary 1 2010
5 Oliver Mary 1 2012
6 O'Neil Cathy 0 2010
7 O'Neil Cathy 1 2012
So in this case, I'd want to collect Mary Oliver and Cathy O'Neil in the list, but not John Smith. I thought of using group_by in dplyr, but I'm not sure where to go next. Any thoughts on how to set this operation up?

Try filtering out the ones where the sum of the values in the program column is less than the number of rows for each name in the names column. The following should do, I think:
Data:
df1 <- structure(list(names = c("Smith John", "Smith John", "Oliver Mary",
"Oliver Mary", "Oliver Mary", "ONeil Cathy", "ONeil Cathy"),
program = c(1L, 1L, 0L, 1L, 1L, 0L, 1L), year = c(2008L,
2010L, 2008L, 2010L, 2012L, 2010L, 2012L)), .Names = c("names",
"program", "year"), class = "data.frame", row.names = c(NA, -7L
))
Code:
df1 %>% group_by(names) %>% dplyr::filter(sum(program) != n())
Output:
names program year
<chr> <int> <int>
1 Oliver Mary 0 2008
2 Oliver Mary 1 2010
3 Oliver Mary 1 2012
4 ONeil Cathy 0 2010
5 ONeil Cathy 1 2012
I hope this helps.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Aggregating Data in R Data Frame - r

We could use xtabs xtabs(Sales ~ Year + Genre, df1)

Related

How to replace a column in R by a modified column, dependent on filtered values? (removing outliers in panel data)

Checking data in R (whether the full range of values in a data frame column exists)

Creating Area Chart in R from xtabs Data Frame

Months to integer R

Finding status changes in R

Categories

Resources