Creating Area Chart in R from xtabs Data Frame - r

I have a data frame that I created using xtabs.
My goal is to create an area graph / sand chart using this data frame, I'm just not entirely sure how to declare the axes.
vg <- read.csv("vgdata.csv")
df <- data.frame(vg)
graph <- xtabs(Sales ~ Year + Genre, df)
print(graph)
Output:
Genre
Year Action RPG Shooter
2005 3 2 2
2006 1 1 3
2007 3 3 4
2008 1 5 8
2009 4 7 7
2010 4 5 2
Typically I would use Sales, Genre, Year, etc as variables of my graph, but these don't exist because of how it was created using xtabs. I simply have graph as a defined variable.
I would like to have the years on the x axis and the sales data on the y axis with the genre being labels. I'm hoping there is an easy way to do this with the format I already have. The reason I chose xtabs is because I had several video game titles under action, RPG, and shooter for each year and it was a convenient way to sum them to get a data frame of total sales per year.

Here are a couple of possible ways you can plot your results. For this example I'm using data from your last SO question.
Assuming you use xtabs as described:
result <- xtabs(Sales ~ Year + Genre, df)
You can convert to a data frame:
plot_data <- as.data.frame(result)
plot_data
Year Genre Freq
1 2005 Action 3
2 2006 Action 0
3 2007 Action 1
4 2005 RPG 0
5 2006 RPG 4
6 2007 RPG 2
7 2005 Shooter 3
8 2006 Shooter 0
9 2007 Shooter 3
For the area plot, you can make Year numeric instead of a factor for use on the x axis:
plot_data$Year <- as.numeric(as.character(plot_data$Year))
Then plot with geom_area:
library(ggplot2)
ggplot(plot_data, aes(x = Year, y = Freq, fill = Genre)) +
geom_area() +
scale_x_continuous(breaks = 2005:2007)
Plot
Data
df <- structure(list(Name = 1:8, Year = c(2005L, 2005L, 2005L, 2006L,
2006L, 2007L, 2007L, 2007L), Genre = c("Action", "Action", "Shooter",
"RPG", "RPG", "Action", "Shooter", "RPG"), Sales = c(1L, 2L,
3L, 2L, 2L, 1L, 3L, 2L)), class = "data.frame", row.names = c(NA,
-8L))

Related

Return the 90th percentile values in R [duplicate]

This question already has answers here:
Create New Dataframe of Summary (90th Percentile) Statistics for Multiple Rows of All Columns in R
(2 answers)
Closed 10 months ago.
For example, I have a dataset of 30-years air temperature of a city, the dataset looks like:
Year Julian_date temperature
1991 1 2.1
1991 2 2.2
... ... ...
1991 365 2.3
1992 1 2.1
... ... ...
1992 365 2.5
... ... ...
2020 366 2.5
I would like to calculate the 90th percentile value of each Julian date (from different years), and returen the results, like:
Julian_date value(the 90th percentile)
1 2.4
2 2.6
... ...
365 2.5
How should I write the code in r?
You can first group by Julian_date, then use the quantile function to set the probability inside summarise.
library(tidyverse)
df %>%
group_by(Julian_date) %>%
summarise("value (the 90th percentile)" = quantile(temperature, probs=0.9, na.rm=TRUE))
Output
Julian_date `value (the 90th percentile)`
<int> <dbl>
1 1 2.1
2 2 2.2
3 365 2.5
Data
df <- structure(list(Year = c(1991L, 1991L, 1991L, 1992L, 1992L, 2020L
), Julian_date = c(1L, 2L, 365L, 1L, 365L, 365L), temperature = c(2.1,
2.2, 2.3, 2.1, 2.5, 2.5)), class = "data.frame", row.names = c(NA,
-6L))
You can use quantile() function. If (from different years) in your question means each year should have separate calculation, then you need to group the data frame by Year and Julian_date. If instead it means the different years are combined, you need to group the data frame only by Julian_date, as #AndrewGB and #benson23 showed.
library(dplyr)
yourdf %>% group_by(Year, Julian_date) %>%
summarise (value_90th_percentile = quantile(temperature, 0.9, na.rm = TRUE))

How to replace a column in R by a modified column, dependent on filtered values? (removing outliers in panel data)

I have a panel dataset that goes like this
year
id
treatment_year
time_to_treatment
outcome
2000
1
2011
-11
2
2002
1
2011
-10
3
2004
2
2015
-9
22
and so on and so forth. I am trying to deal with the outliers by 'Winsorize'. The end goal is to make a scatterplot with time_to_treatment on the X axis and outcome on the Y.
I would like to replace the outcomes for each time_to_treatment by its winsorized outcomes, i.e. replace all extreme values with the 5% and 95% quantile values.
So far what I have tried to do is this but it doesn't work.
for(i in range(dataset$time_to_treatment)){
dplyr::filter(dataset, time_to_treatment == i)$outcome <- DescTools::Winsorize(dplyr::filter(dataset,time_to_treatment==i)$outcome)
}
I get the error - Error in filter(dataset, time_to_treatment == i) <- *vtmp* :
could not find function "filter<-"
Would anyone able to give a better way?
Thanks.
my actual data
where: conflicts = outcome, commission = year of treatment, CD_mun = id.
The concerned time period indicator is time_to_t
Groups: year, CD_MUN, type [6]
type
CD_MUN
year
time_to_t
conflicts
commission
chr
dbl
dbl
dbl
int
dbl
manif
1100023
2000
-11
1
2011
manif
1100189
2000
-3
2
2003
manif
1100205
2000
-9
5
2009
manif
1500602
2000
-4
1
2004
manif
3111002
2000
-11
2
2011
manif
3147006
2000
-10
1
2010
Assuming, "time periods" refer to 'commission' column, you may use ave.
transform(dat, conflicts_w=ave(conflicts, commission, FUN=DescTools::Winsorize))
# type CD_MUN year time_to_t conflicts commission conflicts_w
# 1 manif 1100023 2000 -11 1 2011 1.05
# 2 manif 1100189 2000 -3 2 2003 2.00
# 3 manif 1100205 2000 -9 5 2009 5.00
# 4 manif 1500602 2000 -4 1 2004 1.00
# 5 manif 3111002 2000 -11 2 2011 1.95
# 6 manif 3147006 2000 -10 1 2010 1.00
Data:
dat <- structure(list(type = c("manif", "manif", "manif", "manif", "manif",
"manif"), CD_MUN = c(1100023L, 1100189L, 1100205L, 1500602L,
3111002L, 3147006L), year = c(2000L, 2000L, 2000L, 2000L, 2000L,
2000L), time_to_t = c(-11L, -3L, -9L, -4L, -11L, -10L), conflicts = c(1L,
2L, 5L, 1L, 2L, 1L), commission = c(2011L, 2003L, 2009L, 2004L,
2011L, 2010L)), class = "data.frame", row.names = c(NA, -6L))
For a start you may use this:
# The data
set.seed(123)
df <- data.frame(
time_to_treatment = seq(-15, 0, 1),
outcome = sample(1:30, 16, replace=T)
)
# A solution without Winsorize based solely on dplyr
library(dplyr)
df %>%
mutate(outcome05 = quantile(outcome, probs = 0.05), # 5% quantile
outcome95 = quantile(outcome, probs = 0.95), # 95% quantile
outcome = ifelse(outcome <= outcome05, outcome05, outcome), # replace
outcome = ifelse(outcome >= outcome95, outcome95, outcome)) %>%
select(-c(outcome05, outcome95))
You may adapt this to your exact problem.

Checking data in R (whether the full range of values in a data frame column exists)

I have stumbled across a problem in checking data with R. I am fairly new to it and unfortunately, I have not managed to find a solution.
An example of my data frame (let's call it X) is are as follows:
ID Year Month
1 2012 7
1 2012 8
1 2012 9
2 2012 10
1 2012 11
3 2012 12
What I want to do is check for each ID whether all the months from 1 until 12 are present. I have tried this code :
Dataset_check <- X %>% mutate(check=X$ID<- ifelse(sapply(X$ID, function(Month)
any(X$Month <=12 & X$Month >=1)), "YES", NA))
but it does not check whether ALL of the months are included but rather if any of the months (1 through 12) are there.
I am not sure which function to use if not "any" to designate that I want to check if all of them exist or not. Do you have any ideas? Am I in the right track at all or should I look at it another way?
Thank you in advance.
Does this work:
library(dplyr)
df %>% group_by(ID) %>% mutate(check = if_else(all(1:12 %in% Month), 'Yes','No'))
# A tibble: 6 x 4
# Groups: ID [3]
ID Year Month check
<dbl> <dbl> <int> <chr>
1 1 2012 7 No
2 1 2012 8 No
3 1 2012 9 No
4 2 2012 10 No
5 1 2012 11 No
6 3 2012 12 No
We may use base R as well
df1$check <- with(df1, c("No", "Yes")[1 + ave(Month, ID,
FUN = function(x) all(1:12 %in% x))])
df1$check
[1] "No" "No" "No" "No" "No" "No"
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 1L, 3L), Year = c(2012L,
2012L, 2012L, 2012L, 2012L, 2012L), Month = 7:12),
class = "data.frame", row.names = c(NA,
-6L))

Aggregating Data in R Data Frame

I have a csv file similar to below:
Name - Year - Genre - Sales
1 - 2005 - Action - 1
2 - 2005 - Action - 2
3 - 2005 - Shooter - 3
4 - 2006 - RPG - 2
5 - 2006 - RPG - 2
6 - 2007 - Action - 1
7 - 2007 - Shooter - 3
8 - 2007 - RPG - 2
...
My end goal is to make a sand chart in R that shows the total sales of each genre on the y axis and year on the x axis, with the labels being the genres.
I need to sum up the sales of each of the genres per year, for example 2005 sales would be Action:3, Shooter:3, RPG:0. And do this for every year.
This would eventually give me a data frame that looks like this:
Action Shooter RPG
2005 3 3 0
2006 0 0 4
2007 1 3 2
In Python, I could do this using enumerate, but I'm having a hard time figuring this out in R.
Here's what I have so far
vg <- read.csv("vgdata.csv")
genres <- unique(vg$Genre)
years <- sort(unique(vg$Year))
genredf <-data.frame(vg$Genre)
i<-0
for (year in (unique(vg$Year))) {
yeardata = rep(0,length(genres))
}
This would give me the data frame with 0s in it. Now I'm trying to add in the summation of the data so I can chart it.
Sorry for the poor formatting. I'm still new to stack overflow.
We could use xtabs
xtabs(Sales ~ Year + Genre, df1)
Here is a base R solution using reshape + aggregate (but seems not as simple as the approach of xtabs #akrun)
dfout <- reshape(aggregate(Sales~Year + Genre,df,sum),
direction = "wide",
idvar = "Year",
timevar = "Genre")
such that
> dfout
Year Sales.Action Sales.RPG Sales.Shooter
1 2005 3 NA 3
2 2007 1 2 3
3 2006 NA 4 NA
DATA
df <- structure(list(Name = 1:8, Year = c(2005L, 2005L, 2005L, 2006L,
2006L, 2007L, 2007L, 2007L), Genre = c("Action", "Action", "Shooter",
"RPG", "RPG", "Action", "Shooter", "RPG"), Sales = c(1L, 2L,
3L, 2L, 2L, 1L, 3L, 2L)), class = "data.frame", row.names = c(NA,
-8L))

using R to make calculations across certain rows

I’m a beginner with R and appreciate all the help on this website. But I have been unable to locate a solution to a little problem...
I have 3 columns of data: SchoolName, Year, SATScore
There are many different school names, and for each school name, there is a “Year” which ranges from 2001-2012. (ex., JFK high school has 12 years of SAT data).
For each high school, I need to calculate the difference between SAT score in 2012 and SAT score in 2001.
A pivot table in Excel does this in a few minutes, but I’d like to learn how to do it in R.
Thanks in advance,
Paul
The answer will depend on the format of your data. If it looks like this
dat <- structure(list(shool = c("a", "a", "a", "b", "b", "b", "c", "c",
"c"), year = c(2001L, 2004L, 2012L, 2001L, 2005L, 2012L, 2001L,
2007L, 2012L), sat = c(12L, 45L, 5L, 6L, 8L, 9L, 44L, 55L, 5L
)), .Names = c("shool", "year", "sat"), class = "data.frame", row.names = c(NA,
-9L))
>dat
# shool year sat
#1 a 2001 12
#2 a 2004 45
#3 a 2012 5
#4 b 2001 6
#5 b 2005 8
#6 b 2012 9
#7 c 2001 44
#8 c 2007 55
#9 c 2012 5
Then you can simply do:
dat$sat[dat$year == 2012] - dat$sat[dat$year == 2001]
If things are not ordered so nicely, I suggest:
library(plyr)
ddply(dat, .(shool), summarise,
difference = sat[year == 2012] - sat[year == 2001] )
# shool difference
# 1 a -7
# 2 b 3
# 3 c -39
I'm assuming your data is in a data frame called data. You can do the following:
data2001 <- data.frame(SchoolName = data[data$Year == 2001, ]$SchoolName, Score2001 = data[data$Year == 2001, ]$SATscore)
data2012 <- data.frame(SchoolName = data[data$Year == 2012, ]$SchoolName, Score2012 = data[data$Year == 2012, ]$SATscore)
stats <- merge(data2001, data2012)
stats$Difference <- stats$Score2012 - stats$Score2001

Resources