I’m a beginner with R and appreciate all the help on this website. But I have been unable to locate a solution to a little problem...
I have 3 columns of data: SchoolName, Year, SATScore
There are many different school names, and for each school name, there is a “Year” which ranges from 2001-2012. (ex., JFK high school has 12 years of SAT data).
For each high school, I need to calculate the difference between SAT score in 2012 and SAT score in 2001.
A pivot table in Excel does this in a few minutes, but I’d like to learn how to do it in R.
Thanks in advance,
Paul
The answer will depend on the format of your data. If it looks like this
dat <- structure(list(shool = c("a", "a", "a", "b", "b", "b", "c", "c",
"c"), year = c(2001L, 2004L, 2012L, 2001L, 2005L, 2012L, 2001L,
2007L, 2012L), sat = c(12L, 45L, 5L, 6L, 8L, 9L, 44L, 55L, 5L
)), .Names = c("shool", "year", "sat"), class = "data.frame", row.names = c(NA,
-9L))
>dat
# shool year sat
#1 a 2001 12
#2 a 2004 45
#3 a 2012 5
#4 b 2001 6
#5 b 2005 8
#6 b 2012 9
#7 c 2001 44
#8 c 2007 55
#9 c 2012 5
Then you can simply do:
dat$sat[dat$year == 2012] - dat$sat[dat$year == 2001]
If things are not ordered so nicely, I suggest:
library(plyr)
ddply(dat, .(shool), summarise,
difference = sat[year == 2012] - sat[year == 2001] )
# shool difference
# 1 a -7
# 2 b 3
# 3 c -39
I'm assuming your data is in a data frame called data. You can do the following:
data2001 <- data.frame(SchoolName = data[data$Year == 2001, ]$SchoolName, Score2001 = data[data$Year == 2001, ]$SATscore)
data2012 <- data.frame(SchoolName = data[data$Year == 2012, ]$SchoolName, Score2012 = data[data$Year == 2012, ]$SATscore)
stats <- merge(data2001, data2012)
stats$Difference <- stats$Score2012 - stats$Score2001
Related
I have a panel dataset that goes like this
year
id
treatment_year
time_to_treatment
outcome
2000
1
2011
-11
2
2002
1
2011
-10
3
2004
2
2015
-9
22
and so on and so forth. I am trying to deal with the outliers by 'Winsorize'. The end goal is to make a scatterplot with time_to_treatment on the X axis and outcome on the Y.
I would like to replace the outcomes for each time_to_treatment by its winsorized outcomes, i.e. replace all extreme values with the 5% and 95% quantile values.
So far what I have tried to do is this but it doesn't work.
for(i in range(dataset$time_to_treatment)){
dplyr::filter(dataset, time_to_treatment == i)$outcome <- DescTools::Winsorize(dplyr::filter(dataset,time_to_treatment==i)$outcome)
}
I get the error - Error in filter(dataset, time_to_treatment == i) <- *vtmp* :
could not find function "filter<-"
Would anyone able to give a better way?
Thanks.
my actual data
where: conflicts = outcome, commission = year of treatment, CD_mun = id.
The concerned time period indicator is time_to_t
Groups: year, CD_MUN, type [6]
type
CD_MUN
year
time_to_t
conflicts
commission
chr
dbl
dbl
dbl
int
dbl
manif
1100023
2000
-11
1
2011
manif
1100189
2000
-3
2
2003
manif
1100205
2000
-9
5
2009
manif
1500602
2000
-4
1
2004
manif
3111002
2000
-11
2
2011
manif
3147006
2000
-10
1
2010
Assuming, "time periods" refer to 'commission' column, you may use ave.
transform(dat, conflicts_w=ave(conflicts, commission, FUN=DescTools::Winsorize))
# type CD_MUN year time_to_t conflicts commission conflicts_w
# 1 manif 1100023 2000 -11 1 2011 1.05
# 2 manif 1100189 2000 -3 2 2003 2.00
# 3 manif 1100205 2000 -9 5 2009 5.00
# 4 manif 1500602 2000 -4 1 2004 1.00
# 5 manif 3111002 2000 -11 2 2011 1.95
# 6 manif 3147006 2000 -10 1 2010 1.00
Data:
dat <- structure(list(type = c("manif", "manif", "manif", "manif", "manif",
"manif"), CD_MUN = c(1100023L, 1100189L, 1100205L, 1500602L,
3111002L, 3147006L), year = c(2000L, 2000L, 2000L, 2000L, 2000L,
2000L), time_to_t = c(-11L, -3L, -9L, -4L, -11L, -10L), conflicts = c(1L,
2L, 5L, 1L, 2L, 1L), commission = c(2011L, 2003L, 2009L, 2004L,
2011L, 2010L)), class = "data.frame", row.names = c(NA, -6L))
For a start you may use this:
# The data
set.seed(123)
df <- data.frame(
time_to_treatment = seq(-15, 0, 1),
outcome = sample(1:30, 16, replace=T)
)
# A solution without Winsorize based solely on dplyr
library(dplyr)
df %>%
mutate(outcome05 = quantile(outcome, probs = 0.05), # 5% quantile
outcome95 = quantile(outcome, probs = 0.95), # 95% quantile
outcome = ifelse(outcome <= outcome05, outcome05, outcome), # replace
outcome = ifelse(outcome >= outcome95, outcome95, outcome)) %>%
select(-c(outcome05, outcome95))
You may adapt this to your exact problem.
I have a csv file similar to below:
Name - Year - Genre - Sales
1 - 2005 - Action - 1
2 - 2005 - Action - 2
3 - 2005 - Shooter - 3
4 - 2006 - RPG - 2
5 - 2006 - RPG - 2
6 - 2007 - Action - 1
7 - 2007 - Shooter - 3
8 - 2007 - RPG - 2
...
My end goal is to make a sand chart in R that shows the total sales of each genre on the y axis and year on the x axis, with the labels being the genres.
I need to sum up the sales of each of the genres per year, for example 2005 sales would be Action:3, Shooter:3, RPG:0. And do this for every year.
This would eventually give me a data frame that looks like this:
Action Shooter RPG
2005 3 3 0
2006 0 0 4
2007 1 3 2
In Python, I could do this using enumerate, but I'm having a hard time figuring this out in R.
Here's what I have so far
vg <- read.csv("vgdata.csv")
genres <- unique(vg$Genre)
years <- sort(unique(vg$Year))
genredf <-data.frame(vg$Genre)
i<-0
for (year in (unique(vg$Year))) {
yeardata = rep(0,length(genres))
}
This would give me the data frame with 0s in it. Now I'm trying to add in the summation of the data so I can chart it.
Sorry for the poor formatting. I'm still new to stack overflow.
We could use xtabs
xtabs(Sales ~ Year + Genre, df1)
Here is a base R solution using reshape + aggregate (but seems not as simple as the approach of xtabs #akrun)
dfout <- reshape(aggregate(Sales~Year + Genre,df,sum),
direction = "wide",
idvar = "Year",
timevar = "Genre")
such that
> dfout
Year Sales.Action Sales.RPG Sales.Shooter
1 2005 3 NA 3
2 2007 1 2 3
3 2006 NA 4 NA
DATA
df <- structure(list(Name = 1:8, Year = c(2005L, 2005L, 2005L, 2006L,
2006L, 2007L, 2007L, 2007L), Genre = c("Action", "Action", "Shooter",
"RPG", "RPG", "Action", "Shooter", "RPG"), Sales = c(1L, 2L,
3L, 2L, 2L, 1L, 3L, 2L)), class = "data.frame", row.names = c(NA,
-8L))
I am trying to melt/stack/gather multiple specific columns of a dataframe into 2 columns, retaining all the others.
I have tried many, many answers on stackoverflow without success (some below). I basically have a situation similar to this post here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
only many more columns to retain and combine. It is important to mention my year columns are factors and I have many, many more columns than the sample listed below so I want to call column names not positions.
>df
ID Code Country year.x value.x year.y value.y year.x.x value.x.x
1 A USA 2000 34.33422 2001 35.35241 2002 42.30042
1 A Spain 2000 34.71842 2001 39.82727 2002 43.22209
3 B USA 2000 35.98180 2001 37.70768 2002 44.40232
3 B Peru 2000 33.00000 2001 37.66468 2002 41.30232
4 C Argentina 2000 37.78005 2001 39.25627 2002 45.72927
4 C Peru 2000 40.52575 2001 40.55918 2002 46.62914
I tried using the pivot_longer in tidyr based on the post above which seemed very similar, which resulted in various errors depending on what I did:
pivot_longer(df,
cols = -c(ID, Code, Country),
names_to = c(".value", "group"),
names_sep = ".")
I also played with melt in reshape2 in various ways which either melted only the values columns or only the years columns. Such as:
new.df <- reshape2:::melt(df, id.var = c("ID", "Code", "Country"), measure.vars=c("value.x", "value.y", "value.x.x", "value.y.y", "value.x.x.x", "value.y.y.y"), value.name = "value", variable.vars=c('year.x','year.y', "year.x.x", "year.y.y", "year.x.x.x", "year.y.y.y", "value.x", variable.name = "year")
I also tried dplyr gather based on other posts but I find it extremely difficult to understand the help page and posts.
To be clear what I am looking to achieve:
ID Code Country year value
1 A USA 2000 34.33422
1 A Spain 2000 34.71842
3 B USA 2000 35.98180
3 B Peru 2000 33.00000
4 C Argentina2000 37.78005
4 C Peru 2000 40.52575
1 A USA 2001 35.35241
1 A Spain 2001 39.82727
3 B USA 2001 37.70768
3 B Peru 2001 37.66468
4 C Argentina2001 39.25627
4 C Peru 2001 40.55918
1 A USA 2002 42.30042
etc.
I really appreciate the help here.
We can specify the names_pattern
library(tidyr)
library(dplyr)
df %>%
pivot_longer(cols = -c(ID, Code, Country),
names_to = c(".value", "group"),names_pattern = "(.*)\\.(.*)")
Or use the names_sep with escaped . as according to ?pivot_longer
names_sep - names_sep takes the same specification as separate(), and can either be a numeric vector (specifying positions to break on), or a single string (specifying a regular expression to split on).
which implies that by default the regex is on and the . in regex matches any character and not the literal dot. To get the literal value, either escape or place it inside square bracket
pivot_longer(df,
cols = -c(ID, Code, Country),
names_to = c(".value", "group"),
names_sep = "\\.")
# A tibble: 18 x 6
# ID Code Country group year value
# <int> <chr> <chr> <chr> <int> <dbl>
# 1 1 A USA x 2000 34.3
# 2 1 A USA y 2001 35.4
# 3 1 A USA z 2002 42.3
# 4 1 A Spain x 2000 34.7
# 5 1 A Spain y 2001 39.8
# 6 1 A Spain z 2002 43.2
# 7 3 B USA x 2000 36.0
# 8 3 B USA y 2001 37.7
# 9 3 B USA z 2002 44.4
#10 3 B Peru x 2000 33
#11 3 B Peru y 2001 37.7
#12 3 B Peru z 2002 41.3
#13 4 C Argentina x 2000 37.8
#14 4 C Argentina y 2001 39.3
#15 4 C Argentina z 2002 45.7
#16 4 C Peru x 2000 40.5
#17 4 C Peru y 2001 40.6
#18 4 C Peru z 2002 46.6
Update
For the updated dataset
library(stringr)
df2 %>%
rename_at(vars(matches("year|value")), ~
str_replace(., "^([^.]+\\.[^.]+)\\.([^.]+)$", "\\1\\2")) %>%
pivot_longer(cols = -c(ID, Code, Country),
names_to = c(".value", "group"),names_pattern = "(.*)\\.(.*)")
Or without the rename, use regex lookaround
df2 %>%
pivot_longer(cols = -c(ID, Code, Country),
names_to = c(".value", "group"),
names_sep = "(?<=year|value)\\.")
data
df <- structure(list(ID = c(1L, 1L, 3L, 3L, 4L, 4L), Code = c("A",
"A", "B", "B", "C", "C"), Country = c("USA", "Spain", "USA",
"Peru", "Argentina", "Peru"), year.x = c(2000L, 2000L, 2000L,
2000L, 2000L, 2000L), value.x = c(34.33422, 34.71842, 35.9818,
33, 37.78005, 40.52575), year.y = c(2001L, 2001L, 2001L, 2001L,
2001L, 2001L), value.y = c(35.35241, 39.82727, 37.70768, 37.66468,
39.25627, 40.55918), year.z = c(2002L, 2002L, 2002L, 2002L, 2002L,
2002L), value.z = c(42.30042, 43.22209, 44.40232, 41.30232, 45.72927,
46.62914)), class = "data.frame", row.names = c(NA, -6L))
df2 <- structure(list(ID = c(1L, 1L, 3L, 3L, 4L, 4L), Code = c("A",
"A", "B", "B", "C", "C"), Country = c("USA", "Spain", "USA",
"Peru", "Argentina", "Peru"), year.x = c(2000L, 2000L, 2000L,
2000L, 2000L, 2000L), value.x = c(34.33422, 34.71842, 35.9818,
33, 37.78005, 40.52575), year.y = c(2001L, 2001L, 2001L, 2001L,
2001L, 2001L), value.y = c(35.35241, 39.82727, 37.70768, 37.66468,
39.25627, 40.55918), year.x.x = c(2002L, 2002L, 2002L, 2002L,
2002L, 2002L), value.x.x = c(42.30042, 43.22209, 44.40232, 41.30232,
45.72927, 46.62914)), class = "data.frame", row.names = c(NA,
-6L))
I have some Revenue data in a format like that shown below. So the years are not sequential and they can also repeat (because of a different firm).
Firm Year Revenue
1 A 2018 100
2 B 2017 90
3 B 2018 80
4 C 2016 20
And I want to adjust the Revenue for inflation, by dividing through by the appropriate CPI for each year. The CPI data looks like this:
Year CPI
1 2016 98
2 2017 100
3 2018 101
I have a solution that works, but is this the best way to do it? Is it clumsy to mutate an entire calculating column in there?
revenue <- data.frame(stringsAsFactors=FALSE,
Firm = c("A", "B", "B", "C"),
Year = c(2018L, 2017L, 2018L, 2016L),
Revenue = c(100L, 90L, 80L, 20L)
)
cpi <- data.frame(
Year = c(2016L, 2017L, 2018L),
CPI = c(98L, 100L, 101L)
)
library(dplyr)
df <- left_join(revenue, cpi, by = 'Year')
mutate(df, real_revenue = (Revenue*100)/CPI)
The output is correct, shown below. But is this the best way to do it?
Firm Year Revenue CPI real_revenue
1 A 2018 100 101 99.00990
2 B 2017 90 100 90.00000
3 B 2018 80 101 79.20792
4 C 2016 20 98 20.40816
I need to write a function in R to return the first date in a series for which the value of a column is greater than 0. I would like to identify that date for each year in the dataframe.
For example, given this example data...
Date Year Catch
3/12/2001 2001 0
3/19/2001 2001 7
3/24/2001 2001 9
4/6/2002 2002 12
4/9/2002 2002 0
4/15/2002 2002 5
4/27/2002 2002 0
3/18/2003 2003 0
3/22/2003 2003 0
3/27/2003 2003 15
I would like R to return the first date for each year with catch > 0
Year Date
2001 3/19/2001
2002 4/6/2002
2003 3/27/2003
I had been working with the min function below, but it only returns the line number and I was unable to return a value for each year in the dataframe. min(which(data$Catch > 0))
I'm new to writing my own functions in R. Any help would be appreciated. Thanks.
library(dplyr)
df1 %>%
group_by(Year) %>%
slice(which.max(Catch > 0))
# # A tibble: 3 x 3
# # Groups: Year [3]
# Date Year Catch
# <date> <int> <int>
# 1 2001-03-19 2001 7
# 2 2002-04-06 2002 12
# 3 2003-03-27 2003 15
Data:
df1 <-
structure(list(Date = structure(c(11393, 11400, 11405, 11783,
11786, 11792, 11804, 12129, 12133, 12138), class = "Date"), Year = c(2001L,
2001L, 2001L, 2002L, 2002L, 2002L, 2002L, 2003L, 2003L, 2003L
), Catch = c(0L, 7L, 9L, 12L, 0L, 5L, 0L, 0L, 0L, 15L)), .Names = c("Date",
"Year", "Catch"), row.names = c(NA, -10L), class = "data.frame")
Here is an option with data.table
library(data.table)
setDT(df1)[, .SD[which.max(Catch > 0)], Year]
# Year Date Catch
#1: 2001 2001-03-19 7
#2: 2002 2002-04-06 12
#3: 2003 2003-03-27 15
data
df1 <- structure(list(Date = structure(c(11393, 11400, 11405, 11783,
11786, 11792, 11804, 12129, 12133, 12138), class = "Date"), Year = c(2001L,
2001L, 2001L, 2002L, 2002L, 2002L, 2002L, 2003L, 2003L, 2003L
), Catch = c(0L, 7L, 9L, 12L, 0L, 5L, 0L, 0L, 0L, 15L)), row.names = c(NA,
-10L), class = "data.frame")
Here is a dplyr solution.
df1 %>%
group_by(Year) %>%
mutate(Inx = first(which(Catch > 0))) %>%
filter(Inx == row_number()) %>%
select(-Inx)
## A tibble: 3 x 3
## Groups: Year [3]
# Date Year Catch
# <date> <int> <int>
#1 2001-03-19 2001 7
#2 2002-04-06 2002 12
#3 2003-03-27 2003 15
Data.
df1 <- read.table(text = "
Date Year Catch
3/12/2001 2001 0
3/19/2001 2001 7
3/24/2001 2001 9
4/6/2002 2002 12
4/9/2002 2002 0
4/15/2002 2002 5
4/27/2002 2002 0
3/18/2003 2003 0
3/22/2003 2003 0
3/27/2003 2003 15
", header = TRUE)
df1$Date <- as.Date(df1$Date, "%m/%d/%Y")
df <- data.frame(Date = as.Date(c("3/12/2001", "3/19/2001", "3/24/2001",
"4/6/2002", "4/9/2002", "4/15/2002", "4/27/2002",
"3/18/2003", "3/22/2003", "3/27/2003"), "%m/%d/%Y"),
Year = c(2001, 2001, 2001, 2002, 2002, 2002, 2002, 2003, 2003, 2003),
Catch = c(0, 7, 9, 12, 0, 5, 0, 0, 0, 15))
If you do not need a function, you can try
library(dplyr)
df %>% group_by(Date) %>% filter(Catch > 0 ) %>% group_by(Year) %>% summarize(date = min(Date))
If you exactly want to write a function, perhaps
firstcatch <- function(yr) {
dd <- subset(df, yr == Year)
withcatches <- dd[which(dd$Catch > 0), ]
min(as.character(withcatches$Date))
}
yrs <- c(2001, 2002, 2003)
dates <- unlist(lapply(yrs, firstcatch))
ndt <- data.frame(Year = yrs, Date = dates)
You can try something like this:
df <- data %>%
group_by(Year) %>%
mutate(newCol=Date[Catch>0][1]) %>%
distinct(Year, newCol)