I need to write a function in R to return the first date in a series for which the value of a column is greater than 0. I would like to identify that date for each year in the dataframe.
For example, given this example data...
Date Year Catch
3/12/2001 2001 0
3/19/2001 2001 7
3/24/2001 2001 9
4/6/2002 2002 12
4/9/2002 2002 0
4/15/2002 2002 5
4/27/2002 2002 0
3/18/2003 2003 0
3/22/2003 2003 0
3/27/2003 2003 15
I would like R to return the first date for each year with catch > 0
Year Date
2001 3/19/2001
2002 4/6/2002
2003 3/27/2003
I had been working with the min function below, but it only returns the line number and I was unable to return a value for each year in the dataframe. min(which(data$Catch > 0))
I'm new to writing my own functions in R. Any help would be appreciated. Thanks.
library(dplyr)
df1 %>%
group_by(Year) %>%
slice(which.max(Catch > 0))
# # A tibble: 3 x 3
# # Groups: Year [3]
# Date Year Catch
# <date> <int> <int>
# 1 2001-03-19 2001 7
# 2 2002-04-06 2002 12
# 3 2003-03-27 2003 15
Data:
df1 <-
structure(list(Date = structure(c(11393, 11400, 11405, 11783,
11786, 11792, 11804, 12129, 12133, 12138), class = "Date"), Year = c(2001L,
2001L, 2001L, 2002L, 2002L, 2002L, 2002L, 2003L, 2003L, 2003L
), Catch = c(0L, 7L, 9L, 12L, 0L, 5L, 0L, 0L, 0L, 15L)), .Names = c("Date",
"Year", "Catch"), row.names = c(NA, -10L), class = "data.frame")
Here is an option with data.table
library(data.table)
setDT(df1)[, .SD[which.max(Catch > 0)], Year]
# Year Date Catch
#1: 2001 2001-03-19 7
#2: 2002 2002-04-06 12
#3: 2003 2003-03-27 15
data
df1 <- structure(list(Date = structure(c(11393, 11400, 11405, 11783,
11786, 11792, 11804, 12129, 12133, 12138), class = "Date"), Year = c(2001L,
2001L, 2001L, 2002L, 2002L, 2002L, 2002L, 2003L, 2003L, 2003L
), Catch = c(0L, 7L, 9L, 12L, 0L, 5L, 0L, 0L, 0L, 15L)), row.names = c(NA,
-10L), class = "data.frame")
Here is a dplyr solution.
df1 %>%
group_by(Year) %>%
mutate(Inx = first(which(Catch > 0))) %>%
filter(Inx == row_number()) %>%
select(-Inx)
## A tibble: 3 x 3
## Groups: Year [3]
# Date Year Catch
# <date> <int> <int>
#1 2001-03-19 2001 7
#2 2002-04-06 2002 12
#3 2003-03-27 2003 15
Data.
df1 <- read.table(text = "
Date Year Catch
3/12/2001 2001 0
3/19/2001 2001 7
3/24/2001 2001 9
4/6/2002 2002 12
4/9/2002 2002 0
4/15/2002 2002 5
4/27/2002 2002 0
3/18/2003 2003 0
3/22/2003 2003 0
3/27/2003 2003 15
", header = TRUE)
df1$Date <- as.Date(df1$Date, "%m/%d/%Y")
df <- data.frame(Date = as.Date(c("3/12/2001", "3/19/2001", "3/24/2001",
"4/6/2002", "4/9/2002", "4/15/2002", "4/27/2002",
"3/18/2003", "3/22/2003", "3/27/2003"), "%m/%d/%Y"),
Year = c(2001, 2001, 2001, 2002, 2002, 2002, 2002, 2003, 2003, 2003),
Catch = c(0, 7, 9, 12, 0, 5, 0, 0, 0, 15))
If you do not need a function, you can try
library(dplyr)
df %>% group_by(Date) %>% filter(Catch > 0 ) %>% group_by(Year) %>% summarize(date = min(Date))
If you exactly want to write a function, perhaps
firstcatch <- function(yr) {
dd <- subset(df, yr == Year)
withcatches <- dd[which(dd$Catch > 0), ]
min(as.character(withcatches$Date))
}
yrs <- c(2001, 2002, 2003)
dates <- unlist(lapply(yrs, firstcatch))
ndt <- data.frame(Year = yrs, Date = dates)
You can try something like this:
df <- data %>%
group_by(Year) %>%
mutate(newCol=Date[Catch>0][1]) %>%
distinct(Year, newCol)
Related
I have a panel dataset that goes like this
year
id
treatment_year
time_to_treatment
outcome
2000
1
2011
-11
2
2002
1
2011
-10
3
2004
2
2015
-9
22
and so on and so forth. I am trying to deal with the outliers by 'Winsorize'. The end goal is to make a scatterplot with time_to_treatment on the X axis and outcome on the Y.
I would like to replace the outcomes for each time_to_treatment by its winsorized outcomes, i.e. replace all extreme values with the 5% and 95% quantile values.
So far what I have tried to do is this but it doesn't work.
for(i in range(dataset$time_to_treatment)){
dplyr::filter(dataset, time_to_treatment == i)$outcome <- DescTools::Winsorize(dplyr::filter(dataset,time_to_treatment==i)$outcome)
}
I get the error - Error in filter(dataset, time_to_treatment == i) <- *vtmp* :
could not find function "filter<-"
Would anyone able to give a better way?
Thanks.
my actual data
where: conflicts = outcome, commission = year of treatment, CD_mun = id.
The concerned time period indicator is time_to_t
Groups: year, CD_MUN, type [6]
type
CD_MUN
year
time_to_t
conflicts
commission
chr
dbl
dbl
dbl
int
dbl
manif
1100023
2000
-11
1
2011
manif
1100189
2000
-3
2
2003
manif
1100205
2000
-9
5
2009
manif
1500602
2000
-4
1
2004
manif
3111002
2000
-11
2
2011
manif
3147006
2000
-10
1
2010
Assuming, "time periods" refer to 'commission' column, you may use ave.
transform(dat, conflicts_w=ave(conflicts, commission, FUN=DescTools::Winsorize))
# type CD_MUN year time_to_t conflicts commission conflicts_w
# 1 manif 1100023 2000 -11 1 2011 1.05
# 2 manif 1100189 2000 -3 2 2003 2.00
# 3 manif 1100205 2000 -9 5 2009 5.00
# 4 manif 1500602 2000 -4 1 2004 1.00
# 5 manif 3111002 2000 -11 2 2011 1.95
# 6 manif 3147006 2000 -10 1 2010 1.00
Data:
dat <- structure(list(type = c("manif", "manif", "manif", "manif", "manif",
"manif"), CD_MUN = c(1100023L, 1100189L, 1100205L, 1500602L,
3111002L, 3147006L), year = c(2000L, 2000L, 2000L, 2000L, 2000L,
2000L), time_to_t = c(-11L, -3L, -9L, -4L, -11L, -10L), conflicts = c(1L,
2L, 5L, 1L, 2L, 1L), commission = c(2011L, 2003L, 2009L, 2004L,
2011L, 2010L)), class = "data.frame", row.names = c(NA, -6L))
For a start you may use this:
# The data
set.seed(123)
df <- data.frame(
time_to_treatment = seq(-15, 0, 1),
outcome = sample(1:30, 16, replace=T)
)
# A solution without Winsorize based solely on dplyr
library(dplyr)
df %>%
mutate(outcome05 = quantile(outcome, probs = 0.05), # 5% quantile
outcome95 = quantile(outcome, probs = 0.95), # 95% quantile
outcome = ifelse(outcome <= outcome05, outcome05, outcome), # replace
outcome = ifelse(outcome >= outcome95, outcome95, outcome)) %>%
select(-c(outcome05, outcome95))
You may adapt this to your exact problem.
This question already has answers here:
Aggregate one data frame by time intervals from another data frame
(3 answers)
Closed 1 year ago.
I've posted this as another question, but realised I've got my sample data wrong.
I've got two separate datasets. df1 looks like this:
loc_ID year observations
nin212 2002 90
nin212 2003 98
nin212 2004 102
cha670 2001 18
cha670 2002 19
cha670 2003 21
df2 looks like this:
loc_ID start_year end_year
nin212 2002 2003
nin212 2003 2004
cha670 2001 2002
cha670 2002 2003
I want to calculate the number of observations in the time intervals (start_year to end_year) per loc_ID. In the example above, I would like to achieve this final dataset:
loc_ID start_year end_year observations
nin212 2002 2003 188
nin212 2003 2004 200
cha670 2001 2002 37
cha670 2002 2003 40
How could I do this?
We can do a non-equi join
library(data.table)
setDT(df2)[, observations := setDT(df1)[df2, sum(observations),
on = .(loc_ID, year >= start_year, year <= end_year),
by = .EACHI]$V1]
-output
df2
# loc_ID start_year end_year observations
#1: nin212 2002 2003 188
#2: nin212 2003 2004 200
#3: cha670 2001 2002 37
#4: cha670 2002 2003 40
data
structure(list(loc_ID = c("nin212", "nin212", "nin212", "cha670",
"cha670", "cha670"), year = c(2002L, 2003L, 2004L, 2001L, 2002L,
2003L), observations = c(90L, 98L, 102L, 18L, 19L, 21L)),
class = "data.frame", row.names = c(NA,
-6L))
> dput(df2)
structure(list(loc_ID = c("nin212", "nin212", "cha670", "cha670"
), start_year = c(2002L, 2003L, 2001L, 2002L), end_year = c(2003L,
2004L, 2002L, 2003L)), class = "data.frame", row.names = c(NA,
-4L))
I am trying to melt/stack/gather multiple specific columns of a dataframe into 2 columns, retaining all the others.
I have tried many, many answers on stackoverflow without success (some below). I basically have a situation similar to this post here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
only many more columns to retain and combine. It is important to mention my year columns are factors and I have many, many more columns than the sample listed below so I want to call column names not positions.
>df
ID Code Country year.x value.x year.y value.y year.x.x value.x.x
1 A USA 2000 34.33422 2001 35.35241 2002 42.30042
1 A Spain 2000 34.71842 2001 39.82727 2002 43.22209
3 B USA 2000 35.98180 2001 37.70768 2002 44.40232
3 B Peru 2000 33.00000 2001 37.66468 2002 41.30232
4 C Argentina 2000 37.78005 2001 39.25627 2002 45.72927
4 C Peru 2000 40.52575 2001 40.55918 2002 46.62914
I tried using the pivot_longer in tidyr based on the post above which seemed very similar, which resulted in various errors depending on what I did:
pivot_longer(df,
cols = -c(ID, Code, Country),
names_to = c(".value", "group"),
names_sep = ".")
I also played with melt in reshape2 in various ways which either melted only the values columns or only the years columns. Such as:
new.df <- reshape2:::melt(df, id.var = c("ID", "Code", "Country"), measure.vars=c("value.x", "value.y", "value.x.x", "value.y.y", "value.x.x.x", "value.y.y.y"), value.name = "value", variable.vars=c('year.x','year.y', "year.x.x", "year.y.y", "year.x.x.x", "year.y.y.y", "value.x", variable.name = "year")
I also tried dplyr gather based on other posts but I find it extremely difficult to understand the help page and posts.
To be clear what I am looking to achieve:
ID Code Country year value
1 A USA 2000 34.33422
1 A Spain 2000 34.71842
3 B USA 2000 35.98180
3 B Peru 2000 33.00000
4 C Argentina2000 37.78005
4 C Peru 2000 40.52575
1 A USA 2001 35.35241
1 A Spain 2001 39.82727
3 B USA 2001 37.70768
3 B Peru 2001 37.66468
4 C Argentina2001 39.25627
4 C Peru 2001 40.55918
1 A USA 2002 42.30042
etc.
I really appreciate the help here.
We can specify the names_pattern
library(tidyr)
library(dplyr)
df %>%
pivot_longer(cols = -c(ID, Code, Country),
names_to = c(".value", "group"),names_pattern = "(.*)\\.(.*)")
Or use the names_sep with escaped . as according to ?pivot_longer
names_sep - names_sep takes the same specification as separate(), and can either be a numeric vector (specifying positions to break on), or a single string (specifying a regular expression to split on).
which implies that by default the regex is on and the . in regex matches any character and not the literal dot. To get the literal value, either escape or place it inside square bracket
pivot_longer(df,
cols = -c(ID, Code, Country),
names_to = c(".value", "group"),
names_sep = "\\.")
# A tibble: 18 x 6
# ID Code Country group year value
# <int> <chr> <chr> <chr> <int> <dbl>
# 1 1 A USA x 2000 34.3
# 2 1 A USA y 2001 35.4
# 3 1 A USA z 2002 42.3
# 4 1 A Spain x 2000 34.7
# 5 1 A Spain y 2001 39.8
# 6 1 A Spain z 2002 43.2
# 7 3 B USA x 2000 36.0
# 8 3 B USA y 2001 37.7
# 9 3 B USA z 2002 44.4
#10 3 B Peru x 2000 33
#11 3 B Peru y 2001 37.7
#12 3 B Peru z 2002 41.3
#13 4 C Argentina x 2000 37.8
#14 4 C Argentina y 2001 39.3
#15 4 C Argentina z 2002 45.7
#16 4 C Peru x 2000 40.5
#17 4 C Peru y 2001 40.6
#18 4 C Peru z 2002 46.6
Update
For the updated dataset
library(stringr)
df2 %>%
rename_at(vars(matches("year|value")), ~
str_replace(., "^([^.]+\\.[^.]+)\\.([^.]+)$", "\\1\\2")) %>%
pivot_longer(cols = -c(ID, Code, Country),
names_to = c(".value", "group"),names_pattern = "(.*)\\.(.*)")
Or without the rename, use regex lookaround
df2 %>%
pivot_longer(cols = -c(ID, Code, Country),
names_to = c(".value", "group"),
names_sep = "(?<=year|value)\\.")
data
df <- structure(list(ID = c(1L, 1L, 3L, 3L, 4L, 4L), Code = c("A",
"A", "B", "B", "C", "C"), Country = c("USA", "Spain", "USA",
"Peru", "Argentina", "Peru"), year.x = c(2000L, 2000L, 2000L,
2000L, 2000L, 2000L), value.x = c(34.33422, 34.71842, 35.9818,
33, 37.78005, 40.52575), year.y = c(2001L, 2001L, 2001L, 2001L,
2001L, 2001L), value.y = c(35.35241, 39.82727, 37.70768, 37.66468,
39.25627, 40.55918), year.z = c(2002L, 2002L, 2002L, 2002L, 2002L,
2002L), value.z = c(42.30042, 43.22209, 44.40232, 41.30232, 45.72927,
46.62914)), class = "data.frame", row.names = c(NA, -6L))
df2 <- structure(list(ID = c(1L, 1L, 3L, 3L, 4L, 4L), Code = c("A",
"A", "B", "B", "C", "C"), Country = c("USA", "Spain", "USA",
"Peru", "Argentina", "Peru"), year.x = c(2000L, 2000L, 2000L,
2000L, 2000L, 2000L), value.x = c(34.33422, 34.71842, 35.9818,
33, 37.78005, 40.52575), year.y = c(2001L, 2001L, 2001L, 2001L,
2001L, 2001L), value.y = c(35.35241, 39.82727, 37.70768, 37.66468,
39.25627, 40.55918), year.x.x = c(2002L, 2002L, 2002L, 2002L,
2002L, 2002L), value.x.x = c(42.30042, 43.22209, 44.40232, 41.30232,
45.72927, 46.62914)), class = "data.frame", row.names = c(NA,
-6L))
This is the libraryI am using for creating dummies
install.packages("fastDummies")
library(fastDummies)
This is the dataset
winners <- data.frame(
city = c("SaoPaulito", "NewAmsterdam", "BeatifulCow"),
year = c(1990, 2000, 1990),
crime = 1:3)
Let's them create super dummies out of these cities:
dummy_cols(winners, select_columns = c("city"))
The results are
city year crime city_SaoPaulito city_NewAmsterdam city_BeatifulCow
1 SaoPaulito 1990 1 1 0 0
2 NewAmsterdam 2000 2 0 1 0
3 BeatifulCow 1990 3 0 0 1
So the question if that I want to return to the previous dataset, any ideas?
Thanks in advance!
We can use dcast
library(data.table)
dcast(setDT(winners), crime ~ city, length)
If we need to get the input, it would be
subset(df1, select = 1:3)
# city year crime
#1 SaoPaulito 1990 1
#2 NewAmsterdam 2000 2
#3 BeatifulCow 1990 3
Or with melt
melt(setDT(df1), measure = patterns("_"))[value == 1, .(city, year, crime)]
# city year crime
#1: SaoPaulito 1990 1
#2: NewAmsterdam 2000 2
#3: BeatifulCow 1990 3
data
df1 <- structure(list(city = c("SaoPaulito", "NewAmsterdam", "BeatifulCow"
), year = c(1990L, 2000L, 1990L), crime = 1:3, city_SaoPaulito = c(1L,
0L, 0L), city_NewAmsterdam = c(0L, 1L, 0L), city_BeatifulCow = c(0L,
0L, 1L)), class = "data.frame", row.names = c("1", "2", "3"))
If you are going to have only one city as 1 in each row, you can just skip the dummy columns
df[, 1:3]
# city year crime
#1 SaoPaulito 1990 1
#2 NewAmsterdam 2000 2
#3 BeatifulCow 1990 3
If you can have multiple cities one way using dplyr and tidyr::gather is
library(dplyr)
df %>%
tidyr::gather(key, value, starts_with("city_")) %>%
filter(value == 1) %>%
select(-value, -key)
I’m a beginner with R and appreciate all the help on this website. But I have been unable to locate a solution to a little problem...
I have 3 columns of data: SchoolName, Year, SATScore
There are many different school names, and for each school name, there is a “Year” which ranges from 2001-2012. (ex., JFK high school has 12 years of SAT data).
For each high school, I need to calculate the difference between SAT score in 2012 and SAT score in 2001.
A pivot table in Excel does this in a few minutes, but I’d like to learn how to do it in R.
Thanks in advance,
Paul
The answer will depend on the format of your data. If it looks like this
dat <- structure(list(shool = c("a", "a", "a", "b", "b", "b", "c", "c",
"c"), year = c(2001L, 2004L, 2012L, 2001L, 2005L, 2012L, 2001L,
2007L, 2012L), sat = c(12L, 45L, 5L, 6L, 8L, 9L, 44L, 55L, 5L
)), .Names = c("shool", "year", "sat"), class = "data.frame", row.names = c(NA,
-9L))
>dat
# shool year sat
#1 a 2001 12
#2 a 2004 45
#3 a 2012 5
#4 b 2001 6
#5 b 2005 8
#6 b 2012 9
#7 c 2001 44
#8 c 2007 55
#9 c 2012 5
Then you can simply do:
dat$sat[dat$year == 2012] - dat$sat[dat$year == 2001]
If things are not ordered so nicely, I suggest:
library(plyr)
ddply(dat, .(shool), summarise,
difference = sat[year == 2012] - sat[year == 2001] )
# shool difference
# 1 a -7
# 2 b 3
# 3 c -39
I'm assuming your data is in a data frame called data. You can do the following:
data2001 <- data.frame(SchoolName = data[data$Year == 2001, ]$SchoolName, Score2001 = data[data$Year == 2001, ]$SATscore)
data2012 <- data.frame(SchoolName = data[data$Year == 2012, ]$SchoolName, Score2012 = data[data$Year == 2012, ]$SATscore)
stats <- merge(data2001, data2012)
stats$Difference <- stats$Score2012 - stats$Score2001