I am working with the R programming language.
Suppose I have the following dataset of student grades:
my_data = data.frame(id = c(1,1,1,1,1,2,2,2,3,3,3,3), year = c(2010,2011,2012,2013, 2014, 2008, 2009, 2010, 2018, 2019, 2020, 2021), grade = c(55, 56, 61, 61, 62, 90,89,89, 67, 87, 51, 65))
> my_data
id year grade
1 1 2010 55
2 1 2011 56
3 1 2012 61
4 1 2013 61
5 1 2014 62
6 2 2008 90
7 2 2009 89
8 2 2010 89
9 3 2018 67
10 3 2019 87
11 3 2020 51
12 3 2021 65
My Question: I want to find out which students improved their grades (or kept the same grade) from year to year, and which students got worse grades from year to year.
Using the idea of "grouped window functions", I tried to write the following functions :
check_grades_improvement <- function(grades){
for(i in 2:length(grades)){
if(grades[i] < grades[i-1]){
return(FALSE)
}
}
return(TRUE)
}
check_grades_decline <- function(grades){
for(i in 2:length(grades)){
if(grades[i] > grades[i-1]){
return(FALSE)
}
}
return(TRUE)
}
Then, I tried to apply these functions to my dataset:
improving_students <- my_data %>% group_by(id) %>%
filter(check_grades_improvement(grade)) %>%
select(id) %>%
unique()
worse_students <- my_data %>%
group_by(id) %>%
filter(check_grades_decline(grade)) %>%
select(id) %>%
unique()
But I am getting empty results
Can someone please show me what I am doing wrong and how I can fix this?
Thanks!
Something like this:
library(dplyr)
my_data %>%
group_by(id) %>%
mutate(x = grade-lag(grade, default = grade[1])) %>%
mutate(peformance = case_when(x == 0 ~ "kept_same",
x > 0 ~ "improved",
x < 0 ~ "got_worse",
TRUE ~ NA_character_), .keep="unused")
id year grade peformance
<dbl> <dbl> <dbl> <chr>
1 1 2010 55 kept_same
2 1 2011 56 improved
3 1 2012 61 improved
4 1 2013 61 kept_same
5 1 2014 62 improved
6 2 2008 90 kept_same
7 2 2009 89 got_worse
8 2 2010 89 kept_same
9 3 2018 67 kept_same
10 3 2019 87 improved
11 3 2020 51 got_worse
12 3 2021 65 improved
If we want to break the function at the first instance
check_grades_improvement <- function(grades){
for(i in 2:length(grades)){
if(grades[i] < grades[i-1]){
return(FALSE)
break
}
}
return(TRUE)
}
check_grades_decline <- function(grades){
for(i in 2:length(grades)){
if(grades[i] > grades[i-1]){
return(FALSE)
break
}
}
return(TRUE)
}
-testing
my_data %>%
group_by(id) %>%
filter(check_grades_improvement(grade)) %>%
ungroup %>%
select(id) %>%
unique()
# A tibble: 1 × 1
id
<dbl>
1 1
my_data %>%
group_by(id) %>%
filter(check_grades_decline(grade)) %>%
ungroup() %>%
select(id) %>%
unique()
# A tibble: 1 × 1
id
<dbl>
1 2
Or if it is for all instances
my_data %>%
arrange(id, year) %>%
group_by(id) %>%
filter(c(FALSE, diff(grade) > 0)) %>%
ungroup %>%
select(id) %>%
unique
# A tibble: 2 × 1
id
<dbl>
1 1
2 3
my_data %>%
arrange(id, year) %>%
group_by(id) %>%
filter(c(FALSE, diff(grade) < 0)) %>%
ungroup %>%
select(id) %>%
unique
# A tibble: 2 × 1
id
<dbl>
1 2
2 3
Related
df_input is the data frame which needs to be transformed into df_output.
For instance, 2001-2003 is assembly=1, and we had a winner in 2001. It means we have a winner if the assembly doesn't change. Similarly, we have a string variable called "party", which doesn't change as long as the assembly is the same.
df_input <- data.frame(winner = c(1,0,0,0,2,0,0,0,1,0,0,0,0),
party = c("A",0,0,0,"B",0,0,0,"C",0,0,0,0),
assembly= c(1,1,1,2,2,2,3,3,3,3,4,4,4),
year = c(2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013))
df_output <- data.frame(winner = c(1,1,1,0,2,2,0,0,1,1,0,0,0),
party = c("A","A","A",0,"B","B",0,0,"C","C",0,0,0),
assembly= c(1,1,1,2,2,2,3,3,3,3,4,4,4),
year = c(2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013))
The code is working fine with the numeric variable (winner). How to do it if there is an additional string variable, "party"?
I get the following error after implementing this code:
df_output <- df_input %>%
mutate(winner = if_else(winner > 0, winner, NA_real_)) %>%
group_by(assembly) %>%
fill(winner) %>%
ungroup() %>%
replace_na(list(winner = 0)) #working fine
df_output <- df_input %>%
mutate(party = ifelse(party>0, party, NA)) %>%
group_by(assembly) %>%
fill(party) %>%
ungroup() %>%
replace_na(list(party = 0))
Error:
Error in `vec_assign()`:
! Can't convert `replace$party` <double> to match type of `data$party` <character>.
You have to pay attention to the datatypes. As party is a character use "0" in replace_na. Also, there is a NA_character_:
library(dplyr)
library(tidyr)
df_input %>%
mutate(winner = if_else(winner > 0, winner, NA_real_),
party = if_else(party != "0", party, NA_character_)) %>%
group_by(assembly) %>%
fill(winner, party) %>%
ungroup() %>%
replace_na(list(winner = 0, party = "0"))
#> # A tibble: 13 × 4
#> winner party assembly year
#> <dbl> <chr> <dbl> <dbl>
#> 1 1 A 1 2001
#> 2 1 A 1 2002
#> 3 1 A 1 2003
#> 4 0 0 2 2004
#> 5 2 B 2 2005
#> 6 2 B 2 2006
#> 7 0 0 3 2007
#> 8 0 0 3 2008
#> 9 1 C 3 2009
#> 10 1 C 3 2010
#> 11 0 0 4 2011
#> 12 0 0 4 2012
#> 13 0 0 4 2013
I have a dataframe like so:
id year month val
1 2020 1 50
1 2020 7 80
1 2021 1 40
1 2021 7 70
.
.
Now, I want to index all the values using Jan 2020 as index year for each id. Essentially group by id, then divide val with val at Jan 2020 * 100. So the final dataframe would look something like this:
id year month val
1 2020 1 100
1 2020 7 160
1 2021 1 80
1 2021 7 140
.
.
This is what I tried till now:
df %>% group_by(id) %>% mutate(val = 100*val/[val at Jan 2020])
I can separately get val at Jan 2020 like so:
df %>% filter(year==2020, month==1) %>% select(val)
But it doesn't work together:
df %>% group_by(id) %>% mutate(val = 100*val/(df %>% filter(year==2020, month==1) %>% select(val)))
The above throws error
A dplyr approach
library(dplyr)
df %>%
group_by(id) %>%
mutate(val = val / val[year == 2020 & month == 1] * 100) %>%
ungroup()
# A tibble: 4 × 4
id year month val
<int> <int> <int> <dbl>
1 1 2020 1 100
2 1 2020 7 160
3 1 2021 1 80
4 1 2021 7 140
Base R
do.call(
rbind,
lapply(
split(df,df$id),
function(x){
cbind(
subset(x,select=-c(val)),
"val"=x$val/x$val[x$year==2020 & x$month==1]*100
)
}
)
)
id year month val
1.1 1 2020 1 100
1.2 1 2020 7 160
1.3 1 2021 1 80
1.4 1 2021 7 140
I have the data about sales by years and by-products, let's say like this:
Year <- c(2010,2010,2010,2010,2010,2011,2011,2011,2011,2011,2012,2012,2012,2012,2012)
Model <- c("a","b","c","d","e","a","b","c","d","e","a","b","c","d","e")
Sale <- c("30","45","23","33","24","11","56","19","45","56","33","32","89","33","12")
df <- data.frame(Year, Model, Sale)
Firstly I need to calculate the "Share" column which represents the share of each product within each year.
After I compute cumulative share like this:
In the 3rd step need to identify products that accumulate total sales up to 70% in the last year (2012 in this case) and keep only these products in the whole dataframe + add a ranking column (based on last year) and summarises all the rest of products as category "other". So the final dataframe should be like this:
This is a fairly complex data wrangling task, but can be achieved using dplyr:
library(dplyr)
df %>%
mutate(Sale = as.numeric(Sale)) %>%
group_by(Year) %>%
mutate(Share = 100 * Sale/ sum(Sale),
Year_order = order(order(-Share))) %>%
arrange(Year, Year_order, by_group = TRUE) %>%
mutate(Cumm.Share = cumsum(Share)) %>%
ungroup() %>%
mutate(below_70 = Model %in% Model[Year == max(Year) & Cumm.Share < 70]) %>%
mutate(Model = ifelse(below_70, Model, 'Other')) %>%
group_by(Year, Model) %>%
summarize(Sale = sum(Sale), Share = sum(Share), .groups = 'keep') %>%
group_by(Year) %>%
mutate(pseudoShare = ifelse(Model == 'Other', 0, Share)) %>%
arrange(Year, -pseudoShare, by_group = TRUE) %>%
ungroup() %>%
mutate(Rank = match(Model, Model[Year == max(Year)])) %>%
select(-pseudoShare)
#> # A tibble: 9 x 5
#> Year Model Sale Share Rank
#> <dbl> <chr> <dbl> <dbl> <int>
#> 1 2010 a 30 19.4 2
#> 2 2010 c 23 14.8 1
#> 3 2010 Other 102 65.8 3
#> 4 2011 c 19 10.2 1
#> 5 2011 a 11 5.88 2
#> 6 2011 Other 157 84.0 3
#> 7 2012 c 89 44.7 1
#> 8 2012 a 33 16.6 2
#> 9 2012 Other 77 38.7 3
Note that in the output this code has kept groups a and c, rather than c and d, as in your expected output. This is because a and d have the same value in the final year (16.6), and therefore either could be chosen.
Created on 2022-04-21 by the reprex package (v2.0.1)
Year <- c(2010,2010,2010,2010,2010,2011,2011,2011,2011,2011,2012,2012,2012,2012,2012)
Model <- c("a","b","c","d","e","a","b","c","d","e","a","b","c","d","e")
Sale <- c("30","45","23","33","24","11","56","19","45","56","33","32","89","33","12")
df <- data.frame(Year, Model, Sale, stringsAsFactors=F)
years <- unique(df$Year)
shares <- c()
cumshares <- c()
for (year in years){
extract <- df[df$Year == year, ]
sale <- as.numeric(extract$Sale)
share <- 100*sale/sum(sale)
shares <- append(shares, share)
cumshare <- rev(cumsum(rev(share)))
cumshares <- append(cumshares, cumshare)
}
df$Share <- shares
df$Cumm.Share <- cumshares
df
gives
> df
Year Model Sale Share Cumm.Share
1 2010 a 30 19.354839 100.000000
2 2010 b 45 29.032258 80.645161
3 2010 c 23 14.838710 51.612903
4 2010 d 33 21.290323 36.774194
5 2010 e 24 15.483871 15.483871
6 2011 a 11 5.882353 100.000000
7 2011 b 56 29.946524 94.117647
8 2011 c 19 10.160428 64.171123
9 2011 d 45 24.064171 54.010695
10 2011 e 56 29.946524 29.946524
11 2012 a 33 16.582915 100.000000
12 2012 b 32 16.080402 83.417085
13 2012 c 89 44.723618 67.336683
14 2012 d 33 16.582915 22.613065
15 2012 e 12 6.030151 6.030151
I don't understand what you mean by step 3, how do you decide which products to keep?
Using the example data provided below: For each institution type ("a" and "b") I want to drop rows with fac == "no" if there exists a row with fac == "yes" for the same year. I then want to sum the values by year. I am, however, not able to figure out how to drop the correct "no"-rows. Below are a couple of my attempts based on answers give here.
set.seed(123)
ext <- tibble(
institution = c(rep("a", 7), rep("b", 7)),
year = rep(c("2005", "2005", "2006", "2007", "2008", "2009", "2009"), 2),
fac = rep(c("yes", "no", "no", "no", "no", "yes", "no"), 2),
value = sample(1:100, 14, replace=T)
)
ext %>%
group_by(institution, year) %>%
filter(if (fac == "yes") fac != "no")
ext %>%
group_by(institution, year) %>%
case_when(fac == "yes" ~ filter(., fac != "no"))
ext %>%
group_by(institution, year) %>%
{if (fac == "yes") filter(., fac != "no")}
Another way would be:
library(dplyr)
ext %>%
group_by(institution, year) %>%
filter(fac == 'yes' | n() < 2)
# institution year fac value
# 1 a 2005 yes 31
# 2 a 2006 no 51
# 3 a 2007 no 14
# 4 a 2008 no 67
# 5 a 2009 yes 42
# 6 b 2005 yes 43
# 7 b 2006 no 25
# 8 b 2007 no 90
# 9 b 2008 no 91
# 10 b 2009 yes 69
In case you want the overall amounts by year, add these two lines, which will yield the following output:
group_by(year) %>%
summarise(value=sum(value))
# year value
# <chr> <int>
# 1 2005 74
# 2 2006 76
# 3 2007 104
# 4 2008 158
# 5 2009 111
Does this work: by summarise, I assumed you want to sum by year after applying the filtering.
library(dplyr)
ext %>% group_by(institution, year) %>% filter(fac == 'yes'|all(fac == 'no'))
# A tibble: 10 x 4
# Groups: institution, year [10]
institution year fac value
<chr> <chr> <chr> <int>
1 a 2005 yes 31
2 a 2006 no 51
3 a 2007 no 14
4 a 2008 no 67
5 a 2009 yes 42
6 b 2005 yes 43
7 b 2006 no 25
8 b 2007 no 90
9 b 2008 no 91
10 b 2009 yes 69
ext %>% group_by(institution, year) %>% filter(fac == 'yes'|all(fac == 'no')) %>%
ungroup() %>% group_by(year) %>% summarise(value = sum(value))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 5 x 2
year value
<chr> <int>
1 2005 74
2 2006 76
3 2007 104
4 2008 158
5 2009 111
Try creating a flag to identify the yes occurence and after that filter only the desired values. You would need to group by institution and year. Then, compute the length of values with yes greater or equal to one. With that you can flag the no values if there is some value yes inside the group. Finally, filter only the zero values in Flag and you will drop the rows as you expected. Here the code:
library(dplyr)
#Code
newdf <- ext %>% group_by(institution,year) %>%
mutate(NYes=length(fac[fac=='yes']),
Flag=ifelse(fac=='no' & NYes>=1,1,0)) %>%
filter(Flag==0) %>% select(-c(NYes,Flag))
Output:
# A tibble: 10 x 4
# Groups: institution, year [10]
institution year fac value
<chr> <chr> <chr> <int>
1 a 2005 yes 31
2 a 2006 no 51
3 a 2007 no 14
4 a 2008 no 67
5 a 2009 yes 42
6 b 2005 yes 43
7 b 2006 no 25
8 b 2007 no 90
9 b 2008 no 91
10 b 2009 yes 69
And the full code to summarise by year:
#Code 2
newdf <- ext %>% group_by(institution,year) %>%
mutate(NYes=length(fac[fac=='yes']),
Flag=ifelse(fac=='no' & NYes>=1,1,0)) %>%
filter(Flag==0) %>% select(-c(NYes,Flag)) %>%
ungroup() %>%
group_by(year) %>%
summarise(value=sum(value))
Output:
# A tibble: 5 x 2
year value
<chr> <int>
1 2005 74
2 2006 76
3 2007 104
4 2008 158
5 2009 111
An option with data.table
library(data.table)
setDT(ext)[ext[, .I[fac == 'yes'|all(fac == 'no')], .(institution, year)]$V1]
As said in the title, I have a data.frame like below,
df<-data.frame('id'=c('1','1','1','1','1','1','1'),'time'=c('1998','2000','2001','2002','2003','2004','2007'))
df
id time
1 1 1998
2 1 2000
3 1 2001
4 1 2002
5 1 2003
6 1 2004
7 1 2007
there are some others cases with shorter or longer time window than this,just for illustration's sake.
I want to do two things about this data set, first, find all those id that have at least five consecutive observations here, this can be done by following solutions here. Second, I want to keep only those observations in the at least five consecutive row of id selected by first step. The ideal result would be :
df
id time
1 1 2000
2 1 2001
3 1 2002
4 1 2003
5 1 2004
I could write a complex function using for loop and diff function, but this may be very time consuming both in writing the function and getting the result if I have a bigger data set with lots if id. But this is not seems like R and I do believe there should be a one or two line solution.
Anyone know how to achieve this? your time and knowledge would be deeply appreciated. Thanks in advance.
You can use dplyr to group by id and consecutive time, and filter groups with less than 5 entries, i.e.
#read data with stringsAsFactors = FALSE
df<-data.frame('id'=c('1','1','1','1','1','1','1'),
'time'=c('1998','2000','2001','2002','2003','2004','2007'),
stringsAsFactors = FALSE)
library(dplyr)
df %>%
mutate(time = as.integer(time)) %>%
group_by(id, grp = cumsum(c(1, diff(time) != 1))) %>%
filter(n() >= 5)
which gives
# A tibble: 5 x 3
# Groups: id, grp [1]
id time grp
<chr> <int> <dbl>
1 1 2000 2
2 1 2001 2
3 1 2002 2
4 1 2003 2
5 1 2004 2
Similar to #Sotos answer, this solution instead uses seqle (from cgwtools) as the grouping variable:
library(dplyr)
library(cgwtools)
df %>%
mutate(time = as.numeric(time)) %>%
group_by(id, consec = rep(seqle(time)$length, seqle(time)$length)) %>%
filter(consec >= 5)
Result:
# A tibble: 5 x 3
# Groups: id, consec [1]
id time consec
<chr> <dbl> <int>
1 1 2000 5
2 1 2001 5
3 1 2002 5
4 1 2003 5
5 1 2004 5
To remove grouping variable:
df %>%
mutate(time = as.numeric(time)) %>%
group_by(id, consec = rep(seqle(time)$length, seqle(time)$length)) %>%
filter(consec >= 5) %>%
ungroup() %>%
select(-consec)
Result:
# A tibble: 5 x 2
id time
<chr> <dbl>
1 1 2000
2 1 2001
3 1 2002
4 1 2003
5 1 2004
Data:
df<-data.frame('id'=c('1','1','1','1','1','1','1'),
'time'=c('1998','2000','2001','2002','2003','2004','2007'),
stringsAsFactors = FALSE)
Try that on your data:
df[,] <- lapply(df, function(x) type.convert(as.character(x), as.is = TRUE))
IND1 <- (df$time - c(df$time[-1],df$time[length(df$time)-1])) %>% abs(.)
IND2 <- (df$time - c(df$time[2],df$time[-(length(df$time))])) %>% abs(.)
df <- df[IND1 %in% 1 | IND2 %in% 1,]
df[ave(df$time, df$id, FUN = length) >= 5, ]
A solution from dplyr, tidyr, and data.table.
library(dplyr)
library(tidyr)
library(data.table)
df2 <- df %>%
mutate(time = as.numeric(as.character(time))) %>%
arrange(id, time) %>%
right_join(data_frame(time = full_seq(.$time, 1)), by = "time") %>%
mutate(RunID = rleid(id)) %>%
group_by(RunID) %>%
filter(n() >= 5, !is.na(id)) %>%
ungroup() %>%
select(-RunID)
df2
# A tibble: 5 x 2
id time
<fctr> <dbl>
1 1 2000
2 1 2001
3 1 2002
4 1 2003
5 1 2004