Conditional and grouped mutate dplyr - r

Lets say I have the following data of sock increase per drawer
>socks
year drawer_nbr sock_total
1990 1 2
1991 1 2
1990 2 3
1991 2 4
1990 3 2
1991 3 1
I would like to have a binary variable that identifies if the socks have increased in each drawer. 1 if they increased and 0 if not. The result would be
>socks
drawer_nbr growth
<dbl> <factor>
1 0
2 1
3 0
I am getting stuck on comparing sock_total of one year vs sock_total of another year. I know that I need to use dplyr::summaries(), but I am having difficulty with what goes inside that function.

If you are comparing year 1991 with 1990, you can do:
socks %>%
group_by(drawer_nbr) %>%
summarise(growth = +(sock_total[year == 1991] - sock_total[year == 1990] > 0))
# A tibble: 3 x 2
# drawer_nbr growth
# <int> <int>
#1 1 0
#2 2 1
#3 3 0

You could use a mix of dplyr and tidyr:
library(tidyr)
library(dplyr)
socks %>%
group_by(drawer_nbr) %>%
spread(year, sock_total) %>%
mutate(growth = `1991` - `1990`)
Or if you only wanted growth to be binary:
socks %>%
group_by(drawer_nbr) %>%
spread(year, sock_total) %>%
mutate(growth = ifelse((`1991` - `1990`) > 0,
1, 0))

Related

How to create a column with a value of 1 if another column goes past a threshold from one year to another in panel data , 0 otherwise in R

I want to create a column for if the value went over the threshold of 10 for example as 1 if the previous year did not go over the threshold otherwise it's 0 as shown as in the last column of the sample image. This would be done for only the same unique items across all the years
my intiution is use dplyr and casewhen but don't know how to write the code for the case when part
df <- df %>%
group_by(unique, year) %>%
mutate(value turned = case_when(value10 == 1 in year but not in previous year) ~ 1, otherwise 0)
library(dplyr)
df <- data.frame(unique=c(rep("10ab",3),rep("12cc",3)), year=rep(2017:2019,2),
value10 = c(0,1,1,1,1,1))
df %>% group_by(unique) %>% arrange(year) %>%
mutate(value_turned = as.integer(value10 == 1 & lag(value10, default = 1) == 0)) %>%
ungroup() %>% arrange(unique)
#> # A tibble: 6 × 4
#> unique year value10 value_turned
#> <chr> <int> <dbl> <int>
#> 1 10ab 2017 0 0
#> 2 10ab 2018 1 1
#> 3 10ab 2019 1 0
#> 4 12cc 2017 1 0
#> 5 12cc 2018 1 0
#> 6 12cc 2019 1 0

Count the number of times two values appear in a column based on the unique values of another column [duplicate]

This question already has answers here:
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Tidyr how to spread into count of occurrence [duplicate]
(2 answers)
Closed 4 years ago.
I have the dataframe below:
year<-c("2000","2000","2001","2002","2000")
gender<-c("M","F","M","F","M")
YG<-data.frame(year,gender)
In this dataframe I want to count the number of "M" and "F" for every year and then create a new dataframe like :
year M F
1 2000 2 1
2 2001 1 0
3 2002 0 1
I tried something like:
library(dplyr)
ns<-YG %>%
group_by(year) %>%
count(YG$gender == "M")
A solution using reshape2:
dcast(YG, year~gender)
year F M
1 2000 1 2
2 2001 0 1
3 2002 1 0
Or a different tidyverse solution:
YG %>%
group_by(year) %>%
summarise(M = length(gender[gender == "M"]),
F = length(gender[gender == "F"]))
year M F
<fct> <int> <int>
1 2000 2 1
2 2001 1 0
3 2002 0 1
Or as proposed by #zx8754:
YG %>%
group_by(year) %>%
summarise(M = sum(gender == "M"),
F = sum(gender == "F"))
We can use count and spread to get the df format and use fill = 0 in spread to fill in the 0s:
library(tidyverse)
YG %>%
group_by(year) %>%
count(gender) %>%
spread(gender, n, fill = 0)
Output:
# A tibble: 3 x 3
# Groups: year [3]
year F M
<fct> <dbl> <dbl>
1 2000 1 2
2 2001 0 1
3 2002 1 0

Dplyr solution using slice and group

Ciao, Here is my replicating example.
a=c(1,2,3,4,5,6)
a1=c(15,17,17,16,14,15)
a2=c(0,0,1,1,1,0)
b=c(1,0,NA,NA,0,NA)
c=c(2010,2010,2010,2010,2010,2010)
d=c(1,1,0,1,0,NA)
e=c(2012,2012,2012,2012,2012,2012)
f=c(1,0,0,0,0,NA)
g=c(2014,2014,2014,2014,2014,2014)
h=c(1,1,0,1,0,NA)
i=c(2010,2012,2014,2012,2014,2014)
mydata = data.frame(a,a1,a2,b,c,d,e,f,g,h,i)
names(mydata) = c("id","age","gender","drop1","year1","drop2","year2","drop3","year3","drop4","year4")
mydata2 <- reshape(mydata, direction = "long", varying = list(c("year1","year2","year3","year4"), c("drop1","drop2","drop3","drop4")),v.names = c("year", "drop"), idvar = "X", timevar = "Year", times = c(1:4))
x1 = mydata2 %>%
group_by(id) %>%
slice(which(drop==1)[1])
x2 = mydata2 %>%
group_by(id) %>%
slice(which(drop==0)[1])
I have data "mydata2" which is tall such that every ID has many rows.
I want to make new data set "x" such that every ID has one row that is based on if they drop or not.
The first of drop1 drop2 drop3 drop4 that equals to 1, I want to take the year of that and put that in a variable dropYEAR. If none of drop1 drop2 drop3 drop4 equals to 1 I want to put the last data point in year1 year2 year3 year4 in the variable dropYEAR.
Ultimately every ID should have 1 row and I want to create 2 new columns: didDROP equals to 1 if the ID ever dropped or 0 if the ID did not ever drop. dropYEAR equals to the year of drop if didDROP equals to 1 or equals to the last reported year1 year2 year3 year4 if the ID did not ever drop. I try to do this in dplyr but this gives part of what I want only because it gets rid of ID values that equals to 0.
This is desired output, thank you to #Wimpel
First mydata2 %>% arrange(id) to understand the dataset, then using dplyr first and lastwe can pull the first year where drop==1 and the last year in case of drop never get 1 where drop is not null. Usingcase_when to check didDROP as it has a nice magic in dealing with NAs.
library(dplyr)
mydata2 %>% group_by(id) %>%
mutate(dropY=first(year[!is.na(drop) & drop==1]),
dropYEAR=if_else(is.na(dropY), last(year[!is.na(drop)]),dropY)) %>%
slice(1)
#Update
mydata2 %>% group_by(id) %>%
mutate(dropY=first(year[!is.na(drop) & drop==1]),
dropYEAR=if_else(is.na(dropY), last(year),dropY),
didDROP=case_when(any(drop==1) ~ 1, #Return 1 if there is any drop=1 o.w it will return 0
TRUE ~ 0)) %>%
select(-dropY) %>% slice(1)
# A tibble: 6 x 9
# Groups: id [6]
id age gender Year year drop X dropYEAR didDROP
<dbl> <dbl> <dbl> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 1 15 0 1 2010 1 1 2010 1
2 2 17 0 1 2010 0 2 2012 1
3 3 17 1 1 2010 NA 3 2014 0
4 4 16 1 1 2010 NA 4 2012 1
5 5 14 1 1 2010 0 5 2014 0
6 6 15 0 1 2010 NA 6 2014 0
I hope this what you're looking for.
You can sort by id, drop and year, conditionally on dropping or not:
library(dplyr)
mydata2 %>%
mutate(drop=ifelse(is.na(drop),0,drop)) %>%
arrange(id,-drop,year*(2*drop-1)) %>%
group_by(id) %>%
slice(1) %>%
select(id,age,gender,didDROP=drop,dropYEAR=year)
# A tibble: 6 x 5
# Groups: id [6]
id age gender didDROP dropYEAR
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 15 0 1 2010
2 2 17 0 1 2012
3 3 17 1 0 2014
4 4 16 1 1 2012
5 5 14 1 0 2014
6 6 15 0 0 2014

keep only consecutive observations

As said in the title, I have a data.frame like below,
df<-data.frame('id'=c('1','1','1','1','1','1','1'),'time'=c('1998','2000','2001','2002','2003','2004','2007'))
df
id time
1 1 1998
2 1 2000
3 1 2001
4 1 2002
5 1 2003
6 1 2004
7 1 2007
there are some others cases with shorter or longer time window than this,just for illustration's sake.
I want to do two things about this data set, first, find all those id that have at least five consecutive observations here, this can be done by following solutions here. Second, I want to keep only those observations in the at least five consecutive row of id selected by first step. The ideal result would be :
df
id time
1 1 2000
2 1 2001
3 1 2002
4 1 2003
5 1 2004
I could write a complex function using for loop and diff function, but this may be very time consuming both in writing the function and getting the result if I have a bigger data set with lots if id. But this is not seems like R and I do believe there should be a one or two line solution.
Anyone know how to achieve this? your time and knowledge would be deeply appreciated. Thanks in advance.
You can use dplyr to group by id and consecutive time, and filter groups with less than 5 entries, i.e.
#read data with stringsAsFactors = FALSE
df<-data.frame('id'=c('1','1','1','1','1','1','1'),
'time'=c('1998','2000','2001','2002','2003','2004','2007'),
stringsAsFactors = FALSE)
library(dplyr)
df %>%
mutate(time = as.integer(time)) %>%
group_by(id, grp = cumsum(c(1, diff(time) != 1))) %>%
filter(n() >= 5)
which gives
# A tibble: 5 x 3
# Groups: id, grp [1]
id time grp
<chr> <int> <dbl>
1 1 2000 2
2 1 2001 2
3 1 2002 2
4 1 2003 2
5 1 2004 2
Similar to #Sotos answer, this solution instead uses seqle (from cgwtools) as the grouping variable:
library(dplyr)
library(cgwtools)
df %>%
mutate(time = as.numeric(time)) %>%
group_by(id, consec = rep(seqle(time)$length, seqle(time)$length)) %>%
filter(consec >= 5)
Result:
# A tibble: 5 x 3
# Groups: id, consec [1]
id time consec
<chr> <dbl> <int>
1 1 2000 5
2 1 2001 5
3 1 2002 5
4 1 2003 5
5 1 2004 5
To remove grouping variable:
df %>%
mutate(time = as.numeric(time)) %>%
group_by(id, consec = rep(seqle(time)$length, seqle(time)$length)) %>%
filter(consec >= 5) %>%
ungroup() %>%
select(-consec)
Result:
# A tibble: 5 x 2
id time
<chr> <dbl>
1 1 2000
2 1 2001
3 1 2002
4 1 2003
5 1 2004
Data:
df<-data.frame('id'=c('1','1','1','1','1','1','1'),
'time'=c('1998','2000','2001','2002','2003','2004','2007'),
stringsAsFactors = FALSE)
Try that on your data:
df[,] <- lapply(df, function(x) type.convert(as.character(x), as.is = TRUE))
IND1 <- (df$time - c(df$time[-1],df$time[length(df$time)-1])) %>% abs(.)
IND2 <- (df$time - c(df$time[2],df$time[-(length(df$time))])) %>% abs(.)
df <- df[IND1 %in% 1 | IND2 %in% 1,]
df[ave(df$time, df$id, FUN = length) >= 5, ]
A solution from dplyr, tidyr, and data.table.
library(dplyr)
library(tidyr)
library(data.table)
df2 <- df %>%
mutate(time = as.numeric(as.character(time))) %>%
arrange(id, time) %>%
right_join(data_frame(time = full_seq(.$time, 1)), by = "time") %>%
mutate(RunID = rleid(id)) %>%
group_by(RunID) %>%
filter(n() >= 5, !is.na(id)) %>%
ungroup() %>%
select(-RunID)
df2
# A tibble: 5 x 2
id time
<fctr> <dbl>
1 1 2000
2 1 2001
3 1 2002
4 1 2003
5 1 2004

Count number of observations without N/A per year in R

I have a dataset and I want to summarize the number of observations without the missing values (denoted by NA).
My data is similar as the following:
data <- read.table(header = TRUE,
stringsAsFactors = FALSE,
text="CompanyNumber ResponseVariable Year ExplanatoryVariable1 ExplanatoryVariable2
1 2.5 2000 1 2
1 4 2001 3 1
1 3 2002 NA 7
2 1 2000 3 NA
2 2.4 2001 0 4
2 6 2002 2 9
3 10 2000 NA 3")
I was planning to use the package dplyr, but that does only take the years into account and not the different variables:
library(dplyr)
data %>%
group_by(Year) %>%
summarise(number = n())
How can I obtain the following outcome?
2000 2001 2002
ExplanatoryVariable1 2 2 1
ExplanatoryVariable2 2 2 2
To get the counts, you can start by using:
library(dplyr)
data %>%
group_by(Year) %>%
summarise_at(vars(starts_with("Expla")), ~sum(!is.na(.)))
## A tibble: 3 x 3
# Year ExplanatoryVariable1 ExplanatoryVariable2
# <int> <int> <int>
#1 2000 2 2
#2 2001 2 2
#3 2002 1 2
If you want to reshape it as shown in your question, you can extend the pipe using tidyr functions:
library(tidyr)
data %>%
group_by(Year) %>%
summarise_at(vars(starts_with("Expla")), ~sum(!is.na(.))) %>%
gather(var, count, -Year) %>%
spread(Year, count)
## A tibble: 2 x 4
# var `2000` `2001` `2002`
#* <chr> <int> <int> <int>
#1 ExplanatoryVariable1 2 2 1
#2 ExplanatoryVariable2 2 2 2
Just to let OP know, since they have ~200 explanatory variables to select. You can use another option of summarise_at to select the variables. You can simply name the first:last variable, if they are ordered correctly in the data, for example:
data %>%
group_by(Year) %>%
summarise_at(vars(ExplanatoryVariable1:ExplanatoryVariable2), ~sum(!is.na(.)))
Or:
data %>%
group_by(Year) %>%
summarise_at(3:4, ~sum(!is.na(.)))
Or store the variable names in a vector and use that:
vars <- names(data)[4:5]
data %>%
group_by(Year) %>%
summarise_at(vars, ~sum(!is.na(.)))
data %>%
gather(cat, val, -(1:3)) %>%
filter(complete.cases(.)) %>%
group_by(Year, cat) %>%
summarize(n = n()) %>%
spread(Year, n)
# # A tibble: 2 x 4
# cat `2000` `2001` `2002`
# * <chr> <int> <int> <int>
# 1 ExplanatoryVariable1 2 2 1
# 2 ExplanatoryVariable2 2 2 2
Should do it. You start by making the data stacked, and the simply calculating the n for both year and each explanatory variable. If you want the data back to wide format, then use spread, but either way without spread, you get the counts for both variables.
Using base R:
do.call(cbind,by(data[3:5], data$Year,function(x) colSums(!is.na(x[-1]))))
2000 2001 2002
ExplanatoryVariable1 2 2 1
ExplanatoryVariable2 2 2 2
For aggregate:
aggregate(.~Year,data[3:5],function(x) sum(!is.na(x)),na.action = function(x)x)
You could do it with aggregate in base R.
aggregate(list(ExplanatoryVariable1 = data$ExplanatoryVariable1,
ExplanatoryVariable2 = data$ExplanatoryVariable2),
list(Year = data$Year),
function(x) length(x[!is.na(x)]))
# Year ExplanatoryVariable1 ExplanatoryVariable2
#1 2000 2 2
#2 2001 2 2
#3 2002 1 2

Resources