As said in the title, I have a data.frame like below,
df<-data.frame('id'=c('1','1','1','1','1','1','1'),'time'=c('1998','2000','2001','2002','2003','2004','2007'))
df
id time
1 1 1998
2 1 2000
3 1 2001
4 1 2002
5 1 2003
6 1 2004
7 1 2007
there are some others cases with shorter or longer time window than this,just for illustration's sake.
I want to do two things about this data set, first, find all those id that have at least five consecutive observations here, this can be done by following solutions here. Second, I want to keep only those observations in the at least five consecutive row of id selected by first step. The ideal result would be :
df
id time
1 1 2000
2 1 2001
3 1 2002
4 1 2003
5 1 2004
I could write a complex function using for loop and diff function, but this may be very time consuming both in writing the function and getting the result if I have a bigger data set with lots if id. But this is not seems like R and I do believe there should be a one or two line solution.
Anyone know how to achieve this? your time and knowledge would be deeply appreciated. Thanks in advance.
You can use dplyr to group by id and consecutive time, and filter groups with less than 5 entries, i.e.
#read data with stringsAsFactors = FALSE
df<-data.frame('id'=c('1','1','1','1','1','1','1'),
'time'=c('1998','2000','2001','2002','2003','2004','2007'),
stringsAsFactors = FALSE)
library(dplyr)
df %>%
mutate(time = as.integer(time)) %>%
group_by(id, grp = cumsum(c(1, diff(time) != 1))) %>%
filter(n() >= 5)
which gives
# A tibble: 5 x 3
# Groups: id, grp [1]
id time grp
<chr> <int> <dbl>
1 1 2000 2
2 1 2001 2
3 1 2002 2
4 1 2003 2
5 1 2004 2
Similar to #Sotos answer, this solution instead uses seqle (from cgwtools) as the grouping variable:
library(dplyr)
library(cgwtools)
df %>%
mutate(time = as.numeric(time)) %>%
group_by(id, consec = rep(seqle(time)$length, seqle(time)$length)) %>%
filter(consec >= 5)
Result:
# A tibble: 5 x 3
# Groups: id, consec [1]
id time consec
<chr> <dbl> <int>
1 1 2000 5
2 1 2001 5
3 1 2002 5
4 1 2003 5
5 1 2004 5
To remove grouping variable:
df %>%
mutate(time = as.numeric(time)) %>%
group_by(id, consec = rep(seqle(time)$length, seqle(time)$length)) %>%
filter(consec >= 5) %>%
ungroup() %>%
select(-consec)
Result:
# A tibble: 5 x 2
id time
<chr> <dbl>
1 1 2000
2 1 2001
3 1 2002
4 1 2003
5 1 2004
Data:
df<-data.frame('id'=c('1','1','1','1','1','1','1'),
'time'=c('1998','2000','2001','2002','2003','2004','2007'),
stringsAsFactors = FALSE)
Try that on your data:
df[,] <- lapply(df, function(x) type.convert(as.character(x), as.is = TRUE))
IND1 <- (df$time - c(df$time[-1],df$time[length(df$time)-1])) %>% abs(.)
IND2 <- (df$time - c(df$time[2],df$time[-(length(df$time))])) %>% abs(.)
df <- df[IND1 %in% 1 | IND2 %in% 1,]
df[ave(df$time, df$id, FUN = length) >= 5, ]
A solution from dplyr, tidyr, and data.table.
library(dplyr)
library(tidyr)
library(data.table)
df2 <- df %>%
mutate(time = as.numeric(as.character(time))) %>%
arrange(id, time) %>%
right_join(data_frame(time = full_seq(.$time, 1)), by = "time") %>%
mutate(RunID = rleid(id)) %>%
group_by(RunID) %>%
filter(n() >= 5, !is.na(id)) %>%
ungroup() %>%
select(-RunID)
df2
# A tibble: 5 x 2
id time
<fctr> <dbl>
1 1 2000
2 1 2001
3 1 2002
4 1 2003
5 1 2004
Related
In R, I have a the following data frame:
Id
Year
Age
1
2000
25
1
2001
NA
1
2002
NA
2
2000
NA
2
2001
30
2
2002
NA
Each Id has at least one row with age filled.
I would like to fill the missing "Age" values with the correct age for each ID.
Expected result:
Id
Year
Age
1
2000
25
1
2001
25
1
2002
25
2
2000
30
2
2001
30
2
2002
30
I've tried using 'fill':
df %>% fill(age)
But not getting the expected results.
Is there a simple way to do this?
The comments were close, you just have to add the .direction
df %>% group_by(Id) %>% fill(Age, .direction="downup")
# A tibble: 6 x 3
# Groups: Id [2]
Id Year Age
<int> <int> <int>
1 1 2000 25
2 1 2001 25
3 1 2002 25
4 2 2000 30
5 2 2001 30
6 2 2002 30
Assuming this is your dataframe
df<-data.frame(id=c(1,1,1,2,2,2),year=c(2000,2001,2002,2000,2001,2002),age=c(25,NA,NA,NA,30,NA))
With the zoo package, you can try
library(zoo)
df<-df[order(df$id,df$age),]
df$age<-na.locf(df$age)
Please see the solution below with the tidyverse library.
library(tidyverse)
dt <- data.frame(Id = rep(1:2, each = 3),
Year = rep(2000:2002, each = 2),
Age = c(25,NA,NA,30,NA,NA))
dt %>% group_by(Id) %>% arrange(Id,Age) %>% fill(Age)
In the code you provided, you didn't use group_by. It is also important to arrange by Id and Age, because the function fill only fills the column down. See for example that data frame, and compare the option with and without arrange:
dt <- data.frame(Id = rep(1:2, each = 3),
Year = rep(2000:2002, each = 2),
Age = c(NA, 25,NA,NA,30,NA))
dt %>% group_by(Id) %>% fill(Age) # only fills partially
dt %>% group_by(Id) %>% arrange(Id,Age) %>% fill(Age) # does the right job
I have a dataset which includes seller_ID, product_ID and year the product was sold, and I am trying to find the year that one seller had maximum sold product and the specific number of sold in that year for each individual seller. Here is an example of data
seller_ID <- c(1,1,1,2,2,3,4,4,4,4,4)
Product_ID <- c(1000,1000,1005,1004,1005,1003,1010,
1000,1001,1019,1017)
year <- c(2015,2016,2015,2020,2020,2000,2000,2001,2001,2001,2005)
data<- data.frame(seller_ID,Product_ID,year)
seller_ID Product_ID year
1 1 1000 2015
2 1 1000 2016
3 1 1005 2015
4 2 1004 2020
5 2 1005 2020
6 3 1003 2000
7 4 1010 2000
8 4 1000 2001
9 4 1001 2001
10 4 1019 2001
11 4 1017 2005
so the ideal result would be:
seller_ID Max_sold_num_year Max_year
1 1 2 2015
2 2 2 2020
3 3 1 2000
4 4 3 2001
I tried the approach I explained below and it worked ...
df_temp<- data %>%
group_by(seller_ID, year) %>%
summarize(Sold_in_Year=length(Product_ID))
unique_seller=unique(data$seller_ID)
ID_list=c()
Max_list=c()
Max_Sold_Year=c()
j=1
for (ID in unique_seller) {
df_temp_2<- subset(df_temp, df_temp$seller_ID==ID)
Max_year<- subset(df_temp_2,df_temp_2$Sold_in_Year==max(df_temp_2$Sold_in_Year))
if (nrow(Max_year)>1){
ID_list[j]<-Max_year[1,1]
Max_Sold_Year[j]<-Max_year[1,2]
Max_list[j]<-Max_year[1,3]
j<-j+1
}
else {
ID_list[j]<-Max_year[1,1]
Max_Sold_Year[j]<-Max_year[1,2]
Max_list[j]<-Max_year[1,3]
j<-j+1
}
}
#changing above list to dataframe
mm=length(ID_list)
df_test_list<- data.frame(seller_ID=numeric(mm), Max_sold_num_year=numeric(mm),Max_year=numeric(mm))
for (i in 1:mm){
df_test_list$seller_ID[i] <- ID_list[[i]]
df_test_list$Max_sold_num_year[i] <- Max_list[[i]]
df_test_list$Max_year[i] <- Max_Sold_Year[[i]]
}
however, due to subsetting each time and using for loop this approach is kind of slow for a large dataset. Do you have any suggestions on how I can improve my code? is there any other way that I can calculate the desired result without using for loop?
Thanks
Try this
library(dplyr)
seller_ID <- c(1,1,1,2,2,3,4,4,4,4,4)
Product_ID <- c(1000,1000,1005,1004,1005,1003,1010,
1000,1001,1019,1017)
year <- c(2015,2016,2015,2020,2020,2000,2000,2001,2001,2001,2005)
data<- data.frame(seller_ID,Product_ID,year)
data %>%
dplyr::count(seller_ID, year) %>%
dplyr::group_by(seller_ID) %>%
dplyr::filter(n == max(n)) %>%
dplyr::rename(Max_sold_num_year = n, Max_year = year)
#> # A tibble: 4 x 3
#> # Groups: seller_ID [4]
#> seller_ID Max_year Max_sold_num_year
#> <dbl> <dbl> <int>
#> 1 1 2015 2
#> 2 2 2020 2
#> 3 3 2000 1
#> 4 4 2001 3
And thanks to the comment by #yung_febreze this could be achieved even shorter with
data %>%
dplyr::count(seller_ID, year) %>%
dplyr::group_by(seller_ID) %>%
dplyr::top_n(1)
EDIT In case of duplicated maximum values one can add dplyr::top_n(1, wt = year) which filters for the latest (or maximum) year:
data %>%
dplyr::count(seller_ID, year) %>%
dplyr::group_by(seller_ID) %>%
dplyr::top_n(1, wt = n) %>%
dplyr::top_n(1, wt = year) %>%
dplyr::rename(Max_sold_num_year = n, Max_year = year)
I have this df
df <- read.table(text="
id month gas tickets
1 1 13 14
2 1 12 1
1 2 4 5
3 1 5 7
1 3 0 9
", header=TRUE)
What I like to do is calculate sum of gas, tickets (and another 50+ rows in my real df) for each month. Usually I would do something like
result <-
df %>%
group_by(month) %>%
summarise(
gas = sum(gas),
tickets = sum(tickets)
) %>%
ungroup()
But since I have really lot of columns in my dataframe, I don´t want to repeat myself with creating sum function for each column. I´m wondering if is possible to create some more elegant - function or something that will create sum of each column except id and month with grouped month column.
You can use summarise_at() to ignore id and sum the rest:
df %>%
group_by(month) %>%
summarise_at(vars(-id), list(sum = ~sum))
# A tibble: 3 x 3
month gas_sum tickets_sum
<int> <int> <int>
1 1 30 22
2 2 4 5
3 3 0 9
You can use aggregate as markus recommends in the comments. If you want to stick to the tidyverse you could try something like this:
df %>%
select(-id) %>%
group_by(month) %>%
summarise_if(is.numeric, sum)
#### OUTPUT ####
# A tibble: 3 x 3
month gas tickets
<fct> <int> <int>
1 1 30 22
2 2 4 5
3 3 0 9
This question already has answers here:
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Tidyr how to spread into count of occurrence [duplicate]
(2 answers)
Closed 4 years ago.
I have the dataframe below:
year<-c("2000","2000","2001","2002","2000")
gender<-c("M","F","M","F","M")
YG<-data.frame(year,gender)
In this dataframe I want to count the number of "M" and "F" for every year and then create a new dataframe like :
year M F
1 2000 2 1
2 2001 1 0
3 2002 0 1
I tried something like:
library(dplyr)
ns<-YG %>%
group_by(year) %>%
count(YG$gender == "M")
A solution using reshape2:
dcast(YG, year~gender)
year F M
1 2000 1 2
2 2001 0 1
3 2002 1 0
Or a different tidyverse solution:
YG %>%
group_by(year) %>%
summarise(M = length(gender[gender == "M"]),
F = length(gender[gender == "F"]))
year M F
<fct> <int> <int>
1 2000 2 1
2 2001 1 0
3 2002 0 1
Or as proposed by #zx8754:
YG %>%
group_by(year) %>%
summarise(M = sum(gender == "M"),
F = sum(gender == "F"))
We can use count and spread to get the df format and use fill = 0 in spread to fill in the 0s:
library(tidyverse)
YG %>%
group_by(year) %>%
count(gender) %>%
spread(gender, n, fill = 0)
Output:
# A tibble: 3 x 3
# Groups: year [3]
year F M
<fct> <dbl> <dbl>
1 2000 1 2
2 2001 0 1
3 2002 1 0
I have a dataset and I want to summarize the number of observations without the missing values (denoted by NA).
My data is similar as the following:
data <- read.table(header = TRUE,
stringsAsFactors = FALSE,
text="CompanyNumber ResponseVariable Year ExplanatoryVariable1 ExplanatoryVariable2
1 2.5 2000 1 2
1 4 2001 3 1
1 3 2002 NA 7
2 1 2000 3 NA
2 2.4 2001 0 4
2 6 2002 2 9
3 10 2000 NA 3")
I was planning to use the package dplyr, but that does only take the years into account and not the different variables:
library(dplyr)
data %>%
group_by(Year) %>%
summarise(number = n())
How can I obtain the following outcome?
2000 2001 2002
ExplanatoryVariable1 2 2 1
ExplanatoryVariable2 2 2 2
To get the counts, you can start by using:
library(dplyr)
data %>%
group_by(Year) %>%
summarise_at(vars(starts_with("Expla")), ~sum(!is.na(.)))
## A tibble: 3 x 3
# Year ExplanatoryVariable1 ExplanatoryVariable2
# <int> <int> <int>
#1 2000 2 2
#2 2001 2 2
#3 2002 1 2
If you want to reshape it as shown in your question, you can extend the pipe using tidyr functions:
library(tidyr)
data %>%
group_by(Year) %>%
summarise_at(vars(starts_with("Expla")), ~sum(!is.na(.))) %>%
gather(var, count, -Year) %>%
spread(Year, count)
## A tibble: 2 x 4
# var `2000` `2001` `2002`
#* <chr> <int> <int> <int>
#1 ExplanatoryVariable1 2 2 1
#2 ExplanatoryVariable2 2 2 2
Just to let OP know, since they have ~200 explanatory variables to select. You can use another option of summarise_at to select the variables. You can simply name the first:last variable, if they are ordered correctly in the data, for example:
data %>%
group_by(Year) %>%
summarise_at(vars(ExplanatoryVariable1:ExplanatoryVariable2), ~sum(!is.na(.)))
Or:
data %>%
group_by(Year) %>%
summarise_at(3:4, ~sum(!is.na(.)))
Or store the variable names in a vector and use that:
vars <- names(data)[4:5]
data %>%
group_by(Year) %>%
summarise_at(vars, ~sum(!is.na(.)))
data %>%
gather(cat, val, -(1:3)) %>%
filter(complete.cases(.)) %>%
group_by(Year, cat) %>%
summarize(n = n()) %>%
spread(Year, n)
# # A tibble: 2 x 4
# cat `2000` `2001` `2002`
# * <chr> <int> <int> <int>
# 1 ExplanatoryVariable1 2 2 1
# 2 ExplanatoryVariable2 2 2 2
Should do it. You start by making the data stacked, and the simply calculating the n for both year and each explanatory variable. If you want the data back to wide format, then use spread, but either way without spread, you get the counts for both variables.
Using base R:
do.call(cbind,by(data[3:5], data$Year,function(x) colSums(!is.na(x[-1]))))
2000 2001 2002
ExplanatoryVariable1 2 2 1
ExplanatoryVariable2 2 2 2
For aggregate:
aggregate(.~Year,data[3:5],function(x) sum(!is.na(x)),na.action = function(x)x)
You could do it with aggregate in base R.
aggregate(list(ExplanatoryVariable1 = data$ExplanatoryVariable1,
ExplanatoryVariable2 = data$ExplanatoryVariable2),
list(Year = data$Year),
function(x) length(x[!is.na(x)]))
# Year ExplanatoryVariable1 ExplanatoryVariable2
#1 2000 2 2
#2 2001 2 2
#3 2002 1 2