R data frame - fill missing values with condition on another column - r

In R, I have a the following data frame:
Id
Year
Age
1
2000
25
1
2001
NA
1
2002
NA
2
2000
NA
2
2001
30
2
2002
NA
Each Id has at least one row with age filled.
I would like to fill the missing "Age" values with the correct age for each ID.
Expected result:
Id
Year
Age
1
2000
25
1
2001
25
1
2002
25
2
2000
30
2
2001
30
2
2002
30
I've tried using 'fill':
df %>% fill(age)
But not getting the expected results.
Is there a simple way to do this?

The comments were close, you just have to add the .direction
df %>% group_by(Id) %>% fill(Age, .direction="downup")
# A tibble: 6 x 3
# Groups: Id [2]
Id Year Age
<int> <int> <int>
1 1 2000 25
2 1 2001 25
3 1 2002 25
4 2 2000 30
5 2 2001 30
6 2 2002 30

Assuming this is your dataframe
df<-data.frame(id=c(1,1,1,2,2,2),year=c(2000,2001,2002,2000,2001,2002),age=c(25,NA,NA,NA,30,NA))
With the zoo package, you can try
library(zoo)
df<-df[order(df$id,df$age),]
df$age<-na.locf(df$age)

Please see the solution below with the tidyverse library.
library(tidyverse)
dt <- data.frame(Id = rep(1:2, each = 3),
Year = rep(2000:2002, each = 2),
Age = c(25,NA,NA,30,NA,NA))
dt %>% group_by(Id) %>% arrange(Id,Age) %>% fill(Age)
In the code you provided, you didn't use group_by. It is also important to arrange by Id and Age, because the function fill only fills the column down. See for example that data frame, and compare the option with and without arrange:
dt <- data.frame(Id = rep(1:2, each = 3),
Year = rep(2000:2002, each = 2),
Age = c(NA, 25,NA,NA,30,NA))
dt %>% group_by(Id) %>% fill(Age) # only fills partially
dt %>% group_by(Id) %>% arrange(Id,Age) %>% fill(Age) # does the right job

Related

Select sample from a grouping variable depending on another grouping in R

I have the following data frame with 1,000 rows; 10 Cities, each having 100 rows and I would like to randomly select 10 names by Year in the city and the selected should 10 sample names should come from at least one of the years in the City i.e the 10 names for City 1 should not come from only 1996 for instance.
City Year name
1 1 1996 b
2 1 1996 c
3 1 1997 d
4 1 1997 e
...
101 2 1996 f
102 2 1996 g
103 2 1997 h
104 2 1997 i
Desired Final Sample Data
City Year name
1 1 1996 b
2 1 1998 c
3 1 2001 d
...
11 2 1997 g
12 2 1999 h
13 2 2005 b
...
21 3 1998 a
22 3 2010 c
23 3 2005 d
Sample Data
df1 <- data.frame(City = rep(1:10, each = 100),
Year = rep(1996:2015, each = 5),
name = rep(letters[1:25], 40))
I am failing to randomly select the 10 sample names by Year (without repeating years - unless when the number of Years in a city is less than 10) for all the 10 Cities, how can I go over this?
The Final sample should have 10 names of each city and years should not repeat unless when they are less than 10 in that city.
Thank you.
First group by City and use sample_n to sample a sub-dataframe.
Then group by City and Year, and sample from name one element per group. Don't forget to set the RNG seed in order to make the result reproducible.
library(dplyr)
set.seed(2020)
df1 %>%
group_by(City) %>%
sample_n(min(n(), 10)) %>%
ungroup() %>%
group_by(City, Year) %>%
summarise(name = sample(name, 1))
#`summarise()` regrouping output by 'City' (override with `.groups` argument)
## A tibble: 4 x 3
## Groups: City [2]
# City Year name
# <int> <int> <chr>
#1 1 1996 b
#2 1 1997 e
#3 2 1996 f
#4 2 1997 h
Data
df1 <- read.table(text = "
City Year name
1 1 1996 b
2 1 1996 c
3 1 1997 d
4 1 1997 e
101 2 1996 f
102 2 1996 g
103 2 1997 h
104 2 1997 i
", header = TRUE)
Edit
Instead of reinventing the wheel, use package sampling, function strata to get an index into the data set and then filter its corresponding rows.
library(dplyr)
library(sampling)
set.seed(2020)
df1 %>%
mutate(row = row_number()) %>%
filter(row %in% strata(df1, stratanames = c('City', 'Year'), size = rep(1, 1000), method = 'srswor')$ID_unit) %>%
select(-row) %>%
group_by(City) %>%
sample_n(10) %>%
arrange(City, Year)

Calculate change over time with tidy data in R - do you have to spread and gather?

Quick question about calculating a change over time for tidy data. Do I need to spread the data, mutate the variable and then gather the data again (see below), or is there a quicker way to do this keeping the data tidy.
Here is an example:
df <- data.frame(country = c(1, 1, 2, 2),
year = c(1999, 2000, 1999, 2000),
value = c(20, 30, 40, 50))
df
country year value
1 1 1999 20
2 1 2000 30
3 2 1999 40
4 2 2000 50
To calculate the change in value between 1999 and 2000 I would:
library(dplyr)
library(tidyr)
df2 <- df %>%
spread(year, value) %>%
mutate(change.99.00 = `2000` - `1999`) %>%
gather(year, value, c(`1999`, `2000`))
df2
country change.99.00 year value
1 1 10 1999 20
2 2 10 1999 40
3 1 10 2000 30
4 2 10 2000 50
This seems a laborious way to do this. I assume there should be a neat way to do this while keeping the data in narrow, tidy format, by grouping the data or something but I can't think of it and I can't find an answer online.
Is there an easier way to do this?
After grouping by 'country', get the diff of 'value' filtered with the logical expression year %in% 1999:2000
library(dplyr)
df %>%
group_by(country) %>%
mutate(change.99.00 = diff(value[year %in% 1999:2000]))
# A tibble: 4 x 4
# Groups: country [2]
# country year value change.99.00
# <dbl> <dbl> <dbl> <dbl>
#1 1 1999 20 10
#2 1 2000 30 10
#3 2 1999 40 10
#4 2 2000 50 10
NOTE: Here, we assume that the 'year' is not duplicated per 'country'

Expand data frame with intervening observations

I am trying to expand a data frame in R with missing observations that are not immediately obvious. Here is what I mean:
data.frame(id = c("a","b"),start = c(2002,2004), end = c(2005,2007))
Which is:
id start end
1 a 2002 2005
2 b 2004 2007
What I would like is a new data frame with 8 total observations, 4 each for "a" and "b", and a year that is one of the values between start and end (inclusive). So:
id year
a 2002
a 2003
a 2004
a 2005
b 2004
b 2005
b 2006
b 2007
As I understand, various versions of expand only work on unique values, but here my data frame doesn't have all the unique values (explicitly).
I was thinking to step through each row and then generate a data frame with sapply(), then join all the new data frames together. But this attempt fails:
sapply(test,function(x) { data.frame( id=rep(id,x[["end"]]-x[["start"]]), year = x[["start"]]:x[["end"]] )})
I know there must be some dplyr or other magic to solve this problem!
you could use tidyr and dplyr
library(tidyr)
library(dplyr)
df %>%
gather(key = key, value = year, -id) %>%
select(-key) %>%
group_by(id) %>%
complete(year = full_seq(year,1))
# A tibble: 8 x 2
# Groups: id [2]
id year
<fct> <dbl>
1 a 2002
2 a 2003
3 a 2004
4 a 2005
5 b 2004
6 b 2005
7 b 2006
8 b 2007
Using dplyr and tidyr, I make a new column which contains the list of years, then unnest the dataframe.
library(tidyr)
library(dplyr)
df <-
data.frame(
id = c("a", "b"),
start = c(2002, 2004),
end = c(2005, 2007)
)
df %>%
rowwise() %>%
mutate(year = list(seq(start, end))) %>%
select(-start, -end) %>%
unnest()
Output
# A tibble: 8 x 2
id year
<fct> <int>
1 a 2002
2 a 2003
3 a 2004
4 a 2005
5 b 2004
6 b 2005
7 b 2006
8 b 2007
An easy solution with data.table:
library(data.table)
# option 1
setDT(df)[, .(year = seq(start, end)), by = id]
# option 2
setDT(df)[, .(year = start:end), by = id]
which gives:
id year
1: a 2002
2: a 2003
3: a 2004
4: a 2005
5: b 2004
6: b 2005
7: b 2006
8: b 2007
An approach with base R:
lst <- Map(seq, df$start, df$end)
data.frame(id = rep(df$id, lengths(lst)), year = unlist(lst))

keep only consecutive observations

As said in the title, I have a data.frame like below,
df<-data.frame('id'=c('1','1','1','1','1','1','1'),'time'=c('1998','2000','2001','2002','2003','2004','2007'))
df
id time
1 1 1998
2 1 2000
3 1 2001
4 1 2002
5 1 2003
6 1 2004
7 1 2007
there are some others cases with shorter or longer time window than this,just for illustration's sake.
I want to do two things about this data set, first, find all those id that have at least five consecutive observations here, this can be done by following solutions here. Second, I want to keep only those observations in the at least five consecutive row of id selected by first step. The ideal result would be :
df
id time
1 1 2000
2 1 2001
3 1 2002
4 1 2003
5 1 2004
I could write a complex function using for loop and diff function, but this may be very time consuming both in writing the function and getting the result if I have a bigger data set with lots if id. But this is not seems like R and I do believe there should be a one or two line solution.
Anyone know how to achieve this? your time and knowledge would be deeply appreciated. Thanks in advance.
You can use dplyr to group by id and consecutive time, and filter groups with less than 5 entries, i.e.
#read data with stringsAsFactors = FALSE
df<-data.frame('id'=c('1','1','1','1','1','1','1'),
'time'=c('1998','2000','2001','2002','2003','2004','2007'),
stringsAsFactors = FALSE)
library(dplyr)
df %>%
mutate(time = as.integer(time)) %>%
group_by(id, grp = cumsum(c(1, diff(time) != 1))) %>%
filter(n() >= 5)
which gives
# A tibble: 5 x 3
# Groups: id, grp [1]
id time grp
<chr> <int> <dbl>
1 1 2000 2
2 1 2001 2
3 1 2002 2
4 1 2003 2
5 1 2004 2
Similar to #Sotos answer, this solution instead uses seqle (from cgwtools) as the grouping variable:
library(dplyr)
library(cgwtools)
df %>%
mutate(time = as.numeric(time)) %>%
group_by(id, consec = rep(seqle(time)$length, seqle(time)$length)) %>%
filter(consec >= 5)
Result:
# A tibble: 5 x 3
# Groups: id, consec [1]
id time consec
<chr> <dbl> <int>
1 1 2000 5
2 1 2001 5
3 1 2002 5
4 1 2003 5
5 1 2004 5
To remove grouping variable:
df %>%
mutate(time = as.numeric(time)) %>%
group_by(id, consec = rep(seqle(time)$length, seqle(time)$length)) %>%
filter(consec >= 5) %>%
ungroup() %>%
select(-consec)
Result:
# A tibble: 5 x 2
id time
<chr> <dbl>
1 1 2000
2 1 2001
3 1 2002
4 1 2003
5 1 2004
Data:
df<-data.frame('id'=c('1','1','1','1','1','1','1'),
'time'=c('1998','2000','2001','2002','2003','2004','2007'),
stringsAsFactors = FALSE)
Try that on your data:
df[,] <- lapply(df, function(x) type.convert(as.character(x), as.is = TRUE))
IND1 <- (df$time - c(df$time[-1],df$time[length(df$time)-1])) %>% abs(.)
IND2 <- (df$time - c(df$time[2],df$time[-(length(df$time))])) %>% abs(.)
df <- df[IND1 %in% 1 | IND2 %in% 1,]
df[ave(df$time, df$id, FUN = length) >= 5, ]
A solution from dplyr, tidyr, and data.table.
library(dplyr)
library(tidyr)
library(data.table)
df2 <- df %>%
mutate(time = as.numeric(as.character(time))) %>%
arrange(id, time) %>%
right_join(data_frame(time = full_seq(.$time, 1)), by = "time") %>%
mutate(RunID = rleid(id)) %>%
group_by(RunID) %>%
filter(n() >= 5, !is.na(id)) %>%
ungroup() %>%
select(-RunID)
df2
# A tibble: 5 x 2
id time
<fctr> <dbl>
1 1 2000
2 1 2001
3 1 2002
4 1 2003
5 1 2004

Count number of observations without N/A per year in R

I have a dataset and I want to summarize the number of observations without the missing values (denoted by NA).
My data is similar as the following:
data <- read.table(header = TRUE,
stringsAsFactors = FALSE,
text="CompanyNumber ResponseVariable Year ExplanatoryVariable1 ExplanatoryVariable2
1 2.5 2000 1 2
1 4 2001 3 1
1 3 2002 NA 7
2 1 2000 3 NA
2 2.4 2001 0 4
2 6 2002 2 9
3 10 2000 NA 3")
I was planning to use the package dplyr, but that does only take the years into account and not the different variables:
library(dplyr)
data %>%
group_by(Year) %>%
summarise(number = n())
How can I obtain the following outcome?
2000 2001 2002
ExplanatoryVariable1 2 2 1
ExplanatoryVariable2 2 2 2
To get the counts, you can start by using:
library(dplyr)
data %>%
group_by(Year) %>%
summarise_at(vars(starts_with("Expla")), ~sum(!is.na(.)))
## A tibble: 3 x 3
# Year ExplanatoryVariable1 ExplanatoryVariable2
# <int> <int> <int>
#1 2000 2 2
#2 2001 2 2
#3 2002 1 2
If you want to reshape it as shown in your question, you can extend the pipe using tidyr functions:
library(tidyr)
data %>%
group_by(Year) %>%
summarise_at(vars(starts_with("Expla")), ~sum(!is.na(.))) %>%
gather(var, count, -Year) %>%
spread(Year, count)
## A tibble: 2 x 4
# var `2000` `2001` `2002`
#* <chr> <int> <int> <int>
#1 ExplanatoryVariable1 2 2 1
#2 ExplanatoryVariable2 2 2 2
Just to let OP know, since they have ~200 explanatory variables to select. You can use another option of summarise_at to select the variables. You can simply name the first:last variable, if they are ordered correctly in the data, for example:
data %>%
group_by(Year) %>%
summarise_at(vars(ExplanatoryVariable1:ExplanatoryVariable2), ~sum(!is.na(.)))
Or:
data %>%
group_by(Year) %>%
summarise_at(3:4, ~sum(!is.na(.)))
Or store the variable names in a vector and use that:
vars <- names(data)[4:5]
data %>%
group_by(Year) %>%
summarise_at(vars, ~sum(!is.na(.)))
data %>%
gather(cat, val, -(1:3)) %>%
filter(complete.cases(.)) %>%
group_by(Year, cat) %>%
summarize(n = n()) %>%
spread(Year, n)
# # A tibble: 2 x 4
# cat `2000` `2001` `2002`
# * <chr> <int> <int> <int>
# 1 ExplanatoryVariable1 2 2 1
# 2 ExplanatoryVariable2 2 2 2
Should do it. You start by making the data stacked, and the simply calculating the n for both year and each explanatory variable. If you want the data back to wide format, then use spread, but either way without spread, you get the counts for both variables.
Using base R:
do.call(cbind,by(data[3:5], data$Year,function(x) colSums(!is.na(x[-1]))))
2000 2001 2002
ExplanatoryVariable1 2 2 1
ExplanatoryVariable2 2 2 2
For aggregate:
aggregate(.~Year,data[3:5],function(x) sum(!is.na(x)),na.action = function(x)x)
You could do it with aggregate in base R.
aggregate(list(ExplanatoryVariable1 = data$ExplanatoryVariable1,
ExplanatoryVariable2 = data$ExplanatoryVariable2),
list(Year = data$Year),
function(x) length(x[!is.na(x)]))
# Year ExplanatoryVariable1 ExplanatoryVariable2
#1 2000 2 2
#2 2001 2 2
#3 2002 1 2

Resources