Create a column based on condition in groups of a dataframe - r

I have a data fram looks like:
I want to add a dummy column based on id group and acp which if acq == 1, then the later year in that group will have a dummy value with 1.
something like this :
im trying to doing this in r. i tried with double for loop or dply but all fails. Any help will be appreciated.

After grouping by 'id', we can use cummax and take the lag of it
library(dplyr)
df1 %>%
group_by(id) %>%
mutate(Post = lag(cummax(Acq), default = 0))
# A tibble: 7 x 4
# Groups: id [2]
# id Year Acq Post
# <int> <int> <dbl> <dbl>
#1 1 2008 0 0
#2 1 2009 0 0
#3 1 2010 0 0
#4 2 2008 0 0
#5 2 2009 1.00 0
#6 2 2010 0 1.00
#7 2 2011 0 1.00
data
df1 <- data.frame(id = rep(1:2, c(3, 4)), Year = c(2008:2010, 2008:2011),
Acq = c(0, 0, 0, 0, 1, 0, 0))

Related

Assigning values to a column in the based on values of another column in the same dataframe in R

I have a dataframe with 3 columns and I want to assign values to a fourth column of this dataframe if the sum of a condition is met in another row. In this example I want to assign 1 to df[,4], if df[,3]>=2 for each row.
An example of what I want as the output is:
Any help is appreciated.
Thank you,
library(tidyverse)
data <-
tribble(
~ID, ~time1, ~time2,
'jkjkdf', 1, 1,
'kjkj', 1, 0,
'fgf', 1, 1,
'jhkj', 0, 1,
'hgd', 0,0
)
mutate(data, label = if_else(time1 + time2 >= 2, 1, 0))
#> # A tibble: 5 x 4
#> ID time1 time2 label
#> <chr> <dbl> <dbl> <dbl>
#> 1 jkjkdf 1 1 1
#> 2 kjkj 1 0 0
#> 3 fgf 1 1 1
#> 4 jhkj 0 1 0
#> 5 hgd 0 0 0
#or with n time columns
data %>%
rowwise() %>%
mutate(label = if_else(sum(across(starts_with('time'))) >= 2, 1, 0))
#> # A tibble: 5 x 4
#> # Rowwise:
#> ID time1 time2 label
#> <chr> <dbl> <dbl> <dbl>
#> 1 jkjkdf 1 1 1
#> 2 kjkj 1 0 0
#> 3 fgf 1 1 1
#> 4 jhkj 0 1 0
#> 5 hgd 0 0 0
Created on 2021-06-06 by the reprex package (v2.0.0)
Do you want to assign 1 if both time1 and time2 are 1 ?
If there are only two columns you can do -
df$label <- as.integer(df$time1 == 1 & df$time2 == 1)
If there are many such time columns we can take help of rowSums -
cols <- grep('time', names(df))
df$label <- as.integer(rowSums(df[cols] == 1) == length(cols))
df
# a time1 time2 label
#1 a 1 1 1
#2 b 1 0 0
#3 c 1 1 1
#4 d 0 1 0
#5 e 0 0 0
data
Images are not the right way to share data, provide them in a reproducible format.
df <- data.frame(a = letters[1:5],
time1 = c(1, 1, 1, 0, 0),
time2 = c(1, 0, 1, 1, 0))
We could do thin in a vectorized way using tidyverse methods - select the columns that starts_with 'time' in column name, reduce it to a single vector by adding (+) the corresponding elements, use the aliases from magrittr to convert it to binary for creating the 'label' column. Finally, the object should be assigned (<-) to original data if we want the original object to be changed
library(dplyr)
library(purrr)
library(magrittr)
df %>%
mutate(label = select(cur_data(), starts_with('time')) %>%
reduce(`+`) %>%
is_weakly_greater_than(2) %>%
multiply_by(1))
a time1 time2 label
1 a 1 1 1
2 b 1 0 0
3 c 1 1 1
4 d 0 1 0
5 e 0 0 0
data
df <- structure(list(a = c("a", "b", "c", "d", "e"), time1 = c(1, 1,
1, 0, 0), time2 = c(1, 0, 1, 1, 0)), class = "data.frame", row.names = c(NA,
-5L))

Calculate sum of n previous rows

I have a quite big dataframe and I'm trying to add a new variable which is the sum of the three previous rows on a running basis, also it should be grouped by ID. The first three rows per ID should be 0. Here's what it should look like.
ID Var1 VarNew
1 2 0
1 2 0
1 3 0
1 0 7
1 4 5
1 1 7
Here's an example dataframe
ID <- c(1, 1, 1, 1, 1, 1)
Var1 <- c(2, 2, 3, 0, 4, 1)
df <- data.frame(ID, Var1)
You can use any of the package that has rolling calculation function with a window size of 3 and lag the result. For example with zoo::rollsumr.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(VarNew = lag(zoo::rollsumr(Var1, 3, fill = 0), default = 0)) %>%
ungroup
# ID Var1 VarNew
# <dbl> <dbl> <dbl>
#1 1 2 0
#2 1 2 0
33 1 3 0
#4 1 0 7
#5 1 4 5
#6 1 1 7
You can use filter in ave.
df$VarNew <- ave(df$Var1, df$ID, FUN=function(x) c(0, 0, 0,
filter(head(df$Var1, -1), c(1,1,1), side=1)[-1:-2]))
df
# ID Var1 VarNew
#1 1 2 0
#2 1 2 0
#3 1 3 0
#4 1 0 7
#5 1 4 5
#6 1 1 7
or using cumsum in combination with head and tail.
df$VarNew <- ave(df$Var1, df$ID, FUN=function(x) {y <- cumsum(x)
c(0, 0, 0, tail(y, -3) - head(y, -3))})
Library runner also helps
library(runner)
df %>% mutate(var_new = sum_run(Var1, k =3, na_pad = T, lag = 1))
ID Var1 var_new
1 1 2 NA
2 1 2 NA
3 1 3 NA
4 1 0 7
5 1 4 5
6 1 1 7
NAs can be mutated to 0 if desired so, easily.

Return summary table that sums data with iteration and control statement

none of these functions are particularly hard to do, but I'm wondering how to combine them.
df <- tibble::tibble(index = seq(1:8),
amps = c(7, 6, 7, 0, 7, 6, 0, 6))
As long as there is a positive value for amps, I'd like to sum them up. If amps = 0, then that's a break in the sequence and I'd like to return the 0, then start over. I'd also like to return the corresponding index value. The result would look like this:
index amps
<dbl> <dbl>
1 1 20
2 4 0
3 5 13
4 7 0
5 8 6
I can do this in VBA but I'd like to beef up my R skills in functional programming. I would prefer to use functions rather than loops just because they're cleaner. Any help is appreciated.
Another base R solution using rle + tapply
u <- with(rle(df$amps == 0), rep(seq_along(lengths), lengths))
dfout <- data.frame(
index = which(!duplicated(u)),
amps = tapply(df$amps, u, sum)
)
which gives
> dfout
index amps
1 1 20
2 4 0
3 5 13
4 7 0
5 8 6
One dplyr option could be:
df %>%
group_by(grp = with(rle(amps == 0), rep(seq_along(lengths), lengths))) %>%
summarise(index = first(index),
amps = sum(amps))
grp index amps
<int> <int> <dbl>
1 1 1 20
2 2 4 0
3 3 5 13
4 4 7 0
5 5 8 6
We can create a new group where amps = 0 or where previous value of amps is 0, get the first value of index and sum of amps for each group.
library(dplyr)
df %>%
group_by(gr = cumsum(amps == 0 | lag(amps, default = first(amps)) == 0)) %>%
summarise(index = first(index), amps = sum(amps)) %>%
select(-gr)
# A tibble: 5 x 2
# index amps
# <int> <dbl>
#1 1 20
#2 4 0
#3 5 13
#4 7 0
#5 8 6
Using the same logic in data.table :
library(data.table)
setDT(df)[, .(index = first(index), amps = sum(amps)),
cumsum(amps == 0 | shift(amps, fill = first(amps)) == 0)]
In base R we could use aggregate based on the rle.
ll <- rle(df$amps != 0)$lengths
rr <- aggregate(amps ~ cbind(index=rep(index[!!c(amps[1]>0, diff(amps!=0))], ll)), df, sum)
rr
# index amps
# 1 1 20
# 2 4 0
# 3 5 13
# 4 7 0
# 5 8 6

How to identify specific rows in R (based on other values)

For my dataset I want a row for each year for each ID and I want to determine if they lived in an urban area or not (0/1). Because some ID’s moved within a year and therefore have two rows for that year, I want to identify if they have two rows for that specific year, which mean they lived in an urban and non-urban area in that year (so I can manually determine in Excel at where they belong).
I’ve already excluded the exact double rows (so they moved in a certain year, but the urbanisation didn’t change).
df <- df %>% distinct(ID, YEAR, URBAN, .keep_all = TRUE)
structure(t2A)
# A tibble: 3,177,783 x 4
ID ZIPCODE YEAR URBAN
<dbl> <chr> <chr> <dbl>
1 1 1234AB 2013 0
2 1 1234AB 2014 0
3 1 1234AB 2015 0
4 1 1234AB 2016 0
5 1 1234AB 2017 0
6 1 1234AB 2018 0
7 2 5678CD 2013 0
8 2 5678CD 2014 0
9 2 5678CD 2015 0
10 2 5678CD 2016 0
# ... with 3,177,773 more rows
structure(list(ID= c(1, 1, 1, 1
), YEAR = c("2013", "2014", "2015", "2016"), URBAN = c(0,
0, 0, 0)), row.names = c(NA, -4L), class = c("tbl_df", "tbl",
"data.frame"))
Can you guys help me with identifying ID’s that have two rows for a specific year/have a 0 and 1 in the same year?
Edit: the example doesn't show any ID's with urbanisation 1, but there are and not all ID's are included all years :)
Below might be useful:
df <- df %>%
dplyr::group_by(ID, YEAR) %>%
dplyr::mutate(nIds=dplyr::n(),#count the occurance at unique ID and year combination
URBAN_Flag=sum(URBAN), ##Urban flag for those who are from urban
moved=dplyr::if_else(nIds>1,1,0)) %>%
dplyr::select(-c(nIds))
You can deselect the columns if not needed
First, we create some dummy data
library(tidyverse)
db <- tibble(
id = c(1, 1, 1, 2, 2, 2),
year = c(2000, 2000, 2001, 2001, 2002, 2003),
urban = c(0, 1, 0, 0, 0, 0)
)
We see that person one moved in 2000.
id year urban
<dbl> <dbl> <dbl>
1 1 2000 0
2 1 2000 1
3 1 2001 0
4 2 2001 0
5 2 2002 0
6 2 2003 0
Now, we can group by id and year and count the number of rows. We can use the count value to create a dummy whether or not they moved in a given year.
db %>%
group_by(id, year) %>%
summarize(rows = n()) %>%
mutate(
moved = ifelse(rows == 2, 1, 0)
)
Which gives the result:
id year rows moved
<dbl> <dbl> <int> <dbl>
1 1 2000 2 1
2 1 2001 1 0
3 2 2001 1 0
4 2 2002 1 0
5 2 2003 1 0

R: Tidyverse way of updating a value of a column at change points (What's wrong?)

I want to update a value of a column if it value changes. For example, in the following data, I would like to create column grp based on value column which is a binary variable signifying a change point. I tried to attempt it by creating temp1 but the result is not what I want.
library(tidyverse)
as_tibble(c(1,0,0,0,1,0,1,0)) %>%
mutate(temp1 = 1,
lag_temp1 = lag(temp1,1,default = 1),
temp1 = ifelse(row_number() ==1,1,value + lag_temp1)) %>%
mutate(grp = c(1,1,1,1,2,2,3,3)) %>%
print
# A tibble: 8 x 4
value temp1 lag_temp1 grp
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 1
2 0 1 1 1
3 0 1 1 1
4 0 1 1 1
5 1 2 1 2
6 0 1 1 2
7 1 2 1 3
8 0 1 1 3
Update
Apart from getting the grp correctly, I am also seeking to know why my solution did not work. I have used similar logic in other places in my data analysis. It would be very beneficial for me to know where is the mistake? Apart from inbuilt cumsum I may have to use other functions at times.
To get the grp variable right we can use cumsum
library(tidyverse)
as_tibble(c(1, 0, 0, 0, 1, 0, 1, 0)) %>%
mutate(grp = cumsum(value))
# A tibble: 8 x 2
# value grp
# <dbl> <dbl>
#1 1 1
#2 0 1
#3 0 1
#4 0 1
#5 1 2
#6 0 2
#7 1 3
#8 0 3
In your solution there is no difference between temp1 and lag_temp1 in the first place:
as_tibble(c(1,0,0,0,1,0,1,0)) %>%
mutate(temp1 = 1,
lag_temp1 = lag(temp1, 1, default = 1))
So in the end temp1 is simply c(value[1], value[-1] + 1).
It is not entirely clear to me what is meant by "Apart from inbuilt cumsum I may have to use other functions at times." - because this depends on the specific case. For the above example cumsum does the job.

Resources