For my dataset I want a row for each year for each ID and I want to determine if they lived in an urban area or not (0/1). Because some ID’s moved within a year and therefore have two rows for that year, I want to identify if they have two rows for that specific year, which mean they lived in an urban and non-urban area in that year (so I can manually determine in Excel at where they belong).
I’ve already excluded the exact double rows (so they moved in a certain year, but the urbanisation didn’t change).
df <- df %>% distinct(ID, YEAR, URBAN, .keep_all = TRUE)
structure(t2A)
# A tibble: 3,177,783 x 4
ID ZIPCODE YEAR URBAN
<dbl> <chr> <chr> <dbl>
1 1 1234AB 2013 0
2 1 1234AB 2014 0
3 1 1234AB 2015 0
4 1 1234AB 2016 0
5 1 1234AB 2017 0
6 1 1234AB 2018 0
7 2 5678CD 2013 0
8 2 5678CD 2014 0
9 2 5678CD 2015 0
10 2 5678CD 2016 0
# ... with 3,177,773 more rows
structure(list(ID= c(1, 1, 1, 1
), YEAR = c("2013", "2014", "2015", "2016"), URBAN = c(0,
0, 0, 0)), row.names = c(NA, -4L), class = c("tbl_df", "tbl",
"data.frame"))
Can you guys help me with identifying ID’s that have two rows for a specific year/have a 0 and 1 in the same year?
Edit: the example doesn't show any ID's with urbanisation 1, but there are and not all ID's are included all years :)
Below might be useful:
df <- df %>%
dplyr::group_by(ID, YEAR) %>%
dplyr::mutate(nIds=dplyr::n(),#count the occurance at unique ID and year combination
URBAN_Flag=sum(URBAN), ##Urban flag for those who are from urban
moved=dplyr::if_else(nIds>1,1,0)) %>%
dplyr::select(-c(nIds))
You can deselect the columns if not needed
First, we create some dummy data
library(tidyverse)
db <- tibble(
id = c(1, 1, 1, 2, 2, 2),
year = c(2000, 2000, 2001, 2001, 2002, 2003),
urban = c(0, 1, 0, 0, 0, 0)
)
We see that person one moved in 2000.
id year urban
<dbl> <dbl> <dbl>
1 1 2000 0
2 1 2000 1
3 1 2001 0
4 2 2001 0
5 2 2002 0
6 2 2003 0
Now, we can group by id and year and count the number of rows. We can use the count value to create a dummy whether or not they moved in a given year.
db %>%
group_by(id, year) %>%
summarize(rows = n()) %>%
mutate(
moved = ifelse(rows == 2, 1, 0)
)
Which gives the result:
id year rows moved
<dbl> <dbl> <int> <dbl>
1 1 2000 2 1
2 1 2001 1 0
3 2 2001 1 0
4 2 2002 1 0
5 2 2003 1 0
Related
I have a dataframe with more than 2 000 000 records. Here is sample data:
year <- c(2002, 2002, 2001, 2001, 2000)
type<- c(“red”, “red”, “blue”, “blue”, “blue”)
mydata <- data.frame(year, type)
I need to extract the type per year, which would look something like this:
2002:
“red”: 2, “blue”: 0
2001:
“red”: 0, “blue”: 2
2000:
“red”: 0, “blue”: 1
I am able to extract it separately using table():
table(mydata$year)
table(mydata$type)
However I do not come up with a way to do it in one table.
Try aggregate like below
aggregate(type ~ ., mydata, function(x) table(factor(x, levels = unique(type))))
which gives
year type.red type.blue
1 2000 0 1
2 2001 0 2
3 2002 2 0
Another base R option using xtabs
xtabs(~ year + type, mydata)
gives
type
year blue red
2000 1 0
2001 2 0
2002 0 2
Here's another approach
> library(dplyr)
> data.frame(table(mydata)) %>%
pivot_wider(names_from = type, values_from = Freq)
# A tibble: 3 x 3
year blue red
<fct> <int> <int>
1 2000 1 0
2 2001 2 0
3 2002 0 2
We could also use table
table(mydata)
Let me illustrate my question with an example:
Sample data:
df<-data.frame(BirthYear = c(1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005), Number= c(1,1,1,1,1,1,1,1,1,1,1), Group = c("g", "g", "g", "g", "g", "g","t","t","t","t","t"))
df
BirthYear Number Group
1 1995 1 g
2 1996 1 g
3 1997 1 g
4 1998 1 g
5 1999 1 g
6 2000 1 g
7 2001 1 t
8 2002 1 t
9 2003 1 t
10 2004 1 t
11 2005 1 t
and
df1<- structure(list(Year = c(2015, 2016, 2017, 2018, 2019, 2020)), class = "data.frame", row.names = c(NA,
-6L))
df1
Year
1 2015
2 2016
3 2017
4 2018
5 2019
6 2020
Now I want to add new columns to df1: g1, g2, t1 and t2.
g1 and t1 respectively represent the sum of df$Number for all instances of a group (g or t in df) where df1$Year - df$BirthYear is greater than 18 and lower than 21, so basically if someone is in the age between 19 & 20.
g2 and t2 represent the sum of df$Number for all instances of a group where the difference in years is lower than 19.
I want to end up with the following:
df1
Year g1 g2 t1 t2
1 2015 2 4 0 5
2 2016 2 3 0 5
3 2017 2 2 0 5
4 2018 2 1 0 5
5 2019 2 0 0 5
6 2020 1 0 1 4
I know I could make a for-loop over df1 to create the new columns but I don't know how to specify the condition to get the correct group sums for each year.
I hope this example makes clear what I'm trying to achieve.
I'd be very grateful for any help cause I'm really stuck at this point.
If what you want to do is just to calculate year differences across 2015:2020 and BirthYear, then you don't have to create a separate dataframe. Perhaps just
library(tidyr)
library(dplyr)
df %>%
expand(Year = 2015:2020, nesting(BirthYear, Number, Group)) %>%
group_by(Year, Group) %>%
summarise(
`1` = sum(between(Year - BirthYear, 19, 20) * Number),
`2` = sum((Year - BirthYear < 19) * Number)
) %>%
pivot_wider(names_from = "Group", values_from = c("1", "2"), names_glue = "{Group}{.value}")
Output
`summarise()` regrouping output by 'Year' (override with `.groups` argument)
# A tibble: 6 x 5
# Groups: Year [6]
Year g1 t1 g2 t2
<int> <dbl> <dbl> <dbl> <dbl>
1 2015 2 0 4 5
2 2016 2 0 3 5
3 2017 2 0 2 5
4 2018 2 0 1 5
5 2019 2 0 0 5
6 2020 1 1 0 4
I'm challenged with this problem. I have these types of data:
df <- data.frame(
ID = c(1,1,1,1,1,1,2,2,2,2,2,3,3,3,3),
Pr = c(0, 1, 0, 999, -1, 1, 999, 1, 0, 0, 1, 0, 1, 0, 0),
Yrs = c(2010,2011,2012,2013,2014,2015, 2010, 2011, 2012, 2013, 2014, 2012, 2013, 2014, 2015)
)
ID Pr Yrs
1 0 2010
1 1 2011
1 0 2012
1 999 2013
1 -1 2014
1 1 2015
2 999 2010
2 1 2011
2 0 2012
2 0 2013
2 1 2014
3 0 2012
3 1 2013
3 0 2014
3 0 2015
I would like to get:
a)the number of (unique)IDs having "1" just once;
b)The distance (years) between the first occurrence of "1" and the following occurrence of "1", per group(ID).
Thank you for your help.
Here's one way to get at the problem:
library(tidyverse)
df %>% group_by(ID) %>% filter(sum(Pr==1)==1)
# A tibble: 4 x 3
# Groups: ID [1]
# ID Pr Yrs
# <dbl> <dbl> <dbl>
#1 3 0 2012
#2 3 1 2013
#3 3 0 2014
#4 3 0 2015
df %>%
group_by(ID) %>%
filter(Pr==1) %>%
filter(n()>1) %>%
summarise(dist=diff(Yrs))
# A tibble: 2 x 2
# ID dist
# <dbl> <dbl>
#1 1 4
#2 2 3
With a summary data frame as
library(data.table)
setDT(df)
df_summ <-
df[, {one <- which(Pr == 1);
.(num_ones = length(one), gap = diff(Yrs[one[1:2]]))}
, by = ID]
We can see
a)the number of (unique)IDs having "1" just once;
df_summ[, sum(num_ones == 1)]
# [1] 1
b)The distance (years) between the first occurrence of "1" and the
following occurrence of "1", per group(ID)
See gap column
df_summ
# ID num_ones gap
# 1: 1 2 4
# 2: 2 2 3
# 3: 3 1 NA
Ciao, Here is my replicating example.
a=c(1,2,3,4,5,6)
a1=c(15,17,17,16,14,15)
a2=c(0,0,1,1,1,0)
b=c(1,0,NA,NA,0,NA)
c=c(2010,2010,2010,2010,2010,2010)
d=c(1,1,0,1,0,NA)
e=c(2012,2012,2012,2012,2012,2012)
f=c(1,0,0,0,0,NA)
g=c(2014,2014,2014,2014,2014,2014)
h=c(1,1,0,1,0,NA)
i=c(2010,2012,2014,2012,2014,2014)
mydata = data.frame(a,a1,a2,b,c,d,e,f,g,h,i)
names(mydata) = c("id","age","gender","drop1","year1","drop2","year2","drop3","year3","drop4","year4")
mydata2 <- reshape(mydata, direction = "long", varying = list(c("year1","year2","year3","year4"), c("drop1","drop2","drop3","drop4")),v.names = c("year", "drop"), idvar = "X", timevar = "Year", times = c(1:4))
x1 = mydata2 %>%
group_by(id) %>%
slice(which(drop==1)[1])
x2 = mydata2 %>%
group_by(id) %>%
slice(which(drop==0)[1])
I have data "mydata2" which is tall such that every ID has many rows.
I want to make new data set "x" such that every ID has one row that is based on if they drop or not.
The first of drop1 drop2 drop3 drop4 that equals to 1, I want to take the year of that and put that in a variable dropYEAR. If none of drop1 drop2 drop3 drop4 equals to 1 I want to put the last data point in year1 year2 year3 year4 in the variable dropYEAR.
Ultimately every ID should have 1 row and I want to create 2 new columns: didDROP equals to 1 if the ID ever dropped or 0 if the ID did not ever drop. dropYEAR equals to the year of drop if didDROP equals to 1 or equals to the last reported year1 year2 year3 year4 if the ID did not ever drop. I try to do this in dplyr but this gives part of what I want only because it gets rid of ID values that equals to 0.
This is desired output, thank you to #Wimpel
First mydata2 %>% arrange(id) to understand the dataset, then using dplyr first and lastwe can pull the first year where drop==1 and the last year in case of drop never get 1 where drop is not null. Usingcase_when to check didDROP as it has a nice magic in dealing with NAs.
library(dplyr)
mydata2 %>% group_by(id) %>%
mutate(dropY=first(year[!is.na(drop) & drop==1]),
dropYEAR=if_else(is.na(dropY), last(year[!is.na(drop)]),dropY)) %>%
slice(1)
#Update
mydata2 %>% group_by(id) %>%
mutate(dropY=first(year[!is.na(drop) & drop==1]),
dropYEAR=if_else(is.na(dropY), last(year),dropY),
didDROP=case_when(any(drop==1) ~ 1, #Return 1 if there is any drop=1 o.w it will return 0
TRUE ~ 0)) %>%
select(-dropY) %>% slice(1)
# A tibble: 6 x 9
# Groups: id [6]
id age gender Year year drop X dropYEAR didDROP
<dbl> <dbl> <dbl> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 1 15 0 1 2010 1 1 2010 1
2 2 17 0 1 2010 0 2 2012 1
3 3 17 1 1 2010 NA 3 2014 0
4 4 16 1 1 2010 NA 4 2012 1
5 5 14 1 1 2010 0 5 2014 0
6 6 15 0 1 2010 NA 6 2014 0
I hope this what you're looking for.
You can sort by id, drop and year, conditionally on dropping or not:
library(dplyr)
mydata2 %>%
mutate(drop=ifelse(is.na(drop),0,drop)) %>%
arrange(id,-drop,year*(2*drop-1)) %>%
group_by(id) %>%
slice(1) %>%
select(id,age,gender,didDROP=drop,dropYEAR=year)
# A tibble: 6 x 5
# Groups: id [6]
id age gender didDROP dropYEAR
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 15 0 1 2010
2 2 17 0 1 2012
3 3 17 1 0 2014
4 4 16 1 1 2012
5 5 14 1 0 2014
6 6 15 0 0 2014
I have a data fram looks like:
I want to add a dummy column based on id group and acp which if acq == 1, then the later year in that group will have a dummy value with 1.
something like this :
im trying to doing this in r. i tried with double for loop or dply but all fails. Any help will be appreciated.
After grouping by 'id', we can use cummax and take the lag of it
library(dplyr)
df1 %>%
group_by(id) %>%
mutate(Post = lag(cummax(Acq), default = 0))
# A tibble: 7 x 4
# Groups: id [2]
# id Year Acq Post
# <int> <int> <dbl> <dbl>
#1 1 2008 0 0
#2 1 2009 0 0
#3 1 2010 0 0
#4 2 2008 0 0
#5 2 2009 1.00 0
#6 2 2010 0 1.00
#7 2 2011 0 1.00
data
df1 <- data.frame(id = rep(1:2, c(3, 4)), Year = c(2008:2010, 2008:2011),
Acq = c(0, 0, 0, 0, 1, 0, 0))