I have the following sample dataset of a large dataset -
isin directorid dob_Year2 ROLE_START ROLE_END gender datestartrole dateendrole
US6819771048 340769 1970 1995 2003 M 1995-02-01 2003-01-09
US6819771048 340769 1970 2003 2004 M 2003-01-09 2004-02-24
US6819771048 340769 1970 2004 2007 M 2004-02-24 2007-09-07
US68243Q1067 86917 1951 1986 2016 M 1986-01-01 2016-06-30
US68243Q1067 86917 1951 1976 1986 M 1976-04-01 1986-01-01
US68243Q1067 327069 1961 2016 2020 M 2016-06-30 2020-05-21
My QUESTION is -
I want to create New Rows based on the variables ROLE_START and ROLE_END. The number of
rows to be created depends the MINIMUM ROLE_START and MAXIMUM ROLE_END assuming that the data is grouped by isin and directorid. For example - for isin US6819771048 and directorid 340769, the MINIMUM ROLE_START year is 1995 and MAXIMUM ROLE_END year is 2007. So I need to open rows for each year of 1995-2007 and theses years should be stored in YEAR variable. Please note that in the above example there is no break between 1995-2007 because it is clear from ROLE_START and ROLE_END that all years are included. If there is any break between any years, then those breaks should be excluded. For the above sample dataset, my expected dataset should be look like this -
isin directorid dob_Year2 ROLE_START ROLE_END gender datestartrole dateendrole YEAR
US6819771048 340769 1970 1995 2003 M 1995-02-01 2003-01-09 1995
US6819771048 340769 1970 1995 2003 M 1995-02-01 2003-01-09 1996
US6819771048 340769 1970 1995 2003 M 1995-02-01 2003-01-09 1997
US6819771048 340769 1970 1995 2003 M 1995-02-01 2003-01-09 1998
US6819771048 340769 1970 1995 2003 M 1995-02-01 2003-01-09 1999
US6819771048 340769 1970 1995 2003 M 1995-02-01 2003-01-09 2000
US6819771048 340769 1970 1995 2003 M 1995-02-01 2003-01-09 2001
US6819771048 340769 1970 1995 2003 M 1995-02-01 2003-01-09 2002
US6819771048 340769 1970 1995 2003 M 1995-02-01 2003-01-09 2003
US6819771048 340769 1970 2003 2004 M 2003-01-09 2004-02-24 2003
US6819771048 340769 1970 2003 2004 M 2003-01-09 2004-02-24 2004
US6819771048 340769 1970 2004 2007 M 2004-02-24 2007-09-07 2004
US6819771048 340769 1970 2004 2007 M 2004-02-24 2007-09-07 2005
US6819771048 340769 1970 2004 2007 M 2004-02-24 2007-09-07 2006
US6819771048 340769 1970 2004 2007 M 2004-02-24 2007-09-07 2007
US68243Q1067 86917 1951 1986 2016 M 1986-01-01 2016-06-30 1986
US68243Q1067 86917 1951 1986 2016 M 1986-01-01 2016-06-30 1987
US68243Q1067 86917 1951 1986 2016 M 1986-01-01 2016-06-30 1988
US68243Q1067 86917 1951 1986 2016 M 1986-01-01 2016-06-30 1989
US68243Q1067 86917 1951 1986 2016 M 1986-01-01 2016-06-30 1990
US68243Q1067 86917 1951 1986 2016 M 1986-01-01 2016-06-30 1991
US68243Q1067 86917 1951 1986 2016 M 1986-01-01 2016-06-30 1992
US68243Q1067 86917 1951 1986 2016 M 1986-01-01 2016-06-30 1993
US68243Q1067 86917 1951 1986 2016 M 1986-01-01 2016-06-30 1994
US68243Q1067 86917 1951 1986 2016 M 1986-01-01 2016-06-30 1995
US68243Q1067 86917 1951 1986 2016 M 1986-01-01 2016-06-30 1996
US68243Q1067 86917 1951 1986 2016 M 1986-01-01 2016-06-30 1997
US68243Q1067 86917 1951 1986 2016 M 1986-01-01 2016-06-30 1998
US68243Q1067 86917 1951 1986 2016 M 1986-01-01 2016-06-30 1999
US68243Q1067 86917 1951 1986 2016 M 1986-01-01 2016-06-30 2000
US68243Q1067 86917 1951 1986 2016 M 1986-01-01 2016-06-30 2001
US68243Q1067 86917 1951 1986 2016 M 1986-01-01 2016-06-30 2002
US68243Q1067 86917 1951 1986 2016 M 1986-01-01 2016-06-30 2003
US68243Q1067 86917 1951 1986 2016 M 1986-01-01 2016-06-30 2004
US68243Q1067 86917 1951 1986 2016 M 1986-01-01 2016-06-30 2005
US68243Q1067 86917 1951 1986 2016 M 1986-01-01 2016-06-30 2006
US68243Q1067 86917 1951 1986 2016 M 1986-01-01 2016-06-30 2007
US68243Q1067 86917 1951 1986 2016 M 1986-01-01 2016-06-30 2008
US68243Q1067 86917 1951 1986 2016 M 1986-01-01 2016-06-30 2009
US68243Q1067 86917 1951 1986 2016 M 1986-01-01 2016-06-30 2010
US68243Q1067 86917 1951 1986 2016 M 1986-01-01 2016-06-30 2011
US68243Q1067 86917 1951 1986 2016 M 1986-01-01 2016-06-30 2012
US68243Q1067 86917 1951 1986 2016 M 1986-01-01 2016-06-30 2013
US68243Q1067 86917 1951 1986 2016 M 1986-01-01 2016-06-30 2014
US68243Q1067 86917 1951 1986 2016 M 1986-01-01 2016-06-30 2015
US68243Q1067 86917 1951 1986 2016 M 1986-01-01 2016-06-30 2016
US68243Q1067 86917 1951 1976 1986 M 1976-04-01 1986-01-01 1976
US68243Q1067 86917 1951 1976 1986 M 1976-04-01 1986-01-01 1977
US68243Q1067 86917 1951 1976 1986 M 1976-04-01 1986-01-01 1978
US68243Q1067 86917 1951 1976 1986 M 1976-04-01 1986-01-01 1979
US68243Q1067 86917 1951 1976 1986 M 1976-04-01 1986-01-01 1980
US68243Q1067 86917 1951 1976 1986 M 1976-04-01 1986-01-01 1981
US68243Q1067 86917 1951 1976 1986 M 1976-04-01 1986-01-01 1982
US68243Q1067 86917 1951 1976 1986 M 1976-04-01 1986-01-01 1983
US68243Q1067 86917 1951 1976 1986 M 1976-04-01 1986-01-01 1984
US68243Q1067 86917 1951 1976 1986 M 1976-04-01 1986-01-01 1985
US68243Q1067 86917 1951 1976 1986 M 1976-04-01 1986-01-01 1986
US68243Q1067 327069 1961 2016 2020 M 2016-06-30 2020-05-21 2016
US68243Q1067 327069 1961 2016 2020 M 2016-06-30 2020-05-21 2017
US68243Q1067 327069 1961 2016 2020 M 2016-06-30 2020-05-21 2018
US68243Q1067 327069 1961 2016 2020 M 2016-06-30 2020-05-21 2019
US68243Q1067 327069 1961 2016 2020 M 2016-06-30 2020-05-21 2020
You can create a sequence between ROLE_START and ROLE_END and get data in different rows.
library(dplyr)
df %>%
mutate(YEAR = purrr::map2(ROLE_START, ROLE_END, seq)) %>%
tidyr::unnest(YEAR)
# isin directorid dob_Year2 ROLE_START ROLE_END gender datestartrole dateendrole YEAR
# <chr> <int> <int> <int> <int> <chr> <chr> <chr> <int>
# 1 US6819771048 340769 1970 1995 2003 M 1995-02-01 2003-01-09 1995
# 2 US6819771048 340769 1970 1995 2003 M 1995-02-01 2003-01-09 1996
# 3 US6819771048 340769 1970 1995 2003 M 1995-02-01 2003-01-09 1997
# 4 US6819771048 340769 1970 1995 2003 M 1995-02-01 2003-01-09 1998
# 5 US6819771048 340769 1970 1995 2003 M 1995-02-01 2003-01-09 1999
# 6 US6819771048 340769 1970 1995 2003 M 1995-02-01 2003-01-09 2000
# 7 US6819771048 340769 1970 1995 2003 M 1995-02-01 2003-01-09 2001
# 8 US6819771048 340769 1970 1995 2003 M 1995-02-01 2003-01-09 2002
# 9 US6819771048 340769 1970 1995 2003 M 1995-02-01 2003-01-09 2003
#10 US6819771048 340769 1970 2003 2004 M 2003-01-09 2004-02-24 2003
# … with 52 more rows
Related
I need to count the number of contiguous years in a data frame. I want to filter data frames that have more than 30 years of consecutive records. Before I was doing:
(length(unique(Daily_Streamflow$year)) > 30
But I realized that the number of years (unique years) could be more than 30 but not in a consecutive range, for example:
(unique(DSF_09494000$year))
[1] 1917 1918 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980
[27] 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006
[53] 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
How is possible to count the number of years in a range that is continuous without missing years? Is there a similar function as na.contiguous of stats package but for non-NA values?
I have already linearly interpolated data between observed data points using that code:
df2 <- df %>%
group_by(iso3)%>%
mutate_at(vars(anc4), list(inter = ~na.approx(., na.rm = FALSE)))%>%
ungroup()%>%
mutate_cond(is.na(anc4),anc4=inter)%>%
dplyr::select(-c(inter))"
It gives me a dataset which I am showing here an extract but my dataset contains data from 2000 to 2018 for 194 countries.
ID iso3 year anc4
<chr> <chr> <dbl> <dbl>
1 AFG2000 AFG 2000 NA
2 AFG2001 AFG 2001 NA
3 AFG2002 AFG 2002 NA
4 AFG2003 AFG 2003 NA
5 AFG2004 AFG 2004 NA
6 AFG2005 AFG 2005 NA
7 AFG2006 AFG 2006 NA
8 AFG2007 AFG 2007 NA
9 AFG2008 AFG 2008 16.1
10 AFG2009 AFG 2009 9.9
11 AFG2010 AFG 2010 14.6
12 AFG2011 AFG 2011 18.6
13 AFG2012 AFG 2012 22.7
14 AFG2013 AFG 2013 17.8
15 AFG2014 AFG 2014 16.3
16 AFG2015 AFG 2015 17.8
17 AFG2016 AFG 2016 19.4
18 AFG2017 AFG 2017 20.9
19 AFG2018 AFG 2018 NA
20 AGO2000 AGO 2000 39.8
21 AGO2001 AGO 2001 41.5
22 AGO2002 AGO 2002 43.1
23 AGO2003 AGO 2003 44.8
24 AGO2004 AGO 2004 46.4
25 AGO2005 AGO 2005 48.1
26 AGO2006 AGO 2006 49.8
27 AGO2007 AGO 2007 51.4
28 AGO2008 AGO 2008 53.1
29 AGO2009 AGO 2009 54.8
30 AGO2010 AGO 2010 56.4
31 AGO2011 AGO 2011 58.1
32 AGO2012 AGO 2012 59.7
33 AGO2013 AGO 2013 61.4
34 AGO2014 AGO 2014 NA
35 AGO2015 AGO 2015 NA
36 AGO2016 AGO 2016 NA
37 AGO2017 AGO 2017 NA
38 AGO2018 AGO 2018 NA
What I would like to do now is to extrapolate backward and forward using a linear regression at the country level. I know the function na.locf and na.approx but cannot find any options that would do that. Mice or Amelia do not seem to do the trick as they need covariates. I have only one variable.
The complexity is that I have 194 countries so I am looking for something that could do this for all countries. I would be grateful if you could help!
I have tried this, to first try to extract the slope:
df_slope <- df2 %>%
mutate(slope=NA)%>%
group_by(iso3)%>%
mutate_cond(is.na(slope),slope=lm(anc4 ~year,.)$coefficients[[2]])%>%
ungroup()
..which of course does not work..
Thank you very much!
Since you have only one variable, only thing you can do is that using the year column as the independent variable (as a trend value in other words) to predict the anc4
The codes below you can find the imputations wrt linear regressions by using base R.
df <- as.data.frame(df)
df_model <- df[!is.na(df$anc4),]
predictions <- vector()
for(i in unique(df_model$iso3)) {
temp <- df_model[df_model[,2]==i,]
predictions <- c(predictions,predict(lm(anc4~year,temp),df[is.na(df$anc4) & df$iso3==i,]))
}
df[is.na(df$anc4),]$anc4 <- predictions
df
gives,
ID iso3 year anc4
1 AFG2000 AFG 2000 8.781212
2 AFG2001 AFG 2001 9.471515
3 AFG2002 AFG 2002 10.161818
4 AFG2003 AFG 2003 10.852121
5 AFG2004 AFG 2004 11.542424
6 AFG2005 AFG 2005 12.232727
7 AFG2006 AFG 2006 12.923030
8 AFG2007 AFG 2007 13.613333
9 AFG2008 AFG 2008 16.100000
10 AFG2009 AFG 2009 9.900000
11 AFG2010 AFG 2010 14.600000
12 AFG2011 AFG 2011 18.600000
13 AFG2012 AFG 2012 22.700000
14 AFG2013 AFG 2013 17.800000
15 AFG2014 AFG 2014 16.300000
16 AFG2015 AFG 2015 17.800000
17 AFG2016 AFG 2016 19.400000
18 AFG2017 AFG 2017 20.900000
19 AFG2018 AFG 2018 21.206667
20 AGO2000 AGO 2000 39.800000
21 AGO2001 AGO 2001 41.500000
22 AGO2002 AGO 2002 43.100000
23 AGO2003 AGO 2003 44.800000
24 AGO2004 AGO 2004 46.400000
25 AGO2005 AGO 2005 48.100000
26 AGO2006 AGO 2006 49.800000
27 AGO2007 AGO 2007 51.400000
28 AGO2008 AGO 2008 53.100000
29 AGO2009 AGO 2009 54.800000
30 AGO2010 AGO 2010 56.400000
31 AGO2011 AGO 2011 58.100000
32 AGO2012 AGO 2012 59.700000
33 AGO2013 AGO 2013 61.400000
34 AGO2014 AGO 2014 63.058242
35 AGO2015 AGO 2015 64.719341
36 AGO2016 AGO 2016 66.380440
37 AGO2017 AGO 2017 68.041538
38 AGO2018 AGO 2018 69.702637
I might be overcomplicating things - would love to know if if there is an easier way to solve this. I have a data frame (df) with 5654 observations - 1332 are foreign-born, and 4322 Canada-born subjects.
The variable df$YR_IMM captures: "In what year did you come to live in Canada?"
See the following distribution of observations per immigration year table(df$YR_IMM) :
1920 1926 1928 1930 1939 1942 1944 1946 1947 1948 1949 1950 1951 1952 1953 1954
2 1 1 2 1 2 1 1 1 9 5 1 7 13 3 5
1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970
10 5 8 6 6 1 5 1 6 3 7 16 18 12 15 13
1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986
10 17 8 18 25 16 15 12 16 27 13 16 11 9 17 16
1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003
24 21 31 36 26 30 26 24 22 30 29 26 47 52 53 28 9
Naturally these are only foreign-born individuals (mean = 1985) - however, 348 foreign-borns are missing. There are a total of 4670 NAs that also include Canada-borns subjects.
How can I code these df$YR_IMM NAs in such a way that
348 (NA) --> 1985
4322(NA) --> 100
Additionally, the status is given by df$Brthcoun with 0 = "born in Canada" and 1 = "born outside of Canada.
Hope this makes sense - thank you!
EDIT: This was the solution ->
df$YR_IMM[is.na(df$YR_IMM) & df$Brthcoun == 0] <- 100
df$YR_IMM[is.na(df$YR_IMM) & df$Brthcoun == 1] <- 1985
Try the below code:
df$YR_IMM[is.na(df$YR_IMM) & df$Brthcoun == 0] <- 100
df$YR_IMM[is.na(df$YR_IMM) & df$Brthcoun == 1] <- 1985
I hope this helps!
Something like this should also work:
df$YR_IMM <- ifelse(is.na(df$YR_IMM) & df$Brthcoun == 0, 100, 1985)
My Dataframe:
> head(scotland_weather)
JAN Year.1 FEB Year.2 MAR Year.3 APR Year.4 MAY Year.5 JUN Year.6 JUL Year.7 AUG Year.8 SEP Year.9 OCT Year.10
1 293.8 1993 278.1 1993 238.5 1993 191.1 1947 191.4 2011 155.0 1938 185.6 1940 216.5 1985 267.6 1950 258.1 1935
2 292.2 1928 258.8 1997 233.4 1990 149.0 1910 168.7 1986 137.9 2002 181.4 1988 211.9 1992 221.2 1981 254.0 1954
3 275.6 2008 244.7 2002 201.3 1992 146.8 1934 155.9 1925 137.8 1948 170.1 1939 202.3 2009 193.9 1982 248.8 2014
4 252.3 2015 227.9 1989 200.2 1967 142.1 1949 149.5 2015 137.7 1931 165.8 2010 191.4 1962 189.7 2011 247.7 1938
5 246.2 1974 224.9 2014 180.2 1979 133.5 1950 137.4 2003 135.0 1966 162.9 1956 190.3 2014 189.7 1927 242.3 1983
6 245.0 1975 195.6 1995 180.0 1989 132.9 1932 129.7 2007 131.7 2004 159.9 1985 189.1 2004 189.6 1985 240.9 2001
NOV Year.11 DEC Year.12 WIN Year.13 SPR Year.14 SUM Year.15 AUT Year.16 ANN Year.17
1 262.0 2009 300.7 2013 743.6 2014 409.5 1986 455.6 1985 661.2 1981 1886.4 2011
2 244.8 1938 268.5 1986 649.5 1995 401.3 2015 435.6 1948 633.8 1954 1828.1 1990
3 242.2 2006 267.2 1929 645.4 2000 393.7 1994 427.8 2009 615.8 1938 1756.8 2014
4 231.3 1917 265.4 2011 638.3 2007 393.2 1967 422.6 1956 594.5 1935 1735.8 1938
5 229.9 1981 264.0 2006 608.9 1990 391.7 1992 397.0 2004 590.6 1982 1720.0 2008
6 224.9 1951 261.0 1912 592.8 2015 389.1 1913 390.1 1938 589.2 2006 1716.5 1954
Year.X column is not ordered. I wish to convert this into the following format:
month year rainfall_mm
Jan 1993 293.8
Feb 1993 278.1
Mar 1993 238.5
...
Nov 2015 230.0
I tried t() but it keeps the year column separate.
also tried reshape2 recast(data, formula, ..., id.var, measure.var) but something is missing. as both month and Year.X columns are numeric and int
> str(scotland_weather)
'data.frame': 106 obs. of 34 variables:
$ JAN : num 294 292 276 252 246 ...
$ Year.1 : int 1993 1928 2008 2015 1974 1975 2005 2007 1990 1983 ...
$ FEB : num 278 259 245 228 225 ...
$ Year.2 : int 1990 1997 2002 1989 2014 1995 1998 2000 1920 1918 ...
$ MAR : num 238 233 201 200 180 ...
$ Year.3 : int 1994 1990 1992 1967 1979 1989 1921 1913 2015 1978 ...
$ APR : num 191 149 147 142 134 ...
Based on the pattern of alternating columns in the 'scotland_weather' for the 'YearX' column, one way would be to use c(TRUE, FALSE) to select the alternate column by recycling, which is similar to seq(1, ncol(scotland_weather), by =2). By using c(FALSE, TRUE), we get the seq(2, ncol(scotland_weather), by =2). This will be useful for extracting those columns, get the transpose (t) and concatenate (c) to vector. Once we are done with this, the next step will be to extract the column names that are not 'Year'. For this grep can be used. Then, we use data.frame to bind the vectors to a data.frame.
res <- data.frame(month= names(scotland_weather)[!grepl('Year',
names(scotland_weather))], year=c(t(scotland_weather[c(FALSE,TRUE)])),
rainfall_mm= c(t(scotland_weather[c(TRUE,FALSE)])))
head(res,4)
# month year rainfall_mm
#1 JAN 1993 293.8
#2 FEB 1993 278.1
#3 MAR 1993 238.5
#4 APR 1947 191.1
The problem you have is not only that you need to transform your data you do also have the problem that years for first column is in the second, years for the third column is in the fourth and so on...
Here is a solution using tidyr.
library(tidyr)
match <- Vectorize(function(x,y) grep(x,names(df)) - grep(y,names(df) == 1))
years <- grep("Year",names(scotland_weather))
df %>% gather("month","rainfall_mm",-years) %>%
gather("yearname","year",-c(months,time)) %>%
filter(match(month,yearname)) %>%
select(-yearname)
That goes on for until 2011. I know that 1961:2011 would assign all years in between, but is there a way to accommodate the separate flag column?
To be more specific, it is a csv file. I am reading in the data as
data <-read.csv("file",
col.names=(country, element, *would be 1961:2011 if there were no flag columns*), header=True)
You can use paste0 to generate the names.
col.names = c(country, element, paste0(rep(1961:2011, each=2), c("", "flags")))
the rep call will generate:
[1] 1961 1961 1962 1962 1963 1963 1964 1964 1965 1965 1966 1966 1967 1967 1968
[16] 1968 1969 1969 1970 1970 1971 1971 1972 1972 1973 1973 1974 1974 1975 1975
[31] 1976 1976 1977 1977 1978 1978 1979 1979 1980 1980 1981 1981 1982 1982 1983
[46] 1983 1984 1984 1985 1985 1986 1986 1987 1987 1988 1988 1989 1989 1990 1990
[61] 1991 1991 1992 1992 1993 1993 1994 1994 1995 1995 1996 1996 1997 1997 1998
[76] 1998 1999 1999 2000 2000 2001 2001 2002 2002 2003 2003 2004 2004 2005 2005
[91] 2006 2006 2007 2007 2008 2008 2009 2009 2010 2010 2011 2011
Note that I am using each and not times which would instead result in appending twice the sequence 1961:2011.