Here is my problem. I've got data on city codes (GeoCode) and zip codes (PostCode). Often several zip codes correspond to a single city code. If that's the case, I want to make a column with a string of zip codes corresponding to the same city:
ID<-1:10
GeoCode<-c("AA","BB","BB","CC","CC","CC","DD","DD","DD","DD")
PostCode<-c("01","10","11","20","21","22","30","31","32","33")
data<-data.frame(ID,GeoCode,PostCode)
I want to make such table. For example "20_21_22" belong to City code CC
ID GeoCode PostCode strPostcode
1 1 AA 01 01
2 2 BB 10 10_11
3 3 BB 11 10_11
4 4 CC 20 20_21_22
5 5 CC 21 20_21_22
6 6 CC 22 20_21_22
7 7 DD 30 30_31_32_33
8 8 DD 31 30_31_32_33
9 9 DD 32 30_31_32_33
10 10 DD 33 30_31_32_33
We could group by 'GeoCode' and paste all the unique 'PostCode' in mutate
library(dplyr)
library(stringr)
data %>%
group_by(GeoCode) %>%
mutate(strPostcode = str_c(unique(PostCode), collapse="_"))
# A tibble: 10 x 4
# Groups: GeoCode [4]
# ID GeoCode PostCode strPostcode
# <int> <chr> <chr> <chr>
# 1 1 AA 01 01
# 2 2 BB 10 10_11
# 3 3 BB 11 10_11
# 4 4 CC 20 20_21_22
# 5 5 CC 21 20_21_22
# 6 6 CC 22 20_21_22
# 7 7 DD 30 30_31_32_33
# 8 8 DD 31 30_31_32_33
# 9 9 DD 32 30_31_32_33
#10 10 DD 33 30_31_32_33
Or an option with base R
data$strPostcode <- with(data, ave(PostCode, GeoCode, FUN =
function(x) paste(unique(x), collapse="_")))
The base R option with ave by #akrun is efficient. Here is another workaround
merge(data,
aggregate(PostCode ~ ., data[-1], paste0, collapse = "_"),
by = "GeoCode",
all = TRUE
)
which gives
GeoCode ID PostCode.x PostCode.y
1 AA 1 01 01
2 BB 2 10 10_11
3 BB 3 11 10_11
4 CC 4 20 20_21_22
5 CC 5 21 20_21_22
6 CC 6 22 20_21_22
7 DD 7 30 30_31_32_33
8 DD 8 31 30_31_32_33
9 DD 9 32 30_31_32_33
10 DD 10 33 30_31_32_33
Or you can try this one
data2 <- data %>%
group_by(GeoCode) %>%
mutate(strPostCode = paste0(unique(PostCode), collapse = "_"))
# ID GeoCode PostCode strPostCode
# <int> <chr> <chr> <chr>
# 1 1 AA 01 01
# 2 2 BB 10 10_11
# 3 3 BB 11 10_11
# 4 4 CC 20 20_21_22
# 5 5 CC 21 20_21_22
# 6 6 CC 22 20_21_22
# 7 7 DD 30 30_31_32_33
# 8 8 DD 31 30_31_32_33
# 9 9 DD 32 30_31_32_33
# 10 10 DD 33 30_31_32_33
I have a temporal dataset, however, it is incomplete so I can not reconstruct the series accurately. These are the data:
df<-data.frame(year=c(2006,2007,2008,2009,2010,2011,2012,2013,2014,2015),
sample1=c("D","D","DDD","D","U","UU","UUU","U","D","DDD"),
sample2=c("U","UU","D","D","DDD","D","U","UU","UUU","U"),
sample3=c("D","DDD","D","U","UU","UUU","U","D","DDD","D"),
sample4=c("D","D","UUU","U","D","DDD","D","U","U",NA),
sample5=c(NA,"UU","D","U","UU","UUU","U","D","U",NA))
I need it to end up like this:
df2<-data.frame(year=c(2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,
2015,2016,2017,2018),
sample1=c(NA,NA,"D","D","DDD","D","U","UU","UUU","U","D","DDD",NA,NA,NA),
sample2=c("U","UU","D","D","DDD","D","U","UU","UUU","U",NA,NA,NA,NA,NA),
sample3=c(NA,NA,NA,"D","DDD","D","U","UU","UUU","U","D","DDD","D",NA,NA),
sample4=c(NA,NA,"D","D",NA,NA,NA,NA,"UUU","U","D","DDD","D","U","U"),
sample5=c(NA,"UU","D",NA,NA,NA,"U","UU","UUU","U",NA,NA,"D","U",NA))
I need all the columns aligned in the same pattern, the best result was using DNA alignment functions, but these times to find the best alignment invert the elements, in my case can not occur this.
I have no idea how to do this.
dplyr's add_row function makes this pretty easy, once the initial dataframe exists.
library(dplyr)
df<-data.frame(year=c(2006,2007,2008,2009,2010,2011,2012,2013,2014,2015),
sample1 = c("D","D","DDD","D","U","UU","UUU","U","D","DDD"),
sample2 = c("U","UD","D","D","DDD","D","U","UU","UUU","U"),
sample3 = c("D","DDD","D","U","UU","UUU","U","D","DDD","D"),
sample4 = c("D","D","UUU","U","D","DDD","D","U","U",NA),
sample5 = c(NA,"UU","D","U","UU","UUU","U","D","U",NA))
df2 <- df %>%
add_row(year = 2016:2018)
library(dplyr)
df <- data_frame(year=c(2006,2007,2008,2009,2010,2011,2012,2013,2014,2015),
sample1=c("D","D","DDD","D","U","UU","UUU","U","D","DDD"),
sample2=c("U","UD","D","D","DDD","D","U","UU","UUU","U"),
sample3=c("D","DDD","D","U","UU","UUU","U","D","DDD","D"),
sample4=c("D","D","UUU","U","D","DDD","D","U","U",NA),
sample5=c(NA,"UU","D","U","UU","UUU","U","D","U",NA)) %>%
add_row(year = c(2004, 2005), .before = 1) %>%
add_row(year = c(2016:2018))
Result:
# A tibble: 15 x 6
year sample1 sample2 sample3 sample4 sample5
<dbl> <chr> <chr> <chr> <chr> <chr>
1 2004 NA NA NA NA NA
2 2005 NA NA NA NA NA
3 2006 D U D D NA
4 2007 D UD DDD D UU
5 2008 DDD D D UUU D
6 2009 D D U U U
7 2010 U DDD UU D UU
8 2011 UU D UUU DDD UUU
9 2012 UUU U U D U
10 2013 U UU D U D
11 2014 D UUU DDD U U
12 2015 DDD U D NA NA
13 2016 NA NA NA NA NA
14 2017 NA NA NA NA NA
15 2018 NA NA NA NA NA
I have a sequence of numbers (days):
dayNum <- c(1:10)
And I have a dataframe of id, day, and event:
id = c("aa", "aa", "aa", "bb", "bb", "cc")
day = c(1, 2, 3, 1, 6, 2)
event = c("Y", "Y", "Y", "Y", "Y", "Y")
df = data.frame(id, day, event)
Which looks like this:
id day event
aa 1 Y
aa 2 Y
aa 3 Y
bb 1 Y
bb 6 Y
cc 2 Y
I am trying to put this dataframe into a form that resembles left joining dayNum with df for each id. That is, even if id "aa" had no event on day 5, I should still get a row for "aa" on day 5 with N/A or something under event. Like this:
id day event
aa 1 Y
aa 2 Y
aa 3 Y
aa 4 N/A
aa 5 N/A
aa 6 N/A
aa 8 N/A
aa 9 N/A
aa 10 N/A
bb 1 Y
bb 2 N/A
bb 3 N/A
bb 4 N/A
bb 5 N/A
bb 6 Y
bb 7 N/A
...etc
I can make this work using dplyr and left_join when my dataframe only contains one unique id, but I am stuck trying to make this work with a dataframe that has many different ids.
A push in the right direction would be greatly appreciated.
Thank you!
We can use expand.grid and merge. We create a new dataset using the unique 'id' of 'df' and the 'dayNum'. Then merge with the 'df' to get the expected output.
merge(expand.grid(id=unique(df$id), day=dayNum), df, all.x=TRUE)
# id day event
#1 aa 1 Y
#2 aa 2 Y
#3 aa 3 Y
#4 aa 4 <NA>
#5 aa 5 <NA>
#6 aa 6 <NA>
#7 aa 7 <NA>
#8 aa 8 <NA>
#9 aa 9 <NA>
#10 aa 10 <NA>
#11 bb 1 Y
#12 bb 2 <NA>
#13 bb 3 <NA>
#14 bb 4 <NA>
#15 bb 5 <NA>
#16 bb 6 Y
#17 bb 7 <NA>
#18 bb 8 <NA>
#19 bb 9 <NA>
#20 bb 10 <NA>
#21 cc 1 <NA>
#22 cc 2 Y
#23 cc 3 <NA>
#24 cc 4 <NA>
#25 cc 5 <NA>
#26 cc 6 <NA>
#27 cc 7 <NA>
#28 cc 8 <NA>
#29 cc 9 <NA>
#30 cc 10 <NA>
A similar option using data.table would be to convert the 'data.frame' to 'data.table' (setDT(df), set the 'key' columns, join with the dataset derived from cross join of unique 'id' and 'dayNum'.
library(data.table)
setDT(df, key=c('id', 'day'))[CJ(id=unique(id), day=dayNum)]