Appending and overwriting when joining dataframes

Appending and overwriting when joining dataframes - r

I have the following three dataframes:
prim <- data.frame("t"=2007:2012,
"a"=1:6,
"b"=7:12)
secnd <- data.frame("t"=2012:2013,
"a"=c(5, 7))
third <- data.frame("t"=2012:2013,
"b"=c(11, 13))
I want to join secnd and third to prim in two steps. In the first step I join prim and secnd, where any existing elements in prim are overwritten by those in secnd, so we end up with:
t a b
1 2007 1 7
2 2008 2 8
3 2009 3 9
4 2010 4 10
5 2011 5 11
6 2012 5 12
7 2013 7 NA
After this I want to join with third, where again existing elements are overwritten by those in third:
t a b
1 2007 1 7
2 2008 2 8
3 2009 3 9
4 2010 4 10
5 2011 5 11
6 2012 5 11
7 2013 7 13
Is there a way to achieve this using dplyr or base R?

By using dplyr you can do:
require(dplyr)
prim %>% full_join(secnd, by = 't') %>%
full_join(third, by = 't') %>%
mutate(a = coalesce(as.integer(a.y),a.x),
b = coalesce(as.integer(b.y),b.x)) %>%
select(t,a,b)
I added the as.integer function since you have different data types in your dataframes.

Consider base R with a chain merge and ifelse calls, followed by final column cleanup:
final_df <- Reduce(function(x, y) merge(x, y, by="t", all=TRUE), list(prim, secnd, third))
final_df <- within(final_df, {
a.x <- ifelse(is.na(a.y), a.x, a.y)
b.x <- ifelse(is.na(b.y), b.x, b.y)
})
final_df <- setNames(final_df[,1:3], c("t", "a", "b"))
final_df
# t a b
# 1 2007 1 7
# 2 2008 2 8
# 3 2009 3 9
# 4 2010 4 10
# 5 2011 5 11
# 6 2012 5 11
# 7 2013 7 13

Not very pretty. But seems to do the job
prim %>%
anti_join(secnd, by = "t") %>%
full_join(secnd, by = c("t", "a")) %>%
select(-b) %>%
left_join(prim %>%
anti_join(third, by = "t") %>%
full_join(third, by = c("t", "b")) %>%
select(-a))
gives
t a b
1 2007 1 7
2 2008 2 8
3 2009 3 9
4 2010 4 10
5 2011 5 11
6 2012 5 11
7 2013 7 13

Related

subtract specific row und rename it

it is possible to subtract certain rows and rename them?
year <- c(2005,2005,2005,2006,2006,2006,2007,2007,2007)
category <- c("a","b","c","a","b","c", "a", "b", "c")
value <- c(2,2,10,3,3,12,4,4,16)
df <- data.frame(year, category,value, stringsAsFactors = FALSE)
And this is how the result should look:
year
category
value
2005
a
2
2005
b
2
2005
c
4
2006
a
3
2006
b
3
2006
c
12
2007
a
4
2007
b
4
2007
c
16
2005
c-b
2
2006
c-b
9
2007
c-b
12

You can use group_modify:
library(tidyverse)
df %>%
group_by(year) %>%
group_modify(~ add_row(.x, category = "c-b", value = .x$value[.x$category == "c"] - .x$value[.x$category == "b"]))
# A tibble: 12 x 3
# Groups: year [3]
year category value
<dbl> <chr> <dbl>
1 2005 a 2
2 2005 b 2
3 2005 c 10
4 2005 c-b 8
5 2006 a 3
6 2006 b 3
7 2006 c 12
8 2006 c-b 9
9 2007 a 4
10 2007 b 4
11 2007 c 16
12 2007 c-b 12

See substract() function.
Example:
substracted_df<-substr(df,df$category=="c")
If you want to know which rows are you dealing with, use which()
rows<-which(df$category=="c")
substracted_df<-df[rows, ]
You can rename each desired row as
row.names(substracted_df)<-c("Your desired row names")

Lagging single column in Time-Series

I am running 4.0.3. No access to the internet.
I want to lag a single column of a multicolumn Time-Series. I wasn't able to find a satisfactory answer anywhere else.
Intuitively this makes sense to me, but it just doesn't work:
library(tsbox)
data=data.frame(Date=c('2005-01-01','2005-02-01','2005-03-01','2005-04-01','2005-05-01'),
col1 = c(1,2,3,4,5),
col2 = c(1,2,3,4,5))
data[,'Date']= as.POSIXct(data[,'Date'],format='%Y-%m-%d')
timeseries = ts_ts(ts_long(data))
timeseries[,'col1_L1'] = lag(timeseries[,'col1'],1)
What I get:
col1 col2 col1_L1
Jan 2005 1 1 1
Feb 2005 2 2 2
Mar 2005 3 3 3
Apr 2005 4 4 4
May 2005 5 5 5
What I would expect from this code:
col1 col2 col1_L1
Jan 2005 1 1 NA
Feb 2005 2 2 1
Mar 2005 3 3 2
Apr 2005 4 4 3
May 2005 5 5 4

I wasn't able to reproduce your example (likely due to the reasons pointed out in the comments) but perhaps you could use the function from this post, e.g.
data=data.frame(Date=c('2005-01-01','2005-02-01','2005-03-01','2005-04-01','2005-05-01'),
col1 = c(1,2,3,4,5),
col2 = c(1,2,3,4,5))
data[,'Date']= as.POSIXct(data[,'Date'],format='%Y-%m-%d')
lagpad <- function(x, k) {
if (k>0) {
return (c(rep(NA, k), x)[1 : length(x)] )
}
else {
return (c(x[(-k+1) : length(x)], rep(NA, -k)))
}
}
data$col_l1 <- lagpad(data$col2, 1)
data
#> Date col1 col2 col_l1
#> 1 2005-01-01 1 1 NA
#> 2 2005-02-01 2 2 1
#> 3 2005-03-01 3 3 2
#> 4 2005-04-01 4 4 3
#> 5 2005-05-01 5 5 4

Create a new column with max values using the identifier column within a pipeline

I am trying to clean up some old code and convert over to "tidy". I am trying to create a new column of data within a pipeline that is the maximum age of individual fish. Let's represent the columns of interest as:
fish_1 <- data.frame(year = c(2012,2012,2015,2015,2015,2013,2013,2013,2013,2012,2012,2015,2015,2015),
fishid = c('a','a','b','b','b','c','c','c','c','d','d','e','e','e'), # unique identifier for each fish
agei = c(1,2,1,2,3,1,2,3,4,1,2,1,2,3))
# which looks like this:
fish_1
year fishid agei
1 2012 a 1
2 2012 a 2
3 2015 b 1
4 2015 b 2
5 2015 b 3
6 2013 c 1
7 2013 c 2
8 2013 c 3
9 2013 c 4
10 2012 d 1
11 2012 d 2
12 2015 e 1
13 2015 e 2
14 2015 e 3
What I'm trying to do is create a new column agec that is the maximum age for each individual fish repeated however many number of times is required to fill the rows for each fish.
The desired output would be:
fish_2 <- data.frame(year = c(2012,2012,2015,2015,2015,2013,2013,2013,2013,2012,2012,2015,2015,2015),
fishid = c('a','a','b','b','b','c','c','c','c','d','d','e','e','e'), # unique identifier for each fish
agei = c(1,2,1,2,3,1,2,3,4,1,2,1,2,3),
agec = c(2,2,3,3,3,4,4,4,4,2,2,3,3,3))
# Which looks like:
fish_2
year fishid agei agec
1 2012 a 1 2
2 2012 a 2 2
3 2015 b 1 3
4 2015 b 2 3
5 2015 b 3 3
6 2013 c 1 4
7 2013 c 2 4
8 2013 c 3 4
9 2013 c 4 4
10 2012 d 1 2
11 2012 d 2 2
12 2015 e 1 3
13 2015 e 2 3
14 2015 e 3 3
The way I had done this in the past was to use a plyr::ddply() call to create a new dataframe and then merge with fish like this:
caps = plyr::ddply(fish_1, c('fishid'), plyr::summarize, agec=max(agei))
fish = merge(fish_1, caps, by='fishid')
fish
fishid year agei agec
1 a 2012 1 2
2 a 2012 2 2
3 b 2015 1 3
4 b 2015 2 3
5 b 2015 3 3
6 c 2013 1 4
7 c 2013 2 4
8 c 2013 3 4
9 c 2013 4 4
10 d 2012 1 2
11 d 2012 2 2
12 e 2015 1 3
13 e 2015 2 3
14 e 2015 3 3
I'm hoping someone can help me achieve this data structure concisely within a pipeline. All of the similar questions I have found have been very verbose and not specific to this issue. I am new to using tidyverse but I'm having trouble getting the group_by() function (to replace the ddply() call) within a pipe, and I'm hoping there is a simpler way.
UPDATE
For those interested it appears both answers below are correct. The reason that I struggled was because I was already completing other data manipulations within my pipeline and I tried to complete the formation of the agec column within a previous call to dplyr::mutate(). You can refer to my comment on #Thomas answer to see the error in my ways. Hope this helps.

Try dplyr instead of plyr
library(dplyr)
fish_1 %>%
group_by(fishid) %>%
mutate(agec = max(agei))

You can use group_by from dplyr to group your fish IDs and then simply call mutate (dplyr as well) with max:
fish_1 <- data.frame(year = c(2012,2012,2015,2015,2015,2013,2013,2013,2013,2012,2012,2015,2015,2015),
fishid = c('a','a','b','b','b','c','c','c','c','d','d','e','e','e'), # unique identifier for each fish
agei = c(1,2,1,2,3,1,2,3,4,1,2,1,2,3))
fish_1 %>%
group_by(fishid) %>%
mutate(agec = max(agei))
# A tibble: 14 x 4
# Groups: fishid [5]
year fishid agei agec
<dbl> <chr> <dbl> <dbl>
1 2012 a 1 2
2 2012 a 2 2
3 2015 b 1 3
4 2015 b 2 3
5 2015 b 3 3
6 2013 c 1 4
7 2013 c 2 4
8 2013 c 3 4
9 2013 c 4 4
10 2012 d 1 2
11 2012 d 2 2
12 2015 e 1 3
13 2015 e 2 3
14 2015 e 3 3

An option with data.table
library(data.table)
setDT(fish_1)[, agec := max(agei, na.rm = TRUE), fishid]

Combine data in many row into a columnn

I have a data like this:
year Male
1 2011 8
2 2011 1
3 2011 4
4 2012 3
5 2012 12
6 2012 9
7 2013 4
8 2013 3
9 2013 3
and I need to group the data for the year 2011 in one column, 2012 in the next column and so on.
2011 2012 2013
1 8 3 4
2 1 12 3
3 4 9 3
How do I achieve this?

One option is unstack if the number of rows per 'year' is the same
unstack(df1, Male ~ year)

One option is to use functions from dplyr and tidyr.
library(dplyr)
library(tidyr)
dt2 <- dt %>%
group_by(year) %>%
mutate(ID = 1:n()) %>%
spread(year, Male) %>%
select(-ID)

1
If every year has the same number of data, you could split the data and cbind it using base R
do.call(cbind, split(df$Male, df$year))
# 2011 2012 2013
#[1,] 8 3 4
#[2,] 1 12 3
#[3,] 4 9 3
2
If every year does not have the same number of data, you could use rbind.fill of plyr
df[10,] = c(2015, 5) #Add only one data for the year 2015
library(plyr)
setNames(object = data.frame(t(rbind.fill.matrix(lapply(split(df$Male, df$year), t)))),
nm = unique(df$year))
# 2011 2012 2013 2015
#1 8 3 4 5
#2 1 12 3 NA
#3 4 9 3 NA
3
Yet another way is to use dcast to convert data from long to wide format
df[10,] = c(2015, 5) #Add only one data for the year 2015
library(reshape2)
dcast(df, ave(df$Male, df$year, FUN = seq_along) ~ year, value.var = "Male")[,-1]
# 2011 2012 2013 2015
#1 8 3 4 5
#2 1 12 3 NA
#3 4 9 3 NA

how to replace missing values with previous year's binned mean

I have a data frame as below
p1_bin and f1_bin are calculated by cut function by me with
Bins <- function(x) cut(x, breaks = c(0, seq(1, 1000, by = 5)), labels = 1:200)
binned <- as.data.frame (sapply(df[,-1], Bins))
colnames(binned) <- paste("Bin", colnames(binned), sep = "_")
df<- cbind(df, binned)
Now how to calculate mean/avg for previous two years and replace in NA values with in that bin
for example : at row-5 value is NA for p1 and f1 is 30 with corresponding bin 7.. now replace NA with previous 2 years mean for same bin (7) ,i.e
df
ID year p1 f1 Bin_p1 Bin_f1
1 2013 20 30 5 7
2 2013 24 29 5 7
3 2014 10 16 2 3
4 2014 11 17 2 3
5 2015 NA 30 NA 7
6 2016 10 NA 2 NA
df1
ID year p1 f1 Bin_p1 Bin_f1
1 2013 20 30 5 7
2 2013 24 29 5 7
3 2014 10 16 2 3
4 2014 11 17 2 3
5 2015 **22** 30 NA 7
6 2016 10 **16.5** 2 NA
Thanks in advance

I believe the following code produces the desired output. There's probably a much more elegant way than using mean(rev(lag(f1))[1:2]) to get the average of the last two values of f1 but this should do the trick anyway.
library(dplyr)
df %>%
arrange(year) %>%
mutate_at(c("p1", "f1"), "as.double") %>%
group_by(Bin_p1) %>%
mutate(f1 = ifelse(is.na(f1), mean(rev(lag(f1))[1:2]), f1)) %>%
group_by(Bin_f1) %>%
mutate(p1 = ifelse(is.na(p1), mean(rev(lag(p1))[1:2]), p1)) %>%
ungroup
and the output is:
# A tibble: 6 x 6
ID year p1 f1 Bin_p1 Bin_f1
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2013 20 30.0 5 7
2 2 2013 24 29.0 5 7
3 3 2014 10 16.0 2 3
4 4 2014 11 17.0 2 3
5 5 2015 22 30.0 NA 7
6 6 2016 10 16.5 2 NA