r - 'rbind' dataframes with different prefix in column names - r

I have two dataframes like the following:
df1 <- data.frame(ID = c(1:4),
Year = 2001,
a_Var1 = c("A","B","C","D"),
a_Var2 = c("T","F","F","T"))
df2 <- data.frame(ID = c(1:4),
Year = 2002,
b_Var1 = c("E","F","G","H"))
The desired end product is
df_combined <- data.frame(ID = c(1,1,2,2,3,3,4,4),
Year = c(2001,2002,2001,2002,2001,2002,2001,2002),
Var1 = c("A","E","B","F","C","G","D","H"),
Var2 = c("T",NA,"F",NA,"F",NA,"T",NA))
Question is how to 'rbind' in such a way that the prefix a_ or b_ is removed and Var1, Var2, etc become the new columns.
Tried plyr's rbind.fill but that doesn't solve the problem.

Here is one option. Place the datasets in a list, rename by removing the prefix part including the _ and arrange by 'ID'
library(tidyverse)
map_df(list(df1, df2), ~ .x %>%
rename_all(~ str_remove(.x, "^[^_]+_"))) %>%
arrange(ID)
# ID Year Var1 Var2
#1 1 2001 A T
#2 1 2002 E <NA>
#3 2 2001 B F
#4 2 2002 F <NA>
#5 3 2001 C F
#6 3 2002 G <NA>
#7 4 2001 D T
#8 4 2002 H <NA>

Related

Apply function over list then iterate over second variable, in r

I am trying to have a function apply over a list and iterate over a second variable in the function, in r.
Here is an example:
Create the data
A <- data.frame(var = 1:3, year = 2000:2002)
B <- data.frame(var = 4:6, year = 2000:2002)
C <- data.frame(var = 7:9, year = 2000:2002)
ABC <- list(A, B, C)
> ABC
[[1]]
var year
1 1 2000
2 2 2001
3 3 2002
[[2]]
var year
1 4 2000
2 5 2001
3 6 2002
[[3]]
var year
1 7 2000
2 8 2001
3 9 2002
Write the function: sum (which simply filters for a start year and sums the 'var' values - sorry this simple function got messier in this example than I had intended).
library(dplyr)
sum <- function(dat, start.year) {
dat %>%
filter(year >= start.year) %>%
select(var) %>%
colSums() %>%
data.frame(row.names = NULL) %>%
rename(var = '.') %>%
mutate(start = start.year)
}
Now I can apply the function to the list (and bind_rows to get a neat output):
lapply(ABC, sum, 2000) %>%
bind_rows()
var start
1 6 2000
2 15 2000
3 24 2000
What I want to do however is iterate over start.year creating dataframes for start.year = c(2000, 2001, 2002). This would ideally give:
var start
1 6 2000
2 15 2000
3 24 2000
4 5 2001
5 11 2001
6 17 2001
7 3 2002
8 6 2002
9 9 2002
I have looked at map2, but that talks about using vectors of the same length. That would work in this case, but imagine my list had 4 items in it and only 3 records per list. So assume map2 is doing something different. I also thought about a nested for loop. When I started writing that however I realized I would be dealing with list.append functions in r and that seemed wrong. I assume this is an easy thing to do. Any help would be appreciated.
We can do this with a nested lapply/map
library(purrr)
map_dfr(2000:2002, ~ map_dfr(ABC, sum, .x))
# var start
#1 6 2000
#2 15 2000
#3 24 2000
#4 5 2001
#5 11 2001
#6 17 2001
#7 3 2002
#8 6 2002
#9 9 2002
Or inspired from #thelatemail's suggestion with Map
map2_dfr(rep(ABC, 3), rep(2000:2002,each=length(ABC)), sum)
With lapply
do.call(rbind, lapply(2000:2002, function(x) do.call(rbind, lapply(ABC, sum, x))))
# var start
#1 6 2000
#2 15 2000
#3 24 2000
#4 5 2001
#5 11 2001
#6 17 2001
#7 3 2002
#8 6 2002
#9 9 2002
Or as #thelatemail mentioned
do.call(rbind, Map(sum, ABC, start.year=rep(2000:2002,each=length(ABC))))
If the OP's function can be changed, another option is
library(dplyr)
library(tidyr)
map_dfr(ABC, ~ .x %>%
crossing(year2 = 2000:2002) %>%
filter(year >= year2) %>%
group_by(year2) %>%
summarise(var = base::sum(var)))
Or instead of doing this in a list, we can bind them together with bind_rows then do a group by sum after crossing with the input 'years'
bind_rows(ABC, .id = 'grp') %>%
group_by(grp) %>%
crossing(year2 = 2000:2002) %>%
filter(year >= year2) %>%
group_by(grp, year2) %>%
summarise(var = base::sum(var))

Create a new dataframe with rows for every value in a sequence between two columns in a previous dataframe [duplicate]

This question already has answers here:
R creating a sequence table from two columns
(4 answers)
Closed 3 years ago.
I have a dataframe, where two columns represent the beginning and end of a range of dates. So:
df <- data.frame(var=c("A", "B"), start_year=c(2000, 2002), end_year=c(2005, 2004))
> df
var start_year end_year
1 A 2000 2005
2 B 2002 2004
And I'd like to create a new dataframe, where there is a row for every value between start_year and end_year, for each var.
So the result should look like:
> newdf
var year
1 A 2000
2 A 2001
3 A 2002
4 A 2003
5 A 2004
6 A 2005
7 B 2002
8 B 2003
9 B 2004
Ideally this would involve something from the tidyverse. I've been trying different things with dplyr::group_by and tidyr::gather, but I'm not having any luck.
As akrun demonstrated, it's probably easier to do it without gather and group_by (as mentioned in the question). But in case you're curious how to do it that way, here it is
df %>%
gather(key, value, -var) %>%
group_by(var) %>%
expand(year = value[1]:value[2])
# # A tibble: 9 x 2
# # Groups: var [2]
# var year
# <fct> <int>
# 1 A 2000
# 2 A 2001
# 3 A 2002
# 4 A 2003
# 5 A 2004
# 6 A 2005
# 7 B 2002
# 8 B 2003
# 9 B 2004
Here's the same idea, convert to long and expand, in data.table (same output)
library(data.table)
setDT(df)
melt(df, 'var')[, .(year = value[1]:value[2]), var]
Edit: As markus points out, you don't need to convert to long first with data.table, you can do it in one step (not counting the two lines library/setDT in the code block above). This is a similar approach to akrun's tidyverse answer.
df[, .(year = start_year:end_year), by=var]
We can use map2 to get the sequence from 'start_year' to 'end_year' and unnest the list column to expand the data into 'long' format
library(tidyverse)
df %>%
transmute(var, year = map2(start_year, end_year, `:`)) %>%
unnest
# var year
#1 A 2000
#2 A 2001
#3 A 2002
#4 A 2003
#5 A 2004
#6 A 2005
#7 B 2002
#8 B 2003
#9 B 2004
Or another option is complete
df %>%
group_by(var) %>%
complete(start_year = start_year:end_year) %>%
select(var, year = start_year)
Or in base R with stack and Map
stack(setNames(do.call(Map, c(f = `:`, df[-1])), df$var))
NOTE: First posted the solution with Map and stack
In case of other variations,
stack(setNames(Map(`:`, df[[2]], df[[3]]), df$var))
stack(setNames(do.call(mapply, c(FUN = `:`, df[-1])), df$var))
A short base R solution with seq.
stack(setNames(Map(seq, df[[2]], df[[3]]), df[[1]]))
# values ind
# 1 2000 A
# 2 2001 A
# 3 2002 A
# 4 2003 A
# 5 2004 A
# 6 2005 A
# 7 2002 B
# 8 2003 B
# 9 2004 B
Data
df <- structure(list(var = structure(1:2, .Label = c("A", "B"), class = "factor"),
start_year = c(2000, 2002), end_year = c(2005, 2004)), class = "data.frame", row.names = c(NA,
-2L))

Merge two dataframes and create multiple columns in R

Suppose that we have two data frames as shown below:
df1 <- data.frame(Team1 = c("A","B","C"), Team2 = c("D","E","F"), Winner = c("A","E","F"))
df2 <- data.frame(Country = c("A","B","C","D","E","F"), Index = c(1,2,3,4,5,6))
What i want is create three columns in df2 as Team1_index, Team2_index, and Winner_index.
Team1 Team2 Winner Team1_index Team2_index Winner_index
A D A 1 4 1
B E E 2 5 5
C F F 3 6 6
I tried many ways but failed. Tips and advice!
If you just have a small number of columns, you can use the match function as in the example:
df1$Team1_index <- df2$Index[match(df1$Team1, df2$Country)]
df1$Team2_index <- df2$Index[match(df1$Team2, df2$Country)]
df1$Winner_index <- df2$Index[match(df1$Winner, df2$Country)]
df1
If you have more columns, you may look for more systematic solutions, but if it's really just three cases, this should do:
library("tidyverse")
df1 <- data.frame(Team1 = c("A","B","C"), Team2 = c("D","E","F"), Winner = c("A","E","F"))
df2 <- data.frame(Country = c("A","B","C","D","E","F"), Index = c(1,2,3,4,5,6))
df1 %>%
left_join(df2 %>% rename(Team1 = Country), by = "Team1") %>%
rename(Team1_Index = Index) %>%
left_join(df2 %>% rename(Team2 = Country), by = "Team2") %>%
rename(Team2_Index = Index) %>%
left_join(df2 %>% rename(Winner = Country), by = "Winner") %>%
rename(Winner_Index = Index)
#> Warning: Column `Team1` joining factors with different levels, coercing to
#> character vector
#> Warning: Column `Team2` joining factors with different levels, coercing to
#> character vector
#> Warning: Column `Winner` joining factors with different levels, coercing to
#> character vector
#> Team1 Team2 Winner Team1_Index Team2_Index Winner_Index
#> 1 A D A 1 4 1
#> 2 B E E 2 5 5
#> 3 C F F 3 6 6
You can safely ignore the warnings.
To get new columns as factors :
df1[paste0(colnames(df1),"_index")] <- lapply(df1,factor,df2$Country,df2$Index)
# Team1 Team2 Winner Team1_index Team2_index Winner_index
# 1 A D A 1 4 1
# 2 B E E 2 5 5
# 3 C F F 3 6 6
To get new columns as numeric :
df1[paste0(colnames(df1),"_index")] <-
lapply(df1,function(x) as.numeric(as.character(factor(x,df2$Country,df2$Index))))
# Team1 Team2 Winner Team1_index Team2_index Winner_index
# 1 A D A 1 4 1
# 2 B E E 2 5 5
# 3 C F F 3 6 6
Note that for this specific case (index from 1 incremented by 1), this shorter version works:
df1[paste0(colnames(df1),"_index")] <-
lapply(df1,function(x) as.numeric(factor(x,df2$Country)))
I have an almost solution with data.table, using melt and dacst to change shape
library(data.table)
df1 <- data.table(Team1 = c("A","B","C"), Team2 = c("D","E","F"), Winner = c("A","E","F"))
df2 <- data.table(Country = c("A","B","C","D","E","F"), Index = c(1,2,3,4,5,6))
melt(data = df1 , id.vars = )
plouf <- merge(df2,melt(df1,measure = 1:2), by.x = "Country", by.y = "value")
plouf[,winneridx := Index[Country == Winner]]
dcast(plouf,Country+winneridx~variable,value.var = "Index")
Country winneridx Team1 Team2
1: A 1 1 NA
2: B 5 2 NA
3: C 6 3 NA
4: D 1 NA 4
5: E 5 NA 5
6: F 6 NA 6
This is basically the same as giocomai's answer, just uses purrr to help eliminate duplication:
library(rlang)
library(dplyr)
getIndexCols <- function(df1, df2, colName){
idxColName <- sym(paste0(colName, "_Index"))
df1 %>% left_join(df2 %>% rename(!! sym(colName) := Country, !! idxColName := Index))
}
names(df1) %>% purrr::map(~ getIndexCols(df1, df2, .)) %>% reduce(~ left_join(.x, .y))
You can use chartr This will take into consideration both the country column and the index column:
df3=as.matrix(setNames(df1,paste0(names(df1),"_index")))
cbind(df1,chartr(paste0(df2$Country,collapse=""),paste0(df2$Index,collapse=""),df3))
Team1 Team2 Winner Team1_index Team2_index Winner_index
1 A D A 1 4 1
2 B E E 2 5 5
3 C F F 3 6 6
you can also do:
cbind(df1,do.call(chartr,c(as.list(sapply(unname(df2),paste,collapse="")),list(df3))))
Team1 Team2 Winner Team1_index Team2_index Winner_index
1 A D A 1 4 1
2 B E E 2 5 5
3 C F F 3 6 6
Here is another option for you that uses match and cbind.
df3 <- as.matrix(df1)
colnames(df3) <- paste0(colnames(df3), "_index")
# match the positions
df3[] <- match(df3, df2$Country)
cbind(df1, df3)
# Team1 Team2 Winner Team1_index Team2_index Winner_index
#1 A D A 1 4 1
#2 B E E 2 5 5
#3 C F F 3 6 6
df3 is created as a matrix, i.e. a vector with dimensions attribute, such that we can replace its entries with the result of match (a vector) right away and don't need to repeat the code for every column.
Or in one go
df1[paste0(colnames(df1), "_index")] <- match(as.matrix(df1), df2$Country)
Note however, that this ignores the index column of df2.
Thanks to #Moody_Mudskipper we could also write this more general as
df1[paste0(colnames(df1), "_index")] <- lapply(df1, function(x) df2$Index[match(x, df2$Country)])

Expand data frame with intervening observations

I am trying to expand a data frame in R with missing observations that are not immediately obvious. Here is what I mean:
data.frame(id = c("a","b"),start = c(2002,2004), end = c(2005,2007))
Which is:
id start end
1 a 2002 2005
2 b 2004 2007
What I would like is a new data frame with 8 total observations, 4 each for "a" and "b", and a year that is one of the values between start and end (inclusive). So:
id year
a 2002
a 2003
a 2004
a 2005
b 2004
b 2005
b 2006
b 2007
As I understand, various versions of expand only work on unique values, but here my data frame doesn't have all the unique values (explicitly).
I was thinking to step through each row and then generate a data frame with sapply(), then join all the new data frames together. But this attempt fails:
sapply(test,function(x) { data.frame( id=rep(id,x[["end"]]-x[["start"]]), year = x[["start"]]:x[["end"]] )})
I know there must be some dplyr or other magic to solve this problem!
you could use tidyr and dplyr
library(tidyr)
library(dplyr)
df %>%
gather(key = key, value = year, -id) %>%
select(-key) %>%
group_by(id) %>%
complete(year = full_seq(year,1))
# A tibble: 8 x 2
# Groups: id [2]
id year
<fct> <dbl>
1 a 2002
2 a 2003
3 a 2004
4 a 2005
5 b 2004
6 b 2005
7 b 2006
8 b 2007
Using dplyr and tidyr, I make a new column which contains the list of years, then unnest the dataframe.
library(tidyr)
library(dplyr)
df <-
data.frame(
id = c("a", "b"),
start = c(2002, 2004),
end = c(2005, 2007)
)
df %>%
rowwise() %>%
mutate(year = list(seq(start, end))) %>%
select(-start, -end) %>%
unnest()
Output
# A tibble: 8 x 2
id year
<fct> <int>
1 a 2002
2 a 2003
3 a 2004
4 a 2005
5 b 2004
6 b 2005
7 b 2006
8 b 2007
An easy solution with data.table:
library(data.table)
# option 1
setDT(df)[, .(year = seq(start, end)), by = id]
# option 2
setDT(df)[, .(year = start:end), by = id]
which gives:
id year
1: a 2002
2: a 2003
3: a 2004
4: a 2005
5: b 2004
6: b 2005
7: b 2006
8: b 2007
An approach with base R:
lst <- Map(seq, df$start, df$end)
data.frame(id = rep(df$id, lengths(lst)), year = unlist(lst))

Fill missing values using last or previous observation R

Suppose I have the following table:
ID Name Country
1 A Nor
2 B Nor
3 C Nor
4 D Nor
and I have another table:
ID Name Country
1 A
2 Bel
3 Bel
4 Bel
the result I want to get is:
ID Name Country
1 A Nor
2 B Bel
3 C Bel
4 D Bel
Basically i would like to create a third table which will take as a priority the second table but will fill the missing fields with the second table based on ID. Any help on how to do this in base R will be much appreciated.
You can get the logical vector representing the locations of the NA values using is.na(df2).
You can then set the NA elements of df2 to be the corresponding elements in df.
df <- data.frame(
ID = 1:4,
Name = LETTERS[1:4],
Country = "Nor",
stringsAsFactors = F)
df2 <- data.frame(
ID = 1:4,
Name = c("A", NA, NA, NA),
Country = c(NA, "Bel", "Bel", "Bel"),
stringsAsFactors = F)
df2[is.na(df2)] <- df[is.na(df2)]
df2
#> ID Name Country
#> 1 1 A Nor
#> 2 2 B Bel
#> 3 3 C Bel
#> 4 4 D Bel
You can try a tidyverse solution
library(tidyverse)
d1 %>%
left_join(d2, by="ID") %>%
mutate(Country=case_when(
is.na(Country.y) ~ as.character(Country.x),
is.na(Name.y) ~ as.character(Country.y)
)) %>%
select(ID, Name=Name.x, Country)
ID Name Country
1 1 A Nor
2 2 B Bel
3 3 C Bel
4 4 D Bel
The case_when part is easily and freely expandable.
Data
d1 <- read.table(text="ID Name Country
1 A Nor
2 B Nor
3 C Nor
4 D Nor", header=T)
d2 <- read.table(text="ID Name Country
1 A NA
2 NA Bel
3 NA Bel
4 NA Bel", header=T)
Supposing the order is strictly the same and that df1 and df2 have the same size and that df1 has all the names defined (if not you need to go through a left_join). And well it is not base R but dplyr is a must have ;)
df3 <- dplyr::mutate(df1, Country = ifelse(is.na(df2$Country), Country, df2$Country))
Basically taking df1 as baseline (so as to keep the Names column, and replacing the column Country with the value for df2 unless there is NA.
(if you already have dplyr called then remove dplyr::).
with df1
ID Name Country
1 A
2 Bel
3 Bel
4 Bel
and df2
ID Name Country
1 A Nor
2 B Nor
3 C Nor
4 D Nor
ps: I voted for #Paul for the base solution ... very neat.

Resources