Merge two data frames by row and column names and by group - r

I have two data frames, df1 and df2, that look as follows:
df1<- data.frame(year, week, X1, X2)
df1
year week X1 X2
1 2010 1 2 3
2 2010 2 8 6
3 2011 1 7 5
firm<-c("X1", "X1", "X2")
year <- c(2010,2010,2011)
week<- c(1, 2, 1)
cost<-c(10,30,20)
df2<- data.frame(firm,year, week, cost)
df2
firm year week cost
1 X1 2010 1 10
2 X1 2010 2 30
3 X2 2011 1 20
I'd like to merge these so the final result (i.e. df3) looks as follows:
df3
firm year week cost Y
1 X1 2010 1 10 2
2 X1 2010 2 30 8
3 X2 2011 1 20 5
Where "Y" is a new variable that reflects the values of X1 and X2 for a particular year and week found in df1.
Is there a way to do this in R? Thank you in advance for your reply.

We can reshape the first dataset to 'long' format and then do a join with the second data
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = X1:X2, values_to = 'Y', names_to = 'firm') %>%
right_join(df2)
-output
# A tibble: 3 x 5
# year week firm Y cost
# <dbl> <dbl> <chr> <int> <dbl>
#1 2010 1 X1 2 10
#2 2010 2 X1 8 30
#3 2011 1 X2 5 20
data
df1 <- structure(list(year = c(2010L, 2010L, 2011L), week = c(1L, 2L,
1L), X1 = c(2L, 8L, 7L), X2 = c(3L, 6L, 5L)), class = "data.frame",
row.names = c("1",
"2", "3"))
df2 <- structure(list(firm = c("X1", "X1", "X2"), year = c(2010, 2010,
2011), week = c(1, 2, 1), cost = c(10, 30, 20)), class = "data.frame",
row.names = c(NA,
-3L))

Here is a base R option (borrow data from #akrun, thanks!)
q <- startsWith(names(df1),"X")
v <- cbind(df1[!q],stack(df1[q]),row.names = NULL)
df3 <- merge(setNames(v,c(names(df1)[!q],"Y","firm")),df2)
which gives
> df3
year week firm Y cost
1 2010 1 X1 2 10
2 2010 2 X1 8 30
3 2011 1 X2 5 20

Related

Use an if inside loop to replace the data between two dataframe

I have two files and want to transfer date from one to other after doing a test
File1:
ID, X1, X2, X3
2000, 1, 2, 3
2001, 3, 4, 5
1999, 2, 5, 6
2003, 3, 5, 4
File2:
ID, X1, X2, X3,
2000,
2001,
2002,
2003,
Result file will be like:
1999 "There is an error"
File2:
ID, X1, X2, X3
2000, 1, 2, 3
2001, 3, 4, 5
2002, Na, Na, Na
2003, 3, 5, 4
I tried to use for loop with if, Unfortunately, it doesn't work:
for(j in length(1: nrows(file1){
for(i in length(1: nrows(file2){
if( file1&ID[j]>= file2&ID[j+1]){
print(j, ' wrong value')
esle
file2[i,]<- file1[j,]
break
It would be very nice if I can get some ideas, codes how I can get something similar to result file
I hope I can find the right code to solve this problem
No need to iterate using loops, you can simply use right_join from dplyr package
df1 %>%
right_join(df2, by="ID") %>%
arrange(ID)
ID X1 X2 X3
1 2000 1 2 3
2 2001 3 4 5
3 2002 NA NA NA
4 2003 3 5 4
Sample data
df1 <- structure(list(ID = c(2000L, 2001L, 1999L, 2003L), X1 = c(1L,
3L, 2L, 3L), X2 = c(2L, 4L, 5L, 5L), X3 = c(3L, 5L, 6L, 4L)), class = "data.frame", row.names = c(NA,
-4L))
df2 <- structure(list(ID = 2000:2003), class = "data.frame", row.names = c(NA,
-4L))
Using data.table
library(data.table)
setDT(df2)[df1, names(df1)[-1] := mget(paste0("i.", names(df1)[-1])), on = .(ID)]
-output
> df2
ID X1 X2 X3
1: 2000 1 2 3
2: 2001 3 4 5
3: 2002 NA NA NA
4: 2003 3 5 4
Here is a slightly different approach which does not give the exact expected output: Note that year 1999 is kept in the dataframe:
coalesce_by_column <- function(df) {
return(coalesce(df[1], df[2]))
}
bind_rows(df1, df2) %>%
group_by(ID) %>%
summarise_all(coalesce_by_column)
ID X1 X2 X3
<int> <int> <int> <int>
1 1999 2 5 6
2 2000 1 2 3
3 2001 3 4 5
4 2002 NA NA NA
5 2003 3 5 4

Rearrangement columns of a table in R

I have the following table that I want to modify
Debt2017 Debt2018 Debt2019 Cash2017 Cash2018 Cash2019 Year Other
2 4 3 5 6 7 2018 x
3 8 9 7 9 9 2017 y
So that the result is the following
Debt Cash FLAG After Other
2 5 0 x
3 7 1 x
8 9 1 y
9 9 1 y|
Basically, I want to change the data so that I have the different years in different rows, eliminating the values for the year indicated in the column "Year" and adding a FLAG that tells me whether the data indicated in the row is from a previous (0) or following (1) year (with respect to the year indicated in the column "Year").
Furthermore, I also want to keep the column "Other".
Does anybody know how to do it in R?
library(dplyr)
library(tidyr)
df %>%
pivot_longer(Debt2017:Cash2019,
names_to = c(".value", "Year2"),
names_pattern = "(\\D+)(\\d+)") %>%
filter(Year != Year2) %>%
mutate(flag = +(Year2 > Year))
# # A tibble: 4 × 6
# Year Other Year2 Debt Cash flag
# <int> <chr> <chr> <int> <int> <int>
# 1 2018 x 2017 2 5 0
# 2 2018 x 2019 3 7 1
# 3 2017 y 2018 8 9 1
# 4 2017 y 2019 9 9 1
Data
df <- structure(list(Debt2017 = 2:3, Debt2018 = c(4L, 8L), Debt2019 = c(3L, 9L),
Cash2017 = c(5L, 7L), Cash2018 = c(6L, 9L), Cash2019 = c(7L, 9L),
Year = 2018:2017, Other = c("x", "y")), class = "data.frame", row.names = c(NA, -2L))

replace certain values in df according to several conditions

basic question but I am looking for a nice solution (not for loops) for conditional replacement in DF1 by values of DF2 IF several conditions are fulfilled:
DF1
Name Year Val1
A 2010 x1
A 2012 x2
B 2012 x3
C 2015 x4
C 2012 x5
DF2
Name Year Val1
A 2012 y1
B 2012 y2
C 2012 y3
If Year is of a certain value such as 2012 in this case and the Name of DF1 and DF2 are the same then assign Val1 from DF2 to DF1.
I tried several things:
DF1$Val1[DF1$Year=="2012"&DF1$Name==DF2$Name,] <-DF2$Val1
DF1$Val1<-replace(DF1$Val1, DF1$Year=="2012" & DF1$Name==DF2$Name, DF2$Val1)
But I unfortunately get an error because DF1 and DF2 are not of the same length.
Expected:
DF1
Name Year Val1
A 2010 x1
A 2012 y1
B 2012 y2
C 2015 x4
C 2012 y3
THANK YOU FOR YOUR HELP!
We can use a join on the columns with data.table and update the 'Val'
librar(data.table)
setDT(DF1)[DF2, Val1 := i.Val1, on = .(Name, Year)]
DF1
# Name Year Val1
#1: A 2010 x1
#2: A 2012 y1
#3: B 2012 y2
#4: C 2015 x4
#5: C 2012 y3
data
DF1 <- structure(list(Name = c("A", "A", "B", "C", "C"), Year = c(2010L,
2012L, 2012L, 2015L, 2012L), Val1 = c("x1", "x2", "x3", "x4",
"x5")), class = "data.frame", row.names = c(NA, -5L))
DF2 <- structure(list(Name = c("A", "B", "C"), Year = c(2012L, 2012L,
2012L), Val1 = c("y1", "y2", "y3")), class = "data.frame", row.names = c(NA,
-3L))
I think the easiest way to do this is to filter DF2 down and then append it to DF1.
So
DF2 <- dplyr::filter(DF2, Year==2012,
Name %in% unique(DF1$Name)
DF1 <- dplyr::bind_rows(DF1, DF2)
Here are two base R solutions.
- Using match:
inds <- match(data.frame(t(DF2[-3]),stringsAsFactors = FALSE),
data.frame(t(DF1[-3]),stringsAsFactors = FALSE))
DF1$Val1[inds] <- DF2$Val1
such that
> DF1
Name Year Val1
1 A 2010 x1
2 A 2012 y1
3 B 2012 y2
4 C 2015 x4
5 C 2012 y3
- Using merge + subset:
DF1 <- subset(within(merge(DF1,DF2,by=c("Name","Year"),all.x = TRUE),
Val1 <- ifelse(is.na(Val1.y),Val1.x,Val1.y)),
select = names(DF1))
such that
> DF1
Name Year Val1
1 A 2010 x1
2 A 2012 y1
3 B 2012 y2
4 C 2012 y3
5 C 2015 x4
We can left_join df1 and df2 on Name and Year and use coalesce to select non-NA values from the two Val1 columns.
library(dplyr)
DF1 %>%
left_join(DF2, by = c('Name', 'Year')) %>%
mutate(Val1 = coalesce(Val1.y, Val1.x)) %>%
select(names(df1))
# Name Year Val1
#1 A 2010 x1
#2 A 2012 y1
#3 B 2012 y2
#4 C 2015 x4
#5 C 2012 y3

Create columns from aggregated row data in R

I have a data frame that contains historical price returns. The data is organized with date columns and many Asset columns (denoted as A1,A2...). Each asset column contains price return data for each unique historical date. I would like to process this data to create a data frame with many asset columns and only one row of data - with the data row containing the aggregated/average of the rows for the new columns. The new columns needs headers that are the original asset name, concatenated with date information. A simplified example of the original date follows:
> df <- read.csv("data.csv", header=T)
> df
Year Month A1 A2 A3
1 2015 Jan 1 1 1
2 2015 Feb 2 2 2
3 2015 Mar 3 3 3
4 2016 Jan 1 1 1
5 2016 Feb 2 2 2
6 2016 Mar 3 3 3
I used simple repeating numbers for the returns here. I am using a function that requires the data to be organized as follows:
> df2 <- read.csv("data2.csv", header=T)
> df2
Returns A1.Jan A1.Feb A1.Mar A2.Jan A2.Feb A2.Mar A3.Jan A3.Feb A3.Mar
1 Average 1 2 3 1 2 3 1 2 3
For clarity, A1.Jan contains the average of all Year's Jan returns. Thanks in advance for the insight and/or solution.
Take a look at the base function reshape. This is basically the same task as is solved by the last example on its help page:
reshape(df, idvar="Year", direction="wide", timevar="Month")
Year A1.Jan A2.Jan A3.Jan A1.Feb A2.Feb A3.Feb A1.Mar A2.Mar A3.Mar
1 2015 1 1 1 2 2 2 3 3 3
4 2016 1 1 1 2 2 2 3 3 3
You wanted the Year variable to remain as a column identifier but wanted the Month variable to act as a sequence that gets spread "wide".
With data.table you can do
library(data.table)
setDT(df)
df[, lapply(.SD, mean), .SDcols = names(df)[grep("^A", names(df))], by = Month
][, Returns := "Average"
][, melt(.SD, id = c("Month", "Returns"))
][, dcast(.SD, Returns ~ variable + Month, value.var = 'value', sep = ".")]
# Returns A1.Feb A1.Jan A1.Mar A2.Feb A2.Jan A2.Mar A3.Feb A3.Jan A3.Mar
#1: Average 2 1 3 2 1 3 2 1 3
In the first line we aggregate the data by Month. The part names(df)[grep("^A", names(df)) ensures that we only aggregate variables that start with the letter "A".
The second line creates variable Returns that contains the value "Average".
melt gathers you data into long format and dcast finally spreads into desired output.
data
df <- structure(list(Year = c(2015L, 2015L, 2015L, 2016L, 2016L, 2016L
), Month = c("Jan", "Feb", "Mar", "Jan", "Feb", "Mar"), A1 = c(1L,
2L, 3L, 1L, 2L, 3L), A2 = c(1L, 2L, 3L, 1L, 2L, 3L), A3 = c(1L,
2L, 3L, 1L, 2L, 3L)), .Names = c("Year", "Month", "A1", "A2",
"A3"), class = "data.frame", row.names = c("1", "2", "3", "4",
"5", "6"))
Here's a tidyverse solution. I factored the months so they can be ordered, then used tidyr::gather() to convert into long format so I could dplyr::group_by() by month to dplyr::summarise() to find the average:
library(dplyr)
library(tidyr)
df <- read.table(text = "
Year Month A1 A2 A3
1 2015 Jan 1 1 1
2 2015 Feb 2 2 2
3 2015 Mar 3 3 3
4 2016 Jan 1 1 1
5 2016 Feb 2 2 2
6 2016 Mar 3 3 3", header = T) %>%
tbl_df()
df$Month <- df$Month %>%
factor(levels = format(ISOdate(2000, 1:12, 1), "%b"))
df_tidy <- df %>%
gather(asset, value, -Year, -Month) %>%
group_by(Month, asset) %>%
summarise(Average = mean(value)) %>%
arrange(asset, Month)
df_tidy
# # A tibble: 9 x 3
# # Groups: Month [3]
# Month asset Average
# <fct> <chr> <dbl>
# 1 Jan A1 1
# 2 Feb A1 2
# 3 Mar A1 3
# 4 Jan A2 1
# 5 Feb A2 2
# 6 Mar A2 3
# 7 Jan A3 1
# 8 Feb A3 2
# 9 Mar A3 3
# convert to wide format, as in OP - not sure of 'easy' way
# to order columns by asset.month other than using 'select()'
# (it currently sorts alphabetically).
df_tidy %>%
unite(Returns, c(asset, Month), sep = ".") %>%
spread(Returns, Average)
# # A tibble: 1 x 9
# A1.Feb A1.Jan A1.Mar A2.Feb A2.Jan A2.Mar A3.Feb A3.Jan A3.Mar
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2 1 3 2 1 3 2 1 3

Create Binary Variable based on lag in R

I want to create a binary/indicator variable based on lagged observation. I have a variable X1. The raw data looks like below. It's a sample data. Original data has close to 10K records.
X1
Diagnosis
1
2
3
4
Treatment
1
2
3
I want the output to look like this :
X1 NewVar
Diagnosis Diagnosis
1 Diagnosis
2 Diagnosis
3 Diagnosis
4 Diagnosis
Treatment Treatment
1 Treatment
2 Treatment
3 Treatment
Any help would be highly appreciated!
You can achieve this with cumsum. The cumsum can create a new group each time a Diagnosis or Treatment appears. Then the NewVar in each group will take the value of first X1 in this group:
library(dplyr)
dtf %>%
mutate(g = cumsum(X1 == 'Diagnosis' | X1 == 'Treatment')) %>%
group_by(g) %>%
mutate(NewVar = X1[1]) %>%
ungroup() %>% select(-g)
# # A tibble: 9 x 2
# X1 NewVar
# <fctr> <fctr>
# 1 Diagnosis Diagnosis
# 2 1 Diagnosis
# 3 2 Diagnosis
# 4 3 Diagnosis
# 5 4 Diagnosis
# 6 Treatment Treatment
# 7 1 Treatment
# 8 2 Treatment
# 9 3 Treatment
the dtf in above code:
> dput(dtf)
structure(list(X1 = structure(c(5L, 1L, 2L, 3L, 4L, 6L, 1L, 2L,
3L), .Label = c("1", "2", "3", "4", "Diagnosis", "Treatment"), class = "factor")), .Names = "X1", class = "data.frame", row.names = c(NA,
-9L))
Here is an option with data.table. After converting to 'data.table' (setDT(dtf), get the cumulative sum of logical vector based on 'X1' values as characters and assign 'NewVar' as the first element of 'X1' (X1[1])
library(data.table)
setDT(dtf)[, NewVar := X1[1], cumsum(grepl('^[A-Za-z]+$', X1))]
dtf
# X1 NewVar
#1: Diagnosis Diagnosis
#2: 1 Diagnosis
#3: 2 Diagnosis
#4: 3 Diagnosis
#5: 4 Diagnosis
#6: Treatment Treatment
#7: 1 Treatment
#8: 2 Treatment
#9: 3 Treatment

Resources