Replace a part of dataframe with new data - r

I have data1 and data2, and I need data3, that replaces certain regions of data1 with data2.
I use this method to update the data, but actually several columns need to be updated and it would be tedious.
Do you know a more simple way?
library(tidyverse)
library(lubridate)
data1 <- tibble(date=date("2017-11-1") + c(1:10),
a=sample(100,10),b=sample(100,10))
data2 <- tibble(date=date("2017-11-1") + c(1:8),
a=sample(100,8))
data_bind <- left_join(data1, data2, by=("date"))
data_bind$a.x[!is.na(data_bind$a.y)] <- data_bind$a.y[!is.na(data_bind$a.y)]
data_bind %>% select(-a.y) %>% dplyr::rename(a=a.x)

In my opinion, the data.table-package is better suited for such a task. Using:
# create a vector with names from 'data2' that are not used to join by
nms <- names(data2)[-1]
# load the 'data.table'-package
library(data.table)
# convert the dataframes to data,table's
setDT(data1)
setDT(data2)
# join and update the column in 'data1' with the matching values from 'data2'
data1[data2, on = 'date', (nms) := mget(paste0('i.',nms))][]
gives:
date a b
1: 2017-11-02 21 11
2: 2017-11-03 22 12
3: 2017-11-04 23 13
4: 2017-11-05 24 14
5: 2017-11-06 25 15
6: 2017-11-07 26 16
7: 2017-11-08 27 17
8: 2017-11-09 28 18
9: 2017-11-10 9 19
10: 2017-11-11 10 20
What this does:
With setDT(data1) you convert the dataframes/tibbles to a data.table.
With data1[data2, on = 'date'] you can do a join the data.table-way.
By adding (nms) := mget(paste0('i.',nms)) to the join, you tell data.table to update the columns in data1 with the columns that are also present in data2 only where the dates match.
As an alternative approach you could also reshape both datasets into long format and then do the join:
library(data.table)
melt(data1, id = 'date')[melt(data2, id = 'date')
, on = .(date, variable)
, value := i.value
][, dcast(.SD, date ~ variable)]
A translation of this approach to the tidyverse:
library(dplyr)
library(tidyr)
gather(data1, key, value, -1) %>%
left_join(., gather(data2, key, value, -1), by = c('date','key')) %>%
mutate(value.x = ifelse(!is.na(value.y), value.y, value.x)) %>%
select(date, key, value = value.x) %>%
spread(key, value)
Both will give you the same output.
Used data:
data1 <- data.frame(date = as.Date("2017-11-1") + c(1:10), a = 1:10, b = 11:20)
data2 <- data.frame(date = as.Date("2017-11-1") + c(1:8), a = 21:28)

Related

R - applying calculation pairwise on columns of data frame/data table

Let's say I have the data frames with the same column names
DF1 = data.frame(a = c(0,1), b = c(2,3), c = c(4,5))
DF2 = data.frame(a = c(6,7), c = c(8,9))
and want to apply some basic calculation on them, for example add each column.
Since I also want the goal data frame to display missing data, I appended such a column to DF2, so I have
> DF2
a c b
1 6 8 NA
2 7 9 NA
What I tried here now is to create the data frame
for(i in names(DF2)){
DF3 = data.frame(i = DF1[i] + DF2[i])
}
(and then bind this together) but this obviously doesn't work since the order of the columns is mashed up.
SO,
what's the best way to do this pairwise calculation when the order of the columns is not the same, without reordering them?
I also tried doing (since this is what I thought would be a fix)
for(i in names(DF2)){
DF3 = data.frame(i = DF1$i + DF2$i)
}
but this doesn't work because DF1$i is NULL for all i.
Conlusion: I want the data frame
>DF3
a b c
1 6+0 NA 4+8
2 1+7 NA 5+9
Any help would be appreciated.
This may help -
#Get column names from DF1 and DF2
all_cols <- union(names(DF1), names(DF2))
#Fill missing columns with NA in both the dataframe
DF1[setdiff(all_cols, names(DF1))] <- NA
DF2[setdiff(all_cols, names(DF2))] <- NA
#add the two dataframes arranging the columns
DF1[all_cols] + DF2[all_cols]
# a b c
#1 6 NA 12
#2 8 NA 14
We can use bind_rows
library(dplyr)
library(data.table)
bind_rows(DF1, DF2, .id = 'grp') %>%
group_by(grp = rowid(grp)) %>%
summarise(across(everything(), sum), .groups = 'drop') %>%
select(-grp)
-output
# A tibble: 2 x 3
a b c
<dbl> <dbl> <dbl>
1 6 NA 12
2 8 NA 14
Another base R option using aggregate + stack + reshae
aggregate(
. ~ rid,
transform(
reshape(
transform(rbind(
stack(DF1),
stack(DF2)
),
rid = ave(seq_along(ind), ind, FUN = seq_along)
),
direction = "wide",
idvar = "rid",
timevar = "ind"
),
rid = 1:nrow(DF1)
),
sum,
na.action = "na.pass"
)[-1]
gives
values.a values.b values.c
1 6 NA 12
2 8 NA 14

Retrieve values of the data.frame by matching ID and column name

I have a dataframe named df1 which has four columns (i.e. id, s, date and value). The value column is empty and I want to fill it using a second dataframe that is named df2. df2 is filled with id column and many other columns that are named using dates which they belong. All I need is to find corresponding values of df1$value in df2, where both dates and id numbers are matching.
Example data:
set.seed(123)
#df1
df1 <- data.frame(id = 1:100,
s = runif(100,100,1000),
date = sample(seq(as.Date('1999/01/01'), as.Date('2001/01/01'), by="day"), 100),
value = NA)
#df2
df2 <- data.frame(matrix(runif(80000,1,100), ncol=800, nrow=100))[-1]
names(df2) <- seq(as.Date("1999-01-01"),as.Date("2002-12-31"),1)[c(1:799)]
df2 <- cbind(id = 1:100, df2)
One way is to convert df2 into long format using gather and then do left_join
library(dplyr)
library(tidyr)
df1 %>%
left_join(df2 %>%
gather(date, value, -id) %>%
mutate(date = as.Date(date)), by = c("id", "date"))
# id s date value
#1 1 359 2000-03-15 48.32
#2 2 809 1999-09-01 62.16
#3 3 468 1999-12-23 16.41
#4 4 895 2000-11-26 32.70
#5 5 946 1999-12-18 5.84
#6 6 141 2000-10-09 74.65
#7 7 575 2000-10-25 9.22
#8 8 903 2000-03-17 6.46
#9 9 596 1999-10-25 73.48
#10 10 511 1999-04-17 62.43
#...
data
set.seed(123)
df1 <- data.frame(id = 1:100,
s = runif(100,100,1000),
date = sample(seq(as.Date('1999/01/01'), as.Date('2001/01/01'), by="day"), 100))
df2 <- data.frame(matrix(runif(80000,1,100), ncol=800, nrow=100))[-1]
names(df2) <- seq(as.Date("1999-01-01"),as.Date("2002-12-31"),1)[c(1:799)]
df2 <- cbind(id = 1:100, df2)
You can also use melt and then left join using both the keys:
library(dplyr)
library(reshape2)
set.seed(123)
#df1
df1 <- data.frame(id = 1:100,
s = runif(100,100,1000),
date = sample(seq(as.Date('1999/01/01'), as.Date('2001/01/01'), by="day"), 100),
value = NA)
#df2
df2 <- data.frame(matrix(runif(80000,1,100), ncol=800, nrow=100))[-1]
names(df2) <- seq(as.Date("1999-01-01"),as.Date("2002-12-31"),1)[c(1:799)]
df2 <- cbind(id = 1:100, df2)
df2<-melt(df2, id.vars = "id", value.name = "Value", variable.name = "date")
df2$date<-as.Date(df2$date, format = "%Y-%m-%d")
df1<-left_join(df1, df2, by = c("id", "date"))
head(df1)
id s date value Value
1 1 358.8198 2000-03-15 NA 48.31799
2 2 809.4746 1999-09-01 NA 62.15760
3 3 468.0792 1999-12-23 NA 16.41291
4 4 894.7157 2000-11-26 NA 32.70024
5 5 946.4206 1999-12-18 NA 5.83607
6 6 141.0008 2000-10-09 NA 74.64832
We can use efficient way with data.table join. It should be fast for big datasets
library(data.table)
setDT(df1)[melt(setDT(df2), id.var = 'id')[,
date := as.IDate(variable, '%Y-%m-%d')], on = .(id, date)]

Sum group values by lubridate %within% interval

There are 10 projects split between group A & B, each with different start and end dates. For each day within a given period the sum of outputX and outputY needs to be calculated. I manage to do this for all projects together, but how to split the results per group?
I've made several attempts with lapply() and purrr:map(), also looking at filters and splits, but to no avail. An example that doesn't distinguish between groups is found below.
library(tidyverse)
library(lubridate)
df <- data.frame(
project = 1:10,
group = c("A","B"),
outputX = rnorm(2),
outputY = rnorm(5),
start_date = sample(seq(as.Date('2018-01-3'), as.Date('2018-1-13'), by="day"), 10),
end_date = sample(seq(as.Date('2018-01-13'), as.Date('2018-01-31'), by="day"), 10))
df$interval <- interval(df$start_date, df$end_date)
period <- data.frame(date = seq(as.Date("2018-01-08"), as.Date("2018-01-17"), by = 1))
df_sum <- do.call(rbind, lapply(period$date, function(x){
index <- x %within% df$interval;
list("X" = sum(df$outputX[index]),
"Y" = sum(df$outputY[index]))}))
outcome <- cbind(period, df_sum) %>% gather("id", "value", 2:3)
outcome
Ultimately, it should be a 40x4 table. Some suggestions are much appreciated!
If I understand you correctly, you need to use inner join. SO can suggest us to use sqldf. See https://stackoverflow.com/a/11895368/9300556
With your data we can do smth like this. There is no need to calculate df$interval but we need to add ID to period, otherwise sqldf wont work.
df <- data.frame(
project = 1:10,
group = c("A","B"),
outputX = rnorm(2),
outputY = rnorm(5),
start = sample(seq(as.Date('2018-01-3'), as.Date('2018-1-13'), by="day"), 10),
end = sample(seq(as.Date('2018-01-13'), as.Date('2018-01-31'), by="day"), 10))
# df$interval <- interval(df$start_date, df$end_date)
period <- data.frame(date = seq(as.Date("2018-01-08"), as.Date("2018-01-17"), by = 1)) %>%
mutate(id = 1:nrow(.))
Then we can use sqldf
sqldf::sqldf("select * from period inner join df
on (period.date > df.start and period.date <= df.end) ") %>%
as_tibble() %>%
group_by(date, group) %>%
summarise(X = sum(outputX),
Y = sum(outputY)) %>%
gather(id, value, -group, -date)
# A tibble: 40 x 4
# Groups: date [10]
date group id value
<date> <fct> <chr> <dbl>
1 2018-01-08 A X 3.04
2 2018-01-08 B X 2.34
3 2018-01-09 A X 3.04
4 2018-01-09 B X 3.51
5 2018-01-10 A X 3.04
6 2018-01-10 B X 4.68
7 2018-01-11 A X 4.05
8 2018-01-11 B X 4.68
9 2018-01-12 A X 4.05
10 2018-01-12 B X 5.84
# ... with 30 more rows

How to do SUMIFS in R

Data:
set.seed(42)
df1 = data.frame(
Date = seq.Date(as.Date("2018-01-01"),as.Date("2018-01-30"),1),
value = sample(1:30),
Y = sample(c("yes", "no"), 30, replace = TRUE)
)
df2 = data.frame(
Date = seq.Date(as.Date("2018-01-01"),as.Date("2018-01-30"),7)
)
I want for each date in df2$Date calculate the sum of df1$Value if date in df1$Date falls within df2$Date and df2$Date+6
Inshort I need to calculate weekly sums
Using data.table, create a range start/end, then merge on overlap, then get sum over group:
library(data.table)
df1$start <- df1$Date
df1$end <- df1$Date
df2$start <- df2$Date
df2$end <- df2$Date + 6
setDT(df1, key = c("start", "end"))
setDT(df2, key = c("start", "end"))
foverlaps(df1, df2)[, list(mySum = sum(value)), by = Date ]
# Date mySum
# 1: 2018-01-01 138
# 2: 2018-01-08 96
# 3: 2018-01-15 83
# 4: 2018-01-22 109
# 5: 2018-01-29 39
Check out library lubridate and dplyr, those two are quiet common.
library(lubridate)
library(dplyr)
df1$last_week_day <- ceiling_date(df1$Date, "week") + 1
df1 %>% group_by(last_week_day) %>% summarize(week_value = sum(value))
We can use fuzzyjoin
library(dplyr)
library(fuzzyjoin)
df2$EndDate <- df2$Date+6
fuzzy_left_join(
df1, df2,
by = c(
"Date" = "Date",
"Date" = "EndDate"
), match_fun = list(`>=`, `<=`)) %>%
group_by(Date.y) %>% summarise(Sum=sum(value))
# A tibble: 5 x 2
Date.y Sum
<date> <int>
1 2018-01-01 138
2 2018-01-08 96
3 2018-01-15 83
4 2018-01-22 109
5 2018-01-29 39

How can I organize my CSV file in R

I got .csv file with 53000 rows as follows:
s
1
2
3
m
4
5
6
7
r
8
9
10
11
I would like to make it following format using R or excel:
s 1 2 3
m 4 5 6 7
r 8 9 10 11
Three alternative implementations using base R and data.table:
1: with base R
df$id <- cumsum(grepl("\\D", df$x))
df$name <- ave(df$x, df$id, FUN = function(x) rep(x[1],length(x)))
df <- df[!grepl("\\D", df$x),]
df$pos <- ave(df$x, df$name, FUN = function(x) paste0("p",1:length(x)))
library(reshape2)
dcast(df, name ~ pos, value.var = "x")
this gives:
name p1 p2 p3 p4
1 m 4 5 6 7
2 r 8 9 10 11
3 s 1 2 3 <NA>
2: first approach with data.table
library(data.table)
dcast(setDT(df)[, id := cumsum(grepl("\\D", x))
][, `:=` (name = x[1], pos = 0:(.N-1)), id
][!grepl("\\D", x), .(name, x, pos=paste0("p",pos))],
name ~ pos, value.var = "x")
3: second approach with data.table, but now with the just introduced rowid function from the development version (installation instructions):
library(data.table) # v1.9.7+
dcast(setDT(df)[, id := cumsum(grepl("\\D", x))
][, name := x[1], id
][!grepl("\\D", x), .(name, x)],
name ~ rowid(name, prefix="p"), value.var = "x")
both data.table approaches result in:
name p1 p2 p3 p4
1: m 4 5 6 7
2: r 8 9 10 11
3: s 1 2 3 NA
Used data:
df <- data.frame(x = c("s", 1:3, "m", 4:7, "r", 8:11), stringsAsFactors = FALSE)
Assuming that the new row names are always alpha numeric and the values in the rows are always numeric, this reformats it into a data frame you may be looking for.
library(dplyr)
library(tidyr)
data.frame(x = c("s", 1:3, "m", 4:7, "r", 8:11),
stringsAsFactors = FALSE) %>%
mutate(var_id = cumsum(grepl("[[:alpha:]]", x))) %>%
group_by(var_id) %>%
mutate(row_name = x[1]) %>%
filter(!grepl("[[:alpha:]]", x)) %>%
mutate(var_index = 1:n()) %>%
ungroup() %>%
select(-var_id) %>%
spread(var_index, x)

Resources