Create a new column based on difference of dates - r

I have a dataframe in which I have to create a new column based on the difference of two dates. Example:
Col1 Col2 Col3 Date New_Column_Required
A X A 01/01/2001 Wave1
B Y Q 01/01/2001 Wave1
C Z N 01/01/2001 Wave1
D W M 02/01/2001 Wave2
E Q V 02/01/2001 Wave2
F R O 03/01/2001 Wave3
G S T 03/01/2001 Wave3
2nd date - 1st date should be wave 1, 3rd date - 2nd date Wave 2 and so on. The problem I'm facing is because of the multiple dates can't seem to figure out.

Using dplyr we could change Date to class Date, arrange them based on Date and subtract Date from the first value.
library(dplyr)
df %>%
mutate(Date = lubridate::dmy(Date)) %>%
arrange(Date) %>%
mutate(new_col = paste0("Wave", Date - first(Date) + 1))
#OR
#mutate(new_col = paste0("Wave", as.integer(as.factor(Date))))
# Col1 Col2 Col3 Date new_col
#1 A X A 2001-01-01 Wave1
#2 B Y Q 2001-01-01 Wave1
#3 C Z N 2001-01-01 Wave1
#4 D W M 2001-01-02 Wave2
#5 E Q V 2001-01-02 Wave2
#6 F R O 2001-01-03 Wave3
#7 G S T 2001-01-03 Wave3
And the same logic in base R :
df$Date = as.Date(df$Date, "%d/%m/%Y")
df <- df[order(df$Date), ]
transform(df, new_col = paste0('Wave', Date - Date[1] + 1))
data
df <- structure(list(Col1 = c("A", "B", "C", "D", "E", "F", "G"), Col2 = c("X",
"Y", "Z", "W", "Q", "R", "S"), Col3 = c("A", "Q", "N", "M", "V",
"O", "T"), Date = c("01/01/2001", "01/01/2001", "01/01/2001",
"02/01/2001", "02/01/2001", "03/01/2001", "03/01/2001")), row.names = c(NA,
-7L), class = "data.frame")

Related

Merge three Variables to one and replicate observations

I have a Dataframe which looks like the following:
B <- data.frame(
nr=c(1,2,3,4,5),
A=c('a','b','c','d','e'),
B=c("s", "t", "i", "u", "z"),
B1=c("", "v", "", "", ""),
B2 =c("", "g", "", "", ""))
B <- B %>% mutate_all(na_if,"")
Since my Varaibales B1 and B2 only have one value, I would like to merge B1 and B2 to the Variable B. Therefor it should create two new observation and replicating every other Variable of this Oberservation.
It should look like the following:
B <- data.frame(
nr=c(1,2,2, 2, 3,4,5),
A=c("a","b", "b", "b", "c","d","e"),
B=c("s", "v", "g", "t", "i", "u", "z"))
Thanks for your help!!
Reshape to 'long' format with pivot_longer on the 'B' columns and remove the NA with values_drop_na = TRUE
library(dplyr)
library(tidyr)
B %>%
pivot_longer(cols = starts_with("B"), values_to = "B",
values_drop_na = TRUE, names_to = NULL)
-output
# A tibble: 7 × 3
nr A B
<dbl> <chr> <chr>
1 1 a s
2 2 b t
3 2 b v
4 2 b g
5 3 c i
6 4 d u
7 5 e z

Summation of money amounts in character format by group

I have a data frame that contains the monetary transactions among individuals. The transactions can be two-way, i.e. A can transfer money to B and B can also transfer money to A. The structure of the data frame looks like below:
From To Amount
A B $100
A C $40
A D $30
B A $25
B C $70
C A $190
C D $110
I want to summarize the total amount of transactions among each pair of individuals who have transactions with each other and the results should be something like:
Individual_1 Individual_2 Sum
A B $125
A C $230
A D $30
B C $70
C D $110
I tried to utilize the grouping feature of the package dplyr but I think it does not apply to my case.
You can use pmin/pmax to sort From and To columns and sum the Amount value.
library(dplyr)
df %>%
group_by(col1 = pmin(From, To),
col2 = pmax(From, To)) %>%
summarise(Amount = sum(readr::parse_number(Amount)))
# col1 col2 Amount
# <chr> <chr> <dbl>
#1 A B 125
#2 A C 230
#3 A D 30
#4 B C 70
#5 C D 110
Using the same logic in base R you can do :
aggregate(Amount~col1 + col2,
transform(df, col1 = pmin(From, To), col2 = pmax(From, To),
Amount = as.numeric(sub('$', '', Amount, fixed = TRUE))), sum)
data
df <- structure(list(From = c("A", "A", "A", "B", "B", "C", "C"), To = c("B",
"C", "D", "A", "C", "A", "D"), Amount = c("$100", "$40", "$30",
"$25", "$70", "$190", "$110")), class = "data.frame", row.names = c(NA, -7L))
A solution using the tidyverse package. You need to find a way to create a common grouping column with the right order of the individuals. dat2 is the final output.
library(tidyverse)
dat2 <- dat %>%
mutate(Amount = as.numeric(str_remove(Amount, "\\$"))) %>%
mutate(Group = map2_chr(From, To, ~str_c(sort(c(.x, .y)), collapse = "_"))) %>%
group_by(Group) %>%
summarize(Sum = sum(Amount, na.rm = TRUE)) %>%
separate(Group, into = c("Individual_1", "Individual_2"), sep = "_") %>%
mutate(Sum = str_c("$", Sum))
print(dat2)
# # A tibble: 5 x 3
# Individual_1 Individual_2 Sum
# <chr> <chr> <chr>
# 1 A B $125
# 2 A C $230
# 3 A D $30
# 4 B C $70
# 5 C D $110
Data
dat <- read.table(text = "From To Amount
A B $100
A C $40
A D $30
B A $25
B C $70
C A $190
C D $110",
header = TRUE)
A complete solution without packages, based on #RonakShah's great pmin/pmax approach, using list notation in aggregate (in contrast to formula notation) which allows name assignment.
with(
transform(d, a=as.numeric(gsub("\\D", "", Amount)), b=pmin(From, To), c=pmax(From, To)),
aggregate(list(Sum=a), list(Individual_1=b, Individual_2=c), function(x)
paste0("$", sum(x))))
# Individual_1 Individual_2 Sum
# 1 A B $125
# 2 A C $230
# 3 B C $70
# 4 A D $30
# 5 C D $110
Data:
d <- structure(list(From = c("A", "A", "A", "B", "B", "C", "C"), To = c("B",
"C", "D", "A", "C", "A", "D"), Amount = c("$100", "$40", "$30",
"$25", "$70", "$190", "$110")), class = "data.frame", row.names = c(NA,
-7L))

Subset data for the most recent month

I have a .txt file
test.txt
V1 V2 Date
A B 2020-01-02
C D 2020-02-27
E F 2020-09-10
G H 2020-09-15
I want to subset data based on the most recent month.
I did this which does the job but I want to extract the most recent month automatically rather than typing in manually and then extract the data
test$month <- factor(format(test$Date, "%B"),levels = month.name)
test.subset <- test[test$month == "September"]
We can arrange the Date class column and filter the formated value by comparing it with the last one
library(dplyr)
test %>%
mutate(Date = as.Date(Date), Month = format(Date, '%B')) %>%
arrange(Date) %>%
filter(Month == last(Month)) %>%
select(-Month)
-output
# V1 V2 Date
#1 E F 2020-09-10
#2 G H 2020-09-15
data
test <- structure(list(V1 = c("A", "C", "E", "G"), V2 = c("B", "D", "F",
"H"), Date = c("2020-01-02", "2020-02-27", "2020-09-10", "2020-09-15"
)), class = "data.frame", row.names = c(NA, -4L))
Here is a base R option using subset + gsub
subset(
transform(
df,
ym = gsub("\\d+$", "", Date)
),
ym == max(ym),
select = -ym
)
which gives
V1 V2 Date
3 E F 2020-09-10
4 G H 2020-09-15
A data.table option
setDT(df)[
,
`:=`(Year = year(as.IDate(Date)), Month = month(as.Date(Date)))
][
.(max(Year), max(Month)),
on = .(Year, Month)
][
,
`:=`(Year = NULL, Month = NULL)
][]
gives
V1 V2 Date
1: E F 2020-09-10
2: G H 2020-09-15
Data
> dput(df)
structure(list(V1 = c("A", "C", "E", "G"), V2 = c("B", "D", "F",
"H"), Date = c("2020-01-02", "2020-02-27", "2020-09-10", "2020-09-15"
)), class = "data.frame", row.names = c(NA, -4L))
Using the structure shared by #ThomasIsCoding, and assuming the year is constant, one could just look for the row with the max month and filter for that:
# using datatable
library(data.table)
setDT(df)[month(as.IDate(Date)) == max(month(as.IDate(Date)))]
V1 V2 Date
1: E F 2020-09-10
2: G H 2020-09-15

Remove df rows using information about unrepeated levels between two vectors

df <- data.frame(X = c("a", "b", "c", "a", "b", "c", "a", "b", "c", "d" , "a", "b", "c", "d", "e"),
Y = c("w", "w", "w", "K", "K", "K", "L", "L", "L", "L", "Z", "Z", "Z", "Z", "Z"))
Note that the first vector has 5 levels and the second has 4 levels. My goal is to select df lines that have all levels of vector 1 in common as vector 2. That is, I want to select lines that have levels "a", "b" and "c" since " d "appears only twice" and "appears only in vector 1.
I tried to make a list with the common levels and leave only the lines with the common levels by subset. However, it doesn't work because this level list doesn't generate the address of the lines I want to remove. Ex:
common <- c ("a", "b", "c")
df2 <- df [c(common),]
In my real df, there are 64 levels in common, so it doesn't happen "to do by hand". Can someone help me?
I think this is what you want. Essentially splitting X by Y, then looking for all intersecting values that are in every set.
df[df$X %in% Reduce(intersect, split(df$X, df$Y)),]
# X Y
#1 a w
#2 b w
#3 c w
#4 a K
#5 b K
#6 c K
#7 a L
#8 b L
#9 c L
#11 a Z
#12 b Z
#13 c Z
Another way could be to group_by X and select groups which has all distinct values in Y.
library(dplyr)
df %>%
group_by(X) %>%
filter(n_distinct(Y) == n_distinct(.$Y))
# X Y
# <fct> <fct>
# 1 a w
# 2 b w
# 3 c w
# 4 a K
# 5 b K
# 6 c K
# 7 a L
# 8 b L
# 9 c L
#10 a Z
#11 b Z
#12 c Z
In base R, that would be using ave
subset(df, as.logical(ave(as.character(Y), X,
FUN = function(x) length(unique(x)) == length(unique(Y)))))
Using data.table
library(data.table)
setDT(df)[, .SD[uniqueN(Y) == uniqueN(df$Y)], by = X]

Join data but ignore missing values

I am having some trouble with joining data frames with dplyr, where I would like to ignore the NAs.
The data that I have is quite big, but a simplified version looks like:
id <- c("id1", "id2", "id3", "id4")
A <- c("E", "F", "G", NA)
B <- c("T", NA, "N", "T")
C <- c(NA, "T", "U", NA)
df <- data.frame(A, B, C)
id A B C
1 id1 E T NA
2 id2 F NA T
3 id3 G N U
4 id4 NA T NA
I have an entry that I would like to match with df, which is e.g.:
df2 <- data.frame(A = "E", B = "T", C = "M")
A B C
1 E T M
As a result I would like to obtain all rows from df that match with df2, but the NAs should be ignored. So the result should look like this:
id A B C
1 id1 E T NA
2 id4 NA T NA
I was trying to do this with semi_join, but it did not work so far:
result <- df %>%
group_by(n = seq(n())) %>%
do(modify_if(., is.na, ~NULL) %>%
semi_join(df2, by = c("A", "B", "C"))) %>%
ungroup %>%
select(-n)
Which results in:
Error: `by` can't contain join column `C` which is missing from LHS
Call `rlang::last_error()` to see a backtrace
Who knows the answer?
Here's a solution with a mix of tidyverse and base R. I think this is pretty clear, but I'd be interested in a pure tidyverse implementation that isn't completely contrived.
The idea is to first expand all entries in df and df2 and then filter through all the columns using a loop.
The data:
id <- c("id1", "id2", "id3", "id4")
A <- c("E", "F", "G", NA)
B <- c("T", NA, "N", "T")
C <- c(NA, "T", "U", NA)
df <- data.frame(id, A, B, C, stringsAsFactors = F) # Make sure to use strings not factors
df2 <- data.frame(A = "E", B = "T", C = "M", stringsAsFactors = F)
Code:
library(tidyr)
results <- crossing(df, df2)
select_columns <- c("A", "B", "C")
for(col in select_columns) {
keep <- is.na(results[[col]]) | results[[col]] == results[[paste0(col, 1)]]
results <- results[keep,, drop=F]
}
results <- results %>% dplyr::select(id, A:C) %>% distinct
results
id A B C
1 id1 E T <NA>
2 id4 <NA> T <NA>
If you only need to do this for a single set of values this is probably the most straightforward approach:
d[A %in% c("E",NA) & B %in%c("T",NA) & C %in% c("M",NA),]
Another example using tidyverse and base (dplyr, tidyr, base):
In this I convert your df2 into a dataframe that includes all combinations of values you want to accept ( (E or NA) & (T or NA) & (M or NA) ) and then I do an inner join with this full set. There are other ways to create a dataframe of all possible combinations but this one uses tidyr fairly easily.
library(dplyr)
library(tidyr)
id <- c("id1", "id2", "id3", "id4")
A <- c("E", "F", "G", NA)
B <- c("T", NA, "N", "T")
C <- c(NA, "T", "U", NA)
df <- data.frame(A, B, C, stringsAsFactors = FALSE)
df2 <- data.frame(A = "E", B = "T", C = "M",stringsAsFactors = FALSE)
df2_expanded <- df2 %>%
rowwise() %>%
mutate(combinations = list(expand.grid(A = c(A,NA),B = c(B,NA),C = c(C,NA),stringsAsFactors = FALSE))) %>%
select(-A,-B,-C) %>%
unnest(combinations)
# A tibble: 8 x 3
# A B C
# <chr> <chr> <chr>
# 1 E T M
# 2 NA T M
# 3 E NA M
# 4 NA NA M
# 5 E T NA
# 6 NA T NA
# 7 E NA NA
# 8 NA NA NA
df %>%
inner_join(df2_expanded)
# A B C
# 1 E T <NA>
# 2 <NA> T <NA>

Resources