Create a "flag" column in a dataset based on a another table in R - r

I have two datasets: dataset1 and dataset2.
zz <- "id_customer id_order order_date
1 1 2018-10
1 2 2018-11
2 3 2019-05
3 4 2019-06"
dataset1 <- read.table(text=zz, header=TRUE)
yy <- "id_customer order_date
1 2018-10
3 2019-06"
dataset2 <- read.table(text=yy, header=TRUE)
dataset2 is the result of a query where I have two columns: id_customer and date (format YYYY-mm).
Those correspond to customers which have a different status than the others in the source dataset (dataset1), for a specified month.
dataset1 is a list of transactions where I have id_customer, id_order and date (format YYYY-mm as well).
I want to enrich dataset1 with a "flag" column for each line set to 1 if the customer id appears in dataset2, during the corresponding month.
I have tried something as follows:
dataset$flag <- ifelse(dataset1$id_customer %in% dataset2$id_customer &
dataset1$date == dataset2$date,
"1", "0")
But I get a warning message that says 'longer object length is not a multiple of shorter object length'.
I understand that but cannot come up with a solution. Could someone please help?

You can add a flag to dataset2 then use merge(), keeping all rows from dataset1. Borrowing Chris' data:
dataset2$flag <- 1
merge(dataset1, dataset2, all.x = TRUE)
ID Date flag
1 1 2018-12 NA
2 1 2019-11 NA
3 2 2018-13 NA
4 2 2019-10 NA
5 2 2019-11 1
6 2 2019-12 NA
7 2 2019-12 NA
8 3 2018-12 1
9 3 2018-12 1
10 4 2018-13 1

EDIT:
This seems to work:
Illustrative data:
set.seed(100)
dt1 <- data.frame(
ID = sample(1:4, 10, replace = T),
Date = paste0(sample(2018:2019, 10, replace = T),"-", sample(10:13, 10, replace = T))
)
dt1
ID Date
1 2 2019-12
2 2 2019-12
3 3 2018-12
4 1 2018-12
5 2 2019-11
6 2 2019-10
7 4 2018-13
8 2 2018-13
9 3 2018-12
10 1 2019-11
dt2 <- data.frame(
ID = sample(1:4, 5, replace = T),
Date = paste0(sample(2018:2019, 5, replace = T),"-", sample(10:13, 5, replace = T))
)
dt2
ID Date
1 2 2019-11
2 4 2018-13
3 2 2019-13
4 4 2019-13
5 3 2018-12
SOLUTION:
The solution uses ifelse to define a condition upon which to set the 'flag' 1(as specified in the OP). That condition implies a match between dt1and dt2; thus we're using match. A complicating factor is that the condition requires a double match between two columns in each dataframe. Therefore, we use apply to paste the rows in the two columns together using paste0 and search for matches in these compound strings:
dt1$flag <- ifelse(match(apply(dt1[,1:2], 1, paste0, collapse = " "),
apply(dt2[,1:2], 1, paste0, collapse = " ")), 1, "NA")
RESULT:
dt1
ID Date flag
1 2 2019-12 NA
2 2 2019-12 NA
3 3 2018-12 1
4 1 2018-12 NA
5 2 2019-11 1
6 2 2019-10 NA
7 4 2018-13 1
8 2 2018-13 NA
9 3 2018-12 1
10 1 2019-11 NA
To check the results we can compare them with the results obtained from merge:
flagged_only <- merge(dt1, dt2)
flagged_only
ID Date
1 2 2019-11
2 3 2018-12
3 3 2018-12
4 4 2018-13
The dataframe flagged_onlycontains exactly the same four rows as the ones flagged 1 in dt1-- voilĂ !

It is very is to add a corresponding flag in a data.table way:
# Load library
library(data.table)
# Convert created tables to data.table object
setDT(dataset1)
setDT(dataset2)
# Add {0, 1} to dataset1 if the row can be found in dataset2
dataset1[, flag := 0][dataset2, flag := 1, on = .(id_customer, order_date)]
The result looks as follows:
> dataset1
id_customer id_order order_date flag
1: 1 1 2018-10 1
2: 1 2 2018-11 0
3: 2 3 2019-05 0
4: 3 4 2019-06 1
A bit more manipulations would be needed if you would have the full date/time in the datasets.

Related

r- dynamically detect excel column names format as date (without df slicing)

I am trying to detect column dates that come from an excel format:
library(openxlsx)
df <- read.xlsx('path/df.xlsx', sheet=1, detectDates = T)
Which reads the data as follows:
# a b c 44197 44228 d
#1 1 1 NA 1 1 1
#2 2 2 NA 2 2 2
#3 3 3 NA 3 3 3
#4 4 4 NA 4 4 4
#5 5 5 NA 5 5 5
I tried to specify a fix index slice and then transform these specific columns as follows:
names(df)[4:5] <- format(as.Date(as.numeric(names(df)[4:5]),
origin = "1899-12-30"), "%m/%d/%Y")
This works well when the df is sliced for those specific columns, unfortunately, there could be the possibility that these column index's change, say from names(df)[4:5] to names(df)[2:3] for example. Which will return coerced NA values instead of dates.
data:
Note: for this data the column name is read as X4488, while read.xlsx() read it as 4488
df <- data.frame(a=rep(1:5), b=rep(1:5), c=NA, "44197"=rep(1:5), '44228'=rep(1:5), d=rep(1:5))
Expected Output:
Note: this is the original excel format for these above columns:
# a b c 01/01/2021 01/02/2021 d
#1 1 1 NA 1 1 1
#2 2 2 NA 2 2 2
#3 3 3 NA 3 3 3
#4 4 4 NA 4 4 4
#5 5 5 NA 5 5 5
How could I detect directly these excel format and change it to date without having to slice the dataframe?
We may need to only get those column names that are numbers
i1 <- !is.na(as.integer(names(df)))
and then use
names(df)[i1] <- format(as.Date(as.numeric(names(df)[i1]),
origin = "1899-12-30"), "%m/%d/%Y")
Or with dplyr
library(dplyr)
df %>%
rename_with(~ format(as.Date(as.numeric(.),
origin = "1899-12-30"), "%m/%d/%Y"), matches('^\\d+$'))

Remove duplicates making sure of NA values R

My data set(df) looks like,
ID Name Rating Score Ranking
1 abc 3 NA NA
1 abc 3 12 13
2 bcd 4 NA NA
2 bcd 4 19 20
I'm trying to remove duplicates which using
df <- df[!duplicated(df[1:2]),]
which gives,
ID Name Rating Score Ranking
1 abc 3 NA NA
2 bcd 4 NA NA
but I'm trying to get,
ID Name Rating Score Ranking
1 abc 3 12 13
2 bcd 4 19 20
How do I avoid rows containing NA's when removing duplicates at the same time, some help would be great, thanks.
First, push the NAs to last with na.last = T
df<-df[with(df, order(ID, Name, Score, Ranking),na.last = T),]
then do the removing of duplicated ones with fromLast = FALSE argument:
df <- df[!duplicated(df[1:2],fromLast = FALSE),]
Using dplyr
df <- df %>% filter(!duplicated(.[,1:2], fromLast = T))
You could just filter out the observations you don't want with which() and then use the unique() function:
a<-unique(c(which(df[,'Score']!="NA"), which(df[,'Ranking']!="NA")))
df2<-unique(df[a,])
> df2
ID Name Rating Score Ranking
2 1 abc 3 12 13
4 2 bcd 4 19 20

how to replace the NA in a data frame with the average number of this data frame

I have a data frame like this:
nums id
1233 1
3232 2
2334 3
3330 1
1445 3
3455 3
7632 2
NA 3
NA 1
And I can know the average "nums" of each "id" by using:
id_avg <- aggregate(nums ~ id, data = dat, FUN = mean)
What I would like to do is to replace the NA with the value of the average number of the corresponding id. for example, the average "nums" of 1,2,3 are 1000, 2000, 3000, respectively. The NA when id == 3 will be replaced by 3000, the last NA whose id == 1 will be replaced by 1000.
I tried the following code to achieve this:
temp <- dat[is.na(dat$nums),]$id
dat[is.na(dat$nums),]$nums <- id_avg[id_avg[,"id"] ==temp,]$nums
However, the second part
id_avg[id_avg[,"id"] ==temp,]$nums
is always NA, which means I always pass NA to the NAs I want to replace.
I don't know where I was wrong, or do you have better method to do this?
Thank you
Or you can fix it by:
dat[is.na(dat$nums),]$nums <- id_avg$nums[temp]
nums id
1 1233.000 1
2 3232.000 2
3 2334.000 3
4 3330.000 1
5 1445.000 3
6 3455.000 3
7 7632.000 2
8 2411.333 3
9 2281.500 1
What you want is contained in the zoo package.
library(zoo)
na.aggregate.default(dat, by = dat$id)
nums id
1 1233.000 1
2 3232.000 2
3 2334.000 3
4 3330.000 1
5 1445.000 3
6 3455.000 3
7 7632.000 2
8 2411.333 3
9 2281.500 1
Here is a dplyr way:
df %>%
group_by(id) %>%
mutate(nums = replace(nums, is.na(nums), as.integer(mean(nums, na.rm = T))))
# Source: local data frame [9 x 2]
# Groups: id [3]
# nums id
# <int> <int>
# 1 1233 1
# 2 3232 2
# 3 2334 3
# 4 3330 1
# 5 1445 3
# 6 3455 3
# 7 7632 2
# 8 2411 3
# 9 2281 1
You essentially want to merge the id_avg back to the original data frame by the id column, so you can also use match to follow your original logic:
dat$nums[is.na(dat$nums)] <- id_avg$nums[match(dat$id[is.na(dat$nums)], id_avg$id)]
dat
# nums id
# 1: 1233.000 1
# 2: 3232.000 2
# 3: 2334.000 3
# 4: 3330.000 1
# 5: 1445.000 3
# 6: 3455.000 3
# 7: 7632.000 2
# 8: 2411.333 3
# 9: 2281.500 1

How to count and flag unique values in r dataframe

I have the following dataframe:
data <- data.frame(week = c(rep("2014-01-06", 3), rep("2014-01-13", 3), rep("2014-01-20", 3)), values = c(1, 2, 3))
week values
1 2014-01-06 1
2 2014-01-06 2
3 2014-01-06 3
4 2014-01-13 1
5 2014-01-13 2
6 2014-01-13 3
7 2014-01-20 1
8 2014-01-20 2
9 2014-01-20 3
I'm wanting to create a column in data that counts the unique week and assigns it a sequential value, such that the df appears like this:
week values seq_value
1 2014-01-06 1 1
2 2014-01-06 2 1
3 2014-01-06 3 1
4 2014-01-13 1 2
5 2014-01-13 2 2
6 2014-01-13 3 2
7 2014-01-20 1 3
8 2014-01-20 2 3
9 2014-01-20 3 3
I guess the idiomatic way would be just to calculate the actual week of the year out of the date provided (in case your weeks are not starting from the first week of the year).
as.integer(format(as.Date(data$week), "%W"))
## [1] 1 1 1 2 2 2 3 3 3
Another base R solution would be using as.POSIXlt class and utilizing its yday attribute
as.POSIXlt(data$week)$yday %/% 7 + 1
## [1] 1 1 1 2 2 2 3 3 3
If you want a shorter syntax, data.table package (among many others - See #Kshashaas comment) offers a quick wrapper
library(data.table)
week(data$week)
## [1] 1 1 1 2 2 2 3 3 3
The nicest thing about this package is that you can create columns by reference (similar to #akruns last solution, but probably more efficient because doesn't require the by argument)
setDT(data)[, seq_value := week(week)]
You could use base R by converting the "week" column to factor and specifying the levels as the unique values of "week". Convert factor to numeric and get the numeric index of the levels.
data$seq_value <- with(data, as.numeric(factor(week,levels=unique(week) )))
data$seq_value
#[1] 1 1 1 2 2 2 3 3 3
Or match the "week" column to unique values of that column to get the numeric index.
with(data, match(week, unique(week)))
#[1] 1 1 1 2 2 2 3 3 3
Or using data.table, by first converting data.frame to data.table (setDT) and then get the index values (.GRP) of grouping variable 'week' and assign it to new column seq_value
library(data.table)
setDT(data)[,seq_value:=.GRP, week][]
A dplyr solution:
library(dplyr)
data %>%
mutate(seq_value = dense_rank(week))

R - compare rows consecutively in two data frames and return a value

I have the following two data frames:
df1 <- data.frame(month=c("1","1","1","1","2","2","2","3","3","3","3","3"),
temp=c("10","15","16","25","13","17","20","5","16","25","30","37"))
df2 <- data.frame(period=c("1","1","1","1","1","1","1","1","2","2","2","2","2","2","3","3","3","3","3","3","3","3","3","3","3","3"),
max_temp=c("9","13","16","18","30","37","38","39","10","15","16","25","30","32","8","10","12","14","16","18","19","25","28","30","35","40"),
group=c("1","1","1","2","2","2","3","3","3","3","4","4","5","5","5","5","5","6","6","6","7","7","7","7","8","8"))
I would like to:
Consecutively for each row, check if the value in the month column in df1 matches that in the period column of df2, i.e. df1$month == df2$period.
If step 1 is not TRUE, i.e. df1$month != df2$period, then repeat step 1 and compare the value in df1 with the value in the next row of df2, and so forth until df1$month == df2$period.
If df1$month == df2$period, check if the value in the temp column of df1 is less than or equal to that in the max_temp column of df2, i.e. df1$temp <= df$max_temp.
If df1$temp <= df$max_temp, return value in that row for the group column in df2 and add this value to df1, in a new column called "new_group".
If step 3 is not TRUE, i.e. df1$temp > df$max_temp, then go back to step 1 and compare the same row in df1 with the next row in df2.
An example of the output data frame I'd like is:
df3 <- data.frame(month=c("1","1","1","1","2","2","2","3","3","3","3","3"),
temp=c("10","15","16","25","13","17","20","5","16","25","30","37"),
new_group=c("1","1","1","2","3","4","4","5","6","7","7","8"))
I've been playing around with the ifelse function and need some help or re-direction. Thanks!
I found the procedure for computing new_group hard to follow as stated. As I understand it, you're trying to create a variable called new_group in df1. For row i of df1, the new_group value is the group value of the first row in df2 that:
Is indexed i or higher
Has a period value matching df1$month[i]
Has a max_temp value no less than df1$temp[i]
I approached this by using sapply called on the row indices of df1:
fxn = function(idx) {
# Potentially matching indices in df2
pm = idx:nrow(df2)
# Matching indices in df2
m = pm[df2$period[pm] == df1$month[idx] &
as.numeric(as.character(df1$temp[idx])) <=
as.numeric(as.character(df2$max_temp[pm]))]
# Return the group associated with the first matching index
return(df2$group[m[1]])
}
df1$new_group = sapply(seq(nrow(df1)), fxn)
df1
# month temp new_group
# 1 1 10 1
# 2 1 15 1
# 3 1 16 1
# 4 1 25 2
# 5 2 13 3
# 6 2 17 4
# 7 2 20 4
# 8 3 5 5
# 9 3 16 6
# 10 3 25 7
# 11 3 30 7
# 12 3 37 8
library(data.table)
dt1 <- data.table(df1, key="month")
dt2 <- data.table(df2, key="period")
## add a row index
dt1[, rn1 := seq(nrow(dt1))]
dt3 <-
unique(dt1[dt2, allow.cartesian=TRUE][, new_group := group[min(which(temp <= max_temp))], by="rn1"], by="rn1")
## Keep only the columns you want
dt3[, c("month", "temp", "max_temp", "new_group"), with=FALSE]
month temp max_temp new_group
1: 1 1 19 1
2: 1 3 19 1
3: 1 4 19 1
4: 1 7 19 1
5: 2 2 1 3
6: 2 5 1 3
7: 2 6 1 4
8: 3 10 18 5
9: 3 4 18 5
10: 3 7 18 5
11: 3 8 18 5
12: 3 9 18 5

Resources