I have the following variables
id1 = c(1,1,1,2,3,4,4,5)
id2 = c(1,1,2,2,3,3,4)
digit1 = c(243, 888, 343, 276, 493, 024, 305, 093)
digit2 = c(343, 756, 947, 089, 390, 930, 024)
df1 = data.frame(id1, digit1)
df2 = data.frame(id2, digit2)
I'm looking to find a way to see how many matches digits1 has with digits2 based on similar id's. The frequency for a given id can vary within the same data frame and also when compared with the other data frame.
I don't want it to count an extra incorrect or correct if id1 has more frequency than id2 for a given id. For example when comparing the first three digits in df1 with only the first two digit in df2 the returning vector would count that as 1 correct , 1 incorrect, and 1 NA. I'm trying to merge both data frames and add the new column for the outcomes of the matching.
After aligning the the column in df1 and df2 and merging into a new data frame, I want to add a vector like (0, NA, 1, 0, 0, 1, NA, NA) to the new data frame.
The actual data that I'll be using has thousands of rows for each dataframe
One way to do this would be by joining the two data frames and then checking which ids have digit1 = digit2, here's what I mean:
# Data from the question (before the edit)
id1 = c(1,1,2,3,4,4,5,6,7)
id2 = c(1,2,2,3,3,4,5)
digit1 = c(243, 343, 276, 493, 024, 305, 093, 393, 208)
digit2 = c(343, 947, 089, 390, 930, 024, 093)
df1 = data.frame(id1, digit1)
df2 = data.frame(id2, digit2)
library(dplyr)
df1 %>%
full_join(df2, by = c('id1'='id2')) %>%
mutate(match = (digit1 == digit2)) %>%
group_by(id1) %>%
summarise(match = sum(match))
you get:
# id1 match
# <dbl> <int>
# 1 1 1
# 2 2 0
# 3 3 0
# 4 4 1
# 5 5 1
# 6 6 NA
# 7 7 NA
Related
I have a dataframe
idnr <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,5,5,5,5,5,6,6,6,6,6,7,7,7,7)
labvalue <- c(100, 80, 75, 70, 50, 60, 55, 200, 180, 165, 160, 150, 170, 175, 300, 280, 260, 250, 255, 400, 380, 360, 350, 355, 500, 480, 460)
labdate <- as.Date(c("2022-01-01", "2022-01-02", "2022-01-03", "2022-01-04", "2022-01-05", "2022-01-06", "2022-01-07",
"2022-01-08", "2022-01-09", "2022-01-10", "2022-01-11", "2022-01-12", "2022-01-13", "2022-01-14",
"2022-01-15", "2022-01-16", "2022-01-17", "2022-01-18", "2022-01-19", "2022-01-20", "2022-01-21",
"2022-01-22", "2022-01-23", "2022-01-24", "2022-01-25", "2022-01-26", "2022-01-27"))
data <- data.frame(idnr, labvalue, labdate)
I would like to create a variable for each idnr indicating if the unique idnr have had a drop in lab value of 40 or more and within 2 days. To clarify, if a unique idnr has a lab value of 200, i want to check if there is any lab values taken after the date of lab value 200 but within 48 hours that is 160 or less.
Preferably I would like it to work if the dates had time stamps as well. I understand that I probably need to do a for loop but I can't get it to work.
You can make a helper function to check each row for drops within 2 days, then apply to dates and values using purrr::map2_lgl(), grouped by idnr.
library(dplyr)
library(purrr)
has_drop <- function(cur_date, cur_value, all_dates, all_values) {
days_diff <- as.numeric(all_dates - cur_date, unit = "days")
vals_2day <- all_values[between(days_diff, 0, 2)]
any(vals_2day - cur_value <= -40)
}
data %>%
group_by(idnr) %>%
summarize(
drop = any(map2_lgl(
labdate,
labvalue,
\(d, v) has_drop(d, v, labdate, labvalue)
))
)
# A tibble: 5 × 2
idnr drop
<dbl> <lgl>
1 1 FALSE
2 2 FALSE
3 5 TRUE
4 6 TRUE
5 7 TRUE
To get the dates of values with drops within 2 days, use filter() instead of summarize():
data %>%
group_by(idnr) %>%
filter(map2_lgl(
labdate,
labvalue,
\(d, v) has_drop(d, v, labdate, labvalue)
)) %>%
ungroup()
# A tibble: 3 × 3
idnr labvalue labdate
<dbl> <dbl> <date>
1 5 300 2022-01-15
2 6 400 2022-01-20
3 7 500 2022-01-25
The same code should work for POSIXct timestamps.
So I have some data like this..
Q7
ProblemGambling
950
0
170
0
490
0
500
0
...
...
780
26
23.33
27
170
27
10
27
It is imported from an excel spreadsheet such that the first column contains a wide range of whole numbers, but the second column categorizes the corresponding values from the first in a range from 0 to 27. I need to first transform those second values into either "non-problem", "low-risk", "moderate-risk" or "problem", based on if they're 0, 1-4, 5-7, or 8+. Then I need to separate the resulting list into multiple sublists where each 2 of these categories pair off for further analysis.
I am not sure I fully understood your problem but at least a partial solution should be this:
library(dplyr)
# some dummy data from your print
df <- data.frame(Q7 = c(950, 170, 490, 500, 730, 23.33, 170, 10),
ProblemGambling = c(0,0,0,0,26,27,27,27))
df %>%
# assing the groups acording to numeric range
dplyr::mutate(ProblemGambling = case_when(ProblemGambling == 0 ~ "non-problem",
ProblemGambling <= 4 ~ "low-risk",
ProblemGambling <= 7 ~ "moderate-risk",
TRUE ~ "problem")) %>%
# build groupings to be able to split by them
dplyr::group_by(ProblemGambling) %>%
# split into sublist according to grouping
dplyr::group_split()
[[1]]
# A tibble: 4 x 2
Q7 ProblemGambling
<dbl> <chr>
1 950 non-problem
2 170 non-problem
3 490 non-problem
4 500 non-problem
[[2]]
# A tibble: 4 x 2
Q7 ProblemGambling
<dbl> <chr>
1 730 problem
2 23.3 problem
3 170 problem
4 10 problem
We could use cut with split in base R
split(df, cut(df$ProblemGambling, breaks = c(-Inf, 0, 4, 7, Inf)), drop = TRUE)
data
df <- data.frame(Q7 = c(950, 170, 490, 500, 730, 23.33, 170, 10),
ProblemGambling = c(0,0,0,0,26,27,27,27))
I have the following dataframe:
user_id <- c(97, 97, 97, 97, 96, 95, 95, 94, 94)
event_id <- c(42, 15, 43, 12, 44, 32, 38, 10, 11)
plan_id <- c(NA, 38, NA, NA, 30, NA, NA, 30, 25)
treatment_id <- c(NA, 20, NA, NA, NA, 28, 41, 17, 32)
system <- c(1, 1, 1, 1, NA, 2, 2, NA, NA)
df <- data.frame(user_id, event_id, plan_id, treatment_id system)
I would like to count the distinct number of user_id for each column, excluding the NA values. The output I am hoping for is:
user_id event_id plan_id treatment_id system
1 4 4 3 4 2
I tried to leverage mutate_all, but that was unsuccessful because my data frame is too large. In other functions, I've used the following two lines of code to get the nonnull count and the count distinct for each column:
colSums(!is.empty(df[,]))
apply(df[,], 2, function(x) length(unique(x)))
Optimally, I would like to combine the two with an ifelse to minimize the mutations, as this will ultimately be thrown into a function to be applied with a number of other summary statistics to a list of data frames.
I have tried a brute-force method, where make the values 1 if not null and 0 otherwise and then copy the id to that column if 1. I can then just use the count distinct line from above to get my output. However, I get the wrong values when copying it into the other columns and the number of adjustments is sub optimal. See code:
binary <- cbind(df$user_id, !is.empty(df[,2:length(df)]))
copied <- binary %>% replace(. > 0, binary[.,1])
I'd greatly appreciate your help.
1: Base
sapply(df, function(x){
length(unique(df$user_id[!is.na(x)]))
})
# user_id event_id plan_id treatment_id system
# 4 4 3 3 2
2: Base
aggregate(user_id ~ ind, unique(na.omit(cbind(stack(df), df[1]))[-1]), length)
# ind user_id
#1 user_id 4
#2 event_id 4
#3 plan_id 3
#4 treatment_id 3
#5 system 2
3: tidyverse
df %>%
mutate(key = user_id) %>%
pivot_longer(!key) %>%
filter(!is.na(value)) %>%
group_by(name) %>%
summarise(value = n_distinct(key)) %>%
pivot_wider()
## A tibble: 1 x 5
# event_id plan_id system treatment_id user_id
# <int> <int> <int> <int> <int>
#1 4 3 2 3 4
Thanks #dcarlson I had misunderstood the question:
apply(df, 2, function(x){length(unique(df[!is.na(x), 1]))})
A data.table option with uniqueN
> setDT(df)[, lapply(.SD, function(x) uniqueN(user_id[!is.na(x)]))]
user_id event_id plan_id treatment_id system
1: 4 4 3 3 2
Using dplyr you can use summarise with across :
library(dplyr)
df %>% summarise(across(.fns = ~n_distinct(user_id[!is.na(.x)])))
# user_id event_id plan_id treatment_id system
#1 4 4 3 3 2
I have a data set
customerId <- c(101,101,101,102,102,102,104,104,106,109,109,109)
Purchasedate<- c("2020-06-19","2020-06-20","2020-06-21","2020-06-24","2020-06-27","2020-06-28","2020-06-20","2020-06-21"
,"2020-06-24","2020-06-10","2020-06-14","2020-06-16")
df <- data.frame(customerId,Purchasedate)
I am trying to find out following output
101 3
104 2
as the 101 & 104 customer id only representing continuous purchase dates
I am trying to find out the customerid who had make continuous purchase and for how many days by using R
Maybe you could consider checking for difference between dates using group_by for each id, filter by those ids where the difference is always 1, and summarise to total up the number of rows/dates:
library(dplyr)
df %>%
group_by(customerId) %>%
mutate(diffDays = c(1, diff(Purchasedate))) %>%
filter(n_distinct(diffDays) == 1 & n() > 1) %>%
summarise(continuousDays = n())
Output
customerId continuousDays
<dbl> <int>
1 101 3
2 104 2
Data
df <- structure(list(customerId = c(101, 101, 101, 102, 102, 102, 104,
104, 106, 109, 109, 109), Purchasedate = structure(c(18432, 18433,
18434, 18437, 18440, 18441, 18433, 18434, 18437, 18423, 18427,
18429), class = "Date")), row.names = c(NA, -12L), class = "data.frame")
A similar question was answered here: https://stackoverflow.com/a/53713204/12400385
Adapting it slightly to your case
library(dplyr)
library(lubridate)
# Convert Purchasedate to a date column
df <- df %>%
mutate(Purchasedate = ymd(Purchasedate))
# Create custom function
max_consec <- function(x) {
y <- c(unclass(diff(x))) # c and unclass -- preparing it for rle
r <- rle(y)
with(r, max(lengths[values==1]) + 1)
}
# Apply function to each customer
df %>%
group_by(customerId) %>%
summarize(max.consecutive = max_consec(Purchasedate))
#-------
# A tibble: 5 x 2
customerId max.consecutive
<dbl> <dbl>
1 101 3
2 102 2
3 104 2
4 106 -Inf
5 109 -Inf
I am trying to add rows to a data frame based on the minimum and maximum data within each group. Suppose this is my original data frame:
df = data.frame(Date = as.Date(c("2017-12-01", "2018-01-01", "2017-12-01", "2018-01-01", "2018-02-01","2017-12-01", "2018-02-01")),
Group = c(1,1,2,2,2,3,3),
Value = c(100, 200, 150, 125, 200, 150, 175))
Notice that Group 1 has 2 consecutive dates, group 2 has 3 consecutive dates, and group 3 is missing the date in the middle (2018-01-01). I'd like to be able to complete the data frame by adding rows for missing dates. But the thing is I only want to add additional dates based on dates that are missing between the minimum and maximum date within each group. So if I were to complete this data frame it would look like this:
df_complete = data.frame(Date = as.Date(c("2017-12-01", "2018-01-01", "2017-12-01", "2018-01-01", "2018-02-01","2017-12-01","2018-01-01", "2018-02-01")),
Group = c(1,1,2,2,2,3,3,3),
Value = c(100, 200, 150, 125, 200, 150,NA, 175))
Only one row was added because Group 3 was missing one date. There was no date added for Group 1 because it had all the dates between its minimum (2017-12-01) and maximum date (2018-01-01).
You can use tidyr::complete with dplyr to find a solution. The interval between consecutive dates seems to be month. The approach will be as below:
library(dplyr)
library(tidyr)
df %>% group_by(Group) %>%
complete(Group, Date = seq.Date(min(Date), max(Date), by = "month"))
# A tibble: 8 x 3
# Groups: Group [3]
# Group Date Value
# <dbl> <date> <dbl>
# 1 1.00 2017-12-01 100
# 2 1.00 2018-01-01 200
# 3 2.00 2017-12-01 150
# 4 2.00 2018-01-01 125
# 5 2.00 2018-02-01 200
# 6 3.00 2017-12-01 150
# 7 3.00 2018-01-01 NA
# 8 3.00 2018-02-01 175
Data
df = data.frame(Date = as.Date(c("2017-12-01", "2018-01-01", "2017-12-01", "2018-01-01",
"2018-02-01","2017-12-01", "2018-02-01")),
Group = c(1,1,2,2,2,3,3),
Value = c(100, 200, 150, 125, 200, 150, 175))
#MKR's approach of using tidyr::complete with dplyr is good, but will fail if the group column is not numeric. It will then be typecast as factors and the complete() operation will then result in a tibble with a row for every factor/time combination for each group.
complete() does not need the group variable as first argument, so the solution is
library(dplyr)
library(tidyr)
df %>% group_by(Group) %>%
complete(Date = seq.Date(min(Date), max(Date), by = "month"))