How to group_by more elegantly my data frame

How to group_by more elegantly my data frame - r

I have two tables:
One where I know for sure all users of this table df1 have used a feature called "Folder"
The other where I don't know if users of this table df2 have used a feature called "Folder"
I want to build a graph showing the number of users that used the Folder feature on each date.
So the main data frame I want to build is a data frame with ALL the Dates included (from df1 and df2) and for each date the number of users who used this feature "Folder".
Here is a reproducible example:
df1 <- data.frame(Date=c("2021-05-12","2021-05-15","2021-05-20"),user_ID=c("RZ625","TDH65","EJ7336"))
colnames(df1) <- c("Date", "user_ID")
df2 <- data.frame(Date=c("2021-05-12","2021-05-15","2021-05-22"),user_ID=c("IZ823","TDH65","SI826"))
colnames(df2) <- c("Date", "user_ID")
The only way I found so far was to create some kind of flag Folder_True where it's a 1 if we know this user used Folder feature on this date and 0 if we don't know. I then used it with dplyr combining group_by and sum. But I think it's not very elegant and I would like to learn a more logical/efficient way to do this data wrangling.
Thanks!
df1 <- data.frame(Date=c("2021-05-12","2021-05-15","2021-05-20"),user_ID=c("RZ625","TDH65","EJ7336"), Folder_True=c(0,0,0))
df2 <- data.frame(Date=c("2021-05-12","2021-05-15","2021-05-22"),user_ID=c("IZ823","TDH65","SI826"), Folder_True=c(1,0,1))
combined_df <- rbind(df1, df2)
combined_df <-
combined_df %>%
group_by(Date, user_ID) %>%
summarise(Folder_True = sum(Folder_True))
final_df <-
combined_df %>%
group_by(Date) %>%
summarise(Nb_Users_Folder_True = sum(Folder_True))

For each Date you can find out unique users who have Folder_True = 1.
library(dplyr)
combined_df <- rbind(df1, df2)
combined_df %>%
group_by(Date) %>%
summarise(Nb_Users_Folder_True = n_distinct(user_ID[Folder_True == 1]))
# Date Nb_Users_Folder_True
# <chr> <int>
#1 2021-05-12 1
#2 2021-05-15 0
#3 2021-05-20 0
#4 2021-05-22 1

Using uniqueN from data.table
library(data.table)
rbindlist(list(df1, df2))[,
.(Nb_Users_Folder_True = uniqueN(user_ID[as.logical(Folder_True)])), Date]
-output
# Date Nb_Users_Folder_True
#1: 2021-05-12 1
#2: 2021-05-15 0
#3: 2021-05-20 0
#4: 2021-05-22 1

Related

How to subtract using max(date) and second latest (month) date

I'm trying to create a new variable which equals the latest month's value minus the previous month's (or 3 months prior, etc.).
A quick df:
country <- c("XYZ", "XYZ", "XYZ")
my_dates <- c("2021-10-01", "2021-09-01", "2021-08-01")
var1 <- c(1, 2, 3)
df1 <- country %>% cbind(my_dates) %>% cbind(var1) %>% as.data.frame()
df1$my_dates <- as.Date(df1$my_dates)
df1$var1 <- as.numeric(df1$var1)
For example, I've tried (partially from: How to subtract months from a date in R?)
library(tidyverse)
df2 <- df1 %>%
mutate(dif_1month = var1[my_dates==max(my_dates)] -var1[my_dates==max(my_dates) %m-% months(1)]
I've also tried different variations of using lag():
df2 <- df1 %>%
mutate(dif_1month = var1[my_dates==max(my_dates)] - var1[my_dates==max(my_dates)-lag(max(my_dates), n=1L)])
Any suggestions on how to grab the value of a variable when dates equal the second latest observation?
Thanks for help, and apologies for not including any data. Can edit if necessary.
Edited with a few potential answers:
#this gives me the value of var1 of the latest date
df2 <- df1 %>%
mutate(value_1month = var1[my_dates==max(my_dates)])
#this gives me the date of the second latest date
df2 <- df1 %>%
mutate(month1 = max(my_dates) %m-%months(1))
#This gives me the second to latest value
df2 <- df1 %>%
mutate(var1_1month = var1[my_dates==max(my_dates) %m-%months(1)])
#This gives me the difference of the latest value and the second to last of var1
df2 <- df1 %>%
mutate(diff_1month = var1[my_dates==max(my_dates)] - var1[my_dates==max(my_dates) %m-%months(1)])

mutate requires the output to be of the same length as the number of rows of the original data. When we do the subsetting, the length is different. We may need ifelse or case_when
library(dplyr)
library(lubridate)
df1 %>%
mutate(diff_1month = case_when(my_dates==max(my_dates) ~
my_dates %m-% months(1)))
NOTE: Without a reproducible example, it is not clear about the column types and values
Based on the OP's update, we may do an arrange first, grab the last two 'val' and get the difference
df1 %>%
arrange(my_dates) %>%
mutate(dif_1month = diff(tail(var1, 2)))
. my_dates var1 dif_1month
1 XYZ 2021-08-01 3 -1
2 XYZ 2021-09-01 2 -1
3 XYZ 2021-10-01 1 -1

Merge dataframes using an extra condition r

I know it should be an easier or smarter way of doing what I need, but I haven't found it yet after several days.
I have 2 dataframes that I need to merge using a extra condition. For example:
df1 <- data.frame(Username = c("user1", "user2", "user3", "user4", "user5", "user6"))
df2 <- data.frame(File_Name = c(rep("StudyABC", 5), rep("AnotherStudyCDE", 4)), Username = c("user1", rep(c("user2", "user3", "user4", "user5"),2)))
print(df1)
print(df2)
What I need is to create 2 new columns in df1 called ABC and CDE that includes their "File_Name" values. Of course the real data is hundreds of lines and not ordered so no way of selecting by range.
One of the solutions (not elegant) that I have found is:
df2_filtered <- df2 %>% filter(str_detect(File_Name, "ABC"))
df1 <- left_join(df1, df2_filtered, by = "Username")
names(df1)[2] <- "ABC"
df2_filtered <- df2 %>% filter(str_detect(File_Name, "CDE"))
df1 <- left_join(df1, df2_filtered, by = "Username")
names(df1)[3] <- "CDE"
print(df1)
Is there a shortest way of doing it? Because I have to repeat the same logic 160 times.
Thanks

You can extract either "ABC" or "CDE" from File_Name and cast the data into wide format. We can join the data with df1 to get all the Username in the final dataframe.
library(dplyr)
df2 %>%
mutate(name = stringr::str_extract(File_Name, 'ABC|CDE')) %>%
tidyr::pivot_wider(names_from = name, values_from = File_Name) %>%
right_join(df1, by = 'Username')
# Username ABC CDE
# <chr> <chr> <chr>
#1 user1 StudyABC NA
#2 user2 StudyABC AnotherStudyCDE
#3 user3 StudyABC AnotherStudyCDE
#4 user4 StudyABC AnotherStudyCDE
#5 user5 StudyABC AnotherStudyCDE
#6 user6 NA NA

What you're looking for is a way of casting data from long to wide eg using data.table package I would do this:
library(data.table)
# converts data.frame to data.table
dt <- as.data.table(df2)
# I copy the file_name so one is used for the pivotting for long to wide and the other is used for filling in the data
dt[, study := File_Name]
dt_wide <- dcast(Username~File_Name, data=dt, value.var = "study")
# have a look at df2 in wide format
dt_wide[]
# now its just a direct merge to pull it back in to df1 and turn
# back in to data.frame for you
out <- merge(as.data.table(df1), dt_wide, by="Username", all.x=TRUE)
setDF(out)
out
Plenty of tutorials on melting/casting even without data.table. It's just knowing what to search for eg Google throws up https://ademos.people.uic.edu/Chapter8.html as the first result.

If one study can have more than one file path (which I assume is the case from your previous attempts), just converting your data to a wide format before joining won't work as you'll have one column per file path, not per study.
One method in this case could be to use a for-loop to create an additional column in df2 with the study name, then convert the data to a wide format using pivot_wider.
It's not a very R method though so I'd welcome suggestions to avoid creating the empty study column and the for-loop
studies <- c("ABC", "CDE")
#create empty column named "study"
df2 <- df2 %>%
mutate(study = NA_character_)
for (i in studies) {
df2 <- df2 %>%
mutate(study = if_else(grepl(i, File_Name), i, study))
}
df2 <- df2 %>%
pivot_wider(names_from = study, values_from = File_Name)
> df2
# A tibble: 5 x 3
Username ABC CDE
<chr> <chr> <chr>
1 user1 StudyABC NA
2 user2 StudyABC AnotherStudyCDE
3 user3 StudyABC AnotherStudyCDE
4 user4 StudyABC AnotherStudyCDE
5 user5 StudyABC AnotherStudyCDE
df2 is now in a wide format and you can join it to df1 as before to get your desired output.
df3 <- left_join(df1, df2)

How to create columns from a list in a for loop using mutate

I was wondering if there was a way to create multiple columns from a list in R using the mutate() function within a for loop.
Here is an example of what I mean:
The Problem:
I have a data frame df that has 2 columns: category and rating. I want to add a column for every element of df$category and in that column, I want a 1 if the category column matches the iterator.
library(dplyr)
df <- tibble(
category = c("Art","Technology","Finance"),
rating = c(100,95,50)
)
Doing it manually, I could do:
df <-
df %>%
mutate(art = ifelse(category == "Art", 1,0))
However, what happens when I have 50 categories? (Which is close to what I have in my original problem. That would take a lot of time!)
What I tried:
category_names <- df$category
for(name in category_names){
df <-
df %>%
mutate(name = ifelse(category == name, 1,0))
}
Unfortunately, It doesn't seem to work.
I'd appreciate any light on the subject!
Full Code:
library(dplyr)
#Creates tibble
df <- tibble(
category = c("Art","Technology","Finance"),
rating = c(100,95,50)
)
#Showcases the operation I would like to loop over df
df <-
df %>%
mutate(art = ifelse(category == "Art", 1,0))
#Creates a variable for clarity
category_names <- df$category
#For loop I tried
for(name in category_names){
df <-
df %>%
mutate(name = ifelse(category == name, 1,0))
}
I am aware that what I am essentially doing is a form of model.matrix(); however, before I found out about that function I was still perplexed why what I was doing before wasn't working.

We can use pivot_wider after creating a sequence column
library(dplyr)
library(tidyr)
df %>%
mutate(rn = row_number(), n = 1) %>%
pivot_wider(names_from = category, values_from = n,
values_fill = list(n = 0)) %>%
select(-rn)
# A tibble: 3 x 4
# rating Art Technology Finance
# <dbl> <dbl> <dbl> <dbl>
#1 100 1 0 0
#2 95 0 1 0
#3 50 0 0 1
Or another option is map
library(purrr)
map_dfc(unique(df$category), ~ df %>%
transmute(!! .x := +(category == .x))) %>%
bind_cols(df, .)
# A tibble: 3 x 5
# category rating Art Technology Finance
#* <chr> <dbl> <int> <int> <int>
#1 Art 100 1 0 0
#2 Technology 95 0 1 0
#3 Finance 50 0 0 1
If we need a for loop
for(name in category_names) df <- df %>% mutate(!! name := +(category == name))
Or in base R with table
cbind(df, as.data.frame.matrix(table(seq_len(nrow(df)), df$category)))
# category rating Art Finance Technology
#1 Art 100 1 0 0
#2 Technology 95 0 0 1
#3 Finance 50 0 1 0

Wanted to throw something in for anyone who stumbles across this question. The problem in the OP is that the "name" column name gets re-used during each iteration of the loop: you end up with only one new column, when you really wanted three (or 50). I consistently find myself wanting to create multiple new columns within loops, and I recently found out that mutate can now take "glue"-like inputs to do this. The following code now also solves the original question:
for(name in category_names){
df <-
df %>%
mutate("{name}" := ifelse(category == name, 1, 0))
}
This is equivalent to akrun's answer using a for loop, but it doesn't involve the !! operator. Note that you still need the "walrus" := operator, and that the column name needs to be a string (I think since it's using "glue" in the background). I'm thinking some people might find this format easier to understand.
Reference: https://www.tidyverse.org/blog/2020/02/glue-strings-and-tidy-eval/

Select only rows where the last date is present

Let's say that I have the following data.
df = data.frame(name = c("A","A","A","B","B","B","B"),
date = c("2011-01-01","2011-03-01","2011-05-01",
"2011-01-01","2011-05-01","2011-06-01",
"2011-07-01"))
df
I know the last date in the data set and only want to pick those names where data is available for the last date. So in the above example, the last date is only available for name B. Thus, I want to select only the rows for name B.
I can do simple hacks like this to get the desired result.
last_date = "2011-07-01"
#unique(df$name[df$date %in% last_date])
df[df$name %in% unique(df$name[df$date %in% last_date]),]
However, I was wondering if there was a dplyr/tidyverse or data.table solution for this task.

There are multiple ways you can do this, with dplyr we can filter only those groups which have the last_date
library(dplyr)
df %>%
group_by(name) %>%
filter(last_date %in% date)
# name date
# <fct> <fct>
#1 B 2011-01-01
#2 B 2011-05-01
#3 B 2011-06-01
#4 B 2011-07-01
Or similarly in base R :
df[ave(df$date, df$name, FUN = function(x) last_date %in% x) == TRUE,]
Also, we can get all the name where you find last_date and filter those names from the original dataframe.
df[with(df, name %in% name[date %in% last_date]), ]

Merging two dataframes with different size

I have two Data Frames with two columns. One column for date another for numeric data. The two Data Frames have different size. I give you an example of what I have and what I need.
This is what I have:
DF1
2015-01-02 0
2015-01-03 0
2015-01-04 0
DF2
2015-01-03 200
This is what I need:
DF1
2015-01-02 0
2015-01-03 200
2015-01-04 0
I have tried comparing (compare function) both DF but I have no solution.
Maybe this could help you (or even make the functions faster), in both DF the dates are sorted.
Could someone help me?
Thank you very much,
Gobya

It's not clear how you want to choose which row to select when there are matching dates from the two data frames (per #user295691's comment), so I've provided two selection options below that give the result you specified.
DF1 <- data.frame(date = c("2015-01-02", "2015-01-03", "2015-01-04"),
value = c(0, 0, 0), stringsAsFactors=FALSE)
DF2 <- data.frame(date = c("2015-01-03"), value = c(200), stringsAsFactors=FALSE)
DF1$source = "DF1"
DF2$source = "DF2"
library(dplyr)
# Choose the greatest value for each date
newDF = DF1 %>% bind_rows(DF2) %>%
group_by(date) %>%
filter(value == max(value))
# If there are more than two values for a given date,
# choose the value(s) from DF2 for that date
newDF = DF1 %>% bind_rows(DF2) %>%
group_by(date) %>%
mutate(n=n()) %>%
filter(ifelse(n>1, source=="DF2", source=="DF1")) %>%
select(-n)
FYI, for the second approach, I thought the following would work, but it excludes rows with date=2014-01-03. I'm not sure why and would be interested in any ideas on what's going wrong:
DF1 %>% bind_rows(DF2) %>%
group_by(date) %>%
filter(ifelse(n() > 1, source=="DF2", source=="DF1"))
date value source
1 2015-01-02 0 DF1
2 2015-01-04 0 DF1

Using full_join() from the dplyr package:
DF1 <- data.frame(date = c("2015-01-02", "2015-01-03", "2015-01-04"),
number = c(0, 0, 0))
DF2 <- data.frame(date = c("2015-01-03"), number = c(200))
DF3 <- full_join(DF1, DF2, by="date")
DF3$newColumn <- ifelse(is.na(DF3$number.y), DF3$number.x, DF3$number.y)

newdf <- merge(DF1, DF2, by='V1', all=T)
newdf[,2][is.na(newdf[,2])] <- newdf[,3][!is.na(newdf[,3])]
newdf[-3]
# V1 V2.x
# 1 2015-01-02 0
# 2 2015-01-03 200
# 3 2015-01-04 0

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to group_by more elegantly my data frame - r

Using uniqueN from data.table library(data.table) rbindlist(list(df1, df2))[, .(Nb_Users_Folder_True = uniqueN(user_ID[as.logical(Folder_True)])), Date] -output # Date Nb_Users_Folder_True #1: 2021-05-12 1 #2: 2021-05-15 0 #3: 2021-05-20 0 #4: 2021-05-22 1

Related

How to subtract using max(date) and second latest (month) date

Merge dataframes using an extra condition r

How to create columns from a list in a for loop using mutate

Select only rows where the last date is present

Merging two dataframes with different size

Categories

Resources