I have found this dataframe in an Excel file, very disorganized. This is just a sample of a bigger dataset, with many jobs.
df <- data.frame(
Job = c("Frequency", "Driver", "Operator"),
Gloves = c("Daily", 1,2),
Aprons = c("Weekly", 2,0),
)
Visually it's
I need it to be in this format, something that I can work in a database:
df <- data.frame(
Job = c("Driver", "Driver", "Operator", "Operator"),
Frequency= c("Daily", "Weekly", "Daily", "Weekly"),
Item= c("Gloves", "Aprons", "Gloves", "Aprons"),
Quantity= c(1,2,2,0)
)
Visually it's
Any thoughts in how do we have to manipulate the data? I have tried without any luck.
We could use tidyverse methods by doing this in three steps
Remove the first row - slice(-1), reshape to 'long' format (pivot_longer)
Keep only the first row - slice(1), reshape to 'long' format (pivot_longer)
Do a join with both of the reshaped datasets
library(dplyr)
library(tidyr)
df %>%
slice(-1) %>%
pivot_longer(cols = -Job, names_to = 'Item',
values_to = 'Quantity') %>%
left_join(df %>%
slice(1) %>%
pivot_longer(cols= -Job, values_to = 'Frequency',
names_to = 'Item') %>%
select(-Job) )
-output
# A tibble: 4 x 4
Job Item Quantity Frequency
<chr> <chr> <chr> <chr>
1 Driver Gloves 1 Daily
2 Driver Aprons 2 Weekly
3 Operator Gloves 2 Daily
4 Operator Aprons 0 Weekly
data
df <- data.frame(
Job = c("Frequency", "Driver", "Operator"),
Gloves = c("Daily", 1,2),
Aprons = c("Weekly", 2,0))
Related
I have a data frame where I want to sum column values with the same prefix to produce a new column. My current problem is that it's not taking into account my group_by variable and returning identical values. Is part of the problem the .cols variable I'm selecting in the across function?
Sample data
library(dplyr)
library(purrr)
set.seed(10)
dat <- data.frame(id = rep(1:2, 5),
var1.pre = rnorm(10),
var1.post = rnorm(10),
var2.pre = rnorm(10),
var2.post = rnorm(10)
) %>%
mutate(index = id)
var_names = c("var1", "var2")
What I've tried
sumfunction <- map(
var_names,
~function(.){
sum(dat[glue("{.x}.pre")], dat[glue("{.x}.post")], na.rm = TRUE)
}
) %>%
setNames(var_names)
dat %>%
group_by(id) %>%
summarise(
across(
.cols = index,
.fns = sumfunction,
.names = "{.fn}"
)
) %>%
ungroup
Desired output
For this and similar problems I made the 'dplyover' package (it is not on CRAN). Here we can use dplyover::across2() to loop over two series of columns, first, all columns ending with "pre" and second all columns ending with "post". To get the names correct we can use .names = "{pre}" to get the common prefix of both series of columns.
library(dplyr)
library(dplyover) # https://timteafan.github.io/dplyover/
dat %>%
group_by(id) %>%
summarise(across2(ends_with("pre"),
ends_with("post"),
~ sum(c(.x, .y)),
.names = "{pre}"
)
)
#> # A tibble: 2 × 3
#> id var1 var2
#> <int> <dbl> <dbl>
#> 1 1 -2.32 -5.55
#> 2 2 1.11 -9.54
Created on 2022-12-14 with reprex v2.0.2
Whenever operations across multiple columns get complicated, we could pivot:
library(dplyr)
library(tidyr)
dat %>%
pivot_longer(-c(id, index),
names_to = c(".value", "name"),
names_sep = "\\.") %>%
group_by(id) %>%
summarise(var1 = sum(var1), var2=sum(var2))
id var1 var2
<int> <dbl> <dbl>
1 1 -2.32 -5.55
2 2 1.11 -9.54
I have a dataframe which looks like this:
ID <- c(1,1,1,2,2,2,2,3,3,3,3)
Fact <- c(233,233,233,50,50,50,50,15,15,15,15)
Overall_Category <- c("Purchaser","Purchaser","Purchaser","Car","Car","Car","Car","Car","Car","Car","Car")
Descriptor <- c("Country", "Gender", "Eyes", "Color", "Financed", "Type", "Transmission", "Color", "Financed", "Type", "Transmission")
Members <- c("America", "Male", "Brown", "Red", "Yes", "Sedan", "Manual", "Blue","No", "Van", "Automatic")
df <- data.frame(ID, Fact, Overall_Category, Descriptor, Members)
The dataframes dimensions work like this:
There will always be an ID/key which singularly and uniquely identifies a submitted fact
There will always be a dimension for a given fact defining the Overall_Category of which a submitted fact belongs.
Most of the time - but not always - there will be a dimension for a "Descriptor",
If there is a "Descriptor" dimension for a given fact, there will be another "Members" dimension to show possible members within "Descriptor".
The problem is that a single submitted fact is duplicated for a given ID based on how many dimensions apply to the given fact. What I'd like is a way to show the fact only once, based on its ID, and have the applicable dimensions stored against that single ID.
I've achieved it by doing this:
df1 <- pivot_wider(df,
id_cols = ID,
names_from = c(Overall_Category, Descriptor, Members),
names_prefix = "zzzz",
values_from = Fact,
names_sep = "-",
names_repair = "unique")
ColumnNames <- df1 %>% select(matches("zzzz")) %>% colnames()
df2 <- df1 %>% mutate(mean_sel = rowMeans(select(., ColumnNames), na.rm = T))
df3 <- df2 %>% mutate_at(ColumnNames, function(x) ifelse(!is.na(x), deparse(substitute(x)), NA))
df3 <- df3 %>% unite('Descriptor', ColumnNames, na.rm = T, sep = "_")
df3 <- df3 %>% mutate_at("Descriptor", str_replace_all, "zzzz", "")
But it seems like it wouldn't scale well for facts with many dimensions due to the pivot_wide, and in general doesn't seem like a very efficient approach.
Is there a better way to do this?
You can unite the columns and for each ID combine them together and take average of Fact values.
library(dplyr)
library(tidyr)
df %>%
unite(Descriptor, Overall_Category:Members, sep = '-', na.rm = TRUE) %>%
group_by(ID) %>%
summarise(Descriptor = paste0(Descriptor, collapse = '_'),
mean_sel = mean(Fact, na.rm = TRUE))
# ID Descriptor mean_sel
# <dbl> <chr> <dbl>
#1 1 Purchaser-Country-America_Purchaser-Gender-Male_Purchas… 233
#2 2 Car-Color-Red_Car-Financed-Yes_Car-Type-Sedan_Car-Trans… 50
#3 3 Car-Color-Blue_Car-Financed-No_Car-Type-Van_Car-Transmi… 15
I think you want simple paste with sep and collapse arguments
library(dplyr, warn.conflicts = F)
df %>% group_by(ID, Fact) %>%
summarise(Descriptor = paste(paste(Overall_Category, Descriptor, Members, sep = '-'), collapse = '_'), .groups = 'drop')
# A tibble: 3 x 3
ID Fact Descriptor
<dbl> <dbl> <chr>
1 1 233 Purchaser-Country-America_Purchaser-Gender-Male_Purchaser-Eyes-Brown
2 2 50 Car-Color-Red_Car-Financed-Yes_Car-Type-Sedan_Car-Transmission-Manual
3 3 15 Car-Color-Blue_Car-Financed-No_Car-Type-Van_Car-Transmission-Automatic
An option with str_c
library(dplyr)
library(stringr)
df %>%
group_by(ID, Fact) %>%
summarise(Descriptor = str_c(Overall_Category, Descriptor, Members, sep= "-", collapse="_"), .groups = 'drop')
This is my dataset:
df <- data.frame(PatientID = c("3454","3454","3454","345","345","345"), date = c("05/01/2001", "02/06/1997", "29/03/2004", "05/2/2021", "01/06/1960", "29/03/2003"),
infarct1 = c(TRUE, NA, TRUE, NA, NA, TRUE),infarct2 = c(TRUE, TRUE, TRUE, TRUE, NA, TRUE, stringsAsFactors = F)
Basically I need to keep just 1 patient ID (aka, eliminate duplicated PatientID), based on the most recent infarct (last infarct==TRUE [but any kind of infarct] based on date).
So the outcome I want would look like:
df <- data.frame(PatientID = c("3454","345"), date = c("29/03/2004", "05/2/2021"),
infarct = c(TRUE,TRUE), stringsAsFactors = F)
Hope this makes sense.
Thanks
Try this:
library(dplyr)
df <- df %>%
mutate(infarct = infarct1 | infarct2) %>%
filter(infarct == TRUE) %>%
group_by(PatientID, infarct) %>%
summarise(date=max(date))
Create infarct variable.
Filter TRUE infarct.
Group.
Look for last time.
You can turn the date to date class, arrange the data by PatientID and date and get the last date where infarct = TRUE.
library(dplyr)
df %>%
mutate(date = lubridate::dmy(date)) %>%
arrange(PatientID, date) %>%
group_by(PatientID) %>%
summarise(date = date[max(which(infarct))],
infract = TRUE)
# PatientID date infract
# <chr> <date> <lgl>
#1 345 2003-03-29 TRUE
#2 3454 2004-03-29 TRUE
For multiple columns get the data in long format.
df %>%
mutate(date = lubridate::dmy(date)) %>%
tidyr::pivot_longer(cols = starts_with('infarct')) %>%
arrange(PatientID, date) %>%
group_by(PatientID) %>%
slice(max(which(value))) %>%
ungroup
# PatientID date name value
# <chr> <date> <chr> <lgl>
#1 345 2021-02-05 infarct2 TRUE
#2 3454 2004-03-29 infarct2 TRUE
data
I think you need quotes around data in date column.
df <- data.frame(PatientID = c("3454","3454","3454","345","345","345"),
date = c("05/01/2001", "02/06/1997", "29/03/2004", "05/2/2021", "01/06/1960", "29/03/2003"),
infarct = c(TRUE, NA, TRUE, NA, NA, TRUE), stringsAsFactors = FALSE)
library(tidyverse)
df <- tibble(Date = as.Date(c("2020-01-01", "2020-01-02")),
Shop = c("Store A", "Store B"),
Employees = c(5, 10),
Sales = c(1000, 3000))
#> # A tibble: 2 x 4
#> Date Shop Employees Sales
#> <date> <chr> <dbl> <dbl>
#> 1 2020-01-01 Store A 5 1000
#> 2 2020-01-02 Store B 10 3000
I'm switching from dplyr spread/gather to pivot_* following the dplyr reference guide. I want to gather the "Employees" and "Sales" columns in the following manner:
df %>% pivot_longer(-Date, -Shop, names_to = "Names", values_to = "Values")
#> Error in build_longer_spec(data, !!cols, names_to = names_to,
#> values_to = values_to, : object 'Shop' not found
But I'm getting this error. It seems as though I'm doing everything right. Except that I'm apparently not. Do you know what went wrong?
The cols argument is all of the columns you want to pivot. You can think of it as the complement of the id.vars argument from reshape2::melt
df %>% pivot_longer(-c(Date, Shop), names_to = "Names", values_to = "Values")
same as:
reshape2::melt(df, id.vars=c("Date", "Shop"), variable.name="Names", value.name="Value")
I think a clearer syntax would be
df %>%
pivot_longer(cols = c(Employees, Sales))
As opposed to writing the columns you want to drop.
names_to = "Names", values_to = "Values"
Is just going to capitalize the default new column names of name and value
I have following dataframe got after applying dplyr code
Final_df<- df %>%
group_by(clientID,month) %>%
summarise(test=toString(Sector)) %>%
as.data.frame()
Which gives me following output
ClientID month test
ASD Sep Auto,Auto,Finance
DFG Oct Finance,Auto,Oil
How I want is to count sectors as well
ClientID month test
ASD Sep Auto:2,Finance:1
DFG Oct Finance:1,Auto:1,Oil:1
How can I achieve it with dplyr?
Here's a similar but slightly different solution to the one by #akrun:
count(df, ClientID, month, Sector) %>%
summarise(test = toString(paste(Sector, n, sep=":")))
#Source: local data frame [4 x 3]
#Groups: ClientID [?]
#
# ClientID month test
# <chr> <chr> <chr>
#1 ASD. Oct Finance:2
#2 ASD. Sep Auto:2, Finance:1
#3 DFG. Oct Oil:2
#4 DFG. Sep Auto:1, Finance:2
In this case, count does the same as group_by + tally and you don't need another group_by since the count removes the outer most grouping variable (Sector) automagically.
We can try
df %>%
group_by(client_id, month, Sector) %>%
tally() %>%
group_by(client_id, month) %>%
summarise(test = toString(paste(Sector, n, sep=":")))
Or using data.table
library(data.table)
setDT(df)[, .N, .(ClientID, month, Sector)
][, .(test = toString(paste(Sector, N, sep=":"))) , .(ClientID, month)]
If we need a base R
aggregate(newCol~ClientID + month, transform(aggregate(n~.,
transform(df, n = 1), sum), newCol = paste(Sector, n, sep=":")), toString)
data
df <- data.frame(ClientID = rep(c("ASD.", "DFG."), each = 5),
month = rep(c("Sep", "Oct" ) , c(3,2)),
Sector = c("Auto", "Auto", "Finance", "Finance", "Finance",
"Auto", "Finance", "Finance", "Oil", "Oil"),
stringsAsFactors=FALSE)