I would like to solve some problems with the column name, which corroborates with errors when executing a code. Here, I'll show you a simple example. Note that I have a column called TimeofCalculate and the code below is Timeofcalculate, which gives an error, because the code is calculate instead of Calculate. However, I would like any of them worked in the code. Also, I have a database which is Timeofcalculâte column. This â is common where I live. Therefore, I would like to resolve these mentioned issues.
library(dplyr)
Test <- structure(list(date1 = as.Date(c("2021-11-01","2021-11-01","2021-11-01","2021-11-01")),
date2 = as.Date(c("2021-10-22","2021-10-22","2021-10-28","2021-10-30")),
Week = c("Friday", "Friday", "Thursday", "thursday"),
Category = c("FDE", "FDE", "FDE", "FDE"),
TimeofCalculate = c(4, 6, 6, 3)), class = "data.frame",row.names = c(NA, -4L))
Test %>%
group_by(Week = tools::toTitleCase(Week)) %>%
summarise(Time=mean(Timeofcalculate), .groups = 'drop')
I think weekdays in different spellings are unacceptable in a data base, first fix this. We may use built-in tools::toTitleCase to make first letters upper-case.
Test <- transform(Test, Week=tools::toTitleCase(Week))
Then, we may easily aggregate by column numbers, so no names are needed.
aggregate(list(Time=Test[, 5]), list(Week=Test[, 3]), mean)
# Week Time
# 1 Friday 5.0
# 2 Thursday 4.5
If it's a problem to hard-code column indices by hand, we may use agrep which identifies via string distance matching the index of the most similar column name.
c_tcalc <- agrep('timeofcalculate', names(Test))
c_week <- agrep('week', names(Test))
aggregate(list(Time=Test[, c_tcalc]), list(Week=Test[, c_week]), mean)
# Week Time
# 1 Friday 5.0
# 2 Thursday 4.5
Data:
Test <- structure(list(date1 = structure(c(18932, 18932, 18932, 18932
), class = "Date"), date2 = structure(c(18922, 18922, 18928,
18930), class = "Date"), Week = c("Friday", "Friday", "Thursday",
"Thursday"), Category = c("FDE", "FDE", "FDE", "FDE"), TimeofCalculate = c(4,
6, 6, 3)), class = "data.frame", row.names = c(NA, -4L))
Perhaps we can take advantage of tidyselect::matches.
library(dplyr)
nms <- c('TimeofCalculate|Timeofcalculate|Timeofcalculâte')
#alternative one
Test %>%
group_by(Week = tools::toTitleCase(Week)) %>%
summarise(across(matches(nms), mean), .groups = 'drop')
#> # A tibble: 2 × 2
#> Week TimeofCalculate
#> <chr> <dbl>
#> 1 Friday 5
#> 2 Thursday 4.5
#using a purrr style lambda
Test %>%
group_by(Week = tools::toTitleCase(Week)) %>%
summarise(across(matches(nms), ~mean(., na.rm = TRUE)), .groups = 'drop')
#> # A tibble: 2 × 2
#> Week TimeofCalculate
#> <chr> <dbl>
#> 1 Friday 5
#> 2 Thursday 4.5
#this will also work
Test %>%
group_by(Week = tools::toTitleCase(Week)) %>%
summarise(across(any_of(c("Timeofcalculate", "TimeofCalculate", "Timeofcalculâte")), ~ mean(., na.rm = TRUE)), .groups = "drop")
Created on 2021-12-26 by the reprex package (v2.0.1)
Related
I have data like this:
df<-structure(list(record_id = c(1, 2, 4), alcohol = c(1, 2, 1),
ethnicity = c(1, 1, 1), bilateral_vs_unilateral = c(1, 2,
2), fat_grafting = c(1, 1, 0), number_of_adm_sheets_used = c(1,
NA, NA), number_of_adm_sheets_used_2 = c(1, 1, 1), number_of_fills = c(7,
NA, NA), number_of_fills_2 = c(7, NA, 2), total_fill_volume_ml_left = c(240,
NA, NA), total_volume_ml = c(240, 300, 550), implant_size_l = c(NA_real_,
NA_real_, NA_real_), implant_size_l_2 = c(NA_real_, NA_real_,
NA_real_)), row.names = c(NA, -3L), class = c("tbl_df", "tbl",
"data.frame"))
It is info about patients with each row representing a patient that underwent breast surgery.
I'd like to change it into each row representing a particular breast (of the two). There are several variables, everything from 'number_of_adm_sheets_used' to 'implant_size_l_2' that have a column for each side. I'd like to change those to represent either. An example is 'number_of_adm_sheets_used' stood for on the left side, and 'number_of_adm_sheets_used_2' was on the right side. I'd like to combine them to become one column of sheets used that was for either side.
My expected output would look like:
Pre-
Post-
I figure its some variant of pivot_longer but I'm having trouble with a few aspects:
the real data has 68 columns
I only need a duplicate row if the column "bilateral_vs_unilateral" is a "1" (meaning bilateral)
The way I've used pivot_longer before, you'd say "cols" and pick a big range, I'm not sure how to stack pairs of columns, if that makes sense.
Luckily, despite having 68 other columns, all of the "trouble" columns are shown below. Pairing 'number_of_adm_sheets_used' with 'number_of_adm_sheets_used_2'
'number_of_fills' with 'number_of_fills_2'
'total_fill_volume_ml_left' with 'total_volume_ml'
and 'implant_size_1' with 'implant_size_1_2'
Thank you
Here is one possibility, if I'm understanding the issue correctly.
# Make long format
df.long <- df %>%
pivot_longer(cols = -record_id) %>%
mutate(subject = ifelse(str_sub(name, -2, -1) == "_2", "breast 2", NA),
name = str_remove(name, "_2")) %>%
group_by(record_id, name) %>%
mutate(subject = case_when(
subject == "breast 2" ~ subject,
n() == 2 ~ "breast 1",
n() == 1 ~ "patient"
)) %>%
ungroup()
# statistics regarding the patient
patient <- df.long %>%
filter(subject == "patient") %>%
pivot_wider(names_from = name, values_from = value) %>%
select(-subject)
# statistics regarding each breast
breasts <- df.long %>%
filter(str_detect(subject, "breast")) %>%
pivot_wider(names_from = name, values_from = value)
# merge the two data.frames
patient %>%
inner_join(breasts) %>%
select(record_id, subject, everything())
If you rename your "trouble columns" to a consistent pattern, then you can use pivot_longer()'s names_pattern argument and ".value" sentinel to pull pairs of values into rows. In my example code, I suffixed these with "_l" or "_r" for left- and right-sided variants. We can use the values_drop_na argument to keep only the valid rows for unilateral cases.
I also changed alcohol to a factor, just to demonstrate that it doesn't throw the error you noted in the bounty.
library(tidyverse)
df_long <- df %>%
mutate(alcohol = factor(alcohol)) %>%
rename(
number_of_adm_sheets_used_l = number_of_adm_sheets_used,
number_of_adm_sheets_used_r = number_of_adm_sheets_used_2,
number_of_fills_l = number_of_fills,
number_of_fills_r = number_of_fills_2,
total_fill_volume_ml_l = total_fill_volume_ml_left,
total_fill_volume_ml_r = total_volume_ml,
implant_size_l = implant_size_l,
implant_size_r = implant_size_l_2
) %>%
pivot_longer(
cols = ends_with(c("_l", "_r")),
names_to = c(".value", "side"),
names_pattern = "(.+)_(l|r)",
values_drop_na = TRUE
)
Output:
### move pivoted columns up front for illustration purposes
df_long %>%
relocate(record_id, side, number_of_adm_sheets_used:implant_size)
# A tibble: 4 x 10
record_id side number_of_adm_sheets_used number_of_fills total_fill_volume~
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 l 1 7 240
2 1 r 1 7 240
3 2 r 1 NA 300
4 4 r 1 2 550
# ... with 5 more variables: implant_size <dbl>, alcohol <fct>,
# ethnicity <dbl>, bilateral_vs_unilateral <dbl>, fat_grafting <dbl>
Note that the values of my column time in the output table are rounded, but I would like to leave the values two decimal places after the comma, how to adjust this in the code below?
library(dplyr)
Test <- structure(list(date1 = as.Date(c("2021-11-01","2021-11-01","2021-11-01","2021-11-01")),
date2 = as.Date(c("2021-10-22","2021-10-22","2021-10-28","2021-10-30")),
Week = c("Friday", "Friday", "Thursday", "thursday"),
Category = c("FDE", "FDE", "FDE", "FDE"),
time = c(4, 6, 6, 3)), class = "data.frame",row.names = c(NA, -4L))
Test<-Test %>%
group_by(Week = tools::toTitleCase(Week), Category) %>%
summarise(time = mean(time, na.rm = TRUE), .groups = 'drop')
Test <- transform(Test, time = round(time))
> Test
Week Category time
1 Friday FDE 5
2 Thursday FDE 4
An alternative to Onyambu's suggestion is:
Test <- transform(Test, time = format(round(time, digits = 2), nsmall = 2))
The nsmall argument of format sets the minimum number of digits to the right of the decimal.
I am trying to clean and add a new column to my data named Volume using mutate().
This is the data that I have read into R:
> df1 <- file.choose()
> data1 <- read_excel(df1)
> head(data1)
# A tibble: 5 x 3
`product id` amount `total sales`
<chr> <dbl> <dbl>
1 X180 20 200
2 X109 30 300
3 X918 20 200
4 X273 15 150
5 X988 12 120
Next, I subset and renamed the columns product id and total sales to Product Code and Net Sales respectively, and applied mutate() with my own function on Net Sales and created a new Volume column.
> data2 <- data1 %>%
+ select(`Product Code` = `product id`, `Net Sales` = `total sales`) %>%
+ replace_na(list(`Net Sales` = 0))%>%
+ arrange(desc(`Net Sales`))%>%
+ mutate(Volume = rank_volume(data1, `Net Sales`))
This is the error message I get:
Error: Problem with `mutate()` column `Volume`.
ℹ `Volume = rank_volume(data1, `Net Sales`)`.
x arrange() failed at implicit mutate() step.
* Problem with `mutate()` column `..1`.
ℹ `..1 = Net Sales`.
x object 'Net Sales' not found
And here is the function rank_volume i created
### a function to label the products that are top one third in total sales as "H", products with the lowest third in sales as "L", and the rest as "M"
rank_volume <- function(data, column) {
column <- ensym(column)
colstr <- as_string(column)
data <- arrange(data, desc(!!column))
size <- length(data[[colstr]])
first_third <- data[[colstr]][round(size / 3)]
last_third <- data[[colstr]][round(size - (size / 3))]
case_when(data[[colstr]] > first_third ~ "H",
data[[colstr]] < last_third ~ "L",
TRUE ~ "M")
}
When I run my function separately with a simple data frame, it works perfectly. However, when I run it with mutate() the error appeared. I couldn't find the problem. Can anyone help?
EDIT: dput(head(data))
> dput(head(data1))
structure(list(`product id` = c("X180", "X109", "X918", "X273",
"X988"), amount = c(20, 30, 20, 15, 12), `total sales` = c(200,
300, 200, 150, 120)), row.names = c(NA, -5L), class = c("tbl_df",
"tbl", "data.frame"))
data1 does not have Net Sales column, it is present in the transformation that you have done. You can use . to refer to current dataframe in pipe.
library(dplyr)
data1 %>%
select(`Product Code` = `product id`, `Net Sales` = `total sales`) %>%
replace_na(list(`Net Sales` = 0))%>%
arrange(desc(`Net Sales`)) %>%
mutate(Volume = rank_volume(., `Net Sales`))
# `Product Code` `Net Sales` Volume
# <chr> <dbl> <chr>
#1 X109 300 H
#2 X180 200 M
#3 X918 200 M
#4 X273 150 L
#5 X988 120 L
Or can also use cur_data() -
data1 %>%
select(`Product Code` = `product id`, `Net Sales` = `total sales`) %>%
replace_na(list(`Net Sales` = 0))%>%
arrange(desc(`Net Sales`)) %>%
mutate(Volume = rank_volume(cur_data(), `Net Sales`))
You could add the new column after you've done the initial clean up.
data2 <- data1 %>%
select("Product Code" = "product id", "Net Sales" = "total sales") %>%
replace_na(list("Net Sales" = 0))%>%
arrange(desc("Net Sales"))
data2 <- data2 %>%
mutate(Volume = rank_volume(data2, "Net Sales"))
Hi I have time series data that has daily dates (variable 1) and then for each date I have a time variable that is assigned from (1-60). On each day there is a number X events. Is there a way to create a new dataset where 2 day aggregates for my value are summed across and I have 30 rows (time variables) instead of 60?
Update: Here is a reproducible example of what I want
set.seed(101)
df <- data.frame(
dte = c(as.Date("2021-01-01"),
as.Date("2021-01-02") ,
as.Date("2021-01-03"),
as.Date("2021-01-04") ,
as.Date("2021-02-01") ,
as.Date("2021-02-02") ,
as.Date("2021-02-03") ,
as.Date("2021-02-04")
),
tme = rep(c(1, 2, 3, 4)),
val1 = sample(1:8),
work_type = c("Construction Worker", "Construction Worker","Construction
Worker", "Construction Worker", "Sales", "Sales", "Sales", "Sales"),
Work_Site = "A"
)
print(df)
df_2day <- data.frame(
tme = rep(c(1, 2)),
val1 = c(9,13,5,9),
work_type = c("Construction Worker", "Construction Worker",
"Sales", "Sales"),
Work_Site = "A"
)
print(df_2day)
I also have facility B, C, D
You can create group of 2 days and sum the val1 values.
library(lubridate)
library(dplyr)
df %>%
group_by(Work_Site, work_type, grp = ceiling_date(dte, '2 days')) %>%
summarise(val1 = sum(val1))
# Work_Site work_type grp val1
# <chr> <chr> <date> <int>
#1 A Construction Worker 2021-01-03 9
#2 A Construction Worker 2021-01-05 15
#3 A Sales 2021-02-03 5
#4 A Sales 2021-02-05 7
You can identify the groupings by dividing the row number for each day by two and rounding up the the nearest whole number. So the 3rd reading would be 3/2 = 1.5, rounded up to be group 2. The 10th would be 10/2 = group 5.
Below is an implementation using dplyr, but you could use something else...
library(dplyr)
df <- data.frame(
dte = c(as.Date("2021-01-01"),
as.Date("2021-01-01") ,
as.Date("2021-01-01"),
as.Date("2021-01-01") ,
as.Date("2021-02-01") ,
as.Date("2021-02-01") ,
as.Date("2021-02-01") ,
as.Date("2021-02-01")
),
tme = rep(c(1, 2, 3, 4)),
val1 = sample(1:8),
val2 = sample(1:8)
)
print(df)
result <- df %>%
group_by(dte) %>%
mutate(dategroup=ceiling(rank(tme) / 2)) %>%
group_by(dte, dategroup) %>%
summarise_all(sum)
print(result)
I have reproducible example. I have duplicate ids. Some are suspected some not.
structure(list(id = c(1, 1, 1, 2, 2, 3, 3, 4, 4, 4), test = c("susp",
"susp", "neg", "pos", "pos", "neg", "pos", "susp", "susp", "neg"
)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
))
Yet, I am interested to get the counts:
A total count of suspected patients
of those clients that are suspected that followed multiple testing regardless of the outcome.
want to get a total count of those with two and three suspected.
CAVEATS !! If this could be done with tidyverse, that would be amazing.
a sample of how the table should look like, see bellow.
structure(list(id = c(1, 4), number_of_test_for_suspected_pat = c(2,
2)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
))
And additional tibble with a total of suspected patients with subsequent tests.
We can filter output the 'id's that doesn't have any 'susp'ected cases and then get the sum of logical `vector
library(dplyr)
df1 %>%
group_by(id) %>%
filter('susp' %in% test) %>%
summarise(number_of_test_for_suspected_pat = sum(test == 'susp'),
n_greater_than_3 = number_of_test_for_suspected_pat >=3) %>%
mutate(Total = sum(number_of_test_for_suspected_pat),
n_greater_than_3_count = sum(n_greater_than_3))
# A tibble: 2 x 5
# id number_of_test_for_suspected_pat n_greater_than_3 Total n_greater_than_3_count
# <dbl> <int> <lgl> <int> #<int>
#1 1 2 FALSE 4 0
#2 4 2 FALSE 4 0
Or do the filter first
df1 %>%
filter(test == 'susp') %>%
count(id) %>%
mutate(Total = sum(n))