I created a mapping table in R and have provided an example of what it looks like:
ex <- data.frame("id" = c(rep(1234,7)), "claim" = c(1234, 1367, 1234, 1869, 1234, 1367,1234),
"code1" = c(24, 61, 28, 21, 20, 29,80), date = c('2019-03-18', '2019-04-12',
'2019-03-18', '2019-03-18',
'2019-03-18', '2019-04-12', '2019-03-18'),
'code2' = c(24,29,24,24,24, 29,24), dx1=c("M234","M123",NA,"M434",NA,NA, NA),
dx2=c(NA,NA,NA,NA,"M789","Z123", "M999"),
dx3 = c(NA,NA,"M689",NA,NA,NA, NA),
pay = c(1000, 520, 1000, 780, 1000,520,1000))
Is there any way I could find a way to get this as my final output:
ex2 <- data.frame("id" = c(rep(1234,3)), date = c('2019-03-18', '2019-03-18','2019-04-12'),
'code2' = c(24,24,29),
dx1=c("M234","M434","M123"),
dx2=c("M789",NA,"Z123"),
dx3 = c("M689",NA,NA),
dx4 = c("M999", NA, NA),
pay = c(1000,780,520))
I basically would like for any values in dx2 or dx3 in example 1 to just be added onto the same row corresponding to that code2 value. However, if there are multiple values for code2 in dx1, then I would like to keep them as a separate row.
Is there any way I could do something like this in R?
Thanks in advance!
edit: In my mapping table (ex) there are only columns dx1, dx2, dx3. I would like for any multiple value in dx2 or dx3 to be added on as new columns (which is why in ex2 there is now a dx4 column). These changes are grouped by code2. So if there are multiple values in dx2 or dx3 for code 24, then that will determine how many new dx2 columns are created. The order can then be determined by max(pay) column.
Do you require this?
library(tidyverse)
ex %>% pivot_longer(cols = c("dx1", "dx2", "dx3"), names_to = "code3", values_to = "val", values_drop_na = T) %>%
arrange(claim, code2, code3) %>% group_by(id, claim, date, code2, code3) %>%
mutate(dummy = n(),
dummy2 = row_number(),
code3 = ifelse(dummy >1 & dummy2 >1, "dx4", code3)) %>% arrange(code3) %>%
pivot_wider(id_cols = c('id', 'claim', 'date', 'code2', 'pay'), names_from = 'code3', values_from = 'val', values_fn = min) %>%
ungroup() %>% select(-claim)
# A tibble: 3 x 8
id date code2 pay dx1 dx2 dx3 dx4
<dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
1 1234 2019-03-18 24 1000 M234 M789 M689 M999
2 1234 2019-04-12 29 520 M123 Z123 NA NA
3 1234 2019-03-18 24 780 M434 NA NA NA
Related
I am trying to pair rows for use in a dumbbell plot. I have a df that looks like this:
Year
Species
Tonnes
1960
Cod
123
1961
Cod
456
1970
Cod
124
1971
Cod
457
I want to pair the up results 10 years apart, resulting in this df:
Year
Species
Tonnes
Pair
1960
Cod
123
1
1961
Cod
456
2
1970
Cod
124
1
1971
Cod
457
2
I would very much appreciate help. I wasn't too sure where to begin with the problem.
You could do
df <- structure(list(Year = c(1960L, 1961L, 1970L, 1971L), Species = c("Cod",
"Cod", "Cod", "Cod"), Tonnes = c(123, 150, 256, 450)), row.names = c(NA,
-4L), class = "data.frame")
library(tidyverse)
df %>%
mutate(year = Year %% 10,
decade = 10 * Year %/% 10) %>%
select(-Year) %>%
group_by(Species, year) %>%
summarize(from = Tonnes[which.min(decade)],
to = Tonnes[which.max(decade)],
year = paste(min(year + decade), max(year + decade), sep = '-')) %>%
ggplot(aes(from, year)) +
geom_linerange(aes(xmin = from, xmax = to), alpha = 0.5) +
geom_point(color = 'green4', size = 3) +
geom_point(aes(x = to), color = 'red3', size = 3) +
xlab('Tonnes') +
theme_minimal(base_size = 16)
Using data.table, a join will get the pairs in wide format:
library(data.table)
dt <- setDT(df)[
, `:=`(Year2 = Year + 10, Pair = rleid(Year, Species))
][
df,
.(Year1 = i.Year, Year2 = x.Year, Species, Tonnes1 = i.Tonnes, Tonnes2 = Tonnes, Pair = i.Pair),
on = .(Year = Year2, Species), nomatch = 0
]
dt
#> Year1 Year2 Species Tonnes1 Tonnes2 Pair
#> 1: 1960 1970 Cod 123 124 1
#> 2: 1961 1971 Cod 456 457 2
which can be melted to long format, if desired:
setcolorder(
melt(dt, c("Species", "Pair"), list(c("Year1", "Year2"), c("Tonnes1", "Tonnes2")), value.name = c("Year", "Tonnes")),
c("Year", "Species", "Tonnes", "Pair")
)[, variable := NULL][]
#> Year Species Tonnes Pair
#> 1: 1960 Cod 123 1
#> 2: 1961 Cod 456 2
#> 3: 1970 Cod 124 1
#> 4: 1971 Cod 457 2
Data:
df <- data.frame(Year = c(1960, 1961, 1970, 1971), Species = "Cod", Tonnes = c(123, 456, 124, 457))
I am trying to disaggregate the monthly data and spread them into weekly data in two ways.
First, To find the first Monday from the start date and then create days which are Mondays till the last date (month) of the sequence. And then spread the data within the respective week which is in the month.
Second, To create a weekly sequence from start date and end date and spread the data within the respective week which is in the month.
The data which I am working with is given below:
structure(list(`Row Labels` = c("X6", "X7", "X8", "X9"), `2022-11-01` = c(100,
200, 300, 400), `2022-12-01` = c(160, 200, 300, 400), `2023-01-01` = c(500,
550, 600, 650)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-4L))
And it looks like this:
The expected output 1 is given below, as you can see all the dates are Mondays:
The expected output 2 is given below:
Is this doable, or is it a bit too much to expect from R?
For Mondays we can create a list of Mondays between the dates in the dataframe, join it with the data in long format, count number of the Mondays for each variable in each month, divide the values by the number of Mondays, and revert back the format to wide;
library(dplyr)
library(tidyr)
library(lubridate)
all_dates <- as.Date(names(df1)[-1])
MON <- seq(min(floor_date(all_dates, "month")),
max(ceiling_date(all_dates, "month")),
by="1 day") %>%
.[wday(.,label = TRUE) == "Mon"] %>%
data.frame("Mondays" = .) %>%
mutate(mmm = format(Mondays, "%Y-%m"))
df1 %>%
pivot_longer(cols = -`Row Labels`, names_to = "dates") %>%
mutate(dates = as.Date(dates),
mmm = format(dates, "%Y-%m")) %>%
right_join(MON, by = "mmm") %>%
arrange(mmm) %>%
group_by(`Row Labels`, dates) %>%
mutate(value = value / n()) %>%
ungroup() %>%
select(`Row Labels`, Mondays, value) %>%
pivot_wider(`Row Labels`, names_from = "Mondays", values_from = "value")
#> # A tibble: 4 x 14
#> `Row Labels` `2022-11-07` `2022-11-14` `2022-11-21` `2022-11-28` `2022-12-05`
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 X6 25 25 25 25 40
#> 2 X7 50 50 50 50 50
#> 3 X8 75 75 75 75 75
#> 4 X9 100 100 100 100 100
#> # ... with 8 more variables: 2022-12-12 <dbl>, 2022-12-19 <dbl>,
#> # 2022-12-26 <dbl>, 2023-01-02 <dbl>, 2023-01-09 <dbl>, 2023-01-16 <dbl>,
#> # 2023-01-23 <dbl>, 2023-01-30 <dbl>
Same principal goes to doing it weekly:
WKLY <- seq(min(floor_date(all_dates, "month")),
max(ceiling_date(all_dates, "month")),
by="week") %>%
data.frame("Weekly" = .) %>%
mutate(mmm = format(Weekly, "%Y-%m"))
df1 %>%
pivot_longer(cols = -`Row Labels`, names_to = "dates") %>%
mutate(dates = as.Date(dates),
mmm = format(dates, "%Y-%m")) %>%
right_join(WKLY, by = "mmm") %>%
arrange(mmm) %>%
group_by(`Row Labels`, dates) %>%
mutate(value = value / n()) %>%
ungroup() %>%
select(`Row Labels`, Weekly, value) %>%
pivot_wider(`Row Labels`, names_from = "Weekly", values_from = "value")
#> # A tibble: 4 x 15
#> `Row Labels` `2022-11-01` `2022-11-08` `2022-11-15` `2022-11-22` `2022-11-29`
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 X6 20 20 20 20 20
#> 2 X7 40 40 40 40 40
#> 3 X8 60 60 60 60 60
#> 4 X9 80 80 80 80 80
#> # ... with 9 more variables: 2022-12-06 <dbl>, 2022-12-13 <dbl>,
#> # 2022-12-20 <dbl>, 2022-12-27 <dbl>, 2023-01-03 <dbl>, 2023-01-10 <dbl>,
#> # 2023-01-17 <dbl>, 2023-01-24 <dbl>, 2023-01-31 <dbl>
Data:
df1 <- structure(list(`Row Labels` = c("X6", "X7", "X8", "X9"),
`2022-11-01` = c(100, 200, 300, 400),
`2022-12-01` = c(160, 200, 300, 400),
`2023-01-01` = c(500, 550, 600, 650)),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -4L))
I have two dataframes :
> df1 <- data.frame(date = as.Date( c( "2021-06-01", "2021-06-02", "2021-06-03", "2021-06-04",
"2021-06-05", "2021-06-06", "2021-06-07", "2021-06-08",
"2021-06-09", "2021-06-10", "2021-06-11", "2021-06-12",
"2021-06-13") ),
temperature = c( 17, 30, 28, 29, 16, 21, 20, 11, 28, 29, 25, 26, 19) )
and
> df2 <- data.frame( ID = c( 1 : 4 ),
date.pose = as.Date(c("2021-06-01", "2021-06-03", "2021-06-06", "2021-06-10") ),
date.withdrawal = as.Date(c("2021-06-02", "2021-06-05", "2021-06-09", "2021-06-13") ) )
I want to store the mean temperature for each period that is in df2 in a new colomn (df2$mean.temperature).
For ID = 1 from df2, the mean temperature would be calculated with the temperatures from 2021-06-01 and 2021-06-02, witch is mean(17, 30)
In other words, I want to get this :
> df2 <- data.frame(ID = c( 1 : 4 ),
date.pose = as.Date( c("2021-06-01", "2021-06-03", "2021-06-06", "2021-06-10") ) ,
date.withdrawal = as.Date( c("2021-06-03", "2021-06-06", "2021-06-10", "2021-06-13") ),
mean.Temperature = c(23.5, 24.3, 20.0, 24.8) )
I'm trying to add the ID from df2 in a new colomn in df1. Once I do that, I could aggregate like this :
> df3 <- aggregate(df1$temperature, list(df1$ID, df2$date.pose), FUN = mean)
I don't know how to add the corresponding ID in df1.
Or maybe there is a better way to do this?
Here's an approach using uncount from tidyr and some joins.
df2 %>%
mutate(days = (date.witdrawal - date.pose + 1) %>% as.integer) %>%
tidyr::uncount(days, .id = "row") %>%
transmute(ID, date = date.pose + row - 1) %>%
left_join(df1) %>%
group_by(ID) %>%
summarize(mean.Temperature = mean(temperature)) %>%
right_join(df2)
Result
# A tibble: 4 × 4
ID mean.Temperature date.pose date.witdrawal
<int> <dbl> <date> <date>
1 1 23.5 2021-06-01 2021-06-02
2 2 24.3 2021-06-03 2021-06-05
3 3 20 2021-06-06 2021-06-09
4 4 24.8 2021-06-10 2021-06-13
Update. thanks to #Jon Spring:
Here is how we could do it:
logic:
join both df's by date after long pivoting df1
arrange by date and fill
then after grouping by ID use summarise with mean()
and re-join finally:
library(dplyr)
library(tidyr)
df2 %>%
pivot_longer(-ID, values_to = "date") %>%
full_join(df1, by= "date") %>%
arrange(date) %>%
fill(ID, .direction = "down") %>%
group_by(ID) %>%
summarise(mean_temp = mean(temperature, na.rm = TRUE)) %>%
left_join(df2, by="ID")
ID mean_temp date.pose date.witdrawal
<int> <dbl> <date> <date>
1 1 23.5 2021-06-01 2021-06-02
2 2 24.3 2021-06-03 2021-06-05
3 3 20 2021-06-06 2021-06-09
4 4 24.8 2021-06-10 2021-06-13
I found a few solutions on here but none seem to work to add a summary row to dplyr output.
#mock up data
df <- data.frame("Market" = sample(c("East", "North", "West"), 100, replace = TRUE, prob = c(0.33, 0.33, 0.34)),
"var1" = sample(c("Y", "N"), 100, replace = TRUE, prob = c(0.4, 0.6)),
"var2" = sample(c("Y", "N"), 100, replace = TRUE, prob = c(0.7, 0.3)),
"var3" = sample(c("Y", "N"), 100, replace = TRUE, prob = c(0.5, 0.5)))
Here is the code:
df_report <- df %>%
group_by(Market) %>%
filter(Market == "East" | Market == "West") %>%
summarise(n = n(),
var1_y = sum(var1 == "Y"),
var1_n = sum(var1 == "N")) %>%
mutate(total = var1_y + var1_n,
var1_y_pct = (var1_y/total),
var1_n_pct = (var1_n/total),
pct_total = total/sum(total))
Here is the output:
# A tibble: 2 x 8
Market n var1_y var1_n total var1_y_pct var1_n_pct pct_total
<fct> <int> <int> <int> <int> <dbl> <dbl> <dbl>
1 East 29 13 16 29 0.448 0.552 0.453
2 West 35 16 19 35 0.457 0.543 0.547
Here are the two solutions I tried:
Option 1
df_report %>%
add_row(Market = "Total", n = sum(n), var1_y = sum(var1_y), var1_n = sum(var1_n),
total = sum(total), var1_y_pct = sum(var1_y_pct), var1_n_pct = sum(varn_y_pct), pct_total = sum(pct_total))
Option 2
df_report %>%
rbind(c("Total", sum(n), sum(var1_y), sum(var1_n), sum(total), sum(var1_y_pct), sum(varn_y_pct), sum(pct_total)))
Both give me the same error: Error in sum(n) : invalid 'type' (closure) of argument
I'm unable to determine why these solutions, while working for others and seeming very reasonable, are not working for me.
You should try
df_report %>% janitor::adorn_totals("row")
Which produces
Market n var1_y var1_n total var1_y_pct var1_n_pct pct_total
East 30 11 19 30 0.3666667 0.6333333 0.4285714
West 40 19 21 40 0.4750000 0.5250000 0.5714286
Total 70 30 40 70 0.8416667 1.1583333 1.0000000
The long way of doing this is going for summarise (watch out, you have a typo in var1_n_pct). Then bind the rows.
row_to_add <- df_report %>%
summarise(Market = "Total",
n = sum(n),
var1_y = sum(var1_y),
var1_n = sum(var1_n),
total = sum(total),
var1_y_pct = sum(var1_y_pct),
var1_n_pct = sum(var1_n_pct),
pct_total = sum(pct_total))
df_report %>% bind_rows(row_to_add)
I have the following situation. Given the table
df <- data.frame(ID = c(1, 2, 2, 3, 3, 4),
type = c("MC", "MC", "MK", "MC", "MK", "MC"),
value1 = c(512, 261, 4523, 1004, 1221, 2556),
value2 = c(726, 4000, 280, 998, 113, 6789))
I am trying to find a way to implement the following logic: If for an ID, both types (MC and MK) occur, use value1 from MK and value2 from MC. Otherwise (only the type MC occurs), use MC.
Hence, the final result is supposed to be:
data.frame(ID = c(1, 2, 3, 4),
type = c("MC", "MC", "MC", "MC"),
value1 = c(512, 4523, 1221, 2556),
value2 = c(726, 4000, 998, 6789))
Assuming the type MK is dropped after extracting the value1.
Another version with dplyr
library(dplyr)
df %>%
group_by(ID) %>%
mutate(value1 = ifelse(any(type == "MK"), value1[type=="MK"],value1[type=="MC"]),
value2 = value2[type == "MC"]) %>%
filter(type == "MC")
# ID type value1 value2
# <dbl> <fct> <dbl> <dbl>
#1 1 MC 512 726
#2 2 MC 4523 4000
#3 3 MC 1221 998
#4 4 MC 2556 6789
Here, for value1 we check value in "MK" if it is present or take corresponding "MC" value instead and for value2 by default we take "MC" value and keep only rows with type "MC". This is assuming every group (ID) would have a "MC" type row.
For efficiency I would definitely prefer #Andre Elrico' answer but here is a dplyr option. Try:
df <- data.frame(ID = c(1, 2, 2, 3, 3, 4),
type = c("MC", "MC", "MK", "MC", "MK", "MC"),
value1 = c(512, 261, 4523, 1004, 1221, 2556),
value2 = c(726, 4000, 280, 998, 113, 6789))
library(dplyr)
df %>%
reshape(., idvar = "ID", timevar = "type", direction = "wide") %>%
group_by(ID) %>%
mutate(value1 = ifelse(is.na(value1.MK), value1.MC, value1.MK),
value2 = ifelse(is.na(value2.MC), value2.MK, value2.MC),
type = "MC") %>%
select(ID, type, value1, value2)
# output
# A tibble: 4 x 4
# Groups: ID [4]
ID type value1 value2
<dbl> <chr> <dbl> <dbl>
1 1 MC 512 726
2 2 MC 4523 4000
3 3 MC 1221 998
4 4 MC 2556 6789
data.table solution
setDT(df1)[,{x=.SD;if(all(c("MC","MK") %in% type)){x$value1[] = last(value1)};first(x)},by=ID]
result:
# ID type value1 value2
#1 1 MC 512 726
#2 2 MC 4523 4000
#3 3 MC 1221 998
#4 4 MC 2556 6789
dplyr:
df1 %>% group_by(ID) %>% do(.,(function(x){if(all(c("MC","MK") %in% x$type)){x$value1[] = x$value1[x$type=="MK"]};x[1,]})(.))
# A tibble: 4 x 4
# Groups: ID [4]
# ID type value1 value2
# <dbl> <fct> <dbl> <dbl>
#1 1 MC 512 726
#2 2 MC 4523 4000
#3 3 MC 1221 998
#4 4 MC 2556 6789