keep first row after calculating difference between rows with dplyr::lag - r

My question is similar to this OP and this OP, with a minor difference that seems to be overly complicated.
Example of my data:
ind_id wt date
1002 25 1987-07-27
1002 15 1988-05-05
2340 30 1987-03-18
2340 52 1989-08-15
I am calculating the difference between wt values after group_by(ind_id).
To do this:
df<-df %>%
group_by(ind_id) %>%
mutate(mass_diff=(wt-lag(wt))
This gives me this output:
ind_id wt date mass_diff
1002 15 1988-05-05 -10
2340 52 1989-08-15 22
But, the output I want should keep the first wt record, not the last.
Desired output:
ind_id wt date mass_diff
1002 25 1988-05-05 -10
2340 30 1989-08-15 22
Note that the wt column is the only one I'd like to have maintained from the first row. (Keep in mind that this example is overly simplified and I am actually working with 18 rows).
Any suggestions (using dplyr) would be appreciated!

A possible solution:
library(tidyverse)
df <- structure(list(ind_id = c(1002, 1002, 2340, 2340), wt = c(25,
15, 30, 52), date = structure(c(6416, 6699, 6285, 7166), class = "Date")), row.names = c(NA,
-4L), class = "data.frame")
df %>%
group_by(ind_id) %>%
mutate(mass_diff = (wt-lag(wt))) %>%
mutate(wt = first(wt)) %>%
slice_tail %>% ungroup
#> # A tibble: 2 × 4
#> ind_id wt date mass_diff
#> <dbl> <dbl> <date> <dbl>
#> 1 1002 25 1988-05-05 -10
#> 2 2340 30 1989-08-15 22

Related

Group by a variable in dataframe R

I have a dataframe like below,
Date
cat
cam
reg
per
22-01-05
A
60
120
50
22-01-05
B
20
100
20
22-01-08
A
30
150
20
22-01-08
B
30
100
30
But i want something like below,
Date
cam
reg
per
22-01-05
80
220
14.5
22-01-08
60
250
24
How to get this using R?
I am not sure why your expected per values are like that, but maybe you want the following:
df <- data.frame(Date = c("22-01-05", "22-01-05", "22-01-08", "22-01-08"),
cat = c("A", "B", "A", "B"),
cam = c(60,20,30,30),
reg = c(120,100,150,100),
per = c(50,20,20,30))
library(dplyr)
df %>%
group_by(Date) %>%
summarise(cam = sum(cam),
reg = sum(reg),
per = cam/reg)
#> # A tibble: 2 × 4
#> Date cam reg per
#> <chr> <dbl> <dbl> <dbl>
#> 1 22-01-05 80 220 0.364
#> 2 22-01-08 60 250 0.24
Created on 2022-07-07 by the reprex package (v2.0.1)
Using only the package dplyr (which is part of package tidyverse) just do:
df %>% group_by(Date) %>% summarise(cam = sum(cam),
reg = sum(reg),
per = 100*(cam/reg))
Date cam reg per
<chr> <int> <int> <dbl>
1 22-01-05 80 220 36.4
2 22-01-08 60 250 24
The nice thing with this syntax is, you can modify and add additional variables like sum, but also like mean, median, etc. in a very clean and structured way.
you can try this, but I don't how to get the value of per ,14.5 and 24
library(dplyr)
aggregate(cbind(cam, reg) ~ Date,df,sum) %>% mutate(per = 100*(cam/reg))
A data.frame: 2 × 4
Date cam reg per
<chr> <dbl> <dbl> <dbl>
22-01-05 80 220 36.36364
22-01-08 60 250 24.00000

How best to calculate a year over year difference in R

Below is the sample code. The task at hand is to create a year over year difference (2021 q4 value - 2020 q4 value) for only the fourth quarter and percentage difference. Desired result is below. Usually I would do a pivot_wider and such. However, how does one do this and not take all quarters into account?
year <- c(2020,2020,2020,2020,2021,2021,2021,2021,2020,2020,2020,2020,2021,2021,2021,2021)
qtr <- c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4)
area <- c(1012,1012,1012,1012,1012,1012,1012,1012,1402,1402,1402,1402,1402,1402,1402,1402)
employment <- c(100,102,104,106,108,110,114,111,52,54,56,59,61,66,65,49)
test1 <- data.frame (year,qtr,area,employment)
area difference percentage
1012 5 4.7%
1402 -10 -16.9
You would use filter on quarter:
test1 |>
filter(qtr == 4) |>
group_by(area) |>
mutate(employment_lag = lag(employment),
diff = employment - employment_lag) |>
na.omit() |>
ungroup() |>
mutate(percentage = diff/employment_lag)
Output:
# A tibble: 2 × 7
year qtr area employment diff employment_start percentage
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2021 4 1012 111 5 106 0.0472
2 2021 4 1402 49 -10 59 -0.169
Update: Adding correct percentage.

How to find duplicate dates within a row in R, and then replace associated values with the mean?

There are some similar questions, however I haven't been able to find the solution for my data:
ID <- c(27,46,72)
Gest1 <- c(27,28,29)
Sys1 <- c(120,123,124)
Dia1 <- c(90,89,92)
Gest2 <- c(29,28,30)
Sys2 <- c(122,130,114)
Dia2 <- c(89,78,80)
Gest3 <- c(32,29,30)
Sys3 <- c(123,122,124)
Dia3 <- c(90,88,89)
Gest4 <- c(33,30,32)
Sys4 <- c(124,123,128)
Dia4 <- c(94,89,80)
df.1 <- data.frame(ID,Gest1,Sys1,Dia1,Gest2,Sys2,Dia2,Gest3,Sys3,
Dia3,Gest4,Sys4,Dia4)
df.1
What I need to do is identify where there are any cases of gestational age duplicates (variables beginning with Gest), and then find the mean of the associated Sys and Dia variables.
Once the mean has been calculated, I need to replace the duplicates with just 1 Gest variable, and the mean of the Sys variable and the mean of the Dia variable. Everything after those duplicates should then be moved up the dataframe.
Here is what it should look like:
df.2
My real data has 25 Gest variables with 25 associated Sys variables and 25 association Dia variables.
Sorry if this is confusing! I've tried to write an ok question but it is my first time using stack overflow.
Thank you!!
This is easier to manage in long (and tidy) format.
Using tidyverse, you can use pivot_longer to put into long form. After grouping by ID and Gest you can substitute Sys and Dia values with the mean. If there are more than one Gest for a given ID it will then use the average.
Then, you can keep that row of data with slice. After grouping by ID, you can renumber after combining those with common Gest values.
library(tidyverse)
df.1 %>%
pivot_longer(cols = -ID, names_to = c(".value", "number"), names_pattern = "(\\w+)(\\d+)") %>%
group_by(ID, Gest) %>%
mutate(across(c(Sys, Dia), mean)) %>%
slice(1) %>%
group_by(ID) %>%
mutate(number = row_number())
Output
ID number Gest Sys Dia
<dbl> <int> <dbl> <dbl> <dbl>
1 27 1 27 120 90
2 27 2 29 122 89
3 27 3 32 123 90
4 27 4 33 124 94
5 46 1 28 126. 83.5
6 46 2 29 122 88
7 46 3 30 123 89
8 72 1 29 124 92
9 72 2 30 119 84.5
10 72 3 32 128 80
Note - I would keep in long form - but if you wanted wide again, you can add:
pivot_wider(id_cols = ID, names_from = number, values_from = c(Gest, Sys, Dia))
This involved change the structure of the table into the long format, averaging the duplicates and then reformatting back into the desired table:
library(tidyr)
library(dplyr)
df.1 <- data.frame(ID,Gest1,Sys1,Dia1,Gest2,Sys2,Dia2,Gest3,Sys3, Dia3,Gest4,Sys4,Dia4)
#convert data to long format
longdf <- df.1 %>% pivot_longer(!ID, names_to = c(".value", "time"), names_pattern = "(\\D+)(\\d)", values_to="count")
#average duplicate rows
temp<-longdf %>% group_by(ID, Gest) %>% summarize(Sys=mean(Sys), Dia=mean(Dia)) %>% mutate(time = row_number())
#convert back to wide format
answer<-temp %>% pivot_wider(ID, names_from = time, values_from = c("Gest", "Sys", "Dia"), names_glue = "{.value}{time}")
#resort the columns
answer <-answer[ , names(df.1)]
answer
# A tibble: 3 × 13
# Groups: ID [3]
ID Gest1 Sys1 Dia1 Gest2 Sys2 Dia2 Gest3 Sys3 Dia3 Gest4 Sys4 Dia4
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 27 27 120 90 29 122 89 32 123 90 33 124 94
2 46 28 126. 83.5 29 122 88 30 123 89 NA NA NA
3 72 29 124 92 30 119 84.5 32 128 80 NA NA NA

Pivot wider to one row in R

Here is the sample code that I am using
library(dplyr)
naics <- c("000000","000000",123000,123000)
year <- c(2020,2021,2020,2021)
January <- c(250,251,6,9)
February <- c(252,253,7,16)
March <- c(254,255,8,20)
sample2 <- data.frame (naics, year, January, February, March)
Here is the intended result
Jan2020 Feb2020 March2020 Jan2021 Feb2021 March2021
000000 250 252 254 251 253 255
123000 6 7 8 9 16 20
Is this something that is done with pivot_wider or is it more complex?
We use pivot_wider by selecting the values_from with the month column, names_from as 'year' and then change the column name format in names_glue and if needed convert the 'naics' to row names with column_to_rownames (from tibble)
library(tidyr)
library(tibble)
pivot_wider(sample2, names_from = year, values_from = January:March,
names_glue = "{substr(.value, 1, 3)}{year}")%>%
column_to_rownames('naics')
-output
Jan2020 Jan2021 Feb2020 Feb2021 Mar2020 Mar2021
000000 250 251 252 253 254 255
123000 6 9 7 16 8 20
With reshape function from BaseR,
reshape(sample2, dir = "wide", sep="",
idvar = "naics",
timevar = "year",
new.row.names = unique(naics))[,-1]
# January2020 February2020 March2020 January2021 February2021 March2021
# 000000 250 252 254 251 253 255
# 123000 6 7 8 9 16 20
This takes a longer route than #akrun's answer. I will leave this here in case it may help with more intuition on the steps being taken. Otherwise, #akrun's answer is more resource efficient.
sample2 %>%
tidyr::pivot_longer(-c(naics, year), names_to = "month",
values_to = "value") %>%
mutate(Month=paste0(month, year)) %>%
select(-year, - month) %>%
tidyr::pivot_wider(names_from = Month,values_from = value)
# A tibble: 2 x 7
naics January2020 February2020 March2020 January2021 February2021
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 000000 250 252 254 251 253
2 123000 6 7 8 9 16
# ... with 1 more variable: March2021 <dbl>

Create a new columns in R

I am carrying out an analysis on some Italian regions. I have a dataset similar to the following:
mydata <- data.frame(date= c(2020,2021,2020,2021,2020,2021),
Region= c('Sicilia','Sicilia','Sardegna','Sardegna','Campania','Campania'),
Number=c(20,30,50,70,90,69) )
Now I have to create two new columns. The first (called 'Total population') containing a fixed number for each region (for example each row with Sicily will have a "Total Population" = 250). The second column instead contains the % ratio between the value of 'Number' column and the corresponding value of 'Total Population' (for example for Sicily the value will be 20/250 and so on).
I hope I explained myself well, Thank you very much
Like thsi perhaps:
mydata %<>% group_by( Region ) %>%
mutate(
`Total Population` = sum(Number),
`Ratio of Total` = sprintf( "%.1f%%",100 * Number / sum(Number)) )
mydata is now:
> mydata
# A tibble: 6 x 5
# Groups: Region [3]
date Region Number `Total Population` `Ratio of Total`
<dbl> <chr> <dbl> <dbl> <chr>
1 2020 Sicilia 20 50 40.0%
2 2021 Sicilia 30 50 60.0%
3 2020 Sardegna 50 120 41.7%
4 2021 Sardegna 70 120 58.3%
5 2020 Campania 90 159 56.6%
6 2021 Campania 69 159 43.4%

Resources