How to merge rows based on common columns in R - r

I am working in a dataframe that looks something like this:
vims <- data.frame(
patient_ID = c("a", "a", "a", "b", "b"),
Date = c(2020, 2020, 2018, 2020, 2028),
Eye = c("Right", "Left", "Right", "Right", "Right"),
V1 = c(21, 18, 30, 30, 18)
V2 = c(28, 30, 15, 45, 60)
)
As you can see, the data has an ID and may have several evaluation on different dates for that same ID and further it may have different eye evaluations within the dates. I am trying to merge rows in order to be arrange by ID and date to obtain rows that contains the ID, the date and all the info for every eye in the same row (V1 for the right and left eye if available)

Are you looking for this:
library(dplyr)
library(tidyr)
vims %>% pivot_wider(id_cols = c(patient_ID, Date), names_from = Eye, values_from = c(V1,V2))
# A tibble: 4 x 6
patient_ID Date V1_Right V1_Left V2_Right V2_Left
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a 2020 21 18 28 30
2 a 2018 30 NA 15 NA
3 b 2020 30 NA 45 NA
4 b 2028 18 NA 60 NA

A data.table alternative:
library(data.table)
vims <- as.data.table(vims)
dcast(vims, patient_ID+Date~Eye, value.var = c("V1","V2"))
patient_ID Date V1_Left V1_Right V2_Left V2_Right
1: a 2018 NA 30 NA 15
2: a 2020 18 21 30 28
3: b 2020 NA 30 NA 45
4: b 2028 NA 18 NA 60

Related

Creating a summary row for each group in a dataframe based on other variables in the group

Fairly new to R, ended up in the following situation: I want to create a summary row for each group in the dataframe based on Year and Model, where a value of each row would be based on the subtraction of value of one Variable from others in the group.
df <- data.frame(Model = c(1,1,1,2,2,2,2,2,2,2,2,2,2),
Year = c(2020, 2020, 2020, 2020, 2020, 2020, 2020, 2030, 2030, 2030, 2040, 2040, 2040),
Variable = c("A", "B", "C", "A", "B", "C", "D", "A", "C", "E", "A", "C", "D"),
value = c(15, 2, 5, 25, 6, 4, 4, 41, 24,1, 15, 3, 2))
I have managed to create a new row for each group, so it already has a Year and a Variable name that I manually specified using:
df <- df %>% group_by(Model, Year) %>% group_modify(~ add_row(., Variable = "New", .before=0))
However, I am struggling to create an equation from which I want to calculate the value.
What I want to have instead of NAs: value of A-B-D in each group
Would appreciate any help. My first thread here, pardon for any inconvenience.
You could pivot wide and then back; this would add rows with zeros where missing:
library(dplyr); library(tidyr)
df %>%
pivot_wider(names_from = Variable, values_from = value, values_fill = 0) %>%
mutate(new = A - B - D) %>%
pivot_longer(-c(Model, Year), names_to = "Variable")
# A tibble: 24 × 4
Model Year Variable value
<dbl> <dbl> <chr> <dbl>
1 1 2020 A 15
2 1 2020 B 2
3 1 2020 C 5
4 1 2020 D 0
5 1 2020 E 0
6 1 2020 new 13 # 15 - 2 - 0 = 13
7 2 2020 A 25
8 2 2020 B 6
9 2 2020 C 4
10 2 2020 D 4
# … with 14 more rows
EDIT - variation where we leave the missing values and use coalesce(x, 0) to allow subtraction to treat NA's as zeroes. The pivot_wider creates NA's in the missing spots, but we can exclude these in the pivot_longer using values_drop_na = TRUE.
df %>%
pivot_wider(names_from = Variable, values_from = value) %>%
mutate(new = A - coalesce(B,0) - coalesce(D,0)) %>%
pivot_longer(-c(Model, Year), names_to = "Variable", values_drop_na = TRUE)
# A tibble: 17 × 4
Model Year Variable value
<dbl> <dbl> <chr> <dbl>
1 1 2020 A 15
2 1 2020 B 2
3 1 2020 C 5
4 1 2020 new 13
5 2 2020 A 25
6 2 2020 B 6
7 2 2020 C 4
8 2 2020 D 4
9 2 2020 new 15
10 2 2030 A 41
11 2 2030 C 24
12 2 2030 E 1
13 2 2030 new 41
14 2 2040 A 15
15 2 2040 C 3
16 2 2040 D 2
17 2 2040 new 13

How do I create a new factor level that summarizes total values of other factor levels?

I have a dataset where I have at least three columns
year sex value
1 2019 M 10
2 2019 F 20
3 2020 M 50
4 2020 F 20
I would like to group by the first column, year, and then add another level to sex that corresponds the total value in column 3, that is, I would like something like this:
year sex value
<int> <chr> <dbl>
1 2019 M 10
2 2019 F 20
3 2019 Total 30
4 2020 M 50
5 2020 F 20
6 2020 Total 70
Any help is appreciated, especially in dplyr.
Here is just another way of doing this:
library(dplyr)
library(purrr)
df %>%
group_split(year) %>%
map_dfr(~ add_row(.x, year = first(.x$year), sex = "Total", value = sum(.x$value)))
# A tibble: 6 x 3
year sex value
<int> <chr> <dbl>
1 2019 M 10
2 2019 F 20
3 2019 Total 30
4 2020 M 50
5 2020 F 20
6 2020 Total 70
You can summarise the data for each year and bind it to the original dataset.
library(dplyr)
df %>%
group_by(year) %>%
summarise(sex = 'Total',
value = sum(value)) %>%
bind_rows(df) %>%
arrange(year, sex)
# year sex value
# <int> <chr> <dbl>
#1 2019 F 20
#2 2019 M 10
#3 2019 Total 30
#4 2020 F 20
#5 2020 M 50
#6 2020 Total 70
Or in base R -
aggregate(value~year, df, sum) |>
transform(sex = 'Total') |>
rbind(df)
data
df <- data.frame(year = rep(2019:2020, each = 2),
sex = c('M', 'F'), value = c(10, 20, 50, 20))

Creating a variable by group for sample data

I have a sample data base (which I did not make myself) as follows:
panelID= c(1:50)
year= c(2005, 2010)
country = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
n <- 2
library(data.table)
set.seed(123)
DT <- data.table( country = rep(sample(country, length(panelID), replace = T), each = n),
year = c(replicate(length(panelID), sample(year, n))),
DT [, uniqueID := .I] # Creates a unique ID
DT[DT == 0] <- NA
DT$sales[DT$sales< 0] <- NA
DT <- as.data.frame(DT)
I am always struggling when I want to create a new variable which has to meet certain conditions.
I would like to create a tax rate for my sample database. The tax rate has to be the same per country-year, between 10% and 40% and not more than 5% apart per country.
I cannot seem to figure out how to do it. It would be great if someone could point me in the right direction.
Not 100 % sure what you are looking for. You could use dplyr:
DT %>%
group_by(country) %>%
mutate(base_rate = as.integer(runif(1, 12.5, 37.5))) %>%
group_by(country, year) %>%
mutate(tax_rate = base_rate + as.integer(runif(1,-2.5,+2.5)))
which returns
# A tibble: 100 x 6
# Groups: country, year [20]
country year uniqueID sales base_rate tax_rate
<chr> <dbl> <int> <lgl> <int> <int>
1 C 2005 1 NA 26 26
2 C 2010 2 NA 26 26
3 C 2010 3 NA 26 26
4 C 2005 4 NA 26 26
5 J 2005 5 NA 21 21
6 J 2010 6 NA 21 20
7 B 2010 7 NA 20 20
8 B 2005 8 NA 20 22
9 F 2010 9 NA 26 26
10 F 2005 10 NA 26 26
I first created a random base_rate per country and then a random tax_rate per country and year.
I used integer but you could easily replace them with real percentage values.

Extend last observed values using na.locf for specific country/variable pairs

I need to use na.locf from the zoo package to replace NA values with the last observed value. However, I need to do this only for specific country & variable pairs. These pairs are specified logically using a seperate data frame, an example of which is shown below.
Country <- c("FRA", "DEU", "CHE")
acctm <- c(0, 0, 1)
acctf <- c(1, 1, 0)
df1 <- data.frame(Country, acctm, acctf)
Country acctm acctf
1 FRA 0 1
2 DEU 0 1
3 CHE 1 0
a 1 meaning use na.locf for this pair. An example of the dataset where replacement would be needed is shown below.
Country <- c("FRA", "FRA", "DEU", "DEU", "CHE", "CHE")
Year <- c(2010, 2020, 2010, 2020, 2010, 2020)
acctm <- c(20, 30, 10, NA, 20, NA)
acctf <- c(20, NA, 15, NA, 40, NA)
df2 <- data.frame(Country, Year, acctm, acctf)
Country Year acctm acctf
1 FRA 2010 20 20
2 FRA 2020 30 NA
3 DEU 2010 10 15
4 DEU 2020 NA NA
5 CHE 2010 20 40
6 CHE 2020 NA NA
Given both of the example datasets, the result of the function executing na.locf on df2 for country/variable pairs indicated by df1 should look like this:
acctm <- c(20, 30, 10, NA, 20, 20)
acctf <- c(20, 20, 15, 15, 40, NA)
df3 <- data.frame(Country, Year, acctm, acctf)
Country2 Year acctm acctf
1 FRA 2010 20 20
2 FRA 2020 30 20
3 DEU 2010 10 15
4 DEU 2020 NA 15
5 CHE 2010 20 40
6 CHE 2020 20 NA
The real application is a much larger dataset, so "calls" should be generalized. Thanks.
One option is a join with data.table on the 'Country' column, then use Map to apply the na.locf on the second dataset columns ('nm1') based on the value of the corresponding columns of first dataset and assign (:=) the output back to the columns
library(zoo)
library(data.table)
nm1 <- c('acctm', 'acctf')
nm2 <- paste0("i.", nm1)
setDT(df2)[df1, (nm1) := Map(function(x, y) if(y == 1) na.locf0(x)
else x, mget(nm1), mget(nm2)), on = .(Country), by = .EACHI]
df2
# Country Year acctm acctf
#1: FRA 2010 20 20
#2: FRA 2020 30 20
#3: DEU 2010 10 15
#4: DEU 2020 NA 15
#5: CHE 2010 20 40
#6: CHE 2020 20 NA
One dplyr and tidyr option could be:
df2 %>%
pivot_longer(-c(Country, Year)) %>%
left_join(df1 %>%
pivot_longer(names_to = "cond_names",
values_to = "cond_values", -Country),
by = c("Country" = "Country",
"name" = "cond_names")) %>%
group_by(Country, name) %>%
mutate(value = if_else(cond_values == 1, na.locf(value), value)) %>%
select(-cond_values) %>%
pivot_wider()
Country Year acctm acctf
<fct> <dbl> <dbl> <dbl>
1 FRA 2010 20 20
2 FRA 2020 30 20
3 DEU 2010 10 15
4 DEU 2020 NA 15
5 CHE 2010 20 40
6 CHE 2020 20 NA
Left join df2 to df1 on Country and then grouping by Country generate the appropriate value for each numeric column. Note that we use na.locf0 which ensures that the result has the same length as the input. Finally select the appropriate columns.
library(dplyr)
library(zoo)
df2 %>%
left_join(df1, by = "Country") %>%
group_by(Country) %>%
mutate(acctm = if (first(acctm.y)) na.locf0(acctm.x) else acctm.x,
acctf = if (first(acctf.y)) na.locf0(acctf.x) else acctf.x) %>%
ungroup %>%
select(names(df2))
giving:
# A tibble: 6 x 4
Country Year acctm acctf
<fct> <dbl> <dbl> <dbl>
1 FRA 2010 20 20
2 FRA 2020 30 20
3 DEU 2010 10 15
4 DEU 2020 NA 15
5 CHE 2010 20 40
6 CHE 2020 20 NA

Extracting row in time series in R

I'm trying to extract the rows from a data frame containing the lowest value in a specific column:
income = c(2, 3, 5, 5, -15, 2, 1)
balance = c(15, 17, 20, 25, 30, 15, 17)
date = as.Date(c("2016/02/11", "2016/02/14", "2017/02/16", "2016/03/01", "2017/03/12", "2016/04/11", "2017/04/24"))
df = data.frame(income, balance, date)
Now what I want to get the rows containing the minimum "balance" value from each month, so that the outcome would be a data frame looking like this:
income balance date
1 2 15 2016-02-11
2 5 25 2016-03-01
3 2 33 2016-04-11
I have tryed the aggregate function:
bymonth = aggregate(balance~months(date), data=df,FUN=min)
print(bymonth)
But this gives me the following output:
months(date) balance
1 April 15
2 Februar 15
3 Marts 25
Help!
We can do with dplyr. After grouping by months of 'date', we slice the row which has the min 'balance' and remove the 'mth' column using select
library(dplyr)
df %>%
group_by(mth = months(date)) %>%
slice(which.min(balance)) %>%
ungroup() %>%
select(-mth)
# A tibble: 3 x 3
# income balance date
# <dbl> <dbl> <date>
#1 2 15 2016-04-11
#2 2 15 2016-02-11
#3 5 25 2016-03-01
Note that if there are ties for the 'balance', then use filter(balance == min(balance)) in place of slice
Or with ave from base R tp create a logical vector and use that to subset the rows of 'df'
df[with(df, ave(balance, months(date), FUN = min)==balance),]
# income balance date
#1 2 15 2016-02-11
#4 5 25 2016-03-01
#6 2 15 2016-04-11

Resources