R: Turning row data from one dataframe into column data by group in another - r

I have data in the following format:
ID
Age
Sex
1
29
M
2
32
F
3
18
F
4
89
M
5
45
M
and;
ID
subID
Type
Status
Year
1
3
Car
Y
1
11
Toyota
NULL
2011
1
23
Kia
NULL
2009
2
5
Car
N
3
2
Car
Y
3
4
Honda
NULL
2019
3
7
Fiat
NULL
2006
3
8
Mitsubishi
NULL
2020
4
1
Car
N
5
7
Car
Y
Each ID in the second table has a row specifying if they have a car, and additional rows stating the brand of car/s they own. Each person has a maximum of 3 cars. I want to simplify this data into a single table as so.
ID
Age
Sex
Car?
Car.1
Car1.year
Car.2
Car2.year
Car.3
Car3.year
1
29
M
Y
Toyota
2011
Kia
2009
NULL
NULL
2
32
F
N
NULL
NULL
NULL
NULL
NULL
NULL
3
18
F
Y
Honda
2019
Fiat
2006
Mitsubishi
2020
4
89
M
N
NULL
NULL
NULL
NULL
NULL
NULL
5
45
M
Y
NULL
NULL
NULL
NULL
NULL
NULL
I've tried using the mutate function in dplyr with the case_when function, but I can't check conditions in another dataframe. If I try to join the tables together, I would have multiple rows for each ID which I want to avoid. The non-standard set up of the second table makes things complicated. My only remaining idea is to switch to Python/Pandas and create a for loop that slowly loops through each ID, searches the second dataframe if the person has a car and the car brands, then mutates a column in the first dataframe. But given the size of my dataset, this would be inefficient and take a long time.
What is the best way to do this?

You can try the following codes:
library(tidyverse)
df1
# A tibble: 5 x 3
ID Age Sex
<dbl> <dbl> <chr>
1 1 29 M
2 2 32 F
3 3 18 F
4 4 89 M
5 5 45 M
df2
# A tibble: 10 x 5
ID subID Type Status Year
<dbl> <dbl> <chr> <chr> <dbl>
1 1 3 Car Y NA
2 1 11 Toyota Y 2011
3 1 23 Kia Y 2009
4 2 5 Car N NA
5 3 2 Car Y NA
6 3 4 Honda Y 2019
7 3 7 Fiat Y 2006
8 3 8 Mitsubishi Y 2020
9 4 1 Clothed N NA
10 5 7 Clothed Y NA
df2 <- df2 %>% mutate(Status = if_else(Status == "NULL", "Y", Status))
df3 <- df2 %>% filter(!is.na(Year)) %>% group_by(ID) %>% mutate(index = row_number())
df4 <- df3 %>% pivot_wider(id_cols = c(ID), values_from = c(Type, Year), names_from = index )
So your desired output will be produced:
df1 %>% left_join(df2 %>% select(ID, Status) %>% distinct()) %>% left_join(df4)
# A tibble: 5 x 10
ID Age Sex Status Type_1 Type_2 Type_3 Year_1 Year_2 Year_3
<dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1 29 M Y Toyota Kia NA 2011 2009 NA
2 2 32 F N NA NA NA NA NA NA
3 3 18 F Y Honda Fiat Mitsubishi 2019 2006 2020
4 4 89 M N NA NA NA NA NA NA
5 5 45 M Y NA NA NA NA NA NA

Related

Userfunction with optional grouping argument and if else using piping in R

I recently started to write my own functions to speed up standard and repetitive task while analyzing data with R.
At the moment I'm working on a function with three arguments and ran into a challenge I could not solve yet. I would like to have an optional grouping argument. During the process the function should check if there is a grouping argument and then continue using either subfunction 1 or 2.
But I always get the error "Object not found" if the grouping argument is not NA. How can I do this?
Edit: In my case the filter usually is used to filter certain valid or invalid years. If there is a grouping argument there will follow more steps in the pipe than if there is none.
require(tidyverse)
Data <- mpg
userfunction <- function(DF,Filter,Group) {
without_group <- function(DF) {
DF %>%
count(year)
}
with_group <- function(DF) {
DF %>%
group_by({{Group}}) %>%
count(year) %>%
pivot_wider(names_from=year, values_from=n) %>%
ungroup() %>%
mutate(across(.cols=2:ncol(.),.fns=~replace_na(.x, 0))) %>%
mutate(Mittelwert=round(rowMeans(.[,2:ncol(.)],na.rm=TRUE),2))
}
Obj <- DF %>%
ungroup() %>%
{if(Filter!=FALSE) filter(.,eval(rlang::parse_expr(Filter))) else filter(.,.$year==.$year)} %>%
{if(is.na(Group)) without_group(.) else with_group(.)}
return(Obj)
}
For NA it already works:
> Data %>%
+ userfunction(FALSE,NA)
# A tibble: 2 x 2
year n
<int> <int>
1 1999 117
2 2008 117
With argument it does not work:
> Data %>%
+ userfunction(FALSE,manufacturer)
Error in DF %>% ungroup() %>% { : object 'manufacturer' not found
Edit:
What I would expect from the above function would be the following output:
> Data %>% userfunction_exp(FALSE,manufacturer)
# A tibble: 15 x 4
manufacturer `1999` `2008` Mittelwert
<chr> <dbl> <dbl> <dbl>
1 audi 9 9 9
2 chevrolet 7 12 9.5
3 dodge 16 21 18.5
4 ford 15 10 12.5
5 honda 5 4 4.5
6 hyundai 6 8 7
7 jeep 2 6 4
8 land rover 2 2 2
9 lincoln 2 1 1.5
10 mercury 2 2 2
11 nissan 6 7 6.5
12 pontiac 3 2 2.5
13 subaru 6 8 7
14 toyota 20 14 17
15 volkswagen 16 11 13.5
Data %>% userfunction_exp("cyl==4",manufacturer)
# A tibble: 9 x 4
manufacturer `1999` `2008` mean
<chr> <dbl> <dbl> <dbl>
1 audi 4 4 4
2 chevrolet 1 1 1
3 dodge 1 0 0.5
4 honda 5 4 4.5
5 hyundai 4 4 4
6 nissan 2 2 2
7 subaru 6 8 7
8 toyota 11 7 9
9 volkswagen 11 6 8.5
2021-04-01 14:55: edited to add some information and add some steps to the pipe for function with_group.
Hi this is a good question!
There are multiple ways to achieve this as the previous answers pointed out. One way to do it in the tidyverse is tidy evaluation
Omitting your filter function (which you could explain in more detail...)
my_summary <- function(df, grouping_var) {
grp_var <- enquo(grouping_var) #capture group variable
df %>% my_group_by(grp_var)
}
my_group_by <- function(df, grouping_var){
# Check if group is supplied
if(rlang::quo_is_missing(grouping_var)) {
df %>% without_group()
} else {
df %>% with_group(grouping_var)
}
}
without_group <- function(df) {
# do whatever without group
df %>%
count(year)
}
with_group <- function(df, grouping_var) {
# do whatever with group
df %>%
group_by(!!grouping_var) %>% #Note the !!
count(year) %>%
pivot_wider(names_from=year, values_from=n)
}
Which will give you without any argument
> mpg %>% my_summary()
# A tibble: 2 x 2
year n
<int> <int>
1 1999 117
2 2008 117
With group passed to pipe
> mpg %>% my_summary(model)
# A tibble: 38 x 3
# Groups: model [38]
model `1999` `2008`
<chr> <int> <int>
1 4runner 4wd 4 2
2 a4 4 3
3 a4 quattro 4 4
4 a6 quattro 1 2
5 altima 2 4
6 c1500 suburban 2wd 1 4
7 camry 4 3
8 camry solara 4 3
9 caravan 2wd 6 5
10 civic 5 4
# ... with 28 more rows
I don't know what is the use of Filter argument so I'll keep it as it is for now.
group_by(A) %>% count(B) is same as count(A, B) so you can change your function to :
library(tidyverse)
userfunction <- function(DF,Filter,Group = NULL) {
DF %>%
count(year, {{Group}}) %>%
pivot_wider(names_from=year, values_from=n)
}
Data %>% userfunction(FALSE)
# `1999` `2008`
# <int> <int>
#1 117 117
Data %>% userfunction(FALSE,manufacturer)
# A tibble: 15 x 3
# manufacturer `1999` `2008`
# <chr> <int> <int>
# 1 audi 9 9
# 2 chevrolet 7 12
# 3 dodge 16 21
# 4 ford 15 10
# 5 honda 5 4
# 6 hyundai 6 8
# 7 jeep 2 6
# 8 land rover 2 2
# 9 lincoln 2 1
#10 mercury 2 2
#11 nissan 6 7
#12 pontiac 3 2
#13 subaru 6 8
#14 toyota 20 14
#15 volkswagen 16 11
Note that I have assigned the default value to Group as NULL so when you don't mention anything it ignores that argument.

R How to lag a dataframe by groups

I have the following data set:
Name Year VarA VarB Data.1 Data.2
A 2016 L H 100 101
A 2017 L H 105 99
A 2018 L H 103 105
A 2016 L A 90 95
A 2017 L A 99 92
A 2018 L A 102 101
I want to add a lagged variable by the grouping: Name, VarA, VarB so that my data would look like:
Name Year VarA VarB Data.1 Data.2 Lg1.Data.1 Lg2.Data.1
A 2016 L H 100 101 NA NA
A 2017 L H 105 99 100 NA
A 2018 L H 103 105 105 100
A 2016 L A 90 95 NA NA
A 2017 L A 99 92 90 NA
A 2018 L A 102 101 99 90
I found the following link, which is helpful: debugging: function to create multiple lags for multiple columns (dplyr)
And am using the following code:
df <- df %>%
group_by(Name) %>%
arrange(Name, VarA, VarB, Year) %>%
do(data.frame(., setNames(shift(.[,c(5:6)], 1:2), c(seq(1:8)))))
However, the lag offsetting all data associated w/ name, instead of the grouping I want, so only the 2018 years are accurately lagged.
Name Year VarA VarB Data.1 Data.2 Lg1.Data.1 Lg2.Data.1
A 2016 L H 100 101 NA NA
A 2017 L H 105 99 100 NA
A 2018 L H 103 105 105 100
A 2016 L A 90 95 103 105
A 2017 L A 99 92 90 103
A 2018 L A 102 101 99 90
How do I get the lag to reset for each new grouping combination (e.g. Name / VarA / VarB)?
dplyr::lag lets you set the distance you want to lag by. You can group by whatever variables you want—in this case, Name, VarA, and VarB—before making your lagged variables.
library(dplyr)
df %>%
group_by(Name, VarA, VarB) %>%
mutate(Lg1.Data.1 = lag(Data.1, n = 1), Lg2.Data.1 = lag(Data.1, n = 2))
#> # A tibble: 6 x 8
#> # Groups: Name, VarA, VarB [2]
#> Name Year VarA VarB Data.1 Data.2 Lg1.Data.1 Lg2.Data.1
#> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 A 2016 L H 100 101 NA NA
#> 2 A 2017 L H 105 99 100 NA
#> 3 A 2018 L H 103 105 105 100
#> 4 A 2016 L A 90 95 NA NA
#> 5 A 2017 L A 99 92 90 NA
#> 6 A 2018 L A 102 101 99 90
If you want a version that scales to more lags, you can use some non-standard evaluation to create new lagged columns dynamically. I'll do this with purrr::map to iterate of a set of n to lag by, make a list of data frames with the new columns added, then join all the data frames together. There are probably better NSE ways to do this, so hopefully someone can improve upon it.
I'm making up some new data, just to have a wider range of years to illustrate. Inside mutate, you can create column names with quo_name.
library(dplyr)
library(purrr)
set.seed(127)
df <- tibble(
Name = "A", Year = rep(2016:2020, 2), VarA = "L", VarB = rep(c("H", "A"), each = 5),
Data.1 = sample(1:10, 10, replace = T), Data.2 = sample(1:10, 10, replace = T)
)
df_list <- purrr::map(1:4, function(i) {
df %>%
group_by(Name, VarA, VarB) %>%
mutate(!!quo_name(paste0("Lag", i)) := dplyr::lag(Data.1, n = i))
})
You don't need to save this list—I'm just doing it to show an example of one of the data frames. You could instead go straight into reduce.
df_list[[3]]
#> # A tibble: 10 x 7
#> # Groups: Name, VarA, VarB [2]
#> Name Year VarA VarB Data.1 Data.2 Lag3
#> <chr> <int> <chr> <chr> <int> <int> <int>
#> 1 A 2016 L H 3 9 NA
#> 2 A 2017 L H 1 4 NA
#> 3 A 2018 L H 3 8 NA
#> 4 A 2019 L H 2 2 3
#> 5 A 2020 L H 4 5 1
#> 6 A 2016 L A 8 4 NA
#> 7 A 2017 L A 6 8 NA
#> 8 A 2018 L A 3 2 NA
#> 9 A 2019 L A 8 6 8
#> 10 A 2020 L A 9 1 6
Then use purrr::reduce to join all the data frames in the list. Since there are columns that are the same in each of the data frames, and those are the ones you want to join by, you can get away with not specifying join-by columns in inner_join.
reduce(df_list, inner_join)
#> Joining, by = c("Name", "Year", "VarA", "VarB", "Data.1", "Data.2")
#> Joining, by = c("Name", "Year", "VarA", "VarB", "Data.1", "Data.2")
#> Joining, by = c("Name", "Year", "VarA", "VarB", "Data.1", "Data.2")
#> # A tibble: 10 x 10
#> # Groups: Name, VarA, VarB [?]
#> Name Year VarA VarB Data.1 Data.2 Lag1 Lag2 Lag3 Lag4
#> <chr> <int> <chr> <chr> <int> <int> <int> <int> <int> <int>
#> 1 A 2016 L H 3 9 NA NA NA NA
#> 2 A 2017 L H 1 4 3 NA NA NA
#> 3 A 2018 L H 3 8 1 3 NA NA
#> 4 A 2019 L H 2 2 3 1 3 NA
#> 5 A 2020 L H 4 5 2 3 1 3
#> 6 A 2016 L A 8 4 NA NA NA NA
#> 7 A 2017 L A 6 8 8 NA NA NA
#> 8 A 2018 L A 3 2 6 8 NA NA
#> 9 A 2019 L A 8 6 3 6 8 NA
#> 10 A 2020 L A 9 1 8 3 6 8
Created on 2018-12-07 by the reprex package (v0.2.1)

Search in row for a certain value and report the date

Ciao,
Here is my replicating example.
a=c(1,2,3,4,5,6,7,8)
b=c(1,1,0,0,0,"NA",0,"NA")
c=c(11,7,9,9,5,"NA",7,"NA")
d=c(2012,2011,2012,2014,2014,"NA",2011,"NA")
e=c(1,0,1,0,0,1,"NA","NA")
f=c(10,4,11,10,10,6,"NA","NA")
g=c(2014,2012,2010,2012,2013,2011,"NA","NA")
h=c(1,0,1,0,1,0,1,"NA")
i=c(2,12,12,6,8,11,3,"NA")
j=c(2011,2012,2011,2012,2012,2014,2012,"NA")
k=c(1,1,1,0,1,1,1,"NA")
l=c(11/1/2012,"7/1/2012","11/1/2010",0 ,"8/1/2012","6/1/2012","3/1/2012","NA")
mydata = data.frame(a,b,c,d,e,f,g,h,i,j,k,l)
names(mydata) = c("id","test1","month1","year1","test2","month2","year2","test3","month3","year3","anytest","date")
I am aiming to search through each row and find the first test column that is equal to 1. The new column I am aiming to create is "anytest." This column is 1 if test1 or test2 or test3 equals to 1. If none of them do then it equals to 0. This ignores NA values..if test1 and test2 are NA but test3 equals to 0 then anytest equals to 0. Now I have made progress I think using this code:
anytestTRY = if(rowSums(mydata[,c(test1,test2,test3)] == 1, na.rm=TRUE) > 0],1,0)
But now I am at a crossroads because I am aiming to search through each row to find the first column of test1 test2 or test3 that equals to 1 and then report the month and year for that test. So if test1 equals to 0 and test2 equals to NA and test3 equals to 1 I want the column which I created called date to have the month3 and year3 in analyzable time format. Thanks a million.
a=c(1,2,3,4,5,6,7,8)
b=c(1,1,0,0,0,"NA",0,"NA")
c=c(11,7,9,9,5,"NA",7,"NA")
d=c(2012,2011,2012,2014,2014,"NA",2011,"NA")
e=c(1,0,1,0,0,1,"NA","NA")
f=c(10,4,11,10,10,6,"NA","NA")
g=c(2014,2012,2010,2012,2013,2011,"NA","NA")
h=c(1,0,1,0,1,0,1,"NA")
i=c(2,12,12,6,8,11,3,"NA")
j=c(2011,2012,2011,2012,2012,2014,2012,"NA")
mydata = data.frame(a,b,c,d,e,f,g,h,i,j)
names(mydata) = c("id","test1","month1","year1","test2","month2","year2","test3","month3","year3")
library(tidyverse)
library(lubridate)
mydata %>%
mutate_all(~as.numeric(as.character(.))) %>% # update columns to numeric
group_by(id) %>% # for each id
nest() %>% # nest data
mutate(date = map(data, ~case_when(.$test1==1 ~ ymd(paste0(.$year1,"-",.$month1,"-",1)), # get date based on first test that is 1
.$test2==1 ~ ymd(paste0(.$year2,"-",.$month2,"-",1)),
.$test3==1 ~ ymd(paste0(.$year3,"-",.$month3,"-",1)))),
anytest = map(data, ~as.numeric(case_when(sum(c(.$test1, .$test2, .$test3)==1) > 0 ~ "1", # create anytest column
sum(is.na(c(.$test1, .$test2, .$test3))) == 3 ~ "NA",
TRUE ~ "0")))) %>%
unnest() # unnestdata
which returns:
# # A tibble: 8 x 12
# id date anytest test1 month1 year1 test2 month2 year2 test3 month3 year3
# <dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 2012-11-01 1 1 11 2012 1 10 2014 1 2 2011
# 2 2 2011-07-01 1 1 7 2011 0 4 2012 0 12 2012
# 3 3 2010-11-01 1 0 9 2012 1 11 2010 1 12 2011
# 4 4 NA 0 0 9 2014 0 10 2012 0 6 2012
# 5 5 2012-08-01 1 0 5 2014 0 10 2013 1 8 2012
# 6 6 2011-06-01 0 NA NA NA 1 6 2011 0 11 2014
# 7 7 2012-03-01 0 0 7 2011 NA NA NA 1 3 2012
# 8 8 NA NA NA NA NA NA NA NA NA NA NA

Calculating proportions and ignore NAs

I have a dataset similar to the following and my end goal is to make a table showing variables like mean salary per gender and the females' mean salary as a proportion of men's.
library(dplyr)
x <- data.frame(Department = c("Dep1", "Dep1","Dep2", "Dep2","Dep3"),
Gender = c("F", "M", "F", "M", "F"),
Salary = seq(10,14))
Department Gender Salary
1 Dep1 F 10
2 Dep1 M 11
3 Dep2 F 12
4 Dep2 M 13
5 Dep3 F 14
Step 1: First I calculate the needed summary statistics using summarise.
Table <- x %>% group_by(Department, Gender) %>% summarise(Count = n(),
AverageSalary = mean(Salary, na.rm = T),
MedianSalary = median(Salary, na.rm = T))
Step 2: To calculate the proportion and add the new columns to "Table" I use a tip I got from this forum a few days ago.
Table %>% group_by(Department) %>%
mutate(`AvgSalaryWomen/Men` = AverageSalary[Gender == "F"]/AverageSalary[Gender == "M"],
`MedianSalaryWomen/Men` = MedianSalary[Gender == "F"]/MedianSalary[Gender == "M"])
My challenge is that Dep3 doesn't have any males and so I get the following error message:
Error in mutate_impl(.data, dots) :
Column `AvgSalaryWomen/Men` must be length 1 (the group size), not 0
What I was hoping for was something like this
Department Gender Count AverageSalary MedianSalary AvgSalaryWomen.Men MedianSalaryWomen.Men
1 Dep1 F 1 10 10 0.9090909 0.9090909
2 Dep1 M 1 11 11 0.9090909 0.9090909
3 Dep2 F 1 12 12 0.9230769 0.9230769
4 Dep2 M 1 13 13 0.9230769 0.9230769
5 Dep3 F 1 14 14 NA NA
or this
Department Gender Count AverageSalary MedianSalary AvgSalaryWomen.Men MedianSalaryWomen.Men
1 Dep1 F 1 10 10 0.9090909 0.9090909
2 Dep1 M 1 11 11 NA NA
3 Dep2 F 1 12 12 0.9230769 0.9230769
4 Dep2 M 1 13 13 NA NA
5 Dep3 F 1 14 14 NA NA
Is there an easy way to obtain either of these two results? I'm guessing that alternative 1 would be the easiest.
Thanks in advance!
Using ifelse, you can check if both genders exist in a department before computing the ratios (and if not, returning NA). Something like this:
Table %>% group_by(Department) %>%
mutate(`AvgSalaryWomen/Men` = ifelse(length(unique(Gender)) == 2,
AverageSalary[Gender == "F"]/AverageSalary[Gender == "M"], NA),
`MedianSalaryWomen/Men` = ifelse(length(unique(Gender)) == 2,
MedianSalary[Gender == "F"]/MedianSalary[Gender == "M"], NA))
# A tibble: 5 x 7
# Groups: Department [3]
Department Gender Count AverageSalary MedianSalary `AvgSalaryWomen/Men` `MedianSalaryWomen/Men`
<fct> <fct> <int> <dbl> <int> <dbl> <dbl>
1 Dep1 F 1 10.0 10 0.909 0.909
2 Dep1 M 1 11.0 11 0.909 0.909
3 Dep2 F 1 12.0 12 0.923 0.923
4 Dep2 M 1 13.0 13 0.923 0.923
5 Dep3 F 1 14.0 14 NA NA

how to replace missing values with previous year's binned mean

I have a data frame as below
p1_bin and f1_bin are calculated by cut function by me with
Bins <- function(x) cut(x, breaks = c(0, seq(1, 1000, by = 5)), labels = 1:200)
binned <- as.data.frame (sapply(df[,-1], Bins))
colnames(binned) <- paste("Bin", colnames(binned), sep = "_")
df<- cbind(df, binned)
Now how to calculate mean/avg for previous two years and replace in NA values with in that bin
for example : at row-5 value is NA for p1 and f1 is 30 with corresponding bin 7.. now replace NA with previous 2 years mean for same bin (7) ,i.e
df
ID year p1 f1 Bin_p1 Bin_f1
1 2013 20 30 5 7
2 2013 24 29 5 7
3 2014 10 16 2 3
4 2014 11 17 2 3
5 2015 NA 30 NA 7
6 2016 10 NA 2 NA
df1
ID year p1 f1 Bin_p1 Bin_f1
1 2013 20 30 5 7
2 2013 24 29 5 7
3 2014 10 16 2 3
4 2014 11 17 2 3
5 2015 **22** 30 NA 7
6 2016 10 **16.5** 2 NA
Thanks in advance
I believe the following code produces the desired output. There's probably a much more elegant way than using mean(rev(lag(f1))[1:2]) to get the average of the last two values of f1 but this should do the trick anyway.
library(dplyr)
df %>%
arrange(year) %>%
mutate_at(c("p1", "f1"), "as.double") %>%
group_by(Bin_p1) %>%
mutate(f1 = ifelse(is.na(f1), mean(rev(lag(f1))[1:2]), f1)) %>%
group_by(Bin_f1) %>%
mutate(p1 = ifelse(is.na(p1), mean(rev(lag(p1))[1:2]), p1)) %>%
ungroup
and the output is:
# A tibble: 6 x 6
ID year p1 f1 Bin_p1 Bin_f1
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2013 20 30.0 5 7
2 2 2013 24 29.0 5 7
3 3 2014 10 16.0 2 3
4 4 2014 11 17.0 2 3
5 5 2015 22 30.0 NA 7
6 6 2016 10 16.5 2 NA

Resources