I'd like to join two dataframes with R
here the first one
resno resid elety eleno
1 ILE C 3
1 ILE O 4
2 VAL C 11
2 VAL O 12
3 GLY C 18
3 GLY O 19
the second one
C.O dist
12-18 3.112819
27-37 2.982788
51-63 3.185184
52-62 2.771583
63-69 3.157737
70-80 2.956738
so let's explain what i need. Looking at the second dataframe, i have distance ("dist") between points 12-18, corresponding to "eleno" in the first dataframe. for this 2 points I have also "resno" that is what i'm interesting in, because i'd like to obtain something like this
resno resid elety eleno rescoup dist
1 ILE C 3 - -
1 ILE O 4 - -
2 VAL C 11 - -
2 VAL O 12 2-3 3.112819
3 GLY C 18 2-3 3.112819
3 GLY O 19 - -
how can I do? is this possible with R?
thanks!
You could first create a long dataframe from df2 where every number from C.O is a unique row.
library(dplyr)
library(tidyr)
df3 <- df2 %>%
separate(C.O, c('col1', 'col2'), sep = '-', convert = TRUE) %>%
mutate(eleno = purrr::map2(col1, col2, seq), .before = 1,
row = row_number()) %>%
select(-col1, -col2) %>%
unnest(eleno)
df3
# A tibble: 60 x 3
# eleno row dist
# <int> <int> <dbl>
# 1 12 1 3.11
# 2 13 1 3.11
# 3 14 1 3.11
# 4 15 1 3.11
# 5 16 1 3.11
# 6 17 1 3.11
# 7 18 1 3.11
# 8 27 2 2.98
# 9 28 2 2.98
#10 29 2 2.98
# … with 50 more rows
Join this dataframe with df1 and paste resno values to create rescoup.
df1 %>%
left_join(df3, by = 'eleno') %>%
group_by(row) %>%
mutate(rescoup = paste(resno, collapse = '-'),
rescoup = replace(rescoup, is.na(dist), NA)) %>%
ungroup() %>%
select(-row)
# resno resid elety eleno dist rescoup
# <int> <chr> <chr> <int> <dbl> <chr>
#1 1 ILE C 3 NA NA
#2 1 ILE O 4 NA NA
#3 2 VAL C 11 NA NA
#4 2 VAL O 12 3.11 2-3
#5 3 GLY C 18 3.11 2-3
#6 3 GLY O 19 NA NA
Using data.table, split then merge:
library(data.table)
merge(d1,
d2[, lapply(.SD, function(x) unlist(tstrsplit(x, "-", fixed = TRUE,
type.convert = TRUE)))],
by.x = "eleno", by.y = "C.O", all.x = TRUE)
# eleno resno resid elety dist
# 1: 3 1 ILE C NA
# 2: 4 1 ILE O NA
# 3: 11 2 VAL C NA
# 4: 12 2 VAL O 3.112819
# 5: 18 3 GLY C 3.112819
# 6: 19 3 GLY O NA
Example data:
d1 <- fread("resno resid elety eleno
1 ILE C 3
1 ILE O 4
2 VAL C 11
2 VAL O 12
3 GLY C 18
3 GLY O 19")
d2 <- fread("C.O dist
12-18 3.112819
27-37 2.982788
51-63 3.185184
52-62 2.771583
63-69 3.157737
70-80 2.956738")
Related
I am trying to find a simple way to pivot_longer a dataframe that has multiple columns containing different data for each case. Using multiple names in names_to doesn't seem to solve the problem.
Here is a worked example:
#create the dataframe:
library('dplyr')
set.seed(11)
x <- data.frame(case = c(1:10),
X1990 = runif(10, 0, 1),
flag.1990 = rep(c('a','b'), 5),
X2000 = runif(10, 0, 1),
flag.2000 = rep(c('c', 'd'), 5))
> x
case X1990 flag.1990 X2000 flag.2000
1 1 0.2772497942 a 0.1751129 c
2 2 0.0005183129 b 0.4407503 d
3 3 0.5106083730 a 0.9071830 c
4 4 0.0140479084 b 0.8510419 d
5 5 0.0646897766 a 0.7339875 c
6 6 0.9548492255 b 0.5736857 d
7 7 0.0864958912 a 0.4817655 c
8 8 0.2899750092 b 0.3306110 d
9 9 0.8806991728 a 0.1576602 c
10 10 0.1232162013 b 0.4801341 d
Obviously I cannot just pivot_longer using cols = -case as that will combine year and flag data. If i try using a chr vector in names_to (from here: https://dcl-wrangle.stanford.edu/pivot-advanced.html (6.1.3):
x %>%
setNames(c('case','value.1990', 'flag.1990', 'value.2000', 'flag.2000')) %>%
pivot_longer(cols = -case,
names_to = c('value', 'flag'),
names_sep = '.',
values_to = 'value')
Things don't work, because the flag data isn't in the variable name.
The only way I can think to solve this is to break the dataframe into two data frames, pivot them and then join them. For example:
#create temporary data frame for year data, then pivot
temp1 <- x %>%
select(1,2, 4) %>% #select year data
pivot_longer(cols = c(X1990, X2000), #pivot longer on year data
names_to = 'year',
values_to = 'value') %>%
mutate(year = gsub('X', '', year)) #remove 'X' so that I can use this to join
#create temporary data frame for flag data, then pivot
temp2 <- x %>%
select(1, 3, 5) %>% #select flag variables
pivot_longer(cols = c(flag.1990, flag.2000), #pivot longer on flag data
names_to = 'flag.year',
values_to = 'flag') %>%
mutate(year = gsub('flag.', '', flag.year)) %>% #get year data so that I can join on this
select(-flag.year) #drop flag.year as its no longer useful information
final <- full_join(temp1, temp2, by = c('case', 'year')) #full join the two datasets to get the final data
> final
# A tibble: 20 x 4
case flag year value
<int> <chr> <chr> <dbl>
1 1 a 1990 0.277
2 1 c 2000 0.175
3 2 b 1990 0.000518
4 2 d 2000 0.441
5 3 a 1990 0.511
6 3 c 2000 0.907
7 4 b 1990 0.0140
8 4 d 2000 0.851
9 5 a 1990 0.0647
10 5 c 2000 0.734
11 6 b 1990 0.955
12 6 d 2000 0.574
13 7 a 1990 0.0865
14 7 c 2000 0.482
15 8 b 1990 0.290
16 8 d 2000 0.331
17 9 a 1990 0.881
18 9 c 2000 0.158
19 10 b 1990 0.123
20 10 d 2000 0.480
I assume there is a quicker way to do this. Am I just misreading the documentation on using multiple names in names_to. Any ideas?
In this case one has to use names_to combined with names_pattern:
library(dplyr)
library(tidyr)
> head(x,3)
case X1990 flag.1990 X2000 flag.2000
1 1 0.2772497942 a 0.1751129 c
2 2 0.0005183129 b 0.4407503 d
3 3 0.5106083730 a 0.9071830 c
> x %>%
pivot_longer(cols = -case,
names_to = c(".value", "year"),
names_pattern = "([^\\.]*)\\.*(\\d{4})")
# A tibble: 20 x 4
case year X flag
<int> <chr> <dbl> <chr>
1 1 1990 0.277 a
2 1 2000 0.175 c
3 2 1990 0.000518 b
4 2 2000 0.441 d
5 3 1990 0.511 a
6 3 2000 0.907 c
7 4 1990 0.0140 b
8 4 2000 0.851 d
9 5 1990 0.0647 a
10 5 2000 0.734 c
11 6 1990 0.955 b
12 6 2000 0.574 d
13 7 1990 0.0865 a
14 7 2000 0.482 c
15 8 1990 0.290 b
16 8 2000 0.331 d
17 9 1990 0.881 a
18 9 2000 0.158 c
19 10 1990 0.123 b
20 10 2000 0.480 d
I am trying to find a simple way to pivot_longer a dataframe that has multiple columns containing different data for each case. Using multiple names in names_to doesn't seem to solve the problem.
Here is a worked example:
#create the dataframe:
library('dplyr')
set.seed(11)
x <- data.frame(case = c(1:10),
X1990 = runif(10, 0, 1),
flag.1990 = rep(c('a','b'), 5),
X2000 = runif(10, 0, 1),
flag.2000 = rep(c('c', 'd'), 5))
> x
case X1990 flag.1990 X2000 flag.2000
1 1 0.2772497942 a 0.1751129 c
2 2 0.0005183129 b 0.4407503 d
3 3 0.5106083730 a 0.9071830 c
4 4 0.0140479084 b 0.8510419 d
5 5 0.0646897766 a 0.7339875 c
6 6 0.9548492255 b 0.5736857 d
7 7 0.0864958912 a 0.4817655 c
8 8 0.2899750092 b 0.3306110 d
9 9 0.8806991728 a 0.1576602 c
10 10 0.1232162013 b 0.4801341 d
Obviously I cannot just pivot_longer using cols = -case as that will combine year and flag data. If i try using a chr vector in names_to (from here: https://dcl-wrangle.stanford.edu/pivot-advanced.html (6.1.3):
x %>%
setNames(c('case','value.1990', 'flag.1990', 'value.2000', 'flag.2000')) %>%
pivot_longer(cols = -case,
names_to = c('value', 'flag'),
names_sep = '.',
values_to = 'value')
Things don't work, because the flag data isn't in the variable name.
The only way I can think to solve this is to break the dataframe into two data frames, pivot them and then join them. For example:
#create temporary data frame for year data, then pivot
temp1 <- x %>%
select(1,2, 4) %>% #select year data
pivot_longer(cols = c(X1990, X2000), #pivot longer on year data
names_to = 'year',
values_to = 'value') %>%
mutate(year = gsub('X', '', year)) #remove 'X' so that I can use this to join
#create temporary data frame for flag data, then pivot
temp2 <- x %>%
select(1, 3, 5) %>% #select flag variables
pivot_longer(cols = c(flag.1990, flag.2000), #pivot longer on flag data
names_to = 'flag.year',
values_to = 'flag') %>%
mutate(year = gsub('flag.', '', flag.year)) %>% #get year data so that I can join on this
select(-flag.year) #drop flag.year as its no longer useful information
final <- full_join(temp1, temp2, by = c('case', 'year')) #full join the two datasets to get the final data
> final
# A tibble: 20 x 4
case flag year value
<int> <chr> <chr> <dbl>
1 1 a 1990 0.277
2 1 c 2000 0.175
3 2 b 1990 0.000518
4 2 d 2000 0.441
5 3 a 1990 0.511
6 3 c 2000 0.907
7 4 b 1990 0.0140
8 4 d 2000 0.851
9 5 a 1990 0.0647
10 5 c 2000 0.734
11 6 b 1990 0.955
12 6 d 2000 0.574
13 7 a 1990 0.0865
14 7 c 2000 0.482
15 8 b 1990 0.290
16 8 d 2000 0.331
17 9 a 1990 0.881
18 9 c 2000 0.158
19 10 b 1990 0.123
20 10 d 2000 0.480
I assume there is a quicker way to do this. Am I just misreading the documentation on using multiple names in names_to. Any ideas?
In this case one has to use names_to combined with names_pattern:
library(dplyr)
library(tidyr)
> head(x,3)
case X1990 flag.1990 X2000 flag.2000
1 1 0.2772497942 a 0.1751129 c
2 2 0.0005183129 b 0.4407503 d
3 3 0.5106083730 a 0.9071830 c
> x %>%
pivot_longer(cols = -case,
names_to = c(".value", "year"),
names_pattern = "([^\\.]*)\\.*(\\d{4})")
# A tibble: 20 x 4
case year X flag
<int> <chr> <dbl> <chr>
1 1 1990 0.277 a
2 1 2000 0.175 c
3 2 1990 0.000518 b
4 2 2000 0.441 d
5 3 1990 0.511 a
6 3 2000 0.907 c
7 4 1990 0.0140 b
8 4 2000 0.851 d
9 5 1990 0.0647 a
10 5 2000 0.734 c
11 6 1990 0.955 b
12 6 2000 0.574 d
13 7 1990 0.0865 a
14 7 2000 0.482 c
15 8 1990 0.290 b
16 8 2000 0.331 d
17 9 1990 0.881 a
18 9 2000 0.158 c
19 10 1990 0.123 b
20 10 2000 0.480 d
I have the following data set:
Name Year VarA VarB Data.1 Data.2
A 2016 L H 100 101
A 2017 L H 105 99
A 2018 L H 103 105
A 2016 L A 90 95
A 2017 L A 99 92
A 2018 L A 102 101
I want to add a lagged variable by the grouping: Name, VarA, VarB so that my data would look like:
Name Year VarA VarB Data.1 Data.2 Lg1.Data.1 Lg2.Data.1
A 2016 L H 100 101 NA NA
A 2017 L H 105 99 100 NA
A 2018 L H 103 105 105 100
A 2016 L A 90 95 NA NA
A 2017 L A 99 92 90 NA
A 2018 L A 102 101 99 90
I found the following link, which is helpful: debugging: function to create multiple lags for multiple columns (dplyr)
And am using the following code:
df <- df %>%
group_by(Name) %>%
arrange(Name, VarA, VarB, Year) %>%
do(data.frame(., setNames(shift(.[,c(5:6)], 1:2), c(seq(1:8)))))
However, the lag offsetting all data associated w/ name, instead of the grouping I want, so only the 2018 years are accurately lagged.
Name Year VarA VarB Data.1 Data.2 Lg1.Data.1 Lg2.Data.1
A 2016 L H 100 101 NA NA
A 2017 L H 105 99 100 NA
A 2018 L H 103 105 105 100
A 2016 L A 90 95 103 105
A 2017 L A 99 92 90 103
A 2018 L A 102 101 99 90
How do I get the lag to reset for each new grouping combination (e.g. Name / VarA / VarB)?
dplyr::lag lets you set the distance you want to lag by. You can group by whatever variables you want—in this case, Name, VarA, and VarB—before making your lagged variables.
library(dplyr)
df %>%
group_by(Name, VarA, VarB) %>%
mutate(Lg1.Data.1 = lag(Data.1, n = 1), Lg2.Data.1 = lag(Data.1, n = 2))
#> # A tibble: 6 x 8
#> # Groups: Name, VarA, VarB [2]
#> Name Year VarA VarB Data.1 Data.2 Lg1.Data.1 Lg2.Data.1
#> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 A 2016 L H 100 101 NA NA
#> 2 A 2017 L H 105 99 100 NA
#> 3 A 2018 L H 103 105 105 100
#> 4 A 2016 L A 90 95 NA NA
#> 5 A 2017 L A 99 92 90 NA
#> 6 A 2018 L A 102 101 99 90
If you want a version that scales to more lags, you can use some non-standard evaluation to create new lagged columns dynamically. I'll do this with purrr::map to iterate of a set of n to lag by, make a list of data frames with the new columns added, then join all the data frames together. There are probably better NSE ways to do this, so hopefully someone can improve upon it.
I'm making up some new data, just to have a wider range of years to illustrate. Inside mutate, you can create column names with quo_name.
library(dplyr)
library(purrr)
set.seed(127)
df <- tibble(
Name = "A", Year = rep(2016:2020, 2), VarA = "L", VarB = rep(c("H", "A"), each = 5),
Data.1 = sample(1:10, 10, replace = T), Data.2 = sample(1:10, 10, replace = T)
)
df_list <- purrr::map(1:4, function(i) {
df %>%
group_by(Name, VarA, VarB) %>%
mutate(!!quo_name(paste0("Lag", i)) := dplyr::lag(Data.1, n = i))
})
You don't need to save this list—I'm just doing it to show an example of one of the data frames. You could instead go straight into reduce.
df_list[[3]]
#> # A tibble: 10 x 7
#> # Groups: Name, VarA, VarB [2]
#> Name Year VarA VarB Data.1 Data.2 Lag3
#> <chr> <int> <chr> <chr> <int> <int> <int>
#> 1 A 2016 L H 3 9 NA
#> 2 A 2017 L H 1 4 NA
#> 3 A 2018 L H 3 8 NA
#> 4 A 2019 L H 2 2 3
#> 5 A 2020 L H 4 5 1
#> 6 A 2016 L A 8 4 NA
#> 7 A 2017 L A 6 8 NA
#> 8 A 2018 L A 3 2 NA
#> 9 A 2019 L A 8 6 8
#> 10 A 2020 L A 9 1 6
Then use purrr::reduce to join all the data frames in the list. Since there are columns that are the same in each of the data frames, and those are the ones you want to join by, you can get away with not specifying join-by columns in inner_join.
reduce(df_list, inner_join)
#> Joining, by = c("Name", "Year", "VarA", "VarB", "Data.1", "Data.2")
#> Joining, by = c("Name", "Year", "VarA", "VarB", "Data.1", "Data.2")
#> Joining, by = c("Name", "Year", "VarA", "VarB", "Data.1", "Data.2")
#> # A tibble: 10 x 10
#> # Groups: Name, VarA, VarB [?]
#> Name Year VarA VarB Data.1 Data.2 Lag1 Lag2 Lag3 Lag4
#> <chr> <int> <chr> <chr> <int> <int> <int> <int> <int> <int>
#> 1 A 2016 L H 3 9 NA NA NA NA
#> 2 A 2017 L H 1 4 3 NA NA NA
#> 3 A 2018 L H 3 8 1 3 NA NA
#> 4 A 2019 L H 2 2 3 1 3 NA
#> 5 A 2020 L H 4 5 2 3 1 3
#> 6 A 2016 L A 8 4 NA NA NA NA
#> 7 A 2017 L A 6 8 8 NA NA NA
#> 8 A 2018 L A 3 2 6 8 NA NA
#> 9 A 2019 L A 8 6 3 6 8 NA
#> 10 A 2020 L A 9 1 8 3 6 8
Created on 2018-12-07 by the reprex package (v0.2.1)
I have date values in wide from and I'm trying to calculate the ratio of the date value with the baseline only within the Start Date and End Dates.
For example:
ID Start Date End Date Baseline 1/18 2/18 3/18 4/18 5/18 6/18 7/18 8/18
A 1/1/2018 5/1/2018 5 2 4 1 3 5 2 4 5
B 6/1/2018 8/1/2018 2 4 2 4 3 6 6 2 1
C 2/1/2018 3/1/2018 8 3 5 5 3 2 7 8 2
D 5/1/2015 7/1/2018 9 1 3 5 7 4 8 9 1
I would like to output to be:
ID Start Date End Date Baseline 1/18 2/18 3/18 4/18 5/18 6/18 7/18 8/18
A 1/1/2018 5/1/2018 5 0.4 0.8 0.2 0.6 1
B 6/1/2018 8/1/2018 2 3 1 0.5
C 2/1/2018 3/1/2018 8 0.625 0.625
D 5/1/2015 7/1/2018 9 0.44 0.88 1
Thank you!
A very inelegant solution with dplyr and tidyr, which someone can probably build on:
library(dplyr)
library(tidyr)
sample <- sample %>% mutate_at(vars(5:12), funs(round(./Baseline, digits = 3))) ## perform the initial simple proportion calculation
sample <- sample %>% gather(5:12, key = "day", value = "value") %>%
rowwise() %>% ## allow for rowwise operations
mutate(value_temp = case_when(any(grepl(as.numeric(str_extract(day, "^[:digit:]{1,2}(?=/)")),
as.numeric(str_extract(StartDate, "^[:digit:]{1,2}(?=/)")):as.numeric(str_extract(EndDate, "^[:digit:]{1,2}(?=/)")))) == T ~ T,
TRUE ~ NA)) ## create a logical vector which indicates TRUE if the "day" is included in the range of days of StartDate and EndDate
sample$value[is.na(sample$value_temp)] <- NA ## sets values which aren't included in the vector of days to NA
sample$value_temp <- NULL ## remove the temp variable
sample <- sample %>% spread(day, value) ## spread to original df
> sample
# A tibble: 4 x 12
ID StartDate EndDate Baseline `1/18` `2/18` `3/18` `4/18` `5/18` `6/18` `7/18` `8/18`
<chr> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 1/1/2018 5/1/2018 5 0.4 0.8 0.2 0.6 1 NA NA NA
2 B 6/1/2018 8/1/2018 2 NA NA NA NA NA 3 1 0.5
3 C 2/1/2018 3/1/2018 8 NA 0.625 0.625 NA NA NA NA NA
4 D 5/1/2015 7/1/2018 9 NA NA NA NA 0.444 0.889 1 NA
Update:
sample <- sample %>% mutate_at(vars(5:12), funs(round(./Baseline, digits = 3)))
sample <- sample %>% gather(5:12, key = "day", value = "value") %>%
rowwise() %>%
mutate(value_temp = case_when(any(grepl(as.numeric(str_extract(day, "^[:digit:]{1,2}(?=/)")),
as.numeric(str_extract(Start_Date, "^[:digit:]{1,2}(?=/)")):as.numeric(str_extract(End_Date, "^[:digit:]{1,2}(?=/)")))) == T &
any(grepl(as.numeric(str_extract(day, "[:digit:]{2}$")),
as.numeric(str_extract(Start_Date, "[:digit:]{2}$")):as.numeric(str_extract(End_Date, "[:digit:]{2}$")))) ~ T,
TRUE ~ NA))
sample$value[is.na(sample$value_temp)] <- NA
sample$value_temp <- NULL
sample$day <- sample$day %>% as_factor()
sample <- sample %>% spread(day, value)
I have a dataset similar to the following and my end goal is to make a table showing variables like mean salary per gender and the females' mean salary as a proportion of men's.
library(dplyr)
x <- data.frame(Department = c("Dep1", "Dep1","Dep2", "Dep2","Dep3"),
Gender = c("F", "M", "F", "M", "F"),
Salary = seq(10,14))
Department Gender Salary
1 Dep1 F 10
2 Dep1 M 11
3 Dep2 F 12
4 Dep2 M 13
5 Dep3 F 14
Step 1: First I calculate the needed summary statistics using summarise.
Table <- x %>% group_by(Department, Gender) %>% summarise(Count = n(),
AverageSalary = mean(Salary, na.rm = T),
MedianSalary = median(Salary, na.rm = T))
Step 2: To calculate the proportion and add the new columns to "Table" I use a tip I got from this forum a few days ago.
Table %>% group_by(Department) %>%
mutate(`AvgSalaryWomen/Men` = AverageSalary[Gender == "F"]/AverageSalary[Gender == "M"],
`MedianSalaryWomen/Men` = MedianSalary[Gender == "F"]/MedianSalary[Gender == "M"])
My challenge is that Dep3 doesn't have any males and so I get the following error message:
Error in mutate_impl(.data, dots) :
Column `AvgSalaryWomen/Men` must be length 1 (the group size), not 0
What I was hoping for was something like this
Department Gender Count AverageSalary MedianSalary AvgSalaryWomen.Men MedianSalaryWomen.Men
1 Dep1 F 1 10 10 0.9090909 0.9090909
2 Dep1 M 1 11 11 0.9090909 0.9090909
3 Dep2 F 1 12 12 0.9230769 0.9230769
4 Dep2 M 1 13 13 0.9230769 0.9230769
5 Dep3 F 1 14 14 NA NA
or this
Department Gender Count AverageSalary MedianSalary AvgSalaryWomen.Men MedianSalaryWomen.Men
1 Dep1 F 1 10 10 0.9090909 0.9090909
2 Dep1 M 1 11 11 NA NA
3 Dep2 F 1 12 12 0.9230769 0.9230769
4 Dep2 M 1 13 13 NA NA
5 Dep3 F 1 14 14 NA NA
Is there an easy way to obtain either of these two results? I'm guessing that alternative 1 would be the easiest.
Thanks in advance!
Using ifelse, you can check if both genders exist in a department before computing the ratios (and if not, returning NA). Something like this:
Table %>% group_by(Department) %>%
mutate(`AvgSalaryWomen/Men` = ifelse(length(unique(Gender)) == 2,
AverageSalary[Gender == "F"]/AverageSalary[Gender == "M"], NA),
`MedianSalaryWomen/Men` = ifelse(length(unique(Gender)) == 2,
MedianSalary[Gender == "F"]/MedianSalary[Gender == "M"], NA))
# A tibble: 5 x 7
# Groups: Department [3]
Department Gender Count AverageSalary MedianSalary `AvgSalaryWomen/Men` `MedianSalaryWomen/Men`
<fct> <fct> <int> <dbl> <int> <dbl> <dbl>
1 Dep1 F 1 10.0 10 0.909 0.909
2 Dep1 M 1 11.0 11 0.909 0.909
3 Dep2 F 1 12.0 12 0.923 0.923
4 Dep2 M 1 13.0 13 0.923 0.923
5 Dep3 F 1 14.0 14 NA NA