How to add a set of values to an existing data frame?

How to add a set of values to an existing data frame? - r

I have a data frame containing three columns: ID, year, growth. The last one contains data of growth in milimeters for each year.
Example:
df <- data.frame(ID=rep(c("CHC01", "CHC02", "CHC03"), each=4),
year=rep(2015:2018, 3),
growth=c(NA, 2.3, 2.1, 3.0, NA, NA, NA, 3.2, NA, NA, 2.1, 1.2))
In another data frame, I have other three columns: ID, missing_length, missing_years. Missing length relates to the estimated length missed in the measurements. Missing years relates to the number of missing years in df
estimate <- data.frame(ID=c("CHC01", "CHC02", "CHC03"),
missing_length=c(1.0, 4.4, 3.5),
missing_years=c(1,3,2))
For calculating the growth for each missing year, I tried:
missing <- rep(estimate$missing_length / estimate$missing_years, estimate$missing_years)
Does anyone have any idea of how to deal with this problem?
Thank you very much!

We can do a join and then replace the NA with the calculated value
library(dplyr)
df %>%
left_join(estimate) %>%
group_by(ID) %>%
transmute(year, growth = replace(growth, is.na(growth),
missing_length[1]/missing_years[1]))
# A tibble: 12 x 3
# Groups: ID [3]
# ID year growth
# <fct> <int> <dbl>
# 1 CHC01 2015 1
# 2 CHC01 2016 2.3
# 3 CHC01 2017 2.1
# 4 CHC01 2018 3
# 5 CHC02 2015 1.47
# 6 CHC02 2016 1.47
# 7 CHC02 2017 1.47
# 8 CHC02 2018 3.2
# 9 CHC03 2015 1.75
#10 CHC03 2016 1.75
#11 CHC03 2017 2.1
#12 CHC03 2018 1.2
Or with coalesce
df %>%
mutate(growth = coalesce(growth, with(estimate,
setNames(missing_length/missing_years, ID))[as.character(ID)])) %>%
as_tibble
# A tibble: 12 x 3
# ID year growth
# <fct> <int> <dbl>
# 1 CHC01 2015 1
# 2 CHC01 2016 2.3
# 3 CHC01 2017 2.1
# 4 CHC01 2018 3
# 5 CHC02 2015 1.47
# 6 CHC02 2016 1.47
# 7 CHC02 2017 1.47
# 8 CHC02 2018 3.2
# 9 CHC03 2015 1.75
#10 CHC03 2016 1.75
#11 CHC03 2017 2.1
#12 CHC03 2018 1.2
Or similar option in data.table
library(data.table)
setDT(df)[estimate, growth := fcoalesce(growth,
missing_length/missing_years), on = .(ID)]

Base R solution. Supposing tables "df" and "estimate" are sorted by id (ascending CHC) and we keep your "missing" object, this should work :
df$growth=replace(df$growth,which(is.na(df$growth)),missing)
Output :
ID year growth
1 CHC01 2015 1.000000
2 CHC01 2016 2.300000
3 CHC01 2017 2.100000
4 CHC01 2018 3.000000
5 CHC02 2015 1.466667
6 CHC02 2016 1.466667
7 CHC02 2017 1.466667
8 CHC02 2018 3.200000
9 CHC03 2015 1.750000
10 CHC03 2016 1.750000
11 CHC03 2017 2.100000
12 CHC03 2018 1.200000

Related

pivot_wider results in list column column instead of expected results

I'm just going to chalk this up to my ignorance, but sometimes the pivot_* functions drive me crazy.
I have a tibble:
# A tibble: 12 x 3
year term estimate
<dbl> <chr> <dbl>
1 2018 intercept -29.8
2 2018 daysuntilelection 8.27
3 2019 intercept -50.6
4 2019 daysuntilelection 7.40
5 2020 intercept -31.6
6 2020 daysuntilelection 6.55
7 2021 intercept -19.0
8 2021 daysuntilelection 4.60
9 2022 intercept -10.7
10 2022 daysuntilelection 6.41
11 2023 intercept 120
12 2023 daysuntilelection 0
that I would like to flip to:
# A tibble: 6 x 3
year intercept daysuntilelection
<dbl> <dbl> <dbl>
1 2018 -29.8 8.27
2 2019 -50.6 7.40
3 2020 -31.6 6.55
4 2021 -19.0 4.60
5 2022 -10.7 6.41
6 2023 120 0
Normally pivot_wider should be able to do this as x %>% pivot_wider(!year, names_from = "term", values_from = "estimate") but instead it returns a two-column tibble with lists and a bunch of warning.
# A tibble: 1 x 2
intercept daysuntilelection
<list> <list>
1 <dbl [6]> <dbl [6]>
Warning message:
Values from `estimate` are not uniquely identified; output will contain list-cols.
* Use `values_fn = list` to suppress this warning.
* Use `values_fn = {summary_fun}` to summarise duplicates.
* Use the following dplyr code to identify duplicates.
{data} %>%
dplyr::group_by(term) %>%
dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
dplyr::filter(n > 1L)
Where do I go wrong here? Help!

Next to the solutions offered in the comments, data.table's dcast is a very fast implementation to pivot your data. If the pivot_ functions drive you crazy, maybe this is a nice alternative for you:
x <- read.table(text = "
1 2018 intercept -29.8
2 2018 daysuntilelection 8.27
3 2019 intercept -50.6
4 2019 daysuntilelection 7.40
5 2020 intercept -31.6
6 2020 daysuntilelection 6.55
7 2021 intercept -19.0
8 2021 daysuntilelection 4.60
9 2022 intercept -10.7
10 2022 daysuntilelection 6.41
11 2023 intercept 120
12 2023 daysuntilelection 0")
names(x) <- c("id", "year", "term", "estimate")
library(data.table)
dcast(as.data.table(x), year ~ term)
#> Using 'estimate' as value column. Use 'value.var' to override
#> year daysuntilelection intercept
#> 1: 2018 8.27 -29.8
#> 2: 2019 7.40 -50.6
#> 3: 2020 6.55 -31.6
#> 4: 2021 4.60 -19.0
#> 5: 2022 6.41 -10.7
#> 6: 2023 0.00 120.0

DATA
df <- read.table(text = "
1 2018 intercept -29.8
2 2018 daysuntilelection 8.27
3 2019 intercept -50.6
4 2019 daysuntilelection 7.40
5 2020 intercept -31.6
6 2020 daysuntilelection 6.55
7 2021 intercept -19.0
8 2021 daysuntilelection 4.60
9 2022 intercept -10.7
10 2022 daysuntilelection 6.41
11 2023 intercept 120
12 2023 daysuntilelection 0")
CODE
library(tidyverse)
df %>%
pivot_wider(names_from = V3,values_from = V4 , values_fill = 0) %>%
group_by(V2) %>%
summarise_all(sum,na.rm=T)
OUTPUT
V2 V1 intercept daysuntilelection
<int> <int> <dbl> <dbl>
1 2018 3 -29.8 8.27
2 2019 7 -50.6 7.4
3 2020 11 -31.6 6.55
4 2021 15 -19 4.6
5 2022 19 -10.7 6.41
6 2023 23 120 0

geom_line() omits the whole time series when when some quarters are missing

I am trying to compare fixed asset turnover for 3 different companies. My challenge is that,two of the companies publish annual(A,C) data while the other publish quarterly data(A), i.e For A and B data is only available at the 4th quarter(end of the year) only. here is the data
# A tibble: 30 × 3
Company time value
<chr> <fct> <dbl>
1 A 2019 Q1 NA
2 A 2019 Q2 NA
3 A 2019 Q3 NA
4 A 2019 Q4 7.88
5 A 2020 Q1 NA
6 A 2020 Q2 NA
7 A 2020 Q3 NA
8 A 2020 Q4 8.52
9 A 2021 Q1 NA
10 A 2021 Q2 NA
11 B 2019 Q1 6.51
12 B 2019 Q2 6.48
13 B 2019 Q3 6.77
14 B 2019 Q4 6.72
15 B 2020 Q1 7.26
16 B 2020 Q2 8.33
17 B 2020 Q3 8.65
18 B 2020 Q4 8.55
19 B 2021 Q1 8.29
20 B 2021 Q2 8.59
21 C 2019 Q1 NA
22 C 2019 Q2 NA
23 C 2019 Q3 NA
24 C 2019 Q4 7.79
25 C 2020 Q1 NA
26 C 2020 Q2 NA
27 C 2020 Q3 NA
28 C 2020 Q4 8.95
29 C 2021 Q1 NA
30 C 2021 Q2 NA
Although on A and C has data on their fourth quarter, geom_line() seems to ignore the whole series.
The code
ggplot(df,aes(x=`time`,y=value,color=Company,group=Company))+
geom_line()+
theme_bw()+
theme(axis.text.x = element_text(angle = 45,hjust=1))
here is graph
How can i display these other series based on the missing quarters??

You need at least two consecutive points to make a line. You can either drop na and plot with geom_line, or just plot with geom_point.

Computing lags but grouping by two categories with dplyr

What I want it's create the var3 using a lag (dplyr package), but should be consistent with the year and the ID. I mean, the lag should belong to the corresponding ID. The dataset is like an unbalanced panel.
YEAR ID VARS
2010 1 -
2011 1 -
2012 1 -
2010 2 -
2011 2 -
2012 2 -
2010 3 -
...
My issue is similar to the following question/post, but grouping by two categories:
dplyr: lead() and lag() wrong when used with group_by()
I tried to extend the solution, unsuccessfully (I get NAs).
Attempt #1:
data %>%
group_by(YEAR,ID) %>%
summarise(var1 = ...
var2 = ...
var3 = var1 - dplyr::lag(var2))
)
Attempt #2:
data %>%
group_by(YEAR,ID) %>%
summarise(var1 = ...
var2 = ...
gr = sprintf(YEAR,ID)
var3 = var1 - dplyr::lag(var2, order_by = gr))
)
Minimum example:
MyData <-
data.frame(YEAR = rep(seq(2010,2014),5),
ID = rep(1:5, each=5),
var1 = rnorm(n=25,mean=10,sd=3),
var2 = rnorm(n=25,mean=1,sd=1)
)
MyData %>%
group_by(YEAR,ID) %>%
summarise(var3 = var1 - dplyr::lag(var2)
)
Thanks in advance.

Do you mean group_by(ID) and effectively "order by YEAR"?
MyData %>%
group_by(ID) %>%
mutate(var3 = var1 - dplyr::lag(var2)) %>%
print(n=99)
# # A tibble: 25 x 5
# # Groups: ID [5]
# YEAR ID var1 var2 var3
# <int> <int> <dbl> <dbl> <dbl>
# 1 2010 1 11.1 1.16 NA
# 2 2011 1 13.5 -0.550 12.4
# 3 2012 1 10.2 2.11 10.7
# 4 2013 1 8.57 1.43 6.46
# 5 2014 1 12.6 1.89 11.2
# 6 2010 2 8.87 1.87 NA
# 7 2011 2 5.30 1.70 3.43
# 8 2012 2 6.81 0.956 5.11
# 9 2013 2 13.3 -0.0296 12.4
# 10 2014 2 9.98 -1.27 10.0
# 11 2010 3 8.62 0.258 NA
# 12 2011 3 12.4 2.00 12.2
# 13 2012 3 16.1 2.12 14.1
# 14 2013 3 8.48 2.83 6.37
# 15 2014 3 10.6 0.190 7.80
# 16 2010 4 12.3 0.887 NA
# 17 2011 4 10.9 1.07 10.0
# 18 2012 4 7.99 1.09 6.92
# 19 2013 4 10.1 1.95 9.03
# 20 2014 4 11.1 1.82 9.17
# 21 2010 5 15.1 1.67 NA
# 22 2011 5 10.4 0.492 8.76
# 23 2012 5 10.0 1.66 9.51
# 24 2013 5 10.6 0.567 8.91
# 25 2014 5 5.32 -0.881 4.76
(Disregarding your summarize into a mutate for now.)

How to add values to a column and still keep some NA?

I have a data frame containing three columns: ID, year, growth. The last one contains data of growth in milimeters for each year.
Example:
df <- data.frame(ID=rep(c("CHC01", "CHC02", "CHC03"), each=6),
year=rep(2013:2018, 3),
growth=c(NA, NA, NA, 2.3, 2.1, 3.0, NA, NA, NA, NA, 1.1, 4.8, 1.0, 3.2, 4.2, 2.3, 2.1, 1.2))
In another data frame, I have other three columns: ID, missing_length, missing_years. Missing length relates to the estimated length missed in the measurements. Missing years relates to the number of missing years in df
estimate <- data.frame(ID=c("CHC01", "CHC02", "CHC03"),
missing_length=c(1.0, 4.4, 0),
missing_years=c(1,3,0))
For calculating the growth for each missing year, I tried:
missing <- rep(estimate$missing_length / estimate$missing_years, estimate$missing_years)
It is important to note that not all NA from df will be replaced by estimated values.
Here is an example of the data frame I trying to get:
ID year growth
1 CHC01 2013 NA
2 CHC01 2014 NA
3 CHC01 2015 1.00
4 CHC01 2016 2.30
5 CHC01 2017 2.10
6 CHC01 2018 3.00
7 CHC02 2013 NA
8 CHC02 2014 1.47
9 CHC02 2015 1.47
10 CHC02 2016 1.47
11 CHC02 2017 1.10
12 CHC02 2018 4.80
13 CHC03 2013 1.00
14 CHC03 2014 3.20
15 CHC03 2015 4.20
16 CHC03 2016 2.30
17 CHC03 2017 2.10
18 CHC03 2018 1.20
Does anyone have any idea of how to deal with this problem?
Thank you very much!

We can use which to get the position index and then subset that position with tail with the missing_years in replace to replace those missing values with the ratio of 'missing_length' an 'missing_years' after doing a left_join with the 'estimate'
library(dplyr)
df %>%
left_join(estimate) %>%
group_by(ID) %>%
transmute(year, growth = replace(growth,
tail(which(is.na(growth)), first(missing_years)),
first(missing_length)/first(missing_years)))
# A tibble: 18 x 3
# Groups: ID [3]
# ID year growth
# <fct> <int> <dbl>
# 1 CHC01 2013 NA
# 2 CHC01 2014 NA
# 3 CHC01 2015 1
# 4 CHC01 2016 2.3
# 5 CHC01 2017 2.1
# 6 CHC01 2018 3
# 7 CHC02 2013 NA
# 8 CHC02 2014 1.47
# 9 CHC02 2015 1.47
#10 CHC02 2016 1.47
#11 CHC02 2017 1.1
#12 CHC02 2018 4.8
#13 CHC03 2013 1
#14 CHC03 2014 3.2
#15 CHC03 2015 4.2
#16 CHC03 2016 2.3
#17 CHC03 2017 2.1
#18 CHC03 2018 1.2

Assigning different names to columns in list using tidyr::pivot_longer, and combining them

I'm extracting "wide" data which I intend to tidy with tidyr::pivot_longer().
library(tidyverse)
df1 <-
data.frame(
M = words[1:10],
N = rnorm(10, 3, 3),
O = rnorm(10, 3, 3),
P = rnorm(10, 3, 3)
)
df2 <-
data.frame(
M = words[1:10],
N = rnorm(10, 3, 3),
O = rnorm(10, 3, 3),
P = rnorm(10, 3, 3)
)
df3 <-
data.frame(
M = words[1:10],
N = rnorm(10, 3, 3),
O = rnorm(10, 3, 3),
P = rnorm(10, 3, 3)
)
df4 <-
data.frame(
M = words[1:10],
N = rnorm(10, 3, 3),
O = rnorm(10, 3, 3),
P = rnorm(10, 3, 3)
)
lst <- list(df1, df2, df3, df4)
colname <-
c("ticker", "2017", "2018", "2019")
header <- list("Leverage", "Gearing", "Capex.to.sales", "FCFex")
lst <- lst %>%
lapply(setNames, colname) %>%
lapply(pivot_longer, -ticker, names_to = "Period", values_to = header)
Using values_to = header gives me this error:
Error in [[<-.data.frame(tmp`, ".value", value = list("Leverage", :
replacement has 4 rows, data has 3
Instead, I had to use the default values_to = "value", and subsequently use this code to rename my columns:
lst <- lst %>%
lapply(setNames, colname) %>%
lapply(pivot_longer, -ticker, names_to = "Period", values_to = "value")
lst <- map(seq_along(lst), function(i){
x <- lst[[i]]
colnames(x)[3] <- header[[i]]
x
})
My output is shown below (columns renamed), but I was wondering if there is a way to feed a vector into values_to instead of using map (as it makes for better piping)? Or is there a more efficient way about going about this?
> lst
[[1]]
# A tibble: 30 x 3
ticker Period Leverage
<fct> <chr> <dbl>
1 a 2017 6.01
2 a 2018 4.82
3 a 2019 1.58
4 able 2017 8.64
5 able 2018 6.70
6 able 2019 0.831
7 about 2017 -0.187
8 about 2018 0.549
9 about 2019 0.829
10 absolute 2017 1.26
# ... with 20 more rows
[[2]]
# A tibble: 30 x 3
ticker Period Gearing
<fct> <chr> <dbl>
1 a 2017 2.37
2 a 2018 3.58
3 a 2019 5.63
4 able 2017 0.311
5 able 2018 0.708
6 able 2019 -0.0651
7 about 2017 2.89
8 about 2018 6.25
9 about 2019 10.1
10 absolute 2017 6.48
# ... with 20 more rows
[[3]]
# A tibble: 30 x 3
ticker Period Capex.to.sales
<fct> <chr> <dbl>
1 a 2017 5.22
2 a 2018 1.88
3 a 2019 0.746
4 able 2017 -3.90
5 able 2018 3.06
6 able 2019 1.91
7 about 2017 1.35
8 about 2018 4.12
9 about 2019 11.1
10 absolute 2017 1.76
# ... with 20 more rows
[[4]]
# A tibble: 30 x 3
ticker Period FCFex
<fct> <chr> <dbl>
1 a 2017 1.76
2 a 2018 2.85
3 a 2019 1.86
4 able 2017 -3.38
5 able 2018 -3.02
6 able 2019 -1.52
7 about 2017 6.46
8 about 2018 5.39
9 about 2019 0.810
10 absolute 2017 8.08
# ... with 20 more rows
For the second part of my question, I intend to use bind_col() to combine all the four dataframes into one, but the two common columns are being duplicated (as seen below).
How do I tell R to just bind the rightmost column that was renamed i.e. exclude the first two columns for the last three dataframes? Thank you.
Metrics <- bind_cols(lst)
> Metrics
# A tibble: 30 x 12
ticker Period Leverage ticker1 Period1 Gearing ticker2 Period2
<fct> <chr> <dbl> <fct> <chr> <dbl> <fct> <chr>
1 a 2017 6.01 a 2017 2.37 a 2017
2 a 2018 4.82 a 2018 3.58 a 2018
3 a 2019 1.58 a 2019 5.63 a 2019
4 able 2017 8.64 able 2017 0.311 able 2017
5 able 2018 6.70 able 2018 0.708 able 2018
6 able 2019 0.831 able 2019 -0.0651 able 2019
7 about 2017 -0.187 about 2017 2.89 about 2017
8 about 2018 0.549 about 2018 6.25 about 2018
9 about 2019 0.829 about 2019 10.1 about 2019
10 absol~ 2017 1.26 absolu~ 2017 6.48 absolu~ 2017
# ... with 20 more rows, and 4 more variables: Capex.to.sales <dbl>,
# ticker3 <fct>, Period3 <chr>, FCFex <dbl>

You could do it with purrr:
library(purrr)
lst <- map(lst, setNames, colname)
map2_dfc(lst, header, ~ pivot_longer(
.x, -ticker, names_to = "Period", values_to = .y)) %>%
select(c(1:3, 6, 9, 12))
Output:
ticker Period Leverage Gearing Capex.to.sales FCFex
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 a 2017 6.20 3.43 7.87 7.52
2 a 2018 1.63 3.30 0.126 1.52
3 a 2019 2.32 1.49 -0.286 6.95
4 able 2017 6.38 3.42 7.34 2.60
5 able 2018 0.763 1.68 -0.648 -2.85
6 able 2019 5.56 2.35 -0.572 3.21
7 about 2017 -0.762 1.49 3.12 2.43
8 about 2018 9.07 -1.22 0.821 4.00
9 about 2019 1.37 8.27 -0.700 -1.05
10 absolute 2017 1.39 2.49 0.390 2.40
# … with 20 more rows

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to add a set of values to an existing data frame? - r

Related

pivot_wider results in list column column instead of expected results

geom_line() omits the whole time series when when some quarters are missing

Computing lags but grouping by two categories with dplyr

How to add values to a column and still keep some NA?

Assigning different names to columns in list using tidyr::pivot_longer, and combining them

Categories

Resources