How to add values to a column and still keep some NA? - r

I have a data frame containing three columns: ID, year, growth. The last one contains data of growth in milimeters for each year.
Example:
df <- data.frame(ID=rep(c("CHC01", "CHC02", "CHC03"), each=6),
year=rep(2013:2018, 3),
growth=c(NA, NA, NA, 2.3, 2.1, 3.0, NA, NA, NA, NA, 1.1, 4.8, 1.0, 3.2, 4.2, 2.3, 2.1, 1.2))
In another data frame, I have other three columns: ID, missing_length, missing_years. Missing length relates to the estimated length missed in the measurements. Missing years relates to the number of missing years in df
estimate <- data.frame(ID=c("CHC01", "CHC02", "CHC03"),
missing_length=c(1.0, 4.4, 0),
missing_years=c(1,3,0))
For calculating the growth for each missing year, I tried:
missing <- rep(estimate$missing_length / estimate$missing_years, estimate$missing_years)
It is important to note that not all NA from df will be replaced by estimated values.
Here is an example of the data frame I trying to get:
ID year growth
1 CHC01 2013 NA
2 CHC01 2014 NA
3 CHC01 2015 1.00
4 CHC01 2016 2.30
5 CHC01 2017 2.10
6 CHC01 2018 3.00
7 CHC02 2013 NA
8 CHC02 2014 1.47
9 CHC02 2015 1.47
10 CHC02 2016 1.47
11 CHC02 2017 1.10
12 CHC02 2018 4.80
13 CHC03 2013 1.00
14 CHC03 2014 3.20
15 CHC03 2015 4.20
16 CHC03 2016 2.30
17 CHC03 2017 2.10
18 CHC03 2018 1.20
Does anyone have any idea of how to deal with this problem?
Thank you very much!

We can use which to get the position index and then subset that position with tail with the missing_years in replace to replace those missing values with the ratio of 'missing_length' an 'missing_years' after doing a left_join with the 'estimate'
library(dplyr)
df %>%
left_join(estimate) %>%
group_by(ID) %>%
transmute(year, growth = replace(growth,
tail(which(is.na(growth)), first(missing_years)),
first(missing_length)/first(missing_years)))
# A tibble: 18 x 3
# Groups: ID [3]
# ID year growth
# <fct> <int> <dbl>
# 1 CHC01 2013 NA
# 2 CHC01 2014 NA
# 3 CHC01 2015 1
# 4 CHC01 2016 2.3
# 5 CHC01 2017 2.1
# 6 CHC01 2018 3
# 7 CHC02 2013 NA
# 8 CHC02 2014 1.47
# 9 CHC02 2015 1.47
#10 CHC02 2016 1.47
#11 CHC02 2017 1.1
#12 CHC02 2018 4.8
#13 CHC03 2013 1
#14 CHC03 2014 3.2
#15 CHC03 2015 4.2
#16 CHC03 2016 2.3
#17 CHC03 2017 2.1
#18 CHC03 2018 1.2

Related

geom_line() omits the whole time series when when some quarters are missing

I am trying to compare fixed asset turnover for 3 different companies. My challenge is that,two of the companies publish annual(A,C) data while the other publish quarterly data(A), i.e For A and B data is only available at the 4th quarter(end of the year) only. here is the data
# A tibble: 30 × 3
Company time value
<chr> <fct> <dbl>
1 A 2019 Q1 NA
2 A 2019 Q2 NA
3 A 2019 Q3 NA
4 A 2019 Q4 7.88
5 A 2020 Q1 NA
6 A 2020 Q2 NA
7 A 2020 Q3 NA
8 A 2020 Q4 8.52
9 A 2021 Q1 NA
10 A 2021 Q2 NA
11 B 2019 Q1 6.51
12 B 2019 Q2 6.48
13 B 2019 Q3 6.77
14 B 2019 Q4 6.72
15 B 2020 Q1 7.26
16 B 2020 Q2 8.33
17 B 2020 Q3 8.65
18 B 2020 Q4 8.55
19 B 2021 Q1 8.29
20 B 2021 Q2 8.59
21 C 2019 Q1 NA
22 C 2019 Q2 NA
23 C 2019 Q3 NA
24 C 2019 Q4 7.79
25 C 2020 Q1 NA
26 C 2020 Q2 NA
27 C 2020 Q3 NA
28 C 2020 Q4 8.95
29 C 2021 Q1 NA
30 C 2021 Q2 NA
Although on A and C has data on their fourth quarter, geom_line() seems to ignore the whole series.
The code
ggplot(df,aes(x=`time`,y=value,color=Company,group=Company))+
geom_line()+
theme_bw()+
theme(axis.text.x = element_text(angle = 45,hjust=1))
here is graph
How can i display these other series based on the missing quarters??
You need at least two consecutive points to make a line. You can either drop na and plot with geom_line, or just plot with geom_point.

How to draw geom_line with 4 groups and how to limit x-axis with no data

Data:
Group Year Month Mean
Group 1 2018 Jun 1.58
Group 1 2018 Jul 0.92
Group 1 2018 Aug 3.52
Group 1 2018 Sep 5.9
Group 1 2018 Oct 5.95
Group 1 2018 Nov 11.21
Group 1 2018 Dec 13.55
Group 1 2019 Jan 4.67
Group 1 2019 Feb 4.35
Group 1 2019 Mar 4.04
Group 1 2019 Apr 1.33
Group 1 2019 May 20.5
Group 1 2019 Jun 1
Group 1 2019 Jul 2.67
Group 1 2019 Aug 5.79
Group 1 2019 Sep 3.95
Group 1 2019 Oct 1.83
Group 1 2019 Nov 5
Group 1 2019 Dec 12.95
Group 1 2020 Jan 8.89
Group 1 2020 Feb 0.75
Group 2 2018 Jun 0
Group 2 2018 Jul 1.2
Group 2 2018 Aug 1.83
Group 2 2018 Sep 3.29
Group 2 2018 Oct 3.32
Group 2 2018 Nov 1
Group 2 2018 Dec 6
Group 2 2019 Jan 0
Group 2 2019 Feb 2.25
Group 2 2019 Mar 2.14
Group 2 2019 Apr 1.94
Group 2 2019 May 0
Group 2 2019 Jun 0.2
Group 2 2019 Jul 1.25
Group 2 2019 Aug 2.86
Group 2 2019 Sep 7.93
Group 2 2019 Oct 3.25
Group 2 2019 Nov 2.8
Group 2 2019 Dec 2
Group 2 2020 Jan 0.25
Group 2 2020 Feb 1.33
Group 3 2018 Jun 0.11
Group 3 2018 Jul 0.68
Group 3 2018 Aug 1.3
Group 3 2018 Sep 0.99
Group 3 2018 Oct 2.67
Group 3 2018 Nov 9.89
Group 3 2018 Dec 9.81
Group 3 2019 Jan 2.78
Group 3 2019 Feb 3.97
Group 3 2019 Mar 5.75
Group 3 2019 Apr 2.19
Group 3 2019 May 0.95
Group 3 2019 Jun 0
Group 3 2019 Jul 1.31
Group 3 2019 Aug 3.77
Group 3 2019 Sep 1.79
Group 3 2019 Oct 3.14
Group 3 2019 Nov 1.82
Group 3 2019 Dec 6.5
Group 3 2020 Jan 2.72
Group 3 2020 Feb 1.33
Group 4 2018 Jun 2.4
Group 4 2018 Jul 0.98
Group 4 2018 Aug 1.1
Group 4 2018 Sep 2.32
Group 4 2018 Oct 6.7
Group 4 2018 Nov 15.66
Group 4 2018 Dec 8.18
Group 4 2019 Jan 3.69
Group 4 2019 Feb 0.8
Group 4 2019 Mar 0.04
Group 4 2019 Apr 1.17
Group 4 2019 May 7
Group 4 2019 Jun 0.53
Group 4 2019 Jul 2.93
Group 4 2019 Aug 2.73
Group 4 2019 Sep 2.07
Group 4 2019 Oct 6.59
Group 4 2019 Nov 3.91
Group 4 2019 Dec 7.2
Group 4 2020 Jan 6.81
Group 4 2020 Feb 0.8
Data$Month <- factor(Data$Month, levels = month.abb)
Data$Year <- factor(Data$Year, levels = c("2018", "2019", "2020"))
Data %>% filter(Group == "Group 1") %>% ggplot(aes(x = Month, y = Mean))+ geom_point() + geom_line(aes(colour = Year), group = 1) + facet_grid(~ Year) + theme_minimal() +
theme(legend.position = "none")
Is it possible to draw a line graph GroupWise.
How to remove x-axis labels with no data.plot (Jan:May in 2018, Mar:Dec in 2020)
You were quite close. This should do it:
Data <- read.table(text=
"Group Year Month Mean
1 2018 Jun 1.58
1 2018 Jul 0.92
1 2018 Aug 3.52
1 2018 Sep 5.9
1 2018 Oct 5.95
1 2018 Nov 11.21
1 2018 Dec 13.55
1 2019 Jan 4.67
1 2019 Feb 4.35
1 2019 Mar 4.04
1 2019 Apr 1.33
1 2019 May 20.5
1 2019 Jun 1
1 2019 Jul 2.67
1 2019 Aug 5.79
1 2019 Sep 3.95
1 2019 Oct 1.83
1 2019 Nov 5
1 2019 Dec 12.95
1 2020 Jan 8.89
1 2020 Feb 0.75
2 2018 Jun 0
2 2018 Jul 1.2
2 2018 Aug 1.83
2 2018 Sep 3.29
2 2018 Oct 3.32
2 2018 Nov 1
2 2018 Dec 6
2 2019 Jan 0
2 2019 Feb 2.25
2 2019 Mar 2.14
2 2019 Apr 1.94
2 2019 May 0
2 2019 Jun 0.2
2 2019 Jul 1.25
2 2019 Aug 2.86
2 2019 Sep 7.93
2 2019 Oct 3.25
2 2019 Nov 2.8
2 2019 Dec 2
2 2020 Jan 0.25
2 2020 Feb 1.33
3 2018 Jun 0.11
3 2018 Jul 0.68
3 2018 Aug 1.3
3 2018 Sep 0.99
3 2018 Oct 2.67
3 2018 Nov 9.89
3 2018 Dec 9.81
3 2019 Jan 2.78
3 2019 Feb 3.97
3 2019 Mar 5.75
3 2019 Apr 2.19
3 2019 May 0.95
3 2019 Jun 0
3 2019 Jul 1.31
3 2019 Aug 3.77
3 2019 Sep 1.79
3 2019 Oct 3.14
3 2019 Nov 1.82
3 2019 Dec 6.5
3 2020 Jan 2.72
3 2020 Feb 1.33
4 2018 Jun 2.4
4 2018 Jul 0.98
4 2018 Aug 1.1
4 2018 Sep 2.32
4 2018 Oct 6.7
4 2018 Nov 15.66
4 2018 Dec 8.18
4 2019 Jan 3.69
4 2019 Feb 0.8
4 2019 Mar 0.04
4 2019 Apr 1.17
4 2019 May 7
4 2019 Jun 0.53
4 2019 Jul 2.93
4 2019 Aug 2.73
4 2019 Sep 2.07
4 2019 Oct 6.59
4 2019 Nov 3.91
4 2019 Dec 7.2
4 2020 Jan 6.81
4 2020 Feb 0.8
", header=TRUE)
Data <- Data %>% mutate(
Group = paste("Group",Group),
Year = factor(Year),
Month = factor( Month, levels = month.abb )
)
Data %>%
ggplot(aes(x = Month, y = Mean, group=Group)) +
geom_point() +
geom_line(aes(colour = Group)) +
facet_grid(~ Year, scales = "free_x") +
theme_minimal() +
theme(legend.position = "none")
(Note I had to ommit 'Group' from my input data since I was pasting of the data dump in your post, otherwise read.table can't easily make sense of the space, that's why I paste it back in. It's not important for the solution)

Computing lags but grouping by two categories with dplyr

What I want it's create the var3 using a lag (dplyr package), but should be consistent with the year and the ID. I mean, the lag should belong to the corresponding ID. The dataset is like an unbalanced panel.
YEAR ID VARS
2010 1 -
2011 1 -
2012 1 -
2010 2 -
2011 2 -
2012 2 -
2010 3 -
...
My issue is similar to the following question/post, but grouping by two categories:
dplyr: lead() and lag() wrong when used with group_by()
I tried to extend the solution, unsuccessfully (I get NAs).
Attempt #1:
data %>%
group_by(YEAR,ID) %>%
summarise(var1 = ...
var2 = ...
var3 = var1 - dplyr::lag(var2))
)
Attempt #2:
data %>%
group_by(YEAR,ID) %>%
summarise(var1 = ...
var2 = ...
gr = sprintf(YEAR,ID)
var3 = var1 - dplyr::lag(var2, order_by = gr))
)
Minimum example:
MyData <-
data.frame(YEAR = rep(seq(2010,2014),5),
ID = rep(1:5, each=5),
var1 = rnorm(n=25,mean=10,sd=3),
var2 = rnorm(n=25,mean=1,sd=1)
)
MyData %>%
group_by(YEAR,ID) %>%
summarise(var3 = var1 - dplyr::lag(var2)
)
Thanks in advance.
Do you mean group_by(ID) and effectively "order by YEAR"?
MyData %>%
group_by(ID) %>%
mutate(var3 = var1 - dplyr::lag(var2)) %>%
print(n=99)
# # A tibble: 25 x 5
# # Groups: ID [5]
# YEAR ID var1 var2 var3
# <int> <int> <dbl> <dbl> <dbl>
# 1 2010 1 11.1 1.16 NA
# 2 2011 1 13.5 -0.550 12.4
# 3 2012 1 10.2 2.11 10.7
# 4 2013 1 8.57 1.43 6.46
# 5 2014 1 12.6 1.89 11.2
# 6 2010 2 8.87 1.87 NA
# 7 2011 2 5.30 1.70 3.43
# 8 2012 2 6.81 0.956 5.11
# 9 2013 2 13.3 -0.0296 12.4
# 10 2014 2 9.98 -1.27 10.0
# 11 2010 3 8.62 0.258 NA
# 12 2011 3 12.4 2.00 12.2
# 13 2012 3 16.1 2.12 14.1
# 14 2013 3 8.48 2.83 6.37
# 15 2014 3 10.6 0.190 7.80
# 16 2010 4 12.3 0.887 NA
# 17 2011 4 10.9 1.07 10.0
# 18 2012 4 7.99 1.09 6.92
# 19 2013 4 10.1 1.95 9.03
# 20 2014 4 11.1 1.82 9.17
# 21 2010 5 15.1 1.67 NA
# 22 2011 5 10.4 0.492 8.76
# 23 2012 5 10.0 1.66 9.51
# 24 2013 5 10.6 0.567 8.91
# 25 2014 5 5.32 -0.881 4.76
(Disregarding your summarize into a mutate for now.)

How to add a set of values to an existing data frame?

I have a data frame containing three columns: ID, year, growth. The last one contains data of growth in milimeters for each year.
Example:
df <- data.frame(ID=rep(c("CHC01", "CHC02", "CHC03"), each=4),
year=rep(2015:2018, 3),
growth=c(NA, 2.3, 2.1, 3.0, NA, NA, NA, 3.2, NA, NA, 2.1, 1.2))
In another data frame, I have other three columns: ID, missing_length, missing_years. Missing length relates to the estimated length missed in the measurements. Missing years relates to the number of missing years in df
estimate <- data.frame(ID=c("CHC01", "CHC02", "CHC03"),
missing_length=c(1.0, 4.4, 3.5),
missing_years=c(1,3,2))
For calculating the growth for each missing year, I tried:
missing <- rep(estimate$missing_length / estimate$missing_years, estimate$missing_years)
Does anyone have any idea of how to deal with this problem?
Thank you very much!
We can do a join and then replace the NA with the calculated value
library(dplyr)
df %>%
left_join(estimate) %>%
group_by(ID) %>%
transmute(year, growth = replace(growth, is.na(growth),
missing_length[1]/missing_years[1]))
# A tibble: 12 x 3
# Groups: ID [3]
# ID year growth
# <fct> <int> <dbl>
# 1 CHC01 2015 1
# 2 CHC01 2016 2.3
# 3 CHC01 2017 2.1
# 4 CHC01 2018 3
# 5 CHC02 2015 1.47
# 6 CHC02 2016 1.47
# 7 CHC02 2017 1.47
# 8 CHC02 2018 3.2
# 9 CHC03 2015 1.75
#10 CHC03 2016 1.75
#11 CHC03 2017 2.1
#12 CHC03 2018 1.2
Or with coalesce
df %>%
mutate(growth = coalesce(growth, with(estimate,
setNames(missing_length/missing_years, ID))[as.character(ID)])) %>%
as_tibble
# A tibble: 12 x 3
# ID year growth
# <fct> <int> <dbl>
# 1 CHC01 2015 1
# 2 CHC01 2016 2.3
# 3 CHC01 2017 2.1
# 4 CHC01 2018 3
# 5 CHC02 2015 1.47
# 6 CHC02 2016 1.47
# 7 CHC02 2017 1.47
# 8 CHC02 2018 3.2
# 9 CHC03 2015 1.75
#10 CHC03 2016 1.75
#11 CHC03 2017 2.1
#12 CHC03 2018 1.2
Or similar option in data.table
library(data.table)
setDT(df)[estimate, growth := fcoalesce(growth,
missing_length/missing_years), on = .(ID)]
Base R solution. Supposing tables "df" and "estimate" are sorted by id (ascending CHC) and we keep your "missing" object, this should work :
df$growth=replace(df$growth,which(is.na(df$growth)),missing)
Output :
ID year growth
1 CHC01 2015 1.000000
2 CHC01 2016 2.300000
3 CHC01 2017 2.100000
4 CHC01 2018 3.000000
5 CHC02 2015 1.466667
6 CHC02 2016 1.466667
7 CHC02 2017 1.466667
8 CHC02 2018 3.200000
9 CHC03 2015 1.750000
10 CHC03 2016 1.750000
11 CHC03 2017 2.100000
12 CHC03 2018 1.200000

correlation between two data frames in R

I have one data frame which has sales values for the time period Oct. 2000 to Dec. 2001 (15 months). Also I have profit values for the same time period as above and I want to find the correlation between these two data frames month wise for these 15 months in R. My data frame sales is:
Month sales
Oct. 2000 24.1
Nov. 2000 23.3
Dec. 2000 43.9
Jan. 2001 53.8
Feb. 2001 74.9
Mar. 2001 25
Apr. 2001 48.5
May. 2001 18
Jun. 2001 68.1
Jul. 2001 78
Aug. 2001 48.8
Sep. 2001 48.9
Oct. 2001 34.3
Nov. 2001 54.1
Dec. 2001 29.3
My second data frame profit is:
period profit
Oct 2000 14.1
Nov 2000 3.3
Dec 2000 13.9
Jan 2001 23.8
Feb 2001 44.9
Mar 2001 15
Apr 2001 58.5
May 2001 18
Jun 2001 58.1
Jul 2001 38
Aug 2001 28.8
Sep 2001 18.9
Oct 2001 24.3
Nov 2001 24.1
Dec 2001 19.3
Now I know that for initial two months I cannot get the correlation as there are not enough values but from Dec 2000 onwards I want to calculate the correlation by taking into consideration the previous months values. So, for Dec. 200 I will consider values of Oct. 2000, Nov. 2000 and Dec. 2000 which will give me 3 sales value and 3 profit values. Similarly for Jan. 2001 I will consider values of Oct. 2000, Nov. 2000 Dec. 2000 and Jan. 2001 thus having 4 sales value and 4 profit value. Thus for every month I will consider previous month values also to calculate the correlation and my output should be something like this:
Month Correlation
Oct. 2000 NA or Empty
Nov. 2000 NA or Empty
Dec. 2000 x
Jan. 2001 y
. .
. .
Dec. 2001 a
I know that in R there is a function cor(sales, profit) but how can I find out the correlation for my scenario?
Make some sample data:
> sales = c(1,4,3,2,3,4,5,6,7,6,7,5)
> profit = c(4,3,2,3,4,5,6,7,7,7,6,5)
> data = data.frame(sales=sales,profit=profit)
> head(data)
sales profit
1 1 4
2 4 3
3 3 2
4 2 3
5 3 4
6 4 5
Here's the beef:
> data$runcor = c(NA,NA,
sapply(3:nrow(data),
function(i){
cor(data$sales[1:i],data$profit[1:i])
}))
> data
sales profit runcor
1 1 4 NA
2 4 3 NA
3 3 2 -0.65465367
4 2 3 -0.63245553
5 3 4 -0.41931393
6 4 5 0.08155909
7 5 6 0.47368421
8 6 7 0.69388867
9 7 7 0.78317543
10 6 7 0.81256816
11 7 6 0.80386072
12 5 5 0.80155885
So now data$runcor[3] is the correlation of the first 3 sales and profit numbers.
Note I call this runcor as its a "running correlation", like a "running sum" which is the sum of all elements so far. This is the correlation of all pairs so far.
Another possibility would be: (if dat1 and dat2 are the initial datasets)
Update
dat1$Month <- gsub("\\.", "", dat1$Month)
datN <- merge(dat1, dat2, sort=FALSE, by.x="Month", by.y="period")
indx <- sequence(3:nrow(datN)) #create index to replicate the rows
indx1 <- cumsum(c(TRUE,diff(indx) <0)) #create another index to group the rows
#calculate the correlation grouped by `indx1`
datN$runcor <- setNames(c(NA, NA,by(datN[indx,-1],
list(indx1), FUN=function(x) cor(x$sales, x$profit) )), NULL)
datN
# Month sales profit runcor
#1 Oct 2000 24.1 14.1 NA
#2 Nov 2000 23.3 3.3 NA
#3 Dec 2000 43.9 13.9 0.5155911
#4 Jan 2001 53.8 23.8 0.8148546
#5 Feb 2001 74.9 44.9 0.9345166
#6 Mar 2001 25.0 15.0 0.9119941
#7 Apr 2001 48.5 58.5 0.7056301
#8 May 2001 18.0 18.0 0.6879528
#9 Jun 2001 68.1 58.1 0.7647177
#10 Jul 2001 78.0 38.0 0.7357748
#11 Aug 2001 48.8 28.8 0.7351366
#12 Sep 2001 48.9 18.9 0.7190413
#13 Oct 2001 34.3 24.3 0.7175138
#14 Nov 2001 54.1 24.1 0.7041889
#15 Dec 2001 29.3 19.3 0.7094334

Resources