Combine some columns of two matrices but with common information transposed - r

I have the following two matrices:
matrix1 (first 10 rows and only some relevant columns):
Prod_Y2010 Prod_Y2011 Prod_Y2012 Prod_Y2013 Prod_Y2014 Place
1 6101 5733 5655 5803 5155 3
2 4614 4513 4322 5211 4397 1
3 5370 5295 4951 5145 4491 3
4 5689 5855 5600 5787 4848 1
5 3598 3491 3462 3765 3094 2
6 6367 6244 5838 6404 5466 7
7 2720 2635 2465 2917 2623 2
8 5077 5113 4456 5503 4749 8
9 5260 5055 4512 5691 4876 2
10 4771 4583 4202 5266 4422 2
where each column is grassland productivity from years 2010 to 2014, and the last column is the place where productivity was measured.
and matrix2:
Year Rain_Place1 Rain_Place2 Rain_Place3 Rain_Place7 Rain_Place8
11 2010 123.0 361.0 60.5 469.7 492.3
12 2011 45.5 404.4 224.8 395.4 417.3
13 2012 318.7 369.4 115.7 322.6 385.8
14 2013 93.2 378.4 155.5 398.2 413.1
15 2014 216.8 330.0 31.0 344.0 387.5
where for each of the same 5 years of matrix1 (which are the rows in matrix 2) I have data on the rainfall for each place.
I do not see how to proceed in R to join the information of the two matrices in such a way that my matrix1 has a series of additional columns intercalated (or interspersed) with the corresponding rain values matching the corresponding years and places. That is, what I need is a new matrix1 such as:
Prod_Y2010 Rain_Y2010 Prod_Y2011 Rain_Y2011 Prod_Y2012 Rain_Y2012 ... Place
1 6101 60.5 5733 224.8 5655 115.7 3
2 4614 123.0 4513 45.5 4322 318.7 1
3 5370 60.5 5295 224.8 4951 115.7 3
4 5689 123.0 5855 45.5 5600 318.7 1
5 3598 361.0 3491 404.4 3462 369.4 2
... ... ... ... ... ... ... ...
Of course the order is not important to me: if all the Rainfall columns are added as new columns at the right end of matrix1, that would be fine anyway.
Needless to say, my real matrices are several thousands rows long, and the number of years is 15.

I would second #jazzurro's comment- reformatting your data to long format would likely make it easier to work with for analysis etc. However, if you want to keep it using the wide format here is a way that might work- it uses the reshape2 and plyr libraries.
Given these data frames (dput() output of your data frames above, only included for reproducibility):
m1<-structure(list(Prod_Y2010 = c(6101L, 4614L, 5370L, 5689L, 3598L,
6367L, 2720L, 5077L, 5260L, 4771L), Prod_Y2011 = c(5733L, 4513L,
5295L, 5855L, 3491L, 6244L, 2635L, 5113L, 5055L, 4583L), Prod_Y2012 = c(5655L,
4322L, 4951L, 5600L, 3462L, 5838L, 2465L, 4456L, 4512L, 4202L
), Prod_Y2013 = c(5803L, 5211L, 5145L, 5787L, 3765L, 6404L, 2917L,
5503L, 5691L, 5266L), Prod_Y2014 = c(5155L, 4397L, 4491L, 4848L,
3094L, 5466L, 2623L, 4749L, 4876L, 4422L), Place = c(3L, 1L,
3L, 1L, 2L, 7L, 2L, 8L, 2L, 2L)), .Names = c("Prod_Y2010", "Prod_Y2011",
"Prod_Y2012", "Prod_Y2013", "Prod_Y2014", "Place"), class = "data.frame", row.names = c(NA,
-10L))
m2<-structure(list(Year = 2010:2014, Rain_Place1 = c(123, 45.5, 318.7,
93.2, 216.8), Rain_Place2 = c(361, 404.4, 369.4, 378.4, 330),
Rain_Place3 = c(60.5, 224.8, 115.7, 155.5, 31), Rain_Place7 = c(469.7,
395.4, 322.6, 398.2, 344), Rain_Place8 = c(492.3, 417.3,
385.8, 413.1, 387.5)), .Names = c("Year", "Rain_Place1",
"Rain_Place2", "Rain_Place3", "Rain_Place7", "Rain_Place8"), class = "data.frame", row.names = c("11",
"12", "13", "14", "15"))
To get the place number from the column names in your rain data frame to use in a later join:
rename <- function(x) {
y <- substr(x, nchar(x), nchar(x))
return(y)
}
Edit: Here is a better rename function, that should work with more than 9 places (modified from an answer here):
rename <- function(x) {
y <- unlist(regmatches(x, gregexpr('\\(?[0-9,.]+', x)))
return(y)
}
sapply(names(m2[2:ncol(m2)]), FUN = rename)
names(m2) <- c(names(m2)[1], sapply(names(m2[2:ncol(m2)]), FUN = rename))
> m2
Year 1 2 3 7 8
1 2010 123.0 361.0 60.5 469.7 492.3
2 2011 45.5 404.4 224.8 395.4 417.3
3 2012 318.7 369.4 115.7 322.6 385.8
4 2013 93.2 378.4 155.5 398.2 413.1
5 2014 216.8 330.0 31.0 344.0 387.5
Melt the rain data frame:
m3<-melt(m2, id.vars = "Year", variable.name = "Place", value.name = "Rain")
> head(m3)
Year Place Rain
1 2010 1 123.0
2 2011 1 45.5
3 2012 1 318.7
4 2013 1 93.2
5 2014 1 216.8
6 2010 2 361.0
Reshape the melted data frame to allow for a join by "Place", and treat "Place" as a character rather than a factor:
m4<-reshape(m3, idvar = "Place", timevar = "Year", direction = "wide")
m4$Place <- as.character(m4$Place)
> m4
Place Rain.2010 Rain.2011 Rain.2012 Rain.2013 Rain.2014
1 1 123.0 45.5 318.7 93.2 216.8
6 2 361.0 404.4 369.4 378.4 330.0
11 3 60.5 224.8 115.7 155.5 31.0
16 7 469.7 395.4 322.6 398.2 344.0
21 8 492.3 417.3 385.8 413.1 387.5
Finally, join this melted/reshaped data frame to your "Prod" data frame.
m5<-join(m1, m4, by = "Place")
> m5
Prod_Y2010 Prod_Y2011 Prod_Y2012 Prod_Y2013 Prod_Y2014 Place Rain.2010 Rain.2011 Rain.2012 Rain.2013 Rain.2014
1 6101 5733 5655 5803 5155 3 60.5 224.8 115.7 155.5 31.0
2 4614 4513 4322 5211 4397 1 123.0 45.5 318.7 93.2 216.8
3 5370 5295 4951 5145 4491 3 60.5 224.8 115.7 155.5 31.0
4 5689 5855 5600 5787 4848 1 123.0 45.5 318.7 93.2 216.8
5 3598 3491 3462 3765 3094 2 361.0 404.4 369.4 378.4 330.0
6 6367 6244 5838 6404 5466 7 469.7 395.4 322.6 398.2 344.0
7 2720 2635 2465 2917 2623 2 361.0 404.4 369.4 378.4 330.0
8 5077 5113 4456 5503 4749 8 492.3 417.3 385.8 413.1 387.5
9 5260 5055 4512 5691 4876 2 361.0 404.4 369.4 378.4 330.0
10 4771 4583 4202 5266 4422 2 361.0 404.4 369.4 378.4 330.0

Related

Pivot longer with multiple variables in columns

My data looks like this:
# A tibble: 120 x 5
age death_rate_male life_exp_male death_rate_fem life_exp_fem
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0 0.00630 76.0 0.00523 81.0
2 1 0.000426 75.4 0.000342 80.4
3 2 0.00029 74.5 0.000209 79.4
4 3 0.000229 73.5 0.000162 78.4
5 4 0.000162 72.5 0.000143 77.4
6 5 0.000146 71.5 0.000125 76.5
7 6 0.000136 70.5 0.000113 75.5
8 7 0.000127 69.6 0.000104 74.5
9 8 0.000115 68.6 0.000097 73.5
10 9 0.000103 67.6 0.000093 72.5
# ... with 110 more rows
>
I'm trying to create a tidy table where the variables are age, gender, life expectancy, and death rate.
I managed to do this by splitting the data frame into two (one containing life expectancy, the other death rate), tidying both with pivot_longer(), and then appending the two tables.
Is there a way to do this more elegantly, with a single pivot_longer() command? Thank you in advance.
We can use names_pattern (where we capture as a group based on the pattern)
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = -age, names_to = c( '.value', 'grp'),
names_pattern = "^(\\w+_\\w+)_(\\w+)")
# A tibble: 20 x 4
# age grp death_rate life_exp
# <int> <chr> <dbl> <dbl>
# 1 0 male 0.0063 76
# 2 0 fem 0.00523 81
# 3 1 male 0.000426 75.4
# 4 1 fem 0.000342 80.4
# 5 2 male 0.00029 74.5
# 6 2 fem 0.000209 79.4
# 7 3 male 0.000229 73.5
# 8 3 fem 0.000162 78.4
# 9 4 male 0.000162 72.5
#10 4 fem 0.000143 77.4
#11 5 male 0.000146 71.5
#12 5 fem 0.000125 76.5
#13 6 male 0.000136 70.5
#14 6 fem 0.000113 75.5
#15 7 male 0.000127 69.6
#16 7 fem 0.000104 74.5
#17 8 male 0.000115 68.6
#18 8 fem 0.000097 73.5
#19 9 male 0.000103 67.6
#20 9 fem 0.000093 72.5
or names_sep (specify the pattern here it is underscore followed by no character that is an underscore until the end)
df1 %>%
pivot_longer(cols = -age, names_to = c( '.value', 'grp'),
names_sep = "_(?=[^_]+$)")
data
df1 <- structure(list(age = 0:9, death_rate_male = c(0.0063, 0.000426,
0.00029, 0.000229, 0.000162, 0.000146, 0.000136, 0.000127, 0.000115,
0.000103), life_exp_male = c(76, 75.4, 74.5, 73.5, 72.5, 71.5,
70.5, 69.6, 68.6, 67.6), death_rate_fem = c(0.00523, 0.000342,
0.000209, 0.000162, 0.000143, 0.000125, 0.000113, 0.000104, 9.7e-05,
9.3e-05), life_exp_fem = c(81, 80.4, 79.4, 78.4, 77.4, 76.5,
75.5, 74.5, 73.5, 72.5)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"))
Borrowing the data from akrun, here is a base R option using reshape
reshape(
setNames(df, gsub("(.*)_(\\w+)", "\\1\\.\\2", names(df))),
direction = "long",
varying = -1
)
such that
age time death_rate life_exp id
1.male 0 male 0.006300 76.0 1
2.male 1 male 0.000426 75.4 2
3.male 2 male 0.000290 74.5 3
4.male 3 male 0.000229 73.5 4
5.male 4 male 0.000162 72.5 5
6.male 5 male 0.000146 71.5 6
7.male 6 male 0.000136 70.5 7
8.male 7 male 0.000127 69.6 8
9.male 8 male 0.000115 68.6 9
10.male 9 male 0.000103 67.6 10
1.fem 0 fem 0.005230 81.0 1
2.fem 1 fem 0.000342 80.4 2
3.fem 2 fem 0.000209 79.4 3
4.fem 3 fem 0.000162 78.4 4
5.fem 4 fem 0.000143 77.4 5
6.fem 5 fem 0.000125 76.5 6
7.fem 6 fem 0.000113 75.5 7
8.fem 7 fem 0.000104 74.5 8
9.fem 8 fem 0.000097 73.5 9
10.fem 9 fem 0.000093 72.5 10

Dummies with quantile() in R

I am trying to assigne a dummy to a data entry depending on its quantile. So I got 3 quantiles 1/3 2/3 3/3, and if Leverage is in q1 it should add a 1 in a sep. column if q2 than 1 in another column (other columns stay 0).
This is my data sample:
k <- c("gvkey1" , "gvkey1" , "gvkey1" , "gvkey1", "gvkey2", "gvkey2", "gvkey2", "gvkey2", "gvkey2", "gvkey3", "gvkey3", "gvkey1" , "gvkey1" , "gvkey1" , "gvkey1", "gvkey2", "gvkey2", "gvkey2", "gvkey2", "gvkey2", "gvkey3", "gvkey3", "gvkey1" , "gvkey1" , "gvkey1" , "gvkey1", "gvkey2", "gvkey2", "gvkey2", "gvkey2", "gvkey2", "gvkey3", "gvkey3", "gvkey1" , "gvkey1" , "gvkey1" , "gvkey1", "gvkey2", "gvkey2", "gvkey2", "gvkey2", "gvkey2", "gvkey3", "gvkey3")
l <- c("12/1/2000", "12/1/2000", "12/3/2000", "12/4/2000" , "12/5/2000" , "12/6/2000" , "12/7/2000" , "12/8/2000" , "12/9/2000" , "12/10/2000" , "12/11/2000", "12/1/2000", "12/1/2000", "12/3/2000", "12/4/2000" , "12/5/2000" , "12/6/2000" , "12/7/2000" , "12/8/2000" , "12/9/2000" , "12/10/2000" , "12/11/2000", "12/1/2000", "12/1/2000", "12/3/2000", "12/4/2000" , "12/5/2000" , "12/6/2000" , "12/7/2000" , "12/8/2000" , "12/9/2000" , "12/10/2000" , "12/11/2000", "12/1/2000", "12/1/2000", "12/3/2000", "12/4/2000" , "12/5/2000" , "12/6/2000" , "12/7/2000" , "12/8/2000" , "12/9/2000" , "12/10/2000" , "12/11/2000", "12/1/2000", "12/1/2000", "12/3/2000", "12/4/2000" , "12/5/2000" , "12/6/2000" , "12/7/2000" , "12/8/2000" , "12/9/2000" , "12/10/2000" , "12/11/2000", "12/1/2000", "12/1/2000", "12/3/2000", "12/4/2000" , "12/5/2000" , "12/6/2000" , "12/7/2000" , "12/8/2000" , "12/9/2000" , "12/10/2000" , "12/11/2000")
m <- c(1:66)
y <- structure(list(a = l, b = k, c = m), .Names = c("Date", "gvkey" , "Leverage"),
row.names = c(NA, -66L), class = "data.frame")
y$Date <- as.Date(y$Date, format = "%m/%d/%Y")
test <- data.table(y)
and this is the code which should do as described above:
# quantile function per date
d1 <- paste("d1") # first breakpoint
test <- test[, (d1) := quantile(Leverage, (1/3)), by = "Date"]
d2 <- paste("d2") #second breakpoint
test <- test[, (d2) := quantile(Leverage, (2/3)), by = "Date"]
# match companies and quantiles
dquant1 <- paste("dquant1")
test <- test[, (dquant1) := ifelse(d1 < quantile(test$Leverage, 1/3), 1, 0), by = "Date"]
dquant2 <- paste("dquant2")
test <- test[, (d33_66) := ifelse((d1 > quantile(test$Leverage, 1/3) && (d2 < quantile(test$Leverage, 2/3))),1,0), by = "Date"]
dquant3 <- paste("dquant3")
test <- test[, (dquant3) := ifelse(d1 > quantile(test$Leverage, 2/3), 1, 0), by = "Date"]
The problem I got in my original data set is, that I sometimes get a dummy for 2 portfolios/ in 2 columns (e.g. 1 0 1) and thats what I wanna solve. For this sample I sometimes get not a single dummy.
Any suggestions welcome!
Thanks
Johannes
How about this approach??
test %>% rowwise() %>%
mutate(dquant = cut(Leverage,
breaks = c(0,d1,d2,max(Leverage)),
labels = c('100','010','001'))) %>% print(n=Inf)
# A tibble: 66 x 6
Date gvkey Leverage d1 d2 dquant
<date> <chr> <int> <dbl> <dbl> <fct>
1 2000-12-01 gvkey1 1 19.7 38.3 100
2 2000-12-01 gvkey1 2 19.7 38.3 100
3 2000-12-03 gvkey1 3 21.3 39.7 100
4 2000-12-04 gvkey1 4 22.3 40.7 100
5 2000-12-05 gvkey2 5 23.3 41.7 100
6 2000-12-06 gvkey2 6 24.3 42.7 100
7 2000-12-07 gvkey2 7 25.3 43.7 100
8 2000-12-08 gvkey2 8 26.3 44.7 100
9 2000-12-09 gvkey2 9 27.3 45.7 100
10 2000-12-10 gvkey3 10 28.3 46.7 100
11 2000-12-11 gvkey3 11 29.3 47.7 100
12 2000-12-01 gvkey1 12 19.7 38.3 100
13 2000-12-01 gvkey1 13 19.7 38.3 100
14 2000-12-03 gvkey1 14 21.3 39.7 100
15 2000-12-04 gvkey1 15 22.3 40.7 100
16 2000-12-05 gvkey2 16 23.3 41.7 100
17 2000-12-06 gvkey2 17 24.3 42.7 100
18 2000-12-07 gvkey2 18 25.3 43.7 100
19 2000-12-08 gvkey2 19 26.3 44.7 100
20 2000-12-09 gvkey2 20 27.3 45.7 100
21 2000-12-10 gvkey3 21 28.3 46.7 100
22 2000-12-11 gvkey3 22 29.3 47.7 100
23 2000-12-01 gvkey1 23 19.7 38.3 010
24 2000-12-01 gvkey1 24 19.7 38.3 010
25 2000-12-03 gvkey1 25 21.3 39.7 010
26 2000-12-04 gvkey1 26 22.3 40.7 010
27 2000-12-05 gvkey2 27 23.3 41.7 010
28 2000-12-06 gvkey2 28 24.3 42.7 010
29 2000-12-07 gvkey2 29 25.3 43.7 010
30 2000-12-08 gvkey2 30 26.3 44.7 010
31 2000-12-09 gvkey2 31 27.3 45.7 010
32 2000-12-10 gvkey3 32 28.3 46.7 010
33 2000-12-11 gvkey3 33 29.3 47.7 010
34 2000-12-01 gvkey1 34 19.7 38.3 010
35 2000-12-01 gvkey1 35 19.7 38.3 010
36 2000-12-03 gvkey1 36 21.3 39.7 010
37 2000-12-04 gvkey1 37 22.3 40.7 010
38 2000-12-05 gvkey2 38 23.3 41.7 010
39 2000-12-06 gvkey2 39 24.3 42.7 010
40 2000-12-07 gvkey2 40 25.3 43.7 010
41 2000-12-08 gvkey2 41 26.3 44.7 010
42 2000-12-09 gvkey2 42 27.3 45.7 010
43 2000-12-10 gvkey3 43 28.3 46.7 010
44 2000-12-11 gvkey3 44 29.3 47.7 010
45 2000-12-01 NA 45 19.7 38.3 001
46 2000-12-01 NA 46 19.7 38.3 001
47 2000-12-03 NA 47 21.3 39.7 001
48 2000-12-04 NA 48 22.3 40.7 001
49 2000-12-05 NA 49 23.3 41.7 001
50 2000-12-06 NA 50 24.3 42.7 001
51 2000-12-07 NA 51 25.3 43.7 001
52 2000-12-08 NA 52 26.3 44.7 001
53 2000-12-09 NA 53 27.3 45.7 001
54 2000-12-10 NA 54 28.3 46.7 001
55 2000-12-11 NA 55 29.3 47.7 001
56 2000-12-01 NA 56 19.7 38.3 001
57 2000-12-01 NA 57 19.7 38.3 001
58 2000-12-03 NA 58 21.3 39.7 001
59 2000-12-04 NA 59 22.3 40.7 001
60 2000-12-05 NA 60 23.3 41.7 001
61 2000-12-06 NA 61 24.3 42.7 001
62 2000-12-07 NA 62 25.3 43.7 001
63 2000-12-08 NA 63 26.3 44.7 001
64 2000-12-09 NA 64 27.3 45.7 001
65 2000-12-10 NA 65 28.3 46.7 001
66 2000-12-11 NA 66 29.3 47.7 001
More tricky solution is shown below :
d1 <- paste("d1") # first breakpoint
test <- test[, (d1) := quantile(Leverage, (1/3)), by = "Date"]
d2 <- paste("d2") #second breakpoint
test <- test[, (d2) := quantile(Leverage, (2/3)), by = "Date"]
## I will use the '|' operator in dquant
test = test %>% rowwise() %>%
mutate(s = cut(Leverage,
breaks = c(0,d1,d2,max(Leverage)),
labels = c('1|0|0','0|1|0','0|0|1')))
> test
# A tibble: 66 x 6
Date gvkey Leverage d1 d2 dquant
<date> <chr> <int> <dbl> <dbl> <fct>
1 2000-12-01 gvkey1 1 19.7 38.3 1|0|0
2 2000-12-01 gvkey1 2 19.7 38.3 1|0|0
After this, we have to split dquant column into multiple columns.
dummy <- data.frame(do.call('rbind',
strsplit(as.character(test$s),'|',fixed=TRUE)))
> dummy
X1 X2 X3
1 1 0 0
2 1 0 0
3 1 0 0
4 1 0 0
5 1 0 0
6 1 0 0
....
Finally, you got the answer like below
test = cbind(test,dummy)
> test
Date gvkey Leverage d1 d2 dquant X1 X2 X3
1 2000-12-01 gvkey1 1 19.66667 38.33333 1|0|0 1 0 0
2 2000-12-01 gvkey1 2 19.66667 38.33333 1|0|0 1 0 0
3 2000-12-03 gvkey1 3 21.33333 39.66667 1|0|0 1 0 0
4 2000-12-04 gvkey1 4 22.33333 40.66667 1|0|0 1 0 0
5 2000-12-05 gvkey2 5 23.33333 41.66667 1|0|0 1 0 0
6 2000-12-06 gvkey2 6 24.33333 42.66667 1|0|0 1 0 0
7 2000-12-07 gvkey2 7 25.33333 43.66667 1|0|0 1 0 0
8 2000-12-08 gvkey2 8 26.33333 44.66667 1|0|0 1 0 0
9 2000-12-09 gvkey2 9 27.33333 45.66667 1|0|0 1 0 0
10 2000-12-10 gvkey3 10 28.33333 46.66667 1|0|0 1 0 0
11 2000-12-11 gvkey3 11 29.33333 47.66667 1|0|0 1 0 0
...

How to calculate difference between data in different rows? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I have got monthly data in this format
PrecipMM Date
122.7 2004-01-01
54.2 2005-01-01
31.9 2006-01-01
100.5 2007-01-01
144.9 2008-01-01
96.4 2009-01-01
75.3 2010-01-01
94.8 2011-01-01
67.6 2012-01-01
93.0 2013-01-01
184.6 2014-01-01
101.0 2015-01-01
149.3 2016-01-01
50.2 2004-02-01
46.2 2005-02-01
57.7 2006-02-01
I want to calculate all of the difference of precipMM in same month of different years.
My dream output is like this:
PrecipMM Date PrecipMM_diff
122.7 2004-01-01 NA
54.2 2005-01-01 -68.5
31.9 2006-01-01 -22.3
100.5 2007-01-01 68.6
144.9 2008-01-01 44.4
96.4 2009-01-01 -48.5
75.3 2010-01-01 -21.2
94.8 2011-01-01 19.5
67.6 2012-01-01 -27.2
93.0 2013-01-01 25.4
184.6 2014-01-01 91.6
101.0 2015-01-01 -83.6
149.3 2016-01-01 48.3
50.2 2004-02-01 NA
46.2 2005-02-01 -4.0
57.7 2006-02-01 11.5
I think diff() can do this but I have no idea how.
I think you can do this with lag combined with group_by from dplyr. Here's how:
library(dplyr)
library(lubridate) # makes dealing with dates easier
# Load your example data
df <- structure(list(PrecipMM = c(4.4, 66.7, 48.2, 60.9, 108.1, 109.2,
101.7, 38.1, 53.8, 71.9, 75.4, 67.1, 92.7, 115.3, 68.9, 38.9),
Date = structure(5:20, .Label = c("101.7", "108.1", "109.2",
"115.3", "1766-01-01", "1766-02-01", "1766-03-01", "1766-04-01",
"1766-05-01", "1766-06-01", "1766-07-01", "1766-08-01", "1766-09-01",
"1766-10-01", "1766-11-01", "1766-12-01", "1767-01-01", "1767-02-01",
"1767-03-01", "1767-04-01", "38.1", "38.9", "4.4", "48.2",
"53.8", "60.9", "66.7", "67.1", "68.9", "71.9", "75.4", "92.7"
), class = "factor")), class = "data.frame", row.names = c(NA,
-16L), .Names = c("PrecipMM", "Date"))
results <- df %>%
mutate(years = year(Date), months = month(Date)) %>%
group_by(months) %>%
arrange(years) %>%
mutate(lagged.rain = lag(PrecipMM), rain.diff = PrecipMM - lagged.rain)
results
# Source: local data frame [16 x 6]
# Groups: months [12]
#
# PrecipMM Date years months lagged.rain rain.diff
# (dbl) (fctr) (dbl) (dbl) (dbl) (dbl)
# 1 4.4 1766-01-01 1766 1 NA NA
# 2 92.7 1767-01-01 1767 1 4.4 88.3
# 3 66.7 1766-02-01 1766 2 NA NA
# 4 115.3 1767-02-01 1767 2 66.7 48.6
# 5 48.2 1766-03-01 1766 3 NA NA
# 6 68.9 1767-03-01 1767 3 48.2 20.7
# 7 60.9 1766-04-01 1766 4 NA NA
# 8 38.9 1767-04-01 1767 4 60.9 -22.0
# 9 108.1 1766-05-01 1766 5 NA NA
# 10 109.2 1766-06-01 1766 6 NA NA
# 11 101.7 1766-07-01 1766 7 NA NA
# 12 38.1 1766-08-01 1766 8 NA NA
# 13 53.8 1766-09-01 1766 9 NA NA
# 14 71.9 1766-10-01 1766 10 NA NA
# 15 75.4 1766-11-01 1766 11 NA NA
# 16 67.1 1766-12-01 1766 12 NA NA

geom_bar : There are extra x-axis appear in my bar plot

My data is follow the sequence:
deptime .count
1 4.5 6285
2 14.5 5901
3 24.5 6002
4 34.5 5401
5 44.5 5080
6 54.5 4567
7 104.5 3162
8 114.5 2784
9 124.5 1950
10 134.5 1800
11 144.5 1630
12 154.5 1076
13 204.5 738
14 214.5 556
15 224.5 544
16 234.5 650
17 244.5 392
18 254.5 309
19 304.5 356
20 314.5 364
My ggplot code:
ggplot(pplot, aes(x=deptime, y=.count)) + geom_bar(stat="identity",fill='#FF9966',width = 5) + labs(x="time", y="count")
output figure
There are a gap between each 100. Does anyone know how to fix it?
Thank You

make new data set which is based on number of observations by years in each group

firm year inv value capital
1 1 1935 317.60 3078.50 2.80
2 1 1936 391.80 4661.70 52.60
3 1 1937 410.60 5387.10 156.90
4 1 1938 257.70 2792.20 209.20
5 1 1939 330.80 4313.20 203.40
6 1 1940 461.20 4643.90 207.20
7 1 1941 512.00 4551.20 255.20
8 1 1942 448.00 3244.10 303.70
9 2 1936 355.30 1807.10 50.50
10 2 1937 469.90 2676.30 118.10
11 2 1938 262.30 1801.90 260.20
12 3 1935 33.10 1170.60 97.80
13 4 1935 40.29 417.50 10.50
14 4 1936 72.76 837.80 10.20
15 4 1937 66.26 883.90 34.70
16 4 1938 51.60 437.90 51.80
17 4 1939 52.41 679.70 64.30
I want to make new data set which includes each company have observations at least 4 for years because I will use 1~4 lags in regression.
In this case, firm 1 and 4 are for new data set and firm 2 and 3 should be removed.
How can I use subset command and make new data set.
Or using data.table
library(data.table)
setDT(df)[, .SD[.N >= 4L], firm]
# firm year inv value capital
# 1: 1 1935 317.60 3078.5 2.8
# 2: 1 1936 391.80 4661.7 52.6
# 3: 1 1937 410.60 5387.1 156.9
# 4: 1 1938 257.70 2792.2 209.2
# 5: 1 1939 330.80 4313.2 203.4
# 6: 1 1940 461.20 4643.9 207.2
# 7: 1 1941 512.00 4551.2 255.2
# 8: 1 1942 448.00 3244.1 303.7
# 9: 4 1935 40.29 417.5 10.5
# 10: 4 1936 72.76 837.8 10.2
# 11: 4 1937 66.26 883.9 34.7
# 12: 4 1938 51.60 437.9 51.8
# 13: 4 1939 52.41 679.7 64.3
For big data sets binary search could be useful
setkey(setDT(df)[, indx := .N >= 4L, firm], indx)[J(TRUE)]
Or maybe just
setDT(df)[df[, indx := .N >= 4L, firm]$indx]
Or (as pointed out by #Arun)- this seems the best one
setDT(df)[, if(.N >= 4L) .SD, by = firm]
If you want to subset all those firms with 4 or more observations, you can do it like this:
df[ave(df$firm, df$firm, FUN = length) >= 4,]
# firm year inv value capital
#1 1 1935 317.60 3078.5 2.8
#2 1 1936 391.80 4661.7 52.6
#3 1 1937 410.60 5387.1 156.9
#4 1 1938 257.70 2792.2 209.2
#5 1 1939 330.80 4313.2 203.4
#6 1 1940 461.20 4643.9 207.2
#7 1 1941 512.00 4551.2 255.2
#8 1 1942 448.00 3244.1 303.7
#13 4 1935 40.29 417.5 10.5
#14 4 1936 72.76 837.8 10.2
#15 4 1937 66.26 883.9 34.7
#16 4 1938 51.60 437.9 51.8
#17 4 1939 52.41 679.7 64.3
Or using dplyr:
library(dplyr)
group_by(df, firm) %>% filter(n() >= 4)
A solution using table() and simple subsetting:
z <- table(dat$firm)
idx <- names(z)[z>=4]
with(dat, dat[firm %in% idx, ])
The result:
firm year inv value capital
1 1 1935 317.60 3078.5 2.8
2 1 1936 391.80 4661.7 52.6
3 1 1937 410.60 5387.1 156.9
4 1 1938 257.70 2792.2 209.2
5 1 1939 330.80 4313.2 203.4
6 1 1940 461.20 4643.9 207.2
7 1 1941 512.00 4551.2 255.2
8 1 1942 448.00 3244.1 303.7
13 4 1935 40.29 417.5 10.5
14 4 1936 72.76 837.8 10.2
15 4 1937 66.26 883.9 34.7
16 4 1938 51.60 437.9 51.8
17 4 1939 52.41 679.7 64.3
PS. To recreate the data from the question:
dat <- read.table(header=TRUE, text=" firm year inv value capital
1 1 1935 317.60 3078.50 2.80
2 1 1936 391.80 4661.70 52.60
3 1 1937 410.60 5387.10 156.90
4 1 1938 257.70 2792.20 209.20
5 1 1939 330.80 4313.20 203.40
6 1 1940 461.20 4643.90 207.20
7 1 1941 512.00 4551.20 255.20
8 1 1942 448.00 3244.10 303.70
9 2 1936 355.30 1807.10 50.50
10 2 1937 469.90 2676.30 118.10
11 2 1938 262.30 1801.90 260.20
12 3 1935 33.10 1170.60 97.80
13 4 1935 40.29 417.50 10.50
14 4 1936 72.76 837.80 10.20
15 4 1937 66.26 883.90 34.70
16 4 1938 51.60 437.90 51.80
17 4 1939 52.41 679.70 64.30")

Resources