"by" in data.table (group by) - what am I missing? - r

I'm working with a big data.table and using 'by' to 'group by' 3 variables.
My data.table is d and has the key "ma" (10 digit integer but I have shortened it below).
But setting by="ma,year,month" (which is to me the more intuitive group by statement) does not give what I want. For example, ma = 284 has 3 entries for Nov 2011, or ma= 672 has 2 entries for Dec 2011.
> d[,list(n=length(trx_num)),by=list(ma,year,month)]
ma year month n
1: 284 2011 12 3
2: 284 2012 1 1
3: 284 2011 11 5
4: 284 2011 11 1
5: 284 2011 11 2
---
5782971: 672 2012 7 1
5782972: 672 2011 12 1
5782973: 672 2012 2 1
5782974: 672 2011 12 1
5782975: 672 2012 1 1
Reversing the 'by' order, however, gives the desired result.
> d[,list(n=length(trx_num)),by=list(month,year,ma)]
month year ma n
1: 12 2011 284 3
2: 1 2012 284 1
3: 11 2011 284 8
4: 5 2012 543 7
5: 7 2012 543 3
---
1214686: 5 2012 672 28
1214687: 4 2012 672 13
1214688: 12 2011 672 11
1214689: 7 2012 672 9
1214690: 9 2012 672 11
What am I missing here? Thanks in advance.
EDIT:
str() of the data that gives the wrong result
> str(d)
Classes âdata.tableâ and 'data.frame': 14688135 obs. of 3 variables:
$ ma : num 3e+10 3e+10 3e+10 3e+10 3e+10 ...
$ year : int 2011 2012 2011 2011 2011 2011 2011 2011 2011 2011 ...
$ month: int 12 1 11 12 11 11 11 11 11 11 ...
- attr(*, ".internal.selfref")=<externalptr>
- attr(*, "sorted")= chr "ma"
str() of the wrong result:
> str(d[,.N,by=list(ma,year,month)])
Classes âdata.tableâ and 'data.frame': 5782975 obs. of 4 variables:
$ ma : num 3e+10 3e+10 3e+10 3e+10 3e+10 ...
$ year : int 2011 2012 2011 2011 2011 2012 2012 2012 2012 2012 ...
$ month: int 12 1 11 11 11 5 7 6 9 8 ...
$ N : int 3 1 5 1 2 1 1 1 1 1 ...
- attr(*, ".internal.selfref")=<externalptr>
And str() of right result:
> str(d[,.N,by=list(month,year,ma)])
Classes âdata.tableâ and 'data.frame': 1214690 obs. of 4 variables:
$ month: int 12 1 11 5 7 6 9 8 11 12 ...
$ year : int 2011 2012 2011 2012 2012 2012 2012 2012 2011 2011 ...
$ ma : num 3e+10 3e+10 3e+10 3e+10 3e+10 ...
$ N : int 3 1 8 7 3 12 15 3 6 6 ...
- attr(*, ".internal.selfref")=<externalptr>

To wrap up following the comment trail, the ma column was type numeric and contained values which were precisely different but very close together, almost within machine tolerance but not quite. In other words, this situation :
x < y < z
(y-x) just less than machine tolerance so considered equal
(z-y) just less than machine tolerance so considered equal
(z-x) just over machine tolerance so considered not equal
When such a column is grouped alongside two other columns (i.e. by= 3 columns), the order of those 3 columns, if one of those columns has values like above, can change whether those values are considered equal (and in the same group) or not.
The solution is not to use type numeric (double is another name) for such data. Use integer, or in this case where the integers were larger than 2^31 (giving rise to the coercion to double and loss of accuracy, iiuc), character instead. data.table is fast at sorting integer and character. It's not as fast at sorting double yet anyway.
We'll try and add a new warning to data.table :
FR#2469 Add new tolerance.warning option to detect and issue warning if any numeric values are close but not quite within machine tolerance

I built a small test case that at one point in this dialog I thought exhibited the unexpected behavior, (but I was reading the wrong objects for comparison):
d <- structure(list(ma = c(284L, 284L, 284L, 284L, 284L, 284L, 284L,
284L, 284L, 284L, 284L, 284L, 672L, 672L, 672L, 672L, 672L),
year = c(2011L, 2011L, 2011L, 2012L, 2011L, 2011L, 2011L,
2011L, 2011L, 2011L, 2011L, 2011L, 2012L, 2011L, 2012L, 2011L,
2012L), month = c(12L, 12L, 12L, 1L, 11L, 11L, 11L, 11L,
11L, 11L, 11L, 11L, 7L, 12L, 2L, 12L, 1L), trx_num = c(4L,
9L, 8L, 4L, 4L, 6L, 3L, 8L, 2L, 2L, 8L, 9L, 8L, 6L, 10L,
6L, 10L)), .Names = c("ma", "year", "month", "trx_num"), row.names = c(NA,
-17L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x105afb0>, sorted = "ma")
To build it:
dat <- read.table(text=" ma year month n
284 2011 12 3
284 2012 1 1
284 2011 11 5
284 2011 11 1
284 2011 11 2
672 2012 7 1
672 2011 12 1
672 2012 2 1
672 2011 12 1
672 2012 1 1", header=TRUE)
require(data.table)
d <- data.table( data.frame(dat[rep(rownames(dat), times=dat$n), 1:3], trx_num=unlist(sapply(dat$n, sample, x=1:10)) ) )
setkey(d, ma)
d[,list(n=length(trx_num)),by=list(ma,year,month)]
d[,list(n=length(trx_num)),by=list(month,year,ma)]
At which point it becomes clear that BlueMagister's solution is correct:
d[,.N, by=list(month,year,ma)]
d[,.N, by=list(ma,year,month)] # same result modulo row order

Related

How to calculate year to date and 12 months variations in r

I have data of multiple years (in this example only 3). Need to calculate year to date and 12 months variations in r. Year-to-date variations are as follows:
Year to date variation January 2021= Value January 2021/ Value January 2020.
Year to date variation February 2021= Sum(Value January 2021: February 2021)/Sum(Value January 2020: February 2020)
Year to date variation March 2021= Sum(Value January 2021: March 2021)/Sum(Value January 2020: March 2020)
12 months variations are as follows:
12 months variation January 2021= Sum (Value February 2020:Value January 2021)/ Sum(Value February 2019:Value January 2020).
12 months variation February 2021= Sum (Value March 2020:Value February 2021)/ Sum(Value March 2019:Value February 2020).
Year
Month
Value
Year to date variations R
12 mothns
2019
1
182
2019
2
160
2019
3
170
2019
4
123
2019
5
165
2019
6
153
2019
7
152
2019
8
182
2019
9
156
2019
10
141
2019
11
161
2019
12
193
2020
1
143
0,785714285714286
2020
2
138
0,821637426900585
2020
3
113
0,76953125
2020
4
127
0,820472440944882
2020
5
107
0,785
2020
6
185
0,853095487932844
2020
7
125
0,848868778280543
2020
8
109
0,813519813519814
2020
9
167
0,841302841302841
2020
10
128
0,847222222222222
2020
11
139
0,848710601719198
2020
12
173
0,853457172342621
2021
1
111
0,776223776223776
0,854133754607688
2021
2
169
0,99644128113879
0,88066062866276
2021
3
125
1,02791878172589
0,914835164835165
2021
4
134
1,03454894433781
0,916666666666667
2021
5
112
1,03662420382166
0,94960362400906
2021
6
114
0,940959409594096
0,89321468298109
2021
7
171
0,997867803837953
0,932806324110672
2021
8
161
1,04775549188157
1,00353356890459
2021
9
195
1,06425041186161
1,01345816266823
2021
10
150
1,07451564828614
1,03419811320755
2021
11
181
1,09588116137745
1,07287933094385
2021
12
163
1,07980652962515
1,07980652962515
You can do this by leveraging the power of data.table::shift().
library(data.table)
setDT(df)
df <- df[order(Month,Year),p:=shift(Value),Month] %>%
.[,ytd_var:=cumsum(Value)/cumsum(p),Year] %>%
.[,`:=`(mon_var=Reduce(`+`, shift(Value,0:11))/Reduce(`+`, shift(Value,12:23)), p=NULL)]
Explanation:
First, get the 12-month prior value (p) for each month, by ordering by Month and Year, and using shift(Value), by Month.
Then, create the year-to-date value (ytd_var) by dividing the cumulative sum of Value by the cumulative sum of p, by Year.
Finally, create the 12-month variation value (mon_var) by dividing the sum of Value in the current and prior 11 months by the sum of Value in months 12 thru 23 prior to the current month. Notice that I use Reduce() to sum over all these prior values.
Output:
Year Month Value ytd_var mon_var
1: 2019 1 182 NA NA
2: 2019 2 160 NA NA
3: 2019 3 170 NA NA
4: 2019 4 123 NA NA
5: 2019 5 165 NA NA
6: 2019 6 153 NA NA
7: 2019 7 152 NA NA
8: 2019 8 182 NA NA
9: 2019 9 156 NA NA
10: 2019 10 141 NA NA
11: 2019 11 161 NA NA
12: 2019 12 193 NA NA
13: 2020 1 143 0.7857143 NA
14: 2020 2 138 0.8216374 NA
15: 2020 3 113 0.7695312 NA
16: 2020 4 127 0.8204724 NA
17: 2020 5 107 0.7850000 NA
18: 2020 6 185 0.8530955 NA
19: 2020 7 125 0.8488688 NA
20: 2020 8 109 0.8135198 NA
21: 2020 9 167 0.8413028 NA
22: 2020 10 128 0.8472222 NA
23: 2020 11 139 0.8487106 NA
24: 2020 12 173 0.8534572 0.8534572
25: 2021 1 111 0.7762238 0.8541338
26: 2021 2 169 0.9964413 0.8806606
27: 2021 3 125 1.0279188 0.9148352
28: 2021 4 134 1.0345489 0.9166667
29: 2021 5 112 1.0366242 0.9496036
30: 2021 6 114 0.9409594 0.8932147
31: 2021 7 171 0.9978678 0.9328063
32: 2021 8 161 1.0477555 1.0035336
33: 2021 9 195 1.0642504 1.0134582
34: 2021 10 150 1.0745156 1.0341981
35: 2021 11 181 1.0958812 1.0728793
36: 2021 12 163 1.0798065 1.0798065
Input:
df = structure(list(Year = c(2019L, 2019L, 2019L, 2019L, 2019L, 2019L,
2019L, 2019L, 2019L, 2019L, 2019L, 2019L, 2020L, 2020L, 2020L,
2020L, 2020L, 2020L, 2020L, 2020L, 2020L, 2020L, 2020L, 2020L,
2021L, 2021L, 2021L, 2021L, 2021L, 2021L, 2021L, 2021L, 2021L,
2021L, 2021L, 2021L), Month = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L,
9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L,
12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L), Value = c(182L,
160L, 170L, 123L, 165L, 153L, 152L, 182L, 156L, 141L, 161L, 193L,
143L, 138L, 113L, 127L, 107L, 185L, 125L, 109L, 167L, 128L, 139L,
173L, 111L, 169L, 125L, 134L, 112L, 114L, 171L, 161L, 195L, 150L,
181L, 163L)), row.names = c(NA, -36L), class = "data.frame")

How to change the schedule work for a range of possible time combinations in R

This is my dataset dput()
timeset=structure(list(SAP = c("S412", "S412", "S412", "S412", "S412",
"S412", "S412", "S412", "S412", "S412", "S412", "S412", "S412",
"S412", "S412", "S412", "S412"), weekday = c(1L, 1L, 2L, 2L,
3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 7L, 7L), tab = c(1001L,
1002L, 1001L, 1003L, 1001L, 1002L, 1001L, 1002L, 1003L, 1001L,
1002L, 1003L, 1001L, 1002L, 1003L, 1001L, 1003L), date = c(20220411L,
20220411L, 20220412L, 20220412L, 20220413L, 20220413L, 20220414L,
20220414L, 20220414L, 20220415L, 20220415L, 20220415L, 20220416L,
20220416L, 20220416L, 20220417L, 20220417L), stuff_code = c(801L,
690L, 690L, 690L, 1180L, 690L, 1180L, 690L, 690L, 1180L, 690L,
690L, 1180L, 690L, 690L, 1180L, 690L), TS = c(9L, 9L, 9L, 9L,
9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L), TE = c(21L,
21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L,
21L, 21L, 21L)), class = "data.frame", row.names = c(NA, -17L
))
I am having strong difficulties with two possible staff scheduling situations
The first situation
Sometimes it happens that on some days, for example, on weekday=4, work 2 people with stuff_code=690 and one person with stuff_code=1180 works with them. How to make it so that if on some day two people with stuff_code=690 work simultaneously and stuff_code=1180 with them and these two stuff_code=690 have the same start time TS and time end TE, then for these code=690 change the start and end time by 10 hour interval, for example the first stuff_code is 9-19, the second one is from 11-21, so both will work 10 hours to keep the store working range from 9-21, don't need to set both from 9-19. (stuff=1180 we do not touch.it is only an indication that a situation has occurred 2 people 690 and 1 person 1180)
the second situation
There are days when both stuff_code=690 work at the same time (weekday=2), for them need to similarly change hours according to the 10 hour range, the first from 9-19 the second from 11-21
well, or if the time from 10-22 is indicated, then one from 10-20 the second from 12-22.
We do the same if on some day like in weekend=1 only two people work, but one of them stuff_code=801 and the second stuff_code=690, they work at the same time, then we change the hour for them, where 801 should have the very first earliest date, for example, 801 from 9-19 ,a 690 from 11-21
What is the simplest way to change the graphs in such possible situations?
Desired output which I performed manually in excel
SAP weekday tab date stuff_code TS TE
**S412 1 1001 20220411 801 9 19
S412 1 1002 20220411 690 11 21
S412 2 1001 20220412 690 9 19
S412 2 1003 20220412 690 11 21**
S412 3 1001 20220413 1180 9 21
S412 3 1002 20220413 690 9 21
**S412 4 1001 20220414 1180 9 21
S412 4 1002 20220414 690 9 19
S412 4 1003 20220414 690 11 21
S412 5 1001 20220415 1180 9 21
S412 5 1002 20220415 690 9 19
S412 5 1003 20220415 690 11 21
S412 6 1001 20220416 1180 9 21
S412 6 1002 20220416 690 9 19
S412 6 1003 20220416 690 11 21**
S412 7 1001 20220417 1180 9 21
S412 7 1003 20220417 690 9 21
** marked weekdays where I changed hour.
You can write a small function, f to handle the logic, and then apply the function to each date:
f <- function(cd,s,e) {
new_hours <- function(cd,s,e) {
s[cd!=1180] <- c(min(s),min(s)+2)
e[cd!=1180] <- c(max(e)-2,max(e))
list(s,e)
}
## Situation 1 and 2
if(sum(cd==690)==2 & length(unique(s[cd==690]))==1 & length(unique(e[cd==690]))==1) {
newh = new_hours(cd,s,e)
s = newh[[1]]
e = newh[[2]]
}
## Situation 3
if(length(cd==2) & (690 %in% cd) & (801 %in% cd)) {
if(length(unique(s))==1 & length(unique(e))==1) {
newh = new_hours(cd,s,e)
s = newh[[1]][order(cd, decreasing=T)]
e = newh[[2]][order(cd, decreasing=T)]
}
}
return(list(as.integer(cd),as.integer(s),as.integer(e)))
}
library(data.table)
setDT(timeset)
timeset[, c("stuff_code", "TS", "TE"):=f(stuff_code, TS, TE), by=date]
Output:
SAP weekday tab date stuff_code TS TE
<char> <int> <int> <int> <int> <int> <int>
1: S412 1 1001 20220411 801 9 19
2: S412 1 1002 20220411 690 11 21
3: S412 2 1001 20220412 690 9 19
4: S412 2 1003 20220412 690 11 21
5: S412 3 1001 20220413 1180 9 21
6: S412 3 1002 20220413 690 9 21
7: S412 4 1001 20220414 1180 9 21
8: S412 4 1002 20220414 690 9 19
9: S412 4 1003 20220414 690 11 21
10: S412 5 1001 20220415 1180 9 21
11: S412 5 1002 20220415 690 9 19
12: S412 5 1003 20220415 690 11 21
13: S412 6 1001 20220416 1180 9 21
14: S412 6 1002 20220416 690 9 19
15: S412 6 1003 20220416 690 11 21
16: S412 7 1001 20220417 1180 9 21
17: S412 7 1003 20220417 690 9 21

How do I duplicate and add rows between the values of two different columns?

I'm trying to duplicate each observation for all of the years that fall between "styear" and "endyear." So, for example, there should end up being 118 USA rows with years 1898-2016.
Here's the data:
# A tibble: 14 x 9
stateabb ccode styear stmonth stday endyear endmonth endday version
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 USA 2 1898 8 13 2016 12 31 2016
2 UKG 200 1816 1 1 2016 12 31 2016
3 FRN 220 1816 1 1 1940 6 22 2016
4 FRN 220 1945 8 15 2016 12 31 2016
5 GMY 255 1816 1 1 1918 11 11 2016
6 GMY 255 1925 1 1 1945 5 7 2016
7 GMY 255 1991 12 11 2016 12 31 2016
8 AUH 300 1816 1 1 1918 11 3 2016
9 ITA 325 1860 1 1 1943 9 2 2016
10 RUS 365 1816 1 1 1917 12 5 2016
11 RUS 365 1922 1 1 2016 12 31 2016
12 CHN 710 1950 1 1 2016 12 31 2016
13 JPN 740 1895 4 1 1945 8 14 2016
14 JPN 740 1991 12 11 2016 12 31 2016
I have tried various combinations of slice() and seq() and complete(), but I think I'm just too new at coding to a) know what to do and b) how to really understand other answers to similar questions.
Ultimately, I am merging this data with other data and creating 0/1 dummy variable to indicate if a country was a "great power" in a given year. The easiest way I thought of was to do this by creating individual rows for each year a country was a great power (the data in this question) because the data I am merging it with is also in the country-year format. I am open to other options, though, if something else is easier!
Thank you!
I think tidyr::expand() and full_seq() can achieve what you want, with grouping on stateabb and styear since you have multiple start years for some states.
Assuming your data frame is named mydata, something like this. I have retained the column of expanded years and named it filled_year, but you may want to remove it.
library(dplyr)
library(tidyr)
new_data <- mydata %>%
group_by(stateabb, styear) %>%
tidyr::expand(stateabb, full_seq(c(styear, endyear), 1)) %>%
inner_join(mydata) %>%
rename(filled_year = `full_seq(c(styear, endyear), 1)`) %>%
ungroup()
The top and bottom of the USA rows:
new_data %>%
filter(stateabb == "USA") %>%
head()
# A tibble: 6 x 10
styear stateabb filled_year ccode stmonth stday endyear endmonth endday version
<int> <chr> <dbl> <int> <int> <int> <int> <int> <int> <int>
1 1898 USA 1898 2 8 13 2016 12 31 2016
2 1898 USA 1899 2 8 13 2016 12 31 2016
3 1898 USA 1900 2 8 13 2016 12 31 2016
4 1898 USA 1901 2 8 13 2016 12 31 2016
5 1898 USA 1902 2 8 13 2016 12 31 2016
6 1898 USA 1903 2 8 13 2016 12 31 2016
new_data %>%
filter(stateabb == "USA") %>%
tail()
# A tibble: 6 x 10
styear stateabb filled_year ccode stmonth stday endyear endmonth endday version
<int> <chr> <dbl> <int> <int> <int> <int> <int> <int> <int>
1 1898 USA 2011 2 8 13 2016 12 31 2016
2 1898 USA 2012 2 8 13 2016 12 31 2016
3 1898 USA 2013 2 8 13 2016 12 31 2016
4 1898 USA 2014 2 8 13 2016 12 31 2016
5 1898 USA 2015 2 8 13 2016 12 31 2016
6 1898 USA 2016 2 8 13 2016 12 31 2016
Your example data:
mydata <- structure(list(stateabb = c("USA", "UKG", "FRN", "FRN", "GMY",
"GMY", "GMY", "AUH", "ITA", "RUS", "RUS", "CHN", "JPN", "JPN"
), ccode = c(2L, 200L, 220L, 220L, 255L, 255L, 255L, 300L, 325L,
365L, 365L, 710L, 740L, 740L), styear = c(1898L, 1816L, 1816L,
1945L, 1816L, 1925L, 1991L, 1816L, 1860L, 1816L, 1922L, 1950L,
1895L, 1991L), stmonth = c(8L, 1L, 1L, 8L, 1L, 1L, 12L, 1L, 1L,
1L, 1L, 1L, 4L, 12L), stday = c(13L, 1L, 1L, 15L, 1L, 1L, 11L,
1L, 1L, 1L, 1L, 1L, 1L, 11L), endyear = c(2016L, 2016L, 1940L,
2016L, 1918L, 1945L, 2016L, 1918L, 1943L, 1917L, 2016L, 2016L,
1945L, 2016L), endmonth = c(12L, 12L, 6L, 12L, 11L, 5L, 12L,
11L, 9L, 12L, 12L, 12L, 8L, 12L), endday = c(31L, 31L, 22L, 31L,
11L, 7L, 31L, 3L, 2L, 5L, 31L, 31L, 14L, 31L), version = c(2016L,
2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L,
2016L, 2016L, 2016L, 2016L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14"))
My guess is that there is a better way to do this, but here is one way with a small reduced data set. The steps are
Make a minimal dataset.
Make function that creates the seq of dates that you want
Split the dataframe into a list of small dataframes by country with dplyr group_split.
Apply a function using purrr map that maps your list into new expanded date dataframes.
Bind list of dataframes back into one dataframe.
library(dplyr)
library(purrr )
df<-data.frame(
stringsAsFactors = FALSE,
stateabb = c("USA", "UKG"),
styear = c(1898L, 1816L),
endyear = c(2016L, 2016L)
)
expand_dates<-function(df) {
stateabb<-df %>% pluck("stateabb")
styear<-df %>% pluck("styear")
endyear<-df%>% pluck("endyear")
years=seq(styear,endyear )
data.frame(years) %>%
mutate(stateabb=stateabb,styear=styear,endyear=endyear)
}
df_new<-df %>%
group_split(stateabb)%>%
map(expand_dates) %>%
bind_rows()
head(df_new)
#> years stateabb styear endyear
#> 1 1816 UKG 1816 2016
#> 2 1817 UKG 1816 2016
#> 3 1818 UKG 1816 2016
#> 4 1819 UKG 1816 2016
#> 5 1820 UKG 1816 2016
#> 6 1821 UKG 1816 2016
Created on 2022-01-13 by the reprex package (v2.0.1)

filling in missing data using fitted value in R

I have a dataframe like this:
ID year age wage
1 2 1981 22 10000
2 2 1982 23 11000
3 2 1983 24 11500
4 2 1984 25 11000
5 2 1985 26 14000
6 2 1986 27 16000
7 2 1987 28 20000
8 2 1988 29 19000
9 2 1989 30 20000
10 2 1990 31 20000
11 2 1991 32 22000
12 2 1992 33 25000
13 2 1993 34 0
14 2 1994 35 NA
15 2 1995 36 0
16 2 1996 37 NA
17 2 1997 38 0
18 2 1998 39 NA
19 2 1999 40 0
20 2 2000 41 NA
21 2 2001 42 0
22 2 2002 43 NA
23 2 2003 44 0
24 2 2004 45 NA
25 2 2005 46 5500
26 2 2006 47 NA
27 2 2007 48 5000
28 2 2008 49 NA
29 2 2009 50 6000
30 2 2010 51 NA
31 2 2011 52 19000
32 2 2012 53 NA
33 2 2013 54 21000
34 2 2014 55 NA
35 2 2015 56 23000
36 3 1984 22 1300
37 3 1985 23 0
38 3 1986 24 1500
39 3 1987 25 1000
40 3 1988 26 0
I want to use an individual-specific regression of wage on age and age-squared to impute missing wage observations. I want to only impute when at least 5 non-missing observations are available.
As suggested by jay.sf, I tried the following but with fitted values:
df_imp <- do.call(rbind,
by(df, df$ID, function(x) {
IDs <- which(is.na(x$wage))
if (length(x$wage[- IDs]) >= 5) {
b <- lm(wage ~ poly(age, 2, raw=TRUE), x)$fitted.values
x$wage[IDs] <- with(x, b)[IDs]
}
return(x)
}))
I got the following results:
ID year age wage
36 2 1981 22 10000.000
37 2 1982 23 11000.000
38 2 1983 24 11500.000
39 2 1984 25 11000.000
40 2 1985 26 14000.000
41 2 1986 27 16000.000
42 2 1987 28 20000.000
43 2 1988 29 19000.000
44 2 1989 30 20000.000
45 2 1990 31 20000.000
46 2 1991 32 22000.000
47 2 1992 33 25000.000
48 2 1993 34 0.000
49 2 1994 35 7291.777
50 2 1995 36 0.000
51 2 1996 37 6779.133
52 2 1997 38 0.000
53 2 1998 39 7591.597
54 2 1999 40 0.000
55 2 2000 41 9729.168
56 2 2001 42 0.000
57 2 2002 43 13191.847
58 2 2003 44 0.000
59 2 2004 45 17979.633
60 2 2005 46 5500.000
61 2 2006 47 NA
62 2 2007 48 5000.000
63 2 2008 49 NA
64 2 2009 50 6000.000
65 2 2010 51 NA
66 2 2011 52 19000.000
67 2 2012 53 NA
68 2 2013 54 21000.000
69 2 2014 55 NA
70 2 2015 56 23000.000
You could use a simple if statement, without an else. Define an ID vector IDs that identifies missings, which you use to count them and to subset your Y column wage.
For this you can use by(), which splits your data similar to split() but you may apply a function; just rbind the result.
It's probably wiser to rather use the coefficients than the fitted values, because the latter also would be NA if your Y are NA. And you need to use raw=TRUE in the poly.
DF.imp <- do.call(rbind,
by(DF, DF$ID, function(x) {
IDs <- which(is.na(x$wage))
if (length(x$wage[- IDs]) >= 5) {
b <- lm(wage ~ poly(age, 2, raw=TRUE), x)$coefficients
x$wage[IDs] <- with(x, (b[1] + b[2]*age + b[3]*age^2))[IDs]
}
return(x)
}))
Note that I've slightly changed your example data, so that ID 3 also has missings, but less than 5 non-missings.
Result
DF.imp
# ID year age wage
# 2.1 2 1981 22 10000.000
# 2.2 2 1982 23 11000.000
# 2.3 2 1983 24 11500.000
# 2.4 2 1984 25 11000.000
# 2.5 2 1985 26 14000.000
# 2.6 2 1986 27 16000.000
# 2.7 2 1987 28 20000.000
# 2.8 2 1988 29 19000.000
# 2.9 2 1989 30 20000.000
# 2.10 2 1990 31 20000.000
# 2.11 2 1991 32 22000.000
# 2.12 2 1992 33 25000.000
# 2.13 2 1993 34 0.000
# 2.14 2 1994 35 7626.986
# 2.15 2 1995 36 0.000
# 2.16 2 1996 37 7039.387
# 2.17 2 1997 38 0.000
# 2.18 2 1998 39 6783.065
# 2.19 2 1999 40 0.000
# 2.20 2 2000 41 6858.020
# 2.21 2 2001 42 0.000
# 2.22 2 2002 43 7264.252
# 2.23 2 2003 44 0.000
# 2.24 2 2004 45 8001.761
# 2.25 2 2005 46 5500.000
# 2.26 2 2006 47 9070.546
# 2.27 2 2007 48 5000.000
# 2.28 2 2008 49 10470.609
# 2.29 2 2009 50 6000.000
# 2.30 2 2010 51 12201.948
# 2.31 2 2011 52 19000.000
# 2.32 2 2012 53 14264.565
# 2.33 2 2013 54 21000.000
# 2.34 2 2014 55 16658.458
# 2.35 2 2015 56 23000.000
# 3.36 3 1984 22 1300.000
# 3.37 3 1985 23 NA
# 3.38 3 1986 24 1500.000
# 3.39 3 1987 25 1000.000
# 3.40 3 1988 26 NA
Data
DF <- structure(list(ID = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), year = c(1981L,
1982L, 1983L, 1984L, 1985L, 1986L, 1987L, 1988L, 1989L, 1990L,
1991L, 1992L, 1993L, 1994L, 1995L, 1996L, 1997L, 1998L, 1999L,
2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L,
2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 1984L, 1985L,
1986L, 1987L, 1988L), age = c(22L, 23L, 24L, 25L, 26L, 27L, 28L,
29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L,
42L, 43L, 44L, 45L, 46L, 47L, 48L, 49L, 50L, 51L, 52L, 53L, 54L,
55L, 56L, 22L, 23L, 24L, 25L, 26L), wage = c(10000L, 11000L,
11500L, 11000L, 14000L, 16000L, 20000L, 19000L, 20000L, 20000L,
22000L, 25000L, 0L, NA, 0L, NA, 0L, NA, 0L, NA, 0L, NA, 0L, NA,
5500L, NA, 5000L, NA, 6000L, NA, 19000L, NA, 21000L, NA, 23000L,
1300L, NA, 1500L, 1000L, NA)), row.names = c(NA, -40L), class = "data.frame")

Merge lines with same ID and take average value

From the table below I need to combine the lines by calculating the average value for those lines with same ID (column 2).
I was thinking of the plyr function??
ddply(df, summarize, value = average(ID))
df:
miRNA ID 100G 100R 106G 106R 122G 122R 124G 124R 126G 126R 134G 134R 141G 141R 167G 167R 185G 185R
1 hsa-miR-106a ID7 1585 423 180 113 598 266 227 242 70 106 2703 442 715 309 546 113 358 309
2 hsa-miR-1185-1 ID2 10 1 3 3 11 8 4 4 28 2 13 3 6 3 6 4 7 5
3 hsa-miR-1185-2 ID2 2 0 2 1 5 1 1 0 4 1 1 1 3 2 2 0 2 1
4 hsa-miR-1197 ID2 2 0 0 5 3 3 0 4 16 0 4 1 3 0 0 2 2 4
5 hsa-miR-127 ID3 29 17 6 55 40 35 6 20 171 10 32 21 23 25 10 14 32 55
Summary of original data:
> str(ClusterMatrix)
'data.frame': 113 obs. of 98 variables:
$ miRNA: Factor w/ 202 levels "hsa-miR-106a",..: 1 3 4 6 8 8 14 15 15 16 ...
$ ID : Factor w/ 27 levels "ID1","ID10","ID11",..: 25 12 12 12 21 21 12 21 21 6 ...
$ 100G : Factor w/ 308 levels "-0.307749042739963",..: 279 11 3 3 101 42 139 158 215 222 ...
$ 100R : Factor w/ 316 levels "-0.138028803567403",..: 207 7 8 8 18 42 128 183 232 209 ...
$ 106G : Factor w/ 260 levels "-0.103556709881933",..: 171 4 1 3 7 258 95 110 149 162 ...
$ 106R : Factor w/ 300 levels "-0.141810346640204",..: 141 4 6 2 108 41 146 196 244 267 ...
$ 122G : Factor w/ 336 levels "-0.0409548922061764",..: 237 12 4 6 103 47 148 203 257 264 ...
$ 122R : Factor w/ 316 levels "-0.135708706475279",..: 177 1 8 6 36 44 131 192 239 244 ...
$ 124G : Factor w/ 267 levels "-0.348439853247856",..: 210 5 2 3 7 50 126 138 188 249 ...
$ 124R : Factor w/ 303 levels "-0.176414190219115",..: 193 3 7 3 21 52 167 200 238 239 ...
$ 126G : Factor w/ 307 levels "-0.227658806811544",..: 122 88 5 76 169 61 240 220 281 265 ...
$ 126R : Factor w/ 249 levels "-0.271925865853123",..: 119 1 2 3 11 247 78 110 151 193 ...
$ 134G : Factor w/ 344 levels "-0.106333543799583",..: 304 14 8 5 33 48 150 196 248 231 ...
$ 134R : Factor w/ 300 levels "-0.0997616469801097",..: 183 5 7 7 22 298 113 159 213 221 ...
$ 141G : Factor w/ 335 levels "-0.134429748398679",..: 253 7 3 3 24 29 142 137 223 302 ...
$ 141R : Factor w/ 314 levels "-0.143299688877927",..: 210 4 5 7 98 54 154 199 255 251 ...
$ 167G : Factor w/ 306 levels "-0.211181452126958",..: 222 7 4 6 11 292 91 101 175 226 ...
$ 167R : Factor w/ 282 levels "-0.0490740880560127",..: 130 2 6 4 15 282 110 146 196 197 ...
$ 185G : Factor w/ 317 levels "-0.0567841338235346",..: 218 2 7 7 33 34 130 194 227 259 ...
We can use dplyr. We group by 'ID', use mutate_each to create columns that show the mean value of '100G' to '185R'. We select the columns in mutate_each by using regex patterns in matches. Then cbind (bind_cols) the original dataset with the mutated columns, and convert to data.frame if needed. We can also change the column names of the mean columns.
library(dplyr)
out <- df1 %>%
group_by(ID) %>%
mutate_each(funs(mean=mean(., na.rm=TRUE)), matches('^\\d+')) %>%
setNames(., c(names(.)[1:2], paste0('Mean_', names(.)[3:ncol(.)]))) %>%
as.data.frame()
out1 <- bind_cols(df1, out[-(1:2)])
out1
# miRNA ID 100G 100R 106G 106R 122G 122R 124G 124R 126G 126R 134G
#1 hsa-miR-106a ID7 1585 423 180 113 598 266 227 242 70 106 2703
#2 hsa-miR-1185-1 ID2 10 1 3 3 11 8 4 4 28 2 13
#3 hsa-miR-1185-2 ID2 2 0 2 1 5 1 1 0 4 1 1
#4 hsa-miR-1197 ID2 2 0 0 5 3 3 0 4 16 0 4
#5 hsa-miR-127 ID3 29 17 6 55 40 35 6 20 171 10 32
# 134R 141G 141R 167G 167R 185G 185R Mean_100G Mean_100R Mean_106G
#1 442 715 309 546 113 358 309 1585.000000 423.0000000 180.000000
#2 3 6 3 6 4 7 5 4.666667 0.3333333 1.666667
#3 1 3 2 2 0 2 1 4.666667 0.3333333 1.666667
#4 1 3 0 0 2 2 4 4.666667 0.3333333 1.666667
#5 21 23 25 10 14 32 55 29.000000 17.0000000 6.000000
# Mean_106R Mean_122G Mean_122R Mean_124G Mean_124R Mean_126G Mean_126R
#1 113 598.000000 266 227.000000 242.000000 70 106
#2 3 6.333333 4 1.666667 2.666667 16 1
#3 3 6.333333 4 1.666667 2.666667 16 1
#4 3 6.333333 4 1.666667 2.666667 16 1
#5 55 40.000000 35 6.000000 20.000000 171 10
# Mean_134G Mean_134R Mean_141G Mean_141R Mean_167G Mean_167R Mean_185G
#1 2703 442.000000 715 309.000000 546.000000 113 358.000000
#2 6 1.666667 4 1.666667 2.666667 2 3.666667
#3 6 1.666667 4 1.666667 2.666667 2 3.666667
#4 6 1.666667 4 1.666667 2.666667 2 3.666667
#5 32 21.000000 23 25.000000 10.000000 14 32.000000
# Mean_185R
#1 309.000000
#2 3.333333
#3 3.333333
#4 3.333333
#5 55.000000
EDIT: If we need a single row mean for each 'ID', we can use summarise_each
df1 %>%
group_by(ID) %>%
summarise_each(funs(mean=mean(., na.rm=TRUE)), matches('^\\d+'))
EDIT2: Based on the OP's update the original dataset ('ClusterMatrix') columns are all factor class. We need to convert the columns to numeric class before getting the mean. There are two options to convert the factor to numeric - 1) by as.numeric(as.character(.. which may be a bit slower, 2) as.numeric(levels(.. which is faster. Here I am using the first method as it may be more clear.
ClusterMatrix %>%
group_by(ID) %>%
summarise_each(funs(mean= mean(as.numeric(as.character(.)),
na.rm=TRUE)), matches('^\\d+'))
data
df1 <- structure(list(miRNA = c("hsa-miR-106a", "hsa-miR-1185-1",
"hsa-miR-1185-2",
"hsa-miR-1197", "hsa-miR-127"), ID = c("ID7", "ID2", "ID2", "ID2",
"ID3"), `100G` = c(1585L, 10L, 2L, 2L, 29L), `100R` = c(423L,
1L, 0L, 0L, 17L), `106G` = c(180L, 3L, 2L, 0L, 6L), `106R` = c(113L,
3L, 1L, 5L, 55L), `122G` = c(598L, 11L, 5L, 3L, 40L), `122R` = c(266L,
8L, 1L, 3L, 35L), `124G` = c(227L, 4L, 1L, 0L, 6L), `124R` = c(242L,
4L, 0L, 4L, 20L), `126G` = c(70L, 28L, 4L, 16L, 171L), `126R` = c(106L,
2L, 1L, 0L, 10L), `134G` = c(2703L, 13L, 1L, 4L, 32L), `134R` = c(442L,
3L, 1L, 1L, 21L), `141G` = c(715L, 6L, 3L, 3L, 23L), `141R` = c(309L,
3L, 2L, 0L, 25L), `167G` = c(546L, 6L, 2L, 0L, 10L), `167R` = c(113L,
4L, 0L, 2L, 14L), `185G` = c(358L, 7L, 2L, 2L, 32L), `185R` = c(309L,
5L, 1L, 4L, 55L)), .Names = c("miRNA", "ID", "100G", "100R",
"106G", "106R", "122G", "122R", "124G", "124R", "126G", "126R",
"134G", "134R", "141G", "141R", "167G", "167R", "185G", "185R"
), class = "data.frame", row.names = c("1", "2", "3", "4", "5"
))

Resources