I am new to R and I cant understand how to aggregate the weekly data i have to yearly data?
Week obs
1 2004-01-04 23
2 2004-01-11 36
3 2004-01-18 18
4 2004-01-25 26
5 2004-02-01 17
6 2004-02-08 17
7 2004-02-15 26
8 2004-02-22 21
9 2004-02-29 34
10 2004-03-07 21
11 2004-03-14 30
12 2004-03-21 31
13 2004-03-28 31
14 2004-04-04 38
15 2004-04-11 14
16 2004-04-18 16
17 2004-04-25 44
18 2004-05-02 17
19 2004-05-09 43
20 2004-05-16 31
21 2004-05-23 31
22 2004-05-30 33
23 2004-06-06 13
24 2004-06-13 13
25 2004-06-20 46
26 2004-06-27 34
27 2004-07-04 27
28 2004-07-11 24
29 2004-07-18 20
30 2004-07-25 29
31 2004-08-01 29
32 2004-08-08 12
33 2004-08-15 16
34 2004-08-22 26
35 2004-08-29 29
36 2004-09-05 27
37 2004-09-12 8
38 2004-09-19 18
39 2004-09-26 14
40 2004-10-03 25
41 2004-10-10 26
42 2004-10-17 11
43 2004-10-24 24
44 2004-10-31 17
45 2004-11-07 11
46 2004-11-14 19
47 2004-11-21 8
48 2004-11-28 16
49 2004-12-05 19
50 2004-12-12 14
51 2004-12-19 13
52 2004-12-26 29
I want to just retain
2004 1215
Using data.table, given df$Week is of class Date :
library(data.table)
setDT(df)[,.(abs = sum(obs)), by = year(df$Week)]
# year abs
#1: 2004 1215
In base R,
aggregate(df$obs, list(year = format(df$Week, '%Y')), sum)
# year x
# 1 2004 1215
or with lubridate
library(lubridate)
aggregate(df$obs, list(year = year(df$Week)), sum)
# year x
# 1 2004 1215
or with lubridate and dplyr
library(dplyr)
df %>% group_by(year = year(Week)) %>% summarise(obs = sum(obs))
# Source: local data frame [1 x 2]
#
# year obs
# (dbl) (int)
# 1 2004 1215
Related
I am trying to create a vector where I have 3 repetitions of the number 1, then 3 repetitions of the number 2, and so on up to, for instance, 3 repetitions of the number 36.
c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5...)
I have tried the following use of rep() but got the following error:
Error in rep(3, seq(1:36)) : argument 'times' incorrect
What formulation do I need to use to properly generate the vector I want?
sort(rep(1:36, 3))
Or even better as #Wimpel mentioned in the comments, use the each argument of the rep function.
rep(1:36, each = 3)
output
# [1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9 9 9 10 10 10 11 11 11 12 12 12 13 13 13 14 14 14 15 15 15 16 16 16 17 17 17 18 18 18 19 19 19 20 20 20 21 21 21 22
# [65] 22 22 23 23 23 24 24 24 25 25 25 26 26 26 27 27 27 28 28 28 29 29 29 30 30 30 31 31 31 32 32 32 33 33 33 34 34 34 35 35 35 36 36 36
This one should work. However probably not the most elegant.
reps = c()
n = 36
for(i in 1:n){
reps = append(reps, rep(i, 3))
}
reps
alternatively using the rep function properly (see documentation (?rep for argument each):
rep(1:36,each = 3)
rep approach is preferable (see existing answers)
Here are some other options:
> kronecker(1:36, rep(1, 3))
[1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9
[26] 9 9 10 10 10 11 11 11 12 12 12 13 13 13 14 14 14 15 15 15 16 16 16 17 17
[51] 17 18 18 18 19 19 19 20 20 20 21 21 21 22 22 22 23 23 23 24 24 24 25 25 25
[76] 26 26 26 27 27 27 28 28 28 29 29 29 30 30 30 31 31 31 32 32 32 33 33 33 34
[101] 34 34 35 35 35 36 36 36
> c(outer(rep(1, 3), 1:36))
[1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9
[26] 9 9 10 10 10 11 11 11 12 12 12 13 13 13 14 14 14 15 15 15 16 16 16 17 17
[51] 17 18 18 18 19 19 19 20 20 20 21 21 21 22 22 22 23 23 23 24 24 24 25 25 25
[76] 26 26 26 27 27 27 28 28 28 29 29 29 30 30 30 31 31 31 32 32 32 33 33 33 34
[101] 34 34 35 35 35 36 36 36
I have a very large dataset (> 200000 lines) with 6 variables (only the first two shown)
>head(gt7)
ChromKey POS
1 2447 25
2 2447 183
3 26341 75
4 26341 2213
5 26341 2617
6 54011 1868
I have converted the Chromkey variable to a factor variable made up of > 55000 levels.
> gt7[1] <- lapply(gt7[1], factor)
> is.factor(gt7$ChromKey)
[1] TRUE
I can further make a table with counts of ChromKey levels
> table(gt7$ChromKey)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
88 88 44 33 11 11 33 22 121 11 22 11 11 11 22 11 33
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
22 22 44 55 22 11 22 66 11 11 11 22 11 11 11 187 77
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
77 11 44 11 11 11 11 11 11 22 66 11 22 11 44 22 22
... outut cropped
Which I can save in table format
> table <- table(gt7$ChromKey)
> head(table)
1 2 3 4 5 6
88 88 44 33 11 11
I would like to know whether is it possible to have a table (and histogram) of the number of levels with specific count numbers. From the example above, I would expect
88 44 33 11
2 1 1 2
I would very much appreciate any hint.
We can apply table again on the output to get the frequency count of the frequency
table(table(gt7$ChromKey))
I have tried to change column names based on a vector as follows:
library(data.table)
df <- fread(
"radio1 radio2 radio3 radio4 radio5 radio6 radio7
8 12 18 32 40 36 32
6 12 18 24 30 36 30
8 16 18 24 30 36 18
4 12 12 24 30 36 24
6 16 24 32 40 48 24
8 12 18 24 30 36 30
8 12 18 24 30 36 18
8 16 24 32 40 48 40
8 16 24 24 30 48 48",
header = TRUE
)
var <- c("radio1","radio2","radio3","radio4","radio5", "radio6", "radio7")
recode <- c("A","B","C","D","E", "F", "G")
variables <- cbind(var, recode)
variables <- as.data.table(variables)
for (i in seq_len(ncol(df))) {
colnames(df[[i]]) <- variables$recode[match(names(df)[i], variables $var)]
}
I however get the error:
Error in `colnames<-`(`*tmp*`, value = variables$recode[match(names(df)[i], :
attempt to set 'colnames' on an object with less than two dimensions
What am I doing wrong? Is there a better way to do this?
You can use match directly.
names(df) <- variables$recode[match(names(df), variables$var)]
df
# A B C D E F G
#1: 8 12 18 32 40 36 32
#2: 6 12 18 24 30 36 30
#3: 8 16 18 24 30 36 18
#4: 4 12 12 24 30 36 24
#5: 6 16 24 32 40 48 24
#6: 8 12 18 24 30 36 30
#7: 8 12 18 24 30 36 18
#8: 8 16 24 32 40 48 40
#9: 8 16 24 24 30 48 48
By changing colnames(df[[i]]) to colnames(df)[i], the loop works fine:
for (i in seq_len(ncol(df))) {
colnames(df)[i] <- variables$recode[match(names(df)[i], variables$var)] }
> df
A B C D E F G
1: 8 12 18 32 40 36 32
2: 6 12 18 24 30 36 30
3: 8 16 18 24 30 36 18
4: 4 12 12 24 30 36 24
5: 6 16 24 32 40 48 24
6: 8 12 18 24 30 36 30
7: 8 12 18 24 30 36 18
8: 8 16 24 32 40 48 40
9: 8 16 24 24 30 48 48
What's the name for the [1] below.
What is its significance?
Is it always only [1]? If not, then under what conditions is it something else? (example please)
> bb <- c(5,6,7)
> bb
[1] 5 6 7
It shows the count of the variables. In your case, it shows
bb <- c(5,6,7)
> bb
# [1] 5 6 7
Try,
c(1:50)
#[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
#[35] 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
You can also avoid that being displayed by using cat
cat(c(1:50))
#1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
I am pretty much new to r and I have a dummy example of a bigger table underneath. I want to split the table based on id (a,b,c,d) and do iterative simple linear regression for every subset:
x is my x variable, and columns 1:6 are y variables, to have an output of each id and each column from 1:6. Also, it would be great if I could output the model p values of the slopes into a new data frame
id x 1 2 3 4 5 6
1 a 74 18 19 NA 23 29 1
2 a 77 16 19 17 22 29 2
3 a 79 16 NA 19 23 29 3
4 a 81 17 20 18 23 29 4
5 b 74 19 20 19 23 28 11
6 b 76 15 19 18 26 28 12
7 b 79 19 21 20 24 28 NA
8 b 81 19 21 20 23 28 14
9 c 68 19 20 20 23 29 8
10 c 70 17 22 22 27 29 9
11 c 73 18 22 21 23 29 10
12 c 75 19 20 19 23 29 11
13 d 65 18 18 19 22 28 5
14 d 68 18 NA 18 20 29 6
15 d 70 18 19 18 23 28 7
16 d 72 19 17 19 22 28 8
I tried to do use plyr package but it didn't work out
regression = NULL
for ( i in 3:ncol(dumm)){
regression[i] <- dlply(dumm, .(id), function(z) lm(dumm[,i]~dumm$x, z))
}
coefs <- ldply(regression, coef)
Thanks in advance!