Aggregating monthly data to quarterly (averages) - r

I have data, which has multiple monthly variables. I would like to aggregate these variables to quarterly level. My initial data is:
Time A B C D . . . . . K
Jan-2004 42 57 53 28
Feb-2004 40 78 56 28
Mar-2004 68 77 53 20
Apr-2004 97 96 80 16
May-2004 84 93 76 17
Jun-2004 57 100 100 21
Jul-2004 62 100 79 22
.
.
.
.
N
So the goal is calculate quarters as monthly averages (sum(jan+feb+mar)/3)). In other words, the goal is to end up data like this:
Time A B C D . . . . . K
2004Q1 50,0 70,7 54,0 25,3
2004Q2 79,3 96,3 85,3 18,0
2004Q3
.
.
.
N
Could anyone help me with this problem?
Thank you very much.

An option would be to convert the 'Time' to yearqtr class with as.yearqtr from zoo and do a summarise_all
library(zoo)
library(dplyr)
df1 %>%
group_by(Time = format(as.yearqtr(Time, "%b-%Y"), "%YQ%q")) %>%
summarise_all(mean)
# A tibble: 3 x 5
# Time A B C D
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 2004Q1 50 70.7 54 25.3
#2 2004Q2 79.3 96.3 85.3 18
#3 2004Q3 62 100 79 22
data
df1 <- structure(list(Time = c("Jan-2004", "Feb-2004", "Mar-2004", "Apr-2004",
"May-2004", "Jun-2004", "Jul-2004"), A = c(42L, 40L, 68L, 97L,
84L, 57L, 62L), B = c(57L, 78L, 77L, 96L, 93L, 100L, 100L), C = c(53L,
56L, 53L, 80L, 76L, 100L, 79L), D = c(28L, 28L, 20L, 16L, 17L,
21L, 22L)), class = "data.frame", row.names = c(NA, -7L))

data.table has quarter function, you can do:
library(data.table)
setDT(my_data)
my_data[ , lapply(.SD, mean), by = .(year = year(Time), quarter = quarter(Time))]
This is the gist of it. Getting it to work exactly would require a reproducible example.

Related

Drop observations if there are inconsistent variables within same ID [duplicate]

This question already has answers here:
Select groups based on number of unique / distinct values
(4 answers)
Closed 7 months ago.
df <- structure(list(id = c(123L, 123L, 123L, 45L, 45L, 9L, 103L, 103L,
22L, 22L, 22L), age = c(69L, 23L, 70L, 29L, 29L, 37L, 25L, 54L,
40L, 40L, 41L)), class = "data.frame", row.names = c(NA, -11L
))
id age
1 123 69
2 123 23
3 123 70
4 45 29
5 45 29
6 9 37
7 103 25
8 103 54
9 22 40
10 22 40
11 22 41
I would like to drop all observations for an id if it is associated with different values for age. How can I do that?
I would be left with:
id age
45 29
45 29
9 37
A dplyr approach:
library(dplyr)
dat |>
group_by(id) |>
filter(n_distinct(age)==1)
Without external packages, you could use ave():
df |>
subset(ave(age, id, FUN = \(x) length(unique(x))) == 1)
# id age
# 4 45 29
# 5 45 29
# 6 9 37

Assign value in vector based on presence in another vector in R?

I have tried to look for a similar question and I´m sure other people encountered this problem but I still couldn´t find something that helped me. I have a dataset1 with 37.000 observations like this:
id hours
130 12
165 56
250 13
11 15
17 42
and another dataset2 with 38. 000 observations like this:
id hours
130 6
165 23
250 9
11 14
17 11
I want to do the following: if an id of dataset1 is in dataset2, the hours of dataset1 should override the hours of dataset2. For the id´s who are in dataset1 but not in dataset2, the value for dataset2$hours should be NA.
I tried the %in% operator, ifelse(), a loop, and some base R commands but I can´t figure it out. I always get the error that the vectors don´have the same length.
Thanks for any help!
You can replace hours with NAs for id that don't match between df1 and df2. Since both your data sets had the same values for ids, I added one row in df1 with id = 123 and hours = 12.
df1$hours <- replace(df1$hours, is.na(match(df1$id,df2$id)), NA)
df1
id hours
1 130 12
2 165 56
3 250 13
4 11 15
5 17 42
6 123 NA
data
df1 <- structure(list(id = c(130L, 165L, 250L, 11L, 17L, 123L), hours = c(12L,
56L, 13L, 15L, 42L, NA)), row.names = c(NA, -6L), class = "data.frame")
id hours
1 130 12
2 165 56
3 250 13
4 11 15
5 17 42
6 123 12
df2 <- structure(list(id = c(130L, 165L, 250L, 11L, 17L), hours = c(6L,
23L, 9L, 14L, 11L)), class = "data.frame", row.names = c(NA,
-5L))
First match ID's of replacement data with ID's of original data while using na.omit() for the case when replacement ID's are not contained in original data. Replace with replacement data whose ID's are in original ID's.
I expanded both data sets to fabricate cases with no matches.
dat1
# id hours
# 1 130 12
# 2 165 56
# 3 250 13
# 4 11 15
# 5 17 42
# 6 12 232
# 7 35 456
dat2
# id hours
# 1 11 14
# 2 17 11
# 3 165 23
# 4 999 99
# 5 130 6
# 6 250 9
Replacement
dat1[na.omit(match(dat2$id, dat1$id)), ]$hours <-
dat2[dat2$id %in% dat1$id, ]$hours
dat1
# id hours
# 1 130 6
# 2 165 23
# 3 250 9
# 4 11 14
# 5 17 11
# 6 12 232
# 7 35 456
Data:
dat1 <- structure(list(id = c(130L, 165L, 250L, 11L, 17L, 12L, 35L),
hours = c(12L, 56L, 13L, 15L, 42L, 232L, 456L)), class = "data.frame", row.names = c(NA,
-7L))
dat2 <- structure(list(id = c(11L, 17L, 165L, 999L, 130L, 250L), hours = c(14L,
11L, 23L, 99L, 6L, 9L)), class = "data.frame", row.names = c(NA,
-6L))

Create a new table from an existing one on a criteria in R

I've done a self-paced reading experiment in which 151 participants read 112 sentences divided into three lists and I'm having some problems cleaning the data in R. I'm not a programmer so I'm kind of struggling with all this!
I've got the results file which looks something like this:
results
part item word n.word rt
51 106 * 1 382
51 106 El 2 286
51 106 asistente 3 327
51 106 del 4 344
51 106 carnicero 5 394
51 106 que 6 274
51 106 abapl’a 7 2327
51 106 el 8 1104
51 106 sabor 9 409
51 106 del 10 360
51 106 pollo 11 1605
51 106 envipi— 12 256
51 106 un 13 4573
51 106 libro 14 660
51 106 *. 15 519
Part=participant; item=sentences; n.word=number of word; rt=reading times.
In the results file, I have the reading times of every word of every sentence read by every participant. Every participant read more or less 40 sentences. My problem is that I am interested in the reading times of specific words, such as the main verb or the last word of each sentence. But as every sentence is a bit different, the main verb is not always in the same position for each sentence. So I've done another table with the position of the words I'm interested in every sentence.
rules
item v1 v2 n1 n2
106 12 7 3 5
107 11 8 3 6
108 11 8 3 6
item=sentence; v1=main verb; v2=secondary verb; n1=first noun; n2=second noun.
So this should be read: For sentence 106, the main verb is the word number 12, the secondary verb is the word number 7 and so on.
I want to have a final table that looks like this:
results2
part item v1 v2 n1 n2
51 106 256 2327 327 394
51 107 ...
52 106 ...
Does anyone know how to do this? It's kind of a from long to wide problem but with a more complex scenario.
If anyone could help me, I would really appreciate it! Thanks!!
You can try the following code, which joins your results data to a reshaped rules data, and then reshapes the result into a wider form.
library(tidyr)
library(dplyr)
inner_join(select(results, -word),
pivot_longer(rules, -item), c("item", "n.word"="value")) %>%
select(-n.word) %>%
pivot_wider(names_from=name, values_from=rt) %>%
select(part, item, v1, v2, n1, n2)
# A tibble: 1 x 6
# part item v1 v2 n1 n2
# <int> <int> <int> <int> <int> <int>
#1 51 106 256 2327 327 394
Data:
results <- structure(list(part = c(51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L,
51L, 51L, 51L, 51L, 51L, 51L, 51L), item = c(106L, 106L, 106L,
106L, 106L, 106L, 106L, 106L, 106L, 106L, 106L, 106L, 106L, 106L,
106L), word = c("*", "El", "asistente", "del", "carnicero", "que",
"abapl’a", "el", "sabor", "del", "pollo", "envipi—", "un", "libro",
"*."), n.word = 1:15, rt = c(382L, 286L, 327L, 344L, 394L, 274L,
2327L, 1104L, 409L, 360L, 1605L, 256L, 4573L, 660L, 519L)), class = "data.frame", row.names = c(NA,
-15L))
rules <- structure(list(item = 106:108, v1 = c(12L, 11L, 11L), v2 = c(7L,
8L, 8L), n1 = c(3L, 3L, 3L), n2 = c(5L, 6L, 6L)), class = "data.frame", row.names = c(NA,
-3L))

Subset column and compute operations for each subset [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 5 years ago.
Here is a minimal example of dataframe to reproduce.
df <- structure(list(Gene = structure(c(147L, 147L, 148L, 148L, 148L,
87L, 87L, 87L, 87L, 87L), .Label = c("genome", "k141_1189_101",
"k141_1189_104", "k141_1189_105", "k141_1189_116", "k141_1189_13",
"k141_1189_14", "k141_1189_146", "k141_1189_150", "k141_1189_18",
"k141_1189_190", "k141_1189_194", "k141_1189_215", "k141_1189_248",
"k141_1189_251", "k141_1189_252", "k141_1189_259", "k141_1189_274",
"k141_1189_283", "k141_1189_308", "k141_1189_314", "k141_1189_322",
"k141_1189_353", "k141_1189_356", "k141_1189_372", "k141_1189_373",
"k141_1189_43", "k141_1189_45", "k141_1189_72", "k141_1597_15",
"k141_1597_18", "k141_1597_23", "k141_1597_41", "k141_1597_55",
"k141_1597_66", "k141_1597_67", "k141_1597_68", "k141_1597_69",
"k141_2409_34", "k141_2409_8", "k141_3390_69", "k141_3390_83",
"k141_3390_84", "k141_3726_25", "k141_3726_31", "k141_3726_49",
"k141_3726_50", "k141_3726_62", "k141_3726_8", "k141_3726_80",
"k141_3790_1", "k141_3993_114", "k141_3993_122", "k141_3993_162",
"k141_3993_172", "k141_3993_183", "k141_3993_186", "k141_3993_188",
"k141_3993_24", "k141_3993_25", "k141_3993_28", "k141_3993_32",
"k141_3993_44", "k141_3993_47", "k141_3993_53", "k141_3993_57",
"k141_3993_68", "k141_4255_80", "k141_4255_81", "k141_4255_87",
"k141_5079_107", "k141_5079_110", "k141_5079_130", "k141_5079_14",
"k141_5079_141", "k141_5079_16", "k141_5079_184", "k141_5079_185",
"k141_5079_202", "k141_5079_24", "k141_5079_39", "k141_5079_63",
"k141_5079_65", "k141_5079_70", "k141_5079_77", "k141_5079_87",
"k141_5079_9", "k141_5313_16", "k141_5313_17", "k141_5313_20",
"k141_5313_23", "k141_5313_39", "k141_5313_5", "k141_5313_51",
"k141_5313_52", "k141_5313_78", "k141_5545_101", "k141_5545_103",
"k141_5545_104", "k141_5545_105", "k141_5545_106", "k141_5545_107",
"k141_5545_108", "k141_5545_109", "k141_5545_110", "k141_5545_111",
"k141_5545_112", "k141_5545_113", "k141_5545_114", "k141_5545_119",
"k141_5545_128", "k141_5545_130", "k141_5545_139", "k141_5545_141",
"k141_5545_145", "k141_5545_16", "k141_5545_169", "k141_5545_17",
"k141_5545_172", "k141_5545_6", "k141_5545_60", "k141_5545_62",
"k141_5545_63", "k141_5545_86", "k141_5545_87", "k141_5545_88",
"k141_5545_89", "k141_5545_91", "k141_5545_92", "k141_5545_93",
"k141_5545_94", "k141_5545_96", "k141_5545_97", "k141_5545_98",
"k141_5545_99", "k141_5734_13", "k141_5734_2", "k141_5734_4",
"k141_5734_5", "k141_5734_6", "k141_6014_124", "k141_6014_2",
"k141_6014_34", "k141_6014_75", "k141_6014_96", "k141_908_14",
"k141_908_2", "k141_908_5", "k141_957_126", "k141_957_135", "k141_957_136",
"k141_957_14", "k141_957_140", "k141_957_141", "k141_957_148",
"k141_957_179", "k141_957_191", "k141_957_35", "k141_957_47",
"k141_957_55", "k141_957_57", "k141_957_59", "k141_957_6", "k141_957_63",
"k141_957_65", "k141_957_68", "k141_957_77", "k141_957_95"), class = "factor"),
depth = c(9L, 10L, 9L, 10L, 11L, 14L, 15L, 16L, 17L, 18L),
bases_covered = c(6L, 3L, 4L, 7L, 4L, 59L, 54L, 70L, 34L,
17L), gene_length = c(1140L, 1140L, 591L, 591L, 591L, 690L,
690L, 690L, 690L, 690L), regioncoverage = c(54L, 30L, 36L,
70L, 44L, 826L, 810L, 1120L, 578L, 306L)), .Names = c("Gene",
"depth", "bases_covered", "gene_length", "regioncoverage"), row.names = c(1L,
2L, 33L, 34L, 35L, 78L, 79L, 80L, 81L, 82L), class = "data.frame")
The dataframe looks like this:
Gene depth bases_covered gene_length regioncoverage
1 k141_908_2 9 6 1140 54
2 k141_908_2 10 3 1140 30
33 k141_908_5 9 4 591 36
34 k141_908_5 10 7 591 70
35 k141_908_5 11 4 591 44
78 k141_5079_9 14 59 690 826
79 k141_5079_9 15 54 690 810
80 k141_5079_9 16 70 690 1120
81 k141_5079_9 17 34 690 578
82 k141_5079_9 18 17 690 306
What i want is that for each Gene (e.g k141_908_2) i want to sum region coverage and divide by unique(gene length). In fact gene length is always the same value for each gene.
For example for Gene K141_908_2 i would do: (54+30)/1140 = 0.07
For example for Gene K141_908_5 i would do: (36+70+44)/591 = 0.25
The final dataframe should report two columns.
Gene Newcoverage
1 k141_908_2 0.07
2 k141_908_5 0.25
3 ......
and so on .
Thanks for your help
This is straightforward with dplyr:
library(dplyr)
df_final <- df %>%
group_by(Gene) %>%
summarize(Newcoverage = sum(regioncoverage) / first(gene_length))
df_final
# # A tibble: 3 × 2
# Gene Newcoverage
# <fctr> <dbl>
# 1 k141_5079_9 5.27536232
# 2 k141_908_2 0.07368421
# 3 k141_908_5 0.25380711
I needed to set the first column to character and others to numeric. But after that you can just split the df by gene and then do the necessary calculations.
df[,2:5] = lapply(df[,2:5], as.numeric)
df$Gene = as.character(df$Gene)
sapply(split(df, df$Gene), function(x) sum(x[,5]/x[1,4]))
#k141_5079_9 k141_908_2 k141_908_5
# 5.27536232 0.07368421 0.25380711
We can use tidyverse
library(tidyverse)
df %>%
group_by(Gene) %>%
summarise(Newcoverage = sum(regioncoverage)/gene_length[1])
# A tibble: 3 × 2
# Gene Newcoverage
# <fctr> <dbl>
#1 k141_5079_9 5.27536232
#2 k141_908_2 0.07368421
#3 k141_908_5 0.25380711
Or a base R option is
by(df[4:5], list(as.character(df[,'Gene'])), FUN= function(x) sum(x[,2])/x[1,1])
quick approach is
require(data.table)
DT <- setDT(df)
#just to output unique rows
DT[, .(New_Coverage = unique(sum(regioncoverage)/gene_length)), by = .(Gene)]
output
Gene New_Coverage
1: k141_908_2 0.07368421
2: k141_908_5 0.25380711
3: k141_5079_9 5.27536232
I use dplyr a lot. So here's one way:
library(dplyr)
df %>%
group_by(Gene) %>%
mutate(Newcoverage=sum(regioncoverage)/unique(gene_length))
If you want only unique values per Gene:
df %>%
group_by(Gene) %>%
transmute(Newcoverage=sum(regioncoverage)/unique(gene_length)) %>%
unique()

Data format conversion to be combined with string split in R

I have following data frame oridf:
test_name gp1_0month gp2_0month gp1_1month gp2_1month gp1_3month gp2_3month
Test_1 136 137 152 143 156 150
Test_2 130 129 81 78 86 80
Test_3 129 128 68 68 74 71
Test_4 40 40 45 43 47 46
Test_5 203 201 141 134 149 142
Test_6 170 166 134 116 139 125
oridf <- structure(list(test_name = structure(1:6, .Label = c("Test_1",
"Test_2", "Test_3", "Test_4", "Test_5", "Test_6"), class = "factor"),
gp1_0month = c(136L, 130L, 129L, 40L, 203L, 170L), gp2_0month = c(137L,
129L, 128L, 40L, 201L, 166L), gp1_1month = c(152L, 81L, 68L,
45L, 141L, 134L), gp2_1month = c(143L, 78L, 68L, 43L, 134L,
116L), gp1_3month = c(156L, 86L, 74L, 47L, 149L, 139L), gp2_3month = c(150L,
80L, 71L, 46L, 142L, 125L)), .Names = c("test_name", "gp1_0month",
"gp2_0month", "gp1_1month", "gp2_1month", "gp1_3month", "gp2_3month"
), class = "data.frame", row.names = c(NA, -6L))
I need to convert it to following format:
test_name month group value
Test_1 0 gp1 136
Test_1 0 gp2 137
Test_1 1 gp1 152
Test_1 1 gp2 143
.....
Hence, conversion would involve splitting of gp1 and 0month, etc. from columns 2:7 of the original data frame oridf so that I can plot it with following command:
qplot(data=newdf, x=month, y=value, geom=c("point","line"), color=test_name, linetype=group)
How can I convert these data? I tried the melt command, but I cannot combine it with the strsplit command.
First I would use melt like you had done.
library(reshape2)
mm <- melt(oridf)
then there is also a colsplit function you can use in the reshape2 library as well. Here we use it on the variable column to split at the underscore and the "m" in month (ignoring the rest)
info <- colsplit(mm$variable, "(_|m)", c("group","month", "xx"))[,-3]
Then we can recombine the data
newdf <- cbind(mm[,1, drop=F], info, mm[,3, drop=F])
# head(newdf)
# test_name group month value
# 1 Test_1 gp1 0 136
# 2 Test_2 gp1 0 130
# 3 Test_3 gp1 0 129
# 4 Test_4 gp1 0 40
# 5 Test_5 gp1 0 203
# 6 Test_6 gp1 0 170
And we can plot it using the qplot command you supplied above
Use gather from the tidyr package to convert from wide to long and then useseparate from the same package to separate the group_month column into group and month columns. Finally using mutate from dplyr smf extract_numeric from tidyr extract the numeric part of month.
library(dplyr)
# devtools::install_github("hadley/tidyr")
library(tidyr)
newdf <- oridf %>%
gather(group_month, value, -test_name) %>%
separate(group_month, into = c("group", "month")) %>%
mutate(month = extract_numeric(month))

Resources