R - aggregated data.table columns differently - r

I am given a large data-table that needs to be aggregated according to the first column:
The problem is the following:
For several columns, one just has to form the sum for each category (given in column 1)
For other columns, one has to calculate the mean
There is a 1-1 correspondence between the entries in the first and second columns. Such that the entries of the second column should be kept.
The following is a possible example of such a data-table. Let's assume that columns 3-9 need to be summed up and columns 10-12 need to be averaged.
library(data.table)
set.seed(1)
a<-matrix(c("cat1","text1","cat2","text2","cat3","text3"),nrow=3,byrow=TRUE)
M<-do.call(rbind, replicate(1000, a, simplify=FALSE)) # where m is your matrix
M<-cbind(M,matrix(sample(c(1:100),3000*10,replace=TRUE ),ncol=10))
M <- as.data.table(M)
The result should be a table of the form
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
1: cat1 text1 27 81 78 95 27 22 12 76 18 76
2: cat2 text2 38 48 70 100 11 97 8 53 56 33
3: cat3 text3 58 18 66 24 14 73 18 27 92 70
but with entries the corresponding sums respective averages.

M[, names(M)[-c(1,2)] := lapply(.SD, as.numeric),
.SDcols = names(M)[-c(1,2)]][,
c(lapply(.SD[, ((3:9)-2), with=FALSE], sum),
lapply(.SD[, ((10:12)-2), with=FALSE], mean)),
by = eval(names(M)[c(1,2)])]
#> V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
#> 1: cat1 text1 51978 49854 48476 49451 49620 49870 50248 50.193 51.516 49.694
#> 2: cat2 text2 50607 50097 50572 50507 48960 51419 48905 49.700 49.631 48.863
#> 3: cat3 text3 51033 50060 49742 50345 51532 51299 50957 50.192 50.227 50.689

Related

How to create a subset for each day in time series data automatically

I have collected data from sensors and need to split the data into 1 day data frames automatically.
This is how my data looks like:
2 2021-10-20 20:17:14 151 -135.9 8.304 -339.5 8.175 23.13232 78.95514 97.10153
3 2021-10-20 20:27:15 152 -136.9 8.302 -340.6 8.175 23.89337 86.71063 98.07861
4 2021-10-20 20:37:15 153 -138.2 8.302 -340.5 8.177 23.00682 80.71004 96.15726
5 2021-10-20 20:47:16 154 -138.8 8.302 -341.0 8.176 23.76786 83.38557 98.30032
I used tibbletime and dplyr to get a tbl_time object.
So far I was able to create a Subset for day1 and manually it would be easy to do it for day 2 and 3 etc. In the end I will have like a year of data, so this will be a pain in the a to do manually.
Here's the line of code I used:
day1<- filter_time(bio2_table, time_formula = '2021-10-20' ~ '2021-10-20')
I'm a r noob but I still want to believe there is a way that r does all the work itself.
Thanks!
What data type is your date object?
The first step is to get the date format right, then you can group_split() to get a list of data frames, each of which represents a specific date.
If interested in this approach, you could look at the setNames() function to name each data frame.
data %>%
mutate(date = as.Date(format(date, "%y-%m-%d")))%>%
group_split(date)
This returns a data frame with vectors as your grouped data:
t(aggregate( dat[,-2], by=list(dat$V2), list ))
[,1] [,2] [,3]
Group.1 "2021-10-20" "2021-10-21" "2021-10-22"
V1 Integer,4 6 7
V3 Character,4 "20:47:16" "20:47:16"
V4 Integer,4 155 156
V5 Numeric,4 -135.8 -131.8
V6 Numeric,4 8.306 10.302
V7 Numeric,4 -339 -347
V8 Numeric,4 8.166 8.187
V9 Numeric,4 25.76789 15.76786
V10 Numeric,4 91.38557 87.38557
V11 Numeric,4 102.3003 111.3003
# OR
aggregate( dat[,-2], by=list(dat$V2), list )
Group.1 V1 V3
1 2021-10-20 2, 3, 4, 5 20:17:14, 20:27:15, 20:37:15, 20:47:16
2 2021-10-21 6 20:47:16
3 2021-10-22 7 20:47:16
V4 V5 V6
1 151, 152, 153, 154 -135.9, -136.9, -138.2, -138.8 8.304, 8.302, 8.302, 8.302
2 155 -135.8 8.306
3 156 -131.8 10.302
V7 V8
1 -339.5, -340.6, -340.5, -341.0 8.175, 8.175, 8.177, 8.176
2 -339 8.166
3 -347 8.187
V9 V10
1 23.13232, 23.89337, 23.00682, 23.76786 78.95514, 86.71063, 80.71004, 83.38557
2 25.76789 91.38557
3 15.76786 87.38557
V11
1 97.10153, 98.07861, 96.15726, 98.30032
2 102.3003
3 111.3003
Data
dat <- structure(list(V1 = 2:7, V2 = c("2021-10-20", "2021-10-20", "2021-10-20",
"2021-10-20", "2021-10-21", "2021-10-22"), V3 = c("20:17:14",
"20:27:15", "20:37:15", "20:47:16", "20:47:16", "20:47:16"),
V4 = 151:156, V5 = c(-135.9, -136.9, -138.2, -138.8, -135.8,
-131.8), V6 = c(8.304, 8.302, 8.302, 8.302, 8.306, 10.302
), V7 = c(-339.5, -340.6, -340.5, -341, -339, -347), V8 = c(8.175,
8.175, 8.177, 8.176, 8.166, 8.187), V9 = c(23.13232, 23.89337,
23.00682, 23.76786, 25.76789, 15.76786), V10 = c(78.95514,
86.71063, 80.71004, 83.38557, 91.38557, 87.38557), V11 = c(97.10153,
98.07861, 96.15726, 98.30032, 102.30032, 111.30032)), class = "data.frame", row.names = c(NA,
-6L))
dat
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 2 2021-10-20 20:17:14 151 -135.9 8.304 -339.5 8.175 23.13232 78.95514
2 3 2021-10-20 20:27:15 152 -136.9 8.302 -340.6 8.175 23.89337 86.71063
3 4 2021-10-20 20:37:15 153 -138.2 8.302 -340.5 8.177 23.00682 80.71004
4 5 2021-10-20 20:47:16 154 -138.8 8.302 -341.0 8.176 23.76786 83.38557
5 6 2021-10-21 20:47:16 155 -135.8 8.306 -339.0 8.166 25.76789 91.38557
6 7 2021-10-22 20:47:16 156 -131.8 10.302 -347.0 8.187 15.76786 87.38557
V11
1 97.10153
2 98.07861
3 96.15726
4 98.30032
5 102.30032
6 111.30032

Calculate Ratio of multiple pairs of columns

I have a table in R like this one:
id v1 v2 v3
1 115 116 150
2 47 50 55
3 70 77 77
I would like to calculate the ratio between v2/v1 as (v2/v1)-1, v3/v2 as (v3/v2)-1 and so on (I have around 55 variables, and need to get values like this:
id v1 v2 v3 rat1 rat2
1 115 116 150 0.01 0.29
2 47 50 55 0.06 0.10
3 70 77 77 0.10 0.00
Is there a workaround so I donĀ“t have to code each pair independently?
Thx!
It's essentially a loop over column i and column i+1, which you could write a for loop to do so. Or in R speak, use a vectorised function like Map/mapply:
vars <- paste0("v",1:3)
outs <- paste0("rat",1:2)
dat[outs] <- mapply(`/`, dat[vars[-1]], dat[vars[-length(vars)]]) - 1
dat
# id v1 v2 v3 rat1 rat2
#1 1 115 116 150 0.008695652 0.2931034
#2 2 47 50 55 0.063829787 0.1000000
#3 3 70 77 77 0.100000000 0.0000000
As we remove equal number of columns from the beginning and end ('id' common), the datasets would still be similar in dimensions, so can directly do a /
dat[paste0("rat", 1:2)] <- 1- dat[-c(1, ncol(dat))]/dat[-(1:2)]
data
dat <- structure(list(id = 1:3, v1 = c(115L, 47L, 70L), v2 = c(116L,
50L, 77L), v3 = c(150L, 55L, 77L)), class = "data.frame", row.names = c(NA,
-3L))

Column Mean by Factors

I would like to create a table of column means by Strain factors
I have the following data:
Age Strain 103 3 163 39
V2 28 101CD -3.4224173012 -0.3360570164 -9.2417448649 -3.6094766494
V3 28 101CD -3.6487198656 -0.7948262475 -4.6350611123 -1.9232938265
V4 28 101CD -7.0936427264 -0.1981243536 -9.2063428591 -3.367139071
V5 28 101CD -5.9245254437 -0.1161875584 -7.3830396092 -4.7980771085
V6 30 101HFD -9.4618204696 -5.0355557149 -3.9915005349 -0.9271933496
V7 30 101HFD -8.805867863 -2.667103793 -2.2489197384 -1.5169130813
V8 30 101HFD -10.9841335945 -2.9617657815 -3.3460597574 -1.121806194
V9 30 101HFD -10.4612747952 -4.3759351258 -4.4322637085 -0.772499965
V10 30 101HFD -9.2871507889 -1.2664335711 -4.3142098012 -1.3791233817
V11 30 101HFD -10.9443983294 -2.4651954898 -4.7759052834 -1.0954401254
V12 29 103CD -2.7492530803 -2.0659306194 -2.5698186069 -1.4978280502
V13 29 103CD -6.4401905692 -2.1098420514 -3.4349220483 -0.8836564768
V14 29 103CD -6.479929929 -2.4792621691 -3.368774934 -0.7756932376
V15 29 103CD -3.6586850957 -1.9145944032 -3.0911223702 -1.2730896376
V16 29 103CD -7.1377230731 -1.413139617 -2.9203340711 -1.3152010161
V17 29 103HFD -9.4624093184 -1.3265834556 -4.1871313168 -1.0108235293
V18 29 103HFD -7.336764023 -0.8712499419 -4.204313727 -1.4450582002
V19 29 103HFD -7.036723106 -0.7546877382 -6.0432957599 -1.4161366956
V20 29 103HFD -9.4449207581 -0.9226067311 -4.6305567775 -1.320094489
V21 29 103HFD -9.6383454033 -1.9620356763 -3.0214290407 -0.8602682738
And, I want to end up with this:
Age Strain 103 3 163 39
V1 28 101CD -3.4224173012 -0.3360570164 -9.2417448649 -3.6094766494
V2 30 101HFD -9.4618204696 -5.0355557149 -3.9915005349 -0.9271933496
V3 29 103CD -2.7492530803 -2.0659306194 -2.5698186069 -1.4978280502
V4 29 103HFD -9.4624093184 -1.3265834556 -4.1871313168 -1.0108235293
Where [1,] is the mean of all columns for all samples with Strain=101CD, [2:3] is the mean of all columns for samples with Strain=101HFD, etc.
I have attempted to use:
> ave <- aggregate(data, as.list(factor(data$Age)), mean)
Error in aggregate.data.frame(data, as.list(factor(data$Age)), mean) : arguments must have same length
and
> ave <- sapply(split(data, data$Strain), mean)
101CD 101HFD 103CD 103HFD 32CD 40CD 40HFD 43CD 43HFD 44CD 44HFD
NA NA NA NA NA NA NA NA NA NA NA
...
97HFD 98CD 98HFD 99CD 99HFD
NA NA NA NA NA
There were 50 or more warnings (use warnings() to see the first 50)
and
> ave <- daply(data, data$Strain, mean)
Error in parse(text = x) : <text>:1:4: unexpected symbol
1: 101CD
I feel like there should be a fairly straightforward way to accomplish this, but I have been unable to find a solution.
You can use dplyr. Here we group_by Strain, then use summarise_each to summarise each column, with the function mean with na.rm set to TRUE:
library(dplyr)
data %>% group_by(Strain) %>%
summarise_each(funs(mean(., na.rm=TRUE)))
Source: local data frame [4 x 6]
Strain Age X103 X3 X163 X39
(fctr) (dbl) (dbl) (dbl) (dbl) (dbl)
1 101CD 28 -5.022326 -0.3612988 -7.616547 -3.424497
2 101HFD 30 -9.990774 -3.1286649 -3.851476 -1.135496
3 103CD 29 -5.293156 -1.9965538 -3.076994 -1.149094
4 103HFD 29 -8.583833 -1.1674327 -4.417345 -1.210476
Exploit the fact that a data.frame is a special kind of list.
aggregate(data, data[, "Age", drop = FALSE], mean)
drop = FALSE is required so that the result of the selection remains a data.frame. data[, "Age"] is equivalent to data[, "Age", drop = TRUE] and will return a vector.

Spliting a row into columns using a delimiter in R

My data loks like this:
ID:10:237,204,
ID:11:257,239,
ID:12:309,291,
ID:13:310,272,
ID:14:3202,3184,
ID:15:404,388,
I would like to first separate this into different columns then apply a function on each row to calculate the difference of comma separated values such as (237-204).
Without the use of external library packages.
Try this except if the data is in a file replace the readLines line with something like this: L <- readLines("myfile.csv") . After that replace the colons with commas using gsub and then read the resulting text and transform it:
# test data
Lines <- "ID:10:237,204,
ID:11:257,239,
ID:12:309,291,
ID:13:310,272,
ID:14:3202,3184,
ID:15:404,388,"
L <- readLines(textConnection(Lines))
DF <- read.table(text = gsub(":", ",", L), sep = ",")
transform(DF, diff = V3 - V4)
giving:
V1 V2 V3 V4 V5 diff
1 ID 10 237 204 NA 33
2 ID 11 257 239 NA 18
3 ID 12 309 291 NA 18
4 ID 13 310 272 NA 38
5 ID 14 3202 3184 NA 18
6 ID 15 404 388 NA 16

printing a list based on range met

I would like to generate an string output into a list if some values are met. I have a table that looks like this:
grp V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17
1: 1 go.1 142 144 132 134 0 31 11 F D T hy al qe 34 6 3
2: 2 go.1 313 315 303 305 0 31 11 q z t hr ye er 29 20 41
3: 3 go.1 316 318 306 308 0 31 11 f w y hu er es 64 43 19
4: 4 go.1 319 321 309 311 0 31 11 r a y ie uu qr 26 22 20
5: 5 go.1 322 324 312 314 0 31 11 g w y hp yu re 44 7 0
I'm using this function to generate a desired output:
library(IRanges); library(data.table)
rangeFinder = function(x){
x.ir = reduce(IRanges(x$V2, x$V3))
max.idx = which.max(width(x.ir))
ans = data.table(out = x[1,1],
start = start(x.ir)[max.idx],
end = end(x.ir)[max.idx])
return(ans)}
rangeFinder(x.out)
out start end
1: 1 313 324
I would also like to generate a list with the letters (from column V9-V11) in the between the start and end output from rangeFinder.
For example, the output should look like this.
out
[[go.1]]
[1] "qztfwyraygwy"
rangeFinder is looking at values in column V2 and V3 and printing the longest match of numbers. Notice how "FDT" is not included in the list output even though rangeFinder produced an output from 313-324 (and not from 142-324). How can I get the desired output?
reduce has an argument with.revmap to add a "metadata" column (accessible with mcols()) to the object. This associates with each reduced range the indexes of the original range that map to the reduced range, as an IntegerList class, basically a list where all elements are guaranteed to be integer vectors. So these are the rows you're interested in
ir <- with(x, IRanges(V2, V3))
r <- reduce(ir, with.revmap=TRUE)
i <- unlist(mcols(r)[which.max(width(r)), "revmap"])
and the data character string can be munged with something like
j <- paste0("V", 9:11)
paste0(as.matrix(x[i, j, drop=FALSE]), collapse="")
It's better to ask your questions about IRanges on the Bioconductor mailing list; no subscription required.
with.revmap is a convenience argument added relatively recently; I think
h = findOverlaps(ir, r)
i = queryHits(h)[subjectHits(h) == which.max(width(r))]
is a replacement.

Resources