I have a data frame with "Sol.grp" (non-numeric) and "age" (numeric) columns. I'm trying to store mean of age and count of observations in two separate columns.
I used the following code:
> summary <- data.frame(aggregate(age~sol.grp, data=na.omit(all.tkts), FUN=function(x) c(mean= mean(x), count=length(x))))
Mean & Count are coming in the same column (shown below)
I do not know what's wrong. Any ideas? Thanks in advance for your help !
Edit: The example dataset is shown at the end
row sol.grp Mean
1 Account A 187.7154
2 Account B 215.7747
3 WMID 199.0201
4 Qty 254.5545
5 PM 210.7109
6 CS 165.6500
7 ED 158.5483
8 TM 271.1966
9 39.0000
10 131.0000
11 189.0000
12 149.0000
13 3533.0000
14 2.0000
15 338.0000
16 58.0000
Example data: (Top 20 rows)
sol.grp age
Account A 29.6
Account B 29.6
WMID 26.9
Qty 1.7
PM 3.0
CS 2043.8
ED 24.3
TM 24.3
Account A 24.3
Account B 133.3
WMID 27.0
Qty 2.1
PM 29.2
CS 29.4
ED 97.8
TM 192.9
Account A 651.6
Account B 148.6
WMID 125.2
Qty 31.1
You could try this using data.table
library(data.table)
res1 <- setDT(all.tkts)[, list(Mean=mean(age, na.rm=TRUE), Count=.N),
keyby=sol.grp]
The aggregate results did not show any anomaly using the below example
res2 <- do.call(data.frame,aggregate(age~sol.grp,
data=na.omit(all.tkts), FUN=function(x) c(mean= mean(x), count=length(x))))
res2
# sol.grp age.mean age.count
#1 Account A 235.16667 3
#2 Account B 103.83333 3
#3 CS 1036.60000 2
#4 ED 61.05000 2
#5 PM 16.10000 2
#6 Qty 11.63333 3
#7 TM 108.60000 2
#8 WMID 59.70000 3
data
all.tkts <- structure(list(sol.grp = structure(c(1L, 2L, 8L, 6L, 5L, 3L,
4L, 7L, 1L, 2L, 8L, 6L, 5L, 3L, 4L, 7L, 1L, 2L, 8L, 6L), .Label = c("Account A",
"Account B", "CS", "ED", "PM", "Qty", "TM", "WMID"), class = "factor"),
age = c(29.6, 29.6, 26.9, 1.7, 3, 2043.8, 24.3, 24.3, 24.3,
133.3, 27, 2.1, 29.2, 29.4, 97.8, 192.9, 651.6, 148.6, 125.2,
31.1)), .Names = c("sol.grp", "age"), class = "data.frame", row.names = c(NA,
-20L))
Following from your own code works well:
aggregate(age~sol.grp, data=na.omit(all.tkts), FUN=function(x) c(mean= mean(x), count=length(x)))
sol.grp age.mean age.count
1 Account A 235.16667 3.00000
2 Account B 103.83333 3.00000
3 CS 1036.60000 2.00000
4 ED 61.05000 2.00000
5 PM 16.10000 2.00000
6 Qty 11.63333 3.00000
7 TM 108.60000 2.00000
8 WMID 59.70000 3.00000
Just avoid putting data.frame around aggregate, since aggregate returns a data.frame.
EDIT:
The details of output are:
> dd = aggregate(age~sol.grp, data=na.omit(all.tkts), FUN=function(x) c(mean= mean(x), count=length(x)))
> str(dd)
'data.frame': 8 obs. of 2 variables:
$ sol.grp: Factor w/ 8 levels "Account A","Account B",..: 1 2 3 4 5 6 7 8
$ age : num [1:8, 1:2] 235.2 103.8 1036.6 61 16.1 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "mean" "count"
>
> dd$sol.grp
[1] Account A Account B CS ED PM Qty TM WMID
Levels: Account A Account B CS ED PM Qty TM WMID
> dd$age
mean count
[1,] 235.16667 3
[2,] 103.83333 3
[3,] 1036.60000 2
[4,] 61.05000 2
[5,] 16.10000 2
[6,] 11.63333 3
[7,] 108.60000 2
[8,] 59.70000 3
>
> dd$age[,2]
[1] 3 3 2 2 2 3 2 3
>
> dd$age[,1]
[1] 235.16667 103.83333 1036.60000 61.05000 16.10000 11.63333 108.60000 59.70000
Related
I would like to filter data frame using numeric vector. I am applying function below:
test_data <- exp_data[exp_data$Size_Change %in% vec_data,]
That's how example data looks like:
dput(exp_data)
structure(list(Name = c("Mark", "Greg", "Tomas", "Morka", "Pekka",
"Robert", "Tim", "Tom", "Bobby", "Terka"), Mode = c(1, 2, NA,
4, NA, 3, NA, 1, NA, 3), Change = structure(c(6L, 2L, 4L, 5L,
7L, 7L, 7L, 8L, 3L, 1L), .Label = c("D[+58], I[+12][+385]", "C[+58], K[+1206]",
"C[+58], P[+2074]", "C[+58], K[+2172]", "C[+58], K[+259]", "C[+58], K[+2665]",
"C[+58], T[+385]", "C[+58], C[+600]"), class = "factor"), Size = c(1335.261,
697.356, 1251.603, 920.43, 492.236, 393.991, 492.239, 727.696,
1218.933, 495.237), Place = c(3L, 4L, 3L, 2L, 4L, 5L, 4L, 3L,
3L, 4L), Size_Change = c(4004, 2786, 3753, 1840, 1966, 1966,
1966, 2181, 3655, 1978)), row.names = 2049:2058, class = "data.frame")
and vector used for filtering:
dput(vec_data)
c(4003, 2785, 954, 1129, 4013, 756, 1852, 2424, 1954, 246, 147,
234, 562, 1617, 2180, 888, 1176)
I mentioned about tolerance because vec_data is not very precise and I am expecting +1/-1 difference in numbers and after applying function it will not filter rows with such difference. It may also happen that difference will be +12/-12 or +24/-24. Can I somehow take it into account while filtering ?
Of course probably solution is to do smth like that (vec_data +1) / (vec_data -1) / (vec_data +12), etc. and do couple of filtering attempts and maybe finally rbind outputs of all but I am looking for more "elegant" way. It would also be great if there could be a column added which will indicate how the row was filtered if it was an exact number from vec_data or it was modified by +1, +12, -24 or whatever. Please, take into account that the combination of +1/-1 with any other modification is also possible. Additional column is not necessary if it makes it too complicated.
One option could be (tolerance = 1):
df %>%
filter(sapply(Size_Change, function(x) any(abs(x - vec) %in% 0:1)))
Name Mode Change Size Place Size_Change
1 Mark 1 C[+58], K[+2665] 1335.261 3 4004
2 Greg 2 C[+58], K[+1206] 697.356 4 2786
3 Tom 1 C[+58], C[+600] 727.696 3 2181
Tolerance = 14:
df %>%
filter(sapply(Size_Change, function(x) any(abs(x - vec) %in% 0:14)))
Name Mode Change Size Place Size_Change
1 Mark 1 C[+58], K[+2665] 1335.261 3 4004
2 Greg 2 C[+58], K[+1206] 697.356 4 2786
3 Morka 4 C[+58], K[+259] 920.430 2 1840
4 Pekka NA C[+58], T[+385] 492.236 4 1966
5 Robert 3 C[+58], T[+385] 393.991 5 1966
6 Tim NA C[+58], T[+385] 492.239 4 1966
7 Tom 1 C[+58], C[+600] 727.696 3 2181
The same logic with rowwise():
df %>%
rowwise() %>%
filter(any(abs(Size_Change - vec) %in% 0:1))
The most obvious methodology is to filter based on inequality rather than exact matched (always recommended when comparing numeric [not integers])
comp <- function(x, yvec, tolerance = 1){
sapply(x, \(xi){any(abs(xi - yvec) <= tolerance)})
}
exp_data[comp(exp_data$Size_Change, vec_data),]
Name Mode Change Size Place Size_Change
2049 Mark 1 C[+58], K[+2665] 1335.261 3 4004
2050 Greg 2 C[+58], K[+1206] 697.356 4 2786
2056 Tom 1 C[+58], C[+600] 727.696 3 2181
# Tolerance = 2
# exp_data[comp(exp_data$Size_Change, vec_data, 2),]
What about using a tolerance function.
tol <- \(x, tol=1L) sapply(seq(-tol, tol, 1L), \(i) sweep(as.matrix(x), 1L, i))
exp_data[exp_data$Size_Change %in% tol(vec_data), ]
# Name Mode Change Size Place Size_Change
# 2049 Mark 1 C[+58], K[+2665] 1335.261 3 4004
# 2050 Greg 2 C[+58], K[+1206] 697.356 4 2786
# 2056 Tom 1 C[+58], C[+600] 727.696 3 2181
It defaults to tolerance ±1, if we want ±24 we may define it in the argument:
exp_data[exp_data$Size_Change %in% tol(vec_data, 24L), ]
# Name Mode Change Size Place Size_Change
# 2049 Mark 1 C[+58], K[+2665] 1335.261 3 4004
# 2050 Greg 2 C[+58], K[+1206] 697.356 4 2786
# 2052 Morka 4 C[+58], K[+259] 920.430 2 1840
# 2053 Pekka NA C[+58], T[+385] 492.236 4 1966
# 2054 Robert 3 C[+58], T[+385] 393.991 5 1966
# 2055 Tim NA C[+58], T[+385] 492.239 4 1966
# 2056 Tom 1 C[+58], C[+600] 727.696 3 2181
# 2058 Terka 3 D[+58], I[+12][+385] 495.237 4 1978
I you are wondering about the L in 24L, it is integer notation, you may also use tol=24 without any problems.
Note: R version 4.1.2 (2021-11-01)
I have data split up into two categories:
z= Tracer time treatment
15 0 S
20 0 S
25 0 X
04 0 X
55 15 S
16 15 S
15 15 X
20 15 X
I'd like to divide each value of Tracer by the group mean depending on which group it belongs to (e.g. All values of Tracer belonging to time=0 and treatment=S are divided by their mean).
The procedure would be something like this:
Find category means as follows:
1:
aggmeanz <-aggregate(z$Tracer, list(time=z$time,treatment=z$treatment), FUN=mean)
2: Divide z$Tracer by the correct aggmeanz value
structure(list(Tracer = c(15L, 20L, 25L, 4L, 55L, 16L, 15L, 20L
), time = c(0L, 0L, 0L, 0L, 15L, 15L, 15L, 15L), treatment = structure(c(1L,
1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("S", "X"), class = "factor")), .Names = c("Tracer",
"time", "treatment"), class = "data.frame", row.names = c(NA,
-8L))
Alternatively, here is a dplyr solution:
library(dplyr)
group_by(z,time,treatment) %>%
mutate(pmean=Tracer/mean(Tracer))
Output:
Tracer time treatment pmean
(int) (int) (fctr) (dbl)
1 15 0 S 0.8571429
2 20 0 S 1.1428571
3 25 0 X 1.7241379
4 4 0 X 0.2758621
5 55 15 S 1.5492958
6 16 15 S 0.4507042
7 15 15 X 0.8571429
8 20 15 X 1.1428571
Data:
z <- read.table(text="Tracer time treatment
15 0 S
20 0 S
25 0 X
04 0 X
55 15 S
16 15 S
15 15 X
20 15 X",head=TRUE)
Is it ok to use non-base tools? With data.table installed and loaded:
z <- data.table(z)
z[, scaledTracer := Tracer/mean(Tracer), by = c("time","treatment")]
Would compute means by each unique combination of time and treatment (which appear to be groups of 2 rows in your data), and scale the Tracer values in each group by the appropriate mean.
It's not the prettiest but:
groupmeans = aggregate(z$Tracer, by = list(z$time, z$treatment), FUN = mean)
Group.1 Group.2 x
0 S 17.5
15 S 35.5
0 X 14.5
15 X 17.5
names(groupmeans) = c("time", "treatment", "groupmean")
z = merge(z, groupmeans, id.vars = c("time","treatment" ))
time treatment groupmean Tracer tracer_div
0 S 17.5 15 0.8571429
0 S 17.5 20 1.1428571
0 X 14.5 25 1.7241379
0 X 14.5 4 0.2758621
15 S 35.5 55 1.5492958
15 S 35.5 16 0.4507042
15 X 17.5 15 0.8571429
15 X 17.5 20 1.1428571
z$tracer_div = z$Tracer/z$groupmean
time treatment groupmean Tracer tracer_div
0 S 17.5 15 0.8571429
0 S 17.5 20 1.1428571
0 X 14.5 25 1.7241379
0 X 14.5 4 0.2758621
15 S 35.5 55 1.5492958
15 S 35.5 16 0.4507042
15 X 17.5 15 0.8571429
15 X 17.5 20 1.1428571
You could reassign z$Tracer to the final step if you didn't want to create a whole new column. It can be nice to keep every step though in case you want to use it in another calculation or plot later.
a base R solution:
do.call(c, lapply(split(z[1], z[, -1]), FUN = function(x) x[[1]]/mean(x[[1]])))
# 0.S1 0.S2 15.S1 15.S2 0.X1 0.X2 15.X1 15.X2
#0.8571429 1.1428571 1.5492958 0.4507042 1.7142857 0.2857143 0.8571429 1.1428571
split into timextreatment groups first, then divide each group by mean. finally glue back together with c.
I have looked through other posts and I think I have an idea of what I could do, but I want to be clear!
I have a very large data frame that contains 4 variables and a number of rows.
Chain ResId ResNum Energy
1 C O17 500 -37.03670
2 A ARG 8 -0.84560
3 A LEU 24 -0.56739
4 A ASP 25 -0.98583
5 B ARG 8 -0.64880
6 B LEU 24 -0.58380
7 B ASP 25 -0.85930
Each row contains CHAIN (A, B, or C), ResID, ResNum, and Energy. I would like to sort this data so that all of the energy values belonging to a specific Resid and num in each chain are clustered together. By cluster I mean all of the values for "ARG 8" are grouped or all of the rows containing "ARG 8" are grouped. I don't know which is more efficient. Ideally, I would like the output for all residues to be
ARG 8
0.000
0.000
0.000
where the "0.000" are the energy values for ARG 8 or O17 and so on.
Sorry for the header breaks, I wanted the data to be clean, but I can't insert images.
data
structure(list(Chain = structure(c(3L, 1L, 1L, 1L, 2L, 2L, 2L
), .Label = c("A", "B", "C"), class = "factor"), ResId = structure(c(4L,
1L, 3L, 2L, 1L, 3L, 2L), .Label = c("ARG", "ASP", "LEU", "O17"
), class = "factor"), ResNum = c(500L, 8L, 24L, 25L, 8L, 24L,
25L), Energy = c(-37.0367, -0.8456, -0.56739, -0.98583, -0.6488,
-0.5838, -0.8593)), .Names = c("Chain", "ResId", "ResNum", "Energy"
), class = "data.frame", row.names = c(NA, -7L))
If you want to convert to wide format
library(reshape2)
dcast(df, ResId+ResNum~paste0('Energy.',Chain), value.var='Energy')
# ResId ResNum Energy.A Energy.B Energy.C
#1 ARG 8 -0.84560 -0.6488 NA
#2 ASP 25 -0.98583 -0.8593 NA
#3 LEU 24 -0.56739 -0.5838 NA
#4 O17 500 NA NA -37.0367
After your edit, the output you are most likely looking for is:
library(reshape2)
dcast(df, ResId~Chain, value.var= 'Energy')
ResId A B C
1 ARG -0.84560 -0.6488 NA
2 ASP -0.98583 -0.8593 NA
3 LEU -0.56739 -0.5838 NA
4 O17 NA NA -37.0367
This will put the values together. You can further specify based on your desired output.
df[order(df$ResId), ]
Chain ResId ResNum Energy
2 A ARG 8 -0.84560
5 B ARG 8 -0.64880
4 A ASP 25 -0.98583
7 B ASP 25 -0.85930
3 A LEU 24 -0.56739
6 B LEU 24 -0.58380
1 C O17 500 -37.03670
#With dplyr
library(dplyr)
df %>%
arrange(ResId)
Chain ResId ResNum Energy
1 A ARG 8 -0.84560
2 B ARG 8 -0.64880
3 A ASP 25 -0.98583
4 B ASP 25 -0.85930
5 A LEU 24 -0.56739
6 B LEU 24 -0.58380
7 C O17 500 -37.03670
Data
df <- read.table(text = '
Chain ResId ResNum Energy
C O17 500 -37.0367
A ARG 8 -0.8456
A LEU 24 -0.56739
A ASP 25 -0.98583
B ARG 8 -0.6488
B LEU 24 -0.5838
B ASP 25 -0.8593', header=T)
Try this:
df <- df[order(df$Chain, df$ResId, df$ResNum),]
where df is the name of your dataframe. This should order it for you.
I have a data frame that looks as the following:
system Id initial final
665 9 16001 6070 6071
683 10 16001 6100 6101
696 11 16001 6101 6113
712 10 16971 6150 6151
715 11 16971 6151 6163
4966 7 4118 10238 10242
5031 9 4118 10260 10278
5088 10 4118 10279 10304
5115 11 4118 10305 10317
structure(list(system = c(9L, 10L, 11L, 10L, 11L, 7L, 9L, 10L,
11L), Id = c(16001L, 16001L, 16001L, 16971L, 16971L, 4118L, 4118L,
4118L, 4118L), initial = c(6070, 6100, 6101, 6150, 6151, 10238,
10260, 10279, 10305), final = c(6071, 6101, 6113, 6151, 6163,
10242, 10278, 10304, 10317)), .Names = c("system", "Id", "initial",
"final"), row.names = c(665L, 683L, 696L, 712L, 715L, 4966L,
5031L, 5088L, 5115L), class = "data.frame")
I would like to get a new data frame with the next structure
Id system length initial final
1 16001 9,10,11 3 6070 6113
2 16971 10,11 2 6150 6163
3 4118 7 1 10238 10242
4 4118 9,10,11 3 10260 10317
structure(list(Id = c(16001L, 16971L, 4118L, 4118L), system = structure(c(3L,
1L, 2L, 3L), .Label = c("10,11", "7", "9,10,11"), class = "factor"),
length = c(3L, 2L, 1L, 3L), initial = c(6070L, 6150L, 10238L,
10260L), final = c(6113, 6163, 10242, 10317)), .Names = c("Id",
"system", "length", "initial", "final"), class = "data.frame", row.names = c(NA,
-4L))
The grouping is by Id and the difference (between rows) in "system" field equal to one. Also I would like to get the different "system" and how many of that involved in grouping. Finally a column with the first "initial" and the last "final" involved also.
It is possible to do that in r?
Thanks.
You could use data.table. Convert "data.frame" to "data.table" (setDT), create a grouping variable "indx" by taking the difference of adjacent elements of "system" (diff(system)), cumsum the logical vector, use "Id" and "indx" as grouping variable to get the statistics.
library(data.table)
setDT(df)[,list(system=toString(system), length=.N, initial=initial[1L],
final=final[.N]), by=list(Id,indx=cumsum(c(TRUE, diff(system)!=1)))][,
indx:=NULL][]
# Id system length initial final
#1: 16001 9, 10, 11 3 6070 6113
#2: 16971 10, 11 2 6150 6163
#3: 4118 7 1 10238 10242
#4: 4118 9, 10, 11 3 10260 10317
Or based on #jazzurro's comment about using first/last functions from dplyr,
library(dplyr)
df %>%
group_by(indx=cumsum(c(TRUE, diff(system)!=1)), Id) %>%
summarise(system=toString(system), length=n(),
initial=first(initial), final=last(final))
A solution without data.table, but plyr:
library(plyr)
func = function(subdf)
{
bool = c(diff(subdf$system),1)==1
ldply(split(subdf, bool), function(u){
data.frame(system = paste(u$system, collapse=','),
Id = unique(u$Id),
length = nrow(u),
initial= head(u,1)$initial,
final = tail(u,1)$final)
})
}
ldply(split(df, df$Id), func)
# .id system length Id initial final
#1 FALSE 7 1 4118 10238 10242
#2 TRUE 9,10,11 3 4118 10260 10317
#3 TRUE 9,10,11 3 16001 6070 6113
#4 TRUE 10,11 2 16971 6150 6163
This question already has answers here:
spearman correlation by group in R
(4 answers)
Closed 8 years ago.
ds <- structure(list(GPA = c(1.78, 2.38, 2.43, 1.98, 1.56, 2.32, 1.96,
2.73, 2, 3.59), STUDY_STAGE = c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L), OLAGG = c(18, 14, 14, 17, 17, 16, 16, 15, 14, 15)), .Names = c("GPA",
"STUDY_STAGE", "OLAGG"), row.names = c(NA, 10L), class = "data.frame")
I've made reference to this post
spearman correlation by group in R
However, when I attempted to find the correlation based on sub group STUDY_STAGE (there are 3), I obtained all same values.
ddply(ds,.(STUDY_STAGE), summarise, cor(ds$GPA, ds$OLAGG, method = "spearman"))
STUDY_STAGE ..1
1 1 -0.2805924
2 2 -0.2805924
3 3 -0.2805924
Additional information on the dataframe
str(ds)
'data.frame': 3167 obs. of 3 variables:
$ GPA : num 1.78 2.38 2.43 1.98 1.56 2.32 1.96 2.73 2 3.59 ...
$ STUDY_STAGE: int 3 3 3 3 3 3 3 3 3 3 ...
$ OLAGG : num 18 14 14 17 17 16 16 15 14 15 ...
Just to show that they should have different correlation values:
ds.yr1<-ds[ds$STUDY_STAGE=="Yr 1",]
cor(ds.yr1$GPA, ds.yr1$OLAGG)
[1] -0.3313926
ds.yr2<-ds[ds$STUDY_STAGE=="Yr 2",]
cor(ds.yr2$GPA, ds.yr2$OLAGG)
[1] -0.2905399
Full data is available here: https://dl.dropboxusercontent.com/u/64487083/R/mydata.csv
Question:
How can I find the correlation for all the 3 different study_stage?
Thank you all for your time and effort!
By using ds$GPA and ds$OLAGG, we are calculating the cor of the whole columns instead of by groups.
ds <- read.csv("mydata.csv") #full data from the link
cor(ds$GPA, ds$OLAGG, method='spearman')
#[1] -0.2805924
ddply(ds,.(STUDY_STAGE), summarise, Cor=cor(GPA, OLAGG, method = "spearman"))
# STUDY_STAGE Cor
#1 Yr 1 -0.3337192
#2 Yr 2 -0.2803793
#3 Yr 3 -0.2090219
cor(ds.yr1$GPA, ds.yr1$OLAGG, method='spearman')
#[1] -0.3337192