Two columns' results are coming in the same column in R - r

I have a data frame with "Sol.grp" (non-numeric) and "age" (numeric) columns. I'm trying to store mean of age and count of observations in two separate columns.
I used the following code:
> summary <- data.frame(aggregate(age~sol.grp, data=na.omit(all.tkts), FUN=function(x) c(mean= mean(x), count=length(x))))
Mean & Count are coming in the same column (shown below)
I do not know what's wrong. Any ideas? Thanks in advance for your help !
Edit: The example dataset is shown at the end
row sol.grp Mean
1 Account A 187.7154
2 Account B 215.7747
3 WMID 199.0201
4 Qty 254.5545
5 PM 210.7109
6 CS 165.6500
7 ED 158.5483
8 TM 271.1966
9 39.0000
10 131.0000
11 189.0000
12 149.0000
13 3533.0000
14 2.0000
15 338.0000
16 58.0000
Example data: (Top 20 rows)
sol.grp age
Account A 29.6
Account B 29.6
WMID 26.9
Qty 1.7
PM 3.0
CS 2043.8
ED 24.3
TM 24.3
Account A 24.3
Account B 133.3
WMID 27.0
Qty 2.1
PM 29.2
CS 29.4
ED 97.8
TM 192.9
Account A 651.6
Account B 148.6
WMID 125.2
Qty 31.1

You could try this using data.table
library(data.table)
res1 <- setDT(all.tkts)[, list(Mean=mean(age, na.rm=TRUE), Count=.N),
keyby=sol.grp]
The aggregate results did not show any anomaly using the below example
res2 <- do.call(data.frame,aggregate(age~sol.grp,
data=na.omit(all.tkts), FUN=function(x) c(mean= mean(x), count=length(x))))
res2
# sol.grp age.mean age.count
#1 Account A 235.16667 3
#2 Account B 103.83333 3
#3 CS 1036.60000 2
#4 ED 61.05000 2
#5 PM 16.10000 2
#6 Qty 11.63333 3
#7 TM 108.60000 2
#8 WMID 59.70000 3
data
all.tkts <- structure(list(sol.grp = structure(c(1L, 2L, 8L, 6L, 5L, 3L,
4L, 7L, 1L, 2L, 8L, 6L, 5L, 3L, 4L, 7L, 1L, 2L, 8L, 6L), .Label = c("Account A",
"Account B", "CS", "ED", "PM", "Qty", "TM", "WMID"), class = "factor"),
age = c(29.6, 29.6, 26.9, 1.7, 3, 2043.8, 24.3, 24.3, 24.3,
133.3, 27, 2.1, 29.2, 29.4, 97.8, 192.9, 651.6, 148.6, 125.2,
31.1)), .Names = c("sol.grp", "age"), class = "data.frame", row.names = c(NA,
-20L))

Following from your own code works well:
aggregate(age~sol.grp, data=na.omit(all.tkts), FUN=function(x) c(mean= mean(x), count=length(x)))
sol.grp age.mean age.count
1 Account A 235.16667 3.00000
2 Account B 103.83333 3.00000
3 CS 1036.60000 2.00000
4 ED 61.05000 2.00000
5 PM 16.10000 2.00000
6 Qty 11.63333 3.00000
7 TM 108.60000 2.00000
8 WMID 59.70000 3.00000
Just avoid putting data.frame around aggregate, since aggregate returns a data.frame.
EDIT:
The details of output are:
> dd = aggregate(age~sol.grp, data=na.omit(all.tkts), FUN=function(x) c(mean= mean(x), count=length(x)))
> str(dd)
'data.frame': 8 obs. of 2 variables:
$ sol.grp: Factor w/ 8 levels "Account A","Account B",..: 1 2 3 4 5 6 7 8
$ age : num [1:8, 1:2] 235.2 103.8 1036.6 61 16.1 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "mean" "count"
>
> dd$sol.grp
[1] Account A Account B CS ED PM Qty TM WMID
Levels: Account A Account B CS ED PM Qty TM WMID
> dd$age
mean count
[1,] 235.16667 3
[2,] 103.83333 3
[3,] 1036.60000 2
[4,] 61.05000 2
[5,] 16.10000 2
[6,] 11.63333 3
[7,] 108.60000 2
[8,] 59.70000 3
>
> dd$age[,2]
[1] 3 3 2 2 2 3 2 3
>
> dd$age[,1]
[1] 235.16667 103.83333 1036.60000 61.05000 16.10000 11.63333 108.60000 59.70000

Related

Filter data frame based on numeric vector with "tolerance"

I would like to filter data frame using numeric vector. I am applying function below:
test_data <- exp_data[exp_data$Size_Change %in% vec_data,]
That's how example data looks like:
dput(exp_data)
structure(list(Name = c("Mark", "Greg", "Tomas", "Morka", "Pekka",
"Robert", "Tim", "Tom", "Bobby", "Terka"), Mode = c(1, 2, NA,
4, NA, 3, NA, 1, NA, 3), Change = structure(c(6L, 2L, 4L, 5L,
7L, 7L, 7L, 8L, 3L, 1L), .Label = c("D[+58], I[+12][+385]", "C[+58], K[+1206]",
"C[+58], P[+2074]", "C[+58], K[+2172]", "C[+58], K[+259]", "C[+58], K[+2665]",
"C[+58], T[+385]", "C[+58], C[+600]"), class = "factor"), Size = c(1335.261,
697.356, 1251.603, 920.43, 492.236, 393.991, 492.239, 727.696,
1218.933, 495.237), Place = c(3L, 4L, 3L, 2L, 4L, 5L, 4L, 3L,
3L, 4L), Size_Change = c(4004, 2786, 3753, 1840, 1966, 1966,
1966, 2181, 3655, 1978)), row.names = 2049:2058, class = "data.frame")
and vector used for filtering:
dput(vec_data)
c(4003, 2785, 954, 1129, 4013, 756, 1852, 2424, 1954, 246, 147,
234, 562, 1617, 2180, 888, 1176)
I mentioned about tolerance because vec_data is not very precise and I am expecting +1/-1 difference in numbers and after applying function it will not filter rows with such difference. It may also happen that difference will be +12/-12 or +24/-24. Can I somehow take it into account while filtering ?
Of course probably solution is to do smth like that (vec_data +1) / (vec_data -1) / (vec_data +12), etc. and do couple of filtering attempts and maybe finally rbind outputs of all but I am looking for more "elegant" way. It would also be great if there could be a column added which will indicate how the row was filtered if it was an exact number from vec_data or it was modified by +1, +12, -24 or whatever. Please, take into account that the combination of +1/-1 with any other modification is also possible. Additional column is not necessary if it makes it too complicated.
One option could be (tolerance = 1):
df %>%
filter(sapply(Size_Change, function(x) any(abs(x - vec) %in% 0:1)))
Name Mode Change Size Place Size_Change
1 Mark 1 C[+58], K[+2665] 1335.261 3 4004
2 Greg 2 C[+58], K[+1206] 697.356 4 2786
3 Tom 1 C[+58], C[+600] 727.696 3 2181
Tolerance = 14:
df %>%
filter(sapply(Size_Change, function(x) any(abs(x - vec) %in% 0:14)))
Name Mode Change Size Place Size_Change
1 Mark 1 C[+58], K[+2665] 1335.261 3 4004
2 Greg 2 C[+58], K[+1206] 697.356 4 2786
3 Morka 4 C[+58], K[+259] 920.430 2 1840
4 Pekka NA C[+58], T[+385] 492.236 4 1966
5 Robert 3 C[+58], T[+385] 393.991 5 1966
6 Tim NA C[+58], T[+385] 492.239 4 1966
7 Tom 1 C[+58], C[+600] 727.696 3 2181
The same logic with rowwise():
df %>%
rowwise() %>%
filter(any(abs(Size_Change - vec) %in% 0:1))
The most obvious methodology is to filter based on inequality rather than exact matched (always recommended when comparing numeric [not integers])
comp <- function(x, yvec, tolerance = 1){
sapply(x, \(xi){any(abs(xi - yvec) <= tolerance)})
}
exp_data[comp(exp_data$Size_Change, vec_data),]
Name Mode Change Size Place Size_Change
2049 Mark 1 C[+58], K[+2665] 1335.261 3 4004
2050 Greg 2 C[+58], K[+1206] 697.356 4 2786
2056 Tom 1 C[+58], C[+600] 727.696 3 2181
# Tolerance = 2
# exp_data[comp(exp_data$Size_Change, vec_data, 2),]
What about using a tolerance function.
tol <- \(x, tol=1L) sapply(seq(-tol, tol, 1L), \(i) sweep(as.matrix(x), 1L, i))
exp_data[exp_data$Size_Change %in% tol(vec_data), ]
# Name Mode Change Size Place Size_Change
# 2049 Mark 1 C[+58], K[+2665] 1335.261 3 4004
# 2050 Greg 2 C[+58], K[+1206] 697.356 4 2786
# 2056 Tom 1 C[+58], C[+600] 727.696 3 2181
It defaults to tolerance ±1, if we want ±24 we may define it in the argument:
exp_data[exp_data$Size_Change %in% tol(vec_data, 24L), ]
# Name Mode Change Size Place Size_Change
# 2049 Mark 1 C[+58], K[+2665] 1335.261 3 4004
# 2050 Greg 2 C[+58], K[+1206] 697.356 4 2786
# 2052 Morka 4 C[+58], K[+259] 920.430 2 1840
# 2053 Pekka NA C[+58], T[+385] 492.236 4 1966
# 2054 Robert 3 C[+58], T[+385] 393.991 5 1966
# 2055 Tim NA C[+58], T[+385] 492.239 4 1966
# 2056 Tom 1 C[+58], C[+600] 727.696 3 2181
# 2058 Terka 3 D[+58], I[+12][+385] 495.237 4 1978
I you are wondering about the L in 24L, it is integer notation, you may also use tol=24 without any problems.
Note: R version 4.1.2 (2021-11-01)

Dividing grouped data by group means r

I have data split up into two categories:
z= Tracer time treatment
15 0 S
20 0 S
25 0 X
04 0 X
55 15 S
16 15 S
15 15 X
20 15 X
I'd like to divide each value of Tracer by the group mean depending on which group it belongs to (e.g. All values of Tracer belonging to time=0 and treatment=S are divided by their mean).
The procedure would be something like this:
Find category means as follows:
1:
aggmeanz <-aggregate(z$Tracer, list(time=z$time,treatment=z$treatment), FUN=mean)
2: Divide z$Tracer by the correct aggmeanz value
structure(list(Tracer = c(15L, 20L, 25L, 4L, 55L, 16L, 15L, 20L
), time = c(0L, 0L, 0L, 0L, 15L, 15L, 15L, 15L), treatment = structure(c(1L,
1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("S", "X"), class = "factor")), .Names = c("Tracer",
"time", "treatment"), class = "data.frame", row.names = c(NA,
-8L))
Alternatively, here is a dplyr solution:
library(dplyr)
group_by(z,time,treatment) %>%
mutate(pmean=Tracer/mean(Tracer))
Output:
Tracer time treatment pmean
(int) (int) (fctr) (dbl)
1 15 0 S 0.8571429
2 20 0 S 1.1428571
3 25 0 X 1.7241379
4 4 0 X 0.2758621
5 55 15 S 1.5492958
6 16 15 S 0.4507042
7 15 15 X 0.8571429
8 20 15 X 1.1428571
Data:
z <- read.table(text="Tracer time treatment
15 0 S
20 0 S
25 0 X
04 0 X
55 15 S
16 15 S
15 15 X
20 15 X",head=TRUE)
Is it ok to use non-base tools? With data.table installed and loaded:
z <- data.table(z)
z[, scaledTracer := Tracer/mean(Tracer), by = c("time","treatment")]
Would compute means by each unique combination of time and treatment (which appear to be groups of 2 rows in your data), and scale the Tracer values in each group by the appropriate mean.
It's not the prettiest but:
groupmeans = aggregate(z$Tracer, by = list(z$time, z$treatment), FUN = mean)
Group.1 Group.2 x
0 S 17.5
15 S 35.5
0 X 14.5
15 X 17.5
names(groupmeans) = c("time", "treatment", "groupmean")
z = merge(z, groupmeans, id.vars = c("time","treatment" ))
time treatment groupmean Tracer tracer_div
0 S 17.5 15 0.8571429
0 S 17.5 20 1.1428571
0 X 14.5 25 1.7241379
0 X 14.5 4 0.2758621
15 S 35.5 55 1.5492958
15 S 35.5 16 0.4507042
15 X 17.5 15 0.8571429
15 X 17.5 20 1.1428571
z$tracer_div = z$Tracer/z$groupmean
time treatment groupmean Tracer tracer_div
0 S 17.5 15 0.8571429
0 S 17.5 20 1.1428571
0 X 14.5 25 1.7241379
0 X 14.5 4 0.2758621
15 S 35.5 55 1.5492958
15 S 35.5 16 0.4507042
15 X 17.5 15 0.8571429
15 X 17.5 20 1.1428571
You could reassign z$Tracer to the final step if you didn't want to create a whole new column. It can be nice to keep every step though in case you want to use it in another calculation or plot later.
a base R solution:
do.call(c, lapply(split(z[1], z[, -1]), FUN = function(x) x[[1]]/mean(x[[1]])))
# 0.S1 0.S2 15.S1 15.S2 0.X1 0.X2 15.X1 15.X2
#0.8571429 1.1428571 1.5492958 0.4507042 1.7142857 0.2857143 0.8571429 1.1428571
split into timextreatment groups first, then divide each group by mean. finally glue back together with c.

Sort data by factor and output into a matrix (or df) R

I have looked through other posts and I think I have an idea of what I could do, but I want to be clear!
I have a very large data frame that contains 4 variables and a number of rows.
Chain ResId ResNum Energy
1 C O17 500 -37.03670
2 A ARG 8 -0.84560
3 A LEU 24 -0.56739
4 A ASP 25 -0.98583
5 B ARG 8 -0.64880
6 B LEU 24 -0.58380
7 B ASP 25 -0.85930
Each row contains CHAIN (A, B, or C), ResID, ResNum, and Energy. I would like to sort this data so that all of the energy values belonging to a specific Resid and num in each chain are clustered together. By cluster I mean all of the values for "ARG 8" are grouped or all of the rows containing "ARG 8" are grouped. I don't know which is more efficient. Ideally, I would like the output for all residues to be
ARG 8
0.000
0.000
0.000
where the "0.000" are the energy values for ARG 8 or O17 and so on.
Sorry for the header breaks, I wanted the data to be clean, but I can't insert images.
data
structure(list(Chain = structure(c(3L, 1L, 1L, 1L, 2L, 2L, 2L
), .Label = c("A", "B", "C"), class = "factor"), ResId = structure(c(4L,
1L, 3L, 2L, 1L, 3L, 2L), .Label = c("ARG", "ASP", "LEU", "O17"
), class = "factor"), ResNum = c(500L, 8L, 24L, 25L, 8L, 24L,
25L), Energy = c(-37.0367, -0.8456, -0.56739, -0.98583, -0.6488,
-0.5838, -0.8593)), .Names = c("Chain", "ResId", "ResNum", "Energy"
), class = "data.frame", row.names = c(NA, -7L))
If you want to convert to wide format
library(reshape2)
dcast(df, ResId+ResNum~paste0('Energy.',Chain), value.var='Energy')
# ResId ResNum Energy.A Energy.B Energy.C
#1 ARG 8 -0.84560 -0.6488 NA
#2 ASP 25 -0.98583 -0.8593 NA
#3 LEU 24 -0.56739 -0.5838 NA
#4 O17 500 NA NA -37.0367
After your edit, the output you are most likely looking for is:
library(reshape2)
dcast(df, ResId~Chain, value.var= 'Energy')
ResId A B C
1 ARG -0.84560 -0.6488 NA
2 ASP -0.98583 -0.8593 NA
3 LEU -0.56739 -0.5838 NA
4 O17 NA NA -37.0367
This will put the values together. You can further specify based on your desired output.
df[order(df$ResId), ]
Chain ResId ResNum Energy
2 A ARG 8 -0.84560
5 B ARG 8 -0.64880
4 A ASP 25 -0.98583
7 B ASP 25 -0.85930
3 A LEU 24 -0.56739
6 B LEU 24 -0.58380
1 C O17 500 -37.03670
#With dplyr
library(dplyr)
df %>%
arrange(ResId)
Chain ResId ResNum Energy
1 A ARG 8 -0.84560
2 B ARG 8 -0.64880
3 A ASP 25 -0.98583
4 B ASP 25 -0.85930
5 A LEU 24 -0.56739
6 B LEU 24 -0.58380
7 C O17 500 -37.03670
Data
df <- read.table(text = '
Chain ResId ResNum Energy
C O17 500 -37.0367
A ARG 8 -0.8456
A LEU 24 -0.56739
A ASP 25 -0.98583
B ARG 8 -0.6488
B LEU 24 -0.5838
B ASP 25 -0.8593', header=T)
Try this:
df <- df[order(df$Chain, df$ResId, df$ResNum),]
where df is the name of your dataframe. This should order it for you.

grouping, comparing, and counting rows in r

I have a data frame that looks as the following:
system Id initial final
665 9 16001 6070 6071
683 10 16001 6100 6101
696 11 16001 6101 6113
712 10 16971 6150 6151
715 11 16971 6151 6163
4966 7 4118 10238 10242
5031 9 4118 10260 10278
5088 10 4118 10279 10304
5115 11 4118 10305 10317
structure(list(system = c(9L, 10L, 11L, 10L, 11L, 7L, 9L, 10L,
11L), Id = c(16001L, 16001L, 16001L, 16971L, 16971L, 4118L, 4118L,
4118L, 4118L), initial = c(6070, 6100, 6101, 6150, 6151, 10238,
10260, 10279, 10305), final = c(6071, 6101, 6113, 6151, 6163,
10242, 10278, 10304, 10317)), .Names = c("system", "Id", "initial",
"final"), row.names = c(665L, 683L, 696L, 712L, 715L, 4966L,
5031L, 5088L, 5115L), class = "data.frame")
I would like to get a new data frame with the next structure
Id system length initial final
1 16001 9,10,11 3 6070 6113
2 16971 10,11 2 6150 6163
3 4118 7 1 10238 10242
4 4118 9,10,11 3 10260 10317
structure(list(Id = c(16001L, 16971L, 4118L, 4118L), system = structure(c(3L,
1L, 2L, 3L), .Label = c("10,11", "7", "9,10,11"), class = "factor"),
length = c(3L, 2L, 1L, 3L), initial = c(6070L, 6150L, 10238L,
10260L), final = c(6113, 6163, 10242, 10317)), .Names = c("Id",
"system", "length", "initial", "final"), class = "data.frame", row.names = c(NA,
-4L))
The grouping is by Id and the difference (between rows) in "system" field equal to one. Also I would like to get the different "system" and how many of that involved in grouping. Finally a column with the first "initial" and the last "final" involved also.
It is possible to do that in r?
Thanks.
You could use data.table. Convert "data.frame" to "data.table" (setDT), create a grouping variable "indx" by taking the difference of adjacent elements of "system" (diff(system)), cumsum the logical vector, use "Id" and "indx" as grouping variable to get the statistics.
library(data.table)
setDT(df)[,list(system=toString(system), length=.N, initial=initial[1L],
final=final[.N]), by=list(Id,indx=cumsum(c(TRUE, diff(system)!=1)))][,
indx:=NULL][]
# Id system length initial final
#1: 16001 9, 10, 11 3 6070 6113
#2: 16971 10, 11 2 6150 6163
#3: 4118 7 1 10238 10242
#4: 4118 9, 10, 11 3 10260 10317
Or based on #jazzurro's comment about using first/last functions from dplyr,
library(dplyr)
df %>%
group_by(indx=cumsum(c(TRUE, diff(system)!=1)), Id) %>%
summarise(system=toString(system), length=n(),
initial=first(initial), final=last(final))
A solution without data.table, but plyr:
library(plyr)
func = function(subdf)
{
bool = c(diff(subdf$system),1)==1
ldply(split(subdf, bool), function(u){
data.frame(system = paste(u$system, collapse=','),
Id = unique(u$Id),
length = nrow(u),
initial= head(u,1)$initial,
final = tail(u,1)$final)
})
}
ldply(split(df, df$Id), func)
# .id system length Id initial final
#1 FALSE 7 1 4118 10238 10242
#2 TRUE 9,10,11 3 4118 10260 10317
#3 TRUE 9,10,11 3 16001 6070 6113
#4 TRUE 10,11 2 16971 6150 6163

Using ddply to find correlation of a dataframe for separate groups for R [duplicate]

This question already has answers here:
spearman correlation by group in R
(4 answers)
Closed 8 years ago.
ds <- structure(list(GPA = c(1.78, 2.38, 2.43, 1.98, 1.56, 2.32, 1.96,
2.73, 2, 3.59), STUDY_STAGE = c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L), OLAGG = c(18, 14, 14, 17, 17, 16, 16, 15, 14, 15)), .Names = c("GPA",
"STUDY_STAGE", "OLAGG"), row.names = c(NA, 10L), class = "data.frame")
I've made reference to this post
spearman correlation by group in R
However, when I attempted to find the correlation based on sub group STUDY_STAGE (there are 3), I obtained all same values.
ddply(ds,.(STUDY_STAGE), summarise, cor(ds$GPA, ds$OLAGG, method = "spearman"))
STUDY_STAGE ..1
1 1 -0.2805924
2 2 -0.2805924
3 3 -0.2805924
Additional information on the dataframe
str(ds)
'data.frame': 3167 obs. of 3 variables:
$ GPA : num 1.78 2.38 2.43 1.98 1.56 2.32 1.96 2.73 2 3.59 ...
$ STUDY_STAGE: int 3 3 3 3 3 3 3 3 3 3 ...
$ OLAGG : num 18 14 14 17 17 16 16 15 14 15 ...
Just to show that they should have different correlation values:
ds.yr1<-ds[ds$STUDY_STAGE=="Yr 1",]
cor(ds.yr1$GPA, ds.yr1$OLAGG)
[1] -0.3313926
ds.yr2<-ds[ds$STUDY_STAGE=="Yr 2",]
cor(ds.yr2$GPA, ds.yr2$OLAGG)
[1] -0.2905399
Full data is available here: https://dl.dropboxusercontent.com/u/64487083/R/mydata.csv
Question:
How can I find the correlation for all the 3 different study_stage?
Thank you all for your time and effort!
By using ds$GPA and ds$OLAGG, we are calculating the cor of the whole columns instead of by groups.
ds <- read.csv("mydata.csv") #full data from the link
cor(ds$GPA, ds$OLAGG, method='spearman')
#[1] -0.2805924
ddply(ds,.(STUDY_STAGE), summarise, Cor=cor(GPA, OLAGG, method = "spearman"))
# STUDY_STAGE Cor
#1 Yr 1 -0.3337192
#2 Yr 2 -0.2803793
#3 Yr 3 -0.2090219
cor(ds.yr1$GPA, ds.yr1$OLAGG, method='spearman')
#[1] -0.3337192

Resources