Dividing grouped data by group means r - r

I have data split up into two categories:
z= Tracer time treatment
15 0 S
20 0 S
25 0 X
04 0 X
55 15 S
16 15 S
15 15 X
20 15 X
I'd like to divide each value of Tracer by the group mean depending on which group it belongs to (e.g. All values of Tracer belonging to time=0 and treatment=S are divided by their mean).
The procedure would be something like this:
Find category means as follows:
1:
aggmeanz <-aggregate(z$Tracer, list(time=z$time,treatment=z$treatment), FUN=mean)
2: Divide z$Tracer by the correct aggmeanz value
structure(list(Tracer = c(15L, 20L, 25L, 4L, 55L, 16L, 15L, 20L
), time = c(0L, 0L, 0L, 0L, 15L, 15L, 15L, 15L), treatment = structure(c(1L,
1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("S", "X"), class = "factor")), .Names = c("Tracer",
"time", "treatment"), class = "data.frame", row.names = c(NA,
-8L))

Alternatively, here is a dplyr solution:
library(dplyr)
group_by(z,time,treatment) %>%
mutate(pmean=Tracer/mean(Tracer))
Output:
Tracer time treatment pmean
(int) (int) (fctr) (dbl)
1 15 0 S 0.8571429
2 20 0 S 1.1428571
3 25 0 X 1.7241379
4 4 0 X 0.2758621
5 55 15 S 1.5492958
6 16 15 S 0.4507042
7 15 15 X 0.8571429
8 20 15 X 1.1428571
Data:
z <- read.table(text="Tracer time treatment
15 0 S
20 0 S
25 0 X
04 0 X
55 15 S
16 15 S
15 15 X
20 15 X",head=TRUE)

Is it ok to use non-base tools? With data.table installed and loaded:
z <- data.table(z)
z[, scaledTracer := Tracer/mean(Tracer), by = c("time","treatment")]
Would compute means by each unique combination of time and treatment (which appear to be groups of 2 rows in your data), and scale the Tracer values in each group by the appropriate mean.

It's not the prettiest but:
groupmeans = aggregate(z$Tracer, by = list(z$time, z$treatment), FUN = mean)
Group.1 Group.2 x
0 S 17.5
15 S 35.5
0 X 14.5
15 X 17.5
names(groupmeans) = c("time", "treatment", "groupmean")
z = merge(z, groupmeans, id.vars = c("time","treatment" ))
time treatment groupmean Tracer tracer_div
0 S 17.5 15 0.8571429
0 S 17.5 20 1.1428571
0 X 14.5 25 1.7241379
0 X 14.5 4 0.2758621
15 S 35.5 55 1.5492958
15 S 35.5 16 0.4507042
15 X 17.5 15 0.8571429
15 X 17.5 20 1.1428571
z$tracer_div = z$Tracer/z$groupmean
time treatment groupmean Tracer tracer_div
0 S 17.5 15 0.8571429
0 S 17.5 20 1.1428571
0 X 14.5 25 1.7241379
0 X 14.5 4 0.2758621
15 S 35.5 55 1.5492958
15 S 35.5 16 0.4507042
15 X 17.5 15 0.8571429
15 X 17.5 20 1.1428571
You could reassign z$Tracer to the final step if you didn't want to create a whole new column. It can be nice to keep every step though in case you want to use it in another calculation or plot later.

a base R solution:
do.call(c, lapply(split(z[1], z[, -1]), FUN = function(x) x[[1]]/mean(x[[1]])))
# 0.S1 0.S2 15.S1 15.S2 0.X1 0.X2 15.X1 15.X2
#0.8571429 1.1428571 1.5492958 0.4507042 1.7142857 0.2857143 0.8571429 1.1428571
split into timextreatment groups first, then divide each group by mean. finally glue back together with c.

Related

Loop to create dummies out of two df R

for easier explanation I'm gonna use a smaller example.
I have two DF:
DF1: T01 T02 T03 T04 T05
1 15 20 48 25 5
2 12 18 35 30 12
3 13 15 50 60 42
DF2: MEDIAN SD
T01 13 1.24
T02 18 2.05
T03 45 6.64
T04 30 15.45
T05 12 16.04
What I want to do is create a loop that adds a dummy to DF1 for each variable, that take value 1 if DF1$T01 ≈ (almost equal) to DF2$MEDIAN[1], and 0 if it's not, and then goes to T02, T03, until it breaks.
Until now, I haven't been able to create a loop (I'm not really good at creating loops tho) that makes this. I did manage to make the dummy for one of the variables (T01), but in the real DF I have over 40 variables, so doing it by hand it´s not efficient at all. What I have right now is:
DF1$dummyt01 <- ifelse(almost.equal(DF1$T01, DF2$MEDIAN[1], tolerance = 2),1,0)
outcome expected:
DF1: T01 T02 T03 T04 T05 dummyT01 dummyT02 ... dummyT05
1 15 20 48 25 5 1 1 ... 0
2 12 18 35 30 12 1 1 ... 1
3 13 15 50 60 42 1 0 ... 0
Note: Not a native english speaker. Sorry for any mistakes.
EDIT: Expected Outcome.
We may use tidyverse. Loop across the columns of 'DF1', get the column names of that column looped (cur_column()), use that to subset the 'DF2' (as row names) 'MEDIAN' element, do the comparison with almost.equal to return a logical vector, which is coerced to binary with as.integer or +. In the .names add the prefix 'dummy' so as to create as new columns
library(dplyr)
library(berryFunctions)
DF1 <- DF1 %>%
mutate(across(everything(), ~ +(almost.equal(.,
DF2[cur_column(), "MEDIAN"], tolerance = 1)),
.names = "dummy{.col}"))
-output
DF1
T01 T02 T03 T04 T05 dummyT01 dummyT02 dummyT03 dummyT04 dummyT05
1 15 20 48 25 5 0 0 0 0 0
2 12 18 35 30 12 1 1 0 1 1
3 13 15 50 60 42 1 0 0 0 0
Or using a for loop
for(i in seq_along(DF1))
DF1[paste0('dummy', names(DF1)[i])] <- +(almost.equal(DF1[[i]],
DF2[names(DF1)[i], "MEDIAN"], tolerance = 1))
data
DF1 <- structure(list(T01 = c(15L, 12L, 13L), T02 = c(20L, 18L, 15L),
T03 = c(48L, 35L, 50L), T04 = c(25L, 30L, 60L), T05 = c(5L,
12L, 42L)), class = "data.frame", row.names = c("1", "2",
"3"))
DF2 <- structure(list(MEDIAN = c(13L, 18L, 45L, 30L, 12L), SD = c(1.24,
2.05, 6.64, 15.45, 16.04)), class = "data.frame", row.names = c("T01",
"T02", "T03", "T04", "T05"))

How to create a new table from original data and lookup table in R or matlab?

I have original temperature data in table1.txt with station number header which reads as
Date 101 102 103
1/1/2001 25 24 23
1/2/2001 23 20 15
1/3/2001 22 21 17
1/4/2001 21 27 18
1/5/2001 22 30 19
I have a lookup table file lookup.txt which reads as :
ID Station
1 101
2 103
3 102
4 101
5 102
Now, I want to create a new table (new.txt) with ID number header which should read as
Date 1 2 3 4 5
1/1/2001 25 23 24 25 24
1/2/2001 23 15 20 23 20
1/3/2001 22 17 21 22 21
1/4/2001 21 18 27 21 27
1/5/2001 22 19 30 22 30
Is there anyway I can do this in R or matlab??
I came up with a solution using tidyverse. It involves some wide to long transformation, matching the data frames on Station, and then spreading the variables.
#Recreating the data
library(tidyverse)
df1 <- read_table("text1.txt")
lookup <- read_table("lookup.txt")
#Create the output
k1 <- df1 %>%
gather(Station, value, -Date) %>%
mutate(Station = as.numeric(Station)) %>%
inner_join(lookup) %>% select(-Station) %>%
spread(ID, value)
k1
We can use base R to do this. Create a column index by matching the 'Station' column with the names of the first dataset, use that to duplicate the columns of 'df1' and then change the column names with the 'ID' column of second dataset
i1 <- with(df2, match(Station, names(df1)[-1]))
dfN <- df1[c(1, i1 + 1)]
names(dfN)[-1] <- df2$ID
dfN
# Date 1 2 3 4 5
#1 1/1/2001 25 23 24 25 24
#2 1/2/2001 23 15 20 23 20
#3 1/3/2001 22 17 21 22 21
#4 1/4/2001 21 18 27 21 27
#5 1/5/2001 22 19 30 22 30
data
df1 <- structure(list(Date = c("1/1/2001", "1/2/2001", "1/3/2001", "1/4/2001",
"1/5/2001"), `101` = c(25L, 23L, 22L, 21L, 22L), `102` = c(24L,
20L, 21L, 27L, 30L), `103` = c(23L, 15L, 17L, 18L, 19L)),
class = "data.frame", row.names = c(NA,
-5L))
df2 <- structure(list(ID = 1:5, Station = c(101L, 103L, 102L, 101L,
102L)), class = "data.frame", row.names = c(NA, -5L))
Here is an option with MatLab:
T = readtable('table1.txt','FileType','text','ReadVariableNames',1);
L = readtable('lookup.txt','FileType','text','ReadVariableNames',1);
old_header = strcat('x',num2str(L.Station));
newT = array2table(zeros(height(T),height(L)+1),...
'VariableNames',[{'Date'} strcat('x',num2cell(num2str(L.ID)).')]);
newT.Date = T.Date;
for k = 1:size(old_header,1)
newT{:,k+1} = T.(old_header(k,:));
end
writetable(newT,'new.txt','Delimiter',' ')

Matching dataframes with data.table

I need to fill a matrix (MA) with information from a long data frame (DF) using another matrix as identifier (ID.MA).
An idea of my three matrices:
MA.ID creates an identifier to look in the big DF the needed variables:
a b c
a ID.aa ID.ab ID.ac
b ID.ba ID.bb ID.bc
c ID.ca ID.cb ID.cc
The original big data frame has useless information but has also the rows that are useful for me to fill the target MA matrix:
ID 1990 1991 1992
ID.aa 10 11 12
ID.ab 13 14 15
ID.ac 16 17 18
ID.ba 19 20 21
ID.bb 22 23 24
ID.bc 25 26 27
ID.ca 28 29 30
ID.cb 31 32 33
ID.cc 34 35 36
ID.xx 40 40 55
ID.xy 50 51 45
....
MA should be filled with cross-information. In my example it should look like that for a chosen column of DF (let's say, 1990):
a b c
a 10 13 16
b 19 22 25
c 28 31 34
I've tried to use match but honestly it didn't work out:
MA$a = DF[match(MA.ID$a, DF$ID),2]
I was recommended to use the data.table package but I couldn't see how that would help me.
Anyone has any good way to approach this problem?
Supposing that your input are dataframes, then you could do the following:
library(data.table)
setDT(ma)[, lapply(.SD, function(x) x = unlist(df[match(x,df$ID), "1990"]))
, .SDcols = colnames(ma)]
which returns:
a b c
1: 10 13 16
2: 19 22 25
3: 28 31 34
Explanation:
With setDT(ma) you transform the dataframe into a datatable (which is an enhanced dataframe).
With .SDcols=colnames(ma) you specify on which columns the transformation has to be applied.
lapply(.SD, function(x) x = unlist(df[match(x,df$ID),"1990"])) performs the matching operation on each column specified with .SDcols.
An alternative approach with data.table is first transforming ma to a long data.table:
ma2 <- melt(setDT(ma), measure.vars = c("a","b","c"))
setkey(ma2, value) # set key by which 'ma' has to be indexed
setDT(df, key="ID") # transform to a datatable & set key by which 'df' has to be indexed
# joining the values of the 1990 column of df into
# the right place in the value column of 'ma'
ma2[df, value := `1990`]
which gives:
> ma2
variable value
1: a 10
2: b 13
3: c 16
4: a 19
5: b 22
6: c 25
7: a 28
8: b 31
9: c 34
The only drawback of this method is that the numeric values in the 'value' column get stored as character values. You can correct this by extending it as follows:
ma2[df, value := `1990`][, value := as.numeric(value)]
If you want to change it back to wide format you could use the rowid function within dcast:
ma3 <- dcast(ma2, rowid(variable) ~ variable, value.var = "value")[, variable := NULL]
which gives:
> ma3
a b c
1: 10 13 16
2: 19 22 25
3: 28 31 34
Used data:
ma <- structure(list(a = structure(1:3, .Label = c("ID.aa", "ID.ba", "ID.ca"), class = "factor"),
b = structure(1:3, .Label = c("ID.ab", "ID.bb", "ID.cb"), class = "factor"),
c = structure(1:3, .Label = c("ID.ac", "ID.bc", "ID.cc"), class = "factor")),
.Names = c("a", "b", "c"), class = "data.frame", row.names = c(NA, -3L))
df <- structure(list(ID = structure(1:9, .Label = c("ID.aa", "ID.ab", "ID.ac", "ID.ba", "ID.bb", "ID.bc", "ID.ca", "ID.cb", "ID.cc"), class = "factor"),
`1990` = c(10L, 13L, 16L, 19L, 22L, 25L, 28L, 31L, 34L),
`1991` = c(11L, 14L, 17L, 20L, 23L, 26L, 29L, 32L, 35L),
`1992` = c(12L, 15L, 18L, 21L, 24L, 27L, 30L, 33L, 36L)),
.Names = c("ID", "1990", "1991", "1992"), class = "data.frame", row.names = c(NA, -9L))
In base R, it can be seen as a job for outer:
> outer(1:nrow(MA.ID), 1:ncol(MA.ID), Vectorize(function(x,y) {DF[which(DF$ID==MA.ID[x,y]),'1990']}))
[,1] [,2] [,3]
[1,] 10 13 16
[2,] 19 22 25
[3,] 28 31 34
Explanations:
outer creates a matrix as the outer product of the first argument X (here a b c) and the second argument Y (here the same, a b c)
for every combination of X and Y, it applies a function that looks up the value in DF, at the row where the ID is MA.ID[X,Y], and at the column 1990
the important trick here is to wrap the function with Vectorize, because outer expects a vectorized function
the result is finally returned as matrix
Alternatively, another way to do it (still in base R) is:
to convert your data frame MA.ID into a vector
sapply a quick function that looks up correspondance with DF$ID
and convert back to a matrix.
This works:
> structure(
sapply(unlist(MA.ID),
function(id){DF[which(DF$ID==id),'1990']}),
dim=dim(MA.ID), names=NULL)
[,1] [,2] [,3]
[1,] 10 13 16
[2,] 19 22 25
[3,] 28 31 34
(here the call to structure(..., dim=dim(MA.ID), names=NULL) converts back the vector to a matrix)

Sort data by factor and output into a matrix (or df) R

I have looked through other posts and I think I have an idea of what I could do, but I want to be clear!
I have a very large data frame that contains 4 variables and a number of rows.
Chain ResId ResNum Energy
1 C O17 500 -37.03670
2 A ARG 8 -0.84560
3 A LEU 24 -0.56739
4 A ASP 25 -0.98583
5 B ARG 8 -0.64880
6 B LEU 24 -0.58380
7 B ASP 25 -0.85930
Each row contains CHAIN (A, B, or C), ResID, ResNum, and Energy. I would like to sort this data so that all of the energy values belonging to a specific Resid and num in each chain are clustered together. By cluster I mean all of the values for "ARG 8" are grouped or all of the rows containing "ARG 8" are grouped. I don't know which is more efficient. Ideally, I would like the output for all residues to be
ARG 8
0.000
0.000
0.000
where the "0.000" are the energy values for ARG 8 or O17 and so on.
Sorry for the header breaks, I wanted the data to be clean, but I can't insert images.
data
structure(list(Chain = structure(c(3L, 1L, 1L, 1L, 2L, 2L, 2L
), .Label = c("A", "B", "C"), class = "factor"), ResId = structure(c(4L,
1L, 3L, 2L, 1L, 3L, 2L), .Label = c("ARG", "ASP", "LEU", "O17"
), class = "factor"), ResNum = c(500L, 8L, 24L, 25L, 8L, 24L,
25L), Energy = c(-37.0367, -0.8456, -0.56739, -0.98583, -0.6488,
-0.5838, -0.8593)), .Names = c("Chain", "ResId", "ResNum", "Energy"
), class = "data.frame", row.names = c(NA, -7L))
If you want to convert to wide format
library(reshape2)
dcast(df, ResId+ResNum~paste0('Energy.',Chain), value.var='Energy')
# ResId ResNum Energy.A Energy.B Energy.C
#1 ARG 8 -0.84560 -0.6488 NA
#2 ASP 25 -0.98583 -0.8593 NA
#3 LEU 24 -0.56739 -0.5838 NA
#4 O17 500 NA NA -37.0367
After your edit, the output you are most likely looking for is:
library(reshape2)
dcast(df, ResId~Chain, value.var= 'Energy')
ResId A B C
1 ARG -0.84560 -0.6488 NA
2 ASP -0.98583 -0.8593 NA
3 LEU -0.56739 -0.5838 NA
4 O17 NA NA -37.0367
This will put the values together. You can further specify based on your desired output.
df[order(df$ResId), ]
Chain ResId ResNum Energy
2 A ARG 8 -0.84560
5 B ARG 8 -0.64880
4 A ASP 25 -0.98583
7 B ASP 25 -0.85930
3 A LEU 24 -0.56739
6 B LEU 24 -0.58380
1 C O17 500 -37.03670
#With dplyr
library(dplyr)
df %>%
arrange(ResId)
Chain ResId ResNum Energy
1 A ARG 8 -0.84560
2 B ARG 8 -0.64880
3 A ASP 25 -0.98583
4 B ASP 25 -0.85930
5 A LEU 24 -0.56739
6 B LEU 24 -0.58380
7 C O17 500 -37.03670
Data
df <- read.table(text = '
Chain ResId ResNum Energy
C O17 500 -37.0367
A ARG 8 -0.8456
A LEU 24 -0.56739
A ASP 25 -0.98583
B ARG 8 -0.6488
B LEU 24 -0.5838
B ASP 25 -0.8593', header=T)
Try this:
df <- df[order(df$Chain, df$ResId, df$ResNum),]
where df is the name of your dataframe. This should order it for you.

Two columns' results are coming in the same column in R

I have a data frame with "Sol.grp" (non-numeric) and "age" (numeric) columns. I'm trying to store mean of age and count of observations in two separate columns.
I used the following code:
> summary <- data.frame(aggregate(age~sol.grp, data=na.omit(all.tkts), FUN=function(x) c(mean= mean(x), count=length(x))))
Mean & Count are coming in the same column (shown below)
I do not know what's wrong. Any ideas? Thanks in advance for your help !
Edit: The example dataset is shown at the end
row sol.grp Mean
1 Account A 187.7154
2 Account B 215.7747
3 WMID 199.0201
4 Qty 254.5545
5 PM 210.7109
6 CS 165.6500
7 ED 158.5483
8 TM 271.1966
9 39.0000
10 131.0000
11 189.0000
12 149.0000
13 3533.0000
14 2.0000
15 338.0000
16 58.0000
Example data: (Top 20 rows)
sol.grp age
Account A 29.6
Account B 29.6
WMID 26.9
Qty 1.7
PM 3.0
CS 2043.8
ED 24.3
TM 24.3
Account A 24.3
Account B 133.3
WMID 27.0
Qty 2.1
PM 29.2
CS 29.4
ED 97.8
TM 192.9
Account A 651.6
Account B 148.6
WMID 125.2
Qty 31.1
You could try this using data.table
library(data.table)
res1 <- setDT(all.tkts)[, list(Mean=mean(age, na.rm=TRUE), Count=.N),
keyby=sol.grp]
The aggregate results did not show any anomaly using the below example
res2 <- do.call(data.frame,aggregate(age~sol.grp,
data=na.omit(all.tkts), FUN=function(x) c(mean= mean(x), count=length(x))))
res2
# sol.grp age.mean age.count
#1 Account A 235.16667 3
#2 Account B 103.83333 3
#3 CS 1036.60000 2
#4 ED 61.05000 2
#5 PM 16.10000 2
#6 Qty 11.63333 3
#7 TM 108.60000 2
#8 WMID 59.70000 3
data
all.tkts <- structure(list(sol.grp = structure(c(1L, 2L, 8L, 6L, 5L, 3L,
4L, 7L, 1L, 2L, 8L, 6L, 5L, 3L, 4L, 7L, 1L, 2L, 8L, 6L), .Label = c("Account A",
"Account B", "CS", "ED", "PM", "Qty", "TM", "WMID"), class = "factor"),
age = c(29.6, 29.6, 26.9, 1.7, 3, 2043.8, 24.3, 24.3, 24.3,
133.3, 27, 2.1, 29.2, 29.4, 97.8, 192.9, 651.6, 148.6, 125.2,
31.1)), .Names = c("sol.grp", "age"), class = "data.frame", row.names = c(NA,
-20L))
Following from your own code works well:
aggregate(age~sol.grp, data=na.omit(all.tkts), FUN=function(x) c(mean= mean(x), count=length(x)))
sol.grp age.mean age.count
1 Account A 235.16667 3.00000
2 Account B 103.83333 3.00000
3 CS 1036.60000 2.00000
4 ED 61.05000 2.00000
5 PM 16.10000 2.00000
6 Qty 11.63333 3.00000
7 TM 108.60000 2.00000
8 WMID 59.70000 3.00000
Just avoid putting data.frame around aggregate, since aggregate returns a data.frame.
EDIT:
The details of output are:
> dd = aggregate(age~sol.grp, data=na.omit(all.tkts), FUN=function(x) c(mean= mean(x), count=length(x)))
> str(dd)
'data.frame': 8 obs. of 2 variables:
$ sol.grp: Factor w/ 8 levels "Account A","Account B",..: 1 2 3 4 5 6 7 8
$ age : num [1:8, 1:2] 235.2 103.8 1036.6 61 16.1 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "mean" "count"
>
> dd$sol.grp
[1] Account A Account B CS ED PM Qty TM WMID
Levels: Account A Account B CS ED PM Qty TM WMID
> dd$age
mean count
[1,] 235.16667 3
[2,] 103.83333 3
[3,] 1036.60000 2
[4,] 61.05000 2
[5,] 16.10000 2
[6,] 11.63333 3
[7,] 108.60000 2
[8,] 59.70000 3
>
> dd$age[,2]
[1] 3 3 2 2 2 3 2 3
>
> dd$age[,1]
[1] 235.16667 103.83333 1036.60000 61.05000 16.10000 11.63333 108.60000 59.70000

Resources