I am trying to to divide each value in columns B and C by the sum due to a factor in column A.
The starting matrix could look something like this but has thousands of rows
where A is a factor, and B and C contain the values:
A <- c(1,1,2,2)
B <- c(0.2, 0.3, 1, 0.5)
C <- c(0.7, 0.5, 0, 0.9)
M <- data.table(A,B,C)
> M
A B C
[1,] 1 0.2 0.7
[2,] 1 0.3 0.5
[3,] 2 1.0 0.0
[4,] 2 0.5 0.9
The factors can occur any number of times.
I was able to produce the sum per factor with library data.table:
library(data.table)
M.dt <- data.table(M)
M.sum <- M.dt[, lapply(.SD, sum), by = A]
> M.sum
A B C
1: 1 0.5 1.2
2: 2 1.5 0.9
but didn't know how to go on from here to keep the original format of the table.
The resulting table should look like this:
B.1 <- c(0.4, 0.6, 0.666, 0.333)
C.1 <- c(0.583, 0.416, 0, 1)
M.1 <- cbind(A, B.1, C.1)
> M.1
A B.1 C.1
[1,] 1 0.400 0.58333
[2,] 1 0.600 0.41666
[3,] 2 0.666 0.00000
[4,] 2 0.333 1.00000
The calculation for the first value in B.1 would go like this:
0.2/(0.2+0.3) = 0.4 and so on, where the values to add are given by the factor in A.
I have some basic knowledge of R, but despite trying hard, I do badly with matrix manipulations and loops.
Simply divide each value in each column by its sum per each value in A
M[, lapply(.SD, function(x) x/sum(x)), A]
# A B C
# 1: 1 0.4000000 0.5833333
# 2: 1 0.6000000 0.4166667
# 3: 2 0.6666667 0.0000000
# 4: 2 0.3333333 1.0000000
If you want to update by reference do
M[, c("B", "C") := lapply(.SD, function(x) x/sum(x)), A]
Or more generally
M[, names(M)[-1] := lapply(.SD, function(x) x/sum(x)), A]
A bonus solution for the dplyr junkies
library(dplyr)
M %>%
group_by(A) %>%
mutate_each(funs(./sum(.)))
# Source: local data table [4 x 3]
# Groups: A
#
# A B C
# 1 1 0.4000000 0.5833333
# 2 1 0.6000000 0.4166667
# 3 2 0.6666667 0.0000000
# 4 2 0.3333333 1.0000000
Like most problems of this type, you can either use data.table or plyr package or some combination of split, apply, combine functions in base R.
For those who prefer the plyr package
library (plyr)
M <- data.table(A,B,C)
ddply(M, .(A), colwise(function(x) x/sum(x)))
Output is:
A B C
1 1 0.4000000 0.5833333
2 1 0.6000000 0.4166667
3 2 0.6666667 0.0000000
4 2 0.3333333 1.0000000
Related
Trying to normalize all rows in the data frame such that
A B C A B C
1 2 4 => 1 .3 .6
2 2 5 2 .3 .7
3 4 6 3 .4 .6
This returns a warning that it's coercing to an integer
outdf <- df[, names(df) := (.SD / rowSums(.SD)), .SDcols=x,by=y]
This does nothing
outdf <- df[, names(df) := as.numeric(x)][,x:=(.SD / rowSums(.SD)), .SDcols=x,by=y][]
These are both close. Is there a better way to change types or a better way to normalize.
(data is ~42GB coming into this line so data.table is the way to go)
EDIT:
x and y
x <- names(data)[14:ncol(data)]
y <- names(data)[1]
I think you might be over thinking it. This seems to do what is desired:
library(data.table)
X <- data.table(A=c(1,2,2), B=c(2,2,4))
X[ , .SD/rowSums(.SD)]
# using .SDcols can be used to make this selective
A B
1: 0.3333333 0.6666667
2: 0.5000000 0.5000000
3: 0.3333333 0.6666667
I didn't encounter any problems with assigning to X to accomplish the expected replacement.
Demonstrating the using .SDcols and by parameters does not affect this. (And noting that row oriented operations would not be expected to be affected through the use of by parameter, anyway.)
X <- data.table(ID =letters[1:3], A=c(1,2,2), B=c(2,2,4))
X <- rbind(X,X) # so there are multiple items in the groups
X <- X[ , .SD/rowSums(.SD), .SDcols=c("A", "B"), by="ID"]
# Only effect of the `by="ID"` seem to be an alpha sort
> X
ID A B
1: a 0.3333333 0.6666667
2: a 0.3333333 0.6666667
3: b 0.5000000 0.5000000
4: b 0.5000000 0.5000000
5: c 0.3333333 0.6666667
6: c 0.3333333 0.6666667
I have a data frame:
x <- data.frame(id = letters[1:3], val0 = 1:3, val1 = 4:6, val2 = 7:9)
# id val0 val1 val2
# 1 a 1 4 7
# 2 b 2 5 8
# 3 c 3 6 9
Within each row, I want to calculate the corresponding proportions (ratio) for each value. E.g. for the value in column "val0", I want to calculate row-wise val0 / (val0 + val1 + val2).
Desired output:
id val0 val1 val2
1 a 0.083 0.33 0.583
2 b 0.133 0.33 0.533
3 c 0.167 0.33 0.5
Can anyone tell me what's the best way to do this? Here it's just three columns, but there can be alot of columns.
following should do the trick
cbind(id = x[, 1], x[, -1]/rowSums(x[, -1]))
## id val0 val1 val2
## 1 a 0.08333333 0.3333333 0.5833333
## 2 b 0.13333333 0.3333333 0.5333333
## 3 c 0.16666667 0.3333333 0.5000000
And another alternative (though this is mostly a pretty version of sweep)... prop.table:
> cbind(x[1], prop.table(as.matrix(x[-1]), margin = 1))
id val0 val1 val2
1 a 0.08333333 0.3333333 0.5833333
2 b 0.13333333 0.3333333 0.5333333
3 c 0.16666667 0.3333333 0.5000000
From the "description" section of the help file at ?prop.table:
This is really sweep(x, margin, margin.table(x, margin), "/") for newbies, except that if margin has length zero, then one gets x/sum(x).
So, you can see that underneath, this is really quite similar to #Jilber's solution.
And... it's nice for the R developers to be considerate of us newbies, isn't it? :)
Another alternative using sweep
sweep(x[,-1], 1, rowSums(x[,-1]), FUN="/")
val0 val1 val2
1 0.08333333 0.3333333 0.5833333
2 0.13333333 0.3333333 0.5333333
3 0.16666667 0.3333333 0.5000000
The function adorn_percentages() from the janitor package does this:
library(janitor)
x %>% adorn_percentages()
id val0 val1 val2
a 0.08333333 0.3333333 0.5833333
b 0.13333333 0.3333333 0.5333333
c 0.16666667 0.3333333 0.5000000
This is equivalent to x %>% adorn_percentages(denominator = "row"), though "row" is the default argument so is not needed in this case. An equivalent call is adorn_percentages(x) if you prefer it without the %>% pipe.
Disclaimer: I created the janitor package, but feel it's appropriate to post this; the function was built to perform exactly this task while making code clearer to read, and the package can be installed from CRAN.
I want to create multiple variables that are aggregating various subsets of a dataset. For an illustrating example, say you have the following data:
DT = data.table(Group1 = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4),
Group2 = c(1,1,1,2,2,1,1,2,2,2,1,1,1,1,2,1,1,2,2,2),
Var1 = c(1,1,0,0,0,1,1,0,1,0,1,0,0,0,0,0,0,0,0,0))
I want to find several averages of variable Var1. I want to know:
mean(Var1) grouped by Group1
mean(Var1) for only those with Group2 == 1, grouped by Group1
mean(Var1) for only those with Group2 == 2, grouped by Group1
Or, in data.table parlance,
DT[, mean(Var1), by=Group1]
DT[Group2==1, mean(Var1), by=Group1]
DT[Group2==2, mean(Var1), by=Group1]
Obviously, calculating any one of these is very straightforward. But I can't find a good way to calculate all three of them, since they use different subsets in i. The solution I've been using so far is generating them individually, then merging them into a unified table.
DT_all <- DT[, .(avgVar1_all = mean(Var1)), by = Group1]
DT_1 <- DT[Group2 == 1, .(avgVar1_1 = mean(Var1)), by = Group1]
DT_2 <- DT[Group2 == 2, .(avgVar1_2 = mean(Var1)), by = Group1]
group_info <- merge(DT_all, DT_1, by = "Group1")
group_info <- merge(group_info, DT_2, by = "Group1")
group_info
# Group1 avgVar1_all avgVar1_1 avgVar1_2
# 1: 1 0.4 0.6666667 0.0000000
# 2: 2 0.6 1.0000000 0.3333333
# 3: 3 0.2 0.2500000 0.0000000
# 4: 4 0.0 0.0000000 0.0000000
Is there a more elegant method I could be using?
Just do it all in one grouping operation using .SD:
DT[, .(
all = mean(Var1),
grp1 = .SD[Group2==1, mean(Var1)],
grp2 = .SD[Group2==2, mean(Var1)]
),
by = Group1,
.SDcols=c("Group2","Var1")
]
# Group1 all grp1 grp2
#1: 1 0.4 0.6666667 0.0000000
#2: 2 0.6 1.0000000 0.3333333
#3: 3 0.2 0.2500000 0.0000000
#4: 4 0.0 0.0000000 0.0000000
You can use reshape2::dcast:
reshape2::dcast(DT, Group1 ~ Group2, fun=mean, margins="Group2")
Group1 1 2 (all)
1 1 0.6666667 0.0000000 0.4
2 2 1.0000000 0.3333333 0.6
3 3 0.2500000 0.0000000 0.2
4 4 0.0000000 0.0000000 0.0
#thelatmail noted in a comment below that this approach does not scale well. Eventually, margins should be available in data.table's dcast, which will probably be more efficient.
An ugly workaround:
DT[, c(
dcast(.SD, Group1 ~ Group2, fun=mean),
all = .(dcast(.SD, Group1 ~ ., fun=mean)$.)
)]
Group1 1 2 all
1: 1 0.6666667 0.0000000 0.4
2: 2 1.0000000 0.3333333 0.6
3: 3 0.2500000 0.0000000 0.2
4: 4 0.0000000 0.0000000 0.0
I have a data frame like the one below, but with a lot more rows
> df<-data.frame(x1=c(1,1,0,0,1,0),x2=c("a","a","b","a","c","c"))
> df
x1 x2
1 1 a
2 1 a
3 0 b
4 0 a
5 1 c
6 0 c
From df I want a data frame where the rows are the unique values of df$x2 and col1 is the proportion of 1s associated with each letter, and col2 is the count of each letter. So, my output would be
> getprops(df)
prop count
a .6666 3
b 0 1
c 0.5 2
I can think of some elaborate, dirty ways to do this, but I'm looking for something short and efficient. Thanks
I like #RicardoSaporta's solution (+1), but you can use ?prop.table as well:
> df<-data.frame(x1=c(1,1,0,0,1,0),x2=c("a","a","b","a","c","c"))
> df
x1 x2
1 1 a
2 1 a
3 0 b
4 0 a
5 1 c
6 0 c
> tab <- table(df$x2, df$x1)
> tab
0 1
a 1 2
b 1 0
c 1 1
> ptab <- prop.table(tab, margin=1)
> ptab
0 1
a 0.3333333 0.6666667
b 1.0000000 0.0000000
c 0.5000000 0.5000000
> dframe <- data.frame(values=rownames(tab), prop=ptab[,2], count=tab[,2])
> dframe
values prop count
a a 0.6666667 2
b b 0.0000000 0
c c 0.5000000 1
If you'd like, you can put this together into a single function:
getprops <- function(values, indicator){
tab <- table(values, indicator)
ptab <- prop.table(tab, margin=1)
dframe <- data.frame(values=rownames(tab), prop=ptab[,2], count=tab[,2])
return(dframe)
}
> getprops(values=df$x2, indicator=df$x2)
values prop count
a a 0.6666667 2
b b 0.0000000 0
c c 0.5000000 1
Try installing plyr and running
library(plyr)
df <- data.frame(x1=c(1, 1, 0, 0, 1, 0),
label=c("a", "a", "b", "a", "c", "c"))
ddply(df, .(label), summarize, prop = mean(x1), count = length(x1))
# label prop count
# 1 a 0.6666667 3
# 2 b 0.0000000 1
# 3 c 0.5000000 2
which under the hood applies a split/apply/combine method similar to this in base R:
do.call(rbind, lapply(split(df, df$x2),
with, list(prop = mean(x1),
count = length(x1))))
Here is a one-liner in data.table:
> DT[, list(props=sum(x1) / .N, count=.N), by=x2]
x2 props count
1: a 0.6666667 3
2: b 0.0000000 1
3: c 0.5000000 2
where DT <- data.table(df)
I am not sure if this does what you want.
df<-data.frame(x1=c(1,1,0,0,1,0),x2=c("a","a","b","a","c","c"))
ones <- with(df, aggregate(x1 ~ x2, FUN = sum))
count <- table(df$x2)
prop <- ones$x1 / count
df2 <- data.frame(prop, count)
df2
rownames(df2) <- df2[,3]
df2 <- df2[,c(2,4)]
colnames(df2) <- c('prop', 'count')
df2
prop count
a 0.6666667 3
b 0.0000000 1
c 0.5000000 2
Try using table
tbl <- table(df$x1, df$x2)
# a b c
# 0 1 1 1
# 1 2 0 1
tbl["1",] / colSums(tbl)
# a b c
# 0.6666667 0.0000000 0.5000000
For nice output use:
data.frame(proportions=tbl["1",] / colSums(tbl))
proportions
a 0.6666667
b 0.0000000
c 0.5000000
I have a data frame:
x <- data.frame(id = letters[1:3], val0 = 1:3, val1 = 4:6, val2 = 7:9)
# id val0 val1 val2
# 1 a 1 4 7
# 2 b 2 5 8
# 3 c 3 6 9
Within each row, I want to calculate the corresponding proportions (ratio) for each value. E.g. for the value in column "val0", I want to calculate row-wise val0 / (val0 + val1 + val2).
Desired output:
id val0 val1 val2
1 a 0.083 0.33 0.583
2 b 0.133 0.33 0.533
3 c 0.167 0.33 0.5
Can anyone tell me what's the best way to do this? Here it's just three columns, but there can be alot of columns.
following should do the trick
cbind(id = x[, 1], x[, -1]/rowSums(x[, -1]))
## id val0 val1 val2
## 1 a 0.08333333 0.3333333 0.5833333
## 2 b 0.13333333 0.3333333 0.5333333
## 3 c 0.16666667 0.3333333 0.5000000
And another alternative (though this is mostly a pretty version of sweep)... prop.table:
> cbind(x[1], prop.table(as.matrix(x[-1]), margin = 1))
id val0 val1 val2
1 a 0.08333333 0.3333333 0.5833333
2 b 0.13333333 0.3333333 0.5333333
3 c 0.16666667 0.3333333 0.5000000
From the "description" section of the help file at ?prop.table:
This is really sweep(x, margin, margin.table(x, margin), "/") for newbies, except that if margin has length zero, then one gets x/sum(x).
So, you can see that underneath, this is really quite similar to #Jilber's solution.
And... it's nice for the R developers to be considerate of us newbies, isn't it? :)
Another alternative using sweep
sweep(x[,-1], 1, rowSums(x[,-1]), FUN="/")
val0 val1 val2
1 0.08333333 0.3333333 0.5833333
2 0.13333333 0.3333333 0.5333333
3 0.16666667 0.3333333 0.5000000
The function adorn_percentages() from the janitor package does this:
library(janitor)
x %>% adorn_percentages()
id val0 val1 val2
a 0.08333333 0.3333333 0.5833333
b 0.13333333 0.3333333 0.5333333
c 0.16666667 0.3333333 0.5000000
This is equivalent to x %>% adorn_percentages(denominator = "row"), though "row" is the default argument so is not needed in this case. An equivalent call is adorn_percentages(x) if you prefer it without the %>% pipe.
Disclaimer: I created the janitor package, but feel it's appropriate to post this; the function was built to perform exactly this task while making code clearer to read, and the package can be installed from CRAN.