I have a data frame:
x <- data.frame(id = letters[1:3], val0 = 1:3, val1 = 4:6, val2 = 7:9)
# id val0 val1 val2
# 1 a 1 4 7
# 2 b 2 5 8
# 3 c 3 6 9
Within each row, I want to calculate the corresponding proportions (ratio) for each value. E.g. for the value in column "val0", I want to calculate row-wise val0 / (val0 + val1 + val2).
Desired output:
id val0 val1 val2
1 a 0.083 0.33 0.583
2 b 0.133 0.33 0.533
3 c 0.167 0.33 0.5
Can anyone tell me what's the best way to do this? Here it's just three columns, but there can be alot of columns.
following should do the trick
cbind(id = x[, 1], x[, -1]/rowSums(x[, -1]))
## id val0 val1 val2
## 1 a 0.08333333 0.3333333 0.5833333
## 2 b 0.13333333 0.3333333 0.5333333
## 3 c 0.16666667 0.3333333 0.5000000
And another alternative (though this is mostly a pretty version of sweep)... prop.table:
> cbind(x[1], prop.table(as.matrix(x[-1]), margin = 1))
id val0 val1 val2
1 a 0.08333333 0.3333333 0.5833333
2 b 0.13333333 0.3333333 0.5333333
3 c 0.16666667 0.3333333 0.5000000
From the "description" section of the help file at ?prop.table:
This is really sweep(x, margin, margin.table(x, margin), "/") for newbies, except that if margin has length zero, then one gets x/sum(x).
So, you can see that underneath, this is really quite similar to #Jilber's solution.
And... it's nice for the R developers to be considerate of us newbies, isn't it? :)
Another alternative using sweep
sweep(x[,-1], 1, rowSums(x[,-1]), FUN="/")
val0 val1 val2
1 0.08333333 0.3333333 0.5833333
2 0.13333333 0.3333333 0.5333333
3 0.16666667 0.3333333 0.5000000
The function adorn_percentages() from the janitor package does this:
library(janitor)
x %>% adorn_percentages()
id val0 val1 val2
a 0.08333333 0.3333333 0.5833333
b 0.13333333 0.3333333 0.5333333
c 0.16666667 0.3333333 0.5000000
This is equivalent to x %>% adorn_percentages(denominator = "row"), though "row" is the default argument so is not needed in this case. An equivalent call is adorn_percentages(x) if you prefer it without the %>% pipe.
Disclaimer: I created the janitor package, but feel it's appropriate to post this; the function was built to perform exactly this task while making code clearer to read, and the package can be installed from CRAN.
Related
Trying to normalize all rows in the data frame such that
A B C A B C
1 2 4 => 1 .3 .6
2 2 5 2 .3 .7
3 4 6 3 .4 .6
This returns a warning that it's coercing to an integer
outdf <- df[, names(df) := (.SD / rowSums(.SD)), .SDcols=x,by=y]
This does nothing
outdf <- df[, names(df) := as.numeric(x)][,x:=(.SD / rowSums(.SD)), .SDcols=x,by=y][]
These are both close. Is there a better way to change types or a better way to normalize.
(data is ~42GB coming into this line so data.table is the way to go)
EDIT:
x and y
x <- names(data)[14:ncol(data)]
y <- names(data)[1]
I think you might be over thinking it. This seems to do what is desired:
library(data.table)
X <- data.table(A=c(1,2,2), B=c(2,2,4))
X[ , .SD/rowSums(.SD)]
# using .SDcols can be used to make this selective
A B
1: 0.3333333 0.6666667
2: 0.5000000 0.5000000
3: 0.3333333 0.6666667
I didn't encounter any problems with assigning to X to accomplish the expected replacement.
Demonstrating the using .SDcols and by parameters does not affect this. (And noting that row oriented operations would not be expected to be affected through the use of by parameter, anyway.)
X <- data.table(ID =letters[1:3], A=c(1,2,2), B=c(2,2,4))
X <- rbind(X,X) # so there are multiple items in the groups
X <- X[ , .SD/rowSums(.SD), .SDcols=c("A", "B"), by="ID"]
# Only effect of the `by="ID"` seem to be an alpha sort
> X
ID A B
1: a 0.3333333 0.6666667
2: a 0.3333333 0.6666667
3: b 0.5000000 0.5000000
4: b 0.5000000 0.5000000
5: c 0.3333333 0.6666667
6: c 0.3333333 0.6666667
I have a data frame:
x <- data.frame(id = letters[1:3], val0 = 1:3, val1 = 4:6, val2 = 7:9)
# id val0 val1 val2
# 1 a 1 4 7
# 2 b 2 5 8
# 3 c 3 6 9
Within each row, I want to calculate the corresponding proportions (ratio) for each value. E.g. for the value in column "val0", I want to calculate row-wise val0 / (val0 + val1 + val2).
Desired output:
id val0 val1 val2
1 a 0.083 0.33 0.583
2 b 0.133 0.33 0.533
3 c 0.167 0.33 0.5
Can anyone tell me what's the best way to do this? Here it's just three columns, but there can be alot of columns.
following should do the trick
cbind(id = x[, 1], x[, -1]/rowSums(x[, -1]))
## id val0 val1 val2
## 1 a 0.08333333 0.3333333 0.5833333
## 2 b 0.13333333 0.3333333 0.5333333
## 3 c 0.16666667 0.3333333 0.5000000
And another alternative (though this is mostly a pretty version of sweep)... prop.table:
> cbind(x[1], prop.table(as.matrix(x[-1]), margin = 1))
id val0 val1 val2
1 a 0.08333333 0.3333333 0.5833333
2 b 0.13333333 0.3333333 0.5333333
3 c 0.16666667 0.3333333 0.5000000
From the "description" section of the help file at ?prop.table:
This is really sweep(x, margin, margin.table(x, margin), "/") for newbies, except that if margin has length zero, then one gets x/sum(x).
So, you can see that underneath, this is really quite similar to #Jilber's solution.
And... it's nice for the R developers to be considerate of us newbies, isn't it? :)
Another alternative using sweep
sweep(x[,-1], 1, rowSums(x[,-1]), FUN="/")
val0 val1 val2
1 0.08333333 0.3333333 0.5833333
2 0.13333333 0.3333333 0.5333333
3 0.16666667 0.3333333 0.5000000
The function adorn_percentages() from the janitor package does this:
library(janitor)
x %>% adorn_percentages()
id val0 val1 val2
a 0.08333333 0.3333333 0.5833333
b 0.13333333 0.3333333 0.5333333
c 0.16666667 0.3333333 0.5000000
This is equivalent to x %>% adorn_percentages(denominator = "row"), though "row" is the default argument so is not needed in this case. An equivalent call is adorn_percentages(x) if you prefer it without the %>% pipe.
Disclaimer: I created the janitor package, but feel it's appropriate to post this; the function was built to perform exactly this task while making code clearer to read, and the package can be installed from CRAN.
I want to create multiple variables that are aggregating various subsets of a dataset. For an illustrating example, say you have the following data:
DT = data.table(Group1 = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4),
Group2 = c(1,1,1,2,2,1,1,2,2,2,1,1,1,1,2,1,1,2,2,2),
Var1 = c(1,1,0,0,0,1,1,0,1,0,1,0,0,0,0,0,0,0,0,0))
I want to find several averages of variable Var1. I want to know:
mean(Var1) grouped by Group1
mean(Var1) for only those with Group2 == 1, grouped by Group1
mean(Var1) for only those with Group2 == 2, grouped by Group1
Or, in data.table parlance,
DT[, mean(Var1), by=Group1]
DT[Group2==1, mean(Var1), by=Group1]
DT[Group2==2, mean(Var1), by=Group1]
Obviously, calculating any one of these is very straightforward. But I can't find a good way to calculate all three of them, since they use different subsets in i. The solution I've been using so far is generating them individually, then merging them into a unified table.
DT_all <- DT[, .(avgVar1_all = mean(Var1)), by = Group1]
DT_1 <- DT[Group2 == 1, .(avgVar1_1 = mean(Var1)), by = Group1]
DT_2 <- DT[Group2 == 2, .(avgVar1_2 = mean(Var1)), by = Group1]
group_info <- merge(DT_all, DT_1, by = "Group1")
group_info <- merge(group_info, DT_2, by = "Group1")
group_info
# Group1 avgVar1_all avgVar1_1 avgVar1_2
# 1: 1 0.4 0.6666667 0.0000000
# 2: 2 0.6 1.0000000 0.3333333
# 3: 3 0.2 0.2500000 0.0000000
# 4: 4 0.0 0.0000000 0.0000000
Is there a more elegant method I could be using?
Just do it all in one grouping operation using .SD:
DT[, .(
all = mean(Var1),
grp1 = .SD[Group2==1, mean(Var1)],
grp2 = .SD[Group2==2, mean(Var1)]
),
by = Group1,
.SDcols=c("Group2","Var1")
]
# Group1 all grp1 grp2
#1: 1 0.4 0.6666667 0.0000000
#2: 2 0.6 1.0000000 0.3333333
#3: 3 0.2 0.2500000 0.0000000
#4: 4 0.0 0.0000000 0.0000000
You can use reshape2::dcast:
reshape2::dcast(DT, Group1 ~ Group2, fun=mean, margins="Group2")
Group1 1 2 (all)
1 1 0.6666667 0.0000000 0.4
2 2 1.0000000 0.3333333 0.6
3 3 0.2500000 0.0000000 0.2
4 4 0.0000000 0.0000000 0.0
#thelatmail noted in a comment below that this approach does not scale well. Eventually, margins should be available in data.table's dcast, which will probably be more efficient.
An ugly workaround:
DT[, c(
dcast(.SD, Group1 ~ Group2, fun=mean),
all = .(dcast(.SD, Group1 ~ ., fun=mean)$.)
)]
Group1 1 2 all
1: 1 0.6666667 0.0000000 0.4
2: 2 1.0000000 0.3333333 0.6
3: 3 0.2500000 0.0000000 0.2
4: 4 0.0000000 0.0000000 0.0
I am trying to to divide each value in columns B and C by the sum due to a factor in column A.
The starting matrix could look something like this but has thousands of rows
where A is a factor, and B and C contain the values:
A <- c(1,1,2,2)
B <- c(0.2, 0.3, 1, 0.5)
C <- c(0.7, 0.5, 0, 0.9)
M <- data.table(A,B,C)
> M
A B C
[1,] 1 0.2 0.7
[2,] 1 0.3 0.5
[3,] 2 1.0 0.0
[4,] 2 0.5 0.9
The factors can occur any number of times.
I was able to produce the sum per factor with library data.table:
library(data.table)
M.dt <- data.table(M)
M.sum <- M.dt[, lapply(.SD, sum), by = A]
> M.sum
A B C
1: 1 0.5 1.2
2: 2 1.5 0.9
but didn't know how to go on from here to keep the original format of the table.
The resulting table should look like this:
B.1 <- c(0.4, 0.6, 0.666, 0.333)
C.1 <- c(0.583, 0.416, 0, 1)
M.1 <- cbind(A, B.1, C.1)
> M.1
A B.1 C.1
[1,] 1 0.400 0.58333
[2,] 1 0.600 0.41666
[3,] 2 0.666 0.00000
[4,] 2 0.333 1.00000
The calculation for the first value in B.1 would go like this:
0.2/(0.2+0.3) = 0.4 and so on, where the values to add are given by the factor in A.
I have some basic knowledge of R, but despite trying hard, I do badly with matrix manipulations and loops.
Simply divide each value in each column by its sum per each value in A
M[, lapply(.SD, function(x) x/sum(x)), A]
# A B C
# 1: 1 0.4000000 0.5833333
# 2: 1 0.6000000 0.4166667
# 3: 2 0.6666667 0.0000000
# 4: 2 0.3333333 1.0000000
If you want to update by reference do
M[, c("B", "C") := lapply(.SD, function(x) x/sum(x)), A]
Or more generally
M[, names(M)[-1] := lapply(.SD, function(x) x/sum(x)), A]
A bonus solution for the dplyr junkies
library(dplyr)
M %>%
group_by(A) %>%
mutate_each(funs(./sum(.)))
# Source: local data table [4 x 3]
# Groups: A
#
# A B C
# 1 1 0.4000000 0.5833333
# 2 1 0.6000000 0.4166667
# 3 2 0.6666667 0.0000000
# 4 2 0.3333333 1.0000000
Like most problems of this type, you can either use data.table or plyr package or some combination of split, apply, combine functions in base R.
For those who prefer the plyr package
library (plyr)
M <- data.table(A,B,C)
ddply(M, .(A), colwise(function(x) x/sum(x)))
Output is:
A B C
1 1 0.4000000 0.5833333
2 1 0.6000000 0.4166667
3 2 0.6666667 0.0000000
4 2 0.3333333 1.0000000
Minimal example
I have: input<-data.frame(id=c(1,1,1,2,2,2),A=as.factor(c(1,1,2,2,1,3)),B=as.factor(c(0,1,1,1,0,0)))
I want: output<-data.frame(id=c(1,2), A1=c(2/3,1/3), A2=c(1/3,1/3), A3=c(0/3,1/3), B0=c(1/3,2/3), B1=c(2/3,1/3))
Motivation
I have a data frame with categorical data. I would like to turn this into a dataframe with proportianal counts of each category occuring. In the "output" dataframe I would like to have a column for each variable-category combination (A1,A2, etc.). Each row gives the relative counts for a "id". For instance, id=1 has three entries in "input" with two times category 1 under variable "A". The column "A1" should show 2/3 in that row. Divided by three, because id=1 has three entries in "input".
What I started
function(input){
#create empty dataframe
distcID<-duplicated(input$id)
output<-data.frame(id=integer(0),A1=integer(0),A2=integer(0),A3=integer(0),
B0=integer(0),B1=integer(0))
count<-0
for (i in input$id[distcID]){
df.cID<-input[input$customer_ID==i]
m<- NROW(df.cID)
count<-count+1
output$customer_ID[count]<-i
output$A1[count]<-1/m*NROW(df.cID$A==1)
output$A2[count]<-1/m*NROW(df.cID$A==2)
output$A3[count]<-1/m*NROW(df.cID$A==3)
output$B0[count]<-1/m*NROW(df.cID$B==0)
output$B1[count]<-1/m*NROW(df.cID$B==1)
}
return(output)
}
What is wrong?
- it is ugly. Given functions like apply and aggregate or a package like plyr, there should be nicer (i.e. shorter) solutions to this problem.
R does not accept the initialization of output with empty columns.
the column names of output are not created automatically, but by hand.
Thank you! Please tell me if my question lacks clarity. This is my first question here.
This expression creates a table for each of the non-ID columns (here, 2:3):
individuals <- lapply(2:3, function(i) {
# Table of counts, by "id"
x <- table(input[,c(1,i)])
# Scale to proportions
x <- x / rowSums(x)
# Fix the names
colnames(x) <- paste0(colnames(input)[i], colnames(x))
return(x)
}
)
individuals
## [[1]]
## A
## id A1 A2 A3
## 1 0.6666667 0.3333333 0.0000000
## 2 0.3333333 0.3333333 0.3333333
##
## [[2]]
## B
## id B0 B1
## 1 0.3333333 0.6666667
## 2 0.6666667 0.3333333
Now put them together with cbind:
do.call(cbind, individuals)
## A1 A2 A3 B0 B1
## 1 0.6666667 0.3333333 0.0000000 0.3333333 0.6666667
## 2 0.3333333 0.3333333 0.3333333 0.6666667 0.3333333
The id column is not present, but the row names can be used for this purpose.
This isn't a complete answer, but should help you along the way (with a bit of resphape[2]-ing:
ct <- count(input, "id")
A <- data.frame(table(input[,c(1,2)])/ct[ct$id==1,]$freq)
B <- data.frame(table(input[,c(1,3)])/ct[ct$id==2,]$freq)
print(A)
id A Freq
1 1 1 0.6666667
2 2 1 0.3333333
3 1 2 0.3333333
4 2 2 0.3333333
5 1 3 0.0000000
6 2 3 0.3333333
print(B)
id B Freq
1 1 0 0.3333333
2 2 0 0.6666667
3 1 1 0.6666667
4 2 1 0.3333333
Here's on possible solution:
library(reshape2)
library(qdap)
x <- prop.table(ftable(melt(input, id="id")))
x2 <- colpaste2df(data.frame(x), 2:3, keep.orig = FALSE, sep="", name.sep = "")
x3 <- dcast(x2, id ~ variablevalue, value.var = "Freq")
x3[, c(TRUE, colSums(x3[, -1]) != 0)]
## id A1 A2 A3 B0 B1
## 1 1 0.16666667 0.08333333 0.00000000 0.08333333 0.16666667
## 2 2 0.08333333 0.08333333 0.08333333 0.16666667 0.08333333
Can be seen as a pivot table (or two pivot tables):
>install.packages('reshape')
>library(reshape)
>ct <-count(input, "id")
>DF1<-cast(input, id ~ A, value='B')
>DF2<-cast(input, id ~ B, value="A")
>DF3<-cbind(DF1$id, DF1[names(DF1)!='id']/ct[1,]$freq, DF2[names(DF2)!='id']/ct[2,]$freq)
>names(DF3)<-c('id', paste('A', names(DF1)[-1], sep=''), paste('B', names(DF2)[-1], sep=''))
> DF3
id A1 A2 A3 B0 B1
1 1 0.6666667 0.3333333 0.0000000 0.3333333 0.6666667
2 2 0.3333333 0.3333333 0.3333333 0.6666667 0.3333333
This is what I think you wanted. Just add row or column names to suit your tastes.
tbB <- with(input, table(B, id))
tbA <- with(input, table(A, id))
cbind( t( tbA/rowSums(tbA)), t(tbB/rowSums(tbB)) )
1 2 3 0 1
1 0.6666667 0.5 0 0.3333333 0.6666667
2 0.3333333 0.5 1 0.6666667 0.3333333