I have a data frame in R with the following sample data:
choice Wealth
1 10
2 5
3 8
3 10
2 10
3 5
1 8
2 5
2 5
I am trying to create a summary table that shows the %choice by wealth level (i.e. of those with wealth x, how many chose 1 or 2 or 3):
Wealth %choice1 %choice2 %choice3
5 0% 75% 25%
8 50% 0% 50%
10 33% 33% 33%
I've tried a number of different methods but can't seem to get this correct (table, dplyr, etc.). As I am a novice in R, any ideas or help would be very much appreciated.
Here's a way with the tabyl function from the janitor package:
library(janitor); library(dplyr)
df %>%
tabyl(Wealth, choice) %>%
adorn_percentages("row")
# Wealth 1 2 3
# 5 0.0000000 0.7500000 0.2500000
# 8 0.5000000 0.0000000 0.5000000
# 10 0.3333333 0.3333333 0.3333333
Check with rowSums with table
tb=table(df$Wealth,df$choice)
tb/rowSums(tb)
1 2 3
5 0.0000000 0.7500000 0.2500000
8 0.5000000 0.0000000 0.5000000
10 0.3333333 0.3333333 0.3333333
Related
This question already has answers here:
How does the `prop.table()` function work in r?
(3 answers)
Closed 2 years ago.
I've got a model with 9 covariates and below is an example of one of the tables that it used to calculate the "yes"(1) and no(0) responses of a dataset,
table(wbca1$y,wbca1$Adhes)
And the output appears as follows
How can I code this so that I get the sample proportions for each covariate so I have a new table with 10 columns each representing "yes"(1)?
Thank you in advance
Something like this:
set.seed(111)
x = sample(1:9,100,replace=TRUE)
y = sample(0:1,100,replace=TRUE)
prop.table(table(y,x),margin=2)
x
y 1 2 3 4 5 6 7
0 0.4444444 0.2857143 0.6923077 0.4666667 0.5000000 0.4615385 0.6666667
1 0.5555556 0.7142857 0.3076923 0.5333333 0.5000000 0.5384615 0.3333333
x
y 8 9
0 0.3636364 0.4615385
1 0.6363636 0.5384615
Or you can simply do:
tab = table(y,x)
tab[2,]/colSums(tab)
1 2 3 4 5 6 7 8
0.5555556 0.7142857 0.3076923 0.5333333 0.5000000 0.5384615 0.3333333 0.6363636
9
0.5384615
Using tidyverse
library(dplyr)
tibble(x, y) %>%
count(x, y) %>%
mutate(prop = n/sum(n))
I have a list of dataframes and I am trying to normalize the data in a few of the columns by dividing each row in a column by the sum. The issue is all of the sums are different.
I started by making summing each column in the variable using lapply which resulted in a list of colSums. However, I'm not sure how to divide each list by the corresponding values.
Here is an example using some code I made up
df1=data.frame("total"=c(50,100,75),
"a"=c(15,50,30),
"b"=c(15,10,5),
"c"=c(20,40,40))
df2=data.frame("total"=c(100,200,400,100),
"a"=c(10,40,100,50),
"b"=c(50,100,200,30),
"c"=c(40,60,100,20))
df3=data.frame("total"=c(40,60,80),
"a"=c(15,30,50),
"b"=c(25,20,20),
"c"=c(0,10,10))
listex=list(df1=df1,df2=df2,df3=df3)
listtotal=lapply(listex,function(x) {x=colSums(x,na.rm=T)})
listex
$df1
total a b c
1 50 15 15 20
2 100 50 10 40
3 75 30 5 40
$df2
total a b c
1 100 10 50 40
2 200 40 100 60
3 400 100 200 100
4 100 50 30 20
$df3
total a b c
1 40 15 25 0
2 60 30 20 10
3 80 50 20 10
listtotal
$df1
total a b c
225 95 30 100
$df2
total a b c
800 200 380 220
$df3
total a b c
180 95 65 20
What I want to have happen is the following, but, you know, without having to write it all over again
df1n=data.frame("total"=c(50/225,100/225,75/225),"a"=c(15/95,50/95,30/95),
"b"=c(15/30,10/30,5/30),"c"=c(20/100,40/100,40/100))
df2n=data.frame("total"=c(100/800,200/800,400/800,100/800),
"a"=c(10/200,40/200,100/200,50/200),
"b"=c(50/380,100/380,200/380,30/380),
"c"=c(40/220,60/220,100/220,20/220))
df3n=data.frame('total'=c(40/180,60/180,80/180),
'a'=c(15/95,30/95,50/95),
'b'=c(25/65,20/65,20/65),
'c'=c(0/20,10/20,10/20))
listn=list(df1=df1n,df2=df2n,df3=df3n)
listn
$df1
total a b c
1 0.2222222 0.1578947 0.5000000 0.2
2 0.4444444 0.5263158 0.3333333 0.4
3 0.3333333 0.3157895 0.1666667 0.4
$df2
total a b c
1 0.125 0.05 0.13157895 0.18181818
2 0.250 0.20 0.26315789 0.27272727
3 0.500 0.50 0.52631579 0.45454545
4 0.125 0.25 0.07894737 0.09090909
$df3
total a b c
1 0.2222222 0.1578947 0.3846154 0.0
2 0.3333333 0.3157895 0.3076923 0.5
3 0.4444444 0.5263158 0.3076923 0.5
I'm thinking it has something to do with a nested apply function, but I'm not sure how exactly I would go about doing that. Any help is much appreciated!
We can do arithmetic on equal length vectors/matrices/data.frames. One option is to divide each of the elements in list with the colSums replicated to make the dimensions same
lapply(listex, function(x) x/colSums(x)[col(x)])
Also, if the 'listtotal' is another object, use Map to divide the corresponding elements of one object with the other
Map(function(x, y) x/y[col(x)], listex, listtotal)
You could use the sweep function
lapply(listex, function(x) sweep(x, 2, colSums(x), '/'))
Or convert the colsums to a list to use / directly
lapply(listex, function(x) x/as.list(colSums(x)))
Output for both methods:
# $`df1`
# total a b c
# 1 0.2222222 0.1578947 0.5000000 0.2
# 2 0.4444444 0.5263158 0.3333333 0.4
# 3 0.3333333 0.3157895 0.1666667 0.4
#
# $df2
# total a b c
# 1 0.125 0.05 0.13157895 0.18181818
# 2 0.250 0.20 0.26315789 0.27272727
# 3 0.500 0.50 0.52631579 0.45454545
# 4 0.125 0.25 0.07894737 0.09090909
#
# $df3
# total a b c
# 1 0.2222222 0.1578947 0.3846154 0.0
# 2 0.3333333 0.3157895 0.3076923 0.5
# 3 0.4444444 0.5263158 0.3076923 0.5
I'm not sure what I'm missing here:
library(dplyr)
df1<-data.frame(n=c(1,1,1,2,1,1,2))
mutate(df1,foo=n/mean(c(n,lag(n)),na.rm=TRUE))
n foo
1 1 0.8125
2 1 0.8125
3 1 0.8125
4 2 1.6250
5 1 0.8125
6 1 0.8125
7 2 1.6250
What on earth is going on? The first row should be, basically, 1/mean(1), i.e., '1'. Why am I getting 0.8125? What's even stranger is in my original dataset, I'm getting yet another number - 0.608, for basically the same calculation. What am I missing?
Try summarise(df1, length(c(n,lag(n)))) — the length of the vector c(n,lag(n)) is the same as two times the number of rows and has mean 1.230769.
What I believe you want to do is:
mutate(df1,foo=n/rowMeans(cbind(n,lag(n)),na.rm=TRUE))
n foo
1 1 1.0000000
2 1 1.0000000
3 1 1.0000000
4 2 1.3333333
5 1 0.6666667
6 1 1.0000000
7 2 1.3333333
I can perform a rolling division on vectors, i can take my data, then lag the same data and perform the division:
# Dummy Data
sample <- c(4,5,6,7,8,4,2,6,5,4,3,2,1,2,3,4,5,6)
lagSam <- lag(sample) # Lag by 1
output <- sample / lagSam # Perform division
sample.df <- data.frame(sample, desired = output)
with the desired output:
sample desired
1 4 NA
2 5 1.2500000
3 6 1.2000000
4 7 1.1666667
5 8 1.1428571
6 4 0.5000000
7 2 0.5000000
8 6 3.0000000
9 5 0.8333333
10 4 0.8000000
11 3 0.7500000
12 2 0.6666667
13 1 0.5000000
14 2 2.0000000
15 3 1.5000000
16 4 1.3333333
17 5 1.2500000
18 6 1.2000000
My question is, how do I do the same on a data frame? I have over 100 columns and need to take a rolling division of each. I am trying to write a function and use roll apply:
# My attempt
division <- function(x) {
#tail(x,1) / head(x,1)
x / lag(x)
}
rollapplyr(sample.df$sample, 1, division, fill = NA)
I tried to lag it with head and tail and then again with x / lag of x.
Both results produce NA's.
I frequently find that I want the dplyr::lag behavior but am startled a little when stats::lag returns a tsp-like object.
You're close with tail/head functionality. Using negative n:
n: a single integer. If positive, size for the resulting object:
number of elements for a vector (including lists), rows for a
matrix or data frame or lines for a function. If negative,
all but the 'n' last/first number of elements of 'x'.
c(NA, tail(sample.df$sample,n=-1) / head(sample.df$sample,n=-1))
# [1] NA 1.2500000 1.2000000 1.1666667 1.1428571 0.5000000 0.5000000
# [8] 3.0000000 0.8333333 0.8000000 0.7500000 0.6666667 0.5000000 2.0000000
# [15] 1.5000000 1.3333333 1.2500000 1.2000000
Note that your thought to do a rolling apply is slightly problematic in that it will divide a number by the result of the previous division, not the previous value. That is, c(2,3,4) in a rolling divide should start with an initial value (say, 1, the division-identity), and go with something like c(2/1, 3/(2/1), 4/(3/(2/1))), not what (I think) you requested.
This drives the function to be:
division <- function(x) c(NA, tail(x,n=-1) / head(x,n=-1))
So then you can do
lapply(sample.df, division)
If you only want to run this on select columns, I'd do
ind <- 1
lapply(sample.df[ind], division)
# $sample
# [1] NA 1.2500000 1.2000000 1.1666667 1.1428571 0.5000000 0.5000000
# [8] 3.0000000 0.8333333 0.8000000 0.7500000 0.6666667 0.5000000 2.0000000
# [15] 1.5000000 1.3333333 1.2500000 1.2000000
cbind(sample.df, lapply(sample.df[ind], division))
# sample desired sample
# 1 4 1 NA
# 2 5 1 1.2500000
# 3 6 1 1.2000000
# 4 7 1 1.1666667
# 5 8 1 1.1428571
# 6 4 1 0.5000000
# 7 2 1 0.5000000
# 8 6 1 3.0000000
# 9 5 1 0.8333333
# 10 4 1 0.8000000
# 11 3 1 0.7500000
# 12 2 1 0.6666667
# 13 1 1 0.5000000
# 14 2 1 2.0000000
# 15 3 1 1.5000000
# 16 4 1 1.3333333
# 17 5 1 1.2500000
# 18 6 1 1.2000000
This of course generated a duplicate name, but it's a start.
BTW: rolling generally has to do with a cumulative process on a vector. What you are looking for is applying one function to each vector individually and capturing the response.
Here are a few ways:
1) diff
transform(sample.df, desired = c(NA, exp(diff(log(sample)))))
## sample desired
## 1 4 NA
## 2 5 1.2500000
## 3 6 1.2000000
## 4 7 1.1666667
## 5 8 1.1428571
## ... etc ...
To apply this to multiple columns using the built in data frame anscombe:
rbind(NA, exp(diff(log(as.matrix(anscombe)))))
2) diff.zoo
library(zoo)
z <- zoo(sample.df$sample)
merge(z, desired = diff(z, arith = FALSE), all = TRUE)
## z desired
## 1 4 NA
## 2 5 1.2500000
## 3 6 1.2000000
## 4 7 1.1666667
## 5 8 1.1428571
## ... etc ...
To apply it to all columns of anscombe:
z <- zoo(rbind(NA, anscombe))
diff(z, arith = FALSE)
3) dplyr
library(dplyr)
sample.df %>% mutate(desired = sample/lag(sample))
## sample desired
## 1 4 NA
## 2 5 1.2500000
## 3 6 1.2000000
## 4 7 1.1666667
## 5 8 1.1428571
## ... etc ...
To apply this to all columns of anscombe:
anscombe %>% mutate_all(funs(. / lag(.)))
4) rollapplyr
library(zoo)
transform(sample.df, desired = rollapplyr(sample, 2, function(x) x[2]/x[1], fill = NA))
## sample desired
## 1 4 NA
## 2 5 1.2500000
## 3 6 1.2000000
## 4 7 1.1666667
## 5 8 1.1428571
To apply it to all columns of anscombe:
rollapplyr(anscombe, 2, function(x) x[2]/x[1], fill = NA))
5) lag.ts
transform(sample.df, desired = c(NA, lag(ts(sample)) / ts(sample)))
## sample desired
## 1 4 NA
## 2 5 1.2500000
## 3 6 1.2000000
## 4 7 1.1666667
## 5 8 1.1428571
To apply it to all columns of anscombe use the following. Note that dplyr should NOT be loaded since it annoyingly clobbers lag with its own lag. Alternately use stats::lag:
lag(ts(anscombe)) / ts(anscombe)
I have data (prop_attack), a portion of which, looks like this
attack.type proportion class
4 0.8400000 high
5 0.9733333 high
6 0.9385151 high
7 0.9228659 high
8 0.6187500 high
9 0.9219331 high
1 0.8364853 mid
2 0.9896870 mid
3 0.9529760 mid
4 0.6666667 mid
5 0.9965636 mid
6 0.9687825 mid
attack.type is actually categorical, they are just being assigned numbers 1-9. I want to create a table that rearranges the data such that
weap.type high mid
1 corresponding proportion value corresponding proportion value
2 corresponding proportion value corresponding proportion value
3 corresponding proportion value corresponding proportion value
4 corresponding proportion value corresponding proportion value
5 corresponding proportion value corresponding proportion value
6 etc.
7
8
9
Any suggestions on how to do this?
This is a straightforward "reshape" problem. In base R, you could do:
reshape(prop_attack, direction = "wide", idvar="attack.type", timevar="class")
# attack.type proportion.high proportion.mid
# 1 4 0.8400000 0.6666667
# 2 5 0.9733333 0.9965636
# 3 6 0.9385151 0.9687825
# 4 7 0.9228659 NA
# 5 8 0.6187500 NA
# 6 9 0.9219331 NA
# 7 1 NA 0.8364853
# 8 2 NA 0.9896870
# 9 3 NA 0.9529760
Or even use xtabs:
xtabs(proportion ~ attack.type + class, prop_attack)
# class
# attack.type high mid
# 1 0.0000000 0.8364853
# 2 0.0000000 0.9896870
# 3 0.0000000 0.9529760
# 4 0.8400000 0.6666667
# 5 0.9733333 0.9965636
# 6 0.9385151 0.9687825
# 7 0.9228659 0.0000000
# 8 0.6187500 0.0000000
# 9 0.9219331 0.0000000
Using a package, many would suggest dcast from "reshape2" for its convenient syntax:
dcast(prop_attack, attack.type ~ class, value.var="proportion")
# attack.type high mid
# 1 1 NA 0.8364853
# 2 2 NA 0.9896870
# 3 3 NA 0.9529760
# 4 4 0.8400000 0.6666667
# 5 5 0.9733333 0.9965636
# 6 6 0.9385151 0.9687825
# 7 7 0.9228659 NA
# 8 8 0.6187500 NA
# 9 9 0.9219331 NA