This question already has answers here:
Group by and then add a column for ratio based on condition
(3 answers)
Closed 5 years ago.
I have a dataset like this one:
test <-
data.frame(
variable = c("A","A","B","B","C","D","E","E","E","F","F","G"),
confidence = c(1,0.6,0.1,0.15,1,0.3,0.4,0.5,0.2,1,0.4,0.9),
freq = c(2,2,2,2,1,1,3,3,3,2,2,1),
weight = c(2,2,0,0,1,3,5,5,5,0,0,4)
)
> test
variable confidence freq weight
1 A 1.00 2 2
2 A 0.60 2 2
3 B 0.10 2 0
4 B 0.15 2 0
5 C 1.00 1 1
6 D 0.30 1 3
7 E 0.40 3 5
8 E 0.50 3 5
9 E 0.20 3 5
10 F 1.00 2 0
11 F 0.40 2 0
12 G 0.90 1 4
I want to calculate the sum of the weight by the confidence of each variable, like this:
, where i is the variable (A, B, C…)
Developing the formula above :
w[1]c[1]+w[1]c[2]=2*1+2*0.6=3.2
w[2]c[1]+w[2]c[2]
w[3]c[3]+w[3]c[4]
w[4]c[3]+w[4]c[4]
w[5]c[5]
w[6]c[6]
w[7]c[7]+w[7]c[8]+w[7]c[9]
w[8]c[7]+w[8]c[8]+w[8]c[9]
w[9]c[7]+w[9]c[8]+w[9]c[9]
…
The result should look like this:
> test
variable confidence freq weight SWC
1 A 1.00 2 2 3.2
2 A 0.60 2 2 3.2
3 B 0.10 2 0 0.0
4 B 0.15 2 0 0.0
5 C 1.00 1 1 1.0
6 D 0.30 1 3 0.9
7 E 0.40 3 5 5.5
8 E 0.50 3 5 5.5
9 E 0.20 3 5 5.5
10 F 1.00 2 0 0.0
11 F 0.40 2 0 0.0
12 G 0.90 1 4 3.6
Note that the confidence value is different for each observation but each variable has the same weight, so the summation I need is the same for each of the same variable observation.
First, I tried to make a loop iterating each variable a number of times with:
> table(test$variable)
A B C D E F G
2 2 1 1 3 2 1
but I couldn't make it work. So then, I calculated the position where each variable start, to try to make the for loop iterate only in these values:
> tpos = cumsum(table(test$variable))
> tpos = tpos+1
> tpos
A B C D E F G
3 5 6 7 10 12 13
> tpos = shift(tpos, 1)
> tpos
[1] NA 3 5 6 7 10 12
> tpos[1]=1
> tpos
[1] 1 3 5 6 7 10 12
# tpos is a vector with the positions where each variable (A, B, c...) start
> tposn = c(1:nrow(test))[-tpos]
> tposn
[1] 2 4 8 9 11
> c(1:nrow(test))[-tposn]
[1] 1 3 5 6 7 10 12
# then i came up with this loop but it doesn't give the correct result
for(i in 1:nrow(test)[-tposn]){
a = test$freq[i]-1
test$SWC[i:i+a] = sum(test$weight[i]*test$confidence[i:i+a])
}
Maybe there is an easier way to this? tapply?
By using dplyr:
library(dplyr)
test %>%
group_by(variable) %>%
mutate(SWC=sum(confidence*weight))
# A tibble: 12 x 5
# Groups: variable [7]
variable confidence freq weight SWC
<fctr> <dbl> <dbl> <dbl> <dbl>
1 A 1.00 2 2 3.2
2 A 0.60 2 2 3.2
3 B 0.10 2 0 0.0
4 B 0.15 2 0 0.0
5 C 1.00 1 1 1.0
6 D 0.30 1 3 0.9
7 E 0.40 3 5 5.5
8 E 0.50 3 5 5.5
9 E 0.20 3 5 5.5
10 F 1.00 2 0 0.0
11 F 0.40 2 0 0.0
12 G 0.90 1 4 3.6
Related
I have a data frame with 3 columns, each containing a small number of values:
> df
# A tibble: 364 x 3
A B C
<dbl> <dbl> <dbl>
0. 1. 0.100
0. 1. 0.200
0. 1. 0.300
0. 1. 0.500
0. 2. 0.100
0. 2. 0.200
0. 2. 0.300
0. 2. 0.600
0. 3. 0.100
0. 3. 0.200
# ... with 354 more rows
> apply(df, 2, table)
$`A`
0 1 2 3 4 5 6 7 8 9 10
34 37 31 32 27 39 29 28 37 39 31
$B
1 2 3 4 5 6 7 8 9 10 11
38 28 38 37 32 34 29 33 30 35 30
$C
0.1 0.2 0.3 0.4 0.5 0.6
62 65 65 56 60 56
I would like to create a fourth column, which will contain for each row the product of the frequencies of each value withing each group. So for example the first value of the column "Freq" would be the product of the frequency of zero within column A, the frequency of 1 within column B and the frequency of 0.1 within column C.
How can I do this efficiently with dplyr/baseR?
To emphasize, this is not the combined frequency of each total row, but the product of the 1-column frequencies
An efficient approach using a combination of lapply, Map & Reduce from base R:
l <- lapply(df, table)
m <- Map(function(x,y) unname(y[match(x, names(y))]), df, l)
df$D <- Reduce(`*`, m)
which gives:
> head(df, 15)
A B C D
1 3 5 0.4 57344
2 5 6 0.5 79560
3 0 4 0.1 77996
4 2 6 0.1 65348
5 5 11 0.6 65520
6 3 8 0.5 63360
7 6 6 0.2 64090
8 1 9 0.4 62160
9 10 2 0.2 56420
10 5 2 0.2 70980
11 4 11 0.3 52650
12 7 6 0.5 57120
13 10 1 0.2 76570
14 7 10 0.5 58800
15 8 10 0.3 84175
What this does:
lapply(df, table) creates a list of frequency for each column
With Map a list is created with match where each list-item has the same length as the number of rows of df. Each list-item is a vector of frequencies corresponding to the values in df.
With Reduce the product of the vectors in the list m is calculated element wise: the first value of each vector in the list m are mulplied with each other, then the 2nd value, etc.
The same approach in tidyverse:
library(dplyr)
library(purrr)
df %>%
mutate(D = map(df, table) %>%
map2(df, ., function(x,y) unname(y[match(x, names(y))])) %>%
reduce(`*`))
Used data:
set.seed(2018)
df <- data.frame(A = sample(rep(0:10, c(34,37,31,32,27,39,29,28,37,39,31)), 364),
B = sample(rep(1:11, c(38,28,38,37,32,34,29,33,30,35,30)), 364),
C = sample(rep(seq(0.1,0.6,0.1), c(62,65,65,56,60,56)), 364))
will use the following small example
df
A B C
1 3 5 0.4
2 5 6 0.5
3 0 4 0.1
4 2 6 0.1
5 5 11 0.6
6 3 8 0.5
7 6 6 0.2
8 1 9 0.4
9 10 2 0.2
10 5 2 0.2
sapply(g,table)
$A
0 1 2 3 5 6 10
1 1 1 2 3 1 1
$B
2 4 5 6 8 9 11
2 1 1 3 1 1 1
$C
0.1 0.2 0.4 0.5 0.6
2 3 2 2 1
library(tidyverse)
df%>%
group_by(A)%>%
mutate(An=n())%>%
group_by(B)%>%
mutate(Bn=n())%>%
group_by(C)%>%
mutate(Cn=n(),prod=An*Bn*Cn)
A B C An Bn Cn prod
<int> <int> <dbl> <int> <int> <int> <int>
1 3 5 0.400 2 1 2 4
2 5 6 0.500 3 3 2 18
3 0 4 0.100 1 1 2 2
4 2 6 0.100 1 3 2 6
5 5 11 0.600 3 1 1 3
6 3 8 0.500 2 1 2 4
7 6 6 0.200 1 3 3 9
8 1 9 0.400 1 1 2 2
9 10 2 0.200 1 2 3 6
10 5 2 0.200 3 2 3 18
This is my data set
You can get the data form this link( If can't ,please inform me)
https://www.dropbox.com/s/1n9hpyhcniaghh5/table.csv?dl=0
LABEL DATE TAU TYPE x y z
1 A 1 2 1 0.75 7 16
2 A 1 2 0 0.41 5 18
3 A 1 2 1 0.39 6 14
4 A 2 3 0 0.65 5 14
5 A 2 3 1 0.55 7 19
6 A 2 3 1 0.69 5 19
7 A 2 3 0 0.66 7 19
8 A 3 1 0 0.38 8 15
9 A 3 1 0 0.02 5 16
10 A 3 1 0 0.71 8 13
11 B 1 2 1 0.25 9 18
12 B 1 2 0 0.06 8 20
13 B 1 2 1 0.60 8 20
14 B 1 2 0 0.56 6 13
15 B 1 3 1 0.50 8 19
16 B 1 3 0 0.04 8 16
17 B 2 1 1 0.04 5 15
18 B 2 1 1 0.75 5 13
19 B 2 1 0 0.44 8 18
20 B 2 1 1 0.52 9 13
I want to filter data by group with multiple conditions. And the conditions is
the number of rows for each type(0,1) of TYPE variable by group must
bigger than 1
the number of rows for each type must be equal
(For example: the number of rows for type 1 is equal to the number of rows for type 0 for each group)
I had tried many times... And finally I get this code and this output
table %>% group_by(label,date,tau,type) %>% filter(n()>1) %>% filter(length(type==1)==length(type==0))
# A tibble: 16 x 7
# Groups: label, date, tau, type [7]
LABEL DATE TAU TYPE x y z
<fctr> <int> <int> <int> <dbl> <int> <int>
1 A 1 2 1 0.75 7 16
2 A 1 2 1 0.39 6 14
3 A 2 3 0 0.65 5 14
4 A 2 3 1 0.55 7 19
5 A 2 3 1 0.69 5 19
6 A 2 3 0 0.66 7 19
7 A 3 1 0 0.38 8 15
8 A 3 1 0 0.02 5 16
9 A 3 1 0 0.71 8 13
10 B 1 2 1 0.25 9 18
11 B 1 2 0 0.06 8 20
12 B 1 2 1 0.60 8 20
13 B 1 2 0 0.56 6 13
14 B 2 1 1 0.04 5 15
15 B 2 1 1 0.75 5 13
16 B 2 1 1 0.52 9 13
I was confused about this output I get with this code. I already get rid of the data which didn't meet the condition 1 BUT the data which didn't meet the condition 2 still inside
The result I want is just like the below
LABEL DATE TAU TYPE x y z
<fctr> <int> <int> <int> <dbl> <int> <int>
3 A 2 3 0 0.65 5 14
4 A 2 3 1 0.55 7 19
5 A 2 3 1 0.69 5 19
6 A 2 3 0 0.66 7 19
10 B 1 2 1 0.25 9 18
11 B 1 2 0 0.06 8 20
12 B 1 2 1 0.60 8 20
13 B 1 2 0 0.56 6 13
And if I want to compute value with the function below for each row, how can i code?? Just use the function of mutate()??
f(x,y,z) = 2 * x + y - z / 3 if TYPE == 1
f(x,y,z) = 4 * x - y / 2 + z / 3 if TYPE == 0
I hope there is anyone can help me and I am appreciate for your help! If you need to provide any other information just let me know ~
# example dataset
df = read.table(text = "
LABEL DATE TAU TYPE x y z
1 A 1 2 1 0.75 7 16
2 A 1 2 0 0.41 5 18
3 A 1 2 1 0.39 6 14
4 A 2 3 0 0.65 5 14
5 A 2 3 1 0.55 7 19
6 A 2 3 1 0.69 5 19
7 A 2 3 0 0.66 7 19
8 A 3 1 0 0.38 8 15
9 A 3 1 0 0.02 5 16
10 A 3 1 0 0.71 8 13
11 B 1 2 1 0.25 9 18
12 B 1 2 0 0.06 8 20
13 B 1 2 1 0.60 8 20
14 B 1 2 0 0.56 6 13
15 B 1 3 1 0.50 8 19
16 B 1 3 0 0.04 8 16
17 B 2 1 1 0.04 5 15
18 B 2 1 1 0.75 5 13
19 B 2 1 0 0.44 8 18
20 B 2 1 1 0.52 9 13
", header=T, stringsAsFactors=F)
library(dplyr)
library(tidyr)
# function to use for each row
# (assumes that type can be only 1 or 0)
f = function(t,x,y,z) { ifelse(t == 1,
2 * x + y - z / 3,
4 * x - y / 2 + z / 3) }
df %>%
count(LABEL, DATE, TAU, TYPE) %>% # count rows for each group (based on those combinations)
filter(n > 1) %>% # keep groups with multiple rows
mutate(TYPE = paste0("TYPE_",TYPE)) %>% # update variable
spread(TYPE, n, fill = 0) %>% # reshape data
filter(TYPE_0 == TYPE_1) %>% # keep groups with equal number of rows for type 0 and 1
select(LABEL, DATE, TAU) %>% # keep variables/groups of interest
inner_join(df, by=c("LABEL", "DATE", "TAU")) %>% # join back info
mutate(f_value = f(TYPE,x,y,z)) # apply function
# # A tibble: 8 x 8
# LABEL DATE TAU TYPE x y z f_value
# <chr> <int> <int> <int> <dbl> <int> <int> <dbl>
# 1 A 2 3 0 0.65 5 14 4.76666667
# 2 A 2 3 1 0.55 7 19 1.76666667
# 3 A 2 3 1 0.69 5 19 0.04666667
# 4 A 2 3 0 0.66 7 19 5.47333333
# 5 B 1 2 1 0.25 9 18 3.50000000
# 6 B 1 2 0 0.06 8 20 2.90666667
# 7 B 1 2 1 0.60 8 20 2.53333333
# 8 B 1 2 0 0.56 6 13 3.57333333
I have a data frame ‘true set’, that I would like to sort based on the order of values in vectors ‘order’.
true_set <- data.frame(dose1=c(rep(1,5),rep(2,5),rep(3,5)), dose2=c(rep(1:5,3)),toxicity=c(0.05,0.1,0.15,0.3,0.45,0.1,0.15,0.3,0.45,0.55,0.15,0.3,0.45,0.55,0.6),efficacy=c(0.2,0.3,0.4,0.5,0.6,0.4,0.5,0.6,0.7,0.8,0.5,0.6,0.7,0.8,0.9),d=c(1:15))
orders<-matrix(nrow=3,ncol=15)
orders[1,]<-c(1,2,6,3,7,11,4,8,12,5,9,13,10,14,15)
orders[2,]<-c(1,6,2,3,7,11,12,8,4,5,9,13,14,10,15)
orders[3,]<-c(1,6,2,11,7,3,12,8,4,13,9,5,14,10,15)
The expected result would be:
First orders[1,] :
dose1 dose2 toxicity efficacy d
1 1 1 0.05 0.2 1
2 1 2 0.10 0.3 2
3 2 1 0.10 0.4 6
4 1 3 0.15 0.4 3
5 2 2 0.15 0.5 7
6 3 1 0.15 0.5 11
7 1 4 0.30 0.5 4
8 2 3 0.30 0.6 8
9 3 2 0.30 0.6 12
10 1 5 0.45 0.6 5
11 2 4 0.45 0.7 9
12 3 3 0.45 0.7 13
13 2 5 0.55 0.8 10
14 3 4 0.55 0.8 14
15 3 5 0.60 0.9 15
First orders[2,] : as above
First orders[3,] : as above
true_set <- data.frame(dose1=c(rep(1,5),rep(2,5),rep(3,5)), dose2=c(rep(1:5,3)),toxicity=c(0.05,0.1,0.15,0.3,0.45,0.1,0.15,0.3,0.45,0.55,0.15,0.3,0.45,0.55,0.6),efficacy=c(0.2,0.3,0.4,0.5,0.6,0.4,0.5,0.6,0.7,0.8,0.5,0.6,0.7,0.8,0.9),d=c(1:15))
orders<-matrix(nrow=3,ncol=15)
orders[1,]<-c(1,2,6,3,7,11,4,8,12,5,9,13,10,14,15)
orders[2,]<-c(1,6,2,3,7,11,12,8,4,5,9,13,14,10,15)
orders[3,]<-c(1,6,2,11,7,3,12,8,4,13,9,5,14,10,15)
# Specify your order set in the row dimension
First_order <- true_set[orders[1,],]
Second_order <- true_Set[orders[2,],]
Third_order <- true_Set[orders[3,],]
# If you want to store all orders in a list, you can try the command below:
First_orders <- list(First_Order=true_set[orders[1,],],Second_Order=true_set[orders[2,],],Third_Order=true_set[orders[3,],])
First_orders[1] # OR First_orders$First_Order
First_orders[2] # OR First_orders$Second_Order
First_orders[3] # OR First_orders$Third_Order
# If you want to combine the orders column wise, try the command below:
First_orders <- cbind(First_Order=true_set[orders[1,],],Second_Order=true_set[orders[2,],],Third_Order=true_set[orders[3,],])
# If you want to combine the orders row wise, try the command below:
First_orders <- rbind(First_Order=true_set[orders[1,],],Second_Order=true_set[orders[2,],],Third_Order=true_set[orders[3,],])
Consider this data:
m = data.frame(pop=c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4),
id=c(0,1,1,1,1,1,0,2,1,1,1,2,1,2,2,2))
> m
pop id
1 1 0
2 1 1
3 1 1
4 1 1
5 2 1
6 2 1
7 2 0
8 2 2
9 2 1
10 3 1
11 3 1
12 3 2
13 3 1
14 3 2
15 4 2
16 4 2
I would like to get the frequency of each unique id in each unique pop? For example, the id 1 is present 3 times out of 4 when pop == 1, therefore the frequency of id 1 in pop 1 is 0.75.
I came up with this ugly solution:
out = matrix(0,ncol=3)
for (p in unique(m$pop))
{
for (i in unique(m$id))
{
m1 = m[m$pop == p,]
f = nrow(m1[m1$id == i,])/nrow(m1)
out = rbind(out, c(p, f, i))
}
}
out = out[-1,]
colnames(out) = c("pop", "freq", "id")
# SOLUTION
> out
pop freq id
[1,] 1 0.25 0
[2,] 1 0.75 1
[3,] 1 0.00 2
[4,] 2 0.20 0
[5,] 2 0.60 1
[6,] 2 0.20 2
[7,] 3 0.00 0
[8,] 3 0.60 1
[9,] 3 0.40 2
[10,] 4 0.00 0
[11,] 4 0.00 1
[12,] 4 1.00 2
I am sure there exists a more efficient solution using data.table or table but couldn't find it.
Here's what I might do:
as.data.frame(prop.table(table(m),1))
# pop id Freq
# 1 1 0 0.25
# 2 2 0 0.20
# 3 3 0 0.00
# 4 4 0 0.00
# 5 1 1 0.75
# 6 2 1 0.60
# 7 3 1 0.60
# 8 4 1 0.00
# 9 1 2 0.00
# 10 2 2 0.20
# 11 3 2 0.40
# 12 4 2 1.00
If you want it sorted by pop, you can do that afterwards. Alternately, you could transpose the table with t before converting to data.frame; or use rev(m) and prop.table on dimension 2.
Try:
library(dplyr)
m %>%
group_by(pop, id) %>%
summarise(s = n()) %>%
mutate(freq = s / sum(s)) %>%
select(-s)
Which gives:
#Source: local data frame [8 x 3]
#Groups: pop
#
# pop id freq
#1 1 0 0.25
#2 1 1 0.75
#3 2 0 0.20
#4 2 1 0.60
#5 2 2 0.20
#6 3 1 0.60
#7 3 2 0.40
#8 4 2 1.00
A data.table solution:
setDT(m)[, {div = .N; .SD[, .N/div, keyby = id]}, by = pop]
# pop id V1
#1: 1 0 0.25
#2: 1 1 0.75
#3: 2 0 0.20
#4: 2 1 0.60
#5: 2 2 0.20
#6: 3 1 0.60
#7: 3 2 0.40
#8: 4 2 1.00
So I want to subset my data frame to select rows with a daily maximum value.
Site Year Day Time Cover Size TempChange
ST1 2011 97 0.0 Closed small 0.97
ST1 2011 97 0.5 Closed small 1.02
ST1 2011 97 1.0 Closed small 1.10
Section of data frame is above. I would like to select only the rows which have the maximum value of the variable TempChange for each variable Day. I want to do this because I am interested in specific variables (not shown) for these particular times.
AMENDED EXAMPLE AND REQUIRED OUTPUT
Site Day Temp Row
a 10 0.2 1
a 10 0.3 2
a 11 0.5 3
a 11 0.4 4
b 10 0.1 5
b 10 0.8 6
b 11 0.7 7
b 11 0.6 8
c 10 0.2 9
c 10 0.3 10
c 11 0.5 11
c 11 0.8 12
REQUIRED OUTPUT
Site Day Temp Row
a 10 0.3 2
a 11 0.5 3
b 10 0.8 6
b 11 0.7 7
c 10 0.3 10
c 11 0.8 12
Hope that makes it clearer.
After faffing with raw data frame code, I realised plyr could do this in one:
> df
Day V Z
1 97 0.26575207 1
2 97 0.09443351 2
3 97 0.88097858 3
4 98 0.62241515 4
5 98 0.61985937 5
6 99 0.06956219 6
7 100 0.86638108 7
8 100 0.08382254 8
> ddply(df,~Day,function(x){x[which.max(x$V),]})
Day V Z
1 97 0.88097858 3
2 98 0.62241515 4
3 99 0.06956219 6
4 100 0.86638108 7
To get the rows for max values for unique combinations of more than one column, just add the variable to the formula. For your modified example, its then:
> df
Site Day Temp Row
1 a 10 0.2 1
2 a 10 0.3 2
3 a 11 0.5 3
4 a 11 0.4 4
5 b 10 0.1 5
6 b 10 0.8 6
7 b 11 0.7 7
8 b 11 0.6 8
9 c 10 0.2 9
10 c 10 0.3 10
11 c 11 0.5 11
12 c 11 0.8 12
> ddply(df,~Day+Site,function(x){x[which.max(x$Temp),]})
Site Day Temp Row
1 a 10 0.3 2
2 b 10 0.8 6
3 c 10 0.3 10
4 a 11 0.5 3
5 b 11 0.7 7
6 c 11 0.8 12
Note this isn't in the same order as your original dataframe, but you can fix that.
> dmax = ddply(df,~Day+Site,function(x){x[which.max(x$Temp),]})
> dmax[order(dmax$Row),]
Site Day Temp Row
1 a 10 0.3 2
4 a 11 0.5 3
2 b 10 0.8 6
5 b 11 0.7 7
3 c 10 0.3 10
6 c 11 0.8 12