Problem: How can I fill backwards all rows in a group before an occurrence of a certain value. I am not trying to fill in NA or missing value using zoo na.locf. In the following I would like to fill all previous rows in A with 1.00 before the 1.00 occurs by each ID group, ideally using dplyr.
Input:
data<- data.frame(ID=c(1,1,1,1,2,2,2,3,3,3,4,4,4,4,4),
time=c(1,2,3,4,1,2,3,1,2,3,1,2,3,4,5),
A=c(0.10,0.25,1,0,0.25,1,0.25,0,1,0.10,1,0.10,0.10,0.10,0.05))
ID time A
1 1 0.10
1 2 0.25
1 3 1.00
1 4 0.00
2 1 0.25
2 2 1.00
2 3 0.25
3 1 0.00
3 2 1.00
3 3 0.10
4 1 1.00
4 2 0.10
4 3 0.10
4 4 0.10
4 5 0.05
Desired output:
ID time A
1 1 1.00
1 2 1.00
1 3 1.00
1 4 0.00
2 1 1.00
2 2 1.00
2 3 0.25
3 1 1.00
3 2 1.00
3 3 0.10
4 1 1.00
4 2 0.10
4 3 0.10
4 4 0.10
4 5 0.05
After grouping by ID you can check the cumulative sum of 1's and where it's still below 1 (not yet appeared), replace the A-value with 1:
data %>%
group_by(ID) %>%
mutate(A = replace(A, cumsum(A == 1) < 1, 1))
# Source: local data frame [15 x 3]
# Groups: ID [4]
#
# ID time A
# <dbl> <dbl> <dbl>
# 1 1 1 1.00
# 2 1 2 1.00
# 3 1 3 1.00
# 4 1 4 0.00
# 5 2 1 1.00
# 6 2 2 1.00
# 7 2 3 0.25
# 8 3 1 1.00
# 9 3 2 1.00
# 10 3 3 0.10
# 11 4 1 1.00
# 12 4 2 0.10
# 13 4 3 0.10
# 14 4 4 0.10
# 15 4 5 0.05
Quite similar, you could also use cummax:
data %>% group_by(ID) %>% mutate(A = replace(A, !cummax(A == 1), 1))
And here's a base R approach:
transform(data, A = ave(A, ID, FUN = function(x) replace(x, !cummax(x == 1), 1)))
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(data)), get the row where 'A' is 1, find the sequence of rows, use that as i to assign (:=) the values in 'A' to 1
library(data.table)
setDT(data)[data[, .I[seq_len(which(A==1))], ID]$V1, A := 1][]
# ID time A
# 1: 1 1 1.00
# 2: 1 2 1.00
# 3: 1 3 1.00
# 4: 1 4 0.00
# 5: 2 1 1.00
# 6: 2 2 1.00
# 7: 2 3 0.25
# 8: 3 1 1.00
# 9: 3 2 1.00
#10: 3 3 0.10
#11: 4 1 1.00
#12: 4 2 0.10
#13: 4 3 0.10
#14: 4 4 0.10
#15: 4 5 0.05
Or we can use ave from base R
data$A[with(data, ave(A==1, ID, FUN = cumsum)<1)] <- 1
Related
please see sample data below:
data <- data.frame(q1=c(3,4,5,2,1,2,4),
q2=c(3,4,4,5,4,3,2),
q3=c(2,3,2,3,1,2,3),
q4=c(3,4,4,4,4,5,5))
I would like to create a another column which shows the percent of 4/5 responses. The output I am hoping to get looks something like this. Any help is appreciated, thank you!
q1 q2 q3 q4 percent
1 3 3 2 3 0.00
2 4 4 3 4 0.75
3 5 4 2 4 0.75
4 2 5 3 4 0.50
5 1 4 1 4 0.50
6 2 3 2 5 0.25
7 4 2 3 5 0.50
Using rowMeans
library(dplyr)
data %>%
mutate(percent = rowMeans(across(everything(), ~ .x %in% 4:5)))
-output
q1 q2 q3 q4 percent
1 3 3 2 3 0.00
2 4 4 3 4 0.75
3 5 4 2 4 0.75
4 2 5 3 4 0.50
5 1 4 1 4 0.50
6 2 3 2 5 0.25
7 4 2 3 5 0.50
One possible solution:
data$percent = rowMeans(data>3)
Or
data$percent = apply(data, 1, \(x) mean(x %in% 4:5))
q1 q2 q3 q4 percent
1 3 3 2 3 0.00
2 4 4 3 4 0.75
3 5 4 2 4 0.75
4 2 5 3 4 0.50
5 1 4 1 4 0.50
6 2 3 2 5 0.25
7 4 2 3 5 0.50
library(dplyr)
data <- data.frame(q1=c(3,4,5,2,1,2,4),
q2=c(3,4,4,5,4,3,2),
q3=c(2,3,2,3,1,2,3),
q4=c(3,4,4,4,4,5,5))
percent_4_5 <- function(x) {
(sum(x == 4) + sum(x == 5)) / length(x)
}
data %>% rowwise() %>% mutate(percent = percent_4_5(c_across(starts_with("q")))) %>% ungroup()
Another possible solution without using dplyr
library(magrittr)
data$percent <- (data > 3) %>% as.data.frame() %>% apply(., 1, mean)
True is considered as 1 and False is considered as 0 when counting.
Output:
q1 q2 q3 q4 percent
1 3 3 2 3 0.00
2 4 4 3 4 0.75
3 5 4 2 4 0.75
4 2 5 3 4 0.50
5 1 4 1 4 0.50
6 2 3 2 5 0.25
7 4 2 3 5 0.50
i have a data frame with different variables. i have merged three columns of probabilities to my data frame. my question is how can i use these columns as probabilities in sample function so that for prob argument take each column as probability. for example for y= 1 take ncol (a) , for y=1 take ncol(b) and so on my codes are:
a b c y
1 0.090 0.12 0.10 1
2 0.015 0.13 0.09 1
3 0.034 0.20 0.34 1
4 0.440 0.44 0.70 1
5 0.090 0.12 0.10 2
6 0.015 0.13 0.09 2
mydata$mig<- sample( 1:3, size = 7, replace = TRUE, prob= ????)
any help would be appreciated
Using apply function per rows:
df <- read.table(header = TRUE, text="a b c y
1 0.090 0.12 0.10 1
2 0.015 0.13 0.09 1
3 0.034 0.20 0.34 1
4 0.440 0.44 0.70 1
5 0.090 0.12 0.10 2
6 0.015 0.13 0.09 2")
set.seed(12344)
samples1<- apply(X = df[,-4], MARGIN = 1, # MARGIN = 1 indicates you are applying FUN per rows
FUN = function(x) sample( 1:3,
size = 7,
replace= TRUE ,
prob = x))
#You obtain six columns from samples with prob parameter in df's rows
samples1
1 2 3 4 5 6
[1,] 2 3 3 1 3 2
[2,] 1 2 3 3 2 2
[3,] 2 3 3 3 1 3
[4,] 2 3 3 1 3 2
[5,] 2 2 3 2 3 2
[6,] 1 3 2 3 2 3
[7,] 3 3 3 2 1 2
Update:
Given your comment on my answer, I update and propose a new solution using data.table. I leave the previous version for reference if there will be anyone interested.
library(data.table)
setDT(df)
set.seed(78787)
#Column V1 has your 7 samples per group y, with probs taken at random from a,b,c
df[, sample(1:.N,
size = 7,
replace = TRUE,
prob = unlist(.SD)),
by = y,
.SDcols = sample(names(df)[-ncol(df)], 1)]
y V1
1: 1 4
2: 1 3
3: 1 4
4: 1 3
5: 1 4
6: 1 4
7: 1 4
8: 2 2
9: 2 1
10: 2 1
11: 2 1
12: 2 2
13: 2 1
14: 2 1
While the "normal" indexing of 2d matrices and frames is the [i,j] method, one can also provide a 2-column matrix to i alone to programmatically combine the rows and columns. We can use this to create a matrix whose first column is merely counting the rows (1:6 here), and the second column is taken directly from your y column:
cbind(seq_len(nrow(mydata)), mydata$y)
# [,1] [,2]
# [1,] 1 1
# [2,] 2 1
# [3,] 3 1
# [4,] 4 1
# [5,] 5 2
# [6,] 6 2
mydata[cbind(seq_len(nrow(mydata)), mydata$y)]
# [1] 0.090 0.015 0.034 0.440 0.120 0.130
Note that in this case, your sample-ing code is not going to work:
true --> TRUE
the length of derived probabilities is not the same length as your 1:3
I am trying to classify the temp variable into different classes in such a way that Dur>5.
Further, I want to find the maximum value for each group as shown in expected outcome.
Dur=c(2.75,0.25,13,0.25,45.25,0.25,0.25,4.25,0.25,0.25,14)
temp=c(2.54,5.08,0,2.54,0,5,2.54,0,2.54,0,2.54)
df=data.frame(Dur,temp)
Expected Outcome:
group=c(1,1,1,2,2,3,3,3,3,3,3)
Colnew=c(5.08,5.08,5.08,2.54,2.54,5,5,5,5,5,5)
(output=data.frame(df,group,Colnew))
We create a grouping variable by taking the cumsum of logical vector, then get the max of 'temp'
library(dplyr)
df %>%
group_by(group = as.integer(factor(lag(cumsum(Dur > 5), default = 0)))) %>%
mutate(Max = max(temp))
# A tibble: 11 x 4
# Groups: group [3]
# Dur temp group Max
# <dbl> <dbl> <int> <dbl>
# 1 2.75 2.54 1 5.08
# 2 0.25 5.08 1 5.08
# 3 13 0 1 5.08
# 4 0.25 2.54 2 2.54
# 5 45.2 0 2 2.54
# 6 0.25 5 3 5
# 7 0.25 2.54 3 5
# 8 4.25 0 3 5
# 9 0.25 2.54 3 5
#10 0.25 0 3 5
#11 14 2.54 3 5
This question already has answers here:
Group by and then add a column for ratio based on condition
(3 answers)
Closed 5 years ago.
I have a dataset like this one:
test <-
data.frame(
variable = c("A","A","B","B","C","D","E","E","E","F","F","G"),
confidence = c(1,0.6,0.1,0.15,1,0.3,0.4,0.5,0.2,1,0.4,0.9),
freq = c(2,2,2,2,1,1,3,3,3,2,2,1),
weight = c(2,2,0,0,1,3,5,5,5,0,0,4)
)
> test
variable confidence freq weight
1 A 1.00 2 2
2 A 0.60 2 2
3 B 0.10 2 0
4 B 0.15 2 0
5 C 1.00 1 1
6 D 0.30 1 3
7 E 0.40 3 5
8 E 0.50 3 5
9 E 0.20 3 5
10 F 1.00 2 0
11 F 0.40 2 0
12 G 0.90 1 4
I want to calculate the sum of the weight by the confidence of each variable, like this:
, where i is the variable (A, B, C…)
Developing the formula above :
w[1]c[1]+w[1]c[2]=2*1+2*0.6=3.2
w[2]c[1]+w[2]c[2]
w[3]c[3]+w[3]c[4]
w[4]c[3]+w[4]c[4]
w[5]c[5]
w[6]c[6]
w[7]c[7]+w[7]c[8]+w[7]c[9]
w[8]c[7]+w[8]c[8]+w[8]c[9]
w[9]c[7]+w[9]c[8]+w[9]c[9]
…
The result should look like this:
> test
variable confidence freq weight SWC
1 A 1.00 2 2 3.2
2 A 0.60 2 2 3.2
3 B 0.10 2 0 0.0
4 B 0.15 2 0 0.0
5 C 1.00 1 1 1.0
6 D 0.30 1 3 0.9
7 E 0.40 3 5 5.5
8 E 0.50 3 5 5.5
9 E 0.20 3 5 5.5
10 F 1.00 2 0 0.0
11 F 0.40 2 0 0.0
12 G 0.90 1 4 3.6
Note that the confidence value is different for each observation but each variable has the same weight, so the summation I need is the same for each of the same variable observation.
First, I tried to make a loop iterating each variable a number of times with:
> table(test$variable)
A B C D E F G
2 2 1 1 3 2 1
but I couldn't make it work. So then, I calculated the position where each variable start, to try to make the for loop iterate only in these values:
> tpos = cumsum(table(test$variable))
> tpos = tpos+1
> tpos
A B C D E F G
3 5 6 7 10 12 13
> tpos = shift(tpos, 1)
> tpos
[1] NA 3 5 6 7 10 12
> tpos[1]=1
> tpos
[1] 1 3 5 6 7 10 12
# tpos is a vector with the positions where each variable (A, B, c...) start
> tposn = c(1:nrow(test))[-tpos]
> tposn
[1] 2 4 8 9 11
> c(1:nrow(test))[-tposn]
[1] 1 3 5 6 7 10 12
# then i came up with this loop but it doesn't give the correct result
for(i in 1:nrow(test)[-tposn]){
a = test$freq[i]-1
test$SWC[i:i+a] = sum(test$weight[i]*test$confidence[i:i+a])
}
Maybe there is an easier way to this? tapply?
By using dplyr:
library(dplyr)
test %>%
group_by(variable) %>%
mutate(SWC=sum(confidence*weight))
# A tibble: 12 x 5
# Groups: variable [7]
variable confidence freq weight SWC
<fctr> <dbl> <dbl> <dbl> <dbl>
1 A 1.00 2 2 3.2
2 A 0.60 2 2 3.2
3 B 0.10 2 0 0.0
4 B 0.15 2 0 0.0
5 C 1.00 1 1 1.0
6 D 0.30 1 3 0.9
7 E 0.40 3 5 5.5
8 E 0.50 3 5 5.5
9 E 0.20 3 5 5.5
10 F 1.00 2 0 0.0
11 F 0.40 2 0 0.0
12 G 0.90 1 4 3.6
Consider this data:
m = data.frame(pop=c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4),
id=c(0,1,1,1,1,1,0,2,1,1,1,2,1,2,2,2))
> m
pop id
1 1 0
2 1 1
3 1 1
4 1 1
5 2 1
6 2 1
7 2 0
8 2 2
9 2 1
10 3 1
11 3 1
12 3 2
13 3 1
14 3 2
15 4 2
16 4 2
I would like to get the frequency of each unique id in each unique pop? For example, the id 1 is present 3 times out of 4 when pop == 1, therefore the frequency of id 1 in pop 1 is 0.75.
I came up with this ugly solution:
out = matrix(0,ncol=3)
for (p in unique(m$pop))
{
for (i in unique(m$id))
{
m1 = m[m$pop == p,]
f = nrow(m1[m1$id == i,])/nrow(m1)
out = rbind(out, c(p, f, i))
}
}
out = out[-1,]
colnames(out) = c("pop", "freq", "id")
# SOLUTION
> out
pop freq id
[1,] 1 0.25 0
[2,] 1 0.75 1
[3,] 1 0.00 2
[4,] 2 0.20 0
[5,] 2 0.60 1
[6,] 2 0.20 2
[7,] 3 0.00 0
[8,] 3 0.60 1
[9,] 3 0.40 2
[10,] 4 0.00 0
[11,] 4 0.00 1
[12,] 4 1.00 2
I am sure there exists a more efficient solution using data.table or table but couldn't find it.
Here's what I might do:
as.data.frame(prop.table(table(m),1))
# pop id Freq
# 1 1 0 0.25
# 2 2 0 0.20
# 3 3 0 0.00
# 4 4 0 0.00
# 5 1 1 0.75
# 6 2 1 0.60
# 7 3 1 0.60
# 8 4 1 0.00
# 9 1 2 0.00
# 10 2 2 0.20
# 11 3 2 0.40
# 12 4 2 1.00
If you want it sorted by pop, you can do that afterwards. Alternately, you could transpose the table with t before converting to data.frame; or use rev(m) and prop.table on dimension 2.
Try:
library(dplyr)
m %>%
group_by(pop, id) %>%
summarise(s = n()) %>%
mutate(freq = s / sum(s)) %>%
select(-s)
Which gives:
#Source: local data frame [8 x 3]
#Groups: pop
#
# pop id freq
#1 1 0 0.25
#2 1 1 0.75
#3 2 0 0.20
#4 2 1 0.60
#5 2 2 0.20
#6 3 1 0.60
#7 3 2 0.40
#8 4 2 1.00
A data.table solution:
setDT(m)[, {div = .N; .SD[, .N/div, keyby = id]}, by = pop]
# pop id V1
#1: 1 0 0.25
#2: 1 1 0.75
#3: 2 0 0.20
#4: 2 1 0.60
#5: 2 2 0.20
#6: 3 1 0.60
#7: 3 2 0.40
#8: 4 2 1.00