I am looking to create a new group based on two conditions. I want all of the cases until the cumulative sum of Value reaches 10 to be grouped together and I want this done within each person. I have managed to get it to work for each of the conditions separately, but not together using for loops and dplyr. However, I need both of these conditions to be applied. Below is what I would like the data to look like (I don't need an RunningSum_Value column, but I kept it in for clarification). Ideally I would like a dplyr solution, but I m not picky. Thank you in advance!
ID Value RunningSum_Value Group
PersonA 1 1 1
PersonA 3 4 1
PersonA 10 14 1
PersonA 3 3 2
PersonB 11 11 3
PersonB 12 12 4
PersonC 3 3 5
PersonD 4 4 6
PersonD 9 13 6
PersonD 5 5 7
PersonD 11 16 7
PersonD 6 6 8
PersonD 1 7 8
Here is my data:
df <- read.table(text="ID Value
PersonA 1
PersonA 3
PersonA 10
PersonA 3
PersonB 11
PersonB 12
PersonC 3
PersonD 4
PersonD 9
PersonD 5
PersonD 11
PersonD 6
PersonD 1", header=TRUE,stringsAsFactors=FALSE)
Define function sum0 which does a sum on its argument except that each time it gets to 10 or more it outputs 0. Define function is_start that returns TRUE for the start position of a group and FALSE otherwise. Finally apply is_start to each ID group using ave and then perform a cumsum on that to get the group numbers.
sum0 <- function(x, y) { if (x + y >= 10) 0 else x + y }
is_start <- function(x) head(c(TRUE, Reduce(sum0, init=0, x, acc = TRUE)[-1] == 0), -1)
cumsum(ave(DF$Value, DF$ID, FUN = is_start))
## [1] 1 1 1 2 3 4 5 6 6 7 7 8 8
UPDATE: fix
Related
I am trying to add additional data from a reference table onto my primary dataframe. I see similar questions have been asked about this however cant find anything for my specific case.
An example of my data frame is set up like this
df <- data.frame("participant" = rep(1:3,9), "time" = rep(1:9, each = 3))
lookup <- data.frame("start.time" = c(1,5,8), "end.time" = c(3,6,10), "var1" = c("A","B","A"),
"var2" = c(8,12,3), "var3"= c("fast","fast","slow"))
print(df)
participant time
1 1 1
2 2 1
3 3 1
4 1 2
5 2 2
6 3 2
7 1 3
8 2 3
9 3 3
10 1 4
11 2 4
12 3 4
13 1 5
14 2 5
15 3 5
16 1 6
17 2 6
18 3 6
19 1 7
20 2 7
21 3 7
22 1 8
23 2 8
24 3 8
25 1 9
26 2 9
27 3 9
> print(lookup)
start.time end.time var1 var2 var3
1 1 3 A 8 fast
2 5 6 B 12 fast
3 8 10 A 3 slow
What I want to do is merge or join these two dataframes in a way which also includes the times in between both the start and end time of the look up data frame. So the columns var1, var2 and var3 are added onto the df at each instance where the time lies between the start time and end time.
for example, in the above case - the look up value in the first row has a start time of 1, an end time of 3, so for times 1, 2 and 3 for each participant, the first row data should be added.
the output should look something like this.
print(output)
participant time var1 var2 var3
1 1 1 A 8 fast
2 2 1 A 8 fast
3 3 1 A 8 fast
4 1 2 A 8 fast
5 2 2 A 8 fast
6 3 2 A 8 fast
7 1 3 A 8 fast
8 2 3 A 8 fast
9 3 3 A 8 fast
10 1 4 <NA> NA <NA>
11 2 4 <NA> NA <NA>
12 3 4 <NA> NA <NA>
13 1 5 B 12 fast
14 2 5 B 12 fast
15 3 5 B 12 fast
16 1 6 B 12 fast
17 2 6 B 12 fast
18 3 6 B 12 fast
19 1 7 <NA> NA <NA>
20 2 7 <NA> NA <NA>
21 3 7 <NA> NA <NA>
22 1 8 A 3 slow
23 2 8 A 3 slow
24 3 8 A 3 slow
25 1 9 A 3 slow
26 2 9 A 3 slow
27 3 9 A 3 slow
I realise that column names don't match and they should for merging data sets.
One option would be to use the sqldf package, and phrase your problem as a SQL left join:
sql <- "SELECT t1.participant, t1.time, t2.var1, t2.var2, t2.var3
FROM df t1
LEFT JOIN lookup t2
ON t1.time BETWEEN t2.\"start.time\" AND t2.\"end.time\""
output <- sqldf(sql)
A dplyr solution:
output <- df %>%
# Create an id for the join
mutate(merge_id=1) %>%
# Use full join to create all the combinations between the two datasets
full_join(lookup %>% mutate(merge_id=1), by="merge_id") %>%
# Keep only the rows that we want
filter(time >= start.time, time <= end.time) %>%
# Select the relevant variables
select(participant,time,var1:var3) %>%
# Right join with initial dataset to get the missing rows
right_join(df, by = c("participant","time")) %>%
# Sort to match the formatting asked by OP
arrange(time, participant)
This produces the output asked by OP, but it will only work for data of reasonable size, as the full join produces a data frame with number of rows equal to the product of the number of rows of both initial datasets.
Using tidyverse and creating an auxiliary table:
df <- data.frame("participant" = rep(1:3,9), "time" = rep(1:9, each = 3))
lookup <- data.frame("start.time" = c(1,5,8), "end.time" = c(3,6,10), "var1" = c("A","B","A"),
"var2" = c(8,12,3), "var3"= c("fast","fast","slow"))
lookup_extended <- lookup %>%
mutate(time = map2(start.time, end.time, ~ c(.x:.y))) %>%
unnest(time) %>%
select(-start.time, -end.time)
df2 <- df %>%
left_join(lookup_extended, by = "time")
Hi I have dataframe with multiple columns ,I.e first 5 columns are my metadata and remaing
columns (columns count will be even) are actual columns which need to be calculated
formula : (col6*col9) + (col7*col10) + (col8*col11)
country<-c("US","US","US","US")
name <-c("A","B","c","d")
dob<-c(2017,2018,2018,2010)
day<-c(1,4,7,9)
hour<-c(10,11,2,4)
a <-c(1,3,4,5)
d<-c(1,9,4,0)
e<-c(8,1,0,7)
f<-c(10,2,5,6)
j<-c(1,4,2,7)
m<-c(1,5,7,1)
df=data.frame(country,name,dob,day,hour,a,d,e,f,j,m)
how to get final summation if i have more columns
I have tried with below code
df$final <-(df$a*df$f)+(df$d*df$j)+(df$e*df$m)
Here is one way to do generalize the computation:
x <- ncol(df) - 5
df$final <- rowSums(df[6:(5 + x/2)] * df[(ncol(df) - x/2 + 1):ncol(df)])
# country name dob day hour a d e f j m final
# 1 US A 2017 1 10 1 1 8 10 1 1 19
# 2 US B 2018 4 11 3 9 1 2 4 5 47
# 3 US c 2018 7 2 4 4 0 5 2 7 28
# 4 US d 2010 9 4 5 0 7 6 7 1 37
My data looks like this:
x y
1 1
2 2
3 2
4 4
5 5
6 6
7 6
8 8
9 9
10 9
11 11
12 12
13 13
14 13
15 14
16 15
17 14
18 16
19 17
20 18
y is a grouping variable. I would like to see how well this grouping went.
Because of this I want to extract a sample of n pairs of cases that are grouped together by variable y
and n pairs of cases that are not grouped together by variable y. In order to calculate the number of
false positives and false negatives (either falsly grouped or not). How do I extract a sample of grouped pairs
and a sample of not-grouped pairs?
I would like the samples to look like this (for n=6) :
Grouped sample:
x y
2 2
3 2
9 9
10 9
15 14
17 14
Not-grouped sample:
x y
1 1
2 2
6 8
6 8
11 11
19 17
How would I go about this in R?
I'm not entirely clear on what you like to do, partly because I feel there is some context missing as to what you're trying to achieve. I also don't quite understand your expected output (for example, the not-grouped sample contains an entry 6 8 that does not exist in your original data...)
That aside, here is a possible approach.
# Maximum number of samples per group
n <- 3;
# Set fixed RNG seed for reproducibility
set.seed(2017);
# Grouped samples
df.grouped <- do.call(rbind.data.frame, lapply(split(df, df$y),
function(x) if (nrow(x) > 1) x[sample(min(n, nrow(x))), ]));
df.grouped;
# x y
#2.3 3 2
#2.2 2 2
#6.6 6 6
#6.7 7 6
#9.10 10 9
#9.9 9 9
#13.13 13 13
#13.14 14 13
#14.15 15 14
#14.17 17 14
# Ungrouped samples
df.ungrouped <- df[sample(nrow(df.grouped)), ];
df.ungrouped;
# x y
#7 7 6
#1 1 1
#9 9 9
#4 4 4
#3 3 2
#2 2 2
#5 5 5
#6 6 6
#10 10 9
#8 8 8
Explanation: Split df based on y, then draw min(n, nrow(x)) samples from subset x containing >1 rows; rbinding gives the grouped df.grouped. We then draw nrow(df.grouped) samples from df to produce the ungrouped df.ungrouped.
Sample data
df <- read.table(text =
"x y
1 1
2 2
3 2
4 4
5 5
6 6
7 6
8 8
9 9
10 9
11 11
12 12
13 13
14 13
15 14
16 15
17 14
18 16
19 17
20 18", header = T)
I have a vector x :
x<-rpois(16,lambda=10)
and a lookup table named u3 :
minx<-min(x)
maxx<-max(x)
dg<-maxx-minx
k<-5
sa<-ceiling(dg/k)
u1=data.frame(seq(minx,maxx,sa))
colnames(u1)<-"x"
u2<-NULL
for (i in 1:k)
{
u2[i]<-u1[i,] + sa-1
}
u2<-as.data.frame(u2)
colnames(u2)<-"y"
u3<-cbind(u1,u2)
for(i in 1:nrow(u2))
{
u3$range[i]<-paste(u1[i,],u2[i,],sep="-")
}
print(u3)
my u3 data.frame like :
x y range
1 3 5 3-5
2 6 8 6-8
3 9 11 9-11
4 12 14 12-14
5 15 17 15-17
I want to do a calculation here:
I want to every x vector look in u3 data frame variables in colums 1,2
and then if condition true,so if x values are in range,count the x values which are in the range of u3 dataframe and write the count to u3 dataframe as a new column.
something like this :
count=0
for(i in 1:length(x))
{
for(j in 1:nrow(u3))
{
u3$count[j]<-if(x[i]>=u3[j,1] & x[i]<=u3[j,2]) {count=count + 1}
}
}
but i can't make it.
Do you have an idea about it ? How can I deal such a problem ?
I dont know how to tell R to look dynamically to lookup table and write the count of it.
I want such a desired output
x y range count
1 3 5 3-5 2
2 6 8 6-8 5
3 9 11 9-11 1
4 12 14 12-14 4
5 15 17 15-17 3
Thank you
with set.seed(0)
x
[1] 13 8 14 14 11 14 12 11 9 14 11 8 2 8 10 7
u3
x y range
1 2 4 2-4
2 5 7 5-7
3 8 10 8-10
4 11 13 11-13
5 14 16 14-16
cbind(u3,data.frame(table(findInterval(x,u3$x))))
x y range Var1 Freq
1 2 4 2-4 1 1
2 5 7 5-7 2 1
3 8 10 8-10 3 5
4 11 13 11-13 4 5
5 14 16 14-16 5 4
Keep in mind, I used findInterval(x,u3$x) it only works for your example because your groupings are (x,y] ie (x inclusive, y non-inclusive)
This is my first post at StackOverflow. I am relatively a newbie in programming and trying to work with the data.table in R, for its reputation in speed.
I have a very large data.table, named "Actions", with 5 columns and potentially several million rows. The column names are k1, k2, i, l1 and l2. I have another data.table, with the unique values of Actions in columns k1 and k2, named "States".
For every row in Actions, I would like to find the unique index for columns 4 and 5, matching with States. A reproducible code is as follows:
S.disc <- c(2000,2000)
S.max <- c(6200,2300)
S.min <- c(700,100)
Traces.num <- 3
Class.str <- lapply(1:2,function(x) seq(S.min[x],S.max[x],S.disc[x]))
Class.inf <- seq_len(Traces.num)
Actions <- data.table(expand.grid(Class.inf, Class.str[[2]], Class.str[[1]], Class.str[[2]], Class.str[[1]])[,c(5,4,1,3,2)])
setnames(Actions,c("k1","k2","i","l1","l2"))
States <- unique(Actions[,list(k1,k2,i)])
So if i was using data.frame, the following line would be like:
index <- apply(Actions,1,function(x) {which((States[,1]==x[4]) & (States[,2]==x[5]))})
How can I do the same with data.table efficiently ?
This is relatively simple once you get the hang of keys and the special symbols which may be used in the j expression of a data.table. Try this...
# First make an ID for each row for use in the `dcast`
# because you are going to have multiple rows with the
# same key values and you need to know where they came from
Actions[ , ID := 1:.N ]
# Set the keys to join on
setkeyv( Actions , c("l1" , "l2" ) )
setkeyv( States , c("k1" , "k2" ) )
# Join States to Actions, using '.I', which
# is the row locations in States in which the
# key of Actions are found and within each
# group the row number ( 1:.N - a repeating 1,2,3)
New <- States[ J(Actions) , list( ID , Ind = .I , Row = 1:.N ) ]
# k1 k2 ID Ind Row
#1: 700 100 1 1 1
#2: 700 100 1 2 2
#3: 700 100 1 3 3
#4: 700 100 2 1 1
#5: 700 100 2 2 2
#6: 700 100 2 3 3
# reshape using 'dcast.data.table'
dcast.data.table( Row ~ ID , data = New , value.var = "Ind" )
# Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27...
#1: 1 1 1 1 4 4 4 7 7 7 10 10 10 13 13 13 16 16 16 1 1 1 4 4 4 7 7 7...
#2: 2 2 2 2 5 5 5 8 8 8 11 11 11 14 14 14 17 17 17 2 2 2 5 5 5 8 8 8...
#3: 3 3 3 3 6 6 6 9 9 9 12 12 12 15 15 15 18 18 18 3 3 3 6 6 6 9 9 9...