I have a dataframe Fix with many variables, among these, there is CURRENT_ID, which is numeric and is between 1 and a number that varies (e.g., in certain cases 12, in other 15, etc.), and also a variable called nitem, that represents the number of the item in my experiment. For each trial and each subject, I would like to identify the minimum and the maximum CURRENT_ID. Then I would like to create a new variable called Remove. In Remove I would like to have a value of 1 if the CURRENT_ID is the minimum or maximum for each trial and participant, and a value of 0, for all the other rows. Following is an example of the data I have and the output I would like to obtain:
SESSION_LABEL TRIAL_INDEX CURRENT_ID nitem OUTPUT
ppt01 1 1 4 1
ppt01 1 1 4 1
ppt01 1 4 4 0
ppt01 1 2 4 0
ppt01 1 2 4 0
ppt01 1 2 4 0
ppt01 1 4 4 0
ppt01 1 5 4 0
ppt01 1 6 4 0
ppt01 1 7 4 0
ppt01 1 8 4 0
ppt01 1 10 4 0
ppt01 1 11 4 0
ppt01 1 11 4 0
ppt01 1 12 4 0
ppt01 1 13 4 0
ppt01 1 13 4 0
ppt01 1 14 4 1
ppt01 1 1 4 1
ppt01 1 1 4 1
ppt01 2 2 2 0
ppt01 2 1 2 1
ppt01 2 5 2 0
ppt01 2 3 2 0
ppt01 2 4 2 0
ppt01 2 5 2 0
ppt01 2 5 2 0
ppt01 2 5 2 0
ppt01 2 6 2 0
ppt01 2 7 2 0
ppt01 2 8 2 0
ppt01 2 10 2 0
ppt01 2 10 2 0
ppt01 2 11 2 0
ppt01 2 13 2 0
ppt01 2 13 2 0
ppt01 2 13 2 0
ppt01 2 14 2 1
ppt01 2 3 2 0
ppt01 2 2 2 0
ppt01 2 1 2 1
ppt01 2 1 2 1
ppt01 2 1 2 1
ppt01 2 5 2 0
ppt01 2 4 2 0
ppt01 2 4 2 0
ppt01 2 5 2 0
ppt01 2 7 2 0
ppt01 2 9 2 0
ppt01 2 10 2 0
ppt01 2 12 2 0
ppt01 2 10 2 0
ppt01 2 10 2 0
ppt01 2 4 2 0
ppt01 2 5 2 0
ppt01 2 4 2 0
ppt01 2 6 2 0
ppt04 2 1 8 1
ppt04 2 1 8 1
ppt04 2 2 8 0
ppt04 2 3 8 0
ppt04 2 4 8 0
ppt04 2 5 8 0
ppt04 2 6 8 0
ppt04 2 7 8 0
ppt04 2 8 8 0
ppt04 2 7 8 0
ppt04 2 6 8 0
ppt04 2 8 8 0
ppt04 2 8 8 0
ppt04 2 10 8 0
ppt04 2 9 8 0
ppt04 2 10 8 0
ppt04 2 13 8 0
ppt04 2 14 8 1
ppt04 2 14 8 1
ppt04 2 1 8 1
ppt04 3 2 10 0
ppt04 3 2 10 0
ppt04 3 2 10 0
ppt04 3 3 10 0
ppt04 3 2 10 0
ppt04 3 4 10 0
ppt04 3 5 10 0
ppt04 3 6 10 0
ppt04 3 7 10 0
ppt04 3 9 10 0
ppt04 3 11 10 0
ppt04 3 12 10 0
ppt04 3 12 10 0
ppt04 3 13 10 0
ppt04 3 14 10 1
ppt04 3 14 10 1
Here is my attempt.
for (j in 1:nrow(Fix)){
Fix$Remove[j] <-ifelse(by(Fix$CURRENT_ID, list(Fix$SESSION_LABEL,Fix$nitem), max), 1,
ifelse(by(Fix$CURRENT_ID, list(Fix$SESSION_LABEL,Fix$nitem), min), 1,0))
}
Also, I am not sure if a for loop is the best day to do it.
Using dplyr:
library(dplyr)
your_data %>%
group_by(SESSION_LABEL, nitem) %>%
mutate(Remove = ifelse(
CURRENT_ID == min(CURRENT_ID) | CURRENT_ID == max(CURRENT_ID),
1, 0
))
You can do with base R:
Fix <- within(Fix, {
mx <- ave(CURRENT_ID, SESSION_LABEL, nitem, FUN=max)
mn <- ave(CURRENT_ID, SESSION_LABEL, nitem, FUN=min)
Remove <- ifelse(CURRENT_ID==mx | CURRENT_ID==mn, 1, 0)
})
But testing the result with your data gives:
which(Fix$Remove!=Fix$OUTPUT)
# [1] 78 79 80 82
Related
I'm having trouble using the row number as index. For example I want a new column that will give me the sales taking into account the next 4 days. I want to create column name:sale_next 4
The issue with my code is that I don't know how to make the index of the row_number() variable, since what I'm doing is fetching the actual value of the column.
#heres to create the data
df <- read.table(text = "day price price_change sales High_sales_ind
1 5 0 12 1
2 5 0 6 0
3 5 0 5 0
4 5 0 4 0
5 5 0 10 1
6 5 0 10 1
7 5 0 10 1
8 5 0 12 1
9 5 0 14 1
10 7 2 3 0
11 7 0 2 0", header = TRUE)
#my code
df<- df %>% mutate(sales_next4 = sales[row_number():sales_rownumber()+4)
What I need:
day
price
price_change
sales
High_sales_ind
sales_next4
1
5
0
12
1
27
2
5
0
6
0
25
3
5
0
5
0
29
4
5
0
4
0
34
5
5
0
10
1
42
6
5
0
10
1
46
7
5
0
10
1
39
8
5
0
12
1
31
9
5
0
14
1
19
10
7
2
3
0
5
11
7
0
2
0
2
Any help would be appreciated.
You can use rollapply from the zoo package for cases like this, assuming that the days are consecutive as in the example data provided.
You'll need to use the partial = and align = arguments to fill the column correctly, see ?rollapply for the details.
library(dplyr)
library(zoo)
df <- df %>%
mutate(sales_next4 = rollapply(sales, 4, sum, partial = TRUE, align = "left"))
Result:
day price price_change sales High_sales_ind sales_next4
1 1 5 0 12 1 27
2 2 5 0 6 0 25
3 3 5 0 5 0 29
4 4 5 0 4 0 34
5 5 5 0 10 1 42
6 6 5 0 10 1 46
7 7 5 0 10 1 39
8 8 5 0 12 1 31
9 9 5 0 14 1 19
10 10 7 2 3 0 5
11 11 7 0 2 0 2
You can use map() from purrr to do rolling sum depending on the day column.
library(dplyr)
library(purrr)
df %>%
mutate(sales_next4 = map_dbl(day, ~ sum(sales[between(day, .x, .x+3)])))
# day price price_change sales High_sales_ind sales_next4
# 1 1 5 0 12 1 27
# 2 2 5 0 6 0 25
# 3 3 5 0 5 0 29
# 4 4 5 0 4 0 34
# 5 5 5 0 10 1 42
# 6 6 5 0 10 1 46
# 7 7 5 0 10 1 39
# 8 8 5 0 12 1 31
# 9 9 5 0 14 1 19
# 10 10 7 2 3 0 5
# 11 11 7 0 2 0 2
Using slider
library(dplyr)
library(slider)
df %>%
mutate(sales_next4 = slide_dbl(day, ~ sum(sales[.x]), .after = 3))
day price price_change sales High_sales_ind sales_next4
1 1 5 0 12 1 27
2 2 5 0 6 0 25
3 3 5 0 5 0 29
4 4 5 0 4 0 34
5 5 5 0 10 1 42
6 6 5 0 10 1 46
7 7 5 0 10 1 39
8 8 5 0 12 1 31
9 9 5 0 14 1 19
10 10 7 2 3 0 5
11 11 7 0 2 0 2
You can use Reduce() and data.table::shift()
library(data.table)
setDT(df)[, n4:=Reduce(`+`,shift(c(sales,0,0,0),-3:0))[1:.N]]
Output:
day price price_change sales High_sales_ind sales_next4
1 1 5 0 12 1 27
2 2 5 0 6 0 25
3 3 5 0 5 0 29
4 4 5 0 4 0 34
5 5 5 0 10 1 42
6 6 5 0 10 1 46
7 7 5 0 10 1 39
8 8 5 0 12 1 31
9 9 5 0 14 1 19
10 10 7 2 3 0 5
11 11 7 0 2 0 2
or, could this as part of dplyr/mutate pipeline
mutate(df, sales_next4 = Reduce(`+`, data.table::shift(c(sales,0,0,0),0:-3))[1:nrow(df)])
I have the follow data frame:
> resident
X LOS Age Meds MHealth DietRest ReligAff NmChores Employed EdLevel Courses
1 R1 27 35 2 1 3 2 2 0 2 1
2 R2 56 43 0 0 0 1 3 1 3 2
3 R3 101 41 1 1 0 0 2 2 2 3
4 R4 19 54 3 2 4 3 1 0 1 0
5 R5 34 29 0 0 0 2 3 0 2 1
6 R6 78 46 2 0 2 1 2 1 3 2
7 R7 134 51 3 2 4 0 1 1 3 2
8 R8 112 38 0 1 1 4 2 1 2 3
9 R9 83 61 3 1 3 2 2 0 4 3
10 R10 9 50 2 0 2 1 1 2 2 0
11 R11 67 23 0 1 0 0 2 0 3 1
12 R12 30 47 2 2 0 3 2 0 4 0
13 R13 95 65 4 1 4 2 2 0 3 2
14 R14 165 63 5 2 4 1 1 0 2 2
15 R15 29 40 0 1 0 0 3 2 5 0
16 R16 44 33 2 2 1 0 2 0 3 1
17 R17 36 48 2 1 0 3 2 0 1 1
18 R18 58 57 3 0 2 1 1 1 2 1
19 R19 116 39 0 1 0 2 2 1 3 1
20 R20 73 44 1 0 0 2 1 0 4 2
21 R21 79 30 3 2 3 3 1 0 2 1
22 R22 39 41 0 0 0 0 3 2 2 2
23 R23 18 50 2 1 2 1 1 1 3 0
24 R24 60 35 1 0 0 0 2 1 4 2
25 R25 106 48 3 2 3 2 2 0 2 2
26 R26 46 31 2 1 0 0 1 1 3 1
27 R27 52 59 2 0 1 1 3 2 2 1
28 R28 28 62 6 0 4 2 1 0 5 1
29 R29 79 45 4 2 3 3 2 1 3 2
30 R30 24 42 1 1 1 0 1 0 2 1
31 R31 123 36 3 1 0 2 2 1 3 4
32 R32 11 49 2 0 2 1 2 0 1 0
33 R33 95 26 1 1 0 1 3 0 3 4
34 R34 61 24 0 0 0 2 2 1 2 1
35 R35 88 63 2 1 0 1 1 1 4 2
36 R36 64 38 1 2 1 4 1 1 2 3
37 R37 99 40 2 0 0 1 3 2 4 1
>
LOS = length of stay
I am trying to go through the data frame and create a new column that consists of either a zero or one, based upon if the resident is completing an average of one course every thirty days. How would I go upon doing this? I understand I would need to do something like within this subset of people, break things down so that if someone has been there between thirty and fifty-nine days and has completed at least one course, they receive a value of one. If someone has been there between sixty and eighty-nine days and that person has finished at least two courses, give them a one, and so forth and if not give them a value of zero. How would I create a function that does this and adds a value of either 1 or 0 to a new vector based upon the data for each resident?
How can I convert the following tibble to the final result posted below using dplyr?
> group_by(hth, team) %>% arrange(team)
Source: local data frame [26 x 14]
Groups: team [13]
team CSK DC DD GL KKR KTK KXIP MI PW RCB RPSG
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 CSK 0 8 11 0 11 2 9 10 4 10 0
2 CSK 0 2 5 0 5 0 8 12 2 9 0
3 DC 2 0 8 0 2 1 7 5 3 8 0
4 DC 8 0 3 0 7 0 3 5 1 3 0
5 DD 5 3 0 0 7 2 8 5 2 10 2
6 DD 11 8 0 2 10 0 10 13 4 7 0
7 GL 0 0 2 0 0 0 0 0 0 1 0
8 GL 0 0 0 0 2 0 2 2 0 2 2
9 KKR 5 7 10 2 0 0 5 10 3 15 0
10 KKR 11 2 7 0 0 2 14 8 2 3 2
# ... with 16 more rows, and 2 more variables: RR <dbl>, SH <dbl>
>
I used plyr's ddply function and was able to achieve the result.
> ddply(hth, .(team), function(x) colSums(x[,-1], na.rm = TRUE))
team CSK DC DD GL KKR KTK KXIP MI PW RCB RPSG RR SH
1 CSK 0 10 16 0 16 2 17 22 6 19 0 17 6
2 DC 10 0 11 0 9 1 10 10 4 11 0 9 0
3 DD 16 11 0 2 17 2 18 18 6 17 2 16 8
4 GL 0 0 2 0 2 0 2 2 0 3 2 0 3
5 KKR 16 9 17 2 0 2 19 18 5 18 2 15 9
6 KTK 2 1 2 0 2 0 1 1 1 2 0 2 0
7 KXIP 17 10 18 2 19 1 0 18 6 18 2 15 8
8 MI 22 10 18 2 18 1 18 0 6 19 2 16 8
9 PW 6 4 6 0 5 1 6 6 0 5 0 5 2
10 RCB 19 11 17 3 18 2 18 19 5 0 2 16 9
11 RPSG 0 0 2 2 2 0 2 2 0 2 0 0 2
12 RR 17 9 16 0 15 2 15 16 5 16 0 0 7
13 SH 6 0 8 3 9 0 8 8 2 9 2 7 0
>
How to achieve the same using just dplyr functions?
Looks like you are grouping by team and summing the columns, in dplyr:
library(dplyr)
hth %>%
group_by(team) %>%
summarise_all(funs(sum), na.rm = TRUE)
I need to create 2 columns: PRETARGET and TARGET based on several conditions.
To create PRETARGET, for each row of my data (for each participant PPT and trial TRIAL) I need to check that the CURRENT_ID is associated with a value of 0 in the column CanBePretarget, and that the following row is the value of CURRENT_ID + 1. If these conditions are fulfil, then I would like to have a value of 0, if they are not fulfil a value of 1.
To create TARGET, for each row of my data (for each participant PPT and trial TRIAL) I need to check that the CURRENT_ID is associated with a value of 0 in the column CanBeTarget, and that the previous row is the value of CURRENT_ID - 1. If these conditions are fulfil, then I would like to have a value of 0, if they are not fulfil a value of 1.
In addition, if the result in PRETARGET is 1, then the value of the next row in TARGET should also be 1.
I have added the desired output in the following example.
I was thinking to use for loops and ifelse statements, but I am not sure how to consider the following/previous row of each observation.
PPT TRIAL PREVIOUS_ID CURRENT_ID NEXT_ID CURRENT_INDEX CanBePretarget CanBeTarget PRETARGET TARGET
ppt01 11 2 3 4 3 0 0 0 1
ppt01 11 3 4 3 4 1 0 1 0
ppt01 11 4 5 6 8 0 0 1 1
ppt01 11 6 7 8 10 0 0 1 1
ppt01 11 7 10 11 18 0 1 0 1
ppt01 11 10 11 12 19 0 0 0 0
ppt01 11 11 12 14 20 1 0 1 0
ppt01 12 1 2 1 2 1 0 1 1
ppt01 12 2 3 4 5 0 0 1 1
ppt01 12 5 6 6 8 0 0 0 1
ppt01 12 6 7 7 10 0 0 0 0
ppt01 12 7 8 9 12 0 0 0 0
ppt01 12 8 9 9 13 0 0 0 0
ppt01 12 9 10 11 16 0 0 0 0
ppt01 12 10 11 11 17 0 0 0 0
ppt01 13 1 2 2 2 1 0 1 1
ppt01 13 3 3 3 10 0 0 1 1
ppt01 13 4 5 6 13 0 0 0 1
ppt01 13 5 6 7 14 0 0 1 0
ppt01 13 9 9 10 19 0 0 0 1
ppt01 13 9 10 10 20 0 0 0 0
ppt01 13 10 11 12 22 0 0 0 0
ppt01 13 11 12 12 23 0 0 1 0
ppt01 14 10 11 11 15 0 0 0 1
ppt01 14 11 12 12 17 0 0 1 0
This can be achieved by using dplyr
df.new <- df %>%
mutate(PRETARGET1 = abs(as.numeric(CanBePretarget == 0 & lead(CURRENT_ID, default = 0) == (CURRENT_ID + 1)) - 1)) %>%
group_by(PPT, TRIAL) %>%
mutate(TARGET1 = abs(as.numeric((CanBeTarget == 0 & lag(CURRENT_ID, default = 0) == (CURRENT_ID - 1)) ) -1),
TARGET1 = ifelse(lag(PRETARGET1, default = 0) == 1, 1, TARGET1))
To compare to your results, I created PRETARGET1 and TARGET1.
I have following type of data:
mydata <- data.frame (yvar = rnorm(200, 15, 5), xv1 = rep(1:5, each = 40),
xv2 = rep(1:10, 20))
table(mydata$xv1, mydata$xv2)
1 2 3 4 5 6 7 8 9 10
1 4 4 4 4 4 4 4 4 4 4
2 4 4 4 4 4 4 4 4 4 4
3 4 4 4 4 4 4 4 4 4 4
4 4 4 4 4 4 4 4 4 4 4
5 4 4 4 4 4 4 4 4 4 4
I want tabulate again with yvar categories. The following is cutkey.
cutkey :
< 10 - group 1
10-12 - group 2
12-16 - group 3
>16 - group 4
Thus we will have similar to above type of table to each cutkey elements. I want to have margin sums everytime.
< 10 - group 1
1 2 3 4 5 6 7 8 9 10
1 4 4 4 4 4 4 4 4 4 4
2 4 4 4 4 4 4 4 4 4 4
3 4 4 4 4 4 4 4 4 4 4
4 4 4 4 4 4 4 4 4 4 4
5 4 4 4 4 4 4 4 4 4 4
10-12 - group 2
1 2 3 4 5 6 7 8 9 10
1 4 4 4 4 4 4 4 4 4 4
2 4 4 4 4 4 4 4 4 4 4
3 4 4 4 4 4 4 4 4 4 4
4 4 4 4 4 4 4 4 4 4 4
5 4 4 4 4 4 4 4 4 4 4
and so on for all groups
(the numbers will be definately different)
Is there easyway to do it ?
Yes, using cut, dlply (plyr package) and addmargins:
mydata$yvar1 <- cut(mydata$yvar,breaks = c(-Inf,10,12,16,Inf))
> dlply(mydata,.(yvar1),function(x) addmargins(table(x$xv1,x$xv2)))
$`(-Inf,10]`
1 2 3 4 5 6 7 8 9 10 Sum
1 0 0 0 0 0 0 2 0 1 0 3
2 1 1 0 1 0 0 0 0 2 0 5
3 0 1 0 0 1 1 0 2 0 0 5
4 0 0 2 0 1 1 0 1 0 0 5
5 0 1 1 0 1 1 1 0 0 2 7
Sum 1 3 3 1 3 3 3 3 3 2 25
$`(10,12]`
1 2 3 4 6 7 8 9 10 Sum
1 0 0 0 1 2 0 0 0 0 3
2 0 0 1 0 0 1 0 0 1 3
3 0 1 0 1 1 2 0 0 1 6
4 0 1 0 0 0 0 0 0 0 1
5 1 0 1 1 1 0 1 1 2 8
Sum 1 2 2 3 4 3 1 1 4 21
$`(12,16]`
1 2 3 4 5 6 7 8 9 10 Sum
1 2 3 1 1 1 2 0 3 0 2 15
2 0 1 0 1 3 3 2 0 0 1 11
3 3 1 3 1 0 0 0 2 4 1 15
4 3 2 1 2 2 0 1 1 4 1 17
5 3 1 1 2 0 1 1 1 1 0 11
Sum 11 8 6 7 6 6 4 7 9 5 69
$`(16, Inf]`
1 2 3 4 5 6 7 8 9 10 Sum
1 2 1 3 2 3 0 2 1 3 2 19
2 3 2 3 2 1 1 1 4 2 2 21
3 1 1 1 2 3 2 2 0 0 2 14
4 1 1 1 2 1 3 3 2 0 3 17
5 0 2 1 1 3 1 2 2 2 0 14
Sum 7 7 9 9 11 7 10 9 7 9 85
attr(,"split_type")
[1] "data.frame"
attr(,"split_labels")
yvar1
1 (-Inf,10]
2 (10,12]
3 (12,16]
4 (16, Inf]
You can adjust the breaks argument to cut to get the values just how you want them. (Although the margin sums you display in your question don't look like margin sums at all.)