I need to find the first two times my df meets a certain condition grouped by two variables. I am trying to use the ddply function, but I am doing something wrong with the ".variables" command.
So in this example, I'm trying to find the first two times x > 30 and y > 30 in each group / trial.
The way I'm using ddply is giving me the first two times in the dataset, then repeating that for every group.
set.seed(1)
df <- data.frame((matrix(nrow=200,ncol=5)))
colnames(df) <- c("group","trial","x","y","hour")
df$group <- rep(c("A","B","C","D"),each=50)
df$trial <- rep(c(rep(1,times=25),rep(2,times=25)),times=4)
df[,3:4] <- runif(400,0,50)
df$hour <- rep(1:25,time=8)
library(plyr)
ddply(.data=df, .variables=c("group","trial"), .fun=function(x) {
i <- which(df$x > 30 & df$y >30 )[1:2]
if (!is.na(i)) x[i, ]
})
Expected results:
group trial x y hour
13 A 1 34.3511423 38.161134 13
15 A 1 38.4920710 40.931734 15
36 A 2 33.4233369 34.481392 11
37 A 2 39.7119930 34.470671 12
52 B 1 43.0604738 46.645491 2
65 B 1 32.5435234 35.123126 15
But instead, my code is finding c(1,4) from the first grouptrial and repeating that over for every grouptrial:
group trial x y hour
1 A 1 34.351142 38.161134 13
2 A 1 38.492071 40.931734 15
3 A 2 5.397181 27.745031 13
4 A 2 20.563721 22.636003 15
5 B 1 22.953286 13.898301 13
6 B 1 32.543523 35.123126 15
I would also like for there to be rows of NA if a second occurrence isn't present in a group*trial.
Thanks,
I think this is what you want:
library(tidyverse)
df %>% group_by(group, trial) %>% filter(x > 30 & y > 30) %>% slice(1:2)
Result:
# A tibble: 16 x 5
# Groups: group, trial [8]
group trial x y hour
<chr> <dbl> <dbl> <dbl> <int>
1 A 1 33.5 46.3 4
2 A 1 32.6 42.7 11
3 A 2 35.9 43.6 4
4 A 2 30.5 42.7 14
5 B 1 33.0 38.1 2
6 B 1 40.5 30.4 7
7 B 2 48.6 33.2 2
8 B 2 34.1 30.9 4
9 C 1 33.0 45.1 1
10 C 1 30.3 36.7 17
11 C 2 44.8 33.9 1
12 C 2 41.5 35.6 6
13 D 1 44.2 34.3 12
14 D 1 39.1 40.0 23
15 D 2 39.4 47.5 4
16 D 2 42.1 40.1 10
(slightly different from your results, probably a different R version)
I reccomend using dplyr or data.table rather than plyr. From the plyr github page:
plyr is retired: this means only changes necessary to keep it on CRAN
will be made. We recommend using dplyr (for data frames) or purrr (for
lists) instead.
Since someone has already provided a solution with dplyr, here is one option with data.table.
In the selection df[i, j, k] I am selecting rows which match your criteria in i, grouping by the given variables in k, and selecting the first two rows (head) of each group-specific subset of the data .SD. All of this inside the brackets is data.table specific, and only works because I converted df to a data.table first with setDT.
library(data.table)
setDT(df)
df[x > 30 & y > 30, head(.SD, 2), by = .(group, trial)]
# group trial x y hour
# 1: A 1 34.35114 38.16113 13
# 2: A 1 38.49207 40.93173 15
# 3: A 2 33.42334 34.48139 11
# 4: A 2 39.71199 34.47067 12
# 5: B 1 43.06047 46.64549 2
# 6: B 1 32.54352 35.12313 15
# 7: B 2 48.03090 38.53685 5
# 8: B 2 32.11441 49.07817 18
# 9: C 1 32.73620 33.68561 1
# 10: C 1 32.00505 31.23571 20
# 11: C 2 32.13977 40.60658 9
# 12: C 2 34.13940 49.47499 16
# 13: D 1 36.18630 34.94123 19
# 14: D 1 42.80658 46.42416 23
# 15: D 2 37.05393 43.24038 3
# 16: D 2 44.32255 32.80812 8
To try a solution that is closer to what you've tried so far we can do the following
ddply(.data=df, .variables=c("group","trial"), .fun=function(df_temp) {
i <- which(df_temp$x > 30 & df_temp$y >30 )[1:2]
df_temp[i, ]
})
Some explanation
One problem with the code that you provided is that you used df inside of ddply. So you defined fun= function(x) but you didn't look for cases of x> 30 & y> 30 in x but in df. Further, your code uses i for x, but i was defined with df. Finally, to my understanding there is no need for if (!is.na(i)) x[i, ]. If there is only one row that meets your condition, you will get a row with NAs anayway, because you use which(df_temp$x > 30 & df_temp$y >30 )[1:2].
Using dplyr, you can also do:
df %>%
group_by(group, trial) %>%
slice(which(x > 30 & y > 30)[1:2])
group trial x y hour
<chr> <dbl> <dbl> <dbl> <int>
1 A 1 34.4 38.2 13
2 A 1 38.5 40.9 15
3 A 2 33.4 34.5 11
4 A 2 39.7 34.5 12
5 B 1 43.1 46.6 2
6 B 1 32.5 35.1 15
7 B 2 48.0 38.5 5
8 B 2 32.1 49.1 18
Since everything else is covered here is a base R version using split
output <- do.call(rbind, lapply(split(df, list(df$group, df$trial)),
function(new_df) new_df[with(new_df, head(which(x > 30 & y > 30), 2)), ]
))
rownames(output) <- NULL
output
# group trial x y hour
#1 A 1 34.351 38.161 13
#2 A 1 38.492 40.932 15
#3 B 1 43.060 46.645 2
#4 B 1 32.544 35.123 15
#5 C 1 32.736 33.686 1
#6 C 1 32.005 31.236 20
#7 D 1 36.186 34.941 19
#8 D 1 42.807 46.424 23
#9 A 2 33.423 34.481 11
#10 A 2 39.712 34.471 12
#11 B 2 48.031 38.537 5
#12 B 2 32.114 49.078 18
#13 C 2 32.140 40.607 9
#14 C 2 34.139 49.475 16
#15 D 2 37.054 43.240 3
#16 D 2 44.323 32.808 8
Related
I have a data frame containing the values of weight. I have a create a new column, percentage change of weight wherein the denominator takes the value of every third row.
df <- data.frame(weight = c(30,30,109,30,309,10,20,20,14))
# expected output
change_of_weight = c(30/109, 30/109, 109/109, 30/10,309/10,10/10,20/14,20/14,14/14)
Subset weight column where it's position %% 3 is zero and repeat each value three times.
df <- transform(df, change_of_weight=weight / rep(weight[1:nrow(df) %% 3 == 0], each=3))
df
weight change_of_weight
1 30 0.2752294
2 30 0.2752294
3 109 1.0000000
4 30 3.0000000
5 309 30.9000000
6 10 1.0000000
7 20 1.4285714
8 20 1.4285714
9 14 1.0000000
You can create a group of every 3 rows and divide weight column by the last value in the group.
df$change <- with(df, ave(df$weight, ceiling(seq_len(nrow(df))/3),
FUN = function(x) x/x[length(x)]))
Or using dplyr :
library(dplyr)
df %>%
group_by(grp = ceiling(row_number()/3)) %>%
mutate(change = weight/last(weight))
# weight grp change
# <dbl> <dbl> <dbl>
#1 30 1 0.275
#2 30 1 0.275
#3 109 1 1
#4 30 2 3
#5 309 2 30.9
#6 10 2 1
#7 20 3 1.43
#8 20 3 1.43
#9 14 3 1
We can also use gl to create a grouping column
library(dplyr)
df %>%
group_by(grp = as.integer(gl(n(), 3, n()))) %>%
mutate(change = weight/last(weight))
# A tibble: 9 x 3
# Groups: grp [3]
# weight grp change
# <dbl> <int> <dbl>
#1 30 1 0.275
#2 30 1 0.275
#3 109 1 1
#4 30 2 3
#5 309 2 30.9
#6 10 2 1
#7 20 3 1.43
#8 20 3 1.43
#9 14 3 1
Or using data.table
library(data.table)
setDT(df)[, change := weight/last(weight), .(as.integer(gl(nrow(df), 3, nrow(df))))]
I have a huge table where there is information of 2 professionals in each line that goes like this:
df1 <- data.frame("Date" = c(1,2,3,4), "prof1" = c(25,59,10,5), "prof2" = c(5,7,8,25))
# Date prof1 prf2
#1 1 25 5
#2 2 59 7
#3 3 10 8
#4 4 5 25
... ... ...
I want to delete the line 4 because its the same with line 1, just with alternate values.
So I created a copy os that table with the values of the columns B and C switched like this:
df2 <- data.frame("Date" = c(1,2,3,4), "prof2" = c(5,7,8,25), "prof1" = c(25,59,10,5))
# Date prof2 prof1
#1 1 5 25
#2 2 7 59
#3 3 8 10
#4 4 25 5
... ... ...
And executed the code:
df1<- df1[!do.call(paste, df1[2:3]) %in% do.call(paste, df2[2:3]), ]
But it end up deleting the line 1 as well. Giving me this table:
# Date prof2 prof1
#2 2 7 59
#3 3 8 10
... ... ...
when what I wanted was this:
# Date prof2 prof1
#1 1 5 25
#2 2 7 59
#3 3 8 10
... ... ...
How can I delete only one of the lines that are similar to another?
If you don't care about which one of the duplicates you keep, you can just make sure that
prof2 > prof1 and then remove duplicates.
SWAP = which(df2$prof2 < df2$prof1)
temp = df2$prof2
df2$prof2[SWAP] = df2$prof1[SWAP]
df2$prof1[SWAP] = temp[SWAP]
df2 = df2[!duplicated(df2[,2:3]), ]
df2
Date prof2 prof1
1 1 25 5
2 2 59 7
3 3 10 8
We can do this with apply to loop over the rows of the dataset, sort, them, get the transpose, apply duplicated on it to get a logical vector and subset
df1[!duplicated(t(apply(df1[-1], 1, sort))),]
# Date prof1 prof2
#1 1 25 5
#2 2 59 7
#3 3 10 8
Or another option is pmin/pmax
subset(df1, !duplicated(cbind(pmin(prof1, prof2), pmax(prof1, prof2))))
# Date prof1 prof2
#1 1 25 5
#2 2 59 7
#3 3 10 8
Or using filter from dplyr
library(dplyr)
df1 %>%
filter( !duplicated(cbind(pmin(prof1, prof2), pmax(prof1, prof2))))
We have a data frame with one column for a category and one column for discrete values. We want to get all possible intersections (number of common values) for all combinations of categories.
I came up with the following code. However, is there something shorter out there? I am sure there is a better way of doing this, a specialized function that does exactly this. The code below can be shortened, of course, for example with purrr:map, but that is not my question.
## prepare an example data set
df <- data.frame(category=rep(LETTERS[1:5], each=20),
value=sample(letters[1:10], 100, replace=T))
cats <- unique(df$category)
n <- length(cats)
## all combinations of 1...n unique elements from category
combinations <- lapply(1:n, function(i) combn(cats, i, simplify=FALSE))
combinations <- unlist(combinations, recursive=FALSE)
names(combinations) <- sapply(combinations, paste0, collapse="")
## for each combination of categories, get the values which belong
## to this category
intersections <- lapply(combinations,
function(co)
lapply(co, function(.x) df$value[ df$category == .x ]))
intersections <- lapply(intersections,
function(.x) Reduce(intersect, .x))
intersections <- sapply(intersections, length)
This brings us to my desired outcome:
> intersections
A B C D E AB AC AD AE BC
20 20 20 20 20 10 8 8 9 8
BD BE CD CE DE ABC ABD ABE ACD ACE
8 9 7 8 8 8 8 9 7 8
ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE
8 7 8 8 7 7 8 8 7 7
ABCDE
7
Question: is there a way of achieving the same result with less fuzz?
Here is a possible approach with data.table to cast the data.frame and model.matrix to count the higher-order interactions:
Cast to wide-format by grouping all matching values between categories in the rows (credits to #chinsoon12 for the dcast syntax).
Identify all higher-order interactions with model.matrix and sum over the columns.
library(data.table)
df_wide <- dcast(setDT(df), value + rowid(category, value) ~ category, fun.aggregate = length, fill = 0)
head(df_wide)
#> value category A B C D E
#> 1: a 1 1 1 1 1 1
#> 2: a 2 1 0 0 1 1
#> 3: a 3 0 0 0 1 0
#> 4: b 1 1 1 1 0 1
#> 5: b 2 1 0 1 0 1
#> 6: c 1 1 1 1 1 1
colSums(model.matrix(~(A + B + C + D + E)^5, data = df_wide))[-1]
#> A B C D E A:B A:C
#> 20 20 20 20 20 13 11
#> A:D A:E B:C B:D B:E C:D C:E
#> 12 12 11 13 13 11 13
#> D:E A:B:C A:B:D A:B:E A:C:D A:C:E A:D:E
#> 10 8 9 9 7 9 7
#> B:C:D B:C:E B:D:E C:D:E A:B:C:D A:B:C:E A:B:D:E
#> 8 9 7 8 5 7 5
#> A:C:D:E B:C:D:E A:B:C:D:E
#> 5 6 4
Data
set.seed(1)
df <- data.frame(category=rep(LETTERS[1:5], each=20),
value=sample(letters[1:10], 100, replace=T))
Let's say that we have the following matrix:
x<- as.data.frame(cbind(c("A","A","A","B","B","B","B","B","C","C","C","C","C","D","D","D","D","D"),
c(1,2,3,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
c(14,28,42,14,46,64,71,85,14,28,51,84,66,22,38,32,40,42)))
colnames(x)<- c("ID","Visit", "Age")
The first column represents subject ID, the second a list of observations and the third the age at each of this consecutive observations.
Which would be the easiest way of finding visits where the age is wrong according to the previous visit age. (i.e. in row 13, subject C is 66 years old, when in the previous visit he was already 84 or in row 16, subject D is 32 years old, when in the previous visit he was already 38).
Which would be the way of highlighting the potential errors and removing rows 13 and 16?
I have tried to aggregate by IDs and look for the difference between ages across visits, but it seems hard for me since the error could occur in any visit.
How about this in base R?
df <- do.call(rbind.data.frame, lapply(split(x, x$ID), function(w)
w[c(1, which(diff(w[order(w$Visit), "Age"]) > 0) + 1), ]));
df;
# ID Visit Age
#A.1 A 1 14
#A.2 A 2 28
#A.3 A 3 42
#B.4 B 1 14
#B.5 B 2 46
#B.6 B 3 64
#B.7 B 4 71
#B.8 B 5 85
#C.9 C 1 14
#C.10 C 2 28
#C.11 C 3 51
#C.12 C 4 84
#D.14 D 1 22
#D.15 D 2 38
#D.17 D 4 40
#D.18 D 5 42
Explanation: We split the dataframe on column ID, then order every dataframe subset by Visit, calculate differences between successive Age values, and only keep those rows where the difference is > 0 (i.e. Age is increasing); rbinding gives the final dataframe.
You could do it by filtering out the rows where diff(Age) is negative for each ID.
Using the dplyr package:
library(dplyr)
x %>% group_by(ID) %>% filter(c(0,diff(Age))>=0)
# A tibble: 16 x 3
# Groups: ID [4]
ID Visit Age
<fctr> <fctr> <fctr>
1 A 1 14
2 A 2 28
3 A 3 42
4 B 1 14
5 B 2 46
6 B 3 64
7 B 4 71
8 B 5 85
9 C 1 14
10 C 2 28
11 C 3 51
12 C 4 84
13 D 1 22
14 D 2 38
15 D 4 40
16 D 5 42
The aggregate() approach is pretty concise.
Removing bad rows
good <- do.call(c, aggregate(Age ~ ID, x, function(z) c(z[1], diff(z)) > 0)$Age)
x[good,]
# ID Visit Age
# 1 A 1 14
# 2 A 2 28
# 3 A 3 42
# 4 B 1 14
# 5 B 2 46
# 6 B 3 64
# 7 B 4 71
# 8 B 5 85
# 9 C 1 14
# 10 C 2 28
# 11 C 3 51
# 12 C 4 84
# 14 D 1 22
# 15 D 2 38
# 17 D 4 40
# 18 D 5 42
This will only highlight which groups have an inconsistency:
aggregate(Age ~ ID, x, function(z) all(diff(z) > 0))
# ID Age
# 1 A TRUE
# 2 B TRUE
# 3 C FALSE
# 4 D FALSE
To manipulate/summarize data over time, I usually use SQL ROW_NUMBER() OVER(PARTITION by ...). I'm new to R, so I'm trying to recreate tables I otherwise would create in SQL. The package sqldf does not allow OVER clauses. Example table:
ID Day Person Cost
1 1 A 50
2 1 B 25
3 2 A 30
4 3 B 75
5 4 A 35
6 4 B 100
7 6 B 65
8 7 A 20
I want my final table to include the average of the previous 2 instances for each day after their 2nd instance (day 4 for both):
ID Day Person Cost Prev2
5 4 A 35 40
6 4 B 100 50
7 6 B 65 90
8 7 A 20 35
I've been trying to play around with aggregate, but I'm not really sure how to partition or qualify the function. Ideally, I'd prefer not to use the fact that id is sequential with the date to form my answer (i.e. original table could be rearranged with random date order and code would still work). Let me know if you need more details, thanks for your help!
You could lag zoo::rollapplyr with a width of 2. In dplyr,
library(dplyr)
df %>% arrange(Day) %>% # sort
group_by(Person) %>% # set grouping
mutate(Prev2 = lag(zoo::rollapplyr(Cost, width = 2, FUN = mean, fill = NA)))
#> Source: local data frame [8 x 5]
#> Groups: Person [2]
#>
#> ID Day Person Cost Prev2
#> <int> <int> <fctr> <int> <dbl>
#> 1 1 1 A 50 NA
#> 2 2 1 B 25 NA
#> 3 3 2 A 30 NA
#> 4 4 3 B 75 NA
#> 5 5 4 A 35 40.0
#> 6 6 4 B 100 50.0
#> 7 7 6 B 65 87.5
#> 8 8 7 A 20 32.5
or all in dplyr,
df %>% arrange(Day) %>% group_by(Person) %>% mutate(Prev2 = (lag(Cost) + lag(Cost, 2)) / 2)
which returns the same thing. In base,
df <- df[order(df$Day), ]
df$Prev2 <- ave(df$Cost, df$Person, FUN = function(x){
c(NA, zoo::rollapplyr(x, width = 2, FUN = mean, fill = NA)[-length(x)])
})
df
#> ID Day Person Cost Prev2
#> 1 1 1 A 50 NA
#> 2 2 1 B 25 NA
#> 3 3 2 A 30 NA
#> 4 4 3 B 75 NA
#> 5 5 4 A 35 40.0
#> 6 6 4 B 100 50.0
#> 7 7 6 B 65 87.5
#> 8 8 7 A 20 32.5
or without zoo,
df$Prev2 <- ave(df$Cost, df$Person, FUN = function(x){
(c(NA, x[-length(x)]) + c(NA, NA, x[-(length(x) - 1):-length(x)])) / 2
})
which does the same thing. If you want to remove the NA rows, tack on tidyr::drop_na(Prev2) or na.omit.