row number increases by specific conditions in more fast way R - r

The data frame I have looked like this.
"rank" variable has to be increased once the differences between the [i]th row of "start" and the [i-1]th row of "end" are over 14.(also, when encountered the different "ID")
I tried the code below and it worked very well.
But the thing is.. it is way too slow because I have like over 700000 rows.
So, is there any way to make it perform much faster?
df$rank <- 1
for(i in 2:nrow(l50.df)){
df[i,"rank"] <- ifelse((df[i,"ID"]==df[i-1,"ID"])&
(df[i-1,"diff"]<=14),
df[i,"rank"] <- df[i-1,"rank"],
df[i,"rank"] <- df[i-1,"rank"] + 1)
}

You can try :
library(dplyr)
df %>% mutate(rank = cumsum(diff > 14 | ID != lag(ID, default = TRUE)))
Same logic using base R :
df$rank <- with(df, cumsum(diff > 14 | c(TRUE, tail(ID, -1) != head(ID, -1))))

You can use cumsum to get an increasing rank when the conditions df[i,"ID"]==df[i-1,"ID"]) & (df[i-1,"diff"]<=14) are meet.
df$rank <- cumsum(c(1,(df$ID != c(df$ID[-1], NA) | df$diff>14)[-nrow(df)]))
df
# ID diff rank
#1 a 4 1
#2 a 6 1
#3 a 8 1
#4 a 870 1
#5 a 34 2
#6 a NA 3
#7 b 4 4
#8 b 6 4
#9 b 8 4
#10 b 870 4
#11 b 34 5
#12 b NA 6
Using your code:
df$rank <- 1
for(i in 2:nrow(df)){
df[i,"rank"] <- ifelse((df[i,"ID"]==df[i-1,"ID"]) & (df[i-1,"diff"]<=14),
df[i,"rank"] <- df[i-1,"rank"], df[i,"rank"] <- df[i-1,"rank"] + 1)
}
df
# ID diff rank
#1 a 4 1
#2 a 6 1
#3 a 8 1
#4 a 870 1
#5 a 34 2
#6 a NA 3
#7 b 4 4
#8 b 6 4
#9 b 8 4
#10 b 870 4
#11 b 34 5
#12 b NA 6
Data:
df <- data.frame(ID=rep(c("a","b"), each=6), diff=c(4,6,8,870,34,NA)
, stringsAsFactors = FALSE)
df
# ID diff
#1 a 4
#2 a 6
#3 a 8
#4 a 870
#5 a 34
#6 a NA
#7 b 4
#8 b 6
#9 b 8
#10 b 870
#11 b 34
#12 b NA

Here is a base R solution using ave + ifelse
df <- within(df,rank <- ave(diff>14, diff>14,ID,FUN = function(x) ifelse(x,seq(x),+!x)))

Related

I want to delete redundant lines in my table in R

I have a huge table where there is information of 2 professionals in each line that goes like this:
df1 <- data.frame("Date" = c(1,2,3,4), "prof1" = c(25,59,10,5), "prof2" = c(5,7,8,25))
# Date prof1 prf2
#1 1 25 5
#2 2 59 7
#3 3 10 8
#4 4 5 25
... ... ...
I want to delete the line 4 because its the same with line 1, just with alternate values.
So I created a copy os that table with the values of the columns B and C switched like this:
df2 <- data.frame("Date" = c(1,2,3,4), "prof2" = c(5,7,8,25), "prof1" = c(25,59,10,5))
# Date prof2 prof1
#1 1 5 25
#2 2 7 59
#3 3 8 10
#4 4 25 5
... ... ...
And executed the code:
df1<- df1[!do.call(paste, df1[2:3]) %in% do.call(paste, df2[2:3]), ]
But it end up deleting the line 1 as well. Giving me this table:
# Date prof2 prof1
#2 2 7 59
#3 3 8 10
... ... ...
when what I wanted was this:
# Date prof2 prof1
#1 1 5 25
#2 2 7 59
#3 3 8 10
... ... ...
How can I delete only one of the lines that are similar to another?
If you don't care about which one of the duplicates you keep, you can just make sure that
prof2 > prof1 and then remove duplicates.
SWAP = which(df2$prof2 < df2$prof1)
temp = df2$prof2
df2$prof2[SWAP] = df2$prof1[SWAP]
df2$prof1[SWAP] = temp[SWAP]
df2 = df2[!duplicated(df2[,2:3]), ]
df2
Date prof2 prof1
1 1 25 5
2 2 59 7
3 3 10 8
We can do this with apply to loop over the rows of the dataset, sort, them, get the transpose, apply duplicated on it to get a logical vector and subset
df1[!duplicated(t(apply(df1[-1], 1, sort))),]
# Date prof1 prof2
#1 1 25 5
#2 2 59 7
#3 3 10 8
Or another option is pmin/pmax
subset(df1, !duplicated(cbind(pmin(prof1, prof2), pmax(prof1, prof2))))
# Date prof1 prof2
#1 1 25 5
#2 2 59 7
#3 3 10 8
Or using filter from dplyr
library(dplyr)
df1 %>%
filter( !duplicated(cbind(pmin(prof1, prof2), pmax(prof1, prof2))))

find first occurrence in two variables in df

I need to find the first two times my df meets a certain condition grouped by two variables. I am trying to use the ddply function, but I am doing something wrong with the ".variables" command.
So in this example, I'm trying to find the first two times x > 30 and y > 30 in each group / trial.
The way I'm using ddply is giving me the first two times in the dataset, then repeating that for every group.
set.seed(1)
df <- data.frame((matrix(nrow=200,ncol=5)))
colnames(df) <- c("group","trial","x","y","hour")
df$group <- rep(c("A","B","C","D"),each=50)
df$trial <- rep(c(rep(1,times=25),rep(2,times=25)),times=4)
df[,3:4] <- runif(400,0,50)
df$hour <- rep(1:25,time=8)
library(plyr)
ddply(.data=df, .variables=c("group","trial"), .fun=function(x) {
i <- which(df$x > 30 & df$y >30 )[1:2]
if (!is.na(i)) x[i, ]
})
Expected results:
group trial x y hour
13 A 1 34.3511423 38.161134 13
15 A 1 38.4920710 40.931734 15
36 A 2 33.4233369 34.481392 11
37 A 2 39.7119930 34.470671 12
52 B 1 43.0604738 46.645491 2
65 B 1 32.5435234 35.123126 15
But instead, my code is finding c(1,4) from the first grouptrial and repeating that over for every grouptrial:
group trial x y hour
1 A 1 34.351142 38.161134 13
2 A 1 38.492071 40.931734 15
3 A 2 5.397181 27.745031 13
4 A 2 20.563721 22.636003 15
5 B 1 22.953286 13.898301 13
6 B 1 32.543523 35.123126 15
I would also like for there to be rows of NA if a second occurrence isn't present in a group*trial.
Thanks,
I think this is what you want:
library(tidyverse)
df %>% group_by(group, trial) %>% filter(x > 30 & y > 30) %>% slice(1:2)
Result:
# A tibble: 16 x 5
# Groups: group, trial [8]
group trial x y hour
<chr> <dbl> <dbl> <dbl> <int>
1 A 1 33.5 46.3 4
2 A 1 32.6 42.7 11
3 A 2 35.9 43.6 4
4 A 2 30.5 42.7 14
5 B 1 33.0 38.1 2
6 B 1 40.5 30.4 7
7 B 2 48.6 33.2 2
8 B 2 34.1 30.9 4
9 C 1 33.0 45.1 1
10 C 1 30.3 36.7 17
11 C 2 44.8 33.9 1
12 C 2 41.5 35.6 6
13 D 1 44.2 34.3 12
14 D 1 39.1 40.0 23
15 D 2 39.4 47.5 4
16 D 2 42.1 40.1 10
(slightly different from your results, probably a different R version)
I reccomend using dplyr or data.table rather than plyr. From the plyr github page:
plyr is retired: this means only changes necessary to keep it on CRAN
will be made. We recommend using dplyr (for data frames) or purrr (for
lists) instead.
Since someone has already provided a solution with dplyr, here is one option with data.table.
In the selection df[i, j, k] I am selecting rows which match your criteria in i, grouping by the given variables in k, and selecting the first two rows (head) of each group-specific subset of the data .SD. All of this inside the brackets is data.table specific, and only works because I converted df to a data.table first with setDT.
library(data.table)
setDT(df)
df[x > 30 & y > 30, head(.SD, 2), by = .(group, trial)]
# group trial x y hour
# 1: A 1 34.35114 38.16113 13
# 2: A 1 38.49207 40.93173 15
# 3: A 2 33.42334 34.48139 11
# 4: A 2 39.71199 34.47067 12
# 5: B 1 43.06047 46.64549 2
# 6: B 1 32.54352 35.12313 15
# 7: B 2 48.03090 38.53685 5
# 8: B 2 32.11441 49.07817 18
# 9: C 1 32.73620 33.68561 1
# 10: C 1 32.00505 31.23571 20
# 11: C 2 32.13977 40.60658 9
# 12: C 2 34.13940 49.47499 16
# 13: D 1 36.18630 34.94123 19
# 14: D 1 42.80658 46.42416 23
# 15: D 2 37.05393 43.24038 3
# 16: D 2 44.32255 32.80812 8
To try a solution that is closer to what you've tried so far we can do the following
ddply(.data=df, .variables=c("group","trial"), .fun=function(df_temp) {
i <- which(df_temp$x > 30 & df_temp$y >30 )[1:2]
df_temp[i, ]
})
Some explanation
One problem with the code that you provided is that you used df inside of ddply. So you defined fun= function(x) but you didn't look for cases of x> 30 & y> 30 in x but in df. Further, your code uses i for x, but i was defined with df. Finally, to my understanding there is no need for if (!is.na(i)) x[i, ]. If there is only one row that meets your condition, you will get a row with NAs anayway, because you use which(df_temp$x > 30 & df_temp$y >30 )[1:2].
Using dplyr, you can also do:
df %>%
group_by(group, trial) %>%
slice(which(x > 30 & y > 30)[1:2])
group trial x y hour
<chr> <dbl> <dbl> <dbl> <int>
1 A 1 34.4 38.2 13
2 A 1 38.5 40.9 15
3 A 2 33.4 34.5 11
4 A 2 39.7 34.5 12
5 B 1 43.1 46.6 2
6 B 1 32.5 35.1 15
7 B 2 48.0 38.5 5
8 B 2 32.1 49.1 18
Since everything else is covered here is a base R version using split
output <- do.call(rbind, lapply(split(df, list(df$group, df$trial)),
function(new_df) new_df[with(new_df, head(which(x > 30 & y > 30), 2)), ]
))
rownames(output) <- NULL
output
# group trial x y hour
#1 A 1 34.351 38.161 13
#2 A 1 38.492 40.932 15
#3 B 1 43.060 46.645 2
#4 B 1 32.544 35.123 15
#5 C 1 32.736 33.686 1
#6 C 1 32.005 31.236 20
#7 D 1 36.186 34.941 19
#8 D 1 42.807 46.424 23
#9 A 2 33.423 34.481 11
#10 A 2 39.712 34.471 12
#11 B 2 48.031 38.537 5
#12 B 2 32.114 49.078 18
#13 C 2 32.140 40.607 9
#14 C 2 34.139 49.475 16
#15 D 2 37.054 43.240 3
#16 D 2 44.323 32.808 8

How to mutate filtered rows (using dplyr or if/else)

Similar questions have been certainly asked but my one is much easier and unfortunately I really could not dissect the answer from them so here is my specific, probably simple case:
df <- data.frame("Sample" = 1:30,
"Individual" = c("a", "b", "c"),
"Repeat" = 1:3)
I would like to mutate the entry of Individual == "a" into "a_(number_of_repeat). But only for individual a, not for b or c.
I tried:
df[df$Individual == "a", ] <-
df %>% filter(Individual == "a") %>%
df %>% mutate(Individual = paste0(Individual,"_",Repeat))
but no success. Maybe it could also be solved with a if/else or for argument?
df$Individual <- for (df$Individual == "a") {
df %>% mutate(Individual = paste0(Individual,"_",Repeat))
}
...also a fail.
What about something like this, with mutate and a classic ifelse:
library(dplyr)
df %>% mutate(Individual = ifelse(Individual=="a",
paste0(Individual,'_',Repeat),
Individual))
Sample Individual Repeat
1 1 a_1 1
2 2 2 2
3 3 3 3
4 4 a_1 1
5 5 2 2
6 6 3 3
7 7 a_1 1
8 8 2 2
9 9 3 3
10 10 a_1 1
11 11 2 2
12 12 3 3
13 13 a_1 1
14 14 2 2
15 15 3 3
16 16 a_1 1
17 17 2 2
18 18 3 3
19 19 a_1 1
20 20 2 2
21 21 3 3
22 22 a_1 1
23 23 2 2
24 24 3 3
25 25 a_1 1
26 26 2 2
27 27 3 3
28 28 a_1 1
29 29 2 2
30 30 3 3
Or in a new column:
df %>% mutate(Individual_2 = ifelse(Individual=="a",
paste0(Individual,'_',Repeat),
Individual))
We can use dplyr::if_else
library(dplyr)
df %>%
mutate_if(is.factor, as.character) %>%
mutate(Individual = if_else(
Individual == "a",
sprintf("%s_%s", Individual, Repeat),
Individual))
# Sample Individual Repeat
#1 1 a_1 1
#2 2 b 2
#3 3 c 3
#4 4 a_1 1
#5 5 b 2
#6 6 c 3
#7 7 a_1 1
#8 8 b 2
#9 9 c 3
#10 10 a_1 1
#11 11 b 2
#12 12 c 3
#13 13 a_1 1
#14 14 b 2
#15 15 c 3
#16 16 a_1 1
#17 17 b 2
#18 18 c 3
#19 19 a_1 1
#20 20 b 2
#21 21 c 3
#22 22 a_1 1
#23 23 b 2
#24 24 c 3
#25 25 a_1 1
#26 26 b 2
#27 27 c 3
#28 28 a_1 1
#29 29 b 2
#30 30 c 3
You are mixing up some sytnax and therefore, your code fails.
First you dplyr-approach. Here you are close, but the additional df in the second row, messes up the pipeline.
df[df$Individual == "a", ] <-
df %>% filter(Individual == "a") %>%
# don't pipe again df you already giving that as an input (just filtered)
df %>% mutate(Individual = paste0(Individual,"_",Repeat))
The following makes it work:
Individual is stored as a factor, if you want to modify the column convert it to a character vector.
df$Individual <- as.character(df$Individual)
df[df$Individual == "a", ] <-
df %>%
filter(Individual == "a") %>%
mutate(Individual = paste0(Individual,"_",Repeat))
There are other approaches as well:
E.g. in base R
df$Individual <- ifelse(df$Individual == "a",
paste0(df$Individual, "_", df$Repeat),
df$Individual)
Or in dplyr:
df %>%
mutate(Individual = ifelse(Individual == "a",
paste0(Individual, "_", Repeat),
Individual))
You could also fix the for loop like below, but I really don't recommend that in this case as there are so nice vectorized options.
for (i in 1:nrow(df)) {
if (df$Individual[i] == "a") {
df$Individual[i] <- paste0(df$Individual[i], "_", df$Repeat[i])
}
}

Multiple Conditional Cumulative Sum in R

This is my data frame as given below
rd <- data.frame(
Customer = rep("A",15),
date_num = c(3,3,9,11,14,14,15,16,17,20,21,27,28,29,31),
exp_cumsum_col = c(1,1,2,3,4,4,4,4,4,5,5,6,6,6,7))
I am trying to get column 3 (exp_cumsum_col), but am unable to get the correct values after trying many times. This is the code I used:
rd<-as.data.frame(rd %>%
group_by(customer) %>%
mutate(exp_cumsum_col = cumsum(row_number(ifelse(date_num[i]==date_num[i+1],1)))))
If my date_num is continuous, then I am treating that entire series as a one number, and if there is any break in my date_num, then I am increasing exp_cumsum_col by 1 ..... exp_cumsum_col would start at 1.
We can take the differece of adjacent elements, check if it is greater than 1 and get the cumsum
rd %>%
group_by(Customer) %>%
mutate(newexp_col = cumsum(c(TRUE, diff(date_num) > 1)))
# Customer date_num exp_cumsum_col newexp_col
#1 A 3 1 1
#2 A 3 1 1
#3 A 9 2 2
#4 A 11 3 3
#5 A 14 4 4
#6 A 14 4 4
#7 A 15 4 4
#8 A 16 4 4
#9 A 17 4 4
#10 A 20 5 5
#11 A 21 5 5
#12 A 27 6 6
#13 A 28 6 6
#14 A 29 6 6
#15 A 31 7 7

Create indicator variable within panel data in R

I feel this should be easy but at a loss, and hoping y'all can help. I have panel data, by id with variables, here just v1:
id v1
A 14
A 15
B 12
B 13
B 14
C 11
C 12
C 13
D 14
I would simply like to create a dummy variable indicating whether a value of v1 (say 12) exists in the panel for id. So something like:
id v1 v2
A 14 0
A 15 0
B 12 1
B 13 1
B 14 1
C 11 1
C 12 1
C 13 1
D 14 0
I feel this should be simple but can't figure out an easy one line solution.
Many many thanks!
Try
library(dplyr)
df %>% group_by(id) %>% mutate(v2 = as.numeric(any(v1 == 12)))
Or as per #akrun suggestion:
library(data.table)
setDT(df)[, v2 := any(v1 ==12)+0L, id]
Note: Adding 0L to the logical values created by any() will switch TRUE/FALSE to 0s and 1s.
Another approach could be using ave:
df$v2 <- with(df, ave(v1, id, FUN = function(x) any(x == 12)))
Which gives:
# id v1 v2
#1 A 14 0
#2 A 15 0
#3 B 12 1
#4 B 13 1
#5 B 14 1
#6 C 11 1
#7 C 12 1
#8 C 13 1
#9 D 14 0

Resources