I have a column in my data frame which looks like this:
> df
# A tibble: 20 x 1
duration
<dbl>
1 0
2 40.0
3 247.
4 11.8
5 116.
6 10.2
7 171.
8 7.58
9 87.8
10 23.2
11 390.
12 35.8
13 4.73
14 29.1
15 0
16 36.8
17 73.8
18 12.9
19 124.
20 10.7
I need to group this data, so that all rows starting from a 0 to the last row before the next zero are in a group. I've accomplished this using a for-loop:
counter <- 0
df$group <- NA
df$group[1] <- 1
for (i in 2:NROW(df)) {
df$group[i] <-
ifelse(df$duration[i] == 0, df$group[i - 1] + 1, df$group[i - 1])
}
which gives me the desired output:
> df
# A tibble: 20 x 2
duration group
<dbl> <dbl>
1 0 1
2 40.0 1
3 247. 1
4 11.8 1
5 116. 1
6 10.2 1
7 171. 1
8 7.58 1
9 87.8 1
10 23.2 1
11 390. 1
12 35.8 1
13 4.73 1
14 29.1 1
15 0 2
16 36.8 2
17 73.8 2
18 12.9 2
19 124. 2
20 10.7 2
But as my original dataframe is quite big i'm looking for a faster solution, and I've been trying to get it working with dplyr but to no avail. Other related questions are counting how often the current value has already appeared, not a specific one so I haven't found a solution to this problem yet.
I'd appreaciate your help in finding a vectorized solution for my problem, thanks! Heres the example-data:
df <-
structure(
list(
duration = c(
0,
40.0009999275208,
247.248000144958,
11.8349997997284,
115.614000082016,
10.2449998855591,
171.426000118256,
7.58200001716614,
87.805999994278,
23.1909999847412,
390.417999982834,
35.8229999542236,
4.73100018501282,
29.0869998931885,
0,
36.789999961853,
73.8420000076294,
12.8770000934601,
123.771999835968,
10.7190001010895
)
),
row.names = c(NA,-20L),
class = c("tbl_df", "tbl", "data.frame")
)
We can create the desired column using cumsum as below
df %>%
mutate(grp = cumsum(duration == 0))
# A tibble: 20 x 2
# duration grp
# <dbl> <int>
# 1 0 1
# 2 40.0 1
# 3 247. 1
# 4 11.8 1
# 5 116. 1
# 6 10.2 1
# 7 171. 1
# 8 7.58 1
# 9 87.8 1
#10 23.2 1
#11 390. 1
#12 35.8 1
#13 4.73 1
#14 29.1 1
#15 0 2
#16 36.8 2
#17 73.8 2
#18 12.9 2
#19 124. 2
#20 10.7 2
Related
I have two data frame with different variables named "df" and df1. what I want to do is merging df1 with "df" based on "gender", "age" and "district" in such a way that the age in "df" get given values of AC. for example, if AC is in age group 20-24, all age in "df" which is between 20 to 24 get that same value of AC. thank you in advance.
df<-
district residence gender age weight id
1 1 1 12 26.8 1
2 2 2 14 21.4 2
3 1 1 20 24.2 3
4 2 2 23 35.8 4
5 1 1 31 42.3 5
6 2 2 16 25.2 6
7 1 1 22 35.3 7
8 2 2 45 25.3 8
9 1 1 48 36.2 9
10 2 2 39 35.5 10
df1<-
district age gender AC
1 15-19 2 0.0301
2 20-24 2 0.0934
3 25-29 2 0.108
4 30-34 2 0.0894
5 35-39 2 0.0444
6 40-44 2 0.00945
7 45-49 2 0.00226
8 15-19 2 0.0258
9 20-24 2 0.0701
10 25-29 2 0.0827
You can separate the age column of df1 into two columns and use fuzzyjoin.
library(dplyr)
library(tidyr)
library(fuzzyjoin)
df1 %>%
separate(age, c('start', 'end'), sep = '-', convert = TRUE) %>%
fuzzy_right_join(df,
by = c('district', 'gender', 'start' = 'age', 'end' = 'age'),
match_fun = c(`==`, `==`, `<=`, `>=`))
This is actually a poor minimal example, because there are no such matches in your data. I have modified your data a little bit. Also note that you have some ages in df for which there are no labels in df1.
df$district=1
df1$district=1
df$age1=cut(
df$age,
c(0,as.numeric(unlist(lapply(strsplit(unique(df1$age),"-"),"[[",2)))),
labels=sort(unique(df1$age))
)
merge(
df,
df1,
by.x=c("gender","age1","district"),
by.y=c("gender","age","district")
)
gender age1 district residence age weight id AC
1 2 15-19 1 2 14 21.4 2 0.03010
2 2 15-19 1 2 14 21.4 2 0.02580
3 2 15-19 1 2 16 25.2 6 0.03010
4 2 15-19 1 2 16 25.2 6 0.02580
5 2 20-24 1 2 23 35.8 4 0.07010
6 2 20-24 1 2 23 35.8 4 0.09340
7 2 35-39 1 2 39 35.5 10 0.04440
8 2 45-49 1 2 45 25.3 8 0.00226
I need to find the first two times my df meets a certain condition grouped by two variables. I am trying to use the ddply function, but I am doing something wrong with the ".variables" command.
So in this example, I'm trying to find the first two times x > 30 and y > 30 in each group / trial.
The way I'm using ddply is giving me the first two times in the dataset, then repeating that for every group.
set.seed(1)
df <- data.frame((matrix(nrow=200,ncol=5)))
colnames(df) <- c("group","trial","x","y","hour")
df$group <- rep(c("A","B","C","D"),each=50)
df$trial <- rep(c(rep(1,times=25),rep(2,times=25)),times=4)
df[,3:4] <- runif(400,0,50)
df$hour <- rep(1:25,time=8)
library(plyr)
ddply(.data=df, .variables=c("group","trial"), .fun=function(x) {
i <- which(df$x > 30 & df$y >30 )[1:2]
if (!is.na(i)) x[i, ]
})
Expected results:
group trial x y hour
13 A 1 34.3511423 38.161134 13
15 A 1 38.4920710 40.931734 15
36 A 2 33.4233369 34.481392 11
37 A 2 39.7119930 34.470671 12
52 B 1 43.0604738 46.645491 2
65 B 1 32.5435234 35.123126 15
But instead, my code is finding c(1,4) from the first grouptrial and repeating that over for every grouptrial:
group trial x y hour
1 A 1 34.351142 38.161134 13
2 A 1 38.492071 40.931734 15
3 A 2 5.397181 27.745031 13
4 A 2 20.563721 22.636003 15
5 B 1 22.953286 13.898301 13
6 B 1 32.543523 35.123126 15
I would also like for there to be rows of NA if a second occurrence isn't present in a group*trial.
Thanks,
I think this is what you want:
library(tidyverse)
df %>% group_by(group, trial) %>% filter(x > 30 & y > 30) %>% slice(1:2)
Result:
# A tibble: 16 x 5
# Groups: group, trial [8]
group trial x y hour
<chr> <dbl> <dbl> <dbl> <int>
1 A 1 33.5 46.3 4
2 A 1 32.6 42.7 11
3 A 2 35.9 43.6 4
4 A 2 30.5 42.7 14
5 B 1 33.0 38.1 2
6 B 1 40.5 30.4 7
7 B 2 48.6 33.2 2
8 B 2 34.1 30.9 4
9 C 1 33.0 45.1 1
10 C 1 30.3 36.7 17
11 C 2 44.8 33.9 1
12 C 2 41.5 35.6 6
13 D 1 44.2 34.3 12
14 D 1 39.1 40.0 23
15 D 2 39.4 47.5 4
16 D 2 42.1 40.1 10
(slightly different from your results, probably a different R version)
I reccomend using dplyr or data.table rather than plyr. From the plyr github page:
plyr is retired: this means only changes necessary to keep it on CRAN
will be made. We recommend using dplyr (for data frames) or purrr (for
lists) instead.
Since someone has already provided a solution with dplyr, here is one option with data.table.
In the selection df[i, j, k] I am selecting rows which match your criteria in i, grouping by the given variables in k, and selecting the first two rows (head) of each group-specific subset of the data .SD. All of this inside the brackets is data.table specific, and only works because I converted df to a data.table first with setDT.
library(data.table)
setDT(df)
df[x > 30 & y > 30, head(.SD, 2), by = .(group, trial)]
# group trial x y hour
# 1: A 1 34.35114 38.16113 13
# 2: A 1 38.49207 40.93173 15
# 3: A 2 33.42334 34.48139 11
# 4: A 2 39.71199 34.47067 12
# 5: B 1 43.06047 46.64549 2
# 6: B 1 32.54352 35.12313 15
# 7: B 2 48.03090 38.53685 5
# 8: B 2 32.11441 49.07817 18
# 9: C 1 32.73620 33.68561 1
# 10: C 1 32.00505 31.23571 20
# 11: C 2 32.13977 40.60658 9
# 12: C 2 34.13940 49.47499 16
# 13: D 1 36.18630 34.94123 19
# 14: D 1 42.80658 46.42416 23
# 15: D 2 37.05393 43.24038 3
# 16: D 2 44.32255 32.80812 8
To try a solution that is closer to what you've tried so far we can do the following
ddply(.data=df, .variables=c("group","trial"), .fun=function(df_temp) {
i <- which(df_temp$x > 30 & df_temp$y >30 )[1:2]
df_temp[i, ]
})
Some explanation
One problem with the code that you provided is that you used df inside of ddply. So you defined fun= function(x) but you didn't look for cases of x> 30 & y> 30 in x but in df. Further, your code uses i for x, but i was defined with df. Finally, to my understanding there is no need for if (!is.na(i)) x[i, ]. If there is only one row that meets your condition, you will get a row with NAs anayway, because you use which(df_temp$x > 30 & df_temp$y >30 )[1:2].
Using dplyr, you can also do:
df %>%
group_by(group, trial) %>%
slice(which(x > 30 & y > 30)[1:2])
group trial x y hour
<chr> <dbl> <dbl> <dbl> <int>
1 A 1 34.4 38.2 13
2 A 1 38.5 40.9 15
3 A 2 33.4 34.5 11
4 A 2 39.7 34.5 12
5 B 1 43.1 46.6 2
6 B 1 32.5 35.1 15
7 B 2 48.0 38.5 5
8 B 2 32.1 49.1 18
Since everything else is covered here is a base R version using split
output <- do.call(rbind, lapply(split(df, list(df$group, df$trial)),
function(new_df) new_df[with(new_df, head(which(x > 30 & y > 30), 2)), ]
))
rownames(output) <- NULL
output
# group trial x y hour
#1 A 1 34.351 38.161 13
#2 A 1 38.492 40.932 15
#3 B 1 43.060 46.645 2
#4 B 1 32.544 35.123 15
#5 C 1 32.736 33.686 1
#6 C 1 32.005 31.236 20
#7 D 1 36.186 34.941 19
#8 D 1 42.807 46.424 23
#9 A 2 33.423 34.481 11
#10 A 2 39.712 34.471 12
#11 B 2 48.031 38.537 5
#12 B 2 32.114 49.078 18
#13 C 2 32.140 40.607 9
#14 C 2 34.139 49.475 16
#15 D 2 37.054 43.240 3
#16 D 2 44.323 32.808 8
Attempting to plot aggregate data from the following data.
Person Time Period Value SMA2 SMA3 SMA4
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 1 1 14 NA NA NA
2 A 2 1 8 11 NA NA
3 A 3 1 13 10.5 11.7 NA
4 A 4 1 12 12.5 11 11.8
5 A 5 1 19 15.5 14.7 13
6 A 6 1 9 14 13.3 13.2
7 A 7 2 14 NA NA NA
8 A 8 2 7 10.5 NA NA
9 A 9 2 11 9 10.7 NA
10 A 10 2 14 12.5 10.7 11.5
# ... with 26 more rows
I have used aggregate(DataSet[,c(4,5,6,7)], by=list(DataSet$Person), na.rm = TRUE, max) to get the following:
Group.1 Value SMA2 SMA3 SMA4
1 A 20 18.0 16.66667 15.25
2 B 20 17.0 16.66667 15.00
3 C 19 18.5 14.33333 14.50
I'd like to plot the maxes for each SMA for Person A, B, and C on the same plot.
I would also like to be able to plot the mean of these maxes for each SMA column.
Any help is appreciated.
Like so? Or are you looking for something different?
df <- data.frame("Group.1"=c("A","B","C"), "Value"=c(20,20,20),
"SMA2"=c(18.0, 17.0, 18.5), "SMA3" =c(16.667, 16.667, 14.333),
"SMA4"=c(15.25, 15.00, 14.50))
library(ggplot2)
library(tidyr)
df.g <- df %>%
gather(SMA, Value, -Group.1)
df.g$SMA <- factor(df.g$SMA, levels=c("Value", "SMA2", "SMA3", "SMA4"))
means <- df.g %>%
group_by(SMA) %>%
summarise(m=mean(Value))
ggplot(df.g, aes(x=SMA, y=Value, group=Group.1, colour=Group.1)) +
geom_line() +
geom_point(data=means, aes(x=SMA, y=m), inherit.aes = F)
I need some help to write a loop function in R. I have some problem to select previous match when same id occurs and then write OLD_RANK column and NEW_RANK column.
OLD_RANK must be the NEW_RANK of the previous match found.
`NEW_RANK`<- OLD_RANK+0.05(S1-S2)
Here my data for this example
JUNK<- matrix(c(1,1,10,20,3,2,30,40,1,3,60,4,3,
4,5,40,1,5,10,30,7,6,20,20),ncol=4,byrow=TRUE)
colnames(JUNK) <- c("ID1","DAY","S1","S2")
JUNK<- as.data.frame(JUNK)
What I thought could be a good start:
#subset to find previous match. Find matches before days and if more matches are
#found, choose the row with higher values in `days`
loop for each row
s1 <- subset(s1, DAYS < days)
s1 <- subset(s1, DAYS = max(days))
#if no match fuond JUNK$OLD_RANK<-35 and JUNK$NEW_RANK <-JUNK$OLD_RANK+0.05(S1-S2)
#if previous match is found JUNK$NEW_RANK <-JUNK$OLD_RANK+0.05(S1-S2)
expected result:
ID1 DAYS S1 S2 OLD_RANK NEW_RANK
1 1 10 20 35 34.5
3 2 30 40 35 34.5
1 3 60 4 34.5 37.3
3 4 5 40 34.5 32.75
1 5 10 30 37.3 36.3
7 6 20 20 35 35
Any help is appreciate.
Here's one approach:
library(dplyr)
JUNK2 <- JUNK %>%
group_by(ID1) %>%
mutate(change = 0.05*(S1-S2),
NEW_RANK = 35 + cumsum(change),
OLD_RANK = lag(NEW_RANK) %>% if_else(is.na(.), 35, .)) %>%
ungroup() # EDIT: Added to end with ungrouped table
Result:
JUNK2
# A tibble: 6 x 7
ID1 DAY S1 S2 change NEW_RANK OLD_RANK
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 10 20 -0.5 34.5 35
2 3 2 30 40 -0.5 34.5 35
3 1 3 60 4 2.8 37.3 34.5
4 3 4 5 40 -1.75 32.8 34.5
5 1 5 10 30 -1 36.3 37.3
6 7 6 20 20 0 35 35
I have this data frame:
Votes <- data.frame(
VoteCreationDate = c(1,3,3,5,5,6),
GiverId = c(19,19,38,19,38,19),
CumNumUpVotes = c(1,3,1,7,2,10)
)
Votes
VoteCreationDate GiverId CumNumUpVotes
1 19 1
3 19 3
3 38 1
5 19 7
5 38 2
6 19 10
For each GiverId (19 and 38), all possible dates (number from 1 to 6) should be listed in VoteCreationDate.
Then, for each GiverId and VoteCreationDate, the corresponding CumNumUpVotes should be matched. If there is no corresponding value, the CumNumUpVotes should be taken from the immediately preceding VoteCreationDate.
For example, for VoteCreationDate = 4 and GiverId = 38 there is no corresponding CumNumUpVotes. This cell should be equal to 1, which is the CumNumUpVotes from GiverId = 38 and VoteCreationDate = 3.
Here how it should look at the end:
VoteCreationDate GiverId CumNumUpVotes
1 19 1
2 19 1
3 19 3
4 19 3
5 19 7
6 19 10
1 38 0
2 38 0
3 38 1
4 38 1
5 38 2
6 38 2
Any idea how to get there?
A dplyr and tidyr solution.
library(dplyr)
library(tidyr)
Votes2 <- Votes %>%
complete(VoteCreationDate = full_seq(VoteCreationDate, period = 1), GiverId) %>%
arrange(GiverId, VoteCreationDate) %>%
group_by(GiverId) %>%
fill(CumNumUpVotes) %>%
replace_na(list(CumNumUpVotes = 0)) %>%
ungroup()
Votes2
# # A tibble: 12 x 3
# VoteCreationDate GiverId CumNumUpVotes
# <dbl> <dbl> <dbl>
# 1 1.00 19.0 1.00
# 2 2.00 19.0 1.00
# 3 3.00 19.0 3.00
# 4 4.00 19.0 3.00
# 5 5.00 19.0 7.00
# 6 6.00 19.0 10.0
# 7 1.00 38.0 0
# 8 2.00 38.0 0
# 9 3.00 38.0 1.00
# 10 4.00 38.0 1.00
# 11 5.00 38.0 2.00
# 12 6.00 38.0 2.00
do.call(rbind, lapply(split(Votes, Votes$GiverId), function(x){
temp = merge(x, data.frame(VoteCreationDate = 1:6), all = TRUE)
temp$GiverId = temp$GiverId[!is.na(temp$GiverId)][1]
temp$CumNumUpVotes = cummax(replace(temp$CumNumUpVotes, is.na(temp$CumNumUpVotes), 0))
temp
}))
# VoteCreationDate GiverId CumNumUpVotes
#19.1 1 19 1
#19.2 2 19 1
#19.3 3 19 3
#19.4 4 19 3
#19.5 5 19 7
#19.6 6 19 10
#38.1 1 38 0
#38.2 2 38 0
#38.3 3 38 1
#38.4 4 38 1
#38.5 5 38 2
#38.6 6 38 2