Add observation number by group in R [duplicate] - r

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 1 year ago.
This is a silly question but I am new to R and it would make my life so much easier if I could figure out how to do this!
So here is some sample data
data <- read.table(text = "Category Y
A 5.1
A 3.14
A 1.79
A 3.21
A 5.57
B 3.68
B 4.56
B 3.32
B 4.98
B 5.82
",header = TRUE)
I want to add a column that counts the number of observations within a group. Here is what I want it to look like:
Category Y OBS
A 5.1 1
A 3.14 2
A 1.79 3
A 3.21 4
A 5.57 5
B 3.68 1
B 4.56 2
B 3.32 3
B 4.98 4
B 5.82 5
I have tried:
data <- data %>% group_by(Category) %>% mutate(count = c(1:length(Category)))
which just creates another column numbered from 1 to 10, and
data <- data %>% group_by(Category) %>% add_tally()
which just creates another column of all 5s

Base R:
data$OBS <- ave(seq_len(nrow(data)), data$Category, FUN = seq_along)
data
# Category Y OBS
# 1 A 5.10 1
# 2 A 3.14 2
# 3 A 1.79 3
# 4 A 3.21 4
# 5 A 5.57 5
# 6 B 3.68 1
# 7 B 4.56 2
# 8 B 3.32 3
# 9 B 4.98 4
# 10 B 5.82 5
BTW: one can use any of the frame's columns as the first argument, including ave(data$Category, data$Category, FUN=seq_along), but ave chooses its output class based on the input class, so using a string as the first argument will result in a return of strings:
ave(data$Category, data$Category, FUN = seq_along)
# [1] "1" "2" "3" "4" "5" "1" "2" "3" "4" "5"
While not heinous, it needs to be an intentional choice. Since it appears that you wanted an integer in that column, I chose the simplest integer-in, integer-out approach. It could also have used rep(1L,nrow(data)) or anything that is both integer and the same length as the number of rows in the frame, since seq_along (the function I chose) won't otherwise care.

library(data.table)
setDT(data)[, OBS := seq_len(.N), by = .(Category)]
data
Category Y OBS
1: A 5.10 1
2: A 3.14 2
3: A 1.79 3
4: A 3.21 4
5: A 5.57 5
6: B 3.68 1
7: B 4.56 2
8: B 3.32 3
9: B 4.98 4
10: B 5.82 5

library(dplyr)
data %>% group_by(Category) %>% mutate(Obs = row_number())
# A tibble: 10 x 3
# Groups: Category [2]
Category Y Obs
<chr> <dbl> <int>
1 A 5.1 1
2 A 3.14 2
3 A 1.79 3
4 A 3.21 4
5 A 5.57 5
6 B 3.68 1
7 B 4.56 2
8 B 3.32 3
9 B 4.98 4
10 B 5.82 5
OR
data$OBS <- ave(data$Category, data$Category, FUN = seq_along)
data
Category Y OBS
1 A 5.10 1
2 A 3.14 2
3 A 1.79 3
4 A 3.21 4
5 A 5.57 5
6 B 3.68 1
7 B 4.56 2
8 B 3.32 3
9 B 4.98 4
10 B 5.82 5

Another base R
category <- c(rep('A',5),rep('B',5))
sequence <- sequence(rle(as.character(category))$lengths)
data <- data.frame(category=category,sequence=sequence)
head(data,10)

Related

Fill value of a numeric with the entry of another row based on name

Good day,
I have a dataframe that looks like this;
Community Value Num
<chr> <dbl> <dbl>
1 A 3.54 3
2 A 4.56 3
3 A 2.22 3
4 B 0 NA
5 B 0.76 NA
6 C 1.2 5
I am hoping to fill the Num observation of Community B with the Num observation of Community A while keeping the names as is.
If this is all you need to do, you can try using an if_else() statement. If your greater problem is more complex, you might need to take another approach.
library(dplyr)
df %>%
mutate(Num = if_else(Community == "B", Num[Community == "A"][1], Num))
# Community Value Num
# 1 A 3.54 3
# 2 A 4.56 3
# 3 A 2.22 3
# 4 B 0.00 3
# 5 B 0.76 3
# 6 C 1.20 5
Data:
df <- read.table(textConnection("Community Value Num
A 3.54 3
A 4.56 3
A 2.22 3
B 0 NA
B 0.76 NA
C 1.2 5"), header = TRUE)
If the NA values will always be immediately below the positive values with which you want to replace them, you can use fill:
library(tidyr)
df %>%
fill(Num, .direction = "down")

create new var by product of preceding result per goup id in dplyr

I have the follorwing data, and what I need is to create new var new will obtain by product the preceding row values of var Z per group id. eg. the first value of column new is 0.9, 0.90.1, 0.90.1*0.5 for id x=1.
data <- data.frame(x=c(1,1,1,1,2,2,3,3,3,4,4,4,4),
y=c(4,2,2,6,5,6,6,7,8,2,1,6,5),
z=c(0.9,0.1,0.5,0.12,0.6,1.2,2.1,0.9,0.4,0.8,0.45,1.3,0.85))
desired outcome
x y z new
1 1 4 0.90 0.9000
2 1 2 0.10 0.0900
3 1 2 0.50 0.0450
4 1 6 0.12 0.0054
5 2 5 0.60 0.6000
6 2 6 1.20 0.7200
7 3 6 2.10 2.1000
8 3 7 0.90 1.8900
9 3 8 0.40 0.7560
10 4 2 0.80 0.8000
11 4 1 0.45 0.3600
12 4 6 1.30 0.4680
13 4 5 0.85 0.3978
We can use the cumprod from base R
library(dplyr)
data %>%
group_by(x) %>%
mutate(new = cumprod(z)) %>%
ungroup
Or with base R
data$new <- with(data, ave(z, x, FUN = cumprod))

mutate rnorm with multiple column output

I would like to create random variables using rnorm() with the mean and sd specified in separate columns in a tibble
n <- 2
ti <- tibble(Name = letters[1:10], mean = 1:10, sd = 1:10)
How do I use mutate to add n columns to the tibble with output from rnorm(n, mean, sd) for every row?
(I know I can do this in base R, but am curious to learn how this works using dplyr)
One dplyr and tidyr option could be:
ti %>%
rowwise() %>%
mutate(col_rnorm = list(setNames(rnorm(n, mean, sd), c("col1", "col2")))) %>%
unnest_wider(col_rnorm)
Name mean sd col1 col2
<chr> <int> <int> <dbl> <dbl>
1 a 1 1 -1.73 1.18
2 b 2 2 3.86 0.0943
3 c 3 3 3.54 -0.502
4 d 4 4 3.21 -3.90
5 e 5 5 3.61 9.48
6 f 6 6 7.07 16.1
7 g 7 7 17.4 5.95
8 h 8 8 5.32 13.6
9 i 9 9 19.2 19.8
10 j 10 10 9.67 11.3

How can I leave columns untouched with aggregate in R

I have a large dataframe with experiments with different parameters. Each combination of parameters have several executions:
PROFILE TIME NTHREADS PARAM1 PARAM2 PARAM3
prof1 3.01 1 4 10 1
prof1 2.90 1 4 10 1
prof1 3.02 1 4 10 1
prof1 1.52 1 4 10 2
prof1 1.60 1 4 10 2
...
I am using aggregate to obtain the best time for each combination of profile & nthreads:
data_aggregated <- aggregate(data$TIME,
by = list(PROFILE = data$PROFILE,
NTHREADS = data$NTHREADS),
FUN = min)
That return a new dataframe like this:
PROFILE NTHREADS TIME
prof1 1 1.52
prof1 2 0.9
prof2 1 1.41
prof2 2 0.88
...
What I want is to obtain the values of PARAM1, PARAM2, PARAM3 for the aggregated row in each case (the one with minimum time). For now, I look in first dataframe the row where PROFILE, TIME and NTHREADS are equal to the ones in the second dataframe, but maybe there is an easier way?
Alternatively, with dplyr:
library(dplyr)
dat <- dat %>%
group_by(PROFILE, NTHREADS) %>%
filter(TIME == min(TIME))
Finally I've done it with the comment by Ronak Shah. Iff both dataframes share column names & values (because of aggregating with MIN instead of MEAN), the simplest solution is:
data_aggr <- merge(data_aggr, data)
Consider ave, the method to aggregate across different levels of factors. You can pass multiple groupings as separate arguments:
data <- read.table(text="PROFILE TIME NTHREADS PARAM1 PARAM2 PARAM3
prof1 3.01 1 4 10 1
prof2 2.90 2 4 10 1
prof1 3.02 1 4 10 1
prof2 1.52 2 4 10 2
prof1 1.60 1 4 10 2", header=TRUE)
data$min_TIME <- ave(data$TIME, data$PROFILE, data$NTHREADS, FUN=min)
data
# PROFILE TIME NTHREADS PARAM1 PARAM2 PARAM3 min_TIME
# 1 prof1 3.01 1 4 10 1 1.60
# 2 prof2 2.90 2 4 10 1 1.52
# 3 prof1 3.02 1 4 10 1 1.60
# 4 prof2 1.52 2 4 10 2 1.52
# 5 prof1 1.60 1 4 10 2 1.60

calculate multiple columns mean in R and generate a new table

I have a data set in .csv. It contains multiple columns for example.
Group Wk1 WK2 WK3 WK4 WK5 WK6
A 1 2 3 4 5 6
B 7 8 9 1 2 3
C 4 5 6 7 8 9
D 1 2 3 4 5 6
Then if I want to have the mean of both WK1 & WK2, Wk3, WK4 & WK5, WK6.
How can I do that?
The result may like
Group 1 2 3 4
mean 3.75 5.25 4.5 6
And how can I save it into a new table?
Thanks in advance.
You can melt your data.frame, create your groups using some basic indexing, and use aggregate:
library(reshape2)
X <- melt(mydf, id.vars="Group")
Match <- c(Wk1 = 1, Wk2 = 1, Wk3 = 2, Wk4 = 3, Wk5 = 3, Wk6 = 4)
aggregate(value ~ Match[X$variable], X, mean)
# Match[X$variable] value
# 1 1 3.75
# 2 2 5.25
# 3 3 4.50
# 4 4 6.00
tapply is also an appropriate candidate here:
tapply(X$value, Match[X$variable], mean)
# 1 2 3 4
# 3.75 5.25 4.50 6.00

Resources