Creating a grouping indicator per row in R - r

I have following data
x1 <- rnorm(20,0,1)
x2 <- rnorm(20,0,1)
group <- sample(50:55, size=20, replace=TRUE)
data <- data.frame(x1,x2,group)
head(data)
x1 x2 group
1 -0.88001290 0.53866432 50
2 0.34228653 -0.54503078 52
3 -2.42308971 0.09542262 54
4 0.07310148 -1.03226594 50
5 -0.47786709 2.46726615 55
6 0.45224510 -1.46224926 55
I need to create a grouping indicator based on group variable. (so that the rows where group=50 will equal to 1, group=51 equal to 2 so on)
I tried to do this using dplyr package in R. But I am not getting the correct answer as I have not defined the indicator variable correctly.
data %>% arrange(group) %>% group_by(group) %>% mutate(Indicator = n() )
Can anyone help me to correct my code?
Thank you

We need cur_group_id instead of n() (n() - returns the number of rows of that group)
library(dplyr)
data %>%
arrange(group) %>%
group_by(group) %>%
mutate(indicator = cur_group_id()) %>%
ungroup
-output
# A tibble: 20 x 4
# x1 x2 group indicator
# <dbl> <dbl> <int> <int>
# 1 -1.24 -0.497 50 1
# 2 -0.648 1.59 50 1
# 3 0.598 -0.325 51 2
# 4 -0.721 0.510 51 2
# 5 0.259 1.62 51 2
# 6 -0.288 0.872 52 3
# 7 0.403 0.785 52 3
# 8 1.84 1.65 52 3
# 9 0.116 -0.0234 52 3
#10 -1.31 -0.244 52 3
#11 -0.615 0.994 53 4
#12 -0.469 0.695 53 4
#13 -0.324 -0.599 53 4
#14 -0.394 -0.971 53 4
#15 1.30 0.323 54 5
#16 0.0242 -1.46 54 5
#17 -0.342 -1.96 54 5
#18 1.10 -0.569 54 5
#19 -0.967 -0.863 54 5
#20 -0.396 -0.441 55 6
Or another option is match
data %>%
mutate(indicator = match(group, sort(unique(group))))

base R using factor()
levels = 50:55
labels = 1:6
data$indicator <- factor(data$group, levels, labels)
or
levels = unique(data$group)
labels = seq_len(length(levels))
data$indicator <- factor(data$group, levels, labels)

dplyr::dense_rank may also help even without grouping
data %>% mutate(indicator = dense_rank(group) )
baseR way
data$indicator <- as.numeric(as.factor(data$group))
data
x1 x2 group indicator
1 -1.453628399 -1.78776319 55 6
2 -0.119413813 -0.07656982 52 3
3 0.387951296 -0.26845052 55 6
4 3.117977719 0.69280780 51 2
5 -0.938126762 -0.16898209 50 1
6 -1.596371818 0.35289797 52 3
7 -2.291376398 -1.59385221 55 6
8 0.161164263 -0.99387565 54 5
9 -0.281744752 -0.26801191 53 4
10 0.760719223 -0.28255900 50 1
11 -0.204073022 -1.10262114 51 2
12 0.653628314 0.77778039 54 5
13 0.043736298 -0.37896178 55 6
14 0.002800531 1.17034334 55 6
15 0.451136658 -0.38459588 51 2
16 0.151793862 0.60303631 55 6
17 0.173976519 -0.41745808 53 4
18 0.282827170 -0.16794851 52 3
19 0.737444975 -0.45712603 51 2
20 0.014182869 0.99013155 51 2

Related

Loop for variable definition R

I have a data frame and I want to define multiple columns with the same function (ntile) operated on the original version (column) of the variable. I'm not sure whether a loop or something else will work but the below example is a toy example. My actual data frame has over 20 variables that this needs to be done on.
Basically I want to make a variable called "original_name"_bin for each of the numeric variables in my data frame. These _bin variables are just the ntile function operated on the original non _bin version:
dat1 <- read.table(text = "x1 x2
10 20
20 30.5
30 40.5
40 20.12
50 25
70 86
80 75
90 45 ", header = TRUE)
num_names <- paste(colnames(dat1[sapply(dat1, is.numeric)]))
bin_names <- paste(colnames(dat1[sapply(dat1, is.numeric)]), "bin", sep = "_")
# Want to make columns in data frame where the var_bin is:
dat1$x1_bin <- ntile(dat1$x1, n = 10)
# loop
for (i in 1:length(bin_names)){
assign(paste0("dat1$", bin_names[i]), ntile(???, 10))
}
Here is one base way to do it using lapply:
dat1 <- read.table(text = "x1 x2
10 20
20 30.5
30 40.5
40 20.12
50 25
70 86
80 75
90 45 ", header = TRUE)
num_names <- paste(colnames(dat1[sapply(dat1, is.numeric)]))
bin_names <- paste(colnames(dat1[sapply(dat1, is.numeric)]), "bin", sep = "_")
dat1[bin_names] <- lapply(dat1[num_names], \(x) dplyr::ntile(x, n = 10))
dat1
#> x1 x2 x1_bin x2_bin
#> 1 10 20.00 1 1
#> 2 20 30.50 2 4
#> 3 30 40.50 3 5
#> 4 40 20.12 4 2
#> 5 50 25.00 5 3
#> 6 70 86.00 6 8
#> 7 80 75.00 7 7
#> 8 90 45.00 8 6
Created on 2021-12-07 by the reprex package (v2.0.1)
As base R loop:
for (i in 1:length(bin_names)){
dat1[bin_names[i]] <- dplyr::ntile(dat1[num_names[i]], 10)
}
dat1
#> x1 x2 x1_bin x2_bin
#> 1 10 20.00 1 1
#> 2 20 30.50 2 4
#> 3 30 40.50 3 5
#> 4 40 20.12 4 2
#> 5 50 25.00 5 3
#> 6 70 86.00 6 8
#> 7 80 75.00 7 7
#> 8 90 45.00 8 6
With dplyr::across:
library(dplyr)
dat1 %>%
mutate(across(all_of(num_names),
~ ntile(.x, n = 10),
.names = "{.col}_bin"))
#> x1 x2 x1_bin x2_bin
#> 1 10 20.00 1 1
#> 2 20 30.50 2 4
#> 3 30 40.50 3 5
#> 4 40 20.12 4 2
#> 5 50 25.00 5 3
#> 6 70 86.00 6 8
#> 7 80 75.00 7 7
#> 8 90 45.00 8 6
Created on 2021-12-07 by the reprex package (v2.0.1)

extract data from data frame and delete extracted data

i have a data frame with three variables named df. what i want is in "df1" subset df in such a way that the extracted data to no longer exist in the df. it can be done by "subset" but The extracted data will still exist in df.
any help would be appreciated.
df<-
gender age pro
1 22 0.0301
2 11 0.0934
1 44 0.108
2 56 0.0894
1 70 0.0444
2 33 0.00945
1 23 0.00226
2 32 0.0258
1 12 0.0701
2 1 0.0827
1 17 0.0657
1 9 0.0324
2 44 0.00755
1 49 0.000456
2 39 0.0255
1 18 0.0828
2 31 0.0931
1 8 0.0717
df1<- subset(df, age > 14 & age< 50 & gender==2)
You can use dplyr::anti_join to remove the extracted data from original data.
df1<- subset(df90, age > 14 & age< 50 & gender==2)
df90 <- dplyr::anti_join(df90, df1)
We could do with base R:
df1 <- subset(df, !(age > 14 & age < 50 & gender==2))
Output:
gender age pro
<dbl> <dbl> <dbl>
1 1 22 0.0301
2 2 11 0.0934
3 1 44 0.108
4 2 56 0.0894
5 1 70 0.0444
6 1 23 0.00226
7 1 12 0.0701
8 2 1 0.0827
9 1 17 0.0657
10 1 9 0.0324
11 1 49 0.000456
12 1 18 0.0828
13 1 8 0.0717
Using dplyr
library(dplyr)
filter(df, !(age > 14 & age < 50 & gender==2))

How to find the normalized values within each level of a variable in R

I have a categorical variable B with 3 levels 1,2,3 also I have another variable A with some values.. sample data is as follows
A B
22 1
23 1
12 1
34 1
43 2
47 2
49 2
65 2
68 3
70 3
75 3
82 3
120 3
. .
. .
. .
. .
All I want is say for every level of B ( say in 1) I need to calculate Val(A)-Min/Max-Min, similarly I need to reproduce the same to other levels (2 & 3)
Solution using dplyr:
set.seed(1)
df=data.frame(A=round(rnorm(21,50,10)),B=rep(1:3,each=7))
library(dplyr)
df %>% group_by(B) %>% mutate(C= (A-min(A))/(max(A)-min(A)))
The output is like
# A tibble: 21 x 3
# Groups: B [3]
A B C
<dbl> <int> <dbl>
1 44 1 0.0833
2 52 1 0.417
3 42 1 0
4 66 1 1
5 53 1 0.458
6 42 1 0
7 55 1 0.542
8 57 2 0.784
9 56 2 0.757
10 47 2 0.514
# ... with 11 more rows
You could use the tapply function:
x = read.table(text="A B
22 1
23 1
12 1
34 1
43 2
47 2
49 2
65 2
68 3
70 3
75 3
82 3
120 3", header = TRUE)
y = tapply(x$A, x$B, function(z) (z - min(z)) / (max(z) - min(z)))
# Or using the scale() function
#y = tapply(x$A, x$B, function(z) scale(z, min(z), max(z) - min(z)))
cbind(x, unlist(y))
Not exactly sure how you want the output, but this should be a decent starting point.

Make column with "sample" for each row with purrr

I'm trying to make column with sample value for each row of data
But I'm new with purrr and can't make this.
My code
df<-data.frame(x=rep(1:3,each=4),y=99)
df%>%
group_by(x)%>%
mutate_(val=~purrr::map_dbl(function(x) sample(50,1)))
This didn't work.
But function with purrr only working:
1:5%>%purrr::map_dbl(function(x) sample(50,1))
[1] 39 30 7 18 45
Thanks for any help!
You don't need purrr:
df <- data.frame(x = rep(1:3, each = 4), y = 99)
df %>%
group_by(x) %>%
mutate(val = sample(50, n()))
Output
# A tibble: 12 x 3
# Groups: x [3]
x y val
<int> <dbl> <int>
1 1 99.0 10
2 1 99.0 25
3 1 99.0 2
4 1 99.0 24
5 2 99.0 48
6 2 99.0 19
7 2 99.0 34
8 2 99.0 33
9 3 99.0 24
10 3 99.0 14
11 3 99.0 37
12 3 99.0 12
If you need to use purrr, I guess you could do:
dplyr::mutate(df, val = purrr::map(x, ~ sample(50, 1)))
x y val
1 1 99 35
2 1 99 4
3 1 99 43
4 1 99 28
5 2 99 49
6 2 99 31
7 2 99 31
8 2 99 31
9 3 99 19
10 3 99 4
11 3 99 43
12 3 99 20
Or with the pipe:
library(dplyr)
library(purrr)
df %>%
mutate(val = map(x, ~ sample(50, 1)))
Data:
df <- data.frame(x = rep(1:3, each = 4), y = 99)

summarise dataset conditioning on variable using dplyr

I want to summarise my dataset grouping the variable age into 5 years age groups, so instead of single age 0 1 2 3 4 5 6... I would have 0 5 10 15 etc. with 80 being my open-ended category. I could do this by categorizing everything by hand creating a new variable, but I am sure there must be a quicker way!
a <- cbind(age=c(rep(seq(0, 90, by=1), 2)), value=rnorm(182))
Any ideas?
like this ?
library(dplyr)
a %>% data.frame %>% group_by(age_group = (sapply(age,min,80) %/% 5)*5) %>%
summarize(avg_val = mean(value))
# A tibble: 17 x 2
age_group avg_val
<dbl> <dbl>
1 0 -0.151470805
2 5 0.553619149
3 10 0.198915973
4 15 -0.436646287
5 20 -0.024193193
6 25 0.102671120
7 30 -0.350059839
8 35 0.010762264
9 40 0.339268917
10 45 -0.056448481
11 50 0.002982158
12 55 0.348232262
13 60 -0.364050091
14 65 0.177551510
15 70 -0.178885909
16 75 0.664215782
17 80 -0.376929230
Example data
set.seed(1)
df <- data.frame(age=runif(1000)*100,
value=runif(1000))
Simply add the max value of your group to seq(0,80,5) for irregular breaks with c(..., max(age))
library(dplyr)
df %>%
mutate(age = cut(age, breaks=c(seq(0,80,5), max(age)))) %>%
group_by(age) %>%
summarise(value=mean(value))
Output
age value
<fctr> <dbl>
1 (0,5] 0.4901119
2 (5,10] 0.5131055
3 (10,15] 0.5022297
4 (15,20] 0.4712481
5 (20,25] 0.5610872
6 (25,30] 0.4207203
7 (30,35] 0.5218318
8 (35,40] 0.4377102
9 (40,45] 0.5007616
10 (45,50] 0.4941768
11 (50,55] 0.5350272
12 (55,60] 0.5226967
13 (60,65] 0.5031688
14 (65,70] 0.4652641
15 (70,75] 0.5667020
16 (75,80] 0.4664898
17 (80,100] 0.4604779

Resources