transforming & adding new column in r - r

I have currently have a data frame that is taken from a data feed of events that happened in chronological order. I would like to add a new column onto to each row of my data the corresponds to the previous event's endx if the prior event type is 1 & the previous event's x if the prior event type is not 1
e.g
player_id <- c(12, 17, 26, 3)
event_type <- c(1, 3, 1, 10)
x <- c(65, 34, 43, 72)
endx <- c(68, NA, 47, NA)
df <- data.frame(player_id, event_type, x, endx)
df
player_id event_type x endx
1 12 1 65 68
2 17 3 34 NA
3 26 1 43 47
4 3 10 72 NA
so end result
player_id event_type x endx previous
1 12 1 65 68 NA
2 17 3 34 NA 68
3 26 1 43 47 34
4 3 10 72 NA 47

We can use if_else
library(dplyr)
df %>%
mutate(previous = if_else(lag(event_type)==1, lag(endx), lag(x)))
# player_id event_type x endx previous
#1 12 1 65 68 NA
#2 17 3 34 NA 68
#3 26 1 43 47 34
#4 3 10 72 NA 47

I am sure this isn't the most succient way but you can use a loop and indexing.
df$previous <- NA
for( i in 2: nrow(df)){
df[ i , "previous"] <- df[ i-1 , "endx"]
}

Related

create a new variable that is the multiplication of an existing variable by a number and that meets a condition

data <- structure(list(
x = c(1, 2, 1, 2, 2, 1, 3, 3, 1),
y = c(20, 30, 40, 10, 15, 34, 57, 72, 12)),
class = "data.frame",
row.names = c(NA,-9L))
Hi guys, I want to create a new variable from above data.frame in rstudio but it doesn't work. what I want to do is the same of this command in stata but in rstudio
gen var = y*3600 if x == 1
so I runned this r command but it didnĀ“t work:
df$var[df$x == 1] <- df$y*3600
the new variable should look like this:
x
y
var
1
20
72000
2
30
NA
1
40
144000
2
10
NA
2
15
NA
1
34
122400
3
57
NA
3
72
NA
1
12
43200
I appreciate any help and thanks in advance
data$var <- ifelse(data$x == 1, data$y * 3600, NA)
x y var
1 1 20 72000
2 2 30 NA
3 1 40 144000
4 2 10 NA
5 2 15 NA
6 1 34 122400
7 3 57 NA
8 3 72 NA
9 1 12 43200
We can use replace like below
> transform(
+ data,
+ var = replace(y * 3600, x != 1, NA)
+ )
x y var
1 1 20 72000
2 2 30 NA
3 1 40 144000
4 2 10 NA
5 2 15 NA
6 1 34 122400
7 3 57 NA
8 3 72 NA
9 1 12 43200
Another option
df$var <- df$y * 3600
df$var[df$x != 1] <- NA
df
#-------
> df
x y var
1 1 20 72000
2 2 30 NA
3 1 40 144000
4 2 10 NA
5 2 15 NA
6 1 34 122400
7 3 57 NA
8 3 72 NA
9 1 12 43200
In data.table
library('data.table')
as.data.table(data)
data[x == 1, var := y*3600]
You need to subset the data from both the ends.
data$var <- NA
data$var[data$x == 1] <- data$y[data$x == 1] *3600
data
# x y var
#1 1 20 72000
#2 2 30 NA
#3 1 40 144000
#4 2 10 NA
#5 2 15 NA
#6 1 34 122400
#7 3 57 NA
#8 3 72 NA
#9 1 12 43200
Another option is to use case_when in dplyr.
library(dplyr)
data <- data %>% mutate(var = case_when(x == 1 ~ y * 3600))
By default if a condition is not satisfied it returns NA.

how to subtract the next column by the previous column and create a new column after?

There are here on stackoverflow questions about how to diff a column by the previous column like this my question is a little bit different, i want to create a new column after that diff and don't modify the existing columns
Sample data:
dfData <- data.frame(ID = c(1, 2, 3, 4, 5),
DistA = c(10, 8, 15, 22, 15),
DistB = c(15, 35, 40, 33, 20),
DistC = c(20,40,50,45,30),
DistD = c(60,55,55,48,50))
ID DistA DistB DistC DistD
1 1 10 15 20 60
2 2 8 35 40 55
3 3 15 40 50 55
4 4 22 33 45 48
5 5 15 20 30 50
Expected output:
ID DistA DistB DiffB-A DistC DistD Diff D-C
1 1 10 15 05 20 60 40
2 2 8 35 27 40 55 15
3 3 15 40 25 50 55 05
4 4 22 33 11 45 48 03
5 5 15 20 5 30 50 20
Subtract the next column by the previous column and create a new column after
If you want to subtract every two columns, we can use split.default to split the data into two columns each and subtract the second column with the first one.
cols <- ceiling(seq_along(dfData[-1])/2)
new_cols <- tapply(names(dfData[-1]), cols, function(x)
sprintf('diff_%s', paste0(x, collapse = '')))
dfData[new_cols] <- sapply(split.default(dfData[-1], cols), function(x)
x[[2]] - x[[1]])
dfData
# ID DistA DistB DistC DistD diff_DistADistB diff_DistCDistD
#1 1 10 15 20 60 5 40
#2 2 8 35 40 55 27 15
#3 3 15 40 50 55 25 5
#4 4 22 33 45 48 11 3
#5 5 15 20 30 50 5 20

Replace several values and keep others same efficiently in R

I have a dataframe like the following:
combo_2 combo_4 combo_7 combo_9
12 23 14 17
21 32 41 71
2 3 1 7
1 2 4 1
21 23 14 71
2 32 1 7
Each column has two single-digit values and two double-digit values composed of the single-digit values in each possible order.
I am trying to determine how to replace certain values in the dataframe so that there is only one version of the double-digit value. For example, all values of 21 in the first column should be 12. All values of 32 in the second column should become 23.
I know I can do something like this using the following code:
df <- df %>%
mutate_at(vars(combo_2, combo_4, combo_7, combo_9), function(x)
case_when(x == 21 ~ 12, x == 32 ~ 23, x == 41 ~ 14, x == 71 ~ 17))
The problem with this is that it gives me a dataframe that contains the correct values when specified but leaves all the other values as NA. The resulting dataframe only contains values where 21, 32, 41, and 71 were. I know I could address this by specifying each value, like x == 1 ~ 1. However, I have many values and would prefer to only specify the ones that I am trying to change.
How can I replace several values in a dataframe without all the other values becoming NA? Is there a way for me to replace the values I want to replace while holding the other values the same without directly specifying those values?
You can use TRUE ~ x at the end of your case_when() sequence:
df %>%
mutate_at(vars(combo_2, combo_4, combo_7, combo_9), function(x)
case_when(x == 21 ~ 12, x == 32 ~ 23, x == 41 ~ 14, x == 71 ~ 17, TRUE ~ x))
combo_2 combo_4 combo_7 combo_9
1 12 23 14 17
2 12 23 14 17
3 2 3 1 7
4 1 2 4 1
5 12 23 14 17
6 2 23 1 7
Another option that may be more efficient would be data.table's fcase() function.
Data:
df = read.table(header = TRUE, text = "combo_2 combo_4 combo_7 combo_9
12 23 14 17
21 32 41 71
2 3 1 7
1 2 4 1
21 23 14 71
2 32 1 7")
df[] = lapply(df, as.double) # side-note: tidyverse has become very stict about types
One dplyr and stringi option may be:
df %>%
mutate(across(everything(),
~ if_else(. %in% c(21, 32, 41, 71), as.integer(stri_reverse(.)), .)))
combo_2 combo_4 combo_7 combo_9
1 12 23 14 17
2 12 23 14 17
3 2 3 1 7
4 1 2 4 1
5 12 23 14 17
6 2 23 1 7
Using mapply:
df1[] <- mapply(function(d, x1, x2){ ifelse(d == x1, x2, d) },
d = df1,
x1 = c(21, 32, 41, 71),
x2 = c(12, 23, 14, 17))
df1
# combo_2 combo_4 combo_7 combo_9
# 1 12 23 14 17
# 2 12 23 14 17
# 3 2 3 1 7
# 4 1 2 4 1
# 5 12 23 14 17
# 6 2 23 1 7

How to create a column with information from other columns

Not able to create the column as I want. It consist in using the previous third value of the flow column, for each new value of the event column.
I tried to approach this problem by using for loops but can't exactly replicate what I want. I'm close but not there.
just to recreate the example I generated the following data frame
flow<- c(40, 39, 38, 37, 50, 49, 46, 44, 60, 55, 40, 70, 80, 75, 90, 88, 86, 100, 120, 118)
event<- c(1,1,1,1,2,2,2,2,3,3,3,4,5,5,6,6,6,7,8,8)
a<- data.frame(flow, event)
for (j in seq(1, length(a$event))) {
if (a$event[j] <= 1){
a$BF[a$event==j]<- NA}
else{
if (a$event[j] == a$event[j-1]){
a$BF[a$event==j]<- a$flow[j-3]
} else{
a$BF[j]<- a$flow[j-3] }
}
}
I expected to generate a column called "BF" to be like this:
flow event BF
1 40 1 NA
2 39 1 NA
3 38 1 NA
4 37 1 NA
5 50 2 39
6 49 2 39
7 46 2 39
8 44 2 39
9 60 3 49
10 55 3 49
11 40 3 49
12 70 4 60
13 80 5 55
14 75 5 55
15 90 6 70
16 88 6 70
17 86 6 70
18 100 7 90
19 120 8 88
20 118 8 88
The error that I am obtaining with the previous code is that is not duplicating the values properly that match with the "event" column. (It should be as it is shown in the table).
More Tidy-er solution will be:
library(dplyr)
a %>%
mutate(BF = ifelse(event<=1,NA,row_number()-3)) %>%
group_by(event) %>%
mutate(BF = BF[1]) %>%
ungroup() %>%
mutate(BF = a[BF,]$flow)
# A tibble: 20 x 3
flow event BF
<dbl> <dbl> <dbl>
1 40 1 NA
2 39 1 NA
3 38 1 NA
4 37 1 NA
5 50 2 39
6 49 2 39
7 46 2 39
8 44 2 39
9 60 3 49
10 55 3 49
11 40 3 49
12 70 4 60
13 80 5 55
14 75 5 55
15 90 6 70
16 88 6 70
17 86 6 70
18 100 7 90
19 120 8 88
20 118 8 88
An alternative way to get the output with tidyverse. This breaks your problem up into two pieces. There is likely something more succinct out there:
library(tidyverse)
critical_info <- a %>%
mutate(previous = lag(flow, 3)) %>% #find the previous flow number for each
group_by(event) %>%
mutate(subevent = row_number()) %>% #to knew each subevent within an event
filter(subevent == 1) %>% #filter out unimportant rows
rename(BF = previous) %>% #rename the column
select(event, BF) # get the right stuff
a %>%
left_join(critical_info, by ="event")

R create a column to identify the group that row belong to

Description of Data: Dataset contains information regarding users about their age, gender and membership they are holding.
Goal: Create a new column to identify the group/label for each user based on pre-defined conditions.
Age conditions: multiple age brackets :
18 >= age <= 24, 25 >= age <=30, 31 >= age <= 41, 41 >= age <= 60, age >= 61
Gender: M/F
Membership: A,B,C,I
I created sample data frame to try out creation of new column to identify the group/label
df = data.frame(userid = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10,11, 12),
age = c(18, 61, 23, 35, 30, 25, 55, 53, 45, 41, 21, NA),
gender = c('F', 'M', 'F', 'F', 'M', 'M', 'M', 'M', 'M', 'F', '<NA>', 'M'),
membership = c('A', 'B', 'A', 'C', 'C', 'B', 'A', 'A', 'I', 'I', 'A', '<NA>'))
userid age gender membership
1 1 18 F A
2 2 61 M B
3 3 23 F A
4 4 35 F C
5 5 30 M C
6 6 25 M B
7 7 55 M A
8 8 53 M A
9 9 45 M I
10 10 41 F I
11 11 21 <NA> A
12 12 NA M <NA>
Based on above data there exist 4 * 2 * 5 options (combinations)
Final outcome:
userid age gender membership GroupID
1 1 16 F A 1
2 2 61 M B 40
3 3 23 F A 1
4 4 35 F C 4
5 5 30 M C 5
6 6 25 M B 3
7 7 55 M A 32
8 8 53 M A 32
9 9 45 M I 34
10 10 41 F I 35
userid age gender membership GroupID
1 1 18 F A 1
2 2 61 M B 40
3 3 23 F A 1
4 4 35 F C 4
5 5 30 M C 5
6 6 25 M B 3
7 7 55 M A 32
8 8 53 M A 32
9 9 45 M I 34
10 10 41 F I 35
11 11 21 <NA> A 43 (assuming it will auto-detec combo)
12 12 NA M <NA> 46
I believe my calculation of combinations are correct and if so how can I use dplyr or any other option to get above data frame.
Use multiple if conditions to confirm all the options?
In dplyr is there a way to actually provide conditions for each column to set the grouping conditions:
df %>% group_by(age, gender, membership)
Two options,
One, more automated;
# install.packages(c("tidyverse""), dependencies = TRUE)
library(tidyverse)
df %>% mutate(ageCat = cut(age, breaks = c(-Inf, 24, 30, 41, 60, Inf))) %>%
mutate(GroupID = group_indices(., ageCat, gender, membership)) %>% select(-ageCat)
#> userid age gender membership GroupID
#> 1 1 18 F A 2
#> 2 2 61 M B 9
#> 3 3 23 F A 2
#> 4 4 35 F C 5
#> 5 5 30 M C 4
#> 6 6 25 M B 3
#> 7 7 55 M A 7
#> 8 8 53 M A 7
#> 9 9 45 M I 8
#> 10 10 41 F I 6
#> 11 11 21 <NA> A 1
#> 12 12 NA M <NA> 10
Two, more manual;
Here I make an illustration of a solution with category 1 and 4, you have to code the rest yourself.
df %>% mutate(GroupID =
ifelse((age >= 18 | age > 25) & gender == 'F' & membership == "A", 1,
ifelse((age >= 31 | age > 41) & gender == 'F' & membership == "C", 4, NA)
))
#> userid age gender membership GroupID
#> 1 1 18 F A 1
#> 2 2 61 M B NA
#> 3 3 23 F A 1
#> 4 4 35 F C 4
#> 5 5 30 M C NA
#> 6 6 25 M B NA
#> 7 7 55 M A NA
#> 8 8 53 M A NA
#> 9 9 45 M I NA
#> 10 10 41 F I NA
#> 11 11 21 <NA> A NA
#> 12 12 NA M <NA> NA
the data structure in case others feel like giving it a go,
You can try this:
setDT(df)[,agegrp:= ifelse((df$age >= 18) & (df$age <= 24), 1, ifelse((df$age >= 25) & (df$age <= 30), 2, ifelse((df$age >= 31) & (df$age <= 41),3,ifelse((df$age >= 42) & (df$age <= 60),4,5))))]
setDT(df)[, group := .GRP, by = .(agegrp,gender, membership)]
If you want to use base R only, you could do something like this:
# 1
allcombos <- expand.grid(c("M", "F"), c("A", "B", "C", "I"), 1:5)
allgroups <- do.call(paste0, allcombos) # 40 unique combinations
# 2
agegroups <- cut(df$age,
breaks = c(17, 24, 30, 41, 61, 99),
labels = c(1, 2, 3, 4, 5))
# 3
df$groupid <- paste0(df$gender, df$membership, agegroups)
df$groupid <- factor(df$groupid, levels=allgroups, labels=1:length(allgroups))
expand.grid gives you a data.frame with three columns where every row represents a unique combination of the three arguments provided. As you said, these are 40 combinations. The second line combines every row of the data frame in a single string, like "MA1", "FA1", "MB1", etc.
Then we use cut to each age to its relevant age group with names 1 to 5.
We create a column in df that contains the three character combination of the gender, membership and age group which is then converted to a factor, according to all possible combinations we found in allgroups.

Resources