how to convert factor levels to integer in r - r

I have following dataframe in R
ID Season Year Weekday
1 Winter 2017 Monday
2 Winter 2018 Tuesday
3 Summer 2017 Monday
4 Summer 2018 Wednsday
I want to convert these factor levels to integer,following is my desired dataframe
ID Season Year Weekday
1 1 1 1
2 1 2 2
3 2 1 1
4 2 2 3
Winter = 1,Summer =2
2017 = 1 , 2018 = 2
Monday = 1,Tuesday = 2,Wednesday = 3
Currently, I am doing ifelse for above 3
otest_xgb$Weekday <- as.integer(ifelse(otest_xgb$Weekday == "Monday",1,
ifelse(otest_xgb$Weekday == "Tuesday",2,
ifelse(otest_xgb$Weekday == "Wednesday",3,
ifelse(otest_xgb$Weekday == "Thursday",4,5)))))
Is there any way to avoid writing long ifelse ?

m=dat
> m[]=lapply(dat,function(x)as.integer(factor(x,unique(x))))
> m
ID Season Year Weekday
1 1 1 1 1
2 2 1 2 2
3 3 2 1 1
4 4 2 2 3

We can use match with unique elements
library(dplyr)
dat %>%
mutate_all(funs(match(., unique(.))))
# ID Season Year Weekday
#1 1 1 1 1
#2 2 1 2 2
#3 3 2 1 1
#4 4 2 2 3

Ordered and Nominal factor variables are needed to be taken care of separately. Directly converting a factor column to integer or numeric will provide values in lexicographical sense.
Here Weekday is conceptually ordinal, Year is integer, Season is generally nominal. However, this is again subjective depending on the kind of analysis required.
For eg. When you directly convert from factor to integer variables. In Weekday column, Wednesday will get a higher value than both Saturday and Tuesday:
dat[] <- lapply(dat, function(x)as.integer(factor(x)))
dat
# ID Season Year Weekday
#1 1 2 1 1
#2 2 2 2 3
#3 3 1 1 2 (Saturday)
#4 4 1 2 4 (Wednesday): assigned value greater than that ofSaturday
Therefore, you can convert directly from factor to integers for Season and Year columns only. It might be noted that for year column, it works fine as the lexicographical sense matches with its ordinal sense.
dat[c('Season', 'Year')] <- lapply(dat[c('Season', 'Year')],
function(x) as.integer(factor(x)))
Weekday needs to converted from an ordered factor variable with desired order of levels. It might be harmless if doing general aggregation, but will drastically affect results when implementing statistical models.
dat$Weekday <- as.integer(factor(dat$Weekday,
levels = c("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday"), ordered = TRUE))
dat
# ID Season Year Weekday
#1 1 2 1 1
#2 2 2 2 2
#3 3 1 1 6 (Saturday)
#4 4 1 2 3 (Wednesday): assigned value less than that of Saturday
Data Used:
dat <- read.table(text=" ID Season Year Weekday
1 Winter 2017 Monday
2 Winter 2018 Tuesday
3 Summer 2017 Saturday
4 Summer 2018 Wednesday", header = TRUE)

You can simply use as.numeric() to convert a factor to a numeric. Each value will be changed to the corresponding integer that that factor level represents:
library(dplyr)
### Change factor levels to the levels you specified
otest_xgb$Season <- factor(otest_xgb$Season , levels = c("Winter", "Summer"))
otest_xgb$Year <- factor(otest_xgb$Year , levels = c(2017, 2018))
otest_xgb$Weekday <- factor(otest_xgb$Weekday, levels = c("Monday", "Tuesday", "Wednesday"))
otest_xgb %>%
dplyr::mutate_at(c("Season", "Year", "Weekday"), as.numeric)
# ID Season Year Weekday
# 1 1 1 1 1
# 2 2 1 2 2
# 3 3 2 1 1
# 4 4 2 2 NA

Once you have converted the season, year and weekday to factors, use this code to change to dummy indicator variables
contrasts(factor(dat$season)
contrasts(factor(dat$year)
contrasts(factor(dat$weekday)

Related

Calculate percent change between multiple columns of a data frame

I have a data frame as follows :
id <- c(1, 2, 3, 4, 5)
week1 <- c(234,567456, 134123, 13412421, 2345245)
week2 <- c(4234,5123456, 454123, 12342421, 8394545)
week3 <- c(1234, 234124, 12348, 9348522, 134534)
data <- data.frame(id, week1, week2, week3)
I would like to find the percent change between week1 and week2, and then week2 and week3 etc (my dataframe is much larger with about 27 columns).
I tried:
data$change1 <- (data$week2-data$week1)*100/data$week1
However this would be extensive with a larger dataset.
Try the following:
library(tidyverse)
df <- gather(df, key='week', value='value', -id)
df$week <- as.integer(as.character((gsub('week', '', df$week))))
df %>% group_by(id) %>% arrange(week) %>% mutate(perc_change = (value-lag(value,1))/lag(value,1)*100)
# A tibble: 15 x 4
# Groups: id [5]
id week value perc_change
<dbl> <int> <dbl> <dbl>
1 1 1 234 NA
2 2 1 567456 NA
3 3 1 134123 NA
4 4 1 13412421 NA
5 5 1 2345245 NA
6 1 2 4234 1709.
7 2 2 5123456 803.
8 3 2 454123 239.
9 4 2 12342421 -7.98
10 5 2 8394545 258.
11 1 3 1234 -70.9
12 2 3 234124 -95.4
13 3 3 12348 -97.3
14 4 3 9348522 -24.3
15 5 3 134534 -98.4
This works reasonably well, but assumes that there is an observation every week, or else your percent change will be based on the last available week (so, if week 3 is missing, the value for week 4 will be a week on week change with week 2 as basis).
(Edit: replaced substr with gsub)
Sense checking:
For row 6, you see id 1. This is week 2 with a value of 4234. In week 1, id 1 had a value of 234. The difference is
(4234-234)/234
[1] 17.09402
So, that is aligned.

How to add a column with most resent recurring observation within a group, but within a certain time period, in R

If I had:
person_ID visit date
1 2/25/2001
1 2/27/2001
1 4/2/2001
2 3/18/2004
3 9/22/2004
3 10/27/2004
3 5/15/2008
and I wanted another column to indicate the earliest recurring observation within 90 days, grouped by patient ID, with the desired output:
person_ID visit date date
1 2/25/2001 2/27/2001
1 2/27/2001 4/2/2001
1 4/2/2001 NA
2 3/18/2004 NA
3 9/22/2004 10/27/2004
3 10/27/2004 NA
3 5/15/2008 NA
Thank you!
We convert the 'visit_date' to Date class, grouped by 'person_ID', create a binary column that returns 1 if the difference between the current and next visit_date is less than 90 or else 0, using this column, get the correponding next visit_date' where the value is 1
library(dplyr)
library(lubridate)
library(tidyr)
df1 %>%
mutate(visit_date = mdy(visit_date)) %>%
group_by(person_ID) %>%
mutate(i1 = replace_na(+(difftime(lead(visit_date),
visit_date, units = 'day') < 90), 0),
date = case_when(as.logical(i1)~ lead(visit_date)), i1 = NULL ) %>%
ungroup
-output
# A tibble: 7 x 3
# person_ID visit_date date
# <int> <date> <date>
#1 1 2001-02-25 2001-02-27
#2 1 2001-02-27 2001-04-02
#3 1 2001-04-02 NA
#4 2 2004-03-18 NA
#5 3 2004-09-22 2004-10-27
#6 3 2004-10-27 NA
#7 3 2008-05-15 NA

Using dplyr::lag to calculate days since first event

I am trying to use dplyr::lag to determine the number of days that have passed for each event since the initial event but I am getting unexpected behavior.
Example, very simple data:
df <- data.frame(id = c("1", "1", "1", "1", "2", "2"),
date= c("4/1/2020", "4/2/2020", "4/3/2020", "4/4/2020", "4/17/2020", "4/18/2020"))
df$date <- as.Date(df$date, format = "%m/%d/%Y")
id date
1 1 4/1/2020
2 1 4/2/2020
3 1 4/3/2020
4 1 4/4/2020
5 2 4/17/2020
6 2 4/18/2020
What I was hoping to do was create a new column days_since_first_event that calculated the number of days between the initial event by id and each subsequent date with this expected output
df <- df %>%
group_by(id) %>%
mutate(days_since_first_event = as.numeric(date - lag(date)))
id date days_since_first_event
1 1 4/1/2020 0
2 1 4/2/2020 1
3 1 4/3/2020 2
4 1 4/4/2020 3
5 2 4/17/2020 0
6 2 4/18/2020 1
But instead I get this output
# A tibble: 6 x 3
# Groups: id [2]
id date days_since_first_event
<chr> <date> <dbl>
1 1 2020-04-01 NA
2 1 2020-04-02 1
3 1 2020-04-03 1
4 1 2020-04-04 1
5 2 2020-04-17 NA
6 2 2020-04-18 1
Any suggestions on what I'm doing wrong?
The first n values of lag() get a default value, because you don't have 'older' data. The default value is NA. Hence the NA in your results.
Furthermore, using lag will only yield the difference between consecutive events.

How to find mean of n consecutive days in each group r

I have a dataframe that contains id(contains duplicate),date(contains duplicate),value. the values are recorded for different consecutive days. now what i want is to group the dataframe with id and date(as n consecutive days) and find mean of values. and return NA if the last group does not contain n days.
id date value
1 2016-10-5 2
1 2016-10-6 3
1 2016-10-7 1
1 2016-10-8 2
1 2016-10-9 5
2 2013-10-6 2
. . .
. . .
. . .
20 2012-2-6 10
desired output with n-consecutive days as 3
id date value group_n_consecutive_days mean_n_consecutive_days
1 2016-10-5 2 1 2
1 2016-10-6 3 1 2
1 2016-10-7 1 1 2
1 2016-10-8 2 2 NA
1 2016-10-9 5 2 NA
2 2013-10-6 2 1 4
.
.
.
.
20 2012-2-6 10 6 25
The data in the question is sorted and consecutive within id so we assume that that is the case. Also when the question refers to duplicate dates we assume that that means that different id values can have the same date but within id the dates are unique and consecutive. Now, using the data shown reproducibly in Note 2 at the end group by id and compute the group numbers using gl. Then grouping by id and group_no take the mean of each group of 3 or NA for smaller groups.
library(dplyr)
DF %>%
group_by(id) %>%
mutate(group_no = c(gl(n(), 3, n()))) %>%
group_by(group_no, add = TRUE) %>%
mutate(mean = if (n() == 3) mean(value) else NA) %>%
ungroup
giving:
# A tibble: 6 x 5
id date value group_no mean
<int> <date> <int> <int> <dbl>
1 1 2016-10-05 2 1 2
2 1 2016-10-06 3 1 2
3 1 2016-10-07 1 1 2
4 1 2016-10-08 2 2 NA
5 1 2016-10-09 5 2 NA
6 2 2013-10-06 2 1 NA
Note 1
An alternative to gl(...) could be cumsum(rep(1:3, length = n()) == 1) and an alternative to if (n() = 3) mean(value) else NA could be mean(head(c(value, NA, NA), 3)) .
Note 2
The input data in reproducible form was assumed to be:
Lines <- "id date value
1 2016-10-5 2
1 2016-10-6 3
1 2016-10-7 1
1 2016-10-8 2
1 2016-10-9 5
2 2013-10-6 2"
DF <- read.table(text = Lines, header = TRUE)
DF$date <- as.Date(DF$date)

Choose a month of a year to rank then give resulting ranks to the rest years

Sample data:
df1 <- data.frame(id=c("A","A","A","A","B","B","B","B"),
year=c(2014,2014,2015,2015),
month=c(1,2),
new.employee=c(4,6,2,6,23,2,5,34))
id year month new.employee
1 A 2014 1 4
2 A 2014 2 6
3 A 2015 1 2
4 A 2015 2 6
5 B 2014 1 23
6 B 2014 2 2
7 B 2015 1 5
8 B 2015 2 34
Desired outcome:
desired_df <- data.frame(id=c("A","A","A","A","B","B","B","B"),
year=c(2014,2014,2015,2015),
month=c(1,2),
new.employee=c(4,6,2,6,23,2,5,34),
new.employee.rank=c(1,1,2,2,2,2,1,1))
id year month new.employee new.employee.rank
1 A 2014 1 4 1
2 A 2014 2 6 1
3 A 2015 1 2 2
4 A 2015 2 6 2
5 B 2014 1 23 2
6 B 2014 2 2 2
7 B 2015 1 5 1
8 B 2015 2 34 1
The ranking rule is: I choose month 2 in each year to rank number of new employees between A and B. Then I need to give those ranks to month 1. i.e., month 1 of each year rankings must be equal to month 2 ranking in the same year.
I tried these code to get rankings for each month and each year,
library(data.table)
df1 <- data.table(df1)
df1[,rank:=rank(new.employee), by=c("year","month")]
If (anyone can roll the rank value within a column to replace rank of month 1 by rank of month 2 ), it might be a solution.
You've tried a data.table solution, so here's how would I do this using data.table
library(data.table) # V1.9.6+
temp <- setDT(df1)[month == 2L, .(id, frank(-new.employee)), by = year]
df1[temp, new.employee.rank := i.V2, on = c("year", "id")]
df1
# id year month new.employee new.employee.rank
# 1: A 2014 1 4 1
# 2: A 2014 2 6 1
# 3: A 2015 1 2 2
# 4: A 2015 2 6 2
# 5: B 2014 1 23 2
# 6: B 2014 2 2 2
# 7: B 2015 1 5 1
# 8: B 2015 2 34 1
It appears somewhat similar to the above dplyr solution. Which is basically ranks the ids per year and joins them back to the original data set. I'm using data.table V1.9.6+ here.
Here's a dplyr-based solution. The idea is to reduce the data to the parts you want to compare, make the comparison, then join the results back into the original data set, expanding it to fill all of the relevant slots. Note the edits to your code for creating the sample data.
df1 <- data.frame(id=c("A","A","A","A","B","B","B","B"),
year=rep(c(2014,2014,2015,2015), 2),
month=rep(c(1,2), 4),
new.employee=c(4,6,2,6,23,2,5,34))
library(dplyr)
df1 %>%
# Reduce the data to the slices (months) you want to compare
filter(month==2) %>%
# Group the data by year, so the comparisons are within and not across years
group_by(year) %>%
# Create a variable that indicates the rankings within years in descending order
mutate(rank = rank(-new.employee)) %>%
# To prepare for merging, reduce the new data to just that ranking var plus id and year
select(id, year, rank) %>%
# Use left_join to merge the new data (.) with the original df, expanding the
# new data to fill all rows with id-year matches
left_join(df1, .) %>%
# Order the data by id, year, and month to make it easier to review
arrange(id, year, month)
Output:
Joining by: c("id", "year")
id year month new.employee rank
1 A 2014 1 4 1
2 A 2014 2 6 1
3 A 2015 1 2 2
4 A 2015 2 6 2
5 B 2014 1 23 2
6 B 2014 2 2 2
7 B 2015 1 5 1
8 B 2015 2 34 1

Resources