Add column in r with the preview values in another column - r

I have a data frame with 2 columns. Patient_Id and time (when visit the doctor).
I would like to add a new column "timestart" which have 0 at the first row for each different Patient_id and the other rows with the same id have the preview value from column time.
I think to do this with loop for, but I am new user in R and I don’t know how.
Thanks in advance.

We can group by 'Patient_id' and create the new column with the lag of 'time'
library(dplyr)
df1 %>%
group_by(Patient_id) %>%
mutate(timestart = lag(time, default = 0))
# Patient_id time timestart
# <int> <int> <int>
#1 1 1 0
#2 1 2 1
#3 1 3 2
#4 2 1 0
#5 2 2 1
#6 2 3 2
data
df1 <- data.frame(Patient_id = rep(1:2, each = 3), time = 1:3)

Related

Dynamically Normalize all rows with first element within a group

Suppose I have the following data frame:
year subject grade study_time
1 1 a 30 20
2 2 a 60 60
3 1 b 30 10
4 2 b 90 100
What I would like to do is be able to divide grade and study_time by their first record within each subject. I do the following:
df %>%
group_by(subject) %>%
mutate(RN = row_number()) %>%
mutate(study_time = study_time/study_time[RN ==1],
grade = grade/grade[RN==1]) %>%
select(-RN)
I would get the following output
year subject grade study_time
1 1 a 1 1
2 2 a 2 3
3 1 b 1 1
4 2 b 3 10
It's fairly easy to do when I know what the variable names are. However, I'm trying to write a generalize function that would be able to act on any data.frame/data.table/tibble where I may not know the name of the variables that I need to mutate, I'll only know the variables names not to mutate. I'm trying to get this done using tidyverse/data.table and I can't get anything to work.
Any help would be greatly appreciated.
We group by 'subject' and use mutate_at to change multiple columns by dividing the element by the first element
library(dplyr)
df %>%
group_by(subject) %>%
mutate_at(3:4, funs(./first(.)))
# A tibble: 4 x 4
# Groups: subject [2]
# year subject grade study_time
# <int> <chr> <dbl> <dbl>
#1 1 a 1 1
#2 2 a 2 3
#3 1 b 1 1
#4 2 b 3 10

how to count and dcast for all columns in r

I have following dataframe in r
Company Education Health
A NA 1
A 1 2
A 1 NA
I want the count of levels in each columns(1,2,NA) in a following format
Company Education_1 Education_NA Health_1 Health_2 Health_NA
A 2 1 1 1 1
How can I do it in R?
You can do the following:
library(tidyverse)
df %>%
gather(k, v, -Company) %>%
unite(tmp, k, v, sep = "_") %>%
count(Company, tmp) %>%
spread(tmp, n)
## A tibble: 1 x 6
# Company Education_1 Education_NA Health_1 Health_2 Health_NA
# <fct> <int> <int> <int> <int> <int>
#1 A 2 1 1 1 1
Sample data
df <- read.table(text =
" Company Education Health
A NA 1
A 1 2
A 1 NA ", header = T)
Using DF in the Note at the end where we have added a company B as well and using the reshape2 package it can be done in one recast call. The id.var and fun arguments can be omitted and the same answer will be given but it will produce a message saying it used those defaults.
library(reshape2)
recast(DF, Company ~ variable + value,
id.var = "Company", fun = length)
giving this data frame:
Company Education_1 Education_NA Health_1 Health_2 Health_NA
1 A 2 1 1 1 1
2 B 2 1 1 1 1
Note
Lines <- " Company Education Health
1 A NA 1
2 A 1 2
3 A 1 NA
4 B NA 1
5 B 1 2
6 B 1 NA"
DF <- read.table(text = Lines)
In plyr you can use a hack with ddply by transposing tables to get what appear to be new columns:
x <- data.frame(Company="A",Education=c(NA,1,1),Health=c(1,2,NA))
library(plyr)
ddply(x,.(Company),plyr::summarise,
Education=t(table(addNA(Education))),
Health=t(table(addNA(Health)))
)
Company Education.1 Education.NA Health.1 Health.2 Health.NA
1 A 2 1 1 1 1
However, they are not really columns, but table elements in the data.frame.
You can use a do.call(data.frame,y) construct to make them proper data frame columns, but you need more than one row for it to work.

How to subset by time range?

I would like to create a subset from the following example of data frame. The condition is to select those rows where time column values belong to time range from the minimum time for the certain id till the next lets say one hour.
id time
1 1468696537
1 1468696637
1 1482007490
2 1471902849
2 1471902850
2 1483361074
3 1474207754
3 1474207744
3 1471446673
3 1471446693
And the output should be like this:
id time
1 1468696537
1 1468696637
2 1471902849
2 1471902850
3 1471446673
3 1471446693
Please, help me how to do that?
We can do the following:
library(magrittr);
library(dplyr);
df %>%
group_by(id) %>%
filter(time <= min(time) + 3600)
# id time
# <int> <int>
#1 1 1468696537
#2 1 1468696637
#3 2 1471902849
#4 2 1471902850
#5 3 1471446673
#6 3 1471446693
Explanation: Group entries by id, then filter entries that are within min(time) + 1 hour.
Sample data
df <- read.table(text =
" id time
1 1468696537
1 1468696637
1 1482007490
2 1471902849
2 1471902850
2 1483361074
3 1474207754
3 1474207744
3 1471446673
3 1471446693 ", header = T)

Normalize all rows with first element within group

Is there an elegant method to normalize a column with a group-specific norm with dplyr?
Example:
I have a data frame:
df = data.frame(year=c(1:2, 1:2),
group=c("a", "a", "b", "b"),
val=c(100, 200, 300, 900))
i.e:
year group val
1 1 a 100
2 2 a 200
3 1 b 300
4 2 b 900
I want to normalize val by the value in year=1 of the given group. Desired output:
year group val val_norm
1 1 a 100 1
2 2 a 200 2
3 1 b 300 1
4 2 b 900 3
e.g. in row 4 the norm = 300 (year==1 & group=="b") hence val_norm = 900/300 = 3.
I can achieve this by extracting a ancillary data frame with just norms and then doing a left join on the original data frame.
What is a more elegant way to achieve this without creating a temporary data frame?
We can group by 'group', then divide the 'val' by the 'val' where 'year' is 1 (year==1). Here, I am selecting the first observation (in case there are duplicate 'year' of 1 for each 'group').
library(dplyr)
df %>%
group_by(group) %>%
mutate(val_norm = val/val[year==1][1L])
# year group val val_norm
# <int> <fctr> <dbl> <dbl>
#1 1 a 100 1
#2 2 a 200 2
#3 1 b 300 1
#4 2 b 900 3
If we need elegance and efficiency, data.table can be tried
library(data.table)
setDT(df)[, val_norm := val/val[year==1][1L] , by = group]

Find difference between rows by id, but place difference on first row in R

I have read a few different posts about finding the difference between two different rows in R using dplyr. However, the posts I have seen do not give me quite what I want. I would like to find the difference between the times, and place that difference between n and n+1 in a new variable, on the same row as n, kind of like the duration between n and n+1. All other posts place the elapsed time on the same row as n+1.
Here is some sample data:
df <- read.table(text = c("
id time
1 1
1 4
1 7
2 5
2 10"), header = T)
My desired output:
# id time duration
# 1 1 3
# 1 4 3
# 1 7 NA
# 2 5 5
# 2 10 NA
I have the following code at the moment:
df %>% arrange(id, time) %>% group_by(id) %>% mutate(duration = time - lag(time))
Please let me know how I should change this around. Thanks!
You can use diff(), appending the NA to each group. Just change your mutate() call to
mutate(duration = c(diff(time), NA)))
Edit: To clarify, the code above is only the mutate() call at the end of the pipe in the code shown in the question. So the the entire operation would be, based on the code shown in the question, is
df %>%
arrange(id, time) %>%
group_by(id) %>%
mutate(duration = c(diff(time), NA))
# Source: local data frame [5 x 3]
# Groups: id [2]
#
# id time duration
# <dbl> <dbl> <dbl>
# 1 1 1 3
# 2 1 4 3
# 3 1 7 NA
# 4 2 5 5
# 5 2 10 NA
We can swap lag with lead
df %>%
group_by(id) %>%
mutate(duration = lead(time)- time)
# id time duration
# <int> <int> <int>
#1 1 1 3
#2 1 4 3
#3 1 7 NA
#4 2 5 5
#5 2 10 NA
A corresponding option in data.table would be shift with type = "lead"
library(data.table)
setDT(df)[, duration := shift(time, type = "lead") - time, by = id]
NOTE: In the example the 'id', 'time' were in order. If it is not, add the order statement as the OP showed in his post.

Resources