How to populate values of one row conditional of another row in R? - r

I inherited a data set coded in an unusual way. I would like to learn a less verbose way of reshaping it. The data frame looks like this:
# Input.
participant = c(rep("John",6), rep("Mary",6))
day = c(rep(1,3), rep(2,3), rep(1,3), rep(2,3))
likes = c("apples", "apples", "18", "apples", "apples", "7", "bananas", "bananas", "24", "bananas", "bananas", "3")
question = rep(c(1,1,0),4)
number = c(rep(18,3), rep(7,3), rep(24,3), rep(3,3))
df = data.frame(participant, day, question, likes)
participant day question likes
1 John 1 1 apples
2 John 1 1 apples
3 John 1 0 18
4 John 2 1 apples
5 John 2 1 apples
6 John 2 0 7
7 Mary 1 1 bananas
8 Mary 1 1 bananas
9 Mary 1 0 24
10 Mary 2 1 bananas
11 Mary 2 1 bananas
12 Mary 2 0 3
As you can see, the column likes is heterogeneous. When question equals 0, likes conveys a number chosen by the participants, not their preferred fruit. So I would like to re-code it in a new column as follows:
participant day question likes number
1 John 1 1 apples 18
2 John 1 1 apples 18
3 John 1 0 18 18
4 John 2 1 apples 7
5 John 2 1 apples 7
6 John 2 0 7 7
7 Mary 1 1 bananas 24
8 Mary 1 1 bananas 24
9 Mary 1 0 24 24
10 Mary 2 1 bananas 3
11 Mary 2 1 bananas 3
12 Mary 2 0 3 3
My current solution with base R involves subsetting the initial data frame, creating a lookup table, changing the column names and then merging the lookup table with the original data frame. But this involves several steps and I worry that there should be a simpler solution. I think that tidyr might be the answer, but I don't know how to use it to spread values in one column (likes) conditional other columns (day and question).
Do you have any suggestions? Thanks a lot!

Using the data set above, you can try the following. You group your data by participant and day and look for a row with question == 0 for each group.
library(dplyr)
group_by(df, participant, day) %>%
mutate(age = as.numeric(as.character(likes[which(question == 0)])))
Or as alistaire suggested, you can use grep() too.
group_by(df, participant, day) %>%
mutate(age = as.numeric(grep('\\d+', likes, value = TRUE)))
# participant day question likes age
# (fctr) (dbl) (dbl) (fctr) (dbl)
#1 John 1 1 apples 18
#2 John 1 1 apples 18
#3 John 1 0 18 18
#4 John 2 1 apples 7
#5 John 2 1 apples 7
#6 John 2 0 7 7
#7 Mary 1 1 bananas 24
#8 Mary 1 1 bananas 24
#9 Mary 1 0 24 24
#10 Mary 2 1 bananas 3
#11 Mary 2 1 bananas 3
#12 Mary 2 0 3 3
If you want to use data.table, you can do:
library(data.table)
setDT(df)[, age := as.numeric(as.character(likes[which(question == 0)])),
by = list(participant, day)]
NOTE
The present data set is a new one. Jota's answer works for the deleted data set.

Addressing the new example data:
# create a key column, overwrite it later
df$number <- paste0(df$participant, df$day) # use as a key
# create lookup table
lookup <- df[!is.na(as.numeric(as.character(df$likes))), c("number", "likes")]
# use lookup to overwrite df$number with the appropriate number
df$number <- lookup$likes[match(df$number, lookup$number)]
# participant day question likes number
#1 John 1 1 apples 18
#2 John 1 1 apples 18
#3 John 1 0 18 18
#4 John 2 1 apples 7
#5 John 2 1 apples 7
#6 John 2 0 7 7
#7 Mary 1 1 bananas 24
#8 Mary 1 1 bananas 24
#9 Mary 1 0 24 24
#10 Mary 2 1 bananas 3
#11 Mary 2 1 bananas 3
#12 Mary 2 0 3 3
The warning about NAs be introduced by coercion is expected due to converting characters to numeric (as.numeric(as.character(df$likes))),.
If you're data are ordered like in the example, you can use na.locf from the zoo package:
library(zoo)
df$age <- na.locf(as.numeric(as.character(df$likes)), fromLast = TRUE)

Related

My question is about R: How to number each repetition in a table in R?

In my data set, their is column of full names (eg: below) and I want to add the another column next to it mentioning if a name has appeared two one, two, three, four.... times using R. My output should look like the column below: Number of repetition.
Eg: Data set name: People
**Full name** **Number of repetition**
Peter 1
Peter 2
Alison
Warren
Jack 1
Jack 2
Jack 3
Jack 4
Susan 1
Susan 2
Henry 1
Walison
Tinder 1
Peter 3
Henry 2
Tinder 2
Thanks
Teena
Here is an alternative way solved with help from akrun: sum() condition in ifelse statement
library(dplyr)
df1 %>%
group_by(Fullname) %>%
mutate(newcol = row_number(),
newcol = if(sum(newcol)> 1) newcol else NA) %>%
ungroup
Fullname newcol
<chr> <int>
1 Peter 1
2 Peter 2
3 Alison NA
4 Warren NA
5 Jack 1
6 Jack 2
7 Jack 3
8 Jack 4
9 Susan 1
10 Susan 2
11 Henry 1
12 Walison NA
13 Tinder 1
14 Peter 3
15 Henry 2
16 Tinder 2
Here is one way. Do a group by 'Fullname', and create the sequence with row_number() if the number of rows is greater than 1. By default, case_when returns the other case as NA
library(dplyr)
df1 <- df1 %>%
group_by(Fullname) %>%
mutate(Number_of_repetition = case_when(n() > 1 ~ row_number())) %>%
ungroup
-output
df1
# A tibble: 16 × 2
Fullname Number_of_repetition
<chr> <int>
1 Peter 1
2 Peter 2
3 Alison NA
4 Warren NA
5 Jack 1
6 Jack 2
7 Jack 3
8 Jack 4
9 Susan 1
10 Susan 2
11 Henry 1
12 Walison NA
13 Tinder 1
14 Peter 3
15 Henry 2
16 Tinder 2
If we need to add a third column, use unite on the updated data from previous step
library(tidyr)
df1 %>%
unite(FullNameRep, Fullname, Number_of_repetition, sep="", na.rm = TRUE, remove = FALSE)
-output
# A tibble: 16 × 3
FullNameRep Fullname Number_of_repetition
<chr> <chr> <int>
1 Peter1 Peter 1
2 Peter2 Peter 2
3 Alison Alison NA
4 Warren Warren NA
5 Jack1 Jack 1
6 Jack2 Jack 2
7 Jack3 Jack 3
8 Jack4 Jack 4
9 Susan1 Susan 1
10 Susan2 Susan 2
11 Henry1 Henry 1
12 Walison Walison NA
13 Tinder1 Tinder 1
14 Peter3 Peter 3
15 Henry2 Henry 2
16 Tinder2 Tinder 2
data
df1 <- structure(list(Fullname = c("Peter", "Peter", "Alison", "Warren",
"Jack", "Jack", "Jack", "Jack", "Susan", "Susan", "Henry", "Walison",
"Tinder", "Peter", "Henry", "Tinder")), row.names = c(NA, -16L
), class = "data.frame")

Calculate number of days passed since First event R

I would like to calculate the number of days which have passed since the first event. There are different groups, so each group's starting date for an event is different and I want to calculate each groups number of days passed since their own first event.
names = c('Ben',"Ben","Ben","Ben","Ben","Ben" ,'Dan',"Dan","Dan","Dan", 'Peter',"Peter","Peter","Peter","Peter","Peter","Peter",'Betty',"Betty","Betty",'Betty', "Betty")
dates = c('2000-02-01','2000-02-02',"2000-02-03","2000-02-04",'2000-02-05','2000-02-05', '2000-01-11','2000-01-12',"2000-01-13",'2000-01-14',
'2000-09-10','2000-09-11',"2000-09-12",'2000-09-13','2000-09-14','2000-09-15','2000-09-16','2000-11-13','2000-11-14', "2000-11-15",'2000-11-16','2000-11-17')
events = c(0,0,1,4,5,11,0,0,2,6,0,0,1,2,3,4,5,0,0,1,2,3)
newd = data.frame(names,dates,events)
newd
so the data frame looks like this:
> newd
names dates events
1 Ben 2000-02-01 0
2 Ben 2000-02-02 0
3 Ben 2000-02-03 1
4 Ben 2000-02-04 4
5 Ben 2000-02-05 5
6 Ben 2000-02-05 11
7 Dan 2000-01-11 0
8 Dan 2000-01-12 0
9 Dan 2000-01-13 2
10 Dan 2000-01-14 6
11 Peter 2000-09-10 0
12 Peter 2000-09-11 0
13 Peter 2000-09-12 1
14 Peter 2000-09-13 2
15 Peter 2000-09-14 3
16 Peter 2000-09-15 4
17 Peter 2000-09-16 5
18 Betty 2000-11-13 0
19 Betty 2000-11-14 0
20 Betty 2000-11-15 1
21 Betty 2000-11-16 2
22 Betty 2000-11-17 3
This is just an example I am using, the 'events' are not in a specific order and are totally random, there are also many other dates with the event of 0. So I would like to only start counting days where: event > 0.
So if there's a 0 at 'event' than there should also be a 0 days counted.
Convert the dates to actual date and you can then subtract minimum dates for each names.
newd$dates <- as.Date(newd$dates)
library(dplyr)
newd %>% group_by(names) %>% mutate(events = as.integer(dates - min(dates)))
# names dates events
# <chr> <date> <int>
# 1 Ben 2000-02-02 0
# 2 Ben 2000-02-03 1
# 3 Ben 2000-02-04 2
# 4 Ben 2000-02-05 3
# 5 Ben 2000-02-05 3
# 6 Dan 2000-01-12 0
# 7 Dan 2000-01-13 1
# 8 Dan 2000-01-14 2
# 9 Peter 2000-09-11 0
#10 Peter 2000-09-12 1
#11 Peter 2000-09-13 2
#12 Peter 2000-09-14 3
#13 Peter 2000-09-15 4
#14 Peter 2000-09-16 5
#15 Betty 2000-11-14 0
#16 Betty 2000-11-15 1
#17 Betty 2000-11-16 2
#18 Betty 2000-11-17 3
In base R :
newd$events <- with(newd, dates - ave(dates, names, FUN = min))
and data.table :
library(data.table)
setDT(newd)[, events := dates - min(dates), names]

Get difference with closest previous row in a group which meets criterion

I'm trying, for each row, to calculate the difference with the closest previous row belonging to the same group which meets a certain criterion.
Suppose I have the following dataframe:
s <- read.table(text = "Visit_num Patient Day Admitted
1 1 2015/01/01 Yes
2 1 2015/01/10 No
3 1 2015/01/15 Yes
4 1 2015/02/10 No
5 1 2015/03/08 Yes
6 2 2015/01/01 Yes
7 2 2015/04/01 No
8 2 2015/04/10 No
9 3 2015/04/01 No
10 3 2015/04/10 No", header = T, sep = "")
For each Visit_num and for each Patient, I'd like to get the difference with the closest row for which the patient was admitted (i.e. Yes). Note column day is ordered by day, and time unit for this example is days.
Here is what I wanted my dataframe to look like:
Visit_num Patient Day Admitted Diff_days
1 1 2015/01/01 Yes NA
2 1 2015/01/10 No 9
3 1 2015/01/15 Yes 14
4 1 2015/02/10 No 26
5 1 2015/03/08 Yes 52
6 2 2015/01/01 Yes NA
7 2 2015/04/01 No 90
8 2 2015/04/10 No 99
9 3 2015/04/01 No NA
10 3 2015/04/10 No NA
Any help is appreciated.
Here is an option with tidyverse. Convert the 'Day' to Date class, arrange by 'Patient', 'Day', grouped by 'Patient' get the difference of adjacent 'Day', create a group 'grp' based on the occurrence of 'Yes' in 'Admitted' and take the cumulative sum of 'Diff_days'
library(tidyverse)
s %>%
mutate(Day = ymd(Day)) %>%
arrange(Patient, Day) %>%
group_by(Patient) %>%
mutate(Diff_days = c(NA, diff(Day))) %>%
group_by(grp = cumsum(lag(Admitted == "Yes", default = TRUE)), add = TRUE) %>%
mutate(Diff_days = cumsum(replace_na(Diff_days, 0))) %>%
ungroup %>%
select(-grp) %>%
mutate(Diff_days = na_if(Diff_days, 0))
# A tibble: 8 x 5
# Visit_num Patient Day Admitted Diff_days
# <int> <int> <date> <fct> <dbl>
#1 1 1 2015-01-01 Yes NA
#2 2 1 2015-01-10 No 9
#3 3 1 2015-01-15 Yes 14
#4 4 1 2015-02-10 No 26
#5 5 1 2015-03-08 Yes 52
#6 6 2 2015-01-01 Yes NA
#7 7 2 2015-04-01 No 90
#8 8 2 2015-04-10 No 99

remove individuals based on their range of values

I have a df with two variables, one with IDs and one with a variable called numbers. I would like to excude individuals who do not start their sequence of numbers with the number 1.
I have managed to do this by creating a binary indicator and excluding if the person has this indicator. However, there must be a simpler more elegant way to do this?
Example data and the code I've used to achieve desired result are below.
Thank you.
sample df:
zz<-" names numbers
1 john 1
2 john 2
3 john 3
4 john 4
5 john 5
6 john 6
7 john 7
8 john 8
9 mary 4
10 mary 5
11 mary 6
12 mary 7
13 mary 8
14 mary 9
15 mary 10
16 mary 11
17 mary 12
18 pat 1
19 pat 2
20 pat 3
21 pat 4
22 pat 5
23 pat 6
24 pat 7
25 pat 8
26 pat 9
27 pat 10
28 sue 2
29 sue 3
30 sue 4
31 sue 5
32 sue 6
33 sue 7
34 sue 8
35 sue 9
36 tom 5
37 tom 6
38 tom 7
39 tom 8
40 tom 9
41 tom 10
42 tom 11
"
Data <- read.table(text=zz, header = TRUE)
Step 1 - add binary indicator
df$all<-ifelse(df$numbers==1, 1,0)
df$allperson<-ave(df$all, df$names, FUN=cumsum)
Step two - get rid of people who do not have 1 as their start number
df[!df$allperson==0,]
If you want elegance, I must recommend the package dplyr:
library(dplyr)
Data %>%
group_by(names) %>%
filter(min(numbers) != 1)
It means just what it appears to mean: filter only records where a group (defined by names) has a minimum numbers value inequal to 1.
names numbers
1 mary 4
2 mary 5
3 mary 6
4 mary 7
5 mary 8
6 mary 9
7 mary 10
8 mary 11
9 mary 12
10 sue 2
11 sue 3
You may also try:
zz1 <- zz[with(zz, names %in% unique(names)[!!table(zz)[,1]]),]
head(zz1,4)
# names numbers
#1 john 1
#2 john 2
#3 john 3
#4 john 4

Using grep function to idenify values from which to make a binary indicator

My question is to improve the efficiency/elegance of my code. I have a df with a list of drugs. I want to identify the drugs that start with C09 and C10. If a person has these drugs, I want to give them a binary indicator (1=yes, 0=no) of whether they have these drugs. Binary indicator will be in a new column called "statins", in the same dataframe. I used this post as a guide: What's the R equivalent of SQL's LIKE 'description%' statement?.
Here is what I have done;
names<-c("tom", "mary", "mary", "john", "tom", "john", "mary", "tom", "mary", "tom", "john")
drugs<-c("C10AA05", "C09AA03", "C10AA07", "A02BC01", "C10AA05", "C09AA03", "A02BC01", "C10AA05", "C10AA07", "C07AB03", "N02AA01")
df<-data.frame(names, drugs)
df
names drugs
1 tom C10AA05
2 mary C09AA03
3 mary C10AA07
4 john A02BC01
5 tom C10AA05
6 john C09AA03
7 mary A02BC01
8 tom C10AA05
9 mary C10AA07
10 tom C07AB03
11 john N02AA01
ptn = '^C10.*?'
get_statin = grep(ptn, df$drugs, perl=T)
stats<-df[get_statin,]
names drugs
1 tom C10AA05
3 mary C10AA07
5 tom C10AA05
8 tom C10AA05
9 mary C10AA07
ptn2='^C09.*?'
get_other=grep(ptn2, df$drugs, perl=T)
other<-df[get_other,]
other
names drugs
2 mary C09AA03
6 john C09AA03
df$statins=ifelse(df$drugs %in% stats$drugs,1,0)
df
names drugs statins
1 tom C10AA05 1
2 mary C09AA03 0
3 mary C10AA07 1
4 john A02BC01 0
5 tom C10AA05 1
6 john C09AA03 0
7 mary A02BC01 0
8 tom C10AA05 1
9 mary C10AA07 1
10 tom C07AB03 0
11 john N02AA01 0
df$statins=ifelse(df$drugs %in% other$drugs,1,df$statins)
df
names drugs statins
1 tom C10AA05 1
2 mary C09AA03 1
3 mary C10AA07 1
4 john A02BC01 0
5 tom C10AA05 1
6 john C09AA03 1
7 mary A02BC01 0
8 tom C10AA05 1
9 mary C10AA07 1
10 tom C07AB03 0
11 john N02AA01 0
So, I can get what I want - but I feel there is probably a better, nicer way to do it and would appreciate any guidance here. An obvious solution that I can feel you all shouting at your screens is just use '^C' as a pattern - and therefore catch all the drugs beginning with C. I won't be able to do this in my main analysis as the 'C' will catch things that I don't want in some instances, so I need to make it as narrow as possible.
Here you go:
transform(df, statins=as.numeric(grepl('^C(10|09)', drugs)))

Resources