R combining multiple vectors created using dplyr's pull - r

I have monthly data for 2019, 2020 and only 2 months data for 2021 (Jan and Feb). I want to make a vector of these 26 values for use as a time series.
my_dat <- data.frame(X2021 = c(1:2,rep(NA,10)), X2020 = 1:12, X2019 = 1:12)
library(dplyr)
X2021 <- my_dat %>% pull(X2021)
X2021 <- X2021[ -(3:12) ]
x <- my_dat %>% pull(X2019,X2020)
c(x, X2021)
##1 2 3 4 5 6 7 8 9 10 11 12
##1 2 3 4 5 6 7 8 9 10 11 12 1 2
I expected:
c(1:12, 1:12, 1:2)
What went wrong?

Since pull is equivalent to $ in base R, and can only be used for extracting one variable, I think you want select and then unlist. E.g.:
my_dat %>% select(X2019, X2020) %>% unlist(use.names=FALSE)
#[1] 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
Which would be equivalent to using the square brackets [] in base R:
unlist(my_dat[c("X2019","X2020")], use.names=FALSE)
#[1] 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
As to why the original code didn't work, ?pull shows the syntax is:
pull(.data, var, name)
So
my_dat %>% pull(X2019,X2020)
is just pulling / extracting X2019 and naming it with X2020. To give a clearer example:
dat <- data.frame(a=1:3, b=month.abb[1:3])
pull(dat, a, b)
#Jan Feb Mar
# 1 2 3
unname(pull(dat, a, b))
#[1] 1 2 3
names(pull(dat, a, b))
#[1] "Jan" "Feb" "Mar"

Related

Delete rows with redundant information in r (not just duplicates)

In this sample data:
id<-c(2,2,2,2,2,3,3,3,3,3,3,4,4,4,4)
time<-c(3,5,7,8,9,2,8,10,12,14,18,4,6,7,9)
status<-c('mar','mar','div','c','mar','mar','div','mar','mar','c','div','mar','mar','c','mar')
myd<-data.frame(id,time,status)
id time status
1 2 3 mar
2 2 5 mar
3 2 7 div
4 2 8 c
5 2 9 mar
6 3 2 mar
7 3 8 div
8 3 10 mar
9 3 12 mar
10 3 14 c
11 3 18 div
12 4 4 mar
13 4 6 mar
14 4 7 c
15 4 9 mar
I need to know when the person married (if there are two consecutive 'mar' rows without 'div' anywhere in between, the person never divorced, hence it's the same marriage, and we don't need the timing of that repeat information; the same goes with sequence of mar, c, mar where since div is not detected, the marriage before and after child are the same marriage, hence the second one can be deleted). I suspect I need to get min(time[status=='mar']) but this would be incorrect if that person gets a mar,mar,div,mar,div,mar sequence (only 2nd mar needs deletion, not all the ones after the first one).
So the new data should look something like
id time status
2 2 5 mar
3 2 7 div
4 2 8 c
5 2 9 mar
6 3 2 mar
7 3 8 div
8 3 10 mar
10 3 14 c
11 3 18 div
13 4 6 mar
14 4 7 c
This was my approach, which only worked for one row
myd2<-myd %>%
group_by(id) %>%
mutate(dum1=ifelse(status=='mar',min(time[status=='mar']),NA),
dum2=cumsum(status=='div'),
flag=ifelse(time>dum1 & dum2==0,1,0))
If I get rid of dum2==0 then it deleted too many rows.
Using a quick helper function,
func <- function(x, vals = c("mar", "div")) {
out <- rep(TRUE, length(x))
last <- x[1]
for (ind in seq_along(x)[-1]) {
out[ind] <- x[ind] != last || !x[ind] %in% vals
if (out[ind] && x[ind] %in% vals) last <- x[ind]
}
out
}
We can do
library(data.table)
as.data.table(myd)[, .SD[func(status),], by = .(id)]
# id time status
# <num> <num> <char>
# 1: 2 3 mar
# 2: 2 7 div
# 3: 2 8 c
# 4: 2 9 mar
# 5: 3 2 mar
# 6: 3 8 div
# 7: 3 10 mar
# 8: 3 14 c
# 9: 3 18 div
# 10: 4 4 mar
# 11: 4 7 c
If you want this in dplyr, then
library(dplyr)
myd %>%
group_by(id) %>%
filter(func(status))
My approach:
library(dplyr)
myd %>%
group_by(id) %>%
arrange(time) %>%
filter(status != lag(status) | is.na(lag(status))) %>%
ungroup() %>%
arrange(id)
Returns:
# A tibble: 12 x 3
id time status
<dbl> <dbl> <chr>
1 2 3 mar
2 2 7 div
3 2 8 c
4 2 9 mar
5 3 2 mar
6 3 8 div
7 3 10 mar
8 3 14 c
9 3 18 div
10 4 4 mar
11 4 7 c
12 4 9 mar
I would delete rows in which the status is unchanged by creating a lag_status variable in grouped data:
> myd %>%
+ arrange(id, time) %>%
+ group_by(id) %>%
+ mutate(lag_status = lag(status)) %>%
+ ungroup() %>%
+ filter(is.na(lag_status) | status != lag_status) %>%
+ select(-lag_status)
# A tibble: 12 x 3
id time status
<dbl> <dbl> <fct>
1 2 3 mar
2 2 7 div
3 2 8 c
4 2 9 mar
5 3 2 mar
6 3 8 div
7 3 10 mar
8 3 14 c
9 3 18 div
10 4 4 mar
11 4 7 c
12 4 9 mar
I read two different questions in your post.
When the person first married
How to make a list that removes redundant status information
It seems like you have a solution for #1, but you actually want #2.
I read #2 as a desire to filter out rows where the id and status are the same as the previous row. That would look like:
myd %>%
filter(!(id == lag(id) & status == lag(status))

Add a unique identifier to the same column value in R data frame

I have a data frame as follows:
index val sample_id
1 1 14 5
2 2 22 6
3 3 1 6
4 4 25 7
5 5 3 7
6 6 34 7
For each row with the sample_id, I would like to add a unique identifier as follows:
index val sample_id
1 1 14 5
2 2 22 6-A
3 3 1 6-B
4 4 25 7-A
5 5 3 7-B
6 6 34 7-C
Any suggestion? Thank you for your help.
Base R
dat$id2 <- ave(dat$sample_id, dat$sample_id,
FUN = function(z) if (length(z) > 1) paste(z, LETTERS[seq_along(z)], sep = "-") else as.character(z))
dat
# index val sample_id id2
# 1 1 14 5 5
# 2 2 22 6 6-A
# 3 3 1 6 6-B
# 4 4 25 7 7-A
# 5 5 3 7 7-B
# 6 6 34 7 7-C
tidyverse
library(dplyr)
dat %>%
group_by(sample_id) %>%
mutate(id2 = if (n() > 1) paste(sample_id, LETTERS[row_number()], sep = "-") else as.character(sample_id)) %>%
ungroup()
Minor note: it might be tempting to drop the as.character(z) from either or both of the code blocks. In the first, nothing will change (here): base R allows you to be a little sloppy; if we rely on that and need the new field to always be character, then in that one rare circumstance where all rows have unique sample_id, then the column will remain integer. dplyr is much more careful in guarding against this; if you run the tidyverse code without as.character, you'll see the error.
Using dplyr:
library(dplyr)
dplyr::group_by(df, sample_id) %>%
dplyr::mutate(sample_id = paste(sample_id, LETTERS[seq_along(sample_id)], sep = "-"))
index val sample_id
<int> <dbl> <chr>
1 1 14 5-A
2 2 22 6-A
3 3 1 6-B
4 4 25 7-A
5 5 3 7-B
6 6 34 7-C
If you just want to create unique tags for the same sample_id, maybe you can try make.unique like below
transform(
df,
sample_id = ave(as.character(sample_id),sample_id,FUN = function(x) make.unique(x,sep = "_"))
)
which gives
index val sample_id
1 1 14 5
2 2 22 6
3 3 1 6_1
4 4 25 7
5 5 3 7_1
6 6 34 7_2

conditional merge or left join two dataframes in R

I am trying to add additional data from a reference table onto my primary dataframe. I see similar questions have been asked about this however cant find anything for my specific case.
An example of my data frame is set up like this
df <- data.frame("participant" = rep(1:3,9), "time" = rep(1:9, each = 3))
lookup <- data.frame("start.time" = c(1,5,8), "end.time" = c(3,6,10), "var1" = c("A","B","A"),
"var2" = c(8,12,3), "var3"= c("fast","fast","slow"))
print(df)
participant time
1 1 1
2 2 1
3 3 1
4 1 2
5 2 2
6 3 2
7 1 3
8 2 3
9 3 3
10 1 4
11 2 4
12 3 4
13 1 5
14 2 5
15 3 5
16 1 6
17 2 6
18 3 6
19 1 7
20 2 7
21 3 7
22 1 8
23 2 8
24 3 8
25 1 9
26 2 9
27 3 9
> print(lookup)
start.time end.time var1 var2 var3
1 1 3 A 8 fast
2 5 6 B 12 fast
3 8 10 A 3 slow
What I want to do is merge or join these two dataframes in a way which also includes the times in between both the start and end time of the look up data frame. So the columns var1, var2 and var3 are added onto the df at each instance where the time lies between the start time and end time.
for example, in the above case - the look up value in the first row has a start time of 1, an end time of 3, so for times 1, 2 and 3 for each participant, the first row data should be added.
the output should look something like this.
print(output)
participant time var1 var2 var3
1 1 1 A 8 fast
2 2 1 A 8 fast
3 3 1 A 8 fast
4 1 2 A 8 fast
5 2 2 A 8 fast
6 3 2 A 8 fast
7 1 3 A 8 fast
8 2 3 A 8 fast
9 3 3 A 8 fast
10 1 4 <NA> NA <NA>
11 2 4 <NA> NA <NA>
12 3 4 <NA> NA <NA>
13 1 5 B 12 fast
14 2 5 B 12 fast
15 3 5 B 12 fast
16 1 6 B 12 fast
17 2 6 B 12 fast
18 3 6 B 12 fast
19 1 7 <NA> NA <NA>
20 2 7 <NA> NA <NA>
21 3 7 <NA> NA <NA>
22 1 8 A 3 slow
23 2 8 A 3 slow
24 3 8 A 3 slow
25 1 9 A 3 slow
26 2 9 A 3 slow
27 3 9 A 3 slow
I realise that column names don't match and they should for merging data sets.
One option would be to use the sqldf package, and phrase your problem as a SQL left join:
sql <- "SELECT t1.participant, t1.time, t2.var1, t2.var2, t2.var3
FROM df t1
LEFT JOIN lookup t2
ON t1.time BETWEEN t2.\"start.time\" AND t2.\"end.time\""
output <- sqldf(sql)
A dplyr solution:
output <- df %>%
# Create an id for the join
mutate(merge_id=1) %>%
# Use full join to create all the combinations between the two datasets
full_join(lookup %>% mutate(merge_id=1), by="merge_id") %>%
# Keep only the rows that we want
filter(time >= start.time, time <= end.time) %>%
# Select the relevant variables
select(participant,time,var1:var3) %>%
# Right join with initial dataset to get the missing rows
right_join(df, by = c("participant","time")) %>%
# Sort to match the formatting asked by OP
arrange(time, participant)
This produces the output asked by OP, but it will only work for data of reasonable size, as the full join produces a data frame with number of rows equal to the product of the number of rows of both initial datasets.
Using tidyverse and creating an auxiliary table:
df <- data.frame("participant" = rep(1:3,9), "time" = rep(1:9, each = 3))
lookup <- data.frame("start.time" = c(1,5,8), "end.time" = c(3,6,10), "var1" = c("A","B","A"),
"var2" = c(8,12,3), "var3"= c("fast","fast","slow"))
lookup_extended <- lookup %>%
mutate(time = map2(start.time, end.time, ~ c(.x:.y))) %>%
unnest(time) %>%
select(-start.time, -end.time)
df2 <- df %>%
left_join(lookup_extended, by = "time")

How to create a table with flexible columns based on variables control in R?

I want to create a tale like:
1 1 6 6 10 10 ...
2 2 7 7 11 11 ...
3 3 8 8 12 12 ...
4 4 9 9 13 13 ...
5 5 14 14 ...
15 15 ...
I want to use variables:
n (repeat) and m(total number of columns) and k(k=the prior columns's end number+1,for example: 6=5+1, and 10=9+1), and different number length of row
to create a table.
I know I can use like:
rep(list(1:5,6:9,10:15), each = 2)),
but how to make them as parameters using a general expression to list list(1:5,6:9,10:15,..use n,m,k expression...).
I tried to use loop for (i in 1:m) etc.. but cannot work it out
finally I want a sequence by using unlist(): 1,2,3,4,5,6,1,2,3,4,5,6......)
Many thanks.
Maybe the code below can help
len <- c(5,4,6)
res <- unlist(unname(rep(split(1:sum(len),
findInterval(1:sum(len),cumsum(len)+1)),
each = 2)))
which gives
> res
[1] 1 2 3 4 5 1 2 3 4 5 6 7 8 9 6 7 8 9 10 11 12 13 14 15 10 11 12 13 14 15
Probably, something like this would be helpful.
#Number of times to repeat
r <- 2
#Length of each sequence
len <- c(5, 4, 6)
#Get the end of the sequence
end <- cumsum(Glen)
#Calculate the start of each sequence
start <- c(1, end[-length(end)] + 1)
#Create a sequence of start and end and repeat it r times
Map(function(x, y) rep(seq(x, y), r), start, end)
#[[1]]
# [1] 1 2 3 4 5 1 2 3 4 5
#[[2]]
#[1] 6 7 8 9 6 7 8 9
#[[3]]
# [1] 10 11 12 13 14 15 10 11 12 13 14 15
You could unlist to get it as one vector.
unlist(Map(function(x, y) rep(seq(x, y), r), start, end))

Create multiple sums

Ciao,
Here is a replicate able example.
df <- data.frame("STUDENT"=c(1,2,3,4,5),
"TEST1A"=c(NA,5,5,6,7),
"TEST2A"=c(NA,8,4,6,9),
"TEST3A"=c(NA,10,5,4,6),
"TEST1B"=c(5,6,7,4,1),
"TEST2B"=c(10,10,9,3,1),
"TEST3B"=c(0,5,6,9,NA),
"TEST1TOTAL"=c(NA,23,14,16,22),
"TEST2TOTAL"=c(10,16,15,12,NA))
I have columns STUDENT through TEST3B and want to create TEST1TOTAL TEST2TOTAL. TEST1TOTAL=TEST1A+TEST2A+TEST3A and so on for TEST2TOTAL. If there is any missing score in TEST1A TEST2A TEST3A then TEST1TOTAL is NA.
here is my attempt but is there a solution with less lines of coding? Because here I will need to write this line out many times as there are up to TEST A through O.
TEST1TOTAL=rowSums(df[,c('TEST1A', 'TEST2A', 'TEST3A')], na.rm=TRUE)
Using just R base functions:
output <- data.frame(df1, do.call(cbind, lapply(c("A$", "B$"), function(x) rowSums(df1[, grep(x, names(df1))]))))
Customizing colnames:
> colnames(output)[(ncol(output)-1):ncol(output)] <- c("TEST1TOTAL", "TEST2TOTAL")
> output
STUDENT TEST1A TEST2A TEST3A TEST1B TEST2B TEST3B TEST1TOTAL TEST2TOTAL
1 1 NA NA NA 5 10 0 NA 15
2 2 5 8 10 6 10 5 23 21
3 3 5 4 5 7 9 6 14 22
4 4 6 6 4 4 3 9 16 16
5 5 7 9 6 1 1 NA 22 NA
Try:
library(dplyr)
df %>%
mutate(TEST1TOTAL = TEST1A+TEST2A+TEST3A,
TEST2TOTAL = TEST1B+TEST2B+TEST3B)
or
df %>%
mutate(TEST1TOTAL = rowSums(select(df, ends_with("A"))),
TEST2TOTAL = rowSums(select(df, ends_with("B"))))
I think for what you want, Jilber Urbina's solution is the way to go. For completeness sake (and because I learned something figuring it out) here's a tidyverse way to get the score totals by test number for any number of tests.
The advantage is you don't need to specify the identifiers for the tests (beyond that they're numbered or have a trailing letter) and the same code will work for any number of tests.
library(tidyverse)
df_totals <- df %>%
gather(test, score, -STUDENT) %>% # Convert from wide to long format
mutate(test_num = paste0('TEST', ('[^0-9]', '', test),
'TOTAL'), # Extract test_number from variable
test_let = gsub('TEST[0-9]*', '', test)) %>% # Extract test_letter (optional)
group_by(STUDENT, test_num) %>% # group by student + test
summarize(score_tot = sum(score)) %>% # Sum score by student/test
spread(test_num, score_tot) # Spread back to wide format
df_totals
# A tibble: 5 x 4
# Groups: STUDENT [5]
STUDENT TEST1TOTAL TEST2TOTAL TEST3TOTAL
<dbl> <dbl> <dbl> <dbl>
1 1 NA NA NA
2 2 11 18 15
3 3 12 13 11
4 4 10 9 13
5 5 8 10 NA
If you want the individual scores too, just join the totals together with the original:
left_join(df, df_totals, by = 'STUDENT')
STUDENT TEST1A TEST2A TEST3A TEST1B TEST2B TEST3B TEST1TOTAL TEST2TOTAL TEST3TOTAL
1 1 NA NA NA 5 10 0 NA NA NA
2 2 5 8 10 6 10 5 11 18 15
3 3 5 4 5 7 9 6 12 13 11
4 4 6 6 4 4 3 9 10 9 13
5 5 7 9 6 1 1 NA 8 10 NA

Resources