Spreading row indices into columns in R - r

I have a data frame in R in the following format:
> old.dat
id type minDate maxDat eventNum
1 001 A may june 1
2 002 B apr oct 1
3 002 C may nov 2
4 002 B july dec 3
I want to turn rows into columns, based on eventNum. The max eventNum is 3, so if some IDs only have 1 eventNum, I want them filled with NA.
Goal:
id type1 minDate1 maxDat1 eventNum1 type2 minDate2 maxDat2 eventNum2 type3 minDate3 maxDat3 eventNum3
1 001 A may june 1 <NA> <NA> <NA> NA <NA> <NA> <NA> NA
2 002 B apr oct 1 C may nov 2 B july dec 3
Here is a code chunk to bring in the starting point.
old.dat <- data.frame(id = c("001","002","002","002"),
type = c("A","B","C","B"),
minDate = c("may","apr","may","july"),
maxDat = c("june", "oct", "nov", "dec"),
eventNum = c(1,1,2,3))
I wrote a for loop, but its rather slow, and its taking a long time to churn through my data set, so any faster suggestions would be great. Thanks!

Why? I don't know if I can imagine a situation in which that format will be useful, but here is an approach using tidyr.
First, I am saving a list of the new column names to make things easier to pull together:
newCols <- c("type", "minDate", "MaxDat")
(I will be adding the numbers below).
Then, I am uniteing the values you want for each event, spreading the results to get a new column for each eventNum, then separateing the results back into separate columns (and appending the number of the event to it)
old.dat %>%
unite(toSpread, type, minDate, maxDat, sep = "::") %>%
spread(eventNum, toSpread) %>%
separate(`1`, paste0(newCols, "_1"), sep = "::") %>%
separate(`2`, paste0(newCols, "_2"), sep = "::") %>%
separate(`3`, paste0(newCols, "_3"), sep = "::")
Returns:
id type_1 minDate_1 MaxDat_1 type_2 minDate_2 MaxDat_2 type_3 minDate_3 MaxDat_3
1 001 A may june <NA> <NA> <NA> <NA> <NA> <NA>
2 002 B apr oct C may nov B july dec
Here is an alternative approach that converts the data to a long format first (with gather), then generates the new column names and does the spreading. The complicated mutate line assigning factor levels to the new columns is only needed to sort the columns and uses parse_number from readr to extract the event numbers. If you are OK with the output columns being sorted alphabetically, you can skip that step. This approach is robust to additional event numbers, as it will automatically add events for each unique value in eventNum.
old.dat %>%
gather(Metric, Value, type, minDate, maxDat) %>%
unite(newColHead, Metric, eventNum) %>%
mutate(newColHead = factor(newColHead
, levels =
unique(newColHead) %>%
{.[order(parse_number(.))]}
)) %>%
spread(newColHead, Value)
For this use case, the output is identical to the above.
(And, if you want evidence for why this may be better; note my edit -- I originally mislabeled event numbers 2/3.)

Related

Aggregate week and date in R by some specific rules

I'm not used to using R. I already asked a question on stack overflow and got a great answer.
I'm sorry to post a similar question, but I tried many times and got the output that I didn't expect.
This time, I want to do slightly different from my previous question.
Merge two data with respect to date and week using R
I have two data. One has a year_month_week column and the other has a date column.
df1<-data.frame(id=c(1,1,1,2,2,2,2),
year_month_week=c(2022051,2022052,2022053,2022041,2022042,2022043,2022044),
points=c(65,58,47,21,25,27,43))
df2<-data.frame(id=c(1,1,1,2,2,2),
date=c(20220503,20220506,20220512,20220401,20220408,20220409),
temperature=c(36.1,36.3,36.6,34.3,34.9,35.3))
For df1, 2022051 means 1st week of May,2022. Likewise, 2022052 means 2nd week of May,2022. For df2,20220503 means May 3rd, 2022. What I want to do now is merge df1 and df2 with respect to year_month_week. In this case, 20220503 and 20220506 are 1st week of May,2022.If more than one date are in year_month_week, I will just include the first of them. Now, here's the different part. Even if there is no date inside year_month_week,just leave it NA. So my expected output has a same number of rows as df1 which includes the column year_month_week.So my expected output is as follows:
df<-data.frame(id=c(1,1,1,2,2,2,2),
year_month_week=c(2022051,2022052,2022053,2022041,2022042,2022043,2022044),
points=c(65,58,47,21,25,27,43),
temperature=c(36.1,36.6,NA,34.3,34.9,NA,NA))
First we can convert the dates in df2 into year-month-date format, then join the two tables:
library(dplyr);library(lubridate)
df2$dt = ymd(df2$date)
df2$wk = day(df2$dt) %/% 7 + 1
df2$year_month_week = as.numeric(paste0(format(df2$dt, "%Y%m"), df2$wk))
df1 %>%
left_join(df2 %>% group_by(year_month_week) %>% slice(1) %>%
select(year_month_week, temperature))
Result
Joining, by = "year_month_week"
id year_month_week points temperature
1 1 2022051 65 36.1
2 1 2022052 58 36.6
3 1 2022053 47 NA
4 2 2022041 21 34.3
5 2 2022042 25 34.9
6 2 2022043 27 NA
7 2 2022044 43 NA
You can build off of a previous answer here by taking the function to count the week of the month, then generate a join key in df2. See here
df1 <- data.frame(
id=c(1,1,1,2,2,2,2),
year_month_week=c(2022051,2022052,2022053,2022041,2022042,2022043,2022044),
points=c(65,58,47,21,25,27,43))
df2 <- data.frame(
id=c(1,1,1,2,2,2),
date=c(20220503,20220506,20220512,20220401,20220408,20220409),
temperature=c(36.1,36.3,36.6,34.3,34.9,35.3))
# Take the function from the previous StackOverflow question
monthweeks.Date <- function(x) {
ceiling(as.numeric(format(x, "%d")) / 7)
}
# Create a year_month_week variable to join on
df2 <-
df2 %>%
mutate(
date = lubridate::parse_date_time(
x = date,
orders = "%Y%m%d"),
year_month_week = paste0(
lubridate::year(date),
0,
lubridate::month(date),
monthweeks.Date(date)),
year_month_week = as.double(year_month_week))
# Remove duplicate year_month_weeks
df2 <-
df2 %>%
arrange(year_month_week) %>%
distinct(year_month_week, .keep_all = T)
# Join dataframes
df1 <-
left_join(
df1,
df2,
by = "year_month_week")
Produces this result
id.x year_month_week points id.y date temperature
1 1 2022051 65 1 2022-05-03 36.1
2 1 2022052 58 1 2022-05-12 36.6
3 1 2022053 47 NA <NA> NA
4 2 2022041 21 2 2022-04-01 34.3
5 2 2022042 25 2 2022-04-08 34.9
6 2 2022043 27 NA <NA> NA
7 2 2022044 43 NA <NA> NA
>
Edit: forgot to mention that you need tidyverse loaded
library(tidyverse)

Merging Multiple (and different datasets)

I'd like to merge multiple (around ten) datasets in R. Quite a few of the datasets are different from each other, so I don't need to match them by row name or anything. I'd just like to paste them side by side, on a single dataframe so I can export them into a single sheet. For instance, I have the following two datasets:
Month
Engagement
Test
Jan
51
1
Feb
123
2
Variable
Engagement
Hot
412
Cold
4124
Warm
4fd4
I'd simply like to put them side by side (as in left and right) in a single data frame for exporting purposes, like this:
Month
Engagement
Test
Variable
Engagement
Jan
51
1
Hot
412
Feb
123
2
Cold
4124
NA
NA
NA
Warm
4fd4
Is there any way to accomplish this? It might seem like a strange request, but do let me know if I should provide any more info! Thank you so much.
Put the data in a list. Find the max number of rows from the list. For each dataframe subset the rows, dataframe with lower number of rows will be appended with NA's.
data <- list(df1, df2)
n <- seq_len(max(sapply(data, nrow)))
result <- do.call(cbind, lapply(data, `[`, n, ))
result
# Month Engagement Test Variable Engagement
#1 Jan 51 1 Hot 412
#2 Feb 123 2 Cold 4124
#NA <NA> NA NA Warm 4fd4
Index both data then merge by the index and drop the index:
df1 <- read.csv("Book1.csv", header = TRUE, na.strings = "")
df2 <- read.csv("Book2.csv", header = TRUE, na.strings = "")
# Assign index to the dataframe
rownames(df1) <- 1:nrow(df1)
rownames(df2) <- 1:nrow(df2)
# Merge by index:
merged <- merge(df1, df2, by=0, all=TRUE) %>%
select(-1)
merged
Output:
Month Engagement Test Variable Engagement
1 Jan 51 1 Hot 412
2 Feb 123 2 Cold 4124
3 <NA> NA NA Warm 4fd4

parse dates from multiple columns with NAs and dates hidden in text

I have a data.frame with dates distributed across columns and in a messy format: the year column contains years and NAs, the column date_old contains the format Month DD or DD (or a date duration) or NAs, and the column hidden_date contains text and dates either in thee format .... YYYY .... or in the format .... DD Month YYYY .... (with .... representing general text of variable length).
An example data.frame looks like this:
df <- data.frame(year = c("1992", "1993", "1995", NA),
date_old = c("February 15", "October 02-24", "15", NA),
hidden_date = c(NA, NA, "The hidden date is 15 July 1995", "The hidden date is 2005"))
I want to get the dates in the format YYYY-MM-DD (take the first day of date durations) and fill unknown values with zeroes.
Using parse_date_time didn't help me so far, and the expected output would be:
year date_old hidden_date date
1 1992 February 15 <NA> 1992-02-15
2 1993 October 02-24 <NA> 1993-10-02
3 1995 15 The hidden date is 15 July 1995 1995-07-15
4 <NA> <NA> The hidden date is 2005 2005-00-00
How do I best go about this?
It's a little complicated because you have a jumble of date information in different columns which you need to extract and combine. I don't quite understand if you only have three columns, or if there could be more, so I've tried to solve the general case of an arbitray number of columns. If you only have three columns, each of which always have the same format, then things could be a little simpler, but not much.
I would start by creating a regex pattern for month names:
# We'll use dplyr, stringr, tidyr, readr, and purrr
library(tidyverse)
# We'll use month names and abbreviations just in case.
ms <- paste(c(month.name, month.abb), collapse = "|")
# [1] "January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec"
We can then iterate over each column, extracting the year, month, and day from each row as a data frame, which we then combine into a single data frame. The digit suffixes correspond to the original columns:
df_split_ymd <- map_dfc(df,
~ map_dfr(
.,
~ tibble(
year = str_extract(., "\\b\\d{4}\\b"),
month = str_extract(., str_glue("\\b({ms})\\b")),
day = str_extract(., "\\b\\d{2}\\b")
)
)
)
#### OUTPUT ####
# A tibble: 4 x 9
year month day year1 month1 day1 year2 month2 day2
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1992 NA NA NA February 15 NA NA NA
2 1993 NA NA NA October 02 NA NA NA
3 1995 NA NA NA NA 15 1995 July 15
4 NA NA NA NA NA NA 2005 NA NA
Finally, the year*, month*, and day* columns should be coalesced and then united to make parsing easier. Note that I've replaced NA values in day
with "01" and those in month with "January" because dates can't contain "00":
df_ymd <- df_split_ymd %>%
mutate(year = coalesce(!!!as.list(select(., starts_with("year")))),
month = coalesce(!!!as.list(select(., starts_with("month")))) %>%
replace_na("January"),
day = coalesce(!!!as.list(select(., starts_with("day")))) %>%
replace_na("01")
) %>%
unite(ymd, year, month, day, sep = " ") %>%
select(ymd) %>%
mutate(ymd = parse_date(ymd, "%Y %B %d"))
#### OUTPUT ####
# A tibble: 4 x 1
ymd
<date>
1 1992-02-15
2 1993-10-02
3 1995-07-15
4 2005-01-01

How to diagonally subtract different columns in R

I have a dataset of a hypothetical exam.
id <- c(1,1,3,4,5,6,7,7,8,9,9)
test_date <- c("2012-06-27","2012-07-10","2013-07-04","2012-03-24","2012-07-22", "2013-09-16","2012-06-21","2013-10-18", "2013-04-21", "2012-02-16", "2012-03-15")
result_date <- c("2012-07-29","2012-09-02","2013-08-01","2012-04-25","2012-09-01","2013-10-20","2012-07-01","2013-10-31", "2013-05-17", "2012-03-17", "2012-04-20")
data1 <- as_data_frame(id)
data1$test_date <- test_date
data1$result_date <- result_date
colnames(data1)[1] <- "id"
"id" indicates the ID of the students who have taken a particular exam. "test_date" is the date the students took the test and "result_date" is the date when the students' results are posted. I'm interested in finding out which students retook the exam BEFORE the result of that exam session was released, e.g. students who knew that they have underperformed and retook the exam without bothering to find out their scores. For example, student with "id" 1 took the exam for the second time on "2012-07-10" which was before the result date for his first exam - "2012-07-29".
I tried to:
data1%>%
group_by(id) %>%
arrange(id, test_date) %>%
filter(n() >= 2) %>% #To only get info on students who have taken the exam more than once and then merge it back in with the original data set using a join function
So essentially, I want to create a new column called "re_test" where it would equal 1 if a student retook the exam BEFORE receiving the result of a previous exam and 0 otherwise (those who retook after seeing their marks or those who did not retake).
I have tried to mutate in order to find cases where dates are either positive or negative by subtracting the 2nd test_date from the 1st result_date:
mutate(data1, re_test = result_date - lead(test_date, default = first(test_date)))
However, this leads to mixing up students with different id's. I tried to split but mutate won't work on a list of dataframes so now I'm stuck:
split(data1, data1$id)
Just to add on, this is a part of the desired result:
data2 <- as_data_frame(id <- c(1,1,3,4))
data2$test_date_result <- c("2012-06-27","2012-07-10", "2013-07-04","2012-03-24")
data2$result_date_result <- c("2012-07-29","2012-09-02","2013-08-01","2012-04-25")
data2$re_test <- c(1, 0, 0, 0)
Apologies for the verbosity and hope I was clear enough.
Thanks a lot in advance!
library(reshape2)
library(dplyr)
# first melt so that we can sequence by date
data1m <- data1 %>%
melt(id.vars = "id", measure.vars = c("test_date", "result_date"), value.name = "event_date")
# any two tests in a row is a flag - use dplyr::lag to comapre the previous
data1mc <- data1m %>%
arrange(id, event_date) %>%
group_by(id) %>%
mutate (multi_test = (variable == "test_date" & lag(variable == "test_date"))) %>%
filter(multi_test)
# id variable event_date multi_test
# 1 1 test_date 2012-07-10 TRUE
# 2 9 test_date 2012-03-15 TRUE
## join back to the original
data1 %>%
left_join (data1mc %>% select(id, event_date, multi_test),
by=c("id" = "id", "test_date" = "event_date"))
I have a piecewise answer that may work for you. I first create a data.frame called student that contains the re-test information, and then join it with the data1 object. If students re-took the test multiple times, it will compare the last test to the first, which is a flaw, but I'm unsure if students have the ability to re-test multiple times?
student <- data1 %>%
group_by(id) %>%
summarise(retest=(test_date[length(test_date)] < result_date[1]) == TRUE)
Some re-test values were NA. These were individuals that only took the test once. I set these to FALSE here, but you can retain the NA, as they do contain information.
student$retest[is.na(student$retest)] <- FALSE
Join the two data.frames to a single object called data2.
data2 <- left_join(data1, student, by='id')
I am sure there are more elegant ways to approach this. I did this by taking advantage of the structure of your data (sorted by id) and the lag function that can refer to the previous records while dealing with a current record.
### Ensure Data are sorted by ID ###
data1 <- arrange(data1,id)
### Create Flag for those that repeated ###
data1$repeater <- ifelse(lag(data1$id) == data1$id,1,0)
### I chose to do this on all data, you could filter on repeater flag first ###
data1$timegap <- as.Date(data1$result_date) - as.Date(data1$test_date)
data1$lagdate <- as.Date(data1$test_date) - lag(as.Date(data1$result_date))
### Display results where your repeater flag is 1 and there is negative time lag ###
data1[data1$repeater==1 & !is.na(data1$repeater) & as.numeric(data1$lagdate) < 0,]
# A tibble: 2 × 6
id test_date result_date repeater timegap lagdate
<dbl> <chr> <chr> <dbl> <time> <time>
1 1 2012-07-10 2012-09-02 1 54 days -19 days
2 9 2012-03-15 2012-04-20 1 36 days -2 days
I went with a simple shift comparison. 1 line of code.
data1 <- data.frame(id = c(1,1,3,4,5,6,7,7,8,9,9), test_date = c("2012-06-27","2012-07-10","2013-07-04","2012-03-24","2012-07-22", "2013-09-16","2012-06-21","2013-10-18", "2013-04-21", "2012-02-16", "2012-03-15"), result_date = c("2012-07-29","2012-09-02","2013-08-01","2012-04-25","2012-09-01","2013-10-20","2012-07-01","2013-10-31", "2013-05-17", "2012-03-17", "2012-04-20"))
data1$re_test <- unlist(lapply(split(data1,data1$id), function(x)
ifelse(as.Date(x$test_date) > c(NA, as.Date(x$result_date[-nrow(x)])), 0, 1)))
data1
id test_date result_date re_test
1 1 2012-06-27 2012-07-29 NA
2 1 2012-07-10 2012-09-02 1
3 3 2013-07-04 2013-08-01 NA
4 4 2012-03-24 2012-04-25 NA
5 5 2012-07-22 2012-09-01 NA
6 6 2013-09-16 2013-10-20 NA
7 7 2012-06-21 2012-07-01 NA
8 7 2013-10-18 2013-10-31 0
9 8 2013-04-21 2013-05-17 NA
10 9 2012-02-16 2012-03-17 NA
11 9 2012-03-15 2012-04-20 1
I think there is benefit in leaving NAs but if you really want all others as zero, simply:
data1$re_test <- ifelse(is.na(data1$re_test), 0, data1$re_test)
data1
id test_date result_date re_test
1 1 2012-06-27 2012-07-29 0
2 1 2012-07-10 2012-09-02 1
3 3 2013-07-04 2013-08-01 0
4 4 2012-03-24 2012-04-25 0
5 5 2012-07-22 2012-09-01 0
6 6 2013-09-16 2013-10-20 0
7 7 2012-06-21 2012-07-01 0
8 7 2013-10-18 2013-10-31 0
9 8 2013-04-21 2013-05-17 0
10 9 2012-02-16 2012-03-17 0
11 9 2012-03-15 2012-04-20 1
Let me know if you have any questions, cheers.

Splitting Columns by Number of Characters [duplicate]

This question already has answers here:
Split character string multiple times every two characters
(2 answers)
Closed 6 years ago.
I have a column of dates in a data table entered in 6-digit numbers as such: 201401, 201402, 201403, 201412, etc. where the first 4 digits are the year and second two digits are month.
I'm trying to split that column into two columns, one called "year" and one called "month". Been messing around with strsplit() but can't figure out how to get it to do number of characters instead of a string pattern, i.e. split in the middle of the 4th and 5th digit.
Without using any external package, we can do this with substr
transform(df1, Year = substr(dates, 1, 4), Month = substr(dates, 5, 6))
# dates Year Month
#1 201401 2014 01
#2 201402 2014 02
#3 201403 2014 03
#4 201412 2014 12
We have the option to remove or keep the column.
Or with sub
cbind(df1, read.csv(text=sub('(.{4})(.{2})', "\\1,\\2", df1$dates), header=FALSE))
Or using some package solutions
library(tidyr)
extract(df1, dates, into = c("Year", "Month"), "(.{4})(.{2})", remove=FALSE)
Or with data.table
library(data.table)
setDT(df1)[, tstrsplit(dates, "(?<=.{4})", perl = TRUE)]
tidyr::separate can take an integer for its sep parameter, which will split at a particular location:
library(tidyr)
df <- data.frame(date = c(201401, 201402, 201403, 201412))
df %>% separate(date, into = c('year', 'month'), sep = 4)
#> year month
#> 1 2014 01
#> 2 2014 02
#> 3 2014 03
#> 4 2014 12
Note the new columns are character; add convert = TRUE to coerce back to numbers.

Resources