Joining dataframes with different dimensions and filling the gaps - r

I want to join two different dataframes. Those dataframes are of different dimensions. Here are the example datasets,
Main dataset
# Main data
id <- c(rep(1, 3), rep(3, 3), rep(10, 1))
time <- c(201601, 201602, 201603, 201601, 201602, 201603, 201601)
data1 <- c(100, 150, 160, 111, 120, 130, 150)
data2 <- c(5, 6, 9, 3, 2, 1, 0)
dataf1 <- data.frame(id, time, data1, data2)
Dataframe to be joined with the main dataset
# Additional data
id <- c(3, 10, 2)
time <- c(rep(201604, 3))
data2 <- c(20, 30, 11)
dataf2 <- data.frame(id, time, data2)
I want to join these two dataframes, namely, dataf1 and dataf2. I have tried dplyr::full_join(dataf1, dataf2, by = "id") but it's not giving what I want. The expected join should look like this,
However, the final output should include the missing timestamps. The final output should look like this,
Is there any way I can achieve this?

Here is a data.table go at your question
library(data.table)
#create data.tables out of your data.frames
setDT(dataf1)
setDT(dataf2)
#row-bind all your data together
alldata <- rbindlist( list( dataf1, dataf2 ), use.names = TRUE, fill = TRUE )
#get all unique id-time combinations out of your data
DT <- CJ( alldata$id, alldata$time, unique = TRUE)
setnames(DT, names(DT), c("id", "time"))
#join your data to all unique combinataions of id-time
ans <- DT[ alldata, `:=`( data1 = i.data1, data2 = i.data2), on = .(id, time)]
ourput
# id time data1 data2
# 1: 1 201601 100 5
# 2: 1 201602 150 6
# 3: 1 201603 160 9
# 4: 1 201604 NA NA
# 5: 2 201601 NA NA
# 6: 2 201602 NA NA
# 7: 2 201603 NA NA
# 8: 2 201604 NA 11
# 9: 3 201601 111 3
# 10: 3 201602 120 2
# 11: 3 201603 130 1
# 12: 3 201604 NA 20
# 13:10 201601 150 0
# 14:10 201602 NA NA
# 15:10 201603 NA NA
# 16:10 201604 NA 30
As you can see, it (almost) matches your desired output.
I got confused at why you wanted id = 10 & time = 201604 ==> data1 = 30. Why this behaviour, while data1 = NA, and data2 = 30 ?
Of course you can easily exchange data1 with data2 using an ifelse-like solution in like ans[ is.na(data1) & !is.na(data2),:=(data1 = data2, data2 = NA)]

Here is one way using tidyr::complete with dplyr. After doing a full_join, we convert time column to Date object. For every id complete the sequence from the minimum value to '2016-04-01' and remove NA rows.
library(dplyr)
full_join(dataf1, dataf2, by = "id") %>%
select(-time.y, -data2.y) %>%
rename_all(~names(dataf1)) %>%
mutate(time1 = as.Date(paste0(time, "01"), "%Y%m%d")) %>%
tidyr::complete(id, time1 = seq(min(time1, na.rm = TRUE),
as.Date('2016-04-01'), by = "1 month")) %>%
mutate(time = format(time1, "%Y%m")) %>%
filter_at(vars(-id), any_vars(!is.na(.))) %>%
select(-time1)
# id time data1 data2
# <dbl> <chr> <dbl> <dbl>
# 1 1 201601 100 5
# 2 1 201602 150 6
# 3 1 201603 160 9
# 4 1 201604 NA NA
# 5 2 201601 NA NA
# 6 2 201602 NA NA
# 7 2 201603 NA NA
# 8 2 201604 NA NA
# 9 3 201601 111 3
#10 3 201602 120 2
#11 3 201603 130 1
#12 3 201604 NA NA
#13 10 201601 150 0
#14 10 201602 NA NA
#15 10 201603 NA NA
#16 10 201604 NA NA

This matches your exact final output:
library(data.table)
setnames(dataf2, "data2", "data1") # Warning: This will modify the original dataf2
rbindlist(
list(dataf1, dataf2),
fill = TRUE
)[CJ(id, time, unique = TRUE), on = .(id, time)]
# id time data1 data2
# 1: 1 201601 100 5
# 2: 1 201602 150 6
# 3: 1 201603 160 9
# 4: 1 201604 NA NA
# 5: 2 201601 NA NA
# 6: 2 201602 NA NA
# 7: 2 201603 NA NA
# 8: 2 201604 11 NA
# 9: 3 201601 111 3
# 10: 3 201602 120 2
# 11: 3 201603 130 1
# 12: 3 201604 20 NA
# 13: 10 201601 150 0
# 14: 10 201602 NA NA
# 15: 10 201603 NA NA
# 16: 10 201604 30 NA

Related

Collapse data frame so NAs are removed

I want to collapse this data frame so NA's are removed. How to accomplish this? Thanks!!
id <- c(1,1,1,2,2,3,4,5,5)
q1 <- c(23,55,7,88,90,34,11,22,99)
df <- data.frame(id,q1)
df$row <- 1:nrow(df)
spread(df, id, q1)
row 1 2 3 4 5
1 23 NA NA NA NA
2 55 NA NA NA NA
3 7 NA NA NA NA
4 NA 88 NA NA NA
5 NA 90 NA NA NA
6 NA NA 34 NA NA
7 NA NA NA 11 NA
8 NA NA NA NA 22
9 NA NA NA NA 89
I want it to look like this:
1 2 3 4 5
23 88 34 11 22
55 90 NA NA 89
7 NA NA NA NA
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
The row should be created on the sequence of 'id'. In addition, pivot_wider would be a more general function compared to spread
library(dplyr)
library(tidyr)
df %>%
group_by(id) %>%
mutate(row = row_number()) %>%
ungroup %>%
pivot_wider(names_from = id, values_from = q1) %>%
select(-row)
-output
# A tibble: 3 × 5
`1` `2` `3` `4` `5`
<dbl> <dbl> <dbl> <dbl> <dbl>
1 23 88 34 11 22
2 55 90 NA NA 99
3 7 NA NA NA NA
Or use dcast
library(data.table)
dcast(setDT(df), rowid(id) ~ id, value.var = 'q1')[, id := NULL][]
1 2 3 4 5
<num> <num> <num> <num> <num>
1: 23 88 34 11 22
2: 55 90 NA NA 99
3: 7 NA NA NA NA
Here's a base R solution. I sort each column so the non-NA values are at the top, find the number of non-NA values in the column with the most non-NA values (n), and return the top n rows from the data frame.
library(tidyr)
id <- c(1,1,1,2,2,3,4,5,5)
q1 <- c(23,55,7,88,90,34,11,22,99)
df <- data.frame(id,q1)
df$row <- 1:nrow(df)
df <- spread(df, id, q1)
collapse_df <- function(df) {
move_na_to_bottom <- function(x) x[order(is.na(x))]
sorted <- sapply(df, move_na_to_bottom)
count_non_na <- function(x) sum(!is.na(x))
n <- max(apply(df, 2, count_non_na))
sorted[1:n, ]
}
collapse_df(df[, -1])

In R, use mutate() to create a new column based on conditions by group

For each person, there are two types of visits and for each visits, there are date records. The dataset looks like below.
p <-c(1,1,1,2,2,2,2,3,3,3,4)
type <- c(15,20,20,15,20,15,20,20,15,15,15)
date <- as.Date.factor(c("2014-02-03","2014-02-04","2014-02-06","2014-01-28","2014-02-03","2014-03-03","2014-03-13","2014-04-03","2014-04-09","2014-12-03","2014-04-05"))
d <- data.frame(p,type,date)
So now the dataset looks like this.
> d
p type date
1 1 15 2014-02-03
2 1 20 2014-02-04
3 1 20 2014-02-06
4 2 15 2014-01-28
5 2 20 2014-02-03
6 2 15 2014-03-03
7 2 20 2014-03-13
8 3 20 2014-04-03
9 3 15 2014-04-09
10 3 15 2014-12-03
Now, I'd like to create three new columns.
indicating whether a type 20 visit happens in 7 days after the type 15 visit, if yes then the indicator is 1, otherwise 0.(for example, for p2, in the line 4, this value should be 1, and in the line 6, this value should be 0)
What is the first date of type 20 visit happened in 7 days after the type 15 visit. If there is no type 20 visit in 7 days after the type 15, then keep it blank. (for example, for p1, the value should be 2014-02-04 instead of 2014-02-06)
How many days is between the type 15 visit and type 20 visit happened in 7 days from it. If there is no type 20 visit in 7 days after the type 15, then keep it blank.(for example, the value in line 1 should be 1)
I'm a super newbie in R, and basically have no idea of what to do. I tried a for loop within group, but it never works.
group_by(p)%>%
for(i in i:length(date)){
*if(type[i]== 15 && date[i]+7 >= date[i+1:length(date)]){
indicator = 1
first_date =
days =* #Have no idea how to check in this part
} else {
indicator = 0
first_date = NA
days = NA
}
The expected output is as below.
p type date ind first_date days
1 1 15 2014-02-03 1 2014-02-04 1 # = 2014-02-04 - 2014-02-03
2 1 20 2014-02-04 NA <NA> NA
3 1 20 2014-02-06 NA <NA> NA
4 2 15 2014-01-28 1 2014-02-03 6 # = 2014-02-03 - 2014-01-28
5 2 20 2014-02-03 NA <NA> NA
6 2 15 2014-03-03 0 <NA> NA # since (2014-03-13 - 2014-03-03) > 7
7 2 20 2014-03-13 NA <NA> NA
8 3 20 2014-04-03 NA <NA> NA #I don't care about the value for type 20 lines
9 3 15 2014-04-09 0 <NA> NA
10 3 15 2014-12-03 0 <NA> NA
So I come up with a new idea. What if we group records by p and type == 15.Then we can use subtraction within groups as days, and the rest will be easy.
I found one way in doing this:
d[,group:= cumsum(type ==15)]
However, this will count group when encountering a new type 15 record. How to add p as another grouping condition?
I took a stab at this. There's one caveat though: My answer assumes that after a type 15 visit occurs, the next visit within 7 days will be a type_20 visit. If that's not the case, i.e. there's another type 15 visit within 7 days, the first type 15 visit won't be considered, and only the second type 15 visit matters:
library(dplyr)
library(tidyr)
library(lubridate)
d %>%
mutate(rownum = 1:n()) %>%
spread(type, date, sep="_") %>%
group_by(p) %>%
mutate(ind = ifelse(lead(type_20) - type_15 <= 7, 1, 0)) %>%
mutate(ind = ifelse(is.na(ind), 0, ind)) %>%
mutate(ind = ifelse(is.na(type_15), NA, ind)) %>%
mutate(first_date = ifelse(ind == 1, lead(type_20), NA)) %>%
mutate(first_date = as.Date(first_date, origin = lubridate::origin)) %>%
mutate(days = first_date - type_15) %>%
gather("type", "date", type_15, type_20) %>%
filter(!is.na(date)) %>%
arrange(p, date) %>%
select(p, type, date, ind, first_date, days)
# p type date ind first_date days
# <dbl> <chr> <date> <dbl> <date> <time>
#1 1 type_15 2014-02-03 1 2014-02-04 1 days
#2 1 type_20 2014-02-04 NA <NA> NA days
#3 1 type_20 2014-02-06 NA <NA> NA days
#4 2 type_15 2014-01-28 1 2014-02-03 6 days
#5 2 type_20 2014-02-03 NA <NA> NA days
#6 2 type_15 2014-03-03 0 <NA> NA days
#7 2 type_20 2014-03-13 NA <NA> NA days
#8 3 type_20 2014-04-03 NA <NA> NA days
#9 3 type_15 2014-04-09 0 <NA> NA days
#10 3 type_15 2014-12-03 0 <NA> NA days
Let me try to explain what I'm doing:
First the type and date columns are spread so that the type and date appear in separate columns (this makes it easier to compare dates of the two different type). Next, a couple of mutates. The first three apply the conditions outlined in the questions, as follows: if lead(type_20) - type_15 <= 7) that means there was a type 20 visit within 7 days of a type 15 visit, so we mark that as 1, else we mark as 0. After this, if ind is NA, we assume no type 20 visit was found so we also mark it as 0. In the third mutate we mark the type 15 NA lines as NA.
The next three mutate lines add the columns outlined in 2 and 3 in the question.
Finally, the columns are gathered back up to their previous format, redundant rows are filtered out, the dataframe is arranged by p and date, and the needed columns are selected.
I hope this is clear enough. It might be helpful to run the code line by line, stopping to view the transformed data frame after each line to see how the transformations act on the dataframe.
If you're willing to use some functions from the purrr package and to use some custom functions, here is another option...
Packages you'll need
library(dplyr)
library(purrr)
Set up data (as per question)
p <-c(1,1,1,2,2,2,2,3,3,3)
type <- c(15,20,20,15,20,15,20,20,15,15)
date <- as.Date.factor(c("2014-02-03","2014-02-04","2014-02-06","2014-01-28","2014-02-03","2014-03-03","2014-03-13","2014-04-03","2014-04-09","2014-12-03"))
d <- data.frame(cbind(p,type,date))
d$date = as.Date(date)
Create custom functions that will work with the purrr map_* functions to iterate through your data frame and create ind and first_date.
# Function to manage ind
ind_manager <- function(type, date, dates_20) {
if (type == 20)
return (NA_integer_)
checks <- map_lgl(dates_20, between, date, date + 7)
return (as.integer(any(checks)))
}
# Function to manage first_date
first_date_manager <- function(ind, date, dates_20) {
if (is.na(ind) || ind != 1)
return (NA_character_)
dates_20 <- dates_20[order(dates_20)]
as.character(dates_20[which.max(date < dates_20)])
}
Save a vector of dates where type == 20 to be used as comparisons
dates_20 <- d$date[d$type == 20]
The final mutate() call
# mutate() call to create variables
d %>%
mutate(
ind = map2_int(type, date, ind_manager, dates_20),
first_date = as.Date(map2_chr(ind, date, first_date_manager, dates_20)),
days = as.integer(first_date - date)
)
#> p type date ind first_date days
#> 1 1 15 2014-02-03 1 2014-02-04 1
#> 2 1 20 2014-02-04 NA <NA> NA
#> 3 1 20 2014-02-06 NA <NA> NA
#> 4 2 15 2014-01-28 1 2014-02-03 6
#> 5 2 20 2014-02-03 NA <NA> NA
#> 6 2 15 2014-03-03 0 <NA> NA
#> 7 2 20 2014-03-13 NA <NA> NA
#> 8 3 20 2014-04-03 NA <NA> NA
#> 9 3 15 2014-04-09 0 <NA> NA
#> 10 3 15 2014-12-03 0 <NA> NA
Here is a base R way. Generally, I prefer to create a function that does your task which can then be repeated on other pieces and debugged on test cases where it doesn't seem to work.
The first step is to define the pieces:
d <- structure(list(p = c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3),
type = c(15, 20, 20, 15, 20, 15, 20, 20, 15, 15),
date = structure(c(16104, 16105, 16107, 16098, 16104, 16132, 16142, 16163, 16169, 16407), class = "Date")),
.Names = c("p", "type", "date"),
row.names = c(NA, -10L), class = "data.frame")
id <- with(d, {
id <- ave(type, p, FUN = function(x) cumsum(x == 15))
factor(paste0(p, id), unique(paste0(p, id)))
})
sp <- split(d, id)
So, sp creates a list of data frames to which we will apply a function. Each piece is a single unique p with at most one type == 15 (plus however many type == 20s follow.
The first two pieces are
sp[1:2]
# $`11`
# p type date
# 1 1 15 2014-02-03
# 2 1 20 2014-02-04
# 3 1 20 2014-02-06
#
# $`21`
# p type date
# 4 2 15 2014-01-28
# 5 2 20 2014-02-03
And we can apply the function below on each one
first_date(sp[[1]])
# p type date ind first_date days
# 1 1 15 2014-02-03 1 2014-02-04 1
# 2 1 20 2014-02-04 NA <NA> NA
# 3 1 20 2014-02-06 NA <NA> NA
first_date(sp[[2]])
# p type date ind first_date days
# 4 2 15 2014-01-28 1 2014-02-03 6
# 5 2 20 2014-02-03 NA <NA> NA
Or all at once with a loop
(sp1 <- lapply(sp, first_date))
`rownames<-`(do.call('rbind', sp1), NULL)
# p type date ind first_date days
# 1 1 15 2014-02-03 1 2014-02-04 1
# 2 1 20 2014-02-04 NA <NA> NA
# 3 1 20 2014-02-06 NA <NA> NA
# 4 2 15 2014-01-28 1 2014-02-03 6
# 5 2 20 2014-02-03 NA <NA> NA
# 6 2 15 2014-03-03 0 <NA> NA
# 7 2 20 2014-03-13 NA <NA> NA
# 8 3 20 2014-04-03 NA <NA> NA
# 9 3 15 2014-04-09 0 <NA> NA
# 10 3 15 2014-12-03 0 <NA> NA
You can take advantage of the arguments, like window, or any others you add without changing much of the function, for example, to change the window
(sp2 <- lapply(sp1, first_date, window = 14))
`rownames<-`(do.call('rbind', sp2), NULL)
# p type date ind first_date days ind first_date days
# 1 1 15 2014-02-03 1 2014-02-04 1 1 2014-02-04 1
# 2 1 20 2014-02-04 NA <NA> NA NA <NA> NA
# 3 1 20 2014-02-06 NA <NA> NA NA <NA> NA
# 4 2 15 2014-01-28 1 2014-02-03 6 1 2014-02-03 6
# 5 2 20 2014-02-03 NA <NA> NA NA <NA> NA
# 6 2 15 2014-03-03 0 <NA> NA 1 2014-03-13 10
# 7 2 20 2014-03-13 NA <NA> NA NA <NA> NA
# 8 3 20 2014-04-03 NA <NA> NA NA <NA> NA
# 9 3 15 2014-04-09 0 <NA> NA 0 <NA> NA
# 10 3 15 2014-12-03 0 <NA> NA 0 <NA> NA
first_date <- function(data, window = 7) {
nr <- nrow(data)
## check at least one type 15 and > 1 row
ty15 <- data$type == 15
dt15 <- data$date[ty15]
if (!any(ty15) | nr == 1L)
return(cbind(data, ind = ifelse(any(ty15), 0, NA),
first_date = NA, days = NA))
## first date vector
dts <- rep(min(data$date[!ty15]), nr)
dts[!ty15] <- NA
## days from the type 15 date
days <- as.numeric(data$date[!ty15] - min(dt15))
days <- c(days, rep(NA, nr - length(days)))
## convert to NA if criteria not met
to_na <- days > window | is.na(dts)
days[to_na] <- dts[to_na] <- NA
## ind vector -- 1 or 0 if type 15, NA otherwise
ind <- rep(NA, nr)
ind[ty15] <- as.integer(!is.na(dts[ty15]))
## combine
cbind(data, ind = ind, first_date = dts, days = days)
}

Find row of the next instance of the value in R

I have two columns Time and Event. There are two events A and B. Once an event A takes place, I want to find when the next event B occurs. Column Time_EventB is the desired output.
This is the data frame:
df <- data.frame(Event = sample(c("A", "B", ""), 20, replace = TRUE), Time = paste("t", seq(1,20)))
What is the code in R for finding the next instance of a value (B in this case)?
What is the code for once the instance of B is found, return the value of the corresponding Time Column?
The code should be something like this:
data$Time_EventB <- ifelse(data$Event == "A", <Code for returning time of next instance of B>, "")
In Excel this can be done using VLOOKUP.
Here's a simple solution:
set.seed(1)
df <- data.frame(Event = sample(c("A", "B", ""),size=20, replace=T), time = 1:20)
as <- which(df$Event == "A")
bs <- which(df$Event == "B")
next_b <- sapply(as, function(a) {
diff <- bs-a
if(all(diff < 0)) return(NA)
bs[min(diff[diff > 0]) == diff]
})
df$next_b <- NA
df$next_b[as] <- df$time[next_b]
> df
Event time next_b
1 A 1 2
2 B 2 NA
3 B 3 NA
4 4 NA
5 A 5 8
6 6 NA
7 7 NA
8 B 8 NA
9 B 9 NA
10 A 10 14
11 A 11 14
12 A 12 14
13 13 NA
14 B 14 NA
15 15 NA
16 B 16 NA
17 17 NA
18 18 NA
19 B 19 NA
20 20 NA
Here's an attempt using a "rolling join" from the data.table package:
library(data.table)
setDT(df)
df[Event=="B", .(time, nextb=time)][df, on="time", roll=-Inf][Event != "A", nextb := NA][]
# time nextb Event
# 1: 1 2 A
# 2: 2 NA B
# 3: 3 NA B
# 4: 4 NA
# 5: 5 8 A
# 6: 6 NA
# 7: 7 NA
# 8: 8 NA B
# 9: 9 NA B
#10: 10 14 A
#11: 11 14 A
#12: 12 14 A
#13: 13 NA
#14: 14 NA B
#15: 15 NA
#16: 16 NA B
#17: 17 NA
#18: 18 NA
#19: 19 NA B
#20: 20 NA
Using data as borrowed from #thc

Split the dataset in R [duplicate]

This question already has answers here:
Transpose / reshape dataframe without "timevar" from long to wide format
(9 answers)
Closed 6 years ago.
I have a dataset which contains Billno and Product columns in the following format:
Billno Product
1 123
1 176
2 189
3 1
3 2
3 44
3 46
etc
The output should be a table of the form:
Billno Prod1 Prod2 Prod3 Prod4
1 123 176
2 189
3 1 2 44 46
Split function works but the dataset contains more than million records. Is there an efficient way of doing this?
with dplyr:
library(dplyr)
library(tidyr)
bill <- rep(c(1,1,2,3,3,3,3),5)
prod <- rep(c(123,176,189, 1,2,44,46),5)
df <- data.frame(bill=bill, prod=prod)
#determine max product count (number of columns in result)
prodmax <- df %>% group_by(bill) %>% summarise(n = n())
df %>% group_by(bill) %>%
mutate(prodn = paste0("prod",row_number())) %>%
spread(prodn, prod) %>%
#select columns in correct order
select_(.dots = c('bill',paste0('prod',seq(1,max(prodmax$n)))))
results in:
bill prod1 prod2 prod3 prod4
(dbl) (dbl) (dbl) (dbl) (dbl)
1 1 123 176 NA NA
2 2 189 NA NA NA
3 3 1 2 44 46
You can do
df <- read.table(header=T, text="Billno Product
1 123
1 176
2 189
3 1
3 2
3 44
3 46")
lst <- split(df[,-1], df[,1])
lst <- lapply(lst, "length<-", max(lengths(lst)))
df <- as.data.frame(do.call(rbind, lst))
# V1 V2 V3 V4
# 1 123 176 NA NA
# 2 189 NA NA NA
# 3 1 2 44 46
and then
names(df) <- sub("V", "prod", names(df))
df$billno <- rownames(df)
This will also do:
l <- lapply(split(df, df$Billno), function(x) t(x)[2,])
df <- as.data.frame(do.call(rbind, lapply(lapply(l, unlist), "[",
1:(max(unlist(lapply(l, length)))))))
names(df) <- paste('Prod', 1:ncol(df), sep='')
df
Prod1 Prod2 Prod3 Prod4
1 123 176 NA NA
2 189 NA NA NA
3 1 2 44 46

R converting to long format, pattern

I would like to convert a data.table like this one from wide format to long.
set.seed(1)
DT <- data.table(
ID = c(1:5, NA),
Name = c("Bob","Ana","Smith","Sam","Big","Lulu"),
Kind_2001 = factor(sample(c(letters[1:3], NA), 6, TRUE)),
Kind_2002 = factor(sample(c(letters[1:3], NA), 6, TRUE)),
Kind_2003 = factor(sample(c(letters[1:3], NA), 6, TRUE)),
Conc_2001 = sample(99,6),
Conc_2002 = sample(79,6),
Conc_2003 = sample(49,6)
)
ID Name Kind_2001 Kind_2002 Kind_2003 Conc_2001 Conc_2002 Conc_2003
1 Bob b NA c 38 22 24
2 Ana b c b 77 31 29
3 Smith c c NA 91 2 49
4 Sam NA a b 21 30 9
5 Big a a c 62 66 38
NA Lulu NA a NA 12 26 30
And I would like to get something like this:
ID Name Year Kind Conc
1 Bob 2001 b 38
1 Bob 2002 NA 22
1 Bob 2003 c 24
2 Ana 2001 b 77
2 Ana 2002 c 31
2 Ana 2003 b 29
...
The real table has many more variables, I'm looking for a solution without explicitly saying every column name or number, detecting automatically the pattern.
I have two kind of columns, some ending with an underscore and a four digit year, such as _2001, and the other without that ending.
Some can have an underscore in the middle of the name (this will be kept untransformed).
I would like to transform the columns ending with a year to long format.
I've tried with
melt(DT, id=1:2, variable.name = "year")
or with
melt(DT, id=1:2, measure=patterns("_2[0-9][0-9][0-9]$"))
but I'm not getting what I want.
Maybe I first need to filter the names with gsub.
PD: I've found this solution.
posi <- grep("_[0-9][0-9][0-9][0-9]$",colnames(DT))
work <- unique(gsub("_[0-9][0-9][0-9][0-9]$","",colnames(DT)[posi]))
melt(DT, measure=patterns(paste0("^",work)), variable="year", value.name=work)
It almost works but the year column is not populated properly. I'm missing something or it's a bug.
And I'm sure it could be written simpler.
ID Name year Kind Conc
1 Bob 1 b 38
2 Ana 1 b 77
3 Smith 1 c 91
4 Sam 1 NA 21
5 Big 1 a 62
NA Lulu 1 NA 12
1 Bob 2 NA 22
2 Ana 2 c 31
3 Smith 2 c 2
4 Sam 2 a 30
5 Big 2 a 66
NA Lulu 2 a 26
1 Bob 3 c 24
2 Ana 3 b 29
3 Smith 3 NA 49
4 Sam 3 b 9
5 Big 3 c 38
NA Lulu 3 NA 30
Regards
I've tried eddi solution with my database and I get the error:
"Error: cannot allocate vector of size 756.5 Mb"
even though I have 16GB of memory.
We can solve this on scale using reshape() from base R, without having to explicitly name variables.
# First we get indices of colnames that have format "_1234" at the end
tomelt <- grep("_([0-9]{4})$",names(DT))
# Now we use these indices to reshape data
reshape(DT, varying = tomelt, sep = "_",
direction = 'long', idvar = "ID", timevar = "Year)
# ID Name Year Kind Conc
# 1: 1 Bob 2001 b 38
# 2: 2 Ana 2001 b 77
# 3: 3 Smith 2001 c 91
# 4: 4 Sam 2001 NA 21
# 5: 5 Big 2001 a 62
# 6: NA Lulu 2001 NA 12
...
If we are looking for data.table solution, extract the prefix part from the names of the "DT" and use the unique elements as patterns in the measure argument in melt. Similarly, the suffix from "Year" is extracted and replace the numeric index with that.
nm <- unique(sub("_\\d+", "", names(DT)[-(1:2)]))
yr <- unique(sub("\\D+_", "", names(DT)[-(1:2)]))
melt(DT, measure = patterns(paste0("^", nm)), value.name = nm,
variable.name = "Year")[, Year := yr[Year]][]
# ID Name Year Kind Conc
# 1: 1 Bob 2001 b 38
# 2: 2 Ana 2001 b 77
# 3: 3 Smith 2001 c 91
# 4: 4 Sam 2001 NA 21
# 5: 5 Big 2001 a 62
# 6: NA Lulu 2001 NA 12
# 7: 1 Bob 2002 NA 22
# 8: 2 Ana 2002 c 31
# 9: 3 Smith 2002 c 2
#10: 4 Sam 2002 a 30
#11: 5 Big 2002 a 66
#12: NA Lulu 2002 a 26
#13: 1 Bob 2003 c 24
#14: 2 Ana 2003 b 29
#15: 3 Smith 2003 NA 49
#16: 4 Sam 2003 b 9
#17: 5 Big 2003 c 38
#18: NA Lulu 2003 NA 30
Here's an option that's more robust with respect to the order of your columns, as well as missing/extra years:
dcast(melt(DT, id.vars = c("ID", "Name"))
[, .(ID, Name, sub('_.*', '', variable), sub('.*_', '', variable), value)],
ID + Name + V4 ~ V3)
# ID Name V4 Conc Kind
# 1: 1 Bob 2001 38 b
# 2: 1 Bob 2002 22 NA
# 3: 1 Bob 2003 24 c
# 4: 2 Ana 2001 77 b
# 5: 2 Ana 2002 31 c
# 6: 2 Ana 2003 29 b
# 7: 3 Smith 2001 91 c
# 8: 3 Smith 2002 2 c
# 9: 3 Smith 2003 49 NA
#10: 4 Sam 2001 21 NA
#11: 4 Sam 2002 30 a
#12: 4 Sam 2003 9 b
#13: 5 Big 2001 62 a
#14: 5 Big 2002 66 a
#15: 5 Big 2003 38 c
#16: NA Lulu 2001 12 NA
#17: NA Lulu 2002 26 a
#18: NA Lulu 2003 30 NA
Edit for many id columns:
idvars = grep("_", names(DT), invert = TRUE)
dcast(melt(DT, id.vars = idvars)
[, `:=`(var = sub('_.*', '', variable),
year = sub('.*_', '', variable),
variable = NULL)],
... ~ var, value.var='value')
In case anybody is interested I post here my full solution,
able to work with datasets bigger than memory. It uses some of your ideas and some mine.
My data is the file file.csv (or you can even do it with a compressed file using fread("unzip -c name.zip").
## Initialization
nline <- 1500000 # total number of lines or use wc -l to do it automatically.
chunk <- 5000 # change it according to your memory and number of columns.
times <- ceiling(nline/chunk)
name <- names(fread("file.csv", stringsAsFactors=F, integer64 = "character", nrows=0, na.strings=c("", "NA")) )
idvars = grep("_20[0-9][0-9]$",name , invert = TRUE)
# Now we loop every chunk
for(iter in 0:(times-1)) {
my <- fread("file.csv", stringsAsFactors=F, integer64 = "character", skip=1+(iter*chunk), nrows=chunk, na.strings=c("", "NA"))
colnames(my) <- name
temp <- melt(my, id.vars = idvars)
newfile <- dcast(
temp[, `:=`(var = sub('_20[0-9][0-9]$', '', variable), year = sub('.*_', '', variable), variable = NULL)],
... ~ var, value.var='value')
fwrite(newfile, "long.csv", quote=FALSE, sep=",", append=T)
rm(temp); rm(newfile); rm(my); gc()
}
#
As said before the problem with this method is that it converts all the value to character but if you save them to a file and read the file again (as here) you get the proper classes.
In case of very large files this method is very slow.
I encourage you to improve this solution or suggest any generic solution with tidyr, splitstackshape or other packages.
Or even better it would be great to do it with a database such as sqlite.
The solution should work on datasets with unordered columns or even with "_" in the middle of the name, such as:
set.seed(1)
DT <- data.table(
ID = c(1:15),
Name = c("Bob","Ana","Smith","Sam","Big","Lulu", "Loli", "Chochi", "Tom", "Dick", "Pet", "Shin", "Rock", "Pep", "XXX"),
Kind_2001 = factor(sample(c(letters[1:3], NA), 15, TRUE)),
Kind_2002 = factor(sample(c(letters[1:3], NA), 15, TRUE)),
Kind_2003 = factor(sample(c(letters[1:3], NA), 15, TRUE)),
Conc_2004 = sample(49,15),
aa_Conc_2001 = c(sample(99,14), NA),
Conc_2002 = sample(79,15)
)

Resources