R: Combining several character columns into one by replacing NA-rows - r

I have a data frame consisting of character variables which looks like this:
V1 V2 V3 V4 V5
1 ID Date pic1 pic2 pic3
2 1 15.06.16 11:50 abc <NA> def
3 1 16.06.16 11:19 <NA> hij <NA>
4 1 17.06.16 11:41 <NA> <NA> nop
5 2 28.05.16 11:40 tuv <NA> <NA>
6 2 29.05.16 11:39 <NA> zab <NA>
7 2 30.05.16 09:07 <NA> <NA> wxy
8 3 03.06.16 07:31 lmn <NA> <NA>
9 3 04.06.16 11:01 <NA> rst <NA>
10 3 05.06.16 13:57 <NA> <NA> opq
So on each day one of the pic-variables contains a value, the rest is NA.
Now I want to combine all pic-values into one variable by replacing the NA's. Sorry if this is a dublicate, I've already tried a lot of suggested solutions but nothing has worked so far.
Thanks!

We can try with data.table. Convert the 'data.frame' to 'data.table' (setDT(df1), grouped by 'ID', and 'Date', we unlist the Subset of Data.table (.SD) and omit the NA elements (na.omit)
library(data.table)
setDT(df1)[, .(pic = na.omit(unlist(.SD))), by = .(ID, Date)]
# ID Date pic
# 1: 1 15.06.16 11:50 abc
# 2: 1 15.06.16 11:50 def
# 3: 1 16.06.16 11:19 hij
# 4: 1 17.06.16 11:41 nop
# 5: 2 28.05.16 11:40 tuv
# 6: 2 29.05.16 11:39 zab
# 7: 2 30.05.16 09:07 wxy
# 8: 3 03.06.16 07:31 lmn
# 9: 3 04.06.16 11:01 rst
#10: 3 05.06.16 13:57 opq
Or another option is pmax if there is only a single non-NA per row
setDT(df1)[, pic := do.call(pmax, c(.SD, na.rm = TRUE)),
.SDcols = pic1:pic3][, paste0("pic", 1:3) := NULL][]
Or using dplyr
library(dplyr)
df1 %>%
mutate(pic = pmax(pic1, pic2, pic3, na.rm=TRUE))%>%
select(-(pic1:pic3))
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), Date = c("15.06.16 11:50",
"16.06.16 11:19", "17.06.16 11:41", "28.05.16 11:40", "29.05.16 11:39",
"30.05.16 09:07", "03.06.16 07:31", "04.06.16 11:01", "05.06.16 13:57"
), pic1 = c("abc", NA, NA, "tuv", NA, NA, "lmn", NA, NA), pic2 = c(NA,
"hij", NA, NA, "zab", NA, NA, "rst", NA), pic3 = c("def", NA,
"nop", NA, NA, "wxy", NA, NA, "opq")), .Names = c("ID", "Date",
"pic1", "pic2", "pic3"), row.names = c(NA, -9L), class = "data.frame")

Assuming
on each day one of the pic-variables contains a value, the rest is NA
You can use coalesce from dplyr to get what you want:
library(dplyr)
result <- df1 %>% mutate(pic = coalesce(pic1, pic2, pic3)) %>%
select(-(pic1:pic3))
With the data supplied by akrun:
print(result)
## ID Date pic
##1 1 15.06.16 11:50 abc
##2 1 16.06.16 11:19 hij
##3 1 17.06.16 11:41 nop
##4 2 28.05.16 11:40 tuv
##5 2 29.05.16 11:39 zab
##6 2 30.05.16 09:07 wxy
##7 3 03.06.16 07:31 lmn
##8 3 04.06.16 11:01 rst
##9 3 05.06.16 13:57 opq

Related

How to remove duplicates if specific column has value in r

I need to delete some rows in my dataset based on the given condition.
Kindly gothrough the sample data for reference.
ID Date Dur
123 01/05/2000 3
123 08/04/2002 6
564 04/04/2012 2
741 01/08/2011 5
789 02/03/2009 1
789 08/01/2010 NA
789 05/05/2011 NA
852 06/06/2015 3
852 03/02/2016 NA
155 03/02/2008 NA
155 01/01/2009 NA
159 07/07/2008 NA
My main concern is Dur column. I have to delete the rows which have Dur != NA for group ID's
i.e ID's(123,789,852) have more than one record/row with Dur value. so I need to remove the ID with Dur value, which means entire ID of 123 and first record of 789 and 852.
I don't want to delete any ID's(564,741,852) have Dur with single record or any other ID's with null in Dur.
Expected Output:
ID Date Dur
564 04/04/2012 2
741 01/08/2011 5
789 08/01/2010 NA
789 05/05/2011 NA
852 03/02/2016 NA
155 03/02/2008 NA
155 01/01/2009 NA
159 07/07/2008 NA
Kindly suggest a code to solve the issue.
Thanks in Advance!
One way would be to select rows where number of rows in the group is 1 or there are NA's rows in the data.
This can be written in dplyr as :
library(dplyr)
df %>% group_by(ID) %>% filter(n() == 1 | is.na(Dur))
# ID Date Dur
# <int> <chr> <int>
#1 564 04/04/2012 2
#2 741 01/08/2011 5
#3 789 08/01/2010 NA
#4 789 05/05/2011 NA
#5 852 03/02/2016 NA
#6 155 03/02/2008 NA
#7 155 01/01/2009 NA
#8 159 07/07/2008 NA
Using data.table :
library(data.table)
setDT(df)[, .SD[.N == 1 | is.na(Dur)], ID]
and base R :
subset(df, ave(is.na(Dur), ID, FUN = function(x) length(x) == 1 | x))
data
df <- structure(list(ID = c(123L, 123L, 564L, 741L, 789L, 789L, 789L,
852L, 852L, 155L, 155L, 159L), Date = c("01/05/2000", "08/04/2002",
"04/04/2012", "01/08/2011", "02/03/2009", "08/01/2010", "05/05/2011",
"06/06/2015", "03/02/2016", "03/02/2008", "01/01/2009", "07/07/2008"
), Dur = c(3L, 6L, 2L, 5L, 1L, NA, NA, 3L, NA, NA, NA, NA)),
class = "data.frame", row.names = c(NA, -12L))
We can use .I in data.table
library(data.table)
setDT(df1)[df1[, .I[.N == 1| is.na(Dur)], ID]$V1]

How do I create a third column based on Character Values of other columns, excluding NA and values?

How can I create a new column called 'title' based on values of other columns attributes?
I have shown the example below, where 'title' needs to be created based on the columns Post, Tel, Surname, and Emp. 'title' just indicates which values are not NA.
I have this
ID1 ID2 Post Tel Surname Emp
<chr> <chr> <chr> <chr> <chr> <chr>
1 S04 S03 NA 369 990247 NA NA
2 S14 S08 NA 069 990351 NA NA
3 S18 S03 N165HT NA Jones NA
4 S19 S13 NA 3069 90685 NA NA
5 S20 S16 NA 3069 90954 NA NA
6 S20 S17 CO19RF NA NA Ocean
And I want to create this:
ID1 ID2 Post Tel Surname Emp title
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 S04 S03 NA 369 990247 NA NA Tel
2 S14 S08 NA 069 990351 NA NA Tel
3 S18 S03 N165HT NA Jones NA Post,Surname
4 S19 S13 NA 3069 90685 NA NA Tel
5 S20 S16 NA 3069 90954 NA NA Tel
6 S20 S17 CO19RF NA NA Ocean Post,Emp
In Base R:
cols <- c("Post", "Tel", "Surname", "Emp")
d$title <- apply(d[, cols], 1, function(x){
paste(cols[which(!is.na(x))], collapse = ",")
})
An option here would be to gather into 'long' format (while removing the NA elements with na.rm = TRUE) after creating a unique row identiier ('rn'), grouped by 'rn', paste the 'key' elements in summarise and bind with the original dataset
library(tidyverse)
df1 %>%
rownames_to_column('rn') %>%
gather(key, val, Post:Emp, na.rm = TRUE) %>%
group_by(rn) %>%
summarise(title = toString(key)) %>%
ungroup %>%
select(-rn) %>%
bind_cols(df1, .)
# ID1 ID2 Post Tel Surname Emp title
#1 S04 S03 <NA> 369 990247 <NA> <NA> Tel
#2 S14 S08 <NA> 069 990351 <NA> <NA> Tel
#3 S18 S03 N165HT <NA> Jones <NA> Post, Surname
#4 S19 S13 <NA> 3069 90685 <NA> <NA> Tel
#5 S20 S16 <NA> 3069 90954 <NA> <NA> Tel
#6 S20 S17 CO19RF <NA> <NA> Ocean Post, Emp
data
df1 <- structure(list(ID1 = c("S04", "S14", "S18", "S19", "S20", "S20"
), ID2 = c("S03", "S08", "S03", "S13", "S16", "S17"), Post = c(NA,
NA, "N165HT", NA, NA, "CO19RF"), Tel = c("369 990247", "069 990351",
NA, "3069 90685", "3069 90954", NA), Surname = c(NA, NA, "Jones",
NA, NA, NA), Emp = c(NA, NA, NA, NA, NA, "Ocean")), row.names = c("1",
"2", "3", "4", "5", "6"), class = "data.frame")

Imputing dates to empty cells for large dataset

I have a dataset that looks like below:
PPID join_date week date visit
A 2017-10-01 1 NA 0
A 2017-10-01 2 2017-10-08 2
A 2017-10-01 3 2017-10-15 1
A 2017-10-01 4 NA 0
B 2017-05-23 1 2017-05-21 4
B 2017-05-23 2 2017-05-28 2
B 2017-05-23 3 NA 0
week indicates the difference between the Sunday of the week of join_date and date in weeks (e.g. for participant B, the Sunday of the week of 2017-05-23 is 2017-05-21; thus participant B's week1 starts on 2017-05-21, and week2 starts on 2017-05-28).
My goal is to fill in date where it is currently NA, such that the output looks like below:
PPID join_date week date visit
A 2017-10-01 1 2017-10-01 0
A 2017-10-01 2 2017-10-08 2
A 2017-10-01 3 2017-10-15 1
A 2017-10-01 4 2017-10-22 0
B 2017-05-23 1 2017-05-21 4
B 2017-05-23 2 2017-05-28 2
B 2017-05-23 3 2017-06-04 0
The code I currently have is:
library(dplyr)
library(lubridate)
df2 <- df %>%
group_by(PPID) %>%
mutate(date = seq(unique(floor_date(as.Date(join_date), "weeks")),
unique(floor_date(as.Date(join_date), "weeks") + 7*(max(week)-1)),
by="week"))
The problem with this approach is that I'm working with large dataset (~8 mil observation) and it takes forever to run! I read some posts that all those date conversion/calculation (e.g. floor_date or as.Date) is what takes so long, and was wondering if there's ways to make my code more efficient.
Thanks!
How about simply
df2$date = floor_date(df2$join_date, 'week') + 7*(df2$week-1)
# PPID join_date week date visit
# 1 A 2017-10-01 1 2017-10-01 0
# 2 A 2017-10-01 2 2017-10-08 2
# 3 A 2017-10-01 3 2017-10-15 1
# 4 A 2017-10-01 4 2017-10-22 0
# 5 B 2017-05-23 1 2017-05-21 4
# 6 B 2017-05-23 2 2017-05-28 2
# 7 B 2017-05-23 3 2017-06-04 0
Although this calculates floor_date for every row, it is vectorised rather looping (as you did implicitly using by), so should be fast enough for most purposes. If you need even more speed-up, you could subset on is.na(df2$data) to only calculate the rows you need to impute.
Data:
df2 = structure(list(PPID = c("A", "A", "A", "A", "B", "B", "B"), join_date = structure(c(17440,
17440, 17440, 17440, 17309, 17309, 17309), class = "Date"), week = c(1L,
2L, 3L, 4L, 1L, 2L, 3L), date = structure(c(NA, 17447, 17454,
NA, 17307, 17314, NA), class = "Date"), visit = c(0L, 2L, 1L,
0L, 4L, 2L, 0L)), row.names = c(NA, -7L), class = "data.frame")

Efficient solution to (recursively) replace NAs with the mean of lags, by group

I need to replace NAs with the mean of previous three values, by group.
Once an NA is replaced, it will serve as input for computing the mean corresponding to the next NA (if next NA is within the next three months).
Here it is an example:
id date value
1 2017-04-01 40
1 2017-05-01 40
1 2017-06-01 10
1 2017-07-01 NA
1 2017-08-01 NA
2 2014-01-01 27
2 2014-02-01 13
Data:
dt <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L), date = structure(c(17257, 17287, 17318, 17348, 17379, 16071, 16102), class = "Date"), value = c(40, 40, 10, NA, NA, 27, 13)), row.names = c(1L, 2L, 3L, 4L, 5L, 8L, 9L), class = "data.frame")
The output should look like:
id date value
1 2017-04-01 40.00
1 2017-05-01 40.00
1 2017-06-01 10.00
1 2017-07-01 30.00
1 2017-08-01 26.66
2 2014-01-01 27.00
2 2014-02-01 13.00
where 26.66 = (30 + 10 + 40)/3
What is an efficient way to do this (i.e. to avoid for loops)?
The following uses base R only and does what you need.
sp <- split(dt, dt$id)
sp <- lapply(sp, function(DF){
for(i in which(is.na(DF$value))){
tmp <- DF[seq_len(i - 1), ]
DF$value[i] <- mean(tail(tmp$value, 3))
}
DF
})
result <- do.call(rbind, sp)
row.names(result) <- NULL
result
# id date value
#1 1 2017-01-04 40.00000
#2 1 2017-01-05 40.00000
#3 1 2017-01-06 10.00000
#4 1 2017-01-07 30.00000
#5 1 2017-01-08 26.66667
#6 2 2014-01-01 27.00000
#7 2 2014-01-02 13.00000
Define a roll function which takes 3 or less previous values as a list and the current value and returns as a list the previous 2 values with the current value if the current value is not NA and the prevous 2 values with the mean if the current value is NA. Use that with Reduce and pick off the last value of each list in the result. Then apply all that to each group using ave.
roll <- function(prev, cur) {
prev <- unlist(prev)
list(tail(prev, 2), if (is.na(cur)) mean(prev) else cur)
}
reduce_roll <- function(x) {
sapply(Reduce(roll, init = x[1], x[-1], acc = TRUE), tail, 1)
}
transform(dt, value = ave(value, id, FUN = reduce_roll))
giving:
id date value
1 1 2017-04-01 40
2 1 2017-05-01 40
3 1 2017-06-01 10
4 1 2017-07-01 30
5 1 2017-08-01 26.66667
8 2 2014-01-01 27
9 2 2014-02-01 13

Fill Dates based on Consecutive occurrences

ID Date
1 1-1-2016
1 2-1-2016
1 3-1-2016
2 5-1-2016
3 6-1-2016
3 11-1-2016
3 12-1-2016
4 7-1-2016
5 9-1-2016
5 19-1-2016
5 20-1-2016
6 11-04-2016
6 12-04-2016
6 16-04-2016
6 04-08-2016
6 05-08-2016
6 06-08-2016
Expected Data Frame is based on consecutive dates pairwise
1st_Date is when he visited for first time
2nd_Date is the date after which he visited for 2 consecutive days
3rd_Date is the date after which he visited for 3 consecutive days
For e.g :
For ID = 1 , He visited first time on 1-1-2016 and his 2 consecutive visits also began on the 1-1-2016 as well as his 3rd one .
Similarly For ID = 2 , He only visited 1 time so rest will remain blank
For ID = 3 , he visited 1st Time on 6-1-2016 but visited for 2 consecutive days starting on 11-1-2016.
NOTE : This has to be done till earliest 3rd Date only
Expected Output
ID 1st_Date 2nd_Date 3rd_Date
1 1-1-2016 1-1-2016 1-1-2016
2 5-1-2016 NA NA
3 6-1-2016 11-1-2016 NA
4 7-1-2016 NA NA
5 9-1-2016 19-1-2016 NA
6 11-04-2016 11-04-2016 04-08-2016
Here is an attempt using dplyr and tidyr. The first thing to do is to convert your Date to as.Date and group_by the IDs. We next create a few new variables. The first one, new, checks to see which dates are consecutive. Date is then updated to give NA for those consecutive dates. However, If not all the dates are consecutive, then we filter out the ones that were converted to NA. We then fill (replace NA with latest non-na date for each ID), remove unwanted columns and spread.
library(dplyr)
library(tidyr)
df %>%
mutate(Date = as.Date(Date, format = '%d-%m-%Y')) %>%
group_by(ID) %>%
mutate(new = cumsum(c(1, diff.difftime(Date, units = 'days'))),
Date = replace(Date, c(0, diff(new)) == 1, NA),
new1 = sum(is.na(Date)),
new2 = seq(n())) %>%
filter(!is.na(Date)|new1 != 1) %>%
fill(Date) %>%
select(-c(new, new1)) %>%
spread(new2, Date) %>%
select(ID:`3`)
# ID `1` `2` `3`
#* <int> <date> <date> <date>
#1 1 2016-01-01 2016-01-01 2016-01-01
#2 2 2016-01-05 <NA> <NA>
#3 3 2016-01-06 2016-01-11 <NA>
#4 4 2016-01-07 <NA> <NA>
#5 5 2016-01-09 2016-01-09 2016-01-09
With your Updated Data set, It gives
# ID `1` `2` `3`
#* <int> <date> <date> <date>
#1 1 2016-01-01 2016-01-01 2016-01-01
#2 2 2016-01-05 <NA> <NA>
#3 3 2016-01-06 2016-01-11 <NA>
#4 4 2016-01-07 <NA> <NA>
#5 5 2016-01-09 2016-01-19 <NA>
DATA USED
dput(df)
structure(list(ID = c(1L, 1L, 1L, 2L, 3L, 3L, 3L, 4L, 5L, 5L,
5L), Date = structure(c(1L, 5L, 7L, 8L, 9L, 2L, 3L, 10L, 11L,
4L, 6L), .Label = c("1-1-2016", "11-1-2016", "12-1-2016", "19-1-2016",
"2-1-2016", "20-1-2016", "3-1-2016", "5-1-2016", "6-1-2016",
"7-1-2016", "9-1-2016"), class = "factor")), .Names = c("ID",
"Date"), class = "data.frame", row.names = c(NA, -11L))
Use reshape. Code below assumes z is your data frame where date is a numeric date/time variable, ordered increasingly.
# a "set" variable represents a set of consecutive dates
z$set <- unsplit(tapply(z$date, z$ID, function(x) cumsum(diff(c(x[1], x)) > 1)), z$ID)
# "first.date" represents the first date in the set (of consecutive dates)
z$first.date <- unsplit(lapply(split(z$date, z[, c("ID", "set")]), min), z[, c("ID", "set")])
# "occurence" is a consecutive occurence #
z$occurrence <- unsplit(lapply(split(seq(nrow(z)), z$ID), seq_along), z$ID)
reshape(z[, c("ID", "first.date", "occurrence")], direction = "wide",
idvar = "ID", v.names = "first.date", timevar = "occurrence")
The result:
ID first.date.1 first.date.2 first.date.3
1 1 2016-01-01 2016-01-01 2016-01-01
4 2 2016-01-05 <NA> <NA>
5 3 2016-01-06 2016-01-11 2016-01-11
8 4 2016-01-07 <NA> <NA>
9 5 2016-01-09 2016-01-09 2016-01-09

Resources