If conditions and copying values from different rows - r

I have the following data:
Data <- data.frame(Project=c(123,123,123,123,123,123,124,124,124,124,124,125,125,125),
Name=c("Harry","David","David","Harry","Peter","Peter","John","Alex","Alex","Mary","Mary","Dan","Joe","Joe"),
Value=c(1,4,7,3,8,9,8,3,2,5,6,2,2,1),
OldValue=c("","Open","In Progress","Complete","Open","In Progress","Complete","Open","In Progress","System Declined","In Progress","","Open","In Progress"),
NewValue=c("Open","In Progress","Complete","Open","In Progress","Complete","Open","In Progress","System Declined","In Progress","Complete","Open","In Progress","Complete"))
The data should look like this
I want to create another column called EditedBy that applies the following logic.
IF the project in row 1 equals the project in row 2 AND the New Value in row 1 equals "Open" THEN take the name from row 2. If either of the first two conditions are False, then stick with the name in the first row.
So the data should look like this
How can I do this?

We can do this with data.table
library(data.table)
setDT(Data)[, EditedBy := Name[2L] ,.(Project, grp=cumsum(NewValue == "Open"|
shift(NewValue == "System Declined", fill=TRUE)))]
Data
# Project Name Value OldValue NewValue EditedBy
# 1: 123 Harry 1 Open David
# 2: 123 David 4 Open In Progress David
# 3: 123 David 7 In Progress Complete David
# 4: 123 Harry 3 Complete Open Peter
# 5: 123 Peter 8 Open In Progress Peter
# 6: 123 Peter 9 In Progress Complete Peter
# 7: 124 John 8 Complete Open Alex
# 8: 124 Alex 3 Open In Progress Alex
# 9: 124 Alex 2 In Progress System Declined Alex
#10: 124 Mary 5 System Declined In Progress Mary
#11: 124 Mary 6 In Progress Complete Mary
#12: 125 Dan 2 Open Joe
#13: 125 Joe 2 Open In Progress Joe
#14: 125 Joe 1 In Progress Complete Joe

Related

Getting Data in a single row into multiple rows

I have a code where I see which people work in certain groups. When I ask the leader of each group to present those who work for them, in a survey, I get a row of all of the team members. What I need is to clean the data into multiple rows with their group information.
I don't know where to start.
This is what my data frame looks like,
LeaderName <- c('John','Jane','Louis','Carl')
Group <- c('3','1','4','2')
Member1 <- c('Lucy','Stephanie','Chris','Leslie')
Member1ID <- c('1','2','3','4')
Member2 <- c('Earl','Carlos','Devon','Francis')
Member2ID <- c('5','6','7','8')
Member3 <- c('Luther','Peter','','Severus')
Member3ID <- c('9','10','','11')
GroupInfo <- data.frame(LeaderName, Group, Member1, Member1ID, Member2 ,Member2ID, Member3, Member3ID)
This is what I would like it to show with a certain code
LeaderName_ <- c('John','Jane','Louis','Carl','John','Jane','Louis','Carl','John','Jane','','Carl')
Group_ <- c('3','1','4','2','3','1','4','2','3','1','','2')
Member <- c('Lucy','Stephanie','Chris','Leslie','Earl','Carlos','Devon','Francis','Luther','Peter','','Severus')
MemberID <- c('1','2','3','4','5','6','7','8','9','10','','11')
ActualGroupInfor <- data.frame(LeaderName_,Group_,Member,MemberID)
An option would be melt from data.table and specify the column name patterns in the measure parameter
library(data.table)
melt(setDT(GroupInfo), measure = patterns("^Member\\d+$",
"^Member\\d+ID$"), value.name = c("Member", "MemberID"))[, variable := NULL][]
# LeaderName Group Member MemberID
# 1: John 3 Lucy 1
# 2: Jane 1 Stephanie 2
# 3: Louis 4 Chris 3
# 4: Carl 2 Leslie 4
# 5: John 3 Earl 5
# 6: Jane 1 Carlos 6
# 7: Louis 4 Devon 7
# 8: Carl 2 Francis 8
# 9: John 3 Luther 9
#10: Jane 1 Peter 10
#11: Louis 4
#12: Carl 2 Severus 11
Here is a solution in base r:
reshape(
data=GroupInfo,
idvar=c("LeaderName", "Group"),
varying=list(
Member=which(names(GroupInfo) %in% grep("^Member[0-9]$",names(GroupInfo),value=TRUE)),
MemberID=which(names(GroupInfo) %in% grep("^Member[0-9]ID",names(GroupInfo),value=TRUE))),
direction="long",
v.names = c("Member","MemberID"),
sep="_")[,-3]
#> LeaderName Group Member MemberID
#> John.3.1 John 3 Lucy 1
#> Jane.1.1 Jane 1 Stephanie 2
#> Louis.4.1 Louis 4 Chris 3
#> Carl.2.1 Carl 2 Leslie 4
#> John.3.2 John 3 Earl 5
#> Jane.1.2 Jane 1 Carlos 6
#> Louis.4.2 Louis 4 Devon 7
#> Carl.2.2 Carl 2 Francis 8
#> John.3.3 John 3 Luther 9
#> Jane.1.3 Jane 1 Peter 10
#> Louis.4.3 Louis 4
#> Carl.2.3 Carl 2 Severus 11
Created on 2019-05-23 by the reprex package (v0.2.1)

How to FILL DOWN (autofill) value , eg replace NA with first value in group, using data.table in R?

Very simple and common task:
I need to FILL DOWN in data.table (similar to autofill function in MS Excel) so that
library(data.table)
DT <- fread(
"Paul 32
NA 45
NA 56
John 1
NA 5
George 88
NA 112")
becomes
Paul 32
Paul 45
Paul 56
John 1
John 5
George 88
George 112
Thank you!
Yes the best way to do this is to use #Rui Barradas idea of the zoo package. You can simply do it in one line of code with the na.locf function.
library(zoo)
DT[, V1:=na.locf(V1)]
Replace the V1 with whatever you name your column after reading in the data with fread. Good luck!
For example 2, you can consider using stats::spline for extrapolation as follows:
DT2[is.na(V2), V2 :=
as.integer(DT2[, spline(.I[!is.na(V2)], V2[!is.na(V2)], xout=.I[is.na(V2)]), by=.(V1)]$y)]
output:
V1 V2
1: Paul 1
2: Paul 2
3: Paul 3
4: Paul 4
5: John 100
6: John 110
7: John 120
8: John 130
data:
DT2 <- fread(
"Paul, 1
Paul, 2
Paul, NA
Paul, NA
John, 100
John, 110
John, NA
John, NA")

data.table join + update with mult='first' gives unexpected result

In the below example, I have a table of users and a table of transactions where one user can have 0, 1, or more transactions. I execute a join+update with mult='first' on the users table to attempt to insert a column indicating the date of the first occurring transaction for each user.
library(data.table) # v1.10.4
# Download data
users <- fread("https://raw.githubusercontent.com/ben519/DataWrangling/master/Data/users.csv")
transactions <- transactions <- fread("https://raw.githubusercontent.com/ben519/DataWrangling/master/Data/transactions.csv")
# Convert date columns to Date type
users[, `:=`(Registered = as.Date(Registered), Cancelled = as.Date(Cancelled))]
transactions[, TransactionDate := as.Date(TransactionDate)]
users
UserID User Gender Registered Cancelled FirstTransactionDate
1: 1 Charles male 2012-12-21 <NA> 2012-08-26
2: 2 Pedro male 2010-08-01 2010-08-08 2013-12-23
3: 3 Caroline female 2012-10-23 2016-06-07 2016-05-08
4: 4 Brielle female 2013-07-17 <NA> <NA>
5: 5 Benjamin male 2010-11-25 <NA> <NA>
transactions
TransactionID TransactionDate UserID ProductID Quantity
1: 1 2010-08-21 7 2 1
2: 2 2011-05-26 3 4 1
3: 3 2011-06-16 3 3 1
4: 4 2012-08-26 1 2 3
5: 5 2013-06-06 2 4 1
6: 6 2013-12-23 2 5 6
7: 7 2013-12-30 3 4 1
8: 8 2014-04-24 NA 2 3
9: 9 2015-04-24 7 4 3
10: 10 2016-05-08 3 4 4
##### For each user, insert the TransactionDate of the first matching row
users[transactions, FirstTransactionDate := i.TransactionDate, on="UserID", mult="first"]
# Unexpected result
users[UserID == 2]
UserID User Gender Registered Cancelled FirstTransactionDate
1: 2 Pedro male 2010-08-01 2010-08-08 2013-12-23 # <- shouldn't this be 2013-06-06?
Why does FirstTransactionDate 2013-12-23 get set for user 2 when an earlier transaction in the transactions table is tied to that user? Is this a bug?
Reading the documentation for data.table's mult more closely, it says that:
When i is a list (or data.frame or data.table) and multiple rows in x
match to the row in i, mult controls which are returned: "all"
(default), "first" or "last".
So if there are multiple rows in x ("users") that match to i ("transactions"), then mult will return the first row in x. However, in your case, there aren't multiple rows in x that match to i, rather there are multiple rows in i that match to x.
As #Arun suggested, the best option would be change around your so that mult = "first" is relevant:
users[, FirstTransactionDate := transactions[users, TransactionDate, on="UserID", mult = "first"]]
users
# UserID User Gender Registered Cancelled FirstTransactionDate
#1: 1 Charles male 2012-12-21 <NA> 2012-08-26
#2: 2 Pedro male 2010-08-01 2010-08-08 2013-06-06
#3: 3 Caroline female 2012-10-23 2016-06-07 2011-05-26
#4: 4 Brielle female 2013-07-17 <NA> <NA>
#5: 5 Benjamin male 2010-11-25 <NA> <NA>
Another option would be to change up your merge slightly:
users[transactions[,FirstTransactionDate := min(TransactionDate), by = UserID],
FirstTransactionDate := FirstTransactionDate, on="UserID"]
I just create the first transaction date within the transactions dataset. This gets merged on multiple times, but it should be fine because it's always the same value for a UserID.

Erasing duplicates with NA values

I have a data frame like this:
names <- c('Mike','Mike','Mike','John','John','John','David','David','David','David')
dates <- c('04-26','04-26','04-27','04-28','04-27','04-26','04-01','04-02','04-02','04-03')
values <- c(NA,1,2,4,5,6,1,2,NA,NA)
test <- data.frame(names,dates,values)
Which is:
names dates values
1 Mike 04-26 NA
2 Mike 04-26 1
3 Mike 04-27 2
4 John 04-28 4
5 John 04-27 5
6 John 04-26 6
7 David 04-01 1
8 David 04-02 2
9 David 04-02 NA
10 David 04-03 NA
I'd like to get rid of duplicates with NA values. So, in this case, I have a valid observation from Mike on 04-26 and also have a valid observation from David on 04-02, so rows 1 and 9 should be erased and I will end up with:
names dates values
1 Mike 04-26 1
2 Mike 04-27 2
3 John 04-28 4
4 John 04-27 5
5 John 04-26 6
6 David 04-01 1
7 David 04-02 2
8 David 04-03 NA
I tried to use duplicated function, something like this:
test[!duplicated(test[,c('names','dates')]),]
But that does not work since some NA values come before the valid value. Do you have any suggestions without trying things like merge or making another data frame?
Update: I'd like to keep rows with NA that are not duplicates.
What about this way?
library(dplyr)
test %>% group_by(names, dates) %>% filter((n()>=2 & !is.na(values)) | n()==1)
Source: local data frame [8 x 3]
Groups: names, dates [8]
names dates values
(fctr) (fctr) (dbl)
1 Mike 04-26 1
2 Mike 04-27 2
3 John 04-28 4
4 John 04-27 5
5 John 04-26 6
6 David 04-01 1
7 David 04-02 2
8 David 04-03 NA
Here is an attempt in data.table:
# set up
libary(data.table)
setDT(test)
# construct condition
test[, dupes := max(duplicated(.SD)), .SDcols=c("names", "dates"), by=c("names", "dates")]
# print out result
test[dupes == 0 | !is.na(values),]
Here is a similar method using base R, except that the dupes variable is kept separately from the data.frame:
dupes <- duplicated(test[c("names", "dates")])
# this generates warnings, but works nonetheless
dupes <- ave(dupes, test$names, test$dates, FUN=max)
# print out result
test[dupes == 0 | !is.na(test$values),]
If there are duplicated rows where the values variable is NA, and these duplicates add nothing to the data, then you can drop them prior to running the code above:
testNoNADupes <- test[!(duplicated(test) & is.na(test$values)),]
This should work based on your sample.
test <- test[order(test$values),]
test <- test[!(duplicated(test$names) & duplicated(test$dates) & is.na(test$values)),]

In R: add rows based on a date and another condition

I have a data frame df:
df <- data.frame(names=c("john","mary","tom"),dates=c(as.Date("2010-06-01"),as.Date("2010-07-09"),as.Date("2010-06-01")),tours_missed=c(2,12,6))
names dates tours_missed
john 2010-06-01 2
mary 2010-07-09 12
tom 2010-06-01 6
I want to be able to add a row with the dates the person missed. There are 2 tours every day the person works. Each person works every 4 days.
The result should be (though the order doesn't matter):
names dates tours_missed
john 2010-06-01 2
mary 2010-07-09 12
mary 2010-07-13 12
mary 2010-07-17 12
mary 2010-07-21 12
mary 2010-07-25 12
mary 2010-07-29 12
tom 2010-06-01 6
tom 2010-06-05 6
tom 2010-06-09 6
I have already tried looking at these topics but was unable to produce the above result: Add rows to a data frame based on date in previous row, In R: Add rows with data of previous row to data frame, add new row to dataframe, enter link description here. Thanks for your help!
library(data.table)
dt = as.data.table(df) # or convert in-place using setDT
# all of the relevant dates
dates.all = dt[, seq(dates, length = tours_missed/2, by = "4 days"), by = names]
# set the key and merge filling in the blanks with previous observation
setkey(dt, names, dates)
dt[dates.all, roll = T]
# names dates tours_missed
# 1: john 2010-06-01 2
# 2: mary 2010-07-09 12
# 3: mary 2010-07-13 12
# 4: mary 2010-07-17 12
# 5: mary 2010-07-21 12
# 6: mary 2010-07-25 12
# 7: mary 2010-07-29 12
# 8: tom 2010-06-01 6
# 9: tom 2010-06-05 6
#10: tom 2010-06-09 6
Or if merging is unnecessary (not quite clear from OP), just construct the answer:
dt[, list(dates = seq(dates, length = tours_missed/2, by = "4 days"), tours_missed)
, by = names]

Resources