How to FILL DOWN (autofill) value , eg replace NA with first value in group, using data.table in R? - r

Very simple and common task:
I need to FILL DOWN in data.table (similar to autofill function in MS Excel) so that
library(data.table)
DT <- fread(
"Paul 32
NA 45
NA 56
John 1
NA 5
George 88
NA 112")
becomes
Paul 32
Paul 45
Paul 56
John 1
John 5
George 88
George 112
Thank you!

Yes the best way to do this is to use #Rui Barradas idea of the zoo package. You can simply do it in one line of code with the na.locf function.
library(zoo)
DT[, V1:=na.locf(V1)]
Replace the V1 with whatever you name your column after reading in the data with fread. Good luck!

For example 2, you can consider using stats::spline for extrapolation as follows:
DT2[is.na(V2), V2 :=
as.integer(DT2[, spline(.I[!is.na(V2)], V2[!is.na(V2)], xout=.I[is.na(V2)]), by=.(V1)]$y)]
output:
V1 V2
1: Paul 1
2: Paul 2
3: Paul 3
4: Paul 4
5: John 100
6: John 110
7: John 120
8: John 130
data:
DT2 <- fread(
"Paul, 1
Paul, 2
Paul, NA
Paul, NA
John, 100
John, 110
John, NA
John, NA")

Related

If conditions and copying values from different rows

I have the following data:
Data <- data.frame(Project=c(123,123,123,123,123,123,124,124,124,124,124,125,125,125),
Name=c("Harry","David","David","Harry","Peter","Peter","John","Alex","Alex","Mary","Mary","Dan","Joe","Joe"),
Value=c(1,4,7,3,8,9,8,3,2,5,6,2,2,1),
OldValue=c("","Open","In Progress","Complete","Open","In Progress","Complete","Open","In Progress","System Declined","In Progress","","Open","In Progress"),
NewValue=c("Open","In Progress","Complete","Open","In Progress","Complete","Open","In Progress","System Declined","In Progress","Complete","Open","In Progress","Complete"))
The data should look like this
I want to create another column called EditedBy that applies the following logic.
IF the project in row 1 equals the project in row 2 AND the New Value in row 1 equals "Open" THEN take the name from row 2. If either of the first two conditions are False, then stick with the name in the first row.
So the data should look like this
How can I do this?
We can do this with data.table
library(data.table)
setDT(Data)[, EditedBy := Name[2L] ,.(Project, grp=cumsum(NewValue == "Open"|
shift(NewValue == "System Declined", fill=TRUE)))]
Data
# Project Name Value OldValue NewValue EditedBy
# 1: 123 Harry 1 Open David
# 2: 123 David 4 Open In Progress David
# 3: 123 David 7 In Progress Complete David
# 4: 123 Harry 3 Complete Open Peter
# 5: 123 Peter 8 Open In Progress Peter
# 6: 123 Peter 9 In Progress Complete Peter
# 7: 124 John 8 Complete Open Alex
# 8: 124 Alex 3 Open In Progress Alex
# 9: 124 Alex 2 In Progress System Declined Alex
#10: 124 Mary 5 System Declined In Progress Mary
#11: 124 Mary 6 In Progress Complete Mary
#12: 125 Dan 2 Open Joe
#13: 125 Joe 2 Open In Progress Joe
#14: 125 Joe 1 In Progress Complete Joe

Erasing duplicates with NA values

I have a data frame like this:
names <- c('Mike','Mike','Mike','John','John','John','David','David','David','David')
dates <- c('04-26','04-26','04-27','04-28','04-27','04-26','04-01','04-02','04-02','04-03')
values <- c(NA,1,2,4,5,6,1,2,NA,NA)
test <- data.frame(names,dates,values)
Which is:
names dates values
1 Mike 04-26 NA
2 Mike 04-26 1
3 Mike 04-27 2
4 John 04-28 4
5 John 04-27 5
6 John 04-26 6
7 David 04-01 1
8 David 04-02 2
9 David 04-02 NA
10 David 04-03 NA
I'd like to get rid of duplicates with NA values. So, in this case, I have a valid observation from Mike on 04-26 and also have a valid observation from David on 04-02, so rows 1 and 9 should be erased and I will end up with:
names dates values
1 Mike 04-26 1
2 Mike 04-27 2
3 John 04-28 4
4 John 04-27 5
5 John 04-26 6
6 David 04-01 1
7 David 04-02 2
8 David 04-03 NA
I tried to use duplicated function, something like this:
test[!duplicated(test[,c('names','dates')]),]
But that does not work since some NA values come before the valid value. Do you have any suggestions without trying things like merge or making another data frame?
Update: I'd like to keep rows with NA that are not duplicates.
What about this way?
library(dplyr)
test %>% group_by(names, dates) %>% filter((n()>=2 & !is.na(values)) | n()==1)
Source: local data frame [8 x 3]
Groups: names, dates [8]
names dates values
(fctr) (fctr) (dbl)
1 Mike 04-26 1
2 Mike 04-27 2
3 John 04-28 4
4 John 04-27 5
5 John 04-26 6
6 David 04-01 1
7 David 04-02 2
8 David 04-03 NA
Here is an attempt in data.table:
# set up
libary(data.table)
setDT(test)
# construct condition
test[, dupes := max(duplicated(.SD)), .SDcols=c("names", "dates"), by=c("names", "dates")]
# print out result
test[dupes == 0 | !is.na(values),]
Here is a similar method using base R, except that the dupes variable is kept separately from the data.frame:
dupes <- duplicated(test[c("names", "dates")])
# this generates warnings, but works nonetheless
dupes <- ave(dupes, test$names, test$dates, FUN=max)
# print out result
test[dupes == 0 | !is.na(test$values),]
If there are duplicated rows where the values variable is NA, and these duplicates add nothing to the data, then you can drop them prior to running the code above:
testNoNADupes <- test[!(duplicated(test) & is.na(test$values)),]
This should work based on your sample.
test <- test[order(test$values),]
test <- test[!(duplicated(test$names) & duplicated(test$dates) & is.na(test$values)),]

R - Merging two data files based on partial matching of inconsistent full name formats

I'm looking for a way to merge two data files based on partial matching of participants' full names that are sometimes entered in different formats and sometimes misspelled. I know there are some different function options for partial matches (eg agrep and pmatch) and for merging data files but I need help with a) combining the two; b) doing partial matching that can ignore middle names; c) in the merged data file store both original name formats and d) retain unique values even if they don't have a match.
For example, I have the following two data files:
File name: Employee Data
Full Name Date Started Orders
ANGELA MUIR 6/15/14 25
EILEEN COWIE 6/15/14 44
LAURA CUMMING 10/6/14 43
ELENA POPA 1/21/15 37
KAREN MACEWAN 3/15/99 39
File name: Assessment data
Candidate Leading Factor SI-D SI-I
Angie muir I -3 12
Caroline Burn S -5 -3
Eileen Mary Cowie S -5 5
Elena Pope C -4 7
Henry LeFeuvre C -5 -1
Jennifer Ford S -3 -2
Karen McEwan I -4 10
Laura Cumming S 0 6
Mandip Johal C -2 2
Mubarak Hussain D 6 -1
I want to merge them based on names (Full Name in df1 and Candidate in df2) ignoring middle name (eg Eilen Cowie = Eileen Mary Cowie), extra spaces (Laura Cumming = Laura Cumming); misspells (e.g. Elena Popa = Elena Pope) etc.
The ideal output would look like this:
Name Full Name Candidate Date Started Orders Leading Factor SI-D SI-I
ANGELA MUIR ANGELA MUIR Angie muir 6/15/14 25 I -3 12
Caroline Burn N/A Caroline Burn N/A N/A S -5 -3
EILEEN COWIE EILEEN COWIE Eileen Mary Cowie 6/15/14 44 S -5 5
ELENA POPA ELENA POPA Elena Pope 1/21/15 37 C -4 7
Henry LeFeuvre N/A Henry LeFeuvre N/A N/A C -5 -1
Jennifer Ford N/A Jennifer Ford N/A N/A S -3 -2
KAREN MACEWAN KAREN MACEWAN Karen McEwan 3/15/99 39 I -4 10
LAURA CUMMING LAURA CUMMING Laura Cumming 10/6/14 43 S 0 6
Mandip Johal N/A Mandip Johal N/A N/A C -2 2
Mubarak Hussain N/A Mubarak Hussain N/A N/A D 6 -1
Any suggestions would be greatly appreciated!
Here's a process that may help. You will have to inspect the results and make adjustments as needed.
df1
# v1 v2
#1 ANGELA MUIR 6/15/14
#2 EILEEN COWIE 6/15/14
#3 AnGela Smith 5/3/14
df2
# u1 u2
#1 Eileen Mary Cowie I-3
#2 Angie muir -5 5
index <- sapply(df1$v1, function(x) {
agrep(x, df2$u1, ignore.case=TRUE, max.distance = .5)
}
)
index <- unlist(index)
df2$u1[index] <- names(index)
merge(df1, df2, by.x='v1', by.y='u1')
# v1 v2 u2
#1 ANGELA MUIR 6/15/14 -5 5
#2 EILEEN COWIE 6/15/14 I-3
I had to adjust the argument max.distance in the index function. It may not work for your data, but adjust and test if it works. If this doesn't help, there is a package called stringdist that may have a more robust matching function in amatch.
Data
v1 <- c('ANGELA MUIR', 'EILEEN COWIE', 'AnGela Smith')
v2 <- c('6/15/14', '6/15/14', '5/3/14')
u1 <- c('Eileen Mary Cowie', 'Angie muir')
u2 <- c('I-3', '-5 5')
df1 <- data.frame(v1, v2, stringsAsFactors=F)
df2 <- data.frame(u1, u2, stringsAsFactors = F)

How do I find last date in which a value increased in another column?

I have a data frame in R that looks something like this:
person date level
Alex 2007-06-01 3
Alex 2008-12-01 4
Alex 2009-12-01 3
Beth 2008-03-01 6
Beth 2010-10-01 6
Beth 2010-12-01 6
Mary 2009-11-04 9
Mary 2012-04-25 9
Mary 2013-09-10 10
I have sorted it first by "person" and second by "date".
I am trying to find out when the last increase in "level" occurred for each person. Ideally, the output would look something like:
person date
Alex 2008-12-01
Beth NA
Mary 2013-09-10
Using dplyr
library(dplyr)
dat %>% group_by(person) %>%
mutate(inc = c(F, diff(level) > 0)) %>%
summarize(date = last(date[inc], default = NA))
Yielding:
Source: local data frame [3 x 2]
person date
1 Alex 2008-12-01
2 Beth <NA>
3 Mary 2013-09-10
Try data.table version:
library(data.table)
setDT(dat)[order(person),diff:=c(NA,diff(level)),by=person][diff>0,tail(.SD,1),by=person][,-c(3,4),with=F]
person date
1: Alex 2008-12-01
2: Mary 2013-09-10
If na also needs to be included:
dd=setDT(dat)[order(person),diff:=c(NA,diff(level)),by=person][diff>0,tail(.SD,1),by=person][,-c(3,4),with=F]
dd2 =data.frame(unique(ddt[!(person %in% dd$person),,]$person),NA)
names(dd2) = c('person','date')
rbind(dd, dd2)
person date
1: Alex 2008-12-01
2: Mary 2013-09-10
3: Beth NA
A base-R version, using data frame df:
sapply(levels(df$Person), function(p) {
s <- df[df$Person==p,]
i <- 1+nrow(s)-match(TRUE,rev(diff(s$Level)>0))
ifelse(is.na(i), NA, as.character(s$Date[i]))
})
produces the named vector
Alex Beth Mary
"2008-12-01" NA "2013-09-10"
Easy to wrap this to produce any output format you need:
last.level.up <- function(df) {
data.frame(Date=sapply(levels(df$Person), function(p) {
s <- df[df$Person==p,]
i <- 1+nrow(s)-match(TRUE,rev(diff(s$Level)>0))
ifelse(is.na(i), NA, as.character(s$Date[i]))
}))
}
last.level.up(df)
Date
Alex 2008-12-01
Beth <NA>
Mary 2013-09-10

String tokenization inside R data frame [duplicate]

This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 6 years ago.
I have some data that looks a little bit like this:
test.frame <- read.table(text = "name amounts
JEAN 318.5,45
GREGORY 1518.5,67,8
WALTER 518.5
LARRY 518.5,55,1
HARRY 318.5,32
",header = TRUE,sep = "")
I'd like it to look more like this ...
name amount
JEAN 318.5
JEAN 45
GREGORY 1518.5
GREGORY 67
GREGORY 8
WALTER 518.5
LARRY 518.5
LARRY 55
LARRY 1
HARRY 318.5
HARRY 32
It seems like there should be a straightforward way to break out the "amounts" column, but I'm not coming up with it. Happy to take a "RTFM page for this particular command" answer. What's the command I'm looking for?
(test.frame <- read.table(text = "name amounts
JEAN 318.5,45
GREGORY 1518.5,67,8
WALTER 518.5
LARRY 518.5,55,1
HARRY 318.5,32
",header = TRUE,sep = ""))
# name amounts
# 1 JEAN 318.5,45
# 2 GREGORY 1518.5,67,8
# 3 WALTER 518.5
# 4 LARRY 518.5,55,1
# 5 HARRY 318.5,32
tmp <- setNames(strsplit(as.character(test.frame$amounts),
split = ','), test.frame$name)
data.frame(name = rep(names(tmp), sapply(tmp, length)),
amounts = unlist(tmp), row.names = NULL)
# name amounts
# 1 JEAN 318.5
# 2 JEAN 45
# 3 GREGORY 1518.5
# 4 GREGORY 67
# 5 GREGORY 8
# 6 WALTER 518.5
# 7 LARRY 518.5
# 8 LARRY 55
# 9 LARRY 1
# 10 HARRY 318.5
# 11 HARRY 32
The fastest way (probably) will be data.table
library(data.table)
setDT(test.frame)[, lapply(.SD, function(x) unlist(strsplit(as.character(x), ','))),
.SDcols = "amounts", by = name]
## name amounts
## 1: JEAN 318.5
## 2: JEAN 45
## 3: GREGORY 1518.5
## 4: GREGORY 67
## 5: GREGORY 8
## 6: WALTER 518.5
## 7: LARRY 518.5
## 8: LARRY 55
## 9: LARRY 1
## 10: HARRY 318.5
## 11: HARRY 32
A generalization of David Arenburg's solution would be to use my cSplit function. Get it from the Git Hub Gist (https://gist.github.com/mrdwab/11380733) or load it with "devtools":
# library(devtools)
# source_gist(11380733)
The "long" format would be what you are looking for...
cSplit(test.frame, "amounts", ",", "long")
# name amounts
# 1: JEAN 318.5
# 2: JEAN 45
# 3: GREGORY 1518.5
# 4: GREGORY 67
# 5: GREGORY 8
# 6: WALTER 518.5
# 7: LARRY 518.5
# 8: LARRY 55
# 9: LARRY 1
# 10: HARRY 318.5
# 11: HARRY 32
But the function can create wide output formats too:
cSplit(test.frame, "amounts", ",", "wide")
# name amounts_1 amounts_2 amounts_3
# 1: JEAN 318.5 45 NA
# 2: GREGORY 1518.5 67 8
# 3: WALTER 518.5 NA NA
# 4: LARRY 518.5 55 1
# 5: HARRY 318.5 32 NA
One advantage with this function is being able to split multiple columns at once.
This isn't a super standard format, but here is one way you can transform your data. First, I would use stringsAsFactors=F with your read.table to make sure everything is a character variable rather than a factor. Alternatively you can do as.character() on those columns.
First I split the values in the amounts using the comma then I combine values with the names column
md <- do.call(rbind, Map(cbind, test.frame$name,
strsplit(test.frame$amounts, ",")))
Then I paste everything back together and send it to read.table to do the variable conversion
read.table(text=apply(md,1,paste, collapse="\t"),
sep="\t", col.names=names(test.frame))
Alternatively you could just make a data.frame from the md matrix and do the class conversions yourself
data.frame(names=md[,1], amount=as.numeric(md[,2]))
Here is a plyr solution:
Split.Amounts <- function(x) {
amounts <- unlist(strsplit(as.character(x$amounts), ","))
return(data.frame(name = x$name, amounts = amounts, stringsAsFactors=FALSE))
}
library(plyr)
ddply(test.frame, .(name), Split.Amounts)
Using dplyr:
library(dplyr)
test.frame %>%
group_by(name) %>%
do(Split.Amounts(.))

Resources