String tokenization inside R data frame [duplicate] - r

This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 6 years ago.
I have some data that looks a little bit like this:
test.frame <- read.table(text = "name amounts
JEAN 318.5,45
GREGORY 1518.5,67,8
WALTER 518.5
LARRY 518.5,55,1
HARRY 318.5,32
",header = TRUE,sep = "")
I'd like it to look more like this ...
name amount
JEAN 318.5
JEAN 45
GREGORY 1518.5
GREGORY 67
GREGORY 8
WALTER 518.5
LARRY 518.5
LARRY 55
LARRY 1
HARRY 318.5
HARRY 32
It seems like there should be a straightforward way to break out the "amounts" column, but I'm not coming up with it. Happy to take a "RTFM page for this particular command" answer. What's the command I'm looking for?

(test.frame <- read.table(text = "name amounts
JEAN 318.5,45
GREGORY 1518.5,67,8
WALTER 518.5
LARRY 518.5,55,1
HARRY 318.5,32
",header = TRUE,sep = ""))
# name amounts
# 1 JEAN 318.5,45
# 2 GREGORY 1518.5,67,8
# 3 WALTER 518.5
# 4 LARRY 518.5,55,1
# 5 HARRY 318.5,32
tmp <- setNames(strsplit(as.character(test.frame$amounts),
split = ','), test.frame$name)
data.frame(name = rep(names(tmp), sapply(tmp, length)),
amounts = unlist(tmp), row.names = NULL)
# name amounts
# 1 JEAN 318.5
# 2 JEAN 45
# 3 GREGORY 1518.5
# 4 GREGORY 67
# 5 GREGORY 8
# 6 WALTER 518.5
# 7 LARRY 518.5
# 8 LARRY 55
# 9 LARRY 1
# 10 HARRY 318.5
# 11 HARRY 32

The fastest way (probably) will be data.table
library(data.table)
setDT(test.frame)[, lapply(.SD, function(x) unlist(strsplit(as.character(x), ','))),
.SDcols = "amounts", by = name]
## name amounts
## 1: JEAN 318.5
## 2: JEAN 45
## 3: GREGORY 1518.5
## 4: GREGORY 67
## 5: GREGORY 8
## 6: WALTER 518.5
## 7: LARRY 518.5
## 8: LARRY 55
## 9: LARRY 1
## 10: HARRY 318.5
## 11: HARRY 32

A generalization of David Arenburg's solution would be to use my cSplit function. Get it from the Git Hub Gist (https://gist.github.com/mrdwab/11380733) or load it with "devtools":
# library(devtools)
# source_gist(11380733)
The "long" format would be what you are looking for...
cSplit(test.frame, "amounts", ",", "long")
# name amounts
# 1: JEAN 318.5
# 2: JEAN 45
# 3: GREGORY 1518.5
# 4: GREGORY 67
# 5: GREGORY 8
# 6: WALTER 518.5
# 7: LARRY 518.5
# 8: LARRY 55
# 9: LARRY 1
# 10: HARRY 318.5
# 11: HARRY 32
But the function can create wide output formats too:
cSplit(test.frame, "amounts", ",", "wide")
# name amounts_1 amounts_2 amounts_3
# 1: JEAN 318.5 45 NA
# 2: GREGORY 1518.5 67 8
# 3: WALTER 518.5 NA NA
# 4: LARRY 518.5 55 1
# 5: HARRY 318.5 32 NA
One advantage with this function is being able to split multiple columns at once.

This isn't a super standard format, but here is one way you can transform your data. First, I would use stringsAsFactors=F with your read.table to make sure everything is a character variable rather than a factor. Alternatively you can do as.character() on those columns.
First I split the values in the amounts using the comma then I combine values with the names column
md <- do.call(rbind, Map(cbind, test.frame$name,
strsplit(test.frame$amounts, ",")))
Then I paste everything back together and send it to read.table to do the variable conversion
read.table(text=apply(md,1,paste, collapse="\t"),
sep="\t", col.names=names(test.frame))
Alternatively you could just make a data.frame from the md matrix and do the class conversions yourself
data.frame(names=md[,1], amount=as.numeric(md[,2]))

Here is a plyr solution:
Split.Amounts <- function(x) {
amounts <- unlist(strsplit(as.character(x$amounts), ","))
return(data.frame(name = x$name, amounts = amounts, stringsAsFactors=FALSE))
}
library(plyr)
ddply(test.frame, .(name), Split.Amounts)
Using dplyr:
library(dplyr)
test.frame %>%
group_by(name) %>%
do(Split.Amounts(.))

Related

How to combine the use of %in% with OR operator?

I would like to look up and test whether values from one set ("set A") appear in either set B or set C. I was trying to use the %in% operator for this purpose, but couldn't figure out how to combine it with OR.
A reproducible example follows at the bottom, but just the gist of what I'm trying to get is something like:
set_a %in% (set_b | set_c)
where I want to know which values from set_a exist in either set_b or set_c, or in both.
Example
#Step 1 :: Creating the data
set_a <- unlist(strsplit("Eden Kendall Cali Ari Madden Leo Stacy Emmett Marco Bridger Alissa Elijah Bryant Pierre Sydney Luis", split=" "))
set_b <- as.data.table(unlist(strsplit("Kathy Ryan Brice Rowan Nina Abram Miles Kristina Gabriel Madden Jasper Emmett Marco Bridger Alissa Elijah Bryant Pierre Sydney Luis", split=" ")))
set_c <- as.data.table(unlist(strsplit("Leo Stacy Emmett Marco Moriah Nola Jorden Dalia Kenna Laney Dillon Trystan Elijah Bryant Pierr", split=" ")))
NamesList <- list(set_b, set_c) #set_b and set_c will now become neighboring data.table dataframes in one list.
> NamesList
[[1]]
V1
1: Kathy
2: Ryan
3: Brice
4: Rowan
5: Nina
6: Abram
7: Miles
8: Kristina
9: Gabriel
10: Madden
11: Jasper
12: Emmett
13: Marco
14: Bridger
15: Alissa
16: Elijah
17: Bryant
18: Pierre
19: Sydney
20: Luis
[[2]]
V1
1: Leo
2: Stacy
3: Emmett
4: Marco
5: Moriah
6: Nola
7: Jorden
8: Dalia
9: Kenna
10: Laney
11: Dillon
12: Trystan
13: Elijah
14: Bryant
15: Pierr
#Step 2 :: Checking which values from set_a appear in either set_b or set_c
matches <- set_a %in% (set_b | set_c)
#doesn't work!
Any ideas? By the way, it is important to me to use a data.table format.
You could try the conditions separately
set_a %in% set_b | set_a %in% set_c
Or use union or unique
set_a %in% union(set_b, set_c)
set_a %in% unique(c(set_b, set_c))
We can use
Reduce(`|`, lapply(list(set_b, set_c), `%in%`, set_a))

Getting Data in a single row into multiple rows

I have a code where I see which people work in certain groups. When I ask the leader of each group to present those who work for them, in a survey, I get a row of all of the team members. What I need is to clean the data into multiple rows with their group information.
I don't know where to start.
This is what my data frame looks like,
LeaderName <- c('John','Jane','Louis','Carl')
Group <- c('3','1','4','2')
Member1 <- c('Lucy','Stephanie','Chris','Leslie')
Member1ID <- c('1','2','3','4')
Member2 <- c('Earl','Carlos','Devon','Francis')
Member2ID <- c('5','6','7','8')
Member3 <- c('Luther','Peter','','Severus')
Member3ID <- c('9','10','','11')
GroupInfo <- data.frame(LeaderName, Group, Member1, Member1ID, Member2 ,Member2ID, Member3, Member3ID)
This is what I would like it to show with a certain code
LeaderName_ <- c('John','Jane','Louis','Carl','John','Jane','Louis','Carl','John','Jane','','Carl')
Group_ <- c('3','1','4','2','3','1','4','2','3','1','','2')
Member <- c('Lucy','Stephanie','Chris','Leslie','Earl','Carlos','Devon','Francis','Luther','Peter','','Severus')
MemberID <- c('1','2','3','4','5','6','7','8','9','10','','11')
ActualGroupInfor <- data.frame(LeaderName_,Group_,Member,MemberID)
An option would be melt from data.table and specify the column name patterns in the measure parameter
library(data.table)
melt(setDT(GroupInfo), measure = patterns("^Member\\d+$",
"^Member\\d+ID$"), value.name = c("Member", "MemberID"))[, variable := NULL][]
# LeaderName Group Member MemberID
# 1: John 3 Lucy 1
# 2: Jane 1 Stephanie 2
# 3: Louis 4 Chris 3
# 4: Carl 2 Leslie 4
# 5: John 3 Earl 5
# 6: Jane 1 Carlos 6
# 7: Louis 4 Devon 7
# 8: Carl 2 Francis 8
# 9: John 3 Luther 9
#10: Jane 1 Peter 10
#11: Louis 4
#12: Carl 2 Severus 11
Here is a solution in base r:
reshape(
data=GroupInfo,
idvar=c("LeaderName", "Group"),
varying=list(
Member=which(names(GroupInfo) %in% grep("^Member[0-9]$",names(GroupInfo),value=TRUE)),
MemberID=which(names(GroupInfo) %in% grep("^Member[0-9]ID",names(GroupInfo),value=TRUE))),
direction="long",
v.names = c("Member","MemberID"),
sep="_")[,-3]
#> LeaderName Group Member MemberID
#> John.3.1 John 3 Lucy 1
#> Jane.1.1 Jane 1 Stephanie 2
#> Louis.4.1 Louis 4 Chris 3
#> Carl.2.1 Carl 2 Leslie 4
#> John.3.2 John 3 Earl 5
#> Jane.1.2 Jane 1 Carlos 6
#> Louis.4.2 Louis 4 Devon 7
#> Carl.2.2 Carl 2 Francis 8
#> John.3.3 John 3 Luther 9
#> Jane.1.3 Jane 1 Peter 10
#> Louis.4.3 Louis 4
#> Carl.2.3 Carl 2 Severus 11
Created on 2019-05-23 by the reprex package (v0.2.1)

How to FILL DOWN (autofill) value , eg replace NA with first value in group, using data.table in R?

Very simple and common task:
I need to FILL DOWN in data.table (similar to autofill function in MS Excel) so that
library(data.table)
DT <- fread(
"Paul 32
NA 45
NA 56
John 1
NA 5
George 88
NA 112")
becomes
Paul 32
Paul 45
Paul 56
John 1
John 5
George 88
George 112
Thank you!
Yes the best way to do this is to use #Rui Barradas idea of the zoo package. You can simply do it in one line of code with the na.locf function.
library(zoo)
DT[, V1:=na.locf(V1)]
Replace the V1 with whatever you name your column after reading in the data with fread. Good luck!
For example 2, you can consider using stats::spline for extrapolation as follows:
DT2[is.na(V2), V2 :=
as.integer(DT2[, spline(.I[!is.na(V2)], V2[!is.na(V2)], xout=.I[is.na(V2)]), by=.(V1)]$y)]
output:
V1 V2
1: Paul 1
2: Paul 2
3: Paul 3
4: Paul 4
5: John 100
6: John 110
7: John 120
8: John 130
data:
DT2 <- fread(
"Paul, 1
Paul, 2
Paul, NA
Paul, NA
John, 100
John, 110
John, NA
John, NA")

In R: add rows based on a date and another condition

I have a data frame df:
df <- data.frame(names=c("john","mary","tom"),dates=c(as.Date("2010-06-01"),as.Date("2010-07-09"),as.Date("2010-06-01")),tours_missed=c(2,12,6))
names dates tours_missed
john 2010-06-01 2
mary 2010-07-09 12
tom 2010-06-01 6
I want to be able to add a row with the dates the person missed. There are 2 tours every day the person works. Each person works every 4 days.
The result should be (though the order doesn't matter):
names dates tours_missed
john 2010-06-01 2
mary 2010-07-09 12
mary 2010-07-13 12
mary 2010-07-17 12
mary 2010-07-21 12
mary 2010-07-25 12
mary 2010-07-29 12
tom 2010-06-01 6
tom 2010-06-05 6
tom 2010-06-09 6
I have already tried looking at these topics but was unable to produce the above result: Add rows to a data frame based on date in previous row, In R: Add rows with data of previous row to data frame, add new row to dataframe, enter link description here. Thanks for your help!
library(data.table)
dt = as.data.table(df) # or convert in-place using setDT
# all of the relevant dates
dates.all = dt[, seq(dates, length = tours_missed/2, by = "4 days"), by = names]
# set the key and merge filling in the blanks with previous observation
setkey(dt, names, dates)
dt[dates.all, roll = T]
# names dates tours_missed
# 1: john 2010-06-01 2
# 2: mary 2010-07-09 12
# 3: mary 2010-07-13 12
# 4: mary 2010-07-17 12
# 5: mary 2010-07-21 12
# 6: mary 2010-07-25 12
# 7: mary 2010-07-29 12
# 8: tom 2010-06-01 6
# 9: tom 2010-06-05 6
#10: tom 2010-06-09 6
Or if merging is unnecessary (not quite clear from OP), just construct the answer:
dt[, list(dates = seq(dates, length = tours_missed/2, by = "4 days"), tours_missed)
, by = names]

How do I find last date in which a value increased in another column?

I have a data frame in R that looks something like this:
person date level
Alex 2007-06-01 3
Alex 2008-12-01 4
Alex 2009-12-01 3
Beth 2008-03-01 6
Beth 2010-10-01 6
Beth 2010-12-01 6
Mary 2009-11-04 9
Mary 2012-04-25 9
Mary 2013-09-10 10
I have sorted it first by "person" and second by "date".
I am trying to find out when the last increase in "level" occurred for each person. Ideally, the output would look something like:
person date
Alex 2008-12-01
Beth NA
Mary 2013-09-10
Using dplyr
library(dplyr)
dat %>% group_by(person) %>%
mutate(inc = c(F, diff(level) > 0)) %>%
summarize(date = last(date[inc], default = NA))
Yielding:
Source: local data frame [3 x 2]
person date
1 Alex 2008-12-01
2 Beth <NA>
3 Mary 2013-09-10
Try data.table version:
library(data.table)
setDT(dat)[order(person),diff:=c(NA,diff(level)),by=person][diff>0,tail(.SD,1),by=person][,-c(3,4),with=F]
person date
1: Alex 2008-12-01
2: Mary 2013-09-10
If na also needs to be included:
dd=setDT(dat)[order(person),diff:=c(NA,diff(level)),by=person][diff>0,tail(.SD,1),by=person][,-c(3,4),with=F]
dd2 =data.frame(unique(ddt[!(person %in% dd$person),,]$person),NA)
names(dd2) = c('person','date')
rbind(dd, dd2)
person date
1: Alex 2008-12-01
2: Mary 2013-09-10
3: Beth NA
A base-R version, using data frame df:
sapply(levels(df$Person), function(p) {
s <- df[df$Person==p,]
i <- 1+nrow(s)-match(TRUE,rev(diff(s$Level)>0))
ifelse(is.na(i), NA, as.character(s$Date[i]))
})
produces the named vector
Alex Beth Mary
"2008-12-01" NA "2013-09-10"
Easy to wrap this to produce any output format you need:
last.level.up <- function(df) {
data.frame(Date=sapply(levels(df$Person), function(p) {
s <- df[df$Person==p,]
i <- 1+nrow(s)-match(TRUE,rev(diff(s$Level)>0))
ifelse(is.na(i), NA, as.character(s$Date[i]))
}))
}
last.level.up(df)
Date
Alex 2008-12-01
Beth <NA>
Mary 2013-09-10

Resources