Convert AsIs to numeric separated by coma in data frame - r

I have such data frame:
structure(list(P1 = c("Mark", "Katrin", "Kate", "Hank", "Tom",
"Marcus"), P2 = c("Tim", "Greg", "Seba", "Teqa", "Justine", "Monica"
), clique = structure(list(`930` = integer(0), `2090` = integer(0),
`3120` = c(2L, 3L, 231L), `3663` = integer(0), `3704` = integer(0),
`4156` = c(19L, 27L)), .Names = c("930", "2090", "3120",
"3663", "3704", "4156"), class = "AsIs")), .Names = c("P1", "P2",
"clique"), row.names = c(930L, 2090L, 3120L, 3663L, 3704L, 4156L
), class = "data.frame")
And I have a problem with the last column called clique. I would like to convert this column to numeric values separated by come in one column or the best option would be to transform integer(0) to NAs and put the numbers in separate columns. Just keep one number in each column.
I will accept both solutions.
example data:
P1 P2 clique
Mark Tim integer(0)
Katrin Greg integer(0)
Kate Seba c(2, 3, 231)
Hank Teqa integer(0)
Tom Justine integer(0)
Marcus Monica c(19, 27)
> class(data$clique)
[1] "AsIs"
Desired output:
P1 P2 clique
Mark Tim NA
Katrin Greg NA
Kate Seba 2,3,231
Hank Teqa NA
Tom Justine NA
Marcus Monica 19,27
or
P1 P2 clique New_column1 New_column2
Mark Tim
Katrin Greg
Kate Seba 2 3 231
Hank Teqa
Tom Justine
Marcus Monica 19 27

You can try listCol_w from my "splitstackshape" package:
library(splitstackshape)
listCol_w(mydf, "clique")[, lapply(.SD, as.numeric), by = .(P1, P2)]
## P1 P2 clique_fl_1 clique_fl_2 clique_fl_3
## 1: Mark Tim NA NA NA
## 2: Katrin Greg NA NA NA
## 3: Kate Seba 2 3 231
## 4: Hank Teqa NA NA NA
## 5: Tom Justine NA NA NA
## 6: Marcus Monica 19 27 NA
I recommend this because you mentioned you wanted the numeric values. You won't be able to store a value like "2,3,231" as a numeric value.
If you still want to try the approach of collapsing the values and then splitting them, you can try:
mydf$clique <- vapply(mydf$clique, function(x) paste(x, collapse = ","), character(1L))
The str would show that you now have a single character string instead of a list of character vectors. You can then use cSplit on that to get the wide form.
> str(mydf)
'data.frame': 6 obs. of 3 variables:
$ P1 : chr "Mark" "Katrin" "Kate" "Hank" ...
$ P2 : chr "Tim" "Greg" "Seba" "Teqa" ...
$ clique: chr "" "" "2,3,231" "" ...

Related

Filter data.frame by date range in R

I have a DF like this:
Date <- c("10/17/17","11/11/17","11/23/17","11/25/17","12/3/17","12/10/17","12/16/17")
Ben <- c("1294",NA,"8959","2345",NA,"0303",NA)
James <- c(NA,"4523","3246",NA,"2394","8877","1427")
Alex <- c("3754","1122","5582",NA,"0094",NA,NA)
df1 <- data.frame(Date,Ben,James,Alex)
#df1
Date Ben James Alex
10/17/17 1294 NA 3754
11/11/17 NA 4523 1122
11/23/17 8959 3246 5582
11/25/17 2345 NA NA
12/3/17 NA 2394 0094
12/10/17 0303 8877 NA
12/16/17 NA 1427 NA
As you can see, the DF is sorted by date. I'm trying to put values that are within 2 weeks of the latest date for each column into a new DF, like this:
#df2
Ben James Alex
0303 1427 0094
NA 8877 5582
NA 2394 NA
Ben only has one listed value because there's only one non NA value within 2 weeks of 12/10/17, the latest date that has a non NA value in Ben's column. James's latest non NA date is 12/16/17. He has three values that fall within two weeks of that date: 1427, 8877 and 2394. Alex's latest date is 12/3/17. He has two values within two weeks of his latest date: 0094 and 5582. The number of rows that the new data.frame has should be equal to the column that is longest. Columns with fewer entries within their respective two week ranges should use NA to fill in data, like Ben's column.
I'm currently using the following code, which simply filters the last 3 non NA in each column:
df2 <- lapply(df1[-1], function(x) tail(x[!is.na(x)], n = 3))
using base r to be subset:
lapply(df1[-1],function(x)x[which((m<-tail(df1$Date[!is.na(x)],1)-df1$Date)>=0&m<=14)])->result
max(lengths(result))->len
do.call(cbind.data.frame,lapply(result,`length<-`,len))
Ben James Alex
1 <NA> 2394 5582
2 0303 8877 <NA>
3 <NA> 1427 0094
I just realized those are coded as characters according to the data you gave
To have it exactly as given in the expected results, we would have:
do.call(cbind.data.frame,lapply(result,function(x) `length<-`(rev(x),len)))
Ben James Alex
1 0303 1427 0094
2 <NA> 8877 <NA>
3 <NA> 2394 5582
Whether I well understood what you are looking for, the following code will help you:
I have loaded your dataset (with dput function)
dataset <- structure(list(Date = structure(c(17456, 17481, 17493, 17495,
17499, 17510, 17516), class = "Date"), Ben = c(1294L, NA, 8959L,
2345L, NA, 303L, NA), James = c(NA, 4523L, 3246L, NA, NA, 8877L,
1427L), Alex = c(3754L, 1122L, 5582L, NA, 94L, NA, NA)), .Names = c("Date",
"Ben", "James", "Alex"), row.names = c(NA, -7L), class = "data.frame")
Then load the following packages:
library(lubridate)
library(tidyverse)
Fix last_date and change format to Date variable:
last_date <- mdy("12/16/17")
dataset$Date <- mdy(dataset$Date)
Now, let's select only rows you want:
dataset_filtered <- dataset %>%
filter(Date<=last_date & Date>=(last_date-days(14)))
You'll have:
Date Ben James Alex
1 2017-12-10 303 8877 NA
2 2017-12-16 NA 1427 NA
Please, next time use dput function, not always is Xmas ;-)

What is the "data table" way of doing this join/merge?

I have a "dictionary" table like this:
dict <- data.table(
Nickname = c("Abby", "Ben", "Chris", "Dan", "Ed"),
Name = c("Abigail", "Benjamin", "Christopher", "Daniel", "Edward")
)
dict
# Nickname Name
# 1: Abby Abigail
# 2: Ben Benjamin
# 3: Chris Christopher
# 4: Dan Daniel
# 5: Ed Edward
And a "data" table like this:
dat <- data.table(
Friend1 = c("Abby", "Ben", "Ben", "Chris"),
Friend2 = c("Ben", "Ed", NA, "Ed"),
Friend3 = c("Ed", NA, NA, "Dan"),
Friend4 = c("Dan", NA, NA, NA)
)
dat
# Friend1 Friend2 Friend3 Friend4
# 1: Abby Ben Ed Dan
# 2: Ben Ed NA NA
# 3: Ben NA NA NA
# 4: Chris Ed Dan NA
I would like to produce a data.table that looks like this
result <- data.table(
Friend1.Nickname = c("Abby", "Ben", "Ben", "Chris"),
Friend1.Name = c("Abigail", "Benjamin", "Benjamin", "Christopher"),
Friend2.Nickname = c("Ben", "Ed", NA, "Ed"),
Friend2.Name = c("Benjamin", "Edward", NA, "Edward"),
Friend3.Nickname = c("Ed", NA, NA, "Dan"),
Friend3.Name = c("Edward", NA, NA, "Daniel"),
Friend4.Nickname = c("Dan", NA, NA, NA),
Friend4.Name = c("Daniel", NA, NA, NA)
)
result
# sorry, word wrapping makes this too annoying to copy
And this is the solution I had in mind:
friend_vars <- paste0("Friend", 1:4)
friend_nicks <- paste0(friend_vars, ".Nickname")
friend_names <- paste0(friend_vars, ".Name")
setnames(dat, friend_vars, friend_nicks)
for (i in 1:4) {
dat[, friend_names[i] := dict$Name[match(dat[[friend_nicks[i]]], dict$Nickname)], with = FALSE]
}
Is there a more "data-table-esque" way to do this? I'm sure it's nice and efficient, but it's ugly to read, and part from data.table's in-place assignment I don't feel like I'm taking good advantage of what the package has to offer.
I'm also not a very strong SQL user, and I'm not too comfortable with join terminology. I have a feeling that Data.table - left outer join on multiple tables could be useful here but I'm not sure how to apply it to my situation.
Using data.table 1.9.5:
for (nm in names(dat)) {
on = setattr("Nickname", 'names', nm)
dat[dict, paste0(nm, ".Name") := i.Name, on=on]
}
We can join using on= instead of setting keys. Now you can use setcolorder() to reorder the names.
I avoid reshaping data unless absolutely necessary. This is where update while join comes in really handy. And now with the on= argument, I couldn't resist posting an answer :-).
I didn't come up w/ a solution that matches exactly your result, but you might be able to work w/ something like this:
dat[, id := .I]
dat.m <- melt(dat, id.vars='id', variable.name='Friend', value.name='Nickname')
setkey(dict, Nickname)
dat.m[, Name := dict[Nickname, Name]]
> dat.m
id Friend Nickname Name
1: 1 Friend1 Abby Abigail
2: 2 Friend1 Ben Benjamin
3: 3 Friend1 Ben Benjamin
4: 4 Friend1 Chris Christopher
5: 1 Friend2 Ben Benjamin
6: 2 Friend2 Ed Edward
7: 3 Friend2 NA NA
8: 4 Friend2 Ed Edward
9: 1 Friend3 Ed Edward
10: 2 Friend3 NA NA
11: 3 Friend3 NA NA
12: 4 Friend3 Dan Daniel
13: 1 Friend4 Dan Daniel
14: 2 Friend4 NA NA
15: 3 Friend4 NA NA
16: 4 Friend4 NA NA
The variable id was just a placeholder so I could melt the DT.
setkey(dict,Nickname)
dat[,paste(names(dat),"Name",sep="."):=lapply(.SD,function(x)dict[J(x)]$Name)]
setcolorder(dat,c(1,5,2,6,3,7,4,8))
dat
# Friend1 Friend1.Name Friend2 Friend2.Name Friend3 Friend3.Name Friend4 Friend4.Name
# 1: Abby Abigail Ben Benjamin Ed Edward Dan Daniel
# 2: Ben Benjamin Ed Edward NA NA NA NA
# 3: Ben Benjamin NA NA NA NA NA NA
# 4: Chris Christopher Ed Edward Dan Daniel NA NA
in base, super ugly:
cbind(dat, lapply(dat, function(x){dict$Name[match(x, dict$Nickname)]}))
Friend1 Friend2 Friend3 Friend4 V2 NA NA NA
1: Abby Ben Ed Dan Abigail Benjamin Edward Daniel
2: Ben Ed NA NA Benjamin Edward NA NA
3: Ben NA NA NA Benjamin NA NA NA
4: Chris Ed Dan NA Christopher Edward Daniel NA

R: Split Variable Column into multiple (unbalanced) columns by comma

I have a dataset of 25 variables and over 2 million observations. One of my variables is a combination of a few different "categories" that I want to split to where it shows 1 category per column (similar to what split would do in stata). For example:
# Name Age Number Events First
# Karen 24 8 Triathlon/IM,Marathon,10k,5k 0
# Kurt 39 2 Half-Marathon,10k 0
# Leah 18 0 1
And I want it to look like:
# Name Age Number Events_1 Event_2 Events_3 Events_4 First
# Karen 24 8 Triathlon/IM Marathon 10k 5k 0
# Kurt 39 2 Half-Marathon 10k NA NA 0
# Leah 18 0 NA NA NA NA 1
I have looked through stackoverflow but have not found anything that works (everything gives me an error of some sort). Any suggestions would be greatly appreciated.
Note: May not be important but the largest number of categories 1 person has is 19 therefore I would need to create Event_1:Event_19
Comment: Previous stack overflows have suggested the separate function, however this function does not seem to work with my dataset. When I input the function the program runs but when it is finished nothing is changed, there is no output, and no error code. When I tried to use other suggestions made in other threads I received error messages. However, I finally got it is work by using the cSplit function. Thank for the help!!!
From Ananda's splitstackshape package:
cSplit(df, "Events", sep=",")
# Name Age Number First Events_1 Events_2 Events_3 Events_4
#1: Karen 24 8 0 Triathlon/IM Marathon 10k 5k
#2: Kurt 39 2 0 Half-Marathon 10k NA NA
#3: Leah 18 0 1 NA NA NA NA
Or with tidyr:
separate(df, 'Events', paste("Events", 1:4, sep="_"), sep=",", extra="drop")
# Name Age Number Events_1 Events_2 Events_3 Events_4 First
#1 Karen 24 8 Triathlon/IM Marathon 10k 5k 0
#2 Kurt 39 2 Half-Marathon 10k <NA> <NA> 0
#3 Leah 18 0 NA <NA> <NA> <NA> 1
With the data.table package:
setDT(df)[,paste0("Events_", 1:4) := tstrsplit(Events, ",")][,-"Events", with=F]
# Name Age Number First Events_1 Events_2 Events_3 Events_4
#1: Karen 24 8 0 Triathlon/IM Marathon 10k 5k
#2: Kurt 39 2 0 Half-Marathon 10k NA NA
#3: Leah 18 0 1 NA NA NA NA
Data
df <- structure(list(Name = structure(1:3, .Label = c("Karen", "Kurt",
"Leah "), class = "factor"), Age = c(24L, 39L, 18L), Number = c(8L,
2L, 0L), Events = structure(c(3L, 2L, 1L), .Label = c(" NA",
" Half-Marathon,10k", " Triathlon/IM,Marathon,10k,5k"
), class = "factor"), First = c(0L, 0L, 1L)), .Names = c("Name",
"Age", "Number", "Events", "First"), class = "data.frame", row.names = c(NA,
-3L))

Extracting event types from last 21 day window

My dataframe looks like this. The two rightmost columns are my desired columns.
**Name ActivityType ActivityDate Email(last 21 says) Webinar(last21)**
John Email 1/1/2014 NA NA
John Webinar 1/5/2014 NA NA
John Sale 1/20/2014 Yes Yes
John Webinar 3/25/2014 NA NA
John Sale 4/1/2014 No Yes
John Sale 7/1/2014 No No
Tom Email 1/1/2015 NA NA
Tom Webinar 1/5/2015 NA NA
Tom Sale 1/20/2015 Yes Yes
Tom Webinar 3/25/2015 NA NA
Tom Sale 4/1/2015 No Yes
Tom Sale 7/1/2015 No No
I am just trying to create a yes/no variable that denotes whether there was an email or a webinar in the last 21 days for each "Sale" transaction. I was thinking(mock code) along the lines of using dplyr this way:
custlife %>%
group_by(Name) %>%
mutate(Email(last21days)=lag(ifelse(ActivityType = "Email" & ActivityDate of email within (activity date of sale - 21),Yes,No)).
I am not sure of the way to implement this. Kindly help. Your help is sincerely appreciated!
Here's a possible data.table solution. Here I'm creating 2 temporary data sets- one for Sale and one for the rest of activity types and then joining between them by a rolling window of 21 while using by = .EACHI in order to check conditions in each join. Then, I'm joining the result to the original data set.
Convert the date column to Date class and key the data by Name and Date (for the final/rolling join)
library(data.table)
setkey(setDT(df)[, ActivityDate := as.IDate(ActivityDate, "%m/%d/%Y")], Name, ActivityDate)
Create 2 temporary data sets per each activity
Saletemp <- df[ActivityType == "Sale", .(Name, ActivityDate)]
Elsetemp <- df[ActivityType != "Sale", .(Name, ActivityDate, ActivityType)]
Join by a rolling window of 21 to the sales temporary data set while checking conditions
Saletemp[Elsetemp, `:=`(Email21 = as.logical(which(i.ActivityType == "Email")),
Webinar21 = as.logical(which(i.ActivityType == "Webinar"))),
roll = -21, by = .EACHI]
Join everything back
df[Saletemp, `:=`(Email21 = i.Email21, Webinar21 = i.Webinar21)]
df
# Name ActivityType ActivityDate Email21 Webinar21
# 1: John Email 2014-01-01 NA NA
# 2: John Webinar 2014-01-05 NA NA
# 3: John Sale 2014-01-20 TRUE TRUE
# 4: John Webinar 2014-03-25 NA NA
# 5: John Sale 2014-04-01 NA TRUE
# 6: John Sale 2014-07-01 NA NA
# 7: Tom Email 2015-01-01 NA NA
# 8: Tom Webinar 2015-01-05 NA NA
# 9: Tom Sale 2015-01-20 TRUE TRUE
# 10: Tom Webinar 2015-03-25 NA NA
# 11: Tom Sale 2015-04-01 NA TRUE
# 12: Tom Sale 2015-07-01 NA NA
Here is another option with base R:
df is first split according to Name and then, among each subset, for each Sale, it looks if there is an Email (Webinar) within 21 days from the Sale. Finally, the list is unsplit according to Name.
You just have to replace FALSE by no and TRUE by yes afterwards.
df_split <- split(df, df$Name)
df_split <- lapply(df_split, function(tab){
i_s <- which(tab[,2]=="Sale")
tab$Email21[i_s] <- sapply(tab[i_s, 3], function(d_s){any(tab[tab$ActivityType=="Email", 3] >= d_s-21)})
tab$Webinar21[i_s] <- sapply(tab[i_s, 3], function(d_s){any(tab[tab$ActivityType=="Webinar", 3] >= d_s-21)})
tab
})
df_res <- unsplit(df_split, df$Name)
df_res
# Name ActivityType ActivityDate Email21 Webinar21
#1 John Email 2014-01-01 NA NA
#2 John Webinar 2014-01-05 NA NA
#3 John Sale 2014-01-20 TRUE TRUE
#4 John Webinar 2014-03-25 NA NA
#5 John Sale 2014-04-01 FALSE TRUE
#6 John Sale 2014-07-01 FALSE FALSE
#7 Tom Email 2015-01-01 NA NA
#8 Tom Webinar 2015-01-05 NA NA
#9 Tom Sale 2015-01-20 TRUE TRUE
#10 Tom Webinar 2015-03-25 NA NA
#11 Tom Sale 2015-04-01 FALSE TRUE
#12 Tom Sale 2015-07-01 FALSE FALSE
data
df <- structure(list(Name = c("John", "John", "John", "John", "John",
"John", "Tom", "Tom", "Tom", "Tom", "Tom", "Tom"), ActivityType = c("Email",
"Webinar", "Sale", "Webinar", "Sale", "Sale", "Email", "Webinar",
"Sale", "Webinar", "Sale", "Sale"), ActivityDate = structure(c(16071,
16075, 16090, 16154, 16161, 16252, 16436, 16440, 16455, 16519,
16526, 16617), class = "Date")), .Names = c("Name", "ActivityType",
"ActivityDate"), row.names = c(NA, -12L), index = structure(integer(0), ActivityType = c(1L,
7L, 3L, 5L, 6L, 9L, 11L, 12L, 2L, 4L, 8L, 10L)), class = "data.frame")

Creating a long table from a wide table using merged.stack (or reshape)

I have a data frame that looks like this:
ID rd_test_2011 rd_score_2011 mt_test_2011 mt_score_2011 rd_test_2012 rd_score_2012 mt_test_2012 mt_score_2012
1 A 80 XX 100 NA NA BB 45
2 XX 90 NA NA AA 80 XX 80
I want to write a script that would, for IDs that don't have NA's in the yy_test_20xx columns, create a new data frame with the subject taken from the column title, the test name, the test score and year taken from the column title. So, in this example ID 1 would have three entries. Expected output would look like this:
ID Subject Test Score Year
1 rd A 80 2011
1 mt XX 100 2012
1 mt BB 45 2012
2 rd XX 90 2011
2 rd AA 80 2012
2 mt XX 80 2012
I've tried both reshape and various forms of merged.stack which works in the sense that I get an output that is on the road to being right but I can't understand the inputs well enough to get there all the way:
library(splitstackshape)
merged.stack(x, id.vars='id', var.stubs=c("rd_test","mt_test"), sep="_")
I've had more success (gotten closer) with reshape:
y<- reshape(x, idvar="id", ids=1:nrow(x), times=grep("test", names(x), value=TRUE),
timevar="year", varying=list(grep("test", names(x), value=TRUE), grep("score",
names(x), value=TRUE)), direction="long", v.names=c("test", "score"),
new.row.names=NULL)
This will get your data into the right format:
df.long = reshape(df, idvar="ID", ids=1:nrow(df), times=grep("Test", names(df), value=TRUE),
timevar="Year", varying=list(grep("Test", names(df), value=TRUE),
grep("Score", names(df), value=TRUE)), direction="long", v.names=c("Test", "Score"),
new.row.names=NULL)
Then omitting NA:
df.long = df.long[!is.na(df.long$Test),]
Then splitting Year to remove Test_:
df.long$Year = sapply(strsplit(df.long$Year, "_"), `[`, 2)
And ordering by ID:
df.long[order(df.long$ID),]
ID Year Test Score
1 1 2011 A 80
5 1 2012 XX 100
2 2 2011 XX 90
9 2 2013 AA 80
6 3 2012 A 10
3 4 2011 A 50
7 4 2012 XX 60
10 4 2013 AA 99
4 5 2011 C 50
8 5 2012 A 75
Using reshape:
dat.long <- reshape(dat, direction="long", varying=list(c(2, 4,6), c(3, 5,7)),
times=2011:2013,timevar='Year',
sep="_", v.names=c("Test", "Score"))
dat.long[complete.cases(dat.long),]
ID Year Test Score id
1.2011 1 2011 A 80 1
2.2011 2 2011 XX 90 2
4.2011 4 2011 A 50 4
5.2011 5 2011 C 50 5
1.2012 1 2012 XX 100 1
3.2012 3 2012 A 10 3
4.2012 4 2012 XX 60 4
5.2012 5 2012 A 75 5
2.2013 2 2013 AA 80 2
4.2013 4 2013 AA 99 4
Considering your update, I've entirely rewritten this answer. View the history if you want to see the old version.
The main problem is that your data is "double wide" in a ways. Thus, you can actually solve your problem by reshaping in the "long" direction twice. Alternatively, use melt and *cast to melt your data in a very long format and convert it to a semi-wide format.
However, I would still suggest "splitstackshape" (and not just because I wrote it). It can handle this problem fine, but it needs you to rearrange your names of your data. The part of the name that will result in the names of the new columns should come first. In your example, that means "test" and "score" should be the first part of the variable name.
For this, we can use some gsub to rearrange the existing names.
library(splitstackshape)
setnames(mydf, gsub("(rd|mt)_(score|test)_(.*)", "\\2_\\1_\\3", names(mydf)))
names(mydf)
# [1] "ID" "test_rd_2011" "score_rd_2011" "test_mt_2011"
# [5] "score_mt_2011" "test_rd_2012" "score_rd_2012" "test_mt_2012"
# [9] "score_mt_2012"
out <- merged.stack(mydf, "ID", var.stubs=c("test", "score"), sep="_")
setnames(out, c(".time_1", ".time_2"), c("Subject", "Year"))
out[complete.cases(out), ]
# ID Subject Year test score
# 1: 1 mt 2011 XX 100
# 2: 1 mt 2012 BB 45
# 3: 1 rd 2011 A 80
# 4: 2 mt 2012 XX 80
# 5: 2 rd 2011 XX 90
# 6: 2 rd 2012 AA 80
For the benefit of others, "mydf" in this answer is defined as:
mydf <- structure(list(ID = 1:2, rd_test_2011 = c("A", "XX"),
rd_score_2011 = c(80L, 90L), mt_test_2011 = c("XX", NA),
mt_score_2011 = c(100L, NA), rd_test_2012 = c(NA, "AA"),
rd_score_2012 = c(NA, 80L), mt_test_2012 = c("BB", "XX"),
mt_score_2012 = c(45L, 80L)),
.Names = c("ID", "rd_test_2011", "rd_score_2011", "mt_test_2011",
"mt_score_2011", "rd_test_2012", "rd_score_2012", "mt_test_2012",
"mt_score_2012"), class = "data.frame", row.names = c(NA, -2L))

Resources