in R, customize names of columns created by dcast.data.table - r

I am new to reshape2 and data.table and trying to learn the syntax.
I have a data.table that I want to cast from multiple rows per grouping variable(s) to one row per grouping variable(s). For simplicity, let's make it a table of customers, some of whom share addresses.
library(data.table)
# Input table:
cust <- data.table(name=c("Betty","Joe","Frank","Wendy","Sally"),
address=c(rep("123 Sunny Rd",2),
rep("456 Cloudy Ln",2),
"789 Windy Dr"))
I want the output to have the following format:
# Desired output looks like this:
(out <- data.table(address=c("123 Sunny Rd","456 Cloudy Ln","789 Windy Dr"),
cust_1=c("Betty","Frank","Sally"),
cust_2=c("Joe","Wendy",NA)) )
# address cust_1 cust_2
# 1: 123 Sunny Rd Betty Joe
# 2: 456 Cloudy Ln Frank Wendy
# 3: 789 Windy Dr Sally NA
I would like columns for cust_1...cust_n where n is the max customers per address. I don't really care about the order--whether Joe is cust_1 and Betty is cust_2 or vice versa.

Just pushed a commit to data.table v1.9.5. dcast now
allows casting on multiple value.var columns and multiple fun.aggregate functions
understands undefined variables/expressions in formula
With this, we can do:
dcast(cust, address ~ paste0("cust", cust[, seq_len(.N),
by=address]$V1), value.var="name")
# address cust1 cust2
# 1: 123 Sunny Rd Betty Joe
# 2: 456 Cloudy Ln Frank Wendy
# 3: 789 Windy Dr Sally NA

# My attempt:
setkey(cust,address)
x <- cust[,list(name, addr_cust_num=rank(name,ties.method="random")), by=address])
x[,addr_cust_num:=paste0("cust_",addr_cust_num)]
y <- dcast.data.table(x, address ~ addr_cust_num, value.var="name")
y
Note that I had to paste0 the "cust_" prefix. Before I added that step, I was using setnames(y, names(y), sub("(\\d+)","cust_\\1",names(y)) ) which seemed a clunkier (but probably faster) solution.
Wondering if there is a better way to do the prefixing.
Alternatively, you could just add the column directly to cust by reference:
# no need to set key
cust[, cust := paste("cust", seq_len(.N), sep="_"), by=address]
dcast.data.table(cust, address ~ cust, value.var="name")
# address cust_1 cust_2
# 1: 123 Sunny Rd Betty Joe
# 2: 456 Cloudy Ln Frank Wendy
# 3: 789 Windy Dr Sally NA

Related

Merge dataframe with a key value that is contained within a string in a separate dataframe

employee <- c('John','Peter', 'Gynn', 'Jolie', 'Hope', 'Sue', 'Jane', 'Sarah')
salary <- c('VT020', 'VT126', 'VT027', 'VT667', 'VC120', 'VT000', 'VA120', 'VA020')
emp <- data.frame(employee, salary)
benefit <- c('Health', 'Time', 'Bonus')
benefit_id <- c('VT020 VT126 VT667 VA020', 'VT667', 'VT126 VT667 VT000')
ben <- data.frame(benefit, benefit_id)
Above we have to dataframes, one contains names and a unique ID, the other contains a category and a list of unique IDs.
What is the most efficient way to merge the ben dataframe with the emp dataframe such that we get the appropriate benefit assigned to each employee?
tidyverse
library(dplyr)
library(tidyr) # tidyr
ben %>%
mutate(benefit_id = strsplit(benefit_id, "\\s+")) %>%
unnest(benefit_id) %>%
left_join(emp, ., by = c(salary = "benefit_id"))
# employee salary benefit
# 1 John VT020 Health
# 2 Peter VT126 Health
# 3 Peter VT126 Bonus
# 4 Gynn VT027 <NA>
# 5 Jolie VT667 Health
# 6 Jolie VT667 Time
# 7 Jolie VT667 Bonus
# 8 Hope VC120 <NA>
# 9 Sue VT000 Bonus
# 10 Jane VA120 <NA>
# 11 Sarah VA020 Health
Depending on your needs, you may also prefer a different join. For instance, use a full_join if you want all pairings, where NA in employee indicates a benefit sans employee.
FYI: if you are running R before 4.0, then you might have factors in your data. To fix that, just convert the factor columns with as.character first. (This can be determined with sapply(ben, inherits, "factor").)
data.table
library(data.table)
setDT(emp)
ben_long <- setDT(ben)[, list(benefit_id = unlist(strsplit(x = benefit_id, split = " "))), by = benefit]
merge(x = emp, y = ben_long, by.x = "salary", by.y = "benefit_id", all.x = TRUE)
salary employee benefit
1: VA020 Sarah Health
2: VA120 Jane <NA>
3: VC120 Hope <NA>
4: VT000 Sue Bonus
5: VT020 John Health
6: VT027 Gynn <NA>
7: VT126 Peter Health
8: VT126 Peter Bonus
9: VT667 Jolie Health
10: VT667 Jolie Time
11: VT667 Jolie Bonus

How to FILL DOWN (autofill) value , eg replace NA with first value in group, using data.table in R?

Very simple and common task:
I need to FILL DOWN in data.table (similar to autofill function in MS Excel) so that
library(data.table)
DT <- fread(
"Paul 32
NA 45
NA 56
John 1
NA 5
George 88
NA 112")
becomes
Paul 32
Paul 45
Paul 56
John 1
John 5
George 88
George 112
Thank you!
Yes the best way to do this is to use #Rui Barradas idea of the zoo package. You can simply do it in one line of code with the na.locf function.
library(zoo)
DT[, V1:=na.locf(V1)]
Replace the V1 with whatever you name your column after reading in the data with fread. Good luck!
For example 2, you can consider using stats::spline for extrapolation as follows:
DT2[is.na(V2), V2 :=
as.integer(DT2[, spline(.I[!is.na(V2)], V2[!is.na(V2)], xout=.I[is.na(V2)]), by=.(V1)]$y)]
output:
V1 V2
1: Paul 1
2: Paul 2
3: Paul 3
4: Paul 4
5: John 100
6: John 110
7: John 120
8: John 130
data:
DT2 <- fread(
"Paul, 1
Paul, 2
Paul, NA
Paul, NA
John, 100
John, 110
John, NA
John, NA")

generate list of dates based on one date in r

I am new to R and am finding it difficult to generate a series of rows where each generated row has a calculated date.
For example, going from a dataset like this:
Name date_birth
Greg 01/02/2015
Fred 02/02/2015
...to generate the following:
Name date_birth age date_atage<br/>
Greg 01/02/2015 0 01/02/2015
Greg 01/02/2015 1 02/02/2015
Greg 01/02/2015 2 03/02/2015
Fred 02/02/2015 0 02/02/2015
Fred 02/02/2015 1 03/02/2015
Fred 02/02/2015 2 04/02/2015
I have been studying sites like R-blogger, general instructional blogs and this site and I have been trying to figure out a loop statement involving the Seq statement, so that for each individual (e.g. Greg, Fred, etc) the process can be repeated where dates are calculated and placed in their own rows. Your first thought may be that this is simpler to do in Excel, but it isn't, as I need to repeat this for over 800 individuals (i.e. not just Greg and Fred), and for up to 300 days of age.
We can use data.table
library(data.table)
setDT(df1)[, .(date_birth, date_at_age = format(seq(as.Date(date_birth,
"%d/%m/%Y"), length.out=3, by = "1 day"), "%d/%m/%Y")) ,
by = Name][,age := seq_len(.N)-1 , by = Name][]
# Name date_birth date_at_age age
#1: Greg 01/02/2015 01/02/2015 0
#2: Greg 01/02/2015 02/02/2015 1
#3: Greg 01/02/2015 03/02/2015 2
#4: Fred 02/02/2015 02/02/2015 0
#5: Fred 02/02/2015 03/02/2015 1
#6: Fred 02/02/2015 04/02/2015 2
This is a long form way of getting the same place that data.table will take you.
Have a look at how you use dates in R. I've taken your original format and converted it to a date (code line 2). See http://strftime.org/ for more codes.
Set some dummy data:
df = data.frame(name=c("Gregg", "Joan"), DOB=c("01/02/2015", "02/02/2015"), stringsAsFactors=F)
Make date format:
df$DOB = as.Date(df$DOB, format="%d/%m/%Y")
Loop over each name, making 301 instances and adding day to DoB
df = lapply(1:nrow(df), function(i){
x = data.frame(name=rep(df[i, 1], times=301),
DoB=rep(df[i, 2], times=301),
age=0:300)
x$newDate = x$DoB + x$age
x
})
Convert list to a data frame:
df = do.call("rbind.data.frame", df)
Check output:
head(df)
Setup
df <- cbind(c("Greg","Fred"),c("01/02/2015","02/02/2015"))
max_age <- 2
start_at <- 0
Script
new_df <- data.frame(rep(NA,(max_age+1)*dim(df)[1]))
new_df[,1] <- rep(df[,1],each=max_age-start_at+1) #Names
new_df[,2] <- rep(df[,2],each=max_age-start_at+1) #Birth date
new_df[,3] <- rep(seq(from=start_at,to=max_age),dim(df)[1]) #Age
library(lubridate)
new_df[,4] <- dmy(new_df[,2]) + days(new_df[,3]) #Date at age
colnames(new_df) <- c("names","date_birth","age","date_at_age")

How to rbind when only some of the columns match

I have about 18 dataframes which are essentially frequency counts of the elements stored in the column Rptnames. They all have some different and some the same elements in the Rptnames columns so they look like this
dataframe called GroupedTableProportiondelAll
Rptname freq
bob 4324234
jane 433
ham 4324
tim 22
dataframe called GroupedTableProportiondelLUAD
Rptname freq
bob 987
jane 223
jonny 12
jim 98092
I am trying to set up a table so that the Rptname becomes the column and each row is the frequencies. This is so that I can combine all the dataframes.
I have tried the following
GroupedTableProportiondelAll_T <- as.data.frame(t(GroupedTableProportiondelAll))
GroupedTableProportiondelLUAD_T <- as.data.frame(t(GroupedTableProportiondelLUAD))
total <- rbind(GroupedTableProportiondelLUAD_T, GroupedTableProportiondelAll_T)
but I get the error
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
So the question is
a) how can I do rbind (cbind would also do without transposing I suppose) so that the bind can happen without needing to match.
b) would merge be better here
c) in either is there a way to enter zero for empty values
d) P'raps there's a better way to do this like matrices which Im not really familiar with? I know its 4 questions but the central question's the same- how to bind when not all the rows or columns are matching
An alternative to the rbind + dcast technique that would use the tidyverse.
Use pipes (%>%) to first use bind_rows() to bind all your dataframes together while simultaneously creating a dataframe id column (in this case I just called the variable "df"). Then use spread() to move unique "Rptname" values to become column names and spreading the values of "freq" across the new columns. "Rptname" is the key and "freq" is the value in this case.
It would look like this:
Input:
GTP_A
Rptname freq
1 bob 4324234
2 jane 433
3 ham 4324
4 tim 22
GTP_LUAD
Rptname freq
1 bob 987
2 jane 223
3 jonny 12
4 jim 98092
Code:
GroupTable <- bind_rows(GTP_A,GTP_LUAD, .id = "df") %>%
spread(Rptname, freq)
Output:
GroupTable
df bob ham jane jim jonny tim
1 1 4324234 4324 433 NA NA 22
2 2 987 NA 223 98092 12 NA
UPDATE:
As of the release of tidyr 1.0.0 on 2019/09/13 spread() and gather() have been retired and replaced by pivot_wider() and pivot_longer(), respectively. From the release notes Hadley Wickem states "spread() and gather() won’t go away, but they’ve been retired which means that they’re no longer under active development."
In order to get the same output as above, you will now need to first arrange() by Rptname then use pivot_wider(). If you do not arrange first you will get a similar output but the column order will not be the same as the output from spread().
GroupTable <- bind_rows(GTP_A, GTP_LUAD, .id = "df") %>%
arrange(Rptname) %>%
pivot_wider(names_from = Rptname, values_from = freq)
You could first rbind the dataframes after adding a column to identify the data.frame. Then use dcast function from reshape2 package.
rpt1
## Rptname freq df
## 1 bob 4324234 rpt1
## 2 jane 433 rpt1
## 3 ham 4324 rpt1
## 4 tim 22 rpt1
rpt2
## Rptname freq df
## 1 bob 987 rpt2
## 2 jane 223 rpt2
## 3 jonny 12 rpt2
## 4 jim 98092 rpt2
rpt1$df <- "rpt1"
rpt2$df <- "rpt2"
rpt <- rbind(rpt1, rpt2)
dcast(data = rpt, df ~ Rptname, value.var = "freq")
## df bob ham jane tim jim jonny
## 1 rpt1 4324234 4324 433 22 NA NA
## 2 rpt2 987 NA 223 NA 98092 12

Fill in Blank Fields With a Value From Same Key Index

I have a set of data (10 columns, 1000 rows) that is indexed by an ID number that one or more of these rows can share. To give a small example to illustrate my point, consider this table:
ID Name Location
5014 John
5014 Kate California
5014 Jim
5014 Ryan California
5018 Pete
5018 Pat Indiana
5019 Jeff Arizona
5020 Chris Kentucky
5020 Mike
5021 Will Indiana
I need for all entries to have something in the Location field and I'm having a hell of a time trying to do it.
Things to note:
Every unique ID number has at least one row with the location field populated.
If two rows have the same ID number, they have the same location.
Two different ID numbers can have the same location.
ID numbers are not necessarily consecutive, nor are they necessarily completely numeric. The arrangement of them isn't of importance to me, since any rows that are related share the same ID number.
Any ideas for a solution? I'm currently using R with the data.table package, but I'm relatively new to it.
We can convert the 'data.frame' to 'data.table' (setDT(df1)), Grouped by 'ID', get the elements of Location that are not '' (Location[Location!=''][1L]). Suppose, if there are more than one element per group that are not '', the [1L], selects the first non-blank element, and assign (:=) the output to Location
library(data.table)
setDT(df1)[, Location := Location[Location != ''][1L], by = ID][]
# ID Name Location
# 1: 5014 John California
# 2: 5014 Kate California
# 3: 5014 Jim California
# 4: 5014 Ryan California
# 5: 5018 Pete Indiana
# 6: 5018 Pat Indiana
# 7: 5019 Jeff Arizona
# 8: 5020 Chris Kentucky
# 9: 5020 Mike Kentucky
#10: 5021 Will Indiana
Or we can use setdiff as suggested by #Frank
setDT(df1)[, Location:= setdiff(Location,'')[1L], by = ID][]

Resources