How do I change the column names using dcast? - r

I'm transforming my data from long to wide. Part of the data are dates.
My problem is that I would like to have other colnames.
It is formed like eg variable_1-1 and I want 1-1_variable.
df:
SN specimen_isolate_no isolaat materiaal_lokatie alarmniveau afnamedatum
1: 2 1-1 STAPEP Bloedkweek Bloed 0 2017-04-30
2: 3 1-1 KLEBOX Bloedkweek 0 2018-12-30
3: 3 2-1 KLEBOX Bloedkweek 0 2018-12-31
I tried dcast from data.table:
setDT(df)
df.wide <- dcast(df, SN ~ specimen_isolate_no, value.var = c("materiaal_lokatie","afnamedatum", "isolaat", "alarmniveau" ))
Which give me the following result:
colnames:
[1] "SN" "materiaal_lokatie_1-1" "materiaal_lokatie_2-1"
"afnamedatum_1-1" "afnamedatum_2-1" "isolaat_1-1"
"isolaat_2-1" "alarmniveau_1-1" "alarmniveau_2-1"
This result is ok, but I rather have the colnames formed like specimen_isolate_no_variable, eg 1-1_alarmniveau.
In order to achieve this, I tried
molten <- melt(df, id.vars = c("SN", "specimen_isolate_no"))
dfmolton <- dcast(molten, SN ~ specimen_isolate_no + variable)
#and
df %>%
gather(key, value, -SN, -specimen_isolate_no) %>%
unite(new.col, c(specimen_isolate_no,key )) %>%
spread(new.col, value)
But both options mess up my dates and I don't know how to fix that.
#colnames:
[1] "SN" "1-1_isolaat" "1-1_materiaal_lokatie" "1-1_alarmniveau" "1-1_afnamedatum" "2-1_isolaat" "2-1_materiaal_lokatie" "2-1_alarmniveau" "2-1_afnamedatum"
dfmolten$`1-1_afnamedatum`
[1] "17286" "17895"
So my question: does anyone how to change the forming of colnames using dcast?

As Frank mentioned, there's an outstanding feature request for this... side note: please add reactions to FRs you'd like, we use this to some extent to steer development time:
https://github.com/Rdatatable/data.table/issues/3189
In the meantime, you can just use setnames and some regexing to do this:
old = grep('SN', names(df.wide), value = TRUE, invert = TRUE, fixed = TRUE)
new = sapply(strsplit(old, '_', fixed = TRUE), function(x) paste(rev(x), collapse = '_'))
setnames(df.wide, old, new)

Related

Rearrange dataframe to fit longitudinal model in R

I have a dataframe where each entry relates to a job posting in the NHS specifying the week the job was posted, and what NHS Trust (and region) the job is in.
At the moment my dataframe looks something like this:
set.seed(1)
df1 <- data.frame(
NHS_Trust = sample(1:30,20,T),
Week = sample(1:10,20,T),
Region = sample(1:15,20,T))
And I would like to count the number of jobs for each week across each NHS Trust and assign that value to a new column 'jobs' so my dataframe looks like this:
set.seed(1)
df2 <- data.frame(
NHS_Trust = rep(1:30, each=10),
Week = rep(seq(1,10),30),
Region = rep(as.integer(runif(30,1,15)),1,each = 10),
Jobs = rpois(10*30, lambda = 2))
The dataframe may then be used to create a Poisson longitudinal multilevel model where I may model the number of jobs.
Using the data.table package you can group by, count and assign to a new column in a single expression. The syntax for data.tables is dt[i, j, by]. Here i is "with" - ie the subset of data specified by i or data in the order of i which is empty in this case so all data is used in its original order. The j tells what is to be done, here counting the the number of occurrences using .N, which is then assigned to the new variable count using the assign operator :=. The by takes a list of variables where the j operation is performed on each group.
library(data.table)
setDT(df1)
df1[, count := .N, by = .(NHS_Trust, Week, Region)]
A tidyverse approach would be
library(tidyverse)
df1 <- df1 %>%
group_by(NHS_Trust, Week, Region) %>%
count()
You can use count to count number of jobs across each Region, NHS_Trust and Week and use complete to fill in missing combinations.
library(dplyr)
df1 %>%
count(Region, NHS_Trust, Week, name = 'Jobs') %>%
tidyr::complete(Region, Week = 1:10, fill = list(Jobs = 0))
I guess I'm moving my comment to an answer:
df2 <- df1 %>% group_by(Region, NHS_Trust, Week) %>% count(); colnames(df2)[4] <- "Jobs"
df2$combo <- paste0(df2$Region, "_", df2$NHS_Trust, "_", df2$Week)
for (i in 1:length(unique(df2$Region))){
for (j in 1:length(unique(df2$NHS_Trust))){
for (k in 1:length(unique(df2$Week))){
curr_combo <- paste0(unique(df2$Region)[i], "_",
unique(df2$NHS_Trust)[j], "_",
unique(df2$Week)[k])
if(!curr_combo %in% df2$combo){
curdat <- data.frame(unique(df2$Region)[i],
unique(df2$NHS_Trust)[j],
unique(df2$Week)[k],
0,
curr_combo,
stringsAsFactors = FALSE)
#cat(curdat)
names(curdat) <- names(df2)
df2 <- rbind(as.data.frame(df2), curdat)
}
}
}
}
tail(df2)
# Region NHS_Trust Week Jobs combo
# 4495 15 1 4 0 15_1_4
# 4496 15 1 5 0 15_1_5
# 4497 15 1 8 0 15_1_8
# 4498 15 1 3 0 15_1_3
# 4499 15 1 6 0 15_1_6
# 4500 15 1 9 0 15_1_9
The for loop here check which Region-NHS_Trust-Week combinations are missing from df2 and appends those to df2 with a corresponding Jobs value of 0. The checking is done with the help of the new variable combo which is just a concatenation of the values in the fields mentioned earlier separated by underscores.
Edit: I am plenty sure the people here can come up with something more elegant than this.

Insert Column Name into its Value using R

I need to insert Column Name, Department, into its value. i have code like here:
Department <- c("Store1","Store2","Store3","Store4","Store5")
Department2 <- c("IT1","IT2","IT3","IT4","IT5")
x <- c(100,200,300,400,500)
Result <- data.frame(Department,Department2,x)
Result
The expected result is like:
Department <- c("Department_Store1","Departmentz_Store2","Department_Store3","Department_Store4","Department_Store5")
Department2 <- c("Department2_IT1","Department2_IT2","Department2_IT3","Department2_IT4","Department2_IT5")
x <- c(100,200,300,400,500)
Expected.Result <- data.frame(Department,Department2,x)
Expected.Result
Can somebody help? Thanks
Another way with dplyr and tidyr:
library(dplyr)
library(tidyr)
# Convert to character to avoid warning message, will convert all columns to character
Result[] <- lapply(Result, as.character)
Result %>%
mutate_if(is.factor, as.character) %>% # optional, only convert factor to character, retain all other types
gather(key, value, -x) %>%
mutate(var = paste(key, value, sep = "_")) %>%
select(-value) %>%
spread(key,var)
x Department Department2
1 100 Department_Store1 Department2_IT1
2 200 Department_Store2 Department2_IT2
3 300 Department_Store3 Department2_IT3
4 400 Department_Store4 Department2_IT4
5 500 Department_Store5 Department2_IT5
Data:
Result <- data.frame(
Department = c("Store1","Store2","Store3","Store4","Store5"),
Department2 = c("IT1","IT2","IT3","IT4","IT5"),
x = c(100,200,300,400,500)
)
If you gather the column names in question into a vector dep_col, this is a clean base R solution with a for loop:
df <- data.frame(x = 1:5,
Department = paste0("Store", 1:5),
Department2 = paste0("IT", 1:5))
dep_col <- names(df)[-1]
for (c in dep_col)
df[[c]] <- paste(c, df[[c]], sep = "_")
If I understand correctly, the OP wants to prepend the values in all columns starting with "Department" by the respective column name.
Edit By request of the OP, the code to select columns has been generalized to pick additional column names.
Here is a solution using data.table's fast set() function:
library(data.table)
setDT(Result)
cols <- stringr::str_subset(names(Result), "^(Department|Division|Team)")
for (j in cols) {
set(Result, NULL, j, paste(j, Result[[j]], sep = "_"))
}
Result
Department Department2 x
1: Department_Store1 Department2_IT1 100
2: Department_Store2 Department2_IT2 200
3: Department_Store3 Department2_IT3 300
4: Department_Store4 Department2_IT4 400
5: Department_Store5 Department2_IT5 500
Note that set() updates by reference, i.e., without copying the whole object.

Grouping with numeric variables

I hava a dataframe like this:
name, value
stockA,Google
stockA,Yahoo
stockB,NA
stockC,Google
I would like to convert the values of rows of the second column to columns and keep the first one and in other have a numeric value to 0 and 1 if not exist or exist the value. Here an example of the expected output:
name,Google,Yahoo
stockA,1,1
stockB,0,0
stockC,1,0
I tried this:
library(reshape2)
df2 <- dcast(melt(df, 1:2, na.rm = TRUE), df + name ~ value, length)
and the error it gives me is this:
Using value as value column: use value.var to override.
Error in `[.data.frame`(x, i) : undefined columns selected
Any idea for the error?
An example in which the previous code works.
Data (df):
name,nam2,value
stockA,sth1,Yahoo
stockA,sth2,NA
stockB,sth3,Google
and this works:
df2 <- dcast(melt(df, 1:2, na.rm = TRUE), name + nam2 ~ value, length)
The OP has asked to get an explanation for the error caused by
dcast(melt(df, 1:2, na.rm = TRUE), df + name ~ value, length)
(I'm quite astonished that no one so far has tried to improve the OP's reshape2 approach to return exactly the expected answer).
There are several issues with OP's code:
df appears in the dcast() formula.
The second parameter to melt() is 1:2 which means that all columns are used as id.vars. It should read 1.
But the most crucial point is that the data.frame df already is in long format and doesn't need to be reshaped.
So, df can be used directly in dcast():
library(reshape2)
dcast(df[!is.na(df$value), ], name ~ value, length, drop = FALSE)
# name Google Yahoo
#1 stockA 1 1
#2 stockB 0 0
#3 stockC 1 0
In order to avoid a third NA column appearing in the result, the NA rows have to be filtered out of df before reshaping. On the other hand, drop = FALSE is required to ensure stockB is included in the result.
Data
df <- data.frame(name = c("stockA", "stockA", "stockB", "stockC"),
value = c("Google", "Yahoo", NA, "Google"))
df
# name value
#1 stockA Google
#2 stockA Yahoo
#3 stockB <NA>
#4 stockC Google
You can do that with spread from the tidyr package.
df <- data.frame(name = c("stockA", "stockA", "stockB", "stockC"),
value = c("Google", "Yahoo", NA, "Google"))
df$row <- 1
df %>%
spread(value, row, fill = 0) %>%
select(-`<NA>`)
Try df2 <- dcast(melt(df, 1:2, na.rm = TRUE), name ~ value, length)
Just remove df + from the equation.
Though this will give you an extra column for NA values, which makes me think the na.rm argument isn't working properly in your formulation.
You can do it also with base R:
df <- read.table(header=TRUE, sep=',', text=
'name, value
stockA,Google
stockA,Yahoo
stockB,NA
stockC,Google')
xtabs(~., data=df)
# value
#name Google Yahoo
# stockA 1 1
# stockB 0 0
# stockC 1 0

Iteratively create columns based on grouped variables

I've got some data (below) where I want to iteratively add columns based on sums of current columns by some grouping variable, and I want to name the columns a pasted value of the current name + "_tot". I'm thinking a combination of dplyr and lapply is the way to go about it but I can't get the structure correct.
set.seed(1234)
data <- data.frame(
biz = sample(c("telco","shipping","tech"), 50, replace = TRUE),
region = sample(c("mideast","americas"), 50, replace = TRUE),
june = sample(1:50, 50, replace=TRUE),
july = sample(100:150, 50, replace=TRUE)
)
So, what I want to do is 1) group this data by "region", then add a new column for each of the following months that is the sum of that month's value (in the real dataframe, there are many periods that follow).
Basically, I want to apply this function
library(dplyr)
data %>% group_by(region) %>% mutate(june_tot = sum(june))
across every month, without having to specify "june" or "july". My initial take:
testfun <- function(df, col) {
name <- paste(col, "_tot", sep="")
data2 <- df %>% group_by(region) %>% summarise(name=sum(col))
return(data2)
}
but lapplying this doesn't work, because I have to specify the columns to call into the initial function. Just removing the "col" argument from the initial function doesn't work either, of course.
Any ideas how to lapply this sort of argument?
Here are possible solutions to your problems using dplyr (first, since that is what you tried), and followed by data.table as well as base R solutions:
dplyr:
cols <- lapply(names(data)[-(1:2)], as.name)
names(cols) <- paste0(names(data)[-(1:2)], "_tot")
data %>% group_by(region) %>% mutate_each_q(funs(sum), cols)
Assumes every column but the first two are monthly data. An explanation by line:
we use as.name and lapply to generate a list of the columns names we want to mutate as symbols
we give the new names we want (i.e. month_tot) to the list of symbols from 1.
we use the mutate_each_q (known as mutate_each_ in dplyr 0.3.0.2) to apply sum to the list of expressions we created in 1. and 2.
This is the (sample) result:
Source: local data frame [50 x 6]
Groups: region
biz region june july june_tot july_tot
1 shipping mideast 17 124 780 3339
2 telco americas 11 101 465 2901
3 telco mideast 27 131 780 3339
4 tech americas 24 135 465 2901
... rows omitted
data.table:
new.names <- paste0(tail(names(data), 2L), "_tot") # Make new names
data.table(data)[,
(new.names):=lapply(.SD, sum), # `lapply` `sum` to the selected columns (those in .SD), and assign to `new.names` columns
by=region, .SDcols=-1 # group by `region`, and exclude first column from `.SD` (note `region` is excluded as well by reason of being in `by`
][] # extra `[]` just to force printing
Here, similar logic, except we use the special .SD object that represents every column in the data.table that we are not grouping by.
base:
do.call(
cbind,
list(
data,
setNames(
lapply(data[-(1:2)], function(x) ave(x, data$region, FUN=sum)),
paste0(names(data[-(1:2)]), "_tot")
) ) )
Here we use ave to compute the per region sums, use lapply to apply ave to each column, and use do.call(cbind, ...) to reconstruct the final data frame.
Try:
> for(i in 3:4) print(tapply(data[[i]], data$region, sum))
americas mideast
563 768
americas mideast
2538 3802
You can get all outputs in a list if you want.
Restructuring the data works well for this.
require(tidyr)
# wide to long
d2 <- gather(data = data,key = month,value = monthval,-c(biz,region))
# get totals and rename month
month_tots <- aggregate(x = list(total = d2$monthval),by = list(region = d2$region,month = d2$month),sum)
month_tots$month <- paste0(month_tots$month,'_tot')
# long to wide
month_tots <- spread(data = month_tots,key = month,value = total)
# recombine
merge(data,month_tots,by = 'region',all.x = T)

Non-standard evaluation (NSE) in dplyr's filter_ & pulling data from MySQL

I'd like to pull some data from a sql server with a dynamic filter. I'm using the great R package dplyr in the following way:
#Create the filter
filter_criteria = ~ column1 %in% some_vector
#Connect to the database
connection <- src_mysql(dbname <- "mydbname",
user <- "myusername",
password <- "mypwd",
host <- "myhost")
#Get data
data <- connection %>%
tbl("mytable") %>% #Specify which table
filter_(.dots = filter_criteria) %>% #non standard evaluation filter
collect() #Pull data
This piece of code works fine but now I'd like to loop it somehow on all the columns of my table, thus I'd like to write the filter as:
#Dynamic filter
i <- 2 #With a loop on this i for instance
which_column <- paste0("column",i)
filter_criteria <- ~ which_column %in% some_vector
And then reapply the first code with the updated filter.
Unfortunately this approach doesn't give the expected results. In fact it does not give any error but doesn't even pull any result into R.
In particular, I looked a bit into the SQL query generated by the two pieces of code and there is one important difference.
While the first, working, code generates a query of the form:
SELECT ... FROM ... WHERE
`column1` IN ....
(` sign in the column name), the second one generates a query of the form:
SELECT ... FROM ... WHERE
'column1' IN ....
(' sign in the column name)
Does anyone have any suggestion on how to formulate the filtering condition to make it work?
It's not really related to SQL. This example in R does not work either:
df <- data.frame(
v1 = sample(5, 10, replace = TRUE),
v2 = sample(5,10, replace = TRUE)
)
df %>% filter_(~ "v1" == 1)
It does not work because you need to pass to filter_ the expression ~ v1 == 1 — not the expression ~ "v1" == 1.
To solve the problem, simply use the quoting operator quo and the dequoting operator !!
library(dplyr)
which_column = quot(v1)
df %>% filter(!!which_column == 1)
An alternative solution, with dplyr version 0.5.0 (probably implemented earlier than that), it is possible to pass a composed string as the .dots argument, which I find more readable than the lazyeval::interp solution:
df <- data.frame(
v1 = sample(5, 10, replace = TRUE),
v2 = sample(5,10, replace = TRUE)
)
which_col <- "v1"
which_val <- 1
df %>% filter_(.dots= paste0(which_col, "== ", which_val))
v1 v2
1 1 1
2 1 2
3 1 4
UPDATE for dplyr 0.6 and later:
packageVersion("dplyr")
# [1] ‘0.5.0.9004’
df %>% filter(UQ(rlang::sym(which_col))==which_val)
#OR
df %>% filter((!!rlang::sym(which_col))==which_val)
(Similar to #Matthew 's response for dplyr 0.6, but I assume that which_col is a string variable.)
2nd UPDATE: Edwin Thoen created a nice cheatsheet for tidy evaluation: https://edwinth.github.io/blog/dplyr-recipes/
Here's a slightly less verbose solution and one which uses the typical behavior of the extract function, '[' in selecting a column by character value rather than converting it to a language element:
df %>% filter(., '['(., which_column)==1 )
set.seed(123)
df <- data.frame(
v1 = sample(5, 10, replace = TRUE),
v2 = sample(5,10, replace = TRUE)
)
which_column <- "v1"
df %>% filter(., '['(., which_column)==1)
# v1 v2
#1 1 5

Resources