R: Melt with data.table into two columns - r

I would like to convert a data.table from wide to long format.
A "normal" melt works fine, but I would like to melt my data into two different columns.
I found information about this here:
https://cran.r-project.org/web/packages/data.table/vignettes/datatable-reshape.html (Point 3)
I tried this with my code but somehow it is not working and until now I could not figure out the problem.
It would be great if you could explain me the mistake in the following example.
Thanks
#Create fake data
a = c("l1","l2","l2","l3")
b = c(10, 10, 20, 10)
c = c(30, 30, 30, 30)
d = c(40.2, 32.1, 24.1, 33.0)
e = c(1,2,3,4)
f = c(1.1, 1.2, 1.3, 1.5)
df <- data.frame(a,b,c,d,e,f)
colnames(df) <- c("fac_a", "fac_b", "fac_c", "m1", "m2.1", "m2.2")
#install.packages("data.table")
require(data.table)
TB <- setDT(df)
#Standard melt - works
TB.m1 = melt(TB, id.vars = c("fac_a", "fac_b", "fac_c"), measure.vars = c(4:ncol(TB)))
#Melt into two columns
colA = paste("m1", 4, sep = "")
colB = paste("m2", 5:ncol(TB), sep = "")
DT = melt(TB, measure = list(colA, colB), value.name = c("a", "b"))
#Not working, error: Error in melt.data.table(TB, measure = list(colA, colB), value.name = c("a", : One or more values in 'measure.vars' is invalid.

Related

Fill missing values from another dataframe with the same columns

I searched various join questions and none seemed to quite answer this. I have two dataframes which each have an ID column and several information columns.
df1 <- data.frame(id = c(1:100), color = c(rep("blue", 25), rep("red", 25),
rep(NA, 25)), phase = c(rep("liquid", 50), rep("gas", 50)),
rand.col = rnorm(100))
df2 <- data.frame(id = c(51:100), color = rep("green", 50), phase = rep("gas", 50))
As you can see, df1 is missing some info that is present in df2, while df2 is only a subset of all the ids, but they both have some similar columns. Is there a way to fill the missing values in df1 based on matching ID's from DF2?
I found a similar question that recommended using merge, but when I tried it, it dropped all the id's that were not present in both dataframes. Plus it required manually dropping duplicate columns and in my real dataset, there will be a large number of these, making doing so cumbersome. Even ignoring that though,
both the recommended solutions:
df1 <- setNames(merge(df1, df2)[-2], names(df1))
and
df1[is.na(df1$color), "color"] <- df2[match(df1$id, df2$id), "color"][which(is.na(df1$color))]
did not work for me, throwing various errors.
An alternate solution I have thought of is using rbind then dropping incomplete cases. The problem is that in my real dataset, while there are shared columns, there are also non-shared columns so I would have to create intermediate objects of just the shared columns, rbind, then drop incomplete cases, then join with the original object to regain the dropped columns. This seems unnecessarily roundabout.
In this example it would look like
df2 = rbind(df1[,colnames(df2)], df2)
df2 = df2[complete.cases(df2),]
df2 = merge(df1[,c("id", "rand.col")], df2, by = "id")
and, in case there are any fully duplicated rows between the two dataframes, I would need to add
df2 = unique(df2)
This solution will work, but it is cumbersome and as the number of columns that are being matched on increase, it gets even worse. Is there a better solution?
-edit- fixed a problem in my example data pointed out by Sathish
-edit2- Expanded example data
df1 = data.frame(id = c(1:100), wq2 = rnorm(50), wq3 = rnorm(50), wq4 = rnorm(50),
wq5 = rnorm(50))
df2 = data.frame(id = c(51:100), wq2 = rnorm(50), wq3 = rnorm(50), wq4 = rnorm(50),
wq5 = rnorm(50))
These dataframe represents the case where there are many columns that have incomplete data and a second dataframe that has all of the missing data. Ideally, we would not need to separately list each each column with wq1 := i.wq1 etc.
If you want to join only by id column, you can remove phase in the on clause of code below.
Also your data in the question has discrepancies, which are corrected in the data posted in this answer.
library('data.table')
setDT(df1) # make data table by reference
setDT(df2) # make data table by reference
df1[ i = df2, color := i.color, on = .(id, phase)] # join df1 with df2 by id and phase values, and replace color values of df2 with color values of df1
tail(df1)
# id color phase rand.col
# 1: 95 green gas 1.5868335
# 2: 96 green gas 0.5584864
# 3: 97 green gas -1.2765922
# 4: 98 green gas -0.5732654
# 5: 99 green gas -1.2246126
# 6: 100 green gas -0.4734006
one-liner:
setDT(df1)[df2, color := i.color, on = .(id, phase)]
Data:
set.seed(1L)
df1 <- data.frame(id = c(1:100), color = c(rep("blue", 25), rep("red", 25),
rep(NA, 50)), phase = c(rep("liquid", 50), rep("gas", 50)),
rand.col = rnorm(100))
df2 <- data.frame(id = c(51:100), color = rep("green", 50), phase = rep("gas", 50))
EDIT: based on new data posted in the question
Data:
set.seed(1L)
df1 = data.frame(id = c(1:100), wq2 = rnorm(50), wq3 = rnorm(50), wq4 = rnorm(50),
wq5 = rnorm(50))
set.seed(2423L)
df2 = data.frame(id = c(51:100), wq2 = rnorm(50), wq3 = rnorm(50), wq4 = rnorm(50),
wq5 = rnorm(50))
Code:
library('data.table')
setDT(df1)[ id == 52, ]
# id wq2 wq3 wq4 wq5
# 1: 52 0.1836433 -0.6120264 0.04211587 -0.01855983
setDT(df2)[ id == 52, ]
# id wq2 wq3 wq4 wq5
# 1: 52 0.3917297 -1.007601 -0.6820783 0.3153687
df1[df2, `:=` ( wq2 = i.wq2,
wq3 = i.wq3,
wq4 = i.wq4,
wq5 = i.wq5), on = .(id)]
setDT(df1)[ id == 52, ]
# id wq2 wq3 wq4 wq5
# 1: 52 0.3917297 -1.007601 -0.6820783 0.3153687

Pass multiple arguments to ddply

I am attempting to create a function which takes a list as input, and returns a summarised data frame. However, after trying multiple ways, I am unable to pass a list to the function for the aggregation.
So far I have the following, but it is failing.
library(dplyr)
random_df <- data.frame(
region = c("A", "B", "C", "C"),
number_of_reports = c(1, 3, 2, 1),
report_MV = c(12, 33, 22, 12)
)
output_graph <- function(input) {
print(input$arguments)
DF <- input$DF
group_by <- input$group_by
args <- input$arguments
flow <- ddply(DF, group_by, summarize, args)
return(flow)
}
graph_functions <- list(
DF = random_df,
group_by = .(region),
arguments = .(Reports = sum(number_of_reports),
MV_Reports = sum(report_MV))
)
output_graph(graph_functions)
Where this works:
library(dplyr)
random_df <- data.frame(
region = c("A", "B", "C", "C"),
number_of_reports = c(1, 3, 2, 1),
report_MV = c(12, 33, 22, 12)
)
output_graph <- function(input) {
print(input$arguments)
DF <- input$DF
group_by <- input$group_by
args <- input$arguments
flow <- ddply(
DF,
group_by,
summarize,
Reports = sum(number_of_reports),
MV_Reports = sum(report_MV)
)
return(flow)
}
graph_functions <- list(
DF = random_df,
group_by = .(region),
arguments = .(Reports = sum(number_of_reports),
MV_Reports = sum(report_MV))
)
output_graph(graph_functions)
Would anyone be aware of a way to pass a list of functions to ddply? Or another way to achieve the same goal of aggregating a dynamic set of variables.
In order to pass arguments into the function for use by dplyr, I recommend reading this regarding non-standard evaluation (NSE). Here is an edited function producing the same output as my original.
library(dplyr)
random_df <- data.frame(
region = c('A','B','C','C'),
number_of_reports = c(1, 3, 2, 1),
report_MV = c(12, 33, 22, 12)
)
output_graph <- function(df, group, args) {
grp_quo <- enquo(group)
df %>%
group_by(!!grp_quo) %>%
summarise(!!!args)
}
args <- list(
Reports = quo(sum(number_of_reports)),
MV_Reports = quo(sum(report_MV))
)
output_graph(random_df, region, args)
# # A tibble: 3 x 3
# region Reports MV_Reports
# <fctr> <dbl> <dbl>
# 1 A 1.00 12.0
# 2 B 3.00 33.0
# 3 C 3.00 34.0

How do I update data frame fields in R?

I'm looking to update fields in one data table with information from another data table like this:
dt1$name <- dt2$name where dt1$id = dt2$id
In SQL pseudo code : update dt1 set name = dt2.name where dt1.id = dt2.id
As you can see I'm very new to R so every little helps.
Update
I think it was my fault - what I really want is to update an telephone number if the usernames from both dataframes match.
So how can I compare names and update a field if both names match?
Please help :)
dt1$phone<- dt2$phone where dt1$name = dt2$name
Joran's answer assumes dt1 and dt2 can be matched by position.
If it's not the case, you may need to merge() first:
dt1 <- data.frame(id = c(1, 2, 3), name = c("a", "b", "c"), stringsAsFactors = FALSE)
dt2 <- data.frame(id = c(7, 3), name = c("f", "g"), stringsAsFactors = FALSE)
dt1 <- merge(dt1, dt2, by = "id", all.x = TRUE)
dt1$name <- ifelse( ! is.na(dt1$name.y), dt1$name.y, dt1$name.x)
dt1
(Edit per your update:
dt1 <- data.frame(id = c(1, 2, 3), name = c("a", "b", "c"), phone = c("123-123", "456-456", NA), stringsAsFactors = FALSE)
dt2 <- data.frame(name = c("f", "g", "a"), phone = c(NA, "000-000", "789-789"), stringsAsFactors = FALSE)
dt1 <- merge(dt1, dt2, by = "name", all.x = TRUE)
dt1$new_phone <- ifelse( ! is.na(dt1$phone.y), dt1$phone.y, dt1$phone.x)
Try:
dt1$name <- ifelse(dt1$id == dt2$id, dt2$name, dt1$name)
Alternatively, maybe:
dt1$name[dt1$id == dt2$id] <- dt2$name[dt1$id == dt2$id]
If you're more comfortable working in SQL, you can use the sqldf package:
dt1 <- data.frame(id = c(1, 2, 3),
name = c("A", "B", "C"),
stringsAsFactors = FALSE)
dt2 <- data.frame(id = c(2, 3, 4),
name = c("X", "Y", "Z"),
stringsAsFactors = FALSE)
library(sqldf)
sqldf("SELECT dt1.id,
CASE WHEN dt2.name IS NULL THEN dt1.name ELSE dt2.name END name
FROM dt1
LEFT JOIN dt2
ON dt1.id = dt2.id")
But, computationally, it's about 150 times slower than joran's solution, and quite a bit slower in the human time as well. However, if you are ever in a bind and just need to do something that you can do easily in SQL, it's an option.

sort data into deciles based on a rolling subset

I am trying to replicate the Fama French 1993 paper using R. I need to do the following sorting :
for each month,
calculate ME decile breakpoints on NYSE stocks only
sort all stocks into the deciles created in 2.
Data generation:
set.seed(1234)
n = 120
stocks <- c("A", "B", "C", "D", "E")
exchange <- c("NYSE", "NASDAQ", "AMEX")
df <- as.data.frame(cbind(Month = 1:12,
exchangeCode = exchange[round(runif(n, 1, 3))],
Stock = stocks[round(runif(n, 1, 5))],
ME=floor(100*abs(rnorm(n)))))
Desired Output:
ME_NYSE_vals <- as.numeric(paste(df[df$Month==1 & df$exchangeCode=="NYSE","ME"]))
ME_ALL_vals <- as.numeric(paste(df[df$Month==1,"ME"]))
cut(x = ME_ALL_vals,
breaks = c(-Inf,quantile(ME_NYSE_vals,probs=seq(.1,.9,.1)),+Inf),
labels = 1:10
)
The breaks should be calculated based on ME_NSYE_vals. The cut should be applied to all ME_ALL_vals for each month.
If the intention is to keep the whole data frame but generate deciles only for the NYSE values the code below could do. The point was to generate deciles only for the entries pertaining to the NYSE values but to keep the full data set achieving some form of a partial sorting.
# Libs
Vectorize(require)(package = c("dplyr", "magrittr"),
character.only = TRUE)
# Transformations
df %<>%
mutate(nTileNYSE = ifelse(exchangeCode == "NYSE", ntile(ME, 10), NA))
arrange(nTileNYSE)
The code was applied to the data:
set.seed(1)
df <- as.data.frame(cbind(exchangeCode = c("NYSE", "NASDAQ"),
Stock = c("A", "B", "C", "A"),
Month = 1:12,
ME=rnorm(1200)))
2nd approach
Following the discussion in the comments I would suggest the following approach:
# Libs --------------------------------------------------------------------
Vectorize(require)(package = c( "tidyr", "dplyr", "magrittr", "xts", "Hmisc"),
char = TRUE)
# Data generation ---------------------------------------------------------
set.seed(1234)
n = 120
stocks <- c("A", "B", "C", "D", "E")
exchange <- c("NYSE", "NASDAQ", "AMEX")
df <- as.data.frame(cbind(Month = 1:12,
exchangeCode = exchange[round(runif(n, 1, 3))],
Stock = stocks[round(runif(n, 1, 5))],
ME = floor(100*abs(rnorm(n)))))
# Transformations ---------------------------------------------------------
# For some reason this was needed
df$ME <- as.numeric(as.character(df$ME))
# Generate cuts
dfNtiles <- df %>%
arrange(exchangeCode, Month, ME) %>%
group_by(exchangeCode, Month) %>%
mutate(cutsBsdOnNYSE = cut(x = ME,
breaks = cut2(x = df$ME[df$exchangeCode == "NYSE"],
g = 10, onlycuts = TRUE))) %>%
ungroup() %>%
group_by(cutsBsdOnNYSE) %>%
mutate(grpBsdOnNYSE = n())
It's fairly straightforward
Generating cut brackets reflecting subset of the data.
Applying those brackets to the whole vector (ME)
Numbering the obtained groups so a group identifier is created
and boils down to:

rolling average to multiple variables in R using data.table package

I would like to get rolling average for each of the numeric variables that I have. Using data.table package, I know how to compute for a single variable. But how should I revise the code so it can process multiple variables at a time rather than revising the variable name and repeat this procedure for several times? Thanks.
Suppose I have other numeric variables named as "V2", "V3", and "V4".
require(data.table)
setDT(data)
setkey(data,Receptor,date)
data[ , `:=` ('RollConc' = rollmean(AvgConc, 48, align="left", na.pad=TRUE)) , by=Receptor]
A copy of my sample data can be found at:
https://drive.google.com/file/d/0B86_a8ltyoL3OE9KTUstYmRRbFk/view?usp=sharing
I would like to get 5-hour rolling means for "AvgConc","TotDep","DryDep", and "WetDep" by each receptor.
From your description you want something like this, which is similar to one example that can be found in one of the data.table vignettes:
library(data.table)
set.seed(42)
DT <- data.table(x = rnorm(10), y = rlnorm(10), z = runif(10), g = c("a", "b"), key = "g")
library(zoo)
DT[, paste0("ravg_", c("x", "y")) := lapply(.SD, rollmean, k = 3, na.pad = TRUE),
by = g, .SDcols = c("x", "y")]
Now, one can use the frollmean function in the data.table package for this.
library(data.table)
xy <- c("x", "y")
DT[, (xy):= lapply(.SD, frollmean, n = 3, fill = NA, align="center"),
by = g, .SDcols = xy]
Here, I am replacing the x and y columns by the rolling average.
# Data
set.seed(42)
DT <- data.table(x = rnorm(10), y = rlnorm(10), z = runif(10),
g = c("a", "b"), key = "g")

Resources