reshape wide to long using data.table with multiple columns - r

I have a dataframe in a wide format, like below. i want to reshape the wide to long using data.table melt function.in simple case, i can split two data, and rbind two datasets. but in my case, there are multiple test(i) testgr(i) columns. But there must be a better and a more efficient way to do this. thx in advance.
from =>
id<-c("106E1258","106E2037","104E1182","105E1248","105E1470","10241247",
"10241703")
yr<-c(2017,2017,2015,2016,2016,2013,2013)
finalgr<-c(72,76,75,71,75,77,78)
test01<-c("R0560","R0066","R0308","R0129","R0354","R0483",
"R0503")
test01gr<-c(73,74,67,80,64,80,70)
test02<-c("R0660","R0266","R0302","R0139","R0324","R0383" ,
"R0503")
test02gr<-c(71,54,67,70,68,81,61)
dt<-data.frame(id=id,yr=yr,
finalgr=finalgr,
test01=test01,test01gr=test01gr,
test02=test02,test02gr=test02gr)
to=>
id2<-c("106E1258","106E1258","104E1182","104E1182")
yr2<-c(2017,2017,2015,2015)
finalgr<-c(72,72,75,75)
testid<-c("R0560","R0660","R0308","R0302")
testgr<-c(73,71,67,67)
dt2<-data.frame(id=id2,yr=yr2,finalgr=finalgr,testid=testid,testgr=testgr)

You indeed should use melt:
setDT(dt)
melt(dt, id.vars = c('id', 'yr', 'finalgr'),
measure.vars = list(testid = c('test01', 'test02'),
testgr = c('test01gr', 'test02gr')))
# id yr finalgr variable testid testgr
# 1: 106E1258 2017 72 1 R0560 73
# 2: 106E2037 2017 76 1 R0066 74
# 3: 104E1182 2015 75 1 R0308 67
# 4: 105E1248 2016 71 1 R0129 80
# 5: 105E1470 2016 75 1 R0354 64
# 6: 10241247 2013 77 1 R0483 80
# 7: 10241703 2013 78 1 R0503 70
# 8: 106E1258 2017 72 2 R0660 71
# 9: 106E2037 2017 76 2 R0266 54
# 10: 104E1182 2015 75 2 R0302 67
# 11: 105E1248 2016 71 2 R0139 70
# 12: 105E1470 2016 75 2 R0324 68
# 13: 10241247 2013 77 2 R0383 81
# 14: 10241703 2013 78 2 R0503 61
If there are many more test columns, you can use patterns:
melt(dt, id.vars = c('id', 'yr', 'finalgr'),
measure.vars = patterns(testid = 'test[0-9]+$', testgr = 'test[0-9]+gr'))

Related

Convert values using a conversion table R

I am currently running statistical models on ACT and SAT scores. To help clean my data, I want to convert the ACT scores into its SAT equivalent. I found the following table online:
ACT SAT
<dbl> <dbl>
1 36 1590
2 35 1540
3 34 1500
4 33 1460
5 32 1430
6 31 1400
7 30 1370
8 29 1340
9 28 1310
10 27 1280
I want to replace the column ACT_Composite with the number in the SAT column of the conversion table. For instance, if one row displays an ACT_Composite score of 35, I want to input 1540.
If anyone has ideas on how to accomplish this, I would greatly appreciate it.
In base you can you use merge directly:
#Reading score table
df <- read.table(header = TRUE, text ="ACT SAT
36 1590
35 1540
34 1500
33 1460
32 1430
31 1400
30 1370
29 1340
28 1310
27 1280")
#Setting seed to reproduce df1
set.seed(1234)
# Create a data.frame with 50 sample scores
df1 <- data.frame(ACT_Composite = sample(27:36, 50, replace = TRUE))
# left-join df1 with df with keys ACT_Composite and ACT
result <- merge(df1, df,
by.x = "ACT_Composite",
by.y = "ACT",
all.x = TRUE,
sort = FALSE)
#The first 6 values of result
head(result)
ACT_Composite SAT
1 31 1400
2 31 1400
3 31 1400
4 31 1400
5 31 1400
6 36 1590
In data.table you can you use merge
library(data.table)
#Setting seed to reproduce df1
set.seed(1234)
# Create a data.table with 50 sample scores
df1 <- data.table(ACT_Composite = sample(27:36, 50, replace = TRUE))
# left-join df1 with df with keys ACT_Composite and ACT
result <- merge(df1, df,
by.x = "ACT_Composite",
by.y = "ACT",
all.x = TRUE,
sort = FALSE)
#The first 6 values of result
head(result)
ACT_Composite SAT
1: 36 1590
2: 32 1430
3: 31 1400
4: 35 1540
5: 31 1400
6: 32 1430
Alternatively in data.table you can try also
df1 <- data.table(ACT_Composite = sample(27:36, 50, replace = TRUE))
setDT(df)# you need to convert your look-up table df into data.table
result <- df[df1, on = c(ACT = "ACT_Composite")]
head(result)
ACT_Composite SAT
1: 36 1590
2: 32 1430
3: 31 1400
4: 35 1540
5: 31 1400
6: 32 1430

Using str_split to fill rows down data frame with number ranges and multiple numbers

I have a dataframe with crop names and their respective FAO codes. Unfortunately, some crop categories, such as 'other cereals', have multiple FAO codes, ranges of FAO codes or even worse - multiple ranges of FAO codes.
Snippet of the dataframe with the different formats for FAO codes.
> FAOCODE_crops
SPAM_full_name FAOCODE
1 wheat 15
2 rice 27
8 other cereals 68,71,75,89,92,94,97,101,103,108
27 other oil crops 260:310,312:339
31 other fibre crops 773:821
Using the following code successfully breaks down these numbers,
unlist(lapply(unlist(strsplit(FAOCODE_crops$FAOCODE, ",")), function(x) eval(parse(text = x))))
[1] 15 27 56 44 79 79 83 68 71 75 89 92 94 97 101 103 108
... but I fail to merge these numbers back into the dataframe, where every FAOCODE gets its own row.
> FAOCODE_crops$FAOCODE <- unlist(lapply(unlist(strsplit(MAPSPAM_crops$FAOCODE, ",")), function(x) eval(parse(text = x))))
Error in `$<-.data.frame`(`*tmp*`, FAOCODE, value = c(15, 27, 56, 44, :
replacement has 571 rows, data has 42
I fully understand why it doesn't merge successfully, but I can't figure out a way to fill the table with a new row for each FAOCODE as idealized below:
SPAM_full_name FAOCODE
1 wheat 15
2 rice 27
8 other cereals 68
8 other cereals 71
8 other cereals 75
8 other cereals 89
And so on...
Any help is greatly appreciated!
We can use separate_rows to separate the ,. After that, we can loop through the FAOCODE using map and ~eval(parse(text = .x)) to evaluate the number range. Finnaly, we can use unnest to expand the data frame.
library(tidyverse)
dat2 <- dat %>%
separate_rows(FAOCODE, sep = ",") %>%
mutate(FAOCODE = map(FAOCODE, ~eval(parse(text = .x)))) %>%
unnest(cols = FAOCODE)
dat2
# # A tibble: 140 x 2
# SPAM_full_name FAOCODE
# <chr> <dbl>
# 1 wheat 15
# 2 rice 27
# 3 other cereals 68
# 4 other cereals 71
# 5 other cereals 75
# 6 other cereals 89
# 7 other cereals 92
# 8 other cereals 94
# 9 other cereals 97
# 10 other cereals 101
# # ... with 130 more rows
DATA
dat <- read.table(text = " SPAM_full_name FAOCODE
1 wheat 15
2 rice 27
8 'other cereals' '68,71,75,89,92,94,97,101,103,108'
27 'other oil crops' '260:310,312:339'
31 'other fibre crops' '773:821'",
header = TRUE, stringsAsFactors = FALSE)

Data.table: operation with group-shifted data

Consider the folowing data.table:
DT <- data.table(year = c(2011,2012,2013,2011,2012,2013,2011,2012,2013),
level = c(137,137,137,136,136,136,135,135,135),
valueIn = c(13,30,56,11,25,60,8,27,51))
I would like have the following ouput:
DT <- data.table(year = c(2011,2012,2013,2011,2012,2013,2011,2012,2013),
level = c(137,137,137,136,136,136,135,135,135),
valueIn = c(13,30,56, 11,25,60, 8,27,51),
valueOut = c(12,27.5,58, 9.5,26,55.5, NA,NA,NA))
In other words, I want to calculate the operation (valueIn[level] - valueIn[level-1]) / 2, according to the year. For example, the first value is calculated like this: (13+11)/2=12.
For the moment, I do that with for loops, in which I create data.table's subsets for each level:
levelDtList <- list()
levels <- sort(DT$level, decreasing = FALSE)
for (this.level in levels) {
levelDt <- DT[level == this.level]
if (this.level == min(levels)) {
valueOut <- NA
} else {
levelM1Data <- levelDtList[[this.level - 1]]
valueOut <- (levelDt$valueIn + levelM1Data$valueIn) / 2
}
levelDt$valueOut <- valueOut
levelDtList[[this.level]] <- levelDt
}
datatable <- rbindlist(levelDtList)
This is ugly and quite slow, so I am looking for a better, faster, data.table-based solution.
Using the shift-function with type = 'lead' to get the next value, sum and divide by two:
DT[, valueOut := (valueIn + shift(valueIn, type = 'lead'))/2, by = year]
you get:
year level valueIn valueOut
1: 2011 137 13 12.0
2: 2012 137 30 27.5
3: 2013 137 56 58.0
4: 2011 136 11 9.5
5: 2012 136 25 26.0
6: 2013 136 60 55.5
7: 2011 135 8 NA
8: 2012 135 27 NA
9: 2013 135 51 NA
With all the parameters of the shift-function specified:
DT[, valueOut := (valueIn + shift(valueIn, n = 1L, fill = NA, type = 'lead'))/2, by = year]
We can also use shift with Reduce
DT[, valueOut := Reduce(`+`, shift(valueIn, type = "lead", 0:1))/2, by = year]
DT
# year level valueIn valueOut
#1: 2011 137 13 12.0
#2: 2012 137 30 27.5
#3: 2013 137 56 58.0
#4: 2011 136 11 9.5
#5: 2012 136 25 26.0
#6: 2013 136 60 55.5
#7: 2011 135 8 NA
#8: 2012 135 27 NA
#9: 2013 135 51 NA
It is more easier to generalize as shift can take a vector of 'n' values.
If you:
don't mind using dplyr
the year is the thing that relates your items
the structure shown is representative of reality
then this could work for you:
DT %>% group_by(year) %>% mutate(valueOut = (valueIn + lead(valueIn)) / 2)

creating new column after joining two data.tables

I have two data.tables, main and metrics, both keyed by cid
I want to add to table main the average of each of several values located in metrics.
However, I would like to filter by code, only averaging those rows in metrics with a given code.
> metrics
cid code DZ value1 value2
1: 1001 A 101 8 21
2: 1001 B 102 11 26
3: 1001 A 103 17 25
4: 1002 A 104 25 39
5: 1002 B 105 6 30
6: 1002 A 106 23 40
7: 1003 A 107 27 32
8: 1003 B 108 16 37
9: 1003 A 109 14 42
# DESIRED OUTPUT
> main
cid A.avg.val1 A.avg.val2 B.avg.val1 B.avg.val2
1: 1001 12.5 23.0 11 26
2: 1002 24.0 39.5 6 30
3: 1003 20.5 37.0 16 37
# SAMPLE DATA
set.seed(1)
main <- data.table(cid=1e3+1:3, key="cid")
metrics <- data.table(cid=rep(1e3+1:3, each=3), code=rep(c("A", "B", "A"), 3), DZ=101:109, value1=sample(30, 9), value2=sample(20:50, 9), key="cid")
code.filters <- c("A", "B")
These lines get the desired output, but I am having difficulty assigning the new col back into main. (also, doing it programatically would be preferred).
main[metrics[code==code.filters[[1]]]][, list(mean(c(value1))), by=cid]
main[metrics[code==code.filters[[1]]]][, list(mean(c(value2))), by=cid]
main[metrics[code==code.filters[[2]]]][, list(mean(c(value1))), by=cid]
main[metrics[code==code.filters[[1]]]][, list(mean(c(value2))), by=cid]
Additionally, can someone explain why the following line only takes the last value in each group?
main[metrics[ code=="A"], A.avg.val1 := mean(c(value1))]
You don't need main. You can get it directly from metrics as follows:
> tmp.dt <- metrics[, list(A.avg.val1 = mean(value1[code=="A"]),
A.avg.val2 = mean(value2[code=="A"]),
B.avg.val1 = mean(value1[code == "B"]),
B.avg.val2 = mean(value2[code == "B"])), by=cid]
# cid A.avg.val1 A.avg.val2 B.avg.val1 B.avg.val2
# 1: 1001 12.5 23.0 11 26
# 2: 1002 24.0 39.5 6 30
# 3: 1003 20.5 37.0 16 37
If you still want to subset with main just do:
main <- data.table(cid = c(1001:1002))
> tmp.dt[main]
# cid A.avg.val1 A.avg.val2 B.avg.val1 B.avg.val2
# 1: 1001 12.5 23.0 11 26
# 2: 1002 24.0 39.5 6 30
I would do this in two steps. First, get your means, then reshape the data
foo <- main[metrics]
bar <- foo[, list(val1 = mean(value1),
val2 = mean(value2)),
by=c('cid', 'code')]
library(reshape2)
bar.melt <- melt(bar, id.var=c('cid', 'code'))
dcast(data=bar.melt,
cid ~ code + variable)
But really, I'd probably leave the data in the "long" format because I find it much easier to work with!
working off of #Arun's answer, the following gets the desired results:
invisible(
sapply(code.filters, function(cf)
main[metrics[code==cf, list(avgv1 = mean(value1), avgv2 = mean(value2)), by=cid],
paste0(cf, c(".avg.val1", ".avg.val2")) :=list(avgv1, avgv2)]
))
> main
cid A.avg.val1 A.avg.val2 B.avg.val1 B.avg.val2
1: 1001 12.5 23.0 11 26
2: 1002 24.0 39.5 6 30
3: 1003 20.5 37.0 16 37

Reshaping a data frame --- changing rows to columns

Suppose that we have a data frame that looks like
set.seed(7302012)
county <- rep(letters[1:4], each=2)
state <- rep(LETTERS[1], times=8)
industry <- rep(c("construction", "manufacturing"), 4)
employment <- round(rnorm(8, 100, 50), 0)
establishments <- round(rnorm(8, 20, 5), 0)
data <- data.frame(state, county, industry, employment, establishments)
state county industry employment establishments
1 A a construction 146 19
2 A a manufacturing 110 20
3 A b construction 121 10
4 A b manufacturing 90 27
5 A c construction 197 18
6 A c manufacturing 73 29
7 A d construction 98 30
8 A d manufacturing 102 19
We'd like to reshape this so that each row represents a (state and) county, rather than a county-industry, with columns construction.employment, construction.establishments, and analogous versions for manufacturing. What is an efficient way to do this?
One way is to subset
construction <- data[data$industry == "construction", ]
names(construction)[4:5] <- c("construction.employment", "construction.establishments")
And similarly for manufacturing, then do a merge. This isn't so bad if there are only two industries, but imagine that there are 14; this process would become tedious (though made less so by using a for loop over the levels of industry).
Any other ideas?
This can be done in base R reshape, if I understand your question correctly:
reshape(data, direction="wide", idvar=c("state", "county"), timevar="industry")
# state county employment.construction establishments.construction
# 1 A a 146 19
# 3 A b 121 10
# 5 A c 197 18
# 7 A d 98 30
# employment.manufacturing establishments.manufacturing
# 1 110 20
# 3 90 27
# 5 73 29
# 7 102 19
Also using the reshape package:
library(reshape)
m <- reshape::melt(data)
cast(m, state + county~...)
Yielding:
> cast(m, state + county~...)
state county construction_employment construction_establishments manufacturing_employment manufacturing_establishments
1 A a 146 19 110 20
2 A b 121 10 90 27
3 A c 197 18 73 29
4 A d 98 30 102 19
I personally use the base reshape so I probably should have shown this using reshape2 (Wickham) but forgot there was a reshape2 package. Slightly different:
library(reshape2)
m <- reshape2::melt(data)
dcast(m, state + county~...)

Resources