I have a data table of the form
ID REGION INCOME_BAND RESIDENCY_YEARS
1 SW Under 5,000 10-15
2 Wales Over 70,000 1-5
3 Center 15,000-19,999 6-9
4 SE 15,000-19,999 15-19
5 North 15,000-19,999 10-15
6 North 15,000-19,999 6-9
created by
exp = data.table(
ID = c(1,2,3,4,5,6),
REGION=c("SW", "Wales", "Center", "SE", "North", "North"),
INCOME_BAND = c("Under ?5,000", "Over ?70,000", "?15,000-?19,999", "?15,000-?19,999", "?15,000-?19,999","?15,000-?19,999"),
RESIDENCY_YEARS = c("10-15","1-5","6-9","15-19","10-15", "6-9"))
I would like to transform this to
I've managed to perform the majority of the work with dcast:
exp.dcast = dcast(exp,ID~REGION+INCOME_BAND+RESIDENCY_YEARS, fun=length,
value.var=c('REGION', 'INCOME_BAND', 'RESIDENCY_YEARS'))
However I need some help creating sensible column headings.
Currently I have
["ID"
"REGION.1_Center_?15,000-?19,999_6-9"
"REGION.1_North_?15,000-?19,999_10-15"
"REGION.1_North_?15,000-?19,999_6-9"
"REGION.1_SE_?15,000-?19,999_15-19" "REGION.1_SW_Under
?5,000_10-15" "REGION.1_Wales_Over ?70,000_1-5"
"INCOME_BAND.1_Center_?15,000-?19,999_6-9"
"INCOME_BAND.1_North_?15,000-?19,999_10-15"
"INCOME_BAND.1_North_?15,000-?19,999_6-9"
"INCOME_BAND.1_SE_?15,000-?19,999_15-19"
"INCOME_BAND.1_SW_Under ?5,000_10-15"
"INCOME_BAND.1_Wales_Over ?70,000_1-5"
"RESIDENCY_YEARS.1_Center_?15,000-?19,999_6-9"
"RESIDENCY_YEARS.1_North_?15,000-?19,999_10-15"
"RESIDENCY_YEARS.1_North_?15,000-?19,999_6-9"
"RESIDENCY_YEARS.1_SE_?15,000-?19,999_15-19"
"RESIDENCY_YEARS.1_SW_Under ?5,000_10-15"
"RESIDENCY_YEARS.1_Wales_Over ?70,000_1-5"
And I would like the column headings to be
ID SW Wales Center SE North Under 5,000 Over 70,000 15,000-19,999 1-5 6-9 10-15 15-19
Could anybody advise?
This apparently simple question is not easy to answer. So, we will go forward step-by step.
First, the OP has tried to reshape multiple value columns simultaneously which creates an unwanted cross product of all available combinations.
In order to treat all values in the same way, we need to melt() all value columns first before reshaping:
melt(exp, id.vars = "ID")[, dcast(.SD, ID ~ value, length)]
ID 1-5 10-15 15-19 6-9 ?15,000-?19,999 Center North Over ?70,000 SE SW Under ?5,000 Wales
1: 1 0 1 0 0 0 0 0 0 0 1 1 0
2: 2 1 0 0 0 0 0 0 1 0 0 0 1
3: 3 0 0 0 1 1 1 0 0 0 0 0 0
4: 4 0 0 1 0 1 0 0 0 1 0 0 0
5: 5 0 1 0 0 1 0 1 0 0 0 0 0
6: 6 0 0 0 1 1 0 1 0 0 0 0 0
Now, the result has 13 columns instead of 19 and the columns are named by the respective value as requested.
Unfortunately, the columns appear in the wrong order because they alphabetically ordered. There are two approaches to change the order:
Change order of columns after reshaping
The setcolorder() function reorders the columns of a data.table in place, e.g. without copying:
# define column order = order of values
col_order <- c("North", "Wales", "Center", "SW", "SE", "Under ?5,000", "?15,000-?19,999", "Over ?70,000", "1-5", "6-9", "10-15", "15-19")
melt(exp, id.vars = "ID")[, dcast(.SD, ID ~ value, length)][
# reorder columns
, setcolorder(.SD, c("ID", col_order))]
ID North Wales Center SW SE Under ?5,000 ?15,000-?19,999 Over ?70,000 1-5 6-9 10-15 15-19
1: 1 0 0 0 1 0 1 0 0 0 0 1 0
2: 2 0 1 0 0 0 0 0 1 1 0 0 0
3: 3 0 0 1 0 0 0 1 0 0 1 0 0
4: 4 0 0 0 0 1 0 1 0 0 0 0 1
5: 5 1 0 0 0 0 0 1 0 0 0 1 0
6: 6 1 0 0 0 0 0 1 0 0 1 0 0
Now, all REGION columns appear first, followed by INCOME_BAND and RESIDENCY_YEARS columns in the specified order.
Set factor levels before reshaping
If value is turned into a factor with appropriately ordered factor levels dcast() will use the factor levels for ordering the columns:
melt(exp, id.vars = "ID")[, value := factor(value, col_order)][
, dcast(.SD, ID ~ value, length)]
ID North Wales Center SW SE Under ?5,000 ?15,000-?19,999 Over ?70,000 1-5 6-9 10-15 15-19
1: 1 0 0 0 1 0 1 0 0 0 0 1 0
2: 2 0 1 0 0 0 0 0 1 1 0 0 0
3: 3 0 0 1 0 0 0 1 0 0 1 0 0
4: 4 0 0 0 0 1 0 1 0 0 0 0 1
5: 5 1 0 0 0 0 0 1 0 0 0 1 0
6: 6 1 0 0 0 0 0 1 0 0 1 0 0
Set factor levels before reshaping - lazy version
If it is sufficient to have the columns grouped by REGION, INCOME_BAND, and RESIDENCY_YEARS then we can use a short cut to avoid specifying each value in col_order. The fct_inorder() function from the forcats package reorders factor levels by their first appearance in a vector:
melt(exp, id.vars = "ID")[, value := factor(value, col_order)][
, dcast(.SD, ID ~ value, length)]
ID SW Wales Center SE North Under ?5,000 Over ?70,000 ?15,000-?19,999 10-15 1-5 6-9 15-19
1: 1 1 0 0 0 0 1 0 0 1 0 0 0
2: 2 0 1 0 0 0 0 1 0 0 1 0 0
3: 3 0 0 1 0 0 0 0 1 0 0 1 0
4: 4 0 0 0 1 0 0 0 1 0 0 0 1
5: 5 0 0 0 0 1 0 0 1 1 0 0 0
6: 6 0 0 0 0 1 0 0 1 0 0 1 0
This works because the output of melt() is ordered by variable:
melt(exp, id.vars = "ID")
ID variable value
1: 1 REGION SW
2: 2 REGION Wales
3: 3 REGION Center
4: 4 REGION SE
5: 5 REGION North
6: 6 REGION North
7: 1 INCOME_BAND Under ?5,000
8: 2 INCOME_BAND Over ?70,000
9: 3 INCOME_BAND ?15,000-?19,999
10: 4 INCOME_BAND ?15,000-?19,999
11: 5 INCOME_BAND ?15,000-?19,999
12: 6 INCOME_BAND ?15,000-?19,999
13: 1 RESIDENCY_YEARS 10-15
14: 2 RESIDENCY_YEARS 1-5
15: 3 RESIDENCY_YEARS 6-9
16: 4 RESIDENCY_YEARS 15-19
17: 5 RESIDENCY_YEARS 10-15
18: 6 RESIDENCY_YEARS 6-9
Related
for a dataset similar to the one below, I need N level dummy variables. I use dummyVars() from caret package.
As you can see the column names are ignoring "sep="-"" argument and there are some dots in the column names rather than < or > signs.
df <- data.frame(fruit=as.factor(c("apple", "orange","orange", "carrot", "apple")),
st=as.factor(c("CA", "MN","MN", "NY", "NJ")),
wt=as.factor(c("<2","2-4",">4","2-4","<2")),
buy=c(1,1,0,1,0))
fruit st wt buy
1 apple CA <2 1
2 orange MN 2-4 1
3 orange MN >4 0
4 carrot NY 2-4 1
5 apple NJ <2 0
library(caret)
dmy <- dummyVars(buy~ ., data = df, sep="-")
df2 <- data.frame(predict(dmy, newdata = df))
df2
fruit.apple fruit.carrot fruit.orange st.CA st.MN st.NJ st.NY wt..2 wt..4 wt.2.4
1 1 0 0 1 0 0 0 1 0 0
2 0 0 1 0 1 0 0 0 0 1
3 0 0 1 0 1 0 0 0 1 0
4 0 1 0 0 0 0 1 0 0 1
5 1 0 0 0 0 1 0 1 0 0
I am confused why dummyVars() is not converting the actual levels into the parts of the column names and why is it ignoring the separator argument.
I would appreciate any hint on what I am doing wrong!
EDIT: for the future readers :) ! according to AKRUN's note, the argument below for dataframe() solved the problem.
df2 <- data.frame(predict(dmy, newdata = df), check.names = FALSE)
fruit-apple fruit-carrot fruit-orange st-CA st-MN st-NJ st-NY wt-<2 wt->4 wt-2-4
1 1 0 0 1 0 0 0 1 0 0
2 0 0 1 0 1 0 0 0 0 1
3 0 0 1 0 1 0 0 0 1 0
4 0 1 0 0 0 0 1 0 0 1
5 1 0 0 0 0 1 0 1 0 0
I need to recode a data set of test responses for use in another application (a program called BLIMP that imputes missing values). Specifically, I need to represent the test items and subscale assignments with dummy codes.
Here I create a data frame that holds the responses to a 10-item test for two persons in a nested format. These data are a simplified version of the actual input table.
library(tidyverse)
df <- tibble(
person = rep(101:102, each = 10),
item = as.factor(rep(1:10, 2)),
response = sample(1:4, 20, replace = T),
scale = as.factor(rep(rep(1:2, each = 5), 2))
) %>% mutate(
scale_last = case_when(
as.integer(scale) != lead(as.integer(scale)) | is.na(lead(as.integer(scale))) ~ 1,
TRUE ~ NA_real_
)
)
The columns of df contain:
person: ID numbers for the persons (10 rows for each person)
item: test items 1-10 for each person. Note how the items are nested within each person.
response: score for each item
scale: the test has two subscales. Items 1-5 are assigned to subscale 1, and items 6-10 are assigned to subscale 2.
scale_last: a code of 1 in this column indicates that the item is the last item in its assigned sub scale. This characteristic becomes important below.
I then create dummy codes for the items using the recipes package.
library(recipes)
dum <- df %>%
recipe(~ .) %>%
step_dummy(item, one_hot = T) %>%
prep(training = df) %>%
bake(new_data = df)
print(dum, width = Inf)
# person response scale scale_last item_X1 item_X2 item_X3 item_X4 item_X5 item_X6 item_X7
# <int> <int> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 101 2 1 NA 1 0 0 0 0 0 0
# 2 101 3 1 NA 0 1 0 0 0 0 0
# 3 101 3 1 NA 0 0 1 0 0 0 0
# 4 101 1 1 NA 0 0 0 1 0 0 0
# 5 101 1 1 1 0 0 0 0 1 0 0
# 6 101 1 2 NA 0 0 0 0 0 1 0
# 7 101 3 2 NA 0 0 0 0 0 0 1
# 8 101 4 2 NA 0 0 0 0 0 0 0
# 9 101 2 2 NA 0 0 0 0 0 0 0
#10 101 4 2 1 0 0 0 0 0 0 0
#11 102 2 1 NA 1 0 0 0 0 0 0
#12 102 1 1 NA 0 1 0 0 0 0 0
#13 102 2 1 NA 0 0 1 0 0 0 0
#14 102 3 1 NA 0 0 0 1 0 0 0
#15 102 2 1 1 0 0 0 0 1 0 0
#16 102 1 2 NA 0 0 0 0 0 1 0
#17 102 4 2 NA 0 0 0 0 0 0 1
#18 102 2 2 NA 0 0 0 0 0 0 0
#19 102 4 2 NA 0 0 0 0 0 0 0
#20 102 3 2 1 0 0 0 0 0 0 0
# item_X8 item_X9 item_X10
# <dbl> <dbl> <dbl>
# 1 0 0 0
# 2 0 0 0
# 3 0 0 0
# 4 0 0 0
# 5 0 0 0
# 6 0 0 0
# 7 0 0 0
# 8 1 0 0
# 9 0 1 0
#10 0 0 1
#11 0 0 0
#12 0 0 0
#13 0 0 0
#14 0 0 0
#15 0 0 0
#16 0 0 0
#17 0 0 0
#18 1 0 0
#19 0 1 0
#20 0 0 1
The output shows the item dummy codes represented in the columns with the item_ prefix. For downstream processing, I need a further level of recoding. Within each subscale, the items must be dummy-coded relative to the last item of the subscale. Here’s where the scale_last variable comes into play; this variable identifies the rows in the output that need to be recoded.
For example, the first of these rows is row 5, the row for the last item (item 5) in subscale 1 for person 101. In this row the value of column item_X5 needs to be recoded from 1 to 0. In the next row to be recoded (row 10), it is the value of item_X10 that needs to be recoded from 1 to 0. And so on.
I’m struggling for the right combination of dplyr verbs to accomplish this. What’s tripping me up is the need to isolate specific cells within specific rows to be recoded.
Thanks in advance for any help!
We can use mutate_at and replace values from "item" columns to 0 where scale_last == 1
library(dplyr)
dum %>% mutate_at(vars(starts_with("item")), ~replace(., scale_last == 1, 0))
# A tibble: 20 x 14
# person response scale scale_last item_X1 item_X2 item_X3 item_X4 item_X5
# <int> <int> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 101 2 1 NA 1 0 0 0 0
# 2 101 3 1 NA 0 1 0 0 0
# 3 101 1 1 NA 0 0 1 0 0
# 4 101 1 1 NA 0 0 0 1 0
# 5 101 3 1 1 0 0 0 0 0
# 6 101 4 2 NA 0 0 0 0 0
# 7 101 4 2 NA 0 0 0 0 0
# 8 101 3 2 NA 0 0 0 0 0
# 9 101 2 2 NA 0 0 0 0 0
#10 101 4 2 1 0 0 0 0 0
#11 102 2 1 NA 1 0 0 0 0
#12 102 1 1 NA 0 1 0 0 0
#13 102 4 1 NA 0 0 1 0 0
#14 102 4 1 NA 0 0 0 1 0
#15 102 4 1 1 0 0 0 0 0
#16 102 3 2 NA 0 0 0 0 0
#17 102 4 2 NA 0 0 0 0 0
#18 102 1 2 NA 0 0 0 0 0
#19 102 4 2 NA 0 0 0 0 0
#20 102 4 2 1 0 0 0 0 0
# … with 5 more variables: item_X6 <dbl>, item_X7 <dbl>, item_X8 <dbl>,
# item_X9 <dbl>, item_X10 <dbl>
In base R, we can use lapply
cols <- grep("^item", names(dum))
dum[cols] <- lapply(dum[cols], function(x) replace(x, dum$scale_last == 1, 0))
Appreciate your help. Need to split a column filled with delimited values into columns named after its delimited values and each of these new columns are to be filled with either 1 or 0 where values are found or not.
state <-
c('ACT',
'ACT|NSW|NT|QLD|SA|VIC',
'ACT|NSW|NT|QLD|TAS|VIC|WA',
'ACT|NSW|NT|SA|TAS|VIC',
'ACT|NSW|QLD|VIC',
'ACT|NSW|SA',
'ACT|NSW|NT|QLD|TAS|VIC|WA|SA',
'NSW',
'NT',
'NT|SA',
'QLD',
'SA',
'TAS',
'VIC',
'WA')
df <- data.frame(id = 1:length(state),state)
id state
1 1 ACT
2 2 ACT|NSW|NT|QLD|SA|VIC
3 3 ACT|NSW|NT|QLD|TAS|VIC|WA
4 4 ACT|NSW|NT|SA|TAS|VIC
...
Desired state is a dataframe with the same dimensions plus the additional columns based on state populated with a 1 or 0 depending on the rows.
tq,
James
You can do something like this:
library(tidyr)
library(dplyr)
df %>%
separate_rows(state) %>%
unique() %>% # in case you have duplicated states for a single id
mutate(exist = 1) %>%
spread(state, exist, fill=0)
# id ACT NSW NT QLD SA TAS VIC WA
#1 1 1 0 0 0 0 0 0 0
#2 2 1 1 1 1 1 0 1 0
#3 3 1 1 1 1 0 1 1 1
#4 4 1 1 1 0 1 1 1 0
#5 5 1 1 0 1 0 0 1 0
#6 6 1 1 0 0 1 0 0 0
#7 7 1 1 1 1 1 1 1 1
#8 8 0 1 0 0 0 0 0 0
#9 9 0 0 1 0 0 0 0 0
#10 10 0 0 1 0 1 0 0 0
#11 11 0 0 0 1 0 0 0 0
#12 12 0 0 0 0 1 0 0 0
#13 13 0 0 0 0 0 1 0 0
#14 14 0 0 0 0 0 0 1 0
#15 15 0 0 0 0 0 0 0 1
separate_rows split state and convert the data frame to long format;
add a constant value column for reshaping purpose;
use spread to transform the result to wide format;
Here is a base R option to split the 'state' column by |, convert the list of vectors into a two column data.frame (stack), get the frequency with table and cbind with the first column of 'df'
cbind(df[1], as.data.frame.matrix(table(stack(setNames(strsplit(as.character(df$state),
"[|]"), df$id))[2:1])))
# id ACT NSW NT QLD SA TAS VIC WA
#1 1 1 0 0 0 0 0 0 0
#2 2 1 1 1 1 1 0 1 0
#3 3 1 1 1 1 0 1 1 1
#4 4 1 1 1 0 1 1 1 0
#5 5 1 1 0 1 0 0 1 0
#6 6 1 1 0 0 1 0 0 0
#7 7 1 1 1 1 1 1 1 1
#8 8 0 1 0 0 0 0 0 0
#9 9 0 0 1 0 0 0 0 0
#10 10 0 0 1 0 1 0 0 0
#11 11 0 0 0 1 0 0 0 0
#12 12 0 0 0 0 1 0 0 0
#13 13 0 0 0 0 0 1 0 0
#14 14 0 0 0 0 0 0 1 0
#15 15 0 0 0 0 0 0 0 1
I have that csv file, containing 600k lines and 3 rows, first one containing a disease name, second one a gene, a third one a number something like that: i have roughly 4k disease and 16k genes so sometimes the disease names and genes names are redudant.
cholera xx45 12
Cancer xx65 1
cholera xx65 0
i would like to make a DTM matrix using R, i've been trying to use the Corpus command from the tm library but corpus doesn't reduce the amount of disease and size's 600k ish, i'd love to understand how to transform that file into a DTM.
I'm sorry for not being that precise, totally starting with computer science things as a bio guy :)
Cheers!
If you're not concerned with the number in the third column, then you can accomplish what I think you're trying to do using only the first two columns (gene and disease).
Example with some simulated data:
library(data.table)
# Create a table with 10k combinations of ~6k different genes and 40 different diseases
df <- data.frame(gene=sapply(1:10000, function(x) paste(c(sample(LETTERS, size=2), sample(10, size=1)), collapse="")), disease=sample(40, size=100000, replace=TRUE))
table(df) creates a large matrix, nGenes rows long and nDiseases columns wide. Looking at just the first 10 rows (because it's so large and sparse).
head(table(df))
disease
gene 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
AB10 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0
AB2 1 1 0 0 0 0 1 0 0 0 0 0 0 0 2 0 0 2 0 0 0 0 1 0 1 0 1
AB3 0 1 0 0 2 1 1 0 0 1 0 0 0 0 0 2 1 0 0 1 0 0 1 0 3 0 1
AB4 0 0 1 0 0 1 0 2 1 1 0 1 0 0 1 1 1 1 0 1 0 2 0 0 0 1 1
AB5 0 1 0 1 0 0 2 2 0 1 1 1 0 1 0 0 2 0 0 0 0 0 0 1 1 1 0
AB6 0 0 2 0 2 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0
disease
gene 28 29 30 31 32 33 34 35 36 37 38 39 40
AB10 0 0 1 2 1 0 0 1 0 0 0 0 0
AB2 0 0 0 0 0 0 0 0 0 0 0 0 0
AB3 0 0 1 1 1 0 0 0 0 0 1 1 0
AB4 0 0 1 2 1 1 1 1 1 2 0 3 1
AB5 0 2 1 1 0 0 3 4 0 1 1 0 2
AB6 0 0 0 0 0 0 0 1 0 0 0 0 0
Alternatively, you can exclude the counts of 0 and only include combinations that actually exist. Easy aggregation can be done with data.table, e.g. (continuing from the above example)
library(data.table)
dt <- data.table(df)
dt[, .N, by=list(gene, disease)]
which gives a frequency table like the following:
gene disease N
1: HA5 20 2
2: RF9 10 3
3: SD8 40 2
4: JA7 35 4
5: MJ2 1 2
---
75872: FR10 26 1
75873: IC5 40 1
75874: IU2 20 1
75875: IG5 13 1
75876: DW7 21 1
I have a data frame with the below structure from which I am looking to transpose the variables into categorical. Intent is to find the weighted mix of the variables.
data <- read.table(header=T, text='
subject weight sex test
1 2 M control
2 3 F cond1
3 2 F cond2
4 4 M control
5 3 F control
6 2 F control
')
data
Expected output:
subject weight control_F control_M cond1_F cond1_M cond2_F cond2_M
1 2 0 1 0 0 0 0
2 3 0 0 1 0 0 0
3 2 0 0 0 0 1 0
4 4 0 1 0 0 0 0
5 3 1 0 0 0 0 0
6 2 1 0 0 0 0 0
I tried using a combination of ifelse and cut, but just couldn't produce the output.
Any ideas on how I can do this?
TIA
You may use
model.matrix(~ subject + weight + sex:test - 1, data)
I think model.matrix is most natural here (see #Julius' answer), but here's an alternative:
library(data.table)
setDT(data)
dcast(data, subject+weight~test+sex, fun=length, drop=c(TRUE,FALSE))
subject weight cond1_F cond1_M cond2_F cond2_M control_F control_M
1: 1 2 0 0 0 0 0 1
2: 2 3 1 0 0 0 0 0
3: 3 2 0 0 1 0 0 0
4: 4 4 0 0 0 0 0 1
5: 5 3 0 0 0 0 1 0
6: 6 2 0 0 0 0 1 0
To get the columns in the "right" order (with the control first), set factor levels before casting:
data[, test := relevel(test, "control")]
dcast(data, subject+weight~test+sex, fun=length, drop=c(TRUE,FALSE))
subject weight control_F control_M cond1_F cond1_M cond2_F cond2_M
1: 1 2 0 1 0 0 0 0
2: 2 3 0 0 1 0 0 0
3: 3 2 0 0 0 0 1 0
4: 4 4 0 1 0 0 0 0
5: 5 3 1 0 0 0 0 0
6: 6 2 1 0 0 0 0 0
(Note: reshape2's dcast isn't so good here, since its drop option applies to both rows and cols.)