The melt/cast functions in the reshape package are great, but I'm not sure if there is a simple way to apply them when measured variables are of different types. For example, here is a snippet from data where each MD provides the gender and weight of three patients:
ID PT1 WT1 PT2 WT2 PT3 WT3
1 "M" 170 "M" 175 "F" 145
...
where the objective is to reshape so each row is a patient:
ID PTNUM GENDER WEIGHT
1 1 "M" 170
1 2 "M" 175
1 3 "F" 145
...
Using the reshape function in the stats package is one option of which I'm aware, but I'm posting here in the hopes that R users more experienced than me will post other, hopefully better methods. Many thanks!
--
#Vincent Zoonekynd :
I liked your example a lot, so I generalized it to multiple variables.
# Sample data
n <- 5
d <- data.frame(
id = 1:n,
p1 = sample(c("M","F"),n,replace=TRUE),
q1 = sample(c("Alpha","Beta"),n,replace=TRUE),
w1 = round(runif(n,100,200)),
y1 = round(runif(n,100,200)),
p2 = sample(c("M","F"),n,replace=TRUE),
q2 = sample(c("Alpha","Beta"),n,replace=TRUE),
w2 = round(runif(n,100,200)),
y2 = round(runif(n,100,200)),
p3 = sample(c("M","F"),n,replace=TRUE),
q3 = sample(c("Alpha","Beta"),n,replace=TRUE),
w3 = round(runif(n,100,200)),
y3 = round(runif(n,100,200))
)
# Reshape the data.frame, one variable at a time
library(reshape)
d1 <- melt(d, id.vars="id", measure.vars=c("p1","p2","p3","q1","q2","q3"))
d2 <- melt(d, id.vars="id", measure.vars=c("w1","w2","w3","y1","y2","y3"))
d1 = cbind(d1,colsplit(d1$variable,names=c("var","ptnum")))
d2 = cbind(d2,colsplit(d2$variable,names=c("var","ptnum")))
d1$variable = NULL
d2$variable = NULL
d1c = cast(d1,...~var)
d2c = cast(d2,...~var)
# Join the two data.frames
d3 = merge(d1c, d2c, by=c("id","ptnum"), all=TRUE)
--
Final thoughts: my motivation for this question was to learn about alternatives to the reshape package other than the stats::reshape function. For the moment, I've reached the following conclusions:
Stick to stats::reshape when you can. As long as you remember to use a list rather than a simple vector for the "varying" argument, you'll stay out of trouble. For smaller data sets--a few thousand patient cases with less than 200 variables in total is what I was dealing with this time--the lower speed of this function is worth the simplicity of the code.
To use the cast/melt approach in Hadley Wickham's reshape (or reshape2) package, you have to split your variables into two sets, one consisting of numeric variables and another of character variables. When your data set is large enough that you find stats::reshape unbearable, I imagine the extra step of dividing your variables into two sets won't seem so bad.
You can process each variable separately,
and join the resulting two data.frames.
# Sample data
n <- 5
d <- data.frame(
id = 1:n,
pt1 = sample(c("M","F"),n,replace=TRUE),
wt1 = round(runif(n,100,200)),
pt2 = sample(c("M","F"),n,replace=TRUE),
wt2 = round(runif(n,100,200)),
pt3 = sample(c("M","F"),n,replace=TRUE),
wt3 = round(runif(n,100,200))
)
# Reshape the data.frame, one variable at a time
library(reshape2)
d1 <- melt(d,
id.vars="id", measure.vars=c("pt1","pt2","pt3"),
variable.name="patient", value.name="gender"
)
d2 <- melt(d,
id.vars="id", measure.vars=c("wt1","wt2","wt3"),
variable.name="patient", value.name="weight"
)
d1$patient <- as.numeric(gsub("pt", "", d1$patient))
d2$patient <- as.numeric(gsub("wt", "", d1$patient))
# Join the two data.frames
merge(d1, d2, by=c("id","patient"), all=TRUE)
I think the reshape function in the stats package is simplest. Here is a simple example, does this do what you want?
> tmp
id val val2 cat
1 1 1 14 a
2 1 2 13 b
3 2 3 12 b
4 2 4 11 a
> tmp2 <- tmp
> tmp2$t <- ave(tmp2$val, tmp2$id, FUN=seq_along)
> tmp2
id val val2 cat t
1 1 1 14 a 1
2 1 2 13 b 2
3 2 3 12 b 1
4 2 4 11 a 2
> reshape(tmp2, idvar='id', timevar='t', direction='wide')
id val.1 val2.1 cat.1 val.2 val2.2 cat.2
1 1 1 14 a 2 13 b
3 2 3 12 b 4 11 a
Hopefully your patients sex is not changing each appointment, but there could be other categorical variables that change between visits
Related
I'm trying to merge 2 datasets on a key, but if there is no match then I want to try another key, and so on.
df1 <- data.frame(a=c(5,1,7,3),
b=c("T","T","T","F"),
c=c("F","T","F","F"))
df2 <- data.frame(x1=c(4,5,3,9),
x2=c(7,8,1,2),
x3=c("g","w","t","o"))
df1
a b c
1 5 T F
2 1 T T
3 7 T F
4 3 F F
df2
x1 x2 x3 ..
1 4 7 g ..
2 5 8 w ..
3 3 1 t ..
4 9 2 o ..
The desired output is something like
a b c x3 ..
1 5 T F w ..
2 1 T T t ..
3 7 T F g ..
4 3 F F t ..
I tried something along the lines of
dfm <- merge(df1,df2, by.x = "a", by.y = "x1", all.x = TRUE)
dfm <- merge(dfm,df2, by.x = "a", by.y = "x2", all.x = TRUE)
but that isn't quite right.
This really isn't a standard sort of merge. You can make it more standard by reshaping df2 so you have just one field to merge on
df2long <- rbind(
data.frame(a = df2$x1, df2[,-(1:2), drop=FALSE]),
data.frame(a = df2$x2, df2[,-(1:2), drop=FALSE])
)
dfm <- merge(df1, df2long, by = "a", all.x = TRUE)
You could do something like this:
matches <- lapply(df2[, c("x1", "x2")], function(x) match(df1$a, x))
# finding matches in df2$x1 and df2$x2
# notice that the code below should work with any number of columns to be matched:
# you just need to add the names here eg. df2[, paste0("x", 1:100)]
matches
$x1
[1] 2 NA NA 3
$x2
[1] NA 3 1 NA
combo <- Reduce(function(a,b) "[<-"(a, is.na(a), b[is.na(a)]), matches)
# combining the matches on "first come first served" basis
combo
[1] 2 3 1 3
cbind(df1, df2[combo,])
a b c x1 x2 x3
2 5 T F 5 8 w
3 1 T T 3 1 t
1 7 T F 4 7 g
3.1 3 F F 3 1 t
If I understand correctly, the OP has requested to try a match of a with x1 first, then - if failed - to try to match a with x2. So any match of a with x1 should take precedence over a match of a with x2.
Unfortunately, the sample data set provided by the OP does not include a use case to prove this. Therefore, I have modified the sample dataset accordingly (see Data section).
The approach suggested here is to reshape df2 from wide to long format (likewise to MrFlick's answer) but to use a data.table join with parameter mult = "first".
The columns of df2 to be considered as key columns and the precedence can be controlled by the measure.vars parameter to melt(). After reshaping, melt() arranges the rows in the column order given in measure.vars:
library(data.table)
# define cols of df2 to use as key in order of
key_cols <- c("x1", "x2")
# reshape df2 from wide to long format
long <- melt(setDT(df2), measure.vars = key_cols, value.name = "a")
# join long with df1, pick first matches
result <- long[setDT(df1), on = "a", mult = "first"]
# clean up
setcolorder(result, names(df1))
result[, variable := NULL]
result
a b c x3
1: 5 T F w
2: 1 T T t
3: 7 T F g
4: 3 F F t
5: 0 F F <NA>
Please, note that the original row order of df1 has been preserved.
Also, note that the code works for an arbitrary number of key columns. The precedence of key columns can be easily changed. E.g., if the order is reversed, i.e., key_cols <- c("x2", "x1") matches of a with x2 will be picked first.
Data
Enhanced sample datasets:
df1 has an additional row with no match in df2.
df1 <- data.frame(a=c(5,1,7,3,0),
b=c("T","T","T","F","F"),
c=c("F","T","F","F","F"))
df1
a b c
1: 5 T F
2: 1 T T
3: 7 T F
4: 3 F F
5: 0 F F
df2 has an additional row to prove that a match in x1 takes precedence over a match in x2. The value 5 appears twice: In row 2 of column x1 and in row 5 of column x2.
df2 <- data.frame(x1=c(4,5,3,9,6),
x2=c(7,8,1,2,5),
x3=c("g","w","t","o","n"))
df2
x1 x2 x3
1: 4 7 g
2: 5 8 w
3: 3 1 t
4: 9 2 o
5: 6 5 n
Not sure I understood your question, but rather than repetitive merging I'd compare the keys of the potential merge, if this number is >0, than you have a match. If you want to take the first column with a match you can try this:
library(tidyr)
library(purrr)
(df1 <- data.frame(a=c(5,1,7,3),
b=c("T","T","T","F"),
c=c("F","T","F","F")) )
(df2 <- data.frame(x1=c(4,5,3,9),
x2=c(7,8,1,2),
x3=c("g","w","t","o")) )
FirstColMatch<-1:ncol(df2) %>%
map(~intersect(df1$a, df2[[.x]])) %>%
map(length) %>%
detect_index(function(x)x>0)
NewDF<-merge(df1,df2,by.x="a", by.y =names(df2)[FirstColMatch])
I've noticed that aggregate() appears to return its result ordered by the grouping column(s). Is this a guarantee? Can this be relied upon in surrounding logic?
A couple of examples:
set.seed(1); df <- data.frame(group=sample(letters[1:3],10,replace=T),value=1:10);
aggregate(value~group,df,sum);
## group value
## 1 a 16
## 2 b 22
## 3 c 17
And with two groups (notice the second group is ordered first, then the first group to break ties):
set.seed(1); df <- data.frame(group1=sample(letters[1:3],10,replace=T),group2=sample(letters[4:6],10,replace=T),value=1:10);
aggregate(value~group1+group2,df,sum);
## group1 group2 value
## 1 a d 1
## 2 b d 2
## 3 b e 9
## 4 c e 10
## 5 a f 15
## 6 b f 11
## 7 c f 7
Note: I'm asking because I just came up with an answer for Aggregating while merging two dataframes in R which, at least in its current form at the time of writing, depends on aggregate() returning its result ordered by the grouping column.
Yes, as long as you understand the natural ordering of factors to be by their integer keys. You can see this in the code:
y <- as.data.frame(by, stringsAsFactors = FALSE)
... # y becomes the "integerized" dataframe of index vectors
grp <- rank(do.call(paste, c(lapply(rev(y), ident), list(sep = "."))),
ties.method = "min")
y <- y[match(sort(unique(grp)), grp, 0L), , drop = FALSE]
...
I have two data sets with some overlapping variables. One dataset is basically a subset of the other but needs an additional variable added based on some of the overlapping variables. For example
varA <- c(rep(c("a","b"), each=5))
blah <- c(11:20)
varB <- c(1:10)
speed <- rnorm(10)
dataset1 <- data.frame(varA,blah,varB,speed)
varA.2 <- c("a","a","b","b")
varB.2 <- c(2,10,11,7)
speed.2 <- rep(NA, 4)
dataset2 <- data.frame(varA.2, varB.2, speed.2)
dataset2
I would like the "speed.2" variable to contain the speed values for the lines where varA and varB are matching between the two sets.
I've tried something with "merge" but am having issues.
Thank you!
May be:
colnames(dataset2) <- gsub("\\..*","", colnames(dataset2))
library(dplyr)
left_join(dataset2[,-3],dataset1[,-2])
# Joining by: c("varA", "varB")
# varA varB speed
#1 a 2 -1.3243815
#2 a 10 NA
#3 b 11 NA
#4 b 7 -0.6026936
Or without changing the column names.
merge(dataset1[,-2],dataset2[,-3], by.x=c("varA","varB"), by.y=c("varA.2", "varB.2"), all.y=TRUE)
# varA varB speed
# 1 a 2 -0.6797753
# 2 a 10 NA
# 3 b 7 -2.1838454
# 4 b 11 NA
Values in speed differ as the example was without using set.seed()
You can use 'match' function for "where varA and varB are matching"
dataset2$speed.2 = dataset1[match(paste(dataset2$varA.2,dataset2$varB.2),
paste(dataset1$varA, dataset1$varB)),]$speed
dataset2
varA.2 varB.2 speed.2
1 a 2 0.3917783
2 a 10 NA
3 b 11 NA
4 b 7 1.3265439
>
I have to make a set of selections that vary by the day on this dataset (dat), which is composed by species (sp), day (day, in POSIXct) and area (ar):
sp day ar
A 1-Jan-00 2
B 1-Jan-00 6
C 2-Jan-00 2
A 2-Jan-00 1
D 2-Jan-00 4
E 2-Jan-00 12
F 3-Jan-00 8
A 4-Jan-00 3
G 4-Jan-00 2
B 4-Jan-00 1
I need to subset where species "A" occurs. However, the areas to be selected will vary by day, given by this matrix (dat.ar):
day ar.select
1-Jan-00 (1,6)
2-Jan-00 (1,12)
3-Jan-00 (4,8)
4-Jan-00 (3,12)
More specifically, for areas where species "A" occurs, on 1-jan-00, I need only areas 1 and 6. For 2-jan-00, areas 1 and 12, and so on.
As an example, the desired output on this example for this selection is given below:
sp day ar
A 2-Jan-00 1
A 4-Jan-00 3
I haven't had much success getting a for loop, as I am still trying to learn the semantics in R. In summary, a rough idea of what must be done, but still struggling with the language. Here is a sketch of where I think this should go:
dat1 = with(dat,sapply(day[sp=="A" & dat.ar$day.s[i] ],
function(x) ar == (ar[sp=="A" & day == x]==dat.ar$ar.select[j])
final=dat[rowSums(dat1) > 0, ]
I believe I have to fit a for loop, that would go through dat.ar, specifying the areas to be selected in dat. But despite my efforts in trying to get for the for loop, I haven't gotten anywhere near. I am not even sure if combining an sapply and a for loop is the right way to go about this.
In case someone wishes to reproduce the problem:
sp=c("A","B","C","A","D","E","F","A","G","B")
day=c("1-Jan-00", "1-Jan-00", "2-Jan-00", "2-Jan-00", "2-Jan-00",
"2-Jan-00", "3-Jan-00", "4-Jan-00", "4-Jan-00", "4-Jan-00")
day=as.POSIXct(day, format="%d-%b-%y")
ar=c(2,6,2,1,4,12,8,3,2,1)
dat= as.data.frame(cbind(sp, day, ar))
day.s=c("1-Jan-00", "2-Jan-00", "3-Jan-00", "4-jan-00")
day.s=as.POSIXct(day.s, format="%d-%b-%y")
a.s=c(1,1,4,3)
a.e=c(6,12,8,12)
ar.select=paste(a.s, a.e, sep=",")
dat.ar=cbind(day.s, ar.select)
Any help is much appreciated.
You could merge your table of conditions to the original dataset and filter them conditionally. Consider a1 and a2 like your sp and day values, and obs to be like your ar value.
library(data.table)
dataset <- data.table(
a1 = c("A","B","C","B","A","A","A","A"),
a2 = c("P","Q","Q","Q","R","R","P","Q"),
obs = c(3,2,3,4,2,4,8,0)
)
constraints <- data.table(
a1 = c("A","B","C","A","B","C","A","B","C"),
a2 = c("P","P","P","Q","Q","Q","R","R","R"),
lower = c(1,2,3,4,3,2,3,2,5),
upper = c(6,4,5,7,5,6,5,3,7)
)
checkingdataset <- merge(dataset,constraints, by = c("a1","a2"), all.x = TRUE)
checkingdataset[obs <= upper & obs >= lower, obs.keep := TRUE]
# a1 a2 obs lower upper obs.keep
#1: A P 3 1 6 TRUE
#2: A P 8 1 6 NA
#3: A Q 0 4 7 NA
#4: A R 2 3 5 NA
#5: A R 4 3 5 TRUE
#6: B Q 2 3 5 NA
#7: B Q 4 3 5 TRUE
#8: C Q 3 2 6 TRUE
First, I would not use as.data.frame(cbind(...)) to make your data.frames. Second, I would create dat.ar in much the same structure that you have created dat. Third, I would then just use merge to get the result you are looking for.
dat <- data.frame(sp=c("A","B","C","A","D","E","F","A","G","B"),
day=c("1-Jan-00", "1-Jan-00", "2-Jan-00", "2-Jan-00",
"2-Jan-00", "2-Jan-00", "3-Jan-00", "4-Jan-00",
"4-Jan-00", "4-Jan-00"),
ar=c(2,6,2,1,4,12,8,3,2,1))
dat$day <- as.POSIXct(dat$day, format="%d-%b-%y")
day.s <- c("1-Jan-00", "2-Jan-00", "3-Jan-00", "4-jan-00")
day.s <- as.POSIXct(day.s, format="%d-%b-%y")
a.s <- c(1,1,4,3)
a.e <- c(6,12,8,12)
ar.select <- paste(a.s, a.e, sep=",")
dat.ar <- data.frame(sp = "A", day = day.s, ar = ar.select)
dat.ar <- cbind(dat.ar[-3],
read.csv(text = as.character(dat.ar$ar), header = FALSE))
library(reshape2)
dat.ar <- melt(dat.ar, id.vars=1:2, value.name="ar")
dat.ar
# sp day variable ar
# 1 A 2000-01-01 V1 1
# 2 A 2000-01-02 V1 1
# 3 A 2000-01-03 V1 4
# 4 A 2000-01-04 V1 3
# 5 A 2000-01-01 V2 6
# 6 A 2000-01-02 V2 12
# 7 A 2000-01-03 V2 8
# 8 A 2000-01-04 V2 12
merge(dat, dat.ar)
# sp day ar variable
# 1 A 2000-01-02 1 V1
# 2 A 2000-01-04 3 V1
Of course, I would just suggest that you make your dat.ar object in a more friendly manner to begin with. Why paste values together if you are going to separate them out later anyway? ;)
dat.ar <- data.frame(sp = "A",
day = c("1-Jan-00", "2-Jan-00", "3-Jan-00", "4-jan-00"),
a.s = c(1,1,4,3), a.e = c(6,12,8,12))
dat.ar$day <- as.POSIXct(dat.ar$day, format="%d-%b-%y")
library(reshape2)
dat.ar <- melt(dat.ar, id.vars=1:2, value.name="ar")
I have an aggregation problem which I cannot figure out how to perform efficiently in R.
Say I have the following data:
group1 <- c("a","b","a","a","b","c","c","c","c",
"c","a","a","a","b","b","b","b")
group2 <- c(1,2,3,4,1,3,5,6,5,4,1,2,3,4,3,2,1)
value <- c("apple","pear","orange","apple",
"banana","durian","lemon","lime",
"raspberry","durian","peach","nectarine",
"banana","lemon","guava","blackberry","grape")
df <- data.frame(group1,group2,value)
I am interested in sampling from the data frame df such that I randomly pick only a single row from each combination of factors group1 and group2.
As you can see, the results of table(df$group1,df$group2)
1 2 3 4 5 6
a 2 1 2 1 0 0
b 2 2 1 1 0 0
c 0 0 1 1 2 1
shows that some combinations are seen more than once, while others are never seen. For those that are seen more than once (e.g., group1="a" and group2=3), I want to randomly pick only one of the corresponding rows and return a new data frame that has only that subset of rows. That way, each possible combination of the grouping factors is represented by only a single row in the data frame.
One important aspect here is that my actual data sets can contain anywhere from 500,000 rows to >2,000,000 rows, so it is important to be mindful of performance.
I am relatively new at R, so I have been having trouble figuring out how to generate this structure correctly. One attempt looked like this (using the plyr package):
choice <- function(x,label) {
cbind(x[sample(1:nrow(x),1),],data.frame(state=label))
}
df <- ddply(df[,c("group1","group2","value")],
.(group1,group2),
pick_junc,
label="test")
Note that in this case, I am also adding an extra column to the data frame called "label" which is specified as an extra argument to the ddply function. However, I killed this after about 20 min.
In other cases, I have tried using aggregate or by or tapply, but I never know exactly what the specified function is getting, what it should return, or what to do with the result (especially for by).
I am trying to switch from python to R for exploratory data analysis, but this type of aggregation is crucial for me. In python, I can perform these operations very rapidly, but it is inconvenient as I have to generate a separate script/data structure for each different type of aggregation I want to perform.
I want to love R, so please help! Thanks!
Uri
Here is the plyr solution
set.seed(1234)
ddply(df, .(group1, group2), summarize,
value = value[sample(length(value), 1)])
This gives us
group1 group2 value
1 a 1 apple
2 a 2 nectarine
3 a 3 banana
4 a 4 apple
5 b 1 grape
6 b 2 blackberry
7 b 3 guava
8 b 4 lemon
9 c 3 durian
10 c 4 durian
11 c 5 raspberry
12 c 6 lime
EDIT. With a data frame that big, you are better off using data.table
library(data.table)
dt = data.table(df)
dt[,list(value = value[sample(length(value), 1)]),'group1, group2']
EDIT 2: Performance Comparison: Data Table is ~ 15 X faster
group1 = sample(letters, 1000000, replace = T)
group2 = sample(LETTERS, 1000000, replace = T)
value = runif(1000000, 0, 1)
df = data.frame(group1, group2, value)
dt = data.table(df)
f1_dtab = function() {
dt[,list(value = value[sample(length(value), 1)]),'group1, group2']
}
f2_plyr = function() {ddply(df, .(group1, group2), summarize, value =
value[sample(length(value), 1)])
}
f3_by = function() {do.call(rbind,by(df,list(grp1 = df$group1,grp2 = df$group2),
FUN = function(x){x[sample(nrow(x),1),]}))
}
library(rbenchmark)
benchmark(f1_dtab(), f2_plyr(), f3_by(), replications = 10)
test replications elapsed relative
f1_dtab() 10 4.764 1.00000
f2_plyr() 10 68.261 14.32851
f3_by() 10 67.369 14.14127
One more way:
with(df, tapply(value, list( group1, group2), length))
1 2 3 4 5 6
a 2 1 2 1 NA NA
b 2 2 1 1 NA NA
c NA NA 1 1 2 1
# Now use tapply to sample withing groups
# `resample` fn is from the sample help page:
# Avoids an error with sample when only one value in a group.
resample <- function(x, ...) x[sample.int(length(x), ...)]
#Create a row index
df$idx <- 1:NROW(df)
rowidxs <- with(df, unique( c( # the `c` function will make a matrix into a vector
tapply(idx, list( group1, group2),
function (x) resample(x, 1) ))))
rowidxs
# [1] 1 5 NA 12 16 NA 3 15 6 4 14 10 NA NA 7 NA NA 8
df[rowidxs[!is.na(rowidxs)] , ]