reshaping a dataframe for prediction - r

I just picked up on the package reshape today and I'm having some trouble to understand how it works.
I have the following dataframe:
name workoutnum time weight raceid final position
tommy 1 12 140 1 2
tommy 2 14 140 1 2
tommy 3 11 140 1 2
sarah 1 10 115 1 1
sarah 2 10 115 1 1
sarah 3 11 115 1 1
sarah 4 15 115 1 1
How would I put all this in one row? So the dataframe would look like:
name workoutnum1 workoutnum2 workoutnum3 workoutnum4 time1 time2 time3 time4 weight raceid final_position
tommy 1 1 1 0 12 14 11 NA 140 1 2
sarah 1 1 1 1 10 10 11 15 115 1 1
So all columns would be attached to the workout values.
Is this even the proper way to do it?

reshape seems like a natural part of what you want to do, but won't get you all the way there.
Here's a reshape2 approach that fully melts the data, then casts it back to data.frame, with some tweaks along the way to get the desired output.
Note that in the call to melt(), the variables in the id.vars arguments will remain wide. Then in dcast(), the variable that'll be cast wide is on the RHS of the ~.
library(reshape2)
library(dplyr)
# fully melt the data
d_melt <- melt(d, id.vars = c("name", "raceid", "position", "weight"))
# index the variables within name and variable
d_melt <- d_melt %>%
group_by(name, variable) %>%
mutate(i = row_number(),
wide_variable = paste0(variable, i))
# cast as wide
d_wide <- dcast(d_melt, name + raceid + position + weight ~ wide_variable, value.var = "value")
# replace the workoutnum indices with indicators for missingness
d_wide %>% mutate_each(funs(ifelse(!is.na(.), 1L, 0L)), matches("workoutnum\\d"))
# name raceid position weight time1 time2 time3 time4 workoutnum1 workoutnum2
# 1 sarah 1 1 115 10 10 11 15 1 1
# 2 tommy 1 2 140 12 14 11 NA 1 1
# workoutnum3 workoutnum4
# 1 1 1
# 2 1 0
Data:
structure(list(name = structure(c(2L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("sarah", "tommy"), class = "factor"), workoutnum = c(1L, 2L, 3L, 1L, 2L, 3L, 4L), time = c(12L, 14L, 11L, 10L, 10L, 11L, 15L), weight = c(140L, 140L, 140L, 115L, 115L, 115L, 115L), raceid = c(1L, 1L, 1L, 1L, 1L, 1L, 1L), position = c(2L, 2L, 2L, 1L, 1L, 1L, 1L)), .Names = c("name", "workoutnum", "time", "weight", "raceid", "position"), class = "data.frame", row.names = c(NA, -7L))

Here's an approach using dcast from "data.table", which reshapes a little more like the reshape function in base R.
The only change I've made to the data is the inclusion of another "time" variable though, as pointed out by #rawr in the comments, it almost seems like your "workoutnum" is the time variable.
I've used getanID from my "splitstackshape" package to generate the "time" variable, but you can create this variable in many different ways.
library(splitstackshape)
dcast(getanID(mydf, c("name", "raceid", "final_position")),
name + raceid + final_position ~ .id,
value.var = c("workoutnum", "time", "weight"))
## name raceid final_position workoutnum_1 workoutnum_2 workoutnum_3
## 1: sarah 1 1 1 2 3
## 2: tommy 1 2 1 2 3
## workoutnum_4 time_1 time_2 time_3 time_4 weight_1 weight_2 weight_3 weight_4
## 1: 4 10 10 11 15 115 115 115 115
## 2: NA 12 14 11 NA 140 140 140 NA
If you're using getanID, you can also use reshape like this:
reshape(getanID(mydf, c("name", "raceid", "final_position")),
idvar = c("name", "raceid", "final_position"), timevar = ".id",
direction = "wide")
## name raceid final_position workoutnum.1 time.1 weight.1 workoutnum.2 time.2
## 1: tommy 1 2 1 12 140 2 14
## 2: sarah 1 1 1 10 115 2 10
## weight.2 workoutnum.3 time.3 weight.3 workoutnum.4 time.4 weight.4
## 1: 140 3 11 140 NA NA NA
## 2: 115 3 11 115 4 15 115
but dcast would be more efficient in general.

Related

Merging two datasets by an ID without adding new columns that say ".x" or ".y"

Suppose I have two datasets. One main dataset, with many columns of metadata, and one new dataset which will be used to fill in some of the gaps in concentrations in the main dataset:
Main dataset:
study_id timepoint age occupation concentration1 concentration2
1 1 21 0 3 7
1 2 21 0 4 6
1 3 22 0 NA NA
1 4 22 0 NA NA
2 1 36 3 0 4
2 2 36 3 2 11
2 3 37 3 NA NA
2 4 37 3 NA NA
New data set to merge:
study_id timepoint concentration1 concentration2
1 3 11 20
1 4 21 35
2 3 7 17
2 4 14 25
Whenever I merge by "study_id" and "timepoint", I get two new columns that are "concentration1.y" and "concentration2.y" while the original columns get renamed as "concentration1.x" and "concentration2.x". I don't want this.
This is what I want:
study_id timepoint age occupation concentration1 concentration2
1 1 21 0 3 7
1 2 21 0 4 6
1 3 22 0 11 20
1 4 22 0 21 35
2 1 36 3 0 4
2 2 36 3 2 11
2 3 37 3 7 17
2 4 37 3 14 25
In other words, I want to merge by "study_id" and "timepoint" AND merge the two concentration columns so the data are within the same columns. Please note that both datasets do not have identical columns (dataset 1 has 1000 columns with metadata while dataset2 just has study id, timepoint, and concentration columns that match the concentration columns in dataset1).
Thanks so much in advance.
Using coalesce is one option (from dplyr package). This still adds the two columns for concentration 1 and 2 from the second data frame. These would be removed after NA filled in.
library(tidyverse)
df1 %>%
left_join(df2, by = c("study_id", "timepoint")) %>%
mutate(concentration1 = coalesce(concentration1.x, concentration1.y),
concentration2 = coalesce(concentration2.x, concentration2.y)) %>%
select(-concentration1.x, -concentration1.y, -concentration2.x, -concentration2.y)
Or to generalize with multiple concentration columns:
df1 %>%
left_join(df2, by = c("study_id", "timepoint")) %>%
split.default(str_remove(names(.), "\\.x|\\.y")) %>%
map_df(reduce, coalesce)
Edit: To prevent the resultant column names from being alphabetized from split.default, you can add an intermediate step of sorting the list based on the first data frame's column name order.
df3 <- df1 %>%
left_join(df2, by = c("study_id", "timepoint")) %>%
split.default(str_remove(names(.), "\\.x|\\.y"))
df3[names(df1)] %>%
map_df(reduce, coalesce)
Output
study_id timepoint age occupation concentration1 concentration2
1 1 1 21 0 3 7
2 1 2 21 0 4 6
3 1 3 22 0 11 20
4 1 4 22 0 21 35
5 2 1 36 3 0 4
6 2 2 36 3 2 11
7 2 3 37 3 7 17
8 2 4 37 3 14 25
Data
df1 <- structure(list(study_id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
timepoint = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), age = c(21L,
21L, 22L, 22L, 36L, 36L, 37L, 37L), occupation = c(0L, 0L,
0L, 0L, 3L, 3L, 3L, 3L), concentration1 = c(3L, 4L, NA, NA,
0L, 2L, NA, NA), concentration2 = c(7L, 6L, NA, NA, 4L, 11L,
NA, NA)), class = "data.frame", row.names = c(NA, -8L))
df2 <- structure(list(study_id = c(1L, 1L, 2L, 2L), timepoint = c(3L,
4L, 3L, 4L), concentration1 = c(11L, 21L, 7L, 14L), concentration2 = c(20L,
35L, 17L, 25L)), class = "data.frame", row.names = c(NA, -4L))

new data set with combining some rows

This is not a question about long and wide shape:!!!!! don't make it duplicate plz
Suppose I have :
HouseholdID. PersonID. time. dur. age
1 1 3 4 19
1 2 3 4 29
1 3 5 5 30
1 1 5 5 18
2 1 21 30 18
2 2 21 30 30
In each household some people have the same time and dur. want to combine only rows whose have the same HouseholdID,time and dur
OUTPUT:
HouseholdID. PersonID. time. dur. age. HouseholdID. PersonID. time. dur. age
1 1 3 4 19 1 2 3 4 29
1 3 5 5 30 1 1 5 5 18
2 1 21 30 18 2 2 21 30 30
An option would be dcast from data.table which can take multiple value.var columns
library(data.table)
dcast(setDT(df1), HouseholdID. ~ rowid(HouseholdID.),
value.var = c("PersonID.", "time.", "dur.", "age"), sep="")
# HouseholdID. PersonID.1 PersonID.2 time.1 time.2 dur.1 dur.2 age1 age2
#1: 1 1 2 3 3 4 4 19 29
#2: 2 1 2 21 21 30 30 18 30
Or an option with pivot_wider from the devel version of tidyr
library(tidyr) # ‘0.8.3.9000’
library(dplyr)
df1 %>%
group_by(HouseholdID.) %>%
mutate(rn = row_number()) %>%
pivot_wider(id_cols= HouseholdID., names_from = rn,
values_from = c(PersonID., time., dur., age), name_sep="")
# A tibble: 2 x 9
# HouseholdID. PersonID.1 PersonID.2 time.1 time.2 dur.1 dur.2 age1 age2
# <int> <int> <int> <int> <int> <int> <int> <int> <int>
#1 1 1 2 3 3 4 4 19 29
#2 2 1 2 21 21 30 30 18 30
Update
With the new dataset, extend the id columns by including the 'time.' and 'dur.'
dcast(setDT(df2), HouseholdID. + time. + dur. ~ rowid(HouseholdID., time., dur.),
value.var = c("PersonID.", "age"), sep="")
If we need duplicate columns for 'time.' and 'dur.' (not clear why it is needed though)
dcast(setDT(df2), HouseholdID. + time. + dur. ~ rowid(HouseholdID., time., dur.),
value.var = c("PersonID.", "time.", "dur.", "age"), sep="")[,
c('time.', 'dur.') := NULL][]
# HouseholdID. PersonID.1 PersonID.2 time..11 time..12 dur..11 dur..12 age1 age2
#1: 1 1 2 3 3 4 4 19 29
#2: 1 3 1 5 5 5 5 30 18
#3: 2 1 2 21 21 30 30 18 30
Or with tidyverse
df2 %>%
group_by(HouseholdID., time., dur.) %>%
mutate(rn = row_number()) %>%
pivot_wider(id_cols= c(HouseholdID., time., dur.), names_from = rn,
values_from = c(PersonID., age), names_sep = "")
# A tibble: 3 x 7
# HouseholdID. time. dur. PersonID.1 PersonID.2 age1 age2
# <int> <int> <int> <int> <int> <int> <int>
#1 1 3 4 1 2 19 29
#2 1 5 5 3 1 30 18
#3 2 21 30 1 2 18 30
NOTE: duplicate column names are not recommended as it can lead to confusion in identification of columns.
data
df1 <- structure(list(HouseholdID. = c(1L, 1L, 2L, 2L), PersonID. = c(1L,
2L, 1L, 2L), time. = c(3L, 3L, 21L, 21L), dur. = c(4L, 4L, 30L,
30L), age = c(19L, 29L, 18L, 30L)), class = "data.frame", row.names = c(NA,
-4L))
df2 <- structure(list(HouseholdID. = c(1L, 1L, 1L, 1L, 2L, 2L), PersonID. = c(1L,
2L, 3L, 1L, 1L, 2L), time. = c(3L, 3L, 5L, 5L, 21L, 21L), dur. = c(4L,
4L, 5L, 5L, 30L, 30L), age = c(19L, 29L, 30L, 18L, 18L, 30L)),
class = "data.frame", row.names = c(NA,
-6L))

Subsetting and repetition of rows in a dataframe using R

Suppose we have the following data with column names "id", "time" and "x":
df<-
structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 3L, 3L),
time = c(20L, 6L, 7L, 11L, 13L, 2L, 6L),
x = c(1L, 1L, 0L, 1L, 1L, 1L, 0L)
),
.Names = c("id", "time", "x"),
class = "data.frame",
row.names = c(NA,-7L)
)
Each id has multiple observations for time and x. I want to extract the last observation for each id and form a new dataframe which repeats these observations according to the number of observations per each id in the original data. I am able to extract the last observations for each id using the following codes
library(dplyr)
df<-df%>%
group_by(id) %>%
filter( ((x)==0 & row_number()==n())| ((x)==1 & row_number()==n()))
What is left unresolved is the repetition aspect. The expected output would look like
df <-
structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 3L, 3L),
time = c(7L, 7L, 7L, 13L, 13L, 6L, 6L),
x = c(0L, 0L, 0L, 1L, 1L, 0L, 0L)
),
.Names = c("id", "time", "x"),
class = "data.frame",
row.names = c(NA,-7L)
)
Thanks for your help in advance.
We can use ave to find the max row number for each ID and subset it from the data frame.
df[ave(1:nrow(df), df$id, FUN = max), ]
# id time x
#3 1 7 0
#3.1 1 7 0
#3.2 1 7 0
#5 2 13 1
#5.1 2 13 1
#7 3 6 0
#7.1 3 6 0
You can do this by using last() to grab the last row within each id.
df %>%
group_by(id) %>%
mutate(time = last(time),
x = last(x))
Because last(x) returns a single value, it gets expanded out to fill all the rows in the mutate() call.
This can also be applied to an arbitrary number of variables using mutate_at:
df %>%
group_by(id) %>%
mutate_at(vars(-id), ~ last(.))
slice will be your friend in the tidyverse I reckon:
df %>%
group_by(id) %>%
slice(rep(n(),n()))
## A tibble: 7 x 3
## Groups: id [3]
# id time x
# <int> <int> <int>
#1 1 7 0
#2 1 7 0
#3 1 7 0
#4 2 13 1
#5 2 13 1
#6 3 6 0
#7 3 6 0
In data.table, you could also use the mult= argument of a join:
library(data.table)
setDT(df)
df[df[,.(id)], on="id", mult="last"]
# id time x
#1: 1 7 0
#2: 1 7 0
#3: 1 7 0
#4: 2 13 1
#5: 2 13 1
#6: 3 6 0
#7: 3 6 0
And in base R, a merge will get you there too:
merge(df["id"], df[!duplicated(df$id, fromLast=TRUE),])
# id time x
#1 1 7 0
#2 1 7 0
#3 1 7 0
#4 2 13 1
#5 2 13 1
#6 3 6 0
#7 3 6 0
Using data.table you can try
library(data.table)
setDT(df)[,.(time=rep(time[.N],.N), x=rep(x[.N],.N)), by=id]
id time x
1: 1 7 0
2: 1 7 0
3: 1 7 0
4: 2 13 1
5: 2 13 1
6: 3 6 0
7: 3 6 0
Following #thelatemai, to avoid name the columns you can also try
df[, .SD[rep(.N,.N)], by=id]
id time x
1: 1 7 0
2: 1 7 0
3: 1 7 0
4: 2 13 1
5: 2 13 1
6: 3 6 0
7: 3 6 0

Subsetting a data frame by a value of one of its colums

I have a rather large data frame. Here is a simplified example:
Group Element Value Note
1 AAA 11 Good
1 ABA 12 Good
1 AVA 13 Good
2 CBA 14 Good
2 FDA 14 Good
3 JHA 16 Good
3 AHF 16 Good
3 AKF 17 Good
Here it is as a dput:
dat <- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L), Element = structure(c(1L,
2L, 5L, 6L, 7L, 8L, 3L, 4L), .Label = c("AAA", "ABA", "AHF",
"AKF", "AVA", "CBA", "FDA", "JHA"), class = "factor"), Value = c(11L,
12L, 13L, 14L, 14L, 16L, 16L, 17L), Note = structure(c(1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = "Good", class = "factor")), .Names = c("Group",
"Element", "Value", "Note"), class = "data.frame", row.names = c(NA,
-8L))
I'm trying to separate it based on the group. so let's say
Group 1 will be a data frame:
Group Element Value Note
1 AAA 11 Good
1 ABA 12 Good
1 AVA 13 Good
Group 2:
2 CBA 14 Good
2 FDA 14 Good
and so on.
You can use split for this.
> dat
## Group Element Value Note
## 1 1 AAA 11 Good
## 2 1 ABA 12 Good
## 3 1 AVA 13 Good
## 4 2 CBA 14 Good
## 5 2 FDA 14 Good
## 6 3 JHA 16 Good
## 7 3 AHF 16 Good
## 8 3 AKF 17 Good
> x <- split(dat, dat$Group)
Then you can access each individual data frame by group number with x[[1]], x[[2]], etc.
For example, here is group 2:
> x[[2]] ## or x[2]
## Group Element Value Note
## 4 2 CBA 14 Good
## 5 2 FDA 14 Good
ADD: Since you asked about it in the comments, you can write each individual data frame to file with write.csv and lapply. The invisible wrapper is simply to suppress the output of lapply
> invisible(lapply(seq(x), function(i){
write.csv(x[[i]], file = paste0(i, ".csv"), row.names = FALSE)
}))
We can see that the files were created by looking at list.files
> list.files(pattern = "^[0-9].csv")
## [1] "1.csv" "2.csv" "3.csv"
And we can see the data frame of the third group with read.csv
> read.csv("3.csv")
## Group Element Value Note
## 1 3 JHA 16 Good
## 2 3 AHF 16 Good
## 3 3 AKF 17 Good
Obligatory plyr version (pretty much equiv to Richard's, but I'll bet it's slower, too:
library(plyr)
groups <- dlply(dat, .(Group), function(x) { return(x) })
length(groups)
## [1] 3
groups$`1` # can also do groups[[1]]
## Group Element Value Note
## 1 1 AAA 11 Good
## 2 1 ABA 12 Good
## 3 1 AVA 13 Good
groups[[2]]
## Group Element Value Note
## 1 2 CBA 14 Good
## 2 2 FDA 14 Good

Subset of data with criteria of two columns

I would like to create a subset of data that consists of Units that have a higher score in QTR 4 than QTR 1 (upward trend). Doesn't matter if QTR 2 or 3 are present.
Unit QTR Score
5 4 34
1 1 22
5 3 67
2 4 78
3 2 39
5 2 34
1 2 34
5 1 67
1 3 70
1 4 89
3 4 19
Subset would be:
Unit QTR Score
1 1 22
1 2 34
1 3 70
1 4 89
I've tried variants of something like this:
upward_subset <- subset(mydata,Unit if QTR=4~Score > QTR=1~Score)
Thank you for your time
If the dataframe is named "d", then this succeeds on your test set:
d[ which(d$Unit %in%
(sapply( split(d, d["Unit"]),
function(dd) dd[dd$QTR ==4, "Score"] - dd[dd$QTR ==1, "Score"]) > 0)) ,
]
#-------------
Unit QTR Score
2 1 1 22
7 1 2 34
9 1 3 70
10 1 4 89
An alternative in two steps:
result <- unlist(
by(
test,
test$Unit,
function(x) x$Score[x$QTR==4] > x$Score[x$QTR==2])
)
test[test$Unit %in% names(result[result==TRUE]),]
Unit QTR Score
2 1 1 22
7 1 2 34
9 1 3 70
10 1 4 89
A solution using data.table (Probably there are better versions than what I have at the moment).
Note: Assuming a QTR value for a given Unit is unique
Data:
df <- structure(list(Unit = c(5L, 1L, 5L, 2L, 3L, 5L, 1L, 5L, 1L, 1L,
3L), QTR = c(4L, 1L, 3L, 4L, 2L, 2L, 2L, 1L, 3L, 4L, 4L), Score = c(34L,
22L, 67L, 78L, 39L, 34L, 34L, 67L, 70L, 89L, 19L)), .Names = c("Unit",
"QTR", "Score"), class = "data.frame", row.names = c(NA, -11L
))
Solution:
dt <- data.table(df, key=c("Unit", "QTR"))
dt[, Score[Score[QTR == 4] > Score[QTR == 1]], by=Unit]
Unit V1
1: 1 22
2: 1 34
3: 1 70
4: 1 89

Resources