Say I have the following data frame:
ID<-c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3, 4,4,4,4,4,5,5,5,5,5)
Score<- sample(1:20, 25, replace=TRUE)
days<-rep(c("Mon", "Tue", "Wed", "Thu", "Fri"), times=5)
t<-cbind(ID, Score, days)
I would like to reshape it so that the new columns are ID and the actual weekday names, (meaning 6 columns) and the Score values are distributed according to their ID and day name. Something like this:
I found that reshape package might do. Tried (melt and cast) but it did not produce the result I wanted, but something like in this post: Melt data for one column
A base R solution that uses the built-in reshape command.
set.seed(12345)
t <- data.frame(id = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5),
score = sample(x = 1:20,size = 25,replace = TRUE),
days = rep(x = c("Mon","Tue","Wed","Thu","Fri"),times = 5))
t.wide <- reshape(data = t,
v.names = "score",
timevar = "days",
idvar = "id",
direction = "wide")
names(t.wide) <- gsub(pattern = "score.",replacement = "",x = names(t.wide),fixed = TRUE)
t.wide
id Mon Tue Wed Thu Fri
1 1 15 18 16 18 10
6 2 4 7 11 15 20
11 3 1 4 15 1 8
16 4 10 8 9 4 20
21 5 10 7 20 15 13
You can use reshape2 to do this, but you need a data.frame to do that. Using cbind produces a matrix. (And converts all your numerical variables to characters in this case, as matrices can only hold one data type).
I've changed your code to produce a dataframe, which is already in long format (one row per observation).
set.seed(123)
ID<-c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3, 4,4,4,4,4,5,5,5,5,5)
Score<- sample(1:20, 25, replace=TRUE)
days<-rep(c("Mon", "Tue", "Wed", "Thu", "Fri"), times=5)
dat<-data.frame(ID, Score, days)
Changing it to wide using reshape2 is then quite straightforward:
library(reshape2)
res <- dcast(ID~days,value.var="Score",data=dat)
> res
ID Fri Mon Thu Tue Wed
1 1 16 3 2 12 6
2 2 19 13 12 7 19
3 3 19 19 17 8 15
4 4 15 3 8 1 20
5 5 3 11 18 8 15
You could also use unstack if your data are complete (same number of days per id).
Here's an example (using the data from TARehman's answer):
unstack(t, score ~ days)
# Fri Mon Thu Tue Wed
# 1 10 15 18 18 16
# 2 20 4 15 7 11
# 3 8 1 1 4 15
# 4 20 10 4 8 9
# 5 13 10 15 7 20
Here's the clean-up for the column ordering, and for adding in the ID column:
cbind(ID = unique(t$id), unstack(t, score ~ days)[c("Mon", "Tue", "Wed", "Thu", "Fri")])
## ID Mon Tue Wed Thu Fri
## 1 1 15 18 16 18 10
## 2 2 4 7 11 15 20
## 3 3 1 4 15 1 8
## 4 4 10 8 9 4 20
## 5 5 10 7 20 15 13
Rather than reshape I'd move to the newer tidyr package and also make use of dplyr like so:
library(dplyr)
library(tidyr)
tdf<-as.data.frame(t) %>%
mutate(Score=as.numeric(Score)) %>%
spread(days,Score, fill=NA)
glimpse(tdf)
HTH
Just another option using splitstackshape
library(splitstackshape)
data = data.frame(t)
out = setnames(cSplit(setDT(data)[, .(x = toString(Score)), by = ID],
'x', ','), c('ID', unique(days)))
#> out
# ID Mon Tue Wed Thu Fri
#1: 1 8 14 11 5 10
#2: 2 16 1 4 14 8
#3: 3 8 18 19 13 3
#4: 4 16 9 19 16 6
#5: 5 7 2 1 2 13
Within both the dplyr & tidyr package, use spread to achieve the following:
library(dplyr)
library(tidyr)
t <- tbl_df(as.data.frame(t))
t %>% spread(days, Score, ID)
and you get the following output:
ID Fri Mon Thu Tue Wed
(fctr) (fctr) (fctr) (fctr) (fctr) (fctr)
1 1 10 10 18 17 10
2 2 18 11 14 3 16
3 3 11 13 9 15 17
4 4 13 13 16 17 11
5 5 7 14 9 15 20
Related
I have started using h2o for aggregating large datasets and I have found peculiar behaviour when trying to aggregate the maximum value using h2o's h2o.group_by function. My dataframe often has variables which comprise some or all NA's for a given grouping. Below is an example dataframe.
df <- data.frame("ID" = 1:16)
df$Group<- c(1,1,1,1,2,2,2,3,3,3,4,4,5,5,5,5)
df$VarA <- c(NA_real_,1,2,3,12,12,12,12,0,14,NA_real_,14,16,16,NA_real_,16)
df$VarB <- c(NA_real_,NA_real_,NA_real_,NA_real_,10,12,14,16,10,12,14,16,10,12,14,16)
df$VarD <- c(10,12,14,16,10,12,14,16,10,12,14,16,10,12,14,16)
ID Group VarA VarB VarD
1 1 1 NA NA 10
2 2 1 1 NA 12
3 3 1 2 NA 14
4 4 1 3 NA 16
5 5 2 12 10 10
6 6 2 12 12 12
7 7 2 12 14 14
8 8 3 12 16 16
9 9 3 0 10 10
10 10 3 14 12 12
11 11 4 NA 14 14
12 12 4 14 16 16
13 13 5 16 10 10
14 14 5 16 12 12
15 15 5 NA 14 14
16 16 5 16 16 16
In this dataframe Group == 1 is completely missing data for VarB (but this is important information to know, so the output for aggregating for the maximum should be NA), while for Group == 1 VarA only has one missing value so the maximum should be 3.
This is a link which includes the behaviour of the behaviour of the na.methods argument (https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-munging/groupby.html).
If I set the na.methods = 'all' as below then the aggregated output is NA for Group 1 for both Vars A and B (which is not what I want, but I completely understand this behaviour).
h2o_agg <- h2o.group_by(data = df_h2o, by = 'Group', max(), gb.control = list(na.methods = "all"))
Group max_ID max_VarA max_VarB max_VarD
1 1 4 NaN NaN 16
2 2 7 12 14 14
3 3 10 14 16 16
4 4 12 NaN 16 16
5 5 16 NaN 16 16
If I set the na.methods = 'rm' as below then the aggregated output for Group 1 is 3 for VarA (which is the desired output and makes complete sense) but for VarB is -1.80e308 (which is not what I want, and I do not understand this behaviour).
h2o_agg <- h2o.group_by(data = df_h2o, by = 'Group', max(), gb.control = list(na.methods = "rm"))
Group max_ID max_VarA max_VarB max_VarD
<int> <int> <int> <dbl> <int>
1 1 4 3 -1.80e308 16
2 2 7 12 1.4 e 1 14
3 3 10 14 1.6 e 1 16
4 4 12 14 1.6 e 1 16
5 5 16 16 1.6 e 1 16
Similarly I get the same output if set the na.methods = 'ignore'.
h2o_agg <- h2o.group_by(data = df_h2o, by = 'Group', max(), gb.control = list(na.methods = "ignore"))
Group max_ID max_VarA max_VarB max_VarD
<int> <int> <int> <dbl> <int>
1 1 4 3 -1.80e308 16
2 2 7 12 1.4 e 1 14
3 3 10 14 1.6 e 1 16
4 4 12 14 1.6 e 1 16
5 5 16 16 1.6 e 1 16
I am not sure why something as common as completely missing data for a given variable within a specific group is being given a value of -1.80e308? I tried the same workflow in dplyr and got results which match my expectations (but this is not a solution as I cannot process datasets of this size in dplyr, and hence my need for a solution in h2o). I realise dplyr is giving me -inf values rather than NA, and I can easily recode both -1.80e308 and -Inf to NA, but I am trying to make sure that this isn't a symptom of a larger problem in h2o (or that I am not doing something fundamentally wrong in my code when attempting to aggregate in h2o). I also have to aggregate normalised datasets which often have values which are approximately similar to -1.80e308, so I do not want to accidentally recode legitimate values to NA.
library(dplyr)
df %>%
group_by(Group) %>%
summarise(across(everything(), ~max(.x, na.rm = TRUE)))
Group ID VarA VarB VarD
<dbl> <int> <dbl> <dbl> <dbl>
1 1 4 3 -Inf 16
2 2 7 12 14 14
3 3 10 14 16 16
4 4 12 14 16 16
5 5 16 16 16 16
This is happening because H2O considers value -Double.MAX_VALUE to be the lowest possible representable floating-point number. This value corresponds to -1.80e308. I agree this is confusing and I would consider this to be a bug. You can file an issue in our bug tracker: https://h2oai.atlassian.net/ (PUBDEV project)
Not sure how to achieve that with h2o.group_by() – I get the same weird value when running your code. If you are open for a somewhat hacky workaround, you might want to try the following (I included the part on H2O initialization for future reference):
convert your frame to long format, ie key-value representation
split by group and apply aggregate function using h2o.ddply()
convert your frame back to wide format
## initialize h2o
library(h2o)
h2o.init(
nthreads = parallel::detectCores() * 0.5
)
df_h2o = as.h2o(
df
)
## aggregate per group
df_h2o |>
# convert to long format
h2o.melt(
id_vars = "Group"
, skipna = TRUE # does not include `NA` in the result
) |>
# calculate `max()` per group
h2o.ddply(
.variables = c("Group", "variable")
, FUN = function(df) {
max(df[, 3])
}
) |>
# convert back to wide format
h2o.pivot(
index = "Group"
, column = "variable"
, value = "ddply_C1"
)
# Group ID VarA VarB VarD
# 1 4 3 NaN 16
# 2 7 12 14 14
# 3 10 14 16 16
# 4 12 14 16 16
# 5 16 16 16 16
#
# [5 rows x 5 columns]
## shut down h2o instance
h2o.shutdown(
prompt = FALSE
)
So I have a dataset of parents and their children of the following form
Children_id Parent_id
10 1
11 1
12 1
13 2
14 2
What I want is a dataset of each child's siblings in long format, i.e.,
id sibling_id
10 11
10 12
11 10
11 12
12 10
12 11
13 14
14 13
What's the best way to achieve this, preferably using datatable?
Example data:
df <- data.frame("Children_id" = c(10,11,12,13,14), "Parent_id" = c(1,1,1,2,2))
The graph experts out there will probably have better solutions, but here is a data.table solution:
library(data.table)
setDT(df)[df,on=.(Parent_id), allow.cartesian=T] %>%
.[Children_id!=i.Children_id, .(id = i.Children_id, sibling=Children_id)]
Output:
id sibling
<num> <num>
1: 10 11
2: 10 12
3: 11 10
4: 11 12
5: 12 10
6: 12 11
7: 13 14
8: 14 13
In base R, we can use expand.grid after splitting
out <- do.call(rbind, lapply(split(df$Children_id, df$Parent_id), \(x)
subset(expand.grid(x, x), Var1 != Var2)[2:1]))
row.names(out) <- NULL
colnames(out) <- c("id", "sibling_id")
-output
> out
id sibling_id
1 10 11
2 10 12
3 11 10
4 11 12
5 12 10
6 12 11
7 13 14
8 14 13
Or using data.table with CJ
library(data.table)
setDT(df)[, CJ(id = Children_id, sibling_id = Children_id),
Parent_id][id != sibling_id, .(id, sibling_id)]
id sibling_id
<num> <num>
1: 10 11
2: 10 12
3: 11 10
4: 11 12
5: 12 10
6: 12 11
7: 13 14
8: 14 13
A dplyr solution with inner_join:
library(dplyr)
inner_join(df, df, by = "Parent_id") %>%
select(id = Children_id.x, siblings = Children_id.y) %>%
filter(id != siblings)
id siblings
1 10 11
2 10 12
3 11 10
4 11 12
5 12 10
6 12 11
7 13 14
8 14 13
or another strategy:
library(dplyr)
df %>%
group_by(Parent_id) %>%
mutate(siblings = list(unique(Children_id))) %>%
unnest(siblings) %>%
filter(Children_id != siblings)
I want to take differences for each pair of consecutive columns but for an arbitrary number of columns. For example...
df <- as.tibble(data.frame(group = rep(c("a", "b", "c"), each = 4),
subgroup = rep(c("adam", "boy", "charles", "david"), times = 3),
iter1 = 1:12,
iter2 = c(13:22, NA, 24),
iter3 = c(25:35, NA)))
I want to calculate the differences by column. I would normally use...
df %>%
mutate(diff_iter2 = iter2 - iter1,
diff_iter3 = iter3 - iter2)
But... I'd like to:
accomodate an arbitrary number of columns and
treat NAs such that:
if the number we're subtracting from is NA, then the result should be NA. E.g. NA - 11 = NA
if the number we're subtracting is NA, then that NA is effectively treated as a 0. E.g. 35 - NA = 35
The result should look like this...
group subgroup iter1 iter2 iter3 diff_iter2 diff_iter3
<chr> <chr> <int> <dbl> <int> <dbl> <dbl>
1 a adam 1 13 25 12 12
2 a boy 2 14 26 12 12
3 a charles 3 15 27 12 12
4 a david 4 16 28 12 12
5 b adam 5 17 29 12 12
6 b boy 6 18 30 12 12
7 b charles 7 19 31 12 12
8 b david 8 20 32 12 12
9 c adam 9 21 33 12 12
10 c boy 10 22 34 12 12
11 c charles 11 NA 35 NA 35
12 c david 12 24 NA 12 NA
Originally, this df was in long format but the problem was that I believe the lag() function operates on position within groups and all the groups aren't the same because some have missing records (hence the NA in the wider table shown above).
Starting with long format would do but then please assume the records shown above with NA values would not exist in that longer dataframe.
Any help is appreciated.
An option in tidyverse would be - loop across the columns of 'iter' other than the iter1, then get the column value by replacing the column name (cur_column()) substring by subtracting 1 (as.numeric(x) -1) with str_replace, then replace the NA elements with 0 (replace_na) based on the OP's logic, subtract from the looped column and create new columns by adding prefix in .names ("diff_{.col}" - {.col} will be the original column name)
library(dplyr)
library(stringr)
library(tidyr)
df <- df %>%
mutate(across(iter2:iter3, ~
. - replace_na(get(str_replace(cur_column(), '\\d+',
function(x) as.numeric(x) - 1)), 0), .names = 'diff_{.col}'))
-output
df
# A tibble: 12 × 7
group subgroup iter1 iter2 iter3 diff_iter2 diff_iter3
<chr> <chr> <int> <dbl> <int> <dbl> <dbl>
1 a adam 1 13 25 12 12
2 a boy 2 14 26 12 12
3 a charles 3 15 27 12 12
4 a david 4 16 28 12 12
5 b adam 5 17 29 12 12
6 b boy 6 18 30 12 12
7 b charles 7 19 31 12 12
8 b david 8 20 32 12 12
9 c adam 9 21 33 12 12
10 c boy 10 22 34 12 12
11 c charles 11 NA 35 NA 35
12 c david 12 24 NA 12 NA
Find the columns whose names start with iter, ix, and then take all but the first as df1, all but the last as df2 and replace the NAs in df2 with 0. Then subtract them and cbind df to that. No packages are used.
ix <- grep("^iter", names(df))
df1 <- df[tail(ix, -1)]
df2 <- df[head(ix, -1)]
df2[is.na(df2)] <- 0
cbind(df, diff = df1 - df2)
giving:
group subgroup iter1 iter2 iter3 diff.iter2 diff.iter3
1 a adam 1 13 25 12 12
2 a boy 2 14 26 12 12
3 a charles 3 15 27 12 12
4 a david 4 16 28 12 12
5 b adam 5 17 29 12 12
6 b boy 6 18 30 12 12
7 b charles 7 19 31 12 12
8 b david 8 20 32 12 12
9 c adam 9 21 33 12 12
10 c boy 10 22 34 12 12
11 c charles 11 NA 35 NA 35
12 c david 12 24 NA 12 NA
My dataframe needs to be expanded
df1<-structure(list(TotalTime = c(0, 15, 16, 23, 24, 29), PhaseName = structure(c(1L,1L, 2L, 2L, 2L, 3L), .Label = c("A", "B","C"), class = "factor")), .Names = c("TotalTime", "Phase"), row.names = c(NA, 6L), class = "data.frame")
df1:
TotalTime Phase
1 0 A
2 15 A
3 16 B
4 23 B
5 24 B
6 29 C
So that it becomes the following dataframe with rows that are duplicated based on TotalTime, however TotalTime should be filled in for every number (second). (I put ... in the example to reduce space, but should be filled with 6,7,8,9-15 etc.) :
TotalTime Phase
1 0 A
2 1 A
3 2 A
4 3 A
5 4 A
6 5 A
..
16 15 A
17 16 B
18 17 B
.. B
24 23 B
25 24 B
26 25 B
27 26 B
28 27 B
29 28 B
30 29 C
using both packages zoo and dplyr:
library(dplyr)
library(zoo)
data.frame(TotalTime=0:max(df1$TotalTime)) %>% left_join(df1) %>% na.locf
It first creates a data.frame that has the hole sequence from 0 to 29 (here) and merges it with your data. Then I simply do a "last observation carried forward" imputation on the missing values created by the merge.
It can also be done with the library data.table like this: (see also this answer that I adapted:
library(data.table)
df1 = data.table(df1, key="TotalTime")
df2=data.table(TotalTime=0:max(df1$TotalTime))
df1[df2, roll=T]
You can get it done with dplyr with tidyr:
library(tidyverse)
df1 %>% do(data.frame(TotalTime = first(.$TotalTime):last(.$TotalTime))) %>%
left_join(df1, by="TotalTime") %>%
fill(Phase)
Output:
TotalTime Phase
0 A
1 A
2 A
3 A
4 A
5 A
6 A
7 A
8 A
9 A
10 A
11 A
12 A
13 A
14 A
15 A
16 B
17 B
18 B
19 B
20 B
21 B
22 B
23 B
24 B
25 B
26 B
27 B
28 B
29 C
I hope this helps.
In case you want to see a base R solution.
phases <- with(aggregate(TotalTime~Phase, df1, FUN=min),
rep(Phase, c(diff(TotalTime),
max(df1$TotalTime[df1$Phase == tail(Phase, 1)]) -
min(df1$TotalTime[df1$Phase == tail(Phase, 1)])+1)))
The main "trick" here is in that the second argument of rep can be a vector, which then repeats each element of the first argument that many times. The second argument is constructed using the difference of the minimum values of each phase diff(TotalTime) and concatenating the difference of the min and max value (+1) of the final phase level (here, "C"). The minimum values are found with aggregate, and I use with to simplify notation.
The result can then be fed to the data.frame.
data.frame(period=seq_len(length(phases))-1, phase=phases)
period phase
1 0 A
2 1 A
3 2 A
4 3 A
5 4 A
6 5 A
7 6 A
8 7 A
9 8 A
10 9 A
11 10 A
12 11 A
13 12 A
14 13 A
15 14 A
16 15 A
17 16 B
18 17 B
19 18 B
20 19 B
21 20 B
22 21 B
23 22 B
24 23 B
25 24 B
26 25 B
27 26 B
28 27 B
29 28 B
30 29 C
Below table has 366 days of data:
od
month dayofmonth total ad aont
1 1 1 27 9 18
2 1 2 31 24 7
3 1 3 30 25 5
4 1 4 29 15 14
5 1 5 27 1 26
6 1 6 30 18 12
7 1 7 31 8 23
8 1 8 30 9 21
9 1 9 25 23 2
10 1 10 31 15 16
11 1 11 27 17 7
12 1 12 27 3 24
13 1 13 26 11 15
14 1 14 28 12
library(zoo)
require(xts)
Dates <- seq(as.Date(f, "%Y - %m - %d"), as.Date(t, "%Y - %m - %d"), "day")
total<- xts(od$total, order.by = Dates)
dont<- xts(od$ad, order.by = Dates)
adont<- xts(od$aont, order.by = Dates)
I used zoo package now I want to draw multiple lines in same plot
Using plot.type="single" in the plot call can help with this.
#open libraries
library(xts)
library(zoo)
#set some random variables
a=rnorm(100)
b=rnorm(100)
#and some time series
c=seq(as.Date("2000/1/1"), by = "week", length.out = 100)
d=cbind(a,b)
#make it into an zoo object
d=as.zoo(d, order.by=c)
plot.zoo(d,
plot.type = "single",
col = c("red", "blue"))
Create the zoo object and then plot:
library(ggplot2)
library(zoo)
z <- zoo(od[3:5], as.Date(paste(2012, od$month, od$dayofmonth, sep = "-")))
autoplot(z, facet = NULL)