'Complex' aggregation function in dcast from reshape2 - r

I have a dataframe in long form for which I need to aggregate several observations taken on a particular day.
Example data:
long <- structure(list(Day = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L), .Label = c("1", "2"), class = "factor"),
Genotype = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L,
2L, 2L, 2L), .Label = c("A", "B"), class = "factor"), View = structure(c(1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("1",
"2", "3"), class = "factor"), variable = c(1496L, 1704L,
1738L, 1553L, 1834L, 1421L, 1208L, 1845L, 1325L, 1264L, 1920L,
1735L)), .Names = c("Day", "Genotype", "View", "variable"), row.names = c(NA, -12L),
class = "data.frame")
> long
Day Genotype View variable
1 1 A 1 1496
2 1 A 2 1704
3 1 A 3 1738
4 1 B 1 1553
5 1 B 2 1834
6 1 B 3 1421
7 2 A 1 1208
8 2 A 2 1845
9 2 A 3 1325
10 2 B 1 1264
11 2 B 2 1920
12 2 B 3 1735
I need to aggregate each genotype for each day by taking the cube root of the product of each view. So for genotype A on day 1, (1496 * 1704 * 1738)^(1/3). Final dataframe would look like:
Day Genotype summary
1 1 A 1642.418
2 1 B 1593.633
3 2 A 1434.695
4 2 B 1614.790
Have been going round and round with reshape2 for the last couple of days, but not getting anywhere. Help appreciated!

I'd probably use plyr and ddply for this task:
library(plyr)
ddply(long, .(Day, Genotype), summarize,
summary = prod(variable) ^ (1/3))
#-----
Day Genotype summary
1 1 A 1642.418
2 1 B 1593.633
3 2 A 1434.695
4 2 B 1614.790
Or this with dcast:
dcast(data = long, Day + Genotype ~ .,
value.var = "variable", function(x) prod(x) ^ (1/3))
#-----
Day Genotype NA
1 1 A 1642.418
2 1 B 1593.633
3 2 A 1434.695
4 2 B 1614.790

An other solution without additional packages.
aggregate(list(Summary=long$variable),by=list(Day=long$Day,Genotype=long$Genotype),function(x) prod(x)^(1/length(x)))
Day Genotype Summary
1 1 A 1642.418
2 2 A 1434.695
3 1 B 1593.633
4 2 B 1614.790

Related

subsetting data based with the condition of the current and previous entity in r

I have data with the status column. I want to subset my data to the condition of 'f' status, and previous condition of 'f' status.
to simplify:
df
id status time
1 n 1
1 n 2
1 f 3
1 n 4
2 f 1
2 n 2
3 n 1
3 n 2
3 f 3
3 f 4
my result should be:
id status time
1 n 2
1 f 3
2 f 1
3 n 2
3 f 3
3 f 4
How can I do this in R?
Here's a solution using dplyr -
df %>%
group_by(id) %>%
filter(status == "f" | lead(status) == "f") %>%
ungroup()
# A tibble: 6 x 3
id status time
<int> <fct> <int>
1 1 n 2
2 1 f 3
3 2 f 1
4 3 n 2
5 3 f 3
6 3 f 4
Data -
df <- structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L),
status = structure(c(2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L,
1L), .Label = c("f", "n"), class = "factor"), time = c(1L,
2L, 3L, 4L, 1L, 2L, 1L, 2L, 3L, 4L)), .Names = c("id", "status",
"time"), class = "data.frame", row.names = c(NA, -10L))

R - replace values by row given some statement in if loop with another value in same df

I have a dataset with which I want to conduct a multilevel analysis. Therefore I have two rows for every patient, and a couple column with 1's and 2's (1 = patient, 2 = partner of patient).
Now, I have variables with date of birth and age, for both patient and partner in different columns that are now on the same row.
What I want to do is to write a code that does:
if mydata$couple == 2, then replace mydata$dateofbirthpatient with mydata$dateofbirthpatient
And that for every row. Since I have multiple variables that I want to replace, it would be lovely if I could get this in a loop and just 'add' variables that I want to replace.
What I tried so far:
mydf_longer <- if (mydf_long$couple == 2) {
mydf_long$pgebdat <- mydf_long$prgebdat
}
Ofcourse this wasn't working - but simply stated this is what I want.
And I started with this code, following the example in By row, replace values equal to value in specified column
, but don't know how to finish:
mydf_longer[6:7][mydf_longer[,1:4]==mydf_longer[2,2]] <-
Any ideas? Let me know if you need more information.
Example of data:
# id couple groep_MNC zkhs fbeh pgebdat p_age pgesl prgebdat pr_age
# 1 3 1 1 1 1 1955-12-01 42.50000 1 <NA> NA
# 1.1 3 2 1 1 1 1955-12-01 42.50000 1 <NA> NA
# 2 5 1 1 1 1 1943-04-09 55.16667 1 1962-04-18 36.5
# 2.1 5 2 1 1 1 1943-04-09 55.16667 1 1962-04-18 36.5
# 3 7 1 1 1 1 1958-04-10 40.25000 1 <NA> NA
# 3.1 7 2 1 1 1 1958-04-10 40.25000 1 <NA> NA
mydf_long <- structure(
list(id = c(3L, 3L, 5L, 5L, 7L, 7L),
couple = c(1L, 2L, 1L, 2L, 1L, 2L),
groep_MNC = c(1L, 1L, 1L, 1L, 1L, 1L),
zkhs = c(1L, 1L, 1L, 1L, 1L, 1L),
fbeh = c(1L, 1L, 1L, 1L, 1L, 1L),
pgebdat = structure(c(-5145, -5145, -9764, -9764, -4284, -4284), class = "Date"),
p_age = c(42.5, 42.5, 55.16667, 55.16667, 40.25, 40.25),
pgesl = c(1L, 1L, 1L, 1L, 1L, 1L),
prgebdat = structure(c(NA, NA, -2815, -2815, NA, NA), class = "Date"),
pr_age = c(NA, NA, 36.5, 36.5, NA, NA)),
.Names = c("id", "couple", "groep_MNC", "zkhs", "fbeh", "pgebdat",
"p_age", "pgesl", "prgebdat", "pr_age"),
row.names = c("1", "1.1", "2", "2.1", "3", "3.1"),
class = "data.frame"
)
The following for loop should work if you only want to change the values based on a condition:
for(i in 1:nrow(mydata)){
if(mydata$couple[i] == 2){
mydata$pgebdat[i] <- mydata$prgebdat[i]
}
}
OR
As suggested by #lmo, following will work faster.
mydata$pgebdat[mydata$couple == 2] <- mydata$prgebdat[mydata$couple == 2]

R - dplyr map slice for repeat rows

I have trouble combining slice and map.
I am interested of doing something similar to this; which is, in my case, transforming a compact person-period file to a long (sequential) person-period one. However, because my file is too big, I need to split the data first.
My data look like this
group id var ep dur
1 A 1 a 1 20
2 A 1 b 2 10
3 A 1 a 3 5
4 A 2 b 1 5
5 A 2 b 2 10
6 A 2 b 3 15
7 B 1 a 1 20
8 B 1 a 2 10
9 B 1 a 3 10
10 B 2 c 1 20
11 B 2 c 2 5
12 B 2 c 3 10
What I need is simply this (answer from this)
library(dplyr)
dt %>% slice(rep(1:n(),.$dur))
However, I am interested in introducing a split(.$group).
How I am suppose to do so ?
dt %>% split(.$group) %>% map_df(slice(rep(1:n(),.$dur)))
Is not working for example.
My desired output is the same as dt %>% slice(rep(1:n(),.$dur))
which is
group id var ep dur
1 A 1 a 1 20
2 A 1 a 1 20
3 A 1 a 1 20
4 A 1 a 1 20
5 A 1 a 1 20
6 A 1 a 1 20
7 A 1 a 1 20
8 A 1 a 1 20
9 A 1 a 1 20
10 A 1 a 1 20
.....
But I need to split this operation because the file is too big.
data
dt = structure(list(group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"),
id = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L,
2L, 2L), .Label = c("1", "2"), class = "factor"), var = structure(c(1L,
2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 3L, 3L, 3L), .Label = c("a",
"b", "c"), class = "factor"), ep = structure(c(1L, 2L, 3L,
1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("1", "2",
"3"), class = "factor"), dur = c(20, 10, 5, 5, 10, 15, 20,
10, 10, 20, 5, 10)), .Names = c("group", "id", "var", "ep",
"dur"), row.names = c(NA, -12L), class = "data.frame")
map takes two arguments: a vector/list in .x and a function in .f. It then applies .f on all elements in .x.
The function you are passing to map is not formatted correctly. Try this:
f <- function(x) x %>% slice(rep(1:n(), .$dur))
dt %>%
split(.$group) %>%
map_df(f)
You could also use it like this:
dt %>%
split(.$group) %>%
map_df(slice, rep(1:n(), dur))
This time you directly pass the slice function to map with additional parameters.
I'm not quite sure what your desired final output is, but you could use tidyr to nest the data that you want to repeat and a simple function to expand levels of your nested data, very similar to Tutuchan's answer.
expand_df <- function(df, repeats) {
df %>% slice(rep(1:n(), repeats))
}
dt %>%
tidyr::nest(var:ep) %>%
mutate(expanded = purrr::map2(data, dur, expand_df)) %>%
select(-data) %>%
tidyr::unnest()
Tutuchan's answer gives exactly the same output as your original approach - is that what you were looking for? I don't know if it will have any advantage over your original method.

R - sort semi-numeric column

I created a dataset to illustrate the problem that I have.
My data looks like this
id time act
1 1 time1 a
2 1 time2 a
3 1 time3 a
4 1 time101 a
5 1 time103 a
6 1 time1001 b
7 1 time1003 b
9 1 time10000 b
10 1 time100010 c
What I want is to spread the data with time in the correct order, like this :
id 1 2 3 101 103 1001 1003 1004 10000 100010
1 a a a a a b b b b c
Here is what I do not fully understand. When I spread my data I get something like
library(dplyr)
library(tidyr)
dt %>% spread(time, act)
id time1 time10000 time100010 time1001 time1003 time1004 time101 time103 time2 time3
1 1 a b c b b b a a a a
So R seems to recognise so some of numerical order but considers that time10000 is prior to 2 or 3.
Why is it so ? and I could I solve this problem.
What I would like is this :
id time1 time2 time3 time101 time103 time1001 time1003 time1004 time10000 time100010
1 1 a a a a a b b b b c
The data
dt = structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
time = structure(c(1L, 9L, 10L, 7L, 8L, 4L, 5L, 6L, 2L, 3L
), .Label = c("time1", "time10000", "time100010", "time1001",
"time1003", "time1004", "time101", "time103", "time2", "time3"
), class = "factor"), act = structure(c(1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 3L), .Label = c("a", "b", "c"), class = "factor")), .Names = c("id",
"time", "act"), class = "data.frame", row.names = c(NA, -10L))
Reorder your factor levels:
> dt$time<-factor(dt$time, as.character(dt$time))
> dt %>% spread(time, act)
id time1 time2 time3 time101 time103 time1001 time1003 time1004 time10000
1 1 a a a a a b b b b
time100010
1 c

Adding same observations from 2 different groups. Plyr or tapply?

Looking to create a function.
I would like to add the number of occurrences of an observation up within a given group (ex 5, 5 occurrences 2 times). The same numbers of Days within a Week by Business are to be summed. The summed values will be in a new row 'Total-occurrences.'
tapply or plyr works its way into this, however I'm stuck on a few nuances.
Thanks!
14X3 matrix
Business Week Days
A **1** 3
A **1** 3
A **1** 1
A 2 4
A 2 1
A 2 1
A 2 6
A 2 1
B **1** 1
B **1** 2
B **1** 7
B 2 2
B 2 2
B 2 na
**AND BECOME**
10X4 matrix
Business Week Days Total-Occurrences
A **1** 3 2
A **1** 1 1
A 2 1 3
A 2 4 1
A 2 6 1
B **1** 1 1
B **1** 2 1
B **1** 7 1
B 3 2 2
B 2 na 0
If I understand your question correctly, you want to group your data frame by Business and Week and Days and calculate the occurences of each group in a new column Total-Occurences.
df <- structure(list(Business = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"),
Week = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L,
1L, 2L, 2L, 2L), .Label = c("**1**", "2"), class = "factor"),
Days = structure(c(3L, 3L, 1L, 4L, 1L, 1L, 5L, 1L, 1L, 2L,
6L, 2L, 2L, 7L), .Label = c("1", "2", "3", "4", "6", "7",
"na"), class = "factor")), .Names = c("Business", "Week",
"Days"), class = "data.frame", row.names = c(NA, -14L))
There are certainly different ways of doing this. One way would be to use dplyr:
require(dplyr)
result <- df %.%
group_by(Business,Week,Days) %.%
summarize(Total.Occurences = n())
#>result
# Business Week Days Total.Occurences
#1 A **1** 1 1
#2 A **1** 3 2
#3 A 2 1 3
#4 A 2 4 1
#5 A 2 6 1
#6 B **1** 1 1
#7 B **1** 2 1
#8 B **1** 7 1
#9 B 2 2 2
#10 B 2 na 1
You could also use plyr:
require(plyr)
ddply(df, .(Business, Week, Days), nrow)
note that based on these functions, the output would be slightly different than what you posted in your question. I assume this may be a typo because in your original data there is no Week 3 but in your desired output there is.
Between the two solutions, the dplyr approach is probably faster.
I guess there are also other ways of doing this (but im not sure about tapply)

Resources