Using the following dataframe I would like to group the data by replicate and group and then calculate a ratio of treatment values to control values.
structure(list(group = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L), .Label = c("case", "controls"), class = "factor"), treatment = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "EPA", class = "factor"),
replicate = structure(c(2L, 4L, 3L, 1L, 2L, 4L, 3L, 1L), .Label = c("four",
"one", "three", "two"), class = "factor"), fatty_acid_family = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "saturated", class = "factor"),
fatty_acid = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "14:0", class = "factor"),
quant = c(6.16, 6.415, 4.02, 4.05, 4.62, 4.435, 3.755, 3.755
)), .Names = c("group", "treatment", "replicate", "fatty_acid_family",
"fatty_acid", "quant"), class = "data.frame", row.names = c(NA,
-8L))
I have tried using dplyr as follows:
group_by(dataIn, replicate, group) %>% transmute(ratio = quant[group=="case"]/quant[group=="controls"])
but this results in Error: incompatible size (%d), expecting %d (the group size) or 1
Initially I thought this might be because I was trying to create 4 ratios from a df 8 rows deep and so I thought summarise might be the answer (collapsing each group to one ratio) but that doesn't work either (my understanding is a shortcoming).
group_by(dataIn, replicate, group) %>% summarise(ratio = quant[group=="case"]/quant[group=="controls"])
replicate group ratio
1 four case NA
2 four controls NA
3 one case NA
4 one controls NA
5 three case NA
6 three controls NA
7 two case NA
8 two controls NA
I would appreciate some advice on where I'm going wrong or even if this can be done with dplyr.
Thanks.
You can try:
group_by(dataIn, replicate) %>%
summarise(ratio = quant[group=="case"]/quant[group=="controls"])
#Source: local data frame [4 x 2]
#
# replicate ratio
#1 four 1.078562
#2 one 1.333333
#3 three 1.070573
#4 two 1.446449
Because you grouped by replicate and group, you could not access data from different groups at the same time.
#talat's answer solved for me. I created a minimal reproducible example to help my own understanding:
df <- structure(list(a = c("a", "a", "b", "b", "c", "c", "d", "d"),
b = c(1, 2, 1, 2, 1, 2, 1, 2), c = c(22, 15, 5, 0.2, 107,
6, 0.2, 4)), row.names = c(NA, -8L), class = c("tbl_df",
"tbl", "data.frame"))
# a b c
# 1 a 1 22.0
# 2 a 2 15.0
# 3 b 1 5.0
# 4 b 2 0.2
# 5 c 1 107.0
# 6 c 2 6.0
# 7 d 1 0.2
# 8 d 2 4.0
library(dplyr)
df %>%
group_by(a) %>%
summarise(prop = c[b == 1] / c[b == 2])
# a prop
# 1 a 1.466667
# 2 b 25.000000
# 3 c 17.833333
# 4 d 0.050000
Related
I have the following dataset
structure(list(Var1 = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L), .Label = c("0", "1"), class = "factor"), Var2 = structure(c(1L,
1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("congruent", "incongruent"
), class = "factor"), Var3 = structure(c(1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L), .Label = c("spoken", "written"), class = "factor"),
Freq = c(8L, 2L, 10L, 2L, 10L, 2L, 10L, 2L)), class = "data.frame", row.names = c(NA,
-8L))
I would like to add another column reporting sum of coupled subsequent rows. Thus the final result would look like this:
I have proceeded like this
Table = as.data.frame(table(data_1$unimodal,data_1$cong_cond, data_1$presentation_mode)) %>%
mutate(Var1 = factor(Var1, levels = c('0', '1')))
row = Table %>% #is.factor(Table$Var1)
summarise(across(where(is.numeric),
~ .[Var1 == '0'] + .[Var1 == '1'],
.names = "{.col}_sum"))
column = c(rbind(row$Freq_sum,rep(NA, 4)))
Table$column = column
But I am looking for the quickest way possible with no scripting separated codes. Here I have used the dplyr package, but if you might know possibly suggest some other ways with map(), for loop, and or the method you deem as the best, please just let me know.
This should do:
df$column <-
rep(colSums(matrix(df$Freq, 2)), each=2) * c(1, NA)
If you are fine with no NAs in the dataframe, you can
df %>%
group_by(Var2, Var3) %>%
mutate(column = sum(Freq))
# A tibble: 8 × 5
# Groups: Var2, Var3 [4]
Var1 Var2 Var3 Freq column
<fct> <fct> <fct> <int> <int>
1 0 congruent spoken 8 10
2 1 congruent spoken 2 10
3 0 incongruent spoken 10 12
4 1 incongruent spoken 2 12
5 0 congruent written 10 12
6 1 congruent written 2 12
7 0 incongruent written 10 12
8 1 incongruent written 2 12
I have a vector of numbers:
a <- c(54, 456, 23432, 4868, 34, 245634, 37, 46453, 1342354)
In my already-existent dataframe (head included via dput below), I would like to create a new variable. Each row of the new variable will contain a single element from the vector. So there would be one value (e.g. 54) in each row of the new variable.
structure(list(Phone = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = "a", class = "factor"), Frame = structure(c(1L,
3L, 2L, 4L, 6L, 5L), .Label = c("[-4.46225397 -4.14727267 -4.45203785 -4.67251549 -5.13750066 -4.92839463\n -5.03957588 -5.68530479]",
"[-6.14532579 -4.38918589 -4.12275354 -4.19263549 -4.30380823 -4.35621995\n -4.4079389 -4.47339504]",
"[-6.43104195 -4.75506178 -4.2324676 -4.21878988 -4.1635973 -4.11186806\n -4.05023489 -4.08204198]",
"[-7.1528423 -5.46190925 -5.94873845 -6.635839 -6.84179002 -6.85955335\n -6.83714326 -6.87621415]",
"[-7.23901353 -4.61522546 -3.25206619 -3.38407075 -3.63762837 -3.85352927\n -3.94250123 -4.04015791]",
"[-7.34451319 -5.58664694 -4.69929752 -4.621823 -4.51670576 -4.48494125\n -4.39512713 -4.26553646]"
), class = "factor"), Previous = structure(c(1L, 1L, 1L, 1L,
1L, 1L), .Label = "ch", class = "factor"), Following = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = "p", class = "factor"), Word = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = "juk'ucha-pi", class = "factor"),
Note = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "", class = "factor"),
"[-10.79197258 -7.97949955 -7.10253093 -7.07957825 -6.98695923\n -6.90015207 -6.79672506 -6.85010073",
"[-10.31251047 -7.36552088 -6.91841906 -7.0356884 -7.2222481\n -7.31020053 -7.39699043 -7.5068328 ",
"[-12.00323036 -9.16566481 -9.982616 -11.13564383 -11.48125155\n -11.51106031 -11.47345379 -11.5390189 ",
"[-12.32487451 -9.37498793 -7.8859212 -7.7559107 -7.5795128\n -7.52620857 -7.37549093 -7.15802398",
"[-12.14783486 -7.74483933 -5.45731306 -5.67883075 -6.10432742\n -6.46663209 -6.61593651 -6.77981481"
), Morph_status = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "", class = "factor"),
row.names = c(NA, 6L), class = "data.frame")
When working with data frames, each variable (column) has as many entries as there are rows. What you are describing then is not a data frame and, if I understand you question correctly, the best your can do is going back to general lists:
df <- data.frame(a = 1:3, b = 1:3)
c(as.list(df), c = list(a))
# $a
# [1] 1 2 3
#
# $b
# [1] 1 2 3
#
# $c
# [1] 54 456 23432 4868 34 245634 37 46453 1342354
One other option, as to still have a data frame, would be to fill all the shorter columns with NA's:
library(rowr)
cbind.fill(df, a, fill = NA)
# a b object
# 1 1 1 54
# 2 2 2 456
# 3 3 3 23432
# 4 NA NA 4868
# 5 NA NA 34
# 6 NA NA 245634
# 7 NA NA 37
# 8 NA NA 46453
# 9 NA NA 1342354
I am trying to get column desired_output which consists of values based on value column group by grp_1 & grp_2
i.e if the values in value column having unique values then values should be NA's
if values repeats more than any value then entire group will be that repeated value
if values repeats equal times then entire group will be that MAX number value
grp_1 = c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A")
grp_2 = c("a","a","a","a","a","b","b","b","b","c","c","c","c","d","d","d","d","e","e","e","e")
value =c(1,2,3,3,4,1,2,3,4,1,1,2,2,1,2,4,4,1,3,3,3)
desired_output =c(3,3,3,3,3,NA,NA,NA,NA,2,2,2,2,4,4,4,4,3,3,3,3)
df = data.frame(grp_1,grp_2,value,desired_output)
I have been struck after getting repeated values count
func <- function(x) {
unlist(lapply(rle(x)$lengths, seq_len))
}
df <- group_by(df,grp_1,grp_2)
df_1 <- mutate(df, common=as.numeric(func(value)) )
In case someone likes data.table
data.table::setDT(df)
df[,desired_outcome:= max(value[duplicated(value)]), by=c("grp_1","grp_2")
][is.infinite(desired_outcome),desired_outcome:=NA]
library(dplyr)
library(modeest)
final_df <- df %>%
group_by(grp_1,grp_2) %>%
mutate(desired_output = ifelse(n()==length(unique(value)),
NA,
ifelse(length(unique(table(value)))==1,
max(value),
mlv(value, method='mfv')[['M']]))) %>%
data.frame()
final_df
Output is:
grp_1 grp_2 value desired_output
1 A a 1 3
2 A a 2 3
3 A a 3 3
4 A a 3 3
5 A a 4 3
6 A b 1 NA
7 A b 2 NA
8 A b 3 NA
9 A b 4 NA
10 A c 1 2
11 A c 1 2
12 A c 2 2
13 A c 2 2
14 A d 1 4
15 A d 2 4
16 A d 4 4
17 A d 4 4
18 A e 1 3
19 A e 3 3
20 A e 3 3
21 A e 3 3
#sample data
structure(list(grp_1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "A", class = "factor"),
grp_2 = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L), .Label = c("a",
"b", "c", "d", "e"), class = "factor"), value = c(1, 2, 3,
3, 4, 1, 2, 3, 4, 1, 1, 2, 2, 1, 2, 4, 4, 1, 3, 3, 3)), .Names = c("grp_1",
"grp_2", "value"), row.names = c(NA, -21L), class = "data.frame")
I have trouble combining slice and map.
I am interested of doing something similar to this; which is, in my case, transforming a compact person-period file to a long (sequential) person-period one. However, because my file is too big, I need to split the data first.
My data look like this
group id var ep dur
1 A 1 a 1 20
2 A 1 b 2 10
3 A 1 a 3 5
4 A 2 b 1 5
5 A 2 b 2 10
6 A 2 b 3 15
7 B 1 a 1 20
8 B 1 a 2 10
9 B 1 a 3 10
10 B 2 c 1 20
11 B 2 c 2 5
12 B 2 c 3 10
What I need is simply this (answer from this)
library(dplyr)
dt %>% slice(rep(1:n(),.$dur))
However, I am interested in introducing a split(.$group).
How I am suppose to do so ?
dt %>% split(.$group) %>% map_df(slice(rep(1:n(),.$dur)))
Is not working for example.
My desired output is the same as dt %>% slice(rep(1:n(),.$dur))
which is
group id var ep dur
1 A 1 a 1 20
2 A 1 a 1 20
3 A 1 a 1 20
4 A 1 a 1 20
5 A 1 a 1 20
6 A 1 a 1 20
7 A 1 a 1 20
8 A 1 a 1 20
9 A 1 a 1 20
10 A 1 a 1 20
.....
But I need to split this operation because the file is too big.
data
dt = structure(list(group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"),
id = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L,
2L, 2L), .Label = c("1", "2"), class = "factor"), var = structure(c(1L,
2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 3L, 3L, 3L), .Label = c("a",
"b", "c"), class = "factor"), ep = structure(c(1L, 2L, 3L,
1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("1", "2",
"3"), class = "factor"), dur = c(20, 10, 5, 5, 10, 15, 20,
10, 10, 20, 5, 10)), .Names = c("group", "id", "var", "ep",
"dur"), row.names = c(NA, -12L), class = "data.frame")
map takes two arguments: a vector/list in .x and a function in .f. It then applies .f on all elements in .x.
The function you are passing to map is not formatted correctly. Try this:
f <- function(x) x %>% slice(rep(1:n(), .$dur))
dt %>%
split(.$group) %>%
map_df(f)
You could also use it like this:
dt %>%
split(.$group) %>%
map_df(slice, rep(1:n(), dur))
This time you directly pass the slice function to map with additional parameters.
I'm not quite sure what your desired final output is, but you could use tidyr to nest the data that you want to repeat and a simple function to expand levels of your nested data, very similar to Tutuchan's answer.
expand_df <- function(df, repeats) {
df %>% slice(rep(1:n(), repeats))
}
dt %>%
tidyr::nest(var:ep) %>%
mutate(expanded = purrr::map2(data, dur, expand_df)) %>%
select(-data) %>%
tidyr::unnest()
Tutuchan's answer gives exactly the same output as your original approach - is that what you were looking for? I don't know if it will have any advantage over your original method.
Because I am working on a very large dataset, I need to slice my dataset by groups in order to pursue my computations.
I have a person-period (melt) dataset that looks like this
group id var time
1 A 1 a 1
2 A 1 b 2
3 A 1 a 3
4 A 2 b 1
5 A 2 b 2
6 A 2 b 3
7 B 1 a 1
8 B 1 a 2
9 B 1 a 3
10 B 2 c 1
11 B 2 c 2
12 B 2 c 3
I need to do this simple transformation
library(reshape2)
library(dplyr)
dt %>% dcast(group + id ~ time, value.var = 'var')
In order to get
group id 1 2 3
1 A 1 a b a
2 A 2 b b b
3 B 1 a a a
4 B 2 c c c
So far, so good.
However, because my database is too big, I need to do this separately for each different groups, such as
a = dt %>% filter(group == 'A') %>% dcast(group + id ~ time, value.var ='var')
b = dt %>% filter(group == 'B') %>% dcast(group + id ~ time, value.var = 'var')
bind_rows(a,b)
My problem is that I would like to avoid doing it by hand. I mean, having to store separately each groups, a = ..., b = ..., c = ..., and so on
Any idea how I could have a single pipe stream that would separate each group, compute the transformation and put it back together in a dataframe ?
dt = structure(list(group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"),
id = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("1", "2"), class = "factor"), var = structure(c(1L,
2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 3L, 3L, 3L), .Label = c("a",
"b", "c"), class = "factor"), time = structure(c(1L, 2L,
3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("1",
"2", "3"), class = "factor")), .Names = c("group", "id",
"var", "time"), row.names = c(NA, -12L), class = "data.frame")
Package purrr can be useful for working with lists. First split the dataset by group and then use map_df to dcast each list but return everything in a single data.frame.
library(purrr)
dt %>%
split(.$group) %>%
map_df(~dcast(.x, group + id ~ time, value.var = "var"))
group id 1 2 3
1 A 1 a b a
2 A 2 b b b
3 B 1 a a a
4 B 2 c c c
lapply is your friend here:
do.call(rbind, lapply(unique(dt$Group), function(grp, dt){
dt %>% filter(Group == grp) %>% dcast(group + id ~ time, value.var = "var")
}, dt = dt))