De-aggregate a data frame

De-aggregate a data frame - r

There have been many similar questions (e.g. Repeat each row of data.frame the number of times specified in a column, De-aggregate / reverse-summarise / expand a dataset in R, Repeating rows of data.frame in dplyr), but my data set is of a different structure than the answers to these questions assume.
I have a data frame with the frequencies of measurements within each group and the total number of observations for each outcome per group total_N:
tibble(group=c("A", "B"), total_N=c(4,5), measure_A=c(1,4), measure_B=c(2,3))
# A tibble: 2 x 4
group total_N outcome_A outcome_B
<chr> <dbl> <dbl> <dbl>
1 A 4 1 2
2 B 5 4 3
I want to de-aggregate the data, so that the data frame has as many rows as total observations and each outcome has a 1 for all observations with the outcome and a 0 for all observations without the outcome. Thus the final result should be a data frame like this:
# A tibble: 9 x 3
group outcome_A outcome_B
<chr> <dbl> <dbl>
1 A 1 1
2 A 0 1
3 A 0 0
4 A 0 0
5 B 1 1
6 B 1 1
7 B 1 1
8 B 1 0
9 B 0 0
As the aggregated data does not contain any information about the frequency of combinations (i.e., the correlation) of outcome_A and outcome_B, this can be ignored.

Here's a tidyverse solution.
As you say, it's easy to repeat a row an arbitrary number of times. If you know that row_number() counts rows within groups when a data frame is grouped, then it's easy to convert grouped counts to presence/absence flags. across gives you a way to succinctly convert multiple count columns.
library(tidyverse)
tibble(group=c("A", "B"), total_N=c(4,5), measure_A=c(1,4), measure_B=c(2,3)) %>%
uncount(total_N) %>%
group_by(group) %>%
mutate(
across(
starts_with("measure"),
function(x) as.numeric(row_number() <= x)
)
) %>%
ungroup()
# A tibble: 9 × 3
group measure_A measure_B
<chr> <dbl> <dbl>
1 A 1 1
2 A 0 1
3 A 0 0
4 A 0 0
5 B 1 1
6 B 1 1
7 B 1 1
8 B 1 0
9 B 0 0
As you say, this approach takes no account of correlations between the outcome columns, as this cannot be deduced from the grouped data.

Related

Select values based on conditions

I have the following data set:
id pnum t1 t2 t3 w1 w2 w3
1 1 w r r 1 1 1
1 2 o o w 0 0 1
1 3 o w w 1 1 1
2 1 o w t 1 0 1
2 2 s s s 1 0 1
2 3 s s s 1 0 1
Id defines the group membership.
Based on id and pnum I would like to identify the common measurements reported with pnum 3 at time t with respect to w.
In other words id and pnum define different individuals who took some measurements. In some cases the measurement was taken together in other cases the measurement was taken alone. If taken together than at 'w' we have value 1.
For example:
Common activities at time t:
Id 1 pnum 1 at t1 reported (eg. 1) that a measurement was taken with from the group, more specifically with id1/pnum3. If the measurement data taken together is common in both groups I would like to save it.
Uncommon activities at time t:
Id 2 pnum 1 at t1 reported (eg. 1) that a measurement was taken with from the group, more specifically with id2/pnum2 and pnum 3. In this case the measurement data taken together is uncommon between id2/pnum1 and pnum 2 as well between id2/pnum1 and pnum 3. I don't want to save these measurements. But, I would like to save the common one reported between id2/pnum2 and pnum 3.
Generic example id 1:
In the group with id 1 pnum1 and pnum3 at t1 took together a measurement. Pnum 1 reported w, pnum 2 and pnum 3 reported o. This means that pnum 2 and pnum 3 reported the same measurement. However, when I look at w1 I could observe that they were not together when they did as at w1 pnum 2 is 0 and pnum 3 is 1. In other words, even though the measurement are common for pnum 2 and pnum 3 as they were not taken together I don't want to save the case. I would need to report if they reported the same measurements or not. In this case pnum1 reported w while pnum 3 reported o, so the measurements don't match. Therefore I coded 0. I don't want to save the case.
I would like to identify the common measurements that were taken together at time t.
Output:
id pnum t1 t2 t3
1 1 0 0 0
1 2 0 0 w
1 3 0 s w
2 1 0 0 0
2 2 s 0 s
2 3 s 0 s
Sample data:
df<-structure(list(id=c(1,1,1,2,2,2),pnum=c(1,2,3,1,2,3), t1=c("w","o","o","o","s","s"), t2=c("r","o","w","w","s","s"),t3 = c("r","w","w","t","s","s"), w1= c(1,0,1,1,1,1), w2 = c(1,0,1,0,0,0), w3 = c(1,1,1,1,1,1)), row.names = c(NA, 6L), class = "data.frame")

I don't get why in your expected output there is an "s" at [id1, pnum3, t2] - other than that, I think the following might help you:
First of all, pivoting your data to a "longer" format, where you can group by time, helps you to generalize your code.
library(dplyr)
library(tidyr)
df_longer <- df %>%
pivot_longer(
cols = matches("^[tw]\\d+$"),
names_to = c(".value","time"),
names_pattern = "([tw])(\\d+)"
)
The above pivots your data to look like this:
> head(df_longer)
# A tibble: 6 x 5
id pnum time t w
<dbl> <dbl> <chr> <chr> <dbl>
1 1 1 1 w 1
2 1 1 2 r 1
3 1 1 3 r 1
4 1 2 1 o 0
5 1 2 2 o 0
6 1 2 3 w 1
Now, you can easily group it up and identify those individuals that have given common answers at any given time:
common_answers <- df_longer %>%
arrange(id, time, pnum) %>%
filter(w == 1) %>% # throw out if the answer was given individually
select(-w) %>% # w not needed anymore
group_by(id, time, t) %>% # group by selected answer
filter(n() > 1) %>% # keep only answers given >1 times
ungroup()
This presents you with only only a filtered set of your data where answers were given commonly in group:
> common_answers
# A tibble: 6 x 4
id pnum time t
<dbl> <dbl> <chr> <chr>
1 1 2 3 w
2 1 3 3 w
3 2 2 1 s
4 2 3 1 s
5 2 2 3 s
6 2 3 3 s
// ADDITION:
In case you have to rely on the "wide" format in your output, you can obviously retain all data, modify t so that it only retains its value when it is given commonly by >1 subject and then widen your df again:
common_answers_wide <- df_longer %>%
group_by(id, time, w, t) %>%
mutate(
# retain t only when the response has been given by >1 subject
t = case_when(
w == 0 ~ "0",
n() > 1 ~ t,
T ~ "0"
)
) %>%
ungroup() %>%
select(-w) %>%
pivot_wider(
names_from = time, names_prefix = "t", names_sort = T,
values_from = t
)
That gives you exactly the desired output:
> common_answers_wide
# A tibble: 6 x 5
id pnum t1 t2 t3
<dbl> <dbl> <chr> <chr> <chr>
1 1 1 0 0 0
2 1 2 0 0 w
3 1 3 0 0 w
4 2 1 0 0 0
5 2 2 s 0 s
6 2 3 s 0 s

Filter a tibble by summarize data

I have a tibble with lots of values. I know that some of the values are "bad", because they are overrepresented in the data.
What I'd like to do is filter out any value that occurs more than, say, 10 times.
I can easily get the count of occurrences by doing
values %>% group_by(value) %>% summarize(count=n())
# A tibble: 1,000 x 2
value count
<dbl> <int>
1 1.40e15 1
2 1.40e15 2
3 1.40e15 1
4 1.40e15 17
5 1.40e15 2
6 1.40e15 7
7 1.40e15 1
But how do I now filter the original values tibble to remove any value that occurs more than 10 times in the summary?

Dynamically Normalize all rows with first element within a group

Suppose I have the following data frame:
year subject grade study_time
1 1 a 30 20
2 2 a 60 60
3 1 b 30 10
4 2 b 90 100
What I would like to do is be able to divide grade and study_time by their first record within each subject. I do the following:
df %>%
group_by(subject) %>%
mutate(RN = row_number()) %>%
mutate(study_time = study_time/study_time[RN ==1],
grade = grade/grade[RN==1]) %>%
select(-RN)
I would get the following output
year subject grade study_time
1 1 a 1 1
2 2 a 2 3
3 1 b 1 1
4 2 b 3 10
It's fairly easy to do when I know what the variable names are. However, I'm trying to write a generalize function that would be able to act on any data.frame/data.table/tibble where I may not know the name of the variables that I need to mutate, I'll only know the variables names not to mutate. I'm trying to get this done using tidyverse/data.table and I can't get anything to work.
Any help would be greatly appreciated.

We group by 'subject' and use mutate_at to change multiple columns by dividing the element by the first element
library(dplyr)
df %>%
group_by(subject) %>%
mutate_at(3:4, funs(./first(.)))
# A tibble: 4 x 4
# Groups: subject [2]
# year subject grade study_time
# <int> <chr> <dbl> <dbl>
#1 1 a 1 1
#2 2 a 2 3
#3 1 b 1 1
#4 2 b 3 10

count total and positive samples by group

I have a dataframe like this;
df <- data.frame(concentration=c(0,0,0,0,2,2,2,2,4,4,6,6,6),
result=c(0,0,0,0,0,0,1,0,1,0,1,1,1))
I want to count the total number of results for each concentration level.
I want to count the number of positive samples for each concentration level.
And I want to create a new dataframe with concentration level, total results, and number positives.
conc pos_c total_c
0 0 4
2 1 4
4 1 2
6 3 3
This is what I've come up with so far using plyr;
c <- count(df, "concentration")
r <- count(df, "concentration","result")
names(c)[which(names(c) == "freq")] <- "total_c"
names(r)[which(names(r) == "freq")] <- "pos_c"
cbind(c,r)
concentration total_c concentration pos_c
1 0 4 0 0
2 2 4 2 1
3 4 2 4 1
4 6 3 6 3
Repeating concentration column. I think there is probably a way better/easier way to do this I'm missing. Maybe another library. I'm not sure how to do this in R and it's relatively new to me. Thanks.

We need a group by sum. Using tidyverse, we group by 'concentration (group_by), then summarise to get the two columns - 1) sum of the logical expression (result > 0), 2) number of rows (n())
library(dplyr)
df %>%
group_by(conc = concentration) %>%
summarise(pos_c = sum(result > 0), # in the example just sum(result)
total_c = n())
# A tibble: 4 x 3
# conc pos_c total_c
# <dbl> <int> <int>
#1 0 0 4
#2 2 1 4
#3 4 1 2
#4 6 3 3
Or using base R with table and addmargins
addmargins(table(df), 2)[,-1]

Finding individual values from cumulative mean in R Dataframe

I am having troubles finding how to find individual values from the running mean in an R dataframe.
I have an R dataframe:
x ID Mean
1 1 1
1 2 5
2 1 3
2 2 6
Where the mean is the mean for the x measurements for the specific ID in the dataframe.
To find the individual values at each x value rather than the mean, I was thinking that I needed to apply a recursive function on the dataframe and group by the ID. How could I do this in a dataframe while grouping by one of the values when any apply function wouldn't have access to the previous entry in the dataframe?
When completed and appended to the dataframe, I am hoping it to look like this:
x ID Mean IndivValues
1 1 1 1
1 2 5 5
2 1 3 5
2 2 6 7

It's much easier to calculate this from totals -> to individual observation, as below:
Example data.frame:
df <- read.table(text='
x ID Mean
1 1 1
1 2 5
2 1 3
2 2 6
', header=T)
Solution:
library(dplyr); library(magrittr)
df %>%
group_by(id) %>%
mutate(
total = mean * x,
ind_value = total - lag(total, default=0) )
## A tibble: 4 x 5
## Groups: ID [2]
# x ID Mean total ind_value
# <int> <int> <int> <int> <int>
#1 1 1 1 1 1
#2 1 2 5 5 5
#3 2 1 3 6 5
#4 2 2 6 12 7

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

De-aggregate a data frame - r

Related

Select values based on conditions

Filter a tibble by summarize data

Dynamically Normalize all rows with first element within a group

count total and positive samples by group

Finding individual values from cumulative mean in R Dataframe

Categories

Resources