R: Producing frequency table by selecting certain rows - r

I have a minimal example of a data set D that looks something like:
score person freq
10 1 3
10 2 5
10 3 4
8 1 3
7 2 2
6 4 1
Now, I want to be able to plot frequency of score=10 against person.
However, if I do:
#My bad, turns out the next line only works for matrices anyway:
#D = D[which(D[,1] == 10)]
D = subset(D, score == 10)
then I get:
score person freq
10 1 3
10 2 5
10 3 4
However, this is what I would like to get:
score person freq
10 1 3
10 2 5
10 3 4
10 4 0
Is there any quick and painless way for me to do this in R?

Here's a base R approach:
subset(as.data.frame(xtabs(freq ~ score + person, df)), score == 10)
# score person Freq
#4 10 1 3
#8 10 2 5
#12 10 3 4
#16 10 4 0

You can use complete() from the tidyr package to create the missing rows and then you can simply subset:
library(tidyr)
D2 <- complete(D, score, person, fill = list(freq = 0))
D2[D2$score == 10, ]
## Source: local data frame [4 x 3]
##
## score person freq
## (int) (int) (dbl)
## 1 10 1 3
## 2 10 2 5
## 3 10 3 4
## 4 10 4 0
complete() takes as the first argument the data frame that it should work with. Then follow the names of the columns that should be completed. The argument fill is a list that gives for each of the remaining columns (which is only freq here) the value they should be filled with.
As suggested by docendo-discimus, this can be further simplified by using also the dplyr package as follows:
library(tidyr)
library(dplyr)
complete(D, score, person, fill = list(freq = 0)) %>% filter(score == 10)

Here is a dplyr approach:
D %>% mutate(freq = ifelse(score == 10, freq, 0),
score = 10) %>%
group_by(score, person) %>%
summarise(freq = max(freq))
Source: local data frame [4 x 3]
Groups: score [?]
score person freq
(dbl) (int) (dbl)
1 10 1 3
2 10 2 5
3 10 3 4
4 10 4 0

Related

Stepwise column sum in data frame based on another column in R

I have a data frame like this:
Team
GF
A
3
B
5
A
2
A
3
B
1
B
6
Looking for output like this (just an additional column):
Team
x
avg(X)
A
3
0
B
5
0
A
2
3
A
3
2.5
B
1
5
B
6
3
avg(x) is the average of all previous instances of x where Team is the same. I have the following R code which gets the overall average, however I'm looking for the "step-wise" average.
new_df <- df %>% group_by(Team) %>% summarise(avg_x = mean(x))
Is there a way to vectorize this while only evaluating the previous rows on each "iteration"?
You want the cummean() function from dplyr, combined with lag():
df %>% group_by(Team) %>% mutate(avg_x = replace_na(lag(cummean(x)), 0))
Producing the following:
# A tibble: 6 × 3
# Groups: Team [2]
Team x avg_x
<chr> <dbl> <dbl>
1 A 3 0
2 B 5 0
3 A 2 3
4 A 3 2.5
5 B 1 5
6 B 6 3
As required.
Edit 1:
As #Ritchie Sacramento pointed out, the following is cleaner and clearer:
df %>% group_by(Team) %>% mutate(avg_x = lag(cummean(x), default = 0))

How to use a for loop to changed consecutive values in R?

How can I run a loop over multiple columns changing consecutive values to true values?
For example, if I have a dataframe like this...
Time Value Bin Subject_ID
1 6 1 1
3 10 2 1
7 18 3 1
8 20 4 1
I want to show the binned values...
Time Value Bin Subject_ID
1 6 1 1
2 4 2 1
4 8 3 1
1 2 4 1
Is there a way to do it in a loop?
I tried this code...
for (row in 2:nrow(df)) {
if(df[row - 1, "Subject_ID"] == df[row, "Subject_ID"]) {
df[row,1:2] = df[row,1:2] - df[row - 1,1:2]
}
}
But the code changed it line by line and did not give the correct values for each bin.
If you still insist on using a for loop, you can use the following solution. It's very simple but you have to first create a copy of your data set as your desired output values are the difference of values between rows of the original data set. In order for this to happen we move DF outside of the for loop so the values remain intact, otherwise in every iteration values of DF data set will be replaced with the new values and the final output gives incorrect results:
df <- read.table(header = TRUE, text = "
Time Value Bin Subject_ID
1 6 1 1
3 10 2 1
7 18 3 1
8 20 4 1")
DF <- df[, c("Time", "Value")]
for(i in 2:nrow(df)) {
df[i, c("Time", "Value")] <- DF[i, ] - DF[i-1, ]
}
df
Time Value Bin Subject_ID
1 1 6 1 1
2 2 4 2 1
3 4 8 3 1
4 1 2 4 1
The problem with the code in the question is that after row i is changed the changed row is used in calculating row i+1 rather than the original row i. To fix that run the loop in reverse order. That is use nrow(df):2 in the for statement. Alternately try one of these which do not use any loops and also have the advantage of not overwriting the input -- something which makes the code easier to debug.
1) Base R Use ave to perform Diff by group where Diff uses diff to actually perform the differencing.
Diff <- function(x) c(x[1], diff(x))
transform(df,
Time = ave(Time, Subject_ID, FUN = Diff),
Value = ave(Value, Subject_ID, FUN = Diff))
giving:
Time Value Bin Subject_ID
1 1 6 1 1
2 2 4 2 1
3 4 8 3 1
4 1 2 4 1
2) dplyr Using dplyr we write the above except we use lag:
library(dplyr)
df %>%
group_by(Subject_ID) %>%
mutate(Time = Time - lag(Time, default = 0),
Value = Value - lag(Value, default = 0)) %>%
ungroup
giving:
# A tibble: 4 x 4
Time Value Bin Subject_ID
<dbl> <dbl> <int> <int>
1 1 6 1 1
2 2 4 2 1
3 4 8 3 1
4 1 2 4 1
or using across:
library(dplyr)
df %>%
group_by(Subject_ID) %>%
mutate(across(Time:Value, ~ .x - lag(.x, default = 0))) %>%
ungroup
Note
Lines <- "Time Value Bin Subject_ID
1 6 1 1
3 10 2 1
7 18 3 1
8 20 4 1"
df <- read.table(text = Lines, header = TRUE)
Here is a base R one-liner with diff in a lapply loop.
df[1:2] <- lapply(df[1.2], function(x) c(x[1], diff(x)))
df
# Time Value Bin Subject_ID
#1 1 1 1 1
#2 2 2 2 1
#3 4 4 3 1
#4 1 1 4 1
Data
df <- read.table(text = "
Time Value Bin Subject_ID
1 6 1 1
3 10 2 1
7 18 3 1
8 20 4 1
", header = TRUE)
dplyr one liner
library(dplyr)
df %>% mutate(across(c(Time, Value), ~c(first(.), diff(.))))
#> Time Value Bin Subject_ID
#> 1 1 6 1 1
#> 2 2 4 2 1
#> 3 4 8 3 1
#> 4 1 2 4 1

Wrangle data in R to add variables (columns) and observations (rows) by group

I am trying to rearrange a dataset with a few thousand observations (to eventually use the drm function in package DRC), and I am tired of doing it in excel. Within a dataframe I am looking to add "start" and "end" times (up to inf) based on the intervals found in a vector within the df. This means I would have to end up adding an observation (row) where there the last "end" time is inf. For that last row (the one with inf) I ALSO need to subtract the total of "value" from an arbitrary number (in my example below this would be 50). All this grouped by two variables ("Name", and "Rep" in my example). I am hoping there is a solution using group_by, but honestly I'll be overjoyed at any solution!
I have a data set that looks like this;
# data
names<-c(rep("Luke",30), rep("Han", 30), rep("Leia", 30), rep("OB1", 30))
reps<-c(rep("A", 10), rep("B", 10), rep("C", 10))
time<-rep(seq(1:10), 4)
value<-rep(sample(0:5,10,replace=T), 4)
df<-data.frame(names, reps, time, value)
but need it to look like this;
Example of the data structure I need.
I'm at a loss. Please help!
If I have understood you correctly, we can do
library(dplyr)
df1 <- df %>%
group_by(names, reps) %>%
mutate(start = lag(time, default = 0),
end = time)
bind_rows(df1, df1 %>%
group_by(names, reps) %>%
summarise(start = last(time),
end = Inf,
value = sum(value))) %>%
select(-time) %>%
arrange(names, reps)
# names reps value start end
# <fct> <fct> <int> <dbl> <dbl>
# 1 Han A 2 0 1
# 2 Han A 2 1 2
# 3 Han A 1 2 3
# 4 Han A 1 3 4
# 5 Han A 3 4 5
# 6 Han A 2 5 6
# 7 Han A 0 6 7
# 8 Han A 2 7 8
# 9 Han A 2 8 9
#10 Han A 5 9 10
#11 Han A 20 10 Inf
#.....
We can do this in data.table shifting the 'time' while appending 'Inf' at the end of 'time' to create the end and difference of 50 from the sum of 'value' for 'value' after grouping by 'names' and 'reps'
library(data.table)
setDT(df)[, {stL <- last(time)
enL <- Inf
vL <- 50- sum(value)
.(start = c(shift(time, fill = 0), stL),
end = c(time, enL),
value = c(value, vL))}, .(names, reps)]
# names reps start end value
# 1: Luke A 0 1 0
# 2: Luke A 1 2 3
# 3: Luke A 2 3 3
# 4: Luke A 3 4 4
# 5: Luke A 4 5 0
# ---
#128: OB1 C 6 7 3
#129: OB1 C 7 8 0
#130: OB1 C 8 9 2
#131: OB1 C 9 10 5
#132: OB1 C 10 Inf 27

Dense Rank by Multiple Columns in R

How can I get a dense rank of multiple columns in a dataframe? For example,
# I have:
df <- data.frame(x = c(1,1,1,1,2,2,2,3,3,3),
y = c(1,2,3,4,2,2,2,1,2,3))
# I want:
res <- data.frame(x = c(1,1,1,1,2,2,2,3,3,3),
y = c(1,2,3,4,2,2,2,1,2,3),
r = c(1,2,3,4,5,5,5,6,7,8))
res
x y z
1 1 1 1
2 1 2 2
3 1 3 3
4 1 4 4
5 2 2 5
6 2 2 5
7 2 2 5
8 3 1 6
9 3 2 7
10 3 3 8
My hack approach works for this particular dataset:
df %>%
arrange(x,y) %>%
mutate(r = if_else(y - lag(y,default=0) == 0, 0, 1)) %>%
mutate(r = cumsum(r))
But there must be a more general solution, maybe using functions like dense_rank() or row_number(). But I'm struggling with this.
dplyr solutions are ideal.
Right after posting, I think I found a solution here. In my case, it would be:
mutate(df, r = dense_rank(interaction(x,y,lex.order=T)))
But if you have a better solution, please share.
data.table
data.table has you covered with frank().
library(data.table)
frank(df, x,y, ties.method = 'min')
[1] 1 2 3 4 5 5 5 8 9 10
You can df$r <- frank(df, x,y, ties.method = 'min') to add as a new column.
tidyr/dplyr
Another option (though clunkier) is to use tidyr::unite to collapse your columns to one plus dplyr::dense_rank.
library(tidyverse)
df %>%
# add a single column with all the info
unite(xy, x, y) %>%
cbind(df) %>%
# dense rank on that
mutate(r = dense_rank(xy)) %>%
# now drop the helper col
select(-xy)
You can use cur_group_id:
library(dplyr)
df %>%
group_by(x, y) %>%
mutate(r = cur_group_id())
# x y r
# <dbl> <dbl> <int>
# 1 1 1 1
# 2 1 2 2
# 3 1 3 3
# 4 1 4 4
# 5 2 2 5
# 6 2 2 5
# 7 2 2 5
# 8 3 1 6
# 9 3 2 7
# 10 3 3 8

Filter all rows of a group according to specific member of group [duplicate]

This question already has an answer here:
How to filter (with dplyr) for all values of a group if variable limit is reached?
(1 answer)
Closed 5 years ago.
I want to filter an entire group based on a value at a specified row.
In the data below, I'd like to remove all rows of group ID, according the value of Metric for Hour == '2'. (Note that I am not trying to filter based on two conditions here, I'm trying to filter based on one condition but at a specific row)
Sample data:
ID <- c('A','A','A','A','A','B','B','B','B','C','C')
Hour <- c('0','2','5','6','9','0','2','5','6','0','2')
Metric <- c(3,4,1,6,7,8,8,3,6,1,1)
x <- data.frame(ID, Hour, Metric)
ID Hour Metric
1 A 0 3
2 A 2 4
3 A 5 1
4 A 6 6
5 A 9 7
6 B 0 8
7 B 2 8
8 B 5 3
9 B 6 6
10 C 0 1
11 C 2 1
I want to filter each ID based on whether Metric > 5 for Hour == '2'. The result should look like this (all rows of ID B are removed):
ID Hour Metric
1 A 0 3
2 A 2 4
3 A 5 1
4 A 6 6
5 A 9 7
10 C 0 1
11 C 2 1
A dplyr-based solution would be preferred, but any help is much appreciated.
Adapting How to filter (with dplyr) for all values of a group if variable limit is reached?
we get:
x %>%
group_by(ID) %>%
filter(any(Metric[Hour == '2'] <= 5))
# # A tibble: 7 x 3
# # Groups: ID [2]
# ID Hour Metric
# <fctr> <fctr> <dbl>
# 1 A 0 3
# 2 A 2 4
# 3 A 5 1
# 4 A 6 6
# 5 A 9 7
# 6 C 0 1
# 7 C 2 1
These type of problems can be also answered by first creating a by group intermediate variable, to flag whether rows should be removed.
Method 1:
x %>%
group_by(ID) %>%
mutate(keep_group = (any(Metric[Hour == '2'] <= 5))) %>%
ungroup %>%
filter(keep_group) %>%
select(-keep_group)
Method 2:
groups_to_keep <-
x %>%
filter(Hour == '2', Metric <= 5) %>%
select(ID) %>%
distinct() # N.B. this sorts groups_to_keep by ID which may not be desired
# ID
# 1 A
# 2 C
x %>%
inner_join(groups_to_keep, by = 'ID')
# ID Hour Metric
# 1 A 0 3
# 2 A 2 4
# 3 A 5 1
# 4 A 6 6
# 5 A 9 7
# 6 C 0 1
# 7 C 2 1
Method 3 - as suggested by #thelatemail (safe with respect to duplicates in ID):
groups_not_to_keep <-
x %>%
filter(Hour == 2, Metric > 5) %>%
select(ID)
x %>%
anti_join(groups_not_to_keep, by = 'ID')
Not in (!()) should be useful here. Try this
library(dplyr)
filter(x, Metric > 5 & Hour == '2')$ID # gives B
subset(x, !(ID %in% filter(x, Metric > 5 & Hour == '2')$ID))

Resources