This question already has answers here:
R - subset column based on condition on duplicate rows
(2 answers)
Closed 6 years ago.
I want to subset a data.frame based on a condition in r. I have the following data.frame :
df
id | message | cluster
-------+-----------------+----------------
1 | Test A | 1
2 | Test B | 1
3 | Test C | 3
4 | Test D | 1
5 | Test E | 2
6 | Test F | 2
7 | Test G | 3
8 | Test H | 3
9 | Test I | 1
10 | Test K | 2
11 | Test L | 4
12 | Test M | 4
I want to construct a new data.frame with 4 (number of distinct cluster) rows. I choose the first message as representative of the cluster. So I want to get the following data.frame :
df2
id | message | cluster
-------+-----------------+----------------
1 | Test A | 1
3 | Test C | 3
5 | Test E | 2
11 | Test L | 4
As an alternative approach, the dplyr package is nice for these kinds of things.
text <- "id | message | cluster
1 | Test A | 1
2 | Test B | 1
3 | Test C | 3
4 | Test D | 1
5 | Test E | 2
6 | Test F | 2
7 | Test G | 3
8 | Test H | 3
9 | Test I | 1
10 | Test K | 2
11 | Test L | 4
12 | Test M | 4"
library(readr)
df <- read_delim(text, delim = "|", trim_ws=TRUE)
library(dplyr)
df2 <-
df %>%
group_by(cluster) %>%
summarize(message=first(message))
And here's the result:
> df2
# A tibble: 4 x 2
cluster message
<int> <chr>
1 1 Test A
2 2 Test E
3 3 Test C
4 4 Test L
(It might be useful to arrange the data so that "first" is predictable.)
Get the indices of the rows you want to collect:
indices <- !duplicated(df$cluster)
Use that to subset the dataframe:
df2 <- df[indices, ]
Related
Suppose a respondent (id) is asked to make a binary (discrete) choice, either select 1 or 2 in five tasks (t=1,2,3,4,5) (a panel dataset with five observations per respondent).
If a respondent selects choice 1, then the outcome is a fixed value (let say 30 always) but if a respondent selects choice 2, then the outcome is different and depends on which treatment the respondent is in (there is only one treatment per respondent since the respondent is randomly assigned to one treatment only). Let say there are four treatments (a vector) and in each treatment, there are five outcomes if choice 2 is selected.
That is,
treat1= 1,2,3,4,5
treat2= 6,7,8,9,10
treat3= 11,12,13,14,15
treat4= 16,17,18,19,20
For example, in the case of treat1, if a respondent in the first task selects choice 2, then the outcome is equal to 1. In the second task, the respondent selects choice 1, the outcome is 30 (as always). In the third task, if a respondent selects choice 2, the outcome is 2 (and not 3). That is if choice 2 is selected for the first time in treat1, then pick the first value from the treat1 sequence; if choice 2 is selected for the second time in treat1, then pick the second value from the treat 2 sequence and so on.
The outcome looks like the below.
+----+---+-----------+--------+---------+
| id | t | treatment | choice | outcome |
+----+---+-----------+--------+---------+
| 1 | 1 | 1 | 2 | 1 |
| 1 | 2 | 1 | 1 | 30 |
| 1 | 3 | 1 | 2 | 2 |
| 1 | 4 | 1 | 1 | 30 |
| 1 | 5 | 1 | 2 | 3 |
| 2 | 1 | 3 | 1 | 30 |
| 2 | 2 | 3 | 2 | 11 |
| 2 | 3 | 3 | 2 | 12 |
| 2 | 4 | 3 | 1 | 30 |
| 2 | 5 | 3 | 2 | 13 |
| 3 | 1 | 2 | 2 | 6 |
| 3 | 2 | 2 | 1 | 30 |
| 3 | 3 | 2 | 1 | 30 |
| 3 | 4 | 2 | 1 | 30 |
| 3 | 5 | 2 | 2 | 7 |
| 4 | 1 | 4 | 1 | 30 |
| 4 | 2 | 4 | 1 | 30 |
| 4 | 3 | 4 | 1 | 30 |
| 4 | 4 | 4 | 2 | 16 |
| 4 | 5 | 4 | 1 | 30 |
| 5 | 1 | 2 | 1 | 30 |
| 5 | 2 | 2 | 1 | 30 |
| 5 | 3 | 2 | 1 | 30 |
| 5 | 4 | 2 | 1 | 30 |
| 5 | 5 | 2 | 2 | 6 |
| . | . | . | . | . |
| . | . | . | . | . |
| . | . | . | . | . |
| . | . | . | . | . |
| . | . | . | . | . |
+----+---+-----------+--------+---------+
Since my data has thousands of observations, I was wondering what would be an efficient way to generate the variable outcome.
The id, t, treatment, and choice variables are available in my dataset.
Any thoughts would be appreciated. Thanks.
Another possible approach is to organize the treatment into a data.table, then do a join and update by reference when choice=2
#the sequence of treatment when choice==2
DT[choice==2, ri := rowid(id)]
#look up treatment for the sequence
DT[choice==2, outcome := treat[.SD, on=.(treatment, ri), val]]
#set outcome to 30 for choice=1
DT[choice==1, outcome := 30]
#delete column
DT[, ri := NULL]
data:
library(data.table)
treat <- data.table(treatment=rep(1:4, each=5),
ri=rep(1:5, times=4),
val=1:20)
DT <- fread("id,t,treatment,choice,outcome
1,1,1,2,1
1,2,1,1,30
1,3,1,2,2
1,4,1,1,30
1,5,1,2,3")
DT[, outcome := NULL]
You did not provide any sample data, so I create some fake data first
Data
set.seed(1)
treat_lkp <- list(trt1 = 1:5, trt2 = 6:10, trt3 = 11:15, trt4 = 16:20)
d_in <- expand.grid(task = 1:5, id = 1:5)
d_in$treatment <- paste0("trt", d_in$id %% 4 + 1)
d_in$choice <- sample(2, NROW(d_in), TRUE)
tidyverse solution
I use a simple tidyverse solution.
library(purrr)
library(dplyr)
d_out <- d_in %>%
group_by(id) %>%
mutate(task_new = cumsum(choice == 2)) %>%
ungroup() %>%
mutate(outcome = {
l <- treat_lkp[as.character(d_in$treatment)]
pmap_dbl(list(task = task_new, choice = choice, set = l),
function(task, choice, set)
ifelse(choice == 1, 30, set[task])
)}
)
head(d_out)
# # A tibble: 6 x 6
# task id treatment choice task_new outcome
# <int> <int> <chr> <int> <int> <dbl>
# 1 1 1 trt2 1 0 30
# 2 2 1 trt2 1 0 30
# 3 3 1 trt2 2 1 6
# 4 4 1 trt2 2 2 7
# 5 5 1 trt2 1 2 30
# 6 1 2 trt3 2 1 11
Explanation
You create first a list l with the relevant lookup values for your outcome (depends on treatment). Then you loop over task, treatment and choice to select either 30 (for choice == 1) or you use the right lookup value from l
Update
Taking the comment into account, we need now first to create a task_new variable which holds the correct position. That is the first choice == 2 should result in 1 the second in 2 and so on. So we group_by id and add the counter via cumsum. We use task_new in the mutate call after we ungrouped the data.
Consider following data frame, with x being increasing numbers (don't have to be integers) and y varying values:
x | y | d.x | d.y | abs(d.y)/d.x
---------------------------------
1 | 5 | 1 | 5 | 5
2 | 9 | 1 | 4 | 4
3 | 1 | 1 | -8 | 8
4 | 7 | 1 | 6 | 6
5 | 5 | 1 | -2 | 2
6 | 1 | 1 | -4 | 4
7 | 3 | 1 | 2 | 2
8 | 9 | 1 | 6 | 6
Here, d.x and d.y represent the differenced values. I want to identify the first value of abs(d.y)/d.x that is above a certain threshold (e.g. >5). This would be at x = 3. From the next row I want to reset abs(d.y)/d.x so that it skips x = 3, i.e.:
x | y | d.x | d.y | abs(d.y)/d.x
---------------------------------
1 | 5 | 1 | 5 | 5
2 | 9 | 1 | 4 | 4
4 | 7 | 2 | -2 | 1
5 | 5 | 1 | -2 | 2
6 | 1 | 1 | -4 | 4
7 | 3 | 1 | 2 | 2
8 | 9 | 1 | 6 | 6
Now x=8 would be identified as being above the treshold. This problem can easily be implemented in case there are no consecutive abs(d.y)/d.x that overlap at some point. The implementation can for example be done with a while loop, but I'm certain that someone knows how to do this with e.g. ave, dplyr,etc. I'm doing this with a hundreds of vector with a size of more than 10^6 entries, thus speed is of importance.
The values above the threshold do not have to be removed, I simply want to end up with a vector that contains the index positions (i.e. 3,8, etc.).
Thanks in advance.
My MWE with some random numbers.
# Random initialization
set.seed(1)
y <- randi(20,20000,1)
x <- 1:length(x)
d.x <- diff(x)
d.y <- diff(y)
dy_dx <- abs(d.y)/d.x
idx <- array(0,length(d.y)) # stores if dy_dx above threshold
# Randomly chosen threshold
threshold <- 14
while(1==1){
# Identify values above threshold
idx.new <- match(TRUE, dy_dx > threshold) # identifies first index
if (is.na(idx.new)){
break
}
idx[idx.new] <- 1 # flags value above threshold
d.y[idx.new+1] <- d.y[idx.new]+d.y[idx.new+1]
d.x[idx.new+1] <- d.x[idx.new]+d.x[idx.new+1]
d.x[idx.new] <- 0 # avoids that is above threshold
dy_dx <- abs(d.y)/d.x
}
idx <- c(0,idx)
UPDATE: running it as a for-loop is significantly faster:
for (i in 1:length(dy_dx)){
if (dy_dx[i] <= threshold){
next
} else {
idx[i] <- 1
d.y[i+1] <- d.y[i]+d.y[i+1]
d.x[i+1] <- d.x[i]+d.x[i+1]
dy_dx[i+1] <- abs(d.y[i+1])/(d.x[i+1])
dy_dx[i] <- 0
}
}
I have multiple dataframes like mentioned below with unique id for each row. I am trying to find common rows and make a new dataframe which is appearing at least in two dataframes.
example- row with Id=2 is appearing in all three dataframes. similarly row with Id= 3 is there in df1 and df3.
I want to make a loop which can find common rows and create a new dataframe with common rows.
df1 <- data.frame(Id=c(1,2,3,4),a=c(0,1,0,2),b=c(1,0,1,0),c=c(0,0,4,0))
df2 <- data.frame(Id=c(7,2,5,9),a=c(4,1,9,2),b=c(1,0,1,5),c=c(3,0,7,0))
df3 <- data.frame(Id=c(5,3,2,6),a=c(9,0,1,5),b=c(1,1,0,0),c=c(7,4,0,0))
> df1 > df2
Id | a | b | c | Id | a | b | c |
---|---|---|---| ---|---|---|---|
1 | 0 | 1 | 0 | 7 | 4 | 1 | 3 |
---|---|---|---| ---|---|---|---|
2 | 1 | 0 | 0 | 2 | 1 | 0 | 0 |
---|---|---|---| ---|---|---|---|
3 | 0 | 1 | 4 | 5 | 9 | 1 | 7 |
---|---|---|---| ---|---|---|---|
4 | 2 | 0 | 0 | 9 | 2 | 5 | 0 |
> df3
Id | a | b | c |
---|---|---|---|
5 | 9 | 1 | 7 |
---|---|---|---|
3 | 0 | 1 | 4 |
---|---|---|---|
2 | 1 | 0 | 0 |
---|---|---|---|
6 | 5 | 0 | 0 |
> expected_output
Id | a | b | c |
---|---|---|---|
5 | 9 | 1 | 7 |
---|---|---|---|
3 | 0 | 1 | 4 |
---|---|---|---|
2 | 1 | 0 | 0 |
---|---|---|---|
Note:- ID is unique.
Also, i want to remove rows from original dataframes which are duplicated and I am using it to create new dataframe.
I have multiple dataframes like mentioned below with unique id for each row. I am trying to find common rows and make a new dataframe which is appearing at least in two dataframes.
Since no ID appears twice in the same table, we can tabulate the IDs and keep any found twice:
library(data.table)
DTs = lapply(list(df1,df2,df3), data.table)
Id_keep = rbindlist(lapply(DTs, `[`, j = "Id"))[, .N, by=Id][N >= 2L, Id]
DT_keep = Reduce(funion, DTs)[Id %in% Id_keep]
# Id a b c
# 1: 2 1 0 0
# 2: 3 0 1 4
# 3: 5 9 1 7
Your data should be in an object like DTs to begin with, not a bunch of separate named objects.
How it works
To get a sense of how it works, examine intermediate objects like
list(df1,df2,df3)
lapply(DTs, `[`, j = "Id")
Reduce(funion, DTs)
Also, read the help files, like ?lapply, ?rbindlist, ?funion.
Combine all of the data frames:
combined <- rbind(df1, df2, df3)
Extract the duplicates:
duplicate_rows <- unique(combined[duplicated(combined), ])
(duplicated(combined) gives you the row indices of duplicate rows)
This question already has an answer here:
fill in NA based on the last non-NA value for each group in R [duplicate]
(1 answer)
Closed 5 years ago.
My code looks like this:
Item | Category
A | 1
A |
A |
A | 1
A |
A |
A | 1
B | 2
B |
B |
B | 2
B |
B |
B | 2
B |
B |
I want to impute values and fill the "Category" column with the values corresponding to each "Item", wherever it isn't blank. The end result should be like this:
Item | Category
A | 1
A | 1
A | 1
A | 1
A | 1
A | 1
A | 1
B | 2
B | 2
B | 2
B | 2
B | 2
B | 2
B | 2
B | 2
B | 2
How can I do this in R?
We can use fill from tidyverse
library(tidyverse)
df1 %>%
fill(Category)
Have a data.frame, df as below
id | name | value
1 | team1 | 3
1 | team2 | 1
2 | team1 | 1
2 | team2 | 4
3 | team1 | 0
3 | team2 | 6
4 | team1 | 1
4 | team2 | 2
5 | team1 | 3
5 | team2 | 0
How do we subset the data frame to get rows for all values of id from 2:4 ?
We can apply conditionally like df[,df$id >= 2 & df$id <= 4] . But is there a way to directly use a vector of integer ranges like ids <- c(2:4) to subset a dataframe ?
One way to do this is df[,df$id >= min(ids) & df$id <= max(ids)].
Is there a more elegant R way of doing this ?
The most typical way is mentioned already, but also variations using match
with(df, df[match(id, 2:4, F) > 0, ])
or, similar
with(df, df[is.element(id, 2:4), ])