ifelse with condition on rows [duplicate] - r

This question already has answers here:
Remove duplicated rows
(10 answers)
Closed 5 years ago.
I want to group the data by ID and slice the first row, based on condition.
I have the following dataset:
head(data)
ID Cond1
A 10
A 10
B 20
B 30
Now, I want to slice the rows based on condition:
If value in Cond1 is unique for both rows, keep them;
If value in Cond1 is duplicate, keep top row.
Any ideas?

You can use the base R function ave like this:
datafr[!(ave(datafr$Cond1, datafr$ID, FUN=duplicated)),]
ID Cond1
1 A 10
3 B 20
4 B 30
ave returns a numeric vector by ID with a 1 if the element of Cond1 is duplicated and a 0 if it is not. the ! performs two roles, first it converts the resulting vector to a logical vector appropriate for subetting. Second it reverses the results, keeping the non-duplicate elements.
In data.table, you could use a join.
setDT(datafr)[datafr[, !duplicated(Cond1), by=ID]$V1]
ID Cond1
1: A 10
2: B 20
3: B 30
The inner data.frame returns a logical for not duplicated elements by ID and is pulled out into a vector via $V1. This logical vector is fed to the original data.table to perform the subsetting.
data
datafr <-
structure(list(ID = c("A", "A", "B", "B"), Cond1 = c(10L, 10L,
20L, 30L)), .Names = c("ID", "Cond1"), row.names = c(NA, -4L), class = "data.frame")

We can use n_distinct to filter
library(dplyr)
data %>%
group_by(ID) %>%
filter(n_distinct(Cond1)==n()| row_number()==1)
Or just
data[!duplicated(data),]
# ID Cond1
#1 A 10
#3 B 20
#4 B 30
Based on the description in the OP's post, if we include another row with B 20, the first solution should give
data %>%
group_by(ID) %>%
filter(n_distinct(Cond1)==n()| row_number()==1)
# A tibble: 2 x 2
# Groups: ID [2]
# ID Cond1
# <chr> <int>
#1 A 10
#2 B 20

Related

Filtering out specific rows - taking the first of each type of subject information [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I'm looking to take out the first row of each subjectID and to make it into the new dataset. Essentially, the numbers and letters represent a longitudinal study, but I only want the first test from each subject as a baseline dataframe.
My goal is to make this data below: Lets name this dataframe as DF1
SubjectID
Sex
Age
Race
100s of more columns...
WS11Q6Y-01
F
32
C
WS11Q6Y-02
F
32
C
SEES45W-101
M
12
B
SEES45W
M
12
B
SEES45W
M
12
B
JWE98UW-03
F
45
H
JWE98UW-W4
F
45
H
Look like this by only taking the first row of each type of name. What would be the most effective way in doing this? (columns bolded to help display what was filtered to the new dataframe)
Ideally, I would like to make a new dataframe called DF2 (shown below)
SubjectID
Sex
Age
Race
100s of more columns...
WS11Q6Y-01
F
32
C
SEES45W-101
M
12
B
JWE98UW-03
F
45
H
Essentially, the goal is to extract the first letter series (ex. WS11Q6Y) of its type, including the other variables such that I get one subject in the list.
Any suggestions on addressing this would be much appreciated! Thank you!
We get the substring of 'SubjectID' and use that as grouping variable to slice the first row. Here, the regex used match the - followed by any characters (.*) till the end ($) of the string, thus making the 'SubjectID' values similar without those, then do the grouping to return the first row with slice
library(dplyr)
library(stringr)
df1 %>%
group_by(grp = str_remove(SubjectID, "-.*$")) %>%
slice_head(n = 1) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 3 x 4
SubjectID Sex Age Race
<chr> <chr> <int> <chr>
1 JWE98UW-03 F 45 H
2 SEES45W-101 M 12 B
3 WS11Q6Y-01 F 32 C
Or without using group_by, make use duplicated which returns a logical vector for duplicate elements on the substring of 'SubjectID', negate (!) so that TRUE -> FALSE and FALSE -> TRUE to get the first unique rows
df1 %>%
filter(!duplicated(str_remove(SubjectID, "-.*$")))
Or this can be done in base R as well
subset(df1, !duplicated(sub("-.*$", "", SubjectID)))
SubjectID Sex Age Race
1 WS11Q6Y-01 F 32 C
3 SEES45W-101 M 12 B
6 JWE98UW-03 F 45 H
data
df1 <- structure(list(SubjectID = c("WS11Q6Y-01", "WS11Q6Y-02", "SEES45W-101",
"SEES45W", "SEES45W", "JWE98UW-03", "JWE98UW-W4"), Sex = c("F",
"F", "M", "M", "M", "F", "F"), Age = c(32L, 32L, 12L, 12L, 12L,
45L, 45L), Race = c("C", "C", "B", "B", "B", "H", "H")),
class = "data.frame", row.names = c(NA,
-7L))
data.table
using the approach of a respected #akrun
setDT(df)[, .SD[1], by = gsub("-.*$", "", SubjectID)][]
gsub Sex Age Race
1: WS11Q6Y F 32 C
2: SEES45W M 12 B
3: JWE98UW F 45 H

How to iterate column values to find out all possible combinations in R? [duplicate]

This question already has answers here:
Count common sets of items between different customers
(4 answers)
Intersect all possible combinations of list elements
(3 answers)
Closed 1 year ago.
Suppose you have a dataframe with ids and elements prescripted to each id. For example:
example <- data.frame(id = c(1,1,1,1,1,2,2,2,3,4,4,4,4,4,4,4,5,5,5,5),
vals = c("a","b",'c','d','e','a','b','d','c',
'd','f','g','h','a','k','l','m', 'a',
'b', 'c'))
I want to find all possible pair combinations. The main struggle here is not the functional of R language that I can use, but the logic. How can I iterate through all elements and find patterns? For instance, a was picked with b 3 times in my sample dataframe. But original dataframe is more than 30k rows, so I cannot count these combinations manually. How do I automatize this process of finding the number of picks of each elements?
I was thinking about widening my df with pivot_wider and then using map_lgl to find matches. Then I faced the problem that it will take a lot of time for me to find all possible combinations, applying map_lgl for every pair of elements.
I was asking nearly the same question less than a month ago, fellow users answered it but the result is not anything I really need.
Do you have any ideas how to create a dataframe with all possible combinations of values for all ids?
I understand that this code is slow, but here is another example code to get the expected output based on tidyverse package.
What I do here is first create a nested dataframe by id, then produce all pair combinations for each id, unnest the dataframe, and finally count the pairs.
library(tidyverse)
example <- data.frame(
id = c(1,1,1,1,1,2,2,2,3,4,4,4,4,4,4,4,5,5,5,5),
vals = c("a","b",'c','d','e','a','b','d','c','d','f','g','h','a','k','l','m','a','b', 'c')
)
example %>% nest(dataset=-id) %>% mutate(dataset=map(dataset, function(dataset){
if(nrow(dataset)>1){
dataset %>% .$vals %>% combn(., 2) %>% t() %>% as_tibble(.name_repair=~c("val1", "val2")) %>% return()
}else{
return(NULL)
}
})) %>% unnest(cols=dataset) %>% group_by(val1, val2) %>% summarize(n=n(), .groups="drop") %>% arrange(desc(n), val1, val2)
#> # A tibble: 34 x 3
#> val1 val2 n
#> <chr> <chr> <int>
#> 1 a b 3
#> 2 a c 2
#> 3 a d 2
#> 4 b c 2
#> 5 b d 2
#> 6 a e 1
#> 7 a k 1
#> 8 a l 1
#> 9 b e 1
#> 10 c d 1
#> # … with 24 more rows
Created on 2021-03-04 by the reprex package (v1.0.0)
This won't (can't) be fast for many IDs. If it is too slow, you need to parallelize or implement it in a compiled language (e.g., using Rcpp).
We sort vals. We can then create all combination of two items grouped by ID. We exclude ID's with 1 item. Finally we tabulate the result.
library(data.table)
setDT(example)
setorder(example, id, vals)
example[, if (.N > 1) split(combn(vals, 2), 1:2), by = id][, .N, by = c("1", "2")]
# 1 2 N
# 1: a b 3
# 2: a c 2
# 3: a d 3
# 4: a e 1
# 5: b c 2
# 6: b d 2
# 7: b e 1
#<...>

Subsetting by counts [duplicate]

This question already has answers here:
Select groups based on number of unique / distinct values
(4 answers)
Closed last month.
I have a data.frame
library(dplyr)
ID <- c(1,1,1,1,2,2,3,3,3,3,4,4,5)
Score <- c(20,22,34,56,78,98,56,43,45,33,24,54,22)
Quarter <- c("Q1","Q2","Q3","Q4","Q1","Q2","Q1","Q2","Q3","Q4","Q1","Q2","Q1")
df <- data.frame(ID,Score,Quarter)
I only want to deal with the data that has all 4 quarters (Q1,Q2,Q3,Q4 in column "Quarters"). One way I thought I could do this is subset when the ID is present 4 times because it is repeated in each Quarter. I am having a hard time sub-setting on the count of IDs. I tried:
filter(df, count(df, vars = ID)==4)
But it did not work and guidance would be greatly appreciated.
Thank you
One way we can do is by using n_distinct to get unique values for each ID and filter the group which has all 4 values.
library(dplyr)
df %>%
group_by(ID) %>%
filter(n_distinct(Quarter) == 4)
# ID Score Quarter
# <dbl> <dbl> <fct>
#1 1.00 20.0 Q1
#2 1.00 22.0 Q2
#3 1.00 34.0 Q3
#4 1.00 56.0 Q4
#5 3.00 56.0 Q1
#6 3.00 43.0 Q2
#7 3.00 45.0 Q3
#8 3.00 33.0 Q4
Equivalent base R implementation using ave would be
df[as.numeric(ave(df$Quarter, df$ID, FUN = function(x) length(unique(x)))) == 4, ]
Here are a few alternatives. The last three are base solutions.
#1 is an SQL solution which creates a one-column data frame df0 with only those IDs having 4 quarters which is then joined to df thereby eliminating all other IDs.
#2 is a dplyr solution which filters the groups retaining only those with 4 rows.
#3 is a data.table solution which returns the rows for those ID groups having 4 rows and NULL for the other groups. This has the effect of eliminating the other groups.
#4 is a zoo solution which converts df to a wide form zoo object with quarters along the top and ID as the time index. It then removes any row having an NA and reshapes back to the original using fortify.zoo also reordering back to a sorted order. The last line of the solution could be omitted if the row order does not matter. Interestingly it does not use knowledge of the number 4.
#5 is a base solution which splits df into a list of data frames, one per ID, and then uses Filter to extract those having 4 rows. Finally it puts it all back together.
#6 is a base solution which creates a vector having one element per row of df containing the number of rows (including the current row) having the ID in that row. Then use subset to reduce df to those rows for which that vector equals 4.
#7 is a base solution which splits df into a list of data frames, one per ID, and then uses Reduce to iterate over it appending the current data frame to what we have so far if it has 4 rows or just keeping what we have so far if not.
# 1
library(sqldf)
sqldf("with df0 as (
select ID from df group by ID having count(*) = 4
)
select * from df join df0 using (ID)")
# 2
library(dplyr)
df %>% group_by(ID) %>% filter(n() == 4) %>% ungroup
# 3
library(data.table)
as.data.table(df)[, if (nrow(.SD) == 4) .SD, by = ID]
# 4
library(zoo)
z <- read.zoo(df, split = "Quarter")
df2 <- fortify.zoo(na.omit(z), melt = TRUE, names = names(df)[c(1, 3:2)])
df2 <- df2[order(df2$ID, df2$Quarter), ]
# 5
do.call("rbind", Filter(function(x) nrow(x) == 4, split(df, df$ID)))
# 6
subset(df, ave(ID, ID, FUN = length) == 4)
# 7
Reduce(function(x, y) if (nrow(y) == 4) rbind(x, y) else x, split(df, df$ID))
Here is another base R method using table, rowSums and %in%. We get the frequency count of 'ID', 'Quarter' columns with table, convert it to logical matrix where 0 values are TRUE and all others FALSE (!table(...)), get the rowwise sum (rowSums), convert to logical vector, get the names of the elements that are TRUE and create a comparison with the ID using %in% to subset the dataset
subset(df, ID %in% names(which(!rowSums(!table(df[c(1,3)])))))
# ID Score Quarter
#1 1 20 Q1
#2 1 22 Q2
#3 1 34 Q3
#4 1 56 Q4
#7 3 56 Q1
#8 3 43 Q2
#9 3 45 Q3
#10 3 33 Q4
I just figured out I can do this as well:
df[df$ID %in% names(table(df$ID))[table(df$ID)==4],]
It gets the desired result with using only the counts from ID

Product of several columns on a data frame by a vector using dplyr

I would like to multiply several columns on a dataframe by the values of a vector (all values within the same column should be multiplied by the same value, which will be different according to the column), while keeping the other columns as they are.
Since I'm using dplyr extensively I thought that it might be useful to use mutate_each function, so I can modify all columns at the same time, but I am completely lost on the syntax on the fun() part.
On the other hand, I've read this solution which is simple and works fine, but only works for all columns instead of the selected ones.
That's what I've done so far:
Imagine that I want to multiply all columns in df but letters by weight_df vector as follows:
df = data.frame(
letters = c("A", "B", "C", "D"),
col1 = c(3, 3, 2, 3),
col2 = c(2, 2, 3, 1),
col3 = c(4, 1, 1, 3)
)
> df
letters col1 col2 col3
1 A 3 2 4
2 B 3 2 1
3 C 2 3 1
4 D 3 1 3
>
weight_df = c(1:3)
If I use select before applying mutate_each I get rid of letters columns (as expected), and that's not what I want (a part from the fact that the vector is not applyed per columns basis but per row basis! and I want the opposite):
df = df %>%
select(-letters) %>%
mutate_each(funs(. * weight_df))
> df
col1 col2 col3
1 3 2 4
2 6 4 2
3 6 9 3
4 3 1 3
But if I don't select any particular columns, all values within letters are removed (which makes a lot of sense, by the way), but that's not what I want, neither (a part from the fact that the vector is not applyed per columns basis but per row basis! and I want the opposite):
df = df %>%
mutate_each(funs(. * issb_weight))
> df
letters col1 col2 col3
1 NA 3 2 4
2 NA 6 4 2
3 NA 6 9 3
4 NA 3 1 3
(Please note that this is a very simple dataframe and the original one has way more rows and columns -which unfortunately are not labeled in such an easy way and no patterns can be obtained)
The problem here is that you are basically trying to operate over rows, rather columns, hence methods such as mutate_* won't work. If you are not satisfied with the many vectorized approaches proposed in the linked question, I think using tydeverse (and assuming that letters is unique identifier) one way to achieve this is by converting to long form first, multiply a single column by group and then convert back to wide (don't think this will be overly efficient though)
library(tidyr)
library(dplyr)
df %>%
gather(variable, value, -letters) %>%
group_by(letters) %>%
mutate(value = value * weight_df) %>%
spread(variable, value)
#Source: local data frame [4 x 4]
#Groups: letters [4]
# letters col1 col2 col3
# * <fctr> <dbl> <dbl> <dbl>
# 1 A 3 4 12
# 2 B 3 4 3
# 3 C 2 6 3
# 4 D 3 2 9
using dplyr. This filters numeric columns only. Gives flexibility for choosing columns. Returns the new values along with all the other columns (non-numeric)
index <- which(sapply(df, is.numeric) == TRUE)
df[,index] <- df[,index] %>% sweep(2, weight_df, FUN="*")
> df
letters col1 col2 col3
1 A 3 4 12
2 B 3 4 3
3 C 2 6 3
4 D 3 2 9
try this
library(plyr)
library(dplyr)
df %>% select_if(is.numeric) %>% adply(., 1, function(x) x * weight_df)

R code to generate numbers in sequence and insert rows [duplicate]

This question already has answers here:
R code to insert rows based on a column's value and increment it by 1
(3 answers)
Closed 6 years ago.
I have a dataset with 2 columns. First column is an ID and the 2nd will column is the total number of quarters. If the Col B(quarters) has the value 8, then the 8 rows should be created starting from 1 to 8. The ID in col A should be the same for all rows. The dataset shown below is an example.
ID Quarters
A 5
B 2
C 1
Expected output
ID Quarters
A 1
A 2
A 3
A 4
A 5
B 1
B 2
C 1
Here is what I tried.
library(data.table)
setDT(df.WQuarter)[, (Quarters=1:Quarters), ID]
I get this error. Can you please help. I am really stuck at this for the whole day. I am just learning the basics of R.
We can use base R to replicate the 'ID' by 'Quarters' and create the 'Quarters' by taking the sequence of that column.
with(df1, data.frame(ID= rep(ID, Quarters), Quarters = sequence(Quarters)))
# ID Quarters
#1 A 1
#2 A 2
#3 A 3
#4 A 4
#5 A 5
#6 B 1
#7 B 2
#8 C 1
If we are using data.table, convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'ID', get the sequence of 'Quarters' or just seq(Quarters).
library(data.table)
setDT(df1)[, .(Quarters=sequence(Quarters)) , by = ID]
As #PierreLaFortune commented on the post, if we have NA values, then we need to remove it
setDT(df1)[, .(Quarters = seq_len(Quarters[!is.na(Quarters)])), by = ID]
Or using the dplyr/tidyr
library(dplyr)
library(tidyr)
df1 %>%
group_by(ID) %>%
mutate(Quarters = list(seq(Quarters))) %>%
ungroup() %>%
unnest(Quarters)
If the OP's "Quarters" column is non-numeric, it should be converted to 'numeric' before proceeding
df1$Quarters <- as.numeric(as.character(df1$Quarters))
The as.character is in case if the column is factor, but if it is character class, as.numeric is enough.
data
df1 <- structure(list(ID = c("A", "B", "C"), Quarters = c(5L, 2L, 1L
)), .Names = c("ID", "Quarters"), class = "data.frame", row.names = c(NA,
-3L))

Resources