I would like to locate in this dataframe the elements of var1 for which the elements of var2 are different
data.frame(var1=c("a","a","a","b","b","c","c","c"),
var2=c("X","X","X","Y","Z","W","W","W"),
stringsAsFactors = F)
expected result
data.frame(var1=c("b"))
Many thanks in advance !
Does this work:
library(dplyr)
df %>% group_by(var1) %>% filter(length(unique(var2))>1) %>% distinct(var1)
# A tibble: 1 x 1
# Groups: var1 [1]
var1
<chr>
1 b
Following #Karthik S another alternative approach and if i understood correctly the question.
library(dplyr)
df <- data.frame(var1=c("a","a","a","b","b","c","c","c"), var2=c("X","X","X","Y","Z","W","W","W"), stringsAsFactors = F)
dplyr::select(
distinct(df)[duplicated(distinct(df)$var1),], var1
)
Call select only if you need to guarantee that the output is a dataframe/tibble.
Related
I have a large dataset with the two first columns that serve as ID (one is an ID and the other one is a year variable). I would like to compute a count by group and to loop over each variable that is not an ID one. This code below shows what I want to achieve for one variable:
library(tidyverse)
df <- tibble(
ID1 = c(rep("a", 10), rep("b", 10)),
year = c(2001:2020),
var1 = rnorm(20),
var2 = rnorm(20))
df %>%
select(ID1, year, var1) %>%
filter(if_any(starts_with("var"), ~!is.na(.))) %>%
group_by(year) %>%
count() %>%
print(n = Inf)
I cannot use a loop that starts with for(i in names(df)) since I want to keep the variables "ID1" and "year". How can I run this piece of code for all the columns that start with "var"? I tried using quosures but it did not work as I receive the error select() doesn't handle lists. I also tried to work with select(starts_with("var") but with no success.
Many thanks!
Another possible solution:
library(tidyverse)
df %>%
group_by(ID1) %>%
summarise(across(starts_with("var"), ~ length(na.omit(.x))))
#> # A tibble: 2 × 3
#> ID1 var1 var2
#> <chr> <int> <int>
#> 1 a 10 10
#> 2 b 10 10
for(i in names(df)[grepl('var',names(df))])
From a data frame I need a list of all unique values of one column. For possible later check we need to keep information from a second column, though for simplicity combined.
Sample data
df <- data.frame(id=c(1,3,1),source =c("x","y","z"))
df
id source
1 1 x
2 3 y
3 1 z
The desired outcome is
df2
id source
1 1 x,z
2 3 y
It should be pretty easy, still I cannot find the proper function / grammar?
E.g. something like
df %>%
+ group_by(id) %>%
+ summarise(vlist = paste0(source, collapse = ","))
or
df %>%
+ distinct(id) %>%
+ summarise(vlist = paste0(source, collapse = ","))
What am I missing? Thanks for any advice!
You can use aggregate from stats to combine per group.
aggregate(source ~ id, df, paste, collapse = ",")
# id source
#1 1 x,z
#2 3 y
Using your code here is a solution:
library(dplyr)
df <- data.frame(id=c(1,3,1),source =c("x","y","z"))
df %>%
group_by(id) %>%
summarise(vlist = paste0(source, collapse = ",")) %>%
distinct(id, .keep_all = TRUE)
# A tibble: 2 x 2
id vlist
<dbl> <chr>
1 1 x,z
2 3 y
Your second approach doesn't work because you call distinct before you aggregate the data. Also, you need to use .keep_all = TRUE to also keep the other column.
Your first approach was missing the distinct.
aggregate(source ~ id, df, toString)
What's the easiest way to drop a string before a certain character?
The data looks as follows:
library(tidyverse)
df <- data.frame(var1 = c("lang:10,q1:10,m2:20,q3:20,m5:10",
"lang:1,q1:10,m2:20,m3:20,q3:10",
"lang:100,q1:10,m2:20"))
Now, I'd like to remove the "lang:xy," part at the beginning of each row.
I tried to use "separate", but the comma is also used afterwards (everything that comes after the first comma should stay together).
So my desired output is:
var1
-------------------------
q1:10,m2:20,q3:20,m5:10
q1:10,m2:20,m3:20,q3:10",
q1:10,m2:20
Thanks!
You can use str_remove from stringr package:
df %>%
mutate(
var1 = var1 %>% str_remove("^lang:[0-9]*,")
)
Or try this:
library(tidyverse)
#Code
df %>% mutate(id=1:n()) %>%separate_rows(var1,sep = ',') %>%
filter(!grepl('lang',var1)) %>%
mutate(var='var') %>%
group_by(id) %>%
summarise(var1=paste0(var1,collapse = ',')) %>% ungroup() %>%
select(-id)
Output:
# A tibble: 3 x 1
var1
<chr>
1 q1:10,m2:20,q3:20,m5:10
2 q1:10,m2:20,m3:20,q3:10
3 q1:10,m2:20
Just to round out the answers, the sub function from base R can also work here:
df$var1 <- sub("^lang:\\d+,", "", df$var1)
df
var1
1 q1:10,m2:20,q3:20,m5:10
2 q1:10,m2:20,m3:20,q3:10
3 q1:10,m2:20
We can use trimws from base R
df$var1 <- trimws(df$var1, whitespace = "lang:\\d+,")
Dplyr provides a function top_n(), however in case of equal values it returns all rows (more than one). I would like to return exactly one row per group. See the example below.
df <- data.frame(id1=c(rep("A",3),rep("B",3),rep("C",3)),id2=c(8,8,4,7,7,4,5,5,5))
df %>% group_by(id1) %>% top_n(n=1)
You can use a combination of arrange and slice
df %>%
group_by(id1) %>%
arrange(desc(id2)) %>%
slice(1)
Use desc with in arrange if you want the larges element otherwise leave it out.
Apparently also slice_head is the new name of the function that you are looking for
df %>%
group_by(id1) %>%
arrange(desc(id2)) %>%
slice_head(id2, n=2)
Use slice_max() with the argument with_ties = FALSE:
library(dplyr)
df %>%
group_by(id1) %>%
slice_max(id2, with_ties = FALSE)
# A tibble: 3 x 2
# Groups: id1 [3]
id1 id2
<chr> <dbl>
1 A 8
2 B 7
3 C 5
If you don't want to remember so many {dplyr} function names that are prone to be changed anyway, I can recommend the {data.table} package for such tasks. Plus, it's faster.
require(data.table)
df <- data.frame(id1=c(rep("A",3),rep("B",3),rep("C",3)),id2=c(8,8,4,7,7,4,5,5,5))
setDT(df)
df[ ,
.(id2_head = head(id2, 1)),
by = id1 ]
I would like to find the minimum value of a variable (time) that several other variables are equal to 1 (or any other value). Basically my application is finding the first year that x ==1, for several x. I know how to find this for one x but would like to avoid generating multiple reduced data frames of minima, then merging these together. Is there an efficient way to do this? Here is my example data and solution for one variable.
d <- data.frame(cat = c(rep("A",10), rep("B",10)),
time = c(1:10),
var1 = c(0,0,0,1,1,1,1,1,1,1,0,0,0,0,0,0,1,1,1,1),
var2 = c(0,0,0,0,1,1,1,1,1,1,0,0,0,0,0,0,0,1,1,1))
ddply(d[d$var1==1,], .(cat), summarise,
start= min(time))
How about this using dplyr
d %>%
group_by(cat) %>%
summarise_at(vars(contains("var")), funs(time[which(. == 1)[1]]))
Which gives
# A tibble: 2 x 3
# cat var1 var2
# <fct> <int> <int>
# 1 A 4 5
# 2 B 7 8
We can use base R to get the minimum 'time' among all the columns of 'var' grouped by 'cat'
sapply(split(d[-1], d$cat), function(x)
x$time[min(which(x[-1] ==1, arr.ind = TRUE)[, 1])])
#A B
#4 7
Is this something you are expecting?
library(dplyr)
df <- d %>%
group_by(cat, var1, var2) %>%
summarise(start = min(time)) %>%
filter()
I have left a blank filter argument that you can use to specify any filter condition you want (say var1 == 1 or cat == "A")