Say I have a dataset like this:
id <- c(1, 1, 2, 2, 3, 3)
code <- c("a", "b", "a", "a", "b", "b")
dat <- data.frame(id, code)
I.e.,
id code
1 1 a
2 1 b
3 2 a
4 2 a
5 3 b
6 3 b
Using dplyr, how would I get a count of how many a's there are for each id
i.e.,
id countA
1 1 1
2 2 2
3 3 0
I'm trying stuff like this which isn't working,
countA<- dat %>%
group_by(id) %>%
summarise(cip.completed= count(code == "a"))
The above gives me an error, "Error: no applicable method for 'group_by_' applied to an object of class "logical""
Thanks for your help!
Try the following instead:
library(dplyr)
dat %>% group_by(id) %>%
summarise(cip.completed= sum(code == "a"))
Source: local data frame [3 x 2]
id cip.completed
(dbl) (int)
1 1 1
2 2 2
3 3 0
This works because the logical condition code == a is just a series of zeros and ones, and the sum of this series is the number of occurences.
Note that you would not necessarily use dplyr::count inside summarise anyway, as it is a wrapper for summarise calling either n() or sum() itself. See ?dplyr::count. If you really want to use count, I guess you could do that by first filtering the dataset to only retain all rows in which code==a, and using count would then give you all strictly positive (i.e. non-zero) counts. For instance,
dat %>% filter(code==a) %>% count(id)
Source: local data frame [2 x 2]
id n
(dbl) (int)
1 1 1
2 2 2
Related
I need to find common values between different groups ideally using dplyr and R.
From my dataset here:
group val
<fct> <dbl>
1 a 1
2 a 2
3 a 3
4 b 3
5 b 4
6 b 5
7 c 1
8 c 3
the expected output is
group val
<fct> <dbl>
1 a 3
2 b 3
3 c 3
as only number 3 occurs in all groups.
This code seems not working:
# Filter the data
dd %>%
group_by(group) %>%
filter(all(val)) # does not work
Example here solves similar issue but have a defined vector of shared values. What if I do not know which ones are shared?
Dummy example:
# Reproducible example: filter all id by group
group = c("a", "a", "a",
"b", "b", "b",
"c", "c")
val = c(1,2,3,
3,4,5,
1,3)
dd <- data.frame(group,
val)
group_by isolates each group, so we can't very well group_by(group) and compare between between groups. Instead, we can group_by(val) and see which ones have all the groups:
dd %>%
group_by(val) %>%
filter(n_distinct(group) == n_distinct(dd$group))
# # A tibble: 3 x 2
# # Groups: val [1]
# group val
# <chr> <dbl>
# 1 a 3
# 2 b 3
# 3 c 3
This is one of the rare cases where we want to use data$column in a dplyr verb - n_distinct(dd$group) refers explicitly to the ungrouped original data to get the total number of groups. (It could also be pre-computed.) Whereas n_distinct(group) is using the grouped data piped in to filter, thus it gives the number of distinct groups for each value (because we group_by(val)).
A base R approach can be:
#Code
newd <- dd[dd$val %in% Reduce(intersect, split(dd$val, dd$group)),]
Output:
group val
3 a 3
4 b 3
8 c 3
A similar option in data.table as that of #GregorThomas solution is
library(data.table)
setDT(dd)[dd[, .I[uniqueN(group) == uniqueN(dd$group)], val]$V1]
i'm new to R and trying to use it in place of Excel (where i have more experience). I'm still working out the full 'for' logic, but not having the values to determine if it's working how i think it should is stopping me in my tracks. The goal is to generate what will be used as a factor with 3 levels; 0 = no duplicates, 1 is if duplicate, Oldest, 2 = if duplicate, newest.
I have a dataframe that looks like this
Person <- c("A", "B", "C", "C", "D", "E","E")
Date <- c(1/1/20, 1/1/20,12/25/19, 1/1/20, 1/1/20, 12/25/19, 1/1/20)
ID <- c(1,2,3,4,5,6,7)
DuplicateStatus <- c(0,0,0,0,0,0,0)
IdealResult <- c(0,0,1,2,0,1,2)
mydata <- cbind(Person, Date, ID, DuplicateStatus, IdealResult)
I am trying to use a for loop to evaluate if person duplicates. If a person does not duplicate, value= 0 and if they do duplicate, they should have a 1 for the oldest value and a 2 for the newest value (see ideal result). NOTE: I have already sorted the data to be by person and then date, so if duplicated, first appearance is oldest.
previous investigations of Vlookup in R answers here are aimed at merging datasets based on identical values in multiple datasets. Here, i am attempting to modify a column based on the relationship between columns, within a single dataset.
currentID = 0
nextID =0
for(i in mydata$ID){
currentID = i
nextID = currentID++1
CurrentPerson ##Vlookup function that does - find currentID in ID, return associated value in Person column in same position.
NextPerson ##Vlookup function that does - find nextID in ID, return associated value in Person column in same position.
if CurrentPerson = NextPerson, then DuplicateStatus at ID associated with current person should be 1, and DuplicateStatus at ID associated with NextPerson = 2.
**This should end when current person = total number of people
Thanks!
You really need to spend some time with a simple tutorial on R. Your cbind() function converts all of your data to a character matrix which is probably not what you want. Look at the results of str(mydata). Instead of looping, this creates an index number within each Person group and then zeros out the groups with a single observation:
mydata <- data.frame(Person, Date, ID, DuplicateStatus, IdealResult)
IR <- ave(mydata$ID, mydata$Person, FUN=seq_along)
IR
# [1] 1 1 1 2 1 1 2
tbl <- table(mydata$Person)
tozero <- mydata$Person %in% names(tbl[tbl == 1])
IR[tozero] <- 0
IR
# [1] 0 0 1 2 0 1 2
Is what you are looking for just to count the number of observations for a person, in one column (like a column ID)? If so, this will work using tidyverse:
Person <- c("A", "B", "C", "C", "D", "E","E")
Date <- c(1/1/20, 1/1/20,12/25/19, 1/1/20, 1/1/20, 12/25/19, 1/1/20)
ID <- c(1,2,3,4,5,6,7)
DuplicateStatus <- c(0,0,0,0,0,0,0)
IdealResult <- c(0,0,1,2,0,1,2)
mydata <- data.frame(Person, Date, ID, DuplicateStatus, IdealResult)
library(tidyverse)
mydata <- mydata %>%
group_by(Person) %>%
mutate(Duplicate = seq_along(Person))
mydata
# A tibble: 7 x 6
# Groups: Person [5]
Person Date ID DuplicateStatus IdealResult Duplicate
<fct> <dbl> <dbl> <dbl> <dbl> <int>
1 A 0.05 1 0 0 1
2 B 0.05 2 0 0 1
3 C 0.0253 3 0 1 1
4 C 0.05 4 0 2 2
5 D 0.05 5 0 0 1
6 E 0.0253 6 0 1 1
7 E 0.05 7 0 2 2
You could assign row number within each group provided if there are more than 1 row in each.
This can be implemented in base R, dplyr as well as data.table
In base R :
mydata$ans <- with(mydata, ave(ID, Person, FUN = function(x)
seq_along(x) * (length(x) > 1)))
# Person Date ID IdealResult ans
#1 A 0.0500000 1 0 0
#2 B 0.0500000 2 0 0
#3 C 0.0252632 3 1 1
#4 C 0.0500000 4 2 2
#5 D 0.0500000 5 0 0
#6 E 0.0252632 6 1 1
#7 E 0.0500000 7 2 2
Using dplyr:
library(dplyr)
mydata %>% group_by(Person) %>% mutate(ans = row_number() * (n() > 1))
and with data.table
library(data.table)
setDT(mydata)[, ans := seq_along(ID) * (.N > 1), Person]
data
mydata <- data.frame(Person, Date, ID, IdealResult)
I would argue that n() is the ideal function for you problem
library(tidyverse)
mydata <- mydata %>%
group_by(Person) %>%
mutate(Duplicate = n())
I'm learning the tidyverse and ran into a problem with the simplest of operations:reading and assigning value to a single cell. I need to do this by matching a specific value in another column and calling the name of the column whose value I'd like to change (so I can't use numeric row and column numbers).
I've searched online and on SO and read the tibble documentation (this seems the most applicable https://tibble.tidyverse.org/reference/subsetting.html?q=cell) and haven't found the answer. (I'm probably missing something - apologies for the simplicity of this question and if it's been answered elsewhere)
test<-tibble(x = 1:5, y = 1, z = x ^ 2 + y)
Yields:
A tibble: 5 x 3
x y z
<int> <dbl> <dbl>
1 1 1 2
2 2 1 5
3 3 1 10
4 4 1 17
5 5 1 26
test["x"==3,"z"]
Yields:
A tibble: 0 x 1
… with 1 variable: z <dbl>
But doesn't tell me the value of that cell.
And when I try to assign a value...
test["x"==3,"z"]<-20
...it does not work.
test[3,3] This works, but as stated above I need to call the cell by names not numbers.
What is the right way to do this?
It is not a data.table. If we are using base R methods, the columns 'x' is extracted with test$x or test[["x"]]
test[test$x == 3, "z"]
# A tibble: 1 x 1
# z
# <dbl>
#1 10
Or use subset
subset(test, x == 3, select = 'z')
Or with dplyr
library(dplyr)
test %>%
filter(x == 3) %>%
select(z)
Or if we want to pass a string as column name, convert to symbol and evaluate
test %>%
filter(!! rlang::sym("x") == 3) %>%
select(z)
Or with data.table
library(data.table)
as.data.table(test)[x == 3, .(z)]
# z
#1: 10
I have a dataframe with an id, an ordering time value and a value. And for each group of ids, I would like to remove rows having a smaller value than rows having smaller time value.
data <- data.frame(id = c(rep(c("a", "b"), each = 3L), "b"),
time = c(0, 1, 2, 0, 1, 2, 3),
value = c(1, 1, 2, 3, 1, 2, 4))
> data
id time value
1 a 0 1
2 a 1 1
3 a 2 2
4 b 0 3
5 b 1 1
6 b 2 2
7 b 3 4
So the result would be :
> data
id time value
1 a 0 1
2 a 2 2
3 b 0 3
4 b 3 4
(For id == b rows where time %in% c(3, 4) are removed because the value value is smaller than when time is lower)
I was thinking about lag
data %>%
group_by(id) %>%
filter(time == 0 | lag(value, order_by = time) < value)
Source: local data frame [5 x 3]
Groups: id [2]
id time value
<fctr> <dbl> <dbl>
1 a 0 1
2 a 2 2
3 b 0 3
4 b 2 2
5 b 3 4
But it doesn't work as expected since it's a vectorized function, so instead the idea would be to use a "recursive lag function" or to check the last maximal value. I can do it recursively with a loop but I'm sure there is a more straightforward and high level way to do it.
Any help would be appreciated, thank you !
Here is a data.table solution:
library(data.table)
setDT(data)
data[, myVal := cummax(c(0, shift(value)[-1])), by=id][value > myVal][, myVal := NULL][]
id time value
1: a 0 1
2: a 2 2
3: b 0 3
4: b 3 4
The first part of the chain uses shift and cummax to create the cumulative maximum of the lagged value variable. In c(0, shift(value)[-1]), 0 is added to supply a value lover than any in the variable. More generally, you could use min(value)-1 the [-1] subsetting removes the first element of shift, which is NA. The second part of the chain selects observations where value is greater than the cumulative maximum. The final two chains remove the cumulative maximum variable and print out the result.
Another option is to perform a self anti/non-equi join using data.table
library(data.table) # v1.10.0
setDT(data)[!data, on = .(id, time > time, value <= value)]
# id time value
# 1: a 0 1
# 2: a 2 2
# 3: b 0 3
# 4: b 3 4
Which is basically saying: "If time is larger but value is less-equal, then I don't want these rows (! sign)"
Here is an option with dplyr. After grouping by 'id', we filter the rows where the 'value' is greater than the cumulative maximum of the 'lag' of the 'value' column
library(dplyr)
data %>%
group_by(id) %>%
filter(value > cummax(lag(value, default = 0)) )
# id time value
# <fctr> <dbl> <dbl>
#1 a 0 1
#2 a 2 2
#3 b 0 3
#4 b 3 4
Or another option is slice after arrangeing by 'id' and 'time' (as the OP mentioned about the order
data %>%
group_by(id) %>%
arrange(id, time) %>%
slice(which(value > cummax(lag(value, default = 0))))
I would like to multiply several columns on a dataframe by the values of a vector (all values within the same column should be multiplied by the same value, which will be different according to the column), while keeping the other columns as they are.
Since I'm using dplyr extensively I thought that it might be useful to use mutate_each function, so I can modify all columns at the same time, but I am completely lost on the syntax on the fun() part.
On the other hand, I've read this solution which is simple and works fine, but only works for all columns instead of the selected ones.
That's what I've done so far:
Imagine that I want to multiply all columns in df but letters by weight_df vector as follows:
df = data.frame(
letters = c("A", "B", "C", "D"),
col1 = c(3, 3, 2, 3),
col2 = c(2, 2, 3, 1),
col3 = c(4, 1, 1, 3)
)
> df
letters col1 col2 col3
1 A 3 2 4
2 B 3 2 1
3 C 2 3 1
4 D 3 1 3
>
weight_df = c(1:3)
If I use select before applying mutate_each I get rid of letters columns (as expected), and that's not what I want (a part from the fact that the vector is not applyed per columns basis but per row basis! and I want the opposite):
df = df %>%
select(-letters) %>%
mutate_each(funs(. * weight_df))
> df
col1 col2 col3
1 3 2 4
2 6 4 2
3 6 9 3
4 3 1 3
But if I don't select any particular columns, all values within letters are removed (which makes a lot of sense, by the way), but that's not what I want, neither (a part from the fact that the vector is not applyed per columns basis but per row basis! and I want the opposite):
df = df %>%
mutate_each(funs(. * issb_weight))
> df
letters col1 col2 col3
1 NA 3 2 4
2 NA 6 4 2
3 NA 6 9 3
4 NA 3 1 3
(Please note that this is a very simple dataframe and the original one has way more rows and columns -which unfortunately are not labeled in such an easy way and no patterns can be obtained)
The problem here is that you are basically trying to operate over rows, rather columns, hence methods such as mutate_* won't work. If you are not satisfied with the many vectorized approaches proposed in the linked question, I think using tydeverse (and assuming that letters is unique identifier) one way to achieve this is by converting to long form first, multiply a single column by group and then convert back to wide (don't think this will be overly efficient though)
library(tidyr)
library(dplyr)
df %>%
gather(variable, value, -letters) %>%
group_by(letters) %>%
mutate(value = value * weight_df) %>%
spread(variable, value)
#Source: local data frame [4 x 4]
#Groups: letters [4]
# letters col1 col2 col3
# * <fctr> <dbl> <dbl> <dbl>
# 1 A 3 4 12
# 2 B 3 4 3
# 3 C 2 6 3
# 4 D 3 2 9
using dplyr. This filters numeric columns only. Gives flexibility for choosing columns. Returns the new values along with all the other columns (non-numeric)
index <- which(sapply(df, is.numeric) == TRUE)
df[,index] <- df[,index] %>% sweep(2, weight_df, FUN="*")
> df
letters col1 col2 col3
1 A 3 4 12
2 B 3 4 3
3 C 2 6 3
4 D 3 2 9
try this
library(plyr)
library(dplyr)
df %>% select_if(is.numeric) %>% adply(., 1, function(x) x * weight_df)