R grouped counter that copes with NAs or conditions [duplicate] - r

This question already has answers here:
cumsum by group [duplicate]
(2 answers)
Closed 3 years ago.
I have an R dataframe where I need a counter which gives me a fresh new number for a new set of circumstances while also continuing this number (respecting the order of the data).
There are quite a few previous posts on this but none seems to work for my problem. I've tried using combinations of row_counter, ave and rleid and none seems to hit the spot.
id <- c("A","A","A","A","A","B","B","B","B","B","B","B","B","B","B","C","C","C","C","D","D")
marker_new <- c(1,0,0,0,0,1,0,1,0,0,0,0,1,0,1,1,0,1,0,1,0)
counter_result <- c(1,1,1,1,1,1,1,2,2,2,2,2,3,3,4,1,1,2,2,1,1)
df <- data.frame(id,marker_new, counter_result)
df <- df %>%
group_by(id, marker_new) %>%
mutate(counter =
ifelse(marker_new != 0,
row_number(),
lag(marker_new,lag(marker_new))) %>%
ungroup()
I can get to the point using the code above which will give me a fresh number but won't continue this set of numbers down (as in the counter_result i've included).
Any help much appreciated!

Since, we have marker_new column as 1/0, we can use cumsum by group (id) to get counter.
Base R:
df$result <- with(df, ave(marker_new, id, FUN = cumsum))
dplyr:
df %>% group_by(id) %>% mutate(result = cumsum(marker_new))
data.table
setDT(df)[, result := cumsum(marker_new), by = id]

Related

How to find duplicated values in column in R [duplicate]

There is a similar question for PHP, but I'm working with R and am unable to translate the solution to my problem.
I have this data frame with 10 rows and 50 columns, where some of the rows are absolutely identical. If I use unique on it, I get one row per - let's say - "type", but what I actually want is to get only those rows which only appear once. Does anyone know how I can achieve this?
I can have a look at clusters and heatmaps to sort it out manually, but I have bigger data frames than the one mentioned above (with up to 100 rows) where this gets a bit tricky.
This will extract the rows which appear only once (assuming your data frame is named df):
df[!(duplicated(df) | duplicated(df, fromLast = TRUE)), ]
How it works: The function duplicated tests whether a line appears at least for the second time starting at line one. If the argument fromLast = TRUE is used, the function starts at the last line.
Boths boolean results are combined with | (logical 'or') into a new vector which indicates all lines appearing more than once. The result of this is negated using ! thereby creating a boolean vector indicating lines appearing only once.
A possibility involving dplyr could be:
df %>%
group_by_all() %>%
filter(n() == 1)
Or:
df %>%
group_by_all() %>%
filter(!any(row_number() > 1))
Since dplyr 1.0.0, the preferable way would be:
data %>%
group_by(across(everything())) %>%
filter(n() == 1)
Try it
library(dplyr)
DF1 <- data.frame(Part = c(1,2,3,4,5), Age = c(23,34,23,25,24), B.P = c(87,76,75,75,78))
DF2 <- data.frame(Part =c(3,5), Age = c(23,24), B.P = c(75,78))
DF3 <- rbind(DF1,DF2)
DF3 <- DF3[!(duplicated(DF3) | duplicated(DF3, fromLast = TRUE)), ]

Tidyverse filter by width of variable [duplicate]

This question already has answers here:
Remove all rows where length of string is more than n
(4 answers)
Closed 1 year ago.
I'm working with an untidy dataset and want to filter out any object with an ID shorter than 6 digits (these rows contain errors).
I created a new column that calculates the number of characters for each ID, and then I filter for all objects with 6 or more digits, like so:
clean_df <- df %>%
mutate(chars = nchar(id)) %>%
filter(chars >= 6)
This is working just fine, but I'm wondering if there's an easier way.
Using str_length() from the stringr package (part of the tidyverse):
library(tidyverse)
clean_df <- df %>%
filter(str_length(id) >= 6)
If id's are numeric, just use log10
df %>%
filter(log10(id)>=5)
You can skip mutate
df %>%
filter(nchar(id) >= 6)

Deleting every last row in every group in R [duplicate]

This question already has an answer here:
R delete last row in dataframe for each group
(1 answer)
Closed 2 years ago.
I need to delete every last row in a group after applying group_by.
I have tried something like that, but it does not work.
data=data %>%
group_by(isin) %>%
summarise(data=data[-length(isin),])
Thanks for your help!
We use the built in iris data set as an example. It has three groups of 50 rows each defined by the Species column. Next time please provide sample data in the question. See the top of the r tag page for info.
1) group_modify We can use group_modify from dplyr.
library(dplyr)
iris %>%
group_by(Species) %>%
group_modify(~ head(., -1)) %>%
ungroup
2) slice Another dplyr solution is to use slice
library(dplyr)
iris %>%
group_by(Species) %>%
slice(-n()) %>%
ungroup
3) by A base solution is to use by. It produces a list of data frames which we rbind back together.
do.call("rbind", by(iris, iris$Species, head, -1))
4) subset/ave Another base solution is to create a vector of numbers which count down to 1 for each group and then only keep those rows corresponding to a number greater than 1.
subset(iris, ave(1:nrow(iris), Species, FUN = function(x) length(x):1) > 1)
4a) or keep all rows except the one having the maximum row number in each group:
n <- nrow(iris)
subset(iris, ave(1:n, Species, FUN = max) != 1:n)
5) duplicated Yet another base solution uses duplicated. It only keeps rows whose Species column is duplicated counting back from the end.
subset(iris, duplicated(Species, fromLast = TRUE))
Try using the the base function by
new_data=do.call(rbind,by(data,data[,'isin'],function(x) x[-length(x),]))
By will return the groups in as list and do.call(rbind,...) will convert the list to a data.frame

How to duplicate a specific number of row per group level in R [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 2 years ago.
Here is my data:
For each x1 level, I am trying to duplicate a number of rows equal to number.class and I would like for each row the length class to goes from the Lmin..cm. to Lmax..cm. increasing by 1 for each row.I came up with this code:
test<-A.M %>% filter(x1=="Crenimugil crenilabis")
for (i in 1:test$number.class){test<-test %>% add_row()}
for (i in 1:nrow(test)){test[i,]=test[1,]}
for (i in 1:nrow(test)){test$length.class[i]<-print(i+test$Lmin..cm.)}
test$length.class<-test$length.class-1
which basically works and gives me the expected results: 2
However, this script does not allow me to run this for every species.
Thank you.
Here, we could use uncount from tidyr to replicate the rows, do a group by 'x1' and mutate the 'Lmin..cm' by adding the row_number()
library(dplyr)
library(tidyr)
A.M %>%
uncount(number.class) %>%
group_by(x1) %>%
mutate(`Lmin..cm.` = `Lmin..cm.` + row_number())
If we need to create a sequence from Lmin..cm to Lmax..cm, then instead of uncount, we could use map2 to create the sequence and then unnest
library(purrr)
A.M %>%
mutate(new = map2(`Lmin..cm.`, `Lmax..cm`, ~ seq(.x, .y, by = 1)) %>%
unnest(c(new))

Remove all duplicates by multiple variables with dplyr [duplicate]

This question already has answers here:
Remove all copies of rows with duplicate values in R [duplicate]
(2 answers)
Closed 3 years ago.
I'm trying to remove all duplicate values based on multiple variable using dplyr. Here's how I do it without dplyr:
dat = data.frame(id=c(1,1,2),date=c(1,1,1))
dat = dat[!(duplicated(dat[c('id','date')]) | duplicated(dat[c('id','date')],fromLast=TRUE)),]
It should only return id number 2.
This can be done with a group_by/filter operation in tidyverse. Grouped by the columns of interest (here used group_by_all as all the columns in the dataset are grouped. Instead can also make use of group_by_at if a selected number of columns are needed)
library(dplyr)
dat %>%
group_by_all() %>%
filter(n()==1)
Or simply group_by
dat %>%
group_by(id, date) %>%
filter(n() == 1)
If the OP intended to use the duplicated function
dat %>%
filter_at(vars(id, date),
any_vars(!(duplicated(.)|duplicated(., fromLast = TRUE))))
# id date
#1 2 1

Resources