I have some data that I am trying to group by consecutive values in R. This solution is similar to what I am looking for, however my data is structured like this:
line_num
1
2
3
1
2
1
2
3
4
What I want to do is group each time the number returns to 1 such that I get groups like this:
line_num
group_num)
1
1
2
1
3
1
1
2
2
2
1
3
2
3
3
3
4
3
Any ideas on the best way to accomplish this using dplyr or base R?
Thanks!
We could use cumsum on a logical vector
library(dplyr)
df2 <- df1 %>%
mutate(group_num = cumsum(line_num == 1))
or with base R
df1$group_num <- cumsum(df1$line_num == 1)
data
df1 <- structure(list(line_num = c(1L, 2L, 3L, 1L, 2L, 1L, 2L, 3L, 4L
)), class = "data.frame", row.names = c(NA, -9L))
Related
Despite using R and dplyr on a regular basis, I encountered the issue of not being able to calculate the sum of the absolute differences between all columns:
sum_diff=ABS(A-B)+ABS(B-C)+ABS(C-D)...
A
B
C
D
sum_diff
1
2
3
4
3
2
1
3
4
4
1
2
1
1
2
4
1
2
1
5
I know I could iterate using a for loop over all columns, but given the size of my data frame, I prefer a more elegant and fast solution.
Any help?
Thank you
We may remove the first and last columns, get the difference, and use rowSums on the absolute values in base R. This could be very efficient compared to a package solution
df1$sum_diff <- rowSums(abs(df1[-ncol(df1)] - df1[-1]))
-output
> df1
A B C D sum_diff
1 1 2 3 4 3
2 2 1 3 4 4
3 1 2 1 1 2
4 4 1 2 1 5
Or another option is rowDiffs from matrixStats
library(matrixStats)
rowSums(abs(rowDiffs(as.matrix(df1))))
[1] 3 4 2 5
data
df1 <- structure(list(A = c(1L, 2L, 1L, 4L), B = c(2L, 1L, 2L, 1L),
C = c(3L, 3L, 1L, 2L), D = c(4L, 4L, 1L, 1L)), row.names = c(NA,
-4L), class = "data.frame")
Daata from akrun (many thanks)!
This is complicated the idea is to generate a list of the combinations, I tried it with combn but then I get all possible combinations. So I created by hand.
With this combinations we then could use purrrs map_dfc and do some data wrangling after that:
library(tidyverse)
combinations <-list(c("A", "B"), c("B", "C"), c("C","D"))
purrr::map_dfc(combinations, ~{df <- tibble(a=data[[.[[1]]]]-data[[.[[2]]]])
names(df) <- paste0(.[[1]],"_v_",.[[2]])
df}) %>%
transmute(sum_diff = rowSums(abs(.))) %>%
bind_cols(data)
sum_diff A B C D
<dbl> <int> <int> <int> <int>
1 3 1 2 3 4
2 4 2 1 3 4
3 2 1 2 1 1
4 5 4 1 2 1
data:
data <- structure(list(A = c(1L, 2L, 1L, 4L), B = c(2L, 1L, 2L, 1L),
C = c(3L, 3L, 1L, 2L), D = c(4L, 4L, 1L, 1L)), row.names = c(NA,
-4L), class = "data.frame")
Here is a dplyrs version of #akrun's elegant aproach that calculates the diff of the dataframe with it's shifted variant:
df %>%
mutate(sum_diff = rowSums(abs(identity(.) %>% select(1:last_col(1))
- identity(.) %>% select(2:last_col()))))
And here we have the rowwise variant, which basicly follows the same idea but this time every row is used as a vector that get's substracted by it's shifted self.
df %>%
rowwise() %>%
mutate(sum_diff = map2_int(c_across(1:last_col(1)),
c_across(2:last_col()),
~ abs(.x - .y)) %>% sum())
I have a data frame that looks like this
column1
1
1
2
3
3
and I would like to give a unique ID to each element. My problem is that I can not
find a way the unique IDs to start from zero and be like this
column1 column2
1 0
1 0
2 1
3 2
3 2
Any help is appreciated
Try this, cur_group_id from dplyr will create the id from 1 but you can easily make it to start from zero:
library(dplyr)
#Data
df <- structure(list(column1 = c(0L, 1L, 2L, 3L, 3L)), class = "data.frame", row.names = c(NA,-5L))
#Mutate
df %>% group_by(column1) %>% mutate(id=cur_group_id()-1)
# A tibble: 5 x 2
# Groups: column1 [4]
column1 id
<int> <dbl>
1 0 0
2 1 1
3 2 2
4 3 3
5 3 3
We could use match
library(dplyr)
df1 %>%
mutate(column2 = match(column1, unique(column1)) - 1)
data
df1 <- structure(list(column1 = c(1L, 1L, 2L, 3L, 3L)), class = "data.frame",
row.names = c(NA,
-5L))
This question already has answers here:
pair-wise duplicate removal from dataframe [duplicate]
(4 answers)
Closed 2 years ago.
I am trying to remove those rows if the swap also exists in the data frame.
For example, if I have a data frame:
1 2
1 3
1 4
2 4
4 2
2 1
Then the row (1,2), (2,4) will be removed because (2,1) and (4,2) are also in the df.
Is there any fast and neat way to do it? Thank you!
You can row-wise sort the columns and then select only the unique ones :
library(dplyr)
df %>%
mutate(col1 = pmin(V1, V2),
col2 = pmax(V1, V2)) %>%
distinct(col1, col2)
# col1 col2
#1 1 2
#2 1 3
#3 1 4
#4 2 4
Using base R :
df1 <- transform(df, col1 = pmin(V1, V2), col2 = pmax(V1, V2))
df[!duplicated(df1[3:4]), ]
data
df <- structure(list(V1 = c(1L, 1L, 1L, 2L, 4L, 2L), V2 = c(2L, 3L,
4L, 4L, 2L, 1L)), class = "data.frame", row.names = c(NA, -6L))
Another, base R, solution is by using rowSumsand duplicated:
df[!duplicated(rowSums(df)),]
V1 V2
1 1 2
2 1 3
3 1 4
4 2 4
I am relatively new to R but slowly finding my way. I encountered a problem, however, and hope someone can help me.
Let's say I two dataframes (lets call them A and B), both containing survey responses. A contains all responses from the first set of people. B contains the responses of the second set of people, plus the people of the first set but with their responses set to NA. An example:
Dataframe A:
Household Individual Answer_A Answer_b
1 2 5 6
1 3 6 6
2 1 2 3
Dataframe B:
Household Individual Answer_A Answer_b
1 1 3 6
1 2 NA NA
1 3 NA NA
2 1 NA NA
2 2 4 7
I want to get one dataframe with all individuals and their responses:
Dataframe C:
Household Individual Answer_A Answer_b
1 1 3 6
1 2 5 6
1 3 6 6
2 1 2 3
2 2 4 7
If I only have two datasets I can use rbind.fill, with rbind.fill(B, A) to get dataframe C, as then the NAs in B are overwritten with answers in A.
But... if I would have to add a third dataset, D, that would consist of NAs for people in A and B, I would not be able to use this solution. What would I be able to do then? I've looked at intersect, outersect, different forms of join, but can't seem to think of a good solution.
Any thoughts?
Maybe you can left_join and then use coalesce
library(dplyr)
left_join(B, A, by = c("Household", "Individual")) %>%
mutate(Answer_A = coalesce(Answer_A.x, Answer_A.y),
Answer_B = coalesce(Answer_b.x, Answer_b.y)) %>%
select(-matches("\\.x|\\.y"))
# Household Individual Answer_A Answer_B
#1 1 1 3 6
#2 1 2 5 6
#3 1 3 6 6
#4 2 1 2 3
#5 2 2 4 7
data
A <- structure(list(Household = c(1L, 1L, 2L), Individual = c(2L,
3L, 1L), Answer_A = c(5L, 6L, 2L), Answer_b = c(6L, 6L, 3L)), class = "data.frame",
row.names = c(NA, -3L))
B <- structure(list(Household = c(1L, 1L, 1L, 2L, 2L), Individual = c(1L,
2L, 3L, 1L, 2L), Answer_A = c(3L, NA, NA, NA, 4L), Answer_b = c(6L,
NA, NA, NA, 7L)), class = "data.frame", row.names = c(NA, -5L))
a b #Encounter
1 112233 1
2 334455 1
1 112233 2
3 445566 1
2 334455 2
2 334455 3
3 445566 2
3 445566 3
3 445566 4
How would I calculate #Encounter, given column a and b, on R?
The Excel code would be: =countifs(a(Range), a, b(Range), b)
An option in base R would be to use ave
df1$Encounter <- with(df1, ave(seq_along(a), a, b, FUN = seq_along))
df1$Encounter
#[1] 1 1 2 1 2 3 2 3 4
Or in data.table
library(data.table)
setDT(df1)[, Encounter := rowid(a, b)]
data
df1 <- structure(list(a = c(1L, 2L, 1L, 3L, 2L, 2L, 3L, 3L, 3L), b = c(112233L,
334455L, 112233L, 445566L, 334455L, 334455L, 445566L, 445566L,
445566L)), row.names = c(NA, -9L), class = "data.frame")