Set unique IDs which start from zero in R data.frame - r

I have a data frame that looks like this
column1
1
1
2
3
3
and I would like to give a unique ID to each element. My problem is that I can not
find a way the unique IDs to start from zero and be like this
column1 column2
1 0
1 0
2 1
3 2
3 2
Any help is appreciated

Try this, cur_group_id from dplyr will create the id from 1 but you can easily make it to start from zero:
library(dplyr)
#Data
df <- structure(list(column1 = c(0L, 1L, 2L, 3L, 3L)), class = "data.frame", row.names = c(NA,-5L))
#Mutate
df %>% group_by(column1) %>% mutate(id=cur_group_id()-1)
# A tibble: 5 x 2
# Groups: column1 [4]
column1 id
<int> <dbl>
1 0 0
2 1 1
3 2 2
4 3 3
5 3 3

We could use match
library(dplyr)
df1 %>%
mutate(column2 = match(column1, unique(column1)) - 1)
data
df1 <- structure(list(column1 = c(1L, 1L, 2L, 3L, 3L)), class = "data.frame",
row.names = c(NA,
-5L))

Related

Creating loop to count the number of unique values in column based on values in another column

So, for example, I have the following dataframe, data:
col1
col2
1
5
1
5
1
3
2
10
2
11
3
11
Now, I want to make a new column, col3, which gives me the number of unique values in col2 for every grouping in col1.
So far, I have the following code:
length(unique(data$col2[data$col1 == 1]))
Which would here return the number 2.
However, I'm having a hard time making a loop that goes through all the values in col1 to create the new column, col3.
We can use n_distinct after grouping
library(dplyr)
data <- data %>%
group_by(col1) %>%
mutate(col3 = n_distinct(col2)) %>%
ungroup
-output
data
# A tibble: 6 × 3
col1 col2 col3
<int> <int> <int>
1 1 5 2
2 1 5 2
3 1 3 2
4 2 10 2
5 2 11 2
6 3 11 1
Or with data.table
library(data.table)
setDT(data)[, col3 := uniqueN(col2), col1]
data
data <- structure(list(col1 = c(1L, 1L, 1L, 2L, 2L, 3L), col2 = c(5L,
5L, 3L, 10L, 11L, 11L)), class = "data.frame", row.names = c(NA,
-6L))
You want the counts for every row, so using a for loop you would do
data$col3 <- NA_real_
for (i in seq_len(nrow(data))) {
data$col3[i] <- length(unique(data$col2[data$col1 == data$col1[i]]))
}
data
# col1 col2 col3
# 1 1 5 2
# 2 1 5 2
# 3 1 3 2
# 4 2 10 2
# 5 2 11 2
# 6 3 11 1
However, using for loops in R is mostly inefficient, and in this case we can use the grouping function ave which comes with R.
data <- transform(data, col3=ave(col2, col1, FUN=\(x) length(unique(x))))
data
# col1 col2 col3
# 1 1 5 2
# 2 1 5 2
# 3 1 3 2
# 4 2 10 2
# 5 2 11 2
# 6 3 11 1
Data:
data <- structure(list(col1 = c(1L, 1L, 1L, 2L, 2L, 3L), col2 = c(5L,
5L, 3L, 10L, 11L, 11L)), class = "data.frame", row.names = c(NA,
-6L))

How can I calculate the sum of the column wise differences using dplyr

Despite using R and dplyr on a regular basis, I encountered the issue of not being able to calculate the sum of the absolute differences between all columns:
sum_diff=ABS(A-B)+ABS(B-C)+ABS(C-D)...
A
B
C
D
sum_diff
1
2
3
4
3
2
1
3
4
4
1
2
1
1
2
4
1
2
1
5
I know I could iterate using a for loop over all columns, but given the size of my data frame, I prefer a more elegant and fast solution.
Any help?
Thank you
We may remove the first and last columns, get the difference, and use rowSums on the absolute values in base R. This could be very efficient compared to a package solution
df1$sum_diff <- rowSums(abs(df1[-ncol(df1)] - df1[-1]))
-output
> df1
A B C D sum_diff
1 1 2 3 4 3
2 2 1 3 4 4
3 1 2 1 1 2
4 4 1 2 1 5
Or another option is rowDiffs from matrixStats
library(matrixStats)
rowSums(abs(rowDiffs(as.matrix(df1))))
[1] 3 4 2 5
data
df1 <- structure(list(A = c(1L, 2L, 1L, 4L), B = c(2L, 1L, 2L, 1L),
C = c(3L, 3L, 1L, 2L), D = c(4L, 4L, 1L, 1L)), row.names = c(NA,
-4L), class = "data.frame")
Daata from akrun (many thanks)!
This is complicated the idea is to generate a list of the combinations, I tried it with combn but then I get all possible combinations. So I created by hand.
With this combinations we then could use purrrs map_dfc and do some data wrangling after that:
library(tidyverse)
combinations <-list(c("A", "B"), c("B", "C"), c("C","D"))
purrr::map_dfc(combinations, ~{df <- tibble(a=data[[.[[1]]]]-data[[.[[2]]]])
names(df) <- paste0(.[[1]],"_v_",.[[2]])
df}) %>%
transmute(sum_diff = rowSums(abs(.))) %>%
bind_cols(data)
sum_diff A B C D
<dbl> <int> <int> <int> <int>
1 3 1 2 3 4
2 4 2 1 3 4
3 2 1 2 1 1
4 5 4 1 2 1
data:
data <- structure(list(A = c(1L, 2L, 1L, 4L), B = c(2L, 1L, 2L, 1L),
C = c(3L, 3L, 1L, 2L), D = c(4L, 4L, 1L, 1L)), row.names = c(NA,
-4L), class = "data.frame")
Here is a dplyrs version of #akrun's elegant aproach that calculates the diff of the dataframe with it's shifted variant:
df %>%
mutate(sum_diff = rowSums(abs(identity(.) %>% select(1:last_col(1))
- identity(.) %>% select(2:last_col()))))
And here we have the rowwise variant, which basicly follows the same idea but this time every row is used as a vector that get's substracted by it's shifted self.
df %>%
rowwise() %>%
mutate(sum_diff = map2_int(c_across(1:last_col(1)),
c_across(2:last_col()),
~ abs(.x - .y)) %>% sum())

group ordered line numbers

I have some data that I am trying to group by consecutive values in R. This solution is similar to what I am looking for, however my data is structured like this:
line_num
1
2
3
1
2
1
2
3
4
What I want to do is group each time the number returns to 1 such that I get groups like this:
line_num
group_num)
1
1
2
1
3
1
1
2
2
2
1
3
2
3
3
3
4
3
Any ideas on the best way to accomplish this using dplyr or base R?
Thanks!
We could use cumsum on a logical vector
library(dplyr)
df2 <- df1 %>%
mutate(group_num = cumsum(line_num == 1))
or with base R
df1$group_num <- cumsum(df1$line_num == 1)
data
df1 <- structure(list(line_num = c(1L, 2L, 3L, 1L, 2L, 1L, 2L, 3L, 4L
)), class = "data.frame", row.names = c(NA, -9L))

Filtering using dplyr package

My dataset is set up as follows:
User Day
10 2
1 3
15 1
3 1
1 2
15 3
1 1
I'n trying to find out the users that are present on all three days. I'm using the below code using dplyr package:
MAU%>%
group_by(User)%>%
filter(c(1,2,3) %in% Day)
# but get this error message:
# Error in filter_impl(.data, quo) : Result must have length 12, not 3
any idea how to fix?
Using the input shown reproducibly in the Note at the end, count the distinct Users and filter out those for which there are 3 days:
library(dplyr)
DF %>%
distinct %>%
count(User) %>%
filter(n == 3) %>%
select(User)
giving:
# A tibble: 1 x 1
User
<int>
1 1
Note
Lines <- "
User Day
10 2
1 3
15 1
3 1
1 2
15 3
1 1"
DF <- read.table(text = Lines, header = TRUE)
We can use all to get a single TRUE/FALSE from the logical vector 1:3 %in% Day
library(dplyr)
MAU %>%
group_by(User)%>%
filter(all(1:3 %in% Day))
# A tibble: 3 x 2
# Groups: User [1]
# User Day
# <int> <int>
#1 1 3
#2 1 2
#3 1 1
data
MAU <- structure(list(User = c(10L, 1L, 15L, 3L, 1L, 15L, 1L), Day = c(2L,
3L, 1L, 1L, 2L, 3L, 1L)), class = "data.frame", row.names = c(NA,
-7L))

delete the rows with duplicated ids

I want to delete the rows with duplicated ids
data
id V1 V2
1 a 1
1 b 2
2 a 2
2 c 3
3 a 4
The problem is that some people did the test for a few times, which generate multiple scores on V2, I want to delete the duplicated id and retain one of the scores in V2 randomly.
output
id V1 V2
1 a 1
2 a 2
3 a 4
I tried this:
neu <- unique(neu$userid)
but it didn't work
Using dplyr:
library(dplyr)
set.seed(1)
df %>% sample_frac(., 1) %>% arrange(id) %>% distinct(id)
Output:
id V1 V2
1 1 b 2
2 2 c 3
3 3 a 4
Data:
df <- structure(list(id = c(1L, 1L, 2L, 2L, 3L), V1 = structure(c(1L,
2L, 1L, 3L, 1L), .Label = c("a", "b", "c"), class = "factor"),
V2 = c(1L, 2L, 2L, 3L, 4L)), .Names = c("id", "V1", "V2"), class = "data.frame", row.names = c(NA,
-5L))
Creating the data frame based on your example:
df <- read.table(text =
"id V1 V2
1 a 1
1 b 2
2 a 2
2 c 3
3 a 4", h = T)
Since you want to remove rows randomly, first sort the rows of your data frame randomly:
df <- df[sample(nrow(df)),]
Then remove duplicates in the order of appearence:
df <- df[!duplicated(df$id),]
Now sort your data frame back:
df <- df[with(df, order(id)),]
Remember to change df by your data frame name.

Resources