Related
I need to create a data frame containing the frequency of each categorical variable from a previous data frame. Fortunately, these variables are all structured with numbers, from 1 to 5, instead of texts.
Therefore, I could create a new data frame with a first column containing the numbers 1 to 5, and each following column counting the frequency of that number as the response for each variable in the original data frame.
For example, we have an original df defined as:
df1 <- data.frame(
Z = c(4, 1, 2, 1, 5, 4, 2, 5, 1, 5),
Y = c(5, 1, 5, 5, 2, 1, 4, 1, 3, 3),
X = c(4, 2, 2, 1, 5, 1, 5, 1, 3, 2),
W = c(2, 1, 4, 2, 3, 2, 4, 2, 1, 2),
V = c(5, 1, 3, 3, 3, 3, 2, 4, 4, 1))
I would need a second df containing the following table:
fq Z Y X W V
1 3 3 3 2 2
2 4 2 6 10 2
3 0 6 3 3 12
4 8 4 4 8 8
5 15 15 10 0 5
I saw some answers of how to do smething like this using plyr, but not in a systematic way. Can someone help me out?
table(stack(df1)) * 1:5
ind
values Z Y X W V
1 3 3 3 2 2
2 4 2 6 10 2
3 0 6 3 3 12
4 8 4 4 8 8
5 15 15 10 0 5
If you need a data.frame, you could do:
as.data.frame.matrix(table(stack(df1)) * 1:5)
We may use
sapply(df1, function(x) tapply(x, factor(x, levels = 1:5), FUN = sum))
Z Y X W V
1 3 3 3 2 2
2 4 2 6 10 2
3 NA 6 3 3 12
4 8 4 4 8 8
5 15 15 10 NA 5
Another possible solution, based on purrr::map_dfc:
library(tidyverse)
map_dfc(df1, ~ 1:5 * table(factor(.x, levels = 1:5)) %>% as.vector)
#> # A tibble: 5 × 5
#> Z Y X W V
#> <int> <int> <int> <int> <int>
#> 1 3 3 3 2 2
#> 2 4 2 6 10 2
#> 3 0 6 3 3 12
#> 4 8 4 4 8 8
#> 5 15 15 10 0 5
Suppose I have data as follows:
tibble(
A = c(1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5),
B = c(1, 1, 2, 1, 2, 3, 1, 2, 1, 1, 1, 2, 3, 4, 1, 1),
)
i.e.,
# A tibble: 16 x 2
A B
<dbl> <dbl>
1 1 1
2 2 1
3 2 2
4 2 1
5 2 2
6 2 3
7 3 1
8 3 2
9 3 1
10 3 1
11 4 1
12 4 2
13 4 3
14 4 4
15 4 1
16 5 1
How do I create a sub_id each time a new sequence begins within the group defined by variable A, i.e.,
tibble(
A = c(1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5),
B = c(1, 1, 2, 1, 2, 3, 1, 2, 1, 1, 1, 2, 3, 4, 1, 1),
sub_id = c(1, 1, 1, 2, 2, 2, 1, 1, 2, 3, 1, 1, 1, 1, 2, 1)
)
# A tibble: 16 x 3
A B sub_id
<dbl> <dbl> <dbl>
1 1 1 1
2 2 1 1
3 2 2 1
4 2 1 2
5 2 2 2
6 2 3 2
7 3 1 1
8 3 2 1
9 3 1 2
10 3 1 3
11 4 1 1
12 4 2 1
13 4 3 1
14 4 4 1
15 4 1 2
16 5 1 1
Hopefully that’s well defined. I suppose I’m after a kind of inverse to row_number
Thanks in advance,
James.
Using base R
df$sub_id <- with(df, ave(B ==1, A, FUN = cumsum))
You got the "ingredients" already laid out.
(i) for each group of column A
(ii) check if a new sequence starts
The following is based on {dplyr}. For demo purposes, I create an additional column/variable to show the "start condition". You can combine this into one call.
I use the fact that summing over TRUE/FALSE codes TRUE as 1. If this is not evident for you, you can use as.numeric(B == 1)
library(dplyr)
library(tibble)
# load example data
df <- tibble(
A = c(1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5),
B = c(1, 1, 2, 1, 2, 3, 1, 2, 1, 1, 1, 2, 3, 4, 1, 1),
sub_id = c(1, 1, 1, 2, 2, 2, 1, 1, 2, 3, 1, 1, 1, 1, 2, 1)
)
# perform group-wise operations
df %>%
group_by(A) %>%
mutate(
# --------------- highlight start of new sequence --------------
start = B == 1
# --------------- create cumsum over TRUEs----------------------
, sub_id2 = cumsum(start)
)
This yields what you looked for:
# A tibble: 16 x 5
# Groups: A [5]
A B sub_id start sub_id2
<dbl> <dbl> <dbl> <lgl> <int>
1 1 1 1 TRUE 1
2 2 1 1 TRUE 1
3 2 2 1 FALSE 1
4 2 1 2 TRUE 2
5 2 2 2 FALSE 2
6 2 3 2 FALSE 2
7 3 1 1 TRUE 1
8 3 2 1 FALSE 1
9 3 1 2 TRUE 2
10 3 1 3 TRUE 3
11 4 1 1 TRUE 1
12 4 2 1 FALSE 1
13 4 3 1 FALSE 1
14 4 4 1 FALSE 1
15 4 1 2 TRUE 2
16 5 1 1 TRUE 1
We could use group_by and cumsum:
library(dplyr)
df %>%
group_by(A) %>%
mutate(sub_id = cumsum(B==1)
Output:
# Groups: A [5]
A B sub_id
<dbl> <dbl> <int>
1 1 1 1
2 2 1 1
3 2 2 1
4 2 1 2
5 2 2 2
6 2 3 2
7 3 1 1
8 3 2 1
9 3 1 2
10 3 1 3
11 4 1 1
12 4 2 1
13 4 3 1
14 4 4 1
15 4 1 2
16 5 1 1
A data.table option
> setDT(df)[, sub_id := cumsum(B == 1), A][]
A B sub_id
1: 1 1 1
2: 2 1 1
3: 2 2 1
4: 2 1 2
5: 2 2 2
6: 2 3 2
7: 3 1 1
8: 3 2 1
9: 3 1 2
10: 3 1 3
11: 4 1 1
12: 4 2 1
13: 4 3 1
14: 4 4 1
15: 4 1 2
16: 5 1 1
I have a dataset with the following layout:
ABC1a_1 <- c(1, 5, 3, 4, 3, 4, 5, 2, 2, 1)
ABC1b_1 <- c(4, 2, 1, 1, 5, 3, 2, 1, 1, 5)
ABC1a_2 <- c(4, 5, 5, 4, 2, 5, 5, 1, 2, 4)
ABC1b_2 <- c(2, 3, 3, 2, 2, 3, 2, 1, 4, 2)
ABC2a_1 <- c(2, 5, 3, 5, 3, 4, 5, 3, 2, 3)
ABC2b_1 <- c(1, 2, 2, 4, 5, 3, 2, 4, 1, 4)
ABC2a_2 <- c(2, 5, 5, 1, 2, 1, 5, 1, 3, 4)
ABC2b_2 <- c(2, 3, 3, 2, 1, 3, 1, 1, 2, 2)
df <- data.frame(ABC1a_1, ABC1b_1, ABC1a_2, ABC1b_2, ABC2a_1, ABC2b_1, ABC2a_2, ABC2b_2)
I want to collapse all of the ABC[N][x]_[n] variables into a single ABC[N]_[n] variable like this:
ABC1_1 <- c(1, 5, 3, 4, 3, 4, 5, 2, 2, 1, 4, 2, 1, 1, 5, 3, 2, 1, 1, 5)
ABC1_2 <- c(4, 5, 5, 4, 2, 5, 5, 1, 2, 4, 2, 3, 3, 2, 2, 3, 2, 1, 4, 2)
ABC2_1 <- c(2, 5, 3, 5, 3, 4, 5, 3, 2, 3, 1, 2, 2, 4, 5, 3, 2, 4, 1, 4)
ABC2_2 <- c(2, 5, 5, 1, 2, 1, 5, 1, 3, 4, 2, 3, 3, 2, 1, 3, 1, 1, 2, 2)
df2 <- data.frame(ABC1_1, ABC1_2, ABC2_1, ABC2_2)
What's the best way to achieve this, ideally with a tidyverse solution?
You could also use pivot_longer:
df %>%
rename_with(~str_replace(.x, "(.)(_\\d)", "\\2:\\1")) %>%
pivot_longer(everything(), names_sep = ':', names_to = c(".value", "group")) %>%
arrange(group)
# A tibble: 20 x 5
group ABC1_1 ABC1_2 ABC2_1 ABC2_2
<chr> <dbl> <dbl> <dbl> <dbl>
1 a 1 4 2 2
2 a 5 5 5 5
3 a 3 5 3 5
4 a 4 4 5 1
5 a 3 2 3 2
6 a 4 5 4 1
7 a 5 5 5 5
8 a 2 1 3 1
9 a 2 2 2 3
10 a 1 4 3 4
11 b 4 2 1 2
12 b 2 3 2 3
13 b 1 3 2 3
14 b 1 2 4 2
15 b 5 2 5 1
16 b 3 3 3 3
17 b 2 2 2 1
18 b 1 1 4 1
19 b 1 4 1 2
20 b 5 2 4 2
If you desire to go the Base R way, you could do:
reshape(df, split(names(df), sub("._", "_", names(df))), dir="long")
time ABC1a_1 ABC1a_2 ABC2a_1 ABC2a_2 id
1.1 1 1 4 2 2 1
2.1 1 5 5 5 5 2
3.1 1 3 5 3 5 3
4.1 1 4 4 5 1 4
5.1 1 3 2 3 2 5
6.1 1 4 5 4 1 6
7.1 1 5 5 5 5 7
8.1 1 2 1 3 1 8
9.1 1 2 2 2 3 9
10.1 1 1 4 3 4 10
1.2 2 4 2 1 2 1
2.2 2 2 3 2 3 2
3.2 2 1 3 2 3 3
4.2 2 1 2 4 2 4
5.2 2 5 2 5 1 5
6.2 2 3 3 3 3 6
7.2 2 2 2 2 1 7
8.2 2 1 1 4 1 8
9.2 2 1 4 1 2 9
10.2 2 5 2 4 2 10
Then you can change the names.
If you care about the names from the very beginning:
df1 <- setNames(df, gsub("(.)(_\\d)", "\\2.\\1", names(df)))
reshape(df1, names(df1), dir = "long")
time ABC1_1 ABC1_2 ABC2_1 ABC2_2 id
1 a 1 4 2 2 1
2 a 5 5 5 5 2
3 a 3 5 3 5 3
4 a 4 4 5 1 4
5 a 3 2 3 2 5
6 a 4 5 4 1 6
7 a 5 5 5 5 7
8 a 2 1 3 1 8
9 a 2 2 2 3 9
10 a 1 4 3 4 10
11 b 4 2 1 2 1
12 b 2 3 2 3 2
13 b 1 3 2 3 3
14 b 1 2 4 2 4
15 b 5 2 5 1 5
16 b 3 3 3 3 6
17 b 2 2 2 1 7
18 b 1 1 4 1 8
19 b 1 4 1 2 9
20 b 5 2 4 2 10
A base R solution to collapse it:
res <- as.data.frame(lapply(split.default(df, sub('._', '_', names(df))), unlist))
rownames(res) <- NULL
# >res
# ABC1_1 ABC1_2 ABC2_1 ABC2_2
# 1 1 4 2 2
# 2 5 5 5 5
# 3 3 5 3 5
# 4 4 4 5 1
# 5 3 2 3 2
# 6 4 5 4 1
# 7 5 5 5 5
# 8 2 1 3 1
# 9 2 2 2 3
# 10 1 4 3 4
# 11 4 2 1 2
# 12 2 3 2 3
# 13 1 3 2 3
# 14 1 2 4 2
# 15 5 2 5 1
# 16 3 3 3 3
# 17 2 2 2 1
# 18 1 1 4 1
# 19 1 4 1 2
# 20 5 2 4 2
identical(df2, res)
# [1] TRUE
Using rowSums as the function to combine column values would be better I guess:
> as.data.frame(lapply(split.default(df, sub('._', '_', names(df))), rowSums))
ABC1_1 ABC1_2 ABC2_1 ABC2_2
1 5 6 3 4
2 7 8 7 8
3 4 8 5 8
4 5 6 9 3
5 8 4 8 3
6 7 8 7 4
7 7 7 7 6
8 3 2 7 2
9 3 6 3 5
10 6 6 7 6
You could do:
library(tidyverse)
map_dfr(c("a", "b"),
~df %>%
select(contains(.x, ignore.case = FALSE)) %>%
rename_all(funs(str_remove_all(., .x))))
#ABC1_1 ABC1_2 ABC2_1 ABC2_2
#1 1 4 2 2
#2 5 5 5 5
#3 3 5 3 5
#4 4 4 5 1
# ..
Depending on your actual data, you could replace c("a", "b") with letters[1:2] or unique(str_extract(colnames(df), "[a-z]")).
library(tidyr)
library(dplyr)
df %>%
pivot_longer(everything()) %>%
arrange(name) %>%
mutate(name = gsub("[a-z]_", "_", name)) %>%
pivot_wider(values_fn = list) %>%
unchop(everything())
pivot_longer will put all the column names into a single column name that you can then edit by removing the lowercase letter preceding the underscore.
Then when you pivot back to a wide format the columns will automatically group. The output of pivot_wider are list-columns, unchop will convert these lists into a longer dataframe.
Output
ABC1_1 ABC1_2 ABC2_1 ABC2_2
<dbl> <dbl> <dbl> <dbl>
1 1 4 2 2
2 5 5 5 5
3 3 5 3 5
4 4 4 5 1
5 3 2 3 2
6 4 5 4 1
7 5 5 5 5
8 2 1 3 1
9 2 2 2 3
10 1 4 3 4
11 4 2 1 2
12 2 3 2 3
13 1 3 2 3
14 1 2 4 2
15 5 2 5 1
16 3 3 3 3
17 2 2 2 1
18 1 1 4 1
19 1 4 1 2
20 5 2 4 2
This question already has answers here:
Find complement of a data frame (anti - join)
(7 answers)
Closed 2 years ago.
I have these two data frames a and b
I want to remove what is in a from b
example a =
X Y
1 1 3
2 2 4
3 3 5
example b =
X Y Z
1 3 5 4 --- want to remove this
2 4 6 2
3 1 3 2 --- want to remove this
4 2 3 4
5 5 3 4
6 2 4 2 --- want to remove this
7 4 3 4
8 2 4 6 ---- want remove this
9 6 9 6
10 2 0 3
So I'm only keeping the rows that dont have the combination of a
the final result would be this:
X Y Z
1 4 6 2
2 2 3 4
3 5 3 4
4 4 3 4
5 6 9 6
6 2 0 3
Thanks
anti-join from the dplyr package can be very helpful.
library(tidyverse)
a <- tibble(X=c(1, 2, 3), Y=c(3, 4, 5))
b <- tibble(X=c(3, 4, 1, 2, 5, 2, 4, 2, 6, 2),
Y=c(5, 6, 3, 3, 3, 4, 3, 4, 9, 0),
Z=c(4, 2, 2, 4, 4, 2, 4, 6, 6, 3))
c <- b %>% anti_join(a, by=c("X", "Y"))
c
Gives
# A tibble: 6 x 3
X Y Z
<dbl> <dbl> <dbl>
1 4 6 2
2 2 3 4
3 5 3 4
4 4 3 4
5 6 9 6
6 2 0 3
I have a data.frame(v1,v2,y)
v1: 1 5 8 6 1 1 6 8
v2: 2 6 9 8 4 5 2 3
y: 1 1 2 2 3 3 4 4
and now I want it sorted by y like this:
y: 1 2 3 4 1 2 3 4
v1: 1 8 1 6 5 6 1 8
v2: 2 9 4 2 6 8 5 3
I tried:
sorted <- df[,,sort(df$y)]
but this does not work.. please help
You can try a tidyverse solution
library(tidyverse)
data.frame(y, v1, v2) %>%
group_by(y) %>%
mutate(n=1:n()) %>%
arrange(n, y) %>%
select(-n) %>%
ungroup()
# A tibble: 8 x 3
y v1 v2
<dbl> <dbl> <dbl>
1 1 1 2
2 2 8 9
3 3 1 4
4 4 6 2
5 1 5 6
6 2 6 8
7 3 1 5
8 4 8 3
data:
v1 <- c(1, 5, 8, 6, 1, 1, 6, 8)
v2<- c( 2, 6, 9, 8, 4, 5, 2, 3)
y<- c(1, 1, 2, 2, 3, 3, 4, 4 )
Idea is to add an index along y and then arrange by the index and y.
We can use ave from base R to create a sequence by 'y' group and order on it
df[order(with(df, ave(y, y, FUN = seq_along))),]
# v1 v2 y
#1 1 2 1
#3 8 9 2
#5 1 4 3
#7 6 2 4
#2 5 6 1
#4 6 8 2
#6 1 5 3
#8 8 3 4
data
df <- data.frame(v1 = c(1, 5, 8, 6, 1, 1, 6, 8),
v2 = c(2, 6, 9, 8, 4, 5, 2, 3),
y = c(1, 1, 2, 2, 3, 3, 4, 4))
You could also do alternating subset twice and rbind these together:
rbind(df[c(TRUE,FALSE),], df[c(FALSE,TRUE),])
The result:
v1 v2 y
1 1 2 1
3 8 9 2
5 1 4 3
7 6 2 4
2 5 6 1
4 6 8 2
6 1 5 3
8 8 3 4
You can use matrix() to reorder the indizes of the rows:
df <- data.frame(v1 = c(1, 5, 8, 6, 1, 1, 6, 8),
v2 = c(2, 6, 9, 8, 4, 5, 2, 3),
y = c(1, 1, 2, 2, 3, 3, 4, 4))
df[c(matrix(1:nrow(df), ncol=2, byrow=TRUE)),]
# v1 v2 y
# 1 1 2 1
# 3 8 9 2
# 5 1 4 3
# 7 6 2 4
# 2 5 6 1
# 4 6 8 2
# 6 1 5 3
# 8 8 3 4
The solution uses the property in which order the elements of the matrix are stored (in R it is like in FORTRAN) - the index of the first dimension is running first. In FORTRAN one uses the terminus leading dimension for the number of values for this first dimension (for a 2-dimensional array, i.e. a matrix, it is the number of rows).