I need to create a data frame containing the frequency of each categorical variable from a previous data frame. Fortunately, these variables are all structured with numbers, from 1 to 5, instead of texts.
Therefore, I could create a new data frame with a first column containing the numbers 1 to 5, and each following column counting the frequency of that number as the response for each variable in the original data frame.
For example, we have an original df defined as:
df1 <- data.frame(
Z = c(4, 1, 2, 1, 5, 4, 2, 5, 1, 5),
Y = c(5, 1, 5, 5, 2, 1, 4, 1, 3, 3),
X = c(4, 2, 2, 1, 5, 1, 5, 1, 3, 2),
W = c(2, 1, 4, 2, 3, 2, 4, 2, 1, 2),
V = c(5, 1, 3, 3, 3, 3, 2, 4, 4, 1))
I would need a second df containing the following table:
fq Z Y X W V
1 3 3 3 2 2
2 4 2 6 10 2
3 0 6 3 3 12
4 8 4 4 8 8
5 15 15 10 0 5
I saw some answers of how to do smething like this using plyr, but not in a systematic way. Can someone help me out?
table(stack(df1)) * 1:5
ind
values Z Y X W V
1 3 3 3 2 2
2 4 2 6 10 2
3 0 6 3 3 12
4 8 4 4 8 8
5 15 15 10 0 5
If you need a data.frame, you could do:
as.data.frame.matrix(table(stack(df1)) * 1:5)
We may use
sapply(df1, function(x) tapply(x, factor(x, levels = 1:5), FUN = sum))
Z Y X W V
1 3 3 3 2 2
2 4 2 6 10 2
3 NA 6 3 3 12
4 8 4 4 8 8
5 15 15 10 NA 5
Another possible solution, based on purrr::map_dfc:
library(tidyverse)
map_dfc(df1, ~ 1:5 * table(factor(.x, levels = 1:5)) %>% as.vector)
#> # A tibble: 5 × 5
#> Z Y X W V
#> <int> <int> <int> <int> <int>
#> 1 3 3 3 2 2
#> 2 4 2 6 10 2
#> 3 0 6 3 3 12
#> 4 8 4 4 8 8
#> 5 15 15 10 0 5
Related
I have a column like that :
a = c(3, 1, 2, 3, 3, 3, 1, 3, 2, 3, 3, 1, 3, 2, 1, 3, 1)
I want to have a column that counts 1 and 2 sequentially to make a column like this:
a b
1 3 0
2 1 1
3 2 2
4 3 2
5 3 2
6 3 2
7 1 3
8 3 3
9 2 4
10 3 4
11 3 4
12 1 5
13 3 5
14 2 6
15 1 7
16 3 7
We can use cumsum on a logical vector
df1$b <- cumsum(df1$a %in% c(1, 2))
data
df1 <- data.frame(a)
Hi I have a dataframe as such,
df= structure(list(a = c(1, 3, 4, 6, 3, 2, 5, 1), b = c(1, 3, 4,
2, 6, 7, 2, 6), c = c(6, 3, 6, 5, 3, 6, 5, 3), d = c(6, 2, 4,
5, 3, 7, 2, 6), e = c(1, 2, 4, 5, 6, 7, 6, 3), f = c(2, 3, 4,
2, 2, 7, 5, 2)), .Names = c("a", "b", "c", "d", "e", "f"), row.names = c(NA,
8L), class = "data.frame")
df$total = apply ( df, 1,sum )
df$row = seq ( 1, nrow ( df ))
so the dataframe looks like this.
> df
a b c d e f total row
1 1 1 6 6 1 2 17 1
2 3 3 3 2 2 3 16 2
3 4 4 6 4 4 4 26 3
4 6 2 5 5 5 2 25 4
5 3 6 3 3 6 2 23 5
6 2 7 6 7 7 7 36 6
7 5 2 5 2 6 5 25 7
8 1 6 3 6 3 2 21 8
what I want to do is figure the first leading row where the total is greater than the current. For example for row 1 the total is 17 and the nearest leading row >= 17 would be row 3.
I could loop through each row but it gets really messy. Is this possible?
thanks in advance.
We can do this in 2 steps with dplyr. First we set grouping to rowwise, which applies the operation on each row (basically it makes it work like we were doing an apply loop through the rows), then we find all the rows where total is larger than that row's total. Then we drop those that come before the current row and pick the first (which is the next one):
library(dplyr)
df %>%
rowwise() %>%
mutate(nxt = list(which(.$total > total)),
nxt = nxt[nxt > row][1])
# A tibble: 8 × 9
# Rowwise:
a b c d e f total row nxt
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 1 1 6 6 1 2 17 1 3
2 3 3 3 2 2 3 16 2 3
3 4 4 6 4 4 4 26 3 6
4 6 2 5 5 5 2 25 4 6
5 3 6 3 3 6 2 23 5 6
6 2 7 6 7 7 7 36 6 NA
7 5 2 5 2 6 5 25 7 NA
8 1 6 3 6 3 2 21 8 NA
This question already has answers here:
Find complement of a data frame (anti - join)
(7 answers)
Closed 2 years ago.
I have these two data frames a and b
I want to remove what is in a from b
example a =
X Y
1 1 3
2 2 4
3 3 5
example b =
X Y Z
1 3 5 4 --- want to remove this
2 4 6 2
3 1 3 2 --- want to remove this
4 2 3 4
5 5 3 4
6 2 4 2 --- want to remove this
7 4 3 4
8 2 4 6 ---- want remove this
9 6 9 6
10 2 0 3
So I'm only keeping the rows that dont have the combination of a
the final result would be this:
X Y Z
1 4 6 2
2 2 3 4
3 5 3 4
4 4 3 4
5 6 9 6
6 2 0 3
Thanks
anti-join from the dplyr package can be very helpful.
library(tidyverse)
a <- tibble(X=c(1, 2, 3), Y=c(3, 4, 5))
b <- tibble(X=c(3, 4, 1, 2, 5, 2, 4, 2, 6, 2),
Y=c(5, 6, 3, 3, 3, 4, 3, 4, 9, 0),
Z=c(4, 2, 2, 4, 4, 2, 4, 6, 6, 3))
c <- b %>% anti_join(a, by=c("X", "Y"))
c
Gives
# A tibble: 6 x 3
X Y Z
<dbl> <dbl> <dbl>
1 4 6 2
2 2 3 4
3 5 3 4
4 4 3 4
5 6 9 6
6 2 0 3
I have a data.frame(v1,v2,y)
v1: 1 5 8 6 1 1 6 8
v2: 2 6 9 8 4 5 2 3
y: 1 1 2 2 3 3 4 4
and now I want it sorted by y like this:
y: 1 2 3 4 1 2 3 4
v1: 1 8 1 6 5 6 1 8
v2: 2 9 4 2 6 8 5 3
I tried:
sorted <- df[,,sort(df$y)]
but this does not work.. please help
You can try a tidyverse solution
library(tidyverse)
data.frame(y, v1, v2) %>%
group_by(y) %>%
mutate(n=1:n()) %>%
arrange(n, y) %>%
select(-n) %>%
ungroup()
# A tibble: 8 x 3
y v1 v2
<dbl> <dbl> <dbl>
1 1 1 2
2 2 8 9
3 3 1 4
4 4 6 2
5 1 5 6
6 2 6 8
7 3 1 5
8 4 8 3
data:
v1 <- c(1, 5, 8, 6, 1, 1, 6, 8)
v2<- c( 2, 6, 9, 8, 4, 5, 2, 3)
y<- c(1, 1, 2, 2, 3, 3, 4, 4 )
Idea is to add an index along y and then arrange by the index and y.
We can use ave from base R to create a sequence by 'y' group and order on it
df[order(with(df, ave(y, y, FUN = seq_along))),]
# v1 v2 y
#1 1 2 1
#3 8 9 2
#5 1 4 3
#7 6 2 4
#2 5 6 1
#4 6 8 2
#6 1 5 3
#8 8 3 4
data
df <- data.frame(v1 = c(1, 5, 8, 6, 1, 1, 6, 8),
v2 = c(2, 6, 9, 8, 4, 5, 2, 3),
y = c(1, 1, 2, 2, 3, 3, 4, 4))
You could also do alternating subset twice and rbind these together:
rbind(df[c(TRUE,FALSE),], df[c(FALSE,TRUE),])
The result:
v1 v2 y
1 1 2 1
3 8 9 2
5 1 4 3
7 6 2 4
2 5 6 1
4 6 8 2
6 1 5 3
8 8 3 4
You can use matrix() to reorder the indizes of the rows:
df <- data.frame(v1 = c(1, 5, 8, 6, 1, 1, 6, 8),
v2 = c(2, 6, 9, 8, 4, 5, 2, 3),
y = c(1, 1, 2, 2, 3, 3, 4, 4))
df[c(matrix(1:nrow(df), ncol=2, byrow=TRUE)),]
# v1 v2 y
# 1 1 2 1
# 3 8 9 2
# 5 1 4 3
# 7 6 2 4
# 2 5 6 1
# 4 6 8 2
# 6 1 5 3
# 8 8 3 4
The solution uses the property in which order the elements of the matrix are stored (in R it is like in FORTRAN) - the index of the first dimension is running first. In FORTRAN one uses the terminus leading dimension for the number of values for this first dimension (for a 2-dimensional array, i.e. a matrix, it is the number of rows).
I have two columns - a unique id column id and the day of travel day. My objective is to create a matrix of counts per id per day (and to include all days even if the count is zero)
> test
id day
1 3 3
2 4 4
3 1 4
4 2 3
5 2 5
6 2 4
7 1 1
8 5 4
9 1 1
10 3 2
11 2 2
12 4 2
13 2 4
14 2 5
15 4 5
16 3 4
17 5 3
18 3 2
19 5 5
20 3 4
21 1 3
22 2 3
23 2 5
24 5 2
25 3 2
The output should be the following, where rows represent id and columns represent day:
> output
1 2 3 4 5
1 2 0 1 1 0
2 0 1 2 2 3
3 0 3 1 2 0
4 0 1 0 1 1
5 0 1 1 1 1
I have tried the following with the reshape package
output <- reshape2::dcast(test, day ~ id, sum)
but it throws the following error:
Error in unique.default(x) : unique() applies only to vectors
Why does this happen and what would the right solution be in dplyr or using base R? Any tips would be appreciated.
Here is the data:
> dput(test)
structure(list(id = c(3, 4, 1, 2, 2, 2, 1, 5, 1, 3, 2, 4, 2,
2, 4, 3, 5, 3, 5, 3, 1, 2, 2, 5, 3), day = c(3, 4, 4, 3, 5, 4,
1, 4, 1, 2, 2, 2, 4, 5, 5, 4, 3, 2, 5, 4, 3, 3, 5, 2, 2)), .Names = c("id",
"day"), row.names = c(NA, -25L), class = "data.frame")
Easier to see whats going on with character variables
id <- c('a', 'a', 'b', 'f', 'b', 'a')
day <- c('x', 'x', 'x', 'y', 'z', 'x')
test <- data.frame(id, day)
output <- as.data.frame.matrix(table(test))
This is the simplest way to do it...use the table() function then convert to data.frame
ans <- tapply(test$id, test$day,
function(x) {
y <- table(x)
z <- rep(0, 5)
z[as.numeric(names(y))] <- y
z
} )
do.call("cbind", ans)
1 2 3 4 5
[1,] 2 0 1 1 0
[2,] 0 1 2 2 3
[3,] 0 3 1 2 0
[4,] 0 1 0 1 1
[5,] 0 1 1 1 1