Say I have
Name<- c("A", "A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "C")
Cate<- c("a", "a", "b", "b", "c", "a", "a", "a", "c", "b", "b", "c")
I want to reproduce the following:
Nam fra frb frc
A 2 2 1
B 3 0 1
C 0 2 1
Where fra, frb and frc are the frequency values of a, b and c values respectively in Cate for each category (A,B,C) of Nam.
I am looking for a faster code than the one I am using (subsetting Nam in each category and then calculate the frequencies)
We can do a dcast from data.table which is very efficient and quick
library(data.table)
dcast(data.table(Name, Cate), Name ~paste0("fr", Cate))
# Name fra frb frc
#1: A 2 2 1
#2: B 3 0 1
#3: C 0 2 1
A simple base R option would be
table(Cate, Name)
data
Name <- c("A", "A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "C")
Cate <- c("a", "a", "b", "b", "c", "a", "a", "a", "c", "b", "b", "c")
You can also use the xtabs() function:
xtabs(~Name + Cate)
For completeness' sake, here's a Hadleyverse solution:
library(dplyr)
library(tidyr)
data.frame(Name, Cate) %>%
count(Name, Cate) %>%
spread(key = Cate, value = n, fill = 0)
Related
Imagine you have a vector x:
x <- c("C", "A", "B", "B", "A", "D", "B", "B", "A", "A", "A", "A", "A", "D", "C", "A", "C", "A", "A", "C", "A", "A", "D", "A", "D", "A", "D", "A", "A", "D", "D", "B", "B", "A", "A", "C", "A", "A", "B", "B", "B", "B", "B", "B", "B", "A", "C", "A", "C", "B")
You can make a table using:
table(x)
# x
# A B C D
# 22 14 7 7
What if you only want the table to include certain values (eg. 'A' and 'B'), or you want the table to include values that might not exist in x?
This is my attempt:
tab_specific_values <- function(vector, values) `names<-`(rowSums(outer(values, vector, `==`)), values)
For example:
tab_specific_values(vector = x, values = c('A', 'B'))
# A B
# 22 14
Or if we specify a value that does not exist in x
tab_specific_values(vector = x, values = c('A', 'B', 'E'))
# A B E
# 22 14 0
Is there an existing dedicated function that does this, or do you have a better approach? I suspect my function tab_specific_values might not be the best approach.
Convert to factor with certain levels, then table:
#my values
v <- c("A", "B", "E")
table(factor(x, levels = v))
# A B E
# 22 14 0
Benchmarking:
microbenchmark(
a = table(x, exclude = c('A', 'B')),
b = table(factor(x, levels = c('C', 'D'))),
c = tab_specific_values(vector = x, values = c('C', 'D')),
times = 1000
)
Unit: microseconds
expr min lq mean median uq max neval
a 116.401 131.6505 177.20030 145.201 236.8010 604.701 1000
b 49.302 60.0010 92.33422 66.501 109.4510 10974.101 1000
c 13.301 20.1005 29.09018 24.201 36.3015 134.901 1000
When x is 1,000,000 long:
Unit: milliseconds
expr min lq mean median uq max neval
a 119.3651 131.24110 142.63383 137.50385 144.07945 233.1265 100
b 43.9441 48.18640 58.24316 54.75485 59.12390 129.5087 100
c 48.9598 55.33825 67.03932 62.64145 65.93755 152.9490 100
From a single dataset I created two dataset filtering on the target variable. Now I'd like to compare all the features in the dataset using chi square. The problem is that one of the two dataset is much smaller than the other one so in some features I have some values that are not present in the second one and when I try to apply the chi square test I get this error: "all arguments must have the same length".
How can I add to the dataset with less value the missing value in order to be able to use chi square test?
Example:
I want to use chi square on a the same feature in the two dataset:
chisq.test(table(df1$var1, df2$var1))
but I get the error "all arguments must have the same length" because table(df1$var1) is:
a b c d
2 5 7 18
while table(df2$var1) is:
a b c
8 1 12
so what I would like to do is adding the value d in df2 and set it equal to 0 in order to be able to use the chi square test.
The table output of df2 can be modified if we convert to factor with levels specified
table(factor(df2$var1, levels = letters[1:4]))
a b c d
8 1 12 0
But, table with two inputs, should have the same length. For this, we may need to bind the datasets and then use table
library(dplyr)
table(bind_rows(df1, df2, .id = 'grp'))
var1
grp a b c d
1 2 5 7 18
2 8 1 12 0
Or in base R
table(data.frame(col1 = rep(1:2, c(nrow(df1), nrow(df2))),
col2 = c(df1$var1, df2$var1)))
col2
col1 a b c d
1 2 5 7 18
2 8 1 12 0
data
df1 <- structure(list(var1 = c("a", "a", "b", "b", "b", "b", "b", "c",
"c", "c", "c", "c", "c", "c", "d", "d", "d", "d", "d", "d", "d",
"d", "d", "d", "d", "d", "d", "d", "d", "d", "d", "d")), class = "data.frame",
row.names = c(NA,
-32L))
df2 <- structure(list(var1 = c("a", "a", "a", "a", "a", "a", "a",
"a",
"b", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c"
)), class = "data.frame", row.names = c(NA, -21L))
dat <- data.frame(Comp1Letter = c("A", "B", "D", "F", "U", "A*", "B", "C"),
Comp2Letter = c("B", "C", "E", "U", "A", "C", "A*", "E"),
Comp3Letter = c("D", "A", "C", "D", "F", "D", "C", "A"))
GradeLevels <- c("A*", "A", "B", "C", "D", "E", "F", "G", "U")
I have a dataframe that looks something like the above (but with many other columns I don't want to change).
The columns I am interested in changing contains lists of letter grades, but are currently character vectors and not in the right order.
I need to convert each of these columns into factors with the correct order. I've been able to get this to work using the code below:
factordat <-
dat %>%
mutate(Comp1Letter = factor(Comp1Letter, levels = GradeLevels)) %>%
mutate(Comp2Letter = factor(Comp2Letter, levels = GradeLevels)) %>%
mutate(Comp3Letter = factor(Comp3Letter, levels = GradeLevels))
However this is super verbose and chews up a lot of space.
Looking at some other questions, I've tried to use a combination of mutate() and across(), as seen below:
factordat <-
dat %>%
mutate(across(c(Comp1Letter, Comp2Letter, Comp3Letter) , factor(levels = GradeLetters)))
However when I do this the vectors remain character vectors.
Could someone please tell me what I'm doing wrong or offer another option?
You can do across as an anonymous function like this:
dat <- data.frame(Comp1Letter = c("A", "B", "D", "F", "U", "A*", "B", "C"),
Comp2Letter = c("B", "C", "E", "U", "A", "C", "A*", "E"),
Comp3Letter = c("D", "A", "C", "D", "F", "D", "C", "A"))
GradeLevels <- c("A*", "A", "B", "C", "D", "E", "F", "G", "U")
dat %>%
tibble::as_tibble() %>%
dplyr::mutate(dplyr::across(c(Comp1Letter, Comp2Letter, Comp3Letter) , ~forcats::parse_factor(., levels = GradeLevels)))
# # A tibble: 8 × 3
# Comp1Letter Comp2Letter Comp3Letter
# <fct> <fct> <fct>
# 1 A B D
# 2 B C A
# 3 D E C
# 4 F U D
# 5 U A F
# 6 A* C D
# 7 B A* C
# 8 C E A
You were close, all that was left to be done was make the factor function anonymous. That can be done either with ~ and . in tidyverse or function(x) and x in base R.
I have a data table like this:
a group
1: 1 a
2: 2 a
3: 3 a
4: 4 a
5: 5 a
6: 6 a
The sample can be created from the code below:
structure(list(a = 1:100, group = c("a", "a", "a", "a", "a",
"a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a",
"a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a",
"a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a",
"a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "b", "b", "b",
"b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b",
"b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b",
"b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b",
"b", "b", "b", "b")), .Names = c("a", "group"), row.names = c(NA,
-100L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x0000000004790788>)
For each row in each group I would like to:
take value in column a
divide it by value in column a lagged by 2 and subtract 1
divide it by value in column a lagged by 4 and subtract 1
divide it by value in column a lagged by 6 and subtract 1
sum result of steps 2-4 and return it in a new column
So for rows 1-6, I would have NA, and then 7/5 + 7/3 + 7/1 - 3, 8/6 + 8/4 + 8/2 - 3, 9/7 + 9/5 + 9/3 - 3, 10/8 + 10/6 + 10/4 - 3
So based on the table reported in the first chunk, I would like to get a new column, say metric_1, which would, on the 10th row have the value 2.416667
Please note that the values in column a will not in practice correspond to row numbers, but would be some measurements.
The final output would then look like this:
a group metric_1
1: 1 a NA
2: 2 a NA
3: 3 a NA
4: 4 a NA
5: 5 a NA
6: 6 a NA
7: 7 a 7.733333
8: 8 a 4.333333
9: 9 a 3.085714
10: 10 a 2.416667
I already tried some versions with Reduce which works like a champ if I need to sum some values in a vector, but I haven't been able to tweak it into enabling me to do the division like this.
I'm not sure if this is exactly what you're looking for but perhaps it will help:
library(dplyr)
the_data %>% group_by(group) %>%
mutate(metric_1 = (a/lag(a, 2)-1)+( a/lag(a,4)-1) + (a/lag(a, 6) - 1 )) %>%
ungroup()
found one possible solution as:
dt[,
list(a, Reduce(`+`, lapply(shift(a, seq(2, 6, by = 2)),
function(x) a/x - 1))),
by = "group"]
But it is rather slow.
I need to select 3 top selling products in each category, but if category dose not have 3 products I should add more products from best available category ("a" being the best category, "c" worst).
Every day the products change so I would like to this automatically. Previously I did choose top 3 products and if there was not available I did not bothered, but unfortunately the conditions changed. For that I used code as follows:
Selected <- items %>% group_by(Cat) %>% dplyr:: filter(row_number() < 3) %>% ungroup
Sample data:
items <- data.frame(Cat = c("a", "a", "a", "b", "b", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c"),
ranking = 1:15)
Desired results:
"a", "a", "a", "b", "b", "c", "c", "c", "c"
Sample data - 2:
items <- data.frame(Cat = c("a", "a", "a", "a", "b", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c"),
ranking = 1:15)
Desired results - 2:
"a", "a", "a", "a", "b", "c", "c", "c", "c"
Here is a possible answer. I'm not entirely sure if I'm getting what you are after - if not let me know.
items <- data.frame(Cat = c("a", "a", "a",
"b", "b",
"c", "c", "c", "c", "c", "c", "c", "c", "c", "c"),
ranking = 1:15)
First we order the data according from best to worst category and add the count number within category.
Selected <- items %>% group_by(Cat) %>%
mutate(id = row_number()) %>%
ungroup() %>% arrange(Cat)
Then we can make the filter and fill up with remaining rows from best to worst
Selected %>% filter(id<=3) %>% # Select top 3 in each group
bind_rows(Selected %>% filter(id>3)) %>% # Merge with the ones that weren't selected
mutate(id=row_number()) %>%
filter(id <= 3*length(unique(Cat))) # Extract the right number
This produces
# A tibble: 9 x 3
Cat ranking id
<fctr> <int> <int>
1 a 1 1
2 a 2 2
3 a 3 3
4 b 4 4
5 b 5 5
6 c 6 6
7 c 7 7
8 c 8 8
9 c 9 9
The second data example yields
# A tibble: 9 x 3
Cat ranking id
<fctr> <int> <int>
1 a 1 1
2 a 2 2
3 a 3 3
4 b 5 4
5 c 6 5
6 c 7 6
7 c 8 7
8 a 4 8
9 c 9 9
which seems to be what you were after.