Say I have a range of exam grades in a data frame column:
grades <- c("B", "C", "C", "C", "D", "D", "E", "F", "F")
grades.df <- data.frame(grades)
When illustrating this, the illustration would be a bit misleading, since it doesn't show 0 for a grade that everyone would expect to be there: "A":
barplot(table(grades))
How can I add "A" with 0 occurrences to this table, such that it would appear in the bar plot with zero height?
Use a factor with appropriate levels:
grades <- factor(c("B", "C", "C", "C", "D", "D", "E", "F", "F"),levels=LETTERS[1:6])
table(grades)
grades
A B C D E F
0 1 3 2 1 2
barplot(table(grades))
Related
From a single dataset I created two dataset filtering on the target variable. Now I'd like to compare all the features in the dataset using chi square. The problem is that one of the two dataset is much smaller than the other one so in some features I have some values that are not present in the second one and when I try to apply the chi square test I get this error: "all arguments must have the same length".
How can I add to the dataset with less value the missing value in order to be able to use chi square test?
Example:
I want to use chi square on a the same feature in the two dataset:
chisq.test(table(df1$var1, df2$var1))
but I get the error "all arguments must have the same length" because table(df1$var1) is:
a b c d
2 5 7 18
while table(df2$var1) is:
a b c
8 1 12
so what I would like to do is adding the value d in df2 and set it equal to 0 in order to be able to use the chi square test.
The table output of df2 can be modified if we convert to factor with levels specified
table(factor(df2$var1, levels = letters[1:4]))
a b c d
8 1 12 0
But, table with two inputs, should have the same length. For this, we may need to bind the datasets and then use table
library(dplyr)
table(bind_rows(df1, df2, .id = 'grp'))
var1
grp a b c d
1 2 5 7 18
2 8 1 12 0
Or in base R
table(data.frame(col1 = rep(1:2, c(nrow(df1), nrow(df2))),
col2 = c(df1$var1, df2$var1)))
col2
col1 a b c d
1 2 5 7 18
2 8 1 12 0
data
df1 <- structure(list(var1 = c("a", "a", "b", "b", "b", "b", "b", "c",
"c", "c", "c", "c", "c", "c", "d", "d", "d", "d", "d", "d", "d",
"d", "d", "d", "d", "d", "d", "d", "d", "d", "d", "d")), class = "data.frame",
row.names = c(NA,
-32L))
df2 <- structure(list(var1 = c("a", "a", "a", "a", "a", "a", "a",
"a",
"b", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c"
)), class = "data.frame", row.names = c(NA, -21L))
dat <- data.frame(Comp1Letter = c("A", "B", "D", "F", "U", "A*", "B", "C"),
Comp2Letter = c("B", "C", "E", "U", "A", "C", "A*", "E"),
Comp3Letter = c("D", "A", "C", "D", "F", "D", "C", "A"))
GradeLevels <- c("A*", "A", "B", "C", "D", "E", "F", "G", "U")
I have a dataframe that looks something like the above (but with many other columns I don't want to change).
The columns I am interested in changing contains lists of letter grades, but are currently character vectors and not in the right order.
I need to convert each of these columns into factors with the correct order. I've been able to get this to work using the code below:
factordat <-
dat %>%
mutate(Comp1Letter = factor(Comp1Letter, levels = GradeLevels)) %>%
mutate(Comp2Letter = factor(Comp2Letter, levels = GradeLevels)) %>%
mutate(Comp3Letter = factor(Comp3Letter, levels = GradeLevels))
However this is super verbose and chews up a lot of space.
Looking at some other questions, I've tried to use a combination of mutate() and across(), as seen below:
factordat <-
dat %>%
mutate(across(c(Comp1Letter, Comp2Letter, Comp3Letter) , factor(levels = GradeLetters)))
However when I do this the vectors remain character vectors.
Could someone please tell me what I'm doing wrong or offer another option?
You can do across as an anonymous function like this:
dat <- data.frame(Comp1Letter = c("A", "B", "D", "F", "U", "A*", "B", "C"),
Comp2Letter = c("B", "C", "E", "U", "A", "C", "A*", "E"),
Comp3Letter = c("D", "A", "C", "D", "F", "D", "C", "A"))
GradeLevels <- c("A*", "A", "B", "C", "D", "E", "F", "G", "U")
dat %>%
tibble::as_tibble() %>%
dplyr::mutate(dplyr::across(c(Comp1Letter, Comp2Letter, Comp3Letter) , ~forcats::parse_factor(., levels = GradeLevels)))
# # A tibble: 8 × 3
# Comp1Letter Comp2Letter Comp3Letter
# <fct> <fct> <fct>
# 1 A B D
# 2 B C A
# 3 D E C
# 4 F U D
# 5 U A F
# 6 A* C D
# 7 B A* C
# 8 C E A
You were close, all that was left to be done was make the factor function anonymous. That can be done either with ~ and . in tidyverse or function(x) and x in base R.
So, if I have two lists, one being a "master list" without repeats, and the other being a subset with possible repeats, I would like to be able to check how many of each element are in the secondary subset list.
So if I have these lists:
a <- (a, b, c, d, e, f, g)
b <- (a, d, c, d, a, f, f, g, c, c)
I'd like to determine how many times each element from list a appear in list b and the frequency of each. My ideal output would be an r table that looks like:
c <- a b c d e f g
2 0 3 1 0 2 1
I've been trying to think through it with %in% and table()
You can use table and match - but first make the vectors factors so levels not present are included in the output:
a <- factor(c("a", "b", "c", "d", "e", "f", "g"))
b <- factor(c("a", "d", "c", "d", "a", "f", "f", "g", "c", "c"))
table(a[match(b, a)])
a b c d e f g
2 0 3 2 0 2 1
If for some reason you want a tidyverse solution. This method preserves the original data type in the lists.
library(tidyverse)
a <- c("a", "b", "c", "d", "e", "f", "g")
b <- c("a", "d", "c", "d", "a", "f", "f", "g", "c", "c")
tibble(letters = a, count = unlist(map(a, function(x) sum(b %in% x))))
# A tibble: 7 x 2
letters count
<chr> <int>
1 a 2
2 b 0
3 c 3
4 d 2
5 e 0
6 f 2
7 g 1
The following code:
df <- data.frame(
"letter" = c("a", "b", "c", "d", "e", "f"),
"score" = seq(1,6)
)
Results in the following dataframe:
letter score
1 a 1
2 b 2
3 c 3
4 d 4
5 e 5
6 f 6
I want to get the scores for a sequence of letters, for example the scores of c("f", "a", "d", "e"). It should result in c(6, 1, 4, 5).
What's more, I want to get the scores for c("c", "o", "f", "f", "e", "e"). Now the o is not in the letter column so it should return NA, resulting in c(3, NA, 6, 6, 5, 5).
What is the best way to achieve this? Can I use dplyr for this?
We can use match to create an index and extract the corresponding 'score' If there is no match, then by default it gives NA
df$score[match(v1, df$letter)]
#[1] 3 NA 6 6 5 5
df$score[match(v2, df$letter)]
#[1] 6 1 4 5
data
v1 <- c("c", "o", "f", "f", "e", "e")
v2 <- c("f", "a", "d", "e")
If you want to use dplyr I would use a join:
df <- data.frame(
"letter" = c("a", "b", "c", "d", "e", "f"),
"score" = seq(1:6)
)
library(dplyr)
df2 <- data.frame(letter = c("c", "o", "f", "f", "e", "e"))
left_join(df2, df, by = "letter")
letter score
1 c 3
2 o NA
3 f 6
4 f 6
5 e 5
6 e 5
Say I have
Name<- c("A", "A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "C")
Cate<- c("a", "a", "b", "b", "c", "a", "a", "a", "c", "b", "b", "c")
I want to reproduce the following:
Nam fra frb frc
A 2 2 1
B 3 0 1
C 0 2 1
Where fra, frb and frc are the frequency values of a, b and c values respectively in Cate for each category (A,B,C) of Nam.
I am looking for a faster code than the one I am using (subsetting Nam in each category and then calculate the frequencies)
We can do a dcast from data.table which is very efficient and quick
library(data.table)
dcast(data.table(Name, Cate), Name ~paste0("fr", Cate))
# Name fra frb frc
#1: A 2 2 1
#2: B 3 0 1
#3: C 0 2 1
A simple base R option would be
table(Cate, Name)
data
Name <- c("A", "A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "C")
Cate <- c("a", "a", "b", "b", "c", "a", "a", "a", "c", "b", "b", "c")
You can also use the xtabs() function:
xtabs(~Name + Cate)
For completeness' sake, here's a Hadleyverse solution:
library(dplyr)
library(tidyr)
data.frame(Name, Cate) %>%
count(Name, Cate) %>%
spread(key = Cate, value = n, fill = 0)