Extract list of values from column based upon other column - r

The following code:
df <- data.frame(
"letter" = c("a", "b", "c", "d", "e", "f"),
"score" = seq(1,6)
)
Results in the following dataframe:
letter score
1 a 1
2 b 2
3 c 3
4 d 4
5 e 5
6 f 6
I want to get the scores for a sequence of letters, for example the scores of c("f", "a", "d", "e"). It should result in c(6, 1, 4, 5).
What's more, I want to get the scores for c("c", "o", "f", "f", "e", "e"). Now the o is not in the letter column so it should return NA, resulting in c(3, NA, 6, 6, 5, 5).
What is the best way to achieve this? Can I use dplyr for this?

We can use match to create an index and extract the corresponding 'score' If there is no match, then by default it gives NA
df$score[match(v1, df$letter)]
#[1] 3 NA 6 6 5 5
df$score[match(v2, df$letter)]
#[1] 6 1 4 5
data
v1 <- c("c", "o", "f", "f", "e", "e")
v2 <- c("f", "a", "d", "e")

If you want to use dplyr I would use a join:
df <- data.frame(
"letter" = c("a", "b", "c", "d", "e", "f"),
"score" = seq(1:6)
)
library(dplyr)
df2 <- data.frame(letter = c("c", "o", "f", "f", "e", "e"))
left_join(df2, df, by = "letter")
letter score
1 c 3
2 o NA
3 f 6
4 f 6
5 e 5
6 e 5

Related

Add new value to table() in order to be able to use chi square test

From a single dataset I created two dataset filtering on the target variable. Now I'd like to compare all the features in the dataset using chi square. The problem is that one of the two dataset is much smaller than the other one so in some features I have some values that are not present in the second one and when I try to apply the chi square test I get this error: "all arguments must have the same length".
How can I add to the dataset with less value the missing value in order to be able to use chi square test?
Example:
I want to use chi square on a the same feature in the two dataset:
chisq.test(table(df1$var1, df2$var1))
but I get the error "all arguments must have the same length" because table(df1$var1) is:
a b c d
2 5 7 18
while table(df2$var1) is:
a b c
8 1 12
so what I would like to do is adding the value d in df2 and set it equal to 0 in order to be able to use the chi square test.
The table output of df2 can be modified if we convert to factor with levels specified
table(factor(df2$var1, levels = letters[1:4]))
a b c d
8 1 12 0
But, table with two inputs, should have the same length. For this, we may need to bind the datasets and then use table
library(dplyr)
table(bind_rows(df1, df2, .id = 'grp'))
var1
grp a b c d
1 2 5 7 18
2 8 1 12 0
Or in base R
table(data.frame(col1 = rep(1:2, c(nrow(df1), nrow(df2))),
col2 = c(df1$var1, df2$var1)))
col2
col1 a b c d
1 2 5 7 18
2 8 1 12 0
data
df1 <- structure(list(var1 = c("a", "a", "b", "b", "b", "b", "b", "c",
"c", "c", "c", "c", "c", "c", "d", "d", "d", "d", "d", "d", "d",
"d", "d", "d", "d", "d", "d", "d", "d", "d", "d", "d")), class = "data.frame",
row.names = c(NA,
-32L))
df2 <- structure(list(var1 = c("a", "a", "a", "a", "a", "a", "a",
"a",
"b", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c"
)), class = "data.frame", row.names = c(NA, -21L))

Converting multiple columns to factors and releveling with mutate(across)

dat <- data.frame(Comp1Letter = c("A", "B", "D", "F", "U", "A*", "B", "C"),
Comp2Letter = c("B", "C", "E", "U", "A", "C", "A*", "E"),
Comp3Letter = c("D", "A", "C", "D", "F", "D", "C", "A"))
GradeLevels <- c("A*", "A", "B", "C", "D", "E", "F", "G", "U")
I have a dataframe that looks something like the above (but with many other columns I don't want to change).
The columns I am interested in changing contains lists of letter grades, but are currently character vectors and not in the right order.
I need to convert each of these columns into factors with the correct order. I've been able to get this to work using the code below:
factordat <-
dat %>%
mutate(Comp1Letter = factor(Comp1Letter, levels = GradeLevels)) %>%
mutate(Comp2Letter = factor(Comp2Letter, levels = GradeLevels)) %>%
mutate(Comp3Letter = factor(Comp3Letter, levels = GradeLevels))
However this is super verbose and chews up a lot of space.
Looking at some other questions, I've tried to use a combination of mutate() and across(), as seen below:
factordat <-
dat %>%
mutate(across(c(Comp1Letter, Comp2Letter, Comp3Letter) , factor(levels = GradeLetters)))
However when I do this the vectors remain character vectors.
Could someone please tell me what I'm doing wrong or offer another option?
You can do across as an anonymous function like this:
dat <- data.frame(Comp1Letter = c("A", "B", "D", "F", "U", "A*", "B", "C"),
Comp2Letter = c("B", "C", "E", "U", "A", "C", "A*", "E"),
Comp3Letter = c("D", "A", "C", "D", "F", "D", "C", "A"))
GradeLevels <- c("A*", "A", "B", "C", "D", "E", "F", "G", "U")
dat %>%
tibble::as_tibble() %>%
dplyr::mutate(dplyr::across(c(Comp1Letter, Comp2Letter, Comp3Letter) , ~forcats::parse_factor(., levels = GradeLevels)))
# # A tibble: 8 × 3
# Comp1Letter Comp2Letter Comp3Letter
# <fct> <fct> <fct>
# 1 A B D
# 2 B C A
# 3 D E C
# 4 F U D
# 5 U A F
# 6 A* C D
# 7 B A* C
# 8 C E A
You were close, all that was left to be done was make the factor function anonymous. That can be done either with ~ and . in tidyverse or function(x) and x in base R.

Create ID variable per chain of values

I have a dataset that looks like this:
data <- data.frame(Name1 = c("A", "B", "D", "E", "H"),
Name2 = c("B", "C", "E", "G", "I"))
I would like to add an ID column to help me trace groups of names, i.e. who references who? So with the example data, the groups would be:
Name1 Name2 GroupID
A B 1
B C 1
D E 2
E G 2
H I 3
Please note that my original data is not ordered as this example is. Thanks in advance for any help!
You can use the igraph package to make a network from your data set and determine clusters:
data <- data.frame(Name1 = c("A", "B", "D", "E", "H"),
Name2 = c("B", "C", "E", "G", "I"))
library(igraph)
graph <- graph_from_data_frame(data, directed = FALSE)
clusters <- components(graph)
#data$GroupId <- sapply(data$Name1, function(x) clusters$membership[which(names(clusters$membership) == x)])
# Simpler version
data$GroupId <- clusters$membership[data$Name1]
That gives:
> data
Name1 Name2 GroupId
1 A B 1
2 B C 1
3 D E 2
4 E G 2
5 H I 3

Find Count of Elements from One List in Another List

So, if I have two lists, one being a "master list" without repeats, and the other being a subset with possible repeats, I would like to be able to check how many of each element are in the secondary subset list.
So if I have these lists:
a <- (a, b, c, d, e, f, g)
b <- (a, d, c, d, a, f, f, g, c, c)
I'd like to determine how many times each element from list a appear in list b and the frequency of each. My ideal output would be an r table that looks like:
c <- a b c d e f g
2 0 3 1 0 2 1
I've been trying to think through it with %in% and table()
You can use table and match - but first make the vectors factors so levels not present are included in the output:
a <- factor(c("a", "b", "c", "d", "e", "f", "g"))
b <- factor(c("a", "d", "c", "d", "a", "f", "f", "g", "c", "c"))
table(a[match(b, a)])
a b c d e f g
2 0 3 2 0 2 1
If for some reason you want a tidyverse solution. This method preserves the original data type in the lists.
library(tidyverse)
a <- c("a", "b", "c", "d", "e", "f", "g")
b <- c("a", "d", "c", "d", "a", "f", "f", "g", "c", "c")
tibble(letters = a, count = unlist(map(a, function(x) sum(b %in% x))))
# A tibble: 7 x 2
letters count
<chr> <int>
1 a 2
2 b 0
3 c 3
4 d 2
5 e 0
6 f 2
7 g 1

loop for working with individual values in r

Here is my small dataset.
Indvidual <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
Parent1 <- c(NA, NA, "A", "A", "C", "C", "C", "E", "A", NA)
Parent2 <- c(NA, NA, "B", "C", "D", "D", "D", NA, "D", NA)
mydf <- data.frame (Indvidual, Parent1, Parent2)
Indvidual Parent1 Parent2
1 A <NA> <NA>
2 B <NA> <NA>
3 C A B
4 D A C
5 E C D
6 F C D
7 G C D
8 H E <NA>
9 I A D
10 J <NA> <NA>
Just consider people who has two or one known parents. I need to compare and derieve score by calculating scores that their parents have.
The rules is that either one of parent (names in parent1 or parent2 column) is known (not NA), will get 1 one additional score plus score their parents have. If there are two parents known, the highest scorer will be taken into consideration.
Here is an example:
Individual "A", has both parents unknown so will get score 0
Indiviudal "C", has both parents known (i.e. A, B)
will get 0 score (maximum of their parents)
plus 1 (as it has either one of parents known)
Thus expected output from above dataframe (with explanation) is:
Indvidual Parent1 Parent2 Scores Explanation
1 A <NA> <NA> 0 0 (Max of parent Scores NA) + 0 (neither parent knwon)
2 B <NA> <NA> 0 0 (Max of parent Scores NA) + 0 (neither parent knwon)
3 C A B 1 0 (Max of parent Scores) + 1 (either parent knwon)
4 D A C 2 1 (Max of parent scores) + 1 (either parent knwon)
5 E C D 3 2 (Max of parent scores) + 1 (either parent knwon)
6 F C D 3 2 (Max of parent scores) + 1 (either parent knwon)
7 G C D 3 2 (Max of parent scores) + 1 (either parent knwon)
8 H E <NA> 4 3 (Max of parent scores) + 1 (either parent knwon)
9 I A D 3 2 (Max of parent scores) + 1 (either parent knwon)
10 J <NA> <NA> 0 0 (Max of parent scores NA) + 0 (neither parent knwon)
Explanation: As loop goes on, it takes into account on the Scores already calculated.
Max of parent scores
Edits: based on chase's question
For example:
Individual C has two parents A and B, each of which has Scores calculated as 0 and 0
(in row 1 and 2 and column Scores), means that max (c(0,0)) will be 0
Individual E has parents C and D, whose scores in Scores column is (in row 3 and 4),
1 and 2, respectively. So maximum of max(c(1,2)) will be 2.
Example using plyr and a recursive argument
library(plyr)
Indvidual <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
Parent1 <- c(NA, NA, "A", "A", "C", "C", "C", "E", "A", NA)
Parent2 <- c(NA, NA, "B", "C", "D", "D", "D", NA, "D", NA)
mydf <- data.frame (Indvidual, Parent1, Parent2)
scor.fun<-function(x,mydf){
Explanation<-0
P1<-as.character(x$Parent1)
P2<-as.character(x$Parent2)
score<-as.numeric(!(is.na(P1)||is.na(P1)))
if(!(is.na(P1)||is.na(P2))){
Explanation<-max(scor.fun(subset(mydf,Indvidual==P1),mydf)[1],scor.fun(subset(mydf,Indvidual==P2),mydf)[1])
score<-score+Explanation
}else{
Explanation<-ifelse(is.na(P1),0,scor.fun(subset(mydf,Indvidual==P1),mydf)[1])
Explanation<-max(Explanation,ifelse(is.na(P2),0,scor.fun(subset(mydf,Indvidual==P2),mydf)[1]))
score<-score+Explanation
}
c(score,Explanation)
}
adply(mydf,1,scor.fun,mydf)
Probably not the best idea with the recursion on a big dataframe.
Individual <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
Parent1 <- c(NA, NA, "A", "A", "C", "C", "C", "E", "A", NA)
Parent2 <- c(NA, NA, "B", "C", "D", "D", "D", NA, "D", NA)
mydf <- data.frame (Individual, Parent1, Parent2, stringsAsFactors = FALSE)
mydf$Scores <- NA
mydf$Scores[rowSums(is.na(mydf[, c("Parent1", "Parent2")])) == 2] <- 0
while(any(is.na(mydf$Scores))){
KnownScores <- mydf[!is.na(mydf$Scores), c(1, 4)]
ToCalculate <- mydf[
mydf$Parent1 %in% c(KnownScores$Individual, NA) &
mydf$Parent2 %in% c(KnownScores$Individual, NA) &
is.na(mydf$Scores),
-4]
ToCalculate$Score <- apply(
merge(
merge(
ToCalculate,
KnownScores,
by.x = "Parent1",
by.y = "Individual",
all.x = TRUE
),
KnownScores,
by.x = "Parent2",
by.y = "Individual",
all.x = TRUE
)[, 4:5],
1,
max,
na.rm = TRUE) + 1
mydf <- merge(mydf, ToCalculate[, c(1, 4)], all.x = TRUE)
mydf$Scores[!is.na(mydf$Score)] <- mydf$Score[!is.na(mydf$Score)]
mydf$Score <- NULL
}

Resources