Match words in a data frame to a string in R - r

I have a data frame from a recall task where participants recall as many words as they can from a list they learned earlier. Here's a mock up of the data. Each row is a subject and each column (w1-w5) is a word recalled:
df <- data.frame(subject = 1:5,
w1 = c("screen", "toad", "toad", "witch", "toad"),
w2 = c("package", "tuna", "tuna", "postage", "dinosaur"),
w3 = c("tuna", "postage", "toast", "athlete", "ranch"),
w4 = c("toad", "witch", "tuna", "package", "NA"),
w5 = c("windwo", "mermaid", "NA", "NA", "NA")
)
Which produces the following data frame:
subject w1 w2 w3 w4 w5
1 1 screen package tuna toad windwo
2 2 toad tuna postage witch mermaid
3 3 toad tuna toast tuna NA
4 4 witch postage athlete package NA
5 5 toad dinosaur ranch NA NA
I want to match each word produced (columns w1 - w5) to a list of the correct words, which are:
words <- c("screen", "package", "tuna", "toad", "window",
"postage", "witch", "mermaid", "toast", "dinosaur")
I only want to award points for words that are spelled correctly and are not repeated. So for example, for the data above I'd like to end up with a data frame that looks like this:
subject nCorrect
1 1 4
2 2 5
3 3 3
4 4 3
5 5 2
Subject 1 would get four points because they misspelled one word.
Subject 2 would get five points.
Subject 3 would get 3 points because they repeated tuna and are missing one word.
Subject 4 would get three points because they have one incorrect word and one missing word.
Subject 5 would get two points because they have one incorrect word and two missing words.

data.frame(subject = df$subject
, nCorrect = apply(df[, -1], 1, function(x) sum(unique(x) %in% words)))
# subject nCorrect
# 1 1 4
# 2 2 5
# 3 3 3
# 4 4 3
# 5 5 2
With data.table (same result)
setDT(df)
df[, sum(unique(unlist(.SD)) %in% words), by = subject]

Another option is to convert the data in long format. Group on subject to use dplyr::summarise to find correct number of matching answers.
library(tidyverse)
words <- c("screen", "package", "tuna", "toad", "window",
"postage", "witch", "mermaid", "toast", "dinosaur")
df %>% gather(key, value, -subject) %>%
group_by(subject) %>%
summarise(nCorrect = sum(unique(value) %in% words))
# # A tibble: 5 x 2
# subject nCorrect
# <int> <int>
# 1 1 4
# 2 2 5
# 3 3 3
# 4 4 3
# 5 5 2

Related

R - Parse text string into multiple columns & extract data values

I have a large dataset in the form shown below:
ID
Scores
1
English 3, French 7, Geography 8
2
Spanish 7, Classics 4
3
Physics 5, English 5, PE 7, Art 4
I need to parse the text string from the Scores column into separate columns for each subject with the scores for each individual stored as the data values, as below:
ID
English
French
Geography
Spanish
Classics
Physics
PE
Art
1
3
7
8
-
-
-
-
-
2
-
-
-
7
4
-
-
-
3
5
-
-
-
-
5
7
4
I cannot manually predefine the columns as there are 100s in the full dataset. So far I have cleaned the data to remove inconsistent capitalisation and separated each subject-mark pairing into a distinct column as follows:
df$scores2 <- str_to_lower(df$Scores)
split <- separate(
df,
scores2,
into = paste0("Subject", 1:8),
sep = "\\,",
remove = FALSE,
convert = FALSE,
extra = "warn",
fill = "warn",
)
I have looked at multiple questions on the subject, such as Split irregular text column into multiple columns in r, but I cannot find another case where the column titles and data values are mixed in the text string. How can I generate the full set of columns required and then populate the data value?
You can first strsplit the Scores column to split on subject-score pairs (which would be in a list), then unnest the list-column into rows. Then separate the subject-score pairs into Subject and Score columns. Finally transform the data from a "long" format to a "wide" format.
Thanks #G. Grothendieck for improving my code:)
library(tidyverse)
df %>%
separate_rows(Scores, sep = ", ") %>%
separate(Scores, sep = " ", into = c("Subject", "Score")) %>%
pivot_wider(names_from = "Subject", values_from = "Score")
# A tibble: 3 × 9
ID English French Geography Spanish Classics Physics PE Art
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 3 7 8 NA NA NA NA NA
2 2 NA NA NA 7 4 NA NA NA
3 3 5 NA NA NA NA 5 7 4
Using data.table
library(data.table)
setDT(dt)
dt <- dt[, .(class_grade = unlist(str_split(Scores, ", "))), by = ID]
dt[, c("class", "grade") := tstrsplit(class_grade, " ")]
dcast(dt, ID ~ class, value.var = c("grade"), sep = "")
Results
# ID Art Classics English French Geography PE Physics Spanish
# 1: 1 <NA> <NA> 3 7 8 <NA> <NA> <NA>
# 2: 2 <NA> 4 <NA> <NA> <NA> <NA> <NA> 7
# 3: 3 4 <NA> 5 <NA> <NA> 7 5 <NA>
Data
dt <- structure(list(ID = 1:3, Scores = c("English 3, French 7, Geography 8",
"Spanish 7, Classics 4", "Physics 5, English 5, PE 7, Art 4")), row.names = c(NA,
-3L), class = c("data.frame"))

Check if all the elements in the Vector are available in the groups in R data frame

I am having a data frame in R as follows:
df <- data.frame("location" = c("IND","IND","IND","US","US","US"), type = c("butter","milk","cheese","milk","cheese","yogurt"), quantity = c(2,3,4,5,6,7))
I am having a vector as follows:
typeVector <- c("butter","milk","cheese","yogurt")
I need to check if all the 4 types mentioned in the vector are available in the data frame for each group based on the location. If any of the types are missing in a group, I need to add a row with the missing element and the corresponding location with the quantity as 0 in the data frame.
This is my expected output
dfOutput <- data.frame("location" = c("IND","IND","IND","IND","US","US","US","US"), type = c("butter","milk","cheese","yogurt","butter","milk","cheese","yogurt"), quantity = c(2,3,4,0,0,5,6,7))
How can I achieve this in R using dplyr package?
library(dplyr)
distinct(df, location) %>%
tidyr::crossing(type = typeVector) %>%
full_join(df, ., by = c("location", "type")) %>%
ungroup() %>%
mutate(quantity = coalesce(quantity, 0))
# location type quantity
# 1 IND butter 2
# 2 IND milk 3
# 3 IND cheese 4
# 4 US milk 5
# 5 US cheese 6
# 6 US yogurt 7
# 7 IND yogurt 0
# 8 US butter 0
Steps:
Create a temporary frame that is an expansion of location with your types in typeVector;
distinct(df, location) %>%
crossing(type = typeVector)
# # A tibble: 8 x 2
# location type
# <chr> <chr>
# 1 IND butter
# 2 IND cheese
# 3 IND milk
# 4 IND yogurt
# 5 US butter
# 6 US cheese
# 7 US milk
# 8 US yogurt
Join this back onto the original data, which will produce NAs in the new rows
... %>%
full_join(df, ., by = c("location", "type"))
# location type quantity
# 1 IND butter 2
# 2 IND milk 3
# 3 IND cheese 4
# 4 US milk 5
# 5 US cheese 6
# 6 US yogurt 7
# 7 IND yogurt NA
# 8 US butter NA
Change these new fields from NA to 0 with the mutate. (Note: if you have previously-existing NA and want to keep them that way, then this process needs to be adjusted.)
I tend to ungroup all grouped processes when done. This is not necessary for this task, but if you forget it's grouped and do some future work on it, it is possible that you will get different results, or at least it will be slightly less efficient.

Numerical difference between all rows within a group in R

I have a dataframe that looks a bit like
Indices<-data.frame("Animal"=c("Cat", "Cat", "Cat", "Dog", "Dog", "Dog", "Dog", "Bird",
"Bird"), "Trend"=c(1,3,5,-3,1,2,4,2,1), "Project"=c("ABC", "ABC2",
"EDF", "ABC", "EDF", "GHI", "ABC2", "ABC", "GHI"))
I want to find out whether two or more trend estimates differ by >= 3 within each animal group. I tried using mutate and lag:
Indices %>%
group_by(CommonName) %>%
mutate(Diff = Trend - lag(Trend))
But this only shows me the difference between the rows that are right after each other, and I am trying to see the difference between all of the rows within a group. It also gives me the differences but doesn't tell me if the value is >=3.
I would prefer to have the end result being a list of the animals and project names that have an absolute trend difference >=3.
Animal TrendDiff Projects
Cat 4 ABC-EDF
Dog 7 ABC-ABC2
Dog 3 ABC2-EDF
Dog 4 ABC-EDF
Dog 5 ABC-GHI
I have well over 200 different "animal" groups and over 400 rows so need it to be something that doesn't need to specify each row. I am still very new to r so please be specific with your answers. Thanks!
One approach would be to left_join your Indices data.frame with itself
library(dplyr)
Indices %>%
left_join(Indices, by = "Animal") %>%
filter(Project.x != Project.y) %>%
mutate(TrendDiff = Trend.x - Trend.y) %>%
filter(TrendDiff >= 3)
# A tibble: 5 x 6
# Groups: Animal [2]
# Animal Trend.x Project.x Trend.y Project.y TrendDiff
# cat 5 EDF 1 ABC 4
# Dog 1 EDF -3 ABC 4
# Dog 2 GHI -3 ABC 5
# Dog 4 ABC2 -3 ABC 7
# Dog 4 ABC2 1 EDF 3

How to count multiple variables in dataset a pipe by incorporating the grep function?

I need to count multiple variables in a dataset in one go using a pipe.
I have used the following code:
#R
NonComp_Strat <- Minor_Behaviours %>%
filter(Categories == "Non compliant with routine") %>%
group_by(Strategies) %>%
summarise(frequency= n())
However, in my data frame some cells contain multiple entries separated by a comma.
For example
It treats the following behaviour entries differently, "Disruptive" , and "Disruptive, Off Task".
Both behaviour entries in the data frame have the variable i am looking for but i don't know how to wrap the grep or grepl function into the pipe to count all the individual variables. There are over 20 of them and doing over 20 individual grep functions sounds terrible. Any help is greatly appreciated.
thanks,
Dan
You will first have to split the comma separated values and make new rows out of them. Then you can group_by as you were doing:
library(splitstackshape)
df <- data.frame(id = c(1:4), Strategies = c("Disruptive", "Disruptive, Off Task", "Off Task", "Off Task, Interview"))
df
id Strategies
1 1 Disruptive
2 2 Disruptive, Off Task
3 3 Off Task
4 4 Off Task, Interview
df <- cSplit(df, "Strategies", ",", "long")
df
id Strategies
1: 1 Disruptive
2: 2 Disruptive
3: 2 Off Task
4: 3 Off Task
5: 4 Off Task
6: 4 Interview
In one dplyr and tidyr workflow:
df %>%
separate(Strategies, paste("Strategies", 1:5, sep = "_"), extra = "drop", sep = ",") %>%
gather(Stacked, Strategies, Strategies_1:Strategies_5) %>%
select(-Stacked) %>%
na.omit() %>%
mutate(Strategies = as.factor(trimws(Strategies))) %>%
group_by(Strategies) %>%
summarise(count = n())
Strategies count
<fct> <int>
1 Brief Time Out 1
2 Detention 2
3 Disruptive 2
4 Interview 1
5 Off Task 1
More general we could design a split function that produces reshapeable data.
spltCol <- function(x) {
l <- strsplit(as.character(x), ", ?")
l <- lapply(l, function(y) c(y, rep(NA, max(lengths(l)) - length(y))))
return(as.data.frame(do.call(rbind, l)))
}
Example
df1
# id x z
# 1 1 alpha, beta, gamma 0.7281856
# 2 2 alpha, beta -0.3149730
# 3 3 alpha -2.6412875
# 4 4 <NA> 0.6412990
df12 <- data.frame(append(df1[-2], spltCol(df1$x)))
# id z V1 V2 V3
# 1 1 0.7281856 alpha beta gamma
# 2 2 -0.3149730 alpha beta <NA>
# 3 3 -2.6412875 alpha <NA> <NA>
# 4 4 0.6412990 <NA> <NA> <NA>
reshape(df12, direction="long", varying=cbind("V1", "V2", "V3"), v.names=names(df1)[2])
# id z time x
# 1.1 1 0.7281856 1 alpha
# 2.1 2 -0.3149730 1 alpha
# 3.1 3 -2.6412875 1 alpha
# 4.1 4 0.6412990 1 <NA>
# 1.2 1 0.7281856 2 beta
# 2.2 2 -0.3149730 2 beta
# 3.2 3 -2.6412875 2 <NA>
# 4.2 4 0.6412990 2 <NA>
# 1.3 1 0.7281856 3 gamma
# 2.3 2 -0.3149730 3 <NA>
# 3.3 3 -2.6412875 3 <NA>
# 4.3 4 0.6412990 3 <NA>
Data
df1 <- structure(list(id = 1:4, x = structure(c(3L, 2L, 1L, NA), .Label = c("alpha",
"alpha, beta", "alpha, beta, gamma"), class = "factor"), z = c(0.72818559355044,
-0.314973049072542, -2.64128753187138, 0.641298995312115)), class = "data.frame", row.names = c(NA,
-4L))

pattern matching R

ca.df
id Category
1 Noun
2 Negative
3 Positive
4 adj
5 word
Each term is assigned to more than 1 category, therefore, it corresponds with more than 1 id. In terms.df all the ids are in one column.
terms.df
Terms id
Love 1 4 5 3
Hate 2 4 5
ice 1 5
id in terms is corresponded with category in ca.df. I want an output like this:
x.df
Category terms
Noun ice Love
Negative Hate
Positive Love
adj Hate Love
word ice Hate Love
How to do this?
Here's a possible data.table/splitstackshape packages solution
library(splitstackshape) ## loads `data.table` package too
terms.df <- cSplit(terms.df, "id", sep = " ", direction = "long")
setkey(terms.df, id)[ca.df, .(Category , Terms = toString(Terms)), by = .EACHI]
# id Category Terms
# 1: 1 Noun Love, ice
# 2: 2 Negative Hate
# 3: 3 Positive Love
# 4: 4 adj Love, Hate
# 5: 5 word Love, Hate, ice
Some explanations
We first split the id column by spaces according to the Terms column
Then we are performing a binary left join between the two data sets on the id column
While joining, we are concatenating the Terms column back according to each join using the by = .EACHI operator which allows us to perform different operations while joinig
A solution using tidyr and dplyr.
library(tidyr)
library(dplyr)
ca.df$id <- as.character(ca.df$id)
terms.df %>% separate(id,into=paste0("V",1:3),sep = " ",extra = "merge") %>%
gather(var,id,-Terms) %>%
filter(!is.na(id)) %>%
left_join(ca.df,by="id") %>%
select(-var,-id) %>%
group_by(Category) %>%
summarize(Terms=paste(Terms,collapse=" "))
Output :
Source: local data frame [4 x 2]
Category Terms
1 Negative Hate
2 Noun Love ice
3 adj Love Hate
4 word ice Love Hate
Data :
ca.df <- read.table(text =
"id Category
1 Noun
2 Negative
3 Positive
4 adj
5 word",head=TRUE,stringsAsFactors=FALSE)
terms.df <- read.table(text =
"Terms id
Love '1 4 5'
Hate '2 4 5'
ice '1 5'
",head=TRUE,stringsAsFactors=FALSE)
You can use merge to combine based on id
ca.df <- data.frame(id=1:5, Category=c("Noun", "Negative", "Positive", "adj", "word"))
terms.df <- data.frame(Terms=c(rep("Love", 3), rep("Hate", 3), rep("ice", 2)),
id = c(1,4,5,2,4,5,1,5))
x.df <- merge(ca.df, terms.df, by="id")
x.df
id Category Terms
1 1 Noun Love
2 1 Noun ice
3 2 Negative Hate
4 4 adj Love
5 4 adj Hate
6 5 word Love
7 5 word Hate
8 5 word ice

Resources