Finding a way to tally soccer/football passes in R - r

I am trying to find a way to take a sequence of passes and show how many times each player passes to another player.
So for example, if the pass sequence was: Jordan to Emma to Molly to Emily bad, that means Jordan's and Emma's passes were successful but Molly's was not.
I have an example of a few lines of data I put in R (in a 2x2 dataset):
Passes
1 jordan to karlie karlie turnover unforced
2 jlin to gray bad
3 alia to kiersten to lilly to kiersten bad
4 mandy to karlie bad
5 kelsey to mccarter to jordan to emma emma fouled
6 mandy to karlie bad
7 mandy to kiersten cleared
I am trying to come up with a way that can convert those lines into a table like this:
Players Mandy-G Jlin-G Gray-G Kiersten-G Kelsey-G Karlie-G Jordan-G Lilly-G Mccarter-G Emma-G Alia-G Mandy-B Jlin-B Gray-B Kiersten-B Kelsey-B Karlie-B Jordan-B Lilly-B Mccarter-B Emma-B Alia-B
Mandy 1 2
Jlin 1
Gray
Kiersten 1
Kelsey 1
Karlie
Jordan 1 1
Lilly 1
McCarter 1
Emma
Alia 1
*I don't know how to insert a screenshot, so the copy and paste messed up the formatting but you can still get the idea of what I want it to look like.
If Mandy passed to Gray and it was good there should be a 1 in the Mandy and Gray-G intersection. If Mandy passed to Gray and it was bad there should be a 1 in the Mandy and Gray-B intersection.
There are only numbers in that table because I did it by hand and it was only for about 10 minutes of a game. Ultimately, doing it for the full 90 minutes and for about 25 games, I'm going to need to create a way to go through the first table and have R sort and add a mark for each successful and unsuccessful pass.
dat3 <- strsplit(dat[,1], "to")
numPass <- rep(0, length(dat3))
for (i in 1:length(dat2)) {
temp <- sum(dat2[[i]] == "to")
if ("bad" %in% dat2[[i]]) {
temp <- temp-1
}
numPass[i] <- temp
}
maxPass <- max(numPass)+1
#for (i in 1:length(dat2)){
for (i in 5){
keep<-dat2[[i]]%in%roster[,1]
pls<-dat2[[i]][keep]
#add statemets to remove last name if there is a "bad"
for (j in 1:length(pls)) {
cols<-which(substr(names(seqPass),1,nchar(pls[j]))==pls[j])
seqPass[i,cols[j]]<-j
}
}
seqPass[c(1,5),]
I have tried the above code to go through the first five lines and to count the number of passes in each sequence and it adds a mark under each player's name if they were involved in the pass, but if it was bad they need to be removed which is does not do.
Is there a way for R to automatically count if the first name and second name in the sequence have a good pass, add a mark in their intersection, and do the same for if the first and second name make a bad pass by having the word "bad" follows the second name?
Any help would be much appreciated!
Thanks
Sample data
structure(list(VT = c("jordan to karlie karlie turnover unforced",
"jlin to gray bad", "alia to kiersten to lilly to kiersten bad",
"mandy to karlie bad", "kelsey to mccarter to jordan to emma emma fouled",
"mandy to karlie bad", "mandy to kiersten cleared bad")), row.names = c(NA,
7L), class = "data.frame", na.action = structure(8:19, .Names = c("8",
"9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19"
), class = "omit"))

You could use regular expressions. And Also it will be fast if you only put the data of those who touched the ball. So something like:
pass = sub('_$','_good',sub("(.*\\w+ to (?:\\w+(?=.*(bad))|\\w+)).*",'\\1_\\2',dat$VT,perl = T))
pass1 = gsub('(to(\\s[^_ ]+(?=\\s)))','\\1_good\n\\2',pass,perl=T)
results = xtabs(V3~.,cbind(read.csv(text=gsub('to',',',pass1),h=F,strip.white = T),V3=1))
results
V2
V1 emma_good gray_bad jordan_good karlie_bad karlie_good kiersten_bad kiersten_good lilly_good mccarter_good
alia 0 0 0 0 0 0 1 0 0
jlin 0 1 0 0 0 0 0 0 0
jordan 1 0 0 0 1 0 0 0 0
kelsey 0 0 0 0 0 0 0 0 1
kiersten 0 0 0 0 0 0 0 1 0
lilly 0 0 0 0 0 1 0 0 0
mandy 0 0 0 2 0 1 0 0 0
mccarter 0 0 1 0 0 0 0 0 0

It seems that you have done a lot of the work already, so I will just add in my two cents. It would make your table generally smaller if you didn't separate out good and bad as two tables. You could generally have one table with combinations of players like you have created, but add a column with a 1 or 0 stating if the pass was good or bad, in which case you could just have your code above but with
dat$pass <- as.numeric(grepl(".*(bad)", dat$VT))
This adds a column with 1 if the row has 'bad' in it. Imagine the complexity of a good and bad table over multiple decades and different players!

Related

R create_matrix() - "ngramLength" option is not working

I want to create a document-term matrix by using the create_matrix-function.
This works so far with my example:
library(RTextTools)
library(tidyverse)
pos_tweets = rbind(
c("I love this car", "positive"),
c("This view is amazing", "positive"),
c("I feel great this morning", "positive"),
c("I am so excited about the concert", "positive"),
c("He is my best friend", "positive"))
neg_tweets = rbind(
c("I do not like this car", "negative"),
c("This view is horrible", "negative"),
c("I feel tired this morning", "negative"),
c("I am not looking forward to the concert", "negative"),
c("He is my enemy", "negative"))
tweets = rbind(pos_tweets, neg_tweets)
matrix = create_matrix(tweets[, 1], language = "english", removeStopwords = FALSE,
removeNumbers = TRUE, stemWords = FALSE, ngramLength = 1)
mat = as.matrix(matrix)
mat[, 1:5]
##Result:
Docs about amazing best car concert
I love this car 0 0 0 1 0
This view is amazing 0 1 0 0 0
I feel great this morning 0 0 0 0 0
I am so excited about the concert 1 0 0 0 1
He is my best friend 0 0 1 0 0
I do not like this car 0 0 0 1 0
This view is horrible 0 0 0 0 0
I feel tired this morning 0 0 0 0 0
I am not looking forward to the concert 0 0 0 0 1
He is my enemy 0 0 0 0 0
The function create_matrix has the option ngramLength=, with which one can determine the length of the n-grams. For example, 1 returns unigrams (single words, e.g.: "computer") and 2 returns bigrams (two adjacent words, e.g.: "my computer").
However, this option does not seem to work. No matter what number I enter, the function only gives me unigrams (ngramLength=1).
I would also like to have bigrams (ngramLength=2) as a result.
The result would look like this ( strongly shortened):
Docs this car this view feel great
I love this car 1 0 0
This view is amazing 0 1 0
I feel great this morning 0 0 1
Can anyone help me and solve my problem?
I am also very open to other functions from other packages.
Many Thanks in advance!

Learning R: foreach value in col-A, list unique values in Col-B and then associated values in Col C

I'm learning R and have not coded in a long time (and no this is not a school assignment). My intent is to run NRC_Sentiment on notes, which I can do for only the 1st instance of the Cat. I've read doc on for loop and foreach loop (which I would've used in Perl).
I have created a data frame from a csv file
Cat | Sub-Cat | Notes
Cat1 | Apple | This is a fruit called apple and I love it
Cat1 | Orange | This is a fruit called orange and I don't like it
Cat2 | Tomato | This is a Veg called tomato and I like it
Cat2 | Pepper | This a Veg called pepper and I don't like it
Cat1 | Banana | This a fruit banana and I have no opinion about it
dataIn = read.csv(...)[,1:3] #Read col 1,3
df = data.frame(dataIn)
uCat = data.frame(uCatR=c(df$Cat))
uCat = unqiue(uCat)
I've tried for loop, but it stops at 1st instance of uCat, so any help is appreciated
In perl I could create a hash of hashes and using foreach loop through each
It's hard to tell exactly what you're trying to achieve, but it seems like you might be trying to run a sentiment analysis on the unique instances of Notes. You can use map from the purrr package (part of tidyverse). map will return a list, map_df will return a dataframe, and there are other map functions to return other object types.
library(tidyverse)
library(syuzhet)
map_df(unique(df$Notes), get_nrc_sentiment)
Which returns a dataframe:
anger anticipation disgust fear joy sadness surprise trust negative positive
1 0 0 0 0 1 0 0 0 0 1
2 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0
dput:
structure(list(Cat = c("Cat1", "Cat1", "Cat2", "Cat2", "Cat1"
), `Sub-Cat` = c("Apple", "Orange", "Tomato", "Pepper", "Banana"
), Notes = c("This is a fruit called apple and I love it", "This is a fruit called orange and I don't like it",
"This is a Veg called tomato and I like it", "This a Veg called pepper and I don't like it",
"This a fruit banana and I have no opinion about it")), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))

How to do this?

I have a problem and I would ask if is a function or easy way to do below operation.
I have a data.frame like this
customer item
-------------------
smith a
smith b
smith c
johnson a
bush NA
regan d
How to create matrix like this
customer a b c d
--------------------------------------
smith 1 1 1 0
johnson 1 0 0 0
bush 0 0 0 0
regan 0 0 0 1
Is loop obligartory? Is easier way to create this?
Thank you in advance!
You should use the table function. The call would look something like this. IT goes x,y but depending on what the full data.frame list looks you may want to add some more parameters to handle NA values and such
table(df$customer, df$item)

r string parsing challenge

I am dealing with a column that contains strings as follows
Col1
------------------------------------------------------------------
Department of Mechanical Engineering, Department of Computer Science
Division of Advanced Machining, Center for Mining and Metallurgy
Department of Aerospace, Center for Science and Delivery
What I am trying to do is separate strings containing the words starting with either, Department or Divison or Center until comma(,) the final output should look like this
Dept_Mechanical_Eng Dept_Computer_Science Div_Adv_Machining Cntr_Mining_Metallurgy Dept_Aerospace Cntr_Science_Delivery
1 1 0 0 0 0
0 0 1 1 0 0
0 0 1 1 1 1
I have butchered the actual names just for aesthetic purpose in the expected output. Any help on parsing this string is much appreciated.
This is very similar to a question I just did tabulating another text example. Are you in the same class as the questioner here? Count the number of times (frequency) a string occurs
inp <- "Department of Mechanical Engineering, Department of Computer Science
Division of Advanced Machining, Center for Mining and Metallurgy
Department of Aerospace, Center for Science and Delivery"
inp2 <- factor(scan(text=inp,what="",sep=","))
#Read 6 items
inp3 <- readLines(textConnection(inp))
as.data.frame( setNames( lapply(levels(inp2), function(ll) as.numeric(grepl(ll, inp3) ) ), trimws(levels(inp2) )) )
Department.of.Aerospace Division.of.Advanced.Machining
1 0 0
2 0 1
3 1 0
Center.for.Mining.and.Metallurgy Center.for.Science.and.Delivery
1 0 0
2 1 0
3 0 1
Department.of.Computer.Science Department.of.Mechanical.Engineering
1 1 1
2 0 0
3 0 0

How to clean and re-code check-all-that-apply responses in R survey data?

I've got survey data with some multiple-response questions like this:
HS18 Why is it difficult to get medical care in South Africa? (Select all that apply)
1 Too expensive
2 No transportation to the hospital/clinic
3 Hospital/clinic is too far away
4 Hospital/clinic staff do not speak my language
5 Hospital/clinic staff do not like foreigners
6 Wait time too long
7 Cannot take time off of work
8 None of these. I have no problem accessing medical care
where multiple responses were entered with commas and are recorded as different levels i.e.:
unique(HS18)
[1] 888 1 6 4 5 8 2 3,5 4,6 3,6 3,4 3
[13] 4,5,6 7 999 4,5 2,6 4,8 7,8 1,6 1,2,3 5,7,8 4,5,6,7 1,4
[25] 0 5,6,7 5,6 2,3 1,4,6,7 1,4,5
30 Levels: 0 1 1,2,3 1,4 1,4,5 1,4,6,7 1,6 2 2,3 2,6 3 3,4 3,5 3,6 4 4,5 4,5,6 4,5,6,7 4,6 4,8 ... 999
This is as much a data-cleaning protocol question as an R question...I'm doing the cleaning, but not the analysis, so everything needs to be transparent and user-friendly when I pass it back...and the PI doesn't use R. Basically I'd like to split the multiples into levels and re-name them while keeping them together as a single observation...not sure how to do this, or even if it's the right approach.
How do you generally deal with this issue? Is there an elegant way to process this for analysis in STATA (simple descriptives, regressions, odds ratios)?
Thanks everyone!!!
My best thought for analyzing multi-select questions like this is to convert the possible answers into indicator variables: take all of your possible answers (1 to 8 in this example) and create data columns named HS18.1, HS18.2, etc. (You can optionally include something more in the column name, but that's completely between you and the PI.)
Your sample data here looks like it includes data that is not legal: 0, 888, and 999 are not listed in the options. It's possible/likely that these include DK/NR responses, but I can't be certain. As such:
Your data cleaning should be taking care of these anomalies before this step of converting 0+ length lists into indicator variables.
My code below arbitrarily ignores this fact and you will lose data. This is obviously not "A Good Thing™" in the long run. More robust checks are warranted (and not difficult). (I've added an other column to indicate something was lost.)
The code:
ss <- '888 1 6 4 5 8 2 3,5 4,6 3,6 3,4 3 4,5,6 7 999 4,5 2,6 4,8 7,8 1,6 1,2,3 5,7,8 4,5,6,7 1,4 0 5,6,7 5,6 2,3 1,4,6,7 1,4,5'
dat <- lapply(strsplit(ss, ' '), strsplit, ',')[[1]]
lvls <- as.character(1:8)
## lvls <- sort(unique(unlist(dat))) # alternative method
ret <- structure(lapply(lvls, function(lvl) sapply(dat, function(xx) lvl %in% xx)),
.Names = paste0('HS18.', lvls),
row.names = c(NA, -length(dat)), class = 'data.frame')
ret$HS18.other <- sapply(dat, function(xx) !all(xx %in% lvls))
ret <- 1 * ret ## convert from TRUE/FALSE to 1/0
head(1 * ret)
## HS18.1 HS18.2 HS18.3 HS18.4 HS18.5 HS18.6 HS18.7 HS18.8 HS18.other
## 1 0 0 0 0 0 0 0 0 1
## 2 1 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 1 0 0 0
## 4 0 0 0 1 0 0 0 0 0
## 5 0 0 0 0 1 0 0 0 0
## 6 0 0 0 0 0 0 0 1 0
The resulting data.frame can be cbinded (or even matrixized) to whatever other data you have.
(I use 1 and 0 instead of TRUE and FALSE because you said the PI will not be using R; this can easily be changed to a character string or something that makes more sense to them.)

Resources