How to do this? - r

I have a problem and I would ask if is a function or easy way to do below operation.
I have a data.frame like this
customer item
-------------------
smith a
smith b
smith c
johnson a
bush NA
regan d
How to create matrix like this
customer a b c d
--------------------------------------
smith 1 1 1 0
johnson 1 0 0 0
bush 0 0 0 0
regan 0 0 0 1
Is loop obligartory? Is easier way to create this?
Thank you in advance!

You should use the table function. The call would look something like this. IT goes x,y but depending on what the full data.frame list looks you may want to add some more parameters to handle NA values and such
table(df$customer, df$item)

Related

Finding a way to tally soccer/football passes in R

I am trying to find a way to take a sequence of passes and show how many times each player passes to another player.
So for example, if the pass sequence was: Jordan to Emma to Molly to Emily bad, that means Jordan's and Emma's passes were successful but Molly's was not.
I have an example of a few lines of data I put in R (in a 2x2 dataset):
Passes
1 jordan to karlie karlie turnover unforced
2 jlin to gray bad
3 alia to kiersten to lilly to kiersten bad
4 mandy to karlie bad
5 kelsey to mccarter to jordan to emma emma fouled
6 mandy to karlie bad
7 mandy to kiersten cleared
I am trying to come up with a way that can convert those lines into a table like this:
Players Mandy-G Jlin-G Gray-G Kiersten-G Kelsey-G Karlie-G Jordan-G Lilly-G Mccarter-G Emma-G Alia-G Mandy-B Jlin-B Gray-B Kiersten-B Kelsey-B Karlie-B Jordan-B Lilly-B Mccarter-B Emma-B Alia-B
Mandy 1 2
Jlin 1
Gray
Kiersten 1
Kelsey 1
Karlie
Jordan 1 1
Lilly 1
McCarter 1
Emma
Alia 1
*I don't know how to insert a screenshot, so the copy and paste messed up the formatting but you can still get the idea of what I want it to look like.
If Mandy passed to Gray and it was good there should be a 1 in the Mandy and Gray-G intersection. If Mandy passed to Gray and it was bad there should be a 1 in the Mandy and Gray-B intersection.
There are only numbers in that table because I did it by hand and it was only for about 10 minutes of a game. Ultimately, doing it for the full 90 minutes and for about 25 games, I'm going to need to create a way to go through the first table and have R sort and add a mark for each successful and unsuccessful pass.
dat3 <- strsplit(dat[,1], "to")
numPass <- rep(0, length(dat3))
for (i in 1:length(dat2)) {
temp <- sum(dat2[[i]] == "to")
if ("bad" %in% dat2[[i]]) {
temp <- temp-1
}
numPass[i] <- temp
}
maxPass <- max(numPass)+1
#for (i in 1:length(dat2)){
for (i in 5){
keep<-dat2[[i]]%in%roster[,1]
pls<-dat2[[i]][keep]
#add statemets to remove last name if there is a "bad"
for (j in 1:length(pls)) {
cols<-which(substr(names(seqPass),1,nchar(pls[j]))==pls[j])
seqPass[i,cols[j]]<-j
}
}
seqPass[c(1,5),]
I have tried the above code to go through the first five lines and to count the number of passes in each sequence and it adds a mark under each player's name if they were involved in the pass, but if it was bad they need to be removed which is does not do.
Is there a way for R to automatically count if the first name and second name in the sequence have a good pass, add a mark in their intersection, and do the same for if the first and second name make a bad pass by having the word "bad" follows the second name?
Any help would be much appreciated!
Thanks
Sample data
structure(list(VT = c("jordan to karlie karlie turnover unforced",
"jlin to gray bad", "alia to kiersten to lilly to kiersten bad",
"mandy to karlie bad", "kelsey to mccarter to jordan to emma emma fouled",
"mandy to karlie bad", "mandy to kiersten cleared bad")), row.names = c(NA,
7L), class = "data.frame", na.action = structure(8:19, .Names = c("8",
"9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19"
), class = "omit"))
You could use regular expressions. And Also it will be fast if you only put the data of those who touched the ball. So something like:
pass = sub('_$','_good',sub("(.*\\w+ to (?:\\w+(?=.*(bad))|\\w+)).*",'\\1_\\2',dat$VT,perl = T))
pass1 = gsub('(to(\\s[^_ ]+(?=\\s)))','\\1_good\n\\2',pass,perl=T)
results = xtabs(V3~.,cbind(read.csv(text=gsub('to',',',pass1),h=F,strip.white = T),V3=1))
results
V2
V1 emma_good gray_bad jordan_good karlie_bad karlie_good kiersten_bad kiersten_good lilly_good mccarter_good
alia 0 0 0 0 0 0 1 0 0
jlin 0 1 0 0 0 0 0 0 0
jordan 1 0 0 0 1 0 0 0 0
kelsey 0 0 0 0 0 0 0 0 1
kiersten 0 0 0 0 0 0 0 1 0
lilly 0 0 0 0 0 1 0 0 0
mandy 0 0 0 2 0 1 0 0 0
mccarter 0 0 1 0 0 0 0 0 0
It seems that you have done a lot of the work already, so I will just add in my two cents. It would make your table generally smaller if you didn't separate out good and bad as two tables. You could generally have one table with combinations of players like you have created, but add a column with a 1 or 0 stating if the pass was good or bad, in which case you could just have your code above but with
dat$pass <- as.numeric(grepl(".*(bad)", dat$VT))
This adds a column with 1 if the row has 'bad' in it. Imagine the complexity of a good and bad table over multiple decades and different players!

R: How do you run a function to get multiple columns?

So my data looks like this
id first middle last Age
1 Carol Jenny Smith 15
2 Sarah Carol Roberts 20
3 Josh David Richardson 22
I have a function that creates a new column which gives you how many times the name was found for each row in previous columns that I specified (2nd-4th columns or 'first':'last' columns). I have a function that outputs the result below,
funname <- function(df, cols, value, newcolunmn) {
df$newcolumn <- as.integer(rowSums(df[cols] == value) > 0)
}
id first middle last Age Carol
1 Carol Jenny Smith 15 1
2 Sarah Carol Roberts 20 1
3 Josh David Richardson 22 0
But my real data is more complicated and I want to create at least 20 new, different columns (ex: Carol, Robert, Jenny, Anna, Richard, Daniel, Eric...)
So how can I incorporate multiple new columns into the existing function?
I can only think of adding function(df, cols, value, newcolumn1, newcolumn2, newcolumn3,...,) but this would be impossible if I want like hundred columns later,..any help? thank you in advance! :)
EDIT:
function(df, cols, value, newcol) {
df$newcol <- as.integer(rowSums(df[cols] == value) > 0)
df
}
I read the comments below..but let me change my question..
How would I map this function so that I can generate multiple columns with new names that I want to assign?..
I think this is just one giant table operation if you get your data converted to two long vectors, one representing row number and the other the value:
tab <- as.data.frame.matrix(table(row(dat[2:4]), unlist(dat[2:4])))
cbind(dat, tab)
# id first middle last Age Carol David Jenny Josh Richardson Roberts Sarah Smith
#1 1 Carol Jenny Smith 15 1 0 1 0 0 0 0 1
#2 2 Sarah Carol Roberts 20 1 0 0 0 0 1 1 0
#3 3 Josh David Richardson 22 0 1 0 1 1 0 0 0
This method would also allow you to map the new output columns to variations of the names if required:
tab <- as.data.frame.matrix(table(row(dat[2:4]), unlist(dat[2:4])))
dat[paste0(colnames(tab),"_n")] <- tab
dat
# id first middle last Age Carol_n David_n Jenny_n Josh_n Richardson_n Roberts_n Sarah_n Smith_n
#1 1 Carol Jenny Smith 15 1 0 1 0 0 0 0 1
#2 2 Sarah Carol Roberts 20 1 0 0 0 0 1 1 0
#3 3 Josh David Richardson 22 0 1 0 1 1 0 0 0

counting occurrence of strings across multiple columns in R [duplicate]

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 5 years ago.
I have a dataset in R which looks like the following (only relevant columns shown). It has sex disaggregated data on what crops respondents wanted more information about and how much of a priority this crop for them.
sex wantcropinfo1 priority1 wantcropinfo2 priority2
m wheat high eggplant medium
m rice low cabbage high
m rice high
f eggplant medium
f cotton low
...
I want to be able to (a) count the total occurrences of each crop across all the wantcropinfoX columns; and (b) get the same count but sort them by priority; and (c) do the same thing but disaggregated by sex.
(a) output should look like this:
crop count
wheat 1
eggplant 2
rice 2
...
(b) output should look like this:
crop countm countf
wheat 1 0
eggplant 1 1
rice 2 0
...
(c) should look like this:
crop high_m med_m low_m high_f med_f low_f
wheat 1 0 0 0 0 0
eggplant 0 1 0 0 1 0
rice 1 0 1 0 0 0
...
I'm a bit of an R newbie and the manuals are slightly bewildering. I've googled a lot but couldn't find anything that was quite like this even though it seems like a fairly common thing one might want to do. Similar questions on stackoverflow seemed to be asking something a bit different.
We can use melt from data.table to convert from 'wide' to 'long' format. It can take multiple measure columns.
library(data.table)
dM <- melt(setDT(df1), measure = patterns("^want", "priority"),
value.name = c("crop", "priority"))[crop!='']
In the 'long' format, we get the 3 expected results by either grouping by 'crop' and get the number of rows or convert to 'wide' with dcast specifying the fun.aggregate as length.
dM[,.(count= .N) , crop]
# crop count
#1: wheat 1
#2: rice 2
#3: eggplant 2
#4: cotton 1
#5: cabbage 1
dcast(dM, crop~sex, value.var='sex', length)
# crop f m
#1: cabbage 0 1
#2: cotton 1 0
#3: eggplant 1 1
#4: rice 0 2
#5: wheat 0 1
dcast(dM, crop~priority+sex, value.var='priority', length)
# crop high_m low_f low_m medium_f medium_m
#1: cabbage 1 0 0 0 0
#2: cotton 0 1 0 0 0
#3: eggplant 0 0 0 1 1
#4: rice 1 0 1 0 0
#5: wheat 1 0 0 0 0
Use ddply function in the plyr package.
The structure of how you use this function is the following:
ddply(dataframe,.(var1,var2,...), summarize, function)
In this case you might want to do the follow for:
a) ddply(df,.(wantcropinfo1),summarize,count=length(wantcropinfo1))
b)ddply(df,.(wantcropinfo1,priority),summarize,count=length(wantcropinfo1))
c) ddply(df,.(wantcropinfo1,priority,sex),summarize,count=length(wantcropinfo1))
Note that the output will not have the same structure you mention in your question but the information will be the same. For the mentioned structure use the table function

Compare previous list element using the apply function family R, get rid of for loops

I am always looking to minimize my use of for loops in R. Is there anyway to compare the current element to a previous element in the list without a for loop? Here is a simplified version of the problem I am working on.
I want to mark the First_Transaction column as a 1 if it is the persons's first transaction. The data is already sorted by person and date.
Name Amount Date First_Transaction
1 Joe 50 01/05/15 0
2 Joe 43 02/05/15 0
3 Joe 40 03/05/15 0
4 Tom 40 01/03/15 0
5 Tom 34 01/29/15 0
6 Tom 22 02/05/15 0
7 Tom 49 02/10/15 0
8 Kim 28 03/10/15 0
9 Kim 19 03/20/15 0
10 Kim 24 04/13/15 0
11 Kim 35 04/20/15 0
Using a for loop, I mark the first row a 1 then use logic to check if the current name matches the previous name. If it does not, mark the First_Transaction column 1.
test$First_Transaction[1]=1
for(i in 2:length(test$Name)){
if(test$Name[i] != test$Name[i-1]){
test$First_Transaction[i]=1
}
Is there an apply family function that can implement this logic? I really want to figure out how to do this without the loop. Thanks!
If the first observation of "First_Transaction" should be changed to "1" for each group of "Name", we could use ave
df1$First_Transaction <- as.numeric(with(df1,
ave(seq_along(First_Transaction),Name, FUN=seq_along)==1))
or we could compare the current element of "Name" with the next element
as.numeric(with(df1, c(TRUE, Name[-1]!=Name[-nrow(df1)])))
Or use duplicated
as.numeric(!duplicated(df1$Name))

How to clean and re-code check-all-that-apply responses in R survey data?

I've got survey data with some multiple-response questions like this:
HS18 Why is it difficult to get medical care in South Africa? (Select all that apply)
1 Too expensive
2 No transportation to the hospital/clinic
3 Hospital/clinic is too far away
4 Hospital/clinic staff do not speak my language
5 Hospital/clinic staff do not like foreigners
6 Wait time too long
7 Cannot take time off of work
8 None of these. I have no problem accessing medical care
where multiple responses were entered with commas and are recorded as different levels i.e.:
unique(HS18)
[1] 888 1 6 4 5 8 2 3,5 4,6 3,6 3,4 3
[13] 4,5,6 7 999 4,5 2,6 4,8 7,8 1,6 1,2,3 5,7,8 4,5,6,7 1,4
[25] 0 5,6,7 5,6 2,3 1,4,6,7 1,4,5
30 Levels: 0 1 1,2,3 1,4 1,4,5 1,4,6,7 1,6 2 2,3 2,6 3 3,4 3,5 3,6 4 4,5 4,5,6 4,5,6,7 4,6 4,8 ... 999
This is as much a data-cleaning protocol question as an R question...I'm doing the cleaning, but not the analysis, so everything needs to be transparent and user-friendly when I pass it back...and the PI doesn't use R. Basically I'd like to split the multiples into levels and re-name them while keeping them together as a single observation...not sure how to do this, or even if it's the right approach.
How do you generally deal with this issue? Is there an elegant way to process this for analysis in STATA (simple descriptives, regressions, odds ratios)?
Thanks everyone!!!
My best thought for analyzing multi-select questions like this is to convert the possible answers into indicator variables: take all of your possible answers (1 to 8 in this example) and create data columns named HS18.1, HS18.2, etc. (You can optionally include something more in the column name, but that's completely between you and the PI.)
Your sample data here looks like it includes data that is not legal: 0, 888, and 999 are not listed in the options. It's possible/likely that these include DK/NR responses, but I can't be certain. As such:
Your data cleaning should be taking care of these anomalies before this step of converting 0+ length lists into indicator variables.
My code below arbitrarily ignores this fact and you will lose data. This is obviously not "A Good Thing™" in the long run. More robust checks are warranted (and not difficult). (I've added an other column to indicate something was lost.)
The code:
ss <- '888 1 6 4 5 8 2 3,5 4,6 3,6 3,4 3 4,5,6 7 999 4,5 2,6 4,8 7,8 1,6 1,2,3 5,7,8 4,5,6,7 1,4 0 5,6,7 5,6 2,3 1,4,6,7 1,4,5'
dat <- lapply(strsplit(ss, ' '), strsplit, ',')[[1]]
lvls <- as.character(1:8)
## lvls <- sort(unique(unlist(dat))) # alternative method
ret <- structure(lapply(lvls, function(lvl) sapply(dat, function(xx) lvl %in% xx)),
.Names = paste0('HS18.', lvls),
row.names = c(NA, -length(dat)), class = 'data.frame')
ret$HS18.other <- sapply(dat, function(xx) !all(xx %in% lvls))
ret <- 1 * ret ## convert from TRUE/FALSE to 1/0
head(1 * ret)
## HS18.1 HS18.2 HS18.3 HS18.4 HS18.5 HS18.6 HS18.7 HS18.8 HS18.other
## 1 0 0 0 0 0 0 0 0 1
## 2 1 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 1 0 0 0
## 4 0 0 0 1 0 0 0 0 0
## 5 0 0 0 0 1 0 0 0 0
## 6 0 0 0 0 0 0 0 1 0
The resulting data.frame can be cbinded (or even matrixized) to whatever other data you have.
(I use 1 and 0 instead of TRUE and FALSE because you said the PI will not be using R; this can easily be changed to a character string or something that makes more sense to them.)

Resources