Subset using 2 dataframes of different sizes R - r

I have 2 data frames of different sizes
Season: is a fixture list for Australian Rules Football
Strength: has ratings for different aspects of a team, for each team in the league
I want to create a for loop that looks at each row of Season that matches the home team column with a row in Strength and then assigns that column to a variable HOME and then do the same for AWAY
Then HOME and AWAY will be used to compute a probability and inserted in a new column for the Season data frame
But I cannot get Strength to filter by Season in the loop, this is how I tried
for(row in 1:nrow(Season)){
HOME<-strength%>%
filter(Season$HomeTeam == Strength$Team)
Away<-strength%>%
filter(Season$AwayTeam == Strength$Team)
}
I just keep receiving this error message:
longer object length is not a multiple of shorter object length
Any help would be appreciated
Thanks,
Dave

Answered in comments:
library(dplyr)
HOME <- Strength %>%
filter(Strength$Team %in% Season$HomeTeam)

Related

Using R to create a collection of dataframes extracted from a large database

Okay so I have a large collection of data I'm trying to analyze. It contains ~2 million police reports with information such as jurisdiction, type of offense, race, age, etc. My end goal is to determine the top 20 jurisdictions with the most reports, and then chart only the entries from those locations in various ways.
top20 <- police %>%
group_by(JURISDICTION) %>%
tally(sort=TRUE) %>%
filter(row_number() <= 20)
I'm using this to tally up the totals in each jurisdiction and then trim down to the top 20 entries. What I'd like to have happen next is, using each of the entries in column one of this top20 data.frame, create a new data frame with the name of the locality, as well as all the entries from police with a matching jurisdiction.
I've been experimenting with something along the lines of:
for (i in 1:20) {
assign(paste(top20[i,1]),
filter(police, JURISDICTION == top20[i,1]))
}
which does create data frames with the correct names, but the second portion isn't reading correctly and it's just creating blank data frames at the moment. Any advice on how to streamline this would be appreciated. I'm very capable of just making each frame individually, but if I can do it succinctly I'll be more satisfied.

Is there an R function where I can get the names within a specific column in my dataset

Edit: using the aid from one of the users, I was able to use "table(ArrestData$CHARGE)", yet, since there are over 2400 entries, many of the entries are being omitted. I am looking for the top 5 charges, is there code for this? Additionally, I am looking at a particular council district (which is another variable titled "CITY_COUNCIL_DIST"). I want to see which are the top 5 charges given out within a specific council district. Is there code for this?
Thanks for the help!
Original post follows
Just like how I can use "names(MyData)" to see the names of my variables, I am wondering if I can use a code to see the names/responses/data points of a specific column.
In other words, I am attempting to see the names in my rows for a specific column of data. I would like to see what names are cumulatively being used.
After I find this, I would like to know how many times each name within the rows is being used, whether thats numeric or percentage. After this, I would like to see how many times each name within the rows is being used with the condition that it meets a numeric value of another column/variable.
Apologies if this, in any way, is confusing.
To go further in depth, I am playing around with the Los Angeles Police Data that I got via the Office of the Mayor's website. From 2017-2018, I am attempting to see what charges and the amount of each specific charge were given out in Council District 5. CHARGE and CITY_COUNCIL_DIST are the two variables I am looking at.
Any and all help will be appreciated.
To get all the distinct variables, you can use the unique function, as in:
> x <- c(1,1,2,3,3,4,5,5,5,6)
> unique(x)
[1] 1 2 3 4 5 6
To count the number of distinct values you can use table, as in:
> x <- c(1,1,2,3,3,4,5,5,5,6)
> table(x)
x
1 2 3 4 5 6
2 1 2 1 3 1
The first row gives you the distinct values and the second row the counts for each of them.
EDIT
This edit is aimed to answer your second question following with my previous example.
In order to look for the top five most repeated values of a variable we can use base R. To do so, I would first create a dataframe from your table of frequencies:
df <- as.data.frame(table(x))
Having this, now you just have to order the column Freq in descending order:
df[order(-df$Freq),]
In order to look for the top five most repeated values of a variable within a group, however, we need to go beyond base R. I would use dplyr to create an augmented dataframe with frequencies for each value of the variable of interest, let it be count_variable:
library(dplyr)
x_or <- x %>%
group_by(group_variable, count_variable) %>%
summarise(freq=n())
where x is your original dataframe, group_variable is the variable for your groups and count_variable is the variable you want to count. Now, you just have to order the object in a way you get the frequencies of your count_variable ordered by group_variables:
x_or %>%
arrange(group_variable, count_variable, freq)

How to compute questionnaire total score and subscores by summing all and a selection of columns in R?

I'm new in R and I'm having a little issue. I hope some of you can help me!
I have a data.frame including answers at a single questionnaire.
The rows indicate the participants.
The first columns indicates the participant ID.
The following columns include the answers to each item of the questionnaire (item.1 up to item.20).
I need to create two new vectors:
total.score <- sum of all 20 values for each participant
subscore <- sum of some of the items
I would like to use a function, like a sum(A:T) in Excel.
Just to recap, I'm using R and not other software.
I already did it by summing each vector just with the symbol +
(data$item.1 + data$item.2 + data$item.3 etc...)
but it is a slow way to do it.
Answers range from 0 to 3 for each item, so I expect a total score ranging from 0 to 60.
Thank you in advance!!
Let's use as example this data from a national survey with a questionnaire
If you download the .csv file to your working directory
data <- read.csv("2016-SpanishSurveyBreastfeedingKnowledge-AELAMA.csv", sep = "\t")
Item names are p01, p02, p03...
Imagine you want a subtotal of the first five questions (from p01 to p05)
You can give a name to the group:
FirstFive <- c("p01", "p02", "p03", "p04", "p05")
I think this is worthy because of probably you will want to perform more tasks with this group (analysis, add or delete a question from the group...), and because it helps you to provide meaningful names (for instance "knowledge", "attitudes"...)
And then create the subtotal variable:
data$subtotal1 <- rowSums(data[ , FirstFive])
You can check that the new variable is the sum
head(data[ , c(FirstFive, "subtotal2")])
(notice that FirstFive is not quoted, because it is an object outside data, but subtotal2 is quoted, because it is the name of a variable in data)
You can compute more subtotals and use them to compute a global score
You could may be save some keystrokes if you know that these variables are the columns 20 to 24:
names(data)[20:24]
And then sum them as
rowSums(data[ , c(20:24)])
I think this is what you asked for, but I would avoid doing this way, as it is easier to make mistakes, whick can be hard to be detected

Converting factor value to new variable which stores count of how many times it occured in the factor

I have ATP data set from kaggle. I am working on it in R.In the data set I have various variables like match date, city, tournament name, winner name, loser name, total set won by match winner, total set won by match loser, total games won by winner, total games won by loser and so on.
My attention is on match winner and match loser columns.
These columns are factor variables which have the value of player name.
Now what I want is to plot a graph of match win-loss ratio for different player(say top 5 or top 10 players having highest win-loss ratio) where the x-axis represents the name of player and y-axis represents the win-loss ratio of that player.
How do I create this specific graph. I have tried using pipe lining in dplyr package as follows:
Winner and Loser are factorial variables.
roger_wins <- atp %>% filter(Winner == "Federer R.") %>% count(Winner)
roger_loss <- atp %>% filter(Loser == "Federer R.") %>% count(Loser)
But using this way it has to be hard coded for each player. How to I do this using code for top 5 or top 10 players(according to win-loss)
Please provide the solution in R.
This is the page where the data set can be found:
https://www.kaggle.com/jordangoblet/atp-tour-20002016
If I understand your problem, you can do something like this:
Use the table function to colapse the data
Then you can use the apply function over the output of the first point

recording time a taxa first appears: nested loops and conditional statements in R

Here is my example. Here is some hypothetical data resembling my own. Environmental data describes the metadata of the community data, which is made up of taxa abundances over years in different treatments.
#Elements of Environmental (meta) data
nTrt<-2
Trt<-c("High","High","High","Low","Low","Low")
Year<-c(1,2,3,1,2,3)
EnvData<-cbind(Trt,Year)
#Elements of community data
nTaxa<-2
Taxa1<-c(0,0,2,50,3,4)
Taxa2<-c(0,34,0,0,0,23)
CommData<-cbind(Taxa1,Taxa2)
#Elements of ideal data produced
Ideal_YearIntroduced<-array(0,dim=c(nTrt,nTaxa))
Taxa1_i<-c(2,1)
Taxa2_i<-c(2,3)
IdealData<-cbind(Taxa1_i,Taxa2_i)
rownames(IdealData)<-c("High","Low")
I want to know what the Year is (in EnvData) when a given taxa first appears in a particular treatment. ie The "introduction year". That is, if the taxa is there at year 1, I want it to record "1" in an array of Treatment x Taxa, but if that taxa in that treatment does not arrive until year 3 (which means it meets the condition that it is absent in year 2), I want it to record Year 3.
So I want these conditional statements to only loop within a treatment. In other words, I do not want it to record a taxa as being "introduced" if it is 0 in year 3 of one treatment and prsent in year 1 of the next.
I've approached this by doing several for loops, but the loops are getting out of hand, with the conditional statements, and there is now an error that I can't figure out- I may be not thinking of the i and j's correctly.'
The data itself is more complicated than this...has 6 years, 1102 taxa, many treatments.
#Get the index number where each treatment starts
Index<-which(EnvData[,2]==1)
TaxaIntro<-array(0,dim=dim(Comm_0)) #Array to hold results
for (i in 1:length(Index)) { #Loop through treatment (start at year 1 each time)
for (j in 1:3) { #Loop through years within a treatment
for (k in 1:ncol(CommData)) { #Loop through Taxa
if (CommData[Index[i],1]>0 ) { #If Taxa is present in Year 1...want to save that it was introduced at Year 1
TaxaIntro[i,k]<-EnvData[Index[i],2]
}
if (CommData[Index[i+j]]>0 && CommData[Index[((i+j)-j)]] ==0) { #Or if taxa is present in a year AND absent in the previous year
TaxaIntro[i,k]<-EnvData[Index[i+j],2]
}
}
}
}
With this example, I get an error related to my second conditional statement...I may be going about this the wrong way.
Any help would be greatly appreciated. I am open to other (non-loop) approaches, but please explain thoroughly as I'm not so well-versed.
Current error:
Error in if (CommData[Index[i + j]] > 0 & CommData[Index[((i + j) - j)]] == :
missing value where TRUE/FALSE needed
Based on your example, I think you could combine your environmental and community data into a single data.frame. Then you might approach your problem using functions from the package dplyr.
# Make combined dataset
dat = data.frame(EnvData, CommData)
Since you want to do the work separately for each Trt, you'll want group_by that variable to do everything separately by group.
Then the problem is to find the first time each one of your Taxa columns contains a value greater than 0 and record which year that is. Because you want to do the same thing for many columns, you can use summarise_each. To get the desired summary, I used the function first to choose the first instance of Year where whatever Taxa column you are working with is greater than 0. The . refers to the Taxa columns. The last thing I did in summarise_each is to choose which columns I wanted to do this work on. In this case, you want to do this for all your Taxa columns, so I chose all columns that starts_with the word Taxa.
With chaining, this looks like:
library(dplyr)
dat %>%
group_by(Trt) %>%
summarise_each(funs(first(Year[. > 0])), contains("Taxa"))
The result is slightly different than yours, but I think this is correct based on the data provided (Taxa1 in High first seen in year 3 not year 2).
Source: local data frame [2 x 3]
Trt Taxa1 Taxa2
1 High 3 2
2 Low 1 3
The above code assumes that your dataset is already in order by Year. If it isn't, you can use arrange to set the order before summarising.
If you aren't used to chaining, the following code is the equivalent to above.
groupdat = group_by(dat, Trt)
summarise_each(groupdat, funs(first(Year[. > 0])), starts_with("Taxa"))

Resources