R How to compare Rows and Delete by Matching Strings - r

Match Up Date Points Opponent Points Reb Opponent Reb
Dal vs Den 8/16/2015 20 21 10 15
Den vs Dal 8/16/2015 21 20 15 10
I have a dataframe with sports data. However, I have two rows for every game due to the way that the data had to be collected. For example the two rows above are the same game, but again the data had to be collected twice for each game in this case: once for Dal and one for Den.
I'd like to find a way to delete duplicate games. I figure that one of my conditions to compare will have to be game date. How else would I be able to tell R that I would like for it to check to delete duplicate rows? I assume that I should be able to tell R to:
Check that game date matches
If game date match and if "Teams" match then delete duplicate. (Can this be done even though the strings are not an exact match, i.e. since Den vs Dal and Dal vs Den would not be a matching string?)
Move on to the next row and repeat until the end of the spreadsheet.
R would not need to check more than 50 rows down before moving on to the next row.
Is there a function to test for matching individual words? So for example I do not want to have to tell R: "If cell contains Den... Or "If cell contains Dal as this would involve too many teams. R needs to be able to check the cells for ANY value that could be in and then look to find out if the same value can be found as a string in later rows.
Please help.

Related

Perform operation on all substrings of a string in SQL (MariaDB)

Disclaimer: This is not a database administration or design question. I did not design this database and I do not have rights to change it.
I have a database in which many fields are compound. For example, a single column is used for acre usage for a district. Many districts have one primary crop and the value is a single number, such as 14. Some have two primary crops and it has two numbers separated by a comma like "14,8". Some have three, four, or even five primary crops resulting in a compound value like "14,8,7,4,3".
I am pulling data out of this database for analytical research. Right now, I am pulling columns like that into R, splitting them into 5 values (padding nulls if there aren't 5 values), and performing work on the values. I want to do it in the database itself. I want to split the value on the comma, perform an operation on the resulting values, and then concatenate the result of the operation back into the original column format.
Example, I have a column that is in acres. I want it in square meters. So, I want to take "14,8", temporarily turn it into 14 and 8, multiply each of those by 4046.86, and get "56656.04,32374.88" as my result. What I am currently doing is using regexp_replace. I start with all rows where "acres REGEXP '^[0-9.]+,[0-9.]+,[0-9.]+,[0-9.]+$'" for the where clause. That gives me rows with 5 numbers in the field. Then, I can do the first number with "cast(regexp_replace(acres,',.*%','') as float) * 4046.86". I can do each of the 5 using a different regexp_replace. I can concatenate those values back together. Then, I run a query for those with 4 numbers, then 3, then 2, and finally the single number rows.
Is this possible as a single query?
Use a function to parse the string and to convert it to desired result. This will allow for you to use a sigle query for the job.

Generating 'weight' column for network analysis

I'm very new to R, so forgive any omissions or errors.
I have a dataframe that contains a series of events (called 'incidents') represented by a column named 'INCIDENT_NUM'. These are strings (ex. 2016111111), and there are multiple cells per incident if there are multiple employees involved in the incident. Employees are represented in their own string column ('EMPL_NO'), and they can be in the column multiple times if they're involved in multiple incidents.
So, the data I have looks like:
Incident Number
EMPL_NO
201611111
EID0012
201611111
EID0013
201611112
EID0012
201611112
EID0013
201611112
EID0011
What I am aiming to do is see which employees are connected to one another by how many incidents they're co-involved with. Looking at tutorials for network analysis, folks have data that looks like this, which is what I ultimately want:
From
To
Weight
EID0011
EID0012
2
EID0012
EID0013
1
Is there any easy process for this? My data has thousands of rows, so doing this by hand doesn't feel feasible.
Thanks in advance!!!

How to subset IDs with partial match into data frame

I am trying to subset data to create a list of possible duplicates in a new data frame. The problem is that the names are in different format and possible only a small part of the ID may actually match.
I need R to output a list of possible duplicates for me to then check
I've found a few examples for formatting issues or when the it's the first few characters that you are trying to match. I am not sure how to put the codes together and the characters that match may be anywhere in the name.
So far, this seems to get me the closest, but Im still not sure how to apply the code the work for me.
Subset a df using partial match with multiple criteria
This is what my data looks like (but with 1000000s of lines):
Supplier.Name Date.of.Record BMCC.avg
SG & JM Hammond 2018-07-21 292.2381
Mileshan Nominees Pty Ltd 2018-12-21 130.0000
RW & GJ Brown & Sons 2018-02-21 162.8333
BD & BA Smith 2018-02-21 478.0000
In the end,I would like a list of possible duplicates based on partial matches (maybe 4 or 5 characters in a row?)
Right now I can't seem to put together a code at all. Even a few starting point suggesting would be helpful.
Thanks!

Complex Search in R

This is an unusual and difficult question which has perplexed me for a number of days and I hope I explain it correctly. I have two databases i.e. data-frames in R, the first is approx 90,000 rows and is a record of every race-horse in the UK. It contains numerous fields, and most importantly the NAME of each horse and its SIRE; one record per horse First database, sample and fields. The second database contains over one-million rows and is a history of every race a horse has taken part in over the last ten years i.e. races it has run or as I call it 'appearances', it contains NAME, DATE, TRACK etc..; one record per appearance.Second database, sample and fields
What I am attempting to do is to write a few lines of code - not a loop - that will provide me with a total number of every appearance made by the siblings of a particular horse i.e. one grand total. The first step is easy - finding the siblings i.e. horses with a common sire - and you can see it below (N.B FindSire is my own function which does what it says and finds the sire of a horse by referencing the same dataframe. I have simplified the code somewhat for clarity)
TestHorse <- "Save The Bees"
Siblings <- which(FindSire(TestHorse) == Horses$Sire)
Sibsname <- Horses[sibs,1]
The produces Sibsname which is a 636 names long (snippet below), although the average horse will only have 50 or so siblings. I could construct a loop and search the second 'appearances' data-frame and individually match the sibling names and then total the appearances of all the siblings combined. However, I would like to know if I could avoid a loop - and the time associated with it - and write a few lines of code to achieve the same end i.e. search all 636 horses in the appearances database and calculate the times each appears in the database and a total of all these appearances, or to put it another way, how many races have the siblings of "save the bees" taken part in. Thanks in advance.
[1] "abdication " "aberdonian " "acclamatory " "accolation " ..... to [636]
Using dplyr, calling your "first database" horses and your "second database" races:
library(dplyr)
test_horse = "Save The Bees"
select(horses, Name, Sire) %>%
filter(Sire == Sire[Name == tolower(test_horse)]) %>%
inner_join(races, c("Name" = "SELECTION_NAME")) %>%
summarize(horse = test_horse, sibling_group_races = n())
I am making the assumption that you want the number of appearances of the sibling group to include the appearances of the test horse - to omit them instead add , Name != tolower(test_horse) to the filter() command.
As you haven't shared data reproducibly, I cannot test the code. If you have additional problems I will not be able to help you solve them unless you share data reproducibly. ycw's comment has a helpful link for doing that - I would encourage you to edit your question to include either (a) code to simulate a small sample of data, or (b) use dput() on an small sample of your data to share a few rows in a copy/pasteable format.
The code above will do for querying one horse at a time - if you intend to use it frequently it would be much simpler to just create a table where each row represents a sibling group and contains the number of races. Then you could just reference the table instead of calculating on the fly every time. That would look like this:
sibling_appearances =
left_join(horses, races, by = c("Name" = "SELECTION_NAME")) %>%
group_by(Sire) %>%
summarize(offspring_appearances = n())

Creating New Variables in R that relate to

I have 7 different variable in an excel spreadsheet that I have imported into R. They each are columns with a size of 3331. They are:
'Tribe' - there are 8 of them
'Month' - when the sampling was carried out
'Year' - the year when the sampling was carried out
'ID" - an identifier for each snail
'Weight' - weight of a snail in grams
'Length' - length of a snail shell in millimetres
'Width' - width of a snail shell in millimetres
This is a case where 8 different tribes have been asked to record data on a suspected endangered species of snail to see if they are getting rarer, or changing in size or weight.
This happened at different frequencies between 1993 and 1998.
I would like to know how to be able to create a new variables to the data so that if I entered names(Snails) # then it would list the 7 given variables plus any added variable that I have.
The dataset is limited to the point where I would like to add new variables. Such as, knowing the counts per month of snails in any given month.
This would rely on me using - Tribe,Month,Year and ID. Where if an ID (snail identifier) was were listed according to the rates in any given month then I would be able to sum them to see if there are any changes in counts. I have tried:
count=c(Tribe,Year,Month,ID)
count
But, after doing things like that, R just has a large list of that is 4X the size of the dataset. I would like to be able to create a given new variable that is of column size n=3331.
Or maybe I would like to create a simpler variable so I can see if a tribe collected at any given month. I don't know how I can do this.
I have looked at other forums and searched but, there is nothing that I can see that helps me in my case. I appreciate any help. Thanks
I'm guessing you need to organise your variables in a single structure, such as a data.frame.
See ?data.frame for the help file.
To get you started, you could do something like:
snails <- data.frame(Tribe,Year,Month,ID)
snails
# or for just the first few rows
head(snails)
Then this would have your data looking similar to your Excel file like:
Tribe Year Month ID
1 1 1 1 a
2 2 2 2 b
3 3 3 3 c
<<etc>>
Then if you do names(snails) it will list out your column names.
You could possibly avoid some of this mucking about by just importing your Excel file either directly from Excel, or saving as a csv (comma separated values) file first and then using read.csv("name_of_your_file.csv")
See http://www.statmethods.net/input/importingdata.html for some more specifics on this.
To tabulate your data, you can do things like...
table(snails$Tribe)
...to see the number of snail records collected by each tribe. Or...
table(snails$Tribe,snails$Year)
...to see the trends in each tribe by each year. The $ character will let you access the named variable (column) inside a data.frame in the same way you are currently using the free floating variables. This might seem like more work initially, but it will pay off greatly when you need to do some more involved analysis.
Take for example if you want to only analyse the weights from tribe "1", you could do:
snails$Weight[snails$Tribe==1]
# mean of these weights
mean(snails$Weight[snails$Tribe==1])
There are a lot more things I could explain but you would probably be better served by reading an excellent website like Quick-R here: http://www.statmethods.net/management/index.html to get you doing some more advanced analysis and plotting.

Resources