Complex Search in R - r

This is an unusual and difficult question which has perplexed me for a number of days and I hope I explain it correctly. I have two databases i.e. data-frames in R, the first is approx 90,000 rows and is a record of every race-horse in the UK. It contains numerous fields, and most importantly the NAME of each horse and its SIRE; one record per horse First database, sample and fields. The second database contains over one-million rows and is a history of every race a horse has taken part in over the last ten years i.e. races it has run or as I call it 'appearances', it contains NAME, DATE, TRACK etc..; one record per appearance.Second database, sample and fields
What I am attempting to do is to write a few lines of code - not a loop - that will provide me with a total number of every appearance made by the siblings of a particular horse i.e. one grand total. The first step is easy - finding the siblings i.e. horses with a common sire - and you can see it below (N.B FindSire is my own function which does what it says and finds the sire of a horse by referencing the same dataframe. I have simplified the code somewhat for clarity)
TestHorse <- "Save The Bees"
Siblings <- which(FindSire(TestHorse) == Horses$Sire)
Sibsname <- Horses[sibs,1]
The produces Sibsname which is a 636 names long (snippet below), although the average horse will only have 50 or so siblings. I could construct a loop and search the second 'appearances' data-frame and individually match the sibling names and then total the appearances of all the siblings combined. However, I would like to know if I could avoid a loop - and the time associated with it - and write a few lines of code to achieve the same end i.e. search all 636 horses in the appearances database and calculate the times each appears in the database and a total of all these appearances, or to put it another way, how many races have the siblings of "save the bees" taken part in. Thanks in advance.
[1] "abdication " "aberdonian " "acclamatory " "accolation " ..... to [636]

Using dplyr, calling your "first database" horses and your "second database" races:
library(dplyr)
test_horse = "Save The Bees"
select(horses, Name, Sire) %>%
filter(Sire == Sire[Name == tolower(test_horse)]) %>%
inner_join(races, c("Name" = "SELECTION_NAME")) %>%
summarize(horse = test_horse, sibling_group_races = n())
I am making the assumption that you want the number of appearances of the sibling group to include the appearances of the test horse - to omit them instead add , Name != tolower(test_horse) to the filter() command.
As you haven't shared data reproducibly, I cannot test the code. If you have additional problems I will not be able to help you solve them unless you share data reproducibly. ycw's comment has a helpful link for doing that - I would encourage you to edit your question to include either (a) code to simulate a small sample of data, or (b) use dput() on an small sample of your data to share a few rows in a copy/pasteable format.
The code above will do for querying one horse at a time - if you intend to use it frequently it would be much simpler to just create a table where each row represents a sibling group and contains the number of races. Then you could just reference the table instead of calculating on the fly every time. That would look like this:
sibling_appearances =
left_join(horses, races, by = c("Name" = "SELECTION_NAME")) %>%
group_by(Sire) %>%
summarize(offspring_appearances = n())

Related

How to scrape a complex table which has columns spanning multiple rows (a pedigree chart) in R?

I've looked at all the other related Stack Overflow questions, and none of them are close enough to what I'm trying to do to be useful. In part because, while some of those questions address dealing with tables where the leftward columns span multiple rows (as in a pedigree chart), they don't address how to handle the messy HTML which is somehow generating the chart. When I try the usual ways of ingesting the table with rvest it really doesn't work.
The table I'm trying to scrape looks like this:
When I extract the HTML of the first row (tr) of the table, I see that it contains: Betty, Jack, Bo, Bob, Jim, Dan, b 1932 (the very top of the table).
Great, you say, but not so fast. Because with this structure there's no way to know that Betty's mom is Sue (because Sue is on a different row).
Sue's row doesn't include Betty, but instead starts with Sue herself.
So in this example, Sue's row would be: Sue, Owen, Jacob, Luca, Blane, b 1940.
Furthermore, the row #2 in the HTML is actually just Ava b 1947.
I.e., the here's the content of each HTML row:
I tried using rvest to download the page and then extract the table.
A la:
pedigree <- read_html(page) %>% html_nodes("#PedigreeTable") %>% html_table
It really didn't work. Oddly, I got every column duplicated twice--so not too bad, but I'd rather it be a tibble/dataframe/matrix with the first column being 32 Bettys, and then the next column be 16 of each of Jack and Sue, etc...
I hope this is all clear as mud!
Ideally, as far as output, I'd get a nice neat dataframe with the columns person, father, mother. Like so:
Thanks in advance!
Maybe writting a algorithm can do it, like :
Select only the last two columns :
father_name=first value of the penultimate column
then browse the column to find the next non-NA value, count each rows
count=1 + number of NA values
mother_name=second non NA value
then count all rows until you find a name
count=count + 1 + number of NA values
Create your final table with :
name | father | mother
Isolate all the family's child names, and save them in your final table.
Assign father_name and mother_name in corresponding columns in your final table
Delete all rows used and start again.
Once you have assigned all the last column-people to their parents, delete the last column.
Then, delete all blank rows to have a structure similar to the one needed in the fist step, and start the algorithm again
Hope that helps !!
PS : I suggest that you give an unique ID to each person at some point, to avoid confusion between people that have the same name.

Is there an R function to remove repetition **within** observation?

I have a large dataset that contains one column called "TYPE_DESCRIPTION" that describes the type of activity of each observation.
However, the raw dataset that I obtained somehow may contain more than one repetition of the same activity within the "TYPE_DESCRIPTION" column.
Let's say for one observation, the activity (or value) shown within the "TYPE_DESCRIPTION" column can contain "Walking, Walking, Walking, Walking", instead of just "Walking". How do I remove the repetition of "Walking" within that column so I only have the value once?
I have tried the distinct() function, but it defines the "Walking, Walking, Walking, Walking" as one unique value. Whereas what I want is just "Walking".
This became a problem when later I want to add a new column using mutate() that groups the activity into higher order and write "Walking" in the codes. Since I only write "Walking" on the code, it does not recognize the variation of 'Walking' with different repetition and put it under different category that I need it to be.
Thanks.
in Base R:
transform(df, uniq=sapply(strsplit(TYPE_DESCRIPTION, ', ?'), \(x)toString(unique(x))))
TYPE_DESCRIPTION uniq
1 Walking,Walking, Walking, Walking Walking
2 Running, Walking Running, Walking

R: Reclin Package: Is there a way to keep the weights generated in score_problink() and used in select_n_to_m() after using the link() function?

I am trying to perform a record linkage on 2 datasets containing company names. While Reclin does a very good job indeed, the linked data will need some manual cleaning and because I will most likely have to clean about 3000 rows in a day or 2 it would be great to keep the weights generated in the reclin process as shown below:
CH_ecorda_to_Patstat_left <- pair_blocking(companies_x, companies_y) %>%
compare_pairs(by= "nameor", default_comparator = jaro_winkler()) %>%
score_problink() %>%
select_n_to_m()%>%
link(all_x=TRUE, all_y = FALSE)
I know these weights are still kept up until I use the link() function. I would like to add the weights based to compare the variable "nameor" so I can use these weights to order my data in ascending order, from smallest weight to biggest weight to find mistakes in the attempted match quicker.
For context: I need to find out how many companies_x have handed in patents in the patent data base companies_y. I don´t need to know how often they handed them in, just if there are any at all. So I need matches of x to y, however I don´t know the true number of matches and not every companies_x company will have a match, so some manual cleaning will be necessary as n_to_m forces a match for each entry even if there should be none.
Try doing something like this:
weight<-problink_em(paired)
paired<-score_problink(paired, weight)
You'll have the result stored as weight now.

Set up automatic process for R to read directory and process?

I am so very very new to R. Like had to look up how to open a file in R new. Diving in the deep end. Anyway
I have a bunch of .csv files with results that I need to analyse. Really, I would like to set up some kind of automation so I can just say "go" (a function?)
Basically I have results in one file that are -particle.csv and another that are -ROI.csv. They have the same names so I know which ones match up (e.g. brain1 section1 -particle.csv and brain1 section1 -ROI.csv). I need to do some maths using these two datasets - Divide column 2 rows 2-x in -particle.csv (the row number might change but is there a way of saying row "2-No more content"?) by column 1, 5, 10, etc. row 2 in -ROI.csv (the column number will always stay the same but if it helps they are all called Area1, Area2, Area3,... the number of Area columns can vary but surely there's a way I can say "every column that begins with Area"? Also the area count and the row count will always match up)
Okay, I'm fine to do that manually for each set up results but I have over 300 brains to analyse! Is there anyway I can set it up as a process that I can apply this these and future results that will be in the same format?
Sorry if this is a huge ask!

Using R, randomly pairing rows in different categories without repeats

Thank you to everyone who commented already! I've edited my post with better code and hopefully some clarity on what I'm trying to do. (I appreciate all of the feedback - this is my first time asking a question here!)
I have a very similar question to this one here (Random Pairings that don't Repeat) but am trying to come up with a function or piece of code that I can run in R to create the pairings. Essentially, I have a pool of employees and want to come up with a way to randomly generate pairs of employees to meet every month, with no pairs repeating in future months/running of the function. (I will need to maintain the history of previous pairings.) The catch is that each employee is assigned to a working location, and I only want matches from different location.
I've gone through a number of previous queries on randomly sampling data sets in R and am comfortable with generating a random pair from my data, or pulling out an existing working group, but it's the "generating a pair that ALWAYS comes from a different group" that's tripping me up, especially since the different groups/locations have different numbers of employees so it's hard to sort the groups evenly.
Here's my dummy data, which currently has 10 "employees". The actual data set currently has over 100 employees with more being added to the pool each month:
ID <- (1:10)
Name <- c("Sansa", "Arya", "Hodor", "Jamie", "Cersei", "Tyrion", "Jon", "Sam", "Dany", "Drogo")
Email <- c("a#a.com","b#b.com", "c#c.com", "d#d.com", "e#e.com", "f#f.com", "g#g.com", "h#h.com", "i#i.com", "j#j.com")
Location <- c("winterfell", "Winterfell", "Winterfell", "Kings Landing", "Kings Landing", "Kings Landing",
"The Wall", "The Wall", "Essos", "Essos")
df <- data.frame(ID, Name, Email, Location)
Basically, I want to write something that would say that Sansa could be randomly paired with anyone who is not Arya or Hodor, because they're in the same location, and that Jon could be paired with anyone but Sam (i.e., anyone whose location is NOT Winterfell.) If Jon and Arya were paired up once, I would like them to not be paired up again going forward, so at that point Jon could be paired with anyone but Sam or Arya. I hope I'm making sense.
I was able to run the combn function on the ID column to generate groups of two, but it doesn't account for the historical pairings that we're trying to avoid.
Is there a way to do this in R? I've tried this one (Using R, Randomly Assigning Students Into Groups Of 4) but it wasn't quite right for my needs, again because of the historical pairings.
Thank you for any help you can provide!

Resources