Is there an R function to remove repetition **within** observation? - r

I have a large dataset that contains one column called "TYPE_DESCRIPTION" that describes the type of activity of each observation.
However, the raw dataset that I obtained somehow may contain more than one repetition of the same activity within the "TYPE_DESCRIPTION" column.
Let's say for one observation, the activity (or value) shown within the "TYPE_DESCRIPTION" column can contain "Walking, Walking, Walking, Walking", instead of just "Walking". How do I remove the repetition of "Walking" within that column so I only have the value once?
I have tried the distinct() function, but it defines the "Walking, Walking, Walking, Walking" as one unique value. Whereas what I want is just "Walking".
This became a problem when later I want to add a new column using mutate() that groups the activity into higher order and write "Walking" in the codes. Since I only write "Walking" on the code, it does not recognize the variation of 'Walking' with different repetition and put it under different category that I need it to be.
Thanks.

in Base R:
transform(df, uniq=sapply(strsplit(TYPE_DESCRIPTION, ', ?'), \(x)toString(unique(x))))
TYPE_DESCRIPTION uniq
1 Walking,Walking, Walking, Walking Walking
2 Running, Walking Running, Walking

Related

How do I detect a specific string within five consecutive columns in a df with R, and mutate the value or copy it into another column?

I'm trying to write a script in which within my df, I want to find in several consecutive columns (e.g. 16:25) a specific string which is a combination of "SA (string and spaces within the bracket)". If the string is present in any of the columns, it will only appear in one of them.
When that string is detected, I would like it to be moved or copied to a different column named "SA_status".
If it is not found in any of the columns, then I'd like the value for that row on SA_status to appear as NA.
I've tried to create a for loop to search in the first column, and if it wasn't found, in the next one and so on; and if it was found to stop there. Then, I have used a combination of the functions mutate(), case_when(), str_detect() to copy that value into the new column.
Finally, if after doing this process the values in the new column were still blank, I've assigned them to NA.
However, I must be doing something wrong because it is not working and I'm getting really desperate.
Could you please give me a hand? Cheers!

How can I return a vector with a dataframe inside in R?

Here is a challenge for you: I was trying to make a tic tac toe based on R. First, the players have to configure putting in the name of the players, and the game should check if the name exists in a file called "Players.txt" (if not, the game will create one), if the name exists, the game will ask for a new one. The last part of the game is that the game should record all the punctuation of the players (each gambling chip used will subtract 5 points of 100 that the player has at the beginning of the game). The problem is when a player wins, the game shows the following error: "Error in table[location_name1, 3]: Incorrect number of dimension in R".
A vector can either be atomic or a list. Atomic vectors can only contain elements of one and the same data type. That means, you are "accidentally" creating a list with
vector=c(win,name1,name2,table)
with the result that each column of the data frame should become an entry.
You can solve it with
vector <- list(win, name1, name2, table)
vector is still a list but now it has the format I believe you want.
Having done that you still get errors. The reason is that these assignments fail.
location_name1=which(grepl(name1,table$gamers))
location_name2=which(grepl(name2,table$gamers))
They return an empty vector because earlier in the code you set win=vector[1]... table=vector[4]. Since vector is now a list, you have to subset it accordingly. That means you have to chance the statements to table=vector[[4]].
Now you are going to get another problem. The reason is that you treat the columns table$scores as text. When you read the data you need to make sure that this columns is not interpreted as text. You also have to eliminate all statements that coerce the column into text. Otherwise table[location_name1,3]=table[location_name1,3]+pointsx will obviously fail because you cannot add a number to a string.
For example, you coerce the column into a character column with this statement:
name1 <- data.frame(gamers=name1,games="1",scores="100")
games and scores are strings not numbers. Another example is the assigment after reading the table from the file. You can make sure that scoresare numeric by doing this.
scores <- as.numeric(table[,3])
Please get familiar with Rstudio debugging capabilities (https://support.rstudio.com/hc/en-us/articles/205612627-Debugging-with-RStudio). This way you can go through your code line by line and check consequences of each assignment to the data frame.

Conditionally selecting duplicates and selectively choosing which duplicate to keep

I am currently working with a large data frame for a study involving 486 observations with 7 variables in R studio (see attached image for a sample of dummy data). Two of the variables are ID numbers: the first ID number is unique for each individual person. However, sometimes a person appears more than once on the list, so there are duplicates of this first ID number. The second ID number is also unique for each person, but it lists a study visit number as a suffix (to signify that they came in over a period of time). Thus, this second ID number does not have duplicates. For example, say there was someone with a first ID number of 0001, and they have two second ID numbers of xx0001_1 (for their first visit to the lab) and xx0001_2 (for their second visit to the lab). However, the vast majority of people in the data frame only came in for one visit, so they do not have a duplicate first ID number (or second ID number, in that case).
My problem is this: I want to select only those that have duplicate first ID numbers and then filter out each of their data points such that I only keep their second visit. For example, I need to only select those with duplicate first ID numbers such as "0001", and among those I want to filter out the "xx0001_1" visit and only keep the "xx0001_2" visit.
I know that I can use the "distinct" function for the first ID number, but the problem is that this eliminates the second of the second ID numbers (e.g. xx0001_2) and keeps the first one (e.g. xx0001_1). I want to do the opposite: I want to eliminate duplicates of the first ID number, but only keep the rows that have the second visit in their second ID number (e.g. keep xx0001_2 but remove xx0001_1). I'm very new to R/coding. How can I do this? Thank you!
Example data sample
distinct(.data, ID 1, .keep_all = T)
However, this code only preserves the first row. I want to eliminate the first row and instead only preserve the second row. Thanks!

Complex Search in R

This is an unusual and difficult question which has perplexed me for a number of days and I hope I explain it correctly. I have two databases i.e. data-frames in R, the first is approx 90,000 rows and is a record of every race-horse in the UK. It contains numerous fields, and most importantly the NAME of each horse and its SIRE; one record per horse First database, sample and fields. The second database contains over one-million rows and is a history of every race a horse has taken part in over the last ten years i.e. races it has run or as I call it 'appearances', it contains NAME, DATE, TRACK etc..; one record per appearance.Second database, sample and fields
What I am attempting to do is to write a few lines of code - not a loop - that will provide me with a total number of every appearance made by the siblings of a particular horse i.e. one grand total. The first step is easy - finding the siblings i.e. horses with a common sire - and you can see it below (N.B FindSire is my own function which does what it says and finds the sire of a horse by referencing the same dataframe. I have simplified the code somewhat for clarity)
TestHorse <- "Save The Bees"
Siblings <- which(FindSire(TestHorse) == Horses$Sire)
Sibsname <- Horses[sibs,1]
The produces Sibsname which is a 636 names long (snippet below), although the average horse will only have 50 or so siblings. I could construct a loop and search the second 'appearances' data-frame and individually match the sibling names and then total the appearances of all the siblings combined. However, I would like to know if I could avoid a loop - and the time associated with it - and write a few lines of code to achieve the same end i.e. search all 636 horses in the appearances database and calculate the times each appears in the database and a total of all these appearances, or to put it another way, how many races have the siblings of "save the bees" taken part in. Thanks in advance.
[1] "abdication " "aberdonian " "acclamatory " "accolation " ..... to [636]
Using dplyr, calling your "first database" horses and your "second database" races:
library(dplyr)
test_horse = "Save The Bees"
select(horses, Name, Sire) %>%
filter(Sire == Sire[Name == tolower(test_horse)]) %>%
inner_join(races, c("Name" = "SELECTION_NAME")) %>%
summarize(horse = test_horse, sibling_group_races = n())
I am making the assumption that you want the number of appearances of the sibling group to include the appearances of the test horse - to omit them instead add , Name != tolower(test_horse) to the filter() command.
As you haven't shared data reproducibly, I cannot test the code. If you have additional problems I will not be able to help you solve them unless you share data reproducibly. ycw's comment has a helpful link for doing that - I would encourage you to edit your question to include either (a) code to simulate a small sample of data, or (b) use dput() on an small sample of your data to share a few rows in a copy/pasteable format.
The code above will do for querying one horse at a time - if you intend to use it frequently it would be much simpler to just create a table where each row represents a sibling group and contains the number of races. Then you could just reference the table instead of calculating on the fly every time. That would look like this:
sibling_appearances =
left_join(horses, races, by = c("Name" = "SELECTION_NAME")) %>%
group_by(Sire) %>%
summarize(offspring_appearances = n())

Subsetting rows, changing values, and placing them back into matrix?

I hope this has not been answered, but when I search for a solution to my problem I am not getting any results.
I have a data.frame of 2000+ observations and 20+ columns. Each row represents a different observation and each column represents a different facet of data for that observation. My objective is to iterate through the data.frames and select observations which match criteria (eg. I am trying to pick out observations that are in certain states). After this, I need to subtract or add time to convert it to its appropriate time zone (all of the times are in CST). What I have so far is an exorbitant amount of subsetting commands that pick out the rows that are of the state being checked against. When I try to write a for loop I can only get one value returned, not the whole row.
I was wondering if anyone had any suggestions or knew of any functions that could help me. I've tried just about everything, but I really don't want to have to go through each state of observations and modify the time. I would prefer a loop that could easily go through the data, select rows based on their state, subtract or add time, and then place the row back into its original data.frame (replacing the old value).
I appreciate any help.

Resources