Conditionally selecting duplicates and selectively choosing which duplicate to keep

Conditionally selecting duplicates and selectively choosing which duplicate to keep - r

I am currently working with a large data frame for a study involving 486 observations with 7 variables in R studio (see attached image for a sample of dummy data). Two of the variables are ID numbers: the first ID number is unique for each individual person. However, sometimes a person appears more than once on the list, so there are duplicates of this first ID number. The second ID number is also unique for each person, but it lists a study visit number as a suffix (to signify that they came in over a period of time). Thus, this second ID number does not have duplicates. For example, say there was someone with a first ID number of 0001, and they have two second ID numbers of xx0001_1 (for their first visit to the lab) and xx0001_2 (for their second visit to the lab). However, the vast majority of people in the data frame only came in for one visit, so they do not have a duplicate first ID number (or second ID number, in that case).
My problem is this: I want to select only those that have duplicate first ID numbers and then filter out each of their data points such that I only keep their second visit. For example, I need to only select those with duplicate first ID numbers such as "0001", and among those I want to filter out the "xx0001_1" visit and only keep the "xx0001_2" visit.
I know that I can use the "distinct" function for the first ID number, but the problem is that this eliminates the second of the second ID numbers (e.g. xx0001_2) and keeps the first one (e.g. xx0001_1). I want to do the opposite: I want to eliminate duplicates of the first ID number, but only keep the rows that have the second visit in their second ID number (e.g. keep xx0001_2 but remove xx0001_1). I'm very new to R/coding. How can I do this? Thank you!
Example data sample
distinct(.data, ID 1, .keep_all = T)
However, this code only preserves the first row. I want to eliminate the first row and instead only preserve the second row. Thanks!

Related

Is there an R function to remove repetition within observation?

I have a large dataset that contains one column called "TYPE_DESCRIPTION" that describes the type of activity of each observation.
However, the raw dataset that I obtained somehow may contain more than one repetition of the same activity within the "TYPE_DESCRIPTION" column.
Let's say for one observation, the activity (or value) shown within the "TYPE_DESCRIPTION" column can contain "Walking, Walking, Walking, Walking", instead of just "Walking". How do I remove the repetition of "Walking" within that column so I only have the value once?
I have tried the distinct() function, but it defines the "Walking, Walking, Walking, Walking" as one unique value. Whereas what I want is just "Walking".
This became a problem when later I want to add a new column using mutate() that groups the activity into higher order and write "Walking" in the codes. Since I only write "Walking" on the code, it does not recognize the variation of 'Walking' with different repetition and put it under different category that I need it to be.
Thanks.

in Base R:
transform(df, uniq=sapply(strsplit(TYPE_DESCRIPTION, ', ?'), \(x)toString(unique(x))))
TYPE_DESCRIPTION uniq
1 Walking,Walking, Walking, Walking Walking
2 Running, Walking Running, Walking

Complex Search in R

This is an unusual and difficult question which has perplexed me for a number of days and I hope I explain it correctly. I have two databases i.e. data-frames in R, the first is approx 90,000 rows and is a record of every race-horse in the UK. It contains numerous fields, and most importantly the NAME of each horse and its SIRE; one record per horse First database, sample and fields. The second database contains over one-million rows and is a history of every race a horse has taken part in over the last ten years i.e. races it has run or as I call it 'appearances', it contains NAME, DATE, TRACK etc..; one record per appearance.Second database, sample and fields
What I am attempting to do is to write a few lines of code - not a loop - that will provide me with a total number of every appearance made by the siblings of a particular horse i.e. one grand total. The first step is easy - finding the siblings i.e. horses with a common sire - and you can see it below (N.B FindSire is my own function which does what it says and finds the sire of a horse by referencing the same dataframe. I have simplified the code somewhat for clarity)
TestHorse <- "Save The Bees"
Siblings <- which(FindSire(TestHorse) == Horses$Sire)
Sibsname <- Horses[sibs,1]
The produces Sibsname which is a 636 names long (snippet below), although the average horse will only have 50 or so siblings. I could construct a loop and search the second 'appearances' data-frame and individually match the sibling names and then total the appearances of all the siblings combined. However, I would like to know if I could avoid a loop - and the time associated with it - and write a few lines of code to achieve the same end i.e. search all 636 horses in the appearances database and calculate the times each appears in the database and a total of all these appearances, or to put it another way, how many races have the siblings of "save the bees" taken part in. Thanks in advance.
[1] "abdication " "aberdonian " "acclamatory " "accolation " ..... to [636]

Using dplyr, calling your "first database" horses and your "second database" races:
library(dplyr)
test_horse = "Save The Bees"
select(horses, Name, Sire) %>%
filter(Sire == Sire[Name == tolower(test_horse)]) %>%
inner_join(races, c("Name" = "SELECTION_NAME")) %>%
summarize(horse = test_horse, sibling_group_races = n())
I am making the assumption that you want the number of appearances of the sibling group to include the appearances of the test horse - to omit them instead add , Name != tolower(test_horse) to the filter() command.
As you haven't shared data reproducibly, I cannot test the code. If you have additional problems I will not be able to help you solve them unless you share data reproducibly. ycw's comment has a helpful link for doing that - I would encourage you to edit your question to include either (a) code to simulate a small sample of data, or (b) use dput() on an small sample of your data to share a few rows in a copy/pasteable format.
The code above will do for querying one horse at a time - if you intend to use it frequently it would be much simpler to just create a table where each row represents a sibling group and contains the number of races. Then you could just reference the table instead of calculating on the fly every time. That would look like this:
sibling_appearances =
left_join(horses, races, by = c("Name" = "SELECTION_NAME")) %>%
group_by(Sire) %>%
summarize(offspring_appearances = n())

Subsetting rows, changing values, and placing them back into matrix?

I hope this has not been answered, but when I search for a solution to my problem I am not getting any results.
I have a data.frame of 2000+ observations and 20+ columns. Each row represents a different observation and each column represents a different facet of data for that observation. My objective is to iterate through the data.frames and select observations which match criteria (eg. I am trying to pick out observations that are in certain states). After this, I need to subtract or add time to convert it to its appropriate time zone (all of the times are in CST). What I have so far is an exorbitant amount of subsetting commands that pick out the rows that are of the state being checked against. When I try to write a for loop I can only get one value returned, not the whole row.
I was wondering if anyone had any suggestions or knew of any functions that could help me. I've tried just about everything, but I really don't want to have to go through each state of observations and modify the time. I would prefer a loop that could easily go through the data, select rows based on their state, subtract or add time, and then place the row back into its original data.frame (replacing the old value).
I appreciate any help.

ISSP data: calculating percentage of respondent answers on a particular item

Probably a pretty basic question, and hopefully one not repeated elsewhere. I’m looking at some ISSP survey data in R, and I made a separate data frame for respondents who answered “Government agencies” on one of the questions:
gov.child<-data[data$"V33"=="Government agencies",]
Then I used the table function to see how many total respondents answered that way in each country (C_ALPHAN is the variable name for country):
table(gov.child$C_ALPHAN)
Then I made a matrix of this table:
gov.child.matrix<-as.matrix(table(gov.child$C_ALPHAN))
So I now have a two-column matrix with just the two-letter country code (the C_ALPHAN code) and the number of people who answered “Government agencies.” But I want to know what percentage of respondents in those countries answered that way, so I need to divide this number by the total number of respondents for that country.
Is there some way (a function maybe?) to, after adding a new column, tell R that for each row, it has to divide the number in column two by the total number of rows in the original data set that correspond to the country code in column one (i.e., the n for that country)? Or should I just manually make a vector with the n for each country, which is available on the ISSP website, and add it to the matrix? I'm loathe to to that because of the possibility of making a data entry error, but maybe that's the best way.

BIRT Designer: Determining Percentage of Total for Values in a Column

I have a data set in BIRT Designer with two columns, one with day of week abbreviation names (Su, M, Tu, etc.) and the other with numerical representations of those days of the week starting at 0 and going to 6 (0, 1, 2, etc.). I want to determine what percentage of the total number of rows that each day of week represents. For example, if I have 100 total rows and 12 of those rows correspond to Su/0, 12% of the total rows are made up of Su.
I would like to perform this same calculation within BIRT and graph (bar graph) those percentages that each day consists of out of the total. I'm just learning how to use BIRT and assume that I need to do some scripting either when making my data set or when specifying the rows when making the chart. Any tips would be greatly appreciated.

Use computed columns.
Edit Data set > Computed Columns
The simplest way is to put one column that counts every row, for each day of the week. You can have a separate column that adds a count if the day of the week is a specific values
if (row["Day"] == "Su"){
1
}
I should add: that you can use a 'data' element in your table to compute the percentage. A 'Dynamic Text' item could also be used, but the data item gives you a binded value that you can make better use of later if needed.
Edit
To get a total row count, us a computed column I name mine 'All'
For the Expression use the value "1"

With some inspiration from James Jenkins I think I found my answer. It was pretty simple in the end, but all I needed to do was make a new computed column and instead of adding an expression, I simply set the Aggregation to "COUNT". That counts all of the rows in your table and puts that total on each row. That way you can use that total in any calculations that you may need to do. I have added a screenshot for clarity.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex