library(tidyverse)
Using the sample data below, I want to use dplyr::distinct() based on a condition. I want to eliminate duplicates in the ID column, but only the duplicates with the lowest value of "Rate". For example, for "A1A1",the row with the rate of 2 should be deduped, while for "CC33", the rows with "rate" equal to 2 and 3 should be removed. I also want to end up with all columns by using dplyr::distinct with ".keep_all=TRUE".
I tried the code below, but this removes the Subject column.
DF2%>%group_by(ID)%>%summarise(Min_rate=min(Rate))
I also played around with a group_by, mutate, and if_else, but couldn't get it to work...
DF2%>%group_by(ID)%>%mutate(if_else(Rate=min(Rate),Rate,distinct(ID)
Help would be appreciated...
Sample Data:
ID<-c("A1A1","A22B","CC33","D33D","A1A1","4DD8","4DD8","CC33","CC33","56DK","F4G5","8Y0R")
Subject<-c("Subject1","Subject2","Subject3","Subject4","Subject5","Subject6","Subject7","Subject8","Subject9","Subject10","Subject11","Subject12")
Rate<-c(1,2,3,2,2,3,2,1,2,2,2,3)
DF2<-data_frame(ID,Subject,Rate)
I found a way to accomplish what I want by first using dplyr's "group_by" and "mutate" functions together with "if_else" to recode the smallest value of the rate variable within each ID group with a 1, and all other values with a 0.
DF2<-DF2%>%group_by(ID)%>%mutate(Rate_Min=if_else(Rate==min(Rate),1,0))
I then use dplyr's "filter" to remove the 0's.
DF2<-DF2%>%filter(Rate_Min==1)
Related
I have a huge dataset which has been difficult to work with.
I want to find the median of a second column but only based on one value in the first column. I have used this formula to find general medians without specifying/sorting by the specific values in the first column:
df%>% +group_by(column1)%>% +summarise(Median=median(colum2))
However, there is a specific value in column1 I am hoping to sort by and I only want the medians of the second column based on this first value. Would I do something similar to the below?
df%>% +group_by(column1, specificvalue)%>% +summarise(Median=median(colum2))
Is there an easier way to do this? Would it be easier to make a new dataframe based on the specific value in the first column? How would that be done so that I could have column 1 only include the specific value I want but the rest of the rows included so I can easily determine the median of column2?
Thanks!!
I have a data table with 3 variables, 1 frequency column, and I am wishing to add another proportion column.
The variable 1 has 4 unique values.
Variable 2 has 5,
And Variable 3 has 2.
The frequencies captures the amount of times that happens.
But if I add the prop.table to it, it will calculate the proportion regarding the whole data.table, when I really want it to calculate the proportion in the subsets of Variable 2.
I thought of iterating, but it seems complicated in tables.
You could use the aggregate function (or tapply) to sum all the counts within the categories of variable 2, then use prop.table or similar on the result.
If you want to use the tidyverse instead of base R then this would be a group_by followed by summarise to add within each group, then prop_table again to calculate the proportions.
I am still new to R and I am attempting to solve a seemingly simple problem. I would like to identify all of the unique combinations of values from 4 different rows, and update an additional column in my df to annotate whether or not it is unique.
Giving a df with columns A-Z, I have used the following code to identify unique combinations of column A,B,C,D, and E. I am trying to update column F with this information.
unique(df[ ,c("A", "B","C","D", "E")])
This returns each of the individual rows with unique combinations as expected, but I cannot figure out what the next step I should take in order to update column "F" with a value to indicate that it is a unique row. Thanks in advance for any pointers!
This is probably easy, but in a grouped data frame, I'm trying to find the difference in diff.column between the last row and the row where var.col is B The condition only appears once within each group. I'd like to make that difference a new variable using summarize from dplyr.
my.data<-data.frame(diff.col=seq(1:10),var.col=c(rep('A',5),'B',rep('A',4)))
I'd like to keep this in dplyr and I know how to code it except for selecting diff.col where var.col==B.
my.data%>%summarize(new.var=last(diff.col)-????)
I have a dataframe with multiple columns and I want to apply different functions on each column.
An example of my dataset -
I want to calculate the count of column pq110a for each country mentioned in qcountry2 column(me-mexico,br-brazil,ar-argentina). The problem I face here is that I have to use filter on these columns for example for sample patients I want-
Count of pq110 when the values are 1 and 2 (for some patients)
Count of pq110 when the value is 3 (for another patients)
Similarly when the value is 6.
For total patient I want-total count of pq110.
Output I am expecting is-Output
Similalry for each country I want this output.
Please suggest how can I do this for other columns also,countrywise.
Thanks !!
I guess what you want to do is count the number of columns of 'pq110' which have the same value within different 'qcountry2'.
So I'll try to use 'tapply' to divide data into several subsets and then use 'table' to count column number for each different value.
tapply(my_data[,"pq110"], INDEX = as.factor(my_data[,"qcountry2"]), function(x)table(x))