Count occurrence in one variable based on another when having duplicated values - r

I know there are so many threats answering similar questions to mine but none of the answers out there are specific enough to what I want to obtain.
I've got the following dataset:
I want to count the number of patients (found in "Var_name") that harbour each mutation (found in "var_id") and display the count in a new column ("var_freq"). I've tried things like:
y <- ALL_merged %>%
group_by(var_id, Var_name) %>%
summarise(n_counts = n(), var_freq = sum(var_id == Var_name))
NOTE: In case is relevant for the answers... I had to convert "var_id" and "Var_name" into characters to make this work because they were factors.
However, this does not give me the output I want. Instead, I get the count of how many times each "var_id" appear per patient since, for each "var_id", the same "Var_name" appears a lot of times (because rows contain additional columns with different information), so the final outcome gives me a higher count that I would expect:
I also want to add this new column to the original dataset, I believe this could be done for example by using "mutate". But not sure how to set up everything...
So, in sum, what I want to get is: for each "var_id" how many different "Var_name" I have - taking into account that these data is duplicated...
Thanks in advance!

It is not entirely clear what you are looking for. It would help to provide data without images (such as using dput) and clearly show what your output should be for example data.
Based on what you describe this might be helpful. First, group_by just var_id alone. Then in summarise, you can include n() to get the number of observations/rows of data for each var_id, and then n_distinct to get the number of unique Var_name for each var_id:
library(dplyr)
df %>%
group_by(var_id) %>%
summarise(n_counts = n(),
var_freq = n_distinct(Var_name))

Related

group_by and summarize usage in tidyverse package in r

I am analyzing the COVID-19 data in r and I want to get the aggregate result of total case in different continent.
total_cases_continent <- covid_data %>%
select(continent, new_cases) %>%
group_by(continent) %>%
summarize(total_cases = sum(new_cases))
I get this result below, instead of present total cases in different continent, this only shows total cases in different continent in one row
It looks like there might be some issues with the values of your variable "continent". I would recommend checking the class of the variable, as well as making sure all of the values are what you expected them to be. This is probably causing the issues within your group_by statement.

Find All Unique rows based on single column and exclude all duplicate rows

I have two requirements
find all duplicate values in single column
find all unique rows [opposite to first question] This should not include even single pair from duplicated rows
I'm Learning since last 2 weeks. Watching YouTube videos, Referring Stackoverflow and other websites, so not much. Please do refer if any material or courses.
so answer to my first question i found here
(Find duplicated elements with dplyr)
# All duplicated elements
mtcars %>%
filter(carb %in% unique(.[["carb"]][duplicated(.[["carb"]])]))
So i want opposite of this
Thanks
P.S. I have non technical background. I went through couple of questions and answers here, so i might have found the answer or needed some of tweaks and i totally ignored that
As you probably realised, unique and duplicated don’t quite what you need, because they essentially cause the retention of all distinct values, and just collapse “multiple copies” of such values.
For your first question, you can group_by the column that you’re interested in, and then retain just those groups (via filter) which have more than one row:
mtcars %>%
group_by(mpg) %>%
filter(length(mpg) > 1) %>%
ungroup()
This example selects all rows for which the mpg value is duplicated. This works because, when applied to groups, dplyr operations such as filter work on each group individually. This means that length(mpg) in the above code will return the length of the mpg column vector of each group, separately.
To invert the logic, it’s enough to invert the filtering condition:
mtcars %>%
group_by(mpg) %>%
filter(length(mpg) == 1) %>%
ungroup()

In R, how to sum multiple columns based on value of another column?

In R, I have a dataframe, so that I have One Variable (the name of a country), a number of variables (Population, Number of cars, etc) and then a Column that represents region.
I would like to sum the variables (1, 2, ....) based on the value of the last region. I think this should be possible with dplyr and summarise each, but I cannot get it to work.
Would someone be able to help me please? Thanks a lot.
Reading the response (althought this may change if you can get some of your dataframe together...
library(dplyr)
summarized_df <- df %>%
group_by(region) %>%
summarise(var1=sum(variable1), var2=sum(variable2), var3=sum(variable3))
If this doesn't seem to work, maybe you can post your code and the errors even if you can't post the dataframe.

R code for creating variable for accuracy/percentages

I am having some trouble with R code for a variable I am trying to add to my data frame. Essentially, participants responded to two classes of stimuli (A and B) and their responses could either be correct or incorrect. The important variables (columns) in my data set are: ID (participants' ID), stimtype (A or B), and response (correct or incorrect).
What I want to do is calculate, for each participant, create two "accuracy score" variables (columns): one where it lists accuracy percentage for stimulus type A, and one for stimulus type B.
I can get those percentages fairly easily using table functions, but am having difficulty creating those variables in my dataset. Any advice very much appreciated, thank you!!!
If you have a data.frame mydata with character stimtypes and a TRUE/FALSE response, you can use
library(dplyr)
result <- mydata %>%
group_by(ID, stimtype) %>%
summarize(pct_response = 100 * mean(response, na.rm = T))
This interprets the logical responses (T/F) as 1/0 and taking the mean will give you the percentage for a given ID and stimtype. However, the result will have two rows per ID, with one for each stimtype. If you want the results in two columns, you can use tidyr::spread
library(tidyr)
result %>%
spread(key = stimtype, value = pct_response)

How to avoid for loop in R when altering a column

I'm working with a data frame that looks very similar to the below:
Image here, unfortunately don't have enough reputation yet
This is a 600,000 row data frame. What I want to do is for every repeated instance within the same date, I'd like to divide the cost by total number of repeated instances. I would also like to consider only those falling under the "Sales" tactic.
So for example, in 1/1/16, there are 2 "Help Packages" that are also under the "Sales" tactic. Because there are 2 instances within the same date, I'd like to divide the cost of each by 2 (so the cost would come out as $5 for each).
This is the code I have:
for(i in 1:length(dfExample$Date)){
if(dfExample$Tactic) == "Sales"){
list = agrep(dfExample$Package[i], dfExample$Package)
for(i in list){
date_repeats = agrep(i, dfExample$Date)
dfExample$Cost[date_repeats] = dfExample$Package[i]/length(date_repeats)
}
}
}
It is incredibly inefficient and slow. I know there's got to be a better way to achieve this. Any help would be much appreciated. Thank you!
ave() can give a solution without additional packages:
with(dfExample, Cost / ave(Cost, Date, Package, Tactic, FUN=length))
Using dplyr:
library(dplyr)
dfExample %>%
group_by(Date, Package, Tactic) %>%
mutate(Cost = Cost / n())
I'm a little unclear what you mean by "instance". This (pretty clearly) groups by Date, Package, and Tactic, and so will consider each unique combination of those columns as a grouper. If you don't include Tactic in the definition of an "instance", then you can remove it to group only by Date and Package.

Resources