How to exclude most dissimilar value of set in R? - r

I have a df looking like this but larger:
values <- c(22,16,23,15,14.5,19)
groups <- rep(c("a","b"), each = 3)
df <- data.frame(groups, values)
I have between 1-3 values per group (in the example 3 values for group a and 3 values for group b). I now want to exclude the most dissimilar value from each group.
In this example I would want to exclude a 16 and b 19.
Thank you for your help!

If you're looking for one value to discard, you can remove the observation that has the highest distance from the mean value per group:
df %>%
group_by(groups) %>%
mutate(dist = abs(values - mean(values))) %>%
filter(dist != max(dist))
# A tibble: 4 × 3
# Groups: groups [2]
groups values dist
<chr> <dbl> <dbl>
1 a 22 1.67
2 a 23 2.67
3 b 15 1.17
4 b 14.5 1.67

Related

Using dplyr - how can I create a new category for one column when another column has duplicates?

I have a dataframe of coordinates for different studies that have been conducted. The studies are either experiment or observation however at some locations both experiment AND observation occur. For these sites, I would like to create a new study category called both. How can I do this using dplyr?
Example Data
df1 <- data.frame(matrix(ncol = 4, nrow = 6))
colnames(df1)[1:4] <- c("value", "study", "lat","long")
df1$value <- c(1,1,2,3,4,4)
df1$study <- rep(c('experiment','observation'),3)
df1$lat <- c(37.541290,37.541290,38.936604,29.9511,51.509865,51.509865)
df1$long <- c(-77.434769,-77.434769,-119.986649,-90.0715,-0.118092,-0.118092)
df1
value study lat long
1 1 experiment 37.54129 -77.434769
2 1 observation 37.54129 -77.434769
3 2 experiment 38.93660 -119.986649
4 3 observation 29.95110 -90.071500
5 4 experiment 51.50986 -0.118092
6 4 observation 51.50986 -0.118092
Note that the value above is duplicated when study has experiment AND observation.
The ideal output would look like this
value study lat long
1 1 both 37.54129 -77.434769
2 2 experiment 38.93660 -119.986649
3 3 observation 29.95110 -90.071500
4 4 both 51.50986 -0.118092
We can replace those 'value' cases where both experiment and observation is available to 'both' and get the distinct
library(dplyr)
df1 %>%
group_by(value) %>%
mutate(study = if(all(c("experiment", "observation") %in% study))
"both" else study) %>%
ungroup %>%
distinct
-output
# A tibble: 4 × 4
value study lat long
<dbl> <chr> <dbl> <dbl>
1 1 both 37.5 -77.4
2 2 experiment 38.9 -120.
3 3 observation 30.0 -90.1
4 4 both 51.5 -0.118

R tables with proportion

ID <- c(1,2,3,4,5,6,7,8)
Hospital <- c("A","A","A","A","B","B","B","B")
risk <- c("Low","Low","High","High","Low","Low","High","High")
retest <- c(1,0,1,1,1,1,0,1)
df <- data.frame(ID, Hospital, risk, retest)
# freq. table
df %>% group_by(risk, Hospital) %>%
summarise(n=n())%>%
spread(Hospital,n)
# A tibble: 2 × 3
# Groups: risk [2]
risk A B
<chr> <int> <int>
1 High 2 2
2 Low 2 2
#freq. table of retest by risk and Hospital
df %>%
group_by(risk, Hospital) %>%
#summarise(n=n()) %>%
summarise(retestsum = sum(retest))%>%
spread(Hospital, retestsum)
# A tibble: 2 × 3
# Groups: risk [2]
risk A B
<chr> <dbl> <dbl>
1 High 2 1
2 Low 1 2
I want to get the proportions of retest by Hospital and by risk categories.
For example, Hospital A, low risk , retested 1 person / 2 person = 50.
Need to create A% B% columns to get the final result of the table below.
Please help me get the prop. columns and also (n=x) part in the final table.
Just divide the second table's numeric values by those of the first. Fortunately elementwise division does not destroy the structure if the two tibbles have the same dimensions:
d2 <- df1 %>% group_by(risk, Hospital) %>%
summarise(n=n())%>%
spread(Hospital,n)
`summarise()` has grouped output by 'risk'. You can override using the `.groups` argument.
d3 <- df1 %>%
group_by(risk, Hospital) %>%
#summarise(n=n()) %>%
summarise(retestsum = sum(retest))%>%
spread(Hospital, retestsum)
You can deliver a proportion or a percentage
# proportion
> d3[-1]/d2[-1]
A B
1 1.0 0.5
2 0.5 1.0
#percentage
> 100*d3[-1]/d2[-1]
A B
1 100 50
2 50 100
``

Check if values of one dataframe exist in another dataframe in exact order

I have 1 dataframe of data and multiple "reference" dataframes. I'm trying to automate checking if values of the dataframe match the values of the reference dataframes. Importantly, the values must also be in the same order as the values in the reference dataframes. These columns are of the columns of importance, but my real dataset contains many more columns.
Below is a toy dataset.
Dataframe
group type value
1 A Teddy
1 A William
1 A Lars
2 B Dolores
2 B Elsie
2 C Maeve
2 C Charlotte
2 C Bernard
Reference_A
type value
A Teddy
A William
A Lars
Reference_B
type value
B Elsie
B Dolores
Reference_C
type value
C Maeve
C Hale
C Bernard
For example, in the toy dataset, group1 would score 1.0 (100% correct) because all its values in A match the values and order of values of An in reference_A. However, group2 would score 0.0 because the values in B are out of order compared to reference_B and 0.66 because 2/3 values in C match the values and order of values in reference_C.
Desired output
group type score
1 A 1.0
2 B 0.0
2 C 0.66
This was helpful, but does not take order into account:
Check whether values in one data frame column exist in a second data frame
Update: Thank you to everyone that has provided solutions! These solutions are great for the toy dataset, but have not yet been adaptable to datasets with more columns. Again, like I wrote in my post, the columns that I've listed above are of importance — I'd prefer to not drop the unneeded columns if necessary.
We may also do this with mget to return a list of data.frames, bind them together, and do a group by mean of logical vector
library(dplyr)
mget(ls(pattern = '^Reference_[A-Z]$')) %>%
bind_rows() %>%
bind_cols(df1) %>%
group_by(group, type = type...1) %>%
summarise(score = mean(value...2 == value...5))
# Groups: group [2]
# group type score
# <int> <chr> <dbl>
#1 1 A 1
#2 2 B 0
#3 2 C 0.667
This is another tidyverse solution. Here, I am adding a counter (i.e. rowname) to both reference and data. Then I join them together on type and rowname. At the end, I summarize them on type to get the desired output.
library(dplyr)
library(purrr)
library(tibble)
list(`Reference A`, `Reference B`, `Reference C`) %>%
map(., rownames_to_column) %>%
bind_rows %>%
left_join({Dataframe %>%
group_split(type) %>%
map(., rownames_to_column) %>%
bind_rows},
. , by=c("type", "rowname")) %>%
group_by(type) %>%
dplyr::summarise(group = head(group,1),
score = sum(value.x == value.y)/n())
#> # A tibble: 3 x 3
#> type group score
#> <chr> <int> <dbl>
#> 1 A 1 1
#> 2 B 2 0
#> 3 C 2 0.667
Here's a "tidy" method:
library(dplyr)
# library(purrr) # map2_dbl
Reference <- bind_rows(Reference_A, Reference_B, Reference_C) %>%
nest_by(type, .key = "ref") %>%
ungroup()
Reference
# # A tibble: 3 x 2
# type ref
# <chr> <list<tbl_df[,1]>>
# 1 A [3 x 1]
# 2 B [2 x 1]
# 3 C [3 x 1]
Dataframe %>%
nest_by(group, type, .key = "data") %>%
left_join(Reference, by = "type") %>%
mutate(
score = purrr::map2_dbl(data, ref, ~ {
if (length(.x) == 0 || length(.y) == 0) return(numeric(0))
if (length(.x) != length(.y)) return(0)
sum((is.na(.x) & is.na(.y)) | .x == .y) / length(.x)
})
) %>%
select(-data, -ref) %>%
ungroup()
# # A tibble: 3 x 3
# group type score
# <int> <chr> <dbl>
# 1 1 A 1
# 2 2 B 0
# 3 2 C 0.667

dplyr / base R: compute new columns using logic combinations of row indices

I analyse a data set from an experiment and would like to calculate effect sizes for each variable. My dataframe consist of multiple variables (= columns) for 8 treatments t (= rows), with t1 - t4 being the control for t5 - t8, respectively (t1 control for t5, t2 for t6, ... ). The original data set is way larger, so I would like to solve the following two tasks::
I would like to calculate the log(treatment/control) for each t5 - t8 for one variable, e.g. effect size for t5 = log(t5/t1), effect size for t6 = log(t6/t2), ... . The name of the resulting column should be variablename_effect and the new column would only have 4 rows instead of 8.
The most tricky part is, that I need to implement the combination of specific rows into my code, so that the correct control is used for each treatment.
I would like to calculate the effect sizes for all my variables within one code, so create multiple new columns with the correct names (variablename_effect).
I would prefer to solve the problem in dplyr or base R to keep it simple.
So far, the only related question I found was /r-dplyr-mutate-refer-new-column-itself (shows combination of multiple if else()). I would be very thankful for either a solution, links to similar questions or which packages I should use in cast it's not possible within dplyr / base R!
Sample data:
df <- data.frame("treatment" = c(1:8), "Var1" = c(9:16), "Var2" = c(17:24))
Edit: this is the df_effect I would expect to receive as an output, thanks #Martin_Gal for the hint!
df_effect <- data.frame("treatment" = c(5:8), "Var1_effect" = c(log(13/9), log(14/10), log(15/11), log(16/12)), "Var2_effect" = c(log(21/17), log(22/18), log(23/19), log(24/20)))
My ideas so far:
For calculating the effect size:
mutate() and for function:
# 1st option:
for (i in 5:8) {
dt_effect <- df %>%
mutate(Var1_effect = log(df[i, "Var1"]/df[i - 4, "Var1"]))
}
#2nd option:
for (i in 5:8){
dt_effect <- df %>%
mutate(Var1_effect = log(df[treatment == i , "Var1"]/df[treatment == i - 4 , "Var1"]))
}
problem: both return the result for i = 8 for every row!
mutate() and ifelse():
df_effect <- df %>%
mutate(Var1_effect = ifelse(treatment >= 5, log(df[, "Var1"]/df[ , "Var1"]), NA))
seems to work, but so far I couldn't implement which row to pick for the control, so it returns NA for t1 - t4 (correct) and 0 for t5 - t8 (mathematically correct as I calculate log(t5/t5), ... but not what I want).
maybe I should use summarise() instead of mutate() because I create fewer rows than in my original dataframe?
Make this work for every variable at the same time
My only idea would be to index the columns within a second for function and use the paste() to create the new column names, but I don't know exactly how to do this ...
I don't know if this will solve your problem, but I want to make a suggestion similar to Limey:
library(dplyr)
library(tidyr)
df %>%
mutate(control = 1 - (treatment-1) %/% (nrow(.)/2),
group = ifelse(treatment %% (nrow(.)/2) == 0, nrow(.)/2, treatment %% (nrow(.)/2))) %>%
select(-treatment) %>%
pivot_wider(names_from = c(control), values_from=c(Var1, Var2)) %>%
group_by(group) %>%
mutate(Var1_effect = log(Var1_0/Var1_1))
This yields
# A tibble: 4 x 6
# Groups: group [4]
group Var1_1 Var1_0 Var2_1 Var2_0 Var1_effect
<dbl> <int> <int> <int> <int> <dbl>
1 1 9 13 17 21 0.368
2 2 10 14 18 22 0.336
3 3 11 15 19 23 0.310
4 4 12 16 20 24 0.288
What happend here?
I expected the first half of your data.frame to be the control variables for the second half. So I created an indicator variable and a grouping variable based on the treatment id's/numbers.
Now the treatment id isn't used anymore, so I dropped it.
Next I used pivot_wider to create a dataset with Var1_1 (i.e. Var1 for your control variable) and Var1_0 (i.e. Var1 for your "ordinary" variable).
Finally I calculated Var1_effect per group.
In response to OP's comment to #MartinGal 's solution (which is perfectly fione in its own right):
First convert the input data to a more convenient form:
# Original input dataset
df <- data.frame("treatment" = c(1:8), "Var1" = c(9:16), "Var2" = c(17:24))
# Revised input dataset
revisedDF <- df %>%
select(-treatment) %>%
add_column(
Treatment=rep(c("Control", "Test"), each=4),
Experiment=rep(1:4, times=2)
) %>%
pivot_longer(
names_to="Variable",
values_to="Value",
cols=c(Var1, Var2)
) %>%
arrange(Experiment, Variable, Treatment)
revisedDF %>% head(6)
Giving
# A tibble: 6 x 4
Treatment Experiment Variable Value
<chr> <int> <chr> <int>
1 Control 1 Var1 9
2 Test 1 Var1 13
3 Control 1 Var2 17
4 Test 1 Var2 21
5 Control 2 Var1 10
6 Test 2 Var1 14
I like this format because it makes the analysis code completely independent of the number of variables, the number of experiements and the number of Treatments.
The analysis is straightforward, too:
result <- revisedDF %>% pivot_wider(
names_from=Treatment,
values_from=Value
) %>%
mutate(Effect=log(Test/Control))
result
Giving
Experiment Variable Control Test Effect
<int> <chr> <int> <int> <dbl>
1 1 Var1 9 13 0.368
2 1 Var2 17 21 0.211
3 2 Var1 10 14 0.336
4 2 Var2 18 22 0.201
5 3 Var1 11 15 0.310
6 3 Var2 19 23 0.191
7 4 Var1 12 16 0.288
8 4 Var2 20 24 0.182
pivot_wider and pivot_longer are relatively new dplyr verbs. If you're unable to use the most recent version of the package, spread and gather do the same job with slightly different argument names.

How to run a for loop for each group in a dataframe?

This question is similar to this one asked earlier but not quite. I would like to iterate through a large dataset (~500,000 rows) and for each unique value in one column, I would like to do some processing of all the values in another column.
Here is code that I have confirmed to work:
df = matrix(nrow=783,ncol=2)
counts = table(csvdata$value)
p = (as.vector(counts))/length(csvdata$value)
D = 1 - sum(p**2)
The only problem with it is that it returns the value D for the entire dataset, rather than returning a separate D value for each set of rows where ID is the same.
Say I had data like this:
How would I be able to do the same thing as the code above, but return a D value for each group of rows where ID is the same, rather than for the entire dataset? I imagine this requires a loop, and creating a matrix to store all the D values in with ID in one column and the value of D in the other, but not sure.
Ok, let's work with "In short, I would like whatever is in the for loop to be executed for each block of data with a unique value of "ID"".
In general you can group rows by values in one column (e.g. "ID") and then perform some transformation based on values/entries in other columns per group. In the tidyverse this would look like this
library(tidyverse)
df %>%
group_by(ID) %>%
mutate(value.mean = mean(value))
## A tibble: 8 x 3
## Groups: ID [3]
# ID value value.mean
# <fct> <int> <dbl>
#1 a 13 12.6
#2 a 14 12.6
#3 a 12 12.6
#4 a 13 12.6
#5 a 11 12.6
#6 b 12 15.5
#7 b 19 15.5
#8 cc4 10 10.0
Here we calculate the mean of value per group, and add these values to every row. If instead you wanted to summarise values, i.e. keep only the summarised value(s) per group, you would use summarise instead of mutate.
library(tidyverse)
df %>%
group_by(ID) %>%
summarise(value.mean = mean(value))
## A tibble: 3 x 2
# ID value.mean
# <fct> <dbl>
#1 a 12.6
#2 b 15.5
#3 cc4 10.0
The same can be achieved in base R using one of tapply, ave, by. As far as I understand your problem statement there is no need for a for loop. Just apply a function (per group).
Sample data
df <- read.table(text =
"ID value
a 13
a 14
a 12
a 13
a 11
b 12
b 19
cc4 10", header = T)
Update
To conclude from the comments&chat, this should be what you're after.
# Sample data
set.seed(2017)
csvdata <- data.frame(
microsat = rep(c("A", "B", "C"), each = 8),
allele = sample(20, 3 * 8, replace = T))
csvdata %>%
group_by(microsat) %>%
summarise(D = 1 - sum(prop.table(table(allele))^2))
## A tibble: 3 x 2
# microsat D
# <fct> <dbl>
#1 A 0.844
#2 B 0.812
#3 C 0.812
Note that prop.table returns fractions and is shorter than your (as.vector(counts))/length(csvdata$value). Note also that you can reproduce your results for all values (irrespective of ID) if you omit the group_by line.
A base R option would be
df1$value.mean <- with(df1, ave(value, ID))

Resources