R tables with proportion - r

ID <- c(1,2,3,4,5,6,7,8)
Hospital <- c("A","A","A","A","B","B","B","B")
risk <- c("Low","Low","High","High","Low","Low","High","High")
retest <- c(1,0,1,1,1,1,0,1)
df <- data.frame(ID, Hospital, risk, retest)
# freq. table
df %>% group_by(risk, Hospital) %>%
summarise(n=n())%>%
spread(Hospital,n)
# A tibble: 2 × 3
# Groups: risk [2]
risk A B
<chr> <int> <int>
1 High 2 2
2 Low 2 2
#freq. table of retest by risk and Hospital
df %>%
group_by(risk, Hospital) %>%
#summarise(n=n()) %>%
summarise(retestsum = sum(retest))%>%
spread(Hospital, retestsum)
# A tibble: 2 × 3
# Groups: risk [2]
risk A B
<chr> <dbl> <dbl>
1 High 2 1
2 Low 1 2
I want to get the proportions of retest by Hospital and by risk categories.
For example, Hospital A, low risk , retested 1 person / 2 person = 50.
Need to create A% B% columns to get the final result of the table below.
Please help me get the prop. columns and also (n=x) part in the final table.

Just divide the second table's numeric values by those of the first. Fortunately elementwise division does not destroy the structure if the two tibbles have the same dimensions:
d2 <- df1 %>% group_by(risk, Hospital) %>%
summarise(n=n())%>%
spread(Hospital,n)
`summarise()` has grouped output by 'risk'. You can override using the `.groups` argument.
d3 <- df1 %>%
group_by(risk, Hospital) %>%
#summarise(n=n()) %>%
summarise(retestsum = sum(retest))%>%
spread(Hospital, retestsum)
You can deliver a proportion or a percentage
# proportion
> d3[-1]/d2[-1]
A B
1 1.0 0.5
2 0.5 1.0
#percentage
> 100*d3[-1]/d2[-1]
A B
1 100 50
2 50 100
``

Related

Calculate the frequency of species occurrence across sites

I need to calculate a relative frequency of the species occurence across sites. Lets say, if the species a was found in 5 out of the 8 sampling sites, its relative frequency is 62.5 %. I wonder how to do it in R, ideally using dplyr?
Dummy example:
d <-data.frame(site = c(1,1,2,2,3,3,4,4),
species = c('a','b', 'a','b', 'a','d', 'a', 'e'))
I know that I can calculate the sum of unique sites by counting distinct ones:
d %>%
group_by(site) %>%
summarize(n_sites = n_distinct(site))
I can get the frequency of the individual species occurences using this:
d %>%
count(species)
But how can I get that the frequency of occurence of each species?
Desired output:
species freq
a 100 # species a is present in each plot
b 50 # b occurs in half of plots
d 25 # d&e occur only in 1 out of 4 plots
e 25
We can use
library(dplyr)
d |> group_by(species) |> mutate(n = n_distinct(site)) |>
summarise(freq = n()) |> ungroup() |>
mutate(freq = freq/n_distinct(species)*100)
Output
A tibble: 4 × 2
species freq
<chr> <dbl>
1 a 100
2 b 50
3 d 25
4 e 25
I would break this into two steps, as follows:
d%>%
group_by(species)%>%
# Step 1; count sites by specices
summarise(sites_by_species=n_distinct(site))%>%
# Step 2; divide by total number of sites
mutate(frequency=100*sites_by_species/n_distinct(d$site))
Output of which is:
# A tibble: 4 × 3
species sites_by_species frequency
<chr> <int> <dbl>
1 a 4 100
2 b 2 50
3 d 1 25
4 e 1 25
Since we already group_by species, I guess we cannot use n_distinct() to find out the distinct sites, therefore I used length(unique(d$site)).
library(dplyr)
d %>% group_by(species) %>% summarize(freq = n()*100/length(unique(d$site)))
Or more lengthy (trying to stay in dplyr as much as possible):
d %>%
mutate(sites_n = n_distinct(site)) %>%
group_by(species) %>%
summarize(freq = n()*100/max(sites_n))
Output
# A tibble: 4 × 2
species freq
<chr> <dbl>
1 a 100
2 b 50
3 d 25
4 e 25
d %>%
count(species) %>%
mutate(freq=n/n_distinct(d$site)*100) %>%
select(-n)
species freq
1 a 100
2 b 50
3 d 25
4 e 25
I needed to use d$site since site is no longer available trough pipes after the use of count.

How to exclude most dissimilar value of set in R?

I have a df looking like this but larger:
values <- c(22,16,23,15,14.5,19)
groups <- rep(c("a","b"), each = 3)
df <- data.frame(groups, values)
I have between 1-3 values per group (in the example 3 values for group a and 3 values for group b). I now want to exclude the most dissimilar value from each group.
In this example I would want to exclude a 16 and b 19.
Thank you for your help!
If you're looking for one value to discard, you can remove the observation that has the highest distance from the mean value per group:
df %>%
group_by(groups) %>%
mutate(dist = abs(values - mean(values))) %>%
filter(dist != max(dist))
# A tibble: 4 × 3
# Groups: groups [2]
groups values dist
<chr> <dbl> <dbl>
1 a 22 1.67
2 a 23 2.67
3 b 15 1.17
4 b 14.5 1.67

How to sum all values of a cell if it corresponds with a specific value in another cell?

I might just be going about it the wrong way, but I'm having trouble pulling out all of the female scores and all of the male scores into their own respective dataframes.
I don't need to have any of the exam information, so really I could just get every 'f' and it's corresponding score and every 'm' and it's corresponding score into a dataframe.
data <- tribble(~"X",~"Exam1",~"X.1",~"Exam2",~"X.2",
"n","Score","Gender","Score","Gender",
"1","45","m","66","f",
"2","60","f","73","m")
# Create informative column names
Colnames <- colnames(data) %>% str_c(.,dplyr::slice(data,1) %>% unlist,sep = "_")
# Set column names
data <- data %>%
setNames(Colnames) %>%
dplyr::slice(-1)
# Arrange data by exam type by first getting exam "number"
colnames(data) %>%
str_extract("\\d|\\d\\d") %>%
str_subset("\\d") %>%
unique %>%
# Split and arrange data by exams
purrr::map_df(~{
data %>%
dplyr::select(matches(str_c("X_n|",.x))) %>%
dplyr::mutate(Exam = str_c("Exam ",.x)) %>%
dplyr::rename_all(~c("Serial number","Exam score","Gender","Exam"))
}) %>%
# Split data by gender
dplyr::group_by(Gender) %>%
dplyr::group_split()
Output:
[[1]]
# A tibble: 2 × 4
`Serial number` `Exam score` Gender Exam
<chr> <chr> <chr> <chr>
1 2 60 f Exam 1
2 1 66 f Exam 2
[[2]]
# A tibble: 2 × 4
`Serial number` `Exam score` Gender Exam
<chr> <chr> <chr> <chr>
1 1 45 m Exam 1
2 2 73 m Exam 2

How to get percentages of observations above a certain value for individuals and groups?

I am new to R and looking for some help with my thesis!
The data I have are participant ID, The group they belong to (control or patient) and coordinates in the column “gaze” where values >0 are right and <0 are left.
The goal is to calculate the percentage of coordinates at the right and left side of space for each participant and the two groups.
Sample data:
df <- data.frame(personID=rep(1,6),gaze=c(-0.104,-0.105,0.00550,0.00407,0.00119,0.0411),group=rep('control',6))
df
# personID gaze group
#1 1 -0.10400 control
#2 1 -0.10500 control
#3 1 0.00550 control
#4 1 0.00407 control
#5 1 0.00119 control
#6 1 0.04110 control
You can use the dplyr package to get your answer
library(dplyr)
# create a new boolean column with TRUE where gaze >=0
df <- df %>% mutate(positive_gaze=(gaze>=0))
# group by personID and calculate mean of the new column
df %>% group_by(personID) %>% summarise(pct_positive = 100*mean(positive))
# A tibble: 1 x 2
# personID pct_positive
# <dbl> <dbl>
#1 1 66.7
# similarly you could group by group
df %>% group_by(group) %>% summarise(pct_positive = 100*mean(positive))
# A tibble: 1 x 2
# group pct_positive
# <fct> <dbl>
#1 control 66.7
# or group by both group and personID
df %>% group_by(group,personID) %>% summarise(pct_positive = 100*mean(positive))
# A tibble: 1 x 3
# Groups: group [1]
# group personID pct_positive
# <fct> <dbl> <dbl>
#1 control 1 66.7

Assigning Label based on quantile for every sub group

My data.frame looks like this:
Region Store Sales
A 1 ***
A 2 ***
B 1 ***
B 2 ****
I want to create labels of store based on Sales Performance. That is if a store Sales is higher than 75% quantile assign "High" else low.
Applying ddply using the code
R3 <- ddply(dat, .(REGION), function(x) quantile(x$Sales, na.rm = TRUE))
returns a dataframe with all quantile numbers for the regions.
I can use that frame to join with original and do a if-else for each cluster. I am sure it's not an efficient way. Is there a better approach to it?
Is this what you want ?
df %>% group_by(Region) %>%
mutate(Performance = ifelse(Sales > quantile(Sales, 0.75), 'High', 'Low'))
#> # A tibble: 4 x 4
#> # Groups: Region [2]
#> Region Store Sales Performance
#> <chr> <int> <int> <chr>
#> 1 A 1 100 High
#> 2 A 2 10 Low
#> 3 B 1 90 High
#> 4 B 2 10 Low
Data Input
df = read.table(text = 'Region Store Sales
A 1 100
A 2 10
B 1 90
B 2 10', header = T, stringsAsFactors = F)

Resources