Combining frequency tables in R - r

I have a vector containing the frequencies of molecules within their respective molecular class for all molecules measured. I also have a vector that contains the per class frequency of significant molecules identified by variable selection. How can I merge these 2 vectors into a data frame and fill in empty frequencies with 0's (in R)?
Here is a workable example:
full = rep(letters[1:4], 4:7)
fullTable = table(full)
sub = rep(letters[1:2], c(2, 4))
subTable = table(sub)
I would like the table to look like:
print(data.frame(Letter=letters[1:4], fullFreq=c(4, 5, 6, 7), subFreq=c(2, 4, 0, 0)))

Try this (I supposed you meant subTable=table(sub) in your last line):
res<-merge(as.data.frame(fullTable),as.data.frame(subTable),by.x=1,by.y=1,all=TRUE)
colnames(res)<-c("Letter","fullFreq","subFreq")
res[is.na(res)]<-0

With the library dplyr
library(dplyr)
full=rep(letters[1:4], 4:7)
sub=rep(letters[1:2], c(2,4))
df <- data.frame(Letter=unique(c(full, sub)))
df <- df %>%
left_join(as.data.frame(table(full)), by=c("Letter"="full")) %>%
left_join(as.data.frame(table(sub)), by=c("Letter"="sub"))
df[is.na(df)] <- 0
df

Related

How to identify and remove outliers in a data.frame using R?

I have a dataframe that has multiple outliers. I suspect that these ouliers have produced different results than expected.
I tried to use this tip but it didn't work as I still have very different values: https://www.r-bloggers.com/2020/01/how-to-remove-outliers-in-r/
I tried the solution with the rstatix package, but I can't remove the outliers from my data.frame
library(rstatix)
library(dplyr)
df <- data.frame(
sample = 1:20,
score = c(rnorm(19, mean = 5, sd = 2), 50))
View(df)
out_df<-identify_outliers(df$score)#identify outliers
df2<-df#copy df
df2<- df2[-which(df2$score %in% out_df),]#remove outliers from df2
View(df2)
The identify_outliers expect a data.frame as input i.e. usage is
identify_outliers(data, ..., variable = NULL)
where
... - One unquoted expressions (or variable name). Used to select a variable of interest. Alternative to the argument variable.
df2 <- subset(df, !score %in% identify_outliers(df, "score")$score)
A rule of thumb is that data points above Q3 + 1.5xIQR or below Q1 - 1.5xIQR are considered outliers.
Therefore you just have to identify them and remove them. I don't know how to do it with the dependency rstatix, but with base R can be achived following the example below:
# Generate a demo data
set.seed(123)
demo.data <- data.frame(
sample = 1:20,
score = c(rnorm(19, mean = 5, sd = 2), 50),
gender = rep(c("Male", "Female"), each = 10)
)
#identify outliers
outliers <- which(demo.data$score > quantile(demo.data$score)[4] + 1.5*IQR(demo.data$score) | demo.data$score < quantile(demo.data$score)[2] - 1.5*IQR(demo.data$score))
# remove them from your dataframe
df2 = demo.data[-outliers,]
Do a cooler function that returns to you the index of the outliers:
get_outliers = function(x){
which(x > quantile(x)[4] + 1.5*IQR(x) | x < quantile(x)[2] - 1.5*IQR(x))
}
outliers <- get_outliers(demo.data$score)
df2 = demo.data[-outliers,]

Comparing each row of one dataframe with a row in another dataframe using R

I'm relatively new to R and I have looked for an answer for my problem but didn't find one. I want to compare two dataframes.
library(dplyr)
library(gtools)
v1 <- LETTERS[1:10]
combinations_from_4_letters <- (as.data.frame(combinations(n = 10, r = 4, v = v1),
stringsAsFactors = FALSE))
combinations_from_4_letters$group <- rep(1:15, each = 14)
combinations_from_2_letters <- (as.data.frame(combinations(n = 10, r = 2, v = v1),
stringsAsFactors = FALSE))
Dataframe 'combinations_from_4_letters' contains all combinations that can be made from 10 letters without repetitions and permutations. The combinations are binned into groups from 1-15. I want to find out how often pairs of the 10 letters (saved in dataframe 'combinations_from_2_letters') are found in each group (basically a frequency table). I started doing a complicated loop looping through both dataframes but I think there must be a more 'R' solution to it, similar to comparing a dataframe and a vector like:
combinations_from_4_letters %in% combinations_from_2_letters[i,])
Thank you in advance for your help!
I recommend an approach like the following:
# adding dummy column for a complete cross-join
combinations_from_4_letters = combinations_from_4_letters %>%
mutate(ones = 1)
combinations_from_2_letters = combinations_from_2_letters %>%
mutate(ones = 1)
joined = combinations_from_2_letters %>%
inner_join(combinations_from_4_letters, by = "ones") %>%
# comparison goes here
mutate(within = ifelse(comb2 %in% comb4, 1, 0)) %>%
group_by(comb2) %>%
summarise(freq = sum(within))
You'll probably need to modify to ensure it matches the exact column names and your comparison condition.
Key ideas:
adding filler column so we have a complete cross-join
mutate a new indicator column for whether the two letter pair is within the four letter pair
sum indicators on the two letter pair

pass a list of variable names as an argument to an R function

I am trying achieve the following: I have a dataset, and a function that subsets this dataset and then performs a series of operations on the subset. Subsetting happens based on row names. I am able to do it step by step (i.e. running this function for each subset separately), but I have a list of desired subsets, and I would like to loop over this list. It sounds complicated - please check the example below.
This is what I can do:
#dataframe with rownames
whole_dataset <- data.frame(wt1 = c(1, 2, 3, 6, 6),
wt2 = c(2, 3, 4, 4, 2))
row.names(whole_dataset) = c("HTA1", "HTA2", "HTB2", "CSE1", "CSE2")
# two different non-overlapping subsets
his <- c("HTA1", "HTA2", "HTB2")
cse <- c("CSE1", "CSE2")
#this is the function I have
fav_complex <- function (data, complex) {
small_data<- data[complex,] #subset only the rows that you need
sum.all<-colSums(small_data) #calculate sum of columns
return(sum.all)
}
#I generate two deparate named vectors
his_data <- fav_complex(data = whole_dataset, complex = his)
cse_data <- fav_complex(data = whole_dataset, complex = cse)
#and merge them
merged_data<- rbind(his_data,cse_data)
it looks like this
> merged_data
wt1 wt2
his_data 6 9
cse_data 12 6
I would like to somehow generate the merged_data dataframe without having to call the 'fav_complex' function multiple times. In real life I have about 20 subsets, and it is a lot of code. This is my solution that doesn't work
#I first have a character vector listing all the variable names
subset_list <- c("his", "cse")
#then create a loop that goes over this list
#make an empty dataframe
merged_data2 <- data.frame()
#fill it with a for loop output
for (element in subset_list) {
result <- fav_complex(data = whole_dataset, element)
merged_data2 <-rbind(merged_data2, result)
}
I know this is wrong. In this loop, 'element' is just a string, rather than a variable with stuff in it. But I don't know how to make it a variable. noquote(element) didn't work. I tried reading about non standard evaluation and eval(), substitute(), but it is too abstract for me - I think I am not there yet with my R expertise.
Consider by to run needed operation across all subsets. But first create a group column:
# ANY FUNCTION TO APPLY ON SUBSETS (REMOVE GROUP COL)
fav_complex_new <- function (sub) {
sum.all <- colSums(transform(sub, group=NULL))
return(sum.all)
}
# ASSIGN GROUPING
whole_dataset$group <- ifelse(row.names(whole_dataset) %in% his, "his",
ifelse(row.names(whole_dataset) %in% cse, "cse", NA))
# BY CALL
df_list <- by(whole_dataset, whole_dataset$group, FUN=fav_complex_new)
# COMBINE ALL DFs IN LIST
merged_data <- do.call(rbind, df_list)
Rextester demo (includes OP's original and above solution)
Following #Gregor's suggestion of a modified workflow, would you consider this solution, including some bonus data wrangling?
Put the data that's currently in row names in its own column.
Add a column for complex. We can do this programmatically in case the data are large.
Use dplyr to created split-apply-combine summaries of data grouped by complex.
It could work like this
library(dplyr)
whole_dataset <- tibble(wt1 = c(1, 2, 3, 6, 6),
wt2 = c(2, 3, 4, 4, 2),
id = factor(c("HTA1", "HTA2", "HTB2", "CSE1", "CSE2")))
whole_dataset <- mutate(whole_dataset,
complex = case_when(
grepl("^HT", id) ~ "his",
grepl("^CSE", id) ~ "cse")
) %>%
group_by(factor(complex))
whole_dataset %>% summarize(sum_wt1 = sum(wt1),
sum_wt2 = sum(wt2))
# # A tibble: 2 x 3
# `factor(complex)` sum_wt1 sum_wt2
# <fct> <dbl> <dbl>
# 1 cse 12 6
# 2 his 6 9

Making Calculations on Several Textfiles and making a Dataframe from it R

I am trying to create a table from calculations that I am doing to several text file. I think this might require a loop of some sort, but I am stuck on how to proceed. I have tried different loops but none seem to be working. I have managed to do what I want with one file. Here is my working code:
flare <- read.table("C:/temp/HD3_Bld_CD8_TEM.txt",
header=T)
head(flare[,c(1,2)])
#sum of the freq column, check to see if close to 1
sum(flare$freq)
#Sum of top 10
ten <- sum(flare$freq[1:10])
#Sum of 11-100
to100 <- sum(flare$freq[11:100])
#Sum of 101-1000
to1000 <- sum(flare$freq[101:1000])
#sum of 1001+
rest <- sum(flare$freq[-c(1:1000)])
#place the values of the sum in a table
df <- data.frame(matrix(ncol = 1, nrow = 4))
x <- c("Sum")
colnames(df) <- x
y <- c("10", "11-100", "101-1000", "1000+")
row.names(df) <- y
df[,1] <- c(ten,to100,to1000,rest)
The dataframe ends up looking like this:
>View(df)
Sum
10 0.1745092
11-100 0.2926735
101-1000 0.4211533
1000+ 0.1116640
This is perfect for making a stacked barplot, which I did. However, this is only for one text file. I have several of the same files. All of them have the same column names, so I know that all of them will be using DF$freq column for the calculations. How do I make a table after doing calculations with each file? I want to keep the names of the text files as the sample names so that way when i make a joint stacked barplot all the names will be there. Also, what is the best way to orient the data when writing the new table/dataframe?
I am still new to R, so any help, any explanation would be most welcome. Thank you.
How about something like this, your example is not reproducible so I made a dummy example which you can adjust:
library(tidyverse)
###load ALL your dataframes
test_df_1 <- data.frame(var1 = matrix(c(1,2,3,4,5,6), nrow = 6, ncol = 1))
test_df_1
test_df_2 <- data.frame(var2 = matrix(c(7,8,9,10,11,12), nrow = 6, ncol = 1))
test_df_2
### Bind them into one big wide dataframe
df <- cbind(test_df_1, test_df_2)
### Add an id column which repeats (in your case adjust this to repeat for the grouping you want, i.e replace the each = 2 with each = 10, and each = 4 with each = 100)
df <- df %>%
mutate(id = paste0("id_", c(rep(1, each = 2), rep(2, each = 4))))
### Gather your dataframes into long format by the id
df_gathered <- df %>%
gather(value = value, key = key, - id)
df_gathered
### use group_by to group data by id and summarise to get the sum of each group
df_gathered_sum <- df_gathered %>%
group_by(id, key) %>%
summarise(sigma = sum(value))
df_gathered_sum
You might have some issues with the ID column if your dfs are not equal length so this is only a partial answer. Can do better with a shortened example of your dataset. Can anyone else weigh in on creating an id column? May have sorted it with a couple of edits...
I think I solved it! It gives me the dataframe I want, and from it, I can make the stacked barplot to display the data.
sumfunction <- function(x) {
wow <- read.table(x, header=T)
#Sum of top 10
ten <- sum(wow$freq[1:10])
#Sum of 11-100
to100 <- sum(wow$freq[11:100])
#Sum of 101-1000
to1000 <- sum(wow$freq[101:1000])
#sum of 1001+
rest <- sum(wow$freq[-c(1:1000)])
blah <- c(ten,to100,to1000,rest)
}
library(data.table)
library(tools)
dir = "C:/temp/"
filenames <- list.files(path = dir, pattern = "*.txt", full.names = FALSE)
alltogether <- lapply(filenames, function(x) sumfunction(x))
data <- as.data.frame(data.table::transpose(alltogether),
col.names =c("Top 10 ", "From 11 to 100", "From 101 to 1000", "From 1000 on "),
row.names = file_path_sans_ext(basename(filenames)))
This gives me the dataframe that I want. I instead of putting the "top 10, 11-100, 101-1000, 1000+" as the row names, I changed them to column names and instead made the names of each text file become the row names. The file_path_sans_ext(basename(filenames)) makes sure to just keep the file name and remove the extension.
I hope this helps anyone that reads this! thank you again! I love this platform because just being part of this environment gets me thinking and always striving to better myself at R.
If anyone has any input, that would be great!!! <3

Serial Subsetting in R

I am working with a large datasets. I have to extract values from one datasets, the identifiers for the values are stored in another dataset. So basically I am subsetting twice for each value of one category. For multiple category, I have to combine such double-subsetted values. So I am doing something similar to this shown below, but I think there must be a better way to do it.
example datasets
set.seed(1)
df <- data.frame(number= seq(5020, 5035, 1), value =rnorm(16, 20, 5),
type = rep(c("food", "bar", "sleep", "gym"), each = 4))
df2 <- data.frame(number= seq(5020, 5035, 1), type = rep(LETTERS[1:4], 4))
extract value for grade A
asub_df2 <-subset(df2, type == "A" )
asub_df <-subset(df, number == asub_df2$number)
new_a <- cbind(asub_df, grade = rep(c("A"),nrow(asub_df)))
similarly extract value for grade B in new_b and combine to do any analysis.
can we use
You can split the 'df2' and use lapply
Filter(Negate(is.null),
lapply(split(df2, df2$type), function(x) {
x1 <- subset(df, number==x$number)
if(nrow(x1)>0) {
transform(x1, grade=x$type[1])
}
}))

Resources