Serial Subsetting in R - r

I am working with a large datasets. I have to extract values from one datasets, the identifiers for the values are stored in another dataset. So basically I am subsetting twice for each value of one category. For multiple category, I have to combine such double-subsetted values. So I am doing something similar to this shown below, but I think there must be a better way to do it.
example datasets
set.seed(1)
df <- data.frame(number= seq(5020, 5035, 1), value =rnorm(16, 20, 5),
type = rep(c("food", "bar", "sleep", "gym"), each = 4))
df2 <- data.frame(number= seq(5020, 5035, 1), type = rep(LETTERS[1:4], 4))
extract value for grade A
asub_df2 <-subset(df2, type == "A" )
asub_df <-subset(df, number == asub_df2$number)
new_a <- cbind(asub_df, grade = rep(c("A"),nrow(asub_df)))
similarly extract value for grade B in new_b and combine to do any analysis.
can we use

You can split the 'df2' and use lapply
Filter(Negate(is.null),
lapply(split(df2, df2$type), function(x) {
x1 <- subset(df, number==x$number)
if(nrow(x1)>0) {
transform(x1, grade=x$type[1])
}
}))

Related

How to identify and remove outliers in a data.frame using R?

I have a dataframe that has multiple outliers. I suspect that these ouliers have produced different results than expected.
I tried to use this tip but it didn't work as I still have very different values: https://www.r-bloggers.com/2020/01/how-to-remove-outliers-in-r/
I tried the solution with the rstatix package, but I can't remove the outliers from my data.frame
library(rstatix)
library(dplyr)
df <- data.frame(
sample = 1:20,
score = c(rnorm(19, mean = 5, sd = 2), 50))
View(df)
out_df<-identify_outliers(df$score)#identify outliers
df2<-df#copy df
df2<- df2[-which(df2$score %in% out_df),]#remove outliers from df2
View(df2)
The identify_outliers expect a data.frame as input i.e. usage is
identify_outliers(data, ..., variable = NULL)
where
... - One unquoted expressions (or variable name). Used to select a variable of interest. Alternative to the argument variable.
df2 <- subset(df, !score %in% identify_outliers(df, "score")$score)
A rule of thumb is that data points above Q3 + 1.5xIQR or below Q1 - 1.5xIQR are considered outliers.
Therefore you just have to identify them and remove them. I don't know how to do it with the dependency rstatix, but with base R can be achived following the example below:
# Generate a demo data
set.seed(123)
demo.data <- data.frame(
sample = 1:20,
score = c(rnorm(19, mean = 5, sd = 2), 50),
gender = rep(c("Male", "Female"), each = 10)
)
#identify outliers
outliers <- which(demo.data$score > quantile(demo.data$score)[4] + 1.5*IQR(demo.data$score) | demo.data$score < quantile(demo.data$score)[2] - 1.5*IQR(demo.data$score))
# remove them from your dataframe
df2 = demo.data[-outliers,]
Do a cooler function that returns to you the index of the outliers:
get_outliers = function(x){
which(x > quantile(x)[4] + 1.5*IQR(x) | x < quantile(x)[2] - 1.5*IQR(x))
}
outliers <- get_outliers(demo.data$score)
df2 = demo.data[-outliers,]

pass a list of variable names as an argument to an R function

I am trying achieve the following: I have a dataset, and a function that subsets this dataset and then performs a series of operations on the subset. Subsetting happens based on row names. I am able to do it step by step (i.e. running this function for each subset separately), but I have a list of desired subsets, and I would like to loop over this list. It sounds complicated - please check the example below.
This is what I can do:
#dataframe with rownames
whole_dataset <- data.frame(wt1 = c(1, 2, 3, 6, 6),
wt2 = c(2, 3, 4, 4, 2))
row.names(whole_dataset) = c("HTA1", "HTA2", "HTB2", "CSE1", "CSE2")
# two different non-overlapping subsets
his <- c("HTA1", "HTA2", "HTB2")
cse <- c("CSE1", "CSE2")
#this is the function I have
fav_complex <- function (data, complex) {
small_data<- data[complex,] #subset only the rows that you need
sum.all<-colSums(small_data) #calculate sum of columns
return(sum.all)
}
#I generate two deparate named vectors
his_data <- fav_complex(data = whole_dataset, complex = his)
cse_data <- fav_complex(data = whole_dataset, complex = cse)
#and merge them
merged_data<- rbind(his_data,cse_data)
it looks like this
> merged_data
wt1 wt2
his_data 6 9
cse_data 12 6
I would like to somehow generate the merged_data dataframe without having to call the 'fav_complex' function multiple times. In real life I have about 20 subsets, and it is a lot of code. This is my solution that doesn't work
#I first have a character vector listing all the variable names
subset_list <- c("his", "cse")
#then create a loop that goes over this list
#make an empty dataframe
merged_data2 <- data.frame()
#fill it with a for loop output
for (element in subset_list) {
result <- fav_complex(data = whole_dataset, element)
merged_data2 <-rbind(merged_data2, result)
}
I know this is wrong. In this loop, 'element' is just a string, rather than a variable with stuff in it. But I don't know how to make it a variable. noquote(element) didn't work. I tried reading about non standard evaluation and eval(), substitute(), but it is too abstract for me - I think I am not there yet with my R expertise.
Consider by to run needed operation across all subsets. But first create a group column:
# ANY FUNCTION TO APPLY ON SUBSETS (REMOVE GROUP COL)
fav_complex_new <- function (sub) {
sum.all <- colSums(transform(sub, group=NULL))
return(sum.all)
}
# ASSIGN GROUPING
whole_dataset$group <- ifelse(row.names(whole_dataset) %in% his, "his",
ifelse(row.names(whole_dataset) %in% cse, "cse", NA))
# BY CALL
df_list <- by(whole_dataset, whole_dataset$group, FUN=fav_complex_new)
# COMBINE ALL DFs IN LIST
merged_data <- do.call(rbind, df_list)
Rextester demo (includes OP's original and above solution)
Following #Gregor's suggestion of a modified workflow, would you consider this solution, including some bonus data wrangling?
Put the data that's currently in row names in its own column.
Add a column for complex. We can do this programmatically in case the data are large.
Use dplyr to created split-apply-combine summaries of data grouped by complex.
It could work like this
library(dplyr)
whole_dataset <- tibble(wt1 = c(1, 2, 3, 6, 6),
wt2 = c(2, 3, 4, 4, 2),
id = factor(c("HTA1", "HTA2", "HTB2", "CSE1", "CSE2")))
whole_dataset <- mutate(whole_dataset,
complex = case_when(
grepl("^HT", id) ~ "his",
grepl("^CSE", id) ~ "cse")
) %>%
group_by(factor(complex))
whole_dataset %>% summarize(sum_wt1 = sum(wt1),
sum_wt2 = sum(wt2))
# # A tibble: 2 x 3
# `factor(complex)` sum_wt1 sum_wt2
# <fct> <dbl> <dbl>
# 1 cse 12 6
# 2 his 6 9

Loop through rows in list of dataframes and extract data. (Nested "apply" functions)

I am new to R and trying to do things the "R" way, which means no for loops. I would like to loop through a list of dataframes, loop through each row in the dataframe, and extract data based on criteria and store in a master dataframe.
Some issues I am having are with accessing the "global" dataframe. I am unsure the best approach (global variable, pass by reference).
I have created an abstract example to try to show what needs to be done:
rm(list=ls())## CLEAR WORKSPACE
assign("last.warning", NULL, envir = baseenv())## CLEAR WARNINGS
# Generate a descriptive name with name and size
generateDescriptiveName <- function(animal.row, animalList.vector){
name <- animal.row["animal"]
size <- animal.row["size"]
# if in list of interest prepare name for master dataframe
if (any(grepl(name, animalList.vector))){
return (paste0(name, "Sz-", size))
}
}
# Animals of interest
animalList.vector <- c("parrot", "cheetah", "elephant", "deer", "lizard")
jungleAnimals <- c("ants", "parrot", "cheetah")
jungleSizes <- c(0.1, 1, 50)
jungle.df <- data.frame(jungleAnimals, jungleSizes)
fieldAnimals <- c("elephant", "lion", "hyena")
fieldSizes <- c(1000, 100, 80)
field.df <- data.frame(fieldAnimals, fieldSizes)
forestAnimals <- c("squirrel", "deer", "lizard")
forestSizes <- c(1, 40, 0.2)
forest.df <- data.frame(forestAnimals, forestSizes)
ecosystems.list <- list(jungle.df, field.df, forest.df)
# Final master list
descriptiveAnimal.df <- data.frame(name = character(), descriptive.name = character())
# apply to all dataframes in list
lapply(ecosystems.list, function(ecosystem.df){
names(ecosystem.df) <- c("animal", "size")
# apply to each row in dataframe
output <- apply(ecosystem.df, 1, function(row){generateDescriptiveName(row, animalList.vector)})
if(!is.null(output)){
# Add generated names to unique master list (no duplicates)
}
})
The end result would be:
name descriptive.name
1 "parrot" "parrot Sz-0.1"
2 "cheetah" "cheetah Sz-50"
3 "elephant" "elephant Sz-1000"
4 "deer" "deer Sz-40"
5 "lizard" "lizard Sz-0.2"
I did not use your function generateDescriptiveName() because I think it is a bit too laborious. I also do not see a reason to use apply() within lapply(). Here is my attempt to generate the desired output. It is not perfect but I hope it helps.
df_list <- lapply(ecosystems.list, function(ecosystem.df){
names(ecosystem.df) <- c("animal", "size")
temp <- ecosystem.df[ecosystem.df$animal %in% animalList.vector, ]
if(nrow(temp) > 0){
data.frame(name = temp$animal, descriptive.name = paste0(temp$animal, " Sz-", temp$size))
}
})
do.call("rbind",df_list)

Mapply to Add Column to Each Dataframe in a List

Implemented some code from previous question:
Lapply to Add Columns to Each Dataframe in a List
Using the method above, I receive corrupt data. While I cannot provide actual data, I am wondering if additional arguments need to be implemented to prevent shuffling.
Basically, this:
Require: data.table
df1 <- data.frame(x = runif(3), y = runif(3))
df2 <- data.frame(x = runif(3), y = runif(3))
dfs <- list(df1, df2)
years <- list(2013, 2014)
a<-Map(cbind, dfs, year = years)
final<-rbindlist(a)
But applied to a list of thousands of data frame lists has incorrect results. Assume that some data frames, say df 1.5 somewhere between two above data frames, are empty. Would that affect the order in which the Map binds the years to the dfs? Essentially, I have an output with some data belonging to different years than the Map attached it to. I tested the length and order of years list, and compared it to the output year in final. They are identical. Any thoughts?
We create a logical index based on the length of each element in 'dfs', use that to subset both the 'dfs' and the 'years' and then do the cbind with Map
i1 <- sapply(dfs, length)>1
Or to make it more stringent
i1 <- sapply(dfs, function(x) is.data.frame(x) & !is.null(x) & length(x) >0 )
a <- Map(cbind, dfs[i1], year = years[i1])
and then do the rbindlist with fill = TRUE in case the number of columns are not the same in all the data.frames in the `list.
rbindlist(a, fill = TRUE)
data
dfs[[3]] <- list(NULL)
dfs[[4]] <- data.frame()
years <- 2013:2016
Use the idcol argument to rbindlist and add the year column afterwards:
res = rbindlist(dfs, idcol=TRUE)
res[.(.id = 1:2, year = 2013:2014), on=".id", year := i.year]
X[i, on=cols, z := i.z] merges X with i on cols and then copies z from i to X.

R: Looping through list of dataframes in a vector

I have a dataset where I only want to loop through certain columns in a dataframe one at a time to create a graph. The structure of my dataframe consists of data that I parsed from a larger dataset into a vector containing multiple dataframes.
I want to call one column from one dataframe in the vector. I want to loop on the dataframes to call each column.
See example below:
d1 <- data.frame(y1=c(1,2,3),y2=c(4,5,6))
d2 <- data.frame(y1=c(3,2,1),y2=c(6,5,4))
my.list <- list(d1, d2)
All I have to work with is my.list
How would I do this?
You can use lapply to plot each of the individual data frames in your list. For example,
d1 <- data.frame(y1=c(1,2,3),y2=c(4,5,6),y3=c(7,8,9))
d2 <- data.frame(y1=c(3,2,1),y2=c(6,5,4),y3=c(11,12,13))
mylist <- list(d1, d2)
par(mfrow=c(2,1))
# lapply on a subset of columns
lapply(mylist, function(x) plot(x$y2, x$y3))
You don't need a for loop to get their data points. You can call the column by their column names.
# a toy dataframe
d <- data.frame(A = 1:20, B = sample(c(FALSE, TRUE), 20, replace = TRUE),
C = LETTERS[1:20], D = rnorm(20, 0, 1))
col_names <- c("A", "B", "D") # names of columns I want to get
d[,col_names] # returns a dataset with the values of the columns you want
Here is a solution to your problem using a for loop:
# a toy dataframe
mylist <- list(dat1 = data.frame(A = 1:20, B = LETTERS[1:20]),
dat2 = data.frame(A = 21:40, B = LETTERS[1:20]),
dat3 = data.frame(A = 41:60, B = LETTERS[1:20]))
col_names <- c("A") # name of columns I want to get
for (i in 1:length(mylist)){
# you can do whatever you want with what is returned;
# here I am just print them out
print(names(mylist)[i]) # name of the data frame
print(mylist[[i]][,col_names]) # values in Column A
}
I think the simplest answer to your question is to use double brackets.
for (i in 1:length(my.list)) {
print(my.list[[i]]$column)
}
That works assuming all of the columns in your list of data frames have the same names. You could also call the position of the column in the data frame if you wanted.
Yes, lapply can be more elegant, but in some situations a for loop makes more sense.

Resources