Subsetting dataset on dynamic columns - r

I have a question on data subset based on dynamic column class. For example:
#Coming from other source. Dont exaclty know about their names and number of classes.
#But following are two demography, which will help in imagining the problem
gender <- c(1,2)
agegroup <- c(1,2,3,4,5,6,7,8)
#moredemo.................
# reproducible data
set.seed(1)
col1 <- as.data.frame(rep(gender, 100))
col2 <- as.data.frame(rep(agegroup, 25))
col3 <- runif(200)
datafile <- cbind(col1, col2, col3)
names(datafile)[1] = "gender"
names(datafile)[2] = "agegroup"
datafile <- as.data.frame(datafile)
#Subset is only for gender = 1 and agegroup = 3
#Subset is for every combination of classes in each demography
#No hardcoded name is required, because demography name will not be know
dat_gender_1_agegroup_3 <- datafile[datafile$gender == 1 & datafile$agegroup == 3, ]
But there can be more demography and not just gender and agegroup. There can be income or education or race and so on. each of the demography has varying number of class. Kindly help me in getting the subset of the dataset datafile on the varying number of columns. Thanks in advance

Using expand grid for combos then apply to subset:
#dummy data
set.seed(123)
mydata <- data.frame(gender = sample(1:2, 100, replace = TRUE),
agegroup = sample(1:10, 100, replace = TRUE))
#groups
gender <- c(1,2)
agegroup <- c(1,2,3,4,5,6,7,8)
#get all combo
myCombo <- expand.grid(gender, agegroup)
#result is a list object
apply(myCombo, 1, function(i){
mydata[ mydata$gender == i[1] &
mydata$agegroup == i[2], ]
})
Edit: Based on update, I think you just need split command
split(datafile, datafile[, 1:2])

What about (assuming the column names are "gender" and "agegroup"):
gender <- c(1,2)
agegroup <- c(1,2,3,4,5,6,7,8)
data_subset <- subset(full_data, gender%in%gender | agegroup%in%agegroup | [AND SO ON])
You can add as many [column_name]%in%[values] as you want.
HTH a little!
EDIT: you can very well use & instead of |, obviously.

Related

How to identify and remove outliers in a data.frame using R?

I have a dataframe that has multiple outliers. I suspect that these ouliers have produced different results than expected.
I tried to use this tip but it didn't work as I still have very different values: https://www.r-bloggers.com/2020/01/how-to-remove-outliers-in-r/
I tried the solution with the rstatix package, but I can't remove the outliers from my data.frame
library(rstatix)
library(dplyr)
df <- data.frame(
sample = 1:20,
score = c(rnorm(19, mean = 5, sd = 2), 50))
View(df)
out_df<-identify_outliers(df$score)#identify outliers
df2<-df#copy df
df2<- df2[-which(df2$score %in% out_df),]#remove outliers from df2
View(df2)
The identify_outliers expect a data.frame as input i.e. usage is
identify_outliers(data, ..., variable = NULL)
where
... - One unquoted expressions (or variable name). Used to select a variable of interest. Alternative to the argument variable.
df2 <- subset(df, !score %in% identify_outliers(df, "score")$score)
A rule of thumb is that data points above Q3 + 1.5xIQR or below Q1 - 1.5xIQR are considered outliers.
Therefore you just have to identify them and remove them. I don't know how to do it with the dependency rstatix, but with base R can be achived following the example below:
# Generate a demo data
set.seed(123)
demo.data <- data.frame(
sample = 1:20,
score = c(rnorm(19, mean = 5, sd = 2), 50),
gender = rep(c("Male", "Female"), each = 10)
)
#identify outliers
outliers <- which(demo.data$score > quantile(demo.data$score)[4] + 1.5*IQR(demo.data$score) | demo.data$score < quantile(demo.data$score)[2] - 1.5*IQR(demo.data$score))
# remove them from your dataframe
df2 = demo.data[-outliers,]
Do a cooler function that returns to you the index of the outliers:
get_outliers = function(x){
which(x > quantile(x)[4] + 1.5*IQR(x) | x < quantile(x)[2] - 1.5*IQR(x))
}
outliers <- get_outliers(demo.data$score)
df2 = demo.data[-outliers,]

Horizontal table for many variables with the same categories

I'm looking for an easy way to make a table in R that shows each variable as a row in the dataframe and then each variable category as the column of the dataframe. In each cell the frequency of that category should be displayed and then the sum is the last column. The point is to display distribution for different variables with the same categories easily. I have included to a picture to show what I'm looking for.
I have managed to produce some code that achieves what I want, but it takes a lot of time to do this for each variable i want to include in the table.
mydata <- as.data.frame((table(mydat$var)))
mydata <- as.data.frame(t(mydata))
mydata <- lapply(mydata, as.numeric)
mydata <- as.data.frame(mydata)
mydata$sum <- (mydata$category 1 + mydata$category 2 + mydata$category 3)
mydata[-c(1), ]
The result looks like this:
To add more variables I imagine that i could use rbind(), but there might be some easier way to achieve something similar?
Here is a reproducible example using the mtcars dataset.
data("mtcars")
tdata <- as.data.frame(table(mtcars$cyl))
tdata1 <- as.data.frame(t(tdata))
tdata2 <- lapply(tdata1, as.numeric)
tdata3 <- as.data.frame(tdata2)
tdata3$sum <- (tdata3$V1 + tdata3$V2 + tdata3$V3)
tdata3 <- tdata3[-c(1),]
tdata3
Assuming you have a data.frame where each variable has the same categories (as in your example):
df <- data.frame(Var1 = c(rep("Cat1", 30),
rep("Cat2", 10),
rep("Cat3", 20) ),
Var2 = c(rep("Cat1", 10),
rep("Cat2", 20),
rep("Cat3", 30) ),
Var3 = c(rep("Cat1", 5),
rep("Cat2", 25),
rep("Cat3", 30) ) )
You could use lapply() to apply the table() function to every column in your data.frame:
tab <- lapply(colnames(df), function(x) table(df[, x]))
As lapply() outputs a list, use do.call to bind them, and rowSums() to create the sum column:
tab <- data.frame(do.call(rbind, t(tab)))
tab$Sum <- rowSums(tab)
# add variable labels as rows
rownames(tab) <- colnames(df)
The output will look like this:
Cat1 Cat2 Cat3 Sum
Var1 30 10 20 60
Var2 10 20 30 60
Var3 5 25 30 60
And, you could throw all this in a function:
my_tab_fun <- function(df) {
tab <- lapply(colnames(df),
function(x) table(df[, x]))
tab <- data.frame(
do.call(rbind, t(tab)))
tab$Sum <- rowSums(tab)
rownames(tab) <- colnames(df)
return(tab)
}
my_tab_fun(df)

Combining rows based on conditions and saving others (in R)

I have a question regarding combining columns based on two conditions.
I have two datasets from an experiment where participants had to type in a code, answer about their gender and eyetracking data was documented. The experiment happened twice (first: random1, second: random2).
eye <- c(1000,230,250,400)
gender <- c(1,2,1,2)
code <- c("ABC","DEF","GHI","JKL")
random1 <- data.frame(code,gender,eye)
eye2 <- c(100,250,230,450)
gender2 <- c(1,1,2,2)
code2 <- c("ABC","DEF","JKL","XYZ")
random2 <- data.frame(code2,gender2,eye2)
Now I want to combine the two dataframes. For all rows where code and gender match, the rows should be combined (so columns added). Code and gender variables of those two rows should become one each (gender3 and code3) and the eyetracking data should be split up into eye_first for random1 and eye_second for random2.
For all rows where there was not found a perfect match for their code and gender values, a new dataset with all of these rows should exist.
#this is what the combined data looks like
gender3 <- c(1,2)
eye_first <- c(1000,400)
eye_second <- c(100, 230)
code3 <- c("ABC", "JKL")
random3 <- data.frame(code3,gender3,eye_first,eye_second)
#this is what the data without match should look like
gender4 <- c(2,1,2)
eye4 <- c(230,250,450)
code4 <- c("DEF","GHI","XYZ")
random4 <- data.frame(code4,gender4,eye4)
I would greatly appreciate your help! Thanks in advance.
Use the same column names for your 2 data.frames and use merge
random1 <- data.frame(code = code, gender = gender, eye = eye)
random2 <- data.frame(code = code2, gender = gender2, eye = eye2)
df <- merge(random1, random2, by = c("code", "gender"), suffixes = c("_first", "_second"))
For your second request, you can use anti_join from dplyr
df2 <- merge(random1, random2, by = c("code", "gender"), suffixes = c("_first", "_second"), all = TRUE) # all = TRUE : keep rows with ids that are only in one of the 2 data.frame
library(dplyr)
anti_join(df2, df, by = c("code", "gender"))

Locate a value in R dataframe based on other column values

I have a data frame in R
data.frame(age = 18,19,29,
rate = 1.2,4.5,6.8
sex = "male","female","male")
I would like to get the rate associated with values age =18 and sex = male. Is there a way I can index with those values and be able to do this with any pair of age and sex values.
I can do this in dpylr using filter and select commands but this is too slow for what I'm trying to do
assuming that df is your dataframe:
df[(df$age == 18 & df$sex == 'male'),]
Alternatively, you can use subset.
Assuming your dataframe is called df:
df1 <- subset(df,df$age==18 & df$sex=='male')
And then
View(df1)
your example data.frame is not properly working, here's one ;)
first you can subset the data, then calculate how many rows you have in that subset versus the main set.
df <- data.frame(age = c(18,19,29),
rate = c(1.2,4.5,6.8),
sex = c("male","female","male"),
stringsAsFactors = F)
df_sub <- subset(df, age==18 & sex %in% "male")
df_rate <- nrow(df_sub)/nrow(df)
Though if you say filter and select are too slow, you might want to convert your data.frame into a data.table, they are normally faster than data.frames.
library(data.table)
dt <- as.data.table(df)
nrow(dt[age==18 & sex %in% "male"])/nrow(dt)
# or more data.table-like:
dt[age==18 & sex %in% "male", .N] / dt[,.N]

Merging Long-Form Data that has NAs with Wide-Form Complete Data To Override NAs

So I have three data sets that I need to merge. These contain school data and read/math scores for grades 4 and 5. One of them is a long form data set that has a lot of missingness in some variables (yes, I do need the data in long form) and the other two have the full missing data in wide form. All of these data frames contain a column that has an unique ID number for each individual in the database.
Here is a full reproducible example that generates a small example of the types of data.frames I am working with... The three data frames that I need to use are the following: school_lf, school4 and school5. school_lf has the long form data with NAs and school4 and school5 are the dfs I need to use to populate the NA's in this long form data (by id and grade)
set.seed(890)
school <- NULL
school$id <-sample(102938:999999, 100)
school$selected <-sample(0:1, 100, replace = T)
school$math4 <- sample(400:500, 100)
school$math5 <- sample(400:500, 100)
school$read4 <- sample(400:500, 100)
school$read5 <- sample(400:500, 100)
school <- as.data.frame(school)
# Delete observations at random from the school df
indm4 <- which(school$math4 %in% sample(school$math4, 25))
school$math4[indm4] <- NA
indm5 <- which(school$math5 %in% sample(school$math5, 50))
school$math5[indm5] <- NA
indr4 <- which(school$read4 %in% sample(school$read4, 70))
school$read4[indr4] <- NA
indr5 <- which(school$read5 %in% sample(school$read5, 81))
school$read5[indr5] <- NA
# Separate Read and Math
read <- as.data.frame(subset(school, select = -c(math4, math5)))
math <- as.data.frame(subset(school, select = -c(read4, read5)))
# Now turn this into long form data...
clr <- melt(read, id.vars = c("id", "selected"), variable.name = "variable", value.name = "readscore")
clm <- melt(math, id.vars = c("id", "selected"), value.name = "mathscore")
# Clean up the grades for each of these...
clr$grade <- ifelse(clr$variable == "read4", 4,
ifelse(clr$variable == "read5", 5, NA))
clm$grade <- ifelse(clm$variable == "math4", 4,
ifelse(clm$variable == "math5", 5, NA))
# Put all these in one df
school_lf <-cbind(clm, clr$readscore)
school_lf$readscore <- school_lf$`clr$readscore` # renames
school_lf$`clr$readscore` <- NULL # deletes
school_lf$variable <- NULL # deletes
###############
# Generate the 2 data frames with IDs that have the full data
set.seed(890)
school4 <- NULL
school4$id <-sample(102938:999999, 100)
school4$selected <-sample(0:1, 100, replace = T)
school4$math4 <- sample(400:500, 100)
school4$read4 <- sample(400:500, 100)
school4$grade <- 4
school4 <- as.data.frame(school4)
set.seed(890)
school5 <- NULL
school5$id <-sample(102938:999999, 100)
school5$selected <-sample(0:1, 100, replace = T)
school5$math5 <- sample(400:500, 100)
school5$read5 <- sample(400:500, 100)
school5$grade <- 5
school5 <- as.data.frame(school5)
I need to merge the wide-form data into the long-form data to replace the NAs with the actual values. I have tried the code below, but it introduces several columns instead of merging the read scores and the math scores where there's NA's. I simply need one column with the read scores and one with the math scores, instead of six separate columns (read.x, read.y, math.x, math.y, mathscore and readscore).
sch <- merge(school_lf, school4, by = c("id", "grade", "selected"), all = T)
sch <- merge(sch, school5, by = c("id", "grade", "selected"), all = T)
Any help is highly appreciated! I've been trying to solve this for hours now and haven't made any progress (so figured I'd ask here)
You can use the coalesce function from dplyr. If a value in the first vector is NA, it will see if the value at the same position in the second vector is not NA and select it. If again NA, it goes to the third.
library(dplyr)
sch %>% mutate(mathscore = coalesce(mathscore, math4, math5)) %>%
mutate(readscore = coalesce(readscore, read4, read5)) %>%
select(id:readscore)
EDIT: I just tried to do this approach on my actual data and it does not work because the replacement data also has some NAs and, as a result, the dfs I try to do coalesce with have differing number of rows... Back to square one.
I was able to figure this out with the following code (albeit it's not the most elegant or straight-forward ,and #Edwin's response helped point me in the right direction. Any suggestions on how to make this code more elegant and efficient are more than welcome!
# Idea: put both in long form and stack on top of one another... then merge like that!
sch4r <- as.data.frame(subset(school4, select = -c(mathscore)))
sch4m <- as.data.frame(subset(school4, select = -c(readscore)))
sch5r <- as.data.frame(subset(school5, select = -c(mathscore)))
sch5m <- as.data.frame(subset(school5, select = -c(readscore)))
# Put these in LF
sch4r_lf <- melt(sch4r, id.vars = c("id", "selected", "grade"), value.name = "readscore")
sch4m_lf <- melt(sch4m, id.vars = c("id", "selected", "grade"), value.name = "mathscore")
sch5r_lf <- melt(sch5r, id.vars = c("id", "selected", "grade"), value.name = "readscore")
sch5m_lf <- melt(sch5m, id.vars = c("id", "selected", "grade"), value.name = "mathscore")
# Combine in one DF
sch_full_4 <-cbind(sch4r_lf, sch4m_lf$mathscore)
sch_full_4$mathscore <- sch_full_4$`sch4m_lf$mathscore`
sch_full_4$`sch4m_lf$mathscore` <- NULL # deletes
sch_full_4$variable <- NULL
sch_full_5 <- cbind(sch5r_lf, sch5m$mathscore)
sch_full_5$mathscore <- sch_full_5$`sch5m$mathscore`
sch_full_5$`sch5m$mathscore` <- NULL
sch_full_5$variable <- NULL
# Stack together
sch_full <- rbind(sch_full_4,sch_full_5)
sch_full$selected <- NULL # delete this column...
# MERGE together
final_school_math <- mutate(school_lf, mathscore = coalesce(school_lf$mathscore, sch_full$mathscore))
final_school_read <- mutate(school_lf, readscore = coalesce(school_lf$readscore, sch_full$readscore))
final_df <- cbind(final_school_math, final_school_read$readscore)
final_df$readscore <- final_df$`final_school_read$readscore`
final_df$`final_school_read$readscore` <- NULL

Resources