Locate a value in R dataframe based on other column values - r

I have a data frame in R
data.frame(age = 18,19,29,
rate = 1.2,4.5,6.8
sex = "male","female","male")
I would like to get the rate associated with values age =18 and sex = male. Is there a way I can index with those values and be able to do this with any pair of age and sex values.
I can do this in dpylr using filter and select commands but this is too slow for what I'm trying to do

assuming that df is your dataframe:
df[(df$age == 18 & df$sex == 'male'),]

Alternatively, you can use subset.
Assuming your dataframe is called df:
df1 <- subset(df,df$age==18 & df$sex=='male')
And then
View(df1)

your example data.frame is not properly working, here's one ;)
first you can subset the data, then calculate how many rows you have in that subset versus the main set.
df <- data.frame(age = c(18,19,29),
rate = c(1.2,4.5,6.8),
sex = c("male","female","male"),
stringsAsFactors = F)
df_sub <- subset(df, age==18 & sex %in% "male")
df_rate <- nrow(df_sub)/nrow(df)
Though if you say filter and select are too slow, you might want to convert your data.frame into a data.table, they are normally faster than data.frames.
library(data.table)
dt <- as.data.table(df)
nrow(dt[age==18 & sex %in% "male"])/nrow(dt)
# or more data.table-like:
dt[age==18 & sex %in% "male", .N] / dt[,.N]

Related

How to identify and remove outliers in a data.frame using R?

I have a dataframe that has multiple outliers. I suspect that these ouliers have produced different results than expected.
I tried to use this tip but it didn't work as I still have very different values: https://www.r-bloggers.com/2020/01/how-to-remove-outliers-in-r/
I tried the solution with the rstatix package, but I can't remove the outliers from my data.frame
library(rstatix)
library(dplyr)
df <- data.frame(
sample = 1:20,
score = c(rnorm(19, mean = 5, sd = 2), 50))
View(df)
out_df<-identify_outliers(df$score)#identify outliers
df2<-df#copy df
df2<- df2[-which(df2$score %in% out_df),]#remove outliers from df2
View(df2)
The identify_outliers expect a data.frame as input i.e. usage is
identify_outliers(data, ..., variable = NULL)
where
... - One unquoted expressions (or variable name). Used to select a variable of interest. Alternative to the argument variable.
df2 <- subset(df, !score %in% identify_outliers(df, "score")$score)
A rule of thumb is that data points above Q3 + 1.5xIQR or below Q1 - 1.5xIQR are considered outliers.
Therefore you just have to identify them and remove them. I don't know how to do it with the dependency rstatix, but with base R can be achived following the example below:
# Generate a demo data
set.seed(123)
demo.data <- data.frame(
sample = 1:20,
score = c(rnorm(19, mean = 5, sd = 2), 50),
gender = rep(c("Male", "Female"), each = 10)
)
#identify outliers
outliers <- which(demo.data$score > quantile(demo.data$score)[4] + 1.5*IQR(demo.data$score) | demo.data$score < quantile(demo.data$score)[2] - 1.5*IQR(demo.data$score))
# remove them from your dataframe
df2 = demo.data[-outliers,]
Do a cooler function that returns to you the index of the outliers:
get_outliers = function(x){
which(x > quantile(x)[4] + 1.5*IQR(x) | x < quantile(x)[2] - 1.5*IQR(x))
}
outliers <- get_outliers(demo.data$score)
df2 = demo.data[-outliers,]

Combining rows based on conditions and saving others (in R)

I have a question regarding combining columns based on two conditions.
I have two datasets from an experiment where participants had to type in a code, answer about their gender and eyetracking data was documented. The experiment happened twice (first: random1, second: random2).
eye <- c(1000,230,250,400)
gender <- c(1,2,1,2)
code <- c("ABC","DEF","GHI","JKL")
random1 <- data.frame(code,gender,eye)
eye2 <- c(100,250,230,450)
gender2 <- c(1,1,2,2)
code2 <- c("ABC","DEF","JKL","XYZ")
random2 <- data.frame(code2,gender2,eye2)
Now I want to combine the two dataframes. For all rows where code and gender match, the rows should be combined (so columns added). Code and gender variables of those two rows should become one each (gender3 and code3) and the eyetracking data should be split up into eye_first for random1 and eye_second for random2.
For all rows where there was not found a perfect match for their code and gender values, a new dataset with all of these rows should exist.
#this is what the combined data looks like
gender3 <- c(1,2)
eye_first <- c(1000,400)
eye_second <- c(100, 230)
code3 <- c("ABC", "JKL")
random3 <- data.frame(code3,gender3,eye_first,eye_second)
#this is what the data without match should look like
gender4 <- c(2,1,2)
eye4 <- c(230,250,450)
code4 <- c("DEF","GHI","XYZ")
random4 <- data.frame(code4,gender4,eye4)
I would greatly appreciate your help! Thanks in advance.
Use the same column names for your 2 data.frames and use merge
random1 <- data.frame(code = code, gender = gender, eye = eye)
random2 <- data.frame(code = code2, gender = gender2, eye = eye2)
df <- merge(random1, random2, by = c("code", "gender"), suffixes = c("_first", "_second"))
For your second request, you can use anti_join from dplyr
df2 <- merge(random1, random2, by = c("code", "gender"), suffixes = c("_first", "_second"), all = TRUE) # all = TRUE : keep rows with ids that are only in one of the 2 data.frame
library(dplyr)
anti_join(df2, df, by = c("code", "gender"))

Subset of list in Dataframe R in categorical variable

My data looks like that but number of observations are approx 10000.
Part<-c(1,2,3,4,5,6,7)
Disease_codes>-c("A101.12","A111.12","A121.13","A130.0","B102","C132","D156")
class(Disease_codes)<-Factor
df<-data.frame(Part,Disease_codes)
The obs having Disease_codes starting from A10_A13 are BloodCancer patients. I need to make subset of it and i am trying following
BloodCancer <- subset(df, grepl('^A10', Disease_codes), select = Part
Part_without_Blood_cancer <- subset(df, !grepl('^A10', Disease_codes))
If i am trying the following it is not working.
BloodCancer <- subset(df, grepl('^A10-A13', Disease_codes), select = Part
But it is giving me just A10 coding containing Participants but I want BloodCancer variable to contain all from A10-A13. How can i do this in one command.
the syntax for grepl to return true for any of the strings (e.g. A10, A11) is as follows:
grepl("A10| A11", variable). To keep it as one statement, you can do the following:
BloodCancer = subset(df, grepl(paste(paste("A1", 0:3, sep = ""), collapse = "|"), Disease_codes), select = Part)
try to do it this way
BloodCancer <- subset(df, grepl("^A1[0-3]", as.character(Disease_codes)), select = Part)
An option with dplyr
library(dplyr)
library(stringr)
df %>%
filter(str_detect(Disease_codes, "^A1[0-3]")) %>%
select(Part)

Subsetting dataset on dynamic columns

I have a question on data subset based on dynamic column class. For example:
#Coming from other source. Dont exaclty know about their names and number of classes.
#But following are two demography, which will help in imagining the problem
gender <- c(1,2)
agegroup <- c(1,2,3,4,5,6,7,8)
#moredemo.................
# reproducible data
set.seed(1)
col1 <- as.data.frame(rep(gender, 100))
col2 <- as.data.frame(rep(agegroup, 25))
col3 <- runif(200)
datafile <- cbind(col1, col2, col3)
names(datafile)[1] = "gender"
names(datafile)[2] = "agegroup"
datafile <- as.data.frame(datafile)
#Subset is only for gender = 1 and agegroup = 3
#Subset is for every combination of classes in each demography
#No hardcoded name is required, because demography name will not be know
dat_gender_1_agegroup_3 <- datafile[datafile$gender == 1 & datafile$agegroup == 3, ]
But there can be more demography and not just gender and agegroup. There can be income or education or race and so on. each of the demography has varying number of class. Kindly help me in getting the subset of the dataset datafile on the varying number of columns. Thanks in advance
Using expand grid for combos then apply to subset:
#dummy data
set.seed(123)
mydata <- data.frame(gender = sample(1:2, 100, replace = TRUE),
agegroup = sample(1:10, 100, replace = TRUE))
#groups
gender <- c(1,2)
agegroup <- c(1,2,3,4,5,6,7,8)
#get all combo
myCombo <- expand.grid(gender, agegroup)
#result is a list object
apply(myCombo, 1, function(i){
mydata[ mydata$gender == i[1] &
mydata$agegroup == i[2], ]
})
Edit: Based on update, I think you just need split command
split(datafile, datafile[, 1:2])
What about (assuming the column names are "gender" and "agegroup"):
gender <- c(1,2)
agegroup <- c(1,2,3,4,5,6,7,8)
data_subset <- subset(full_data, gender%in%gender | agegroup%in%agegroup | [AND SO ON])
You can add as many [column_name]%in%[values] as you want.
HTH a little!
EDIT: you can very well use & instead of |, obviously.

dynamic subsetting in r

I have a data set that is something like the following, but with many more columns and rows:
a<-c("Fred","John","Mindy","Mike","Sally","Fred","Alex","Sam")
b<-c("M","M","F","M","F","M","M","F")
c<-c(40,35,25,50,25,40,35,40)
d<-c(9,7,8,10,10,9,5,8)
df<-data.frame(a,b,c,d)
colnames(df)<-c("Name", "Gender", "Age", "Score")
I need to create a function that will let me sum the scores for selected subsets of the data. However, the subsets selected may have different numbers of variables each time. One subset could be Name=="Fred" and another could be Gender == "M" & Age == 40. In my actual data set, there could be up to 20 columns used in a selected subset, so I need to make this as general as possible.
I tried using a sapply command that included eval(parse(text=...), but it takes a long time with only a sample of 20,000 or so records. I'm sure there's a much faster way, and I'd appreciate any help in finding it.
There are several ways to represent these two variables. One way is as two distinct objects, another is as two elements in a list.
However, using a named list might be the easiest:
# df is a function for the F distribution. Avoid using "df" as a variable name
DF <- df
example1 <- list(Name = c("Fred")) # c() not needed, used for emphasis
example2 <- list(Gender = c("M"), Age=c(40, 50))
## notice that the key portion is `DF[[nm]] %in% ll[[nm]]`
subByNmList <- function(ll, DF, colsToSum=c("Score")) {
ret <- vector("list", length(ll))
names(ret) <- names(ll)
for (nm in names(ll))
ret[[nm]] <- colSums(DF[DF[[nm]] %in% ll[[nm]] , colsToSum, drop=FALSE])
# optional
if (length(ret) == 1)
return(unlist(ret, use.names=FALSE))
return(ret)
}
subByNmList(example1, DF)
subByNmList(example2, DF)
lapply( subset( df, Gender == "M" & Age == 40, select=Score), sum)
#$Score
#[1] 18
I could have writtne just :
sum( subset( df, Gender == "M" & Age == 40, select=Score) )
But that would not generalize very well.

Resources