Subsetting data frames while dropping rows which fulfill certain conditions - r

I want to subset a data Frame. Most time one reduces an original data frame by keeping observations which fulfil certain conditions in its variables and dropping the rest.
A working code is:
Companies.Exchanges.1 <- subset(Companies.Exchanges.0,
(Frankfurt == 1 & London == 1))
I want to do it the other way round: Dropping all observations which fulfil certain conditions and keeping the rest - which violates at last one condition - in a new data frame.
How do I have to reformulate the above code to do this this?

Try negating your filtering conditions with !
Companies.Exchanges.1 <- subset(Companies.Exchanges.0,
!(Frankfurt == 1 & London == 1))
When you specify filtering conditions for subset or in general, R takes all of your rows and checks them against the conditions you set. Think of it as adding another boolean vector to your dataframe where matching criteria = TRUE, and not matching = FALSE. The ! operator reverses this invisible vector.

Related

How to create a subset from a set of values within a column in R

I have a dataframe with 62 columns and 110 rows. In the column "date_observed" I have 57 dates with some of them having multiple records for the same date.
I am trying to extract only 12 dates out of this. They are not in any given order.
I tried this:
datesubset <- original %>% select (original$date_observed == c("13-Jun-21","21-Jun-21", "28-Jun-21", "13-Jul-21", "20-Jul-21", "8-Aug-21", "9-Aug-21", "25-Aug-21", "31-Aug-21", "8-Sep-21", "27-Sep-21"))
But, I got the following error:
Error: Must subset columns with a valid subscript vector.
x Subscript has the wrong type logical.
i It must be numeric or character.
I did try searching here and on google but I could find results only for how to subset a set of columns but not for specific values within columns. I am still new to R so please pardon me if this was a very simple question to ask.
In {dplyr}, the select() function is for selecting particular columns, but if you want to subset particular rows you want to use filter().
The logical operator == will also compare what is on the left, to EVERYTHING on the right, giving you a vector of TRUE/FALSE for each row, rather than just a single TRUE or FALSE for each row, which is what you are after.
What I think you are after is the logical operator %in% which checks to see if what is on the left appears at all on the right, and returns a single TRUE or FALSE.
As was mentioned, inside of tidyverse functions you don't need the $, you can just input the column name as in the example below.
I don't have your original data to double check, but the example below should work with your original data frame.
specific_dates <- c(
"13-Jun-21",
"21-Jun-21",
"28-Jun-21",
"13-Jul-21",
"20-Jul-21",
"8-Aug-21",
"9-Aug-21",
"25-Aug-21",
"31-Aug-21",
"8-Sep-21",
"27-Sep-21"
)
datesubset <- original %>%
filter(date_observed %in% specific_dates)

Issue with identifying negative values in r

I have a column that has positive and negative values. I’m trying to identify certain rows that meet 2 different conditions. The first condition is identifying number over a certain value. The line of code I have for this works. However, I am having trouble identifying the rows that are less than a certain (negative) number. They are not being identified at all and I’m not sure why
taskvariables2$PC_LambdaAmbig[taskvariables2$PC_LambdaAmbig>upperbound[5,1]] <- "OB"
taskvariables2$PC_LambdaAmbig[taskvariables2$PC_LambdaAmbig<lowerbound[5,1]] <- "OB"
When we do the first assignment on the same numeric column to a character value, the column type changes to character which changes the dynamic of how the comparison operator works. Instead, use ifelse
taskvariables2$new_variable <- with(taskvariables2, ifelse(PC_LambdaAmbig > upperbound[5,1]|
PC_LambdaAmbig < lowerbound[5,1],
"OB", PC_LambdaAmbig))
NOTE: Here, we are creating a new column instead of assigning to same old column (in case there are more comparisons to be made on the old column)
I just did it together instead of upper and lower bounds separately and it worked
taskvariables2$PC_LambdaAmbig[taskvariables2$LambdaAmbig > upperbound[5,1] | taskvariables2$LambdaAmbig < lowerbound[5,1]] <- "OB"

Remove outlier rows by column and factor in R

I am working with a data-frame in R. I have the following function which removes all rows of a data-frame df where, for a specified column index/attribute, the value at that row is outside mean (of column) plus or minus n*stdev (of column).
remove_outliers <- function(df,attr,n){
outliersgone <- df[df[,attr]<=(mean(df[,attr],na.rm=TRUE)+n*sd(df[,attr],na.rm=TRUE)) & df[,attr]>=(mean(df[,attr],na.rm=TRUE)-n*sd(df[,attr],na.rm=TRUE)),]
return(outliersgone)
}
There are two parts to my question.
(1) My data-frame df also has a column 'Group', which specifies a class label. I would like to be able to remove outliers according to mean and standard deviation within their group within the column, i.e. organised by factor (within the column). So you would remove from the data-frame a row labelled with group A if, in the specified column/attribute, the value at that row is outside mean (of group A rows in that column) plus/minus n*stdev (of group A rows in that column). And the same for groups B, C, D, E, F, etc.
How can I do this? (Preferably using only base R and dplyr.) I have tried to use df %>% group_by(Group) followed by mutate but I'm not sure what to pass to mutate, given my function remove_outliers seems to require the whole data-frame to be passed into it (so it can return the whole data-frame with rows only removed based on the chosen attribute attr).
I am open to hearing suggestions for changing the function remove_outliers as well, as long as they also return the whole data-frame as explained. I'd prefer solutions that avoid loops if possible (unless inevitable and no more efficient method presents itself in base R / dplyr).
(2) Is there a straightforward way I could combine outlier considerations across multiple columns? e.g. remove from the dataframe df those rows which are outliers wrt at least $N$ attributes out of a specified vector of attributes/column indices (length≥N). or a more complex condition like, remove from the dataframe df those rows which are outliers wrt Attribute 1 and at least 2 of Attributes 2,4,6,8.
(Ideally the definition of outlier would again be within-group within column, as specified in question 1 above, but a solution working in terms of just within column without considering the groups would also be useful for me.)
Ok - part 1 (and trying to avoid loops wherever possible):
Here's some test data:
test_data=data.frame(
group=c(rep("a",100),rep("b",100)),
value=rnorm(200)
)
We'll find the groups:
groups=levels(test_data[,1]) # or unique(test_data[,1]) if it isn't a factor
And we'll calculate the outlier limits (here I'm specifying only 1 sd) - sorry for the loop, but it's only over the groups, not the data:
outlier_sds=1
outlier_limits=sapply(groups,function(g) {
m=mean(test_data[test_data[,1]==g,2])
s=sd(test_data[test_data[,1]==g,2])
return(c(m-outlier_sds*s,m+outlier_sds*s))
})
So we can define the limits for each row of test_data:
test_data_limits=outlier_limits[,test_data[,1]]
And use this to determine the outliers:
outliers=test_data[,2]<test_data_limits[1,] | test_data[,2]>test_data_limits[2,]
(or, combining those last steps):
outliers=test_data[,2]<outlier_limits[1,test_data[,1]] | test_data[,2]>outlier_limits[2,test_data[,1]]
Finally:
test_data_without_outliers=test_data[!outliers,]
EDIT: now part 2 (apply part 1 with a loop over all the columns in the data):
Some test data with more than one column of values:
test_data2=data.frame(
group=c(rep("a",100),rep("b",100)),
value1=rnorm(200),
value2=2*rnorm(200),
value3=3*rnorm(200)
)
Combine all the steps of part 1 into a new function find_outliers that returns a logical vector indicating whether any value is an outlier for its respective column & group:
find_outliers = function(values,n_sds,groups) {
group_names=levels(groups)
outlier_limits=sapply(group_names,function(g) {
m=mean(values[groups==g])
s=sd(values[groups==g])
return(c(m-n_sds*s,m+n_sds*s))
})
return(values < outlier_limits[1,groups] | values > outlier_limits[2,groups])
}
And then apply this function to each of the data columns:
test_groups=test_data2[,1]
test_data_outliers=apply(test_data2[,-1],2,function(d) find_outliers(values=d,n_sds=1,groups=test_groups))
The rowSums of test_data_outliers indicate how many times each row is considered an 'outlier' in the various columns, with respect to its own group:
rowSums(test_data_outliers)

Using R data.table for selecting from a column containg vector elements

I am trying to learn the recommended ways to create a new column for a data table when the column of interest is a list (or vector), and the selection is done relative to another column, and there may be a preliminary selection done as part of a chain.
Consider these data named (tmp). We want to find the minimum value of sacStartT greater than stimTime (in the real data one or the other of these could be empty and no minimum exist).
tmp = data.table("pid" = c(14,14,9,9),"trialNumber" = c(25,26,25,26),"stimTime" = c(100,200,1,2),"sacStartT" = list(c(98,99,101,102), c(201,202), c(5), c(-2,-3,3)))
This works:
tmp[,"mintime" := as.integer(min(unlist(sacStartT)[unlist(sacStartT)>stimTime])),by=seq_len(nrow(tmp))]
But if I wanted to first subselect the data I don't know how to get that row number for the row-by-row analysis, e.g.
tmp[pid == 9][,"mintime" := as.integer(min(unlist(sacStartT)[unlist(sacStartT)>stimTime])),by=seq_len(nrow(.N))]
fails because .N refers to the number of rows in tmp, and not the subset in the chain.
In summary, the question is the composition of:
Recommendations for doing this row by row analysis?
How to find the right number for the by argument in a chain?
Recommendations for dealing with data.table elements that contain lists? Do you just have to manually unlist them all?

Return all rows of a data frame with a certain value

I have a data frame with multiple columns, one of which (called: drift.N) is a series of TRUE's and FALSES's. How would I go about separating the "TRUE" rows from the "FALSE" rows or asking R to tell me which rows drift.N=="TRUE" ?
If you have a data.frame called df:
df[df$column_name,]
gets you the subset of the data.frame where column_name equals TRUE. To get the FALSE subset:
df[!df$column_name,]
(spot the exclamation mark !), where ! is NOT. To get the indices where column_name is TRUE:
which(df$column_name)
which(!df$column_name)
Finally, I recommend you go online and download some basic R tutorials and work through them. This questions, and many other basics, will be treated in them. See e.g.:
http://www.cyclismo.org/tutorial/R/
http://cran.r-project.org/manuals.html
It is really quite easy because R can use logical indexing. So if drift.N already contains TRUE/FALSE, then simply:
yourdata[yourdata[, "drift.N"], ]
should work. Basically, pass the column vector yourdata[, "drift.N"] as the row subset you want from your whole data frame, yourdata. The rows where drift.N == TRUE will be returned.

Resources