Nested subsetting with "[" - r

I recently discovered that, after subsetting an object (i.e. a data frame) with "[", the resulting object could be subset with "[" on the same line of code (I should have realized it earlier!). Here is an example:
# Create a data frame
df1 <- as.data.frame(matrix(1:9, nrow = 3))
# Take a look at the data frame
df1
V1 V2 V3
1 1 4 7
2 2 5 8
3 3 6 9
# If I want the value which is on the 3rd row and 2nd column
df1[3,2]
[1] 6
# But I could also
df1[,2][3]
[1] 6
A few words on the second alternative. df[,2] returns an atomic vector, which is then subset with df[,2][3].
The following data frame will be helpful to illustrate my issue. It is a simple data frame containing the name of 26 students, their respective department as well as a numeric value. A seed number is added for reproducibility.
set.seed(123)
df2 <- data.frame(name = letters, dept = sample(c("econ", "stat", "math"), 26, replace = TRUE), value = runif(26, 0, 100))
head(df2)
name dept value
1 a econ 54.40660
2 b math 59.41420
3 c stat 28.91597
4 d math 14.71136
5 e math 96.30242
6 f econ 90.22990
I would like to know who has the lowest value in the econ department. The first thing I tried was:
df2[df2$dept == "econ" & df2$value == min(df2$value),]
[1] name dept value
<0 rows> (or 0-length row.names)
It took me a while to understand what I was doing wrong, but I finally realized that the problem was that my code assumed that the person who had the lowest value overall was also from the econ department, which is not the case (and that's the answer that R gave me). Actually, the person with the lowest value overall is from the stat department.
i <- which(df$value == min(df$value))
df[i,]
name dept value
9 i stat 2.461368
Of course, I can easily find the answer to my question with:
df_econ <- df2[df2$dept == "econ",]
df_econ
name dept value
1 a econ 54.40660
6 f econ 90.22990
15 o econ 14.28000
17 q econ 41.37243
18 r econ 36.88455
19 s econ 15.24447
df_econ[df_econ$value == min(df_econ$value),]
name dept value
15 o econ 14.28
But I would like to know if I can get the same result using "nested" subsetting with the [ operator. What I mean is with a code like this:
df2[df2$dept == "econ",][... ,]
I do not know how to refer to the value column at this point since the resulting data frame of the first subsetting operation df2[df2$dept == "econ",] is a data frame different from df2. I also know that the value column is the 3rd column, but I do not know how to set subsetting conditions using column indexes rather than their names.
Thank you for your help.

Here are some options:
library(dplyr)
# also in #bramtayl's answer:
df2 %>% filter(dept == "econ") %>% filter(value==min(value))
# or
df2 %>% filter(dept == "econ") %>% slice(which.min(value))
# or...
library(data.table)
setDT(df2)[dept == "econ"][value==min(value)]
# or
setDT(df2)[dept == "econ"][which.min(value)]
These packages offer convenient ways of chaining not available in base R except awkwardly, like
subset(subset(df2, dept=="econ"), value == min(value))
There may be other packages, but these two are widely used lately.
Comment. If you're just browsing data, I'd recommend aggregating at the dept level:
# dplyr:
df2 %>% group_by(dept) %>% slice(which.min(value))
# data.table:
df2[, .SD[which.min(value)], by=dept]
dept name value
1: econ o 14.280002
2: math t 13.880606
3: stat i 2.461368

Agreed that chaining is necessary:
library(magrittr)
df %>%
`[`(.$dept == "econ", ) %>%
`[`(.$value == min(.$value), )
Why not stick with dplyr though?
library(dplyr)
df %>%
filter(dept == "econ") %>%
filter(value == min(value) )

Related

How to return the range of values shared between two data frames in R?

I have several data frames that have the same columns names, and ID
, the following to are the start from and end to of a range and group label from each of them.
What I want is to find which values offrom and to from one of the data frames are included in the range of the other one. I leave an example picture to ilustrate what I want to achieve (no graph is need for the moment)
I thought I could accomplish this using between() of the dplyr package but no. This could be accomplish using if between() returns true then return the maximum value of from and the minimum value of to between the data frames.
I leave example data frames and the results I'm willing to obtain.
a <- data.frame(ID = c(1,1,1,2,2,2,3,3,3),from=c(1,500,1000,1,500,1000,1,500,1000),
to=c(400,900,1400,400,900,1400,400,900,1400),group=rep("a",9))
b <- data.frame(ID = c(1,1,1,2,2,2,3,3,3),from=c(300,1200,1900,1400,2800,3700,1300,2500,3500),
to=c(500,1500,2000,2500,3000,3900,1400,2800,3900),group=rep("b",9))
results <- data.frame(ID = c(1,1,1,2,3),from=c(300,500,1200,1400,1300),
to=c(400,500,1400,1400,1400),group=rep("a, b",5))
I tried using this function which will return me the values when there is a match but it doesn't return me the range shared between them
f <- function(vec, id) {
if(length(.x <- which(vec >= a$from & vec <= a$to & id == a$ID))) .x else NA
}
b$fromA <- a$from[mapply(f, b$from, b$ID)]
b$toA <- a$to[mapply(f, b$to, b$ID)]
We can play with the idea that the starting and ending points are in different columns and the ranges for the same group (a and b) do not overlap. This is my solution. I have called 'point_1' and 'point_2' your mutated 'from' and 'to' for clarity.
You can bind the two dataframes and compare the from col with the previous value lag(from) to see if the actual value is smaller. Also you compare the previous lag(to) to the actual to col to see if the max value of the range overlap the previous range or not.
Important, these operations do not distinguish if the two rows they are comparing are from the same group (a or b). Therefore, filtering the NAs in point_1 (the new mutated 'from' column) you will remove wrong mutated values.
Also, note that I assume that, for example, a range in 'a' cannot overlap two rows in 'b'. In your 'results' table that doesn't happen but you should check that in your dataframes.
res = rbind(a,b) %>% # Bind by rows
arrange(ID,from) %>% # arrange by ID and starting point (from)
group_by(ID) %>% # perform the following operations grouped by IDs
# Here is the trick. If the ranges for the same ID and group (i.e. 1,a) do
# not overlap, when you mutate the following cols the result will be NA for
# point_1.
mutate(point_1 = ifelse(from <= lag(to), from, NA),
point_2 = ifelse(lag(to)>=to, to, lag(to)),
groups = paste(lag(group), group, sep = ',')) %>%
filter(! is.na(point_1)) %>% # remove NAs in from
select(ID,point_1, point_2, groups) # get the result dataframe
If you play a bit with the code, not using the filter() and select() you will see how that's work.
> res
# A tibble: 5 x 4
# Groups: ID [3]
ID point_1 point_2 groups
<dbl> <dbl> <dbl> <chr>
1 1 300 400 a,b
2 1 500 500 b,a
3 1 1200 1400 a,b
4 2 1400 1400 a,b
5 3 1300 1400 a,b

Extract first value after a specific observation

I couldn't quite find what I was looking for: this comes closest: Extract rows for the first occurrence of a variable in a data frame.
So I'm trying to figure out how I could extract the row directly following a specific observation. So for every year, there will be a place in the data where the observation is "over" and then I want the first numeric value following that "over." How would I do that?
So in the minimal example below, I would want to pluck the "7" (from the threshold variable) which directly follows the "over."
Thanks much!
other.values = c(13,10,10,9,8,4,5,7,7,5)
values = c(12,15,16,7,6,4,5,8,8,4)
df = data.frame(values, other.values)%>%mutate(threshold = ifelse(values - other.values > 0, "over", values))
You can do :
library(dplyr)
df %>%
mutate(grp = cumsum(threshold != 'over')) %>%
filter(lag(threshold) == 'over' & lag(grp) != grp)
# values other.values threshold grp
#1 7 9 7 2
#2 4 5 4 6
You could try something like this but surely there is a better way
df$threshold[max(which(df$threshold == "over"))+1]
Basically, this return the row index of the last match to "over" and then adds 1.
EDIT:
You can subset on those rows which do not have the value over in threshold AND (&) which do have over in the row prior (lag):
library(dplyr)
df[which(!df$threshold=="over" & lag(df$threshold)=="over"),]
values other.values threshold
values other.values threshold
4 7 9 7
10 4 5 4

R writing a function to avoid for loop

Hi I am trying to learn ways in which I can avoid loops in my codes.
I have an example data here:
options(warn=-1) #Turning warnings off here
Company=c("A","C","B","B","A","C","C","A","B","C","B","A")
CityID=as.character(c(1,1,1,2,2,2,3,3,3,4,4,4))
Value=c(120.5,123,125,122.5,122.1,121.7,123.2,123.7,120.7,122.3,120.1,122)
Sales=c(1,1,0,0,0,1,1,0,1,0,1,0)
df=data.frame(Company,CityID,Sales,Value)
df$new_value=0
I also created a custom function (simple example only for testing purposes) as below.
funcCity12 = function(data){
data_new=data[which(data$CityID == '1'|data$CityID == '2'),]
for (i in 1:nrow(data_new)){
data_company=df[(df$Company)==data_new[i,'Company'] & !df$CityID==1 & !df$CityID==2,]
data_new[i,'new_value'] = max(data_company[data_company$Sales==1,]$Value) #Note we take the maximum value here
}
data_new
}
df2=funcCity12(data=df) # obtaining the result here
Now I am trying to write a function to avoid the for loop in the previous function.
funcCity12_no_loop = function(x,df){
data_company=df[(df$Company)==x[,'Company'] & !df$CityID==1 & !df$CityID==2,]
x[,'new_value'] = max(data_company[data_company$Sales==1,]$Value) #Note we take the maximum value here
x
}
funcCity12_no_loop(x=df[1,],df=df) #Output for the first row of df1
This seems to be working when I input the rows individually. What I am stuck at is how to run this function for all rows of the dataframe. I am not sure if the 2nd function requires more changes for this purpose. Any help is appreciated. Thanks in advance.
P.S. For the second function, my initial reaction was to create a for loop and loop through the observations, but that defeats the whole purpose.
EDIT
This is based on #eonurk's answer
zz=apply(df,1, function(x){
data_company=df[(df$Company)==x[1] & !df$CityID==1 & !df$CityID==2,]
x[5] = max(data_company[data_company$Sales==1,]$Value) #Note we take the maximum value here
x
})
Output is shown below:
You can use apply function to reach out each individual observation of your dataframe.
For instance, you can multiplicate Values and Sales columns for no reason at all with following:
apply(df,1, function(x){ as.numeric(x["Sales"])*as.numeric(x["Value"])})
Edit:
Now you just need to use dplyr package
zz=apply(df,1, function(x){
data_company=df[(df$Company)==x[1] & !df$CityID==1 & !df$CityID==2,]
x[5] = max(data_company[data_company$Sales==1,]$Value) #Note we take the maximum value here
x
}) %>% as.data.frame %>% t
Here is one way without a loop. First we filter based on your criteria, then we group by company and calculate the max, then we join the dataframe to the original dataset (also filtered based on your criteria). I didn't make it a function, but the building blocks are all there.
library(tidyverse)
list(
df %>%
filter(CityID %in% 1:2) %>%
select(-new_value),
df %>%
filter(! CityID %in% 1:2 & Sales == 1) %>%
group_by(Company) %>%
summarise(new_value = max(Value))
) %>%
reduce(full_join, by = "Company")
#> Company CityID Sales Value new_value
#> 1 A 1 1 120.5 NA
#> 2 C 1 1 123.0 123.2
#> 3 B 1 0 125.0 120.7
#> 4 B 2 0 122.5 120.7
#> 5 A 2 0 122.1 NA
#> 6 C 2 1 121.7 123.2

R selecting rows by conditions given in an external table

Given the following data
data_min <- data.frame("cond"=c("a","b","c"),"min"=c(1,3,1))
data <- data.frame("cond"=c("a","b","b","a","c"),"val"=c(0,2,4,7,0))
I would like to select all rows from data for that the value in val is bigger than the minimum value specified in data_min for that condidition. Thus, in the given example, I expect to end up with a table
cond val
b 4
a 7
So far, I have tried
datanew <- data[which(data$cond==data_min$cond & data$val > data_min$min),]
which gives me a 7but not b 4. I have two questions, (1) why do I get the result I get, and (2) how do I get the desired result?
You need to use match because the data.frames have different numbers of rows:
data[data_min[match(data$cond, data_min$cond),]$min <= data$val,]
You could just merge the two data frames together to make things easier:
> m=merge(data,data_min,by='cond')
> m[which(m$val > m$min), c('cond','val')]
cond val
2 a 7
4 b 4
A solution using dplyr. We can perform a join first and then filter the condition between the val and min column.
library(dplyr)
data2 <- data %>%
left_join(data_min, by = "cond") %>%
filter(val > min) %>%
select(-min)
data2
cond val
1 b 4
2 a 7

Remove an entire column from a data.frame in R

Does anyone know how to remove an entire column from a data.frame in R? For example if I am given this data.frame:
> head(data)
chr genome region
1 chr1 hg19_refGene CDS
2 chr1 hg19_refGene exon
3 chr1 hg19_refGene CDS
4 chr1 hg19_refGene exon
5 chr1 hg19_refGene CDS
6 chr1 hg19_refGene exon
and I want to remove the 2nd column.
You can set it to NULL.
> Data$genome <- NULL
> head(Data)
chr region
1 chr1 CDS
2 chr1 exon
3 chr1 CDS
4 chr1 exon
5 chr1 CDS
6 chr1 exon
As pointed out in the comments, here are some other possibilities:
Data[2] <- NULL # Wojciech Sobala
Data[[2]] <- NULL # same as above
Data <- Data[,-2] # Ian Fellows
Data <- Data[-2] # same as above
You can remove multiple columns via:
Data[1:2] <- list(NULL) # Marek
Data[1:2] <- NULL # does not work!
Be careful with matrix-subsetting though, as you can end up with a vector:
Data <- Data[,-(2:3)] # vector
Data <- Data[,-(2:3),drop=FALSE] # still a data.frame
To remove one or more columns by name, when the column names are known (as opposed to being determined at run-time), I like the subset() syntax. E.g. for the data-frame
df <- data.frame(a=1:3, d=2:4, c=3:5, b=4:6)
to remove just the a column you could do
Data <- subset( Data, select = -a )
and to remove the b and d columns you could do
Data <- subset( Data, select = -c(d, b ) )
You can remove all columns between d and b with:
Data <- subset( Data, select = -c( d : b )
As I said above, this syntax works only when the column names are known. It won't work when say the column names are determined programmatically (i.e. assigned to a variable). I'll reproduce this Warning from the ?subset documentation:
Warning:
This is a convenience function intended for use interactively.
For programming it is better to use the standard subsetting
functions like '[', and in particular the non-standard evaluation
of argument 'subset' can have unanticipated consequences.
(For completeness) If you want to remove columns by name, you can do this:
cols.dont.want <- "genome"
cols.dont.want <- c("genome", "region") # if you want to remove multiple columns
data <- data[, ! names(data) %in% cols.dont.want, drop = F]
Including drop = F ensures that the result will still be a data.frame even if only one column remains.
The posted answers are very good when working with data.frames. However, these tasks can be pretty inefficient from a memory perspective. With large data, removing a column can take an unusually long amount of time and/or fail due to out of memory errors. Package data.table helps address this problem with the := operator:
library(data.table)
> dt <- data.table(a = 1, b = 1, c = 1)
> dt[,a:=NULL]
b c
[1,] 1 1
I should put together a bigger example to show the differences. I'll update this answer at some point with that.
There are several options for removing one or more columns with dplyr::select() and some helper functions. The helper functions can be useful because some do not require naming all the specific columns to be dropped. Note that to drop columns using select() you need to use a leading - to negate the column names.
Using the dplyr::starwars sample data for some variety in column names:
library(dplyr)
starwars %>%
select(-height) %>% # a specific column name
select(-one_of('mass', 'films')) %>% # any columns named in one_of()
select(-(name:hair_color)) %>% # the range of columns from 'name' to 'hair_color'
select(-contains('color')) %>% # any column name that contains 'color'
select(-starts_with('bi')) %>% # any column name that starts with 'bi'
select(-ends_with('er')) %>% # any column name that ends with 'er'
select(-matches('^v.+s$')) %>% # any column name matching the regex pattern
select_if(~!is.list(.)) %>% # not by column name but by data type
head(2)
# A tibble: 2 x 2
homeworld species
<chr> <chr>
1 Tatooine Human
2 Tatooine Droid
You can also drop by column number:
starwars %>%
select(-2, -(4:10)) # column 2 and columns 4 through 10
With this you can remove the column and store variable into another variable.
df = subset(data, select = -c(genome) )
Using dplyR, the following works:
data <- select(data, -genome)
as per documentation found here https://www.marsja.se/how-to-remove-a-column-in-r-using-dplyr-by-name-and-index/#:~:text=select(starwars%2C%20%2Dheight)
I just thought I'd add one in that wasn't mentioned yet. It's simple but also interesting because in all my perusing of the internet I did not see it, even though the highly related %in% appears in many places.
df <- df[ , -which(names(df) == 'removeCol')]
Also, I didn't see anyone post grep alternatives. These can be very handy for removing multiple columns that match a pattern.

Resources