Subset Data Frame Rows by value in row.names in R - r

I have seen this Subsetting a data frame based on a logical condition on a subset of rows and that https://statisticsglobe.com/filter-data-frame-rows-by-logical-condition-in-r
I want to subset a data.frame according to a specific value in the row.names.
data <- data.frame(x1 = c(3, 7, 1, 8, 5), # Create example data
x2 = letters[1:5],
group = c("ga1", "ga2", "gb1", "gc3", "gb1"))
data # Print example data
# x1 x2 group
# 3 a ga1
# 7 b ga2
# 1 c gb1
# 8 d gc3
# 5 e gb1
I want to subset data according to group. One subset should be the rows containing a in their group, one containing b in their group and one c. Maybe something with grepl?
The result should look like this
data.a
# x1 x2 group
# 3 a ga1
# 7 b ga2
data.b
# x1 x2 group
# 1 c gb1
# 5 e gb1
data.c
# 8 d gc3
I would be interested in how to subset one of these output examples, or perhaps a loop would work too.
I modified the example from here https://statisticsglobe.com/filter-data-frame-rows-by-logical-condition-in-r

Extract the data which you want to split on :
sub('\\d+', '', data$group)
#[1] "ga" "ga" "gb" "gc" "gb"
and use the above in split to divide the data into groups.
new_data <- split(data, sub('\\d+', '', data$group))
new_data
#$ga
# x1 x2 group
#1 3 a ga1
#2 7 b ga2
#$gb
# x1 x2 group
#3 1 c gb1
#5 5 e gb1
#$gc
# x1 x2 group
#4 8 d gc3
It is better to keep data in a list however, if you want separate dataframes for each group you can use list2env.
list2env(new_data, .GlobalEnv)

We can use group_split with str_remove in tidyverse
library(dplyr)
library(stringr)
data %>%
group_split(grp = str_remove(group, "\\d+$"), .keep = FALSE)

Good question. This solution uses inputs and outputs that closely match the request: "I want to subset data according to group. One subset should be the rows containing a in their group, one containing b in their group and one c. Maybe something with grepl?".
The code below uses the data frame that was provided (named data), and uses grep(), and subsets by group.
code:
ga <- grep("ga", data$group) # seperate the data by group type
gb <- grep("gb", data$group)
gc <- grep("gc", data$group)
ga1 <- data[ga,] # subset ga
gb1 <- data[gb,] # subset gb
gc1 <- data[gc,] # subset gc
print(ga1)
print(gb1)
print(gc1)
Windows and Jupyter Lab were used. This output here closely matches the output that was shown above.
Output shown at link: link1

Related

How can I apply the decile cuts from one dataframe to another using R

I have a dataframe (df1) and have calculated the deciles for each row using the following:
#create a function to calculate the deciles
decilefun <- function(x) as.integer(cut(x, unique(quantile(x, probs=0:10/10)), include.lowest=TRUE))
# convert df1 to matrix
mat1 <- as.matrix(df1)
#apply the function I created above to calculate deciles
df1_deciles <- apply(mat1, 1, decilefun)
#add the rownames back in
rownames(df1_deciles) <- row.names(df1)
#convert to dataframe
df1_deciles <- as.data.frame(df1_deciles)
str(df1_deciles) # to show what the data looks like
#'data.frame': 157 obs. of 3321 variables:
# $ Variable1 : int 10 10 4 4 5 8 8 8 6 3 ...
# $ Variable2 : int 8 3 9 7 2 8 9 5 8 2 ...
# $ Variable3 : int 8 4 7 7 2 9 10 3 8 3 ...
I have another dataframe (df2) with the same rownames (Variable1, Variable2,etc...) but different number of columns.
I would like to use the same decile cuts which were used for df1 on this second dataframe but I'm not sure how to do it. I am actually not even sure how to determine/export what the cuts where on the original data which resulted on the df1_deciles dataframe I created. What I mean by this is, how do I export an object which tells me what range of values for Variable1 on df1 were assigned to a decile value = 1 or a decile value = 2, and so on.
I do not want to use the 'decilefun' function I created on df2, but instead want to use the variability and range information from df1.
This is my first question on the platform so I hope it is clear and I hope I have provided enough information. I have tried to find answers on the platform but have not found one. I appreciate any help on this.
Using data.table:
##
# create an artificial dataset with the structure you describe
#
set.seed(1)
df1 <- data.frame(Variable.1=rnorm(1000), variable.2=runif(1000), variable.3=rgamma(1000, scale=10, shape=5))
df1 <- t(df1)
##
#
df2 <- data.frame(Variable.1=rnorm(1000, -1), variable.2=runif(1000), variable.3=rgamma(1000, scale=20, shape=5))
df2 <- t(df2)
##
# you start here
# assumes df1 and df2 have structure described in problem
# data in rows, not columns
#
library(data.table)
df1 <- as.data.table(t(df1)) # transpose: put data in columns
brks <- lapply(df1, quantile, probs=(0:10)/10, labels=FALSE) # list of deciles for each row in df1
df2 <- as.data.table(df2, keep.rownames = TRUE) # keep df2 data in rows: 1000 columns here
result <- df2[ # this does all the work
, .(value= unlist(.SD),
decile=cut(unlist(.SD), breaks=c(-Inf, brks[[rn]], +Inf), labels=c('below', names(brks[[rn]])[2:11], 'above'))
)
, by=.(rn)]
result[, .N, keyby=.(rn, decile)] # validate that result is reasonable
Applying deciles from one dataset to another has the nuance the some values in the new dataset might be outside the range of the original data. The test data here demonstrates this problem. Variable.1 in df2 has values lower than any in df1, and variable.3 in df2 has values larger than any in df1.

randomly ordering across groups (not within group) in data.table

Let's say I want to order the iris dataset (as a data.table) by Species, keeping observations grouped by species and randomly ordering across species.
How do I do that?
I am not talking about generating a random order within groups (species).
My intuition was to write the code bellow. But it actually creates the within species random variable. Well at least it makes the question reproducible
d <- iris %>% data.table
set.seed('12345')
d[,g:=runif(.N),Species]
You may do a binary search in i. A smaller example:
d <- data.table(Species = rep(letters[1:4], each = 2), ri = 1:8)
set.seed(1)
d[.(sample(unique(Species))), on = "Species"]
# Species ri
# 1: b 3
# 2: b 4
# 3: d 7
# 4: d 8
# 5: c 5
# 6: c 6
# 7: a 1
# 8: a 2
Alternatively you could do:
e <- d[, .N, Species]
e[, g2 := runif(.N)]
d <- e[, .(Species, g2)][d, on = 'Species']
We can randomly sample from a series 1...N where N is the # of levels of the factor (Species) in question.
We then map the new order to a column and sort by it. Broken apart into steps for illustration it looks like this:
tmp <- sample_n(as.data.frame(seq(1,length(unique(d$Species)))),3)[,1]
d$index <- tmp[as.numeric(d$Species)]
d <- d[order(d$index),]
You could compact this into 1 line/step:
d <- d[order(sample_n(as.data.frame(seq(1,length(unique(d$Species)))),3)[,1][as.numeric(d$Species)]),]

Find the mean of one variable subseted by another variable

I have a list of identical dataframes. Each data frame contains columns with unique variables (temp/DO) and with repeated variables (eg-t1).
[[1]]
temp DO t1
1 4 1
3 9 1
5 7 1
I want to find the mean of DO when the temperature is equal to t1.
t1 represents a specific temperature, but the value varies for each data frame in the list so I can't specify an actual value.
So far I've tried writing a function
hvod<-function(DO, temp, depth){
hDO<-DO[which(temp==t1[1])]
mHDO<-mean(hDO)
htemp<-temp[which(temp=t1[1])]
mhtemp<-mean(htemp)
}
hfit<-hvod(data$DO, data$temp, data$depth)
But for whatever reason t1 is not recognized. Any ideas on the function OR
a way to combine select (dplyr function) and lapply to solve this?
I've seen similar posts put none that apply to the issue of a specific value (t1) that changes for each data frame.
I would just take the dataframe as argument and do rest of the logic inside function as it gives more control to the function. Something like this would work,
hvod<-function(data){
temp <- data$temp
t1 <- data$t1
DO <- data$DO
hDO<-DO[which(temp==t1[1])]
mHDO<-mean(hDO)
htemp<-temp[which(temp=t1[1])]
mhtemp<-mean(htemp)
}
You can try using dplyr::bind_rows function to combine all data.frames from list in one data.frame.
Then group on data.frame number to find the mean of DO for rows having temp==t1 as:
library(dplyr)
bind_rows(ll, .id = "DF_Name") %>%
group_by(DF_Name) %>%
filter(temp==t1) %>%
summarise(MeanDO = mean(DO)) %>%
as.data.frame()
# DF_Name MeanDO
# 1 1 4.0
# 2 2 6.5
# 3 3 6.7
Data:
df1 <- read.table(text =
"temp DO t1
1 4 1
3 9 1
5 7 1",
header = TRUE)
df2 <- read.table(text =
"temp DO t1
3 4 3
3 9 3
5 7 1",
header = TRUE)
df3 <- read.table(text =
"temp DO t1
2 4 2
2 9 2
2 7 2",
header = TRUE)
ll <- list(df1, df2, df3)
Thank you Thiloshon and MKR for the help! I had initial combined the data I needed into one list of data frames but to answer this I actually had my data in separate data frames (fitsObs and df1).
The variables I was working with in the code were 1 to 1, so by finding the range where depth and d2 were the same (I used temp and t1 in the example), I could find the mean over that range .
for(i in 1:1044){
df1 <- GLNPOsurveyCTD$data[[i]]
fitObs <- fitTp2(-df1$depth, df1$temp)
deptho <- -abs(df1$depth) #defining temp and depth in the loop
to <- df1$temp
do <- df1$DO
xx <- which(deptho <= fitObs$d2) #mean over range xx
mhtemp <- mean(to[xx], na.rm=TRUE)
mHDO <- mean(do[xx], na.rm=TRUE)
}

Sorting a column in descending order in R excluding the first row

I have a dataframe with 5 columns and a very large dataset. I want to sort by column 3. How do you sort everything after the first row? (When calling this function I want to end it with nrows)
Example output:
Original:
4
7
9
6
8
New:
4
9
8
7
6
Thanks!
If I'm correctly understanding what you want to do, this approach should work:
z <- data.frame(x1 = seq(10), x2 = rep(c(2,3), 5), x3 = seq(14, 23))
zsub <- z[2:nrow(z),]
zsub <- zsub[order(-zsub[,3]),]
znew <- rbind(z[1,], zsub)
Basically, snip off the rows you want to sort, sort them in descending order on column 3, then reattach the first row.
And here's a piped version using dplyr, so you don't clutter the workspace with extra objects:
library(dplyr)
z <- z %>%
slice(2:nrow(z)) %>%
arrange(-x3) %>%
rbind(slice(z, 1), .)
You might try this single line of code to modify the third column in your data frame df as described:
df[,3] <- c(df[1,3],sort(df[-1,3]))
df$x[-1] <- df$x[-1][order(df$x[-1], decreasing=T)]
# x
# 1 4
# 2 9
# 3 8
# 4 7
# 5 6

Finding unique tuples in R but ignoring order

Since my data is much more complicated, I made a smaller sample dataset (I left the reshape in to show how I generated the data).
set.seed(7)
x = rep(seq(2010,2014,1), each=4)
y = rep(seq(1,4,1), 5)
z = matrix(replicate(5, sample(c("A", "B", "C", "D"))))
temp_df = cbind.data.frame(x,y,z)
colnames(temp_df) = c("Year", "Rank", "ID")
head(temp_df)
require(reshape2)
dcast(temp_df, Year ~ Rank)
which results in...
> dcast(temp_df, Year ~ Rank)
Using ID as value column: use value.var to override.
Year 1 2 3 4
1 2010 D B A C
2 2011 A C D B
3 2012 A B D C
4 2013 D A C B
5 2014 C A B D
Now I essentially want to use a function like unique, but ignoring order to find where the first 3 elements are unique.
Thus in this case:
I would have A,B,C in row 5
I would have A,B,D in rows 1&3
I would have A,C,D in rows 2&4
Also I need counts of these "unique" events
Also 2 more things. First, my values are strings, and I need to leave them as strings.
Second, if possible, I would have a column between year and 1 called Weighting, and then when counting these unique combinations I would include each's weighting. This isn't as important because all weightings will be small positive integer values, so I can potentially duplicate the rows earlier to account for weighting, and then tabulate unique pairs.
You could do something like this:
df <- dcast(temp_df, Year ~ Rank)
combos <- apply(df[, 2:4], 1, function(x) paste0(sort(x), collapse = ""))
combos
# 1 2 3 4 5
# "BCD" "ABC" "ACD" "BCD" "ABC"
For each row of the data frame, the values in columns 1, 2, and 3 (as labeled in the post) are sorted using sort, then concatenated using paste0. Since order doesn't matter, this ensures that identical cases are labeled consistently.
Note that the paste0 function is equivalent to paste(..., sep = ""). The collapse argument says to concatenate the values of a vector into a single string, with vector values separated by the value passed to collapse. In this case, we're setting collapse = "", which means there will be no separation between values, resulting in "ABC", "ACD", etc.
Then you can get the count of each combination using table:
table(combos)
# ABC ACD BCD
# 2 1 2
This is the same solution as #Alex_A but using tidyverse functions:
library(purrr)
library(dplyr)
df <- dcast(temp_df, Year ~ Rank)
distinct(df, ID = pmap_chr(select(df, num_range("", 1:3)),
~paste0(sort(c(...)), collapse="")))

Resources