Subsetting data.frame upon two constraints - r

Say I want to subset using 2 constraints.
1, being the values in the first column be identical
2, and at the same time, the values in the second column be the same
For example, I have a data frame
a <- rep(1:5)
b <- c(1,2,2,2,1,1,1,2,2,2)
data <- data.frame(a,b)
say a is the pair identification number and b represents the gender
now we want to subset to create a dataset where we have a matched pair ID and gender.
Would one create a loop using the while command or use the duplicated
the expected results should return a subset of data that is highlighted here in green

You can try
data[with(data, !!ave(b, a, FUN=function(x)
length(unique(x))==1)),]
Or
library(dplyr)
data %>%
group_by(a) %>%
filter(n_distinct(b)==1)
Or
library(data.table)
setDT(data)[,.(b=b[length(unique(b))==1]) , a]
Or another data.table solution provided by #David Arenburg
setDT(data)[, if (length(unique(b)) == 1) .SD, a]

Related

Join smaller data frame to larger data frame by index in tidyverse?

Suppose I have the following data:
df <- data.frame(a=c(1,2,3,4))
index <- data.frame(a=c(1,3), data=c('x', 'y'))
I want to join df and index such that I end up with a result that has the rows of df, but with index$data joined for appropriate index$a. For some reason, English words fail me, but 'x' should be applied to 1 and 2 (because index$a has 1, and 3 is the "next" index value), and 'y' should be applied to 3 and 4.
Here is the data I'd like to end up with:
df2 <- data.frame(a=c(1,2,3,4), data=c('x', 'x', 'y', 'y'))
Ideally this solution is compatible with tidyverse without loading any other libraries.
Suggestions?
Is this what you want? First, join df and index and keep all observations in df. Then, we fill in all NA values with the last non-NA observations.
df %>% left_join(index, by = "a") %>% fill(data)
We can use data.table
library(data.table)
library(zoo)
setDT(df)[index, data := i.data, on = .(a)][, data := na.locf0(data)]

How to subset the data frame based on selected variable with limited column?

i would like to subset limited column and selected variable as i have multiple column in my data frame.
my sample data:
df <- data.frame('ID'=c('A','B','C'),'YEAR'=c('2020','2020','2020'),'MONTH'=c('1','1','1'),'DAY'=c('16','16','16'),'HOUR'=c('15','15','15'),'VALUE1'=c(1,2,3))
i would like to subset ID'='C' and column name 'VALUE1'
Expected output:-
ID VALUE1
1 C 3
Appreciate any help...!
What i have tried so far is.
df1 <- subset(df,df$ID=='C')
df2 <- subset(df1,select=c('ID','VALUE1')
Is there any efficient way to do that as creating multiple data frame when we have multiple is not good.
you can use dplyr chaining function too,
df %>% select(ID,VALUE1) %>% filter(ID=="C")
We can have both subset and select
subset(df, subset = ID=='C', select = c('ID', 'VALUE1'))

Variance of a complete group of a dataframe in R

Let's say I have a dataframe with 10+1 columns and 10 rows, and every value has the same units except for one column (the "grouping" column A).
I'm trying to accomplish the following: given a grouping of the data frames based on the last column, how do I compute the standard deviation of the whole block as a single, monolithic variable.
Let's say I do the grouping (in reality it's a cut in intervals):
df %>% group_by(A)
From what I have gathered trhoughout this site, you can use aggregate or other dplyr methods to calculate variance per column, i.e.:
this (SO won't let me embed if I have <10 rep).
In that picture we can see the grouping as colors, but by using aggregate I would get 1 standard deviation per specified column (I know you can use cbind to get more than 1 variable, for example aggregate(cbind(V1,V2)~A, df, sd)) and per group (and similar methods using dplyr and %>%, with summarise(..., FUN=sd) appended at the end).
However what I want is this: just like in Matlab when you do
group1 = df(row_group,:) % row_group would be df(:,end)==1 in this case
stdev(group1(:)) % operator (:) is key here
% iterate for every group
I have my reasons for wanting it that specific way, and of course the real dataframe is bigger than this mock example.
Minimum working example:
df <- data.frame(cbind(matrix(rnorm(100),10,10),c(1,2,1,1,2,2,3,3,3,1)))
colnames(df) <- c(paste0("V",seq(1,10)),"A")
df %>% group_by(A) %>% summarise_at(vars(V1), funs(sd(.))) # no good
aggregate(V1~A, data=df, sd) # no good
aggregate(cbind(V1,V2,V3,V4,V5,V6,V7,V8,V9,V10)~A, data=df, sd) # nope
df %>% group_by(A) %>% summarise_at(vars(V1,V2,V3,V4,V5,V6,V7,V8,V9,V10), funs(sd(.))) # same as above...
Result should be 3 doubles, each with the sd of the group (which should be close to 1 if enough columns are added).
If you want a base R solution, try the following.
sp <- split(df[-1], cut(df$A, breaks=c(2.1)))
lapply(sp, function(x) var(unlist(x)))
#$`(0.998,2]`
#[1] 0.848707
#
#$`(2,3]`
#[1] 1.80633
I have coded it in two lines to make it clearer but you can avoid the creation of sp and write the one-liner
lapply(split(df[-1], cut(df$A, breaks=c(2.1))), function(x) var(unlist(x)))
Or, for a result in another form,
sapply(sp, function(x) var(unlist(x)))
#(0.998,2] (2,3]
# 0.848707 1.806330
DATA
set.seed(6322) # make the results reproducible
df <- data.frame(cbind(matrix(rnorm(100),10,10),c(1,2,1,1,2,2,3,3,3,1)))
colnames(df) <- c(paste0("V",seq(1,10)),"A")

Dividing columns by group (Grouping in data frame)

I would like to calculate relative response values by dividing each response/column by its' group mean.
I have managed to produce an exhaustive (and thus unsatisfying) method. My data set is very large and contains multiple groups and responses.
###############
# example
# used packages
require(plyr)
# sample data
group <- c(rep("alpha", 3), rep("beta", 3), rep("gamma", 3))
a <- rnorm(9, 10,1) #some random data as response
b <- rnorm(9, 10,1)
df <- data.frame(group, a, b)
# my approach
# means for each group and response
df.means <- ddply(df, "group", colwise(mean))
# clunky method
df$rel.a[df$group=="alpha"] <-
df$a[df$group=="alpha"]/df.means$a[df.means$group=="alpha"]
df$rel.a[df$group=="beta"] <-
df$a[df$group=="beta"]/df.means$a[df.means$group=="beta"]
# ... etc
df$rel.b[df$group=="gamma"] <-
df$b[df$group=="gamma"]/df.means$b[df.means$group=="gamma"]
#desired outcome (well, perhaps with no missing values)
df
###############
I have been using r for a while now, but I still struggle with trivial data handling procedures. I believe I must be missing something, How can I better address these group(s)?
It's quite easily understandable with the package dplyr, the next version of plyr for data frames:
library(dplyr)
df %>% group_by(group) %>% mutate_each(funs(./mean(.)))
The . represents the data in each column (by group). mutate_each is used to modify each column except the grouping variables. You specify inside the funs argument which functions should be applied to each column.
With data.table package you can do this whole thing fast and easy in one line (without creating the df.means at all), simply
library(data.table)
setDT(df)[, paste0("real.", names(df)[-1]) :=
lapply(.SD, function(x) x/mean(x)),
group]
This will run over all the column within df (except group) by group and divide each value by the group mean
Edit: If you want to override the original columns (like in the dplyr answer, you can do this with small modification (remove the paste0 part):
setDT(df)[, names(df)[-1] := lapply(.SD, function(x) x/mean(x)), group]
If i understand you correctly, you can also do this easily in dplyr. Given the above data
library(dplyr)
df %>% group_by(group) %>% mutate(aresp = a/ mean(a), bresp= b/mean(b))
returns:
group a b aresp bresp
1 alpha 10.052847 8.076405 1.0132828 0.8288214
2 alpha 10.002243 11.447665 1.0081822 1.1747888
3 alpha 9.708111 9.709265 0.9785350 0.9963898
4 beta 10.732693 7.483065 0.9751125 0.8202278
5 beta 11.719656 11.270522 1.0647824 1.2353754
6 beta 10.567513 8.615878 0.9601051 0.9443968
7 gamma 10.221040 11.181763 1.0035630 0.9723315
8 gamma 10.302611 11.286443 1.0115721 0.9814341
9 gamma 10.030605 12.031643 0.9848649 1.0462344

Split dataframe, then select random observations from list, and the lists merge back into a dataframe

I have a data frame with 3 variables (subject, trialtype, and RT), and I need to select randomly half of the RT observations for the each subject, and then re-create the data frame from that selection.
In browsing the list I've got up to here
split_df <- split(bucnidata_rt,
list(bucnidata_rt$Subject, bucnidata_rt$trialtype))
(this gives a series of split_df[1], split_df[2], ....)
But then I can not subset using this
split_df[1] <- sample(nrow(split_df[1]), 24), ]
I think because sample only works on data frames and this split_df[1] is a list.
To re-merge I would do:
remerged_df <- unsplit(split_df[1],
list(bucnidata_rt$Subject, bucnidata_rt$trialtype))
Could you please help me with step 2?
I propose a slightly different approach using dplyr if you don't mind. You can group by subject and then randomly select 50% of observations of each group:
library(dplyr)
bucnidata_rt %>%
group_by(Subject) %>%
sample_frac(size = 0.5)
Edit
Here's another way, closer to what you started. I use the mtcars dataset in this case:
split_df <- split(mtcars, mtcars$cyl) #split by `cyl`
#randomly select 50% of rows per group, without replacement
split_df <- lapply(split_df, function(x) x[sample(seq_len(nrow(x)), nrow(x)/2, replace=FALSE),])
#merge the randomly selected list elements back into one data.frame
remerged_df <- do.call(rbind, split_df)
#check the result
nrow(remerged_df)
#[1] 15
Edit #2 corrected dplyr method after comment by #Gregor

Resources