What exactly does the logical parameter on the `subset` function in R? - r

I am Learning R with the book Learning R - Richard Cotton, Chapter 5: List and Dataframes and I don't understand this example give, I have this dataframe and the following scripts:
(a_data_frame <- data.frame(
x = letters[1:5],
y = rnorm(5),
z = runif(5) > 0.5
))
x y z
1 a 0.6395739 FALSE
2 b -1.1645383 FALSE
3 c -1.3616093 FALSE
4 d 0.5658254 FALSE
5 e 0.4345538 FALSE
subset(a_data_frame, y > 0 | z, x) # what exactly mean y > 0 | z ?
I read the book and said:
subset takes up to three arguments: a data frame to subset, a
logical vector of conditions for rows to include, and a vector of
column names to keep
No more information about the second logic parameter.

It's a tricky example because the (a_data_frame, y > 0 | z, x) the second parameter means y > 0 and the "| z" means or the values in z column that are True.
y>0 evaluate the values given by rnorm(5) your values is different than the book because are randomly generate also the "or" "|" symbol is in the case the column z is selected if the condition is True, in your case all the values False and you can't see what's going on but as didactic example if we change z = rnorm(5) instead of runif(5)>5, you can understand better how works this function.
(a_data_frame <- data.frame(
x = letters[1:5],
y = rnorm(5),
z = rnorm(5)
))
x y z
1 a -0.91016367 2.04917552
2 b 0.01591093 0.03070526
3 c 0.19146220 -0.42056236
4 d 1.07171934 1.31511485
5 e 1.14760483 -0.09855757
So If we have y<0 or z<0 the output of column will be the row a,c,e
> subset(a_data_frame, y < 0 | z < 0, x)
x
1 a
3 c
5 e
> subset(a_data_frame, y < 0 & z<0, x)
[1] x
<0 rows> (or 0-length row.names) # there is no values for y<0 and z<0
> subset(a_data_frame, y < 0 & z, x) # True for row 2.
x
2 b
> subset(a_data_frame, y < 0 | z, x) # true for row 2 and row 4.
x
2 b
4 d

Related

extracting specific column with certain condition on one column only

I have a data(in R) in below form,
enter image description here
A B C D
x alpha sine 0
y gama cos 1
z beta tan 2
and I want to extract only column A & B where column D > 0.
i tried using data %>% filter(D > 0), which gives me last two rows where D>0 but it also gives me column c which i don't want.
how can i get only column A&B with condition applied on column D only.?
Data in text:
A
B
C
D
x
alpha
sine
0
y
gama
cos
1
z
beta
tan
2
data %>% filter(D > 0) %>%select(A,B, D)
A B D
1 y gama 1
2 z beta 2
or even:
data %>% filter(D > 0) %>%select(-C)
A B D
1 y gama 1
2 z beta 2

R collapse duplicate pairs (in any order) across dataframe columns and edit 3rd column?

I used rbind to join 2 dataframes, with a column denoting its source, resulting in
from | to | source
1 A B X
2 C D Y
3 B A Y
...
I would like to look for overlapping pairs, regardless of "order", combine those pairs, then edit the source column to something else, e.g. "Z".
In the above example, rows 1 and 3 would be flagged as overlapping, so they will be combined and modified.
So the desired output would look something like
from | to | source
1 A B Z
2 C D Y
...
How can this be done?
You can try the code below
unique(
transform(
transform(
df,
from = pmin(from, to),
to = pmax(from, to)
),
source = ave(source, from, to, FUN = function(x) ifelse(length(x) > 1, "Z", x))
)
)
which gives
from to source
1 A B Z
2 C D Y
Example
set.seed(1)
df=data.frame(
"from"=sample(LETTERS[1:4],10,replace=T),
"to"=sample(LETTERS[1:4],10,replace=T),
"source"=sample(c("X","Y"),10,replace=T)
)
from to source
1 A C X
2 D C X
3 C A X
4 A A X
5 B A X
6 A B X
7 C B Y
8 C B X
9 B B X
10 B C Y
and then
tmp=t(
apply(df,1,function(x){
sort(x[1:2])
})
)
t1=duplicated(tmp,fromLast=F)
t2=duplicated(tmp,fromLast=T)
df[t2,"source"]="Z"
df[!t1,]
from to source
1 A C Z
2 D C X
4 A A X
5 B A Z
7 C B Z
9 B B X

Loop within a loop with column names in R

I have the following data:
id A B C
1 1 1 0
2 1 1 1
3 0 1 1
I will like to create a function that computes the following three information between columns:
the number of individuals i) with A and B, ii) with A but not B, iii) B but not A. Similarly, I will like a recursive loop that computes these three numbers for A and C, and B and C. Is there a smart way to do so? a loop within a loop? So far, I have tried the following:
for(ii in colnames(df)){
for(jj in (ii+1):df){
print(ii,jj)
}}
Perhaps something like this:
# function to return your metrics
foo = function(x, y) {
c(
"x and y" = sum(x & y),
"x not y" = sum(x & !y),
"y not x" = sum(!x & y)
)
}
# generate combinations of columns
col_combos = combn(names(df)[-1], 2)
result = apply(col_combos, 2, function(x) foo(df[[x[1]]], df[[x[2]]]))
colnames(result) = apply(col_combos, 2, toString)
result
# A, B A, C B, C
# x and y 2 1 2
# x not y 0 1 1
# y not x 1 1 0
Using this data:
df = read.table(text = 'id A B C
1 1 1 0
2 1 1 1
3 0 1 1 ', header = TRUE)

Flag based on multiple conditions

Being this my initial dataset:
x <- c("a","a","b","b","c","c","d","d")
y <- c("a","a","a","b","c","c", "d", "d")
z <- c(5,1,2,6,1,1,5,6)
df <- data.frame(x,y,z)
I am trying to create a column in a dataframe to flag if there is another row in the dataset with the following condition:
There is a row in the dataset with the same "x" and "y" columns. And at least 1 of the rows of the dataset, with that "x" and "y" has a "z" value >= 5
With the example provided, the output should be:
x y z flag
1 a a 5 TRUE
2 a a 1 TRUE
3 b a 2 FALSE
4 b b 6 TRUE
5 c c 1 FALSE
6 c c 1 FALSE
7 d d 5 TRUE
8 d d 6 TRUE
Thank you!
I use data.table package for all my aggregations. With this package I would do the following:
library(data.table)
dt <- as.data.table(df)
# by=.(x, y): grouping by x and y
# find all cases where
# 1. the maximum z value is >= 5
# 2. there are more than 1 entry for that (x, y) combo. .N is a data.table syntax for number of rows in that group
# := is a data.table syntax to assign back in to the original data.table
dt[, flag := max(z) >= 5 & .N > 1, by=.(x, y)]
# Does x need to equal y? If so use this
dt[, flag := max(z) >= 5 & .N > 1 & x == y, by=.(x, y)]
# view the result
dt[]
# return back to df
df <- as.data.frame(dt)
df
You can try the code below
> within(df, flag <- x==y & z>=5)
x y z flag
1 a a 5 TRUE
2 a a 1 FALSE
3 b a 2 FALSE
4 b b 6 TRUE
5 c c 1 FALSE
6 c c 1 FALSE
7 d d 5 TRUE
8 d d 6 TRUE

operating between columns and classifing values per groups R

I try to obtain percentages grouping values regarding one variable.
For this I used sapply to obtain the percentage of each column regarding another one, but I dont know how to group these values by type (another variable)
x <- data.frame("A" = c(0,0,1,1,1,1,1), "B" = c(0,1,0,1,0,1,1), "C" = c(1,0,1,1,0,0,1),
"type" = c("x","x","x","y","y","y","x"), "yes" = c(0,0,1,1,0,1,1))
x
A B C type yes
1 0 0 1 x 0
2 0 1 0 x 0
3 1 0 1 x 1
4 1 1 1 y 1
5 1 0 0 y 0
6 1 1 0 y 1
7 1 1 1 x 1
I need to obtaing the next value (percentage): A==1&yes==1/A==1, and for this I use the next code:
result <- as.data.frame(sapply(x[,1:3],
function(i) (sum(i & x$yes)/sum(i))*100))
result
sapply(x[, 1:3], function(i) (sum(i & x$yes)/sum(i)) * 100)
A 80
B 75
C 75
Now I need to obtain the same math operation but taking into account the varible "type". It means, obtaing the same percentage but discriminating it by type. So, my expected table was:
type sapply(x[, 1:3], function(i) (sum(i & x$yes)/sum(i)) * 100)
A x 40
A y 40
B x 25
B y 50
C x 50
C y 25
In the example it's possible to observe that, by letters, the percentage sum is the same value that the obtained in the first result, just here is discriminated by type.
thanks a lot.
You can do the following using data.table:
Code
setDT(df)
cols = c('A', 'B', 'C')
mat = df[yes == 1, lapply(.SD, function(x){
100 * sum(x)/df[, lapply(.SD, sum), .SDcols = cols][[substitute(x)]]
# Here, the numerator is sum(x | yes == 1) for x == columns A, B, C
# If we look at the denominator, it equals sum(x) for x == columns A, B, C
# The reason why we need to apply substitute(x) is because df[, lapply(.SD, sum)]
# generates a list of column sums, i.e. list(A = sum(A), B = sum(B), ...).
# Hence, for each x in the column names we must subset the list above using [[substitute(x)]]
# Ultimately, the operation equals sum(x | yes == 1)/sum(x) for A, B, C.
}), .(type), .SDcols = cols]
# '.(type)' simply means that we apply this for each type group,
# i.e. once for x and once for y, for each ABC column.
# The dot is just shorthand for 'list()'.
# .SDcols assigns the subset that I want to apply my lapply statement onto.
Result
> mat
type A B C
1: x 40 25 50
2: y 40 50 25
Long format (your example)
> melt(mat)
type variable value
1: x A 40
2: y A 40
3: x B 25
4: y B 50
5: x C 50
6: y C 25
Data
df <- data.frame("A" = c(0,0,1,1,1,1,1), "B" = c(0,1,0,1,0,1,1), "C" = c(1,0,1,1,0,0,1),
"type" = c("x","x","x","y","y","y","x"), "yes" = c(0,0,1,1,0,1,1))

Resources