How to sort a data frame on multiple variables of which the names are given in vectors using a base R function? - r

I have a data frame like the one below:
df <- data.frame(v1 = c("A", "B", "A", "A", "B", "B", "B", "B", "A", "A", "A", "A"),
v2 = c("X", "Y", "X", "Y", "Z", "X", "X", "Y", "X", "Y", "Z", "Z"),
v3 = c(2, 1, 3, 1, 1, 2, 1, 2, 1, 2, 2, 1))
In this data frame v1 and v2 are so called grouping variables (charachter vectors is this case) within I'd like to order my counter variable v3 ascending using (a) base R function(s). There's no requirement for the order in which the grouping variables are sorted (both ascending and descending would be ok). Now in this special case that would be easy:
df <- df[order(df$v1, df$v2, df$v3),]
Or alternatively:
df <- df[do.call(what = order, args = df),]
What I'd like is a more general solution for any data frame with n grouping variables of which the names are contained in a vector and the name of the counter variable is contained in another vector. Reason I want this is that this data is given in a function call in a user defined function and can therefore vary.
grouping_vars <- c("v1", "v2", ..., "vn") #not actual code. Data frame contains *n* variables.
counter <- "vi" #not actual code. One of them, the i-th, is the counter variable.
Again, I'd like to make use of a base R function here (most likely order) and not a solution from data.frame or tidyverse from example.

Your code is almost there. Just use [] behind df to extract grouping and numerical columns for ordering.
df[do.call(what = order, args = df[,c(grouping_vars, counter)]), ]
PeterD: I added a comma in front of the vector that contains the selected columns to be explicit about the selection of columns of data frame df.

Related

R add all combinations of three values of a vector to a three-dimensional array

I have a data frame with two columns. The first one "V1" indicates the objects on which the different items of the second column "V2" are found, e.g.:
V1 <- c("A", "A", "A", "A", "B", "B", "B", "C", "C", "C", "C")
V2 <- c("a","b","c","d","a","c","d","a","b","d","e")
df <- data.frame(V1, V2)
"A" for example contains "a", "b", "c", and "d". What I am looking for is a three dimensional array with dimensions of length(unique(V2)) (and the names "a" to "e" as dimnames).
For each unique value of V1 I want all possible combinations of three V2 items (e.g. for "A" it would be c("a", "b", "c"), c("a", "b", "d", and c("b", "c", "d").
Each of these "three-item-co-occurrences" should be regarded as a coordinate in the three-dimensional array and therefore be added to the frequency count which the values in the array should display. The outcome should be the following array
ar <- array(data = c(0,0,0,0,0,0,0,1,2,1,0,1,0,2,0,0,2,2,0,1,0,1,0,1,0,
0,0,1,2,1,0,0,0,0,0,1,0,0,1,0,2,0,1,0,1,1,0,0,1,0,
0,1,0,2,0,1,0,0,1,0,0,0,0,0,0,2,1,0,0,0,0,0,0,0,0,
0,2,2,0,1,2,0,1,0,1,2,1,0,0,0,0,0,0,0,0,1,1,0,0,0,
0,1,0,1,0,1,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0),
dim = c(5, 5, 5),
dimnames = list(c("a", "b", "c", "d", "e"),
c("a", "b", "c", "d", "e"),
c("a", "b", "c", "d", "e")))
I was wondering about the 3D symmetry of your result. It took me a while to understand that you want to have all permutations of all combinations.
library(gtools) #for the permutations
foo <- function(x) {
#all combinations:
combs <- combn(x, 3, simplify = FALSE)
#all permutations for each of the combinations:
combs <- do.call(rbind, lapply(combs, permutations, n = 3, r = 3))
#tabulate:
do.call(table, lapply(asplit(combs, 2), factor, levels = letters[1:5]))
}
#apply grouped by V1, then sum the results
res <- Reduce("+", tapply(df$V2, df$V1, foo))
#check
all((res - ar)^2 == 0)
#[1] TRUE
I used to use the crossjoin CJ() to retain the pairwise count of all combinations of two different V2 items
res <- setDT(df)[,CJ(unique(V2), unique(V2)), V1][V1!=V2,
.N, .(V1,V2)][order(V1,V2)]
This code creates a data frame res with three columns. V1 and V2 contain the respective items of V2 from the original data frame df and N contains the count (how many times V1 and V2 appear with the same value of V1 (from the original data frame df).
Now, I found that I could perform this crossjoin with three 'dimensions' as well by just adding another unique(V2) and adapting the rest of the code accordingly.
The result is a data frame with four columns. V1, V2, and V3 indicate the original V2 items and N again shows the number of mutual appearances with the same original V1 objects.
res <- setDT(df)[,CJ(unique(V2), unique(V2), unique(V2)), V1][V1!=V2 & V1 != V3 & V2 != V3,
.N, .(V1,V2,V3)][order(V1,V2,V3)]
The advantage of this code is that all empty combinations (those which do not appear at all) are not considered. It worked with 1,000,000 unique values in V1 and over 600 unique items in V2, which would have otherwise caused an extremely large array of 600 x 600 x 600

how count the number of rows in a dataframe with cell matching each other

I have two columns (one with predicted values (in strings) and one with real values (in strings) and my wish is to assess the number of rows in which the real values or string do match the predicted values or string in the same row.
I was wondering whether it is possible to something like that with R?
# create sample dataset
df <- data.frame(
col1 = c("a", "b", "c", "d", "e"),
col2 = c("a", "x", "y", "z", "e"),
stringsAsFactors = FALSE
)
# count the number of rows where two columns equal each other
sum( df$col1 == df$col2 )

Using the Character of a Range in Subset()/Coercing Range from Character to Numeric

I'm struggling with having the subset() function use a range (i.e. 4:7) that is being called as a character from a variable.
Is there a way for me to coerce the input, which is the variable DayVar and has different days I want the function to subset, to be numeric while avoiding the following issues:
1.) keeping the 4:7 as such instead of as 4, 5, 6, 7, and
2.) converting the character "1:4" into numeric format that the subset evaluation can use as though it were 1:4.
Here is a sample data frame:
DayVar = c("1", "2", "3", "4:7")
a <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j")
b <- c(61:70)
Day <- c(1:10)
df <- data.frame("a" = a, "b" = b, "Day" = Day)
Subset <- list()
for(i in 1:length(DayVar)){
Subset[[i]] = subset(df, Day %in% DayVar[i])
}
As thelatemail suggested the list works but you have to change the DayVar quotes to get the list index:
DayVar <- list(1,2,3,4:7)
Subset <- list()
for(i in 1:length(DayVar)){
Subset[[i]] = subset(df, Day %in% DayVar[[i]])
}

Most efficient to append some columns of a data frame to some other columns

Suppose I have the following data frame:
foo <- data.frame(a=letters,b=seq(1,26),
n1=rnorm(26),n2=rnorm(26),
u1=runif(26),u2=runif(26))
I want to append columns u1 and u2 to columns n1 and n2. For now, I found the following way:
df1 <- foo[,c("a","b","n1","n2")]
df2 <- foo[,c("a","b","u1","u2")]
names(df2) <- names(df1)
bar <- rbind(df1,df2)
That does the trick. However, it seems a little bit involved. Am I too picky? Or is there a faster/simpler way to do this in R?
Here is one way using full_join() from dplyr:
library(dplyr)
full_join(df1, df2, by = c("a", "b", "n1" = "u1", "n2" = "u2"))
From the documentation:
full_join
return all rows and all columns from both x and y. Where
there are not matching values, returns NA for the one missing.
by
a character vector of variables to join by. If NULL, the default,
join will do a natural join, using all variables with common names
across the two tables. A message lists the variables so that you can
check they're right.
To join by different variables on x and y use a named vector. For
example, by = c("a" = "b") will match x.a to y.b.
Use Map() to concatenate the columns, and cbind() with recycling to arrive at the final data frame.
cbind(foo[1:2], Map(c, foo[3:4], foo[5:6]))
Substitute numerical indexes with column names, if desired.
cbind(foo[c("a", "b")], Map(c, foo[c("n1", "n2")], foo[c("u1", "u2")]))
Short-hand:
rbind(foo[1:4], setNames(foo[c(1, 2, 5, 6)], names(foo[1:4])))
Long-winded:
rbind(foo[c("a", "b", "n1", "n2")], setNames(foo[c("a", "b", "u1", "u2")], c("a", "b", "n1", "n2")))
Long-winded (more DRY):
nms <- c("a", "b", "n1", "n2")
rbind(foo[nms], setNames(foo[c("a", "b", "u1", "u2")], nms))

Order data frame by two columns in R

I'm trying to reorder the rows of a data frame by two factors. For the first factor i'm happy with the default ordering. For the second factor i'd like to impose my own custom order to the rows. Here's some dummy data:
dat <- data.frame(apple=rep(LETTERS[1:10], 3),
orange=c(rep("agg", 10), rep("org", 10), rep("fut", 10)),
pear=rnorm(30, 10),
grape=rnorm(30, 10))
I'd like to order "apple" in a specific way:
appleOrdered <- c("E", "D", "J", "A", "F", "G", "I", "B", "H", "C")
I've tried this:
dat <- dat[with(dat, order(orange, rep(appleOrdered, 3))), ]
But it seems to put "apple" into a random order. Any suggestions? Thanks.
Reordering the factor levels:
dat[with(dat, order(orange, as.integer(factor(apple, appleOrdered)))), ]
Try using a factor with the levels in the desired order and the arrange function from plyr:
dat$apple <- factor(dat$apple,levels=appleOrdered)
arrange(dat,orange,apple)

Resources