Sort data.frame or data.table using vector of column names [duplicate] - r

This question already has answers here:
Sort a data.table fast by Ascending/Descending order
(2 answers)
Order data.table by a character vector of column names
(2 answers)
Sort a data.table programmatically using character vector of multiple column names
(1 answer)
Closed 2 years ago.
I have a data.frame (a data.table in fact) that I need to sort by multiple columns. The names of columns to sort by are in a vector. How can I do it? E.g.
DF <- data.frame(A= 5:1, B= 11:15, C= c(3, 3, 2, 2, 1))
DF
A B C
5 11 3
4 12 3
3 13 2
2 14 2
1 15 1
sortby <- c('C', 'A')
DF[order(sortby),] ## How to do this?
The desired output is the following but using the sortby vector as input.
DF[with(DF, order(C, A)),]
A B C
1 15 1
2 14 2
3 13 2
4 12 3
5 11 3
(Solutions for data.table are preferable.)
EDIT: I'd rather avoid importing additional packages provided that base R or data.table don't require too much coding.

With data.table:
setorderv(DF, sortby)
which gives:
> DF
A B C
1: 1 15 1
2: 2 14 2
3: 3 13 2
4: 4 12 3
5: 5 11 3
For completeness, with setorder:
setorder(DF, C, A)
The advantage of using setorder/setorderv is that the data is reordered by reference and thus very fast and memory efficient. Both functions work on data.table's as wel as on data.frame's.
If you want to combine ascending and descending ordering, you can use the order-parameter of setorderv:
setorderv(DF, sortby, order = c(1L, -1L))
which subsequently gives:
> DF
A B C
1: 1 15 1
2: 3 13 2
3: 2 14 2
4: 5 11 3
5: 4 12 3
With setorder you can achieve the same with:
setorder(DF, C, -A)

Using dplyr, you can use arrange_at which accepts string column names :
library(dplyr)
DF %>% arrange_at(sortby)
# A B C
#1 1 15 1
#2 2 14 2
#3 3 13 2
#4 4 12 3
#5 5 11 3
Or with the new version
DF %>% arrange(across(sortby))
In base R, we can use
DF[do.call(order, DF[sortby]), ]

Also possible with dplyr:
DF %>%
arrange(get(sort_by))
But Ronaks answer is more elegant.

Related

count unique combinations of variable values in an R dataframe column [duplicate]

This question already has answers here:
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Count number of rows within each group
(17 answers)
Closed 2 years ago.
I want to count the unique combinations of a variable that appear per group.
For example:
df <- data.frame(id = c(1,1,1,2,2,2,3,3,4,4,4,5,6,6,7,7,7),
status = c("a","b","c","a","b","c","b","c","b","c","d","b","b","c","b","c", "d"))
> df
id status
1 1 a
2 1 b
3 1 c
4 2 a
5 2 b
6 2 c
7 3 b
8 3 c
9 4 b
10 4 c
11 4 d
12 5 b
13 6 b
14 6 c
15 7 b
16 7 c
17 7 d
So that, for example, I can tally how many times a given combination of "status" appears.
By hand, for example, I see that "a,b,c" appears twice total (id's 1 and 2).
These seem to be similar questions, but I couldn't work out how to do it and with clearer explanation in R:
Counting unique combinations
Count of unique combinations despite order
The result I think I am looking for would be something like:
abc 2
bc 3
b 1
...
An option with tidyverse where group by 'id', paste the 'status' and get the count
library(dplyr)
library(stringr)
df %>%
group_by(id) %>%
summarise(status = str_c(status, collapse="")) %>%
count(status)
# A tibble: 4 x 2
# status n
# <chr> <int>
#1 abc 2
#2 b 1
#3 bc 2
#4 bcd 2
Here is a base R option via aggregate
> aggregate(.~status,rev(aggregate(.~id,df,paste0,collapse = "")),length)
status id
1 abc 2
2 b 1
3 bc 2
4 bcd 2
You can use the apply family of functions too with tapply and lapply to get there with table.
tap <- tapply(df$status, df$id ,FUN= function(x) unique(x))
lap <- lapply(tap,FUN = function(x) paste0(x,collapse=""))
status <- unlist(lap)
df1 <- data.frame(table(status))
> df1
status Freq
1 abc 2
2 b 1
3 bc 2
4 bcd 2

Merge and sum two different data.tables [duplicate]

This question already has answers here:
Adding values in two data.tables
(2 answers)
Closed 5 years ago.
I've 2 different data.tables. I need to merge and sum based on a row values. The examples of two tables are given as Input below and expected output shown below.
Input
Table 1
X A B
A 3
B 4 6
C 5
D 9 12
Table 2
X A B
A 1 5
B 6 8
C 7 14
D 5
E 1 1
F 2 3
G 5 6
Expected Output:
X A B
A 4 5
B 10 14
C 12 14
D 14 12
E 1 1
F 2 3
G 5 6
We can do this by rbinding the two tables and then do a group by sum
library(data.table)
rbindlist(list(df1, df2))[, lapply(.SD, sum, na.rm = TRUE), by = X]
# X A B
#1: A 4 5
#2: B 10 14
#3: C 12 14
#4: D 14 12
#5: E 1 1
#6: F 2 3
#7: G 5 6
Or using a similar approach with dplyr
library(dplyr)
bind_rows(df1, df2) %>%
group_by(X) %>%
summarise_all(funs(sum(., na.rm = TRUE)))
Note: Here, we assume that the blanks are NA and the 'A' and 'B' columns are numeric/integer class
Merge your tables together first, then do the sum. If you later want to drop the individual values you can do so easily.
out <- merge(df1, df2, by.x="X", by.y="X", all.x=T, all.y=T)
out$sum <- rowSums(out[2:3])
out$A <- out$B <- NULL # drop original values
Below code will help you to do required job for all numeric columns at once
library(dplyr)
Table = Table1 %>% full_join(Table2) %>%
group_by(X) %>% summarise_all(funs(sum(.,na.rm = T)))

R: Converting wide format to long format with multiple 3 time period variables [duplicate]

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 4 years ago.
Apologies if this is a simple question, but I haven't been able to find a simple solution after searching. I'm fairly new to R, and am having trouble converting wide format to long format using either the melt (reshape2) or gather(tidyr) functions. The dataset that I'm working with contains 22 different time variables that are each 3 time periods. The problem occurs when I try to convert all of these from wide to long format at once. I have had success in converting them individually, but it's a very inefficient and long, so I was wondering if anyone could suggest a simpler solution. Below is a sample dataset I created that is formatted in a similar way as the dataset I am working with:
Subject <- c(1, 2, 3)
BlueTime1 <- c(2, 5, 6)
BlueTime2 <- c(4, 6, 7)
BlueTime3 <- c(1, 2, 3)
RedTime1 <- c(2, 5, 6)
RedTime2 <- c(4, 6, 7)
RedTime3 <- c(1, 2, 3)
GreenTime1 <- c(2, 5, 6)
GreenTime2 <- c(4, 6, 7)
GreenTime3 <- c(1, 2, 3)
sample.df <- data.frame(Subject, BlueTime1, BlueTime2, BlueTime3,
RedTime1, RedTime2, RedTime3,
GreenTime1,GreenTime2, GreenTime3)
A solution that has worked for me is to use the gather function from tidyr, arranging the data by Subject (so that each subject's data is grouped together), and then selecting only the subject, time period, and rating. This was done for each variable (in my case 22).
install.packages("dplyr")
install.packages("tidyr")
library(dplyr)
library(tidyr)
BlueGather <- gather(sample.df, Time_Blue, Rating_Blue, c(BlueTime1,
BlueTime2,
BlueTime3))
BlueSorted <- arrange(BlueGather, Subject)
BlueSubtracted <- select(BlueSorted, Subject, Time_Blue, Rating_Blue)
After this code, I combine everything into one data frame. This seems very slow and inefficient to me, and was hoping that someone could help me find a simpler solution. Thank you!
The idea here is to gather() all the time variables (all variables but Subject), use separate() on key to split them into a label and a time and then spread() the label and value to obtain your desired output.
library(dplyr)
library(tidyr)
sample.df %>%
gather(key, value, -Subject) %>%
separate(key, into = c("label", "time"), "(?<=[a-z])(?=[0-9])") %>%
spread(label, value)
Which gives:
# Subject time BlueTime GreenTime RedTime
#1 1 1 2 2 2
#2 1 2 4 4 4
#3 1 3 1 1 1
#4 2 1 5 5 5
#5 2 2 6 6 6
#6 2 3 2 2 2
#7 3 1 6 6 6
#8 3 2 7 7 7
#9 3 3 3 3 3
Note
Here we use the regex in separate() from this answer by #RichardScriven to split the column on the first encountered digit.
Edit
I understand from your comments that your dataset column names are actually in the form ColorTime_Pre, ColorTime_Post, ColorTime_Final. If that is the case, you don't have to specify a regex in separate() as the default one sep = "[^[:alnum:]]+" will match your _ and split the key into label and time accordingly:
sample.df %>%
gather(key, value, -Subject) %>%
separate(key, into = c("label", "time")) %>%
spread(label, value)
Will give:
# Subject time BlueTime GreenTime RedTime
#1 1 Final 1 1 1
#2 1 Post 4 4 4
#3 1 Pre 2 2 2
#4 2 Final 2 2 2
#5 2 Post 6 6 6
#6 2 Pre 5 5 5
#7 3 Final 3 3 3
#8 3 Post 7 7 7
#9 3 Pre 6 6 6
We can use melt from data.table which can take multiple measure columns as a regex pattern
library(data.table)
melt(setDT(sample.df), measure = patterns("^Blue", "^Red", "^Green"),
value.name = c("BlueTime", "RedTime", "GreenTime"), variable.name = "time")
# Subject time BlueTime RedTime GreenTime
#1: 1 1 2 2 2
#2: 2 1 5 5 5
#3: 3 1 6 6 6
#4: 1 2 4 4 4
#5: 2 2 6 6 6
#6: 3 2 7 7 7
#7: 1 3 1 1 1
#8: 2 3 2 2 2
#9: 3 3 3 3 3
Or as #StevenBeaupré mentioned in the comments, if there are many patterns, one option would be to use the names of the dataset after extracting the substring as the patterns argument
melt(setDT(sample.df), measure = patterns(as.list(unique(sub("\\d+", "",
names(sample.df)[-1])))),value.name = c("BlueTime", "RedTime",
"GreenTime"), variable.name = "time")
If your goal is to convert the three colors to long this can be accomplished with the base R reshape function:
reshape(sample.df, idvar="subject", varying=2:length(sample.df), sep="", direction="long")
Subject time BlueTime RedTime GreenTime subject
1.1 1 1 2 2 2 1
2.1 2 1 5 5 5 2
3.1 3 1 6 6 6 3
1.2 1 2 4 4 4 1
2.2 2 2 6 6 6 2
3.2 3 2 7 7 7 3
1.3 1 3 1 1 1 1
2.3 2 3 2 2 2 2
3.3 3 3 3 3 3 3
The time variable captures the 1,2,3 in the names of the wide variables. The varying argument tells reshape which variables should be converted to long. The sep argument tells reshape to look for numbers at the end of the varying variables that are not separated by any characters, while the direction argument tells the function to attempt a long conversion.
I always add the id variable, even if it is not necessary for future reference.
If your data.frame doesn't have actually have the numbers for the time variable, a fairly simple solution is to change the variable names so that they do. For example, the following would replace "_Pre" with "1" at the end of any such variables.
names(df)[grep("_Pre$", names(df))] <- gsub("_Pre$", "1",
names(df)[grep("_Pre$", names(df))])

How to do something to each element in the group

Suppose I have a dataframe like so
a b c
1 2 3
1 3 4
1 4 5
2 5 6
2 6 7
3 7 8
4 8 9
What I want is the following:
a b c d
1 2 3 a
1 3 4 b
1 4 5 c
2 5 6 a
2 6 7 b
3 7 8 a
4 8 9 a
Essentially, I want to do a cycling, for each group by the column a, I want to create a new column which cycles the letters from a to z in order. Group 1 has three elements, so the letter goes from 'a' to 'c'. Group 3 and 4 has only 1 element, so the letter only gets assigned 'a'.
A data.table option is
library(data.table)
setDT(dd)[, d:= letters[seq_len(.N)], by = a]
One way to do this is with a split-apply-combine paradigm, as in plyr (or dplyr or data.table or ...
Create data:
dd <- data.frame(a=rep(1:4,c(3,2,1,1)),
b=2:8,c=3:9)
Use ddply to split the data frame by variable a, transforming each piece by adding an appropriate variable, then recombine:
library("plyr")
ddply(dd,"a",
transform,
d=letters[1:length(b)])
Or in dplyr:
library("dplyr")
dd %>% group_by(a) %>%
mutate(d=letters[1:n()])
Or in base R (thanks #thelatemail):
dd$d <- ave(rownames(dd), dd$a,
FUN=function(x) letters[seq_along(x)] )

r get mean of n columns by row

I have a simple data.frame
> df <- data.frame(a=c(3,5,7), b=c(5,3,7), c=c(5,6,4))
> df
a b c
1 3 5 5
2 5 3 6
3 7 7 4
Is there a simple and efficient way to get a new data.frame with the same number of rows but with the mean of, for example, column a and b by row? something like this:
mean.of.a.and.b c
1 4 5
2 4 6
3 7 4
Use rowMeans() on the first two columns only. Then cbind() to the third column.
cbind(mean.of.a.and.b = rowMeans(df[-3]), df[3])
# mean.of.a.and.b c
# 1 4 5
# 2 4 6
# 3 7 4
Note: If you have any NA values in your original data, you may want to use na.rm = TRUE in rowMeans(). See ?rowMeans for more.
Another option using the dplyr package:
library("dplyr")
df %>%
rowwise()%>%
mutate(mean.of.a.and.b = mean(c(a, b))) %>%
## Then if you want to remove a and b:
select(-a, -b)
I think the best option is using rowMeans() posted by Richard Scriven. rowMeans and rowSums are equivalent to use of apply with FUN = mean or FUN = sum but a lot faster. I post the version with apply just for reference, in case we would like to pass another function.
data.frame(mean.of.a.and.b = apply(df[-3], 1, mean), c = df[3])
Output:
mean.of.a.and.b c
1 4 5
2 4 6
3 7 4
Very verbose using SQL with sqldf
library(sqldf
sqldf("SELECT (sum(a)+sum(b))/(count(a)+count(b)) as mean, c
FROM df group by c")
Output:
mean c
1 7 4
2 4 5
3 4 6

Resources