Sort data frame by two columns (with condition) [duplicate] - r

This question already has answers here:
Sort (order) data frame rows by multiple columns
(19 answers)
Closed 7 years ago.
I have the following data frame in R:
DataTable <- data.frame( Name = c("Nelle","Alex","Thomas","Jeff","Rodger","Michi"), Age = c(17, 18, 18, 16, 16, 16), Grade = c(1,5,3,2,2,4) )
Name Age Grade
1 Nelle 17 1
2 Alex 18 5
3 Thomas 18 3
4 Jeff 16 2
5 Rodger 16 2
6 Michi 16 4
Now ill will sort this data frame by its Age column. No problem so far:
DataTable_sort_age <- DataTable[with(DataTable, order(DataTable[,2])),]
Name Age Grade
4 Jeff 16 2
5 Rodger 16 2
6 Michi 16 4
1 Nelle 17 1
2 Alex 18 5
3 Thomas 18 3
There are more persons in the Name columns that have the same age and they should be sorted alphabetically. If the condition, that more than one person is at the same age, is true the data frame should be sorted alphabetically by Name. The output should look like this:
Name Age Grade
1 Jeff 16 2
2 Michi 16 2
3 Rodger 16 4
4 Nelle 17 1
5 Alex 18 5
6 Thomas 18 3
Hope you can help me by sorting the data frame alphabetically.

As per #Stezzo 's comment updating the answer
Just add, DataTable[, 1] in the order function
DataTable[order(DataTable[,2], DataTable[, 1]),]
# Name Age Grade
# 4 Jeff 16 2
# 6 Michi 16 4
# 5 Rodger 16 2
# 1 Nelle 17 1
# 2 Alex 18 5
# 3 Thomas 18 3
Remember, the order in which parameters are passed matters. It would first sort the DataTable dataframe w.r.t 2nd column and in case of a tie it would consider the second parameter which is the first column.

in addition to #Ronak Shah answer you can also use arrange of dplyr.
It looks a bit simpler to me.
arrange(DataTable,Age,Name)
gives
Name Age Grade
1 Alex 16 3
2 Jeff 16 2
3 Michi 16 4
4 Rodger 16 2
5 Nelle 17 1
6 Alex 18 5
7 Thomas 18 4
Here, it first sorts by Age then Name and you can add more variables so on.

Related

Creating an summary dataset with multiple objects and multiple observations per object

I have a dataset with the reports from a local shop, where each line has a client's ID, date of purchase and total value per purchase.
I want to create a new plot where for each client ID I have all the purchases in the last month or even just sample purchases in a range of dates I choose.
The main problem is that certain customers might buy once a month, while others can come daily - so the number of observations per period of time can vary.
I have tried subsetting my dataset to a specific range of time, but either I choose a specific date - and then I only get a small % of all customers, or I choose a range and get multiple observations for certain customers.
(In this case - I wouldn't mind getting the earliest observation)
An important note: I know how to create a for loop to solve this problem, but since the dataset is over 4 million observations it isn't practical since it would take an extremely long time to run.
A basic example of what the dataset looks like:
ID Date Sum
1 1 1 234
2 1 2 45
3 1 3 1
4 2 4 223
5 3 5 546
6 4 6 12
7 2 1 20
8 4 3 30
9 6 2 3
10 3 5 45
11 7 6 456
12 3 7 65
13 8 8 234
14 1 9 45
15 3 2 1
16 4 3 223
17 6 6 546
18 3 4 12
19 8 7 20
20 9 5 30
21 11 6 3
22 12 6 45
23 14 9 456
24 15 10 65
....
And the new data set would look something like this:
ID 1Date 1Sum 2Date 2Sum 3Date 3Sum
1 1 234 2 45 3 1
2 1 20 4 223 NA NA
3 2 1 5 546 5 45
...
Thanks for your help!
I think you can do this with a bit if help from dplyr and tidyr
library(dplyr)
library(tidyr)
dd %>% group_by(ID) %>% mutate(seq=1:n()) %>%
pivot_wider("ID", names_from="seq", values_from = c("Date","Sum"))
Where dd is your sample data frame above.

How to sort with multiple conditions in R [duplicate]

This question already has answers here:
Sort (order) data frame rows by multiple columns
(19 answers)
Closed 3 years ago.
I have a very simple dataframe in R:
x <- data.frame("SN" = 1:7, "Age" = c(21,15,22,33,21,15,25), "Name" = c("John","Dora","Paul","Alex","Bud","Chad","Anton"))
My goal is to sort the dataframe by the Age and the Name. I am able to achieve this task partially if i type the following command:
x[order(x[, 'Age']),]
which returns:
SN Age Name
2 2 15 Dora
6 6 15 Chad
1 1 21 John
5 5 21 Bud
3 3 22 Paul
7 7 25 Anton
4 4 33 Alex
As you can see the dataframe is order by the Age but not the Name.
Question: how can i order the dataframe by the age and name at the same time? This is what the result should look like
SN Age Name
6 6 15 Chad
2 2 15 Dora
5 5 21 Bud
1 1 21 John
3 3 22 Paul
7 7 25 Anton
4 4 33 Alex
Note: I would like to avoid to use additional packages but using just the default ones
With dplyr:
library(dplyr)
x %>%
arrange(Age, Name)
SN Age Name
1 6 15 Chad
2 2 15 Dora
3 5 21 Bud
4 1 21 John
5 3 22 Paul
6 7 25 Anton
7 4 33 Alex
x[with(x, order(Age, Name)), ]
SN Age Name
6 6 15 Chad
2 2 15 Dora
5 5 21 Bud
1 1 21 John
3 3 22 Paul
7 7 25 Anton
4 4 33 Alex

How to create column based on subsets from another dataframe with differing number of rows

I am trying to create a new column in a dataframe based on subsets from a different dataframe.
df<-data.frame(Name=c('Team 1','Team 1','Team 1', 'Team 2', 'Team 3','Team 3', 'Team 3','Team 4','Team 4'), In=c(8,25,2,2,1,3,3,9,24), Out=c(40,40,20,3,20,1,1,100,1), Group=c(1,1,1,-1,NA,NA,NA,1,1))
df1<-data.frame(Name=c('Team 1','Team 1','Team 1', 'Team 2','Team 2','Team 2','Team 2', 'Team 3','Team 3', 'Team 3','Team 4','Team 4'), In=c(8,25,2,2,1,3,3,9,24,35,14,19), Out=c(40,40,20,3,20,1,1,1,18,29,31,11))
df1$Group<-''
a<-subset(df,Group=='-1')
b<-subset(df,Group=='1')
head(df)
Name In Out Group
1 Team 1 8 40 1
2 Team 1 25 40 1
3 Team 1 2 20 1
4 Team 2 2 3 -1
5 Team 3 1 20 NA
6 Team 3 3 1 NA
7 Team 3 3 1 NA
8 Team 4 9 100 1
9 Team 4 24 1 1
head(df1)
Name In Out Group
1 Team 1 5 4
2 Team 1 5 4
3 Team 1 22 2
4 Team 2 21 13
5 Team 2 14 21
6 Team 2 13 11
7 Team 2 13 21
8 Team 3 19 13
9 Team 3 21 18
10 Team 3 13 29
11 Team 4 14 31
12 Team 4 19 11
I found what I thought was my answer here, and this way doesn't use subsets, but it also doesn't work due to the differing numbering of rows.
df1$Group <- df$Group[match(df$Name,df1$Name)]
Error in `$<-.data.frame`(`*tmp*`, "Group", value = c(1, 1, 1, -1, 1, :
replacement has 9 rows, data has 12
What I want for my outcome is to create a column ('Group') in df1 so that if the "Name" is found in subset 'a', then it receives a '-1', and if the name is found in subset 'b', then it receives a '1', and everything else that does not fit the category is either left blank or 'NA'.
Example of wanted outcome:
head(df1)
Name In Out Group
1 Team 1 5 4 1
2 Team 1 5 4 1
3 Team 1 22 2 1
4 Team 2 21 13 -1
5 Team 2 14 21 -1
6 Team 2 13 11 -1
7 Team 2 13 21 -1
8 Team 3 19 13 NA
9 Team 3 21 18 NA
10 Team 3 13 29 NA
11 Team 4 14 31 1
12 Team 4 19 11 1
The datasets I am working with are still being added to, and are very large, and so it's nonsensical to do it manually. I've been stuck on this for a while, so hopefully one of you can help me figure this out. Thanks.
#JasonAizkalns pretty much nailed it, but I'll try to expand on what he wrote.
df1$Group <- ifelse(df1$Name %in% a$Name, -1, ifelse(df1$Name %in% b$Name, 1, NA))
ifelse() is a very useful function. It takes three arguments - the condition, the first output, and the 'else' output. As you can see, he's used another ifelse() in the 'else' condition.
As regards %in%, from the documentation:
%in% is a more intuitive interface as a binary operator, which returns a logical vector indicating if there is a match or not for its left operand.
So, putting it together:
If df1$Name has an indexed match in a$Name, assign df1$Group -1.
If df1$Name has an indexed match in b$Name, assign df1$Group a 1.
Else, assign df1$Group 'NA'.
Hope that's clear.

subsetting a dataframe by a condition in R [duplicate]

This question already has answers here:
Filtering a data frame by values in a column [duplicate]
(3 answers)
Closed 3 years ago.
I have the following data with the ID of subjects.
V1
1 2
2 2
3 2
4 2
5 2
6 2
7 2
8 2
9 2
10 2
11 2
12 2
13 2
14 2
15 2
16 4
17 4
18 4
19 4
20 4
21 4
22 4
23 4
24 4
I want to subset all the rows of the data where V1 == 4. This way I can see which observations relate to subject 4.
For example, the correct output would be
16 4
17 4
18 4
19 4
20 4
21 4
22 4
23 4
24 4
However, the output I'm given after subsetting does not give me the correct rows . It simply gives me.
V1
1 4
2 4
3 4
4 4
5 4
6 4
7 4
8 4
I'm unable to tell which observations relate to subject 4, as observations 1:8 are for subject 2.
I've tried the usual methods, such as
condition<- df == 4
df[condition]
How can I subset the data so I'm given back a dataset that shows the correct row numbers for subject 4.
You can also use the subset function:
subset(df,df$V1==4)
I've managed to find a solution since posting.
newdf <- subset(df, V1 == 4).
However i'm still very interested in other solutions to this problems, so please post if you're aware of another method.

averaging subsets within a data frame while retaining comments [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 7 years ago.
Say I have a data frame, data That contains multiple sites, indicated by integer site codes. Within those sites are samples from multiple horizons, A,B and C, which have observations of some type, indicated in the column value:
site<- c(12,12,12,12,45,45,45,45)
horizon<-c('A','A','B','C','A','A','B','C')
value<- c(19,14,3,2,18,19,4,5)
comment<- c('pizza','pizza','pizza','pizza','taco','taco','taco','taco')
data<- data.frame(site,horizon,value,comment)
Which looks like this:
site horizon value comment
1 12 A 19 pizza
2 12 A 14 pizza
3 12 B 3 pizza
4 12 C 2 pizza
5 45 A 18 taco
6 45 A 19 taco
7 45 B 4 taco
8 45 C 5 taco
In this case both sites have multiple A observations. I would like to average the values of of duplicate horizons within a site. I would like to retain the comment line within the data frame as well. All observations within a site have the same entry within the comment vector. I would like the output to look like this:
site horizon value comment
1 12 A 16.5 pizza
3 12 B 3 pizza
4 12 C 2 pizza
5 45 A 18.5 taco
7 45 B 4 taco
8 45 C 5 taco
d <- read.table(header=TRUE, text=
' site horizon value comment
1 12 A 19 pizza
2 12 A 14 pizza
3 12 B 3 pizza
4 12 C 2 pizza
5 45 A 18 taco
6 45 A 19 taco
7 45 B 4 taco
8 45 C 5 taco')
merge(aggregate(value ~ site+horizon, FUN=mean, data=d), unique(d[,-3]))

Resources