How to sort with multiple conditions in R [duplicate] - r

This question already has answers here:
Sort (order) data frame rows by multiple columns
(19 answers)
Closed 3 years ago.
I have a very simple dataframe in R:
x <- data.frame("SN" = 1:7, "Age" = c(21,15,22,33,21,15,25), "Name" = c("John","Dora","Paul","Alex","Bud","Chad","Anton"))
My goal is to sort the dataframe by the Age and the Name. I am able to achieve this task partially if i type the following command:
x[order(x[, 'Age']),]
which returns:
SN Age Name
2 2 15 Dora
6 6 15 Chad
1 1 21 John
5 5 21 Bud
3 3 22 Paul
7 7 25 Anton
4 4 33 Alex
As you can see the dataframe is order by the Age but not the Name.
Question: how can i order the dataframe by the age and name at the same time? This is what the result should look like
SN Age Name
6 6 15 Chad
2 2 15 Dora
5 5 21 Bud
1 1 21 John
3 3 22 Paul
7 7 25 Anton
4 4 33 Alex
Note: I would like to avoid to use additional packages but using just the default ones

With dplyr:
library(dplyr)
x %>%
arrange(Age, Name)
SN Age Name
1 6 15 Chad
2 2 15 Dora
3 5 21 Bud
4 1 21 John
5 3 22 Paul
6 7 25 Anton
7 4 33 Alex
x[with(x, order(Age, Name)), ]
SN Age Name
6 6 15 Chad
2 2 15 Dora
5 5 21 Bud
1 1 21 John
3 3 22 Paul
7 7 25 Anton
4 4 33 Alex

Related

How to make an histogram with a data frame

I was trying to make an histogram of the frequencies of a name list, the list is like this:
> x[1:15,]
X x
1 1 JUAN DOMINGOGONZALEZDELGADO
2 2 FRANCISCO JAVIERVELARAMIREZ
3 3 JUAN CARLOSPEREZMEDINA
4 4 ARMANDOSALINASSALINAS
5 5 JOSE FELIXZAMBRANOAMEZQUITA
6 6 GABRIELMONTIELRIVAS
7 7 DONACIANOSANCHEZHERRERA
8 8 JUAN MARTINXHUERTA
9 9 ALVARO ALEJANDROGONZALEZRAMOS
10 10 OMAR ROMANCASTAƑEDALOPEZ
11 11 IGNACIOBUENOCANO
12 12 RAFAELBETANCOURTDELGADO
13 13 LUIS ALBERTOCASTILLOESCOBEDO
14 14 VICTORHERNANDEZGONZALEZ
15 15 FATIMAROMOTORRES
in order to do that I change it to a frequency table, it looks like this:
> y[1:15,]
X x Freq
1 1 15
2 2 JULIO CESAR ORDAZFLORES 1
3 3 MARCOS ANTONIOCUEVASNAVARRO 1
4 4 DULEY DILTON TRIBOUILLIERLOARCA 1
5 5 ANTONIORAMIREZLOPEZ 2
6 6 BRAYAN ALEJANDROOJEDARAMIREZ 1
7 7 JOSE DE JESUSESCOTOCORTEZ 1
8 8 AARONFLORESGARCIA 1
9 9 ABIGAILNAVARROAMBRIZ 1
10 10 ABILENYRODRIGUEZORTEGA 1
11 11 ABRAHAMHERNANDEZRAMIREZ 1
12 12 ABRAHAMPONCEALCANTARA 1
13 13 ADRIAN VAZQUEZ BUSTAMANTE 2
14 14 ADRIANHERNANDEZBERMUDEZ 28
15 15 ALAN ORLANDOCASTILLALOPEZ 11
when I try hist(x) or hist(x[,2]) I get:
Error in hist.default(x) : 'x' must be numeric
and if I try hist(y[,3]) I got an strange histogram which is not the desired, now how can I make a histogram of the frequencies of the name list?

R: Sum column from table 2 based on value in table 1, and store result in table 1

I am a R noob, and hope some of you can help me.
I have two data sets:
- store (containing store data, including location coordinates (x,y). The location are integer values, corresponding to GridIds)
- grid (containing all gridIDs (x,y) as well as a population variable TOT_P for each grid point)
What I want to achieve is this:
For each store I want loop over the grid date, and sum the population of the grid ids close to the store grid id.
I.e basically SUMIF the grid population variable, with the condition that
grid(x) < store(x) + 1 &
grid(x) > store(x) - 1 &
grid(y) < store(y) + 1 &
grid(y) > store(y) - 1
How can I accomplish that? My own take has been trying to use different things like merge, sapply, etc, but my R inexperience stops me from getting it right.
Thanks in advance!
Edit:
Sample data:
StoreName StoreX StoreY
Store1 3 6
Store2 5 2
TOT_P GridX GridY
8 1 1
7 2 1
3 3 1
3 4 1
22 5 1
20 6 1
9 7 1
28 1 2
8 2 2
3 3 2
12 4 2
12 5 2
15 6 2
7 7 2
3 1 3
3 2 3
3 3 3
4 4 3
13 5 3
18 6 3
3 7 3
61 1 4
25 2 4
5 3 4
20 4 4
23 5 4
72 6 4
14 7 4
178 1 5
407 2 5
26 3 5
167 4 5
58 5 5
113 6 5
73 7 5
76 1 6
3 2 6
3 3 6
3 4 6
4 5 6
13 6 6
18 7 6
3 1 7
61 2 7
25 3 7
26 4 7
167 5 7
58 6 7
113 7 7
The output I am looking for is
StoreName StoreX StoreY SUM_P
Store1 3 6 479
Store2 5 2 119
I.e for store1 it is the sum of TOT_P for Grid fields X=[2-4] and Y=[5-7]
One approach would be to use dplyr to calculate the difference between each store and all grid points and then group and sum based on these new columns.
#import library
library(dplyr)
#create example store table
StoreName<-paste0("Store",1:2)
StoreX<-c(3,5)
StoreY<-c(6,2)
df.store<-data.frame(StoreName,StoreX,StoreY)
#create example population data (copied example table from OP)
df.pop
#add dummy column to each table to enable cross join
df.store$k=1
df.pop$k=1
#dplyr to join, calculate absolute distance, filter and sum
df.store %>%
inner_join(df.pop, by='k') %>%
mutate(x.diff = abs(StoreX-GridX), y.diff=abs(StoreY-GridY)) %>%
filter(x.diff<=1, y.diff<=1) %>%
group_by(StoreName) %>%
summarise(StoreX=max(StoreX), StoreY=max(StoreY), tot.pop = sum(TOT_P) )
#output:
StoreName StoreX StoreY tot.pop
<fctr> <dbl> <dbl> <int>
1 Store1 3 6 721
2 Store2 5 2 119

reshape / restructure the data frame in R [duplicate]

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 6 years ago.
I'm cleaning a dataset, but the frame is not ideal, I have to reshape it, but I don't know how. The following are the original data frame:
Rater Rater ID Ratee1 Ratee2 Ratee3 Ratee1.item1 Ratee1.item2 Ratee2.item1 Ratee2.item2 Ratee3.item1 Ratee3.item2
A 12 701 702 800 1 2 3 4 5 6
B 23 45 46 49 3 3 3 3 3 3
C 24 80 81 28 2 3 4 5 6 9
Then I am wondering how to reshape it as the below:
Rater Rater ID Ratee item1 item2
A 12 701 1 2
A 12 702 3 4
A 12 800 5 6
B 23 45 3 3
B 23 46 3 3
B 23 49 3 3
C 24 80 2 3
C 24 81 4 5
C 24 28 6 9
This reshaping is a little bit different from this one (Reshaping data.frame from wide to long format). As I have three parts in the original data.
First part is about the rater's ID (Rater and Rater ID).
The second is about retee's ID (Ratee1, Ratee2, Ratee3).
The Third part is about Rater's rating on each retee (retee*.item1(or2)).
To make it more clear, let me brief the data collecting process.
First, a rater types in his own name and ID,
then nominates three persons (Ratee1 to Ratee3),
and then rates the questions regarding each retee (for each retee, there are two questions).
Does anyone know how to reshape this? Thanks!
We can use melt from data.table
library(data.table)
melt(setDT(df1), measure = patterns("^Ratee\\d+$", "^Ratee\\d+\\.item1",
"^Ratee\\d+\\.item2"), value.name = c("Ratee", "item1", "item2"))[,
variable := NULL][order(Rater)]
# Rater RaterID Ratee item1 item2
#1: A 12 701 1 2
#2: A 12 702 3 4
#3: A 12 800 5 6
#4: B 23 45 3 3
#5: B 23 46 3 3
#6: B 23 49 3 3
#7: C 24 80 2 3
#8: C 24 81 4 5
#9: C 24 28 6 9

Sort data frame by two columns (with condition) [duplicate]

This question already has answers here:
Sort (order) data frame rows by multiple columns
(19 answers)
Closed 7 years ago.
I have the following data frame in R:
DataTable <- data.frame( Name = c("Nelle","Alex","Thomas","Jeff","Rodger","Michi"), Age = c(17, 18, 18, 16, 16, 16), Grade = c(1,5,3,2,2,4) )
Name Age Grade
1 Nelle 17 1
2 Alex 18 5
3 Thomas 18 3
4 Jeff 16 2
5 Rodger 16 2
6 Michi 16 4
Now ill will sort this data frame by its Age column. No problem so far:
DataTable_sort_age <- DataTable[with(DataTable, order(DataTable[,2])),]
Name Age Grade
4 Jeff 16 2
5 Rodger 16 2
6 Michi 16 4
1 Nelle 17 1
2 Alex 18 5
3 Thomas 18 3
There are more persons in the Name columns that have the same age and they should be sorted alphabetically. If the condition, that more than one person is at the same age, is true the data frame should be sorted alphabetically by Name. The output should look like this:
Name Age Grade
1 Jeff 16 2
2 Michi 16 2
3 Rodger 16 4
4 Nelle 17 1
5 Alex 18 5
6 Thomas 18 3
Hope you can help me by sorting the data frame alphabetically.
As per #Stezzo 's comment updating the answer
Just add, DataTable[, 1] in the order function
DataTable[order(DataTable[,2], DataTable[, 1]),]
# Name Age Grade
# 4 Jeff 16 2
# 6 Michi 16 4
# 5 Rodger 16 2
# 1 Nelle 17 1
# 2 Alex 18 5
# 3 Thomas 18 3
Remember, the order in which parameters are passed matters. It would first sort the DataTable dataframe w.r.t 2nd column and in case of a tie it would consider the second parameter which is the first column.
in addition to #Ronak Shah answer you can also use arrange of dplyr.
It looks a bit simpler to me.
arrange(DataTable,Age,Name)
gives
Name Age Grade
1 Alex 16 3
2 Jeff 16 2
3 Michi 16 4
4 Rodger 16 2
5 Nelle 17 1
6 Alex 18 5
7 Thomas 18 4
Here, it first sorts by Age then Name and you can add more variables so on.

dplyr summarise over nested group_by [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 4 years ago.
I have a data frame like this:
Date Amount Category
1 02.07.15 1 1
2 02.07.15 2 1
3 02.07.15 3 1
4 02.07.15 4 2
5 03.07.15 5 2
6 04.07.15 6 3
7 05.07.15 7 3
8 06.07.15 8 3
9 07.07.15 9 4
10 08.07.15 10 5
11 09.07.15 11 6
12 10.07.15 12 4
13 11.07.15 13 4
14 12.07.15 14 5
15 13.07.15 15 5
16 14.07.15 16 6
17 15.07.15 17 6
18 16.07.15 18 5
19 17.07.15 19 4
I would like to calculate the sum of the amount for each single day in a category. My attempts like (see the code) are both not sufficient.
summarise(group_by(testData, Category), sum(Amount))
Wrong output --> here the sum is calculated over each group
Category sum(Amount)
1 1 6
2 2 9
3 3 21
4 4 53
5 5 57
6 6 44
summarise(group_by(testData, Date), sum(Amount), categories = toString(Category))
Wrong output --> here the sum is calculated over each day but the categories are not considered
Date sum(Amount) categories
1 02.07.15 10 1, 1, 1, 2
2 03.07.15 5 2
3 04.07.15 6 3
4 05.07.15 7 3
5 06.07.15 8 3
6 07.07.15 9 4
7 08.07.15 10 5
8 09.07.15 11 6
9 10.07.15 12 4
10 11.07.15 13 4
11 12.07.15 14 5
12 13.07.15 15 5
13 14.07.15 16 6
14 15.07.15 17 6
15 16.07.15 18 5
16 17.07.15 19 4
So far I did not succeed in combining both statements.
How can I nest both group_by statements to calculate the sum of the amount for each single day in each category?
Nesting the groups like:
summarise(group_by(group_by(testData, Date), Category), sum(Amount), dates = toString(Date))
Category sum(Amount) dates
1 1 6 02.07.15, 02.07.15, 02.07.15
2 2 9 02.07.15, 03.07.15
3 3 21 04.07.15, 05.07.15, 06.07.15
4 4 53 07.07.15, 10.07.15, 11.07.15, 17.07.15
5 5 57 08.07.15, 12.07.15, 13.07.15, 16.07.15
6 6 44 09.07.15, 14.07.15, 15.07.15
does not work as intended.
I have heard of dplyr - summarise weighted data summarise_each but could not get it to work:
summarise_each(testData, funs(Category))
Error could not find function Category
You can try
testData %>%
group_by(Date,Category) %>%
summarise(Amount= sum(Amount))

Resources