Newbie to R, I've tried Googling but I'm failing find a solution.
Here's my data frame:
Name Value
Bob 50
Mary 55
John 51
Todd 50
Linda 56
Tom 55
So I've sorted it but what I need to add a rank column, so it looks like this:
Name Value Rank
Bob 50 1
Todd 50 1
John 51 2
Mary 55 3
Tom 55 3
Linda 56 4
So what I found is:
resultset$Rank <- ave(resultset$Name, resultset$Value, FUN = rank)
But this gives me:
Name Value Rank
Bob 50 1
Todd 50 2
John 51 1
Mary 55 1
Tom 55 2
Linda 56 1
So close but yet so far...
Here's a base-R solution:
uv <- unique(df$Value)
merge(df,data.frame(uv,r=rank(uv)),by.x="Value",by.y="uv")
which gives
Value Name r
1 50 Bob 1
2 50 Todd 1
3 51 John 2
4 55 Mary 3
5 55 Tom 3
6 56 Linda 4
This is memory inefficient and has the side-effect of resorting your data. You could alternately do:
require(data.table)
DT <- data.table(df)
DT[order(Value),r:=.GRP,by=Value]
which gives
Name Value r
1: Bob 50 1
2: Mary 55 3
3: John 51 2
4: Todd 50 1
5: Linda 56 4
6: Tom 55 3
No need to sort... You can use dense_rank from "dplyr":
> library(dplyr)
> mydf %>% mutate(rank = dense_rank(Value))
Name Value rank
1 Bob 50 1
2 Mary 55 3
3 John 51 2
4 Todd 50 1
5 Linda 56 4
6 Tom 55 3
I guess your rank variable can be obtained by 1:length(unique(df$value)). Below is my trial.
df <- data.frame(name = c("Bob", "Mary", "John", "Todd", "Linda", "Tom"),
value = c(50, 55, 51, 50, 56, 55))
# rank by lengths of unique values
rank <- data.frame(rank = 1:length(unique(df$value)), value = sort(unique(df$value)))
merge(df, rank, by="value")
value name rank
1 50 Bob 1
2 50 Todd 1
3 51 John 2
4 55 Mary 3
5 55 Tom 3
6 56 Linda 4
Related
Consider the code:
df = pd.DataFrame({
"name":["john","jim","eric","jim","john","jim","jim","eric","eric","john"],
"category":["a","b","c","b","a","b","c","c","a","c"],
"amount":[100,200,13,23,40,2,43,92,83,1]
})
df
When I copy the output, I get a not nicely formatted table here on StackOverflow:
name category amount sum count
0 john a 100 140 2
1 jim b 200 225 3
2 eric c 13 105 2
3 jim b 23 225 3
4 john a 40 140 2
5 jim b 2 225 3
6 jim c 43 43 1
7 eric c 92 105 2
8 eric a 83 83 1
9 john c 1 1 1
How to fix this problem?
from an ipython terminal session I can paste:
In [73]: df
Out[73]:
name category amount
0 john a 100
1 jim b 200
2 eric c 13
3 jim b 23
4 john a 40
5 jim b 2
6 jim c 43
7 eric c 92
8 eric a 83
9 john c 1
From a notebook, it looks like print(df) gives a better result:
name category amount
0 john a 100
1 jim b 200
2 eric c 13
3 jim b 23
4 john a 40
5 jim b 2
6 jim c 43
7 eric c 92
8 eric a 83
9 john c 1
The copy selection should be a solid color block.
This question already has answers here:
Sort (order) data frame rows by multiple columns
(19 answers)
Closed 3 years ago.
I have a very simple dataframe in R:
x <- data.frame("SN" = 1:7, "Age" = c(21,15,22,33,21,15,25), "Name" = c("John","Dora","Paul","Alex","Bud","Chad","Anton"))
My goal is to sort the dataframe by the Age and the Name. I am able to achieve this task partially if i type the following command:
x[order(x[, 'Age']),]
which returns:
SN Age Name
2 2 15 Dora
6 6 15 Chad
1 1 21 John
5 5 21 Bud
3 3 22 Paul
7 7 25 Anton
4 4 33 Alex
As you can see the dataframe is order by the Age but not the Name.
Question: how can i order the dataframe by the age and name at the same time? This is what the result should look like
SN Age Name
6 6 15 Chad
2 2 15 Dora
5 5 21 Bud
1 1 21 John
3 3 22 Paul
7 7 25 Anton
4 4 33 Alex
Note: I would like to avoid to use additional packages but using just the default ones
With dplyr:
library(dplyr)
x %>%
arrange(Age, Name)
SN Age Name
1 6 15 Chad
2 2 15 Dora
3 5 21 Bud
4 1 21 John
5 3 22 Paul
6 7 25 Anton
7 4 33 Alex
x[with(x, order(Age, Name)), ]
SN Age Name
6 6 15 Chad
2 2 15 Dora
5 5 21 Bud
1 1 21 John
3 3 22 Paul
7 7 25 Anton
4 4 33 Alex
Name Trial# Result ResultsSoFar
1 Bob 1 14 14
2 Bob 2 22 36
3 Bob 3 3 39
4 Bob 4 18 57
5 Nancy 2 33 33
6 Nancy 3 87 120
Hello, say I have the dataframe above. What's the best way to generate the "ResultsSoFar" column which is a sum of that person's results up to and including that trial (Bob's results do not include Nancy's and vice versa).
With data.table you can do:
library(data.table)
setDT(df)[, ResultsSoFar:=cumsum(Result), by=Name]
df
Name Trial. Result ResultsSoFar
1: Bob 1 14 14
2: Bob 2 22 36
3: Bob 3 3 39
4: Bob 4 18 57
5: Nancy 2 33 33
6: Nancy 3 87 120
Note:
If Trial# is not sorted, you can do setDT(df)[, ResultsSoFar:=cumsum(Result[order(Trial.)]), by=Name] to get the right order for the cumsum
I have a data set in R with a series of people, events that occur and an assigned time that they occur in seconds, starting from 0. It looks similar to this:
event seconds person
1 0.0 Bob
2 15.0 Bob
3 28.5 Bob
4 32.0 Joe
5 38.0 Joe
6 41.0 Joe
7 42.5 Joe
8 55.0 Anne
9 58.0 Anne
I need to filter for each name, and that means the ordered events will not be sequential for each person.
An example of what this looks like (notice how Bob is not involved in events 4-40, etc.):
event seconds person
1 0.0 Bob
2 15.0 Bob
3 28.5 Bob
41 256.0 Bob
42 261.0 Bob
43 266.0 Bob
44 268.5 Bob
45 272.0 Bob
46 273.0 Bob
49 569.0 Bob
80 570.5 Bob
81 581.0 Bob
The events that are sequential and related are separated by an increment of 1. I would like to find the duration of the related events, for example, events 1-3 is a group that would be 28.5 seconds. Events 41-46 is another group that lasts 17 seconds. This would be required for all the names that are listed in the person column.
I have tried filtering the names using dplyr and then finding the difference between event rows, using as.matrix, and determining where the increment is greater than 1 (indicating it's no longer part of the current sequence of events). I haven't found a way to assign the max and min based off of this to determine the duration of related events. The solution does not need to involve this step though, but it was the closest I could come.
The end goal is to plot the non-contiguous time durations for each person to have a visual representation of each person's event involvement for the entire data set.
Thank you in advance.
Use this:
DF <- read.table(text = "event seconds person
1 0.0 Bob
2 15.0 Bob
3 28.5 Bob
41 256.0 Bob
42 261.0 Bob
43 266.0 Bob
44 268.5 Bob
45 272.0 Bob
46 273.0 Bob
49 569.0 Bob
80 570.5 Bob
81 581.0 Bob", header = TRUE)
DF$personEvent <- cumsum(c(1L, diff(DF$event)) != 1L)
# event seconds person personEvent
#1 1 0.0 Bob 0
#2 2 15.0 Bob 0
#3 3 28.5 Bob 0
#4 41 256.0 Bob 1
#5 42 261.0 Bob 1
#6 43 266.0 Bob 1
#7 44 268.5 Bob 1
#8 45 272.0 Bob 1
#9 46 273.0 Bob 1
#10 49 569.0 Bob 2
#11 80 570.5 Bob 3
#12 81 581.0 Bob 3
Since I'm not a follower of the great pipe, I leave the rest to you.
Suppose first we have just Bob's rows of the dataframe, called bob.
We will assume bob is already ordered by event, increasing.
Along the same lines as you mentioned (looking at diff(event) > 1), you can additionally use cumsum to group each event to the 'run' of events it belongs to:
library(plyr)
bob2 <- mutate(bob, start = c(1, diff(bob$event) > 1), run=cumsum(start))
event seconds person start run
1 1 0.0 Bob 1 1
2 2 15.0 Bob 0 1
3 3 28.5 Bob 0 1
4 41 256.0 Bob 1 2
5 42 261.0 Bob 0 2
6 43 266.0 Bob 0 2
7 44 268.5 Bob 0 2
8 45 272.0 Bob 0 2
9 46 273.0 Bob 0 2
10 49 569.0 Bob 1 3
11 80 570.5 Bob 1 4
12 81 581.0 Bob 0 4
start indicates whether this starts a run of sequential events, and run is which such set of events we are in.
Then you can just find the duration:
ddply(bob2, .(run), summarize, length=diff(range(seconds)))
run length
1 1 28.5
2 2 17.0
3 3 0.0
4 4 10.5
Now supposing you have your original dataframe with everyone mixed together in it, we can use ddply again to split it up by person:
tmp <- ddply(df, .(person), transform, run=cumsum(c(1, diff(event) != 1)))
ddply(tmp, .(person, run), summarize, length=diff(range(seconds)), start_event=first(event), end_event=last(event))
person run length start_event end_event
1 Anne 1 3.0 8 9
2 Bob 1 28.5 1 3
3 Bob 2 17.0 41 46
4 Bob 3 0.0 49 49
5 Bob 4 10.5 80 81
6 Joe 1 10.5 4 7
Note: my df is your bob table rbind-ed to your other table, unique()d (just to show it works when there are more than one run per person).
There is probably a clever way to do this that combines the two ddply calls (or uses the dplyr pipe-y syntax that I am not familiar with), but I do not know what it is.
I have data that looks something like this:
name age profit
Ann -3 10
Ann -2 5
Ann 1 23
Ann 2 15
Josh -2 12
Josh -1 34
Josh 0 1
Josh 1 21
Josh 2 26
I want to remove those rows for which age is negative.
After using
subset(profitData,age>0,select=c(name,age,profit))
I get this:
name age profit
Ann 1 10
Ann 2 5
Ann 3 23
Ann 4 15
Josh 1 12
Josh 2 34
Josh 3 1
Josh 4 21
Josh 5 26
So, only the values from the age column are removed but not the entire row.
Any suggestions?
It seems like this will be OK:
profitData[profitData$age>0,]
To answer your specific subset question...
There's something hinky going on. I run your code and get your desired output. Start a clean session maybe:
profitData <- read.table(text="name age profit
Ann -3 10
Ann -2 5
Ann 1 23
Ann 2 15
Josh -2 12
Josh -1 34
Josh 0 1
Josh 1 21
Josh 2 26", header=T)
subset(profitData,age>0,select=c(name,age,profit))
## > subset(profitData,age>0,select=c(name,age,profit))
## name age profit
## 3 Ann 1 23
## 4 Ann 2 15
## 8 Josh 1 21
## 9 Josh 2 26