I'm having trouble cleaning my dataset in R. I have a dataset with three variables(name, day, data). The third variable actually contains all of my data but it needs to be parsed. I need to split this column into multiple columns based on a value in the column. For example, in the following database:
x <- data.frame("name" = c("John","John","John","John","John","Sarah","Sarah","Sarah"), "Day" = c(1,1,1,1,1,2,2,2), "Data" = c("Map 28", 2,3,"Transfer","Time","Map 18",2,3))
which looks like:
name Day Data
1 John 1 Map 28
2 John 1 2
3 John 1 3
4 John 1 Transfer
5 John 1 Time
6 Sarah 2 Map 18
7 Sarah 2 2
8 Sarah 2 3
I need to look through the 'data' column and find any time the word 'map' is used, and then convert all of the data under that into another column. Like so:
name Day Data Val1 Val2 Val3 Val4
1 John 1 Map 28 2 3 Transfer Time
2 Sarah 2 Map 18 2 3 <NA> <NA>
Any help with this would be greatly appreciated!
[EDIT]
Sorry all, I think I oversimplified my example... the issue is that each person on each day will have multiple "Map" values that need to be located. So it looks more like the following.
x <- data.frame("name" = c("John","John","John","John","John","John","John","John","John","John","John","John","Sarah","Sarah","Sarah","Sarah","Sarah","Sarah"),
"Day" = c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2),
"Data" = c("Map 28", 2,3,"Transfer","Time","Map 15",2,3,"Text","Map3",2,4,"Map 18",2,3,"Map 22",2,3))
name Day Data
1 John 1 Map 28
2 John 1 2
3 John 1 3
4 John 1 Transfer
5 John 1 Time
6 John 1 Map 15
7 John 1 2
8 John 1 3
9 John 1 Text
10 John 1 Map3
11 John 1 2
12 John 1 4
13 Sarah 2 Map 18
14 Sarah 2 2
15 Sarah 2 3
16 Sarah 2 Map 22
17 Sarah 2 2
18 Sarah 2 3
and then the final output would be...
y <- data.frame("name" = c("John","John","John","Sarah", "Sarah"),
"Day" =c(1,1,1,2,2),
"Data"= c("Map 28","Map 15","Map 3","Map 18","Map 22"),
"Val1" =c(2,2,2,2,2),
"Val2"=c(3,3,4,3,3),
"Val3"=c("Transfer","Text",NA,NA,NA),
"Val4"=c("Time",NA,NA,NA,NA))
name Day Data Val1 Val2 Val3 Val4
1 John 1 Map 28 2 3 Transfer Time
2 John 1 Map 15 2 3 Text <NA>
3 John 1 Map 3 2 4 <NA> <NA>
4 Sarah 2 Map 18 2 3 <NA> <NA>
5 Sarah 2 Map 22 2 3 <NA> <NA>
you could use reshape from base R after adding a time variable using ave
reshape(transform(x,time=ave(Day,Day,FUN=seq_along)),v.names="Data",dir="wide",idvar = "name")
name Day Data.1 Data.2 Data.3 Data.4 Data.5
1 John 1 Map 28 2 3 Transfer Time
6 Sarah 2 Map 18 2 3 <NA> <NA>
with your new edit: Using base R you could do:
d = transform(x,ID=cumsum(grepl('Map',Data))->a,time=ave(a,a,FUN=seq_along))
reshape(d,v.names="Data",idvar = "ID",dir="wide")
name Day ID Data.1 Data.2 Data.3 Data.4 Data.5
1 John 1 1 Map 28 2 3 Transfer Time
6 John 1 2 Map 15 2 3 Text <NA>
10 John 1 3 Map3 2 4 <NA> <NA>
13 Sarah 2 4 Map 18 2 3 <NA> <NA>
16 Sarah 2 5 Map 22 2 3 <NA> <NA>
with tidyverse you could do:
library(tidyverse)
x%>%
group_by(ID = cumsum(str_detect(Data,"Map")))%>%
mutate(time=1:n())%>%
spread(time,Data)
# A tibble: 5 x 8
# Groups: ID [5]
name Day ID `1` `2` `3` `4` `5`
<fct> <dbl> <int> <fct> <fct> <fct> <fct> <fct>
1 John 1 1 Map 28 2 3 Transfer Time
2 John 1 2 Map 15 2 3 Text NA
3 John 1 3 Map3 2 4 NA NA
4 Sarah 2 4 Map 18 2 3 NA NA
5 Sarah 2 5 Map 22 2 3 NA NA
Related
Here is a representation of my dataset
set.seed(1)
library (tidyverse)
Date<-c(1:6,1:8,1:10)
ID<-c(rep(1,3*2),rep(2,4*2),rep(3,5*2))
Surgery<-c(c("Surg1",NA,NA,NA,"Surg2",NA),
c(NA,NA,NA,"Surg.a",NA,NA,"Surg.f",NA),
c("Surg.C",NA,NA,"Surg.A",NA,NA,"Surg.X",NA,NA,NA))
Complication<-sample(c(rep("Infection",8),rep("Pain",7),rep("bleeding",5),rep("Oedema",4)))
NumberOfSurgery<-c(rep(2,6),rep(2,8),rep(3,10))
OrderOfSurgery<-c(1,rep(NA,3),2,rep(NA,4),1,NA,NA,2,NA,1,NA,NA,2,NA,NA,3,rep(NA,3))
mydata<-data.frame(ID,Date,Surgery,Complication,NumberOfSurgery,OrderOfSurgery)
mydata
There are three patients. The first one has benefited from two surgeries over time, the second one 2 also and the last one three surgeries.
I would like to select all complications from the first day of surgery 1 to the first day of surgery 2, and for all individuals; in order to have such a dataset below
ID Date Surgery Complication NumberOfSurgery OrderOfSurgery
1 1 1 Surg1 Infection 2 1
2 1 2 <NA> Infection 2 NA
3 1 3 <NA> Infection 2 NA
4 1 4 <NA> Infection 2 NA
5 1 5 Surg2 Pain 2 2
10 2 4 Surg.a bleeding 2 1
11 2 5 <NA> Pain 2 NA
12 2 6 <NA> Infection 2 NA
13 2 7 Surg.f bleeding 2 2
15 3 1 Surg.C Pain 3 1
16 3 2 <NA> Pain 3 NA
17 3 3 <NA> Pain 3 NA
18 3 4 Surg.A bleeding 3 2
Here is how I proceeded:
The dates are already arranged
Firstly, I suppressed for the second individuals all the observation that occurred before the first surgery (that is from the date 1 to the date 3).
mydata2<-mydata%>%mutate(Surgery2=Surgery,OrderOfSurgery2=OrderOfSurgery)%>%
group_by(ID)%>%fill(Surgery2,OrderOfSurgery2)%>%filter(!is.na(Surgery2))
Then I could select only all observations that follow the first surgery, by doing this:
mydata3<-mydata2%>%filter(OrderOfSurgery2==1)
What I want to do is to include the row at the date of the second surgery,as I mentionned above.
Using lag() and shift OrderOfSurgery2 down by one position, filling the blank with the default value of 1 should do.
mydata3 <- mydata2 %>%
mutate(OrderOfSurgery2 = lag(OrderOfSurgery2, default = 1)) %>%
filter(OrderOfSurgery2 == 1)
# # A tibble: 13 x 8
# # Groups: ID [3]
# ID Date Surgery Complication NumberOfSurgery OrderOfSurgery Surgery2
# <dbl> <int> <chr> <chr> <dbl> <dbl> <chr>
# 1 1 1 Surg1 Infection 2 1 Surg1
# 2 1 2 NA Infection 2 NA Surg1
# 3 1 3 NA Infection 2 NA Surg1
# 4 1 4 NA Infection 2 NA Surg1
# 5 1 5 Surg2 Pain 2 2 Surg2
# 6 2 4 Surg.a bleeding 2 1 Surg.a
# 7 2 5 NA Pain 2 NA Surg.a
# 8 2 6 NA Infection 2 NA Surg.a
# 9 2 7 Surg.f bleeding 2 2 Surg.f
# <Omitted>
I have a data.frame with some individual, like this:
Animal Score Weight
John 5 4
John 5 3
John 5 3
Peter 3 2
Peter 3 2
Louis 4 2
Louis 4 4
Louis 4 1
Sammy 3 2
Sammy 3 2
Sammy 3 2
John 1 5
John 1 5
John 1 5
And I would like to select the TOP 40% of the best animals, based on the score, and maintaining all measures of the TOP 40%, like this:
Animal Score Weight
John 5 4
John 5 3
John 5 3
Louis 4 2
Louis 4 4
Louis 4 1
I tried this code:
top40=subset(df, Score > quantile(Score, prob = 1 - 40/100))
but didn't work, I selected based only on the Score value, like this:
Animal Score Weight
John 5 4
John 5 3
John 5 3
Louis 4 2
Assuming each animal has a single score which is repeated, and you want to select animals whose score is in the top 40%, one option is
subset(df, Score > quantile(unique(Score), 1 - 40/100))
# Animal Score Weight
# 1: John 5 4
# 2: John 5 3
# 3: John 5 3
# 4: Louis 4 2
# 5: Louis 4 4
# 6: Louis 4 1
If you don't use unique (as in all the other answers currently), you get what I would think is an unexpected result. Without unique, the animals in the "top 40% by Score" change even when the scores are the same, just because some animals have more rows.
df[c(rep(1, 10), 2:nrow(df)),] %>%
filter(
Score > Score %>% quantile(0.4)
)
# Animal Score Weight
# 1 John 5 4
# 2 John 5 4
# 3 John 5 4
# 4 John 5 4
# 5 John 5 4
# 6 John 5 4
# 7 John 5 4
# 8 John 5 4
# 9 John 5 4
# 10 John 5 4
# 11 John 5 3
# 12 John 5 3
Use this:
animals %>%
filter(
Score > Score %>% quantile(0.4)
)
BTW, animals is your dataframe.
You can use :
n <- nrow(data)
head(data[order(data$Score, decreasing = TRUE) , ] , round(n*0.4) )
Suppose I have two data frames df1 and df2 as follows
Df1
Id Price Profit Month
10 5 2 1
10 5 3 2
10 5 2 3
11 7 3 1
11 7 1 2
12 0 0 1
12 5 1 2
Df2
Id Name
9 Kane
10 Jack
10 Jack
11 Will
12 Matt
13 Lee
14 Han
Now I want to insert a new column in Df1 named Name and get its value from Df2 based on matching Id
So modified Df1 will be
Id Price Profit Month Name
10 5 2 1 Jack
10 5 3 2 Jack
10 5 2 3 Jack
11 7 3 1 Will
11 7 1 2 Will
12 0 0 1 Matt
12 5 1 2 Matt
df1 <- data.frame(Id=c(10L,10L,10L,11L,11L,12L,12L),Price=c(5L,5L,5L,7L,7L,0L,5L),Profit=c(2L,3L,2L,3L,1L,0L,1L),Month=c(1L,2L,3L,1L,2L,1L,2L),stringsAsFactors=F);
df2 <- data.frame(Id=c(9L,10L,10L,11L,12L,13L,14L),Name=c('Kane','Jack','Jack','Will','Matt','Lee','Han'),stringsAsFactors=F);
df1$Name <- df2$Name[match(df1$Id,df2$Id)];
df1;
## Id Price Profit Month Name
## 1 10 5 2 1 Jack
## 2 10 5 3 2 Jack
## 3 10 5 2 3 Jack
## 4 11 7 3 1 Will
## 5 11 7 1 2 Will
## 6 12 0 0 1 Matt
## 7 12 5 1 2 Matt
use left_join in dplyr
library(dplyr)
left_join(df1, df2, "Id")
eg:
> left_join(df1, df2)
Joining by: "Id"
Id Price Profit Month Name
1 10 5 2 1 Jack
2 10 5 3 2 Jack
3 10 5 2 3 Jack
4 11 7 3 1 Will
5 11 7 1 2 Will
6 12 0 0 1 Matt
7 12 5 1 2 Matt
Data wrangling cheatsheet by RStudio is a very helpful resource.
Here is an option using data.table
library(data.table)
setDT(Df1)[unique(Df2), on = "Id", nomatch=0]
# Id Price Profit Month Name
#1: 10 5 2 1 Jack
#2: 10 5 3 2 Jack
#3: 10 5 2 3 Jack
#4: 11 7 3 1 Will
#5: 11 7 1 2 Will
#6: 12 0 0 1 Matt
#7: 12 5 1 2 Matt
Or as #Arun mentioned in the comments, we can assign (:=) the "Name" column after joining on "Id" to reflect the changes in the original dataset "Df1".
setDT(Df1)[Df2, Name:= Name, on = "Id"]
Df1
A simple base R option could be merge()
merge(Df1,unique(Df2), by="Id")
# Id Price Profit Month Name
#1 10 5 2 1 Jack
#2 10 5 3 2 Jack
#3 10 5 2 3 Jack
#4 11 7 3 1 Will
#5 11 7 1 2 Will
#6 12 0 0 1 Matt
#7 12 5 1 2 Matt
The function unique() is used here because of the duplicate entry in Df2 concerning "Jack". For the example data described in the OP the option by="Id" can be omitted, but it might be necessary in a more general case.
I have the following individual data and I want to make an unique household identifier. Every individual already has its rank in household so basically rank 1 marks the start of the new household.
e.g.
rank name
1 John
2 Lisa
3 Stu
1 Phil
1 Mike
1 Florence
2 George
3 David
4 Diana
1 Eleanor
The result I am looking for is this:
rank name id
1 John 1
2 Lisa 1
3 Stu 1
1 Phil 2
1 Mike 3
1 Florence 4
2 George 4
3 David 4
4 Diana 4
1 Eleanor 5
There are around 320 000 individuals, so group id should go from 1 to sum(df$rank[rank = 1]) or something similar. Any other type of unique ID also works, it doesn't have to be seq(1,n,1).
df$id <- cumsum(df$rank == 1)
# rank name id
# 1 1 John 1
# 2 2 Lisa 1
# 3 3 Stu 1
# 4 1 Phil 2
# 5 1 Mike 3
# 6 1 Florence 4
# 7 2 George 4
# 8 3 David 4
# 9 4 Diana 4
# 10 1 Eleanor 5
As #Andre Elrico noted, if rank is NA for any rows, the method above would give you NA for id in all subsequent rows, so you could use the option below instead if you know rank may be NA (but not when it should be 1).
df$id <- cumsum(df$rank %in% 1)
Data used:
df <- read.table(text = '
rank name
1 John
2 Lisa
3 Stu
1 Phil
1 Mike
1 Florence
2 George
3 David
4 Diana
1 Eleanor
', header = T)
OK, check out this data frame...
customer_name order_dates order_values
1 John 2010-11-01 15
2 Bob 2008-03-25 12
3 Alex 2009-11-15 5
4 John 2012-08-06 15
5 John 2015-05-07 20
Lets say I want to add an order variable that Ranks the highest order value, by name, by max order date, using the last order date at the tie breaker. So, ultimately the data should look like this:
customer_name order_dates order_values ranked_order_values_by_max_value_date
1 John 2010-11-01 15 3
2 Bob 2008-03-25 12 1
3 Alex 2009-11-15 5 1
4 John 2012-08-06 15 2
5 John 2015-05-07 20 1
Where everyone's single order gets 1, and all subsequent orders are ranked based on the value, and the tie breaker is the last order date getting priority.
In this example, John's 8/6/2012 order gets the #2 rank because it was placed after 11/1/2010. The 5/7/2015 order is 1 because it was the biggest. So, even if that order was placed 20 years ago, it should be the #1 Rank because it was John's highest order value.
Does anyone know how I can do this in R? Where I can Rank within a group of specified variables in a data frame?
Thanks for your help!
The top rated answer (by cdeterman) is actually incorrect. The order function provides the location of the 1st, 2nd, 3rd, etc ranked values not the ranks of the values in their current order.
Let’s take a simple example where we want to rank, starting with the largest, grouping by customer name. I have included a manual ranking so we can check the values
> df
customer_name order_values manual_rank
1 John 2 5
2 John 5 2
3 John 9 1
4 John 1 6
5 John 4 3
6 John 3 4
7 Lucy 4 4
8 Lucy 9 1
9 Lucy 6 3
10 Lucy 2 6
11 Lucy 8 2
12 Lucy 3 5
If I run the code suggested by cdeterman I get the following incorrect ranks:
> df %>%
+ group_by(customer_name) %>%
+ mutate(my_ranks = order(order_values, decreasing=TRUE))
Source: local data frame [12 x 4]
Groups: customer_name [2]
customer_name order_values manual_rank my_ranks
<fctr> <dbl> <dbl> <int>
1 John 2 5 3
2 John 5 2 2
3 John 9 1 5
4 John 1 6 6
5 John 4 3 1
6 John 3 4 4
7 Lucy 4 4 2
8 Lucy 9 1 5
9 Lucy 6 3 3
10 Lucy 2 6 1
11 Lucy 8 2 6
12 Lucy 3 5 4
Order is used to re-order dataframes into decreasing or increasing order. What we actually want is to run the order function twice, with the second order function giving us the actual ranks we want.
> df %>%
+ group_by(customer_name) %>%
+ mutate(good_ranks = order(order(order_values, decreasing=TRUE)))
Source: local data frame [12 x 4]
Groups: customer_name [2]
customer_name order_values manual_rank good_ranks
<fctr> <dbl> <dbl> <int>
1 John 2 5 5
2 John 5 2 2
3 John 9 1 1
4 John 1 6 6
5 John 4 3 3
6 John 3 4 4
7 Lucy 4 4 4
8 Lucy 9 1 1
9 Lucy 6 3 3
10 Lucy 2 6 6
11 Lucy 8 2 2
12 Lucy 3 5 5
You can do this pretty cleanly with dplyr
library(dplyr)
df %>%
group_by(customer_name) %>%
mutate(my_ranks = order(order(order_values, order_dates, decreasing=TRUE)))
Source: local data frame [5 x 4]
Groups: customer_name
customer_name order_dates order_values my_ranks
1 John 2010-11-01 15 3
2 Bob 2008-03-25 12 1
3 Alex 2009-11-15 5 1
4 John 2012-08-06 15 2
5 John 2015-05-07 20 1
This can be achieved with ave and rank. ave passes the proper groups to rank. The result from rank is reversed due to the requested order:
with(x, ave(as.numeric(order_dates), customer_name, FUN=function(x) rev(rank(x))))
## [1] 3 1 1 2 1
In base R you can do this with the slightly unwieldy
transform(df,rank=ave(1:nrow(df),customer_name,
FUN=function(x) order(order_values[x],order_dates[x],decreasing=TRUE)))
customer_name order_dates order_values rank
1 John 2010-11-01 15 3
2 Bob 2008-03-25 12 1
3 Alex 2009-11-15 5 1
4 John 2012-08-06 15 2
5 John 2015-05-07 20 1
where order is provided both the primary and tie-breaker values for each group.
df %>%
group_by(customer_name) %>%
arrange(customer_name,desc(order_values)) %>%
mutate(rank2=rank(order_values))
Similar to #t-himmel's answer, you can get the ranks with data.table.
dt[ , rnk := order(order(order_values, decreasing = TRUE)), customer_name ]