Calculate value using row with pattern - r

I have these inputs:
df
# A tibble: 53 x 2
Task Frame
<chr> <int>
1 S101-10061 6
2 S101-10061-74716 16
3 S101-10065 18
4 S101-10065-104934 16
5 S101-10071 32
6 S101-10071-104898 74
7 S101-10072 8
8 S101-10072-79124 58
9 S101-10074 38
10 S101-10075 82
As you see in same "Task" first 10 characters sometimes is same. So I need to find that task is a same, for example task 1 (S101-10061) as same task 2 (S101-10061-74716) in first 10 characters, and if that same find abs difference from number of frame, here in example 16-6=10. So I expect something like:
Task Frame Diff
<chr> <int> <int>
1 S101-10061 6 6
2 S101-10061-74716 16
3 S101-10065 18 2
4 S101-10065-104934 16
5 S101-10071 32 24
6 S101-10071-104898 74
7 S101-10072 8 50
8 S101-10072-79124 58
9 S101-10074 38
10 S101-10075 82
I tried:
df %>% mutate(
Diff = accumulate(
Frame[1:n()], function(x,y)(abs(x-y))
)
)
But its doesn't help, how to compare row by pattern? any ideas?

Here is a dplyr solution.
library(dplyr)
df %>%
mutate(Task = substr(Task, 1, 10)) %>%
group_by(Task) %>%
mutate(Diff = abs(Frame - lead(Frame)))
Data.
df <-
structure(list(Task = c("S101-10061", "S101-10061-74716", "S101-10065",
"S101-10065-104934", "S101-10071", "S101-10071-104898", "S101-10072",
"S101-10072-79124", "S101-10074", "S101-10075"), Frame = c(6L,
16L, 18L, 16L, 32L, 74L, 8L, 58L, 38L, 82L)), class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"))

Related

R: Add new column based on character vector and existing column in dataframe with unique items

I want to assign elements in the character vector to the dataframe based on matching information in the existing column.
Data frame with one column
head(df, 5)
items
1 1
2 1
3 1
4 1
5 1
tail(df, 5)
120001 44
120002 44
120003 44
120004 44
120005 44
The character vector chr_v consists of 44 unique items.
chr_v <- c("T1_1", "C1_1", "T1_2", "A_1", "C_2", "C_3", "T1_3", "A_2", "C_4",
"C_5", "C_6", "C_7", "C_8", "A_3", "C_9", 'C_10', "C_11", "A_4", 'C_12', "A_5",
"C_13", "A_6", "A_7", "C_14", "C_15", "C_16", "T_4", "C_17", "C_18", "C_19", 'T_5',
"C_20", "C_21", "T_6", "A_8", "C_22", "C_23", "C_24", "C_25", "C_26", "T_7", "T_8",
'C_27', 'C_28')
The length of ```chr_v``` is
length(chr_v)
[1] 44
There are 44 unique ordered items in column items in dataframe and 44 rows in the character vector. I want to create a new column by repeating each item in the character vector to the unique ordered item in the column of dataframe.
Expected Output:
head(df, 5)
items newitem
1 1 T1_1
2 1 T1_1
3 1 T1_1
4 1 T1_1
5 1 T1_1
tail(df, 5)
items newitem
120001 44 C_28
120002 44 C_28
120003 44 C_28
120004 44 C_28
120005 44 C_28
I checked the dimension of each items in the df with table command but the output is not ordered (even tried to sort). Therefore, I cannot use the output to simply repeat the items sequentially.
Martin provided a tidyverse solution. Here is a base R solution:
df$newitem <- sample_info[df$items]
Here the dplyr pendant:
df %>%
mutate(newitem = sample_info[items])
output:
items newitem
1 1 T1_1
2 1 T1_1
3 1 T1_1
4 1 T1_1
5 1 T1_1
6 44 C_28
7 44 C_28
8 44 C_28
9 44 C_28
10 44 C_28
data:
df <- structure(list(items = c(1L, 1L, 1L, 1L, 1L, 44L, 44L, 44L, 44L,
44L), newitem = c("T1_1", "T1_1", "T1_1", "T1_1", "T1_1", "C_28",
"C_28", "C_28", "C_28", "C_28")), row.names = c("1", "2", "3",
"4", "5", "6", "7", "8", "9", "10"), class = "data.frame")
You could use enframe() combined with a left_join():
library(tidyverse)
df %>%
left_join(enframe(chr_v), by = c("items" = "name"))

How to summarize the top 3 highest values in a dataset when there are ties

I have a data frame (my_data) and want to calculate the sum of only the 3 highest values even though there might be ties. I am quite new to R and I've used dplyr.
A tibble: 15 x 3
city month number
<chr> <chr> <dbl>
1 Lund jan 12
2 Lund feb 12
3 Lund mar 18
4 Lund apr 28
5 Lund may 28
6 Stockholm jan 15
7 Stockholm feb 15
8 Stockholm mar 30
9 Stockholm apr 30
10 Stockholm may 10
11 Uppsala jan 22
12 Uppsala feb 30
13 Uppsala mar 40
14 Uppsala apr 60
15 Uppsala may 30
This is the code I have tried:
# For each city, count the top 3 of variable number
my_data %>% group_by(city) %>% top_n(3, number) %>% summarise(top_nr = sum(number))
The expected (wanted) output is:
# A tibble: 3 x 2
city top_nr
<chr> <dbl>
1 Lund 86
2 Stockholm 75
3 Uppsala 130
but the actual R output is:
# A tibble: 3 x 2
city top_nr
<chr> <dbl>
1 Lund 86
2 Stockholm 90
3 Uppsala 160
It seems like if there are ties, all tied values are included in the summation. I wanted only 3 unique instances with highest values to be counted.
Any help would be much appreciated! :)
We can do a distinct to remove the duplicate elements. The way in which top_n works is that if the values are duplicated, it will keep that many dupe rows
my_data %>%
distinct(city, number, .keep_all = TRUE) %>%
group_by(city) %>%
top_n(3, number) %>%
summarise(top_nr = sum(number))
Update
Based on the OP's new output, after the top_n output (which is not arranged), get the 'number' arranged in descending order and get the sum of first 3 'number'
my_data %>%
group_by(city) %>%
top_n(3, number) %>%
arrange(city, desc(number)) %>%
summarise(number = sum(head(number, 3)))
# A tibble: 3 x 2
# city number
# <chr> <int>
#1 Lund 74
#2 Stockholm 75
#3 Uppsala 130
data
my_data <- structure(list(city = c("Lund", "Lund", "Lund", "Lund", "Lund",
"Stockholm", "Stockholm", "Stockholm", "Stockholm", "Stockholm",
"Uppsala", "Uppsala", "Uppsala", "Uppsala", "Uppsala"), month = c("jan",
"feb", "mar", "apr", "may", "jan", "feb", "mar", "apr", "may",
"jan", "feb", "mar", "apr", "may"), number = c(12L, 12L, 18L,
28L, 28L, 15L, 15L, 30L, 30L, 10L, 22L, 30L, 40L, 60L, 30L)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15"))
Life might be way simpler without top_n():
dat %>%
group_by(city) %>%
summarize(
top_nr = sum(tail(sort(number), 3))
)
This tidyverse (actually, dplyr) solution is almost equal to akrun's, but filters the dataframe instead of getting the top_n.
library(tidyverse)
my_data %>%
group_by(city) %>%
arrange(desc(number), .by_group = TRUE) %>%
filter(row_number() %in% 1:3) %>%
summarise(top_nr = sum(number))
## A tibble: 3 x 2
# city top_nr
# <chr> <int>
#1 Lund 74
#2 Stockholm 75
#3 Uppsala 130

Create new column using condition on another existing column

I have data like this
Time chamber
9 1
10 2
11 3
12 4
13 5
14 6
15 7
16 8
17 9
18 10
19 11
20 12
21 1
22 2
23 3
24 4
I want to create a new column using conditions on another existing column (chamber).
It should look something like this
Time chamber treatment
9 1 c2t2
10 2 c2t2
11 3 c0t0r
12 4 c2t2r
13 5 c2t2r
14 6 c0t0
15 7 c0t0r
16 8 c0t0r
17 9 c2t2
18 10 c2t2r
19 11 c0t0
20 12 c0t0
21 1 c2t2
22 2 c2t2
23 3 c0t0r
24 4 c2t2r
For chambers 1,2,9: Treatment is c2t2
For chambers 3,7,8: Treatment is c0t0r.
For chambers 4,5,10: Treatment is c2t2r
For chambers 6,11,12: Treatment is c0t0.
I have also made a lookup table, but I don't know how to use it:
lookup_table <- data.frame(row.names = c("1", "2", "3","4", "5", "6","7", "8", "9","10", "11", "12"),
new_col = c("C2T2", "C2T2", "C0T0R","C2T2R", "C2T2R", "C0T0","C0T0R", "C0T0R", "C2T2","C2T2R", "C0T0", "C0T0"),
stringsAsFactors = FALSE)
Assuming "dt" is your dataframe name, then you can use dplyr with case_when
library(tidyverse)
dt %>%
mutate(newcol = case_when(dt$chamber %in% c(1, 2, 9) ~ "c2t2",
dt$chamber %in% c(3, 7, 8) ~ "c0t0r",
dt$chamber %in% c(4, 5, 10) ~ "c2t2r",
dt$chamber %in% c(6, 11, 12) ~ "c0t0"))
Output:
Time chamber newcol
1 9 1 c2t2
2 10 2 c2t2
3 11 3 c0t0r
4 12 4 c2t2r
5 13 5 c2t2r
6 14 6 c0t0
7 15 7 c0t0r
8 16 8 c0t0r
9 17 9 c2t2
10 18 10 c2t2r
11 19 11 c0t0
12 20 12 c0t0
13 21 1 c2t2
14 22 2 c2t2
15 23 3 c0t0r
16 24 4 c2t2r
>
You can merge your df with the lookup_table. In my experience, if you want to combine different data.frames, merge() is the command I like to use. Do note that there are many different ways and specialised packages you can use for the same purpose!
You need to specify which column you use as the 'matching column' and also that you want to keep all records in df:
merge(df, lookup_table, all.x = TRUE, by.x = "chamber", by.y = "row.names")
Data:
df <- structure(list(Time = 9:24, chamber = c(1L, 2L, 3L, 4L, 5L, 6L,
7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L)),
.Names = c("Time", "chamber"), class = "data.frame",
row.names = c(NA, -16L))
lookup_table <- structure(list(new_col = c("C2T2", "C2T2", "C0T0R", "C2T2R",
"C2T2R", "C0T0", "C0T0R", "C0T0R",
"C2T2", "C2T2R", "C0T0", "C0T0")),
.Names = "new_col",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"), class = "data.frame")

R - ddply summarise using nlevels() does not work

When using the plyr package to summarise my data, it seems impossible to use the nlevels() function.
The structure of my data set is as follows:
>aer <- read.xlsx("XXXX.xlsx", sheetIndex=1)
>aer$ID <- as.factor(aer$ID)
>aer$description <- as.factor(aer$description)
>head(aer)
ID SOC start end days count severity relation
1 1 410 2015-04-21 2015-04-28 7 1 1 3
2 1 500 2015-01-30 2015-05-04 94 1 1 3
3 1 600 2014-11-25 2014-11-29 4 1 1 3
4 1 600 2015-01-02 2015-01-07 5 1 1 3
5 1 600 2015-01-26 2015-03-02 35 1 1 3
6 1 600 2015-04-14 2015-04-17 3 1 1 3
> dput(head(aer,4))
structure(list(ID = structure(c(1L, 1L, 1L, 1L), .Label = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "12", "13", "14",
"15"), class = "factor"), SOC = c(410, 500, 600, 600),
start = structure(c(16546, 16465, 16399, 16437), class = "Date"),
end = structure(c(16553, 16559, 16403, 16442), class = "Date"),
days = c(7, 94, 4, 5), count = c(1, 1, 1, 1), severity = c(1,
1, 1, 1), relation = c(3, 3, 3, 3)), .Names = c("ID", "SOC",
"description", "start", "end", "days", "count", "severity", "relation"
), row.names = c(NA, 4L), class = "data.frame")
What I would like to know is how many levels exists in the "ID" variable in data sections created, when dividing the data set using the variable "SOC". I want to summarise this information together with some other variables in a new data set. Therefore, I would like to use the plyr package like so:
summaer2 <- ddply(aer, c("SOC"), summarise,
participants = nlevels(ID),
events = sum(count),
min_duration = min(days),
max_duration = max(days),
max_severity = max(severity))
This returns the following error:
Error in Summary.factor(c(4L, 5L, 11L, 11L, 14L, 14L), na.rm = FALSE) :
‘max’ not meaningful for factors
Could someone give me advice on how to reach my goal? Or what I'm doing wrong?
Many thanks in advance!
Update:
Substituting nlevels(ID) with length(unique(ID)) seems to give me the desired output:
> head(summaer2)
SOC participants events min_duration max_duration max_severity
1 100 4 7 1 62 2
2 410 9 16 1 41 2
3 431 2 2 109 132 1
4 500 5 9 23 125 2
5 600 8 19 1 35 1
6 1040 1 1 98 98 2

Identifying Duplicate/Unique Teams (and Restructuring Data) in R

I have a data set that looks like this:
Person Team
1 30
2 30
3 30
4 30
11 40
22 40
1 50
2 50
3 50
4 50
15 60
16 60
17 60
1 70
2 70
3 70
4 70
11 80
22 80
My overall goal is to organize that team identification codes so that it is easy to see which teams are duplicates of one another and which teams are unique. I want to summarize the data so that it looks like this:
Team Duplicate1 Duplicate2
30 50 70
40 80
60
As you can see, teams 30, 50, and 70 have identical members, so they share a row. Similarly, teams 40 and 80 have identical members, so they share a row. Only team 60 (in this example) is unique.
In situations where teams are duplicated, I don't care which team id goes in which column. Also, there may be more than 2 duplicates of a team. Teams range in size from 2 members to 8 members.
This answer gives the output data format you asked for. I left the duplicate teams in a single variable because I think it's a better way to handle an arbitrary number of duplicates.
require(dplyr)
df %>%
arrange(Team, Person) %>% # this line is necessary in case the rest of your data isn't sorted
group_by(Team) %>%
summarize(players = paste0(Person, collapse = ",")) %>%
group_by(players) %>%
summarize(teams = paste0(Team, collapse = ",")) %>%
mutate(
original_team = ifelse(grepl(",", teams), substr(teams, 1, gregexpr(",", teams)[[1]][1]-1), teams),
dup_teams = ifelse(grepl(",", teams), substr(teams, gregexpr(",", teams)[[1]][1]+1, nchar(teams)), NA)
)
The result:
Source: local data frame [3 x 4]
players teams original_team dup_teams
1 1,2,3,4 30,50,70 30 50,70
2 11,22 40,80 40 80
3 15,16,17 60 60 NA
Not exactly the format you're wanting, but pretty useful:
# using MrFlick's data
library(dplyr)
dd %>% group_by(Team) %>%
arrange(Person) %>%
summarize(team.char = paste(Person, collapse = "_")) %>%
group_by(team.char) %>%
arrange(team.char, Team) %>%
mutate(duplicate = 1:n())
Source: local data frame [6 x 3]
Groups: team.char
Team team.char duplicate
1 40 11_22 1
2 80 11_22 2
3 60 15_16_17 1
4 30 1_2_3_4 1
5 50 1_2_3_4 2
6 70 1_2_3_4 3
(Edited in the arrange(Person) line in case the data isn't already sorted, got the idea from #Reed's answer.)
Using this for your sample data
dd<-structure(list(Person = c(1L, 2L, 3L, 4L, 11L, 22L, 1L, 2L, 3L,
4L, 15L, 16L, 17L, 1L, 2L, 3L, 4L, 11L, 22L), Team = c(30L, 30L,
30L, 30L, 40L, 40L, 50L, 50L, 50L, 50L, 60L, 60L, 60L, 70L, 70L,
70L, 70L, 80L, 80L)), .Names = c("Person", "Team"),
class = "data.frame", row.names = c(NA, -19L))
You could try a table()/interaction() to find duplicate groups. For example
tt <- with(dd, table(Team, Person))
grp <- do.call("interaction", c(data.frame(unclass(tt)), drop=TRUE))
split(rownames(tt), grp)
this returns
$`1.1.1.1.0.0.0.0.0`
[1] "30" "50" "70"
$`0.0.0.0.0.1.1.1.0`
[1] "60"
$`0.0.0.0.1.0.0.0.1`
[1] "40" "80"
so the group "names" are really just indicators for membership for each person. You could easily rename them if you like with setNames(). But here it collapse the appropriate teams.
Two more base R options (though not exactly the desired output):
DF2 <- aggregate(Person ~ Team, DF, toString)
> split(DF2$Team, DF2$Person)
$`1, 2, 3, 4`
[1] 30 50 70
$`11, 22`
[1] 40 80
$`15, 16, 17`
[1] 60
Or
( DF2$DupeGroup <- as.integer(factor(DF2$Person)) )
Team Person DupeGroup
1 30 1, 2, 3, 4 1
2 40 11, 22 2
3 50 1, 2, 3, 4 1
4 60 15, 16, 17 3
5 70 1, 2, 3, 4 1
6 80 11, 22 2
Note that the expected output as shown in the question would either require to add NAs or empty strings in some of the columns entries because in a data.frame, all columns must have the same number of rows. That is different for lists in, as you can see in some of the answers.
The second option, but using data.table, since aggregate tends to be slow for large data:
library(data.table)
setDT(DF)[, toString(Person), by=Team][,DupeGroup := .GRP, by=V1][]
Team V1 DupeGroup
1: 30 1, 2, 3, 4 1
2: 40 11, 22 2
3: 50 1, 2, 3, 4 1
4: 60 15, 16, 17 3
5: 70 1, 2, 3, 4 1
6: 80 11, 22 2
Using uniquecombs from the mgcv package:
library(mgcv)
library(magrittr) # for the pipe %>%
# Using MrFlick's data
team_names <- sort(unique(dd$Team))
unique_teams <- with(dd, table(Team, Person)) %>% uniquecombs %>% attr("index")
printout <- unstack(data.frame(team_names, unique_teams))
> printout
$`1`
[1] 60
$`2`
[1] 40 80
$`3`
[1] 30 50 70
Now you could use something like this answer to print it in tabular form (note that the groups are column-wise, not row-wise as in your question):
attributes(printout) <- list(names = names(printout)
, row.names = 1:max(sapply(printout, length))
, class = "data.frame")
> printout
1 2 3
1 60 40 30
2 <NA> 80 50
3 <NA> <NA> 70
Warning message:
In format.data.frame(x, digits = digits, na.encode = FALSE) :
corrupt data frame: columns will be truncated or padded with NAs

Resources