I need help with regular expressions to do the following. I have a list of study subjects, named such as:
subject <- c('x-010', 'x-011', 'x-012', 'x-013', 'x-014', 'x-015', 'x-016', 'x-017', 'x-018', 'x-019', 'x-020', 'x-021', 'x-022', 'x-023', 'x-024', 'x-025', 'x-026', 'x-027', 'x-028', 'x-029', 'x-030')
df <- data.frame(subject)
I want to add a column to classify the subjects by group according to their number, such that 1 - 10 are in Group A, 11 - 20 are in Group B, 21 - 30 are in Group C, and so on. I don't know how to do this using regular expressions, only to start with:
df <- data.frame(subject) %>%
mutate(case_when(group = str_detect(subject,
but need to understand how to describe this pattern.
We can extract the numeric part and create the group with %/%
library(tidyverse)
df %>%
group_by(grp = paste0("Group ", LETTERS[(as.numeric(str_extract(subject,
"[0-9]+"))-1) %/% 10 + 1]))
# A tibble: 21 x 2
# Groups: grp [3]
# subject grp
# <fct> <chr>
# 1 x-010 Group A
# 2 x-011 Group B
# 3 x-012 Group B
# 4 x-013 Group B
# 5 x-014 Group B
# 6 x-015 Group B
# 7 x-016 Group B
# 8 x-017 Group B
# 9 x-018 Group B
#10 x-019 Group B
# ... with 11 more rows
Related
I am struggling to count the number of unique combinations in my data. I would like to first group them by the id and then count, how many times combination of each values occurs. here, it does not matter if the elements are combined in 'd-f or f-d, they still belongs in teh same category, as they have same element:
combinations:
n
c-f: 2 # aslo f-c
c-d-f: 1 # also cfd or fdc
d-f: 2 # also f-d or d-f. The dash is only for isualization purposes
Dummy example:
# my data
dd <- data.frame(id = c(1,1,2,2,2,3,3,4, 4, 5,5),
cat = c('c','f','c','d','f','c','f', 'd', 'f', 'f', 'd'))
> dd
id cat
1 1 c
2 1 f
3 2 c
4 2 d
5 2 f
6 3 c
7 3 f
8 4 d
9 4 f
10 5 f
11 5 d
Using paste is a great solution provided by #benson23, but it considers as unique category f-d and d-f. I wish, however, that the order will not matter. Thank you!
Create a "combination" column in summarise, we can count this column afterwards.
An easy way to count the category is to order them at the beginning, then in this case they will all be in the same order.
library(dplyr)
dd %>%
group_by(id) %>%
arrange(id, cat) %>%
summarize(combination = paste0(cat, collapse = "-"), .groups = "drop") %>%
count(combination)
# A tibble: 3 x 2
combination n
<chr> <int>
1 c-d-f 1
2 c-f 2
3 d-f 2
I need to find common values between different groups ideally using dplyr and R.
From my dataset here:
group val
<fct> <dbl>
1 a 1
2 a 2
3 a 3
4 b 3
5 b 4
6 b 5
7 c 1
8 c 3
the expected output is
group val
<fct> <dbl>
1 a 3
2 b 3
3 c 3
as only number 3 occurs in all groups.
This code seems not working:
# Filter the data
dd %>%
group_by(group) %>%
filter(all(val)) # does not work
Example here solves similar issue but have a defined vector of shared values. What if I do not know which ones are shared?
Dummy example:
# Reproducible example: filter all id by group
group = c("a", "a", "a",
"b", "b", "b",
"c", "c")
val = c(1,2,3,
3,4,5,
1,3)
dd <- data.frame(group,
val)
group_by isolates each group, so we can't very well group_by(group) and compare between between groups. Instead, we can group_by(val) and see which ones have all the groups:
dd %>%
group_by(val) %>%
filter(n_distinct(group) == n_distinct(dd$group))
# # A tibble: 3 x 2
# # Groups: val [1]
# group val
# <chr> <dbl>
# 1 a 3
# 2 b 3
# 3 c 3
This is one of the rare cases where we want to use data$column in a dplyr verb - n_distinct(dd$group) refers explicitly to the ungrouped original data to get the total number of groups. (It could also be pre-computed.) Whereas n_distinct(group) is using the grouped data piped in to filter, thus it gives the number of distinct groups for each value (because we group_by(val)).
A base R approach can be:
#Code
newd <- dd[dd$val %in% Reduce(intersect, split(dd$val, dd$group)),]
Output:
group val
3 a 3
4 b 3
8 c 3
A similar option in data.table as that of #GregorThomas solution is
library(data.table)
setDT(dd)[dd[, .I[uniqueN(group) == uniqueN(dd$group)], val]$V1]
I have some code written using the dplyr package. I want to calculate the mode. Currently I get results back with a column which says "Character" all the way down. The mode will be the most reoccurring value, which in my case could be a letter, number of a symbol.
eth.data<-data.comb %>%
group_by(Ethnicity, `Qualification Title`, `Qualification Number`, `OutGrade`)%>%
summarise(`Number of Learners`=n(), `Mode` = mode(`OutGrade`)) %>%
group_by(`Qualification Number`)%>%
mutate(`Total Number of Learners`= sum(`Number of Learners`)) %>%
arrange(`Total Number of Learners`)
Take a look at ?mode. mode tells you the storage mode of an object (e.g. "character" for character vectors). If you want the statistical mode, write your own function, see this question.
Also, if you group_by OutGrade, then you will have precisely 1 unique OutGrade in the summarise function, so don't do that.
Let us set up an example (which you should do when you are asking a question!).
df <- data.frame(group=rep(LETTERS[1:5], each=20),
grade=sample(letters[1:15], 100, replace=T))
mymode <- function(x) {
t <- table(x)
names(t)[ which.max(t) ]
}
df %>% group_by(group) %>% summarise(mode=mymode(grade))
The result is what you want:
# A tibble: 5 x 2
group mode
<chr> <chr>
1 A l
2 B f
3 C g
4 D g
5 E c
Note that if you did group_by(group, grade), the summarise function would be called for each combination of group and grade, so the results would have been very different:
# A tibble: 55 x 3
# Groups: group [5]
group grade mode
<chr> <chr> <chr>
1 A a a
2 A b b
3 A f f
4 A h h
5 A i i
6 A k k
7 A l l
8 A m m
9 A n n
10 B a a
# … with 45 more rows
I am working with a data frame corresponding to the example below:
set.seed(1)
dta <- data.frame("CatA" = rep(c("A","B","C"), 4), "CatNum" = rep(1:2,6),
"SomeVal" = runif(12))
I would like to quickly build a data frame that would have sum values for all the combinations of the categories derived from the CatA and CatNum as well as for the categories derived from each column separately. On the primitive example above, for the first couple of combinations, this can be achieved with use of simple code:
df_sums <- data.frame(
"Category" = c("Total for A",
"Total for A and 1",
"Total for A and 2"),
"Sum" = c(sum(dta$SomeVal[dta$CatA == 'A']),
sum(dta$SomeVal[dta$CatA == 'A' & dta$CatNum == 1]),
sum(dta$SomeVal[dta$CatA == 'A' & dta$CatNum == 2]))
)
This produces and informative data frame of sums:
Category Sum
1 Total for A 2.1801780
2 Total for A and 1 1.2101839
3 Total for A and 2 0.9699941
This solution would be grossly inefficient when applied to a data frame with multiple categories. I would like to achieve the following:
Cycle through all the categories, including categories derived from each column separately as well as from both columns in the same time
Achieve some flexibility with respect to how the function is applied, for instance I may want to apply mean instead of the sum
Save the Total for string a separate object that I could easily edit when applying other function than sum.
I was initially thinking of using dplyr, on the lines:
require(dplyr)
df_sums_experiment <- dta %>%
group_by(CatA, CatNum) %>%
summarise(TotVal = sum(SomeVal))
But it's not clear to me how I could apply multiple groupings simultaneously. As stated, I'm interested in grouping by each column separately and by the combination of both columns. I would also like to create a string column that would indicate what is combined and in what order.
You could use tidyr to unite the columns and gather the data. Then use dplyr to summarise:
library(dplyr)
library(tidyr)
dta %>% unite(measurevar, CatA, CatNum, remove=FALSE) %>%
gather(key, val, -SomeVal) %>%
group_by(val) %>%
summarise(sum(SomeVal))
val sum(SomeVal)
(chr) (dbl)
1 1 2.8198078
2 2 3.0778622
3 A 2.1801780
4 A_1 1.2101839
5 A_2 0.9699941
6 B 1.4405782
7 B_1 0.4076565
8 B_2 1.0329217
9 C 2.2769138
10 C_1 1.2019674
11 C_2 1.0749464
Just loop over the column combinations, compute the quantities you want and then rbind them together:
library(data.table)
dt = as.data.table(dta) # or setDT to convert in place
cols = c('CatA', 'CatNum')
rbindlist(apply(combn(c(cols, ""), length(cols)), 2,
function(i) dt[, sum(SomeVal), by = c(i[i != ""])]), fill = T)
# CatA CatNum V1
# 1: A 1 1.2101839
# 2: B 2 1.0329217
# 3: C 1 1.2019674
# 4: A 2 0.9699941
# 5: B 1 0.4076565
# 6: C 2 1.0749464
# 7: A NA 2.1801780
# 8: B NA 1.4405782
# 9: C NA 2.2769138
#10: NA 1 2.8198078
#11: NA 2 3.0778622
Split then use apply
#result
res <- do.call(rbind,
lapply(
c(split(dta,dta$CatA),
split(dta,dta$CatNum),
split(dta,dta[,1:2])),
function(i)sum(i[,"SomeVal"])))
#prettify the result
res1 <- data.frame(Category=paste0("Total for ",rownames(res)),
Sum=res[,1])
res1$Category <- sub("."," and ",res1$Category,fixed=TRUE)
row.names(res1) <- seq_along(row.names(res1))
res1
# Category Sum
# 1 Total for A 2.1801780
# 2 Total for B 1.4405782
# 3 Total for C 2.2769138
# 4 Total for 1 2.8198078
# 5 Total for 2 3.0778622
# 6 Total for A and 1 1.2101839
# 7 Total for B and 1 0.4076565
# 8 Total for C and 1 1.2019674
# 9 Total for A and 2 0.9699941
# 10 Total for B and 2 1.0329217
# 11 Total for C and 2 1.0749464
I have the following data frame:
id <- c(1,1,1,1,1,2,2,2,2)
spent <- c(10,10,20,10,10,5,5,5,20)
period <- c("f","c","c","v","v","f","c","c","v")
mean.spent <- c(10,15,15,10,10,5,5,5,20)
df <- data.frame(id,spent,period,mean.spent)
what I want is to aggregate the mean spent for each Id in each period as follow:
id f c v
1 10 15 10
2 5 5 20
Can you please help me to do this?
Use xtabs() along with aggregate() as follows:
df <- data.frame(id = c(1,1,1,1,1,2,2,2,2),
spent = c(10,10,20,10,10,5,5,5,20),
period = c("f","c","c","v","v","f","c","c","v"),
mean.spent = c(10,15,15,10,10,5,5,5,20))
xtabs(spent ~ id + period, aggregate(spent ~ id + period, df, mean))
# period
# id c f v
# 1 15 10 10
# 2 5 5 20
aggregate calculates the mean per group (as grouped by "id" and "period"), and xtabs does the transformation into this wider format.
Here's how to make it into a data.frame:
temp1 <- xtabs(spent ~ id + period,
aggregate(spent ~ id + period, df, mean))
data.frame(id = dimnames(temp1)$id, as.data.frame.matrix(temp1))
# id c f v
# 1 1 15 10 10
# 2 2 5 5 20
Update: a more direct approach
I always forget about tapply, but this example is a good candidate for when it is convenient.
tapply(df$spent, list(df$id, df$period), mean)
# c f v
# 1 15 10 10
# 2 5 5 20