R Sum numbers within string - r

I have a question:
I have a dataset like this simple example:
df<-data.frame(ID=c("A","B","C","D"),
Score=c("15","16/18/19+2/6","3/+2","19/18/14"))
I want to end up with a dataset that has split the score numbers. I have a problem with the /+2 part. when it says "3/+2"it actually means: "3/3+2" which would finally give "3/5". So what I would like some help with, is to end up with a dataset like this:
ID Score
A 15
B 16/18/19/21/6
C 3/5
D 19/18/14
I already found out that I can then seperate the score by
df<-df %>%
mutate(Score = strsplit(as.character(ID), "/")) %>%
unnest(Score)
But I don't know how I can let the numbers duplicate and then sum when /+ occurs, could someone help me?

It could be probably solved in a more elegant way, but here is one possibility:
df %>%
mutate(Score = strsplit(as.character(Score), "/")) %>%
unnest() %>%
rowwise() %>%
mutate(Score = eval(parse(text = paste0(Score)))) %>%
group_by(ID) %>%
mutate(Score = paste0(Score, collapse = "/")) %>%
distinct()
ID Score
<fct> <chr>
1 A 15
2 B 16/18/21/6
3 C 3/5
4 D 19/18/14
Sample data:
df <- data.frame(ID=c("A","B","C","D"),
Score=c("15","16/18/19+2/6","3/3+2","19/18/14"))
It splits "Score" based on /, converts characters to expression by parse() and then transforms it back.
Using the data you provided and the pattern from #A. Suliman:
df %>%
mutate(Score = strsplit(gsub("(\\d+)/*\\+(\\d+)","\\1/\\1+\\2", Score), "/")) %>%
unnest() %>%
rowwise() %>%
mutate(Score = eval(parse(text = paste0(Score)))) %>%
group_by(ID) %>%
mutate(Score = paste0(Score, collapse = "/")) %>%
distinct()
ID Score
<fct> <chr>
1 A 15
2 B 16/18/19/21/6
3 C 3/5
4 D 19/18/14

We can use gsubfn to do this in a compact way
library(gsubfn)
library(tidyverse)
df %>%
mutate(Score = gsubfn("\\d+\\+\\d+", ~ eval(parse(text = x)), Score))
# ID Score
#1 A 15
#2 B 16/18/21/6
#3 C 3/5
#4 D 19/18/14
data
df <- data.frame(ID=c("A","B","C","D"),
Score=c("15","16/18/19+2/6","3/3+2","19/18/14"), stringsAsFactors = FALSE)

library(dplyr)
library(tidyr) #separate_rows, no need for unnest
df %>% rowwise()%>%
mutate(Score_upd=paste0(sapply(unlist(strsplit(gsub('(\\d+)/*\\+(\\d+)','\\1/\\1+\\2',Score),'/')),
function(x)eval(parse(text = x))),collapse = '/')) %>%
separate_rows(Score_upd,sep = '/')
#short version
df %>% mutate(Score=gsub('(\\d+)/*\\+(\\d+)','\\1/\\1+\\2',Score)) %>%
separate_rows(Score,sep='/') %>% rowwise() %>% mutate(Score=eval(parse(text=Score))) %>%
group_by(ID) %>% summarise(Score=paste0(Score,collapse = '/'))
# A tibble: 4 x 2
ID Score
<fct> <chr>
1 A 15
2 B 16/18/19/21/6
3 C 3/5
4 D 19/18/14
The main idea is using gsub to separate 2+3 correctly, e.g:
gsub('(\\d+)/*\\+(\\d+)','\\1/\\1+\\2','20/8/2+3') #/* means 0 or 1 occurence of / e.g, 19+2 and 3/+2.
[1] "20/8/2/2+3"
Then
valid_str <- gsub('(\\d+)/*\\+(\\d+)','\\1/\\1+\\2','20/8/2+3')
sapply(unlist(strsplit(valid_str,'/')),function(x) eval(parse(text=x)))
20 8 2 2+3
20 8 2 5
#OR
sapply(unlist(strsplit(valid_str,'/')),function(x) sum(as.numeric(unlist(strsplit(x,'\\+')))))
20 8 2 2+3
20 8 2 5

Related

Dplyr pipe groupby top_n does not get top_n in group

I'm trying to obtain the top 2 names, sorted alphabetically, per group. I would think that top_n() would select this after I perform a group_by. However, this does not seem to be the case. This code shows the problem.
df <- data.frame(Group = c(0, 0, 0, 1, 1, 1),
Name = c("a", "c", "b", "e", "d", "f"))
df <- df %>%
arrange(Name, Group) %>%
group_by(Group) %>%
top_n(2)
df
# A tibble: 2 x 2
# Groups: Group [1]
Group Name
<dbl> <chr>
1 1 e
2 1 f
Expected output would be:
df <- df %>%
arrange(Name, Group) %>%
group_by(Group) %>%
top_n(2)
df
Group Name
1 0 a
2 0 b
3 1 d
4 1 e
Or something similar. Thanks.
top_n selects top n max values. You seem to need top n min values. You can use index with negative values to get that. Additionaly you don't need to arrange the data when using top_n.
library(dplyr)
df %>% group_by(Group) %>% top_n(-2, Name)
# Group Name
# <dbl> <chr>
#1 0 a
#2 0 b
#3 1 e
#4 1 d
Another way is to arrange the data and select first two rows in each group.
df %>% arrange(Group, Name) %>% group_by(Group) %>% slice(1:2)
We can use
library(dplyr)
df %>%
arrange(Group, Name) %>%
group_by(Group) %>%
filter(row_number() < 3)

Filter data by group & preserve empty groups

I wonder how can I filter my data by group, and preserve the groups that are empty?
Example:
year = c(1,2,3,1,2,3,1,2,3)
site = rep(c("a", "b", "d"), each = 3)
value = c(3,3,0,1,8,5,10,18,27)
df <- data.frame(year, site, value)
I want to subset the rows where the value is more than 5. For some groups, this is never true. Filter function simply skips empty groups.
How can I keep my empty groups and have NA instead? Ideally, I would like to use dplyr funtions instead of base R.
My filtering approach, where .preserve does not preserve empty groups:
df %>%
group_by(site) %>%
filter(value > 5, .preserve = TRUE)
Expected output:
year site value
<dbl> <fct> <dbl>
1 NA a NA
2 2 b 8
3 1 d 10
4 2 d 18
5 3 d 27
With the addition of tidyr, you can do:
df %>%
group_by(site) %>%
filter(value > 5) %>%
ungroup() %>%
complete(site = df$site)
site year value
<fct> <dbl> <dbl>
1 a NA NA
2 b 2 8
3 d 1 10
4 d 2 18
5 d 3 27
Or if you want to keep it in dplyr:
df %>%
group_by(site) %>%
filter(value > 5) %>%
bind_rows(df %>%
group_by(site) %>%
filter(all(value <= 5)) %>%
summarise_all(~ NA))
Using the nesting functionality of tidyr and applying purrr::map
df %>%
group_by(site) %>%
tidyr::nest() %>%
mutate(data = purrr::map(data, . %>% filter(value > 5))) %>%
tidyr::unnest(cols=c(data), keep_empty = TRUE)

Rename a dataframe Column with text from within the column itself

Given a (simplified) dataframe with format
df <- data.frame(a = c(1,2,3,4),
b = c(4,3,2,1),
temp1 = c("-","-","-","foo: 3"),
temp2 = c("-","bar: 10","-","bar: 4")
)
a b temp1 temp2
1 4 - -
2 3 - bar: 10
3 2 - -
4 1 foo: 3 bar: 4
I need to rename all temp columns with the names contained within the column, My end goal is to end up with this:
a b foo bar
1 4 - -
2 3 - 10
3 2 - -
4 1 3 4
the df column names and the data contained within them will be unknown, however the columns that need changing will contain temp and the delimiter will always be a ":"
As such I can easily remove the name from within the columns using dplyr like this:
df <- df %>%
mutate_at(vars(contains("temp")), ~(substr(., str_locate(., ":")+1,str_length(.))))
but first I need to rename the columns based on some function method, that scans the column and returns the value(s) within it, ie.
rename_at(vars(contains("temp")), ~(...some function.....))
As per the example given there's no guarantee that specific rows will have data so I can't simply grab value from row 1
Any ideas welcome.
Thanks in advance
One possibility involving dplyr and tidyr could be:
df %>%
pivot_longer(names_to = "variables", values_to = "values", -c(a:b)) %>%
mutate(values = replace(values, values == "-", NA_character_)) %>%
separate(values, into = c("variables2", "values"), sep = ": ") %>%
group_by(variables) %>%
fill(variables2, .direction = "downup") %>%
ungroup() %>%
select(-variables) %>%
pivot_wider(names_from = "variables2", values_from = "values")
a b foo bar
<dbl> <dbl> <chr> <chr>
1 1 4 <NA> <NA>
2 2 3 <NA> 10
3 3 2 <NA> <NA>
4 4 1 3 4
If you want to further replace the NAs with -:
df %>%
pivot_longer(names_to = "variables", values_to = "values", -c(a:b)) %>%
mutate(values = replace(values, values == "-", NA_character_)) %>%
separate(values, into = c("variables2", "values"), sep = ": ") %>%
group_by(variables) %>%
fill(variables2, .direction = "downup") %>%
ungroup() %>%
select(-variables) %>%
pivot_wider(names_from = "variables2", values_from = "values") %>%
mutate_at(vars(-a, -b), ~ replace_na(., "-"))
a b foo bar
<dbl> <dbl> <chr> <chr>
1 1 4 - -
2 2 3 - 10
3 3 2 - -
4 4 1 3 4
This will do the job:
colnames(df)[which(grepl("temp", colnames(df)))] <- unique(unlist(sapply(df[,grepl("temp", colnames(df))],
function(x){gsub("[:].*",
"",
grep("\\w+",
x,
value = TRUE))})))

Find duplicate values and have references

My data = data.lab
data.lab <- data.frame(Name=c("A","e","b","c","d"),
bp =c( 12,12,11,12,11),
sugar = c(19,21,23,19,23))
I want to have only duplicate names with the reference
desired output
lab.data <- data.frame(Name=c("A","b","c","d"),
bp =c( 12,11,12,11),
sugar = c(19,23,19,23),
pair=c(1,1,2,2))
dub.data <- duplicated(data.lab) | duplicated(data.lab, fromLast = TRUE)
out.1=data.lab[dub.data, ]
this gives the duplicate data but i need a column as what are the duplicate pairs
With dplyr, you can do:
data.lab %>%
group_by(bp, sugar) %>%
filter(n() == 2) %>%
mutate(pair = seq_along(Name))
Name bp sugar pair
<fct> <dbl> <dbl> <int>
1 A 12 19 1
2 b 11 23 1
3 c 12 19 2
4 d 11 23 2
Or:
data.lab %>%
group_by(bp, sugar) %>%
filter(n() == 2) %>%
mutate(pair = row_number())
Or if there could be more than two pairs of duplicates:
data.lab %>%
group_by(bp, sugar) %>%
filter(n() > 1) %>%
mutate(pair = seq_along(Name))
Or:
data.lab %>%
group_by(bp, sugar) %>%
filter(n() > 1) %>%
mutate(pair = row_number())
Or to group by all variables except of "Name":
data.lab %>%
group_by_at(vars(-matches("(Name)"))) %>%
filter(n() > 1) %>%
mutate(pair = seq_along(Name))
Or:
data.lab %>%
group_by_at(vars(-matches("(Name)"))) %>%
filter(n() > 1) %>%
mutate(pair = row_number())
Continuing from your approach , we can use ave in base R
dat1 <- data.lab[duplicated(data.lab[c("bp", "sugar")]) |
duplicated(data.lab[c("bp", "sugar")], fromLast = TRUE) , ]
dat1$pair <- with(dat1, ave(Name, bp, sugar, FUN = seq_along))
dat1
# Name bp sugar pair
#1 A 12 19 1
#2 b 11 23 1
#3 c 12 19 2
#4 d 11 23 2

How to pass a variable name to dplyr's group_by()

I can calculate the rank of the values (val) in my dataframe df within the group name1 with the code:
res <- df %>% arrange(val) %>% group_by(name1) %>% mutate(RANK=row_number())
Instead of writing the column "name1" in the code, I want to pass it as variable, eg crit = "name1". However, the code below does not work since crit1 is assumed to be the column name instead of a variable name.
res <- df %>% arrange(val) %>% group_by(crit1) %>% mutate(RANK=row_number())
How can I pass crit1 in the code?
Thanks.
We can use group_by_
library(dplyr)
df %>%
arrange(val) %>%
group_by_(.dots=crit1) %>%
mutate(RANK=row_number())
#Source: local data frame [10 x 4]
#Groups: name1, name2 [7]
# val name1 name2 RANK
# <dbl> <chr> <chr> <int>
#1 -0.848370044 b c 1
#2 -0.583627199 a a 1
#3 -0.545880758 a a 2
#4 -0.466495124 b b 1
#5 0.002311942 a c 1
#6 0.266021979 c a 1
#7 0.419623149 c b 1
#8 0.444585270 a c 2
#9 0.536585304 b a 1
1#0 0.847460017 a c 3
Update
group_by_ is deprecated in the recent versions (now using dplyr version - 0.8.1), so we can use group_by_at which takes a vector of strings as input variables
df %>%
arrange(val) %>%
group_by_at(crit1) %>%
mutate(RANK=row_number())
Or another option is to convert to symbols (syms from rlang) and evaluate (!!!)
df %>%
arrange(val) %>%
group_by(!!! rlang::syms(crit1)) %>%
mutate(RANK = row_number())
data
set.seed(24)
df <- data.frame(val = rnorm(10), name1= sample(letters[1:3], 10, replace=TRUE),
name2 = sample(letters[1:3], 10, replace=TRUE),
stringsAsFactors=FALSE)
crit1 <- c("name1", "name2")
Update with dplyr 1.0.0
The new across syntax eliminates the need for !!! rlang::syms(). So you can now simplify the code by:
df %>%
arrange(val) %>%
group_by(across(all_of(crit1))) %>%
mutate(RANK = row_number())
Facing a similar task I could successfully work with these two options.
Use across():
for (crit in names(df)) {
print(df |>
# all_of() is not needed here
group_by(across(crit)) |>
count())
}
Use syms() and !!:
crits = syms(names(df))
for (crit in crits) {
print(df |>
# the use of !! instead of !!! is now encouraged
group_by(!!crit) |>
count())
}

Resources