I have a df that looks like this:
group sequence link
90 1 11|S1
90 2 10|S1
90 3 12|10
91 1 9|10
91 2 13|9
93 1 15|20
...
How can I store the first and last value of the linkvariable in each group as a new variable?
Desired output is:
group sequence link Key
90 1 11|S1 11|S1, 12|10
90 2 10|S1 11|S1, 12|10
90 3 12|10 11|S1, 12|10
91 1 9|10 9|10, 13|9
91 2 13|9 9|10,13|9
93 1 15|20
....
You could do:
library(dplyr)
df %>%
group_by(group) %>%
mutate(
Key = paste(link[1], link[n()], sep = ", ")
)
Though that wouldn't match your desired output. In your example data frame, you have e.g. the group 91 where there's only 1 value. The above code would give you 9|10 repeatedly both as beginning and end.
If you'd like to only display one value in such cases, you can do:
df %>%
group_by(group) %>%
mutate(
Key = case_when(
n() > 1 ~ paste(link[1], link[n()], sep = ", "),
TRUE ~ as.character(link)
)
)
I think you could use arrange() and slice() to find the first/last links in your data. My solution is a lengthier than #arg0naut91's, but is perhaps more intuitive.
Create toy data frame...
df <- data.frame(group=rep(letters,3), # create toy data frame
sequence=rep(1:3,26),
link=sample(9:13,78,T)) %>%
arrange(group,sequence) %>% # arrange data
group_by(group,link) %>% sample_n(1) %>% # remove any duplicate link values (to create uneven sequence var)
ungroup() %>% arrange(group,sequence) # arrange again to view
glimpse(df)
Find first and last links. Add them as new columns to the data frame.
df <- df %>% arrange(group,link) %>% group_by(group) %>%
slice(1) %>% mutate(link.first=link) %>% # find first link for each group
select(group,link.first) %>% left_join(df,.) # add to original data frame
df <- df %>% arrange(group,link) %>% group_by(group) %>%
slice(n()) %>% mutate(link.last=link) %>% # find last link for each group
select(group,link.last) %>% left_join(df,.) # add to original data frame
df %>% mutate(key=paste(link.first,link.last,sep=', ')) # paste links to form key
# A tibble: 62 x 6
group sequence link link.first link.last key
<fct> <int> <int> <int> <int> <chr>
1 a 1 10 10 12 10, 12
2 a 2 12 10 12 10, 12
3 b 2 9 9 11 9, 11
4 b 3 11 9 11 9, 11
5 c 1 13 9 13 9, 13
6 c 2 12 9 13 9, 13
7 c 3 9 9 13 9, 13
8 d 1 9 9 13 9, 13
9 d 3 13 9 13 9, 13
10 e 1 11 9 11 9, 11
Since I used sample() with replacement to generate the data, there may be some group's with only one row (i.e., the same first and last link values), which can be filtered out.
df %>% filter(link.first==link.last)
# A tibble: 2 x 5
group sequence link link.first link.last
<fct> <int> <int> <int> <int>
1 k 2 9 9 9
2 z 1 9 9 9
df %>% count(group) %>% filter(n==1)
Related
I'm working with a "movie" dataset. I have a movie "title" column (col no 1) and a "overall_score" column (col no 13).
Apparently multiple movies has scored 10, so when I make the top 10, it only shows me all movie with score 10.
But I only want the score 10, 9, 8 and so on until 1 to appear only 3 times. I tired using the slice function but wasn't successful in that, what do you think I'm doing wrong?
Here's my code:
movie2 <- movie_reviews %>%
arrange(desc(Overall)) %>%
group_by(uid, title) %>%
head(10) %>% slice(13:3)
If you don't care about which movies are within the score subgroups, then you could just use row_number to assign a unique number per Overall group.
library(dplyr)
set.seed(1)
movie_reviews <- data.frame(
uid = 1:100,
title = paste("title", 1:100),
Overall = sample(1:10, 100, replace=T)
)
movie2 <- movie_reviews %>%
group_by(Overall) %>%
mutate(rn = row_number()) %>%
ungroup() %>%
filter(rn < 4)%>%
select(-rn) %>%
arrange(Overall)
> movie2
# A tibble: 30 × 4
uid title Overall rn
<int> <chr> <int> <int>
1 4 title 4 1 3
2 9 title 9 1 2
3 64 title 64 1 1
4 23 title 23 2 1
5 82 title 82 2 2
6 87 title 87 2 3
7 8 title 8 3 3
8 57 title 57 3 2
9 80 title 80 3 1
10 27 title 27 4 1
# … with 20 more rows
Let's say I have a data frame. I would like to mutate new columns by subtracting each pair of the existing columns. There are rules in the matching columns. For example, in the below codes, the prefix is all same for the first component (base_g00) of the subtraction and the same for the second component (allow_m00). Also, the first component has numbers from 27 to 43 for the id and the second component's id is from 20 to 36 also can be interpreted as (1st_id-7). I am wondering for the following code, can I write in a apply function or loops within mutate format to make the codes simpler. Thanks so much for any suggestions in advance!
pred_error<-y07_13%>%mutate(annual_util_1=base_g0027-allow_m0020,
annual_util_2=base_g0028-allow_m0021,
annual_util_3=base_g0029-allow_m0022,
annual_util_4=base_g0030-allow_m0023,
annual_util_5=base_g0031-allow_m0024,
annual_util_6=base_g0032-allow_m0025,
annual_util_7=base_g0033-allow_m0026,
annual_util_8=base_g0034-allow_m0027,
annual_util_9=base_g0035-allow_m0028,
annual_util_10=base_g0036-allow_m0029,
annual_util_11=base_g0037-allow_m0030,
annual_util_12=base_g0038-allow_m0031,
annual_util_13=base_g0039-allow_m0032,
annual_util_14=base_g0040-allow_m0033,
annual_util_15=base_g0041-allow_m0034,
annual_util_16=base_g0042-allow_m0035,
annual_util_17=base_g0043-allow_m0036)
I think a more idiomatic tidyverse approach would be to reshape your data so those column groups are encoded as a variable instead of as separate columns which have the same semantic meaning.
For instance,
library(dplyr); library(tidyr); library(stringr)
y07_13 <- tibble(allow_m0021 = 1:5,
allow_m0022 = 2:6,
allow_m0023 = 11:15,
base_g0028 = 5,
base_g0029 = 3:7,
base_g0030 = 100)
y07_13 %>%
mutate(row = row_number()) %>%
pivot_longer(-row) %>%
mutate(type = str_extract(name, "allow_m|base_g"),
num = str_remove(name, type) %>% as.numeric(),
group = num - if_else(type == "allow_m", 20, 27)) %>%
select(row, type, group, value) %>%
pivot_wider(names_from = type, values_from = value) %>%
mutate(annual_util = base_g - allow_m)
Result
# A tibble: 15 x 5
row group allow_m base_g annual_util
<int> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 5 4
2 1 2 2 3 1
3 1 3 11 100 89
4 2 1 2 5 3
5 2 2 3 4 1
6 2 3 12 100 88
7 3 1 3 5 2
8 3 2 4 5 1
9 3 3 13 100 87
10 4 1 4 5 1
11 4 2 5 6 1
12 4 3 14 100 86
13 5 1 5 5 0
14 5 2 6 7 1
15 5 3 15 100 85
Here is vectorised base R approach -
base_cols <- paste0("base_g00", 27:43)
allow_cols <- paste0("allow_m00", 20:36)
new_cols <- paste0("annual_util", 1:17)
y07_13[new_cols] <- y07_13[base_cols] - y07_13[allow_cols]
y07_13
An example data.frame:
library(tidyverse)
example <- data.frame(matrix(sample.int(15),5,3),
sample(c("A","B","C"),5,replace=TRUE) ) %>%
`colnames<-`( c("A","B","C","choose") ) %>% print()
Output:
A B C choose
1 9 12 4 A
2 7 8 13 C
3 5 1 2 A
4 15 3 11 C
5 14 6 10 B
The column "choose" indicates which value should be selected from the columns A,B,C
My humble solution for the column "result" :
cols <- c(A=1,B=2,C=3)
col_index <- cols[example$choose]
xy <- cbind(1:nrow(example),col_index)
example %>% mutate(result = example[xy])
Output:
A B C choose result
1 9 12 4 A 9
2 7 8 13 C 13
3 5 1 2 A 5
4 15 3 11 C 11
5 14 6 10 B 6
I'am sure there is a more elegant solution with dplyr,
but my attemps with "rowwise" or "accross" failed.
Is it possible to get here a one-line-solution?
The efficient option is to make use of row/column indexing
example$result <- example[1:3][cbind(seq_len(nrow(example)),
match(example$choose, names(example)))]
with dplyr, we may use get with rowwise
library(dplyr)
example %>%
rowwise %>%
mutate(result = get(choose)) %>%
ungroup
Or instead of get use cur_data()
example %>%
rowwise %>%
mutate(result = cur_data()[[choose]]) %>%
ungroup
Or the vectorized option with row/column indexing
example %>%
mutate(result = select(., where(is.numeric))[cbind(row_number(),
match(choose, names(example)))])
Here is an alternative way:
library(dplyr)
library(tidyr)
example %>%
pivot_longer(
-choose,
) %>%
filter(choose == name) %>%
select(result=value) %>%
bind_cols(example)
result A B C choose
<int> <int> <int> <int> <chr>
1 9 6 9 1 B
2 14 5 2 14 C
3 7 8 7 3 B
4 15 15 4 12 A
5 11 13 10 11 C
I have a number of large data frames which has the occasional string value and I would like to know what the unique string values are (ignoring the numeric values) and if possible count these strings.
df <- data.frame(1:16)
df$A <- c("Name",0,0,0,0,0,12,12,0,14,NA_real_,14,NA_real_,NA_real_,16,16)
df$B <- c(10,0,"test",0,12,12,12,12,0,14,NA_real_,14,16,16,16,16)
df$C <- c(10,12,14,16,10,12,14,16,10,12,14,16,10,12,14,16)
X1.16 A B C
1 1 Name 10 10
2 2 0 0 12
3 3 0 test 14
4 4 0 0 16
5 5 0 12 10
6 6 0 12 12
7 7 12 12 14
8 8 12 12 16
9 9 0 0 10
10 10 14 14 12
11 11 <NA> <NA> 14
12 12 14 14 16
13 13 <NA> 16 10
14 14 <NA> 16 12
15 15 16 16 14
16 16 16 16 16
I know I can use the count function in dplyr but I have too many unique numeric values so this is not a great solution. In the code below I was able to filter my data so to only retain rows that contain an alphabetical character (although this isn't a solution either).
df %>% filter_all(any_vars(str_detect(., pattern = "[:alpha:]")))
X1.16 A B C
1 1 Name 10 10
2 3 0 test 14
My desired output would be something to the effect of:
Variable n
"Name" 1
"test" 1
You can get the string value with grep and count them using table :
stack(table(grep('[[:alpha:]]', unlist(df), value = TRUE)))[2:1]
If you want a tidyverse answer you can get the data in long format, keep only the rows with characters in it and count them.
library(dplyr)
df %>%
mutate(across(.fns = as.character)) %>%
tidyr::pivot_longer(cols = everything()) %>%
filter(grepl('[[:alpha:]]', value)) %>%
count(value)
# value n
# <chr> <int>
#1 Name 1
#2 test 1
#Ronak and #akrun above beat me to the punch, my solution is very similar - with an extension if you want a count within columns
# Coerce to tibble for ease of reading
df <- df %>%
as_tibble() %>%
mutate(across(.fns = as.character))
df %>%
pivot_longer(cols = everything()) %>%
summarise(Variable = str_subset(value, "[:alpha:]")) %>%
count(Variable, sort = TRUE)
# A tibble: 2 x 2
Variable n
<chr> <int>
1 Name 1
2 test 1
# str_subset is a convenient wrapper around filter & str_detect
Add some extra words to test
# Test on extra word counts - replace 12 and 14 with words
df2 <- df
df2[df2 == 12] <- 'Name'
df2[df2 == 14] <- 'test'
df2
df2 %>%
pivot_longer(cols = everything()) %>%
summarise(Variable = str_subset(value, "[:alpha:]")) %>%
count(Variable, sort = TRUE)
# A tibble: 2 x 2
Variable n
<chr> <int>
1 Name 12
2 test 10
If you want counts by column
df2 %>%
select(-1) %>%
pivot_longer(everything(), names_to = 'col') %>%
group_by(col) %>%
summarise(Variable = str_subset(value, "[:alpha:]")) %>%
count(col, Variable)
# A tibble: 6 x 3
# Groups: col [3]
col Variable n
<chr> <chr> <int>
1 A Name 3
2 A test 2
3 B Name 4
4 B test 3
5 C Name 4
6 C test 4
We can use filter with across
library(dplyr)
library(tidyr)
library(stringr)
library(purrr)
df %>%
select(-1) %>%
mutate(across(everything(), as.character)) %>%
filter(across(everything(), ~ str_detect(., '[:alpha:]')) %>% reduce(`|`)) %>%
pivot_longer(everything()) %>%
filter(str_detect(value, '[:alpha:]')) %>%
count(value)
# A tibble: 2 x 2
# value n
# <chr> <int>
#1 Name 1
#2 test 1
library(tidyverse)
df <- tibble(a = as.factor(1:20), b = c(50, 20, 13, rep(2, 10), rep(1, 7)))
How do I make dplyr look at this data frame df and collapse all these occurences of 2 into a single summed group, and collapse all the occurrences of 1 into a single summed group? And also keep the rest of the data frame.
Turn this:
# A tibble: 20 x 2
a b
<fct> <dbl>
1 1 50
2 2 20
3 3 13
4 4 2
5 5 2
6 6 2
7 7 2
8 8 2
9 9 2
10 10 2
11 11 2
12 12 2
13 13 2
14 14 1
15 15 1
16 16 1
17 17 1
18 18 1
19 19 1
20 20 1
into this:
# A tibble: 5 x 2
a b
<fct> <dbl>
1 1 50
2 2 20
3 3 13
4 grp2 20
5 grp1 7
[Edit] - I fixed the example data. Sorry about that.
We group by a manufactured sortkey to maintain sort order. We used the fact that b is in descending order in the input but if that is not the case in your actual data then replace sortkey = -b with the more general sortkey = data.table::rleid(b) or the longer sortkey = cumsum(coalesce(b != lag(b), FALSE)) .
We also convert b to the group names giving a new a. It wasn't clear which groups are to be converted to grp... form. Hard-coded 1 and 2? Any group with more than one row? Groups at the end with more than one row? At any rate it would be easy enough to change the condition in the if_else once that were clarified.
Finally perform the summation and then remove the sortkey.
df %>%
group_by(sortkey = -b, a = paste0(if_else(b %in% 1:2, "grp", ""), b)) %>%
summarize(b = sum(b)) %>%
ungroup %>%
select(-sortkey)
giving:
# A tibble: 5 x 2
a b
<chr> <int>
1 50 50
2 20 20
3 13 13
4 grp2 20
5 grp1 7
Here's a way. I have converted a from factor to character to make things easier. You can convert it back to factor if you want. Also your test data was a bit wrong.
df <- tibble(a = as.character(1:20), b = c(50, 20, 13, rep(2, 10), rep(1, 7)))
df %>%
mutate(
a = case_when(
b == 1 ~ "grp1",
b == 2 ~ "grp2",
TRUE ~ a
)
) %>%
group_by(a) %>%
summarise(b = sum(b))
# A tibble: 5 x 2
a b
<chr> <dbl>
1 1 50
2 2 20
3 3 13
4 grp1 7
5 grp2 20
This is an approach which gives you the desired names for groups & where you don't need to think in advance how many cases like that you would need (e.g. it would create grp3, grp4, ... depending on the number in b).
library(dplyr)
df %>%
mutate(
grp = as.numeric(lag(df$b) != df$b),
grp = cumsum(ifelse(is.na(grp), 0, grp))
) %>% group_by(grp) %>%
mutate(
a = ifelse(n() > 1, paste0("grp", b), a),
b = sum(b)
) %>% ungroup() %>% distinct(a, b)
Output:
a b
<chr> <dbl>
1 1 50
2 2 20
3 3 13
4 grp2 20
5 grp1 7
Note that the code could be also condensed but that leads to a certain lack of readability in my opinion:
df %>%
group_by(grp = cumsum(ifelse(is.na(as.numeric(lag(df$b) != df$b)), 0, as.numeric(lag(df$b) != df$b)))) %>%
mutate(
a = ifelse(n() > 1, paste0("grp", b), a),
b = sum(b)
) %>% ungroup() %>% distinct(a, b)