i have a grouped tibble with several columns. i now want to add a new column that has the same value for every row within a group but a different value for each group, basically giving the groups names. these per group values are supplied from a vector.
ideally i want to do this in generic way, so it works in a function based on the number of groups the input has.
any help would be much appreciated, here is a very basic and reduced example of the tibble and vector. (the original tibble has character, int, and dbl columns)
df <- tibble(a = c(1,2,3,1,3,2)) %>% group_by(a)
names <- c("owl", "newt", "zag")
desired_output <– tibble(a = c(1, 2, 3, 1, 3, 2),
name = c("owl", "newt", "zag", "owl", "zag", "newt"))
as the output i would like to have the same tibble just with another column for all in group 1 = owl, 2 = newt, and 3 = zag
Just take a as indices:
library(dplyr)
df %>%
mutate(name = names[a])
# # A tibble: 6 × 2
# a name
# <dbl> <chr>
# 1 1 owl
# 2 2 newt
# 3 3 zag
# 4 1 owl
# 5 3 zag
# 6 2 newt
You can also use recode() if a cannot be used as indices.
df %>%
mutate(name = recode(a, !!!setNames(names, 1:3)))
Data
df <- tibble(a = c(1,2,3,1,3,2))
names <- c("owl", "newt", "zag")
Something like this?
library(dplyr)
names = c("owl", "newt", "zag")
df %>%
group_by(a) %>%
mutate(new_col = case_when(a == 1 ~ names[1],
a == 2 ~ names[2],
a == 3 ~ names[3]))
a new_col
<dbl> <chr>
1 1 owl
2 2 newt
3 3 zag
4 1 owl
5 2 newt
6 3 zag
7 2 newt
8 3 zag
9 1 owl
10 2 newt
11 1 owl
12 3 zag
13 2 newt
14 3 zag
data:
df <- structure(list(a = c(1, 2, 3, 1, 2, 3, 2, 3, 1, 2, 1, 3, 2, 3
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-14L))
You could use factor()with mutate()
names = c("owl", "newt", "zag")
dat = data.frame(a = c(1,2,3,1,2,3,2,3,1,2,1,3,2,3))
dat %>% mutate(label = factor(a, levels = c(1,2,3), labels = names))
Just make sure the order in levels corresponds to the order in labels (i.e 1 = "owl")
Related
I am doing topic modeling but need to remove certain characters. Specifically bullet points remain in my terms list.
USAID_stops <- c("performance", "final", "usaidgov", "kaves", "evaluation", "*", "[[:punct:]]", "U\2022")
#for (i in 1:length(chapters_1)) {
a <- SimpleCorpus(VectorSource(chapters_1[1]))
dtm_questions <- DocumentTermMatrix(a)
report_topics <- LDA(dtm_questions, k = 4)
topic_weights <- tidy(report_topics, matrix = "beta")
top_terms <- topic_weights %>%
group_by(topic) %>%
slice_max(beta, n = 10) %>%
ungroup() %>%
arrange(topic, -beta) %>%
filter(!term %in% stop_words$word) %>%
filter(!term %in% USAID_stops)
topic term beta
<int> <chr> <dbl>
1 chain 0.009267748
2 • 0.009766040
2 chain 0.009593995
2 change 0.008294549
3 nutrition 0.017117040
3 related 0.009621772
3 strategy 0.008523203
4 • 0.021312755
4 chain 0.010974153
4 ftf 0.008146484
These remain. How and where can I remove them from?
You can use mutate and str_remove to remove the bullets.
library(tidyverse)
df %>%
mutate(across(everything(), ~ str_remove(., "•")))
Output
topic term beta
1 1 chain 0.009267748
2 2 0.009766040
3 2 chain 0.009593995
4 2 change 0.008294549
5 3 nutrition 0.017117040
6 3 related 0.009621772
7 3 strategy 0.008523203
8 4 0.021312755
9 4 chain 0.010974153
10 4 ftf 0.008146484
Or you can use gsub from base R.
df$term <- gsub("•","",as.character(df$term))
You could also replace earlier before running LDA.
dtm_questions[["dimnames"]][["Terms"]] <-
gsub("•","NA",dtm_questions[["dimnames"]][["Terms"]])
If you want to replace the bullets with something else, then you can do this:
df %>%
mutate(across(term, ~ str_replace(., "•", "NA")))
# Or in base R
df$term <- gsub("•","NA",as.character(df$term))
Output
topic term beta
1 1 chain 0.009267748
2 2 NA 0.009766040
3 2 chain 0.009593995
4 2 change 0.008294549
5 3 nutrition 0.017117040
6 3 related 0.009621772
7 3 strategy 0.008523203
8 4 NA 0.021312755
9 4 chain 0.010974153
10 4 ftf 0.008146484
Data
df <-
structure(list(
topic = c(1, 2, 2, 2, 3, 3, 3, 4, 4, 4),
term = c(
"chain", "•", "chain", "change", "nutrition",
"related", "strategy", "•", "chain", "ftf"
),
beta = c(
0.009267748, 0.00976604, 0.009593995, 0.008294549,
0.01711704, 0.009621772, 0.008523203, 0.021312755,
0.010974153, 0.008146484
)
),
class = "data.frame",
row.names = c(NA, -10L))
I have the following data frame:
library(dplyr)
old_data = data.frame(id = c(1,2,3), var1 = c(11,12,13))
> old_data
id var1
1 1 11
2 2 12
3 3 13
I want to replace the values in the 2nd row of "old_data" with data in "new_data" (i.e. rows in "old_data" where the id variables matches ):
new_data = data.frame(id = c(4,2,5), var1 = c(11,15,13))
> new_data
id var1
1 4 11
2 2 15
3 5 13
Using the answer found here (Update rows of data frame in R), I tried to do this with the "dplyr" library:
update = old_data %>%
rows_update(new_data, by = "id")
But this gave me the following error:
Error: Attempting to update missing rows.
Run `rlang::last_error()` to see where the error occurred.
This is what I am trying to get:
id var1
1 1 11
2 2 15
3 3 13
Can someone please tell me what I am doing wrong?
Thanks!
A little bit messy but this works (on this sample data at least)
old_data %>%
left_join(new_data,by="id") %>%
mutate(var1 = if_else(!is.na(var1.y),var1.y,var1.x)) %>%
select(id,var1)
# id var1
#1 1 11
#2 2 15
#3 3 13
A base R approach using match -
inds <- match(old_data$id, new_data$id)
old_data$var1[!is.na(inds)] <- na.omit(new_data$var1[inds])
old_data
# id var1
#1 1 11
#2 2 15
#3 3 13
A data.table approach (with turning the data table back into a dataframe):
library(data.table)
as.data.frame(setDT(old_data)[new_data, var1 := .(i.var1), on = "id"])
Output
id var1
1 1 11
2 2 15
3 3 13
An alternative tidyverse option using rows_update. You can filter new_data to only have ids that appear in old_data. Then, you can update those values, like you had previously tried. Essentially, new_data must only have id values that appear in old_data.
library(tidyverse)
old_data %>%
rows_update(., new_data %>% filter(id %in% old_data$id), by = "id")
Data
old_data <-
structure(list(id = c(1, 2, 3), var1 = c(11, 12, 13)),
class = "data.frame",
row.names = c(NA,-3L))
new_data <-
structure(list(id = c(4, 2, 5), var1 = c(11, 15, 13)),
class = "data.frame",
row.names = c(NA,-3L))
We can use dplyr::rows_update if we first use a semi_join on new_data to filter only those ids that are included in old_data.
library(dplyr)
old_data %>%
rows_update(new_data %>%
semi_join(old_data, by = "id"),
by = "id")
#> id var1
#> 1 1 11
#> 2 2 15
#> 3 3 13
Created on 2021-12-29 by the reprex package (v0.3.0)
I have a dataframe containing the results of a competition. In this example competitors b and c have tied for second place. The actual dataframe is very large and could contain multiple ties.
df <- data.frame(name = letters[1:4],
place = c(1, 2, 2, 4))
I also have point values for the respective places, where first place gets 4 points, 2nd gets 3, 3rd gets 1 and 4th gets 0.
points <- c(4, 3, 1, 0)
names(points) <- 1:4
I can match points to place to get each competitor's score
df %>%
mutate(score = points[place])
name place score
1 a 1 4
2 b 2 3
3 c 2 3
4 d 4 0
What I would like to do though is award points to b and c that are the mean of the point values for 2nd and 3rd, such that each receives 2 points like this:
name place score
1 a 1 4
2 b 2 2
3 c 2 2
4 d 4 0
How can I accomplish this programmatically?
A solution using nested data frames and purrr.
library(dplyr)
library(tidyr)
library(purrr)
df <- data.frame(name = letters[1:4],
place = c(1, 2, 2, 4))
points <- c(4, 3, 1, 0)
names(points) <- 1:4
# a function to help expand the dataframe based on the number of ties
expand_all <- function(x,n){
x:(x+n-1)
}
df %>%
group_by(place) %>%
tally() %>%
mutate(new_place = purrr::map2(place,n, expand_all)) %>%
unnest(new_place) %>%
mutate(score = points[new_place]) %>%
group_by(place) %>%
summarize(score = mean(score)) %>%
inner_join(df)
Robert Wilson's answer gave me an idea. Rather than mapping over nested dataframes the rank function from base can get to the same result
df %>%
mutate(new_place = rank(place, ties.method = "first")) %>%
mutate(score = points[new_place]) %>%
group_by(place) %>%
summarize(score = mean(score)) %>%
inner_join(df)
place score name
<dbl> <dbl> <chr>
1 1 4 a
2 2 2 b
3 2 2 c
4 4 0 d
This can be accomplished in few lines with an ifelse() statement inside of a mutate():
df %>%
group_by(place) %>%
mutate(n_ties = n()) %>%
ungroup %>%
mutate(score = (points[place] + ifelse(n_ties > 1, 1, 0))/ n_ties)
# A tibble: 4 x 4
name place n_ties score
<chr> <dbl> <int> <dbl>
1 a 1 1 4
2 b 2 2 2
3 c 2 2 2
4 d 4 1 0
I am having data as
function person
1 hr 1
2 sls 5
3 mktg 3
4 qlt 7
5 rev 5
I want to make a row with sum of value in column "function" as "sls" and "mktg" using r programing
desired output is :
Person function
1 1 hr
2 8 sls & mktg
3 7 qlt
4 5 rev
A base R solution:
merg <- c("sls", "mktg")
dat$func[dat$func %in% merg] <- paste(merg, collapse = " & ")
aggregate(person ~ func, dat, sum)
func person
1 hr 1
2 qlt 7
3 rev 5
4 sls & mktg 8
Data
dat <- data.frame(
func = c("hr", "sls", "mktg", "qlt", "rev"),
person = c(1, 5, 3, 7, 5),
stringsAsFactors = FALSE
)
Note that this assumes dat$func is a character... if it is not first convert to character with as.character()
library(dplyr)
dat <- data.frame(func = c("hr", "sls", "mktg", "qlt", "rev"),
person = c(1, 5, 3, 7, 5))
dat %>%
mutate(func = func %>% as.factor() %>% as.character(),
func = ifelse(func %in% c("sls", "mktg"), "sls & mktg", func)) %>%
group_by(func) %>%
summarize(Person = sum(person))
returns
# A tibble: 4 x 2
func Person
<chr> <dbl>
1 hr 1
2 qlt 7
3 rev 5
4 sls & mktg 8
Another approach with dplyr:
Code:
dfr %>%
group_by(Function = sub("sls|mktg", "sls & mktg", functn)) %>%
summarise(Person = sum(person))
Output:
# A tibble: 4 x 2
Function Person
<chr> <dbl>
1 hr 1.
2 qlt 7.
3 rev 5.
4 sls & mktg 8.
Data
tringsAsFactors = TRUE|FALSE - works in both cases.
dfr <- data.frame(
functn = c("hr", "sls", "mktg", "qlt", "rev"),
person = c(1, 5, 3, 7, 5)
)
I am trying to convert an LDA prediction result, which is a list object containing hundred of list (of topics (in numeric) assigned to each token in a document), such as the following example
assignments <- list(
as.integer(c(1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3)),
as.integer(c(1, 1, 1, 2, 2, 2, 2, 2, 3, 3)),
as.integer(c(1, 3, 3, 3, 3, 3, 3, 2, 2))
)
where each list of the list object has different length corresponding to the length of each tokenized document.
What I want to do are to 1) get the most frequent topic (1, 2, 3) out of each list, and 2) convert them into tbl or data.frame format like this
document topic freq
1 1 6
2 2 5
3 3 6
such that I can use inner_join() to merge this "consensus" prediction with the topic assignment results generated by tm or topicmodels applications and compare their precision, etc. Since the assignments is in list format, I cannot apply top_n() function to get the most frequent topic for each list. I tried sing lapply(unlist(assignments), count), but it didn't give me what I want.
You can iterate over the list with sapply, get frequency with table and extract first value from sorted result:
result <- sapply(assignments, function(x) sort(table(x), decreasing = TRUE)[1])
data.frame(document = seq_along(assignments),
topic = as.integer(names(result)),
freq = result)
document topic freq
1 1 1 6
2 2 2 5
3 3 3 6
We can loop through the list, get the frequency of elements with tabulate, find the index of maximum elements, extract those along with the frequency as a data.frame and rbind the list elements
do.call(rbind, lapply(seq_along(assignments), function(i) {
x <- assignments[[i]]
ux <- unique(x)
i1 <- tabulate(match(x, ux))
data.frame(document = i, topic = ux[which.max(i1)], freq = max(i1))})
)
# document topic freq
#1 1 1 6
#2 2 2 5
#3 3 3 6
Or another option is to convert it to a two column dataset and then do group by to find the index of max values
library(data.table)
setDT(stack(setNames(assignments, seq_along(assignments))))[,
.(freq = .N), .(document = ind, topic = values)][, .SD[freq == max(freq)], document]
# document topic freq
#1: 1 1 6
#2: 2 2 5
#3: 3 3 6
Or we can use tidyverse
library(tidyverse)
map(assignments, as_tibble) %>%
bind_rows(.id = 'document') %>%
count(document, value) %>%
group_by(document) %>%
filter(n == max(n)) %>%
ungroup %>%
rename_at(2:3, ~c('topic', 'freq'))
# A tibble: 3 x 3
# document topic freq
# <chr> <int> <int>
#1 1 1 6
#2 2 2 5
#3 3 3 6
using purrr::imap_dfr :
library(tidyverse)
imap_dfr(assignments,~ tibble(
document = .y,
Topic = names(which.max(table(.x))),
freq = max(tabulate(.x))))
# # A tibble: 3 x 3
# document Topic freq
# <int> <chr> <int>
# 1 1 1 6
# 2 2 2 5
# 3 3 3 6