Suppose I have a few sentences describing how John spends his days stored in a dataframe in R:
df <- data_frame(sentence = c("John went to work this morning", "John likes to jog", "John is hungry"))
Thus, I want to identify what words are more often repeated when a sentence contains "John". I can use unnest_tokens() to identify consecutive words. How can I identify recurring pairings that are non consecutive?
The goal is to obtain a result that counts how many times each other word appears close to John:
df2 <- data_frame(word1 = c("John", "John", "John", "John", "John", "John", "John", "John", "John"),
word2 = c("went", "to", "work", "this", "morning", "likes", "jog", "is", "hungry"),
n = c(1, 2, 1, 1, 1, 1, 1, 1, 1))
We can try
library(dplyr)
lst <- lapply(strsplit(df$sentence , " ") , \(x) list(x[1] , x[-1])) |>
lapply(\(x) data.frame(x[1], x[2]))
ans <- lapply(lst , \(x) {colnames(x) <- c("word1" , "word2") ;x}) |>
do.call(rbind , args = _) |> group_by(word1 , word2) |>
summarise(n = n())
Output
# A tibble: 9 × 3
# Groups: word1 [1]
word1 word2 n
<chr> <chr> <int>
1 John hungry 1
2 John is 1
3 John jog 1
4 John likes 1
5 John morning 1
6 John this 1
7 John to 2
8 John went 1
9 John work 1
Related
I have two dataframes organised like this.
df1 <- data.frame(lastname = c("Miller", "Smith", "Grey"),
firstname = c("John", "Jane", "Hans")
)
df2 <- data.frame(lastname =c("Smith", "Grey"),
firstname = c("Jane", "Hans")
)
df2 is not necessarily a subset of df1. Duplicated entries are also possible.
My goal is to keep a copy of df1 in which all entries occur represented in both dfs. Alternatively, I would like to end up with a subset of df1 with a new variable, indicating that the name is also element of df2.
Can someone suggest a way to do this? A {dyplr}-attempt is totally fine.
Desired output for the the paticular simple case:
res <- data.frame(lastname = c("Smith", "Grey"),
firstname = c("Jane", "Hans")
)
Including the "alternatively" part of the question this is an approach with left_join. Adding a grouping variable grp to distinguish the 2 sets.
library(dplyr)
left_join(cbind(df1, grp = "A"), cbind(df2, grp = "B"),
c("lastname", "firstname"), suffix=c("_A", "_B"))
lastname firstname grp_A grp_B
1 Miller John A <NA>
2 Smith Jane A B
3 Grey Hans A B
or with base R merge
merge(cbind(df1, grp = "A"), cbind(df2, grp = "B"),
c("lastname", "firstname"), suffixes=c("_A", "_B"), all=T)
lastname firstname grp_A grp_B
1 Grey Hans A B
2 Miller John A <NA>
3 Smith Jane A B
To remove NA and compact the grps
na.omit(left_join(cbind(df1, grp = "A"), cbind(df2, grp = "B"),
c("lastname", "firstname"), suffix=c("_A", "_B"))) %>%
summarize(lastname, firstname,
grp = list(across(starts_with("grp"), ~ unique(.x))))
lastname firstname grp
1 Smith Jane A, B
2 Grey Hans A, B
The other part is simply
merge(df1, df2)
lastname firstname
1 Grey Hans
2 Smith Jane
For a large database, I would like to find a solution where I could predefine the strings to be searched and then get a table that would contain the frequency of these strings and their possible variations per row.
strings <- c("dog", "cat", "mouse")
var1 <- c("black dog", "white dog", "angry dog", "dogs and cats are nice", "dog")
var2 <- c("white cat", "black cat", "tiny cat", NA, "cow")
var3 <- c("little mouse", "big mouse", NA, NA, "mouse")
data <- data.frame(var1, var2, var3)
The result should look like this while I am looking for dog, cat and mouse:
dog&cat 4
mouse 3
We may try
v1 <- do.call(paste, data)
stack(setNames(lapply(c( "\\bdog.*\\bcat|\\bcat.*\\bdog", "mouse"),
\(pat) sum(grepl(pat, v1))), c("dog&cat", "mouse")))[2:1]
ind values
1 dog&cat 4
2 mouse 3
Or if we need all the combinations
lst1 <- lapply(c(strings, combn(strings, 2, FUN = \(x)
sprintf("\\b%1$s.*\\b%2$s|\\b%2$s.*\\b%1$s", x[1], x[2]))),
\(pat) sum(grepl(pat, v1)))
names(lst1) <- c(strings, combn(strings, 2, FUN = paste, collapse = "&"))
stack(lst1)[2:1]
ind values
1 dog 5
2 cat 4
3 mouse 3
4 dog&cat 4
5 dog&mouse 3
6 cat&mouse 2
For more combinations, it may be better to use Reduce with individually applying grepl
lst1 <- lapply(1:3, \(n) {
vals <- colSums(combn(strings, n,
FUN = \(pats) Reduce(`&`, lapply(pats, \(pat) grepl(pat, v1)))))
nms <- combn(strings, n, FUN = paste, collapse = "&")
setNames(vals, nms)
})
stack(unlist(lst1))[2:1]
ind values
1 dog 5
2 cat 4
3 mouse 3
4 dog&cat 4
5 dog&mouse 3
6 cat&mouse 2
7 dog&cat&mouse 2
Or with tidyverse
library(dplyr)
library(stringr)
library(tidyr)
data %>%
unite(var, everything(), na.rm = TRUE, sep = " ") %>%
summarise(`dog&cat` = sum(str_detect(var,
"\\bdog.*\\bcat|\\bcat.*\\bdog")),
mouse = sum(str_detect(var, 'mouse'))) %>%
pivot_longer(everything())
-output
# A tibble: 2 × 2
name value
<chr> <int>
1 dog&cat 4
2 mouse 3
I have 3 data frames with three variables each and the name of the player
a <- rnorm(16, 3, 2)
b <- rnorm(16, 1, 3)
c <- rpois(16, 3)
year <- c(rep(2015, 5), rep(2016, 5), rep(2017, 6))
player <- c("Alex", "CT", "Bill", "Brian", "Collin", "Chez", "Adam", "Danny III", "Lee", "Chris",
"Erik", "Axel", "Louis", "Justin", "Dustin", "Johnson")
df_1 <- data.frame(player, year, a, b, c)
d <- rnorm(16, 3, 2)
e <- rnorm(16, 1, 3)
f <- rpois(16, 3)
year <- c(rep(2015, 5), rep(2016, 5), rep(2017, 6))
player <- c("Alexander", "C.T.", "Bill", "Brian", "Collin", "Chez", "Adam", "Danny IV", "Lee", "Chris",
"Erik", "Axel", "Louis", "Justin", "Dustin", "Johnson")
df_2 <- data.frame(player, year, d, e, f)
g <- rnorm(16, 3, 2)
h <- rnorm(16, 1, 3)
i <- rpois(16, 3)
year <- c(rep(2015, 5), rep(2016, 5), rep(2017, 6))
player <- c("Alex", "CT", "Bill", "Brian", "Collin", "Chez", "Adam", "Danny III", "Lee", "Chris",
"Erik", "Axel", "Louis", "Justin", "Dustin", "Johnson")
df_3 <- data.frame(player, year, g, h, i)
This data frame contains the name of the player corresponding to each data set of variables.
For example, Alex is the same as Alexander in variables from d to f, and it is the same as Alex in observations from g to i. Danny III is named Danny IV in variables from d to f and it is named Danny III in variables from g to i.
a_to_c <- c("Alex", "CT", "Danny III")
d_to_f <- c("Alexander", "C.T.", "Danny IV")
g_to_i <- c("Alex", "CT", "Danny III")
names_palyer <- data.frame(a_to_c, d_to_f, g_to_i)
I want to merge the three data frames by year and player into a single data frame. I need to use the information from "names_player" to correctly match the player with the data
I did this example for simplicity, in reality, I have thousand of observations so I need to find a way to automatically match the player's name so I can have a single data frame with the information of the three data frames.
Initialize the output ('out') as the first data ('df_1'). Loop over the index of columns of 'names_palyer' (excluding the last column), get the value of 'df_' object corresponding (incrementing 1 - i + 1 - assuming objects are named as df_1, df_2 etc.), then select a subset of columns of 'names_palyer' (keydat), use match to get the index of matching values with 'player' column of 'tmp' data. Replace the 'player' to the first column values of 'keydat' based on index, then do the merge (left join - all.x = TRUE), and at end, change the output 'player' that match to keydat' column to second column values of 'keydat' (so that it would be useful for the next iteration)
out <- df_1
for(i in 1:(ncol(names_palyer)-1)) {
tmp <- data.table::copy(get(paste0('df_', i + 1)))
keydat <- names_palyer[c(i, i + 1)]
keydat <- keydat[keydat[[2]] %in% tmp$player,, drop = FALSE]
i1 <- match(keydat[[2]], tmp$player, nomatch = 0)
tmp$player[i1] <- keydat[[1]]
print(tmp)
out <- merge(out, tmp, by = c('player', 'year'), all.x = TRUE)
i2 <- match(keydat[[1]], out$player, nomatch = 0)
out$player[i2] <- keydat[[2]]
}
-output
out
player year a b c d e f g h i
1 Adam 2016 0.03587367 -0.57907496 3 5.1149009 2.47064240 2 2.3325348 0.62526907 6
2 Alex 2015 1.27778013 0.05809471 0 4.1932959 4.37934704 0 4.3226737 -0.33523019 5
3 Axel 2017 2.56466723 0.43108713 2 5.9970138 -2.19947169 4 0.9717511 2.05843957 3
4 Bill 2015 2.05594607 3.96167974 3 2.5232810 3.87191286 3 3.1726895 3.43683108 0
5 Brian 2015 3.44690732 0.35032810 4 4.7287671 0.08108714 2 2.8519495 -0.08249603 2
6 CT 2015 5.85679299 -1.57623304 2 3.9653678 1.68389034 3 3.0328709 1.04315644 2
7 Chez 2016 0.73604605 -2.58101736 1 4.0642894 0.04941299 3 5.4688474 -1.82831432 3
8 Chris 2016 0.95621081 2.05206411 4 2.7249987 2.42911270 8 -0.4515070 -2.12097504 0
9 Collin 2015 7.14194691 0.74030236 5 4.7879545 5.41397214 4 1.4835656 0.92897125 2
10 Danny III 2016 4.59832890 0.60355092 5 4.4822495 4.15865653 0 2.4950848 3.31059942 3
11 Dustin 2017 0.26640646 -0.23381080 4 5.3164916 3.67001803 1 0.7011976 2.59135173 4
12 Erik 2017 0.27363760 -4.50110125 3 4.9495033 3.31417537 3 4.1907692 5.57914934 6
13 Johnson 2017 7.12013083 2.52775367 3 1.9192381 4.33916287 2 3.3836699 -2.37444447 3
14 Justin 2017 3.41710305 -3.82843506 4 5.5590782 0.56030426 1 0.1670448 5.99934712 6
15 Lee 2016 -1.02002976 -3.24576311 4 0.9538381 -0.91783716 5 2.5668076 -0.67247680 2
16 Louis 2017 1.94420093 0.47369179 3 2.8249960 -1.28630731 7 3.0070664 1.25132019 5
With the OP's new data
out <- copy(df_1)
for(i in 1:(ncol(names_palyer)-1)) {
tmp <- data.table::copy(get(paste0('df_', i + 1)))
keydat <- names_palyer[c(i, i + 1)]
keydat <- keydat[keydat[[2]] %in% tmp$player,, drop = FALSE]
i1 <- match(keydat[[2]], tmp$player, nomatch = 0)
tmp$player[i1] <- keydat[[1]]
print(tmp)
out <- merge(out, tmp, by = c('player', 'year'), all.x = TRUE)
i2 <- match(keydat[[1]], out$player, nomatch = 0)
out$player[i2] <- keydat[[2]][keydat[[1]] %in% out$player]
}
library(dplyr)
library(purrr)
split.default(out[-(1:2)], sub("\\..*", "", names(out)[-(1:2)])) %>%
map_dfc(reduce, coalesce) %>%
bind_cols( out[1:2], .)
I've been trying to figure this out for a few hours and I'm hoping someone can point me in the right direction. I'm trying to get from the below data set named "current data"
current_data <-
tribble(
~ID, ~grade_Q1, ~points_Q1,
"1", c("D-", "C-", "C-", "C-"), c(1, 2, 2, 2),
"2", c("A", "B", "B+", "B+"), c(4, 3, 3, 3),
)
to the below dataset named "updated_data"
updated_data <-
tribble(
~ID, ~grade_Q1, ~points_Q1, ~n_grades,
"1", "D- C C- C-", "1 2 2 2", 4,
"2", "A B B+ B+ A", "4 3 3 3 4", 5
)
The "n_grades" column is literally just a count of the number of letter grades in the "grade_q1" column. Anyone have any ideas how to proceed?
We can get the lengths of 'grade_Q1' to create the n_grades, then loop over the list columns with map, concatenate into a single string with str_c
library(dplyr)
library(stringr)
library(purrr)
current_data %>%
mutate(n_grades = lengths(grade_Q1),
grade_Q1 = map_chr(grade_Q1, str_c, collapse= ' '),
points_Q1 = map_chr(points_Q1, str_c, collapse = ' '))
-output
# A tibble: 2 x 4
# ID grade_Q1 points_Q1 n_grades
# <chr> <chr> <chr> <int>
#1 1 D- C- C- C- 1 2 2 2 4
#2 2 A B B+ B+ A 4 3 3 3 4 5
If there are many columns, it can be simplified with across
current_data %>%
mutate(n_grades = lengths(grade_Q1),
across(c(grade_Q1, points_Q1), ~ map_chr(., str_c, collapse= ' ')))
Or using base R
current_data$n_grades <- lengths(current_data$grade_Q1)
current_data[c("grade_Q1", "points_Q1")] <-
lapply(current_data[c("grade_Q1", "points_Q1")],
sapply, paste, collapse= ' ')
A data.table option
setDT(current_data)[
,
c(
lapply(.SD, function(x) paste0(unlist(x), collapse = " ")),
n_grades = lengths(grade_Q1)
),
ID
][]
gives
ID grade_Q1 points_Q1 n_grades
1: 1 D- C- C- C- 1 2 2 2 4
2: 2 A B B+ B+ A 4 3 3 3 4 5
I have a dataset that has a column called QTY in which most of the values are already summed, but a few are several integers separated by commas. How can I replace those rows with the sums of the values?
I have:
ID Name QTY
1 Abc 2
2 Bac 3
3 Cba 2, 4, 5, 8
4 Bcb 4, 1
Desired result:
ID Name QTY
1 Abc 2
2 Bac 3
3 Cba 19
4 Bcb 5
I've tried messing around with for loops a bit and using ifelse(), but I can't quite figure it out.
This looks a bit ugly but should work. Assuming column QTY is a character -
your_df$QTY_new <- sapply(strsplit(your_df$QTY, ", "), function(x) sum(as.numeric(x)))
Using for loops should be this way:
data <- data.table(ID = 1:4,
Name = c("Abc", "Bac", "Cba", "Bcb"),
QTY = c("2", "3", "2, 4, 5, 8", "4, 1"),
QTY2 = numeric(4))
for(i in 1:nrow(data)){
data$QTY2[i] <- sum(as.numeric(unlist(strsplit(as.character(data$QTY[i]), ', '))))
}
and the resulting DF is:
ID Name QTY QTY2
1: 1 Abc 2 2
2: 2 Bac 3 3
3: 3 Cba 2, 4, 5, 8 19
4: 4 Bcb 4, 1 5
I made a function for solving your question. But let me explain how it works:
sumInRow = function(row_value, split = ",") {
# 1. split the values
row_value = strsplit(row_value, split = split)
# 2. Convert them to numeric and sum
row_sum = sapply(row_value, function(single_row) {
single_row = as.numeric(single_row)
return(sum(single_row))
})
return(row_sum)
}
The row_value, by default, will be a character because of the comma.
Then for each value we need to split them:
row_value = strsplit(row_value, split = split)
But it will return a list contain the split for all element in row_value, don't worry we'll use it later.
row_sum = sapply(row_value, function(single_row) {
single_row = as.numeric(single_row)
return(sum(single_row))
})
Sapply function works as an interator, for each element of the list we'll use the following function: convert to numeric and return the sum of them.
[EDIT_1]
To use if you have to call:
sumInRow(<your data frame>$QYT)
I hope it helps you.
Here is one option with tidyverse, We split the 'QTY' column by the delimiter , to expand the rows (separate_rows), grouped by 'ID', 'Name', get the sum of the 'QTY'
library(tidyverse)
df1 %>%
separate_rows(QTY, convert = TRUE) %>%
group_by(ID, Name) %>%
summarise(QTY = sum(QTY))
# A tibble: 4 x 3
# Groups: ID [4]
# ID Name QTY
# <int> <chr> <int>
#1 1 Abc 2
#2 2 Bac 3
#3 3 Cba 19
#4 4 Bcb 5
data
df1 <- structure(list(ID = 1:4, Name = c("Abc", "Bac", "Cba", "Bcb"),
QTY = c("2", "3", "2, 4, 5, 8", "4, 1")), class = "data.frame", row.names = c(NA,
-4L))