I've been trying to figure this out for a few hours and I'm hoping someone can point me in the right direction. I'm trying to get from the below data set named "current data"
current_data <-
tribble(
~ID, ~grade_Q1, ~points_Q1,
"1", c("D-", "C-", "C-", "C-"), c(1, 2, 2, 2),
"2", c("A", "B", "B+", "B+"), c(4, 3, 3, 3),
)
to the below dataset named "updated_data"
updated_data <-
tribble(
~ID, ~grade_Q1, ~points_Q1, ~n_grades,
"1", "D- C C- C-", "1 2 2 2", 4,
"2", "A B B+ B+ A", "4 3 3 3 4", 5
)
The "n_grades" column is literally just a count of the number of letter grades in the "grade_q1" column. Anyone have any ideas how to proceed?
We can get the lengths of 'grade_Q1' to create the n_grades, then loop over the list columns with map, concatenate into a single string with str_c
library(dplyr)
library(stringr)
library(purrr)
current_data %>%
mutate(n_grades = lengths(grade_Q1),
grade_Q1 = map_chr(grade_Q1, str_c, collapse= ' '),
points_Q1 = map_chr(points_Q1, str_c, collapse = ' '))
-output
# A tibble: 2 x 4
# ID grade_Q1 points_Q1 n_grades
# <chr> <chr> <chr> <int>
#1 1 D- C- C- C- 1 2 2 2 4
#2 2 A B B+ B+ A 4 3 3 3 4 5
If there are many columns, it can be simplified with across
current_data %>%
mutate(n_grades = lengths(grade_Q1),
across(c(grade_Q1, points_Q1), ~ map_chr(., str_c, collapse= ' ')))
Or using base R
current_data$n_grades <- lengths(current_data$grade_Q1)
current_data[c("grade_Q1", "points_Q1")] <-
lapply(current_data[c("grade_Q1", "points_Q1")],
sapply, paste, collapse= ' ')
A data.table option
setDT(current_data)[
,
c(
lapply(.SD, function(x) paste0(unlist(x), collapse = " ")),
n_grades = lengths(grade_Q1)
),
ID
][]
gives
ID grade_Q1 points_Q1 n_grades
1: 1 D- C- C- C- 1 2 2 2 4
2: 2 A B B+ B+ A 4 3 3 3 4 5
Related
I am trying to solve the following: here is the top of my df
Col1 Col2
1 Basic ABC
2 B ABCD
3 B abc
4 B ab c
5 B AB12
Col2 is a string column. I now want to convert the strings to unique numbers, based on the specific words
Like this:
Col1 Col2 Col3
1 Basic ABC 123
2 B ABCD 1234
3 B abc 272829
4 B ab c 2728029
5 B AB12 1212
...
As you see, there can be CAPITAL LETTERS, numbers, lower cases, and spaces, that need to be converted to a specific numeric value. It doesn't matter, what numbers are generated, they only need to be unique.
The difficult part is, that I need static numeric IDs but my df is dynamic.
Meaning: Strings can be added or removed over time, but if i.e. the string "dog" is added - it gets an ID (i.e. "789") which was and will never be used by another string. So the generated IDs are not influenced by the col2 size, the position of strings in that column or any order - only by the content of a string itself.
Help is much appreciated
If you are just mapping characters within some master vector, then perhaps this:
chrs <- c(LETTERS, letters, 0:9)
quux$Col3 <- sapply(strsplit(quux$Col2, ""), function(z) paste(match(z, chrs, nomatch = 0L), collapse = ""))
quux
# Col1 Col2 Col3
# 1 Basic ABC 123
# 2 B ABCD 1234
# 3 B abc 272829
# 4 B ab c 2728029
# 5 B AB12 125455
or a dplyr variant, if you're already using it (but this varies very little):
library(dplyr)
quux %>%
mutate(Col3 = sapply(strsplit(Col2, ""), function(z) paste(match(z, chrs, nomatch = 0L), collapse = "")))
# Col1 Col2 Col3
# 1 Basic ABC 123
# 2 B ABCD 1234
# 3 B abc 272829
# 4 B ab c 2728029
# 5 B AB12 125455
However, as MrFlick suggested, perhaps what you really need is a hashing function?
sapply(quux$Col2, digest::digest, algo = "sha256")
# ABC
# "8fe32130ce14a3fb071473f9b718e403752f56c0f13081943d126ffb28a7b923"
# ABCD
# "942a0d444d8cf73354e5316517909d5f34b17963214a8f5b271375fe1da43013"
# abc
# "9f7b8da9f3abe2caaf5212f6b224448706de57b3c7b5dda916ee8d6005d9f24b"
# ab c
# "a3f1c49979af0fffa22f68028a42d302e0a675798ac4ac8a76bed392880af8f2"
# AB12
# "9ffbe9825833ab3c6b183f9986ab194a7aefcc06f5c940549a2c799dd4cd15b1"
Data
quux <- structure(list(Col1 = c("Basic", "B", "B", "B", "B"), Col2 = c("ABC", "ABCD", "abc", "ab c", "AB12")), row.names = c("1", "2", "3", "4", "5"), class = "data.frame")
ah r2 beat me but same concept
dd <- read.table(header = TRUE, text = "a Col1 Col2
1 Basic ABC
2 B ABCD
3 B abc
4 B 'ab c'
5 B AB12")
dd$Col2
f <- function(x) {
x <- strsplit(x, '')
sapply(x, function(y)
factor(y, c(' ', LETTERS, letters, 0:9), c(0, 1:26, 27:52, 0:9)) |>
as.character() |> paste0(`...` = _, collapse = ''))
}
f(dd$Col2)
# [1] "123" "1234" "272829" "2728029" "1212"
For a large database, I would like to find a solution where I could predefine the strings to be searched and then get a table that would contain the frequency of these strings and their possible variations per row.
strings <- c("dog", "cat", "mouse")
var1 <- c("black dog", "white dog", "angry dog", "dogs and cats are nice", "dog")
var2 <- c("white cat", "black cat", "tiny cat", NA, "cow")
var3 <- c("little mouse", "big mouse", NA, NA, "mouse")
data <- data.frame(var1, var2, var3)
The result should look like this while I am looking for dog, cat and mouse:
dog&cat 4
mouse 3
We may try
v1 <- do.call(paste, data)
stack(setNames(lapply(c( "\\bdog.*\\bcat|\\bcat.*\\bdog", "mouse"),
\(pat) sum(grepl(pat, v1))), c("dog&cat", "mouse")))[2:1]
ind values
1 dog&cat 4
2 mouse 3
Or if we need all the combinations
lst1 <- lapply(c(strings, combn(strings, 2, FUN = \(x)
sprintf("\\b%1$s.*\\b%2$s|\\b%2$s.*\\b%1$s", x[1], x[2]))),
\(pat) sum(grepl(pat, v1)))
names(lst1) <- c(strings, combn(strings, 2, FUN = paste, collapse = "&"))
stack(lst1)[2:1]
ind values
1 dog 5
2 cat 4
3 mouse 3
4 dog&cat 4
5 dog&mouse 3
6 cat&mouse 2
For more combinations, it may be better to use Reduce with individually applying grepl
lst1 <- lapply(1:3, \(n) {
vals <- colSums(combn(strings, n,
FUN = \(pats) Reduce(`&`, lapply(pats, \(pat) grepl(pat, v1)))))
nms <- combn(strings, n, FUN = paste, collapse = "&")
setNames(vals, nms)
})
stack(unlist(lst1))[2:1]
ind values
1 dog 5
2 cat 4
3 mouse 3
4 dog&cat 4
5 dog&mouse 3
6 cat&mouse 2
7 dog&cat&mouse 2
Or with tidyverse
library(dplyr)
library(stringr)
library(tidyr)
data %>%
unite(var, everything(), na.rm = TRUE, sep = " ") %>%
summarise(`dog&cat` = sum(str_detect(var,
"\\bdog.*\\bcat|\\bcat.*\\bdog")),
mouse = sum(str_detect(var, 'mouse'))) %>%
pivot_longer(everything())
-output
# A tibble: 2 × 2
name value
<chr> <int>
1 dog&cat 4
2 mouse 3
I have a df where one variable is an integer. I'd like to split this column into it's individual digits. See my example below
Group Number
A 456
B 3
C 18
To
Group Number Digit1 Digit2 Digit3
A 456 4 5 6
B 3 3 NA NA
C 18 1 8 NA
We can use read.fwf from base R. Find the max number of character (nchar) in 'Number' column (mx). Read the 'Number' column after converting to character (as.character), specify the 'widths' as 1 by replicating 1 with mx and assign the output to new 'Digit' columns in the data
mx <- max(nchar(df1$Number))
df1[paste0("Digit", seq_len(mx))] <- read.fwf(textConnection(
as.character(df1$Number)), widths = rep(1, mx))
-output
df1
# Group Number Digit1 Digit2 Digit3
#1 A 456 4 5 6
#2 B 3 3 NA NA
#3 C 18 1 8 NA
data
df1 <- structure(list(Group = c("A", "B", "C"), Number = c(456L, 3L,
18L)), class = "data.frame", row.names = c(NA, -3L))
Another base R option (I think #akrun's approach using read.fwf is much simpler)
cbind(
df,
with(
df,
type.convert(
`colnames<-`(do.call(
rbind,
lapply(
strsplit(as.character(Number), ""),
`length<-`, max(nchar(Number))
)
), paste0("Digit", seq(max(nchar(Number))))),
as.is = TRUE
)
)
)
which gives
Group Number Digit1 Digit2 Digit3
1 A 456 4 5 6
2 B 3 3 NA NA
3 C 18 1 8 NA
Using splitstackshape::cSplit
splitstackshape::cSplit(df, 'Number', sep = '', stripWhite = FALSE, drop = FALSE)
# Group Number Number_1 Number_2 Number_3
#1: A 456 4 5 6
#2: B 3 3 NA NA
#3: C 18 1 8 NA
Updated
I realized I could use max function for counting characters limit in each row so that I could include it in my map2 function and save some lines of codes thanks to an accident that led to an inspiration by dear #ThomasIsCoding.
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
df %>%
rowwise() %>%
mutate(map2_dfc(Number, 1:max(nchar(Number)), ~ str_sub(.x, .y, .y))) %>%
unnest(cols = !c(Group, Number)) %>%
rename_with(~ str_replace(., "\\.\\.\\.", "Digit"), .cols = !c(Group, Number)) %>%
mutate(across(!c(Group, Number), as.numeric, na.rm = TRUE))
# A tibble: 3 x 5
Group Number Digit1 Digit2 Digit3
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 456 4 5 6
2 B 3 3 NA NA
3 C 18 1 8 NA
Data
df <- tribble(
~Group, ~Number,
"A", 456,
"B", 3,
"C", 18
)
Two base r methods:
no_cols <- max(nchar(as.character(df1$Number)))
# Using `strsplit()`:
cbind(df1, setNames(data.frame(do.call(rbind,
lapply(strsplit(as.character(df1$Number), ""),
function(x) {
length(x) <- no_cols
x
}
)
)
), paste0("Digit", seq_len(no_cols))))
# Using `regmatches()` and `gregexpr()`:
cbind(df1, setNames(data.frame(do.call(rbind,
lapply(regmatches(df1$Number, gregexpr("\\d", df1$Number)),
function(x) {
length(x) <- no_cols
x
}
)
)
), paste0("Digit", seq_len(no_cols))))
I want to aggregate one column (C) in a data frame according to one grouping variable A, and separate the individual values by a comma while keeping all the other column B. However, B can either have a character (which is always the same for all the rows) or be empty. In this case, I would like to keep the character whenever it is present on one row.
Here is a simplified example:
data <- data.frame(A = c(rep(111, 3), rep(222, 3)), B = c("", "", "", "a" , "", "a"), C = c(5:10))
data
Based on this question Collapse / concatenate / aggregate a column to a single comma separated string within each group, I have the following code:
library(dplyr)
data %>%
group_by(A) %>%
summarise(test = toString(C)) %>%
ungroup()
Here it is what I would like to obtain:
A B C
1 111 5,6,7
2 222 a 8,9,10
Use summarise_all()
To keep all your columns, you can use summarise_all():
data %>%
group_by(A) %>%
summarise_all(toString)
# A tibble: 2 x 3
A B C
<dbl> <chr> <chr>
1 111 1, 2, 1 5, 6, 7
2 222 2, 1, 2 8, 9, 10
Edit for updated question
You can add a B column to summarise to achieve the desided results:
data <- data.frame(A = c(rep(111, 3), rep(222, 3)), B = c("", "", "", "a" , "", "a"), C = c(5:10))
data
library(dplyr)
data %>%
group_by(A) %>%
summarise(B = names(sort(table(B),decreasing=TRUE))[1],
C = toString(C)) %>%
ungroup()
# A tibble: 2 x 3
A B C
<dbl> <fct> <chr>
1 111 "" 5, 6, 7
2 222 a 8, 9, 10
This will return the most frequent value in B column (as order gives you ordered indexes).
Hope this helps.
You could write one function to return unique values
library(dplyr)
get_common_vars <- function(x) {
if(n_distinct(x) > 1) unique(x[x !='']) else unique(x)
}
and then use it on all columns that you are interested :
data %>%
group_by(A) %>%
mutate(C = toString(C)) %>%
summarise_at(vars(B:C), get_common_vars)
# ^------ Include all columns here
# A tibble: 2 x 3
# A B C
# <dbl> <fct> <chr>
#1 111 "" 5, 6, 7
#2 222 a 8, 9, 10
You can also use the paste() function and leverage the collapse argument.
data %>%
group_by(A) %>%
summarise(
B = paste(unique(B), collapse = ""),
C = paste(C, collapse = ", "))
# A tibble: 2 x 3
A B C
<chr> <chr> <chr>
1 111 "" 5, 6, 7
2 222 a 8, 9, 10
I have a dataset that has a column called QTY in which most of the values are already summed, but a few are several integers separated by commas. How can I replace those rows with the sums of the values?
I have:
ID Name QTY
1 Abc 2
2 Bac 3
3 Cba 2, 4, 5, 8
4 Bcb 4, 1
Desired result:
ID Name QTY
1 Abc 2
2 Bac 3
3 Cba 19
4 Bcb 5
I've tried messing around with for loops a bit and using ifelse(), but I can't quite figure it out.
This looks a bit ugly but should work. Assuming column QTY is a character -
your_df$QTY_new <- sapply(strsplit(your_df$QTY, ", "), function(x) sum(as.numeric(x)))
Using for loops should be this way:
data <- data.table(ID = 1:4,
Name = c("Abc", "Bac", "Cba", "Bcb"),
QTY = c("2", "3", "2, 4, 5, 8", "4, 1"),
QTY2 = numeric(4))
for(i in 1:nrow(data)){
data$QTY2[i] <- sum(as.numeric(unlist(strsplit(as.character(data$QTY[i]), ', '))))
}
and the resulting DF is:
ID Name QTY QTY2
1: 1 Abc 2 2
2: 2 Bac 3 3
3: 3 Cba 2, 4, 5, 8 19
4: 4 Bcb 4, 1 5
I made a function for solving your question. But let me explain how it works:
sumInRow = function(row_value, split = ",") {
# 1. split the values
row_value = strsplit(row_value, split = split)
# 2. Convert them to numeric and sum
row_sum = sapply(row_value, function(single_row) {
single_row = as.numeric(single_row)
return(sum(single_row))
})
return(row_sum)
}
The row_value, by default, will be a character because of the comma.
Then for each value we need to split them:
row_value = strsplit(row_value, split = split)
But it will return a list contain the split for all element in row_value, don't worry we'll use it later.
row_sum = sapply(row_value, function(single_row) {
single_row = as.numeric(single_row)
return(sum(single_row))
})
Sapply function works as an interator, for each element of the list we'll use the following function: convert to numeric and return the sum of them.
[EDIT_1]
To use if you have to call:
sumInRow(<your data frame>$QYT)
I hope it helps you.
Here is one option with tidyverse, We split the 'QTY' column by the delimiter , to expand the rows (separate_rows), grouped by 'ID', 'Name', get the sum of the 'QTY'
library(tidyverse)
df1 %>%
separate_rows(QTY, convert = TRUE) %>%
group_by(ID, Name) %>%
summarise(QTY = sum(QTY))
# A tibble: 4 x 3
# Groups: ID [4]
# ID Name QTY
# <int> <chr> <int>
#1 1 Abc 2
#2 2 Bac 3
#3 3 Cba 19
#4 4 Bcb 5
data
df1 <- structure(list(ID = 1:4, Name = c("Abc", "Bac", "Cba", "Bcb"),
QTY = c("2", "3", "2, 4, 5, 8", "4, 1")), class = "data.frame", row.names = c(NA,
-4L))