Iterating (Looping) through columns of a dataframe in R - r

I am struggling in R and hope that someone can help me out. I am trying to write a for loop to iterate over the columns of a data frame, but unfortunately, I am not successful.
So here is my Problem:
I have 10 data frames (dt1, dt2 ,dt3,…,dt10). For example, dt1 looks like this:
dt1<-data.frame(Topic1=c(1,2,3,4,5,6,7,8,9),Topic2=c(9,8,7,6,5,4,3,2,1), Topic3=c(1,9,2,8,3,7,4,6,5), Name=c("A","A","A","A","A","B","B","B","B"))
I want to check if the Name variable still contains “A” and “B” when I filter I filter Topic 1 (then Topic 2, Topic3…) to greater than 5. At the moment, I do the following
Library(dpylr)
dt.new<-dt1 %>% filter(Topic1>5)
isTRUE("A" %in% dt.new$Name && "B" %in% dt.new$Name)
At the end of the day, for each data frame, I want to have a new table (data frame) that looks like this:
result<-data.frame(Topic=c("Topic1","Topic2","Topic3"),Return=c("FALSE","FALSE","TRUE"))
Now the problem is, that I have several data frames (dt1, dt2…) each of them has more than 50 variables (Topic1,…, Topic50).
I've written some loops so far and tried it out. But unfortunately without success. Therefore I would be happy to receive any hint or tip.
Thank you very much!

An option would be to group by 'Name', summarise the variable that have column names that start with 'Topic' by checking if there are any value that are greater than 5, then gather (getting deprecated - in the newer tidyr - use pivot_longer) to convert from 'wide' to 'long', grouped by 'Topic' column, summarise by checking if all the 'val' elements are TRUE
library(dplyr)
library(tidyr)
dt1 %>%
group_by(Name) %>%
summarise_at(vars(starts_with('Topic')), ~ any(. > 5)) %>%
gather(Topic, val, -Name) %>%
group_by(Topic) %>%
summarise(Return = all(val))
# A tibble: 3 x 2
# Topic Return
# <chr> <lgl>
#1 Topic1 FALSE
#2 Topic2 FALSE
#3 Topic3 TRUE
Or reshape it to 'long' format first and then do the summariseation
dt1 %>%
pivot_longer(cols = -Name, names_to = "Topic") %>%
filter(value > 5) %>%
group_by(Topic) %>%
summarise(result = n_distinct(Name) == 2)

Related

R - Identifying only strings ending with A and B in a column

I have a column in a data frame in R that contains sample names. Some names are identical except that they end in A or B at the end, and some samples repeat themselves, like this:
df <- data.frame(Samples = c("S_026A", "S_026B", "S_028A", "S_028B", "S_038A", "S_040_B", "S_026B", "S_38A"))
What I am trying to do is to isolate all sample names that have an A and B at the end and not include the sample names that only have either A or B.
The end result of what I'm looking for would look like this:
"S_026" and "S_028" as these are the only ones that have A and B at the end.
All I seem to find is how to remove duplicates, and removing duplicates would only give me "S_026B" and "S_38A" in this case.
Alternatively, I have tried to strip the A's and B's at the end and then sum how many times each of those names sum > 2, but again, this does not give me the desired results.
Any suggestions?
We could use substring to get the last character after grouping by substring not including the last character, and check if there are both 'A', and 'B' in the substring
library(dplyr)
df %>%
group_by(grp = substr(Samples, 1, nchar(Samples)-1)) %>%
filter(all(c("A", "B") %in% substring(Samples, nchar(Samples)))) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 5 x 1
Samples
<chr>
1 S_026A
2 S_026B
3 S_028A
4 S_028B
5 S_026B
You can extract the last character from Sample in different column, keep only those values that have both 'A' and 'B' and keep only the unique values.
library(dplyr)
library(tidyr)
df %>%
extract(Samples, c('value', 'last'), '(.*)(.)') %>%
group_by(value) %>%
filter(all(c('A', 'B') %in% last)) %>%
ungroup %>%
distinct(value)
# value
# <chr>
#1 S_026
#2 S_028

R: What is the expected output of passing a character vector to dplyr::all_of()?

I am trying to understand the expected output of dplyr::group_by() in conjunction with the use of dplyr::all_of(). My understanding is that using dplyr::all_of() should convert character vectors containing variable names to the bare names so that group_by(), but this doesn't appear to happen.
Below, I generate some fake data, pass different objects to group_by() with(out) all_of() and calculate the number of observations in each group. In the example, passing a single bare column name without dplyr::all_of() produces the correct output: one row per unique value of the column. However, passing character vectors or using dplyr::all_of() produces incorrect output: one row regardless of the number of values in a column.
What is expected when using all_of and how might I alternatively pass a character vector to group_by to process as a vector of bare names?
library(dplyr)
# Create a 20-row data.frame with
# 2 variables each with 2 unique values.
df <- data.frame(var = rep(c("a", "b"), 10),
bar = rep(c(1, 2), 20))
# Output 1: 2x2 tibble - GOOD
df %>% group_by(var) %>% summarize(n = n())
# Output 2: 1x2 tibble - BAD
foo <- "var"
df %>% group_by(all_of(foo)) %>% summarize(n = n())
# Output 3: 1x2 tibble
df %>% group_by("var") %>% summarize(n = n())
# Output 4: Error in_var not found - BAD
foo2 <- list("var", "bar")
lapply(foo2, function(in_var) {
df %>%
group_by(in_var) %>%
summarize(n = n())
})
# Output 5: list of length 2 where
# each element is a 1x2 tibble - BAD
foo2 <- list("var", "bar")
lapply(foo2, function(in_var) {
df %>%
group_by(all_of(in_var)) %>%
summarize(n = n())
})
We can use group_by_at
lapply(foo2, function(in_var) df %>%
group_by_at(all_of(in_var)) %>%
summarise(n = n()))
-output
#[[1]]
# A tibble: 2 x 2
# var n
#* <chr> <int>
#1 a 20
#2 b 20
#[[2]]
# A tibble: 2 x 2
# bar n
#* <dbl> <int>
#1 1 20
#2 2 20
As across replaces some of the functionality of group_by_at, we can use it instead with all_of:
lapply(foo2, function(in_var) df %>%
group_by(across(all_of(in_var))) %>%
summarise(n = n()))
Or convert to symbol and evaluate (!!)
lapply(foo2, function(in_var) df %>%
group_by(!! rlang::sym(in_var)) %>%
summarise(n = n()))
Or use map
library(purrr)
map(foo2, ~ df %>%
group_by(!! rlang::sym(.x)) %>%
summarise(n = n()))
Or instead of group_by, it can be count
map(foo2, ~ df %>%
count(across(all_of(.x))))
To add to #akrun's answers of mutliple ways to achieve the desired output - my understanding of all_of() is that, it is a helper for selection of variables stored as character for dplyr function and uses vctrs underneath. Compared to any_of() which is a less strict version of all_of() and some convenient use cases.
reading the ?tidyselect::all_off() is helpful. This page is also helpful to keep up with changes in dplyr and tidy evaluation https://dplyr.tidyverse.org/articles/programming.html.
The scoped dplyr verbs are being superceded in the future with across based on decisions by the devs at RStudio. See ?group_by_at() or other *_if, *_at, *_all documentation. So I guess it really depends on what version of dplyr you are using in your workflow and what works best for you.
This SO post also gives context of changes in solutions over time with passing characters into dplyr functions, and there's probably more posts out there.

Determine the size of string in a particular cell in dataframe: R

In a data frame, I have a column (type: chr) that contains answers separated by a comma. I want to create another column based on the size of the string and award points. For example, some of the entries in a column are:
Column1
word1,word2,word3
word1,word2
word1
Now, for the first cell, I want the size of the cell to be evaluated as 3 (as it contains three distinct word and there are no duplicates in the cell values). I'm not sure how do I achieve this.
An option is to split the column with strsplit into a list of vectors, get the unique elements by looping over the list with lapply and get the lengths
df1$Size <- lengths(lapply(strsplit(df1$Column1, ",\\s*"), unique))
Another option is separate_rows from tidyr
library(dplyr)
library(tidyr)
df1 %>%
mutate(rn = row_number()) %>%
separate_rows(Column1) %>%
group_by(rn) %>%
summarise(Size = n_distinct(Column1), .groups = 'drop') %>%
select(Size) %>%
bind_cols(df1, .)
-output
# Column1 Size
#1 word1,word2,word3 3
#2 word1,word2 2
#3 word1 1
data
df1 <- data.frame(Column1 = c('word1,word2,word3', 'word1,word2', 'word1'))
Original Answer:
Another option:
library(dplyr)
library(stringr)
df %>%
mutate(Lengths = str_count(Column1, ",") + 1)
Edit:
I hadn't noticed the OP requirements properly (about non-duplicates). As #Onyambu pointed out in the comments, this chunk will only works if there are no duplicated words in data.
It basically counts how many words there are.

Sum columns based on index in a a different data frame in R

I have two data frames similar to this:
df<-data.frame("A1"=c(1,2,3), "A2"=c(3,4,5), "A3"=c(6,7,8), "B1"=c(3,4,5))
ref_df<-data.frame("Name"=c("A1","A2","A3","B1"),code=c("Blue" ,"Blue","Green","Green"))
I would like to sum the values in the columns of df based on the code in the ref_df. I would like to store the results in a new data frame with column names matching the code in the ref_df
i.e. I would like a new data frame with Blue and Green as columns and the values representing the sum of A1+A2 and A3&B1 respectively. Like the one here:
result<-data.frame("Blue"=c(4,6,8), "Green"=c(9,11,13))
There are lots of post on summing columns based on conditions, but after a morning of research I cannot find any thing that solves my exact problem.
We can split the columns in df based on values in ref_df$code and then take row-wise sum.
sapply(split.default(df, ref_df$code), rowSums)
# Blue Green
#[1,] 4 9
#[2,] 6 11
#[3,] 8 13
If the order in ref_df do not follow the same order as column names in df, arrange them first.
ref_df <- ref_df[match(ref_df$Name, names(df)),]
We can use tidyverse
library(dplyr)
library(tidyr)
df %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = -rn, names_to = 'Name') %>%
left_join(ref_df) %>%
group_by(code, rn) %>%
summarise(Sum = sum(value)) %>%
pivot_wider(names_from = code, values_from = Sum) %>% select(-rn)

How can I select the values that only appear once in the column of a data table in R?

Like the title, the question is very straightforward. (pardon my ignorance)
I have a column, character type, in a data table.
And there are several different words/values stored, some of them only appear once, others appear multiple times.
How can I select out the ones that only appear once??
Any help is appreciated! Thank you!
One option would be to do a group by and then select the groups having single row
library(data.table)
dt1 <- dt[, .SD[.N == 1], .(col)]
library(dplyr)
df %>%
group_by(column) %>%
dplyr::filter(n() == 1) %>%
ungroup()
Example:
data = tibble(text = c("a","a","b","c","c","c"))
data %>%
group_by(text) %>%
dplyr::filter(n() == 1) %>%
ungroup()
# A tibble: 1 x 1
text
<chr>
1 b

Resources