Pull out column names of cells which match logical criteria - r

I've got a table such as this:
structure(list(Suggested.Symbol = c("CCT4", "DHRS2", "PMS2",
"FARSB", "RPL31", "ASNS"), p_onset = c(0.9378, 0.5983, 7.674e-10,
0.09781, 0.5495, 0.7841), p_dc14 = c(0.3975, 0.3707, 6.117e-17,
0.2975, 0.4443, 0.7661), p_tfc6 = c(0.2078, 0.896, 7.388e-19,
0.5896, 0.3043, 0.6696), p_tms30 = c(0.5724, 0.3409, 4.594e-13,
0.2403, 0.1357, 0.3422)), row.names = c(NA, 6L), class = "data.frame")
I'd like to create a new column called 'summary'. In it, on a row-wise basis, I'd like to return the columns names of the cells with values <0.05, comma separated. Is that possible??

We can use toString by looping over the rows, create a logical vector where the values are less than 0.05, subset the names and paste them with toString
df1$summary <- apply(df1[-1], 1, \(x) toString(names(x)[x < 0.05]))

Related

Unnest in R conditional on the cell's content

The main dataframe has a column "passings". It is the only nested variable in the main dataframe. Inside it, there are dataframes (an example a nested cell). In the nested cells, the number of rows varies, yet the number of columns is the same. The columns names are "date" and "title". What I need is to grab a respective date and put it in the main dataframe as a new variable if title is "Закон прийнято" ("A passed law" - translation).
I'm a newbie in coding.
I will appreciate your help!
dataframe
an example of a dataframe within a nested cell
Here is an option where we loop over the 'passings' list column with map (based on the image, it would be a list of 2 column data.frame), filter the rows where the 'title' is "Закон прийнято" (assuming only a single value per row) and pull the 'date' column to create a new column 'date' in the original dataset
library(dplyr)
library(purrr)
df1 %>%
mutate(date = map_chr(passings, ~ .x %>%
filter(title == "Закон прийнято") %>%
pull(date)))
# id passed passings date
#1 54949 TRUE 2015-06-10, 2015-06-08, abcb, Закон прийнято 2015-06-08
#2 55009 TRUE 2015-06-10, 2015-09-08, bcb, Закон прийнято 2015-09-08
NOTE: It works as expected.
data
df1 <- structure(list(id = c(54949, 55009), passed = c(TRUE, TRUE),
passings = list(structure(list(date = c("2015-06-10", "2015-06-08"
), title = c("abcb", "Закон прийнято")), class = "data.frame", row.names = c(NA,
-2L)), structure(list(date = c("2015-06-10", "2015-09-08"
), title = c("bcb", "Закон прийнято")), class = "data.frame", row.names = c(NA,
-2L)))), row.names = c(NA, -2L), class = "data.frame")

Loop through columns and split fields automatically to new column

I have been comparing two data frames in R using a package called daff and this is the final table I get:
dput(df)
structure(list(v1 = c("Silva->Silva/Mark", "Brandon->Brandon/Livo", "Mango->Mango or Apple"),
v2 = c("James->James=Jacy","NA->Na/Jane", "Egg->Egg and Orange")),
class = "data.frame", row.names = c(NA, -3L))
The rows fields have ->(arrow) to mean the data was modified in that cell from previous data frame column to current dataframe value. Now from here I had to separate the columns with ->(arrow) separator so that I can have an old column and new changed column. This means I added a suffix_old and _New to new columns. I used this code and see the output:
setDT(df)
df1<- lapply(names(df), function(x) {
mDT <- df[, tstrsplit(get(x), " *-> *")]
if (ncol(mDT) == 2L) setnames(mDT, paste0(x, c("_Old", "_New")))
}) %>% as.data.table()
OUTPUT
dput(df)
structure(list(v1_Old = c("Silva", "Brandon", "Mango"),
v1_New = c("Silva/Mark", "Brandon/Livo", "Mango or Apple"),
v2_Old = c("James","NA", "Egg"),
v2_New = c("James=Jacy","Na/Jane", "Egg and Orange")),
class = "data.frame", row.names = c(NA, -3L))
Now my next step is to compare every two columns which have _old and _new suffix to identify what was modified then split and store in new column called diff_v1 and diff_v2. This I did using this code (Realise I have to do this manually by creating different spliting code lines, this is tedious with over 20 separated columns):
df$diff_v1<- mapply(function(x, y) paste(setdiff(y, x), collapse = '| '), strsplit(df$v1_old, '\\||, | | -| \\+'), strsplit(df$v1_Name_new, '\\||, | | -| \\+'))
df$diff_v2<- mapply(function(x, y) paste(setdiff(y, x), collapse = '| '), strsplit(df$v2_old, '\\||, | | -| \\+'), strsplit(df$v2_new, '\\||, | | -| \\+'))
OUTPUT
dput(df)
structure(list(v1_Old = c("Silva", "Brandon", "Mango"),
v1_New = c("Silva/Mark", "Brandon/Livo", "Mango or Apple"),
diff_v1 = c("/Mark", "/Livo", "or Apple"),
v2_Old = c("James","NA", "Egg"),
v2_New = c("James=Jacy","Na/Jane", "Egg and Orange"),
diff_v2 = c("=Jacy","/Jane", "and Orange")),
class = "data.frame", row.names = c(NA, -3L))
My question is can I be able to loop through columns with _old and _new and create new column called diff_v1 and diff_v2 respectively without running code line by line since. I have multiple columns and they keep changing depending on dataframes I am comparing . Wanted to know How I can use code to automatically identify columns with _Old and _New suffix and split then create that new column after the two but should happen on each pair of columns.
Currently I have to go to the data frame, check columns with old and new then manually change in the code that is splitting and creating diff column
We could identify "Old" and "New" columns based on their name using grep. We can use str_remove which is vectorized over string and pattern to remove part of "Old" col which is present in "New" col to create new columns.
old_cols <- grep("Old$", names(df), value = TRUE)
new_cols <- grep("New$", names(df), value = TRUE)
df[sub("New$", "diff", new_cols)] <- Map(stringr::str_remove,
df[new_cols], df[old_cols])
To get the names in order, we can do
df <- df[order(sub("_.*", "", names(df)))]
df
# v1_Old v1_New v1_diff v2_Old v2_New v2_diff
#1 Silva Silva/Mark /Mark James James=Jacy =Jacy
#2 Brandon Brandon/Livo /Livo NA Na/Jane Na/Jane
#3 Mango Mango or Apple or Apple Egg Egg and Orange and Orange
Using tidyverse, we can do
library(tidyverse)
df %>%
bind_cols(map2(df %>% select(ends_with("New")),
df %>% select(ends_with("Old")), stringr::str_remove))

How to identify the position of string element stored in one column in a vector stored in another column of a tibble?

I have a tibble with two columns named ID and VEC.
ID stores a specific string, whereas VEC stores a vector including the string that is stored in column ID.
I would like to identify the position of the string in ID in the vector stored in VEC for each specific row.
Usually when just looking for a String in any vector I would go like this:
which(ID == VEC) - which would return the position.
However whenever trying to do this using mutate, R returns an error.
df <- structure(list(ID = 1:7, VEC = list(1:7, 1:7, 1:7, 1:7, 1:7,
1:7, 1:7)), row.names = c(NA, -7L), class = c("tbl_df", "tbl",
"data.frame"))
df %>%
mutate(POS = which(ID == VEC))
I would like to add a new column with the position of the STRING in ID based on the vector that is stored in VEC.
Unfortunately I get this error massage:
Error in mutate_impl(.data, dots) :
Evaluation error: (list) object cannot be coerced to type 'integer'.
Is there any way to do this using mutate?

Cleaning xlsx files

I am trying to wrangle messy large datasets from xlsx sheets. The table structures are such that the column headers are a combination of three rows.
I am using RStudio and trying to write a function that takes empty cells and fills them up with an attribute from previous filled cells, and finally concatenate all filled rows into one final column header with hyphens: e.g. Employment, Number, Males on three different rows should become Employment_Number_Male.
Any suggestions?
Please see the sample xlsx table I am working with.
Taking this data.frame:
df <- data.frame(..1 = c("year", NA, NA),
..2 = c(NA, "males", "all"),
..3 = c(NA, NA, "half"),
..4 = c(NA, NA, "some"),
..5 = c(NA, "females", "all"),
..6 = c(NA, NA, "half"),
..7 = c(NA, NA, "some"))
Here is an attempt to convert empty cells to NA's..
# convert empty cells to NA
empty_as_na <- function(x){
if("factor" %in% class(x)) x <- as.character(x) ## since ifelse wont work with factors
ifelse(as.character(x)!="", x, NA)}
# transform all columns
df %>% mutate_each(funs(empty_as_na))
# apply function
na.rows <- which( apply(df, 1, function(z) (all(is.na(z)) ) ) )
df[na.rows , ] <- df[na.rows-1, ]
issue is filling it in with the value of the cell beside it..
a reprex render

If in a data.frame

I would like to choose a value between two columns in the same row following values in other columns.
My function would be like: if values inside shapiro1, shapiro2 and F_test are less than 0.05 choose value in t_test else choose wilcox's value. Does it seem possible to you to make a function like this and apply on a larger columns?
structure(list(modalities = structure(1:3, .Label = c("BS1",
"HW1", "PG"), class = "factor"), shapiro1 = c(0.0130672654432492,
0.305460485386201, 0.148320635833262), shapiro2 = c(0.920315823302857,
0.1354174735521, 0.148320635833262), F_test = c(0.20353475323665,
0.00172897172228584, 1), t_test = c(2.88264982135322e-06, 5.75374264225996e-05,
NaN), wilcox = c(0.00909069801592506, 0.00902991076269246, NaN
)), class = "data.frame", row.names = c(NA, -3L))
You could select columns, apply rowSums and check if any value in that row is less than 0.05 and select t_test or wilcox values accordingly.
cols <- c("shapiro1", "shapiro2", "F_test")
ifelse(rowSums(df[cols] < 0.05) > 0, df$t_test, df$wilcox)
#[1] 2.882650e-06 5.753743e-05 NaN

Resources