Select columns based on string values in column content - r

I have a tibble and want to select only those columns that contain at least one value that matches a regular expression. It took me a while to figure out how to do this, so I'm sharing my solution here.
My use case: I want to select only those columns that include media filenames, from a tibble like the one below. Importantly, I don't know ahead of time what columns the tibble consists of, and whether or not there are any columns that include media filenames.
condition
picture
sound
video
description
A
cat.png
meow.mp3
cat.mp4
A cat
A
dog.png
woof.mp3
dog.mp4
A dog
B
NA
NA
NA
NA
B
bird.png
tjirp.mp3
tjirp.mp4
A bird
R code to reproduce tibble:
dat = structure(list(condition = c("A", "A", "B", "B"), picture = c("cat.png",
"dog.png", NA, "bird.png"), sound = c("meow.mp3", "woof.mp3",
NA, "tjirp.mp3"), video = c("cat.mp4", "dog.mp4", NA, "tjirp.mp4"
), description = c("A cat", "A dog", NA, "A bird")), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))

Solution:
> dat %>% select_if(~any(grepl("\\.png|\\.mp3|\\.mp4", .)))
# A tibble: 4 x 3
picture sound video
<chr> <chr> <chr>
1 cat.png meow.mp3 cat.mp4
2 dog.png woof.mp3 dog.mp4
3 NA NA NA
4 bird.png tjirp.mp3 tjirp.mp4

Related

Replace string by match to another data frame

I want to replace the strings in column ID of df2 with the column genus of df1 based on the matching string in column species in df1. Any tips appreciated, especially dplyr. Maybe left_join?
> df1
genus species
1 Orthobunyavirus Variola virus
2 Alphatorquevirus Torque teno virus 6
3 Yatapoxvirus Yaba-like disease virus
.
> df2
ID
1 Variola virus
2 Torque teno virus 6
3 Yaba-like disease virus
.
desired out
ID
1 Orthobunyavirus
2 Alphatorquevirus
3 Yatapoxvirus
> dput(df1)
structure(list(genus = c("Orthobunyavirus", "Alphatorquevirus",
"Yatapoxvirus"), species = c("Variola virus", "Torque teno virus 6",
"Yaba-like disease virus")), class = "data.frame", row.names = c(NA,
-3L))
> dput(df2)
structure(list(ID = c("Variola virus", "Torque teno virus 6",
"Yaba-like disease virus")), class = "data.frame", row.names = c(NA,
-3L))
You could simply use match
df2$ID <- df1$genus[match(df2$ID, df1$species)]
df2
#> ID
#> 1 Orthobunyavirus
#> 2 Alphatorquevirus
#> 3 Yatapoxvirus
df2$ID <- df1$genus[match(df2$ID,df1$species)]
replaces it, removing your original df2 data
df3 <- data.frame(ID = df1$genus[match(df2$ID,df1$species)])
creates a third df with the results.

Pre-processing data in R: filtering and replacing using wildcards

Good day!
I have a dataset in which I have values like "Invalid", "Invalid(N/A)", "Invalid(1.23456)", lots of them in different columns and they are different from file to file.
Goal is to make script file to process different CSVs.
I tried read.csv and read_csv, but faced errors with data types or no errors, but no action either.
All columns are col_character except one - col_double.
Tried this:
is.na(df) <- startsWith(as.character(df, "Inval")
no luck
Tried this:
is.na(df) <- startsWith(df, "Inval")
no luck, some error about non char object
Tried this:
df %>%
mutate(across(everything(), .fns = ~str_replace(., "invalid", NA_character_)))
no luck
And other google stuff - no luck, again, errors with data types or no errors, but no action either.
So R is incapable of simple find and replace in data frame, huh?
data frame exampl
Output of dput(dtype_Result[1:20, 1:4])
structure(list(Location = c("1(1,A1)", "2(1,B1)", "3(1,C1)",
"4(1,D1)", "5(1,E1)", "6(1,F1)", "7(1,G1)", "8(1,H1)", "9(1,A2)",
"10(1,B2)", "11(1,C2)", "12(1,D2)", "13(1,E2)", "14(1,F2)", "15(1,G2)",
"16(1,H2)", "17(1,A3)", "18(1,B3)", "19(1,C3)", "20(1,D3)"),
Sample = c("Background0", "Background0", "Standard1", "Standard1",
"Standard2", "Standard2", "Standard3", "Standard3", "Standard4",
"Standard4", "Standard5", "Standard5", "Standard6", "Standard6",
"Control1", "Control1", "Control2", "Control2", "Unknown1",
"Unknown1"), EGF = c(NA, NA, "6.71743640129069", "2.66183193679533",
"16.1289784536322", "16.1289784536322", "78.2706654825781",
"78.6376213069722", "382.004087907716", "447.193928257862",
"Invalid(N/A)", "1920.90297258996", "7574.57784103579", "29864.0308009592",
"167.830723655146", "109.746615928611", "868.821939675054",
"971.158518683179", "9.59119569511596", "4.95543581398464"
), `FGF-2` = c(NA, NA, "25.5436745776637", NA, "44.3280630362038",
NA, "91.991708192168", "81.9459159768959", "363.563899234418",
"425.754478700876", "Invalid(2002.97340881547)", "2027.71958119836",
"9159.40221389147", "11138.8722428849", "215.58494072476",
"70.9775438699825", "759.798876479002", "830.582605561901",
"58.7007261370257", "70.9775438699825")), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
The error is in the use of startsWith. The following grepl solution is simpler and works.
is.na(df) <- sapply(df, function(x) grepl("^Invalid", x))
The str_replace function will attempt to edit the content of a character string, inserting a partial replacement, rather than replacing it entirely. Also, the across function is targeting all of the columns including the numeric id. The following code works, building on the tidyverse example you provided.
To fix it, use where to identify the columns of interest, then use if_else to overwrite the data with NA values when there is a partial string match, using str_detect to spot the target text.
Example data
library(tiyverse)
df <- tibble(
id = 1:3,
x = c("a", "invalid", "c"),
y = c("d", "e", "Invalid/NA")
)
df
# A tibble: 3 x 3
id x y
<int> <chr> <chr>
1 1 a d
2 2 invalid e
3 3 c Invalid/NA
Solution
df <- df %>%
mutate(
across(where(is.character),
.fns = ~if_else(str_detect(tolower(.x), "invalid"), NA_character_, .x))
)
print(df)
Result
# A tibble: 3 x 3
id x y
<int> <chr> <chr>
1 1 a d
2 2 NA e
3 3 c NA

Create new column in string partial match-based dataframe without repeats

I have a dataframe with 2 columns GL and GLDESC and want to add a 3rd column called KIND based on some data that is inside of column GLDESC.
DF:
GL GLDESC
1 515100 Payroll-ISL
2 515900 Payroll-ICA
3 532300 Bulk Gas
4 551000 Supply AB
5 551000 Supply XPTO
6 551100 Supply AB
7 551300 Intern
For each row of the data table:
If GLDESC contains the word Payroll anywhere in the string then I want KIND to be Payroll.
If GLDESC contains the word Supply anywhere in the string then I want KIND to be Supply.
In all other cases I want KIND to be Other.
Then, I found this:
DF$KIND <- ifelse(grepl("supply", DF$GLDESC, ignore.case = T), "Supply",
ifelse(grepl("payroll", DF$GLDESC, ignore.case = T), "Payroll", "Other"))
But with that, I have everything that matches Supply, for example, classified. However, as in DF lines 4 and 5, the same GL has two Supply, which for me is unnecessary. In fact, I need only one type of GLDESC to be matched if for the same GL the string is repeated.
Edit: I can not delet any row. I want to have this as output:
GL GLDESC KIND
A Supply1 Supply
A Supply2 N/A
A Supply3 N/A
A Supply4 N/A
A Supply5 N/A
A Supply6 N/A
A Payroll1 Payroll
B Supply2 Supply
B Payroll Payroll
If we need the repeating element to be NA, use duplicated on 'GLDESC' to get a logical vector and assign those elements in 'KIND' created with ifelse to NA
DF$KIND[duplicated(DF$GLDESC)] <- NA_character_
If we need to change the values by a grouping variable
library(dplyr)
DF %>%
group_by(GL) %>%
mutate(KIND = replace(KIND, duplicated(KIND) & KIND == "Supply", NA_character_))
# A tibble: 9 x 3
# Groups: GL [2]
# GL GLDESC KIND
# <chr> <chr> <chr>
#1 A Supply1 Supply
#2 A Supply2 <NA>
#3 A Supply3 <NA>
#4 A Supply4 <NA>
#5 A Supply5 <NA>
#6 A Supply6 <NA>
#7 A Payroll1 Payroll
#8 B Supply2 Supply
#9 B Payroll Payroll
Or with the full changes
DF1 %>%
mutate(KIND = str_remove(GLDESC, "\\d+"),
KIND = replace(KIND, !KIND %in% c("Supply", "Payroll"), "Othere")) %>%
group_by(GL) %>%
mutate(KIND = replace(KIND, duplicated(KIND) & KIND == "Supply", NA_character_))
data
DF1 <- structure(list(GL = c("A", "A", "A", "A", "A", "A", "A", "B",
"B"), GLDESC = c("Supply1", "Supply2", "Supply3", "Supply4",
"Supply5", "Supply6", "Payroll1", "Supply2", "Payroll")), row.names = c(NA,
-9L), class = "data.frame")

Data wrangling - data spread over three rows - dplyr

I have a very untidy data set something like this
A tibble: 200000 x 2
ChatData
<chr>
1 Sep 30, 2018 7:12pm
2 Person A
3 Hello
4 Sep 30, 2018 7:11pm
5 Person B
6 Hello there
7 Sep 30, 2018 7:10pm
8 Person A
...
As you can see it goes date, person name, comment, and repeats.
I am working on the problem and have a very complex method that adds a score column depending on the names etc....
I would like to transform this into something like this
Person A , Person B
Hello NA
NA Hello there
how's you, NA
...
(The date as a row name or third column would be great but not essential to the question)
Optimally I am looking for a dplyr/tidyverse solution
I am working with lots of data so no slow for loops etc..
Raw data to work with:
structure(list(ChatData = c("Sep 30, 2018 7:12pm", "Person A", "Hello", "Sep 30, 2018 7:11pm", "Person B", "Hello there")), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
If anyone is wondering I am analysing facebook messenger data, and this is the form it comes in when you download it.
Thank you.
In this case, your starting data set has only one column (aka feature). But in this case, there are three types of data that are encoded here about each message: a timestamp, the label of the person, and a message. It will be more useful to transform these into a table where each message is in its own row, and each column represents a different aspect of each observation, i.e. in long, or "tidy", format: https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html
In the approach below, the user first defines what features are repeated in the data set. I call them "headers" here, since I'm working toward a table where these are the column headers. Then the script adds that information to the data and converts the single-column data into a tidy format with one row per message, and one aspect of each message in each column.
Your requested output is a minor variation of this, addressed in the last line below: %>% spread(person, msg), which separates out the Person A and Person b data into separate columns.
library(tidyverse)
header_names <- c("timestamp", "person", "msg")
rows_per <- length(header_names)
data_length <- length(data$ChatData) / rows_per
data2 <- data %>%
mutate(msg_number = rep(1:(nrow(data)/rows_per), each=rows_per),
# This line repeats the header_names sequence for each msg
header = rep(header_names, data_length)) %>%
spread(header, ChatData) %>%
mutate(timestamp = lubridate::mdy_hm(timestamp)) %>%
spread(person, msg)
head(data2)
# A tibble: 2 x 4
msg_number timestamp `Person A` `Person B`
<int> <dttm> <chr> <chr>
1 1 2018-09-30 19:12:00 Hello NA
2 2 2018-09-30 19:11:00 NA Hello there
As you basically just have a character vector that you would like to convert into a 3 columnn data.frame
One other option is to simply use matrix and specify ncol=3 and byrow=TRUE
# your sample data
d <- structure(list(ChatData = c("Sep 30, 2018 7:12pm", "Person A", "Hello", "Sep 30, 2018 7:11pm", "Person B", "Hello there")), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
matrix( d$ChatData, ncol=3, byrow=TRUE,
dimnames=list( NULL, c("date_time", "person", "message")) )
Result is a character matrix:
date_time person message
[1,] "Sep 30, 2018 7:12pm" "Person A" "Hello"
[2,] "Sep 30, 2018 7:11pm" "Person B" "Hello there"
But you can wrap that in as.data.frame() to convert to a data.frame and continue working from there with dplyr if that's what you want.
Put it together for a whole solution:
It becomes a nice short, readable bit of code IMO:
library(dplyr)
library(lubridate)
result_df <-
matrix( d$ChatData, ncol=3, byrow=TRUE,
dimnames=list(NULL, c("date_time", "person", "message")) ) %>%
as.data.frame() %>%
mutate(date_time=lubridate::mdy_hm(date_time))
Here is one approach:
data %>% group_by(msg_number = rep(1:(nrow(data)/3), each=3)) %>%
summarize(msg_data = list(ChatData)) %>% as.data.frame
msg_number msg_data
1 1 Sep 30, 2018 7:12pm, Person A, Hello
2 2 Sep 30, 2018 7:11pm, Person B, Hello there
This numbers each message and puts the data into a column list.

to find count of distinct values across two columns in r

I have two columns . both are of character data type.
One column has strings and other has got strings with quote.
I want to compare both columns and find the no. of distinct names across the data frame.
string f.string.name
john NA
bravo NA
NA "john"
NA "hulk"
Here the count should be 2, as john is common.
Somehow i am not able to remove quotes from second column. Not sure why.
Thanks
The main problem I'm seeing are the NA values.
First, let's get rid of the quotes you mention.
dat$f.string.name <- gsub('["]', '', dat$f.string.name)
Now, count the number of distinct values.
i1 <- complete.cases(dat$string)
i2 <- complete.cases(dat$f.string.name)
sum(dat$string[i1] %in% dat$f.string.name[i2]) + sum(dat$f.string.name[i2] %in% dat$string[i1])
DATA
dat <-
structure(list(string = c("john", "bravo", NA, NA), f.string.name = c(NA,
NA, "\"john\"", "\"hulk\"")), .Names = c("string", "f.string.name"
), class = "data.frame", row.names = c(NA, -4L))
library(stringr)
table(str_replace_all(unlist(df), '["]', ''))
# bravo hulk john
# 1 1 2

Resources