Finding and replacing strings in cells of a matrix in R - r

I'm trying to process a survey, in which one of the questions asks the respondents to name a friend. Now I have a matrix like this:
I want to save these results in a relational database. I have assigned every person a unique ID, and want the answers to be saved as a last of ID's. So that the table looks like this:
My code so far:
i've tried
df$name %in% df$friends
which did not give any results. I'm now trying to use a for loop with str_detect:
friends <- df$friends
names <- df$name
for (i in 1:length(names)) {
friends_called <- str_detect(friends, names[i])
id_index <- grep(names[i], df$name)
id <- df$id[id_index]
for (j in 1:length(friends_called)) {
if(friends_called[j] == T) {
df$friends_id[j] <- paste(df$friends_id[j], id, ",", sep="")
}
df$friends <- df$friends_id
But I have some issues with it:
It's not working
It uses two loops, which i'm used to from writing python but I read that i should avoid them in R
The string matching needs to be fuzzy (If Anna wrote "Jon" instead of "John", it should still match.
Does anyone have suggestions on how to tackle this?

You can do this without a loop in tidyverse as follows:
df %>%
mutate(friends = map(friends, ~ df %>%
filter(str_detect(.x,name)) %>%
select(id) %>%
unlist() %>%
paste(collapse = ',')))
gives
id name friends
1 a1d John b2e,c3f
2 b2e Anna a1d
3 c3f Denise
or with base R you can use sapply:
df$friends <- sapply(friends, function(x) paste(id[str_detect(x,name)],collapse = ','))

Related

How to run a loop creating new columns?

I have a dataset with columns that contain information of a code + name, which I would like to separate into 2 columns. So, just an example:
Column E5000_A contain values like `0080002. ALB - Democratic Party' in one cell, I would like two columns one containing the code 0080002, and the other containing the other info.
I have 8 more columns with values very similar (E5000_A until E5000_H). This is the code that I am writing.
cols2 <- c("E5000_A" , "E5000_B" , "E5000_C" , "E5000_D" ,
"E5000_E" , "E5000_F" , "E5000_G" , "E5000_H" )
for(i in cols2){
cses_imd_m <- cses_imd_m %>% mutate(substr(i, 1L, 7L))
}
But for some reason it is only generating a new column for the E5000_A and the loop does not go to the other variables. What am I doing wrong? Let me know if you need more details about the code or data frame.
data.frame approach
# to extract codes
df %>%
mutate_at(.vars = vars(c("E5000_A", "E5000_B", "E5000_C", "E5000_D", "E5000_E",
"E5000_F", "E5000_G", "E5000_H")),
.funs = function(x) str_extract("^\\d+", x))
You can also use across() inside of mutate().
If you want to use for loop
col_names <- c("E5000_A", "E5000_B", "E5000_C", "E5000_D", "E5000_E", "E5000_F", "E5000_G", "E5000_H")
for (i in col_names) {
df[,sprintf("code_%s", i)] <- str_extract("^\\d+", df[,i])
df[,sprintf("party_%s", i)] <- gsub(".*\\.", "", df[,i]) %>% str_trim() # remove all before dot (.)
}

How to make purrr recognize colnames of df in a list?

I am trying to unite first and last names in each dataframe in a list of dataframes. The problem is that purrr doesn't seem to recognize colnames within each df.
Each df in data$authors_list looks something like
authid
surname
given-name
12345
Smith
John
85858
Scott
Jane
I want to unite the "surname" and "given-names" into a column called AuN.
data <- data %>%
mutate(authors_list = map(authors_list,
unite(col=AuN,
c(`given-name`,
surname),
sep = " ")))
However, I get the following error.
Error in unite(col = AuN, c(`given-name`, surname), sep = " ") :
object 'given-name' not found
I am new to using purrr, and I haven't been able to find solutions to a similar problem online. Any help would be appreciated!
I think this is what you're after. You need to put in .x in the unite call to stand in for each data frame in the list. For each one, it will unite with the parameters you specified.
library(tidyverse)
#Set up the data (but please in the future give us data so we don't have to set it up)
df <- tibble(authid = c(12345, 85858), surname = c("Smith", "Scott"), `given-name` = c("John","Jane"))
list_df <- list(df, df, df)
list_df_unite <- map(list_df, ~ unite(.x, AuN, c(`given-name`,surname), sep = " "))

How do I use grepl to filter out based off matches from a string of characters?

I have a list of gene names that I am trying to filter out of a larger data set using grepl. For example:
gene_list <- c(geneA, geneB, geneC)
data <- c(XXXgene1, XXXgene2, XXXgeneF, XXXgeneA, XXXgeneB)
select_grepl <- data %>% filter(grepl(c(gene_list), data)==T)
I have tried the grepl code above but since the pattern is > 1 they only use the first geneA to search within the string. If I change c(gene_list) to a single pattern like "geneA", then the code works. Is there another solution?
You need to separate your keywords in grepl with a |. You can do this easily by using paste(..., collapse = "|") like below. Note that you need to transform your data vector that you gave as an example to a data.frame in order to apply the filter(). The following code works.
Code:
gene_list <- c("geneA", "geneB", "geneC")
data <- c("XXXgene1", "XXXgene2", "XXXgeneF", "XXXgeneA", "XXXgeneB")
gene_vec <- paste(gene_list, collapse = "|")
select_grepl <- as.data.frame(data) %>%
filter(., grepl(gene_vec, data) == TRUE)
Output:
> select_grepl
data
1 XXXgeneA
2 XXXgeneB

How to lookup data and print values based on criteria in R?

So I have a csv file that has 12 columns of data, what I want to do is get specific values from the CSV file based on the desired criteria
A snip of the data is provided, so I have this list of Maps:
Maps <- c("Nuke","Vertigo","Inferno","Mirage","Train","Overpass","Dust2")
The goal is to get CTWinProb & TWinProb values for each of the maps in the Map list, e.g.
CTWinProbs;
Nuke = 0.5758
Dust2 = 0.4965
Inferno = 0.4885
etc and vice versa for TWinProb
So far I have been using sqldf library which is very tedious, this is what I am currently doing:
T1NukeCT <- sqldf("select CTWinProb from Team1 where MapName like '%Nuke%'")
which outputs T1NukeCT = 0.5758
and repeating for each Map and then again for TWinProb
I am sure there is an easier way, just quite new to using R so am not 100% on the best method here or how to go about doing it in a less tedious manner
You may use a WHERE IN (...) clause:
Maps <- c("Nuke","Vertigo","Inferno","Mirage","Train","Overpass","Dust2")
where_in <- paste0("('", paste(Maps, collapse="','"), "')")
sql <- paste0("SELECT CTWinProb FROM Team1 WHERE MapName IN ", where_in)
T1NukeCT <- sqldf(sql)
To be clear, the SQL query generated by the above script is:
SELECT CTWinProb
FROM Team1
WHERE MapName IN ('Nuke','Vertigo','Inferno','Mirage','Train','Overpass','Dust2')
What output/results are you looking for exactly?
If you want results in R, these are two simple functions to return the desired values.
They require the dplyr package to be loaded.
library(dplyr)
YourData <- read_csv("./yourfile/.csv")
CTWinFunc <- function(x){
YourData %>% filter(MapName == x) %>% pull(CTWinProb)}
TWinFunc <- function(x){
YourData %>% filter(MapName == x) %>% pull(TWinProb)}
Now CTWinFunc("Nuke") should return CTWinProb result for Nuke, ie: 0.5758
And TWinFunc("Nuke") should return TWinProb result for Nuke, ie: 0.4242
If you want to return a vector with all the results together, I guess you could use the sapply() function. Something like this...
TWins <- sapply(Maps, TWinFunc)
TWins[lengths(TWins)==0] <- NA
TWins <- unlist(TWins)
And this should give you a table with the results:
cbind(Maps, Twins)
Of course, it seems like all this data is already in the original table and you could just subset that.
YourData[,c(4,11,12)]

Using str_detect (or some other function) and some way to loop through a list to essentially perform a vlookup

I have been searching for a way to do this and some results on here seem similar, nothing seems to be working, nor can I find a method that will loop through a list like a vlookup in excel. I apologize if I have missed it.
I am trying to add a new column to a data set with Mutate. What it is going to do is look at one column using str_replace (or some other function if necessary), and then loop through another list. I want to replace what it finds on with the corresponding value in another column. Essentially a vlookup in excel. It cannot be done in excel however because the file is simply too large.
I can do a simple str_replace one at a time, but there are 502 possible options that I need to choose from, so writing the code for that would take a very long time. Here is what I have so far:
testVendor <- vendorData %>%
select(TOUPPER(Addr1) %>%
mutate('NewAdd' = str_replace(Addr1, 'STREET', 'ST'))
However, rather than me specifying STREET and then ST, I want it to loop through a list of common postal abbreviations and return the standard abbreviation.
An example would be
addr1 <- c('123 MAIN STREET', '123 GARDEN ROAD', '123 CHARLESTON BOULEVARD')
state_abbrv <- c('FL', 'CA', 'NY')
vendor <- data.frame(addr1, state_abbrv)
usps_name <- c('STREET', 'LANE', 'BOULEVARD', 'ROAD', 'TURNPIKE')
usps_abbrv <- c('ST', 'LN', 'BLVD', 'RD', 'TPKE')
usps <- data.frame(usps_name, usps_abbrv)
The ideal output would be a new column on the vendor data frame and would look like this:
Any assistance with this is wonderful, and please allot me to expand on the question if it is unclear of what I am looking for.
Thank you in advance.
I would use a for loop:
usps[] = lapply(usps, as.character)
vendor$new_addr1 = as.character(vendor$addr1)
for(i in 1:nrow(usps)) {
vendor$new_addr1 = str_replace_all(
vendor$new_addr1,
pattern = usps$usps_name[i],
replacement = usps$usps_abbrv[i])
}
vendor
# addr1 state_abbrv new_addr1
# 1 123 MAIN STREET FL 123 MAIN ST
# 2 123 GARDEN ROAD CA 123 GARDEN RD
# 3 123 CHARLESTON BOULEVARD NY 123 CHARLESTON BLVD
To be extra safe, I'd add regex word boundaries to your patterns, as below, so that only whole words are replaced. (I assume you want AIRPLANE RD changed to AIRPLANE RD, not AIRPLN RD)
for(i in 1:nrow(usps)) {
vendor$new_addr1 = str_replace_all(
vendor$new_addr1,
pattern = paste0("\\b", usps$usps_name[i], "\\b"),
replacement = usps$usps_abbrv[i])
}
This might be one of the most confusing r code that I have ever written but it kind of solves the problem
library(tidyverse)
df_phrases <- tribble(~phrases,
"testing this street for pests",
"this street better be lit")
df_lookup <- tribble(~word,~replacement,
"street","st",
"pests","rats",
"lit","well iluminated")
lookup_function <- function(phrase,df_lookup){
wordss <- phrase %>%
str_split(" ")
table_to_join <- tibble(word = wordss) %>% unnest()
table_to_join %>%
left_join(df_lookup) %>%
mutate(new_vector = if_else(replacement %>% is.na,
word,
replacement)) %>%
pull(new_vector) %>%
str_flatten(collapse = " ")
# words_to_replace <- map(wordss,function(x) x %in% c(df_lookup$word))
# tibble(wordss,words_to_replace) %>%
# unnest()
}
df_phrases%>%
mutate(test = phrases %>% map_chr(lookup_function,df_lookup))

Resources