How to replace multiple string responses in multiple columns - r

I have imported a csv dataset with many columns and many responses. I want to look at specific columns and replace a set of responses.
In my dataset, I have: hairtypeDad, hairtypeMom, hairtypeBro1, hairtypeSis1, which are all located in different areas of my file. Within these are many responses that I want to change, including but not limited to:
Straight= straightened,
Curly= curled
Wavy = waved
wavyy=waved
cruley= curled
and so on.
Below is the code that I have tried so far:
hairdata <- read.csv ('alldata.csv', header = TRUE, stringAsFactors = FALSE)
hair_vars<- c ("hairtypeDad", "hairtypeMom", "hairtypeBro1", "hairtypeSis1")
hairdata[hair_vars]<-str_replace_all(hairdata[hair_vars],
c("Straight"= "straightened",
"Curly"= "curled",
"Wavy" = "waved",
"wavyy"= "waved"))
#I also tried:
hairdata %>% mutate(across(c("hairtypeDad", "hairtypeMom", "hairtypeBro1", "hairtypeSis1"),
fns= ~ str_replace_all(.,
c("Straight"= "straightened",
"Curly"= "curled",
"Wavy" = "waved",
"wavyy"= "waved"))
Ultimately, I want it to go from:
id
hairtypeMom
hairtypeDad
hairtypeBro1
1
Straight
Curly
wavyy
2
Wavy
Curly
Curly
to
id
hairtypeMom
hairtypeDad
hairtypeBro1
1
straightened
curled
waved
2
waved
curled
curled
and am not getting what i need. Please help!!

You were very close; you were just missing the period in .fns =, as you had fns =. You were also missing a couple of closing parentheses as well.
library(tidyverse)
df %>%
mutate(across(
c("hairtypeDad", "hairtypeMom", "hairtypeBro1"),
.fns = ~ str_replace_all(.,
c("Straight" = "straightened",
"Curly" = "curled",
"Wavy" = "waved",
"wavyy" = "waved")
)
))
Output
id hairtypeMom hairtypeDad hairtypeBro1
1 1 straightened curled waved
2 2 waved curled curled
Data
df <- structure(list(id = 1:2, hairtypeMom = c("Straight", "Wavy"),
hairtypeDad = c("Curly", "Curly"), hairtypeBro1 = c("wavyy",
"Curly")), class = "data.frame", row.names = c(NA, -2L))

Related

Clever way to avoid for loop in R

I have a data file that follows roughly this format:
HEADER:001,v1,v2,v3...,v10
v1,v2,v3,STATUS,v5...v6
.
.
.
HEADER:006,v1,v2,v3...v10
HEADER:012,v1,v2,v3...v10
v1,v2,v3,STATUS,v5...v6
v1,v2,v3,STATUS,v5...v6
.
.
.
etc
where each block or chunk of data leads off with a comma separated line that includes the header and a unique (not necessarily sequential) number, and then there may be 0 or more lines that are identified by the STATUS keyword in the body of the chunk.
I am reading this block in using readLines and then splitting it into header lines and status lines to be read in as CSV separately, since they have a different number of variables:
datablocks <- readLines(filename, skipNul = T)
headers <- datablocks[grepl("HEADER", datablocks, useBytes = T)]
headers <- read.csv(text=headers, header= F, stringsAsFactors = F)
statuses <- datablocks[grepl("STATUS", datablocks, useBytes = T)]
statuses <- read.csv(text=statuses, header= F, stringsAsFactors = F)
Eventually, I would like to inner join this data, so that the variables from the header are included in each status line:
all <- headers %>% inner_join(statuses, by = c("ID" = "ID"))
But I need a way to add the unique ID of the header to each status line below it, until the next header. The only way I can think of doing this is with a for loop that runs over the initial full text datablock:
header_id <- NA
for(i in seq(1:length(datablocks))) {
is_header_line <- str_extract(datablocks[i], "HEADER:([^,]*)")
if(!is.na(is_header_line)) {
header_id <- is_header_line
}
datablocks[i] <- paste(datablocks[i], header_id, sep=",")
}
This works fine, but it's ugly, and not very... R-ish. I can't think of a way to vectorize this operation, since it needs to keep an external variable.
Am I missing something obvious here?
Edit
If the input looks literally like this
HEADER:001,a0,b0,c0,d0
e0,f0,g0,STATUS,h0,i0,j0,k0,l0,m0
HEADER:006,a1,b1,c1,d1
HEADER:012,a2,b2,c2,d2
e1,f1,g1,STATUS,h1,i1,j1,k1,l1,m1
e2,f2,g2,STATUS,h2,i2,j2,k2,l2,m2
The output should look like this:
e0,f0,g0,h0,i0,j0,k0,l0,m0,a0,b0,c0,d0,001
e1,f1,g1,h1,i1,j1,k1,l1,m1,a2,b2,c2,d2,012
e2,f2,g2,h2,i2,j2,k2,l2,m2,a2,b2,c2,d2,012
So there needs to be a column propagated from the parent (HEADER) to the children (STATUS) to inner join on.
EDIT:
Thanks for the clarification. The specific input and output makes it dramatically easier to avoid misunderstandings.
Here I use tidyr::separate to separate out the header label from the "a0,b0,c0,d0" part, and tidyr::fill to propagate header info down into the following status rows.
library(tidyverse)
read_table(col_names = "text",
"HEADER:001,a0,b0,c0,d0
e0,f0,g0,STATUS,h0,i0,j0,k0,l0,m0
HEADER:006,a1,b1,c1,d1
HEADER:012,a2,b2,c2,d2
e1,f1,g1,STATUS,h1,i1,j1,k1,l1,m1
e2,f2,g2,STATUS,h2,i2,j2,k2,l2,m2") %>%
mutate(status_row = str_detect(text, "STATUS"),
header_row = str_detect(text, "HEADER"),
header = if_else(header_row, str_remove(text, "HEADER:"), NA_character_)) %>%
separate(header, c("header", "stub"), sep = ",", extra = "merge") %>%
fill(header, stub) %>%
filter(status_row) %>%
mutate(output = paste(str_remove(text, "STATUS,"), stub, header, sep = ",")) %>%
select(output)
Result
# A tibble: 3 x 1
output
<chr>
1 e0,f0,g0,h0,i0,j0,k0,l0,m0,a0,b0,c0,d0,001
2 e1,f1,g1,h1,i1,j1,k1,l1,m1,a2,b2,c2,d2,012
3 e2,f2,g2,h2,i2,j2,k2,l2,m2,a2,b2,c2,d2,012

How to build a new variable from a col with a lot of words

I have a data that looks like this:
And i would like to build a new variable to only show music ones. I tried to use gsub to build it but it did not work. Any suggestion on how to do this. Not limit to gsub.
My codes are: df$music<-gsub("Sawing"|"Cooking", "", df$Hobby)
The outcome should be sth that looks like this:
Sample data can be build using codes:
df<- structure(list(Hobby = c("cooking, sawing, piano, violin", "cooking, violin",
"piano, sawing", "sawing, cooking")), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
The double quotes opening and closing should be a single pair "Sawing|Cooking" and not "Sawing"|"Cooking" in the pattern
df$music<- trimws(gsub("Sawing|Cooking", "", df$Hobby, ignore.case = TRUE),
whitespace ="(,\\s*){1,}")
trimws will remove the leading/lagging , with spaces (if any)
The opposite would be to extract the words of interest and paste them
library(stringr)
sapply(str_extract_all(df$Hobby, 'piano|violin'), toString)
Another way to do this would be :
library(dplyr)
library(tidyr)
df %>%
mutate(index = row_number()) %>%
separate_rows(Hobby, sep = ',\\s*') %>%
group_by(index) %>%
summarise(Music = toString(setdiff(Hobby, c('sawing', 'cooking'))),
Hobby = toString(Hobby)) %>%
select(Hobby,Music)
# Hobby Music
# <chr> <chr>
#1 cooking, sawing, piano, violin "piano, violin"
#2 cooking, violin "violin"
#3 piano, sawing "piano"
#4 sawing, cooking ""

R arulessequence - Preparing data for cspade mining

I am trying to mine sequences in R via the arulessequence implementation of cspade.
My data frame looks like this:
items sequenceId eventId size
A 1 1 1
B 2 1 1
C 2 2 1
A 3 1 1
This data frame was created from an existing data set via the following code (removing unnecessary columns and creating the sequences):
data %>%
select(seqId, sequence, items) %>%
group_by(seqId) %>%
mutate(basketSize = 1, sequence = rank(sequence)) %>%
ungroup() %>%
mutate(seqId = ordered(seqId), sequence = ordered(sequence)) %>%
write.table("data.txt", sep=" ", row.names = FALSE, col.names = FALSE, quote = FALSE)
data <- read_baskets("data.txt", info = c("sequenceID", "eventID", "size"))
as(data, "data.frame") #shows the data frame above!
So far so good!
However when I try:
s1 <- cspade(data, parameter = list(support = 0.4), control = list(verbose = TRUE))
I get the following error:
Error in makebin(data, file) : 'eid' invalid (strict order)
I have read elsewhere that this is because cspade needs the event and sequence id to be ordered. But how do I specify this? Clearly ordering the factors before exporting them to ".txt" does not work.
Edit:
Some further details
Just to explain the code to create the data input for cspade a bit more. Originally the sequence-variable had some missing steps (e.g. 1,3,4 for some sequences) because I had filtered some events. Therefore I ran a rank-function on it to reindex the events per sequence. The size-column is totally unecessary (it is constant) but was included in the sample code in the documentation for arulessequence, which is why I included it too.

R not producing the same result when the data set source is changed

if i manually create 2 DFs then the code does what it was intended to do:
`df1 <- structure(list(CompanyName = c("Google", "Tesco")), .Names = "CompanyName", class = "data.frame", row.names = c(NA, -2L))
df2 <- structure(list(CompanyVariationsNames = c("google plc", "tesco bank","tesco insurance", "google finance", "google play")), .Names = "CompanyVariationsNames", class = "data.frame", row.names = c(NA, -5L))-5L))
`
test <- df2 %>%
rowwise() %>%
mutate(CompanyName = as.character(Filter(length,
lapply(df1$CompanyName, function(x) x[grepl(x, CompanyVariationsNames, ignore.case=T)])))) %>%
group_by(CompanyName) %>%
summarise(Variation = paste(CompanyVariationsNames, collapse=",")) %>%
cSplit("Variation", ",")
this produces the following result:
CompanyName Variation_1 Variation_2 Variation_3
1: Google google plc google finance google play
2: Tesco tesco bank tesco insurance NA
but..... if i import a data set (using read.csv)then i get the following error Error in mutate_impl(.data, dots) : Column CompanyName must be length 1 (the group size), not 0. my data sets are rather large so df1 would have 1000 rows and df2 will have 54k rows.
is there a specific reason why the code works when the data set is manually created and it does not when data is imported?
the DF1 contains company names and DF2 contains variation names of those companies
help please!
Importing from CSV can be tricky. See if the default separator (comma) applies to your file in particular. If not, you can change it by setting the sep argument to a character that works. (E.g.: read.csv(file_path, sep = ";") which is a commom problem in my country due to our local conventions.
In fact, if your standard is semicolons, read.csv2(file_path) will suffice.
And also (to avoid further trouble) it is very commom for csv to mess with columns with decimal values, because here we use commas as decimal separators rather then dots. So, it would be worth checking if this is a problem in your file too, in any of the other columns.
If that is your case, you can set the adequate parameter in either read.csv or read.csv2 by setting dec = "," (E.g.: read.csv(file_path, sep = ";", dec = ","))

Write csv file with non-numeric columns quoted and no row names

I'm trying to write a csv file from a data frame, i.e:
Col_A Col_B Col_C
Hello World 4
Once More 21
Hi Data 23
So far I use this code:
ds = dataf
write.csv(ds,"test.csv", row.names = FALSE, quote = c(1,2), sep = ",")
However, the result is:
Col_A,"Col_B","Col_C"
Hello,"World",4
Once,"More",21
Hi,"Data",23
But I really need to have something like this:
"Col_A","Col_B","Col_C"
"Hello","World",4
"Once","More",21
"Hi","Data",23
Note that everything is between double quotes unless the numeric values, separated by commas. I can do that if I also write the rownames, but I really don't want them.
There no point in setting a "set" to "," because it's the default for write.csv.
Anyway, are you sure of your data.frame design ?
This seems to work :
df <- rbind(c("Hello", "Once", "Hi"), c("World", "More", "Data"), c(4,21,23))
df <- as.data.frame(t(df))
write.csv(df,"test.csv", row.names = FALSE, quote = c(1,2))

Resources