Finding a Pattern in R - r

I am trying to clean some data. Below is an example of my data.
test1 test2 test3
jsb cjn kd N069W j N9DSW
I want to indicate what column has the pattern N0{num}{num}W in it. The {num} part can be any number between 0-9. This pattern can also appear anywhere in the string. Hence in this case my results would be as follows.
test1 test2 test3 col
jsb cjn kd N069W j N9DSW 2
Thanks in advance for any help.

We loop through the columns, use grepl to get a logical index and then with max.col get the column index of each row
max.col(data.frame(lapply(df1, grepl, pattern = "N0\\d{2}W")))
#[1] 2
data
df1 <- structure(list(test1 = "jsb cjn", test2 = "kd N069W j",
test3 = "N9DSW"), class = "data.frame", row.names = c(NA,
-1L))

you can also use the function str_detect() from the library stringr.
library(stringr)
str_detect('kd NO69W j', pattern = "NO\\d+W")
# [1] TRUE

Using apply:
df$col <- apply(df, 1, function(x) grep("N0\\d{2}W", x))
Data:
df <- structure(list(test1 = structure(1L, .Label = "jsb cjn", class = "factor"),
test2 = structure(1L, .Label = "kd N069W j", class = "factor"),
test3 = structure(1L, .Label = "N9DSW ", class = "factor")), class = "data.frame", row.names = c(NA,
-1L))

Related

Check if value from one data frame can be found in another data frame in R

I have two data.frames as follows:
a$id <- as.data.frame(c("1-23-2", "2-3-231-2", "122-121"))
b$id <- as.data.frame(c("1-23-2", "122-121", "12-1223-12", "1221-12"))
I want to check, if all values of a can be found in b.
I tried this:
if (a$id %in% b$id){a$test <- "yes"} else {a$test <- "no"}
Which gives a warning message and the wrong result unfortunately.
Use ifelse.
a$test <- ifelse(a$id %in% b$id, "yeah", "no")
a
# id test
# 1 1-23-2 yeah
# 2 2-3-231-2 no
# 3 122-121 yeah
Data
a <- structure(list(id = structure(c(1L, 3L, 2L), .Label = c("1-23-2",
"122-121", "2-3-231-2"), class = "factor")), class = "data.frame", row.names = c(NA,
-3L))
b <- structure(list(id = structure(c(1L, 3L, 2L, 4L), .Label = c("1-23-2",
"12-1223-12", "122-121", "1221-12"), class = "factor")), class = "data.frame", row.names = c(NA,
-4L))
You may have several base R approaches to make it, e.g.,
a <- within(a,test <- ifelse(id %in% b$id,"yes","no"))
or
a <- within(a,test <- c("yes","no")[(!id%in% b$id) + 1])
or
a <- within(a,test <- c("yes","no")[is.na(match(id,b$id))+1])
such that
> a
id test
1 1-23-2 yes
2 2-3-231-2 no
3 122-121 yes
DATA
a <- data.frame(id = c("1-23-2", "2-3-231-2", "122-121"))
b <- data.frame(id = c("1-23-2", "122-121", "12-1223-12", "1221-12"))

The first two columns defined as "rownames"

I want to define the first two columns of a data frame as rownames. Actually I want to do some calculations and the data frame has to be numeric for that.
data.frame <- data_frame(id=c("A1","B2"),name=c("julia","daniel"),BMI=c("20","49"))
The values for BMI are numerical (proved with is.numeric), but the over all data.frame not. How to define the first two columns (id and name) as rownames?
Thank you in advance for any suggestions
You can combine id and name column and then assign rownames
data.frame %>%
tidyr::unite(rowname, id, name) %>%
tibble::column_to_rownames()
# BMI
#A1_julia 20
#B2_daniel 49
In base R, you can do the same in steps as
data.frame <- as.data.frame(data.frame)
rownames(data.frame) <- paste(data.frame$id, data.frame$name, sep = "_")
data.frame[c('id', 'name')] <- NULL
Not sure if the code and result below is the thing you are after:
dfout <- `rownames<-`(data.frame(BMI = as.numeric(df$BMI)),paste(df$id,df$name))
such that
> dfout
BMI
A1 julia 20
B2 daniel 49
DATA
df <- structure(list(id = structure(1:2, .Label = c("A1", "B2"), class = "factor"),
name = structure(2:1, .Label = c("daniel", "julia"), class = "factor"),
BMI = structure(1:2, .Label = c("20", "49"), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))

How do I split a column in R into two columns when I have no delimiter?

I have a dataset called data1 that I need to split the first column into two columns. The issue I'm having is that there is no delimiter between what I need to split and the character lengths are different is many rows.
I would like to split it by the date and sex.
E.g
12/1/09male
1/9/20female
13/1/19female
4/12/12male
I've been trying this but because the values have a different amount of characters I'm stuck.
separate(data1, col = 1, into = c("date","sex"), sep = "")
Any help would be hugely appreciated!
An option is a positive look-behind and look-ahead to split on a digit followed by an "m" or "f".
df %>% separate(1, c("date", "sex"), sep = "(?<=\\d)(?=[mf])")
# date sex
#1 12/1/09 male
#2 1/9/20 female
#3 13/1/19 female
#4 4/12/12 male
For what it's worth, the same regexp pattern works in base R's strsplit
setNames(do.call(
rbind.data.frame,
strsplit(as.character(df[, 1]), "(?<=\\d)(?=[mf])", perl = T)),
c("date", "sex"))
Sample data
df <- read.table(text =
'12/1/09male
1/9/20female
13/1/19female
4/12/12male')
I am fairly new to R so I am sure this is not the most elegant solution. I first add a comma between the date and sex and then separate on the comma
a <- data.frame(row_1 = c("12/1/09male", "1/9/20female", "13/1/19female", "4/12/12male"))
a[, "row_1"] = str_replace(a$row_1, "(male|female)", ",\\1")
separate(a, row_1, ",", into = c("date", "sex"))
Using tidyr::extract, we can capture data into two parts. First capture the date (in the format d/m/y) and second capture all the remaining part of the string.
tidyr::extract(df, V1, c("date", "sex"), "(\\d+/\\d+/\\d+)(.*)")
# date sex
#1 12/1/09 male
#2 1/9/20 female
#3 13/1/19 female
#4 4/12/12 male
data
df <- structure(list(V1 = structure(c(2L, 1L, 3L, 4L), .Label = c("1/9/20female",
"12/1/09male", "13/1/19female", "4/12/12male"), class = "factor")),
class = "data.frame", row.names = c(NA,-4L))
Base R solution using gsub and some regex:
df_clean <- within(df, {
date <- as.Date(gsub("[A-Za-z]+", "", V1), format = "%d/%m/%y")
sex <- as.factor(gsub("\\d+|\\/", "", V1))
rm(V1)
}
)
Data:
df <- structure(list(V1 = structure(c(2L, 1L, 3L, 4L), .Label = c("1/9/20female",
"12/1/09male", "13/1/19female", "4/12/12male"), class = "factor")), class = "data.frame", row.names = c(NA,
-4L))

Split Columns in a List of Dataframes R

I have a list of data frames which some columns have this special character ->(arrow). Now i do want to loop through this list of data frames and locate columns with this -> (arrow) then the new columns be named with a suffix _old and _new. This is a sample of data frames :
dput(df1)
df1 <- structure(list(v1 = c("reg->joy", "ress", "mer->dls"),
t2 = c("James","Jane", "Egg")),
class = "data.frame", row.names = c(NA, -3L))
dput(df2)
df2 <- structure(list(v1 = c("me", "df", "kl"),
t2 = c("James","Jane->dlt", "Egg"),
t3 = c("James ->may","Jane", "Egg")),
class = "data.frame", row.names = c(NA, -3L))
dput(df3)
df3 <- structure(list(v1 = c("56->34", "df23-> ", "mkl"),
t2 = c("James","Jane", "Egg"),
d3 = c("James->","Jane", "Egg")),
class = "data.frame", row.names = c(NA, -3L))
This is what I have tried
dfs <- list(df1,df2,df3)
for (y in 1:length(dfs)){
setDT(dfs[[y]])
df1<- lapply(names(dfs[[y]]), function(x) {
mDT <- df2[[y]][, tstrsplit(get(x), " *-> *")]
if (ncol(mDT) == 2L) setnames(mDT, paste0(x, c("_old", "_new")))
}) %>% as.data.table()
}
This only splits one data frame, I need to split all of the data frames.
NOTE: The code I have splits so well on one dataframe, what I want is how to implement it on a List of data frames
EXPECTED OUTPUT
dput(df1)
df1 <- structure(list(v1_old = c("reg", "mer"),
v1_new = c("joy", "dls")),
class = "data.frame", row.names = c(NA, -3L))
dput(df2)
df2 <- structure(list(t2_old = c("dlt"),
t2_new = c("dlt"),
t3_old = c("James"),
t3_new = c("may")),
class = "data.frame", row.names = c(NA, -3L))
dput(df3)
df3 <- structure(list(v1_old = c("56", "df23 "),
v1_new = c("34", " "),
d3 = c("James"),
d3 = c(" ")),
class = "data.frame", row.names = c(NA, -3L))
I add below a solution using the tidyverse.
Select the columns if one of the strings in the columns contains an arrow:
col_arrow_ls <- purrr::map(dfs, ~select_if(., ~any(str_detect(., "->"))))
Then split the function using tidyr::separate. Since each element of the output is a data frame, purrr::map_dfc is used to column-bind them together:
split_df_fn <- function(df1){
names(df1) %>%
map_dfc(~ df1 %>%
select(.x) %>%
tidyr::separate(.x,
into = paste0(.x, c("_old", "_new")),
sep = "->")
)
}
Apply the function to the list of data frames.
purrr::map(col_arrow_ls, split_df_fn)
[[1]]
v1_old v1_new
1 reg joy
2 ress <NA>
3 mer dls
[[2]]
t2_old t2_new t3_old t3_new
1 James <NA> James may
2 Jane dlt Jane <NA>
3 Egg <NA> Egg <NA>
[[3]]
v1_old v1_new d3_old d3_new
1 56 34 James
2 df23 Jane <NA>
3 mkl <NA> Egg <NA>

Get single column of values comparing multiple columns

I have just started my journey with R. I want to test values across multiple columns for the same condition and return 5 if any of the values is "hello" within a row:
result = ifelse((myData[1] == "hello") | (myData[2] == "hello") | (myData[3] == "hello"), 5, 0)
This works fine, but code seems to be redundant. When I do:
resultSec = ifelse(myData[1:3] == "hello", 5, 0)
Then all 3 columns are checked against the condition, but the result I get is not a single column, but 3 columns. So then I would have to perform an additional comparison for all columns which makes totally more lines of code then the first redundant method.
How can I get in this case a one column of values in efficient way ?
You can use the function apply() to iterate over a data.frame or matrix, by either columns or rows. The margin argument determines which one you use.
Here we want to check the rows, so we use margin = 1:
dat <- data.frame(col1 = c("happy", "sad", "mad"),
col2 = c("tired", "sleepy", "happy"),
col3 = c("relaxed", "focused", "fine"))
dat$res <- apply(X = dat, MARGIN = 1,
FUN = function(x) ifelse("happy" %in% x, 5, 0))
dat
col1 col2 col3 res
1 happy tired relaxed 5
2 sad sleepy focused 0
3 mad happy fine 5
We can use rowSums here
df1$res <- rowSums(df1 == "happy") * 5
df1$res
#[1] 5 0 5
data
df1 <- structure(list(col1 = structure(c(1L, 3L, 2L), .Label = c("happy",
"mad", "sad"), class = "factor"), col2 = structure(c(3L, 2L,
1L), .Label = c("happy", "sleepy", "tired"), class = "factor"),
col3 = structure(c(3L, 2L, 1L), .Label = c("fine", "focused",
"relaxed"), class = "factor")), .Names = c("col1", "col2",
"col3"), row.names = c(NA, -3L), class = "data.frame")

Resources