changing non character letters in selected variables - r

I've got a data frame with text. I'd like to change all "," to "-" in all observations of selected variables, and like to select the variables based on their names containing the word date.
I've tried to incorporate various variations of grep() expressions into MyFunc but haven't been able to get it to work.
Thanks!
starting point:
df <- data.frame(dateofbirth=c("25,06,1939","15,04,1941","21,06,1978","06,07,1946","14,07,1935"),recdate=c("26,06,1945","03,04,1964","21,06,1949","15,07,1923","07,12,1945"),b=c("8,ted,st","99,tes,rd","6,ldk,dr","2,sdd,jun","asd,2,st"),disdatenow=c("25,06,1975","25,05,1996","21,06,1932","26,07,1934","07,07,1965"), stringsAsFactors = FALSE)
desired outcome:
df <- data.frame(dateofbirth=c("25-06-1939","15-04-1941","21-06-1978","06-07-1946","14-07-1935"),recdate=c("26-06-1945","03-04-1964","21-06-1949","15-07-1923","07-12-1945"),b=c("8,ted,st","99,tes,rd","6,ldk,dr","2,sdd,jun","asd,2"),disdatenow=c("25-06-1975","25-05-1996","21-06-1932","26-07-1934","07-07-1965"), stringsAsFactors = FALSE)
Current code:
MyFunc <- function(x) {gsub(",","-",df$x)}

You can use mutate_at from dplyr:
df %>%
mutate_at(vars(contains("date")), function(x){gsub(",", "-", x)})
and that gives you this:
dateofbirth recdate b disdatenow
1 25-06-1939 26-06-1945 8,ted,st 25-06-1975
2 15-04-1941 03-04-1964 99,tes,rd 25-05-1996
3 21-06-1978 21-06-1949 6,ldk,dr 21-06-1932
4 06-07-1946 15-07-1923 2,sdd,jun 26-07-1934
5 14-07-1935 07-12-1945 asd,2,st 07-07-1965

Using your function MyFunc, this should also work
MyFunc <- function(x) {gsub(",", "-", x)}
library(data.table)
setDT(df)
cols <- c("dateofbirth", "recdate", "disdatenow")
df[, cols] <- df[, lapply(.SD, MyFunc), .SDcols = cols]

Related

How do I convert all numeric columns to character type in my dataframe?

I would like to do something more efficient than
dataframe$col <- as.character(dataframe$col)
since I have many numeric columns.
In base R, we may either use one of the following i.e. loop over all the columns, create an if/else conditon to change it
dataframe[] <- lapply(dataframe, function(x) if(is.numeric(x))
as.character(x) else x)
Or create an index for numeric columns and loop only on those columns and assign
i1 <- sapply(dataframe, is.numeric)
dataframe[i1] <- lapply(dataframe[i1], as.character)
It may be more flexible in dplyr
library(dplyr)
dataframe <- dataframe %>%
mutate(across(where(is.numeric), as.character))
All said by master akrun! Here is a data.table alternative. Note it converts all columns to character class:
library(data.table)
data.table::setDT(df)
df[, (colnames(df)) := lapply(.SD, as.character), .SDcols = colnames(df)]

How do I get the column number from a dataframe which contains specific strings?

I have a data frame df with 7 columns and I have a list z containing multiple strings.
I want a dataframe containing only the columns in df which contain the sting from z.
df <- data.frame("a_means","b_means","c_means","d_means","e_mean","f_means","g_means")
z <- c("a_m","c_m","f_m")
How do I get the column number of the z strings in df? Or how do I get a dataframe with only the columns which contains the z strings.
What I want is:
print(df)
"a_means" "c_m" "f_m"
What I tried:
match(a, names(df)
and
df[,which(colnames(df) %in% colnames(df[ ,grepl(z,names(df)])]
You can use:
df[,match(z, substring(colnames(df), 1, 3))]
With base R:
z <- paste(z, collapse = "|")
df[, grepl(z, names(df))] # you could use grep as well
Combine the search patterns and use that as a pattern for stringr::str_detect() function.
library(dplyr)
library(stringr)
df <- data.frame(a_means = "a_means",
b_means = "b_means",
c_means = "c_means",
d_means = "d_means",
e_means = "e_means",
f_means = "f_means",
g_means = "g_means"
)
z <- c("a_m","c_m","f_m")
z <- paste(z, collapse = "|")
df %>% select_if(str_detect(names(df), z))
#> a_means c_means f_means
#> 1 a_means c_means f_means
You can simply do this:
library(dplyr)
df %>%
select(contains(z))
Check out help("starts_with"). You can also match to a starting prefix with starts_with() among other things.
You can use select and matches to subest the columns based on z
library(dplyr)
df <- data.frame("a_means","b_means","c_means","d_means","e_mean","f_means","g_means")
z <- c("a_m","c_m","f_m")
df %>%
select(matches(z))
#> X.a_means. X.c_means. X.f_means.
#> 1 a_means c_means f_means

How to group column names and add suffixes to them?

I kindly appreciate if someone could help me with the task described below.
I have R dataframe with the following columns:
id
cols_len.max.(1,5]
cols_len.max.(1,55]
cols_width.min.(1,55]
cols_width.min.(2,15]
cols_width.uppen.(1,15]
I want to rename these columns to get the following column names:
id
cols_len.max_1
cols_len.max_2
cols_width.min_1
cols_width.min_2
cols_width.upper
This is my current code:
colnames(df) <- gsub("\\(.*\\]*-*.","",colnames(df))
colnames(df) <- gsub("\\.","",colnames(df))
colnames(df) <- gsub("-","",colnames(df))
colnames(df) <- gsub("\\_","",colnames(df))
But this gives my duplicate column names (cols_len.max and cols_width.min):
id
cols_len.max
cols_len.max
cols_width.min
cols_width.min
cols_width.upper
How can I append then with _N, where N should be assigned as showed above? I am searching for an automated approach because my real data frame contains hundreds of columns.
An option is to remove the substring at the end and wrrap with make.unique
v2 <- make.unique(sub("\\.\\(.*", "", v1))
Or another option is to use the sub output as a grouping variable and then append the sequence at the end
tmp <- sub("\\.\\(.*", "", v1)
t1 <- ave(seq_along(tmp), tmp, FUN = function(x)
if(length(x) == 1) "" else seq_along(x))
and paste it at the end of 'tmp'
i1 <- nzchar(t1)
tmp[i1] <- paste(tmp[i1], t1[i1], sep="_")
tmp
#[1] "id" "cols_len.max_1" "cols_len.max_2" "cols_width.min_1" "cols_width.min_2" "cols_width.upper"
dat
v1 <- c("id", "cols_len.max.(1,5]", "cols_len.max.(1,55]", "cols_width.min.(1,55]",
"cols_width.min.(2,15]", "cols_width.upper.(1,15]")

Matching columns in 2 data frames when numbers don't exactly match

How do I match two different data frames when the values I am comparing are not exactly the same?
I was thinking of using merge() but I am not sure.
Table1:
ID Value.1
10001 x
18273-9 y
12824/5/6/7 z
10283/5/9 d
Table2:
ID Value.2
10001 a
18274 b
12826 c
10289 u
How do I merge Table 1 and 2 based on ID?
Which specific function of fuzzyjoin package would I use, especially with the "/" & "-" cases? How do I expand the "-" case from 18273-9 so that R will register 18273 / 18274 / 18275 / ...?
You can write a function to extract the corresponding sequences from the strings containing "/" or "-" and recombine them into a new data.frame as follows:
df1 <- data.frame(ID=c("10001","18273-9","15273-8", "15170-4", "12824/5/6/7","10283/5/9"),
value=c("a","c","c", "d","k", "l"), stringsAsFactors = F)
df2 <- data.frame(ID=c("10001","18274","12826","10289"),
value=c("o","p","q","r"), stringsAsFactors = F)
doIt <- function(df){
listAsDF <- function(l) {
x <- stack(setNames(l, temp$value))
names(x) <- c("ID", "value")
return(x)
}
Base <- df[!grepl("\\/", df$ID) & !grepl("\\-", df$ID), ]
#1 cases when - present
temp <- df[grep("\\-", df$ID),]
temp <- listAsDF(lapply(strsplit(temp$ID, "-"), function(e) seq(e[1], paste0(strtrim(e[1], nchar(e[1])-1), e[2]), 1)))
Base <- rbind(Base, temp)
#2 cases when / present
temp <- df[grep("\\/", df$ID),]
temp <- listAsDF(lapply(strsplit(temp$ID, "/"), function(a) c(a[1], paste0(strtrim(a[1], nchar(a[1])-1), a[-1]))))
Base <- rbind(Base, temp)
return(Base)
}
Then you can mergge the df2 and df1:
merge(doIt(df1), df2, by = "ID", all.x = T)
Hope this helps!
You could use the fuzzy string matching function "agrep" from base R.
df1 <- data.frame(ID=c("10001","18273-9","12824/5/6/7","10283/5/9"),
value=c("a","c","d","k"))
df2 <- data.frame(ID=c("10001","18274","12826","10289"),
value=c("o","p","q","r"))
apply(df1, 1, function(x) agrep(x["ID"], df2$ID, max = 3.5))
As you see it struggles to find the match for row 4. So it might make sense to clean your ID variable (e.g., take out the "/") before running agrep.
One option could consist in extracting the format of ID you want to keep. And then do your merge.
You can format your ID column as follow :
library(stringr)
library(dplyr)
If you want only the digits before any symbols
Table1 %>% mutate(ID = str_extract("[0-9]*"))
If you want to keep the first sequence of 5 digits
Table1 %>% mutate(ID = str_extract("[0-9]{5}"))
This answers your second question, but does not use the fuzzyjoin package

select part of a string after certain number of special character

I have a data.table with a column
V1
a_b_c_las_poi
a_b_c_kiosk_pran
a_b_c_qwer_ok
I would like to add a new column to this data.table which does not include the last part of string after the "_".
UPDATE
So i would like the output to be
a_b_c_las
a_b_c_kiosk
a_b_c_qwer
If k is the number of fields to keep:
k <- 2
DT[, V1 := do.call(paste, c(read.table(text=V1, fill=TRUE, sep="_")[1:k], sep = "_"))]
fill=TRUE can be omitted if all rows have the same number of fields.
Note: DT in a reproducible form is:
library(data.table)
DF <- data.frame(V1 = c("a_b_c_las_poi", "a_b_c_kiosk_pran", "a_b_c_qwer_ok"),
stringsAsFactors = FALSE)
DT <- as.data.table(DF)
You can do this with sub and a regular expression.
sub("(.*)_.*", "\\1", V1)
[1] "a_b_c_las" "a_b_c_kiosk" "a_b_c_qwer"

Resources