How to separate a column into two columns - r

df <- data.frame(PATIENT_ID=c(1,2,3,4),
CODE=c('N18','N180','N190','M1920'))
I want to separate the variable 'CODE' into two variables. One variable shows the first letter of 'CODE' ('N' or 'M' in my case), the other shows the left number. If there are more than two digits, give a '.' after the second digit.
The output should be
df <- data.frame(PATIENT_ID=c(1,2,3,4),
CODE=c('N18','N180','N190','M1920'),
VOR_1=c('N','N','N','M'),
VOR_2=c('18','18.0','19.0','19.20'))
Finally, define the variable of 'VOR_2' as a numeric variable.

Using sub for a base R solution:
df$VOR_1 <- sub("^([A-Z]).*$", "\\1", df$CODE)
df$VOR_2 <- sub("^([0-9]{2})(?=[0-9])", "\\1.", sub("^[A-Z]([0-9]+)$", "\\1", df$CODE), perl=TRUE)
df$VOR_2 <- as.numeric(df$VOR_2) # if desired
df
PATIENT_ID CODE VOR_1 VOR_2
1 1 N18 N 18
2 2 N180 N 18.0
3 3 N190 N 19.0
4 4 M1920 M 19.20
An explanation on the logic behind VOR_2 is warranted. We first extract all the digits from the second character onwards using the simple regex ^[A-Z]([0-9]+)$. Then, we make a second call to sub on the digit string, to insert a decimal point after the second digit. The pattern uses a positive lookahead which ensures that a dot gets intercolated only in the case of three or more digits.

An idea via tidyr using separate can be,
library(dplyr)
library(tidyr) #separate
df %>%
separate(CODE, into = c("text", "num"), sep = "(?<=[A-Za-z])(?=[0-9])") %>%
mutate(num = as.numeric(num),
num = num / (10 ^ (nchar(num) - 2))
)
# PATIENT_ID text num
#1 1 N 18.0
#2 2 N 18.0
#3 3 N 19.0
#4 4 M 19.2

You can use str_extract and sub:
library(stringr)
df$VOR1 <- str_extract(df$CODE, "^[A-Z]")
Here, you simply grasp the capicatl letter at the beginning of the string marked by ^.
df$VOR2 <- sub("(\\d{2})(\\d{1,2})", "\\1.\\2", str_extract(df$CODE, "\\d+"))
Here, you first extract just the digits using str_extract and then insert the period .where appropriate:
Result:
df
PATIENT_ID CODE VOR1 VOR2
1 1 N18 N 18
2 2 N180 N 18.0
3 3 N190 N 19.0
4 4 M1920 M 19.20

Related

filter with a list of string conditions

This is an example what the data looks like:
height <- c("T_0.1", "T_0.2", "T_0.3", "T_0.11", "T_0.12", "T_0.13", "T_10.1", "T_10.2",
"T_10.3", "T_10.11", "T_10.12", "T_10.13","T_36.1", "T_36.2", "T_36.3", "T_36.11", "T_36.12",
"T_36.13")
value <- c(1,12,14,15,20,22,5,9,4,0.0,0.45,0.7,1,2,7,100,9,45)
df <- data.frame(height,value)
I want to filter all the values in height that ends with ".1", ".2", and ".3". However I want to do that using a "list of patterns" because the actual data frame has more than 1000 values.
Here what I tried:
vars_list <- c(".1", ".2",".3")
df_new<-df[grepl(paste(vars_list, collapse = "|"), df$height),]
matchPattern <- paste(vars_list, collapse = "|")
df_new <- df %>% select(matches(matchPattern))
Both codes returns 0 observation. I am not sure what it is the issue and I couldn't find a post that would help. So any help is very much appreciated!
The dot is a regex metacharacter, which matches any character except a new line. You need to escape it (i.e. tell R you are looking for a literal dot), by prepending it with \\.
However, your pattern will then match all rows in your sample data.
I assume you do not want to match, for example, "T_0.13", because it does not end with ".1", ".2" or ".3". In which case, you should add a dollar sign to indicate that you want your string to end with the desired match, rather than just contain it.
vars_list <- c("\\.1$", "\\.2$","\\.3$")
df_new<-df[grepl(paste(vars_list, collapse = "|"), df$height),]
df_new
# height value
# 1 T_0.1 1
# 2 T_0.2 12
# 3 T_0.3 14
# 7 T_10.1 5
# 8 T_10.2 9
# 9 T_10.3 4
# 13 T_36.1 1
# 14 T_36.2 2
# 15 T_36.3 7
Incidentally, another way you could express this is:
df[grepl("\\.[1-3]$", df$height),]
You can read more here about the syntax used in regular expressions.
Alternatively use the base function endsWith
df <- data.frame(height,value) %>% filter(endsWith(height,vars_list))
Created on 2023-02-12 with reprex v2.0.2
height value
1 T_0.1 1
2 T_0.2 12
3 T_0.3 14
4 T_10.1 5
5 T_10.2 9
6 T_10.3 4
7 T_36.1 1
8 T_36.2 2
9 T_36.3 7

R: explode a character-string and get the last element (row-wise)

I have the following data-frame
df <- data.frame(var1 = c("f253.02.ds.a01", "f253.02.ds.a02", "f253.02.ds.x.a01", "f253.02.ds.x.a02", "f253.02.ds.a10", "test"))
df
What's the easiest way to extract the last two digits of the variable var1? (e.g. 1, 2, 10, NA) I was experimenting with separate(), but the number of points in the characters is not always the same. Maybe with regular expressions?
With separate, we can use a regex lookaround
library(dplyr)
library(tidyr)
df %>%
separate(var1, into = c('prefix', 'suffix'),
sep="(?<=[a-z])(?=\\d+$)", remove = FALSE, convert = TRUE)
-output
# var1 prefix suffix
#1 f253.02.ds.a01 f253.02.ds.a 1
#2 f253.02.ds.a02 f253.02.ds.a 2
#3 f253.02.ds.x.a01 f253.02.ds.x.a 1
#4 f253.02.ds.x.a02 f253.02.ds.x.a 2
#5 f253.02.ds.a10 f253.02.ds.a 10
#6 test test NA
The expected output shown in the question has 4 elements but the input has 6 rows so we assume that the expected output shown in the question is erroneous and that the correct output is that shown below. tes).
Now assuming that the 2 digits are preceded by a non-digit and note that \D means non-digit (backslash must be doubled within double quo
df %>% mutate(last2 = as.numeric(sub(".*\\D", "", var1)))
giving:
var1 last2
1 f253.02.ds.a01 1
2 f253.02.ds.a02 2
3 f253.02.ds.x.a01 1
4 f253.02.ds.x.a02 2
5 f253.02.ds.a10 10
6 test NA

Remove prefix letter from column variables

I have all column names that start with 'm'. Example: mIncome, mAge. I want to remove the prefix. So far, I have tried the following:
df %>%
rename_all(~stringr::str_replace_all(.,"m",""))
This removes all the column names that has the letter 'm'. I just need it removed from from the start. Any suggestions?
You can use sub in base R to remove "m" from the beginning of the column names.
names(df) <- sub('^m', '', names(df))
We need to specify the location. The ^ matches the start of the string (or here the column name). So, if we use ^m, it will only match 'm' at the beginning or start of the string and not elsewhere.
library(dplyr)
library(stringr)
df %>%
rename_all(~stringr::str_replace(.,"^m",""))
# ba Mbgeg gmba cfor
#1 1 2 4 6
#2 2 3 5 7
#3 3 4 6 8
Also, if the case should be ignored, wrap with regex and specify ignore_case = TRUE
df %>%
rename_all(~ stringr::str_replace(., regex("^m", ignore_case = TRUE), ""))
# ba bgeg gmba cfor
#1 1 2 4 6
#2 2 3 5 7
#3 3 4 6 8
Another option is word boundary (\\bm), but this could match the beginning of words where there are multi word column names
NOTE: str_replace_all is used when we want to replace multiple occurrence of the pattern. Here, we just need to replace the first instance and for that str_replace is enough.
data
df <- data.frame(mba = 1:3, Mbgeg = 2:4, gmba = 4:6, cfor = 6:8)
Another way you can try
library(tidyverse)
df <- data.frame(mma = 1:2, mbapbe = 1:2)
df2 <- df %>%
rename_at(vars(c("mma", "mbapbe")) ,function(x) gsub("^m", "", x))
# ma bapbe
# 1 1 1
# 2 2 2

Separate Million and Billion Data from one column

I am trying below code for separating "M" and "B" with their values in 2 different column.
I want output like this:
level 1 level 2
M 3.2 B 3.6
M 4 B 2.8
B 3.5
Input:
reve=c("M 3.2","B 3.6","B 2.8","B 3.5","M 4")
#class(reve)
data=data.frame(reve)
Here is what I have tried.
index=which(grepl("M ",data$reve)
data$reve=gsub("M ","",data$reve)
data$reve=gsub("B ","",data$reve)
data$reve=as.numeric(data$reve)
If you have a data frame you can do that with dplyr separate()
I give you an example of this:
library(dplyr)
df <- tibble(coupe = c("M 2.3", "M 4.5", "B 1"))
df %>% separate(coupe, c("MorB","Quant"), " ")
OUTPUT
# MorB Quant
# <chr> <chr>
#1 M 2.3
#2 M 4.5
#3 B 1
Hope it help you!
For counting the number of "M" rows:
df %>% separate(YourColumn, c("MorB","Quant"), " ") %>%
filter(MorB == "M") %>% nrow()
Here is a base R approach.
lst <- split(reve, substr(reve, 1, 1))
df1 <- as.data.frame(lapply(lst, `length<-`, max(lengths(lst))))
df1
# B M
#1 B 3.6 M 3.2
#2 B 2.8 M 4
#3 B 3.5 <NA>
split the vector in two by the first letter. This gives you a list with entries of unequal length. Use lapply to make the entries having the same length, i.e. append the shorter one with NAs. Call as.data.frame.
If you want to change the names, you can use setNames
setNames(df1, c("level_2", "level_1"))
In case I misunderstood your desired output, try
df1 <- data.frame(do.call(rbind, (strsplit(reve, " "))), stringsAsFactors = FALSE)
df1[] <- lapply(df1, type.convert, as.is = TRUE)
df1
# X1 X2
#1 M 3.2
#2 B 3.6
#3 B 2.8
#4 B 3.5
#5 M 4.0
I think options rooted in regex may also be helpful for these types of problems
reve=c("M 3.2","B 3.6","B 2.8","B 3.5","M 4")
data=data.frame(reve, stringsAsFactors = F) # handle your data as strings, not factors
# regex to extract M vals and B vals
mvals <- stringi::stri_extract_all_regex(data, "M+\\s[0-9]\\.[0-9]|M+\\s[0-9]")[[1]]
bvals <- stringi::stri_extract_all_regex(data, "B+\\s[0-9]\\.[0-9]|B+\\s[0-9]")[[1]]
# gluing things together into a single df
len <- max(length(mvals), length(bvals)) # find the length
data.frame(M = c(mvals, rep(NA, len - length(mvals))) # ensure vectors are the same size
,B = c(bvals, rep(NA, len - length(bvals)))) # ensure vectors are the same size
In case regex is unfamiliar, the first expression searches for "M" followed by a space, then by digits 0 through 9, then a period, then digits 0 through 9 again. The vertical pipe is on "or" operator, so the expression also searches for "M" followed by a space, then digits 0 through 9. The second half of the expression accounts for cases like "M 4". The second expression does the same thing, just for lines that contain "B" in lieu of "M".
These are quick and dirty regex statements. I'm sure cleaner formulations are possible to get the same results.
We can count Millions or Billions as follows:
Input datatset:
reve=c("M 3.2","B 3.6","B 2.8","B 3.5","M 4")
data=data.frame(reve)
Code
library(dplyr)
library(tidyr)
data %>%
separate(reve, c("Label", "Value"),extra = "merge") %>%
group_by(Label) %>%
summarise(n = n())
Output
# A tibble: 2 x 2
Label n
<chr> <int>
1 B 3
2 M 2

How to rename data frame column by taking only the first value of the splitter string

I have the following data frame:
df <- structure(list(n.foldchange = c(2, 3, 5), s.foldchange = c(4,
0.2, 100.3)), .Names = c("n.foldchange", "s.foldchange"), row.names = c(NA,
-3L), class = "data.frame")
Which looks like this:
n.foldchange s.foldchange
1 2 4.0
2 3 0.2
3 5 100.3
What I want to do is to rename the column by removing the value after ..
Yielding
n s
1 2 4.0
2 3 0.2
3 5 100.3
How can I do that? (with tidyverse possible?)
We can rename the names of columns using setnames with dplyr.
library(dplyr)
df %>%
setnames(sub("\\..*", "", names(.)))
df
# n s
#1 2 4.0
#2 3 0.2
#3 5 100.3
We can do this with sub to match the . (escape the . (\\.) to get the literal meaning of . as it is a metacharacter which means any character) followed by other characters (.*) and replace it with blanks
names(df) <- sub("\\..*", "", names(df))
Or using substring or substr
names(df) <- substring(names(df), 1,1)
Another solution uses base R functions substring and regexpr.
names(df) <- substring(names(df), 1, regexpr(".", names(df), fixed=TRUE)-1)
df
n s
1 2 4.0
2 3 0.2
3 5 100.3
Here, regexpr is used to identify the positions of the first dot in the variable names. This position (minus one) is given to substring which returns a substring of the original variable names starting at the first character and ending right before the first dot.

Resources