Hi I have question regarding R Programming, I am a newbie in R.
I have a dataset in excel with a particular column having values as such.
123456
123456789
123456789123
Now my requirement is to get values in multiples of 3 and split into different columns.
For eg. My first row would be splitting into 2 columns and second row into 3 columns
colA colB colC
123 456
123 456 789
The desired output:
Here are a few solutions. The first 5 do not use any packages. nc (number of columns) and cn (column names) defined in (1) are used in the others as well.
1) read.fwf Using the input DF shown reproducibly in the Note at the end count the maximum number of characters in a line and divide by 3 to get the number of columns nc. Next compute the column names cn. Finally use read.fwf to read them in. No packages are used.
nc <- max(nchar(DF[[1]]))/3
cn <- paste0("col", head(LETTERS, nc))
read.fwf(textConnection(as.character(DF[[1]])), rep(3, length = nc),
col.names = cn)
giving:
colA colB colC colD
1 123 456 NA NA
2 123 456 789 NA
3 123 456 789 123
2) formatC A variation on the above would be to use formatC to insert commas after every 3 characters giving the character vector ch and then read that in using read.csv.
ch <- formatC(DF[[1]], format= "f", digits = 0, big.mark = ",")
read.csv(text = ch, header = FALSE, col.names = cn)
3) strsplit Another variation would be to split the column using strsplit and the indicated regular expression to split by and then use toString to put the split components into a comma separated string vector, ch. Finally use read.csv as before.
ch <- sapply(strsplit(as.character(DF[[1]]), "(?<=...)", perl = TRUE), toString)
read.csv(text = ch, header = FALSE, col.names = cn)
4) gsub Yet another variation is to use gsub to insert commas every 3 characters and then use read.csv as in (2) and (3).
ch <- gsub("(...)(?=.)", "\\1,", DF[[1]], perl = TRUE)
read.csv(text = ch, header = FALSE, col.names = cn)
5) strcapture This one does not use any read.* routine. It also uses only base R.
strcapture(strrep("(...)?", nc), DF[[1]], setNames(double(nc), cn))
6) strapplyc This is the only variation that uses a package. strapplyc can be used to pick off successive 3 character subsets. It uses a simpler regular expression than some of our other solutions. read.csv is used as in some of the other solutions.
library(gsubfn)
ch <- sapply(strapplyc(DF[[1]], "..."), toString)
read.csv(text = ch, header = FALSE, col.names = cn)
Note
The input in reproducible form:
Lines <- "
123456
123456789
123456789123"
DF <- read.table(text = Lines)
Here is one option with separate
library(tidyverse)
df %>%
separate(a, into = c('b', 'c', 'd'), sep= c(3, 6), remove = FALSE)
# a b c d
#1 123 123
#2 123456 123 456
#3 123456789 123 456 789
Using convert=TRUE, changes the type of the column automatically
df %>%
separate(a, into = c('b', 'c', 'd'), sep= c(3, 6),
remove = FALSE, convert = TRUE)
data
df <- data.frame (a = c(123,123456,123456789))
using library data.table
library(data.table)
setDT(df1)
df1[, tstrsplit(df1$col1, "(?:.{3}+\\K)", perl = TRUE)] # change {3} to other numbers if you don't want to split after every 3.
# V1 V2 V3 V4
#1: 123 456 <NA> <NA>
#2: 123 456 789 <NA>
#3: 123 456 789 123
data:
df1<-
structure(list(col1 = c("123456", "123456789", "123456789123"
)), class = c("data.table", "data.frame"), row.names = c(NA, -3L))
There's probably a method that involves less repetition but one option may be
library(tidyverse)
df <- data.frame (a = c(123,123456,123456789))
df %>%
mutate(b = substr(a, 0,3),
c = substr(a, 4,6),
d = substr(a, 7,9))
a b c d
1 123 123
2 123456 123 456
3 123456789 123 456 789
Related
I'm trying to select rows in a data.table. I need the values in variable dt$s to start with any of the strings in vector y
dt <- data.table(x = (c(1:5)), s = c("a", "ab", "b.c", "db", "d"))
y <- c("a", "b")
Desired result:
x s
1: 1 a
2: 2 ab
3: 3 b.c
I would use dt[s %in% y] for a full match, and %like% or "^a*" for a partial match with a single string, but I'm not sure how to get a strict starts with match on a character vector.
My real dataset and character vector is quite large, so I'd appreciate an efficient solution.
Thanks.
You can create a pattern dynamically from y.
library(data.table)
pat <- sprintf('^(%s)', paste0(y, collapse = '|'))
pat
#[1] "^(a|b)"
and use it to subset the data.
dt[grepl(pat, s)]
# x s
#1: 1 a
#2: 2 ab
#3: 3 b.c
Using glue and filter
library(glue)
library(dplyr)
library(stringr)
dt %>%
filter(str_detect(s, glue("^({str_c(y, collapse = '|')})")))
# x s
#1: 1 a
#2: 2 ab
#3: 3 b.c
my dataframe look like this
x <-
data.frame(
id = c("123_1", "987_123")
)
I'd like to create this result dataframe via dplyr mutate function. I just want to take first part before underscore sign and another one right after underscore sign.
result <-
data.frame(
id = c("123_1", "987_123"),
af = c("123", "987"),
ad = c("1", "123")
)
1) tidyverse Use separate like this:
library(dplyr)
library(tidyr)
x %>%
separate(id, c("af", "ad"), remove = FALSE)
## id af ad
## 1 123_1 123 1
## 2 987_123 987 123
2) Base R
2a) read.table Without any packages use read.table
cbind(x, read.table(text = x$id, sep = "_", col.names = c("af", "ad"),
colClasses = "character"))
## id af ad
## 1 123_1 123 1
## 2 987_123 987 123
2b) sub or use sub:
transform(x, af = sub("_.*", "", id), ad = sub(".*_", "", id))
## id af ad
## 1 123_1 123 1
## 2 987_123 987 123
Given this vector:
vector <- c("Superman1000", "Batman35", "Wonderwoman240")
I want to split the superhero's name and age.
df=data.frame(vector= c("Superman1000", "Batman35", "Wonderwoman240"))
library(stringr)
library(stringi)
library(dplyr)
df %>% separate(vector, c("A", "B"))
I tried this but it doesn't work.
If the data is same as shown, we can remove all the digits to get super hero name and remove all the non-digits to get their age.
df$super_hero <- gsub("\\D", "", df$vector)
df$super_hero_age <- gsub("\\d+", "", df$vector)
Or with tidyr::extract
tidyr::extract(df, vector, into = c("name", "age"),regex = "(.*\\D)(\\d+)")
# name age
#1 Superman 1000
#2 Batman 35
#3 Wonderwoman 240
As mentioned by #A5C1D2H2I1M1N2O1R2T1, we can also use strcapture
strcapture("(.*\\D)(\\d+)", df$vector,
proto = data.frame(superhero = character(), age = integer()))
We can use read.csv from base R after creating a delimiter before the numeric part with sub
read.csv(text = sub("(\\d+)", ",\\1", df$vector), header = FALSE,
stringsAsFactors = FALSE, col.names = c('name', 'age'))
# name age
#1 Superman 1000
#2 Batman 35
#3 Wonderwoman 240
Or another option is separate where we specify a regex lookaround
library(tidyr)
separate(df, vector, into = c("name", "age"), sep= "(?<=[a-z])(?=\\d)")
# name age
#1 Superman 1000
#2 Batman 35
#3 Wonderwoman 240
I know there are some answers here about splitting a string every nth character, such as this one and this one, However these are pretty question specific and mostly related to a single string and not to a data frame of multiple strings.
Example data
df <- data.frame(id = 1:2, seq = c('ABCDEFGHI', 'ZABCDJHIA'))
Looks like this:
id seq
1 1 ABCDEFGHI
2 2 ZABCDJHIA
Splitting on every third character
I want to split the string in each row every thrid character, such that the resulting data frame looks like this:
id 1 2 3
1 ABC DEF GHI
2 ZAB CDJ HIA
What I tried
I used the splitstackshape before to split a string on a single character, like so: df %>% cSplit('seq', sep = '', stripWhite = FALSE, type.convert = FALSE) I would love to have a similar function (or perhaps it is possbile with cSplit) to split on every third character.
An option would be separate
library(tidyverse)
df %>%
separate(seq, into = paste0("x", 1:3), sep = c(3, 6))
# id x1 x2 x3
#1 1 ABC DEF GHI
#2 2 ZAB CDJ HIA
If we want to create it more generic
n1 <- nchar(as.character(df$seq[1])) - 3
s1 <- seq(3, n1, by = 3)
nm1 <- paste0("x", seq_len(length(s1) +1))
df %>%
separate(seq, into = nm1, sep = s1)
Or using base R, using strsplit, split the 'seq' column for each instance of 3 characters by passing a regex lookaround into a list and then rbind the list elements
df[paste0("x", 1:3)] <- do.call(rbind,
strsplit(as.character(df$seq), "(?<=.{3})", perl = TRUE))
NOTE: It is better to avoid column names that start with non-standard labels such as numbers. For that reason, appended 'x' at the beginning of the names
You can split a string each x characters in base also with read.fwf (Read Fixed Width Format Files), which needs either a file or a connection.
read.fwf(file=textConnection(as.character(df$seq)), widths=c(3,3,3))
V1 V2 V3
1 ABC DEF GHI
2 ZAB CDJ HIA
Best
Basically, I've a table data and a smaller table vocabulary.
What I would like to have is, that the values of the vocabularies well be mapped on the data values. And this within a function, in such a way that it can be used +/- dynamicaly
Given:
dt : data.csv
V1____V2___sex__V4__V5_
abc abc jeny abc 123
abc abc eric abc 123
abc abc bob abc 123
vocabulary1: sex.csv
old___new
jeny f
eric m
bob m
Wanted Result:
V1____V2___sex__V4__V5_
abc abc f abc 123
abc abc m abc 123
abc abc m abc 123
What I've
replace_by_vocabulary <- function(dt,voc,col_name){
dt[,col_name] <- tolower(dt[,col_name])
**** something something ***
return(dt)
}
How I would like to use it ...
dt <- replace_by_vocabulary(dt,vocabulary1,"sex")
dt <- replace_by_vocabulary(dt,vocabulary2,"date")
dt <- replace_by_vocabulary(dt,vocabulary3,"mood")
An alternative to merge that is more in line with what you had:
replace_by_vocabulary <- function(dt,voc,col_name){
col <- which(colnames(dt) == col_name)
dt[,col] <- voc$new[match(tolower(dt[,col]), voc$old)]
return(dt)
}
You want to locate the column in dt from the col_name string input first. Then, use match to find the row indices of voc$old that matches those of tolower(dt[,col]), and use these to retrieve the replacement values from voc$new. Here, we convert the dt[,col] column to all lower case, as you had in your sample code, dynamically in the function so as to match the lower case data in the vocabulary table. The advantage over merge is that we do not have to rename and remove columns afterwards to get the result you want.
Using your data:
replace_by_vocabulary(dt,vocabulary,"sex")
## V1 V2 sex V4 V5
##1 abc abc f abc 123
##2 abc abc m abc 123
##3 abc abc m abc 123
This post seems to be a duplicate of the one listed below.
VLookup type method in R
You should be able to work out a function to do what you want to do using the merge function:
string = c("abc", "abc", "abc")
names = c("jeny", "eric", "bob")
sex = c("f", "m", "m")
data = data.frame(cbind(string, string, names, string, c(1, 2, 3)))
vocabulary1 = data.frame(cbind(names, sex))
dt = merge(data, vocabulary1, by.x = "names")
dt
If I understood you aim correctly you want to merge two data.frames together?
You should look at ?merge
For instance:
merge(x = dt, y = vocabulary1, by.x = "sex", by.y = "old")
If you want a dynamic function you could do
replace_by_vocabulary <- function(dt,voc,col_name){
merged_df <- merge(x = dt, y = voc, by.x = "sex", by.y = col_name)
return(merged_df)
}
Have you considered merging, then dropping the unwanted column? Like so.
dt<-merge(x=dt, y=vocabulary1, by.x="sex", by.y="old")
dt<-dt %>%
select(-sex) %>%
mutate(sex=old)