integer type split into two columns - r

I've column with two alpha numeric characters separated by '->' I'm trying to split them into columns.
Df:
column e
1. asd1->ref2
2. fde4 ->fre4
3. dfgt-fgr ->frt5
4. ftr5 -> lkh-oiut
5. rey6->usre-lynng->usre-lkiujh->kiuj-bunny
6. dge1->fgt4->okiuj-dfet
Desired output
col 1 col 2
1. asd1 ref2
2. fde4 fre4
3. frt5
4. ftr5
5. rey6
6. dge1 fgt4
I tried using out <- strsplit(as.character(Df$column e),'_->_') with no output and used str_extract(m1$column e,"(?<=\\[)[[:alnum:]]")->m1$column f, also strsplit(as.character(Df$column e),' -> 'fixed=T)[[1]][[1]] but not getting the desired output.
The column if of integer type and all are capital letters(I'm not sure if this is imp.)

Here is one way with tidyverse
library(tidyverse)
df1 %>%
separate(columne, into = c('col1', 'col2'), sep = "->", extra = 'drop') %>%
mutate_all(funs(replace(., str_detect(., '-'), "")))
# col1 col2
#1 asd1 ref2
#2 fde4 fre4
#3 frt5
#4 ftr5
#5 rey6
#6 dge1 fgt4

A base R solution as well, though a fair bit less concise than #akrun's tidyverse one:
# split as appropriate
out <- strsplit( as.character( Df$column.e ), '->' )
out <- lapply( out, function(x) {
# I assume you don't want the white space
y <- trimws( x )
# take the first two "columns"
y <- y[1:2]
# remove any items containing a hyphen
y[ grepl( "-", y ) ] <- ""
y
}
)
# then bind it all rowwise
out <- do.call( rbind, out )
data.frame( out )
X1 X2
1 asd1 ref2
2 fde4 fre4
3 frt5
4 ftr5
5 rey6
6 dge1 fgt4

Related

If string has a certain character, fill an empty cell in the same row with a certain value

Say I have the following data frame:
# S/N a b
# 1 L1-S2 <blank>
# 2 T1-T3 <blank>
# 3 T1-L2 <blank>
How do I turn the above data frame into this:
# S/N a b
# 1 L1-S2 LS
# 2 T1-T3 T
# 3 T1-L2 TL
I am thinking of writing a loop, where
For x in column a,
If first character in x == L AND 4th character in x == S,
fill the corresponding cell in b with LS
and so on...
However, I am not sure how to implement it, or if there is a more elegant way of doing this.
We can extract the upper case letters and remove the repeated ones
library(stringr)
library(dplyr)
df1 %>%
mutate(b = str_replace(str_replace(a, "^([A-Z])\\d+-([A-Z])\\d+",
"\\1\\2"), "(.)\\1+", "\\1"))
-output
# S_N a b
#1 1 L1-S2 LS
#2 2 T1-T3 T
#3 3 T1-L2 TL
Or another option is str_extract_all to extract the upper case letters, loop over the list with map, paste the unique elements
library(purrr)
df1 %>%
mutate(b = str_extract_all(a, "[A-Z]") %>%
map_chr(~ str_c(unique(.x), collapse="")))
Or using a corresponding base R option for the first tidyverse option
df1$b <- sub("(.)\\1+", "\\1", gsub("[0-9-]+", "", df1$a))
Or with strsplit
df1$b <- sapply(strsplit(df1$a, "[0-9-]+"),
function(x) paste(unique(x), collapse=""))
data
df1 <- structure(list(S_N = 1:3, a = c("L1-S2", "T1-T3", "T1-L2"),
b = c(NA,
NA, NA)), class = "data.frame", row.names = c(NA, -3L))

Merge Alternating Rows

I extracted data from a pdf and have a small issue. Some cells got seperated in two rows, for the data to have NA in some cells, and values in others. These values I want to simply merge into the above cell.
Interestingly, every "real" line to which I want to merge the other ones starts with the same symbol, namely a "§".
I have about ~1000 observations so automated solution would be amazing
first <- c("§", "3", "4")
second <- c(NA, "2", NA)
third <- c("§", "2", 5)
fourth <- c(NA, "2", "3")
... and so on
df <- as.data.frame(rbind(first, second, third, fourth))
expected output:
first_e <- c("§", "32", "4")
second_e <- c "§", "22", "53")
df_e <- as.data.frame(rbind(first_e, second_e))
It would be truly amazing if someone had an idea (:
best from berlin
One possible solution if there is allways a second line would be this:
library(dplyr)
library(dplyr)
df %>%
# use the row number as colum
dplyr::mutate(ID = dplyr::row_number()) %>%
# substract 1 from very even row numer to build groups
dplyr::mutate(ID = ifelse(ID %% 2 == 0, ID - 1, ID)) %>%
# group by the new ID
dplyr::group_by(ID) %>%
# convert all NAs to "" (empty string)
dplyr::mutate_all(~ ifelse(is.na(.), "", .)) %>%
# concatenate all strings per group
dplyr::mutate_all( ~ paste(., collapse = "")) %>%
# select only distinct cases (do elimitate "seconds" as the now are identical to "frists)
dplyr::distinct()
V1 V2 V3 ID
<chr> <chr> <chr> <dbl>
1 4 32 4 1
I left the created ID number in the result but you can drop/delete it after the calculations if you prefer
Simply paste the odd elements of a column with even elements:
# vectors of TRUEs in odd or even positions
odd <- rep(c(T,F), length.out=nrow(df))
evn <- rep(c(F,T), length.out=nrow(df))
# for each column...
result <- lapply(df, function(col) {
paste0(ifelse(is.na(col[odd]), '', col[odd]),
ifelse(is.na(col[evn]), '', col[evn]))
})
as.data.frame(result)
Consider flagging § columns with a ifelse + cumsum to generate a grouping field for aggregate with paste:
# BUILD DATA FRAME
df <- setNames(rbind.data.frame(first, second, third, fourth, stringsAsFactors=FALSE),
c("col1", "col2", "col3"))
# CONVERT ALL NAs TO EMPTY STRING
df[is.na(df)] <- ""
# GENERATE GROUPING COLUMN
df$section <- cumsum(ifelse(df$col1 == "§", 1, 0))
df
# col1 col2 col3 section
# 1 § 3 4 1
# 2 2 1
# 3 § 2 5 2
# 4 2 3 2
# AGGREGATE BY GROUPING COLUMNS
clean_df <- aggregate(. ~ section, df, paste, collapse="")[-1]
clean_df
# col1 col2 col3
# 1 § 32 4
# 2 § 22 53
Online Demo

Manipulate string in column of a dataframe

I have a data frame
a = data.frame("a" = c("aaa|abbb", "bbb|aaa", "bbb|aaa|ccc"), "b" = c(1,2,3))
a b
aaa|abbb 1
bbb|aaa 2
bbb|aaa|ccc 3
I want to split the colum value by "|" and sort the output and merge them together to look like this
a b
aaa|abbb 1
aaa|bbb 2
|aaa|bbb|ccc 3
I tried to use following
paste(sort(ignore.case(unlist(strsplit(as.character(a$a), "\\|")))),collapse = ", ")
but that just combine everything together. How can I implement it on each value of column A and get the result as dataframe. I tried to use lapply but still got the same result, one combined list.
We could use separate_rows to split the values in 'a', then grouped by 'b', sort 'a' and paste the elements together
library(tidyverse)
a %>%
separate_rows(a) %>%
group_by(b) %>%
summarise(a = paste(sort(a), collapse="|")) %>%
select(names(a))
# A tibble: 3 x 2
# a b
# <chr> <dbl>
#1 aaa|abbb 1
#2 aaa|bbb 2
#3 aaa|bbb|ccc 3
An idea via base R,
sapply(strsplit(as.character(a$a), '|', fixed = TRUE), function(i) paste(sort(i), collapse = '|'))
#[1] "aaa|abbb" "aaa|bbb" "aaa|bbb|ccc"
So to update your column a, just assign it back to it, i.e.
a$a <- sapply(strsplit(as.character(a$a), '|', fixed = TRUE), function(i) paste(sort(i), collapse = '|'))
Similar to Sotos's answer:
a$clean <- sapply(as.character(a$a), function(i) paste(sort(tolower(unlist(strsplit(i, split = "|", fixed = TRUE)))), collapse = "|"))
# a b clean
# 1 aaa|abbb 1 aaa|abbb
# 2 bbb|aaa 2 aaa|bbb
# 3 bbb|aaa|ccc 3 aaa|bbb|ccc
if you want to do it with data.table
library(data.table)
dat <- fread("a b
aaa|abbb 1
bbb|aaa 2
bbb|aaa|ccc 3")
dat[,a_sorted :=sapply(lapply(strsplit(a, "\\|"), sort),paste,collapse="|") ]

Spacing vector by regular pattern

I have a vector
vec <- c("ab", "#4", "gw", "#29", "mp", "jq", "#35", "ez")
which generally follows the pattern of alternating between two different sequences of strings (the first sequence being all alphabetical, the second being numerical with the symbol #).
However there are cases where no # term appears: so in the above between mp and jq, and then again after ez. I would like to define a function which "fills the gaps" with the character string #, so that I would have the output:
[1] "ab" "#4" "gw" "#29" "mp" "#" "jq" "#35" "ez" "#"
which I would then convert to a data frame
V1 V2
1 ab #4
2 gw #29
3 mp #
4 jq #35
5 ez #
My attempt so far is rather clunky and relies on looping through the vector and filling the gaps. I'd be interested to see more elegant solutions.
My Solution
greplSpace <- function(pattern, replacement, x){
j <- 1
while( j < length(x) ){
if(grepl(pattern, x[j+1]) ){
j <- j+2
} else {
x <- c( x[1:j], replacement, x[(j+1):length(x)] )
j <- j+2
}
}
if( ! grepl(pattern, tail(x,1) ) ){ x <- c(x, replacement) }
return(x)
}
library(magrittr)
vec <- c("ab", "#4", "gw", "#29", "mp", "jq", "#35", "ez")
vec %>% greplSpace("#", "#", . ) %>%
matrix(ncol = 2, byrow = TRUE) %>%
as.data.frame
Start with your vec, we can create your expected data frame directly with some functions from the dplyr, tidyr, and stringr.
library(dplyr)
library(tidyr)
library(stringr)
vec <- c("ab", "#4", "gw", "#29", "mp", "jq", "#35", "ez")
dat <- data_frame(Value = vec)
dat2 <- dat %>%
mutate(String = !str_detect(vec, "#"),
Key = ifelse(String, "V1", "V2"),
Row = cumsum(String)) %>%
select(-String) %>%
spread(Key, Value, fill = "#") %>%
select(-Row)
dat2
# # A tibble: 5 x 2
# V1 V2
# <chr> <chr>
# 1 ab #4
# 2 gw #29
# 3 mp #
# 4 jq #35
# 5 ez #
Here is a base R option with split. Create a logical index by checking the "#" in each of the strings, get the cumulative sum and split the original vector by this grouping variable into a list ('lst'). For those list elements that don't have two (maximum length) elements are appended with NA at the end by assignment with length<-. Then, rbind, the list elements into a two column matrix. If needed, convert those NA to #
lst <- split(vec, cumsum(!grepl("#", vec)))
out <- do.call(rbind, lapply(lst, `length<-`, max(lengths(lst))))
out[,2][is.na(out[,2])] <- "#" #not recommended though
out
# [,1] [,2]
#1 "ab" "#4"
#2 "gw" "#29"
#3 "mp" "#"
#4 "jq" "#35"
#5 "ez" "#"
Wrap it with as.data.frame if we need a data.frame output
You can use Base R:
First Collapse the vector into a string while replaceing # where needed.
Then just read using read.csv
vec1=gsub("([a-z]),\\s*([a-z])|$","\\1,#,\\2",toString(vec))
read.csv(text=gsub("(#.*?),","\\1\n",vec1),h=F)
V1 V2
1 ab #4
2 gw #29
3 mp #
4 jq #35
5 ez #
Explanation:
First collapse the vector into a string by toString
Then if there are alphabets on either side of the , ie [a-z],\s*[a-z] or at the end ie |$ you insert an #.
Then create line breaks after numbers or # and read in the data as a table
You can also do:
a=read.csv(h=F,text=toString(sub("([a-z]+)","\n\\1",vec)),na=c(" ",""))[1:2]
a
V1 V2
1 ab #4
2 gw #29
3 mp <NA>
4 jq #35
5 ez <NA>
data.frame(replace(as.matrix(a),is.na(a),"#"))
V1 V2
1 ab #4
2 gw #29
3 mp #
4 jq #35
5 ez #
Another base possibility:
do.call(rbind, tapply(vec, cumsum(!grepl("^#", vec)), FUN = function(x){
if(length(x) == 1) c(x, "#") else x}))
# [,1] [,2]
# 1 "ab" "#4"
# 2 "gw" "#29"
# 3 "mp" "#"
# 4 "jq" "#35"
# 5 "ez" "#"
Explanation:
Check if elements in vec starts with #, and negate it: !grepl("^#", vec); creates a logical vector.
Create a grouping variable by applying cumsum to the logical vector (note: 1 & 2 similar to #akrun).
Use tapply to apply a function to each subset of vec, defined by the grouping variable. Check if the length is 1. If so, pad by a trailing #, else just return the subset: if(length(x) == 1) c(x, "#") else x
Bind the resulting list together by row: do.call(rbind,
Another one:
# create a row index
ri <- cumsum(!grepl("^#", vec))
# create a column index
ci <- ave(ri, ri, FUN = seq_along)
# create an empty matrix of desired dimensions
m <- matrix(nrow = max(ri), ncol = 2)
# assign 'vec' to matrix at relevant indices
m[cbind(ri, ci)] <- vec
# replace NA with '#'
m[is.na(m)] <- "#"
Using data.table. Create a grouping variable as above, and reshape from long to wide.
library(data.table)
d <- data.table(vec)
d[ , g := cumsum(!grepl("^#", vec))]
dcast(d, g ~ rowid(g), value.var = "vec", fill = "#")
# g 1 2
# 1: 1 ab #4
# 2: 2 gw #29
# 3: 3 mp #
# 4: 4 jq #35
# 5: 5 ez #

R: Apply a function over only character columns without type coercion

I have a data frame with many columns. The columns differ in their types: some are numeric, some are character, etc. Here's a small example where we just have 3 variables with 2 types:
# Generate data
dat <- data.frame(x = c("1","2","3"),
y = c(1.0,2.5,3.3),
z = c(1,2,3),
stringsAsFactors = FALSE)
I want to replace the value 3 with a space, but only for character columns. Here's my current code:
out <- as.data.frame(lapply(dat, function(x) {
ifelse(is.character(x),
gsub("3", " ", x),
x)}),
stringsAsFactors = FALSE)
The problem is that the ifelse() function ignores that y and z are numeric and that it also seems to coerce the numeric variables to character anyway.
And idea has been to pull out the character columns, gsub() them, then bind them back to the original data frame. This, however, changes the ordering of the columns. Key to any solution is that I do not need to specify variables by name but only by type.
One can also do this trivially using dplyr:
# Load package
library(dplyr)
# Create data
dat <- data.frame(x = c("1","2","3"),
y = c(1.0,2.5,3.3),
z = c(1,2,3),
stringsAsFactors = FALSE)
# Replace 3's with spaces for character columns
dat <- dat %>% mutate_if(is.character, function(x) gsub(pattern = "3", " ", x))
I tried your code and for me it seems like ifelse did not work but separating if ad else does. Below is the code which works:
# Generate data
dat <- data.frame(x = c("1","2","3"),
y = c(1.0,2.5,3.3),
z = c(1,2,3),
stringsAsFactors = FALSE)
> lapply(dat, function(x) { if(is.character(x)) gsub("3", " ", x) else x })
$x
[1] "1" "2" " "
$y
[1] 1.0 2.5 3.3
$z
[1] 1 2 3
> as.data.frame(lapply(dat, function(x) { if(is.character(x)) gsub("3", " ", x) else x }))
x y z
1 1 1.0 1
2 2 2.5 2
3 3.3 3
It comes down to this line in ?ifelse
ifelse returns a value with the same shape as test ...
is.character is length one so the returned value is length 1. You can use if(...) yes else no as you have suggested instead as #Heikki have suggested.
Similar to #user3614648 solution:
library(dplyr)
dat %>%
mutate_if(is.character, funs(ifelse(. == "3", " ", .)))
x y z
1 1 1.0 1
2 2 2.5 2
3 3.3 3

Resources