R Split a column into multiple column by pattern [closed] - r

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I want to separate the digits and character in a column of a dataframe d.df:
col1
ab 12 14 56
xb 23 234 2342 2
ad 23 45
Expected output:
col1 col2
ab 12 14 56
xb 23 234 2342 2
ad 23 45
I recognize it will be something similar to this, but I'm not sure about the separators
t <- as.data.frame(str_match(d$col1,"^(.*)"))
I tried many methods and the output was:
col1 col2
a b 12 14 56
x b 23 234 2342 2
a d 23 45

You can use separate from tidyr.
library(tidyr)
d.df %>% separate(col1, c("col1", "col2"), sep="(?<=[a-z]{2} )")
# col1 col2
# 1 ab 12 14 56
# 2 xb 23 234 2342 2
# 3 ad 23 45
The regex, "(?<=[a-z]{2} )", is a look-behind, meaning "split at the position in the string after two lower case characters followed by a space". tidyr seems to have a limit on the length of look-behinds, so {2} is used to specify the number of letters.

Here is an option with data.table.
library(data.table)#v1.9.5+
setnames(setDT(df1)[, tstrsplit(col1,
'(?<=[^0-9]) (?=[0-9])', perl=TRUE)], paste0('col', 1:2))[]
# col1 col2
#1: ab 12 14 56
#2: xb 23 234 2342 2
#3: ad 23 45
We convert the 'data.frame' to 'data.table' (setDT(df1)). Using tstrsplit from the devel version of 'data.table', split at the space in 'col1' by matching the space after a letter and before a numeric part. We use regex lookarounds ((?<=[^0-9]) and ((?=[0-9])) for matching.

The approach here will vary significantly depending on whether this is actually how your strings look like or just an example. If they are always two letters and numbers, you can substring:
> df <- data.frame(col1 = c("ab 12 14 56", "xb 23 234 2342 2", "ad 23 45"))
>
> df$col1.1 <- sapply(df$col1, substring, 0, 2)
>
> df$col1.2 <- sapply(df$col1, substring, 3)
>
> df
col1 col1.1 col1.2
1 ab 12 14 56 ab 12 14 56
2 xb 23 234 2342 2 xb 23 234 2342 2
3 ad 23 45 ad 23 45
If the length and positions of the strings change, regex might be better suited. Using a base R approach, you can extract only the numbers or letters (keeping the white spaces):
> df <- data.frame(col1 = c("ab 12 14 56", "xb 23 234 2342 2", "ad 23 45"))
> df$col1.1 <- sapply(regmatches(df$col1, gregexpr("[a-zA-Z]", df$col1)), paste, collapse = "")
> df$col1.2 <- sapply(regmatches(df$col1, gregexpr("[0-9]\\s*", df$col1)), paste, collapse = "")
> df
col1 col1.1 col1.2
1 ab 12 14 56 ab 12 14 56
2 xb 23 234 2342 2 xb 23 234 2342 2
3 ad 23 45 ad 23 45

Related

Split numeric variables by decimals in R

I have a data frame with a column that contains numeric values, which represent the price.
ID
Total
1124
12.34
1232
12.01
1235
13.10
I want to split the column Total by "." and create 2 new columns with the euro and cent amount. Like this:
ID
Total
Euro
Cent
1124
12.34
12
34
1232
12.01
12
01
1235
13.10
13
10
1225
13.00
13
00
The euro and cent column should also be numeric.
I tried:
df[c('Euro', 'Cent')] <- str_split_fixed(df$Total, "(\\.)", 2)
But I get 2 new columns of type character that looks like this:
ID
Total
Euro
Cent
1124
12.34
12
34
1232
12.01
12
01
1235
13.10
13
1
1225
13.00
13
If I convert the character columns (euro and cent) to numeric like this:
as.numeric(df$Euro)
the 00 cent value turns into NULL and the 10 cent turn into 1 cent.
Any help is welcome.
Two methods:
If class(dat$Total) is numeric, you can do this:
dat <- transform(dat, Euro = Total %/% 1, Cent = 100 * (Total %% 1))
dat
# ID Total Euro Cent
# 1 1124 12.34 12 34
# 2 1232 12.01 12 1
# 3 1235 13.10 13 10
%/% is the integer-division operator, %% the modulus operator.
If class(dat$Total) is character, then
dat <- transform(dat, Euro = sub("\\..*", "", Total), Cent = sub(".*\\.", "", Total))
dat
# ID Total Euro Cent
# 1 1124 12.34 12 34
# 2 1232 12.01 12 01
# 3 1235 13.10 13 10
The two new columns are also character. For this, you may want one of two more steps:
Removing leading 0s, and keep them character:
dat[,c("Euro", "Cent")] <- lapply(dat[,c("Euro", "Cent")], sub, pattern = "^0+", replacement = "")
dat
# ID Total Euro Cent
# 1 1124 12.34 12 34
# 2 1232 12.01 12 1
# 3 1235 13.10 13 10
Convert to numbers:
dat[,c("Euro", "Cent")] <- lapply(dat[,c("Euro", "Cent")], as.numeric)
dat
# ID Total Euro Cent
# 1 1124 12.34 12 34
# 2 1232 12.01 12 1
# 3 1235 13.10 13 10
(You can also use as.integer if you know both columns will always be such.)
Just use standard numeric functions:
df$Euro <- floor(df$Total)
df$Cent <- df$Total %% 1 * 100

Concatenating levels of one column and merging the values of another column [duplicate]

This question already has answers here:
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 3 years ago.
I have a column with common levels as reps (1-4). I have data that goes with them in col3. Some of the levels don't contain information but for the ones that do, I would like to merge the values into one column by each common level in col1. The values in col3 are not consistent.
I have tried removing duplicates but this does not merge col3 values.
train <- data.table(col1=c(rep('a0001',4),rep('b0002',4)), col2=c(seq(1,4,1),seq(1,4,1)), col3=c("12 43 543 1232 43 543", "","","","15 24 85 64 85 25 46","","658 1568 12 584 15684",""))
this is reproducible code
I have about 40000 lines to do.
result<-data.frame(col1=c("a0001","b0002"),col3=c("12 43 543 1232 43 543",'15 24 85 64 85 25 46 658 1568 12 584 15684'))
This is the result I am looking for...
We can bring col3 values into separate_rows, remove empty values, group_by col1 and paste col3 values together.
library(dplyr)
train %>%
tidyr::separate_rows(col3) %>%
filter(col3 != '') %>%
group_by(col1) %>%
summarise(col3 = paste(col3, collapse = " "))
# col1 col3
# <chr> <chr>
#1 a0001 12 43 543 1232 43 543
#2 b0002 15 24 85 64 85 25 46 658 1568 12 584 15684
I am learning from #Ronak Shah's answer. This could be a variation:
library(dplyr)
train %>% group_by(col1) %>% summarise(col3 = paste(col3, collapse = " "))
col1 col3
<chr> <chr>
1 a0001 "12 43 543 1232 43 543 "
2 b0002 "15 24 85 64 85 25 46 658 1568 12 584 15684 "

Splitting a column with multiple and unevenly distributed delimiters in R

I have a column/vector of character data that I need to separate into different columns. The problem? There are different delimiters (which mean different things), and different lengths between each delimiter. For example:
column_name
akjhaa 1-29 y 12-30
bsd, 14-20
asdf asdf del 2-5 y 6
dkljwv 3-31
joikb 6-22
sqwzsxcryvyde jd de 1-2
pk, ehde 1-2
jsd 1-15
asdasd asedd 1,3
The numbers need to be separated into columns apart from the characters. However, the numbers can be separated by a comma or dash or 'y'. Moreover, the numbers separated by dash should be somehow designated, as eventually, I need to make a document/vector where each of the numbers in that range is in their own column also (such that the split aaa column would become aaa 1 2 3 4 5 .... 29 12 13 ... 30).
So far, I have tried separating into columns based on the different delimiters, but because sometimes the values have more than one '-','y', or the 'y' falls as a word in one of the first character parts, it is starting to get a bit complicated...is there an easier way?
For clarification, in the particular "column_name" I gave, the final output would be such that i would have n columns, where n = (the highest number of numbers + 1 (the character string of the column name)). So, in the example of the provided "column_name," it would look like:
column_name n1 n2 n3 n4 n5 n6 n7 n8 n9 n10 n11 n12 n13 n14 n15 n16 n17 n18 n19 n20 n21 n22 n23 n24 n25 n26 n27 n28 n29 n30 n31 n32 n33 n34 n35 n36 n37 n38 n39 n40 n41 n42 n43 n44 n45 n46 n47 n48 n49 n50 n51 n52 n53 n54 n55 n56 n57 n58
akjhaa 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
bsd 14 15 16 17 18 19 20
asdf asdf del 2 3 4 5 6
dkljwv 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
joikb 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
sqwzsxcryvyde jd de 1 2
pk ehde 1 2
jsd 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
asdasd asedd 1 3
This isn't pretty, but it works. The result is a list column with the relevant values.
library(magrittr)
library(splitstackshape)
setDT(mydf)[, CN := gsub("(.*?) ([0-9].*)", "\\1SPLIT\\2", column_name)] %>%
cSplit("CN", "SPLIT") %>%
cSplit("CN_2", "[y,]", "long", fixed = FALSE) %>%
cSplit("CN_2", "-") %>%
.[, list(values = list(if (is.na(CN_2_2)) CN_2_1 else CN_2_1:CN_2_2)),
.(CN_1, rowid(CN_1))] %>%
.[, list(values = list(unlist(values))), .(CN_1)]
# CN_1 values
# 1: akjhaa 1,2,3,4,5,6,...
# 2: bsd, 14,15,16,17,18,19,...
# 3: asdf asdf del 2,3,4,5,6
# 4: dkljwv 3,4,5,6,7,8,...
# 5: joikb 6, 7, 8, 9,10,11,...
# 6: sqwzsxcryvyde jd de 1,2
# 7: pk, ehde 1,2
# 8: jsd 1,2,3,4,5,6,...
# 9: asdasd asedd 1,3
To get the extra columns instead of a list, you would need one more line: cbind(., .[, data.table::transpose(values)]):
as.data.table(mydf)[, CN := gsub("(.*?) ([0-9].*)", "\\1SPLIT\\2", column_name)] %>%
cSplit("CN", "SPLIT") %>%
cSplit("CN_2", "[y,]", "long", fixed = FALSE) %>%
cSplit("CN_2", "-") %>%
.[, list(values = list(if (is.na(CN_2_2)) CN_2_1 else CN_2_1:CN_2_2)),
.(CN_1, rowid(CN_1))] %>%
.[, list(values = list(unlist(values))), .(CN_1)] %>%
cbind(., .[, data.table::transpose(values)])
The basic idea is to do the following steps:
Split the column names from the values.
Split values separated by "y" or by a "," into new rows.
Split values separated by "-" into multiple columns.
Create your list of vectors according to the rule that if any values in the second split column are NA, return just the value from the first column, otherwise, create the sequence from the value in the first column to the value in the second column. Since you have duplicated "id" values because you've converted the data into a longer form, use rowid() to help with the grouping.
Consolidate the values in the list column according to the actual IDs.
(Optionally, in my opinion) transform the list data into multiple columns.

How can I transpose dataset in R [duplicate]

This question already has answers here:
Transpose / reshape dataframe without "timevar" from long to wide format
(9 answers)
Closed 5 years ago.
I have a dataset A shown as below. How can I transform dataset A to dataset B. Dataset A contains over 10,000 observations in my file. Is there any easy way to do it?
Dataset A:
Line 1:AB 12 23
Line 2:AB 34 56
Line 3:CD 78 90
Line 4:EF 13 45
Dataset B:
Line 1:AB 12 23 34 56
Line 2:CD 78 90 NA NA
Line 3:EF 13 45 NA NA
Try this by using cSplit
library(splitstackshape)
library(dplyr)
DatA['new']=apply(DatA[,-1], 1, paste, collapse=",")
DatA=DatA%>%group_by(Alphabet)%>%summarise(new=paste(new,collapse=','))
cSplit(DatA, 2, drop = TRUE,sep=',')
Alphabet new_1 new_2 new_3 new_4
1: AB 12 23 34 56
2: CD 78 90 NA NA
3: EF 13 45 NA NA
Data input
DatA <- data.frame(Alphabet = c("AB", "AB", "CD","EF"),
Value1 = c(12,34,78,13),Value2 = c(23,56,90,45),stringsAsFactors = F)

How to gather series of columns with data into rows [duplicate]

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 5 years ago.
I'm just trying to get my head around tidying my data and I have this problem:
I have data as follows:
ID Tx1 Tx1Date Tx1Details Tx2 Tx2Date Tx2Details Tx3 Tx1Date Tx1Details
1 14 12/3/14 blabla 1e 12/5/14 morebla r 14/2/14 grrr
2 23 14/5/16 albalb 342 1/4/5 teeee s 5/6/17 purrr
I want the data to be in the format
ID Tx TxDate TxDetails
1 14 12/3/14 blabla
1 1e 12/5/14 morebla
1 r 14/2/14 grrr
2 23 14/5/16 albalb
2 342 1/4/5 teeee
2 s 5/6/17 purrr
I have used
library(tidyr)
library(dplyr)
NewData<-mydata %>% gather(key, value, "ID", 2:10)
but I'm not sure how to rename the columns as per the intended output to see if this will work
You can rename your data frame column names to a more conventional separable names and then use the base reshape function, assuming your initial data frames looks like this(changed the last two column names to Tx3Date and Tx3Details as otherwise they are duplicates of columns 4 and 5):
df
# ID Tx1 Tx1Date Tx1Details Tx2 Tx2Date Tx2Details Tx3 Tx3Date Tx3Details
#1 1 14 12/3/14 blabla 1e 12/5/14 morebla r 14/2/14 grrr
#2 2 23 14/5/16 albalb 342 1/4/5 teeee s 5/6/17 purrr
names(df) <- gsub("(\\d)(\\w*)", "\\2\\.\\1", names(df))
df
# ID Tx.1 TxDate.1 TxDetails.1 Tx.2 TxDate.2 TxDetails.2 Tx.3 TxDate.3 TxDetails.3
#1 1 14 12/3/14 blabla 1e 12/5/14 morebla r 14/2/14 grrr
#2 2 23 14/5/16 albalb 342 1/4/5 teeee s 5/6/17 purrr
reshape(df, varying = 2:10, idvar = "ID", dir = "long")
# ID time Tx TxDate TxDetails
#1.1 1 1 14 12/3/14 blabla
#2.1 2 1 23 14/5/16 albalb
#1.2 1 2 1e 12/5/14 morebla
#2.2 2 2 342 1/4/5 teeee
#1.3 1 3 r 14/2/14 grrr
#2.3 2 3 s 5/6/17 purrr
Drop the redundant time variable if you don't need it.
The data.table package handles this pretty well.
library(data.table)
setDT(df)
melt(df, measure = list(Tx = grep("^Tx[0-3]$", names(df)),
Date = grep("Date", names(df)),
Details = grep("Details", names(df))),
value.name = c("Tx", "TxDate", "TxDetails"))
Or more concisely
melt(df, measure = patterns("^Tx[0-3]$", "Date", "Details"),
value.name = c("Tx", "TxDate", "TxDetails"))

Resources