Programmatically split non-delimited strings and generate new columns - r

I have a 1 column data table containing non-delimited strings like so
d1 = data.table(x = c("2728661941-1945", "2657461921-1925", "2786161921-1925"))
d1
#> x
#> 1: 2728661941-1945
#> 2: 2657461921-1925
#> 3: 2786161921-1925
I have another data table of the form
dic = data.table(field = c("ID","group","year"),start=c(1,6,7), length=c(5,1,9))
dic
#> field start length
#> 1: ID 1 5
#> 2: group 6 1
#> 3: year 7 9
I want to split the strings in the data table d1 using the information in dic, and end up with a new data frame of the form
d2 = data.table(ID = c("27286", "26574", "27861"),
group = c(6, 6, 6),
year = c("1941-1945", "1921-1925", "1921-1925")
d2
#> ID group year
#> 1: 27286 6 1941-1945
#> 2: 26574 6 1921-1925
#> 3: 27861 6 1921-1925
I have tried
d2 = copy(d1)[,(dic$field) := transpose(
lapply(x, stri_sub, from = dic$start, length = dic$length))]
But, the underneath data is in list form, not really in table form. I want to be able to refer to the created fields as columns.
I have to admit I am not entirely sure what I am doing, and I don't really have to use data table for this, but I can't think of another way to do it. The easiest dataset I have contains strings of 79 characters, and there are 25 fields that would be generated, so I would prefer not to have to pull each field individually.
I hope this makes sense. Any suggestions are appreciated.

1) read.fwf Try read.fwf. No packages are used.
read.fwf(textConnection(d1$x), dic$length, col.names = dic$field)
giving:
ID group year
1 27286 6 1941-1945
2 26574 6 1921-1925
3 27861 6 1921-1925
2) separate This also works and gives the same answer:
library(tidyr)
d1 %>%
separate(x, sep = dic$start - 1, into = dic$field, remove = TRUE)

regex is useful here, particularly since you can programmatically define the patterns you want to search for and output
d1 %>%
mutate(x=gsub(paste0("(.{", dic$length, "})", collapse=""), paste0("\\", seq_along(dic$length), collapse=" "), x)) %>%
separate(x, into=dic$field, sep=" ")
# ID group year
# 1 27286 6 1941-1945
# 2 26574 6 1921-1925
# 3 27861 6 1921-1925
Explanation
# Pattern to search for
paste0("(.{", dic$length, "})", collapse="")
# "(.{5})(.{1})(.{9})"
# (.{5}) - group that contains any 5 characters - will be group 1
# (.{1}) - group that contains any 1 character - will be group 2
# (.{9}) - group that contains any 9 characters - will be group 3
# Pattern to output
paste0("\\", seq_along(dic$length), collapse=" ")
# "\\1 \\2 \\3"
# \\1 - output group 1
# \\2 - output group 2
# each group is separated by a space
Use tidyr::separate to split the resulting space-delimited string into distinct fields

Not using the dic table, but this can be easily done with extract from tidyr:
library(tidyr)
extract(d1, x, c("ID", "group", "year"), "^(.{5})(.{1})(.{9})$")
Result:
ID group year
1: 27286 6 1941-1945
2: 26574 6 1921-1925
3: 27861 6 1921-1925

Using the dic table as reference:
library(dplyr)
breaks <- setNames(as.list(paste0("substr(x", ", ", dic$start, ", ", dic$start+dic$length-1, ")")), dic$field)
d1 %>%
mutate_(.dots = breaks)

setNames(data.frame(do.call(rbind, lapply(d1$x, function(X) sapply(1:NROW(dic),
function(i) c(substring(X, dic$start[i], dic$start[i] + dic$length[i])))))), dic$field)
# ID group year
#1 272866 61 1941-1945
#2 265746 61 1921-1925
#3 278616 61 1921-1925

We can use the strcapture function from base R to technically capture the strings. The we will input it in a dataframe that has been predefined.
strcapture("(\\d{5})(\\d)(.*)",d1$x,data.frame(Id=numeric(),group=numeric(),year=character()))
Id group year
1 27286 6 1941-1945
2 26574 6 1921-1925
3 27861 6 1921-1925
Explanation: (\\d{5}) captures the first 5 digits then (\\d) captures the next digits and (.*) captures everything else afterwards.

Related

Split columns considering only the first dot in R using separate

This is my dataframe:
df <- tibble(col1 = c("1. word","2. word","3. word","4. word","5. N. word","6. word","7. word","8. word"))
I need to split in two columns using separate function and rename them as Numbers and other called Words. Ive doing this but its not working:
df %>% separate(col = col1 , into = c('Number','Words'), sep = "^. ")
The problem is that the fifth has 2 dots. And I dont know how to handle with this regarding the regex.
Any help?
Here is an alternative using readrs parse_number and a regex:
library(dplyr)
library(readr)
df %>%
mutate(Numbers = parse_number(col1), .before=1) %>%
mutate(col1 = gsub('\\d+\\. ','',col1))
Numbers col1
<dbl> <chr>
1 1 word
2 2 word
3 3 word
4 4 word
5 5 N. word
6 6 word
7 7 word
A tidyverse approach would be to first clean the data then separate.
df %>%
mutate(col1 = gsub("\\s.*(?=word)", "", col1, perl=TRUE)) %>%
tidyr::separate(col1, into = c("Number", "Words"), sep="\\.")
Result:
# A tibble: 8 x 2
Number Words
<chr> <chr>
1 1 word
2 2 word
3 3 word
4 4 word
5 5 word
6 6 word
7 7 word
8 8 word
I'm assuming that you would like to keep the cumbersome "N." in the result. For that, my advice is to use extract instead of separate:
df %>%
extract(
col = col1 ,
into = c('Number','Words'),
regex = "([0-9]+)\\. (.*)")
The regular expression ([0-9]+)\\. (.*) means that you are looking first for a number, that you want to put in a first column, followed by a dot and a space (\\. ) that should be discarded, and the rest should go in a second column.
The result:
# A tibble: 8 × 2
Number Words
<chr> <chr>
1 1 word
2 2 word
3 3 word
4 4 word
5 5 N. word
6 6 word
7 7 word
8 8 word
Try read.table + sub
> read.table(text = sub("\\.", ",", df$col1), sep = ",")
V1 V2
1 1 word
2 2 word
3 3 word
4 4 word
5 5 N. word
6 6 word
7 7 word
8 8 word
I am not sure how to do this with tidyr, but the following should work with base R.
df$col1 <- gsub('N. ', '', df$col1)
df$Numbers <- as.numeric(sapply(strsplit(df$col1, ' '), '[', 1))
df$Words <- sapply(strsplit(df$col1, ' '), '[', 2)
df$col1 <- NULL
Result
> head(df)
Numbers Words
1 1 word
2 2 word
3 3 word
4 4 word
5 5 word
6 6 word

pivot_longer: passing "names_sep" as character yields different result than passing it as character index

Say we have a data frame which looks something like
df <- data.frame(x_A = c(1, 2), x_B = c(3, 4), y_A = c(5, 6), y_B = c(7, 8))
df
x_A x_B y_A y_B
1 1 3 5 7
2 2 4 6 8
Using library(dplyr) I'm wondering why passing names_sep = "_" in pivot_longer yields a different result than names_sep = 2, as seen in the following
pivot_longer(df, x_A:y_B, names_to = c("name1", "name2"), names_sep = "_")
# A tibble: 8 x 3
name1 name2 value
<chr> <chr> <dbl>
1 x A 1
2 x B 3
3 y A 5
4 y B 7
5 x A 2
6 x B 4
7 y A 6
8 y B 8
pivot_longer(df, x_A:y_B, names_to = c("name1", "name2"), names_sep = 2)
# A tibble: 8 x 3
name1 name2 value
<chr> <chr> <dbl>
1 x_ A 1
2 x_ B 3
3 y_ A 5
4 y_ B 7
5 x_ A 2
6 x_ B 4
7 y_ A 6
8 y_ B 8
When passing a string with the character to break on, that character itself is dropped. When passing the index of the character, it is not. Could someone explain why there is a difference?
From the online doc: "names_sep takes the same specification as separate(), and can either be a numeric vector (specifying positions to break on), or a single string (specifying a regular expression to split on)".
Note the difference in wording "split" for regular expression and "position to break" for numerics. So with a regular expression, the split string is taken as a word separator (and not included in the output column names). With a numeric, there is no "separator" and all characters in the original column name appear in the output.
As #KonradRudolph says, this is intuitive. If you want the separator to appear in the output when using a regex, you have an inconsistency: does the separator get associated with the "name" to its left or to its right? You can only resolve that by convention (which is sub-optimal) or an additional parameter (which - to me - is unnecessary and overly complicated).
Different results, but documented and intentional. I wouldn't desribe the behaviour as inconsistent.

Remove unwanted letter in data column names in R environment

I have a dataset the contains a large number of columns every column has a name of date in the form of x2019.10.10
what I want is to remove the x letter and change the type of the date to be 2019-10-10
How this could be done in the R environment?
One solution would be:
Get rid of x
Replace . with -.
Here I create a dataframe that has similar columns to yours:
df = data.frame(x2019.10.10 = c(1, 2, 3),
x2020.10.10 = c(4, 5, 6))
df
x2019.10.10 x2020.10.10
1 1 4
2 2 5
3 3 6
And then, using dplyr (looks much tidier):
library(dplyr)
names(df) = names(df) %>%
gsub("x", "", .) %>% # Get rid of x and then (%>%):
gsub("\\.", "-", .) # replace "." with "-"
df
2019-10-10 2020-10-10
1 1 4
2 2 5
3 3 6
If you do not want to use dplyr, here is how you would do the same thing in base R:
names(df) = gsub("x", "", names(df))
names(df) = gsub("\\.", "-", names(df))
df
2019-10-10 2020-10-10
1 1 4
2 2 5
3 3 6

convert data frame of "missed" numbers into data frame of numbers "hit"

I have quite an specific doubt, but it should be easy to solve, I just cannot think how...
I have a simple data frame like this:
mydf <- data.frame(Shooter=1:3, Targets.missed=c(paste(sample(1:10,4),collapse=";"), paste(sample(1:10,5),collapse=";"), paste(sample(1:10,8),collapse=";")))
mydf
Shooter Targets.missed
1 1 3;8;4;7
2 2 10;1;5;7;4
3 3 5;9;4;10;8;1;6;7
This data frame tells me the Targets (from 1 to 10) that are missed by each Shooter.
I would like to obtain a different data frame that tells me, per Target, which Shooter\s made it.
The result would be:
Target hit.by.Shooters
1 1
2 1;2;3
3 2;3
4 NA
5 1
6 1;2
7 NA
8 2
9 1;2
10 1
We expand the data by splitting at the ; of the 'Targets.missed' into 'long' format, then grouped by 'Shooter', summarise with a list of numbers that are not in the 'Targets.missed' from 1:10, unnest the list column, grouped by 'Target', summarise by pasteing the unique 'Shooter' elements into a single string, and fill the missing elements from 1:10 with NA by using complete
library(tidyverse)
mydf %>%
separate_rows(Targets.missed) %>%
group_by(Shooter) %>%
summarise(Target = list(setdiff(1:10, Targets.missed))) %>%
unnest %>%
group_by(Target) %>%
summarise(hit.by.Shooters = paste(unique(Shooter), collapse=";")) %>%
complete(Target = 1:10)
# A tibble: 10 x 2
# Target hit.by.Shooters
# <int> <chr>
# 1 1 1
# 2 2 1;2;3
# 3 3 2;3
# 4 4 <NA>
# 5 5 1
# 6 6 1;2
# 7 7 <NA>
# 8 8 2
# 9 9 1;2
#10 10 1
Or another option is base R by splitting the 'Targets.missed' (assuming character class) into a list of vectors, loop through the list, get the values that are not in 1:10 (with setdiff), set the names of the list with the 'Shooter' column, stack the key/val list pairs into a two column data.frame, get the unique rows, aggregate by pasteing the 'ind' column grouped by 'values', merge with a full 'values' dataset from 1:10
out <- aggregate(ind ~ values,
unique(stack(setNames(lapply(strsplit(mydf$Targets.missed, ';'),
setdiff, x= 1:10), mydf$Shooter))), FUN = paste, collapse=";")
out1 <- merge(data.frame(values = 1:10), out, all.x = TRUE)
and change the column names if necessary
names(out1) <- c('Target', 'hit.by.Shooters')
data
mydf <- structure(list(Shooter = 1:3, Targets.missed = c("3;8;4;7", "10;1;5;7;4",
"5;9;4;10;8;1;6;7")), class = "data.frame", row.names = c("1",
"2", "3"))
Another tidyverse possibility. We first create dataframe with all possible combinations of Shooter and Targets and then remove rows which are present in mydf using anti_join, fill in the missing Targets by adding them as NA and finally summarise by Targets to get Shooters who actually hit the target.
library(tidyverse)
crossing(Shooter = unique(mydf$Shooter), Targets.missed = 1:10) %>%
anti_join(mydf %>% separate_rows(Targets.missed) %>% mutate_all(as.numeric)) %>%
complete(Targets.missed = 1:10) %>%
group_by(Targets.missed) %>%
summarise(hit.by.Shooters = paste0(Shooter, collapse = ";"))
# Targets.missed hit.by.Shooters
# <int> <chr>
# 1 1 1;2
# 2 2 1;2
# 3 3 1
# 4 4 1
# 5 5 2
# 6 6 1;3
# 7 7 1;2
# 8 8 2
# 9 9 NA
#10 10 3
data
set.seed(987)
mydf <- data.frame(Shooter=1:3,
Targets.missed=c(paste(sample(1:10,4),collapse=";"),
paste(sample(1:10,5),collapse=";"), paste(sample(1:10,8),collapse=";")))
data.table approach
library( data.table )
#vector with all possible targets
targets.v <- 1:10
#split the missed targets to a list
missed.list <- strsplit( mydf$Targets.missed, ";")
#inverse, to get all hit targets
hit.list <- lapply( missed.list, function(x) as.data.table( targets.v[!targets.v %in% x] ) )
#bind hit targets to data.table
dt <- rbindlist( hit.list, idcol = "shooter" )
#summarise (paste with collapse), and join on all possible targets
dt[, .(hit.by.shooters = paste(shooter, collapse = ";")), by = .(target = V1)][data.table(target = targets.v), on = c("target")]
# target hit.by.shooters
# 1: 1 1
# 2: 2 1;2;3
# 3: 3 2;3
# 4: 4 <NA>
# 5: 5 1
# 6: 6 1;2
# 7: 7 <NA>
# 8: 8 2
# 9: 9 1;2
# 10: 10 1

Average of Columns by Unique ID in R

I would like to average columns in a data set based on a unique identifier. I do not know ahead of time how many columns I will have for each unique identifier or what order they will come in. The unique IDs are all known before hand and are lists of weeks. I have found solutions for regular patterns but not solutions for using the actual column headers to sort out the average. Thanks for any and all help.
I present the original data and desired result. In the example there are only 2 unique IDs
x = read.table(text = "
site wk1 wk2 wk1 wk1
1 2 4 6 8
2 10 20 30 40
3 5 NA 2 3
4 100 100 NA NA",
sep = "", header = TRUE)
x
desired.outcome = read.table(text = "
site wk1avg wk2avg
1 3.3 4
2 26.6 20
3 3.3 NA
4 NA 100",
sep = "", header = TRUE)
If your original data file has duplicated column names, read.table will change them so all the columns have unique values (as you can see by checking x in your example after it's loaded). In fact, the code below depends on that happening, because melt will drop columns with duplicated names. Then we use mutate to remove the extra text added by read.table to de-duplicate the column names so that we can group properly by week.
library(reshape2)
library(dplyr)
x %>% melt(id.var="site") %>% # Convert to long format
mutate(variable = gsub("\\..*", "", variable)) %>% # "re-duplicate" original column names
group_by(site, variable) %>%
summarise(mn = mean(value)) %>%
dcast(site ~ variable)
site wk1 wk2
1 1 5.333333 4
2 2 26.666667 20
3 3 3.333333 NA
4 4 NA 100
Here's a tidyr and dplyr approach:
library(dplyr)
library(tidyr)
x %>% gather(wk, val, -site) %>% # gather wk* columns into key-value pairs
extract(wk, 'wk', '(wk\\d+).*?') %>% # trim suffixes added by read.table
group_by(site, wk) %>%
summarise(mean_val = mean(val)) %>% # calculate grouped means
spread(wk, mean_val) # spread back into wk* columns
# Source: local data frame [4 x 3]
# Groups: site [4]
#
# site wk1 wk2
# (int) (dbl) (dbl)
# 1 1 5.333333 4
# 2 2 26.666667 20
# 3 3 3.333333 NA
# 4 4 NA 100

Resources