r extracting list elements from within dataframe - r

I am working with data where text comments are used to record a change in field contents rather than have an extra record and start/end dates. So the data looks like this:
Study Fob
1 100
2 101 now 102
3 103
Note: test data can be constructed with:
df <- data.frame(Study = 1:3,
Fob = c("100", "101 now 102", "103"),
stringsAsFactors = FALSE)
I want to end up with the following form so I can process it essentially as a many-to-one conversion from Fob signal data to Study IDs:
Study Fob
1 100
2 101
2 102
3 103
I can get rid of the superfluous text with:
df$IDs <- strsplit(df$Fob, "[^0-9]+")
which gets me to:
Study Fob IDs
1 100 100
2 101 now 102 c("101", "102")
3 103 103
but can't get any further. My first thought was to try and replicate the lines with multiple IDs (like 2) using a counter based on the length of the IDs, but adding df$counter <- length(df$IDs) just gets me a column with the value 3 because it is taking the length of the IDs column, not the element within it.

One option is cSplit from library(splitstackshape). We specify the pattern to split, use fixed=FALSE as the default is fixed=TRUE and the direction = 'long'
library(splitstackshape)
cSplit(df, 'Fob', '[^0-9]+', fixed=FALSE, 'long')
# Study Fob
#1: 1 100
#2: 2 101
#3: 2 102
#4: 3 103
[^0-9]+ implies one more characters that are not a number. So, it will split by all non-numeric characters leaving only the numeric part. By default, type.convert=TRUE, so we will be getting numeric column class after the split.
Or instead of using [^0-9]+, a compact version would be \\D+ to match all non-numeric characters (from #David Arenburg's comments)
cSplit(df, 'Fob', '\\D+', fixed=FALSE, 'long')

Related

Split columns in a dataframe into a column that contains text not numbers and a column that contains numbers not text in R

Here is a simplified version of data I am working with:
a<-c("There are 5 programs", "2 - adult programs, 3- youth programs","25", " ","there are a number of programs","other agencies run our programs")
b<-c("four", "we don't collect this", "5 from us, more from others","","","")
c<-c(2,6,5,8,2,"")
df<-cbind.data.frame(a,b,c)
df$c<-as.numeric(df$c)
I want to keep both the text and numbers from the data b/c some of the text is important
expected output:
What I think makes sense is the following:
id all columns that have text in them, perhaps in a list (because some columns are just numbers)
subset columns from step 1 to a new dataframe lets call this df1
delete the subsetted columns in df1 from df
split all the columns in df1 into 2 columns, one that keeps the text and one that has the number.
bind the new spit columns from df1 into the orginal df
What I am struggling with is steps 1-2 and 4. I am okay with the characters (e.g., - and ') being excluded or included. There is additional processing I have to do after (e.g., when there are multiple numbers in a column after splitting I will need to split and add these and also address the written numbers), but those are things I can do.
Here's a dplyr solution using regular expression:
library(stringr)
library(dplyr)
df %>%
mutate(
a.text = gsub("(^|\\s)\\d+", "", a),
a.num = str_extract_all(a, "\\d+"),
b.text = gsub("(^|\\s)\\d+", "", b),
b.num = str_extract_all(b, "\\d+")
) %>%
select(c(4:7,3))
a.text a.num b.text b.num c
1 There are programs 5 four 2
2 - adult programs,- youth programs 2, 3 we don't collect this 6
3 25 from us, more from others 5 5
4 8
5 there are a number of programs 2
6 other agencies run our programs NA
Here is what I would do with my preferred tools. The solution will work with arbitrary numbers of arbitrarily named character and non-character columns.
library(data.table) # development version 1.14.3 used here
library(magrittr) # piping used to improve readability
num <- \(x) stringr::str_extract_all(x, "\\d+", simplify = TRUE) %>%
apply(1L, \(x) sum(as.integer(x), na.rm = TRUE))
txt <- \(x) stringr::str_remove_all(x, "\\d+") %>%
stringr::str_squish()
setDT(df)[, lapply(
.SD, \(x) if (is.character(x)) data.table(txt = txt(x), num = num(x)) else x)]
which returns
a.txt a.num b.txt b.num c
<char> <int> <char> <int> <num>
1: There are programs 5 four 0 2
2: - adult programs, - youth programs 5 we don't collect this 0 6
3: 25 from us, more from others 5 5
4: 0 0 8
5: there are a number of programs 0 0 2
6: other agencies run our programs 0 0 NA
Explanation
num() is a function which uses the regular expression \\d+ to extract all strings which consist of contiguous digits (aka integer numbers), coerces them to type integer, and computes the rowwise sum of the extracted numbers (as requested in OP's last sentence).
txt() is a function which removes all strings which consist of contiguous digits (aka integer numbers), removes whitespace from start and end of the strings and reduces repeated whitespace inside the strings.
\(x) is a new shortcut for function(x) introduced with R version 4.1
The next steps implement OP's proposed approach in data.table syntax, by and large:
lapply(.SD, ...) loops over each column of df.
if the column is character both functions txt() and num() are applied. The two resulting vectors are turned into a data.table as a partial result. Note that cbind() cannot be used here as it would return a character matrix.
if the column is non-character it is returned as is.
The final result is a data.table where the column names have been renamed automagically.
This approach keeps the relative position of columns.

Using value in 1 column to fill in values in 2 other columns

When entering behavior data in a different system, I wrote the subjects in a form such as 3-2 (to mean rank 3 to rank 2). I exported these to Excel, which took these entries as dates (so 2-Mar for this example).
I now have thousands of entries in this format. I have added two columns ("Actor" and "Recipient") and would like to fill in the rank numbers for these, based on what is in the "Subject" column.
A couple of lines of what I'm hoping my R output will give me:
Subject Actor Recipient
2-Mar 3 2
5-Jun 6 5
6-Feb 2 6
etc.
So I already have the "Subject" columns and need help figuring out code to fill in the "Actor" and "Recipient" columns. Rank numbers only go up to 6.
I've tried a couple of things but just keep getting error messages... Any help with this would be GREATLY appreciated!
Here you can use tstrsplit() after converting to date format
# Recreate your data
x <- data.frame("Subject" = c("2-Mar", "5-Jun", "6-Feb"))
# Change the format of your Subject coumn
x[, "Subject"] <- format(as.POSIXct(x[, "Subject"], format = "%d-%b"), "%m %d")
# Split into the two strings
library(data.table) # to get tstrsplit() function
x[, c("Actor", "Recipient")] <- tstrsplit(x[, "Subject"], " ")
# Convert to numeric
x[, "Actor"] <- as.numeric(x[, "Actor"])
x[, "Recipient"] <- as.numeric(x[, "Recipient"])
This returns
> x
Subject Actor Recipient
1 02 03 3 2
2 05 06 6 5
3 06 02 2 6
And if you want Subject in its original format
# Return Subject to original format
x[, "Subject"] <- format(as.POSIXct(x[, "Subject"], format = "%m %d"), "%d-%b")
Giving
> x
Subject Actor Recipient
1 02-Mar 3 2
2 05-Jun 6 5
3 06-Feb 2 6
Explained:
Your vector/variable "Subject" was imported as a character-type atomic vector (atomic vectors are a 1 dimensional structure of one or more elements, where all elements must be the same type). The solution was to convert that something that R would interpret as a date using the as.POSIXct(..., format = "...") function, where format is telling R how the string is formatted (see codes here). I then wrapped that in the format() function, telling it to change the format to numeric months. That was then split into two columns using the tstrsplit() function, but R interpreted those as character-type data, so I converted them using the as.numeric() function to double-type data.
You could convert Subject to date and extract month and year from it.
temp <- as.Date(df$Subject, "%d-%b")
df$Actor <- as.integer(format(temp, "%m"))
df$Recipient <- as.integer(format(temp, "%d"))
df
# Subject Actor Recipient
#1 2-Mar 3 2
#2 5-Jun 6 5
#3 6-Feb 2 6
This can also be done using lubridate functions.
df$Actor <- month(temp)
df$Recipient <- day(temp)

mutate_at using multiple conditons

I would to transform the columns that contain 1 or 2 in their names of my dataframe test by dividing them by the column Unit.
It seems to work with one conditions but I do not know how to add the condition OR 1
test= test%>% mutate_at(vars(contains('2')), funs(./Unit))
any idea?
Identifier Source 196001 200006 Unit
1: top HH NA NA 1e-06
2: top2 BB NA 4569.6 1e+00
This should work:
test = read.table(header = T,text="
Identifier Source 196001 200006 Unit
top HH 65 888 3
top2 BB 0111 9886 8") #I modified your values so you can see the divisions
test %>% mutate_at(vars(contains('2'), contains('1')), funs(./Unit))
Basically you say "select variables that contain 2, oh and also select variables than contain 1".
If you wanted to select variables that contain 1 AND 2, this would be a different problem and might require regexp (with dplyr::match).
Also, please note that funs is soft-deprecated, you should now use lambda functions:
test = test %>% mutate_at(vars(contains('2'), contains('1')), ~./Unit)

Perform row-wise operations on a data.table for a vector-valued column

EDIT:
(I apologize for the fact that my example was oversimplified, and I will try to remedy this, as well as format my more relevant example in a more convenient format for copying directly into R. In particular, there are multiple value columns, and some preceding columns with other information that does not need to be parsed.)
I am fairly new to R, and to data.table as well, so I would appreciate input on an issue I am finding. I am working with a data table where one column is a colon-separated format string that serves as a legend for values in other colon-separated columns. In order to parse it, I have to first split it into its components, and then search for the indices of the components I need to later index the value strings. Here is a simplified example of the sort of situation I might be working with
DT <- data.table(number=c(1:5),
format=c("name:age","age:name","age:name:height","height:age:name","weight:name:age"),
person1=c("john:30","40:bill","20:steve:100","300:70:george","140:fred:20"),
person2=c("jane:31","42:ivan","21:agnes:120","320:72:vivian","143:rose:22"))
When evaluated, we get
> DT
number format person1 person2
1: 1 name:age john:30 jane:31
2: 2 age:name 40:bill 42:ivan
3: 3 age:name:height 20:steve:100 21:agnes:120
4: 4 height:age:name 300:70:george 320:72:vivian
5: 5 weight:name:age 140:fred:20 143:rose:22
Let's say that for each person, I need to know ONLY their name and age, and don't need their height or weight; in this example, and in my actual data, every format string has fields for name and age, but possibly in different positions (the fields that I am actually looking for are usually fixed in certain columns, but I am reluctant to hard-code any indices as I am not completely familiar with the production of the data files I am working with). I would first split up the format string and then do a match() search for the names of the fields I want.
DT[, format.split := strsplit(format, ":")]
At this point, the only method I used that worked to perform the match was a vapply:
DT[, index.name := vapply(format.split, function (x) match('name', x), 0L)]
DT[, index.age := vapply(format.split, function (x) match('age', x), 0L)]
because I don't know of any other way to let R know that it should be looking at the rows in the columns individually, and not bunched together as a vector, and perform the match on the vector-valued format.split column of each row, rather than trying to match the whole column of rows. Even then, once I find the indices for each row, I have to perform another strsplit and then an mapply to parse the name-value and age-value out of each person's value-string:
DT[, person1.split := strsplit(person1, ':')]
DT[, person1.name := mapply(function (x,y) x[y], person1.split, index.name]
DT[, person1.age := mapply(function (x,y) x[y], person1.split, index.age]
DT[, person2.split := strsplit(person2, ':')]
DT[, person2.name := mapply(function (x,y) x[y], person2.split, index.name]
DT[, person2.age := mapply(function (x,y) x[y], person2.split, index.age]
(And, of course, I would do the same thing for age as well)
I am working with fairly large data sets, so I'd like my code to be as efficient as possible. Does anyone have recommendations for ways I can speed up or otherwise optimize my code?
(NOTE: I am really looking for the right approach to take, not the right *apply or *ply or Map function to use. If *(ap)ply or Map really is the right approach, I would appreciate knowing which is the most efficient or appropriate for my situation, but if there is a better way of testing for intra-row duplication, I would prefer recommendations about that to function suggestions. Suggestions are welcome, though).
EDIT 2:
It turns out that my example was much more general than it need have been. I only need two fields, which are always going to be the first two fields in the format string, without variation. The first field is just a literal character string. The second field, however, consists of at least 2 numbers, separated by commas (ultimately, I filter out any rows with more than 2 numbers in the second field, so the possibility of more is only relevant if the filtering happens after the parsing). For each of the (3) value strings, I only need to create three columns: a character column for the first field, and two numeric columns, one for each member of the comma-separated pair in the second field. Any other fields are irrelevant. My current method, which is probably quite inefficient, is to use sub() to pattern-match on the desired fields and subfields with back-references.
> DT <- data.table(id=1:5,
format=c(rep("A:B:C:D:E", 5)),
person1=paste(paste0("foo",LETTERS[1:5]), paste(1:5, 10:6, sep=','), "blah", "bleh", "bluh", sep=':'),
person2=paste(paste0("bar",LETTERS[1:5]), paste(16:20, 5:1, sep=','), "blah", "bleh", "bluh", sep=':'),
person3=paste(paste0("baz",LETTERS[1:5]), paste(0:4, 12:8, sep=','), "blah", "bleh", "bluh", sep=':'))
> DT
id format person1 person2 person3
1: 1 A:B:C:D:E fooA:1,10:blah:bleh:bluh barA:16,5:blah:bleh:bluh bazA:0,12:blah:bleh:bluh
2: 2 A:B:C:D:E fooB:2,9:blah:bleh:bluh barB:17,4:blah:bleh:bluh bazB:1,11:blah:bleh:bluh
3: 3 A:B:C:D:E fooC:3,8:blah:bleh:bluh barC:18,3:blah:bleh:bluh bazC:2,10:blah:bleh:bluh
4: 4 A:B:C:D:E fooD:4,7:blah:bleh:bluh barD:19,2:blah:bleh:bluh bazD:3,9:blah:bleh:bluh
5: 5 A:B:C:D:E fooE:5,6:blah:bleh:bluh barE:20,1:blah:bleh:bluh bazE:4,8:blah:bleh:bluh
My code then does this:
DT[, `:=`(person1.A=sub("^([^:]*):.*$","\\1", person1),
person2.A=sub("^([^:]*):.*$","\\1", person2),
person3.A=sub("^([^:]*):.*$","\\1", person3),
person1.B.first=sub("^[^:]*:([^:,]*),.*$","\\1", person1),
person1.B.second=sub("^[^:]*:[^:,]*,([^:,]*)(,[^:,]*)*:.*$","\\1", person1),
person2.B.first=sub("^[^:]*:([^:,]*),.*$","\\1", person2),
person2.B.second=sub("^[^:]*:[^:,]*,([^:,]*)(,[^:,]*)*:.*$","\\1", person2),
person3.B.first=sub("^[^:]*:([^:,]*),.*$","\\1", person3),
person3.B.second=sub("^[^:]*:[^:,]*,([^:,]*)(,[^:,]*)*:.*$","\\1", person3))]
for the splitting, and filters by
DT <- DT[grepl("^[^:]*:[^:,]*,[^:,]*:.*$", person1) &
grepl("^[^:]*:[^:,]*,[^:,]*:.*$", person2) &
grepl("^[^:]*:[^:,]*,[^:,]*:.*$", person3) ]
I understand that this method is probably very inefficient, but it was the first improvement I came up with over my old approach of repeatedly applying strsplit. With the new conditions in mind, is there an even better way of doing things than melt, csplit, dcast?
EDIT 3:
Since I only needed the first two fields, I ended up trimming all the value strings, removing those with more than two commas (i.e. more than 3 2nd-field numbers), changing the commas to colons, replacing the format string of every line with the names of the (now 3) fields, and performing the dcast(csplit(melt)) as suggested by #AnandaMahto. It seems to work well.
#bskaggs has the right idea that it might just make more sense to put your data into a long form, or even a structured wide form.
I'll show you two options, but first, it's always better to share your data in a way that others can actually use it:
DT <- data.table(
format = c("name:age", "name:age:height", "age:height:name",
"height:weight:name:age", "name:age:weight:height",
"name:age:height:weight"),
values = c("john:30", "rene:33:183", "100:10:speck",
"100:400:sumo:11", "james:43:120:120",
"plink:2:300:400"))
I'm also going to suggest you use my cSplit function.
Here's how you would easily convert this dataset into a long form:
cSplit(DT, c("format", "values"), ":", "long")
# format values
# 1: name john
# 2: age 30
# 3: name rene
# 4: age 33
# 5: height 183
# 6: age 100
# 7: height 10
# 8: name speck
# 9: height 100
# 10: weight 400
# 11: name sumo
# 12: age 11
# 13: name james
# 14: age 43
# 15: weight 120
# 16: height 120
# 17: name plink
# 18: age 2
# 19: height 300
# 20: weight 400
Once the data are in a "long" form, you can convert it easily to a "wide" form using dcast.data.table, like this. (I've also reordered the columns using setcolorder, which lets you rearrange the data without copying.)
X <- dcast.data.table(
cSplit(cbind(id = 1:nrow(DT), DT),
c("format", "values"), ":", "long"),
id ~ format, value.var = "values")
setcolorder(X, c("id", "name", "age", "height", "weight"))
X
# id name age height weight
# 1: 1 john 30 NA NA
# 2: 2 rene 33 183 NA
# 3: 3 speck 100 10 NA
# 4: 4 sumo 11 100 400
# 5: 5 james 43 120 120
# 6: 6 plink 2 300 400
How does this fare in terms of speed?
First, a very moderate dataset:
DT <- rbindlist(replicate(2000, DT, FALSE))
dim(DT)
# [1] 12000 2
## #bskaggs's suggestion
system.time(colonMelt(DT))
# user system elapsed
# 0.27 0.00 0.27
## cSplit. It would be even faster if you already had
## an id column and didn't need to cbind one in
system.time(cSplit(cbind(id = 1:nrow(DT), DT),
c("format", "values"), ":", "long"))
# user system elapsed
# 0.02 0.00 0.01
## cSplit + dcast.data.table
system.time(dcast.data.table(
cSplit(cbind(id = 1:nrow(DT), DT),
c("format", "values"), ":", "long"),
id ~ format, value.var = "values"))
# user system elapsed
# 0.08 0.00 0.08
Update
For your updated problem, you can melt the "data.table" first, and then proceed similarly:
library(reshape2)
## Melting, but no reshaping -- a nice long format
cSplit(melt(DT, id.vars = c("number", "format")),
c("format", "value"), ":", "long")
## Try other combinations for the LHS and RHS of the
## formula. This seems to be what you might be after
dcast.data.table(
cSplit(melt(DT, id.vars = c("number", "format")),
c("format", "value"), ":", "long"),
number ~ variable + format, value.var = "value")
I think you may be better served by using a tall tidy format:
colonMelt <- function(DT) {
formats <- strsplit(DT$format, ":")
rows <- rep(row.names(DT), sapply(formats, length))
data.frame(row = rows,
key = unlist(formats),
value = unlist(strsplit(DT$values, ":"))
)
}
newDT <- colonMelt(DT)
The result is a format that is much easier to do search and filtering without string splitting all the time:
row key value
1 1 name john
2 1 age 30
3 2 name rene
4 2 age 33
5 2 height 183
6 3 age 100
7 3 height 10
8 3 name speck

Regular expression to convert raw text into columns of data

I have a raw text output from a program that I want to convert into a DataFrame. The text file is not formatted and is as shown below.
10037 149439Special Event 11538.00 13542.59 2004.59
10070 10071Weekday 8234.00 9244.87 1010.87
10216 13463Weekend 145.00 0 -145.00
I am able to read the data into R using readLines() in the base package. How can I convert this into data that looks like this (column names can be anything).
A B C D E F
10037 149439 Special Event 11538.00 13542.59 2004.59
10070 10071 Weekday 8234.00 9244.87 1010.87
10216 13463 Weekend 145.00 0 -145.00
What regular expression should I use to achieve this? I know that this is ideal for applying a combination of regexec() and regmatches(). But I am unable to come up with an expression that splits the line into the desired components.
Here's a simple solution:
raw <- readLines("filename.txt")
data.frame(do.call(rbind, strsplit(raw, " {2,}|(?<=\\d)(?=[A-Z])", perl = TRUE)))
# X1 X2 X3 X4 X5 X6
# 1 10037 149439 Special Event 11538.00 13542.59 2004.59
# 2 10070 10071 Weekday 8234.00 9244.87 1010.87
# 3 10216 13463 Weekend 145.00 0 -145.00
The regular expression " {2,}|(?<=\\d)(?=[A-Z])" consists of two parts, combined with "|" (logical or).
" {2,}" means at least two spaces. This will split between the different columns only, since the text in the third column has a single space.
"(?<=\\d)(?=[A-Z])" denotes the positions that are preceded by a digit and followed by an uppercase letter. This is used to split between the second and the third column.
I created "txt.txt" from your data. Then we work some with a regular expression.
> read <- readLines("txt.txt")
> S <- strsplit(read, "[A-Za-z]|\\s")
> W <- do.call(rbind, lapply(S, function(x) x[nzchar(x)]))
> D <- data.frame(W[,1:2], col, W[,3:5])
> names(D) <- LETTERS[seq(D)]
> D
## A B C D E F
## 1 10037 149439 SpecialEvent 11538.00 13542.59 2004.59
## 2 10070 10071 Weekday 8234.00 9244.87 1010.87
## 3 10216 13463 Weekend 145.00 0 -145.00
Toss it all into some curly brackets and you've got yourself a function to parse your files.
PS: If the space between "Special" and "Event" is important, please comment and I'll revise.
Something like this at least works on your example but I don't know all your corner cases...
([0-9]+) +([0-9]+)(.+) ([0-9.-]+) +([0-9.-]+) +([0-9.-]+)
The captured groups from 1 to 6 are resp. your columns from A to F.

Resources