extract irregular numeric data from strings

extract irregular numeric data from strings - r

I have data like below. I wish to extract the first and last year from each string here called my.string. Some strings only contain one year and some strings contain no years. No strings contain more than two years. I have provided the desired result in the object named desired.result below the example data set. I am using R.
When a string contains two years those years are contained within a portion of the string that looks like this ga49.51 or ea22.24
When a string contains only one year that year is contained in a portion of the string that looks like this: time11
I know a bit about regex, but this problem seems too irregular and complex for me to figure out. I am not even sure where to begin. Thank you for any advice.
EDIT
Perhaps delete the numbers before the first colon (:) and the remaining numbers are what I want.
my.data <- read.table(text = '
my.string cov1 cov2
42:Alpha:ga6.8 -0.1 2.2
43:Alpha:ga9.11 -2.5 0.6
44:Alpha:ga30.32 -1.3 0.5
45:Alpha:ga49.51 -2.5 0.6
50:Alpha:time1:ga.time -1.7 0.9
51:Alpha:time2:ga.time -1.5 0.8
52:Alpha:time3:ga.time -1.0 1.0
2:Beta:ea2.9 -1.7 0.6
3:Beta:ea17.19 -5.0 0.8
4:Beta:ea22.24 -6.4 1.0
8:Beta:as 0.2 0.6
9:Beta:sd 1.7 0.4
12:Beta:time1:ea.tim -2.6 1.8
13:Beta:time10:ea.ti -3.6 1.1
14:Beta:time11:ea.ti -3.1 0.7
', header = TRUE, stringsAsFactors = FALSE, na.strings = "NA")
desired.result <- read.table(text = '
my.string cov1 cov2 time1 time2
42:Alpha:ga6.8 -0.1 2.2 6 8
43:Alpha:ga9.11 -2.5 0.6 9 11
44:Alpha:ga30.32 -1.3 0.5 30 32
45:Alpha:ga49.51 -2.5 0.6 49 51
50:Alpha:time1:ga.time -1.7 0.9 1 NA
51:Alpha:time2:ga.time -1.5 0.8 2 NA
52:Alpha:time3:ga.time -1.0 1.0 3 NA
2:Beta:ea2.9 -1.7 0.6 2 9
3:Beta:ea17.19 -5.0 0.8 17 19
4:Beta:ea22.24 -6.4 1.0 22 24
8:Beta:as 0.2 0.6 NA NA
9:Beta:sd 1.7 0.4 NA NA
12:Beta:time1:ea.tim -2.6 1.8 1 NA
13:Beta:time10:ea.ti -3.6 1.1 10 NA
14:Beta:time11:ea.ti -3.1 0.7 11 NA
', header = TRUE, stringsAsFactors = FALSE, na.strings = "NA")

I suggest using stringr library to extract the data you need since it handles NA values better, and also allows using a constrained-width lookbehind:
> library(stringr)
> my.data$time1 <- str_extract(my.data$my.string, "(?<=time)\\d+|(?<=\\b[ge]a)\\d+")
> my.data$time2 <- str_extract(my.data$my.string, "(?<=\\b[ge]a\\d{1,100}\\.)\\d+")
> my.data
my.string cov1 cov2 time1 time2
1 42:Alpha:ga6.8 -0.1 2.2 6 8
2 43:Alpha:ga9.11 -2.5 0.6 9 11
3 44:Alpha:ga30.32 -1.3 0.5 30 32
4 45:Alpha:ga49.51 -2.5 0.6 49 51
5 50:Alpha:time1:ga.time -1.7 0.9 1 <NA>
6 51:Alpha:time2:ga.time -1.5 0.8 2 <NA>
7 52:Alpha:time3:ga.time -1.0 1.0 3 <NA>
8 2:Beta:ea2.9 -1.7 0.6 2 9
9 3:Beta:ea17.19 -5.0 0.8 17 19
10 4:Beta:ea22.24 -6.4 1.0 22 24
11 8:Beta:as 0.2 0.6 <NA> <NA>
12 9:Beta:sd 1.7 0.4 <NA> <NA>
13 12:Beta:time1:ea.tim -2.6 1.8 1 <NA>
14 13:Beta:time10:ea.ti -3.6 1.1 10 <NA>
15 14:Beta:time11:ea.ti -3.1 0.7 11 <NA>
The first regex matches:
(?<=time)\\d+ - 1+ digits that have time before them
| - or
(?<=\\b[ge]a)\\d+ - 1+ digits that have ge or ea` as a whole word in front
The second regex matches:
(?<=\\b[ge]a\\d{1,100}\\.) - check if the current position is preceded with ge or ea as a whole word followed with 1 to 100 digits (I believe that should be enough for your scenario, 100-digit chunks are hardly expected here, you may even decrease the value), and then a .
\\d+ - 1+ digits

Here's a regex that will extract either of the two types, and output them to different columns at the end of the lines:
Search: .*(?:time(\d+)|(?:[ge]a)(\d+)\.(\d+)).*
Replace: $0\t$1\t$2\t$3
Breakdown:
.*(?: ... ).* ensures that the whole line is matched, and uses a non-capturing group for the main alternation
time(\d+): this is the first half of the alternation, capturing any digits after a "time"
(?:[ge]a)(\d+)\.(\d+): the second half of the alternation matches "ga" or "ea" followed by two sets of digits, each in its own capture group
Replacement: $0 puts the whole line back. Each of the other capture groups are added, with tabs in-between.
See regex101 example

Related

R read_excel reads numeric data incorrectly

I'm trying to download and parse the Data worksheet in the file ie_data.xls from Professor Robert Shiller's home page (http://www.econ.yale.edu/~shiller/data.htm). I download the file from http://www.econ.yale.edu/~shiller/data/ie_data.xls, and then run the following script:
library(tidyverse)
ie_data <- read_excel("ie_data.xls", sheet = "Data", col_names = TRUE,
col_types = "numeric", na = "", skip = 7) %>%
select(Date,E) %>%
drop_na()
A bunch of warnings are generated, but more bothersome is the output
> names(ie_data)
[1] "Date" "E"
> ie_data
# A tibble: 1,791 x 2
Date E
<dbl> <dbl>
1 1871. 0.4
2 1871. 0.4
3 1871. 0.4
4 1871. 0.4
5 1871. 0.4
6 1871. 0.4
7 1871. 0.4
8 1871. 0.4
9 1871. 0.4
10 1871. 0.4
# ... with 1,781 more rows
Warning message:
`...` is not empty.
We detected these problematic arguments:
* `needs_dots`
These dots only exist to allow future extensions and should be empty.
Did you misspecify an argument?
The contents of both columns should have two decimal places (1871.01 represents January 1871, 1871.02 represents February 1871 and so on, and the second column is earnings per share rounded to the nearest penny), but everything after the decimal point is gone in the first column at the head of the dataframe! Even more mysterious is its tail:
> tail(ie_data)
# A tibble: 6 x 2
Date E
<dbl> <dbl>
1 2019. 135.
2 2019. 137.
3 2019. 139.
4 2020. 132.
5 2020. 124.
6 2020. 116.
Warning message:
`...` is not empty.
We detected these problematic arguments:
* `needs_dots`
These dots only exist to allow future extensions and should be empty.
Did you misspecify an argument?
Now both columns have lost their fractional part! What change do I need to make to my code in order to read these columns correctly?
Sincerely and with many thanks in advance
Thomas Philips

You can do the following to see more significant digits in your console when printing your data with ie_data. This doesn't affect your data, only the way it is shown when printed to your console.
options(pillar.sigfig = 10)
ie_data
Which will show:
Date E
<dbl> <dbl>
1 1871.01 0.4
2 1871.02 0.4
3 1871.03 0.4
4 1871.04 0.4
5 1871.05 0.4
6 1871.06 0.4
7 1871.07 0.4
8 1871.08 0.4
9 1871.09 0.4
10 1871.1 0.4
# ... with 1,781 more rows
If you use the following:
options(pillar.sigfig = 1)
ie_data
You will get:
# A tibble: 1,791 x 2
Date E
<dbl> <dbl>
1 1871. 0.4
2 1871. 0.4
3 1871. 0.4
4 1871. 0.4
5 1871. 0.4
6 1871. 0.4
7 1871. 0.4
8 1871. 0.4
9 1871. 0.4
10 1871. 0.4
# ... with 1,781 more rows

try it with col_types = "text"
Don't really know why numeric will get you trimmed numbers but i seem to get it working with text (provided you later convert to a rounded number)

How do I split or create a new column for a list of data in a dataframe?

Please have a look at the preview of the data in theimage. I would like to create 3 new columns i.e. Start, End, Density and create new row for each record in these 3 columns.

In accordance with comments above you can converse list into the data.frame as below:
# simulation of data.frame with one row and one cell with histogram
z <- hist(rnorm(1000))
z$start <- z$breaks[-length(z$breaks)]
z$end <- z$breaks[-1]
z[c("mids", "xname", "breaks", "equidist", "counts")] <- NULL
names_z <- names(z)
attributes(z) <- NULL
df <- data.frame(a = 1, b = 2, x = I(list((z))))
# Conversion of list to dataframe
setNames(as.data.frame(unlist(df["x"], recursive = FALSE)), names_z)
Output:
density start end
1 0.012 -3.0 -2.5
2 0.042 -2.5 -2.0
3 0.082 -2.0 -1.5
4 0.182 -1.5 -1.0
5 0.288 -1.0 -0.5
6 0.354 -0.5 0.0
7 0.418 0.0 0.5
8 0.300 0.5 1.0
9 0.172 1.0 1.5
10 0.088 1.5 2.0
11 0.050 2.0 2.5
12 0.012 2.5 3.0

Selecting rows with time in R

I have a data frame that looks like this:
Subject Time Freq1 Freq2 ...
A 6:20 0.6 0.1
A 6:30 0.1 0.5
A 6:40 0.6 0.1
A 6:50 0.6 0.1
A 7:00 0.3 0.4
A 7:10 0.1 0.5
A 7:20 0.1 0.5
B 6:00 ... ...
I need to delete the rows in the time range it is not from 7:00 to 7:30.So in this case, all the 6:00, 6:10, 6:20...
I have tried creating a data frame with just the times I want to keep but I does not seem to recognize the times as a number nor as a name. And I get the same error when trying to directly remove the ones I don't need. It is probably quite simple but I haven't found any solution.
Any suggestions?

We can convert the time column to a Period class under the package lubridate and then filter the data frame based on that column.
library(dplyr)
library(lubridate)
dat2 <- dat %>%
mutate(HM = hm(Time)) %>%
filter(HM < hm("7:00") | HM > hm("7:30")) %>%
select(-HM)
dat2
# Subject Time Freq1 Freq2
# 1 A 6:20 0.6 0.1
# 2 A 6:30 0.1 0.5
# 3 A 6:40 0.6 0.1
# 4 A 6:50 0.6 0.1
# 5 B 6:00 NA NA
DATA
dat <- read.table(text = "Subject Time Freq1 Freq2
A '6:20' 0.6 0.1
A '6:30' 0.1 0.5
A '6:40' 0.6 0.1
A '6:50' 0.6 0.1
A '7:00' 0.3 0.4
A '7:10' 0.1 0.5
A '7:20' 0.1 0.5
B '6:00' NA NA",
header = TRUE)

Create xts object from CSV

I'm trying to generate an xts from a CSV file. The output looks okay as a simple vector i.e. Date and Value columns are character and numeric, respectively.
However, if I want to make it into an xts, the output seems dubious
I'm wondering what is the output on the furthest left column on the xts?
> test <- read.csv("Test.csv", header = TRUE, as.is = TRUE)
> test
Date Value
1 1/12/2014 1.5
2 2/12/2014 0.9
3 1/12/2015 -0.1
4 2/12/2015 -0.3
5 1/12/2016 -0.7
6 2/12/2016 0.2
7 7/12/2016 -1.0
8 8/12/2016 -0.2
9 9/12/2016 -1.1
> xts(test, order.by = as.POSIXct(test$Date), format = "%d/%m/%Y")
Date Value
0001-12-20 "1/12/2014" " 1.5"
0001-12-20 "1/12/2015" "-0.1"
0001-12-20 "1/12/2016" "-0.7"
0002-12-20 "2/12/2014" " 0.9"
0002-12-20 "2/12/2015" "-0.3"
0002-12-20 "2/12/2016" " 0.2"
0007-12-20 "7/12/2016" "-1.0"
0008-12-20 "8/12/2016" "-0.2"
0009-12-20 "9/12/2016" "-1.1"
I'd simply like to set an xts ordered by Date, rather than the mystery column on the left. I've tried as.Date for the xts as well but have the same results.

I recommend you use read.zoo to read the data from CSV, then convert the result to xts using as.xts.
Text <- "Date,Value
1/12/2014,1.5
2/12/2014,0.9
1/12/2015,-0.1
2/12/2015,-0.3
1/12/2016,-0.7
2/12/2016,0.2
7/12/2016,-1.0
8/12/2016,-0.2
9/12/2016,-1.1"
z <- read.zoo(text=Text, sep=",", header=TRUE, format="%m/%d/%Y", drop=FALSE)
x <- as.xts(z)
# Value
# 2014-01-12 1.5
# 2014-02-12 0.9
# 2015-01-12 -0.1
# 2015-02-12 -0.3
# 2016-01-12 -0.7
# 2016-02-12 0.2
# 2016-07-12 -1.0
# 2016-08-12 -0.2
# 2016-09-12 -1.1
Note that you will need to omit text = Text from your actual call, and replace it with file = "your_file_name.csv".

The issue appears to be twofold. One, there is a misplaced parenthesis in one of your calls; two, the left most column is the index, making the Date column superfluous.
df <- read.table(text="
Date Value
1/12/2014 1.5
2/12/2014 0.9
1/12/2015 -0.1
2/12/2015 -0.3
1/12/2016 -0.7
2/12/2016 0.2
7/12/2016 -1.0
8/12/2016 -0.2
9/12/2016 -1.1",
header=TRUE)
df$Date <- as.Date(df$Date, format="%d/%m/%Y")
library(xts)
xts(df[-1], order.by=df[,1])
# Value
# 2014-12-01 1.5
# 2014-12-02 0.9
# 2015-12-01 -0.1
# 2015-12-02 -0.3
# 2016-12-01 -0.7
# 2016-12-02 0.2
# 2016-12-07 -1.0
# 2016-12-08 -0.2
# 2016-12-09 -1.1

dynamic column names in data.table correlation

I've combined the outputs for each user and item (for a recommendation system) into this all x all R data.table. For each row in this table, I need to calculate the correlation between user scores 1,2,3 & item scores 1,2,3 (e.g. for the first row what is the correlation between 0.5,0.6,-0.2 and 0.2,0.8,-0.3) to see how well the user and the item match.
user item user_score_1 user_score_2 user_score_3 item_score_1 item_score_2 item_score_3
A 1 0.5 0.6 -0.2 0.2 0.8 -0.3
A 2 0.5 0.6 -0.2 0.4 0.1 -0.8
A 3 0.5 0.6 -0.2 -0.2 -0.4 -0.1
B 1 -0.6 -0.1 0.9 0.2 0.8 -0.3
B 2 -0.6 -0.1 0.9 0.4 0.1 -0.8
B 3 -0.6 -0.1 0.9 -0.2 -0.4 -0.1
I have a solution that works - which is:
scoresDT[, cor(c(user_score_1,user_score_2,user_score_3), c(item_score_1,item_score_2,item_score_3)), by= .(user, item)]
...where scoresDT is my data.table.
This is all well and good, and it works...but I can't get it to work with dynamic variables instead of hard coding in the variable names.
Normally in a data.frame I could create a list and just input that, but as it's character format, the data.table doesn't like it. I've tried using a list with "with=FALSE" and have had some success when trying basic subsetting of the data.table but not with the correlation syntax that I need...
Any help is much, much appreciated!
Thanks,
Andrew

Here's what I would do:
mDT = melt(scoresDT,
id.vars = c("user","item"),
measure.vars = patterns("item_score_", "user_score_"),
value.name = c("item_score", "user_score")
)
mDT[, cor(item_score, user_score), by=.(user,item)]
user item V1
1: A 1 0.8955742
2: A 2 0.9367659
3: A 3 -0.8260332
4: B 1 -0.6141324
5: B 2 -0.9958706
6: B 3 0.5000000
I'd keep the data in its molten/long form, which fits more naturally with R and data.table functionality.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

extract irregular numeric data from strings - r

Related

R read_excel reads numeric data incorrectly

How do I split or create a new column for a list of data in a dataframe?

Selecting rows with time in R

Create xts object from CSV

dynamic column names in data.table correlation

Categories

Resources