Extract specific numbers from string in R - r

I have this example:
> exemplo
V1 V2
local::/raiz/diretorio/adminadmin/ 1
local::/raiz/diretorio/jatai_p_user/ 2
local::/raiz/diretorio/adminteste/ 3
local::/raiz/diretorio/adminteste2/ 4
local::/raiz/diretorio/48808032191/ 5
local::/raiz/diretorio/85236250110/ 6
local::/raiz/diretorio/92564593100/ 7
local::/raiz/diretorio/AACB/036/03643936451/ 331
home::22723200159 3894
home::98476963300 3895
home::15239136149 3896
home::01534562567 3897
I would like extract just numbers with exact 11 characters (in first column), producing results like this one:
> exemplo
V1 V2
48808032191 5
85236250110 6
92564593100 7
03643936451 331
22723200159 3894
98476963300 3895
15239136149 3896
01534562567 3897
Any help would be great :-)

Here's one way using stringr, where d is your dataframe:
library(stringr)
m <- str_extract(d$V1, '\\d{11}')
na.omit(data.frame(V1=m, V2=d$V2))
# V1 V2
# 5 48808032191 5
# 6 85236250110 6
# 7 92564593100 7
# 8 03643936451 331
# 9 22723200159 3894
# 10 98476963300 3895
# 11 15239136149 3896
# 12 01534562567 3897
The approach above will match strings of at least 11 numerals. In response to #JoshO'Brien's comment, if you only want to match exactly 11 numerals, then you can use:
m <- str_extract(d$V1, perl('(?<!\\d)\\d{11}(?!\\d)'))

DF <- read.table(text = "V1 V2
local::/raiz/diretorio/adminadmin/ 1
local::/raiz/diretorio/jatai_p_user/ 2
local::/raiz/diretorio/adminteste/ 3
local::/raiz/diretorio/adminteste2/ 4
local::/raiz/diretorio/48808032191/ 5
local::/raiz/diretorio/85236250110/ 6
local::/raiz/diretorio/92564593100/ 7
local::/raiz/diretorio/AACB/036/03643936451/ 331
home::22723200159 3894
home::98476963300 3895
home::15239136149 3896
home::01534562567 3897", header = TRUE)
pattern <- "\\d{11}"
m <- regexpr(pattern, DF$V1)
DF1 <- DF[attr(m, "match.length") > -1,]
DF1$V1<- regmatches(DF$V1, m)
# V1 V2
#5 48808032191 5
#6 85236250110 6
#7 92564593100 7
#8 03643936451 331
#9 22723200159 3894
#10 98476963300 3895
#11 15239136149 3896
#12 01534562567 3897

Here's how I'd approach it. This can be done in base R but stringi's consistency in naming makes it so easy to use not to mention it's fast. I'd store the 11 digits as a new column rather than overwrite the old one.
dat <- read.table(text="V1 V2
local::/raiz/diretorio/adminadmin/ 1
local::/raiz/diretorio/jatai_p_user/ 2
local::/raiz/diretorio/adminteste/ 3
local::/raiz/diretorio/adminteste2/ 4
local::/raiz/diretorio/48808032191/ 5
local::/raiz/diretorio/85236250110/ 6
local::/raiz/diretorio/92564593100/ 7
local::/raiz/diretorio/AACB/036/03643936451/ 331
home::22723200159 3894
home::98476963300 3895
home::15239136149 3896
home::01534562567 3897", header=TRUE)
library(stringi)
dat[["V3"]] <- unlist(stri_extract_all_regex(dat[["V1"]], "\\d{11}"))
dat[!is.na(dat[["V3"]]), 3:2]
## V3 V2
## 5 48808032191 5
## 6 85236250110 6
## 7 92564593100 7
## 8 03643936451 331
## 9 22723200159 3894
## 10 98476963300 3895
## 11 15239136149 3896
## 12 01534562567 3897

The command you are looking for is grep(). The pattern to use there would be something like \d{11} or [0-9]{11}.

Related

Adding a second column as a function of first with data.table's (`:=`) [duplicate]

I want to create a new data.table or maybe just add some columns to a data.table. It is easy to specify multiple new columns but what happens if I want a third column to calculate a value based on one of the columns I am creating. I think plyr package can do something such as that. Can we perform such iterative (sequential) column creation in data.table?
I want to do as follows
dt <- data.table(shop = 1:10, income = 10:19*70)
dt[ , list(hope = income * 1.05, hopemore = income * 1.20, hopemorerealistic = hopemore - 100)]
or maybe
dt[ , `:=`(hope = income*1.05, hopemore = income*1.20, hopemorerealistic = hopemore-100)]
You can also use <- within the call to list eg
DT <- data.table(a=1:5)
DT[, c('b','d') := list(b1 <- a*2, b1*3)]
DT
a b d
1: 1 2 6
2: 2 4 12
3: 3 6 18
4: 4 8 24
5: 5 10 30
Or
DT[, `:=`(hope = hope <- a+1, z = hope-1)]
DT
a b d hope z
1: 1 2 6 2 1
2: 2 4 12 3 2
3: 3 6 18 4 3
4: 4 8 24 5 4
5: 5 10 30 6 5
It is possible by using curly braces and semicolons in j
There are multiple ways to go about it, here are two examples:
# If you simply want to output:
dt[ ,
{hope=income*1.05;
hopemore=income*1.20;
list(hope=hope, hopemore=hopemore, hopemorerealistic=hopemore-100)}
]
# if you want to save the values
dt[ , c("hope", "hopemore", "hopemorerealistic") :=
{hope=income*1.05;
hopemore=income*1.20;
list(hope, hopemore, hopemore-100)}
]
dt
# shop income hope hopemore hopemorerealistic
# 1: 1 700 735.0 840 740
# 2: 2 770 808.5 924 824
# 3: 3 840 882.0 1008 908
# 4: 4 910 955.5 1092 992
# 5: 5 980 1029.0 1176 1076
# 6: 6 1050 1102.5 1260 1160
# 7: 7 1120 1176.0 1344 1244
# 8: 8 1190 1249.5 1428 1328
# 9: 9 1260 1323.0 1512 1412
# 10: 10 1330 1396.5 1596 1496

Automate regression by rows

I have a data.frame
set.seed(100)
exp <- data.frame(exp = c(rep(LETTERS[1:2], each = 10)), re = c(rep(seq(1, 10, 1), 2)), age1 = seq(10, 29, 1), age2 = seq(30, 49, 1),
h = c(runif(20, 10, 40)), h2 = c(40 + runif(20, 4, 9)))
I'd like to make a lm for each row in a data set (h and h2 ~ age1 and age2)
I do it by loop
exp$modelh <- 0
for (i in 1:length(exp$exp)){
age = c(exp$age1[i], exp$age2[i])
h = c(exp$h[i], exp$h2[i])
model = lm(age ~ h)
exp$modelh[i] = coef(model)[1] + 100 * coef(model)[2]
}
and it works well but takes some time with very large files. Will be grateful for the faster solution f.ex. dplyr
Using dplyr, we can try with rowwise() and do. Inside the do, we concatenate (c) the 'age1', 'age2' to create 'age', likewise, we can create 'h', apply lm, extract the coef to create the column 'modelh'.
library(dplyr)
exp %>%
rowwise() %>%
do({
age <- c(.$age1, .$age2)
h <- c(.$h, .$h2)
model <- lm(age ~ h)
data.frame(., modelh = coef(model)[1] + 100*coef(model)[2])
} )
gives the output
# exp re age1 age2 h h2 modelh
#1 A 1 10 30 19.23298 46.67906 68.85506
#2 A 2 11 31 17.73018 47.55402 66.17050
#3 A 3 12 32 26.56967 46.69174 84.98486
#4 A 4 13 33 11.69149 47.74486 61.98766
#5 A 5 14 34 24.05648 46.10051 82.90167
#6 A 6 15 35 24.51312 44.85710 89.21053
#7 A 7 16 36 34.37208 47.85151 113.37492
#8 A 8 17 37 21.10962 48.40977 74.79483
#9 A 9 18 38 26.39676 46.74548 90.34187
#10 A 10 19 39 15.10786 45.38862 75.07002
#11 B 1 20 40 28.74989 46.44153 100.54666
#12 B 2 21 41 36.46497 48.64253 125.34773
#13 B 3 22 42 18.41062 45.74346 81.70062
#14 B 4 23 43 21.95464 48.77079 81.20773
#15 B 5 24 44 32.87653 47.47637 115.95097
#16 B 6 25 45 30.07065 48.44727 101.10688
#17 B 7 26 46 16.13836 44.90204 84.31080
#18 B 8 27 47 20.72575 47.14695 87.00805
#19 B 9 28 48 20.78425 48.94782 84.25406
#20 B 10 29 49 30.70872 44.65144 128.39415
We could do this with the devel version of data.table i.e. v1.9.5. Instructions to install the devel version are here.
We convert the 'data.frame' to 'data.table' (setDT), create a column 'rn' with the option keep.rownames=TRUE. We melt the dataset by specifying the patterns in the measure to convert from 'wide' to 'long' format. Grouped by 'rn', we do the lm and get the coef. This can be assigned as a new column in the original dataset ('exp') while removing the unwanted 'rn' column by assigning (:=) it to NULL.
library(data.table)#v1.9.5+
modelh <- melt(setDT(exp, keep.rownames=TRUE), measure=patterns('^age', '^h'),
value.name=c('age', 'h'))[, {model <- lm(age ~h)
coef(model)[1] + 100 * coef(model)[2]},rn]$V1
exp[, modelh:= modelh][, rn := NULL]
exp
# exp re age1 age2 h h2 modelh
# 1: A 1 10 30 19.23298 46.67906 68.85506
# 2: A 2 11 31 17.73018 47.55402 66.17050
# 3: A 3 12 32 26.56967 46.69174 84.98486
# 4: A 4 13 33 11.69149 47.74486 61.98766
# 5: A 5 14 34 24.05648 46.10051 82.90167
# 6: A 6 15 35 24.51312 44.85710 89.21053
# 7: A 7 16 36 34.37208 47.85151 113.37492
# 8: A 8 17 37 21.10962 48.40977 74.79483
# 9: A 9 18 38 26.39676 46.74548 90.34187
#10: A 10 19 39 15.10786 45.38862 75.07002
#11: B 1 20 40 28.74989 46.44153 100.54666
#12: B 2 21 41 36.46497 48.64253 125.34773
#13: B 3 22 42 18.41062 45.74346 81.70062
#14: B 4 23 43 21.95464 48.77079 81.20773
#15: B 5 24 44 32.87653 47.47637 115.95097
#16: B 6 25 45 30.07065 48.44727 101.10688
#17: B 7 26 46 16.13836 44.90204 84.31080
#18: B 8 27 47 20.72575 47.14695 87.00805
#19: B 9 28 48 20.78425 48.94782 84.25406
#20: B 10 29 49 30.70872 44.65144 128.39415
Great (double) answer from #akrun.
Just a suggestion for your future analysis as you mentioned "it's an example of a bigger problem". Obviously, if you are really interested in building models rowwise then you'll create more and more columns as your age and h observations increase. If you get N observations you'll have to use 2xN columns for those 2 variables only.
I'd suggest to use a long data format in order to increase your rows instead of your columns.
Something like:
exp[1,] # how your first row (model building info) looks like
# exp re age1 age2 h h2
# 1 A 1 10 30 19.23298 46.67906
reshape(exp[1,], # how your model building info is transformed
varying = list(c("age1","age2"),
c("h","h2")),
v.names = c("age_value","h_value"),
direction = "long")
# exp re time age_value h_value id
# 1.1 A 1 1 10 19.23298 1
# 1.2 A 1 2 30 46.67906 1
Apologies if the "bigger problem" refers to something else and this answer is irrelevant.
With base R, the function sprintf can help us create formulas. And lapply carries out the calculation.
strings <- sprintf("c(%f,%f) ~ c(%f,%f)", exp$age1, exp$age2, exp$h, exp$h2)
lst <- lapply(strings, function(x) {model <- lm(as.formula(x));coef(model)[1] + 100 * coef(model)[2]})
exp$modelh <- unlist(lst)
exp
# exp re age1 age2 h h2 modelh
# 1 A 1 10 30 19.23298 46.67906 68.85506
# 2 A 2 11 31 17.73018 47.55402 66.17050
# 3 A 3 12 32 26.56967 46.69174 84.98486
# 4 A 4 13 33 11.69149 47.74486 61.98766
# 5 A 5 14 34 24.05648 46.10051 82.90167
# 6 A 6 15 35 24.51312 44.85710 89.21053
# 7 A 7 16 36 34.37208 47.85151 113.37493
# 8 A 8 17 37 21.10962 48.40977 74.79483
# 9 A 9 18 38 26.39676 46.74548 90.34187
# 10 A 10 19 39 15.10786 45.38862 75.07002
# 11 B 1 20 40 28.74989 46.44153 100.54666
# 12 B 2 21 41 36.46497 48.64253 125.34773
# 13 B 3 22 42 18.41062 45.74346 81.70062
# 14 B 4 23 43 21.95464 48.77079 81.20773
# 15 B 5 24 44 32.87653 47.47637 115.95097
# 16 B 6 25 45 30.07065 48.44727 101.10688
# 17 B 7 26 46 16.13836 44.90204 84.31080
# 18 B 8 27 47 20.72575 47.14695 87.00805
# 19 B 9 28 48 20.78425 48.94782 84.25406
# 20 B 10 29 49 30.70872 44.65144 128.39416
In the lapply function the expression as.formula(x) is what converts the formulas created in the first line into a format usable by the lm function.
Benchmark
library(dplyr)
library(microbenchmark)
set.seed(100)
big.exp <- data.frame(age1=sample(30, 1e4, T),
age2=sample(30:50, 1e4, T),
h=runif(1e4, 10, 40),
h2= 40 + runif(1e4,4,9))
microbenchmark(
plafort = {strings <- sprintf("c(%f,%f) ~ c(%f,%f)", big.exp$age1, big.exp$age2, big.exp$h, big.exp$h2)
lst <- lapply(strings, function(x) {model <- lm(as.formula(x));coef(model)[1] + 100 * coef(model)[2]})
big.exp$modelh <- unlist(lst)},
akdplyr = {big.exp %>%
rowwise() %>%
do({
age <- c(.$age1, .$age2)
h <- c(.$h, .$h2)
model <- lm(age ~ h)
data.frame(., modelh = coef(model)[1] + 100*coef(model)[2])
} )}
,times=5)
t: seconds
expr min lq mean median uq max neval cld
plafort 13.00605 13.41113 13.92165 13.56927 14.53814 15.08366 5 a
akdplyr 26.95064 27.64240 29.40892 27.86258 31.02955 33.55940 5 b
(Note: I downloaded the newest 1.9.5 devel version of data.table today, but continued to receive errors when trying to test it.
The results also differ fractionally (1.93 x 10^-8). Rounding likely accounts for the difference.)
all.equal(pl, ak)
[1] "Attributes: < Component “class”: Lengths (1, 3) differ (string compare on first 1) >"
[2] "Attributes: < Component “class”: 1 string mismatch >"
[3] "Component “modelh”: Mean relative difference: 1.933893e-08"
Conclusion
The lapply approach seems to perform well compared to dplyr with respect to speed, but it's 5 digit rounding may be an issue. Improvements may be possible. Perhaps using apply after converting to matrix to increase speed and efficiency.

CSV conversion in R for standard calculations

I have a problem calculating the mean of columns for a dataset imported from this CSV file
I import the file using the following command:
dataGSR = read.csv("ShimmerData.csv", header = TRUE, sep = ",",stringsAsFactors=T)
dataGSR$X=NULL #don't need this column
Then I take a subset of this
dati=dataGSR[4:1000,]
i check they are correct
head(dati)
Shimmer Shimmer.1 Shimmer.2 Shimmer.3 Shimmer.4 Shimmer.5 Shimmer.6 Shimmer.7
4 31329 0 713 623.674691281028 2545 3706.5641025641 2409 3529.67032967033
5 31649 9.765625 713 623.674691281028 2526 3678.89230769231 2501 3664.46886446886
6 31969 19.53125 712 638.528829576655 2528 3681.80512820513 2501 3664.46886446886
7 32289 29.296875 713 623.674691281028 2516 3664.3282051282 2498 3660.07326007326
8 32609 39.0625 711 654.10779696494 2503 3645.39487179487 2496 3657.14285714286
9 32929 48.828125 713 623.674691281028 2505 3648.30769230769 2496 3657.14285714286
When I type
means=colMeans(dati)
Error in colMeans(dati) : 'x' must be numeric
In order to solve this problem I convert everything into a matrix
datiM=data.matrix(dati)
But when I check the new variable, data values are different
head(datiM)
Shimmer Shimmer.1 Shimmer.2 Shimmer.3 Shimmer.4 Shimmer.5 Shimmer.6 Shimmer.7
4 370 1 10 1 65 65 1 1
5 375 3707 10 1 46 46 24 24
6 381 1025 9 2 48 48 24 24
7 386 2162 10 1 36 36 21 21
8 392 3126 8 3 23 23 19 19
9 397 3229 10 1 25 25 19 19
My questions here is:
How to convert correctly the "dati" variable in order to perform the colMeans()?
In addition to #akrun's advice, another option is to convert the columns to numeric yourself (rather than having read.csv do it):
dati <- data.frame(
lapply(dataGSR[-c(1:3),-9],as.numeric))
##
R> colMeans(dati)
Shimmer Shimmer.1 Shimmer.2 Shimmer.3 Shimmer.4 Shimmer.5 Shimmer.6 Shimmer.7
33004.2924 18647.4609 707.4335 718.3989 2521.3626 3672.1383 2497.9013 3659.9287
Where dataGSR was read in with stringsAsFactors=F,
dataGSR <- read.csv(
file="F:/temp/ShimmerData.csv",
header=TRUE,
stringsAsFactors=F)
Unless you know for sure that you need character columns to be factors, you are better off setting this option to FALSE.
The header lines ("character") in the dataset span first 4 lines. We could skip the 4 lines, use header=FALSE and then change the column names based on the info from the first 4 lines.
dataGSR <- read.csv('ShimmerData.csv', header=FALSE,
stringsAsFactors=FALSE, skip=4)
lines <- readLines('ShimmerData.csv', n=4)
colnames(dataGSR) <- do.call(paste, c(strsplit(lines, ','),
list(sep="_")))
dataGSR <- dataGSR[,-9]
unname(colMeans(dataGSR))
# [1] 33004.2924 18647.4609 707.4335 718.3989 2521.3626
# 3672.1383 2497.9013
# [8] 3659.9287

R - "find and replace" integers in a column with character labels

I have two data frames the first (DF1) is similar to this:
Ba Ram You Sheep
30 1 33.2 120.9
27 3 22.1 121.2
22 4 39.1 99.1
11 1 20.0 101.6
9 3 9.8 784.3
The second (DF2) contains titles for column "Ram":
V1 V2
1 RED
2 GRN
3 YLW
4 BLU
I need to replace the DF1$Ram with corresponding character strings of DF2$V2:
Ba Ram You Sheep
30 RED 33.2 120.9
27 YLW 22.1 121.2
22 BLU 39.1 99.1
11 RED 20.0 101.6
9 YLW 9.8 784.3
I can do this with a nested for loop, but it feels REALLY inefficient:
x <- c(1:nrows(DF1))
y <- c(1:4)
for (i in x) {
for (j in y) {
if (DF1$Ram[i] == x) {
DF1$Ram[i] <- DF2$V2[y]
}
}
}
Is there a way to do this more efficiently??!?! I know there is. I'm a noob.
Use merge
> result <- merge(df1, df2, by.x="Ram", by.y="V1")[,-1] # merging data.frames
> colnames(result)[4] <- "Ram" # setting name
The following is just for getting the output in the order you showed us
> result[order(result$Ba, decreasing = TRUE), c("Ba", "Ram", "You", "Sheep")]
Ba Ram You Sheep
1 30 RED 33.2 120.9
3 27 YLW 22.1 121.2
5 22 BLU 39.1 99.1
2 11 RED 20.0 101.6
4 9 YLW 9.8 784.3
Usually, when you encode some character strings with integers, you likely want factor. They offer some benefits you can read about in the fine manual.
df1 <- data.frame(V2 = c(3,3,2,3,1))
df2 <- data.frame(V1=1:4, V2=c('a','b','c','d'))
df1 <- within(df1, {
f <- factor(df1$V2, levels=df2$V1, labels=df2$V2)
aschar <- as.character(f)
asnum <- as.numeric(f)
})

R Read CSV file that has timestamp

I have a csvfile that has a time stamp column as a string
15,1035,4530,3502,2,892,482,0,20060108081608,2,N
15,1034,7828,3501,3,263,256,0,20071124175519,3,N
15,1035,7832,4530,2,1974,1082,0,20071124193818,7,N
15,2346,8381,8155,3,2684,649,0,20080207131002,9,N
I use the read.csv option but the problem with that is once I finish the import the data column looks like:
1 15 1035 4530 3502 2 892 482 0 2.006011e+13 2 N
2 15 1034 7828 3501 3 263 256 0 2.007112e+13 3 N
3 15 1035 7832 4530 2 1974 1082 0 2.007112e+13 7 N
4 15 2346 8381 8155 3 2684 649 0 2.008021e+13 9 N
Is there away to strip the date from string as it get read (csv file does have headers: removed here to keep data anonymous). If we can't strip as it get read can what is the best way to do the strip?
Here 2 methods:
Using zoo package. Personally I prefer this one. I deal with your data as a time series.
library(zoo)
read.zoo(text='15,1035,4530,3502,2,892,482,0,20060108081608,2,N
15,1034,7828,3501,3,263,256,0,20071124175519,3,N
15,1035,7832,4530,2,1974,1082,0,20071124193818,7,N
15,2346,8381,8155,3,2684,649,0,20080207131002,9,N',
index=9,tz='',format='%Y%m%d%H%M%S',sep=',')
V1 V2 V3 V4 V5 V6 V7 V8 V10 V11
2006-01-08 08:16:08 15 1035 4530 3502 2 892 482 0 2 N
2007-11-24 17:55:19 15 1034 7828 3501 3 263 256 0 3 N
2007-11-24 19:38:18 15 1035 7832 4530 2 1974 1082 0 7 N
2008-02-07 13:10:02 15 2346 8381 8155 3 2684 649 0 9 N
Using colClasses argument in read.table, as mentioned in the comment :
dat <- read.table(text='15,1035,4530,3502,2,892,482,0,20060108081608,2,N
15,1034,7828,3501,3,263,256,0,20071124175519,3,N
15,1035,7832,4530,2,1974,1082,0,20071124193818,7,N
15,2346,8381,8155,3,2684,649,0,20080207131002,9,N',
colClasses=c(rep('numeric',8),
'character','numeric','character')
,sep=',')
strptime(dat$V9,'%Y%m%d%H%M%S')
1] "2006-01-08 08:16:08" "2007-11-24 17:55:19"
"2007-11-24 19:38:18" "2008-02-07 13:10:02"
As Ricardo says, you can set the column classes with read.csv. In this case I recommend importing these as characters and once the csv is loaded, converting them to dates with strptime().
for example:
test <- '20080207131002'
strptime(x = test, format = "%Y%m%d%H%M%S")
Which will return a POSIXlt object w/ the date/time info.
You can use lubridate package
test <- '20080207131002'
lubridate::as_datetime(test)
Can also specify format for each case depends on your needs

Resources