Trying to extract specific characters in a column in R? - r

The content in the column appears as follows $1,521+ 2 bds. I want to extract 1521 and put it in a new column. I know this can be done in alteryx using regex can I do it R?

How about the following?:
library(tidyverse)
x <- '$1,521+ 2 bds'
parse_number(x)

For example:
library(tidyverse)
#generate some data
tbl <- tibble(string = str_c('$', as.character(seq(1521, 1541, 1)), '+', ' 2bds'))
new_col <-
tbl$string %>%
str_split('\\+',simplify = TRUE) %>%
`[`(, 1) %>%
str_sub(2, -1) #get rid of '$' at the start
mutate(tbl, number = new_col)
#> # A tibble: 21 x 2
#> string number
#> <chr> <chr>
#> 1 $1521+ 2bds 1521
#> 2 $1522+ 2bds 1522
#> 3 $1523+ 2bds 1523
#> 4 $1524+ 2bds 1524
#> 5 $1525+ 2bds 1525
#> 6 $1526+ 2bds 1526
#> 7 $1527+ 2bds 1527
#> 8 $1528+ 2bds 1528
#> 9 $1529+ 2bds 1529
#> 10 $1530+ 2bds 1530
#> # … with 11 more rows
Created on 2021-06-12 by the reprex package (v2.0.0)

We can use sub from base R
as.numeric( sub("\\$(\\d+),(\\d+).*", "\\1\\2", x))
#[1] 1521
data
x <- '$1,521+ 2 bds'

Related

Creating serial number for unique entries in R

I wanted to assign same serial number for all same Submission_Ids under one Batch_number. Could some one please help me figure this out?
Submission_Id <- c(619295,619295,619295,619295,619296,619296,619296,619296,619296,556921,556921,559254,647327,647327,647327,646040,646040,646040,646040,646040,64604)
Batch_No <- (633,633,633,633,633,633,633,633,633,633,633,633,634,634,634,650,650,650,650,650,650)
Expected result
Sl.No <- c(1,1,1,1,2,2,2,2,2,3,3,4,1,1,1,1,1,1,1,1,1)
One way to do it is creating run-length IDs with data.table::rleid(Submission_Id) grouped_by(Batch_No). We can use this inside 'dplyr'. To show this I created a tibble() with both given vectors Batch_Id and Submission_Id.
library(dplyr)
library(data.table)
dat <- tibble(Submission_Id = Submission_Id,
Batch_No = Batch_No)
dat %>%
group_by(Batch_No) %>%
mutate(S1.No = data.table::rleid(Submission_Id))
#> # A tibble: 21 x 3
#> # Groups: Batch_No [3]
#> Submission_Id Batch_No S1.No
#> <dbl> <dbl> <int>
#> 1 619295 633 1
#> 2 619295 633 1
#> 3 619295 633 1
#> 4 619295 633 1
#> 5 619296 633 2
#> 6 619296 633 2
#> 7 619296 633 2
#> 8 619296 633 2
#> 9 619296 633 2
#> 10 556921 633 3
#> # ... with 11 more rows
The original data
Submission_Id <- c(619295,619295,619295,619295,619296,619296,619296,619296,619296,556921,556921,559254,647327,647327,647327,646040,646040,646040,646040,646040,64604)
Batch_No <- c(633,633,633,633,633,633,633,633,633,633,633,633,634,634,634,650,650,650,650,650,650)
Created on 2022-12-16 by the reprex package (v2.0.1)

Add leading zeros to colum names

I'm surprised to find no one asked this question on Stackoverflow before. Maybe it's too stupid to ask?
So I have a dataframe that contains 48 weather variables, each representing a weather value for a month. I have drawn a simplified table shown below:
weather 1
weather 2
weather 3
weather 4
weather 5
weather 6
weather 7
weather 8
weather 9
weather 10
weather 11
weather 12
12
6
34
9
100
.01
-4
38
64
77
21
34
99
42
-3
34
34
.5
27
19
7
18
NA
20
My objective is to make the column names from "weather 1, weather 2, ..." to "weather 01, weather 02, ...." And I wrote a loop like this:
for (i in 1:9){
colnames(df) = gsub(i, 0+i, colnames(df))
}
However, instead of replacing the single-digit numbers with a leading zero, R replaced the actual letter "i" with "0+i". Can anyone let me know what's going on here and how to fix it? Or is there a better way to add leading zeros to column names?
Thank you very much!
We can use
library(stringr)
colnames(df) <- str_replace(colnames(df), "\\d+",
function(x) sprintf("%02d", as.integer(x)))
Here is another option:
library(tidyverse)
set.seed(35)
example <- tibble(`weather 1` = runif(2),
`weather 2` = runif(2),
`weather 3` = runif(2))
rename_with(example, ~str_replace(., "(weather )(\\d+)", "\\10\\2"), everything())
#> # A tibble: 2 x 3
#> `weather 01` `weather 02` `weather 03`
#> <dbl> <dbl> <dbl>
#> 1 0.857 0.553 0.486
#> 2 0.0108 0.950 0.0939
or with base R
colnames(example) <- gsub("(weather )(\\d+)", "\\10\\2", colnames(example))
example
#> # A tibble: 2 x 3
#> `weather 01` `weather 02` `weather 03`
#> <dbl> <dbl> <dbl>
#> 1 0.857 0.553 0.486
#> 2 0.0108 0.950 0.0939

Aggregate character string into vector in R

I have a data table test:
id
key
1
2365
1
2365
1
3709
2
6734
2
1908
2
4523
I want to aggregate unique key values by id into vector using data.table package.
Expected output:
id
key_array
1
"2365", "3709"
2
"6734", "1908", "4523"
So, this should work like array_agg sql function.
I tried:
res <- test[, list(key_array = paste(unique(key), collapse = ", ")), by = "id"], but I get just a string. But I need to have opportunity to find the length of each vector and operate with its certain elements (find the intersection of two vectors for example).
1. Base R
This an aggregate one-liner.
x <- 'id key
1 2365
1 2365
1 3709
2 6734
2 1908
2 4523'
test <- read.table(textConnection(x), header = TRUE)
aggregate(key ~ id, test, \(x) c(unique(x)))
#> id key
#> 1 1 2365, 3709
#> 2 2 6734, 1908, 4523
Created on 2022-06-14 by the reprex package (v2.0.1)
But if user #Chris's comment is right then the right solution as follows.
aggregate(key ~ id, test, \(x) paste(unique(x), collapse = ", "))
Note that both c(unique(x)) and as.character(c(unique(x))) will output a list column, so the latter solution is right anyway.
2. Package data.table
Once again a one-liner.
The output is a list column, with each list member an integer vector. To keep as integers use
list(unique(key))
instead.
suppressPackageStartupMessages(library(data.table))
res <- setDT(test)[, .(key_array = list(as.character(unique(key)))), by = id]
res
#> id key_array
#> 1: 1 2365,3709
#> 2: 2 6734,1908,4523
str(res)
#> Classes 'data.table' and 'data.frame': 2 obs. of 2 variables:
#> $ id : int 1 2
#> $ key_array:List of 2
#> ..$ : chr "2365" "3709"
#> ..$ : chr "6734" "1908" "4523"
#> - attr(*, ".internal.selfref")=<externalptr>
Created on 2022-06-14 by the reprex package (v2.0.1)
Then, in order to access the vectors use two extractors, one to extract the column and the other one to extract the vectors.
res$key_array[[1]]
#> [1] "2365" "3709"
res$key_array[[2]]
#> [1] "6734" "1908" "4523"
Created on 2022-06-14 by the reprex package (v2.0.1)
3. dplyr solution
Group by id and collapse the unique strings into one only.
suppressPackageStartupMessages(library(dplyr))
test %>%
group_by(id) %>%
summarise(key_array = paste(unique(key), collapse = ", "))
#> # A tibble: 2 × 2
#> id key_array
#> <int> <chr>
#> 1 1 2365, 3709
#> 2 2 6734, 1908, 4523
Created on 2022-06-14 by the reprex package (v2.0.1)

Long to wide with multiple columns

I have a data frame with the columns sampleID, method, parameter and value.
set.seed(123)
mydata <- data.frame(sample_ID = rep(1:100, each=4),
method = rep(LETTERS[1:2], 100),
parameter = rep(c("M1","M2"),times=c(2,2)),
value = round(runif(100, min = 100, max = 5000)),
stringsAsFactors = FALSE)
This data frame is organized in long format and I would like to convert it to wide format like this: The sample_ID should be the identifier of the row - now the columns method + parameter should be combined with the corresponding value, f.e.
Sample_ID 1 has the value
1509 for method A and parameter M1
3963 for method B and parameter M1
2104 for method A and parameter M2
4427 for method B and parameter M2
Now I would like to convert these 4 rows to a single row like this:
sample_ID = 1, A_M1 = 1509, B_M1 = 3963, A_M2 = 2104, B_M2 = 4427
The next row would be consist of those variables with sample_ID = 2, ...
I'm sorry but I was not able to do this with spread() or melt().
Thank you in advance!
Making use of tidyr::pivot_wider you could do:
tidyr::pivot_wider(mydata, names_from = c("method", "parameter"), values_from = value)
#> # A tibble: 100 × 5
#> sample_ID A_M1 B_M1 A_M2 B_M2
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1509 3963 2104 4427
#> 2 2 4708 323 2688 4473
#> 3 3 2802 2337 4788 2321
#> 4 4 3420 2906 604 4509
#> 5 5 1306 306 1707 4777
#> 6 6 4459 3495 3238 4972
#> 7 7 3313 3572 2766 3011
#> 8 8 1517 821 4819 4521
#> 9 9 3484 3998 221 2441
#> 10 10 3816 1160 1659 1235
#> # … with 90 more rows
Using dcast() from data.table package
dcast(mydata, sample_ID~...)
If you convert your data.frame to a data.table first using setDT() you can express this using the proper data.table notation
mydata[, dcast(.SD, sample_ID~...)]

use model object, e.g. panelmodel, to flag data used

Is it possible in some way to use a fit object, specifically the regression object I get form a plm() model, to flag observations, in the data used for the regression, if they were in fact used in the regression. I realize this could be done my looking for complete observations in my original data, but I am curious if there's a way to use the fit/reg object to flag the data.
Let me illustrate my issue with a minimal working example,
First some packages needed,
# install.packages(c("stargazer", "plm", "tidyverse"), dependencies = TRUE)
library(plm); library(stargazer); library(tidyverse)
Second some data, this example is drawing heavily on Baltagi (2013), table 3.1, found in ?plm,
data("Grunfeld", package = "plm")
dta <- Grunfeld
now I create some semi-random missing values in my data object, dta
dta[c(3:13),3] <- NA; dta[c(22:28),4] <- NA; dta[c(30:33),5] <- NA
final step in the data preparation is to create a data frame with an index attribute that describes its individual and time dimensions, using tidyverse,
dta.p <- dta %>% group_by(firm, year)
Now to the regression
plm.reg <- plm(inv ~ value + capital, data = dta.p, model = "pooling")
the results, using stargazer,
stargazer(plm.reg, type="text") # stargazer(dta, type="text")
#> ============================================
#> Dependent variable:
#> ---------------------------
#> inv
#> ----------------------------------------
#> value 0.114***
#> (0.008)
#>
#> capital 0.237***
#> (0.028)
#>
#> Constant -47.962***
#> (9.252)
#>
#> ----------------------------------------
#> Observations 178
#> R2 0.799
#> Adjusted R2 0.797
#> F Statistic 348.176*** (df = 2; 175)
#> ===========================================
#> Note: *p<0.1; **p<0.05; ***p<0.01
Say I know my data has 200 observations, and I want to find the 178 that was used in the regression.
I am speculating if there's some vector in the plm.reg I can (easily) use to crate a flag i my original data, dta, if this observation was used/not used, i.e. the semi-random missing values I created above. Maybe some broom like tool.
I imagine something like,
dta <- dta %>% valid_reg_obs(plm.reg)
The desired outcome would look something like this, the new element is the vector plm.reg at the end, i.e.,
dta %>% as_tibble()
#> # A tibble: 200 x 6
#> firm year inv value capital plm.reg
#> * <int> <int> <dbl> <dbl> <dbl> <lgl>
#> 1 1 1935 318 3078 2.80 T
#> 2 1 1936 392 4662 52.6 T
#> 3 1 1937 NA 5387 157 F
#> 4 1 1938 NA 2792 209 F
#> 5 1 1939 NA 4313 203 F
#> 6 1 1940 NA 4644 207 F
#> 7 1 1941 NA 4551 255 F
#> 8 1 1942 NA 3244 304 F
#> 9 1 1943 NA 4054 264 F
#> 10 1 1944 NA 4379 202 F
#> # ... with 190 more rows
Update, I tried to use broom's augment(), but unforunatly it gave me the error message I had hoped would create some flag,
# install.packages(c("broom"), dependencies = TRUE)
library(broom)
augment(plm.reg, dta)
#> Error in data.frame(..., check.names = FALSE) :
#> arguments imply differing number of rows: 200, 178
The vector is plm.reg$residuals. Not sure of a nice broom solution, but this seems to work:
library(tidyverse)
dta.p %>%
as.data.frame %>%
rowid_to_column %>%
mutate(plm.reg = rowid %in% names(plm.reg$residuals))
for people who use the class pdata.frame() to create an index attribute that describes its individual and time dimensions, you can us the following code, this is from another Baltagi in the ?plm,
# == Baltagi (2013), pp. 204-205
data("Produc", package = "plm")
pProduc <- pdata.frame(Produc, index = c("state", "year", "region"))
form <- log(gsp) ~ log(pc) + log(emp) + log(hwy) + log(water) + log(util) + unemp
Baltagi_reg_204_5 <- plm(form, data = pProduc, model = "random", effect = "nested")
pProduc %>% mutate(reg.re = rownames(pProduc) %in% names(Baltagi_reg_204_5$residuals)) %>%
as_tibble() %>% select(state, year, region, reg.re)
#> # A tibble: 816 x 4
#> state year region reg.re
#> <fct> <fct> <fct> <lgl>
#> 1 CONNECTICUT 1970 1 T
#> 2 CONNECTICUT 1971 1 T
#> 3 CONNECTICUT 1972 1 T
#> 4 CONNECTICUT 1973 1 T
#> 5 CONNECTICUT 1974 1 T
#> 6 CONNECTICUT 1975 1 T
#> 7 CONNECTICUT 1976 1 T
#> 8 CONNECTICUT 1977 1 T
#> 9 CONNECTICUT 1978 1 T
#> 10 CONNECTICUT 1979 1 T
#> # ... with 806 more rows
finally, if you are running the first Baltagi without index attributes, i.e. unmodified example from the help file, the code should be,
Grunfeld %>% rowid_to_column %>%
mutate(plm.reg = rowid %in% names(p$residuals)) %>% as_tibble()

Resources