How to split a string after the nth character in r - r

I am working with the following data:
District <- c("AR01", "AZ03", "AZ05", "AZ08", "CA01", "CA05", "CA11", "CA16", "CA18", "CA21")
I want to split the string after the second character and put them into two columns.
So that the data looks like this:
state district
AR 01
AZ 03
AZ 05
AZ 08
CA 01
CA 05
CA 11
CA 16
CA 18
CA 21
Is there a simple code to get this done? Thanks so much for you help

You can use substr if you always want to split by the second character.
District <- c("AR01", "AZ03", "AZ05", "AZ08", "CA01", "CA05", "CA11", "CA16", "CA18", "CA21")
#split district starting at the first and ending at the second
state <- substr(District,1,2)
#split district starting at the 3rd and ending at the 4th
district <- substr(District,3,4)
#put in data frame if needed.
st_dt <- data.frame(state = state, district = district, stringsAsFactors = FALSE)

you could use strcapture from base R:
strcapture("(\\w{2})(\\w{2})",District,
data.frame(state = character(),District = character()))
state District
1 AR 01
2 AZ 03
3 AZ 05
4 AZ 08
5 CA 01
6 CA 05
7 CA 11
8 CA 16
9 CA 18
10 CA 21
where \\w{2} means two words

The OP has written
I'm more familiar with strsplit(). But since there is nothing to split
on, its not applicable in this case
Au contraire! There is something to split on and it's called lookbehind:
strsplit(District, "(?<=[A-Z]{2})", perl = TRUE)
The lookbehind works like "inserting an invisible break" after 2 capital letters and splits the strings there.
The result is a list of vectors
[[1]]
[1] "AR" "01"
[[2]]
[1] "AZ" "03"
[[3]]
[1] "AZ" "05"
[[4]]
[1] "AZ" "08"
[[5]]
[1] "CA" "01"
[[6]]
[1] "CA" "05"
[[7]]
[1] "CA" "11"
[[8]]
[1] "CA" "16"
[[9]]
[1] "CA" "18"
[[10]]
[1] "CA" "21"
which can be turned into a matrix, e.g., by
do.call(rbind, strsplit(District, "(?<=[A-Z]{2})", perl = TRUE))
[,1] [,2]
[1,] "AR" "01"
[2,] "AZ" "03"
[3,] "AZ" "05"
[4,] "AZ" "08"
[5,] "CA" "01"
[6,] "CA" "05"
[7,] "CA" "11"
[8,] "CA" "16"
[9,] "CA" "18"
[10,] "CA" "21"

We can use str_match to capture first two characters and the remaining string in separate columns.
stringr::str_match(District, "(..)(.*)")[, -1]
# [,1] [,2]
# [1,] "AR" "01"
# [2,] "AZ" "03"
# [3,] "AZ" "05"
# [4,] "AZ" "08"
# [5,] "CA" "01"
# [6,] "CA" "05"
# [7,] "CA" "11"
# [8,] "CA" "16"
# [9,] "CA" "18"
#[10,] "CA" "21"

With the tidyverse this is very easy using the function separate from tidyr:
library(tidyverse)
District %>%
as.tibble() %>%
separate(value, c("state", "district"), sep = "(?<=[A-Z]{2})")
# A tibble: 10 × 2
state district
<chr> <chr>
1 AR 01
2 AZ 03
3 AZ 05
4 AZ 08
5 CA 01
6 CA 05
7 CA 11
8 CA 16
9 CA 18
10 CA 21

Treat it as fixed width file, and import:
# read fixed width file
read.fwf(textConnection(District), widths = c(2, 2), colClasses = "character")
# V1 V2
# 1 AR 01
# 2 AZ 03
# 3 AZ 05
# 4 AZ 08
# 5 CA 01
# 6 CA 05
# 7 CA 11
# 8 CA 16
# 9 CA 18
# 10 CA 21

Related

How to obtain values from a matrix using stored numbers as indexes in R

am really new at R and I can't find the way of subsetting matrix rows given a list of indexes.
I have a dataframe called 'demo' with 855 rows and 3 columns that looks like this:
## Subject AGE DX
## 1 011_S_0002_bl 74.3 0
## 2 011_S_0003_bl 81.3 1
## 3 011_S_0005_bl 73.7 0
## 4 022_S_0007_bl 75.4 1
## 5 011_S_0008_bl 84.5 0
## 6 011_S_0010_bl 73.9 1
From this, I want to extract the indexes for all the rows that match DX == 1. So I do:
rownames(demo[demo$DX == 1,])
Which returns:
## [1] "2" "4" "6" "14" "20" "31" "33" "34" "36" "39" "40" "41"
## [13] "46" "47" "53" "54" "55" "58" "64" "67" "69" "70" "72" "81"
## [25] "84" "87" "88" "92" "96" "98" "100" "101" "106" "108" "109" "112"
....
Now I have a matrix called T_hat with 855 rows and 1 column that looks like this:
## [,1]
## [1,] 5.812925
## [2,] 10.477721
## [3,] 1.519726
## [4,] -0.221328
## [5,] 1.784920
What I want is to use the numbers in 'al' to subset the values with the corresponding numbers in the indexes and to get something like this:
## [,1]
## [2,] 10.477721
## [4,] -0.221328
...and so on.
I've tried all these options:
T_hat_a <- T_hat[rownames(demo[demo$DX == 1,]),1]
T_hat_b <- T_hat[is.numeric(rownames(demo[demo$DX == 1,])),1]
T_hat_c <- T_hat[rownames(T_hat) %in% rownames(demo[demo$DX == 1,]),1]
T_hat_d <- T_hat[rownames(T_hat) %in% is.numeric(rownames(demo[demo$DX == 1,])),1]
But none returns what I expect.
T_hat_a = ERROR "no 'dimnames' attributes for array
T_hat_b = numeric(0)
T_hat_c = numeric(0)
T_hat_d = numeric(0)
I've also tried to convert my matrix to a df, but only the T_hat_a option returns a result, but it is not at all as desired, since it returns different values...

Dropping the last two numbers from every entry in a column of data.table

Preface: I am a beginner to R that is eager to learn. Please don't mistake the simplicity of the question (if it is a simple answer) for lack of research or effort!
Here is a look at the data I am working with:
year state age POP
1: 90 1001 0 239
2: 90 1001 0 203
3: 90 1001 1 821
4: 90 1001 1 769
5: 90 1001 2 1089
The state column contains the FIPS codes for all states. For the purpose of merging, I need the state column to match my another dataset. To achieve this task, all I have to do is omit the last two numbers for each FIPS code such that the table looks like this:
year state age POP
1: 90 10 0 239
2: 90 10 0 203
3: 90 10 1 821
4: 90 10 1 769
5: 90 10 2 1089
I can't figure out how to accomplish this task on a numeric column. Substr() makes this easy on a character column.
In case your number is not always 4 digits long, to omit the last two you can make use of the vectorized behavior of substr()
x <- rownames(mtcars)[1:5]
x
#> [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
#> [4] "Hornet 4 Drive" "Hornet Sportabout"
substr(x, 1, nchar(x)-2)
#> [1] "Mazda R" "Mazda RX4 W" "Datsun 7" "Hornet 4 Dri"
#> [5] "Hornet Sportabo"
# dummy code for inside a data.table
dt[, x_new := substr(x, 1, nchar(x)-2)]
Just for generalizing this in the instance when you might have a very large numeric column, and need to substr it correctly. (Which is probably a good argument for storing/importing it as a character column to start with, but it's an imperfect world...)
x <- c(10000000000, 1000000000, 100000000, 10000000, 1000000,100000,10000,1000,100)
substr(x, 1, nchar(x)-2 )
#[1] "1e+" "1e+" "1e+" "1e+" "1e+" "1e+" "100" "10" "1"
as.character(x)
#[1] "1e+10" "1e+09" "1e+08" "1e+07" "1e+06" "1e+05" "10000" "1000"
#[9] "100"
xsf <- sprintf("%.0f", x)
substr(xsf, 1, nchar(xsf)-2)
#[1] "100000000" "10000000" "1000000" "100000" "10000"
#[6] "1000" "100" "10" "1"
cbind(x, xsf, xsfsub=substr(xsf, 1, nchar(xsf)-2) )
# x xsf xsfsub
# [1,] "1e+10" "10000000000" "100000000"
# [2,] "1e+09" "1000000000" "10000000"
# [3,] "1e+08" "100000000" "1000000"
# [4,] "1e+07" "10000000" "100000"
# [5,] "1e+06" "1000000" "10000"
# [6,] "1e+05" "100000" "1000"
# [7,] "10000" "10000" "100"
# [8,] "1000" "1000" "10"
# [9,] "100" "100" "1"

Creating vectors from regular expressions in a column name

I have a dataframe, in which the columns represent species. The species affilation is encoded in the column name's suffix:
Ac_1234_AnyString
The string after the second underscore (_) represents the species affiliation.
I want to plot some networks based on rank correlations, and i want to color the species according to their species affiliation, later when i create fruchtermann-rheingold graphs with library(qgraph).
Ive done it previously by sorting the df by the name_suffix and then create vectors by manually counting them:
list.names <- c("SG01", "SG02")
list <- vector("list", length(list.names))
names(list) <- list.names
list$SG01 <- c(1:12)
list$SG02 <- c(13:25)
str(list)
List of 2
$ SG01 : int [1:12] 1 2 3 4 5 6 7 8 9 10 ...
$ SG02 : int [1:13] 13 14 15 16 17 18 19 20 21 22 ...
This was very tedious for the big datasets i am working with.
Question is, how can i avoid the manual sorting and counting, and extract vectors (or a list) according to the suffix and the position in the dataframe. I know i can create a vector with the suffix information by
indx <- gsub(".*_", "", names(my_data))
str(indx)
chr [1:29]
"4" "6" "6" "6" "6" "6" "11" "6" "6" "6" "6" "6" "3" "18" "6" "6" "6" "5" "5"
"6" "3" "6" "3" "6" "NA" "6" "5" "4" "11"
Now i would need to create vectors with the position of all "4"s, "6"s and so on:
List of 7
$ 4: int[1:2] 1 28
$ 6: int[1:17] 2 3 4 5 6 8 9 10 11 12 15 16 17 20 22 24 26
$ 11: int[1:2] 7 29
....
Thank you.
you can try:
sapply(unique(indx), function(x, vec) which(vec==x), vec=indx)
# $`4`
# [1] 1 28
# $`6`
# [1] 2 3 4 5 6 8 9 10 11 12 15 16 17 20 22 24 26
# $`11`
# [1] 7 29
# $`3`
# [1] 13 21 23
# $`18`
# [1] 14
# $`5`
# [1] 18 19 27
# $`NA`
# [1] 25
Another option is
setNames(split(seq_along(indx),match(indx, unique(indx))), unique(indx))

Access the levels of a factor in R

I have a 5-level factor that looks like the following:
tmp
[1] NA
[2] 1,2,3,6,11,12,13,18,20,21,22,26,29,33,40,43,46
[3] NA
[4] NA
[5] 5,9,16,24,35,36,42
[6] 4,7,10,14,15,17,19,23,25,27,28,30,31,32,34,37,38,41,44,45,47,48,49,50
[7] 8,39
5 Levels: 1,2,3,6,11,12,13,18,20,21,22,26,29,33,40,43,46 ...
I want to access the items within each level except NA. So I use the levels() function, which gives me:
> levels(tmp)
[1] "1,2,3,6,11,12,13,18,20,21,22,26,29,33,40,43,46"
[2] "4,7,10,14,15,17,19,23,25,27,28,30,31,32,34,37,38,41,44,45,47,48,49,50"
[3] "5,9,16,24,35,36,42"
[4] "8,39"
[5] "NA"
Then I would like to access the elements in each level, and store them as numbers. However, for example,
>as.numeric(cat(levels(tmp)[3]))
5,9,16,24,35,36,42numeric(0)
Can you help me removing the commas within the numbers and the numeric(0) at the very end. I would like to have a vector of numerics 5, 9, 16, 24, 35, 36, 42 so that I can use them as indices to access a data frame. Thanks!
You need to use a combination of unlist, strsplit and unique.
First, recreate your data:
dat <- read.table(text="
NA
1,2,3,6,11,12,13,18,20,21,22,26,29,33,40,43,46
NA
NA
5,9,16,24,35,36,42
4,7,10,14,15,17,19,23,25,27,28,30,31,32,34,37,38,41,44,45,47,48,49,50
8,39")$V1
Next, find all the unique levels, after using strsplit:
sort(unique(unlist(
sapply(levels(dat), function(x)unlist(strsplit(x, split=",")))
)))
[1] "1" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "2" "20" "21" "22" "23" "24" "25" "26"
[20] "27" "28" "29" "3" "30" "31" "32" "33" "34" "35" "36" "37" "38" "39" "4" "40" "41" "42" "43"
[39] "44" "45" "46" "47" "48" "49" "5" "50" "6" "7" "8" "9"
Does this do what you want?
levels_split <- strsplit(levels(tmp), ",")
lapply(levels_split, as.numeric)
Using Andrie's dat
val <- scan(text=levels(dat),sep=",")
#Read 50 items
split(val,cumsum(c(T,diff(val) <0)))
#$`1`
#[1] 1 2 3 6 11 12 13 18 20 21 22 26 29 33 40 43 46
#$`2`
#[1] 4 7 10 14 15 17 19 23 25 27 28 30 31 32 34 37 38 41 44 45 47 48 49 50
#$`3`
#[1] 5 9 16 24 35 36 42
#$`4`
#[1] 8 39

R regex / gsub : extract part of pattern

I have a list of weather stations and their locations by latitude and longitude. There was formatting issue and some of them have have hours and minutes while other have hours, minutes and seconds. I can find the pattern using regex but I'm having trouble extracting the individual pieces.
Here's data:
> head(wthrStat1 )
Station lat lon
1940 K01R 31-08N 092-34W
1941 K01T 28-08N 094-24W
1942 K03Y 48-47N 096-57W
1943 K04V 38-05-50N 106-10-07W
1944 K05F 31-25-16N 097-47-49W
1945 K06D 48-53-04N 099-37-15W
I'd like something like this:
Station latHr latMin latSec latDir lonHr lonMin lonSec lonDir
1940 K01R 31 08 00 N 092 34 00 W
1941 K01T 28 08 00 N 094 24 00 W
1942 K03Y 48 47 00 N 096 57 00 W
1943 K04V 38 05 50 N 106 10 07 W
1944 K05F 31 25 16 N 097 47 49 W
1945 K06D 48 53 04 N 099 37 15 W
I can get matches to this regex:
data.format <- "\\d{1,3}-\\d{1,3}(?:-\\d{1,3})?[NSWE]{1}"
grep(data.format, wthrStat1$lat)
But am unsure how to get the individual parts into columns. I've tried a few things like:
wthrStat1$latHr <- ifelse(grepl(data.format, wthrStat1$lat), gsub(????), NA)
but with no luck.
Here's a dput():
> dput(wthrStat1[1:10,] )
structure(list(Station = c("K01R", "K01T", "K03Y", "K04V", "K05F",
"K06D", "K07G", "K07S", "K08D", "K0B9"), lat = c("31-08N", "28-08N",
"48-47N", "38-05-50N", "31-25-16N", "48-53-04N", "42-34-28N",
"47-58-27N", "48-18-03N", "43-20N"), lon = c("092-34W", "094-24W",
"096-57W", "106-10-07W", "097-47-49W", "099-37-15W", "084-48-41W",
"117-25-42W", "102-24-23W", "070-24W")), .Names = c("Station",
"lat", "lon"), row.names = 1940:1949, class = "data.frame")
Any suggestions?
strapplyc in the gsubfn package will extract each group in the regular expression surrounded with parentheses:
library(gsubfn)
data.format <- "(\\d{1,3})-(\\d{1,3})-?(\\d{1,3})?([NSWE]{1})"
parts <- strapplyc(wthrStat1$lat, data.format, simplify = rbind)
parts[parts == ""] <- "00"
which gives:
> parts
[,1] [,2] [,3] [,4]
[1,] "31" "08" "00" "N"
[2,] "28" "08" "00" "N"
[3,] "48" "47" "00" "N"
[4,] "38" "05" "50" "N"
[5,] "31" "25" "16" "N"
[6,] "48" "53" "04" "N"
[7,] "42" "34" "28" "N"
[8,] "47" "58" "27" "N"
[9,] "48" "18" "03" "N"
[10,] "43" "20" "00" "N"
it is extremely inefficient , I hope someone else had better solution:
dat <- read.table(text =' Station lat lon
1940 K01R 31-08N 092-34W
1941 K01T 28-08N 094-24W
1942 K03Y 48-47N 096-57W
1943 K04V 38-05-50N 106-10-07W
1944 K05F 31-25-16N 097-47-49W
1945 K06D 48-53-04N 099-37-15W', head=T)
pattern <- '([0-9]+)[-]([0-9]+)([-|A-Z]+)([0-9]*)([A-Z]*)'
dat$latHr <- gsub(pattern,'\\1',dat$lat)
dat$latMin <- gsub(pattern,'\\2',dat$lat)
latSec <- gsub(pattern,'\\4',dat$lat)
latSec[nchar(latSec)==0] <- '00'
dat$latSec <- latSec
latDir <- gsub(pattern,'\\5',dat$lat)
latDir[nchar(latDir)==0] <- latDir[nchar(latDir)!=0][1]
dat$latDir <- latDir
dat
Station lat lon latHr latMin latSec latDir
1940 K01R 31-08N 092-34W 31 08 00 N
1941 K01T 28-08N 094-24W 28 08 00 N
1942 K03Y 48-47N 096-57W 48 47 00 N
1943 K04V 38-05-50N 106-10-07W 38 05 50 N
1944 K05F 31-25-16N 097-47-49W 31 25 16 N
1945 K06D 48-53-04N 099-37-15W 48 53 04 N
Another answer, using stringr:
# example data
data <-
"Station lat lon
1940 K01R 31-08N 092-34W
1941 K01T 28-08N 094-24W
1942 K03Y 48-47N 096-57W
1943 K04V 38-05-50N 106-10-07W
1944 K05F 31-25-16N 097-47-49W
1945 K06D 48-53-04N 099-37-15W"
## read string into a data.frame
df <- read.table(text=data, head=T, stringsAsFactors=F)
pattern <- "(\\d{1,3})-(\\d{1,3})(?:-(\\d{1,3}))?([NSWE]{1})"
library(stringr)
str_match(df$lat, pattern)
This produces a data.frame with one column for the whole matching string and an additional column for each capture-group.
[,1] [,2] [,3] [,4] [,5]
[1,] "31-08N" "31" "08" "" "N"
[2,] "28-08N" "28" "08" "" "N"
[3,] "48-47N" "48" "47" "" "N"
[4,] "38-05-50N" "38" "05" "-50" "N"
[5,] "31-25-16N" "31" "25" "-16" "N"
[6,] "48-53-04N" "48" "53" "-04" "N"
R's string processing ability has progressed a lot in the past few years.

Resources