Creating vectors from regular expressions in a column name - r

I have a dataframe, in which the columns represent species. The species affilation is encoded in the column name's suffix:
Ac_1234_AnyString
The string after the second underscore (_) represents the species affiliation.
I want to plot some networks based on rank correlations, and i want to color the species according to their species affiliation, later when i create fruchtermann-rheingold graphs with library(qgraph).
Ive done it previously by sorting the df by the name_suffix and then create vectors by manually counting them:
list.names <- c("SG01", "SG02")
list <- vector("list", length(list.names))
names(list) <- list.names
list$SG01 <- c(1:12)
list$SG02 <- c(13:25)
str(list)
List of 2
$ SG01 : int [1:12] 1 2 3 4 5 6 7 8 9 10 ...
$ SG02 : int [1:13] 13 14 15 16 17 18 19 20 21 22 ...
This was very tedious for the big datasets i am working with.
Question is, how can i avoid the manual sorting and counting, and extract vectors (or a list) according to the suffix and the position in the dataframe. I know i can create a vector with the suffix information by
indx <- gsub(".*_", "", names(my_data))
str(indx)
chr [1:29]
"4" "6" "6" "6" "6" "6" "11" "6" "6" "6" "6" "6" "3" "18" "6" "6" "6" "5" "5"
"6" "3" "6" "3" "6" "NA" "6" "5" "4" "11"
Now i would need to create vectors with the position of all "4"s, "6"s and so on:
List of 7
$ 4: int[1:2] 1 28
$ 6: int[1:17] 2 3 4 5 6 8 9 10 11 12 15 16 17 20 22 24 26
$ 11: int[1:2] 7 29
....
Thank you.

you can try:
sapply(unique(indx), function(x, vec) which(vec==x), vec=indx)
# $`4`
# [1] 1 28
# $`6`
# [1] 2 3 4 5 6 8 9 10 11 12 15 16 17 20 22 24 26
# $`11`
# [1] 7 29
# $`3`
# [1] 13 21 23
# $`18`
# [1] 14
# $`5`
# [1] 18 19 27
# $`NA`
# [1] 25

Another option is
setNames(split(seq_along(indx),match(indx, unique(indx))), unique(indx))

Related

How to split a string after the nth character in r

I am working with the following data:
District <- c("AR01", "AZ03", "AZ05", "AZ08", "CA01", "CA05", "CA11", "CA16", "CA18", "CA21")
I want to split the string after the second character and put them into two columns.
So that the data looks like this:
state district
AR 01
AZ 03
AZ 05
AZ 08
CA 01
CA 05
CA 11
CA 16
CA 18
CA 21
Is there a simple code to get this done? Thanks so much for you help
You can use substr if you always want to split by the second character.
District <- c("AR01", "AZ03", "AZ05", "AZ08", "CA01", "CA05", "CA11", "CA16", "CA18", "CA21")
#split district starting at the first and ending at the second
state <- substr(District,1,2)
#split district starting at the 3rd and ending at the 4th
district <- substr(District,3,4)
#put in data frame if needed.
st_dt <- data.frame(state = state, district = district, stringsAsFactors = FALSE)
you could use strcapture from base R:
strcapture("(\\w{2})(\\w{2})",District,
data.frame(state = character(),District = character()))
state District
1 AR 01
2 AZ 03
3 AZ 05
4 AZ 08
5 CA 01
6 CA 05
7 CA 11
8 CA 16
9 CA 18
10 CA 21
where \\w{2} means two words
The OP has written
I'm more familiar with strsplit(). But since there is nothing to split
on, its not applicable in this case
Au contraire! There is something to split on and it's called lookbehind:
strsplit(District, "(?<=[A-Z]{2})", perl = TRUE)
The lookbehind works like "inserting an invisible break" after 2 capital letters and splits the strings there.
The result is a list of vectors
[[1]]
[1] "AR" "01"
[[2]]
[1] "AZ" "03"
[[3]]
[1] "AZ" "05"
[[4]]
[1] "AZ" "08"
[[5]]
[1] "CA" "01"
[[6]]
[1] "CA" "05"
[[7]]
[1] "CA" "11"
[[8]]
[1] "CA" "16"
[[9]]
[1] "CA" "18"
[[10]]
[1] "CA" "21"
which can be turned into a matrix, e.g., by
do.call(rbind, strsplit(District, "(?<=[A-Z]{2})", perl = TRUE))
[,1] [,2]
[1,] "AR" "01"
[2,] "AZ" "03"
[3,] "AZ" "05"
[4,] "AZ" "08"
[5,] "CA" "01"
[6,] "CA" "05"
[7,] "CA" "11"
[8,] "CA" "16"
[9,] "CA" "18"
[10,] "CA" "21"
We can use str_match to capture first two characters and the remaining string in separate columns.
stringr::str_match(District, "(..)(.*)")[, -1]
# [,1] [,2]
# [1,] "AR" "01"
# [2,] "AZ" "03"
# [3,] "AZ" "05"
# [4,] "AZ" "08"
# [5,] "CA" "01"
# [6,] "CA" "05"
# [7,] "CA" "11"
# [8,] "CA" "16"
# [9,] "CA" "18"
#[10,] "CA" "21"
With the tidyverse this is very easy using the function separate from tidyr:
library(tidyverse)
District %>%
as.tibble() %>%
separate(value, c("state", "district"), sep = "(?<=[A-Z]{2})")
# A tibble: 10 × 2
state district
<chr> <chr>
1 AR 01
2 AZ 03
3 AZ 05
4 AZ 08
5 CA 01
6 CA 05
7 CA 11
8 CA 16
9 CA 18
10 CA 21
Treat it as fixed width file, and import:
# read fixed width file
read.fwf(textConnection(District), widths = c(2, 2), colClasses = "character")
# V1 V2
# 1 AR 01
# 2 AZ 03
# 3 AZ 05
# 4 AZ 08
# 5 CA 01
# 6 CA 05
# 7 CA 11
# 8 CA 16
# 9 CA 18
# 10 CA 21

Change global environment to read in numerical order

I have a dataset in which I have 22 animals. Each animal has been named as follows: c(" Shark1", "Shark2", "Shark3", ...) etc.
I am trying to plot a two category variables against each other do determine the proportion of time each shark spent at separate depths:
Sharks<-table(merge$DepthCat, merge$ID2) #Depth category vs. ID
merge$DepthCat[merge$Depth2>200]<-"4"
Sharks<-table(merge$DepthCat, merge$ID2)
plot(t(Sharks), main="",
col=c("whitesmoke", "slategray3", "slategray", "slategray4"),
ylab="Depth catagory", xlab="Month")
axis(side=4)
While the plot works, it is not plotting in numerical order but instead alphabetical therefore I am getting the following graph below.
Does anyone know how to resolve this for the plot? I have research the array method but unsure how it would be implemented here.
You didn't provide your complete data set, so I generated my own random data. Given that the bar headers derived from ID2 are sorting lexicographically, I assumed they are stored as characters in your data.frame merge, so I generated them thusly.
set.seed(2L);
NR <- 300L;
merge <- data.frame(ID2=sample(as.character(1:22),NR,T),Depth2=pmax(0,rnorm(NR,100,50)),stringsAsFactors=F);
merge$DepthCat <- as.character(findInterval(merge$Depth2,c(0,66,133,200)));
str(merge);
## 'data.frame': 300 obs. of 3 variables:
## $ ID2 : chr "5" "16" "13" "4" ...
## $ Depth2 : num 148.8 91.5 136.1 57.8 163.9 ...
## $ DepthCat: chr "3" "2" "3" "1" ...
And sure enough, we can reproduce the problem with this test data:
Sharks <- table(merge$DepthCat,merge$ID2);
plot(t(Sharks),main='',col=c('whitesmoke','slategray3','slategray','slategray4'),ylab='Depth category',xlab='Month');
axis(side=4L);
The solution is to coerce the ID2 vector to numeric so it sorts numerically.
merge$ID2 <- as.integer(merge$ID2);
str(merge);
## 'data.frame': 300 obs. of 3 variables:
## $ ID2 : int 5 16 13 4 21 21 3 19 11 13 ...
## $ Depth2 : num 148.8 91.5 136.1 57.8 163.9 ...
## $ DepthCat: chr "3" "2" "3" "1" ...
Sharks <- table(merge$DepthCat,merge$ID2);
plot(t(Sharks),main='',col=c('whitesmoke','slategray3','slategray','slategray4'),ylab='Depth category',xlab='Month');
axis(side=4L);

converting string to numeric in R

I have a problem regarding data conversion using R language.
I have two data that being stored in variables named lung.X and lung.y, below are the description of my data.
> str(lung.X)
chr [1:86, 1:7129] " 170.0" " 104.0" " 53.7" " 119.0" " 105.5" " 130.0" ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:86] "V3" "V4" "V5" "V6" ...
..$ : chr [1:7129] "A28102_at" "AB000114_at" "AB000115_at" "AB000220_at" ...
and
> str(lung.y)
num [1:86] -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
lung.X is a matrix (row: 86 col: 7129) and lung.y is an array of numbers (86 entries)
Do anyone know how to convert above data into the format below?
> str(lung.X)
num [1:86, 1:7129] 170 104 53.7 119 105.5 130...
I thought I should do like this
lung.X <- as.numeric(lung.X)
but I got this instead
> str(lung.X)
num [1:613094] 170 104 53.7 119 105.5 130...
The reason of doing this is because I need lung.X to be numerical only.
Thank you.
You could change the mode of your matrix to numeric:
## example data
m <- matrix(as.character(1:10), nrow=2,
dimnames = list(c("R1", "R2"), LETTERS[1:5]))
m
# A B C D E
# R1 "1" "3" "5" "7" "9"
# R2 "2" "4" "6" "8" "10"
str(m)
# num [1:2, 1:5] 1 2 3 4 5 6 7 8 9 10
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:2] "R1" "R2"
# ..$ : chr [1:5] "A" "B" "C" "D" ...
# NULL
mode(m) <- "numeric"
str(m)
# num [1:2, 1:5] 1 2 3 4 5 6 7 8 9 10
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:2] "R1" "R2"
# ..$ : chr [1:5] "A" "B" "C" "D" ...
# NULL
m
# A B C D E
# R1 1 3 5 7 9
# R2 2 4 6 8 10
Give this a try: m <- matrix(as.numeric(lung.X), nrow = 86, ncol = 7129)
If you need it in dataframe/list format, df <- data.frame(m)

Access the levels of a factor in R

I have a 5-level factor that looks like the following:
tmp
[1] NA
[2] 1,2,3,6,11,12,13,18,20,21,22,26,29,33,40,43,46
[3] NA
[4] NA
[5] 5,9,16,24,35,36,42
[6] 4,7,10,14,15,17,19,23,25,27,28,30,31,32,34,37,38,41,44,45,47,48,49,50
[7] 8,39
5 Levels: 1,2,3,6,11,12,13,18,20,21,22,26,29,33,40,43,46 ...
I want to access the items within each level except NA. So I use the levels() function, which gives me:
> levels(tmp)
[1] "1,2,3,6,11,12,13,18,20,21,22,26,29,33,40,43,46"
[2] "4,7,10,14,15,17,19,23,25,27,28,30,31,32,34,37,38,41,44,45,47,48,49,50"
[3] "5,9,16,24,35,36,42"
[4] "8,39"
[5] "NA"
Then I would like to access the elements in each level, and store them as numbers. However, for example,
>as.numeric(cat(levels(tmp)[3]))
5,9,16,24,35,36,42numeric(0)
Can you help me removing the commas within the numbers and the numeric(0) at the very end. I would like to have a vector of numerics 5, 9, 16, 24, 35, 36, 42 so that I can use them as indices to access a data frame. Thanks!
You need to use a combination of unlist, strsplit and unique.
First, recreate your data:
dat <- read.table(text="
NA
1,2,3,6,11,12,13,18,20,21,22,26,29,33,40,43,46
NA
NA
5,9,16,24,35,36,42
4,7,10,14,15,17,19,23,25,27,28,30,31,32,34,37,38,41,44,45,47,48,49,50
8,39")$V1
Next, find all the unique levels, after using strsplit:
sort(unique(unlist(
sapply(levels(dat), function(x)unlist(strsplit(x, split=",")))
)))
[1] "1" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "2" "20" "21" "22" "23" "24" "25" "26"
[20] "27" "28" "29" "3" "30" "31" "32" "33" "34" "35" "36" "37" "38" "39" "4" "40" "41" "42" "43"
[39] "44" "45" "46" "47" "48" "49" "5" "50" "6" "7" "8" "9"
Does this do what you want?
levels_split <- strsplit(levels(tmp), ",")
lapply(levels_split, as.numeric)
Using Andrie's dat
val <- scan(text=levels(dat),sep=",")
#Read 50 items
split(val,cumsum(c(T,diff(val) <0)))
#$`1`
#[1] 1 2 3 6 11 12 13 18 20 21 22 26 29 33 40 43 46
#$`2`
#[1] 4 7 10 14 15 17 19 23 25 27 28 30 31 32 34 37 38 41 44 45 47 48 49 50
#$`3`
#[1] 5 9 16 24 35 36 42
#$`4`
#[1] 8 39

sort column in data frame according to levels of another column

i'm new to R and stuck with the following data. Either i searched with the wrong terms or this question has not been risen (maybe due to simplicity?).
i have a data frame containing factors and numerical columns:
> head(PAMdata1)
salt cultivar Ratiod13 StdErr
1 50 1 1.0760163 0.02915785
2 100 1 0.9814083 0.04914316
3 50 2 0.9617199 0.06571578
4 100 2 0.7878740 0.10270647
5 50 4 0.9551830 0.04134652
6 100 4 0.8429793 0.10993336
> str(PAMdata1)
'data.frame': 36 obs. of 4 variables:
$ salt : Factor w/ 2 levels "50","100": 1 2 1 2 1 2 1 2 1 2 ...
$ cultivar: Factor w/ 18 levels "27","26","21",..: 7 7 15 15 13 13 11 11 9 9 ...
$ Ratiod13: num 1.076 0.981 0.962 0.788 0.955 ...
$ StdErr : num 0.0292 0.0491 0.0657 0.1027 0.0413 ...
The column 'cultivar' contains factors, whose levels are ordered using another data frame:
PAMdata1$cultivar <- factor(PAMdata1$cultivar, levels = unique(as.character(my_other_df$cultivar)))
levels(PAMdata1$cultivar)
[1] "27" "26" "21" "52" "14" "25" "1" "23" "7" "8" "5" "28" "4" "22" "2"
[16] "53" "51" "50"
What i would like to have is PAMdata1$Ratiod13 ordered by the levels of cultivars. How do i transform the vector of levels into a vector of line numbers each level is located in?
I would appreciate any help on this.
Thanks a lot
talking to an office mate and seeing your comments i saw my confusion between sort() and order()
PAMdata1[order(PAMdata1$cultivar),]
or even
PAMdata1[with(PAMdata1,order(cultivar)),]
would do the job.
Thanks a lot for your help.
A data.table option would be to setkey cultivar:
setkey(PAMdata1,cultivar)

Resources