Recode factors to number of my choosing - r

I like to convert NG to 0, SG=1.25, LG=7.25, MG=26 and HG=40
My actual data that looks exactly like the t below is here:
actual data causing problems
t<-rep(c("NG","SG","LG","MG","HG"),each=5)
colnames(t)<-c("X.1","X1","X2","X4","X8","X12","X24","X48")
Why doesn't this work?
t[t=="NG"] <- "0"
t[t=="SG"] <- "1.25"
t[t=="LG"] <- "7.25"
t[t=="MG"] <- "26"
or this:
factor(t, levels=c("NG","SG","LG","MG", "HG"), labels=c("0","1.25","7.25","26","40"))
or this:
t <- sapply(t,switch,"NG"=0,"SG"=1.25,"LG"=7.25,"MG"=26, "HG"=40)

You may want this:
t <- rep(c(NG = 0, SG = 1.25, LG = 7.25, MG = 26, HG = 40), each = 5)
t <- factor(t)
levels(t)
# [1] "0" "1.25" "7.25" "26" "40"
labels(t)
# [1] "NG" "NG" "NG" "NG" "NG" "SG" "SG" "SG" "SG" "SG" "LG" "LG" "LG" "LG" "LG"
# [16] "MG" "MG" "MG" "MG" "MG" "HG" "HG" "HG" "HG" "HG"
The internal codes for the factor will always be integers, so you can't create a factor with internal codes that are double precision floats.
unclass(t)
# NG NG NG NG NG SG SG SG SG SG LG LG LG LG LG MG MG MG MG MG HG HG HG HG HG
# 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5
# attr(,"levels")
# [1] "0" "1.25" "7.25" "26" "40"
You can still extract the numerical value using the label for a level:
t["SG"]
# SG
# 1.25
# Levels: 0 1.25 7.25 26 40

Related

Dropping the last two numbers from every entry in a column of data.table

Preface: I am a beginner to R that is eager to learn. Please don't mistake the simplicity of the question (if it is a simple answer) for lack of research or effort!
Here is a look at the data I am working with:
year state age POP
1: 90 1001 0 239
2: 90 1001 0 203
3: 90 1001 1 821
4: 90 1001 1 769
5: 90 1001 2 1089
The state column contains the FIPS codes for all states. For the purpose of merging, I need the state column to match my another dataset. To achieve this task, all I have to do is omit the last two numbers for each FIPS code such that the table looks like this:
year state age POP
1: 90 10 0 239
2: 90 10 0 203
3: 90 10 1 821
4: 90 10 1 769
5: 90 10 2 1089
I can't figure out how to accomplish this task on a numeric column. Substr() makes this easy on a character column.
In case your number is not always 4 digits long, to omit the last two you can make use of the vectorized behavior of substr()
x <- rownames(mtcars)[1:5]
x
#> [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
#> [4] "Hornet 4 Drive" "Hornet Sportabout"
substr(x, 1, nchar(x)-2)
#> [1] "Mazda R" "Mazda RX4 W" "Datsun 7" "Hornet 4 Dri"
#> [5] "Hornet Sportabo"
# dummy code for inside a data.table
dt[, x_new := substr(x, 1, nchar(x)-2)]
Just for generalizing this in the instance when you might have a very large numeric column, and need to substr it correctly. (Which is probably a good argument for storing/importing it as a character column to start with, but it's an imperfect world...)
x <- c(10000000000, 1000000000, 100000000, 10000000, 1000000,100000,10000,1000,100)
substr(x, 1, nchar(x)-2 )
#[1] "1e+" "1e+" "1e+" "1e+" "1e+" "1e+" "100" "10" "1"
as.character(x)
#[1] "1e+10" "1e+09" "1e+08" "1e+07" "1e+06" "1e+05" "10000" "1000"
#[9] "100"
xsf <- sprintf("%.0f", x)
substr(xsf, 1, nchar(xsf)-2)
#[1] "100000000" "10000000" "1000000" "100000" "10000"
#[6] "1000" "100" "10" "1"
cbind(x, xsf, xsfsub=substr(xsf, 1, nchar(xsf)-2) )
# x xsf xsfsub
# [1,] "1e+10" "10000000000" "100000000"
# [2,] "1e+09" "1000000000" "10000000"
# [3,] "1e+08" "100000000" "1000000"
# [4,] "1e+07" "10000000" "100000"
# [5,] "1e+06" "1000000" "10000"
# [6,] "1e+05" "100000" "1000"
# [7,] "10000" "10000" "100"
# [8,] "1000" "1000" "10"
# [9,] "100" "100" "1"

Extract character-level n-grams from text in R

I have a dataframe with text and I want to extract the character-level bigrams (n = 2), e.g. "st", "ac", "ck", for each text in R.
I also want to count the frequency of each character-level bigram in the text.
Data:
df$text
[1] "hy my name is"
[2] "stackover flow is great"
[3] "how are you"
I'm not quite sure of your expected output here. I would have thought that the bigrams for "stack" would be "st", "ta", "ac", and "ck", since this captures each consecutive pair.
For example, if you wanted to know how many instances of the bigram "th" the word "brothers" had in it, and you split it into the bigrams "br", "ot", "he" and "rs", then you would get the answer 0, which is wrong.
You can build up a single function to get all bigrams like this:
# This function takes a vector of single characters and creates all the bigrams
# within that vector. For example "s", "t", "a", "c", "k" becomes
# "st", "ta", "ac", and "ck"
pair_chars <- function(char_vec) {
all_pairs <- paste0(char_vec[-length(char_vec)], char_vec[-1])
return(as.vector(all_pairs[nchar(all_pairs) == 2]))
}
# This function splits a single word into a character vector and gets its bigrams
word_bigrams <- function(words){
unlist(lapply(strsplit(words, ""), pair_chars))
}
# This function splits a string or vector of strings into words and gets their bigrams
string_bigrams <- function(strings){
unlist(lapply(strsplit(strings, " "), word_bigrams))
}
So now we can test this on your example:
df <- data.frame(text = c("hy my name is", "stackover flow is great",
"how are you"), stringsAsFactors = FALSE)
string_bigrams(df$text)
#> [1] "hy" "my" "na" "am" "me" "is" "st" "ta" "ac" "ck" "ko" "ov" "ve" "er" "fl"
#> [16] "lo" "ow" "is" "gr" "re" "ea" "at" "ho" "ow" "ar" "re" "yo" "ou"
If you want to count occurrences, you can just use table:
table(string_bigrams(df$text))
#> ac am ar at ck ea er fl gr ho hy is ko lo me my na ou ov ow re st ta ve yo
#> 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 2 1 1 1 1
However, if you are going to be doing a fair bit of text mining, you should look into specific R packages like stringi, stringr, tm and quanteda that help with the basic tasks
For example, all of the base R functions I wrote above can be replaced using the quanteda package like this:
library(quanteda)
char_ngrams(unlist(tokens(df$text, "character")), concatenator = "")
#> [1] "hy" "ym" "my" "yn" "na" "am" "me" "ei" "is" "ss" "st" "ta" "ac" "ck"
#> [15] "ko" "ov" "ve" "er" "rf" "fl" "lo" "ow" "wi" "is" "sg" "gr" "re" "ea"
#> [29] "at" "th" "ho" "ow" "wa" "ar" "re" "ey" "yo" "ou"
Created on 2020-06-13 by the reprex package (v0.3.0)
In addition to the answer of Allen,
You could use the qgram function from the stringdist package in combination with gsub to remove the spaces.
library(stringdist)
qgrams(gsub(" ", "", df1$text), q = 2)
hy ym yn yo my na st ta ve wi wa ov rf sg ow re ou me is ko lo am ei er fl gr ho ey ck ea at ar ac
V1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1

How to split a string after the nth character in r

I am working with the following data:
District <- c("AR01", "AZ03", "AZ05", "AZ08", "CA01", "CA05", "CA11", "CA16", "CA18", "CA21")
I want to split the string after the second character and put them into two columns.
So that the data looks like this:
state district
AR 01
AZ 03
AZ 05
AZ 08
CA 01
CA 05
CA 11
CA 16
CA 18
CA 21
Is there a simple code to get this done? Thanks so much for you help
You can use substr if you always want to split by the second character.
District <- c("AR01", "AZ03", "AZ05", "AZ08", "CA01", "CA05", "CA11", "CA16", "CA18", "CA21")
#split district starting at the first and ending at the second
state <- substr(District,1,2)
#split district starting at the 3rd and ending at the 4th
district <- substr(District,3,4)
#put in data frame if needed.
st_dt <- data.frame(state = state, district = district, stringsAsFactors = FALSE)
you could use strcapture from base R:
strcapture("(\\w{2})(\\w{2})",District,
data.frame(state = character(),District = character()))
state District
1 AR 01
2 AZ 03
3 AZ 05
4 AZ 08
5 CA 01
6 CA 05
7 CA 11
8 CA 16
9 CA 18
10 CA 21
where \\w{2} means two words
The OP has written
I'm more familiar with strsplit(). But since there is nothing to split
on, its not applicable in this case
Au contraire! There is something to split on and it's called lookbehind:
strsplit(District, "(?<=[A-Z]{2})", perl = TRUE)
The lookbehind works like "inserting an invisible break" after 2 capital letters and splits the strings there.
The result is a list of vectors
[[1]]
[1] "AR" "01"
[[2]]
[1] "AZ" "03"
[[3]]
[1] "AZ" "05"
[[4]]
[1] "AZ" "08"
[[5]]
[1] "CA" "01"
[[6]]
[1] "CA" "05"
[[7]]
[1] "CA" "11"
[[8]]
[1] "CA" "16"
[[9]]
[1] "CA" "18"
[[10]]
[1] "CA" "21"
which can be turned into a matrix, e.g., by
do.call(rbind, strsplit(District, "(?<=[A-Z]{2})", perl = TRUE))
[,1] [,2]
[1,] "AR" "01"
[2,] "AZ" "03"
[3,] "AZ" "05"
[4,] "AZ" "08"
[5,] "CA" "01"
[6,] "CA" "05"
[7,] "CA" "11"
[8,] "CA" "16"
[9,] "CA" "18"
[10,] "CA" "21"
We can use str_match to capture first two characters and the remaining string in separate columns.
stringr::str_match(District, "(..)(.*)")[, -1]
# [,1] [,2]
# [1,] "AR" "01"
# [2,] "AZ" "03"
# [3,] "AZ" "05"
# [4,] "AZ" "08"
# [5,] "CA" "01"
# [6,] "CA" "05"
# [7,] "CA" "11"
# [8,] "CA" "16"
# [9,] "CA" "18"
#[10,] "CA" "21"
With the tidyverse this is very easy using the function separate from tidyr:
library(tidyverse)
District %>%
as.tibble() %>%
separate(value, c("state", "district"), sep = "(?<=[A-Z]{2})")
# A tibble: 10 × 2
state district
<chr> <chr>
1 AR 01
2 AZ 03
3 AZ 05
4 AZ 08
5 CA 01
6 CA 05
7 CA 11
8 CA 16
9 CA 18
10 CA 21
Treat it as fixed width file, and import:
# read fixed width file
read.fwf(textConnection(District), widths = c(2, 2), colClasses = "character")
# V1 V2
# 1 AR 01
# 2 AZ 03
# 3 AZ 05
# 4 AZ 08
# 5 CA 01
# 6 CA 05
# 7 CA 11
# 8 CA 16
# 9 CA 18
# 10 CA 21

Creating vectors from regular expressions in a column name

I have a dataframe, in which the columns represent species. The species affilation is encoded in the column name's suffix:
Ac_1234_AnyString
The string after the second underscore (_) represents the species affiliation.
I want to plot some networks based on rank correlations, and i want to color the species according to their species affiliation, later when i create fruchtermann-rheingold graphs with library(qgraph).
Ive done it previously by sorting the df by the name_suffix and then create vectors by manually counting them:
list.names <- c("SG01", "SG02")
list <- vector("list", length(list.names))
names(list) <- list.names
list$SG01 <- c(1:12)
list$SG02 <- c(13:25)
str(list)
List of 2
$ SG01 : int [1:12] 1 2 3 4 5 6 7 8 9 10 ...
$ SG02 : int [1:13] 13 14 15 16 17 18 19 20 21 22 ...
This was very tedious for the big datasets i am working with.
Question is, how can i avoid the manual sorting and counting, and extract vectors (or a list) according to the suffix and the position in the dataframe. I know i can create a vector with the suffix information by
indx <- gsub(".*_", "", names(my_data))
str(indx)
chr [1:29]
"4" "6" "6" "6" "6" "6" "11" "6" "6" "6" "6" "6" "3" "18" "6" "6" "6" "5" "5"
"6" "3" "6" "3" "6" "NA" "6" "5" "4" "11"
Now i would need to create vectors with the position of all "4"s, "6"s and so on:
List of 7
$ 4: int[1:2] 1 28
$ 6: int[1:17] 2 3 4 5 6 8 9 10 11 12 15 16 17 20 22 24 26
$ 11: int[1:2] 7 29
....
Thank you.
you can try:
sapply(unique(indx), function(x, vec) which(vec==x), vec=indx)
# $`4`
# [1] 1 28
# $`6`
# [1] 2 3 4 5 6 8 9 10 11 12 15 16 17 20 22 24 26
# $`11`
# [1] 7 29
# $`3`
# [1] 13 21 23
# $`18`
# [1] 14
# $`5`
# [1] 18 19 27
# $`NA`
# [1] 25
Another option is
setNames(split(seq_along(indx),match(indx, unique(indx))), unique(indx))

Parallel Processing in R using "parallel" package

I have two data frames:
> head(k)
V1
1 1814338070
2 1199215279
3 1283239083
4 1201972527
5 404900682
6 3093614019
> head(g)
start end state value
1 16777216 16777471 queensland 15169
2 16777472 16778239 fujian 0
3 16778240 16779263 victoria 56203
4 16779264 16781311 guangdong 0
5 16781312 16781823 tokyo 0
6 16781824 16782335 aichi 0
> dim(k)
[1] 624979 1
> dim(g)
[1] 5510305 4
I want to compare each value in data.frame(k) and match if it fits between the range of start and end of data.frame(g) and if it does return the value of state and value from data.frame(g)
The problem I have is due to the dimensions of both the data frame and to do the match and return my desired values it takes 5 hours on my computer. I've used the following method but I'm unable to make use of all cores on my computer and not even make it work correctly:
return_first_match_position <- function(int, start,end) {
match = which(int >= start & int <= end)
if(length(match) > 0){
return(match[1])
}
else {
return(match)
}
}
library(parallel)
cl = makeCluster(detectCores())
matches = Vectorize(return_first_match_position, 'int')(k$V1,g$start, g$end)
p = parSapply(cl, Vectorize(return_first_match_position, 'int')(k$V1,g$start, g$end), return_first_match_position)
stopCluster(cl)
desired output is % number of times state and value show up for every match of the number from data.frame(k) in data.frame(g)
Was wondering there there is an intelligent way of doing parallel processing in R ?
And can anyone please suggest (any sources) how I can learn/improve writing functions in R?
I think you want to do a rolling join. This can be done very efficiently with data.table:
DF1 <- data.frame(V1=c(1.5, 2, 0.3, 1.7, 0.5))
DF2 <- data.frame(start=0:3, end=0.9:3.9,
state=c("queensland", "fujian", "victoria", "guangdong"),
value=1:4)
library(data.table)
DT1 <- data.table(DF1, key="V1")
DT1[, pos:=V1]
# V1 pos
#1: 0.3 0.3
#2: 0.5 0.5
#3: 1.5 1.5
#4: 1.7 1.7
#5: 2.0 2.0
DT2 <- data.table(DF2, key="start")
# start end state value
#1: 0 0.9 queensland 1
#2: 1 1.9 fujian 2
#3: 2 2.9 victoria 3
#4: 3 3.9 guangdong 4
DT2[DT1, roll=TRUE]
# start end state value pos
#1: 0 0.9 queensland 1 0.3
#2: 0 0.9 queensland 1 0.5
#3: 1 1.9 fujian 2 1.5
#4: 1 1.9 fujian 2 1.7
#5: 2 2.9 victoria 3 2.0
so instead of editing the last one a lot (pretty much making a new one).. is this what you want:
I noticed that your end is always 1 before the next rows start, so what you want ( i think) is to just find out how many were within each interval and give that interval the state,value for that range. so
set.seed(123)
c1=seq(1,25,4)
c2=seq(4,30,4)
c3=letters[1:7]
c4=sample(seq(1,7),7)
c.all=cbind(c1,c2,c3,c4)
> c.all ### example data.frame that looks similar to yours
c1 c2 c3 c4
[1,] "1" "4" "a" "3"
[2,] "5" "8" "b" "7"
[3,] "9" "12" "c" "2"
[4,] "13" "16" "d" "1"
[5,] "17" "20" "e" "6"
[6,] "21" "24" "f" "5"
[7,] "25" "28" "g" "4"
k1 <- sample(seq(1,18),20,replace=T)
k1
[1] 2 1 15 14 4 15 3 17 18 1 4 3 16 15 2 4 8 11 7 16
fallsin <- cut(k1, c(as.numeric(c.all[,1]), max(c.all[,2])), labels=paste(c.all[,3], c.all[,4],sep=':'), right=F)
fallsin
[1] a:3 a:3 e:6 e:6 a:3 e:6 a:3 f:5 f:5 a:3 a:3 a:3 e:6 e:6 a:3 a:3 c:2 d:1 b:7 e:6
Levels: a:3 b:7 c:2 d:1 e:6 f:5 g:4
prop.table(table(fallsin))
a:3 b:7 c:2 d:1 e:6 f:5 g:4
0.45 0.05 0.05 0.05 0.30 0.10 0.00
where the names of the columns are the 'state:value' and the numbers are the percent of k1 that fall within the range of that label

Resources