Separating a column by the first 3 characters - r

I have a set of data below and I would like to separate the first three characters from the bm_id column into a separate column with the rest of the characters in another column.
bm_id
1
popCL20TE
2
agrST20
3
agrST20-09SE
I have tried using solutions to a similar question asked on stack, however I end up making extra empty columns with my data remaining together.
bm_id[c('species', 'id')] <- tstrsplit(bm_id$bm_id, '(?<=.{3})', perl = TRUE)
same happens with this code
bm_id2 <- tidyr::separate(bm_id, bm_id, into = c("species", "id"), sep = 3)

How about substr
df <- data.frame(vec= c("popCL20TE", "agrST20"))
df$first3 <- substr(df$vec, 1, 3)
df$last <- substr(df$vec, 4, nchar(df$vec))
df
vec first3 last
1 popCL20TE pop CL20TE
2 agrST20 agr ST20

Related

how to find the frequency of a tag or a word in r

Im working on stack overflow data dump .csv file and I need to to find :
The top 8 most frequent tags in the dataset.
To do this, I see the set of tags associated with each row in the data1.PostTypeId column.The frequency of a tag is equal to the number of questions that have that tag.(it means the frequency of a tag is equal to the number of rows that has that tag )
Note1 : The file is too large it has over 1 million of rows
Note2 : Im beginner in R, so I need the simplest way. My attempt is to use table function but what I got was list of tags and I couldn't figure out the top ones
This is a sample of the table Im using is below :
Let say for example that "java" had the highest frequency (because it appeared in the most among all the rows)
then the tag "python-3.x" is the second highest frequency (because appeared the most among all the rows)
so basically I need to go over the the second column in the table and what are the top 8 that were there
etc ...
Using base R with (optional) magrittr pipes for readability:
library(magrittr)
# Make a vector of all the tags present in data
tags_sep <- tags %>%
strsplit("><") %>%
unlist()
# Clean out the remaining < and >
tags_sep <- gsub("<|>", "", tags_sep)
# Frequency table sorted
tags_table <- tags_sep %>%
table() %>%
sort(decreasing = TRUE)
# Print the top 10 tags
tags_table[1:10]
java android amazon-ec2 amazon-web-services android-mediaplayer
4 2 1 1 1
antlr antlr4 apache-kafka appium asp.net
1 1 1 1 1
Data
tags <- c(
"<java><android><selenium><appium>",
"<java><javafx><javafx-2>",
"<apache-kafka>",
"<java><spring><eclipse><gradle><spring-boot>",
"<c><stm32><led>",
"<asp.net>",
"<python-3.x><python-2.x>",
"<http><server><Iocalhost><ngrok>",
"<java><android><audio><android-mediaplayer>",
"<antlr><antlr4>",
"<ios><firebase><swift3><push-notification>",
"<amazon-web-services><amazon-ec2><terraform>",
"<xamarin.forms>",
"<gnuplot>",
"<rx-java><rx-android><rx-binding>",
"<vim><vim-plugin><syntastic>",
"<plot><quantile>",
"<node.js><express-handlebars>",
"<php><html>"
)
If I understood correctly, this should solve your problem
library(stringr)
library(data.table)
# some dummy data
dat = data.table(id = 1:3, tags = c("<java><android><selenium>",
"<java><javafx>",
"<apache><android>"))
tags = apply(str_split(dat$tags, pattern = "><", simplify = T),
2, function(x) str_replace(x, "<|>", "")) # separate one tag in each column
foo = cbind(dat[, .(id)], tags) # add the separated tags to the data
foo[foo==""] = NA # substitute empty strings with NA
foo = melt.data.table(foo, id.vars = "id") # transform to long format
foo = foo[, .N, by = value] # calculate frequency
foo[, .SD[N %in% head(N, n = 1)]] # change the value of "n" to the number you want
value N
1: java 2
2: android 2
3: NA 2

Need to find most common combination of letters

Let's say for simplicity that i have 10 rows of 5 characters where each character can be A-Z.
E.g//
KJGXI
GDGQT
JZKDC
YOTQD
SSDIQ
PLUWC
TORHC
PFJSQ
IIZMO
BRPOJ
WLMDX
AZDIJ
ARNUA
JEXGA
VFPIP
GXOXM
VIZEM
TFVQJ
OFNOG
QFNJR
ZGUBZ
CCTMB
HZPGV
ORQTJ
I want to know which 3 letter combination is most common. However, the combination does not need to be in order, nor next to each other. E.g
ABCXY
CQDBA
=ABC
I could probably brute-force it with endless loops but I was wondering if there was a better way of doing it!
Here is a solution:
x <- c("KJGXI", "GDGQT", "JZKDC", "YOTQD", "SSDIQ", "PLUWC", "TORHC", "PFJSQ", "IIZMO", "BRPOJ", "WLMDX", "AZDIJ",
"ARNUA", "JEXGA", "VFPIP", "GXOXM", "VIZEM", "TFVQJ", "OFNOG", "QFNJR", "ZGUBZ", "CCTMB", "HZPGV", "ORQTJ")
temp <- do.call(cbind, lapply(strsplit(x, ""), combn, m = 3))
temp <- apply(temp, 2, sort)
temp <- apply(temp, 2, paste0, collapse = "")
sort(table(temp), decreasing = TRUE)
which will return the number of times each combination appear. You can then use names(which.max(sort(table(temp), decreasing = TRUE))) to have the combination (in this case, "FJQ")
In this case, two combinations appear 3 times, you can do
result <- sort(table(temp), decreasing = TRUE)
names(which(result == max(result)))
# [1] "FJQ" "IMZ"
to have the two combinations which appear the most time.
The code works as follow:
split each element of x in five letters, then generate each possible combination of 3 elements from the 5 letters
sort each of those combination alphabetically
paste the 3 letters together
generate the count for each of those combinations, and sort the result
I would split each string into letters, sort them, then use combn to get all combinations. Use paste0 to collapse these back into strings and count.
txt <- c("KJGXI", "GDGQT", "JZKDC", "YOTQD", "SSDIQ", "PLUWC", "TORHC",
"PFJSQ", "IIZMO", "BRPOJ", "WLMDX", "AZDIJ", "ARNUA", "JEXGA",
"VFPIP", "GXOXM", "VIZEM", "TFVQJ", "OFNOG", "QFNJR", "ZGUBZ",
"CCTMB", "HZPGV", "ORQTJ")
txt2 <- strsplit(txt, split = "")
txt2 <- lapply(txt2, sort)
txt3 <- lapply(txt2, combn, m = 3)
txt4 <- lapply(txt3, function(x){apply(x, 2, paste0, collapse = "")})
table(unlist(txt4))
Several steps here could be combined.

Get indexes from a vector with multiple elements based on a specific value

I have a vector:
lst <- c("2,1","7,10","11,0","7,0","10,0","1,1","1,0","4,0","4,1","0,1","6,0")
each element contains two numbers,separated by ",". I would like to get indexes of elements containing "1".
So the index list is expected:
1, 6, 7, 9, 10
grep() will work nicely for this. By default, it returns the indices of the matched pattern.
grep("^1,|,1$", lst)
# [1] 1 6 7 9 10
The regular expression ^1,|,1$ looks to match a string that
^1, = starts with 1,
| OR
,1$ = ends with ,1
each element contains two numbers. my answer is not ideal but I got what I need.
m <- as.numeric(unlist(lapply(strsplit(as.character(lst), "\\,"),"[[",1)))
n <- as.numeric(unlist(lapply(strsplit(as.character(lst), "\\,"),"[[",2)))
sort(unique(c(which(m==1),which(n==1))))
Depending on background and context of this task it might be prudent to turn this vector into a data.frame:
lst <- c("2,1","7,10","11,0","7,0","10,0","1,1","1,0","4,0","4,1","0,1","6,0")
DF <- read.table(text = do.call(paste, list(lst, collapse = "\n")), sep = ",")
which(DF$V1 == 1L | DF$V2 == 1L)
#[1] 1 6 7 9 10

how can i read a csv file containing some additional text data

I need to read a csv file in R. But the file contains some text information in some rows instead of comma values. So i cannot read that file using read.csv(fileName) method.
The content of the file is as follows:
name:russel date:21-2-1991
abc,2,saa
anan,3,ds
ama,ds,az
,,
name:rus date:23-3-1998
snans,32,asa
asa,2,saz
I need to store only values of each name,date pair as data frame. To do that how can i read that file?
Actually my required output is
>dataFrame1
abc,2,saa
anan,3,ds
ama,ds,az
>dataFrame2
snans,32,asa
asa,2,saz
You can read the data with scan and use grep and sub functions to extract the important values.
The text:
text <- "name:russel date:21-2-1991
abc,2,saa
anan,3,ds
ama,ds,az
,,
name:rus date:23-3-1998
snans,32,asa
asa,2,saz"
These commands generate a data frame with name and date values.
# read the text
lines <- scan(text = text, what = character())
# find strings staring with 'name' or 'date'
nameDate <- grep("^name|^date", lines, value = TRUE)
# extract the values
values <- sub("^name:|^date:", "", nameDate)
# create a data frame
dat <- as.data.frame(matrix(values, ncol = 2, byrow = TRUE,
dimnames = list(NULL, c("name", "date"))))
The result:
> dat
name date
1 russel 21-2-1991
2 rus 23-3-1998
Update
To extract the values from the strings, which do not contain name and date information, the following commands can be used:
# read data
lines <- readLines(textConnection(text))
# split lines
splitted <- strsplit(lines, ",")
# find positions of 'name' lines
idx <- grep("^name", lines)[-1]
# create grouping variable
grp <- cut(seq_along(lines), c(0, idx, length(lines)))
# extract values
values <- tapply(splitted, grp, FUN = function(x)
lapply(x, function(y)
if (length(y) == 3) y))
create a list of data frames
dat <- lapply(values, function(x) as.data.frame(matrix(unlist(x),
ncol = 3, byrow = TRUE)))
The result:
> dat
$`(0,7]`
V1 V2 V3
1 abc 2 saa
2 anan 3 ds
3 ama ds az
$`(7,9]`
V1 V2 V3
1 snans 32 asa
2 asa 2 saz
I would read the entire file first as a list of characters, i.e. a string for each line in the file, this can be done using readLines. Next you have to find the places where the data for a new date starts, i.e. look for ,,, see grep for that. Then take the first entry of each data block, e.g. using str_extract from the stringr package. Finally, you need split all the remaing data strings, see strsplit for that.

Importing one long line of data with spaces into R

This question is a followup to my previous question, Importing one long line of data into R.
I have a large data file consisting of a single line of text. The format resembles
Cat 14 15 Horse 16
I'd eventually like to get it into a data.frame. In the above example I would end up with two variables, two variables, Animal and Number. The number of characters in each "line" is fixed, so in the example above each line contains 11 characters, animals being the first 7 and numbers being the next four.
So what I'd like is a data frame that looks like:
Animal Number
Cat 14
NA 15
Horse 16
You can read the file with read.fwf, specifying the column widths and the number of columns:
inp.fwf <- read.fwf("tmp.txt", widths = rep(c(7, 4), times = 3), as.is = TRUE)
Here the argument times = 3 works for your sample data; for your real file, you'll have to indicate how many pairs there are and change times accordingly. If you don't know how many entries you have, this might work:
inp.rl <- readLines("tmp.txt")
nchar(inp.rl)/11
This will give you a data.frame with one row and many columns. You need to break that into many rows and two columns:
inp.mat <- matrix(inp.fwf, byrow = TRUE, ncol = 2)
This will get you the correct shape for your data. The animal names are stored as character vectors, which you'll probably want to change into factors, but at this point all the data is in R, so you can easily tweak it.
Solution with vectorized substring function.
x <- readLines(textConnection("Cat 14 15 Horse 16 "))
idx <- seq.int(1,nchar(x),by=11)
vsubstr <- Vectorize(substr,vectorize.args=c("start","stop"))
dat <- data.frame(Animal= vsubstr(x,idx,idx+6),
Number= as.numeric(vsubstr(x,idx+7,idx+10)))
Not sure what the 15 is all about from the way you described data it should be animal-space-count-space-animal...
Anyway if the 15 should not be there here is one approach.
list1<-"Cat 14 Horse 16"
x <- unlist(strsplit(list1, " "))
x <- as.data.frame(matrix(x, length(x)/2, 2, byrow = TRUE))
x[, 2] <- as.numeric(as.character(x[, 2]))
x[, 1] <- as.character(x[, 1])
names(x) <-c('animal', 'count')
x
Assume you have a text file, test.dat, with repeated Animal Number pairs.
x <- scan("test.dat", what=list("", 0))
my.df <- data.frame(Animal = x[[1]], Number = x[[2]])
Tyler's use of read.fwf is perhaps cleaner, but here's another possible method.
x <- readLines(textConnection("Cat 14 15 Horse 16 "))
x <- matrix(strsplit(x, "")[[1]], nrow=11)
d <- data.frame(Animal = apply(x[1:7,], 2, paste, collapse=""),
Number = as.numeric(apply(x[8:11,], 2, paste, collapse="")))

Resources