Importing data into R - r

So I have a set of data here (note: ignore the first line, the data sets from the second line). There are 311,522 characters in total. I wish to import this into R such that each single character is in one cell, so I end up with a 311,522 by 1 column vector. However, when I copied the data into a text file and then imported that into R, each line is recognized as one single "character" and instead I end up with a column vector where each entry is the entire line rather than a single character.
How can I get around this?

Just use readLines and strsplit. This is pretty straightforward stuff in R:
x <- readLines("Your_Actual_URL_Here")
Check for any junk:
head(x)
# [1] ""
# [2] "<PRE>"
# [3] ">hg19_knownGene_uc003qec.4 range=chr6:133551736-133863257 5'pad=0 3'pad=0 strand=+ repeatMasking=none"
# [4] "AGGGAGAGGAGTATCTTGTCTTGGGGAGGGTGGAGACAGACAACCATTTC"
# [5] "TGTTTTTGTTATATTGAATTGTACATCTTCCTAGGCATAAATACTCTTCA"
# [6] "TGATTTCAGGCCAGGTCCAAATGATACCTCCTACATTCCTTCAGCTGGAA"
tail(x)
# [1] "CTTGCTTTTCACAAAAAGAGATCCAAGAGGAAGAGGTGGAGCAAGCTAGC"
# [2] "AAGAGAGCACCCAAGATGGAAGCTGCAGTCTTTTACCCTAACCTCAGAAG"
# [3] "TGGTGTACCTTTTGCCATATGCCATTTGTCATATAGCTCAAGCATGGTAC"
# [4] "AGTGTGGGAGGGGGCTACATGGGATGTTAATACCAGGATGCAGGGGATCG"
# [5] "CTGGGGCTACTTTGGAGGCTGG"
# [6] "</PRE>"
So, we want from the fourth line to one less than the length of the vector:
y <- unlist(strsplit(x[4:(length(x)-1)], ""), use.names=FALSE)
head(y)
# [1] "A" "G" "G" "G" "A" "G"
tail(y)
# [1] "G" "G" "C" "T" "G" "G"
length(y)
# [1] 311522

Related

Extracting non-capitalized words using Regex

I'm trying to extract non-capitalized words using Regex in R. The data contains several columns (e.g. word, word duration, syllable, syllable duration ...), and in the Word column, there are tons of words that are either capitalized (e.g. EAT), non-capitalized (e.g. see), or in curly brackets (e.g. {VAO}). I want to extract all the words that are not capitalized in the word column. The following is a small example data frame with an expected outcome.
file word
1 sp
2 WHAT
3 ISN'
4 'EM
5 O
6 {PPC}
OUTCOME:
"sp", "{PPC}"
> unique(full_dat$word[!grepl("^[A-Z].*[A-Z]|\\d", full_dat$word) & !grepl(" [[:punct:]] ", full_dat$word)]
This result is the following:
[1] "sp" "{OOV}"
[3] "O" "I"
[5] "A" NA
[7] "{XX}" "'S"
[9] "{LG}" "Y"
[11] "B" "'VE"
[13] "N" "{GAP_ANONYMIZATION_NAME}"
[15] "'EM" "W"
[17] "{GAP_ANONYMIZATION}" "K"
This looks good, since I can easily recognize the non-capitalized words, but there are still some capitalized words in this list.... How can I modify the code, so it shows only lower-case words and curly-bracketed words?
With the library stringr you can just simply do that:
library(stringr)
x <- c("HELLO WORLD", "hello world", "Hello World", "hello World", "HeLLo wOrlD")
str_extract(x, "[A-Z]+")
Which results in all the upper cases found in each word:
[1] "HELLO" NA "H" "W" "H"
You can omit NAs by applying the na.omit function, and you will also obtain in which positions there are NAs, that is, in which positions there are not capitalized words:
na.omit(str_extract(x, "[A-Z]+"))
[1] "HELLO" "H" "W" "H"
attr(,"na.action")
[1] 2
attr(,"class")
[1] "omit"
But you can also see in in which positions there are not capitalized words by doing:
is.na(str_extract(x, "[A-Z]+"))
[1] FALSE TRUE FALSE FALSE FALSE
I hope this is helpful 😀

How to retain character strings using positional indexing?

What I need to do is very similar to what the function below does
x = c("abcde", "ghij", "klmnopq")
tstrsplit(x, "", fixed=TRUE, keep=c(1,3,5), names=c('first','second','third'))
However, I would like to be able to return strings using ranges of values. For example, I would like to specify that in first I want to have the first two letters for each element.
Thus instead of having:
$first
[1] "a" "g" "k"
$second
[1] "c" "i" "m"
$third
[1] "e" NA "o"
The output should look like
$first
[1] "ab" "gh" "kl"
$second
[1] "c" "i" "m"
$third
[1] "e" NA "o"
Background:
I have a large .txt file of records and a lookup table that tells from which position to which position each attribute goes, and the expected max width from which position. The txt file looks like:
James Brown M 01-01-1970
And then in a separate file I have a lookup table that says:
Field Start width
Name 1 7
FamilyN 9 7
Gender 11 1
Incidentally, I would appreciate any feedback on the best way to import this type of large .txt file. I feel like read.table is inappropriate since it tries to reduce to a dataframe format which is not what these files really are.
Something like this maybe:
x = c("abcde", "ghij", "klmnopq")
library(tidyverse)
list(c(1,3,5), c(2,1,1)) %>%
pmap(~ substr(x, .x, .x + .y - 1) %>% replace(., .=="", NA))
[[1]]
[1] "ab" "gh" "kl"
[[2]]
[1] "c" "i" "m"
[[3]]
[1] "e" NA "o"
I've hardcoded the positions. Per #MrFlick's comment, if you have a large number of strings, you'll need some strategy for deciding on the character positions so that you can automate it, rather than hardcoding it.

Modifying the suffix of a specific range of columns in R

I have a data frame with N number of columns. In this case 25 and would like to change the suffix only from colum variables 15 to 30.
t0 is the dataframe with the 30 column variables
For all the variable 1 to 30, the following command works perfect:
t0<-data.frame(a=c(1),b=c(1),c=c(1),d=c(1),e=c(1),f=c(1),g=c(1),h=c(1))
colnames(t0) <- paste( colnames(t0), "Sub",sep = "_")
names(t0)
[1] "a_Sub" "b_Sub" "c_Sub" "d_Sub" "e_Sub"
[6] "f_Sub" "g_Sub" "h_Sub" "i_Sub" "ii_Sub"
[15] "j_Sub" "k_Sub" "l_Sub" "m_Sub" "n_Sub"
Desired output:
names(t0)
[1] "a" "b" "c" "d" "e"
[6] "f" "g" "h" "i" "ii"
[15] "j_Sub" "k_Sub" "l_Sub" "m_Sub" "n_Sub"
Any idea how to get this done in R?
Thanks,
Albit
The reason why it didn't work was due to subsetting the dataset and then get the column names. Instead, we can directly get the column names of the entire dataset and subset the columns with numeric index
colnames(t0)[15:30] <- paste(colnames(t0)[15:30], "Sub", sep="_")

How do I apply an index vector over a list of vectors?

I want to apply a long index vector (50+ non-sequential integers) to a long list of vectors (50+ character vectors containing 100+ names) in order to retrieve specific values (as a list, vector, or data frame).
A simplified example is below:
> my.list <- list(c("a","b","c"),c("d","e","f"))
> my.index <- 2:3
Desired Output
[[1]]
[1] "b"
[[2]]
[1] "f"
##or
[1] "b"
[1] "f"
##or
[1] "b" "f"
I know I can get the same value from each element using:
> lapply(my.list, function(x) x[2])
##or
> lapply(my.list,'[', 2)
I can pull the second and third values from each element by:
> lapply(my.list,'[', my.index)
[[1]]
[1] "b" "c"
[[2]]
[1] "e" "f"
##or
> for(j in my.index) for(i in seq_along(my.list)) print(my.list[[i]][[j]])
[1] "b"
[1] "e"
[1] "c"
[1] "f"
I don't know how to pull just the one value from each element.
I've been looking for a few days and haven't found any examples of this being done, but it seems fairly straight forward. Am I missing something obvious here?
Thank you,
Scott
Whenever you have a problem that is like lapply but involves multiple parallel lists/vectors, consider Map or mapply (Map simply being a wrapper around mapply with SIMPLIFY=FALSE hardcoded).
Try this:
Map("[",my.list,my.index)
#[[1]]
#[1] "b"
#
#[[2]]
#[1] "f"
..or:
mapply("[",my.list,my.index)
#[1] "b" "f"

how to split a vector with mixed variables into two separate vectors in R

I extracted a mixed variable which includes both numeric and string values from a data file using strsplit function. I ended up with a variable just as seen below:
> sample3
[[1]]
[1] "1200" "A"
[[2]]
[1] "1193" "A"
[[3]]
[1] "1117" "B"
[[4]]
[1] "5663"
[[5]]
[1] "7003" "C"
[[6]]
[1] "1205" "A"
[[7]]
[1] "2100" "D"
[[8]]
[1] "1000" "D"
[[9]]
[1] "D"
[[10]]
[1] "1000" "B"
I need to split this into two variables/vectors(or convert to a two-columned matrix). I tried to use unlist(sample3) code then put the all values into a matrix with ncol=2 however since there are some missing data points the result is not correct when I use this way. I think I need to solve missing data issue before putting into a two columned matrix. Does anyone have any idea on this issue? Any help will be greatly appreciated.
Something like this will work
# dummy data
x <- list(c('100','a'), '100', c('a'), c('1000','b'))
numeric_x <- unlist(lapply(x,function(x) {.x <- head(x,1); as.numeric(.x)}))
character_x <- unlist(lapply(x,function(x) {.x <- tail(x,1); if(is.na(as.numeric(.x))) {return(.x)} else {return(NA)}}))
There will be a much nicer regex answer I am sure

Resources