how to match string from the input and read it as number? - r

I am taking a string which is given as input and then match it with a number and return back the string as out put.
Example:
my.string <- readline(prompt="Enter string 1: ")
print(my.string)
#[1] "coffee"
DataFrame:
ID name
22 coffee
23 tea
24 milk
Task :
I need to convert 'coffee' into 22 because I want to pass this number into a
function.
How do I get that ?

I suggest, if your data.frame has only 2 columns then using named vector would be an option worth considering. It provides easy way to subset.
It seems to me that name column is unique in your data.frame. If so then you can convert name column to rownames then selection will be much simpler.
#change the 'name' to rownames
rownames(df) <- df$name
df$name <- NULL
#Now selection will be much easier
df[my.string, "ID"]
#22
Using "named vector":
itemVector <- c(22,23,24)
names(itemVector) <- c("coffee", "tea", "milk")
itemVector[my.string]
# coffee
# 22
Data:
df <- read.table(text =
"ID name
22 coffee
23 tea
24 milk",
header = TRUE, stringsAsFactors = FALSE)

Like my comment said, this is rather straightforward.
my.string <- readline(prompt="Enter string 1: ")
inx <- which(DataFrame$name == my.string)
DataFrame$ID[inx]
#[1] 22
Data.
DataFrame <- read.table(text = "
ID name
22 coffee
23 tea
24 milk
", header = TRUE)

Related

order a list of stings in r

The data I have include two variables: id and income (a list of characters)
id <- seq(1,6)
income <- c("2322;5125",
"0110;2012",
"2212;0912",
"1012;0145",
"1545;1102",
"1010;2028")
df <- data.frame(id, income)
df$income <- as.character(df$income)
I need to add a third column income_order which includes the ordered values of column income. The final output would look like
NOTE: I would still need to keep the leading zeros
We could split the string on ";", sort and paste the string back.
df$income_order <- sapply(strsplit(df$income, ";"), function(x)
paste(sort(x), collapse = ";"))
df
# id income income_order
#1 1 2322;5125 2322;5125
#2 2 0110;2012 0110;2012
#3 3 2212;0912 0912;2212
#4 4 1012;0145 0145;1012
#5 5 1545;1102 1102;1545
#6 6 1010;2028 1010;2028
We can use gsubfn
library(gsubfn)
df$income_order <- gsubfn("(\\d+);(\\d+)", ~ paste(sort(c(x, y)), collapse=";"), df$income)
df$income_order
#[1] "2322;5125" "0110;2012" "0912;2212" "0145;1012" "1102;1545" "1010;2028"

how to add title of gene to the output in R?

I have 9 length strings and list of longer strings with titles
Example data:
String <- "ABCDEFGHI", "ACBDGHIEF"
Data in text file contains 'longer strings with titles' like
>name
ABCDEFGHIJKLMNOPQRSTUVWXYX
>name1
TUVWXYACBDGHIEFXGHIJKLMIJK
>name2
ABFNOCDEPQRXYXGSTUVWHIMJKL
I use library(stringr) to locate the positions of each string.
Code in R
loc <- str_locate(textfile,pattern = strings)
write.csv(loc, "locate.csv")
EXPECTED Output:
string | locate | source of longer string
1 | 1-9| name1
2 | 7-15|name2
3 |NA| NA
QUESTION:
I would like to add the name of the longer string on which the "string" located? How to do this in R? I want to have the last column (that has bold-faced in the EXPECTED OUTCOME).
Thank you for help
Venkata
Here is an option with tidyverse. After reading the data with readLines, based on the occurence of 'title' with 'value', it is alternating, so an option would be to separate into columns or vectors with a recycling logical vector ('i1'), apply the str_locate only the 'value' ('col2'), create a row_number column and the 'source_longer_string' by checking if there is a NA element in 'locate' or not
library(dplyr)
library(stringr)
i1 <- c(TRUE, FALSE)
df1 <- tibble(col1 = textfile[i1], col2 = textfile[!i1])
str_locate(df1$col2, str_c(String, collapse="|")) %>%
as.data.frame %>%
transmute(string = row_number(),
locate = str_c(start, end, sep="-"),
source_longer_string = case_when(is.na(locate) ~ NA_character_,
TRUE ~ df1$col1))
# string locate source_longer_string
#1 1 1-9 >name
#2 2 7-15 >name1
#3 3 <NA> <NA>
data
textfile <- readLines(textConnection(">name
ABCDEFGHIJKLMNOPQRSTUVWXYX
>name1
TUVWXYACBDGHIEFXGHIJKLMIJK
>name2
ABFNOCDEPQRXYXGSTUVWHIMJKL"))
String <- c("ABCDEFGHI", "ACBDGHIEF")

Parsing names in mixed formats using R

I have a list of names in mixed formats, which I would like to separate into columns containing first and last names in R. An example dataset:
Names <- c("Mary Smith","Hernandez, Maria","Bonds, Ed","Michael Jones")
The goal is to assemble a dataframe that contains names in a format like this:
FirstNames <- c("Mary","Maria","Ed","Michael")
LastNames <- c("Smith","Hernandez","Bonds","Jones")
FinalData <- data.frame (FirstNames, LastNames)
I tried a few approaches to select either the First or Last name based on whether the names are separated by a space only versus comma-space. For instance I wanted to use regular expressions in gsub to copy first names from rows in which a comma-space separates the names:
FirstNames2 <- gsub (".*,\\s","",Names)
This worked for rows that contained names in the LastName, FirstName format, but gsub collected the entire contents in rows with names in FirstName LastName format.
So my request for help is to please advise: How would you tackle this problem? Thanks to all in advance!
Here is a one-liner. The pattern tries Firstname lastname first and if that fails it tries lastname, firstname. No packages are used.
read.table(text = sub("(\\w+) (\\w+)|(\\w+), (\\w+)", "\\1\\4 \\2\\3", Names), as.is=TRUE)
giving:
V1 V2
1 Mary Smith
2 Maria Hernandez
3 Ed Bonds
4 Michael Jones
You could rearrange the , version to first last name and then just strsplit.
FirstNames <- sapply(strsplit(gsub('(\\w+), (\\w+)', '\\2 \\1', Names), ' '), `[[`, 1)
LastNames <- sapply(strsplit(gsub('(\\w+), (\\w+)', '\\2 \\1', Names), ' '), `[[`, 2)
temp = strsplit(x = Names, split = "(, | )")
do.call(rbind, lapply(1:length(temp), function(i){
if (grepl(pattern = ", ", x = Names[i])){
data.frame(F = temp[[i]][2], L = temp[[i]][1])
}else{
data.frame(F = temp[[i]][1], L = temp[[i]][2])
}
}))
# F L
#1 Mary Smith
#2 Maria Hernandez
#3 Ed Bonds
#4 Michael Jones

Read CSV in R with first column as dataframe header

I have a simply text file where the first column is names (strings) and the second column is values (floats). As an example, names and ages:
Name, Age
John, 32
Heather, 46,
Jake, 23
Sally, 19
I'd like to read this in as a dataframe (call this df) but transposed so that I can access ages by names such that df$John would return 32. How can I do this?
Previous I tried creating a new dataframe, tdf, looping through the data in a for loop, assigning each name and age and then inserting into the empty dataframe as tdf[name] = age but this did not work as I expected.
You can read your data using read.table().
Then you can transpose it using t() and set colnames after.
Example:
If df is:
df=read.table("dummydata", header=T, sep=",")
df
Name Age
1 John 32
2 Heather 46
3 Jake 23
4 Sally 19
You transpose the age and then transform them into a dataframe:
tdf=as.data.frame(t(df$Age))
colnames(tdf)=t(df$Name)
So tdf will return:
tdf
John Heather Jake Sally
1 32 46 23 19
And, as you asked, tdf$John will return:
tdf$John
[1] 32
Now, if you have more than two columns you can do the same but instead of indicating the name of the column you can simply indicate the position using brackets.
df=read.table("dummydata", header=T, sep=",")
With t(df[2:ncol(df)]) you transpose the whole table starting from the second column, no matter the number of columns. The first column will be the names after the transpose.
tdf=as.data.frame(t(df[2:ncol(df)]))
Then you set the columnames.
colnames(tdf)=t(df[1])
tdf$John
[1] 32
This should add the the row as header when you read from the file
read.csv2(filename, as.is = TRUE, header = TRUE)
Read the data into a data frame, DF (see Note).
1) Assign the names to the rows of DF in which case this will give John's age without having to create a new data structure:
rownames(DF) <- DF$Name
DF["John", "Age"]
## [1] 32
2) Alternatively, split DF into a named list in which case you can get the precise syntax requested:
ages <- with(DF, split(Age, Name))
ages$John
## [1] 32
3) This alternative would also create the same list:
ages <- with(DF, setNames(as.list(Age), Name))
Note: DF in reproducible form is as follows. (We have removed the trailing comma on one line in the question but if it is really there add fill = TRUE to the read.csv line.)
Lines <- "Name, Age
John, 32
Heather, 46
Jake, 23
Sally, 19"
DF <- read.csv(text = Lines)
A bit late but hopefully helpful. The "row.names" parameter allows you to select the desired column as header:
read.csv("df.csv", header = TRUE, row.names = 1)

Merging two data frames with columns with certain patterns in strings

(I have been stuck with this problem for past two days, So if it has an answer on SO please bear with me.)
I have two data frames A and B. I want to merge them on Name column. Suppose, A has two columns Name and Numbers. The Name column of A df has values ".tony.x.rds", ".tom.x.rds" and so on.
Name Numbers
.tony.x.rds 15.6
.tom.x.rds 14.5
The B df has two columns Name and ChaR. The Name column of B has values "tony.x","tom.x" and so on.
Name ChaR
tony.x ENG
tom.x US
The main element in column Name of both dfs is "tony', "tom" and so on.
So, ".tony.x.rds" is equal to "tony.x" and ".tom.x.rds" is equal to "tom.x".
I have tried gsub with various option leaving me with 'tony", "tom", and so on in column Name of both A and B data frames. But when I use
StoRe<-merge(A,B, all=T)
I ge all the rows of A and B rather than single rows. That is, there are two rows for each "a", "b" and so on for with their respective values in Numbers and ChaR column. For example:
Name Numbers ChaR
tony 15.6 NA
tony NULL ENG
tom 14.5 NA
tom NULL US
It has been giving me splitting headache. I request you to help.
One possible solution. I am not completely sure what you want to do with the 'x' in the strings, I have kept them in the linkage key, but by changing the \\1\\2 to \\1 you keep only the first letter.
a <- data.frame(
Name = paste0(".", c("tony", "tom", "foo", "bar", "foobar"), ".x.rds"),
Numbers = rnorm(5)
)
b <- data.frame(
Name = paste0(c("tony", "tom", "bar", "foobar", "company"), ".x"),
ChaR = LETTERS[11:15]
)
# String consists of 'point letter1 point letter2 point rds'; replace by
# 'letter1 letter2'
a$Name_stand <- gsub("^\\.([a-z]+)\\.([a-z]+)\\.rds$", "\\1\\2", a$Name)
# String consists of 'letter1 point letter2'; replace by 'letter1 letter2'
b$Name_stand <- gsub("^([a-z]+)\\.([a-z]+)$", "\\1\\2", b$Name)
result <- merge(a, b, all = TRUE, by = "Name_stand")
Output:
#> result
# Name_stand Name.x Numbers Name.y ChaR
#1 barx .bar.x.rds 1.38072696 bar.x M
#2 companyx <NA> NA company.x O
#3 foobarx .foobar.x.rds -1.53076596 foobar.x N
#4 foox .foo.x.rds 1.40829287 <NA> <NA>
#5 tomx .tom.x.rds -0.01204651 tom.x L
#6 tonyx .tony.x.rds 0.34159406 tony.x K
Another, perhaps somewhat more robust (to variations of the strings such as 'tom.rds' and 'tom' which will still be linked; this can of course also be a disadvantage)/
# Remove the rds from a$Name
a$Name_stand <- gsub("rds$" , "", a$Name)
# Remove all non alpha numeric characters from the strings
a$Name_stand <- gsub("[^[:alnum:]]", "", a$Name_stand)
b$Name_stand <- gsub("[^[:alnum:]]", "", b$Name)
result2 <- merge(a, b, all = TRUE, by = "Name_stand")

Resources