R: Combine two elements of vector based on pattern - r

I would like to know how I could possibly add a function, like paste to combine to strings, which are characters, of a vector as one new element of a new vector:
So my problem would look like this:
a) My initial data stored as a txt file
10_x_R1_001.fastq.gz
10_x_R2_001.fastq.gz
11_x_R1_001.fastq.gz
11_x_R2_001.fastq.gz
This data I then have as a data vector, like
x= c("10_x_R1_001.fastq.gz", "10_x_R2_001.fastq.gz", "11_x_R1_001.fastq.gz", "11_x_R2_001.fastq.gz")
So my question would be, how can I add the elements with start/contain the indicator "10" or "11" as a new element so that the result would look like this.
x= c("10_x_R1_001.fastq.gz 10_x_R2_001.fastq.gz", "11_x_R1_001.fastq.gz 11_x_R2_001.fastq.gz")
Because the two elements are always nxt to each other I already solved the problem with rollapply of the zoo package, but I would like to know how I can do it otherwise.
Thx

A base R approach is to substring the first 2 characters, use that as grouping variable in tapply and paste
unname(tapply(x, substring(x, 1, 2), FUN = paste, collapse= ' '))
Or if the numbers can be different and have variable number of digits, then use sub
unname(tapply(x, sub("_.*", "", x), FUN = paste, collapse= " "))
#[1] "10_x_R1_001.fastq.gz 10_x_R2_001.fastq.gz" "11_x_R1_001.fastq.gz 11_x_R2_001.fastq.gz"
If the values are always next to each other, then use a logical recycling vector to extract alternate elements and paste together
paste(x[c(TRUE, FALSE)], x[c(FALSE, TRUE)])

Related

How do I multiply columns of a dataset by a constant?

I am quite new to R so eggscuse my lack of ability. I have tried and failed a fair bit, and would appreciate any input.
I am asked to get rid of inconsistent use of "." and "," to indicate decimals by multiplying every number in certain columns by some multiple of 10. I have tried to simply multiply using the binary operator * but it obviously doesnt work as some columns are factors, which is required in this case.
I have tried using this code aswell but get erros :subscript "Var" cant be "NA"
data %>% mutate_if(is.numeric, ~ . * 1000)
Below is the code I have for my dataset
datat <- c("Starting_year" , "Rank" , "Team" , "Home_total_Games", "Home_Total_Attendance" , "Home_Avg_Attendance" , "Home_capacity" , "Away_Total_Attendance" , "Away_Avg_Attendance" , "Away_Capacity")
names(data) <- datat
Factors assigned
data$Rank <- as.factor(data$Rank)
data$Starting_year <- as.factor(data$Starting_year)
Thanks in advance
Cant embed but there is a picture below of the data. I am asked to use a function in dplyr to multiply the columns by 1000 to remove all the . and ,
dataset
What is the format of numbers?
If the format is: 1.000.000,5, where . is a thousand separator, while , is a decimal separator, just use gsub:
foo = "1.000.000,5"
bar = gsub("\\.", "", foo) # "1000000,5"
baz = gsub(",", "\\.", bar) # "1000000.5"
as.numeric(baz)
In this case, factor is not a problem because gsub will de-factor the vector.
If you need to multiply the numbers after that, it is not a problem. Transform this into a function (such as convert_decimal) and apply it to columns you want:
data$column = convert_decimal(data$column)
For multiple selected columns (let's call the vector of names selection):
data[selection] = lapply(data[selection], convert_decimal)

gsub extracting a specific number from a string regex optional comma

I need to extract a specific number from strings in a vector that look like this:
V1 V2 info
XX YY AB=414312;CD=0.5555;EF=1234;GH=2346;IJ=551;AA_CD=0.4633
VV ZZ AB=1093;CD=0.4444,0.78463;EF=1654;GH=6546;IJ=1241;AA_CD=0.4366
I only want to extract the number from "CD=XXX" (notice there is also a "AA_CD=XXXX" in every row)
I currently have:
df$info <- as.numeric(gsub("^.*;CD=[0-9, ],?|;.*$", "", df$info))
Which grabs the number after "CD=" in instances where there is not more than one number separated by a comma.
I need this to include the rows in which there are more than one number separated by commas.
My regex only works for rows in which there is only one number in that spot, like so:
0.5555
0.4444,0.78463
0.0123
0.34,0.54,0.765
I know it is probably a silly mistake I am making...Thanks in advance!!!
Here is an approach
lapply(strsplit(gsub("^.*;CD=(0\\.[0-9]),?|;.*$", "\\1", vec), ","), as.numeric)
gsub("^.*;CD=(0\\.[0-9]),?|;.*$", "\\1", vec) #extracts the numbers
#output
1] "0.5555" "0.4444,0.78463"
these are then split at , with strsplit producing a list
then as.numeric converts the list elements with lapply
if it is not needed to keep track of which vector member had which numbers:
as.numeric(unlist(strsplit(gsub("^.*;CD=(0\\.[0-9]),?|;.*$", "\\1", vec), ",")))

Assigning automatic class based on various columns in R [duplicate]

I would like to take a data frame with characters and numbers, and concatenate all of the elements of the each row into a single string, which would be stored as a single element in a vector. As an example, I make a data frame of letters and numbers, and then I would like to concatenate the first row via the paste function, and hopefully return the value "A1"
df <- data.frame(letters = LETTERS[1:5], numbers = 1:5)
df
## letters numbers
## 1 A 1
## 2 B 2
## 3 C 3
## 4 D 4
## 5 E 5
paste(df[1,], sep =".")
## [1] "1" "1"
So paste is converting each element of the row into an integer that corresponds to the 'index of the corresponding level' as if it were a factor, and it keeps it a vector of length two. (I know/believe that factors that are coerced to be characters behave in this way, but as R is not storing df[1,] as a factor at all (tested by is.factor(), I can't verify that it is actually an index for a level)
is.factor(df[1,])
## [1] FALSE
is.vector(df[1,])
## [1] FALSE
So if it is not a vector then it makes sense that it is behaving oddly, but I can't coerce it into a vector
> is.vector(as.vector(df[1,]))
[1] FALSE
Using as.character did not seem to help in my attempts
Can anyone explain this behavior?
While others have focused on why your code isn't working and how to improve it, I'm going to try and focus more on getting the result you want. From your description, it seems you can readily achieve what you want using paste:
df <- data.frame(letters = LETTERS[1:5], numbers = 1:5, stringsAsFactors=FALSE)
paste(df$letters, df$numbers, sep=""))
## [1] "A1" "B2" "C3" "D4" "E5"
You can change df$letters to character using df$letters <- as.character(df$letters) if you don't want to use the stringsAsFactors argument.
But let's assume that's not what you want. Let's assume you have hundreds of columns and you want to paste them all together. We can do that with your minimal example too:
df_args <- c(df, sep="")
do.call(paste, df_args)
## [1] "A1" "B2" "C3" "D4" "E5"
EDIT: Alternative method and explanation:
I realised the problem you're having is a combination of the fact that you're using a factor and that you're using the sep argument instead of collapse (as #adibender picked up). The difference is that sep gives the separator between two separate vectors and collapse gives separators within a vector. When you use df[1,], you supply a single vector to paste and hence you must use the collapse argument. Using your idea of getting every row and concatenating them, the following line of code will do exactly what you want:
apply(df, 1, paste, collapse="")
Ok, now for the explanations:
Why won't as.list work?
as.list converts an object to a list. So it does work. It will convert your dataframe to a list and subsequently ignore the sep="" argument. c combines objects together. Technically, a dataframe is just a list where every column is an element and all elements have to have the same length. So when I combine it with sep="", it just becomes a regular list with the columns of the dataframe as elements.
Why use do.call?
do.call allows you to call a function using a named list as its arguments. You can't just throw the list straight into paste, because it doesn't like dataframes. It's designed for concatenating vectors. So remember that dfargs is a list containing a vector of letters, a vector of numbers and sep which is a length 1 vector containing only "". When I use do.call, the resulting paste function is essentially paste(letters, numbers, sep).
But what if my original dataframe had columns "letters", "numbers", "squigs", "blargs" after which I added the separator like I did before? Then the paste function through do.call would look like:
paste(letters, numbers, squigs, blargs, sep)
So you see it works for any number of columns.
For those using library(tidyverse), you can simply use the unite function.
new.df <- df%>%
unite(together, letters, numbers, sep="")
This will give you a new column called together with A1, B2, etc.
This is indeed a little weird, but this is also what is supposed to happen.
When you create the data.frame as you did, column letters is stored as factor. Naturally factors have no ordering, therefore when as.numeric() is applied to a factor it returns the ordering of of the factor. For example:
> df[, 1]
[1] A B C D E
Levels: A B C D E
> as.numeric(df[, 1])
[1] 1 2 3 4 5
A is the first level of the factor df[, 1] therefore A gets converted to the value 1, when as.numeric is applied. This is what happens when you call paste(df[1, ]). Since columns 1 and 2 are of different class, paste first transforms both elements of row 1 to numeric then to characters.
When you want to concatenate both columns, you first need to transform the first row to character:
df[, 1] <- as.character(df[, 1])
paste(df[1,], collapse = "")
As #sebastian-c pointed out, you can also use stringsAsFactors = FALSE in the creation of the data.frame, then you can omit the as.character() step.
if you want to start with
df <- data.frame(letters = LETTERS[1:5], numbers = 1:5, stringsAsFactors=TRUE)
.. then there is no general rule about how df$letters will be interpreted by any given function. It's a factor for modelling functions, character for some and integer for some others. Even the same function such as paste may interpret it differently, depending on how you use it:
paste(df[1,], collapse="") # "11"
apply(df, 1, paste, collapse="") # "A1" "B2" "C3" "D4" "E5"
No logic in it except that it will probably make sense once you know the internals of every function.
The factors seem to be converted to integers when an argument is converted to vector (as you know, data frames are lists of vectors of equal length, so the first row of a data frame is also a list, and when it is forced to be a vector, something like this happens:)
df[1,]
# letters numbers
# 1 A 1
unlist(df[1,])
# letters numbers
# 1 1
I don't know how apply achieves what it does (i.e., factors are represented by character values) -- if you're interested, look at its source code. It may be useful to know, though, that you can trust (in this specific sense) apply (in this specific occasion). More generally, it is useful to store every piece of data in a sensible format, that includes storing strings as strings, i.e., using stringsAsFactors=FALSE.
Btw, every introductory R book should have this idea in a subtitle. For example, my plan for retirement is to write "A (not so) gentle introduction to the zen of data fishery with R, the stringsAsFactors=FALSE way".

Split a character column from a dataframe based on specific token

I have a dataframe df and the first column looks like this:
[1] "760–563" "01455–1" "4672–04" "11–31234" "22–12" "11111–53" "111–21" "17–356239" "14–22352" "531–353"
I want to split that column on -.
What I'm doing is
strsplit(df[,1], "-")
The problem is that it's not working. It returns me a list without splitting the elements. I already tried adding the parameter fixed = TRUE and putting a regular expressing on the split parameter but nothing worked.
What is weird is that if I replicate the column on my own, for example:
myVector <- c("760–563" "01455–1" "4672–04" "11–31234" "22–12" "11111–53" "111–21" "17–356239" "14–22352" "531–353")
and then apply the strsplit, it works.
I already checked my column type and class with
class(df[,1]) and typeof(df[,1]) and both returns me character, so it's good.
I was also using the dataframe with dplyr so it was of the type tbl_df. I converted it back to dataframe but didn't work too.
Also tried apply(df, 2, function(x) strsplit(x, "-", fixed = T)) but didn't work too.
Any clues?
I don't know how you did it, but you have two different types of dashes:
charToRaw(substr("760–563", 4, 4))
#[1] 96
charToRaw("-")
#[1] 2d
So the strsplit() is working just fine, it's just that the dash isn't there in your original data. Adjust this, and away you go:
strsplit("760–563", "–")
#[[1]]
#[1] "760" "563"
You can just split on a non-numeric character
library(dplyr)
library(tidyr)
data %>%
separate(your_column,
c("first_number", "second_number"),
sep = "[^0-9]")

Concatenate rows of a data frame

I would like to take a data frame with characters and numbers, and concatenate all of the elements of the each row into a single string, which would be stored as a single element in a vector. As an example, I make a data frame of letters and numbers, and then I would like to concatenate the first row via the paste function, and hopefully return the value "A1"
df <- data.frame(letters = LETTERS[1:5], numbers = 1:5)
df
## letters numbers
## 1 A 1
## 2 B 2
## 3 C 3
## 4 D 4
## 5 E 5
paste(df[1,], sep =".")
## [1] "1" "1"
So paste is converting each element of the row into an integer that corresponds to the 'index of the corresponding level' as if it were a factor, and it keeps it a vector of length two. (I know/believe that factors that are coerced to be characters behave in this way, but as R is not storing df[1,] as a factor at all (tested by is.factor(), I can't verify that it is actually an index for a level)
is.factor(df[1,])
## [1] FALSE
is.vector(df[1,])
## [1] FALSE
So if it is not a vector then it makes sense that it is behaving oddly, but I can't coerce it into a vector
> is.vector(as.vector(df[1,]))
[1] FALSE
Using as.character did not seem to help in my attempts
Can anyone explain this behavior?
While others have focused on why your code isn't working and how to improve it, I'm going to try and focus more on getting the result you want. From your description, it seems you can readily achieve what you want using paste:
df <- data.frame(letters = LETTERS[1:5], numbers = 1:5, stringsAsFactors=FALSE)
paste(df$letters, df$numbers, sep=""))
## [1] "A1" "B2" "C3" "D4" "E5"
You can change df$letters to character using df$letters <- as.character(df$letters) if you don't want to use the stringsAsFactors argument.
But let's assume that's not what you want. Let's assume you have hundreds of columns and you want to paste them all together. We can do that with your minimal example too:
df_args <- c(df, sep="")
do.call(paste, df_args)
## [1] "A1" "B2" "C3" "D4" "E5"
EDIT: Alternative method and explanation:
I realised the problem you're having is a combination of the fact that you're using a factor and that you're using the sep argument instead of collapse (as #adibender picked up). The difference is that sep gives the separator between two separate vectors and collapse gives separators within a vector. When you use df[1,], you supply a single vector to paste and hence you must use the collapse argument. Using your idea of getting every row and concatenating them, the following line of code will do exactly what you want:
apply(df, 1, paste, collapse="")
Ok, now for the explanations:
Why won't as.list work?
as.list converts an object to a list. So it does work. It will convert your dataframe to a list and subsequently ignore the sep="" argument. c combines objects together. Technically, a dataframe is just a list where every column is an element and all elements have to have the same length. So when I combine it with sep="", it just becomes a regular list with the columns of the dataframe as elements.
Why use do.call?
do.call allows you to call a function using a named list as its arguments. You can't just throw the list straight into paste, because it doesn't like dataframes. It's designed for concatenating vectors. So remember that dfargs is a list containing a vector of letters, a vector of numbers and sep which is a length 1 vector containing only "". When I use do.call, the resulting paste function is essentially paste(letters, numbers, sep).
But what if my original dataframe had columns "letters", "numbers", "squigs", "blargs" after which I added the separator like I did before? Then the paste function through do.call would look like:
paste(letters, numbers, squigs, blargs, sep)
So you see it works for any number of columns.
For those using library(tidyverse), you can simply use the unite function.
new.df <- df%>%
unite(together, letters, numbers, sep="")
This will give you a new column called together with A1, B2, etc.
This is indeed a little weird, but this is also what is supposed to happen.
When you create the data.frame as you did, column letters is stored as factor. Naturally factors have no ordering, therefore when as.numeric() is applied to a factor it returns the ordering of of the factor. For example:
> df[, 1]
[1] A B C D E
Levels: A B C D E
> as.numeric(df[, 1])
[1] 1 2 3 4 5
A is the first level of the factor df[, 1] therefore A gets converted to the value 1, when as.numeric is applied. This is what happens when you call paste(df[1, ]). Since columns 1 and 2 are of different class, paste first transforms both elements of row 1 to numeric then to characters.
When you want to concatenate both columns, you first need to transform the first row to character:
df[, 1] <- as.character(df[, 1])
paste(df[1,], collapse = "")
As #sebastian-c pointed out, you can also use stringsAsFactors = FALSE in the creation of the data.frame, then you can omit the as.character() step.
if you want to start with
df <- data.frame(letters = LETTERS[1:5], numbers = 1:5, stringsAsFactors=TRUE)
.. then there is no general rule about how df$letters will be interpreted by any given function. It's a factor for modelling functions, character for some and integer for some others. Even the same function such as paste may interpret it differently, depending on how you use it:
paste(df[1,], collapse="") # "11"
apply(df, 1, paste, collapse="") # "A1" "B2" "C3" "D4" "E5"
No logic in it except that it will probably make sense once you know the internals of every function.
The factors seem to be converted to integers when an argument is converted to vector (as you know, data frames are lists of vectors of equal length, so the first row of a data frame is also a list, and when it is forced to be a vector, something like this happens:)
df[1,]
# letters numbers
# 1 A 1
unlist(df[1,])
# letters numbers
# 1 1
I don't know how apply achieves what it does (i.e., factors are represented by character values) -- if you're interested, look at its source code. It may be useful to know, though, that you can trust (in this specific sense) apply (in this specific occasion). More generally, it is useful to store every piece of data in a sensible format, that includes storing strings as strings, i.e., using stringsAsFactors=FALSE.
Btw, every introductory R book should have this idea in a subtitle. For example, my plan for retirement is to write "A (not so) gentle introduction to the zen of data fishery with R, the stringsAsFactors=FALSE way".

Resources