I have one vector created using the following code:
vectorA<-c(1.125,2.250,3.501)
I have another vector stored in a data frame:
vectordf<-data.frame()
vectordf[1,1]<-'1.125,2.250,3.501'
vectorB<-vectordf[1,1]
I need vectorB to be the same as vectorA so I can use it in another function. Right now the two vectors are different as shown below:
printerA<-paste("vectorA=",vectorA)
printerB<-paste("vectorB=",vectorB)
print(printerA)
print(printerB)
dput(vectorA)
dput(vectorB)
[1] "vectorA= 1.125" "vectorA= 2.25" "vectorA= 3.501"
[1] "vectorB= 1.125 2.250 3.501"
c(1.125, 2.25, 3.501)
"1.125 2.250 3.501"
How can I get vectorB into the same format as vectorA? I have tried using as.numeric, as.list, as.array, as.matrix.
This can be done with scan.
printerB<-paste("vectorB=", scan(text = vectordf[1,1], sep = ','))
And now printerA and printerB are
printerA
#[1] "vectorA= 1.125" "vectorA= 2.25" "vectorA= 3.501"
printerB
#[1] "vectorB= 1.125" "vectorB= 2.25" "vectorB= 3.501"
The problem is that what you've called "vectorB" isn't quite a vector as you imagine it -- it's a string vector of length 1 consisting of numbers separated by commas.
Your idea to use as.numeric() is good, but as.numeric() doesn't quite know how to parse the string with commas as a vector of distinct numbers. So, you first want to split the string:
vectorB <- unlist(strsplit(vectorB, ",", fixed = T))
The strsplit() call will chop up vectorB into different vector sub-parts based on where it finds the commas. The data structure it returns is a list, so we flatten it back down to a vector with unlist(). Then, your as.numeric() idea will work:
vectorB <- as.numeric(vectorB)
Obviously you can clean that up into a single line if you wish, but I wanted to clearly illustrate where the hole in your strategy was.
To make the answer more complete: the reason why this mismatch happened at all was in this line early on in your code:
vectordf[1,1]<-'1.125,2.250,3.501'
The type of the object on the right side of <- is a string vector, and it's a vector of length 1. To fix this, you could have used
vectordf[1:3, 1] <- c(1.125, 2.25, 3.501)
because the type of the object on the right is now a numeric vector of length 3. Note that we had to adjust the indexing on the left side by changing the row index to be 1:3.
Related
I'd like to sort a list based on more than the first character of each item in that list. The list contains chr data though some of those characters are digits. I've been trying to use a combination of substr() and order but to no avail.
For example:
mylist <- c('0_times','3-10_times','11_20_times','1-2_times','more_than_20_times')
mylist[order(substr(mylist,1,2))]
However, this results in 11-20_times being placed prior to 3-10_times:
[1] "0_times" "1-2_times" "11-20_times" "3-10_times" "more_than_20_times"
Update
To provide further detail on the use case.
My data is similar to the following:
mydf <- data.frame(X1=c("0_times","3-10_times", "11-20_times", "1-2_times","3-10_times",
"0_times","3-10_times", "11-20_times", "1-2_times","3-10_times" ),
X2=c('a','b','c','d','e','a','b','c','d','e'))
mydf2 <- data.frame(names = colnames(mydf))
mydf2$vals <- lapply(mydf, unique)
It is the vectors in mydf2$vals that I would like to sort. While the solution from #AllanCameron functions perfectly on a single vector, I'd like to apply that to each vector contained within mydf2$vals but cannot figure out how.
I have attempted to use unlist to access the lists contained but again can only do this on an individual row basis:
unlist(mydf2[1,'vals'], use.names=FALSE)
My inexperience evident here but I've been struggling with this all day.
This requires a bit of string parsing and converting to numeric:
o <- sapply(strsplit(mylist, '\\D+'), function(x) min(as.numeric(x[nzchar(x)])))
mylist[order(o)]
#> [1] "0_times" "1-2_times" "3-10_times"
#> [4] "11_20_times" "more_than_20_times"
I have 12 sample names consisting of a long character string, but most sample names are different lengths and all have different sample identifiers (e.g. some are labelled JKT-n and some are labelled sample_n or Sample_n). I want to extract only the sample identifier part which is in the middle of the string, labelled as either "JKT-n", "Sample_n" or "sample_n". I'm having difficulty as the identifiers aren't consistent. Here is an example for 3 of them:
data$sample
[1] "Monocytes DF 2_E18_016e_20180411_JKT-6_01_normalized_Ungated_viSNE_FlowSOM.fcs"
[2] "Monocytes DHF 2_E19_014b_20190731_sample_32_01_normalized_Ungated_viSNE_FlowSOM.fcs"
[3] "Monocytes DF 2_E19_014b_20190730_Sample_21_01_normalized_Ungated_viSNE_FlowSOM.fcs"
This is the method I've used to split the strings which got me what I wanted in the end. However, I'm wondering if there's a neater way to do this as it's a bit clunky.
data <- as.data.frame(clean_names(read_excel("Significant Citrus clusters.xlsx", )))
data$tmp <- substr(data$sample,34,nchar(data$sample)-40)
data$tmp2 <- gsub(pattern = c("Sample_"), replacement = "JKT-", x=data$tmp)
data$tmp3 <- gsub(pattern=c("_sample_"), replacement="JKT-", x=data$tmp2)
data$tmp4 <- gsub(pattern="_", replacement="", x=data$tmp3)
data$CohortID <- data$tmp4
data$CohortID
[1] "JKT-6" "JKT-8" "JKT-12" "JKT-21" "JKT-26" "JKT-27" "JKT-4" "JKT-9" "JKT-22" "JKT-30" "JKT-32" "JKT-33"
Thanks
You can combine your third and fourth lines into one line of code by using a pipe symbol:
gsub(pattern=c("Sample_|_sample_"), replacement="JKT-", x=data$tmp2)
Alternatively, the digits which appear in the middle of the strings right after _JKT-, _Sample_, or _sample_ independent of the actual position can be extracted by a regular expression with lookbehind (see https://www.regular-expressions.info/lookaround.html for detailed explanations).
paste0("JKT-", stringr::str_extract(data$sample, "(?<=_(JKT-|[sS]ample_))\\d+"))
[1] "JKT-6" "JKT-32" "JKT-21"
For string manipulation, I prefer the stringr package over the base R functions because of it's consistent user interface and function naming.
Sample data
data <- data.frame(sample = c(
"Monocytes DF 2_E18_016e_20180411_JKT-6_01_normalized_Ungated_viSNE_FlowSOM.fcs",
"Monocytes DHF 2_E19_014b_20190731_sample_32_01_normalized_Ungated_viSNE_FlowSOM.fcs",
"Monocytes DF 2_E19_014b_20190730_Sample_21_01_normalized_Ungated_viSNE_FlowSOM.fcs"
))
I'm trying to extract a specific index in a vector, and I keep getting a strange output. I'm using R-Studio and it works fine with string vectors, but I get strange numbers with an "L" after them when I input integers. The same thing happens when I define all_numbers using c(), :, and seq(). Am I doing something incorrectly? I thought I was doing it exactly as my textbook describes it.
# Extracts "Anne" correctly
all_names <- c("Sally", "Pedro", "Anne", "Molly")
extract <- all_names [3]
# Extracts "3L" not 3
all_numbers <- 1:30
extract <- all_numbers[3]
# Extracts "7L" not 7
all_numbers <- 5:30
extract <- all_numbers[3]
# Extracts "12L" not 12
all_numbers <- 10:30
extract <- all_numbers[3]
L is a way in which R represents integers.
class(1L)
#[1] "integer"
class(1)
#[1] "numeric"
In R, indexing starts at 1. So all_numbers[3] in 2nd and 3rd case should be 7 and 12 respectively.
I can't find the relevant document at this moment but if I remember correctly integer takes up less space than numeric class.
If you don't want L in the output convert all_numbers to numeric class.
all_numbers <- as.numeric(all_numbers)
I have a list of vectors with the following structure
[953] "c(\"15768\", \"11999\")"
[954] "c(\"18012\", \"4761\", \"1792\", \"18085\", \"18002\", \"18018\", \"8818\", \"8696\")"
[955] "c(\"735\", \"6073\", \"18007\", \"18046\", \"18087\")"
As you can see, each number is a string. These strings repeate in different vectors. What I need is to find out how often each string repeats over the data I have.
I have tried to table, but it doesn't work the way I need.
If it is a vector of single strings, then an option is str_extract_all to extract all the numeric part, unlist and get the table
library(stringr)
tbl <- sort(table(as.numeric(unlist(str_extract_all(vec1,
"\\d+")))), decreasing = TRUE)
Or using base R
sort(table(unlist(regmatches(vec1, gregexpr("\\d+", vec1)))), decreasing = TRUE)
data
vec1 <- c("c(\"15768\", \"11999\")", "c(\"18012\", \"4761\", \"1792\", \"18085\", \"18002\", \"18018\", \"8818\", \"8696\")",
"c(\"735\", \"6073\", \"18007\", \"18046\", \"18087\")")
I have a data set in which I want to pad zeroes in front of a set of dates that don't have six characters. For example, I have a date that reads 91003 (October 3rd, 2009) and I want it to read 091003, as well as any other date that is missing a zero in front. When I use the sprintf function, the code is:
Data1$entrydate <- sprintf("%06d", data1$entrydate)
But what it spits out is something like 000127, or some other other random number for all the other dates in the problem. I don't understand what's going on, and I would appreciate some help on the issue. Thanks.
PS. I am sometimes also getting a error message that sprintf is only for character values, I don't know if there is any code for numerical values.
I guess you got different results than expected because the column class was factor. You can convert the column to numeric either by as.numeric(as.character(datacolumn)) or as.numeric(levels(datacolumn)). According to ?factor
To transform a factor ‘f’ to approximately its
original numeric values, ‘as.numeric(levels(f))[f]’ is recommended
and slightly more efficient than ‘as.numeric(as.character(f))’.
So, you can use
levels(data1$entrydate) <- sprintf('%06d', as.numeric(levels(data1$entrydate)))
Example
Here is an example that shows the problem
v1 <- factor(c(91003, 91104,90103))
sprintf('%06d', v1)
#[1] "000002" "000003" "000001"
Or, it is equivalent to
sprintf('%06d', as.numeric(v1)) #the formatted numbers are
# the numeric index of factor levels.
#[1] "000002" "000003" "000001"
When you convert it back to numeric, works as expected
sprintf('%06d', as.numeric(levels(v1)))
#[1] "090103" "091003" "091104"