Getting last character/number of data frame column - r

I'm trying to get the last character or number of a series of symbols on data frame so I can filter some categories after. But I'm not getting the expected result.
names = as.character(c("ABC Co","DEF Co","XYZ Co"))
code = as.character(c("ABCN1","DEFMO2","XYZIOIP4")) #variable length
my_df = as.data.frame(cbind(names,code))
First Approach:
my_df[,3] = substr(my_df[,2],length(my_df[,2]),length(my_df[,2]))
What I expected to receive was: c("1","2","4")
What I am really receiving is : c("C","F","Z")
Then, I realized that length(my_df[,2]) is the number of rows of my data frame, and not the length of each cell. So, I decided to create this loop:
for (i in length(nrow(my_df))){
my_df[i,3] = substr(my_df[i,2],length(my_df[i,2]),length(my_df[i,2]))
}
What I expected to receive was: c("1","2","4")
What I am really receiving is : c("A","F","Z")
So then I tried:
for (i in length(nrow(my_df))){
my_df[i,3] = substr(my_df[i,2],-1,-1)
}
What I expected to receive was: c("1","2","4")
What I am really receiving is : c("","F","Z")
Not getting any luck, any thoughts of what am I missing? Thank you very much!

length is a vector (or list) property, whereas in substr you probably need a string property. Base R's nchar works.
my_df = as.data.frame(cbind(names, code), stringsAsFactors = FALSE)
substr(my_df[,2], nchar(my_df[,2]), nchar(my_df[,2]))
# [1] "1" "2" "4"
(I added stringsAsFactors = FALSE, otherwise you'll need to add as.character.)

If the last character is always a number you can do:
library(stringr)
str_extract(my_df$code, "\\d$")
[1] "1" "2" "4"
If the last character can be anything you can do this:
str_extract(my_df$code, ".$")

You can use substr:
my_df$last_char <- substr(code, nchar(code), nchar(code))
# or my_df$last_char <- substr(my_df$code, nchar(my_df$code), nchar(my_df$code))
Output
my_df
# names code last_char
# 1 ABC Co ABCN1 1
# 2 DEF Co DEFMO2 2
# 3 XYZ Co XYZIOIP4 4

We can use sub
sub(".*(\\d+$)", "\\1", my_df$code)

Related

Checking if a vector starts with a number

I have a pretty straight forward question. Sorry if this has already been asked somewhere, but I could not find the answer...
I want to check if genenames start with a number, and if they do start with a number, I want to add 'aaa_' to the genename. Therefor I used the following code:
geneName <- "2310067B10Rik"
if (is.numeric(substring(geneName, 1, 1))) {
geneName <<- paste("aaaa_", geneName, sep="")
}
What I want to get back is aaaa_2310067B10Rik. However, is.numeric returns a FALSE, because the substring gives "2" in quotations as a character. I've also tries to use noquote(), but that didnt work, and as.numeric() around the substring, but then it also applies the if code to genes that don't start with a number. Any suggestions? Thanks!
Here is a solution with regex (Learning Regular Expressions ):
geneName <- c("2310067B10Rik", "Z310067B10Rik")
sub("^(\\d)", "aaa_\\1", geneName)
or as PERL-flavoured variant (thx to #snoram):
sub("^(?=\\d)", "aaa_", geneName, perl = TRUE)
Using the replace() function:
start_nr <- grep("^\\d", geneName)
replace(geneName, start_nr, paste0("aaaa_", geneName[start_nr]))
[1] "aaaa_2310067B10Rik" "foo" "aaaa_9bar"
Where:
geneName <- c("2310067B10Rik", "foo", "9bar")
geneName <- c("2310067B10Rik", "foo")
ifelse(substring(geneName, 1,1) %in% c(0:9), paste0("aaaa_", geneName), geneName)
[1] "aaaa_2310067B10Rik" "foo"
Or based on above comment, you could replace substring(geneName, 1,1) %in% c(0:9) by grepl("^\\d", geneName)
Using regex:
You can first check the first character of your geneName and if it is digit then you can append as follow:
geneName <- "2310067B10Rik"
ifelse(grepl("^[0-9]*$", substring(geneName, 1,1)),paste("aaaa",geneName,sep="_"),)
Output:
[1] "aaaa_2310067B10Rik"
geneName=function(x){
if( grepl("^[0-9]",x) ){
as.character(glue::glue('aaaa_{x}'))
}else{x}
}
> geneName("2310067B10Rik")
[1] "aaaa_2310067B10Rik"
> geneName("sdsad")
[1] "sdsad"

R: For loop works on list, not individual element

I'm trying to learn by writing a function. It should convert the UOM (unit of measure) into a fraction of the standard UOM. In this case, 1/10 or 0.1
I'm trying to loop through a list generated from strsplit, but I only get the whole list, not each element in the list. I can't figure out what I'm doing wrong. Is strsplit the wrong function? I don't think the problem is in strsplit, but I can't figure out what I'm doing wrong in the For loop:
qty<-0
convf<-0
uom <- "EA"
std <- "CA"
pack <-"1EA/10CA"
if(uom!=std){
s<-strsplit(pack,split = '/')
for (i in s){
print(i)
if(grep(uom,i)){
qty<- regmatches(i,regexpr('[0-9]+',i))
}
if(grep(std,i)){
convf<-regmatches(i, regexpr('[0-9]+',i))
}
} #end for
qty<-as.numeric(qty)
convf<-as.numeric(convf)
}
return(qty/convf)
maybe is a problem with the indexing of the list. Have you tried to use [[1]] after the strsplit function?
Example:
string <- "Hello/world"
mylist <- strsplit(string, "/")
## [[1]]
## [1] "Hello" "World"
But if we explicit say that we want the first "element" of the list with [[1]] we will have the entire array of the string.
Example:
string <- "Hello/World"
mylist <- strsplit(string, "/")[[1]]
## [1] "Hello" "World"
Hope this can help you in your problem.
There are a few issues here. The main problem you are having is that s is a list of length 1. Within that list, the first (only) element is a vector of length 2. Consequently, you would need to set i in s[[1]].
However, we can go one step further. Try the following code:
library(stringr)
lapply(strsplit(pack,split = '/'), # works within the list, can handle larger vectors for `pack`
function(x, uom, std) {
reg_expr <- paste(uom,std, sep = "|") # call this on its own, it's just searching for the text saved in uom or std
qty <- as.numeric(str_remove(x, reg_expr)) # removes that text and converts the string to a number
names(qty) <- str_extract(x, reg_expr) # extracts the text and uses it to name elements in qty
qty[uom] / qty[std] # your desired result.
},
uom = uom, # since these are part of the function call, we need to specify what they are. This is where you should change them.
std = std)
I don't know if this is what you're trying to practice, but I'd avoid loops while extracting the digits from a string like "1EA/10CA". If it helps, the column lst is actually a list inside of a dataset.
library(magrittr)
ds <- data.frame(pack = c("1EA/10CA", "1EA/4CA", "2EA/2CA"))
pattern <- "^(\\d+)EA/(\\d+)CA$"
ds %>%
dplyr::mutate(
qty = as.numeric(sub(pattern, "\\1", pack)),
convf = as.numeric(sub(pattern, "\\2", pack)),
ratio = qty / convf,
lst = purrr::map2(qty, convf, ~list(qty=.x[[1]], convf=.y[[1]]))
)
Result:
pack qty convf ratio lst
1 1EA/10CA 1 10 0.10 1, 10
2 1EA/4CA 1 4 0.25 1, 4
3 2EA/2CA 2 2 1.00 2, 2

R typecast NA vs string

I had a problem with some of my code and I fixed it but don't fully understand why the error was an error
The code looked like:
for(i in 1:3){df = rbind.fill(z, data.frame(id=i,
data=if(is.null(x$results[[i]]$synopsis$data))
{NA}else{x$results[[i]]$synopsis$data}))}
The issue I had was, if the first data value was indeed null, I would get NA but then for the second and third I would either get another NA or if there was data I wouldn't get it, I would get 1.
If the first value was data, then I would get the data and for the other two I would either get NA or the correct data.
I'm not a computer scientist but a dev who sits near me (but doesnt know R) suggested it was something to do with different typecasts of NA and a string. To solve the issue I changed the NA to "0" (I suppose "NA" would work too).
I'd just like a more thorough explanation into what was happening. My layman's understanding is if NA was the first result, then every result is in that "format" where something is either NA or not and not is handled as 1 which is kinda of like a Boolean response?
Example:
my.list <- list(list(),structure(
list(
experience = structure(
list(
start = "Hi"
),.Names = c("start")),
`_meta` = structure(
list(weight = 1L, `_sources` = list(structure(
list(`_origin` = "a"), .Names = "_origin"
))),.Names = c("weight", "_sources"))),.Names = c("experience", "_meta")))
my.list[[1]]$experience$start
NULL
my.list[[2]]$experience$start
[1] "Hi"
df <- NULL
for(i in 1:2){df = rbind.fill(df, data.frame(id=i,
data=if(is.null(my.list[[i]]$experience$start))
{NA}else{my.list[[i]]$experience$start}))}
Then
df2 <- NULL
for(i in 1:2){df2 = rbind.fill(df2, data.frame(id=i,
data=if(is.null(my.list[[i]]$experience$start))
{"NA"}else{my.list[[i]]$experience$start}))}
Results:
df: df2:
id data id data
1 NA 1 NA
2 1 2 HI
Olivia, thanks for clarifications.
You are nearly there. As you loop, indeed the first iteration will determine the class of the column data of your output data.frame df.
In scenario 1, you can have a better idea by going through the loop step by step:
df <- NULL
i=1
df = rbind.fill(df, data.frame(id=i,
data=if(is.null(my.list[[i]]]$experience$start)) {NA}
else{my.list[[i]]$experience$start}))
df
id data
1 1 NA
Then, have a look at class of df$data
class(df$data)
[1] "logical"
Which is derived from: mode(NA) (logical).
As an alternative way, when you store data related in your set of experiments in a list, you should try to use a "R-ish" way to manipulate this list.
For instance, you can try:
sapply(my.list, FUN=function(element)element$experience$start)
[[1]]
NULL
[[2]]
[1] "Hi"
Which highlights that you tries to gather together sets of incompatibles contents: simplification can't go simpler than this list -- if you unlist you would dismiss this meaningful NULL

Get name of every 6th element out of string vector using R

Hi guys I am trying to use R for my data analysis. So i read in a data table and then I grap the columns which I need and get rid of all Blanks. I then build the mean value of 6 measurements and go on till the end. My problem is that that I also need the name of the sample. So I tried to get every 6th entry from the iC_name vector and transfer it to my iC-Name short Vector but that doesnt work at all. Any suggesdtions ?
daten = read.csv(file="test.csv",head=TRUE,sep=",")
trash <- grep("Blank", daten[,3])
daten <- daten[-c(trash),]
iAv_temp <- grep("Intensity average" , daten[,2])
iAv <- daten[c(iAv_temp),]
iAv_name <- iAv[,3]
iAv <- iAv[,4]
iC_temp <- grep("Concentration average" , daten[,2])
iC <- daten[c(iC_temp),]
iC_name <-iC[,3]
iC <- iC[,4]
i=1
x=6
j=1
leng=length(iC)
leng_t=leng/6
iC_MW <- c(1:leng_t)
while (j<=leng_t)
{
iC_name_short<-iC_name[i]
temp <- iC[i:x]
iC_MW[j]<-median(temp)
i<-i+6
x<-x+6
j<-j+1
}
You can use by argument in seq function to do that. For example if want to extract every sixth letter
letters[seq(1,26,by =6)]
[1] "a" "g" "m" "s" "y"

Losing data by using the function unlist

I have a simple but strange problem.
indices.list is a list, containing 118,771 Elements(integers or numeric). By applying the function unlist I lose about 500 elements.
Look at the following code:
> indices <- unlist(indices.list, use.names = FALSE)
>
> length(indices.list)
[1] 118771
> length(indices)
[1] 118248
How is that Possible?? I checked if indices.list contains any NA. But it does not:
> any(is.na(indices.list) == TRUE)
[1] FALSE
data.set.merged is a dataframe containing more than 200,000 rows. When I use the vector indices (which apparently has the length 118,248) in order to get a subset of data.set.merged, I get a dataframe with 118,771 rows!?? That's so strange!
data.set.merged.2 <- data.set.merged[indices, ]
> nrow(data.set.2)
[1] 118771
Any ideas whats going on here?
Well, for your first mystery, the likely explanation is that some elements of indices.list are NULL, which means they will disappear when you use unlist:
unlist(list(a = 1,b = "test",c = 2,d = NULL, e = 5))
a b c e
"1" "test" "2" "5"

Resources