Match unlist output against set of column names - r

Here is sample data:
main.data <- c("id","num","open","close","char","gene","valid")
data.step.1 <- list(id="12",num="00",open="01-01-2015",char="yes",gene="1234",valid="NA")
match.step.1 <- unlist(data.step.1)
The main.data are the column names of all possible column data.
I have a loop that streams data step-by-step, which could have missing column (list name).
I would like to match the each step (data.step.n) against the master column names (main.data).
Desired output:
id num open close char gene valid
"12" "00" "01-01-2015" "" "yes" "1234" "NA"
How can I unlist the data and match it against the names so that if the entry is missing like in this case close that would be filled with empty string.

Try
v1 <- setNames(rep('', length(main.data)), main.data)
v1[main.data %in% names(match.step.1)] <- match.step.1
Or use match
v1[match(names(match.step.1), main.data)] <- match.step.1
Or just use [
v2 <- setNames(match.step.1[main.data], main.data)
v2[is.na(v2)] <- ''

Related

subset the string matches in the middle of the column from dataframe in R

I need to subset the column that contains uniprot/swiss-prot: ID from the data frame in R.The column contains other IDs also.
Below is an example:
biogrid:107054|entrez gene/locuslink:BAK1|uniprot/swiss-prot:Q16611|refseq:NP_001179
I need the below output:
Q16611
You can use -
x <- 'biogrid:107054|entrez gene/locuslink:BAK1|uniprot/swiss-prot:Q16611|refseq:NP_001179'
sub('.*swiss-prot:(\\w+)\\|.*', '\\1', x)
#[1] "Q16611"
This will extract a word after swiss-prot: and | in the text.
For apply this to a dataframe column you can do -
df$result <- sub('.*swiss-prot:(\\w+)\\|.*', '\\1', df$col)
Using str_extract
library(stringr)
str_extract(x, "(?<=prot:)\\w+")
[1] "Q16611"
data
x <- 'biogrid:107054|entrez gene/locuslink:BAK1|uniprot/swiss-prot:Q16611|refseq:NP_001179'

How select columns, where the first row equals TRUE in R?

I am using following code
summary.out$which[which.max(summary.out$adjr2),]
This gives me column names with one one row, which equals to TRUE or FALSE.
What should I add to this code in order to get column names, where the first row shows TRUE?
Additionally, how can I create that all column names of this output are in one string deliminated with "+" sign? (one string showing "Column1"+"Column2"+"Column3"
Since the OP did not provide a minimal reproducible example, we'll assume that summary.out is a data frame, and therefore we can use the colnames() function to extract the column names once we know which elements of the first row have the value TRUE.
Here is a reproducible example illustrating how to use TRUE / FALSE values in the first row of data frame to extract its column names.
textData <- "v1,v2,v3,v4
TRUE,FALSE,TRUE,TRUE
FALSE,TRUE,FALSE,TRUE
TRUE,TRUE,TRUE,TRUE"
data <- read.csv(text=textData)
# select column names where first row = TRUE
colnames(data)[data[1,]==TRUE]
The preceding line of code uses the [ form of the extract operator to extract the required elements from the result of the colnames() function.
...and the output:
> colnames(data)[data[1,]==TRUE]
[1] "v1" "v3" "v4"
We can print the names as a single object separated by + signs with the paste() function and its collapse = argument.
# separated by + sign
paste(colnames(data)[data[1,]==TRUE],collapse = " + ")
...and the output:
> # separated by + sign
> paste(colnames(data)[data[1,]==TRUE],collapse = " + ")
[1] "v1 + v3 + v4"

Replace Column Names With String Right of "_"

I have a dataframe (d3) which has some column names with "Date_Month.Year", I want to replace those column names with just "Month.Year" so if there are multiple columns with the same "Month.Year" they will just be a summed column.
Below is the code I tried and the output
library(stringr)
print(colnames(d3))
#below is output of the print statement
#[1] "ProductCategoryDesc" "RegionDesc" "SourceDesc" "variable"
#[5] "2019-02-28_Feb.2019" "2019-03-01_Mar.2019" "2019-03-04_Mar.2019" "2019-03-05_Mar.2019"
#[9] "2019-03-06_Mar.2019" "2019-03-07_Mar.2019" "2019-03-08_Mar.2019"
d3 <- d3 %>% mutate(col = str_remove(col, '*._'))
Here is the error I get:
Evaluation error: argument str should be a character vector (or an object coercible to).
So I got the first part of my problem answered I used to get all column names in Month.Year format but now I am having issues with summing the columns that have the same name, for that I looked at Sum and replace columns with same name R for a data frame containing different classes
colnames(d3) <- gsub('.*_', '', colnames(d3))
Below is the code I used to get the columns summed that have a duplicate name, however with this code it is not necessarily putting the summed values in the correct columns.
indx <- sapply(d3, is.numeric)#check which columns are numeric
nm1 <- which(indx)#get the numeric index of the column
indx2 <- duplicated(names(nm1))|duplicated(names(nm1),fromLast=TRUE)
nm2 <- nm1[indx2]
indx3 <- duplicated(names(nm2))
d3[nm2[!indx3]] <- Map(function(x,y) rowSums(x[y],na.rm = FALSE),
list(d3),split(nm2, names(nm2)))
d3 <- d3[ -nm2[indx3]]
If you want to change the column names, you should be changing colnames:
colnames(d3) <- gsub('.*_', '', colnames(d3))
Note, in your regex, quantifiers (ie *) go after the thing they quantify. So it should be .*_ rather than *._
An example where we remove text before a . in iris:
colnames(iris)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
# In regex, . means any character, so to match an actual '.',
# we need to 'escape' it with \\.
colnames(iris) <- gsub('.*\\.', '', colnames(iris))
colnames(iris)
[1] "Length" "Width" "Length" "Width" "Species"
colnames(d3) <- sapply(colnames(d3), function(colname){
return( str_remove(colname, '.*_') )
})
The regex should be ".*_" to match the case you need

Replacing an element in a character string by the previous value

I have a character string looking like this:
string <- c("1","2","3","","5","6","")
I would like to replace the gaps by the previous value, obtaining a string similar to this:
string <- c("1","2","3","3","5","6","6")
I have adjusted this solution (Replace NA with previous and next rows mean in R) and I do get the correct result:
string <- as.data.frame(string)
ind <- which(string == "")
string$string[ind] <- sapply(ind, function(i) with(string, string[i-1]))
This way is however quite cumbersome and there must be an easier way that does not require me to transform the string to a data frame first. Thanks for your help!
We can use na.locf from zoo after changing the blank ("") to NA so that the NA values get replaced by the non-NA adjacent previous values
library(zoo)
na.locf(replace(string, string =="", NA))
#[1] "1" "2" "3" "3" "5" "6" "6"
If there is only atmost one blank between the elements, then create an index as in the OP's post and then do the replacement by the element corresponding to the index subtracted 1
i1 <- which(string == "")
string[i1] <- string[i1-1]

R: How to replace only particular strings in a dataframe column

I have a dataframe column which has values like Americ0,Indi0,Data 2.0...
While doing the data cleaning I am supposed to replace "0" with "an"
df$column <- lapply(df$column, function(x){
str_replace(x,"0","an")
})
I am using the above code to replace 0 with "an" which is working as expected. The problem now is there are certain values in df$column which are not to be replaced like the value Data 2.0 .Appreciate if someone can help me on this.
You can do a str_replace from stringr,Assuming x is df$column:
library(stringr)
x <- c("Americ0","Indi0","Data 2.0")
str_replace(x,"([:alpha:]+)(0)","\\1an")
Or, using baseR
gsub("([[:alpha:]]+)(0)","\\1an",x)
Output:
> str_replace(x,"([:alpha:]+)(0)","\\1an")
[1] "American" "Indian" "Data 2.0"
> gsub("([[:alpha:]]+)(0)","\\1an",x)
[1] "American" "Indian" "Data 2.0"
Inside parenthesis , the items getting captured are called captured group, so I captured all the alphabets more than one into a capture group 1, Hence in this case 2.0 would not get selected.
From documentation:
[:alpha:] Alphabetic characters: [:lower:] and [:upper:].
For more you can search ?regex on your console
I'm not sure how you would do this without having some sort of rule on which you want/do not want to replace like maybe don't replace if 0 is at the beginning, or if 0 occurs in this set of strings.
With your current setup you could probably do something like this (assuming only "Data 2.0" is something you want to skip)
df <- as.data.frame(c("Americ0","Indi0","Data 2.0"))
colnames(df)[1] = "column"
do_not_replace <- c("Data 2.0")
df$column <- lapply(df$column, function(x) {
if(x %in% do_not_replace) {
x
} else str_replace(x, "0", "an")
})

Resources