R matching patterns: vector and column - r

I have a vector:
vector_1 <- c('aa1/10', 'aa1/20', 'aa2/10')
And I have a data frame, with the column: product (some rows are empty)
product
hello123
hello123;aa1/20
World
I want to have another column, called: check.
If one of the values in my vector_1 is in the column product, then I want to have a 1, else a 0.
I tried different things, but they didn't work out:
df$check <- ifelse(df$product %in% vector_1, 1,0)
Unfortunately, no results... So I tried:
df$check <- grepl(vector_1, df$product)
But there I received an warning message: In grep: argument pattern has lenght >1 and only the first element will be used.
How can I solve this?
Result:
product check
hello123 0
0
hello123;aa1/20 1
World 0

df$check <- as.numeric(grepl(pattern = paste0(vector_1, collapse = "|"), x = df$product))

Related

Filter out all data frames which don't have the column Z in a list of data frames?

I have a list of six data frames, from which 5/6 data frames have a column "Z". To proceed with my script, I need to remove the data frame which doesn't have column Z, so I tried the following code:
for(i in 1:length(df)){
if(!("Z" %in% colnames(df[[i]])))
{
df[[i]] = NULL
}
}
This seem'd to actually do the job (it removed the one data frame from the list, which didn't have the column Z), BUT however I still got a message "Error in df[[i]] : subscript out of bounds". Why is that, and how could I get around the error?
The base Filter function works well here:
df <- Filter(\(x) "Z" %in% names(x), df)
As to why your method doesn't work, for(i in 1:length(df)) iterates over each item in the original length(df). As soon as df[[i]] = NULL happens once, then df is shorter than it was when the loop started, so the last iteration will be out of bounds. And you'll also skip some items: if df[[2]] is removed then the original df[[3]] is now df[[2]], and the current df[[3]] was originally df[[4]], so you hop over the original df[[3]] without checking it. Lesson: don't change the length of objects in the midst of iterating over them.
If df is your list of 6 dataframes, you can do this:
df <- df[sapply(df, \(i) "Z" %in% colnames(i))]
The reason you get the error is that your loop will reduce the length of df, such that i will eventually be beyond the (new) length of df. There will be no error if the only frame in df without column Z is the last frame.
Using discard:
list_df <- list(df1, df2)
purrr::discard(list_df, ~any(colnames(.x) == "Z"))
Output:
[[1]]
A B
1 1 3
2 3 4
As you can see it removed the first dataframe which had column Z.
data
df1 <- data.frame(A = c(1,2),
Z = c(1,4))
df2 <- data.frame(A = c(1,3),
B = c(3,4))

Count number of entries in each column with result in dataframe

I have a dataframe with many columns. I want to count the number of times something is entered into each column.
#Example data
Gender<-c("","Male","Male","","Female","Female")
location<-c("UK","France","USA","","","")
dataset<-data.frame(Gender,location, stringsAsFactors = FALSE)
There are 4 entries in the gender column and 3 entries in the location column.
I want the results to be in a dataframe such as:
result<-data.frame(Results=c("Gender","location"), Totals=c(4,3))
Can anyone suggest an approach to do this?
You can use the namesof datasetas one column for resultand calculate the Totals by counting how often grep matches anything that is a character (as opposed to nothing in an empty cell):
result <- data.frame(
Results = names(dataset),
Totals = sapply(dataset, function(x) length(grep(".", x)))
)
rownames(result) <- NULL
Result:
result
Results Totals
1 Gender 4
2 location 3
A base R option using stack + colSums
setNames(
rev(stack(colSums(dataset != ""))),
c("Results", "Total")
)
gives
Results Total
1 Gender 4
2 location 3
This should work for you:
ngen <- sum(dataset$Gender != "") #sum number entries in column that are not empty
nloc <- sum(dataset$location != "") #sam thing
Totals <- c(ngen,nloc)
result<-data.frame(Results=c("Gender","location"), Totals)
You can simplify some of the steps if you want, but that would be the detailed way.

How to replace the last character of particular strings in a dataframe column?

I have a dataframe which includes a column of identifier codes. Where the code ends in a 0, I want to replace it with a 1.
Through a lot of trial and error I have a for loop which almost works. It works when there is only one code which ends in a 0 and it's in the last row of the dataframe. If there's another row of data, the for loop doesn't produce the desired output.
library(stringr)
df_a <- data.frame(a = c("02.1.1", "02.1.1.0"))
df_b <- data.frame(a = c("02.1.1", "02.1.1.0", "02.1.2"))
for (i in nrow(df_a)){
df_a$adj <- ""
df_a$code_adj <- ""
if (str_sub(df_a[i, "a"], -1, -1) == "0"){
df_a[i, "adj"] <- "1"
df_a[i, "code_adj"] <- paste0(str_sub(df_a[i, "a"], 1, -2), df_a[i, "adj"])
}
}
When I run the for loop on the dataframe df_a, it produces the desired result. When I run it on df_b it does not.
I'm open to better way's of approaching this problem but I would also like to know why the for loop behaves as it does on the different dataframes.
We can create a function with sub and reuse it on multiple datasets. Match the 0 at the end ($) of the string and replace with 1 for the specific column in the dataset, update the column and return the dataset
f1 <- function(dat, colNm) {
dat[[colNm]] <- sub("0$", "1", dat[[colNm]])
dat
}
f1(df_a, "a")
# a
#1 02.1.1
#2 02.1.1.1
f1(df_b, "a")
# a
#1 02.1.1
#2 02.1.1.1
#3 02.1.2
could you not use the stringr package and do something like
df_b <- str_replace(df_b$a, "0$", "1")
this looks for the 0 at the end of the string and replaces it with a 1. Just note that you have to do the conversion to a character as it does not work on factors using
df_b$a <- as.character(df_b$a)

ow to get matrix column number based on pre determined value, by row

I am new to coding and am using r 3.4.0 with windows10. I have a matrix with 64 columns and 17000 rows. By row, Each column contains a number from 1 to 64 with no duplicates or missing values. I want to search each row for a value of 1 across all columns and return the column name or number that contains the 1.
Here is what I have tried:
LargeVector1 <- c(1)
which(apply(matrix, 1, function(x) any(x == LargeVector1)))
This returns the row number instead of the column number
I also tried this to try and return the col names:
colnames(matrix)[apply(matrix, 1, function(x) any(x == LargeVector1))]
This is returning all NA's. Any help would be greatly appreciated.
res =apply(matrix, 1, function(x) {if(1 %in% x) {
colnames(matrix)[which(x == 1)]}
else return NA}
Will give you a vector with each element corresponding to colname, with NA representing rows with no 1 in them

How to convert single column data into two-column matrix using conditional/for loop in R

I have a single column data frame - example data:
1 >PROKKA_00002 Alpha-ketoglutarate permease
2 MTESSITERGAPELADTRRRIWAIVGASSGNLVEWFDFYVYSFCSLYFAHIFFPSGNTTT
3 QLLQTAGVFAAGFLMRPIGGWLFGRIADRRGRKTSMLISVCMMCFGSLVIACLPGYAVIG
4 >PROKKA_00003 lipoprotein
5 MRTIIVIASLLLTGCSHMANDAWSGQDKAQHFLASAMLSAAGNEYAQHQGYSRDRSAAIG
Each sequence of letters is associated with the ">" line above it. I need a two-column data frame with lines starting in ">" in the first column, and the respective lines of letters concatenated as one sequence in the second column. This is what I've tried so far:
y <- matrix(0,5836,2) #empty matrix with 5836 rows and two columns
z <- 0
for(i in 1:nrow(df)){
if((grepl(pattern = "^>", x = df)) == TRUE){ #tried to set the conditional "if a line starts with ">", execute code"
z <- z + 1
y[z,1] <- paste(df[i])
} else{
y[z,2] <- paste(df[i], collapse = "")
}
}
I would eventually convert the matrix y back to a data.frame using as.data.frame, but my loop keeps getting Error: unexpected '}' in "}". I'm also not sure if my conditional is right. Can anyone help? It would be greatly appreciated!
Although I will stick with packages, here is a solution
initialize data
mydf <- data.frame(x=c(">PROKKA_00002 Alpha-ketoglutarate","MTESSITERGAPEL", "MTESSITERGAPEL",">PROKKA_00003 lipoprotein", "MTESSITERGAPEL" ,"MRTIIVIASLLLT"), stringsAsFactors = F)
process
ind <- grep(">", mydf$x)
temp<-data.frame(ind=ind, from=ind+1, to=c((ind-1)[-1], nrow(mydf)))
seqs<-rep(NA, length(ind))
for(i in 1:length(ind)) {
seqs[i]<-paste(mydf$x[temp$from[i]:temp$to[i]], collapse="")
}
fastatable<-data.frame(name=gsub(">", "", mydf[ind,1]), sequence=seqs)
> fastatable
name sequence
1 PROKKA_00002 Alpha-ketoglutarate MTESSITERGAPELMTESSITERGAPEL
2 PROKKA_00003 lipoprotein MTESSITERGAPELMRTIIVIASLLLT
Try creating an index of the rows with the target symbol with the column headers. Then split the data on that index. The call cumsum(ind1)[!ind1] first creates an id rows by coercing the logical vector into numeric, then eliminates the rows with the column headers.
ind1 <- grepl(">", mydf$x)
#split data on the index created
newdf <- data.frame(mydf$x[ind1][cumsum(ind1)], mydf$x)[!ind1,]
#Add names
names(newdf) <- c("Name", "Value")
newdf
# Name Value
# 2 >PROKKA_00002 Alpha-ketoglutarate
# 3 >PROKKA_00002 MTESSITERGAPEL
# 5 >PROKKA_00003 lipoprotein
# 6 >PROKKA_00003 MRTIIVIASLLLT
Data
mydf <- data.frame(x=c(">PROKKA_00002","Alpha-ketoglutarate","MTESSITERGAPEL", ">PROKKA_00003", "lipoprotein" ,"MRTIIVIASLLLT"))
You can use plyr to accomplish this if you are able to assigned a section number to your rows appropriately:
library(plyr)
df <- data.frame(v1=c(">PROKKA_00002 Alpha-ketoglutarate permease",
"MTESSITERGAPELADTRRRIWAIVGASSGNLVEWFDFYVYSFCSLYFAHIFFPSGNTTT",
"QLLQTAGVFAAGFLMRPIGGWLFGRIADRRGRKTSMLISVCMMCFGSLVIACLPGYAVIG",
">PROKKA_00003 lipoprotein",
"MRTIIVIASLLLTGCSHMANDAWSGQDKAQHFLASAMLSAAGNEYAQHQGYSRDRSAAIG"))
df$hasMark <- ifelse(grepl(">",df$v1,fixed=TRUE),1, 0)
df$section <- cumsum(df$hasMark)
t <- ddply(df, "section", function(x){
data.frame(v2=head(x,1),v3=paste(x$v1[2:nrow(x)], collapse=''))
})
t <- subset(t, select=-c(section,v2.hasMark,v2.section)) #drop the extra columns
if you then view 't' I believe this is what you were looking for in your original post

Resources