Hi I have a data frame df and wish to find out are there any palindromes in one name column.
I have test data which has 12 records in it. I know 2 of the column records for name are palindromes.
The code below will return a list using lapply of true false values.
How do I return the name that is a palindrome with the true values and how would i find out which is the most frequently occuring palindrome name?
is_palindrome = function(x){
charsplit = strsplit(x, "")[[1]]
revchar = rev(charsplit)
all(charsplit==revchar)
}
dfnamelc = tolower(as.character(df$Name))
listtest = as.list(dfnamelc)
lapply(listtest,is_palindrome)
example df
Linda,F,100
Mary,F,150
Patrick,M,200
Barbara,F,300
Susan,F,100
Norman,M,40
Deborah,F,500
Sandra,F,23
Conor,M,80
anna,F,40
Otto,M,30
anna,M,40
It will probably be more convenient to use sapply() to return the results as a vector, and incorporate the results back into the data frame.
df <- transform(df,
is_pal=sapply(tolower(Name),is_palindrome))
df$Name[df$is_pal] ## which names are palindromes?
paltab <- table(df$Name[df$is_pal]) ## count palindromic names
names(paltab)[which.max(paltab)] ## "anna"
I'm not sure what your third column signifies, so I'm ignoring it.
Related
Given the below two vectors is there a way to produce the desired data frame? This represents a real world situation which I have to data frames the first contains a col with database values (keys) and the second contains a col of 1000+ rows each a file name (potentials) which I need to match. The problem is there can be multiple files (potentials) matched to any given key. I have worked with grep, merge, inner join etc. but was unable to incorporate them into one solution. Any advise is appreciated!
potentials <- c("tigerINTHENIGHT",
"tigerWALKINGALONE",
"bearOHMY",
"bearWITHME",
"rat",
"imatchnothing")
keys <- c("tiger",
"bear",
"rat")
desired <- data.frame(keys, c("tigerINTHENIGHT, tigerWALKINGALONE", "bearOHMY, bearWITHME", "rat"))
names(desired) <- c("key", "matches")
Psudo code for what I think of as the solution:
#new column which is comma separated potentials
# x being the substring length i.e. x = 4 means true if first 4 letters match
function createNewColumn(keys, potentials, x){
str result = na
foreach(key in keys){
if(substring(key, 0, x) == any(substring(potentals, 0 ,x))){ //search entire potential vector
result += potential that matched + ', '
}
}
return new column with result as the value on the current row
}
We can write a small functions to extract matches and then loop over the keys:
return_matches <- function(keys, potentials, fixed = TRUE) {
vapply(keys, function(k) {
paste(grep(k, potentials, value = TRUE, fixed = fixed), collapse = ", ")
}, FUN.VALUE = character(1))
}
vapply is just a typesafe version of sapply meaning it will never return anything but a character vector. When you set fixed = TRUE the function will run a lot faster but does not recognise regular expressions anymore. Then we can easily make the desired data.frame:
df <- data.frame(
key = keys,
matches = return_matches(keys, potentials),
stringsAsFactors = FALSE
)
df
#> key matches
#> tiger tiger tigerINTHENIGHT, tigerWALKINGALONE
#> bear bear bearOHMY, bearWITHME
#> rat rat rat
The reason for putting the loop in a function instead of running it directly is just to make the code look cleaner.
You can interate using grep
> Match <- sapply(keys, function(item) {
paste0(grep(item, potentials, value = TRUE), collapse = ", ")
} )
> data.frame(keys, Match, row.names = NULL)
keys Match
1 tiger tigerINTHENIGHT, tigerWALKINGALONE
2 bear bearOHMY, bearWITHME
3 rat rat
I want to create a dataframe with 3 columns.
#First column
name_list = c("ABC_D1", "ABC_D2", "ABC_D3",
"ABC_E1", "ABC_E2", "ABC_E3",
"ABC_F1", "ABC_F2", "ABC_F3")
df1 = data.frame(C1 = name_list)
These names in column 1 are a bunch of named results of the cor.test function. The second column should consist of the correlation coefficents I get by writing ABC_D1$estimate, ABC_D2$estimate.
My problem is now that I dont want to add the $estimate manually to every single name of the first column. I tried this:
df1$C2 = paste0(df1$C1, '$estimate')
But this doesnt work, it only gives me this back:
"ABC_D1$estimate", "ABC_D2$estimate", "ABC_D3$estimate",
"ABC_E1$estimate", "ABC_E2$estimate", "ABC_E3$estimate",
"ABC_F1$estimate", "ABC_F2$estimate", "ABC_F3$estimate")
class(df1$C2)
[1] "character
How can I get the numeric result for ABC_D1$estimate in my dataframe? How can I convert these characters into Named num? The 3rd column should constist of the results of $p.value.
As pointed out by #DSGym there are several problems, including the it is not very convenient to have a list of character names, and it would be better to have a list of object instead.
Anyway, I think you can get where you want using:
estimates <- lapply(name_list, function(dat) {
dat_l <- get(dat)
dat_l[["estimate"]]
}
)
cbind(name_list, estimates)
This is not really advisable but given those premises...
Ok I think now i know what you need.
eval(parse(text = paste0("ABC_D1", '$estimate')))
You connect the two strings and use the functions parse and eval the get your results.
This it how to do it for your whole data.frame:
name_list = c("ABC_D1", "ABC_D2", "ABC_D3",
"ABC_E1", "ABC_E2", "ABC_E3",
"ABC_F1", "ABC_F2", "ABC_F3")
df1 = data.frame(C1 = name_list)
df1$C2 <- map_dbl(paste0(df1$C1, '$estimate'), function(x) eval(parse(text = x)))
I am trying to search for data inside of a vector. So I have two tables (turned into vectors), and I am trying to search the info of the vector "b" inside of the vector "a". Below my code is provided, does anyone knows how to fix this? I only get a TRUE/FALSE when in reallity I want to create a new vector. The column 2 of the "a" vector contains the info I am trying to search from vector "b".
a = read.table("data.txt",stringsAsFactors=FALSE,sep="\t")
a = as.vector(a[[2]])
b <- read.table("info.txt", stringsAsFactors = FALSE, sep = "\t")
b = as.vector(b[[1]])
f <- a[unlist(lapply(b, function(x) any(x %in% b)))]
You just need to tell r to use the TRUE and FALSE to construct a subset in some way:
c<-subset(a, a%in% b)
I have the following data frame from which I would like to extract rows based on matching strings.
> GEMA_EO5
gene_symbol fold_EO p_value RefSeq_ID BH_p_value
KNG1 3.433049 8.56e-28 NM_000893,NM_001102416 1.234245e-24
REXO4 3.245317 1.78e-27 NM_020385 2.281367e-24
VPS29 3.827665 2.22e-25 NM_057180,NM_016226 2.560770e-22
CYP51A1 3.363149 5.95e-25 NM_000786,NM_001146152 6.239386e-22
TNPO2 4.707600 1.60e-23 NM_001136195,NM_001136196,NM_013433 1.538000e-20
NSDHL 2.703922 6.74e-23 NM_001129765,NM_015922 5.980454e-20
DPYSL2 5.097382 1.29e-22 NM_001386 1.062868e-19
So I would like to extract e.g. two rows based on matching strings in $RefSeq_ID, that works fine with the following:
> list<-c("NM_001386", "NM_020385")
> GEMA_EO6<-subset(GEMA_EO5, GEMA_EO5$RefSeq_ID %in% list, drop = TRUE)
> GEMA_EO6
gene_symbol fold_EO p_value RefSeq_ID BH_p_value
REXO4 3.245317 1.78e-27 NM_020385 2.281367e-24
DPYSL2 5.097382 1.29e-22 NM_001386 1.062868e-19
But some of the rows have several RefSeq_IDs separated with commas, so I am looking for a general way of telling if $RefSeq_ID contains a certain string pattern and then subset that row.
To do partial matching you'll need to use regular expressions (see ?grepl). Here's a solution to your particular problem:
##Notice that the first element appears in
##a row containing commas
l = c( "NM_013433", "NM_001386", "NM_020385")
To test one sequence at a time, we just select a particular seq id:
R> subset(GEMA_EO5, grepl(l[1], GEMA_EO5$RefSeq_ID))
gene_symbol fold_EO p_value RefSeq_ID BH_p_value
5 TNPO2 4.708 1.6e-23 NM_001136195,NM_001136196,NM_013433 1.538e-20
To test for multiple genes, we use the | operator:
R> paste(l, collapse="|")
[1] "NM_013433|NM_001386|NM_020385"
R> grepl(paste(l, collapse="|"),GEMA_EO5$RefSeq_ID)
[1] FALSE TRUE FALSE FALSE TRUE FALSE TRUE
So
subset(GEMA_EO5, grepl(paste(l, collapse="|"),GEMA_EO5$RefSeq_ID))
should give you what you want.
A different approach is to recognize the duplicate entries in RefSeq_ID as an attempt to represent two data base tables in a single data frame. So if the original table is csv, then normalize the data into two tables
Anno <- cbind(key = seq_len(nrow(csv)), csv[,names(csv) != "RefSeq_ID"])
key0 <- strsplit(csv$RefSeq_ID, ",")
RefSeq <- data.frame(key = rep(seq_along(key0), sapply(key0, length)),
ID = unlist(key0))
and recognize that the query is a subset (select) on the RefSeq table, followed by a merge (join) with Anno
l <- c( "NM_013433", "NM_001386", "NM_020385")
merge(Anno, subset(RefSeq, ID %in% l))[, -1]
leading to
> merge(Anno, subset(RefSeq, ID %in% l))[, -1]
gene_symbol fold_EO p_value BH_p_value ID
1 REXO4 3.245317 1.78e-27 2.281367e-24 NM_020385
2 TNPO2 4.707600 1.60e-23 1.538000e-20 NM_013433
3 DPYSL2 5.097382 1.29e-22 1.062868e-19 NM_001386
Perhaps the goal is to merge with a `Master' table, then
Master <- cbind(key = seq_len(nrow(csv)), csv)
merge(Master, subset(RefSeq, ID %in% l))[,-1]
or similar.
I have 9880 records in a data frame, I am trying to split it into 9 groups of 1000 each and the last group will have 880 records and also name them accordingly. I used for-loop for 1-9 groups but manually for the last 880 records, but i am sure there are better ways to achieve this,
library(sqldf)
for (i in 0:8)
{
assign(paste("test",i,sep="_"),as.data.frame(final_9880[((1000*i)+1):(1000*(i+1)), (1:53)]))
}
test_9<- num_final_9880[9001:9880,1:53]
also am unable to append all the parts in one for-loop!
#append all parts
all_9880<-rbind(test_0,test_1,test_2,test_3,test_4,test_5,test_6,test_7,test_8,test_9)
Any help is appreciated, thanks!
A small variation on this solution
ls <- split(final_9880, rep(0:9, each = 1000, length.out = 9880)) # edited to Roman's suggestion
for(i in 1:10) assign(paste("test",i,sep="_"), ls[[i]])
Your command for binding should work.
Edit
If you have many dataframes you can use a parse-eval combo. I use the package gsubfn for readability.
library(gsubfn)
nms <- paste("test", 1:10, sep="_", collapse=",")
eval(fn$parse(text='do.call(rbind, list($nms))'))
How does this work? First I create a string containing the comma-separated list of the dataframes
> paste("test", 1:10, sep="_", collapse=",")
[1] "test_1,test_2,test_3,test_4,test_5,test_6,test_7,test_8,test_9,test_10"
Then I use this string to construct the list
list(test_1,test_2,test_3,test_4,test_5,test_6,test_7,test_8,test_9,test_10)
using parse and eval with string interpolation.
eval(fn$parse(text='list($nms)'))
String interpolation is implemented via the fn$ prefix of parse, its effect is to intercept and substitute $nms with the string contained in the variable nms. Parsing and evaluating the string "list($mns)" creates the list needed. In the solution the rbind is included in the parse-eval combo.
EDIT 2
You can collect all variables with a certain pattern, put them in a list and bind them by rows.
do.call("rbind", sapply(ls(pattern = "test_"), get, simplify = FALSE))
ls finds all variables with a pattern "test_"
sapply retrieves all those variables and stores them in a list
do.call flattens the list row-wise.
No for loop required -- use split
data <- data.frame(a = 1:9880, b = sample(letters, 9880, replace = TRUE))
splitter <- (data$a-1) %/% 1000
.list <- split(data, splitter)
lapply(0:9, function(i){
assign(paste('test',i,sep='_'), .list[[(i+1)]], envir = .GlobalEnv)
return(invisible())
})
all_9880<-rbind(test_0,test_1,test_2,test_3,test_4,test_5,test_6,test_7,test_8,test_9)
identical(all_9880,data)
## [1] TRUE