I am trying to search for data inside of a vector. So I have two tables (turned into vectors), and I am trying to search the info of the vector "b" inside of the vector "a". Below my code is provided, does anyone knows how to fix this? I only get a TRUE/FALSE when in reallity I want to create a new vector. The column 2 of the "a" vector contains the info I am trying to search from vector "b".
a = read.table("data.txt",stringsAsFactors=FALSE,sep="\t")
a = as.vector(a[[2]])
b <- read.table("info.txt", stringsAsFactors = FALSE, sep = "\t")
b = as.vector(b[[1]])
f <- a[unlist(lapply(b, function(x) any(x %in% b)))]
You just need to tell r to use the TRUE and FALSE to construct a subset in some way:
c<-subset(a, a%in% b)
Related
I want to create a dataframe with 3 columns.
#First column
name_list = c("ABC_D1", "ABC_D2", "ABC_D3",
"ABC_E1", "ABC_E2", "ABC_E3",
"ABC_F1", "ABC_F2", "ABC_F3")
df1 = data.frame(C1 = name_list)
These names in column 1 are a bunch of named results of the cor.test function. The second column should consist of the correlation coefficents I get by writing ABC_D1$estimate, ABC_D2$estimate.
My problem is now that I dont want to add the $estimate manually to every single name of the first column. I tried this:
df1$C2 = paste0(df1$C1, '$estimate')
But this doesnt work, it only gives me this back:
"ABC_D1$estimate", "ABC_D2$estimate", "ABC_D3$estimate",
"ABC_E1$estimate", "ABC_E2$estimate", "ABC_E3$estimate",
"ABC_F1$estimate", "ABC_F2$estimate", "ABC_F3$estimate")
class(df1$C2)
[1] "character
How can I get the numeric result for ABC_D1$estimate in my dataframe? How can I convert these characters into Named num? The 3rd column should constist of the results of $p.value.
As pointed out by #DSGym there are several problems, including the it is not very convenient to have a list of character names, and it would be better to have a list of object instead.
Anyway, I think you can get where you want using:
estimates <- lapply(name_list, function(dat) {
dat_l <- get(dat)
dat_l[["estimate"]]
}
)
cbind(name_list, estimates)
This is not really advisable but given those premises...
Ok I think now i know what you need.
eval(parse(text = paste0("ABC_D1", '$estimate')))
You connect the two strings and use the functions parse and eval the get your results.
This it how to do it for your whole data.frame:
name_list = c("ABC_D1", "ABC_D2", "ABC_D3",
"ABC_E1", "ABC_E2", "ABC_E3",
"ABC_F1", "ABC_F2", "ABC_F3")
df1 = data.frame(C1 = name_list)
df1$C2 <- map_dbl(paste0(df1$C1, '$estimate'), function(x) eval(parse(text = x)))
I would like to remove the elements containing '_1' and '_3' in the vector using the discard function from purrr. Here the example:
library(purrr)
x <- c("ABAC_13", "ZDRF73", "UYDS_12", "FGSH41", "GFSC_35" , "JHSC_29")
With discard we need to provide a logical vector indicating which values we need to discard.
To create a logical vector we use grepl giving TRUE values to the elements which have '_1' or '_3'
library(purrr)
discard(x, grepl("_1|_3", x))
#[1] "ZDRF73" "FGSH41" "JHSC_29"
and as #Lazarus Thurston commented using str_subset should be a better choice here.
str_subset(x, '_(1|3)', negate = TRUE)
As this is specific to tidyverse, we can use the syntax specific to it
library(tidyverse)
str_detect(x, "_[13]") %>%
discard(x, .)
#[1] "ZDRF73" "FGSH41" "JHSC_29"
If we need to remove the elements
grep("_\\d+", x, invert = TRUE, value = TRUE)
#[1] "ZDRF73" "FGSH41"
or if it is specific to 1, 3
grep("_[13]", x, invert = TRUE, value = TRUE)
#[1] "ZDRF73" "FGSH41" "JHSC_29"
If we need to remove the substring part,
sub("_\\d+", '', x)
This task can be performed using grepl(). Basically we want to find such occurrences that contains _1 or _3. The grepl output is a logical vector of TRUE/FALSE. Following that we remove those elements from x vector by using a subset and negating opearator i.e. x[!grepl("_1|_3", x)].
x <- c("ABAC_13", "ZDRF73", "UYDS_12", "FGSH41", "GFSC_35" , "JHSC_29")
x[!grepl("_1|_3", x)]
This is actually a series of questions about the referencing character type of values in R. Would add more bullets when I recalled any other related questions I believe which is interesting and related to this topic. For simplification, here I shall use some simple random examples to explain my questions. Hope this helps:
When building up a set of datasets using for loops and wanted to output a series of vectors with names restored in a list called name_list = ("a", "b", "c", "d", "e", "f") in the loop we would like to define as
for(i in 1:4){
a <- data[data$Year == 2010,]
b <- unique(data$Name)
c <- summarise(group_by(data,Year,Name), avg = mean(quantity))
...
f <- left_join(data,data1, by = c("Year", "Names)
}
Is there any function that allows me to use function(name_list[1]) through function(name_list[6]) to replace the a through f in the for loop? This question also goes for trying to create columns using column names in some tables/data frames embedded a chunk of code. (as.name and noquote function work when just referencing the vector/dataset but don't work when attempting to assign values to the target variable, if possible could anyone share why this happens?)
When we extract some information from SQL or other data sources we might have some information separated by comma or some other delimiters as one variable. How could we test if certain values is among one of the values separated by commas? See the example below:
1567 %in% c(1567,1456,123)
TRUE
a <- "c(1567,1456,123)"
noquote(a)
c(1567,1456,123)
1567 %in% noquote(a)
FALSE
1567 %in% list(noquote(a))
FALSE
b <- "1567,1456,123"
noquote(b)
1567,1456,123
1567 %in% noquote(strsplit(a,","))
FALSE
1567 %in% list(noquote(strsplit(a,",")))
FALSE
I kind of get why the %in% here doesn't work, seems like R is taking 1567,1456,123 as one element. So I used the strsplit to separate them. But seems that it's still not working. Wondering is there any way that allows us to get R taking the string as commands?
If all you need to do is convert comma-separated lists like "1567,1456,123" into R vectors like c(1567, 1456, 123), you definitely do not need to wrap them in c(...) and try to evaluate them directly as vectors. You should just use strsplit to split the data:
data_str <- "1567,1456,123"
data_vec <- as.integer(strsplit(string_data, ","))
stopifnot(1567 %in% data_vec)
Note that strsplit returns a list, because it can also character vectors of length greater than one:
stopifnot(
all.equal(
list(c("a", "b"), c("x", "y")),
strsplit(c("a,b", "x,y"), ",")) == TRUE)
which makes it useful for operating on columns of SQL output:
| id | concatenated_field |
|----|--------------------|
| 1 | 5362,395,9000,7 |
| 2 | 319,75624,63 |
(etc.)
d <- data.frame(
id = c(1, 2),
concatenated_field = c("5362,395,9000,7", "319,75624,63"))
d$split_field <- strsplit(d$concatenated_field, ",")
sapply(d, class)
# id concatenated_field split_field
# "numeric" "character" "list"
d$split_field[[1]]
# [1] "5362" "395" "9000" "7"
Alternatively, if you're reading in one big stream of comma-separated data, you can use scan:
data_vec <- scan(
what = 0, # arcane way to say "expect numeric input"
sep = ",",
text = "1,2,3,4,5,6,7,8,9,10")
stopifnot(all.equal(data_vec, 1:10) == TRUE)
scan is more heavy-duty than strsplit and can handle more complicated inputs as well, such as data with quoted fields:
weird_data <- scan(what="", sep=",", text='marvin,ruby,"joe,joseph",dean')
print(weird_data)
# [1] "marvin" "ruby" "joe,joseph" "dean"
If you are really really sure you need to be able to accept and evaluate R code passed as an input (this can be VERY DANGEROUS since it means you will be executing arbitrary unverified R code), you can use
r_code_string <- 'c("a", "b"), c("x", "y"))'
stopifnot(
all.equal(
c("a", "b"), c("x", "y")),
eval(parse(r_code_string))) == TRUE)
parse converts raw text into an unevaluated "expression", which is a representation of R code in the form of a special R object, eval passes the expression to the interpreter for execution.
As for noquote, it doesn't do what you think it does. It doesn't actually modify the string, it just adds a flag to the variable so that it will print without quotation marks. You can emulate this behavior with print(..., quote = FALSE).
Hi I have a data frame df and wish to find out are there any palindromes in one name column.
I have test data which has 12 records in it. I know 2 of the column records for name are palindromes.
The code below will return a list using lapply of true false values.
How do I return the name that is a palindrome with the true values and how would i find out which is the most frequently occuring palindrome name?
is_palindrome = function(x){
charsplit = strsplit(x, "")[[1]]
revchar = rev(charsplit)
all(charsplit==revchar)
}
dfnamelc = tolower(as.character(df$Name))
listtest = as.list(dfnamelc)
lapply(listtest,is_palindrome)
example df
Linda,F,100
Mary,F,150
Patrick,M,200
Barbara,F,300
Susan,F,100
Norman,M,40
Deborah,F,500
Sandra,F,23
Conor,M,80
anna,F,40
Otto,M,30
anna,M,40
It will probably be more convenient to use sapply() to return the results as a vector, and incorporate the results back into the data frame.
df <- transform(df,
is_pal=sapply(tolower(Name),is_palindrome))
df$Name[df$is_pal] ## which names are palindromes?
paltab <- table(df$Name[df$is_pal]) ## count palindromic names
names(paltab)[which.max(paltab)] ## "anna"
I'm not sure what your third column signifies, so I'm ignoring it.
I have five dataframes (a-f), each of which has a column 'nq'. I want to find the max, min and average of the nq columns
classes <- c("a","b","c","d","e","f")
for (i in classes){
format(max(i$nq), scientific = TRUE)
format(min(i$nq), scientific = TRUE)
format(mean(i$nq), scientific = TRUE)
}
But the code is not working. Can you please help?
You can't use a character value as a data.frame name. The value "a" is not the same as the data.frame a.
You probably shouldn't have a bunch of data.frames lying around. You probably want to have them all in a list. Then you can lapply over them to get results.
mydata <- list(
a = data.frame(nq=runif(10)),
b = data.frame(nq=runif(10)),
c = data.frame(nq=runif(10)),
d = data.frame(nq=runif(10))
)
then you can do
lapply(mydata, function(x)
format(c(max(x$nq), min(x$nq), mean(x$nq)), scientific = TRUE)
)
to get all the values at once.
The reason it is not working is because 'i' is a character/string. As already mentioned by Mr.Flick you have to make it into a list.
Alternatively, you instead of writing i$nq in your loop you can write get(i)$nq. The get() function will search the workspace for an object by name and it will return the object itself. However, this is not as clean as making it into a list and using lapply.