How to plot based on a wildcard - r

I have data that looks like this:
A 2 3 LOGIC:A
B 3 3 LOGIC:B
C 2 2 COMBO:A
plot(Data$V2[Data$V4 == "LOGIC:A"], DATA$V3[Data$V4 == "LOGIC:A"])
However I want to plot whenever the column 4 is LOGIC, when I provide "LOGIC" inside the plot command it should plot both "LOGIC:A" and "LOGIC:B". Right now it only accepts the exact column 4 value. Can I use wildcards?

You can use grepl to find occurrences of your string.
x <- c("LOGIC: A", "COMBO: B")
x[grepl("LOGIC", x)]
[1] "LOGIC: A"

Using Data shown reproducibly in the Note at the end this will plot those rows for which V4 contains the substring LOGIC using the character after the colon to represent the point. If you want all points to be represented by the same character omit the pch argument from plot.
plot(V3 ~ V2, Data, subset = grep("LOGIC", V4), pch = sub("LOGIC:", "", V4))
Note
Lines <- "A 2 3 LOGIC:A
B 3 3 LOGIC:B
C 2 2 COMBO:A"
Data <- read.table(text = Lines, as.is = TRUE, strip.white = TRUE)

Related

Why won't R recognize data frame column names within lists?

HEADLINE: Is there a way to get R to recognize data.frame column names contained within lists in the same way that it can recognize free-floating vectors?
SETUP: Say I have a vector named varA:
(varA <- 1:6)
# [1] 1 2 3 4 5 6
To get the length of varA, I could do:
length(varA)
#[1] 6
and if the variable was contained within a larger list, the variable and its length could still be found by doing:
list <- list(vars = "varA")
length(get(list$vars[1]))
#[1] 6
PROBLEM:
This is not the case when I substitute the vector for a dataframe column and I don't know how to work around this:
rows <- 1:6
cols <- c("colA")
(df <- data.frame(matrix(NA,
nrow = length(rows),
ncol = length(cols),
dimnames = list(rows, cols))))
# colA
# 1 NA
# 2 NA
# 3 NA
# 4 NA
# 5 NA
# 6 NA
list <- list(vars = "varA",
cols = "df$colA")
length(get(list$vars[1]))
#[1] 6
length(get(list$cols[1]))
#Error in get(list$cols[1]) : object 'df$colA' not found
Though this contrived example seems inane, because I could always use the simple length(variable) approach, I'm actually interested in writing data from hundreds of variables varying in lengths onto respective dataframe columns, and so keeping them in a list that I could iterate through would be very helpful. I've tried everything I could think of, but it may be the case that it's just not possible in R, especially given that I cannot find any posts with solutions to the issue.
You could try:
> length(eval(parse(text = list$cols[1])))
[1] 6
Or:
list <- list(vars = "varA",
cols = "colA")
length(df[, list$cols[1]])
[1] 6
Or with regex:
list <- list(vars = "varA",
cols = "df$colA")
length(df[, sub(".*\\$", "", list$cols[1])])
[1] 6
If you are truly working with a data frame d, then nrow(d) is the length of all of the variables in d. There should be no reason to use length in this case.
If you are actually working with a list x containing variables of potentially different lengths, then you should use the [[ operator to extract those variables by name (see ?Extract):
x <- list(a = 1:10, b = rnorm(20L))
l <- list(vars = "a")
length(d[[l$vars[1L]]]) # 10
If you insist on using get (you shouldn't), then you need to supply a second argument telling it where to look for the variable (see ?get):
length(get(l$vars[1L], x)) # 10

Standardize group names using a vector of possible matches

I need to standardize how subgroups are referred to in a data set. To do this I need to identify when a variable matches one of several strings and then set a new variable with the standardized name. I am trying to do that with the following:
df <- data.frame(a = c(1,2,3,4), b = c(depression_male, depression_female, depression_hsgrad, depression_collgrad))
TestVector <- "male"
for (i in TestVector) {
df$grpl <- grepl(paste0(i), df$b)
df[ which(df$grpl == TRUE),]$standard <- "male"
}
The test vector will frequently have multiple elements. The grepl works (I was going to deal with the male/female match confusion later but I'll take suggestions on that) but the subsetting and setting a new variable doesn't. It would be better (and work) if I could transform the grepl output directly into the standard name variable.
Your only real issue is that you need to initialize the standard column. But we can simplify your code a bit:
df <- data.frame(a = c(1,2,3,4), b = c("depression_male", "depression_female", "depression_hsgrad", "depression_collgrad"))
TestVector <- "male"
df$standard <- NA
for (i in TestVector) {
df[ grepl(i, df$b), "standard"] <- "male"
}
df
# a b standard
# 1 1 depression_male male
# 2 2 depression_female male
# 3 3 depression_hsgrad <NA>
# 4 4 depression_collgrad <NA>
Then you've got the issue that the "male" pattern matches "female" as well.
Perhaps you're looking for sub instead? It works like find/replace:
df$standard = sub(pattern = "depression_", replacement = "", df$b)
df
# a b standard
# 1 1 depression_male male
# 2 2 depression_female female
# 3 3 depression_hsgrad hsgrad
# 4 4 depression_collgrad collgrad
It's hard to generalize what will be best in your case without more example input/output pairs. If all your data is of the form "depression_" this will work well. Or maybe the standard name is always after an underscore, so you could use pattern = ".*_" to replace everything before the last underscore. Or maybe something else... Hopefully these ideas give you a good start.

Comparing two columns in a dataframe using R or Excel

I have a csv file containing two columns, "Taxon" in column A and "Tip" in column C. I would like to compare column A against column C, and if the string matches another string in column C I'd like it to print "y" or something similar in column B next to the string in column A, if not I would like to print "n" or equivalent. Here is the beginning of my data:
Taxon B Tip
Nitrosotalea devanaterra Methanothermobacter thermautotrophicus
Nitrososphaera gargensis Methanobacterium beijingense
Nitrososphaera sca5445 Methanobacterium bryantii
Nitrososphaera sca2170 Methanosarcina mazei
Methanobacterium beijingense Persephonella marina
Methanobacterium bryantii Sulfurihydrogenibium azorense
Methanothermobacter thermautotrophicus Balnearium lithotrophicum
Methanosarcina mazei Isosphaera pallida
Koribacter versatilis Methanobacterium beijingense
Acidicapsa borealis Parachlamydia acanthamoebae
Acidobacterium capsulatum Leptospira biflexa
This is only a small part of the data, but the idea is that "n" would be printed in column B for all of the bacteria apart from "Methanobacterium beijingense" and "Methanobacterium bryantii", which are also found in the "Tip" column, and so "y" would be posted there. These could also just be "1" and "0".
I know dplyr has some good functions for filtering and joining data, however I can't find anything that exactly matches my needs. If there is an alternative method of using Excel to do this that's fine too.
Thanks.
For excel use the following formula in B2,
=if(isnumber(match(a2, c:c, 0)), "y", "n")
Fill down or double-click the 'drag button'.
A method using r and dplyr:
# create example data
x = read.table(header = TRUE, stringsAsFactors = FALSE, text =
"Taxon B Tip
Nitrosotalea_devanaterra 1 Methanothermobacter_thermautotrophicus
Nitrososphaera_gargensis 1 Methanobacterium_beijingense
Nitrososphaera_sca5445 1 Methanobacterium_bryantii
Nitrososphaera_sca2170 1 Methanosarcina_mazei
Methanobacterium_beijingense 1 Persephonella_marina
Methanobacterium_bryantii 1 Sulfurihydrogenibium_azorense
Methanothermobacter_thermautotrophicus 1 Balnearium_lithotrophicum
Methanosarcina_mazei 1 Isosphaera_pallida
Koribacter_versatilis 1 Methanobacterium_beijingense
Acidicapsa_borealis 1 Parachlamydia_acanthamoebae
Acidobacterium_capsulatum 1 Leptospira_biflexa")
# Data management part
x1 = data.frame(A = x$Taxon,B = x$B)
x2 = data.frame(A = x$Tip,B = x$B)
x$B[which(x$Taxon == anti_join(x1,x2))] = 0

Find similar strings and reconcile them within one dataframe

Another question for me as a beginner. Consider this example here:
n = c(2, 3, 5)
s = c("ABBA", "ABA", "STING")
b = c(TRUE, "STING", "STRING")
df = data.frame(n,s,b)
n s b
1 2 ABBA TRUE
2 3 ABA STING
3 5 STING STRING
How can I search within this dataframe for similar strings, i.e. ABBA and ABA as well as STING and STRING and make them the same (doesn't matter whether ABBA or ABA, either fine) that would not require me knowing any variations? My actual data.frame is very big so that it would not be possible to know all the different variations.
I would want something like this returned:
> n = c(2, 3, 5)
> s = c("ABBA", "ABBA", "STING")
> b = c(TRUE, "STING", "STING")
> df = data.frame(n,s,b)
> print(df)
n s b
1 2 ABBA TRUE
2 3 ABBA STING
3 5 STING STING
I have looked around for agrep, or stringdist, but those refer to two data.frames or are able to name the column which I can't since I have many of those.
Anyone an idea? Many thanks!
Best regards,
Steffi
This worked for me but there might be a better solution
The idea is to use a recursive function, special, that uses agrepl, which is the logical version of approximate grep, https://www.rdocumentation.org/packages/base/versions/3.4.1/topics/agrep. Note that you can specify the 'error tolerance' to group similar strings with agrep. Using agrepl, I split off rows with similar strings into x, mutate the s column to the first-occurring string, and then add a grouping variable grp. The remaining rows that were not included in the ith group are stored in y and recursively passed through the function until y is empty.
You need the dplyr package, install.packages("dplyr")
library(dplyr)
desired <- NULL
grp <- 1
special <- function(x, y, grp) {
if (nrow(y) < 1) { # if y is empty return data
return(x)
} else {
similar <- agrepl(y$s[1], y$s) # find similar occurring strings
x <- rbind(x, y[similar,] %>% mutate(s=head(s,1)) %>% mutate(grp=grp))
y <- setdiff(y, y[similar,])
special(x, y, grp+1)
}
}
desired <- special(desired,df,grp)
To change the stringency of string similarity, change max.distance like agrepl(x,y,max.distance=0.5)
Output
n s b grp
1 2 ABBA TRUE 1
2 3 ABBA STING 1
3 5 STING STRING 2
To remove the grouping variable
withoutgrp <- desired %>% select(-grp)

column names have periods inserted where there should be spaces

In the plot generated by ggplot, each label along the x-axis is a string, i.e., “the product in 1990”. However, the generated plot there is a period in between each word. In other words, the above string is shown as “the.product.in.1990”
How can I ensure the above “.” is not added?
The following code is what I used to add string for each point along the x-axis
last_plot()+scale_x_discrete(limits=ddata$labels$text)
Sample code:
library(ggdendro)
x <- read.csv("test.csv",header=TRUE)
d <- as.dist(x,diag=FALSE,upper=FALSE)
hc <- hclust(d,"ave")
dhc <- as.dendrogram(hc)
ddata <- dendro_data(dhc,type="rectangle")
ggplot(segment(ddata)) + geom_segment(aes(x=x0,y=y0,xend=x1,yend=y1))
last_plot() + scale_x_discrete(limits=ddata$labels$text)
each row of ddata$labels$text is a string, like "the product in 1990".
I would like to keep the same format in the generated plot rather than "the.product.in.1990"
The issue arises because you are trying to read data with column names that contain spaces.
When you read this data with read.csv these column names are converted to syntactically valid R names. Here is an example to illustrate the issues:
some.file <- '
"Col heading A", "Col heading B"
A, 1
B, 2
C, 3
'
Read it with the default read.csv settings:
> x1 <- read.csv(text=some.file)
> x1
Col.heading.A Col.heading.B
1 A 1
2 B 2
3 C 3
4 NA
> names(x1)
[1] "Col.heading.A" "Col.heading.B"
To avoid this, use the argument check.names=FALSE:
> x2 <- read.csv(text=some.file, check.names=FALSE)
> x2
Col heading A Col heading B
1 A 1
2 B 2
3 C 3
4 NA
> names(x2)
[1] "Col heading A" "Col heading B"
Now, the remaining issue is that a column name can not contain spaces. So to refer to these columns, you need to wrap your column name in backticks:
> x2$`Col heading A`
[1] A B C
Levels: A B C
For more information, see ?read.csv and specifically the information for check.names.
There is also some information about backticks in ?Quotes

Resources