R plot frequency of strings with specific pattern - r

Given a data frame with a column that contains strings. I would like to plot the frequency of strings that bear a certain pattern. For example
strings <- c("abcd","defd","hfjfjcd","kgjgcdjrye","yryriiir","twtettecd")
df <- as.data.frame(strings)
df
strings
1 abcd
2 defd
3 hfjfjcd
4 kgjgcdjrye
5 yryriiir
6 twtettec
I would like to plot the frequency of the strings that contain the pattern `"cd"
Anyone with a quick solution?

I presume from your question that you meant to have some entries that appear more than once, so I've added one duplicate string:
x <- c("abcd","abcd","defd","hfjfjcd","kgjgcdjrye","yryriiir","twtettecd")
To find only those strings that contain a specific pattern, use grep or grepl:
y <- x[grepl("cd", x)]
To get a table of frequencies, you can use table
table(y)
y
abcd hfjfjcd kgjgcdjrye twtettecd
2 1 1 1
And you can plot it using plot or barplot as follows:
barplot(table(y))

Others have already mentioned grepl. Here is an implementation with plot.density using grep to get the positions of the matches
plot( density(0+grepl("cd", strings)) )
If you don't like the extension of the density plot beyond the range there are other methods in the 'logspline' package that allow one to get sharp border at range extremes. Searching RSiteSearch

check "Kernlab" package.
You can define a kernel (pattern) which could any kind of string and count them later on.

Related

How to label CCA-Plot with row.names in R

I've been trying to solve the following problem which I am sure is an easy one (I am just not able to find a solution). I am using the package vegan and want to perform a cca that shows the actual row names as labels (instead of the default "sit1", "sit2", ...).
I created a dataframe (ls_Treat1) with cast(), showing plot treatments (AB, DB, DL etc.) as row names and species occurences. The dataframe looks as follows:
species 1
species 2
species 3
AB
0
3
1
DB
1
6
0
DL
3
4
2
I created the data frame with the following code to set the treatments (AB, DB, DL, ...) as row names:
ls_Treat1 <- cast(fungi_ls, Treatment ~ species)
row.names(ls_Treat1)<- ls_Treat1$Treatment
ls_Treat1 <- ls_Treat1[,-1]
When I perform a cca with the following code:
ca <- cca(ls_Treat1)
plot(ca,display="sites")
R puts the default labels "sit1", "sit2", ... into the plot, instead of the actual row names, even though I have performed it this way before and the plots normally showed the right labels. Does this have anything to do with my creating the data frame? I tried to change the treatments (characters) into numbers (integers or factors) but still, the plot won't be labelled with my row names.
Can anyone help me with this?
Thank you very very much!!
The problem is that reshape::cast() does not produce data.frame but something else. It claims to be a data.frame but it is not. We do matrix algebra in cca and therefore we cast input to a matrix which works for standard data.frame, but it does not work with the object you supplied as input. In particular, after you remove the first column in ls_Treat1 <- ls_Treat1[,-1], you also remove the attributes that allow preserving names – it would have worked without removing this column (if reshape package was still loaded). It seems that upgrading to reshape2 package and using reshape2::acast() can be a solution.

R programming using the cut() function to split a variable into 3 classes

I a have a dataframe called wine with variables among which is the variable
Spice of integer type. I would like to split this variable (Spice) using the cut() function into 3 classes ( <2; between 2 and 2.5; >3 ).
Assuming you mean spice is numeric (and not integer, which would only fall between 2 and 2.5 if it was exactly two), it might be hard to do with just cut if you're trying to be left and right inclusive.
You can get close with something like
dat <- data.frame(spice=5*runif(100))
dat$lvl <- cut(dat$spice, breaks=c(0,2,2.5,1e6), right=FALSE)
dat$lvl <- as.factor(as.numeric(dat$lvl))
though if the value for spice is exactly 2.5 it will be placed in group 3 instead of group 2.

Adding a new column in R based on maximum occurrence of words from a CSV

I am working with two CSV files. They are formatted like this:
File 1
able,2
gobble,3
highway,3
test,6
zoo,10
File 2
able,6
gobble,10
highway,3
speed,7
test,8
upper,3
zoo,10
In my program I want to do the following:
Create a keyword list by combining the values from two CSV files and keeping only unique keywords
Compare that keyword list to each individual CSV file to determine the maximum number of occurences of a given keyword, then append that information to the keyword list.
The first step I have done already.
I am getting confused by R reading things as vectors/factors/data frames etc...and "coercion to lists". For example in my files given above, the maximum occurrence for the word "gobble" should be 10 (its value is 3 in file 1 and 10 in file 2)
So basically two things need to happen. First, I need to create a column in "keywords" that holds information about the maximum number of occurrences of a word from the CSV files. Second, I need to populate that column with the maximum value.
Here is my code:
# Read in individual data sets
keywordset1=as.character(read.csv("set1.csv",header=FALSE,sep=",")$V1)
keywordset2=as.character(read.csv("set2.csv",header=FALSE,sep=",")$V1)
exclude_list=as.character(read.csv("exclude.csv",header=FALSE,sep=",")$V1)
# Sort, capitalize, and keep unique values from the two keyword sets
keywords <- sapply(unique(sort(c(keywordset1, keywordset2))), toupper)
# Keep keywords greater than 2 characters in length (basically exclude in at etc...)
keywords <- keywords[nchar(keywords) > 2]
# Keep keywords that are not in the exclude list
keywords <- setdiff(keywords, sapply(exclude_list, toupper))
# HERE IS WHERE I NEED HELP
# Compare the read keyword list to the master keyword list
# and keep the frequency column
key1=read.csv("set1.csv",header=FALSE,sep=",")
key1$V1=sapply(key1[[1]], toupper)
keywords$V2=key1[which(keywords[[1]] %in% key1$V1),2]
return(keywords)
The reason that your last commmand fails is that you try to use the $ operator on a vector. It only works on lists or data frames (which are a special case of lists).
A remark regarding toupper (and many other functions in R): it works on vectors, such that you don't need to use sapply. toupper(c(keywordset1, keywordset2)) is perfectly fine.
But I would like to propose an entirely different solution to your problem. First, I create the data as follows:
keywords1 <- read.table(text="able,2
gobble,3
highway,3
test,6
zoo,10",sep=",",stringsAsFactors=FALSE)
keywords2 <- read.table(text="gobble,10
highway,3
speed,7
test,8
upper,3
zoo,10",sep=",",stringsAsFactors=FALSE)
Note that I use stringsAsFactors=FALSE. This prevents read.table from converting characters to factors, such that there is no need to call as.character later.
The next steps are to capitalize the keyword columns in both tables. At the same time, I put both tables in a list. This is often a good way to simplify calculations in R, because you can use lapply to apply a function on all the list elements. Then I put both tables into a single table.
keyword_list <- lapply(list(keywords1,keywords2),function(kw)
transform(kw,V1=toupper(V1)))
keywords_all <- do.call(rbind,keyword_list)
The next step is to sort the data frame in decreasing order by the number in the second column:
keywords_sorted <- keywords_all[order(keywords_all$V2,decreasing=TRUE),]
keywords_sorted looks as follows:
V1 V2
5 ZOO 10
6 GOBBLE 10
11 ZOO 10
9 TEST 8
8 SPEED 7
4 TEST 6
2 GOBBLE 3
3 HIGHWAY 3
7 HIGHWAY 3
10 UPPER 3
1 ABLE 2
As you notice, some keywords appear only once and for those that appear twice, the first appearance is the one you want to keep. There is a function in R that can be used to extract exactly these elements: duplicated() (run ?duplicated to learn more). Basically, the function returns TRUE, if an element appears for the at least second time in a vector. These are the elements you don't want. To convert TRUE to FALSE (and vice versa), you use the operator !. So the following gives your desired result:
keep <- !duplicated(keywords_sorted$V1)
keywords_max <- keywords_sorted[keep,]
V1 V2
5 ZOO 10
6 GOBBLE 10
9 TEST 8
8 SPEED 7
3 HIGHWAY 3
10 UPPER 3
1 ABLE 2

merge and plot multiple text files

I have sixty text files, each with two columns as shown below, each representing a unique sample, and headed 'Coverage' and 'counts'. The length of each file differs by a few rows, because for some values of Coverage, the Count is zero, therefore not printed. Each file is about 1000 rows long. Each file is named in the format "B001.BaseCovDist.txt" to "B060.BaseCovDist.txt", and in R I have them as "B001" to "B060".
How can I combine the data frames by Coverage? This is complicated by missing rows. I've tried various approaches in bash, base R, reshape(2), and dplyr.
How can I make a single graph of the Counts(y-axis) against Coverage (x-axis) with each unique sample as a different series. Ggplot2 seems ideal but I seem to need a loop or a list to add the series without having to type out all of the names in full (which would be ridiculous).
One approach that seemed good was to add a third column that contains the unique sample name because this creates a molten dataset. However this didn't work in bash (awk) because the number of whitespace delimiters varies by row.
Any help would be very welcome.
Coverage Count
1 0 7089359
2 1 983611
3 2 658253
4 3 520767
5 4 448916
6 5 400904
A good starting point is to consider a long-format for the data vice a wide-format. Since you mentioned reshape2, this should make sense, but check out tidyr as well, as the docs for both document the differences between long/wide.
Going with a long format, try the following:
allfiles <- lapply(list.files(pattern='foo.csv'),
function(fname) cbind(fname=fname, read.csv(fname)))
dat <- rbind_all(allfiles)
dat
## fname Coverage Count
## 1 B001.BaseCovDist.txt 0 7089359
## 2 B001.BaseCovDist.txt 1 983611
## 3 B001.BaseCovDist.txt 2 658253
## 4 B001.BaseCovDist.txt 3 520767
## 5 B001.BaseCovDist.txt 4 448916
## 6 B001.BaseCovDist.txt 5 400904
ggplot(data=dat, aes(x=Coverage, y=Count, group=fname)) + geom_line()
Just to add to your answer, r2evans I added a gsub command so that the filename suffix is removed from the added column (and also some boring import modifers).
allfiles <- lapply(list.files(pattern='.BasCovDis.txt'), function(sample) cbind(sample=gsub("[.]BasCovDis.txt","", sample), read.table(sample, header=T, skip=3)))

Filling Gaps in Time Series Data in R

So this question has been bugging me for a while since I've been looking for an efficient way of doing it. Basically, I have a dataframe, with a data sample from an experiment in each row. I guess this should be looked at more as a log file from an experiment than the final version of the data for analyses.
The problem that I have is that, from time to time, certain events get logged in a column of the data. To make the analyses tractable, what I'd like to do is "fill in the gaps" for the empty cells between events so that each row in the data can be tied to the most recent event that has occurred. This is a bit difficult to explain but here's an example:
Now, I'd like to take that and turn it into this:
Doing so will enable me to split the data up by the current event. In any other language I would jump into using a for loop to do this, but I know that R isn't great with loops of that type, and, in this case, I have hundreds of thousands of rows of data to sort through, so am wondering if anyone can offer suggestions for a speedy way of doing this?
Many thanks.
This question has been asked in various forms on this site many times. The standard answer is to use zoo::na.locf. Search [r] for na.locf to find examples how to use it.
Here is an alternative way in base R using rle:
d <- data.frame(LOG_MESSAGE=c('FIRST_EVENT', '', 'SECOND_EVENT', '', ''))
within(d, {
# ensure character data
LOG_MESSAGE <- as.character(LOG_MESSAGE)
CURRENT_EVENT <- with(rle(LOG_MESSAGE), # list with 'values' and 'lengths'
rep(replace(values,
nchar(values)==0,
values[nchar(values) != 0]),
lengths))
})
# LOG_MESSAGE CURRENT_EVENT
# 1 FIRST_EVENT FIRST_EVENT
# 2 FIRST_EVENT
# 3 SECOND_EVENT SECOND_EVENT
# 4 SECOND_EVENT
# 5 SECOND_EVENT
The na.locf() function in package zoo is useful here, e.g.
require(zoo)
dat <- data.frame(ID = 1:5, sample_value = c(34,56,78,98,234),
log_message = c("FIRST_EVENT", NA, "SECOND_EVENT", NA, NA))
dat <-
transform(dat,
Current_Event = sapply(strsplit(as.character(na.locf(log_message)),
"_"),
`[`, 1))
Gives
> dat
ID sample_value log_message Current_Event
1 1 34 FIRST_EVENT FIRST
2 2 56 <NA> FIRST
3 3 78 SECOND_EVENT SECOND
4 4 98 <NA> SECOND
5 5 234 <NA> SECOND
To explain the code,
na.locf(log_message) returns a factor (that was how the data were created in dat) with the NAs replaced by the previous non-NA value (the last one carried forward part).
The result of 1. is then converted to a character string
strplit() is run on this character vector, breaking it apart on the underscore. strsplit() returns a list with as many elements as there were elements in the character vector. In this case each component is a vector of length two. We want the first elements of these vectors,
So I use sapply() to run the subsetting function '['() and extract the 1st element from each list component.
The whole thing is wrapped in transform() so i) I don;t need to refer to dat$ and so I can add the result as a new variable directly into the data dat.

Resources