I am reading this from a CSV file, and i need to write a function that churns out a final data frame, so given a particular entry, i have
x
[1] {2,4,5,11,12}
139 Levels: {1,2,3,4,5,6,7,12,17} ...
i can change it to
x2<-as.character(x)
which gives me
x
[1] "{2,4,5,11,12}"
how do i extract 2,4,5,11,12 out? (having 5 elements)
i have tried to use various ways, like gsub, but to no avail
can anyone please help me?
It sounds like you're trying to import a database table that contains arrays. Since R doesn't know about such data structures, it treats them as text.
Try this. I assume the column in question is x. The result will be a list, with each element being the vector of array values for that row in the table.
dat <- read.csv("<file>", stringsAsFactors=FALSE)
dat$x <- strsplit(gsub("\\{(.*)\\}", "\\1", dat$x), ",")
Related
How can I in R predefine patterns that I would like to keep in a string a then in a column of a data frame?
g <- c("3+kk120", "3+1121", "1+170", "1+kk5")
# I want to get
c("3+kk", "3+1", "1+1", "1+kk")
I am not quite sure if I understand you but, after replacing two digits (kk) into one (X) you can use substr(). Then you can replace back the previous one as follows,
sub("X","kk",substr(sub("kk","X",g),1,3))
gives,
# [1] "3+kk" "3+1" "1+1" "1+kk"
Task
I am attempting to use better functionality (loop or vector) to parse down a larger list into 26(maybe 27) smaller lists based on each letter of the alphabet (i.e. the first list contains all entries of the larger list that start with the letter A, the second list with the letter B ... the possible 27th list contains all remaining entries that use either numbers of other characters).
I am then attempting to ID which names on the list are similar by using the adist function (for instance, I need to correct company names that are misspelled. e.g. Companyy A needs to be corrected to Company A).
Code thus far
#creates a vector for all uniqueID/stakeholders whose name starts with "a" or "A"
stakeA <- grep("^[aA].*", uniqueID, value=TRUE)
#creates a distance matrix for all stakeholders whose name starts with "a" or "A"
stakeAdist <- (adist(stakeA), ignore.case=TRUE)
write.table(stakeAdist, "test.csv", quote=TRUE, sep = ",", row.names=stakeA, col.names=stakeA)
Explanation
I was able to complete the first step of my task using the above code; I have created a list of all the entries that begin with the letter A and then calculated the "distance" between each entry (appears in a matrix).
Ask One
I can copy and paste this code 26 times and move my way through the alphabet, but I figure that is likely a more elegant way to do this, and I would like to learn it!
Ask Two
To "correct" the entries, thus far I have resorted to writing a table and moving to Excel. In Excel I have to insert a row entry to have the matrix properly align (I suppose this is a small flaw in my code). To correct the entries, I use conditional formatting to highlight all instances where adist is between say 1 and 10 and then have to manually go through the highlights and correct the lists.
Any help on functions / methods to further automate this / better strategies using R would be great.
It would help to have an example of your data, but this might work.
EDIT: I am assuming your data is in a data.frame named df
for(i in 1:26) {
stake <- subset(df, uniqueID==grep(paste0('^[',letters[i],LETTERS[i],'].*'), df$uniqueID, value=T))
stakeDist <- adist(stakeA,ignore.case=T)
write.table(stakeDist, paste0("stake_",LETTERS[i],".csv"), quote=T, sep=',')
}
Using a combination of paste0, and the builtin letters and LETTERS this creates your grep expression.
Using subset, the correct IDs are extracted
paste0 will also create a unique filename for write.table().
And it is all tied together using a for()-loop
I am working with a long list of data frames.
Here is a simple hypothetical example of a data frame:
DFrame<-data.frame(c(1,0),c("Yes","No"))
colnames(DFrame)<-c("ColOne","ColTwo")
I am trying to retrieve a specified column of the data frame using paste function.
get(paste("DFrame","$","ColTwo",sep=""))
The get function returns the following error, when trying to retrieve a specified column:
Error in get(paste("DFrame", "$", "ColTwo", sep = "")) :object 'DFrame$ColTwo' not found
When I enter the constructed name of the data frame DFrame$ColTwo it returns the desired output of the second column.
If I reconstruct an example without the '$' sign then I get the desired answer from the get function. For example the code yields 2:
enter code here
Ans <- 2
get(paste("An","s",sep=""))
[1] 2
I am looking for the same desired outcome, but struggling to get past the error that the object could not be found.
I also attempted using the following format, but the quotation in the column name breaks the paste function:
paste("DFrame","[,"ColTwo"]",sep="")
Thank you very much for the input,
Kind regards
You can do that using the following syntax:
get("DFrame")[,"ColTwo"]
You can use paste() in both of these strings, for example:
get(paste("D", "Frame", sep=""))[,paste("Col", "Two", sep="")]
Edit: Despite someone downvoting this answer without leaving a comment, this does exactly what the original poster asked for. If you feel that it does not or is in some way dangerous, I would encourage you to leave a comment.
Stop trying to use paste and get entirely.
The whole point of having a list (of data frames, say) is that you can reference them using names:
DFrame<-data.frame(c(1,0),c("Yes","No"))
colnames(DFrame)<-c("ColOne","ColTwo")
#A list of data frames
l <- list(DFrame,DFrame)
#The data frames in the list can have names
names(l) <- c("DF1",'DF2')
# Now you just use `[[`
> l[["DF1"]][["ColOne"]]
[1] 1 0
> l[["DF1"]][["ColTwo"]]
[1] Yes No
Levels: No Yes
If you have to, you can use paste to construct the indices passed inside [[.
I have several datafiles, which I need to process in a particular order. The pattern of the names of the files is, e.g. "Ad_10170_75_79.txt".
Currently they are sorted according to the first numbers (which differ in length), see below:
f <- as.matrix (list.files())
f
[1] "Ad_10170_75_79.txt" "Ad_10345_76_79.txt" "Ad_1049_25_79.txt" "Ad_10531_77_79.txt"
But I need them to be sorted by the middle number, like this:
> f
[1] "Ad_1049_25_79.txt" "Ad_10170_75_79.txt" "Ad_10345_76_79.txt" "Ad_10531_77_79.txt"
As I just need the middle number of the filename, I thought the easiest way is, to get rid of the rest of the name and renaming all files. For this I tried using strsplit (plyr).
f2 <- strsplit (f,"_79.txt")
But I'm sure there is a way to sort the files directly, without renaming all files. I tried using sort and to describe the name with regex but without success. This has been a problem for many days, and I spent several hours searching and trying, to solve this presumably easy task. Any help is very much appreciated.
old example dataset:
f <- c("Ad_10170_75_79.txt", "Ad_10345_76_79.txt",
"Ad_1049_25_79.txt", "Ad_10531_77_79.txt")
Thank your for your answers. I think I have to modify my example, because the solution should work for all possible middle numbers, independent of their digits.
new example dataset:
f <- c("Ad_10170_75_79.txt", "Ad_10345_76_79.txt",
"Ad_1049_9_79.txt", "Ad_10531_77_79.txt")
Here's a regex approach.
f[order(as.numeric(gsub('Ad_\\d+_(\\d+)_\\d+\\.txt', '\\1', f)))]
# [1] "Ad_1049_9_79.txt" "Ad_10170_75_79.txt" "Ad_10345_76_79.txt" "Ad_10531_77_79.txt"
Try this:
f[order(as.numeric(unlist(lapply(strsplit(f, "_"), "[[", 3))))]
[1] "Ad_1049_25_79.txt" "Ad_10170_75_79.txt" "Ad_10345_76_79.txt" "Ad_10531_77_79.txt"
First we split by _, then select the third element of every list element, find the order and subset f based on that order.
I would create a small dataframe containing filenames and their respective extracted indices:
f<- c("Ad_10170_75_79.txt","Ad_10345_76_79.txt","Ad_1049_25_79.txt","Ad_10531_77_79.txt")
f2 <- strsplit (f,"_79.txt")
mydb <- as.data.frame(cbind(f,substr(f2,start=nchar(f2)-1,nchar(f2))))
names(mydb) <- c("filename","index")
library(plyr)
arrange(mydb,index)
Take the first column of this as your filename vector.
ADDENDUM:
If a numeric index is required, simply convert character to numeric:
mydb$index <- as.numeric(mydb$index)
I have a column of gene symbols that I have retrieved directly from a database, and some of the rows contain two or more symbols which are comma separated (see example below).
SLC6A13
ATP5J2-PTCD1,BUD31,PTCD1
ACOT7
BUD31,PDAP1
TTC26
I would like to remove the commas, and place the separated symbols into new rows like so:
SLC6A13
ATP5J2-PTCD1
BUD31
PTCD1
ACOT7
BUD3
PDAP1
TTC26
I haven't been able to find a straight forward way to do this in R, does anyone have any suggestions?
You can use this vector result to put into a matrix or a data.frame:
vec <- scan(text="SLC6A13
ATP5J2-PTCD1,BUD31,PTCD1
ACOT7
BUD31,PDAP1
TTC26", what=character(), sep=",")
Read 8 items
vec
[1] "SLC6A13" "ATP5J2-PTCD1" "BUD31" "PTCD1" "ACOT7" "BUD31" "PDAP1"
[8] "TTC26"
Perhaps:
as.matrix(vec)
(The scan function can also read from files. The "text" parameter was only added relatively recently, but it saves typing file=textConnection("...").)
Another option is to use readLines and strsplit :
unlist(strsplit(readLines(textConnection(txt)),','))
"SLC6A13" "ATP5J2-PTCD1" "BUD31" "PTCD1" "ACOT7"
"BUD31" "PDAP1" "TTC26"