I'm not sure that I understand the different outputs in these two scenarios:
(1)
pioneers <- c("GAUSS:1777", "BAYES:1702", "PASCAL:1623", "PEARSON:1857")
split <- strsplit(pioneers, split = ":")
split
(2)
pioneers <- c("GAUSS:1777", "BAYES:1702", "PASCAL:1623", "PEARSON:1857")
split <- lapply(pioneers, strsplit, split = ":")
split
In both cases, the output is a list but I'm not sure when I'd use the one notation (simply applying a function to a vector) or the other (using lapply to loop the function over the vector).
Thanks for the help.
Greg
To me it's to do with how the output is returned. [l]apply stands for list apply - i.e. the output is returned as a list. strsplit already returns a list as, if there were multiple :s in your pioneers vector, it's the only data structure that makes sense - i.e. a list element of each of the 4 elements of the vector and each list element contains a vector of the split string.
So using lapply(x, strsplit, ...) will always return a list inside a list, which you probably don't want in this case.
Using lapply is useful in cases where you expect the result of the function you're applying to be a vector of an undefined or variable length. As strsplit can see this coming already, the use of lapply is redundant, so you should probably know what form you expect/want your answer to be in, and use the appropriate functions to coerce the output in to the right data structure.
To make clear, the output of the examples you gave is not the same. One is a list, one is a list of lists. The identical result would be
lapply(pioneers, function(x, split) strsplit(x, split)[[1]], split = ":")
i.e. taking the first list element of the inner list (which is only 1 element anyway) in each case.
Related
Task
I am attempting to use better functionality (loop or vector) to parse down a larger list into 26(maybe 27) smaller lists based on each letter of the alphabet (i.e. the first list contains all entries of the larger list that start with the letter A, the second list with the letter B ... the possible 27th list contains all remaining entries that use either numbers of other characters).
I am then attempting to ID which names on the list are similar by using the adist function (for instance, I need to correct company names that are misspelled. e.g. Companyy A needs to be corrected to Company A).
Code thus far
#creates a vector for all uniqueID/stakeholders whose name starts with "a" or "A"
stakeA <- grep("^[aA].*", uniqueID, value=TRUE)
#creates a distance matrix for all stakeholders whose name starts with "a" or "A"
stakeAdist <- (adist(stakeA), ignore.case=TRUE)
write.table(stakeAdist, "test.csv", quote=TRUE, sep = ",", row.names=stakeA, col.names=stakeA)
Explanation
I was able to complete the first step of my task using the above code; I have created a list of all the entries that begin with the letter A and then calculated the "distance" between each entry (appears in a matrix).
Ask One
I can copy and paste this code 26 times and move my way through the alphabet, but I figure that is likely a more elegant way to do this, and I would like to learn it!
Ask Two
To "correct" the entries, thus far I have resorted to writing a table and moving to Excel. In Excel I have to insert a row entry to have the matrix properly align (I suppose this is a small flaw in my code). To correct the entries, I use conditional formatting to highlight all instances where adist is between say 1 and 10 and then have to manually go through the highlights and correct the lists.
Any help on functions / methods to further automate this / better strategies using R would be great.
It would help to have an example of your data, but this might work.
EDIT: I am assuming your data is in a data.frame named df
for(i in 1:26) {
stake <- subset(df, uniqueID==grep(paste0('^[',letters[i],LETTERS[i],'].*'), df$uniqueID, value=T))
stakeDist <- adist(stakeA,ignore.case=T)
write.table(stakeDist, paste0("stake_",LETTERS[i],".csv"), quote=T, sep=',')
}
Using a combination of paste0, and the builtin letters and LETTERS this creates your grep expression.
Using subset, the correct IDs are extracted
paste0 will also create a unique filename for write.table().
And it is all tied together using a for()-loop
I am trying to concatenate two character vectors in a way that the following output is produced
aggmodes<-c("17x8","17x7x8","17x28x8")
listdata<-c("Motion.Age","res.Context.Only")
The output should be like this
"Motion.Age,17x8"
"Motion.Age,17x7x8"
"Motion.Age,17x28x8"
"res.Context.Only,17x8"
"res.Context.Only,17x7x8"
"res.Context.Only,17x28x8"
I have written following code:
c<-as.vector(sapply(1:length(listdata), function(i){
sapply(1:length(aggmodes),function(j){paste(aggmodes,listdata)})
}))
but it gives me an a 10 dimensional vector. I am sorry if it is a duplicate, but i couldnot find a correct answer for solving my problem
c(sapply(listdata,paste,aggmodes,sep=","))
# [1] "Motion.Age,17x8" "Motion.Age,17x7x8" "Motion.Age,17x28x8"
# [4] "res.Context.Only,17x8" "res.Context.Only,17x7x8" "res.Context.Only,17x28x
We paste each element of listdata to all of aggmodes with sapply, and then unwrap it all.
Your code is suboptimal because you don't leverage the fact paste is vectorized, however it can work with a slight modification:
as.vector(sapply(1:length(listdata), function(i){
sapply(1:length(aggmodes),function(j){paste(aggmodes[j],listdata[i])})
}))
as.character(outer(listdata, aggmodes, paste, sep = ","))
outer takes three arguments: x, y, and FUN. It applies FUN to all elements of x and y - in this case, pasting them together. Since outer returns a matrix, wrap it in as.character to return a vector!
In RHadoop, when we make the results readable, it will use the code:
results <- data.frame(words=unlist(lapply(Output_data,"[[",1)), count
=unlist(lapply(Output_data,"[[",2)))
but what does lapply(Output_data,"[[",1)mean? especially the "[[" and '1' in lapply.
The syntax of extracting list elements with [ or [[ is often used in R. It is not specific to any packages. The meaning of the syntax
lapply(Output_data,"[[",1)
is loop through the object 'Output_data' and extract ([[) the first element. So, if the 'Output_data' is a list of data.frames, it will extract the first column of the data.frame and if it is a list of vectors, it extracts the first elements of vector. It does similar functionality as an anonymous function does i..e
lapply(Output_data, function(x) x[[1]])
The latter syntax is more clear and easier to understand but the former is compact and a bit more stylish...
More info about the [[ can be found in ?Extract
Operators like [[ , [ and -> are actually functions.
list[[1]]
is equal to
`[[`(list,1)
In your case, lapply(Output_data,"[[",1)means to extract the first value of every column (or sublist) of Output_data. And the 1 is a argument passed to [[ function.
Is it possible to convert a list of strings so that it will return the value it's named after?
For example, I have this list of strings that I made with paste:
mylist <- c("nhdata$Credit", "nhdata$Honey", "nhdata$Plants")
mylist
The list I'm working with is a lot bigger (about 35). So is it possible to print these strings in a way that it will actually call the value they are named after?
Appreciate any help, this is my first question stackoverflow
You can use the get function:
temp <- 1:10
get("temp")
In your example, you may do better to use the following, though:
mylist <- c("Credit", "Honey", "Plants")
nhdata[, mylist[1]]
or similarly,
nhdata[[mylist[1]]]
I have a data frame with a column called listA, and a listB. I want to pull out only those rows in the data frame which match to an entry in listB, so I have:
newData <- mydata[mydata$listA %in% listB,]
However, some entries of listA are in the format "ABC /// DEF", where both ABC and DEF are possible entries in listB.
I want to pull out the rows of the data frame which have a listA for which any of the words match to an entry in listB. So if listB had "ABC" in it, that entry would be included in newData. I found the strsplit function, but things like
strsplit(mydata$listA," ") %in% listB
always returns FALSE, presumably because it's checking if the whole list returned by strsplit is an entry in listB.
match(word_vector, target_vector) allows both arguments to be vectors, which is what you want (note: that's vectors, not lists). In fact, %in% operator is a synonym for match(), as its help tells you.
But stringi package's methods stri_match_* may well directly do what you want, are all vectorized, and are way more performant than either match() or strsplit():
stri_match_all stri_match_all_regex stri_match_first stri_match_first_regex stri_match_last stri_match_last_regex
Also, you probably won't need to use an explicit split function, but if you must, then use stringi::stri_split_*(), avoid base::strsplit()
Note on performance: avoid splitting strings like the plague in R whenever possible, it creates memory leaks via unnecessary conscells, as gc() will show you. That's yet another reason why stringi is very efficient.