Trying to learn some R after doing mostly Haskell for rather a long time I got stuck on a problem I would usually have using unzip1 and map.
I have a sequence of strings, each containing two substrings separated by an underscore. I want to "unzip" this sequence into something like a data frame or a matrix, where the first column is the sequence of all the first substrings and the second column the sequence of all the second substrings.
Is there any analogue to unzip in R, and would it be considered ideomatic to use it here, or am I approaching this from alltogether the wrong direction?
[1] Given a list (or more generally any kind of sequence) of pairs unzip produces a pair of lists, in the obvious way.
You're on the right track. You want strsplit
vec <- paste(letters,letters[26:1],sep='_')
out <- strsplit(vec,'_')
thats a list.. and sapply will get the vectors out.
data.frame(one = sapply(out,'[',1), two = sapply(out,'[',2))
Related
I am very new to r and coding in general. So far, I have done two projects that are big for me and used several hours to finish. In both projects I had the situation that I needed a vector as an input like this c("Washington", "Dakota", "New York")
In both projects, the vector consisted of 20+ entries. The information, such as Washington or Dakota, that I need to write in the vector, is also available as a column in a dataframe.
I basically had to type down everything that was already in a column again. Since I am a bit lazy sometimes and would rather spend my time and concentration on finding the right functions etc, I wondered if it is possible to turn the column of a dataframe into a vector and then just call the vector as input, instead o having to type everyrthing again.
I tried turning the df column into a vector with as.vector, but it did not work out, when I replaced c() with the name of the vector.
Is that not possible at all or how I have to write it?
A data.frame is a list of vectors that are all the same length. You can access a specific vector with $ or [[.
Example: iris$Species or more safely iris[['Species']].
A function in a package gives me a character, where the original strings are merged together. I need to separate them, in other words I have to find the original elements. Here is an example and what I have tried:
orig<-c("answer1","answer2","answer3")
result<-"answer3answer2"
What I need as an outcome is:
c("answer2","answer3")
I have tried to split() result, but there is no string to base it on, especially that I have no former knowledge of what the answers will be.
I have tried to match() the result to the orig, but I would need to do that with all substrings.
There has to be an easy solution, but I haven't found it.
index <- gregexpr(paste(orig,collapse='|'),result)[[1]]
starts <- as.numeric(index)
stops <- starts + attributes(index)$match.length - 1 )
substring(result, starts, stops)
This should work for well defined and reversible inputs. Alternatively, is it possible to append some strings to the input of the function, such that it can be easily separated afterwards?
What you're describing seems to be exactly string matching, and for your strings, grepl seems to be just the thing, in particular:
FindSubstrings <- function(orig, result){
orig[sapply(orig, grepl, result)]
}
In more detail: grepl takes a pattern argument and looks whether it occurs in your string (result in our case), and returns a TRUE/FALSE value. We subset the original values by the logical vector - does the value occur in the string?
Possible improvements:
fixed=TRUE may be a bright idea, because you don't need the full regex power for simple strings matching
some match patterns may contain others, for example, "answer10" contains "answer1"
stringi may be faster for such tasks (just rumors floating around, haven't rigorously tested), so you may want to look into it if you do this a lot.
I have 75 matrices that I want to search through. The matrices are named a1r1, a1r2, a1r3, a1r4, a1r5, a2r1,...a15r5, and I have a list with all 75 of those names in it; each matrix has the same number of rows and columns. Inside some nested for loops, I also have a line of code that, for the first matrix looks like this:
total <- (a1r1[row,i]) + (a1r1[row,j]) + (a1r1[row,k])
(i, j, k, and row are all variables that I am looping over.) I would like to automate this line so that the for loops would fully execute using the first matrix in the list, then fully execute using the second matrix and so on. How can I do this?
(I'm an experienced programmer, but new to R, so I'm willing to be told I shouldn't use a list of the matrix names, etc. I realize too that there's probably a better way in R than for loops, but I was hoping for sort of quick and dirty at my current level of R expertise.)
Thanks in advance for the help.
Here The R way to do this :
lapply(ls(pattern='a[0-9]r[0-9]'),
function(nn) {
x <- get(nn)
sum(x[row,c(i,j,k)])
})
ls will give a list of variable having a certain pattern name
You loop through the resulted list using lapply
get will transform the name to a varaible
use multi indexing with the vectorized sum function
It's not bad practice to build automatically lists of names designating your objects. You can build such lists with paste, rep, and sequences as 0:10, etc. Once you have a list of object names (let's call it mylist), the get function applied on it gives the objects themselves.
Ok, I'm stuck in a dumbness loop. I've read thru the helpful ideas at How to sort a dataframe by column(s)? , but need one more hint. I'd like a function that takes a matrix with an arbitrary number of columns, and sorts by all columns in sequence. E.g., for a matrix foo with N columns,
does the equivalent of foo[order(foo[,1],foo[,2],...foo[,N]),] . I am happy to use a with or by construction, and if necessary define the colnames of my matrix, but I can't figure out how to automate the collection of arguments to order (or to with) .
Or, I should say, I could build the entire bloody string with paste and then call it, but I'm sure there's a more straightforward way.
The most elegant (for certain values of "elegant") way would be to turn it into a data frame, and use do.call:
foo[do.call(order, as.data.frame(foo)), ]
This works because a data frame is just a list of variables with some associated attributes, and can be passed to functions expecting a list.
I am pretty new to R and have a couple of questions about a loop I am attemping to execute. I will try explain myself as best as possible reguarding what I wish the loop to do.
for(i in (1988:1999,2000:2006)){
yearerrors=NULL
binding=do.call("rbind.fill",x[grep(names(x), pattern ="1988.* 4._ data=")])
cmeans=lapply(binding[,2:ncol(binding)],mean)
datcmeans=as.data.frame(cmeans)
finvec=datcmeans[1,]
kk=0
result=RMSE2(yields[(kk+1):(kk+ncol(binding))],finvec)
kk=kk+ncol(binding)
yearerrors=c(result)
}
yearerrors
First I wish for the loop to iterate over file names of data.
Specifically over the years 1988-2006 in the place where 1988 is
placed right now in the binding statement. x is a list of data files
inputted into R and the 1988 is part of the file name. So, I have
file names starting with 1988,1989,...,2006.
yields is a numeric vector and I would like to input the indices of
the vector into the function RMSE2 as indicated in the loop. For
example, over the first iteration I wish for the indices 1 to the
number of columns in binding to be used. Then for the next iteration
I want the first index to be 1 more than what the previous iteration
ended with and continue to a number equal to the number of columns in the next binding
statement. I just don't know if what I have written will accomplish
this.
Finally, I wish to store each of these results in the vector
yearerrors and then access this vector afterwards.
Thanks so much in advance!
OK, there's a heck of a lot of guesswork here because the structure of your data is extremely unclear, I have no idea what the RMSE2 function is (and you've given no detail). Based on your question the other day, I'm going to assume that your data is in .csv files. I'm going to have a stab at your problem.
I would start by building the combined dataframe while reading the files in, not doing one then the other. Like so:
#Set your working directory to the folder containing the .csv files
#I'm assuming they're all in the form "YEAR.something.csv" based on your pattern matching
filenames <- list.files(".", pattern="*.csv") #if you only want to match a specific year then add it to the pattern match
years <- gsub("([0-9]+).*", "\\1", filenames)
df <- mdply(filenames, read.csv)
df$year <- as.numeric(years[df$X1]) #Adds the year
#Your column mean dataframe didn't work for me
cmeans <- as.data.frame(t(colMeans(df[,2:ncol(df)])))
It then gets difficult to know what you're trying to achieve. Since your datcmeans is a one row data.frame, datcmeans[1,] doesn't change anything. So if a one row from a dataframe (or a numeric vector) is an argument required for your RMSE2 function, you can just pass it datcmeans (cmeans in my example).
Your code from then is pretty much indecipherable to me. Without know what yields looks like, or how RMSE2 works, it's pretty much impossible to help more.
If you're going to do a loop here, I'll say that setting kk=kk+ncol(binding) at the end of the first iteration is not going to help you, since you've set kk=0, kk is not going to be equal to ncol(binding), which is, I'm guessing, not what you want. Here's my guess at what you need here (assuming looping is required).
yearerrors=vector("numeric", ncol(df)) #Create empty vector ahead of loop
for(i in 1:ncol(df)) {
yearerrors[i] <- RMSE2(yields[i:ncol(df)], finvec)
}
yearerrors
I honestly can't imagine a function that would work like this, but it seems the most logical adaption of your code.