I'm trying replace the date in some Posixct data I have in a df. Essentially I thought of using something like this:
as.POSIXct(sub("\\S+", "2018-07-02", x))
x being the value in the row of my original data frame. I thought that the most efficient way would be to use something like apply to iterate through the rows, something like:
apply(df$original.date,1,function(x) as.POSIXct(sub("\\S+", "2018-07-02", x)))
However, it doesn't seem to like it, giving me an error regarding positive length. I was wondering first of all whether my approach was sound, and if so how to fix it. Alternatively I'm all ears if there is a better approach.
Thank you.
The functions sub and as.POSIXct are vectorized, hence a single call
df$original.date <- as.POSIXct(sub("\\S+", "2018-07-02", df$original.date))
will replace the original.date column in the data frame.
As for your apply code:
> apply(df$original.date,1,function(x) as.POSIXct(sub("\\S+", "2018-07-02", x)))
Error in apply(df$original.date, 1, function(x) as.POSIXct(sub("\\S+", :
dim(X) must have a positive length
The problem is that apply expects a matrix or array, while you give it a column from a data frame, i.e. a vector. You could use lapply or sapply. But that is unnecessary since as shown above the entire column can be changed in one go.
Related
I have a dataframe with cases that repeat on the rows. Some rows have more complete data than others. I would like to group cases and then assign the first non-missing value to all NA cells in that column for that group. This seems like a simple enough task but I'm stuck. I have working syntax but when I try to use apply to apply the code to all columns in the dataframe I get a list back instead of a dataframe. Using do.call(rbind) or rbindlist or unlist doesn't quite fix things either.
Here's the syntax.
df$groupid<-group_indices (df,id1,id2) #creates group id on the basis of a combination of two variables
df%<>%group_by(id1,id2) #actually groups the dataframe according to these variables
df<-summarise(df, xvar1=xvar1[which(!is.na(xvar1))[1]]) #this code works great to assign the first non missing value to all missing values but it only works on 1 column at a time (X1).
I have many columns so I try using apply to make this a manageable task..
df<-apply(df, MARGIN=2, FUN=function(x) {summarise(df, x=x[which(!is.na(x))[1]])
}
)
This gets me a list for each variable, I wanted a dataframe (which I would then de-duplicate). I tried rbindlist and do.call(rbind) and these result in a long dataframe with only 3 columns - the two group_by variables and 'x'.
I know the problem is simply how I'm using apply, probably the indexing with 'which', but I'm stumped.
What about using lapply with do.call and cbind, like the following:
df <- do.call(cbind, lapply(df, function(x) {summarise(df, x=x[which(!is.na(x))[1]])}))
colnames gives me the column names for a whole dataframe. Is there any way to get the name of one specified column. i would need this for naming labels when plotting data in ggplot.
So say my data is like this:
df1 <- data.frame(a=sample(1:50,10), b=sample(1:50,10), c=sample(1:50,10))
I would need something like paste(colnames(df1[,1])) which obviously won't work.
any ideas?
you call the name like this:
colnames(df1)[1]
# i.e. call the first element of colnames not colnames of the first vector
however by removing your comma e.g.:
colnames(df1[1])
you can also call the names, becauseusing only [x] not [,x] or [[x]] keeps the data.frame structure not reducing to a vector unlike $x and [,x]
names(df1)[1]
will give you the name of the first column. So too will
names(df1[1])
Neither uses a comma.
Would colnames(df1)[1] solve the problem?
I am having trouble turning my data.frame into a matrix format. Because I wanted to change my data.frame with mostly factor variables into a numeric matrix, I used the following code
UN2010frame <- data.matrix(lapply(UN2010, as.numeric))
However when I checked the mode of the UN2010frame, it still showed up as a list. Because the code I want to run (Ordrating) does not accept data in a list format, I used UN2010matrix <- unlist(UN2010frame) to unlist my matrix. When I did this, my first row ( which was formerly a row with column names) turned into NAs. This was a problem for me because when I tried to run an ordinal IRT model using this data set, I got the following error message.
> Error in 1:nrow(Y) : argument of
> length 0
I think it is because all the values in my first row are now gone.
If you could help me on any front, It would be deeply appreciated.
Thank you very much!
Haillie
First, the correct use of data.matrix is :
data.matrix(UN2010)
as it converts automatically to numeric. The lapply in your code is the first source for the error you get. You put a list in the data.matrix function, not a dataframe. So it returns a list of matrices, and not a matrix.
Second, unlist returns a vector, not a matrix. So pretty sure you won't find a "first row with NA", as you have a vector. Which might explain part of your confusion.
You probably have a character column somewhere. Converting this to numeric gives NA. If you don't want this, then exclude them from the further analysis. One possibility is to use colwise() from the plyr package to convert only the factors:
colwise(as.numeric,is.factor)(UN2010)
Which returns a dataframe with only the factors. This can be easily converted by data.matrix() or as.matrix(). Alternatively you use the base solution :
id <- sapply(UN2010,is.character)
sapply(UN2010[!id],as.numeric)
which will return you a matrix with all non-character columns converted to numeric.If you really want to keep the dataframe with all original columns, you can do :
UN2010frame <- UN2010
UN2010frame[!id] <- lapply(UN2010[!id],as.numeric)
Toy example code :
UN2010 <- data.frame(
F1 = factor(rep(letters[1:3],10)),
F2 = factor(rep(letters[5:10],5)),
Char = rep(letters[11:16],each=5),
Num = 1:30,
stringsAsFactors=FALSE
)
Try as.data.frame instead of data.matrix.
I've got a data frame with far too many rows to be able to do a spatial correlogram. Instead, I want to grab 40 rows for each species and run my correlogram on that subset.
I wrote a function to subset a data frame as follows:
samp <- function(dataf)
{
dataf[sample(1:dim(dataf)[1], size=40, replace=FALSE),]
}
Now I want to apply this function to each species in a larger data frame.
When I try something like
culled_data = ddply (larger_data, .(species), subset, samp)
I get this error:
Error in subset.data.frame(piece, ...) :
'subset' must evaluate to logical
Anyone got ideas on how to do this?
It looks like it should work once you remove , subset from your call.
Dirk answer is of course correct, but to add additional explanation I post my own.
Why your call don't work?
First of all your syntax is a shorthand. It's equivalent of
ddply(larger_data, .(species), function(dfrm) subset(dfrm, samp))
so you can clearly see that you provide function (see class(samp)) as second argument of subset. You could use samp(dfrm), but it won't work too cause samp return data.frame and subset need logical vector. So you could use samp(dfrm) when it returns logical indexing.
How to use subset in this case?
Make subset work by feed him with logical vector:
ddply (larger_data, .(species), subset, sample(seq_along(species)<=40))
I create logical vector with 40 TRUE (btw it works when for some spieces is less then 40 cases, then it return all) and random it.
What if one wants to apply a functon i.e. to each row of a matrix, but also wants to use as an argument for this function the number of that row. As an example, suppose you wanted to get the n-th root of the numbers in each row of a matrix, where n is the row number. Is there another way (using apply only) than column-binding the row numbers to the initial matrix, like this?
test <- data.frame(x=c(26,21,20),y=c(34,29,28))
t(apply(cbind(as.numeric(rownames(test)),test),1,function(x) x[2:3]^(1/x[1])))
P.S. Actually if test was really a matrix : test <- matrix(c(26,21,20,34,29,28),nrow=3) , rownames(test) doesn't help :(
Thank you.
What I usually do is to run sapply on the row numbers 1:nrow(test) instead of test, and use test[i,] inside the function:
t(sapply(1:nrow(test), function(i) test[i,]^(1/i)))
I am not sure this is really efficient, though.
If you give the function a name rather than making it anonymous, you can pass arguments more easily. We can use nrow to get the number of rows and pass a vector of the row numbers in as a parameter, along with the frame to be indexed this way.
For clarity I used a different example function; this example multiplies column x by column y for a 2 column matrix:
test <- data.frame(x=c(26,21,20),y=c(34,29,28))
myfun <- function(position, df) {
print(df[position,1] * df[position,2])
}
positions <- 1:nrow(test)
lapply(positions, myfun, test)
cbind()ing the row numbers seems a pretty straightforward approach. For a matrix (or a data frame) the following should work:
apply( cbind(1:(dim(test)[1]), test), 1, function(x) plot(x[-1], main=x[1]) )
or whatever you want to plot.
Actually, in the case of a matrix, you don't even need apply. Just:
test^(1/row(test))
does what you want, I think. I think the row() function is the thing you are looking for.
I'm a little confuse so excuse me if I get this wrong but you want work out n-th root of the numbers in each row of a matrix where n = the row number. If this this the case then its really simple create a new array with the same dimensions as the original with each column having the same values as the corresponding row number:
test_row_order = array(seq(1:length(test[,1]), dim = dim(test))
Then simply apply a function (the n-th root in this case):
n_root = test^(1/test_row_order)