Error in 'colsplit' function? - r

Im am trying to split a column of a dataframe into 2 columns using transform and colsplit from reshape package. I don't get what I am doing wrong. Here's an example...
library(reshape)
df1 <- data.frame(col1=c("x-1","y-2","z-3"))
Now I am trying to split the col1 into col1.a and col1.b at the delimiter '-'. the following is my code...
df1 <- transform(df1,col1 = colsplit(col1,split='-',names = c('a','b')))
Now in my RStudio when I do View(df1) I do get to see col1.a and col1.b split the way I want to.
But when I run...
df1$col1.a or head(df1$col1.a) I get NULL. Apparently I am not able to make any further operations on these split columns. What exactly is wrong with this?

colsplit returns a list, the easiest (and idiomatic) way to assign these to multiple columns in the data frame is to use [<-
eg
df1[c('col1.a','col1.b')] <- colsplit(df1$col1,'-',c('a','b'))
it will be much harder to do this within transform (see Assign multiple new variables on LHS in a single line in R)

Related

Dcast() weird output

I have two dataframes. Applying the same dcast() function to the two get me different results in the output. Both the dataset have the same structure but different size. The first one has more than 950 rows:
The code I apply is:
trans_matrix_complete <- mod_attrib$transition_matrix
trans_matrix_complete[which(trans_matrix_complete$channel_from=="_3RDLIVE"),]
trans_matrix_complete <- rbind(trans_matrix_complete, df_dummy)
trans_matrix_complete$channel_to <- factor(trans_matrix_complete$channel_to,
levels = c(levels(trans_matrix_complete$channel_to)))
trans_matrix_complete <- dcast(trans_matrix_complete,
channel_from ~ channel_to,value.var = 'transition_probability')
And the trans_matrix_complete output I get is the following:
Something is not working as it should be as with the smaller dataframe of just few lines I get the following outcome:
Where
a) the row number is different. I'm not sure why there are two dots listed in the first case
b) and too, trying to assign rownames to the dataframe by
row.names(trans_matrix_complete) <- trans_matrix_complete$channel_from
does not work for the large dataframe, as despite the row.names contact the dataframe show up exactly as in the first image, without names assigned to rows.
Any idea about this weird behavior?
I resolved moving from dcast() to spread() of the package tidyverse using the following function:
trans_matrix_complete<-spread(trans_matrix_complete,
channel_to,transition_probability)
By applying spread() the two dataframe the matrix output is of the same format and accept rownames without any issue.
So I suspect it is all realted to the fact that dcast() and reshape2 package are not maintained anymore
Regards

R friendly way to convert lots of R data frame columns to lots of vectors

I looked at this solution: R-friendly way to convert R data.frame column to a vector?
but each solution seems to involve manually declaring the name of the vector being created.
I have a large dataframe with about 224 column names. I would like to break up the data frame and turn it into 224 different vectors which preserve their label without typing them all manually. Is there a way to step through the columns in the data frame and produce a vector which has the same name as the column or am I dreaming?
I think it's a bad idea but this would work (using mtcars data set):
list2env(mtcars, .GlobalEnv)
attach is another dangerous command that people use to be able to access the columns of a data frame directly with their names. If you don't know why it's dangerous, though, don't do it.
Here's another bad idea:
for(i in names(mtcars)) assign(i, mtcars[,i])
Just for Richard:
for (x in names(mtcars))
eval(parse(text=paste(x, '<- c(', paste(mtcars[[x]], collapse=',') ,')')))

Applying a function to a dataframe to trim empty columns within a list environment R

I am a naive user of R and am attempting to come to terms with the 'apply' series of functions which I now need to use due to the complexity of the data sets.
I have large, ragged, data frame that I wish to reshape before conducting a sequence of regression analyses. It is further complicated by having interlaced rows of descriptive data(characters).
My approach to date has been to use a factor to split the data frame into sets with equal row lengths (i.e. a list), then attempt to remove the trailing empty columns, make two new, matching lists, one of data and one of chars and then use reshape to produce a common column number, then recombine the sets in each list. e.g. a simplified example:
myDF <- as.data.frame(rbind(c("v1",as.character(1:10)),
c("v1",letters[1:10]),
c("v2",c(as.character(1:6),rep("",4))),
c("v2",c(letters[1:6], rep("",4)))))
myDF[,1] <- as.factor(myDF[,1])
myList <- split(myDF, myDF[,1])
myList[[1]]
I can remove the empty columns for an individual set and can split the data frame into two sets from the interlacing rows but have been stumped with the syntax in writing a function to apply the following function to the list - though 'lapply' with 'seq_along' should do it?
Thus for the individual set:
DF <- myList[[2]]
DF <- DF[,!sapply(DF, function(x) all(x==""))]
DF
(from an earlier answer to a similar, but simpler example on this site). I have a large data set and would like an elegant solution (I could use a loop but that would not use the capabilities of R effectively). Once I have done that I ought to be able to use the same rationale to reshape the frames and then recombine them.
regards
jac
Try
lapply(split(myDF, myDF$V1), function(x) x[!colSums(x=='')])

How should I apply the same formatting to a list of dataframes in R?

Here is what I've done so far. So, that's basically grabbing some tables off the internet using XML, putting them into a list of dataframes and then some mess trying (and failing) to format them in an efficient and consistent way.
I can't work out how to apply the same changes to all of the dataframes. I think I need to use llply, but I can't get it right. Overall I am trying to achieve:
Column names all legitimate R names using make.names, then use the
str_replace_all towards the end of the file to strip all non-alpha
characters so the names are the same
Next I want to remove all but the first four columns from all of the dataframes
Then I want to add a column with the title for each book. I guess I'll have to do this manually.
Finally, I want to do an rbind to join all of the dataframes together
What's really got me stumped is how to apply the same transformations to each dataframe in the list such as modifying their column names and cutting off rows. Is llply the right tool for the job? How do I use it?
So far the most I've been able to achieve is turning my list of dataframes into a list of vectors with the right names. I believe this is because when I tried using names() it returned the vector of correct names, rather than a dataframe with the correct names. This was my attempt:
tlist <- llply(tabs, function(x) as.data.frame(str_replace_all(make.names(names(x)), "[^[:alpha:]]", "")))
I don't think I'm a million miles away here, but I can't think how to get it to return the full df.
Use this instead:
f <- function(x)
{
y <- x[,1:4]
names(y) <- str_replace_all(make.names(names(y)), "[^[:alpha:]]", "")
y
}
result <- rbind.fill(llply(tabs, f))
EDIT: following #baptiste, this may be better:
result <- ldply(tabs, f)

Split dataframe using two columns of data and apply common transformation on list of resulting dataframes

I want to split a large dataframe into a list of dataframes according to the values in two columns. I then want to apply a common data transformation on all dataframes (lag transformation) in the resulting list. I'm aware of the split command but can only get it to work on one column of data at a time.
You need to put all the factors you want to split by in a list, eg:
split(mtcars,list(mtcars$cyl,mtcars$gear))
Then you can use lapply on this to do what else you want to do.
If you want to avoid having zero row dataframes in the results, there is a drop parameter whose default is the opposite of the drop parameter in the "[" function.
split(mtcars,list(mtcars$cyl,mtcars$gear), drop=TRUE)
how about this one:
library(plyr)
ddply(df, .(category1, category2), summarize, value1 = lag(value1), value2=lag(value2))
seems like an excelent job for plyr package and ddply() function. If there are still open questions please provide some sample data. Splitting should work on several columns as well:
df<- data.frame(value=rnorm(100), class1=factor(rep(c('a','b'), each=50)), class2=factor(rep(c('1','2'), 50)))
g <- c(factor(df$class1), factor(df$class2))
split(df$value, g)
You can also do the following:
split(x = df, f = ~ var1 + var2...)
This way, you can also achieve the same split dataframe by many variables without using a list in the f parameter.

Resources