I have a function that looks like this
calc_df <- function(A_df, B_df){
C_df <- filter(A_df, Type == "Animal") %>%
left_join(B_df) %>%
as.numeric(C$Count)
Where I cannot get the last lime to work, the first 3 work properly, but I would like the last line to take the column "Count" from the new df calculated in the function and make it numeric. (Right now it is a character vector)
** I have to do this at the end of the function because before the filter command, the Count column contains letters and cannot be made as.numeric
Looks like you're using dplyr, and that you want to change or add a column. This is what the dplyr::mutate function does.
Replace
as.numeric(C$Count)
with
mutate(Count = as.numeric(Count))
to replace the old, non-numeric Count column with the coerced-to-numeric replacement.
As to why your code didn't work, there are a few problems:
dplyr is made for working with data frames, and the main dplyr functions (select, filter, mutate, summarize, group_by, *_join, ...) expect data frames as the first argument, and then return data frames. By piping the result of a left join into as.numeric, you are really calling as.numeric(unnamed_data_frame_from_your_join, C$Count), which clearly doesn't make much sense.
You are trying to reference a data frame called C inside a definition for a data frame called C_df, which I think you mean to be the same thing. There's two issues here: (1) the mismatch between the names C and C_df, and (2) you can't reference C_df inside it's own definition.
Related
Ok so I cannot figure this out for the life of me, I want to filter my data based on a partial string match. here is my data, I am just showing the column i want to filter, but there are more rows in the overall set. I only want to show the rows that begin with "CAO" --this is easily achievable in the viewer
dataviewer image:
Basically I want the R "code" that would reproduce this exact result. I have tried using grepl like so
filter(longdata, grepl("^CAO",longdata[,1]))
I have tried using subset
subset(longdata,longdata[,1]=="^CAO")
I have tried subset with grepl and no matter what I do I cant figure it out. I am new to R so please try and explain it thoroughly.
The second argument of grepl wasn´t recognized in your first code
library(tidyverse) #in this case access to dplyr and to tibble´s data_frame() function which preserves the spaces in the column names
longdata <- data_frame(`Issue ID`=c("CAO-2017-20", "CAO-2017-20", "CAO-2017-20", "AO-2017-20", "CA-2017-20"))
longdata %>% filter(grepl("CAO", `Issue ID`)) #patern "^CAO" also works
%>% is a piping operator that passes the outcomes of the previous operations further, here it´s loaded by dplyr.
Basically what I did was to load the tidyverse set of packages (read more on tidyverse here). Those ones of interest are tibble and dplyr.
Then I created a sample data frame with tibble´s function data_frame()
Then I applied an adjusted function that you suggested, namely
filter(longdata, grepl("^CAO",`Issue ID`))
which is the same in its piped form:
longdata %>% filter(grepl("CAO", `Issue ID`))
gene HSC_7256.bam HSC_6792.bam HSC_7653.bam HSC_5852
My data frame looks like this i can do that in a normal way such as take out the columns make another data frame average it ,but i want to do that in dplyr and im having a hard time I not sure what is the problem
I doing something like this
HSC<- EPIGENETIC_FACTOR_SEQMONK %>%
select(EPIGENETIC_FACTOR_SEQMONK,gene)
I get this error
Error: EPIGENETIC_FACTOR_SEQMONK must resolve to integer column positions, not a list
So i have to do this take out all the HSC sample average them
Anyone suggest what am i doing it incorrectly ?that would be helpful
The %>% function pulls whatever is to the left of it into the first position of the following function. If your data frame is EPIGENETIC_FACTOR_SEQMONK, then these two statements are equivalent:
HSC <- EPIGENETIC_FACTOR_SEQMONK %>%
select(gene)
HSC <- select(EPIGENETIC_FACTOR_SEQMONK, gene)
In the first, we are passing EPIGENETIC_FACTOR_SEQMONK into select using %>%, which is generally used in dplyr chains as the first argument in dplyr functions is a data frame.
I need to bind two data.frames using a user-defined function. As example let's imagine that the data frames look like this.
library(dplyr)
library(lazyeval)
df<-data.frame(type1=c("a","b","c","a","b","c",NA),type2=c("d","e","f","d","e","f","f"))
f<-function(x){
y<-df%>%
dplyr::filter_(lazyeval::interp(~!is.na(x),x=as.name(x)))%>%
dplyr::group_by_(x)%>%
dplyr::summarize("Sum"=sum(type2=="d"))
y<-dplyr::bind_rows(y,data.frame(x="Total",Sum=sum(y$Sum)))
return(y)
}
result_f<-f("type1")
The problem is that this function assumes that the name of variable "Total" in the second data frame is "x" instead of "Total" creating an additional column due to the mismatch with the first data frame.
How can the function interpret x as a variable instead of a string? Unquoting? How?
You can change the last line in the function to
y <- dplyr::bind_rows(y,setNames(data.frame("Total",sum(y$Sum)), c(x, "Sum")))
That will set the names of the data.frame you are trying to bind in to the original names.
Before you spend too much time learning all the underscore functions in dplyr, note that in the next version (0.6) they are being superseded by a completely different method of non-standard evaluation. Read more here: https://blog.rstudio.org/2017/04/13/dplyr-0-6-0-coming-soon/
Im am trying to split a column of a dataframe into 2 columns using transform and colsplit from reshape package. I don't get what I am doing wrong. Here's an example...
library(reshape)
df1 <- data.frame(col1=c("x-1","y-2","z-3"))
Now I am trying to split the col1 into col1.a and col1.b at the delimiter '-'. the following is my code...
df1 <- transform(df1,col1 = colsplit(col1,split='-',names = c('a','b')))
Now in my RStudio when I do View(df1) I do get to see col1.a and col1.b split the way I want to.
But when I run...
df1$col1.a or head(df1$col1.a) I get NULL. Apparently I am not able to make any further operations on these split columns. What exactly is wrong with this?
colsplit returns a list, the easiest (and idiomatic) way to assign these to multiple columns in the data frame is to use [<-
eg
df1[c('col1.a','col1.b')] <- colsplit(df1$col1,'-',c('a','b'))
it will be much harder to do this within transform (see Assign multiple new variables on LHS in a single line in R)
I have a data.frame where each row is a tweet, and each row is an attribute ("text", "user", etc.).
I have written a function "processTweet()" that takes in a row of the data.frame and changes 3 columns in the tweet ("X", "Y" and "Z") and returns this modified single-row data.frame.
I'm currently trying to find out how to use something like dplyr or an apply-like function to actually reflect these modifications back in the original data.frame.
I'm aware that I could split the processTweet function into 3, but this would be inefficient since I'd have to do the same logical lookup multiple times.
I've tried using dplyr with rowwise, but I'm obviously doing something wrong, as the changes are not reflected in the tweets data.frame, whereas mutate seems to allow to modify one column, but not several:
tweets %>% rowwise() %>% processTweet()
Seem to have found an answer using plyr
tweets = adply(.data = tweets, .margins = 1, .fun = processTweet)
but deployer implementation is still a mystery.
The following question/answer works when result is saved into a single column, but unclear what to do when we want to return a whole data.frame in the function
Applying a function to every row of a table using dplyr?
After some trial and a lot of error, the ddplyr way that seems to work is:
tweets = as.data.frame(tweets %>% rowwise() %>% do(processTweet(.)) %>% rbind())