Difference between cbind with dataframe subset or indicating each column separately? - r

What is the difference between these two lines of codes?
varname1 <- cbind(df.name$var1, df.name$var2, df.name$var3)
varname2 <- cbind(df.name[1:3])
If I then try to use the next function I get an "invalid type (list) for variable "varname2".
This is the next function I try to use:
manova(varname ~ indepvar.snack+judge+rep,data = df.name)
So why does varname1 works and varname2 not?

Nulling my previous answer as I originaly thought you are column binding a series of columns in to a single columned dataframe.
check str(varname1) since it results in matrix while str(varname2) is dataframe.
manova is accepting matrix-type variable as argument.
do:
varname2 <- as.matrix(varname2)

Related

Write Loop To Perform Function through Column Names

I have a dataset with a quantitative column for which I want to calculate the mean based on groups. The other columns in the dataset are titled [FY2001,FY2002,...,FY2018]. These columns are populated with either a 1 or 0.
I want to calculate the mean of the first column for each of the FY columns when they equal 1. So, I want 18 different means.
I am used to using macros in SAS where I can replace parts of a dataset name or column name using a let statement. This is my attempt at writing a loop in R to solve this problem:
vector = c("01","02","03","04","05","06","07","08","09","10",
"11","12","13","14","15","16","17","18")
varlist = paste("FY20", vector, sep = "")
abc = for (i in length(varlist)){
table(ALL_FY2$paste(varlist)[i])
}
abc
This doesn't work since it treats the paste function as a column. What am I missing? Any help would be appreciated.
We can use [[ instead of & to subset the column. In addition, 'abc' should be a list which gets assigned with the corresponding table output of each column in the for loop.
abc <- vector("list", length(varlist)) # initialize a `list` object
Loop through the sequence of 'varlist' and not the length(varlist) (it is a single number)
for(i in seq_along(varlist)) abc[[i]] <- table(ALL_FY2[[varlist[i]]])
However, if we need to have a single table output from all the columns mentioned in the 'varlist', unlist the columns to a vector and replicate the sequence of columns before applying the table
ind <- rep(seq_along(varlist), each = nrow(ALL_FY2))
table(ind, unlist(ALL_FY2[varlist]))

Selecting unique values from single column of a data frame

I have a data frame consisting of five character variables which represent specific bacteria. I then have thousands of observations of each variable that all begin with the letter K. eg
x <- c(K0001,K0001,K0003,K0006)
y <- c(K0001,K0001,K0002,K0003)
z <- c(K0001,K0002,K0007,K0008)
r <- c(K0001,K0001,K0001,K0001)
o <- c(K0003,K0009,K0009,K0009)
I need to identify unique observations in the first column that don't appear in any of the remaining four columns. I have tried the approach suggested here which I think would work if I could create individual vectors using select ...
How to tell what is in one vector and not another?
but when I try to create a vector for analysis using the code ...
x <- select(data$x)
I get the error
Error in UseMethod("select_") :
no applicable method for 'select_' applied to an object of class "character
I have tried to mutate the vectors using as.factor and as.numeric but neither of these approaches work as the first gives an equivalent error as above, and as.numeric returns NAs.
Thanks in advance
The reference that you cited recommended using setdiff. The only thing that you need to do to apply that solution is to convert the four columns into one, so that it can be treated as a set. You can do that with unlist
setdiff(data$x, unlist(data[,2:5]))
"K0006"

Data.table column generated with function taking multiple, cumulative and lagged arguments

I'm trying to add a column to a data.table, where that column of the data.table is populated by passing the cumulative (lag 1) vector of values per group as well as a group-level attribute to a function, and then returning the appropriate value.
I have 8M rows, one per agent-day. My function is more complicated that myfun, but the key thing is that it takes two arguments from the agent-day table: a vector (Vector) of values for a particular agent from all days prior to the particular day, and a vector of agent-level attributes (PerAgent) that are all the same per agent.
library(data.table)
library(dplyr)
library(zoo)
DT <- data.table(Agent=LETTERS[1:3],day=c(1,1,1,2,2,2,3,3,3), PerAgent=c(.2,.4,.6),Vector=1:9,Answer=c(NA,NA,NA,.2,.8,1.8,1,2.8,5.4))
myFun = function(Vector,PerAgent){
PerAgent=PerAgent[1]
Answer=PerAgent*sum(Vector)
return(Answer)
}
The "Answer" column is what I'm trying to generate (obviously not manually as I've done here).
What I have right now that doesn't work because I'm trying to pass the second argument is:
DT[,Answer:=lag(rollapplyr(Vector,PerAgent,seq_along(Vector),myFun),1),by=.(Agent)]
If I didn't need to pass the second argument to the (simplified) function, this works:
myFun = function(Vector){
Answer=.1*sum(Vector)
return(Answer)
}
DT[,Answer:=lag(rollapplyr(Vector,seq_along(Vector),myFun),1),by=.(Agent)]
Your help is VERY appreciated.

correlation of several columns need to be calculated

I'm trying to get the correlation coefficient for corresponding columns of two csv files. I simply use the followings but get errors. consider each csv file has 50 columns
first values <- read.csv("")
second values <- read.csv("")
correlation.csv <- cor(x= first values , y=second values, method="spearman)
But i get x' must be numeric error!
subset of one csv file
Thanks for your help
The read.table function and all of it's derivatives return a data.frame which is an R list object. The mapply function processes lists in "parallel". If the matching columns are in the same order in the two datasets and have the same number of rows and do not have spaces in their names, it would be as simple as:
mapply(cor, first_values , second_values)
If it's more complicated tahn that, then you need to fill in the missing details with example data by editing the question (not by responding in comments.)
There must be some categorical variable in X.So you can first separate that categorical variable from X and then use X in cor() function.

Retaining a value in an R dataset if it's present in another dataset

I am currently working on a code which applies to various datasets from an experiment which looks at a wide range of variables which might not be present in every repetition. My first step is to create an empty dataset with all the possible variables, and then write a function which retains columns that are in the dataset being inputted and delete the rest. Here is an example of how I want to achieve this:-
x<-c("a","b","c","d","e","f","g")
y<-c("c","f","g")
Is there a way of removing elements of x that aren't present in y and/or retaining values of x that are present in y?
For your first question: "My first step is to create an empty dataset with all the possible variables", I would use factor on the concatenation of all the vectors, for example:
all_vect = c(x, y)
possible = levels(factor(all_vect))
Then, for the second part " write a function which retains columns that are in the dataset being inputted and delete the rest", I would write:
df[,names(df)%in%possible]
As akrun wrote, use intersect(x,y) or
> x[x %in% y]

Resources