I am trying to write an R program that can create a vector.
Suppose I have 3 factors (X with levels 1,2,3, Y with levels 1,2, and Z with level 1,2,3,4). If I want to represent them in a contingency table there are 3x2x4=24 cells. (for example, (111, 112, 121, 222) are typical cells).
I want to write a for loop that the output is a vector of all cells. that means the output is a vector of length 24.
vector1 <- factor(x = c(1,2,3))
vector2 <- factor(x = c(1,2))
vector3 <- factor(x = c(1,2,3,4,3,2))
df1 <- expand.grid(levels(vector1),levels(vector2),levels(vector3))
results <- paste0(df1$Var1,df1$Var2,df1$Var3)
factor(results)
Here's another solution using lapply :
create dummy data and assign it to a list
set.seed(10)
listofvectors<-list(
factor(sample(1:3,20, replace = TRUE)),
factor(sample(1:2,10, replace = TRUE)),
factor(sample(1:4,15, replace = TRUE))
)
generate a table of combinations
combinations <-expand.grid(lapply(listofvectors, levels))
combine row-wise then create vector
unlist(do.call("paste0", combinations))
Related
I'm combining 12 CSV files into one dataframe in R. Before doing this I want to ensure all the column names are an exact match with each other. I've made a dataframe where each column is the column names of the 12 CSV files.
jul21_cols <- data.frame(colnames(jul21))
aug21_cols <- data.frame(colnames(aug21))
sep21_cols <- data.frame(colnames(sep21))
oct21_cols <- data.frame(colnames(oct21))
nov21_cols <- data.frame(colnames(nov21))
dec21_cols <- data.frame(colnames(dec21))
jan22_cols <- data.frame(colnames(jan22))
feb22_cols <- data.frame(colnames(feb22))
mar22_cols <- data.frame(colnames(mar22))
apr22_cols <- data.frame(colnames(apr22))
may22_cols <- data.frame(colnames(may22))
jun22_cols <- data.frame(colnames(jun22))
col_df <- cbind(jul21_cols,aug21_cols,sep21_cols,oct21_cols,nov21_cols,dec21_cols,
jan22_cols,feb22_cols,mar22_cols,apr22_cols,may22_cols,jun22_cols)
I've tried using the identical function to compare 2 columns at a time.
identical(col_df[['jul21']], col_df[['aug21']])
identical(col_df[['aug21']], col_df[['sep21']])
identical(col_df[['sep21']], col_df[['oct21']])
identical(col_df[['oct21']], col_df[['nov21']])
identical(col_df[['nov21']], col_df[['dec21']])
identical(col_df[['dec21']], col_df[['jan22']])
identical(col_df[['jan22']], col_df[['feb22']])
identical(col_df[['feb22']], col_df[['mar22']])
identical(col_df[['mar22']], col_df[['apr22']])
identical(col_df[['apr22']], col_df[['may22']])
identical(col_df[['may22']], col_df[['jun22']])`
All of the identical lines return the value of TRUE
I'm just trying to verify that this code is telling me all my column names are identical in each CSV files before I move on. I'd also like to know if there is a more efficient way to solve this problem.
First, identical() will only return TRUE if the two dataframes have all the same column names in the same order. If you don’t care about order, just that all the same names are in both dataframes, you can sort() the names before comparing as shown below.
Second, you can often use the base::lapply() or purrr::map() families of functions for operations requiring iteration.
For your case, let’s put your dataframes in a list (which they probably should be to begin with), then use sapply() to compare the column names of the first df in the list to the column names of all other dfs.
jul21 <- data.frame(x = 1, y = 2)
aug21 <- data.frame(x = 3, y = 4)
sep21 <- data.frame(y = 6, x = 5)
dfs <- list(jul21,aug21,sep21)
all(sapply(
dfs[-1],
\(x) identical(sort(colnames(x)), sort(colnames(dfs[[1]])))
))
# TRUE
And as another test case, we’ll add a df with a non-matching column.
oct22 <- data.frame(x = 1, y = 2, z = 3)
dfs[[4]] <- oct22
all(sapply(
dfs[-1],
\(x) identical(sort(colnames(x)), sort(colnames(dfs[[1]])))
))
# FALSE
We assume that what is needed is to determine if the column names are the same and in same order and if not to determine which differ.
First get a character vector, Names, containing the names of the data frames and from that make a named list L containing the data frames themselves.
From those names assemble a list L of the data frames and then get a character vector nms whose elements are strings of column names, one for each data frame.
Finally group the names of the data frames using tapply and nms as the groupings so we can see which data frames contain which columns. In the example below aug21 and jul21 have one set of columns, i.e. Time and demand, and sep21 has a different set, i.e. Time and DEMAND. If there were only one row then all data frames have the same column names in the same order.
Names <- c("jul21", "aug21", "sep21") # using example in Note
L <- mget(Names)[Names]
nms <- sapply(names(L), function(x) toString(names(L[[x]])))
tab <- stack(tapply(names(nms), nms, toString))
names(tab) <- c("data.frames", "column.names")
nrow(tab)
## [1] 2
tab
## data.frames column.names
## 1 jul21, aug21 Time, demand
## 2 sep21 Time, DEMAND
graph
Another approach which could be used alternately or in conjuction with the one above is to create a graph such that each vertex is a data frame and each edge means that the two vertices on either end of the edge have the same column names in the same order. Each connected component represents distinct column names or orders. From the example below we see that jul21 and aug21 form one connected component and sep21 forms a second connected component.
To investigate how data frame column names differ note that setdiff(names(jul21), names(sep21)) will show names that are in jul21 but not in sep21 and the reverse can be used for the other direction. If the setdiff in both directions are zero length vectors and names vectors are not the same then they differ by order.
library(igraph)
set.seed(123)
isSame <- function(x, y) +identical(names(x), names(y))
A <- outer(L, L, Vectorize(isSame))
diag(A) <- 0
g <- graph_from_adjacency_matrix(A, "undirected")
plot(g, vertex.color = "white", vertex.size = 30)
Note
Test data. BOD comes with R.
jul21 <- aug21 <- sep21 <- BOD
names(sep21) <- c("Time", "DEMAND")
I have data in multiple vectors that I would like to convert to a data.frame with one ID column (vector name) and one data column (vector values). Here's a toy example:
data.1 <- c(1, 2)
data.2 <- c(10, 20, 30)
df <- bind_rows(data.frame(ID="data.1", value=data.1), data.frame(ID="data.2", value=data.2))
If I have another vector (or any other data structure) that contains the name of the variables as a character string, how can I elegantly shorten the code? One time I would need to retrieve the entry as a character string (for ID) and the other time as the variable name (for value).
studies <- c("data.1", "data.2")
you can define a function f, which will return a data.frame with the column ID as the variable name of the object you are passing into the function:
f <- function(x){
return(data.frame(ID=deparse(substitute(x)), value=x))
}
So, you can define your new data.frame as follows:
require(dplyr)
data.1 <- c(1, 2)
data.2 <- c(10, 20, 30)
bind_rows(f(data.1), f(data.2))
It looks much more elegant to me because you don't need to write twice the name of the sources.
I think I found a general solution via lists that only uses R base functions and is generally pretty simple. The most complicated part is to maintain the names they have different lengths or if there are different numbers of values in each study that span more than one order of magnitude (will append different numbers of characters to the name during unlist.
Thank you #Gregor for pointing me toward working with lists.
# data input
studies <- c("study1", "std2", "This name is very long")
data_1 <- c(1, 2)
data_2 <- c(10, 20, 30, 40, 50, 60, 70, 80, 90, 100)
data_3 <- c(3, 6, 9, 12, 15)
#generate list with data and assign names of idential length
data_list <- list(data_1, data_2, data_3)
names(data_list) <- format(studies, width=max(nchar(studies)))
# get names from 'unlist', make them the same length, remove last chars
names <- names(unlist(data_list))
names <- format(names, width=max(nchar(names)))
names <- substr(names,1,nchar(names)-2)
# get values from 'unlist'
values <- unname(unlist(data_list))
# make data.frame
data <- data.frame(names, values)
I am new to R and I don't know how to create multiple data frames in a loop. For example:
I have a data frame "Data" with 20 rows and 4 columns:
Data <- data.frame(matrix(NA, nrow = 20, ncol = 4))
names(Data) <- c("A","B","C","D")
I want to choose the rows of Data which its values in column T are the closest values to the vector elements of X.
X = c(X1,X2,X3,X4,X5)
Finally, I want to assign them to a separate data frames with their associated X name:
for(i in 1:length(X)){
data_X[i] <- data.frame(matrix(NA))
data_X[i] <- subset(data2, 0 <= A-X[i] | A-X[i]< 0.000001 )
}
Thank you!
Since you didn't give us any numbers, it is difficult to say exactly what you need the for loop to look for. As such, you will need to sort that out yourself, but here is a basic example of what you could do. The important part that I think you are missing is that you need to use assign to send the created dataframes to your global environment or wherever you want them to go for that matter. Paste0 is a handy way to give them each their own name. Take note that some of the data frames will be empty. It may be worthwhile to use an if statement that skips assigning the dataframe if (nrow(data3)==0).
`Data <- data.frame(matrix(sample(1:10,80,replace = T), nrow = 20, ncol = 4))`
`names(Data) <- c("A","B","C","D")`
`X = c(1:10)`
`for(i in 1:length(X)){
data2 <- Data
data3 <- subset(data2, A == X[i])
assign(paste0("SubsetData",i), data3, envir = .GlobalEnv)
}`
I need to run a chi squared test on data from each row of a data frame in R. So far I have a function that can create a matrix and run the test on the matrix. This is working fine when I manually enter data into the function.
chisquare.table <- function(var1, var2, var3, var4){
t <- matrix(c(var1, var2, var3, var4), nrow = 2)
chisq.test(t)
chisquare.table(80, 99920, 85, 99915)
However, what I want to do is apply this function to each row of a data frame such that var1 is row x column 1, var2 is row x column 2, var3 is row x column 3, and var4 is row x column 4.
I've tried a few different ways with the apply() function but I can't find one that allows me to take the data from the row in the way I want. I'd really appreciate any help or advice on this as I haven't found much online about using apply() with multiple inputs.
If we are applying the function on each row, use apply. Also, instead of specifying the row elements one by one as arguments (as it can differ for each dataset), use the ... which can take any number of elements a arguments, and create the matrix out of it
chisquare.tableMod <- function(...){
t <- matrix(c(...), nrow = 2)
chisq.test(t)
}
out <- apply(df1, 1, chisquare.tableMod)
Testing with the output from OP's function
chisquare.table <- function(var1, var2, var3, var4){
t <- matrix(c(var1, var2, var3, var4), nrow = 2)
chisq.test(t)
}
outOld <- chisquare.table(80, 99920, 85, 99915)
identical(out[[1]], outOld)
#[1] TRUE
As #42- mentioned in the comments, apply returns a matrix and matrix can hold only single class. So, select only those columns that are numeric while working with apply (or only single class)
data
df1 <- data.frame(v1 = c(80, 79, 49), v2 = c(99920, 98230, 43240),
v3 = c(85, 40, 35), v4 = c(99915, 43265, 43238))
I have created a genind object from a table containing SNPs information.
I need to insert population information into this genind.
I know which individuals (which are identified by numbers) should go into each population.
How do I pick the correct individuals and place them into separate populations?
It's always helpful to make a reproducible example when asking a question.
First, loading the necessary library (pretty sure its adegenet)
library(adegenet)
Making some fake data by first getting a vector of alleles
alleles <- paste0("0",1:4)
Setting number of loci, individuals per population, and the number of populations
nloci <- 10
nind <- 10
npops <- 2
Using a for loop to make the fake dataset
i <- NULL
out <- NULL
for(i in 1:npops){
#there are nind*nloci genotypes in each population
#make a
gts <- replicate(n = nind*nloci,
expr = paste0(sample(x = alleles,size = 1,replace = T),
sample(x = alleles,size = 1,replace = T)))
gts <- as.data.frame(matrix(data = gts,
nrow = nind, ncol = nloci, byrow = T))
#making generic locus colnames()
colnames(gts) <- paste("locus_",1:nloci)
out <- rbind(out,gts)
} #end of for loop
head(out)
Now converting that data.frame into a genind
obj <- df2genind(out, ploidy=2, ncode=2)
obj
Note that the row.names() are considered individual IDs
Now for setting the populations, note its empty right now
obj#pop
You just need a vector that represents the populations corresponding to each individual.
Option 1
If your individual IDs are clustered by population (e.g. 1-10 are from pop1 and 11-20 are from pop2), then something like this should work
pops<- paste0("pop",1:npops)
Set the populations using that vector, make sure it's a factor
obj#pop <- as.factor(rep(pops,each=nind))
obj#pop
Option 2
If the original data.frame (table) that contained your SNP information also contained population information, you could use that as your vector
e.g. If out looked like this
out$pops <- sample(x = pops,size = nrow(out),replace = T)
head(out)
Then do could use that column as your vector
obj#pop <- as.factor(out$pops)
obj#pop
Option 3
Alternatively, if you had another table that enabled you to identify which individuals corresponded to which population, then you use that information. It assumes that the second table (data.frame) is the same number of rows as out
Here is an example second table
df <- data.frame(pops = rep(pops,each=nind),
id = sample(x = 1:nrow(out),size = nrow(out),replace = F))
head(df)
Note that the IDs are not in order, but they were in out and therefore are in the obj, so df needs to be ordered by df$id
df <- df[order(df$id),]
head(df)
After they are in the correct order
obj#pop <- as.factor(df$pops)
obj#pop