I already asked a related question here, which was probably not sufficiently specified and thus the answers did not solve my problem. Will try to do it better this time.
I have created dataframe df:
df <- data.frame(names1=c('mouse','dog','cat','cat','mouse','cat','cat','dog','cat','mouse'), names2=c('cat','dog','dog','mouse','cat','cat','mouse','mouse','mouse','mouse'), values=c(11,5,41,25,101,78,12,41,6,77))
There are purposely no "birds" neither in the names1 nor names2 column.
and the following names vector:
dims <- c('dog','mouse','bird','cat')
and an initially empty matrix:
my_matrix <- matrix(data=0,nrow = length(unique(dims)),ncol = length(unique(dims))) rownames(my_matrix) <- c('dog','mouse','cat', 'bird') colnames(my_matrix) <- c('dog','mouse','cat','bird'))
So the empty matrix look like this:
> dog mouse bird cat
> dog 0 0 0 0
> mouse 0 0 0 0
> bird 0 0 0 0
> cat 0 0 0 0
**The Goal: **
The goal is to populate the empty matrix with the information coming from dataframe df
Count Matrix
In the first matrix, I want to count the occurrences where the vector elements match. So the first matrix should look exactly like this one (based on df$names1 and df$names2)
> dog mouse bird cat
> dog 1 1 0 0
> mouse 0 1 0 2
> bird 0 0 0 0
> cat 1 3 0 1
Sum Matrix
In the second matrix, I want to show the sums provided by column df$values and based on the vector match element given by df$names1 and df$names2. So the matrix should look exactly like this one
> dog mouse bird cat
> dog 5 41 0 0
> mouse 0 77 0 112
> bird 0 0 0 0
> cat 0 137 0 25
Restrictions:
there is one restriction regarding the shape of the matrix -
The order of rows and columns is given by the dims vector. The purpose is that the matrix should show that there are no birds in df$names1 and df$names2
My approach:
I tried to place the count and sum into each element of the empty matrix by using a for loop which looks like this:
for(i in 1:nrow(my_matrix)){
for(j in 1:ncol(my_matrix)){
my_matrix[i,j] <- sum(df$values & df$names1[i] == df$names2[j] & df$names1[j] == df$names2[i])
}
}
and which provides me with this undesired result
dog mouse bird cat
dog 0 0 0 10
mouse 0 10 0 0
bird 0 0 0 0
cat 10 0 0 0
My intuition tells me that the approach with the for loop is ok but I am not sure how exactly to address each matrix element in the loop and how to define the restrictions within the sum function to get the counts (goal matrix1) and sums (goal matrix2).
Your help will be much appreciated and I am very open to other solutions considering the above-mentioned restrictions (still knowing how to loop through a matrix and assign values to each position would be cool).
To continue your for loop approach -
The code you posted only loops through the first 4 elements of the dataframe (e.g. the maximum of nrow() or ncol() of the matrix), and it does not match the names of the matrix columns and rows to the names1 and names2 vectors in the dataframe.
I'm not sure if the following is exactly what you want, but it does loop through all of the dataframe, relate dataframe and matrix names and count/take sums, so you can modify it as you need.
my_matrix_count <- my_matrix
for(i in 1:nrow(my_matrix)){
for(j in 1:ncol(my_matrix)){
my_matrix_count[i,j] <-
length(df$values[df$names1==row.names(my_matrix)[i]
& df$names2==colnames(my_matrix)[j] ]>0)
my_matrix[i,j] <-
sum(df$values[df$names1==row.names(my_matrix)[i]
& df$names2==colnames(my_matrix)[j] ])
}
}
It's good to know looping and indexing, in my opinion, but yes, it can be done in many better ways, such as the answer by jblood94.
For another version, you can also try to write:
xtabs(values ~names1 + names2, data=df)
Regards, Lars
Using match and data.table grouping operations:
library(data.table)
n <- length(dims)
# initialize matrices
m_count <- m_sum <- matrix(0L, n, n, 0, list(dims, dims))
dt <- setDT(df)[
, .(
# the matrix index for the name1-name2 combination
idx = (match(names2, dims) - 1L)*n + match(names1, dims),
count = .N, # value for the count matrix
sum = sum(values) # value for the sum matrix
),
c("names1", "names2") # group by
]
# update matrices
m_count[dt$idx] <- dt$count
m_sum[dt$idx] <- dt$sum
m_count
#> dog mouse bird cat
#> dog 1 1 0 0
#> mouse 0 1 0 2
#> bird 0 0 0 0
#> cat 1 3 0 1
m_sum
#> dog mouse bird cat
#> dog 5 41 0 0
#> mouse 0 77 0 112
#> bird 0 0 0 0
#> cat 41 43 0 78
Data:
df <- data.frame(
names1=c('mouse','dog','cat','cat','mouse','cat','cat','dog','cat','mouse'),
names2=c('cat','dog','dog','mouse','cat','cat','mouse','mouse','mouse','mouse'),
values=c(11,5,41,25,101,78,12,41,6,77)
)
dims <- unique(c('dog','mouse','bird','cat'))
Related
I have a list of three data frames that are similar (same number of columns but different number of rows), and were split from a larger data set.
Here is some example code to make three data frames and put them in a list. It is really hard to make an exact replicate of my data since the files are so large (over 400 columns and the first 6 columns are not numerical)
a <- c(0,1,0,1,0,0,0,0,0,1,0,1)
b <- c(0,0,0,0,0,0,0,0,0,0,0,0)
c <- c(1,0,1,1,1,1,1,1,1,1,0,1)
d <- c(0,0,0,0,0,0,0,0,0,0,0,0)
e <- c(1,1,1,1,0,1,0,1,0,1,1,1)
f <- c(0,0,0,0,0,0,0,0,0,0,0,0)
g <- c(1,0,1,0,1,1,1,1,1,1)
h <- c(0,0,0,0,0,0,0,0,0,0)
i <- c(1,0,0,0,0,0,0,0,0,0)
j <- c(0,0,0,0,1,1,1,1,1,0)
k <- c(0,0,0,0,0)
l <- c(1,0,1,0,1)
m <- c(1,0,1,0,0)
n <- c(0,0,0,0,0)
o <- c(1,0,1,0,1)
df1 <- data.frame(a,b,c,d,e,f)
df2 <- data.frame(g,h,i,j)
df3 <- data.frame(k,l,m,n,o)
my.list <- list(df1,df2,df3)
I am looking to remove all the columns in each data frame whose total == 0. The code is below:
list2 <- lapply(my.list, function(x) {x[, colSums(x) != 0];x})
list2 <- lapply(my.list, function(x) {x[, colSums(x != 0) > 0];x})
Both of the above codes will run, but neither actually remove the columns == 0.
I am not sure why that is, any tips are greatly appreciated
The OP found a solution by exchanging comments with me. But I wanna drop the following. In lapply(my.list, function(x) {x[, colSums(x) != 0];x}), the OP was asking R to do two things. The first thing was subsetting each data frame in my.list. The second thing was showing each data frame. I think he thought that each data frame was updated after subsetting columns. But he was simply asking R to show each data frame as it is in the second command. So R was showing the result for the second command. (On the surface, he did not see any change.) If I follow his way, I would do something like this.
lapply(my.list, function(x) {foo <- x[, colSums(x) != 0]; foo})
He wanted to create a temporary object in the anonymous function and return the object. Alternatively, he wanted to do the following.
lapply(my.list, function(x) x[, colSums(x) != 0])
For each data frame in my.list, run a logical check for each column. If colSums(x) != 0 is TRUE, keep the column. Otherwise remove it. Hope this will help future readers.
[[1]]
a c e
1 0 1 1
2 1 0 1
3 0 1 1
4 1 1 1
5 0 1 0
6 0 1 1
7 0 1 0
8 0 1 1
9 0 1 0
10 1 1 1
11 0 0 1
12 1 1 1
[[2]]
g i j
1 1 1 0
2 0 0 0
3 1 0 0
4 0 0 0
5 1 0 1
6 1 0 1
7 1 0 1
8 1 0 1
9 1 0 1
10 1 0 0
[[3]]
l m o
1 1 1 1
2 0 0 0
3 1 1 1
4 0 0 0
5 1 0 1
I have a dataframe, which contains 100.000 rows. It looks like this:
Value
1
2
-1
-2
0
3
4
-1
3
I want to create an extra column (column B). Which consist of 0 and 1's.
It is basically 0, but when there are 5 data points in a row positive OR negative, then it should give a 1. But, only if they are in a row (e.g.: when the row is positive, and there is a negative number.. the count shall start again).
Value B
1 0
2 0
1 0
2 0
2 1
3 1
4 1
-1 0
3 0
I tried different loops, but It didn't work. I also tried to convert the whole DF to a list (and loop over the list). Unfortunately with no end.
Here's an approach that uses the rollmean function from the zoo package.
set.seed(1000)
df = data.frame(Value = sample(-9:9,1000,replace=T))
sign = sign(df$Value)
library(zoo)
rolling = rollmean(sign,k=5,fill=0,align="right")
df$B = as.numeric(abs(rolling) == 1)
I generated 1000 values with positive and negative sets.
Extract the sign of the values - this will be -1 for negative, 1 for positive and 0 for 0
Calculate the right aligned rolling mean of 5 values (it will average x[1:5], x[2:6], ...). This will be 1 or -1 if all the values in a row are positive or negative (respectively)
Take the absolute value and store the comparison against 1. This is a logical vector that turns into 0s and 1s based on your conditions.
Note - there's no need for loops. This can all be vectorised (once we have the rolling mean calculated).
This will work. Not the most efficient way to do it but the logic is pretty transparent -- just check if there's only one unique sign (i.e. +, -, or 0) for each sequence of five adjacent rows:
dat <- data.frame(Value=c(1,2,1,2,2,3,4,-1,3))
dat$new_col <- NA
dat$new_col[1:4] <- 0
for (x in 5:nrow(dat)){
if (length(unique(sign(dat$Value[(x-4):x])))==1){
dat$new_col[x] <- 1
} else {
dat$new_col[x] <- 0
}
}
Use the cumsum(...diff(...) <condition>) idiom to create a grouping variable, and ave to calculate the indices within each group.
d$B2 <- ave(d$Value, cumsum(c(0, diff(sign(d$Value)) != 0)), FUN = function(x){
as.integer(seq_along(x) > 4)})
# Value B B2
# 1 1 0 0
# 2 2 0 0
# 3 1 0 0
# 4 2 0 0
# 5 2 1 1
# 6 3 1 1
# 7 4 1 1
# 8 -1 0 0
# 9 3 0 0
I have an ordered table, similar to as follows:
df <- read.table(text =
"A B C Size
1 0 0 1
0 1 1 2
0 0 1 1
1 1 0 2
0 1 0 1",
header = TRUE)
In reality there will be many more columns, but this is fine for a solution.
I wish to sort this table first by SIZE (Ascending), then by each other column in priority sequence (Descending) - i.e. by column A first, then B, then C, etc.
The problem is that I will not know the column names in advance so cannot name them, but need in effect "all columns except SIZE".
End result should be:
A B C Size
1 0 0 1
0 1 0 1
0 0 1 1
1 1 0 2
0 1 1 2
I've seen examples of sorting by two columns, but I just can't find the correct syntax to sort by 'all other columns sequentially'.
Many thanks
With the names use order like this. No packages are used.
o <- with(df, order(Size, -A, -B, -C))
df[o, ]
This gives:
A B C Size
1 1 0 0 1
5 0 1 0 1
3 0 0 1 1
4 1 1 0 2
2 0 1 1 2
or without the names just use column numbers:
o <- order(df[[4]], -df[[1]], -df[[2]], -df[[3]])
or
k <- 4
o <- do.call("order", data.frame(df[[k]], -df[-k]))
If Size is always the last column use k <- ncol(df) instead or if it is not necessarily last but always called Size then use k <- match("Size", names(df)) instead.
Note: Although not needed in the example shown in the question if the columns were not numeric then one could not negate them so a more general solution would be to replace the first line above with the following where xtfrm is an R function which converts objects to numeric such that the result sorts in the order expected.
o <- with(df, order(Size, -xtfrm(A), -xtfrm(B), -xtfrm(C)))
We can use arrange from dplyr
library(dplyr)
arrange(df, Size, desc(A), desc(B), desc(C))
For more number of columns, arrange_ can be used
cols <- paste0("desc(", names(df)[1:3], ")")
arrange_(df, .dots = c("Size", cols))
I try to create an adjacency matrix M from a list pList containing the indices that have to be equal to 1 in the matrix M.
For example, M is a 10x5 matrix
The variable pList contains 5 elements, each one is a vector of indices
Example :
s <- list("1210", c("254", "534"), "254", "534", "364")
M <- matrix(c(rep(0)),nrow = 5, ncol = length(unique(unlist(s))), dimnames=list(1:5,unique(unlist(s))))
Actually, my too simple solution is the brutal one with a for loop over rows of the matrix :
for (i in 1:nrow(M)){
M[i, as.character(s[[i]])] <- 1
}
So that the expected result is :
M
1210 254 534 364
1 1 0 0 0
2 0 1 1 0
3 0 1 0 0
4 0 0 1 0
5 0 0 0 1
The problem is that I have to manipulate matrices with several thousands of lines and it takes too much time.
I am not a "apply" expert but I wonder if there is a quicker solution
Thanks
Regards
We can convert the list to a matrix of row/column index, use that index to assign the elements in 'M' to 1.
M[as.matrix(stack(setNames(s, seq_along(s)))[,2:1])] <- 1
M
# 1210 254 534 364
#1 1 0 0 0
#2 0 1 1 0
#3 0 1 0 0
#4 0 0 1 0
#5 0 0 0 1
Or instead of using stack to convert to a data.frame, we can unlist the 's' to get the column index, cbind with row index created by replicating the sequence of list with length of each list element (using lengths) and assign the elements in 'M' to 1.
M[cbind(rep(seq_along(s), lengths(s)), unlist(s))] <- 1
Or yet another option would be to create a sparseMatrix
library(Matrix)
Un1 <- unlist(s)
sparseMatrix(i = rep(seq_along(s), lengths(s)),
j=as.integer(factor(Un1, levels = unique(Un1))),
x=1)
I have a large dataset (DF), a subset of which looks like this:
Site Event HardwareID Species Day1 Day2 Day3 Day4 Day5 Day6
1 1 16_11 x 0 0 0 0 0 0
1 1 29_11 y 0 0 6 2 0 1
1 1 36_11 d 0 0 0 0 0 1
1 1 41_11 y 0 0 2 4 1 1
1 1 41_11 x 0 0 0 0 0 1
1 1 58_11 a 0 0 1 0 0 0
1 1 62_11 y 0 0 0 1 0 0
1 1 62_11 z 0 0 0 0 0 0
1 1 62_11 x 0 0 0 0 0 1
2 1 40_AR b 0 0 0 0 0 0
2 1 12_11 z 0 0 1 0 0 0
I'd like to examine the minimum number of HardwareIDs to produce the most Species over the shortest amount of time, by calculating species accumulation curves (which intrinsically incorporates the Days columns) for each HardwareID, at each different site, and boostrapping the HardwareID selection part (so, look at accumulation curves using two HardwareIDs, then 3, then 4 etc, at each site).
I have written a function to create species accumulation curves (using specaccum) for a subset of these, such as:
Sites<-subset(DF,DF$Site==1)
samples<-function (x) {
specurve_sample<-(ddply(Sites[,4:length(colnames(Sites))],"Species",numcolwise(sum)))
specurve_sample<-specurve_sample[-1,]
n<-specurve_sample$Species
n<-drop.levels(n,reorder=FALSE)
specurve_sample<-specurve_sample[,-1]
specurve_sample <-t(specurve_sample)
colnames(specurve_sample)<-n
specurve_sample<-as.data.frame(specurve_sample)
sample_k<-specaccum(specurve_sample)
out<-rbind(sample_k$richness,sample_k$sd)
outnames<-c("Richness","SD")
st<-rep(Sites$Site[1],2)
out<-as.data.frame(cbind(outnames,st,out))
colnames(out)<-c("label","site","Days")
out
}
The function works fine if I subset my data before hand, but the boostrapping part does not work. I know I need to create a function (x,j) but cannot figure out where to place the j in my function. Here is the rest of my code. Many thanks for any assistance. James
all_data<-c()
for (i in 1:length(unique(DF$Site))) {
Sites<-subset(DF,DF$Site==i)
boots<-boot(Sites,samples, strata=Sites$HardwareID,R=1000)
all_data<-rbind(all_data,boots)
all_data
}
One straightforward way to do this is to create a function of x and j (as you have started to do), and have the first line of that function identify the relevant bootstrap subset from the whole collection, bootsub <- x[j, ]. Then, you can refer to this subset, bootsub throughout the rest of the function, and you need not refer to j again.
In your case, you don't want your function to refer back to your original data frame, Site. So, every where that you have Site in your function, change it to bootsub. For example:
samples <- function(x, j) {
bootsub <- x[j, ]
specurve_sample <- (ddply(bootsub[, 4:length(colnames(bootsub))], "Species", numcolwise(sum)))
specurve_sample <- specurve_sample[-1, ]
n <- specurve_sample$Species
n <- drop.levels(n, reorder=FALSE)
specurve_sample <- specurve_sample[, -1]
specurve_sample <- t(specurve_sample)
colnames(specurve_sample) <- n
specurve_sample <- as.data.frame(specurve_sample)
sample_k <- specaccum(specurve_sample)
out <- rbind(sample_k$richness, sample_k$sd)
outnames <- c("Richness", "SD")
st <- rep(bootsub$Site[1], 2)
out <- as.data.frame(cbind(outnames, st, out))
colnames(out) <- c("label", "site", "Days")
out
}
...
A follow up to the first two comments below. It's a little hard to troubleshoot without data, but this is my best guess. It may be that you have an issue with your subset() function, because you use i as an index of unique sites in the for() loop, but then refer to i as the value of the site in the call to subset(). Also, it is likely more efficient to run one call to do.call() after the for() loop, rather than multiple calls to rbind() inside the loop. Give this untested code a try.
# vector of unique sites
usite <- unique(DF$Site)
# empty list in which to put the bootstrap results
alldatlist <- vector("list", length(usite))
# loop through every site separately, save the bootstrap replicates ($t)
for(i in 1:length(usite)) {
Sites <- subset(DF, DF$Site==usite[i])
alldatlist[[i]] <- boot(Sites, samples, strata=Sites$HardwareID, R=1000)$t
}
# combine the list of results into a single matrix
all_data <- do.call(rbind, alldatlist)