Related
This table is based on a species sampling procedure that comprises starting at a certain location in a forest and recording the number of species that occur in that exact spot. Then, the surveyor walks and records the distance he traveled until he found a new species. This is the distance between the place where he found the new species and the initial point.
I would like to create a new column the includes the cumulative number of species based on the traveled distance. Here's what the new column should look like.
Example data:
data<-structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), binomial = c("Dromicodryas bernieri",
"Dromicodryas quadrilineatus", "Erymnochelys madagascariensis",
"Furcifer lateralis", "Furcifer oustaleti", "Hemidactylus mercatorius",
"Langaha pseudoalluaudi", "Leioheterodon madagascariensis", "Lycodryas pseudogranuliceps",
"Liophidium torquatum", "Liopholidophis sexlineatus", "Madagascarophis colubrinus",
"Madatyphlops decorsei", "Madascincus polleni", "Mimophis mahfalensis",
"Pelusios castanoides", "Phelsuma madagascariensis", "Thamnosophis lateralis",
"Trachylepis elegans", "Trachylepis gravenhorstii", "Zonosaurus madagascariensis",
"Hemidactylus frenatus", "Calumma nasutum", "Trachylepis madagascariensis",
"Amphiglossus macrocercus", "Zonosaurus aeneus", "Phelsuma lineata",
"Pelomedusa subrufa", "Calumma crypticum", "Furcifer viridis",
"Lygodactylus blancae", "Calumma gastrotaenia", "Trachylepis boettgeri",
"Zonosaurus ornatus", "Sanzinia madagascariensis", "Oplurus cyclurus",
"Leioheterodon modestus", "Oplurus cuvieri", "Madascincus igneocaudatus",
"Acrantophis dumerili", "Furcifer campani", "Pseudoxyrhopus imerinae",
"Lygodactylus mirabilis", "Phelsuma barbouri", "Furcifer minor",
"Compsophis infralineatus", "Pseudoxyrhopus quinquelineatus",
"Calumma hilleniusi", "Paroedura bastardi", "Brookesia brygooi"
), distance = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 3714.77402549982, 6249.49093233716, 7067.80424387549,
7715.0303317613, 13769.1057463018, 17206.1480236598, 18733.5237644898,
21923.789153995, 27314.2085865309, 31154.1890492383, 35460.0864839256,
35822.0263564291, 36933.3736660544, 39735.6007540156, 40983.6673876956,
43032.8409122139, 43793.3004333338, 44063.3992480126, 44657.9183000201,
44723.8214805486, 45184.0884859559, 46785.9008560645, 48994.7048866502,
55332.621992021, 57746.4142325833, 58866.2845249788, 60839.811988087,
65560.1987963227)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-50L))
One option is to get the cumsum on a logical vector i.e. where distance not equal to 0 and then add the count of 0 distance to it (sum(distance == 0))
library(dplyr)
data <- data %>%
mutate(new = cumsum(distance!=0) + sum(distance == 0))
Or for this, we can use base R
data$new <- with(data, cumsum(distance!=0) + sum(distance == 0))
Though this problem has been 'solved' many times, it turns out there's always another problem.
Without the print function it runs with no errors, but with it I get the following:
Error in .subset2(x, i) : recursive indexing failed at level 2
Which I'm taking to mean it doesn't like graphs being created in two layers of iteration? Changing the method to 'qplot(whatever:whatever)' has the exact same problem.
It's designed to print a graph for every pairing of the variables I'm looking at. There's too many for them to fit in a singular picture, such as for the pairs function, and I need to be able to see the actual variable names in the axes.
load("Transport_Survey.RData")
variables <- select(Transport, "InfOfReceievingWeather", "InfOfReceievingTraffic", "InfOfSeeingTraffic", "InfWeather.Ice", "InfWeather.Rain", "InfWeather.Wind", "InfWeather.Storm", "InfWeather.Snow", "InfWeather.Cold", "InfWeather.Warm", "InfWeather.DarkMorn", "InfWeather.DarkEve", "HomeParking", "WorkParking", "Disability", "Age", "CommuteFlexibility", "Gender", "PassionReduceCongest")
varnames <- list("InfOfReceivingWeather", "InfOfReceivingTraffic", "InfOfSeeingTraffic", "InfWeather.Ice", "InfWeather.Rain", "InfWeather.Wind", "InfWeather.Storm", "InfWeather.Snow", "InfWeather.Cold", "InfWeather.Warm", "InfWeather.DarkMorn", "InfWeather.DarkEve", "HomeParking", "WorkParking", "Disability", "Age", "CommuteFlexibility", "Gender", "PassionReduceCongest")
counterx = 1
countery = 1
for (a in variables) {
for (b in variables) {
print(ggplot(variables, mapping=aes(x=variables[[a]], y=variables[[b]],
xlab=varnames[counterx], ylab=varnames[countery]))+
geom_point())
countery = countery+1
counterx = counterx+1
}
}
#variables2 <- select(Transport, one_of(InfOfReceivingWeather, InfOfReceivingTraffic, InfOfSeeingTraffic, InfWeather.Ice, InfWeather.Rain, InfWeather.Wind, InfWeather.Storm, InfWeather.Snow, InfWeather.Cold, InfWeather.Warm, InfWeather.DarkMorn, InfWeather.DarkEve, HomeParking, WorkParking, Disability, Age, CommuteFlexibility, Gender, PassionReduceCongest))
Here is a mini-data frame for reference, sampled from the columns I'm using:
structure(list(InfOfReceievingWeather = c(1, 1, 1, 1, 4), InfOfReceievingTraffic = c(1,
1, 1, 1, 4), InfOfSeeingTraffic = c(1, 1, 1, 1, 4), InfWeather.Ice = c(3,
1, 3, 5, 5), InfWeather.Rain = c(1, 1, 2, 2, 4), InfWeather.Wind = c(1,
1, 2, 2, 4), InfWeather.Storm = c(1, 1, 1, 2, 5), InfWeather.Snow = c(1,
1, 2, 5, 5), InfWeather.Cold = c(1, 1, 1, 2, 5), InfWeather.Warm = c(1,
1, 1, 1, 3), InfWeather.DarkMorn = c(1, 1, 1, 1, 1), InfWeather.DarkEve = c(1,
1, 1, 1, 1), HomeParking = c(1, 1, 3, 1, 1), WorkParking = c(1,
4, 4, 5, 4), Disability = c(1, 1, 1, 1, 1), Age = c(19, 45, 35,
40, 58), CommuteFlexibility = c(2, 1, 5, 1, 2), Gender = c(2,
2, 2, 2, 1), PassionReduceCongest = c(0, 0, 2, 0, 2)), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))
You get an error in the assignment of your a and b. Basically, when defining a and b in variables, they become the vector of values contained in columns of variables. Thus, in your aes mapping, when you are calling variables[[a]], basically, you are writing (for the first iteration of a in variables):
variables[[c(1, 1, 1, 1, 4)]] instead of variables[["InfOfReceievingWeather"]]. So, it can't work.
To get over this issue, you have to either choose between:
for (a in variables) {
for (b in variables) {
print(ggplot(variables, mapping=aes(x=a, y=b)) ...
or
for (a in 1:ncol(variables)) {
for (b in 1:ncol(variables)) {
print(ggplot(variables, mapping=aes(x=variables[[a]], y=variables[[b]])) ...
Despite the first one seems to be simpler, I will rather prefere the second option because it will allow you to recycle a and b as column indicator to extract colnames of variables for xlab and ylab.
At the end, writing something like this should work:
for (a in 1:ncol(variables)) {
for (b in 1:ncol(variables)) {
print(ggplot(variables, mapping=aes(x=variables[[a]], y=variables[[b]])) +
xlab(colnames(variables)[a])+
ylab(colnames(variables)[b])+
geom_point())
}
}
Does it answer your question ?
I have a file that represents the gene structure of bacteria models. Each row represents a model. A row is a fixed length binary string of which genes are present (1 for present and 0 for absent). My task is to compare the gene sequence for each pair of models and get a score of how similar they are and computer a dissimilarity matrix.
In total there are 450 models (rows) in one file and there are 250 files. I have a working code however it takes roughly 1.6 hours to do the whole thing for only one file.
#Sample Data
Generation: 0
[0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0]
[1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1]
[1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0]
[0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0]
[0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0]
[1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0]
What my code does:
Reads the file
Convert the binary string into a data frame Gene, Model_1, Model_2,
Model_3, … Model_450
Run a nested for loop to do the pair-wise comparison (only the top
half of the matrix) – I take the two corresponding columns and add
them, then count the positions where the sum is 2 (meaning present
in both models)
Write the data to a file
Create the matrix later
comparison code
generationFiles = list.files(pattern = "^Generation.*\\_\\d+.txt$")
start.time = Sys.time()
for(a in 1:length(generationFiles)){
fname = generationFiles[a]
geneData = read.table(generationFiles[a], sep = "\n", header = T, stringsAsFactors = F)
geneCount = str_count(geneData[1,1],"[1|0]")
geneDF <- data.frame(Gene = paste0("Gene_", c(1:geneCount)), stringsAsFactors = F)
#convert the string into a data frame
for(i in 1:nrow(geneData)){
#remove the square brackets
dataRow = substring(geneData[i,1], 2, nchar(geneData[i,1]) - 1)
#removing white spaces
dataRow = gsub(" ", "", dataRow, fixed = T)
#splitting the string
dataRow = strsplit(dataRow, ",")
#converting to numeric
dataRow = as.numeric(unlist(dataRow))
colName = paste("M_",i,sep = "")
geneDF <- cbind(geneDF, dataRow)
colnames(geneDF)[colnames(geneDF) == 'dataRow'] <- colName
dataRow <- NULL
}
summaryDF <- data.frame(Model1 = character(), Model2 = character(), Common = integer(),
Uncommon = integer(), Absent = integer(), stringsAsFactors = F)
modelNames = paste0("M_",c(1:450))
secondaryLevel = modelNames
fileName = paste0("D://BellosData//GC_3//Summary//",substr(fname, 1, nchar(fname) - 4),"_Summary.txt")
for(x in 1:449){
secondaryLevel = secondaryLevel[-1]
for(y in 1:length(secondaryLevel)){
result = geneDF[modelNames[x]] + geneDF[secondaryLevel[y]]
summaryDF <- rbind(summaryDF, data.frame(Model1 = modelNames[x],
Model2 = secondaryLevel[y],
Common = sum(result == 2),
Uncommon = sum(result == 1),
Absent = sum(result == 0)))
}
}
write.table(summaryDF, fileName, sep = ",", quote = F, row.names = F)
geneDF <- NULL
summaryDF <- NULL
geneData <-NULL
}
converting to matrix
maxNum = max(summaryDF$Common)
normalizeData = summaryDF[,c(1:3)]
normalizeData[c('Common')] <- lapply(normalizeData[c('Common')], function(x) 1 - x/maxNum)
normalizeData[1:2] <- lapply(normalizeData[1:2], factor, levels=unique(unlist(normalizeData[1:2])))
distMatrixN = xtabs(Common~Model1+Model2, data=normalizeData)
distMatrixN = distMatrixN + t(distMatrixN)
Is there a way to make the process run faster? Is there a more efficient way to do the comparison?
This code should be faster. Nested loops are nightmare slow in R. Operations like rbind-ing one row at a time is also among the worst and slowest ideas in R programming.
Generate 450 rows with 20 elements of 0, 1 on each row.
M = do.call(rbind, replicate(450, sample(0:1, 20, replace = T), simplify = F))
Generate list of combination(450, 2) numbers of row pairs
L = split(v<-t(utils::combn(450, 2)), seq(nrow(v))); rm(v)
Apply whatever comparison function you want. In this case, the number of 1's at the same position for each row combinations. If you want to calculate different metrics, just write another function(x) where M[x[1],] is the first row and M[x[2],] is the second row.
O = lapply(L, function(x) sum(M[x[1],]&M[x[2],]))
Code takes ~4 seconds a fairly slow 2.6 Ghz Sandy Bridge
Get a clean data.frame with your results, three columns : row 1, row 2, metric between the two rows
data.frame(row1 = sapply(L, `[`, 1),
row2 = sapply(L, `[`, 2),
similarity_metric = do.call(rbind, O))
To be honest, I didn't thoroughly comb through your code to replicate exactly what you were doing. If this is not what you are looking for (or can't be modified to achieve what you are looking for), leave a comment.
I have a data frame
testdf <- data.frame(predicted1 = c(1, 0, 1, 3, 2, 1, 1, 0, 1, 0), predicted2 = c(1, 0, 2, 2, 2, 1, 1, 0, 0, 0), predicted3 = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), actual = c(1, 0, 2, 3, 2, 1, 1, 1, 0, 0))
I want to add another column to this data frame which tells me the total percentage accuracy when looking at all predicted values. So for example, row 1 of this would have an accuracy of 100%, because all prediction columns predicted the correct value (1).
How can this be done?
Thanks!
We can compare with the 'actual' get the rowMeans, multiply by 100 and round if needed
round(100*rowMeans(testdf[1:3] == testdf$actual), 2)
I've the following data.table
structure(list(xi = c(1, 1, 1, 2, 2, 2, 3, 3, 3), yi = c(1, 2,
3, 1, 2, 3, 1, 2, 3), flag = c(0, 0, 0, 0, 0, 0, 0, 0, 0)), .Names = c("xi",
"yi", "flag"), row.names = c(NA, -9L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x11a1a78>)
I also have a 3x3 matrix as below.
structure(c(1, 1, 0.4, 1, 0, 0, 1, 0, 0.2), .Dim = c(3L, 3L))
I want to assign a third column to the data.table flag such that if the element in the matrix represented by the xi row and yi column is less than 1, then flag = 1 else 0. I wrote a function for this,
func <- function (x, y, m) {
if (m[x, y] < 1) {
return(1)
}
else {
return(0)
}
}
However, if I try
y[,flag := func(xi,yi,m)]
my flag values are always 0. Could someone point out what I'm doing wrong here?
Thanks in advance.
You don't need a custom function...
dt[ , flag := as.integer( m[cbind(xi,yi)] < 1 ) ]
You do need to be careful to index the matrix in the correct way (using cbind(...) rather than [,] form of indexing).