Im new to R and im stuck with a problem i can't solve by myself.
A friend recommended me to use one of the apply functions, i just dont get how to use it in this case. Anyway, on to the problem! =)
Inside the inner while loop, I have an ifelse. That is the bottleneck. It takes on average 1 second to run each iteration. The slow part is marked with #slow part start/end in the code.
Given that, we will run it 2000*100 = 200000 times it will take aproximately 55.5 hours to finish each time we run this code. And the bigger problem is that this will be reused a lot. So x*55.5 hours is just not doable.
Below is a fraction of the code relevant to the question
#dt is data.table with close to 1.5million observations of 11 variables
#rand.mat is a 110*100 integer matrix
j <- 1
while(j <= 2000)
{
#other code is executed here, not relevant to the question
i <- 1
while(i <= 100)
{
#slow part start
t$column2 = ifelse(dt$datecolumn %in% c(rand.mat[,i]) & dt$column4==index[i], NA, dt$column2)
#slow part end
i <- i + 1
}
#other code is executed here, not relevant to the question
j <- j + 1
}
Please, any advice would be greatly appreciated.
EDIT - Run below code to reproduce problem
library(data.table)
dt = data.table(datecolumn=c("20121101", "20121101", "20121104", "20121104", "20121130",
"20121130", "20121101", "20121101", "20121104", "20121104", "20121130", "20121130"), column2=c("5",
"3", "4", "6", "8", "9", "2", "4", "3", "5", "6", "8"), column3=c("5",
"3", "4", "6", "8", "9", "2", "4", "3", "5", "6", "8"), column4=c
("1", "1", "1", "1", "1", "1", "2", "2", "2", "2", "2", "2"))
unq_date <- c(20121101L,
20121102L, 20121103L, 20121104L, 20121105L, 20121106L, 20121107L,
20121108L, 20121109L, 20121110L, 20121111L, 20121112L, 20121113L,
20121114L, 20121115L, 20121116L, 20121117L, 20121118L, 20121119L,
20121120L, 20121121L, 20121122L, 20121123L, 20121124L, 20121125L,
20121126L, 20121127L, 20121128L, 20121129L, 20121130L
)
index <- as.numeric(dt$column4)
numberOfRepititions <- 2
set.seed(131107)
rand.mat <- replicate(numberOfRepititions, sample(unq_date, numberOfRepititions))
i <- 1
while(i <= numberOfRepititions)
{
dt$column2 = ifelse(dt$datecolumn %in% c(rand.mat[,i]) & dt$column4==index[i], NA, dt$column2)
i <- i + 1
}
Notice that we wont be able to run the loop more than 2 times now unless dt grows in rows so that we have the initial 100 types of column4 (which is just an integer value 1-100)
Here is one proposal which is based on your small example dataset. I tried to vectorize the operations. Like in your example, numberOfRepititions represents the number of loop runs.
First, create matrices for all necessary evaluations. dt$datecolum is compared with all columns of rand.mat:
rmat <- apply(rand.mat[, seq(numberOfRepititions)], 2, "%in%", x = dt$datecolumn)
Here, dt$column4 is compared with all values of the vector index:
imat <- sapply(head(index, numberOfRepititions), "==", dt$column4)
Both matrices are combined with logical and. Afterwards, we calculate whether there is at least one TRUE:
replace_idx <- rowSums(rmat & imat) != 0
Use the created index to replace corresponding values with NA:
is.na(dt$column2) <- replace_idx
Done.
The code in one chunk:
rmat <- apply(rand.mat[, seq(numberOfRepititions)], 2, "%in%", x = dt$datecolumn)
imat <- sapply(head(index, numberOfRepititions), "==", dt$column4)
replace_idx <- rowSums(rmat & imat) != 0
is.na(dt$column2) <- replace_idx
I think you can do it in 1 line like this:
dt[which(apply(dt, 1, function(x) x[1] %in% rand.mat[,as.numeric(x[4])])),]$column3<-NA
basically the apply function works as follows by argument:
1) uses the data from "dt"
2) "1" means apply by row
3) the function passes the row as 'x', returns TRUE if your criteria are met
Related
I am trying to subset 1000 loci from a vcf or genind files however I am unable to do so. Is there a way to subset 1000 randomly selected loci from a vcf or genind file in R? Example of my code below...
Load libraries and Get Data
library(OutFLANK)
library(SNPRelate)
library(dartR)
library(vcfR)
library(poppr)
library(hierfstat)
library(reshape2)
data(vcfR_example) #get the data
vcfd = vcfR2genind(vcf) #convert vcf file to genind
vcfd <- vcfd[,c(sample(1:5083, 1000, replace=T))] #subset 1000 loci ???
pop(vcfd) <- as.factor(c("5", "5", "7", "7", "7", "7", "7", "7", "8",
"8", "8", "8", "8", "8", "8", "9", "9", "9")) #assign populations
DOES NOT WORK
basic_vcfd = basic.stats(vcfd, diploid = TRUE) #compute basic.stats
Error in rep(lab, vec) : invalid 'times' argument
Check subsetted Genind object for loci number
/// GENIND OBJECT /////////
// 18 individuals; 836 loci; 1,000 alleles; size: 480 Kb
// Basic content
#tab: 18 x 1000 matrix of allele counts
#loc.n.all: number of alleles per locus (range: 1-4)
#loc.fac: locus factor for the 1000 columns of #tab
#all.names: list of allele names for each locus
#ploidy: ploidy of each individual (range: 2-2)
#type: codom
#call: .local(x = x, i = i, j = j, drop = drop)
// Optional content
- empty -
This code does not work. You can see above that there are 836 loci however I need 1000. I need the 1000 loci to calculate the basic.stats function. Looking for a solution.
This may be what you're looking for:
locs = locNames(vcfd)[1:1000]
new_vcfd = vcfd[loc = locs]
It does indeed return a genuine object with exactly 1,000 loci.
vcfd[loc=sample(nLoc(vcfd), 1000, replace=F)]
This worked!
I am working with the R programming language.
I have the following data set (route_1):
route_1
id long lat
1 1 -74.56048 40.07051
3 3 -72.44129 41.71506
4 4 -77.53908 41.55434
2 2 -74.23018 40.12929
6 6 -78.68685 42.35981
5 5 -79.26506 43.22408
Based on this data, I want to make a directed network graph in which each row is only linked to the row that comes right after. Using the "igraph" library, I was able to do this manually:
library(igraph)
my_data <- data.frame(
"node_a" = c("1", "3", "4", "2", "6"),
"node_b" = c("3", "4", "2", "6", "5")
)
graph <- graph.data.frame(my_data, directed=TRUE)
graph <- simplify(graph)
plot(graph)
My Question: Is it possible to make this above network graph directly using the "route_1" dataset, and without manually creating a new data set that contains information on which node is connected to what node?
Thanks!
Is the dataset always going to be ordered correctly, so the plot will go from row 1->2->3 etc in a single line? If so, we can make the node info dataframe by simply subsetting the ID column. If we put the steps in a function, it becomes a simple 1-liner:
plot_nodes <- function(x) {
id = x$id
a = id[1:length(id)-1]
b = id[2:length(id)]
graph.data.frame(data.frame(a,b), directed=TRUE)
}
graph <- plot_nodes(route_1)
plot(simplify(graph))
I don't understand spatial.data at all. I have been studying but I'm missing something.
What I have: data.frame enterprises with the columns: id, parent_subsidiary, city_cod.
What I need: the mean and the max distance from the parent's city to the subsidiary cities.
Ex:
id | mean_dist | max_dist
1111 | 25km | 50km
232 | 110km | 180km
333 | 0km | 0km
What I did :
library("tidyverse")
library("sf")
# library("brazilmaps") not working anymore
library("geobr")
parent <- enterprises %>% filter(parent_subsidiary==1)
subsidiary <- enterprises %>% filter(parent_subsidiary==2)
# Cities - polygons
m_city_br <- read_municipality(code_muni="all", year=2019)
# or shp_city<- st_read("/BR_Municipios_2019.shp")
# data.frame with the column geom
map_parent <- left_join(parent, m_city_br, by=c("city_cod"="code_muni"))
map_subsidiary <- left_join(subsidiary, m_city_br, by=c("city_cod"="code_muni"))
st_distance(map_parent$geom[1],map_subsidiary$geom[2]) %>% units::set_units(km)
# it took a long time and the result is different from google.maps
# is it ok?!
# To do by ID -- I also stucked here
distance_p_s <- data.frame(id=as.numeric(),subsidiar=as.numeric(),mean_dist=as.numeric(),max_dist=as.numeric())
id_v <- as.vector(parent$id)
for (i in 1:length(id_v)){
test_p <- map_parent %>% filter(id==id_v[i])
test_s <- map_subsidiary %>% filter(id==id_v[i])
total <- 0
value <- 0
max <- 0
l <- 0
l <- nrow(test_s)
for (j in 1:l){
value <- as.numeric(round(st_distance(test_p$geom[1],test_s$geom[j]) %>% units::set_units(km),2))
total <- total + value
ifelse(value>max,max<-value,NA)
}
mean_dist <- total/l
done <- data.frame(id=id[i],subsidiary=l,mean_dist=round(mean_dist,2),max_dist=max)
distance_p_s <- rbind(distance_p_s,done)
rm(done)
}
}
Is it right?
Can I calculate the centroid of the cities and than calculate the distance?
I realize that the distance from code_muni==4111407 to code_muni==4110102, the distance is 0, but is another city (Imbituva, PR,Brasil - Ivaí, PR,Brasil). Why?
Data example: structure(list(id = c("1111", "1111", "1111", "1111", "232", "232", "232", "232", "3123", "3123", "4455", "4455", "686", "333", "333", "14112", "14112", "14112", "3633", "3633"), parent_subsidiary = c("1","2", "2", "2", "1", "2", "2", "2", "1", "2", "1", "2", "1", "2", "1", "1", "2", "2", "1", "2"), city_cod = c(4305801L,4202404L, 4314803L, 4314902L, 4318705L, 1303403L, 4304507L, 4314100L, 2408102L, 3144409L, 5208707L, 4205407L, 5210000L, 3203908L, 3518800L, 3118601L, 4217303L, 3118601L, 5003702L, 5205109L)), row.names = c(NA, 20L), class = "data.frame")
PS: this is Brazilian cities
https://github.com/ipeaGIT/geobr/tree/master/r-package
Great problem. I looked at it for a little while. Then I came back and looked some more after thinking about it. The mean was not calculated. Only the distances were determined from each parent to its subsidiaries.
The data was binded - the cities data and the data frame data. Then the new df was mutated to add the centroid data for each point on the surface.
The df was split by id and resulted in a list of 8 df's. Each df contained separate parent with related subsidiaries. (1:4, 1:3, 1:4, 1:2, .... )
A loop with a function cleaned up the 8 df's, and calculated the distance from each parent to each subsidiary.
I checked the distance of the first df in the list against values for distances from a website. The distances of df1 were nearly identical to the website.
The output is shown at [link]
I did something like that:
distance_p_s <- data.frame(id=as.character(),
qtd_subsidiary=as.numeric(),
dist_min=as.numeric(),
dist_media=as.numeric(),
dist_max=as.numeric())
id <- as.vector(mparentid$id)
for (i in 1:length(id)){
eval(parse(text=paste0("
print('Filtering id: ",id[i]," (",i," of ",length(id),")')
")))
teste_m <- mparentid %>% filter(id==id[i]) %>% st_as_sf()
teste_f <- msubsidiaryid %>% filter(id==id[i]) %>% st_as_sf()
teste_f <- st_centroid(teste_f)
teste_m <- st_centroid(teste_m)
teste_f = st_transform(teste_f, 4674)
teste_m = st_transform(teste_m, 4674)
total <- 0
value <- 0
min <- 0
max <- 0
l <- 0
l <- nrow(teste_f)
for (j in 1:l){
eval(parse(text=paste0("
print('Tratando id: ",id[i]," (",i," de ",length(id),"), subsidiary: ",j," de ",l,"')
")))
value <- as.numeric(round(st_distance(teste_m$geom[1],teste_f$geom[j]) %>% units::set_units(km),2))
total <- total + value
ifelse(value>max,max<-value,NA)
if(j==1){
min<-value
} else {
ifelse(value<min,min<-value,NA)}
}
dist_med <- total/l
done <- data.frame(id=id[i],qtd_subsidiary=l,dist_min=min,dist_media=round(dist_med,2),dist_max=max)
distance_p_s <- rbind(distance_p_s,done)
eval(parse(text=paste0("
print('Concluido id: ",id[i]," (",i," de ",length(id),"), subsidiary: ",j," de ",l,"')
")))
rm(done)
}
Probably this is not the best way, but it solved my problem for now.
I believe this is fairly simple, although I am new to using R and code. I have a dataset which has a single row for each rodent trap site. There were however, 8 occasions of trapping over 4 years. What I wish to do is to expand the trap site data and append a number 1 to 8 for each row.
Then I can then label them with the trap visit for a subsequent join with the obtained trap data.
I have managed to replicate the rows with the following code. And while the rows are expanded in the data frame to 1, 1.1...1.7,2, 2.1...2.7 etc. I cannot figure out how to convert this to a useable column based ID.
structure(list(TrapCode = c("IA1sA", "IA2sA", "IA3sA", "IA4sA",
"IA5sA"), Y = c(-12.1355987315, -12.1356879776, -12.1357664998,
-12.1358823313, -12.1359720852), X = c(-69.1335789865, -69.1335225279,
-69.1334668485, -69.1333847769, -69.1333226532)), row.names = c(NA,
5L), class = "data.frame")
gps_1 <– gps_1[rep(seq_len(nrow(gps_1)), 3), ]
gives
"IA5sA", "IA1sA", "IA2sA", "IA3sA", "IA4sA", "IA5sA", "IA1sA",
"IA2sA", "IA3sA", "IA4sA", "IA5sA"), Y = c(-12.1355987315, -12.1356879776,
-12.1357664998, -12.1358823313, -12.1359720852, -12.1355987315,
-12.1356879776, -12.1357664998, -12.1358823313, -12.1359720852,
-12.1355987315, -12.1356879776, -12.1357664998, -12.1358823313,
-12.1359720852), X = c(-69.1335789865, -69.1335225279, -69.1334668485,
-69.1333847769, -69.1333226532, -69.1335789865, -69.1335225279,
-69.1334668485, -69.1333847769, -69.1333226532, -69.1335789865,
-69.1335225279, -69.1334668485, -69.1333847769, -69.1333226532
)), row.names = c("1", "2", "3", "4", "5", "1.1", "2.1", "3.1",
"4.1", "5.1", "1.2", "2.2", "3.2", "4.2", "5.2"), class = "data.frame")
I have a column with Trap_ID currently being a unique identifier. I hope that after the replication I could append an iteration number to this to keep it as a unique ID.
For example:
Trap_ID
IA1sA.1
IA1sA.2
IA1sA.3
IA2sA.1
IA2sA.2
IA2sA.3
Simply use a cross join (i.e., join with no by columns to return a cartesian product of both sets):
mdf <- merge(data.frame(Trap_ID = 1:8), trap_side_df, by=NULL)
I want to classify my data by minimum distance between known centers.
How to implement using R?
the centers data
> centers
X
1 -0.78998176
2 2.40331380
3 0.77320007
4 -1.64054294
5 -0.05343331
6 -1.14982180
7 1.67658736
8 -0.44575567
9 0.36314671
10 1.18697840
the data wanted to be classified
> Y
[1] -0.7071068 0.7071068 -0.3011463 -0.9128686 -0.5713978 NA
the result I expected:
1. find the closest distance (minimum absolute difference value) between each
items in Y and centers.
2. Assigns sequence number of classes to each items in Y
expected result:
> Y
[1] 1 3 8 1 8 NA
Y <- c(-0.707106781186548, 0.707106781186548, -0.301146296962689,
-0.912868615826101, -0.571397763410073, NA)
centers <- structure(c(-0.789981758587318, 2.40331380121291, 0.773200070034431,
-1.64054294268215, -0.0534333085941505, -1.14982180092619, 1.67658736336158,
-0.445755672120908, 0.363146708827924, 1.18697840480949), .Dim = c(10L,
1L), .Dimnames = list(c("1", "2", "3", "4", "5", "6", "7", "8",
"9", "10"), "X"))
sapply(Y, function(y) {r=which.min(abs(y-centers)); ifelse(is.na(y), NA, r)})
Essentially, you are applying which.min to each element of Y, and determining which center has the smallest absolute distance. Ties go to the earlier element on the list. NA values need to be handled separately, which is why I have a second statement with ifelse there.
This is not clustering.
But nearest neighbor classification.
See the knn function.