Replicate and append to dataframe in R - r

I believe this is fairly simple, although I am new to using R and code. I have a dataset which has a single row for each rodent trap site. There were however, 8 occasions of trapping over 4 years. What I wish to do is to expand the trap site data and append a number 1 to 8 for each row.
Then I can then label them with the trap visit for a subsequent join with the obtained trap data.
I have managed to replicate the rows with the following code. And while the rows are expanded in the data frame to 1, 1.1...1.7,2, 2.1...2.7 etc. I cannot figure out how to convert this to a useable column based ID.
structure(list(TrapCode = c("IA1sA", "IA2sA", "IA3sA", "IA4sA",
"IA5sA"), Y = c(-12.1355987315, -12.1356879776, -12.1357664998,
-12.1358823313, -12.1359720852), X = c(-69.1335789865, -69.1335225279,
-69.1334668485, -69.1333847769, -69.1333226532)), row.names = c(NA,
5L), class = "data.frame")
gps_1 <– gps_1[rep(seq_len(nrow(gps_1)), 3), ]
gives
"IA5sA", "IA1sA", "IA2sA", "IA3sA", "IA4sA", "IA5sA", "IA1sA",
"IA2sA", "IA3sA", "IA4sA", "IA5sA"), Y = c(-12.1355987315, -12.1356879776,
-12.1357664998, -12.1358823313, -12.1359720852, -12.1355987315,
-12.1356879776, -12.1357664998, -12.1358823313, -12.1359720852,
-12.1355987315, -12.1356879776, -12.1357664998, -12.1358823313,
-12.1359720852), X = c(-69.1335789865, -69.1335225279, -69.1334668485,
-69.1333847769, -69.1333226532, -69.1335789865, -69.1335225279,
-69.1334668485, -69.1333847769, -69.1333226532, -69.1335789865,
-69.1335225279, -69.1334668485, -69.1333847769, -69.1333226532
)), row.names = c("1", "2", "3", "4", "5", "1.1", "2.1", "3.1",
"4.1", "5.1", "1.2", "2.2", "3.2", "4.2", "5.2"), class = "data.frame")
I have a column with Trap_ID currently being a unique identifier. I hope that after the replication I could append an iteration number to this to keep it as a unique ID.
For example:
Trap_ID
IA1sA.1
IA1sA.2
IA1sA.3
IA2sA.1
IA2sA.2
IA2sA.3

Simply use a cross join (i.e., join with no by columns to return a cartesian product of both sets):
mdf <- merge(data.frame(Trap_ID = 1:8), trap_side_df, by=NULL)

Related

Hieraching across rows for the same id

So, I have a data set with a lot of observations for X individuals and more rows per some individuals. For each row, I have assigned a classification (the variable clinical_significance) that takes three values in prioritized order: definite disease, possible, colonization. Now, I would like to have only one row for each individual and the "highest classification" across the rows, e.g. definite if present, subsidiary possible and colonization. Any good suggestions on how to overcome this?
For instance, as seen in the example, I would like all ID #23 clinical_signifiance to be 'definite disease' as this outranks 'possible'
id id_row number_of_samples species_ny clinical_significa…
18 1 2 MAC possible
18 2 2 MAC possible
20 1 2 scrofulaceum possible
20 2 2 scrofulaceum possible
23 1 2 MAC possible
23 2 2 MAC definite disease
Making a reproducible example:
df <- structure(
list(
id = c("18", "18", "20", "20", "23", "23"),
id_row = c("1","2", "1", "2", "1", "2"),
number_of_samples = c("2", "2", "2","2", "2", "2"),
species_ny = c("MAC", "MAC", "scrofulaceum", "scrofulaceum", "MAC", "MAC"),
clinical_significance = c("possible", "possible", "possible", "possible", "possible", "definite disease")
),
row.names = c(NA, -6L), class = c("data.frame")
)
The idea is to turn clinical significance into a factor, which is stored as an integer instead of character (i.e. 1 = definite, 2 = possible, 3 = colonization). Then, for each ID, take the row with lowest number.
df_prio <- df |>
mutate(
fct_clin_sig = factor(
clinical_significance,
levels = c("definite disease", "possible", "colonization")
)
) |>
group_by(id) |>
slice_min(fct_clin_sig)
I fixed it using
df <- df %>%
group_by(id) %>%
mutate(clinical_significance_new = ifelse(any(clinical_significance == "definite disease"), "definite disease", as.character(clinical_significance)))

Subsetting 1000 random loci from vcf/genind file in R

I am trying to subset 1000 loci from a vcf or genind files however I am unable to do so. Is there a way to subset 1000 randomly selected loci from a vcf or genind file in R? Example of my code below...
Load libraries and Get Data
library(OutFLANK)
library(SNPRelate)
library(dartR)
library(vcfR)
library(poppr)
library(hierfstat)
library(reshape2)
data(vcfR_example) #get the data
vcfd = vcfR2genind(vcf) #convert vcf file to genind
vcfd <- vcfd[,c(sample(1:5083, 1000, replace=T))] #subset 1000 loci ???
pop(vcfd) <- as.factor(c("5", "5", "7", "7", "7", "7", "7", "7", "8",
"8", "8", "8", "8", "8", "8", "9", "9", "9")) #assign populations
DOES NOT WORK
basic_vcfd = basic.stats(vcfd, diploid = TRUE) #compute basic.stats
Error in rep(lab, vec) : invalid 'times' argument
Check subsetted Genind object for loci number
/// GENIND OBJECT /////////
// 18 individuals; 836 loci; 1,000 alleles; size: 480 Kb
// Basic content
#tab: 18 x 1000 matrix of allele counts
#loc.n.all: number of alleles per locus (range: 1-4)
#loc.fac: locus factor for the 1000 columns of #tab
#all.names: list of allele names for each locus
#ploidy: ploidy of each individual (range: 2-2)
#type: codom
#call: .local(x = x, i = i, j = j, drop = drop)
// Optional content
- empty -
This code does not work. You can see above that there are 836 loci however I need 1000. I need the 1000 loci to calculate the basic.stats function. Looking for a solution.
This may be what you're looking for:
locs = locNames(vcfd)[1:1000]
new_vcfd = vcfd[loc = locs]
It does indeed return a genuine object with exactly 1,000 loci.
vcfd[loc=sample(nLoc(vcfd), 1000, replace=F)]
This worked!

Generate dummy according to quanteda dictionary lookup

I'm using quanteda to create dictionaries and look up for terms.
Here is a reproducible example of my data:
dput(tweets[1:4, ])
structure(list(tweet_id = c("174457180812_10156824364270813",
"174457180812_10156824136360813", "174457180812_10156823535820813",
"174457180812_10156823868565813"), tweet_message = c("Climate change is a big issue",
"We should care about the environment", "Let's rethink environmental policies",
"#Davos WEF"
), date = c("2019-03-25T23:03:56+0000", "2019-03-25T21:10:36+0000",
"2019-03-25T21:00:03+0000", "2019-03-25T20:00:03+0000"), group = c("1",
"2", "3", "4")), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
Here is how I use my dictionary following a suggestion I got from this forum:
climate_corpus <- corpus(tweets, text_field = "tweet_message")
climatechange_dict <-
dictionary(list(climate = c("environment*", "climate change")))
groupeddfm <- tokens(climate_corpus) %>%
tokens_lookup(dictionary = climatechange_dict) %>%
dfm(groups = "group")
convert(groupeddfm, to = "data.frame")
What I need to do is to create a dummy in my original dataset "tweets" equal to 1 when tokens_lookup identifies a word included in my dictionary in one specific observation (tweet). Using my reproducible example, I would like to generate a dummy equal to 1 for the first three observations (they include dictionary words), and equal to 0 for the fourth one (no dictionary words).
I would really appreciate your help on this.
Many thanks!
library("quanteda")
## Package version: 2.0.1
tweets <- structure(
list(tweet_id = c(
"174457180812_10156824364270813",
"174457180812_10156824136360813", "174457180812_10156823535820813",
"174457180812_10156823868565813"
), tweet_message = c(
"Climate change is a big issue",
"We should care about the environment", "Let's rethink environmental policies",
"#Davos WEF"
), date = c(
"2019-03-25T23:03:56+0000", "2019-03-25T21:10:36+0000",
"2019-03-25T21:00:03+0000", "2019-03-25T20:00:03+0000"
), group = c(
"1",
"2", "3", "4"
)),
row.names = c(NA, -4L), class = c(
"tbl_df",
"tbl", "data.frame"
)
)
climate_corpus <- corpus(tweets, text_field = "tweet_message")
climatechange_dict <-
dictionary(list(climate = c("environment*", "climate change")))
groupeddfm <- tokens(climate_corpus) %>%
tokens_lookup(dictionary = climatechange_dict) %>%
dfm(groups = "group")
tweets$mentions_climate <- as.logical(groupeddfm[, "climate"])
tweets
## # A tibble: 4 x 5
## tweet_id tweet_message date group mentions_climate
## <chr> <chr> <chr> <chr> <lgl>
## 1 174457180812_1015… Climate change is a b… 2019-03-25T2… 1 TRUE
## 2 174457180812_1015… We should care about … 2019-03-25T2… 2 TRUE
## 3 174457180812_1015… Let's rethink environ… 2019-03-25T2… 3 TRUE
## 4 174457180812_1015… #Davos WEF 2019-03-25T2… 4 FALSE

Conditionally remove items from a list in R

I know there are many similar questions about removing items from a list, but I've been unable to solve my problem in particular - and I appreciate the help.
Simply put, I'd like to remove any entry (row) that has a value that is greater than -74.
list(structure(c(40.7571907043457, 40.7601699829102, 40.761848449707,
40.7660789489746, -73.9972381591797, -74.0038146972656, -74.0072479248047,
-74.0172576904297), .Dim = c(4L, 2L), .Dimnames = list(c("1",
"2", "3", "4"), c("lat", "lon"))), structure(c(40.7582893371582,
40.760498046875, 40.7620582580566, 40.7662887573242, -73.9975280761719,
-74.0031967163086, -74.0070190429688, -74.0170593261719), .Dim = c(4L,
2L), .Dimnames = list(c("1", "2", "3", "4"), c("lat", "lon"))))
Thanks so much.
If you only need to look at lon column with the negative values then simply,
lapply(your_list, function(i)i[i[,2] <= -74,])
In case you want to check both columns,
lapply(your_list, function(i)i[rowSums(i<=-74) > 0, , drop = FALSE])
Both give the same result,
[[1]]
lat lon
2 40.76017 -74.00381
3 40.76185 -74.00725
4 40.76608 -74.01726
[[2]]
lat lon
2 40.76050 -74.00320
3 40.76206 -74.00702
4 40.76629 -74.01706

R - Extremely slow code

Im new to R and im stuck with a problem i can't solve by myself.
A friend recommended me to use one of the apply functions, i just dont get how to use it in this case. Anyway, on to the problem! =)
Inside the inner while loop, I have an ifelse. That is the bottleneck. It takes on average 1 second to run each iteration. The slow part is marked with #slow part start/end in the code.
Given that, we will run it 2000*100 = 200000 times it will take aproximately 55.5 hours to finish each time we run this code. And the bigger problem is that this will be reused a lot. So x*55.5 hours is just not doable.
Below is a fraction of the code relevant to the question
#dt is data.table with close to 1.5million observations of 11 variables
#rand.mat is a 110*100 integer matrix
j <- 1
while(j <= 2000)
{
#other code is executed here, not relevant to the question
i <- 1
while(i <= 100)
{
#slow part start
t$column2 = ifelse(dt$datecolumn %in% c(rand.mat[,i]) & dt$column4==index[i], NA, dt$column2)
#slow part end
i <- i + 1
}
#other code is executed here, not relevant to the question
j <- j + 1
}
Please, any advice would be greatly appreciated.
EDIT - Run below code to reproduce problem
library(data.table)
dt = data.table(datecolumn=c("20121101", "20121101", "20121104", "20121104", "20121130",
"20121130", "20121101", "20121101", "20121104", "20121104", "20121130", "20121130"), column2=c("5",
"3", "4", "6", "8", "9", "2", "4", "3", "5", "6", "8"), column3=c("5",
"3", "4", "6", "8", "9", "2", "4", "3", "5", "6", "8"), column4=c
("1", "1", "1", "1", "1", "1", "2", "2", "2", "2", "2", "2"))
unq_date <- c(20121101L,
20121102L, 20121103L, 20121104L, 20121105L, 20121106L, 20121107L,
20121108L, 20121109L, 20121110L, 20121111L, 20121112L, 20121113L,
20121114L, 20121115L, 20121116L, 20121117L, 20121118L, 20121119L,
20121120L, 20121121L, 20121122L, 20121123L, 20121124L, 20121125L,
20121126L, 20121127L, 20121128L, 20121129L, 20121130L
)
index <- as.numeric(dt$column4)
numberOfRepititions <- 2
set.seed(131107)
rand.mat <- replicate(numberOfRepititions, sample(unq_date, numberOfRepititions))
i <- 1
while(i <= numberOfRepititions)
{
dt$column2 = ifelse(dt$datecolumn %in% c(rand.mat[,i]) & dt$column4==index[i], NA, dt$column2)
i <- i + 1
}
Notice that we wont be able to run the loop more than 2 times now unless dt grows in rows so that we have the initial 100 types of column4 (which is just an integer value 1-100)
Here is one proposal which is based on your small example dataset. I tried to vectorize the operations. Like in your example, numberOfRepititions represents the number of loop runs.
First, create matrices for all necessary evaluations. dt$datecolum is compared with all columns of rand.mat:
rmat <- apply(rand.mat[, seq(numberOfRepititions)], 2, "%in%", x = dt$datecolumn)
Here, dt$column4 is compared with all values of the vector index:
imat <- sapply(head(index, numberOfRepititions), "==", dt$column4)
Both matrices are combined with logical and. Afterwards, we calculate whether there is at least one TRUE:
replace_idx <- rowSums(rmat & imat) != 0
Use the created index to replace corresponding values with NA:
is.na(dt$column2) <- replace_idx
Done.
The code in one chunk:
rmat <- apply(rand.mat[, seq(numberOfRepititions)], 2, "%in%", x = dt$datecolumn)
imat <- sapply(head(index, numberOfRepititions), "==", dt$column4)
replace_idx <- rowSums(rmat & imat) != 0
is.na(dt$column2) <- replace_idx
I think you can do it in 1 line like this:
dt[which(apply(dt, 1, function(x) x[1] %in% rand.mat[,as.numeric(x[4])])),]$column3<-NA
basically the apply function works as follows by argument:
1) uses the data from "dt"
2) "1" means apply by row
3) the function passes the row as 'x', returns TRUE if your criteria are met

Resources