I have a temperature profiler (tp) data for date, depth and temperature. The depth for each date is not exactly the same so I need to unify it to the same depth and set the temperature for that depth by linear approximation. I was able to do this with a loop using ‘approx’ function (see first part of the enclosed code). But I know that I should do it better without a loop (considering I will have about 600,000 rows). I tried to do it with ‘by’ function but was not successful transforming the result (list) into a data frame or matrix (see second part of the code).
Keep in mind that length of the rounded depth is not always the same as in the example.
Rounded depth is in Depth2 column, interpulated temperature is put in Temp2
What is the ‘right’ way to solve this?
# create df manually
tp <- data.frame(Date=double(31), Depth=double(31), Temperature=double(31))
tp$Date[1:11] <- '2009-12-17' ; tp$Date[12:22] <- '2009-12-18'; tp$Date[23:31] <- '2009-12-19'
tp$Depth <- c(24.92,25.50,25.88,26.33,26.92,27.41,27.93,28.37,28.82,29.38,29.92,25.07,25.56,26.06,26.54,27.04,27.53,28.03,28.52,29.02,29.50,30.01,25.05,25.55,26.04,26.53,27.02,27.52,28.01,28.53,29.01)
tp$Temperature <- c(19.08,19.06,19.06,18.87,18.67,17.27,16.53,16.43,16.30,16.26,16.22,17.62,17.43,17.11,16.72,16.38,16.28,16.20,16.15,16.13,16.11,16.08,17.54,17.43,17.32,17.14,16.89,16.53,16.28,16.20,16.13)
# create rounded depth column
tp$Depth2 <- round(tp$Depth)
# loop on date to calculate linear approximation for rounded depth
dtgrp <- tp[!duplicated(tp[,1]),1]
for (i in dtgrp) {
x1 <- tp[tp$Date == i, "Depth"]
y1 <- tp[tp$Date == i, "Temperature"]
x2 <- tp[tp$Date == i, "Depth2"]
tpa <- approx(x=x1,y=y1,xout=x2, rule=2)
tp[tp$Date == i, "Temp2"] <- tpa$y
}
# reduce result to rounded depth
tp1 <- tp[!duplicated(tp[,-c(2:3)]),-c(2:3)]
# not part of the question, but the end need is for a matrix, so this complete it:
library(reshape2)
tpbydt <- acast(tp1, Date~Depth2, value.var="Temp2")
# second part: I tried to use the by function (instead of loop) but got lost when tring to convert it to data frame or matrix
rdpth <- function(x1,y1,x2) {
tpa <- approx(x=x1,y=y1,xout=x2, rule=2)
return(tpa)
}
tp2 <- by(tp, tp$Date,function(tp) rdpth(tp$Depth,tp$Temperature,tp$Depth2), simplify = TRUE)
Very close with by call but remember it returns a list of objects. Therefore, consider building a list of data frames to be row binded at very end:
df_list <- by(tp, tp$Date, function(sub) {
tpa <- approx(x=sub$Depth, y=sub$Temperature, xout=sub$Depth2, rule=2)
df <- unique(data.frame(Date = sub$Date,
Depth2 = sub$Depth2,
Temp2 = tpa$y,
stringsAsFactors = FALSE))
return(df)
})
tp2 <- do.call(rbind, unname(df_list))
tp2
# Date Depth2 Temp2
# 1 2009-12-17 25 19.07724
# 2 2009-12-17 26 19.00933
# 5 2009-12-17 27 18.44143
# 7 2009-12-17 28 16.51409
# 9 2009-12-17 29 16.28714
# 11 2009-12-17 30 16.22000
# 12 2009-12-18 25 17.62000
# 21 2009-12-18 26 17.14840
# 4 2009-12-18 27 16.40720
# 6 2009-12-18 28 16.20480
# 8 2009-12-18 29 16.13080
# 10 2009-12-18 30 16.08059
# 13 2009-12-19 25 17.54000
# 22 2009-12-19 26 17.32898
# 41 2009-12-19 27 16.90020
# 61 2009-12-19 28 16.28510
# 81 2009-12-19 29 16.13146
And if you reset row.names, this is exactly identical to your tp1 output:
identical(data.frame(tp1, row.names = NULL),
data.frame(tp2, row.names = NULL))
# [1] TRUE
Related
I have a data.table with approximately 400 columns and 800,000 rows. The columns represent samples and the rows represent CpG sites. Example data here:
require(data.table)
samples <- replicate(200,replicate(1000,runif(1)))
cpgs <- paste0('cpg',1:1000)
n <- c('cpg',paste0('sample',1:200))
data <- data.table(cbind(cpgs,samples))
colnames(data) <- n
I want to run a wilcox.test() on randomly selected columns of this data 1000 times. I've currently implemented this the following way, but it's very slow on large numbers of permutations.
cases <- paste0('sample',1:10)
controls <- paste0('sample',30:40)
data[,wilcox_p:=wilcox.test( as.numeric(.SD[,mget(cases)]), as.numeric(.SD[,mget(controls)]) )$p.value,by=cpg]
Is there a more efficient way to do this? My complete use case, where getCpGSites() is the function described above, is here:
iterations_vec <- 1:1000
labels <- paste0('sample',1:200)
permutations <- foreach(i = iterations_vec, .combine='rbind', .multicombine = TRUE ) %dopar% {
case_labels <- sample(labels,num_cases,replace=FALSE)
control_labels <- labels[!labels %in% case_labels]
signature_cpgs <- getCpGSites(case_labels,control_labels)
num_signature_cpgs <- length(signature_cpgs)
out <- data.table('gene' = gene,
'iteration' = i,
'num_signature_cpgs' = num_signature_cpgs)
return(out)
}
Here's one approach, based on the tidyverse. First, convert all your character data tonumeric, rtaher than delegating to your function.
library(tidyverse)
numericData <- data %>% mutate(across(where(is.character), as.numeric))
Now write a function to perform a Wilcoxon test on a randomly selected pair of columns
randomWilcox <- function(d) {
cols <- sample(2:ncol(d), size=2, replace=FALSE)
d1 <- d %>% select(cpg, all_of(cols))
tibble(
col1=cols[1],
col2=cols[2],
p.value=wilcox.test(d1 %>% pull(2), d1 %>% pull(3))$p.value
)
}
Now use lapply to run the function 1000 times, with a very crude measure of speed:
startTime <- Sys.time()
lapply(1:1000, function(x) numericData %>% randomWilcox) %>% bind_rows()
endTime <- Sys.time()
# A tibble: 1,000 × 3
col1 col2 p.value
<int> <int> <dbl>
1 15 172 0.124
2 26 58 0.202
3 200 60 0.840
4 124 94 0.344
5 180 200 0.723
6 122 155 0.987
7 122 174 0.173
8 83 146 0.921
9 135 95 0.0605
10 168 174 0.0206
# … with 990 more rows
Each row of the output tibble contains the indices of the columns selected, and the p-value obtained from corresponding wilcox.test.
The time taken is about 13 seconds on my machine. Is that quick enough?
endTime - startTime
Time difference of 13.1156 secs
Edit
Removing the intermediate data frame reduces the time taken to just over none seconds:
randomWilcox <- function(d) {
cols <- sample(2:ncol(d), size=2, replace=FALSE)
tibble(
col1=cols[1],
col2=cols[2],
p.value=wilcox.test(d %>% pull(cols[1]), d %>% pull(cols[2]))$p.value
)
}
We have a set of 50 csv files from participants, currently being read into a list as
file_paths <- fs::dir_ls("data")
file_paths
file_contents <- list ()
for (i in seq_along (file_paths)) {
file_contents[[i]] <- read_csv(
file = file_paths[[i]]
)
}
dt <- set_names(file_contents, file_paths)
My data looks like this:
level time X Y Type
1 1 355. -10.6 22.36 P
1 1 371. -33 24.85 O
1 2 389. -10.58 17.23 P
1 2 402. -16.7 30.46 O
1 3 419. -29.41 17.32 P
1 4 429. -10.28 26.36 O
2 5 438. -26.86 32.98 P
2 6 451. -21 17.06 O
2 7 463. -21 32.98 P
2 8 474. -19.9 17.06 O
We have 70 sets of coordinates per csv.
Time does not matter for this, but I would like to split up by the level column at some stage.
For every 'P' I want to compare it to 'O' and get the distance between coordinates.The first P will always match with the first O and so on.
For now, I have them split into two different lists, though this may be the complete wrong way to do it! I'm having trouble figuring out how to take all of these csv files and get the distances for all of them, the list seems to cause issues with most functions (like dist)
Here is how I've pulled the right information so far
for (i in seq_along (dt)) {
pLoc[[i]] <- dplyr::filter(dt[[i]], grepl("P", type))
oLoc[[i]] <- dplyr::filter(dt[[i]], grepl("o", type))
pX[[i]] <- pLoc[[i]] %>% pull(as.numeric(headX))
pY[[i]] <- pLoc[[i]] %>% pull(as.numeric(headY))
pCoordinates[[i]] <- cbind(pX[[i]], pY[[i]])
}
[EDITED] Following comments, here is how you can do it with the raster library:
library(raster)
library(dplyr)
df = data.frame(
x = c(10, 20 ,15,9),
y = c(45,34,54,24),
type = c("P","O","P","O")
)
df = cbind(df[df$type=="P",] %>%
dplyr::select(-type) %>%
dplyr::rename(xP = x,
yP = y),
df[df$type=="O",] %>%
dplyr::select(-type) %>%
dplyr::rename(xO = x,
yO = y))
The following could probably be achieved more efficiently with some form of the apply() function:
v = c()
for(i in 1:nrow(df)){
dist = raster::pointDistance(lonlat = F,
p1 = c(df$xP[i],df$yP[i]),
p2 = c(df$xO[i],df$yO[i]))
v = c(v,dist)
}
df$dist = v
print(df)
xP yP xO yO dist
1 10 45 20 34 14.86607
3 15 54 9 24 30.59412
In "Zero frequent items" when using the eclat to mine frequent itemsets, the OP is interested in the groupings/clusterings based on how frequent they are ordered together. This grouping can be inspected by the arules::inspect function.
library(arules)
dataset <- read.transactions("8GbjnHK2.txt", sep = ";", rm.duplicates = TRUE)
f <- eclat(dataset,
parameter = list(
supp = 0.001,
maxlen = 17,
tidLists = TRUE))
inspect(head(sort(f, by = "support"), 10))
The data set can be downloaded from https://pastebin.com/8GbjnHK2.
However, the output cannot be easily saved to another object as a data frame.
out <- inspect(f)
So how do we capture the output of inspect(f) for use as data frame?
We can use the methods labels to extract the associations/groupings and quality to extract the quality measures (support and count). We can then use cbind to store these into a data frame.
out <- cbind(labels = labels(f), quality(f))
head(out)
# labels support count
# 1 {3031093,3059242} 0.001010 16
# 2 {3031096,3059242} 0.001073 17
# 3 {3060614,3060615} 0.001010 16
# 4 {3022540,3072091} 0.001010 16
# 5 {3061698,3061700} 0.001073 17
# 6 {3031087,3059242} 0.002778 44
Coercing the itemsets to a data.frame also creates the required output.
> head(as(f, "data.frame"))
items support count
1 {3031093,3059242} 0.001010101 16
2 {3031096,3059242} 0.001073232 17
3 {3060614,3060615} 0.001010101 16
4 {3022540,3072091} 0.001010101 16
5 {3061698,3061700} 0.001073232 17
6 {3031087,3059242} 0.002777778 44
For tl;dr I have a simple question at the bottom:
I'm trying to turn XML files into use-able tables in R.
<toes copyright='(C)version='1.1'> <generated date='2017-01-21
07:45:04'timestamp='1485006304'/>
<description> Active TOE vehicle levels and adjustments for the current
campaign up to the RDP cycle in progress. c0 = the cycle 0 capacity, adj
= comma-separated list of cycle:capacity adjustments, cur = current
capacity </description>
<defaults><def att='adj' value=''/></defaults>
<r toe="deairfor" veh="22" c0="30" cur="30"/>
<r toe="deairfor" veh="23" c0="40" cur="20" adj="1:35,2:20"/>
<r toe="deairfor" veh="26" c0="2" cur="2" adj="2:10,3:30"/>
</toes>
My intended format is this:
"TOE" "Veh" "c0" "cur" "adj1" "adj2" "adj3"
"deairfor" 22 30 30 NA NA NA
"deairfor" 23 40 20 35 20 NA
"deairfor" 26 2 2 NA 10 30
I have zero experience with importing XML files but I think this file is not formatted properly as I haven't encountered any XML example with data inside the tags like in < r toe "...data..."/>. I have been able to extract the data with the following:
library(XML)
source <- "http://wiretap.wwiionline.com/xml/toes.sheet.xml"
xmlfile <- xmlTreeParse(source, useInternalNodes = TRUE)
nodes <- getNodeSet(xmlfile, "/toes//r")
Df1 <- NULL
for(i in 1:length(nodes)) {
Df1 <- t(xmlToList(nodes[[i]]))
Df2 <- smartbind(Df2,Df1[1,])
}
I was only able to extract 1 row at a time, so I use the later code to bind these together. I needed the df1/2 otherwise it would error out on i = 1. Probably much easier in a different way but I couldn't get it working.
This leaves me with a dataframe Df2, with all the variables as "factor" (why?)
"TOE" "Veh" "c0" "cur" "adj"
deairfor 22 30 30 NA
deairfor 23 40 20 35 1:35,2:20
deairfor 26 2 2 2 2:10,3:30
So the difficulty lies now in this "adj" column. I can separate it with the following:
Df2 <- separate(data = Df2, col = adj, into = c("adj1", adj2","adj3"), sep = "\\,")
Df2 <- separate(data = Df2, col = adj1, into = c("adj1","adj1value"), sep = "\\:")
Df2 <- separate(data = Df2, col = adj2, into = c("adj2","adj2value"), sep = "\\:")
Df2 <- separate(data = Df2, col = adj3, into = c("adj3","adj3value"), sep = "\\:")
But the cells are not in the right columns. Df2 now is as below:
"TOE" "Veh" "c0" "cur" "adj1" "adj1value" "adj2" "adj2value" "adj3" "adj3value"
deairfor 22 30 30 NA NA NA NA NA NA
deairfor 23 40 20 1 35 2 20 NA NA
deairfor 26 2 2 2 10 3 30 NA NA
While this last row needs to be: (once the adj1values are in the proper columns we can also drop adj1/adj2/adj3)
deairfor 26 2 2 NA NA 2 10 3 30
I've tried numerous ways to move these cells to the right, but constantly get errors, e.g.: (the adj* columns are characters after separating hence the "1")
Df2$adj3[Df2$adj1 == "1"] <- Df2$adj2
Df2$adj3value[Df2$adj1 == "1"] <- Df2$adj2value
"NAs are not allowed in subscripted assignments"
So question: How do I move these values to the proper column?
"TOE" "Veh" "c0" "cur" "adj"
deairfor 26 2 2 2:10,3:30
Should become
"TOE" "Veh" "c0" "cur" "adj1" "adj2" "adj3"
deairfor 26 2 2 NA 10 30
Bonus question: I get the feeling I need to use many lines because the XML import at the beginning isn't quite optimal, anyway to do that better given the goal that I have?
I would write a function that can add NAs to the prefixed adj string and then use tidyr's separate
add_NAs <- function(x, n=3){
y <- strsplit(x, ",")
sapply( y, function(z){
n <- match( 1:n, substr(z,1,1))
paste(substring(z, 3)[n], collapse=",")
})
}
add_NAs( c(NA, "1:35,2:20", "2:10,3:30", "1:20,3:5") )
[1] "NA,NA,NA" "35,20,NA" "NA,10,30" "20,NA,5"
You can also use xmlAttrsToDataFrame to parse the attributes.
x <- XML:::xmlAttrsToDataFrame(doc["//r[#toe]"], stringsAsFactors=FALSE)
x$adj <- add_NAs(x$adj)
separate(x, adj, c("adj1", "adj2", "adj3"), sep="," , convert=TRUE)
toe veh c0 cur adj1 adj2 adj3
1 deairfor 22 30 30 NA NA NA
2 deairfor 23 40 20 35 20 NA
3 deairfor 26 2 2 NA 10 30
Thanks Chris for your help, really answered all my questions!
Final code shown below for anyone who is interested.
I only had to insert a line that downloaded the xml file first otherwise it wouldn't pick it up. Topic I used for this: (https://stackoverflow.com/questions/24139221/reading-and-understanding-xml-in-r)
Additionally for this table I wanted the level to 'continue' after the adjustments, which is what I did with the 5 similar rows at the end. So if c0 =10, adj1=20 and adj2=NA then adj2/Tier2=20.
library(XML)
library(tidyr)
add_NAs <- function(x, n=5){
y <- strsplit(x, ",")
sapply( y, function(z){
n <- match(1:n, substr(z,1,1))
paste(substring(z, 3)[n], collapse=",")
})
}
fileURL <- "http://wiretap.wwiionline.com/xml/toes.sheet.xml"
download.file(fileURL, destfile=tf <- tempfile(fileext=".xml"))
doc <- xmlParse(tf)
Test <- XML:::xmlAttrsToDataFrame(doc["//r[#toe]"], stringsAsFactors=FALSE)
Test$adj <- add_NAs(Test$adj)
Test <- separate(data = Test, col = adj, into = c("Tier1","Tier2","Tier3","Tier4","Tier5"), sep = "\\,")
Test$Tier1 <- ifelse(Test$Tier1=="NA",Test$c0,Test$Tier1)
Test$Tier2 <- ifelse(Test$Tier2=="NA",Test$Tier1,Test$Tier2)
Test$Tier3 <- ifelse(Test$Tier3=="NA",Test$Tier2,Test$Tier3)
Test$Tier4 <- ifelse(Test$Tier4=="NA",Test$Tier3,Test$Tier4)
Test$Tier5 <- ifelse(Test$Tier5=="NA",Test$Tier4,Test$Tier5)
I am attempting to repeatedly add a "fixed number" to a numeric vector depending on a specified bin size. However, the "fixed number" is dependent on the data range.
For instance ; i have a data range 10 to 1010, and I wish to separate the data into 100 bins. Therefore ideally the data would look like this
Since 1010 - 10 = 1000
And 1000 / 100(The number of bin specified) = 10
Therefore the ideal data would look like this
bin1 - 10 (initial data)
bin2 - 20 (initial data + 10)
bin3 - 30 (initial data + 20)
bin4 - 40 (initial data + 30)
bin100 - 1010 (initial data + 1000)
Now the real data is slightly more complex, there is not just one data range but multiple data range, hopefully the example below would clarify
# Some fixed values
start <- c(10, 5000, 4857694)
end <- c(1010, 6500, 4897909)
Ideally I wish to get something like
10 20
20 30
30 40
.. ..
5000 5015
5015 5030
5030 5045
.. ..
4857694 4858096 # Note theoretically it would have decimal places,
#but i do not want any decimal place
4858096 4858498
.. ..
So far I was thinking along this kind of function, but it seems inefficient because ;
1) I have to retype the function 100 times (because my number of bin is 100)
2) I can't find a way to repeat the function along my values - In other words my function can only deal with the data 10-1010 and not the next one 5000-6500
# The range of the variable
width <- end - start
# The bin size (Number of required bin)
bin_size <- 100
bin_count <- width/bin_size
# Create a function
f1 <- function(x,y){
c(x[1],
x[1] + y[1],
x[1] + y[1]*2,
x[1] + y[1]*3)
}
f1(x= start,y=bin_count)
f1
[1] 10 20 30 40
Perhaps any hint or ideas would be greatly appreciated. Thanks in advance!
Aafter a few hours trying, managed to answer my own question, so I thought to share it. I used the package "binr" and the function in the package called "bins" to get the required bin. Please find below my attempt to answer my question, its slightly different than the intended output but for my purpose it still is okay
library(binr)
# Some fixed values
start <- c(10, 5000, 4857694)
end <- c(1010, 6500, 4897909)
tmp_list_start <- list() # Create an empty list
# This just extract the output from "bins" function into a list
for (i in seq_along(start)){
tmp <- bins(start[i]:end[i],target.bins = 100,max.breaks = 100)
# Now i need to convert one of the output from bins into numeric value
s <- gsub(",.*", "", names(tmp$binct))
s <- gsub("\\[","",s)
tmp_list_start[[i]] <- as.numeric(s)
}
# Repeating the same thing with slight modification to get the end value of the bin
tmp_list_end <- list()
for (i in seq_along(end)){
tmp <- bins(start[i]:end[i],target.bins = 100,max.breaks = 100)
e <- gsub(".*,", "", names(tmp$binct))
e <- gsub("]","",e)
tmp_list_end[[i]] <- as.numeric(e)
}
v1 <- unlist(tmp_list_start)
v2 <- unlist(tmp_list_end)
df <- data.frame(start=v1, end=v2)
head(df)
start end
1 10 20
2 21 30
3 31 40
4 41 50
5 51 60
6 61 70
Pardon my crappy code, Please share if there is a better way of doing this. Would be nice if someone could comment on how to wrap this into a function..
Here's a way that may help with base R:
bin_it <- function(START, END, BINS) {
range <- END-START
jump <- range/BINS
v1 <- c(START, seq(START+jump+1, END, jump))
v2 <- seq(START+jump-1, END, jump)+1
data.frame(v1, v2)
}
It uses the function seq to create the vectors of numbers leading to the ending number. It may not work for every case, but for the ranges you gave it should give the desired output.
bin_it(10, 1010)
v1 v2
1 10 20
2 21 30
3 31 40
4 41 50
5 51 60
bin_it(5000, 6500)
v1 v2
1 5000 5015
2 5016 5030
3 5031 5045
4 5046 5060
5 5061 5075
bin_it(4857694, 4897909)
v1 v2
1 4857694 4858096
2 4858097 4858498
3 4858499 4858900
4 4858901 4859303
5 4859304 4859705
6 4859706 4860107