R algorithm for deleting rows with artefacts - r

I'm analyzing DEM data of rivers with R and need assistance with the data processing. The DEM data include many artifacts, where the river longitudinal profile goes slightly uphill, which is in fact nonsense. So I would like to have an algorithm to delete all rows from the data set where the Z value (elevation) is higher than the predecessor. To explain it better, just look at the following data rows:
*data.frame*
ID Z
1 105.2
2 105.4
3 105.3
4 105.1
5 105.1
6 105.2
7 104.9
I would like to delete rows 2, 3 and 6 from the list. I wrote the following code but it doesn't work:
i <- *data.frame*[1,2]
for (n in *data.frame*[,2]) {if(n-i>0) *data.frame*[i,2]=0 else i <- n}
I would be very appreciated if anybody can help.

Apparently, you want to recursively remove values until there are no increasing values. It would be easiest to simply use a while loop:
DF <- read.table(text = "ID Z
1 105.2
2 105.4
3 105.3
4 105.1
5 105.1
6 105.2
7 104.9", header = TRUE)
while(any(diff(DF$Z) > 0)) DF <- DF[c(TRUE, diff(DF$Z) <= 0),]
# ID Z
#1 1 105.2
#4 4 105.1
#5 5 105.1
#7 7 104.9
Test yourself, if this is sufficiently efficient.
I would also like to comment on your whole idea of data cleaning here. I find it very dubious. How do you now that there is no error in the values that don't increase? You might remove perfectly valid values because there is an error in a (strongly) decreasing value.

Related

Search for value within a range of values in two separate vectors

This is my first time posting to Stack Exchange, my apologies as I'm certain I will make a few mistakes. I am trying to assess false detections in a dataset.
I have one data frame with "true" detections
truth=
ID Start Stop SNR
1 213466 213468 10.08
2 32238 32240 10.28
3 218934 218936 12.02
4 222774 222776 11.4
5 68137 68139 10.99
And another data frame with a list of times, that represent possible 'real' detections
possible=
ID Times
1 32239.76
2 32241.14
3 68138.72
4 111233.93
5 128395.28
6 146180.31
7 188433.35
8 198714.7
I am trying to see if the values in my 'possible' data frame lies between the start and stop values. If so I'd like to create a third column in possible called "between" and a column in the "truth" data frame called "match. For every value from possible that falls between I'd like a 1, otherwise a 0. For all of the rows in "truth" that find a match I'd like a 1, otherwise a 0.
Neither ID, not SNR are important. I'm not looking to match on ID. Instead I wand to run through the data frame entirely. Output should look something like:
ID Times Between
1 32239.76 0
2 32241.14 1
3 68138.72 0
4 111233.93 0
5 128395.28 0
6 146180.31 1
7 188433.35 0
8 198714.7 0
Alternatively, knowing if any of my 'possible' time values fall within 2 seconds of start or end times would also do the trick (also with 1/0 outputs)
(Thanks for the feedback on the original post)
Thanks in advance for your patience with me as I navigate this system.
I think this can be conceptulised as a rolling join in data.table. Take this simplified example:
truth
# id start stop
#1: 1 1 5
#2: 2 7 10
#3: 3 12 15
#4: 4 17 20
#5: 5 22 26
possible
# id times
#1: 1 3
#2: 2 11
#3: 3 13
#4: 4 28
setDT(truth)
setDT(possible)
melt(truth, measure.vars=c("start","stop"), value.name="times")[
possible, on="times", roll=TRUE
][, .(id=i.id, truthid=id, times, status=factor(variable, labels=c("in","out")))]
# id truthid times status
#1: 1 1 3 in
#2: 2 2 11 out
#3: 3 3 13 in
#4: 4 5 28 out
The source datasets were:
truth <- read.table(text="id start stop
1 1 5
2 7 10
3 12 15
4 17 20
5 22 26", header=TRUE)
possible <- read.table(text="id times
1 3
2 11
3 13
4 28", header=TRUE)
I'll post a solution that I'm pretty sure works like you want it to in order to get you started. Maybe someone else can post a more efficient answer.
Anyway, first I needed to generate some example data - next time please provide this from your own data set in your post using the function dput(head(truth, n = 25)) and dput(head(possible, n = 25)). I used:
#generate random test data
set.seed(7)
truth <- data.frame(c(1:100),
c(sample(5:20, size = 100, replace = T)),
c(sample(21:50, size = 100, replace = T)))
possible <- data.frame(c(sample(1:15, size = 15, replace = F)))
colnames(possible) <- "Times"
After getting sample data to work with; the following solution provides what I believe you are asking for. This should scale directly to your own dataset as it seems to be laid out. Respond below if the comments are unclear.
#need the %between% operator
library(data.table)
#initialize vectors - 0 or false by default
truth.match <- c(rep(0, times = nrow(truth)))
possible.between <- c(rep(0, times = nrow(possible)))
#iterate through 'possible' dataframe
for (i in 1:nrow(possible)){
#get boolean vector to show if any of the 'truth' rows are a 'match'
match.vec <- apply(truth[, 2:3],
MARGIN = 1,
FUN = function(x) {possible$Times[i] %between% x})
#if any are true then update the match and between vectors
if(any(match.vec)){
truth.match[match.vec] <- 1
possible.between[i] <- 1
}
}
#i think this should be called anyMatch for clarity
truth$anyMatch <- truth.match
#similarly; betweenAny
possible$betweenAny <- possible.between

Is there a way to multiply two columns in R like this, ex. 17.5 x 4 would give me a list of 4 rows 17.5,17.5,17.5,17.5?

I need to multiply two columns so that the result, columnC, is a list of columnA with columnB entries (sorry if that is confusing I dont know how else to say it). So columnA (17.5) * columnB (4) gives columnC (17.5, 17.5, 17.5, 17.5).
Is this possible? I need to make a histogram in R but the data is entered in the A B format (i.e. there were 4 ind at 17.5, 2 ind at 16.8, 5 ind at 15.9, etc) but I cannot get the plotting to work this way so I thought if I changed it to just a list of values it would work. It is a very large data set and doing this manually is prohibitive. Is there a better way to do this? New to R so any help is greatly appreciated.
As far as I can tell the following code will do what you want it to:
dat <- data.frame( x = c(17.5,16.8,15.9),y=c(4,2,5))
newDat <- data.frame( x = rep(dat$x,dat$y), y = rep(1,sum(dat$y) ) )
if(!require("ggplot2")){ #INCLUDE PACKAGE ggplot2 AND INSTALL IT IF IT'S NOT ALREADY INSTALLED
install.packages("ggplot2",repos="http://ftp.heanet.ie/mirrors/cran.r-project.org/",dependencies = TRUE)
library("ggplot2")
}
ggplot(newDat, aes(x=x, y=y, fill=factor(x))) + geom_bar(stat="identity")
Depending on the size of your data this might not make sense and you might want to do something other than appending a column of 1 to your dataframe, but for this toy example it functions fine. You should get something like the following:
Suppose your data frame (called df) is as follows:
A B
1 17.5 5
2 16.8 8
One way to expand (i.e. replicate) is
df <- df[rep(rownames(df), df$B),]
# A B
#1 17.5 5
#1.1 17.5 5
#1.2 17.5 5
#1.3 17.5 5
#1.4 17.5 5
#2 16.8 8
#2.1 16.8 8
#2.2 16.8 8
#2.3 16.8 8
#2.4 16.8 8
#2.5 16.8 8
#2.6 16.8 8
#2.7 16.8 8
If you want to 'tidy' your rownames you can just do,
rownames(df) <- NULL

Replace values in one data frame from values in another data frame

I need to change individual identifiers that are currently alphabetical to numerical. I have created a data frame where each alphabetical identifier is associated with a number
individuals num.individuals (g4)
1 ZYO 64
2 KAO 24
3 MKU 32
4 SAG 42
What I need to replace ZYO with the number 64 in my main data frame (g3) and like wise for all the other codes.
My main data frame (g3) looks like this
SAG YOG GOG BES ATR ALI COC CEL DUN EVA END GAR HAR HUX ISH INO JUL
1 2
2 2 EVA
3 SAG 2 EVA
4 2
5 SAG 2
6 2
Now on a small scale I can write a code to change it like I did with ATR
g3$ATR <- as.character(g3$ATR)
g3[g3$target == "ATR" | g3$ATR == "ATR","ATR"] <- 2
But this is time consuming and increased chance of human error.
I know there are ways to do this on a broad scale with NAs
I think maybe we could do a for loop for this, but I am not good enough to write one myself.
I have also been trying to use this function which I feel like may work but I am not sure how to logically build this argument, it was posted on the questions board here
Fast replacing values in dataframe in R
df <- as.data.frame(lapply(df, function(x){replace(x, x <0,0)})
I have tried to work my data into this by
df <- as.data.frame(lapply(g4, function(g3){replace(x, x <0,0)})
Here is one approach using the data.table package:
First, create a reproducible example similar to your data:
require(data.table)
ref <- data.table(individuals=1:4,num.individuals=c("ZYO","KAO","MKU","SAG"),g4=c(64,24,32,42))
g3 <- data.table(SAG=c("","SAG","","SAG"),KAO=c("KAO","KAO","",""))
Here is the ref table:
individuals num.individuals g4
1: 1 ZYO 64
2: 2 KAO 24
3: 3 MKU 32
4: 4 SAG 42
And here is your g3 table:
SAG KAO
1: KAO
2: SAG KAO
3:
4: SAG
And now we do our find and replacing:
g3[ , lapply(.SD,function(x) ref$g4[chmatch(x,ref$num.individuals)])]
And the final result:
SAG KAO
1: NA 24
2: 42 24
3: NA NA
4: 42 NA
And if you need more speed, the fastmatch package might help with their fmatch function:
require(fastmatch)
g3[ , lapply(.SD,function(x) ref$g4[fmatch(x,ref$num.individuals)])]
SAG KAO
1: NA 24
2: 42 24
3: NA NA
4: 42 NA

Populate a column with forecasts of panel data using data.table in R

I have a panel data with "entity" and "year". I have a column "x" with values that i consider like time series. I want to create a new column "xp" where for each "entity" I give, for each "year", the value obtained from the forecast of the previous 5 years. If there are less than 5 previous values available, xp=NA.
For the sake of generality, the forecast is the output of a function built in R from a couple of predefinite functions found in some packages like "forecast". If it is easier with a specific function, let's use forecast(auto.arima(x.L5:x.L1),h=1).
For now, I use data.table in R because it is so much faster for all the other manipulations I make on my dataset.
However, what I want to do is not data.table 101 and I struggle with it.
I would so much appreciate a bit of your time to help me on that.
Thanks.
Here is an extract of what i would like to do:
entity year x xp
1 1980 21 NA
1 1981 23 NA
1 1982 32 NA
1 1983 36 NA
1 1984 38 NA
1 1985 45 42.3 =f((21,23,32,36,38))
1 1986 50 48.6 =f((23,32,36,38,45))
2 1991 2 NA
2 1992 4 NA
2 1993 6 NA
2 1994 8 NA
2 1995 10 NA
2 1996 12 12.4 =f((2,4,6,8,10))
2 1997 14 13.9 =f((4,6,8,10,12))
...
As suggested by Eddi, I found a way using rollapply:
DT <- data.table(mydata)
DT <- DT[order(entity,year)]
DT[,xp:=rollapply(.SD$x,5,timeseries,align="right",fill=NA,by="entity"]
with:
timeseries <- function(x){
fit <- auto.arima(x)
value <- as.data.frame(forecast(fit,h=1))[1,1]
return(value)
}
For a sample of mydata, it works perfectly. However, when I use the whole dataset (150k lines), after some computing time, i have the following error message:
Error in seq.default(start.at,NROW(data),by = by) : wrong sign in 'by' argument
Where does it come from?
Can it come from the "5" parameter in rollapply and from some specifities of certain entities in the dataset (not enough data...)?
Thanks again for your time and help.

Altering a large distance matrix to be just three columns

I have a large data frame/.csv that is a matrix with 42 columns and 110,357,407. It was derived from the x and y coordinates for two datasets of points, one with 41 and another with 110,357,407 and the values of the rows represent the distances between these two sets of points (the distance of each point on list 1 to every single point on list 2). The first column is a list of points (from 1 to 110,357,407). An excerpt from the matrix is below.
V1 V2 V3 V4 V5 V6 V7
1 38517.05 38717.8 38840.16 38961.37 39281.06 88551.03 88422.62
2 38514.05 38714.79 38837.15 38958.34 39278 88545.48 88417.09
3 38511.05 38711.79 38834.14 38955.3 39274.94 88539.92 88411.56
4 38508.05 38708.78 38831.13 38952.27 39271.88 88534.37 88406.03
5 38505.06 38705.78 38828.12 38949.24 39268.83 88528.82 88400.5
6 38502.07 38702.78 38825.12 38946.21 39265.78 88523.27 88394.97
7 38499.08 38699.78 38822.12 38943.18 39262.73 88517.72 88389.44
8 38496.09 38696.79 38819.12 38940.15 39259.68 88512.17 88383.91
9 38493.1 38693.8 38816.12 38937.13 39256.63 88506.62 88378.38
10 38490.12 38690.8 38813.12 38934.11 39253.58 88501.07 88372.85
11 38487.14 38687.81 38810.13 38931.09 39250.54 88495.52 88367.33
12 38484.16 38684.83 38807.14 38928.07 39247.5 88489.98 88361.8
13 38481.18 38681.84 38804.15 38925.06 39244.46 88484.43 88356.28
14 38478.21 38678.86 38801.16 38922.04 39241.43 88478.88 88350.75
15 38475.23 38675.88 38798.17 38919.03 39238.39 88473.34 88345.23
16 38472.26 38672.9 38795.19 38916.03 39235.36 88467.8 88339.71
My issue is that I would like to change this matrix into just 3 columns, the first column would be similar to the first column of the matrix with the 110,357,407 rows, the second would be the 41 data points (each matched up with a distance each of the first points to all of the others) and the third would be the distance between those points. So it would look something like this
Back Pres Dist
1 1 3486
2 1 3456
3 1 3483
4 1 3456
5 1 3429
6 1 3438
7 1 3422
8 1 3427
9 1 3428
(After the distances between the back and all of the first value of pres are complete, pres will change to 2 and will eventually work its way up to 41)
I realize that this will output a hugely ridiculous number of rows, but this is the format that I need to run some processes that are outside of R.
I tried using this code
cols.Output <- data.frame(col = rep(colnames(output3), each = nrow(output3)),
row = rep(rownames(output3), ncol(output3)),
value = as.vector(output3))
But there won’t be the same number of rows for each column, so I received an error (and I don’t think it would have really worked with my pres column needs). I tried experimenting with some of the rbind.fill and cbind.fill functions (the one in plyr and ones that others have come up with in the forum). I also looked into some of the melting and reshaping but I was very confused about the functions and couldn’t figure out how to implement them appropriately (or if they even are appropriate for what I need). I would really appreciate any help on this as I’ve been struggling with it for a long time.
Edit: Just to be a little more clear about what I need. Take these two smaller data sets
back <- 1 dataset with 5 sets of x, y points
pres <- 1 dataset with 3 sets of x, y points
Calculating distances between these two data frames generates the initial matrix:
Back 1 2 3
1 3427 3444 3451
2 3432 3486 3476
3 3486 3479 3486
4 3449 3438 3484
5 3483 3486 3486
And my desired output would look like this:
Back Pres Dist
1 1 3427
2 1 3432
3 1 3486
4 1 3449
5 1 3483
1 2 3444
2 2 3486
3 2 3479
4 2 3438
5 2 3486
1 3 3451
2 3 3476
3 3 3486
4 3 3484
5 3 3486
Yes, it looks this is the kind of problem generally solved with some combination of melt and cast in the reshape2 package. That said, with 100+ million rows, I'm not sure that that's the most efficient way to go in this case.
You could do it all manually as follows. I'll assume your data frame is called df, and the distances are in columns 2 to 42. See if this works.
d <- unlist(df[-1]) # put all the distances into a vector
newdf <- cbind(expand.grid(back=seq_len(nrow(df)), pres=seq_len(ncol(df) - 1)), d)
This will probably die unless you have tons of memory. The same holds for any simple solution though, since you have > 4.2 billion elements in the vector of distances. You can work on subsets of the full dataset at a time to get around this problem.
Here's how to use melt on a small example:
require(reshape2)
a <- matrix(rnorm(9), nrow = 3)
a[, 1] <- 1:3 ## Pretending these are one set of points
rownames(a) <- a[, 1] ## We'll put them as rownames instead of a column
melt(a[, -1]) ## And omit that column when melting
If you have memory issues, you could write a for loop and do it in pieces, writing each to a file when they're completed.

Resources