Plotting data using vectors of different length in R - r

I want to plot several files in the same figure; each file has two-column data.
The problem is that each file has a different number of rows (529,567,660, etc)
For data with same number of rows I did the following:
data1 <- read.table(file="ro0.2/T0.1/sq_Ave.dat")
x1 <- data1[1]
y1 <- data1[2]
data2 <- read.table(file="ro0.4/T0.1/sq_Ave.dat")
x2 <- data2[1]
y2 <- data2[2]
max_valuex = max(x1,x2,x3,x4,x5)
max_valuey = max(y1,y2,y3,y4,y5)
matplot(x1,cbind(y1,y2,y3,y4,y5),type="l",
col=c("black","red","green","blue","orange"),
lwd = 2,xlab = expression(q*sigma), ylab="S(q)", col.lab="black",
cex.lab=1.5,font.lab=4, xaxt = "n", yaxt = "n", xlim = c(0,max_valuex),
ylim = c(0,max_valuey), xaxs = "i", yaxs = "i")
However, this does not work for files with different number of rows.
R complains with:
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 529, 567, 661
Calls: matplot -> ncol -> as.matrix -> cbind -> cbind -> data.frame
Any idea or suggestion would be greatly appreciated!
Thanks a lot in advance
S H-V

I guess you could enlarg your vectors with NAs. I believe this won't matter in next handling your data. E.g.:
a= 1:10
b=1:5
d=1:7
data.frame(a,b,d) #different length
#Error in data.frame(a, b, d) :
#arguments imply differing number of rows: 10, 5, 7
length(b) = length(d) = length(a)
data.frame(a,b,d) # no error now
a b d
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 NA 6
7 7 NA 7
8 8 NA NA
9 9 NA NA
10 10 NA NA

Even if you manage to load the y-axis data into a list object which would be the natural data type in R for storing vectors of variable length. In the next step you will get something like:
> matplot (matrix(1:100, nrow=10, ncol=10)[1], matrix(1:100, nrow=12, ncol=10))
Error in matplot(matrix(1:100, nrow = 10, ncol = 10)[, 1], matrix(1:100, :
'x' and 'y' must have same number of rows
A scatterplot like plot or matplot needs complete x,y tuples, but in your code you are only using x1 as the x-values. Your code example is not complete. You are also loading x2,x3,...
but only use them to calculate xlim. Why would you calculate xlim including the maxima of all x if you do not intend to plot them?
It is therefore not clear to me what the final plot should look like, and wether or not a scatterplot is the correct visualization of the data. You might want to give more details about what your data in fact consists of and how incomplete data should be handled.
Could it be that you want to plot several line-plots into a single figure, using add=TRUE:
matplot(data1[,1:2], xlim = c(0,max_valuex), ylim = c(0,max_valuey))
matplot(data2[,1:2], add=TRUE)

Related

How do I use a for loop to create a list of data frames, based off two existing data frames, in R?

I am VERY new to loops. Some of my loops have been successful, others... not so much.
I have some observed data (df_obs) that I'd like to test against my model predictions (df_pred).
MY CURRENT AIM: write a loop which makes a list of data frames, so that can I use this list in future loops assessing model performance. I will probably be back for help with THOSE loops...
YES: I do want a list of data frames. I'm working with 50+ species and have a bunch of tests to run on these values.
MAYBE: I think I want a for() loop, but if a different method is easier e.g. lapply(), I'm open to suggestions.
I've done my best to create a reproducible data set and code that mimics what I am working with:
#observed presence (1) and absence (0)
set.seed(733)
df_obs <- data.frame(plot = 1:10,
sp1 = sample(0:1, 10, replace = TRUE),
sp2 = sample(0:1, 10, replace = TRUE),
sp3 = sample(0:1, 10, replace = TRUE))
#predicted probability of occurrence (ranges from 0 to 1)
set.seed(733)
df_preds <- data.frame(plot = 1:10,
sp1 = runif(10, 0, 1),
sp2 = runif(10, 0, 1),
sp3 = runif(10, 0, 1))
sppcodes <- c("sp1", "sp2", "sp3")
test.eval.list <- vector("list", length = length(sppcodes))
names(test.eval.list) <- sppcodes
for(i in seq_along(sppcodes)){
sppn <- sppcodes[i]
plot = df_obs$plot
obs = df_obs[,sppn]
pred = df_preds[,sppn]
df <- data.frame(plot, obs, pred) #produces dataframe as expected
test.eval.list[sppn] <- df #problem seems to be here, it ends up assigning a vector of numbers...
}
Could someone please help me understand why I am not ending up with a list of data frames, and give a correct way of doing so?
Please note - I know there are areas which could be done in a single line of code, I prefer this way of spreading the code out to understand which parts are/are not working.
You had a small mistake in for loop. You had to use [[ instead of [ while accessing the list. You may want to read up ?Extract if you are interested in different ways of accessing elements.
for(i in seq_along(sppcodes)){
sppn <- sppcodes[i]
plot = df_obs$plot
obs = df_obs[,sppn]
pred = df_preds[,sppn]
df <- data.frame(plot, obs, pred)
test.eval.list[[sppn]] <- df
}
However, an alternative is using Map
Map(cbind.data.frame, plot = list(df_obs$plot),obs=df_obs[-1],pred = df_preds[-1])
#[[1]]
# plot obs pred
#1 1 1 0.3266487
#2 2 1 0.3745092
#3 3 0 0.8633161
#4 4 0 0.1970302
#5 5 1 0.3017755
#6 6 0 0.9154151
#7 7 0 0.6193044
#8 8 0 0.4020479
#9 9 1 0.9947362
#10 10 1 0.7975380
#...
#....

Make a data frame from a list of sequences

With a list of sequences, for example,
datList <- list(One = seq(1,5, length.out = 20),
Two = seq(1,10, length.out = 20),
Three = seq(5,50, length.out = 20))
Is it possible to make a data frame so that the sequences are converted into columns. As in,
datDF <- data.frame(One = datList[[1]], Two = datList[[2]], Three = datList[[3]] )
> head(datDF)
One Two Three
1 1.000000 1.000000 5.000000
2 1.210526 1.473684 7.368421
3 1.421053 1.947368 9.736842
4 1.631579 2.421053 12.105263
5 1.842105 2.894737 14.473684
6 2.052632 3.368421 16.842105
Within the context of my 'real' data I am working with many sequences and was hoping an apply (or similar) function could be used rather than manually creating the desired data frame.
We can use data.frame
datDFN <- data.frame(datList)
identical(datDF, datDFN)
#[1] TRUE

Average xy points with conditional distance

I have xy coordinates of points and I want to make use distance for averaging points. My data is named qq and I obtain the distance matrix using dist function
qq
X Y
2 4237.5 4411.5
3 4326.5 4444.5
4 4382.0 4418.0
5 4204.0 4487.5
6 4338.5 4515.0
mydist = as.matrix(dist(qq))
2 3 4 5 6
2 0.00000 94.92102 144.64612 83.0557 144.61414
3 94.92102 0.00000 61.50203 129.8278 71.51398
4 144.64612 61.50203 0.00000 191.0870 106.30734
5 83.05570 129.82777 191.08702 0.0000 137.28256
6 144.61414 71.51398 106.30734 137.2826 0.00000
What I want to do is to average points that are closer that a certain threshold, for this example we could use 80. The only pairwise distances that fall below that limit are 3-4 and 3-6. The question is how to go back to the original matrix and average xy coordinates to make the 3-4 pair one point and 3-6 pair another one (discarding former points 3,4 and 6)
here's the dput of my data.frame
dput(qq)
structure(list(X = c(4237.5, 4326.5, 4382, 4204, 4338.5), Y = c(4411.5,
4444.5, 4418, 4487.5, 4515)), .Names = c("X", "Y"), row.names = 2:6, class = "data.frame")
UPDATE
Using some of the provided with modifications code I get the 2 points I need to replace in the 3-4 place and 3-6 place. This means my point 3 and 4 and 6 will have to disappear from qq and this two points should be appended to it
pairs <- which(as.matrix(dist(qq)) < 80 & upper.tri(as.matrix(dist(qq))), arr.ind = T)
t(apply(pairs,1,function(i) apply(qq[i,],2,mean)))
X Y
3 4354.25 4431.25
3 4332.50 4479.75
I think this should do it for you, if I understand the problem correctly.
pairs <- which(as.matrix(y) > 140 & upper.tri(as.matrix(y)), arr.ind = T)
result <- apply(pairs,1,function(i) apply(qq[i,],2,mean))
#optionally, I think this is the form you will want it in.
result <- data.frame(t(result))
It will a matrix of a similar structure to qq containing the averages of points which are "far" away from each other determined by thresh.
UPDATE
qq <- qq[-unique(c(pairs)),]
qq <- rbind(qq,result)
Ok so I was able to merge strategies and solve the issue but not in a fancy way
# Search pairs less than threshold
pairs <- which(as.matrix(dist(qq)) < 80 & upper.tri(as.matrix(dist(qq))), arr.ind = T)
# Get the row numbers for subsetting the original matrix
indx=unique(c(pairs[,1],pairs[,2]))
# Get result dataframe
out = data.frame(rbind(qq[-indx,],t(apply(pairs,1,function(i) apply(qq[i,],2,mean)))),row.names=NULL)
dim(out)
[1] 4 2
out
X Y
1 4237.50 4411.50
2 4204.00 4487.50
3 4354.25 4431.25
4 4332.50 4479.75
The row.names get dropped because they mean nothing now that I've removed original points and added new ones. I'm still open to better ways to do it and to check everything is done correctly.
UPDATE
I made a function that could be more useful that making things step-wise and let's you play with the threshold.
distance_fix = function(dataframe,threshold){
mydist = as.matrix(dist(dataframe))
# Which pairs in the upper triangle are below threshold
pairs <- which(mydist < threshold & upper.tri(mydist), arr.ind = T)
# Get the row numbers for subsetting the original matrix
indx=unique(c(pairs))
# Get result dataframe
out = data.frame(rbind(dataframe[-indx,],t(apply(pairs,1,function(i) apply(dataframe[i,],2,mean)))),row.names=NULL)
return(out)
}

R : Select either or, but not both

I am absolutely new to coding so please forgive me if this should be very easy to solve or to find - maybe it's so simple that nobody has bothered explaining so far or I just haven't been searching with the right keywords.
I have a column in my dataset that contains the letters f, n, i in all possible combinations. Now I want to find only those rows that contain either f or n, but not both of them. So that could be f, or fi, or n, or ni.
Then I want to compare those two sets of rows to each other in a boxplot. So ideally I would have two boxes: one with all the data points belonging to group f, including fi, and one with all the data points belonging to group n, including ni.
Example of my dataset:
df <- data.frame(D = c("f", "f", "fi", "n", "ni", "ni", "fn", "fn"), y = c(1, 0.8, 1.1, 2.1, 0.9, 8.8, 1.7, 5.4))
D y
1 f 1.0
2 f 0.8
3 fi 1.1
4 n 2.1
5 ni 0.9
6 ni 8.8
7 fn 1.7
8 fn 5.4
Now what I want to get is this subset:
D y
1 f 1.0
2 f 0.8
3 fi 1.1
4 n 2.1
5 ni 0.9
6 ni 8.8
and then somehow have 1,2,3 and 4,5,6 in a group each, to plot in a boxplot.
So far I have only succeeded in getting a subset that has only entries with either f or n, but not fi, ni etc, which is not what I want, with this code:
df2<-df[df$D==c("f","n"),]
and in creating a subset that has all different groups with f and n:
df2 <- df[grepl("f", df$D) | grepl("n", bat.df$D),]
I read about the "exclusive or" operator xor but when I try to use that like this:
df2 <- bat.df[xor(match("n", df$D), match("f", df$D)),]
it just gives me a dataframe full of NAs. But even if that did work, I guess I would only be able to make a boxplot with four groups, f, n, fi and ni, where I want only two groups. So how can I get that code to work, and how do I go on from there?
I hope this is not too terrible for a first question! I am kind of bleary eyed after spending far too much time on this. Any help, about my problem, on where to look for the answer or on how to improve the question is very much appreciated!
I think your last example is pretty close. xor only works with things that return logical like TRUE and FALSE, but match actually returns the integer position. So just use grepl with xor:
xor(grepl("f", df$D), grepl("n", df$D))
Or you could get fancy:
library(functional)
Reduce(xor, lapply(c("f", "n"), grepl, df$D))
We all cut our teeth on R at some point, so I'll try to construct an example for you that fits the question. How about:
# simulate a data.frame with "all possible combinations" of singles and pairs
df <- data.frame(txt = as.character(outer(c("i", "f", "n"), c("", "i", "f", "n"), paste0)),
stringsAsFactors = FALSE)
# create an empty factor variable to contain the result
df$has_only <- factor(rep(NA, nrow(df)), levels = 1:2, labels = c("f", "n"))
# replace with codes if contains either f or n, not both(f, n)
df$has_only[which(grepl("f", df$txt) & !grepl("f.*n|n.*f", df$txt))] <- "f"
df$has_only[which(grepl("n", df$txt) & !grepl("f.*n|n.*f", df$txt))] <- "n"
df
## txt has_only
## 1 i <NA>
## 2 f f
## 3 n n
## 4 ii <NA>
## 5 fi f
## 6 ni n
## 7 if f
## 8 ff f
## 9 nf <NA>
## 10 in n
## 11 fn <NA>
## 12 nn n
plot(df$has_only)
Note that this is a bar plot, not a box plot, since a box plot would only plot the range of continuous values, and you have not specified what are the continuous values or what they would look like. But if you did have such a variable, say df$myvalue, then you could produce a box plot with:
# simulate some continuous data
set.seed(50)
df$myvalue <- runif(nrow(df))
boxplot(myvalue ~ has_only, data = df)

How to multiple across columns in dataset and save to new dataset without changing original data in R

I want to multiply all the values in columns e.g. by 5, and then save the results into a new dataset, without changing the data being read in.
Using a loop I use the following R code:
raw_data[,i]<-raw_data[,i]*5
What I want is to keep the original data as it is, raw_data, and save the multiplied data into e.g. new_data:
new_data[,i]<-raw_data[,i]*5
I get an error saying the object 'new_data' is not found.
Is there a neat way of doing this, or do you have to create the new_data object first as an empty dataset?
No need for loops here.
# a toy data frame
raw_data <- data.frame(x = 1:2, y = 3:4, z = 5:6)
# same applies if you have your data in a matrix
# raw_data <- matrix(1:6, ncol = 3)
raw_data
# x y z
# 1 1 3 5
# 2 2 4 6
new_data <- raw_data * 5
new_data
# x y z
# 1 5 15 25
# 2 10 20 30

Resources