replacing randomly values in an existing matrix in R - r

I have an existing matrix and I want to replace some of the existing values by NA's in a random uniform way.
I tried to use the following, but it only replaced 392 values with NA, not 452 as I expected. What am I doing wrong?
N <- 452
ind1 <- (runif(N,2,length(macro_complet$Sod)))
macro_complet$Sod[ind1] <- NA
summary(macro_complet$Sod)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.3222 0.9138 1.0790 1.1360 1.3010 2.8610 392.0000
My data looks like this
> str(macro_complet)
'data.frame': 1504 obs. of 26 variables:
$ Sod : num 8.6 13.1 12 13.8 12.9 10 7 14.8 11.3 4.9 ...
$ Azo : num 2 1.7 2.2 1.9 1.89 1.61 1.72 2.1 1.63 2 ...
$ Cal : num 26 28.1 24 28.5 24.5 24 17.4 26.6 24.8 10.5 ...
$ Bic : num 72 82 81 84 77 68 66 81 70 37.8 ...
$ DBO : num 3 2.2 3 2.7 3.3 3 3.2 2.9 2.8 2 ...
$ AzoK : num 0.7 0.7 0.9 0.8 0.7 0.7 0.7 0.9 0.7 0.7 ...
$ Orho : num 0.3 0.2 0.31 0.19 0.19 0.2 0.16 0.24 0.2 0.01 ...
$ Ammo : num 0.12 0.16 0.15 0.13 0.19 0.22 0.19 0.16 0.17 0.08 ...
$ Carb : num 0.3 0.3 2 0.3 0.3 0.3 0.3 0.3 0.3 0.5 ...
$ Ox : num 10.2 9.7 9.8 9.6 9.7 9.1 9.1 8.1 9.7 10.6 ...
$ Mag : num 5.5 6.5 6.3 7 6.4 5.1 6 6.7 5.7 2 ...
$ Nit : num 4.2 4.7 5.7 4.6 4.2 3.5 4.9 4.5 4.2 2.8 ...
$ Matsu : num 17 9 24 15 17 19 20 19 13 3.9 ...
$ Tp : num 10.5 9.7 11.9 12 12.9 11.2 12.8 13.7 11.5 10.6 ...
$ Co : num 3 3.45 3.3 3.54 2.7 2.7 3.3 3.49 2.8 1.8 ...
$ Ch : num 17 24 22 28 25 19 13 28 23 6.4 ...
$ Cu : num 25 15 20 20 15 20 15 15 20 15 ...
$ Po : num 3.5 3.8 4 3.6 3.8 3.7 3 4.2 3.7 0.4 ...
$ Ph : num 0.2 0.17 0.2 0.14 0.18 0.2 0.17 0.17 0.17 0.01 ...
$ Cnd : int 226 275 285 295 272 225 267 283 251 61 ...
$ Txs : num 93 88 89 86 87 88 84 80 91 94 ...
$ Niti : num 0.06 0.09 0.07 0.06 0.08 0.07 0.08 0.11 0.1 0.01 ...
$ Dt : num 9 9.7 9 10.2 8 8 7 9.4 8.5 3 ...
$ H : num 7.6 7.7 7.6 7.7 7.55 7.4 7.3 7.5 7.5 7.6 ...
$ Dco : int 17 12 15 13 15 20 16 14 12 7 ...
$ Sf : num 22 20.5 18 22.2 22.1 21 11.6 21.7 21.9 6.8 ...
I also tried to do this for only a single variable, but got the same result.
I converted my data frame into a matrix using
as.matrix(n1)
then I replaced some values for only one variable
N <- 300
ind <- (runif(N,1,length(n1$Sodium)))
n1$Sodium[ind] <- NA
However, using summary() I observed that only 262 values were replaced instead of 300 as expected. What am I doing wrong?
summary(n1$Sodium)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.3222 0.8976 1.0790 1.1320 1.3010 2.8610 262.0000

Try this. This will sample your matrix uniformly without replacement (so the same value is not chosen and replaced twice). If you want some other distribution, you can modify the weights using the prob argument (see ?sample)
vec <- matrix(1:25, nrow = 5)
vec[sample(1:length(vec), 4, replace = FALSE)] <- NA
vec
[,1] [,2] [,3] [,4] [,5]
[1,] NA 6 NA 16 NA
[2,] NA 7 12 17 22
[3,] 3 8 13 18 23
[4,] 4 9 14 19 24
[5,] 5 10 15 20 25

you must apply runif in the right spot, which is the index to vec. (The way you have it now, you are asking R to draw random numbers from a uniform distribution between NA and NA, which of course does not make sense and so it gives you back NaNs)
Try instead:
N <- 5 # the number of random values to replace
inds <- round ( runif(N, 1, length(vec)) ) # draw random values from [1, length(vec)]
vec[inds] <- NA # use the random values as indicies to vec, for which to replace
Note that it is not necessary to use round(.) since [[ will accept numerics, but they will all be rounded down by default, which is just slightly less than a uniform dist.

We could use
vec[sample(seq_along(vec), 4, replace = FALSE)] <- NA

Related

How to use aritmatic using tapply() in R

I'm calling height, diameter and age from a csv file. I'm trying to calculate the volume of the tree using pi x h x r^2. In order to calculate the radius, I'm taking dbh and dividing it by 2. Then I get this error.
Error in dbh/2 : non-numeric argument to binary operator
setwd("/Users/user/Desktop/")
treeg <- read.csv("treeg.csv",row.names=1)
head(treeg)
heights <- tapply(treeg$height.ft,treeg$forest, identity)
ages <- tapply(treeg$age,treeg$forest, identity)
dbh <- tapply(treeg$dbh.in,treeg$forest, identity)
radius <- dbh / 2
In the vector dbh it is storing the diameter from he csv file in terms of forest which is the ID.
How can I divide dbh by 2, while still retaining format of each value being stored by its receptive ID (which is he forest ---> treeg$forest) and treeg is the dataframe that call the csv file.
> head(treeg)
tree.ID forest habitat dbh.in height.ft age
1 1 4 5 14.6 71.4 55
2 1 4 5 12.4 61.4 45
3 1 4 5 8.8 40.1 35
4 1 4 5 7.0 28.6 25
5 1 4 5 4.0 19.6 15
6 2 4 5 20.0 103.4 107
str(dbh)
List of 9
$ 1: num [1:36] 19.9 18.6 16.2 14.2 12.3 9.4 6.8 4.9 2.6 22 ...
$ 2: num [1:60] 16.5 15.5 14.5 13.7 12.7 11.4 9.5 8 5.9 4.1 ...
$ 3: num [1:50] 18.4 17.2 15.6 13.7 11.6 8.5 5.3 2.8 13.3 10.6 ...
$ 4: num [1:81] 14.6 12.4 8.8 7 4 20 18.8 17 15.9 14 ...
$ 5: num [1:153] 28 27.2 26.1 25 23.7 21.3 19 16.7 12.2 9.8 ...
$ 6: num [1:22] 21.3 20.2 19.1 18 16.9 15.6 14.8 13.3 11.3 9.2 ...
$ 7: num [1:63] 13.9 12.4 10.6 8.1 5.8 3.4 27 25.6 23 20.2 ...
$ 8: num [1:27] 20.8 17.7 15.6 13.2 10.5 7.5 4.8 2.9 12.9 11.3 ...
$ 9: num [1:50] 23.6 20.5 16.9 14.1 11.1 8 5.1 2.9 24.1 20.9 ...
- attr(*, "dim")= int 9
- attr(*, "dimnames")=List of 1
..$ : chr [1:9] "1" "2" "3" "4" ...
Are you just trying to create a radius column that is dbh.in divided by two?
treeg <- read.table(textConnection("tree.ID forest habitat dbh.in height.ft age
1 1 4 5 14.6 71.4 55
2 1 4 5 12.4 61.4 45
3 1 4 5 8.8 40.1 35
4 1 4 5 7.0 28.6 25
5 1 4 5 4.0 19.6 15
6 2 4 5 20.0 103.4 107"), header=TRUE)
treeg$radius <- treeg$dbh.in / 2
Or do you need that dbh list for something...
dbh <- tapply(treeg$dbh.in,treeg$forest, identity)
> dbh
$`4`
[1] 14.6 12.4 8.8 7.0 4.0 20.0
lapply(dbh, function(x)x/2)
List of 1
$ 4: num [1:6] 7.3 6.2 4.4 3.5 2 10

How to match across 2 data frames IDs and run operations in R loop?

I have 2 data frames, the sampling ("samp") and the coordinates ("coor").
The "samp" data frame:
Plot X Y H L
1 6.4 0.6 3.654 0.023
1 19.1 9.3 4.998 0.023
1 2.4 4.2 5.568 0.024
1 16.1 16.7 5.32 0.074
1 10.8 15.8 6.58 0.026
1 1 16 4.968 0.023
1 9.4 12.4 6.804 0.078
2 3.6 0.4 4.3 0.038
3 12.2 19.9 7.29 0.028
3 2 18.2 7.752 0.028
3 6.5 19.9 7.2 0.028
3 3.7 13.8 5.88 0.042
3 4.9 10.3 9.234 0.061
3 3.7 13.8 5.88 0.042
3 4.9 10.3 9.234 0.061
4 16.3 2.4 5.18 0.02
4 15.7 9.8 10.92 0.096
4 6 12.6 6.96 0.16
5 19.4 16.4 8.2 0.092
10 4.8 5.16 7.38 1.08
11 14.7 16.2 16.44 0.89
11 19 19 10.2 0.047
12 10.8 2.7 19.227 1.2
14 0.6 6.4 12.792 0.108
14 4.6 1.9 12.3 0.122
15 12.2 18 9.6 0.034
16 13 18.3 4.55 0.021
The "coor" data frame:
Plot X Y
1 356154.007 501363.546
2 356154.797 501345.977
3 356174.697 501336.114
4 356226.469 501336.816
5 356255.24 501352.714
10 356529.313 501292.4
11 356334.895 501320.725
12 356593.271 501255.297
14 356350.029 501314.385
15 356358.81 501285.955
16 356637.29 501227.297
17 356652.157 501263.238
18 356691.68 501262.403
19 356755.386 501242.501
20 356813.735 501210.59
22 356980.118 501178.974
23 357044.996 501168.859
24 357133.365 501158.418
25 357146.781 501158.866
26 357172.485 501161.646
I wish to run "for loop" function to register the "samp" data frame with the GPS coordinates from the "coor" data frame -- e.g. the "new_x" variable is the sum output of "X" from the "samp" and the "coor" , under the same "Plot" IDs.
This is what i tried but not working.
for (i in 1:nrow(samp)){
if (samp$Plot[i]==coor$Plot[i]){
(samp$new_x[i]<-(coor$X[i] + samp$X[i]))
} else (samp$new_x[i]<-samp$X[i])
}
The final output i wish to have is with a proper coordinate variable ("new_x") created onto the "samp" data frame. It should looks like this:
Plot X Y H L new_x
1 6.4 0.6 3.654 0.023 356160.407
1 19.1 9.3 4.998 0.023 356173.107
1 2.4 4.2 5.568 0.024 356156.407
1 16.1 16.7 5.32 0.074 356170.107
1 10.8 15.8 6.58 0.026 356164.807
1 1 16 4.968 0.023 356155.007
1 9.4 12.4 6.804 0.078 356163.407
2 3.6 0.4 4.3 0.038 356158.397
3 12.2 19.9 7.29 0.028 356186.897
3 2 18.2 7.752 0.028 356176.697
3 6.5 19.9 7.2 0.028 356181.197
3 3.7 13.8 5.88 0.042 356178.397
3 4.9 10.3 9.234 0.061 356179.597
3 3.7 13.8 5.88 0.042 356178.397
3 4.9 10.3 9.234 0.061 356179.597
4 16.3 2.4 5.18 0.02 356242.769
4 15.7 9.8 10.92 0.096 356242.169
4 6 12.6 6.96 0.16 356232.469
5 19.4 16.4 8.2 0.092 356274.64
10 4.8 5.16 7.38 1.08 356534.113
11 14.7 16.2 16.44 0.89 356349.595
11 19 19 10.2 0.047 356353.895
Any suggestion will be appreciated. Thanks.
You could merge the two datasets and create a new column by summing the X.x and X.y variables.
res <- transform(merge(samp, coor, by='Plot'), new_x=X.x+X.y)[,-c(6:7)]
colnames(res) <- colnames(out) #`out` is the expected result showed
all.equal(res[1:22,], out, check.attributes=FALSE)
#[1] TRUE

how to calculate widths for read.fwf if different column has different widths in r

Trying to load file in to r (skipping first 4 lines)
http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for
This is fixed width file and I don't know how to calculate width from file.
Can someone please tell how to load a fixed width file into R?
Create a ruler on your console:
cat(">",paste0(rep(c(1:9,"+"),6),collapse=""))
Paste in the first line, then count:
> cat(">",paste0(rep(c(1:9,"+"),6),collapse=""))
> 123456789+123456789+123456789+123456789+123456789+123456789+
> 03JAN1990 23.4-0.4 25.1-0.3 26.6 0.0 28.6 0.3
Error: unexpected symbol in "03JAN1990"
If you look at the file you see that the only places where there are missing whitespaces are the columns with minus-signs. So another way would be to replace all the instances of "-" with " -" , i.e. to create whitespace where it is needed and then read with read.table:
dat <- read.table(text= gsub("\\-", " -",
readLines(url("http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for"))),
skip=4)
> str(dat)
'data.frame': 1284 obs. of 9 variables:
$ V1: Factor w/ 1284 levels "01APR1992","01APR1998",..: 98 394 689 984 1266 265 560 855 1150 279 ...
$ V2: num 23.4 23.4 24.2 24.4 25.1 25.8 25.9 26.1 26.1 26.7 ...
$ V3: num -0.4 -0.8 -0.3 -0.5 -0.2 0.2 -0.1 -0.1 -0.2 0.3 ...
$ V4: num 25.1 25.2 25.3 25.5 25.8 26.1 26.4 26.7 26.7 26.7 ...
$ V5: num -0.3 -0.3 -0.3 -0.4 -0.2 -0.1 0 0.2 -0.1 -0.2 ...
$ V6: num 26.6 26.6 26.5 26.5 26.7 26.8 26.9 27.1 27.2 27.3 ...
$ V7: num 0 0.1 -0.1 -0.1 0.1 0.1 0.2 0.3 0.3 0.2 ...
$ V8: num 28.6 28.6 28.6 28.4 28.4 28.4 28.5 28.9 29 28.9 ...
$ V9: num 0.3 0.3 0.3 0.2 0.2 0.3 0.4 0.8 0.8 0.7 ...
You could even skip only the first three lines and get headers:
> dat <- read.table(text= gsub("\\-", " -", readLines(url("http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for"))),
header=TRUE, skip=3)
> str(dat)
'data.frame': 1284 obs. of 9 variables:
$ Week : Factor w/ 1284 levels "01APR1992","01APR1998",..: 98 394 689 984 1266 265 560 855 1150 279 ...
$ SST : num 23.4 23.4 24.2 24.4 25.1 25.8 25.9 26.1 26.1 26.7 ...
$ SSTA : num -0.4 -0.8 -0.3 -0.5 -0.2 0.2 -0.1 -0.1 -0.2 0.3 ...
$ SST.1 : num 25.1 25.2 25.3 25.5 25.8 26.1 26.4 26.7 26.7 26.7 ...
$ SSTA.1: num -0.3 -0.3 -0.3 -0.4 -0.2 -0.1 0 0.2 -0.1 -0.2 ...
$ SST.2 : num 26.6 26.6 26.5 26.5 26.7 26.8 26.9 27.1 27.2 27.3 ...
$ SSTA.2: num 0 0.1 -0.1 -0.1 0.1 0.1 0.2 0.3 0.3 0.2 ...
$ SST.3 : num 28.6 28.6 28.6 28.4 28.4 28.4 28.5 28.9 29 28.9 ...
$ SSTA.3: num 0.3 0.3 0.3 0.2 0.2 0.3 0.4 0.8 0.8 0.7 ...
I'm an entire newbie in R, so don't be too harsh.
I got stuck doing this quiz too and searched for everything I could. Still, I couldn't find a function that would fully programmatically calculate this argument (for example, how would I know in the above comment that there are minuses that should deal with them?). So I wrote myself a simple function doing that. I considered that every new column in the file starting with a symbol, and if the number of symbols in some header is less than the width of the respective column, then empty spaces at the end of the header were added.
I do not deny that it works, perhaps awkwardly, but for my task, it helped. Anyway, you are welcome to take a look at my "widths.R" and use it, correct it and so on if you feel so.
// example url: https://d396qusza40orc.cloudfront.net/getdata%2Fwksst8110.for
or (the same) http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for //
myurl <- "url"
l <- readLines(myurl)
head(l) ## looking for headers line number
myh <- NUMBER ## WRITE your headers line NUMBER (in my ex. myh <- 4)
widths.fwf <- function(url = myurl, h = myh) ## h: headers line number
{
x <- readLines(url, n = h)
y <- strsplit(x[[h]], "") ## headers line, splitted into characters
v <- as.vector(y[[1]]) ## vector of headers line characters
b <- ifelse(v[[1]] == " ", 0,1) ##binary var: empty (0) and filled (1) places in headers line
p <- numeric() ## vector to find the places of every header start
for (i in 2:length(b)) if (b[i] == 0 & b[i+1] == 1) p[i] <- i else p[i] <- 0
pp <- which(p !=0) ## only places of every header start
ppp <- numeric() ## to be vector of "widths"
ppp[1] <- pp[1]
for(i in 2:length(pp)) ppp[i] <- pp[i] - pp[i-1]
ppp[length(pp)+1] <- length(p) - pp[length(pp)]
return(ppp)}
library(foreign)
myppp <- widths.fwf()
t <- read.fwf(myurl, widths = myppp, skip = myh) ## our table ".for"
head(t)
You can use dyplr::read_fwf
Fix the widths based on the field of vector you want to parse
nao <- read_fwf("https://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for",
fwf_widths(c(15, 4, 9, 4, 9, 4, 9, 4,4),
col_names = c("week",
"Nino1+2_sst",
"Nino1+2_ssta",
"Nino3_sst",
"Nino3_ssta",
"Nino34_sst",
"Nino34_ssta",
"Nino4_sst",
"Nino4_ssta")),
skip =4)

Setting the number of digits to a given value

Is there a way that one can set the number of digits of a full data frame to 2? Case in point, how would you set the number of digits to 2 for the following data using R?
Distance Age Height Coning
1 21.4 18 3.3 Yes
2 13.9 17 3.4 Yes
3 23.9 16 2.9 Yes
4 8.7 18 3.6 No
5 241.8 6 0.7 No
6 44.5 17 1.3 Yes
7 30.0 15 2.5 Yes
8 32.3 16 1.8 Yes
9 31.4 17 5.0 No
10 32.8 13 1.6 No
11 53.3 12 2.0 No
12 54.3 6 0.9 No
13 96.3 11 2.6 No
14 133.6 4 0.6 No
15 32.1 15 2.3 No
16 57.9 12 2.4 Yes
17 30.8 17 1.8 No
18 59.9 7 0.8 No
19 42.7 15 2.0 Yes
20 20.6 18 1.7 Yes
21 62.0 8 1.3 No
22 53.1 7 1.6 No
23 28.9 16 2.2 Yes
24 177.4 5 1.1 No
25 24.8 14 1.5 Yes
26 75.3 14 2.3 Yes
27 51.6 7 1.4 No
28 36.1 9 1.1 No
29 116.1 6 1.1 No
30 28.1 16 2.5 Yes
31 8.7 19 2.2 Yes
32 105.1 6 0.8 No
33 46.0 15 3.0 Yes
34 102.6 7 1.2 No
35 15.8 15 2.2 No
36 60.0 7 1.3 No
37 96.4 13 2.6 No
38 24.2 14 1.7 No
39 14.5 15 2.4 No
40 36.6 14 1.5 No
41 65.7 5 0.6 No
42 116.3 7 1.6 No
43 113.6 8 1.0 No
44 16.7 15 4.3 Yes
45 66.0 7 1.0 No
46 60.7 7 1.0 No
47 90.6 7 0.7 No
48 91.3 7 1.3 No
49 14.4 18 3.1 Yes
50 72.8 14 3.0 Yes
It sounds like you're just looking for format, which has a data.frame method.
A small example:
mydf <- mydf2 <- data.frame(
Distance = c(21.4, 13.9, 23.9, 8.7, 241.8, 44.5),
Age = c(18, 17, 16, 18, 6, 17),
Height = c(3.3, 3.4, 2.9, 3.6, 0.7, 1.3),
Coning = c("Y", "Y", "Y", "N", "N", "Y"))
format(mydf, nsmall = 2)
# Distance Age Height Coning
# 1 21.40 18.00 3.30 Y
# 2 13.90 17.00 3.40 Y
# 3 23.90 16.00 2.90 Y
# 4 8.70 18.00 3.60 N
# 5 241.80 6.00 0.70 N
# 6 44.50 17.00 1.30 Y
As you should expect, if the data are integers, they won't be printed as decimals.
mydf2$Age <- as.integer(mydf2$Age)
format(mydf2, nsmall = 2)
# Distance Age Height Coning
# 1 21.40 18 3.30 Y
# 2 13.90 17 3.40 Y
# 3 23.90 16 2.90 Y
# 4 8.70 18 3.60 N
# 5 241.80 6 0.70 N
# 6 44.50 17 1.30 Y
An alternative to format is to globally set the option for the number of digits to be displayed:
options(digits=2)
Will mean that from that point forward all numerics will be printed up to 2 decimal places (The default is 7).

How to convert a dataframe to a time series

I have the following R dataframe:
> str(H.dark)
'data.frame': 86400 obs. of 18 variables:
$ groupname: Factor w/ 8 levels "rowA","rowB",..: 8 8 8 8 8 8 8 8 8 8 ...
$ location : Factor w/ 96 levels "c1","c10","c11",..: 84 85 86 87 88 90 91 92 93 94 ...
$ starttime: int 7200 7200 7200 7200 7200 7200 7200 7200 7200 7200 ...
$ dark : int 1 1 1 1 1 1 1 1 1 1 ...
$ inadist : num 2.2 2.6 3.9 3.8 2.5 2.2 3 1.3 2.7 1.2 ...
$ smldist : num 11.5 22.6 9 18.6 19.1 18.9 11.7 28 13.9 9.8 ...
$ lardist : num 9.8 5.8 1.9 3.1 2.1 1.3 6 13.6 3.4 0.9 ...
$ emptydur : num 1.5 0.4 0 0.2 0.3 0.7 0.4 0.4 0.1 2 ...
$ inadur : num 2.1 2 3.3 2 2.5 1.8 2.8 1.2 2.6 1.7 ...
$ smldur : num 1.2 2.4 1.6 2.7 2.1 2.5 1.6 2.8 2.1 1.2 ...
$ lardur : num 0.2 0.3 0.1 0.1 0.1 0.1 0.3 0.6 0.1 0.1 ...
$ emptyct : int 5 7 1 1 5 1 1 3 1 2 ...
$ entct : int 0 0 0 0 0 0 0 0 0 0 ...
$ inact : int 8 10 9 11 13 9 12 7 7 6 ...
$ smlct : int 10 16 11 14 16 12 16 15 9 7 ...
$ larct : int 2 3 1 1 1 1 3 5 1 1 ...
I wish to turn it into a timeseries using ts() because ts has some nice functions for lags and other time specific analyses. The starttime column is the time. The min value is 7200 seconds and the max is 43200. Observations occur every 5 seconds. I've read the ts documentation and I think I may first have to convert my dataframe to a matrix. But I'm completely unclear on how I need to rearrange the data, or if it indeed needs to be rearranged.
Can anyone offer some clarity?

Resources