How to use aritmatic using tapply() in R - r

I'm calling height, diameter and age from a csv file. I'm trying to calculate the volume of the tree using pi x h x r^2. In order to calculate the radius, I'm taking dbh and dividing it by 2. Then I get this error.
Error in dbh/2 : non-numeric argument to binary operator
setwd("/Users/user/Desktop/")
treeg <- read.csv("treeg.csv",row.names=1)
head(treeg)
heights <- tapply(treeg$height.ft,treeg$forest, identity)
ages <- tapply(treeg$age,treeg$forest, identity)
dbh <- tapply(treeg$dbh.in,treeg$forest, identity)
radius <- dbh / 2
In the vector dbh it is storing the diameter from he csv file in terms of forest which is the ID.
How can I divide dbh by 2, while still retaining format of each value being stored by its receptive ID (which is he forest ---> treeg$forest) and treeg is the dataframe that call the csv file.
> head(treeg)
tree.ID forest habitat dbh.in height.ft age
1 1 4 5 14.6 71.4 55
2 1 4 5 12.4 61.4 45
3 1 4 5 8.8 40.1 35
4 1 4 5 7.0 28.6 25
5 1 4 5 4.0 19.6 15
6 2 4 5 20.0 103.4 107
str(dbh)
List of 9
$ 1: num [1:36] 19.9 18.6 16.2 14.2 12.3 9.4 6.8 4.9 2.6 22 ...
$ 2: num [1:60] 16.5 15.5 14.5 13.7 12.7 11.4 9.5 8 5.9 4.1 ...
$ 3: num [1:50] 18.4 17.2 15.6 13.7 11.6 8.5 5.3 2.8 13.3 10.6 ...
$ 4: num [1:81] 14.6 12.4 8.8 7 4 20 18.8 17 15.9 14 ...
$ 5: num [1:153] 28 27.2 26.1 25 23.7 21.3 19 16.7 12.2 9.8 ...
$ 6: num [1:22] 21.3 20.2 19.1 18 16.9 15.6 14.8 13.3 11.3 9.2 ...
$ 7: num [1:63] 13.9 12.4 10.6 8.1 5.8 3.4 27 25.6 23 20.2 ...
$ 8: num [1:27] 20.8 17.7 15.6 13.2 10.5 7.5 4.8 2.9 12.9 11.3 ...
$ 9: num [1:50] 23.6 20.5 16.9 14.1 11.1 8 5.1 2.9 24.1 20.9 ...
- attr(*, "dim")= int 9
- attr(*, "dimnames")=List of 1
..$ : chr [1:9] "1" "2" "3" "4" ...

Are you just trying to create a radius column that is dbh.in divided by two?
treeg <- read.table(textConnection("tree.ID forest habitat dbh.in height.ft age
1 1 4 5 14.6 71.4 55
2 1 4 5 12.4 61.4 45
3 1 4 5 8.8 40.1 35
4 1 4 5 7.0 28.6 25
5 1 4 5 4.0 19.6 15
6 2 4 5 20.0 103.4 107"), header=TRUE)
treeg$radius <- treeg$dbh.in / 2
Or do you need that dbh list for something...
dbh <- tapply(treeg$dbh.in,treeg$forest, identity)
> dbh
$`4`
[1] 14.6 12.4 8.8 7.0 4.0 20.0
lapply(dbh, function(x)x/2)
List of 1
$ 4: num [1:6] 7.3 6.2 4.4 3.5 2 10

Related

qtgrace/xmgrace non-overlaping data sets

I'm using qtgrace for MacOS and when I plotted two data in qtgrace I got something like this:
Overlapping data sets
However, I would like to plot something like this:
Non-overlapping data sets
My data 1:
0 14
0.1 6
0.2 14
0.3 14
0.4 14
0.5 14
0.6 14
0.7 14
0.8 6
0.9 6
1 6
1.1 6
1.2 6
1.3 6
1.4 6
1.5 6
1.6 6
1.7 6
1.8 6
1.9 6
2 6
2.1 6
2.2 6
2.3 6
2.4 6
2.5 6
2.6 6
2.7 6
2.8 6
2.9 6
3 6
3.1 6
3.2 6
3.3 6
3.4 6
3.5 6
3.6 6
3.7 6
3.8 6
3.9 6
4 6
4.1 6
4.2 6
4.3 6
4.4 6
4.5 6
4.6 6
4.7 6
4.8 6
4.9 6
5 6
5.1 6
5.2 6
5.3 6
5.4 6
5.5 6
5.6 6
5.7 6
5.8 6
5.9 6
6 6
6.1 6
6.2 6
6.3 6
6.4 6
6.5 6
6.6 6
6.7 6
6.8 6
6.9 6
7 6
7.1 6
7.2 6
7.3 2
7.4 6
7.5 2
7.6 2
7.7 2
7.8 2
7.9 6
8 2
8.1 6
8.2 2
8.3 2
8.4 6
8.5 6
8.6 6
8.7 2
8.8 6
8.9 19
9 19
9.1 6
9.2 6
9.3 6
9.4 2
9.5 2
9.6 2
9.7 2
9.8 2
9.9 2
10 2
10.1 2
10.2 2
10.3 2
10.4 2
10.5 2
10.6 2
10.7 2
10.8 2
10.9 2
11 2
11.1 2
11.2 2
11.3 2
11.4 2
11.5 2
11.6 2
11.7 2
11.8 2
11.9 2
12 2
12.1 2
12.2 2
12.3 2
12.4 2
12.5 2
12.6 2
12.7 2
12.8 2
12.9 2
13 2
13.1 2
13.2 2
13.3 2
13.4 2
13.5 2
13.6 2
13.7 2
13.8 2
13.9 2
14 2
14.1 2
14.2 2
14.3 2
14.4 2
14.5 2
14.6 2
14.7 2
14.8 2
14.9 2
15 2
15.1 2
15.2 2
15.3 2
15.4 2
15.5 2
15.6 2
15.7 2
15.8 2
15.9 2
16 2
16.1 2
16.2 2
16.3 2
16.4 2
16.5 2
16.6 2
16.7 2
16.8 2
16.9 2
17 2
17.1 2
17.2 2
17.3 2
17.4 2
17.5 2
17.6 2
17.7 2
17.8 2
17.9 2
18 2
18.1 2
18.2 2
18.3 2
18.4 2
18.5 2
18.6 2
18.7 2
18.8 2
18.9 2
19 2
19.1 2
19.2 2
19.3 2
19.4 2
19.5 2
19.6 2
19.7 2
19.8 2
19.9 2
20 2
20.1 2
20.2 2
20.3 2
20.4 2
20.5 2
20.6 2
20.7 2
20.8 2
20.9 2
21 2
21.1 2
21.2 2
21.3 2
21.4 2
21.5 2
21.6 2
21.7 2
21.8 7
21.9 2
22 2
22.1 2
22.2 2
22.3 7
22.4 7
22.5 7
22.6 7
22.7 7
22.8 2
22.9 2
23 7
23.1 7
23.2 7
23.3 7
23.4 7
23.5 2
23.6 2
23.7 2
23.8 2
23.9 2
24 2
24.1 2
24.2 2
24.3 2
24.4 2
24.5 2
24.6 2
24.7 2
24.8 2
24.9 2
25 2
. .
. .
. .
Data 2:
0 4
0.1 4
0.2 4
0.3 4
0.4 4
0.5 4
0.6 4
0.7 4
0.8 4
0.9 4
1 2
1.1 4
1.2 4
1.3 4
1.4 4
1.5 4
1.6 4
1.7 4
1.8 4
1.9 4
2 4
2.1 4
2.2 4
2.3 4
2.4 4
2.5 4
2.6 4
2.7 4
2.8 4
2.9 4
3 4
3.1 4
3.2 4
3.3 4
3.4 4
3.5 4
3.6 4
3.7 4
3.8 4
3.9 4
4 4
4.1 4
4.2 4
4.3 4
4.4 4
4.5 4
4.6 4
4.7 4
4.8 4
4.9 4
5 4
5.1 4
5.2 4
5.3 4
5.4 4
5.5 4
5.6 4
5.7 4
5.8 4
5.9 4
6 4
6.1 4
6.2 4
6.3 4
6.4 4
6.5 4
6.6 4
6.7 4
6.8 4
6.9 4
7 4
7.1 4
7.2 4
7.3 4
7.4 4
7.5 4
7.6 4
7.7 4
7.8 4
7.9 4
8 4
8.1 4
8.2 4
8.3 4
8.4 2
8.5 4
8.6 4
8.7 4
8.8 4
8.9 4
9 4
9.1 4
9.2 4
9.3 4
9.4 4
9.5 4
9.6 4
9.7 4
9.8 4
9.9 4
10 4
10.1 4
10.2 4
10.3 4
10.4 4
10.5 2
10.6 2
10.7 4
10.8 2
10.9 2
11 2
11.1 2
11.2 4
11.3 4
11.4 2
11.5 2
11.6 2
11.7 2
11.8 2
11.9 2
12 2
12.1 2
12.2 2
12.3 2
12.4 4
12.5 4
12.6 2
12.7 2
12.8 4
12.9 2
13 2
13.1 4
13.2 4
13.3 4
13.4 4
13.5 10
13.6 2
13.7 2
13.8 2
13.9 2
14 2
14.1 2
14.2 2
14.3 10
14.4 2
14.5 2
14.6 4
14.7 2
14.8 2
14.9 4
15 2
15.1 10
15.2 2
15.3 2
15.4 2
15.5 2
15.6 2
15.7 2
15.8 2
15.9 2
16 2
16.1 2
16.2 2
16.3 2
16.4 2
16.5 2
16.6 2
16.7 2
16.8 2
16.9 2
17 2
17.1 2
17.2 2
17.3 2
17.4 2
17.5 2
17.6 2
17.7 2
17.8 2
17.9 2
18 2
18.1 2
18.2 2
18.3 2
18.4 2
18.5 2
18.6 2
18.7 2
18.8 2
18.9 2
19 2
19.1 2
19.2 2
19.3 2
19.4 2
19.5 2
19.6 2
19.7 2
19.8 2
19.9 2
20 2
20.1 2
20.2 2
20.3 2
20.4 2
20.5 2
20.6 2
20.7 2
20.8 2
20.9 2
21 2
21.1 2
21.2 2
21.3 2
21.4 2
21.5 2
21.6 2
21.7 2
21.8 2
21.9 2
22 2
22.1 2
22.2 2
22.3 2
22.4 2
22.5 2
22.6 2
22.7 2
22.8 2
22.9 2
23 2
23.1 2
23.2 2
23.3 2
23.4 2
23.5 2
23.6 2
23.7 2
23.8 2
23.9 2
24 2
24.1 2
24.2 2
24.3 2
24.4 2
24.5 2
24.6 2
24.7 2
24.8 2
24.9 2
25 2
. .
. .
. .
The data are in two separate xvg file from GROMACS cluster analysis. I wanna plot five different sets in a manner which I can see all data without superposing.
Thank you!
I think the best approach would be to write a script that takes the original files and spits out new files with shifted y values. However, since you have asked for a qt/xmgrace solution, here is how you do it:
Load up all the datasets into qtgrace
Open the "Data -> Transformations -> Evaluate expression..." dialog
Select in the left and right columns a dataset and in the textbox below enter the formula y = y + 0.1. Click "apply". This will shift the dataset up by 0.1
Select the next dataset in the same way and use the formula y = y + 0.2. Click apply
Rinse and repeat for all the datasets (changing the shift accordingly)

Matrices of difent size multiplication

Good evenning
In Rstudio
I have a problem multiplying these two matrices of a different size, and it becomes worse because I have to multiply in such a way that the values in the row d2$ID=1 have to multiply only the repetitions of w$sample=1.
sample and ID are indicating is the same sample
In other words, from the "subset" d2$ID=1, every single value ("L1", "ST", "GR", "CB", "HSK", "DDM") has to multiply the whole "subset" w$sample=1 (4 rows in this case, but not always), yes, all the values "G2", "G4", "G6", "G8", "G12"
>d2
ID L1 ST GR CB HSK DDM
1 1 0.1662000 0.2337000 0.3637000 0.11110000 0.10100000 0.024300000
2 2 0.1896576 0.2280830 0.3705740 0.09406879 0.09319434 0.024422281
3 3 0.1110259 0.2217769 0.4180797 0.11122498 0.10902635 0.028866094
4 4 0.1558785 0.2008862 0.4222565 0.09805538 0.10218119 0.020742172
5 5 0.1536421 0.1674096 0.4205395 0.14362176 0.08635519 0.028431849
6 6 0.1841964 0.1514189 0.4603306 0.10243621 0.08928011 0.012337688
> w
sample G2 G4 G6 G8 G12
1 1 10.9 15.9 21.4 28.0 37.8
2 1 11.5 16.6 22.2 29.5 38.3
3 1 10.3 15.1 20.7 28.3 36.7
4 1 11.7 18.1 24.8 31.2 39.5
5 2 11.0 16.8 22.4 30.6 38.0
6 2 10.1 15.9 22.5 30.2 36.7
7 2 12.8 17.8 22.8 28.7 37.1
8 2 11.8 16.3 20.8 27.3 34.7
9 2 11.9 16.7 21.6 28.3 34.6
10 3 12.0 18.1 24.2 30.9 40.0
11 3 12.2 17.7 24.2 31.7 40.5
12 4 11.1 16.5 22.7 31.0 39.2
13 4 12.5 19.8 27.4 32.8 38.8
14 4 12.4 19.2 25.8 33.0 39.9
15 4 12.4 19.2 26.2 33.4 38.9
16 4 13.4 18.3 23.7 30.0 38.2
17 5 13.3 18.6 24.0 30.7 38.4
18 5 13.3 18.1 22.9 30.1 36.8
19 5 13.7 19.9 26.5 33.8 43.0
20 5 12.7 18.2 24.6 32.5 41.3
21 6 12.1 17.5 24.3 33.7 42.2
22 6 14.5 20.8 28.4 35.3 43.7
I have check already a lot of questions but I can't figure it out, specially because most of the information is for matrices of the same size.
I tried by filtering the data from d2, but the data set is really big, then is really inefficient.
I am a beginner, if you consider is so easy I would appreciate at least a hint, please!
I have several data sets like these ones...
Thanks in advance!
This seems to perform as requested:
res <- apply(w, 1, function(x){ unclass(
outer(as.matrix( x[-1] ),
as.matrix( d2[1, c( "L1", "ST", "GR", "CB", "HSK", "DDM")])))})
str(res)
# result
# num [1:30, 1:22] 1.81 2.64 3.56 4.65 6.28 ...
# - attr(*, "dimnames")=List of 2
# ..$ : NULL
# ..$ : chr [1:22] "1" "2" "3" "4" ...
I almost got it right on the first pass but after some debugging found that I needed to add the as.matrix call to both arguments inside outer (so to speak ;-). To explain my logic ... I wanted to run down each row of w with apply and then use match on the value of the first column (of each row of w) to the unique row of d2. The match function is designed for just this purpose, to return a suitable number to be used for indexing. Then with the rest of the row (x[-1] by the time it was passed through the function call), I would use outer on the row values crossed with the desired row and columns of d2. If you do it without the as.matrix calls you get an error message:
Error in tcrossprod(x, y) :
requires numeric/complex matrix/vector arguments
I don't think that's a very informative error message. Both of the arguments were numeric vectors.

How to convert a character array to data frame

I have a character array dat which I want to convert to a data frame df but it is not working
head(dat)
[1] " 1931 1 5.0 0.6 11 78.4 43.4"
[2] " 1931 2 6.7 0.7 7 48.9 63.6"
[3] " 1931 4 10.4 3.1 3 44.6 110.1"
[4] " 1931 5 13.2 6.1 1 63.7 167.4"
[5] " 1931 6 15.4 8.0 0 87.8 150.3"
[6] " 1931 7 17.3 10.6 0 121.4 111.2"
> df<-as.data.frame(dat)
> head(df)
dat
1 1931 1 5.0 0.6 11 78.4 43.4
2 1931 2 6.7 0.7 7 48.9 63.6
3 1931 4 10.4 3.1 3 44.6 110.1
4 1931 5 13.2 6.1 1 63.7 167.4
5 1931 6 15.4 8.0 0 87.8 150.3
6 1931 7 17.3 10.6 0 121.4 111.2
df[,c(3)]
Error in [.data.frame(df, , c(3)) : undefined columns selected
Reading with read.table: You can rename as desired.
df<-read.table(text = " dat
1 1931 1 5.0 0.6 11 78.4 43.4
2 1931 2 6.7 0.7 7 48.9 63.6
3 1931 4 10.4 3.1 3 44.6 110.1
4 1931 5 13.2 6.1 1 63.7 167.4
5 1931 6 15.4 8.0 0 87.8 150.3
6 1931 7 17.3 10.6 0 121.4 111.2",
header=F,fill=T,as.is=T,skip = 1)
df[3]
V3
1 1
2 2
3 4
4 5
5 6
6 7
If dat is as shown reproducibly in the Note at the end then as.data.frame(dat) creates a data frame with one column called dat and then when there is an attempt to take the 3rd column an error results since there is only one column.
Instead, use read.table and get the third column like this. Omit the comma if you want a data frame result.
read.table(text = dat)[, 3]
## [1] 5.0 6.7 10.4 13.2 15.4 17.3
Note
dat <- c(" 1931 1 5.0 0.6 11 78.4 43.4",
" 1931 2 6.7 0.7 7 48.9 63.6",
" 1931 4 10.4 3.1 3 44.6 110.1",
" 1931 5 13.2 6.1 1 63.7 167.4",
" 1931 6 15.4 8.0 0 87.8 150.3",
" 1931 7 17.3 10.6 0 121.4 111.2")
Here's a tidyverse approach:
dat <- c(" 1931 1 5.0 0.6 11 78.4 43.4",
" 1931 2 6.7 0.7 7 48.9 63.6",
" 1931 4 10.4 3.1 3 44.6 110.1",
" 1931 5 13.2 6.1 1 63.7 167.4",
" 1931 6 15.4 8.0 0 87.8 150.3",
" 1931 7 17.3 10.6 0 121.4 111.2")
library(tidyverse)
str_trim(dat) %>% # trim leading space
tibble(x = .) %>% # put into tibble (data.frame)
separate(x, # separate x into 7 columns, named below
into = c("year","v1","v2","v3","v4","v5","v6"),
sep = "[ ]{1,}") # separate by one or more spaces ("[ ]{1,}")
That leads to:
# A tibble: 6 x 7
year v1 v2 v3 v4 v5 v6
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1931 1 5.0 0.6 11 78.4 43.4
2 1931 2 6.7 0.7 7 48.9 63.6
3 1931 4 10.4 3.1 3 44.6 110.1
4 1931 5 13.2 6.1 1 63.7 167.4
5 1931 6 15.4 8.0 0 87.8 150.3
6 1931 7 17.3 10.6 0 121.4 111.2

Draw histograms per row over multiple columns in R

I'm using R for the analysis of my master thesis
I have the following data frame: STOF: Student to staff ratio
HEI.ID X2007 X2008 X2009 X2010 X2011 X2012
1 OP 41.8 147.6 90.3 82.9 106.8 63.0
2 MO 20.0 20.8 21.1 20.9 12.6 20.6
3 SD 21.2 32.3 25.7 23.9 25.0 40.1
4 UN 51.8 39.8 19.9 20.9 21.6 22.5
5 WS 18.0 19.9 15.3 13.6 15.7 15.2
6 BF 11.5 36.9 20.0 23.2 18.2 23.8
7 ME 34.2 30.3 28.4 30.1 31.5 25.6
8 IM 7.7 18.1 20.5 14.6 17.2 17.1
9 OM 11.4 11.2 12.2 11.1 13.4 19.2
10 DC 14.3 28.7 20.1 17.0 22.3 16.2
11 OC 28.6 44.0 24.9 27.9 34.0 30.7
Then I rank colleges using this commend
HEIrank1<-(STOF[,-c(1)])
rank1 <- apply(HEIrank1,2,rank)
> HEIrank11
HEI.ID X2007 X2008 X2009 X2010 X2011 X2012
1 OP 18.0 20 20.0 20.0 20.0 20
2 MO 14.0 9 13.0 13.5 2.0 12
3 SD 15.0 16 17.0 16.0 16.0 19
4 UN 20.0 18 8.0 13.5 14.0 13
5 WS 12.0 8 4.0 7.0 6.0 8
6 BF 6.5 17 9.5 15.0 10.0 14
7 ME 17.0 15 19.0 19.0 17.0 15
8 IM 2.0 6 12.0 8.0 8.5 10
9 OM 4.5 3 2.5 3.0 3.0 11
10 DC 11.0 14 11.0 9.0 15.0 9
11 OC 16.0 19 16.0 18.0 19.0 17
I would like to draw histogram for each HEIs (for each row)?
If you use ggplot you won't need to do it as a loop, you can plot them all at once. Also, you need to reformat your data so that it's in long format not short format. You can use the melt function from the reshape package to do so.
library(reshape2)
new.df<-melt(HEIrank11,id.vars="HEI.ID")
names(new.df)=c("HEI.ID","Year","Rank")
substring is just getting rid of the X in each year
library(ggplot2)
ggplot(new.df, aes(x=HEI.ID,y=Rank,fill=substring(Year,2)))+
geom_histogram(stat="identity",position="dodge")
Here's a solution in lattice:
require(lattice)
barchart(X2007+X2008+X2009+X2010+X2011+X2012 ~ HEI.ID,
data=HEIrank11,
auto.key=list(space='right')
)

replacing randomly values in an existing matrix in R

I have an existing matrix and I want to replace some of the existing values by NA's in a random uniform way.
I tried to use the following, but it only replaced 392 values with NA, not 452 as I expected. What am I doing wrong?
N <- 452
ind1 <- (runif(N,2,length(macro_complet$Sod)))
macro_complet$Sod[ind1] <- NA
summary(macro_complet$Sod)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.3222 0.9138 1.0790 1.1360 1.3010 2.8610 392.0000
My data looks like this
> str(macro_complet)
'data.frame': 1504 obs. of 26 variables:
$ Sod : num 8.6 13.1 12 13.8 12.9 10 7 14.8 11.3 4.9 ...
$ Azo : num 2 1.7 2.2 1.9 1.89 1.61 1.72 2.1 1.63 2 ...
$ Cal : num 26 28.1 24 28.5 24.5 24 17.4 26.6 24.8 10.5 ...
$ Bic : num 72 82 81 84 77 68 66 81 70 37.8 ...
$ DBO : num 3 2.2 3 2.7 3.3 3 3.2 2.9 2.8 2 ...
$ AzoK : num 0.7 0.7 0.9 0.8 0.7 0.7 0.7 0.9 0.7 0.7 ...
$ Orho : num 0.3 0.2 0.31 0.19 0.19 0.2 0.16 0.24 0.2 0.01 ...
$ Ammo : num 0.12 0.16 0.15 0.13 0.19 0.22 0.19 0.16 0.17 0.08 ...
$ Carb : num 0.3 0.3 2 0.3 0.3 0.3 0.3 0.3 0.3 0.5 ...
$ Ox : num 10.2 9.7 9.8 9.6 9.7 9.1 9.1 8.1 9.7 10.6 ...
$ Mag : num 5.5 6.5 6.3 7 6.4 5.1 6 6.7 5.7 2 ...
$ Nit : num 4.2 4.7 5.7 4.6 4.2 3.5 4.9 4.5 4.2 2.8 ...
$ Matsu : num 17 9 24 15 17 19 20 19 13 3.9 ...
$ Tp : num 10.5 9.7 11.9 12 12.9 11.2 12.8 13.7 11.5 10.6 ...
$ Co : num 3 3.45 3.3 3.54 2.7 2.7 3.3 3.49 2.8 1.8 ...
$ Ch : num 17 24 22 28 25 19 13 28 23 6.4 ...
$ Cu : num 25 15 20 20 15 20 15 15 20 15 ...
$ Po : num 3.5 3.8 4 3.6 3.8 3.7 3 4.2 3.7 0.4 ...
$ Ph : num 0.2 0.17 0.2 0.14 0.18 0.2 0.17 0.17 0.17 0.01 ...
$ Cnd : int 226 275 285 295 272 225 267 283 251 61 ...
$ Txs : num 93 88 89 86 87 88 84 80 91 94 ...
$ Niti : num 0.06 0.09 0.07 0.06 0.08 0.07 0.08 0.11 0.1 0.01 ...
$ Dt : num 9 9.7 9 10.2 8 8 7 9.4 8.5 3 ...
$ H : num 7.6 7.7 7.6 7.7 7.55 7.4 7.3 7.5 7.5 7.6 ...
$ Dco : int 17 12 15 13 15 20 16 14 12 7 ...
$ Sf : num 22 20.5 18 22.2 22.1 21 11.6 21.7 21.9 6.8 ...
I also tried to do this for only a single variable, but got the same result.
I converted my data frame into a matrix using
as.matrix(n1)
then I replaced some values for only one variable
N <- 300
ind <- (runif(N,1,length(n1$Sodium)))
n1$Sodium[ind] <- NA
However, using summary() I observed that only 262 values were replaced instead of 300 as expected. What am I doing wrong?
summary(n1$Sodium)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.3222 0.8976 1.0790 1.1320 1.3010 2.8610 262.0000
Try this. This will sample your matrix uniformly without replacement (so the same value is not chosen and replaced twice). If you want some other distribution, you can modify the weights using the prob argument (see ?sample)
vec <- matrix(1:25, nrow = 5)
vec[sample(1:length(vec), 4, replace = FALSE)] <- NA
vec
[,1] [,2] [,3] [,4] [,5]
[1,] NA 6 NA 16 NA
[2,] NA 7 12 17 22
[3,] 3 8 13 18 23
[4,] 4 9 14 19 24
[5,] 5 10 15 20 25
you must apply runif in the right spot, which is the index to vec. (The way you have it now, you are asking R to draw random numbers from a uniform distribution between NA and NA, which of course does not make sense and so it gives you back NaNs)
Try instead:
N <- 5 # the number of random values to replace
inds <- round ( runif(N, 1, length(vec)) ) # draw random values from [1, length(vec)]
vec[inds] <- NA # use the random values as indicies to vec, for which to replace
Note that it is not necessary to use round(.) since [[ will accept numerics, but they will all be rounded down by default, which is just slightly less than a uniform dist.
We could use
vec[sample(seq_along(vec), 4, replace = FALSE)] <- NA

Resources