add random missing values to a complete data frame (in R) [duplicate] - r

This question already has answers here:
Randomly insert NAs into dataframe proportionaly
(6 answers)
Closed 7 years ago.
for testing purposes I need to add missing values to a data frame which has no missing values, how can I add 10% random NAs to my data frame:
dat <- data.frame(v1=rnorm(20),v2=rnorm(20),v3=rnorm(20))
my idea was something like:
a <- sample(1:nrow(dat),3,replace=F)
b <- sample(1:ncol(dat),2,replace=F)
dat[a,b] <- NA
but this is not random enough. thanks.

It seems like you're asking for a way to create true random numbers, rather than how to implement it. If this is the case, you could use the random package available through CRAN which can sample random integers from random.org
install.packages("random")
require("random")
The details of the package: http://cran.r-project.org/web/packages/random/index.html
I suggest you look especially at the vignette, 'random: an R package for true random numbers' for how to obtain random integers.

Related

r: How to sample from a population of size 1? [duplicate]

This question already has answers here:
Sample from vector of varying length (including 1)
(4 answers)
Closed 6 months ago.
I appreciate that sampling from a list of length 1 has little practical use yet all the same I tried the following:
When I run the r snippet sample(c(1,2,3),1) then I obtain a random single value from the list c(1,2,3).
When I run the r snippet sample(c(3),1) then I would expect the number 3 to always be output but I don't, I seem to obtain the same behaviour as above.
Why is this? How can I sample from a list of length 1?
I found that sample(c(3,3),1) does indeed output the intended, but feels not what I had in mind.
See documentation for sample:
If x has length 1, is numeric (in the sense of is.numeric) and x >= 1, sampling via sample takes place from 1:x.
You can use resample() from the gdata package. This saves you having to redefine resample in each new script. Just call
gdata::resample(c(3), 1)
https://www.rdocumentation.org/packages/gdata/versions/2.18.0/topics/resample

Difference between two lists to create a dataset [duplicate]

This question already has answers here:
Find complement of a data frame (anti - join)
(7 answers)
Closed 5 years ago.
I have a dataset, like this mushrooms <- read.csv("mushrooms.csv") and now I already have a mushrooms.training_set which is 1/3 of the whole dataset. For both variables, typeof() returns list.
Now, I want to select the rows in the original dataset mushrooms, that are not in the mushrooms.training_set. How would I do this? I have tried the following:
mushrooms[c(!mushrooms.training_set),] but this returns something in the order of 64K rows.
mushrooms[!mushrooms.training_set,]
mushrooms[!duplicated(mushrooms.training_set)]
Who helps me out?
From where you are in the question, you can use dplyr::setdiff:
library(dplyr)
mushroooms.test = setdiff(mushrooms, mushrooms.training_set)
But most of the time it's easier to create the test set using at the same time as the training set. Lots of examples here at How to split data into training and test sets?

Sudden changes on the original digits of data frame in R, when tapply() applied [duplicate]

This question already has answers here:
Display exact value of a variable in R
(2 answers)
Closed 5 years ago.
I've used tapply() a lot in R, but I have no idea why the order of magnitudes are suddenly converted after tapply() function is applied.
When I load the original CSV data, the data shows as follows.
Barcode Group Price
1002-01-23 A 10.23568975
1002-01-24 A 2356.25
1002-01-25 A 123.54897
1002-01-26 A 200.1548794
However, after I use R codes, the digits of Price are converted as follows.
Barcode Group Price mean
1002-01-23 A 10.23569 672.5474
1002-01-24 A 2356.25000 672.5474
1002-01-25 A 123.54897 672.5474
1002-01-26 A 200.15488 672.5474
I would like to have 672.5473847875(=(10.23568975+2356.25+123.54897+200.1548794)/4) as a result of mean. How could I solve the problem? Let me show you my R codes.
barcode <- read.csv("barcode.csv",header=T)
barcode$Group <- as.factor(barcode$Group)
barcode$Price <- as.numeric(barcode$Price)
test <- tapply(barcode$Price, barcode$Group, mean)
test1 <- data.frame(Group=names(test), mean=test)
barcode$mean <- test1$mean[match(barcode$Group, test1$Group)]
I really need your help. Thank you so much.
The means are calculated correctly. The simplest way to see this is to test for it:
barcode$mean == 672.5473847875
[1] TRUE TRUE TRUE TRUE
You can change the default number of digits being printed by e.g.
options(digits=15)

addition of a data point from a different column to the preceding data point [duplicate]

This question already has an answer here:
Vector of cumulative sums in R
(1 answer)
Closed 6 years ago.
I'm having real difficulty performing a calculation that is incredibly easy to perform in excel. What i require is a kind of rolling addition whereby the value in one column is added to preceding data point. For example:
column a: 1,2,3,5,16,18,3,11
would produce:
column b: 1,3,6,11,27,45,48,59
i.e. (1+1=2),(2+1=3),(3+3=6),(5+6=11)...
I have a feeling I'm missing something really obvious but have tried various iterations of rollapply and shift with no success... How can I do this in R? What am I missing?
The function you are looking for is cumsum:
df = data.frame(a=1:10)
df$b = cumsum(df$a)

how to find the complement of a random selection of rows of a dataframe? [duplicate]

This question already has answers here:
How to split a data frame?
(8 answers)
Closed 8 years ago.
I have a data set called data, which I am splitting into 2 new data sets, which I will call test and train.
I want the splitting to be random, without replacement.
Using the code below, I get train to be a new data frame with 35 elements:
rows_in_test <- 35 # number of rows to randomly select
rows_in_train <- nrow(data) - rows_in_test
train <- data[sample(nrow(data), rows_in_test), ]
Is there a nice way in R to assign the complement of train to a new data set called test? I am thinking there must be a function for this?
myData<-data.frame(a=c(1:20), b=c(101:120))
set.seed(123)#to be able to replicate random sampling later
trainRows<-runif(nrow(myData))>0.25 #randomly put aside 25% of the data
train<-myData[trainRows,]#has 13 rows
test<-myData[!trainRows,]#has 7 rows
#following method to select a fixed no. of samples - in this case selecting 5 rows
testRows2<-sort(sample(c(1:nrow(myData)), 5, replace=F))
train2<-myData[-testRows2, ]
test2<-myData[testRows2, ]

Resources