Sampling from a data table in R

Sampling from a data table in R - r

I'm new to R and I would like to know how to take a certain number of samples from a csv file made entirely of numbers in Excel. I managed to import the data to R and use each number as a row and then take random rows as samples but it seems impractical. The whole file is displayed as a column and I took some samples with the next code:
Heights[sample(nrow(Heights), 5), ]
[1] 1.84 1.65 1.73 1.70 1.72
Also please let me know if there is a way to repeat this step at least 100 times and save each sample in another chart maybe, to work with it later.

This is how you'd take 100 samples and store them:
my_samples <- replicate(100, Heights[sample(nrow(Heights), 5), ])
If your .csv file just comma separated values of one type (the heights), and not structured as a table, you may want to turn it into a vector instead. Most R functions that read textual formats of data are going to turn the data into a data frame or some other table like format.
heights <- unlist(strsplit(readLines("yourfile.csv"), ","))
readLines("yourfile.csv") with a .csv file of comma separated values will turn it into a character vector. strsplit() then does the separating work for you.
To put this all together, with a dummy example:
writeLines(c("1,2,3,4,5", "6,7,8,9,10"), "test.csv")
heights <- as.numeric(unlist(strsplit(readLines("test.csv"), ",")))
set.seed(123)
my_samples <- replicate(100, sample(heights, 5))
dim(my_samples)
# [1] 5 100
You can see that my_samples is a matrix of 5 rows (with each row corresponding to a single element sampled from heights), and 100 columns (with each column corresponding to one of one hundred sampling events).

You can use the infer package that is used for bootstrapping.
library(infer)
rep_sample_n(size = 100, replace = TRUE, reps = 1)
Here "size" is the number of samples. "replace" (if true) allows you to replace an observation when sampling - that is, you spin the roulette wheel without taking numbers off the wheel once they come up. 'reps' allows you to repeat the sampling process.

Related

Write an excel or csv file in a way that the dataframes are listed on the same sheet, instead of multiple sheets in R

I have an object that contains multiple dataframes and wanted to produce a single excel worksheet with the data. I know there are ways of dealing with this problem in excel. But is there a way to manipulate it from the R side, so that people don't have to worry about the extra steps that weren't in my R script? I have been using this function (see below), but am open to another function. This function produces 1 excel file, but a worksheet for every dataframe. I have 119 dataframes. So this is not really practical.
write_xlsx(results1, "hpresponse1.MinimallyAdjusted")
I used the bind_rows. However, some of the data was lost. I am not sure how to retain it, especially as I don't even know what kind of data it is. But I turned my results for logistic regression into a dataframe so that I was able to perform certain manipulations. There are labels of some kind off to the left that are not variables. Can I turn these data into a variable so that it is retained when I use bind_rows?

(Up front, I'm assuming that the 199 frames are all different. Or at least that they are structured or used such that you must not combine them into a single frame, as stefan has suggested in the comments.)
I suggest you use the openxlsx package and offset each table individually.
L <- list(mtcars[1:3,1:3], mtcars[4:5,1:5], mtcars[6:10,1:4])
wb <- openxlsx::createWorkbook("quux.xlsx")
sapply(L, nrow)
# [1] 3 2 5
starts <- cumsum(c(1L, 2L + sapply(L[-length(L)], nrow)))
starts
# [1] 1 6 10
wb <- openxlsx::createWorkbook()
openxlsx::addWorksheet(wb, "quuxsheet")
ign <- Map(function(x, st) openxlsx::writeData(wb, sheet = "quuxsheet", x = x, startRow = st), L, starts)
openxlsx::saveWorkbook(wb, file = "quux.xlsx")
sapply(L, nrow) gives us the number of rows in each table in the list. This is used so that we know to offset so-many-rows after a table for the next table. Since we don't care about the number of rows in the last frame, we omit it with L[-length(L)]
2L + sapply(..) gives us a gap of 1 row between each frame in the worksheet. Change to to suit your needs.
cumsum(c(1L, 2L+sapply(..)) is because we need an absolute row number, not just the counts for each frame. This results in starts holding the first excel row number for each frame in L.

Beginner: how can I repeat this function?

I need R studio for analysing some data, but haven't used it for 4 years now.
Now I've got a problem and don't know how to solve it. I want to calculate the variation of some columns together in every row. With some experimentation I've found this out:
var(as.numeric(data[1,8:33]))
and I get: 1.046154
As far as I know this should be right. It should at least give me the variation for the items 8 to 33 in the column for the first person. It also works for any other row:
var(as.numeric(data[5,8:33])) => 1.046154
Now I could of course use the same thing for every row individually, but I have 111 participants and several surveys. I tried to find a way to repeat the same command with every row but it didn't work.
How can I use the command from above and repeat it to all 111 participants?

Without the data it is difficult to help, but I created some dummy data using rnorm. You can use apply to obtain a vector containing the variance for each row. Since it appears that your data is in character format and not numeric, I created a simple function to automatically transform it and calculate the variance.
set.seed(20)
data <- matrix(as.character(rnorm(3663)),
ncol = 33,
nrow = 111)
##basic function
obtain_variance_from_character <- function(x){
return(var(as.numeric(x)))
}
##Calculate variances by row
variances <- apply(data_frame(data), MARGIN = 1, FUN = obtain_variance_from_character)

Calculate ratios of all column combinations from a dataframe

I have a CVS file imported as df in R. dimension of this df is 18x11. I want to calculate all possible ratios between the columns. Can you guys please help me with this? I understand that either 'for loop" or vectorized function will do the job. The row names will remain the same, while column name combinations can be merged using paste. However, I don't know how to execute this. I did this in excel as it is still a smaller data set. A larger size will make it tedious and error prone in excel, therefore, I would like to try in R.
Will be great help indeed. Thanks. Let's say below is the data frame as subset from my data.
dfn = data.frame(replicate(18,sample(100:1000,15,rep=TRUE)))

If you do:
do.call("cbind", lapply(seq_along(dfn), function(y) apply(dfn, 2, function(x) dfn[[y]]/x)))
You will get an array that is 15 * 324, with 18 columns representing all columns divided by the first column, 18 columns divided by the second column, and so on.
You can keep track of them by labelling the columns with the following names:
apply(expand.grid(names(dfn), names(dfn)), 1, paste, collapse = " / ")

How do I import data from a .csv file into R without repeating the values of the first column into all the other ones?

I want to import data into R from a .csv file.
So far I have done the following:
> #Clear environment
rm(list=ls())
#Read my data into R
myData <- read.csv("C:/Users/.../flow.csv", header=TRUE)
#Convert from list to array
myData <- array(as.numeric(unlist(myData)), dim=c(264,3))
#Create vectors with specific values of interest: qMax, qMin
qMax <- myData[,2]
qMin <- myData[,3]
#Transform vectors into matrices
qMax <- matrix(qMax,nrow = 12, ncol = round((length(qMax)/12)))
qMin <- matrix(qMin,nrow = 12, ncol = round((length(qMin)/12)))
After importing the data using read.csv, I have a list. I then proceed to transform this list into an array with 264 lines of data spread through 3 columns. Here I have my first problem.
I know that each column of my list brings a different set of data; the values are not the same. However, after I check to see what I imported, it seems that only the first column is imported correctly, but then it repeats itself for columns one and two.
Here's an image for better explanation:
The matrix has the right layout, yet wrong data. Columns 2 and 3 should have different values from each other and from column 1.
How do I correct that? I have checked the source and the original document has all the correct values.
Also, assuming I will eventually get rid of this mistake, will the proceeding lines of code from the block "#Transform vectors into matrices" deliver a 12 x 22 matrix? The first six elements of both qMax and qMin are NA and I wish to keep it this way in the matrix. Will R perform that with these lines of code or will I need to change it?
Thank you.
Edit: As suggested by akrun, here's the results for str(myData and for dput(droplevels(head(myData)))

R populate list with samples

I have a numeric vector stock_data containing thousands of floating point numbers, I know i can sample them using
sample(stock_data, sample_size)
I want to take 100 different samples and populate them in a list of samples.
How do i do that without using a loop to append the samples to a list?
I thought of creating a list replicating the stock data 100 times then using lapply on them.
I tried:
all_repl <- as.list(rep(stock_data,100))
all_samples <- lapply(all_repl, sample, size=100)
But all_repl doesn't contain a list of data, it contains a single numeric vector which has replicated the data 100 times.
Can anyone suggest what's wrong and point out a better method to do what i want.

We can use replicate
replicate(100, sample(stock_data, sample_size))
Using simplify=FALSE get the output in a list. Using a reproducible example
replicate(5, sample(1:9, 5), simplify=FALSE)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Sampling from a data table in R - r

Related

Write an excel or csv file in a way that the dataframes are listed on the same sheet, instead of multiple sheets in R

Beginner: how can I repeat this function?

Calculate ratios of all column combinations from a dataframe

How do I import data from a .csv file into R without repeating the values of the first column into all the other ones?

R populate list with samples

Categories

Resources