Calculate ratios of all column combinations from a dataframe - r

I have a CVS file imported as df in R. dimension of this df is 18x11. I want to calculate all possible ratios between the columns. Can you guys please help me with this? I understand that either 'for loop" or vectorized function will do the job. The row names will remain the same, while column name combinations can be merged using paste. However, I don't know how to execute this. I did this in excel as it is still a smaller data set. A larger size will make it tedious and error prone in excel, therefore, I would like to try in R.
Will be great help indeed. Thanks. Let's say below is the data frame as subset from my data.
dfn = data.frame(replicate(18,sample(100:1000,15,rep=TRUE)))

If you do:
do.call("cbind", lapply(seq_along(dfn), function(y) apply(dfn, 2, function(x) dfn[[y]]/x)))
You will get an array that is 15 * 324, with 18 columns representing all columns divided by the first column, 18 columns divided by the second column, and so on.
You can keep track of them by labelling the columns with the following names:
apply(expand.grid(names(dfn), names(dfn)), 1, paste, collapse = " / ")

Related

Loop over even/odd columns & stack them under specific ones

I have the following data set from Douglas Montgomery's book Introduction to Time Series Analysis & Forecasting:
I created a data frame called pharm from this spreadsheet. We only have two variables but they're repeated over several columns. I'd like to take all odd "Week" columns past the 2nd column and stack them under the 1st Week column in order. Conversely I'd like to do the same thing with the even "Sales, in thousands" columns. Here's what I've tried so far:
pharm2 <- data.frame(week=c(pharm$week, pharm[,3], pharm[,5], pharm[,7]), sales=c(pharm$sales, pharm[,4], pharm[,6], pharm[,8]))
This works because there aren't many columns, but I need a way to do this more efficiently because hard coding won't be practical with many columns. Does anyone know a more efficient way to do this?
If the columns are alternating, just subset with a recycling logical vector, unlist and create a new data.frame
out <- data.frame(week = unlist(pharm[c(TRUE, FALSE)]),
sales = unlist(pharm[c(FALSE, TRUE)]))
You may use the seq function to generate sequence to extract alternating columns.
pharm2 <- data.frame(week = unlist(pharm[seq(1, ncol(pharm), 2)]),
sales = unlist(pharm[seq(2, ncol(pharm), 2)]))

Beginner: how can I repeat this function?

I need R studio for analysing some data, but haven't used it for 4 years now.
Now I've got a problem and don't know how to solve it. I want to calculate the variation of some columns together in every row. With some experimentation I've found this out:
var(as.numeric(data[1,8:33]))
and I get: 1.046154
As far as I know this should be right. It should at least give me the variation for the items 8 to 33 in the column for the first person. It also works for any other row:
var(as.numeric(data[5,8:33])) => 1.046154
Now I could of course use the same thing for every row individually, but I have 111 participants and several surveys. I tried to find a way to repeat the same command with every row but it didn't work.
How can I use the command from above and repeat it to all 111 participants?
Without the data it is difficult to help, but I created some dummy data using rnorm. You can use apply to obtain a vector containing the variance for each row. Since it appears that your data is in character format and not numeric, I created a simple function to automatically transform it and calculate the variance.
set.seed(20)
data <- matrix(as.character(rnorm(3663)),
ncol = 33,
nrow = 111)
##basic function
obtain_variance_from_character <- function(x){
return(var(as.numeric(x)))
}
##Calculate variances by row
variances <- apply(data_frame(data), MARGIN = 1, FUN = obtain_variance_from_character)

Sampling from a data table in R

I'm new to R and I would like to know how to take a certain number of samples from a csv file made entirely of numbers in Excel. I managed to import the data to R and use each number as a row and then take random rows as samples but it seems impractical. The whole file is displayed as a column and I took some samples with the next code:
Heights[sample(nrow(Heights), 5), ]
[1] 1.84 1.65 1.73 1.70 1.72
Also please let me know if there is a way to repeat this step at least 100 times and save each sample in another chart maybe, to work with it later.
This is how you'd take 100 samples and store them:
my_samples <- replicate(100, Heights[sample(nrow(Heights), 5), ])
If your .csv file just comma separated values of one type (the heights), and not structured as a table, you may want to turn it into a vector instead. Most R functions that read textual formats of data are going to turn the data into a data frame or some other table like format.
heights <- unlist(strsplit(readLines("yourfile.csv"), ","))
readLines("yourfile.csv") with a .csv file of comma separated values will turn it into a character vector. strsplit() then does the separating work for you.
To put this all together, with a dummy example:
writeLines(c("1,2,3,4,5", "6,7,8,9,10"), "test.csv")
heights <- as.numeric(unlist(strsplit(readLines("test.csv"), ",")))
set.seed(123)
my_samples <- replicate(100, sample(heights, 5))
dim(my_samples)
# [1] 5 100
You can see that my_samples is a matrix of 5 rows (with each row corresponding to a single element sampled from heights), and 100 columns (with each column corresponding to one of one hundred sampling events).
You can use the infer package that is used for bootstrapping.
library(infer)
rep_sample_n(size = 100, replace = TRUE, reps = 1)
Here "size" is the number of samples. "replace" (if true) allows you to replace an observation when sampling - that is, you spin the roulette wheel without taking numbers off the wheel once they come up. 'reps' allows you to repeat the sampling process.

How do I import data from a .csv file into R without repeating the values of the first column into all the other ones?

I want to import data into R from a .csv file.
So far I have done the following:
> #Clear environment
rm(list=ls())
#Read my data into R
myData <- read.csv("C:/Users/.../flow.csv", header=TRUE)
#Convert from list to array
myData <- array(as.numeric(unlist(myData)), dim=c(264,3))
#Create vectors with specific values of interest: qMax, qMin
qMax <- myData[,2]
qMin <- myData[,3]
#Transform vectors into matrices
qMax <- matrix(qMax,nrow = 12, ncol = round((length(qMax)/12)))
qMin <- matrix(qMin,nrow = 12, ncol = round((length(qMin)/12)))
After importing the data using read.csv, I have a list. I then proceed to transform this list into an array with 264 lines of data spread through 3 columns. Here I have my first problem.
I know that each column of my list brings a different set of data; the values are not the same. However, after I check to see what I imported, it seems that only the first column is imported correctly, but then it repeats itself for columns one and two.
Here's an image for better explanation:
The matrix has the right layout, yet wrong data. Columns 2 and 3 should have different values from each other and from column 1.
How do I correct that? I have checked the source and the original document has all the correct values.
Also, assuming I will eventually get rid of this mistake, will the proceeding lines of code from the block "#Transform vectors into matrices" deliver a 12 x 22 matrix? The first six elements of both qMax and qMin are NA and I wish to keep it this way in the matrix. Will R perform that with these lines of code or will I need to change it?
Thank you.
Edit: As suggested by akrun, here's the results for str(myData and for dput(droplevels(head(myData)))

Changing values of multiple column elements for dataframe in R

I'm trying to update a bunch of columns by adding and subtracting SD to each value of the column. The SD is for the given column.
The below is the reproducible code that I came up with, but I feel this is not the most efficient way to do it. Could someone suggest me a better way to do this?
Essentially, there are 20 rows and 9 columns.I just need two separate dataframes one that has values for each column adjusted by adding SD of that column and the other by subtracting SD from each value of the column.
##Example
##data frame containing 9 columns and 20 rows
Hi<-data.frame(replicate(9,sample(0:20,20,rep=TRUE)))
##Standard Deviation calcualted for each row and stored in an object - i don't what this objcet is -vector, list, dataframe ?
Hi_SD<-apply(Hi,2,sd)
#data frame converted to matrix to allow addition of SD to each value
Hi_Matrix<-as.matrix(Hi,rownames.force=FALSE)
#a new object created that will store values(original+1SD) for each variable
Hi_SDValues<-NULL
#variable re-created -contains sum of first column of matrix and first element of list. I have only done this for 2 columns for the purposes of this example. however, all columns would need to be recreated
Hi_SDValues$X1<-Hi_Matrix[,1]+Hi_SD[1]
Hi_SDValues$X2<-Hi_Matrix[,2]+Hi_SD[2]
#convert the object back to a dataframe
Hi_SDValues<-as.data.frame(Hi_SDValues)
##Repeat for one SD less
Hi_SDValues_Less<-NULL
Hi_SDValues_Less$X1<-Hi_Matrix[,1]-Hi_SD[1]
Hi_SDValues_Less$X2<-Hi_Matrix[,2]-Hi_SD[2]
Hi_SDValues_Less<-as.data.frame(Hi_SDValues_Less)
This is a job for sweep (type ?sweep in R for the documentation)
Hi <- data.frame(replicate(9,sample(0:20,20,rep=TRUE)))
Hi_SD <- apply(Hi,2,sd)
Hi_SD_subtracted <- sweep(Hi, 2, Hi_SD)
You don't need to convert the dataframe to a matrix in order to add the SD
Hi<-data.frame(replicate(9,sample(0:20,20,rep=TRUE)))
Hi_SD<-apply(Hi,2,sd) # Hi_SD is a named numeric vector
Hi_SDValues<-Hi # Creating a new dataframe that we will add the SDs to
# Loop through all columns (there are many ways to do this)
for (i in 1:9){
Hi_SDValues[,i]<-Hi_SDValues[,i]+Hi_SD[i]
}
# Do pretty much the same thing for the next dataframe
Hi_SDValues_Less <- Hi
for (i in 1:9){
Hi_SDValues[,i]<-Hi_SDValues[,i]-Hi_SD[i]
}

Resources