Subsetting and Altering Only Certain Elements of a Vector in R - r

I am student who is working with the iris dataset in r. This has 3 flower types.
I am supposed to create a new vector of the Petal.Length vector in one statement that is the same but for only the Virginica Species I take the log base 10 value. I am not sure how to command r to take the log base 10 value of only the virginica values in the Petal.Length column but to keep the other values for the other two flowers the same.

Use square brackets in R to subset data. The generic formula is object[object operator condition]. For example, iris$Petal.Length[iris$Species == "virginica"] is equivalent to saying "Show me the Petal.Length values only for the Species values that equal "virginica".

Related

R: How to subset with filtering multiple (date-)variables at once

I have a dataset with multiple date-variables and want to create subsets, where I can filter out certain rows by defining the wanted date of the date-variables.
To be more precise: Each row in the dataset represents a patient case in a psychiatry and contains all the applied seclusions. So for each case there is either no seclusion, or they are documented as seclusion_date1, seclusion_date2..., seclusion_enddate1, seclusion_enddate2...(depending on how many seclusions were happening).
My plan is to create a subset with only those cases, where there is either no seclusion documented or the seclusion_date1 (first seclusion) is after 2019-06-30 and all the possible seclusion_enddates (1, 2, 3....) are before 2020-05-01. Cases with seclusions happening before 2019-06-30 and after 2020-05-01 would be excluded.
I'm very new in the R language so my tries are possibly very wrong. I appreciate any help or ideas.
I tried it with the subset function in R.
To filter all possible seclusion_enddates at once, I tried to use starts_with and I tried writing a loop.
all_seclusion_enddates <- function() { c(WMdata, any_of(c("seclusion_enddate")), starts_with("seclusion_enddate")) }
Error: any_of()` must be used within a selecting function.
and then my plan would have been: cohort_2_before <- subset(WMdata, seclusion_date1 >= "2019-07-01" & all_seclusion_enddates <= "2020-04-30")
loop:
for(i in 1:53) { cohort_2_before <- subset(WMdata, seclusion_date1 >= "2019-07-01" & ((paste0("seclusion_enddate", i))) <= "2020-04-30" & restraint_date1 >= "2019-07-01" & ((paste0('seclusion_enddate', i))) <= "2020-04-30") }
Result: A subset with 0 obs. was created.
Since you don't provide a reproducible example, I can't see your specific problem, but I can help with the core issue.
any_of, starts_with and the like are functions used by the tidyverse set of packages to select columns within their functions. They can only be used within tidyverse selector functions to control their behavior, which is why you got that error. They probably are the tools I'd use to solve this problem, though, so here's how you can use them:
Starting with the default dataset iris, we use the filter_at function from dplyr (enter ?filter_at in the R console to read the help). This function filters (selects specific rows) from a data.frame (given to the .tbl argument) based on a criteria (given to .vars_predicate argument) which is applied to specific columns based on selectors given to the .vars argument.
library(dplyr)
iris %>%
filter_at(vars(starts_with('Sepal')), all_vars(.>4))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.7 4.4 1.5 0.4 setosa
2 5.2 4.1 1.5 0.1 setosa
3 5.5 4.2 1.4 0.2 setosa
In this example, we take the dataframe iris, pass it into filter_at with the %>% pipe command, then tell it to look only in columns which start with 'Sepal', then tell it to select rows where all the selected columns match the given condition: value > 4. If we wanted rows where any column matched the condition, we could use any_vars(.>4).
You can add multiple conditions by piping it into other filter functions:
iris %>%
filter_at(vars(starts_with('Sepal')), all_vars(.>4)) %>%
filter(Petal.Width > 0.3)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.7 4.4 1.5 0.4 setosa
Here we filter the previous result again to get rows that also have Petal.Width > 0.3
In your case, you'd want to make sure your date values are formatted as date (with as.Date), then filter on seclusion_date1 and vars(starts_with('secusion_enddate'))

Apply wilcox.test to all paired columns in a dataframe

For my M.Sc. project I try to order all columns of a (by the user) given dataframe by their median, apply the wilcox.test on the columns by a specific schema (mentioned later) and then plot each column's values in a box-whisker-plot.
The ordering and the plotting works just fine, but I have trouble finding a way to apply the wilcox.test to the dataframe in the following schema:
wilcox.test(i, j, paired=TRUE)
whereas i=1 and j=2 and both incrementing until j=ncol(dataframe). So I want to run the function with the parameters column 1 and 2, after that with column 2 and 3 and so on, until j is the last column of the dataframe.
I too want to store all the p-values in a dataframe with one row (containing the p-values) and each row having the name of the two columns that were the parameters in their wilcox.test, because I dont only want to plot all the columns (each representing a "solution"), but I too want to print the p-values for each test in the console (something like: "Wilcoxon-test with 'Solution1' and 'Solution2' resulted in the p-value: 'p-value from wilcox.test of Solution1 and Solution2', which means the solutions are/aren't significatly different").
I tried to adjust some code in other posts concerning this problem, but nothing worked out. Unfortunately I am a very unexperienced in R, too, so I hope that what I wrote above was no utter bullsh*t either.
I too tried to iterate over the columns of the dataframe with for-loops and increments in a java-manner, as this is the only programming language I got taught, but that didn't work at all (what a surprise).
The plot my code creates on base of a dataframe with very different values:
Thanks for any advices you guys can give me, it's very much appreciated!
Seems like a job for the matrixTests package. Here is a demonstration using the iris dataset:
library(matrixTests)
col_wilcoxon_twosample(iris[,1:3], iris[,2:4])
obs.x obs.y obs.tot statistic pvalue alternative location.null exact corrected
Sepal.Length 150 150 300 22497.5 9.812123e-51 two.sided 0 FALSE TRUE
Sepal.Width 150 150 300 7793.5 4.151103e-06 two.sided 0 FALSE TRUE
Petal.Length 150 150 300 19348.5 3.735718e-27 two.sided 0 FALSE TRUE
The returned results match wilcox.test() done on each pair. For example, 1st vs 2nd columns:
w <- wilcox.test(iris[,1], iris[,2])
w$p.value
[1] 9.812123e-51

Calculating range for all variables

For someone new to R, what is the best way to view the range for a number of variables? I've run the summary command on the entire dataset, can I do range () on the entire dataset as well or do i need to create variables for each variable in the dataset?
For individual variable, you can use range. To see the range of multiple variables, you can combine range with one of the apply functions. See below for an example.
range(iris$Sepal.Length)
# [1] 4.3 7.9
sapply(iris[, 1:4], range)
# Sepal.Length Sepal.Width Petal.Length Petal.Width
#[1,] 4.3 2.0 1.0 0.1
#[2,] 7.9 4.4 6.9 2.5
(only the first four columns were selected from iris since the 5th is a factor, and range doesn't apply for factors)

Importing a data frame vs creating one in R

I am trying to create a specialized summary 'matrix' for my supervisor, and would like R to export it in a clean, readable form. As such, I am creating it from scratch basically, to tailor it to our project. My problem is I can't figure out how to get a created data frame to behave like an imported one, specifically headers.
I am most comfortable dealing with imported data frames with headers, and calling specific rows by name instead of column number:
iris$Sepal.Length
with(iris,Sepal.Length)
iris['Sepal.Length']
Now, if I want to create a data frame (or matrix, I'm not entirely sure what the difference is), I have tried the following:
groups<-c("Group 1", "Group 2")
factors<-c("Fac 1", "Fac 2", "Fac 3","Fac 4", "Fac 5")
x<-1:10
y<-11:20
z<-21-30
data<-cbind(groups, factors, x, y, z)
names(data) #returns NULL
data$x #clearly doesn't return the column 'x' since the matrix 'data' has no names
data<-data.frame(cbind(groups, factors, x, y, z))
names(data) #confirms that there are header names
So, I have created a data frame that has the columns x, y and z, but in reality I don't have a premade column to start off with. If I knew how many rows of data there would be I could simply do:
data<-data.frame(1:10)
data$x<-x
data$y<-y
data$z<-z
I tried creating an empty data frame, but it is one element big, and if I try to append a vector to it (of any length greater than 1), I get an error:
data<-data.frame(0)
data$x<-x #returns an error
My best guess at what to do is to pass through the data once to find out how long many rows of data I will have (there are several factor levels, and the summary matrix will have a row for each possible combination of factors). Then I can get the data frame started with a simple:
data<-data.frame(length(n)) #where n would be how many rows of data I would have
And follow through by creating individual vectors for each summary statistic I want and appending it to the data frame with ~$~.
Another solution I tried to play with was creating a matrix and filling in each element as I calculate it within a loop. I know the apply family is better than a loop, but to make my summary table tailored to my needs I would need to run an apply function then try to pull the individual data:
means<-with(iris,tapply(iris[,4],Species,mean))
means[1] #This returns the species and the mean petal width. What I need is the numeric part of this, as I will have my own headers, or possibly a separate summary table for each species.
I'm not sure if extracting the numerical information from the apply output is better / any easier than simply constructing my own loop to calculate the required statistics. It would be a nested loop that would first sort by group (2 runs), then an internal loop that would run by factors (5 runs) for a total of 10 runs through the data. I was thinking of creating an empty martix, and simply saving the data in the appropriate cell when it is calculated. My problem, again, is calling a specific row in a matrix. I have tried:
m<-matrix(0,ncol=5)
m[1,1]<-'Groups'
m[1,2]<-'Factors'
m[1,3]<-'Mean.x'
m[1,4]<-'Mean.y'
m[1,5]<-'Mean.z'
names(m) #Returns NULL
My desired output would look like:
Groups Factors Mean.x Mean.y Mean.z
Group 1 Fac 1
Group 1 Fac 2
Group 1 Fac 3
Etc, for all combinations of groups and factors.
You can use ddply from plyr package for that: assume your original data frame is mydata and your new data frame where you store the result is newdata:
library(plyr)
newdata<-ddply(mydata,.(Groups,Factors),summarize,mean.x=mean(x),mean.y=mean(y),mean.z=mean(z))
Example: mydata<-iris
> newdata<-ddply(mydata,.(Species),colwise(mean))
> newdata
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
I think this is what you're looking for, but I'm a little confused by your question in general. This basically will give you a pivot table of the means in each column x,y, and z grouped by the columns 'groups' and 'factors'
aggregate(.~groups+factors, data=data, FUN="mean")
groups factors x y z
1 Group 1 Fac 1 1 1 1
2 Group 2 Fac 1 7 6 1
3 Group 1 Fac 2 8 7 1
4 Group 2 Fac 2 3 2 1
5 Group 1 Fac 3 4 3 1
6 Group 2 Fac 3 9 8 1
7 Group 1 Fac 4 10 9 1
8 Group 2 Fac 4 5 4 1
9 Group 1 Fac 5 6 5 1
10 Group 2 Fac 5 2 10 1
or with the iris data grouped by Species:
aggregate(.~Species, data=iris, FUN="mean")
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
UPDATE: To only calculate the mean of certain columns, you can either pass only the apropriate columns of your dataset to the aggregate function (perhaps calling subset) or modify the formula like this:
aggregate(cbind(Sepal.Length,Sepal.Width)~Species, data=iris, FUN="mean")
I am not entirely sure if that's what you are looking for but there are several options to add “stuff” to data frames:
To add a variable, just type data$newname <- NA (no need to know the length of the data frame or pass a vector, all rows will be filled with NA)
To append data use rbind (the data you are adding should be another data frame with the same variables)
To fix your example, first create an empty data frame and append data as it comes:
data <- data.frame(x=numeric())
data <- rbind(data, data.frame(x))
The previous example had only one variable (x) but you can also define a data frame with several variables and no rows:
data <- data.frame(x=numeric(),
y=numeric(),
a=character(),
b=factor(levels=c("Factor 1", "Factor 2")))
You don't need to know how many rows you will have, but the data you are adding needs to have the same structure. If that's not the case, you need to create columns with missing values in both data frames as needed, e.g.
data1 <- data.frame(x=1:10, y=1)
data2 <- data.frame(y=2, z=100:110)
rbind(data1, data2) # Error
data1$z <- NA
data2$x <- NA
rbind(data1, data2) # Now it works

using ffdfdply to split data and get characteristics of each id in the split

Within R I'm using ffdf to work with a large dataset. I want to use ffdfdply from the ffbase package to split the data according to a certain variable (var) and then compute some characteristics for all the observations with a unique value for var (for example: the number of observations for each unique value of var). To see if this is possible using ffdfdply I executed the example described below.
I expected that it would split on each Species and then calculate the minimum Petal.Width for each Species and then return a two columns each with three entries listing the Species and minimum Petal.Width for that Species. Expected output:
Species min_pw
1 setosa 0.1
2 versicolor 1.0
3 virginica 1.4
However for BATCHBYTES=5000 it will use two splits, one containing two Species and the other containing one Species. This results in the following:
Species min_pw
1 setosa 0.1
2 virginica 1.4
When I change BATCHBYTES to 2000, this will force ffdfdply to use three splits and thus results in the expected output posted above. However I want to have another way of enforcing a split into each unique value of the variable assigned to 'split'. Is there any way to make this happen? Or do you have any other suggestions to get the result I need?
ffiris <- as.ffdf(iris)
result <- ffdfdply(x = ffiris,
split = ffiris$Species,
FUN = function(x) {
min_pw <- min(x$Petal.Width)
data.frame(Species=x$Species, min_pw= min_pw)
},
BATCHBYTES = 5000,
trace=TRUE
)
dim(result)
dim(iris)
result
The function ffdfdply was designed when you have a lot of split elements e.g. when you have 1000000 customers and you want to have data in memory at least split by customer but possibly more customers if your RAM allows such that the internals do not need to do an ffwhich 1000000 times.
That is why the doc of ffdfdply states:
Please make sure your FUN covers the fact that several split elements can be in one chunk of data on which FUN is applied.'
So the solution for your issue is to cover this in FUN namely as follows e.g.
FUN=function(x){
require(doBy)
summaryBy(Petal.Width ~ Species, data=x, keep.names=TRUE, FUN=min)
}

Resources