Learning R-Code Review on Subsetting Data

Learning R-Code Review on Subsetting Data - r

Instructions:*
Create an object called response that contains the name of the variable that will be considered the response variable.
Create a vector called predictors that contains the names of the variables that will be used as predictors.
Assignment Code:
# Create response object
response = 'price'
# Create predictors object
predictors = c("carat","cut","clarity","color","depth")
Feedback:
Now that you've identified what variables are of interest, you can subset the data to only include those columns.
Subsetting the Data
Context:
There are various ways to subset a data frame to only include specific columns. Here, you can use the objects response and predictors to indicate the names of the columns to keep. The function select(), from the {dplyr} package, is a good way to accomplish this task. It requires two arguments:
the name of the data frame being used
the column names to be included in the subset, or the objects that contain that information (response and predictors), separated by a comma
Instructions:
Use select() to create a subset of myData that contains only the columns of interest, and store this subset in an object called myData_subset.
Assignment Code:
# Subset the data
myData_subset %>% select(response,predictors)
Could anyone tell me where I am going wrong on subsetting? Is it that my objects were created incorrectly? Thanks a lot.

Try this: the assignment is lacking myData and Using an external vector in selections is ambiguous -> we could use all_of(..)
response = 'price'
predictors = c("carat","cut","clarity","color","depth")
myData_subset <- myData %>%
select(response,all_of(predictors))

Related

How to dynamically create and name data frames in a for loop

I am trying to generate data frame subsets for each respondent in a data frame using a for loop.
I have a large data frame with columns titled "StandardCorrect", "NameProper", "StartTime", "EndTime", "AScore", and "StandardScore" and several thousand rows.
I want to make a subset data frame for each person's name so I can generate statistics for each respondent.
I tried using a for loop
for(name in 1:length(NamesList)){ name <- DigiNONA[DigiNONA$NameProper == NamesList[name], ] }
NamesList is just a list containing all the levels of NamesProper (which isa factor variable)
All I want the loop to do is each iteration, generate a new data frame with the name "NamesList[name]" and I want that data frame to contain a subset of the main data frame where NameProper corresponds to the name in the list for that iteration.
This seems like it should be simple I just can;t figure out how to get r to dynamically generate data frames with different names for each iteration.
Any advice would be appreciated, thank you.

The advice to use assign for this purpose is technically feasible, but incorrect in the sense that it is widely deprecated by experienced users of R. Instead what should be done is to create a single list with named elements each of which contains the data from a single individual. That way you don't need to keep a separate data object with the names of the resulting objects for later access.
named_Dlist <- setNames( split( DigiNONA, DigiNONA$NameProper),
NamesList)
This would allow you to access individual dataframes within the named_Dlist object:
named_Dlist[[ NamesList[1] ]] # The dataframe with the first person in that NamesList vector.
It's probably better to use the term list only for true R lists and not for atomic character vectors.

Dropping Columns of Specific Name in R

I'm working, in RStudio, with data for patients that are either normal, have Crohn's disease, or ulcerative colitis. Now, the data is structured in such a way that patient information is in a separate data frame (called sampleInfo), and the data I want to use for analysis is in a different data frame (called expressionData). For my analysis, I would like to remove the patients that are 'normal' from the dataset and only keep those with Crohn's disease or ulcerative colitis.
So, what I did was first run the following command to make a new data frame from sampleInfo containing all the patients (aka rows) with the normal disease state, using the following command:
bad_patients <- sampleInfo[sampleInfo$characteristics_ch1.3 == "disease state: normal", ]
bad_patients has a column called geoaccession, which contains the patient ID, which also corresponds with the column names for the same patient in expressionData.
I save the names of these IDs using
patient_names <- bad_patients$geo_accession.
Now, I want to remove the columns with these names from expressionData. I looked at a lot of different StackOverflow posts, as well as posts on the R help forum, and found two main ways, both of which I have tried. The first is done with the following command:
newDataFrame <- expressionData[ , !names(expressionData) %in% patient_names]
Though this method does produce a new matrix called newDataFrame, attempting to view this matrix in RStudio gives the following error:
Error in View : 'names' attribute [1] must be the same length as the vector [0]
I also tried a second subset method with the following command:
newDataFrame <- subset(expressionData, -patient_names)
which raises the error: Error in -patient_names : invalid argument to unary operator
I also tried this subset method by explicity typing out the columns I wanted to remove as follows:
newDataFrame <- subset(expressionData, -c('ID090190', ...) (where ... corresponds to the rest of the IDs) and got the same exact error.
Can someone tell me what I'm doing wrong, or how to work around this?

Couple of solutions:
Subsetting based on names
newDataFrame <- expressionData[!(names(expressionData) %in% patient_names)]
One problem with your attempt was that you hadn't wrapped the whole expression evaluated by ! in parentheses. As it was, you were looking for !names(expressionData) in patient_names. ! here would coerce names(expressionData) into a logical and likely return a vector full of FALSEs
I've subset with only one dimension (x[this] rather than x[,this]). You can do this with the columns of data frames because a data frame is a list of its columns. This subsetting method preserves the data.frame class of the returned object, whereas the two-dimensional subset will just return a vector if you select only one column. (Tibbles will return a tibble with both methods, which is one big advantage of tibbles)
Tidyverse solution: use dplyr::select with dplyr::all_of
newDataFrame <- dplyr::select(expressionData, -dplyr::all_of(patientnames))
Edit: Make sure your data really is a data.frame
If you're getting this error Error in UseMethod("select_") : no applicable method for 'select_' applied to an object of class "c('matrix', 'array', 'double', 'numeric')", it's because your data is a matrix, rather than a data frame. You may have inadvertently coerced it in processing.
Use as.data.frame to return to a data frame object, which will be compabtible with the methods above. If you wish to keep your data as a matrix, use colnames:
expressionData[ , !(colnames(expressionData) %in% patient_names)] to subset the columns.
If expressionData is a matrix, you'll need to subset the columns with colnames, rather than names. The names of a data.frame are identical to its colnames (because a df is a list of its columns), but the names of a matrix are the names of every element in the matrix, because a matrix is just an array with dimensionality. You'll want to check colnames(expressionData) to make sure that there are colnames to subset.

You might want to try:
newDataFrame <- expressionData[ , !colnames(expressionData) %in% patient_numbers]
names(expressionData) is NULL, hence your error; you want the column names
in your example, your list of sample names was called patient_numbers, not patient_names

Adding numeric list object of values with row.names to dataframe of same length without row.names

I have performed an operation using the mclust package on a nonmissing data frame. The nonmissing data frame was created using the dplyr package by using the select function. As such, row.names appears as a vector in the data frame passed to the mclust function.
I next have extracted some critical values (the case 'classification') from this function as:
class<-functionobject$classification
Thus, the numeric list of classification values is associated with row.names.
When I attempt to append this list of values to a new data frame of the same length (the same cases) without row.names, I lose important ordering, it seems. I know this as when I compare classification groups on other variables in the new data frame, they are not equal to the values obtained in the mclust function using those same variables.
The reason I can not simply append to the nonmissing data frame (with row.names) used in the mclust function is that I require other variables from the data set not used in the function and which needed merged on ID variables as:
NEW_DF=merge(mclust_DF, other_DF, by=c("X1", "X2"))
So I end up with a data frame of the same length but which no longer has row.names on which I want to append the classification values from the mclust function described above. Although no errors are thrown when I use:
FINAL_DF<- cbind.data.frame(NEW_DF, class)
The data are off as I can see inspection of group (class) means on relevant variables do NOT equal those from the mclust function (which they should as it is the same core input data).
I realize I am missing something obvious here, but I have not found an answer despite an exhaustive search of the archives. What is the correct way to go about this rather tedious wrangling?

FWIW: a simple, though perhaps still inefficient solution overall, was to bind the saved classification data from the mclust function to the nonmissing data frame BEFORE merging with additional validation variables as when the merge occurs, the 'row.names' vector induced by dplyr in the select cases function is lost and cases are resorted.
This solution dawned on me as I realized that the mclust function was based on the nonmissing data frame (created using dplyr) and thus resultant data objects followed the case ordering from input data (by row.names)

How to convert all factor variables into numeric variables (in multiple data frames at once)?

I have n data frames, each corresponding to data from a city.
There are 3 variables per data frame and currently they are all factor variables.
I want to transform all of them into numeric variables.
I have started by creating a vector with the names of all the data frames in order to use in a for loop.
cities <- as.vector(objects())
for ( i in cities){
i <- as.data.frame(lapply(i, function(x) as.numeric(levels(x))[x]))
}
Although the code runs and there I get no error code, I don't see any changes to my data frames as all three variables remain factor variables.
The strangest thing is that when doing them one by one (as below) it works:
df <- as.data.frame(lapply(df, function(x) as.numeric(levels(x))[x]))

What you're essentially trying to do is modify the type of the field if it is a factor (to a numeric type). One approach using purrr would be:
library(purrr)
map(cities, ~ modify_if(., is.factor, as.numeric))
Note that modify() in itself is like lapply() but it doesn't change the underlying data structure of the objects you are modifying (in this case, dataframes). modify_if() simply takes a predicate as an additional argument.

for anyone who's interested in my question, I worked out the answer:
for ( i in cities){
assign(i, as.data.frame(lapply(get(i), function(x) as.numeric(levels(x))[x])))
}

create many subsets at once

I have a large dataset based on some medical records so I cannot post a sample due to privacy restrictions, but I am trying to subset the one data frame into many. The goal is for each unique facility to be its own data frame so I can identify efficiency rates for each facility. I have tried the following code where df is the name of the data frame, Name is the name I will give to the subset, Location is the value of interest from the variable "Facility" from the original dataframe:
ratefunct <- function(df, Name, Facility) {Name <- subset(df, Facility, == "Location")
Name <- within(Name, {rate <- <-cumsum(Complete)/ cumsum(Complete+Incomplete) })}
but don't seem to be getting any results in my environment

Based on your comment, it sounds like you're trying to store the results of split as separate data frames.
You can do so like this, using assign
dfL <- split(iris, iris$Species)
for (i in 1:length(dfL)){
assign(paste0("df_", names(dfL[i])), dfL[i])
# added the print line so you can see the names of the objects that are created
print(paste0("df_",names(dfL[i])))
}
[1] "df_setosa"
[1] "df_versicolor"
[1] "df_virginica"
Which will create data frames df_setosa, df_virginica, and df_versicolor
Alternatively, if you're happy with the current object names, you could simply use:
list2env(dfL,envir=.GlobalEnv)
Which will save each list item as an object, using the object's name in the list. So instead of having the df_ prefix, you would just have setosa, virginica, and versicolor objects.
Edit: as a simpler way to assign custom names to each created object, directly specifying the names of dfL is a nice clean solution:
names(dfL) <- paste0("df_",names(dfL))
list2env(dfL,envir=.GlobalEnv)
This way you avoid having to write the for loop, and still get object names with a useful prefix.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Learning R-Code Review on Subsetting Data - r

Try this: the assignment is lacking myData and Using an external vector in selections is ambiguous -> we could use all_of(..) response = 'price' predictors = c("carat","cut","clarity","color","depth") myData_subset <- myData %>% select(response,all_of(predictors))

Related

How to dynamically create and name data frames in a for loop

Dropping Columns of Specific Name in R

Adding numeric list object of values with row.names to dataframe of same length without row.names

How to convert all factor variables into numeric variables (in multiple data frames at once)?

create many subsets at once

Categories

Resources