Subset from subset of a dataframe in R - r

I have a dataframe df, it contains 10 different variables, including group.name, group.student, and other variables.
I firstly want to select the group with name "papaya", I do dfByGroup = df[group.name=="papaya",], the result is fine.
Then I want to select a specific person called "julia" from the above subset, I do dfByJulia <- dfByGroup[group.student=="julia",], and I view the results with View(dfByJulia), unfortunately, I can still see rows with students' names other than "julia".
Actually the student name "julia" is unique in my data, so I also tried select julia's rows directly from the original data dfByJulia<- df[student.name=="julia",]. This time, the subset data is correct.
Why does this happen? Why cannot I do subsetting from a subset with [ operator? Why must I do it on the original dataframe?

Related

How to dynamically create and name data frames in a for loop

I am trying to generate data frame subsets for each respondent in a data frame using a for loop.
I have a large data frame with columns titled "StandardCorrect", "NameProper", "StartTime", "EndTime", "AScore", and "StandardScore" and several thousand rows.
I want to make a subset data frame for each person's name so I can generate statistics for each respondent.
I tried using a for loop
for(name in 1:length(NamesList)){ name <- DigiNONA[DigiNONA$NameProper == NamesList[name], ] }
NamesList is just a list containing all the levels of NamesProper (which isa factor variable)
All I want the loop to do is each iteration, generate a new data frame with the name "NamesList[name]" and I want that data frame to contain a subset of the main data frame where NameProper corresponds to the name in the list for that iteration.
This seems like it should be simple I just can;t figure out how to get r to dynamically generate data frames with different names for each iteration.
Any advice would be appreciated, thank you.
The advice to use assign for this purpose is technically feasible, but incorrect in the sense that it is widely deprecated by experienced users of R. Instead what should be done is to create a single list with named elements each of which contains the data from a single individual. That way you don't need to keep a separate data object with the names of the resulting objects for later access.
named_Dlist <- setNames( split( DigiNONA, DigiNONA$NameProper),
NamesList)
This would allow you to access individual dataframes within the named_Dlist object:
named_Dlist[[ NamesList[1] ]] # The dataframe with the first person in that NamesList vector.
It's probably better to use the term list only for true R lists and not for atomic character vectors.

mutate function in R

I am trying to add a column from one dataframe to another. The data is long repeated measures data, with each ID having two rows. Both my main dataset (d) and my secondary dataset (d2) use the same column (ID) to link cases to participants. However, when I use mutate like this d <- mutate(d, x = d2$x) the column binds to the dataframe but the values are not tied to the ID. This means that data gets mixed up between participants.
Is there a way to make sure that the values are referenced by ID when I add the column?

Is there a R methodology to select the columns from a dataframe that are listed in a separate array

I have a dataframe with over 100 columns. Post implementation of certain conditions, I need a subset of the dataframe with the columns that are listed in a separate array.
The array has 50 entries with 2 columns. The first column has the selected variable names and the second column has some associated values.
I wish to build a new data frame with just the variables mentioned in the the first column of the separate array. Could you please point me as to how to proceed?
Try this:
library(dplyr)
iris <- iris %>% select(contains(dataframe_with_names$names))
In R you can use square brackets [rows, columns] to select specific rows or specific columns. (Leaving either blank selects all).
If you had a vector of column names you wanted to keep called important_columns you could select only those columns with:
myData[,important_columns]
In your case the vector of column names is actually a column in your array. So you select that column and use it as your vector:
myData[, array$names]

Remove multiple rows from a list of names in R (a list of 187 names to remove)?

I have a data frame in R containing over 29,000 rows. I need to remove multiple rows using only a list of names (187 names).
My dataset is about airlines, and I need to remove specific airlines from my data set that contains over 200 types of airlines. My first column contains all airline names, and I need to remove the entire row for those specific airlines.
I singled out all airline names that I want removed by this code: transmute(a_name_remove, airline_name). This gave me a table of all names of airlines that I want removed, now I have to remove that list of names from my original dataset named airlines.
I know there is a way to do this manually, which is: mydata[-c("a", "b"), ], for example. But writing out each name would be hectic.
Can you please help me by giving me a way to use the list that I have to forwardly remove those rows from my dataset?
I cannot write out each name on its own.
I also tried this: airlines[!(row.names(airlines) %in% c(remove)), ], in which I made my list "removed" into a data frame and as a vector, then used that code to remove it from my original dataset "airlines", still did not work.
Thank you!
You can create a function that negates %in%, e.g.
'%not_in%' <- Negate('%in%')
so per your code, it should look like this
airlines[row.names(airlines) %not_in% remove, ]
additionally, I do not recommend using remove as a variable name, since it is a base function in R, if possible rename the variable, e.g. discard_airlines ,
airlines[row.names(airlines) %not_in% discard_airlines, ]

Assigning variables from variable values in a data frame to another data frame in R

I have 10 topics. For each topic name I have a results_topic_df data frame. In this data frame there are 2 columns: index, which is a name of another data frame and var_name, which is a name of a variable inside the corresponding data frame (indicated by index).
What I want to do is to take the corresponding original data frame (whos name is indicated by results_topic_df$index), look at the value of results_topic_df$var_name in the same row, go to the original data frame and copy the relevant variable to a data frame named container_df.
Eventually I will have container_df having only the selected variables from all the data frames that appear in results_topic_df.
I want to repeat this procedure for each one of the 10 topics.
I have tried to do this with a loop but because my data frames' names change, I got really confused with all the combinations of assign(),paste0(), and eval(). Is there a simpler way to accomplish my goal? Thanks.

Resources