This question already has answers here:
Find complement of a data frame (anti - join)
(7 answers)
Closed 5 years ago.
I have a dataset, like this mushrooms <- read.csv("mushrooms.csv") and now I already have a mushrooms.training_set which is 1/3 of the whole dataset. For both variables, typeof() returns list.
Now, I want to select the rows in the original dataset mushrooms, that are not in the mushrooms.training_set. How would I do this? I have tried the following:
mushrooms[c(!mushrooms.training_set),] but this returns something in the order of 64K rows.
mushrooms[!mushrooms.training_set,]
mushrooms[!duplicated(mushrooms.training_set)]
Who helps me out?
From where you are in the question, you can use dplyr::setdiff:
library(dplyr)
mushroooms.test = setdiff(mushrooms, mushrooms.training_set)
But most of the time it's easier to create the test set using at the same time as the training set. Lots of examples here at How to split data into training and test sets?
Related
This question already has answers here:
Sort (order) data frame rows by multiple columns
(19 answers)
Closed 3 months ago.
I am trying to order a data frame/matrix by the variable VIRUS, by descending or ascending it does not matter.
Anncol<-data.frame(Metadata$VIRUS) ##From left to case, add more if neeedd
sortedAnncol<-Anncol[order(Anncol$VIRUS),]
sAnncol<-as.matrix(sortedAnncol)
This is what I have tried so far, but I lose the first column of data, the corresponding data points in the data frame. How can order the 'Anncol' data frame by the variable 'VIRUS' while simultaneously ordering the rest of the data frame.
Any help would be greatly appreciated! Thank you in advance
Here a solution using dplyr
library(dplyr)
Metadata <- Metadata %>% arrange(VIRUS)
This question already has answers here:
Simultaneously merge multiple data.frames in a list
(9 answers)
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 6 months ago.
I'm a full beginner and I already struggle by just creating the panel data set. I'm working with data from a longitudinal study. So I have datasets from 6 timepoints. I wanted to merge these datasets by using the merge function from the package dplyr:
df_merged <- df0 %>%
left_join(df9, by='ID02_01') %>%
left_join(df10, by='ID02_01')%>%
left_join(df11, by='ID02_01')%>%
left_join(df12, by='ID02_01')%>%
left_join(df13, by='ID02_01')%>%
left_join(df14, by='ID02_01')
Well it seems to work but I keep reading that R needs a long format for longitudinal analysis. So I guess I'm on a totally wrong path here. Does anyonye have a step by step recommendation on how to prepare data for multilevel modelling?
Also if someone offers tutoring on this I'm very thankful for every offer or recommendation!
Thank you all so much in advance.
This question already has answers here:
Extracting specific columns from a data frame
(10 answers)
Closed 4 years ago.
I have to extract two columns from this data set (Cars93 on MASS) and create a separate folder consisting only of the two columns MPG.highway and EngineSize. How do I go about doing this?
You can look at Cars93 on Mass and just get the first ten rows to see it.
You can create a subset using the names directly using the subset function or alternately,
new_df <- Cars93[,c("MPG.highway","EngineSize")]
#or
new_df <- subset(Cars93, keep = c("MPG.highway","EngineSize"))
This question already has an answer here:
Vector of cumulative sums in R
(1 answer)
Closed 6 years ago.
I'm having real difficulty performing a calculation that is incredibly easy to perform in excel. What i require is a kind of rolling addition whereby the value in one column is added to preceding data point. For example:
column a: 1,2,3,5,16,18,3,11
would produce:
column b: 1,3,6,11,27,45,48,59
i.e. (1+1=2),(2+1=3),(3+3=6),(5+6=11)...
I have a feeling I'm missing something really obvious but have tried various iterations of rollapply and shift with no success... How can I do this in R? What am I missing?
The function you are looking for is cumsum:
df = data.frame(a=1:10)
df$b = cumsum(df$a)
This question already has answers here:
Find complement of a data frame (anti - join)
(7 answers)
Closed 7 years ago.
I'm having a problem with a very simple issue and I don't know how to sort it out. Here's the deal. I have two one column data frames
a <- data.frame(C=c("c1","c2","c3","c4","c5","c6","c7","c8"))
b <- data.frame(C=c("c1","c4","c5","c8"))
I would like to get one column dataframe with the entries that do NOT appear in b but they are in a. ie. a dataframe with "c2","c3","c6","c7".
I tried
c <- setdiff(a,b)
but I got the a dataframe and also with
c <- merge(a,b,all.x=TRUE)
I don't get what I want it. so do you know where I am wrong?
We can use anti_join
library(dplyr)
anti_join(a,b)
Or
data.frame(C= setdiff(a[,1], b[,1]))