Generate dplyr arguments using values in another dataframe - r

I have data where the factor labels have been provided in separate files. As a result, when I read things in I have data that looks like this:
id <- seq(1,10,1)
factor_x <- as.factor(sample(x = 1:7, size = 10, replace = T))
data <- data.frame(id, factor_x)
And a separate data frame containing the labels for factor_x that looks like this:
code <- seq(1,7,1)
label <- letters[1:7]
factor_x_labels <- data.frame(code, label)
factor_x_labels$label <- as.character(factor_x_labels$label)
I am looking for an efficient way to update factor_x in data frame 'data' with the labels in data frame 'factor_x_labels'.
I have been trying to work with fct_recode from the forcats package or recode from dplyr but am running into trouble because (for example) the existing and updated labels need to be pasted as strings but need to separated by = as a symbol.

#Ronak comment is obviously working (and should maybe be an answer) but since this post was tagged dplyr, I'm also posting a dplyr solution:
factor_x_labels$code <- as.character(factor_x_labels$code) #this won't work if one of "code" and "factor_x" is numeric but not the other
data %>%
left_join(factor_x_labels, by=c("factor_x"="code")) %>%
rename(factor_x_label = label)

Related

How do I compare variables from different datasets and mutate the variable accordingly in RStudio?

Comparing two items from two different datasets and mutate variable accordingly
summary
Dear Stackoverflow-Community,
I'm trying to compare a column/variable (item1) from one dataset (data1) with a column/variable (item1) from a different dataset (data2). I would like to mutate the compared column/variable (item1) in dataset1 to a third variable (letter) of dataset data2.
Unfortunately I'm receiving the ereror message "Error in UseMethod("mutate_") :
inapplicable method for 'mutate_' applied to object of class "logical"." with my code.
I've created two data example sets and a dataset showing the output that I'm trying to generate with R you will find in the dropbox link below.
download to example dataset (+ visualization of desired output)
https://www.dropbox.com/sh/eido04eiocuw06l/AABiCr2EpRf4PPsb1HYLLGFna?dl=0
My Code
data1 <- read.csv2("data 1.csv")
data2 <- read.csv2("data 2.csv")
attach(data1)
attach(data2)
data1 <- as.data.frame(data1)
data2 <- as.data.frame(data2)
if(data1$item.1 = data2$item.1) %>%
mutate(data1$item.1 == data2$letter)
Background
I downloaded a big dataset from moodle and I need to transform the dataset in order to do my analyses. This afternoon I've been trying this for way too long with my colleague and now we hope for some advice (as we just started with R).
Thanks in advance and have a great day!
Karla
data1 <- read.csv2("stackoverflow/data_1.csv")
data2 <- read.csv2("stackoverflow/data_2.csv")
# Get data in format where there are only two columns
long_data1 <- tidyr::gather(data1, key = "key", value = "value", -person)
long_data2 <- tidyr::gather(data2, key = "key", value = "value", -letter)
# Merge on those two columns
merged_data <- merge(long_data1, long_data2, by = c("key", "value"))
# Tidy up the results
merged_data <- subset(merged_data, select = c(person, letter, key))
final_data <- tidyr::spread(merged_data, key = key, value = letter)
The cleanest solution I can come up with getting the datain the long format - where each observation has its own row - and then merging the columns. The tidyr package does this best, which will need to be installed with install.packages(tidyr) if you don't have it installed already.

Reordering rows in a dataframe in r

I would like to reorder rows in a dataframe based on a specific order. Here is a dummy dataframe (in the long format) that pretty much looks like my data:
library(ggplot2)
library(dplyr)
#data frame
MV<-c(rnorm(50,mean=10, sd=1),rnorm(50,mean=9, sd=1))
ML<-c(rnorm(50,mean=12, sd=1),rnorm(50,mean=10, sd=1))
NL<-c(rnorm(50,mean=10, sd=1),rnorm(50,mean=8,sd=1))
ID<-rep(1:50,1)
Type<-rep(c("BM","NBM"),times=1, each=50)
df<-data.frame(ID, Type, MV, ML, NL)
#Here is the dataframe:
df.gat<-gather(df, "Tests", "Value", 3:5)
My data is already in the long format to start with (df.gat). The code before that is just to get you a similar dataframe.
Basically, I'd like to have my data ordered in my dataframe in the following order: NL, MV, and ML
I have tried various methods such as the following Reorder rows using custom order or How does one reorder columns in a data frame? which are not very convenient considering the number of rows in my dataset.
The solution also needs to work if some participants didn't do all the tests.
Any solution?
In that case you could just tweak what I suggested above into:
df.gat[rev(order(df.gat$Tests)),]
which happens to do the trick for me here but not necessarily generically.
If you want something generic you could (re-)create (the/a) factor:
df.gat$tests2 <- factor(df.gat$Tests, levels=c(c('NL','MV','ML')))
df.gat[order(df.gat$tests2),]
which should give you the same ordering as above.
If I am understanding you correctly, you simply want to reorder the columns of your dataframe. Why not do something like this:
library(ggplot2)
library(dplyr)
#data frame
MV<-c(rnorm(50,mean=10, sd=1),rnorm(50,mean=9, sd=1))
ML<-c(rnorm(50,mean=12, sd=1),rnorm(50,mean=10, sd=1))
NL<-c(rnorm(50,mean=10, sd=1),rnorm(50,mean=8,sd=1))
ID<-rep(1:50,1)
Type<-rep(c("BM","NBM"),times=1, each=50)
df<-data.frame(ID, Type, MV, ML, NL)
df.gat<-gather(df, "Tests", "Value", 3:5)
df <- df %>% select(NL, MV, ML, everything())
Did you try just placing multiple parameters in an order() call like so?
df[order(MV,ML,NL),]
Your df isn't the best demonstration of this as they're all decimals.
Here is a simpler alternative example using discrete values:
df2 <- data.frame(
C1=sample(rep(c(10,20,30) ,20)),
C2=sample(rep(c('A','B','C'),20)))
df2[order(df2$C1,df2$C2),]
I'm not sure why you'd need dplyr:gather() in your example if you're just working on reordering df, right?

Vector addition with vector indexing

This may well have an answer elsewhere but I'm having trouble formulating the words of the question to find what I need.
I have two dataframes, A and B, with A having many more rows than B. I want to look up a value from B based on a column of A, and add it to another column of A. Something like:
A$ColumnToAdd + B[ColumnToMatch == A$ColumnToMatch,]$ColumnToAdd
But I get, with a load of NAs:
Warning in `==.default`: longer object length is not a multiple of shorter object length
I could do it with a messy for-loop but I'm looking for something faster & elegant.
Thanks
If I understood your question correctly, you're looking for a merge or a join, as suggested in the comments.
Here's a simple example for both using dummy data that should fit what you described.
library(tidyverse)
# Some dummy data
ColumnToAdd <- c(1,1,1,1,1,1,1,1)
ColumnToMatch <- c('a','b','b','b','c','a','c','d')
A <- data.frame(ColumnToAdd, ColumnToMatch)
ColumnToAdd <- c(1,2,3,4)
ColumnToMatch <- c('a','b','c','d')
B <- data.frame(ColumnToAdd, ColumnToMatch)
# Example using merge
A %>%
merge(B, by = c("ColumnToMatch")) %>%
mutate(sum = ColumnToAdd.x + ColumnToAdd.y)
# Example using join
A %>%
inner_join(B, by = c("ColumnToMatch")) %>%
mutate(sum = ColumnToAdd.x + ColumnToAdd.y)
The advantages of the dplyr versions over merge are:
rows are kept in existing order
much faster
tells you what keys you're merging by (if you don't supply)
also work with database tables.

How to Rename Column Headers in R

I have two separate datasets: one has the column headers and another has the data.
The first one looks like this:
where I want to make the 2nd column as the column headers of the next dataset:
How can I do this? Thank you.
In general you can use colnames, which is a list of your column names of your dataframe or matrix. You can rename your dataframe then with:
colnames(df) <- *listofnames*
Also it is possible just to rename one name by using the [] brackets.
This would rename the first column:
colnames(df2)[1] <- "name"
For your example we gonna take the values of your column. Try this:
colnames(df2) <- as.character(df1[,2])
Take care that the length of the columns and the header is identical.
Equivalent for rows is rownames()
dplyr way w/ reproducible code:
library(dplyr)
df <- tibble(x = 1:5, y = 11:15)
df_n <- tibble(x = 1:2, y = c("col1", "col2"))
names(df) <- df_n %>% select(y) %>% pull()
I think the select() %>% pull() syntax is easier to remember than list indexing. Also I used names over colnames function. When working with a dataframe, colnames simply calls the names function, so better to cut out the middleman and be more explicit that we are working with a dataframe and not a matrix. Also shorter to type.
You can simply do this :
names(data)[3]<- 'Newlabel'
Where names(data)[3] is the column you want to rename.

Applying functions on columns in nested data frame

I have data that I'm nesting into list columns, then I'd like to use purrr::map() to apply a plotting function separately to each column within the nested data frames. Minimal reproducible example:
library(dplyr)
library(tidyr)
library(purrr)
data=data.frame(Type=c(rep('Type1',20),
rep('Type2',20),
rep('Type3',20)),
Result1=rnorm(60),
Result2=rnorm(60),
Result3=rnorm(60)
)
dataNested=data%>%group_by(Type)%>%nest()
Say, I wanted to generate a histogram for Result1:Result3 for each element of dataNested$data:
dataNested%>%map(data,hist)
Any iteration of my code won't separately iterate over the columns within each nested dataframe.
Why would you need to complicate things in such way, when you're already in the tidyverse? List columns are rather a last resort solution to problems..
library(tidyverse)
data %>%
gather(result, value, -Type) %>%
ggplot(aes(value)) +
geom_histogram() +
facet_grid(Type ~ result)
gather reformats the wide dataset into a long one, with Type column, result column and a value column, where all the numbers are.
Perhaps do not create a nested data frame. We can split the data frame by the Type column and plot the histogram.
library(tidyverse)
dt %>%
split(.$Type) %>%
map(~walk(.[-1], ~hist(.)))
DATA
library(tidyverse)
set.seed(1)
dt <- data.frame(Type = c(rep('Type1', 20),
rep('Type2', 20),
rep('Type3', 20)),
Result1 = rnorm(60),
Result2 = rnorm(60),
Result3 = rnorm(60),
stringsAsFactors = FALSE)
So I think you are thinking about this the right way. Running this code:
dataNested$data[[1]
You can see that you have data that you can iterate. You can loop through it like:
for(i in dataNested) {
print(i)
}
This clearly demonstrates that the structure is nothing too complicated to work with. Okay so how to create the histograms? We can create a helper function:
helper_hist <- function(df) {
lapply(df, hist)
}
And run using:
map(dataNested$data, helper_hist)
Hope this helps.

Resources