I want to revalue multiple variables at the same time in R - r

For my thesis research I am trying to revalue multiple variables/columns at the same time. I have tried using the following function, but this would require me to specify each column apart:
full_df$CR39 <- revalue(full_df$CR39, c("1"="0", "2" ="1"))
I have got around 65 variables (named CR00, CR01, CR02, ...) that I want to recode. The value "1" must become "0" and the value "2" must become "1." I also have got variables named CR00FAM, CR01FAM, CR02FAM, ...) which I do not wish to revalue at the same time.
I have tried using the 'select' function, but this does not seem to help: full_df%>% select(starts_with("DF"), -contains("FAM")).
Does anyone know a possible solution? I have searched a lot of stackoverflow topics, but none of the proposed solutions worked out for me.

We can loop over the variables and do this. Select the columns of interest based on a regex i.e. those column names starts with (^) the 'CR' followed by one or more digits (\\d+) at the end ($) of the string. Loop over the selected columns with lapply and apply revalue, assign the output back to the selected columns dataset
nm1 <- grep("^CR\\d+$", names(full_df), value = TRUE)
full_df[nm1] <- lapply(full_df[nm1], function(x) revalue(x, c("1"="0", "2" ="1"))
Or using dplyr
library(dplyr)
full_df <- full_df %>%
mutate(across(matches("^CR\\d+$"), ~
revalue(., c("1" = "0", "2" = "1"))))

Related

How do I multiply columns of a dataset by a constant?

I am quite new to R so eggscuse my lack of ability. I have tried and failed a fair bit, and would appreciate any input.
I am asked to get rid of inconsistent use of "." and "," to indicate decimals by multiplying every number in certain columns by some multiple of 10. I have tried to simply multiply using the binary operator * but it obviously doesnt work as some columns are factors, which is required in this case.
I have tried using this code aswell but get erros :subscript "Var" cant be "NA"
data %>% mutate_if(is.numeric, ~ . * 1000)
Below is the code I have for my dataset
datat <- c("Starting_year" , "Rank" , "Team" , "Home_total_Games", "Home_Total_Attendance" , "Home_Avg_Attendance" , "Home_capacity" , "Away_Total_Attendance" , "Away_Avg_Attendance" , "Away_Capacity")
names(data) <- datat
Factors assigned
data$Rank <- as.factor(data$Rank)
data$Starting_year <- as.factor(data$Starting_year)
Thanks in advance
Cant embed but there is a picture below of the data. I am asked to use a function in dplyr to multiply the columns by 1000 to remove all the . and ,
dataset
What is the format of numbers?
If the format is: 1.000.000,5, where . is a thousand separator, while , is a decimal separator, just use gsub:
foo = "1.000.000,5"
bar = gsub("\\.", "", foo) # "1000000,5"
baz = gsub(",", "\\.", bar) # "1000000.5"
as.numeric(baz)
In this case, factor is not a problem because gsub will de-factor the vector.
If you need to multiply the numbers after that, it is not a problem. Transform this into a function (such as convert_decimal) and apply it to columns you want:
data$column = convert_decimal(data$column)
For multiple selected columns (let's call the vector of names selection):
data[selection] = lapply(data[selection], convert_decimal)

Change complicated strings in R with qsub or R-strings

I have a column of a data frame that has thousands complicate sample names like this
sample- c("16_3_S16_R1_001", "16_3_S16_R2_001", "2_3_S2_R1_001","2_3_S2_R2_001")
I am trying with no success to change the sample names to achieve the following sample names
16.3R1, 16.3R2, 2.3R1,2.3R2
I am thinking of solving the problem with qsub or stringsR.
Any suggestion? I have tried qsub but not retrieving the desirable name
You can use sub to extract the parts :
sample <- c("16_3_S16_R1_001","16_3_S16_R2_001","2_3_S2_R1_001","2_3_S2_R2_001")
sub('(\\d+)_(\\d+)_.*(R\\d+).*', '\\1.\\2\\3', sample)
#[1] "16.3R1" "16.3R2" "2.3R1" "2.3R2"
\\d+ refers to one or more digits. The values captured between () are called as capture groups. So here we are capturing one or more digits(1), followed by underscore and by another digit (2) and finally "R" with a digit (3). The values which are captured are referred using back reference so \\1 is the first value, \\2 as second value and so on.
If you split the string sample into substrings according to the pattern "_", you need only the 1st, 2n and 4th parts:
sample <- c("16_3_S16_R1_001",
"16_3_S16_R2_001",
"2_3_S2_R1_001",
"2_3_S2_R2_001")
x <- strsplit(sample, "_")
sapply(x, function(y) paste0(y[1], ".", y[2], y[4]))
Here is one way you could do it.
It helps to create a data frame with a header column, so it's what I did below, and I called the column "cats"
trial <- data.frame( "cats" = character(0))
x <- c("16_3_S16_R1_001", "16_3_S16_R2_001", "2_3_S2_R1_001","2_3_S2_R2_001")
df <- data.frame("cats" = x)
The data needs to be in the right structure, in our case, as.factor()
df$cats <- as.factor(df$cats)
levels(df$cats)[levels(df$cats)=="16_3_S16_R1_001"] <- "16.3R1"
levels(df$cats)[levels(df$cats)=="16_3_S16_R2_001"] <- "16.3R2"
levels(df$cats)[levels(df$cats)=="2_3_S2_R1_001"] <- "2.3R1"
levels(df$cats)[levels(df$cats)=="2_3_S2_R2_001"] <- "2.3R2"
And voilĂ 

How to label factors but still preserve its original levels' values - R

I'll divide this question into two parts, being the first a general question, and the second a specific one.
First - I would like to know if there a is a possible way to label numeric factors but still keep its original numeric levels. This is specially confusing since I realised that when we pass a label argument to a factor, it then becomes this factor's levels, for example:
x<- factor(c(1,2,3, 2, 3, 1, 2), levels = c(1, 2, 3), labels = c("a", "b", "c"))
levels(x)
#[1] "a" "b" "c"
labels(x)
#[1] "1" "2" "3" "4" "5" "6" "7"
I would like to know if there is a way, like it does in Stata, to label the categories of a factor. I want to be able to sum x while its elements show as "a, "b or "c", but keep the value 1, 2, or 3.
Second- I'm asking this because I have a very large data set which has columns with numeric categories. This data set comes with a dictionary in xlsx which I read and treat into R, so each column has its numeric categories and their respective labels. I'm attempting to read the dictionary, create a list of categories and labels inside a list of columns and then read the data set, loop through the columns and label the variables. These labels are important so I don't have to look at the dictionary every time I have to interpret something on the data set. And the numeric levels are important because since I have a lot of dummy variables (yes or no variables) I want to be able to sum them.
Here's my code (I use the data.table package):
dic<- readRDS(dictionary_filename)
# Reading data set #
data <- fread(dataset_filename, header = T, sep = "|", encoding = "UTF-8", na.strings = c("NA", ""))
# Treating the data.set #
# Identifying which lines of the dictionary have categorized variables. This is very specific to my dictionary strcture #
index<- which(!is.na(dic$num.categoria))
# storing the names of columns that have categorized variables #
names_var<- dic$`Var name`[index]
names_var<- names_var[!is.na(names_var)]
# Creating a data frame with categorized variables which will be later split into lists #
df<- as.data.frame(dic[index,])
# Transforming the index column to factor so it is possible to split the data frame into a list with sublists for each categorized column #
df$N<- as.factor(df$N)
# Splitting the data frame to list
lst<- split(df, df$N)
# Creating a labels list and a levels list #
lbs<- list()
lvs<- list()
for (i in 1:length(lst)){
lbs[[i]]<- as.vector(lst[[i]]$category)
lvs[[i]]<- as.vector(lst[[i]]$category.number)
}
# Changing the data set columns into factors with ther respective levels and labels #
k<- 1
for (var in names_var){
set(data, j =var, value = factor(data[[var]], levels = lvs[[k]], labels = lbs[[k]]))
k<- k +1
}
I realize the code is a bit abstract since i don't provide the data set or the dictionary, but it is just so you could have an idea. My code works, it runs with no error and it does what I hoped it would do (all the categorized columns are now showing their labels, for example, "yes" or "no" when before it was 1 or 0). Except from the fact that I can no longer access the original numbers in levels, which I need to in the next part of my project.
It would be preferable if there is a general way of doing so, since I run this code in a function, with many columns with different data sets and different dictionaries. Is there a way to accomplish this?
PS.: I have read the documentation in R and the answers to those questions:
Factor, levels, and original values
Having issues using order function in R
But unfortunately I wasn't able to figure it out by myself, it just became obvious that using the "labels" argument in "factor" was not the way to get it done.
Thank you so much!

Assigning automatic class based on various columns in R [duplicate]

I would like to take a data frame with characters and numbers, and concatenate all of the elements of the each row into a single string, which would be stored as a single element in a vector. As an example, I make a data frame of letters and numbers, and then I would like to concatenate the first row via the paste function, and hopefully return the value "A1"
df <- data.frame(letters = LETTERS[1:5], numbers = 1:5)
df
## letters numbers
## 1 A 1
## 2 B 2
## 3 C 3
## 4 D 4
## 5 E 5
paste(df[1,], sep =".")
## [1] "1" "1"
So paste is converting each element of the row into an integer that corresponds to the 'index of the corresponding level' as if it were a factor, and it keeps it a vector of length two. (I know/believe that factors that are coerced to be characters behave in this way, but as R is not storing df[1,] as a factor at all (tested by is.factor(), I can't verify that it is actually an index for a level)
is.factor(df[1,])
## [1] FALSE
is.vector(df[1,])
## [1] FALSE
So if it is not a vector then it makes sense that it is behaving oddly, but I can't coerce it into a vector
> is.vector(as.vector(df[1,]))
[1] FALSE
Using as.character did not seem to help in my attempts
Can anyone explain this behavior?
While others have focused on why your code isn't working and how to improve it, I'm going to try and focus more on getting the result you want. From your description, it seems you can readily achieve what you want using paste:
df <- data.frame(letters = LETTERS[1:5], numbers = 1:5, stringsAsFactors=FALSE)
paste(df$letters, df$numbers, sep=""))
## [1] "A1" "B2" "C3" "D4" "E5"
You can change df$letters to character using df$letters <- as.character(df$letters) if you don't want to use the stringsAsFactors argument.
But let's assume that's not what you want. Let's assume you have hundreds of columns and you want to paste them all together. We can do that with your minimal example too:
df_args <- c(df, sep="")
do.call(paste, df_args)
## [1] "A1" "B2" "C3" "D4" "E5"
EDIT: Alternative method and explanation:
I realised the problem you're having is a combination of the fact that you're using a factor and that you're using the sep argument instead of collapse (as #adibender picked up). The difference is that sep gives the separator between two separate vectors and collapse gives separators within a vector. When you use df[1,], you supply a single vector to paste and hence you must use the collapse argument. Using your idea of getting every row and concatenating them, the following line of code will do exactly what you want:
apply(df, 1, paste, collapse="")
Ok, now for the explanations:
Why won't as.list work?
as.list converts an object to a list. So it does work. It will convert your dataframe to a list and subsequently ignore the sep="" argument. c combines objects together. Technically, a dataframe is just a list where every column is an element and all elements have to have the same length. So when I combine it with sep="", it just becomes a regular list with the columns of the dataframe as elements.
Why use do.call?
do.call allows you to call a function using a named list as its arguments. You can't just throw the list straight into paste, because it doesn't like dataframes. It's designed for concatenating vectors. So remember that dfargs is a list containing a vector of letters, a vector of numbers and sep which is a length 1 vector containing only "". When I use do.call, the resulting paste function is essentially paste(letters, numbers, sep).
But what if my original dataframe had columns "letters", "numbers", "squigs", "blargs" after which I added the separator like I did before? Then the paste function through do.call would look like:
paste(letters, numbers, squigs, blargs, sep)
So you see it works for any number of columns.
For those using library(tidyverse), you can simply use the unite function.
new.df <- df%>%
unite(together, letters, numbers, sep="")
This will give you a new column called together with A1, B2, etc.
This is indeed a little weird, but this is also what is supposed to happen.
When you create the data.frame as you did, column letters is stored as factor. Naturally factors have no ordering, therefore when as.numeric() is applied to a factor it returns the ordering of of the factor. For example:
> df[, 1]
[1] A B C D E
Levels: A B C D E
> as.numeric(df[, 1])
[1] 1 2 3 4 5
A is the first level of the factor df[, 1] therefore A gets converted to the value 1, when as.numeric is applied. This is what happens when you call paste(df[1, ]). Since columns 1 and 2 are of different class, paste first transforms both elements of row 1 to numeric then to characters.
When you want to concatenate both columns, you first need to transform the first row to character:
df[, 1] <- as.character(df[, 1])
paste(df[1,], collapse = "")
As #sebastian-c pointed out, you can also use stringsAsFactors = FALSE in the creation of the data.frame, then you can omit the as.character() step.
if you want to start with
df <- data.frame(letters = LETTERS[1:5], numbers = 1:5, stringsAsFactors=TRUE)
.. then there is no general rule about how df$letters will be interpreted by any given function. It's a factor for modelling functions, character for some and integer for some others. Even the same function such as paste may interpret it differently, depending on how you use it:
paste(df[1,], collapse="") # "11"
apply(df, 1, paste, collapse="") # "A1" "B2" "C3" "D4" "E5"
No logic in it except that it will probably make sense once you know the internals of every function.
The factors seem to be converted to integers when an argument is converted to vector (as you know, data frames are lists of vectors of equal length, so the first row of a data frame is also a list, and when it is forced to be a vector, something like this happens:)
df[1,]
# letters numbers
# 1 A 1
unlist(df[1,])
# letters numbers
# 1 1
I don't know how apply achieves what it does (i.e., factors are represented by character values) -- if you're interested, look at its source code. It may be useful to know, though, that you can trust (in this specific sense) apply (in this specific occasion). More generally, it is useful to store every piece of data in a sensible format, that includes storing strings as strings, i.e., using stringsAsFactors=FALSE.
Btw, every introductory R book should have this idea in a subtitle. For example, my plan for retirement is to write "A (not so) gentle introduction to the zen of data fishery with R, the stringsAsFactors=FALSE way".

Conditionally Splitting Dataframes Using ifelse

I have a large dataset called "inputs". One of the columns in the dataset is a flag called "constrained" with either "Y" or "N". I want to create two datasets where one is the rows where the flag is "Y" and one is the rows where the flag is "N".
I tried:
ifelse(inputs$constrained == "N",unconstrained <- inputs,constrained <- inputs)
but both datasets unconstrained and constrained are identical to inputs.
What am I doing wrong?
first <- split(inputs, inputs$constrained)[1]
second <- split(inputs, inputs$constrained)[2]
If you wanted to use "[" you could do this:
unconstrd <- inputs[ inputs$constrained == "N" , ]
constrd <- inputs[ ! inputs$constrained == "N" , ]
Both of that second option might have entries where 'constrained' is NA, due the screwy way that R handles NA conditionals although it would not be a faithful reflection of those rows. (I admit I did not sure what the split method does with NA's.) I just tested the split method and it might be superior, since (like subset) it does not return the is.na(input$constrained) rows.

Resources