Best practice for factor and reverse in R - r

I have a variable that has on one column some strings; so for further processing I have converted them into factor:
myVar$strCol <- as.factor(myVar$strCol)
Now I want to get back the strings for writing the output. I have tested and it seems that there are more possibilities to do the reverse of as.factor. I have found:
as.character(myVar$strCol)
and
factor(myVar$strCol)
I am confused, now. Which is the best one? Which is the fastest one? Which one shall I use? Is it another one that is better?
Any help please, I am new to R?

Although the print output from those two objects is identical if they are in a data.frame, the result is completely different. Furthermore, in most cases of an R newbie looking at this, it will be the case that looking at the content of a dataframe with "character variables" reveals them to be factors. (That presumption may now be incorrect. The new behavior since R v4+ of the read.* functions is no longer to make all character columns "AsFactors". Sofor the last year factors need to be specifically created.
Bottom line: Only the first option you presented will deliver what you asked for.
You should learn to examine R objects with str() and present them to SO audiences with dput()-output so that the ambiguity of the console-print method can be avoided.
> test <- factor(1:10)
> test
[1] 1 2 3 4 5 6 7 8 9 10
Levels: 1 2 3 4 5 6 7 8 9 10
> dput( as.character ( test) )
c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10")
> dput( factor (test) )
structure(1:10, .Label = c("1", "2", "3", "4", "5", "6", "7",
"8", "9", "10"), class = "factor")
Although the "character" column has no hint that it is a factor, it still is one in the dd-object below:
> dd <- data.frame(test=letters[1:10], num =1:10)
> dd
test num
1 a 1
2 b 2
3 c 3
4 d 4
5 e 5
6 f 6
7 g 7
8 h 8
9 i 9
10 j 10
> dput(dd)
structure(list(test = structure(1:10, .Label = c("a", "b", "c",
"d", "e", "f", "g", "h", "i", "j"), class = "factor"), num = 1:10), .Names = c("test",
"num"), row.names = c(NA, -10L), class = "data.frame")

Related

How to format data from excel containing two rows of column headers to be able to use in R?

I am importing the following table 1 into R but am struggling with the formatting, as each column has two headers. My desired output is the second table 2. I plan to use tidyr to gather the data.
Another obstacle I have is the merged cells. I have been using fillMergedCells=TRUE to duplicate this.
read.xlsx(xlsxFile ="C:/Users/X/X/Desktop/X.xlsx",fillMergedCells = TRUE)
One option would be to
read your excel file with option colNames = FALSE
Paste the first two rows together and use the result as the column names. Here I use an underscore as the separator which makes it easy to split the names later on.
Get rid of the first two rows
Use tidyr::pivot_longer to convert to long format.
# df <- openxlsx::read.xlsx(xlsxFile ="data/test2.xlsx", fillMergedCells = TRUE, colNames = FALSE)
# Use first two rows as names
names(df) <- paste(df[1, ], df[2, ], sep = "_")
names(df)[1] <- "category"
# Get rid of first two rows and columns containing year average
df <- df[-c(1:2), ]
df <- df[, !grepl("^Year", names(df))]
library(tidyr)
library(dplyr)
df %>%
pivot_longer(-category, names_to = c("Time", ".value"), names_pattern = "^(.*?)_(.*)$") %>%
arrange(Time)
#> # A tibble: 16 × 4
#> category Time Y Z
#> <chr> <chr> <chr> <chr>
#> 1 Total Feb-21 1 1
#> 2 A Feb-21 2 2
#> 3 B Feb-21 3 3
#> 4 C Feb-21 4 4
#> 5 D Feb-21 5 5
#> 6 E Feb-21 6 6
#> 7 F Feb-21 7 7
#> 8 G Feb-21 8 8
#> 9 Total Jan-21 1 1
#> 10 A Jan-21 2 2
#> 11 B Jan-21 3 3
#> 12 C Jan-21 4 4
#> 13 D Jan-21 5 5
#> 14 E Jan-21 6 6
#> 15 F Jan-21 7 7
#> 16 G Jan-21 8 8
DATA
df <- structure(list(X1 = c(
NA, NA, "Total", "A", "B", "C", "D", "E",
"F", "G"
), X2 = c(
"Year Rolling Avg.", "Share", NA, "1", "1",
"1", "1", "1", "1", "1"
), X3 = c(
"Year Rolling Avg.", "Y", "1",
"2", "3", "4", "5", "6", "7", "8"
), X4 = c(
"Year Rolling Avg.",
"Z", "1", "2", "3", "4", "5", "6", "7", "8"
), X5 = c(
"Jan-21",
"Y", "1", "2", "3", "4", "5", "6", "7", "8"
), X6 = c(
"Jan-21",
"Z", "1", "2", "3", "4", "5", "6", "7", "8"
), X7 = c(
"Feb-21",
"Y", "1", "2", "3", "4", "5", "6", "7", "8"
), X8 = c(
"Feb-21",
"Z", "1", "2", "3", "4", "5", "6", "7", "8"
)), row.names = c(
NA,
10L
), class = "data.frame")

Why do I get the error message "Error: unexpected '=' in:" when using the fct_collpase function?

Data frame = qog_std3
factor = btid
I am trying to collapse this ordinal level factor using following code:
I get the following error message:
Error: unexpected '=' in:
"btid4 <- fct_collapse(qog_std3$btid,
1="
Can anyone explain to me why the use of "=" provides this error and what I can do about it?
Any alternative solution would also be deeply appreciated.
If the column is factor or character, we need to quote the name especically when it is numeric. It is not an issue when it is non-numeric
fct_collapse(df1$btid, "1" = c("1", "2"))
#[1] 1 1 3 3 4 5 1 1
#Levels: 1 3 4 5
It can be also backquotes
fct_collapse(df1$btid, `1` = c("1", "2"))
#[1] 1 1 3 3 4 5 1 1
#Levels: 1 3 4 5
whereas if we specify the unquoted numeric value
fct_collapse(df1$btid, 1 = c("1", "2"))
Error: unexpected '=' in " fct_collapse(df1$btid, 1 ="
However, this is not an issue when it is character
fct_collapse(df1$id2, AB = c("A", "B"))
#[1] AB AB C D AB AB C AB
#Levels: AB C D
data
df1 <- structure(list(btid = c("1", "1", "3", "3", "4", "5", "1", "2"
), id2 = c("A", "B", "C", "D", "A", "B", "C", "A")), row.names = c(NA,
-8L), class = "data.frame")

count_if (EXPSS) with multiple conditions in R

I am using expss::count_if.
While something like this works fine (i.e., counting values only where value is equal to "1"):
(number_unemployed = count_if("1",unemployed_field,na.rm = TRUE)),
This does not (i.e., counting values only where value is equal to "1" or "2" or "3"):
(number_unemployed = count_if("1", "2", "3", unemployed_field,na.rm = TRUE)),
What is the correct syntax for using multiple conditions for count_if? I cannot find anything in the expss package documentation.
You need to put them into a vector. This works:
(number_unemployed = count_if(c("1", "2", "3"), unemployed_field), na.rm=T),
Example: Sample data is provided below;
library(expss)
count_if(c("1","2","3"),dt$Encounter)
#> 9
Data:
dt <- structure(list(Location = c("A", "B", "A", "A", "C", "B", "A", "B", "A", "A", "A"),
Encounter = c("1", "2", "3", "1", "2", "3", "4", "1", "2", "3", "4")),
row.names = c(NA, -11L), class = "data.frame")
# Location Encounter
# 1 A 1
# 2 B 2
# 3 A 3
# 4 A 1
# 5 C 2
# 6 B 3
# 7 A 4
# 8 B 1
# 9 A 2
# 10 A 3
# 11 A 4

Subsetting Data in R not working through a vector

Scanario.
In the datacamp courses, cleaning data with R: case studies. There is an excercise at the extreme end of the course where we have 5 columns (say: 1,2,3,4,5) of dataset "att5". Only column 1 is char & has characters in it but 2:5 has numbers but it is type(chars). They tell me to make a vector cols consisting of vectors which has indices of (2,3,4,5) and use sapply to use as.numeric function on them.
My solution is not working although it is making sense. I'm sharing my their solutions first and then my solutions. Please help me understand what is going on.
Data Camp Solution(working)
# Define vector containing numerical columns: cols
cols <- -1
# Use sapply to coerce cols to numeric
att5[, cols] <- sapply(att5[, cols], as.numeric)
My Solution(not working)
# Define vector containing numerical columns: cols
cols <- c(2:5)
# Use sapply to coerce cols to numeric
att5[, cols] <- sapply(att5[, cols], as.numeric)
I'm getting this error: invalid subscript type list
Help me understand. Newbie in R.
Your solution works perfectly on my machine. The only difference I can be able to see is cols <- -1 is of class "numeric" where as cols <- c(2:5) is [1] "integer". If you want to know the difference between the two have a look What's the difference between integer class and numeric class in R.
So, one way to reverse-engineer their solution is to generate cols in numeric class and seq can help do that.
cols <- seq(2,5,1)
#class(cols)
#[1] "numeric"
att5[, cols] <- sapply(att5[, cols], as.numeric)
# str(att5)
# 'data.frame': 5 obs. of 5 variables:
# $ att1: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
# $ att2: num 1 2 3 4 5
# $ att3: num 1 2 3 4 5
# $ att4: num 1 2 3 4 5
# $ att5: num 1 2 3 4 5
Data
dput(att5)
att5 <- structure(list(att1 = structure(1:5, .Label = c("A", "B", "C",
"D", "E"), class = "factor"), att2 = structure(1:5, .Label = c("1",
"2", "3", "4", "5"), class = "factor"), att3 = structure(1:5, .Label = c("1",
"2", "3", "4", "5"), class = "factor"), att4 = structure(1:5, .Label = c("1",
"2", "3", "4", "5"), class = "factor"), att5 = structure(1:5, .Label = c("1",
"2", "3", "4", "5"), class = "factor")), class = "data.frame", row.names = c(NA,
-5L))
Hope it works on your end.

Error when subsetting based on adjusted values of different data frame in R

I am asking a side-question about the method I learned here from #redmode :
Subsetting based on values of a different data frame in R
When I try to dynamically adjust the level I want to subset by:
N <- nrow(A)
cond <- sapply(3:N, function(i) sum(A[i,] > 0.95*B[i,])==2)
rbind(A[1:2,], subset(A[3:N,], cond))
I get an error
Error in FUN(left, right) : non-numeric argument to binary operator.
Can you think of a way I can get rows pertaining to values in A that are greater than 95% of the value in B? Thank you.
Here is code for A and B to play with.
A <- structure(list(name1 = c("trt", "0", "1", "10", "1", "1", "10"
), name2 = c("ctrl", "3", "1", "1", "1", "1", "10")), .Names = c("name1",
"name2"), row.names = c("cond", "hour", "A", "B", "C", "D", "E"
), class = "data.frame")
B <- structure(list(name1 = c("trt", "0", "1", "1", "1", "1", "9.4"),
name2 = c("ctrl", "3", "1", "10", "1", "1", "9.4")), .Names = c("name1",
"name2"), row.names = c("cond", "hour", "A", "B", "C", "D", "E"
), class = "data.frame")
You have some serious formatting issues with your data.
First, columns should be of the same data type, rows should be observations. (not always true, but a very good way to start) Here you have a row called cond, then a row called hour, then a series of classifications I'm guessing. The way you're data is presented to begin with doesn't make much sense and doesn't lend itself to easy manipulation of your data. But all is not lost. This is what I would do:
Reorganize my data:
C <- data.frame(matrix(as.numeric(unlist(A)), ncol=2)[-(1:2), ])
colnames(C) <- c('A.trt', 'A.cntr')
rownames(C) <- LETTERS[1:nrow(C)]
D <- data.frame(matrix(as.numeric(unlist(B)), ncol=2)[-(1:2), ])
colnames(D) <- c('B.trt', 'B.cntr')
(df <- cbind(C, D))
Which gives:
# A.trt A.cntr B.trt B.cntr
# A 1 1 1.0 1.0
# B 10 1 1.0 10.0
# C 1 1 1.0 1.0
# D 1 1 1.0 1.0
# E 10 10 9.4 9.4
Then you're problem is easily solved by:
df[which(df[, 1] > 0.95*df[, 3] & df[, 2] > 0.95*df[, 4]), ]
# A.trt A.cntr B.trt B.cntr
# A 1 1 1.0 1.0
# C 1 1 1.0 1.0
# D 1 1 1.0 1.0
# E 10 10 9.4 9.4

Resources