Subsetting Data in R not working through a vector - r

Scanario.
In the datacamp courses, cleaning data with R: case studies. There is an excercise at the extreme end of the course where we have 5 columns (say: 1,2,3,4,5) of dataset "att5". Only column 1 is char & has characters in it but 2:5 has numbers but it is type(chars). They tell me to make a vector cols consisting of vectors which has indices of (2,3,4,5) and use sapply to use as.numeric function on them.
My solution is not working although it is making sense. I'm sharing my their solutions first and then my solutions. Please help me understand what is going on.
Data Camp Solution(working)
# Define vector containing numerical columns: cols
cols <- -1
# Use sapply to coerce cols to numeric
att5[, cols] <- sapply(att5[, cols], as.numeric)
My Solution(not working)
# Define vector containing numerical columns: cols
cols <- c(2:5)
# Use sapply to coerce cols to numeric
att5[, cols] <- sapply(att5[, cols], as.numeric)
I'm getting this error: invalid subscript type list
Help me understand. Newbie in R.

Your solution works perfectly on my machine. The only difference I can be able to see is cols <- -1 is of class "numeric" where as cols <- c(2:5) is [1] "integer". If you want to know the difference between the two have a look What's the difference between integer class and numeric class in R.
So, one way to reverse-engineer their solution is to generate cols in numeric class and seq can help do that.
cols <- seq(2,5,1)
#class(cols)
#[1] "numeric"
att5[, cols] <- sapply(att5[, cols], as.numeric)
# str(att5)
# 'data.frame': 5 obs. of 5 variables:
# $ att1: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
# $ att2: num 1 2 3 4 5
# $ att3: num 1 2 3 4 5
# $ att4: num 1 2 3 4 5
# $ att5: num 1 2 3 4 5
Data
dput(att5)
att5 <- structure(list(att1 = structure(1:5, .Label = c("A", "B", "C",
"D", "E"), class = "factor"), att2 = structure(1:5, .Label = c("1",
"2", "3", "4", "5"), class = "factor"), att3 = structure(1:5, .Label = c("1",
"2", "3", "4", "5"), class = "factor"), att4 = structure(1:5, .Label = c("1",
"2", "3", "4", "5"), class = "factor"), att5 = structure(1:5, .Label = c("1",
"2", "3", "4", "5"), class = "factor")), class = "data.frame", row.names = c(NA,
-5L))
Hope it works on your end.

Related

Why do I get the error message "Error: unexpected '=' in:" when using the fct_collpase function?

Data frame = qog_std3
factor = btid
I am trying to collapse this ordinal level factor using following code:
I get the following error message:
Error: unexpected '=' in:
"btid4 <- fct_collapse(qog_std3$btid,
1="
Can anyone explain to me why the use of "=" provides this error and what I can do about it?
Any alternative solution would also be deeply appreciated.
If the column is factor or character, we need to quote the name especically when it is numeric. It is not an issue when it is non-numeric
fct_collapse(df1$btid, "1" = c("1", "2"))
#[1] 1 1 3 3 4 5 1 1
#Levels: 1 3 4 5
It can be also backquotes
fct_collapse(df1$btid, `1` = c("1", "2"))
#[1] 1 1 3 3 4 5 1 1
#Levels: 1 3 4 5
whereas if we specify the unquoted numeric value
fct_collapse(df1$btid, 1 = c("1", "2"))
Error: unexpected '=' in " fct_collapse(df1$btid, 1 ="
However, this is not an issue when it is character
fct_collapse(df1$id2, AB = c("A", "B"))
#[1] AB AB C D AB AB C AB
#Levels: AB C D
data
df1 <- structure(list(btid = c("1", "1", "3", "3", "4", "5", "1", "2"
), id2 = c("A", "B", "C", "D", "A", "B", "C", "A")), row.names = c(NA,
-8L), class = "data.frame")

count_if (EXPSS) with multiple conditions in R

I am using expss::count_if.
While something like this works fine (i.e., counting values only where value is equal to "1"):
(number_unemployed = count_if("1",unemployed_field,na.rm = TRUE)),
This does not (i.e., counting values only where value is equal to "1" or "2" or "3"):
(number_unemployed = count_if("1", "2", "3", unemployed_field,na.rm = TRUE)),
What is the correct syntax for using multiple conditions for count_if? I cannot find anything in the expss package documentation.
You need to put them into a vector. This works:
(number_unemployed = count_if(c("1", "2", "3"), unemployed_field), na.rm=T),
Example: Sample data is provided below;
library(expss)
count_if(c("1","2","3"),dt$Encounter)
#> 9
Data:
dt <- structure(list(Location = c("A", "B", "A", "A", "C", "B", "A", "B", "A", "A", "A"),
Encounter = c("1", "2", "3", "1", "2", "3", "4", "1", "2", "3", "4")),
row.names = c(NA, -11L), class = "data.frame")
# Location Encounter
# 1 A 1
# 2 B 2
# 3 A 3
# 4 A 1
# 5 C 2
# 6 B 3
# 7 A 4
# 8 B 1
# 9 A 2
# 10 A 3
# 11 A 4

r dataframe using rank

I would like to rank the row of a dataframe (with 30 columns) which has numerical values ranking from -inf to +inf.
This is what I have:
df <- structure(list(StockA = c("-5", "3", "6"),
StockB = c("2", "-1", "3"),
StockC = c("-3", "-4", "4")),
.Names = c( "StockA","StockB", "StockC"),
class = "data.frame", row.names = c(NA, -3L))
> df
StockA StockB StockC
1 -5 2 -3
2 3 -1 -4
3 6 3 4
This is what I would like to have:
> df_rank
StockA StockB StockC
1 3 1 2
2 1 2 3
3 1 3 2
I am using this command:
> rank(df[1,])
StockA StockB StockC
2 3 1
The resulting rank variables are not correct though as you can see.
rank() assigns the lowest rank to the smallest value.
So the short answer to your question is to use rank of the vector multiplied by -1:
rank (-c(-5, 2, -3) )
[1] 1 3 2
Here is the full code:
# data frame definition. The numbers should actually be integers as pointed out
# in comments, otherwise the rank command will sort them as strings
# So in the real word you should define them as integers,
# but to go with your data I will convert them to integers in the next step
df <- structure(list(StockA = c("-5", "3", "6"),
StockB = c("2", "-1", "3"),
StockC = c("-3", "-4", "4")),
.Names = c( "StockA","StockB", "StockC"),
class = "data.frame", row.names = c(NA, -3L))
# since you plan to rank them not as strings, but numbers, you need to convert
# them to integers:
df[] <- lapply(df,as.integer)
# apply will return a matrix or a list and you need to
# transpose the result and convert it back to a data.frame if needed
result <- as.data.frame(t( apply(df, 1, FUN=function(x){ return(rank(-x)) }) ))
result
# StockA StockB StockC
# 3 1 2
# 1 2 3
# 1 3 2

Best practice for factor and reverse in R

I have a variable that has on one column some strings; so for further processing I have converted them into factor:
myVar$strCol <- as.factor(myVar$strCol)
Now I want to get back the strings for writing the output. I have tested and it seems that there are more possibilities to do the reverse of as.factor. I have found:
as.character(myVar$strCol)
and
factor(myVar$strCol)
I am confused, now. Which is the best one? Which is the fastest one? Which one shall I use? Is it another one that is better?
Any help please, I am new to R?
Although the print output from those two objects is identical if they are in a data.frame, the result is completely different. Furthermore, in most cases of an R newbie looking at this, it will be the case that looking at the content of a dataframe with "character variables" reveals them to be factors. (That presumption may now be incorrect. The new behavior since R v4+ of the read.* functions is no longer to make all character columns "AsFactors". Sofor the last year factors need to be specifically created.
Bottom line: Only the first option you presented will deliver what you asked for.
You should learn to examine R objects with str() and present them to SO audiences with dput()-output so that the ambiguity of the console-print method can be avoided.
> test <- factor(1:10)
> test
[1] 1 2 3 4 5 6 7 8 9 10
Levels: 1 2 3 4 5 6 7 8 9 10
> dput( as.character ( test) )
c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10")
> dput( factor (test) )
structure(1:10, .Label = c("1", "2", "3", "4", "5", "6", "7",
"8", "9", "10"), class = "factor")
Although the "character" column has no hint that it is a factor, it still is one in the dd-object below:
> dd <- data.frame(test=letters[1:10], num =1:10)
> dd
test num
1 a 1
2 b 2
3 c 3
4 d 4
5 e 5
6 f 6
7 g 7
8 h 8
9 i 9
10 j 10
> dput(dd)
structure(list(test = structure(1:10, .Label = c("a", "b", "c",
"d", "e", "f", "g", "h", "i", "j"), class = "factor"), num = 1:10), .Names = c("test",
"num"), row.names = c(NA, -10L), class = "data.frame")

Error when subsetting based on adjusted values of different data frame in R

I am asking a side-question about the method I learned here from #redmode :
Subsetting based on values of a different data frame in R
When I try to dynamically adjust the level I want to subset by:
N <- nrow(A)
cond <- sapply(3:N, function(i) sum(A[i,] > 0.95*B[i,])==2)
rbind(A[1:2,], subset(A[3:N,], cond))
I get an error
Error in FUN(left, right) : non-numeric argument to binary operator.
Can you think of a way I can get rows pertaining to values in A that are greater than 95% of the value in B? Thank you.
Here is code for A and B to play with.
A <- structure(list(name1 = c("trt", "0", "1", "10", "1", "1", "10"
), name2 = c("ctrl", "3", "1", "1", "1", "1", "10")), .Names = c("name1",
"name2"), row.names = c("cond", "hour", "A", "B", "C", "D", "E"
), class = "data.frame")
B <- structure(list(name1 = c("trt", "0", "1", "1", "1", "1", "9.4"),
name2 = c("ctrl", "3", "1", "10", "1", "1", "9.4")), .Names = c("name1",
"name2"), row.names = c("cond", "hour", "A", "B", "C", "D", "E"
), class = "data.frame")
You have some serious formatting issues with your data.
First, columns should be of the same data type, rows should be observations. (not always true, but a very good way to start) Here you have a row called cond, then a row called hour, then a series of classifications I'm guessing. The way you're data is presented to begin with doesn't make much sense and doesn't lend itself to easy manipulation of your data. But all is not lost. This is what I would do:
Reorganize my data:
C <- data.frame(matrix(as.numeric(unlist(A)), ncol=2)[-(1:2), ])
colnames(C) <- c('A.trt', 'A.cntr')
rownames(C) <- LETTERS[1:nrow(C)]
D <- data.frame(matrix(as.numeric(unlist(B)), ncol=2)[-(1:2), ])
colnames(D) <- c('B.trt', 'B.cntr')
(df <- cbind(C, D))
Which gives:
# A.trt A.cntr B.trt B.cntr
# A 1 1 1.0 1.0
# B 10 1 1.0 10.0
# C 1 1 1.0 1.0
# D 1 1 1.0 1.0
# E 10 10 9.4 9.4
Then you're problem is easily solved by:
df[which(df[, 1] > 0.95*df[, 3] & df[, 2] > 0.95*df[, 4]), ]
# A.trt A.cntr B.trt B.cntr
# A 1 1 1.0 1.0
# C 1 1 1.0 1.0
# D 1 1 1.0 1.0
# E 10 10 9.4 9.4

Resources