Convert dataframe column to character vector - r

I have a dataframe and I'd like to run a sentiment analysis on a particular column.
dataframe
mysentiment <- get_nrc_sentiment(hud['review_body'])
However, when I run a sentiment analysis on R studio using the
get_nrc_sentiment function I get the error "Error in get_nrc_sentiment(hud["review_body"]) :
Data must be a character vector."
I tried converting the dataframe column to a vector using
as.vector(hud['review_body'])
The above doesn't seem to work as well. I've just started learning R.

Sub-setting a data.frame this way: hud['review_body'] merely creates a new data.frame with that one column. The result is still a data.frame, not a vector.
class(hud['review_body'])
## "data.frame"
You can extract the column as a vector in any of the following ways:
class(hud$review_body)
## "character"
class(hud[, 'review_body'])
## "character"
class(hud[['review_body']])
## "character"
Any of these will work with get_nrc_sentiment(...).
The double square bracket notation confuses a lot of people. A data.frame is a list of vectors. Using single brackets extracts that element of the list, but it's still a list (with one element). To extract the contents of the element (the vector), you need to use double brackets.
x = list(A=1:3, B=4:6, C=7:9)
x # list with 3 elements, "A", "B", and "C"
x["A"] # list with one element (the "A" element)
x[["A"]] # contents of the "A" element of the list
In future, please do not include pictures of data. Include the data.

Related

R code to extract two columns (one of which has multiple text strings which need to be parsed) into a named list of character vectors

I currently have a situation where I have a dataframe where I need to convert two of the columns a specified format. Example of the data in each column:
Column 1: Some_text_String
Column 2:
GO:0048046^cellular_component^apoplast`GO:0005618^cellular_component^cell wall`GO:0005576^cellular_component^extracellular region`GO:0099503^cellular_component^secretory vesicle`GO:0004252^molecular_function^serine-type endopeptidase activity`GO:0080001^biological_process^mucilage extrusion from seed coat`GO:0048359^biological_process^mucilage metabolic process involved in seed coat development`GO:0010214^biological_process^seed coat development
So I have two problems. I need to parse the second column so that only the GO:XXXXXXXX text is included. A partial solution that gets the first term is stringr::str_extract(mydataframe[1,2], ".{0,8}GO.{0,8}") but this only captures the first term.
Secondly the final output needs to be a named list of character vectors, with the list names being the first column and each element of the list being a character vector. This is direct from the vignette of the R package I'm trying to use (topGO).
The object returned by readMappings is a named list of character
vectors. The list names give the genes identifiers. Each element of
the list is a character vector and contains the GO identifiers
annotated to the specific gene
I know this is simple but I'm just getting stuck trying to use apply or some other solution and my brain is on strike.
Repex:
myvector1 <- c("Some_text_String")
myvector2 <- c("GO:0048046^cellular_component^apoplast`GO:0005618^cellular_component^cell wall`")
mydataframe <- data.frame(myvector1,myvector2)
# parse myvector2 to remove everything except GO terms.
# This code only gets the first term, but I need all of them as a vector
stringr::str_extract(mydataframe [1,2], ".{0,8}GO.{0,8}")
# At this point the desired result is named list of character vectors, with the list names being the first column and each element of the list being a character vector.
You can use str_extract_all to extract all the values that satisfy the pattern and use setNames to get a named list.
library(stringr)
setNames(str_extract_all(mydataframe [1,2], "GO.{0,8}"), mydataframe$myvector1)
#$Some_text_String
#[1] "GO:0048046" "GO:0005618"

Vector from tibble has length 0

I have a tibble ('df') with
> dim(df)
[1] 55 144
of which I extract a vector test <- c(df[,39]). I would expect the following result:
> length(test)
[1] 55
as I basically took column 39 from my tibble. Instead, I get
> length(test)
[1] 1
Now, class(test) yielded list, so I thought the class might be the reason; however, with class set to char, I get the same result.
I'm especially confused since length(df[39,]) yields [1] 155.
Background is I am searching in the vector using grep, which doesn't work with a vector taken from a column. Of course, as I am trying to recode all lines in my tibble, I can recode them by row instead of by column, so I think there is a workaround. However, what causes R to assume that test has length 1? What is the difference in the treatment of rows and columns?
Whenever you apply [] operation on a tibble, it always returns another tibble. This is one of differences between tibble structure and the data.frame in base R.
For example:
a <- 1:5
df = tibble(a,b=a*2,c=a^2)
df2 = as.data.frame(df) # convert to base data.frame
df[,2] # give a tibble, its dim is 5 1
df2[,2] # give a vector, its dim is NULL, its length is 5.
You see the return type from the data.frame has been changed from the original type. Meanwhile the tibble is designed in such way to keep the structure consistency between input and output type.
There are two ways, if you want to process certain column of a tibble as vectors.
pull()
[[ ]]
Personally, I am using pull(), which is also very intuitive.
Why length(df[39,]) yields 155?
My understanding is that df[39,] give you a tibble, its dim is 1 155. And its length is equal to the number of columns. Why? Because length also can give the length of lists. Behind of the design of tibble and data.frame, they are constructed by linked list. Each column is actually a list. That's why you can have different types in one tibble or data.frame.

Clarification in colnames function in R

I am new to R and I wanted to ask experts about the colnames function in R. Using the function I realized that it provides a NULL if used for single column of a matrix object, however it works perfectly fine for more than 1 columns of a matrix object. To illustrate, say I have matrix test
>test<-matrix(0,ncol=4,nrow=5)
>colnames(test)<-c("A","B","C","D")
>colnames(test[,1]) or colnames(test[,c(1)]) gives output as NULL
NULL
whereas the following works fine,
colnames(test[,c(1:2)])
[1] "A" "B"
I understand that alternative way is to use colnames(test)[c(1:2)]. Am I missing something here in the case where I am getting NULL.
If you look in the description of ?colnames. You'll see that it takes an argument x which is a a matrix-like R object, with at least two dimensions for colnames.
When you are calling colnames(test[,1]) you are giving colnames a vector with 1 dimension. Compare class(test[,1]) vs. class(test[,c(1:2)]). Vectors don't have columns or rows and therefore no column or row names. You can have named elements within a vector, but that is definitely not equivalent to the column names from a matrix
The best way to extract a single (or multiple) column name is to select the column after from the full vector of column names
colnames(test) # gives you all column names
colnames(test)[1] # gives you the column name 1
colnames(test)[c(1,2)] # gives you column names 1 and 2
Does this clarify this issue for you?

How to get a matrix element without the column name in R?

This seems to be simple but I can't find the answer.
I combine two vectors using cbind().
> first = c(1:5)
> second = c(6:10)
> values = cbind(first,second)
When I want to retrieve a single element using values[1,2] I always get the column name in addition to the actual element.
> values[1,2]
second
6
How can I get the value without the column name?
I know I can remove the column names in the matrix like in this post: How to remove column names from a matrix in R? But how can I leave the matrix as is and only get the value I want?
We can use unname
unname(values[1,2])
#[1] 6
Or as.vector
as.vector(values[1,2])
You can use the [[ operator to extact a single element,
values[[1,2]]
# [1] 6

Data frame typecasting entire column to character from numeric

Suppose I have a data.frame that's completely numeric. If I make one entry of the first column a character (for example), then the entire first column will become character.
Question: How do I reverse this. That is, how do I make it such that any character objects inside the data.frame that are "obviously" numeric objects are forced to be numeric?
MWE:
test <- data.frame(matrix(rnorm(50),10))
is(test[3,1])
test[1,1] <- "TEST"
is(test[3,1])
print(test)
So my goal here would be to go FROM the way that test is now, TO a state of affairs where test[2:10] is numeric. So I guess I'm asking for a function that does this over an entire data.frame.
Short answer is you cannot.
As was mentioned in the comments, in a data frame, all elements of a column must have the same mode.
If you would like to specifically find the values that are "number like" you can use the following (where vec here would be, say, a data frame column)
vec[!is.na(as.numeric((vec)))]
You can then convert these, but unfortunately you cannot put the converted values back into the same column. As as you do, they will be coerced back to character
As for a function that can convert the whole dataframe to numeric (realizing that isolating specific entries as exceptions is not possible), you can use sapply
sapply(dataFrameName, as.numeric)
You are allowed to have vectors of type list in a data.frame, and the list can contain any type of object except functions as long as it is of the same length as the other columns in the data.frame, e.g.:
mydataframe <- data.frame(numbers=1:3)
mydataframe$mylist <- list(1, 'plum', 5)
mydataframe
# numbers mylist
#1 1 1
#2 2 plum
#3 3 5
sapply(mydataframe, typeof)
# numbers mylist
#"integer" "list"
sapply(mydataframe$mylist, typeof)
#[1] "double" "character" "double"

Resources