R remove NA value from factor in split function - r

I'm using the split function to group my data.frame into three categories (C, Q or S). Now, when I execute the split function, I notice that there are now 4 lists in the variable (C, Q, S and empty string).
I expect this has to do with an NA value, or an empty string. How do I filter this correctly?
Currently, my code looks like this:
# Read the data from the CSV file.
train.csv <- read.csv("train.csv")
# Create some handy variables
ship.embarked <- split(train.csv, train.csv$Embarked)
ship.pclass <- split(train.csv, train.csv$Pclass)
ship.embarked returns 4 lists (C, Q S and empty string), while I expect to have 3 (C, Q and S). How do I solve this correctly?

If we need to remove the "", convert to character, use nzchar to return a logical vector, subset the rows based on that and remove the unused levels with droplevels
train.csv <- droplevels(train.csv[nzchar(as.character(train.csv$Embarked)‌​),])
Now, we can do the split and there won't be any ""

Related

Replace special characters in a dataframe with the normal variant dataframe is empty

I'm new to R and want to replace all the special characters in my dataframe.
i've looked it up via stack and it partially works. All the special characters are replaced with there normal counterparts example ä --> a . The problem i'm encountering is that the dataframe doesn't exist anymore.
funtion to replace
plain_text = function(x) {
old1 <- "šžþàáâãäåçèéêëìíîïðñòóôõöùúûüý"
new1 <- "szyaaaaaaceeeeiiiidnooooouuuuy"
x = apply(x,2,function(x) gsub(old1,new1,x))
}
R script
df1 = plain_text(df)
dataframe
V1
1 c("Foo","Bar","Foo","Bar")
2 c("Foo","Bar","Foo","Bar")
3 c("fixed","fixed","not fixed","fixed")
Don't use apply, it can be (and is here) destructive to a data.frame. In this case, you can use chartr. I caution that your function should take into considering if a column is character or not (since fixing letters on a numeric column breaks it).
plain_text = function(x) {
old1 <- "šžþàáâãäåçèéêëìíîïðñòóôõöùúûüý"
new1 <- "szyaaaaaaceeeeiiiidnooooouuuuy"
ischr <- sapply(x, is.character)
x[ischr] <- lapply(x[ischr], chartr, old = old1, new = new1)
x
}
The chartr function translates from one set of characters to another (so that old= and new= must be strings with the same number of characters). It is analogous to the shell command tr.
The reason apply is bad is that it converts its arguments to a matrix before doing anything. If the frame is all character, then this does not destroy any data, but it does lose the data.frame structure (perhaps easily re-applied with as.data.frame). A more idiomatic way for a frame is to lapply over its columns (analogous to MARGIN=2 for apply), and it returns a list. (A data.frame is effectively just a special-case list.) If we just ran lapply(x, ...) and reassigned it to x, then x would now be a list; however, by reassigning to specific columns with x[ischr]<- (or all columns using x[]<-, not shown here), then x is still a frame albeit with those columns changed.
Lastly, gsub is not used well there because it is looking for the entire string "šžþàáâãäåçèéêëìíîïðñòóôõöùúûüý", not just one of its characters. What this job needs (I believe) is a one-by-one look at the characters, and replace them in-kind.

Subset my R list by character vector from dataframe

I have an r object, 'd' that is a list. I want a dataframe of references to subsets of this list to make as variables for a function, 'myfunction'. This function will be called thousands of times using rslurm each using a different subset of d.
example: d[['1']][[3]] references a data matrix within the list.
myfunction(d[['1']][[3]])
works fine, but I want to be able to call these subsets from a dataframe.
I want to be able to have a dataframe, 'ds' containing all of my subset references.
>ds
d
1 d[['1']][[3]]
2 d[['1']][[4]]
>myfunction(get(ds[1,1]))
Error in get(ds[1, 1]) : object 'd[['1']][[3]]' not found
Is there something like 'get' that will let me call a subset of my object, d?
Or something I can put in 'myfunction' that will clarify that this string references a subset of d?
stack_overflow 'get'
A list:
my_list <- c('peanut', 'butter', 'is', 'amazing')
A dataframe containing subset references:
my_dataframe <- data.frame(keys=c("my_list[[1]]", "my_list[[2]]", "my_list[[3]]", "my_list[[4]]"), stringsAsFactors=F)
A function that extracts the value from a list based on a passed value:
my_function <- function(key, my_list) {
from_list <- eval(parse(text=key))
print(from_list)
}
Getting the value from a list by passing in the dataframe row choice and the list:
my_function(my_dataframe[1,1], my_list)
I solved this by changing myfunction to take two variables, c and w, and defining d using bracket notation in the first line of the updated function. My ds now has two variables, c and w, with variable c defined as as.character and it works!
myfunction(c,w) {
d<-d[[c]][[w]]
....rest of function}
>ds
c w
1 1 3
2 1 4
>test <- myfunction(ds[1,1],ds[1,2])

Assigning automatic class based on various columns in R [duplicate]

I would like to take a data frame with characters and numbers, and concatenate all of the elements of the each row into a single string, which would be stored as a single element in a vector. As an example, I make a data frame of letters and numbers, and then I would like to concatenate the first row via the paste function, and hopefully return the value "A1"
df <- data.frame(letters = LETTERS[1:5], numbers = 1:5)
df
## letters numbers
## 1 A 1
## 2 B 2
## 3 C 3
## 4 D 4
## 5 E 5
paste(df[1,], sep =".")
## [1] "1" "1"
So paste is converting each element of the row into an integer that corresponds to the 'index of the corresponding level' as if it were a factor, and it keeps it a vector of length two. (I know/believe that factors that are coerced to be characters behave in this way, but as R is not storing df[1,] as a factor at all (tested by is.factor(), I can't verify that it is actually an index for a level)
is.factor(df[1,])
## [1] FALSE
is.vector(df[1,])
## [1] FALSE
So if it is not a vector then it makes sense that it is behaving oddly, but I can't coerce it into a vector
> is.vector(as.vector(df[1,]))
[1] FALSE
Using as.character did not seem to help in my attempts
Can anyone explain this behavior?
While others have focused on why your code isn't working and how to improve it, I'm going to try and focus more on getting the result you want. From your description, it seems you can readily achieve what you want using paste:
df <- data.frame(letters = LETTERS[1:5], numbers = 1:5, stringsAsFactors=FALSE)
paste(df$letters, df$numbers, sep=""))
## [1] "A1" "B2" "C3" "D4" "E5"
You can change df$letters to character using df$letters <- as.character(df$letters) if you don't want to use the stringsAsFactors argument.
But let's assume that's not what you want. Let's assume you have hundreds of columns and you want to paste them all together. We can do that with your minimal example too:
df_args <- c(df, sep="")
do.call(paste, df_args)
## [1] "A1" "B2" "C3" "D4" "E5"
EDIT: Alternative method and explanation:
I realised the problem you're having is a combination of the fact that you're using a factor and that you're using the sep argument instead of collapse (as #adibender picked up). The difference is that sep gives the separator between two separate vectors and collapse gives separators within a vector. When you use df[1,], you supply a single vector to paste and hence you must use the collapse argument. Using your idea of getting every row and concatenating them, the following line of code will do exactly what you want:
apply(df, 1, paste, collapse="")
Ok, now for the explanations:
Why won't as.list work?
as.list converts an object to a list. So it does work. It will convert your dataframe to a list and subsequently ignore the sep="" argument. c combines objects together. Technically, a dataframe is just a list where every column is an element and all elements have to have the same length. So when I combine it with sep="", it just becomes a regular list with the columns of the dataframe as elements.
Why use do.call?
do.call allows you to call a function using a named list as its arguments. You can't just throw the list straight into paste, because it doesn't like dataframes. It's designed for concatenating vectors. So remember that dfargs is a list containing a vector of letters, a vector of numbers and sep which is a length 1 vector containing only "". When I use do.call, the resulting paste function is essentially paste(letters, numbers, sep).
But what if my original dataframe had columns "letters", "numbers", "squigs", "blargs" after which I added the separator like I did before? Then the paste function through do.call would look like:
paste(letters, numbers, squigs, blargs, sep)
So you see it works for any number of columns.
For those using library(tidyverse), you can simply use the unite function.
new.df <- df%>%
unite(together, letters, numbers, sep="")
This will give you a new column called together with A1, B2, etc.
This is indeed a little weird, but this is also what is supposed to happen.
When you create the data.frame as you did, column letters is stored as factor. Naturally factors have no ordering, therefore when as.numeric() is applied to a factor it returns the ordering of of the factor. For example:
> df[, 1]
[1] A B C D E
Levels: A B C D E
> as.numeric(df[, 1])
[1] 1 2 3 4 5
A is the first level of the factor df[, 1] therefore A gets converted to the value 1, when as.numeric is applied. This is what happens when you call paste(df[1, ]). Since columns 1 and 2 are of different class, paste first transforms both elements of row 1 to numeric then to characters.
When you want to concatenate both columns, you first need to transform the first row to character:
df[, 1] <- as.character(df[, 1])
paste(df[1,], collapse = "")
As #sebastian-c pointed out, you can also use stringsAsFactors = FALSE in the creation of the data.frame, then you can omit the as.character() step.
if you want to start with
df <- data.frame(letters = LETTERS[1:5], numbers = 1:5, stringsAsFactors=TRUE)
.. then there is no general rule about how df$letters will be interpreted by any given function. It's a factor for modelling functions, character for some and integer for some others. Even the same function such as paste may interpret it differently, depending on how you use it:
paste(df[1,], collapse="") # "11"
apply(df, 1, paste, collapse="") # "A1" "B2" "C3" "D4" "E5"
No logic in it except that it will probably make sense once you know the internals of every function.
The factors seem to be converted to integers when an argument is converted to vector (as you know, data frames are lists of vectors of equal length, so the first row of a data frame is also a list, and when it is forced to be a vector, something like this happens:)
df[1,]
# letters numbers
# 1 A 1
unlist(df[1,])
# letters numbers
# 1 1
I don't know how apply achieves what it does (i.e., factors are represented by character values) -- if you're interested, look at its source code. It may be useful to know, though, that you can trust (in this specific sense) apply (in this specific occasion). More generally, it is useful to store every piece of data in a sensible format, that includes storing strings as strings, i.e., using stringsAsFactors=FALSE.
Btw, every introductory R book should have this idea in a subtitle. For example, my plan for retirement is to write "A (not so) gentle introduction to the zen of data fishery with R, the stringsAsFactors=FALSE way".

Concatenating strings from different rows in R

I have a R data frame which looks like
data.1 data.character
a **str1**,str2,str2,str3,str4,str5,str6
b str3,str4,str5
c **str1**,str6
I am currently using grepl to identify if the column data.character has my search string "<str>" and if so I want all the row values in data.1 to be concatenated into one string with a separator
eg. if I use grepl(str1,data.character) it will return two rows of df$data.1 and I want an output like
a,c ( rows which contain str1 in data.character)
I am currently using two for loops but i know this is not an efficient method. I was wondering if someone could suggest a more elegant and less time consuming method.
You were almost there - (now my long-winded answer)
# Data
df <- read.table(text="data.1 data.character
a **str1**,str2,str2,str3,str4,str5,str6
b str3,str4,str5
c **str1**,str6",header=T,stringsAsFactors=F)
Match string
# In your question you used grepl which produces a logical vector (TRUE if
#string is present)
grepl("str1" , df$data.character)
#[1] TRUE FALSE TRUE
# In my comment I used grep which produces an positional index of the vector if
# string is present (this was due to me not reading your grepl properly rather
# than because of any property)
grep("str1" , df$data.character)
# [1] 1 3
Then subset the vector that you want at these positions resulting from grep (or grepl)
(s <- df$data.1[grepl("str1" , df$data.character)])
# [1] "a" "c" first and third elements are selected
Paste these together into the required format (collapse argument is used to define the separator between the elements)
paste(s,collapse=",")
# [1] "a,c"
So more succinctly
paste(df$data.1[grep("str1" , df$data.character)],collapse=",")

Concatenate rows of a data frame

I would like to take a data frame with characters and numbers, and concatenate all of the elements of the each row into a single string, which would be stored as a single element in a vector. As an example, I make a data frame of letters and numbers, and then I would like to concatenate the first row via the paste function, and hopefully return the value "A1"
df <- data.frame(letters = LETTERS[1:5], numbers = 1:5)
df
## letters numbers
## 1 A 1
## 2 B 2
## 3 C 3
## 4 D 4
## 5 E 5
paste(df[1,], sep =".")
## [1] "1" "1"
So paste is converting each element of the row into an integer that corresponds to the 'index of the corresponding level' as if it were a factor, and it keeps it a vector of length two. (I know/believe that factors that are coerced to be characters behave in this way, but as R is not storing df[1,] as a factor at all (tested by is.factor(), I can't verify that it is actually an index for a level)
is.factor(df[1,])
## [1] FALSE
is.vector(df[1,])
## [1] FALSE
So if it is not a vector then it makes sense that it is behaving oddly, but I can't coerce it into a vector
> is.vector(as.vector(df[1,]))
[1] FALSE
Using as.character did not seem to help in my attempts
Can anyone explain this behavior?
While others have focused on why your code isn't working and how to improve it, I'm going to try and focus more on getting the result you want. From your description, it seems you can readily achieve what you want using paste:
df <- data.frame(letters = LETTERS[1:5], numbers = 1:5, stringsAsFactors=FALSE)
paste(df$letters, df$numbers, sep=""))
## [1] "A1" "B2" "C3" "D4" "E5"
You can change df$letters to character using df$letters <- as.character(df$letters) if you don't want to use the stringsAsFactors argument.
But let's assume that's not what you want. Let's assume you have hundreds of columns and you want to paste them all together. We can do that with your minimal example too:
df_args <- c(df, sep="")
do.call(paste, df_args)
## [1] "A1" "B2" "C3" "D4" "E5"
EDIT: Alternative method and explanation:
I realised the problem you're having is a combination of the fact that you're using a factor and that you're using the sep argument instead of collapse (as #adibender picked up). The difference is that sep gives the separator between two separate vectors and collapse gives separators within a vector. When you use df[1,], you supply a single vector to paste and hence you must use the collapse argument. Using your idea of getting every row and concatenating them, the following line of code will do exactly what you want:
apply(df, 1, paste, collapse="")
Ok, now for the explanations:
Why won't as.list work?
as.list converts an object to a list. So it does work. It will convert your dataframe to a list and subsequently ignore the sep="" argument. c combines objects together. Technically, a dataframe is just a list where every column is an element and all elements have to have the same length. So when I combine it with sep="", it just becomes a regular list with the columns of the dataframe as elements.
Why use do.call?
do.call allows you to call a function using a named list as its arguments. You can't just throw the list straight into paste, because it doesn't like dataframes. It's designed for concatenating vectors. So remember that dfargs is a list containing a vector of letters, a vector of numbers and sep which is a length 1 vector containing only "". When I use do.call, the resulting paste function is essentially paste(letters, numbers, sep).
But what if my original dataframe had columns "letters", "numbers", "squigs", "blargs" after which I added the separator like I did before? Then the paste function through do.call would look like:
paste(letters, numbers, squigs, blargs, sep)
So you see it works for any number of columns.
For those using library(tidyverse), you can simply use the unite function.
new.df <- df%>%
unite(together, letters, numbers, sep="")
This will give you a new column called together with A1, B2, etc.
This is indeed a little weird, but this is also what is supposed to happen.
When you create the data.frame as you did, column letters is stored as factor. Naturally factors have no ordering, therefore when as.numeric() is applied to a factor it returns the ordering of of the factor. For example:
> df[, 1]
[1] A B C D E
Levels: A B C D E
> as.numeric(df[, 1])
[1] 1 2 3 4 5
A is the first level of the factor df[, 1] therefore A gets converted to the value 1, when as.numeric is applied. This is what happens when you call paste(df[1, ]). Since columns 1 and 2 are of different class, paste first transforms both elements of row 1 to numeric then to characters.
When you want to concatenate both columns, you first need to transform the first row to character:
df[, 1] <- as.character(df[, 1])
paste(df[1,], collapse = "")
As #sebastian-c pointed out, you can also use stringsAsFactors = FALSE in the creation of the data.frame, then you can omit the as.character() step.
if you want to start with
df <- data.frame(letters = LETTERS[1:5], numbers = 1:5, stringsAsFactors=TRUE)
.. then there is no general rule about how df$letters will be interpreted by any given function. It's a factor for modelling functions, character for some and integer for some others. Even the same function such as paste may interpret it differently, depending on how you use it:
paste(df[1,], collapse="") # "11"
apply(df, 1, paste, collapse="") # "A1" "B2" "C3" "D4" "E5"
No logic in it except that it will probably make sense once you know the internals of every function.
The factors seem to be converted to integers when an argument is converted to vector (as you know, data frames are lists of vectors of equal length, so the first row of a data frame is also a list, and when it is forced to be a vector, something like this happens:)
df[1,]
# letters numbers
# 1 A 1
unlist(df[1,])
# letters numbers
# 1 1
I don't know how apply achieves what it does (i.e., factors are represented by character values) -- if you're interested, look at its source code. It may be useful to know, though, that you can trust (in this specific sense) apply (in this specific occasion). More generally, it is useful to store every piece of data in a sensible format, that includes storing strings as strings, i.e., using stringsAsFactors=FALSE.
Btw, every introductory R book should have this idea in a subtitle. For example, my plan for retirement is to write "A (not so) gentle introduction to the zen of data fishery with R, the stringsAsFactors=FALSE way".

Resources