Splitting 1 column into 1 to 3 columns in R - r

I have been wrestling with some code for a personal project and have been hitting some roadblocks.
I have some restaurant data and there is a column for the table with information separated by "/".
For example : 4/1 means table 4, and first check at that table for the day. 10/A/2 means Table 10, the check was split into 2 or more checks (A, B, C, etc) and this is check 10/A, and turnover 2.
Checks can also be togo orders which may be denoted by the name of the order.
For example, here are some possible orders:
1/1
1/2
10/A/3
10/B/3
Togo
Bob Togo
And I want to split them into 1 to 3 columns that are organized by table (or Togo), split, and turnover. Like so:
> check <- c("1/1", "1/2", "10/A/3", "10/B/3", "Togo", "Bob Togo")
> checknum <- seq(1:6)
> dat <- cbind(checknum,check)
> dat
checknum check
[1,] "1" "1/1"
[2,] "2" "1/2"
[3,] "3" "10/A/3"
[4,] "4" "10/B/3"
[5,] "5" "Togo"
[6,] "6" "Bob Togo"
And Ideally I want them to look like this:
> Table <- c(1,1,10,10,"Togo","Bob Togo")
> Split <- c(NA,NA,"A","B",NA,NA)
> Turn <- c(1,2,3,3,NA,NA)
> Ideal <- cbind(checknum,Table,Split,Turn)
> Ideal
checknum Table Split Turn
[1,] "1" "1" NA "1"
[2,] "2" "1" NA "2"
[3,] "3" "10" "A" "3"
[4,] "4" "10" "B" "3"
[5,] "5" "Togo" NA NA
[6,] "6" "Bob Togo" NA NA
Where all columns are for a specific aspect of the check with NAs for missing values.
Numeric values can be left as factors because each acts as a factor more than an integer. Ideally, the "Bob Togo" would be renamed "Togo" as well so that all Togo orders share the same factor.
I know this is a bit at once, but I've been hitting roadblocks for over 2 weeks now and I feel I'm missing something simple.
I'm relatively new to R, so any addition explanation with your answer is greatly appreciated.

We can do this with tidyverse by mutateing the 'check' column using str_replace and then separate the 'check' into three columns
library(tidyverse)
dat %>%
mutate(check = str_replace(check, "^(\\d+)/(\\d+)$", "\\1/NA/\\2")) %>%
separate( check, into = c("Table", "Split", "Turn"), sep="/", convert = TRUE)
# checknum Table Split Turn
#1 1 1 NA 1
#2 2 1 NA 2
#3 3 10 A 3
#4 4 10 B 3
#5 5 Togo <NA> <NA>
#6 6 Bob Togo <NA> <NA>
NOTE 1: It is better to create a data.frame as initial dataset than a matrix to accommodeate different class of columns
NOTE 2: tidyverse is a collection of packages. So, when load, it loads all the packages coming from that bundle. As #mt1022 suggested, we don't need to load the whole tidyverse, instead can load dplyr (mutate), tidyr (separate) and stringr (str_replace).
data
dat <- data.frame(checknum,check, stringsAsFactors=FALSE)

Related

How to change the class of a column in a list of a list from character to numeric in r?

The codes for producing sample dataset and converting from character to numeric is as below:
ff = data.frame(a = c('1','2','3'),b = 1:3, c = 5:7)
#data.frame is a type of list.
fff = list(ff,ff,ff,ff)
k = fff %>% map(~map(.x,function(x){x['a'] %<>% as.numeric
return(x)}))
However, the result is something like this...:
There are 3 lists appear in each of the nested list ==> 33 = 9, which is very strange.
I think the result should have 3 lists in a nested list.==> 31 = 3
what I want is to convert every a in each dataframe to be numeric.
> k
[[1]]
[[1]]$a
a
"1" "2" "3" NA
[[1]]$b
a
1 2 3 NA
[[1]]$c
a
5 6 7 NA
[[2]]
[[2]]$a
a
"1" "2" "3" NA
[[2]]$b
a
1 2 3 NA
[[2]]$c
a
5 6 7 NA
[[3]]
[[3]]$a
a
"1" "2" "3" NA
[[3]]$b
a
1 2 3 NA
[[3]]$c
a
5 6 7 NA
[[4]]
[[4]]$a
a
"1" "2" "3" NA
[[4]]$b
a
1 2 3 NA
[[4]]$c
a
5 6 7 NA
I cannot understand why I cannot convert a into numeric...
Like this, with mutate:
fff %>%
map(~ mutate(.x, a = as.numeric(a)))
Or, more base R style:
fff %>%
map(\(x) {x$a <- as.numeric(x$a); x})
You should use map only once, because you don't have a nested list. With the first map, you access to each dataframe, and then you can convert to numeric. With a second map, you are accessing the columns of each data frame (which you don't want).
With two maps, it's also preferable to use \ or function rather than ~ because it becomes confusing to use .x and x for different objects. In your question, .x is the dataframe, while x are columns of it.

How can I order a column of a matrix?

I have created a matrix out of two vectors
x<-c(1,118,3,220)
y<-c("A","B","C","D")
z<-c(x,y)
m<-matrix(z,ncol=2)
Now I want order the second row, but it doesn't work properly.
I tried:
m[order(m[,2]),]
The order should be 1,3,118,220, but it shows 1,118,220,3
The matrix can only hold one class which in this case would be character since you have "A","B","C","D".
So if still want to order the rows in matrix you need to subset the first column convert it into numeric, use order and then use them to reorder rows.
m[order(as.numeric(m[, 1])), ]
# [,1] [,2]
#[1,] "1" "A"
#[2,] "3" "C"
#[3,] "118" "B"
#[4,] "220" "D"
Since you have data with mixed data types why not store them in dataframe instead ?
x<-c(1,118,3,220)
y<-c("A","B","C","D")
df <- data.frame(x,y)
df[order(df[,1]),]
# x y
#1 1 A
#3 3 C
#2 118 B
#4 220 D

count of records within levels of a factor

I am trying to populate a field in a table (or create a separate vector altogether, whichever is easier) with consecutive numbers from 1 to n, where n is the total number of records that share the same factor level, and then back to 1 for the next level, etc. That is, for a table like this
data<-matrix(c(rep('A',4),rep('B',3),rep('C',4),rep('D',2)),ncol=1)
the result should be a new column (e.g. "sample") as follows:
sample<-c(1,2,3,4,1,2,3,1,2,3,4,1,2)
You can get it as follows, using ave:
data <- data.frame(data)
new <- ave(rep(1,nrow(data)),data$data,FUN=cumsum)
all.equal(new,sample) # check if it's right.
You can use rle function together with lapply :
sample <- unlist(lapply(rle(data[,1])$lengths,FUN=function(x){1:x}))
data <- cbind(data,sample)
Or even better, you can combine rle and sequence in the following one-liner (thanks to #Arun suggestion)
data <- cbind(data,sequence(rle(data[,1])$lengths))
> data
[,1] [,2]
[1,] "A" "1"
[2,] "A" "2"
[3,] "A" "3"
[4,] "A" "4"
[5,] "B" "1"
[6,] "B" "2"
[7,] "B" "3"
[8,] "C" "1"
[9,] "C" "2"
[10,] "C" "3"
[11,] "C" "4"
[12,] "D" "1"
[13,] "D" "2"
There are lots of different ways of achieving this, but I prefer to use ddply() from plyr because the logic seems very consistent to me. I think it makes more sense to be working with a data.frame (your title talks about levels of a factor):
dat <- data.frame(ID = c(rep('A',4),rep('B',3),rep('C',4),rep('D',2)))
library(plyr)
ddply(dat, .(ID), summarise, sample = 1:length(ID))
# ID sample
# 1 A 1
# 2 A 2
# 3 A 3
# 4 A 4
# 5 B 1
# 6 B 2
# 7 B 3
# 8 C 1
# 9 C 2
# 10 C 3
# 11 C 4
# 12 D 1
# 13 D 2
My answer:
sample <- unlist(lapply(levels(factor(data)), function(x)seq_len(sum(factor(data)==x))))
factors <- unique(data)
f1 <- length(which(data == factors[1]))
...
fn <- length(which(data == factors[length(factors)]))
You can use a for loop or 'apply' family to speed that part up.
Then,
sample <- c(1:f1, 1:f2, ..., 1:fn)
Once again you can use a for loop for that part. Here is the full script you can use:
data<-matrix(c(rep('A',4),rep('B',3),rep('C',4),rep('D',2)),ncol=1)
factors <- unique(data)
f <- c()
for(i in 1:length(factors)) {
f[i] <- length(which(data == factors[i]))
}
sample <- c()
for(i in 1:length(f)) {
sample <- c(sample, 1:f[i])
}
> sample
[1] 1 2 3 4 1 2 3 1 2 3 4 1 2

Getting a row from a data frame as a vector in R

I know that to get a row from a data frame in R, we can do this:
data[row,]
where row is an integer. But that spits out an ugly looking data structure where every column is labeled with the names of the column names. How can I just get it a row as a list of value?
Data.frames created by importing data from a external source will have their data transformed to factors by default. If you do not want this set stringsAsFactors=FALSE
In this case to extract a row or a column as a vector you need to do something like this:
as.numeric(as.vector(DF[1,]))
or like this
as.character(as.vector(DF[1,]))
You can't necessarily get it as a vector because each column might have a different mode. You might have numerics in one column and characters in the next.
If you know the mode of the whole row, or can convert to the same type, you can use the mode's conversion function (for example, as.numeric()) to convert to a vector. For example:
> state.x77[1,]
Population Income Illiteracy Life Exp Murder HS Grad Frost
3615.00 3624.00 2.10 69.05 15.10 41.30 20.00
Area
50708.00
> as.numeric(state.x77[1,])
[1] 3615.00 3624.00 2.10 69.05 15.10 41.30 20.00 50708.00
This would work even if some of the columns were integers, although they would be converted to numeric floating-point numbers.
There is a problem with what you propose; namely that the components of data frames (what you call columns) can be of different data types. If you want a single row as a vector, that must contain only a single data type - they are atomic vectors!
Here is an example:
> set.seed(2)
> dat <- data.frame(A = 1:10, B = sample(LETTERS[1:4], 10, replace = TRUE))
> dat
A B
1 1 A
2 2 C
3 3 C
4 4 A
5 5 D
6 6 D
7 7 A
8 8 D
9 9 B
10 10 C
> dat[1, ]
A B
1 1 A
If we force it to drop the empty (column), the only recourse for R is to convert the row to a list to maintain the disparate data types.
> dat[1, , drop = TRUE]
$A
[1] 1
$B
[1] A
Levels: A B C D
The only logical solution to this it to get the data frame into a common type by coercing it to a matrix. This is done via data.matrix() for example:
> mat <- data.matrix(dat)
> mat[1,]
A B
1 1
data.matrix() converts factors to their internal numeric codes. The above allows the first row to be extracted as a vector.
However, if you have character data in the data frame, the only recourse will be to create a character matrix, which may or may not be useful, and data.matrix() now can't be used, we need as.matrix() instead:
> dat$String <- LETTERS[1:10]
> str(dat)
'data.frame': 10 obs. of 3 variables:
$ A : int 1 2 3 4 5 6 7 8 9 10
$ B : Factor w/ 4 levels "A","B","C","D": 1 3 3 1 4 4 1 4 2 3
$ String: chr "A" "B" "C" "D" ...
> mat <- data.matrix(dat)
Warning message:
NAs introduced by coercion
> mat
A B String
[1,] 1 1 NA
[2,] 2 3 NA
[3,] 3 3 NA
[4,] 4 1 NA
[5,] 5 4 NA
[6,] 6 4 NA
[7,] 7 1 NA
[8,] 8 4 NA
[9,] 9 2 NA
[10,] 10 3 NA
> mat <- as.matrix(dat)
> mat
A B String
[1,] " 1" "A" "A"
[2,] " 2" "C" "B"
[3,] " 3" "C" "C"
[4,] " 4" "A" "D"
[5,] " 5" "D" "E"
[6,] " 6" "D" "F"
[7,] " 7" "A" "G"
[8,] " 8" "D" "H"
[9,] " 9" "B" "I"
[10,] "10" "C" "J"
> mat[1, ]
A B String
" 1" "A" "A"
> class(mat[1, ])
[1] "character"
How about this?
library(tidyverse)
dat <- as_tibble(iris)
pulled_row <- dat %>% slice(3) %>% flatten_chr()
If you know all the values are same type, then use flatten_xxx.
Otherwise, I think flatten_chr() is safer.
As user "Reinstate Monica" notes, this problem has two parts:
A data frame will often have different data types in each column that need to be coerced to character strings.
Even after coercing the columns to character format, the data.frame "shell" needs to stripped-off to create a vector via a command like unlist.
With a combination of dplyr and base R this can be done in two lines. First, mutate_all converts all columns to character format. Second, the unlist commands extracts the vector out of the data.frame structure.
My particular issue was that the second line of a csv included the actual column names. So, I wanted to extract the second row to a vector and use that to assign column names. The following worked to extract the row as a character vector:
library(dplyr)
data_col_names <- data[2, ] %>%
mutate_all(as.character) %>%
unlist(., use.names=FALSE)
# example of using extracted row to rename cols
names(data) <- data_col_names
# only for this example, you'd want to remove row 2
# data <- data[-2, ]
(Note: Using as.character() in place of unlist will work too but it's less intuitive to apply as.character twice.)
I see that the most short variant is
c(t(data[row,]))
However if at least one column in data is a column of strings, so it will return string vector.

Row names & column names in R

Do the following function pairs generate exactly the same results?
Pair 1) names() & colnames()
Pair 2) rownames() & row.names()
As Oscar Wilde said
Consistency is the last refuge of the
unimaginative.
R is more of an evolved rather than designed language, so these things happen. names() and colnames() work on a data.frame but names() does not work on a matrix:
R> DF <- data.frame(foo=1:3, bar=LETTERS[1:3])
R> names(DF)
[1] "foo" "bar"
R> colnames(DF)
[1] "foo" "bar"
R> M <- matrix(1:9, ncol=3, dimnames=list(1:3, c("alpha","beta","gamma")))
R> names(M)
NULL
R> colnames(M)
[1] "alpha" "beta" "gamma"
R>
Just to expand a little on Dirk's example:
It helps to think of a data frame as a list with equal length vectors. That's probably why names works with a data frame but not a matrix.
The other useful function is dimnames which returns the names for every dimension. You will notice that the rownames function actually just returns the first element from dimnames.
Regarding rownames and row.names: I can't tell the difference, although rownames uses dimnames while row.names was written outside of R. They both also seem to work with higher dimensional arrays:
>a <- array(1:5, 1:4)
> a[1,,,]
> rownames(a) <- "a"
> row.names(a)
[1] "a"
> a
, , 1, 1
[,1] [,2]
a 1 2
> dimnames(a)
[[1]]
[1] "a"
[[2]]
NULL
[[3]]
NULL
[[4]]
NULL
I think that using colnames and rownames makes the most sense; here's why.
Using names has several disadvantages. You have to remember that it means "column names", and it only works with data frame, so you'll need to call colnames whenever you use matrices. By calling colnames, you only have to remember one function. Finally, if you look at the code for colnames, you will see that it calls names in the case of a data frame anyway, so the output is identical.
rownames and row.names return the same values for data frame and matrices; the only difference that I have spotted is that where there aren't any names, rownames will print "NULL" (as does colnames), but row.names returns it invisibly. Since there isn't much to choose between the two functions, rownames wins on the grounds of aesthetics, since it pairs more prettily withcolnames. (Also, for the lazy programmer, you save a character of typing.)
And another expansion:
# create dummy matrix
set.seed(10)
m <- matrix(round(runif(25, 1, 5)), 5)
d <- as.data.frame(m)
If you want to assign new column names you can do following on data.frame:
# an identical effect can be achieved with colnames()
names(d) <- LETTERS[1:5]
> d
A B C D E
1 3 2 4 3 4
2 2 2 3 1 3
3 3 2 1 2 4
4 4 3 3 3 2
5 1 3 2 4 3
If you, however run previous command on matrix, you'll mess things up:
names(m) <- LETTERS[1:5]
> m
[,1] [,2] [,3] [,4] [,5]
[1,] 3 2 4 3 4
[2,] 2 2 3 1 3
[3,] 3 2 1 2 4
[4,] 4 3 3 3 2
[5,] 1 3 2 4 3
attr(,"names")
[1] "A" "B" "C" "D" "E" NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[20] NA NA NA NA NA NA
Since matrix can be regarded as two-dimensional vector, you'll assign names only to first five values (you don't want to do that, do you?). In this case, you should stick with colnames().
So there...

Resources