mutate string into numeric, ignore alphabetical order of factor - r

I am trying to recode factor levels into numbers using mutate function, but I want to ignore alphabetical order the factors are appearing in. There are multiple same values of factor levels and I want them to be assigned the number in the new column of the row in which they first appeared in the dataframe.
Example:
library(stringi)
set.seed(234)
data<-stri_rand_strings(20,1)
data<-as.data.frame(data)
data2<-data %>% mutate(num=(as.numeric(factor(data))))
data2
Expected outcome:
dat<-data2[,-2]
order<-c(1,2,3,2,4,5)
expected_result<-cbind.data.frame(head(dat), order)
expected_result

I think you can just create a new factor and set the levels as unique values of data2$data in your example:
new_fac <- factor(data2$data, levels = unique(data2$data))
The numeric values can be obtained:
new_order <- as.numeric(new_fac)
And this is what your final result would look like:
head(data.frame(new_fac, new_order))
new_fac new_order
1 k 1
2 m 2
3 1 3
4 m 2
5 4 4
6 d 5
Or in your example with dplyr, you can do:
data %>%
mutate(num = as.numeric(factor(data, levels = unique(data))))

You could accomplish this with a helper table that contains the row number of the first time a string appears in your table. I.e.
library(stringi)
library(tidyverse)
# generate data
data<-stri_rand_strings(20,1)
data<-as.data.frame(data)
Create helper table:
factorlevels <- data %>% unique() %>% mutate(order = row_number())
... and inner join to data
data %>% inner_join(factorlevels)
Output:
> data %>% inner_join(factorlevels)
Joining, by = "data"
data order
1 k 1
2 m 2
3 1 3
4 m 2
5 4 4
6 d 5
7 v 6
8 i 7
9 v 6
10 H 8
11 Y 9
12 X 10
13 a 11
14 a 11
15 0 12
16 R 13
17 J 14
18 j 15
19 8 16
20 s 17
I am sure that there is a one-liner approach to this, but I could not figure it out right away.

Related

Add non occurent factors to data frame in R

I have a dataframe of factors and corresponding values like this:
df <- data.frame(week = factor(c(1,2,49,50)), occurrences = c(1,4,2,3))
week occurrences
1 1 1
2 2 4
3 49 2
4 50 3
I want to add factors for all the "missing" weeks in (1-53) with the corresponding occurrences value of 0. What is the best way to do this? I have to do this to several data frames that may not be "missing" the same factors so I would like to generalize it in a function.
You can use rbind() to append the necessary lines to your df, in this example, I first create the df to be added before appending it for clarity. setdiff() will return the numbers currently not present in your week column:
df_to_app = data.frame(week = factor(setdiff(1:52, df$week)), occurrences = 0)
df = rbind(df, df_to_app)
I hope that helps!
Here's an approach with tidyr::complete. First, we need to add the additional levels to our week column. We can use forcats::fct_expand. Then tidyr::complete will fill the data.frame with those levels and we can use the fill = argument to indicate that we want 0.
library(tidyverse)
df %>%
mutate(week = fct_expand(week,paste0(1:52))) %>%
complete(week, fill = list(occurrences = 0))
# A tibble: 52 x 2
week occurrences
<fct> <dbl>
1 1 1
2 2 4
3 49 2
4 50 3
5 3 0
6 4 0
7 5 0
8 6 0
9 7 0
10 8 0
# … with 42 more rows
Or with a right join to a data.frame containing all weeks:
library(dplyr)
df %>%
right_join(data.frame(week = as.factor(1:52))) %>%
mutate(occurrences = replace_na(occurrences,0))

How to merge columns in R with different levels of values

I have been given a dataset that I am attempting to perform logistic regression on. However, to do so, I need to merge some columns in R.
For instance in the carevaluations data set, I am given (BuyingPrice_low, BuyingPrice_medium, BuyingPrice_high, BuyingPrice_vhigh, MaintenancePrice_low MaintenancePrice_medium MaintenancePrice_high MaintenancePrice_vhigh)
How would I combine the columns buying price_low, medium, etc. into one column called "BuyingPrice" with the order and their respective data in each column and the same with the maintenanceprice column?
library(dplyr)
df <- data.frame(Buy_low=rep(c(0,1), 10),
Buy_high=rep(c(0,1), 10))
one_column <- df %>%
gather(var, value)
head(one_column)
var value
1 Buy_low 0
2 Buy_low 1
3 Buy_low 0
4 Buy_low 1
5 Buy_low 0
6 Buy_low 1
It can be done with stack in base R :
df1 <- data.frame(a=1:3,b=4:6,c=7:9)
stack(df1)
# values ind
# 1 1 a
# 2 2 a
# 3 3 a
# 4 4 b
# 5 5 b
# 6 6 b
# 7 7 c
# 8 8 c
# 9 9 c

sum up certain variables (columns) by variable names

i want to sum up certain variables (columns in a data frame).
I would like to select those variables by parts of their names.
The complex thing is that i have various conditions. So, using a single contains from dplyr does not work.
Here is an example:
ab_yy <- c(1:5)
bc_yy <- c(5:9)
cd_yy <- c(2:6)
de_xx <- c(3:7)
ab_yy bc_yy cd_yy de_xx
1 1 5 2 3
2 2 6 3 4
3 3 7 4 5
4 4 8 5 6
5 5 9 6 7
dat <- data.frame(ab_yy,bc_yy,cd_yy,de_xx)
#sum up all variables that contain yy and certain extra conditions
#may look something like this: rowSums(select(dat, contains(("yy&ab")|("yy&bc")) ) )
desired result:
6 8 10 12 14
EDIT: Fixed, sorry, low on caffeine
If you want to use dplyr, try using matches:
library(dplyr)
dat %>%
select(matches("*yy", )) %>%
select(matches("ab*|bc*")) %>%
rowSums(.)
[1] 6 8 10 12 14
I don't think that it's the best way but u can do it like that with a grepl:
rowSums(dat[,grepl(pattern = "ab.*yy|bc.*yy",colnames(dat))==T])

How to split a data frame in the order I want?

I have a data frame df like this
df
x y id
10 5 2
12 10 2
15 0 1
I want to split by the id. I used split(df, df$id) and I get
x y id
15 0 1
and
x y id
10 5 2
12 10 2
But I want the one with id=2 to come before than the one with id =1
So basically I want the output to be
x y id
10 5 2
12 10 2
and
x y id
15 0 1
According to the documentation of split(), The components of the list are named by the levels of f (after converting to a factor ...). f is the second parameter to split(). So, the chunks appear in the order of the factor levels after splitting.
The OP has requested that the chunks should be returned in the same order as they appear in df. This can be achieved conveniently with the fct_inorder() function of Hadley's forcats package:
split(df, forcats::fct_inorder(factor(df$id)))
#$`2`
# x y id
#1 10 5 2
#2 12 10 2
#
#$`1`
# x y id
#3 15 0 1
Note, that
id itself remains unchanged. fct_inorder() is only used for defining the split.
the additional call to factor() is only required because id is of type integer.
Edit This can also be achieved without any packages:
split(df, factor(df$id, levels = unique(df$id)))
Just switch the order of the elements in the list.
Sdf = split(df, df$id)
Sdf = Sdf[c(2,1)]
$`2`
x y id
1 10 5 2
2 12 10 2
$`1`
x y id
3 15 0 1
You could also use rev (reverse)
Sdf = rev(Sdf)

How to count how many values per level in a given factor?

I have a data.frame mydf with about 2500 rows. These rows correspond to 69 classes of objects in colum 1 mydf$V1, and I want to count how many rows per object class I have.
I can get a factor of these classes with:
objectclasses = unique(factor(mydf$V1, exclude="1"));
What's the terse R way to count the rows per object class? If this were any other language I'd be traversing an array with a loop and keeping count but I'm new to R programming and am trying to take advantage of R's vectorised operations.
Or using the dplyr library:
library(dplyr)
set.seed(1)
dat <- data.frame(ID = sample(letters,100,rep=TRUE))
dat %>%
group_by(ID) %>%
summarise(no_rows = length(ID))
Note the use of %>%, which is similar to the use of pipes in bash. Effectively, the code above pipes dat into group_by, and the result of that operation is piped into summarise.
The result is:
Source: local data frame [26 x 2]
ID no_rows
1 a 2
2 b 3
3 c 3
4 d 3
5 e 2
6 f 4
7 g 6
8 h 1
9 i 6
10 j 5
11 k 6
12 l 4
13 m 7
14 n 2
15 o 2
16 p 2
17 q 5
18 r 4
19 s 5
20 t 3
21 u 8
22 v 4
23 w 5
24 x 4
25 y 3
26 z 1
See the dplyr introduction for some more context, and the documentation for details regarding the individual functions.
Here 2 ways to do it:
set.seed(1)
tt <- sample(letters,100,rep=TRUE)
## using table
table(tt)
tt
a b c d e f g h i j k l m n o p q r s t u v w x y z
2 3 3 3 2 4 6 1 6 5 6 4 7 2 2 2 5 4 5 3 8 4 5 4 3 1
## using tapply
tapply(tt,tt,length)
a b c d e f g h i j k l m n o p q r s t u v w x y z
2 3 3 3 2 4 6 1 6 5 6 4 7 2 2 2 5 4 5 3 8 4 5 4 3 1
Using plyr package:
library(plyr)
count(mydf$V1)
It will return you a frequency of each value.
Using data.table
library(data.table)
setDT(dat)[, .N, keyby=ID] #(Using #Paul Hiemstra's `dat`)
Or using dplyr 0.3
res <- count(dat, ID)
head(res)
#Source: local data frame [6 x 2]
# ID n
#1 a 2
#2 b 3
#3 c 3
#4 d 3
#5 e 2
#6 f 4
Or
dat %>%
group_by(ID) %>%
tally()
Or
dat %>%
group_by(ID) %>%
summarise(n=n())
We can use summary on factor column:
summary(myDF$factorColumn)
One more approach would be to apply n() function which is counting the number of observations
library(dplyr)
library(magrittr)
data %>%
group_by(columnName) %>%
summarise(Count = n())
In case I just want to know how many unique factor levels exist in the data, I use:
length(unique(df$factorcolumn))
Use the package plyr with lapply to get frequencies for every value (level) and every variable (factor) in your data frame.
library(plyr)
lapply(df, count)
This is an old post, but you can do this with base R and no data frames/data tables:
sapply(levels(yTrain), function(sLevel) sum(yTrain == sLevel))

Resources