Convert All Variables in a Dataframe to Numbers - r

Is there a fast way to convert all variables in a column to numbers, regardless of variable type? ie. if a column only had values "Yes" and "No", they would be converted to 0 and 1; columns with 3 values of "a", "b" and "c" would be converted to 0, 1, 2, etc.
The current df that I am using has the 9th column as "Yes/No".
EDIT:
Using Moody_Mudskipper's suggestion, I have tried:
RawData1 <- as.matrix(as.numeric(factor(RawData[[9]], levels = c("Yes","No"))) - 1)
dput(head(df,10))
structure(c("function (x, df1, df2, ncp, log = FALSE) ", "{",
" if (missing(ncp)) ", " .Call(C_df, x, df1, df2, log)",
" else .Call(C_dnf, x, df1, df2, ncp, log)", "}"), .Dim = c(6L,
1L), .Dimnames = list(c("1", "2", "3", "4", "5", "6"), ""), class =
"noquote")

Moody's answer (+1) explains that you need to convert to factors, then to numeric
You can use mutate_all to change the class of all columns in your data frame
library(dplyr)
df %>%
mutate_all(funs(as.numeric(as.factor(.))))

you can use factors for this:
df <- data.frame(yn = sample(c("yes","no"),10,T),
abc = sample(c("a","b","c"),10,T),
stringsAsFactors = F
)
df$yn2 <- as.numeric(factor(df$yn,levels = c("yes","no"))) - 1
df$abc2 <- as.numeric(factor(df$abc,levels = c("a","b","c"))) - 1
# yn abc yn2 abc2
# 1 no b 1 1
# 2 yes b 0 1
# 3 no b 1 1
# 4 yes a 0 0
# 5 yes c 0 2
# 6 yes c 0 2
# 7 yes c 0 2
# 8 yes a 0 0
# 9 no c 1 2
# 10 yes b 0 1

Another Base R solution to convert all columns:
# Added a numeric column to #Moody_Mudskipper's data example
set.seed(1)
df <- data.frame(yn = sample(c("yes","no"),10,T),
abc = sample(c("a","b","c"),10,T),
num = 1:10,
stringsAsFactors = F
)
df = data.frame(lapply(df, function(x) as.numeric(as.factor(x))))
One issue with this though is that it gives:
yn abc num
1 2 1 1
2 2 1 2
3 1 3 3
4 1 2 4
5 2 3 5
6 1 2 6
7 1 3 7
8 1 3 8
9 1 2 9
10 2 3 10
which is not what OP wants, as he wanted factor/character variables to be converted to 0,1,2,3,... One can try to do this:
df = data.frame(lapply(df, function(x) as.numeric(as.factor(x))-1))
but then all numeric columns would be incorrectly subtracted by 1...Using mutate_all (as in #CPak's answer) has this same issue. What you can do instead is to use mutate_if to only convert columns that are factors/characters:
library(dplyr)
df %>%
mutate_if(function(x) is.factor(x) | is.character(x), funs(as.numeric(as.factor(.))-1))
# or this...
df %>%
mutate_if(function(x) !is.numeric(x), funs(as.numeric(as.factor(.))-1))
Now, columns are correctly converted:
yn abc num
1 1 0 1
2 1 0 2
3 0 2 3
4 0 1 4
5 1 2 5
6 0 1 6
7 0 2 7
8 0 2 8
9 0 1 9
10 1 2 10

Related

Encode string column as several dummy columns [duplicate]

This question already has answers here:
Split string column to create new binary columns
(10 answers)
Closed 3 years ago.
I'd like to take data of the form
names label
1 A/B V
2 A W
3 A/C/D X
4 B/C Y
5 B/D Z
and encode the 'names' column into several columns containing a dummy variable which shows whether a particular name is included, i.e.
A B C D label
1 1 1 0 0 V
2 1 0 0 0 W
3 1 0 1 1 X
4 0 1 1 0 Y
5 0 1 0 1 Z
It feels like there should be an R function which takes care of this easily, but I have not been able to find one. Thanks for any pointers!
An option would be to split the string column by / and use mtabulate
library(qdapTools)
cbind(mtabulate(strsplit(df1$names, "/")), df1['label'])
# A B C D label
#1 1 1 0 0 V
#2 1 0 0 0 W
#3 1 0 1 1 X
#4 0 1 1 0 Y
#5 0 1 0 1 Z
Or in base R
table(stack(setNames(strsplit(df1$names, "/"), df1$label))[2:1])
NO packages used
data
df1 <- structure(list(names = c("A/B", "A", "A/C/D", "B/C", "B/D"),
label = c("V", "W", "X", "Y", "Z")), class = "data.frame",
row.names = c("1", "2", "3", "4", "5"))
Use separate_rows to put it in long form and then table will produce the output. Transpose to get it in the orientation shown in the quesiton.
library(dplyr)
library(tidyr)
DF %>%
separate_rows(names) %>%
table %>%
t
giving:
names
label A B C D
V 1 1 0 0
W 1 0 0 0
X 1 0 1 1
Y 0 1 1 0
Z 0 1 0 1
Note
The input in reproducible form:
Lines <- "names label
1 A/B V
2 A W
3 A/C/D X
4 B/C Y
5 B/D Z"
DF <- read.table(text = Lines, as.is = TRUE)

Create dummy-column based on another columns

Let's say I have this dataset
> example <- data.frame(a = 1:10, b = 10:1, c = 1:5 )
I want to create a new variable d. I want in d the value 1 when at least in of the variables a b c the value 1 2 or 3 is present.
d should look like this:
d <- c(1, 1, 1, 0, 0, 1, 1, 1, 1, 1)
Thanks in advance.
You can use rowSums to get a logical vector of 1, 2 or 3 appearing in each row and wrap it in as.integer to convert to 0 and 1, i.e.
as.integer(rowSums(df == 1|df == 2| df == 3) > 0)
#[1] 1 1 1 0 0 1 1 1 1 1
Will work for any number of vars:
example <- data.frame(a = 1:10, b = 10:1, c = 1:5 )
x <- c(1, 2, 3)
as.integer(Reduce(function(a, b) (a %in% x) | (b %in% x), example))
With the dplyr package:
library(dplyr)
x <- 1:3
example %>% mutate(d = as.integer(a %in% x | b %in% x | c %in% x))
Two other possibilities which work with any number of columns:
#option 1
example$d <- +(rowSums(sapply(example, `%in%`, 1:3)) > 0)
#option 2
library(matrixStats)
example$d <- rowMaxs(+(sapply(example, `%in%`, 1:3)))
which both give:
> example
a b c d
1 1 10 1 1
2 2 9 2 1
3 3 8 3 1
4 4 7 4 0
5 5 6 5 0
6 6 5 1 1
7 7 4 2 1
8 8 3 3 1
9 9 2 4 1
10 10 1 5 1
You can do this using apply(although little slow)
Logic: any will compare if there is any 1,2 or 3 is present or not, apply is used to iterate this logic on each of the rows. Then finally converting the boolean outcome to numeric by adding +0 (You may choose as.numeric here in case you want to be more expressive)
d <- apply(example,1 ,function(x)any(x==1|x==2|x==3))+0
In case someone wants to restrict the columns or want to run the logic on some columns, then one can do this also:
d <- apply(example[,c("a","b","c")], 1, function(x)any(x==1|x==2|x==3))+0
Here you have control on columns on which one to take or ignore basis your needs.
Output:
> d
[1] 1 1 1 0 0 1 1 1 1 1
general solution:
example %>%
sapply(function(i)i %in% x) %>% apply(1,any) %>% as.integer
#[1] 1 1 1 0 0 1 1 1 1 1
Try this method, verify if in any column there is at list one element present in x.
x<-c(1,2,3)
example$d<-as.numeric(example$a %in% x | example$b %in% x | example$c %in% x)
example
a b c d
1 1 10 1 1
2 2 9 2 1
3 3 8 3 1
4 4 7 4 0
5 5 6 5 0
6 6 5 1 1
7 7 4 2 1
8 8 3 3 1
9 9 2 4 1
10 10 1 5 1

Find consecutive values in dataframe

I have a dataframe. I wish to detect consecutive numbers and populate a new column as 1 or 0.
ID Val
1 a 8
2 a 7
3 a 5
4 a 4
5 a 3
6 a 1
Expected output
ID Val outP
1 a 8 0
2 a 7 1
3 a 5 0
4 a 4 1
5 a 3 1
6 a 1 0
You could do this with the diff function in combination with abs and see whether the outcome is 1 or another value:
d$outP <- c(0, abs(diff(d$Val)) == 1)
which gives:
> d
ID Val outP
1 a 8 0
2 a 7 1
3 a 5 0
4 a 4 1
5 a 3 1
6 a 1 0
If you only want to take decreasing consecutive values into account, you can use:
c(0, diff(d$Val) == -1)
When you want to do this for each ID, you can also do this in base R or with dplyr:
# base R
d$outP <- ave(d$Val, d$ID, FUN = function(x) c(0, abs(diff(x)) == 1))
# dplyr
library(dplyr)
d %>%
group_by(ID) %>%
mutate(outP = c(0, abs(diff(Val)) == 1))
We can also a faster option by comparing the previous value with current
with(df1, as.integer(c(FALSE, Val[-length(Val)] - Val[-1]) ==1))
#[1] 0 1 0 1 1 0
If we need to group by "ID", one option is data.table
library(data.table)
setDT(df1)[, outP := as.integer((shift(Val, fill =Val[1]) - Val)==1) , by = ID]

How to transform a key/value string into separate columns?

I've got a data.frame with key/value string column containing information about features and their values for a set of users. Something like this:
data<-data.frame(id=1:3,statid=c("s003e","s093u","s085t"),str=c("a:1,7:2","a:1,c:4","a:3,b:5,c:33"))
data
# id statid str
# 1 1 s003e a:1,7:2
# 2 2 s093u a:1,c:4
# 3 3 s085t a:3,b:5,c:33
What I'm trying to do is to create a data.frame containing column for every feature. Like this:
data_after<-data.frame(id=1:3,statid=c("s003e","s093u","s085t"),
a=c(1,1,3),b=c(0,0,5),c=c(0,4,33),"7"=c(2,0,0))
data_after
# id statid a b c X7
# 1 1 s003e 1 0 0 2
# 2 2 s093u 1 0 4 0
# 3 3 s085t 3 5 33 0
I was trying to use str_split from stringr package and then transform elements of created list to data.frames (later bind them using for example rbind.fill from plyr) but couldn't done it. Any help will be appreciated!
You can use dplyr and tidyr:
library(dplyr); library(tidyr)
data %>% mutate(str = strsplit(str, ",")) %>% unnest(str) %>%
separate(str, into = c('var', 'val'), sep = ":") %>% spread(var, val, fill = 0)
# id statid 7 a b c
# 1 1 s003e 2 1 0 0
# 2 2 s093u 0 1 0 4
# 3 3 s085t 0 3 5 33
We can use cSplit to do this in a cleaner way. Convert the data to 'long' format by splitting at ,, then do the split at : and dcast from 'long' to 'wide'
library(splitstackshape)
library(data.table)
dcast(cSplit(cSplit(data, "str", ",", "long"), "str", ":"),
id+statid~str_1, value.var="str_2", fill = 0)
# id statid 7 a b c
#1: 1 s003e 2 1 0 0
#2: 2 s093u 0 1 0 4
#3: 3 s085t 0 3 5 33

How to sum by group an "Origin-Destination" data frame?

I have this kind of data frame:
df<-data.frame(Origin=c(1,1,1,2,2,3,3,3),
Var= c(2,4,1,3,5,6,2,1),
Desti= c(2,2,3,2,1,2,1,3))
I would like to get the sum of Var, for each value of Origin, grouped by Desti (Out.x) and by Origin (In.x). The result would be for df:
Out.1 Out.2 Out.3 In.1 In.2 In.3
1 0 6 1 0 5 2
2 5 3 0 6 3 6
3 2 6 1 1 0 1
Any ideas ?
May be this helps
res <- cbind(xtabs(Var~., df), xtabs(Var~Desti+Origin, df))
colnames(res) <- paste(rep(c("Out", "In"), each=3), 1:3, sep=".")
res
# Out.1 Out.2 Out.3 In.1 In.2 In.3
#1 0 6 1 0 5 2
#2 5 3 0 6 3 6
#3 2 6 1 1 0 1
Or, the above can be simplied
r1 <- xtabs(Var~., df)
res <- cbind(r1, t(r1)) #change the `column names` accordingly
Or using reshape2
library(reshape2)
res1 <- cbind(acast(df, Origin~Desti, value.var='Var', sum),
acast(df, Desti~Origin, value.var='Var', sum))
colnames(res1) <- colnames(res)

Resources