Convert lines from factor to numeric? - r

this example:
dat=structure(list(X = structure(c(1L, 2L,3L), .Label = c("A", "B", "C"), class = "factor"), X10 = structure(c(1L,2L,3L), .Label = c("3","0", "2"), class = "factor"), X11 = structure(c(1L, 2L,3L), .Label = c("0", "2", "0"), class = "factor")), class = "data.frame", row.names = c(NA, -3L))
dat=dat[,-1]
fi=as.numeric(as.character(dat[1,] ))
> fi
[1] 1 1
Which is not correct. I wonder what is wrong ?

as.numeric is for vector, you need to use apply if you want to apply this to a data frame:
apply(dat, MARGIN=2,FUN=as.numeric)
result:
X10 X11
[1,] 3 0
[2,] 0 2
[3,] 2 0

For multiple columns of different class, we can have a check whether it is factor or not to do the conversion
library(dplyr)
dat %>%
mutate_if(is.factor, funs(as.numeric(as.character(.))))
and if all the columns are factor, then use mutate_all
dat %>%
mutate_all(funs(as.numeric(as.character(.))))
The base R way if all columns are factor, use lapply and assign it to the original object
dat[] <- lapply(dat, function(x) as.numeric(as.character(x)))

Related

The first two columns defined as "rownames"

I want to define the first two columns of a data frame as rownames. Actually I want to do some calculations and the data frame has to be numeric for that.
data.frame <- data_frame(id=c("A1","B2"),name=c("julia","daniel"),BMI=c("20","49"))
The values for BMI are numerical (proved with is.numeric), but the over all data.frame not. How to define the first two columns (id and name) as rownames?
Thank you in advance for any suggestions
You can combine id and name column and then assign rownames
data.frame %>%
tidyr::unite(rowname, id, name) %>%
tibble::column_to_rownames()
# BMI
#A1_julia 20
#B2_daniel 49
In base R, you can do the same in steps as
data.frame <- as.data.frame(data.frame)
rownames(data.frame) <- paste(data.frame$id, data.frame$name, sep = "_")
data.frame[c('id', 'name')] <- NULL
Not sure if the code and result below is the thing you are after:
dfout <- `rownames<-`(data.frame(BMI = as.numeric(df$BMI)),paste(df$id,df$name))
such that
> dfout
BMI
A1 julia 20
B2 daniel 49
DATA
df <- structure(list(id = structure(1:2, .Label = c("A1", "B2"), class = "factor"),
name = structure(2:1, .Label = c("daniel", "julia"), class = "factor"),
BMI = structure(1:2, .Label = c("20", "49"), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))

How do I split a column in R into two columns when I have no delimiter?

I have a dataset called data1 that I need to split the first column into two columns. The issue I'm having is that there is no delimiter between what I need to split and the character lengths are different is many rows.
I would like to split it by the date and sex.
E.g
12/1/09male
1/9/20female
13/1/19female
4/12/12male
I've been trying this but because the values have a different amount of characters I'm stuck.
separate(data1, col = 1, into = c("date","sex"), sep = "")
Any help would be hugely appreciated!
An option is a positive look-behind and look-ahead to split on a digit followed by an "m" or "f".
df %>% separate(1, c("date", "sex"), sep = "(?<=\\d)(?=[mf])")
# date sex
#1 12/1/09 male
#2 1/9/20 female
#3 13/1/19 female
#4 4/12/12 male
For what it's worth, the same regexp pattern works in base R's strsplit
setNames(do.call(
rbind.data.frame,
strsplit(as.character(df[, 1]), "(?<=\\d)(?=[mf])", perl = T)),
c("date", "sex"))
Sample data
df <- read.table(text =
'12/1/09male
1/9/20female
13/1/19female
4/12/12male')
I am fairly new to R so I am sure this is not the most elegant solution. I first add a comma between the date and sex and then separate on the comma
a <- data.frame(row_1 = c("12/1/09male", "1/9/20female", "13/1/19female", "4/12/12male"))
a[, "row_1"] = str_replace(a$row_1, "(male|female)", ",\\1")
separate(a, row_1, ",", into = c("date", "sex"))
Using tidyr::extract, we can capture data into two parts. First capture the date (in the format d/m/y) and second capture all the remaining part of the string.
tidyr::extract(df, V1, c("date", "sex"), "(\\d+/\\d+/\\d+)(.*)")
# date sex
#1 12/1/09 male
#2 1/9/20 female
#3 13/1/19 female
#4 4/12/12 male
data
df <- structure(list(V1 = structure(c(2L, 1L, 3L, 4L), .Label = c("1/9/20female",
"12/1/09male", "13/1/19female", "4/12/12male"), class = "factor")),
class = "data.frame", row.names = c(NA,-4L))
Base R solution using gsub and some regex:
df_clean <- within(df, {
date <- as.Date(gsub("[A-Za-z]+", "", V1), format = "%d/%m/%y")
sex <- as.factor(gsub("\\d+|\\/", "", V1))
rm(V1)
}
)
Data:
df <- structure(list(V1 = structure(c(2L, 1L, 3L, 4L), .Label = c("1/9/20female",
"12/1/09male", "13/1/19female", "4/12/12male"), class = "factor")), class = "data.frame", row.names = c(NA,
-4L))

Get single column of values comparing multiple columns

I have just started my journey with R. I want to test values across multiple columns for the same condition and return 5 if any of the values is "hello" within a row:
result = ifelse((myData[1] == "hello") | (myData[2] == "hello") | (myData[3] == "hello"), 5, 0)
This works fine, but code seems to be redundant. When I do:
resultSec = ifelse(myData[1:3] == "hello", 5, 0)
Then all 3 columns are checked against the condition, but the result I get is not a single column, but 3 columns. So then I would have to perform an additional comparison for all columns which makes totally more lines of code then the first redundant method.
How can I get in this case a one column of values in efficient way ?
You can use the function apply() to iterate over a data.frame or matrix, by either columns or rows. The margin argument determines which one you use.
Here we want to check the rows, so we use margin = 1:
dat <- data.frame(col1 = c("happy", "sad", "mad"),
col2 = c("tired", "sleepy", "happy"),
col3 = c("relaxed", "focused", "fine"))
dat$res <- apply(X = dat, MARGIN = 1,
FUN = function(x) ifelse("happy" %in% x, 5, 0))
dat
col1 col2 col3 res
1 happy tired relaxed 5
2 sad sleepy focused 0
3 mad happy fine 5
We can use rowSums here
df1$res <- rowSums(df1 == "happy") * 5
df1$res
#[1] 5 0 5
data
df1 <- structure(list(col1 = structure(c(1L, 3L, 2L), .Label = c("happy",
"mad", "sad"), class = "factor"), col2 = structure(c(3L, 2L,
1L), .Label = c("happy", "sleepy", "tired"), class = "factor"),
col3 = structure(c(3L, 2L, 1L), .Label = c("fine", "focused",
"relaxed"), class = "factor")), .Names = c("col1", "col2",
"col3"), row.names = c(NA, -3L), class = "data.frame")

Order column names in ascending order within dplyr chain

I have this data.frame:
df <- structure(list(att_number = structure(1:3, .Label = c("0", "1",
"2"), class = "factor"), `1` = structure(c(2L, 3L, 1L), .Label = c("1026891",
"412419", "424869"), class = "factor"), `10` = structure(c(2L,
1L, 3L), .Label = c("235067", "546686", "92324"), class = "factor"),
`2` = structure(c(3L, 1L, 2L), .Label = c("12729", "7569",
"9149"), class = "factor")), .Names = c("att_number", "1",
"10", "2"), row.names = c(NA, -3L), class = "data.frame")
It looks like this having numbers as the column names.
att_number 1 10 2
0 412419 546686 9149
1 424869 235067 12729
2 1026891 92324 7569
Within a dplyr chain, I would like to order the columns in ascending order, like this:
att_number 1 2 10
0 412419 9149 546686
1 424869 12729 235067
2 1026891 7569 7569
I've tried using select_, but it doesn't want to work according to plan. Any idea on how I can do this? Here's my feeble attempt:
names_order <- names(df)[-1] %>%
as.numeric %>%
.[order(.)] %>%
as.character %>%
c('att_number', .)
df %>%
select_(.dots = names_order)
Error: Position must be between 0 and n
Update:
For newer versions of dplyr (>= 0.7.0):
library(tidyverse)
sort_names <- function(data) {
name <- names(data)
chars <- keep(name, grepl, pattern = "[^0-9]") %>% sort()
nums <- discard(name, grepl, pattern = "[^0-9]") %>%
as.numeric() %>%
sort() %>%
sprintf("%s", .)
select(data, !!!c(chars, nums))
}
sort_names(df)
Original:
You need back ticks around the numeric column names to stop select from trying to interpret them as column positions:
library(tidyverse)
sort_names <- function(data) {
name <- names(data)
chars <- keep(name, grepl, pattern = "[^0-9]") %>% sort()
nums <- discard(name, grepl, pattern = "[^0-9]") %>%
as.numeric() %>%
sort() %>%
sprintf("`%s`", .)
select_(data, .dots = c(chars, nums))
}
sort_names(df)

How to save the column names and their corresponding type in R into excel?

i have a R data set with >200 columns. I need to get what class each column is and get that into excel, with col name and its corresponding class as two columns
1. Using lapply/sapply with stack/melt
You could do this using lapply/sapply to get the class of each column and then using stack from base R or melt from reshape2 to get the 2 column data.frame.
res <- stack(lapply(df, class))
#or
library(reshape2)
res1<- melt(lapply(df, class))
Then use write.csv or using any of the specialized libraries for writing to excel data i.e. XLConnect, WriteXLS etc.
write.csv(res, file="file1.csv", row.names=FALSE, quote=FALSE)
.csv files can be opened in excel
2. From the output of str
Or you could use capture.output and regex to get the required info from the str and convert it to data.frame using read.table
v1 <- capture.output(str(df))
v2 <- grep("\\$", v1, value=TRUE)
res2 <- read.table(text=gsub(" +\\$ +(.*)\\: +([A-Za-z]+) +.*", "\\1 \\2", v2),
sep="",header=FALSE,stringsAsFactors=FALSE)
head(res2,2)
# V1 V2
#1 t02.clase Factor
#2 Std_A_CLI_monto_sucursal_1 chr
data
df <-structure(list(t02.clase = structure(c(1L, 1L, 1L), .Label = "AK",
class = "factor"),Std_A_CLI_monto_sucursal_1 = c("0", "0", "0"),
Std_A_CLI_monto_sucursal_2 = c(0, 0.01303586, 0), Std_A_CLI_monto_sucursal_3 =
c(0.051311597, 0.003442244, 0.017347593), Std_A_CLI_monto_sucursal_4 = c(0L,
0L, 0L), Std_A_CLI_promociones = c(0.4736842, 0.5, 0), Std_A_CLI_dias_cliente =
c(0.57061341, 0.55492154, 0.05991441), Std_A_CLI_sucursales = c(0.05555556,
0.05555556, 0.05555556)), .Names = c("t02.clase", "Std_A_CLI_monto_sucursal_1",
"Std_A_CLI_monto_sucursal_2", "Std_A_CLI_monto_sucursal_3",
"Std_A_CLI_monto_sucursal_4", "Std_A_CLI_promociones", "Std_A_CLI_dias_cliente",
"Std_A_CLI_sucursales"), row.names = c("1", "2", "3"), class = "data.frame")

Resources