Convert certain columns from char to numeric in R - r

I have this data frame which ended up all as characters. I need to convert the Date column to a date format and the rest as numeric.
> df <- data.frame(Date = c("1996-01-01", "1996-01-05", "1996-01-29"),
+ SD = c("11", "12", "13"),
+ SF = c("624", "625", "626"),
+ LA = c("1", "2", "3"),
+ IR = c("107", "108", "109"))
> df
Date SD SF LA IR
1 1996-01-01 11 624 1 107
2 1996-01-05 12 625 2 108
3 1996-01-29 13 626 3 109
> str(df)
'data.frame': 3 obs. of 5 variables:
$ Date: chr "1996-01-01" "1996-01-05" "1996-01-29"
$ SD : chr "11" "12" "13"
$ SF : chr "624" "625" "626"
$ LA : chr "1" "2" "3"
$ IR : chr "107" "108" "109"
Tried this to convert only columns 2:5 but ended with Date as num and coerced to "NA".
> df$Date <- as.Date(df$Date)
> df2 <- df
> columns <- c(1, 2:5)
> df2[ , columns] <- apply(df[ , columns], 2, function(x) as.numeric(x))
Warning message:
In FUN(newX[, i], ...) : NAs introduced by coercion
> df2
Date SD SF LA IR
1 NA 11 624 1 107
2 NA 12 625 2 108
3 NA 13 626 3 109
> str(df2)
'data.frame': 3 obs. of 5 variables:
$ Date: num NA NA NA
$ SD : num 11 12 13
$ SF : num 624 625 626
$ LA : num 1 2 3
$ IR : num 107 108 109
Any ideas where I got it wrong or any ideas how I can do this better?
Thanks in advance.

For this I would suggest using type.convert() on the whole data.frame, and then use as.Date() on the Date column.
Use the as.is = TRUE argument to ensure strings (your dates) are not converted to factors.
df <- data.frame(
Date = c("1996-01-01", "1996-01-05", "1996-01-29"),
SD = c("11", "12", "13"),
SF = c("624", "625", "626"),
LA = c("1", "2", "3"),
IR = c("107", "108", "109")
)
str(df)
#> 'data.frame': 3 obs. of 5 variables:
#> $ Date: chr "1996-01-01" "1996-01-05" "1996-01-29"
#> $ SD : chr "11" "12" "13"
#> $ SF : chr "624" "625" "626"
#> $ LA : chr "1" "2" "3"
#> $ IR : chr "107" "108" "109"
df2 <- type.convert(df, as.is = TRUE)
str(df2)
#> 'data.frame': 3 obs. of 5 variables:
#> $ Date: chr "1996-01-01" "1996-01-05" "1996-01-29"
#> $ SD : int 11 12 13
#> $ SF : int 624 625 626
#> $ LA : int 1 2 3
#> $ IR : int 107 108 109
df2$Date <- as.Date(df2$Date)
str(df2)
#> 'data.frame': 3 obs. of 5 variables:
#> $ Date: Date, format: "1996-01-01" "1996-01-05" ...
#> $ SD : int 11 12 13
#> $ SF : int 624 625 626
#> $ LA : int 1 2 3
#> $ IR : int 107 108 109

Currently your logic is including all columns:
columns <- c(1, 2:5) # same as c(1:5)
But you want to exclude the first column of dates, so use this version:
columns <- c(2:5)
df2[ , columns] <- apply(df[ , columns], 2, function(x) as.numeric(x))

Related

Converting columns from character to integer, but only if the tittle of the column/variable name includes the word "Average"

Basically what the title says! I have columns like name, age, year, average_points, average_steals, average_rebounds etc. But all the average columns (there are a lot) are stored as characters. Thanks!
First I created some random data. You can mutate across the columns that starts_with "average" and convert them to as.integer. You can use the following code:
df <- data.frame(name = c("A", "B"),
age = c(10, 51),
year = c(2001, 1980),
average_points = c("3", "5"),
average_steals = c("4","6"),
average_bounds = c("6","7"))
str(df)
#> 'data.frame': 2 obs. of 6 variables:
#> $ name : chr "A" "B"
#> $ age : num 10 51
#> $ year : num 2001 1980
#> $ average_points: chr "3" "5"
#> $ average_steals: chr "4" "6"
#> $ average_bounds: chr "6" "7"
library(dplyr)
library(tidyr)
result <- df %>%
mutate(across(starts_with("average"), as.integer))
str(result)
#> 'data.frame': 2 obs. of 6 variables:
#> $ name : chr "A" "B"
#> $ age : num 10 51
#> $ year : num 2001 1980
#> $ average_points: int 3 5
#> $ average_steals: int 4 6
#> $ average_bounds: int 6 7
Created on 2022-07-20 by the reprex package (v2.0.1)

Creating a list of dataframes based on filter criteria

I would have a data set with a column ID. I filter them the data frame into winter and summer. I would like to split the data further based on the ID. In my actual data set there are over 100 IDs, so I don't want to make 100 data frames. Instead I would like to make a list of data frames. I used the group_split function to do this, but the number of list comes out uneven between winter and summer. I know for certain that there are the same number of IDs that should be in winter and summer. Is there a better way of doing this?
library(lubridate)
date <- rep_len(seq(dmy("26-12-2010"), dmy("20-12-2011"), by = "days"), 500)
ID <- rep(seq(1, 5), 100)
df <- data.frame(date = date,
x = runif(length(date), min = 60000, max = 80000),
y = runif(length(date), min = 800000, max = 900000),
ID)
df$month <- month(df$date)
summer <- df%>% arrange(ID, date) %>%
filter(month %in% 07:09) %>%
group_by(ID, .add = TRUE) %>%
group_split(ID)
winter <- df%>%
arrange(ID, date) %>%
filter(month %in% c(01,02,03)) $>%
group_by(ID, .add = TRUE) %>%
# group_split(ID)
Thank you!
I think split will do what you want: produce a list of frames.
summer <- filter(df, month(date) %in% 7:9)
head(summer)
# date x y ID
# 1 2011-07-01 74958.44 842429.7 3
# 2 2011-07-02 64223.78 897607.8 4
# 3 2011-07-03 78843.54 829362.2 5
# 4 2011-07-04 60703.31 822962.0 1
# 5 2011-07-05 71328.44 872268.8 2
# 6 2011-07-06 68827.96 880618.3 3
str(split(summer, summer$ID))
# List of 5
# $ 1:'data.frame': 18 obs. of 4 variables:
# ..$ date: Date[1:18], format: "2011-07-04" "2011-07-09" ...
# ..$ x : num [1:18] 60703 64986 79477 67815 70387 ...
# ..$ y : num [1:18] 822962 858762 897413 817728 838251 ...
# ..$ ID : int [1:18] 1 1 1 1 1 1 1 1 1 1 ...
# $ 2:'data.frame': 18 obs. of 4 variables:
# ..$ date: Date[1:18], format: "2011-07-05" "2011-07-10" ...
# ..$ x : num [1:18] 71328 65414 64275 74286 76995 ...
# ..$ y : num [1:18] 872269 862579 818690 825991 847360 ...
# ..$ ID : int [1:18] 2 2 2 2 2 2 2 2 2 2 ...
# $ 3:'data.frame': 19 obs. of 4 variables:
# ..$ date: Date[1:19], format: "2011-07-01" "2011-07-06" ...
# ..$ x : num [1:19] 74958 68828 69431 76959 68538 ...
# ..$ y : num [1:19] 842430 880618 852488 874800 839197 ...
# ..$ ID : int [1:19] 3 3 3 3 3 3 3 3 3 3 ...
# $ 4:'data.frame': 19 obs. of 4 variables:
# ..$ date: Date[1:19], format: "2011-07-02" "2011-07-07" ...
# ..$ x : num [1:19] 64224 66977 75101 64189 73444 ...
# ..$ y : num [1:19] 897608 845062 809777 850364 822869 ...
# ..$ ID : int [1:19] 4 4 4 4 4 4 4 4 4 4 ...
# $ 5:'data.frame': 18 obs. of 4 variables:
# ..$ date: Date[1:18], format: "2011-07-03" "2011-07-08" ...
# ..$ x : num [1:18] 78844 77418 79762 78613 77485 ...
# ..$ y : num [1:18] 829362 867594 860007 819956 815058 ...
# ..$ ID : int [1:18] 5 5 5 5 5 5 5 5 5 5 ...

Dealing with categorical data in R

This is the code that I was using for my data mining assignment in R studio. I was preprocessing the data.
setwd('C:/Users/user/OneDrive/assignments/Data mining/individual')
dataset = read.csv('Dataset.csv')
dataset[dataset == '?'] <- NA
View(dataset)
x <- na.omit(dataset)
library(tidyr)
library(dplyr)
library(outliers)
View(gather(x))
x$Age[x$Age <= 30] <- 3
x$Age[(x$Age <=49) & (x$Age >= 31)] <- 2
x$Age[(x$Age != 3) & (x$Age !=2)] <- 1
x$Hours_Per_week[x$Hours_Per_week <= 30] <- 3
x$Hours_Per_week[(x$Hours_Per_week <= 49)& (x$Hours_Per_week >= 31)] <- 2
x$Hours_Per_week[(x$Hours_Per_week != 3) & (x$Hours_Per_week != 2)] <- 1
x$Work_Class <- factor(x$Work_Class, levels = c("Federal-gov","Local-
gov","Private","Self-emp-inc","Self-emp-not-inc","State-gov"), labels =
c(1,2,3,4,5,6) )
And here by I will attach the result of the code.
the result
str(x)
As you can see in the result , after the last code , all the data in the column Hours_Per_week is suddenly changed into NA. I don't really know why this occurs since every other example that I saw online changed the data inside to the labels.
The link for the dataset :
dataset
unfortunately I do not know the original data - possibly you just have to change the levels and labels content:
x$Work_Class <- factor(x$Work_Class, levels = c(1,2,3,4,5,6), labels = c("Federal-gov","Local-gov","Private","Self-emp-inc","Self-emp-not-inc","State-gov") )
The problem is the factor() statement. The Dataset.csv file does not have character strings surrounded by quotation marks so you get a leading space on every character field.
str(dataset)
# data.frame': 100 obs. of 7 variables:
# $ Age : int 39 50 38 53 28 37 49 52 31 42 ...
# $ Work_Class : chr " State-gov" " Self-emp-not-inc" " Private" NA ...
# $ Education : chr " Bachelors" " Bachelors" " HS-grad" " 11th" ...
# $ Marital_Status: chr " Never-married" " Married-civ-spouse" " Divorced" " Married-civ-spouse" ...
# $ Sex : chr " Male" " Male" " Male" " Male" ...
# $ Hours_Per_week: int 40 13 40 40 40 40 16 45 50 40 ...
# $ Income : chr " <=50K" " <=50K" " <=50K" " <=50K" ...
Notice the blank space before each label in Work_Class, Education, Marital_Status, Sex, and Income. You need to trim the white space when you read the file:
dataset = read.csv('Dataset.csv', strip.white=TRUE)
Then change the last line by removing the labels= argument:
x$Work_Class <- factor(x$Work_Class, levels = c("Federal-gov", "Local-gov", "Private", "Self-emp-inc", "Self-emp-not-inc", "State-gov"))
str(x)
# 'data.frame': 93 obs. of 7 variables:
# $ Age : num 2 1 2 3 2 2 1 2 2 3 ...
# $ Work_Class : Factor w/ 6 levels "Federal-gov",..: 6 5 3 3 3 3 5 3 3 6 ...
# $ Education : chr "Bachelors" "Bachelors" "HS-grad" "Bachelors" ...
# $ Marital_Status: chr "Never-married" "Married-civ-spouse" "Divorced" "Married-civ-spouse" ...
# $ Sex : chr "Male" "Male" "Male" "Female" ...
# $ Hours_Per_week: num 2 3 2 2 2 3 2 2 1 2 ...
# $ Income : chr "<=50K" "<=50K" "<=50K" "<=50K" ...
# - attr(*, "na.action")= 'omit' Named int [1:7] 4 9 28 62 70 78 93
# ..- attr(*, "names")= chr [1:7] "4" "9" "28" "62" ...
table(x$Work_Class)
#
# Federal-gov Local-gov Private Self-emp-inc Self-emp-not-inc State-gov
# 6 6 67 3 7 4

Change variable types in data frame [duplicate]

I have a dataframe with all the columns being character like this.
ID <- c("A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B")
ToolID <- c("CCP_A","CCP_A","CCQ_A","CCQ_A","IOT_B","CCP_B","CCQ_B","IOT_B",
"CCP_A","CCP_A","CCQ_A","CCQ_A","IOT_B","CCP_B","CCQ_B","IOT_B")
Step <- c("Step_A","Step_A","Step_B","Step_C","Step_D","Step_D","Step_E","Step_F",
"Step_A","Step_A","Step_B","Step_C","Step_D","Step_D","Step_E","Step_F")
Measurement <- c("Length","Breadth","Width","Height",NA,NA,NA,NA,
"Length","Breadth","Width","Height",NA,NA,NA,NA)
Passfail <- c("Pass","Pass","Fail","Fail","Pass","Pass","Pass","Pass",
"Pass","Pass","Fail","Fail","Pass","Pass","Pass","Pass")
Points <- as.character(c(7,5,3,4,0,0,0,0,17,15,13,14,0,0,0,0))
Average <- as.character(c(7.5,6.5,7.1,6.6,NA,NA,NA,NA,17.5,16.5,17.1,16.6,NA,NA,NA,NA))
Sigma <- as.character(c(2.5,2.5,2.1,2.6,NA,NA,NA,NA,12.5,12.5,12.1,12.6,NA,NA,NA,NA))
Tool <- c("ABC_1","ABC_2","ABD_1","ABD_2","COB_1","COB_2","COB_1","COB_2",
"ABC_1","ABC_2","ABD_1","ABD_2","COB_1","COB_2","COB_1","COB_2")
Dose <- as.character(c(NA,NA,NA,NA,17.1,NA,NA,17.3,NA,NA,NA,NA,117.1,NA,NA,117.3))
Machine <- c("CO2","CO6","CO3","CO6","CO2,CO6","CO2,CO3,CO4","CO2,CO3","CO2",
"CO2","CO6","CO3","CO6","CO2,CO6","CO2,CO3,CO4","CO2,CO3","CO2")
df <- data.frame(ID,ToolID,Step,Measurement,Passfail,Points,Average,Sigma,Tool,Dose,Machine)
I am trying to check these character vectors for numeric values and then convert those with numeric values to numeric. I use the "varhandle" package in R to do it
library(varhandle)
if(all(check.numeric(df$Machine, na.rm=TRUE))){
# convert the vector to numeric
df$Machine <- as.numeric(df$Machine)
}
This works but is inefficient because I have to manually enter the column names like above. How can I do it more efficiently in a loop or use vectorization over multiple columns? My actual dataset has around 350 columns. Can someone point me in the right direction?
We can use parse_guess function from readr package which basically tries to guess the type of columns.
library(readr)
library(dplyr)
df1 <- df %>% mutate_all(parse_guess)
str(df1)
#'data.frame': 16 obs. of 11 variables:
# $ ID : chr "A" "A" "A" "A" ...
# $ ToolID : chr "CCP_A" "CCP_A" "CCQ_A" "CCQ_A" ...
# $ Step : chr "Step_A" "Step_A" "Step_B" "Step_C" ...
# $ Measurement: chr "Length" "Breadth" "Width" "Height" ...
# $ Passfail : chr "Pass" "Pass" "Fail" "Fail" ...
# $ Points : int 7 5 3 4 0 0 0 0 17 15 ...
# $ Average : num 7.5 6.5 7.1 6.6 NA NA NA NA 17.5 16.5 ...
# $ Sigma : num 2.5 2.5 2.1 2.6 NA NA NA NA 12.5 12.5 ...
# $ Tool : chr "ABC_1" "ABC_2" "ABD_1" "ABD_2" ...
# $ Dose : num NA NA NA NA 17.1 NA NA 17.3 NA NA ...
# $ Machine : chr "CO2" "CO6" "CO3" "CO6" ...
We can do this in base R
df[] <- lapply(df, function(x) type.convert(as.character(x), as.is = TRUE))
str(df)
#'data.frame': 16 obs. of 11 variables:
# $ ID : chr "A" "A" "A" "A" ...
# $ ToolID : chr "CCP_A" "CCP_A" "CCQ_A" "CCQ_A" ...
# $ Step : chr "Step_A" "Step_A" "Step_B" "Step_C" ...
# $ Measurement: chr "Length" "Breadth" "Width" "Height" ...
# $ Passfail : chr "Pass" "Pass" "Fail" "Fail" ...
# $ Points : int 7 5 3 4 0 0 0 0 17 15 ...
# $ Average : num 7.5 6.5 7.1 6.6 NA NA NA NA 17.5 16.5 ...
# $ Sigma : num 2.5 2.5 2.1 2.6 NA NA NA NA 12.5 12.5 ...
# $ Tool : chr "ABC_1" "ABC_2" "ABD_1" "ABD_2" ...
# $ Dose : num NA NA NA NA 17.1 NA NA 17.3 NA NA ...
# $ Machine : chr "CO2" "CO6" "CO3" "CO6" ...
With varhandle and tidyverse :
df %>% mutate_if(purrr::compose(all,check.numeric),as.numeric)
I think that the easiest solution is to use all.is.numeric from Hmisc. Here's the simple example:
Hmisc::all.is.numeric(c("A", "B", "1"), what = "vector", extras = NA)
## [1] "A" "B" "1"
Hmisc::all.is.numeric(c("3", "2", "1", NA), what = "vector", extras = NA)
## [1] 3 2 1 NA
Then you can use mutate_all from dplyr to do all the job for data.frame:
library(dplyr)
ID <- c("A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B")
ToolID <- c("CCP_A","CCP_A","CCQ_A","CCQ_A","IOT_B","CCP_B","CCQ_B","IOT_B",
"CCP_A","CCP_A","CCQ_A","CCQ_A","IOT_B","CCP_B","CCQ_B","IOT_B")
Step <- c("Step_A","Step_A","Step_B","Step_C","Step_D","Step_D","Step_E","Step_F",
"Step_A","Step_A","Step_B","Step_C","Step_D","Step_D","Step_E","Step_F")
Measurement <- c("Length","Breadth","Width","Height",NA,NA,NA,NA,
"Length","Breadth","Width","Height",NA,NA,NA,NA)
Passfail <- c("Pass","Pass","Fail","Fail","Pass","Pass","Pass","Pass",
"Pass","Pass","Fail","Fail","Pass","Pass","Pass","Pass")
Points <- as.character(c(7,5,3,4,0,0,0,0,17,15,13,14,0,0,0,0))
Average <- as.character(c(7.5,6.5,7.1,6.6,NA,NA,NA,NA,17.5,16.5,17.1,16.6,NA,NA,NA,NA))
Sigma <- as.character(c(2.5,2.5,2.1,2.6,NA,NA,NA,NA,12.5,12.5,12.1,12.6,NA,NA,NA,NA))
Tool <- c("ABC_1","ABC_2","ABD_1","ABD_2","COB_1","COB_2","COB_1","COB_2",
"ABC_1","ABC_2","ABD_1","ABD_2","COB_1","COB_2","COB_1","COB_2")
Dose <- as.character(c(NA,NA,NA,NA,17.1,NA,NA,17.3,NA,NA,NA,NA,117.1,NA,NA,117.3))
Machine <- c("CO2","CO6","CO3","CO6","CO2,CO6","CO2,CO3,CO4","CO2,CO3","CO2",
"CO2","CO6","CO3","CO6","CO2,CO6","CO2,CO3,CO4","CO2,CO3","CO2")
df <- data.frame(ID,ToolID,Step,Measurement,Passfail,Points,Average,Sigma,Tool,Dose,Machine)
dt2 <- df %>% mutate_all(function(x) Hmisc::all.is.numeric(x, what = "vector", extras = NA))
## check classes
sapply(dt2, class)
## ID ToolID Step Measurement Passfail Points
## "character" "character" "character" "character" "character" "numeric"
## Average Sigma Tool Dose Machine
## "numeric" "numeric" "character" "numeric" "character"
Another solution is retype from hablar package:
library(hablar)
df %>% retype()
which gives:
# A tibble: 16 x 11
ID ToolID Step Measurement Passfail Points Average Sigma Tool Dose Machine
<chr> <chr> <chr> <chr> <chr> <int> <dbl> <dbl> <chr> <dbl> <chr>
1 A CCP_A Step_A Length Pass 7 7.50 2.50 ABC_1 NA CO2
2 A CCP_A Step_A Breadth Pass 5 6.50 2.50 ABC_2 NA CO6
3 A CCQ_A Step_B Width Fail 3 7.10 2.10 ABD_1 NA CO3
4 A CCQ_A Step_C Height Fail 4 6.60 2.60 ABD_2 NA CO6
5 A IOT_B Step_D NA Pass 0 NA NA COB_1 17.1 CO2,CO6
6 A CCP_B Step_D NA Pass 0 NA NA COB_2 NA CO2,CO3,CO4
7 A CCQ_B Step_E NA Pass 0 NA NA COB_1 NA CO2,CO3

how to change column type of columns stored in vector in r

I have following dataframe in r
qty_1 qty_2 qty_3 make_1 make_2 make_3 qty_4
1 22 33 21 5 55 6
2 33 92 83 76 65 23
I have vector as following
qty_vec <- c("qty_1","qty_2","qty_3")
I want to change the data type of columns which matches with qty_vec to character
I am doing following in r,but it does not work
final_df[,names(final_df)[names(final_df) %in% qty_vec]] <- lapply(final_df[,names(final_df)[names(final_df) %in% qty_vec]], function(x)
type.convert(as.character(x)))
As per Sotos' comment above, see also this SO question,
df <- structure(list(qty_1 = 1:2, qty_2 = c(22L, 33L), qty_3 = c(33L,
92L), make_1 = c(21L, 83L), make_2 = c(5L, 76L), make_3 = c(55L,
65L), qty_4 = c(6L, 23L)), .Names = c("qty_1", "qty_2", "qty_3",
"make_1", "make_2", "make_3", "qty_4"), class = "data.frame", row.names = c(NA,
-2L))
str(df)
#> 'data.frame': 2 obs. of 7 variables:
#> $ qty_1 : int 1 2
#> $ qty_2 : int 22 33
#> $ qty_3 : int 33 92
#> $ make_1: int 21 83
#> $ make_2: int 5 76
#> $ make_3: int 55 65
#> $ qty_4 : int 6 23
qty_vec <- c("qty_1","qty_2","qty_3")
df[qty_vec] <- lapply(df[qty_vec], as.character)
str(df)
#> 'data.frame': 2 obs. of 7 variables:
#> $ qty_1 : chr "1" "2"
#> $ qty_2 : chr "22" "33"
#> $ qty_3 : chr "33" "92"
#> $ make_1: int 21 83
#> $ make_2: int 5 76
#> $ make_3: int 55 65
#> $ qty_4 : int 6 23

Resources