Convert certain columns from char to numeric in R - r
I have this data frame which ended up all as characters. I need to convert the Date column to a date format and the rest as numeric.
> df <- data.frame(Date = c("1996-01-01", "1996-01-05", "1996-01-29"),
+ SD = c("11", "12", "13"),
+ SF = c("624", "625", "626"),
+ LA = c("1", "2", "3"),
+ IR = c("107", "108", "109"))
> df
Date SD SF LA IR
1 1996-01-01 11 624 1 107
2 1996-01-05 12 625 2 108
3 1996-01-29 13 626 3 109
> str(df)
'data.frame': 3 obs. of 5 variables:
$ Date: chr "1996-01-01" "1996-01-05" "1996-01-29"
$ SD : chr "11" "12" "13"
$ SF : chr "624" "625" "626"
$ LA : chr "1" "2" "3"
$ IR : chr "107" "108" "109"
Tried this to convert only columns 2:5 but ended with Date as num and coerced to "NA".
> df$Date <- as.Date(df$Date)
> df2 <- df
> columns <- c(1, 2:5)
> df2[ , columns] <- apply(df[ , columns], 2, function(x) as.numeric(x))
Warning message:
In FUN(newX[, i], ...) : NAs introduced by coercion
> df2
Date SD SF LA IR
1 NA 11 624 1 107
2 NA 12 625 2 108
3 NA 13 626 3 109
> str(df2)
'data.frame': 3 obs. of 5 variables:
$ Date: num NA NA NA
$ SD : num 11 12 13
$ SF : num 624 625 626
$ LA : num 1 2 3
$ IR : num 107 108 109
Any ideas where I got it wrong or any ideas how I can do this better?
Thanks in advance.
For this I would suggest using type.convert() on the whole data.frame, and then use as.Date() on the Date column.
Use the as.is = TRUE argument to ensure strings (your dates) are not converted to factors.
df <- data.frame(
Date = c("1996-01-01", "1996-01-05", "1996-01-29"),
SD = c("11", "12", "13"),
SF = c("624", "625", "626"),
LA = c("1", "2", "3"),
IR = c("107", "108", "109")
)
str(df)
#> 'data.frame': 3 obs. of 5 variables:
#> $ Date: chr "1996-01-01" "1996-01-05" "1996-01-29"
#> $ SD : chr "11" "12" "13"
#> $ SF : chr "624" "625" "626"
#> $ LA : chr "1" "2" "3"
#> $ IR : chr "107" "108" "109"
df2 <- type.convert(df, as.is = TRUE)
str(df2)
#> 'data.frame': 3 obs. of 5 variables:
#> $ Date: chr "1996-01-01" "1996-01-05" "1996-01-29"
#> $ SD : int 11 12 13
#> $ SF : int 624 625 626
#> $ LA : int 1 2 3
#> $ IR : int 107 108 109
df2$Date <- as.Date(df2$Date)
str(df2)
#> 'data.frame': 3 obs. of 5 variables:
#> $ Date: Date, format: "1996-01-01" "1996-01-05" ...
#> $ SD : int 11 12 13
#> $ SF : int 624 625 626
#> $ LA : int 1 2 3
#> $ IR : int 107 108 109
Currently your logic is including all columns:
columns <- c(1, 2:5) # same as c(1:5)
But you want to exclude the first column of dates, so use this version:
columns <- c(2:5)
df2[ , columns] <- apply(df[ , columns], 2, function(x) as.numeric(x))
Related
Converting columns from character to integer, but only if the tittle of the column/variable name includes the word "Average"
Basically what the title says! I have columns like name, age, year, average_points, average_steals, average_rebounds etc. But all the average columns (there are a lot) are stored as characters. Thanks!
First I created some random data. You can mutate across the columns that starts_with "average" and convert them to as.integer. You can use the following code: df <- data.frame(name = c("A", "B"), age = c(10, 51), year = c(2001, 1980), average_points = c("3", "5"), average_steals = c("4","6"), average_bounds = c("6","7")) str(df) #> 'data.frame': 2 obs. of 6 variables: #> $ name : chr "A" "B" #> $ age : num 10 51 #> $ year : num 2001 1980 #> $ average_points: chr "3" "5" #> $ average_steals: chr "4" "6" #> $ average_bounds: chr "6" "7" library(dplyr) library(tidyr) result <- df %>% mutate(across(starts_with("average"), as.integer)) str(result) #> 'data.frame': 2 obs. of 6 variables: #> $ name : chr "A" "B" #> $ age : num 10 51 #> $ year : num 2001 1980 #> $ average_points: int 3 5 #> $ average_steals: int 4 6 #> $ average_bounds: int 6 7 Created on 2022-07-20 by the reprex package (v2.0.1)
Creating a list of dataframes based on filter criteria
I would have a data set with a column ID. I filter them the data frame into winter and summer. I would like to split the data further based on the ID. In my actual data set there are over 100 IDs, so I don't want to make 100 data frames. Instead I would like to make a list of data frames. I used the group_split function to do this, but the number of list comes out uneven between winter and summer. I know for certain that there are the same number of IDs that should be in winter and summer. Is there a better way of doing this? library(lubridate) date <- rep_len(seq(dmy("26-12-2010"), dmy("20-12-2011"), by = "days"), 500) ID <- rep(seq(1, 5), 100) df <- data.frame(date = date, x = runif(length(date), min = 60000, max = 80000), y = runif(length(date), min = 800000, max = 900000), ID) df$month <- month(df$date) summer <- df%>% arrange(ID, date) %>% filter(month %in% 07:09) %>% group_by(ID, .add = TRUE) %>% group_split(ID) winter <- df%>% arrange(ID, date) %>% filter(month %in% c(01,02,03)) $>% group_by(ID, .add = TRUE) %>% # group_split(ID) Thank you!
I think split will do what you want: produce a list of frames. summer <- filter(df, month(date) %in% 7:9) head(summer) # date x y ID # 1 2011-07-01 74958.44 842429.7 3 # 2 2011-07-02 64223.78 897607.8 4 # 3 2011-07-03 78843.54 829362.2 5 # 4 2011-07-04 60703.31 822962.0 1 # 5 2011-07-05 71328.44 872268.8 2 # 6 2011-07-06 68827.96 880618.3 3 str(split(summer, summer$ID)) # List of 5 # $ 1:'data.frame': 18 obs. of 4 variables: # ..$ date: Date[1:18], format: "2011-07-04" "2011-07-09" ... # ..$ x : num [1:18] 60703 64986 79477 67815 70387 ... # ..$ y : num [1:18] 822962 858762 897413 817728 838251 ... # ..$ ID : int [1:18] 1 1 1 1 1 1 1 1 1 1 ... # $ 2:'data.frame': 18 obs. of 4 variables: # ..$ date: Date[1:18], format: "2011-07-05" "2011-07-10" ... # ..$ x : num [1:18] 71328 65414 64275 74286 76995 ... # ..$ y : num [1:18] 872269 862579 818690 825991 847360 ... # ..$ ID : int [1:18] 2 2 2 2 2 2 2 2 2 2 ... # $ 3:'data.frame': 19 obs. of 4 variables: # ..$ date: Date[1:19], format: "2011-07-01" "2011-07-06" ... # ..$ x : num [1:19] 74958 68828 69431 76959 68538 ... # ..$ y : num [1:19] 842430 880618 852488 874800 839197 ... # ..$ ID : int [1:19] 3 3 3 3 3 3 3 3 3 3 ... # $ 4:'data.frame': 19 obs. of 4 variables: # ..$ date: Date[1:19], format: "2011-07-02" "2011-07-07" ... # ..$ x : num [1:19] 64224 66977 75101 64189 73444 ... # ..$ y : num [1:19] 897608 845062 809777 850364 822869 ... # ..$ ID : int [1:19] 4 4 4 4 4 4 4 4 4 4 ... # $ 5:'data.frame': 18 obs. of 4 variables: # ..$ date: Date[1:18], format: "2011-07-03" "2011-07-08" ... # ..$ x : num [1:18] 78844 77418 79762 78613 77485 ... # ..$ y : num [1:18] 829362 867594 860007 819956 815058 ... # ..$ ID : int [1:18] 5 5 5 5 5 5 5 5 5 5 ...
Dealing with categorical data in R
This is the code that I was using for my data mining assignment in R studio. I was preprocessing the data. setwd('C:/Users/user/OneDrive/assignments/Data mining/individual') dataset = read.csv('Dataset.csv') dataset[dataset == '?'] <- NA View(dataset) x <- na.omit(dataset) library(tidyr) library(dplyr) library(outliers) View(gather(x)) x$Age[x$Age <= 30] <- 3 x$Age[(x$Age <=49) & (x$Age >= 31)] <- 2 x$Age[(x$Age != 3) & (x$Age !=2)] <- 1 x$Hours_Per_week[x$Hours_Per_week <= 30] <- 3 x$Hours_Per_week[(x$Hours_Per_week <= 49)& (x$Hours_Per_week >= 31)] <- 2 x$Hours_Per_week[(x$Hours_Per_week != 3) & (x$Hours_Per_week != 2)] <- 1 x$Work_Class <- factor(x$Work_Class, levels = c("Federal-gov","Local- gov","Private","Self-emp-inc","Self-emp-not-inc","State-gov"), labels = c(1,2,3,4,5,6) ) And here by I will attach the result of the code. the result str(x) As you can see in the result , after the last code , all the data in the column Hours_Per_week is suddenly changed into NA. I don't really know why this occurs since every other example that I saw online changed the data inside to the labels. The link for the dataset : dataset
unfortunately I do not know the original data - possibly you just have to change the levels and labels content: x$Work_Class <- factor(x$Work_Class, levels = c(1,2,3,4,5,6), labels = c("Federal-gov","Local-gov","Private","Self-emp-inc","Self-emp-not-inc","State-gov") )
The problem is the factor() statement. The Dataset.csv file does not have character strings surrounded by quotation marks so you get a leading space on every character field. str(dataset) # data.frame': 100 obs. of 7 variables: # $ Age : int 39 50 38 53 28 37 49 52 31 42 ... # $ Work_Class : chr " State-gov" " Self-emp-not-inc" " Private" NA ... # $ Education : chr " Bachelors" " Bachelors" " HS-grad" " 11th" ... # $ Marital_Status: chr " Never-married" " Married-civ-spouse" " Divorced" " Married-civ-spouse" ... # $ Sex : chr " Male" " Male" " Male" " Male" ... # $ Hours_Per_week: int 40 13 40 40 40 40 16 45 50 40 ... # $ Income : chr " <=50K" " <=50K" " <=50K" " <=50K" ... Notice the blank space before each label in Work_Class, Education, Marital_Status, Sex, and Income. You need to trim the white space when you read the file: dataset = read.csv('Dataset.csv', strip.white=TRUE) Then change the last line by removing the labels= argument: x$Work_Class <- factor(x$Work_Class, levels = c("Federal-gov", "Local-gov", "Private", "Self-emp-inc", "Self-emp-not-inc", "State-gov")) str(x) # 'data.frame': 93 obs. of 7 variables: # $ Age : num 2 1 2 3 2 2 1 2 2 3 ... # $ Work_Class : Factor w/ 6 levels "Federal-gov",..: 6 5 3 3 3 3 5 3 3 6 ... # $ Education : chr "Bachelors" "Bachelors" "HS-grad" "Bachelors" ... # $ Marital_Status: chr "Never-married" "Married-civ-spouse" "Divorced" "Married-civ-spouse" ... # $ Sex : chr "Male" "Male" "Male" "Female" ... # $ Hours_Per_week: num 2 3 2 2 2 3 2 2 1 2 ... # $ Income : chr "<=50K" "<=50K" "<=50K" "<=50K" ... # - attr(*, "na.action")= 'omit' Named int [1:7] 4 9 28 62 70 78 93 # ..- attr(*, "names")= chr [1:7] "4" "9" "28" "62" ... table(x$Work_Class) # # Federal-gov Local-gov Private Self-emp-inc Self-emp-not-inc State-gov # 6 6 67 3 7 4
Change variable types in data frame [duplicate]
I have a dataframe with all the columns being character like this. ID <- c("A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B") ToolID <- c("CCP_A","CCP_A","CCQ_A","CCQ_A","IOT_B","CCP_B","CCQ_B","IOT_B", "CCP_A","CCP_A","CCQ_A","CCQ_A","IOT_B","CCP_B","CCQ_B","IOT_B") Step <- c("Step_A","Step_A","Step_B","Step_C","Step_D","Step_D","Step_E","Step_F", "Step_A","Step_A","Step_B","Step_C","Step_D","Step_D","Step_E","Step_F") Measurement <- c("Length","Breadth","Width","Height",NA,NA,NA,NA, "Length","Breadth","Width","Height",NA,NA,NA,NA) Passfail <- c("Pass","Pass","Fail","Fail","Pass","Pass","Pass","Pass", "Pass","Pass","Fail","Fail","Pass","Pass","Pass","Pass") Points <- as.character(c(7,5,3,4,0,0,0,0,17,15,13,14,0,0,0,0)) Average <- as.character(c(7.5,6.5,7.1,6.6,NA,NA,NA,NA,17.5,16.5,17.1,16.6,NA,NA,NA,NA)) Sigma <- as.character(c(2.5,2.5,2.1,2.6,NA,NA,NA,NA,12.5,12.5,12.1,12.6,NA,NA,NA,NA)) Tool <- c("ABC_1","ABC_2","ABD_1","ABD_2","COB_1","COB_2","COB_1","COB_2", "ABC_1","ABC_2","ABD_1","ABD_2","COB_1","COB_2","COB_1","COB_2") Dose <- as.character(c(NA,NA,NA,NA,17.1,NA,NA,17.3,NA,NA,NA,NA,117.1,NA,NA,117.3)) Machine <- c("CO2","CO6","CO3","CO6","CO2,CO6","CO2,CO3,CO4","CO2,CO3","CO2", "CO2","CO6","CO3","CO6","CO2,CO6","CO2,CO3,CO4","CO2,CO3","CO2") df <- data.frame(ID,ToolID,Step,Measurement,Passfail,Points,Average,Sigma,Tool,Dose,Machine) I am trying to check these character vectors for numeric values and then convert those with numeric values to numeric. I use the "varhandle" package in R to do it library(varhandle) if(all(check.numeric(df$Machine, na.rm=TRUE))){ # convert the vector to numeric df$Machine <- as.numeric(df$Machine) } This works but is inefficient because I have to manually enter the column names like above. How can I do it more efficiently in a loop or use vectorization over multiple columns? My actual dataset has around 350 columns. Can someone point me in the right direction?
We can use parse_guess function from readr package which basically tries to guess the type of columns. library(readr) library(dplyr) df1 <- df %>% mutate_all(parse_guess) str(df1) #'data.frame': 16 obs. of 11 variables: # $ ID : chr "A" "A" "A" "A" ... # $ ToolID : chr "CCP_A" "CCP_A" "CCQ_A" "CCQ_A" ... # $ Step : chr "Step_A" "Step_A" "Step_B" "Step_C" ... # $ Measurement: chr "Length" "Breadth" "Width" "Height" ... # $ Passfail : chr "Pass" "Pass" "Fail" "Fail" ... # $ Points : int 7 5 3 4 0 0 0 0 17 15 ... # $ Average : num 7.5 6.5 7.1 6.6 NA NA NA NA 17.5 16.5 ... # $ Sigma : num 2.5 2.5 2.1 2.6 NA NA NA NA 12.5 12.5 ... # $ Tool : chr "ABC_1" "ABC_2" "ABD_1" "ABD_2" ... # $ Dose : num NA NA NA NA 17.1 NA NA 17.3 NA NA ... # $ Machine : chr "CO2" "CO6" "CO3" "CO6" ...
We can do this in base R df[] <- lapply(df, function(x) type.convert(as.character(x), as.is = TRUE)) str(df) #'data.frame': 16 obs. of 11 variables: # $ ID : chr "A" "A" "A" "A" ... # $ ToolID : chr "CCP_A" "CCP_A" "CCQ_A" "CCQ_A" ... # $ Step : chr "Step_A" "Step_A" "Step_B" "Step_C" ... # $ Measurement: chr "Length" "Breadth" "Width" "Height" ... # $ Passfail : chr "Pass" "Pass" "Fail" "Fail" ... # $ Points : int 7 5 3 4 0 0 0 0 17 15 ... # $ Average : num 7.5 6.5 7.1 6.6 NA NA NA NA 17.5 16.5 ... # $ Sigma : num 2.5 2.5 2.1 2.6 NA NA NA NA 12.5 12.5 ... # $ Tool : chr "ABC_1" "ABC_2" "ABD_1" "ABD_2" ... # $ Dose : num NA NA NA NA 17.1 NA NA 17.3 NA NA ... # $ Machine : chr "CO2" "CO6" "CO3" "CO6" ...
With varhandle and tidyverse : df %>% mutate_if(purrr::compose(all,check.numeric),as.numeric)
I think that the easiest solution is to use all.is.numeric from Hmisc. Here's the simple example: Hmisc::all.is.numeric(c("A", "B", "1"), what = "vector", extras = NA) ## [1] "A" "B" "1" Hmisc::all.is.numeric(c("3", "2", "1", NA), what = "vector", extras = NA) ## [1] 3 2 1 NA Then you can use mutate_all from dplyr to do all the job for data.frame: library(dplyr) ID <- c("A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B") ToolID <- c("CCP_A","CCP_A","CCQ_A","CCQ_A","IOT_B","CCP_B","CCQ_B","IOT_B", "CCP_A","CCP_A","CCQ_A","CCQ_A","IOT_B","CCP_B","CCQ_B","IOT_B") Step <- c("Step_A","Step_A","Step_B","Step_C","Step_D","Step_D","Step_E","Step_F", "Step_A","Step_A","Step_B","Step_C","Step_D","Step_D","Step_E","Step_F") Measurement <- c("Length","Breadth","Width","Height",NA,NA,NA,NA, "Length","Breadth","Width","Height",NA,NA,NA,NA) Passfail <- c("Pass","Pass","Fail","Fail","Pass","Pass","Pass","Pass", "Pass","Pass","Fail","Fail","Pass","Pass","Pass","Pass") Points <- as.character(c(7,5,3,4,0,0,0,0,17,15,13,14,0,0,0,0)) Average <- as.character(c(7.5,6.5,7.1,6.6,NA,NA,NA,NA,17.5,16.5,17.1,16.6,NA,NA,NA,NA)) Sigma <- as.character(c(2.5,2.5,2.1,2.6,NA,NA,NA,NA,12.5,12.5,12.1,12.6,NA,NA,NA,NA)) Tool <- c("ABC_1","ABC_2","ABD_1","ABD_2","COB_1","COB_2","COB_1","COB_2", "ABC_1","ABC_2","ABD_1","ABD_2","COB_1","COB_2","COB_1","COB_2") Dose <- as.character(c(NA,NA,NA,NA,17.1,NA,NA,17.3,NA,NA,NA,NA,117.1,NA,NA,117.3)) Machine <- c("CO2","CO6","CO3","CO6","CO2,CO6","CO2,CO3,CO4","CO2,CO3","CO2", "CO2","CO6","CO3","CO6","CO2,CO6","CO2,CO3,CO4","CO2,CO3","CO2") df <- data.frame(ID,ToolID,Step,Measurement,Passfail,Points,Average,Sigma,Tool,Dose,Machine) dt2 <- df %>% mutate_all(function(x) Hmisc::all.is.numeric(x, what = "vector", extras = NA)) ## check classes sapply(dt2, class) ## ID ToolID Step Measurement Passfail Points ## "character" "character" "character" "character" "character" "numeric" ## Average Sigma Tool Dose Machine ## "numeric" "numeric" "character" "numeric" "character"
Another solution is retype from hablar package: library(hablar) df %>% retype() which gives: # A tibble: 16 x 11 ID ToolID Step Measurement Passfail Points Average Sigma Tool Dose Machine <chr> <chr> <chr> <chr> <chr> <int> <dbl> <dbl> <chr> <dbl> <chr> 1 A CCP_A Step_A Length Pass 7 7.50 2.50 ABC_1 NA CO2 2 A CCP_A Step_A Breadth Pass 5 6.50 2.50 ABC_2 NA CO6 3 A CCQ_A Step_B Width Fail 3 7.10 2.10 ABD_1 NA CO3 4 A CCQ_A Step_C Height Fail 4 6.60 2.60 ABD_2 NA CO6 5 A IOT_B Step_D NA Pass 0 NA NA COB_1 17.1 CO2,CO6 6 A CCP_B Step_D NA Pass 0 NA NA COB_2 NA CO2,CO3,CO4 7 A CCQ_B Step_E NA Pass 0 NA NA COB_1 NA CO2,CO3
how to change column type of columns stored in vector in r
I have following dataframe in r qty_1 qty_2 qty_3 make_1 make_2 make_3 qty_4 1 22 33 21 5 55 6 2 33 92 83 76 65 23 I have vector as following qty_vec <- c("qty_1","qty_2","qty_3") I want to change the data type of columns which matches with qty_vec to character I am doing following in r,but it does not work final_df[,names(final_df)[names(final_df) %in% qty_vec]] <- lapply(final_df[,names(final_df)[names(final_df) %in% qty_vec]], function(x) type.convert(as.character(x)))
As per Sotos' comment above, see also this SO question, df <- structure(list(qty_1 = 1:2, qty_2 = c(22L, 33L), qty_3 = c(33L, 92L), make_1 = c(21L, 83L), make_2 = c(5L, 76L), make_3 = c(55L, 65L), qty_4 = c(6L, 23L)), .Names = c("qty_1", "qty_2", "qty_3", "make_1", "make_2", "make_3", "qty_4"), class = "data.frame", row.names = c(NA, -2L)) str(df) #> 'data.frame': 2 obs. of 7 variables: #> $ qty_1 : int 1 2 #> $ qty_2 : int 22 33 #> $ qty_3 : int 33 92 #> $ make_1: int 21 83 #> $ make_2: int 5 76 #> $ make_3: int 55 65 #> $ qty_4 : int 6 23 qty_vec <- c("qty_1","qty_2","qty_3") df[qty_vec] <- lapply(df[qty_vec], as.character) str(df) #> 'data.frame': 2 obs. of 7 variables: #> $ qty_1 : chr "1" "2" #> $ qty_2 : chr "22" "33" #> $ qty_3 : chr "33" "92" #> $ make_1: int 21 83 #> $ make_2: int 5 76 #> $ make_3: int 55 65 #> $ qty_4 : int 6 23