converting multiple columns from character to numeric format in r - r
What is the most efficient way to convert multiple columns in a data frame from character to numeric format?
I have a dataframe called DF with all character variables.
I would like to do something like
for (i in names(DF){
DF$i <- as.numeric(DF$i)
}
Thank you
You could try
DF <- data.frame("a" = as.character(0:5),
"b" = paste(0:5, ".1", sep = ""),
"c" = letters[1:6],
stringsAsFactors = FALSE)
# Check columns classes
sapply(DF, class)
# a b c
# "character" "character" "character"
cols.num <- c("a","b")
DF[cols.num] <- sapply(DF[cols.num],as.numeric)
sapply(DF, class)
# a b c
# "numeric" "numeric" "character"
If you're already using the tidyverse, there are a few solution depending on the exact situation.
Basic if you know it's all numbers and doesn't have NAs
library(dplyr)
# solution
dataset %>% mutate_if(is.character,as.numeric)
Test cases
df <- data.frame(
x1 = c('1','2','3'),
x2 = c('4','5','6'),
x3 = c('1','a','x'), # vector with alpha characters
x4 = c('1',NA,'6'), # numeric and NA
x5 = c('1',NA,'x'), # alpha and NA
stringsAsFactors = F)
# display starting structure
df %>% str()
Convert all character vectors to numeric (could fail if not numeric)
df %>%
select(-x3) %>% # this removes the alpha column if all your character columns need converted to numeric
mutate_if(is.character,as.numeric) %>%
str()
Check if each column can be converted. This can be an anonymous function. It returns FALSE if there is a non-numeric or non-NA character somewhere. It also checks if it's a character vector to ignore factors. na.omit removes original NAs before creating "bad" NAs.
is_all_numeric <- function(x) {
!any(is.na(suppressWarnings(as.numeric(na.omit(x))))) & is.character(x)
}
df %>%
mutate_if(is_all_numeric,as.numeric) %>%
str()
If you want to convert specific named columns, then mutate_at is better.
df %>% mutate_at('x1', as.numeric) %>% str()
You can use index of columns:
data_set[,1:9] <- sapply(dataset[,1:9],as.character)
I used this code to convert all columns to numeric except the first one:
library(dplyr)
# check structure, row and column number with: glimpse(df)
# convert to numeric e.g. from 2nd column to 10th column
df <- df %>%
mutate_at(c(2:10), as.numeric)
Using the across() function from dplyr 1.0
df <- df %>% mutate(across(, ~as.numeric(.))
You could use convert from the hablar package:
library(dplyr)
library(hablar)
# Sample df (stolen from the solution by Luca Braglia)
df <- tibble("a" = as.character(0:5),
"b" = paste(0:5, ".1", sep = ""),
"c" = letters[1:6])
# insert variable names in num()
df %>% convert(num(a, b))
Which gives you:
# A tibble: 6 x 3
a b c
<dbl> <dbl> <chr>
1 0. 0.100 a
2 1. 1.10 b
3 2. 2.10 c
4 3. 3.10 d
5 4. 4.10 e
6 5. 5.10 f
Or if you are lazy, let retype() from hablar guess the right data type:
df %>% retype()
which gives you:
# A tibble: 6 x 3
a b c
<int> <dbl> <chr>
1 0 0.100 a
2 1 1.10 b
3 2 2.10 c
4 3 3.10 d
5 4 4.10 e
6 5 5.10 f
type.convert()
Convert a data object to logical, integer, numeric, complex, character or factor as appropriate.
Add the as.is argument type.convert(df,as.is = T) to prevent character vectors from becoming factors when there is a non-numeric in the data set.
See.
Slight adjustment to answers from ARobertson and Kenneth Wilson that worked for me.
Running R 3.6.0, with library(tidyverse) and library(dplyr) in my environment:
library(tidyverse)
library(dplyr)
> df %<>% mutate_if(is.character, as.numeric)
Error in df %<>% mutate_if(is.character, as.numeric) :
could not find function "%<>%"
I did some quick research and found this note in Hadley's "The tidyverse style guide".
The magrittr package provides the %<>% operator as a shortcut for modifying an object in place. Avoid this operator.
# Good x <- x %>%
abs() %>%
sort()
# Bad x %<>%
abs() %>%
sort()
Solution
Based on that style guide:
df_clean <- df %>% mutate_if(is.character, as.numeric)
Working example
> df_clean <- df %>% mutate_if(is.character, as.numeric)
Warning messages:
1: NAs introduced by coercion
2: NAs introduced by coercion
3: NAs introduced by coercion
4: NAs introduced by coercion
5: NAs introduced by coercion
6: NAs introduced by coercion
7: NAs introduced by coercion
8: NAs introduced by coercion
9: NAs introduced by coercion
10: NAs introduced by coercion
> df_clean
# A tibble: 3,599 x 17
stack datetime volume BQT90 DBT90 DRT90 DLT90 FBT90 RT90 HTML90 RFT90 RLPP90 RAT90 SRVR90 SSL90 TCP90 group
<dbl> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
I think I figured it out. Here's what I did (perhaps not the most elegant solution - suggestions on how to imp[rove this are very much welcome)
#names of columns in data frame
cols <- names(DF)
# character variables
cols.char <- c("fx_code","date")
#numeric variables
cols.num <- cols[!cols %in% cols.char]
DF.char <- DF[cols.char]
DF.num <- as.data.frame(lapply(DF[cols.num],as.numeric))
DF2 <- cbind(DF.char, DF.num)
I realize this is an old thread but wanted to post a solution similar to your request for a function (just ran into the similar issue myself trying to format an entire table to percentage labels).
Assume you have a df with 5 character columns you want to convert. First, I create a table containing the names of the columns I want to manipulate:
col_to_convert <- data.frame(nrow = 1:5
,col = c("col1","col2","col3","col4","col5"))
for (i in 1:max(cal_to_convert$row))
{
colname <- col_to_convert$col[i]
colnum <- which(colnames(df) == colname)
for (j in 1:nrow(df))
{
df[j,colnum] <- as.numericdf(df[j,colnum])
}
}
This is not ideal for large tables as it goes cell by cell, but it would get the job done.
like this?
DF <- data.frame("a" = as.character(0:5),
"b" = paste(0:5, ".1", sep = ""),
"c" = paste(10:15),
stringsAsFactors = FALSE)
DF <- apply(DF, 2, as.numeric)
If there are "real" characters in dataframe like 'a' 'b' 'c', i would recommend answer from davsjob.
Use data.table set function
setDT(DF)
for (j in YourColumns)
set(DF, j=j, value = as.numeric(DF[[j]])
If you need to keep as data.frame then just use setDF(DF)
Try this to change numeric column to character:
df[,1:11] <- sapply(df[,1:11],as.character)
DF[,6:11] <- sapply(DF[,6:11], as.numeric)
or
DF[,6:11] <- sapply(DF[,6:11], as.character)
for (i in 1:names(DF){
DF[[i]] <- as.numeric(DF[[i]])
}
I solved this using double brackets [[]]
Since we can index a data frame column by it's name, a simple change can be made:
for (i in names(DF)){ DF[i] <- as.data.frame(as.numeric(as.matrix(DF[i]))) }
A<- read.csv("Environment_Temperature_change_E_All_Data_NOFLAG.csv",header = F)
Now, convert to character
A<- type.convert(A,as.is=T)
Convert some columns to numeric from character
A[,c(1,3,5,c(8:66))]<- as.numeric(as.character(unlist(A[,c(1,3,5,c(8:66))])))
Related
Is there a way to put all columns from my database as integer with a simple code? [duplicate]
What is the most efficient way to convert multiple columns in a data frame from character to numeric format? I have a dataframe called DF with all character variables. I would like to do something like for (i in names(DF){ DF$i <- as.numeric(DF$i) } Thank you
You could try DF <- data.frame("a" = as.character(0:5), "b" = paste(0:5, ".1", sep = ""), "c" = letters[1:6], stringsAsFactors = FALSE) # Check columns classes sapply(DF, class) # a b c # "character" "character" "character" cols.num <- c("a","b") DF[cols.num] <- sapply(DF[cols.num],as.numeric) sapply(DF, class) # a b c # "numeric" "numeric" "character"
If you're already using the tidyverse, there are a few solution depending on the exact situation. Basic if you know it's all numbers and doesn't have NAs library(dplyr) # solution dataset %>% mutate_if(is.character,as.numeric) Test cases df <- data.frame( x1 = c('1','2','3'), x2 = c('4','5','6'), x3 = c('1','a','x'), # vector with alpha characters x4 = c('1',NA,'6'), # numeric and NA x5 = c('1',NA,'x'), # alpha and NA stringsAsFactors = F) # display starting structure df %>% str() Convert all character vectors to numeric (could fail if not numeric) df %>% select(-x3) %>% # this removes the alpha column if all your character columns need converted to numeric mutate_if(is.character,as.numeric) %>% str() Check if each column can be converted. This can be an anonymous function. It returns FALSE if there is a non-numeric or non-NA character somewhere. It also checks if it's a character vector to ignore factors. na.omit removes original NAs before creating "bad" NAs. is_all_numeric <- function(x) { !any(is.na(suppressWarnings(as.numeric(na.omit(x))))) & is.character(x) } df %>% mutate_if(is_all_numeric,as.numeric) %>% str() If you want to convert specific named columns, then mutate_at is better. df %>% mutate_at('x1', as.numeric) %>% str()
You can use index of columns: data_set[,1:9] <- sapply(dataset[,1:9],as.character)
I used this code to convert all columns to numeric except the first one: library(dplyr) # check structure, row and column number with: glimpse(df) # convert to numeric e.g. from 2nd column to 10th column df <- df %>% mutate_at(c(2:10), as.numeric)
Using the across() function from dplyr 1.0 df <- df %>% mutate(across(, ~as.numeric(.))
You could use convert from the hablar package: library(dplyr) library(hablar) # Sample df (stolen from the solution by Luca Braglia) df <- tibble("a" = as.character(0:5), "b" = paste(0:5, ".1", sep = ""), "c" = letters[1:6]) # insert variable names in num() df %>% convert(num(a, b)) Which gives you: # A tibble: 6 x 3 a b c <dbl> <dbl> <chr> 1 0. 0.100 a 2 1. 1.10 b 3 2. 2.10 c 4 3. 3.10 d 5 4. 4.10 e 6 5. 5.10 f Or if you are lazy, let retype() from hablar guess the right data type: df %>% retype() which gives you: # A tibble: 6 x 3 a b c <int> <dbl> <chr> 1 0 0.100 a 2 1 1.10 b 3 2 2.10 c 4 3 3.10 d 5 4 4.10 e 6 5 5.10 f
type.convert() Convert a data object to logical, integer, numeric, complex, character or factor as appropriate. Add the as.is argument type.convert(df,as.is = T) to prevent character vectors from becoming factors when there is a non-numeric in the data set. See.
Slight adjustment to answers from ARobertson and Kenneth Wilson that worked for me. Running R 3.6.0, with library(tidyverse) and library(dplyr) in my environment: library(tidyverse) library(dplyr) > df %<>% mutate_if(is.character, as.numeric) Error in df %<>% mutate_if(is.character, as.numeric) : could not find function "%<>%" I did some quick research and found this note in Hadley's "The tidyverse style guide". The magrittr package provides the %<>% operator as a shortcut for modifying an object in place. Avoid this operator. # Good x <- x %>% abs() %>% sort() # Bad x %<>% abs() %>% sort() Solution Based on that style guide: df_clean <- df %>% mutate_if(is.character, as.numeric) Working example > df_clean <- df %>% mutate_if(is.character, as.numeric) Warning messages: 1: NAs introduced by coercion 2: NAs introduced by coercion 3: NAs introduced by coercion 4: NAs introduced by coercion 5: NAs introduced by coercion 6: NAs introduced by coercion 7: NAs introduced by coercion 8: NAs introduced by coercion 9: NAs introduced by coercion 10: NAs introduced by coercion > df_clean # A tibble: 3,599 x 17 stack datetime volume BQT90 DBT90 DRT90 DLT90 FBT90 RT90 HTML90 RFT90 RLPP90 RAT90 SRVR90 SSL90 TCP90 group <dbl> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
I think I figured it out. Here's what I did (perhaps not the most elegant solution - suggestions on how to imp[rove this are very much welcome) #names of columns in data frame cols <- names(DF) # character variables cols.char <- c("fx_code","date") #numeric variables cols.num <- cols[!cols %in% cols.char] DF.char <- DF[cols.char] DF.num <- as.data.frame(lapply(DF[cols.num],as.numeric)) DF2 <- cbind(DF.char, DF.num)
I realize this is an old thread but wanted to post a solution similar to your request for a function (just ran into the similar issue myself trying to format an entire table to percentage labels). Assume you have a df with 5 character columns you want to convert. First, I create a table containing the names of the columns I want to manipulate: col_to_convert <- data.frame(nrow = 1:5 ,col = c("col1","col2","col3","col4","col5")) for (i in 1:max(cal_to_convert$row)) { colname <- col_to_convert$col[i] colnum <- which(colnames(df) == colname) for (j in 1:nrow(df)) { df[j,colnum] <- as.numericdf(df[j,colnum]) } } This is not ideal for large tables as it goes cell by cell, but it would get the job done.
like this? DF <- data.frame("a" = as.character(0:5), "b" = paste(0:5, ".1", sep = ""), "c" = paste(10:15), stringsAsFactors = FALSE) DF <- apply(DF, 2, as.numeric) If there are "real" characters in dataframe like 'a' 'b' 'c', i would recommend answer from davsjob.
Use data.table set function setDT(DF) for (j in YourColumns) set(DF, j=j, value = as.numeric(DF[[j]]) If you need to keep as data.frame then just use setDF(DF)
Try this to change numeric column to character: df[,1:11] <- sapply(df[,1:11],as.character)
DF[,6:11] <- sapply(DF[,6:11], as.numeric) or DF[,6:11] <- sapply(DF[,6:11], as.character)
for (i in 1:names(DF){ DF[[i]] <- as.numeric(DF[[i]]) } I solved this using double brackets [[]]
Since we can index a data frame column by it's name, a simple change can be made: for (i in names(DF)){ DF[i] <- as.data.frame(as.numeric(as.matrix(DF[i]))) }
A<- read.csv("Environment_Temperature_change_E_All_Data_NOFLAG.csv",header = F) Now, convert to character A<- type.convert(A,as.is=T) Convert some columns to numeric from character A[,c(1,3,5,c(8:66))]<- as.numeric(as.character(unlist(A[,c(1,3,5,c(8:66))])))
Renaming all columns in a batch - dplyr
Hi I want to replace all column names in the old dataset "olddata" with new names saved in a data frame "newnames" In basic R it's simple and works colnames(olddata) <- t(as.vector(newnames)) However an attempt with dplyr: olddata <- olddata %>% rename(vars(everything()), ~t(newnames)) Returns an error: Error: Must rename columns with a valid subscript vector. x Subscript has the wrong type `quosures`. ℹ It must be numeric or character. What might be wrong here? Thank you!
Assuming that newnames is a one column data.frame, you can convert it to vector using: newnames %>% pull(1) then you can rename your olddata with: olddata <- olddata %>% rename_with(~ newnames %>% pull(1)) Here is some testing with some hypothetical data: newnames <- data.frame(letters[1:3]) # letters.1.3. # 1 a # 2 b # 3 c olddata <- data.frame(col_1 = 1, col_2 = 2, col_3 = 3) # col_1 col_2 col_3 # 1 1 2 3 olddata <- olddata %>% rename_with(~ newnames %>% pull(1)) # a b c # 1 1 2 3
R Subsetting text from a comma seperated column in a data-frame
I have a data.frame with a column that looks like that: diagnosis F.31.2,A.43.2,R.45.2,F.43.1 I want to somehow split this column into two colums with one containing all the values with F and one for all the other values, resulting in two columns in a df that looks like that. F other F.31.2,F43.1 A.43.2,R.45.2 Thanks in advance
Try next tidyverse approach. You can separate the rows by , and then create a group according to the pattern in order to reshape to wide and obtain the expected result: library(dplyr) library(tidyr) #Data df <- data.frame(diagnosis='F.31.2,A.43.2,R.45.2,F.43.1',stringsAsFactors = F) #Code new <- df %>% separate_rows(diagnosis,sep = ',') %>% mutate(Group=ifelse(grepl('F',diagnosis),'F','Other')) %>% pivot_wider(values_fn = toString,names_from=Group,values_from=diagnosis) Output: # A tibble: 1 x 2 F Other <chr> <chr> 1 F.31.2, F.43.1 A.43.2, R.45.2
First, use strsplit at the commas. Then, using grep find indexes of F, and select/antiselect them by multiplying by 1 or -1 and paste them. tmp <- el(strsplit(d$diagnosis, ",")) res <- lapply(c(1, -1), function(x) paste(tmp[grep("F", tmp)*x], collapse=",")) res <- setNames(as.data.frame(res), c("F", "other")) res # F other # 1 F.31.2,F.43.1 A.43.2,R.45.2 Data: d <- setNames(read.table(text="F.31.2,A.43.2,R.45.2,F.43.1"), "diagnosis")
Map readr::type_convert to specific columns only
readr::type_convert guesses the class of each column in a data frame. I would like to apply type_convert to only some columns in a data frame (to preserve other columns as character). MWE: # A data frame with multiple character columns containing numbers. df <- data.frame(A = letters[1:10], B = as.character(1:10), C = as.character(1:10)) # This works df %>% type_convert() Parsed with column specification: cols( A = col_character(), B = col_double(), C = col_double() ) A B C 1 a 1 1 2 b 2 2 ... However, I would like to only apply the function to column B (this is a stylised example; there may be multiple columns to try and convert). I tried using purrr::map_at as well as sapply, as follows: # This does not work map_at(df, "B", type_convert) Error in .f(.x[[i]], ...) : is.data.frame(df) is not TRUE # This does not work sapply(df["B"], type_convert) Error in FUN(X[[i]], ...) : is.data.frame(df) is not TRUE Is there a way to apply type_convert selectively to only some columns of a data frame? Edit: #ekoam provides an answer for type_convert. However, applying this answer to many columns would be tedious. It might be better to use the base::type.convert function, which can be mapped: purrr::map_at(df, "B", type.convert) %>% bind_cols() # A tibble: 10 x 3 A B C <chr> <int> <chr> 1 a 1 1 2 b 2 2
Try this: df %>% type_convert(cols(B = "?", C = "?", .default = "c")) Guess the type of B; any other character column stays as is. The tricky part is that if any column is not of a character type, then type_convert will also leave it as is. So if you really have to type_convert, maybe you have to first convert all columns to characters.
type_convert does not seem to support it. One trick which I have used a few times is using combination of select & bind_cols as shown below. df %>% select(B) %>% type_convert() %>% bind_cols(df %>% select(-B))
Purrr Implementation of For-Loop
I am trying to develop a logistic regression model in R. I am trying to loop over rows of a data frame (or tibble) so that I can multiply a subset of the columns in that row by another vector as a dot product. I initially tried to accomplish some preparatory work using purrr's vector functions, but was having difficulty and decided to implement it in a for-loop. This is the working design I have with a For-Loop: library(tidyverse) # Define necessary functions lambdaFunc <- function(factors,theta){ return((1+exp(sum(factors*theta)))^(-1)) } # y is 0 or 1 # x and theta are a numeric vectors indiv_likhd <- function(y,x,theta){ return(lambdaFunc(x,theta)^y*(1-lambdaFunc(x,theta))^(1-y)) } # Assuming df is dataframe of the form # Col1 Col2 ... ColN # isDefault(0 or 1) factor1 ... factorN likhds <- function(df,theta){ df <- as.data.frame(df) likhds <- vector("numeric",nrow(df)) for (i in 1:nrow(df)) { likhds[i] <- indiv_likhd(df[i,1],df[i,2:ncol(df)],theta) } return(likhds) } So testdf <- tibble(y=c(1,0),x_1=c(1,1),x_2=c(1,1),x_3=c(1,1)) testTheta <- c(1,1,1) likhds(testdf,testTheta) yields [1] 0.04742587 0.95257413 Is there a way to implement this with vector functions-specifically the purr package? This is my first real question on stackoverflow so I apologize if there is something missing or unclear, in which case, please let me know. Thank you.
Without changing your lambdaFunc and indiv_likhd we could rewrite your for loop with pmap library(dplyr) library(purrr) testdf %>% mutate(new_col = pmap_dbl(., ~indiv_likhd(c(...)[1], c(...)[-1], testTheta))) # y x_1 x_2 x_3 new_col # <dbl> <dbl> <dbl> <dbl> <dbl> #1 1 1 1 1 0.0474 #2 0 1 1 1 0.953 c(...) is used to capture all the values passed to pmap (here the entire row), so c(...)[1] means the first value in the row, c(...)[-1] means everything other than the first values in the row.
Here is an option f <- function(df, theta) { df %>% group_by(y) %>% nest() %>% mutate(likhds = map2_dbl(y, data, function(y, x) indiv_likhd(y, x, theta))) %>% pull(likhds) } f(testdf, testTheta) #[1] 0.04742587 0.95257413 Explanation: We nest data by y, then use map2_dbl to loop through the pairs of y and data (which are your x values) for every row, and return the output of indiv_likhd as a double vector.