Losing information after converting from factor to numeric in R - r

I have a dataframe in which some of the numeric column are in factor and i want to convert to numeric value. However i tried the below code but still it is losing the information.
> str(xdate1)
'data.frame': 6 obs. of 1 variable:
$ Amount.in.doc..curr.: Factor w/ 588332 levels "-0.5","-1","-1,000",..: 5132 57838 81064 98277 76292 71982
After converting to numeric i am losing the information. below is the output:
> xdate1$Amount.in.doc..curr.<-as.numeric(as.character(xdate1$Amount.in.doc..curr.))
Warning message:
NAs introduced by coercion
> str(xdate1)
'data.frame': 6 obs. of 1 variable:
$ Amount.in.doc..curr.: num -150 NA NA NA NA NA

You have values with commas( ',') which turn into NA when changing to numeric, remove them before converting to numeric.
xdate1$Amount.in.doc..curr. <- as.numeric(gsub(',', '', xdate1$Amount.in.doc..curr.))
Or use parse_number from readr
xdate1$Amount.in.doc..curr. <- readr::parse_number(as.character(xdate1$Amount.in.doc..curr.))

Related

Converting data frame into numeric sparseMatrix in R

I have a data.frame with 3 columns. The structure of the data.fame is as below
str(data)
'data.frame': 76971772 obs. of 3 variables:
$ V1: chr "XH104_AACGAGAGCTAAACTAGCCCTA" "XH104_AACGAGAGCTAAACTAGCCCTA" "XH104_AACGAGAGCTAAACTAGCCCTA" "XH104_AACGAGAGCTAAACTAGCCCTA" ...
$ V2: chr "10:100175000-100180000" "10:101065000-101070000" "10:101550000-101555000" "10:101585000-101590000" ...
$ V3: int 2 2 2 2 10 1 2 2 2 2 ...
I am trying to convert it into sparseMatrix such that the row name of sparseMatrix is data$V1 and the column name is data$V2. I am using the command given below to do that.
sparse.data <- with(data, sparseMatrix(i=as.numeric(V1), j=as.numeric(V2), x=V3, dimnames=list(levels(V1), levels(V2))))
I keep getting the this error.
Error in sparseMatrix(i = as.numeric(V1), j = as.numeric(V2), x = V3, :
NA's in (i,j) are not allowed
I realized that when I use i=as.numeric(V1) in my command, all the values of V1 become NA.
Can someone suggest how can I solve this error?

Pivot_longer on all columns

I am using pivot_longer from tidyr to transform a data frame from wide to long. I wish to use all the columns and maintain rownames in a column as well. The earlier melt function works perfect on this call
w1 <- reshape2::melt(w)
head(w1)
'data.frame': 900 obs. of 3 variables:
$ Var1 : Factor w/ 30 levels "muscle system process",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Var2 : Factor w/ 30 levels "muscle system process",..: 1 1 1 1 1 1 1 1 1 1 ...
$ value: num NA NA NA NA NA NA NA NA NA NA ...
But pivot_longer doesnt
w %>% pivot_longer()
Error in UseMethod("pivot_longer") :
no applicable method for 'pivot_longer' applied to an object of class "c('matrix', 'array', 'double', 'numeric')"
Any suggestion is appreciated
Obviously some data would be helpful, but your problem lies in the fact that you are using pivot_longer() on an object of class matrix and not data.frame
library(tidyr)
# your error
mycars <- as.matrix(mtcars)
pivot_longer(mycars)
Error in UseMethod("pivot_longer") :
no applicable method for 'pivot_longer' applied to an object of class
"c('matrix', 'array', 'double', 'numeric')"
pivot_longer() will work on a data frame
> class(mycars)
[1] "matrix" "array"
> class(mtcars)
[1] "data.frame"
Remember to specify the cols argument, this was not required in reshape2::melt() (more info in the documentation). You want all the columns so cols = everything():
pivot_longer(mtcars, cols = everything())
(Disclaimer: Of course, mtcars is not the best dataset to convert to long format)

Convert delimited string to numeric vector in dataframe

This is such a basic question, I'm embarrassed to ask.
Let's say I have a dataframe full of columns which contain data of the following form:
test <-"3000,9843,9291,2161,3458,2347,22925,55836,2890,2824,2848,2805,2808,2775,2760,2706,2727,2688,2727,2658,2654,2588"
I want to convert this to a numeric vector, which I have done like so:
test <- as.numeric(unlist(strsplit(test, split=",")))
I now want to convert a large dataframe containing a column full of this data into a numeric vector equivalent:
mutate(data,
converted = as.numeric(unlist(strsplit(badColumn, split=","))),
)
This doesn't work because presumably it's converting the entire column into a numeric vector and then replacing a single row with that value:
Error in mutate_impl(.data, dots) : Column converted must be
length 20 (the number of rows) or one, not 1274
How do I do this?
Here's some sample data that reproduces your error:
data <- data.frame(a = 1:3,
badColumn = c("10,20,30,40,50", "1,2,3,4,5,6", "9,8,7,6,5,4,3"),
stringsAsFactors = FALSE)
Here's the error:
library(tidyverse)
mutate(data, converted = as.numeric(unlist(strsplit(badColumn, split=","))))
# Error in mutate_impl(.data, dots) :
# Column `converted` must be length 3 (the number of rows) or one, not 18
A straightforward way would be to just use strsplit on the entire column, and lapply ... as.numeric to convert the resulting list values from character vectors to numeric vectors.
x <- mutate(data, converted = lapply(strsplit(badColumn, ",", TRUE), as.numeric))
str(x)
# 'data.frame': 3 obs. of 3 variables:
# $ a : int 1 2 3
# $ badColumn: chr "10,20,30,40,50" "1,2,3,4,5,6" "9,8,7,6,5,4,3"
# $ converted:List of 3
# ..$ : num 10 20 30 40 50
# ..$ : num 1 2 3 4 5 6
# ..$ : num 9 8 7 6 5 4 3
This might help:
library(purrr)
mutate(data, converted = map(badColumn, function(txt) as.numeric(unlist(strsplit(txt, split = ",")))))
What you get is a list column which contains the numeric vectors.
Base R
A=c(as.numeric(strsplit(test,',')[[1]]))
A
[1] 3000 9843 9291 2161 3458 2347 22925 55836 2890 2824 2848 2805 2808 2775 2760 2706 2727 2688 2727 2658 2654 2588
df$NEw2=lapply(df$NEw, function(x) c(as.numeric(strsplit(x,',')[[1]])))
df%>%mutate(NEw2=list(c(as.numeric(strsplit(NEw,',')[[1]]))))

Read in Excel column with numbers and characters to R

I'm trying to read in an excel file to R using read_excel(it's a xlsx file), I have columns that contain letters and numbers, for example things like P765876. These columns also have cells with just numbers i.e 234654, so when it reads in to R it reads as an Unknown (not character or numeric) but this means that it gives any cell which has a letter and number a value of NA, how can I read this in correctly?
My code at the moment is
tenant<-read_excel("C:/Users/MPritchard/Repairs Projects/May 2017/Tenant Info/R data 1.xlsx")
Would also recommend to use the col_types argument, by specifying it as "text" you should avoid getting NAs introduced by coercion. So your code would be like:
tenant<-read_excel("C:/Users/MPritchard/Repairs Projects/May 2017/Tenant Info/R data 1.xlsx", col_types = "text")
Please let me know if this solved your problem.
Regards,
/Michael
Not really an answer but too much for a comment...
1:
> library(xlsx)
> tenant <- read.xlsx("returns.xlsx", sheetIndex = 1)
> str(tenant)
'data.frame': 9 obs. of 3 variables:
$ only_integer: num 1 2 34 5 546931 ...
$ int_char : Factor w/ 9 levels "2545","2a","2d",..: 6 4 9 3 5 1 7 2 8
$ only_char : Factor w/ 6 levels "af","dd","e",..: 2 1 5 6 3 2 4 3 1
2:
> library(readxl)
> tenant2 <- read_excel("returns.xlsx")
> str(tenant2)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 9 obs. of 3 variables:
$ only_integer: num 1 2 34 5 546931 ...
$ int_char : chr "d5" "5" "ff2ad2f" "2d" ...
$ only_char : chr "dd" "af" "h" "ha" ...
The column int_char is a mixture of both, starting/ending with numbers or characters

Using "NA" as a legitimate nonmissing value

I'm working with a data set that includes first names entered in all capital letters. I need to work with the names as character variables, not as factors.
One person in the data set has the first name "NA". Can I get R to accept "NA" as a legitimate character value? My work-around solution was to rename that person NAA, but I am interested to see if there is a better way.
As a demonstration of my comment, consider the following sample CSV file:
x <- tempfile()
cat("v1,v2", "NA,1", "AB,3", sep = "\n", file = x)
cat(readLines(x), sep = "\n")
# v1,v2
# NA,1
# AB,3
Here's the str of a basic read.csv. Note the NA is seen as NA
str(read.csv(x))
# 'data.frame': 2 obs. of 2 variables:
# $ v1: Factor w/ 1 level "AB": NA 1
# $ v2: int 1 3
Now, specify a different character as your na.strings argument:
str(read.csv(x, na.strings = ""))
# 'data.frame': 2 obs. of 2 variables:
# $ v1: Factor w/ 2 levels "AB","NA": 2 1
# $ v2: int 1 3

Resources