Read in Excel column with numbers and characters to R - r

I'm trying to read in an excel file to R using read_excel(it's a xlsx file), I have columns that contain letters and numbers, for example things like P765876. These columns also have cells with just numbers i.e 234654, so when it reads in to R it reads as an Unknown (not character or numeric) but this means that it gives any cell which has a letter and number a value of NA, how can I read this in correctly?
My code at the moment is
tenant<-read_excel("C:/Users/MPritchard/Repairs Projects/May 2017/Tenant Info/R data 1.xlsx")

Would also recommend to use the col_types argument, by specifying it as "text" you should avoid getting NAs introduced by coercion. So your code would be like:
tenant<-read_excel("C:/Users/MPritchard/Repairs Projects/May 2017/Tenant Info/R data 1.xlsx", col_types = "text")
Please let me know if this solved your problem.
Regards,
/Michael

Not really an answer but too much for a comment...
1:
> library(xlsx)
> tenant <- read.xlsx("returns.xlsx", sheetIndex = 1)
> str(tenant)
'data.frame': 9 obs. of 3 variables:
$ only_integer: num 1 2 34 5 546931 ...
$ int_char : Factor w/ 9 levels "2545","2a","2d",..: 6 4 9 3 5 1 7 2 8
$ only_char : Factor w/ 6 levels "af","dd","e",..: 2 1 5 6 3 2 4 3 1
2:
> library(readxl)
> tenant2 <- read_excel("returns.xlsx")
> str(tenant2)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 9 obs. of 3 variables:
$ only_integer: num 1 2 34 5 546931 ...
$ int_char : chr "d5" "5" "ff2ad2f" "2d" ...
$ only_char : chr "dd" "af" "h" "ha" ...
The column int_char is a mixture of both, starting/ending with numbers or characters

Related

Converting data frame into numeric sparseMatrix in R

I have a data.frame with 3 columns. The structure of the data.fame is as below
str(data)
'data.frame': 76971772 obs. of 3 variables:
$ V1: chr "XH104_AACGAGAGCTAAACTAGCCCTA" "XH104_AACGAGAGCTAAACTAGCCCTA" "XH104_AACGAGAGCTAAACTAGCCCTA" "XH104_AACGAGAGCTAAACTAGCCCTA" ...
$ V2: chr "10:100175000-100180000" "10:101065000-101070000" "10:101550000-101555000" "10:101585000-101590000" ...
$ V3: int 2 2 2 2 10 1 2 2 2 2 ...
I am trying to convert it into sparseMatrix such that the row name of sparseMatrix is data$V1 and the column name is data$V2. I am using the command given below to do that.
sparse.data <- with(data, sparseMatrix(i=as.numeric(V1), j=as.numeric(V2), x=V3, dimnames=list(levels(V1), levels(V2))))
I keep getting the this error.
Error in sparseMatrix(i = as.numeric(V1), j = as.numeric(V2), x = V3, :
NA's in (i,j) are not allowed
I realized that when I use i=as.numeric(V1) in my command, all the values of V1 become NA.
Can someone suggest how can I solve this error?

Numbers in quotes read as numerical variable in R

I have a dataset where there are many columns with numbers in quotes which indicates that a variable is a factor. (ex: "8").
read.table automatically converts them in numerical variables even if stringsAsFactor is set as true.
Suppose I cannot convert them manually with as.factor, how can I import this dataset with those variables coded directly as factor?
That's because of the quote option. Set quote="". Example:
t <- '"1" "3"
"2" "4"'
> str(read.table(text=t))
'data.frame': 2 obs. of 2 variables:
$ V1: int 1 2
$ V2: int 3 4
> str(read.table(text=t, quote=""))
'data.frame': 2 obs. of 2 variables:
$ V1: Factor w/ 2 levels "\"1\"","\"2\"": 1 2
$ V2: Factor w/ 2 levels "\"3\"","\"4\"": 1 2

Remove duplicates in R without converting to numeric

I have 2 variables in a data frame with 300 observations.
$ imagelike: int 3 27 4 5370 ...
$ user: Factor w/ 24915 levels "\"0.1gr\"","\"008bla\"", ..
I then tried to remove the duplicates, such as "- " appears 2 times:
testclean <- data1[!duplicated(data1), ]
This gives me the warning message:
In Ops.factor(left): "-"not meaningful for factors
I have then converted it to a maxtrix:
data2 <- data.matrix(data1)
testclean2 <- data2[!duplicated(data2), ]
This does the trick - however - it converts the userNames to a numeric.
=========================================================================
I am new but I have tried looking at previous posts on this topic (including the one below) but it did not work out:
Convert data.frame columns from factors to characters
Some sample data, from your image (please don't post images of data!):
data1 <- data.frame(imageLikeCount = c(3,27,4,4,16,103),
userName = c("\"testblabla\"", "test_00", "frenchfries", "frenchfries", "test.inc", "\"parmezan_pizza\""))
str(data1)
# 'data.frame': 6 obs. of 2 variables:
# $ imageLikeCount: num 3 27 4 4 16 103
# $ userName : Factor w/ 5 levels "\"parmezan_pizza\"",..: 2 5 3 3 4 1
To fix the problem with factors as well as the embedded quotes:
data1$userName <- gsub('"', '', as.character(data1$userName))
str(data1)
# 'data.frame': 6 obs. of 2 variables:
# $ imageLikeCount: num 3 27 4 4 16 103
# $ userName : chr "testblabla" "test_00" "frenchfries" "frenchfries" ...
Like #DanielWinkler suggested, if you can change how the data is read-in or defined, you might choose to include stringsAsFactors = FALSE (this argument is accepted in many functions, including read.csv, read.table, and most data.frame functions including as.data.frame and rbind):
data1 <- data.frame(imageLikeCount = c(3,27,4,4,16,103),
userName = c("\"testblabla\"", "test_00", "frenchfries", "frenchfries", "test.inc", "\"parmezan_pizza\""),
stringsAsFactors = FALSE)
str(data1)
# 'data.frame': 6 obs. of 2 variables:
# $ imageLikeCount: num 3 27 4 4 16 103
# $ userName : chr "\"testblabla\"" "test_00" "frenchfries" "frenchfries" ...
(Note that this still has embedded quotes, so you'll still need something like data1$userName <- gsub('"', '', data1$userName).)
Now, we have data that looks like this:
data1
# imageLikeCount userName
# 1 3 testblabla
# 2 27 test_00
# 3 4 frenchfries
# 4 4 frenchfries
# 5 16 test.inc
# 6 103 parmezan_pizza
and your need to remove duplicates works:
data1[! duplicated(data1), ]
# imageLikeCount userName
# 1 3 testblabla
# 2 27 test_00
# 3 4 frenchfries
# 5 16 test.inc
# 6 103 parmezan_pizza
Try
data$userName <- as.character(data$userName)
And then
data<-unique(data)
You could also pass the argument stringAsFactor = FALSE when reading the data. This is usually a good idea.

Dplyr - Error: column '' has unsupported type

I have a odd issue when using dplyr on a data.frame to compute the number of missing observations for each group of a character variable. This creates the error "Error: column "" has unsupported type.
To replicate it I have created a subset. The subset rdata file is available here:
rdata file including dftest data.frame
First. Using the subset I have provided, the code:
dftest %>%
group_by(file) %>%
summarise(missings=sum(is.na(v131)))
Will create the error:
Error: column 'file' has unsupported type
The str(dftest) returns:
'data.frame': 756345 obs. of 2 variables:
$ file: atomic bjir31fl.dta bjir31fl.dta bjir31fl.dta bjir31fl.dta ...
..- attr(*, "levels")= chr
$ v131: Factor w/ 330 levels "not of benin",..: 6 6 6 6 1 1 1 9 9 9 ...
However, taking a subset of the subset, and running the dplyr command again, will create the expected output.
dftest <- dftest[1:756345,]
dftest %>%
group_by(file) %>%
summarise(missings=sum(is.na(v131)))
The str(dftest) now returns:
'data.frame': 756345 obs. of 2 variables:
$ file: chr "bjir31fl.dta" "bjir31fl.dta" "bjir31fl.dta" "bjir31fl.dta" ...
$ v131: Factor w/ 330 levels "not of benin",..: 6 6 6 6 1 1 1 9 9 9 ...
Anyone have any suggestions about what might cause this error, and what to do about it. In my original file I have 300 variables, and dplyr states that most of these are of unsupported type.
Thanks.
This seems to be an issue with using filter when a column of the data frame has an attribute. For example,
> df = data.frame(x=1:10, y=1:10)
> filter(df, x==3) # Works
x y
1 3 3
Add an attribute to the x column. Notice that str(df) shows x as atomic now, and filter doesn't work:
> attr(df$x, 'width')='broad'
> str(df)
'data.frame': 10 obs. of 2 variables:
$ x: atomic 1 2 3 4 5 6 7 8 9 10
..- attr(*, "width")= chr "broad"
$ y: int 1 2 3 4 5 6 7 8 9 10
> filter(df, x==3)
Error: column 'x' has unsupported type
To make it work, remove the attribute:
> attr(df$x, 'width') = NULL
> filter(df, x==3)
x y
1 3 3

Using "NA" as a legitimate nonmissing value

I'm working with a data set that includes first names entered in all capital letters. I need to work with the names as character variables, not as factors.
One person in the data set has the first name "NA". Can I get R to accept "NA" as a legitimate character value? My work-around solution was to rename that person NAA, but I am interested to see if there is a better way.
As a demonstration of my comment, consider the following sample CSV file:
x <- tempfile()
cat("v1,v2", "NA,1", "AB,3", sep = "\n", file = x)
cat(readLines(x), sep = "\n")
# v1,v2
# NA,1
# AB,3
Here's the str of a basic read.csv. Note the NA is seen as NA
str(read.csv(x))
# 'data.frame': 2 obs. of 2 variables:
# $ v1: Factor w/ 1 level "AB": NA 1
# $ v2: int 1 3
Now, specify a different character as your na.strings argument:
str(read.csv(x, na.strings = ""))
# 'data.frame': 2 obs. of 2 variables:
# $ v1: Factor w/ 2 levels "AB","NA": 2 1
# $ v2: int 1 3

Resources