Remove duplicates in R without converting to numeric - r

I have 2 variables in a data frame with 300 observations.
$ imagelike: int 3 27 4 5370 ...
$ user: Factor w/ 24915 levels "\"0.1gr\"","\"008bla\"", ..
I then tried to remove the duplicates, such as "- " appears 2 times:
testclean <- data1[!duplicated(data1), ]
This gives me the warning message:
In Ops.factor(left): "-"not meaningful for factors
I have then converted it to a maxtrix:
data2 <- data.matrix(data1)
testclean2 <- data2[!duplicated(data2), ]
This does the trick - however - it converts the userNames to a numeric.
=========================================================================
I am new but I have tried looking at previous posts on this topic (including the one below) but it did not work out:
Convert data.frame columns from factors to characters

Some sample data, from your image (please don't post images of data!):
data1 <- data.frame(imageLikeCount = c(3,27,4,4,16,103),
userName = c("\"testblabla\"", "test_00", "frenchfries", "frenchfries", "test.inc", "\"parmezan_pizza\""))
str(data1)
# 'data.frame': 6 obs. of 2 variables:
# $ imageLikeCount: num 3 27 4 4 16 103
# $ userName : Factor w/ 5 levels "\"parmezan_pizza\"",..: 2 5 3 3 4 1
To fix the problem with factors as well as the embedded quotes:
data1$userName <- gsub('"', '', as.character(data1$userName))
str(data1)
# 'data.frame': 6 obs. of 2 variables:
# $ imageLikeCount: num 3 27 4 4 16 103
# $ userName : chr "testblabla" "test_00" "frenchfries" "frenchfries" ...
Like #DanielWinkler suggested, if you can change how the data is read-in or defined, you might choose to include stringsAsFactors = FALSE (this argument is accepted in many functions, including read.csv, read.table, and most data.frame functions including as.data.frame and rbind):
data1 <- data.frame(imageLikeCount = c(3,27,4,4,16,103),
userName = c("\"testblabla\"", "test_00", "frenchfries", "frenchfries", "test.inc", "\"parmezan_pizza\""),
stringsAsFactors = FALSE)
str(data1)
# 'data.frame': 6 obs. of 2 variables:
# $ imageLikeCount: num 3 27 4 4 16 103
# $ userName : chr "\"testblabla\"" "test_00" "frenchfries" "frenchfries" ...
(Note that this still has embedded quotes, so you'll still need something like data1$userName <- gsub('"', '', data1$userName).)
Now, we have data that looks like this:
data1
# imageLikeCount userName
# 1 3 testblabla
# 2 27 test_00
# 3 4 frenchfries
# 4 4 frenchfries
# 5 16 test.inc
# 6 103 parmezan_pizza
and your need to remove duplicates works:
data1[! duplicated(data1), ]
# imageLikeCount userName
# 1 3 testblabla
# 2 27 test_00
# 3 4 frenchfries
# 5 16 test.inc
# 6 103 parmezan_pizza

Try
data$userName <- as.character(data$userName)
And then
data<-unique(data)
You could also pass the argument stringAsFactor = FALSE when reading the data. This is usually a good idea.

Related

Read in Excel column with numbers and characters to R

I'm trying to read in an excel file to R using read_excel(it's a xlsx file), I have columns that contain letters and numbers, for example things like P765876. These columns also have cells with just numbers i.e 234654, so when it reads in to R it reads as an Unknown (not character or numeric) but this means that it gives any cell which has a letter and number a value of NA, how can I read this in correctly?
My code at the moment is
tenant<-read_excel("C:/Users/MPritchard/Repairs Projects/May 2017/Tenant Info/R data 1.xlsx")
Would also recommend to use the col_types argument, by specifying it as "text" you should avoid getting NAs introduced by coercion. So your code would be like:
tenant<-read_excel("C:/Users/MPritchard/Repairs Projects/May 2017/Tenant Info/R data 1.xlsx", col_types = "text")
Please let me know if this solved your problem.
Regards,
/Michael
Not really an answer but too much for a comment...
1:
> library(xlsx)
> tenant <- read.xlsx("returns.xlsx", sheetIndex = 1)
> str(tenant)
'data.frame': 9 obs. of 3 variables:
$ only_integer: num 1 2 34 5 546931 ...
$ int_char : Factor w/ 9 levels "2545","2a","2d",..: 6 4 9 3 5 1 7 2 8
$ only_char : Factor w/ 6 levels "af","dd","e",..: 2 1 5 6 3 2 4 3 1
2:
> library(readxl)
> tenant2 <- read_excel("returns.xlsx")
> str(tenant2)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 9 obs. of 3 variables:
$ only_integer: num 1 2 34 5 546931 ...
$ int_char : chr "d5" "5" "ff2ad2f" "2d" ...
$ only_char : chr "dd" "af" "h" "ha" ...
The column int_char is a mixture of both, starting/ending with numbers or characters

How can I store a value in a name?

I use the neotoma package where I get data from a geographical site, which is marked by an ID. What I want to do is to "store" the number in a term, like Sitenum, so I can just need to write down the ID once and then use it.
What I did:
Site<-get_download(20131, verbose = TRUE)
taxa<-as.vector(Site$'20131'$taxon.list$taxon.name)
What I want to do:
Sitenum <-20131
Site<-get_download(Sitenum, verbose = TRUE) # this obv. works
taxa<-as.vector(Site$Sitenum$taxon.list$taxon.name) # this doesn't work
The structure of the dataset:
str(Site)
List of 1
$ 20131:List of 6
..$ taxon.list :'data.frame': 84 obs. of 6 variables:
.. ..$ taxon.name : Factor w/ 84 levels "Alnus","Amaranthaceae",..: 1 2 3 4 5 6 7 8 9 10 ...
I constructed an object that mimics yours as follows:
Site <- list("2043"=list(other=data.frame(that=1:10)))
Note that the structure is essentially identical.
str(Site)
List of 1
$ 2043:List of 1
..$ other:'data.frame': 10 obs. of 1 variable:
.. ..$ that: int [1:10] 1 2 3 4 5 6 7 8 9 10
Now, I save the value of the first term:
temp <- 2043
Then use the code in my comment to access the inner vector:
Site[[as.character(temp)]]$other$that
[1] 1 2 3 4 5 6 7 8 9 10
I could also use recursive referencing like this
Site[[c(temp,"other", "that")]]
[1] 1 2 3 4 5 6 7 8 9 10
because c will coerce temp to be a character vector in the presence of "other" and "that" character vectors.

Dplyr - Error: column '' has unsupported type

I have a odd issue when using dplyr on a data.frame to compute the number of missing observations for each group of a character variable. This creates the error "Error: column "" has unsupported type.
To replicate it I have created a subset. The subset rdata file is available here:
rdata file including dftest data.frame
First. Using the subset I have provided, the code:
dftest %>%
group_by(file) %>%
summarise(missings=sum(is.na(v131)))
Will create the error:
Error: column 'file' has unsupported type
The str(dftest) returns:
'data.frame': 756345 obs. of 2 variables:
$ file: atomic bjir31fl.dta bjir31fl.dta bjir31fl.dta bjir31fl.dta ...
..- attr(*, "levels")= chr
$ v131: Factor w/ 330 levels "not of benin",..: 6 6 6 6 1 1 1 9 9 9 ...
However, taking a subset of the subset, and running the dplyr command again, will create the expected output.
dftest <- dftest[1:756345,]
dftest %>%
group_by(file) %>%
summarise(missings=sum(is.na(v131)))
The str(dftest) now returns:
'data.frame': 756345 obs. of 2 variables:
$ file: chr "bjir31fl.dta" "bjir31fl.dta" "bjir31fl.dta" "bjir31fl.dta" ...
$ v131: Factor w/ 330 levels "not of benin",..: 6 6 6 6 1 1 1 9 9 9 ...
Anyone have any suggestions about what might cause this error, and what to do about it. In my original file I have 300 variables, and dplyr states that most of these are of unsupported type.
Thanks.
This seems to be an issue with using filter when a column of the data frame has an attribute. For example,
> df = data.frame(x=1:10, y=1:10)
> filter(df, x==3) # Works
x y
1 3 3
Add an attribute to the x column. Notice that str(df) shows x as atomic now, and filter doesn't work:
> attr(df$x, 'width')='broad'
> str(df)
'data.frame': 10 obs. of 2 variables:
$ x: atomic 1 2 3 4 5 6 7 8 9 10
..- attr(*, "width")= chr "broad"
$ y: int 1 2 3 4 5 6 7 8 9 10
> filter(df, x==3)
Error: column 'x' has unsupported type
To make it work, remove the attribute:
> attr(df$x, 'width') = NULL
> filter(df, x==3)
x y
1 3 3

tapply function complains that args are unequal length yet they appear to match

Here is the failing call, error messages and some displays to show the lengths in question:
it <- tapply(molten, c(molten$Activity, molten$Subject, molten$variable), mean)
# Error in tapply(molten, c(molten$Activity, molten$Subject, molten$variable), :
# arguments must have same length
length(molten$Activity)
# [1] 679734
length(molten$Subject)
# [1] 679734
length(molten$variable)
# [1] 679734
dim(molten)
# [1] 679734 4
str(molten)
# 'data.frame': 679734 obs. of 4 variables:
# $ Activity: Factor w/ 6 levels "WALKING","WALKING_UPSTAIRS",..: 5 5 5 5 5 5 5 5 5 5 ...
# $ Subject : Factor w/ 30 levels "1","2","3","4",..: 2 2 2 2 2 2 2 2 2 2 ...
# $ variable: Factor w/ 66 levels "tBodyAcc-mean()-X",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ value : num 0.257 0.286 0.275 0.27 0.275 ...
If you have a look at ?tapply you will see that X should be "an atomic object, typically a vector". You feed tapply with a data frame ("molten"), which is not an atomic object. See is.atomic, and try is.atomic(molten). Furthermore, your grouping variables should be provided as a list (see INDEX argument).
Something like this works:
tapply(X = warpbreaks$breaks, INDEX = list(warpbreaks$wool, warpbreaks$tension), mean)
# L M H
# A 44.55556 24.00000 24.55556
# B 28.22222 28.77778 18.77778
You need to have a single object for INDEX, butc( )will string them all together which is the source of the eror, so use a list:
it <- tapply(molten$value, list(Act=molten$Activity, sub=molten$Subject, var=molten$variable), mean)
Better would be:
it <- with(molten , tapply(value, list(Act=Activity, Sub=Subject, var=variable), mean) )
Ever got this solved? Because I had the same issue reading in a CSV file and could fix the issue by saving the original CSV file as CSV(delimiter seperated) instead of CSV(delimiter seperated-UTF-8). My dataset had German Umlauts in it though so that might play a role aswell.

Using "NA" as a legitimate nonmissing value

I'm working with a data set that includes first names entered in all capital letters. I need to work with the names as character variables, not as factors.
One person in the data set has the first name "NA". Can I get R to accept "NA" as a legitimate character value? My work-around solution was to rename that person NAA, but I am interested to see if there is a better way.
As a demonstration of my comment, consider the following sample CSV file:
x <- tempfile()
cat("v1,v2", "NA,1", "AB,3", sep = "\n", file = x)
cat(readLines(x), sep = "\n")
# v1,v2
# NA,1
# AB,3
Here's the str of a basic read.csv. Note the NA is seen as NA
str(read.csv(x))
# 'data.frame': 2 obs. of 2 variables:
# $ v1: Factor w/ 1 level "AB": NA 1
# $ v2: int 1 3
Now, specify a different character as your na.strings argument:
str(read.csv(x, na.strings = ""))
# 'data.frame': 2 obs. of 2 variables:
# $ v1: Factor w/ 2 levels "AB","NA": 2 1
# $ v2: int 1 3

Resources