R: not meaningful as factors - r

what is best practice to handle this particular problem when it comes up? for example I have created a dataframe:
dat<- sqlQuery(con,"select * from mytable")
in which my table looks like:
ID RESULT GROUP
-- ------ -----
1 Y A
2 N A
3 N B
4 Y B
5 N A
in which ID is an int, Result and Group are both factors.
problem is that when I want to do something like:
tapply(dat$RESULT,dat$GROUP,sum)
I get complaints about columns being a factor:
Error in Summary.factor(c(2L,2L,2L,2L,1L,2L,1L,2L,2L,1L,1L, :
sum not meaningful for factors
Given that factors are essential for use in things like ggplot, how does everyone else handle this?
Setting stringsAsFactors=FALSE and rerunning gives
tapply(dat$RESULT,dat$GROUP,sum)
Error in FUN(X[[1L]], ...) : invalid "type" (character) or argument
so I'm not sure merely setting stringsAsFactors=FALSE is the right approach

I assume you want to sum up the "Y"s in the RESULT column.
As suggested by #akrun, one possibility is to use table()
with(dat,table(GROUP,RESULT))
If you want to stick with the tapply(), you can change the type of the RESULT column to a boolean:
dat$RESULT <- dat$RESULT=="Y"
tapply(dat$RESULT,dat$GROUP,sum)
If your goal is to have some columns as factors and other as strings, you can convert to factors only selected columns in the result, e.g. with
dat<- sqlQuery(con,"select ID,RESULT,GROUP from mytable",as.is=2)
As in the read.table man page (recalled by the sqlQuery man page) : as.is is either a vector of logicals (values are recycled if necessary), or a vector of numeric or character indices which specify which columns should not be converted to factors.
But then again, you need either to use table() or to turn the result into a boolean.

I'm not clear what your question is, either. If you're just trying to sum the Y's, how about:
library(dplyr)
df <- data.frame(ID = 1:5,
RESULT = as.factor(c("Y","N","N","Y","N")),
GROUP = as.factor(c("A", "A", "B", "B", "A")))
df %>% mutate(logRes = (RESULT == "Y")) %>%
summarise(sum=sum(logRes))

Related

How do I replace specific cell values in dataframe using continuous (sequential) indexing?

I have two dataframes of equal dimensions.
One has some value in cells (i.e. 'abc') that i need to index. Other has all different values. And I need to replace the values in other dataframe with the same index as 'abc'.
Examples:
df1 <- data.frame('1'=c('abc','bbb','rweq','dsaf','cxc','rwer','anc','ewr','yuje','gda'),
'2'=c(NA,NA,'bbb','dsaf','rwer','dsaf','ewr','cxc','dsaf','cxc'),
'3'=c(NA,NA,'dsaf','abc','bbb','cxc','yuje',NA,'ewr','anc'),
'4'=c(NA,NA,'cxc',NA,'abc','anc',NA,NA,'yuje','rweq'),
'5'=c(NA,NA,'anc',NA,'abc',NA,NA,NA,'rwer','rwer'),
'6'=c(NA,NA,'rweq',NA,'dsaf',NA,NA,NA,'bbb','bbb'),
'7'=c(NA,NA,'abc',NA,'ewr',NA,NA,NA,'abc','abc'),
'8'=c(NA,NA,'abc',NA,'rweq',NA,NA,NA,'cxc','bbb'),
'9'=c(NA,NA,NA,NA,'abc',NA,NA,NA,'anc',NA),
'10'=c(NA,NA,NA,NA,'abc',NA,NA,NA,'rweq',NA))
df2 <- data.frame('1'=c('green','black','white','yelp','help','green','red','brown','green','crack'),
'2'=c(NA,NA,'black','yelp','green','yelp','brown','help','yelp','help'),
'3'=c(NA,NA,'yelp','green','black','help','green',NA,'brown','red'),
'4'=c(NA,NA,'help',NA,'green','red',NA,NA,'green','white'),
'5'=c(NA,NA,'red',NA,'green',NA,NA,NA,'green','green'),
'6'=c(NA,NA,'white',NA,'yelp',NA,NA,NA,'black','black'),
'7'=c(NA,NA,'green',NA,'brown',NA,NA,NA,'green','green'),
'8'=c(NA,NA,'green',NA,'white',NA,NA,NA,'help','black'),
'9'=c(NA,NA,NA,NA,'green',NA,NA,NA,'red',NA),
'10'=c(NA,NA,NA,NA,'green',NA,NA,NA,'white',NA))
I can find sequential index of 'abc', but it returns one-sized vector
which(df1 == 'abc')
#[1] 1 24 35 45 63 69 70 73 85 95
And i don't know how to replace values using this method
In output expected to view df2 with replaced values 'green' only on the same indexes as values 'abc' in df1.
But note!! that 'green' values in df2 are not only in the same indexes as in df1
I don't think your problem is appropriately approached with the data in a data.frame. That introduces several complications. First, each variable (column) in the data frame is a factor with different levels! Second, your code is making a comparison between a list (data.frame) and a factor (which is coerced into an atomic vector). The help function for the == operator states ..if the other is a list R attempts to coerce it to the type of the atomic vector.. The help function also points out that factors get special handling in comparisons where it first assumes you are comparing factor levels, which your code is doing.
I think you want to convert your data frames of identical dimensions to a matrix first. If you need the results in a data.frame, convert it back after as I show here but realize that the factor levels may have changed.
# Starting with the values assigned to df1 and df2
m1 <- as.matrix(df1)
m2 <- as.matrix(df2)
index <- which(m1 == "abc")
m2[index] <- "abc"
df2 <- as.data.frame(m2)
Here is a way to. Learn about the *apply family in R: I think it is the most useful group of functions in this language, whatever you plan to do ;) Also know that data.frame are of 'list' type.
df1 <- lapply(df1, function(frame, pattern, replace){ # for each frame = column:
matches <- which(pattern %in% frame) # what are the matching indexes of the frame
if(length(matches) > 0) # If there is at least one index matching,
frame[matches] <- replace # give it the value you want
return(frame) # Commit your changes back to df1
}, pattern="abc", replace= "<whatYouWant>") # don't forget this part: the needed arguments !

Vector gets stored as a dataframe instead of being a vector

I am new to r and rstudio and I need to create a vector that stores the first 100 rows of the csv file the programme reads . However , despite all my attempts my variable v1 ends up becoming a dataframe instead of an int vector . May I know what I can do to solve this? Here's my code:
library(readr)
library(readr)
cup_data <- read_csv("C:/Users/Asus.DESKTOP-BTB81TA/Desktop/STUDY/YEAR 2/
YEAR 2 SEM 2/PREDICTIVE ANALYTICS(1_PA_011763)/Week 1 (Intro to PA)/
Practical/cup98lrn variable subset small.csv")
# Retrieve only the selected columns
cup_data_small <- cup_data[c("AGE", "RAMNTALL", "NGIFTALL", "LASTGIFT",
"GENDER", "TIMELAG", "AVGGIFT", "TARGET_B", "TARGET_D")]
str(cup_data_small)
cup_data_small
#get the number of columns and rows
ncol(cup_data_small)
nrow(cup_data_small)
cat("No of column",ncol(cup_data_small),"\nNo of Row :",nrow(cup_data_small))
#cat
#Concatenate and print
#Outputs the objects, concatenating the representations.
#cat performs much less conversion than print.
#Print the first 10 rows of cup_data_small
head(cup_data_small, n=10)
#Create a vector V1 by selecting first 100 rows of AGE
v1 <- cup_data_small[1:100,"AGE",]
Here's what my environment says:
cup_data_small is a tibble, a slightly modified version of a dataframe that has slightly different rules to try to avoid some common quirks/inconsistencies in standard dataframes. E.g. in a standard dataframe, df[, c("a")] gives you a vector, and df[, c("a", "b")] gives you a dataframe - you're using the same syntax so arguably they should give the same type of result.
To get just a vector from a tibble, you have to explicitly pass drop = TRUE, e.g.:
library(dplyr)
# Standard dataframe
iris[, "Species"]
iris_tibble = iris %>%
as_tibble()
# Remains a tibble/dataframe
iris_tibble[, "Species"]
# This gives you just the vector
iris_tibble[, "Species", drop = TRUE]

Order and subset a multi-column dataframe in R?

I wanted to order by some column, and subset, a multi-column dataframe but the command used did not work
print(df[order(df$x) & df$x < 5,])
This does not order the results.
To debug this I generated a test dataframe with 1 column but this 'simplification' had unexpected effects
df <- data.frame(x = sample(1:50))
print(df[order(df$x) & df$x < 5,])
This does not order the results so I felt I had reproduced the problem but with simpler data.
Breaking down the process to first ordering and then subsetting led me to discover the ordering in this case does not generate a dataframe object
df <- data.frame(x = sample(1:50))
ndf <- df[order(df$x),]
print(class(ndf))
produces
[1] "integer"
Attempting to subset the resultant "integer" ndf object using dataframe syntax e.g.
print(ndf[ndf$x < 5, ])
obviously generates an error:
Error in ndf$x : $ operator is invalid for atomic vectors.
Simplifying even further, I found subsetting alone (not applying the order function ) does not generate a dataframe object
ndf <- df[df$x < 5,]
class(ndf)
[1] "integer"
It turns out for the multicolumn dataframe that separating the ordering and the subsetting does work as expected
df <- data.frame(x = sample(1:50), y = rnorm(50))
ndf <- df[order(df$x),]
print(ndf[ndf$x < 5, ])
and this solved my original problem, but led to two further questions:
Why is the type of object returned, as described above based on the 1 column dataframe test case, not a dataframe? ( I appreciate a 1 column dataframe just contains a single vector but it's still wrapped in a dataframe ?)
Is it possible to order and subset a multicolumn dataframe in 1 step?
A data.frame in R automatically simplifies to vectors when selecting just one column. This is a common and useful simplification and is better described in this question. Of course you can prevent that with drop=FALSE.
Subsetting and ordering are two different operations. You should do them in two logical steps (but possibly one line of code). This line doesn't make a lot of sense
df[order(df$x) & df$x < 5,]
Subsetting in R can either be done with a vector of row indices (which order() returns) or boolean values (which the < comparison returns). Mixing them (with just an &) doesn't make it clear how R should perform the subset. But you can break that out into two steps with subset()
subset(df[order(df$x),], x < 5)
This does the ordering first and then the subsetting. Note that the condition no longer directory references the value of df specfically, it's will filter the data from the re-ordered data.frame.
Operations like this is one of the reasons many people perfer the dplyr library for data manipulations. For example this can be done with
library(dplyr)
dd <- data.frame(x = sample(1:50))
dd %>% filter(x<5) %>% arrange(x)

How to assign a subset from a data frame `a' to a subset of data frame `b'

It might be a trivial question (I am new to R), but I could not find a answer for my question, either here in SO or anywhere else. My scenario is the following.
I have an data frame df and i want to update a subset df$tag values. df is similar to the following:
id = rep( c(1:4), 3)
tag = rep( c("aaa", "bbb", "rrr", "fff"), 3)
df = data.frame(id, tag)
Then, I am trying to use match() to update the column tag from the subsets of the data frame, using a second data frame (e.g., aux) that contains two columns, namely, key and value. The subsets are defined by id = n, according to n in unique(df$id). aux looks like the following:
> aux
key value
"aaa" "valueAA"
"bbb" "valueBB"
"rrr" "valueRR"
"fff" "valueFF"
I have tried to loop over the data frame, as follows:
for(i in unique(df$id)){
indexer = df$id == i
# here is how I tried to update the dame frame:
df[indexer,]$tag <- aux[match(df[indexer,]$tag, aux$key),]$value
}
The expected result was the df[indexer,]$tag updated with the respective values from aux$value.
The actual result was df$tag fulfilled with NA's. I've got no errors, but the following warning message:
In '[<-.factor'('tmp', df$id == i, value = c(NA, :
invalid factor level, NA generated
Before, I was using df$tag <- aux[match(df$tag, aux$key),]$value, which worked properly, but some duplicated df$tags made the match() produce the misplaced updates in a number of rows. I also simulate the subsetting and it works fine. Can someone suggest a solution for this update?
UPDATE (how the final dataset should look like?):
> df
id tag
1 "valueAA"
2 "valueBB"
3 "valueRR"
4 "valueFF"
(...) (...)
Thank you in advance.
Does this produce the output you expect?
df$tag <- aux$value[match(df$tag, aux$key)]
merge() would work too unless you have duplicates in aux.
It turned out that my data was breaking all the available built-in functions providing me a wrong dataset in the end. Then, my solution (at least, a preliminary one) was the following:
to process each subset individually;
add each data frame to a list;
use rbindlist(a.list, use.names = T) to get a complete data frame with the results.

Apply function to each column in a data frame observing each columns existing data type

I'm trying to get the min/max for each column in a large data frame, as part of getting to know my data. My first try was:
apply(t,2,max,na.rm=1)
It treats everything as a character vector, because the first few columns are character types. So max of some of the numeric columns is coming out as " -99.5".
I then tried this:
sapply(t,max,na.rm=1)
but it complains about max not meaningful for factors. (lapply is the same.) What is confusing me is that apply thought max was perfectly meaningful for factors, e.g. it returned "ZEBRA" for column 1.
BTW, I took a look at Using sapply on vector of POSIXct and one of the answers says "When you use sapply, your objects are coerced to numeric,...". Is this what is happening to me? If so, is there an alternative apply function that does not coerce? Surely it is a common need, as one of the key features of the data frame type is that each column can be a different type.
If it were an "ordered factor" things would be different. Which is not to say I like "ordered factors", I don't, only to say that some relationships are defined for 'ordered factors' that are not defined for "factors". Factors are thought of as ordinary categorical variables. You are seeing the natural sort order of factors which is alphabetical lexical order for your locale. If you want to get an automatic coercion to "numeric" for every column, ... dates and factors and all, then try:
sapply(df, function(x) max(as.numeric(x)) ) # not generally a useful result
Or if you want to test for factors first and return as you expect then:
sapply( df, function(x) if("factor" %in% class(x) ) {
max(as.numeric(as.character(x)))
} else { max(x) } )
#Darrens comment does work better:
sapply(df, function(x) max(as.character(x)) )
max does succeed with character vectors.
The reason that max works with apply is that apply is coercing your data frame to a matrix first, and a matrix can only hold one data type. So you end up with a matrix of characters. sapply is just a wrapper for lapply, so it is not surprising that both yield the same error.
The default behavior when you create a data frame is for categorical columns to be stored as factors. Unless you specify that it is an ordered factor, operations like max and min will be undefined, since R is assuming that you've created an unordered factor.
You can change this behavior by specifying options(stringsAsFactors = FALSE), which will change the default for the entire session, or you can pass stringsAsFactors = FALSE in the data.frame() construction call itself. Note that this just means that min and max will assume "alphabetical" ordering by default.
Or you can manually specify an ordering for each factor, although I doubt that's what you want to do.
Regardless, sapply will generally yield an atomic vector, which will entail converting everything to characters in many cases. One way around this is as follows:
#Some test data
d <- data.frame(v1 = runif(10), v2 = letters[1:10],
v3 = rnorm(10), v4 = LETTERS[1:10],stringsAsFactors = TRUE)
d[4,] <- NA
#Similar function to DWin's answer
fun <- function(x){
if(is.numeric(x)){max(x,na.rm = 1)}
else{max(as.character(x),na.rm=1)}
}
#Use colwise from plyr package
colwise(fun)(d)
v1 v2 v3 v4
1 0.8478983 j 1.999435 J
If you want to learn your data summary (df) provides the min, 1st quantile, median and mean, 3rd quantile and max of numerical columns and the frequency of the top levels of the factor columns.
The best way to do this is avoid base *apply functions, which coerces the entire data frame to an array, possibly losing information.
If you wanted to apply a function as.numeric to every column, a simple way is using mutate_all from dplyr:
t %>% mutate_all(as.numeric)
Alternatively use colwise from plyr, which will "turn a function that operates on a vector into a function that operates column-wise on a data.frame."
t %>% (colwise(as.numeric))
In the special case of reading in a data table of character vectors and coercing columns into the correct data type, use type.convert or type_convert from readr.
Less interesting answer: we can apply on each column with a for-loop:
for (i in 1:nrow(t)) { t[, i] <- parse_guess(t[, i]) }
I don't know of a good way of doing assignment with *apply while preserving data frame structure.
building on #ltamar's answer:
Use summary and munge the output into something useful!
library(tidyr)
library(dplyr)
df %>%
summary %>%
data.frame %>%
select(-Var1) %>%
separate(data=.,col=Freq,into = c('metric','value'),sep = ':') %>%
rename(column_name=Var2) %>%
mutate(value=as.numeric(value),
metric = trimws(metric,'both')
) %>%
filter(!is.na(value)) -> metrics
It's not pretty and it is certainly not fast but it gets the job done!
these days loops are just as fast so this is more than sufficient:
for (I in 1L:length(c(1,2,3))) {
data.frame(c("1","2","3"),c("1","3","3"))[,I] <-
max(as.numeric(data.frame(c("1","2","3"),c("1","3","3"))[,I]))
}
A solution using retype() from hablar to coerce factors to character or numeric type depending on feasability. I'd use dplyr for applying max to each column.
Code
library(dplyr)
library(hablar)
# Retype() simplifies each columns type, e.g. always removes factors
d <- d %>% retype()
# Check max for each column
d %>% summarise_all(max)
Result
Not the new column types.
v1 v2 v3 v4
<dbl> <chr> <dbl> <chr>
1 0.974 j 1.09 J
Data
# Sample data borrowed from #joran
d <- data.frame(v1 = runif(10), v2 = letters[1:10],
v3 = rnorm(10), v4 = LETTERS[1:10],stringsAsFactors = TRUE)
df <- head(mtcars)
df$string <- c("a","b", "c", "d","e", "f"); df
my.min <- unlist(lapply(df, min))
my.max <- unlist(lapply(df, max))

Resources