as.numeric vs chr - r

I have the following code
fig4 <- data.frame(chads=NA,age=NA,treatment=NA,mean=NA,lower=NA,upper=NA)
fig4$chads <- as.factor(fig4$chads)
levels(fig4$chads) <- c(0,1,2,3,4,5,6)
fig4$age <- as.factor(fig4$age)
levels(fig4$age ) <- c("u80","o80")
fig4$treatment <- as.factor(fig4$treatment)
levels(fig4$treatment) <- c("OAC","OAP")
fig4$mean <- as.numeric(fig4$mean)
fig4$lower <- as.numeric(fig4$lower)
fig4$upper <- as.numeric(fig4$upper)
> str(fig4)
'data.frame': 1 obs. of 6 variables:
$ chads : Factor w/ 7 levels "0","1","2","3",..: NA
$ age : Factor w/ 2 levels "u80","o80": NA
$ treatment: Factor w/ 2 levels "OAC","OAP": NA
$ mean : num NA
$ lower : num NA
$ upper : num NA
So far so good. But then I do this:
vc <- as.vector(c(6,"o80","OAC",0.1,0.02,0.25), mode = "any")
fig4 <- rbind(fig4,vc)
which results in this:
> str(fig4)
'data.frame': 2 obs. of 6 variables:
$ chads : Factor w/ 7 levels "0","1","2","3",..: NA 7
$ age : Factor w/ 2 levels "u80","o80": NA 2
$ treatment: Factor w/ 2 levels "OAC","OAP": NA 1
$ mean : chr NA "0.1"
$ lower : chr NA "0.02"
$ upper : chr NA "0.25"
Why did the numeric vectors turn into character ones ?

Lists can hold objects of multiple types, so to avoid your new data being converted to character, you can do:
fig4[nrow(fig4) + 1, ] <- list(6,"o80","OAC",0.1,0.02,0.25)

For the same reason a matrix would --- both vector and matrix can hold only one type. And as you force character into the mix, you get character.
Use a data.frame to hold "columns" of different types, then subset individual columns.

Related

Removing all special characters from an entire dataframe but keeping factor level definitions

'm trying to remove special characters e.g. "-","/",")","(" etc entirely from my dataframe. However my dataframe only contains one observation as it's feeding into a model that will be used in production. I've defined the factor levels explicitly for the data frame.
I've tried the following:
sanitize_string <- function(string){
gsub('\\s+', "_", string) %>%
gsub("[(]", "_", .) %>%
gsub("[)]", "_", .) %>%
gsub("[/]", "_", .) %>%
gsub("[-]", "_", .)}
and then:
df <- as.data.frame(lapply(df, function(dataframe) sapply(dataframe, sanitize_string)), stringsAsFactors=FALSE)
But when I do this, I'm loose my factor levels, it just sees every factor as having one level, which causes problems later when I try to get predictions from my model as the sparse.model.matrix needs 2 or more levels for each factor, but really in production, it will only be sent one observation.
Thanks.
Here is my dataframe:
$ children_under16 : Factor w/ 2 levels "No","Yes": 1
$ ft_employment_status : Factor w/ 5 levels "Employed","Full-Time Education(Student)",..: 1
$ fuel_type : Factor w/ 2 levels "D","P": 2
$ homeowner : Factor w/ 2 levels "FALSE","TRUE": 2
$ marital_status : Factor w/ 6 levels "Married","Separated",..: 1
$ overnight_loc : Factor w/ 7 levels "In a private Driveway",..: NA
$ usage_type : Factor w/ 3 levels "CLASS_1","SDPC",..: 1
$ licence_type : Factor w/ 3 levels "UK","European",..: 1
$ yad_relationship_to_policyholder: Factor w/ 8 levels "Spouse","No_YAD",..: 1
$ A : Factor w/ 7 levels "1","2","5","3",..: 1
$ B : Factor w/ 19 levels "C","E","Q","D",..: 1
$ C : Factor w/ 63 levels "11","19","58",..: 1
$ region : Factor w/ 12 levels "Yorkshire and The Humber",..: 1
$ D : Factor w/ 28 levels "Semi-Detached Suburbia",..: 27
$ E : Factor w/ 77 levels "Families in Terraces and Flats",..: 77
$ F : Factor w/ 9 levels "Suburbanites",..: 1
$ industry_band : Factor w/ 18 levels "13","14","15",..: 14
$ occ_band_goco : Factor w/ 17 levels "0","1","2","3",..: 2
$ transmission : Factor w/ 2 levels "A","M": 2
$ vehicle_make : Factor w/ 19 levels "OTHER","AUDI",..: 1
$ vehicle_type : Factor w/ 17 levels "Mid Exec Saloon/Estate/Coupe",..: 1
$ rural_urban : Factor w/ 19 levels "Urban major conurbation",..: 2
$ water_company : Factor w/ 23 levels "Affinity Water",..: 23
$ seats : Factor w/ 6 levels "-99","2","4",..: ```
You can sanitize the levels of the factor, rather than the column. This will preserve the order the levels are in---though it will create an error if your sanitization takes two levels that were different and makes them the same. I would just do a for loop:
for (i in 1:ncol(df)) {
if(is.factor(df[[i]])) {
levels(df[[i]]) = sanitize_string(levels(df[[i]]))
}
}
I can't test this on the structure you've posted, but if you have problems please share some data with dput() so I can copy/paste it (e.g., dput(df[1:10, ]), or some other small subset that illustrates the problem) and I'll be happy to test and refine.

R - Aggregate function creating sub-lists

I'm using the aggregate function to summarise some data. The data is loans data, I have the ContractNum and LoanAmount. I want to aggregate the data by StartDate, count the number of Loans and Average the loan amount.
Here is a sample of the data and the function that I use:
ContractNum <- c("RHL-1","RHL-2","RHL-3","RHL-3")
StartDate <- c("2016-11-01","2016-11-01","2016-12-01","2016-12-01")
LoanPurpose <- c("Personal","Personal","HomeLoan","Investment")
LoanAmount <- c(200,500,600,150)
dat <- data.frame(ContractNum,StartDate,LoanPurpose,LoanAmount)
aggr.data <- aggregate(
cbind(LoanAmount,ContractNum) ~ StartDate + LoanPurpose
,data = dat
,FUN = function(x)c(count = mean(x),length(x))
)
When I lookat the results of the aggregate function, it looks ok:
> aggr.data
StartDate LoanPurpose LoanAmount.count LoanAmount.V2 ContractNum.count ContractNum.V2
1 2016-12-01 HomeLoan 600 1 3.0 1.0
2 2016-12-01 Investment 150 1 3.0 1.0
3 2016-11-01 Personal 350 2 1.5 2.0
But when I look at the strucutre of it, it seems to have created a sub-list:
> str(aggr.data)
'data.frame': 3 obs. of 4 variables:
$ StartDate : Factor w/ 2 levels "2016-11-01","2016-12-01": 2 2 1
$ LoanPurpose: Factor w/ 3 levels "HomeLoan","Investment",..: 1 2 3
$ LoanAmount : num [1:3, 1:2] 600 150 350 1 1 2
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "count" ""
$ ContractNum: num [1:3, 1:2] 3 3 1.5 1 1 2
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "count" ""
How do I get rid of this sub-list so that I can access each column the way I would normally access a DF? I understand that in the code I've asked to give me a mean on a ContractNum which is not meaningful, but I can just get rid of that column.
Thank you
Just do do.call(data.frame, ...) on aggr.data to unnest the matrices.
aggr.data <- do.call(data.frame, aggr.data);
str(aggr.data);
#'data.frame': 3 obs. of 6 variables:
# $ StartDate : Factor w/ 2 levels "2016-11-01","2016-12-01": 2 2 1
# $ LoanPurpose : Factor w/ 3 levels "HomeLoan","Investment",..: 1 2 3
# $ LoanAmount.count : num 600 150 350
# $ LoanAmount.V2 : num 1 1 2
# $ ContractNum.count: num 3 3 1.5
# $ ContractNum.V2 : num 1 1 2

Replace integer(0) by NA

I have a function that I apply to a column and puts results in another column and it sometimes gives me integer(0) as output. So my output column will be something like:
45
64
integer(0)
78
How can I detect these integer(0)'s and replace them by NA? Is there something like is.na() that will detect them ?
Edit: Ok I think I have a reproducible example:
df1 <-data.frame(c("267119002","257051033",NA,"267098003","267099020","267047006"))
names(df1)[1]<-"ID"
df2 <-data.frame(c("257051033","267098003","267119002","267047006","267099020"))
names(df2)[1]<-"ID"
df2$vals <-c(11,22,33,44,55)
fetcher <-function(x){
y <- df2$vals[which(match(df2$ID,x)==TRUE)]
return(y)
}
sapply(df1$ID,function(x) fetcher(x))
The output from this sapply is the source of the problem.
> str(sapply(df1$ID,function(x) fetcher(x)))
List of 6
$ : num 33
$ : num 11
$ : num(0)
$ : num 22
$ : num 55
$ : num 44
I don't want this to be a list - I want a vector, and instead of num(0) I want NA (note in this toy data it gives num(0) - in my real data it gives (integer(0)).
Here's a way to (a) replace integer(0) with NA and (b) transform the list into a vector.
# a regular data frame
> dat <- data.frame(x = 1:4)
# add a list including integer(0) as a column
> dat$col <- list(45,
+ 64,
+ integer(0),
+ 78)
> str(dat)
'data.frame': 4 obs. of 2 variables:
$ x : int 1 2 3 4
$ col:List of 4
..$ : num 45
..$ : num 64
..$ : int
..$ : num 78
# find zero-length values
> idx <- !(sapply(dat$col, length))
# replace these values with NA
> dat$col[idx] <- NA
# transform list to vector
> dat$col <- unlist(dat$col)
# now the data frame contains vector columns only
> str(dat)
'data.frame': 4 obs. of 2 variables:
$ x : int 1 2 3 4
$ col: num 45 64 NA 78
Best to do that in your function, I'll call it myFunctionForApply but that's your current function. Before you return, check the length and if it is 0 return NA:
myFunctionForApply <- function(x, ...) {
# Do your processing
# Let's say it ends up in variable 'ret':
if (length(ret) == 0)
return(NA)
return(ret)
}

daply: Correct results, but confusing structure

I have a data.frame mydf, that contains data from 27 subjects. There are two predictors, congruent (2 levels) and offset (5 levels), so overall there are 10 conditions. Each of the 27 subjects was tested 20 times under each condition, resulting in a total of 10*27*20 = 5400 observations. RT is the response variable. The structure looks like this:
> str(mydf)
'data.frame': 5400 obs. of 4 variables:
$ subject : Factor w/ 27 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
$ congruent: logi TRUE FALSE FALSE TRUE FALSE TRUE ...
$ offset : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 5 5 1 2 5 5 2 2 3 5 ...
$ RT : int 330 343 457 436 302 311 595 330 338 374 ...
I've used daply() to calculate the mean RT of each subject in each of the 10 conditions:
myarray <- daply(mydf, .(subject, congruent, offset), summarize, mean = mean(RT))
The result looks just the way I wanted, i.e. a 3d-array; so to speak 5 tables (one for each offset condition) that show the mean of each subject in the congruent=FALSE vs. the congruent=TRUE condition.
However if I check the structure of myarray, I get a confusing output:
List of 270
$ : num 417
$ : num 393
$ : num 364
$ : num 399
$ : num 374
...
# and so on
...
[list output truncated]
- attr(*, "dim")= int [1:3] 27 2 5
- attr(*, "dimnames")=List of 3
..$ subject : chr [1:27] "1" "2" "3" "5" ...
..$ congruent: chr [1:2] "FALSE" "TRUE"
..$ offset : chr [1:5] "1" "2" "3" "4" ...
This looks totally different from the structure of the prototypical ozone array from the plyr package, even though it's a very similar format (3 dimensions, only numerical values).
I want to compute some further summarizing information on this array, by means of aaply. Precisely, I want to calculate the difference between the congruent and the incongruent means for each subject and offset.
However, already the most basic application of aaply() like aaply(myarray,2,mean) returns non-sense output:
FALSE TRUE
NA NA
Warning messages:
1: In mean.default(piece, ...) :
argument is not numeric or logical: returning NA
2: In mean.default(piece, ...) :
argument is not numeric or logical: returning NA
I have no idea, why the daply() function returns such weirdly structured output and thereby prevents any further use of aaply. Any kind of help is kindly appreciated, I frankly admit that I have hardly any experience with the plyr package.
Since you haven't included your data it's hard to know for sure, but I tried to make a dummy set off your str(). You can do what you want (I'm guessing) with two uses of ddply. First the means, then the difference of the means.
#Make dummy data
mydf <- data.frame(subject = rep(1:5, each = 150),
congruent = rep(c(TRUE, FALSE), each = 75),
offset = rep(1:5, each = 15), RT = sample(300:500, 750, replace = T))
#Make means
mydf.mean <- ddply(mydf, .(subject, congruent, offset), summarise, mean.RT = mean(RT))
#Calculate difference between congruent and incongruent
mydf.diff <- ddply(mydf.mean, .(subject, offset), summarise, diff.mean = diff(mean.RT))
head(mydf.diff)
# subject offset diff.mean
# 1 1 1 39.133333
# 2 1 2 9.200000
# 3 1 3 20.933333
# 4 1 4 -1.533333
# 5 1 5 -34.266667
# 6 2 1 -2.800000

Replace NA's in R - works in a practice dataset but warning message when applied to actual data

I have a dataset in R which looks like, and has been reshaped in the same way as, the following example. The aim is to turn NA values in to something else (e.g. "FALSE" or "0") which can then be used to create a new column
ortho.test<-data.frame(rep("a",10));colnames(ortho.test)=("ODB6")
ortho.test$FBGN=c("FBgn0132258","FBgn0131535","FBgn0138769","FBgn01561235","FBgn0316645","FBgn874916","FBgn5758641","FBgn5279946","FBgn67543154","FBgn2451645")
ortho.test$Species=c("DROME","DROSI","DROSE","DROAN","DROYA","DROPS","DROPE","DROVI","DROGR","DROWI")
ortho<-reshape(ortho.test,direction="wide",idvar="ODB6",timevar="Species")
ortho$FBGN.DROME<-NA
is.na(ortho)
Which returns a vector telling me all but the FBGN.DROME are FALSE
With the following str() output:
> str(ortho)
'data.frame': 1 obs. of 11 variables:
$ ODB6 : Factor w/ 1 level "a": 1
$ FBGN.DROME: logi NA
$ FBGN.DROSI: chr "FBgn0131535"
$ FBGN.DROSE: chr "FBgn0138769"
$ FBGN.DROAN: chr "FBgn01561235"
$ FBGN.DROYA: chr "FBgn0316645"
$ FBGN.DROPS: chr "FBgn874916"
$ FBGN.DROPE: chr "FBgn5758641"
$ FBGN.DROVI: chr "FBgn5279946"
$ FBGN.DROGR: chr "FBgn67543154"
$ FBGN.DROWI: chr "FBgn2451645"
- attr(*, "reshapeWide")=List of 5
..$ v.names: NULL
..$ timevar: chr "Species"
..$ idvar : chr "ODB6"
..$ times : chr "DROME" "DROSI" "DROSE" "DROAN" ...
..$ varying: chr [1, 1:10] "FBGN.DROME" "FBGN.DROSI" "FBGN.DROSE" "FBGN.DROAN" ...
I change my NA to 0
ortho[is.na(ortho)]<-0
is.na(ortho)
Which returns a vector telling me all are now FALSE - a success because now I can create a column using ifelse() to show which of the rows have no 0's or FALSE's (or whatever text label I use to replace the NA's) in any column...
However, when I apply this to the full blown dataframe the NA's do not convert and I get the following warnings
> ortho[is.na(ortho)]<-0
There were 12 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: In `[<-.factor`(`*tmp*`, thisvar, value = structure(c(62938L, ... :
invalid factor level, NAs generated
2: In `[<-.factor`(`*tmp*`, thisvar, value = structure(c(67667L, ... :
invalid factor level, NAs generated
3: In `[<-.factor`(`*tmp*`, thisvar, value = structure(c(122384L, ... :
invalid factor level, NAs generated
4: In `[<-.factor`(`*tmp*`, thisvar, value = structure(c(136498L, ... :
invalid factor level, NAs generated
5: In `[<-.factor`(`*tmp*`, thisvar, value = structure(c(84764L, ... :
invalid factor level, NAs generated
6: In `[<-.factor`(`*tmp*`, thisvar, value = structure(c(162734L, ... :
invalid factor level, NAs generated
7: In `[<-.factor`(`*tmp*`, thisvar, value = structure(c(33586L, ... :
invalid factor level, NAs generated
8: In `[<-.factor`(`*tmp*`, thisvar, value = structure(c(38959L, ... :
invalid factor level, NAs generated
9: In `[<-.factor`(`*tmp*`, thisvar, value = structure(c(149363L, ... :
invalid factor level, NAs generated
10: In `[<-.factor`(`*tmp*`, thisvar, value = structure(c(846L, ... :
invalid factor level, NAs generated
11: In `[<-.factor`(`*tmp*`, thisvar, value = structure(c(98228L, ... :
invalid factor level, NAs generated
12: In `[<-.factor`(`*tmp*`, thisvar, value = structure(c(110267L, ... :
invalid factor level, NAs generated
and this is the str() output
> str(ortho)
'data.frame': 17217 obs. of 13 variables:
$ ODB6 : Factor w/ 17217 levels "EOG60023J","EOG60023K",..: 1 2 3 4 5 6 7 8 9 10 ...
$ FBGN.DROGR: Factor w/ 164289 levels "FBgn0000008",..: 62938 54687 54705 56261 52591 58895 52161 52477 59180 53404 ...
$ FBGN.DROMO: Factor w/ 164289 levels "FBgn0000008",..: 67667 65117 65951 66506 68291 71722 73134 68667 72523 76080 ...
$ FBGN.DROVI: Factor w/ 164289 levels "FBgn0000008",..: 122384 121133 120018 121674 NA 125620 123754 123969 127130 130755 ...
$ FBGN.DROWI: Factor w/ 164289 levels "FBgn0000008",..: 136498 136809 139642 137108 NA 141689 136363 137237 135869 132801 ...
$ FBGN.DROPE: Factor w/ 164289 levels "FBgn0000008",..: 84764 78121 81229 80829 85509 82276 79001 80267 77133 87679 ...
$ FBGN.DROPS: Factor w/ 164289 levels "FBgn0000008",..: 162734 158625 162203 158653 158028 22427 158179 13830 19898 160874 ...
$ FBGN.DROAN: Factor w/ 164289 levels "FBgn0000008",..: 33586 35261 35694 23649 33601 25796 33808 33861 25917 29992 ...
$ FBGN.DROER: Factor w/ 164289 levels "FBgn0000008",..: 38959 41203 40738 39865 38807 46087 38821 44982 47952 38091 ...
$ FBGN.DROYA: Factor w/ 164289 levels "FBgn0000008",..: 149363 153417 153106 152243 149654 147146 149664 149482 147635 144838 ...
$ FBGN.DROME: Factor w/ 164289 levels "FBgn0000008",..: 846 7219 6958 162946 525 1892 125 3510 163839 10111 ...
$ FBGN.DROSE: Factor w/ 164289 levels "FBgn0000008",..: 98228 94438 94153 102953 98068 95380 98082 92553 93497 95950 ...
$ FBGN.DROSI: Factor w/ 164289 levels "FBgn0000008",..: 110267 108223 107983 107246 110164 117494 116973 110504 106459 NA ...
- attr(*, "reshapeWide")=List of 5
..$ v.names: NULL
..$ timevar: chr "Species"
..$ idvar : chr "ODB6"
..$ times : Factor w/ 12 levels "DROAN","DROER",..: 3 5 10 11 6 7 1 2 12 4 ...
..$ varying: chr [1, 1:12] "FBGN.DROGR" "FBGN.DROMO" "FBGN.DROVI" "FBGN.DROWI" ...
>
Could you help me get the main dataframe to play along like the test one did? (PS - I know I'm going to get "this is a duplicate, read the help pages and search properly" response - but I have searched, which is how I found out how to replace NA's, and I haven't found any with this same issue.)
You have a factors problem. If you look at your real data set, you'll notice the
Factor w/ 164289 levels .....
For example,
R> x = factor(c("A", "B"))
R> x[x=="A"] = 0
Warning message:
In `[<-.factor`(`*tmp*`, x == "A", value = 0) :
invalid factor level, NAs generated
You need to add 0 as a level. So something like:
x = factor(x, levels=c(levels(x), 0))
x[is.na(x)] = 0
should do the trick. However, a better tactic would be to change how you read in the data. For example,
read.table(filename, stringsAsFactors=FALSE)
For those whose data does not come from reading a file. Converting each column of the data.frame can be done with this loop (apply wouldn't work because it converts a data.frame into a matrix):
for (k in 1:ncol(data)){
data[[k]] <- as.character(data[[k]])
}
And then apply the solution of this question if the level solution doesn't work.

Resources