R Standardizing numeric variables in dataframe while retaining factor variables - r

I have a dataframe (dcc) loaded in R which I have narrowed down to complete cases.
str(dcc)
'data.frame': 41715 obs. of 9 variables:
$ XCoord : num 661382 661412 661442 661472 661502 ...
$ YCoord : num 648092 648092 648092 648092 648092 ...
$ OBJECTID : int 1 2 3 4 5 6 7 8 9 10 ...
$ POINTID : int 1 2 3 4 5 6 7 8 9 10 ...
$ GRID_CODE : int 0 0 0 0 0 0 0 0 0 0 ...
$ APPL_COST_DIST_RIV_COAST: num 21350 21674 22185 22748 23448 ...
$ APPL_DEM30 : int 785 793 792 769 765 777 784 789 781 751 ...
$ APPL_DEM30_SLOPE : num 19.7 13.3 18.6 23.2 21 ...
$ APPL_SITE_NONSITE : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
I want to standardize the numeric and integer variables by subtracting the mean and dividing by the standard deviation. When I apply the following code, I inadvertently drop the factor variable APPL_SITE_NONSITE from the dataframe:
ind <- sapply(dcc, is.numeric)
dcc.s<-sapply(dcc[,ind], function(x) (x-mean(x))/sd(x))
dcc.s<-data.frame(dcc.s)
If I'm not mistaken, that happens because ind=FALSE for that variable. It seems like I need some combination of a for loop and if/else statement to standardize the numeric variables and leave the factor variable alone. I have tried a number of permutations, but keep getting errors. For example, the following code:
dcc.s <- for (i in 1:ncol(dcc)){ sapply(dcc[,i],
if (is.numeric(dcc[,i])==TRUE) {
function(x) (x-mean(x))/sd(x) }
else {dcc[,i]})
}
returns the error:
Error in match.fun(FUN) :
c("'if (is.numeric(dcc[, i]) == TRUE) {' is not a function, character or symbol", "' function(x) (x - mean(x))/sd(x)' is not a function, character or symbol", "'} else {' is not a function, character or symbol", "' dcc[, i]' is not a function, character or symbol", "'}' is not a function, character or symbol")
Perhaps this is a simple formatting error or misplaced bracket, but I'm thoroughly stuck. I am open to other approaches if there is an more elegant way to do this. Any help would be much appreciated.

You need to use rapply instead of sapply
set.seed(1)
> df=data.frame(A=rnorm(10),b=1:10,C=as.factor(rep(1:2,5)))
> str(df)
'data.frame': 10 obs. of 3 variables:
$ A: num -0.626 0.184 -0.836 1.595 0.33 ...
$ b: int 1 2 3 4 5 6 7 8 9 10
$ C: Factor w/ 2 levels "1","2": 1 2 1 2 1 2 1 2 1 2
The code you need to use:
> D=rapply(df,scale,c("numeric","integer"),how="replace")
> D
A b C
1 -0.97190653 -1.4863011 1
2 0.06589991 -1.1560120 2
3 -1.23987805 -0.8257228 1
4 1.87433300 -0.4954337 2
5 0.25276523 -0.1651446 1
6 -1.22045645 0.1651446 2
7 0.45507643 0.4954337 1
8 0.77649606 0.8257228 2
9 0.56826358 1.1560120 1
10 -0.56059319 1.4863011 2
> str(D)
'data.frame': 10 obs. of 3 variables:
$ A: num [1:10, 1] -0.9719 0.0659 -1.2399 1.8743 0.2528 ...
..- attr(*, "scaled:center")= num 0.132
..- attr(*, "scaled:scale")= num 0.781
$ b: num [1:10, 1] -1.486 -1.156 -0.826 -0.495 -0.165 ...
..- attr(*, "scaled:center")= num 5.5
..- attr(*, "scaled:scale")= num 3.03
$ C: Factor w/ 2 levels "1","2": 1 2 1 2 1 2 1 2 1 2
>

Here is a dplyr and scale solution.
For dplyr < 1.0.0
require(dplyr)
df %>% mutate_if(is.numeric, scale)
# a runif(20) rnorm(20)
#1 y 0.5783877 -0.004177104
#2 n -0.2344854 -0.866626472
#3 m 1.5629961 1.526857969
#4 h 0.9648646 -1.557975547
#5 u -0.7212756 0.533400304
#6 u 1.4753675 -0.072289864
#7 b 0.5346870 -0.464299111
#8 l -0.4287559 0.426600473
#9 m -1.2050841 -0.880135405
#10 h -0.6150410 -0.040636433
#11 r 1.3768249 -0.719785950
#12 a -1.3929511 0.083010969
#13 a -0.4422665 0.385574213
#14 l -0.7719473 -0.934716525
#15 m 1.4483803 0.131974911
#16 k 0.6291919 2.598581195
#17 k -1.0356817 -1.018890381
#18 s -1.0960083 1.560216350
#19 y -0.8826702 -0.367821579
#20 v 0.2554671 -0.318862011
For dplyr >= 1.0.0
df %>% mutate(across(where(is.numeric), scale))
Note that scale(x) will do the same as (x - mean(x)) / sd(x); if you want to scale based on different metrics (e.g. a robust/modified Z score based on the median and MAD) you can do that using sweep.
Sample data
set.seed(2017);
df <- cbind.data.frame(a = factor(sample(letters, 20, replace = T)), runif(20), rnorm(20));

ind <- sapply(dcc, is.numeric)
dcc.s <- as.data.frame(lapply(dcc[,ind], function(x) (x-mean(x))/sd(x)))
dcc.s <- cbind(dcc, dcc.s)
If you don't need the "old" dataframe you can also do
ind <- sapply(dcc, is.numeric)
dcc[,ind] <- vapply(dcc[,ind], function(x) (x-mean(x))/sd(x))

Related

caret::predict giving Error: $ operator is invalid for atomic vectors

This has been driving me crazy and I've been looking through similar posts all day but can't seem to solve my problem. I have a naive bayes model trained and stored as model. I'm attempting to predict with a newdata data frame but I keep getting the error Error: $ operator is invalid for atomic vectors. Here is what I am running: stats::predict(model, newdata = newdata) where newdata is the first row of another data frame: new data <- pbp[1, c("balls", "strikes", "outs_when_up", "stand", "pitcher", "p_throws", "inning")]
class(newdata) gives [1] "tbl_df" "tbl" "data.frame".
The issue is with the data used. it should match the levels used in the training. E.g. if we use one of the rows from trainingData to predict, it does work
predict(model, head(model$trainingData, 1))
#[1] Curveball
#Levels: Changeup Curveball Fastball Sinker Slider
By checking the str of both datasets, some of the factor columns in the training is character class
str(model$trainingData)
'data.frame': 1277525 obs. of 7 variables:
$ pitcher : Factor w/ 1390 levels "112526","115629",..: 277 277 277 277 277 277 277 277 277 277 ...
$ stand : Factor w/ 2 levels "L","R": 1 1 2 2 2 2 2 1 1 1 ...
$ p_throws : Factor w/ 2 levels "L","R": 2 2 2 2 2 2 2 2 2 2 ...
$ balls : num 0 1 0 1 2 2 2 0 0 0 ...
$ strikes : num 0 0 0 0 0 1 2 0 1 2 ...
$ outs_when_up: num 1 1 1 1 1 1 1 2 2 2 ...
$ .outcome : Factor w/ 5 levels "Changeup","Curveball",..: 3 4 1 4 1 5 5 1 1 5 ...
str(newdata)
tibble [1 × 6] (S3: tbl_df/tbl/data.frame)
$ balls : int 3
$ strikes : int 2
$ outs_when_up: int 1
$ stand : chr "R"
$ pitcher : int 605200
$ p_throws : chr "R"
An option is to make levels same for factor class
nm1 <- intersect(names(model$trainingData), names(newdata))
nm2 <- names(which(sapply(model$trainingData[nm1], is.factor)))
newdata[nm2] <- Map(function(x, y) factor(x, levels = levels(y)), newdata[nm2], model$trainingData[nm2])
Now do the prediction
predict(model, newdata)
#[1] Sinker
#Levels: Changeup Curveball Fastball Sinker Slider

After inserting an apply instead of loop

I changed my dataset to data.table and I'm using sapply (apply family) but so far that wasn't sufficiant. Is this fully correct?
I already went from this:
library(data.table)
library(lubridate)
buying_volume_before_breakout <- list()
for (e in 1:length(df_1_30sec_5min$date_time)) {
interval <- dolar_tick_data_unified_dt[date_time <= df_1_30sec_5min$date_time[e] &
date_time >= df_1_30sec_5min$date_time[e] - time_to_collect_volume &
Type == "Buyer"]
buying_volume_before_breakout[[e]] <- sum(interval$Quantity)
}
To this (created a function and and using sapply)
fun_buying_volume_before_breakout <- function(e) {
interval <- dolar_tick_data_unified_dt[date_time <= df_1_30sec_5min$date_time[e] &
date_time >= df_1_30sec_5min$date_time[e] - time_to_collect_volume &
Type == "Buyer"]
return(sum(interval$Quantity))
}
buying_volume_before_breakout <- sapply(1:length(df_1_30sec_5min$date_time), fun_buying_volume_before_breakout)
I couldn't make my data reproducible but here are some more insights about its structure.
> str(dolar_tick_data_unified_dt)
Classes ‘data.table’ and 'data.frame': 3120650 obs. of 6 variables:
$ date_time : POSIXct, format: "2017-06-02 09:00:35" "2017-06-02 09:00:35" "2017-06-02 09:00:35" ...
$ Buyer_from : Factor w/ 74 levels "- - ","- - BGC LIQUIDEZ DTVM",..: 29 44 19 44 44 44 44 17 17 17 ...
$ Price : num 3271 3271 3272 3271 3271 ...
$ Quantity : num 5 5 5 5 5 5 10 5 50 25 ...
$ Seller_from: Factor w/ 73 levels "- - ","- - BGC LIQUIDEZ DTVM",..: 34 34 42 28 28 28 28 34 45 28 ...
$ Type : Factor w/ 4 levels "Buyer","Direct",..: 1 3 1 1 1 1 1 3 3 3 ...
- attr(*, ".internal.selfref")=<externalptr>
> str(df_1_30sec_5min)
Classes ‘data.table’ and 'data.frame': 3001 obs. of 13 variables:
$ date_time : POSIXct, format: "2017-06-02 09:33:30" "2017-06-02 09:49:38" "2017-06-02 10:00:41" ...
$ Price : num 3251 3252 3256 3256 3260 ...
$ fast_small_mm : num 3250 3253 3254 3256 3259 ...
$ slow_small_mm : num 3254 3253 3254 3256 3259 ...
$ fast_big_mm : num 3255 3256 3256 3256 3258 ...
$ slow_big_mm : num 3258 3259 3260 3261 3262 ...
$ breakout_strength : num 6.5 2 0.5 2 2.5 0.5 1 2.5 1 0.5 ...
$ buying_volume_before_breakout: num 1285 485 680 985 820 ...
$ total_volume_before_breakout : num 1285 485 680 985 820 ...
$ average_buying_volume : num 1158 338 318 394 273 ...
$ average_total_volume : num 1158 338 318 394 273 ...
$ relative_strenght : num 1 1 1 1 1 1 1 1 1 1 ...
$ relative_strenght_last_6min : num 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, ".internal.selfref")=<externalptr>
First, separate the 'buyer' data from the rest. Then add a column for the start of the time interval and do a non-equi join in data.table, which is what #chinsoon is suggesting. I've made a reproducible example below:
library(data.table)
set.seed(123)
N <- 1e5
# Filter buyer details first
buyer_dt <- data.table(
tm = Sys.time()+runif(N,-1e6,+1e6),
quantity=round(runif(N,1,20))
)
time_dt <- data.table(
t = seq(
min(buyer_dt$tm),
max(buyer_dt$tm),
by = 15*60
)
)
t_int <- 300
time_dt[,t1:=t-t_int]
library(rbenchmark)
benchmark(
a={ # Your sapply code
bv1 <- sapply(1:nrow(time_dt), function(i){
buyer_dt[between(tm,time_dt$t[i]-t_int,time_dt$t[i]),sum(quantity)]
})
},
b={ # data.table non-equi join
all_intervals <- buyer_dt[time_dt,.(t,quantity),on=.(tm>=t1,tm<=t)]
bv2 <- all_intervals[,sum(quantity),by=.(t)]
}
,replications = 9
)
#> test replications elapsed relative user.self sys.self user.child
#> 1 a 9 42.75 158.333 81.284 0.276 0
#> 2 b 9 0.27 1.000 0.475 0.000 0
#> sys.child
#> 1 0
#> 2 0
Edit: In general, any join of two tables A and B is a subset of the outer join [A x B]. The rows of [A x B] will have all possible combinations of the rows of A and the rows of B. An equi join will subset [A x B] by checking equality conditions, i.e. If x and y are the join columns in A and B, Your join will be : rows from [A x B] where A.x=B.x and A.y=B.y
In a NON-equi join, the subset condition will have comparision operators OTHER than =, for example: like your case, where you want columns such that A.x <= B.x <= A.x + delta.
I don't know much about how they are implemented, but data.table has a pretty fast one that has worked well for me with large data frames.

Calculate new column with growth rate based on two factors without splintering dataframe

Hej hej,
I would like to calculate growth rates, storing them in a new column of my data frame e.g. named growth.per.day. I am - as always - looking for a way that doesn't include hundreds and hundreds of lines of manually edited code.
I have six levels of algae and 25 levels of nutrients.
This means i have 150 "subgroups" for which i want to calculate the rates. Those subsets differ in length based on the individual algae.
So, basically:
Algae A ->
Nutrient (1) -> C.mikro.gr.L (Day 2) - C.mikro.gr.L (Day 1),C.mikro.gr.L (Day 3) - C.mikro.gr.L (Day 2) ... ;
Nutrient (2) -> C.mikro.gr.L (Day 2) - C.mikro.gr.L (Day 1),C.mikro.gr.L (Day 3) - C.mikro.gr.L (Day 2) ... etc.
I already split the data frame by algae
X <- split(data, data$ALGAE)
names(X) <- c("ANKI", "CHLAMY", "MIX_A", "MIX_B", "SCENE", "STAURA")
list2env(X, envir = .GlobalEnv)
and i have also split those again, creating the aforementioned lovely 150 subsets. Then i applied
ratio1$growth.per.day <- c(NA,ratio1[2:nrow(ratio1), 16] - ratio1[1:(nrow(ratio1)-1), 16])
which is perfect and does what i want, BUT i would really very much appreciate a shorter, more elegant way without butchering my dataframe.
'data.frame': 3550 obs. of 16 variables:
$ SAMPLE.ID : Factor w/ 150 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ COMMUNITY : chr "com.1" "com.1" "com.1" "com.1" ...
$ NUTRIENT : Factor w/ 25 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ RATIO : Factor w/ 23 levels "3.2","4","5.4",..: 11 9 6 4 1 14 10 8 5 2 ...
$ PHOS : Factor w/ 5 levels "0.09","0.195",..: 5 5 5 5 5 4 4 4 4 4 ...
$ NIT : Factor w/ 5 levels "1.5482","3.0964",..: 5 4 3 2 1 5 4 3 2 1 ...
$ DATUM : Factor w/ 35 levels "30.08.16","31.08.16",..: 1 1 1 1 1 1 1 1 1 1 ...
$ DAY : int 0 0 0 0 0 0 0 0 0 0 ...
$ TYPE : chr "mono" "mono" "mono" "mono" ...
$ ALGAE : Factor w/ 6 levels "ANK","CHLA","MIX A",..: 5 5 5 5 5 5 5 5 5 5 ...
$ MEAN : num 864 868 882 873 872 ...
$ GROW : num 0.00116 0.00115 0.00113 0.00115 0.00115 ...
$ FLUORO : num NA NA NA NA NA NA NA NA NA NA ...
$ MEAN.MQ : num 0.964 0.969 0.985 0.975 0.973 ...
$ GROW.MQ : num 1.04 1.03 1.02 1.03 1.03 ...
$ C.mikro.gr.L: num -764 -913 -1394 -1085 -1039 ...
I hope this sufficiently describes the problem,
Thanks so much!
Hope it is what you asked for:
df = data.frame(algae = sort(rep(LETTERS[1:6], 20)),
nutrient = rep(letters[22:26], 24),
day = rep(c(rep(1, 5),
rep(2, 5),
rep(3, 5),
rep(4, 5)), 6),
growth = runif(120, 30, 60))
library(dplyr)
df = df %>% group_by(algae, nutrient) %>% mutate(rate = c(NA, diff(growth, lag = 1)))
And there the table for alga A and nutrient v:
algae nutrient day growth rate
<fctr> <fctr> <dbl> <dbl> <dbl>
1 A v 1 48.68547 NA
2 A v 2 55.63570 6.950232
3 A v 3 53.28569 -2.350013
4 A v 4 44.83022 -8.455465

R: Extracting lines from dataframes in list and splitting into new dataframes

I have a list with 3 data frames (DvE, DvS, EvS) in it:
str(Table.list2)
List of 3
$ DvE:'data.frame': 18482 obs. of 4 variables:
..$ gene : Factor w/ 18482 levels "c10000_g1_i3|m.32237",..: 1 2 3 4 5 6 7 8 9 10 ...
..$ FDR : num [1:18482] 0.502 0.982 0.936 0.411 0.461 ...
..$ log2FC : num [1:18482] 0.415 -0.245 0.728 -0.384 0.474 ...
..$ annotation: Factor w/ 4939 levels "","[Genbank](myosin heavy-chain) kinase [Calothrix sp. PCC 6303] ",..: 1 2204 2980 2204 1 2204 4622 2980 1 241 ...
$ DvS:'data.frame': 18482 obs. of 4 variables:
..$ gene : Factor w/ 18482 levels "c10000_g1_i3|m.32237",..: 1 2 3 4 5 6 7 8 9 10 ...
..$ FDR : num [1:18482] 1.25e-01 7.18e-01 2.02e-01 2.72e-13 6.02e-01 ...
..$ log2FC : num [1:18482] -0.417 0.583 2.148 1.689 -0.167 ...
..$ annotation: Factor w/ 4939 levels "","[Genbank](myosin heavy-chain) kinase [Calothrix sp. PCC 6303] ",..: 1 2204 2980 2204 1 2204 4622 2980 1 241 ...
$ EvS:'data.frame': 18482 obs. of 4 variables:
..$ gene : Factor w/ 18482 levels "c10000_g1_i3|m.32237",..: 1 2 3 4 5 6 7 8 9 10 ...
..$ FDR : num [1:18482] 1.78e-03 6.04e-01 4.09e-01 3.42e-19 3.20e-02 ...
..$ log2FC : num [1:18482] -0.832 0.828 1.42 2.073 -0.641 ...
..$ annotation: Factor w/ 4939 levels "","[Genbank](myosin heavy-chain) kinase [Calothrix sp. PCC 6303] ",..: 1 2204 2980 2204 1 2204 4622 2980 1 241 ...
all 3 dataframes have similar structure, e.g.:
> head(Table.list2$DvE)
gene FDR log2FC annotation
1 c10000_g1_i3|m.32237 0.5024600 0.4149066
2 c10000_g1_i4|m.32240 0.9818297 -0.2449509 [Pfam]Calcium-activated chloride channel
3 c10000_g1_i4|m.32242 0.9361868 0.7277203 [Pfam]LSM domain
4 c10000_g1_i5|m.32244 0.4114795 -0.3835745 [Pfam]Calcium-activated chloride channel
5 c10000_g1_i6|m.32245 0.4605157 0.4739777
6 c10000_g1_i6|m.32246 0.4965353 -0.4607749 [Pfam]Calcium-activated chloride channel
What I'd like to do is in each data frame, take out data that has FDR < 0.05 and log2FC > 0 and put in a new data frame, and then take out data that has FDR < 0.05 and log2FC < 0 and put in another data frame.
So that from a list of 3 data frames, I'd get 6 new data frames that are named:
DvE.+
DvE.-
DvS.+
DvS.-
EvS.+
EvS.-
Example output of DvE.+:
gene FDR log2FC annotation
47 c10010_g1_i4|m.32346 8.609296e-15 1.9188013 [Genbank]conserved unknown protein [Ectocarpus siliculosus]
48 c10010_g1_i4|m.32348 5.625766e-09 1.8240089 [Genbank]hypothetical protein THAOC_07134 [Thalassiosira oceanica]
155 c10037_g1_i4|m.32582 2.666894e-02 0.6669399 [Pfam]LETM1-like protein
211 c10050_g2_i2|m.32706 8.154555e-03 1.6900611 [Genbank]hypothetical protein SELMODRAFT_84252 [Selaginella moellendorffii]
243 c10057_g1_i1|m.32812 1.936893e-02 0.8141790 [Pfam]Fibrinogen alpha/beta chain family
265 c10061_g4_i2|m.32861 3.614401e-02 1.7059034 [Pfam]Maf1 regulator
I was wondering if there's a more elegant way/loop that I can do all this in rather than repeatedly writing out similar command lines?
Update:
I tried doing this:
DEG.list <- lapply(Table.list2, function(i){
pos <- i[(i$FDR < 0.05 & i$log2FC > 0),]
neg <- i[(i$FDR < 0.05 & i$log2FC < 0),]
assign(paste(i, ".+", sep=""), value=pos)
assign(paste(i, ".-", sep=""), value=neg)
})
But I got this error:
Warning messages:
1: In assign(paste(i, ".+", sep = ""), value = pos) :
only the first element is used as variable name
2: In assign(paste(i, ".-", sep = ""), value = neg) :
only the first element is used as variable name
3: In assign(paste(i, ".+", sep = ""), value = pos) :
only the first element is used as variable name
4: In assign(paste(i, ".-", sep = ""), value = neg) :
only the first element is used as variable name
5: In assign(paste(i, ".+", sep = ""), value = pos) :
only the first element is used as variable name
6: In assign(paste(i, ".-", sep = ""), value = neg) :
only the first element is used as variable name
Not tested:
listdf<-list(DvE, DvS, EvS)
library(dplyr) # filtering the data
alldf<-lapply(listdf, function(i) { # Each list contains two filtered dataframes
df1<-filter(i,FDR < 0.05 & log2FC > 0) # dfs have not been properly named here
df2<-filter(i,FDR < 0.05 & log2FC < 0)
list(df1,df2)
}

Replacing just '0' (single zeros) in a column, without replacing the zeros in larger numbers (e.g. 10, 20, 30 etc.)

I want to make any 0 values in my data frame have a positive number so that my model will work.
However, when I try to replace all zero values, I also replace the zeros that are in strings belonging to much larger numbers such as 10, 20, 30, 40... 100, 1000 etc..
How do I specify that I only want to replace those values which are actually zero, and not just any string which contains the number zero?
Thanks!
Here's the code:
total<- read.csv("total.csv")
total.rm <- na.omit(total)
#removing NAs/NAN
total.rm$mediansp[which(is.nan(total.rm$mediansp))] = NA
total.rm$mediansp[which(total.rm$mediansp==Inf)] = NA
total.rm$connections[which(is.nan(total.rm$connections))] = NA
total.rm$connections[which(total.rm$connections==Inf)] = NA
#make all 0 values positive
total.rm$mediansp <- gsub("0", "0.00001", total.rm$mediansp)
total.rm$connections <- gsub("0", "0.00001", total.rm$connections)
#remove zeros varaibles
total.rm$mediansp <- gsub("NA", "0", total.rm$mediansp)
total.rm$connections <- gsub("NA", "0", total.rm$connections)
total.rm$mediansp <- gsub("0", "0.01", total.rm$mediansp)
total.rm$connections <- gsub("0", "0.01", total.rm$connections)
#convert character variables to numeric variables
total.rm$mediansp <- as.numeric(total.rm$mediansp)
total.rm$connections <- as.numeric(total.rm$connections)
#plot maps with fitted values and with residuals
sc.lm <- lm (log(mediansp) ~ log(connections), total.rm, na.action="na.exclude")
total.rm$fitted.s <- predict(sc.lm, total.rm) - mean(predict(sc.lm, total.rm))
total.rm$residuals <- residuals(sc.lm)
Here's the structure:
data.frame': 133537 obs. of 19 variables:
$ pcd : Factor w/ 1736958 levels "AB101AA","AB101AB",..:
$ pcdstatus : Factor w/ 5 levels "Insufficient Data",..: 5 5 5 5 5 5 5 5 5 5 ...
$ mbps2 : num 0 0 0 0 1 0 1 1 0 0 ...
$ averagesp : chr "16" "19.3" "14.1" "14.9" ...
$ mediansp : chr "16.2" "20" "18.7" "16.8" ...
$ maxsp : chr "23.8" "24" "20.2" "19.7" ...
$ nga : num 0 0 0 1 0 1 1 1 1 1 ...
$ connections : chr "54" "14" "98" "43" ...
$ oslaua : Factor w/ 407 levels "","95A","95B",..: 326 326 326 326 326 326 326
$ x : int 540194 540194 540300 539958 540311 539894 540311 540379 540310
$ y : int 169201 169201 169607 169584 168997 169713 168997 168749 168879
$ ctry : Factor w/ 4 levels "E92000001","N92000002",..: 1 1 1 1 1 1 1 1 1 1
$ hro2 : Factor w/ 13 levels "","E12000001",..: 8 8 8 8 8 8 8 8 8 8 ...
$ soa2 : Factor w/ 7197 levels "","E02000001",..: 145 145 135 135 145 135 145
$ urindew : int 5 5 5 5 5 5 5 5 5 5 ...
$ averagesp.lt : num 2.77 2.96 2.65 2.7 2.05 ...
$ mediansp.lt : num 2.79 3 2.93 2.82 2.09 ...
$ maxsp.lt : num 3.17 3.18 3.01 2.98 2.68 ...
$ connections.lt: num 3.99 2.64 4.58 3.76 3.22 ...
gsub is making a regex substitution in your code below. To replace just the character string "0" make the pattern argument in gsub pattern = "^0$". This should solve your problem.
As an added note, it's almost certainly bad form to simply replace 0's with very small numbers to make your models work. Pick a better model.

Resources