Data extraction in R

Data extraction in R - r

I have a data set data with structure as
'data.frame': 153 obs. of 6 variables:
$ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
$ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
$ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
$ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
$ Month : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 2 3 4 5 6 7 8 9 10 ...
Now when want to remove NA from the column Ozone the following way is showing me an error :
mean(data["Ozone"], na.rm = TRUE)
[1] NA
Warning message:
In mean.default(data["Ozone"], na.rm = TRUE) :
argument is not numeric or logical: returning NA
How should I remove NA in the above problem?

You forgot a comma when subseting, just include that missing comma and it'll work like a charm
> mean(data[, "Ozone"], na.rm = TRUE)
[1] 42.12931
I'm assuming you are working with airquality dataset.
Note that double brakets (without comma) also works
> mean(data[["Ozone"]], na.rm = TRUE)
[1] 42.12931
Take a look at ?Extract for further details on subseting.

Related

Agricolae, tapply, error: arguments must have same length

I am new to R and I am having issues moving forward with the data analysis. My Excel data has a lot of NA's and I tried troubleshooting this error. Here's my code if anyone can help, and a link to a sample of my data
file:///C:/Users/steph/Documents/DLI%20ANOVA%20Sample.htm
Some of my variables have 4 reps instead of all 8reps, so I have a lot of NA's in the excel file. I keep getting this error after I try tapply:
Error in tapply(X = data1$gi..m3., INDEX = data1$cultivar, FUN = mean, :
arguments must have same length
library(agricolae)
data1=read.csv("DLI ANOVA Sample.csv", header=T, as.is=T)
#setting factors
block = as.factor(data1$block)
treatmentt = as.factor(data1$trt)
cultivar<-factor(data1$cv,c("CR", "LB","RF","RR","S","SNS","SNY","SSJ","YC"))
str(data1)
#Summary statistics
tapply(X = data1$growth.index, INDEX = data1$cultivar, FUN = mean, na.rm=T)
tapply(X = data1$growth.index, INDEX = data1$treatment, FUN = mean, na.rm=T)
data.frame': 288 obs. of 24 variables:
$ block : int 1 1 2 2 3 3 4 4 1 1 ...
$ trt : chr "HL-L" "HL-L" "HL-L" "HL-L" ..
$ cv : chr "CR" "CR" "CR" "CR" ...
$ rep : int 1 2 3 4 5 6 7 8 1 2 ...
$ height : int 23 20 25 19 23 19 22 19 19 24
$ growth.index : num 0.0221 0.0258 0.0276 0.0227 0.0209
$ number.of.mature.fruit : int 34 30 35 34 28 25 40 24 12 16 ...
$ mature.fruit.fw : num 163 163 186 152 169 ...
$ number.of.immature.fruit : int 38 28 40 27 35 37 44 48 20 30 ...
$ immature.fruit.fw : num 77.4 66.6 87.6 43.4 81.3 ...
$ Total.number.of.fruit : num 72 58 75 61 63 62 84 72 32 46 ...
$ Total.fruit.fw : num 241 230 273 195 250 ...
$ Fruit.Water.Content..g. : num NA 209 NA 176 NA ...
$ Brix.. : num 4.9 NA 5.6 NA 4.7 NA 5.1 NA 5.6 NA ...
$ pH : num 4.17 NA 4.3 NA 4.1 ...
$ EC.uS.mL : num 4.46 NA 9.19 NA 8.24 ...
$ X..citric.Acid : num 0.704 NA 0.397 NA 0.653 ...
$ Sugar.Acid.Ratio : num 6.96 NA 14.11 NA 7.2 ...
$ oedema.injury.level..1.6. : int 3 3 1 2 1 1 1 2 2 1 ...
$ Stomatal.conductance : num NA 365 NA 422 NA ...
$ spad : num NA NA NA 64.3 NA 65.5 NA 68.7 NA 55.6 ...
$ Irrigation.Events : int NA 14 NA 12 NA 13 NA 16 NA 13 ...
$ WUE : num NA 0.00584 NA 0.00693 NA ...
$ transpiration..g.H2O.lost..g.dry.biomass.: num NA 117 NA 111 NA ...

Retrieving corresponding column values based on row label [duplicate]

I have a data frame, str(data) to show more about my data frame the result is the following:
> str(data)
'data.frame': 153 obs. of 6 variables:
$ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
$ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
$ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
$ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
$ Month : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 2 3 4 5 6 7 8 9 10 ...
However, for example, when I want to subset the amounts of Ozone above 14 I use the following code which gives me an error:
> data[data$Ozone > 14 ]
Error in [.data.frame(data, data$Ozone > 14) : undefined columns selected

You want rows where that condition is true so you need a comma:
data[data$Ozone > 14, ]

Unable to designate CSV column heads "as.factor" for R -Error

I am having an issue with assigning factors to my data CSV. Here is a summary of the data frame:
> data.frame': 303 obs. of 12 variables:
> PLOT : int 19 177 54 114 41 48 142 134 160 267 ...
> RANGE : int 2 12 4 8 3 4 10 9 11 18 ...
> ROW : int 4 12 9 9 11 3 7 14 10 12 ...
> REP : int 1 1 1 1 1 1 1 1 1 1 ...
> ENTRY : Factor w/ 184 levels "","17_YMG_0293",..: 40 40 77 82 87 88 102 103 103 6 ...
> PLOT_ID : Factor w/ 301 levels "","18_HZG_OvOv_001",..: 20 178 55 115 42 49 143 135 161 268 ...
> Shatter : num 9 9 9 9 9 9 9 9 9 8 ...
> Chaff.Color : Factor w/ 4 levels "","*Blank ones are segregating in color",..: 3 4 3 4 4 4 3 4 4 3 ...
> Heading_d.from.Jan.1: int 138 139 137 133 135 135 133 137 135 136 ...
> Height_cm : int 74 73 77 76 74 79 78 73 76 70 ...
> Plot.weight..kg. : num 0.26 0.18 0.19 0.14 0.33 0.19 0.13 0.11 0.24 0.18 ...
But I get this error:
HAYSData$Rep<-as.factor(HAYSData$Rep)
Error in `$<-.data.frame`(`*tmp*`, Rep, value = integer(0)) :
replacement has 0 rows, data has 303
I get the same type of error for Entry, Range, and Rows. I am not sure when I look at length(Entry) for example I get 300. I even tested with changing factor to numeric but it does not help.
I don't have an NA in my data each category is its own column as well.
I don't know if something is wrong with my CSV. I have worked this same script with another CSV but no issues in the part of the script for the other data.
Can someone please help me?

It's case-sensitive, try with:
HAYSData$REP <- as.factor(HAYSData$REP)
HAYSData$ENTRY <- as.factor(HAYSData$ENTRY)
HAYSData$RANGE <- as.factor(HAYSData$RANGE)
HAYSData$ROW <- as.factor(HAYSData$ROW)

Error in ncol(xj) : object 'xj' not found when using R matplot()

Using matplot, I'm trying to plot the 2nd, 3rd and 4th columns of airquality data.frame after dividing these 3 columns by the first column of airquality.
However I'm getting an error
Error in ncol(xj) : object 'xj' not found
Why are we getting this error? The code below will reproduce this problem.
attach(airquality)
airquality[2:4] <- apply(airquality[2:4], 2, function(x) x /airquality[1])
matplot(x= airquality[,1], y= as.matrix(airquality[-1]))

You have managed to mangle your data in an interesting way. Starting with airquality before you mess with it. (And please don't attach() - it's unnecessary and sometimes dangerous/confusing.)
str(airquality)
'data.frame': 153 obs. of 6 variables:
$ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
$ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
$ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
$ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
$ Month : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 2 3 4 5 6 7 8 9 10 ...
After you do
airquality[2:4] <- apply(airquality[2:4], 2,
function(x) x /airquality[1])
you get
'data.frame': 153 obs. of 6 variables:
$ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
$ Solar.R:'data.frame': 153 obs. of 1 variable:
..$ Ozone: num 4.63 3.28 12.42 17.39 NA ...
$ Wind :'data.frame': 153 obs. of 1 variable:
..$ Ozone: num 0.18 0.222 1.05 0.639 NA ...
$ Temp :'data.frame': 153 obs. of 1 variable:
..$ Ozone: num 1.63 2 6.17 3.44 NA ...
$ Month : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 2 3 4 5 6 7 8 9 10 ...
or
sapply(airquality,class)
## Ozone Solar.R Wind Temp Month Day
## "integer" "data.frame" "data.frame" "data.frame" "integer" "integer"
that is, you have data frames embedded within your data frame!
rm(airquality) ## clean up
Now change one character and divide by the column airquality[,1] rather than airquality[1] (divide by a vector, not a list of length one ...)
airquality[,2:4] <- apply(airquality[,2:4], 2,
function(x) x/airquality[,1])
matplot(x= airquality[,1], y= as.matrix(airquality[,-1]))
In general it's safer to use [, ...] indexing rather than [] indexing to refer to columns of a data frame unless you really know what you're doing ...

Undefined columns selected when subsetting data frame

I have a data frame, str(data) to show more about my data frame the result is the following:
> str(data)
'data.frame': 153 obs. of 6 variables:
$ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
$ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
$ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
$ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
$ Month : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 2 3 4 5 6 7 8 9 10 ...
However, for example, when I want to subset the amounts of Ozone above 14 I use the following code which gives me an error:
> data[data$Ozone > 14 ]
Error in [.data.frame(data, data$Ozone > 14) : undefined columns selected

You want rows where that condition is true so you need a comma:
data[data$Ozone > 14, ]

Categories

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Data extraction in R - r

Related

Agricolae, tapply, error: arguments must have same length

Retrieving corresponding column values based on row label [duplicate]

Unable to designate CSV column heads "as.factor" for R -Error

Error in ncol(xj) : object 'xj' not found when using R matplot()

Undefined columns selected when subsetting data frame

Categories

Resources