NAs introduced by coercion - mixed vector - r

NAs introduced by coercion. How to get around this? Thank you for your help.
water <- 785.5
volume_water <- as.numeric(as.character(c("water", water)))
volume_water
[1] NA 785.5
This is dataframe called data
Substance v1
1 abc 12.5
2 defg 100.0
3 hijk 100.0
4 abfg 2.0
I want to achieve:
rbind(data, volume_water)
Substance v1
1 abc 12.5
2 defg 100.0
3 hijk 100.0
4 abfg 2.0
5 water 785.5

I would create the object as a data frame, i.e.:
volume_water = data.frame(Substance="water", v1=785.5)
Then you can rbind it with data.

Related

R: Creating an index vector

I need some help with R coding here.
The data set Glass consists of 214 rows of data in which each row corresponds to a glass sample. Each row consists of 10 columns. When viewed as a classification problem, column 10
(Type) specifies the class of each observation/instance. The remaining columns are attributes that might beused to infer column 10. Here is an example of the first row
RI Na Mg Al Si K Ca Ba Fe Type
1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0.0 0.0 1
First, I casted column 10 so that it is interpreted by R as a factor instead of an integer value.
Now I need to create a vector with indices for all observations (must have values 1-214). This needs to be done to creating training data for Naive Bayes. I know how to create a vector with 214 values, but not one that has specific indices for observations from a data frame.
If it helps this is being done to set up training data for Naive Bayes, thanks
I'm not totally sure that I get what you're trying to do... So please forgive me if my solution isn't helpful. If your df's name is 'df', just use the dplyr package for reordering your columns and write
library(dplyr)
df['index'] <- 1:214
df <- df %>% select(index,everything())
Here's an example. So that I can post full dataframes, my dataframes will only have 10 rows...
Let's say my dataframe is:
df <- data.frame(col1 = c(2.3,6.3,9.2,1.7,5.0,8.5,7.9,3.5,2.2,11.5),
col2 = c(1.5,2.8,1.7,3.5,6.0,9.0,12.0,18.0,20.0,25.0))
So it looks like
col1 col2
1 2.3 1.5
2 6.3 2.8
3 9.2 1.7
4 1.7 3.5
5 5.0 6.0
6 8.5 9.0
7 7.9 12.0
8 3.5 18.0
9 2.2 20.0
10 11.5 25.0
If I want to add another column that just is 1,2,3,4,5,6,7,8,9,10... and I'll call it 'index' ...I could do this:
library(dplyr)
df['index'] <- 1:10
df <- df %>% select(index, everything())
That will give me
index col1 col2
1 1 2.3 1.5
2 2 6.3 2.8
3 3 9.2 1.7
4 4 1.7 3.5
5 5 5.0 6.0
6 6 8.5 9.0
7 7 7.9 12.0
8 8 3.5 18.0
9 9 2.2 20.0
10 10 11.5 25.0
Hope this will help
df$ind <- seq.int(nrow(df))

How to use sapply - switch logic

I have data frame that I am using for a small educational project.
EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
192527 URBAN/SMALL STREAM FLOODING 0.0 5 0
192938 HEAVY SNOW 1.7 5 0
193995 HAIL 30.0 5 25 M
194223 THUNDERSTORM WINDS 0.1 5 0
195672 THUNDERSTORM WINDS 0.0 5 0
198497 THUNDERSTORM WINDS 10.0 5 0
My objective is to create a new column named PropAmtDmg and takes the following form.
If PROPDMGEXP = "5" then 5 * PROPDMG
t1$PropAmtDmg <- ifelse(t1$PROPDMGEXP == "7", t1$PROPDMG * 7,
ifelse(t1$PROPDMGEXP == "5", t1$PROPDMG * 5,
0))
I might of more cases than just two that I mentioned.
I would like to do this in sapply.
I would like to suggest the use of data.table for this task. data.table is a package that enhances data frames inherent in R. It is very fast. The benefit of this is there is not constant recopying of data so that if your data is large, this is memory efficient. Let's assume that your data frame is called dfr:
require(data.table)
set.seed(123) #set the seed so this can be replicated
dtb = data.table(PROPDMGEXP = sample(1:10, 10), PROPDMG = sample(1:10,10)) #sample data.table
dtb[(PROPDMGEXP %in% c(5,7)),rslt:=PROPDMG*PROPDMGEXP]
You are done. Here is the result:
PROPDMGEXP PROPDMG rslt
1: 3 10 NA
2: 8 5 NA
3: 4 6 NA
4: 7 9 63
5: 6 1 NA
6: 1 7 NA
7: 10 8 NA
8: 9 4 NA
9: 2 3 NA
10: 5 2 10
Note: if you want to make all the other entries 0 you can do this instead:
dtb[,rslt:=0][(PROPDMGEXP %in% c(5,7)),rslt:=PROPDMG*PROPDMGEXP]
You can aggregate all conditions in a unique one like this :
transform(t1,PropAmtDmg=ifelse(PROPDMGEXP %in% c(5,7),PROPDMG*PROPDMGEXP,0))

R Programming Calculate Rows Average

How to use R to calculate row mean ?
Sample data:
f<- data.frame(
name=c("apple","orange","banana"),
day1sales=c(2,5,4),
day1sales=c(2,8,6),
day1sales=c(2,15,24),
day1sales=c(22,51,13),
day1sales=c(5,8,7)
)
Expected Results :
Subsequently the table will add more column for example the expected results is only until AverageSales day1sales.4. After running more data, it will add on to day1sales.6 and so on. So how can I count the average for all the rows?
with rowMeans
> rowMeans(f[-1])
## [1] 6.6 17.4 10.8
You can also add another column to of means to the data set
> f$AvgSales <- rowMeans(f[-1])
> f
## name day1sales day1sales.1 day1sales.2 day1sales.3 day1sales.4 AvgSales
## 1 apple 2 2 2 22 5 6.6
## 2 orange 5 8 15 51 8 17.4
## 3 banana 4 6 24 13 7 10.8
rowMeans is the simplest way. Also the function apply will apply a function along the rows or columns of a data frame. In this case you want to apply the mean function to the rows:
f$AverageSales <- apply(f[, 2:length(f)], 1, mean)
(changed 6 to length(f) since you say you may add more columns).
will add an AverageSales column to the dataframe f with the value that you want
> f
## name day1sales day1sales.1 day1sales.2 day1sales.3 day1sales.4 means
##1 apple 2 2 2 22 5 6.6
##2 orange 5 8 15 51 8 17.4
##3 banana 4 6 24 13 7 10.8

Function to store data.frames and calculate mean?

I'm trying to come up with a function and got stuck. I need to run a function (ses.mpd) 1000 times with randomized matrices. The outputs (data.frames) should be stored and then a data.frame with means of those 1000 output data.frames should be calculated.
Example:
output data.frames
ntaxa mpd.obs mpd.rand.mean mpd.rand.sd
sample1 3 10 9 0.2
sample2 6 15 12 0.6
sample3 4 9 10 0.1
ntaxa mpd.obs mpd.rand.mean mpd.rand.sd
sample1 6 12 10 0.5
sample2 4 12 15 0.3
sample3 7 4 7 0.3
result data.frame should look like this
ntaxa mpd.obs mpd.rand.mean mpd.rand.sd
sample1 4.5 11 9.5 0.35
sample2 5 13.5 13.5 0.45
sample3 5.5 6.5 8.5 0.2
I think I have save the 1000 data.frames in a list and then maybe use the ddply function in plyr, but I have not really an idea how to do this.
If all the matrices are the same (e.g. same dimensions and same variable locations), then I would store them in a 3d array and use apply or rowMeans, etc. The latter will be faster.
Using a built-in dataset:
> dim(UCBAdmissions)
[1] 2 2 6
> rowMeans( UCBAdmissions, dims=c(2) )
Gender
Admit Male Female
Admitted 199.6667 92.83333
Rejected 248.8333 213.00000

R data.table reshape chunks of columns at once

Lets say I have a data.table with these columns
nodeID
hour1aaa
hour1bbb
hour1ccc
hour2aaa
hour2bbb
hour2ccc
...
hour24aaa
hour24bbb
hour24ccc
for a total of 72 columns. Let's call it rawtable
I want to reshape it so I have
nodeID
hour
aaa
bbb
ccc
for a total of just these 5 columns
where the hour column will contain whichever hour from the original 72 that it should be.
Let's call it newshape
The way I'm doing it now is to use rbindlist with 24 items where each item is the proper subset of the bigger data.table. Like this (except I'm leaving out most of the hours in my example)
newshape<-rbindlist(list(
rawtable[,list(nodeID, Hour=1, aaa=hour1aaa, bbb=hour1bbb, ccc=hour1ccc)],
rawtable[,list(nodeID, Hour=2, aaa=hour2aaa, bbb=hour2bbb, ccc=hour2ccc)],
rawtable[,list(nodeID, Hour=24, aaa=hour24aaa, bbb=hour24bbb, ccc=hour24ccc)]))
Here is some sample data to play with
rawtable<-data.table(nodeID=c(1,2),hour1aaa=c(12.4,32),hour1bbb=c(61.1,65.33),hour1ccc=c(-4.2,54),hour2aaa=c(12.2,1.2),hour2bbb=c(12.2,5.7),hour2ccc=c(5.6,101.9),hour24aaa=c(45.2,8.5),hour24bbb=c(23,7.9),hour24ccc=c(98,32.3))
Using my rbindlist approach gives the desired result but, as with most things I do with R, there is probably a better way. By better I mean more memory efficient, faster, and/or uses less lines of code. Does anyone have a better way to achieve this?
This is a classic reshape problem if you get your names in the standard convention it expects, though I'm not sure this really harnesses the efficiency of the data.table structure:
reshape(
setNames(rawtable, gsub("(\\D+)(\\d+)(\\D+)", "\\3.\\2", names(rawtable))),
idvar="nodeID", direction="long", varying=-1
)
Result:
nodeID hour aaa bbb ccc
1: 1 1 12.4 61.10 -4.2
2: 2 1 32.0 65.33 54.0
3: 1 2 12.2 12.20 5.6
4: 2 2 1.2 5.70 101.9
5: 1 24 45.2 23.00 98.0
6: 2 24 8.5 7.90 32.3
#Arun's answer over here: https://stackoverflow.com/a/15510828/496803 may also be useful if you can adapt it to your current data.
One option is to use merged.stack from my package "splitstackshape". This function, stacks groups of columns and then merges the output together. Because of how the function creates the "time" variable, you can specify whatever you wanted to strip out from the column names. In this case, we want to strip out "hour", "aaa", "bbb", and "ccc" and have just the numbers remaining.
library(splitstackshape)
## Make sure you're using at least 1.2.0
packageVersion("splitstackshape")
# [1] ‘1.2.0’
merged.stack(rawtable, id.vars="nodeID",
var.stubs=c("aaa", "bbb", "ccc"),
sep="hour|aaa|bbb|ccc")
# nodeID .time_1 aaa bbb ccc
# 1: 1 1 12.4 61.10 -4.2
# 2: 1 2 12.2 12.20 5.6
# 3: 1 24 45.2 23.00 98.0
# 4: 2 1 32.0 65.33 54.0
# 5: 2 2 1.2 5.70 101.9
# 6: 2 24 8.5 7.90 32.3

Resources