R: Creating an index vector - r

I need some help with R coding here.
The data set Glass consists of 214 rows of data in which each row corresponds to a glass sample. Each row consists of 10 columns. When viewed as a classification problem, column 10
(Type) specifies the class of each observation/instance. The remaining columns are attributes that might beused to infer column 10. Here is an example of the first row
RI Na Mg Al Si K Ca Ba Fe Type
1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0.0 0.0 1
First, I casted column 10 so that it is interpreted by R as a factor instead of an integer value.
Now I need to create a vector with indices for all observations (must have values 1-214). This needs to be done to creating training data for Naive Bayes. I know how to create a vector with 214 values, but not one that has specific indices for observations from a data frame.
If it helps this is being done to set up training data for Naive Bayes, thanks

I'm not totally sure that I get what you're trying to do... So please forgive me if my solution isn't helpful. If your df's name is 'df', just use the dplyr package for reordering your columns and write
library(dplyr)
df['index'] <- 1:214
df <- df %>% select(index,everything())
Here's an example. So that I can post full dataframes, my dataframes will only have 10 rows...
Let's say my dataframe is:
df <- data.frame(col1 = c(2.3,6.3,9.2,1.7,5.0,8.5,7.9,3.5,2.2,11.5),
col2 = c(1.5,2.8,1.7,3.5,6.0,9.0,12.0,18.0,20.0,25.0))
So it looks like
col1 col2
1 2.3 1.5
2 6.3 2.8
3 9.2 1.7
4 1.7 3.5
5 5.0 6.0
6 8.5 9.0
7 7.9 12.0
8 3.5 18.0
9 2.2 20.0
10 11.5 25.0
If I want to add another column that just is 1,2,3,4,5,6,7,8,9,10... and I'll call it 'index' ...I could do this:
library(dplyr)
df['index'] <- 1:10
df <- df %>% select(index, everything())
That will give me
index col1 col2
1 1 2.3 1.5
2 2 6.3 2.8
3 3 9.2 1.7
4 4 1.7 3.5
5 5 5.0 6.0
6 6 8.5 9.0
7 7 7.9 12.0
8 8 3.5 18.0
9 9 2.2 20.0
10 10 11.5 25.0

Hope this will help
df$ind <- seq.int(nrow(df))

Related

Adding a header to columns based on the values of rows

I have the following different dataframes:
df1:
Scribe Reduced A 5 2.5 3 10
Reader Reduced A 9.2 4 12 10
Optimise Reduced A 5 5.8 3 12
df2:
Convert Reduced A 14 25
Configure Reduced A 14.7 6.8
Race Reduced A 2 6.3
df3:
Abstract Reduced A 8 7.5 9 8 4.5 11
Follower Reduced A 5.5 6 14 19 6 13.5
I would like to add a header for each of the dataframes where the column names are:
Class Technique Algorithm 1 2 3 ....
My issue is not with the first three columns but with the rest of the columns (integer values). As you see in the example, the number of columns for these integer values differs which makes it difficult to me how to name these columns (i.e., starting form 1 until the last value, for example, 4 in df1).
Can someone help me please in solving this issue?
Here is a function for you. The first argument, dat, is your data frame. The second argument, chr, is the vector names for your first few columns.
header_fun <- function(dat, chr = c("Class", "Technique", "Algorithm")){
dat2 <- setNames(dat, c(chr, 1:(ncol(dat) - length(chr))))
return(dat2)
}
The function will return a new data frame with the updated header.
header_fun(df1)
# Class Technique Algorithm C1 C2 C3 C4
# 1 Scribe Reduced A 5.0 2.5 3 10
# 2 Reader Reduced A 9.2 4.0 12 10
# 3 Optimise Reduced A 5.0 5.8 3 12
header_fun(df2)
# Class Technique Algorithm 1 2
# 1 Convert Reduced A 14.0 25.0
# 2 Configure Reduced A 14.7 6.8
# 3 Race Reduced A 2.0 6.3
header_fun(df3)
# Class Technique Algorithm 1 2 3 4 5 6
# 1 Abstract Reduced A 8.0 7.5 9 8 4.5 11.0
# 2 Follower Reduced A 5.5 6.0 14 19 6.0 13.5
DATA
df1 <- read.table(text = "Scribe Reduced A 5 2.5 3 10
Reader Reduced A 9.2 4 12 10
Optimise Reduced A 5 5.8 3 12",
header = FALSE, stringsAsFactors = FALSE)
df2 <- read.table(text = "Convert Reduced A 14 25
Configure Reduced A 14.7 6.8
Race Reduced A 2 6.3",
header = FALSE, stringsAsFactors = FALSE)
df3 <- read.table(text = "Abstract Reduced A 8 7.5 9 8 4.5 11
Follower Reduced A 5.5 6 14 19 6 13.5",
header = FALSE, stringsAsFactors = FALSE)

Calculation via factor results in a by-list - how to circumvent?

I have a data.frame as following:
Lot Wafer Voltage Slope Voltage_irradiated Slope_irradiated m_dist_lot
1 8 810 356.119 6.08423 356.427 6.13945 NA
2 8 818 355.249 6.01046 354.124 6.20855 NA
3 9 917 346.921 6.21474 346.847 6.33904 NA
4 (...)
120 9 914 353.335 6.15060 352.540 6.19277 NA
121 7 721 358.647 6.10592 357.797 6.17244 NA
122 (...)
My goal is simple but also a bit difficult. Definitely it is doable to solve it in several ways:
I want to apply a function "func" to each row according to a factor, e.g. the factor "Lot". This is done via
m_dist_lot<- by(data.frame, data.frame$Lot,func)
This actually works but the result is a by-list:
data.frame$Lot: 7
354 355 363 367 378 419 426 427 428 431 460 477 836
3.5231249 9.4229589 1.4996504 7.2984485 7.6883170 1.2354754 1.8547674 3.1129814 4.4303001 1.9634573 3.7281868 3.6182559 6.4718306
data.frame$Lot: 8
1 2 11 15 17 18 19 20 21 22 24 25
2.1415352 4.6459868 1.3485551 38.8218984 3.9988686 2.2473563 6.7186047 2.6433790 0.5869746 0.5832567 4.5321623 1.8567318
The first row seems to be the row of the initial data.frame where the data is taken from. The second row are the calculated values.
My problem now is: How can I store these values properly into the origin data.frame according to the correct rows?
For example in case of one certain calculation/row of the data frame:
m_dist_lot<- by(data.frame, data.frame$Lot,func)
results for the second row of the data.frame in
data.frame$Lot: 8
2
4.6459868
I want to store the value 4.6459868 in data.frame$m_dist_lot according to the correct row "2":
Lot Wafer Voltage Slope Voltage_irradiated Slope_irradiated m_dist_lot
1 8 810 356.119 6.08423 356.427 6.13945 NA
2 8 818 355.249 6.01046 354.124 6.20855 4.6459868
3 9 917 346.921 6.21474 346.847 6.33904 NA
4 (...)
120 9 914 353.335 6.15060 352.540 6.19277 NA
121 7 721 358.647 6.10592 357.797 6.17244 NA
122 (...)
but I don't know how. My best try actually is to use "unlist".
un<- unlist(m_dist_lot) results in
un[1]
6.354
3.523125
un[2]
6.355
9.422959
un[3]
(..)
But I still don't know how I can "separate" the information of "factor.row" and "calculcated" value in such a way that the information is stored correctly in the data frame.
At least when using un<- unlist(m_dist_lot, use.names = FALSE) the factors are not present:
un[1]
3.523125
un[2]
9.422959
un[3]
1.49965
(..)
But now I lack the information of how to assign these values properly into the data.frame.
Using un<- do.call(rbind, lapply(m_dist_lot, data.frame, stringsAsFactors=FALSE)) results in
(...)
7.922 0.94130936
7.976 4.89560441
8.1 2.14153516
8.2 4.64598677
8.11 1.34855514
(...)
Here I still lack a proper assignment of calculated values <> data.frame.
I'm sure there must be a doable way. Do you know a good method?
Without reproducible data or an example of what you want func to do, I am guessing a bit here. However, I think that dplyr is going to be the answer for you.
First, I am going to use the pipe (%>%) from dplyr (exported from magrittr) to pass the builtin iris data through a series of functions. If what you are trying to calculate requires the full data.frame (and not just a column or two), you could modify this approach to do what you want (just write your function to take a data.frame, add the column(s) of interest, then return the full data.frame).
Here, I first split the iris data by Species (this creates a list, with a separate data.frame for each species). Next, I use lapply to run the function head on each element of the list. This returns a list of data.frames that now each only have three rows. (You could replace head with your function of interest here, as long as it returns a full data.frame.) Finally, I stitch each element of the list back together with bind_rows.
topIris <-
iris %>%
split(.$Species) %>%
lapply(head, n = 3) %>%
bind_rows()
This returns:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 7.0 3.2 4.7 1.4 versicolor
5 6.4 3.2 4.5 1.5 versicolor
6 6.9 3.1 4.9 1.5 versicolor
7 6.3 3.3 6.0 2.5 virginica
8 5.8 2.7 5.1 1.9 virginica
9 7.1 3.0 5.9 2.1 virginica
Which I am going to use to illustrate the approach that I think will actually address your underlying problem.
The group_by function from dplyr allows a similar approach, but without having to split the data.frame. When a data.frame is grouped, any functions applied to it are applied separately by group. Here is an example in action, which ranks the sepal lengths within each species. This is obviously not terribly useful directly, but you could write a custom function which took any number of columns as arguments (which are then passed in as vectors) and returned a vector of the same length (to create a new column or update an existing one). The select function at the end is only there to make it easier to see what I did
topIris %>%
group_by(Species) %>%
mutate(rank_Sepal_Length = rank(Sepal.Length)) %>%
select(Species, rank_Sepal_Length, Sepal.Length)
Returns:
Species rank_Sepal_Length Sepal.Length
<fctr> <dbl> <dbl>
1 setosa 3 5.1
2 setosa 2 4.9
3 setosa 1 4.7
4 versicolor 3 7.0
5 versicolor 1 6.4
6 versicolor 2 6.9
7 virginica 2 6.3
8 virginica 1 5.8
9 virginica 3 7.1
I got a workaround with the help of Force gsub to keep trailing zeros :
un<- do.call(rbind, lapply(list, data.frame, stringsAsFactors=FALSE))
un<- gsub(".*.","", un)
un<- regmatches(un, gregexpr("(?<=.).*", un, perl=TRUE))
rows<- data.frame(matrix(ncol = 1, nrow = lengths(un)))
colnames(rows)<- c("row_number")
rows["row_number"]<- sprintf("%s", rownames(un))
rows["row_number"]<- as.numeric(un[,1])
rows["row_number"]<- sub("^[^.]*[.]", "", format(rows[,1], width = max(nchar(rows[,1]))))

Cumulative summing between groups using dplyr

I have a tibble structured as follows:
day theta
1 1 2.1
2 1 2.1
3 2 3.2
4 2 3.2
5 5 9.5
6 5 9.5
7 5 9.5
Note that the tibble contains multiple rows for each day, and for each day the same value for theta is repeated an arbitrary number of times. (The tibble contains other arbitrary columns necessitating this repeating structure.)
I'd like to use dplyr to cumulatively sum values for theta across days such that, in the example above, 2.1 is added only a single time to 3.2, etc. The tibble would be mutated so as to append the new cumulative sum (c.theta) as follows:
day theta c.theta
1 1 2.1 2.1
2 1 2.1 2.1
3 2 3.2 5.3
4 2 3.2 5.3
5 5 9.5 14.8
6 5 9.5 14.8
7 5 9.5 14.8
...
My initial efforts to group_by day and then cumsum over theta resulted only in cumulative summing over the full set of data (e.g., 2.1 + 2.1 + 3.2 ...) which is undesirable. In my Stack Overflow searches, I can find many examples of cumulative summing within groups, but never between groups, as I describe above. Nudges in the right direction would be much appreciated.
Doing this in dplyr I came up with a very similar solution to PoGibas - use distinct to just get one row per day, find the sum and merge back in:
df = read.table(text="day theta
1 1 2.1
2 1 2.1
3 2 3.2
4 2 3.2
5 5 9.5
6 5 9.5
7 5 9.5", header = TRUE)
cumsums = df %>%
distinct(day, theta) %>%
mutate(ctheta = cumsum(theta))
df %>%
left_join(cumsums %>% select(day, ctheta), by = 'day')
Not a dplyr, but just an alternative data.table solution:
library(data.table)
# Original table is called d
setDT(d)
merge(d, unique(d)[, .(c.theta = cumsum(theta), day)])
day theta c.theta
1: 1 2.1 2.1
2: 1 2.1 2.1
3: 2 3.2 5.3
4: 2 3.2 5.3
5: 5 9.5 14.8
6: 5 9.5 14.8
7: 5 9.5 14.8
PS: If you want to preserve other columns you have to use unique(d[, .(day, theta)])
In base R you could use split<- and tapply to return the desired result.
# construct 0 vector to fill in
dat$temp <- 0
# fill in with cumulative sum for each day
split(dat$temp, dat$day) <- cumsum(tapply(dat$theta, dat$day, head, 1))
Here, tapply returns the first element of theta for each day which is is fed to cumsum. The elements of cumulative sum are fed to each day using split<-.
This returns
dat
day theta temp
1 1 2.1 2.1
2 1 2.1 2.1
3 2 3.2 5.3
4 2 3.2 5.3
5 5 9.5 14.8
6 5 9.5 14.8
7 5 9.5 14.8

Extracting complete paired values (non-NA) from a matrix in R [duplicate]

This question already has answers here:
Remove rows with all or some NAs (missing values) in data.frame
(18 answers)
Closed 7 years ago.
I apologize if this is elementary or has been answered before, but I haven't found an answer to my question despite extensive searching. I'm also very new to programming so please bear with me here.
I have a bunch of 25 by 2 matrices of data, however some of the cells have NA values. I'm looking to extract a subset of the matrix consisting of only the complete paired values (so no NA values).
So say I have:
3.6 4.2
9.2 8.4
4.8 NA
1.1 8.2
NA 11.6
NA NA
2.7 3.5
I want:
3.6 4.2
9.2 8.4
1.1 8.2
2.7 3.5
Is there some function that would do this easily?
Thanks!
Try this
df <- read.table(text = "3.6 4.2
9.2 8.4
4.8 NA
1.1 8.2
NA 11.6
NA NA
2.7 3.5")
df[complete.cases(df), ]
# V1 V2
# 1 3.6 4.2
# 2 9.2 8.4
# 4 1.1 8.2
# 7 2.7 3.5
df[ apply(!is.na(df), 1, all) , ]
df <- data.frame(V1 = c(3.6,9.2,4.8,1.1,NA,NA,2.7),
V2 = c(4.2,8.4,NA,8.2,11.6,NA,3.5))
EDIT: I forgot na.omit or complete.cases Doh.

Computing a "rightmost" moving average?

I would like to compute a moving average (ma) over some time series data but I would like the ma to consider the order n starting from the rightmost of my series so my last ma value corresponds to the ma of the last n values of my series. The desired function rightmost_ma would produce this output:
data <- seq(1,10)
> data
[1] 1 2 3 4 5 6 7 8 9 10
rightmost_ma(data, n=2)
NA 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
I was reviewing the different ma possibilities e.g. package forecast and could not find how to cover this use case. Note that the critical requirement for me is to have valid non NA ma values for the last elements of the series or in other words I want my ma to produce valid results without "looking into the future".
Take a look at rollmean function from zoo package
> library(zoo)
> rollmean(zoo(1:10), 2, align ="right", fill=NA)
1 2 3 4 5 6 7 8 9 10
NA 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
you can also use rollapply
> rollapply(zoo(1:10), width=2, FUN=mean, align = "right", fill=NA)
1 2 3 4 5 6 7 8 9 10
NA 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
I think using stats::filter is less complicated, and might have better performance (though zoo is well written).
This:
filter(1:10, c(1,1)/2, sides=1)
gives:
Time Series:
Start = 1
End = 10
Frequency = 1
[1] NA 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
If you don't want the result to be a ts object, use as.vector on the result.

Resources