Extracting complete paired values (non-NA) from a matrix in R [duplicate] - r

This question already has answers here:
Remove rows with all or some NAs (missing values) in data.frame
(18 answers)
Closed 7 years ago.
I apologize if this is elementary or has been answered before, but I haven't found an answer to my question despite extensive searching. I'm also very new to programming so please bear with me here.
I have a bunch of 25 by 2 matrices of data, however some of the cells have NA values. I'm looking to extract a subset of the matrix consisting of only the complete paired values (so no NA values).
So say I have:
3.6 4.2
9.2 8.4
4.8 NA
1.1 8.2
NA 11.6
NA NA
2.7 3.5
I want:
3.6 4.2
9.2 8.4
1.1 8.2
2.7 3.5
Is there some function that would do this easily?
Thanks!

Try this
df <- read.table(text = "3.6 4.2
9.2 8.4
4.8 NA
1.1 8.2
NA 11.6
NA NA
2.7 3.5")
df[complete.cases(df), ]
# V1 V2
# 1 3.6 4.2
# 2 9.2 8.4
# 4 1.1 8.2
# 7 2.7 3.5

df[ apply(!is.na(df), 1, all) , ]
df <- data.frame(V1 = c(3.6,9.2,4.8,1.1,NA,NA,2.7),
V2 = c(4.2,8.4,NA,8.2,11.6,NA,3.5))
EDIT: I forgot na.omit or complete.cases Doh.

Related

Is there a way to collapse numeric rows by two variables while preserving non-numerical variables? [duplicate]

This question already has answers here:
Group by multiple columns and sum other multiple columns
(7 answers)
Closed 2 years ago.
Need some help in R. Currently stuck solving the following task. I have the following sample table named macula
Initials ExamDate Eye Layer GCLR GCLL INLR INLL
ON 01/01/2020 R GCL 1.1 NA NA NA
ON 01/01/2020 L GCL NA 1.2 NA NA
ON 01/01/2020 R INL NA NA 1.3 NA
ON 01/01/2020 L INL NA NA NA 1.4
ON 11/11/2020 R GCL 3.1 NA NA NA
ON 11/11/2020 L GCL NA 3.2 NA NA
ON 11/11/2020 R INL NA NA 3.3 NA
ON 11/11/2020 L INL NA NA NA 3.4
TH 02/01/2020 R GCL 2.1 NA NA NA
TH 02/01/2020 L GCL NA 2.2 NA NA
TH 02/01/2020 R INL NA NA 2.3 NA
TH 02/01/2020 L INL NA NA NA 2.4
How do I get the following output where I collapse the rows by Lastname and Exam Date (since some people have multiple exam dates). I essentially need the following table:
Initials ExamDate GCLR GCLL INLR INLL
ON 01/01/2020 1.1 1.2 1.3 1.4
ON 11/11/2020 3.1 3.2 3.3 3.4
TH 02/01/2020 2.1 2.2 2.3 2.4
I have tried the following code, but I just keep getting the error that I cannot sum character variables which makes sense.
try <- macula %>% select(.,
Lastname,
ExamDate,
GCLR)%>%
group_by(Lastname,ExamDate) %>%summarise_all(funs(sum))
Any help is appreciated!
macula %>%
select(-Eye, -Layer) %>%
group_by(Initials, ExamDate) %>%
summarise_all(~ sum(.x, na.rm = TRUE)) %>%
ungroup
# # A tibble: 3 x 6
# Initials ExamDate GCLR GCLL INLR INLL
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 ON 01/01/2020 1.1 1.2 1.3 1.4
# 2 ON 11/11/2020 3.1 3.2 3.3 3.4
# 3 TH 02/01/2020 2.1 2.2 2.3 2.4
Your comment of "cannot sum character variables" suggests that the last four columns are not all numeric. If you're confident that all should be numbers (there aren't other numbers in there), and/or are willing to take the jump and assume it to be the case, then you can use as.numeric to the call
macula %>%
select(-Eye, -Layer) %>%
group_by(Initials, ExamDate) %>%
summarise_all(~ sum(as.numeric(.x), na.rm = TRUE)) %>%
ungroup()

R: Creating an index vector

I need some help with R coding here.
The data set Glass consists of 214 rows of data in which each row corresponds to a glass sample. Each row consists of 10 columns. When viewed as a classification problem, column 10
(Type) specifies the class of each observation/instance. The remaining columns are attributes that might beused to infer column 10. Here is an example of the first row
RI Na Mg Al Si K Ca Ba Fe Type
1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0.0 0.0 1
First, I casted column 10 so that it is interpreted by R as a factor instead of an integer value.
Now I need to create a vector with indices for all observations (must have values 1-214). This needs to be done to creating training data for Naive Bayes. I know how to create a vector with 214 values, but not one that has specific indices for observations from a data frame.
If it helps this is being done to set up training data for Naive Bayes, thanks
I'm not totally sure that I get what you're trying to do... So please forgive me if my solution isn't helpful. If your df's name is 'df', just use the dplyr package for reordering your columns and write
library(dplyr)
df['index'] <- 1:214
df <- df %>% select(index,everything())
Here's an example. So that I can post full dataframes, my dataframes will only have 10 rows...
Let's say my dataframe is:
df <- data.frame(col1 = c(2.3,6.3,9.2,1.7,5.0,8.5,7.9,3.5,2.2,11.5),
col2 = c(1.5,2.8,1.7,3.5,6.0,9.0,12.0,18.0,20.0,25.0))
So it looks like
col1 col2
1 2.3 1.5
2 6.3 2.8
3 9.2 1.7
4 1.7 3.5
5 5.0 6.0
6 8.5 9.0
7 7.9 12.0
8 3.5 18.0
9 2.2 20.0
10 11.5 25.0
If I want to add another column that just is 1,2,3,4,5,6,7,8,9,10... and I'll call it 'index' ...I could do this:
library(dplyr)
df['index'] <- 1:10
df <- df %>% select(index, everything())
That will give me
index col1 col2
1 1 2.3 1.5
2 2 6.3 2.8
3 3 9.2 1.7
4 4 1.7 3.5
5 5 5.0 6.0
6 6 8.5 9.0
7 7 7.9 12.0
8 8 3.5 18.0
9 9 2.2 20.0
10 10 11.5 25.0
Hope this will help
df$ind <- seq.int(nrow(df))

Cumulative summing between groups using dplyr

I have a tibble structured as follows:
day theta
1 1 2.1
2 1 2.1
3 2 3.2
4 2 3.2
5 5 9.5
6 5 9.5
7 5 9.5
Note that the tibble contains multiple rows for each day, and for each day the same value for theta is repeated an arbitrary number of times. (The tibble contains other arbitrary columns necessitating this repeating structure.)
I'd like to use dplyr to cumulatively sum values for theta across days such that, in the example above, 2.1 is added only a single time to 3.2, etc. The tibble would be mutated so as to append the new cumulative sum (c.theta) as follows:
day theta c.theta
1 1 2.1 2.1
2 1 2.1 2.1
3 2 3.2 5.3
4 2 3.2 5.3
5 5 9.5 14.8
6 5 9.5 14.8
7 5 9.5 14.8
...
My initial efforts to group_by day and then cumsum over theta resulted only in cumulative summing over the full set of data (e.g., 2.1 + 2.1 + 3.2 ...) which is undesirable. In my Stack Overflow searches, I can find many examples of cumulative summing within groups, but never between groups, as I describe above. Nudges in the right direction would be much appreciated.
Doing this in dplyr I came up with a very similar solution to PoGibas - use distinct to just get one row per day, find the sum and merge back in:
df = read.table(text="day theta
1 1 2.1
2 1 2.1
3 2 3.2
4 2 3.2
5 5 9.5
6 5 9.5
7 5 9.5", header = TRUE)
cumsums = df %>%
distinct(day, theta) %>%
mutate(ctheta = cumsum(theta))
df %>%
left_join(cumsums %>% select(day, ctheta), by = 'day')
Not a dplyr, but just an alternative data.table solution:
library(data.table)
# Original table is called d
setDT(d)
merge(d, unique(d)[, .(c.theta = cumsum(theta), day)])
day theta c.theta
1: 1 2.1 2.1
2: 1 2.1 2.1
3: 2 3.2 5.3
4: 2 3.2 5.3
5: 5 9.5 14.8
6: 5 9.5 14.8
7: 5 9.5 14.8
PS: If you want to preserve other columns you have to use unique(d[, .(day, theta)])
In base R you could use split<- and tapply to return the desired result.
# construct 0 vector to fill in
dat$temp <- 0
# fill in with cumulative sum for each day
split(dat$temp, dat$day) <- cumsum(tapply(dat$theta, dat$day, head, 1))
Here, tapply returns the first element of theta for each day which is is fed to cumsum. The elements of cumulative sum are fed to each day using split<-.
This returns
dat
day theta temp
1 1 2.1 2.1
2 1 2.1 2.1
3 2 3.2 5.3
4 2 3.2 5.3
5 5 9.5 14.8
6 5 9.5 14.8
7 5 9.5 14.8

Computing a "rightmost" moving average?

I would like to compute a moving average (ma) over some time series data but I would like the ma to consider the order n starting from the rightmost of my series so my last ma value corresponds to the ma of the last n values of my series. The desired function rightmost_ma would produce this output:
data <- seq(1,10)
> data
[1] 1 2 3 4 5 6 7 8 9 10
rightmost_ma(data, n=2)
NA 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
I was reviewing the different ma possibilities e.g. package forecast and could not find how to cover this use case. Note that the critical requirement for me is to have valid non NA ma values for the last elements of the series or in other words I want my ma to produce valid results without "looking into the future".
Take a look at rollmean function from zoo package
> library(zoo)
> rollmean(zoo(1:10), 2, align ="right", fill=NA)
1 2 3 4 5 6 7 8 9 10
NA 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
you can also use rollapply
> rollapply(zoo(1:10), width=2, FUN=mean, align = "right", fill=NA)
1 2 3 4 5 6 7 8 9 10
NA 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
I think using stats::filter is less complicated, and might have better performance (though zoo is well written).
This:
filter(1:10, c(1,1)/2, sides=1)
gives:
Time Series:
Start = 1
End = 10
Frequency = 1
[1] NA 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
If you don't want the result to be a ts object, use as.vector on the result.

plot command isn't recognizing column names

Recently when I tried to plot in R I keep getting this error. Can anyone tell me why I can't seem to do a scatter plot? I've pasted the terminal screen below.
tcmg2o4 <-read.table("~/Documents/research/metal.oxides/TcMg2O4.inverse/energydata.txt")
tcmg2o4
V1 V2
1 Lattice_constant Total_energy
2 8.0 -371.63306746
3 8.1 -375.035492
4 8.2 -378.8669067
5 8.3 -380.34136459
6 8.4 -382.3921237
7 8.5 -383.60394736
8 8.6 -384.09517631
9 8.7 -383.77668067
10 8.8 -382.43806866
11 8.9 -381.42213458
12 9.0 -379.63327976
attach(tcmg2o4)
plot(Lattice_constant, Total_energy)
Error in plot(Lattice_constant, Total_energy) :
object 'Lattice_constant' not found
plot(V1,V2)
Your problem is that you are not reading the column names as column names. to do this use
header = T
tcmg2o4 <-read.table("~/Documents/research/metal.oxides/TcMg2O4.inverse/energydata.txt", header = T)
In your case, the read.table call has created column names V1 and V2 and these columns will both be factor variables.
You can check the structure of your read in object by
str(tcmg2o4)
## 'data.frame': 11 obs. of 2 variables:
## $ Lattice_constant: num 8 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 ...
## $ Total_energy : num -372 -375 -379 -380 -382 ...
I would also avoid using attach
instead use with or
with(tcmg2o4, plot(Lattice_constant, Total_energy))
or the fact that it is a 2 column data.frame
plot(tcmg2o4)
or use a formula to specify your x and y axis (y~x)
plot(Total_energy ~ Lattice_constant, data = tcmg2o4)
which will all give the same result and be much clearer as to where the data is stored

Resources