Remove doubles with no decimal places - r

I have a vector:
x <- c(0.0, 0.5, 1.000, 1.5, 1.6, 1.7, 1.75, 2.0, 2.4, 2.5, 3.0, 74.0)
How can I extract only the values of x which contain nonzero values after the decimal place? For example, the resultant vector would look like this:
c(0.5, 1.5, 1.6, 1.7, 1.75, 2.4, 2.5)
Which has removed 0.0, 1.000, 2.0, 3.0, and 74.0.

alternatively
x[x %% 1 != 0]
#[1] 0.50 1.50 1.60 1.70 1.75 2.40 2.50
or
x[trunc(x) != x]
#[1] 0.50 1.50 1.60 1.70 1.75 2.40 2.50
or
x[as.integer(x) != x]
#[1] 0.50 1.50 1.60 1.70 1.75 2.40 2.50
or ( now I stop!)
x[grepl("\\.[^0]+$",x)]
#[1] 0.50 1.50 1.60 1.70 1.75 2.40 2.50
:D

We can construct a logical index with round
x[round(x) != x]
#[1] 0.50 1.50 1.60 1.70 1.75 2.40 2.50

Related

How to find in a column when the row at time t is different from the one at t-1? [duplicate]

This question already has an answer here:
How to filter rows based on the previous row and keep previous row using dplyr?
(1 answer)
Closed 2 years ago.
I have a dataframe with 3 columns. For each column I would like to get when the jth row of column i is different from its past value. Ideally I would like to get the dates when this change happens.
Let me take an example to clarity what I mean. This is my dataframe:
df = structure(list(Date = c("2005-11-30", "2005-12-31", "2006-01-31",
"2006-02-28", "2006-03-31", "2006-04-30", "2006-05-31", "2006-06-30",
"2006-07-31", "2006-08-31", "2006-09-30", "2006-10-31", "2006-11-30",
"2006-12-31", "2007-01-31", "2007-02-28", "2007-03-31", "2007-04-30",
"2007-05-31", "2007-06-30"), MLF = c(3, 3.25, 3.25, 3.25, 3.5,
3.5, 3.5, 3.75, 3.75, 4, 4, 4.25, 4.25, 4.5, 4.5, 4.5, 4.75,
4.75, 4.75, 5), MRO = c(2, 2.25, 2.25, 2.25, 2.5, 2.5, 2.5, 2.75,
2.75, 3, 3, 3.25, 3.25, 3.5, 3.5, 3.5, 3.75, 3.75, 3.75, 4),
DFR = c(1, 1.25, 1.25, 1.25, 1.5, 1.5, 1.5, 1.75, 1.75, 2,
2, 2.25, 2.25, 2.5, 2.5, 2.5, 2.75, 2.75, 2.75, 3)), row.names = 83:102, class = "data.frame")
Date MLF MRO DFR
83 2005-11-30 3.00 2.00 1.00
84 2005-12-31 3.25 2.25 1.25
85 2006-01-31 3.25 2.25 1.25
86 2006-02-28 3.25 2.25 1.25
87 2006-03-31 3.50 2.50 1.50
88 2006-04-30 3.50 2.50 1.50
89 2006-05-31 3.50 2.50 1.50
90 2006-06-30 3.75 2.75 1.75
91 2006-07-31 3.75 2.75 1.75
92 2006-08-31 4.00 3.00 2.00
93 2006-09-30 4.00 3.00 2.00
94 2006-10-31 4.25 3.25 2.25
95 2006-11-30 4.25 3.25 2.25
96 2006-12-31 4.50 3.50 2.50
97 2007-01-31 4.50 3.50 2.50
98 2007-02-28 4.50 3.50 2.50
99 2007-03-31 4.75 3.75 2.75
100 2007-04-30 4.75 3.75 2.75
101 2007-05-31 4.75 3.75 2.75
102 2007-06-30 5.00 4.00 3.00
For each column now, I would like to have a code that tells me, for example, that 2005-12-31 3.25 2.25 1.25 differs from its previous values. I would like to do this for the entire dataframe and for each column.
Can anyone help me with this?
Thanks!
You can use lag to compare values from it's previous values.
library(dplyr)
df %>% filter(MLF != lag(MLF) & MRO != lag(MRO) & DFR != lag(DFR))
# Date MLF MRO DFR
#1 2005-12-31 3.25 2.25 1.25
#2 2006-03-31 3.50 2.50 1.50
#3 2006-06-30 3.75 2.75 1.75
#4 2006-08-31 4.00 3.00 2.00
#5 2006-10-31 4.25 3.25 2.25
#6 2006-12-31 4.50 3.50 2.50
#7 2007-03-31 4.75 3.75 2.75
#8 2007-06-30 5.00 4.00 3.00
In data.table, we can use shift :
library(data.table)
setDT(df)[MLF != shift(MLF) & MRO != shift(MRO) & DFR != shift(DFR)]

Sort 22x2 array by the first column

I have the following:
one = [0.3, 0.3, 0.3, 0.3, 0.3, 0.17, 0.255, 0.1, 0.145, 0.275, 0.17, 0.225, 0.25, 0.25, 0.28, 0.29, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3]
two = [0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0, 8.5, 9.0, 9.5, 10.0, 10.5, 11.0]
data_needed = [one two] # build 22×2 Array{Float64,2}
For example output (truncated)
22×2 Array{Float64,2}:
0.3 0.5
0.3 1.0
0.3 1.5
0.3 2.0
0.3 2.5
0.17 3.0
0.255 3.5
0.1 4.0
0.145 4.5
0.275 5.0
So i wish to sort the full 22,2 array by the first column:
data_needed[1:size(data_needed,1)]
Float64[22]
0.300
0.300
0.300
0.300
0.300
0.170
0.255
0.100
0.145
0.275
0.170
0.225
0.250
0.250
0.280
0.290
0.300
0.300
0.300
0.300
0.300
0.300
Sort in ascending order:
Float64[22]
0.100
0.145
0.170
0.170
0.225
0.250
0.250
0.255
0.275
0.280
0.290
0.300
0.300
If sort by this first column in ascending order - it may associate the corresponding values in the second column to the same row position as the sorted column.
If i sorted a full data frame as e.g by a specific column, it would associate the other data on the same row to sorted order - does this happen for Arrays? using sort() with no avail.
answer:
sortslices(data_needed,dims=1)
22×2 Array{Float64,2}:
0.1 4.0
0.145 4.5
0.17 3.0
0.17 5.5
0.225 6.0
0.25 6.5
0.25 7.0
0.255 3.5
0.275 5.0
0.28 7.5

Remove rows containing NA from the column with the least number of NAs

I have a dataframe that will eventually be converted into an xts object. The first column contains date data while all other columns contain numeric data. However, not all numeric columns have the same number of values/same length. Some columns have more rows containing NAs than others.
I want to filter my dataframe by removing the rows containing NAs in the column that has the least number of NAs but still retain the rows containing NAs for all other columns that I have selected. For example, the column grpA below has the least number of NAs. I would want to remove the first 2 rows of the dataframe that contains NAs but retain the values within grpB regardless of what they are.
What I have:
Date grpA grpB
2007-11-06 NA NA
2007-11-07 NA NA
2007-11-09 1.66 NA
2007-11-12 1.64 NA
2007-11-13 1.61 1.28
2007-11-14 1.60 1.30
2007-11-15 1.57 1.27
2007-11-16 1.56 1.25
2007-11-19 1.55 1.25
2007-11-20 1.55 1.25
2007-11-21 1.52 1.22
2007-11-22 1.50 1.21
2007-11-23 1.51 1.21
2007-11-26 1.52 1.25
2007-11-27 1.50 1.25
2007-11-28 1.50 1.23
2007-11-29 1.52 1.24
2007-11-30 1.56 1.25
2007-12-03 1.56 1.22
2007-12-04 1.56 1.23
What I want:
Date grpA grpB
2007-11-09 1.66 NA
2007-11-12 1.64 NA
2007-11-13 1.61 1.28
2007-11-14 1.60 1.30
2007-11-15 1.57 1.27
2007-11-16 1.56 1.25
2007-11-19 1.55 1.25
2007-11-20 1.55 1.25
2007-11-21 1.52 1.22
2007-11-22 1.50 1.21
2007-11-23 1.51 1.21
2007-11-26 1.52 1.25
2007-11-27 1.50 1.25
2007-11-28 1.50 1.23
2007-11-29 1.52 1.24
2007-11-30 1.56 1.25
2007-12-03 1.56 1.22
2007-12-04 1.56 1.23
A reproducible sample of the dataframe is as follows:
df <- data.frame(Date = structure(c(1194307200, 1194393600, 1194566400,
1194825600, 1194912000, 1194998400, 1195084800, 1195171200, 1195430400,
1195516800, 1195603200, 1195689600, 1195776000, 1196035200, 1196121600,
1196208000, 1196294400, 1196380800, 1196640000, 1196726400), class = c("POSIXct",
"POSIXt"), tzone = "UTC"),
grpA = c(NA, NA, 1.66, 1.64, 1.61, 1.6, 1.57, 1.56, 1.55, 1.55, 1.52, 1.5, 1.51, 1.52, 1.5, 1.5, 1.52, 1.56, 1.56, 1.56),
grpB = c(NA, NA, NA, NA, 1.28, 1.3, 1.27, 1.25, 1.25, 1.25, 1.22, 1.21, 1.21, 1.25, 1.25, 1.23, 1.24, 1.25, 1.22, 1.23))
I have tried the drop_na function from the tidyr package and it works:
df2 <- drop_na(df, grpA)
However, I am going to use the above filtering in a Shiny App and I would not know in advance what columns users would select that has the least number of rows containing NAs in them.
I have tried the following to identify the column with the least number of rows containing NAs in them, but it provided me with the number of non-NA rows instead of the column name:
max(colSums(!is.na(df[-1])))
I have tried to extract out the name of the column using the following, but have encountered an error:
colnames(df)[which(colSums(!is.na(df[-1]))) == max(colSums(!is.na(df[-1])))]
I assumed that this was a straightforward task but it has become quite complicated. I would need the answer to be able to be used in a reactive expression in shiny.
Thanks and much appreciated!
We could first find the name of the column with minimum number of NAs and then remove NA rows from that column.
col <- names(which.min(colSums(is.na(df[-1]))))
df[!is.na(df[col]), ]
# Date grpA grpB
#3 2007-11-09 1.66 NA
#4 2007-11-12 1.64 NA
#5 2007-11-13 1.61 1.28
#6 2007-11-14 1.60 1.30
#7 2007-11-15 1.57 1.27
#8 2007-11-16 1.56 1.25
#9 2007-11-19 1.55 1.25
#10 2007-11-20 1.55 1.25
#11 2007-11-21 1.52 1.22
#12 2007-11-22 1.50 1.21
#13 2007-11-23 1.51 1.21
#14 2007-11-26 1.52 1.25
#15 2007-11-27 1.50 1.25
#16 2007-11-28 1.50 1.23
#17 2007-11-29 1.52 1.24
#18 2007-11-30 1.56 1.25
#19 2007-12-03 1.56 1.22
#20 2007-12-04 1.56 1.23
which can be done in one-liner as well without creating additional variable
df[!is.na(df[names(which.min(colSums(is.na(df[-1]))))]), ]
Using the same logic a dplyr approach could be using filter_at
library(dplyr)
df %>%
filter_at(df %>%
summarise_at(-1, ~sum(is.na(.))) %>%
which.min %>% names, ~!is.na(.))
Or using it with tidyr::drop_na
tidyr::drop_na(df, df %>%
summarise_at(-1, ~sum(is.na(.))) %>%
which.min %>% names)

Find all the sums of all combinations of 3 rows from 5 columns

I've loaded a table from a .CVS file using
mydata = read.csv("CS2Data.csv") # read csv file
which gave me:
mydata
Date DCM TMUS SKM RCI SPOK
1 11/2/2015 -0.88 -2.16 -1.04 1.12 0.67
2 12/1/2015 1.03 3.26 -2.25 -5.51 -0.23
3 1/4/2016 1.94 1.29 0.13 -1.16 0.11
4 2/1/2016 -0.41 -2.94 0.99 3.93 -0.19
5 3/1/2016 -0.68 1.27 -0.79 -2.06 -0.33
6 4/1/2016 1.82 1.22 -0.05 -1.27 -0.46
7 5/2/2016 -0.36 3.40 0.63 -2.77 0.46
8 6/1/2016 1.94 0.77 0.51 -0.26 1.66
9 7/1/2016 0.12 3.18 1.84 -1.34 -0.67
10 8/1/2016 -1.83 -0.20 -1.10 -0.90 -1.91
11 9/1/2016 0.05 0.31 1.11 0.80 1.17
12 10/3/2016 -0.02 3.19 -0.81 -4.00 0.29
I'd like to find all combination of any 3 of the 5 numbers for each month (row).
I tried using the combn function based on an answer I found here:
combin <- combn(mydata, 3, rowSums, simplify = TRUE)
but that gave me the error-
"Error in FUN(x[a], ...) : 'x' must be numeric"
Next I tried naming each column separately
DCM=mydata[2]
TMUS=mydata[3]
SKM=mydata[4]
RCI=mydata[5]
SPOK=mydata[6]
and then using:
stock_ret <- data.table(DCM, TMUS,SKM,RCI,SPOK)
combin <- combn(stock_ret, 3, rowSums, simplify = TRUE)
I suspect there's an easier way to just use the column headers directly from the .CVS file to do this but I'm stuck.
Get all but the first column with dates (origin of the error in the question):
mydata <- mydata[,-1]
Use combn to calculate selecting 3 columns at a time:
combn(mydata, m = 3, FUN = rowSums, simplify = TRUE)
Example:
> mydata <- iris[1:10,1:4]
> combn(mydata, m = 3, FUN = rowSums, simplify = TRUE)
[,1] [,2] [,3] [,4]
[1,] 10.0 8.8 6.7 5.1
[2,] 9.3 8.1 6.5 4.6
[3,] 9.2 8.1 6.2 4.7
[4,] 9.2 7.9 6.3 4.8
[5,] 10.0 8.8 6.6 5.2
[6,] 11.0 9.7 7.5 6.0
[7,] 9.4 8.3 6.3 5.1
[8,] 9.9 8.6 6.7 5.1
[9,] 8.7 7.5 6.0 4.5
[10,] 9.5 8.1 6.5 4.7
The general logic to apply for any dataframe:
set.seed(1) # for reproducibility
# create a dataframe frame
df <- as.data.frame(matrix(c(rnorm(10), rnorm(10), rnorm(10),rnorm(10),rnorm(10)), nrow=10))
df # show it
# V1 V2 V3 V4 V5
# 1 -0.6264538 1.51178117 0.91897737 1.35867955 -0.1645236
# 2 0.1836433 0.38984324 0.78213630 -0.10278773 -0.2533617
# ...
# 10 -0.3053884 0.59390132 0.41794156 0.76317575 0.8811077
combinations <- combn(5,3) #123 124 125 ...345
# all combination of any 3 of the 5 columns
lapply(1:dim(combinations)[[2]], function(x) {df[combinations[,x]]})
# sums of all combination of any 3 of the 5 columns
lapply(1:dim(combinations)[[2]], function(x) {rowSums(df[combinations[,x]])})
# use "matrix(unlist(...), nrow)" for better presentation and easier later handlings
matrix(unlist(lapply(1:dim(combinations)[[2]], function(x) {rowSums(df[combinations[,x]])})),nrow=nrow(df))
The solution for the specific data of the questioner:
mydata <- as.data.frame(matrix(c(
11/2/2015, -0.88, -2.16, -1.04, 1.12, 0.67,
12/1/2015, 1.03, 3.26, -2.25, -5.51, -0.23,
1/4/2016, 1.94, 1.29, 0.13, -1.16, 0.11,
2/1/2016, -0.41, -2.94, 0.99, 3.93, -0.19,
3/1/2016, -0.68, 1.27, -0.79, -2.06, -0.33,
4/1/2016, 1.82, 1.22, -0.05, -1.27, -0.46,
5/2/2016, -0.36, 3.40, 0.63, -2.77, 0.46,
6/1/2016, 1.94, 0.77, 0.51, -0.26, 1.66,
7/1/2016, 0.12, 3.18, 1.84, -1.34, -0.67,
8/1/2016, -1.83, -0.20, -1.10, -0.90, -1.91,
9/1/2016, 0.05, 0.31, 1.11, 0.80, 1.17,
10/3/2016, -0.02, 3.19, -0.81, -4.00, 0.29), nrow=12, byrow=TRUE))
names(mydata) <- c("Date", "DCM", "TMUS", "SKM", "RCI", "SPOK") # name the columns
mydata # show the dataframe
# Date DCM TMUS SKM RCI SPOK
# 1 0.0027295285 -0.88 -2.16 -1.04 1.12 0.67
# 2 0.0059553350 1.03 3.26 -2.25 -5.51 -0.23
# ............................................
# 12 0.0016534392 -0.02 3.19 -0.81 -4.00 0.29
combinations <- combn(5,3) #123 124 125 ...345
# all combination of any 3 of the 5 columns
lapply(1:dim(combinations)[[2]], function(x) {mydata[,2:6][combinations[,x]]})
# sums of all combination of any 3 of the 5 columns
lapply(1:dim(combinations)[[2]], function(x) {rowSums(mydata[,2:6][combinations[,x]])})
# use "matrix(unlist(...), nrow)" for better presentation and easier later handlings
matrix(unlist(lapply(1:dim(combinations)[[2]], function(x) {rowSums(mydata[,2:6][combinations[,x]])})),nrow=nrow(mydata))
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,] -4.08 -1.92 -2.37 -0.80 -1.25 0.91 -2.08 -2.53 -0.37 0.75
# [2,] 2.04 -1.22 4.06 -6.73 -1.45 -4.71 -4.50 0.78 -2.48 -7.99
# [3,] 3.36 2.07 3.34 0.91 2.18 0.89 0.26 1.53 0.24 -0.92
# ...............................................................
# [12,] 2.36 -0.83 3.46 -4.83 -0.54 -3.73 -1.62 2.67 -0.52 -4.52
That performs correctly.
Check, e.g., 10th case; 0.75=sum(-1.04, 1.12, 0.67) -7.99=sum(-2.25, -5.51, -0.23) ...

Converting a n by m dataframe into a 1 by n*m dataframe in R

I have a dataframe that I call 'd' which is in the format as follows:
Date Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1 1895 12.63 2.47 2.69 2.43 3.10 1.65 0.13 0.24 1.78 0.18 3.32 7.78
2 1896 13.08 3.86 5.14 5.91 1.61 0.10 0.00 0.05 0.44 3.76 9.51 8.71
3 1897 4.10 7.16 6.38 0.85 0.47 0.87 0.00 0.00 0.46 2.51 5.27 3.40
4 1898 1.97 6.14 0.29 0.30 2.40 0.49 0.00 0.00 1.10 1.32 2.40 2.11
5 1899 7.61 2.69 8.12 1.56 1.66 0.75 0.00 0.18 0.31 7.87 10.79 5.20
6 1900 8.68 2.44 3.53 1.75 2.95 0.33 0.00 0.25 0.60 5.69 9.38 5.00
I would like to rewrite it in the form as follows
Date Precip
1 1895-01-01 12.63
2 1895-02-01 2.47
3 1895-03-01 2.69
4 1895-04-01 2.43
...
70 1900-10-01 5.69
71 1900-11-01 9.38
72 1900-12-01 5.00
The only way I can think of is creating a new dataframe with a sequence of dates from start to end by monthly and then using rbind concatenate the rows of the dataframe 'd' to the new dataframe. Is there a cleaner way of doing this without using for loops? Thank you for your help!
I did it using stack and as.Date
dat.l <- data.frame(dat[,1],stack(dat[,-1]))
dat <- data.frame(Date =as.Date(paste(dat.l[,1],dat.l[,3],'01',sep='/'),format='%Y/%b/%d'),
Precip= dat.l[,2])
Date Precip
1 1895-01-01 12.63
2 1896-01-01 13.08
3 1897-01-01 4.10
.....
70 1898-12-01 2.11
71 1899-12-01 5.20
72 1900-12-01 5.00
I notice that my data is not in the good order I order per Date:
dat[order(dat$Date),]
Date Precip
1 1895-01-01 12.63
7 1895-02-01 2.47
13 1895-03-01 2.69
19 1895-04-01 2.43
25 1895-05-01 3.10
31 1895-06-01 1.65
I did it without for loops starting with the Date strategy you suggested.
df2 <- data.frame(Date = seq.Date(as.Date("1895-01-01"),
as.Date("1900-12-01"),
by="month"),
Precip =c(t(data.matrix(dfrm[-1]) ) ))
head(df2)
Date Precip
1 1895-01-01 12.63
2 1895-02-01 2.47
3 1895-03-01 2.69
4 1895-04-01 2.43
5 1895-05-01 3.10
6 1895-06-01 1.65
You might want to deal with time series objects here:
x <- c(12.63, 2.47, 2.69, 2.43, 3.1, 1.65, 0.13, 0.24, 1.78, 0.18,
3.32, 7.78, 13.08, 3.86, 5.14, 5.91, 1.61, 0.1, 0, 0.05, 0.44,
3.76, 9.51, 8.71, 4.1, 7.16, 6.38, 0.85, 0.47, 0.87, 0, 0, 0.46,
2.51, 5.27, 3.4, 1.97, 6.14, 0.29, 0.3, 2.4, 0.49, 0, 0, 1.1,
1.32, 2.4, 2.11, 7.61, 2.69, 8.12, 1.56, 1.66, 0.75, 0, 0.18,
0.31, 7.87, 10.79, 5.2, 8.68, 2.44, 3.53, 1.75, 2.95, 0.33, 0,
0.25, 0.6, 5.69, 9.38, 5)
x.ts <- ts(x, frequency=12, start=(1895))
x.ts
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1895 12.63 2.47 2.69 2.43 3.10 1.65 0.13 0.24 1.78 0.18 3.32 7.78
## 1896 13.08 3.86 5.14 5.91 1.61 0.10 0.00 0.05 0.44 3.76 9.51 8.71
## 1897 4.10 7.16 6.38 0.85 0.47 0.87 0.00 0.00 0.46 2.51 5.27 3.40
## 1898 1.97 6.14 0.29 0.30 2.40 0.49 0.00 0.00 1.10 1.32 2.40 2.11
## 1899 7.61 2.69 8.12 1.56 1.66 0.75 0.00 0.18 0.31 7.87 10.79 5.20
## 1900 8.68 2.44 3.53 1.75 2.95 0.33 0.00 0.25 0.60 5.69 9.38 5.00
Then xts will format this how you want (among other things that can be done with time series):
head(as.xts(x.ts))
## [,1]
## Jan 1895 12.63
## Feb 1895 2.47
## Mar 1895 2.69
## Apr 1895 2.43
## May 1895 3.10
## Jun 1895 1.65
A solution using data.table:
require(data.table)
dt <- data.table(dat, key="Date")
dt.out <- dt[, list(Date_01 = as.Date(paste(Date, 1:12, "01", sep="/")),
Precip = unlist(.SD)), by=Date][, Date := NULL]
setnames(dt.out, "Date_01", "Date")
Yes, check out the melt function in the reshape2 package. It's actually pretty straightforward. Read more here: http://www.jstatsoft.org/v21/i12/paper
You would do:
install.packages("reshape2")
library(reshape2)
melt(DF, id="Date", measured=c(2:13))

Resources