Create multiple columns from a single column and clean up results - r

I have a data frame like this:
foo=data.frame(Point.Type = c("Zero Start","Zero Start", "Zero Start", "3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww","3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww","3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww","Zero Stop","Zero Start"),
Point.Value = c(NA,NA,NA,rnorm(3),NA,NA))
I want to add three columns, by splitting the first column with separator _, and retain only the numeric values obtained after the split. For those rows where the first column doesn't contain any _, the three new columns should be NA. I got somewhat close using separate, but that's not enough:
> library(tidyr)
> bar = separate(foo,Point.Type, c("rpm_nom", "GVF_nom", "p0in_nom"), sep="_", remove = FALSE, extra="drop", fill="right")
> bar
Point.Type rpm_nom GVF_nom p0in_nom Point.Value
1 Zero Start Zero Start <NA> <NA> NA
2 Zero Start Zero Start <NA> <NA> NA
3 Zero Start Zero Start <NA> <NA> NA
4 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000rpm 10% 13barG -1.468033
5 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000rpm 10% 13barG 1.280868
6 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000rpm 10% 13barG 0.270126
7 Zero Stop Zero Stop <NA> <NA> NA
8 Zero Start Zero Start <NA> <NA> NA
I'm not sure why my data frame contains now two apparently different kinds of NA, but is.na seems to like them both, so I can live with that. However, I have two kind of problems:
the new columns should be at least numeric, and possibly integer. Instead they're character, because of the trailing rpm, %, barG. How do I get rid of those?
when Point.Type can't be split, rpm_nom should be NA, instead it becomes Zero Start or Zero Stop. Changing the fill= option only changes which one of the new columns get the Zero Start/Zero Stop. Instead I want all three of them to be NA. How can I do that?
NOTE: I'm using tidyr, but of course you don't need to, if you think there's a better way to do this.

You can post-process the columns with dplyr:
library(dplyr)
foo <- foo %>%
separate(Point.Type, c("rpm_nom", "GVF_nom", "p0in_nom"),
sep="_", remove = FALSE, extra="drop", fill="right") %>%
mutate_each(funs(as.numeric(gsub("[^0-9]","",.))), rpm_nom, GVF_nom, p0in_nom)
The gsub("[^0-9]","",.)-part removes all non-numeric characters. If you want to prevent the removal of decimal points, you can use [^0-9.] instead of [^0-9] (like #PierreLafortune used in his answer), but be aware that this will also include points that are not meant to be decimal points. By wrapping it in as.numeric, you convert them to numeric values while at the same time transforming the empty cells to NA. This gives the following result:
> foo
Point.Type rpm_nom GVF_nom p0in_nom Point.Value
1 Zero Start NA NA NA NA
2 Zero Start NA NA NA NA
3 Zero Start NA NA NA NA
4 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000 10 13 -1.2361145
5 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000 10 13 -0.8727960
6 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000 10 13 0.9685555
7 Zero Stop NA NA NA NA
8 Zero Start NA NA NA NA
Or using data.table (as contributed by #DavidArenburg in the comments):
library(data.table)
setDT(foo)[, c("rpm_nom","GVF_nom","p0in_nom") :=
lapply(tstrsplit(Point.Type, "_", fixed = TRUE)[1:3],
function(x) as.numeric(gsub("[^0-9]","",x)))
]
will give a similar result:
> foo
Point.Type Point.Value rpm_nom GVF_nom p0in_nom
1: Zero Start NA NA NA NA
2: Zero Start NA NA NA NA
3: Zero Start NA NA NA NA
4: 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww -0.09255445 3000 10 13
5: 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 1.18581340 3000 10 13
6: 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 2.14475950 3000 10 13
7: Zero Stop NA NA NA NA
8: Zero Start NA NA NA NA
The advantage of this is that foo is updated by reference. As this is faster and more memory efficient, this is especially valuable for using with large datasets.

With base R we can first coerce NA values where necessary and coerce class numeric:
bar[-1] <- lapply(bar[-1], function(x) {
is.na(x) <- grepl("Zero", x)
as.numeric(gsub("[^0-9.]", "", x))})
# Point.Type rpm_nom GVF_nom p0in_nom Point.Value
# 1 Zero Start NA NA NA NA
# 2 Zero Start NA NA NA NA
# 3 Zero Start NA NA NA NA
# 4 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000 10 13 0.3558397
# 5 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000 10 13 1.1454829
# 6 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000 10 13 0.2958815
# 7 Zero Stop NA NA NA NA
# 8 Zero Start NA NA NA NA
To reduce to one line (#Jaap):
bar[-1] <- lapply(bar[-1], function(x) as.numeric(gsub("[^0-9.]", "", x)))

Related

Create column from data on dynamic number of columns depending on availabity in R

Given a uncertain number of columns containing source values for the same variable I would like to create a column that defines the final value to be selected depending on source importance and availability.
Reproducible data:
set.seed(123)
actuals = runif(10, 500, 1000)
get_rand_vector <- function(){return (runif(10, 0.95, 1.05))}
get_na_rand_ixs <- function(){return (round(runif(5,0,10),0))}
df = data.frame("source_1" = actuals*get_rand_vector(),
"source_2" = actuals*get_rand_vector(),
"source_n" = actuals*get_rand_vector())
df[["source_1"]][get_na_rand_ixs()] <- NA
df[["source_2"]][get_na_rand_ixs()] <- NA
df[["source_n"]][get_na_rand_ixs()] <- NA
My manual solution is as follows:
df$available <- ifelse(
!is.na(df$source_1),
df$source_1,
ifelse(
!is.na(df$source_2),
df$source_2,
df$source_n
)
)
Given the desired result of:
source_1 source_2 source_n available
1 NA NA NA NA
2 NA NA 930.1242 930.1242
3 716.9981 NA 717.9234 716.9981
4 NA 988.0446 NA 988.0446
5 931.7081 NA 924.1101 931.7081
6 543.6802 533.6798 NA 543.6802
7 744.6525 767.4196 783.8004 744.6525
8 902.8788 955.1173 NA 902.8788
9 762.3690 NA 761.6135 762.3690
10 761.4092 702.6064 708.7615 761.4092
How could I automatically iterate over the available sources to set the data to be considered? Given in some cases n_sources could be 1,2,3..,7 and priority follows the natural order (1 > 2 >..)
Once you have all of the candidate vectors in order and in an appropriate data structure (e.g., data.frame or matrix), you can use apply to apply a function over the rows. In this case, we just look for the first non-NA value. Thus, after the first block of code above, you only need the following line:
df$available <- apply(df, 1, FUN = function(x) x[which(!is.na(x))[1]])
coalesce() from dplyr is designed for this:
library(dplyr)
df %>%
mutate(available = coalesce(!!!.))
source_1 source_2 source_n available
1 NA NA NA NA
2 NA NA 930.1242 930.1242
3 716.9981 NA 717.9234 716.9981
4 NA 988.0446 NA 988.0446
5 931.7081 NA 924.1101 931.7081
6 543.6802 533.6798 NA 543.6802
7 744.6525 767.4196 783.8004 744.6525
8 902.8788 955.1173 NA 902.8788
9 762.3690 NA 761.6135 762.3690
10 761.4092 702.6064 708.7615 761.4092

Impute NA values with previous value in R

We have 101 variables (companys) their closing prices. We got a lot of NA values (because the stock market closes on saturdays and sundays -> gives NA value in our data) and we need to impute those NA values with the previous value if there is a previous value but we don't succeed. This is our data example
There are also companies that don't have data in the first years since they were not on the stock market so they have NA values for this period. And there are companies that go bankrupt and start having NA values so these should both become 0.
How should we do this since we have several conditions for filling our NA's
Thanks in advance.
My understanding of the rules are:
columns that are all NA are to be left as all NA
leading NA values are left as NA
interior NA values are replaced with the most recent non-NA values
trailing NA values are replaced with 0
To try this out we use the built-in data frame BOD replacing the 1st, 3rd and last rows with NA and adding a column of NA values -- see Note at end.
We define a logical vector ok having one element per column which is TRUE for columns having at least one element that is not NA and FALSE for other columns. Then operating only on the columns for which ok is TRUE we fill in the trailing NA values with 0 using na.fill. Then we use na.locf to fill in the interior NA values.
library(zoo)
ok <- !apply(is.na(BOD), 2, all)
BOD[, ok] <- na.locf(na.fill(BOD[, ok], c(NA, NA, 0)), na.rm = FALSE)
giving:
Time demand X
1 NA NA NA <-- leading NA values are left intact
2 2 10.3 NA
3 2 10.3 NA <-- interior NA values are filled in with last non-NA value
4 4 16.0 NA
5 5 15.6 NA
6 0 0.0 NA <- trailing NA values are filled in with 0
Note
We used the following input above:
BOD[c(1, 3, 6), ] <- NA
BOD <- cbind(BOD, X = NA)
Update
Fix.

Rolling mean with ZOO when columns have NA values

I have calculated the rolling mean using data.table and zoo.
Code below: I am calculating this on the artprice column which contains NA rows.
library(data.table)
library(zoo)
rollmean1 <- data.table(newdf)
rollmean2 <- (rollmean1)[, paste0('MA',126) := lapply(126, function(x) rollmeanr(artprice, x, fill = NA))][]
Output:
> head(rollmean2)
spdate SP500close artprice MA2 MA3 MA126
1: 19870330 289.20 83.6 NA NA NA
2: 19870331 291.70 NA 290.450 NA NA
3: 19870401 292.39 NA 292.045 291.0967 NA
4: 19870402 293.63 NA 293.010 292.5733 NA
5: 19870403 300.41 NA 297.020 295.4767 NA
6: 19870406 301.95 NA 301.180 298.6633 NA
Notice artprice column has mostly NA values, however, I would like to ignore them and still run the rolling average. However, I wouldnt want my artprice data to remove all NAs and for it not to match the date column.
Any ideas on how to achieve?
I was leaning towards something like this and for it recognize there was NA values but continue with the calculation: (x[!is.na(x)])
rollmean2 <- (rollmean1)[, paste0('MA',126) := lapply(126, function(x) rollmeanr(artprice, (x[!is.na(x)]), fill = NA))][]
Any insight appreciated.
EDIT: Changed code to use rollapply.
> rollmean2 <- (rollmean1)[, paste0('MA',126) := lapply(126, rollapplyr, data=artprice, mean, rm.na = TRUE, Fill = NA)][]
Warning message:
In `[.data.table`((rollmean1), , `:=`(paste0("MA", 126), lapply(126, :
Supplied 7491 items to be assigned to 7616 items of column 'MA126' (recycled leaving remainder of 125 items).
> tail(rollmean2)
spdate SP500close artprice MA126
1: 20170524 2404.39 NA NA
2: 20170525 2415.07 NA NA
3: 20170526 2415.82 NA NA
4: 20170530 2412.91 NA NA
5: 20170531 2411.80 NA NA
6: 20170601 2430.06 NA NA
It is throwing this warning message:
Warning message:
In `[.data.table`((rollmean1), , `:=`(paste0("MA", 126), lapply(126, :
Supplied 7491 items to be assigned to 7616 items of column 'MA126'
data frame rollmean1 has 7616 rows
> nrow(rollmean1)
[1] 7616
And nrow of output table:
> nrow(rollmean2)
[1] 7616
So the rolling calculation is to be performed on a MA126 basis. That means the actual calculation can not begin until line 126 leaving 125 lines not filled at the start of the data set. It is recognizing this fact but still not outputting my desired outcome.

Finding the maximum values associated to a given string length in a dataframe

I'm exploring the acss package.
I want to know which strings for a given length of the acss_data dataframe have been assigned maximum K.i value.
tail(acss_data)
K.2 K.4 K.5 K.6 K.9
012345678883 NA NA NA NA 50.28906
012345678884 NA NA NA NA 50.31291
012345678885 NA NA NA NA 49.71200
012345678886 NA NA NA NA 49.81041
012345678887 NA NA NA NA 49.51936
012345678888 NA NA NA NA 48.61247
The acss_data dataframe contains K.2, K.4, K.5, K.6, and K.9 values associated to strings from lengths 1 to 12, and I want to know the maximum K.i for each string length, i.e, I want to know the max K.2 for strings of length 1, length 2, ... length 12. Then I would like to know the max K.4 for strings of length 1, length 2, ... length 12, etc.
How can I query this in R?
You can use aggregate to summarize the data:
library(acss.data)
d=acss_data
d$len=nchar(rownames(d)) # calculate lengths of strings
d[is.na(d)]=-1 # fix NAs for max function
s=aggregate(d[,1:5], list(d$len), max)
The result is a data frame:
Group.1 K.2 K.4 K.5 K.6 K.9
1 1 2.514277 3.547388 3.947032 4.268200 4.964344
2 2 3.327439 5.414104 6.108780 6.675197 7.927055
3 3 5.505383 8.520908 9.432003 10.189697 11.905392
4 4 8.406714 12.231447 13.284113 14.182866 16.280365
5 5 11.834019 16.230760 17.340010 18.329451 20.735158
6 6 15.366332 19.993828 21.291613 22.410022 25.170522
7 7 18.989162 23.816377 25.389206 26.615356 29.685526
8 8 22.679752 27.556472 29.379371 30.880603 34.243156
9 9 26.343527 31.187297 33.264487 35.097073 38.851463
10 10 29.427574 34.891807 37.282071 39.258235 43.506412
11 11 32.778797 39.506517 42.000889 43.657406 48.208571
12 12 37.064199 40.506517 42.263923 43.657406 52.897870

subsetting error in R

I have a large dataframe called dualbeta which contains 2 rows and 6080 columns. Here is a sample:
row.names A.Close AA.Close AADR.Close AAIT.Close AAL.Close
1 upside 1.253929 0.9869027 0.6169613 0.6353903 0.1782124
2 downside 1.027412 1.1936236 0.5915299 0.5697878 0.1702382
I am trying to extract only those with the upside >= 1.00 and those with a downside <=1.00. I used combinations <- subset(dualbeta, upside>=1.00 & downside<=1.00) but i get the following:
row.names A.Close AA.Close AADR.Close AAIT.Close
1 NA NA NA NA NA
2 NA.1 NA NA NA NA
3 NA.2 NA NA NA NA
4 NA.3 NA NA NA NA
5 NA.4 NA NA NA NA
...
It should just return a 2 by x table where x is the number of combinations found. I do not know why I am getting a bunch of rows? Additionally, i thought i had NA values in the dualbeta so i used na.omit(dualbeta)->dualbeta but it deleted everything & turned dualbeta into a 0 by 6080. I also used which(is.na(dualbeta)) which returned 3307 and 3308 but when i checked those columns, they did not contain NAs.
You might work on the transpose of the data in order to select rows with the proper characteristics (which are columns in the transpose):
# Fix up the data, use proper row names
rownames(x) <- x$row.names
# Remove old row name column
x <- x[-1]
# transpose and subset
subset(data.frame(t(x)), upside > 1 & downside < 1)
This expression returns a zero-length result with your example data. Changing the parameters shows what is returned:
subset(data.frame(t(x)), upside > .6 & downside < .6)
## upside downside
## AADR.Close 0.6169613 0.5915299
## AAIT.Close 0.6353903 0.5697878
You can the data with simple indexing.
Let's say this is your data
dualbeta<-data.frame(matrix(runif(24,0,2),
nrow=2,
dimnames=list(c("upside","downside"), letters[1:12])))
then you can extract with
dualbeta[, dualbeta[1,]>=1.00 & dualbeta[2,]<=1.00]

Resources