Vectorized ifelse conundrum - r

I have two arrays "begin" and "end_a" which contain some integer indices, except that some of the entries in "end_a" are NA.
And panelDataset is a matrix which contains the data. I want to take the means of the rows of panelDataset corresponding to non-NA entries of begin and end_a.
I have this working in serial fashion and it works fine, but when I tried to vectorize it as follows
switch_mu=ifelse(!is.na(end_a),mean(panelDataset[begin: end_a,4]),NA)
It gives an error: Error in begin:end_a : NA/NaN argument.
When I check the entries of end_a separately for NAs using is.na(end_a), it does show the correct entries of the array as NA. So, that is not an issue.
I know I am missing something trivial. Any thoughts?

Try this:
means <- apply(na.omit(cbind(begin, end_a)), 1,
function(x) mean(panelDataset[x[1]:x[2], 4]))
replace(end_a, !is.na(end_a), means)

Related

Trying to find a better way to sorting the data in R

In my data frame I am trying to sort the data in descending order. I am using the below line of code for sorting my data and it works as intended.
CNS25VOL <- CNS25VOL[order(-CNS25VOL$MATVOL22), ]
However if I refer to the same column by it's index number, the code throws an error
CNS25VOL <- CNS25VOL[order(-CNS25VOL[, 2]), ]
Error thrown is
Error in CNS25VOL[, 2] : incorrect number of dimensions
While I do have a solution to what I am intending to do, but issue I see is if all of a sudden name of my column changes the code won't work. I know that their position will stay same in the data frame.
How can we handle it.
order(-CNS25VOL[, 2]) order here does expect a vector which you try to construct via the [] in CNS25VOL[, 2]. Normal dataframes will return a vector consisting only of the 2nd column. A tibble however will return a tibble with only one column.
You can reproduce the behaviour of normal data.frames with the drop = FALSE argument to [] as in
CNS25VOL[, 2, drop = TRUE]
Try to always be aware whether you are using a standard data.frame or a tibble or a data.table because they look very similar and are not in the details. Also see https://tibble.tidyverse.org/reference/subsetting.html
dplyr functions tend to give you a tibble back even if you fed them a classical data.frame.

Trouble with setting conditionals to parse through data.frame

seems like we have some extra artifacts that appear as the dataset changes firms. write a piece of code that checks to see where the tickers change, and delete all artifacts from those points
for (y in 1:nrow(longitudinal)){
if (longitudinal[y,2] != longitudinal[y-1,2])
{longitudinal[y,] = NA }}
hey guys, I am trying to remove values from a column in a dataset according to a change in column 2, the name value. Unfortunately I am getting the error
Error in Ops.data.frame(longitudinal[y, 2], longitudinal[y - 1, 2]) :
‘!=’ only defined for equally-sized data frames
I cannot think of a different way to compare the elements in the name column in order to set the condition for the NA's to correspond to the change in the name. Would appreciate any help thinking through this.
The for loop had to start from 1 row down ie
y in 2:nrow(longitudinal)
because the conditional would've had the second element starting at row 0.

Error in using grep in SparkR

I am having an issue with subsetting my Spark DataFrame.
I have a DataFrame called nfe, which contains a column called ITEM_PRODUTO that is formatted as a string. I would like to subset this DataFrame based on whether the item column contains the word "AREIA". I can easily subset the data based on an exact phrase:
nfe.subset1 <- subset(nfe, nfe$ITEM_PRODUTO == "AREIA LAVADA FINA")
nfe.subset2 <- subset(nfe, nfe$ITEM_PRODUTO %in% "AREIA")
However, what I would like is a subset of all rows that contain the word "AREIA" in the ITEM_PRODUTO column. When I try to use grep, though, I receive an error message:
nfe.subset3 <- subset(nfe, grep("AREIA", nfe$ITEM_PRODUTO))
# Error in as.character.default(x) :
# no method for coercing this S4 class to a vector
I've tried multiple iterations of syntax, and tried grepl as well, but nothing seems to work. It's probably a syntax error, but could anyone help me out?
Thanks!
Standard R functions cannot be applied to SparkDataFrame. Use either like`:
where(nfe, like(nfe$ITEM_PRODUTO, "%AREIA%"))
or rlike:
where(nfe, rlike(nfe$ITEM_PRODUTO, ".*AREIA.*"))

Create new column in dataframe using if {} else {} in R

I'm trying to add a conditional column to a dataframe, but not getting the results I'm expecting.
I have a dataframe with values recorded for the column "steps" across 5-minute intervals over various days. I'm trying to impute missing values in the 'steps' column by using the mean number of steps for a given 5-minute interval on the days that do have measurements. n.b. I tried using the MICE package for this but it just crashed my computer so I opted for a more manual workaround.
As an intermediate stage, I have bound an additional column to the existing dataframe with the mean number of steps for that interval. What I want to do next is create a column that returns that mean if the raw number of steps is NULL, and just uses the raw value if not null. Here's my code for that part:
activityTimeAvgs$stepsImp <- if(is.na(activityTimeAvgs$steps)){
activityTimeAvgs$avgsteps
} else {
activityTimeAvgs$steps
}
What I expected to happen is that the if statement would evaluate as TRUE if 'steps' is NA and consequently give 'avgsteps'; in cases where 'steps' is not NA I would expect it to just use the raw value for 'steps'. However, the output just gives the value for 'avgsteps' in every row, which is not much use. I also get the following warning:
Warning message:
In if (is.na(activityTimeAvgs$steps)) { :
the condition has length > 1 and only the first element will be used
Any ideas where I'm going wrong?
Thanks in advance.
The if statement is not suitable for this. You need to use ifelse:
activityTimeAvgs$stepsImp <- ifelse(is.na(activityTimeAvgs$steps), activityTimeAvgs$avgsteps, activityTimeAvgs$steps)

R returns list instead of filling in dataframe column

I am trying to use apply() to fill in an additional column in a dataframe and by calling a function I created with each row of the data frame.
The dataframe is called Hit.Data has 2 columns Zip.Code and Hits. Here are a few rows
Zip.Code , Hits
97222 , 20
10100 , 35
87700 , 23
The apply code is the following:
Hit.Data$Zone = apply(Hit.Data, 1, function(x) lookupZone("89000", x["Zip.Code"]))
The lookupZone() function is the following:
lookupZone <- function(sourceZip, destZip){
sourceKey = substr(sourceZip, 1, 3)
destKey = substr(destZips, 1, 3)
return(zipToZipZoneMap[[sourceKey]][[destKey]])
}
All the lookupZone() function does is take the 2 strings, truncates to the required characters and looks up the values. What happens when I run this code though is that R assigns a list to Hit.Data$Zone instead of filling in data row by row.
> typeof(Hit.Data$Zone)
[1] "list
What baffles me is that when I use apply and just tell it to put a number in it works correctly:
> Hit.Data$Zone = apply(Hit.Data, 1, function(x) 2)
> typeof(Hit.Data$Zone)
[1] "double"
I know R has a lot of strange behavior around dropping dimensions of matrices and doing odd things with lists but this looks like it should be pretty straightforward. What am I missing? I feel like there is something fundamental about R I am fighting, and so far it is winning.
Your problem is that you are occasionally looking up non-existing entries in your hashmap, which causes hash to silently return NULL. Consider:
> hash("890", hash("972"=3, "101"=3, "877"=3))[["890"]][["101"]]
[1] 3
> hash("890", hash("972"=3, "101"=3, "877"=3))[["890"]][["100"]]
NULL
If apply encounters any NULL values, then it can't coerce the result to a vector, so it will return a list. Same will happen with sapply.
You have to ensure that all possible combinations of the first three zip code digits in your data are present in your hash, or you need logic in your code to return NA instead of NULL for missing entries.
As others have said, it's hard to diagnose without knowing what ZiptoZipZoneMap(...) is doing, but you could try this:
Hit.Data$Zone <- sapply(Hit.Data$Zip.Code, function(x) lookupZone("89000", x))

Resources