Trouble with setting conditionals to parse through data.frame - r

seems like we have some extra artifacts that appear as the dataset changes firms. write a piece of code that checks to see where the tickers change, and delete all artifacts from those points
for (y in 1:nrow(longitudinal)){
if (longitudinal[y,2] != longitudinal[y-1,2])
{longitudinal[y,] = NA }}
hey guys, I am trying to remove values from a column in a dataset according to a change in column 2, the name value. Unfortunately I am getting the error
Error in Ops.data.frame(longitudinal[y, 2], longitudinal[y - 1, 2]) :
‘!=’ only defined for equally-sized data frames
I cannot think of a different way to compare the elements in the name column in order to set the condition for the NA's to correspond to the change in the name. Would appreciate any help thinking through this.

The for loop had to start from 1 row down ie
y in 2:nrow(longitudinal)
because the conditional would've had the second element starting at row 0.

Related

Trying to find a better way to sorting the data in R

In my data frame I am trying to sort the data in descending order. I am using the below line of code for sorting my data and it works as intended.
CNS25VOL <- CNS25VOL[order(-CNS25VOL$MATVOL22), ]
However if I refer to the same column by it's index number, the code throws an error
CNS25VOL <- CNS25VOL[order(-CNS25VOL[, 2]), ]
Error thrown is
Error in CNS25VOL[, 2] : incorrect number of dimensions
While I do have a solution to what I am intending to do, but issue I see is if all of a sudden name of my column changes the code won't work. I know that their position will stay same in the data frame.
How can we handle it.
order(-CNS25VOL[, 2]) order here does expect a vector which you try to construct via the [] in CNS25VOL[, 2]. Normal dataframes will return a vector consisting only of the 2nd column. A tibble however will return a tibble with only one column.
You can reproduce the behaviour of normal data.frames with the drop = FALSE argument to [] as in
CNS25VOL[, 2, drop = TRUE]
Try to always be aware whether you are using a standard data.frame or a tibble or a data.table because they look very similar and are not in the details. Also see https://tibble.tidyverse.org/reference/subsetting.html
dplyr functions tend to give you a tibble back even if you fed them a classical data.frame.

Use multiple columns as identifiers while comparing two data frames in R using setdiff

I have two data frames to compare. Screenshots of the data frames are shown below
There are three things I am trying to check:
1st Check: Items that existed in Data 1, but do not exist in Data 2 [Item4; SubItem4; SubsubItem1]
2nd Check: Items that did not exist in Data 1, but do exist in Data 2 [Item6; SubItem1; SubsubItem1]
3rd Check: Item that exist in both list, but has changed in value [Item2; SubItem5; SubsubItem1]
I got the first and the second check easily with anti_join()
MissingfromData2 <- anti_join(Data1,Data2, by = c("Property.1","Property.2","Property.3"))
MissingfromData1 <- anti_join(Data2,Data1, by = c("Property.1","Property.2","Property.3"))
For the 3rd check, however, I cannot seem to lock in the identifiers in the by=c("Property.1","Property.2","Property3")
When I do the following
changedValue1 <- setdiff(Data1,Data2, by = c("Property.1","Property.2","Property.3"))
changedValue2 <- setdiff(Data2,Data1, by = c("Property.1","Property.2","Property.3"))
I get the additional row (from check 1 and check 2), which I do not need.
How do I obtain the result for only changed values?
I found the solution to the problem. All I needed to add to the code above was the following bit of code
Result <- setdiff(changedValue2,MissingfromData1, by = c("Property.1","Property.2","Property.3"))
which rendered the only row in changedValue2 which was missing from MissingfromData1 with the desired difference in the value columns.

Create new column in dataframe using if {} else {} in R

I'm trying to add a conditional column to a dataframe, but not getting the results I'm expecting.
I have a dataframe with values recorded for the column "steps" across 5-minute intervals over various days. I'm trying to impute missing values in the 'steps' column by using the mean number of steps for a given 5-minute interval on the days that do have measurements. n.b. I tried using the MICE package for this but it just crashed my computer so I opted for a more manual workaround.
As an intermediate stage, I have bound an additional column to the existing dataframe with the mean number of steps for that interval. What I want to do next is create a column that returns that mean if the raw number of steps is NULL, and just uses the raw value if not null. Here's my code for that part:
activityTimeAvgs$stepsImp <- if(is.na(activityTimeAvgs$steps)){
activityTimeAvgs$avgsteps
} else {
activityTimeAvgs$steps
}
What I expected to happen is that the if statement would evaluate as TRUE if 'steps' is NA and consequently give 'avgsteps'; in cases where 'steps' is not NA I would expect it to just use the raw value for 'steps'. However, the output just gives the value for 'avgsteps' in every row, which is not much use. I also get the following warning:
Warning message:
In if (is.na(activityTimeAvgs$steps)) { :
the condition has length > 1 and only the first element will be used
Any ideas where I'm going wrong?
Thanks in advance.
The if statement is not suitable for this. You need to use ifelse:
activityTimeAvgs$stepsImp <- ifelse(is.na(activityTimeAvgs$steps), activityTimeAvgs$avgsteps, activityTimeAvgs$steps)

Vectorized ifelse conundrum

I have two arrays "begin" and "end_a" which contain some integer indices, except that some of the entries in "end_a" are NA.
And panelDataset is a matrix which contains the data. I want to take the means of the rows of panelDataset corresponding to non-NA entries of begin and end_a.
I have this working in serial fashion and it works fine, but when I tried to vectorize it as follows
switch_mu=ifelse(!is.na(end_a),mean(panelDataset[begin: end_a,4]),NA)
It gives an error: Error in begin:end_a : NA/NaN argument.
When I check the entries of end_a separately for NAs using is.na(end_a), it does show the correct entries of the array as NA. So, that is not an issue.
I know I am missing something trivial. Any thoughts?
Try this:
means <- apply(na.omit(cbind(begin, end_a)), 1,
function(x) mean(panelDataset[x[1]:x[2], 4]))
replace(end_a, !is.na(end_a), means)

Create a new data frame of the means of randomly selected rows - looped

Question:
I have a data.frame (hlth) that consists of 49 vectors - a mix of numeric(25:49) and factor(1:24). I am trying to randomly select 50 rows, then calculate column means only for the numeric columns (dropping the other values), and then place the random row mean(s) into a new data.frame (beta). I would then like to iterate this process 1000 times.
I have attempted this process but the values that get returned are identical and the new means will not enter the new data.frame
Here is a few rows and columns of the data.frame(hlth)
DateIn adgadj Sex VetMedCharges pwtfc
1/01/2006 3.033310 STEER 0.00 675.1151
1/10/1992 3.388245 STEER 2540.33 640.2261
1/10/1995 3.550847 STEER 572.78 607.6200
1/10/1996 2.893707 HEIFER 549.42 425.5217
1/10/1996 3.647233 STEER 669.18 403.8238
The code I have used thus far:
set.seed[25]
beta<-data.frame()
net.row<-function(n=50){
netcol=sample(1:nrow(hlth),size=n ,replace=TRUE)
rNames <- row.names(hlth)
subset(hlth,rNames%in%netrow,select=c(25:49))
colMeans(s1,na.rm=TRUE,dims=1)
}
beta$net.row=replicate(1000,net.row()); net.row
The two issues, that I have detected, are:
1) Returns the same value(s) each iteration
2) "Error during wrap-up: object of type 'closure' is not subsettable" when the beta$netrow
Any suggestions would be appreciated!!!
Just adding to my comment (and firstly pasting it):
netcol=sample(1:nrow(hlth),size=n ,replace=TRUE) should presumably by netrow = ... and the error is a scoping problem - R is trying to subset the function beta, presumably again, because it can't find netRowMeans in the data.frame you've defined, moves on to the global environment and throws an error there.
There are also a couple of other things. You don't assign subset(hlth,rNames%in%netrow,select=c(25:49)) to a variable, which I think you mean to assign to s1, so colMeans is probably running on something you've set in the global environment.
If you want to pass a variable directly in to the data frame beta in that manner, you'll have to initialise beta with the right number of columns and number of rows - the column means you've passed out will be a vector of (1 x 25), so won't fit in a single column. You would probably be better of initalising a matrix called mat or something (to avoid confusion with scoping errors masking the actual error messages) with 25 columns and 1000 rows.
EDIT: Question has been edited slightly since I posted this, but most points still stand.

Resources