Subsetting rows, changing values, and placing them back into matrix? - r

I hope this has not been answered, but when I search for a solution to my problem I am not getting any results.
I have a data.frame of 2000+ observations and 20+ columns. Each row represents a different observation and each column represents a different facet of data for that observation. My objective is to iterate through the data.frames and select observations which match criteria (eg. I am trying to pick out observations that are in certain states). After this, I need to subtract or add time to convert it to its appropriate time zone (all of the times are in CST). What I have so far is an exorbitant amount of subsetting commands that pick out the rows that are of the state being checked against. When I try to write a for loop I can only get one value returned, not the whole row.
I was wondering if anyone had any suggestions or knew of any functions that could help me. I've tried just about everything, but I really don't want to have to go through each state of observations and modify the time. I would prefer a loop that could easily go through the data, select rows based on their state, subtract or add time, and then place the row back into its original data.frame (replacing the old value).
I appreciate any help.

Related

Removing NA values from rows, without removing the rows in R

I have a dataset in which the data would look something like this:
a fragment dataframe with the data
So lot's of NA's per row, but also regular answers, that I want in the final version.
Is it possible to remove the NAs, but without removing the rows as a whole?
I thought about pivoting and removing rows with NA, but then it would just remove the occurences that have actual answers as well.
The data is coming form a decision making procedure in qualtrics, in which not every option is displayed to the participants (hence the NAs), but we do not want to exclude people in any step. I also thought about maybe recoding the values, and subsetting them somehow, but that doesn't seem to work out right in my mind when it comes to the actual analysis.
I tried removing the NAs, as well as pivoting the table and removing them later.
I do not yet have the full dataset, but want to experiment on strategies of data analysis before I have the data collected, to not get lost once I have it.

Determining whether values between rows match for certain columns

I am working with some observation data and have run into a bit of an issue beyond my current capabilities. I surveyed different polygons (the column "PolygonID" in the screenshot) for lizards two times during a survey season. I want to determine the total search effort (shown in the column "Effort") for each individual polygon within each survey round. Problem is, the software I was using to collect the data sometimes creates unnecessary repeats for polygons within a survey round. There is an example of this in the screenshot for the rows with PolygonID P3.
Most of the time it does not affect the effort calculations because the start and end time for the rows (the fields used to calculate effort) are the same, and I know how to filter the dataset so it only shows one line per polygon per survey, but I have reason to be concerned there might be some lines where the software glitched and assigned incorrect start and end times for one of the repeat lines. Is there a way I can test whether start and end time match for any such repeats with R, rather than manually going through all the data?
Thank you!

Why can't I do anything with my columns in Rstudio

So far I have imported a data set with columns labeled with things people are afraid of "snakes, heights, spiders, etc.) and each row is a number (1,2,3) representing a different person. So each person rated how much each thing scared them using 1-5.
I cleared the data so the NAs are gone. Now every time I go to do calculations in the data set, it tells me
Error: object 'fear.of.public.speaking' not found

Name of columns depends on data

I have the question that is linked to the financial data of stock (open price, close close, high, low). Since the data which we download are not always the similar one, it's the problem to automize the code where this data are used.
F.E. sometimes I download the data that have the next columns:
open close high low
Sometimes this columns may be names as:
open_ask close_bid high low
Is there function in R which allows to work with data, where the columns may be named similar but not exactly same name? F.e. I want to plot the candle chart, and it's required that R may use the necessary column, where the open and close price are.
You could try identifying columns in your data frame using a regex which provides a logical match. For example, to match the open or open_ask columns, you could use:
open_col <- df[, grepl("open", names(df))]
If the names cannot be correlated in any meaningful way, then you might be able to go by position. But this runs the risk of error should columns shift position, whereas a regex works regardless of where a potentially matching column is positioned.

Column means over a finite range of rows

I am working with climate data in New Mexico and I am an R novice. I am trying to replace NA with means but there are 37 different sites in my df. I want the means of the column for which the DF$STATION.NAME (in column 1) is unique. I cant be using data from one location to find the mean of another... obviously. so really I should have a mean for each month, for each station.
My data is organized by station.name vertically in column 1 and readings for months jan-dec in columns following, including a total column at the end (right). readings or observations are for each station for each month, over several years (station name listed in new row for each new year.)
I need to replace the NAs with the sums of the CLDD for the given month within the given station.name, how do I do this?
Try asking that question on https://stats.stackexchange.com/ (as suggested by the statistics tag), there are probably more R users there than on the general programming site. I also added the r tag to your question.
There is nothing wrong with splitting your data into station-month subsets, filling the missing values there, then reassembling them into one big matrix!
See also:
Replace mean or mode for missing values in R
Note that the common practice of filling missing values with means, medians or modes is popular, but may dilute your results since this will obviously reduce variance. Unless you have a strong physical argument why and how the missing values can be interpolated, it would be more elegant if you could find a way that can deal with missing values directly.

Resources