Name of columns depends on data - r

I have the question that is linked to the financial data of stock (open price, close close, high, low). Since the data which we download are not always the similar one, it's the problem to automize the code where this data are used.
F.E. sometimes I download the data that have the next columns:
open close high low
Sometimes this columns may be names as:
open_ask close_bid high low
Is there function in R which allows to work with data, where the columns may be named similar but not exactly same name? F.e. I want to plot the candle chart, and it's required that R may use the necessary column, where the open and close price are.

You could try identifying columns in your data frame using a regex which provides a logical match. For example, to match the open or open_ask columns, you could use:
open_col <- df[, grepl("open", names(df))]
If the names cannot be correlated in any meaningful way, then you might be able to go by position. But this runs the risk of error should columns shift position, whereas a regex works regardless of where a potentially matching column is positioned.

Related

Grouping and transposing data in R

It is hard to explain this without just showing what I have, where I am, and what I need in terms of data structure:
What structure I had:
Where I have got to with my transformation efforts:
What I need to end up with:
Notes:
I've not given actual names for anything as the data is classed as sensitive, but:
Metrics are things that can be measured- for example, the number of permanent or full-time jobs. The number of metrics is larger than presented in the test data (and the example structure above).
Each metric has many years of data (whilst trying to do the code I have restricted myself to just 3 years. The illustration of the structure is based on this test). The number of years captured will change overtime- generally it will increase.
The number of policies will fluctuate, I've just labelled them policy 1, 2 etc for sensitivity reasons and limited the number whilst testing the code. Again, I have limited the number to make it easier to check the outputs.
The source data comes from a workbook of surveys with a tab for each policy. The initial import creates a list of tibbles consisting of a row for each metric, and 4 columns (the metric names, the values for 2024, the values for 2030, and the values for 2035). I converted this to a dataframe, created a vector to be a column header and used cbind() to put this on top to get the "What structure I had" data.
To get to the "Where I have got to with my transformation efforts" version of the table, I removed all the metric columns, created another vector of metrics and used rbind() to put this as the first column.
The idea in my head was to group the data by policy to get a vector for each metric, then transpose this so that the metric became the column, and the grouped data would become the row. Then expand the data to get the metrics repeated for each year. A friend of mine who does coding (but has never used R) has suggested using loops might be a better way forward. Again, I am not sure of the best approach so welcome advice. On Reddit someone suggested using pivot_wider/pivot_longer but this appears to be a summarise tool and I am not trying to summarise the data rather transform its structure.
Any suggestions on approaches or possible tools/functions to use would be gratefully received. I am learning R whilst trying to pull this data together to create a database that can be used for analysis, so, if my approach sounds weird, feel free to suggest alternatives. Thanks

Sorting a column of values based on index location

I am currently working with a large amount of data. For testing purposes I am using a smaller batch, but the main point of concern is the sorting of all the data based off of values in one particular column. I have posted a picture below that shows a small portion of my unsorted data. I want to sort the values in row 2 in ascending order along with all other data in those corresponding columns. In other words I don't want to just order row 2, I want to order row 2 and shift all other data with those re-ordered values.
Currently what I do is read in that csv to a data frame (tmpDF).
After that I transpose the data using tmpDF <- t(tmpDF)
Now I take that data and order the second column into ascending order (or at least that is what i think I am doing. ) tmpDF<- tmpDF[order(tmpDF[,1]),]
Re transpose the data to get it back how it was originally, but sorted. Result is shown in picture below "Ordered data result" Keep in mind that the data shown between the unsorted and sorted are different numbers due to my not posting my entire data set.
I have a few questions about this.
1) Am I going about this the correct way? I am not a very experienced programmer, just trying to teach myself R to help out my research efforts.
2) Why are the values such as "102" being represented as "1.01E+02" in my final sorted csv file? I don't believe I am changing type and in the original file they were represented as "102"
3) Why does the value 116 gets ordered before "1.01E+02"?

Subsetting rows, changing values, and placing them back into matrix?

I hope this has not been answered, but when I search for a solution to my problem I am not getting any results.
I have a data.frame of 2000+ observations and 20+ columns. Each row represents a different observation and each column represents a different facet of data for that observation. My objective is to iterate through the data.frames and select observations which match criteria (eg. I am trying to pick out observations that are in certain states). After this, I need to subtract or add time to convert it to its appropriate time zone (all of the times are in CST). What I have so far is an exorbitant amount of subsetting commands that pick out the rows that are of the state being checked against. When I try to write a for loop I can only get one value returned, not the whole row.
I was wondering if anyone had any suggestions or knew of any functions that could help me. I've tried just about everything, but I really don't want to have to go through each state of observations and modify the time. I would prefer a loop that could easily go through the data, select rows based on their state, subtract or add time, and then place the row back into its original data.frame (replacing the old value).
I appreciate any help.

Column means over a finite range of rows

I am working with climate data in New Mexico and I am an R novice. I am trying to replace NA with means but there are 37 different sites in my df. I want the means of the column for which the DF$STATION.NAME (in column 1) is unique. I cant be using data from one location to find the mean of another... obviously. so really I should have a mean for each month, for each station.
My data is organized by station.name vertically in column 1 and readings for months jan-dec in columns following, including a total column at the end (right). readings or observations are for each station for each month, over several years (station name listed in new row for each new year.)
I need to replace the NAs with the sums of the CLDD for the given month within the given station.name, how do I do this?
Try asking that question on https://stats.stackexchange.com/ (as suggested by the statistics tag), there are probably more R users there than on the general programming site. I also added the r tag to your question.
There is nothing wrong with splitting your data into station-month subsets, filling the missing values there, then reassembling them into one big matrix!
See also:
Replace mean or mode for missing values in R
Note that the common practice of filling missing values with means, medians or modes is popular, but may dilute your results since this will obviously reduce variance. Unless you have a strong physical argument why and how the missing values can be interpolated, it would be more elegant if you could find a way that can deal with missing values directly.

Trouble getting my data into wide form with the reshape package

I am currently analysing a rather large dataset (22k+records) and am having some trouble getting the data into a wide format (with one row corresponding to each observation, and columns representing variables).
The data came in two CSV files, one giving demographics and the other giving participants probability ratings to a number of questions. Both of these CSV files were in long format.
I have used the reshape (and reshape2 for speed) packages to attempt to solve my problem. The specific issue i am having is the following.
I have the participants probability ratings in the following form (after one successful reshape).
dtf <- read.csv("http://dl.dropbox.com/u/8566396/foobar.csv")
Now, the format i would like my data to be in is as follows:
User ID Qid1, ....Qid255 Time, with the probabilities for each question in the questions corresponding column.
I have tried a loop and apply to put the values into a new data frame, and many variations of melt and cast. I have also tried the base reshape function, but all to no avail.
In the past, i've always edited my CSV files directly, but this is not an option with the size of this file (my laziness when it comes to data manipulation within R has come back to haunt me).
Any advice or solution you can give to avoid me having to do this by hand would be greatly appreciated.
Your dataset has 6 rows, 3 of which have the column "variable" equal to "probability" and 3 of which have that column equal to "time". You want to have probability be the value of each, and time be added onto the right.
I think there's a difficulty in making this work for you because what you want to do isn't clear. You have values for each UID-Time-X### cell, and values for each UID-Prob-X### cell. Therefore, you have to discard information to get it into your preferred format (UID-Time-X### with probabilities as the values). It seems to me like you're treating time as an ID variable, but it's storing values like a content variable.
To avoid discarding any data, your output would have to look something like:
UID Time1 Time2 Time3 Prob1 Prob2 Prob3
Which is simply reshaped wide.

Resources