Sorting a column of values based on index location - r

I am currently working with a large amount of data. For testing purposes I am using a smaller batch, but the main point of concern is the sorting of all the data based off of values in one particular column. I have posted a picture below that shows a small portion of my unsorted data. I want to sort the values in row 2 in ascending order along with all other data in those corresponding columns. In other words I don't want to just order row 2, I want to order row 2 and shift all other data with those re-ordered values.
Currently what I do is read in that csv to a data frame (tmpDF).
After that I transpose the data using tmpDF <- t(tmpDF)
Now I take that data and order the second column into ascending order (or at least that is what i think I am doing. ) tmpDF<- tmpDF[order(tmpDF[,1]),]
Re transpose the data to get it back how it was originally, but sorted. Result is shown in picture below "Ordered data result" Keep in mind that the data shown between the unsorted and sorted are different numbers due to my not posting my entire data set.
I have a few questions about this.
1) Am I going about this the correct way? I am not a very experienced programmer, just trying to teach myself R to help out my research efforts.
2) Why are the values such as "102" being represented as "1.01E+02" in my final sorted csv file? I don't believe I am changing type and in the original file they were represented as "102"
3) Why does the value 116 gets ordered before "1.01E+02"?

Related

Paste name of column to other columns in R?

I have recently received an output from the online survey (ESRI Survey123), storing the each recored attribte as a new column of teh table. The survey reports characteristics of single trees located on study site: e.g. beech1, beech2, etc. For each beech, several attributes are recorded such as height, shape, etc.
This is how the output table looks like in Excel. ID simply represent the site number:
Now I wonder, how can I read those data into R to make sure that columns 1:3 belong to beech1, columns 4:6 represent beech2, etc.? I am looking for something that would paste the beech1 into names of the following columns: beech1.height, beech1.shape. But I am not sure how to do it?

Referencing last used row in a data frame

I couldn't find the answer in any previously asked questions, but I believe this is an easy one.
I have the below two lines of code, which take in data from excel in a specific range (using readxl for this). The range itself only goes through row 2589 in the excel document, but it will update dynamically (it's a time series) and to ensure I capture the different observations (rows) as they're added, I've included rows to 10000 in the read_excel range argument.
In the end, I'd like to run charts on this data, but a key part of this is identifying the last used row, without manually updating the code row for the latest date. I've tried using nrow but to no avail.
Raw_Index_History <- read_excel("RData.xlsx", range = "ReturnsA6:P10000", col_names = TRUE)
Raw_Index_History <- Raw_Index_History[nrow(Raw_Index_History),]
Does anybody have any thoughts or advice? Thanks very much.
It would be easier to answer your question if you include an example.
Not knowing how your data looks like answers are likely going to be a bit vague.
Does your data contain NAs? If not it should be straight forward to remove the empty rows with
na.omit(Raw_Index_History)
It appears you also have control over the excel spreadsheet. So in case your data does contain NAs you could have some default value in your empty rows that will get overwritten as soon as a new data point is recorded. This will allow you to filter your dataframe accordingly.
Raw_Index_History[!grepl("place_holder", Raw_Index_History$column_with_placeholder),]
If you expect data in the spreadsheet to grow, you can specify only the columns to include, instead of a defined boundary.
Something like this ...
Raw_Index_History <- read_excel("RData.xlsx",
sheet = 1,
range = cell_cols("A:P"), # Only cols, no rows
col_names = TRUE)
Every time you run the code, R will pull in the data from columns between A:P up until the last populated row.
This will be a more elegant approach to your use case. (Consider what you'd do when your data crosses 10000 rows in the future)

R empty data frame after subsetting by factor

I need to subset my data depending on the content of one factor variable.
I tried to do it with subset:
new <- subset(data, original$Group1=="SALAD")
data is already a subset from a bigger data frame, in original I have the factor variable which should identify the wanted rows.
This works perfectly for one level of the factor variable, but (and I really don´t understand why!!) when I do it with the other factor level "BREAD" it creates the data frame but says "no data available" - so it is empty. I´ve imported the data from SPSS, if this matters. I´ve already checked the factor levels, but the naming should be right!
Would be really grateful for help, I spent 3 hours on this problem and wasn´t able to find a solution.
I´ve also tried other ways to subset my data (e.g. split), but I want a data frame as output.
Do you have advice in general, what is the best way to subset a data frame if I want e.g. 3 columns of this data frame and these should be extracted depending on the level of a factor (most Code examples are only for one or all columns..)
The entire point of the subset function (as I understand it) is to look inside the data frame for the right variable - so you can type
subset(data, var1 == "value")
instead of
data[data$var1 == "value,]
Please correct me anyone if that is incorrect.
Now, in you're case, you are explicitly taking Group1 from the data frame original and using that to subset data - which you say is a subset of original. Based on this, I see no reason to believe (and every reason not to believe) that the elements of original$Group1 will align with the rows of data. If Group1 is defined within data, why not just use the copy defined there - which is aligned correctly? If not, you need to be very explicit about what you are trying to accomplish, so that you can ensure that things are aligned correctly.

Subsetting rows, changing values, and placing them back into matrix?

I hope this has not been answered, but when I search for a solution to my problem I am not getting any results.
I have a data.frame of 2000+ observations and 20+ columns. Each row represents a different observation and each column represents a different facet of data for that observation. My objective is to iterate through the data.frames and select observations which match criteria (eg. I am trying to pick out observations that are in certain states). After this, I need to subtract or add time to convert it to its appropriate time zone (all of the times are in CST). What I have so far is an exorbitant amount of subsetting commands that pick out the rows that are of the state being checked against. When I try to write a for loop I can only get one value returned, not the whole row.
I was wondering if anyone had any suggestions or knew of any functions that could help me. I've tried just about everything, but I really don't want to have to go through each state of observations and modify the time. I would prefer a loop that could easily go through the data, select rows based on their state, subtract or add time, and then place the row back into its original data.frame (replacing the old value).
I appreciate any help.

Trouble getting my data into wide form with the reshape package

I am currently analysing a rather large dataset (22k+records) and am having some trouble getting the data into a wide format (with one row corresponding to each observation, and columns representing variables).
The data came in two CSV files, one giving demographics and the other giving participants probability ratings to a number of questions. Both of these CSV files were in long format.
I have used the reshape (and reshape2 for speed) packages to attempt to solve my problem. The specific issue i am having is the following.
I have the participants probability ratings in the following form (after one successful reshape).
dtf <- read.csv("http://dl.dropbox.com/u/8566396/foobar.csv")
Now, the format i would like my data to be in is as follows:
User ID Qid1, ....Qid255 Time, with the probabilities for each question in the questions corresponding column.
I have tried a loop and apply to put the values into a new data frame, and many variations of melt and cast. I have also tried the base reshape function, but all to no avail.
In the past, i've always edited my CSV files directly, but this is not an option with the size of this file (my laziness when it comes to data manipulation within R has come back to haunt me).
Any advice or solution you can give to avoid me having to do this by hand would be greatly appreciated.
Your dataset has 6 rows, 3 of which have the column "variable" equal to "probability" and 3 of which have that column equal to "time". You want to have probability be the value of each, and time be added onto the right.
I think there's a difficulty in making this work for you because what you want to do isn't clear. You have values for each UID-Time-X### cell, and values for each UID-Prob-X### cell. Therefore, you have to discard information to get it into your preferred format (UID-Time-X### with probabilities as the values). It seems to me like you're treating time as an ID variable, but it's storing values like a content variable.
To avoid discarding any data, your output would have to look something like:
UID Time1 Time2 Time3 Prob1 Prob2 Prob3
Which is simply reshaped wide.

Resources