Im working on a crash course for R at https://bioinformatics-core-shared-training.github.io/r-crash-course/crash-course.nb.html
The problem im facing is to extract rows that are min or max for a certain value.
For example, when running
df[df$tmp ==min(df$tmp),]
I get the correct row with the expected value.
However, when running the following code
df[min(df$tmp),]
I get something else completely.
Im wondering what is causing this discrepancy?
Assuming df$Tmp is numeric with no NAs, min(df$Tmp) should be returning a number. Assuming that number is an integer, i, df[min(df$Tmp),] will return the ith row of your data frame, assuming that your data frame has an ith row.
On the other hand, df[df$Tmp ==min(df$tmp),] will return the row(s) of df where df$Tmp is equal to the minimum value in that column.
df[df$Tmp ==min(df$tmp),] is the correct approach to get what you are looking for.
df[min(df$Tmp),] returns the row in df that is equal to min(df$Tmp). It may result in an error in certain cases for e.g. when min(df$Tmp) is not an integer, or is negative, or if it is greater than the number of rows in df etc. Hope this makes sense.
Related
I'm trying to write a function where I find the number of times the value in a data frame is above a certain number x (in this case, 3). Basically, the data start from 1.0, increase, then go below 1.0 (in a span of about 150 data points). I want the function to return to me the number of times the values are above this threshold. I'm fairly new to R and am just confused on how to go about this. Any help is appreciated. Thank you!
If your data frame is called df then sum(df$x>3) will return the number of rows of df where x is greater than 3.
If there are missing values in x and you want to ignore them then use sum(df$x>3, na.rm=TRUE).
I am new to R. Many thanks in advance for your help.
I am trying to find the maximum value of a single (the first) column in a matrix, "bin.matrix". Of course, I have been able to find the max value for all columns using:
apply(bin.matrix, 2, max)
But I can't seem to figure out how to get the value for just the first column. It's a homework question, so just reading the first value won't do unfortunately. The next question asks for the max value in all but the first column.
Thanks again for your help.
We can select the first column by subsetting with numeric index and get the max
max(bin.matrix[,1])
I have a fairly large data set in csv format that I'd like to read into R. The data is annoyingly structured (my own fault) as follows:
,US912828LJ77,,US912810ED64,,US912828D804,...
17/08/2009,101.328125,15/08/1989,99.6171875,02/09/2014,99.7265625,...
And with the second line style repeated for a few thousand times. The structure is that each pair of columns represents a timeseries of differing lengths (so that the data is not rectangular).
If I use something like
>rawdata <- read.csv("filename.csv")
I get a dataframe with all the blank entries padded with NA, and the odd columns forced to a factor datatype.
What I'd like to ultimately get to is either a set of timeseries objects (for each pair of columns) named after every even entry in the first row (the "US912828LJ77" fields) or a single dataframe with row labels as dates running from the minimum of (min of each odd column) to max of (max of each odd column).
I can't imagine I'm the only mook to put together a dataset in such an unhelpful structure but I can't see any suggestions out there for how to deal with this. Any help would be greatly appreciated!
First you need to parse every odd column to date
odd.cols = names(rawdata)[seq(1,dim(rawdata)[2]-1,2)]
for(dateCol in odd.cols){
rawdata[[dateCol]] = as.Date(rawdata[[dateCol]], "%d/%m/%Y")
}
Now I guess the problem is straightforward, you just need to find min, max values per column, create a vector running from min date to max date, join it with rawdata and handle missing values for you US* columns.
I have a data frame of about 10,000,000 entries. There's only two columns: 'value' and 'deleted'. The values usually range from 1:1800, but also there's some odd strings. Deleted is a boolean indicating whether the value was deleted. If I copy this data frame with the condition
deletedFrame <- df[df$deleted!=0, ]
the resulting data frame reduces to 283 entries. However, it doesn't copy over any of the corresponding values. That column is there but is left blank. Any ideas on what I'm doing wrong?
It could be a case where we have NA along with the boolean, one way would be to use
df[df$deleted!=0 & !is.na(df$deleted), ]
I'm looking for a way to find the maximum value of a column, but only in rows where a different column equals a given value.
Suppose all your data is stored in a data frame called dat
max(dat$columnYouWantMaxOf[dat$columnYouWantToHaveSpecificValue==ValueYouWantThisColumnToHave])