I have a data frame of about 10,000,000 entries. There's only two columns: 'value' and 'deleted'. The values usually range from 1:1800, but also there's some odd strings. Deleted is a boolean indicating whether the value was deleted. If I copy this data frame with the condition
deletedFrame <- df[df$deleted!=0, ]
the resulting data frame reduces to 283 entries. However, it doesn't copy over any of the corresponding values. That column is there but is left blank. Any ideas on what I'm doing wrong?
It could be a case where we have NA along with the boolean, one way would be to use
df[df$deleted!=0 & !is.na(df$deleted), ]
Related
Im working on a crash course for R at https://bioinformatics-core-shared-training.github.io/r-crash-course/crash-course.nb.html
The problem im facing is to extract rows that are min or max for a certain value.
For example, when running
df[df$tmp ==min(df$tmp),]
I get the correct row with the expected value.
However, when running the following code
df[min(df$tmp),]
I get something else completely.
Im wondering what is causing this discrepancy?
Assuming df$Tmp is numeric with no NAs, min(df$Tmp) should be returning a number. Assuming that number is an integer, i, df[min(df$Tmp),] will return the ith row of your data frame, assuming that your data frame has an ith row.
On the other hand, df[df$Tmp ==min(df$tmp),] will return the row(s) of df where df$Tmp is equal to the minimum value in that column.
df[df$Tmp ==min(df$tmp),] is the correct approach to get what you are looking for.
df[min(df$Tmp),] returns the row in df that is equal to min(df$Tmp). It may result in an error in certain cases for e.g. when min(df$Tmp) is not an integer, or is negative, or if it is greater than the number of rows in df etc. Hope this makes sense.
I started programming in R yesterday (literally), and I am having the following issue:
-I have a data frame containing R rows, and each row contains N values.
Rows are identified by the first and second field, while the other N-2 are just numerical values or NA.
-Some rows have identical first field and identical second field, something like:
row 1: a,b, third_field, .. ,last_field
row 2: a,b, third_field, .. ,last_field
the rule is that usually the first line will have its fields containing some numbers and some NA, while the second row will contain NA and numbers as well, but differently distributed.
What I am trying to do is to merge the two rows (or records) according to these two rules:
1) if both rows have a NA on a given field, I keep NA
2) if one of the two has a number, I use that value; if both of the rows contain the same value, I keep it also.
How do you do this without looping on each field of each row? (1M rows, tenths of fields, it will finish maybe tomorrow).
I do not know how to better explain my problem. I am sorry for the lengthy explaination, thanks a lot.
EDIT: it is better if I add an example. The following two lines
a,b,NA,NA,NA,1,2 ,NA
a,b,NA,3 ,NA,1,NA,NA
should become
a,b,NA,3 ,NA,1,2 ,NA
I'm looking for a way to find the maximum value of a column, but only in rows where a different column equals a given value.
Suppose all your data is stored in a data frame called dat
max(dat$columnYouWantMaxOf[dat$columnYouWantToHaveSpecificValue==ValueYouWantThisColumnToHave])
I've done some thorough research and I am struggling with an attempt to find a function that will name the number of the row (in my data frame the rows don't contain numbers) that contains a certain value. In this case a number.
e.g. Call the data frame = df
I don't know how to show a little image of the data frame but say that in row 5, column 4 the value was '162', is there a function I could use that will end with the return being '5' or 'row 5'?
I have used rowsums(df=="162")
which gives a long line of the rows, if they contain the values there is a '1' under them, if not a '0' but I need a function that simply states the row.
I couldn't figure out how to correctly use the 'which' function either.
which(df$col4=='162')
I am assuming that col4 is the name of the column number 4
I need to extract the columns from a dataset without header names.
I have a ~10000 x 3 data set and I need to plot the first column against the second two.
I know how to do it when the columns have names ~ plot(data$V1, data$V2) but in this case they do not. How do I access each column individually when they do not have names?
Thanks
Why not give them sensible names?
names(data)=c("This","That","Other")
plot(data$This,data$That)
That's a better solution than using the column number, since names are meaningful and if your data changes to have a different number of columns your code may break in several places. Give your data the correct names and as long as you always refer to data$This then your code will work.
I usually select columns by their position in the matrix/data frame.
e.g.
dataset[,4] to select the 4th column.
The 1st number in brackets refers to rows, the second to columns. Here, I didn't use a "1st number" so all rows of column 4 are selected, i.e., the whole column.
This is easy to remember since it stems from matrix calculations. E.g., a 4x3 dimensional matrix has 4 rows and 3 columns. Thus when I want to select the 1st row of the third column, I could do something like matrix[1,3]