Press here for Dataset relating to question
How do you extract only the Male rows into another dataset whilst keeping the rownames & column names intact?
I think
mydf[mydf["sex"] == "Male"],]
has the effect you want (where mydf is the name of your dataframe). Note that mydf["sex"] is the column of values labeled "sex" and mydf["sex"] == "Male" is a column of Boolean values (TRUE where the value is "Male").
The comma with nothing between it and the right square bracket mydf[...,] is important. It means, select all columns.
Related
I have a dataframe where the rows are the names of different genes, with 2 columns called: Control_mean and Patient_mean.
I want to create a third column where I store the value of "Patient_mean - Control_mean" for each row respectively but I cant figure out how!
I tried to do so using this:
for(i in 1:nrow(newdf8)){
newdf8$log2FC[i] <- (newdf8[,2] - newdf8[,1])
}
but it didnt work, since all the values in the new column became the same number, and not the value of the actual difference.
I have two columns in a dataframe that contain date information after a left outer join. Because of the style of join, one of the date columns now contains NAs. I want to check if all non-NA values are identical between these columns. An example is below:
date 1 date 2
1/1/21 NA
1/2/21 1/2/21
1/3/21 NA
1/4/21 1/4/21
I don't need the second column if all non-NA values match
Before I did the left outer join, I did a outer join and this statement:
identical(df[['date 1']], df[['date 2']])
returned a true as each row in both columns were indeed identical
Is there a way to use this or a similar statement while ignoring all rows that contain an "NA" in "date 2"?
You can test for null values and mismatched values by filtering your df, then check whether there are any.
df_mismatch = df[(df['date 2'].notnull()) & (df['date 1'] != df['date 2'])]
if len(df_mismatch) > 0:
print('found this many mismatches:', len(df_mismatch))
I found a workaround:
first, create a new dataframe that just stores these two columns. The reason you'll want to make a new dataframe is because we will use na.omit in the next step which will remove any row that contains 1 or more "NA" in any column.
df2 <- df[, c("date 1", "date2")]
Then remove all rows that contain an "NA" in any column
df2 <- na.omit(df2)
Finally, run identical to check if the remaining columns are indeed identical
identical(df2[['date1']], df2[['date2']])
I'm sure there is a more elegant way, but this worked for me in the meantime
How would I count the number of occurences after I filter a dataset in regards to one column (e.g "Variant" column for "A/T") then subsequently filter in another column for words containing a particular word(e.g "SEQ" column for "G[A]C")?
Ive tried the following but received an error:
length(which(mydata$VARIANT =="A/T") & grep(length("G[A]A", mydata$SEQ)))
Checking in excel, filtering for just 'A/T' reveals 9 then there are 2 containing 'G[A]C'
When using &, we need logical vectors. So, instead of grep, it should be grepl
sum(mydata$VARIANT == "A/T" & grepl("G[A]A", mydata$SEQ, fixed = TRUE))
We have a data frame from a tab delimited file. The data frame NCNT has columns 2 and 3 with observed values as A,G,T,C and missing data represented as '.' instead of NA.
We would like to use the subset command to define a new data frame newNCNT such that it only contains rows that have the missing value '.' value from columns 2 and 3.
This should deliver the desired subset using ordinary logical indexing and logical operators:
newNCNT <- NCNT[ NCNT[[2]] == "." & NCNT[[3]] == ".", ]
In order to use the subset function one would ordinarily need to know the column names for those two columns. If one knew the names to be name1 and name2 then it might be:
newNCNT <- subset( NCNT, name1 == "." & name2 == ".")
This will deliver rows where both values in those columns are ".". Many people have difficulty expressing their desired logical operations correctly, so if you wanted rows with either column 2 or column 3 having a missing value then you would need the | (OR) operator. #docendodiscimus apparently thought you wanted the latter.
I need to extract the columns from a dataset without header names.
I have a ~10000 x 3 data set and I need to plot the first column against the second two.
I know how to do it when the columns have names ~ plot(data$V1, data$V2) but in this case they do not. How do I access each column individually when they do not have names?
Thanks
Why not give them sensible names?
names(data)=c("This","That","Other")
plot(data$This,data$That)
That's a better solution than using the column number, since names are meaningful and if your data changes to have a different number of columns your code may break in several places. Give your data the correct names and as long as you always refer to data$This then your code will work.
I usually select columns by their position in the matrix/data frame.
e.g.
dataset[,4] to select the 4th column.
The 1st number in brackets refers to rows, the second to columns. Here, I didn't use a "1st number" so all rows of column 4 are selected, i.e., the whole column.
This is easy to remember since it stems from matrix calculations. E.g., a 4x3 dimensional matrix has 4 rows and 3 columns. Thus when I want to select the 1st row of the third column, I could do something like matrix[1,3]