Fill columns with data via 2 defined parameters - r

I have a sample working data set (called df) which I have added columns to in R, and I would like to fill these columns with data according to very specific conditions.
I ran samples in a lab with 8 different variables, and always ran each sample with each variable twice (sample column). From this, I calculated an average result, called Cq_mean.
The columns I have added in R below refer to each variable name.
I would like to fill these columns with positive or negative based on 2 conditions :
Variable
Cq_mean
As you see with my code below, I am able to create positive or negative results based on Cq_mean, however this logically runs it over the entire dataset, not taking into account variable as well, and it fills in cells with data that I would like to remain empty. I am not sure how to ask R to take these two conditions into account at the same time.
positive: Cq_mean <= 37.1
negative: Cq_mean >= 37
Helpful information:
Under sample, the data is always separated by a dash (-) with sample number in front, and variable name after. Somehow I need to isolate what comes after the dash.
Please refer to my desired results table to visualize what I am aiming for.
df <- read.table("https://pastebin.com/raw/ZPJS9Vjg", header=T,sep="")
add column names respective to variables
df$TypA <- ""
df$TypB <- ""
df$TypC <- ""
df$RP49 <- ""
df$RPS5 <- ""
df$H20 <- ""
df$F1409B <-""
df$F1430A <- ""
fill columns with data
df$TypA <- ifelse(df$Cq_mean>=37.1,"negative", 'positive')
df$TypB <- ifelse(df$Cq_mean>=37.1,"negative", 'positive')
and continue through with each variable
desired results (subset of entire dataset done by hand in excel):
desired_outcome <- read.table("https://pastebin.com/raw/P3PPbiwr", header = T, sep="\t")

Something like this will do the trick:
df$TypA[grepl('TypA', df$sample1)] <- ifelse(df$Cq_mean[grepl('TypA', df$sample1)] >= 37.1,
'neg', 'pos')
You'll need to do this once per new column you want.
The grepl will filter out only the rows where your string of choice (here TypA) is present in the sample variable.

Related

Loop through dataframe rows and add value in a new column (R)

I have a dataframe (df) with a column of Latitude (Lat), and I need to match up the corresponding Longitude value (based off relationships in another dataset). New column name is 'Long_matched'.
Here, I am trying to write a new value in the column 'Long_matched' at the corresponding row to latitudes between -33.9238 and -33.9236. The data in 'Lat' has many more decimal places (e.g: -33.9238026666667, -33.9236026666667, etc.). As I will be applying this code to multiple datasets over the same geographical location (hence the long decimals will vary slightly), I want to write Longitude values which fall within a a 0.0002 degree range.
Some attempts of code I have tried include:
df$Long_matched <- ifelse(df$Lat< -33.9236 & df$Lat> -33.9238, 151.2279 , "N/A")
or
df$Long_matched[df$Lat< -33.9236 & df$Lat> -33.9238] <- 151.2279
I think I need to use a for loop to loop through the rows and an if statement, but struggling to figure this out - any help would be appreciated!
Resulting output should look something like this:
Lat Long_matched
-33.9238026666667 151.2279
-33.9236026666667 (new long value will go here)
Everything said in the comments applies, but here is a trick you can try:
In the following code, you will need to replace text with numbers.
Latitude_breaks <- seq(min_latitude, max_latitude, 0.0002) # you need to replace `min_latitude`, `max_latitude`, and `increment` with numbers
Longitude_values <- seq(first, last, increment) # you need to replace `first`, `last` and `increment` with numbers
df <- within(df, {
# make a categorical version of `Lat`
Lat_cat <- cut(Lat, Latitude_breaks)
Long_matched <- Longitude_values[Lat_cat]
})
A few notes:
the values between min_latitude and min_latitude + 1 will be assigned to the values of Longitude marked first.
The length of Latitude_beaks should be one more than the length of Longitude_values.
Values of Lat outside of Latitude_breaks will becomes NAs.
This works by exploiting a nice feature of factors - they are stored as integers. So we can use them to index another vector - in this case, Longitude_values.

R - Removing rows in data frame by list of column values

I have two data frames, one containing the predictors and one containing the different categories I want to predict. Both of the data frames contain a column named geoid. Some of the rows of my predictors contains NA values, and I need to remove these.
After extracting the geoid value of the rows containing NA values, and removing them from the predictors data frame I need to remove the corresponding rows from the categories data frame as well.
It seems like a rather basic operation but the code won't work.
categories <- as.data.frame(read.csv("files/cat_df.csv"))
predictors <- as.data.frame(read.csv("files/radius_100.csv"))
NA_rows <- predictors[!complete.cases(predictors),]
geoids <- NA_rows['geoid']
clean_categories <- categories[!(categories$geoid %in% geoids),]
None of the rows in categories/clean_categories are removed.
A typical geoid value is US06140231. typeof(categories$geoid) returns integer.
I can't say this is it, but a very basic typo won't be doing what you want, try this correction
clean_categories <- categories[!(categories$geoid %in% geoids),]
Almost certainly this is what you meant to happen in that line. You want to negate the result of the %in% operator. You don't include a reproducible example so I can't say whether the whole thing will do as you want.

Trying to predict in R

I created a data set using a random row generator:
training_data <- fulldata[sample(nrow(fulldata),100,]
I am under the impression that I can create a second data set of the rest of the data ... rest_data <- fulldata[-training_data] is the code I jotted down in my notes but I am getting
"Error in '[.default'(fulldata, -training_data) :
What part of my code is incorrect?
assuming that fulldatais a dataframe you need a comma in the subscript to indicate that you want the rows of the data frame (i.e. fulldata[rows,columns]). But the indices of the new dataframe training_data will be numbered 1:100so you need a different sort of indicator that corresponds between training_dataand fulldata to show which rows of fulldata should not be included. What you might do is use the rownames, something like:
rest_data<-fulldata[-which(rownames(fulldata)%in%rownames(training_data)),]
which should tell R to remove the rownames of fulldata that occur in training_data. If you have something like an ID variable that is unique to each row you could also use this
rest_data<-fulldata[-which(fulldata$ID%in%training_data$ID),]

Omitting NA in specific rows when analyzing data from 2 columns of a very large dataframe

I am very new to R and I am struggling to understand how to omit NA values in a specific way.
I have a large dataframe with several columns (up to 40) and rows (up to 200ish). I want to use data from one of the columns to do simple stats (wilcox.test, boxplot, etc): one column will have a continuous variable (V1), while the other has a binary variable (V2; 0 or 1), which divides 2 groups. I want to do this for the continuous variable using different V2 binary variables, which are unrelated. I organized this data in Excel, saved it as CSV and am using R Studio.
All these columns have interspersed NA values and when I use omit.na, it just takes off every single row where a NA value is present, which takes away an awful load of data. Is there any simple solution to do this? I have seen some answers to similar topics, but none seems quite exactly what I need to do.
Many thanks for any answer. Again, I am a baby-level newbie to R and may have overlooked something in other topics!
If I understand, you want to apply to function to a pair of column each time.
wilcox.test(V1,V2)
wilcox.test(V1,V3)...
Where Vi have no missing values. I would do something like this :
## use complete.cases to assert that you have no missing values
## for the selected pair
apply_clean <-
function(x,y){
ok <- complete.cases(x, y)
wilcox.test(x[ok],dat$V1[ok])
}
## apply this function to all columns after removing the continuous column
lapply(subset(dat,select=-V1),apply_clean,y=dat$V1)
You can manipulate the data.frame to omit based on any rules you like. For example:
dirty.frame <- data.frame(col1 = c(1,2,3,4,5,6,7,NA,9,10), col2 = c(10, 9, 8, 7,6,5,4,3,2,1))
cleaned.frame <- dirty.frame[!is.na(dirty.frame$col1),]
This code used is.na() to test if a row in a specific column is na. The ! means not, and will omit that row.

Basic R - how to exclude rows with blank columns, how to show data for specific column values

Two questions about R:
1.) If I have a data set with the multiple column values and one of the column values is 'test_score' how can I exclude the rows with blank values (and / or non-numeric values) for that column? (using pie(), hist(), or cor())
2) If the dataset has a column named 'Teachers', how might I graph the column 'testscores' only for the rows where Teacher = Jones?
Creating separate vectors without the missing data:
dat.nomissing <- tenthgrade[!is.nan(Score),]
seems problematic as the two columns must remain paired.
I was thinking something such as:
hist(!is.nan(tenthgrade$Score)[tenthgrade$Teacher=='Jones'])
However, is.nan is creating a list of TRUE, FALSE values (as it should).
Use subscripting. For example:
dat[!is.na(dat$test_score),]
hist(dat$test_score[dat$Teachers=='Jones'])
And a more complete example with artificial data:
# Create artificial dataset
dat <- data.frame('test_score'=rnorm(500), 'Teachers'=sample(c('Jones', 'Smith', 'Clark'), 500, replace=TRUE))
# Introduce some random missingness
dat$test_score[sample(1:500, 50)] <- NA
# Keep if test_score is valid
dat.nomissing <- dat[!is.na(dat$test_score),]
# Plot subset of data
hist(dat$test_score[dat$Teachers=='Jones'])

Resources