Adding specific column value according to row value - r

I have a dataframe h3
Genotype Preference
Rice 1
Rice 2
Lr 3
Lr 3
th 4
th 7
I want the dataframe to look like
Genotype Preference Haplotype
Rice 1 1
Rice 2 1
Lr 3 2
Lr 3 2
th 4 0.5
th 7 0.5
That is I want to add a numerical variable to be added to the each type of genotype. I have around 100 observations for each type of genotype. I want to be able to add the numberical variable into a new column in a single line of code and ensure that 1 is added corresponding to rice, 2 to Lr and 0.5 to th.
I tried constructing the code with the mutate/ifelse:
h3 %>% select(Genotype) %>% mutate(Type = ifelse (Genotype = c("Rice"), 1, Genotype))
Other results which I looked up, provide solutions for adding a column with a calculated value from the previous columns but not specific values.
I have found this dplyr mutate with conditional values and Apply R-function to rows depending on value in other column but dont know how to modify it for my code.
Any help with this will be greatly appreciated.

Using dplyr, you can do:
library(dplyr)
df %>% mutate(Haplotype = ifelse(Genotype == "Rice",1,ifelse(Genotype == "Lr",2,0.5)))
Using base, you can do the same thing:
df$Haplotype = ifelse(df$Genotype == "Rice",1,ifelse(df$Genotype == "Lr",2,0.5))
Data
df = data.frame("Genotype" = c("Rice","Rice","Lr","Lr","Th","Th"),
"Preference" = c(1,2,3,3,4,7))

Related

How to find a total of row values in R

I am trying to find the total of rows that have a column value of 3 or 4. That being said, the first row has only one value of 3 so if I create a new column
currentdx_count1$TotalDiagnoses
That new column called TotalDiagnoses should only have a value of 1 under it for the first row. I have tried
currentdx_count1$TotalDiagnoses <- rowSums(currentdx_count1[2:32])
This doesn't give me what I need as expected because it literally sums up the whole row. That being said, is there an existing function that does what I want to do or will I have to make one? Could I specify more in rowSums for it to work as I need it to?
Thanks for any and all help.
Edit: I'm trying to adapt a method I use earlier in my script that works for a similar purpose
findtotal <- endsWith(names(currentdx_count1), 'Current')
findtotal <- lapply(findtotal, `>`, 2)
findtotal <- unlist(findtotal)
currentdx_count1$TotalDiagnoses <- currentdx_count1[c(findtotal)]
I get an error which I have never seen before (an error in view?!)
So I tried just this
findtotal <- endsWith(names(currentdx_count1), 'Current')
currentdx_count1$TotalDiagnoses <- currentdx_count1[c(findtotal)]
Gets me closer but it is finding the total count for each column separately which is not what I need. I want a single column to encompass counts for each SID.
You can compare the dataframe with the value of 3 or 4 and then use rowSums to count :
currentdx_count1$TotalDiagnoses <- rowSums(currentdx_count1[-1] == 3 |
currentdx_count1[-1] == 4)
currentdx_count1$TotalDiagnoses
#[1] 1 2 2 2 1 1 1 1 1 1 1 1 1 2

Complex data calculation for consecutive zeros at row level in R (lag v/s lead)

I have a complex calculation that needs to be done. It is basically at a row level, and i am not sure how to tackle the same.
If you can help me with the approach or any functions, that would be really great.
I will break my problem into two sub-problems for simplicity.
Below is how my data looks like
Group,Date,Month,Sales,lag7,lag6,lag5,lag4,lag3,lag2,lag1,lag0(reference),lead1,lead2,lead3,lead4,lead5,lead6,lead7
Group1,42005,1,2503,1,1,0,0,0,0,0,0,0,0,0,0,1,0,1
Group1,42036,2,3734,1,1,1,1,1,0,0,0,0,1,1,0,0,0,0
Group1,42064,3,6631,1,0,0,1,0,0,0,0,0,0,1,1,1,1,0
Group1,42095,4,8606,0,1,0,1,1,0,1,0,1,1,1,0,0,0,0
Group1,42125,5,1889,0,1,1,0,1,0,0,0,0,0,0,0,1,1,0
Group1,42156,6,4819,0,1,0,0,0,1,0,0,1,0,1,1,1,1,0
Group1,42186,7,5120,0,0,1,1,1,1,1,0,0,1,1,0,1,1,0
I have data for each Group at Monthly Level.
I would like to capture the below two things.
1. The count of consecutive zeros for each row to-and-fro from lag0(reference)
The highlighted yellow are the cases, that are consecutive with lag0(reference) to a certain point, that it reaches first 1. I want to capture the count of zero's at row level, along with the corresponding Sales value.
Below is the output i am looking for the part1.
Output:
Month,Sales,Count
1,2503,9
2,3734,3
3,6631,5
4,8606,0
5,1889,6
6,4819,1
7,5120,1
2. Identify the consecutive rows(row:1,2 and 3 & similarly row:5,6) where overlap of any lag or lead happens for any 0 within the lag0(reference range), and capture their Sales and Month value.
For example, for row 1,2 and 3, the overlap happens at atleast lag:3,2,1 &
lead: 1,2, this needs to be captured and tagged as case1 (or 1). Similarly, for row 5 and 6 atleast lag1 is overlapping, hence this needs to be captured, and tagged as Case2(or 2), along with Sales and Month value.
Now, row 7 is not overlapping with the previous or later consecutive row,hence it will not be captured.
Below is the result i am looking for part2.
Month,Sales,Case
1,2503,1
2,3734,1
3,6631,1
5,1889,2
6,4819,2
I want to run this for multiple groups, hence i will either incorporate dplyr or loop to get the result. Currently, i am simply looking for the approach.
Not sure how to solve this problem. First time i am looking to capture things at row level in R. I am not looking for any solution. Simply looking for a first step to counter this problem. Would appreciate any leads.
An option using rle for the 1st part of the calculation can be as:
df$count <- apply(df[,-c(1:4)],1,function(x){
first <- rle(x[1:7])
second <- rle(x[9:15])
count <- 0
if(first$values[length(first$values)] == 0){
count = first$lengths[length(first$values)]
}
if(second$values[1] == 0){
count = count+second$lengths[1]
}
count
})
df[,c("Month", "Sales", "count")]
# Month Sales count
# 1 1 2503 9
# 2 2 3734 3
# 3 3 6631 5
# 4 4 8606 0
# 5 5 1889 6
# 6 6 4819 1
# 7 7 5120 1
Data:
df <- read.table(text =
"Group,Date,Month,Sales,lag7,lag6,lag5,lag4,lag3,lag2,lag1,lag0(reference),lead1,lead2,lead3,lead4,lead5,lead6,lead7
Group1,42005,1,2503,1,1,0,0,0,0,0,0,0,0,0,0,1,0,1
Group1,42036,2,3734,1,1,1,1,1,0,0,0,0,1,1,0,0,0,0
Group1,42064,3,6631,1,0,0,1,0,0,0,0,0,0,1,1,1,1,0
Group1,42095,4,8606,0,1,0,1,1,0,1,0,1,1,1,0,0,0,0
Group1,42125,5,1889,0,1,1,0,1,0,0,0,0,0,0,0,1,1,0
Group1,42156,6,4819,0,1,0,0,0,1,0,0,1,0,1,1,1,1,0
Group1,42186,7,5120,0,0,1,1,1,1,1,0,0,1,1,0,1,1,0",
header = TRUE, stringsAsFactors = FALSE, sep = ",")

R removes more observations than it should with dplyr or base subset

I've got a question regarding the filter() function of dplyr, and/or base subset() function within R. Basically, when I use filter() or subset() I can extract observations based on two conditions, which is what I need.
As an example, this is what I've been using so far:
df %>% filter(Axis_1_1 == "Diagnostic of function on axis1 postponed") %>% filter(is.na(diagnostic_code9))
This gives me the right amount of observations that satisfy these two conditions at the same time, i.e. 92 out of the 23992 in total.
However, when I use the negation sign to not include these observations in my current dataframe, R is deleting roughly 8000 extra observations. Thus, the end result is 15992 observations left after filtering with the negation "!" sign used. Example:
df %>% filter(Axis_1_1 != "Diagnostic of function on axis1 postponed") %>% filter(!is.na(diagnostic_code9))
Using simple subsetting from base R gives me the same wrong end result, while it manages to find the correct 92 observations that satisfy the condition, as stated in the first example.
subset(df, df$Axis1_1 == "Diagnostic of function on axis1 postponed" & is.na(diagnostic_code9))
My dataframe consists of 112 variables and 23900+ observations in the current setting.
Thus, my questions are:
Could there be something curious going on with my dataframe I'm using (Unfortunately I cannot give you a subset out of it)
Second, is there something wrong here with my coding?
Lastly, what is R exactly doing in the background? Since it is able to filter out these observations based on the exact conditioning where they match the string and is.na() function, while doing completely something else when using the negation sign.
Your logic doesn't quote work in this case. Doing two subsequent filter statments is kind of like doing an AND operation. Consider the following example
df <- data.frame(a=c(1,1,1,1,2,2,2, 2),
b=c(NA,NA,5,5,5,5,5,NA))
df %>% filter(a==1) %>% filter(is.na(b))
# a b
# 1 1 NA
# 2 1 NA
df %>% filter(a!=1) %>% filter(!is.na(b))
# a b
# 1 2 5
# 2 2 5
# 3 2 5
Note the rows with a=1, b=5 are not returned even though they are not in the first output because your first filter (filter(!=1)) eliminates them.
So if you consider your two filters as A and B, in the first case you are doing A and B. It would be the same as
df %>% filter(a==1 & is.na(b))
# a b
# 1 1 NA
# 2 1 NA
But in the second you are doing NOT A and NOT B. These are not equivalent. According to DeMorgan's Law, you need NOT A OR NOT B. So try
df %>% filter(a!=1 | !is.na(b))
# a b
# 1 1 5
# 2 1 5
# 3 2 5
# 4 2 5
# 5 2 5
# 6 2 NA
or equivalently (note the parenthsis applying the NOT (!) to the whole expression)
df %>% filter(!(a==1 & is.na(b)))

create new dataframe based on 2 columns

I have a large dataset "totaldata" containing multiple rows relating to each animal. Some of them are LactationNo 1 readings, and others are LactationNo 2 readings. I want to extract all animals that have readings from both LactationNo 1 and LactationNo 2 and store them in another dataframe "lactboth"
There are 16 other columns of variables of varying types in each row that I need to preserve in the new dataframe.
I have tried merge, aggregate and %in%, but perhaps I'm using them incorrectly eg.
(lactboth <- totaldata[totaldata$LactationNo %in% c(1,2), ])
Animal Id is column 1, and lactationno is column 2. I can't figure out how to select only those AnimalId with LactationNo=1&2
Have also tried
lactboth <- totaldata[ which(totaldata$LactationNo==1 & totaldata$LactationNo ==2), ]
I feel like this should be simple, but couldn't find an example to follow quite the same. Help appreciated!!
If I understand your question correctly, then your dataset looks something like this:
AnimalId LactationNo
1 A 1
2 B 2
3 E 2
4 A 2
5 E 2
and you'd like to select animals that happen to have both lactation numbers 1 & 2 (like A in this particular example). If that's the case, then you can simply use merge:
lactboth <- merge(totaldata[totaldata$LactationNo == 1,],
totaldata[totaldata$LactationNo == 2,],
by.x="AnimalId",
by.y="AnimalId")[,"AnimalId"]

Function to work out an average number of unique occurrences

I have the following code, which does what I want. But I would like to know if there is a simpler/nicer way of getting there?
The overall aim of me doing this is that I am building a separate summary table for the overall data, so the average which comes out of this will go into that summary.
Test <- data.frame(
ID = c(1,1,1,2,2,2,3,3,3),
Thing = c("Apple","Apple","Pear","Pear","Apple","Apple","Kiwi","Apple","Pear"),
Day = c("Mon","Tue","Wed")
)
countfruit <- function(data){
df <- as.data.frame(table(data$ID,data$Thing))
df <- dcast(df, Var1 ~ Var2)
colnames(df) = c("ID", "Apple","Kiwi", "Pear")
#fixing the counts to apply a 1 for if there is any count there:
df$Apple[df$Apple>0] = 1
df$Kiwi[df$Kiwi>0] = 1
df$Pear[df$Pear>0] = 1
#making a new column in the summary table of how many for each person
df$number <- rowSums(df[2:4])
return(mean(df$number))}
result <- countfruit(Test)
I think you over complicate the problem, Here a small version keeping the same rationale.
df <- table(data$ID,data$Thing)
mean(rowSums(df>0)) ## mean of non zero by column
EDIT one linear solution:
with(Test , mean(rowSums(table(ID,Thing)>0)))
It looks like you are trying to count how many nonzero entries in each column. If so, either use as.logical which will convert any nonzero number to TRUE (aka 1) , or just count the number of zeros in a row and subtract from the number of pertinent columns.
For example, if I followed your code correctly, your dataframe is
Var1 Apple Kiwi Pear
1 1 2 0 1
2 2 2 0 1
3 3 1 1 1
So, (ncol(df)-1) - length(df[1,]==0) gives you the count for the first row.
Alternatively, use as.logical to convert all nonzero values to TRUE aka 1 and calculate the rowSums over the columns of interest.

Resources