How to compare two variable columns with each other in R? - r

I'm new to R and need help! I have many variables including Response and RightResponse.
I need to compare those two columns, and create a new column that can show whether there is a match or miss between each of the value pairs.
Thanks.

Perhaps something like this?
library(magrittr)
library(dplyr)
> res <- data.frame(Response=c(1,4,4,3,3,6,3),RightResponse=c(1,2,4,3,3,6,5))
> res <- res %>% mutate("CorrectOrNot" = ifelse(Response == RightResponse, "Correct","Incorrect"))
> res
Response RightResponse CorrectOrNot
1 1 1 Correct
2 4 2 Incorrect
3 4 4 Correct
4 3 3 Correct
5 3 3 Correct
6 6 6 Correct
7 3 5 Incorrect
Basically the mutate function has created a new column containing the results of a comparison between Response and RightResponse.
Hope this helps!

Related

Combine rows of data frame in R using colMeans?

I'm impressed by the number of "how to combine rows/columns" threads, but even more by the fact that none of these was particularly helpful or at least not applicable to my issue.
My data look like this:
MyData<-data.frame("id" = c("a","a","b"),
"value1_1990" = c(5,NA,1),
"value2_1990" = c(5,NA,2),
"value1_2000" = c(2,1,1),
"value2_2000" = c(2,1,2),
"value1_2010" = c(NA,9,1),
"value2_2010" = c(NA,9,2))
What I want to do is to combine the two rows where id=="a" for columns MyData[,(2:7)] using base R's colMeans.
What it looks like:
id value1_1990 value2_1990 value1_2000 value2_2000 value1_2010 value2_2010
1 a 5 5 2 2 NA NA
2 a NA NA 1 1 9 9
3 b 1 2 1 2 1 2
What I need:
id value1_1990 value2_1990 value1_2000 value2_2000 value1_2010 value2_2010
1 a 5 5 1.5 1.5 9 9
2 b 1 2 1 2 1 2
What I tried (among numerous other things):
MyData[nrow(MyData)+1, 2:7] = colMeans(MyData[which(MyData$id=="a"),(2:7)],na.rm=T) # to combine values from rows where id=="a"
MyData$id<-ifelse(is.na(MyData$id),"NewRow",MyData$id) # to replace "<NA>" in the id-column of the newly created row by "NewRow".
This works, except for the fact that...
...it turns all other existing id's into numeric values (and I don't want to let the second line of code -- the ifelse-statement -- touch any of the existing id's, which is why I wrote else==MyData$id).
...this is not particulary fancy code. Is there a one-line-of-code-solution that does the trick? I saw other approaches using aggregate() but this didn't work for me.
You can try using dplyr:
library(dplyr)
Possible solution:
MyData %>% group_by(id) %>% summarise_all(funs(mean(., na.rm = TRUE)))

How to label consecutive periods with identical statuses

I have long vector of patient statuses in R that are chronologically sorted, and a label of associated patient IDs. This vector is an element of a dataframe. I would like to label consecutive rows of data for which the patient status is the same. If the status changes, then reverts to its original value, that would be three separate events. This is different than most situations I have searched where duplicated or match would suffice.
An example would be along the lines of:
s <- c(0,0,0,1,1,1,0,0,2,1,1,0,0)
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2)
and the desired output would be
flag <- c(1,1,1,2,2,2,3,1,2,3,4,4)
or
flag <- c(1,1,1,2,2,2,3,4,5,6,7,7)
One inelegant approach would be to generate the sequence:
unlist(tapply(s, id, function(x) cumsum(c(T, x[-1] != rev(rev(x)[-1])))))
Is there a better way?
I think you could use rleid from data.table for this:
library(data.table)
rleid(s,id)
Output:
1 1 1 2 2 2 3 4 5 6 6 7 7
Or for the first sequence:
data.table(s,id)[,rleid(s),id]$V1
Output:
1 1 1 2 2 2 3 1 2 3 3 4 4
Run Length Encoding - rle()
tapply(s, id, function(x) {
v<-rle(x)$length
rep(1:length(v), v)
})

R creating new column based on split column name

I faced a problem while trying to re-arrange by data frame into long format.
my table looks like this:
x <- data.frame("Accession"=c("AGI1","AGI2","AGI3","AGI4","AGI5","AGI6"),"wt_rep_1"=c(1,2,3,4,4,5), "wt_rep_2" = c(1,2,3,4,8,9), "mutant1_rep_1"=c(1,1,0,0,5,3), "mutant2_rep_1" = c(1,7,0,0,1,5), "mutant2_rep_2" = c(1,1,4,0,1,8) )
> x
Accession wt_rep_1 wt_rep_2 mutant1_rep_1 mutant2_rep_1 mutant2_rep_2
1 AGI1 1 1 1 1 1
2 AGI2 2 2 1 7 1
3 AGI3 3 3 0 0 4
4 AGI4 4 4 0 0 0
5 AGI5 4 8 5 1 1
6 AGI6 5 9 3 5 8
I need to create a column that I would name "genotype", and it would containt the first part of the name of the column before "_"
How to use
strsplit(names(x), "_")
for that?
and preferably loop...
please, anyone, help.
I'll extract the part of the column names of x before the first _ in two instructions. Note that it can be done in just one line, but I'm posting like this for clarity.
sp <- strsplit(names(x), "_")
sapply(sp[-1], `[`, 1)
Now, how can this be a new column in data.frame x? There are only five elements in the resulting vector and x has six rows.
I agree with Ruy Barradas: I don't get how this vector could be a part of your original dataframe. Could you please clarify?
William Doane's response to this question suggests that using regular expressions might do the trick. I like this approach because I find it elegant and fast:
> gsub("(_.*)$", "", names(x))[-1]
[1] "wt" "wt" "mutant1" "mutant2" "mutant2"

R: Subset data frame based on multiple values for multiple variables

I need to pull records from a first data set (called df1 here) based on a combination of specific dates, ID#s, event start time, and event end time that match with a second data set (df2). Everything works fine when there is just 1 date, ID, and event start and end time, but some of the matching records between the data sets contain multiple IDs, dates, or times, and I can't get the records from df1 to subset properly in those cases. I ultimately want to put this in a FOR loop or independent function since I have a rather large dataset. Here's what I've got so far:
I started just by matching the dates between the two data sets as follows:
match_dates <- as.character(intersect(df1$Date, df2$Date))
Then I selected the records in df2 based on the first matching date, also keeping the other columns so I have the other ID and time information I need:
records <- df2[which(df2$Date == match_dates[1]), ]
The date, ID, start, and end time from records are then:
[1] "01-04-2009" "599091" "12:00" "17:21"
Finally I subset df1 for before and after the event based on the date, ID, and times in records and combined them into a new data frame called final to get at the data contained in df1 that I ultimately need.
before <- subset(df1, NUM==records$ID & Date==records$Date & Time<records$Start)
after <- subset(df1, NUM==records$ID & Date==records$Date & Time>records$End)
final <- rbind(before, after)
Here's the real problem - some of the matching dates have more than 1 corresponding row in df2, and return multiple IDs or times. Here is what an example of multiple records looks like:
records <- df2[which(df2$Date == match_dates[25]), ]
> records$ID
[1] 507646 680845 680845
> records$Date
[1] "04-02-2009" "04-02-2009" "04-02-2009"
> records$Start
[1] "09:43" "05:37" "11:59"
> records$End
[1] "05:19" "11:29" "16:47"
When I try to subset df1 based on this I get an error:
before <- subset(df1, NUM==records$ID & Date==records$Date & Time<records$Start)
Warning messages:
1: In NUM == records$ID :
longer object length is not a multiple of shorter object length
2: In Date == records$Date :
longer object length is not a multiple of shorter object length
3: In Time < records$Start :
longer object length is not a multiple of shorter object length
Trying to do it manually for each ID-date-time combination would be way to tedious. I have 9 years worth of data, all with multiple matching dates for a given year between the data sets, so ideally I would like to set this up as a FOR loop, or a function with a FOR loop in it, but I can't get past this. Thanks in advance for any tips!
If you're asking what I think you are the filter() function from the dplyr package combined with the match function does what you're looking for.
> df1 <- data.frame(A = c(rep(1,4),rep(2,4),rep(3,4)), B = c(rep(1:4,3)))
> df1
A B
1 1 1
2 1 2
3 1 3
4 1 4
5 2 1
6 2 2
7 2 3
8 2 4
9 3 1
10 3 2
11 3 3
12 3 4
> df2 <- data.frame(A = c(1,2), B = c(3,4))
> df2
A B
1 1 3
2 2 4
> filter(df1, A %in% df2$A, B %in% df2$B)
A B
1 1 3
2 1 4
3 2 3
4 2 4

Subsetting data frames in R

I'm new to R and learning about subsetting. I have a table and I'm trying to get the size of a subset of the table. My issue is that when I try two different ways I get two different answers. For a table "dat" where I'm trying to select all rows where RMS is 5 and BDS is 2:
dim(dat[(dat$RMS==5) & (dat$BDS==2),])
gives me a different answer than
dim(subset(dat,(dat$RMS==5) & (dat$BDS==2)))
The second one is correct, could someone explain why these are different and why the first one is giving me the wrong answer?
Thanks
The reason must be in different treatment of NA values by these two methods. If you remove rows with NA from the data frame you should get the same results:
dat_clean = na.omit(dat)
Works for me.....
> x = c(1,1,2,2,3,3)
> y = c(4,4,5,5,6,6)
>
> X = data.frame(x,y)
>
> dim(X[X$x==1 & X$y==4,])
[1] 2 2
>
> (X[X$x==1 & X$y==4,])
x y
1 1 4
2 1 4
> dim(subset(X,(X$x==1) & (X$y==4)))
[1] 2 2
> subset(X,(X$x==1) & (X$y==4))
x y
1 1 4
2 1 4

Resources