R creating new column based on split column name - r

I faced a problem while trying to re-arrange by data frame into long format.
my table looks like this:
x <- data.frame("Accession"=c("AGI1","AGI2","AGI3","AGI4","AGI5","AGI6"),"wt_rep_1"=c(1,2,3,4,4,5), "wt_rep_2" = c(1,2,3,4,8,9), "mutant1_rep_1"=c(1,1,0,0,5,3), "mutant2_rep_1" = c(1,7,0,0,1,5), "mutant2_rep_2" = c(1,1,4,0,1,8) )
> x
Accession wt_rep_1 wt_rep_2 mutant1_rep_1 mutant2_rep_1 mutant2_rep_2
1 AGI1 1 1 1 1 1
2 AGI2 2 2 1 7 1
3 AGI3 3 3 0 0 4
4 AGI4 4 4 0 0 0
5 AGI5 4 8 5 1 1
6 AGI6 5 9 3 5 8
I need to create a column that I would name "genotype", and it would containt the first part of the name of the column before "_"
How to use
strsplit(names(x), "_")
for that?
and preferably loop...
please, anyone, help.

I'll extract the part of the column names of x before the first _ in two instructions. Note that it can be done in just one line, but I'm posting like this for clarity.
sp <- strsplit(names(x), "_")
sapply(sp[-1], `[`, 1)
Now, how can this be a new column in data.frame x? There are only five elements in the resulting vector and x has six rows.

I agree with Ruy Barradas: I don't get how this vector could be a part of your original dataframe. Could you please clarify?
William Doane's response to this question suggests that using regular expressions might do the trick. I like this approach because I find it elegant and fast:
> gsub("(_.*)$", "", names(x))[-1]
[1] "wt" "wt" "mutant1" "mutant2" "mutant2"

Related

How to compare two variable columns with each other in R?

I'm new to R and need help! I have many variables including Response and RightResponse.
I need to compare those two columns, and create a new column that can show whether there is a match or miss between each of the value pairs.
Thanks.
Perhaps something like this?
library(magrittr)
library(dplyr)
> res <- data.frame(Response=c(1,4,4,3,3,6,3),RightResponse=c(1,2,4,3,3,6,5))
> res <- res %>% mutate("CorrectOrNot" = ifelse(Response == RightResponse, "Correct","Incorrect"))
> res
Response RightResponse CorrectOrNot
1 1 1 Correct
2 4 2 Incorrect
3 4 4 Correct
4 3 3 Correct
5 3 3 Correct
6 6 6 Correct
7 3 5 Incorrect
Basically the mutate function has created a new column containing the results of a comparison between Response and RightResponse.
Hope this helps!

R find all indexes of character matches in a list of strings

I have a df like this (with ~800,000 lines)
# str
# 1 .||.
# 2 .
# 3 .|..
# 4 ..
and I want a new data frame like this (record the location in each character string with a .) (sorry about the formatting of columns)
# str loc
# 1 .||. 1 4
# 2 . 1
# 3 .|.. 1 3 4
# 4 .. 1 2
I can get the locations with gregexpr(".", str, fixed = TRUE), but I don’t know how to get the first part of the gregexpr output, without the three attribute parts. I will later use the location vectors in other calculations. As gregexpr is vectorized, I do not want to use a loop to do this, as this would take too long. I think this problem must have been addressed in previous questions, but I can’t find a solution. Also, if there is a completely different way to handle this, please tell me.
Here's an example. Is this what you mean?
S = c("appleap", "tapppapp")
P = "ap"
lapply(gregexpr(P, S), function(x) as.vector(x))
#[[1]]
#[1] 1 6
#[[2]]
#[1] 2 6

R rename adjacent column

When excel has merged cells importing the data gives generic column names for the subsequent columns as shown in the picture below.
R data frame from excel sheet with merged cells
So is it possible to copy the name of a column to the column to its right?
In this example it would be copying "Sulfur dioxide Results" to overwrite X_6 and X_7, and "Ethanol Results" to X_8 and X_9 etc.
All the column names of interest end with "Results" so i'm considering if I can select the columns based on the "Results" in the name and copy the name to the 2 columns to its right.
There are many more columns, but they have the same pattern, and the amount of columns and their names are likely to change, but "Results" will still be in the names.
This solution works by using sapply against the names of a data frame. Then, for each column name, it checks if the name of the column which came either one or two positions prior ends in results. If so, then it copies over that previous name, from one or two positions prior.
df <- data.frame(one_results=c(1:3), blah=c(4:6), star=c(7:9), col=c(1:3))
df
names(df) <- sapply(seq_along(names(df)), function(x) {
if (x > 1 && grepl("results$", names(df)[x-1])) {
return(names(df)[x-1])
}
else if (x > 2 && grepl("results$", names(df)[x-2])) {
return(names(df)[x-2])
}
else {
return(names(df)[x]) # do not alter the column name in this case
}
})
df
Output:
one_results blah star col
1 1 4 7 1
2 2 5 8 2
3 3 6 9 3
one_results one_results one_results col
1 1 4 7 1
2 2 5 8 2
3 3 6 9 3

Combine rows of data frame in R using colMeans?

I'm impressed by the number of "how to combine rows/columns" threads, but even more by the fact that none of these was particularly helpful or at least not applicable to my issue.
My data look like this:
MyData<-data.frame("id" = c("a","a","b"),
"value1_1990" = c(5,NA,1),
"value2_1990" = c(5,NA,2),
"value1_2000" = c(2,1,1),
"value2_2000" = c(2,1,2),
"value1_2010" = c(NA,9,1),
"value2_2010" = c(NA,9,2))
What I want to do is to combine the two rows where id=="a" for columns MyData[,(2:7)] using base R's colMeans.
What it looks like:
id value1_1990 value2_1990 value1_2000 value2_2000 value1_2010 value2_2010
1 a 5 5 2 2 NA NA
2 a NA NA 1 1 9 9
3 b 1 2 1 2 1 2
What I need:
id value1_1990 value2_1990 value1_2000 value2_2000 value1_2010 value2_2010
1 a 5 5 1.5 1.5 9 9
2 b 1 2 1 2 1 2
What I tried (among numerous other things):
MyData[nrow(MyData)+1, 2:7] = colMeans(MyData[which(MyData$id=="a"),(2:7)],na.rm=T) # to combine values from rows where id=="a"
MyData$id<-ifelse(is.na(MyData$id),"NewRow",MyData$id) # to replace "<NA>" in the id-column of the newly created row by "NewRow".
This works, except for the fact that...
...it turns all other existing id's into numeric values (and I don't want to let the second line of code -- the ifelse-statement -- touch any of the existing id's, which is why I wrote else==MyData$id).
...this is not particulary fancy code. Is there a one-line-of-code-solution that does the trick? I saw other approaches using aggregate() but this didn't work for me.
You can try using dplyr:
library(dplyr)
Possible solution:
MyData %>% group_by(id) %>% summarise_all(funs(mean(., na.rm = TRUE)))

Comparing two columns: logical- is value from column 1 also in column 2?

I'm pretty confused on how to go about this. Say I have two columns in a dataframe. One column a numerical series in order (x), the other specifying some value from the first, or -1 (y). These are results from a matching experiment, where the goal is to see if multiple photos are taken of the same individual. In the example below, there 10 photos, but 6 are unique individuals. In the y column, the corresponding x is reported if there is a match. y is -1 for no match (might as well be NAs). If there is more than 2 photos per individual, the match # will be the most recent record (photo 1, 5 and 7 are the same individual below). The group is the time period the photo was take (no matches within a group!). Hopefully I've got this example right:
x <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(-1,-1,-1,-1,1,-1,1,-1,2,4)
group <- c(1,1,1,2,2,2,3,3,3,3)
DF <- data.frame(x,y,group)
I would like to create a new variable to name the unique individuals, and have a final dataset with a single row per individual (i.e. only have 6 rows instead of 10), that also includes the group information. I.e. if an individual is in all three groups, there could be a value of "111" or if just in the first and last group it would be "101". Any tips?
Thanks for asking about the resulting dataset. I realized my group explanation was bad based on the actual numbers I gave, so I changed the results slightly. Bonus would also be nice to have, but not critical.
name <- c(1,2,3,4,6,8)
group_history <- as.character(c('111','101','100','011','010','001'))
bonus <- as.character(c('1,5,7','2,9','3','4,10','6','8'))
results_I_want <- data.frame(name,group_history,bonus)
My word, more mistakes fixed above...
Using the (updated) example you gave
x <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(-1,-1,-1,-1,1,-1,1,-1,3,4)
group <- c(1,1,1,2,2,2,3,3,3,3)
DF <- data.frame(x,y,group)
Use the x and y to create a mapping from higher numbers to lower numbers that are the same person. Note that names is a string, despite it be a string of digits.
bottom.df <- DF[DF$y==-1,]
mapdown.df <- DF[DF$y!=-1,]
mapdown <- c(mapdown.df$y, bottom.df$x)
names(mapdown) <- c(mapdown.df$x, bottom.df$x)
We don't know how many times it might take to get everything down to the lowest number, so have to use a while loop.
oldx <- DF$x
newx <- mapdown[as.character(oldx)]
while(any(oldx != newx)) {
oldx = newx
newx = mapdown[as.character(oldx)]
}
The result is the group it belongs to, names by the lowest number of that set.
DF$id <- unname(newx)
Getting the group membership is harder. Using reshape2 to convert this into wide format (one column per group) where the column is "1" if there was something in that one and "0" if not.
library("reshape2")
wide <- dcast(DF, id~group, value.var="id",
fun.aggregate=function(x){if(length(x)>0){"1"}else{"0"}})
Finally, paste these "0"/"1" memberships together to get the grouping variable you described.
wide$grouping = apply(wide[,-1], 1, paste, collapse="")
The result:
> wide
id 1 2 3 grouping
1 1 1 1 1 111
2 2 1 0 0 100
3 3 1 0 1 101
4 4 0 1 1 011
5 6 0 1 0 010
6 8 0 0 1 001
No "bonus" yet.
EDIT:
To get the bonus information, it helps to redo the mapping to keep everything. If you have a lot of cases, this could be slow.
Replace the oldx/newx part with:
iterx <- matrix(DF$x, ncol=1)
iterx <- cbind(iterx, mapdown[as.character(iterx[,1])])
while(any(iterx[,ncol(iterx)]!=iterx[,ncol(iterx)-1])) {
iterx <- cbind(iterx, mapdown[as.character(iterx[,ncol(iterx)])])
}
DF$id <- iterx[,ncol(iterx)]
To generate the bonus data, then you can use
bonus <- tapply(iterx[,1], iterx[,ncol(iterx)], paste, collapse=",")
wide$bonus <- bonus[as.character(wide$id)]
Which gives:
> wide
id 1 2 3 grouping bonus
1 1 1 1 1 111 1,5,7
2 2 1 0 0 100 2
3 3 1 0 1 101 3,9
4 4 0 1 1 011 4,10
5 6 0 1 0 010 6
6 8 0 0 1 001 8
Note this isn't same as your example output, but I don't think your example output is right (how can you have a grouping_history of "000"?)
EDIT:
Now it agrees.
Another solution for bonus variable
f_bonus <- function(data=df){
data_a <- subset(data,y== -1,select=x)
data_a$pos <- seq(nrow(data_a))
data_b <- subset(df,y!= -1,select=c(x,y))
data_b$pos <- match(data_b$y, data_a$x)
data_t <- rbind(data_a,data_b[-2])
data_t <- with(data_t,tapply(x,pos,paste,sep="",collapse=","))
return(data_t)
}

Resources