I've got two strings of variable names that looks like this
> names_a = paste(paste0('a', seq(0,6,1)), collapse = ", ")
> names_a
[1] "a0, a1, a2, a3, a4, a5, a6"
> names_b = paste(paste0('b', seq(0,6,1)), collapse = ", ")
> names_b
[1] "b0, b1, b2, b3, b4, b5, b6"
Eacha and b variable contains a vector of ids, for example:
> head(a3)
[1] "1234" "56567" "457659"...
I aim to get all possible pairs of a and b ids. For this purpose I try to paste variables' names rigth into function rbind and then to expand.grid
pairs = expand.grid(rbind(parse(text = names_a), rbind(parse(text = names_b))
I mean I try to collapse all a0 to a6 vectors into a single vector using rbind, let it be named a, the same for all b's vectors and then find all pairs of values in a and b
surprisingly nothing works. Can it be fixed?
Something like this?
a1 = 1:2
a2 = 3:4
b1 = 5:6
b2 = 7:8
expand.grid(do.call(rbind, mget(paste("a", 1:2, sep = ""))),
do.call(rbind, mget(paste("b", 1:2, sep = ""))))
# Var1 Var2
#1 1 5
#2 3 5
#3 2 5
#4 4 5
#5 1 7
#6 3 7
#7 2 7
#8 4 7
#9 1 6
#10 3 6
#11 2 6
#12 4 6
#13 1 8
#14 3 8
#15 2 8
#16 4 8
Collapse all of a0 through a6 into one vector:
a <- as.vector(sapply(strsplit(gsub(" ","",names_a),",")[[1]],function(x) get(x)))
(or if you don't have the names as a single string you need to parse):
a <- as.vector(sapply(paste0("a",0:6),function(x) get(x)))
Do the same with b and then
merge(a,b) #all pairs
This will generate duplicates if any of the a or b variables has duplicates, so you may want to add unique to the collapsing of a and b
Related
I have a dataframe:
x y
A1 ''
A2 '123,0'
A3 '4557777'
A4 '8756784321675'
A5 ''
A6 ''
A7
A8
A9 '1533,10'
A10
A11 '51'
I want to add column "type" to it, which has three types: 1,2,3. 1 is if value in y is a number without comma, 2 is for number with comma, 3 is for empty value ''(two apostrophes). So desired output is:
x y type
A1 '' 3
A2 '123,0' 2
A3 '4557777' 1
A4 '8756784321675' 1
A5 '' 3
A6 '' 3
A7
A8
A9 '1533,10' 2
A10
A11 '51' 1
How could i do it? The most unclear part for me is captioning each type in column y
Here's a solution via ifelseand regex:
Data:
df <- data.frame(
y = c("", "", "1,234", "5678", "001,2", "", "455"), stringsAsFactors = F)
Solution:
df$type <- ifelse(grepl(",", df$y), 2,
ifelse(grepl("[^,]", df$y), 1, 3))
Result:
df
y type
1 3
2 3
3 1,234 2
4 5678 1
5 001,2 2
6 3
7 455 1
Update:
df <- data.frame(
y = c("''", "", "1,234", "5678", "001,2", "", "''", 455), stringsAsFactors = F)
df$type <- ifelse(grepl(",", df$y), 2,
ifelse(grepl("[^,']", df$y), 1,
ifelse(df$y=="", "", 3)))
df
y type
1 '' 3
2
3 1,234 2
4 5678 1
5 001,2 2
6
7 '' 3
8 455 1
Is this what you had in mind?
assuming the empty rows have NULL values in them, I thought of dividing into 3 parts:
Those which are empty strings (1)
Those which are convertible to numerics without invoking NA (3)
Those which are NULL (no value)
the only one outside of this set are the ones who belong to group 2, so:
THREE <- which(df$y == "")
ONE <- which(is.na(df$y %>% as.numeric)==FALSE)
EMPTY <- which(is.null(df$y))
type <- c()
type[THREE] = 3
type[ONE] = 1
type[EMPTY] = NA
type[-c(ONE,THREE,EMPTY)] = 2
finally you have a vector which you can join into your dataframe as a column with :
df2 = cbind(df,type)
I have a bunch of columns and I need to paste the first column into every other column. It looks like this except actual words instead of letters, and theres a few hundred columns.
TEST0 TEST1 TEST2 TEST3 TEST4
1 Q1: AA AA AA AA AA AA BB BB BB
2 Q2:
3 Q3: BB BB BB CC CC CC CC CC CC CC CC CC
4 Q4: DD DD DD DD DD DD DD DD DD
I'm able to paste the first column into another column one at a time doing this:
paste(test[,2],test[,3])
[1] "Q1: AA AA AA" "Q2: " "Q3: BB BB BB" "Q4: DD DD DD "
paste(test[,2],test[,4])
[1] "Q1: AA AA AA " "Q2: " "Q3: CC CC CC " "Q4: "
but is there a way to do multiple columns at once? Thanks
Here is a way of doing it with dplyr. Create your own pasting function first:
df <- data.frame(A = LETTERS, B = 1:26, C = 1:26)
head(df)
A B C
1 A 1 1
2 B 2 2
3 C 3 3
4 D 4 4
5 E 5 5
pasteA <- function(., x) paste0(df$A,.)
df %>%
mutate_if(.predicate = c(F, rep(T, ncol(df)-1)), .funs = list(pasteA))
A B C
1 A A1 A1
2 B B2 B2
3 C C3 C3
4 D D4 D4
5 E E5 E5
We use mutate_if to select all columns except the first one using a logical vector.
This is a base solution with a for loop. For every target column, paste the
first column to it.
df <- data.frame(a = letters[1:5], b = 1:5, c = 5:1)
for (i in 2:length(df)) {
df[[i]] <- paste(df[[1]], df[[i]], sep = ": ")
}
Where length gives the number of columns of a data.frame.
Result:
a b c
1 a a: 1 a: 5
2 b b: 2 b: 4
3 c c: 3 c: 3
4 d d: 4 d: 2
5 e e: 5 e: 1
{dplyr} is surprisingly convoluted for this case. A much easier solution is to use lapply (which works since data.frames are lists of columns):
as.data.frame(lapply(test[-1], function (x) paste(test[[1]], x)))
I have a dataframe, say
df = data.frame(x = c("a","a","b","b","b","c","d","t","c","b","t","c","t","a","a","b","d","t","t","c"),
y = c(2,4,5,2,6,2,4,5,2,6,2,4,5,2,6,2,4,5,2,6))
I want to remove only those rows in which one or multiple ts are directly in between a d and a c, in all other cases I want to retain the cases. So for this example, I would like to remove the ts on row 8, 18 and 19, but keep the others. I have over thousands of cases so doing this manually would be a true horror. Any help is very much appreciated.
One option would be to use rle to get runs of the same string and then you can use an sapply to check forward/backward and return all the positions you want to drop:
rle_vals <- rle(as.character(df$x))
drop <- unlist(sapply(2:length(rle_vals$values), #loop over values
function(i, vals, lengths) {
if(vals[i] == "t" & vals[i-1] == "d" & vals[i+1] == "c"){#Check if value is "t", previous is "d" and next is "c"
(sum(lengths[1:i-1]) + 1):sum(lengths[1:i]) #Get row #s
}
},vals = rle_vals$values, lengths = rle_vals$lengths))
drop
#[1] 8 18 19
df[-drop,]
# x y
#1 a 2
#2 a 4
#3 b 5
#4 b 2
#5 b 6
#6 c 2
#7 d 4
#9 c 2
#10 b 6
#11 t 2
#12 c 4
#13 t 5
#14 a 2
#15 a 6
#16 b 2
#17 d 4
#20 c 6
This also works, by collapsing to a string, identifying groups of t's between d and c (or c and d - not sure whether you wanted this option as well), then working out where they are and removing the rows as appropriate.
df = data.frame(x=c("a","a","b","b","b","c","d","t","c","b","t","c","t","a","a","b","d","t","t","c"),
y=c(2,4,5,2,6,2,4,5,2,6,2,4,5,2,6,2,4,5,2,6),stringsAsFactors = FALSE)
dfs <- paste0(df$x,collapse="") #collapse to a string
dfs2 <- do.call(rbind,lapply(list(gregexpr("dt+c",dfs),gregexpr("ct+d",dfs)),
function(L) data.frame(x=L[[1]],y=attr(L[[1]],"match.length"))))
dfs2 <- dfs2[dfs2$x>0,] #remove any -1 values (if string not found)
drop <- unlist(mapply(function(a,b) (a+1):(a+b-2),dfs2$x,dfs2$y))
df2 <- df[-drop,]
Here is another solution with base R:
df = data.frame(x = c("a","a","b","b","b","c","d","t","c","b","t","c","t","a","a","b","d","t","t","c"),
y = c(2,4,5,2,6,2,4,5,2,6,2,4,5,2,6,2,4,5,2,6))
#
s <- paste0(df$x, collapse="")
L <- c(NA, NA)
while (TRUE) {
r <- regexec("dt+c", s)[[1]]
if (r[1]==-1) break
L <- rbind(L, c(pos=r[1]+1, length=attr(r, "match.length")-2))
s <- sub("d(t+)c", "x\\1x", s)
}
L <- L[-1,]
drop <- unlist(apply(L,1, function(x) seq(from=x[1], len=x[2])))
df[-drop, ]
# > drop
# 8 18 19
# > df[-drop, ]
# x y
# 1 a 2
# 2 a 4
# 3 b 5
# 4 b 2
# 5 b 6
# 6 c 2
# 7 d 4
# 9 c 2
# 10 b 6
# 11 t 2
# 12 c 4
# 13 t 5
# 14 a 2
# 15 a 6
# 16 b 2
# 17 d 4
# 20 c 6
With gregexpr() it is shorter:
s <- paste0(df$x, collapse="")
g <- gregexpr("dt+c", s)[[1]]
L <- data.frame(pos=g+1, length=attr(g, "match.length")-2)
drop <- unlist(apply(L,1, function(x) seq(from=x[1], len=x[2])))
df[-drop, ]
I'm new to this but I'm pretty sure this question hasn't been answered, or I'm just not good at searching....
I would like to subtract the values in multiple rows from a particular row based on matching columns and values. My actual data will be a large matrix with >5000 columns, eaching needing to be subtracted by a blank value that matches the a value in a factor column.
Here is an example data table:
c1 c2 c3 c4 c5
r1 A 1 2 3 aa
r2 B 2 3 4 bb
r3 C 3 4 5 aa
r4 D 4 1 6 bb
r5 Blank 2 3 4 aa
r6 Blank 3 4 5 bb
I would like to subtract the c2,c3,and c4 values of c1 ="Blank" row from A,B,and C using the c5 factor to define which Blank values are used (aa or bb). I would like the "Blank" values to be subtracted from all rows sharing c5 info.
(i know this is confusing to describe)
So the results would look like this:
c1 c2 c3 c4 c5
r1 A -1 -1 -1 aa
r2 B -1 -1 -1 bb
r3 C 1 1 1 aa
r4 D 1 -3 1 bb
I've seen the ddply function work for doing something like this with a single column, but I wasn't able to expand that to perform this task for multiple columns. I'm a noob though...
Thank you for your help!
This is not tested for all possible cases, but should give you an idea:
df <- read.table(text =
"c1 c2 c3 c4 c5
r1 A 1 2 3 aa
r2 B 2 3 4 bb
r3 C 3 4 5 aa
r4 D 4 1 6 bb
r5 Blank 2 3 4 aa
r6 Blank 3 4 5 bb", header = T)
library(data.table)
# separate dataset into two
dt <- data.table(df, key = "c5")
dt.blank <- dt[c1 == "Blank"]
dt <- dt[c1 != "Blank"]
# merge into resulting dataset
dt.res <- dt[dt.blank]
# update each column
columns.count <- ncol(dt)
for(i in 2:(columns.count-1)) {
dt.res[[i]] <- dt.res[[i]] - dt.res[[i + columns.count]]
}
# > dt.res
# c1 c2 c3 c4 c5 i.c1 i.c2 i.c3 i.c4
# 1: A -1 -1 -1 aa Blank 2 3 4
# 2: C 1 1 1 aa Blank 2 3 4
# 3: B -1 -1 -1 bb Blank 3 4 5
# 4: D 1 -3 1 bb Blank 3 4 5
First split your data, since there's no reason you have them in a single data structure. Then apply the function:
# recreate your data
df <- data.frame(rbind(c(1:3, "aa"), c(2:4, "bb"), c(3:5, "aa"), c(4,1,6, "bb"), c(2:4, "aa"), c(3:5, "bb")))
df[,1:3] <- apply(df[,1:3], 2, as.integer)
# split it
blank1 <- df[5,]
blank2 <- df[6,]
df <- df[1:4,]
for (i in 1:nrow(df)) {
if (df[i,4] == "aa") {df[i,1:3] <- df[i,1:3] - blank1[1:3]}
else {df[i,1:3] <- df[i,1:3] - blank2[1:3]}
}
There are a few different was to run the loop, including vectorizing. But this suffices. I'd also argue that there's no reason to keep the labels "aa" v "bb" in the initial data structure either, which would make this simpler; but it's your choice.
I have a matrix. The entries of the matrix are counts for the combination of the dimension levels. For example:
(m0 <- matrix(1:4, nrow=2, dimnames=list(c("A","B"),c("A","B"))))
A B
A 1 3
B 2 4
I can change it to a long format:
library("reshape")
(m1 <- melt(m0))
X1 X2 value
1 A A 1
2 B A 2
3 A B 3
4 B B 4
But I would like to have multipe entries according to value:
m2 <- m1
for (i in 1:nrow(m1)) {
j <- m1[i,"value"]
k <- 2
while ( k <= j) {
m2 <- rbind(m2,m1[i,])
k = k+1
}
}
> m2 <- subset(m2,select = - value)
> m2[order(m2$X1),]
X1 X2
1 A A
3 A B
31 A B
32 A B
2 B A
4 B B
21 B A
41 B B
42 B B
43 B B
Is there a parameter in melt which considers to multiply the entries according to value? Or any other library which can perform this issue?
We could do this with base R. We convert the dimnames of 'm0' to a 'data.frame' with two columns using expand.grid, then replicate the rows of the dataset with the values in 'm0', order the rows and change the row names to NULL (if necessary).
d1 <- expand.grid(dimnames(m0))
d2 <- d1[rep(1:nrow(d1), c(m0)),]
res <- d2[order(d2$Var1),]
row.names(res) <- NULL
res
# Var1 Var2
#1 A A
#2 A B
#3 A B
#4 A B
#5 B A
#6 B A
#7 B B
#8 B B
#9 B B
#10 B B
Or with melt, we convert the 'm0' to 'long' format and then replicate the rows as before.
library(reshape2)
dM <- melt(m0)
dM[rep(1:nrow(dM), dM$value),1:2]
As #Frank mentioned, we can also use table with as.data.frame to create 'dM'
dM <- as.data.frame(as.table(m0))