I'm new to this but I'm pretty sure this question hasn't been answered, or I'm just not good at searching....
I would like to subtract the values in multiple rows from a particular row based on matching columns and values. My actual data will be a large matrix with >5000 columns, eaching needing to be subtracted by a blank value that matches the a value in a factor column.
Here is an example data table:
c1 c2 c3 c4 c5
r1 A 1 2 3 aa
r2 B 2 3 4 bb
r3 C 3 4 5 aa
r4 D 4 1 6 bb
r5 Blank 2 3 4 aa
r6 Blank 3 4 5 bb
I would like to subtract the c2,c3,and c4 values of c1 ="Blank" row from A,B,and C using the c5 factor to define which Blank values are used (aa or bb). I would like the "Blank" values to be subtracted from all rows sharing c5 info.
(i know this is confusing to describe)
So the results would look like this:
c1 c2 c3 c4 c5
r1 A -1 -1 -1 aa
r2 B -1 -1 -1 bb
r3 C 1 1 1 aa
r4 D 1 -3 1 bb
I've seen the ddply function work for doing something like this with a single column, but I wasn't able to expand that to perform this task for multiple columns. I'm a noob though...
Thank you for your help!
This is not tested for all possible cases, but should give you an idea:
df <- read.table(text =
"c1 c2 c3 c4 c5
r1 A 1 2 3 aa
r2 B 2 3 4 bb
r3 C 3 4 5 aa
r4 D 4 1 6 bb
r5 Blank 2 3 4 aa
r6 Blank 3 4 5 bb", header = T)
library(data.table)
# separate dataset into two
dt <- data.table(df, key = "c5")
dt.blank <- dt[c1 == "Blank"]
dt <- dt[c1 != "Blank"]
# merge into resulting dataset
dt.res <- dt[dt.blank]
# update each column
columns.count <- ncol(dt)
for(i in 2:(columns.count-1)) {
dt.res[[i]] <- dt.res[[i]] - dt.res[[i + columns.count]]
}
# > dt.res
# c1 c2 c3 c4 c5 i.c1 i.c2 i.c3 i.c4
# 1: A -1 -1 -1 aa Blank 2 3 4
# 2: C 1 1 1 aa Blank 2 3 4
# 3: B -1 -1 -1 bb Blank 3 4 5
# 4: D 1 -3 1 bb Blank 3 4 5
First split your data, since there's no reason you have them in a single data structure. Then apply the function:
# recreate your data
df <- data.frame(rbind(c(1:3, "aa"), c(2:4, "bb"), c(3:5, "aa"), c(4,1,6, "bb"), c(2:4, "aa"), c(3:5, "bb")))
df[,1:3] <- apply(df[,1:3], 2, as.integer)
# split it
blank1 <- df[5,]
blank2 <- df[6,]
df <- df[1:4,]
for (i in 1:nrow(df)) {
if (df[i,4] == "aa") {df[i,1:3] <- df[i,1:3] - blank1[1:3]}
else {df[i,1:3] <- df[i,1:3] - blank2[1:3]}
}
There are a few different was to run the loop, including vectorizing. But this suffices. I'd also argue that there's no reason to keep the labels "aa" v "bb" in the initial data structure either, which would make this simpler; but it's your choice.
Related
My Data looks like
df <- data.frame(user_id=c('13','15'),
answer_id = c('{"row[0][0]":"A","row[0][1]":"B","row[0][2]":"C","row[0][3]":"D","row[1][0]":"A1","row[1][1]":"B1","row[1][2]":"C1","row[1][3]":"D1"}', '{"row[0][0]":"W","row[0][1]":"X","row[0][2]":"Y","row[0][3]":"Z","row[1][0]":"W1","row[1][1]":"X1","row[1][2]":"Y1","row[1][3]":"Z1"}
'))
Desired data view
user_id answer_id1 answer_id2 answer_id3 answer_id4
13 A B C D
13 A1 B1 C1 D1
15 W X Y Z
15 W1 X1 Y1 Z1
i'm new with R and hope to get solution soon as i do always
may not be the best solution but this can get you from your sample input to your desired output using stringr, purrr, & tidyr. See regex101 for an explanation of the regex used in the stringr::str_match_all() call.
df <- data.frame(user_id=c('13','15'),
answer_id = c('{"row[0][0]":"A","row[0][1]":"B","row[0][2]":"C","row[0][3]":"D","row[1][0]":"A1","row[1][1]":"B1","row[1][2]":"C1","row[1][3]":"D1"}', '{"row[0][0]":"W","row[0][1]":"X","row[0][2]":"Y","row[0][3]":"Z","row[1][0]":"W1","row[1][1]":"X1","row[1][2]":"Y1","row[1][3]":"Z1"}'),
stringsAsFactors=F)
#use regex to extract row ids and answers
regex_matches <- stringr::str_match_all(df$answer_id, '\\"row\\[(\\d+)\\]\\[(\\d+)\\]\\":\\"([^\\"]*)\\"')
#add user id to each result
answers_by_user <- purrr::map2(df$user_id, regex_matches, ~cbind(.x, .y[,-1]))
#combine list of matrices and convert to df
answers_df <- data.frame(do.call(rbind, answers_by_user))
#add meaningful names
names(answers_df) <- c("user_id", "row_1", "row_2", "value")
#convert to wide
spread_row_1 <- tidyr::spread(answers_df, row_1, value)
final_df <- tidyr::spread(answers_df, row_2, value)
#remove row column
final_df$row_1 <- NULL
#clean up names
names(final_df) <- c("user_id", "answer_id1", "answer_id2", "answer_id3", "answer_id4")
final_df
#output
user_id answer_id1 answer_id2 answer_id3 answer_id4
1 13 A B C D
2 13 A1 B1 C1 D1
3 15 W X Y Z
4 15 W1 X1 Y1 Z1
Column 2 looks like JSON, so you could do something like this to get it into a form that you can do something with...
library(rjson)
df2 <- lapply(1:nrow(df),function(i)
data.frame(user=df[i,1],
answer=unlist(fromJSON(as.character(df[i,2]))),stringsAsFactors = FALSE))
df2 <- do.call(rbind,df2)
df2[,"r1"] <- gsub(".+\\[(\\d)]\\[(\\d)].*","\\1",rownames(df2))
df2[,"r2"] <- gsub(".+\\[(\\d)]\\[(\\d)].*","\\2",rownames(df2))
df2
user answer r1 r2
row[0][0] 13 A 0 0
row[0][1] 13 B 0 1
row[0][2] 13 C 0 2
row[0][3] 13 D 0 3
row[1][0] 13 A1 1 0
row[1][1] 13 B1 1 1
row[1][2] 13 C1 1 2
row[1][3] 13 D1 1 3
row[0][0]1 15 W 0 0
row[0][1]1 15 X 0 1
row[0][2]1 15 Y 0 2
row[0][3]1 15 Z 0 3
row[1][0]1 15 W1 1 0
row[1][1]1 15 X1 1 1
row[1][2]1 15 Y1 1 2
row[1][3]1 15 Z1 1 3
I am currently having a problem utilizing R to compare each column within a specific matrix. I have attempted to compare each of the entire columns at once, and generate a true and false output via the table command, and then convert the number of trues that can be found to a numeric value and input such values in their respective places within the incidence matrix.
For example, I have data in this type of format:
//Example state matrix - I am attempting to compare c1 with c2, then c1 with c3, then c1 with c4 and so on and so forth
c1 c2 c3 c4
r1 2 6 3 2
r2 1 1 6 5
r3 3 1 3 6
And I am trying to instead put it into this format
//Example incidence matrix - Which is how many times c1 equaled c2 in the above matrix
c1 c2 c3 c4
c1 3 1 1 1
c2 1 3 0 0
c3 1 0 3 0
c4 1 0 0 3
Here is the code I have come up with so far, however, I keep getting this particular error --
Warning message:
In IncidenceMat[rat][r] = IncidenceMat[rat][r] + as.numeric(instances) :number of items to replace is not a multiple of replacement length
rawData = read.table("5-14-2014streamW636PPstate.txt")
colnames = names(rawData) #the column names in R
df <- data.frame(rawData)
rats = ncol(rawData)
instances = nrow(rawData)
IncidenceMat = matrix(rep(0, rats), nrow = rats, ncol = rats)
for(rat in rats)
for(r in rats)
if(rat == r){rawData[instance][rat] == rawData[instance][r] something like this would work in C++ if I attempted,
IncidenceMat[rat][r] = IncidenceMat[rat][r] + as.numeric(instances)
} else{
count = df[colnames[rat]] == df[colnames[r]]
c = table(count)
TotTrue = as.numeric(c[2][1])
IncidenceMat[rat][r] = IncidenceMat[rat][r] + TotTrue #count would go here #this should work like a charm as well
}
Any help would be greatly appreciated; I have also looked at some of these resources, however, I am still stumped
I tried this and this along with some other resources I recently closed.
How about this (note the incidence matrix is symmetric)?
df
c1 c2 c3 c4
r1 2 6 3 2
r2 1 1 6 5
r3 3 1 3 6
incidence <- matrix(rep(0, ncol(df)*ncol(df)), nrow=ncol(df))
diag(incidence) <- nrow(df)
for (i in 1:(ncol(df)-1)) {
for (j in (i+1):ncol(df)) {
incidence[i,j] = incidence[j,i] = sum(df[,i] == df[,j])
}
}
incidence
[,1] [,2] [,3] [,4]
[1,] 3 1 1 1
[2,] 1 3 0 0
[3,] 1 0 3 0
[4,] 1 0 0 3
What I want to do is multiply all the values in column 1 of a data.frame by the first element in a vector, then multiply all the values in column 2 by the 2nd element in the vector, etc...
c1 <- c(1,2,3)
c2 <- c(4,5,6)
c3 <- c(7,8,9)
d1 <- data.frame(c1,c2,c3)
c1 c2 c3
1 1 4 7
2 2 5 8
3 3 6 9
v1 <- c(1,2,3)
So the result is this:
c1 c2 c3
1 1 8 21
2 2 10 24
3 3 12 27
I can do this one column at a time but what if I have 100 columns? I want to be able to do this programmatically.
Or simply diagonalize the vector, so that each row entry is multiplied by the corresponding element in v1:
c1 <- c(1,2,3)
c2 <- c(4,5,6)
c3 <- c(7,8,9)
d1 <- as.matrix(cbind(c1,c2,c3))
v1 <- c(1,2,3)
d1%*%diag(v1)
[,1] [,2] [,3]
[1,] 1 8 21
[2,] 2 10 24
[3,] 3 12 27
Transposing the dataframe works.
c1 <- c(1,2,3)
c2 <- c(4,5,6)
c3 <- c(7,8,9)
d1 <- data.frame(c1,c2,c3)
v1 <- c(1,2,3)
t(t(d1)*v1)
# c1 c2 c3
#[1,] 1 8 21
#[2,] 2 10 24
#[3,] 3 12 27
EDIT: If all columns are not numeric, you can do the following
c1 <- c(1,2,3)
c2 <- c(4,5,6)
c3 <- c(7,8,9)
d1 <- data.frame(c1,c2,c3)
# Adding a column of characters for demonstration
d1$c4 <- c("rr", "t", "s")
v1 <- c(1,2,3)
#Choosing only numeric columns
index <- which(sapply(d1, is.numeric) == TRUE)
d1_mat <- as.matrix(d1[,index])
d1[,index] <- t(t(d1_mat)*v1)
d1
# c1 c2 c3 c4
#1 1 8 21 rr
#2 2 10 24 t
#3 3 12 27 s
We can also replicate the vector to make the lengths equal and then multiply
d1*v1[col(d1)]
# c1 c2 c3
#1 1 8 21
#2 2 10 24
#3 3 12 27
Or use sweep
sweep(d1, 2, v1, FUN="*")
Or with mapply to multiply the corresponding columns of 'data.frame' and elements of 'vector'
mapply(`*`, d1, v1)
I'd like to be able to compare two tables and have R return a list of records and variables that don't match.
For example, with the following two tables
> df1
id let num
1 1a a 1
2 2b b 2
3 3c c 3
4 4d d 4
5 5e e 5
> df2
id let num
1 1a a 1
2 2b b 2
3 3c c 3
4 4d e 4
5 5e d 5
I would want a compare() function to return something like "id=4d, let" to let me know that the let variable in the record with id = 4d doesn't match.
I have seen the compare library in CRAN but it only returns TRUE or FALSE for the entire variable if there is a mismatch. Is there a library with a different compare function, or a way to do this manually?
df1 <- read.table(text="
id let1 num1
1a a 1
2b b 2
3c c 3
4d d 4
5e e 5", head=T, as.is=T)
df2 <- read.table(text="
id let2 num2
1a a 1
2b b 2
3c c 3
4d e 4
5e d 5", head=T, as.is=T)
df <- merge(df1, df2, by="id")
df$let <- ifelse(df$let1 == df$let2, "equal", "not equal")
df$num <- ifelse(df$num1 == df$num2, "equal", "not equal")
df
# id let1 num1 let2 num2 let num
# 1 1a a 1 a 1 equal equal
# 2 2b b 2 b 2 equal equal
# 3 3c c 3 c 3 equal equal
# 4 4d d 4 e 4 not equal equal
# 5 5e e 5 d 5 not equal equal
You mean something like which? Quick reproducible example:
> m1 <- m2 <- matrix(1:9, 3)
> diag(m1) <- 0
> which(m1 != m2, arr.ind = TRUE)
row col
[1,] 1 1
[2,] 2 2
[3,] 3 3
Something like:
df_diff <- list()
for (i in 1:ncol(df1))
{
df_diff[[i]] <- df1$id[df2[i] != df1[i]]
names(df_diff)[i] <- names(df1)[i]
}
This should produce (hopefully :)) a list of character vectors (one for each variable). Each vector contains the IDs of df1 where the records of the two df don't match.
I have a matrix in R. Each entry i,j is a score and the rownames and colnames are ids.
Instead of the matrix I just want a 3 column matrix that has: i,j,score
Right now I'm using nested for loops. Like:
for(i in rownames(g))
{
print(which(rownames(g)==i))
for(j in colnames(g))
{
cur.vector<-c(cur.ref, i, j, g[rownames(g) %in% i,colnames(g) %in% j])
rbind(new.file,cur.vector)->new.file
}
}
But thats very inefficient I think...I'm sure there's a better way I'm just not good enough with R yet.
Thoughts?
If I understand you correctly, you need to flatten the matrix.
You can use as.vector and rep to add the id columns e.g. :
m = cbind(c(1,2,3),c(4,5,6),c(7,8,9))
row.names(m) = c('R1','R2','R3')
colnames(m) = c('C1','C2','C3')
d <- data.frame(i=rep(row.names(m),ncol(m)),
j=rep(colnames(m),each=nrow(m)),
score=as.vector(m))
Result:
> m
C1 C2 C3
R1 1 4 7
R2 2 5 8
R3 3 6 9
> d
i j score
1 R1 C1 1
2 R2 C1 2
3 R3 C1 3
4 R1 C2 4
5 R2 C2 5
6 R3 C2 6
7 R1 C3 7
8 R2 C3 8
9 R3 C3 9
Please, note that this code converts a matrix into a data.frame, since the row and col names can be string and you can't have a matrix with different column type.
If you are sure that all row and col names are numbers, you can coerced it to a matrix.
If you convert your matrix first to a table (with as.table) then to a data frame (as.data.frame) then it will accomplish what you are asking for. A simple example:
> tmp <- matrix( 1:12, 3 )
> dimnames(tmp) <- list( letters[1:3], LETTERS[4:7] )
> as.data.frame( as.table( tmp ) )
Var1 Var2 Freq
1 a D 1
2 b D 2
3 c D 3
4 a E 4
5 b E 5
6 c E 6
7 a F 7
8 b F 8
9 c F 9
10 a G 10
11 b G 11
12 c G 12