R: Data Frame Manipulations - r

I have two data frames:
>df1
type id1 id2 id3 count1 count2 count3
a x1 y1 z1 10 20 0
b x2 y2 z2 20 0 30
c x3 y3 z3 10 10 10
>df2
id prop
x1 10
x2 5
x3 100
y1 0
y2 50
y3 80
z1 10
z2 20
z3 30
count* are like weights. So, finally I want to join the table such that TotalProp is weighted sum of prop and counts
For e.g. for the first row in df1 TotalProp = 10(prop for x1) * 10(count1) + 0(Prop for y1) * 20(count2) + 10(Prop for z1) * 0(count3) = 100
Hence my final table looks like this:
>result
type id1 id2 id3 TotalProp
a x1 y1 z1 100
b x2 y2 z2 700
c x3 y3 z3 2100
Any idea how can I do this?
Thanks.

One line solution first and then explanation using multiple steps
df1
## type id1 id2 id3 count1 count2 count3
## 1 a x1 y1 z1 10 20 0
## 2 b x2 y2 z2 20 0 30
## 3 c x3 y3 z3 10 10 10
df2
## id prop
## x1 x1 10
## x2 x2 5
## x3 x3 100
## y1 y1 0
## y2 y2 50
## y3 y3 80
## z1 z1 10
## z2 z2 20
## z3 z3 30
rownames(df2) <- df2$id
result <- data.frame(type = df1$type, TotalProp = rowSums(matrix(df2[unlist(df1[, c("id1", "id2", "id3")]), "prop"], nrow = nrow(df1)) * as.matrix(df1[,
c("count1", "count2", "count3")])))
result
## type TotalProp
## 1 a 100
## 2 b 700
## 3 c 2100
Stepwise explanation
First we get all the id values in a vector for which we want to fetch corresponding prop values from df2
Step 1
unlist(df1[, c("id1", "id2", "id3")])
## id11 id12 id13 id21 id22 id23 id31 id32 id33
## "x1" "x2" "x3" "y1" "y2" "y3" "z1" "z2" "z3"
Step 2
We name the rows of df2 with df2$id.
rownames(df2) <- df2$id
Step 3
Then using result from step 1, we get corresponding prop values
df2[unlist(df1[, c("id1", "id2", "id3")]), "prop"]
## [1] 10 5 100 0 50 80 10 20 30
Step 4
Convert the vector from step 3 back to 2 dimensional form
matrix(df2[unlist(df1[, c("id1", "id2", "id3")]), "prop"], nrow = nrow(df1))
## [,1] [,2] [,3]
## [1,] 10 0 10
## [2,] 5 50 20
## [3,] 100 80 30
Step 5
Multiply result of Step 4 with counts from df1
as.matrix(df1[, c("count1", "count2", "count3")])
## count1 count2 count3
## [1,] 10 20 0
## [2,] 20 0 30
## [3,] 10 10 10
matrix(df2[unlist(df1[, c("id1", "id2", "id3")]), "prop"], nrow = nrow(df1)) *
as.matrix(df1[, c("count1", "count2", "count3")])
## count1 count2 count3
## [1,] 100 0 0
## [2,] 100 0 600
## [3,] 1000 800 300
Step 6
Apply rowSums to result from step 5 to get desired TotalProp values
rowSums(matrix(df2[unlist(df1[,c('id1','id2','id3')]),'prop'], nrow=nrow(df1)) * as.matrix(df1[,c('count1', 'count2', 'count3')]))
## [1] 100 700 2100

My solution relies on the data structure, so it is not universal, but short.
m1 <- matrix(df[, tail(names(df1), 3)])
m2 <- matrix(df2$prop, 3)
rowSums(m1 * m2)
[1] 100 700 2100
It does not use ids whatsoever, so be careful!

And another way...
TotalProp <- apply(df1,1,function(x) {
sapply(x[2:4],function(x)df2[df2$id==x,]$prop) %*% as.numeric(x[5:7])
})
result <- cbind(df1[1:4],TotalProp)
%*% is the inner product operator, which is like rowsum, so this is somewhat like #ChinmayPatil's answer. So the steps are:
For each row in df1, extract the elements of df2 which have id = cols 2:4 of df1
Form the inner product of the vector from 1 with the vector formed from cols 5:7 of df1
Repeat for each row of df1 [apply(df1,1, ...)]

Related

How to fill the matrix with values from other matrix matching rows and columns?

I'm trying to make a new matrix using values from other matrix with R. I'm trying to match the names of rows and columns while importing the values. This is what what trying to do:
I have two matrices;
X1 X2 X3 X4
X1 0 9 8 0
X2 1 2 3 5
X4 6 1 2 4
X1 X2 X3 X4
X1 NA NA NA NA
X2 NA NA NA NA
X3 NA NA NA NA
X4 NA NA NA NA
I want to do
X1 X2 X3 X4
X1 0 9 8 0
X2 1 2 3 5
X3 NA NA NA NA
X4 6 1 2 4
These matrices are just simple examples of my dataset, my real data is more complicated.
Many thanks,
checking for rownames and colnames matches in both matrices will prevent subscript out of bounds error. See below.
mat2[rownames(mat2) %in% rownames(mat1),
colnames(mat2) %in% colnames(mat1)] <- mat1[rownames(mat1) %in% rownames(mat2),
colnames(mat1) %in% colnames(mat2)]
mat2
# X1 X2 X3 X4
# X1 0 9 8 0
# X2 1 2 3 5
# X3 NA NA NA NA
# X4 6 1 2 4
Data:
mat1 <- read.table(text = ' X1 X2 X3 X4
X1 0 9 8 0
X2 1 2 3 5
X4 6 1 2 4', header = TRUE)
mat1 <- as.matrix(mat1)
mat2 <- matrix(NA, nrow = 4, ncol = 4, dimnames = list(paste0("X", 1:4),
paste0("X", 1:4)))
If I understood your question you can do this:
# Building your matrices
mat1 <- matrix(runif(12), nrow = 3, ncol = 4)
mat2 <- matrix(NA, nrow = 4, ncol = 4)
labs <- paste0("x", 1:4)
colnames(mat1) <- colnames(mat2) <- labs
rownames(mat2) <- labs
rownames(mat1) <- labs[c(1:2, 4)]
#
rows <- sort(unique(c(rownames(mat1), rownames(mat2))))
result <- matrix(NA, nrow = length(rows), ncol = ncol(mat1))
result[match(rownames(mat1), rows), ] <- mat1

Call col name with min value (NA included)

I have df including NA.
df <- data.frame( X1= c(NA, 1, 4, NA),
X2 = c(34, 75, 1, 4),
X3= c(2,9,3,5))
My ideal out come looks like,
X1 X2 X3 Min
1 NA 34 2 X3
2 1 75 9 X1
3 4 1 3 X2
4 NA 4 5 X2
I have tried
df$Min <- colnames(df)[apply(df,1,which.min, na.rm=TRUE)]
but this one didn't work
You don't need the na.rm=TRUE when using which.min() – try this instead:
df$Min <- colnames(df)[apply(df,1,which.min)]
Output:
X1 X2 X3 Min
1 NA 34 2 X3
2 1 75 9 X1
3 4 1 3 X2
4 NA 4 5 X2
Code:
foo <- names(df)
df$Min <- apply(df, 1, function(x) foo[which.min(x)])
df
Output:
X1 X2 X3 Min
1 NA 34 2 X3
2 1 75 9 X1
3 4 1 3 X2
4 NA 4 5 X2
Here's an idea that will likely be faster and does not require any looping. You could replace NA with Inf, take the negative of the data, then find the maximum per column via max.col().
names(df)[max.col(-replace(df, is.na(df), Inf))]
# [1] "X3" "X1" "X2" "X2"
Also, not to forget, a data.table solution, given that dt <- as.data.table(df)
dt[ , Min:=names(dt)[match(min(.SD, na.rm=T), .SD)], by=1:nrow(dt)][]
# X1 X2 X3 Min
#1: NA 34 2 X3
#2: 1 75 9 X1
#3: 4 1 3 X2
#4: NA 4 5 X2
Not much simpler than the solutions above, just extending the choices here.

Column Split into columns and rows in R

My Data looks like
df <- data.frame(user_id=c('13','15'),
answer_id = c('{"row[0][0]":"A","row[0][1]":"B","row[0][2]":"C","row[0][3]":"D","row[1][0]":"A1","row[1][1]":"B1","row[1][2]":"C1","row[1][3]":"D1"}', '{"row[0][0]":"W","row[0][1]":"X","row[0][2]":"Y","row[0][3]":"Z","row[1][0]":"W1","row[1][1]":"X1","row[1][2]":"Y1","row[1][3]":"Z1"}
'))
Desired data view
user_id answer_id1 answer_id2 answer_id3 answer_id4
13 A B C D
13 A1 B1 C1 D1
15 W X Y Z
15 W1 X1 Y1 Z1
i'm new with R and hope to get solution soon as i do always
may not be the best solution but this can get you from your sample input to your desired output using stringr, purrr, & tidyr. See regex101 for an explanation of the regex used in the stringr::str_match_all() call.
df <- data.frame(user_id=c('13','15'),
answer_id = c('{"row[0][0]":"A","row[0][1]":"B","row[0][2]":"C","row[0][3]":"D","row[1][0]":"A1","row[1][1]":"B1","row[1][2]":"C1","row[1][3]":"D1"}', '{"row[0][0]":"W","row[0][1]":"X","row[0][2]":"Y","row[0][3]":"Z","row[1][0]":"W1","row[1][1]":"X1","row[1][2]":"Y1","row[1][3]":"Z1"}'),
stringsAsFactors=F)
#use regex to extract row ids and answers
regex_matches <- stringr::str_match_all(df$answer_id, '\\"row\\[(\\d+)\\]\\[(\\d+)\\]\\":\\"([^\\"]*)\\"')
#add user id to each result
answers_by_user <- purrr::map2(df$user_id, regex_matches, ~cbind(.x, .y[,-1]))
#combine list of matrices and convert to df
answers_df <- data.frame(do.call(rbind, answers_by_user))
#add meaningful names
names(answers_df) <- c("user_id", "row_1", "row_2", "value")
#convert to wide
spread_row_1 <- tidyr::spread(answers_df, row_1, value)
final_df <- tidyr::spread(answers_df, row_2, value)
#remove row column
final_df$row_1 <- NULL
#clean up names
names(final_df) <- c("user_id", "answer_id1", "answer_id2", "answer_id3", "answer_id4")
final_df
#output
user_id answer_id1 answer_id2 answer_id3 answer_id4
1 13 A B C D
2 13 A1 B1 C1 D1
3 15 W X Y Z
4 15 W1 X1 Y1 Z1
Column 2 looks like JSON, so you could do something like this to get it into a form that you can do something with...
library(rjson)
df2 <- lapply(1:nrow(df),function(i)
data.frame(user=df[i,1],
answer=unlist(fromJSON(as.character(df[i,2]))),stringsAsFactors = FALSE))
df2 <- do.call(rbind,df2)
df2[,"r1"] <- gsub(".+\\[(\\d)]\\[(\\d)].*","\\1",rownames(df2))
df2[,"r2"] <- gsub(".+\\[(\\d)]\\[(\\d)].*","\\2",rownames(df2))
df2
user answer r1 r2
row[0][0] 13 A 0 0
row[0][1] 13 B 0 1
row[0][2] 13 C 0 2
row[0][3] 13 D 0 3
row[1][0] 13 A1 1 0
row[1][1] 13 B1 1 1
row[1][2] 13 C1 1 2
row[1][3] 13 D1 1 3
row[0][0]1 15 W 0 0
row[0][1]1 15 X 0 1
row[0][2]1 15 Y 0 2
row[0][3]1 15 Z 0 3
row[1][0]1 15 W1 1 0
row[1][1]1 15 X1 1 1
row[1][2]1 15 Y1 1 2
row[1][3]1 15 Z1 1 3

Adding data by row into an empty matrix and handling missing data

I have an empty matrix with a certain number of columns that I'm trying to fill row-by-row with output vectors of a for-loop. However, some of the output are not the same length as the number of columns as my matrix, and just want to fill up those "empty spaces" with NAs.
For example:
matrix.names <- c("x1", "x2", "x3", "x4", "y1", "y2", "y3", "y4", "z1", "z2", "z3", "z4")
my.matrix <- matrix(ncol = length(matrix.names))
colnames(my.matrix) <- matrix.names
This would be the output from one iteration:
x <- c(1,2)
y <- c(4,2,1,5)
z <- c(1)
Where I would want it in the matrix like this:
x1 x2 x3 x4 y1 y2 y3 y4 z1 z2 z3 z4
[1,] 1 2 NA NA 4 2 1 5 1 NA NA NA
The output from the next iteration would be, for example:
x <- c(1,1,1,1)
y <- c(0,4)
z <- c(4,1,3)
And added as a new row in the matrix:
x1 x2 x3 x4 y1 y2 y3 y4 z1 z2 z3 z4
[1,] 1 2 NA NA 4 2 1 5 1 NA NA NA
[2,] 1 1 1 1 0 4 NA NA 4 1 3 NA
It's not really a concern if I have a 0, it's just where there is no data. Also, the data is saved in such a way that whatever is there is listed in the row first, followed by NAs in empty slots. In other words, I'm not worried if an NA may pop up first.
Also, is such a thing better handled in data frames rather than matrices?
not the efficient answer : just a try
logic : extending the length to 4.(exception could be if already x/y/z is laready of length4) Therefore while rbinding I only extract the first 4 elements .
x[length(x)+1:4] <- NA
y[length(y)+1:4] <- NA
z[length(z)+1:4] <- NA
my.matrix <- rbind(my.matrix,c(x[1:4],y[1:4],z[1:4]))
Note : the exception I mentioned above is like below :
> x <- c(1,1,1,1)
> x
[1] 1 1 1 1
> x[length(x)+1:4] <- NA
> x
[1] 1 1 1 1 NA NA NA NA # therefore I extract only the first four
Here is an option to do this programmatically
d1 <- stack(mget(c("x", "y", "z")))[2:1]
nm <- with(d1, paste0(ind, ave(seq_along(ind),ind, FUN = seq_along)))
my.matrix[,match(nm,colnames(my.matrix), nomatch = 0)] <- d1$values
my.matrix
# x1 x2 x3 x4 y1 y2 y3 y4 z1 z2 z3 z4
#[1,] 1 2 NA NA 4 2 1 5 1 NA NA NA
Or another option is stri_list2matrix from stringi
library(stringi)
m1 <- as.numeric(stri_list2matrix(list(x,y, z)))
Change the 'x', 'y', 'z' values
m2 <- as.numeric(stri_list2matrix(list(x,y, z)))
rbind(m1, m2)

How do I convert table formats in R

Specifically,
I used the following set up:
newdata <- tapply(mydata(#), list(mydata(X), mydata(Y)), sum)
I currently have a table that currently is listed as follows:
X= State, Y= County within State, #= a numerical total of something
__ Y1 Y2 Y3 Yn
X1 ## ## ## ##
X2 ## ## ## ##
X3 ## ## ## ##
Xn ## ## ## ##
What I need is a table listed as follows:
X1 Y1 ##
X1 Y2 ##
X1 Y3 ##
X1 Yn ##
X2 Y1 ##
X2 Y2 ##
X2 Y3 ##
X2 Yn ##
Xn Y1 ##
Xn Y2 ##
Xn Y3 ##
Xn Yn ##
library(reshape2)
new_data <- melt(old_data, id.vars=1)
Look into ?melt for more details on syntax.
example:
> df <- data.frame(x=1:5, y1=rnorm(5), y2=rnorm(5))
> df
x y1 y2
1 1 -1.3417817 -1.1777317
2 2 -0.4014688 1.4653270
3 3 0.4050132 1.5547598
4 4 0.1622901 -1.2976084
5 5 -0.7207541 -0.1203277
> melt(df, id.vars=1)
x variable value
1 1 y1 -1.3417817
2 2 y1 -0.4014688
3 3 y1 0.4050132
4 4 y1 0.1622901
5 5 y1 -0.7207541
6 1 y2 -1.1777317
7 2 y2 1.4653270
8 3 y2 1.5547598
9 4 y2 -1.2976084
10 5 y2 -0.1203277
Some example data
mydata <- data.frame(num=rnorm(40),
gp1=rep(LETTERS[1:2],2),
gp2=rep(letters[1:2],each=2))
And applying tapply to it:
tmp <- tapply(mydata$num, list(mydata$gp1, mydata$gp2), sum)
The result of tapply is a matrix, but you can treat it like a table and use as.data.frame.table to convert it. This does not rely on any additional packages.
as.data.frame.table(tmp)
The two different data structures look like:
> tmp
a b
A 8.381483 6.373657
B 2.379303 -1.189488
> as.data.frame.table(tmp)
Var1 Var2 Freq
1 A a 8.381483
2 B a 2.379303
3 A b 6.373657
4 B b -1.189488

Resources