I have a matrix and want to calculate products of sequences for each column of a matrix by groups. First, each group corresponds to different numbers. Second, each column of the matrix should be computed separately too.
For example, for user 1 and y column, the number should be 1*1; for user 2 and y column, the number should be 2*3*1
I used apply with group_by but the output is incorrect.
#data
user<-c(1,1,2,2,2)
y<-c(1,1,2,3,1)
z<-c(2,2,3,3,3)
dt<-data.frame(user,y,z)
user y z
1 1 1 2
2 1 1 2
3 2 2 3
4 2 3 3
5 2 1 3
The output below is not by user
dt%>%group_by(user)%>% (function(x){apply(x[,2:3],2,prod)})
y z
6 108
Desired output ( you can remove the NA too. It does not matter whether adding a new column or not)
[,1] [,2]
[1,] NA NA
[2,] 1 4
[3,] NA NA
[4,] NA NA
[5,] 6 27
Related
There are a lot of (inefficient) ideas I have about processing this data, but was suggested to me on another question to directly ask you all in another Q.
Basically, I have a lot of data that was taken by multiple users, with an ID number for the sample, and two weight variables (pre and post processing). Because the data was not processed sequentially by ID, and the data was collected at very different times for pre- and post-processing, it would be difficult (and probably increase data entry error likelihood) for the user to locate the ID and pre- column to input the post.
So instead the dataframe looks something like this:
#example data creation
id = c(rep(1:4,2),5:8)
pre = c(rep(10,4),rep(NA,4),rep(100,4))
post = c(rep(NA,4),rep(10,4),rep(100,4))
df = cbind(id,pre,post)
print(df)
id pre post
[1,] 1 10 NA
[2,] 2 10 NA
[3,] 3 10 NA
[4,] 4 10 NA
[5,] 1 NA 10
[6,] 2 NA 10
[7,] 3 NA 10
[8,] 4 NA 10
[9,] 5 100 100
[10,] 6 100 100
[11,] 7 100 100
[12,] 8 100 100
I asked another question on how to merge the data, so I feel alright about that. I want to know what the best method is to sweep the dataframe for user error before I merge the columns.
Specifically, I am looking for if one ID has a pre but not a post (or vise versa), or if there is a double entry for any of the values. I would ideally just like to pump out all strange data (doubles, missing) IDs into a new dataframe so then I can go investigate and see what the problem is.
For example, if my data frame has this:
id pre post
1 1 10 NA
2 1 10 NA
3 2 10 NA
4 3 10 NA
6 2 NA 10
7 3 NA 10
8 4 NA 10
9 5 100 100
10 6 100 100
How do I get it to recognize that id #1 has been entered twice, and that ids 1 and 4 are missing a post and pre entry? All I need it to do is detect those anomalies and spit them out into a dataframe!
Thanks!
I am not sure the following is all the question asks for.
na <- rowSums(is.na(df[, -1])) > 0
df[duplicated(df[,1]) | na, ]
# id pre post
#[1,] 1 10 NA
#[2,] 2 10 NA
#[3,] 3 10 NA
#[4,] 4 10 NA
#[5,] 1 NA 10
#[6,] 2 NA 10
#[7,] 3 NA 10
#[8,] 4 NA 10
This question already has answers here:
Remove rows with all or some NAs (missing values) in data.frame
(18 answers)
Closed 6 years ago.
I need to do the 2-steps analysis of the following data:
1 5 1 -2
2 6 3 4
1 5 4 -3
NA NA NA NA
2 5 4 -4
Step 1. Remove all NA rows (these are always whole rows, not cells)
Step 2. Sort rows by the values of 4th column in descending order
The result should be the following:
2 6 3 4
1 5 1 -2
1 5 4 -3
2 5 4 -4
How can I efficiently do this processing, while considering that the data set might be large (e.g. 100,000 entries).
Another way to do this would be to first remove all NA values and then order the matrix.
# make a matrix
my_mat <- matrix(c(1,2,1,1,2,5,6,5,2,5,1,3,4,2,4,-2,4,-3,2,-4),
nrow = 5, ncol = 4)
# add some NA values
my_mat[4,] <- NA
[,1] [,2] [,3] [,4]
[1,] 1 5 1 -2
[2,] 2 6 3 4
[3,] 1 5 4 -3
[4,] NA NA NA NA
[5,] 2 5 4 -4
# remove rows that contain any number of NAs, for this purpose
# NAs always occupy the entire row as specified in the question
my_mat <- my_mat[complete.cases(my_mat),]
# order by the 4th column
my_mat[order(my_mat[,4], decreasing = TRUE),]
[,1] [,2] [,3] [,4]
[1,] 2 6 3 4
[2,] 1 5 1 -2
[3,] 1 5 4 -3
[4,] 2 5 4 -4
I have 2 data frames each consisting of rows of co-ordinates namely x,y,z
These data frames are of different length
I would like to be able to use one data frame as a reference and search the other for any coordinates that match in all 3 positions
I would then like these coordinates to be written to another data frame
i.e.
data frame one:
[1,] 1 2 3
[2,] 2 3 3
[3,] 1 2 4
[4,] 4 2 5
data frame two:
[1,] 3 2 3
[2,] 1 1 2
[3,] 2 3 3
[4,] 1 2 3
and I would like this to return
[1,] 2 3 3
[2,] 1 2 3
the ones that match
i.e. I want it to, not just check rows of the same number but all rows in the data frame.
You can use intersect from dplyr
library(dplyr)
intersect(as.data.frame(m1) , as.data.frame(m2))
# V1 V2 V3
#1 2 3 3
#2 1 2 3
Or you can use
mNew <- rbind(m1,m2)
mNew[duplicated(mNew),]
# [,1] [,2] [,3]
#[1,] 2 3 3
#[2,] 1 2 3
data
m1 <- matrix(c(1,2,1,4, 2,3,2,4, 3,3,4,5), ncol=3)
m2 <- matrix(c(3,1,2,1,2,1,3,2,3,2,3,3), ncol=3)
I have a sampling problem where I'm looking to search through columns of data for the highest ranked matches (which correspond to other set of data). Some matches are already ruled out:
-9 because it was the first selection (col 1 values)
-2 because it was highest ranked in set 9 (col 2 values)
-7 because it was highest ranked between cols 1 & 2 (col 3 values)
For example, the highest ranked match between columns 1,2,3 should be 1 because it has rank [2,7,2] across the three columns, which is the lowest sum. This new rank result would append to the 10x3 matrix to make it a 10X4 matrix, and a fifth match would then be sought.
[,1] [,2] [,3]
[1,] 2 9 9
[2,] 1 10 1
[3,] 4 6 3
[4,] 5 8 2
[5,] 7 7 4
[6,] 8 3 6
[7,] 10 1 10
[8,] 3 5 5
[9,] 6 4 8
[10,] 9 2 7
The goal is to do this for much larger data sets (say 500 rows) and creating a subset of about 20 values, so something besides visual inspection is needed! The list of numbers in each column is the same length and includes the same values, just in a unique order for each column.
I have a data set like this
head(data)
V1 V2
[1,] NA NA
[2,] NA NA
[3,] NA NA
[4,] 5 2
[5,] 5 2
[6,] 5 2
where
unique(data$V1)
[1] NA 5 4 3 2 1 0 6 7 9 8
unique(data$V2)
[1] NA 2 6 1 5 3 7 4 0 8 9
What I would like to do is a plot similar to this
plot(df$V1,df$V2)
but with a colour indicator indicating how many match there are with a grid instead of points.
Can anyone help me?
It looks like this may be what you want - first you tabulate using table(), then plot a heatmap of the table using heatmap():
set.seed(1)
data <- data.frame(V1=sample(1:10,100,replace=TRUE),V2=sample(1:10,100,replace=TRUE))
foo <- table(data)
heatmap(foo,Rowv=NA,Colv=NA)