I have a data as below. I want to get a list of distinct values and their count for the entire matrix. What is an efficient way to do that?
I thought about putting one column below another (concatenating columns) and creating a single column of 9 elements and then running table command. But I feel that there must be a better way to do this..what are my options?
sm <- matrix(c(51,43,22,"a",51,21,".",22,9),ncol=3,byrow=TRUE)
expected output
distinct value: count
51:2
43:1
22:2
a:1
21:1
.:1
9:1
The table() command works just fine across a matrix
t<-table(sm)
t
# sm
# . 21 22 43 51 9 a
# 1 1 2 1 2 1 1
if you want to reshape the results, you can do
cat(paste0(names(t), ":", t, collapse="\n"), "\n")
# .:1
# 21:1
# 22:2
# 43:1
# 51:2
# 9:1
# a:1
Related
I have a dataframe
names2 <- c('AdagioBarber','AdagioBarber', 'Beethovan','Beethovan')
Value <- c(33,55,21,54)
song.data <- data.frame(names2,Value)
I would like to arrange it according to this character vector
names <- c('Beethovan','Beethovan','AdagioBarber','AdagioBarber')
I am using match() to achieve this
data.frame(song.data[match((names), (song.data$names2)),])
The problem is that match returns only first occurences
names2 Value
3 Beethovan 21
3.1 Beethovan 21
1 AdagioBarber 33
1.1 AdagioBarber 33
You can use order, as #zx8754 and #Evan Friedland have pointed out.
> name.order <- c('Beethovan','AdagioBarber')
> song.data$names2 <- factor(song.data$names2, levels= name.order)
> song.data[order(song.data$names2), ]
names2 Value
3 Beethovan 21
4 Beethovan 54
1 AdagioBarber 33
2 AdagioBarber 55
Basically, factor turns the strings into integers and creates a lookup table of what integers correspond to what strings. The levels argument specifies what you want that lookup table to be. Without that argument, it would just go by order of appearance.
So for example:
> as.numeric(factor(letters[1:5]))
[1] 1 2 3 4 5
> as.numeric(factor(letters[1:5], levels=c("d","b","e","a","c")))
[1] 4 2 5 1 3
Note: You'll need to be absolutely sure you get all your (correctly spelled) levels in that name.order vector, otherwise you'll end up with NA's in the output from order.
(I'm not sure why sort doesn't have the ability to sort factors, but it is what it is.)
I have a dataframe containing a set of parts and test results. The parts are tested on 3 sites (North Centre and South). Sometimes those parts are re-tested. I want to eventually create some charts that compare the results from the first time that a part was tested with the second (or third, etc.) time that it was tested, e.g. to look at tester repeatability.
As an example, I've come up with the below code. I've explicitly removed the "Experiment" column from the morley data set, as this is the column I'm effectively trying to recreate. The code works, however it seems that there must be a more elegant way to approach this problem. Any thoughts?
Edit - I realise that the example given was overly simplistic for my actual needs (I was trying to generate a reproducible example as easily as possible).
New example:
part<-as.factor(c("A","A","A","B","B","B","A","A","A","C","C","C"))
site<-as.factor(c("N","C","S","C","N","S","N","C","S","N","S","C"))
result<-c(17,20,25,51,50,49,43,45,47,52,51,56)
data<-data.frame(part,site,result)
data$index<-1
repeat {
if(!anyDuplicated(data[,c("part","site","index")]))
{ break }
data$index<-ifelse(duplicated(data[,1:2]),data$index+1,data$index)
}
data
part site result index
1 A N 17 1
2 A C 20 1
3 A S 25 1
4 B C 51 1
5 B N 50 1
6 B S 49 1
7 A N 43 2
8 A C 45 2
9 A S 47 2
10 C N 52 1
11 C S 51 1
12 C C 56 1
Old example:
#Generate a trial data frame from the morley dataset
df<-morley[,c(2,3)]
#Set up an iterative variable
#Create the index column and initialise to 1
df$index<-1
# Loop through the dataframe looking for duplicate pairs of
# Runs and Indices and increment the index if it's a duplicate
repeat {
if(!anyDuplicated(df[,c(1,3)]))
{ break }
df$index<-ifelse(duplicated(df[,c(1,3)]),df$index+1,df$index)
}
# Check - The below vector should all be true
df$index==morley$Expt
We may use diff and cumsum on the 'Run' column to get the expected output. In this method, we are not creating a column of 1s i.e 'index' and also assuming that the sequence in 'Run' is ordered as showed in the OP's example.
indx <- cumsum(c(TRUE,diff(df$Run)<0))
identical(indx, morley$Expt)
#[1] TRUE
Or we can use ave
indx2 <- with(df, ave(Run, Run, FUN=seq_along))
identical(indx2, morley$Expt)
#[1] TRUE
Update
Using the new example
with(data, ave(seq_along(part), part, site, FUN=seq_along))
#[1] 1 1 1 1 1 1 2 2 2 1 1 1
Or we can use getanID from library(splitstackshape)
library(splitstackshape)
getanID(data, c('part', 'site'))[]
I think this is a job for make.unique, with some manipulation.
index <- 1L + as.integer(sub("\\d+(\\.)?","",make.unique(as.character(morley$Run))))
index <- ifelse(is.na(index),1L,index)
identical(index,morley$Expt)
[1] TRUE
Details of your actual data.frame may matter. However, a couple of options working with your example:
#this works if each group starts with 1:
df$index<-cumsum(df$Run==1)
#this is maybe more general, with data.table
require(data.table)
dt<-as.data.table(df)
dt[,index:=seq_along(Speed),by=Run]
I have a table like:
a
n_msi2010 n_msi2011
1 -0.122876 1.818750
2 1.328930 0.931426
3 -0.111653 4.400060
4 1.222900 4.500450
5 3.604160 6.110930
I would like to merge these two columns into one column to obtain (I don't want to keep column names):
a
n_msi2010
1 -0.122876
2 1.328930
3 -0.111653
4 1.222900
5 3.604160
6 1.818750
7 0.931426
8 4.400060
9 4.500450
10 6.110930
When I am using prefabricated data like
x <- cbind(c(1, 2, 3), c(4, 5, 6))
colnames(x)<-c("a","b")
c(t(x))
# 1 4 2 5 3 6
c((x))
# 1 2 3 4 5 6
the column merging works fine. Only in "a" exemple id doesn't work and it creates 2 separate vectors. I don't really understand why. Any help? Thanks
It seems like your question is about column versus row order vector creation from a data.frame.
Using t() on a data.frame converts the data.frame to a matrix, and using c() on the matrix removes its dimensions.
With that knowledge, you can try:
# create a vector of values, column by column
c(as.matrix(a)) # you are missing the `as.matrix` in your current approach
# create a vector of values, row by row
c(t(a)) # you already know this works
Other approaches to get the "column by column" result would be:
unlist(a, use.names = FALSE)
stack(a)[, "values"] # add `drop = FALSE` if you want to retain a data.frame
Not a elegant way but it seems it can combine two or several columns to one.
n_msi2010 <- 1:5
n_msi2011 <- 6:10
a <- data.frame(n_msi2010, n_msi2011)
vector <- vector()
for (i in 1:dim(a)[2]){
vector <- append(vector, as.vector(a[,i]))
vector
}
You may do
as.matrix(vector) or data.frame(vector)
I have a dataframe. For simplicity, I am leaving out many columns and rows:
Distance Type
1 162 A
2 27182 A
3 212 C
4 89 B
5 11 C
I need to find 6 consecutive rows in the dataframe, such that the average distance is 1000, and such that the only types considered are A or B. Just for clarification, one may think to filter out all Type C rows, and then proceed, but then the rows that were not originally consecutive will become consecutive upon filtering, and that's no good.
For example, if I filtered out rows 3 and 5 above, I would be left with 3 rows. And if I had provided more rows, that might produce a faulty result.
Maybe a solution with data.table library ?
For reproducibility, here is a data sample, based on what you wrote.
library(data.table)
# data orig (with row numbers...)
DO<-"Distance Type
1 162 A
2 27182 A
3 212 C
4 89 B
5 11 C
6 1234 A"
# data : sep by comma
DS<-gsub('[[:blank:]]+',';',DO)
# data.frame
DF<-read.table(textConnection(DS),header=T,sep=';',stringsAsFactors = F)
#data.table
DT<-as.data.table(DF)
Then, make a function to increment a counter each time a sequence of identical value is found :
# function to set sequencial group number
mkGroupRep<-function(x){
cnt=1L
grp=1L
lx<-length(x)
ne<- x[-lx] != x[-1L] #next not equal
for(i in seq_along(ne)){if(ne[i])cnt=cnt+1;grp[i+1]=cnt}
grp
}
And use it with data.table 'multiple assignment by reference' :
# update dat : set group number based on sequential type
DT[,grp:=mkGroupRep(Type)]
# calc sum of distance and number of item in group, by group
DT[,`:=`(
distMean=mean(Distance),
grpLength=.N
),by=grp]
# filter what you want :
DT[Type != 'C' & distMean >100 & grpLength==2 | grpLength==3]
Output :
Distance Type grp distMean grpLength
1: 162 A 1 13672 2
2: 27182 A 1 13672 2
I have large data-frame consists of two columns. I want to calculate the average of the second column values for each subset of the first column. The subset of the first column is based on a specified granularity. For example, for the following data-frame, df, I want to calculate the average of df$B values for each subset of df$A with an increment(granularity) of 1 for each subset. The results should be in two new columns.
A B expected results newA newB
0.22096 1 0 1.142857
0.33489 1 1 2
0.33655 1 2 4
0.43953 1
0.64933 2
0.86668 1
0.96932 1
1.09342 2
1.58314 2
1.88481 2
2.07654 4
2.34652 3
2.79777 5
This is a simple example, I'm not sure how to loop over the whole data-frame and perform the calculation i.e. the average of the df$B.
tried below to subset, but couldn't figure how to append the results and create final results:
Tried something like :
increment<-1
mx<-max(df$A)
i<-0
newDF<-data.frame()
while(i < mx){
tmp<-subset(df, (A >i & A< (i+increment)))
i<-i+granualrity
}
Not sure about the logic. But I'm sure there is a short way to do the required calculation. Any thoughts?
I would use findInterval for the subset selection (In your example a simple ceiling for each A value should be sufficient, too. But if your increment is different from 1 you need findInterval.) and tapply to calculate the mean:
df <- read.table(textConnection("
A B
0.22096 1
0.33489 1
0.33655 1
0.43953 1
0.64933 2
0.86668 1
0.96932 1
1.09342 2
1.58314 2
1.88481 2
2.07654 4
2.34652 3
2.79777 5"), header=TRUE)
## sort data.frame by column A (needed for findInterval)
df <- df[order(df$A), ]
## define granuality
subsets <- seq(1, max(ceiling(df$A)), by=1) # change the "by" argument for different increments
df$subset <- findInterval(df$A, subsets)
tapply(df$B, df$subset, mean)
# 0 1 2
#1.142857 2.000000 4.000000