R: split multiple value/key pairs in data.frame field - r

I've got a data.frame that contains a field like this:
:6:Description_C
:3:Description_A:2:Description_B:1:Description_C
:2:Description_C:1:Description_B:1:Description_A:1:Description_D:1:Description_E
:3:Description_B:3:Description_A
The number in front, surrounded by colons, is the number of times, out of a total of 6, which the Description is seen in that entry in the data.frame. If there is a :6:Description_X means that all 6 counts go for that description, if not it's split into different counts, one next to each other.
I would like to turn this field into a key/value hash of number of counts for each description, so that I can then do a barplot of the total proportions for all counts, but also in a way that I can plot these proportions in combination with the other factors in the data.frame.
EDIT: looking a bit at the doc for colsplit, probably what people will tell me is that I need a new column for each description, since I only have about 8 descriptions in total. Still, haven't figured out how to do it.
How can I do that in R?

I'm not sure what structure you wanted for the "key:value hash" but this will extract the strings and their associated numeric reps:
inp <- readLines(textConnection(
":6:Description_C
:3:Description_A:2:Description_B:1:Description_C
:2:Description_C:1:Description_B:1:Description_A:1:Description_D:1:Description_E
:3:Description_B:3:Description_A")
)
inp2 <- sapply( strsplit(inp, ":"), "[", -1) # drop the leading empty strings
reps <- lapply(inp2, function(x) as.numeric(x[ seq( 1, length(x) , by=2)]))
values <- lapply(inp2, function(x) x[ seq( 2, length(x) , by=2)])
lapply(reps, barplot) # Probably needs to work but this demonstrates feasibility

Related

Count the number of occurring a specific ordered sequence in R

I have this vector
data<-c(3,1,1,3,1,1,1,1,2,1,1,3,3,3,1,3,1,1,3,2,1,3,3,3,3)
I need to find the number of times I can have 1, then 2, then 3 (in this particular order)
So the expected answer for the above vector is 98 times (all possible ways).
Is there any efficient way to do so, as my actual problem will be a vector with many unique values (not simply as 1,2,3).
and here is my codes that give me the answer
data<-c(3,1,1,3,1,1,1,1,2,1,1,3,3,3,1,3,1,1,3,2,1,3,3,3,3)
yind<-which(data==2)
y1<-yind[1]
y2<-yind[2]
sum(data[1:y1]<data[y1])*sum(data[y1:length(data)]>data[y1])+sum(data[1:y2]<data[y2])*sum(data[y2:length(data)]>data[y2])
but it is not suitable for a vector with many unique values.For example
set.seed(3)
data2 <- sample(1:5,100,replace = TRUE)
and then count how many times I can have 1, then 2, then 3, then 4, then 5 (all possible ways).
Thank you
Here is an option using non-equi joins from data.table:
library(data.table)
v <- data2
tofind <- 1L:5L
dat <- data.table(rn=seq_along(v), v)
paths <- dat[v==tofind[1L]][, npaths := as.double(1)]
for (k in tofind[-1L]) {
paths <- paths[dat[v==k], on=.(rn<rn), allow.cartesian=TRUE, nomatch=0L,
by=.EACHI, .(npaths=sum(npaths))]
}
paths[, sum(npaths)]
Output for your data is 98.
Output for your data2 is 20873.
—-
Explanation:
Picture a n-nomial tree where each layer is the sequence of numbers that you are looking for and each vertex is the position of numbers in the data vector. For example, for data = c(1,2,1,2,3) the tree would look like
So the code goes through each layer and find the numbers of paths going into each vertex on that layer. The code uses a non-equi inner join to find those paths going into the vertices.
Here's an approach with expand.grid.
FindComb <- function(vector,variables){
grid <- do.call(expand.grid,lapply(variables,function(x) which(vector == x)))
sum(Reduce(`&`,lapply(seq(2,ncol(grid)), function(x) grid[,x-1] < grid[,x])))
}
FindComb(data,c(1,2,3))
#[1] 98
I expect it will not scale well with longer vectors or more numbers, but it works OK for smaller scales:
set.seed(3)
data2 <- sample(1:9,1000,replace = TRUE)
FindComb(data2,c(8,2,3))
[1] 220139

How to sum the values of different columns in a dataframe looping on the variables names

I'm relatively new to R (used to work in Stata before) so sorry if the question is too trivial.
I've a dataframe with variables named in a sequential way that follows the following logic:
q12.X.Y
where X assumes the values from 1 to 9, and Y from 1 to 5
I need to add together the values of the variables of all the q12.X.Y variables with the Y numbers from 1 to 3 (but NOT those ending with the number 4 or 5)
Ideally I would have written a loop based on the sequential numbers of the variables, namely something like:
df$test <- 0
for(i in 1:9){
for(j in 1:3){
df$test <- df$test+ df$q12.i.j
}
}
That obviously do not work.
I also tried with the command "rowSums" and "subset"
df$test <- rowSums(subset(df,select= ...)
However I find it a bit cumbersome, as the column numbers are not sequential and i do not want to type the name of all the variables.
Any suggestion how to do that?
We can use grep to get the match
rowSums(df[grep("q12\\.[1-9]\\.[1-3]", names(df))])
or if all the column names are present, then use an exact match by creating the column names with paste
rowSums(df[paste0(rep(paste0("q12.", 1:9, "."), 3), 1:3)])

Remove Duplicates, but Keep the Most Complete Iteration

I'm trying to figure out how remove duplicates based on three variables (id, key, and num). I would like to remove the duplicate with the least amount of columns filled. If an equal number are filled, either can be removed.
For example,
Original <- data.frame(id= c(1,2,2,3,3,4,5,5),
key=c(1,2,2,3,3,4,5,5),
num=c(1,1,1,1,1,1,1,1),
v4= c(1,NA,5,5,NA,5,NA,7),
v5=c(1,NA,5,5,NA,5,NA,7))
The output would be the following:
Finished <- data.frame(id= c(1,2,3,4,5),
key=c(1,2,3,4,5),
num=c(1,1,1,1,1),
v4= c(1,5,5,5,7),
v5=c(1,5,5,5,7))
My real dataset is bigger and a mix of mostly numerical, but some character variables, but I couldn't determine the best way to go about doing this. I've previously used a program that would do something similar within the duplicates command called check.all.
So far, my thoughts have been to use grepl and determine where "anything" is present
Present <- apply(Original, 2, function(x) grepl("[[:alnum:]]", x))
Then, using the resultant dataframe I ask for rowSums and Cbind it to the original.
CompleteNess <- rowSums(Present)
cbind(Original, CompleteNess)
This is the point where I'm unsure of my next steps... I have a variable which tells me how many columns are filled in each row (CompleteNess); however, I'm unsure of how to implement duplicates.
Simply, I'm looking for When id, key, and num are duplicated - keep the row with the highest value of CompleteNess.
If anybody can think of a better way to do this or get me through the last little bit I would greatly appreciate it. Thanks All!
Here is a solution. It is not very pretty but it should work for your application:
#Order by the degree of completeness
Original<-Original[order(CompleteNess),]
#Starting from the bottom select the not duplicated rows
#based on the first 3 columns
Original[!duplicated(Original[,1:3], fromLast = TRUE),]
This does rearrange your original data frame so beware if there is additional processing later on.
You can aggregate your data and select the row with max score:
Original <- data.frame(id= c(1,2,2,3,3,4,5,5),
key=c(1,2,2,3,3,4,5,5),
num=c(1,1,1,1,1,1,1,1),
v4= c(1,NA,5,5,NA,5,NA,7),
v5=c(1,NA,5,5,NA,5,NA,7))
Present <- apply(Original, 2, function(x) grepl("[[:alnum:]]", x))
#get the score
Original$present <- rowSums(Present)
#create a column to aggregate on
Original$id.key.num <- paste(Original$id, Original$key, Original$num, sep = "-")
library("plyr")
#aggregate here
Final <- ddply(Original,.(id.key.num),summarize,
Max = max(present))
And if you want to keep the other columns, just do this:
Final <- ddply(Original,.(id.key.num),summarize,
Max = max(present),
v4 = v4[which.max(present)],
v5 = v5[which.max(present)]
)

How to best reshape a data set in R that has a two-row header?

The data set I'm working with is in Excel. It shows sales of products in both unit and revenue terms for the first 26 weeks of availability.
Each row of data represents a product. Let's say there are 50 of them.
The 2nd header row could basically be reconstructed with rep(("Units","Revenue"),26)
Above each of those ("Units","Revenue") pairs in the 1st header row is a merged pair of cells taking the sequence "Week 1", "Week 2"...."Week 26".
I basically want to convert the dataset from 50 rows to 50*26 = 1300 rows with 4 columns (Product, Week, Units, Sales).
I've seen how to handle two row headers and how to reshape data with the melt function, but I'm not sure I've seen anything that indicates a best practice for combining the two, especially in cases like this where both header rows contain key information needed to reshape the data.
It is somwhat abiguous what sort of csv file might result from merged cells but assuming there are twice as many such cells you would first need to read in the first two lines with readLines using sep=",", then:
gsub( " ", "", paste( rep( row1[row1 > ""], each=2), c("Units","Revenue"), sep="_") )
To any red-hot moderator: yes, I know code-only answers are deprecated , but I think they should be acceptable for answering code and data-deficient questions.
I have run into the same problem many times and have used melt in reshape2 in the past. But here is a function that takes multiple rows of headers as well as multiple columns:
PivReady <- function(data,label_rows,label_columns){
c<-nrow(data)
d<-ncol(data)
pivRdata <- data.frame(matrix(ncol = (label_columns+label_rows+1), nrow = ((c-label_rows)*(d-label_columns))))
for(i in 1:label_columns){
pivRdata[,i]<-rep(data[(label_rows+1):c,i],each=(d-label_columns))
}
trowlabels<-t(data[1:label_rows,(label_columns+1):d])
pivRdata[,(label_columns+1):(label_columns+label_rows)]<-do.call(rbind, replicate(((c-label_rows)*(d-label_columns))/(d-label_columns), trowlabels, simplify=FALSE))
datatrans<-t(data[(label_rows+1):c,(label_columns+1):d])
datatrans<-as.vector(datatrans)
pivRdata[,(label_columns+label_rows+1)]<-as.data.frame(datatrans)
names <- data.frame(matrix(ncol = (label_columns+label_rows+1), nrow = 1))
names[1,1:label_columns]<-as.matrix(data[label_rows,1:label_columns])
names[1,(label_columns+1):(label_columns+label_rows)]<-paste("Category",1:label_rows,sep="")
names[1,(label_columns+label_rows+1)]<-"Value"
names(pivRdata)<-names
return(pivRdata)
}
Yes, I know this code is not very beautiful but if you import your data with headers=FALSE and then specify in the above function that the data has e.g. 2 columns of labels (left most columns), and 3 rows of headers, then this works quite nicely.
eg.
long_data <- PivReady(wide_data,3,2)

R array indexing for multi-dimensional arrays

I have a simple array indexing question for multi-dimensional arrays in R. I am doing a lot of simulations that each give a result as a matrix, where the entries are classified into categories. So for example a result looks like
aresult<-array(sample(1:3, 6, replace=T), dim=c(2,5),
dimnames=list(
c("prey1", "prey2"),
c("predator1", "predator2", "predator3", "predator4", "predator5")))
Now I want to store the results of my experiments in a 3D-matrix, where the first two dimension are the same as in aresult and the third dimension holds the number of experiments that fell into each category. So my arrays of counts should look like
Counts<-array(0, dim=c(2, 5, 3),
dimnames=list(
c("prey1", "prey2"),
c("predator1", "predator2", "predator3", "predator4", "predator5"),
c("n1", "n2", "n3")))
and after each experiment I want to increment the numbers in the third dimension by 1, using the values in aresults as indexes.
How can I do that without using loops?
This sounds like a typical job for matrix indexing. By subsetting Counts with a three column matrix, each row specifying the indices of an element we want to extract, we are free to extract and increment any elements we like.
# Create a map of all combinations of indices in the first two dimensions
i <- expand.grid(prey=1:2, predator=1:5)
# Add the indices of the third dimension
i <- as.matrix( cbind(i, as.vector(aresult)) )
# Extract and increment
Counts[i] <- Counts[i] + 1

Resources