although there are alot of questions concering this topic; I can not seem to find the correct question answer. Therefore I am directing this question to you guys.
The context:
I've got a data set with alot of rows (+150K) with 32 corresponding columns. The second column is a document number. The document number is not a unique ID. So the date contains rows with mutiple rows with the same document number. I like to create a list of the document numbers. This list of document numbers contains another list with the corresponding rows with the same document numbers.
For example:
Here is an example of the data (I included a dput output of the example below).
Document Number Col.A Col.B
A random_56681 random_24984
A random_78738 random_23098
A random_48640 random_32375
B random_96243 random_96927
B random_72045 random_52583
C random_19367 random_20441
C random_96778 random_22161
C random_48038 random_95644
C random_62999 random_44561
Now here is what I am looking for. I need a list that contains the 3 documents (A, B, C). Each of these list needs to contain another list containing the corresponding rows. For example, the main list (lets say my_list) should have 3 lists A, B and C; each of the lists should contain respectively 3, 2 and 4 lists.
Hope I was clear enough in asking the question (if not please let me know).
Here you can find the example data:
structure(list(Document_Number = structure(c(1L, 1L, 1L, 2L,
2L, 3L, 3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"),
Col.A = structure(c(4L, 7L, 3L, 8L, 6L, 1L, 9L, 2L, 5L), .Label = c("random_19367",
"random_48038", "random_48640", "random_56681", "random_62999",
"random_72045", "random_78738", "random_96243", "random_96778"
), class = "factor"), Col.B = structure(c(4L, 3L, 5L, 9L,
7L, 1L, 2L, 8L, 6L), .Label = c("random_20441", "random_22161",
"random_23098", "random_24984", "random_32375", "random_44561",
"random_52583", "random_95644", "random_96927"), class = "factor")), class = "data.frame", row.names = c(NA,
-9L))
You can use split like:
split(x, x$Document_Number)
#$A
# Document_Number Col.A Col.B
#1 A random_56681 random_24984
#2 A random_78738 random_23098
#3 A random_48640 random_32375
#
#$B
# Document_Number Col.A Col.B
#4 B random_96243 random_96927
#5 B random_72045 random_52583
#
#$C
# Document_Number Col.A Col.B
#6 C random_19367 random_20441
#7 C random_96778 random_22161
#8 C random_48038 random_95644
#9 C random_62999 random_44561
An option is group_split
library(dplyr)
df1 %>%
group_split(Document_Number)
Related
CustomerID MarkrtungChannel OrderID
1 A 1
2 B 2
3 A 3
4 B 4
5 C 5
1 C 6
1 A 7
2 C 8
3 B 9
3 B 10
Hi, I want to know which combinations of marketing channels are used by how many customers .
How can I calculate this with R?
E.g. The combination of Marketing channels A and C is used by 1 customer (ID 1)
the combination of Marketing channels C and B is also used by 1 customer (ID 2)
And so on...
and here's a tidyverse way.
library(tidyverse)
data.df%>%
group_by(CustomerID)%>%
summarize(combo=paste0(sort(unique(MarkrtungChannel)),collapse=""))%>%
ungroup()%>%
group_by(combo)%>%
summarize(n.users=n())
counting the number of people using each combo at the end.
You can do it multiple ways. Here is data.table way:
# Here is your data
df<-structure(list(CustomerID = c(1L, 2L, 3L, 4L, 5L, 1L, 1L, 2L,
3L, 3L), MarkrtungChannel = structure(c(1L, 2L, 1L, 2L, 3L, 3L,
1L, 3L, 2L, 2L), .Label = c("A", "B", "C"), class = "factor"),
OrderID = 1:10), .Names = c("CustomerID", "MarkrtungChannel",
"OrderID"), class = "data.frame", row.names = c(NA, -10L))
df[]<-lapply(df[],as.character)
# Here is the combination field
library(data.table)
setDT(df)
df[,Combo:=.(list(unique(MarkrtungChannel))), by=CustomerID]
# Or (to get the combination counts)
df[,list(combo=(list(unique(MarkrtungChannel)))), by=CustomerID][,uniqueN(CustomerID),by=combo]
I'm trying to get the data from column one that matches with column 2 but only on the "B" values. Need to somehow make the true values a list.
Need this to repeat for 50,000 rows. Around 37,000 of them are true.
I'm incredibly new to this so any help would be nice.
Data <- data.frame(
X = sample(1:10),
Y = sample(c("B", "W"), 10, replace = TRUE)
)
Count <- 1
If(data[count,2] == "B") {
List <- list(data[count,1]
Count <- count + 1
#I'm not sure what to use to repeat I just put
Repeat
} else {
Count <- count + 1
Repeat
}
End result should be a list() of only column one data.
In this if rows 1-5 had "B" I want the column one numbers from that.
Not sure if I understood correctly what you're looking for, but from the comments I would assume that this might help:
setNames(data.frame(Data[1][Data[2]=="B"]), "selected")
# selected
#1 2
#2 5
#3 7
#4 6
No loop needed.
data
Data <- structure(list(X = c(10L, 4L, 9L, 8L, 3L, 2L, 5L, 1L, 7L, 6L),
Y = structure(c(2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 1L),
.Label = c("B", "W"), class = "factor")),
.Names = c("X", "Y"), row.names = c(NA, -10L),
class = "data.frame")
For a sample dataframe:
df <- structure(list(animal.1 = structure(c(1L, 1L, 2L, 2L, 2L, 4L,
4L, 3L, 1L, 1L), .Label = c("cat", "dog", "horse", "rabbit"), class = "factor"),
animal.2 = structure(c(1L, 2L, 2L, 2L, 4L, 4L, 1L, 1L, 3L,
1L), .Label = c("cat", "dog", "hamster", "rabbit"), class = "factor"),
number = c(5L, 3L, 2L, 5L, 1L, 4L, 6L, 7L, 1L, 11L)), .Names = c("animal.1",
"animal.2","number"), class = "data.frame", row.names = c(NA,
-10L))
... I wish to make a new df with 'animal' duplicates all added together. For example multiple rows with the same animal in columns 1 and 2 will be put together. So for example the dataframe above would read:
cat cat 16
dog dog 7
cat dog 3 etc. etc... (those with different animals would be left as they are). Importantly the sum of 'number' in both dataframes would be the same.
My real df is >400K observations, so anything that anyone could recommend could cope with a large dataset would be great!
Thanks in advance.
One option would be to use data.table. Convert "data.frame" to "data.table" (setDT(), if the "animal.1" rows are equal to "animal.2", then, replace the "number" with sum of "number" after grouping by the two columns, and finally get the unique rows.
library(data.table)
setDT(df)[as.character(animal.1)==as.character(animal.2),
number:=sum(number) ,.(animal.1, animal.2)]
unique(df)
# animal.1 animal.2 number
#1: cat cat 16
#2: cat dog 3
#3: dog dog 7
#4: dog rabbit 1
#5: rabbit rabbit 4
#6: rabbit cat 6
#7: horse cat 7
#8: cat hamster 1
Or an option with dplyr. The approach is similar to data.table. We group by "animal.1", "animal.2", then replace the "number" with sum only when "animal.1" is equal to "animal.2", and get the unique rows
library(dplyr)
df %>%
group_by(animal.1, animal.2) %>%
mutate(number=replace(number,as.character(animal.1)==
as.character(animal.2),
sum(number))) %>%
unique()
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I need to make migration(or conversion) path map in R.
There is the example of my data.frame
ID order state
1 1 a
1 2 b
1 3 b
2 1 b
2 2 b
2 3 c
3 1 b
3 2 c
4 1 a
4 2 b
5 1 c
In this data.frame ID1 have moved to a -> b -> b according to the order.
In the same perspective, ID2 have moved to b -> b -> c, ID3 have moved to b-> c, ID4 have moved to a->b. And ID5 did not move.
In the aggregate level, we can make migration (or conversion) path map like below.
In this map, the arrows have frequency information of path. And the circles have frequency information of states.
How can I make this path map in R? Is there any packages for this?
Here's a possibility using the diagram package. Most of the work here is just reshaping the data into a nice format. There may be more efficient ways, but this at least seems to work. First, your data. I also want to make sure that we treat the order column as a factor rather than a numeric value.
#sample input data
dd<-structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L, 4L,
5L), order = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 1L, 2L, 1L),
state = structure(c(1L, 2L, 2L, 2L, 2L, 3L, 2L, 3L, 1L, 2L, 3L),
.Label = c("a", "b", "c"), class = "factor")),
.Names = c("ID", "order", "state"),
class = "data.frame", row.names = c(NA, -11L))
dd$order<-factor(dd$order)
Now we begin the transformation. We need to create an adjaceny matrix between all the state/order positions
ss <- interaction(dd$state, dd$order)
Embed <- function(x) if(length(x)>1) embed(x,2) else numeric(0)
adj <- do.call(rbind, lapply(split(as.numeric(ss), dd$ID), Embed))
tf <- function(x) factor(levels(ss)[x], levels=levels(ss))
tt <- table(tf(adj[,1]), tf(adj[,2]))
Then we re-name the rows of the matrix (because that's what is used as labels on the plot)
rownames(tt) <- paste(levels(dd$state), table(dd$state, dd$order), sep="/")
And now we focus on the layout. We assign positions to each circle, then plot the diagram with the transitions, and finally add the text at the top.
xpos<-cbind(rep(1:nlevels(dd$order), each=nlevels(dd$state)),
rev(rep(1:nlevels(dd$state), nlevels(dd$order))))
xpos<-(xpos-1)/2*.7+.15
plotmat(tt, pos=xpos)
text(paste("order", levels(dd$order)), x=unique(xpos[,1]), y=1, xpd=NA)
The final result is
I tried to make it as robust as possible to different numbers of states/orders but I didn't fully test it. So be sure to double check the results with your real data.
Either it's late, or I've found a bug, or cast doesn't like colnames with "." in them. This all happens inside a function, but it "doesn't work" outside of a function as much as it doesn't work inside of it.
x <- structure(list(df.q6 = structure(c(1L, 1L, 1L, 11L, 11L, 9L,
4L, 11L, 1L, 1L, 2L, 2L, 11L, 5L, 4L, 9L, 4L, 4L, 1L, 9L, 4L,
10L, 1L, 11L, 9L), .Label = c("a", "b", "c", "d", "e", "f", "g",
"h", "i", "j", "k"), class = "factor"), df.s5 = structure(c(4L,
4L, 1L, 2L, 4L, 4L, 4L, 3L, 4L, 1L, 2L, 1L, 2L, 4L, 1L, 3L, 4L,
2L, 2L, 4L, 4L, 4L, 2L, 2L, 1L), .Label = c("a", "b", "c", "d",
"e"), class = "factor")), .Names = c("df.q6", "df.s5"), row.names = c(NA,
25L), class = "data.frame")
cast(x, df.q6 + df.s5 ~., length)
No worky.
However, if:
colnames(x) <- c("variable", "value")
cast(x, variable + value ~., length)
Works like a charm.
For me I use a similar solution to what Spacedman points out.
#take your data.frame x with it's two columns
#add a column
x$value <- 1
#apply your cast verbatim
cast(x, df.q6 + df.s5 ~., length)
df.q6 df.s5 (all)
1 a a 2
2 a b 2
3 a d 3
4 b a 1
5 b b 1
6 d a 1
7 d b 1
8 d d 3
9 e d 1
10 i a 1
11 i c 1
12 i d 2
13 j d 1
14 k b 3
15 k c 1
16 k d 1
Hopefully that helps!
Jay
Nothing to do with the dots in the colnames (easily shown!).
If your dataframe doesnt have a column called 'value' then cast() guesses what column is the value - in this case it guesses 'df.s5' as it is the last column. This is what you get when you melt() data. It then renames that column to 'value' before calling reshape1. Now the column 'df.s5' is no more, yet it's there on the left of your formula. Uh oh.
You are using the value in the formula, which is an odd thing to do. None of the cast examples do that. What are you trying to do here?
You could add an ad-hoc column as a dummy value:
> cast(cbind(x,1), df.q6+s5~., length)
Using 1 as value column. Use the value argument to cast to override this choice
df.q6 s5 (all)
1 a a 2
2 a b 2
3 a d 3
4 b a 1
5 b b 1
[etc]
But I suspect there's a better way to get the number of repeated observations (rows) in a data frame - which is your real question!
if you are looking for an easy solution, dcast in reshape2 package can help you:
library(reshape2)
dcast(x, df.q6 + df.s5 ~., length)