I'm moving to R from Mathematica where I don't need to anticipate data structures during importation, in particular I do not need to anticipate the rectangularness of my data before import.
I have many files .csv files formatted as follows:
tasty,chicken,cinnamon
not_tasty,butter,pepper,onion,cardamom,cayenne
tasty,olive_oil,pepper
okay,olive_oil,onion,potato,black_pepper
not_tasty,tomato,fenugreek,pepper,onion,potato
tasty,butter,cheese,wheat,ham
Rows have differing lengths and will only contain strings.
In R, how should I approach this problem?
What Have You Tried?
I've tried with read.table:
dataImport <- read.table("data.csv", header = FALSE)
class(dataImport)
##[1] "data.frame"
dim(dataImport)
##[1] 6 1
dataImport[1]
##[1] tasty,chicken,cinnamon
##6 Levels: ...
I interpret this from the documentation to be a singular column with each list of ingredients as a distinct row. I may extract the first three rows as follows, each row is of class factor but appears to contain more data than what I expect:
dataImport[c(1,2,3),1]
## my rows
rowOne <- dataImport[c(1),1];
class(rowOne)
## "factor"
rowOne
## [1] tasty,chicken,cinnamon
## 6 Levels: not_tasty,butter,cheese [...]
This is as far as I've pursued this problem for now, I would appreciate advice on suitability of read.table for this data structure.
My goal is to group the data by the first element of each row, and analyse the difference between each type of recipe. In case it helps influence data structure advice, in Mathematica I would do the following:
dataImport=Import["data.csv"];
tasty = Cases[dataImport, {"tasty", ingr__} :> {ingr}]
Answer Discussion
#G.Grothendieck has provided a solution in using read.table and subsequent processing using the reshape2 package - this seems tremendously useful and I'll investigate later. General advice here solved my issue, hence accept.
#MrFlick's suggestion of using the tm package was useful for later analysis using DataframeSource
read.table Try read.table with fill=TRUE:
d1 <- read.table("data.csv", sep = ",", as.is = TRUE, fill = TRUE)
giving:
> d1
V1 V2 V3 V4 V5 V6
1 tasty chicken cinnamon
2 not_tasty butter pepper onion cardamom cayenne
3 tasty olive_oil pepper
4 okay olive_oil onion potato black_pepper
5 not_tasty tomato fenugreek pepper onion potato
6 tasty butter cheese wheat ham
read.table with NAs
or to fill the empty cells with NA values add na.strings = "" :
d2 <- read.table("data.csv", sep = ",", as.is = TRUE, fill = TRUE, na.strings = "")
giving:
> d2
V1 V2 V3 V4 V5 V6
1 tasty chicken cinnamon <NA> <NA> <NA>
2 not_tasty butter pepper onion cardamom cayenne
3 tasty olive_oil pepper <NA> <NA> <NA>
4 okay olive_oil onion potato black_pepper <NA>
5 not_tasty tomato fenugreek pepper onion potato
6 tasty butter cheese wheat ham <NA>
long form
If you want it in long form:
library(reshape2)
long <- na.omit(melt(d2, id.var = c("id", "V1"))[-3])
long <- long[order(long$id), ]
giving:
> long
id V1 value
1 1 tasty chicken
7 1 tasty cinnamon
2 2 not_tasty butter
8 2 not_tasty pepper
14 2 not_tasty onion
20 2 not_tasty cardamom
26 2 not_tasty cayenne
3 3 tasty olive_oil
9 3 tasty pepper
4 4 okay olive_oil
10 4 okay onion
16 4 okay potato
22 4 okay black_pepper
5 5 not_tasty tomato
11 5 not_tasty fenugreek
17 5 not_tasty pepper
23 5 not_tasty onion
29 5 not_tasty potato
6 6 tasty butter
12 6 tasty cheese
18 6 tasty wheat
24 6 tasty ham
wide form 0/1 binary variables
To represent the variable portion as 0/1 binary variables try this:
wide <- cast(id + V1 ~ value, data = long)
wide[-(1:2)] <- 0 + !is.na(wide[-(1:2)])
giving this:
list in a data frame
A different representation would be the following list in a data frame so that ag$value is a list of character vectors:
ag <- aggregate(value ~., transform(long, value = as.character(value)), c)
ag <- ag[order(ag$id), ]
giving:
> ag
id V1 value
4 1 tasty chicken, cinnamon
1 2 not_tasty butter, pepper, onion, cardamom, cayenne
5 3 tasty olive_oil, pepper
3 4 okay olive_oil, onion, potato, black_pepper
2 5 not_tasty tomato, fenugreek, pepper, onion, potato
6 6 tasty butter, cheese, wheat, ham
> str(ag)
'data.frame': 6 obs. of 3 variables:
$ id : int 1 2 3 4 5 6
$ V1 : chr "tasty" "not_tasty" "tasty" "okay" ...
$ value:List of 6
..$ 15: chr "chicken" "cinnamon"
..$ 1 : chr "butter" "pepper" "onion" "cardamom" ...
..$ 17: chr "olive_oil" "pepper"
..$ 11: chr "olive_oil" "onion" "potato" "black_pepper"
..$ 6 : chr "tomato" "fenugreek" "pepper" "onion" ...
..$ 19: chr "butter" "cheese" "wheat" "ham"
I don't think shoving your data into a data.frame or data.table is going to help you much since both those forms generally assume rectangular data. If you just want a list of character vectors, you can read them in with.
strsplit(readLines("data.csv"), ",")
It all depends on exactly what you are going to do with the data after you read it in. If you plan to use existing function, what input do they expect?
Sounds like you may be tracking terms in each of these recipes. Maybe the appropriate data structure would be a "corpus" from the tm package for text mining.
Related
I am trying to impute missing values in my dataset by matching against values in another dataset.
This is my data:
df1 %>% head()
<V1> <V2>
1 apple NA
2 cheese NA
3 butter NA
df2 %>% head()
<V1> <V2>
1 apple jacks
2 cheese whiz
3 butter scotch
4 apple turnover
5 cheese sliders
6 butter chicken
7 apple sauce
8 cheese doodles
9 butter milk
This is what I want df1 to look like:
<V1> <V2>
1 apple jacks, turnover, sauce
2 cheese whiz, sliders, doodles
3 butter scotch, chicken, milk
This is my code:
df1$V2[is.na(df1$V2)] <- df2$V2[match(df1$V1,df2$V1)][which(is.na(df1$V2))]
This code works fine, however it only pulls the first missing value and ignores the rest.
Another solution just using base R
aggregate(DF2$V2, list(DF2$V1), c, simplify=F)
Group.1 x
1 apple jacks, turnover, sauce
2 butter scotch, chicken, milk
3 cheese whiz, sliders, doodles
I don't think you even need to import the df1 in this case can do it all based on df2
df1 <- df2 %>% group_by(`<V1>`) %>% summarise(`<V2>`=paste0(`<V2>`, collapse = ", "))
Let's say I have the following dataframe:
my_basket = data.frame(ITEM_GROUP = c("Fruit","Fruit","Fruit","Fruit","Fruit","Vegetable","Vegetable","Vegetable","Vegetable","Dairy","Dairy","Dairy","Dairy","Dairy"),
ITEM_NAME = c("Apple","Banana","Orange","Mango","Papaya","Carrot","Potato","Brinjal","Raddish","Milk","Curd","Cheese","Milk","Paneer"),
Price = c(100,80,80,90,65,70,60,70,25,60,40,35,50,NA),
Tax = c(2,4,5,6,2,3,5,1,3,4,5,6,4,NA))
This then yields:
> my_basket
ITEM_GROUP ITEM_NAME Price Tax
1 Fruit Apple 100 2
2 Fruit Banana 80 4
3 Fruit Orange 80 5
4 Fruit Mango 90 6
5 Fruit Papaya 65 2
6 Vegetable Carrot 70 3
7 Vegetable Potato 60 5
8 Vegetable Brinjal 70 1
9 Vegetable Raddish 25 3
10 Dairy Milk 60 4
11 Dairy Curd 40 5
12 Dairy Cheese 35 6
13 Dairy Milk 50 4
14 Dairy Paneer NA NA
What I now would like to do, is make a list of fruits I want to keep and then filter those, so:
fruitlist = c("Apple", "Banana")
How would I go about using tidyverse to filter the data in my data.frame to only keep the fruits in my fruitlist, but also all my Vegetables and Dairy? Normally I'd do:
my_basket %<>% filter(ITEM_NAME %in% fruitlist)
But then I'd also lose all the vegetables and dairy, which is not what I want. I've been trying to make something work with case_when but can't seem to make it work. There must be something obvious I'm missing here.
EDIT: Seconds after posting my question I finally realised:
my_basket %<>% filter(ITEM_NAME %in% fruitlist | ITEM_GROUP != "Fruit")
That solves it. I think if I'd have to filter multiple groups like this, piping the filter command repeatedly would work too.
You could use grepl with a regex alternation:
fruitlist <- c("Apple", "Banana")
regex <- paste0("^(?:", paste0(fruitlist, collapse="|"), ")$")
my_basket %<>% filter(grepl(regex, ITEM_NAME))
I have a data frame like below.
New_ment1_1 New_ment1_2 New_ment1_3 New_ment1_4
1 application android ios NA
2 donald trump agreement climate united states
3 donald trump agreement paris united states
4 donald trump agreement united states NA
5 donald trump climate emission united states
6 donald trump entertainer host president
7 hen chicken mustard wimp
8 husband pamela private lives NA
9 pan chicken hen wimp
10 sex associate pamela partner
11 united kingdom chicken hen wimp
12 united states agreement paris NA
And I want the resultant as a data frame with rows like below
For example,
Row1 should be as such since it doesn't have any similar rows.
if you see rows 2,3,4,5 and 12. They should be combined in a same row like
united states donald trump paris climate agreement emission
And rows 7,9 and 11 should be combined as
united kingdom chicken hen wimp mustard
It can be in any order.
Assume the data frame DF shown reproducibly in the Note at the end.
Convert that to a character matrix m. Let us say that two rows are similar if they have more than one element in common and define is_similar to take two row indexes and return TRUE or FALSE accordingly. Then apply that to every pair of rows using outer. Interpret that as the adjacency matrix of a graph and calculate the connected compnents splitting DF into a list L each of whose elements is a data frame of the rows from DF that constitute that connected component.. Finally rework L into a character matrix.
library(igraph)
m <- as.matrix(DF)
n <- nrow(m)
is_similar <- function(i, j) length(intersect(na.omit(m[i, ]), na.omit(m[j, ]))) > 1
smat <- outer(1:n, 1:n, Vectorize(is_similar))
adj <- graph.adjacency(smat)
cl <- components(adj)$membership
str(split(1:n, cl))
## List of 6
## $ 1: int 1
## $ 2: int [1:5] 2 3 4 5 12
## $ 3: int 6
## $ 4: int [1:3] 7 9 11
## $ 5: int 8
## $ 6: int 10
spl <- split(DF, cl)
L <- lapply(spl, function(x) na.omit(unique(unlist(x))))
t(do.call("cbind", lapply(L, ts)))
giving:
[,1] [,2] [,3] [,4] [,5] [,6]
1 "application" "android" "ios" NA NA NA
2 "donald_trump" "united_states" "agreement" "climate" "paris" "emission"
3 "donald_trump" "entertainer" "host" "president" NA NA
4 "hen" "pan" "united_kingdom" "chicken" "mustard" "wimp"
5 "husband" "pamela" "private_lives" NA NA NA
6 "sex" "associate" "pamela" "partner" NA NA
Note: The input in reproducible form is:
Lines <- "
New_ment1_1 New_ment1_2 New_ment1_3 New_ment1_4
1 application android ios NA
2 donald_trump agreement climate united_states
3 donald_trump agreement paris united_states
4 donald_trump agreement united_states NA
5 donald_trump climate emission united_states
6 donald_trump entertainer host president
7 hen chicken mustard wimp
8 husband pamela private_lives NA
9 pan chicken hen wimp
10 sex associate pamela partner
11 united_kingdom chicken hen wimp
12 united_states agreement paris NA"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)
Update: Fixed similarity definition.
I had created one data set as shown below for applying market basket analysis(apriori())
id name
1 mango
1 apple
1 grapes
2 apple
2 carrot
3 mango
3 apple
4 apple
4 carrot
4 grapes
5 strawberry
6 guava
6 strawberry
6 bananas
7 bananas
8 guava
8 strawberry
8 pineapple
9 mango
9 apple
9 blueberries
10 black grapes
11 pomogranate
12 black grapes
12 pomogranate
12 carrot
12 custard apple
I applied some logic to convert it into market basket analysis process-able data.
library(arules)
fact <- data.frame(lapply(frt,as.factor))
trans <- as(fact, 'transactions')
and I tried this one also and got an error.
trans1 = read.transactions(file = frt, format = "single", sep = ",",cols=c("id","name"))
Error in scan(file = file, what = "", sep = sep, quiet = TRUE, nlines = 1) :
'file' must be a character string or connection
the output that I got is not as expected.
output I got.
items transactionID
1 {name=mango} 1
2 {name=apple} 2
3 {name=grapes} 3
4 {name=apple} 4
5 {name=carrot} 5
6 {name=mango} 6
7 {name=apple} 7
8 {name=apple} 8
9 {name=carrot} 9
10 {name=grapes} 10
11 {name=strawberry} 11
12 {name=guava} 12
13 {name=strawberry} 13
14 {name=bananas} 14
my expected output is
id item
1 {mango,apple,grapes)
2 {apple,carrot}
3 {mango,apple}
and so on like this
so can any one please help a way to get my expected output(if it's possible)
so that it helps me to apply apriori() algorithm.
Thanking you in advance.
If you are doing market basket analysis in arules, you need to construct a transactions. You can do this from your text file like:
write.csv(frt,file="temp.csv", row.names=FALSE) # say "temp.csv" is your text file
tranx <- read.transactions(file="temp.csv",format="single", sep=",", cols=c("id","name"))
inspect(tranx)
# items transactionID
# 1 {apple,
# grapes,
# mango} 1
# 2 {black-grapes} 10
# 3 {pomogranate} 11
# 4 {black-grapes,
# carrot,
# custard-apple,
# pomogranate} 12
...or, if you have already read your text file into a data.frame, you can coerce it into a transactions through a list object like:
tranx2 <- list()
for(i in unique(frt$id)){
tranx2[[i]] <- unlist(frt$name[frt$id==i])
}
inspect(as(tranx2,'transactions'))
# items
# 1 {apple,
# grapes,
# mango}
# 2 {apple,
# carrot}
# 3 {apple,
# mango}
# 4 {apple,
# carrot,
# grapes}
I am trying to create a large empty data.frame and insert a groups of row. I have seen a few similar questions on numerous forums, however I have been unable to apply any of them successfully to the specific formatting issue I am having.
I started with rbind(df,allic) # allic is the data frame I would like to insert into df # however, given the size of my dataset the operation takes 5 1/2 minutes to complete. I understand that creating the data frame at the beginning and replacing rows improves efficiency, however I have been unable to make it work for my problem. Code is as follows:
Initial data:
Order.ID Product
1 193505 Onion Rings
2 193505 Pineapple Cheddar Burger
3 193623 Fountain Soda
4 193623 French Fries
5 193623 Hamburger
6 193623 Hot Dog
7 193631 French Fries
8 193631 Hamburger
9 193631 Milkshake
The products won't match to below, however this being a formatting issue I figured it best to show the formatting that brought me to where I am now.
nb$Order.ID <- as.factor(nb$Order.ID)
plist <- aggregate(nb$Product,list(nb$Order.ID),list)
allp <- unique(unlist(plist$x))
allic <- expand.grid(plist$x[[1]], Var2=plist$x[[1]], Var3=1)
Var1 Var2 Var3
1 Onion Rings Onion Rings 1
2 Pineapple Cheddar Burger Onion Rings 1
3 Onion Rings Pineapple Cheddar Burger 1
4 Pineapple Cheddar Burger Pineapple Cheddar Burger 1
Now I create an empty dataframe (df) using:
df <- data.frame(factor=rep(NA, rcnt), factor=rep(NA,rcnt), stringsAsFactors=FALSE)
rcnt being a large, arbitrary number which I plan to trim once the operation is complete. My issue comes when I try to insert these lines using:
df[1:4,] <- allic
head(df, n=10)
factor factor.1
1 47 47
2 51 47
3 47 51
4 51 51
5 NA NA
6 NA NA
7 NA NA
8 NA NA
How can I insert rows in a dataframe without losing the format of my values? I would greatly appreciate any help I can get at this point.
EDIT Per comment below:
>df[i] <- for(i in 1:nrow(plist)) {
> allic <- expand.grid(plist$x[[i]], Var2=plist$x[[i]], Var3=1)
> df[i:nrow(allic),] <- sapply(allic, as.character)
I'm still very new with R, however this was working when I was using df <- rbind(df,allic). nrow(df) is 4096.
Try wrapping allic in as.character as follows:
df[1:4,] <- sapply(allic, as.character)
> df
factor factor.1
1 Onion Rings Onion Rings
2 Pineapple Cheddar Burger Onion Rings
3 Onion Rings Pineapple Cheddar Burger
4 Pineapple Cheddar Burger Pineapple Cheddar Burger
5 <NA> <NA>
6 <NA> <NA>
7 <NA> <NA>
8 <NA> <NA>
9 <NA> <NA>
10 <NA> <NA>