Inserting new rows to dataframe without losing format - r

I am trying to create a large empty data.frame and insert a groups of row. I have seen a few similar questions on numerous forums, however I have been unable to apply any of them successfully to the specific formatting issue I am having.
I started with rbind(df,allic) # allic is the data frame I would like to insert into df # however, given the size of my dataset the operation takes 5 1/2 minutes to complete. I understand that creating the data frame at the beginning and replacing rows improves efficiency, however I have been unable to make it work for my problem. Code is as follows:
Initial data:
Order.ID Product
1 193505 Onion Rings
2 193505 Pineapple Cheddar Burger
3 193623 Fountain Soda
4 193623 French Fries
5 193623 Hamburger
6 193623 Hot Dog
7 193631 French Fries
8 193631 Hamburger
9 193631 Milkshake
The products won't match to below, however this being a formatting issue I figured it best to show the formatting that brought me to where I am now.
nb$Order.ID <- as.factor(nb$Order.ID)
plist <- aggregate(nb$Product,list(nb$Order.ID),list)
allp <- unique(unlist(plist$x))
allic <- expand.grid(plist$x[[1]], Var2=plist$x[[1]], Var3=1)
Var1 Var2 Var3
1 Onion Rings Onion Rings 1
2 Pineapple Cheddar Burger Onion Rings 1
3 Onion Rings Pineapple Cheddar Burger 1
4 Pineapple Cheddar Burger Pineapple Cheddar Burger 1
Now I create an empty dataframe (df) using:
df <- data.frame(factor=rep(NA, rcnt), factor=rep(NA,rcnt), stringsAsFactors=FALSE)
rcnt being a large, arbitrary number which I plan to trim once the operation is complete. My issue comes when I try to insert these lines using:
df[1:4,] <- allic
head(df, n=10)
factor factor.1
1 47 47
2 51 47
3 47 51
4 51 51
5 NA NA
6 NA NA
7 NA NA
8 NA NA
How can I insert rows in a dataframe without losing the format of my values? I would greatly appreciate any help I can get at this point.
EDIT Per comment below:
>df[i] <- for(i in 1:nrow(plist)) {
> allic <- expand.grid(plist$x[[i]], Var2=plist$x[[i]], Var3=1)
> df[i:nrow(allic),] <- sapply(allic, as.character)
I'm still very new with R, however this was working when I was using df <- rbind(df,allic). nrow(df) is 4096.

Try wrapping allic in as.character as follows:
df[1:4,] <- sapply(allic, as.character)
> df
factor factor.1
1 Onion Rings Onion Rings
2 Pineapple Cheddar Burger Onion Rings
3 Onion Rings Pineapple Cheddar Burger
4 Pineapple Cheddar Burger Pineapple Cheddar Burger
5 <NA> <NA>
6 <NA> <NA>
7 <NA> <NA>
8 <NA> <NA>
9 <NA> <NA>
10 <NA> <NA>

Related

imputing missing values in R dataframe

I am trying to impute missing values in my dataset by matching against values in another dataset.
This is my data:
df1 %>% head()
<V1> <V2>
1 apple NA
2 cheese NA
3 butter NA
df2 %>% head()
<V1> <V2>
1 apple jacks
2 cheese whiz
3 butter scotch
4 apple turnover
5 cheese sliders
6 butter chicken
7 apple sauce
8 cheese doodles
9 butter milk
This is what I want df1 to look like:
<V1> <V2>
1 apple jacks, turnover, sauce
2 cheese whiz, sliders, doodles
3 butter scotch, chicken, milk
This is my code:
df1$V2[is.na(df1$V2)] <- df2$V2[match(df1$V1,df2$V1)][which(is.na(df1$V2))]
This code works fine, however it only pulls the first missing value and ignores the rest.
Another solution just using base R
aggregate(DF2$V2, list(DF2$V1), c, simplify=F)
Group.1 x
1 apple jacks, turnover, sauce
2 butter scotch, chicken, milk
3 cheese whiz, sliders, doodles
I don't think you even need to import the df1 in this case can do it all based on df2
df1 <- df2 %>% group_by(`<V1>`) %>% summarise(`<V2>`=paste0(`<V2>`, collapse = ", "))

R: how to aggreate rows by count

This is my data frame
ID=c(1,2,3,4,5,6,7,8,9,10,11,12)
favFruit=c('apple','lemon','pear',
'apple','apple','pear',
'apple','lemon','pear',
'pear','pear','pear')
surveyDate = ('1/1/2005','1/1/2005','1/1/2005',
'2/1/2005','2/1/2005','2/1/2005',
'3/1/2005','3/1/2005','3/1/2005',
'4/1/2005','4/1/2005','4/1/2005')
df<-data.frame(ID,favFruit, surveyDate)
I need to aggregate it so I can plot a line graph in R for count of favFruit by date split by favFruit but I am unable to create an aggregate table. My data has 45000 rows so a manual solution is not possible.
surveyYear favFruit count
1/1/2005 apple 1
1/1/2005 lemon 1
1/1/2005 pear 1
2/1/2005 apple 2
2/1/2005 lemon 0
2/1/2005 pear 1
... etc
I tried this but R printed an error
df2 <- aggregate(df, favFruit, FUN = sum)
and I tried this, another error
df2 <- aggregate(df, date ~ favFruit, sum)
I checked for solutions online but their data generally included a column of quantities which I dont have and the solutions were overly complex. Is there an easy way to do this? Thanx in advance. Thank you to whoever suggested the link as a possible duplicate but it has has date and number of rows. But my question needs number of rows by date and favFruit (one more column) 1
Update:
Ronak Shah's solution worked. Thanx!
The solution provided by Ronak is very good.
In case you prefer to keep the zero counts in your dataframe.
You could use table function:
data.frame(with(df, table(favFruit, surveyDate)))
Output:
favFruit surveyDate Freq
1 apple 1/1/2005 1
2 lemon 1/1/2005 1
3 pear 1/1/2005 1
4 apple 2/1/2005 2
5 lemon 2/1/2005 0
6 pear 2/1/2005 1
7 apple 3/1/2005 1
8 lemon 3/1/2005 1
9 pear 3/1/2005 1
10 apple 4/1/2005 0
11 lemon 4/1/2005 0
12 pear 4/1/2005 3

Only filter values in a column based on a condition

Let's say I have the following dataframe:
my_basket = data.frame(ITEM_GROUP = c("Fruit","Fruit","Fruit","Fruit","Fruit","Vegetable","Vegetable","Vegetable","Vegetable","Dairy","Dairy","Dairy","Dairy","Dairy"),
ITEM_NAME = c("Apple","Banana","Orange","Mango","Papaya","Carrot","Potato","Brinjal","Raddish","Milk","Curd","Cheese","Milk","Paneer"),
Price = c(100,80,80,90,65,70,60,70,25,60,40,35,50,NA),
Tax = c(2,4,5,6,2,3,5,1,3,4,5,6,4,NA))
This then yields:
> my_basket
ITEM_GROUP ITEM_NAME Price Tax
1 Fruit Apple 100 2
2 Fruit Banana 80 4
3 Fruit Orange 80 5
4 Fruit Mango 90 6
5 Fruit Papaya 65 2
6 Vegetable Carrot 70 3
7 Vegetable Potato 60 5
8 Vegetable Brinjal 70 1
9 Vegetable Raddish 25 3
10 Dairy Milk 60 4
11 Dairy Curd 40 5
12 Dairy Cheese 35 6
13 Dairy Milk 50 4
14 Dairy Paneer NA NA
What I now would like to do, is make a list of fruits I want to keep and then filter those, so:
fruitlist = c("Apple", "Banana")
How would I go about using tidyverse to filter the data in my data.frame to only keep the fruits in my fruitlist, but also all my Vegetables and Dairy? Normally I'd do:
my_basket %<>% filter(ITEM_NAME %in% fruitlist)
But then I'd also lose all the vegetables and dairy, which is not what I want. I've been trying to make something work with case_when but can't seem to make it work. There must be something obvious I'm missing here.
EDIT: Seconds after posting my question I finally realised:
my_basket %<>% filter(ITEM_NAME %in% fruitlist | ITEM_GROUP != "Fruit")
That solves it. I think if I'd have to filter multiple groups like this, piping the filter command repeatedly would work too.
You could use grepl with a regex alternation:
fruitlist <- c("Apple", "Banana")
regex <- paste0("^(?:", paste0(fruitlist, collapse="|"), ")$")
my_basket %<>% filter(grepl(regex, ITEM_NAME))

Importing and analysing non-rectangular .csv files in R

I'm moving to R from Mathematica where I don't need to anticipate data structures during importation, in particular I do not need to anticipate the rectangularness of my data before import.
I have many files .csv files formatted as follows:
tasty,chicken,cinnamon
not_tasty,butter,pepper,onion,cardamom,cayenne
tasty,olive_oil,pepper
okay,olive_oil,onion,potato,black_pepper
not_tasty,tomato,fenugreek,pepper,onion,potato
tasty,butter,cheese,wheat,ham
Rows have differing lengths and will only contain strings.
In R, how should I approach this problem?
What Have You Tried?
I've tried with read.table:
dataImport <- read.table("data.csv", header = FALSE)
class(dataImport)
##[1] "data.frame"
dim(dataImport)
##[1] 6 1
dataImport[1]
##[1] tasty,chicken,cinnamon
##6 Levels: ...
I interpret this from the documentation to be a singular column with each list of ingredients as a distinct row. I may extract the first three rows as follows, each row is of class factor but appears to contain more data than what I expect:
dataImport[c(1,2,3),1]
## my rows
rowOne <- dataImport[c(1),1];
class(rowOne)
## "factor"
rowOne
## [1] tasty,chicken,cinnamon
## 6 Levels: not_tasty,butter,cheese [...]
This is as far as I've pursued this problem for now, I would appreciate advice on suitability of read.table for this data structure.
My goal is to group the data by the first element of each row, and analyse the difference between each type of recipe. In case it helps influence data structure advice, in Mathematica I would do the following:
dataImport=Import["data.csv"];
tasty = Cases[dataImport, {"tasty", ingr__} :> {ingr}]
Answer Discussion
#G.Grothendieck has provided a solution in using read.table and subsequent processing using the reshape2 package - this seems tremendously useful and I'll investigate later. General advice here solved my issue, hence accept.
#MrFlick's suggestion of using the tm package was useful for later analysis using DataframeSource
read.table Try read.table with fill=TRUE:
d1 <- read.table("data.csv", sep = ",", as.is = TRUE, fill = TRUE)
giving:
> d1
V1 V2 V3 V4 V5 V6
1 tasty chicken cinnamon
2 not_tasty butter pepper onion cardamom cayenne
3 tasty olive_oil pepper
4 okay olive_oil onion potato black_pepper
5 not_tasty tomato fenugreek pepper onion potato
6 tasty butter cheese wheat ham
read.table with NAs
or to fill the empty cells with NA values add na.strings = "" :
d2 <- read.table("data.csv", sep = ",", as.is = TRUE, fill = TRUE, na.strings = "")
giving:
> d2
V1 V2 V3 V4 V5 V6
1 tasty chicken cinnamon <NA> <NA> <NA>
2 not_tasty butter pepper onion cardamom cayenne
3 tasty olive_oil pepper <NA> <NA> <NA>
4 okay olive_oil onion potato black_pepper <NA>
5 not_tasty tomato fenugreek pepper onion potato
6 tasty butter cheese wheat ham <NA>
long form
If you want it in long form:
library(reshape2)
long <- na.omit(melt(d2, id.var = c("id", "V1"))[-3])
long <- long[order(long$id), ]
giving:
> long
id V1 value
1 1 tasty chicken
7 1 tasty cinnamon
2 2 not_tasty butter
8 2 not_tasty pepper
14 2 not_tasty onion
20 2 not_tasty cardamom
26 2 not_tasty cayenne
3 3 tasty olive_oil
9 3 tasty pepper
4 4 okay olive_oil
10 4 okay onion
16 4 okay potato
22 4 okay black_pepper
5 5 not_tasty tomato
11 5 not_tasty fenugreek
17 5 not_tasty pepper
23 5 not_tasty onion
29 5 not_tasty potato
6 6 tasty butter
12 6 tasty cheese
18 6 tasty wheat
24 6 tasty ham
wide form 0/1 binary variables
To represent the variable portion as 0/1 binary variables try this:
wide <- cast(id + V1 ~ value, data = long)
wide[-(1:2)] <- 0 + !is.na(wide[-(1:2)])
giving this:
list in a data frame
A different representation would be the following list in a data frame so that ag$value is a list of character vectors:
ag <- aggregate(value ~., transform(long, value = as.character(value)), c)
ag <- ag[order(ag$id), ]
giving:
> ag
id V1 value
4 1 tasty chicken, cinnamon
1 2 not_tasty butter, pepper, onion, cardamom, cayenne
5 3 tasty olive_oil, pepper
3 4 okay olive_oil, onion, potato, black_pepper
2 5 not_tasty tomato, fenugreek, pepper, onion, potato
6 6 tasty butter, cheese, wheat, ham
> str(ag)
'data.frame': 6 obs. of 3 variables:
$ id : int 1 2 3 4 5 6
$ V1 : chr "tasty" "not_tasty" "tasty" "okay" ...
$ value:List of 6
..$ 15: chr "chicken" "cinnamon"
..$ 1 : chr "butter" "pepper" "onion" "cardamom" ...
..$ 17: chr "olive_oil" "pepper"
..$ 11: chr "olive_oil" "onion" "potato" "black_pepper"
..$ 6 : chr "tomato" "fenugreek" "pepper" "onion" ...
..$ 19: chr "butter" "cheese" "wheat" "ham"
I don't think shoving your data into a data.frame or data.table is going to help you much since both those forms generally assume rectangular data. If you just want a list of character vectors, you can read them in with.
strsplit(readLines("data.csv"), ",")
It all depends on exactly what you are going to do with the data after you read it in. If you plan to use existing function, what input do they expect?
Sounds like you may be tracking terms in each of these recipes. Maybe the appropriate data structure would be a "corpus" from the tm package for text mining.

Clustering / Matching Over Many Dimensions in R

I have a very large and complex data set with many observations of companies. Some of the observations of the companies are redundant and I need to make a key to map the redundant observations to a single one. However the only way to tell if they are actually representing the same company is through the similarity of a variety of variables. I think the appropriate approach is a kind of clustering based on a variety of conditions or perhaps even some kind of propensity score matching. Perhaps I just need flexible tools for making a complex kind of similarity matrix.
Unfortunately, I am not quite sure how to go about that in R. Most of the tools I've seen for clustering and categorizing seem to do so with either numerical distance or categorical data, but don't seem to allow multiple conditions or user specified conditions.
Below I've tried to create a smaller, public example of the kind of data I am working with and the result I am trying to produce. There are some conditions that must apply, for example, the location must be the same. There are some features that may associate one with another, for example var1 and var2. Then there are some features that may associate one with another, but they must not conflict, such as var3.
An additional layer of complexity is that the kind of association I am trying to use to map the redundant observation varies. For example, id1 and id2 are the same company redundantly entered into the data twice. In one place its name is "apples" and another "red apples". They share the same location, var1 value and var3 (after adjusting for formatting). Similarly ids 3, 5 and 6, are also really just one company, though much of the input for each is different. Some clusters would identify multiple observations, others would only have one. Ideally I would like to find a way to categorize or associate the observations based on several conditions, for example:
1. Test that the location is the same
2. Test whether var3 is different
3. Test whether the names is a substring of others
4. Test the edit distance of names
5. Test the similarity of var1 and var2 between observations
Anyways, hopefully there are better, more flexible tools for this than what I am finding or someone has experience with this kind of data work in R. Any and all suggestions and advice are much appreciated!
Data
id name location var1 var2 var3
1 apples US 1 abc 12345
2 red apples US 1 NA 12-345
3 green apples Mexico 2 def 235-92
4 bananas Brazil 2 abc NA
5 oranges Mexico 2 NA 23592
6 green apple Mexico NA def NA
7 tangerines Honduras NA abc 3498
8 mango Honduras 1 NA NA
9 strawberries Honduras NA abcd 3498
10 strawberry Honduras NA abc 3498
11 blueberry Brazil 1 abcd 2348
12 blueberry Brazil 3 abc NA
13 blueberry Mexico NA def 1859
14 bananas Brazil 1 def 2348
15 blackberries Honduras NA abc NA
16 grapes Mexico 6 qrs NA
17 grapefruits Brazil 1 NA 1379
18 grapefruit Brazil 2 bcd 1379
19 mango Brazil 3 efaq NA
20 fuji apples US 4 NA 189-35
Result
id name location var1 var2 var3 Result
1 apples US 1 abc 12345 1
2 red apples US 1 NA 12-345 1
3 green apples Mexico 2 def 235-92 3
4 bananas Brazil 2 abc NA 4
5 oranges Mexico 2 NA 23592 3
6 green apple Mexico NA def NA 3
7 tangerines Honduras NA abc 3498 7
8 mango Honduras 1 NA NA 8
9 strawberries Honduras NA abcd 3498 7
10 strawberry Honduras NA abc 3498 7
11 blueberry Brazil 1 abcd 2348 11
12 blueberry Brazil 3 abc NA 11
13 blueberry Mexico NA def 1859 13
14 bananas Brazil 1 def 2348 11
15 blackberries Honduras NA abc NA 15
16 grapes Mexico 6 qrs NA 16
17 grapefruits Brazil 1 NA 1379 17
18 grapefruit Brazil 2 bcd 1379 17
19 mango Brazil 3 efaq NA 19
20 fuji apples US 4 NA 189-35 20
Thanks in advance for your time and help!
library(stringdist)
getMatches <- function(df, tolerance=6){
out <- integer(nrow(df))
for(row in 1:nrow(df)){
dists <- numeric(nrow(df))
for(col in 1:ncol(df)){
tempDist <- stringdist(df[row, col], df[ , col], method="lv")
# WARNING: Matches NA perfectly.
tempDist[is.na(tempDist)] <- 0
dists <- dists + tempDist
}
dists[row] <- Inf
min_dist <- min(dists)
if(min_dist < tolerance){
out[row] <- which.min(dists)
}
else{
out[row] <- row
}
}
return(out)
}
test$Result <- getMatches(test[, -1])
Where test is your data. This probably definitely needs some refining and certainly needs some postprocessing. This creates a column with the index of the closest match. If it can't find a match within the given tolerance, it returns the index of itself.
EDIT: I will attempt some more later.

Resources