Duplicated Rows No Longer Recognized As Duplicates After read_csv() Is Used

Duplicated Rows No Longer Recognized As Duplicates After read_csv() Is Used - r

I have a historical file for a project. If we call this data frame A, say it is 5000 rows by 20 columns, I need to constantly be appending A with new records as they come in.
If I want to test that I'm removing duplicates, my test is:
A <- rbind(A, A) #These are the exact same file. This is now 10,000 x 20
A <- dplyr::filter(A, !duplicated(A)) #using the dplyr package, this is now 5,000 x 20
Keep in mind, the above works as it should. It removes all duplicates. However, when I want to test that this works I save the file, then read it in again and rbind once more:
readr::write_csv(A, "path/A_saved") #Saving the historical file A
A_import <- readr::read_csv("A_saved") #Loading in the historical file I just saved to my computer
A <- rbind(A, A_import) #Again, these are still the exact same file, same dimension with the same records, each with a duplicate row. This is now 10,000 by 20
A <- dplyr::filter(A, !duplicated(A)) #Same as above, BUT this is now 6,000 x 20
It is removing the majority of duplicates. However, it does not remove all the duplicates. Upon inspection, the 1,000 rows that should have been removed are still exact duplicates of other rows.
What is happening when I use read_csv() and the duplicated() function in this instance? I searched far and wide for a similar issue and could not find a solution.
I have used the unique() function as opposed to duplicated() and the problem persists. When I read in the exact same data frame, rbind() the data frame object with its read_csv() version, then try to filter() duplicates, not all duplicates are filtered.
duplicated() is not recognizing all rows as proper duplicates.
Any ideas?
Thank you in advance.

Related

Remove row with zeros in large data set does not work

I have a large data set, namely Sachs which is freely available at the gss package. The data is so large with 7466 observations and 12 variables. I tried to remove all rows with at least one zero. That is, if one row contains zero, then remove this row over all the variables. For example,
if one variable contains zero value, then this row and the corresponding row of all other variables need to be removed. I tried all available methods and, I am failing. Here is one of my tries. I know that many similar questions are already there on this website, but I tried all of them but none of them work for me.
library(gss)
data <- data.frame(Sachs[,-12])
dat <- data[apply(data,1, function(x) all(data!= 0.0000000)),]
View(dat)

To remove rows with contain at least one zero, you can use the following code:
library(gss)
data("Sachs")
Sachs[!apply(Sachs==0,1,any),]

Or using dplyr:
library(tidyverse)
library(gss)
data("Sachs")
Sachs |> filter(!if_any(everything(), ~ . == 0))

R: Check for finite values in DataFrame

I need to check whether data frame is "empty" or not ("empty" in a sense that dataframe contain zero finite value. If there is mix of finite and non-finite value, it should NOT be considered "empty")
Referring to How to check a data.frame for any non-finite, I came up with one line code to almost achieve this objective
nrow(tmp[rowSums(sapply(tmp, function(x) is.finite(x))) > 0,]) == 0
where tmp is some data frame.
This code works fine for most cases, but it fails if data frame contains a single row.
For example, the above code would work fine for,
tmp <- data.frame(a=c(NA,NA), b=c(NA,NA)) OR tmp <- data.frame(a=c(3,NA), b=c(4,NA))
But not for,
tmp <- data.frame(a=NA, b=NA)
because I think rowSums expects at least two rows
I looked at some other posts such as https://stats.stackexchange.com/questions/6142/how-to-calculate-the-rowmeans-with-some-single-rows-in-data, but I still couldn't come up a solution for my problem.
My question is, are there any clean ways (i.e. avoid using loops and ideally one liner) to check for being "empty" for any dataframes?
Thanks

If you are checking all columns, then you can just do
all(sapply(tmp, is.finite))
Here we are using all rather than the rowSums trick so we don't have to worry about preserving matrices.

ftable() fails because r can't generate a table with more than 2^31 elements

i encountered a problem using the ftable() function in R.
I basically have large data frames, where i want to delete all duplicated rows. Which is simply done with:
distinct(my_df)
I also want to count how many times a certain row does appear in the dataframe. which can be done with:
my_ftab <- as.data.frame(ftable(my_df))
my_ftab <- arrange(my_ftab[my_ftab$Freq>0,],desc(Freq))
This will return a data frame showing me the the distinct rows and how many times they occur..
When the size of my_df exceeds approx. 1000 * 30 it stops working because R cant produce data tables with more than 2^31 elements, which apparently would be necassary for some intermediate calculation step.
So my question is if there is a function that produces a similar output as ftable, but does not have its limitations?

Merge with allow.cartasian=TRUE results in too many observations

I want to merge 2 data frames (data1 and data2). Both initially contain around 35 million observations (around 2GB each).
I removed the duplicates from data2. I would need to keep the duplicates in data 1, as I wish to use them for further calculations per observation in data1.
I initially get the well documented error:
Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including j and dropping by (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.
As a solution (I looked at several topics, such here, here, and here), I included allow.cartesian=TRUE, but now I run into memory issues. Also, for a subset it works, but it gives me more observations than I wish (data1 now has 50 million observations, although is specify all.x=TRUE).
My code is:
#Remove duplicates before merge
data2 <- unique(data2)
#Merge
require(data.table)
data1 <- merge(data1, data2, by="ID", all.x=TRUE, allow.cartesian=TRUE)
Any advice on how to merge this, is very welcome.

In order to do a left join, the merge statement needs to understand which column you are using as the "key" or "index" for the join. If you have duplicate column names that are used as the key/index, it doesn't know what to do and gives that error. Further, it needs to know what to do if columns are being joined that have the same name as existing columns.
The solution is to temporarily rename the key/index column in your left (data1) dataset As a general rule, having duplicate column names is "bad" in R because it will confuse a lot of functions. Many functions silently call make.unique() to de-duplicate column names to avoid confusion.
If you have duplicate ID columns in data1 change them with colnames(data1) <- make.unique(colnames(data1)), which will set them to ID.1, ID.2, etc. Then do your merge (make sure to specify by.x="ID.1", by.y="ID" because of the rename. By default, duplicate columns that are merged will be appended with .y although you can specify the suffix with the suffixes= option (See Merge helpfile for details)
Lastly, it's worth noting that the merge() function in the data.table package tends to be a lot faster than the base merge() function with similar syntax. Seepage 47 of the data.table manual.

Any viable alternatives to merge function? R and R Studio shut down

I'm trying to merge to data frames based on a common field called "lookup" that I created. I created the data frames after subsetting the original data frame.
Each of the two newly created data frames is less than 10,000 rows. When trying to execute merge, after much thinking, both R and R Studio shuts down, with R sometimes producing an error message stating:
Error in make.unique(as.character(rows)) :
promise already under evaluation: recursive default argument reference or earlier problems?
Below is my code...is there any other way to pull down the data from the other data frame based on the common field besides using the merge function? Any help is appreciated.
Also, do you have any thoughts as to why it may be shutting down, using up all the memory, when, in fact, the data size is so small?
wmtdata <- datastep2[datastep2$Market.Type=="WMT", c("Item", "Brand.Family", "Brand", "Unit.Size", "Pack.Size",
"Container", "UPC..int.", "X..Vol", "Unit.Vol", "Distribution", "Market.Type",
"Week.Ending", "MLK.Day","Easter", "Independence.Day", "Labor.Day", "Veterans.Day", "Thanksgiving",
"Christmas", "New.Years","Year","Month","Week.Sequence","Price")]
compdata <- datastep2[datastep2$Market.Type=="Rem Mkt", c("Week.Ending", "UPC..int.","X..Vol", "Unit.Vol", "Price","lookup")]
colnames(compdata)[colnames(compdata)=="X..Vol"]<-"Comp Vol"
colnames(compdata)[colnames(compdata)=="Unit.Vol"]<-"Comp Unit Vol"
colnames(compdata)[colnames(compdata)=="Price"]<-"Comp Price"
combineddata <-merge(wmtdata, compdata, by="lookup")

Try join from the plyr package:
combineddata <- join(wmtdata, compdata, by="lookup")

With only 10,000 rows, the problem is unlikely to be the use of merge(...) instead of something else. Are the elements of the lookup column unique? Otherwise you get a cross-join.
Consider this trivial case:
df.1 <- data.frame(id=rep(1,10), x=rnorm(10))
df.2 <- data.frame(id=rep(1,10), y=rnorm(10))
z <- merge(df.1,df.2,by="id")
nrow(z)
# [1] 100
So two df with 10 rows each produce a merge with 100 rows because the id is not unique.
Now consider:
df.1 <- data.frame(id=rep(1:10, each=40), x=rnorm(400))
df.2 <- data.frame(id=rep(1:10, each=50), y=rnorm(500))
z <- merge(df.1,df.2,by="id")
nrow(z)
# [1] 20000
In this example, df.1 has each id replicated 40 times, and in df.2 each id is replicated 50 times. Merge will produce one row for every instance of an id in each df, so 50 X 40 =2000 rows per id. Since there are 10 ids in this example, you get 20,000 rows. So your merge results can get very big very quickly if the id field (lookup in your case) is not unique.

Instead of using data frames, use the data.table package (see here for an intro). A data.table is like an indexed data frame. It has its own merge method that would probably work in this case.

Thank you all for all the great help. Data tables are the way to go for me, as I think this was a memory issue ("lookup" values were common between data frames). While 8 GB of memory (~ 6GB free) should be plenty, it was all used up during this process. nevertheless, data tables worked just fine. Learning a lot from these boards.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Duplicated Rows No Longer Recognized As Duplicates After read_csv() Is Used - r

Related

Remove row with zeros in large data set does not work

R: Check for finite values in DataFrame

ftable() fails because r can't generate a table with more than 2^31 elements

Merge with allow.cartasian=TRUE results in too many observations

Any viable alternatives to merge function? R and R Studio shut down

Categories

Resources