Find columns with different values in duplicate rows - r

I have a data set that has some duplicate records. For those records, most of the column values are the same, but a few ones are different.
I need to identify the columns where the values are different, and then subset those columns.
This would be a sample of my dataset:
library(data.table)
dat <- "ID location date status observationID observationRep observationVal latitude longitude setSource
FJX8KL loc1 2018-11-17 open 445 1 17.6 -52.7 -48.2 XF47
FJX8KL loc2 2018-11-17 open 445 2 1.9 -52.7 -48.2 LT12"
dat <- setDT(read.table(textConnection(dat), header=T))
And this is the output I would expect:
observationRep observationVal setSource
1: 1 17.6 XF47
2: 2 1.9 LT12
One detail is: my original dataset has 189 columns, so I need to check all of them.
How to achieve this?

Two issues, first, use text= argument rather than textConnection, second, use as.data.table, since seDT modifies object in place, but it yet isn't there.
dat1 <- data.table::as.data.table(read.table(text=dat, header=TRUE))
dat1[, c('observationRep', 'observationVal', 'setSource')]
# observationRep observationVal setSource
# 1: 1 17.6 XF47
# 2: 2 1.9 LT12

Related

Replacing expression within data.table

I'm running the following code below to retrieve a data set, which unfortunately uses "." instead of NA to represent missing data. After much wrangling and searching SO and other fora, I still cannot make the code replace all instances of "." with NA so I can convert the columns to numeric and go on with my life. I'm pretty sure the problem is between the screen and the chair, so I don't see a need to post sessionInfo, but please let me know otherwise. Help in solving this would be greatly appreciated. The first four columns are integers setting out the date and the unique ID, so I would only need to correct the other columns. Thanks in advance you all!
library(data.table)
google_mobility_data <- data.table(read.csv("https://github.com/OpportunityInsights/EconomicTracker/raw/main/data/Google Mobility - State - Daily.csv",stringsAsFactors = FALSE))
# The following line is the one where I can't make it work properly.
google_mobility_data[, .SD := as.numeric(sub("^\\.$", NA, .SD)), .SDcols = -c(1:4)]
I downloaded your data and changed the last entry on the first row to "." to test NA in the final column.
Use readLines to read a character vector.
Use gsub to change . to NA.
Use fread to read as a data.table.
library(data.table)
gmd <- readLines("Google Mobility - State - Daily.csv")
gmd[c(2,3120)]
# [1] "2020,2,24,1,.00286,-.00714,.0557,.06,.0129,.00857,."
# [2] "2020,4,25,10,-.384,-.191,.,-.479,-.441,.179,-.213"
gmd <- gsub(",\\.,",",NA,",gmd)
gmd <- gsub(",\\.$",",NA",gmd)
gmd[c(2,3120)]
# [1] "2020,2,24,1,.00286,-.00714,.0557,.06,.0129,.00857,NA"
# [2] "2020,4,25,10,-.384,-.191,NA,-.479,-.441,.179,-.213"
google_mobility_data <- fread(text=gmd)
google_mobility_data[c(1,3119)]
# year month day statefips gps_retail_and_recreation gps_grocery_and_pharmacy gps_parks gps_transit_stations gps_workplaces gps_residential gps_away_from_home
#1: 2020 2 24 1 0.00286 -0.00714 0.0557 0.060 0.0129 0.00857 NA
#2: 2020 4 25 10 -0.38400 -0.19100 NA -0.479 -0.4410 0.17900 -0.213
summary(google_mobility_data)
EDIT: You mentioned using na.strings with fread didn't work for you, so I suggested the above approach.
However, at least with the data file downloaded as I did, this worked in one line - as suggested by #MichaelChirico:
google_mobility_data <- fread("Google Mobility - State - Daily.csv",na.strings=".")
google_mobility_data[c(1,3119)]
year month day statefips gps_retail_and_recreation gps_grocery_and_pharmacy gps_parks gps_transit_stations gps_workplaces gps_residential gps_away_from_home
#1: 2020 2 24 1 0.00286 -0.00714 0.0557 0.060 0.0129 0.00857 NA
#2: 2020 4 25 10 -0.38400 -0.19100 NA -0.479 -0.4410 0.17900 -0.213

R: Is it possible to combine rows of non-equal length into a single data frame using a for-loop?

I have been working with a dataset (called CWNA_clim_vars) structured so that the variables associated with each datapoint within the set are arranged in columns, like this:
dbsid elevation Tmax04 Tmax10 Tmin04 Tmin10 PPT04 PPT10
0001 1197 8.1 8.9 -5.2 -3.5 34 95
0002 1110 7.7 8 -2.9 -0.6 114 375
0003 1466 5.4 6.4 -4.7 -1.5 199 453
0004 1267 6.1 7.1 -3.6 -0.7 166 376
... ... ... ... ... ... ... ...
1000 926 7.2 10.1 -0.8 2.7 245 351
I've been attempting to on each column run boxplot stats, retrieve the values of the outliers within each column, and write them to a new data frame, called summary_stats. The code I set up in attempt to achieve this is as follows:
summary_stats <- data.frame()
for (i in names(CWNA_clim_vars)){
temp <- boxplot.stats(CWNA_clim_vars[,i])
out <- as.list(temp$out)
for (j in out) {
summary_stats[i,j] <- out[j]
}
}
Unfortunately, in running this, the following error message is thrown:
Error in `[<-.data.frame`(`*tmp*`, i, j, value = list(6.65)) :
new columns would leave holes after existing columns
I am guessing that it is because the number of outliers varies between columns that this error message is being thrown, as if instead I replace temp$out with temp$n, which contains one number only per column, produced is a data frame having these numbers arranged in a single column.
Is there a way of easily remedying this so that I end up with a data frame having rows which are not necessarily of the same length? Thanks for considering my question - any help I would appreciate greatly.
You'd better use a "list".
out_lst <- lapply(CWNA_clim_vars, function (x) boxplot.stats(x)$out)
If for some reason you have to present it in a "data frame", you need padding.
N <- max(lengths(out_lst))
out_df <- data.frame(lapply(out_lst, function (x) c(x, rep(NA, N - length(x)))))
Try with a tiny example:
CWNA_clim_vars <- data.frame(a = c(rep(1,9), 10), b = c(10,11,rep(1,8)))

Double entries in dataframe after merg r

My data
Hello, I have a problem with merging two dataframes with each other.
The goal is to merge them so that each date has the corresponding values. If there is no corresponding value, I want to replace NA with 0.
names(FiresNearLA.ab.03)[1] <- "Date.Local"
U.NO2.ab.03 <- unique(NO2.ab.03) # No2.ab.03 has all values multiplied
ind <- merge(FiresNearLA.ab.03,U.NO2.ab.03, all = TRUE, all.x=TRUE)
ind[is.na(ind)] <- 0
So far so good. And the first lines look like they are supposed to look. But beginning from 2004-04-24, all dates are doubled and it writes weird values in the second NO2.Mean colum.
U.NO2.Mean table:
Date.Local NO2.Mean
361 2004-03-31 30.217391
365 2004-04-24 50.000000
366 2004-04-25 47.304348
370 2004-04-26 50.913043
374 2004-04-27 41.157895
ind table:
Date.Local FIRE_SIZE F.number.n_fires NO2.Mean
113 2004-04-22 34.30 10 13.681818
114 2004-04-23 45.00 13 17.222222
115 2004-04-24 55.40 22 28.818182
116 2004-04-24 55.40 22 50.000000
117 2004-04-25 2306.85 15 47.304348
118 2004-04-25 2306.85 15 21.090909
Why, are there Values in NO2.Mean for 2004-04-23 and 2004-04-22 days if they should be 0? and why does it double the values after the 24th and where do the second ones come from?
Thank you
So I managed to merge your data:
FiresNearLA.ab.03 <- dget("FiresNearLA.ab.03.txt", keep.source = FALSE)
U.NO2.ab.03 <- dget("NO2.ab.03.txt", keep.source = FALSE)
ind <- merge(FiresNearLA.ab.03,
U.NO2.ab.03,
all = TRUE,
by.x = "DISCOVERY_DATEymd",
by.y = "Date.Local")
As a side note: Usually, you share a small sample of your data on stackoverflow, not the whole thing. In your case, dput(FiresNearLA.ab.03[1:50, ]) and then copy and paste from the console to the question would have been sufficient.
Back to your problem: The duplication already happens in NO2.ab.03 and a number of dates and values occurs twice or more often. The easiest way to solve this (in my experience) is to use the package data.table which has a duplicated which is more straightforward and also faster:
library(data.table)
# Test duplicated occurrences in U.NO2.ab.03
> table(duplicated(U.NO2.ab.03, by = c("DISCOVERY_DATEymd", "NO2.Mean")))
FALSE TRUE
7767 27308
>
> nrow(ind)
[1] 35229
# Remove duplicated rows from data frame
> ind <- ind[!duplicated(ind, by = c("DISCOVERY_DATEymd", "NO2.Mean")), ]
> nrow(ind)
[1] 7921
After these steps, you should be fine :)
I got the answer. The Original data source of NO3.ab.03 was faulty.
As JonGrub suggestes said, the problem was within NO3.ab.O3. For some days it had two different NO3.Means corresponding to the same date. I deleted this rows and now its working good. Thank you again for the help and the great advices

R For loop fails applying max function

I premise I'm new with R and actually I'm trying to get the fundamentals.
Currently I'm workin on a large dataframe (called "ppl") which I have to edit in order to filter some rows. Each row is included in a group and it is characterized by an intensity (into) value and a sample value.
mz rt into sample tracker sn grp
100.0153 126 2.762664 3 11908 7.522655 0
100.0171 127 2.972048 2 5308 7.718521 0
100.0788 272 30.217969 2 5309 19.024807 1
100.0796 272 17.277916 3 11910 7.297716 1
101.0042 128 37.557324 3 11916 27.991320 2
101.0043 128 39.676014 2 5316 28.234918 2
Well, the first question is: "How can I select from each group the sample with the highest intensity?"
I tried a for loop:
for (i in ppl$grp) {
temp<-ppl[ppl$grp == i,]
sel<-rbind(sel,temp[max(temp$into),])
}
The fact is that it works for ppl$grp == 0, but the next cycles return NAs rows.
Then the filtered dataframe(called "sel") also should store the sample values of the removed rows. It should be as follows:
mz rt into sample tracker sn grp
100.0171 127 2.972048 c(2,3) 5308 7.718521 0
100.0788 272 30.217969 c(2,3) 5309 19.024807 1
101.0043 128 39.676014 c(2,3) 5316 28.234918 2
In order to get this I would use this approach:
lev<-factor(ppl$grp)
samp<-ppl$sample
samp2<-split(samp,lev)
sel$sample<-samp2
Any hint? Because I cannot test it since I still don't have solved the previous problem.
Thanks a lot.
Not sure if I follow your question. But maybe this will get you started.
library(dplyr)
ppl %>% group_by(grp) %>% filter(into == max(into))
A base R option using ave is
ppl[with(ppl, ave(into, grp, FUN = max)==into),]
If the 'sample' column in the expected output have the unique elements in each 'grp', then after grouping by 'grp', update the 'sample' as the pasted unique elements of 'sample', then arrange the 'into' descendingly and slice the 1st row.
library(dplyr)
ppl %>%
group_by(grp) %>%
mutate(sample = toString(sort(unique(sample)))) %>%
arrange(desc(into)) %>%
slice(1L)
# mz rt into sample tracker sn grp
# <dbl> <int> <dbl> <chr> <int> <dbl> <int>
#1 100.0171 127 2.972048 2, 3 5308 7.718521 0
#2 100.0788 272 30.217969 2, 3 5309 19.024807 1
#3 101.0043 128 39.676014 2, 3 5316 28.234918 2
A data.table alternative:
library(data.table)
setkey(setDT(ppl),grp)
ppl <- ppl[ppl[,into==max(into),by=grp]$V1,]
## mz rt into sample tracker sn grp
##1: 100.0171 127 2.972048 2 5308 7.718521 0
##2: 100.0788 272 30.217969 2 5309 19.024807 1
##3: 101.0043 128 39.676014 2 5316 28.234918 2
I have no idea why this code would work
for (i in ppl$grp) {
temp<-ppl[ppl$grp == i,]
sel<-rbind(sel,temp[max(temp$into),])
}
max(temp$into) should return the maximum value--which appears to not be an integer in most cases.
Also, building a data.frame with rbind in every for loop instance is not good practice (in any language). It requires quit a bit of type checking and array growing that can get very expensive.
Also, max will return NA when there are any NAs for that group.
There is also a question about what you want to do about ties? Do you just want one result or all of them? The code Akrun gives will give you all of them.
This code will write a new column that has the group max
ppl$grpmax <- ave(ppl$into, ppl$grp, FUN=function(x) { max(x, na.rm=TRUE ) } )
You can then select all values in a group that are equal to the max with
pplmax <- subset(ppl, into == grpmax)
If you want just one per group then you can remove duplicates
pplmax[!duplicated(pplmax$grp),]

How to join data.tables when one is a lookup table?

I'm having trouble applying a simple data.table join example to a larger (10GB) data set. merge() works just fine on data.frames with the larger dataset, although I'd love to take advantage of the speed in data.table. Could anyone point out what I'm misunderstanding about data.table (and the error message in particular)?
Here is the simple example (derived from this thread: Join of two data.tables fails).
# The data of interest.
(DT <- data.table(id = c(rep(1154:1155, 2), 1160),
price = c(1.99, 2.50, 15.63, 15.00, 0.75),
key = "id"))
id price
1: 1154 1.99
2: 1154 15.63
3: 1155 2.50
4: 1155 15.00
5: 1160 0.75
# Lookup table.
(lookup <- data.table(id = 1153:1160,
version = c(1,1,3,4,2,1,1,2),
yr = rep(2006, 4),
key = "id"))
id version yr
1: 1153 1 2006
2: 1154 1 2006
3: 1155 3 2006
4: 1156 4 2006
5: 1157 2 2006
6: 1158 1 2006
7: 1159 1 2006
8: 1160 2 2006
# The desired table. Note: lookup[DT] works as well.
DT[lookup, allow.cartesian = T, nomatch=0]
id price version yr
1: 1154 1.99 1 2006
2: 1154 15.63 1 2006
3: 1155 2.50 3 2006
4: 1155 15.00 3 2006
5: 1160 0.75 2 2006
The larger data set consists of two data.frames: temp.3561 (the dataset of interest) and temp.versions (the lookup dataset). They have the same structure as DT and lookup (above), respectively. Using merge() works well, however my application of data.table is clearly flawed:
# Merge data.frames: works just fine
long.merged <- merge(temp.versions, temp.3561, by = "id")
# Convert the data.frames to data.tables
DTtemp.3561 <- as.data.table(temp.3561)
DTtemp.versions <- as.data.table(temp.versions)
# Merge the data.tables: doesn't work
setkey(DTtemp.3561, id)
setkey(DTtemp.versions, id)
DTlong.merged <- merge(DTtemp.versions, DTtemp.3561, by = "id")
Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x), :
Join results in 11277332 rows; more than 7946667 = max(nrow(x),nrow(i)). Check for duplicate
key values in i, each of which join to the same group in x over and over again. If that's ok,
try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the
large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE.
Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-
help for advice.
DTtemp.versions has the same structure as lookup (in the simple example), and the key "id" consists of 779,473 unique values (no duplicates).
DTtemp3561 has the same structure as DT (in the simple example) plus a few other variables, but its key "id" only has 829 unique values despite the 7,946,667 observations (lots of duplicates).
Since I'm just trying to add version numbers and years from DTtemp.versions to each observation in DTtemp.3561, the merged data.table should have the same number of observations as DTtemp.3561 (7,946,667). Specifically, I don't understand why merge() generates "excess" observations when using data.table but not when using data.frame.
Likewise
# Same error message, but with 12,055,777 observations
altDTlong.merged <- DTtemp.3561[DTtemp.versions]
# Same error message, but with 11,277,332 observations
alt2DTlong.merged <- DTtemp.versions[DTtemp.3561]
Including allow.cartesian=T and nomatch=0 doesn't drop the "excess" observations.
Oddly, if I truncate the dataset of interest to have 10 observatons, merge() works fine on both data.frames and data.tables.
# Merge short DF: works just fine
short.3561 <- temp.3561[-(11:7946667),]
short.merged <- merge(temp.versions, short.3561, by = "id")
# Merge short DT
DTshort.3561 <- data.table(short.3561, key = "id")
DTshort.merged <- merge(DTtemp.versions, DTshort.3561, by = "id")
I've been through the FAQ (http://datatable.r-forge.r-project.org/datatable-faq.pdf, and 1.12 in particular). How would you suggest thinking about this?
Could anyone point out what I'm misunderstanding about data.table (and the error message in particular)?
Taking you answer directly. The error message
Join results in 11277332 rows; more than 7946667 = max(nrow(x),nrow(i)). Check for duplicate key values in i...
states the result of your join has more values than usual cases expects. This means the lookup table key has duplicates which results multiple matches on join.
If it doesn't answer your question you should restate it.

Resources