Double entries in dataframe after merg r - r

My data
Hello, I have a problem with merging two dataframes with each other.
The goal is to merge them so that each date has the corresponding values. If there is no corresponding value, I want to replace NA with 0.
names(FiresNearLA.ab.03)[1] <- "Date.Local"
U.NO2.ab.03 <- unique(NO2.ab.03) # No2.ab.03 has all values multiplied
ind <- merge(FiresNearLA.ab.03,U.NO2.ab.03, all = TRUE, all.x=TRUE)
ind[is.na(ind)] <- 0
So far so good. And the first lines look like they are supposed to look. But beginning from 2004-04-24, all dates are doubled and it writes weird values in the second NO2.Mean colum.
U.NO2.Mean table:
Date.Local NO2.Mean
361 2004-03-31 30.217391
365 2004-04-24 50.000000
366 2004-04-25 47.304348
370 2004-04-26 50.913043
374 2004-04-27 41.157895
ind table:
Date.Local FIRE_SIZE F.number.n_fires NO2.Mean
113 2004-04-22 34.30 10 13.681818
114 2004-04-23 45.00 13 17.222222
115 2004-04-24 55.40 22 28.818182
116 2004-04-24 55.40 22 50.000000
117 2004-04-25 2306.85 15 47.304348
118 2004-04-25 2306.85 15 21.090909
Why, are there Values in NO2.Mean for 2004-04-23 and 2004-04-22 days if they should be 0? and why does it double the values after the 24th and where do the second ones come from?
Thank you

So I managed to merge your data:
FiresNearLA.ab.03 <- dget("FiresNearLA.ab.03.txt", keep.source = FALSE)
U.NO2.ab.03 <- dget("NO2.ab.03.txt", keep.source = FALSE)
ind <- merge(FiresNearLA.ab.03,
U.NO2.ab.03,
all = TRUE,
by.x = "DISCOVERY_DATEymd",
by.y = "Date.Local")
As a side note: Usually, you share a small sample of your data on stackoverflow, not the whole thing. In your case, dput(FiresNearLA.ab.03[1:50, ]) and then copy and paste from the console to the question would have been sufficient.
Back to your problem: The duplication already happens in NO2.ab.03 and a number of dates and values occurs twice or more often. The easiest way to solve this (in my experience) is to use the package data.table which has a duplicated which is more straightforward and also faster:
library(data.table)
# Test duplicated occurrences in U.NO2.ab.03
> table(duplicated(U.NO2.ab.03, by = c("DISCOVERY_DATEymd", "NO2.Mean")))
FALSE TRUE
7767 27308
>
> nrow(ind)
[1] 35229
# Remove duplicated rows from data frame
> ind <- ind[!duplicated(ind, by = c("DISCOVERY_DATEymd", "NO2.Mean")), ]
> nrow(ind)
[1] 7921
After these steps, you should be fine :)

I got the answer. The Original data source of NO3.ab.03 was faulty.
As JonGrub suggestes said, the problem was within NO3.ab.O3. For some days it had two different NO3.Means corresponding to the same date. I deleted this rows and now its working good. Thank you again for the help and the great advices

Related

R function or loop for repeatedly selecting rows that meet a condition, saving as separate object, and renaming column headers

I have 16 large datasets of landcover variables around routes. Example dataset "Trial1":
RtNo TYPE CA PLAND NP PD LPI TE
2001 cls_11 996.57 6.4297 22 0.1419 6.3055 31080
2010 cls_11 56.34 0.3654 23 0.1492 0.1669 15480
18003 cls_11 141.12 0.9899 37 0.2596 0.1503 38700
18014 cls_11 797.58 5.3499 47 0.3153 1.3969 98310
2001 cls_21 1514.97 9.7744 592 3.8195 0.8443 761670
2010 cls_21 638.55 4.1414 95 0.6161 0.7489 463260
18003 cls_21 904.68 6.3463 612 4.2931 0.8769 549780
18014 cls_21 1189.89 7.9814 759 5.0911 0.4123 769650
2001 cls_22 732.33 4.7249 653 4.2131 0.7212 377430
2010 cls_22 32.31 0.2096 168 1.0896 0.0198 31470
18003 cls_22 275.85 1.9351 781 5.4787 0.0423 237390
18014 cls_22 469.44 3.1488 104 6.7345 0.1014 377580
I want to first select rows that meet a condition, for example, all rows in column "TYPE" that is cls_21. I know the following code does this work:
Trial21 <-subset(Trial1, TYPE==" cls_21 ")
(yes the invisible space before and after the categorical variable caused me a considerable headache).
And there are several other ways of doing this as shown in
[https://stackoverflow.com/questions/5391124/select-rows-of-a-matrix-that-meet-a-condition]
I get the following output (sorry this one has extra columns, but shouldn't affect my question):
RtNo TYPE CA PLAND NP PD LPI TE ED LSI
2 18003 cls_21 904.68 6.3463 612 4.2931 0.8769 549780 38.5668 46.1194
18 18014 cls_21 1189.89 7.9814 759 5.0911 0.4123 769650 51.6255 56.2522
34 2001 cls_21 1514.97 9.7744 592 3.8195 0.8443 761670 49.1418 49.3462
50 2010 cls_21 638.55 4.1414 95 0.6161 0.7489 463260 30.0457 46.0118
62 2020 cls_21 625.5 4.1165 180 1.1846 0.5064 384840 25.3268 38.6407
85 2021 cls_21 503.55 2.7926 214 1.1868 0.1178 348330 19.3175 38.9267
I want to rename the columns in this subset so they uniquely identify the class by adding "L21" at the back of existing column names, and I can do this using
library(data.table)
setnames(Trial21, old = c('CA', 'PLAND', 'NP', 'PD', 'LPI', 'TE', 'ED', 'LSI'),
new = c('CAL21', 'PLANDL21', 'NPL21', 'PDL21', 'LPIL21', 'TEL21', 'EDL21', 'LSIL21'))
I want help to develop a function or a loop that automates this process so I don't have to spend days repeating the same codes for 15 different classes and 16 datasets (240 times). Also, decrease the risk of errors. I may have to do the same for additional datasets. Any help to speed the process will be greatly appreciated.
You could do:
a <- split(df, df$TYPE)
b <- sapply(names(a), function(x)setNames(a[[x]],
paste0(names(a[[x]]), sub(".*_", 'L', x))), simplify = FALSE)
You can use ls to get the variable names of the datasets, and manipulate them as you wish inside a loop and with get function, then create new datasets with assign.
sets = grep("Trial", ls(), value=TRUE) #Assuming every dataset has "Trial" in the name
for(i in sets){
classes = unique(get(i)$TYPE)
for(j in classes){
number = gsub("(.+)([0-9]{2})( )", "\\2", j)#this might be an overly complicated way of getting just the number, you can look for better options if you want
assign(paste0("Trial", number),
subset(Trial1, TYPE==j) %>% rename_with(function(x){paste0(x, number)}))}}
Here is a start that should work for your example:
library(dplyr)
myfilter <- function(data, number) {
data %>%
filter(TYPE == sprintf(" cls_%s ") %>%
rename_with(\(x) sprintf("%s%s", x, suffix), !1:2)
}
myfilter(example_data, 21)
Given a list of numbers (here: 21 to 31) you could then automatically use them to filter a single dataframe:
multifilter <- function(data) {
purrr::map(21:31, \(i) myfilter(data, i))
}
multifilter(example_data)
Finally, given a list of dataframes, you can automatically apply the filters to them:
purrr::map(list_of_dataframes, multifilter)

Merging specific rows by summing certain columns on grouping variables

The following dataframe is a subset of a bigger df, which contains duplicated information
df<-data.frame(Caught=c(92,134,92,134),
Discarded=c(49,47,49,47),
Units=c(170,170,220,220),
Hours=c(72,72,72,72),
Colour=c("red","red","red","red"))
In Base R, I would like to get the following:
df_result<-data.frame(Caught=226,
Retained=96,
Units=390,
Hours=72,
colour="red")
So basically the results is the sum of unique values for columns Caught, Retained, Units and leaving the same value for Hours and colour (Caught=92+134, Retained=49+47, Units= 170+220, Hours=72, colour="red)
However, I intend to do this in a much bigger data.frame with several columns. My idea was to apply a function based on column names as:
l <- lapply(df, function(x) {
if(names(x) %in% c("Caught","Discarded","Units"))
sum(unique(x))
else
unique(x)
})
as.data.frame(l)
However, this does not work, as I am not entirely sure how to extract vector names when using lapply() and other functions such as this.
I have tried withouth succes to implement by(), apply() functions.
Thanks
Asking for Base R:
l <- lapply( df, function(n) {
if( is.numeric(n) )
sum( unique(n) )
else
unique( n )
})
as.data.frame(l)
This solution takes advantage of the fact that data.frames are really just lists of vectors.
It produces this:
# Caught Discarded Units Hours Colour
# 226 96 390 72 red
A proposition:
df <-data.frame(Caught=c(92,134,92,134),
Discarded=c(49,47,49,47),
Units=c(170,170,220,220),
Hours=c(72,72,72,72),
Colour=c("red","red","red","red"))
df
#> Caught Discarded Units Hours Colour
#> 1 92 49 170 72 red
#> 2 134 47 170 72 red
#> 3 92 49 220 72 red
#> 4 134 47 220 72 red
df_results <- data.frame(Caught = sum(unique(df$Caught)),
Discarded = sum(unique(df$Discarded)),
Units = sum(unique(df$Units)),
Hours = unique(df$Hours),
Colour = unique(df$Colour))
df_results
#> Caught Discarded Units Hours Colour
#> 1 226 96 390 72 red
# Created on 2021-02-23 by the reprex package (v0.3.0.9001)
Regards,

R - for loop - two columns whose values depend on each other

Apologies - I'm used to working with Excel/Minitab/SQL, where this kind of thing works differently.
I have an excel dataset with columns "Date" (Column "A"), "Pallets" (Column "B), "Lt" (Column "K") and "Tt" (Column "L"), where the values for column "Lt" and "Tt" depend on each other, and fixed parameters (alpha and beta), which are listed in cells in the spreadsheet ("T3" and "T4", respectively).
In Excel, I simply enter the formulae for Lt and Tt in cells "K6" and "L6" respectively and drag these down. Excel then updates both columns simultaneously to reach the correct value. The formulae are =$T$3*K5+(1-$T$3)*(K5+L5) and =$T$4*(C6-C5)+(1-$T$4)*L5 respectively.
However, in R, I have tried updating the values of both columns using a for loop:
for(i in 3:368)
df[i,"Lt"]<-alpha*df[i-1,"Lt"]+(1-alpha)*(df[i-1,"Lt"]+df[i-1,"Tt"])
for (i in 3:368)
df[i,"Tt"]<-beta*(df[i,"Pallets"] - df[i-1,"Pallets"])+(1-beta)*df[i-1,"Tt"]
The problem with this is that this means that both columns change their values separately. Consequently, they don't interact with each other as they update, so I end up with two not-quite correct columns.
The values of alpha and beta are 269 and 0.787890411 respectively. In Excel, I get:
Date Pallets Lt Tt
01/01/2011 491
02/01/2011 385 269 0.79
03/01/2011 662 269.7879 0.843133
04/01/2011 28 270.6298 0.843133
05/01/2011 46 271.4718 0.843132
06/01/2011 403 272.3156 0.843132
07/01/2011 282 273.1588 0.843133
08/01/2011 315 274.0021 0.843133
Whereas with R, because the two columns don't update simultaneously, I get variously different values for Lt and Tt each time I update either column. Currently I have:
Date Pallets Lt Tt
1 28/12/2011 491 NA NA
2 29/12/2011 385 269.0000 0.7878904
3 30/12/2011 662 269.7879 0.8431328
4 31/12/2011 28 270.6310 0.7161642
5 01/01/2012 46 271.3472 0.7196210
6 02/01/2012 403 272.0668 0.7908770
7 03/01/2012 282 272.8577 0.7665189
8 04/01/2012 315 273.6242 0.7729656
9 05/01/2012 327 274.3971 0.7752110
10 06/01/2012 458 275.1724 0.8012559
How can I get both columns to update and reflect each other, as happens automatically in Minitab or Excel?
Combine the for loops:
for(i in seq(3, nrow(df), by = 1)) {
df[i,"Lt"]<-alpha*df[i-1,"Lt"]+(1-alpha)*(df[i-1,"Lt"]+df[i-1,"Tt"])
df[i,"Tt"]<-beta*(df[i,"Pallets"] - df[i-1,"Pallets"])+(1-beta)*df[i-1,"Tt"]
}
Also, you shouldn't use hard-coded numbers for things like row counts. Those can change, and it'd be a pain to constantly update the code when they do.

store rows of a data.frame are equal to top.20. in frequens from column as factor i.e names

The data.frame (d1.csv) looks like:
Age Height Weight Sport
23 170 60 Judo
33 193 125 Athletics
I have to make a ny data.frame like d2 with the top 20 an shall use this charachters below stored in
names(top.20.sports)
[1] "Athletics" "Swimming" "Football" "Rowing"
... and have to use match() or %in% like to use subset() like d1 with subset = Sport %in% names(top.20.sports).
I tried several things bud I'm new at this and am missing something...
d2<-subset(d1, (Sport %in% names(top.20.sports)))
gives the hole list, same as with
d2 <- d1[d1$Sport %in% names(top.20.sports),]
match gives me a bunch (42) with "NA"
d2<-d1[,tolower(names(top.20.sports)) %in% d1[,4]]
Dataframe with 0 colomns und 9038 rows
(9038 rows are correct bud where is the data?)
There was no error, like BondedDust told me: "If subset(d1, (Sport %in% names(top.20.sports))) gives the whole list then .... it is what it is. All of the Sport entries are in the top-20."
Just it never was the hole list...:
I was thinking I had 10384 rows
-10384 24 221 110 Basketball-
Basketball as the last one. Bud the number of the row is not the number of the rows:
nrow(d2)
[1] 8009
dim(d2)
[1] 8009 4

How to column bind and row bind a large number of data frames in R?

I have a large data set of vehicles. They were recorded every 0.1 seconds so there IDs repeat in Vehicle ID column. In total there are 2169 vehicles. I filtered the 'Vehicle velocity' column for every vehicle (using for loop) which resulted in a new column with first and last 30 values removed (per vehicle) . In order to bind it with original data frame, I removed the first and last 30 values of table too and then using cbind() combined them. This works for one last vehicle. I want this smoothing and column binding for all vehicles and finally I want to combine all the data frames of vehicles into one single table. That means rowbinding in sequence of vehicle IDs. This is what I wrote so far:
traj1 <- read.csv('trajectories-0750am-0805am.txt', sep=' ', header=F)
head(traj1)
names (traj1)<-c('Vehicle ID', 'Frame ID','Total Frames', 'Global Time','Local X', 'Local Y', 'Global X','Global Y','Vehicle Length','Vehicle width','Vehicle class','Vehicle velocity','Vehicle acceleration','Lane','Preceding Vehicle ID','Following Vehicle ID','Spacing','Headway')
# TIME COLUMN
Time <- sapply(traj1$'Frame ID', function(x) x/10)
traj1$'Time' <- Time
# SMOOTHING VELOCITY
smooth <- function (x, D, delta){
z <- exp(-abs(-D:D/delta))
r <- convolve (x, z, type='filter')/convolve(rep(1, length(x)),z,type='filter')
r
}
for (i in unique(traj1$'Vehicle ID')){
veh <- subset (traj1, traj1$'Vehicle ID'==i)
svel <- smooth(veh$'Vehicle velocity',30,10)
svel <- data.frame(svel)
veh <- head(tail(veh, -30), -30)
fta <- cbind(veh,svel)
}
'fta' now only shows the data frame for last vehicle. But I want all data frames (for all vehicles 'i') combined by row. May be for loop is not the right way to do it but I don't know how can I use tapply (or any other apply function) to do so many things same time.
EDIT
I can't reproduce my dataset here but 'Orange' data set in R could provide good analogy. Using the same smoothing function, the for loop would look like this (if 'age' column is smoothed and 'Tree' column is equivalent to my 'Vehicle ID' coulmn):
for (i in unique(Orange$Tree)){
tre <- subset (Orange, Orange$'Tree'==i)
age2 <- round(smooth(tre$age,2,0.67),digits=2)
age2 <- data.frame(age2)
tre <- head(tail(tre, -2), -2)
comb <- cbind(tre,age2)}
}
Umair, I am not sure I understood what you want.
If I understood right, you want to combine all the results by row. To do that you could save all the results in a list and then do.call an rbind:
comb <- list() ### create list to save the results
length(comb) <- length(unique(Orange$Tree))
##Your loop for smoothing:
for (i in 1:length(unique(Orange$Tree))){
tre <- subset (Orange, Tree==unique(Orange$Tree)[i])
age2 <- round(smooth(tre$age,2,0.67),digits=2)
age2 <- data.frame(age2)
tre <- head(tail(tre, -2), -2)
comb[[i]] <- cbind(tre,age2) ### save results in the list
}
final.data<-do.call("rbind", comb) ### combine all results by row
This will give you:
Tree age circumference age2
3 1 664 87 687.88
4 1 1004 115 982.66
5 1 1231 120 1211.49
10 2 664 111 687.88
11 2 1004 156 982.66
12 2 1231 172 1211.49
17 3 664 75 687.88
18 3 1004 108 982.66
19 3 1231 115 1211.49
24 4 664 112 687.88
25 4 1004 167 982.66
26 4 1231 179 1211.49
31 5 664 81 687.88
32 5 1004 125 982.66
33 5 1231 142 1211.49
Just for fun, a different way to do it using plyr::ddply and sapply with split:
library(plyr)
data<-ddply(Orange, .(Tree), tail, n=-2)
data<-ddply(data, .(Tree), head, n=-2)
data<- cbind(data,
age2=matrix(sapply(split(Orange$age, Orange$Tree), smooth, D=2, delta=0.67), ncol=1, byrow=FALSE))

Resources