Why is dplyr collapsing my whole data frame and not grouping it - r

I have been researching this for a while and I can't seem to find the issue. I use dplyr regularly, but seems like all of a sudden, I am getting odd output from the group_by/summarise combination.
I have a large dataset and I am trying to summarize it using the following:
dataAgg <- dataRed %>% group_by(ClmNbr, SnapshotDay, Pre2016) %>%
filter(SnapshotDay == '30'| SnapshotDay == '90') %>%
summarise(
NumFeat = sum(FeatureNbr),
TotInc = sum(IncSnapshotDay),
TotDelta = sum(InctoFinal),
TotPaid = sum(FinalPaid)
)
The setup of the data frame is below:
'data.frame': 123819 obs. of 8 variables:
$ ClmNbr : Factor w/ 33617 levels "14-00765132",..: 2162 2163 2163 2164 1842 2287 27 27 27 28 ...
$ SnapshotDay : Factor w/ 3 levels "7","30","90": 1 1 1 1 1 1 1 1 1 1 ...
$ Pre2016 : Factor w/ 2 levels "Post2016","Pre2016": 2 2 2 2 2 2 2 2 2 2 ...
$ FeatureNbr : int 6 2 3 3 6 2 4 5 6 5 ...
$ IncSnapshotDay: num 5000 77 5000 4500 77 2200 1800 1100 1800 25000 ...
$ FinalPaid : num 442 0 15000 5000 0 ...
$ InctoFinal : num -4558 -77 10000 500 -77 ...
$ TimeDelta : num 25.833 2.833 2.833 0.833 1.833 ...
When I execute the code, I get 1 obs. of 4 variables; there is no grouping applied.
'data.frame': 1 obs. of 4 variables:
$ NumFeat : int 287071
$ TotInc : num NA
$ TotDelta: num NA
$ TotPaid : num 924636433
I used to do this all the time without problems.
I could use aggregate, but sometimes, I am mixing and matching functions based on the column so it does not always work.
What am I doing wrong?

So, after a bit of research and some experimentation, the order of the library load matters. The original order was the following:
library(RODBC)
library(dplyr)
library(DT)
library(reshape2)
library(ggplot2)
library(scales)
library(caret)
library(markovchain)
library(knitr)
library(Metrics)
library(RColorBrewer)
However, ggplot2 loads in plyr as a dependency, so in order to make this work more smoothly, the order should be revised to load dplyr last; which is what I used to do.
library(RODBC)
library(DT)
library(reshape2)
library(ggplot2)
library(scales)
library(caret)
library(markovchain)
library(knitr)
library(Metrics)
library(RColorBrewer)
library(dplyr)
Alternately, as in Python, it can be accomplished by specifying the library to execute the command. In Python, we import libraries in the following syntax:
import numpy as np
Then any numpy commmands are referenced using np. like np.array() the R syntax is the following library::
Adding dplyr:: to the commands fixes the problem as shown below.
dataAgg <- dataRed %>% dplyr::group_by(ClmNbr, SnapshotDay, Pre2016) %>%
dplyr::filter(SnapshotDay == '30'| SnapshotDay == '90') %>%
dplyr::summarise(
NumFeat = sum(FeatureNbr),
TotInc = sum(IncSnapshotDay),
TotDelta = sum(InctoFinal),
TotPaid = sum(FinalPaid)
)

Related

R round correlate function from corrr package

I'm creating a correlation table using the correlate function in the corrr package. Here is my code and a screenshot of the output.
correlation_table <- corrr::correlate(salary_professor_dataset_cor_table,
method = "pearson")
correlation_table
I think this would look better and be easier to read if I could round off the values in the correlation table. I tried this code:
correlation_table <- round(corrr::correlate(salary_professor_dataset_cor_table,
method = "pearson"),2)
But I get this error:
Error in Math.data.frame(list(term = c("prof_rank_factor", "yrs.since.phd", : non-numeric variable(s) in data frame: term
The non-numeric variables part of this error message doesn't make sense to me. When I look at the structure I only see integer or numeric variable types.
'data.frame': 397 obs. of 6 variables:
$ prof_rank_factor : num 3 3 1 3 3 2 3 3 3 3 ...
$ yrs.since.phd : int 19 20 4 45 40 6 30 45 21 18 ...
$ yrs.service : int 18 16 3 39 41 6 23 45 20 18 ...
$ salary : num 139750 173200 79750 115000 141500 ...
$ sex_factor : num 1 1 1 1 1 1 1 1 1 2 ...
$ discipline_factor: num 2 2 2 2 2 2 2 2 2 2 ...
How can I clean up this correlation table with rounded values?
After returning the tibble output with correlate, loop across the columns that are numeric and round
library(dplyr)
corrr::correlate(salary_professor_dataset_cor_table,
method = "pearson") %>%
mutate(across(where(is.numeric), round, digits = 2))
We can use:
options(digits=2)
correlation_table <- corrr::correlate(salary_professor_dataset_cor_table,
method = "pearson")
correlation_table

unable to write to the csv file [duplicate]

I am trying to write a dataframe in R to a text file, however it is returning to following error:
Error in if (inherits(X[[j]], "data.frame") && ncol(xj) > 1L)
X[[j]] <- as.matrix(X[[j]]) :
missing value where TRUE/FALSE needed
I used the following command for the export:
write.table(df, file ='dfname.txt', sep='\t' )
I have no idea what the problem could stem from. As far as "missing data where TRUE/FALSE is needed", I have only one column which contains TRUE/FALSE values, and none of these values are missing.
Contents of the dataframe:
> str(df)
'data.frame': 776 obs. of 15 variables:
$ Age : Factor w/ 4 levels "","A","J","SA": 2 2 2 2 2 2 2 2 2 2 ...
$ Sex : Factor w/ 2 levels "F","M": 1 1 1 1 2 2 2 2 2 2 ...
$ Rep : Factor w/ 11 levels "L","NR","NRF",..: 1 1 4 4 2 2 2 2 2 2 ...
$ FA : num 61.5 62.5 60.5 61 59.5 59.5 59.1 59.2 59.8 59.9 ...
$ Mass : num 20 19 16.5 17.5 NA 14 NA 23 19 18.5 ...
$ Vir1 : num 999 999 999 999 999 999 999 999 999 999 ...
$ Vir2 : num 999 999 999 999 999 999 999 999 999 999 ...
$ Vir3 : num 40 999 999 999 999 999 999 999 999 999 ...
$ Location : Factor w/ 4 levels "Loc1",..: 4 4 4 4 4 4 2 2 2 2 ...
$ Site : Factor w/ 6 levels "A","B","C",..: 5 5 5 5 5 5 3 3 3 3 ...
$ Date : Date, format: "2010-08-30" "2010-08-30" ...
$ Record : int 35 34 39 49 69 38 145 112 125 140 ...
$ SampleID : Factor w/ 776 levels "AT1-A-F1","AT1-A-F10",..: 525 524 527 528
529 526 111 78
88 110 ...
$ Vir1Inc : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Month :'data.frame': 776 obs. of 2 variables:
..$ Dates: Date, format: "2010-08-30" "2010-08-30" ...
..$ Month: Factor w/ 19 levels "Apr-2011","Aug-2010",..: 2 2 2 2
2 2 18 18 18 18 ...
I hope I've given enough/the right information ...
Many thanks,
Heather
An example to reproduce the error. I create a nested data.frame:
Month=data.frame(Dates= as.Date("2003-02-01") + 1:15,
Month=gl(12,2,15))
dd <- data.frame(Age=1:15)
dd$Month <- Month
str(dd)
'data.frame': 15 obs. of 2 variables:
$ Age : int 1 2 3 4 5 6 7 8 9 10 ...
$ Month:'data.frame': 15 obs. of 2 variables:
..$ Dates: Date, format: "2003-02-02" "2003-02-03" "2003-02-04" ...
..$ Month: Factor w/ 12 levels "1","2","3","4",..: 1 1 2 2 3 3 4 4 5 5 ...
No I try to save it , I reproduce the error :
write.table(dd)
Error in if (inherits(X[[j]], "data.frame") && ncol(xj) > 1L)
X[[j]] <- as.matrix(X[[j]]) : missing value where TRUE/FALSE needed
Without inverstigating, one option to remove the nested data.frame:
write.table(data.frame(subset(dd,select=-c(Month)),unclass(dd$Month)))
The solution by agstudy provides a great quick fix, but there is a simple alternative/general solution for which you do not have to specify the element(s) in your data.frame that was(were) nested:
The following bit is just copied from agstudy's solution to obtain the nested data.frame dd:
Month=data.frame(Dates= as.Date("2003-02-01") + 1:15,
Month=gl(12,2,15))
dd <- data.frame(Age=1:15)
dd$Month <- Month
You can use akhilsbehl's LinearizeNestedList() function (which mrdwab made available here) to flatten (or linearize) the nested levels:
library(devtools)
source_gist(4205477) #loads the function
ddf <- LinearizeNestedList(dd, LinearizeDataFrames = TRUE)
# ddf is now a list with two elements (Age and Month)
ddf <- LinearizeNestedList(ddf, LinearizeDataFrames = TRUE)
# ddf is now a list with 3 elements (Age, `Month/Dates` and `Month/Month`)
ddf <- as.data.frame.list(ddf)
# transforms the flattened/linearized list into a data.frame
ddf is now a data.frame without nesting. However, it's column names still reflect the nested structure:
names(ddf)
[1] "Age" "Month.Dates" "Month.Month"
If you want to change this (in this case it seems redundant to have Month. written before Dates, for example) you can use gsub and some regular expression that I copied from Sacha Epskamp to remove all text in the column names before the ..
names(ddf) <- gsub(".*\\.","",names(ddf))
names(ddf)
[1] "Age" "Dates" "Month"
The only thing left now is exporting the data.frame as usual:
write.table(ddf, file="test.txt")
Alternatively, you could use the "flatten" function from the jsonlite package to flatten the dataframe before export. It achieves the same result of the other functions mentioned and is much easier to implement.
jsonlite::flatten
https://rdrr.io/cran/jsonlite/man/flatten.html

GGPLOT: Printing Stacked Bar Chart & Line to File

I know that it might not look like it from this question, but I've actually been programming for over 20 years, but I'm new to R. I'm trying to move away from Excel and to automate creation of about 100 charts I currently do in Excel by hand. I've asked two previous questions about this: here and here. Those solutions work for those toy examples, but when I try the exact same code on my own full program, they behave very differently and I'm completely befuddled as to why. When I run the program below, the testplot.png file is just a plot of the line, without the stacked bar chart.
So here is my (full) code as cut down as I can make it. If anyone wants to critique my programming, go ahead. I know that the comments are light, but that's to try to shorten it for this post. Also, this does actually download the USDA PSD database which is about 20MB compressed and is 170MB uncompressed...sorry but I would love someone's help on this!
Edit, here are str() outputs of both 'full' data and 'toy' data. The toy data works, the full data doesn't.
> str(melteddata)
Classes ‘data.table’ and 'data.frame': 18 obs. of 3 variables:
$ Year : int 1 2 3 4 5 6 1 2 3 4 ...
$ variable: Factor w/ 3 levels "stocks","exports",..: 1 1 1 1 1 1 2 2 2 2 ...
$ Qty : num 2 4 3 2 4 3 4 8 6 4 ...
- attr(*, ".internal.selfref")=<externalptr>
> str(SoySUHist)
Classes ‘data.table’ and 'data.frame': 159 obs. of 3 variables:
$ Year : int 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 ...
$ variable: Factor w/ 3 levels "Stocks","DomCons",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Qty : num 0.0297 0.0356 0.0901 0.1663 0.3268 ...
- attr(*, ".internal.selfref")=<externalptr>
> str(linedata)
Classes ‘data.table’ and 'data.frame': 6 obs. of 2 variables:
$ Year: int 1 2 3 4 5 6
$ Qty : num 15 16 15 16 15 16
- attr(*, ".internal.selfref")=<externalptr>
> str(SoyProd)
Classes ‘data.table’ and 'data.frame': 53 obs. of 2 variables:
$ Year: int 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 ...
$ Qty : num 701 846 928 976 1107 ...
- attr(*, ".internal.selfref")=<externalptr>
>
library(data.table)
library(ggplot2)
library(ggthemes)
library(plyr)
toyplot <- function(plotdata,linedata){
plotCExp <- ggplot(plotdata) +
geom_bar(aes(x=Year,y=Qty,factor=variable,fill=variable), stat="identity") +
geom_line(data=linedata, aes(x=Year,y=Qty)) # <---- comment out this line & the stack plot works
ggsave(plotCExp,filename = "ggsavetest.png", width=7, height=5, units="in")
}
convertto <- function(value,crop,unit='BU'){
if (unit=='BU' & ( crop=='WHEAT' | crop=='SOYBEANS')){
value = value * 36.7437
}
return(value)
}
# =====================================
# Download Data (Warning...large download!)
# =====================================
system("curl https://apps.fas.usda.gov/psdonline/download/psd_alldata_csv.zip | funzip > DATA/psd.csv")
tmp <- fread("DATA/psd.csv")
PSD = data.table(tmp)
rm(tmp)
setkey(PSD,Country_Code,Commodity_Code,Attribute_ID)
tmp=unique(PSD[,.(Commodity_Description,Attribute_Description,Commodity_Code,Attribute_ID)])
tmp[order(Commodity_Description)]
names(PSD)[names(PSD) == "Market_Year"] = "Year"
names(PSD)[names(PSD) == "Value"] = "Qty"
PSDCmdtyAtt = unique(PSD[,.(Commodity_Code,Attribute_ID)])
# Soybean Production, Consumpion, Stocks/Use
SoyStocks = PSD[list("US",2222000,176),.(Year,Qty)] # Ending Stocks
SoyExp = PSD[list("US",2222000,88),.(Year,Qty)] # Exports
SoyProd = PSD[list("US",2222000,28),.(Year,Qty)] # Total Production
SoyDmCons = PSD[list("US",2222000,125),.(Year,Qty)] # Total Dom Consumption
SoyStocks$Qty = convertto(SoyStocks$Qty,"SOYBEANS","BU")/1000
SoyExp$Qty = convertto(SoyExp$Qty,"SOYBEANS","BU")/1000
SoyProd$Qty = convertto(SoyProd$Qty,"SOYBEANS","BU")/1000
SoyDmCons$Qty = convertto(SoyDmCons$Qty,"SOYBEANS","BU")/1000
# Stocks/Use
SoySUPlot <- SoyExp
names(SoySUPlot)[names(SoySUPlot) == "Qty"] = "Exports"
SoySUPlot$DomCons = SoyDmCons$Qty
SoySUPlot$Stocks = SoyStocks$Qty
SoySUHist <- melt(SoySUPlot,id.vars="Year")
SoySUHist$Qty = SoySUHist$value/1000
SoySUHist$value <- NULL
SoySUPlot$StocksUse = 100*SoySUPlot$Stocks/(SoySUPlot$DomCons+SoySUPlot$Exports)
SoySUPlot$Production = SoyProd$Qty/1000
SoySUHist$variable <- factor(SoySUHist$variable, levels = rev(levels(SoySUHist$variable)))
SoySUHist = arrange(SoySUHist,variable)
toyplot(SoySUHist,SoyProd)
All right, I'm feeling generous. Your example code contains a lot of fluff that should not be in a minimal reproducible example and your system call is not portable, but I had a look anyway.
The good news: Your code works as expected.
Let's plot only the bars:
ggplot(SoySUHist) +
geom_bar(aes(x=Year,y=Qty,factor=variable,fill=variable), stat="identity")
Now only the lines:
ggplot(SoySUHist) +
geom_line(data=SoyProd, aes(x=Year,y=Qty))
Now compare the scales of the y-axes. If you plot both together, the bars get plotted, but they are so small that you can't see them. You need to rescale:
ggplot(SoySUHist) +
geom_bar(aes(x=Year,y=Qty,factor=variable,fill=variable), stat="identity") +
geom_line(data=SoyProd, aes(x=Year,y=Qty/1000))

merge data frames "not a slot in class data.frame"

I use the book "A practical guide to geostatistical mapping" from T. Hengl, which also offers the code to reproduce the results. Unfortunately, loads of the code contained is deprecated or even defunct. I was able to restore most of the code, but now I'm stuck with something seemingly simple: merging two data frames. My error:
Error in (function (cl, name, valueClass) : ‘data’ is not a slot in class “data.frame”
Here the code to reproduce that error:
library(gstat)
library(rgdal)
library(sp)
# load the data:
data(meuse)
coordinates(meuse) <- ~x+y
proj4string(meuse) <- CRS("+init=epsg:28992")
download.file("http://spatial-analyst.net/book/system/files/meuse.zip", destfile=paste(getwd(), "meuse.zip", sep="/"))
grid.list <- c("ahn.asc", "dist.asc", "ffreq.asc", "soil.asc")
# unzip the maps in a loop:
for(j in grid.list){
fname <- unzip("meuse.zip", file=j)
print(fname)
file.copy(fname, paste("./", j, sep=""), overwrite=FALSE)
}
# load grids to R:
meuse.grid <- readGDAL(grid.list[1])
# fix the layer name:
names(meuse.grid)[1] <- sub(".asc", "", grid.list[1])
for(i in grid.list[-1]) {
meuse.grid#data[sub(".asc", "", i[1])] <- readGDAL(paste(i))$band1
}
names(meuse.grid)
proj4string(meuse.grid) <- CRS("+init=epsg:28992")
meuse.ov <- over(meuse, meuse.grid)
str(meuse.ov)
meuse.data <- meuse[c("zinc", "lime")]#data
str(meuse.data)
meuse.ov#data <- merge(meuse.ov, meuse.data)
This is really confusing, as both data frames (meuse.ov and meuse.data) seem identical in their structure:
> str(meuse.ov)
'data.frame': 155 obs. of 4 variables:
$ ahn : int 3214 3402 3277 3563 3406 3355 3428 3476 3522 3525 ...
$ dist : num 0.00136 0.01222 0.10303 0.19009 0.27709 ...
$ ffreq: int 1 1 1 1 1 1 1 1 1 1 ...
$ soil : int 1 1 1 2 2 2 2 1 1 2 ...
and
> str(meuse.data)
'data.frame': 155 obs. of 2 variables:
$ zinc: num 1022 1141 640 257 269 ...
$ lime: Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
I tried resolving this with looking things up on stackoverflow, but nothing did work. The (not working) legacy code in the book suggested this (for your understanding maybe):
meuse.ov <- overlay(meuse.grid, meuse)
meuse.ov#data <- cbind(meuse.ov#data, meuse[c("zinc", "lime")]#data)

Error when exporting dataframe to text file in R

I am trying to write a dataframe in R to a text file, however it is returning to following error:
Error in if (inherits(X[[j]], "data.frame") && ncol(xj) > 1L)
X[[j]] <- as.matrix(X[[j]]) :
missing value where TRUE/FALSE needed
I used the following command for the export:
write.table(df, file ='dfname.txt', sep='\t' )
I have no idea what the problem could stem from. As far as "missing data where TRUE/FALSE is needed", I have only one column which contains TRUE/FALSE values, and none of these values are missing.
Contents of the dataframe:
> str(df)
'data.frame': 776 obs. of 15 variables:
$ Age : Factor w/ 4 levels "","A","J","SA": 2 2 2 2 2 2 2 2 2 2 ...
$ Sex : Factor w/ 2 levels "F","M": 1 1 1 1 2 2 2 2 2 2 ...
$ Rep : Factor w/ 11 levels "L","NR","NRF",..: 1 1 4 4 2 2 2 2 2 2 ...
$ FA : num 61.5 62.5 60.5 61 59.5 59.5 59.1 59.2 59.8 59.9 ...
$ Mass : num 20 19 16.5 17.5 NA 14 NA 23 19 18.5 ...
$ Vir1 : num 999 999 999 999 999 999 999 999 999 999 ...
$ Vir2 : num 999 999 999 999 999 999 999 999 999 999 ...
$ Vir3 : num 40 999 999 999 999 999 999 999 999 999 ...
$ Location : Factor w/ 4 levels "Loc1",..: 4 4 4 4 4 4 2 2 2 2 ...
$ Site : Factor w/ 6 levels "A","B","C",..: 5 5 5 5 5 5 3 3 3 3 ...
$ Date : Date, format: "2010-08-30" "2010-08-30" ...
$ Record : int 35 34 39 49 69 38 145 112 125 140 ...
$ SampleID : Factor w/ 776 levels "AT1-A-F1","AT1-A-F10",..: 525 524 527 528
529 526 111 78
88 110 ...
$ Vir1Inc : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Month :'data.frame': 776 obs. of 2 variables:
..$ Dates: Date, format: "2010-08-30" "2010-08-30" ...
..$ Month: Factor w/ 19 levels "Apr-2011","Aug-2010",..: 2 2 2 2
2 2 18 18 18 18 ...
I hope I've given enough/the right information ...
Many thanks,
Heather
An example to reproduce the error. I create a nested data.frame:
Month=data.frame(Dates= as.Date("2003-02-01") + 1:15,
Month=gl(12,2,15))
dd <- data.frame(Age=1:15)
dd$Month <- Month
str(dd)
'data.frame': 15 obs. of 2 variables:
$ Age : int 1 2 3 4 5 6 7 8 9 10 ...
$ Month:'data.frame': 15 obs. of 2 variables:
..$ Dates: Date, format: "2003-02-02" "2003-02-03" "2003-02-04" ...
..$ Month: Factor w/ 12 levels "1","2","3","4",..: 1 1 2 2 3 3 4 4 5 5 ...
No I try to save it , I reproduce the error :
write.table(dd)
Error in if (inherits(X[[j]], "data.frame") && ncol(xj) > 1L)
X[[j]] <- as.matrix(X[[j]]) : missing value where TRUE/FALSE needed
Without inverstigating, one option to remove the nested data.frame:
write.table(data.frame(subset(dd,select=-c(Month)),unclass(dd$Month)))
The solution by agstudy provides a great quick fix, but there is a simple alternative/general solution for which you do not have to specify the element(s) in your data.frame that was(were) nested:
The following bit is just copied from agstudy's solution to obtain the nested data.frame dd:
Month=data.frame(Dates= as.Date("2003-02-01") + 1:15,
Month=gl(12,2,15))
dd <- data.frame(Age=1:15)
dd$Month <- Month
You can use akhilsbehl's LinearizeNestedList() function (which mrdwab made available here) to flatten (or linearize) the nested levels:
library(devtools)
source_gist(4205477) #loads the function
ddf <- LinearizeNestedList(dd, LinearizeDataFrames = TRUE)
# ddf is now a list with two elements (Age and Month)
ddf <- LinearizeNestedList(ddf, LinearizeDataFrames = TRUE)
# ddf is now a list with 3 elements (Age, `Month/Dates` and `Month/Month`)
ddf <- as.data.frame.list(ddf)
# transforms the flattened/linearized list into a data.frame
ddf is now a data.frame without nesting. However, it's column names still reflect the nested structure:
names(ddf)
[1] "Age" "Month.Dates" "Month.Month"
If you want to change this (in this case it seems redundant to have Month. written before Dates, for example) you can use gsub and some regular expression that I copied from Sacha Epskamp to remove all text in the column names before the ..
names(ddf) <- gsub(".*\\.","",names(ddf))
names(ddf)
[1] "Age" "Dates" "Month"
The only thing left now is exporting the data.frame as usual:
write.table(ddf, file="test.txt")
Alternatively, you could use the "flatten" function from the jsonlite package to flatten the dataframe before export. It achieves the same result of the other functions mentioned and is much easier to implement.
jsonlite::flatten
https://rdrr.io/cran/jsonlite/man/flatten.html

Resources