Pivottabler tallying number of cases in dataset, not their values - r

I am working through creating pivot tables with the Pivottabler package to summarise frequencies of rock art classes by location. The data I am summarising here are from published papers, and I have it stored in an RDS file created in R, and looks like this:
> head(cyp_art_freq)
Class Location value
1: Figurative Princess Charlotte Bay 347
2: Track Princess Charlotte Bay 35
3: Non-Figurative Princess Charlotte Bay 18
4: Figurative Mitchell-Palmer and Chillagoe 320
5: Track Mitchell-Palmer and Chillagoe 79
6: Non-Figurative Mitchell-Palmer and Chillagoe 1002
>str(cyp_art_freq)
Classes ‘data.table’ and 'data.frame': 12 obs. of 3 variables:
Class : chr "Figurative" "Track" "Non-Figurative" "Figurative" ...
Location: chr "Princess Charlotte Bay" "Princess Charlotte Bay" "Princess Charlotte Bay" "Mitchell-Palmer and Chillagoe" ...
value : num 347 35 18 320 79 ...
attr(*, ".internal.selfref")=<externalptr>
The problem is that pivottabler does not sum the contents of the 'value' col. Instead, it counts the number of rows/cases. So, as the graphic below shows, the resulting table includes a total of 12 cases when the result should be into the 1000s. I think this relates to the 'value' column which is a count of a larger dataset. I've tried pivot_longer and pivot_wider, changed datatypes and used CSVs instead of RDS for import (and more).
The code block I'm using for this data works with the built-in BHMtrains dataset, and my other datasets, but I suspect I can either specify that pivottabler tallies the contents of the 'values' col, or I just expand the underlying dataset.
How might I ensure that the 'Count' columns actually count the contents of the input 'value' column? I hope that is clear, and thanks for any suggestions on how to address this issue.
table01 <- PivotTable$new()
table01$addData(cyp_art_freq)
table01$addColumnDataGroups("Class", totalCaption = "Total")
table01$defineCalculation(calculationName="Count", summariseExpression="n()", caption="Count", visible=TRUE)
filterOverrides <- PivotFilterOverrides$new(table01, keepOnlyFiltersFor="Count")
table01$defineCalculation(calculationName="TOCTotal", filters=filterOverrides,
summariseExpression="n()", caption="TOC Total", visible=FALSE)
table01$defineCalculation(calculationName="PercentageAllMotifs", type="calculation",
basedOn=c("Count", "TOCTotal"),
calculationExpression="values$Count/values$TOCTotal*100",
format="%.1f %%", caption="Percent")
table01$addRowDataGroups("Location")
table01$theme <- "compact"
table01$renderPivot()
table01$evaluatePivot()
The PT returned from this code

Related

R function or loop for repeatedly selecting rows that meet a condition, saving as separate object, and renaming column headers

I have 16 large datasets of landcover variables around routes. Example dataset "Trial1":
RtNo TYPE CA PLAND NP PD LPI TE
2001 cls_11 996.57 6.4297 22 0.1419 6.3055 31080
2010 cls_11 56.34 0.3654 23 0.1492 0.1669 15480
18003 cls_11 141.12 0.9899 37 0.2596 0.1503 38700
18014 cls_11 797.58 5.3499 47 0.3153 1.3969 98310
2001 cls_21 1514.97 9.7744 592 3.8195 0.8443 761670
2010 cls_21 638.55 4.1414 95 0.6161 0.7489 463260
18003 cls_21 904.68 6.3463 612 4.2931 0.8769 549780
18014 cls_21 1189.89 7.9814 759 5.0911 0.4123 769650
2001 cls_22 732.33 4.7249 653 4.2131 0.7212 377430
2010 cls_22 32.31 0.2096 168 1.0896 0.0198 31470
18003 cls_22 275.85 1.9351 781 5.4787 0.0423 237390
18014 cls_22 469.44 3.1488 104 6.7345 0.1014 377580
I want to first select rows that meet a condition, for example, all rows in column "TYPE" that is cls_21. I know the following code does this work:
Trial21 <-subset(Trial1, TYPE==" cls_21 ")
(yes the invisible space before and after the categorical variable caused me a considerable headache).
And there are several other ways of doing this as shown in
[https://stackoverflow.com/questions/5391124/select-rows-of-a-matrix-that-meet-a-condition]
I get the following output (sorry this one has extra columns, but shouldn't affect my question):
RtNo TYPE CA PLAND NP PD LPI TE ED LSI
2 18003 cls_21 904.68 6.3463 612 4.2931 0.8769 549780 38.5668 46.1194
18 18014 cls_21 1189.89 7.9814 759 5.0911 0.4123 769650 51.6255 56.2522
34 2001 cls_21 1514.97 9.7744 592 3.8195 0.8443 761670 49.1418 49.3462
50 2010 cls_21 638.55 4.1414 95 0.6161 0.7489 463260 30.0457 46.0118
62 2020 cls_21 625.5 4.1165 180 1.1846 0.5064 384840 25.3268 38.6407
85 2021 cls_21 503.55 2.7926 214 1.1868 0.1178 348330 19.3175 38.9267
I want to rename the columns in this subset so they uniquely identify the class by adding "L21" at the back of existing column names, and I can do this using
library(data.table)
setnames(Trial21, old = c('CA', 'PLAND', 'NP', 'PD', 'LPI', 'TE', 'ED', 'LSI'),
new = c('CAL21', 'PLANDL21', 'NPL21', 'PDL21', 'LPIL21', 'TEL21', 'EDL21', 'LSIL21'))
I want help to develop a function or a loop that automates this process so I don't have to spend days repeating the same codes for 15 different classes and 16 datasets (240 times). Also, decrease the risk of errors. I may have to do the same for additional datasets. Any help to speed the process will be greatly appreciated.
You could do:
a <- split(df, df$TYPE)
b <- sapply(names(a), function(x)setNames(a[[x]],
paste0(names(a[[x]]), sub(".*_", 'L', x))), simplify = FALSE)
You can use ls to get the variable names of the datasets, and manipulate them as you wish inside a loop and with get function, then create new datasets with assign.
sets = grep("Trial", ls(), value=TRUE) #Assuming every dataset has "Trial" in the name
for(i in sets){
classes = unique(get(i)$TYPE)
for(j in classes){
number = gsub("(.+)([0-9]{2})( )", "\\2", j)#this might be an overly complicated way of getting just the number, you can look for better options if you want
assign(paste0("Trial", number),
subset(Trial1, TYPE==j) %>% rename_with(function(x){paste0(x, number)}))}}
Here is a start that should work for your example:
library(dplyr)
myfilter <- function(data, number) {
data %>%
filter(TYPE == sprintf(" cls_%s ") %>%
rename_with(\(x) sprintf("%s%s", x, suffix), !1:2)
}
myfilter(example_data, 21)
Given a list of numbers (here: 21 to 31) you could then automatically use them to filter a single dataframe:
multifilter <- function(data) {
purrr::map(21:31, \(i) myfilter(data, i))
}
multifilter(example_data)
Finally, given a list of dataframes, you can automatically apply the filters to them:
purrr::map(list_of_dataframes, multifilter)

which() function in R - after sorting in descending order, issues matching with duplicate values

I'm trying to find the next closest store from a matrix of store IDs, zip codes, and long/latitude coordinates for each of the zip codes. Trouble happens when there are more than 1 store per zipcode, and the script doesn't know how to order 2 values that are identical (store x is 10 miles away, store y is 10 miles, and has trouble with the order of x and y, and is returning (c(x,y)), instead of x,y or y,x). I need to find a way to have my code figure out how to list both of them (arbituary order since they are the same distance away from the store, based on zip code).
I'm thinking there can likely be modifications to the which() function, but I'm not having any luck.
Note that all the stores run, just the 100 or so stores that have the same zipcode as another store get tripped up - I'd love to not manually go through and edit the csv.
library(data.table)
library(zipcode)
library(geosphere)
source<-read.csv("C:\\Users\\mcan\Desktop\\Projects\\Closest Store\\Site and Zip.csv",header=TRUE, sep=",") #open
zip<-source[,2] #break apart the source zip codes
ID<-source[,1] #break apart the IDs
zip<-clean.zipcodes(zip) #clean up the zipcodes
CleanedData<-data.frame(ID,zip) #combine the IDs and cleaned Zip codes
CleanedData<-merge(x=CleanedData,y=zipcode,by="zip",all.x=TRUE) #dataset of store IDs, zipcodes, and their long/lat positions
setDT(CleanedData) #set data frame to data table
storeDistances <- distm(CleanedData[,.(longitude,latitude)],CleanedData[,.(longitude,latitude)]) #matrix between long/lat points of all stores in list
colnames(storeDistances) <- rownames(storeDistances) <- CleanedData[,ID]
whatsClosest <- function(number=1){
apply(storeDistances,1,function(x) (colnames(storeDistances)[which(x==sort(x)[number+1])])) #sorts in descending order and picks the 2nd closest distance, matches with storeID
}
CleanedData[,firstClosestSite:=whatsClosest(1)] #looks for 1st closest store
CleanedData[,secondClosestSite:=whatsClosest(2)] #looks for 2nd closest store
CleanedData[,thirdClosestSite:=whatsClosest(3)] #looks for 3rd closest store
Data set format:
Classes ‘data.table’ and 'data.frame': 1206 obs. of 9 variables:
$ zip : Factor w/ 1182 levels "01234","02345",..: 1 2 3 4 5 6 7 8 9 10 ...
$ ID : int 11111 12222 13333 10528 ...
$ city : chr "Boston" "Somerville" "Cambridge" "Weston" ...
$ state : chr "MA" "MA" "MA" "MA" ...
$ latitude : num 40.0 41.0 42.0 43.0 ...
$ longitude : num -70.0 -70.1 -70.2 -70.3 -70.4 ...
$ firstClosestSite :List of 1206
..$ : chr "12345"
$ secondClosestSite :List of 1206
..$ : chr "12344"
$ thirdClosestSite :List of 1206
..$ : chr "12343"
Issue comes with the firstClosestSite and secondClosest site, they are sorted by distance, but if the distance is the same because two stores exist in the same zipcode, the which() function (I think) doesn't know how to account for this, so you get this awkward concatenation in the CSV:
StoreID Zip City State Longitude Latitude FirstClosestSite
11222 11000 Boston MA 40.0 -70.0 c("11111""12222")
SecondClosestSite ThirdClosestSite
c("11111" "12222") 13333
Example of how the distance matrix is formed (store IDs in first row and column, with the matrix values being the distance between store IDs):
11111 22222 33333 44444 55555 66666
11111 0 6000 32000 36000 28000 28000
22222 6000 0 37500 40500 32000 32000
33333 32000 37500 0 11000 6900 6900
44444 36000 40500 11000 0 8900 8900
55555 28000 32000 6900 8900 0 0
66666 28000 32000 6900 8900 0 0
Issue is the duplicates in each row... the which() doesn't know which store is closest to 11111 (either 55555 or 66666).
Here is my attempt at a solution. Everything up until the line with colnames(storeDistances) <- ... stays the same. After that, you should replace the code with the following:
whatsClosestList <- sapply(as.data.frame(storeDistances), function(x) list(data.frame(distance = x, store = rownames(storeDistances), stringsAsFactors = F)))
# Get the names of the stores
# this step is necessary because lapply doesn't allow us
# to access the list names
storeNames = names(whatsClosestList)
# Iterate through each store's data frame using storeNames
# and delete the distance to itself
whatsClosestListRemoveSelf <- lapply(storeNames, function(name) {
df <- whatsClosestList[[name]]
df <- df[!df$store == name,]
})
# The previous step got rid of the store names in the list,
# so we add them again here
names(whatsClosestListRemoveSelf) <- storeNames
whatsClosestOrderedList <- lapply(whatsClosestListRemoveSelf, function(df) { df[order(df$distance),] })
whatsClosestTopThree <- lapply(whatsClosestOrderedList, function(df) { df$store[1:3] })
firstClosestSite <- lapply(whatsClosestTopThree, function(x) { x[1]} )
secondClosestSite <- lapply(whatsClosestTopThree, function(x) { x[2]} )
thirdClosestSite <- lapply(whatsClosestTopThree, function(x) { x[3]} )
CleanedData[,firstClosestSite:=firstClosestSite] #looks for 1st closest store in list
CleanedData[,secondClosestSite:=secondClosestSite] #looks for 2nd closest store in list
CleanedData[,thirdClosestSite:=thirdClosestSite] #looks for 3rd closest store in list
Basically, instead of searching only for the (first, second, third) closest site, I create a list of dataframes for each store with the distance to all other stores. I then order these dataframes, and extract the three closest stores, which sometimes include ties (if tied, they're ordered by the name of the store). Then you only need to extract a list with the firstClosestSite, secondClosestSite, etc., for each store, and that's why you use in the search in CleanedData. Hope it works!

How to conditionally call a function for each row in R?

I am trying to call a function that provides a value for specific data elements in a table.
A data table (gameData) might be:
Date TeamA TeamB TeamAScore TeamBScore
1 2016-03-06 NYC HOU 67 76
2 2016-02-14 BOS SEA NaN NaN
3 2016-01-30 LAS DAL 63 74
I would like to populate the TeamAScore with the return of a function if it is NaN. I tried a function like the following:
gameData$TeamAScore <- ifelse(
is.nan(gameData$TeamAScore),
getTeamAScore(gameData$TeamA,gameData$TeamB,gameDate=gameData$Date),
gameData$TeamAScore
)
When I run this, I get a an error like the following:
Error in Ops.factor(teamdata$Team, TeamA) :
level sets of factors are different
It seems to be sending all of the TeamA's with the function call instead of only the value for that row.
The problem here is that the TeamA and TeamB columns do not have the data you think they have. Factors are tricky in R...
Let's create two factors here to see what is happening:
> TeamA <- factor(c("NYC", "BOS", "LAS", "SEA"))
> TeamB <- factor(c("HOU", "LAS", "NYC", "SEA"))
> TeamA
[1] NYC BOS LAS SEA
Levels: BOS LAS NYC SEA
OK, so TeamA has four slots: NYC, BOS, LAS and SEA. So we can compare this to TeamB to see whether any slot in the two vectors is the same. Right? Wrong:
> TeamA == TeamB
Error in Ops.factor(TeamA, TeamB) : level sets of factors are different
That is the same error you are receiving! That happens because what is really stored in this vectors is a number representing each "factor level".
> str(TeamA)
Factor w/ 4 levels "BOS","LAS","NYC",..: 3 1 2 4
> levels(TeamA)
[1] "BOS" "LAS" "NYC" "SEA"
> levels(TeamB)
[1] "HOU" "LAS" "NYC" "SEA"
So, 1 represents BOS in the TeamA vector, but it represents HOU in the TeamB vector. Of course they can't be compared!
How to avoid using factors when they are getting in your way? Use the argument stringsAsFactors=FALSE when you create the data.frame (either using data.frame(x, y, z, stringsAsFactors=FALSE) or read.csv("filename.csv", etc, etc, stringsAsFactors=FALSE)`.

Subset data.frame based on a list

Sample Data : https://www.dropbox.com/s/f3l2uub1cttwmf2/test.csv?dl=0
I need to subset this data.frame based on only those county codes (fips) that are available on another dataset. I have a list of all the fips codes from the other dataset and am trying to remove all those not in the list, but am not having much luck.
From this small sample dataset are three fips (8009,8011,8013), so how would i remove all except for 8009 and 8011 in the context that this would be a list.
Here's what I've tried :
prism.dd <- prism.d[(prism.d$fips %in% fips) ,]
Where fips is a list of 779 fips to keep:
fips <- unique(DustBowlData_Pre$fips)
But it's only returning the same number. A solution with data.table would be preferred, but what works best is also fine.
Thanks!
Edit : Update for akrun's request :
Output of dput(head(fips))
c(8009L, 8011L, 8013L, 8017L, 8035L, 8039L)
Update : str(prism.d)
Classes ‘data.table’ and 'data.frame': 52802 obs. of 3 variables:
$ fips: int 30061 30063 30077 30049 30013 30059 30045 30027 30069 30033 ...
$ Year: int 1910 1910 1910 1910 1910 1910 1910 1910 1910 1910 ...
$ ppt : num 87 64.2 52.4 46.6 34.9 ...
- attr(*, ".internal.selfref")=<externalptr>
Solution :
setkey(setDT(prism.d), fips)
fips <- unique(DustBowlData_Pre$fips)
fips <- data.table(fips)
Subpr <- prism.d[fips]
Thanks #akrun! This worked perfectly. I really need to learn data.table.
You could try using data.table
library(data.table)
setkey(setDT(prism.d), fips)
fips <- c(8009, 8011)
fips1 <- data.table(fips)
Subpr <- prism.d[fips1]
Update
I think the previous code didn't work because I thought the dataset is data.frame and not data.table. Try
fips2 <- fips #renaming because `prism.d` has the `same` column name `fips`
prism.d[fips %in% fips2]
data
prism.d <- read.csv('test-1.csv')

store rows of a data.frame are equal to top.20. in frequens from column as factor i.e names

The data.frame (d1.csv) looks like:
Age Height Weight Sport
23 170 60 Judo
33 193 125 Athletics
I have to make a ny data.frame like d2 with the top 20 an shall use this charachters below stored in
names(top.20.sports)
[1] "Athletics" "Swimming" "Football" "Rowing"
... and have to use match() or %in% like to use subset() like d1 with subset = Sport %in% names(top.20.sports).
I tried several things bud I'm new at this and am missing something...
d2<-subset(d1, (Sport %in% names(top.20.sports)))
gives the hole list, same as with
d2 <- d1[d1$Sport %in% names(top.20.sports),]
match gives me a bunch (42) with "NA"
d2<-d1[,tolower(names(top.20.sports)) %in% d1[,4]]
Dataframe with 0 colomns und 9038 rows
(9038 rows are correct bud where is the data?)
There was no error, like BondedDust told me: "If subset(d1, (Sport %in% names(top.20.sports))) gives the whole list then .... it is what it is. All of the Sport entries are in the top-20."
Just it never was the hole list...:
I was thinking I had 10384 rows
-10384 24 221 110 Basketball-
Basketball as the last one. Bud the number of the row is not the number of the rows:
nrow(d2)
[1] 8009
dim(d2)
[1] 8009 4

Resources