Short version: when executing the following command qtm(countries, "freq") I get the following error message:
Error in $<-.data.frame(*tmp*, "SHAPE_AREAS", value =
c(652270.070308042, : replacement has 177 rows, data has 210
Disclaimer: I have already checked other answers like this one or this one as well as this explanation that states that usually this error comes from misspelling objects, but could not find an answer to my problem.
Reproducible code:
library(rgdal)
library(dplyr)
library(tmap)
# Load JSON file with countries.
countries = readOGR(dsn = "https://gist.githubusercontent.com/ccamara/fc26d8bb7e777488b446fbaad1e6ea63/raw/a6f69b6c3b4a75b02858e966b9d36c85982cbd32/countries.geojson")
# Load dataframe.
df = read.csv("https://gist.githubusercontent.com/ccamara/fc26d8bb7e777488b446fbaad1e6ea63/raw/754ea37e4aba1b7ed88eaebd2c75fd4afcc54c51/sample-dataframe.csv")
countries#data = left_join(countries#data, df, by = c("iso_a2" = "country_code"))
qtm(countries, "freq")
Your error is in the data - the code works fine.
What you are doing right now is:
1) attempting a 1:1 match
2) realize that your .csv data contains several ids to match
3) a left-join then multiplies the left hand side with all matches on the right hand-side
To avoid this issue you have to aggregate your data one more time like:
library(dplyr)
df_unique = df %>%
group_by(country_code, country_name) %>%
summarize(total = sum(total), freq = sum(freq))
#after that you should be fine - as long as just adding up the data is okay.
countries#data = left_join(countries#data, df, by = c("iso_a2" =
"country_code"))
qtm(countries, "freq")
Related
I tried to run a panel var on dataset I got from Statistics Sweden and here is what I get:
df<- read_excel("Inkfördelning per kommun.xlsx")
nujavlar <- pvarfeols(dependent_vars = c("Kvintil-1", "Kvintil-4", "Kvintil-5"),
lags = 1,
transformation = "demean",
data = df,
panel_identifier = c("Kommun", "Year")
)
Error: Can't subset columns that don't exist.
x Column `Kvintil-1` doesn't exist.
I often get this message too:
Warning in xtfrm.data.frame(x) : cannot xtfrm data frames
Error: Can't subset columns that don't exist.
x Location 2 doesn't exist.
ℹ There are only 1 column.
I have made sure that all data is numeric. I have also tried cleaning my workspace and restarted the programme. I also tried to convert it into a paneldata frame with palm package. I also tried converting my entity variable "Kommun" (Municipality) into factors and it still doesn't work.
Here's the data if someone wants to give it a go.
https://docs.google.com/spreadsheets/d/16Ak_Z2n6my-5wEw69G29_NLryQKcrYZC/edit?usp=sharing&ouid=113164216369677216623&rtpof=true&sd=true
The column names in your dataframe are Kvintil 1, not Kvintil-1, so the variable you are referring to really does not exist. Please be aware that in R, variable names cannot have hyphens and it is good practice to avoid spaces in variable names because it is annoying to refer to variables with spaces. I have included a reproducible example below.
library(tidyverse)
library(gsheet)
library(panelvar)
url <- 'docs.google.com/spreadsheets/d/16Ak_Z2n6my-5wEw69G29_NLryQKcrYZC'
df <- gsheet2tbl(url) %>%
rename(Kvintil1 = `Kvintil 1`) %>%
rename(Kvintil2 = `Kvintil 2`) %>%
rename(Kvintil3 = `Kvintil 3`) %>%
rename(Kvintil4 = `Kvintil 4`) %>%
rename(Kvintil5 = `Kvintil 5`) %>%
as.data.frame()
nujavlar <- pvarfeols(
dependent_vars = c("Kvintil1", "Kvintil4", "Kvintil5"),
lags = 1,
transformation = "demean",
data = df,
panel_identifier = c("Kommun", "Year"))
I am trying to use the pipe function in dplyr and left_join to clean some meta data up. Setting up variables....
library(openxlsx)
library(tidyverse)
mdat <- read.xlsx("https://journals.plos.org/plospathogens/article/file?type=supplementary&id=info:doi/10.1371/journal.ppat.1005511.s011",
startRow = 3, fillMergedCells = TRUE) %>%
mutate(sample=Accession.Number)
dge$samples$sample=
[1] "SRR1346026" "SRR1346027" "SRR1346028" "SRR1346029" "SRR1346030" "SRR1346031" "SRR1346032" "SRR1346033" "SRR1346034"
[10] "SRR1346035" "SRR1346036" "SRR1346037" "SRR1346038" "SRR1346039" "SRR1346040" "SRR1346041" "SRR1346042" "SRR1346043"
[19] "SRR1346044" "SRR1346045" "SRR1346046" "SRR1346047" "SRR1346049" "SRR1346048" "SRR1346050" "SRR1346051" "SRR1346052"
I am trying to pipe in the dge$samples$sample, which is a character class. It needs to become a data frame of one column named sample so I can merge mdat with it by left join in order to remove all the metadata I don't have a sample for. If you run dim(mdat) you will find it is 35 by 15, I want to reduce it to the 19 samples I actually have data for, these are given in the dge$samples$sample list. I am trying to use the following code to first convert dge$samples$sample into a data frame with one column titled sample for joining the two and essentially removing all metadata that is not of interest to me. The code below has been my progress so far but I think I am failing to understand how pipe works.
test = data.frame(dge$samples$sample) %>%
colnames(.) = c("sample") %>%
left_join(
.,
mdat,
by = sample,
copy = FALSE,
suffix = c(".x", ".y"),
keep = FALSE,
na_matches = c("na", "never")
)
Why not just check if theyre in there and filter them:
mdat %>% filter( sample %in% dge$samples$sample )
It's easier to understand and controll than a join and performance shouldn't be an issue.
I think your code can be reduced to
library(dplyr)
test <- data.frame(sample = dge$samples$sample) %>%
left_join(mdat, by = 'sample')
Or an inner join should work as well, using base R :
test <- merge(data.frame(sample = dge$samples$sample), mdat, by = 'sample')
Using collapse
library(collapse)
sbt(mdat, sample %in% dge$samples$sample)
Short version: when executing the following command qtm(World, "amount") I get the following error message:
Error in $<-.data.frame(*tmp*, "SHAPE_AREAS", value =
c(653989.801201595, : replacement has 177 rows, data has 175
Disclaimer: this is the same problem I used to have in this question, but if I'm not wrong, in that one the problem was that I had one variable on the left dataframe that matched to several variables on the right one, and hence, I needed to group variables on right dataframe. In this case, I am pretty sure that I do not have the same problem, as can be seen from the code below:
library(tmap)
library(tidyr)
# Read tmap's world map.
data("World")
# Load my dataframe.
df = read.csv("https://gist.githubusercontent.com/ccamara/ad106eda807f710a6f331084ea091513/raw/dc9b51bfc73f09610f199a5a3267621874606aec/tmap.sample.dataframe.csv",
na = "")
# Compare the countries in df that do not match with World's
# SpatialPolygons.
df$iso_a3 %in% World$iso_a3
# Return rows which do not match
selected.countries = df$iso_a3[!df$iso_a3 %in% World$iso_a3]
df.f = filter(df, !(iso_a3 %in% selected.countries))
# Verification.
df.f$iso_a3[!df.f$iso_a3 %in% World$iso_a3]
World#data = World#data %>%
left_join(df.f, by = "iso_a3") %>%
mutate(iso_a3 = as.factor(iso_a3)) %>%
filter(complete.cases(iso_a3))
qtm(World, "amount")
My guess is that the clue may be the fact that the column I am using when joining both dataframes has different levels (hence it is converted to string), but I'm ashamed to admit that I still don't understand the error that I am having here. I'm assuming I have something wrong with my dataframe, although I have to admit that it didn't work even with a smaller dataframe:
selected.countries2 = c("USA", "FRA", "ITA", "ESP")
df.f2 = filter(df, iso_a3 %in% selected.countries2)
df.f2$iso_a3 = droplevels(df.f2$iso_a3)
World#data = World#data %>%
left_join(df.f2, by = "iso_a3") %>%
mutate(iso_a3 = as.factor(iso_a3)) %>%
filter(complete.cases(iso_a3))
World$iso_a3 = droplevels(World$iso_a3)
qtm(World, "amount")
Can anyone help me pointing out what's causing this error (providing an solution may also be much appreaciated)
Edited: It is again your data
table(df$iso_a3)
I'm converting a local R script to make use of the RevoScaleR functions in the Revolution-R (aka Microsoft R Client/Server) package. This to be able to scale better with large amounts of data.
The goal is to create a new column that numbers the rows per group. Using data.table this would be achieved using the following code:
library(data.table)
eventlog[,ActivityNumber := seq(from=1, to=.N, by=1), by=Case.ID]
For illustration purposes, the output is something like this:
Case.ID ActivityNumber
1 A 1
2 A 2
3 B 1
4 C 1
5 C 2
6 C 3
After some research to do this using the rx-functions I found the package dplyrXdf, which is basically a wrapper to use dplyrfunctions on Xdfstored data, while still benefitting from the optimized functions of RevoScaleR (see http://blog.revolutionanalytics.com/2015/10/using-the-dplyrxdf-package.html)
In my case, this would lead to the following:
result <- eventlog %>%
group_by(Case.ID) %>%
mutate(ActivityNumber = seq_len(n()))
However, this leads to the following error:
ERROR: Attempting to add a variable without a name to an analysis.
Caught exception in file: CxAnalysis.cpp, line: 3756. ThreadID: 1248 Rethrowing.
Caught exception in file: CxAnalysis.cpp, line: 5249. ThreadID: 1248 Rethrowing.
Error in doTryCatch(return(expr), name, parentenv, handler) :
Error in executing R code: ERROR: Attempting to add a variable without a name to an analysis.
Any ideas how to solve this error? Or other (better?) approaches to get the requested result?
Thanks to #Matt-parker for pointing me to this question.
Note that n() is not a regular R function, although it looks like one. It needs to be implemented specially for each data source, and maybe also separately for each of mutate, summarise and filter.
Right now, the only usage of n that is supported for xdf files is within summarise, to count the number of rows. Implementing it for the other verbs is actually nontrivial.
In particular, there is a problem with Matt's use of seq_along to implement n's functionality. Remember that xdf files are block-structured: each chunk of rows is read in and processed independently of other chunks. This means that the sequence generated is for that chunk of rows only, and not for all the rows in a group. If a group spans more than one chunk, the sequence numbers will restart in the middle.
The way to get correct sequence numbers is to keep a running count of how many rows you've read in for that group, and update it each time a chunk is processed. You can do this with a transformFunc, which you pass to transmute via the .rxArgs argument:
ev <- eventlog %>% group_by(Case.ID) %>% transmute(.rxArgs = list(
transformFunc = function(varList) {
n <- .n + seq_along(varList[[1]])
if(!.rxIsTestChunk) # need this b/c rxDataStep does a test run on the 1st 10 rows
.n <<- n[length(n)]
list(n=n)
},
transformObjects = list(.n = 0))
This should work with the local, localpar and foreach compute contexts. It may not work (or at least won't give a reproducible result) with any context where you can't guarantee that rxDataStep will process the rows in a deterministic order -- so Mapreduce, Spark, Teradata or similar.
I'm not sure why this works, but try using seq_along(Case.ID) instead of seq_len(n()):
result <- eventlog %>%
group_by(Case.ID) %>%
mutate(ActivityNumber = seq_along(Case.ID))
It seems to be some problem with n(). Here's my exploratory code, in case anyone else wants to experiment:
options(stringsAsFactors = FALSE)
library(dplyrXdf)
# Set up some test data
eventlog_df <- data.frame(Case.ID = c("A", "A", "A", "A", "A", "B", "C", "C", "C"))
# Add a variable for artificially splitting the XDF into small chunks
eventlog_df$Chunk.ID <- factor((seq_len(nrow(eventlog_df)) + 2) %/% 3)
# Check the results
eventlog_df
# Now read it into an XDF file. I'm going to read just three rows in at a time
# so that the XDF file has several chunks, so we can be confident this works
# across chunks
eventlog <- tempfile(fileext = ".xdf")
for(i in 1:3) {
rxImport(inData = eventlog_df[eventlog_df$Chunk.ID %in% i, ],
outFile = eventlog,
colInfo = list(Case.ID = list(type = "factor",
levels = c("A", "B", "C"))),
append = file.exists(eventlog))
}
# Convert to a proper data source
eventlog <- RxXdfData(eventlog)
rxGetInfo(eventlog, getVarInfo = TRUE, numRows = 10)
# Now to dplyr. First, let's make sure it can count up the records
# in each group without any trouble.
result <- eventlog %>%
group_by(Case.ID) %>%
summarise(ActivityNumber = n())
# It can:
rxDataStep(result)
# Now if we switch to mutate, does n() still work?
result <- eventlog %>%
group_by(Case.ID) %>%
mutate(ActivityNumber = n())
# No - and it seems to be complaining about missing variables. So what if
# we try to refer to a variable we *know* exists?
result <- eventlog %>%
group_by(Case.ID) %>%
mutate(ActivityNumber = seq_along(Case.ID))
# It works
rxDataStep(result)
dplyr and dplyrXdf have a tally method that counts items per group:
result <- eventlog %>%
group_by(Case.ID) %>%
tally()
If you want to do more than just tabulate the records per group, you can use summarize (since you didn't show your data, I'm using a hypothetical column called delay, which I'm assuming is numeric for illustrative purposes):
result <- eventlog %>%
group_by(Case.ID) %>%
summarize(counts = n(),
ave_delay = mean(delay))
You could do the above with regular RevoScaleR functions,
rxCrossTabs(~ Case.ID, data = eventlog)
and for the second example:
rxCube(delay ~ Case.ID, data = eventlog)
data_different_tech_count <- data_different_tech %>%
group_by(tech) %>%
summarise(count(tech))
now this gives me a data.frame as an output but I am unable to save the file. When I try to change the colnames, it shows me:
colnames(data1)[c(1,2)]<- c("tech","count")
Error in colnames<-(*tmp*, value = c("tech", "count")) :
'names' attribute [2] must be the same length as the vector [1]
When I am using
colnames(data_different_count_tech)
It says that I have only one column.
When I am using the
summary(data_different_count_tech)
it shows two columns.
When I am trying to write this file to my directory it returns the following error.
write.csv(file=data_different_tech_count,"tech.csv")
Error in matrix(unlist(value, recursive = FALSE, use.names = FALSE), nrow = nr, :
length of 'dimnames' [2] not equal to array extent
Are you trying to get a count of the number of times each value of tech appears? I can't get your example to work without having a reproducible example.
If so, here are a few alternatives that will give you what you want:
Using Dplyr
data_different_tech_count <- data_different_tech %>% group_by(tech) %>% summarise(count = n())
Using Base R
data_different_tech_count <- as.data.frame(table(data_different_tech$tech))
colnames(data_different_tech_count) <- c("tech","count")