Issue with duplicate last names/cannot find object - r

Recent excel graduate trying to transition to R, so am very new to this.
I am trying to create a player based sports model. However, when trying to print the code I have already written, R is conflating players with the same last name (using dplyr). Essentially it has created two columns (player_last_name.x and player_last_name.y), and has merged these players stats. My first thoughts were to merge the first and last name columns into one. However, not sure how R goes with merging categorical data.
Also, R seems to not be able to find my third variable in season_TOG.
Any help would be appreciated.
Thanks.
disp <- playerdata %>%
group_by(player_first_name, player_last_name)%>%
summarise(season_disposals = sum(disposals))%>%
games <- playerdata %>%
group_by(player_first_name, player_last_name) %>%
summarise(season_game_count = n_distinct(match_round))%>%
TOG <- playerdata %>%
group_by(player_first_name, player_last_name)%>%
summarise(season_TOG = sum(time_on_ground_percentage))%>%
PropModel_df <- merge(disp, games, TOG, by="player_first_name", "player_last_name")%>%
PropModel_df <- transform(PropModel_df, avg_disp = season_disposals/season_game_count)%>%
PropModel_df <- transform(PropModel_df, avg_TOG = season_TOG/season_game_count)%>%
print(PropModel_df)```
```Error in eval(substitute(list(...)), `_data`, parent.frame()) :
object 'season_TOG' not found```

There are at least three clear issues here.
Your code is not parse-able: you have extra %>% in several points of your code. It might be that this is just an artifact of your question, and you reduced some otherwise unnecessary portions of your code but didn't clean up your pipes ... in which case thank you for reducing your code, but please try your reduced code before posting in the question.
merge accepts exactly two frames to join, so your
PropModel_df <- merge(disp, games, TOG, by="player_first_name", "player_last_name")
will fail by that notion. You'll need to merge the first two (merge(disp, games, by=...) and then merge those results with TOG.
When you join using multiple fields, you need to include them in a single-vector. Your code (adjusted for #2):
PropModel_df <- merge(disp, games, by="player_first_name", "player_last_name")
should be
PropModel_df <- merge(disp, games, by = c("player_first_name", "player_last_name"))
Further detail: when arguments are provided without names, they are assigned by position. Because merge arguments are
merge(x, y, by = intersect(names(x), names(y)),
by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all,
sort = TRUE, suffixes = c(".x",".y"), no.dups = TRUE,
incomparables = NULL, ...)
this is the apparent argument names for your call:
merge(x = disp, y = games, by = "player_first_name", by.x = "player_last_name")
which is (I believe) not what you intend.

Related

Create a graph with the information of columns whose name begins with a number in R

I imported information from an excel file into RStudio. When I tried to create a bar chart out of two of the set's columns it didn't work.
ggplot(data = Encuesta_BIDVOX)+geom_bar(mapping = aes(x=Cargo, fill=1_PC_Estandarizado))
The Console's output showed:
Error: unexpected input in "ggplot(data = Encuesta_BIDVOX)+geom_bar(mapping = aes(x=Cargo, fill=1_"
I realized that the issue was that the Variable's name (the column's name) which I'm using to fill the bar starts with a number (I tried with other columns with no numbers starting its name and it worked fine).
So the question is:
Is there a way to command R to ignore that the Columns name is starting with a number, and I can use it in the ggplot function without ending in an error?
This would be of immense help since most of the columns (over 48) start with a number...
Please keep answers as simple as possible since as you already noticed I'm new at R. Thanks!
Just using `` may helps(note that it's not ')
iris3 <- iris
names(iris3) <- c("Sepal.Length", "Sepal.Width" , "Petal.Length" ,"Petal.Width" , '1_')
iris3 %>%
ggplot(aes(Sepal.Length, Sepal.Width, color = `1_`)) +
geom_point()
I suggest fixing this as far upstream as possible, potentially in the source file, or upon import, or if not then before you go to ggplot. janitor::clean_names() offers a nice interface for doing that and controlling what the new, syntactical column names look like.
uhoh <- data.frame(`1_` = 1,
`2_` = 2,
`10_` = 3,
check.names = FALSE)
uhoh
# 1_ 2_ 10_
#1 1 2 3
uhoh %>%
janitor::clean_names() %>%
ggplot() +
geom_bar(aes(x1, fill = x10))
You can use make.names, which is used by e.g. data.frame() to make sure names are valid.
# make a dataframe with invalid names
x <- data.frame('1a' = 1, '2b' = 1, '_c' = 1, check.names = FALSE)
# make any invalid names valid
names(x) <- make.names(names(x), unique = TRUE)
But, it is almost always better to fix the names where you are creating them (e.g. most functions to read in data have arguments to make sure the names are checked and valid).

kwic() function returns less rows than it should

I'm currently trying to perform a sentiment analysis on a kwic object, but I'm afraid that the kwic() function does not return all rows it should return. I'm not quite sure what exactly the issue is which makes it hard to post a reproducible example, so I hope that a detailed explanation of what I'm trying to do will suffice.
I subsetted the original dataset containing speeches I want to analyze to a new data frame that only includes speeches mentioning certain keywords. I used the following code to create this subset:
ostalgie_cluster <- full_data %>%
filter(grepl('Schwester Agnes|Intershop|Interflug|Trabant|Trabi|Ostalgie',
speechContent,
ignore.case = TRUE))
The resulting data frame consists of 201 observations. When I perform kwic() on the same initial dataset using the following code, however, it returns a data frame with only 82 observations. Does anyone know what might cause this? Again, I'm sorry I can't provide a reproducible example, but when I try to create a reprex from scratch it just.. works...
#create quanteda corpus object
qtd_speeches_corp <- corpus(full_data,
docid_field = "id",
text_field = "speechContent")
#tokenize speeches
qtd_tokens <- tokens(qtd_speeches_corp,
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE,
padding = FALSE) %>%
tokens_remove(stopwords("de"), padding = FALSE) %>%
tokens_compound(pattern = phrase(c("Schwester Agnes")), concatenator = " ")
ostalgie_words <- c("Schwester Agnes", "Intershop", "Interflug", "Trabant", "Trabi", "Ostalgie")
test_kwic <- kwic(qtd_tokens,
pattern = ostalgie_words,
window = 5)
It's something of a guess without having a reproducible example (your input full_data, namely) but here's my best guess. Your kwic() call is using the default "glob" pattern matching, and what you want is a regular expression match instead.
Fix it this way:
kwic(qtd_tokens, pattern = ostalgie_words, valuetype = "regex",
window = 5

merge a data frame out put of a column

I have a data.frame with two variables: ID and Text
I am using the following text analysis command that gives a data.frame output of 48 columns.
analysis <- textstat_readability(mydata$text, measure = c("all"), remove_hyphens = TRUE)
How can I add those 48 columns of results as separate columns in mydata?
Currently I am using the following:
analysis <- cbind(mydata$ID[1:100000], textstat_readability(mydata$text[1:100000], measure = c("all"), remove_hyphens = TRUE))
But it takes forever to finish.
You have 100.000 records with text. Depending on your system and how big each text record is, that might take a while. You could try speeding up the process by using more cores. Most of quanteda's processes run in parallel so it is worth a shot.
Try to do the following to see if that speeds it up:
library(quanteda)
# use all available cores - 1
quanteda_options(threads = parallel::detectCores() - 1)
analyses <- textstat_readability(mydata$text[1:100000], measure = c("all"), remove_hyphens = TRUE)
analyses <- cbind(mydata$text[1:100000], analyses)
Testing this with an a data.frame filled with 2000 times the data_char_sampletext didn't show much difference if you want to do it in one cbind action. But that depends on how big your mydata data.frame already is. It might be better to do it in 2 steps.
Not sure why your approach takes forever to finish to be honest, but the correct way to do it would be the following, I think:
# (0.) Load the package and make a random sample dataset (usually this should be
# provided in the question, just saying):
library(quanteda)
mydata <- data.frame(ID = 1:100,
text = stringi::stri_rand_strings(
n = 100,
length = runif(100, min=1, max=100),
pattern = "[A-Za-z0-9]"),
stringsAsFactors = FALSE)
# 1. Make a quanteda corpus, where the ID is stored alongside the text column:
mydata_corpus <- corpus(mydata, docid_field = "ID", text_field = "text")
# 2. Then run the readability command:
`analysis <- textstat_readability(mydata_corpus, measure = c("all"), remove_hyphens = TRUE)`
# 3. Now you can either keep this, or merge it with your original set based on
# IDs:
mydata_analysis <- merge(mydata, analysis, by.x = "ID", by.y = "document")
This should work without you having to use cbind() at all.

Merge is duplicating rows in r

I have two data sets with country names in common.
first data frame
As you can see, both data sets have a two letter country code formated the same way.
After running this code:
merged<- merge(aggdata, Trade, by="Group.1" , all.y = TRUE, all.x=TRUE)
I get the following result
Rather than having 2 rows with the same country code, I'd like them to be combine.
Thanks!
I strongly suspect that the Group.1 strings in one or other of your data frames has one or more trailing spaces, so they appear identical when viewed, but are not. An easy way of visually checking whether they are the same:
levels(as.factor(Trade$Group.1))
levels(as.factor(aggdata$Group.1))
If the problem does turn out to be trailing spaces, then if you are using R 3.2.0 or higher, try:
Trade$Group.1 <- trimws(Trade$Group.1)
aggdata$Group.1 <- trimws(aggdata$Group.1)
Even better, if you are using read.table etc. to input your data, then use the parameter strip.white=TRUE
For future reference, it would be better to post at least a sample of your data rather than a screenshot.
The following works for me:
aggdata <- data.frame(Group.1 = c('AT', 'BE'), CASEID = c(1587.6551, 506.5), ISOCNTRY = c(NA, NA),
QC17_2 = c(2.0, 1.972332), D70 = c(1.787440, 1.800395))
Trade <- data.frame(Group.1 = c('AT', 'BE'), trade = c(99.77201, 100.10685))
merged<- merge(aggdata, Trade, by="Group.1" , all.y = TRUE, all.x=TRUE)
I had to transcribe your data by hand from your screenshots, so I only did the first two rows. If you could paste in a full sample of your data, that would be helpful. See here for some guidelines on producing a reproducible example: https://stackoverflow.com/a/5963610/236541

How do I save individual species data downloaded via rgbif?

I have a list of species and I want to download occurrence data from them using rgbif. I'm trying out the code with just two species with the assumption that when I get it to work for two getting it to work for the actual (and much longer) list won't be a problem. Here's the code I'm using:
#Start
library(rgbif)
splist <- c('Acer platanoides','Acer pseudoplatanus')
keys <- sapply(splist, function(x) name_suggest(x)$key[1], USE.NAMES=FALSE)
OS1=occ_search(taxonKey=keys, fields=c('name','key','decimalLatitude','decimalLongitude','country','basisOfRecord','coordinateAccuracy','elevation','elevationAccuracy','year','month','day'), minimal=FALSE,limit=10, return='data')
OS1
#End
This bit works almost perfectly. I get data for both species divided by species. One species is missing some columns, but I'm assuming for now that's an issue with the data, not the code. The next line I tried -
write.csv(OS1, "os1.csv")
works fine when saving a single species but not for more than one. Can someone please help? How do I save data for each species as separate files, bearing in mind I also want the method to work for data for more than 2 species?
Thanks!
The result is a list, which means you can use R's functions to climb each list element and save it. The following code extracts species names (you might have this laying around somewhere already) and uses mapply to pair species data and file name and use this to save a .txt file.
filenames <- paste(sapply(sapply(OS1, FUN = "[[", "name", simplify = FALSE), unique), ".txt", sep = "")
mapply(OS1, filenames, FUN = function(x, y) write.table(x, file = y, row.names = FALSE))
This is akin to a for loop solution, but some might argue a more concise one.
for (i in 1:length(filenames)) {
write.table(OS1[[i]], file = filenames[i], row.names = FALSE)
}

Resources