Reading names with special characters using R - r

I've an excel (xlsx) table and in the column "PLAYERS" European players have an asterisk in their names and South Americans don't. Something like this
PLAYERS
Neymar
*Bale*
Messi
*Ronaldo*
*Benzema*
*Iniesta*
DiMaria
Is there any way I can use R (or excel itself) to split this dataset into one with Europeans (with asterisk) and another one with South Americans? Of course, the data set contains other columns like "SALARY", "SCORED GOALS", "OFFSITE", "AGE" etc. etc. etc.
Thanks,
Diego.

You could check if there's an "*" in the players name and in a new column write "European" or "South American" and, if you want, you could then split the data frame into a list with two data.frames, one with Europeans and the other with South Americans:
df <- data.frame(PLAYERS = c("Neymar", "*Ronaldo*", "Messi"), SALARY = 5:7)
df
# PLAYERS SALARY
#1 Neymar 5
#2 *Ronaldo* 6
#3 Messi 7
# check if there's a * in the PLAYERS column
df$Location <- ifelse(grepl("\\*", df$PLAYERS), "European", "South American")
df
# PLAYERS SALARY Location
#1 Neymar 5 South American
#2 *Ronaldo* 6 European
#3 Messi 7 South American
#split the data based on location:
dflist <- split(df, df$Location)
dflist
#$European
# PLAYERS SALARY Location
#2 *Ronaldo* 6 European
#
#$`South American`
# PLAYERS SALARY Location
#1 Neymar 5 South American
#3 Messi 7 South American
Now you can access each list element (which is a data.frame) by typing
dflist[["European"]] # or "South American" instead
# PLAYERS SALARY Location
#2 *Ronaldo* 6 European

You can split this specific column and name the resulting list with split and setNames
> dat <- structure(list(PLAYERS = structure(c(6L, 1L, 5L, 7L, 2L, 4L, 3L),
.Label = c("*Bale*", "*Benzema*", "DiMaria", "*Iniesta*",
"Messi", "Neymar", "*Ronaldo*"), class = "factor")),
.Names = "PLAYERS", class = "data.frame", row.names = c(NA,-7L))
> setNames(split(dat, grepl("[*]", dat$PLAYERS)), nm = c("Euro", "SoAm"))
#$Euro
# PLAYERS
# 1 Neymar
# 3 Messi
# 7 DiMaria
#
# $SoAm
# PLAYERS
# 2 *Bale*
# 4 *Ronaldo*
# 5 *Benzema*
# 6 *Iniesta*

Create a PivotTable from your source data with PLAYERS for ROWS. Filter with Label Filters, Contains... ~* and click on Grand Total. Return to PT, select Does Not Contain... and click on Grand Total again.

Related

How to remove values in a column based on other column values equaling the column values above it?

I am currently coding in R and merged two dataframes together so I could include all the information together but I don't want the one column "Cost" to be duplicated multiple times (it was due to the unique values of the last 3 columns). I want it to include the cost 100 only in the first column and then for every other instance where the columns "State", "Market", "Date", and "Cost" are the same as above. I attached what the dataframe looks like and what I want it to be changed to. Thank you!
What it currently looks like
What it should look like
Please use index like in this example:
name_of_your_dataset[nrow_init:nrow_fin, ncol] <- NA
In your case, assuming the name of your dataset as 'data'
data[2:4,4]<- NA
Just leave a positive feedback and if I was useful, just vote this answer up.
Here is a solution using duplicated with your dataframe (df)
State Market Date Cost Word format Type
1 AZ Phoenix 10-20-2020 100 HELLO AM Sports related
2 AZ Phoenix 10-21-2020 NA GOODBYE PM Non Sports related
3 AZ Phoenix 10-22-2020 NA YES FM Country
4 AZ Phoenix 10-23-2020 NA NONE CM Rock
Set duplicates to NA
df$Cost[duplicated(df$Cost)] <- NA
Output:
State Market Date Cost Word format Type
1 AZ Phoenix 10-20-2020 100 HELLO AM Sports related
2 AZ Phoenix 10-21-2020 NA GOODBYE PM Non Sports related
3 AZ Phoenix 10-22-2020 NA YES FM Country
4 AZ Phoenix 10-23-2020 NA NONE CM Rock
The column Date is different so I think you want to do replace duplicated Cost for every value of State and Market combination.
library(dplyr)
df <- df %>%
group_by(State, Market) %>%
mutate(Cost = replace(Cost, duplicated(Cost), NA)) %>%
ungroup
df
# State Market Date Cost Word format Type
# <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
#1 AZ Phoenix 10-20-2020 100 HELLO AM Sports related
#2 AZ Phoenix 10-21-2020 NA GOODBYE PM Non Sports related
#3 AZ Phoenix 10-22-2020 NA YES FM Country
#4 AZ Phoenix 10-23-2020 NA NONE CM Rock
data
It is easier to help if you provide data in a reproducible format
df <- structure(list(State = c("AZ", "AZ", "AZ", "AZ"), Market = c("Phoenix",
"Phoenix", "Phoenix", "Phoenix"), Date = c("10-20-2020", "10-21-2020",
"10-22-2020", "10-23-2020"), Cost = c(100, 100, 100, 100), Word = c("HELLO",
"GOODBYE", "YES", "NONE"), format = c("AM", "PM", "FM", "CM"),
Type = c("Sports related", "Non Sports related", "Country",
"Rock")), row.names = c(NA, -4L), class = "data.frame")

Create a ggplot with grouped factor levels

This is variation on a question asked here: Group factor levels in ggplot.
I have a dataframe:
df <- data.frame(respondent = factor(c(1, 2, 3, 4, 5, 6, 7)),
location = factor(c("California", "Oregon", "Mexico",
"Texas", "Canada", "Mexico", "Canada")))
There are three separate levels related to the US. I don't want to collapse them as the distinction between states is useful for data analysis. I would like to have, however, a basic barplot that combines the three US states and stacks them on top of one another, so that there are three bars in the barplot--Canada, Mexico, and US--with the US bar divided into three states like so:
If the state factor levels had the "US" in their names, e.g. "US: California", I could use
library(tidyverse)
with_states <- df %>%
separate(location, into = c("Country", "State"), sep = ": ") %>%
replace_na(list(State = "Other")) %>%
mutate(State = as.factor(State)
%>% fct_relevel("Other", after = Inf))
to achieve the desired outcome. But how can this be done when R doesn't know that the three states are in the US?
If you look at the previous example, all the separate and replace_na functions do is separate the location variable into a country and state variable:
df
respondent location
1 1 US: California
2 2 US: Oregon
3 3 Mexico
...
df %>%
separate(location, into = c("Country", "State"), sep = ": ") %>%
replace_na(list(State = "Other"))
respondent Country State
1 1 US California
2 2 US Oregon
3 3 Mexico Other
...
So really all you need to do if get your data into this format: with a column for country and a column for state/provence.
There are many ways to do this yourself. Many times your data will already be in this format. If it isn't, the easiest way to fix it is to do a join to a table which maps location to country:
df
respondent location
1 1 California
2 2 Oregon
3 3 Mexico
4 4 Texas
5 5 Canada
6 6 Mexico
7 7 Canada
state_mapping <- data.frame(state = c("California", "Oregon", "Texas"),
country = c('US', 'US', 'US'),
stringsAsFactors = F)
df %>%
left_join(state_mapping, by = c('location' = 'state')) %>%
mutate(country = if_else(is.na(.$country),
location,
country))
respondent location country
1 1 California US
2 2 Oregon US
3 3 Mexico Mexico
4 4 Texas US
5 5 Canada Canada
6 6 Mexico Mexico
7 7 Canada Canada
Once you've got it in this format, you can just do what the other question suggested.

How to loop through a list of cities and get temparature for given date with 'weatherData' in R

I have the following df
city <- data.frame(City = c("London", "Liverpool", "Manchester","London", "Liverpool", "Manchester"),
Date = c("2016-08-05","2016-08-09","2016-08-10", "2016-09-05","2016-09-09","2016-09-10"))
I want to loop through it and get weather data by city$City for the data in city$Date
city <- data.frame(City = c("London", "Liverpool", "Manchester","London", "Liverpool", "Manchester"),
Date = c("2016-08-05","2016-08-09","2016-08-10", "2016-09-05","2016-09-09","2016-09-10"),
Mean_TemperatureC = c("15","14","13","14","11","14"))
Currently I am using weatherData to get weather data with the following funtion:
library(weatherData)
df <- getWeatherForDate("BOS", "2016-08-01")
Can someone help?
Here is a possibility:
temp <- as.matrix(city)
codes <- sapply(1:nrow(city), function(x) getStationCode(city[x,1], "GB")[[1]])
station <- sub("^[[:print:]]+\\s([A-Z]{4})\\s[[:print:]]+", "\\1", codes)
temp[, 1] <- station
temperature <- sapply(1:nrow(temp), function(x) {getWeatherForDate(temp[x, 1], temp[x, 2])$Mean_TemperatureC})
city2 <- setNames(cbind(city, temperature), c(colnames(city), "Mean_TemperatureC"))
city2
# City Date Mean_TemperatureC
# 1 London 2016-08-05 14
# 2 Liverpool 2016-08-09 14
# 3 Manchester 2016-08-10 13
# 4 London 2016-09-05 20
# 5 Liverpool 2016-09-09 18
# 6 Manchester 2016-09-10 13
The first step is to get the codes of the different cities with the sub and the getStationCode functions. We then get the vector with the mean of the temperatures, and finally, we create the data.frame city2, with the correct column names.
It is necessary to look for the stations code, as some cities (like Liverpool) could be on different countries (Canada and UK in this case). I checked the results on Weather Underground website for Liverpool, the results are correct.

Compare values in data.frame from different rows [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I have an R data.frame of college football data, with two entries for each game (one for each team, with stats and whatnot). I would like to compare points from these to create a binary Win/Loss variable, but I have no idea how (I'm not very experienced with R).
Is there a way I can iterate through the columns and try to match them up against another column (I have a game ID variable, so I'd match on that) and create aforementioned binary Win/Loss variable by comparing points values?
Excerpt of dataframe (many variables left out):
Team Code Name Game Code Date Site Points
5 Akron 5050320051201 12/1/2005 NEUTRAL 32
5 Akron 404000520051226 12/26/2005 NEUTRAL 23
8 Alabama 419000820050903 9/3/2005 TEAM 37
8 Alabama 664000820050910 9/10/2005 TEAM 43
What I want is to append a new column, a binary variable that's assigned 1 or 0 based on if the team won or lost. To figure this out, I need to take the game code, say 5050320051201, find the other row with that same game code (there's only one other row with that same game code, for the other team in that game), and compare the points value for the two, and use that to assign the 1 or 0 for the Win/Loss variable.
Assuming that your data has exactly two teams for each unique Game Code and there are no tie games as given by the following example:
df <- structure(list(`Team Code` = c(5L, 6L, 5L, 5L, 8L, 9L, 9L, 8L
), Name = c("Akron", "St. Joseph", "Akron", "Miami(Ohio)", "Alabama",
"Florida", "Tennessee", "Alabama"), `Game Code` = structure(c(1L,
1L, 2L, 2L, 3L, 3L, 4L, 4L), .Label = c("5050320051201", "404000520051226",
"419000820050903", "664000820050910"), class = "factor"), Date = structure(c(13118,
13118, 13143, 13143, 13029, 13029, 13036, 13036), class = "Date"),
Site = c("NEUTRAL", "NEUTRAL", "NEUTRAL", "NEUTRAL", "TEAM",
"AWAY", "AWAY", "TEAM"), Points = c(32L, 25L, 23L, 42L, 37L,
45L, 42L, 43L)), .Names = c("Team Code", "Name", "Game Code",
"Date", "Site", "Points"), row.names = c(NA, -8L), class = "data.frame")
print(df)
## Team Code Name Game Code Date Site Points
##1 5 Akron 5050320051201 2005-12-01 NEUTRAL 32
##2 6 St. Joseph 5050320051201 2005-12-01 NEUTRAL 25
##3 5 Akron 404000520051226 2005-12-26 NEUTRAL 23
##4 5 Miami(Ohio) 404000520051226 2005-12-26 NEUTRAL 42
##5 8 Alabama 419000820050903 2005-09-03 TEAM 37
##6 9 Florida 419000820050903 2005-09-03 AWAY 45
##7 9 Tennessee 664000820050910 2005-09-10 AWAY 42
##8 8 Alabama 664000820050910 2005-09-10 TEAM 43
You can use dplyr to generate what you want:
library(dplyr)
result <- df %>% group_by(`Game Code`) %>%
mutate(`Win/Loss`=if(first(Points) > last(Points)) as.integer(c(1,0)) else as.integer(c(0,1)))
print(result)
##Source: local data frame [8 x 7]
##Groups: Game Code [4]
##
## Team Code Name Game Code Date Site Points Win/Loss
## <int> <chr> <fctr> <date> <chr> <int> <int>
##1 5 Akron 5050320051201 2005-12-01 NEUTRAL 32 1
##2 6 St. Joseph 5050320051201 2005-12-01 NEUTRAL 25 0
##3 5 Akron 404000520051226 2005-12-26 NEUTRAL 23 0
##4 5 Miami(Ohio) 404000520051226 2005-12-26 NEUTRAL 42 1
##5 8 Alabama 419000820050903 2005-09-03 TEAM 37 0
##6 9 Florida 419000820050903 2005-09-03 AWAY 45 1
##7 9 Tennessee 664000820050910 2005-09-10 AWAY 42 0
##8 8 Alabama 664000820050910 2005-09-10 TEAM 43 1
Here, we first group_by the Game Code and then use mutate to create the Win/Loss column for each group. The logic here is simply that if the first Points is greater than the last (there are only two by assumption), then we set the column to c(1,0). Otherwise, we set it to (0,1). Note that this logic does not handle ties, but can easily be extended to do so. Note also that we surround the column names with back-quotes because of special characters such as space and /.
footballdata$SomeVariable[footballdata$Wins == "1"] = stuff
call yours wins by either 1 or 0, thus binomial
R's data frames are nice in that you can aggregate what you want like, I only want the data frames with wins are 1. Then you can set the data to some variable as above. If you wanna do another data frame to populate a data frame, make sure they have the same amount of data.
footballdata$SomeVariable[footballdata$Wins == "1"][footballdata$Team == "Browns"] = Hopeful

R make new data frame from current one

I'm trying to calculate the best goal differentials in the group stage of the 2014 world cup.
football <- read.csv(
file="http://pastebin.com/raw.php?i=iTXdPvGf",
header = TRUE,
strip.white = TRUE
)
football <- head(football,n=48L)
football[which(max(abs(football$home_score - football$away_score)) == abs(football$home_score - football$away_score)),]
Results in
home home_continent home_score away away_continent away_score result
4 Cameroon Africa 0 Croatia Europe 4 l
7 Spain Europe 1 Netherlands Europe 5 l
37 Germany
So those are the games with the highest goal differntial, but now I need to make a new data frame that has a team name, and abs(football$home_score-football$away_score)
football$score_diff <- abs(football$home_score - football$away_score)
football$winner <- ifelse(football$home_score > football$away_score, as.character(football$home),
ifelse(football$result == "d", NA, as.character(football$away)))
You could save some typing in this way. You first get score differences and winners. When the result indicates w, home is the winner. So you do not have to look into scores at all. Once you add the score difference and winner, you can subset your data by subsetting data with max().
mydf <- read.csv(file="http://pastebin.com/raw.php?i=iTXdPvGf",
header = TRUE, strip.white = TRUE)
mydf <- head(mydf,n = 48L)
library(dplyr)
mutate(mydf, scorediff = abs(home_score - away_score),
winner = ifelse(result == "w", as.character(home),
ifelse(result == "l", as.character(away), "draw"))) %>%
filter(scorediff == max(scorediff))
# home home_continent home_score away away_continent away_score result scorediff winner
#1 Cameroon Africa 0 Croatia Europe 4 l 4 Croatia
#2 Spain Europe 1 Netherlands Europe 5 l 4 Netherlands
#3 Germany Europe 4 Portugal Europe 0 w 4 Germany
Here is another option without using ifelse for creating the "winner" column. This is based on row/column indexes. The numeric column index is created by matching the result column with its unique elements (match(football$result,..), and the row index is just 1:nrow(football). Subset the "football" dataset with columns 'home', 'away' and cbind it with an additional column 'draw' with NAs so that the 'd' elements in "result" change to NA.
football$score_diff <- abs(football$home_score - football$away_score)
football$winner <- cbind(football[c('home', 'away')],draw=NA)[
cbind(1:nrow(football), match(football$result, c('w', 'l', 'd')))]
football[with(football, score_diff==max(score_diff)),]
# home home_continent home_score away away_continent away_score result
#60 Brazil South America 1 Germany Europe 7 l
# score_diff winner
#60 6 Germany
If the dataset is very big, you could speed up the match by using chmatch from library(data.table)
library(data.table)
chmatch(as.character(football$result), c('w', 'l', 'd'))
NOTE: I used the full dataset in the link

Resources