How to remove values in a column based on other column values equaling the column values above it? - r

I am currently coding in R and merged two dataframes together so I could include all the information together but I don't want the one column "Cost" to be duplicated multiple times (it was due to the unique values of the last 3 columns). I want it to include the cost 100 only in the first column and then for every other instance where the columns "State", "Market", "Date", and "Cost" are the same as above. I attached what the dataframe looks like and what I want it to be changed to. Thank you!
What it currently looks like
What it should look like

Please use index like in this example:
name_of_your_dataset[nrow_init:nrow_fin, ncol] <- NA
In your case, assuming the name of your dataset as 'data'
data[2:4,4]<- NA
Just leave a positive feedback and if I was useful, just vote this answer up.

Here is a solution using duplicated with your dataframe (df)
State Market Date Cost Word format Type
1 AZ Phoenix 10-20-2020 100 HELLO AM Sports related
2 AZ Phoenix 10-21-2020 NA GOODBYE PM Non Sports related
3 AZ Phoenix 10-22-2020 NA YES FM Country
4 AZ Phoenix 10-23-2020 NA NONE CM Rock
Set duplicates to NA
df$Cost[duplicated(df$Cost)] <- NA
Output:
State Market Date Cost Word format Type
1 AZ Phoenix 10-20-2020 100 HELLO AM Sports related
2 AZ Phoenix 10-21-2020 NA GOODBYE PM Non Sports related
3 AZ Phoenix 10-22-2020 NA YES FM Country
4 AZ Phoenix 10-23-2020 NA NONE CM Rock

The column Date is different so I think you want to do replace duplicated Cost for every value of State and Market combination.
library(dplyr)
df <- df %>%
group_by(State, Market) %>%
mutate(Cost = replace(Cost, duplicated(Cost), NA)) %>%
ungroup
df
# State Market Date Cost Word format Type
# <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
#1 AZ Phoenix 10-20-2020 100 HELLO AM Sports related
#2 AZ Phoenix 10-21-2020 NA GOODBYE PM Non Sports related
#3 AZ Phoenix 10-22-2020 NA YES FM Country
#4 AZ Phoenix 10-23-2020 NA NONE CM Rock
data
It is easier to help if you provide data in a reproducible format
df <- structure(list(State = c("AZ", "AZ", "AZ", "AZ"), Market = c("Phoenix",
"Phoenix", "Phoenix", "Phoenix"), Date = c("10-20-2020", "10-21-2020",
"10-22-2020", "10-23-2020"), Cost = c(100, 100, 100, 100), Word = c("HELLO",
"GOODBYE", "YES", "NONE"), format = c("AM", "PM", "FM", "CM"),
Type = c("Sports related", "Non Sports related", "Country",
"Rock")), row.names = c(NA, -4L), class = "data.frame")

Related

Categorizing Data Based on Text Value in New Column [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 2 years ago.
I'm trying to take an existing data frame that has a column for state and add a new column called Region depending on what the row's state is. So for example any row that has "CA" should be categorized "West" and any row that has "IL" should be Midwest. There are 4 regions: West, South, Midwest, and Northeast.
I had tried doing this separately in 4 code chunks like this:
south <- c("FL", "KY", "GA", "TX", "MS", "SC", "NC", "AL", "LA", "AR", "TN", "VA", "DC", "MD", "DE", "WV") #16 states
south.mdata <- mdata %>% filter(state %in% south) #1832 locations
south.byyear <- south.mdata %>% group_by(Year) %>% summarize(s.total = n())
south.total <- data %>% filter(state %in% south) %>% group_by(Year) %>% summarize(yearly.total = n())
But this seems repetitive and not the most efficient way to do this. Plus I'd like to be able to group_by both Year and Region so I can compare across regions.
I'm having trouble implementing this and the first thing that comes to mind is to do some sort of if/else loop using filter but I know loops aren't really R's style.
The original data looks like this:
Field.1 ID title description streetaddress city state
1 74 DE074 Cork 'n' Bottle Route 14, 1 mile south of town Rehoboth Beach DE
2 75 DE075 Cork 'n' Bottle Route 14, 1 mile south of town Rehoboth Beach DE
3 23 DE023 Dog House 1200 DuPont Hwy. Wilmington DE
4 19 DE019 Dog House 1200 DuPont Hwy Wilmington DE
5 26 DE026 Dog House 1200 Dupont Wilmington DE
6 65 DE065 Henlopen Hotel Bar Boardwalk & Surf Rehoboth Beach DE
amenityfeatures type Year notes lon lat
1 (M),(R) Restaurant 1977 <NA> -75.07601 38.72095
2 (M),(R) Restaurant 1976 <NA> -75.07601 38.72095
3 (M),(R) Restaurant 1975 <NA> -75.58243 39.68839
4 (M),(R) Restaurant 1976 <NA> -75.58243 39.68839
5 (M),(R) Restaurant 1974 <NA> -75.58723 39.76705
6 (M) Bars/Clubs,Hotel 1972 <NA> -75.07712 38.72280
status
1 Location could not be verified. General city or location coordinates used.
2 Location could not be verified. General city or location coordinates used.
3 Google Verified Location
4 Google Verified Location
5 Google Verified Location
6 Verified Location
I want to add a new column called "Region" that would loop through each row, look at the state, and then add a value to Region.
Any suggestions on the right syntax to do something like this would be so appreciated! Thanks so much!
This is a snippet of what the solution suggested by Gregor’s comment could look like.
library(tidyverse)
orig_data <-
tribble(~ID, ~state,
1, "CA",
2, "FL",
3, "DE")
region_lookup <-
tribble(~state, ~region,
"CA", "west",
"FL", "south",
"DE", "south")
left_join(orig_data, region_lookup)
#> Joining, by = "state"
#> # A tibble: 3 x 3
#> ID state region
#> <dbl> <chr> <chr>
#> 1 1 CA west
#> 2 2 FL south
#> 3 3 DE south
Created on 2020-11-02 by the reprex package (v0.3.0)
the simplest solution would be a join. Therefore you need a data.frame/tibble that has all the states an regions. Fortunately the data is already in base R:
library(dplyr)
# build the tibble of state abbrevitation and region from base R data
state_region <- dplyr::tibble(state.abb, state.region)
# join it on your data.frame/tibble
ORIGINAL_DATA %>%
dplyr::left_join(state_region, by = c("state" = "state.abb"))
Now you should have a new column called "state.region" that you can group by. Be aware that the states have to be in upper case.

If/Else statement in R

I have two dataframes in R:
city price bedroom
San Jose 2000 1
Barstow 1000 1
NA 1500 1
Code to recreate:
data = data.frame(city = c('San Jose', 'Barstow'), price = c(2000,1000, 1500), bedroom = c(1,1,1))
and:
Name Density
San Jose 5358
Barstow 547
Code to recreate:
population_density = data.frame(Name=c('San Jose', 'Barstow'), Density=c(5358, 547));
I want to create an additional column named city_type in the data dataset based on condition, so if the city population density is above 1000, it's an urban, lower than 1000 is a suburb, and NA is NA.
city price bedroom city_type
San Jose 2000 1 Urban
Barstow 1000 1 Suburb
NA 1500 1 NA
I am using a for loop for conditional flow:
for (row in 1:length(data)) {
if (is.na(data[row,'city'])) {
data[row, 'city_type'] = NA
} else if (population[population$Name == data[row,'city'],]$Density>=1000) {
data[row, 'city_type'] = 'Urban'
} else {
data[row, 'city_type'] = 'Suburb'
}
}
The for loop runs with no error in my original dataset with over 20000 observations; however, it yields a lot of wrong results (it yields NA for the most part).
What has gone wrong here and how can I do better to achieve my desired result?
I have become quite a fan of dplyr pipelines for this type of join/filter/mutate workflow. So here is my suggestion:
library(dplyr)
# I had to add that extra "NA" there, did you not? Hm...
data <- data.frame(city = c('San Jose', 'Barstow', NA), price = c(2000,1000, 500), bedroom = c(1,1,1))
population <- data.frame(Name=c('San Jose', 'Barstow'), Density=c(5358, 547));
data %>%
# join the two dataframes by matching up the city name columns
left_join(population, by = c("city" = "Name")) %>%
# add your new column based on the desired condition
mutate(
city_type = ifelse(Density >= 1000, "Urban", "Suburb")
)
Output:
city price bedroom Density city_type
1 San Jose 2000 1 5358 Urban
2 Barstow 1000 1 547 Suburb
3 <NA> 500 1 NA <NA>
Using ifelse create the city_type in population_density, then we using match
population_density$city_type=ifelse(population_density$Density>1000,'Urban','Suburb')
data$city_type=population_density$city_type[match(data$city,population_density$Name)]
data
city price bedroom city_type
1 San Jose 2000 1 Urban
2 Barstow 1000 1 Suburb
3 <NA> 1500 1 <NA>

How to loop through a list of cities and get temparature for given date with 'weatherData' in R

I have the following df
city <- data.frame(City = c("London", "Liverpool", "Manchester","London", "Liverpool", "Manchester"),
Date = c("2016-08-05","2016-08-09","2016-08-10", "2016-09-05","2016-09-09","2016-09-10"))
I want to loop through it and get weather data by city$City for the data in city$Date
city <- data.frame(City = c("London", "Liverpool", "Manchester","London", "Liverpool", "Manchester"),
Date = c("2016-08-05","2016-08-09","2016-08-10", "2016-09-05","2016-09-09","2016-09-10"),
Mean_TemperatureC = c("15","14","13","14","11","14"))
Currently I am using weatherData to get weather data with the following funtion:
library(weatherData)
df <- getWeatherForDate("BOS", "2016-08-01")
Can someone help?
Here is a possibility:
temp <- as.matrix(city)
codes <- sapply(1:nrow(city), function(x) getStationCode(city[x,1], "GB")[[1]])
station <- sub("^[[:print:]]+\\s([A-Z]{4})\\s[[:print:]]+", "\\1", codes)
temp[, 1] <- station
temperature <- sapply(1:nrow(temp), function(x) {getWeatherForDate(temp[x, 1], temp[x, 2])$Mean_TemperatureC})
city2 <- setNames(cbind(city, temperature), c(colnames(city), "Mean_TemperatureC"))
city2
# City Date Mean_TemperatureC
# 1 London 2016-08-05 14
# 2 Liverpool 2016-08-09 14
# 3 Manchester 2016-08-10 13
# 4 London 2016-09-05 20
# 5 Liverpool 2016-09-09 18
# 6 Manchester 2016-09-10 13
The first step is to get the codes of the different cities with the sub and the getStationCode functions. We then get the vector with the mean of the temperatures, and finally, we create the data.frame city2, with the correct column names.
It is necessary to look for the stations code, as some cities (like Liverpool) could be on different countries (Canada and UK in this case). I checked the results on Weather Underground website for Liverpool, the results are correct.

R. How to add sum row in data frame

I know this question is very elementary, but I'm having a trouble adding an extra row to show summary of the row.
Let's say I'm creating a data.frame using the code below:
name <- c("James","Kyle","Chris","Mike")
nationality <- c("American","British","American","Japanese")
income <- c(5000,4000,4500,3000)
x <- data.frame(name,nationality,income)
The code above creates the data.frame below:
name nationality income
1 James American 5000
2 Kyle British 4000
3 Chris American 4500
4 Mike Japanese 3000
What I'm trying to do is to add a 5th row and contains: name = "total", nationality = "NA", age = total of all rows. My desired output looks like this:
name nationality income
1 James American 5000
2 Kyle British 4000
3 Chris American 4500
4 Mike Japanese 3000
5 Total NA 16500
In a real case, my data.frame has more than a thousand rows, and I need efficient way to add the total row.
Can some one please advice? Thank you very much!
We can use rbind
rbind(x, data.frame(name='Total', nationality=NA, income = sum(x$income)))
# name nationality income
#1 James American 5000
#2 Kyle British 4000
#3 Chris American 4500
#4 Mike Japanese 3000
#5 Total <NA> 16500
using index.
name <- c("James","Kyle","Chris","Mike")
nationality <- c("American","British","American","Japanese")
income <- c(5000,4000,4500,3000)
x <- data.frame(name,nationality,income, stringsAsFactors=FALSE)
x[nrow(x)+1, ] <- c('Total', NA, sum(x$income))
UPDATE: using list
x[nrow(x)+1, ] <- list('Total', NA, sum(x$income))
x
# name nationality income
# 1 James American 5000
# 2 Kyle British 4000
# 3 Chris American 4500
# 4 Mike Japanese 3000
# 5 Total <NA> 16500
sapply(x, class)
# name nationality income
# "character" "character" "numeric"
If you want the exact row as you put in your post, then the following should work:
newdata = rbind(x, data.frame(name='Total', nationality='NA', income = sum(x$income)))
I though agree with Jaap that you may not want this row to add to the end. In case you need to load the data and use it for other analysis, this will add to unnecessary trouble. However, you may also use the following code to remove the added row before other analysis:
newdata = newdata[-newdata$name=='Total',]

Reading names with special characters using R

I've an excel (xlsx) table and in the column "PLAYERS" European players have an asterisk in their names and South Americans don't. Something like this
PLAYERS
Neymar
*Bale*
Messi
*Ronaldo*
*Benzema*
*Iniesta*
DiMaria
Is there any way I can use R (or excel itself) to split this dataset into one with Europeans (with asterisk) and another one with South Americans? Of course, the data set contains other columns like "SALARY", "SCORED GOALS", "OFFSITE", "AGE" etc. etc. etc.
Thanks,
Diego.
You could check if there's an "*" in the players name and in a new column write "European" or "South American" and, if you want, you could then split the data frame into a list with two data.frames, one with Europeans and the other with South Americans:
df <- data.frame(PLAYERS = c("Neymar", "*Ronaldo*", "Messi"), SALARY = 5:7)
df
# PLAYERS SALARY
#1 Neymar 5
#2 *Ronaldo* 6
#3 Messi 7
# check if there's a * in the PLAYERS column
df$Location <- ifelse(grepl("\\*", df$PLAYERS), "European", "South American")
df
# PLAYERS SALARY Location
#1 Neymar 5 South American
#2 *Ronaldo* 6 European
#3 Messi 7 South American
#split the data based on location:
dflist <- split(df, df$Location)
dflist
#$European
# PLAYERS SALARY Location
#2 *Ronaldo* 6 European
#
#$`South American`
# PLAYERS SALARY Location
#1 Neymar 5 South American
#3 Messi 7 South American
Now you can access each list element (which is a data.frame) by typing
dflist[["European"]] # or "South American" instead
# PLAYERS SALARY Location
#2 *Ronaldo* 6 European
You can split this specific column and name the resulting list with split and setNames
> dat <- structure(list(PLAYERS = structure(c(6L, 1L, 5L, 7L, 2L, 4L, 3L),
.Label = c("*Bale*", "*Benzema*", "DiMaria", "*Iniesta*",
"Messi", "Neymar", "*Ronaldo*"), class = "factor")),
.Names = "PLAYERS", class = "data.frame", row.names = c(NA,-7L))
> setNames(split(dat, grepl("[*]", dat$PLAYERS)), nm = c("Euro", "SoAm"))
#$Euro
# PLAYERS
# 1 Neymar
# 3 Messi
# 7 DiMaria
#
# $SoAm
# PLAYERS
# 2 *Bale*
# 4 *Ronaldo*
# 5 *Benzema*
# 6 *Iniesta*
Create a PivotTable from your source data with PLAYERS for ROWS. Filter with Label Filters, Contains... ~* and click on Grand Total. Return to PT, select Does Not Contain... and click on Grand Total again.

Resources