I've been trying myself and searching for a while now over the net and stackoverflow to no success. I've got a dataframe which I subset from applying conditions and select for projection but fail to retrieve aggregated output.
Dataframe mydf:
mydf = list()
mydf = cbind(mydf,
c("New York", "New York", "San Francisco"),
c(4000, 7600, 2500),
c("Bartosz", "Damian", "Maciej"))
mydf = as.data.frame(mydf)
colnames(mydf) = c("city","salary","name")
Let's assume given part of dataframe returned with:
subset(mydf, city == "New York", select = c(salary, name))
which return a data frame such as:
salary name
9 4000 Bartosz
10 7600 Damian
Now I need to calculate from the given salary a sum, avg and choose an employee with least salary from above data frame, preferably using one-liner by modifying the above code (I'm guessing it's possible), so that it returns:
for sum: 11600
for avg: 5800
for least: 4000 Bartosz
I've tried things as (1)
subset(mydf, city == "New York", select = sum(salary))
or (2)
x = subset(mydf, city == "New York", select = salary)
min(x)
and many more combination which only yields errors saying that summary function is only defined on a data frame with all variables being numbers (2) or the same output as the first code without sum (1)
The problem might be that your dataframe object actually contains a bunch of lists. So if you take
ny.df = subset(mydf, city == "New York", select = c(salary, name))
then any of the subsequent work needs to be peppered with as.numeric calls to translate your lists into vectors. These will give you your answers:
sum(as.numeric(ny.df$salary)) # sum
mean(as.numeric(ny.df$salary)) # avg
ny.df[which(as.numeric(ny.df$salary) == min(as.numeric(ny.df$salary))),] # row with min salary
Alternatively, you can define mydf as a dataframe of vectors instead of a dataframe of lists:
mydf = data.frame(c("New York", "New York", "San Francisco"),
c(4000, 7600, 2500),
c("Bartosz", "Damian", "Maciej"))
colnames(mydf) = c("city","salary","name")
ny.df = subset(mydf, city == "New York", select = c(salary, name))
sum(ny.df$salary)
mean(ny.df$salary)
ny.df[which(ny.df$salary == min(ny.df$salary)),]
Your mydf was weird so I made my own. I split mydf by city and then obtained the necessary data from running necessary operations (mean, sum, etc.) on each subgroup.
#DATA
mydf = structure(list(city = structure(c(1L, 1L, 2L), .Label = c("New York",
"San Francisco"), class = "factor"), salary = c(4000, 7600, 2500
), name = structure(1:3, .Label = c("Bartosz", "Damian", "Maciej"
), class = "factor")), .Names = c("city", "salary", "name"), row.names = c(NA,
-3L), class = "data.frame")
do.call(rbind, lapply(split(mydf, mydf$city), function(a)
data.frame(employee = a$name[which.min(a$salary)], #employee with least salary
mean = mean(a$salary), #mean salary
sum = sum(a$salary)))) #sum of salary
# employee mean sum
#New York Bartosz 5800 11600
#San Francisco Maciej 2500 2500
There is a simple and fast solution using data.table
library(data.table)
setDT(mydf)[, .( salary_sum = sum(salary),
salary_avg = mean(salary),
name = name[which.min(salary)]), by= city]
> city salary_sum salary_avg name
> 1: New York 11600 5800 Bartosz
> 2: San Francisco 2500 2500 Maciej
your dataset:
mydf = data.frame(city=c("New York", "New York", "San Francisco"),
salary=c(4000, 7600, 2500),
name=c("Bartosz", "Damian", "Maciej"))
Your data frame is structured unusally as lists within the dataframe, which may be casuign you issues. Here is a dplyr solution (now edited to find th elowest salary)
library(dplyr)
mydf <- data.frame(
city = c("New York", "New York", "San Francisco"),
salary = c(4000, 7600, 2500),
name = c("Bartosz", "Damian", "Maciej"))
mydf %>%
group_by(city) %>%
mutate(avg = mean(salary),
sum = sum(salary)) %>%
top_n(-1, wt = salary)
# city salary name avg sum
# <fctr> <dbl> <fctr> <dbl> <dbl>
# 1 New York 4000 Bartosz 5800 11600
# 2 San Francisco 2500 Maciej 2500 2500
I think the dplyr is what you might be looking for:
library(dplyr)
mydf %>%
group_by(city) %>%
filter (city =="New York") %>%
summarise(mean(salary), sum(salary))
# A tibble: 1 x 3
# city mean(salary) sum(salary)
# <fctr> <dbl> <dbl>
#1 New York 5800 11600
There is a good tutorial at this link link[https://rpubs.com/justmarkham/dplyr-tutorial]
Related
I'm receiving data daily that's in the same format, and I need to keep track of some summary stats for each day.
This is an example of how I may receive the data
Sales_11.23.2020 <- data.frame(State = c("New York", "New Jersey", "Texas","New Mexico","California",
"Kansas","Florida","Alaska","Montana", "Maine"),
Units = c(455,453,125,135,135,568,451,125,215,314),
Sales = c("20000","12530","51110","54110","65000",
"58220","54612","45102","45896","12510"),
Target_Sales = c("20000","20000","55000","50000","65000",
"58000","55000","45000","45000","13000"))
Sales_11.24.2020 <- data.frame(State = c("New York", "New Jersey", "Texas","New Mexico","California",
"Kansas","Florida","Alaska","Montana", "Maine"),
Units = c(460,463,165,139,165,668,421,125,205,316),
Sales = c("21000","13530","51010","54410","63000",
"56220","57612","42602","43696","12160"),
Target_Sales = c("25000","15000","55000","55000","65000",
"58000","55000","47000","45000","13000"))
Sales_11.25.2020 <- data.frame(State = c("New York", "New Jersey", "Texas","New Mexico","California",
"Kansas","Florida","Alaska","Montana", "Maine"),
Units = c(405,353,325,155,235,560,401,125,215,314),
Sales = c("20200","16210","51310","56110","65500",
"58225","54602","45602","45806","12410"),
Target_Sales = c("25000","22000","55000","50000","65000",
"60000","55000","35000","40000","10000"))
And My desired output would be something like this where I can have a column that keeps track of the stats on a daily basis.
Try with base R and manipulate data using lists and a function for process. It looks like some variables are defined as factor, so you must format them as numbers and then compute the aggregation values. Here the code:
#Code
List <- mget(ls(pattern = 'Sales_'))
#Function
process <- function(x)
{
x$Units <- as.numeric(as.character(x$Units))
x$Sales <- as.numeric(as.character(x$Sales))
x$Target_Sales <- as.numeric(as.character(x$Target_Sales))
y <- aggregate(cbind(Units,Sales,Target_Sales)~1,x,sum,na.rm=T)
return(y)
}
#Apply
LL <- lapply(List,process)
#Bind
df <- do.call(rbind,LL)
df$Date <- format(as.Date(gsub('Sales_','',rownames(df)),'%m.%d.%Y'),'%m/%d/%Y')
rownames(df) <- NULL
df<-df[,c('Date','Units','Sales','Target_Sales')]
Output:
df
Date Units Sales Target_Sales
1 11/23/2020 2976 419090 426000
2 11/24/2020 3127 415240 433000
3 11/25/2020 3088 425975 417000
Using tidyverse:
library(tidyverse)
mget(ls(pattern="Sales_")) %>%
bind_rows(.id="Date") %>%
mutate(across(Units:Target_Sales, ~ as.numeric(as.character(.)))) %>%
group_by(Date) %>%
summarise(across(Units:Target_Sales, sum, .names="Total {.col}")) %>%
mutate(Date=str_replace(Date, "^.+?_(\\d+).(\\d+).(\\d+)", "\\1/\\2/\\3"))
# A tibble: 3 x 4
Date `Total Units` `Total Sales` `Total Target_Sales`
<chr> <dbl> <dbl> <dbl>
1 11/23/2020 2976 419090 426000
2 11/24/2020 3127 415240 433000
3 11/25/2020 3088 425975 417000
I have the following data frame in R. I would like to get fips from this dataset. I tried to use fips function in usmap (https://rdrr.io/cran/usmap/man/fips.html). But I could not get fips from this function because I need to enclose double quote. Then, I tried to use paste0(""", df$state, """), but I could not get it. Is there any efficient ways to get fips?
> df1
state county
1 california napa
2 florida palm beach
3 florida collier
4 florida duval
UPDATE
I can get "\"california\"" by using dQuote. Thanks. After the conversion of each column, I tried the followings. How do I deal with this issue?
> df1$state <- dQuote(df1$state, FALSE)
> df1$county <- dQuote(df1$county, FALSE)
> fips(state = df1$state, county = df1$county)
Error in fips(state = df1$state, county = df1$county) :
`county` parameter cannot be used with multiple states.
> fips(state = df1$state[1], county = df1$county[1])
Error in fips(state = df1$state[1], county = df1$county[1]) :
"napa" is not a valid county in "california".
> fips(state = "california", county = "napa")
[1] "06055"
We can split the dataset by state and apply the fips
library(usmap)
lapply(split(df1, df1$state), function(x)
fips(state = x$state[1], county = x$county))
#$california
#[1] "06055"
#$florida
#[1] "12099" "12021" "12031"
Or with Map
lst1 <- split(df1$county, df1$state)
Map(fips, lst1, state = names(lst1))
#$california
#[1] "06055"
#$florida
#[1] "12099" "12021" "12031"
Or with tidyverse
library(dplyr)
library(tidyr)
df1 %>%
group_by(state) %>%
summarise(new = list(fips(state = first(state), county = county))) %>%
unnest(c(new))
# A tibble: 4 x 2
# state new
# <chr> <chr>
#1 california 06055
#2 florida 12099
#3 florida 12021
#4 florida 12031
data
df1 <- structure(list(state = c("california", "florida", "florida",
"florida"), county = c("napa", "palm beach", "collier", "duval"
)), class = "data.frame", row.names = c("1", "2", "3", "4"))
I would like to match the strings from my first dataset with all of their closest common matches.
Data looks like:
dataset1:
California
Texas
Florida
New York
dataset2:
Californiia
callifoornia
T3xas
Te xas
texas
Fl0 rida
folrida
New york
new york
desired result is:
col_1 col_2 col_3 col4
California Californiia callifoornia
Texas T3xas texas Te xas
Florida folrida Fl0 rida
New York New york new york
The question is:
How do I search for common strings between the first dataset and the
second dataset, and generate a list of terms in the second dataset
that align with each term in the first?
Thanks in advance.
library(fuzzyjoin); library(tidyverse)
dataset1 %>%
stringdist_left_join(dataset2,
max_dist = 3) %>%
rename(col_1 = "states.x") %>%
group_by(col_1) %>%
mutate(col = paste0("col_", row_number() + 1)) %>%
spread(col, states.y)
#Joining by: "states"
## A tibble: 4 x 4
## Groups: col_1 [4]
# col_1 col_2 col_3 col_4
# <chr> <chr> <chr> <chr>
#1 California Californiia callifoornia NA
#2 Florida Fl0 rida folrida NA
#3 New York New york new york NA
#4 Texas T3xas Te xas texas
data:
dataset1 <- data.frame(states = c("California",
"Texas",
"Florida",
"New York"),
stringsAsFactors = F)
dataset2 <- data.frame(stringsAsFactors = F,
states = c(
"Californiia",
"callifoornia",
"T3xas",
"Te xas",
"texas",
"Fl0 rida",
"folrida",
"New york",
"new york"
)
)
I read a bit about stringdist and came up with this. It's a workaround, but I like it. Can definitely be improved:
library(stringdist)
library(janitor)
ds1a <- read.csv('dataset1')
ds2a <- read.csv('dataset2')
distancematrix <- stringdistmatrix(ds2a$name, ds1a$name, useNames = T)
df <- data.frame(stringdistmatrix(ds2a$name, ds1a$name, useNames = T), ncol=maxcol in distance matrix)
# go thru this df, and every cell that's < 4, replace with the column name, otherwise replace with empty string
for (j in 1:ncol(df)) {
trigger <- df[j,] < 4
df[trigger , j] <- names(df)[j]
df[!trigger , j] <- ""
}
df <- remove_constant(df)
write.csv(df, file="~/Desktop/df.csv")
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
The first data frame I have includes a column for states called state, but some of the entries are shown as abbreviations (LA, CA, OH), while others have the full name of the state (Louisiana, California, Ohio).
The second data frame I have includes four columns with the following titles:
allCaps (example: ALABAMA)
full (example: Alabama)
twoLetter (example: AL)
threeLetter (example: Ala.)
Is there a way to join the two data frames so that the first data frame only shows the state abbreviations in the state column, replacing the full names with their abbreviations?
EDIT:
I'm going to include pictures, despite having been shot down for doing so before.
This is table one. Each row is a separate tweet that was sent from the respective states. I created it with this code (drawing data from a separate table called tweets):
tweets_per_state <- tweets %>%
filter(country_code == "US" & place_type == "city" | place_type == "admin") %>%
select(place_type, full_name) %>%
mutate(state = ifelse(place_type == "admin", str_sub(full_name, start = 1, end = -6), str_sub(full_name, -2)))
This is table two, which I am trying to join with table one so that where table one shows "Virginia", instead it shows "VA".
One dplyr based solution will involve using a dummy column to join two tables and then using grepl to replace state column with twoLetter value.
I have created data.frames with few rows to demonstrate solution.
tweets <- data.frame(place_type = rep("city",4),
full_name = c("Los Angeles, CA", "Maitland, FL", "Indianapolis, IN", "Virginia, USA" ),
state = c("CA", "FL", "IN", "Virginia"), stringsAsFactors = F)
# place_type full_name state
#1 city Los Angeles, CA CA
#2 city Maitland, FL FL
#3 city Indianapolis, IN IN
#4 city Virginia, USA Virginia
state <- data.frame(allCaps = c("CALIFORNIA", "FLORIDA", "INDIANA", "VIRGINIA"),
full = c("California", "Florida", "Indiana", "Virginia"),
twoLetter = c("CA", "FL", "IN", "VR"),
threeLetter = c("Calif.", "Fla.", "Ind.", "Vir." ),stringsAsFactors = F)
state <- state %>% mutate(dummy = 1)
tweets%>%
mutate(dummy = 1) %>%
filter(place_type == "city" | place_type == "admin") %>%
inner_join(state, by = "dummy") %>%
rowwise() %>%
mutate(state = ifelse(state == twoLetter , state,
ifelse(grepl(full, full_name),twoLetter, NA))) %>%
filter(!is.na(state)) %>%
select(place_type,full_name,state)
# Result
# place_type full_name state
# <chr> <chr> <chr>
# 1 city Los Angeles, CA CA
# 2 city Maitland, FL FL
# 3 city Indianapolis, IN IN
# 4 city Virginia, USA VR
I have this data.table:
CITY CITY2
Phoenix NA
NASHVILLE Nashville
Los Angeles Los Angeles
NEWYORK New York
CHICAGO NA
This is the result I want:
CITY
Phoenix
Nashville
Los Angeles
New York
CHICAGO
I tried in many ways and nothing worked. Any idea?
Due to my despair I not stopped researching and found a solution:
myDataTable[ is.na( CITY2 ) & !is.na( CITY ), CITY2 := CITY, ]
This is a bit of a mess of a dataframe as you have some desired results in both columns but there appears to be a lack of predictability. Are you sure that city2 has the correct formatting for all values that are not NA?
Either way, there are a couple of methods to get to your final desired answer with the correct capitalization of city name using dplyr and the "tools" package.
library(dplyr)
library(tools)
city_df <- data.frame(
city = c("Phoenix", "NASHVILLE", "Los Angeles", "NEWYORK", "CHICAGO"),
city2 = c(NA, "Nashville", "Los Angeles", "New York", NA),
stringsAsFactors = FALSE)
The first method assumes city_df$city contains all of the cities but is formatted incorrectly.
city_df %>%
mutate(city =
replace(x = city, city == "NEWYORK", values = "New York")) %>%
select(city) %>%
mutate(city = tools::toTitleCase(tolower(city)))
which returns
city
1 Phoenix
2 Nashville
3 Los Angeles
4 New York
5 Chicago
If you need the values of df_city$city replaced with the Non-NA values of df_city$city2 you can do the following:
city_df %>%
mutate(city = case_when(
!(is.na(city2)) ~ city2,
is.na(city2) ~ city)) %>%
select(city) %>%
mutate(city = tools::toTitleCase(tolower(city)))
This returns the same column as above.