I have a file with the following values:
column1 column2
quiebre Madrid
no_quiebre Barcelona
quiebre Madrid
no_quiebre Barcelona
quiebre Barcelona
no_quiebre Madrid
no_quiebre Madrid
and I need to calculate three variables in Tableau (column3, column4 and column5), the first one (column3) should count the number of times the city appear in the entire file and give the following result:
column1 column2 column3
quiebre Madrid 4
no_quiebre Barcelona 3
quiebre Madrid 4
no_quiebre Barcelona 3
quiebre Barcelona 3
no_quiebre Madrid 4
no_quiebre Madrid 4
The second variable (column4) should add the number of (quiebre) per city:
column1 column2 column3 column4
quiebre Madrid 4 2
no_quiebre Barcelona 3 1
quiebre Madrid 4 2
no_quiebre Barcelona 3 1
quiebre Barcelona 3 1
no_quiebre Madrid 4 2
no_quiebre Madrid 4 2
and finally the third variable calculates the ratio between column4 and column3, I mean column4 / column3 * 100
Any clue about how to do this?
Column 3 should look like ->
column3 = Count(column2)
Now Join column1 and column2 call it joined column.
Then create column4 it should look like ->
column4 = Count(column1+column2)
Arrange them on the tableau worksheet so that you get the desired result.
Related
My data is similar to table given below. I have data to province where an individual is living and data on previous province where he was living last year. I want to construct the outflow rate and inflow rate of migrants in Stata or R
Province
Previous Province
Delhi
Mumbai
Mumbai
Kolkata
Kolkata
Mumbai
Delhi
Mumbai
Kolkata
Delhi
Mumbai
Mumbai
I want a matrix which is as follows
Delhi
Mumbai
Kolkata
Delhi
0
2
0
Mumbai
0
1
1
Kolkata
1
1
0
table(df)
Previous_Province
Province Delhi Kolkata Mumbai
Delhi 0 0 2
Kolkata 1 0 1
Mumbai 0 1 1
In dataframe format:
as.data.frame.matrix(table(df))
Delhi Kolkata Mumbai
Delhi 0 0 2
Kolkata 1 0 1
Mumbai 0 1 1
pivot_wider(df, names_from = Previous_Province,
values_from = Previous_Province, values_fn = length,
values_fill = 0)
# A tibble: 3 × 4
Province Mumbai Kolkata Delhi
<chr> <int> <int> <int>
1 Delhi 2 0 0
2 Mumbai 1 1 0
3 Kolkata 1 0 1
I have these two dataframes for example:
dates = c('2020-11-19', '2020-11-20', '2020-11-21')
df1 <- data.frame(dates, area = c('paris', 'london', 'newyork'),
rating = c(10, 5, 6),
rating2 = c(5, 6, 7))
df2 <- data.frame(dates, area = c('budapest', 'moscow', 'valencia'),
rating = c(1, 2, 1))
> df1
dates area rating rating2
1 2020-11-19 paris 10 5
2 2020-11-20 london 5 6
3 2020-11-21 newyork 6 7
> df2
dates area rating
1 2020-11-19 budapest 1
2 2020-11-20 moscow 2
3 2020-11-21 valencia 1
When performing an outer join using dplyr:
df <- df1 %>%
full_join(df2, by = c('dates', 'area'))
the result is like this:
dates area rating.x rating2 rating.y
1 2020-11-19 paris 10 5 NA
2 2020-11-20 london 5 6 NA
3 2020-11-21 newyork 6 7 NA
4 2020-11-19 budapest NA NA 1
5 2020-11-20 moscow NA NA 2
6 2020-11-21 valencia NA NA 1
i.e. the rating columns from the two dataframes are not blended together but two separate columns are created.
How do I get a result like this?
dates area rating rating2
1 2020-11-19 paris 10 5
2 2020-11-20 london 5 6
3 2020-11-21 newyork 6 7
4 2020-11-19 budapest 1 NA
5 2020-11-20 moscow 2 NA
6 2020-11-21 valencia 1 NA
Thanks to the solutions provided by #kybazzi, the desired result was obtained.
df <- df1 %>%
bind_rows(df2)
FOLLOW-UP
As a follow-up question, I would like to join the following to the joined dataframe:
df3 <- data.frame(dates, area = c('budapest', 'moscow', 'valencia'),
rating2 = c(3, 2, 5))
Using the same method, the result is this:
> df_final <- df %>%
+ bind_rows(df3)
> df_final
dates area rating rating2
1 2020-11-19 paris 10 5
2 2020-11-20 london 5 6
3 2020-11-21 newyork 6 7
4 2020-11-19 budapest 1 NA
5 2020-11-20 moscow 2 NA
6 2020-11-21 valencia 1 NA
7 2020-11-19 budapest NA 3
8 2020-11-20 moscow NA 2
9 2020-11-21 valencia NA 5
How do i get a result like this:
dates area rating rating2
1 2020-11-19 paris 10 5
2 2020-11-20 london 5 6
3 2020-11-21 newyork 6 7
4 2020-11-19 budapest 1 3
5 2020-11-20 moscow 2 2
6 2020-11-21 valencia 1 5
What you're looking for is dplyr::bind_rows(), which will preserve common columns and fill NA for columns that only exist in one of the data frames:
> bind_rows(df1, df2)
dates area rating rating2
1 2020-11-19 paris 10 5
2 2020-11-20 london 5 6
3 2020-11-21 newyork 6 7
4 2020-11-19 budapest 1 NA
5 2020-11-20 moscow 2 NA
6 2020-11-21 valencia 1 NA
Note that you could also continue using full_join() - but you must ensure that all common columns between the data frames are included as keys if you don't want columns to be split:
> full_join(
+ df1, df2,
+ by = c("dates", "area", "rating")
+ )
dates area rating rating2
1 2020-11-19 paris 10 5
2 2020-11-20 london 5 6
3 2020-11-21 newyork 6 7
4 2020-11-19 budapest 1 NA
5 2020-11-20 moscow 2 NA
6 2020-11-21 valencia 1 NA
The documentation for dplyr joins mentions:
Output columns include all x columns and all y columns. If columns in x and y have the same name (and aren't included in by), suffixes are added to disambiguate.
You could also avoid this issue by not specifying by, in which case dplyr will use all common columns.
> full_join(df1, df2)
Joining, by = c("dates", "area", "rating")
dates area rating rating2
1 2020-11-19 paris 10 5
2 2020-11-20 london 5 6
3 2020-11-21 newyork 6 7
4 2020-11-19 budapest 1 NA
5 2020-11-20 moscow 2 NA
6 2020-11-21 valencia 1 NA
As far as I know, both methods are good for your use case. In fact, I believe that the practical advantage full_join() has over bind_rows() is precisely this behaviour you wanted to avoid here, i.e. splitting columns that aren't keys.
I have the following dataframe:
FOOD ID DATE PRICE DES
1 1/1/2020 100 Tuna
1 1/1/2020 NA Tuna
1 1/1/2020 100 Tuna
1 1/1/2020 NA Tuna
3 1/25/2020 4 Tomato
3 1/25/2020 NA Tomato
3 1/1/2019 NA Tomato
3 1/1/2019 5 Tomato
I would need to replace (where/when possible) the NA values when a price for the same FOOD ID and same DATE is available. Expected output:
FOOD ID DATE PRICE DES
1 1/1/2020 100 Tuna
1 1/1/2020 100 Tuna
1 1/1/2020 100 Tuna
1 1/1/2020 100 Tuna
3 1/25/2020 4 Tomato
3 1/25/2020 4 Tomato
3 1/1/2019 5 Tomato
3 1/1/2019 5 Tomato
Without using a loop for, is there a way I could easily perform such task?
I guess one way could be to use dplyr, group the data by FOOD ID and DATE and get an "average" PRICE, delete the PRICE column from the original dataframe, and finally merged the group data with the original dataframe, but this seems a odd way to do it.
Thanks for the help.
df %>%
group_by(FOOD_ID, DATE)%>%
fill(PRICE, .direction = 'updown')
# A tibble: 8 x 4
# Groups: FOOD_ID, DATE [3]
FOOD_ID DATE PRICE DES
<int> <chr> <int> <chr>
1 1 1/1/2020 100 Tuna
2 1 1/1/2020 100 Tuna
3 1 1/1/2020 100 Tuna
4 1 1/1/2020 100 Tuna
5 3 1/25/2020 4 Tomato
6 3 1/25/2020 4 Tomato
7 3 1/1/2019 5 Tomato
8 3 1/1/2019 5 Tomato
We can use the data itself to feed prices back in.
Data:
df <- read.table(header = TRUE, text= "FOOD_ID DATE PRICE DES
1 1/1/2020 100 Tuna
1 1/1/2020 NA Tuna
1 1/1/2020 100 Tuna
1 1/1/2020 NA Tuna
3 1/25/2020 4 Tomato
3 1/25/2020 NA Tomato
3 1/1/2019 NA Tomato
3 1/1/2019 5 Tomato")
Find distinct prices for each product on each date.
prices <- df %>%
filter(!is.na(PRICE)) %>%
group_by(FOOD_ID, DATE, DES) %>%
distinct(FOOD_ID, .keep_all = TRUE)
Join these prices back into the original dataframe, which will assign the prices for each day (I have removed the original price column because it feeds back in from the prices df.
new_df <- df %>%
select(-PRICE) %>%
left_join(prices, by = c('FOOD_ID', 'DATE', 'DES'))
Output of new_df:
FOOD_ID DATE DES PRICE
1 1 1/1/2020 Tuna 100
2 1 1/1/2020 Tuna 100
3 1 1/1/2020 Tuna 100
4 1 1/1/2020 Tuna 100
5 3 1/25/2020 Tomato 4
6 3 1/25/2020 Tomato 4
7 3 1/1/2019 Tomato 5
8 3 1/1/2019 Tomato 5
I want to merge one column from df2 with df1 by matching df1$District_name and df2$Districts.
But the character values in df1$District_name and df2$Districts are not in the same order and df1 and df2 are not of same length.
The values do not match exactly. df1 has more rows than df2, so the corresponding values for those extra district names should be zero.
df1=data.frame(State_name=c("Maharashtra","Andhra Pradesh","Bihar","Bihar","West Bengal","Gujarat","Gujarat","Assam"),
District_name=c("Nashik","Chittoor","Madhepura","Kishanganj","Howrah","Gandhinagar","Ahmadabad","Sivasagar"),
Value1=c(5,3,6,4,4,3,2,4))
df2=data.frame(Districts=c("Nashik","Chitoor","Kishanganj","Madhepur","Sibhasagar","Ahmadabad"),
FinanceIndex=c(0.20975,0.12187,0.37155,0.66128,0.10918,0.54730))
# df1
State_name District_name Value1
1 Maharashtra Nashik 5
2 Andhra Pradesh Chittoor 3
3 Bihar Madhepura 6
4 Bihar Kishanganj 4
5 West Bengal Howrah 4
6 Gujarat Gandhinagar 3
7 Gujarat Ahmadabad 2
8 Assam Sivasagar 4
# df2
Districts FinanceIndex
1 Nashik 0.20975
2 Chitoor 0.12187
3 Kishanganj 0.37155
4 Madhepur 0.66128
5 Sibhasagar 0.10918
6 Ahmadabad 0.54730
I used match function but due to the spelling differences, I am getting most of them as zero values.
index<-match(df1$District_name, df2$Districts)
df1$finindex=df2$FinanceIndex[index]
df1$finindex[is.na(df1$finindex]=0
For String matching, I found this function which matches similar phonetic words:
library(RecordLinkage)
soundex('Nellore')==soundex('Vellore')
#FALSE
The output should be :
# df1
State_name District_name Value1 finindex
1 Maharashtra Nashik 5 0.20975
2 Andhra Pradesh Chittoor 3 0.12187
3 Bihar Madhepura 6 0.66128
4 Bihar Kishanganj 4 0.37155
5 West Bengal Howrah 4 0.00000
6 Gujarat Gandhinagar 3 0.00000
7 Gujarat Ahmadabad 2 0.54730
8 Assam Sivasagar 4 0.10918
Is there any way these two functions can be used together to solve the problem? Or any other way to solve the problem?
An option is to do a partial match with stringddist
library(fuzzyjoin)
stringdist_left_join(df1, df2, by = c("District_name" = "Districts")) %>%
select(-Districts)
# State_name District_name Value1 FinanceIndex
#1 Maharashtra Nashik 5 0.20975
#2 Andhra Pradesh Chittoor 3 0.12187
#3 Bihar Madhepura 6 0.66128
#4 Bihar Kishanganj 4 0.37155
#5 West Bengal Howrah 4 NA
#6 Gujarat Gandhinagar 3 NA
#7 Gujarat Ahmadabad 2 0.54730
#8 Assam Sivasagar 4 0.10918
I have two data frames. The original, df1,
Country Ccode Year Happiness Power
1 France FR 2000 1872 1213
2 France FR 2001 2345 1234
3 UK UK 2000 2234 1726
4 UK UK 2001 9082 6433
and df1vars which contains only a vector of a few column names:
1 Country
2 Year
3 Happiness
I would like to select from df1 the columns in df1vars. When I, against my better judgment tried to do the following:
library(dplyr)
df2 <- select(df1, df1vars)
To get the output:
Country Year Happiness
1 France 2000 1872
2 France 2001 2345
3 UK 2000 2234
4 UK 2001 9082
I got the message:
Error: ``ES1varselect`` must evaluate to column positions or names, not a list
Is there an efficient workaround to this aspect of the select statement?
If both are data.frame then
df1[,c(df1vars$Colname)]
where df1 is your data.frame mentioned in the question, df1vars is the other data.frame having following content:
Colname
1 Country
2 Year
3 Happiness
Final output:
Country Year Ccode
1 France 2000 FR
2 France 2001 FR
3 UK 2000 UK
4 UK 2001 UK