I'm currently working on R on a survey on schools and I would like to add a variable with the population of the city the school is in.
In the first data set I have all the survey respondants which includes a variable "city_name". I have managed to find online a list of the cities with their population which I have imported on R.
What I now would like to do is to add a variable in dataset_1 called city_pop which is equal to the city population when city_name is in both data sets. It might be relevant to know that the first dataset has around 1200 rows while the second one has around 36000 rows.
I've tried several things including the following:
data_set_1$Pop_city = ifelse(data_set_1$city_name == data_set_2$city_name, data_set_2$Pop_city, 0)
Any clues?
Thanks!!
You need to merge the two dataset:
new_df <- merge(data_set_1, data_set_2, by="city_name")
The result will be a dataframe containing only matching rows (in your case, 1200 rows assuming that all cities in data_set_1 are also in data_set_2) and all columns of both data frames. If you want to also keep non-matching rows of data_set_1, you can use the all.x option:
new_df <- merge(data_set_1, data_set_2, by="city_name", all.x=TRUE)
Two ways you could try using dplyr:
library(dplyr)
data_set_1 %>%
mutate(Pop_city = ifelse(city_name %in% data_set_2$city_name,
data_set_2$Pop_city[city_name == data_set_2$city_name],
0))
or using a left_join
data_set_1 %>%
left_join(data_set_2, by = "city_name")
perhaps followed by a select(names(data_set_1), Pop_city).
Related
I usually use dplyr to filter data. I know hava huge dataset (62176 entries) of banks operating in different countries. I'd like to subset/filter that datasets for Eurozone banks only.
I haven't found any workaround rather than pasting all the name of Eurozone countries and then create a new dataset with filter.
Is there any workaround for this problem?
Thank you!
Without the data we can't give you clear answers however, given my understanding of the problem below are some methods.
Assuming your dataset already has a column that has each bank's operating country, you could create a manual vector of the countries you are interested in and then filter the dataset for rows that match
#manually assign countries to vector (this must match how the countries are listed in your data)
euro_countries<- c("Germany","England","France","Poland")
#Then filter dataset to pull up rows that match, I make up colnames as I don't know your data
dataframe %>% filter(op_country %in% euro_countries)
alternatively, depending on your data set you can reference the very helfpul countrycode library in R which has an existing dataset that can potentially join your dataset country column against the matching column in countrycode::codelist and then reference the countrycode::codelist$continent to filter for countries in "Europe".
#join your data set with the codelist table but depends on country column in your dataset
dataframe <- leftjoin(x=df,y=countrycode::codelist,by=c("op_country"="country.name.en"))
#filter your dataset with the new column
dataframe %>% filter(continent=="Europe")
I want to transform a csv or excel table (with rows in one order and eg 20 columns with head) into another table with same rows in other pre-established order.
Thank you very much
suppose your table looks a bit like this, once you've loaded it into r:
# Packages
library(tidyverse)
# Sample Table
df <- tibble(names = c("Jane","Boris","Michael","Tina"),
height = c(167,175,182,171),
age = c(26,45,32,51),
occupation = c("Teacher","Construction Worker","Salesman","Banker"))
If all you want to do is reorder the columns, you can do the following:
df <- df %>%
select(occupation,height,age,names)
There are other ways to do this, especially if you only want to move one or two columns out of your 20. But suppose you want to rearrange all of them, this will do it.
I have two data sets (A and B), one with 1600 observations/ rows and 1002 Variables/columns and one with 860 observations/rows and 1040 variables/ columns. I want to quickly check which variables are not contained in dataset A but are in dataset B and vice versa. I am only interestes in the column names, not in the onservations contained within these columns.
I found this great function here: https://cran.r-project.org/web/packages/arsenal/vignettes/comparedf.html and essencially I would want to get an output similar to this:
The code I am trying is: summary(comparedf(dataA, dataB)) However, the table is not printed because R does a row by row comparision of both data sets and then runs out of space when printing the results in the console. Is there a quick way of achieving what I need here?
I think you can use the anti_join() function from the dplyr package to find the unmatched records. It will give you an output of the rows that both data sets A and B do not share in common. Here is an example:-
table1<-data.frame(id=c(1:5), animal=c("cat", "dog", "parakeet",
"lion", "duck"))
table2<-table1[c(1,3,5),]
library(dplyr)
anti_join(table1, table2, by="id")
id animal
1 2 dog
2 4 lion
This will return the unshared rows by ID.
Edit
If you are wanting to find which column names/variables appear in one data frame but not the other, then you could use this solution:-
df1 <- data.frame(a=rnorm(100), b=rnorm(100), not=rnorm(100))
df2 <- data.frame(a=rnorm(100), b=rnorm(100))
df1[, !names(df1) %in% names(df2)] #returns column/variable that appears in df1 but not in df2
I hope this answers your question. It will return the actual values beneath each unshared column/variable, but you could save the output to an object and run colnames() on it, which should print your unshared column/variable names.
It may be a bit clunky, but combining setdiff() with colnames() may work.
Doing both setdiff(colnames(DataA),colnames(DataB)) and setdiff(colnames(DataB),colnames(DataA)) will give you 2 vectors, each with the names of the columns present in one of the datasets but not in the other one.
I am working with the following data: http://people.stern.nyu.edu/wgreene/Econometrics/healthcare.csv
What I want to do is train my algorithm to predict correctly if a person will drop out in the subsequent period.
data1 <- subset(data, YEAR==1984)
data2 <- subset(data, YEAR==1985)
didtheydrop <- as.integer(data1$id)
didtheydrop <- lapply(didtheydrop, function(x) as.integer(ifelse(x==data2$id, 0, 1)))
This created a large list with the values that I think I wanted, but I'm not sure. In the end, I would like to append this variable to the 1984 data and then use that to create my model.
What can I do to ensure that the appropriate values are compared? The list lengths aren't the same, and it's also not the case that they appear in the correct order (i.e. respondents 3 - 7 do not respond in 1984 but they appear in 1985)
Assumming data1 and data2 are two dataframes (unclear, because it appears that you extracted them from an original larger single dataframe called data), I think it is better to merge them and work with a single dataframe. That is, if there is a single larger dataframe, do not subset it, just delete the columns you do not need; if data1 and data2 are two dataframes merge them and work with only one dataframe.
There are multiple ways to do this in R.
You should review the merge function calling ?merge in your console and reading the function description.
Essentially, to merge two dataframes, you should do something like:
merge(data1, data2, by= columnID) #Where columnID is the name of the variable that identifies the ID. If it is different in data1 and data2 you can use by.x and by.y
Then you have to define if you want to merge all rows from both tables with the parameters all.x, all.y, and all: all values from data1 even if no matching is found in data2, or all values from data2 even if no matching is found in data1 or all values regardless of whether there is a matching ID in the other database.
Merge is in the base package with any installation of R.
You can also use dplyr package, which makes the type of join even more explicit:
inner_join(data1, data2, by = "ID")
left_join(data1, data2, by = "ID")
right_join(data1, data2, by = "ID")
full_join(data1, data2, by = "ID")
This is a good link for dplyr join https://rpubs.com/williamsurles/293454
Hope it helps
I have dataset of a a few columns with duplicate row.( duplication based on one column by name ProjectID).
I want to remove the duplicate rows and keep just one of it.
However, each of these rows have a separate amount value against it which needs to be summed and stored for the final consolidated row.
I have used aggregate function. However it removes all other columns (by the use I know).
Can somebody Please tell me a easier way.
the example data set is attached.
dataset
This could be solved using dplyr as #PLapointe pointed out. If your dataset is called df then this would go as
df %>%
group_by(`Project ID`, `Project No.`, `Account Head`, `Function`, `Functionary`) %>%
summarise(cost.total = sum(Amount))
This should do it. You can also adjust the variables you want to keep.
Its a more complicated method, but worked for me.
I aggregated the amounts about the ProjectIDs using the aggregate function, storing them in a new tibble.
Further I appended this column to the original tibble as a new column.
It didn't work exactly what I wanted to. But I was able to work out with a new column Final_Amount keeping the earlier Amount column irrelevant.
Duplicate_remove2 <- function(dataGP_cleaned)
{
#aggregating unique amounts
aggregated_amount <- aggregate(dataGP_cleaned['Amount'], by=dataGP_cleaned['ProjectID'], sum)
#finding Distinct dataset
dataGP_unique <- distinct(dataGP_cleaned, ProjectID, .keep_all = TRUE)
#changing name of the column for easy identification
aggregated_amount$Final_Amount <- aggregated_amount$Amount
#appending the list
aggregate_dataGP <- bind_cols(dataGP_unique, aggregated_amount['Final_Amount'] )
return(aggregate_dataGP)
}