I have one dataframe df_EU that is composed of scientists working in the EU in the following format:
Author ID Country Year
A 12345 UK 2011
B 13254 Germany 2018
C 54952 Belgium 2005
D 58774 UK 2009
E 88569 Italy 2015
...
Then, I have another dataframe that contains scientists from the US df_US in the same format. Now, what I am trying to do is to add a new column for the US dataframe in which I compare each ID in the US dataframe with all the IDs in the EU dataframe. Each time there is a match, I want a 1 to appear in the new column, for each ID that is not in the EU set, a 0.
So far, I am fairly certain that my solution should contain mapply and i deducted from this question that I can "load" the values for the ID numbers using:
mapply(function(i, j) length(grep(i, j)), df_EU$ID, df_US$ID)
I am, however, quite lost on how to proceed from here. I have never really worked with functions, and would therefore greatly appreciate your help! Thank you very much.
Another problem is that the scientists might appear multiple times per dataframe, as they are not listed by their unique names but by publications that have appeared in the respective region.
Here, we can use a regex_fuzzy_join
library(fuzzyjoin)
df_US <- regex_left_join(df_US, df_EU %>%
select(ID), by = 'ID') %>%
mutate(EU_migration = !is.na(ID.y))
Related
I have a dataframe which structure looks like this
I would like to create reshuffle the way data is presented by creating a new data frame where I summarise the data above and it looks like this:
Therefore, for each European country, I will be creating 4 variables which are a sum of the capital expenditure variable based on different conditions. Lets take the first one as an example:
This is the sum of total capital expenditure that is directed to Austria (so Destination country= 'Austria') from EU countries (Source country continent=EU).
Can someone indicate the code to create a new df with this structure and create the variable explained above?
Thanks a lot!
Thanks a lot!
I have a dataset with multiple records, each of which is assigned a country, and I want to create a worldmap using rworldmap, coloured according to the frequency with which each country occurs in the dataset. Not all countries appear in the dataset - either because they have no corresponding records or because they are not eligible (e.g. middle/low income countries).
To build the map, I have created a dataframe (dfmap) based on a table of countries, where one column is the country code and the second column is the frequency with which it appears in the dataset.
In order to identify on the map countries which are eligible, but have no records, I have tried to use add_row to add these to my dataframe e.g. for Andorra:
add_row(dfmap, Var1="AND", Freq=0)
When I run add_row for each country, it appears to work (no error message and that new row appears in the table below the command) - but previously added rows where the Freq=0 do not appear.
When I then look at the dataframe using "dfmap" or "summary(dfmap)", none of the rows where Freq=0 appear, and when I build the map, they are coloured as for missing countries.
I'm not sure where I'm going wrong and would welcome any suggestions.
Many thanks
Using the method suggested in the comment above, one can use the join function and then the replace_na function to create a tibble with the complete country and give these a count value of zero.
As there was no sample data in the question i created two data frames below based on what I thought was implied by the question.
dfrm_counts = tibble(Country = c('England','Germany'),
Count = c(1,4))
dfrm_all = tibble(Country = c('England', 'Germany', 'France'))
dfrm_final = dfrm_counts %>%
right_join(dfrm_all, by = "Country") %>%
replace_na(list(Count = 0))
dfrm_final
# A tibble: 3 x 2
Country Count
<chr> <dbl>
1 England 1
2 Germany 4
3 France 0
I have GDP values listed by country (rows) and list of years (column headings) in one dataset. I'm trying to combine it with another dataset where the values represent GINI. How do I merge these two massive datasets by country and year, when "year" is not a variable? (How do I manipulate each dataset so that I introduce "year" as a column and have repeating countries to represent each year?
i.e. from the top dataframe to the bottom dataframe in the image?
Reshape the top dataset from wide to long and then merge with your other dataset. There are many, many, examples of reshaping data on this site with different approaches. A common one is to use the tidyr package, which has a function called gather that does just what you need.
long_table <- tidyr::gather(wide_table, key = year, value = GDP, 1960:1962)
or whatever the last year in your dataset is. You can install the tidyr package with install.packages('tidyr') if you don't have it yet.
Next time, please avoid putting pictures of your data and provide reproducible data so this is easier for others to answer exactly. You can use dput(..) to do so.
Hope this helps!
#sample data (added 'X' before numeric year columns as R doesn't allow column name to start with digit)
df <- data.frame(Country_Name=c('Belgium','Benin'),
X1960=c(123,234),
X1961=c(567,890))
library(dplyr)
library(tidyr)
df_new <- df %>%
gather(Year, GDP, -Country_Name)
df_new$Year <- gsub('X','',df_new$Year )
df_new
Output is:
Country_Name Year GDP
1 Belgium 1960 123
2 Benin 1960 234
3 Belgium 1961 567
4 Benin 1961 890
(PS: As already suggested by others you should always share sample data using dput(df))
With the data in Excel, if you have Excel 2010 or later, you can use Power Query or Get & Transform to unpivot the "year" columns.
This is the code but you can do this through the GUI
And this is the result, although I had to format the GDP column to get your mix of Scientific and Number formatting, and I had a typo on Belgium 1962
I am new to R and trying to learn on my own. I have data in csv format with 1,048,575 rows and 73 columns. I am looking at three columns - year, country, aid_amount. I want to get the sum of aid_amount by country for i) all years, and ii) for years 1991-2010. I tried the following to get for all years BUT the result I get is different from when I sort/sum in Excel. What is wrong here. Also, what change should I make for ii) years 1991-2010. Thanks.
aiddata <- read.csv("aiddata_research.csv")
sum_by_country <- tapply(aiddata$aid_amount, aiddata$country, sum, na.rm=TRUE) # There are missing data on aid_amount
write.csv(sum_by_country, "sum_by_country.csv")
I have also tried:
sum_by_country <- aggregate(aid_amount ~ country, data = aiddata, sum) instead of tapply.
The first few rows for a few columns look like this:
aiddata_id year country aid_amount
23229017 2004 Bangladesh 685899.2666
14582630 2000 Bilateral, unspecified 15772.77174
28085216 2006 Bilateral, unspecified 38926.82898
28702455 2006 Bilateral, unspecified 12633.85659
29928104 2006 Cambodia 955412.9884
27783934 2006 Cambodia 11773.77268
37418683 2008 Guatemala 40150.7331
94726192 2010 Guatemala 151206.3096
You could use data.table for the big dataset. If you want to get the sum of aid_amount for each country by year
library(data.table)
setkey(setDT(aiddata), country,year)[,
list(aid_amount=sum(aid_amount)), by=list(country, year)]
To get the sum of aid_amount for each country
setkey(setDT(aiddata), country)[,
list(aid_amount=sum(aid_amount)), by=list(country)]
yy=aggregate(df$Column1,by=list(df$Column2),FUN=mean)
Column 2- Categories on which you want to sum.
If you want to know the maximum value(sum) among all categories? Use the below code:
which.max(yy$x)
I am using a list of variables to download and create dataframes in R. I'd like to be able to use this list to make changes to different columns in each dataframe, but I am having trouble calling particular columns using the list of variables.
countries= c("USA","CHN")
for (i in 1:length(countries)){
download.file(url[i],savedata[i])
assign(countries[i],xmlToDataFrame(savedata[i]))
}
Now I have dataframes that look like this:
head(USA)
indicator country date value decimal
1 GDP (current US$) United States 2012 15684800000000 0
2 GDP (current US$) United States 2011 14991300000000 0
3 GDP (current US$) United States 2010 14419400000000 0
4 GDP (current US$) United States 2009 13898300000000 0
5 GDP (current US$) United States 2008 14219300000000 0
6 GDP (current US$) United States 2007 13961800000000 0
And I would like to go through and make several changes, such as formatting the date column with the as.date() function, or changing the units of the value column, but I want to be able to do the same to both dataframe (or an arbitrary number in case I increase the length of countries.
However, whenever I try to do this I can seem to use the list of countries in the countries variable to get 'inside' each data frame. My initial guess was putting something like this in a loop:
assign(paste(countries[i],"date",sep="$"),
as.date(get(paste(countries[i],"date",sep="$")))
In particular, I get confused about how the get(paste(countries[i])) works if I am not trying to get the particular column date, and how the paste(countries[i],"date",sep="$") prints the correct name, but I can't seem to get just the one column I'd like to manipulate.
Additionally, I realize loops are not the ideal way of doing this, but I've been having the same problem with the apply functions, though I am likely having trouble with them due to my lack of experience. Suggestions for either how to do it in a loop, or with out, would be much appreciated. Super R novice here, just trying to learn. Also, if you've come across a clear explanation/answer for this somewhere else, I'd appreciate you pointing me towards it.
It's much easier if you use lists. Start with an empty one:
mylist = list()
Then change this:
assign(countries[i],xmlToDataFrame(savedata[i]))
to this:
mylist[[i]] <- xmlToDataFrame(savedata[i])
Then make a function that does your formatting, for instance:
f <- function(df){
within(df, date <- as.date(date))
}
And use lapply to apply it to all dataframes:
mylist2 <- lapply(mylist, f)
If you want to access dataframes by name, use this:
names(mylist2) <- countries
And test:
mylist2[["USA"]]