Add columns to other columns - r

I would like to take two columns and add them two other columns. For example, I have the data below:
EU.Member.States X. Other.countries..continued. X..1
Austria 122 Cameroon 203
Belgium 150 Canada 156
Denmark 179 Canary Islands 132
Finland 156 Cape Verde 147
France 130 Cayman Islands 213
How can I take the rows under "Other.countries..continued." and "X..1" and add them directly under "EU.Member.States" and "X." respectively?
I have tried using unite of (tidyr) with no success.

Your question is almost identical to this one. Using the piping from dplyr package I can suggest a solution by first duplicating your column names, and then applying classic rbind. I used only the first 2 lines of your example:
df %>% setNames(names(df)[c(1,2,1,2)]) %>% {rbind(.[,1:2], .[,3:4])}
#### EU.Member.States X.
#### 1 Austria 122
#### 2 Belgium 150
#### 3 Cameroon 203
#### 4 Canada 156
Note: the brackets are here to tell the piping not to take the . as an implicit first argument.

Related

Summarise rows in dataframe by two columns

I have this data frame called Worldwhich shows the following:
City Year Income Tourist
London 2008 50 100
NY 2009 75 250
Paris 2010 45 340
Dubai 2008 32 240
London 2011 50 140
Abu Dhabi 2009 60 120
Paris 2009 70 140
NY 2007 50 150
Tokyo 2008 45 150
Dubai 2010 40 480
#With 207 more rows
I want to summarise each rows so that every city shows the total income and tourists for all the years. So I want to find a code where City and Years are matched and then summarised so that every city just have one row.
Something like this:
City Income Tourist
London 1051 5040
NY 1547 5432
Paris 2600 4321
Dubai 3222 5312
Abu Dhabi 3100 7654
Tokyo 2404 4321
#With 40 more rows
After the research I've done n_distinct and group_by should be used.
Base R solution:
You can use the sapply() function to iterate over cities.
the first argument will be a vector of unique cities
we then write our function that select all the rows (years) of each city and returns the "Income" and "Tourist" columns
Sum the columns values with colSums() function
Transpose the output using the t() function.
t( sapply( unique( World$City ),function(CITY) colSums(World[World$City==CITY,c("Income","Tourist")] ) ) )
Solution with R's data.table package:
Make sure your object is of type data.table.
in the j part of the bracket (the do part):
you can provide names to the wanted columns ("Income="),
and specify the wanted output ("sum(Income)").
To group the cities, add a by argument to the data.table object.
World[,.(Income=sum(Income),Tourist=sum(Tourist)),by=City]
yes, you can use group_by and summarise function.
world %>% group_by(City) %>% summarise(across(c(Income,Tourist), sum))
you can also add Year in the group by function.
world %>% group_by(City,Year) %>% summarise(across(c(Income,Tourist), sum))

R t-test of mean vs observations for multiple factor levels

I have a dataset of some 39k rows of data, an excerpt is below:
'Country', 'Group', 'Item', 'Year' are categorical
'Production' and 'Waste' are numerical
'LF' is also numerical, but is the result of 'Waste'/'Production
Region Country Group Item Year Production Waste LF
Europe Bulgaria Cereals Wheat 1961 2040 274 0.134313725
Europe Bulgaria Cereals Wheat 1962 2090 262 0.125358852
Europe Bulgaria Cereals Wheat 1963 1894 277 0.14625132
Europe Bulgaria Cereals Wheat 1964 2121 286 0.134842056
Europe Bulgaria Cereals Wheat 1965 2923 341 0.116660965
Europe Bulgaria Cereals Wheat 1966 3193 385 0.120576261
Europe Bulgaria Cereals Barley 1961 612 15 0.024509804
Europe Bulgaria Cereals Barley 1962 599 16 0.026711185
Europe Bulgaria Cereals Barley 1963 618 16 0.025889968
Europe Bulgaria Cereals Barley 1964 764 21 0.027486911
Europe Bulgaria Cereals Barley 1965 876 22 0.025114155
Europe Bulgaria Cereals Barley 1966 1064 24 0.022556391
I have used the following code to generate 991 different means by Item and Group
df2 <- aggregate(LF ~ Country + Item, data=df1, FUN='mean')
The results of this function look ok.
I would like to test whether the respective means of LF in df2 are different to the underlying annual observations in df1 for each Country-Item combination (ie. if FALSE, then LF is really just a static ratio, if TRUE then 'Waste' is independent from 'Production').
How might this best be done? There seem to be 991 tests to conduct for this dataset alone and I don't know how to mix the apply and t.test functions in this manner.
Thanks!
t.test requires two groups to compare on a numeric/scale dependent output variable. Here, it seems to me that for each combination of country and item you want to compare all different year averages/means. In other words, you are trying to investigate if year is influencing the LF averages, for each combination of country and item.
The easiest way to do this is to create a linear model (LF ~ Year) for each combination of country and item and interpret the coefficient and p value of the variable year.
library(dplyr)
library(broom)
set.seed(115)
# example dataset
dt = data.frame(Country = rep("country1",12),
Item = c(rep("item1",6), rep("item2",6)),
Year = rep(1961:1966,2),
LF = runif(12,0,1))
# general means by country and item
dt %>% group_by(Country,Item) %>% summarise(Mean_LF = mean(LF))
# each years means by country and item
dt %>% group_by(Country,Item,Year) %>% summarise(Mean_LF = mean(LF))
# does year influence the means for each country and item?
dt %>% group_by(Country,Item) %>% do(tidy(lm(LF~Year, data=.)))
Hope this helps. Let me know if I'm missing something and I'll update my code.

How to Convert Numeric Data into Currency in R?

Searched Google and SO and couldn't find a good answer. I have the following table:
Country Value
23 Bolivia 2575.684
71 Guyana 3584.693
125 Paraguay 3878.150
49 Ecuador 5647.638
126 Peru 6825.461
38 Colombia 7752.168
151 Suriname 9376.495
25 Brazil 11346.796
7 Argentina 11610.220
171 Venezuela 12766.725
168 Uruguay 14702.505
37 Chile 15363.098
All values are in US dollars - I'd like to add in the dollar signs and the commas. Bolivia's value should therefore read $2,575.684. Also, is there any real need to change row names to 1 through 12? If so, an easy way to do so?
Thanks in advance.
paste('$',formatC(df$Value, big.mark=',', format = 'f'))

Wrong histogram from data

I have the data frame new1 with 20 columns of variables one of which is new1$year. This includes 25 years with the following count:
> table(new1$year)
1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
2770 3171 3392 2955 2906 2801 2930 2985 3181 3059 2977 2884 3039 2428 2653 2522 2558 2370 2666 3046 3155 3047 2941 2591 1580
I tried to prepare an histogram of this with
hist(new1$year, breaks=25)
but I obtain a histogram where the hight of the columns is actually different from the numbers in table(new1$year). FOr example the first column is >4000 in histo while it should be <2770; another example is that for 1995, where there should be a lower bar relatively to the other years around it this bar is also a little higher.
What am I doing wrong? I have tried to define numeric(new1$year) (error says 'invalid length argument') but with no different result.
Many thanks
Marco
Per my comment, try:
barplot(table(new1$year))
The reason hist does not work exactly as you intend has to do with specification of the breaks argument. See ?hist:
one of:
a vector giving the breakpoints between histogram cells,
a function to compute the vector of breakpoints,
a single number giving the number of cells for the histogram,
a character string naming an algorithm to compute the number of cells (see ‘Details’),
a function to compute the number of cells.
In the last three cases the number is a suggestion only.

copy result of unique() string vector in a dataframe R

I am puzzled by something that I thought would easily work.
I have a dataframe with year, city, and species columns.
species City Year
80 Landpattedyr Sisimiut 2007
83 Landpattedyr Sisimiut 2008
87 Landpattedyr Sisimiut 2009
721733 Havpattedyr Upernavik 2010
721734 Havpattedyr Upernavik 2011
721735 Havpattedyr Upernavik 2007
I have used the function unique as follows
years<-unique(df$year)
city<-unique(df$City)
species<-unique(df$species)
now I need to assign a value in each of those vectors to a dataframe row based on an index, for example
hunting[1,]$year<-year[i]
hunting[1,]$group<-species[j]
hunting[1,]$city<-city[k]
The problem is that only year is copied properly while city and species in the hunting df show up as numbers. I can't figure out why this is happening. Can anybody help please?
year group city lat long total
1 2007 6 19 66.93 -53.66 4563
NA 2007 6 20 72.78 -56.15 91
3 2007 6 8 67.01 -50.72 388
4 2007 6 21 70.66 -52.12 280
5 2007 6 14 77.47 -69.23 469
6 2007 6 5 69.22 -51.10 1114
To find out if a column is factor or character you can use this is.factor(df$City) or is.character(df$City).
In the case of a factor column, the (unique) levels are stored in the levels attribute, which can be accessed with
levels(df$City)
Note: this may include levels that are not present in the vector, for instance, if some rows have been removed or if some levels have been added.
To retrieve the unique elements of a factoror character vector, you can use this:
as.character(unique(df$City))
Which will not return levels that are not present in factor columns.
Note: the last command is slightly more efficient than unique(as.character(df$City)), since the conversion is evaluated on a possibly shorter vector.

Resources