How do I reorder a factor - r

I want to reorder a factor based on one of its rows. For example I want to reorder the "country" factor based on the value corresponding to the 2014 entries below. UK would be ranked first and USA second.
dat <- data.frame(
country=c("USA","USA","UK","UK"),
year=c(2014,2013,2014,2013),
value=c(2,NA,1,NA)
)
country year value
1 USA 2014 2
2 USA 2013 NA
3 UK 2014 1
4 UK 2013 NA
I don't quite understand how factors are reordered. It seems the reorder command replaces the an entire column in a data.frame but it I would think that I should only need to specify a new order for the factor labels. "level" seems to do the opposite, giving labels to the ordering.

Maybe this:
factor(dat$country, levels=with(dat[dat$year==2014,], country[order(value)] ))
#[1] USA USA UK UK
#Levels: UK USA

factor(country<-c("USA","USA","UK","UK"),level <- c("UK","USA"))
sort(country)

Related

replace missing value by grouping with mean

I have a table with countries and gdp and missing value. I want to replace with a mean but not the whole colomn mean just which include in the same group
I have 27 countries and 11 years. like
countries year GDP
1 2001 125
1 2002 ...
1 2003 525
2 2001 222
2 2002 ...
So I would like to get the mean of the first country all year and replace with missing value for GDP
I know how to replace the whole colomn
data$gdp[which(is.na(data$gdp))]<- mean(data$gdp, na.rm=TRUE)
but this will calculate the whole colomn. Dont want to take a subset of each country and calculate seperatly I was thinking if I could do it in one go.
One option is using na.aggregate (from zoo - by default it takes the mean and replace the NA elements) grouped by 'countries'
library(dplyr)
library(zoo)
df1 %>%
group_by(countries) %>%
mutate(GDP = na.aggregate(GDP))

Removing "outer rows" to allow for interpolation (and prevent extrapolation)

I have (left)joined two data frames by country-year.
df<- left_join(df, df2, by="country-year")
leading to the following example output:
country country-year a b
1 France France2000 NA NA
2 France France2001 1000 1000
3 France France2002 NA NA
4 France France2003 1600 2200
5 France France2004 NA NA
6 UK UK2000 1000 1000
7 UK UK2001 NA NA
8 UK UK2002 1000 1000
9 UK UK2003 NA NA
10 UK UK2004 NA NA
I initially wanted to remove all values for which both of the added columns (a,b) were NA.
df<-df[!is.na( df$a | df$b ),]
However, in second instance, I decided I wanted to interpolate the data I had (but not extrapolate). So instead I would like to remove all the columns for which I cannot interpolate; in the example:
1 France France2000 NA NA
5 France France2004 NA NA
9 UK UK2003 NA NA
10 UK UK2004 NA NA
I believe there are 2 options. First I somehow adapt this function:
library(tidyerse)
TRcomplete<-TRcomplete%>%
group_by(country) %>%
mutate_at(a:b,~na.fill(.x,"extend"))
to interpolate only, and then remove then apply df<-df[!is.na( df$a | df$b ),]
or I write a code to remove the "outer"columns first and then use extend like normal. Desired output:
country country-year a b
2 France France2001 1000 1000
3 France France2002 1300 1600
4 France France2003 1600 2200
6 UK UK2000 1000 1000
7 UK UK2001 0 0
8 UK UK2002 1000 1000
Any suggestions?
There are options in na.fill to specify what is done. If you look at ?na.fill, you see that fill can specify the left, interior and right, so if you specify the left and right are NA and the interior is "extend", then it will only fill the interior data. You can then filter the rows with NA.
library(tidyverse)
library(zoo)
df %>%
group_by(country) %>%
mutate_at(vars(a:b),~na.fill(.x,c(NA, "extend", NA))) %>%
filter(!is.na(a) | !is.na(b))
By the way, you have a typo in your library(tidyverse) statement; you are missing the v.

Subsetting rows of a dataframe when respondent number is duplicated in column

I have a huge dataset which is partly pooled cross section and partly panel data:
Year Country Respnr Power Nr
1 2000 France 1 1213 1
2 2001 France 2 1234 2
3 2000 UK 3 1726 3
4 2001 UK 3 6433 4
I would like to filter the panel data from the combined data and tried the following:
> anyDuplicated(df$Respnr)
[1] 45047 # Out of 340.000
dfpanel<- subset(df, duplicated(df$Respnr) == TRUE)
The new df is however reduced to zero observations. The following led to the expected amount of observations:
dfpanel<- subset(df, Nr < 3)
Any idea what could be the issue?
Although I have not figured out why the previous did not work, the following does provide a working solution. I have simply split the previous approach. The solution adds a column panel, which in my case is actually a welcome addition
df$panel <- duplicated(df$Respnr)
dfpanel <- subset(df, df$panel == TRUE)

How can I count the number of instances a value occurs within a subgroup in R?

I have a data frame that I'm working with in R, and am trying to check how many times a value occurs within its larger, associated group. Specifically, I'm trying to count the number of cities that are listed for each particular country.
My data look something like this:
City Country
=========================
New York US
San Francisco US
Los Angeles US
Paris France
Nantes France
Berlin Germany
It seems that table() is the way to go, but I can't quite figure it out — how can I find out how many cities are listed for each country? That is to say, how can I find out how many fields in one column are associated with a particular value in another column?
EDIT:
I'm hoping for something along the lines of
3 US
2 France
1 Germany
I guess you can try table.
table(df$Country)
# France Germany US
# 2 1 3
Or using data.table
library(data.table)
setDT(df)[, .N, by=Country]
# Country N
#1: US 3
#2: France 2
#3: Germany 1
Or
library(plyr)
count(df$Country)
# x freq
#1 France 2
#2 Germany 1
#3 US 3

copy result of unique() string vector in a dataframe R

I am puzzled by something that I thought would easily work.
I have a dataframe with year, city, and species columns.
species City Year
80 Landpattedyr Sisimiut 2007
83 Landpattedyr Sisimiut 2008
87 Landpattedyr Sisimiut 2009
721733 Havpattedyr Upernavik 2010
721734 Havpattedyr Upernavik 2011
721735 Havpattedyr Upernavik 2007
I have used the function unique as follows
years<-unique(df$year)
city<-unique(df$City)
species<-unique(df$species)
now I need to assign a value in each of those vectors to a dataframe row based on an index, for example
hunting[1,]$year<-year[i]
hunting[1,]$group<-species[j]
hunting[1,]$city<-city[k]
The problem is that only year is copied properly while city and species in the hunting df show up as numbers. I can't figure out why this is happening. Can anybody help please?
year group city lat long total
1 2007 6 19 66.93 -53.66 4563
NA 2007 6 20 72.78 -56.15 91
3 2007 6 8 67.01 -50.72 388
4 2007 6 21 70.66 -52.12 280
5 2007 6 14 77.47 -69.23 469
6 2007 6 5 69.22 -51.10 1114
To find out if a column is factor or character you can use this is.factor(df$City) or is.character(df$City).
In the case of a factor column, the (unique) levels are stored in the levels attribute, which can be accessed with
levels(df$City)
Note: this may include levels that are not present in the vector, for instance, if some rows have been removed or if some levels have been added.
To retrieve the unique elements of a factoror character vector, you can use this:
as.character(unique(df$City))
Which will not return levels that are not present in factor columns.
Note: the last command is slightly more efficient than unique(as.character(df$City)), since the conversion is evaluated on a possibly shorter vector.

Resources