Subsetting rows of a dataframe when respondent number is duplicated in column - r

I have a huge dataset which is partly pooled cross section and partly panel data:
Year Country Respnr Power Nr
1 2000 France 1 1213 1
2 2001 France 2 1234 2
3 2000 UK 3 1726 3
4 2001 UK 3 6433 4
I would like to filter the panel data from the combined data and tried the following:
> anyDuplicated(df$Respnr)
[1] 45047 # Out of 340.000
dfpanel<- subset(df, duplicated(df$Respnr) == TRUE)
The new df is however reduced to zero observations. The following led to the expected amount of observations:
dfpanel<- subset(df, Nr < 3)
Any idea what could be the issue?

Although I have not figured out why the previous did not work, the following does provide a working solution. I have simply split the previous approach. The solution adds a column panel, which in my case is actually a welcome addition
df$panel <- duplicated(df$Respnr)
dfpanel <- subset(df, df$panel == TRUE)

Related

efficiently creating a panel data.frame from cross sections with unharmonized column names

I need to create a panel data set (long format) from multiple yearly (cross-sectional) data sets. The variables of interest have different names in the single data sets and i need to harmonize them.
I loaded the dataframes to a list and now want to manipulate the names using lapply or a chunk of code that allows binding the dataframes. I can see several ways of doing this, but would like to use one which works with little code on a large list of data.frames, so that I can do this for several variables and easily change specifics later on.
So what I am looking for is either a way to rename the columns, so that I able to simple use bind_rows() from dplyr or an equivalent method, or a way to rename and bind the datasets in one step. Since I need to do this for several variables it might be safer to keep the two steps apart.
To illustrate, here an example:
a <- data.frame(id=c("Marc", "Julia", "Rico"), year=2000:2002, laborincome=1:3)
b <- data.frame(id=c("Marc", "Julia", "Rico"), earningsfromlabor=2:4, year=2003:2005)
dflist <- list(a, b)
equivalent_vars <- c("laborincome", "earningsfromlabor")
newnanme <- "income"
Desired result:
data.frame(id=c("Marc", "Julia", "Rico"), income=c(1,2,3,2,3,4), year=2000:2005)
id income year
1 Marc 1 2000
2 Julia 2 2001
3 Rico 3 2002
4 Marc 2 2003
5 Julia 3 2004
6 Rico 4 2005
We could use setnames from data.table
library(data.table)
do.call(rbind, Map(setnames, dflist, old = equivalent_vars, new = newnanme))
# id year income
#1 Marc 2000 1
#2 Julia 2001 2
#3 Rico 2002 3
#4 Marc 2003 2
#5 Julia 2004 3
#6 Rico 2005 4
Or we can use the :=
library(dplyr)
library(purrr)
map2_df(dflist, equivalent_vars, ~ .x %>%
rename(!! (newnanme) := !! .y)) %>%
select(id, income, year)
# id income year
#1 Marc 1 2000
#2 Julia 2 2001
#3 Rico 3 2002
#4 Marc 2 2003
#5 Julia 3 2004
#6 Rico 4 2005

R. How to add sum row in data frame

I know this question is very elementary, but I'm having a trouble adding an extra row to show summary of the row.
Let's say I'm creating a data.frame using the code below:
name <- c("James","Kyle","Chris","Mike")
nationality <- c("American","British","American","Japanese")
income <- c(5000,4000,4500,3000)
x <- data.frame(name,nationality,income)
The code above creates the data.frame below:
name nationality income
1 James American 5000
2 Kyle British 4000
3 Chris American 4500
4 Mike Japanese 3000
What I'm trying to do is to add a 5th row and contains: name = "total", nationality = "NA", age = total of all rows. My desired output looks like this:
name nationality income
1 James American 5000
2 Kyle British 4000
3 Chris American 4500
4 Mike Japanese 3000
5 Total NA 16500
In a real case, my data.frame has more than a thousand rows, and I need efficient way to add the total row.
Can some one please advice? Thank you very much!
We can use rbind
rbind(x, data.frame(name='Total', nationality=NA, income = sum(x$income)))
# name nationality income
#1 James American 5000
#2 Kyle British 4000
#3 Chris American 4500
#4 Mike Japanese 3000
#5 Total <NA> 16500
using index.
name <- c("James","Kyle","Chris","Mike")
nationality <- c("American","British","American","Japanese")
income <- c(5000,4000,4500,3000)
x <- data.frame(name,nationality,income, stringsAsFactors=FALSE)
x[nrow(x)+1, ] <- c('Total', NA, sum(x$income))
UPDATE: using list
x[nrow(x)+1, ] <- list('Total', NA, sum(x$income))
x
# name nationality income
# 1 James American 5000
# 2 Kyle British 4000
# 3 Chris American 4500
# 4 Mike Japanese 3000
# 5 Total <NA> 16500
sapply(x, class)
# name nationality income
# "character" "character" "numeric"
If you want the exact row as you put in your post, then the following should work:
newdata = rbind(x, data.frame(name='Total', nationality='NA', income = sum(x$income)))
I though agree with Jaap that you may not want this row to add to the end. In case you need to load the data and use it for other analysis, this will add to unnecessary trouble. However, you may also use the following code to remove the added row before other analysis:
newdata = newdata[-newdata$name=='Total',]

How to specific rows from a split list in R based on column condition

I am new to R and to programming in general and am looking for feedback on how to approach what is probably a fairly simple problem in R.
I have the following dataset:
df <- data.frame(county = rep(c("QU","AN","GY"), 3),
park = (c("Downtown","Queens", "Oakville","Squirreltown",
"Pinhurst", "GarbagePile","LottaTrees","BigHill",
"Jaynestown")),
hectares = c(12,42,6,18,92,6,4,52,12))
df<-transform(df, parkrank = ave(hectares, county,
FUN = function(x) rank(x, ties.method = "first")))
Which returns a dataframe looking like this:
county park hectares parkrank
1 QU Downtown 12 2
2 AN Queens 42 1
3 GY Oakville 6 1
4 QU Squirreltown 18 3
5 AN Pinhurst 92 3
6 GY GarbagePile 6 2
7 QU LottaTrees 4 1
8 AN BigHill 52 2
9 GY Jaynestown 12 3
I want to use this to create a two-column data frame that lists each county and the park name corresponding to a specific rank (e.g. if when I call my function I add "2" as a variable, shows the second biggest park in each county).
I am very new to R and programming and have spent hours looking over the built in R help files and similar questions here on stack overflow but I am clearly missing something. Can anyone give a simple example of where to begin? It seems like I should be using split then lapply or maybe tapply, but everything I try leaves me very confused :(
Thanks.
Try,
df2 <- function(A,x) {
# A is the name of the data.frame() and x is the rank No
df <- A[A[,4]==x,]
return(df)
}
> df2(df,2)
county park hectares parkrank
1 QU Downtown 12 2
6 GY GarbagePile 6 2
8 AN BigHill 52 2

How do I reorder a factor

I want to reorder a factor based on one of its rows. For example I want to reorder the "country" factor based on the value corresponding to the 2014 entries below. UK would be ranked first and USA second.
dat <- data.frame(
country=c("USA","USA","UK","UK"),
year=c(2014,2013,2014,2013),
value=c(2,NA,1,NA)
)
country year value
1 USA 2014 2
2 USA 2013 NA
3 UK 2014 1
4 UK 2013 NA
I don't quite understand how factors are reordered. It seems the reorder command replaces the an entire column in a data.frame but it I would think that I should only need to specify a new order for the factor labels. "level" seems to do the opposite, giving labels to the ordering.
Maybe this:
factor(dat$country, levels=with(dat[dat$year==2014,], country[order(value)] ))
#[1] USA USA UK UK
#Levels: UK USA
factor(country<-c("USA","USA","UK","UK"),level <- c("UK","USA"))
sort(country)

R - Bootstrap by several column criteria

So what I have is data of cod weights at different ages. This data is taken at several locations over time.
What I would like to create is "weight at age", basically a mean value of weights at a certain age. I want do this for each location at each year.
However, the ages are not sampled the same way (all old fish caught are measured, while younger fish are sub sampled), so I can't just create a normal average, I would like to bootstrap samples.
The bootstrap should take out 5 random values of weight at an age, create a mean value and repeat this a 1000 times, and then create an average of the means. The values should be able to be used again (replace). This should be done for each age at every AreaCode for every year. Dependent factors: Year-location-Age.
So here's an example of what my data could look like.
df <- data.frame( Year= rep(c(2000:2008),2), AreaCode = c("39G4", "38G5","40G5"), Age = c(0:8), IndWgt = c(rnorm(18, mean=5, sd=3)))
> df
Year AreaCode Age IndWgt
1 2000 39G4 0 7.317489899
2 2001 38G5 1 7.846606144
3 2002 40G5 2 0.009212455
4 2003 39G4 3 6.498688035
5 2004 38G5 4 3.121134937
6 2005 40G5 5 11.283096043
7 2006 39G4 6 0.258404136
8 2007 38G5 7 6.689780137
9 2008 40G5 8 10.180511929
10 2000 39G4 0 5.972879108
11 2001 38G5 1 1.872273650
12 2002 40G5 2 5.552962065
13 2003 39G4 3 4.897882549
14 2004 38G5 4 5.649438631
15 2005 40G5 5 4.525012587
16 2006 39G4 6 2.985615831
17 2007 38G5 7 8.042884181
18 2008 40G5 8 5.847629941
AreaCode contains the different locations, in reality I have 85 different levels. The time series stretches 1991-2013, the ages 0-15. IndWgt contain the weight. My whole data frame has a row length of 185726.
Also, every age does not exist for every location and every year. Don't know if this would be a problem, just so the scripts isn't based on references to certain row number. There are some NA values in the weight column, but I could just remove them before hand.
I was thinking that I maybe should use replicate, and apply or another plyr function. I've tried to understand the boot function but I don't really know if I would write my arguments under statistics, and in that case how. So yeah, basically I have no idea.
I would be thankful for any help I can get!
How about this with plyr. I think from the question you wanted to bootstrap only the "young" fish weights and use actual means for the older ones. If not, just replace the ifelse() statement with its last argument.
require(plyr)
#cod<-read.csv("cod.csv",header=T) #I loaded your data from csv
bootstrap<-function(Age,IndWgt){
ifelse(Age>2, # treat differently for old/young fish
res<-mean(IndWgt), # old fish mean
res<-mean(replicate(1000,sample(IndWgt,5,replace = TRUE))) # young fish bootstrap
)
return(res)
}
ddply(cod,.(Year,AreaCode,Age),summarize,boot_mean=bootstrap(Age,IndWgt))
Year AreaCode Age boot_mean
1 2000 39G4 0 6.650294
2 2001 38G5 1 4.863024
3 2002 40G5 2 2.724541
4 2003 39G4 3 5.698285
5 2004 38G5 4 4.385287
6 2005 40G5 5 7.904054
7 2006 39G4 6 1.622010
8 2007 38G5 7 7.366332
9 2008 40G5 8 8.014071
PS: If you want to sample all ages in the same way, no need for the function, just:
ddply(cod,.(Year,AreaCode,Age),
summarize,
boot_mean=mean(replicate(1000,mean(sample(IndWgt,5,replace = TRUE)))))
Since you don't provide enough code, it's too hard (lazy) for me to test it properly. You should get your first step using the following code. If you wrap this into replicate, you should get your end result that you can average.
part.result <- aggregate(IndWgt ~ Year + AreaCode + Age, data = data, FUN = function(x) {
rws <- length(x)
get.em <- sample(x, size = 5, replace = TRUE)
out <- mean(get.em)
out
})
To handle any missing combination of year/age/location, you could probably add an if statement checking for NULL/NA and producing a warning and/or skipping the iteration.

Resources