R table conversion - r

Hello I am working with a table with these characteristics:
2000 0.051568
2000 0.04805
2002 0.029792
2002 0.056141
2008 0.047285
2008 0.038989
And I need to convert it to something like this:
2000 2002 2008
0.051568 0.029792 0.047285
0.04805 0.056141 0.038989
I would be grateful if somebody could give me a solution.

Here's a relatively simple solution:
# CREATE ORIGINAL DATA.FRAME
df <- read.table(text="2000 0.051568
2000 0.04805
2002 0.029792
2002 0.056141
2008 0.047285
2008 0.038989", header=FALSE)
names(df) <- c("year", "value")
# MODIFY ITS LAYOUT
df2 <- as.data.frame(split(df$value, df$year))
df2
# X2000 X2002 X2008
# 1 0.051568 0.029792 0.047285
# 2 0.048050 0.056141 0.038989

I'm guessing you are new to R, so I'm going to guess what you mean and give you some more correct terminology. If I guess wrong, then at least this may help you to clarify the question.
In R, a table is a special case of a matrix that arises from cross-tabulation. What I think you have (or want) to start with is a data.frame. A data.frame is a set of columns with potentially different types, but all the same length; it is "rectangular" in that sense. Generally, elements in the same positions in the columns (that is, each row) of a data.frame are related to each other. The columns of a data.frame have names, as can the rows.
long <- data.frame(year=c(2000,2000,2002,2002,2008,2008),
val=c(0.051568, 0.04805, 0.029792,
0.056141, 0.047285, 0.038989))
Which when printed looks like
> long
year val
1 2000 0.051568
2 2000 0.048050
3 2002 0.029792
4 2002 0.056141
5 2008 0.047285
6 2008 0.038989
By itself, this isn't enough, because for your desired output, you need to specify which value for, say, 2000 is in the first row and which is in the second (etc., if there were more). In your example, it is just the order they are in.
long$targetrow = 1:2
Which makes long now look like
> long
year val targetrow
1 2000 0.051568 1
2 2000 0.048050 2
3 2002 0.029792 1
4 2002 0.056141 2
5 2008 0.047285 1
6 2008 0.038989 2
Now you can use reshape on it.
reshape(long, idvar="targetrow", timevar="year", direction="wide")
which gives
> reshape(long, idvar="targetrow", timevar="year", direction="wide")
targetrow val.2000 val.2002 val.2008
1 1 0.051568 0.029792 0.047285
2 2 0.048050 0.056141 0.038989
More complicated transformations are possible using the reshape2 package, but this should get you started.

probably i am understanding this wrong but is ?reshape what you are looking for?
from the examples:
summary(Indometh)
wide <- reshape(Indometh, v.names="conc", idvar="Subject", timevar="time", direction="wide")
wide

Related

SQL `lead()` equivalent in R

I want to make something like LEAD(mes) OVER(PARTITION BY CODIGO_CLIENTE ORDER BY mes) mes_2 in R, but I dont know a similar function.
I have no clue how to work it out.
Since you shared no data and desired output, here is an example with lead() from the dplyr package. The example is from the Help page of lead(). This can give you a good idea of what you can do with this function.
df <- data.frame(year = 2000:2005, value = (0:5) ^ 2)
scrambled <- df[sample(nrow(df)), ]
year value
1 2000 0
5 2004 16
3 2002 4
4 2003 9
2 2001 1
6 2005 25
right <- mutate(scrambled, `next` = lead(value, order_by = year))
arrange(right, year)
year value next
1 2000 0 1
2 2001 1 4
3 2002 4 9
4 2003 9 16
5 2004 16 25
6 2005 25 NA
Since you're new to R I suggest you read a bit on the dplyr package. Also, to make it easier for the people trying to help you, please provide more details next time!

efficiently creating a panel data.frame from cross sections with unharmonized column names

I need to create a panel data set (long format) from multiple yearly (cross-sectional) data sets. The variables of interest have different names in the single data sets and i need to harmonize them.
I loaded the dataframes to a list and now want to manipulate the names using lapply or a chunk of code that allows binding the dataframes. I can see several ways of doing this, but would like to use one which works with little code on a large list of data.frames, so that I can do this for several variables and easily change specifics later on.
So what I am looking for is either a way to rename the columns, so that I able to simple use bind_rows() from dplyr or an equivalent method, or a way to rename and bind the datasets in one step. Since I need to do this for several variables it might be safer to keep the two steps apart.
To illustrate, here an example:
a <- data.frame(id=c("Marc", "Julia", "Rico"), year=2000:2002, laborincome=1:3)
b <- data.frame(id=c("Marc", "Julia", "Rico"), earningsfromlabor=2:4, year=2003:2005)
dflist <- list(a, b)
equivalent_vars <- c("laborincome", "earningsfromlabor")
newnanme <- "income"
Desired result:
data.frame(id=c("Marc", "Julia", "Rico"), income=c(1,2,3,2,3,4), year=2000:2005)
id income year
1 Marc 1 2000
2 Julia 2 2001
3 Rico 3 2002
4 Marc 2 2003
5 Julia 3 2004
6 Rico 4 2005
We could use setnames from data.table
library(data.table)
do.call(rbind, Map(setnames, dflist, old = equivalent_vars, new = newnanme))
# id year income
#1 Marc 2000 1
#2 Julia 2001 2
#3 Rico 2002 3
#4 Marc 2003 2
#5 Julia 2004 3
#6 Rico 2005 4
Or we can use the :=
library(dplyr)
library(purrr)
map2_df(dflist, equivalent_vars, ~ .x %>%
rename(!! (newnanme) := !! .y)) %>%
select(id, income, year)
# id income year
#1 Marc 1 2000
#2 Julia 2 2001
#3 Rico 3 2002
#4 Marc 2 2003
#5 Julia 3 2004
#6 Rico 4 2005

Adding data points in a column by factors in R

The data.frame my_data consists of two columns("PM2.5" & "years") & around 6400000 rows. The data.frame has various data points for pollutant levels of "PM2.5" for years 1999, 2002, 2005 & 2008.
This is what i have done to the data.drame:
{
my_data <- arrange(my_data,year)
my_data$year <- as.factor(my_data$year)
my_data$PM2.5 <- as.numeric(my_data$PM2.5)
}
I want to find the sum of all PM2.5 levels (i.e sum of all data points under PM2.5) according to different year. How can I do it.
!The image shows the first 20 rows of the data.frame.
Since the column "years" is arranged, it is showing only 1999
Say this is your data:
library(plyr) # <- don't forget to tell us what libraries you are using
give us an easy sample set
my_data <- data.frame(year=sample(c("1999","2002","2005","2008"), 10, replace=T), PM2.5 = rnorm(10,mean = 5))
my_data <- arrange(my_data,year)
my_data$year <- as.factor(my_data$year)
my_data$PM2.5 <- as.numeric(my_data$PM2.5)
> my_data
year PM2.5
1 1999 5.556852
2 2002 5.508820
3 2002 4.836500
4 2002 3.766266
5 2005 6.688936
6 2005 5.025600
7 2005 4.041670
8 2005 4.614784
9 2005 4.352046
10 2008 6.378134
One way to do it (out of many, many ways already shown by a simple google search):
> with(my_data, (aggregate(PM2.5, by=list(year), FUN="sum")))
Group.1 x
1 1999 5.556852
2 2002 14.111586
3 2005 24.723037
4 2008 6.378134

R - Bootstrap by several column criteria

So what I have is data of cod weights at different ages. This data is taken at several locations over time.
What I would like to create is "weight at age", basically a mean value of weights at a certain age. I want do this for each location at each year.
However, the ages are not sampled the same way (all old fish caught are measured, while younger fish are sub sampled), so I can't just create a normal average, I would like to bootstrap samples.
The bootstrap should take out 5 random values of weight at an age, create a mean value and repeat this a 1000 times, and then create an average of the means. The values should be able to be used again (replace). This should be done for each age at every AreaCode for every year. Dependent factors: Year-location-Age.
So here's an example of what my data could look like.
df <- data.frame( Year= rep(c(2000:2008),2), AreaCode = c("39G4", "38G5","40G5"), Age = c(0:8), IndWgt = c(rnorm(18, mean=5, sd=3)))
> df
Year AreaCode Age IndWgt
1 2000 39G4 0 7.317489899
2 2001 38G5 1 7.846606144
3 2002 40G5 2 0.009212455
4 2003 39G4 3 6.498688035
5 2004 38G5 4 3.121134937
6 2005 40G5 5 11.283096043
7 2006 39G4 6 0.258404136
8 2007 38G5 7 6.689780137
9 2008 40G5 8 10.180511929
10 2000 39G4 0 5.972879108
11 2001 38G5 1 1.872273650
12 2002 40G5 2 5.552962065
13 2003 39G4 3 4.897882549
14 2004 38G5 4 5.649438631
15 2005 40G5 5 4.525012587
16 2006 39G4 6 2.985615831
17 2007 38G5 7 8.042884181
18 2008 40G5 8 5.847629941
AreaCode contains the different locations, in reality I have 85 different levels. The time series stretches 1991-2013, the ages 0-15. IndWgt contain the weight. My whole data frame has a row length of 185726.
Also, every age does not exist for every location and every year. Don't know if this would be a problem, just so the scripts isn't based on references to certain row number. There are some NA values in the weight column, but I could just remove them before hand.
I was thinking that I maybe should use replicate, and apply or another plyr function. I've tried to understand the boot function but I don't really know if I would write my arguments under statistics, and in that case how. So yeah, basically I have no idea.
I would be thankful for any help I can get!
How about this with plyr. I think from the question you wanted to bootstrap only the "young" fish weights and use actual means for the older ones. If not, just replace the ifelse() statement with its last argument.
require(plyr)
#cod<-read.csv("cod.csv",header=T) #I loaded your data from csv
bootstrap<-function(Age,IndWgt){
ifelse(Age>2, # treat differently for old/young fish
res<-mean(IndWgt), # old fish mean
res<-mean(replicate(1000,sample(IndWgt,5,replace = TRUE))) # young fish bootstrap
)
return(res)
}
ddply(cod,.(Year,AreaCode,Age),summarize,boot_mean=bootstrap(Age,IndWgt))
Year AreaCode Age boot_mean
1 2000 39G4 0 6.650294
2 2001 38G5 1 4.863024
3 2002 40G5 2 2.724541
4 2003 39G4 3 5.698285
5 2004 38G5 4 4.385287
6 2005 40G5 5 7.904054
7 2006 39G4 6 1.622010
8 2007 38G5 7 7.366332
9 2008 40G5 8 8.014071
PS: If you want to sample all ages in the same way, no need for the function, just:
ddply(cod,.(Year,AreaCode,Age),
summarize,
boot_mean=mean(replicate(1000,mean(sample(IndWgt,5,replace = TRUE)))))
Since you don't provide enough code, it's too hard (lazy) for me to test it properly. You should get your first step using the following code. If you wrap this into replicate, you should get your end result that you can average.
part.result <- aggregate(IndWgt ~ Year + AreaCode + Age, data = data, FUN = function(x) {
rws <- length(x)
get.em <- sample(x, size = 5, replace = TRUE)
out <- mean(get.em)
out
})
To handle any missing combination of year/age/location, you could probably add an if statement checking for NULL/NA and producing a warning and/or skipping the iteration.

calculate differences in dataframe

I have a dataframe that looks like this:
set.seed(50)
data.frame(distance=c(rep("long", 5), rep("short", 5)),
year=rep(2002:2006),
mean.length=rnorm(10))
distance year mean.length
1 long 2002 0.54966989
2 long 2003 -0.84160374
3 long 2004 0.03299794
4 long 2005 0.52414971
5 long 2006 -1.72760411
6 short 2002 -0.27786453
7 short 2003 0.36082844
8 short 2004 -0.59091244
9 short 2005 0.97559055
10 short 2006 -1.44574995
I need to calculate the difference between in mean.length between long and short in each year. Whats fastest way of doing this?
Here's one way using plyr:
set.seed(50)
df <- data.frame(distance=c(rep("long", 5),rep("short", 5)),
year=rep(2002:2006),
mean.length=rnorm(10))
library(plyr)
aggregation.fn <- function(df) {
data.frame(year=df$year[1],
diff=(df$mean.length[df$distance == "long"] -
df$mean.length[df$distance == "short"]))}
new.df <- ddply(df, "year", aggregation.fn)
Gives you
> new.df
year diff
1 2002 0.8275344
2 2003 -1.2024322
3 2004 0.6239104
4 2005 -0.4514408
5 2006 -0.2818542
A second way
df <- df[order(df$year, df$distance), ]
n <- dim(df)[1]
df$new.year <- c(1, df$year[2:n] != df$year[1:(n-1)])
df$diff <- c(-diff(df$mean.length), NA)
df$diff[!df$new.year] <- NA
new.df.2 <- df[!is.na(df$diff), c("year", "diff")]
all(new.df.2 == new.df) # True
Use tapply() and apply() like this:
apply(
with(x, tapply(mean.length, list(year, distance), FUN=mean)),
1,
diff
)
2002 2003 2004 2005 2006
-0.8275344 1.2024322 -0.6239104 0.4514408 0.2818542
This works because tapply creates a tabular summary by year and distance:
with(x, tapply(mean.length, list(year, distance), FUN=mean))
long short
2002 0.54966989 -0.2778645
2003 -0.84160374 0.3608284
2004 0.03299794 -0.5909124
2005 0.52414971 0.9755906
2006 -1.72760411 -1.4457499
Since you seem to have paired values and the data.frame is ordered, you can do this:
res <- with(DF, mean.length[distance=="long"]-mean.length[distance=="short"])
names(res) <- unique(DF$year)
# 2002 2003 2004 2005 2006
#0.8275344 -1.2024322 0.6239104 -0.4514408 -0.2818542
This should be quite fast, but not as safe as the other answers as it relies on the assumptions.
You've received some good answers for computing the specific question at hand. It may make sense for you to consider reshaping your data into a wide format. Here are two options:
reshape(df, direction = "wide", idvar = "year", timevar = "distance")
#---
year mean.length.long mean.length.short
1 2002 0.54966989 -0.2778645
2 2003 -0.84160374 0.3608284
3 2004 0.03299794 -0.5909124
4 2005 0.52414971 0.9755906
5 2006 -1.72760411 -1.4457499
#package reshape2 is probably easier to use.
library(reshape2)
dcast(year ~ distance, data = df)
#---
year long short
1 2002 0.54966989 -0.2778645
2 2003 -0.84160374 0.3608284
3 2004 0.03299794 -0.5909124
4 2005 0.52414971 0.9755906
5 2006 -1.72760411 -1.4457499
You can easily compute your new statistics now.

Resources