Splitting a data frame to create new columns

Splitting a data frame to create new columns - r

I have a data frame with columns for "Count","Transect Number","Data", and "Year". My goal is to split up the data frame by Transect, then again by Year, and create a new data frame with a column for "Transect", and then the appropriate data per Year in the following columns.
To build a dummy data frame:
Count1<-1:27
Count2<-1:30
Count3<-1:25
T1<-c(1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3)
T2<-c(1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3,1,1,1,1,2,2,2,2,3,3,3,3)
T3<-c(1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3)
Data1<-c(1,2,3,2,1,2,3,4,3,2,1,2,3,4,3,2,1,2,3,4,5,4,3,2,3,3,2)
Data2<-c(1,2,3,2,1,4,3,2,1,2,4,3,2,3,4,3,2,3,4,5,6,4,3,2,1,4,5,4,3,2)
Data3<-c(1,2,3,4,5,4,3,3,3,4,5,4,3,3,2,3,4,5,4,3,4,3,2,3,4)
Year1<-c(2014,2014,2014,2014,2014,2014,2014,2014,2014,2015,2015,2015,2015,2015,2015,2015,2015,2015,2016,2016,2016,2016,2016,2016,2016,2016,2016)
Year2<-c(2014,2014,2014,2014,2014,2014,2014,2014,2014,2015,2015,2015,2015,2015,2015,2015,2015,2015,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016)
Year3<-c(2014,2014,2014,2014,2014,2014,2014,2014,2014,2015,2015,2015,2015,2015,2015,2015,2015,2015,2016,2016,2016,2016,2016,2016,2016)
DF1<-data.frame(Count1,T1,Data1,Year1)
colnames(DF1)<-c("Count","Transect","Data","Year")
DF2<-data.frame(Count2,T2,Data2,Year2)
colnames(DF2)<-c("Count","Transect","Data","Year")
DF3<-data.frame(Count3,T3,Data3,Year3)
colnames(DF3)<-c("Count","Transect","Data","Year")
All<-rbind(DF1,DF2,DF3)
Once I have the data frame, my thought was to split up the data by transect since this will be a permanent aspect of my ongoing data set.
#Step 1-Break down by T
Trans1<-All[All$Transect==1,]
Trans2<-All[All$Transect==2,]
Trans3<-All[All$Transect==3,]
Trans4<-All[All$Transect==4,]
Trans5<-All[All$Transect==5,]
But I'm a little less clear on the next step. I need to pull out data from the "Data" column organized by year. Something along the lines of further breaking down the data like so:
Trans1_Year1<-Trans1[Trans1$Year==2014,]
Trans2_Year1<-Trans2[Trans2$Year==2014,]
Trans3_Year1<-Trans3[Trans3$Year==2014,]
Trans4_Year1<-Trans4[Trans4$Year==2014,]
Trans5_Year1<-Trans5[Trans5$Year==2014,]
or even using split
ByYear1<-split(Trans1,Trans1$Year)
But I would prefer to avoid writing out the code as above as I hope to add new data every year as this data set progresses. And I'd like the code to be able to accommodate new "Year" data as it is added, as opposed to writing out new lines of code every year.
Once I have the data set up like so, I'd like to create a second data frame with columns for each year. One problem is that the each year contains differing numbers of rows, which has been an issue for me. But my final result would have columns:
"Transect", "Data 2014", "Data 2015", "Data 2016"
Since each year has can have different numbers of rows within a transect, I'd like to leave NA's at the end of each Transect section when the number of rows per individual transect differ between years.

It sounds like you are basically trying to convert your data into a semi-wide format, with columns for years, rather than keeping it in the "long" format.
If this is the case, you're better off adding a secondary index column that shows the repeated combination of "Transect" and "Year".
This can easily be done with getanID from my "splitstackshape" package. "splitstackshape" also loads "data.table", from which you could then use dcast.data.table to get a wide format.
library(splitstackshape)
dcast.data.table(getanID(All, c("Transect", "Year")),
Transect + .id ~ Year, value.var = "Data")
# Transect .id 2014 2015 2016
# 1: 1 1 1 2 3
# 2: 1 2 2 1 4
# 3: 1 3 3 2 5
# 4: 1 4 1 2 4
# 5: 1 5 2 4 5
# 6: 1 6 3 3 6
# 7: 1 7 1 4 4
# 8: 1 8 2 5 4
# 9: 1 9 3 4 3
# 10: 1 10 NA NA 4
# 11: 2 1 2 3 4
# 12: 2 2 1 4 3
# 13: 2 3 2 3 2
# 14: 2 4 2 2 3
# 15: 2 5 1 3 2
# 16: 2 6 4 4 1
# 17: 2 7 4 3 4
# 18: 2 8 5 3 3
# 19: 2 9 4 2 2
# 20: 2 10 NA NA 3
# 21: 3 1 3 2 3
# 22: 3 2 4 1 3
# 23: 3 3 3 2 2
# 24: 3 4 3 3 5
# 25: 3 5 2 2 4
# 26: 3 6 1 3 3
# 27: 3 7 3 3 2
# 28: 3 8 3 4 4
# 29: 3 9 3 5 NA
# Transect .id 2014 2015 2016
Then, if you really want to split on the "Transect" column you can go ahead and use split, but since you now have a "data.table" it would be better to stick with that and take advantage of its many convenient features, including those related to subsetting and aggregation.

I think you are forcing your data into a format it does not have naturally. There are a lot of processing advantages to leaving it in "long" format. Have a look at this article if you have not seen it yet, it is a classic.
http://www.jstatsoft.org/v21/i12

Related

R Studio: Time Syncing Data Sets

I have a simple problem, and a bit more complicated twist at the end.
I have 2 datasets A & B (Separate when imported into R):
Dataset A is pulled from a DAQ that is sampling at 2000 times a second, while dataset B is pulled from a scope at 500 times a second. I have a test that records data from the DAQ and Scope for 5 seconds.
In R Studio I want to time synchronize this data and, for the sake of learning, how can I do it in both of the following ways?
1) Without duplicating values so filtering doesn't stair step:
A B
1 1 1
2 2 NA
3 3 NA
4 4 NA
5 5 2
6 6 NA
7 7 NA
8 8 NA
9 9 3
10 10 NA
11 11 NA
12 12 NA
2) With duplicating numbers if I don't want NA's in the functions I apply to the frame:
A B
1 1 1
2 2 1
3 3 1
4 4 1
5 5 2
6 6 2
7 7 2
8 8 2
9 9 3
10 10 3
11 11 3
12 12 3
Now here is the twist where it becomes a very unique problem I have. Lets say Dataset A records a bit before & after the 5 second test. Dataset A also has an extra column for "Trigger" which is either a 0 or a 1. 1 is a high that represents recording and basically where Dataset B starts. When it switches back to 0, Dataset B has finished recording.
Is there a way I can strategically do the above time sync in Dataset A? The reason I want to keep the data before & after the "true" recording section, is to make sure a filter or a filtfilt sweep will level out before the data truly starts.
Thanks for any help!

Arrange a data set in a repeating manner from a reshaped data

I have reshaped the data to long. It has been sorted in ascending order based on one column (as x2 in the below reproducible example) and I want to keep the data in a repeating manner rather than factored. Here is a sample:
set.seed(234)
data<-data.frame(x1=c(1:12),x2=rep(1:3,each=4),x3=runif(12,min=0,max=12))
And I want the format something like this:
x1 x2 x3
1 1 1 6.115445
2 2 2 5.157014
3 3 3 4.793458
4 4 1 9.998710
5 5 2 2.620250
6 6 3 1.825839
7 7 1 5.842854
8 8 2 5.616670
9 9 3 6.511315
10 10 1 9.164444
11 11 2 8.401418
Can you please help me with either what to include in the melt function while converting the data to long format or any other function I should use in rearranging that data.
note:
The above result is to show the desired format, not the exact solution for my data.
EDIT:
Here is head() of my real data:
Date stn Elev Amount
1 2010-01-01 11 0 268.945
2 2010-01-01 11 0 268.396
3 2010-01-01 11 0 267.512
4 2010-01-01 11 0 266.488
5 2010-01-01 11 0 265.558
6 2010-01-01 11 0 265.178
In the actual data, the column Elev contains values like, c("0","100","250","500"...). So you assume that 0 is equivalent to 1 in x2 of the above sample, and so forth for 100, 250....

One method is to use ave as follows:
data[order(ave(data$x3, data$x2, FUN=function(i) 1:length(i)), data$x2),]
x1 x2 x3
1 1 1 8.9474400
5 5 2 0.8029211
9 9 3 11.1328381
2 2 1 9.3805491
6 6 2 7.7375415
10 10 3 3.4107614
3 3 1 0.2404454
7 7 2 11.1526315
11 11 3 6.6686992
4 4 1 9.3130246
8 8 2 8.6117063
12 12 3 6.5724198
In this instance, ave calculates a running count by data$x2, which is then used to sort the data with the order function.
You can also renumber x1 if desired: data$x1 <- 1:nrow(data), which would return your desired result.

Generating recursive ID by muli-variate group using data.table in R

I've found several options on how to generate IDs by groups using the data.table package in R, but none of them fit my problem exactly. Hopefully someone can help.
In my problem, I have 160 markets that fall within 21 regions in a country. These markets are numbered 1:160 and there may be multiple observations documented within each market. I would like to restructure my market ID variable so that it represents unique markets within each region, and starts counting over again with each new region.
Here's some code to represent my problem:
require(data.table)
dt <- data.table(region = c(1,1,1,1,2,2,2,2,3,3,3,3),
market = c(1,1,2,2,3,3,4,4,5,6,7,7))
> dt
region market
1: 1 1
2: 1 1
3: 1 2
4: 1 2
5: 2 3
6: 2 3
7: 2 4
8: 2 4
9: 3 5
10: 3 6
11: 3 7
12: 3 7
Currently, my data is set up to represent the result of
dt[, market_new := .GRP, by = .(region, market)]
But what I'd like get is
region market market_new
1: 1 1 1
2: 1 1 1
3: 1 2 2
4: 1 2 2
5: 2 3 1
6: 2 3 1
7: 2 4 2
8: 2 4 2
9: 3 5 1
10: 3 6 2
11: 3 7 3
12: 3 7 3

This seems to return what you want
dt[, market_new:=as.numeric(factor(market)), by=region]
here we divide the data up by regions and then give a unique ID to each market in each region via the factor() function and extract the underlying numeric index.

From 1.9.5+, you can use frank() (or frankv()) with ties.method = "dense" as follows:
dt[, market_new := frankv(market, ties="dense"), by=region]

How to merge dating correctly

I'm trying to merge 7 complete data frames into one great wide data frame. I figured I have to do this stepwise and merge 2 frames into 1 and then that frame into another so forth until all 7 original frames becomes one.
fil2005: "ID" "abr_2005" "lop_2005" "ins_2005"
fil2006: "ID" "abr_2006" "lop_2006" "ins_2006"
But the variables "abr_2006" "lop_2006" "ins_2006" and 2005 are all either 0,1.
Now the things is, I want to either merge or do a dcast of some sort (I think) to make these two long data frames into one wide data frame were both "abr_2005" "lop_2005" "ins_2005" and abr_2006" "lop_2006" "ins_2006" are in that final file.
When I try
$fil_2006.1 <- merge(x=fil_2005, y=fil_2006, by="ID__", all.y=T)
all the variables with _2005 at the end if it is saved to the fil_2006.1, but the variables ending in _2006 doesn't.
I'm apparently doing something wrong. Any idea?

Is there a reason you put those underscores after ID__? Otherwise, the code you provided will work
An example:
dat1 <- data.frame("ID"=seq(1,20,by=2),"varx2005"=1:10, "vary2005"=2:11)
dat2 <- data.frame("ID"=5:14,"varx2006"=1:20, "vary2006"=21:40)
# create data frames of differing lengths
head(dat1)
ID varx2005 vary2005
1 1 1 2
2 3 2 3
3 5 3 4
4 7 4 5
5 9 5 6
6 11 6 7
head(dat2)
ID varx2006 vary2006
1 5 1 21
2 6 2 22
3 7 3 23
4 8 4 24
5 9 5 25
6 10 6 26
merged <- merge(dat1,dat2,by="ID",all=T)
head(merged)
ID varx2006 vary2006 varx2005 vary2005
1 1 NA NA 1 2
2 3 NA NA 2 3
3 5 1 21 3 4
4 5 11 31 3 4
5 7 13 33 4 5
6 7 3 23 4 5

Remove rows from a dataframe based on a value in one column

I have a dataframe (imported from a csv file) as follows
moose loose hoose
2 3 8
1 3 4
5 4 2
10 1 4
The R code should generate a mean column and then I would like to remove all rows where the value of the mean is <4 so that I end up with:
moose loose hoose mean
2 3 8 4.3
1 3 4 2.6
5 4 2 3.6
10 1 4 5
which should then end up as:
moose loose hoose mean
2 3 8 4.3
10 1 4 5
How can I do this in R?

dat2 <- subset(transform(dat1, Mean=round(rowMeans(dat1),1)), Mean >=4)
dat2
# moose loose hoose Mean
#1 2 3 8 4.3
#4 10 1 4 5.0
Using data.table
setDT(dat1)[, Mean:=rowMeans(.SD)][Mean>=4]
# moose loose hoose Mean
#1: 2 3 8 4.333333
#2: 10 1 4 5.000000

I will assume your data is called d. Then you run:
d$mean <- rowMeans(d) ## create a new column with the mean of each row
d <- d[d$mean >= 4, ] ## filter the data using this column in the condition
I suggest you read about creating variables in a data.frame, and filtering data. These are very common operations that you can use in many many contexts.

You could also use within, which allows you to assign/remove columns and then returns the transformed data. Start with df,
> df
# moose loose hoose
#1 2 3 8
#2 1 3 4
#3 5 4 2
#4 10 1 4
> within(d <- df[rowMeans(df) > 4, ], { means <- round(rowMeans(d), 1) })
# moose loose hoose means
#1 2 3 8 4.3
#4 10 1 4 5.0

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Splitting a data frame to create new columns - r

I think you are forcing your data into a format it does not have naturally. There are a lot of processing advantages to leaving it in "long" format. Have a look at this article if you have not seen it yet, it is a classic. http://www.jstatsoft.org/v21/i12

Related

R Studio: Time Syncing Data Sets

Arrange a data set in a repeating manner from a reshaped data

Generating recursive ID by muli-variate group using data.table in R

How to merge dating correctly

Remove rows from a dataframe based on a value in one column

Categories

Resources