Listing my splitted data in R - r

My data looks like this:
colnames(dati)< - c("grupa", "regions5", "regions6", "novads.rep", "pilseta.lt", "specialists", "limenis.1", "limenis.2", "cipari.3", "ratio", "gads", "KV", "DS")
and I have manually applied split to it in order to have 24 splits (12 splits including year and 12 without splitting by years). I did them following way:
k1<-split(dati$ratio, list(dati$gads, dati$grupa), drop=TRUE)
k2<-split(dati$ratio, list(dati$gads, dati$grupa, dati$regions5), drop=TRUE)
...
k13<-split(dati$ratio,list(dati$grupa),drop=TRUE)
k14<-split(dati$ratio,list(dati$grupa,dati$regions5),drop=TRUE)
...etc
and what I mean to do is to apply these splits to my function as follows:
function(k1,k13)
but instead of inserting the values manually I would like to change them so that I could do my function similar to this:
for(i in 1:12){function(k[i],k[i+12])}
I just can't seem to find the right way to do it
dati after i split them look like this:
grupa regions5 regions6 novads.rep pilseta.lt specialists
1 1* Zemgales Zemgales Novads lauki Silva
2 1* Kurzemes Kurzemes Novads lauki Sniedze
3 3* Kurzemes Kurzemes REP pilsēta AnitaE
4 1* Vidzemes Vidzemes Novads pilsēta Dainis
limenis.1 limenis.2 cipari.3 ratio gads KV
1 Jelgavas nov. Svētes pag. 1 0.8682626 2011 2162
2 Ventspils nov. Vārves pag. 1 0.3923857 2011 27467
3 _Liepāja _Liepāja 4 0.4069100 2011 30107
4 Alūksnes nov. Alūksne 2 0.5641127 2011 8147
DS
1 2490.03
2 70000.00
3 73989.33
4 14442.15
...
and here is the output i'm looking for:
count mean lowermean uppermean median ...
2011.1*.Kurzemes 119 0.83322820 7.719323e-01 0.8945241 0.79888324
2012.1*.Kurzemes 171 0.82800498 7.836221e-01 0.8723879 0.84424821
2013.1*.Kurzemes 144 0.77551814 7.347631e-01 0.8162731 0.80745150
2014.1*.Kurzemes 180 0.78134649 7.396007e-01 0.8230923 0.81635065
2015.1*.Kurzemes 80 0.78146588 7.135070e-01 0.8494248 0.73659659
2011.10*.Kurzemes 16 1.09552970 6.930780e-01 1.4979814 1.02127841
2012.10*.Kurzemes 22 0.87442906 5.721409e-01 1.1767172 0.74787482
2013.10*.Kurzemes 25 0.84406131 6.947097e-01 0.9934129 0.91786319
2014.10*.Kurzemes 22 0.79385199 5.880507e-01 0.9996533 0.71708060
2015.10*.Kurzemes 12 1.19059850 8.213604e-01 1.5598365 1.25322750
2012.11*.Kurzemes 1 0.09461065 NA NA 0.09461065
2013.11*.Kurzemes 2 0.18134522 -1.823437e+00 2.1861274 0.18134522
2014.11*.Kurzemes 1 0.11097174 NA NA 0.11097174
2013.12*.Kurzemes 1 0.44620780 NA NA 0.44620780
...

You could use a list:
k <- list()
k[[1]] <- split(dati$ratio, list(dati$gads, dati$grupa), drop=TRUE)
k[[2]] <- split(dati$ratio, list(dati$gads, dati$grupa, dati$regions5), drop=TRUE)
# etc
Then the following is valid:
for(i in 1:12){
function(k[[i]],k[[i+12]])
}
Note that k3 is the name of a variable, which could be x, myvar32, whatever. When you type k[3], you state that you want to access the third cell of the vector k. Note that k and k3 are totally distinct variables. If you want to be able to access you variables using k[i], you must first create the vector k and store what you need in k[i]...
The double bracket notation is used to access lists, which are basically handy store anything -- what you need in your case.

Related

test data time series - how can I merge 2 data sets?

I have 2 datasets: one with test results (nedl_1) and another with more test results (subset_cal1) and a time for these tests(dc_time). I'd like to merge the data by ID shown in File column (for ex. 180061). I first transposed subset_cal1 and now column name in both datasets are about the same (except "."). Then I'll try to join them. However, joining them is not possible as one data set is numeric and the other is factor (due to transpose).
I coerced the transposed subset_cal_1 into numeric but dc_time column got coerced into a number. I think I'm forcing something here and I'd rather learn how to do it right because it will come up again.
nedl_1
Wavelength 18005.1 18006.1 18009.1 18010.1 18012.1
1 350 7.920042e-10 8.118013e-10 1.002651e-09 7.379407e-10 9.285596e-10
2 351 7.990535e-10 6.535653e-10 1.275650e-09 5.742704e-10 9.042697e-10
subset_cal1
File dc_time Channels it calibration instrument_num
1 180061 Fri Jan 20 15:37:40 2012 2151 136 1 18006
2 180091 Fri Jan 27 13:30:23 2012 2151 136 1 18009
3 180101 Fri Jan 27 09:41:38 2012 2151 136 1 18010
4 180121 Tue Feb 28 12:15:02 2012 2151 136 1 18012
Here is the code that I used to transpose subset_cal1 and then join with nedl_1
n <- subset_cal1$File # remember the characters in $File
sh_raw <- as.data.frame(t(subset_cal1[,-1])) # transpose all but $File
colnames(sh_raw) <- n # change colnames to those stored in n
dups <- unique(as.list(sh_raw)) # list the duplicate cols
sh_raw_2 <- sh_raw[!duplicated(dups)] # remove duplicate cols
j_raw_nedl <- left_join(sh_raw, nedl_1) #join matching cols
Error: Can't join on '18051.1' x '18051.1' because of incompatible types (numeric / character)

Using variables from two different size datasets (and logic relations) to create a new variable

I have two dataframes. The number of observations is very different, and I would like to use some information from one dataframe into the other, conditioning to some logical relations, and I can't seem to be able to. A down-scaled example would look something like this:
year <- as.vector(c(rep(1949,5), rep(1950,5), rep(1951,5), rep(1952,5)))
moneyband <- as.vector(c(rep(c(10,20,30,40,50),4)))
rate <-as.vector(c(rep(c(0.1,0.2,0.3,0.4,0.5),2),rep(c(0.15,0.25,0.35,0.45,0.55),2)))
datasmall <- as.data.frame(cbind(year,moneyband,rate))
yearbig <- as.vector(c(rep(1949,10), rep(1950,10), rep(1951,10), rep(1952,11)))
earnings <- as.vector(c(rep(c(9,19,30,39,50),8),60))
databig <- as.data.frame(cbind(yearbig,earnings))
Now I want to create a new variable in the big database (let's call it ratebig) that assigns to that variable the rate associated with that amount of earnings, if earnings (in the big database) equal moneyband (in the small database) for a given year. As you can see, in this example this would happen with the values 30 and 50. The rest I would like them to be NA.
I tried this:
databig$ratebig <- NA
for (i in 1949:1952) {
databig$ratebig[datasmall$year == i & (databig$earnings[databig$yearbig==i]==datasmall$moneyband[datasmall$year == i])] <- datasmall$rate[datasmall$year == i & (databig$earnings[databig$yearbig==i]==datasmall$moneyband[datasmall$year == i])]
}
But the different size of databases (or other things) are giving me trouble (it gives me errors and the results are wrong). It seems the result does not take care the conditions as I would like, and it is influenced by relative position and the structure in the two datasets.
In principle, I wouldn't want to merge the datasets (we are talking about a high number of observations in the real data) and was hoping for a way to do this.
Thanks!!
For your case merge works fine
merge(databig, datasmall, by.x = c("yearbig", "earnings"),
by.y = c("year", "moneyband"), all.x = TRUE)
# yearbig earnings rate
#1 1949 9 NA
#2 1949 9 NA
#3 1949 19 NA
#4 1949 19 NA
#5 1949 30 0.30
#6 1949 30 0.30
#7 1949 39 NA
#8 1949 39 NA
#9 1949 50 0.50
#10 1949 50 0.50
#.....
Regarding why your for loop doesn't work as expected you need to do it for every row of databig
databig$ratebig <- NA
for (i in 1:nrow(databig)) {
inds <- databig$yearbig[i] == datasmall$year &
databig$earnings[i] == datasmall$moneyband
if (any(inds))
databig$ratebig[i] <- datasmall$rate[inds]
}

How to column bind and row bind a large number of data frames in R?

I have a large data set of vehicles. They were recorded every 0.1 seconds so there IDs repeat in Vehicle ID column. In total there are 2169 vehicles. I filtered the 'Vehicle velocity' column for every vehicle (using for loop) which resulted in a new column with first and last 30 values removed (per vehicle) . In order to bind it with original data frame, I removed the first and last 30 values of table too and then using cbind() combined them. This works for one last vehicle. I want this smoothing and column binding for all vehicles and finally I want to combine all the data frames of vehicles into one single table. That means rowbinding in sequence of vehicle IDs. This is what I wrote so far:
traj1 <- read.csv('trajectories-0750am-0805am.txt', sep=' ', header=F)
head(traj1)
names (traj1)<-c('Vehicle ID', 'Frame ID','Total Frames', 'Global Time','Local X', 'Local Y', 'Global X','Global Y','Vehicle Length','Vehicle width','Vehicle class','Vehicle velocity','Vehicle acceleration','Lane','Preceding Vehicle ID','Following Vehicle ID','Spacing','Headway')
# TIME COLUMN
Time <- sapply(traj1$'Frame ID', function(x) x/10)
traj1$'Time' <- Time
# SMOOTHING VELOCITY
smooth <- function (x, D, delta){
z <- exp(-abs(-D:D/delta))
r <- convolve (x, z, type='filter')/convolve(rep(1, length(x)),z,type='filter')
r
}
for (i in unique(traj1$'Vehicle ID')){
veh <- subset (traj1, traj1$'Vehicle ID'==i)
svel <- smooth(veh$'Vehicle velocity',30,10)
svel <- data.frame(svel)
veh <- head(tail(veh, -30), -30)
fta <- cbind(veh,svel)
}
'fta' now only shows the data frame for last vehicle. But I want all data frames (for all vehicles 'i') combined by row. May be for loop is not the right way to do it but I don't know how can I use tapply (or any other apply function) to do so many things same time.
EDIT
I can't reproduce my dataset here but 'Orange' data set in R could provide good analogy. Using the same smoothing function, the for loop would look like this (if 'age' column is smoothed and 'Tree' column is equivalent to my 'Vehicle ID' coulmn):
for (i in unique(Orange$Tree)){
tre <- subset (Orange, Orange$'Tree'==i)
age2 <- round(smooth(tre$age,2,0.67),digits=2)
age2 <- data.frame(age2)
tre <- head(tail(tre, -2), -2)
comb <- cbind(tre,age2)}
}
Umair, I am not sure I understood what you want.
If I understood right, you want to combine all the results by row. To do that you could save all the results in a list and then do.call an rbind:
comb <- list() ### create list to save the results
length(comb) <- length(unique(Orange$Tree))
##Your loop for smoothing:
for (i in 1:length(unique(Orange$Tree))){
tre <- subset (Orange, Tree==unique(Orange$Tree)[i])
age2 <- round(smooth(tre$age,2,0.67),digits=2)
age2 <- data.frame(age2)
tre <- head(tail(tre, -2), -2)
comb[[i]] <- cbind(tre,age2) ### save results in the list
}
final.data<-do.call("rbind", comb) ### combine all results by row
This will give you:
Tree age circumference age2
3 1 664 87 687.88
4 1 1004 115 982.66
5 1 1231 120 1211.49
10 2 664 111 687.88
11 2 1004 156 982.66
12 2 1231 172 1211.49
17 3 664 75 687.88
18 3 1004 108 982.66
19 3 1231 115 1211.49
24 4 664 112 687.88
25 4 1004 167 982.66
26 4 1231 179 1211.49
31 5 664 81 687.88
32 5 1004 125 982.66
33 5 1231 142 1211.49
Just for fun, a different way to do it using plyr::ddply and sapply with split:
library(plyr)
data<-ddply(Orange, .(Tree), tail, n=-2)
data<-ddply(data, .(Tree), head, n=-2)
data<- cbind(data,
age2=matrix(sapply(split(Orange$age, Orange$Tree), smooth, D=2, delta=0.67), ncol=1, byrow=FALSE))

How to Find difference between two values of last two dates in R program

DF2
Date EMMI ACT NO2
2011/02/12 12345 21 11
2011/02/14 43211 22 12
2011/02/19 12345 21 13
2011/02/23 43211 13 12
2011/02/23 56341 13 12
2011/03/03 56431 18 20
I need to find difference between two dates in a column. For example difference between ACT column values.For example, the EMMI 12345, Difference between dates 2011/02/19 - 2011/02/12 = 21-21 = 0. like that i want to do for entire column of ACT. Add a new column diff and add values to that. Can anybody let me know please how to do it.
This is the output i want
DF3
Date EMMI ACT NO2 DifACT
2011/02/12 12345 21 11 NA
2011/02/14 43211 22 12 NA
2011/02/19 12345 21 13 0
2011/02/23 43211 13 12 -9
2011/02/23 56341 13 12 5
Try this:
DF3 <- DF2
DF3$difACT <- ave( DF3$ACT, DF3$EMMI, FUN= function(x) c(NA, diff(x)) )
As long as the dates are sorted (within EMMI) this will work, if they are not sorted then we would need to modify the above to sort within EMMI first. I would probably sort the entire data frame on date first (and save the results of order), then run the above. Then if you need it back in the original order you can run order on the results of the original order results to "unorder" the data frame.
This is based on plyr package (not tested):
library(plyr)
DF3<-ddply(DF2,.(EMMI),mutate,difACT=diff(ACT))

R and appending to data frames

I have some cross correlation function crosscor, and I would like to loop through the function for each of the columns I have in my data matrix. The function outputs some cross correlation that looks something like this each time it is run:
Lags Cross.Correlation P.value
1 0 -0.0006844958 0.993233547
2 1 0.1021006478 0.204691627
3 2 0.0976746274 0.226628526
4 3 0.1150337867 0.155426784
5 4 0.1943150900 0.016092041
6 5 0.2360415470 0.003416147
7 6 0.1855274375 0.022566685
8 7 0.0800646242 0.330081900
9 8 0.1111071269 0.177338885
10 9 0.0689602574 0.404948252
11 10 -0.0097332533 0.906856279
12 11 0.0146241719 0.860926388
13 12 0.0862549791 0.302268025
14 13 0.1283308019 0.125302070
15 14 0.0909537922 0.279988895
16 15 0.0628012627 0.457795228
17 16 0.1669241304 0.047886605
18 17 0.2019811994 0.016703619
19 18 0.1440124960 0.090764520
20 19 0.1104842808 0.197035340
21 20 0.1247428178 0.146396407
I would like put all of the lists together so they are in a data frame, and ultimately export it into a csv file so the columns are as follows: lags.3, cross-correlation.3, p-value.3, lags.3, cross-correlation.2....etc. until p.value.50.
I have tried to use do.call as follows, but have not been successful:
for(i in 3:50)
{
l1<-crosscor(data[,2], data[,i], lagmax=20)
ccdata<-do.call(rbind, l1)
cat("Data row", i)
}
I've also tried just creating the data frame straight out, but am just getting the lag column names:
ccdata <- data.frame()
for(i in 3:50)
{
ccdata[i-2:i+1]<-crosscor(data[,2], data[,i], lagmax=20)
cat("Data row", i)
}
What am I doing wrong? Or is there an online source on data sets I could access to figure out how to do this? Best,
There is a transpose method for data.frames. If "crosscor" is the name of the object just try this:
tcrosscor <- t(crosscor)
write.csv(tcrosscor, file="my_crosscor_1.csv")
The first row would be the Lag's; the second row, the Cross.Correlation's; the third row the P.value's. I suppose you could "flatten" it further so it would be entirely "horizontal" or "wide". Seems painful but this might go something like:
single_line <- as.data.frame(unlist(tcrosscor))
names(single_line) <- paste("Lag", 'Cross.Correlation', 'P.value'), rep(1:50, 3), sep=".")
write.csv(single_line, file="my_single_1.csv")

Resources