Replacing base values in R by mean values from multiple rows - r

I am new to R. Having a problem to solve this dataset.
df
ID Time Value
1001 -34 3.3
1001 14 4.2
1002 -34 3.8
1002 14 6.5
1004 -18 4.1
1004 -11 3.4
1004 37 3.8
1005 -16 5.8
1005 -10 6.0
1005 14 8.1
1006 -20 16.1
1006 -10 14.1
1006 158 14.1
1007 -35 7.1
1007 -20 4.6
1007 -20 5.1
1007 10 5.0
For each ID, if there are more than one reading for negative time, replace the value with the mean and set the time to 0. The resulting dataset should be:
df1
ID Time Value
1001 0 3.3
1001 14 4.2
1002 0 3.8
1002 14 6.5
1004 0 3.75
1004 37 3.8
1005 0 5.9
1005 14 8.1
1006 0 15.1
1006 158 14.1
1007 0 5.6
1007 10 5.0
Thanks for help!

This will be pretty fast if you have lots of data.
#Convert to data.table object
require("data.table")
dt <- data.table(df)
#Label Negative values
dt[,Neg:=(Time<0)*1]
#Make positive and negative datasets
dt1 <- dt[Neg==0]
dt2 <- dt[Neg==1,list(Time=0,Value=mean(Value,na.rm=T),Neg=1),by="ID"]
#Recombine them together
df.final <- rbindlist(list(dt1,dt2))[order(ID,Time)]
Here is the result
# ID Time Value Neg
# 1: 1001 0 3.30 1
# 2: 1001 14 4.20 0
# 3: 1002 0 3.80 1
# 4: 1002 14 6.50 0
# 5: 1004 0 3.75 1
# 6: 1004 37 3.80 0
# 7: 1005 0 5.90 1
# 8: 1005 14 8.10 0
# 9: 1006 0 15.10 1
# 10: 1006 158 14.10 0
# 11: 1007 0 5.60 1
# 12: 1007 10 5.00 0
You can also put it all together in a one-liner to get a similar answer as follows:
dt[, list(Time=Time[1] * tt,
Value = if(tt) Value else mean(Value)),
by=list(ID, tt=Time>0)]

Here's yet another solution.
#copy raw data
dx <- df
#find time<0
lz <- dx$Time<0
#set those to tim 0
dx$Time[lz] <- 0
#update means for each ID for those values where time<0
dx$Value[lz] <- ave(dx$Value, dx$ID, lz, FUN=mean)[lz]
#remove duplicated time<0 values
dx<- dx[!(duplicated(dd$ID, lz) & lz), ]
And the results...
ID Time Value
1 1001 0 3.30
2 1001 14 4.20
3 1002 0 3.80
4 1002 14 6.50
5 1004 0 3.75
7 1004 37 3.80
8 1005 0 5.90
10 1005 14 8.10
11 1006 0 15.10
13 1006 158 14.10
14 1007 0 5.60
17 1007 10 5.00

Related

Use cases with higher value on one variable for each case of another variable in R

I am doing a meta-analysis in R. For each study (variable StudyID) I have multiple effect sizes. For some studies I have the same effect size multiple times depending on the level of acquaintance (variable Familiarity) between the subjects.
head(dat)
studyID A.C.Extent Visibility Familiarity p_t_cov group.size same.sex N published
1 1 3.0 5.0 1 0.0462 4 0 44 1
2 1 5.0 2.5 1 0.1335 4 0 44 1
3 1 2.5 3.0 1 -0.1239 4 0 44 1
4 1 2.5 3.5 1 0.2062 4 0 44 1
5 1 2.5 3.0 1 -0.0370 4 0 44 1
6 1 3.0 5.0 1 -0.3850 4 0 44 1
Those are the first rows of the data set. In total there are over 50 studies. Most studies look like study 1 with the same value in "Familiarity" for all effect sizes. In some studies, there are effect sizes with multiple levels of familiarity. For example study 36 as seen below.
head(dat)
studyID A.C.Extent Visibility Familiarity p_t_cov group.size same.sex N published
142 36 1.0 4.5 0 0.1233 5.00 0 311 1
143 36 3.5 3.0 0 0.0428 5.00 0 311 1
144 36 1.0 4.5 0 0.0986 5.00 0 311 1
145 36 1.0 4.5 1 -0.0520 5.00 0 311 1
146 36 1.5 2.5 1 -0.0258 5.00 0 311 1
147 36 3.5 3.0 1 0.1104 5.00 0 311 1
148 36 1.0 4.5 1 0.0282 5.00 0 311 1
149 36 1.0 4.5 2 -0.1724 5.00 0 311 1
150 36 3.5 3.0 2 0.2646 5.00 0 311 1
151 36 1.0 4.5 2 -0.1426 5.00 0 311 1
152 37 3.0 4.0 1 0.0118 5.35 0 123 0
153 37 1.0 4.5 1 -0.3205 5.35 0 123 0
154 37 2.5 3.0 1 -0.2356 5.35 0 123 0
155 37 3.0 2.0 1 0.1372 5.35 0 123 0
156 37 2.5 2.5 1 -0.1401 5.35 0 123 0
157 37 3.0 3.5 1 -0.3334 5.35 0 123 0
158 37 2.5 2.5 1 0.0317 5.35 0 123 0
159 37 1.0 3.0 1 -0.3025 5.35 0 123 0
160 37 1.0 3.5 1 -0.3248 5.35 0 123 0
Now I want for those studies that include multiple levels of familiarity, to take the rows with only one level of familiarity (two seperate versions: one with the lower, one with the higher familiarity).
I think that it can be possible with the package dplyr, but I have no real code so far.
In a second step I would like to give those rows unique studyIDs for each level of familiarity (so create out of study 36 three "different" studies).
Thank you in advance!
If you want to use dplyr, you could create an alternate ID or casenum by using group_indices:
df <- df %>%
mutate(case_num = group_indices(.dots=c("studyID", "Familiarity")))
You could do:
library(dplyr)
df %>%
group_by(studyID) %>%
mutate(nDist = n_distinct(Familiarity) > 1) %>%
ungroup() %>%
mutate(
studyID = case_when(nDist ~ paste(studyID, Familiarity, sep = "_"), TRUE ~ studyID %>% as.character),
nDist = NULL
)
Output:
# A tibble: 19 x 9
studyID A.C.Extent Visibility Familiarity p_t_cov group.size same.sex N published
<chr> <dbl> <dbl> <int> <dbl> <dbl> <int> <int> <int>
1 36_0 1 4.5 0 0.123 5 0 311 1
2 36_0 3.5 3 0 0.0428 5 0 311 1
3 36_0 1 4.5 0 0.0986 5 0 311 1
4 36_1 1 4.5 1 -0.052 5 0 311 1
5 36_1 1.5 2.5 1 -0.0258 5 0 311 1
6 36_1 3.5 3 1 0.110 5 0 311 1
7 36_1 1 4.5 1 0.0282 5 0 311 1
8 36_2 1 4.5 2 -0.172 5 0 311 1
9 36_2 3.5 3 2 0.265 5 0 311 1
10 36_2 1 4.5 2 -0.143 5 0 311 1
11 37 3 4 1 0.0118 5.35 0 123 0
12 37 1 4.5 1 -0.320 5.35 0 123 0
13 37 2.5 3 1 -0.236 5.35 0 123 0
14 37 3 2 1 0.137 5.35 0 123 0
15 37 2.5 2.5 1 -0.140 5.35 0 123 0
16 37 3 3.5 1 -0.333 5.35 0 123 0
17 37 2.5 2.5 1 0.0317 5.35 0 123 0
18 37 1 3 1 -0.302 5.35 0 123 0
19 37 1 3.5 1 -0.325 5.35 0 123 0

Filter a dataframe by keeping row dates of three days in a row preferably with dplyr

I would like to filter a dataframe based on its date column. I would like to keep the rows where I have at least 3 consecutive days. I would like to do this as effeciently and quickly as possible, so if someone has a vectorized approached it would be good.
I tried to inspire myself from the following link, but it didn't really go well, as it is a different problem:
How to filter rows based on difference in dates between rows in R?
I tried to do it with a for loop, I managed to put an indicator on the dates who are not consecutive, but it didn't give me the desired result, because it keeps all dates that are in a row even if they are less than 3 in a row.
tf is my dataframe
for(i in 2:(nrow(tf)-1)){
if(tf$Date[i] != tf$Date[i+1] %m+% days(-1)){
if(tf$Date[i] != tf$Date[i-1] %m+% days(1)){
tf$Date[i] = as.Date(0)
}
}
}
The first 22 rows of my dataframe look something like this:
Date RR.x RR.y Y
1 1984-10-20 1 10.8 1984
2 1984-11-04 1 12.5 1984
3 1984-11-05 1 7.0 1984
4 1984-11-09 1 22.9 1984
5 1984-11-10 1 24.4 1984
6 1984-11-11 1 19.0 1984
7 1984-11-13 1 5.9 1984
8 1986-10-15 1 10.3 1986
9 1986-10-16 1 18.1 1986
10 1986-10-17 1 11.3 1986
11 1986-11-17 1 14.1 1986
12 2003-10-17 1 7.8 2003
13 2003-10-25 1 7.6 2003
14 2003-10-26 1 5.0 2003
15 2003-10-27 1 6.6 2003
16 2003-11-15 1 26.4 2003
17 2003-11-20 1 10.0 2003
18 2011-10-29 1 10.0 2011
19 2011-11-04 1 11.4 2011
20 2011-11-21 1 9.8 2011
21 2011-11-22 1 5.6 2011
22 2011-11-23 1 20.4 2011
The result should be:
Date RR.x RR.y Y
4 1984-11-09 1 22.9 1984
5 1984-11-10 1 24.4 1984
6 1984-11-11 1 19.0 1984
8 1986-10-15 1 10.3 1986
9 1986-10-16 1 18.1 1986
10 1986-10-17 1 11.3 1986
13 2003-10-25 1 7.6 2003
14 2003-10-26 1 5.0 2003
15 2003-10-27 1 6.6 2003
20 2011-11-21 1 9.8 2011
21 2011-11-22 1 5.6 2011
22 2011-11-23 1 20.4 2011
One possibility could be:
df %>%
mutate(Date = as.Date(Date, format = "%Y-%m-%d"),
diff = c(0, diff(Date))) %>%
group_by(grp = cumsum(diff > 1 & lead(diff, default = last(diff)) == 1)) %>%
filter(if_else(diff > 1 & lead(diff, default = last(diff)) == 1, 1, diff) == 1) %>%
filter(n() >= 3) %>%
ungroup() %>%
select(-diff, -grp)
Date RR.x RR.y Y
<date> <int> <dbl> <int>
1 1984-11-09 1 22.9 1984
2 1984-11-10 1 24.4 1984
3 1984-11-11 1 19 1984
4 1986-10-15 1 10.3 1986
5 1986-10-16 1 18.1 1986
6 1986-10-17 1 11.3 1986
7 2003-10-25 1 7.6 2003
8 2003-10-26 1 5 2003
9 2003-10-27 1 6.6 2003
10 2011-11-21 1 9.8 2011
11 2011-11-22 1 5.6 2011
12 2011-11-23 1 20.4 2011
Here's a base solution:
DF$Date <- as.Date(DF$Date)
rles <- rle(cumsum(c(1,diff(DF$Date)!=1)))
rles$values <- rles$lengths >= 3
DF[inverse.rle(rles), ]
Date RR.x RR.y Y
4 1984-11-09 1 22.9 1984
5 1984-11-10 1 24.4 1984
6 1984-11-11 1 19.0 1984
8 1986-10-15 1 10.3 1986
9 1986-10-16 1 18.1 1986
10 1986-10-17 1 11.3 1986
13 2003-10-25 1 7.6 2003
14 2003-10-26 1 5.0 2003
15 2003-10-27 1 6.6 2003
20 2011-11-21 1 9.8 2011
21 2011-11-22 1 5.6 2011
22 2011-11-23 1 20.4 2011
Similar approach in dplyr
DF%>%
mutate(Date = as.Date(Date))%>%
add_count(IDs = cumsum(c(1, diff(Date) !=1)))%>%
filter(n >= 3)
# A tibble: 12 x 6
Date RR.x RR.y Y IDs n
<date> <int> <dbl> <int> <dbl> <int>
1 1984-11-09 1 22.9 1984 3 3
2 1984-11-10 1 24.4 1984 3 3
3 1984-11-11 1 19 1984 3 3
4 1986-10-15 1 10.3 1986 5 3
5 1986-10-16 1 18.1 1986 5 3
6 1986-10-17 1 11.3 1986 5 3
7 2003-10-25 1 7.6 2003 8 3
8 2003-10-26 1 5 2003 8 3
9 2003-10-27 1 6.6 2003 8 3
10 2011-11-21 1 9.8 2011 13 3
11 2011-11-22 1 5.6 2011 13 3
12 2011-11-23 1 20.4 2011 13 3

How to dynamically select columns

Lets assume i ran a random Forest model and i get the variable importance info as below:
set.seed(121)
ImpMeasure<-data.frame(mod.varImp$importance)
ImpMeasure$Vars<-row.names(ImpMeasure)
ImpMeasure.df<-ImpMeasure[order(-ImpMeasure$Overall),]
row.names(ImpMeasure.df)<-NULL
class(ImpMeasure.df)
ImpMeasure.df<-ImpMeasure.df[,c(2,1)] # so now we have the importance variable info in a data frame
ImpMeasure.df
Vars Overall
1 num_voted_users 100.000000
2 num_critic_for_reviews 58.961441
3 num_user_for_reviews 56.500707
4 movie_facebook_likes 50.680318
5 cast_total_facebook_likes 30.012205
6 gross 27.652559
7 actor_3_facebook_likes 24.094213
8 actor_2_facebook_likes 19.633290
9 imdb_score 16.063007
10 actor_1_facebook_likes 15.848972
11 duration 11.886036
12 budget 11.853066
13 title_year 7.804387
14 director_facebook_likes 7.318787
15 facenumber_in_poster 1.868376
16 aspect_ratio 0.000000
Now If i decide that i want only top 5 variables for further analysis then in do this:
library(dplyr)
top.var<-ImpMeasure.df[1:5,] %>% select(Vars)
top.var
Vars
1 num_voted_users
2 num_critic_for_reviews
3 num_user_for_reviews
4 movie_facebook_likes
5 cast_total_facebook_likes
How can use this info to select these var only from the original dataset (given below) without spelling out the actual variable names but using say the output of top.var....how to use dplyr select function for this..
My original dataset is like this:
num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes
1 723 178 0 855
2 302 169 563 1000
3 602 148 0 161
4 813 164 22000 23000
5 255 95 131 782
6 462 132 475 530
actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes
1 1000 760505847 886204 4834
2 40000 309404152 471220 48350
3 11000 200074175 275868 11700
4 27000 448130642 1144337 106759
5 131 228830 8 143
6 640 73058679 212204 1873
facenumber_in_poster num_user_for_reviews budget title_year
1 0 3054 237000000 2009
2 0 1238 300000000 2007
3 1 994 245000000 2015
4 0 2701 250000000 2012
5 0 97 26000000 2002
6 1 738 263700000 2012
actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes cluster
1 936 7.9 1.78 33000 2
2 5000 7.1 2.35 0 3
3 393 6.8 2.35 85000 2
4 23000 8.5 2.35 164000 3
5 12 7.1 1.85 0 1
6 632 6.6 2.35 24000 2
movies.imp<-moviesdf.cluster%>% select(one_of(top.vars),cluster)
head(movies.imp)
## num_voted_users num_user_for_reviews num_critic_for_reviews
## 1 886204 3054 723
## 2 471220 1238 302
## 3 275868 994 602
## 4 1144337 2701 813
## 5 8 127 37
## 6 212204 738 462
## movie_facebook_likes cast_total_facebook_likes cluster
## 1 33000 4834 1
## 2 0 48350 1
## 3 85000 11700 1
## 4 164000 106759 1
## 5 0 143 2
## 6 24000 1873 1
That done!
Hadley provided the answer to that here:
select_(df, .dots = top.var)

R: How to find spatial points within a certain radius of a trajectory and to ensure that they are consecutive

I'm struggling with a space-time problem. Help is very much appreciated! Here's first what I'm looking for:
I've got a dataframe with fixes of roe deer datapoints (x,y-coordinates) taken in a irregular approximatly five minute interval.
I want to find all fixes within a certain radius (let's say 15 Meters) of the acutall position of the roe deer, with the condition that they are consecutiv for about 30 Minutes (so six fixes). Oviously the number of fixes can vary along the trajectory. I want to write down the number of fixes into a new column and want to mark in another column (0=no,1=yes), if the condition is met.
If the conditions are met, I would like to calculate the center of gravity (in x,y-coordinates) of said pointcloud and write it to this or another dataframe.
According to another users question (Listing number of obervations by location) I was able to find the fixes within the radius, but i couldn't figure a way how to ensure that they are consecutive
Here's some rows of my dataframe (in realty there's more than 10000 rows, because i subsetted this data.frame the ID's are not starting with one):
FID No CollarID Date Time Latitude__ Longitude_ Height__m_ DOP FixType
1 0 667 7024 2013-10-22 06:01:49 47.26859 8.570701 609.94 10.6 GPS-3D
2 1 668 7024 2013-10-22 06:06:04 47.26861 8.570634 612.31 10.4 GPS-3D
3 2 669 7024 2013-10-22 06:11:07 47.26871 8.570402 609.43 9.8 GPS-3D
4 3 670 7024 2013-10-22 06:16:14 47.26857 8.570796 665.40 4.4 val. GPS-3D
5 4 671 7024 2013-10-22 06:20:36 47.26855 8.570582 653.65 4.6 val. GPS-3D
6 5 672 7024 2013-10-22 06:25:50 47.26850 8.570834 659.03 4.8 val. GPS-3D
7 6 673 7024 2013-10-23 06:00:53 47.27017 8.569882 654.86 3.6 val. GPS-3D
8 7 700 7024 2013-10-26 12:00:18 47.26904 8.569596 651.88 3.8 val. GPS-3D
9 8 701 7024 2013-10-26 12:05:41 47.26899 8.569640 652.76 3.8 val. GPS-3D
10 9 702 7024 2013-10-26 12:10:40 47.26898 8.569534 650.42 4.6 val. GPS-3D
11 10 703 7024 2013-10-26 12:16:17 47.26896 8.569606 653.77 11.4 GPS-3D
12 11 704 7024 2013-10-26 12:20:18 47.26903 8.569792 702.49 9.8 val. GPS-3D
13 12 705 7024 2013-10-26 12:25:47 47.26901 8.569579 670.12 2.4 val. GPS-3D
14 13 706 7024 2013-10-26 12:30:18 47.26900 8.569477 685.65 2.0 val. GPS-3D
15 14 707 7024 2013-10-26 12:35:23 47.26885 8.569400 685.15 6.2 val. GPS-3D
Temp___C_ X Y ID Trajectory distance speed timelag timevalid
1 19 685667.7 235916.0 RE01 RE01 5.420858 0.021258268 4.250000 1
2 20 685662.6 235917.8 RE01 RE01 21.276251 0.070218649 5.050000 1
3 20 685644.9 235929.5 RE01 RE01 34.070730 0.110979577 5.116667 1
4 20 685675.0 235913.5 RE01 RE01 16.335573 0.062349516 4.366667 1
5 20 685658.8 235911.3 RE01 RE01 19.896906 0.063365941 5.233333 1
6 20 685677.9 235905.7 RE01 RE01 199.248728 0.002346781 1415.050000 0
7 22 685603.2 236090.4 RE01 RE01 126.831124 0.000451734 4679.416667 0
8 22 685583.4 235965.1 RE01 RE01 6.330467 0.019598970 5.383333 1
9 22 685586.8 235959.8 RE01 RE01 8.270701 0.027661208 4.983333 1
10 23 685578.8 235957.8 RE01 RE01 5.888147 0.017472246 5.616667 1
11 22 685584.3 235955.7 RE01 RE01 16.040998 0.066560158 4.016667 1
12 23 685598.3 235963.6 RE01 RE01 16.205330 0.049256322 5.483333 1
13 23 685582.2 235961.6 RE01 RE01 7.742184 0.028568946 4.516667 1
14 23 685574.5 235960.9 RE01 RE01 18.129019 0.059439406 5.083333 1
15 23 685568.8 235943.7 RE01 RE01 15.760165 0.051672673 5.083333 1
Date_text Time_text DateTime Flucht FluchtALL
1 22.10.2013 06:01:49 22.10.2013 06:01:49 0 0
2 22.10.2013 06:06:04 22.10.2013 06:06:04 0 0
3 22.10.2013 06:11:07 22.10.2013 06:11:07 0 0
4 22.10.2013 06:16:14 22.10.2013 06:16:14 0 0
5 22.10.2013 06:20:36 22.10.2013 06:20:36 0 0
6 22.10.2013 06:25:50 22.10.2013 06:25:50 0 0
7 23.10.2013 06:00:53 23.10.2013 06:00:53 0 0
8 26.10.2013 12:00:18 26.10.2013 12:00:18 0 0
9 26.10.2013 12:05:41 26.10.2013 12:05:41 0 0
10 26.10.2013 12:10:40 26.10.2013 12:10:40 0 0
11 26.10.2013 12:16:17 26.10.2013 12:16:17 0 0
12 26.10.2013 12:20:18 26.10.2013 12:20:18 0 0
13 26.10.2013 12:25:47 26.10.2013 12:25:47 0 0
14 26.10.2013 12:30:18 26.10.2013 12:30:18 0 0
15 26.10.2013 12:35:23 26.10.2013 12:35:23 0 0
and here's the code i've got so far:
for (i in seq(nrow(df)))
{
# circle's centre
xcentre <- df[i,'X']
ycentre <- df[i,'Y']
# checking how many fixes lie within 15 m of the above centre, noofcloserest column will contain this value
df[i,'noofclosepoints'] <- sum(
(df[,'X'] - xcentre)^2 +
(df[,'Y'] - ycentre)^2
<= 15^2
) - 1
cat(i,': ')
# this prints the true/false vector for which row is within the radius, and which row isn't
cat((df[,'X'] - xcentre)^2 +
(df[,'Y'] - ycentre)^2
<= 15^2)
cat('\n')
}
I tried to test the consecutive-condition on the TRUE/FALSE List from cat(), but i couldn't access the results to process them any further.
I know this is a mile long question, I would be very glad if someone could help me with this or part of this problem. I will be thankful till the end of my life :). Btw: you may have noticed, that I'm an unfortunate R beginner. Many thanks!
Here's how
You can apply a rolling function to your time series for a window of 6.
library(xts)
## You read you time serie using the handy `read.zoo`
## Note the use of the index here
dx <-
read.zoo(text="Date Time DOP X Y noofclosepoints
4705 09.07.2014 11:05:33 3.4 686926.8 231039.3 14
4706 09.07.2014 11:10:53 3.2 686930.5 231042.5 14
4707 09.07.2014 11:16:29 15.8 686935.2 231035.5 14
4708 09.07.2014 11:20:08 5.2 686932.9 231035.6 14
4709 09.07.2014 11:25:17 4.8 686933.8 231038.6 14
4710 09.07.2014 11:30:16 2.2 686938.0 231037.0 15
4711 09.07.2014 11:35:13 2.0 686930.9 231035.8 14
4712 09.07.2014 11:40:09 2.0 686930.6 231035.7 14
4713 09.07.2014 11:45:25 3.4 686907.2 231046.8 0
4714 09.07.2014 11:50:25 3.2 686936.1 231037.1 14",
index = 1:2,format="%d.%m.%Y %H:%M:%S",tz="")
## You apply your rolling to all columns
## for each window you compute the distance between the current
## point and others , then you return the number of points within the radius
as.xts(rollapplyr(dx,7,function(x) {
curr <- tail(x,1)
others <- head(x,-1)
dist <- sqrt((others[,"X"]-curr[,"X"])^2 + (others[,"Y"]-curr[,"Y"])^2 )
sum(dist<15)
},by.column=FALSE))
# [,1]
# 2014-07-09 11:35:13 6
# 2014-07-09 11:40:09 6
# 2014-07-09 11:45:25 0
# 2014-07-09 11:50:25 5

R Creating new data.table with specified rows of a single column from an old data.table

I have the following data.table:
Month Day Lat Long Temperature
1: 10 01 80.0 180 -6.383330333333309
2: 10 01 77.5 180 -6.193327999999976
3: 10 01 75.0 180 -6.263328333333312
4: 10 01 72.5 180 -5.759997333333306
5: 10 01 70.0 180 -4.838330999999976
---
117020: 12 31 32.5 310 11.840003833333355
117021: 12 31 30.0 310 13.065001833333357
117022: 12 31 27.5 310 14.685003333333356
117023: 12 31 25.0 310 15.946669666666690
117024: 12 31 22.5 310 16.578336333333358
For every location (given by Lat and Long), I have a temperature for each day from 1 October to 31 December.
There are 1,272 locations consisting of each pairwise combination of Lat:
Lat
1 80.0
2 77.5
3 75.0
4 72.5
5 70.0
--------
21 30.0
22 27.5
23 25.0
24 22.5
and Long:
Long
1 180.0
2 182.5
3 185.0
4 187.5
5 190.0
---------
49 300.0
50 302.5
51 305.0
52 307.5
53 310.0
I'm trying to create a data.table that consists of 1,272 rows (one per location) and 92 columns (one per day). Each element of that data.table will then contain the temperature at that location on that day.
Any advice about how to accomplish that goal without using a for loop?
Here we use ChickWeights as the data, where we use "Chick-Diet" as the equivalent of your "lat-lon", and "Time" as your "Date":
dcast.data.table(data.table(ChickWeight), Chick + Diet ~ Time)
Produces:
Chick Diet 0 2 4 6 8 10 12 14 16 18 20 21
1: 18 1 1 1 NA NA NA NA NA NA NA NA NA NA
2: 16 1 1 1 1 1 1 1 1 NA NA NA NA NA
3: 15 1 1 1 1 1 1 1 1 1 NA NA NA NA
4: 13 1 1 1 1 1 1 1 1 1 1 1 1 1
5: ... 46 rows omitted
You will likely need to lat + lon ~ Month + Day or some such for your formula.
In the future, please make your question reproducible as I did here by using a built-in data set.
First create a date value using the lubridate package (I assumed year = 2014, adjust as necessary):
library(lubridate)
df$datetext <- paste(df$Month,df$Day,"2014",sep="-")
df$date <- mdy(df$datetext)
Then one option is to use the tidyr package to spread the columns:
library(tidyr)
spread(df[,-c(1:2,6)],date,Temperature)
Lat Long 2014-10-01 2014-12-31
1 22.5 310 NA 16.57834
2 25.0 310 NA 15.94667
3 27.5 310 NA 14.68500
4 30.0 310 NA 13.06500
5 32.5 310 NA 11.84000
6 70.0 180 -4.838331 NA
7 72.5 180 -5.759997 NA
8 75.0 180 -6.263328 NA
9 77.5 180 -6.193328 NA
10 80.0 180 -6.383330 NA

Resources