How to merge dating correctly - r

I'm trying to merge 7 complete data frames into one great wide data frame. I figured I have to do this stepwise and merge 2 frames into 1 and then that frame into another so forth until all 7 original frames becomes one.
fil2005: "ID" "abr_2005" "lop_2005" "ins_2005"
fil2006: "ID" "abr_2006" "lop_2006" "ins_2006"
But the variables "abr_2006" "lop_2006" "ins_2006" and 2005 are all either 0,1.
Now the things is, I want to either merge or do a dcast of some sort (I think) to make these two long data frames into one wide data frame were both "abr_2005" "lop_2005" "ins_2005" and abr_2006" "lop_2006" "ins_2006" are in that final file.
When I try
$fil_2006.1 <- merge(x=fil_2005, y=fil_2006, by="ID__", all.y=T)
all the variables with _2005 at the end if it is saved to the fil_2006.1, but the variables ending in _2006 doesn't.
I'm apparently doing something wrong. Any idea?

Is there a reason you put those underscores after ID__? Otherwise, the code you provided will work
An example:
dat1 <- data.frame("ID"=seq(1,20,by=2),"varx2005"=1:10, "vary2005"=2:11)
dat2 <- data.frame("ID"=5:14,"varx2006"=1:20, "vary2006"=21:40)
# create data frames of differing lengths
head(dat1)
ID varx2005 vary2005
1 1 1 2
2 3 2 3
3 5 3 4
4 7 4 5
5 9 5 6
6 11 6 7
head(dat2)
ID varx2006 vary2006
1 5 1 21
2 6 2 22
3 7 3 23
4 8 4 24
5 9 5 25
6 10 6 26
merged <- merge(dat1,dat2,by="ID",all=T)
head(merged)
ID varx2006 vary2006 varx2005 vary2005
1 1 NA NA 1 2
2 3 NA NA 2 3
3 5 1 21 3 4
4 5 11 31 3 4
5 7 13 33 4 5
6 7 3 23 4 5

Related

R: Sum column from table 2 based on value in table 1, and store result in table 1

I am a R noob, and hope some of you can help me.
I have two data sets:
- store (containing store data, including location coordinates (x,y). The location are integer values, corresponding to GridIds)
- grid (containing all gridIDs (x,y) as well as a population variable TOT_P for each grid point)
What I want to achieve is this:
For each store I want loop over the grid date, and sum the population of the grid ids close to the store grid id.
I.e basically SUMIF the grid population variable, with the condition that
grid(x) < store(x) + 1 &
grid(x) > store(x) - 1 &
grid(y) < store(y) + 1 &
grid(y) > store(y) - 1
How can I accomplish that? My own take has been trying to use different things like merge, sapply, etc, but my R inexperience stops me from getting it right.
Thanks in advance!
Edit:
Sample data:
StoreName StoreX StoreY
Store1 3 6
Store2 5 2
TOT_P GridX GridY
8 1 1
7 2 1
3 3 1
3 4 1
22 5 1
20 6 1
9 7 1
28 1 2
8 2 2
3 3 2
12 4 2
12 5 2
15 6 2
7 7 2
3 1 3
3 2 3
3 3 3
4 4 3
13 5 3
18 6 3
3 7 3
61 1 4
25 2 4
5 3 4
20 4 4
23 5 4
72 6 4
14 7 4
178 1 5
407 2 5
26 3 5
167 4 5
58 5 5
113 6 5
73 7 5
76 1 6
3 2 6
3 3 6
3 4 6
4 5 6
13 6 6
18 7 6
3 1 7
61 2 7
25 3 7
26 4 7
167 5 7
58 6 7
113 7 7
The output I am looking for is
StoreName StoreX StoreY SUM_P
Store1 3 6 479
Store2 5 2 119
I.e for store1 it is the sum of TOT_P for Grid fields X=[2-4] and Y=[5-7]
One approach would be to use dplyr to calculate the difference between each store and all grid points and then group and sum based on these new columns.
#import library
library(dplyr)
#create example store table
StoreName<-paste0("Store",1:2)
StoreX<-c(3,5)
StoreY<-c(6,2)
df.store<-data.frame(StoreName,StoreX,StoreY)
#create example population data (copied example table from OP)
df.pop
#add dummy column to each table to enable cross join
df.store$k=1
df.pop$k=1
#dplyr to join, calculate absolute distance, filter and sum
df.store %>%
inner_join(df.pop, by='k') %>%
mutate(x.diff = abs(StoreX-GridX), y.diff=abs(StoreY-GridY)) %>%
filter(x.diff<=1, y.diff<=1) %>%
group_by(StoreName) %>%
summarise(StoreX=max(StoreX), StoreY=max(StoreY), tot.pop = sum(TOT_P) )
#output:
StoreName StoreX StoreY tot.pop
<fctr> <dbl> <dbl> <int>
1 Store1 3 6 721
2 Store2 5 2 119

Arrange a data set in a repeating manner from a reshaped data

I have reshaped the data to long. It has been sorted in ascending order based on one column (as x2 in the below reproducible example) and I want to keep the data in a repeating manner rather than factored. Here is a sample:
set.seed(234)
data<-data.frame(x1=c(1:12),x2=rep(1:3,each=4),x3=runif(12,min=0,max=12))
And I want the format something like this:
x1 x2 x3
1 1 1 6.115445
2 2 2 5.157014
3 3 3 4.793458
4 4 1 9.998710
5 5 2 2.620250
6 6 3 1.825839
7 7 1 5.842854
8 8 2 5.616670
9 9 3 6.511315
10 10 1 9.164444
11 11 2 8.401418
Can you please help me with either what to include in the melt function while converting the data to long format or any other function I should use in rearranging that data.
note:
The above result is to show the desired format, not the exact solution for my data.
EDIT:
Here is head() of my real data:
Date stn Elev Amount
1 2010-01-01 11 0 268.945
2 2010-01-01 11 0 268.396
3 2010-01-01 11 0 267.512
4 2010-01-01 11 0 266.488
5 2010-01-01 11 0 265.558
6 2010-01-01 11 0 265.178
In the actual data, the column Elev contains values like, c("0","100","250","500"...). So you assume that 0 is equivalent to 1 in x2 of the above sample, and so forth for 100, 250....
One method is to use ave as follows:
data[order(ave(data$x3, data$x2, FUN=function(i) 1:length(i)), data$x2),]
x1 x2 x3
1 1 1 8.9474400
5 5 2 0.8029211
9 9 3 11.1328381
2 2 1 9.3805491
6 6 2 7.7375415
10 10 3 3.4107614
3 3 1 0.2404454
7 7 2 11.1526315
11 11 3 6.6686992
4 4 1 9.3130246
8 8 2 8.6117063
12 12 3 6.5724198
In this instance, ave calculates a running count by data$x2, which is then used to sort the data with the order function.
You can also renumber x1 if desired: data$x1 <- 1:nrow(data), which would return your desired result.

subsetting a dataframe by a condition in R [duplicate]

This question already has answers here:
Filtering a data frame by values in a column [duplicate]
(3 answers)
Closed 3 years ago.
I have the following data with the ID of subjects.
V1
1 2
2 2
3 2
4 2
5 2
6 2
7 2
8 2
9 2
10 2
11 2
12 2
13 2
14 2
15 2
16 4
17 4
18 4
19 4
20 4
21 4
22 4
23 4
24 4
I want to subset all the rows of the data where V1 == 4. This way I can see which observations relate to subject 4.
For example, the correct output would be
16 4
17 4
18 4
19 4
20 4
21 4
22 4
23 4
24 4
However, the output I'm given after subsetting does not give me the correct rows . It simply gives me.
V1
1 4
2 4
3 4
4 4
5 4
6 4
7 4
8 4
I'm unable to tell which observations relate to subject 4, as observations 1:8 are for subject 2.
I've tried the usual methods, such as
condition<- df == 4
df[condition]
How can I subset the data so I'm given back a dataset that shows the correct row numbers for subject 4.
You can also use the subset function:
subset(df,df$V1==4)
I've managed to find a solution since posting.
newdf <- subset(df, V1 == 4).
However i'm still very interested in other solutions to this problems, so please post if you're aware of another method.

R: How to use intervals as input data for histograms?

I would like to import the data into R as intervals, then I would like to count all the numbers falling within these intervals and draw a histogram from this counts.
Example:
start end freq
1 8 3
5 10 2
7 11 5
.
.
.
Result:
number freq
1 3
2 3
3 3
4 3
5 5
6 5
7 10
8 10
9 7
10 7
11 5
Some suggestions?
Thank you very much!
Assuming your data is in df, you can create a data set that has each number in the range repeated by freq. Once you have that it's trivial to use the summarizing functions in R. This is a little roundabout, but a lot easier than explicitly computing the sum of the overlaps (though that isn't that hard either).
dat <- unlist(apply(df, 1, function(x) rep(x[[1]]:x[[2]], x[[3]])))
hist(dat, breaks=0:max(df$end))
You can also do table(dat)
dat
1 2 3 4 5 6 7 8 9 10 11
3 3 3 3 5 5 10 10 7 7 5

How can I produce a table into a data.frame?

I printed out the summary of a column variables as such:
Please see below the summary table printed out from R:
I would like to generate it into a data.frame. However, there are too many subject names that it's very difficult to list out all, also, the term "OTHER" with number 31 means that there are 319 subjects which appear only 1 time in the original data.frame.
So, the new data.frame I hope to produce would look like below:
Here is one possible solution.
Table<-table(rpois(100,5))
as.data.frame(Table)
Var1 Freq
1 1 2
2 2 11
3 3 9
4 4 18
5 5 13
6 6 20
7 7 14
8 8 8
9 9 3
10 10 1
11 11 1

Resources