Join spatial features with dataframe by id with inconsistent format - r

Hello everyone I was hoping I could get some help with this issue:
I have shapefile with 2347 features that correspond to 3172 units, perhaps when the original file was created there were some duplicated geometries and they decided to arrange them like this:
Feature gis_id
1 "1"
2 "2"
3 "3,4,5"
4 "6,8"
5 "7"
6 "9,10,13"
... like that until the 3172 units and 2347 features
On the other side my data table has 72956 observations (about 16 columns) with data corresponding to the gis_id from the shapefile. However, this table has a unique gis_id per observation
head(hru_ls)
jday mon day yr unit gis_id name sedyld tha sedorgn kgha sedorgp kgha surqno3 kgha lat3no3 kgha
1 365 12 31 1993 1 1 hru0001 0.065 0.861 0.171 0.095 0
2 365 12 31 1993 2 2 hru0002 0.111 1.423 0.122 0.233 0
3 365 12 31 1993 3 3 hru0003 0.024 0.186 0.016 0.071 0
4 365 12 31 1993 4 4 hru0004 6.686 16.298 1.040 0.012 0
5 365 12 31 1993 5 5 hru0005 37.220 114.683 6.740 0.191 0
6 365 12 31 1993 6 6 hru0006 6.597 30.949 1.856 0.021 0
surqsolp kgha usle tons sedmin ---- tileno3 ----
1 0.137 0 0.010 0
2 0.041 0 0.009 0
3 0.014 0 0.001 0
4 0.000 0 0.175 0
5 0.000 0 0.700 0
6 0.000 0 0.227 0
With multiple records for each unit (20 years data)
I would like to merge the geometry data of my shapefile to my data table. I've done this before with sp::merge I think, but with a shapefile that did not have multiple id's per geometry/feature.
Is there a way to condition the merging so it gives each feature from the data table the corresponding geometry according to if it has any of the values present on the gis_id field from the shapefile?

This is a very intriguing question, so I gave it a shot. My answer is probably not the quickest or most concise way of going about this, but it works (at least for your sample data). Notice that this approach is fairly sensitive to the formatting of the data in shapefile$gis_id (see regex).
# your spatial data
shapefile <- data.frame(feature = 1:6, gis_id = c("1", "2", "3,4,5", "6,8", "7", "9,10,13"))
# your tabular data
hru_ls <- data.frame(unit = 1:6, gis_id = paste(1:6))
# loop over all gis_ids in your tabular data
# perhaps this could be vectorized?
gis_ids <- unique(hru_ls$gis_id)
for(id in gis_ids){
# Define regex to match gis_ids
id_regex <- paste0("(,|^)", id, "(,|$)")
# Get row in shapefile that matches regex
searchterm <- lapply(shapefile$gis_id, function(x) grepl(pattern = id_regex, x = x))
rowmatch <- which(searchterm == TRUE)
# Return shapefile feature id that maches tabular gis_id
hru_ls[hru_ls$gis_id == id, "gis_feature_id"] <- shapefile[rowmatch, "feature"]
}
Since you didn't provide the geometry fields in your question, I just matched on Feature in your spatial data. You could either add an additional step that merges based on Feature, or replace "feature" in shapefile[rowmatch, "feature"] with your geometry fields.

Related

Mutation of non-conformable arrays

library(boot)
install.packages("AMORE")
library(AMORE)
l.data=nrow(melanoma)
set.seed(5)
idxTrain<-sample(1:l.data,100)
idxTest<-setdiff(1:l.data,idxTrain)
set.seed(3)
net<-newff(n.neurons=c(6,6,3),
learning.rate.global=0.02,
momentum.global=0.5,
hidden.layer="sigmoid",
output.layer="purelin",
method="ADAPTgdwm",
error.criterium="LMS")
result<-train(net,
melanoma[idxTrain,-2],
melanoma$status,
error.criterium="LMS",
report=TRUE,
show.step=10,
n.shows=800)
The problem I have is I have an error in result: "target - non-conformable arrays".
I know that it is the problem with melanoma$status, but have no idea how to alter the data accordingly. Any ideas? Couple of samples of data (if you don't use boot package from Rstudio).
melanoma:
time status sex age year thickness ulcer
1 10 3 1 76 1972 6.76 1
2 30 3 1 56 1968 0.65 0
3 35 2 1 41 1977 1.34 0
4 99 3 0 71 1968 2.90 0
5 185 1 1 52 1965 12.08 1
Your target variable should first take only the training indices. Moreover, the target should have a number of columns equal to the number of classes - with one-hot encoding. Something like this:
net<-newff(n.neurons=c(6,6,3),
learning.rate.global=0.02,
momentum.global=0.5,
hidden.layer="sigmoid",
output.layer="purelin",
method="ADAPTgdwm",
error.criterium="LMS")
Target = matrix(data=0, nrow=length(idxTrain), ncol=3)
status_mat=matrix(nrow=length(idxTrain), ncol=2)
status_mat[,1] = c(1:length(idxTrain))
status_mat[,2] = melanoma$status[idxTrain]
Target[(status_mat[,2]-1)*length(idxTrain)+status_mat[,1]]=1
result<-train(net,
melanoma[idxTrain,-2],
Target,
error.criterium="LMS",
report=TRUE,
show.step=10,
n.shows=800)

R- Subtracting the mean of a group from each element of that group in a dataframe

I am trying to merge a vector 'means' to a dataframe.
My dataframe looks like this Data = growth
I first calculated all the means for the different groups (1 group = population + temperature + size + replicat) using this command:
means<-aggregate(TL ~ Population + Temperature + Replicat + Size + Measurement, data=growth, list=growth$Name, mean)
Then, I selected the means for Measurement 1 as follows as I am only interested in these means.
meansT0<-means[which(means$Measurement=="1"),]
Now, I would like to merge this vector of means values to my dataframe (=growth) so that the right mean of each group corresponds to the right part of the dataframe.
The goal is to then substrat the mean of each group (at Measurement 1) to each element of the dataframe based on its belonging group (and for all other Measurements except Measurement 1). Maybe there is no need to add the means column to the dataframe? Do you know any command to do that ?
[27.06.18]
I made up this simplified dataframe, I hope this help understanding.
So, what I want is to substrat, for each individual in the dataframe and for each measurement (here only Measurement 1 and Measurement 2, normally I have more), the mean of its belongig group at MEASUREMENT 1.
So, if I get the means by group (1 group= Population + Temperature + Measurement):
means<-aggregate(TL ~ Population + Temperature + Measurement, data=growth, list=growth$Name, mean)
means
I got these values of means (in this example) :
Population Temperature Measurement TL
JUB 15 **1** **12.00000**
JUB 20 **1** **15.66667**
JUB 15 2 17.66667
JUB 20 2 18.66667
JUB 15 3 23.66667
JUB 20 3 24.33333
We are only interested by the means at MEASUREMENT 1. For each individual in the dataframe, I want to substrat the mean of its belonging group at Measurement 1: in this example (see dataframe with R command):
-for the group JUB+15+Measurement 1, mean = 12
-for the group JUB+20+Measurement 1, mean = 15.66
growth<-data.frame(Population=c("JUB", "JUB", "JUB","JUB", "JUB", "JUB","JUB", "JUB", "JUB","JUB", "JUB", "JUB","JUB", "JUB", "JUB","JUB", "JUB", "JUB"), Measurement=c("1","1","1","1","1","1","2","2","2","2","2","2", "3", "3", "3", "3", "3", "3"),Temperature=c("15","15","15","20", "20", "20","15","15","15","20", "20", "20","15","15","15","20", "20", "20"),TL=c(11,12,13,15,18,14, 16,17,20,21,19,16, 25,22,24,26,24,23), New_TL=c("11-12", "12-12", "13-12", "15-15.66", "18-15.66", "14-15.66", "16-12", "17-12", "20-12", "21-15.66", "19-15.66", "16-15.66", "25-12", "22-12", "24-12", "26-15.66", "24-15.66", "23-15.66"))
print(growth)
I hope with this, you can understand better what I am trying to do. I have a lot of data and if I have to do this manually, this will take me a lot of time and increase the risk of me putting mistakes.
Here is an option with tidyverse. After grouping by the group columns, use mutate_at specifying the columns of interest and get the difference of that column (.) with the mean of it.
library(tidyverse)
growth %>%
group_by(Population, Temperature, Replicat, Size, Measurement) %>%
mutate_at(vars(HL, TL), funs(MeanGroupDiff = .
- mean(.[Measurement == 1])))
Using a reproducible example with mtcars dataset
data(mtcars)
mtcars %>%
group_by(cyl, vs) %>%
mutate_at(vars(mpg, disp), funs(MeanGroupDiff = .- mean(.[am==1])))
Have you considered using the data.table package? It is very well suited for doing these kind of grouping, filtering, joining, and aggregation operations you describe, and might save you a great deal of time in the long run.
The code below shows how a workflow similar to the one you described but based on the built in mtcars data set might look using data.table.
To be clear, there are also ways to do what you describe using base R as well as other packages like dplyr, just throwing out a suggestion based on what I have found the most useful for my personal work.
library(data.table)
## Convert mtcars to a data.table
## only include columns `mpg`, `cyl`, `am` and `gear` for brevity
DT <- as.data.table(mtcars)[, .(mpg, cyl,am, gear)]
## Take a subset where `cyl` is equal to 6
DT <- DT[cyl == 6]
## Calculate grouped mean based on `gear` and `am` as grouping variables
DT[,group_mpg_avg := mean(mpg), keyby = .(gear, am)]
## Calculate each row's difference from the group mean
DT[,mpg_diff_from_group := mpg - group_mpg_avg]
print(DT)
# mpg cyl am gear group_mpg_avg mpg_diff_from_group
# 1: 21.4 6 0 3 19.75 1.65
# 2: 18.1 6 0 3 19.75 -1.65
# 3: 19.2 6 0 4 18.50 0.70
# 4: 17.8 6 0 4 18.50 -0.70
# 5: 21.0 6 1 4 21.00 0.00
# 6: 21.0 6 1 4 21.00 0.00
# 7: 19.7 6 1 5 19.70 0.00
Consider by to subset your data frame by factors (but leave out Measurement in order to compare group 1 and all other groups). Then, run an ifelse conditional logic calculation for needed columns. Since by will return a list of data frames, bind all outside with do.call():
df_list <- by(growth, growth[,c("Population", "Temperature")], function(sub) {
# TL CORRECTION
sub$Correct_TL <- ifelse(sub$Measurement != 1,
sub$TL - mean(subset(sub, Measurement == 1)$TL),
sub$TL)
# ADD OTHER CORRECTIONS
return(sub)
})
final_df <- do.call(rbind, df_list)
Output (using posted data)
final_df
# Population Measurement Temperature TL New_TL Correct_TL
# 1 JUB 1 15 11 11-12 11.0000000
# 2 JUB 1 15 12 12-12 12.0000000
# 3 JUB 1 15 13 13-12 13.0000000
# 7 JUB 2 15 16 16-12 4.0000000
# 8 JUB 2 15 17 17-12 5.0000000
# 9 JUB 2 15 20 20-12 8.0000000
# 13 JUB 3 15 25 25-12 13.0000000
# 14 JUB 3 15 22 22-12 10.0000000
# 15 JUB 3 15 24 24-12 12.0000000
# 4 JUB 1 20 15 15-15.66 15.0000000
# 5 JUB 1 20 18 18-15.66 18.0000000
# 6 JUB 1 20 14 14-15.66 14.0000000
# 10 JUB 2 20 21 21-15.66 5.3333333
# 11 JUB 2 20 19 19-15.66 3.3333333
# 12 JUB 2 20 16 16-15.66 0.3333333
# 16 JUB 3 20 26 26-15.66 10.3333333
# 17 JUB 3 20 24 24-15.66 8.3333333
# 18 JUB 3 20 23 23-15.66 7.3333333

big dataframe: "repeated" t-test between groups for thousand of factors

I have read a lot of posts related to data wrangling and “repeated” t-test but I can’t figure out the way to achieve it in my case.
You can get my example dataset for StackOverflow here: https://www.dropbox.com/s/0b618fs1jjnuzbg/dataset.example.stckovflw.txt?dl=0
I have a big dataframe of gen expression like:
> b<-read.delim("dataset.example.stckovflw.txt")
> head(b)
animal gen condition tissue LogFC
1 animalcontrol1 kjhss1 control brain 7.129283
2 animalcontrol1 sdth2 control brain 7.179909
3 animalcontrol1 sgdhstjh20 control brain 9.353147
4 animalcontrol1 jdygfjgdkydg21 control brain 6.459432
5 animalcontrol1 shfjdfyjydg22 control brain 9.372865
6 animalcontrol1 jdyjkdg23 control brain 9.541097
> str(b)
'data.frame': 21507 obs. of 5 variables:
$ animal : Factor w/ 25 levels "animalcontrol1",..: 1 1 1 1 1 1 1 1 1 1 ...
$ gen : Factor w/ 1131 levels "dghwg1041","dghwg1086",..: 480 761 787 360 863 385 133 888 563 738 ...
$ condition: Factor w/ 5 levels "control","treatmentA",..: 1 1 1 1 1 1 1 1 1 1 ...
$ tissue : Factor w/ 2 levels "brain","heart": 1 1 1 1 1 1 1 1 1 1 ...
$ LogFC : num 7.13 7.18 9.35 6.46 9.37 ...
Each group has 5 animals, and each animals has many gens quantified. (However, each animal may possibly have a different set of quantified gens, but also many of the gens will be in common between animals and groups).
I would like to perform t-test for each gen between my treated group (A, B, C or D) and the controls. The data should be presented as a table containing the p- value for each gen in each group.
Because I have so many gens (thousand), I cannot subset each gen.
Do you know how could I automate the procedure ?
I was thinking about a loop but I am absolutely not sure it could achieve what I want and how to proceed.
Also, I was looking more at these posts using the apply function : Apply t-test on many columns in a dataframe split by factor and Looping through t.tests for data frame subsets in r
#
################ additionnal information after reading first comments and answers :
#andrew_reece : Thank you very much for this. It is almost-exactly what I was looking for. However, I can’t find the way to do it with t-test. ANOVA is interesting information, but then I will need to know which of the treated groups is/are significantly different from my controls. Also I would need to know which treated group is significantly different from each others, “two by two”.
I have been trying to use your code by changing the “aov(..)” in “t.test(…)”. For that, first I realize a subset(b, condition == "control" | condition == "treatmentA" ) in order to compare only two groups. However, when exporting the result table in csv file, the table is unanderstandable (no gen name, no p-values, etc, only numbers). I will keep searching a way to do it properly but until now I’m stuck.
#42:
Thank you very much for these tips. This is just a dataset example, let’s assume we do have to use individual t-tests.
This is very useful start for exploring my data. For example, I have been trying to reprsent my data with Venndiagrams. I can write my code but it is kind of out of the initial topic. Also, I don't know how to summarize in a less fastidious way the shared "gene" detected in each combination of conditions so i have simplified with only 3 conditions.
# Visualisation of shared genes by VennDiagrams :
# let's simplify and consider only 3 conditions :
b<-read.delim("dataset.example.stckovflw.txt")
b<- subset(b, condition == "control" | condition == "treatmentA" | condition == "treatmentB")
b1<-table(b$gen, b$condition)
b1
b2<-subset(data.frame(b1, "control" > 2
|"treatmentA" > 2
|"treatmentB" > 2 ))
b3<-subset(b2, Freq>2) # select only genes that have been quantified in at least 2 animals per group
b3
b4 = within(b3, {
Freq = ifelse(Freq > 1, 1, 0)
}) # for those observations, we consider the gene has been detected so we change the value 0 regardless the freq of occurence (>2)
b4
b5<-table(b4$Var1, b4$Var2)
write.csv(b5, file = "b5.csv")
# make an intermediate file .txt (just add manually the name of the cfirst column title)
# so now we have info
bb5<-read.delim("bb5.txt")
nrow(subset(bb5, control == 1))
nrow(subset(bb5, treatmentA == 1))
nrow(subset(bb5, treatmentB == 1))
nrow(subset(bb5, control == 1 & treatmentA == 1))
nrow(subset(bb5, control == 1 & treatmentB == 1))
nrow(subset(bb5, treatmentA == 1 & treatmentB == 1))
nrow(subset(bb5, control == 1 & treatmentA == 1 & treatmentB == 1))
library(grid)
library(futile.logger)
library(VennDiagram)
venn.plot <- draw.triple.venn(area1 = 1005,
area2 = 927,
area3 = 943,
n12 = 843,
n23 = 861,
n13 = 866,
n123 = 794,
category = c("controls", "treatmentA", "treatmentB"),
fill = c("red", "yellow", "blue"),
cex = 2,
cat.cex = 2,
lwd = 6,
lty = 'dashed',
fontface = "bold",
fontfamily = "sans",
cat.fontface = "bold",
cat.default.pos = "outer",
cat.pos = c(-27, 27, 135),
cat.dist = c(0.055, 0.055, 0.085),
cat.fontfamily = "sans",
rotation = 1);
Update (per OP comments):
Pairwise comparison across condition can be managed with an ANOVA post-hoc test, such as Tukey's Honest Significant Difference (stats::TukeyHSD()). (There are others, this is just one way to demonstrate PoC.)
results <- b %>%
mutate(condition = factor(condition)) %>%
group_by(gen) %>%
filter(length(unique(condition)) >= 2) %>%
nest() %>%
mutate(
model = map(data, ~ TukeyHSD(aov(LogFC ~ condition, data = .x))),
coef = map(model, ~ broom::tidy(.x))
) %>%
unnest(coef) %>%
select(-term)
results
# A tibble: 7,118 x 6
gen comparison estimate conf.low conf.high adj.p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 kjhss1 treatmentA-control 1.58 -20.3 23.5 0.997
2 kjhss1 treatmentC-control -3.71 -25.6 18.2 0.962
3 kjhss1 treatmentD-control 0.240 -21.7 22.2 1.000
4 kjhss1 treatmentC-treatmentA -5.29 -27.2 16.6 0.899
5 kjhss1 treatmentD-treatmentA -1.34 -23.3 20.6 0.998
6 kjhss1 treatmentD-treatmentC 3.95 -18.0 25.9 0.954
7 sdth2 treatmentC-control -1.02 -21.7 19.7 0.991
8 sdth2 treatmentD-control 3.25 -17.5 24.0 0.909
9 sdth2 treatmentD-treatmentC 4.27 -16.5 25.0 0.849
10 sgdhstjh20 treatmentC-control -7.48 -30.4 15.5 0.669
# ... with 7,108 more rows
Original answer
You can use tidyr::nest() and purrr::map() to accomplish the technical task of grouping by gen, and then conducting statistical tests comparing the effects of condition (presumably with LogFC as your DV).
But I agree with the other comments that there are issues with your statistical approach here that bear careful consideration - stats.stackexchange.com is a better forum for those questions.
For the purpose of demonstration, I've used an ANOVA instead of a t-test, since there are frequently more than two conditions per gen grouping. This shouldn't really change the intuition behind the implementation, however.
require(tidyverse)
results <- b %>%
mutate(condition = factor(condition)) %>%
group_by(gen) %>%
filter(length(unique(condition)) >= 2) %>%
nest() %>%
mutate(
model = map(data, ~ aov(LogFC ~ condition, data = .x)),
coef = map(model, ~ broom::tidy(.x))
) %>%
unnest(coef)
A few cosmetic trimmings to get closer to your original vision (of just a table with gen and p-values), although note that this really leaves a lot of important information out and I'm not advising you actually limit your results in this way.
results %>%
filter(term!="Residuals") %>%
select(gen, df, statistic, p.value)
results
# A tibble: 1,111 x 4
gen df statistic p.value
<chr> <dbl> <dbl> <dbl>
1 kjhss1 3. 0.175 0.912
2 sdth2 2. 0.165 0.850
3 sgdhstjh20 2. 0.440 0.654
4 jdygfjgdkydg21 2. 0.267 0.770
5 shfjdfyjydg22 2. 0.632 0.548
6 jdyjkdg23 2. 0.792 0.477
7 fckjhghw24 2. 0.790 0.478
8 shsnv25 2. 1.15 0.354
9 qeifyvj26 2. 0.588 0.573
10 qsiubx27 2. 1.14 0.359
# ... with 1,101 more rows
Note: I can't take much credit for this approach - it's taken almost verbatim from an example I saw Hadley give at a talk last night on purrr. Here's a link to the public repo of the demo code he used, which covers a similar use case.
You have 25 animals in 5 different treatment groups with a varying number of gen-values (presumably activities of genetic probes) in two different tissues:
table(b$animal, b$condition)
control treatmentA treatmentB treatmentC treatmentD
animalcontrol1 1005 0 0 0 0
animalcontrol2 857 0 0 0 0
animalcontrol3 959 0 0 0 0
animalcontrol4 928 0 0 0 0
animalcontrol5 1005 0 0 0 0
animaltreatmentA1 0 927 0 0 0
animaltreatmentA2 0 883 0 0 0
animaltreatmentA3 0 908 0 0 0
animaltreatmentA4 0 861 0 0 0
animaltreatmentA5 0 927 0 0 0
animaltreatmentB1 0 0 943 0 0
animaltreatmentB2 0 0 841 0 0
animaltreatmentB3 0 0 943 0 0
animaltreatmentB4 0 0 910 0 0
animaltreatmentB5 0 0 943 0 0
animaltreatmentC1 0 0 0 742 0
animaltreatmentC2 0 0 0 724 0
animaltreatmentC3 0 0 0 702 0
animaltreatmentC4 0 0 0 698 0
animaltreatmentC5 0 0 0 742 0
animaltreatmentD1 0 0 0 0 844
animaltreatmentD2 0 0 0 0 776
animaltreatmentD3 0 0 0 0 812
animaltreatmentD4 0 0 0 0 783
animaltreatmentD5 0 0 0 0 844
Agree you need to "automate" this in some fashion, but I think you are in need of a more general strategy for statistical inference rather than trying to pick out relationships by applying individual t-tests. You might consider either mixed models or one of the random forest variants. I think you should be discussing this with a statistician. As an example of where your hopes are not going to be met, take a look at the information you have about the first "gen" among the 1131 values:
str( b[ b$gen == "dghwg1041", ])
'data.frame': 13 obs. of 5 variables:
$ animal : Factor w/ 25 levels "animalcontrol1",..: 1 6 11 2 7 12 3 8 13 14 ...
$ gen : Factor w/ 1131 levels "dghwg1041","dghwg1086",..: 1 1 1 1 1 1 1 1 1 1 ...
$ condition: Factor w/ 5 levels "control","treatmentA",..: 1 2 3 1 2 3 1 2 3 3 ...
$ tissue : Factor w/ 2 levels "brain","heart": 1 1 1 1 1 1 1 1 1 1 ...
$ LogFC : num 4.34 2.98 4.44 3.87 2.65 ...
You do have a fair number with "complete representation:
gen_length <- ave(b$LogFC, b$gen, FUN=length)
Hmisc::describe(gen_length)
#--------------
gen_length
n missing distinct Info Mean Gmd .05 .10
21507 0 18 0.976 20.32 4.802 13 14
.25 .50 .75 .90 .95
18 20 24 25 25
Value 5 8 9 10 12 13 14 15 16 17
Frequency 100 48 288 270 84 624 924 2220 64 527
Proportion 0.005 0.002 0.013 0.013 0.004 0.029 0.043 0.103 0.003 0.025
Value 18 19 20 21 22 23 24 25
Frequency 666 2223 3840 42 220 1058 3384 4925
Proportion 0.031 0.103 0.179 0.002 0.010 0.049 0.157 0.229
You might start by looking at all the "gen"s that have complete data:
head( gen_tbl[ gen_tbl == 25 ], 25)
#------------------
dghwg1131 dghwg546 dghwg591 dghwg636 dghwg681
25 25 25 25 25
dghwg726 dgkuck196 dgkuck286 dgkuck421 dgkuck691
25 25 25 25 25
dgkuck736 dgkukdgse197 dgkukdgse287 dgkukdgse422 dgkukdgse692
25 25 25 25 25
dgkukdgse737 djh592 djh637 djh682 djh727
25 25 25 25 25
dkgkjd327 dkgkjd642 dkgkjd687 dkgkjd732 fckjhghw204
25 25 25 25 25

Saving a loop's output in a vector in R

I have a .cvs file which consists of 4 columns and 120 rows.
I'm attempting to go through every row and where ever i see a "1" in the third column (which here is called "Dam") , i want to save that row in a matrix called "Dam.one"
Here's my code so far:
DamType = c( "Dam.one", "Dam.two", "Dam.three", "NoDam.one", "NoDam.two", "NoDam.three")
for (i in 1:120) {
if (mercury.raw[i,]["Dam"] == 1) {
if (mercury.raw[i,]["Type"] == 1){
DamType["Dam.one"] <- mercury.raw[i,]
}}}
This is the first 6 entries of the data set:`
> mercury.raw
Lake Mercury Dam Type
1 ALLEN.P 1.080 1 1
2 ALLIGATOR.P 0.025 1 1
3 A.SAGUNTICOOK.L 0.570 0 2
4 BALCH&STUMP.PONDS 0.770 0 2
5 BASKAHEGAN.L 0.790 1 2
6 BAUNEAG.BEG.L 0.750 0 2
I want DamType["Dam.one"]to be equal to:
Lake Mercury Dam Type
1 ALLEN.P 1.080 1 1
2 ALLIGATOR.P 0.025 1 1
I don't know what is wrong with it.
Any help is appreciated.
A matrix cannot be stored in a vector. A list can store multiple matrices (or data frames in this case). This looks like you are trying to subset your data and a for loop is not required.
DamType <- list()
DamType[["Dam.one"]] <- mercury.raw[mercury.raw$Dam == 1 & mercury.raw$Type == 1, ]
> DamType$Dam.one
Lake Mercury Dam Type
1 ALLEN.P 1.080 1 1
2 ALLIGATOR.P 0.025 1 1
I suggest getting more familiar with R and common packages, such as dplyr.
library(dplyr)
dam_grouped <- mercury.raw %>% group_by(Dam)

Adding row in R with next day and 0 in each column

I have a data.frame with 4 columns. The first column is the_day, from 11/1/15 until 11/30/15. The next 3 have values corresponding to each day based on amount_raised. However, some dates are missing because there were no values in the next 3 columns (no money was raised).
For example, 11/3/15 is missing. What I want to do is add a row in between 11/2/15 and 11/4/15 with the date, and zeros in the next 3 columns. So it would read like this:
11/3/2015 0 0 0
Do I have to create a vector and then add it into the existing data.frame? I feel like there has to be a quicker way.
This should work,
date_seq <- seq(min(df$the_day),max(df$the_day),by = 1)
rbind(df,cbind(the_day = as.character( date_seq[!date_seq %in% df$the_day]), inf = "0", specified = "0", both = "0"))
# the_day inf specified both
# 1 2015-11-02 1.32 156 157.32
# 2 2015-11-04 4.25 40 44.25
# 3 2015-11-05 3.25 25 28.25
# 4 2015-11-06 1 15 16
# 5 2015-11-07 4.75 10 14.75
# 6 2015-11-08 32 0 32
# 7 2015-11-03 0 0 0
If you want sort it according to the_day, take the data frame in a variable and use the order function
ans <- rbind(df,cbind(the_day = as.character( date_seq[!date_seq %in% df$the_day]), inf = "0", specified = "0", both = "0"))
ans[order(ans$the_day), ]
# the_day inf specified both
# 1 2015-11-02 1.32 156 157.32
# 7 2015-11-03 0 0 0
# 2 2015-11-04 4.25 40 44.25
# 3 2015-11-05 3.25 25 28.25
# 4 2015-11-06 1 15 16
# 5 2015-11-07 4.75 10 14.75
# 6 2015-11-08 32 0 32
data.frames are not efficient to work with row-wise internally. I would suggest something along the following lines:
create empty (zero) 30x3 matrix. This will include your amount_raised.
create a complete sequence of dates from 11/1 till 11/30
for each existing date, find it's match in the complete sequence
copy the corresponding line from your data frame to the matched line in the matrix (use match() function).
Eventually, make a new data frame out of the new sequence and matrix.

Resources