I track various information relating to the water in California on a daily basis. The people before me have done this by manually entering the data sourced from websites. I have begun to automate this process using R. It has gone well so far using selector gadget for pages like https://cdec.water.ca.gov/reportapp/javareports?name=RES
However, I am having trouble with this report since it is all text:
https://water.ca.gov/-/media/DWR-Website/Web-Pages/Programs/State-Water-Project/Operations-And-Maintenance/Files/Operations-Control-Office/Project-Wide-Operations/Dispatchers-Monday-Water-Report.txt?la=en&hash=B8C874426999D484F7CF1E9821EE9D8C6896CF1E
I have tried following different text mining tutorials step by step but am still really confused with this task.
I have also tried converting it to a pdf and using pdf tools as well and have not been able to achieve my goal.
Any help would be appreciated.
Thanks,
Ethan James W
library(httr)
library(stringi)
res <- httr::GET("https://water.ca.gov/-/media/DWR-Website/Web-Pages/Programs/State-Water-Project/Operations-And-Maintenance/Files/Operations-Control-Office/Project-Wide-Operations/Dispatchers-Monday-Water-Report.txt?la=en&hash=B8C874426999D484F7CF1E9821EE9D8C6896CF1E")
l <- stri_split_lines(content(res))[[1]]
page_breaks <- which(stri_detect_fixed(l, "SUMMARY OF SWP"))
# target page 1
page_one <- l[1:(page_breaks[2]-1)]
# find all the records on the page
recs <- paste0(page_one[stri_detect_regex(page_one, "^[[:alpha:]].*[[:digit:]]\\.")], collapse="\n")
# read it in as a fixed-width text file (b/c it really kinda is)
read.fwf(
textConnection(recs),
widths = c(10, 7, 8, 7, 7, 8, 8, 5, 7, 6, 7),
stringsAsFactors = FALSE
) -> xdf
# clean up the columns
xdf[] <- lapply(xdf, stri_trim_both)
xdf[] <- lapply(xdf, function(x) ifelse(grepl("\\.\\.|DCTOT", x), "NA", x)) # replace "....."s and the "DCTOT" string with "NA" so we can do the type conversion
xdf <- type.convert(xdf)
colnames(xdf) <- c("reservoir", "abs_max_elev", "abs_max_stor", "norm_min_elev", "norm_min_stor", "elev", "stor", "evap", "chng", "net_rel", "inflow")
xdf$reservoir <- as.character(xdf$reservoir)
Which gives us:
xdf
## reservoir abs_max_elev abs_max_stor norm_min_elev norm_min_stor elev stor evap chng net_rel inflow
## 1 FRENCHMN 5588.0 55475 5560.00 21472 5578.67 41922 NA -53 NA NA
## 2 ANTELOPE 5002.0 22564 4990.00 12971 4994.64 16306 NA -46 NA NA
## 3 DAVIS 5775.0 84371 5760.00 35675 5770.22 66299 NA -106 NA NA
## 4 OROVILLE 901.0 3553405 640.00 852196 702.69 1275280 249 -4792 6018 1475
## 5 F/B 225.0 11768 221.00 9350 224.52 11467 NA -106 NA NA
## 6 DIV 225.0 13353 221.00 12091 224.58 13217 NA -48 NA NA
## 7 F/B+DIV 225.0 25120 221.00 21441 NA 24684 NA -154 NA NA
## 8 AFTERBAY 136.0 54906 124.00 15156 132.73 41822 NA -263 5372 NA
## 9 CLIF CT 5.0 29082 -2.00 13965 -0.72 16714 NA 194 NA 5943
## 10 BETHANY 243.5 4894 241.50 4545 243.00 4806 NA 0 NA NA
## 11 DYER 806.0 545 785.00 90 795.40 299 NA -21 NA NA
## 12 DEL VALLE 703.0 39914 678.00 24777 690.22 31514 NA -122 97 0
## 13 TEHACHAPI 3101.0 545 3097.00 388 3098.22 434 NA -25 NA NA
## 14 TEHAC EAB 3101.0 1232 3085.00 254 3096.64 941 NA -39 NA NA
## 15 QUAIL+LQC 3324.5 8612 3306.50 3564 3318.18 6551 NA -10 0 NA
## 16 PYRAMID 2578.0 169901 2560.00 147680 2574.72 165701 25 -1056 881 0
## 17 ELDRBERRY 1530.0 27681 1490.00 12228 1510.74 19470 NA 805 0 0
## 18 CASTAIC 1513.0 319247 1310.00 33482 1491.48 273616 36 -1520 1432 0
## 19 SILVRWOOD 3355.0 74970 3312.00 39211 3351.41 71511 10 276 1582 107
## 20 DC AFBY 1 1933.0 50 1922.00 18 1932.64 49 NA 0 NA NA
## 21 DC AFBY 2 1930.0 967 1904.50 198 1922.01 696 NA 37 1690 NA
## 22 CRAFTON H 2925.0 292 2905.00 70 2923.60 274 NA -2 NA NA
## 23 PERRIS 1588.0 126841 1555.30 60633 1577.96 104620 21 85 8 NA
## 24 SAN LUIS 543.0 2027835 326.00 79231 470.16 1178789 238 3273 -4099 0
## 25 O'NEILL 224.5 55076 217.50 36843 222.50 49713 NA 2325 NA NA
## 26 LOS BANOS 353.5 34562 296.00 8315 322.87 18331 NA -5 0 0
## 27 L.PANOCHE 670.4 13233 590.00 308 599.60 664 NA 0 0 0
## 28 TRINITY 2370.0 2447656 2145.00 312631 2301.44 1479281 NA -1192 NA NA
## 29 SHASTA 1067.0 4552095 828.00 502004 974.01 2300953 NA -6238 NA NA
## 30 FOLSOM 466.0 976952 327.80 84649 408.50 438744 NA -2053 NA NA
## 31 MELONES 1088.0 2420000 808.00 300000 1031.66 1779744 NA -2370 NA NA
## 32 PINE FLT 951.5 1000000 712.58 100002 771.51 231361 NA 543 508 NA
## 33 MATHEWS 1390.0 182569 1253.80 3546 1352.17 94266 NA 522 NA NA
## 34 SKINNER 1479.0 44405 1393.00 0 1476.02 38485 NA 242 NA NA
## 35 BULLARDS 1956.0 966103 1730.00 230118 1869.01 604827 NA -1310 NA NA
That was the easy one :-)
Most of Page 2 is doable pretty in a pretty straightforward manner:
page_two <- l[page_breaks[2]:length(l)]
do.call(
rbind.data.frame,
lapply(
stri_split_fixed(
stri_replace_all_regex(
stri_trim_both(page_two[stri_detect_regex(
stri_trim_both(page_two), # trim blanks
"^([^[:digit:]]+)([[:digit:]\\.]+)[[:space:]]+([^[:digit:]]+)([[:digit:]\\.]+)$" # find the release rows
)]),
"[[:space:]]{2,}", "\t" # make tab-separated fields wherever there are 2+ space breaks
), "\t"),
function(x) {
if (length(x) > 2) { # one of the lines will only have one record but most have 2
data.frame(
facility = c(x[1],x[3]),
amt = as.numeric(c(x[2], x[4])),
stringsAsFactors = FALSE
)
} else {
data.frame(
facility = x[1],
amt = as.numeric(x[2]),
stringsAsFactors = FALSE
)
}
})
) -> ydf
Which gives us (sans the nigh useless TOTAL rows):
ydf[!grepl("TOTAL", ydf$facility),]
## facility amt
## 1 KESWICK RELEASE TO RIVER 15386.0
## 2 SHASTA STORAGE WITHDRAWAL 8067.0
## 3 SPRING CREEK RELEASE 0.0
## 4 WHISKYTOWN STORAGE WITHDRAWAL 46.0
## 6 OROVILLE STORAGE WITHDRAWL 5237.0
## 7 CDWR YUBA RIVER # MARYSVILLE 0.0
## 8 FOLSOM STORAGE WITHDRAWAL 1386.0
## 9 LAKE OROVILLE 20.2
## 10 BYRON BETHANY I.D. 32.0
## 11 POWER CANAL 0.0
## 12 SAN LUIS TO SAN FELIPE 465.0
## 13 SUTTER BUTTE 922.0
## 14 O'NEILL FOREBAY 2.0
## 15 LATERAL 0.0
## 16 CASTAIC LAKE 1432.0
## 17 RICHVALE 589.0
## 18 SILVERWOOD LAKE TO CLAWA 7.0
## 19 WESTERN 787.0
## 20 LAKE PERRIS 0.0
## 23 D/S FEATHER R. DIVERSIONS 0.0
## 24 FISH REQUIREMENT 1230.0
## 25 FLOOD CONTROL RELEASE 0.0
## 26 DELTA REQUIREMENT 3629.0
## 27 FEATHER R. RELEASE # RIVER OUTLET 3074.0
## 28 OTHER RELEASE 0.0
But, if you need the deltas or the plant operations data you're on your own.
Related
I am a beginner in R, I am learning the basics to analyse some biological data. I have 415 .csv files, each is a fungal species. Each file has 5 columns - (YEAR, FFD, LFD, MEAN, RANGE)
YEAR FFD LFD RAN MEAN
1 1950 NA NA NA NA
2 1951 NA NA NA NA
3 1952 NA NA NA NA
4 1953 NA NA NA NA
5 1954 NA NA NA NA
6 1955 NA NA NA NA
7 1956 NA NA NA NA
8 1957 NA NA NA NA
9 1958 NA NA NA NA
10 1959 140 141 1 140
11 1960 NA NA NA NA
12 1961 NA NA NA NA
13 1962 NA NA NA NA
14 1963 NA NA NA NA
15 1964 NA NA NA NA
16 1965 155 156 1 155
17 1966 NA NA NA NA
18 1967 NA NA NA NA
19 1968 152 153 1 152
20 1969 NA NA NA NA
21 1970 NA NA NA NA
22 1971 161 162 1 161
23 1972 NA NA NA NA
24 1973 143 144 1 143
25 1974 NA NA NA NA
26 1975 NA NA NA NA
27 1976 NA NA NA NA
28 1977 NA NA NA NA
29 1978 NA NA NA NA
30 1979 NA NA NA NA
31 1980 NA NA NA NA
32 1981 NA NA NA NA
33 1982 155 156 1 155
34 1983 NA NA NA NA
35 1984 NA NA NA NA
36 1985 157 158 1 157
37 1986 170 310 140 240
38 1987 173 274 101 232
39 1988 192 236 44 214
40 1989 234 320 86 277
41 1990 172 287 115 213
42 1991 148 287 139 205
43 1992 140 278 138 206
44 1993 152 273 121 216
45 1994 142 319 177 228
46 1995 261 318 57 287
47 1996 247 315 68 285
48 1997 164 270 106 230
49 1998 186 187 1 186
50 1999 235 236 1 235
51 2000 NA NA NA NA
52 2001 309 310 1 309
53 2002 203 308 105 256
54 2003 140 238 98 189
55 2004 204 313 109 267
56 2005 253 313 60 287
57 2006 247 300 53 279
58 2007 185 295 110 225
59 2008 259 260 1 259
60 2009 296 315 19 309
61 2010 230 303 73 275
62 2011 247 248 1 247
63 2012 206 207 1 206
64 2013 NA NA NA NA
65 2014 250 317 67 271
First I would like to see the regression coefficient (slope of the line) for each file, and the significance (p-value) for all of the files.
I can do it individually with:
fruit<-read.csv(file.choose(),header=TRUE)
yr<-fruit[,1]
ffd<-fruit[,2]
res<-lm(ffd~yr)
summary(res)
when I do this for the data, I get:
Call:
lm(formula = ffd ~ yr)
Residuals:
Min 1Q Median 3Q Max
-77.358 -20.858 -5.714 22.494 96.015
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4162.0710 950.1439 -4.380 0.000119 ***
yr 2.1864 0.4765 4.588 6.55e-05 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 38.75 on 32 degrees of freedom
(31 observations deleted due to missingness)
Multiple R-squared: 0.3968, Adjusted R-squared: 0.378
F-statistic: 21.05 on 1 and 32 DF, p-value: 6.549e-05
The only information I need from this at the moment is the regression coefficient (2.1864) and the p-value (6.549e-05)
The perfect output would be if I could get R to cycle through the 415 files, and give an output in the form of a table with 3 columns: filename, regression coefficient, and significance. There would be 415 rows, one for each file.
I would then like to do YEAR~LFD, YEAR~RANGE, and YEAR~MEAN. I am hoping that I can easily edit the code for YEAR~FFD and run it for the other 3 regressions.
The following code will probably work.
I have tested it with your data in two files. The functions that do all the work are these ones:
regrFun <- function(DF){
fit <- lm(DF[[1]] ~ DF[[2]])
coef(summary(fit))[2, c(1, 4)]
}
regrList <- function(iv, L){
res <- lapply(seq_along(L), function(i){
dftmp <- L[[i]]
cfs <- regrFun(dftmp[c(1, iv)])
data.frame(file = names(L)[i], Estimate = cfs[1], p.value = cfs[2])
})
res <- do.call(rbind, res)
row.names(res) <- NULL
res
}
Now read in the data files. In the following code line, substitute a common filename part for "pattern" in the obvious place.
filenames <- list.files(pattern = "pattern")
df_list <- lapply(filenames, read.csv)
names(df_list) <- filenames
And compute the values you want.
results_list <- lapply(2:ncol(df_list[[1]]), regrList, df_list)
names(results_list) <- names(df_list[[1]][-1])
First I simulate like 5 csv files, with columns that look like yours:
for(i in 1:5){
tab=data.frame(
YEAR=1950:2014,
FFD= rpois(65,100),
LFD= rnorm(65,100,10),
RAN= rnbinom(65,mu=100,size=1),
MEAN = runif(65,min=50,max=150)
)
write.csv(tab,paste0("data",i,".csv"))
}
Now, we need a vector of all the files in your directory, this will be different for yours, but try to create this somehow using the pattern argument:
csvfiles = dir(pattern="data[0-9]*.csv$")
So we use three libraries from tidyverse, and I guess each csv file is not so huge, so the code below reads in all the files, group them by the source and performs the regression, note you can use call the columns from the data frame, not having to rename them:
library(dplyr)
library(purrr)
library(broom)
csvfiles %>%
map_df(function(i){df = read.csv(i);df$data = i;df}) %>%
group_by(data) %>%
do(tidy(lm(FFD ~ YEAR,data=.))) %>%
filter(term!="(Intercept)")
# A tibble: 5 x 6
# Groups: data [5]
data term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 data1.csv YEAR -0.0228 0.0731 -0.311 0.756
2 data2.csv YEAR -0.139 0.0573 -2.42 0.0182
3 data3.csv YEAR -0.175 0.0650 -2.70 0.00901
4 data4.csv YEAR -0.0478 0.0628 -0.762 0.449
5 data5.csv YEAR 0.0204 0.0648 0.315 0.754
You can just change the formula inside lm(FFD ~ YEAR,data=.) to get the other regressions
data.table version, using StupidWolf's csv files layout and names, featuring the requested fields:
library(data.table)
input.dir = "/home/user/Desktop/My Folder/" # adjust to your needs
csvfiles <- list.files(path=input.dir, full.names=TRUE, pattern=".*data(.*)\\.csv") # adjust pattern
Above, I used a more specific regex pattern, but you could just pick pattern="*.csv" if you want to process all csv files in that folder.
# order the files
csvfiles <- csvfiles[order(as.numeric(gsub(".*data(.*)\\.csv", "\\1", csvfiles)))]
# function to read file and return requested columns
regrFun <- function(x){
DT <- fread(x)
fit <- lm(FFD ~ YEAR, data=DT)
return(as.list(c(filename=basename(x), coef(summary(fit))[2, c(1, 4)])))
}
# apply function and rename columns
DT <- rbindlist(lapply(csvfiles, regrFun))
setnames(DT, c("filename", "regression coefficient", "significance"))
DT
Result:
filename regression coefficient significance
1: data1.csv -0.113286713286712 0.0874762832713643
2: data2.csv -0.044449300699302 0.457096760642717
3: data3.csv 0.0464597902097902 0.499618510612891
4: data4.csv -0.032473776223776 0.638494798460044
5: data5.csv 0.0562062937062939 0.452955919860998
---
411: data411.csv 0.0381555944055959 0.544185411150829
412: data412.csv -0.0672202797202807 0.314346452751388
413: data413.csv 0.116564685314687 0.0694785724198052
414: data414.csv -0.0908216783216786 0.110811677724832
415: data415.csv -0.0282779720279721 0.638766712090455
You could write a R script that runs on a single file, then run it on every file via the terminal.
The script is simply a .R file with code inside.
To run it on every file you would execute on your terminal something on the lines of (using bash)
for file in $(ls yourDataDirectory); do
Rscript yourScriptFile.R $file >> finalOutput
done
This would run the script in yourScriptFile.R on every file in yourDataDircetory and save the output on finalOutput.
The script code itself would be very similar to the one you already wrote, but instead of file.choose() you would use the argument passed by the command line, as described here, and you would have to print only the information you're insterested, instead of the output of summary.
finalOutput could even be a csv file, if you format the script output correctly.
I have a data frame called source that looks something like this
185 2002-07-04 NA NA 20
186 2002-07-05 NA NA 20
187 2002-07-06 NA NA 20
188 2002-07-07 14.400 0.243 20
189 2002-07-08 NA NA 20
190 2002-07-09 NA NA 20
191 2002-07-10 NA NA 20
192 2002-07-11 NA NA 20
193 2002-07-12 NA NA 20
194 2002-07-13 4.550 0.296 20
195 2002-07-14 NA NA 20
196 2002-07-15 NA NA 20
197 2002-07-16 NA NA 20
198 2002-07-17 NA NA 20
199 2002-07-18 NA NA 20
200 2002-07-19 NA 0.237 20
and when I try
> nrow(complete.cases(source))
I only get NULL
can someone explain why this is the case and how can I count how many rows there are without NA or NaN values?
Instead use sum. Though the safest option would be NROW (because it can handle both data.frams and vectors)
sum(complete.cases(source))
#[1] 2
Or alternatively if you insist on using nrow
nrow(source[complete.cases(source), ])
#[1] 2
Explanation: complete.cases returns a logical vector indicating which cases (in your case rows) are complete.
Sample data
source <- read.table(text =
"185 2002-07-04 NA NA 20
186 2002-07-05 NA NA 20
187 2002-07-06 NA NA 20
188 2002-07-07 14.400 0.243 20
189 2002-07-08 NA NA 20
190 2002-07-09 NA NA 20
191 2002-07-10 NA NA 20
192 2002-07-11 NA NA 20
193 2002-07-12 NA NA 20
194 2002-07-13 4.550 0.296 20
195 2002-07-14 NA NA 20
196 2002-07-15 NA NA 20
197 2002-07-16 NA NA 20
198 2002-07-17 NA NA 20
199 2002-07-18 NA NA 20
200 2002-07-19 NA 0.237 20")
complete.cases returns a logical vector that indicates the rows which are complete. As a vector doesn't have a row attribute, you cannot use nrow here, but as suggested by others sum. With sum the TRUE and FALSE are transformed to 1 and 0 internally, so using sum counts the TRUE values of your vector.
sum(complete.cases(source))
# [1] 2
If you however are more interested in the data.frame, which is left after you exclude all non-complete rows, you can use na.exclude. This returns a data.frame and you can use nrow.
nrow(na.exclude(source))
# [1] 2
na.exclude(source)
# V2 V3 V4 V5
# 188 2002-07-07 14.40 0.243 20
# 194 2002-07-13 4.55 0.296 20
You can even try:
source[rowSums(is.na(source))==0,]
# V1 V2 V3 V4 V5
# 4 188 2002-07-07 14.40 0.243 20
# 10 194 2002-07-13 4.55 0.296 20
nrow(source[rowSums(is.na(source))==0,])
#[1] 2
I am new to the world of Statistics, So some simple suggestions will be acknowledged ...
I have a data frame in R
Ganeeshan
Year General OBC SC ST VI VacancySC VacancyGen VacancyOBC Banks Participated VacancyST VacancyHI
1 2016 52.5 52.5 41.75 31.50 37.5 1338 4500 2319 20 665 154
2 2015 76.0 76.0 50.00 47.75 36.0 1965 6146 3454 23 1050 270
3 2014 82.0 80.0 70.00 56.00 38.0 2496 8212 4482 23 1531 458
4 2013 61.0 60.0 50.00 26.00 27.0 3208 10846 5799 21 1827 458
5 2012 135.0 135.0 127.00 106.00 127.0 3409 11058 6062 21 1886 436
VacancyOC VacancyVI
1 113 102
2 358 242
3 323 321
4 208 390
5 257 345
and want to built a linear Model taking dependent variable as "General", I used the following command
GaneeshanModel1 <- lm(General ~ ., data = Ganeeshan)
I get " NA " instead of values in summary of model
Call:
lm(formula = General ~ ., data = Ganeeshan)
Residuals:
ALL 5 residuals are 0: no residual degrees of freedom!
Coefficients: (9 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6566.6562 NA NA NA
Year -3.2497 NA NA NA
OBC 0.5175 NA NA NA
SC -0.2167 NA NA NA
ST 0.6078 NA NA NA
VI NA NA NA NA
VacancySC NA NA NA NA
VacancyGen NA NA NA NA
VacancyOBC NA NA NA NA
`Banks Participated` NA NA NA NA
VacancyST NA NA NA NA
VacancyHI NA NA NA NA
VacancyOC NA NA NA NA
VacancyVI NA NA NA NA
why I am not getting any data here
This can happen if you don't do data preprocessing correctly first. It seems that your 'Bank' column is empty (NaN) and you should think about what to do with it (I am not sure if this is the whole file or there are other non-empty values inside your 'Bank' column). In general, before starting to use your data, you need to replace the NaN (empty) values in your columns with some numerical values (usually it is mean or median value of a column). In R, for your column 'Banks' (in case it has other non-empty values) for example you can do it like this:
dataset$Banks = ifelse(is.na(dataset$Banks),
ave(dataset$Banks, FUN = function(x) mean(x, na.rm = TRUE)),
dataset$Banks)
Otherwise, depending on your data set, if some of your values are represented by a period (or any other non number value) you can import your csv as
dataset = read.csv("data.csv", header = TRUE, c(" ", ".", "NA"))
to change 'period' and 'empty' values to NaN (NA) and after that use the line above to replace the NA (NaN) with mean/median/something else.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
Hi Stackoverflow community,
Background:
I'm using a population modelling program to try to predict genetic outcomes of threatened species populations given a range of management scenarios. At the end of each of my population modelling scenarios I have a .csv file containing information on all the final living individuals over all 1,000 iterations of the modeled population which includes information on all surviving individual's genotypes.
What I want:
From this .csv output file I'd like to determine the frequency of the allele "6" in the columns "Allele2a" and "Allele2b" in each of the 1,000 iterations of the model contained in the file.
The Problem:
The .csv file I'm trying to determine the allele 6's frequency from does not contain information that can be used to easily subset the data (from what can see) into the separate iterations. I have no idea how to split this dataset into it's respective iterations given that the number of individuals surviving to the end of the model (and subsequently the number of individual rows in each iteration) is not the same, and there is no clear subsettable points.
Any guidance on how to separate this data into iteration units which can be analysed, or how to determine the frequency of the allele without complex subsetting would be very greatly appreciated. If any further information is required please don't hesitate to ask.
Thanks!
EDIT: When input into R the data looks like this:
Living<-read.csv("Living Ind.csv", header=F)
colnames(Living) <- c("Iteration","ID","Pop","Sex","alive","Age","DamID","SireID","F","FInd","MtDNA","Alle1a","Alle1b","Alle2a","Alle2b")
attach(Living)
Living
Iteration ID Pop Sex alive Age DamID SireID F FInd MtDNA Alle1a Alle1b Alle2a Alle2b
1 Iteration 1 NA NA NA NA NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA NA NA NA NA NA
3 2511 2 M TRUE 19 545 1376 0.000 0.000 545 1089 2751 6 6
4 2515 2 F TRUE 18 590 1783 0.000 0.000 590 1180 3566 5 5
5 2519 2 F TRUE 18 717 1681 0.000 0.000 717 1434 3362 4 6
6 2526 2 M TRUE 17 412 1780 0.000 0.000 412 823 3559 4 6
7 2529 2 F TRUE 17 324 1473 0.000 0.000 324 647 2945 5 6
107 2676 2 F TRUE 1 2576 2526 0.000 0.000 621 3876 3559 6 4
108 NA NA NA NA NA NA NA NA NA NA NA NA NA
109 Iteration 2 NA NA NA NA NA NA NA NA NA NA NA NA
110 NA NA NA NA NA NA NA NA NA NA NA NA NA
111 2560 2 M TRUE 18 703 1799 0.000 0.000 703 1406 3598 6 6
112 2564 2 M TRUE 18 420 1778 0.000 0.000 420 840 3555 4 6
113 2578 2 F TRUE 17 347 1778 0.000 0.000 347 693 3555 3 5
114 2581 2 M TRUE 16 330 1454 0.000 0.000 330 659 2907 6 6
115 2584 2 F TRUE 16 568 1593 0.000 0.000 568 1135 3185 6 5
116 2591 2 F TRUE 13 318 1423 0.000 0.000 318 635 2846 3 6
117 2593 2 M TRUE 13 341 1454 0.000 0.000 341 682 2907 6 6
118 2610 2 M TRUE 8 2578 2582 0.000 0.000 347 693 2908 5 6
119 2612 2 M TRUE 8 2578 2582 0.000 0.000 347 3555 660 3 6
Just a total mess I'm afraid.
Here's a link to a copy of the .csv file.
https://www.dropbox.com/s/pl6ncy5i0152uv1/Living%20Ind.csv?dl=0
Thank you for providing your data. It makes this much easier. In future you should always do this with questions on SO.
The basic issue is transforming your original data into something more easy to manipulate in R. Since your data-set is fairly large, I'm using the data.table package, but you could also do basically the same thing using data.frames in base R.
library(data.table)
url <- "https://www.dropbox.com/s/pl6ncy5i0152uv1/Living%20Ind.csv?dl=1"
DT <- fread(url,header=FALSE, showProgress = FALSE) # import data
DT <- DT[!is.na(V2)] # remove blank lines (rows)
brks <- which(DT$V1=="Iteration") # identify iteration header rows
iter <- DT[brks,]$V2 # extract iteration numbers
DT <- DT[-brks,Iter:=rep(iter,diff(c(brks,nrow(DT)+1))-1)] # assign iteration number to each row
DT <- DT[-brks] # remove iteration header rows
DT[,V1:=NULL] # remove first column
setnames(DT, c("ID","Pop","Sex","alive","Age","DamID","SireID","F","FInd","MtDNA","Alle1a","Alle1b","Alle2a","Alle2b","Iteration"))
# now can count fraction of allele 6 easily.
DT[,list(frac=sum(Alle2a==6 | Alle2b==6)/.N), by=Iteration]
# Iteration frac
# 1: 1 0.7619048
# 2: 2 0.9130435
# 3: 3 0.6091954
# 4: 4 0.8620690
# 5: 5 0.8850575
# ---
If you are going to be analyzing large datasets like this a lot, it would probably be worth your while to learn how to use data.table.
I have the following data.table:
Month Day Lat Long Temperature
1: 10 01 80.0 180 -6.383330333333309
2: 10 01 77.5 180 -6.193327999999976
3: 10 01 75.0 180 -6.263328333333312
4: 10 01 72.5 180 -5.759997333333306
5: 10 01 70.0 180 -4.838330999999976
---
117020: 12 31 32.5 310 11.840003833333355
117021: 12 31 30.0 310 13.065001833333357
117022: 12 31 27.5 310 14.685003333333356
117023: 12 31 25.0 310 15.946669666666690
117024: 12 31 22.5 310 16.578336333333358
For every location (given by Lat and Long), I have a temperature for each day from 1 October to 31 December.
There are 1,272 locations consisting of each pairwise combination of Lat:
Lat
1 80.0
2 77.5
3 75.0
4 72.5
5 70.0
--------
21 30.0
22 27.5
23 25.0
24 22.5
and Long:
Long
1 180.0
2 182.5
3 185.0
4 187.5
5 190.0
---------
49 300.0
50 302.5
51 305.0
52 307.5
53 310.0
I'm trying to create a data.table that consists of 1,272 rows (one per location) and 92 columns (one per day). Each element of that data.table will then contain the temperature at that location on that day.
Any advice about how to accomplish that goal without using a for loop?
Here we use ChickWeights as the data, where we use "Chick-Diet" as the equivalent of your "lat-lon", and "Time" as your "Date":
dcast.data.table(data.table(ChickWeight), Chick + Diet ~ Time)
Produces:
Chick Diet 0 2 4 6 8 10 12 14 16 18 20 21
1: 18 1 1 1 NA NA NA NA NA NA NA NA NA NA
2: 16 1 1 1 1 1 1 1 1 NA NA NA NA NA
3: 15 1 1 1 1 1 1 1 1 1 NA NA NA NA
4: 13 1 1 1 1 1 1 1 1 1 1 1 1 1
5: ... 46 rows omitted
You will likely need to lat + lon ~ Month + Day or some such for your formula.
In the future, please make your question reproducible as I did here by using a built-in data set.
First create a date value using the lubridate package (I assumed year = 2014, adjust as necessary):
library(lubridate)
df$datetext <- paste(df$Month,df$Day,"2014",sep="-")
df$date <- mdy(df$datetext)
Then one option is to use the tidyr package to spread the columns:
library(tidyr)
spread(df[,-c(1:2,6)],date,Temperature)
Lat Long 2014-10-01 2014-12-31
1 22.5 310 NA 16.57834
2 25.0 310 NA 15.94667
3 27.5 310 NA 14.68500
4 30.0 310 NA 13.06500
5 32.5 310 NA 11.84000
6 70.0 180 -4.838331 NA
7 72.5 180 -5.759997 NA
8 75.0 180 -6.263328 NA
9 77.5 180 -6.193328 NA
10 80.0 180 -6.383330 NA