read delimited .txt file with multiple, interspersed headers in R - r

I am trying to open and clean a massive oceanographic dataset in R, where station information is interspersed as headers in between the chunks of observations:
$
2008 1 774 8 17 5 11 2 78.4952 6.0375 30 7 1.2 -999.0 -9 -9 -9 -9 4868.8 2017 0 7114
2.0 6.0297 35.0199 34.4101 2.0 11111
3.0 6.0279 35.0201 34.4091 3.0 11111
4.0 6.0272 35.0203 34.4091 4.0 11111
5.0 6.0273 35.0204 34.4097 4.9 11111
6.0 6.0274 35.0205 34.4104 5.9 11111
$
2008 1 777 8 17 12 7 25 78.4738 8.3510 27 6 4.1 -999.0 3 7 2 0 4903.8 1570 0 7114
3.0 6.4129 34.5637 34.3541 3.0 11111
4.0 6.4349 34.5748 34.3844 4.0 11111
5.0 6.4803 34.5932 34.4426 4.9 11111
6.0 6.4139 34.5624 34.3552 5.9 11111
7.0 6.5079 34.6097 34.4834 6.9 11111
each $ is followed by a row containing station data (e.g. year, ..., lat, lon, date, time), then follow several rows containing the observations sampled at that station (e.g. depth, temperature, salinity etc.).
I would like to add the station data to the observation, so that each variable is a column
and each observation is a row, like this:
2008 1 774 8 17 5 11 2 78.4952 6.0375 30 7 1.2 -999 2 6.0297 35.0199 34.4101 2 11111
2008 1 774 8 17 5 11 2 78.4952 6.0375 30 7 1.2 -999 3 6.0279 35.0201 34.4091 3 11111
2008 1 774 8 17 5 11 2 78.4952 6.0375 30 7 1.2 -999 4 6.0272 35.0203 34.4091 4 11111
2008 1 774 8 17 5 11 2 78.4952 6.0375 30 7 1.2 -999 5 6.0273 35.0204 34.4097 4.9 11111
2008 1 774 8 17 5 11 2 78.4952 6.0375 30 7 1.2 -999 6 6.0274 35.0205 34.4104 5.9 11111
2008 1 777 8 17 12 7 25 78.4738 8.351 27 6 4.1 -999 3 6.4129 34.5637 34.3541 3 11111
2008 1 777 8 17 12 7 25 78.4738 8.351 27 6 4.1 -999 4 6.4349 34.5748 34.3844 4 11111
2008 1 777 8 17 12 7 25 78.4738 8.351 27 6 4.1 -999 5 6.4803 34.5932 34.4426 4.9 11111
2008 1 777 8 17 12 7 25 78.4738 8.351 27 6 4.1 -999 6 6.4139 34.5624 34.3552 5.9 11111
2008 1 777 8 17 12 7 25 78.4738 8.351 27 6 4.1 -999 7 6.5079 34.6097 34.4834 6.9 11111

This is simpler and only depends on base R. I assume that you have read the text file with x <- readLines(....) first:
start <- which(x == "$") + 1 # Find header indices
rows <- diff(c(start, length(x)+2)) - 2 # Find number of lines per group
# Function to read header and rows and cbind
getdata <- function(begin, end) {
cbind(read.table(text=x[begin]), read.table(text=x[(begin+1):(begin+end)]))
}
dta.list <- lapply(1:(length(start)), function(i) getdata(start[i], rows[i]))
dta.df <- do.call(rbind, dta.list)
This works with the two groups you included in your post. You will need to fix the column names since V1 - V6 are repeated at the beginning and end.

This solution is pretty involved, and rests on knowledge of several Tidyverse libraries and features. I'm not sure how robust it is for your needs, but it does do okay with the sample you posted. But the approach of folding blocks, creating functions to parse the smaller blocks, and then unfolding the results I think will serve you well.
The first piece involves finding the '$' markers, grouping following lines together, and then "nesting" the block of data together. Then we have a data frame that has only a few rows - one for each section.
library(tidyverse)
txt_lns <- readLines("ocean-sample.txt")
txt <- tibble(txt = txt_lns)
# Start by finding new sections, and nesting the data
nested_txt <- txt %>%
mutate(row_number = row_number()) %>%
mutate(new_section = str_detect(txt, "\\$")) %>% # Mark new sections
mutate(starting = ifelse(new_section, row_number, NA)) %>% # Index with row num
tidyr::fill(starting) %>% # Fill index down
# where missing
select(-new_section) %>% # Clean up
filter(!str_detect(txt, "\\$")) %>%
nest(data = c(txt, row_number)) # "Nest" the data
# Take a quick look
nested_txt
Then, we need to be able to deal with those nested blocks. The routines here parse those blocks by identifying header rows, and then separating the fields into dataframes of their own. Here, we have different logic for header rows vs. the shorter lesser rows.
# Deal with the records within a section
parse_inner_block <- function(x, header_ind) {
if (header_ind) {
df <- x %>%
mutate(txt = str_trim(txt)) %>%
# Separate the header row into 22 variables
separate(txt, into = LETTERS[1:22], sep = "\\s+")
} else {
df <- x %>%
mutate(txt = str_trim(txt)) %>%
# Separate the lesser rows into 6 variables
separate(txt, into = letters[1:6], sep = "\\s+")
}
return(df)
}
parse_outer_block <- function(x) {
df <- x %>%
# Determine if it's a header row with 22 variables or lesser row with 6
mutate(leading_row = (row_number == min(row_number))) %>%
# Fold by header row vs. not
nest(data = c(txt, row_number)) %>%
# Create data frames for both header and lesser rows
mutate(processed = purrr::map2(data, leading_row, parse_inner_block)) %>%
unnest(processed) %>%
# Copy header row values to lesser rows
tidyr::fill(A:V) %>%
# Drop header row
filter(!leading_row)
return(df)
}
And then we can put it all together -- starting with our nested data, processing each block, unnesting the fields that came back, and prepping the full output.
# Actually put all this together and generate an output dataframe
output <- nested_txt %>%
mutate(proc_out = purrr::map(data, parse_outer_block)) %>%
select(-data) %>%
unnest(proc_out) %>%
select(-starting, -leading_row, -data, -row_number)
output
Hope it helps. I'd recommend looking at some purrr tutorials as well for some similar problems.

Related

subset a and group by data frame with multiple condition and multiple criteria in r

I have a dataset df with multiple variables and unique IDs
ID A B C D
1 20 5 5.4 120.5
1 30 10 6.8 110.6
2 50 40 7.5 117.8
3 10 50 3.4 119
3 80 30 2.8 117.5
2 5 20 9.5 325.4
I can subset them by below code
new.df <- df[df$A < 56 & is.na(df$A) == FALSE,]
and I want the conditional column and subset the data frame by IDs
I Want the data frame with conditional column such as
ID =1 A=20 B=10 C=5.4 D=110.6
ID =2 A=5 B=40 C=9.5 D=325.4
ID =3 A=10 B=30 C=3.4 D=119
and output data frame should be
ID A B C D
1 20 10 5.4 110.6
2 5 40 9.5 325.4
3 10 30 3.4 119
can you guys help me out how it can be done
this will give the output as a dataframe ,
df %>% group_by(ID) %>% summarise(minA=min(A), maxB = max(B), minC= min(C), minD = min(D))

Adding a header to columns based on the values of rows

I have the following different dataframes:
df1:
Scribe Reduced A 5 2.5 3 10
Reader Reduced A 9.2 4 12 10
Optimise Reduced A 5 5.8 3 12
df2:
Convert Reduced A 14 25
Configure Reduced A 14.7 6.8
Race Reduced A 2 6.3
df3:
Abstract Reduced A 8 7.5 9 8 4.5 11
Follower Reduced A 5.5 6 14 19 6 13.5
I would like to add a header for each of the dataframes where the column names are:
Class Technique Algorithm 1 2 3 ....
My issue is not with the first three columns but with the rest of the columns (integer values). As you see in the example, the number of columns for these integer values differs which makes it difficult to me how to name these columns (i.e., starting form 1 until the last value, for example, 4 in df1).
Can someone help me please in solving this issue?
Here is a function for you. The first argument, dat, is your data frame. The second argument, chr, is the vector names for your first few columns.
header_fun <- function(dat, chr = c("Class", "Technique", "Algorithm")){
dat2 <- setNames(dat, c(chr, 1:(ncol(dat) - length(chr))))
return(dat2)
}
The function will return a new data frame with the updated header.
header_fun(df1)
# Class Technique Algorithm C1 C2 C3 C4
# 1 Scribe Reduced A 5.0 2.5 3 10
# 2 Reader Reduced A 9.2 4.0 12 10
# 3 Optimise Reduced A 5.0 5.8 3 12
header_fun(df2)
# Class Technique Algorithm 1 2
# 1 Convert Reduced A 14.0 25.0
# 2 Configure Reduced A 14.7 6.8
# 3 Race Reduced A 2.0 6.3
header_fun(df3)
# Class Technique Algorithm 1 2 3 4 5 6
# 1 Abstract Reduced A 8.0 7.5 9 8 4.5 11.0
# 2 Follower Reduced A 5.5 6.0 14 19 6.0 13.5
DATA
df1 <- read.table(text = "Scribe Reduced A 5 2.5 3 10
Reader Reduced A 9.2 4 12 10
Optimise Reduced A 5 5.8 3 12",
header = FALSE, stringsAsFactors = FALSE)
df2 <- read.table(text = "Convert Reduced A 14 25
Configure Reduced A 14.7 6.8
Race Reduced A 2 6.3",
header = FALSE, stringsAsFactors = FALSE)
df3 <- read.table(text = "Abstract Reduced A 8 7.5 9 8 4.5 11
Follower Reduced A 5.5 6 14 19 6 13.5",
header = FALSE, stringsAsFactors = FALSE)

R - Create random subsamples of size n for multiple sample groups

I have a large data set of samples that belong to different groups and differ in the area covered. The structure of the data set is simplified below. I now would like to create pooled samples (Subgroups) for each Group where the area covered by each Subgroup equates to a specified area (e.g. 20). Samples should be allocated randomly and without replacement to each Subgroup and the number of the Subgroup should be listed in a new column at the end of the data frame.
SampleID Group Area Subgroup
1 A 1.5 1
2 A 3.8 2
3 A 6 4
4 A 1.9 1
5 A 1.5 3
6 A 4.1 1
7 A 3.7 1
8 A 4.5 3
...
300 B 1.2 1
301 B 3.8 1
302 B 4.1 4
303 B 2.6 3
304 B 3.1 5
305 B 3.5 3
306 B 2.1 2
...
2000 S 2.7 5
...
I am currently using the ‘cumsum’ command to create the Subgroups, using the code below.
dat <- read.table("Pooling_Test.txt", header = TRUE, sep = "\t")
dat$CumArea <- cumsum(dat$Area)
dat$Diff_CumArea <- c(0, head(cumsum(dat$Area), -1))
dat$Sample_Int_1 <- "0"
dat$Sample_End <- "0"
current.sum <- 0
for (c in 1:nrow(dat)) {
current.sum <- current.sum + dat[c, "Area"]
dat[c, "Diff_CumArea"] <- current.sum
if (current.sum >= 20) {
dat[c, "Sample_Int_1"] <- "1"
dat[c, "Sample_End"] <- "End"
current.sum <- 0
dat$Sample_Int_2 <- cumsum(dat$Sample_Int_1)+1
dat$Sample_Final <- dat$Sample_Int_2
for (d in 1:nrow(dat)) {
if (dat$Sample_End[d] == 'End')
dat$Subgroup[d] <- dat$Sample_Int_2[d]-1
else 0 }
}}
write.csv(dat, file = 'Pooling_Test_Output.csv', row.names = FALSE)
The resultant data frame shows what I want (see below). However, there are a couple of steps I would like to improve. First, I have problems including a command for choosing samples randomly from each Group, so I currently randomise the order of samples before loading the data frame into R. Secondly, in the output table the Subgroups are numbered consecutively, but I would like to start the Subgroup numbering with 1 for each new Group. Has anybody any advice on how to achieve this?
SampleID Group CumArea Subgroups
1 A 1.5 1
77 A 4.6 1
6 A 9.3 1
43 A 16.4 1
17 A 19.5 1
67 A 2.1 2
4 A 4.3 2
32 A 8.9 2
...
300 B 4.5 10
257 B 6.8 10
397 B 10.6 10
344 B 14.5 10
367 B 16.7 10
303 B 20.1 10
306 B 1.5 11
...
A few functions in the dplyr package make this fairly straightforward. You can use slice to randomize the data, group_by to perform computations at the group level, and mutate to create new variables. If you chain the functions together with the %>% operator, I believe the solution would look something like this, assuming that you want groups that add up to 20.
install.packages("dplyr") #If you haven't used dplyr before
library(dplyr)
dat %>%
group_by(Group) %>%
slice(sample(1:n())) %>%
mutate(CumArea = cumsum(Area), SubGroup = ceiling(CumArea / 20))

How to approach loop with increasing variable name in R

My dataset is currently a set of answers to twenty questions with 300 observations.
Each of the questions are labled q1, q2, q3, etc.
Each observation gives a 1 to 10 response.
The code below is what I have. What I want is for the q1 to change when the counter changes in R.
totaltenq1 <- sum(UpdatedQualtrix$tenq1)
totalnineq1 <- sum(UpdatedQualtrix$nineq1)
totaleightq1 <- sum(UpdatedQualtrix$eightq1)
totalsevenq1 <- sum(UpdatedQualtrix$sevenq1)
totalsixq1 <- sum(UpdatedQualtrix$sixq1)
totalfiveq1 <- sum(UpdatedQualtrix$fiveq1)
totalfourq1 <- sum(UpdatedQualtrix$fourq1)
totalthreeq1 <- sum(UpdatedQualtrix$threeq1)
totaltwoq1 <- sum(UpdatedQualtrix$twoq1)
totaloneq1 <- sum(UpdatedQualtrix$oneq1)
totaltenq2 <- sum(UpdatedQualtrix$tenq2)
totalnineq2 <- sum(UpdatedQualtrix$nineq2)
totaleightq2 <- sum(UpdatedQualtrix$eightq2)
totalsevenq2 <- sum(UpdatedQualtrix$sevenq2)
totalsixq2 <- sum(UpdatedQualtrix$sixq2)
totalfiveq2 <- sum(UpdatedQualtrix$fiveq2)
totalfourq2 <- sum(UpdatedQualtrix$fourq2)
totalthreeq2 <- sum(UpdatedQualtrix$threeq2)
totaltwoq2 <- sum(UpdatedQualtrix$twoq2)
totaloneq2 <- sum(UpdatedQualtrix$oneq2)
I would like to have code that is
count = 20
for (i in 1:count){
totaltenq(i) <- sum(UpdatedQualtrix$tenq(i)
totalninq(I) <- sum(UpdatedQuatlrix$nineq(I)
etc
}
That way, when I do it again in the future, I can tell R how many questions it has the next time and it will change it. That way I don't have 10,000 lines of code from copying and pasting my code 20 times.
I don't think you need any loops at all. It just all depends on how you want to store those value. I'm a big fan of not having more variables than necessary.
Here's some sample data. I'll just make 10 rows (observations) with values 1-5.
set.seed(15)
Q<-3
numbs<-c("one","two","three","four","five","six","seven","eight","nine","ten")
qs<-paste0("q",1:Q)
qnumbs <- outer(numbs, qs, paste0)
UpdatedQualtrix <-data.frame(ID=1:10,
matrix(sample(1:5, 10*length(numbs)*Q, replace=T), nrow=10))
colnames(UpdatedQualtrix) <- c("ID",qnumbs)
Now I can sum up each of the columns with
( Qsums<-colSums(UpdatedQualtrix[, qnumbs]) )
# oneq1 twoq1 threeq1 fourq1 fiveq1 sixq1 sevenq1 eightq1 nineq1 tenq1
# 37 35 29 26 32 39 40 33 40 26
# oneq2 twoq2 threeq2 fourq2 fiveq2 sixq2 sevenq2 eightq2 nineq2 tenq2
# 37 31 19 29 25 38 36 35 28 27
# oneq3 twoq3 threeq3 fourq3 fiveq3 sixq3 sevenq3 eightq3 nineq3 tenq3
# 37 30 31 31 24 31 29 31 25 41
And if we want the totals per question we can do
sapply(qs, function(a, b) sum(Qsums[paste0(b,a)]), b=numbs)
# q1 q2 q3
# 337 305 310
Or if we want the counts per response we can do
sapply(numbs, function(a, b) sum(Qsums[paste0(a,b)]), b=qs)
# one two three four five six seven eight nine ten
# 111 96 79 86 81 108 105 99 93 94
You might want to also consider melting your data since it's so structured. You can use the reshape2 library to help. You can do
require(reshape2)
mm <- melt(UpdatedQualtrix, id.vars="ID")
mm <- cbind(mm[,-2], colsplit(mm$variable, "q", c("resp","q")))
mm$resp <- factor(mm$resp, levels=numbs)
to turn your data into a "tall" format so each value has it's own row with a column for ID, value, response and question.
str(mm)
# 'data.frame': 300 obs. of 4 variables:
# $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
# $ value: int 4 1 5 4 2 5 5 2 4 5 ...
# $ resp : Factor w/ 10 levels "one","two","three",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ q : int 1 1 1 1 1 1 1 1 1 1 ...
And then we can more easily do other calculations. Of you want the total scores by question, you could do
aggregate(value~q, mm, sum)
# q value
# 1 1 337
# 2 2 305
# 3 3 310
If you wanted the average value for each question/response you could do
with(mm, tapply(value, list(q,resp), mean))
# one two three four five six seven eight nine ten
# 1 3.7 3.5 2.9 2.6 3.2 3.9 4.0 3.3 4.0 2.6
# 2 3.7 3.1 1.9 2.9 2.5 3.8 3.6 3.5 2.8 2.7
# 3 3.7 3.0 3.1 3.1 2.4 3.1 2.9 3.1 2.5 4.1

R Programming Calculate Rows Average

How to use R to calculate row mean ?
Sample data:
f<- data.frame(
name=c("apple","orange","banana"),
day1sales=c(2,5,4),
day1sales=c(2,8,6),
day1sales=c(2,15,24),
day1sales=c(22,51,13),
day1sales=c(5,8,7)
)
Expected Results :
Subsequently the table will add more column for example the expected results is only until AverageSales day1sales.4. After running more data, it will add on to day1sales.6 and so on. So how can I count the average for all the rows?
with rowMeans
> rowMeans(f[-1])
## [1] 6.6 17.4 10.8
You can also add another column to of means to the data set
> f$AvgSales <- rowMeans(f[-1])
> f
## name day1sales day1sales.1 day1sales.2 day1sales.3 day1sales.4 AvgSales
## 1 apple 2 2 2 22 5 6.6
## 2 orange 5 8 15 51 8 17.4
## 3 banana 4 6 24 13 7 10.8
rowMeans is the simplest way. Also the function apply will apply a function along the rows or columns of a data frame. In this case you want to apply the mean function to the rows:
f$AverageSales <- apply(f[, 2:length(f)], 1, mean)
(changed 6 to length(f) since you say you may add more columns).
will add an AverageSales column to the dataframe f with the value that you want
> f
## name day1sales day1sales.1 day1sales.2 day1sales.3 day1sales.4 means
##1 apple 2 2 2 22 5 6.6
##2 orange 5 8 15 51 8 17.4
##3 banana 4 6 24 13 7 10.8

Resources