How to approach loop with increasing variable name in R - r

My dataset is currently a set of answers to twenty questions with 300 observations.
Each of the questions are labled q1, q2, q3, etc.
Each observation gives a 1 to 10 response.
The code below is what I have. What I want is for the q1 to change when the counter changes in R.
totaltenq1 <- sum(UpdatedQualtrix$tenq1)
totalnineq1 <- sum(UpdatedQualtrix$nineq1)
totaleightq1 <- sum(UpdatedQualtrix$eightq1)
totalsevenq1 <- sum(UpdatedQualtrix$sevenq1)
totalsixq1 <- sum(UpdatedQualtrix$sixq1)
totalfiveq1 <- sum(UpdatedQualtrix$fiveq1)
totalfourq1 <- sum(UpdatedQualtrix$fourq1)
totalthreeq1 <- sum(UpdatedQualtrix$threeq1)
totaltwoq1 <- sum(UpdatedQualtrix$twoq1)
totaloneq1 <- sum(UpdatedQualtrix$oneq1)
totaltenq2 <- sum(UpdatedQualtrix$tenq2)
totalnineq2 <- sum(UpdatedQualtrix$nineq2)
totaleightq2 <- sum(UpdatedQualtrix$eightq2)
totalsevenq2 <- sum(UpdatedQualtrix$sevenq2)
totalsixq2 <- sum(UpdatedQualtrix$sixq2)
totalfiveq2 <- sum(UpdatedQualtrix$fiveq2)
totalfourq2 <- sum(UpdatedQualtrix$fourq2)
totalthreeq2 <- sum(UpdatedQualtrix$threeq2)
totaltwoq2 <- sum(UpdatedQualtrix$twoq2)
totaloneq2 <- sum(UpdatedQualtrix$oneq2)
I would like to have code that is
count = 20
for (i in 1:count){
totaltenq(i) <- sum(UpdatedQualtrix$tenq(i)
totalninq(I) <- sum(UpdatedQuatlrix$nineq(I)
etc
}
That way, when I do it again in the future, I can tell R how many questions it has the next time and it will change it. That way I don't have 10,000 lines of code from copying and pasting my code 20 times.

I don't think you need any loops at all. It just all depends on how you want to store those value. I'm a big fan of not having more variables than necessary.
Here's some sample data. I'll just make 10 rows (observations) with values 1-5.
set.seed(15)
Q<-3
numbs<-c("one","two","three","four","five","six","seven","eight","nine","ten")
qs<-paste0("q",1:Q)
qnumbs <- outer(numbs, qs, paste0)
UpdatedQualtrix <-data.frame(ID=1:10,
matrix(sample(1:5, 10*length(numbs)*Q, replace=T), nrow=10))
colnames(UpdatedQualtrix) <- c("ID",qnumbs)
Now I can sum up each of the columns with
( Qsums<-colSums(UpdatedQualtrix[, qnumbs]) )
# oneq1 twoq1 threeq1 fourq1 fiveq1 sixq1 sevenq1 eightq1 nineq1 tenq1
# 37 35 29 26 32 39 40 33 40 26
# oneq2 twoq2 threeq2 fourq2 fiveq2 sixq2 sevenq2 eightq2 nineq2 tenq2
# 37 31 19 29 25 38 36 35 28 27
# oneq3 twoq3 threeq3 fourq3 fiveq3 sixq3 sevenq3 eightq3 nineq3 tenq3
# 37 30 31 31 24 31 29 31 25 41
And if we want the totals per question we can do
sapply(qs, function(a, b) sum(Qsums[paste0(b,a)]), b=numbs)
# q1 q2 q3
# 337 305 310
Or if we want the counts per response we can do
sapply(numbs, function(a, b) sum(Qsums[paste0(a,b)]), b=qs)
# one two three four five six seven eight nine ten
# 111 96 79 86 81 108 105 99 93 94
You might want to also consider melting your data since it's so structured. You can use the reshape2 library to help. You can do
require(reshape2)
mm <- melt(UpdatedQualtrix, id.vars="ID")
mm <- cbind(mm[,-2], colsplit(mm$variable, "q", c("resp","q")))
mm$resp <- factor(mm$resp, levels=numbs)
to turn your data into a "tall" format so each value has it's own row with a column for ID, value, response and question.
str(mm)
# 'data.frame': 300 obs. of 4 variables:
# $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
# $ value: int 4 1 5 4 2 5 5 2 4 5 ...
# $ resp : Factor w/ 10 levels "one","two","three",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ q : int 1 1 1 1 1 1 1 1 1 1 ...
And then we can more easily do other calculations. Of you want the total scores by question, you could do
aggregate(value~q, mm, sum)
# q value
# 1 1 337
# 2 2 305
# 3 3 310
If you wanted the average value for each question/response you could do
with(mm, tapply(value, list(q,resp), mean))
# one two three four five six seven eight nine ten
# 1 3.7 3.5 2.9 2.6 3.2 3.9 4.0 3.3 4.0 2.6
# 2 3.7 3.1 1.9 2.9 2.5 3.8 3.6 3.5 2.8 2.7
# 3 3.7 3.0 3.1 3.1 2.4 3.1 2.9 3.1 2.5 4.1

Related

subset a and group by data frame with multiple condition and multiple criteria in r

I have a dataset df with multiple variables and unique IDs
ID A B C D
1 20 5 5.4 120.5
1 30 10 6.8 110.6
2 50 40 7.5 117.8
3 10 50 3.4 119
3 80 30 2.8 117.5
2 5 20 9.5 325.4
I can subset them by below code
new.df <- df[df$A < 56 & is.na(df$A) == FALSE,]
and I want the conditional column and subset the data frame by IDs
I Want the data frame with conditional column such as
ID =1 A=20 B=10 C=5.4 D=110.6
ID =2 A=5 B=40 C=9.5 D=325.4
ID =3 A=10 B=30 C=3.4 D=119
and output data frame should be
ID A B C D
1 20 10 5.4 110.6
2 5 40 9.5 325.4
3 10 30 3.4 119
can you guys help me out how it can be done
this will give the output as a dataframe ,
df %>% group_by(ID) %>% summarise(minA=min(A), maxB = max(B), minC= min(C), minD = min(D))

read delimited .txt file with multiple, interspersed headers in R

I am trying to open and clean a massive oceanographic dataset in R, where station information is interspersed as headers in between the chunks of observations:
$
2008 1 774 8 17 5 11 2 78.4952 6.0375 30 7 1.2 -999.0 -9 -9 -9 -9 4868.8 2017 0 7114
2.0 6.0297 35.0199 34.4101 2.0 11111
3.0 6.0279 35.0201 34.4091 3.0 11111
4.0 6.0272 35.0203 34.4091 4.0 11111
5.0 6.0273 35.0204 34.4097 4.9 11111
6.0 6.0274 35.0205 34.4104 5.9 11111
$
2008 1 777 8 17 12 7 25 78.4738 8.3510 27 6 4.1 -999.0 3 7 2 0 4903.8 1570 0 7114
3.0 6.4129 34.5637 34.3541 3.0 11111
4.0 6.4349 34.5748 34.3844 4.0 11111
5.0 6.4803 34.5932 34.4426 4.9 11111
6.0 6.4139 34.5624 34.3552 5.9 11111
7.0 6.5079 34.6097 34.4834 6.9 11111
each $ is followed by a row containing station data (e.g. year, ..., lat, lon, date, time), then follow several rows containing the observations sampled at that station (e.g. depth, temperature, salinity etc.).
I would like to add the station data to the observation, so that each variable is a column
and each observation is a row, like this:
2008 1 774 8 17 5 11 2 78.4952 6.0375 30 7 1.2 -999 2 6.0297 35.0199 34.4101 2 11111
2008 1 774 8 17 5 11 2 78.4952 6.0375 30 7 1.2 -999 3 6.0279 35.0201 34.4091 3 11111
2008 1 774 8 17 5 11 2 78.4952 6.0375 30 7 1.2 -999 4 6.0272 35.0203 34.4091 4 11111
2008 1 774 8 17 5 11 2 78.4952 6.0375 30 7 1.2 -999 5 6.0273 35.0204 34.4097 4.9 11111
2008 1 774 8 17 5 11 2 78.4952 6.0375 30 7 1.2 -999 6 6.0274 35.0205 34.4104 5.9 11111
2008 1 777 8 17 12 7 25 78.4738 8.351 27 6 4.1 -999 3 6.4129 34.5637 34.3541 3 11111
2008 1 777 8 17 12 7 25 78.4738 8.351 27 6 4.1 -999 4 6.4349 34.5748 34.3844 4 11111
2008 1 777 8 17 12 7 25 78.4738 8.351 27 6 4.1 -999 5 6.4803 34.5932 34.4426 4.9 11111
2008 1 777 8 17 12 7 25 78.4738 8.351 27 6 4.1 -999 6 6.4139 34.5624 34.3552 5.9 11111
2008 1 777 8 17 12 7 25 78.4738 8.351 27 6 4.1 -999 7 6.5079 34.6097 34.4834 6.9 11111
This is simpler and only depends on base R. I assume that you have read the text file with x <- readLines(....) first:
start <- which(x == "$") + 1 # Find header indices
rows <- diff(c(start, length(x)+2)) - 2 # Find number of lines per group
# Function to read header and rows and cbind
getdata <- function(begin, end) {
cbind(read.table(text=x[begin]), read.table(text=x[(begin+1):(begin+end)]))
}
dta.list <- lapply(1:(length(start)), function(i) getdata(start[i], rows[i]))
dta.df <- do.call(rbind, dta.list)
This works with the two groups you included in your post. You will need to fix the column names since V1 - V6 are repeated at the beginning and end.
This solution is pretty involved, and rests on knowledge of several Tidyverse libraries and features. I'm not sure how robust it is for your needs, but it does do okay with the sample you posted. But the approach of folding blocks, creating functions to parse the smaller blocks, and then unfolding the results I think will serve you well.
The first piece involves finding the '$' markers, grouping following lines together, and then "nesting" the block of data together. Then we have a data frame that has only a few rows - one for each section.
library(tidyverse)
txt_lns <- readLines("ocean-sample.txt")
txt <- tibble(txt = txt_lns)
# Start by finding new sections, and nesting the data
nested_txt <- txt %>%
mutate(row_number = row_number()) %>%
mutate(new_section = str_detect(txt, "\\$")) %>% # Mark new sections
mutate(starting = ifelse(new_section, row_number, NA)) %>% # Index with row num
tidyr::fill(starting) %>% # Fill index down
# where missing
select(-new_section) %>% # Clean up
filter(!str_detect(txt, "\\$")) %>%
nest(data = c(txt, row_number)) # "Nest" the data
# Take a quick look
nested_txt
Then, we need to be able to deal with those nested blocks. The routines here parse those blocks by identifying header rows, and then separating the fields into dataframes of their own. Here, we have different logic for header rows vs. the shorter lesser rows.
# Deal with the records within a section
parse_inner_block <- function(x, header_ind) {
if (header_ind) {
df <- x %>%
mutate(txt = str_trim(txt)) %>%
# Separate the header row into 22 variables
separate(txt, into = LETTERS[1:22], sep = "\\s+")
} else {
df <- x %>%
mutate(txt = str_trim(txt)) %>%
# Separate the lesser rows into 6 variables
separate(txt, into = letters[1:6], sep = "\\s+")
}
return(df)
}
parse_outer_block <- function(x) {
df <- x %>%
# Determine if it's a header row with 22 variables or lesser row with 6
mutate(leading_row = (row_number == min(row_number))) %>%
# Fold by header row vs. not
nest(data = c(txt, row_number)) %>%
# Create data frames for both header and lesser rows
mutate(processed = purrr::map2(data, leading_row, parse_inner_block)) %>%
unnest(processed) %>%
# Copy header row values to lesser rows
tidyr::fill(A:V) %>%
# Drop header row
filter(!leading_row)
return(df)
}
And then we can put it all together -- starting with our nested data, processing each block, unnesting the fields that came back, and prepping the full output.
# Actually put all this together and generate an output dataframe
output <- nested_txt %>%
mutate(proc_out = purrr::map(data, parse_outer_block)) %>%
select(-data) %>%
unnest(proc_out) %>%
select(-starting, -leading_row, -data, -row_number)
output
Hope it helps. I'd recommend looking at some purrr tutorials as well for some similar problems.

R - Create random subsamples of size n for multiple sample groups

I have a large data set of samples that belong to different groups and differ in the area covered. The structure of the data set is simplified below. I now would like to create pooled samples (Subgroups) for each Group where the area covered by each Subgroup equates to a specified area (e.g. 20). Samples should be allocated randomly and without replacement to each Subgroup and the number of the Subgroup should be listed in a new column at the end of the data frame.
SampleID Group Area Subgroup
1 A 1.5 1
2 A 3.8 2
3 A 6 4
4 A 1.9 1
5 A 1.5 3
6 A 4.1 1
7 A 3.7 1
8 A 4.5 3
...
300 B 1.2 1
301 B 3.8 1
302 B 4.1 4
303 B 2.6 3
304 B 3.1 5
305 B 3.5 3
306 B 2.1 2
...
2000 S 2.7 5
...
I am currently using the ‘cumsum’ command to create the Subgroups, using the code below.
dat <- read.table("Pooling_Test.txt", header = TRUE, sep = "\t")
dat$CumArea <- cumsum(dat$Area)
dat$Diff_CumArea <- c(0, head(cumsum(dat$Area), -1))
dat$Sample_Int_1 <- "0"
dat$Sample_End <- "0"
current.sum <- 0
for (c in 1:nrow(dat)) {
current.sum <- current.sum + dat[c, "Area"]
dat[c, "Diff_CumArea"] <- current.sum
if (current.sum >= 20) {
dat[c, "Sample_Int_1"] <- "1"
dat[c, "Sample_End"] <- "End"
current.sum <- 0
dat$Sample_Int_2 <- cumsum(dat$Sample_Int_1)+1
dat$Sample_Final <- dat$Sample_Int_2
for (d in 1:nrow(dat)) {
if (dat$Sample_End[d] == 'End')
dat$Subgroup[d] <- dat$Sample_Int_2[d]-1
else 0 }
}}
write.csv(dat, file = 'Pooling_Test_Output.csv', row.names = FALSE)
The resultant data frame shows what I want (see below). However, there are a couple of steps I would like to improve. First, I have problems including a command for choosing samples randomly from each Group, so I currently randomise the order of samples before loading the data frame into R. Secondly, in the output table the Subgroups are numbered consecutively, but I would like to start the Subgroup numbering with 1 for each new Group. Has anybody any advice on how to achieve this?
SampleID Group CumArea Subgroups
1 A 1.5 1
77 A 4.6 1
6 A 9.3 1
43 A 16.4 1
17 A 19.5 1
67 A 2.1 2
4 A 4.3 2
32 A 8.9 2
...
300 B 4.5 10
257 B 6.8 10
397 B 10.6 10
344 B 14.5 10
367 B 16.7 10
303 B 20.1 10
306 B 1.5 11
...
A few functions in the dplyr package make this fairly straightforward. You can use slice to randomize the data, group_by to perform computations at the group level, and mutate to create new variables. If you chain the functions together with the %>% operator, I believe the solution would look something like this, assuming that you want groups that add up to 20.
install.packages("dplyr") #If you haven't used dplyr before
library(dplyr)
dat %>%
group_by(Group) %>%
slice(sample(1:n())) %>%
mutate(CumArea = cumsum(Area), SubGroup = ceiling(CumArea / 20))

R:Calculating percentage values across a matrix based on the values in another matrix

I have two matrices, one is a 10x1 double matrix, that can be expanded to any user preset number, eg. 100.
View(min_matrx)
V1
1 27
2 46
3 30
4 59
5 46
6 45
7 34
8 31
9 52
10 46
The other matrix looks like this, there are more rows not shown:
View(main_matrx)
row.names sum_value
s17 45
s7469 213
s20984 24
s17309 214
s7432369 43
s221320984 12
s17556 34
s741269 11
s20132984 35
For each row name in main_matrx I want to count the number of times that a value more than the sum_value in main_matrx appears in min_matrx. Then I want to divide that by the number of rows in min_matrx and add that value as a new column in main_matrx.
For example, in row 1 of main_matrx for s17, the number of times a value appears that is more than 45 in min_matrx =5 times.
Now divide that 5 by 10 rows of min_matrx=> 5/10 =0.5 would be the value I'd like to have as a new column in main_matrx for s17. Then the same formula for all the s_ids in the row names.
So far I have fiddled with:
for(s in 1:length(main_matrx)) {
new<-sum(main_matrx[s,]>min_CPRS_set)/length(min_matrx)
}
and I tried using apply() but I'm still not getting results.
apply(main_matrx,1:length(main_matrx), function(x) sum(main_matrx>min_CPRS_set)/length(min_matrx)))
Now, I'm just stuck because it's not working. I'm still new to R so my code isn't particularly efficient. Any suggestions?
Lots of ways to approach this. Here's one that came to my head (I think I understand what you're after; again it's much easier to understand an example than with words alone. In the future I'd suggest an example to accompany the text question.)
Where x is an element, y is a vector
FUN <- function(x, y = min_matrix[, 1]) {
sum(y > x)/length(y)
}
main_matrx$new <- sapply(main_matrx[, 2], FUN)
## > main_matrx
## row.names sum_value new
## 1 s17 45 0.5
## 2 s7469 213 0.0
## 3 s20984 24 1.0
## 4 s17309 214 0.0
## 5 s7432369 43 0.6
## 6 s221320984 12 1.0
## 7 s17556 34 0.6
## 8 s741269 11 1.0
## 9 s20132984 35 0.6

Selecting top finite number of rows for each unique value of a column in a data fame in R

I have a data frame with 3 columns. a,b,c. There are multiple rows corresponding to each unique value of column a. I want to select top 5 rows corresponding to each unique value of column a. column c is some value and the data frame is already sorted by it in descending order, so that would not be a problem. Can anyone please suggest how can I do this in R.
Stealing #ptocquin's example, here's how you can use base function by. You can flatten the result using do.call (see below).
> by(data = data, INDICES = data$a, FUN = function(x) head(x, 5))
# or by(data = data, INDICES = data$a, FUN = head, 5)
data$a: 1
a b c
21 1 0.1188552 1.6389895
41 1 1.0182033 1.4811359
61 1 -0.8795879 0.7784072
81 1 0.6485745 0.7734652
31 1 1.5102255 0.7107957
------------------------------------------------------------
data$a: 2
a b c
15 2 -1.09704040 1.1710693
85 2 0.42914795 0.8826820
65 2 -1.01480957 0.6736782
45 2 -0.07982711 0.3693384
35 2 -0.67643885 -0.2170767
------------------------------------------------------------
A similar thing could be achieved by splitting your data.frame based on a and then using lapply to step through each element subsetting first n rows.
split.data <- split(data, data$a)
subsetted.data <- lapply(split.data, FUN = function(x) head(x, 5)) # or ..., FUN = head, 5) like above
flatten.data <- do.call("rbind", subsetted.data)
head(flatten.data)
a b c
1.21 1 0.11885516 1.63898947
1.41 1 1.01820329 1.48113594
1.61 1 -0.87958790 0.77840718
1.81 1 0.64857445 0.77346517
1.31 1 1.51022545 0.71079568
2.15 2 -1.09704040 1.17106930
2.85 2 0.42914795 0.88268205
2.65 2 -1.01480957 0.67367823
2.45 2 -0.07982711 0.36933837
2.35 2 -0.67643885 -0.21707668
Here is my try :
library(plyr)
data <- data.frame(a=rep(sample(1:20,10),10),b=rnorm(100),c=rnorm(100))
data <- data[rev(order(data$c)),]
head(data, 15)
a b c
28 6 1.69611039 1.720081
91 11 1.62656460 1.651574
70 9 -1.17808386 1.641954
6 15 1.23420550 1.603140
23 7 0.70854914 1.588352
51 11 -1.41234359 1.540738
19 10 2.83730734 1.522825
49 10 0.39313579 1.370831
80 9 -0.59445323 1.327825
59 10 -0.55538404 1.214901
18 6 0.08445888 1.152266
86 15 0.53027267 1.066034
69 10 -1.89077464 1.037447
62 1 -0.43599566 1.026505
3 7 0.78544009 1.014770
result <- ddply(data, .(a), "head", 5)
head(result, 15)
a b c
1 1 -0.43599566 1.02650544
2 1 -1.55113486 0.36380251
3 1 0.68608364 0.30911430
4 1 -0.85406406 0.05555500
5 1 -1.83894595 -0.11850847
6 5 -1.79715809 0.77760033
7 5 0.82814909 0.22401278
8 5 -1.52726859 0.06745849
9 5 0.51655092 -0.02737905
10 5 -0.44004646 -0.28106808
11 6 1.69611039 1.72008079
12 6 0.08445888 1.15226601
13 6 -1.99465060 0.82214319
14 6 0.43855489 0.76221979
15 6 -2.15251353 0.64417757

Resources