How to change column names for mrset in R? - r

I am trying to create crosstabs I have a dataframe in which I have multiple select questions. I am importing the data frame from SPSS file using foreign and expss package. I am creating the multiple select questions using the mrset function. Here's the demo code for this to make it clear.
Banner1 = w %>%
tab_cells(mrset(as.category( temp1,counted_value = "Checked"))) %>%
tab_cols(total(),mrset(as.category( temp2, counted_value = "Checked"))) %>%
tab_stat_cases(total_row_position = "none",label = "")
tab_pivot(Banner1)
The datatable imported looks like this
Total Q12_1 Q12_2 Q12_3 Q12_4 Q12_5
A B C D E F
Total Cases 803 34 18 14 38 37
Q13_1 64 11 7 8 9 7
Q13_2 12 54 54 43 13 12
Q13_3 67 54 23 21 6 4
Sorry about the alignment here....So this is the imported dataset.
Coming to the problem, As you can see this dataset has column labels as Question numbers and not variable labels. For single select questions everything works fine. Is there any function I can change the colnames for mrset functions dynamically?
The desired output should be something like this. For eg,
Total Apple Mango Banana Orange Grapes
A B C D E F
Total Cases 803 34 18 14 38 37
Apple 64 11 7 8 9 7
Mango 12 54 54 43 13 12
banana 67 54 23 21 6 4
Any help would be greatly appreciated.

Related

How do I get rid of commas and periods, etc in R? [duplicate]

This question already has answers here:
How to load comma separated data into R?
(2 answers)
Closed 6 years ago.
This is my data set:
Depth.Fe
1 0,14.21
2 3,19.35
3 10,17.22
4 14,15.87
5 23,13.62
6 30,16.31
7 36,14.13
8 48,13.95
9 59,15
10 66,14.23
11 68,16.81
12 81,15.93
13 94,16.02
14 96,17.85
15 102,17.02
16 115,15.87
17 121,19.84
18 130,16.94
19 163,16.72
20 168,19.2
21 205,20.41
22 239,16.88
23 251,18.74
24 283,16.67
25 297,18.56
26 322,18.87
27 335,20.81
28 351,24.52
29 370,25.03
30 408,25.11
31 416,23.28
32 419,22.56
33 425,19
34 429,20.53
35 443,19.08
36 447,22.83
37 465,21.06
38 474,24.96
39 493,19.12
40 502,22.24
41 522,26.88
42 550,21.15
43 558,28.92
44 571,27.96
45 586,25.03
46 596,26.27
I want depth and Fe to be separated as individual columns, but nothing I try is working.
please help
First of all, #akrun is definitely right in his comment to your post. If this is a dataset imported from somewhere, then follow his comment.
Assuming that somehow you were handed this weird dataset, I would try this:
df <- data.frame(matrix(as.numeric(unlist(strsplit(df$Depth.Fe,split=","))),nrow=2,byrow = T),stringsAsFactors = F)
colnames(df) <- c("Depth","Fe")
This would take a dataset that looks like this:
Depth.Fe
1 0,14.21
2 3,19.35
to this:
Depth Fe
1 0 14.21
2 3 19.34

Looping through rows, creating and reusing multiple variables

I am building a streambed hydrology calculator in R using multiple tables from an Access database. I am having trouble automating and calculating the same set of indices for multiple sites. The following sample dataset describes my data structure:
> Thalweg
StationID AB0 AB1 AB2 AB3 AB4 AB5 BC1 BC2 BC3 BC4 Xdep_Vdep
1 1AAUA017.60 47 45 44 55 54 6 15 39 15 11 18.29
2 1AXKR000.77 30 27 24 19 20 18 9 12 21 13 6.46
3 2-BGU005.95 52 67 62 42 28 25 23 26 11 19 20.18
4 2-BLG011.41 66 85 77 83 63 35 10 70 95 90 67.64
5 2-CSR003.94 29 35 46 14 19 14 13 13 21 48 6.74
where each column represents certain field-measured parameters (i.e. depth of a reach section) and each row represents a different site.
I have successfully used the apply functions to simultaneously calculate simple functions on multiple rows:
> Xdepth <- apply(Thalweg[, 2:11], 1, mean) # Mean Depth
> Xdepth
1 2 3 4 5
33.1 19.3 35.5 67.4 25.2
and appending the results back to the proper station in a dataframe.
However, I am struggling when I want to calculate and save variables that are subsequently used for further calculations. I cannot seem to loop or apply the same function to multiple columns on a single row and complete the same calculations over the next row without mixing variables and data.
I want to do:
Residual_AB0 <- min(Xdep_Vdep, Thalweg$AB0)
Residual_AB1 <- min((Residual_AB0 + other_variables), Thalweg$AB1)
Residual_AB2 <- min((Residual_AB1 + other_variables), Thalweg$AB2)
Residual_AB3 <- min((Residual_AB2 + other_variables), Thalweg$AB3)
# etc.
Depth_AB0 <- (Thalweg$AB0 - Residual_AB0)
Depth_AB1 <- (Thalweg$AB1 - Residual_AB1)
Depth_AB2 <- (Thalweg$AB2 - Residual_AB2)
# etc.
I have tried and subsequently failed at for loops such as:
for (i in nrow(Thalweg)){
Residual_AB0 <- min(Xdep_Vdep, Thalweg$AB0)
Residual_AB1 <- min((Residual_AB0 + Stacks_Equation), Thalweg$AB1)
Residual_AB2 <- min((Residual_AB1 + Stacks_Equation), Thalweg$AB2)
Residual_AB3 <- min((Residual_AB2 + Stacks_Equation), Thalweg$AB3)
Residuals <- data.frame(Thalweg$StationID, Residual_AB0, Residual_AB1, Residual_AB2, Residual_AB3)
}
Is there a better way to approach looping through multiple lines of data when I need unique variables saved for each specific row that I am currently calculating? Thank you for any suggestions.
your exact problem is still a mistery to me...
but it looks like you want a double for loop
for(i in 1:nrow(thalweg)){
residual=thalweg[i,"Xdep_Vdep"]
for(j in 2:11){
residual=min(residual,thalweg[i,j])
}
}

Barchart help in R

I am trying to set up a bar chart to compare control and experimental samples taken of specific compounds. The data set is known as 'hydrocarbon3' and contains the following information:
Exp. Contr.
c12 89 49
c17 79 30
c26 78 35
c42 63 3
pris 0.5 0.8
phy 0.5 0.9
nap 87 48
nap1 83 44
nap2 78 44
nap3 73 20
acen1 81 50
acen2 86 46
fluor 83 11
fluor1 68 13
fluor2 79 17
dibe 65 7
dibe1 67 6
dibe2 56 10
phen 82 13
phen1 70 12
phen2 65 15
phen3 53 14
fluro 62 9
pyren 48 11
pyren1 34 10
pyren2 19 8
chrys 22 3
chrys1 21 3
chrys2 21 3
When I create a bar chart with the formula:
barplot(as.matrix(hydrocarbon3),
main=c("Fig 1. Change in concentrations of different hydrocarbon compounds\nin sediments with and without the presence of bacteria after 21 days"),
beside=TRUE,
xlab="Oiled sediment samples collected at 21 days",
space=c(0,2),
ylab="% loss in concentration relative to day 0")
I receive this diagram, however I need the control and experimental samples of each chemical be next to each other allow a more accurate comparison, rather than the experimental samples bunched on the left and control samples bunched on the right: Is there a way to correct this on R?
Try transposing your matrix:
barplot(t(as.matrix(hydrocarbon3)), beside=T)
Basically, barplot will plot things in the order they show up in the matrix, which, since a matrix is just a vector wrapped colwise, means barplot will plot all the values of the first column, then all those of the second column, etc.
Check this question out: Barplot with 2 variables side by side
It uses ggplot2, so you'll have to use the following code before running it:
intall.packages("ggplot2")
library(ggplot2)
Hopefully this works for you. Plus it looks a little nicer with ggplot2!
> df
row exp con
1 a 1 2
2 b 2 3
3 c 3 4
> barplot(rbind(df$exp,df$con),
+ beside = TRUE,names.arg=df$row)
produces:

R efficiently add up tables in different order

At some point in my code, I get a list of tables that looks much like this:
[[1]]
cluster_size start end number p_value
13 2 12 13 131 4.209645e-233
12 1 12 12 100 6.166824e-185
22 11 12 22 132 6.916323e-143
23 12 12 23 133 1.176194e-139
13 1 13 13 31 3.464284e-38
13 68 13 117 34 3.275941e-37
23 78 23 117 2 4.503111e-32
....
[[2]]
cluster_size start end number p_value
13 2 12 13 131 4.209645e-233
12 1 12 12 100 6.166824e-185
22 11 12 22 132 6.916323e-143
23 12 12 23 133 1.176194e-139
13 1 13 13 31 3.464284e-38
....
While I don't show the full table here I know they are all the same size. What I want to do is make one table where I add up the p-values. Problem is that the $cluster_size, start, $end and $number columns don't necessarily correspond to the same row when I look at the table in different list elements so I can't just do a simple sum.
The brute force way to do this is to: 1) make a blank table 2) copy in the appropriate $cluster_size, $start, $end, $number columns from the first table and pull the correct p-values using a which() statement from all the tables. Is there a more clever way of doing this? Or is this pretty much it?
Edit: I was asked for a dput file of the data. It's located here:
http://alrig.com/code/
In the sample case, the order of the rows happen to match. That will not always be the case.
Seems like you can do this in two steps
Convert your list to a data.frame
Use any of the split-apply-combine approaches to summarize.
Assuming your data was named X, here's what you could do:
library(plyr)
#need to convert to data.frame since all of your list objects are of class matrix
XDF <- as.data.frame(do.call("rbind", X))
ddply(XDF, .(cluster_size, start, end, number), summarize, sump = sum(p_value))
#-----
cluster_size start end number sump
1 1 12 12 100 5.550142e-184
2 1 13 13 31 3.117856e-37
3 1 22 22 1 9.000000e+00
...
29 105 23 117 2 6.271469e-16
30 106 22 146 13 7.266746e-25
31 107 23 146 12 1.382328e-25
Lots of other aggregation techniques are covered here. I'd look at data.table package if your data is large.

Transpose with multiple variables and more than one metrics in R

I'm previously a SAS user - since I don't have SAS anymore I need to learn to use R for work.
The dataset has the following column:
market date sitename impression clicks
I want to transpose it into:
market date sitename-impression sitename-clicks
I think in SAS I used to do:
Proc Transpose
by market date;
id sitename;
var impression clicks;
run;
I do have a book on R and googled a lot, but couldn't find the solution that works...
Would really appreciate if anyone can help.
Thanks in advance!!!
Let me start by saying welcome to stackoverflow. Glad to have anew user. When you ask a question it's helpful and encouraged for you to provide the code you're using and a reproducible data set that looks like the original. This is called a minimal reproducible example. To get a data set into here you can use several options, here are two: use dput() around the object name and cut and paste what is displayed in the console or just post the dataframe directly. For the code provide all the code necessary to replicate your problem. I hope you find this helpful for future questions you'll ask.
I may not fully understand but I think you want to transform, not transpose, the data.
dat <- data.frame(market=rnorm(10), date=rnorm(10), #let's create a data set
sitename=rnorm(10), impression=rnorm(10), clicks=rnorm(10))
dat #look at it (I pasted it below)
# > dat
# market date sitename impression clicks
# 1 -0.9593797 -0.08411994 1.6079129 -0.5204772 -0.31633966
# 2 -0.5088689 1.78799500 -0.2469315 1.3476964 -0.04344779
# 3 -0.1527465 0.81673996 1.7824969 -1.5531260 -1.28304384
# 4 -0.7026194 0.52072913 -0.1174356 0.5722210 -1.20474443
# 5 -0.4537490 -0.69139062 1.1124277 -0.2452974 -0.33025320
# 6 0.7466588 0.36318337 -0.4623319 -0.9036768 -0.65754302
# 7 0.8007612 2.59588554 0.1820732 0.4318629 -0.36308748
# 8 1.0781715 -1.01512734 0.2297475 0.9219439 -1.15687902
# 9 0.3731450 -0.19004572 0.5190749 -1.4020371 -0.97370295
# 10 0.7724259 1.76528303 0.5781786 -0.5490849 -0.83819036
#now to create the new columns (I think this is what you want)
#the easiest way is to use transform. ?tranform for more
dat.new <- transform(dat, sitename.clicks=sitename-clicks,
impression.clicks=impression-clicks)
dat.new #here's the new data set. Notice it has the new and old columns.
#To get rid of the old columns you can use indexing and specify the columns you want.
dat.new[, c(1:2, 6:7)]
#We could have also done:
dat.new[, c(1,2,6,7)]
#or said the columns not wanted with negative indexing:
dat.new[, -c(3:5)]
EDIT In looking at Brian's comments and the variables I would think that a long to wide transformation is what the poster desires. I would likely approach it using Wickham's reshape2 package as well, as this method is easier for me to work with and I imagine it would be easier for an R beginner as well. However, here is a base way to do the long to wide format using the same data set Brian provided:
wide <- reshape(DF, v.names=c("impression", "clicks"), idvar=c("market", "date"),
timevar="sitename", direction="wide")
reshape(wide)
The reshape function is very flexible but takes some getting used to to use appropriately. I'm leaving my previous response up as well to keep the history of this post though I now believe this is not the posters intent. It serves as a reminder that a reproducible example is very helpful in providing clarity to your query.
Example data, as Tyler said, is important. I interpreted your question differently because I thought your data was different. I didn't take the - as a literal subtraction of numerics, but a combination of variables.
DF <- expand.grid(market = LETTERS[1:5],
date = Sys.Date()+(0:5),
sitename = letters[1:2])
n <- nrow(DF)
DF$impression <- sample(100, n, replace=TRUE)
DF$clicks <- sample(100, n, replace=TRUE)
I find the reshape2 package useful for these sort of transpositions/transformations/rearrangements.
library("reshape2")
dcast(melt(DF, id.vars=c("market","date","sitename")),
market+date~sitename+variable)
gives
market date a_impression a_clicks b_impression b_clicks
1 A 2012-02-28 74 97 11 71
2 A 2012-02-29 34 30 88 35
3 A 2012-03-01 40 85 40 49
4 A 2012-03-02 46 12 99 20
5 A 2012-03-03 6 95 85 56
6 A 2012-03-04 61 61 42 64
7 B 2012-02-28 4 53 74 9
8 B 2012-02-29 43 27 92 59
9 B 2012-03-01 34 26 86 43
10 B 2012-03-02 81 47 84 35
11 B 2012-03-03 3 5 91 48
12 B 2012-03-04 19 26 99 21
13 C 2012-02-28 22 31 100 53
14 C 2012-02-29 40 83 95 27
15 C 2012-03-01 78 89 81 29
16 C 2012-03-02 57 55 79 87
17 C 2012-03-03 37 61 3 97
18 C 2012-03-04 83 61 41 77
19 D 2012-02-28 81 18 47 3
20 D 2012-02-29 90 100 17 83
21 D 2012-03-01 12 40 35 93
22 D 2012-03-02 85 14 63 67
23 D 2012-03-03 63 53 29 58
24 D 2012-03-04 40 79 56 70
25 E 2012-02-28 97 62 68 31
26 E 2012-02-29 24 84 17 63
27 E 2012-03-01 94 93 32 2
28 E 2012-03-02 6 26 86 26
29 E 2012-03-03 100 34 37 80
30 E 2012-03-04 89 87 72 11
The column names have a _ between them rather than a -, but you can change that if you want. I wouldn't recommend it, though, because then you will have problems later referencing the column since the - will be taken as subtraction (you would need to quote the name).

Resources