R: combine two 2-dimensional crosstabs - r

> t <- read.csv("data.csv", sep=';')
> t
sex pacemaker smoker
1 female no never
2 female no never
3 male no never
4 male no former
5 male yes former
6 male yes former
7 female yes current
8 female yes former
9 female no current
> xtabs(~smoker+sex, data=t)
sex
smoker female male
current 2 0
former 1 3
never 2 1
> xtabs(~smoker+pacemaker, data=t)
pacemaker
smoker no yes
current 1 1
former 1 3
never 3 0
How can I combine two 2-dimensional crosstabs in R ?
Desired output:
| sex | pacemaker
smoker | female male | no yes
current | 2 0 | 1 1
former | 1 3 | 1 3
never | 2 1 | 3 0

I have renamed your data.frame to be df. This code should work for you.
cbind(xtabs(~smoker+sex, data=df), xtabs(~smoker+pacemaker, data=df))
female male no yes
current 2 0 1 1
former 1 3 1 3
never 2 1 3 0
You might want to rename the pacemaker column headers.
colnames(XTab)[3:4] = c("Pacemaker_no", "Pacemaker_yes")
XTab
female male Pacemaker_no Pacemaker_yes
current 2 0 1 1
former 1 3 1 3
never 2 1 3 0

Related

Removing certain values from a data frame

I know there are already some threads like this, but I could not find any solutions.
I have a dataframe that looks like this:
Name Age Sex Survived
1 Allison 0.17 female 1
2 Leah 0.33 female 0
3 David 0.8 male 1
4 Daniel 0.83 male 1
5 Alex 0.83 male 1
6 Jay 0.92 male 1
7 Sara 16 female 1
8 Jade 15 female 1
9 Connor 17 male 1
10 Jon 18 male 1
11 Mary 8 female 1
I want to remove ages that are below 1. I want the data to look like this:
Name Age Sex Survived
1 Allison NA female 1
2 Leah NA female 0
3 David NA male 1
4 Daniel NA male 1
5 Alex NA male 1
6 Jay NA male 1
7 Sara 16 female 1
8 Jade 15 female 1
9 Connor 17 male 1
10 Jon 18 male 1
11 Mary 8 female 1
Or to just remove the rows with ages < 1 altogether.
Following other solutions I tried this but it didn't work
mydata[mydata$Age<"1"&&mydata$Age>"0"] <- NA
Here are three ways to remove the rows:
mydata[mydata$Age > 1, ]
subset(mydata, Age > 1)
filter(mydata, Age > 1)
Here is how to make them NA:
mydata$Age[mydata$Age < 1] <- NA
Your issue is that you are using 1 as a character (in quotes). Character less/greater than work a little differently to numbers so be careful. Also make sure your Age column is numeric. The best way to do that is
mydata$Age <- as.numeric(as.character(mydata$Age))
so you don't accidentally mess up factor variables.
edit
put the wrong signs. fixed now
> mydata[mydata$Age<1, "Age"] <- NA
> mydata
Name Age Sex Survived
1 Allison NA female 1
2 Leah NA female 0
3 David NA male 1
4 Daniel NA male 1
5 Alex NA male 1
6 Jay NA male 1
7 Sara 16 female 1
8 Jade 15 female 1
9 Connor 17 male 1
10 Jon 18 male 1
11 Mary 8 female 1
Update
Maybe you can use if Age is factor
mydata[as.numeric(as.character(mydata$Age))<1, "Age"] <- NA

R: apply/ lapply: How to Create a bar chart if all entries in on column are 1's?

imagine, you have the following data set:
df<-data.frame(read.table(header = TRUE, text = "
ID Wine Beer Water Age Gender
1 0 1 0 20 Male
2 1 0 1 38 Female
3 0 0 1 32 Female
4 1 0 1 30 Male
5 1 1 1 30 Male
6 1 1 1 26 Female
7 0 1 1 36 Female
8 0 1 1 29 Male
9 0 1 1 33 Female
10 0 1 1 20 Female"))
Further, imagine you want to compile summary tables that print out the frequencies of those that drink wine, beer, water.
I solved it that way.
con<-apply(df[,c(2:4)], 2, table)
con_P<-prop.table(con,2)
This allows me to complete my ultimate goal of compiling a bar chart in the way I want it:
barplot(con_P)
It works perfectly. No problem. Now, let us tweak the data set as follows: We set all entries for water to 1.
df<-data.frame(read.table(header = TRUE, text = "
ID Wine Beer Water Age Gender
1 0 1 1 20 Male
2 1 0 1 38 Female
3 0 0 1 32 Female
4 1 0 1 30 Male
5 1 1 1 30 Male
6 1 1 1 26 Female
7 0 1 1 36 Female
8 0 1 1 29 Male
9 0 1 1 33 Female
10 0 1 1 20 Female"))
If I now run the following commands:
con<-apply(df[,c(2:4)], 2, table)
con_P<-prop.table(con,2)
it gives me the following error message after the second line: Error in margin.table(x, margin) : 'x' is not an array!
Through another question here on this forum, I learned that the following will help me to overcome this issue:
con_P <- lapply(con, function(x) x/sum(x))
However, if I now run
barplot(con_P)
R does not create a barplot: Error in -0.01 * height : non-numeric argument to binary operator. I assume it is because it is no array!
My question is what to do now (how would I transform con_P in th second example into an array?). Secondly, how can I make the entire step of creating prop.tables and then a bar chart more efficient? Any help is much appreciated.
We can by converting the columns to factor with levels specified. In the second example, as the columns have 0 and 1 values in the 2nd and 3rd, we use the levels as 0:1, then get the table and convert to proportion with prop.table. and do the barplot
barplot(prop.table(sapply(df[2:4],
function(x) table(factor(x, levels=0:1))),2))
Reproducing your data:
df<-data.frame(read.table(header = TRUE, text = "
ID Wine Beer Water Age Gender
1 0 1 1 20 Male
2 1 0 1 38 Female
3 0 0 1 32 Female
4 1 0 1 30 Male
5 1 1 1 30 Male
6 1 1 1 26 Female
7 0 1 1 36 Female
8 0 1 1 29 Male
9 0 1 1 33 Female
10 0 1 1 20 Female"))
con <-lapply(df[,c(2:4)], table)
con_P <- lapply(con, function(x) x/sum(x))
You can use reshape2 to melt the data:
library(reshape2)
df <- melt(con_P)
Now, if you want to use gpplot2 you can use df to plot the bar plot:
ggplot(df, aes(x = L1, y = value, fill = factor(Var1) )) +
geom_bar(stat= "identity") +
theme_bw()
If you want to use barplot you can reshape the data.frame into an array:
array <- acast( df, Var1~L1)
array[is.na(array)] <- 0
barplot(array)

R Frequency Tables: prop.table does not work if all data points within variable share the outcome?

imagine, you have the following data set:
df<-data.frame(read.table(header = TRUE, text = "
ID Wine Beer Water Age Gender
1 0 1 0 20 Male
2 1 0 1 38 Female
3 0 0 1 32 Female
4 1 0 1 30 Male
5 1 1 1 30 Male
6 1 1 1 26 Female
7 0 1 1 36 Female
8 0 1 1 29 Male
9 0 1 1 33 Female
10 0 1 1 20 Female"))
Further, imagine you want to compile summary tables that print out the frequencies of those that drink wine, beer, water.
I solved it that way.
con<-apply(df[,c(2:4)], 2, table)
con_P<-prop.table(con,2)
It works perfectly. No problem. Now, let us tweak the data set as follows: We set all entries for water to 1.
df<-data.frame(read.table(header = TRUE, text = "
df<-data.frame(read.table(header = TRUE, text = "
ID Wine Beer Water Age Gender
1 0 1 1 20 Male
2 1 0 1 38 Female
3 0 0 1 32 Female
4 1 0 1 30 Male
5 1 1 1 30 Male
6 1 1 1 26 Female
7 0 1 1 36 Female
8 0 1 1 29 Male
9 0 1 1 33 Female
10 0 1 1 20 Female"))
If I now run the following commands:
con<-apply(df[,c(2:4)], 2, table)
con_P<-prop.table(con,2)
it gives me the following error message after the second line: Error in margin.table(x, margin) : 'x' is not an array! Why?
Why does it make a difference if all data points within a variable have all the same outcome? Also, what can I do to circumvent this problem? Thanks guys!
The function prop.table uses the function sweep which takes an array as first argument. Since your second con is a list and not an array, prop.table will fail.
Why is your second con a list? Because the column Water has just one element and all the other columns have 2 elements. When the number of elements is different apply can't simplify the result to an array and gives you a list.
In the example you gave us, a safer way is to work with lapply instead, it will always give a list with the results:
con <- lapply(df, table)
con_P <- lapply(con, function(x) x/sum(x))

Formatting long-form survey data in R

We asked 3 people two or three yes-no questions. Let me denote these 3 people by 101,102,103 the questions by "A", "B","C" and the responses by 0, 1. The result is
q<-data.frame(response=c(0,0,1,0,0,1,1),
qstn=c("A","B","A","B","A","B","C"),
person=c(101,101,102,102,103,103,103))
We need to convert this table to the following format
person|qustionA|questionB|questionC
101 | 0 | 0 | NA
102 | 1 | 0 | NA
103 | 0 | 1 | 1
You can use reshape from base-r:
reshape(q, v.names="response", idvar="person",
timevar="qstn", direction="wide")
person response.A response.B response.C
1 101 0 0 NA
3 102 1 0 NA
5 103 0 1 1

How do you plot/analyze variables in R based on their common value in a data frame?

I have a data frame that I'm working with that contains experimental data. For the purposes of this post we can limit the discussion to 3 columns: ExperimentID, ROI, isContrast, isTreated, and, Value. ROI is a text-based factor that indicates where a region-of-interest is drawn, e.g. 'ROI_1', 'ROI_2',...etc. isTreated and isContrast are binary fields indicating whether or not some treatment was applied. I want to make a scatter plot comparing the values of, e.g., 'ROI_1' vs. 'ROI_2 ', which means I need the data paired in such a way that when I plot it the first X value is from Experiment_1 and ROI_1, the first Y value is from Experiment_1 and ROI_2, the next X value is from Experiment_2 and ROI_1, the next Y value is from Experiment_2 and ROI_2, etc. I only want to make this comparison for common values of isContrast and isTreated (i.e. 1 plot for each combination of these variables, so 4 plots altogether.
Subsetting doesn't solve my problem because data from different experiments/ROIs was sometimes entered out of numerical order.
The following code produces a mock data set to demonstrate the problem
expID = c('Bob','Bob','Bob','Bob','Lisa','Lisa','Lisa','Lisa','Alice','Alice','Alice','Alice','Joe','Joe','Joe','Joe','Bob','Bob','Alice','Alice','Lisa','Lisa')
treated = c(0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,0,0,0,0)
contrast = c(0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1)
val = c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4,6,7,8,9,10,11)
roi = c(rep('A',16),'B','B','B','B','B','B')
myFrame = data.frame(ExperimentID=expID,isTreated = treated, isContrast= contrast,Value = val, ROI=roi)
ExperimentID isTreated isContrast Value ROI
1 Bob 0 0 1 A
2 Bob 0 1 2 A
3 Bob 1 0 3 A
4 Bob 1 1 4 A
5 Lisa 0 0 1 A
6 Lisa 0 1 2 A
7 Lisa 1 0 3 A
8 Lisa 1 1 4 A
9 Alice 0 0 1 A
10 Alice 0 1 2 A
11 Alice 1 0 3 A
12 Alice 1 1 4 A
13 Joe 0 0 1 A
14 Joe 0 1 2 A
15 Joe 1 0 3 A
16 Joe 1 1 4 A
17 Bob 0 0 6 B
18 Bob 0 1 7 B
19 Alice 0 0 8 B
20 Alice 0 1 9 B
21 Lisa 0 0 10 B
22 Lisa 0 1 11 B
Now let's say I want to scatter plot values for A vs. B. That is to say, I want to plot x vs. y where {(x,y)} = {(Bob's Value from ROI A, Bob's Value from ROI B), (Alice's Value from ROI A, Alices Value from ROI B)},...} etc. and these all must have the same values for isTreated and isContrast for the comparison to make sense. Now, if I just go an subset I'll get something like:
> x= myFrame$Value[(myFrame$ROI == 'A') & (myFrame$isTreated == 0) & (myFrame$isContrast == 0)]
> x
[1] 1 1 1 1
> y= myFrame$Value[(myFrame$ROI == 'B') & (myFrame$isTreated == 0) & (myFrame$isContrast == 0)]
> y
[1] 6 8 10
Now as you can see the values in y correspond to the first rows of Bob, Lisa, Alice and Joe, respectively but the values of y Bob, Alice and Lisa respectively, and there is no value for Joe.
So say I ignored the value for Joe because that data is missing for B and just decided to plot the first 3 values of x vs. the first 3 values of y. The data are still out of order because x = (Bob, Lisa, Alice) but y = (Bob, Alice, Lisa) in terms of where the values are coming from. So I would like to now how to make vectors such that the order is correct and the plot makes sense.
Similar to #Matthew, with ggplot:
The idea is to reshape your data so the the values from ROI=A and RIO=B are in different columns. This can be done (with your sample data) as follows:
library(reshape2)
zz <- dcast(myFrame,
value.var="Value",
formula=ExperimentID+isTreated+isContrast~ROI)
zz
ExperimentID isTreated isContrast A B
1 Alice 0 0 1 8
2 Alice 0 1 2 9
3 Alice 1 0 3 NA
4 Alice 1 1 4 NA
5 Bob 0 0 1 6
6 Bob 0 1 2 7
7 Bob 1 0 3 NA
8 Bob 1 1 4 NA
9 Joe 0 0 1 NA
10 Joe 0 1 2 NA
11 Joe 1 0 3 NA
12 Joe 1 1 4 NA
13 Lisa 0 0 1 10
14 Lisa 0 1 2 11
15 Lisa 1 0 3 NA
16 Lisa 1 1 4 NA
Notiice that your sample data is rather sparse (lots of NA's).
To plot:
library(ggplot2)
ggplot(zz,aes(x=A,y=B,color=factor(isTreated))) +
geom_point(size=4)+facet_wrap(~isContrast)
Produces this:
The reason there are no blue points is that, in your sample data, there are no occurrences of isTreated=1 and ROI=B.
Something like this, perhaps:
myFrameReshaped <- reshape(myFrame, timevar='ROI', direction='wide', idvar=c('ExperimentID','isTreated','isContrast'))
plot(Value.B ~ Value.A, data=myFrameReshaped)
To condition by the isTreated and isContrast variables, lattice comes in handy:
library(lattice)
xyplot(Value.B~Value.A | isTreated + isContrast, data=myFrameReshaped)
Values that are not present for one of the conditions give NA, and are not plotted.
head(myFrameReshaped)
## ExperimentID isTreated isContrast Value.A Value.B
## 1 Bob 0 0 1 6
## 2 Bob 0 1 2 7
## 3 Bob 1 0 3 NA
## 4 Bob 1 1 4 NA
## 5 Lisa 0 0 1 10
## 6 Lisa 0 1 2 11

Resources