Formatting long-form survey data in R

Formatting long-form survey data in R - r

We asked 3 people two or three yes-no questions. Let me denote these 3 people by 101,102,103 the questions by "A", "B","C" and the responses by 0, 1. The result is
q<-data.frame(response=c(0,0,1,0,0,1,1),
qstn=c("A","B","A","B","A","B","C"),
person=c(101,101,102,102,103,103,103))
We need to convert this table to the following format
person|qustionA|questionB|questionC
101 | 0 | 0 | NA
102 | 1 | 0 | NA
103 | 0 | 1 | 1

You can use reshape from base-r:
reshape(q, v.names="response", idvar="person",
timevar="qstn", direction="wide")
person response.A response.B response.C
1 101 0 0 NA
3 102 1 0 NA
5 103 0 1 1

Related

Conditionally remove rows from H2O frame object in R

I have an H2O frame R object like this
h2odf
A | B | C | D
--|---|---|---
1 | NA| 2 | 0
2 | 1 | 2 | 0
3 | NA| 2 | 0
4 | 3 | 2 | 0
I want to remove all those rows where B is NA (1st and 3rd row). I have tried
na <- is.na(h2odf[,"b"])
h2odf <- h2odf[!na,]
and
h2odf <- h2odf[!is.na(h2odf$B),]
and
h2odf <- subset(h2odf, B!=NA)
This works for R Dataframe but not H2O. Giving this error:
Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page, :
ERROR MESSAGE:
DistributedException from localhost/127.0.0.1:54321: 'Cannot set illegal UUID value'
Desired output is
h2odf
A | B | C | D
--|---|---|---
2 | 1 | 2 | 0
4 | 3 | 2 | 0
One option I have is to convert it into R Dataframe, remove rows and convert it back to H2O frame. But that is taking long time because input file size is close to 4.5 GB. Is it possible to do this in H2O frame hex object itself?
I am running Rstudio on aws cluster.

> class(h2odf)
[1] "H2OFrame"
> h2odf
A B C D
1 1 NA 2 0
2 2 1 2 0
3 3 NA 2 0
4 4 3 2 0
[4 rows x 4 columns]
> h2odf[!is.na(as.numeric(as.character(h2odf$B))),]
A B C D
1 2 1 2 0
2 4 3 2 0
[2 rows x 4 columns]

R: combine two 2-dimensional crosstabs

> t <- read.csv("data.csv", sep=';')
> t
sex pacemaker smoker
1 female no never
2 female no never
3 male no never
4 male no former
5 male yes former
6 male yes former
7 female yes current
8 female yes former
9 female no current
> xtabs(~smoker+sex, data=t)
sex
smoker female male
current 2 0
former 1 3
never 2 1
> xtabs(~smoker+pacemaker, data=t)
pacemaker
smoker no yes
current 1 1
former 1 3
never 3 0
How can I combine two 2-dimensional crosstabs in R ?
Desired output:
| sex | pacemaker
smoker | female male | no yes
current | 2 0 | 1 1
former | 1 3 | 1 3
never | 2 1 | 3 0

I have renamed your data.frame to be df. This code should work for you.
cbind(xtabs(~smoker+sex, data=df), xtabs(~smoker+pacemaker, data=df))
female male no yes
current 2 0 1 1
former 1 3 1 3
never 2 1 3 0
You might want to rename the pacemaker column headers.
colnames(XTab)[3:4] = c("Pacemaker_no", "Pacemaker_yes")
XTab
female male Pacemaker_no Pacemaker_yes
current 2 0 1 1
former 1 3 1 3
never 2 1 3 0

Data transformation for machine learning

I have dataset with SKU IDs and their counts, i need to feed this data into a machine learning algorithm, in a way that SKU IDs become columns and COUNTs are at the intersection of transaction id and SKU ID. Can anyone suggest how to achieve this transformation.
CURRENT DATA
TransID SKUID COUNT
1 31 1
1 32 2
1 33 1
2 31 2
2 34 -1
DESIRED DATA
TransID 31 32 33 34
1 1 2 1 0
2 2 0 0 -1

In R, we can use either xtabs
xtabs(COUNT~., df1)
# SKUID
#TransID 31 32 33 34
# 1 1 2 1 0
# 2 2 0 0 -1
Or dcast
library(reshape2)
dcast(df1, TransID~SKUID, value.var="COUNT", fill=0)
# TransID 31 32 33 34
#1 1 1 2 1 0
#2 2 2 0 0 -1
Or spread
library(tidyr)
spread(df1, SKUID, COUNT, fill=0)

In Pandas, you can use pivot:
>>> df.pivot('TransID', 'SKUID').fillna(0)
COUNT
SKUID 31 32 33 34
TransID
1 1 2 1 0
2 2 0 0 -1
To avoid ambiguity, it is best to explicitly label your variables:
df.pivot(index='TransID', columns='SKUID').fillna(0)
You can also perform a groupby and then unstack SKUID:
>>> df.groupby(['TransID', 'SKUID']).COUNT.sum().unstack('SKUID').fillna(0)
SKUID 31 32 33 34
TransID
1 1 2 1 0
2 2 0 0 -1

In GraphLab/SFrame, the relevant commands are unstack and unpack.
import sframe #or import graphlab
sf = sframe.SFrame({'TransID':[1, 1, 1, 2, 2],
'SKUID':[31, 32, 33, 31, 34],
'COUNT': [1, 2, 1, 2, -1]})
sf2 = sf.unstack(['SKUID', 'COUNT'], new_column_name='dict_counts')
out = sf2.unpack('dict_counts', column_name_prefix='')
The missing values can be filled by column:
for c in out.column_names():
out[c] = out[c].fillna(0)
out.print_rows()
+---------+----+----+----+----+
| TransID | 31 | 32 | 33 | 34 |
+---------+----+----+----+----+
| 1 | 1 | 2 | 1 | 0 |
| 2 | 2 | 0 | 0 | -1 |
+---------+----+----+----+----+

How do you plot/analyze variables in R based on their common value in a data frame?

I have a data frame that I'm working with that contains experimental data. For the purposes of this post we can limit the discussion to 3 columns: ExperimentID, ROI, isContrast, isTreated, and, Value. ROI is a text-based factor that indicates where a region-of-interest is drawn, e.g. 'ROI_1', 'ROI_2',...etc. isTreated and isContrast are binary fields indicating whether or not some treatment was applied. I want to make a scatter plot comparing the values of, e.g., 'ROI_1' vs. 'ROI_2 ', which means I need the data paired in such a way that when I plot it the first X value is from Experiment_1 and ROI_1, the first Y value is from Experiment_1 and ROI_2, the next X value is from Experiment_2 and ROI_1, the next Y value is from Experiment_2 and ROI_2, etc. I only want to make this comparison for common values of isContrast and isTreated (i.e. 1 plot for each combination of these variables, so 4 plots altogether.
Subsetting doesn't solve my problem because data from different experiments/ROIs was sometimes entered out of numerical order.
The following code produces a mock data set to demonstrate the problem
expID = c('Bob','Bob','Bob','Bob','Lisa','Lisa','Lisa','Lisa','Alice','Alice','Alice','Alice','Joe','Joe','Joe','Joe','Bob','Bob','Alice','Alice','Lisa','Lisa')
treated = c(0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,0,0,0,0)
contrast = c(0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1)
val = c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4,6,7,8,9,10,11)
roi = c(rep('A',16),'B','B','B','B','B','B')
myFrame = data.frame(ExperimentID=expID,isTreated = treated, isContrast= contrast,Value = val, ROI=roi)
ExperimentID isTreated isContrast Value ROI
1 Bob 0 0 1 A
2 Bob 0 1 2 A
3 Bob 1 0 3 A
4 Bob 1 1 4 A
5 Lisa 0 0 1 A
6 Lisa 0 1 2 A
7 Lisa 1 0 3 A
8 Lisa 1 1 4 A
9 Alice 0 0 1 A
10 Alice 0 1 2 A
11 Alice 1 0 3 A
12 Alice 1 1 4 A
13 Joe 0 0 1 A
14 Joe 0 1 2 A
15 Joe 1 0 3 A
16 Joe 1 1 4 A
17 Bob 0 0 6 B
18 Bob 0 1 7 B
19 Alice 0 0 8 B
20 Alice 0 1 9 B
21 Lisa 0 0 10 B
22 Lisa 0 1 11 B
Now let's say I want to scatter plot values for A vs. B. That is to say, I want to plot x vs. y where {(x,y)} = {(Bob's Value from ROI A, Bob's Value from ROI B), (Alice's Value from ROI A, Alices Value from ROI B)},...} etc. and these all must have the same values for isTreated and isContrast for the comparison to make sense. Now, if I just go an subset I'll get something like:
> x= myFrame$Value[(myFrame$ROI == 'A') & (myFrame$isTreated == 0) & (myFrame$isContrast == 0)]
> x
[1] 1 1 1 1
> y= myFrame$Value[(myFrame$ROI == 'B') & (myFrame$isTreated == 0) & (myFrame$isContrast == 0)]
> y
[1] 6 8 10
Now as you can see the values in y correspond to the first rows of Bob, Lisa, Alice and Joe, respectively but the values of y Bob, Alice and Lisa respectively, and there is no value for Joe.
So say I ignored the value for Joe because that data is missing for B and just decided to plot the first 3 values of x vs. the first 3 values of y. The data are still out of order because x = (Bob, Lisa, Alice) but y = (Bob, Alice, Lisa) in terms of where the values are coming from. So I would like to now how to make vectors such that the order is correct and the plot makes sense.

Similar to #Matthew, with ggplot:
The idea is to reshape your data so the the values from ROI=A and RIO=B are in different columns. This can be done (with your sample data) as follows:
library(reshape2)
zz <- dcast(myFrame,
value.var="Value",
formula=ExperimentID+isTreated+isContrast~ROI)
zz
ExperimentID isTreated isContrast A B
1 Alice 0 0 1 8
2 Alice 0 1 2 9
3 Alice 1 0 3 NA
4 Alice 1 1 4 NA
5 Bob 0 0 1 6
6 Bob 0 1 2 7
7 Bob 1 0 3 NA
8 Bob 1 1 4 NA
9 Joe 0 0 1 NA
10 Joe 0 1 2 NA
11 Joe 1 0 3 NA
12 Joe 1 1 4 NA
13 Lisa 0 0 1 10
14 Lisa 0 1 2 11
15 Lisa 1 0 3 NA
16 Lisa 1 1 4 NA
Notiice that your sample data is rather sparse (lots of NA's).
To plot:
library(ggplot2)
ggplot(zz,aes(x=A,y=B,color=factor(isTreated))) +
geom_point(size=4)+facet_wrap(~isContrast)
Produces this:
The reason there are no blue points is that, in your sample data, there are no occurrences of isTreated=1 and ROI=B.

Something like this, perhaps:
myFrameReshaped <- reshape(myFrame, timevar='ROI', direction='wide', idvar=c('ExperimentID','isTreated','isContrast'))
plot(Value.B ~ Value.A, data=myFrameReshaped)
To condition by the isTreated and isContrast variables, lattice comes in handy:
library(lattice)
xyplot(Value.B~Value.A | isTreated + isContrast, data=myFrameReshaped)
Values that are not present for one of the conditions give NA, and are not plotted.
head(myFrameReshaped)
## ExperimentID isTreated isContrast Value.A Value.B
## 1 Bob 0 0 1 6
## 2 Bob 0 1 2 7
## 3 Bob 1 0 3 NA
## 4 Bob 1 1 4 NA
## 5 Lisa 0 0 1 10
## 6 Lisa 0 1 2 11

R - Change row values based on the contents of neighbouring rows

I have a series of numbers in two columns, with the titles "a" and "b".
I want to get R to change the values in column "b" if the difference between a value in column "a" is greater than 10 from its neighboring cells.
For example:
a | b
-----------
1 | 1
2 | 1
3 | 1
4 | 1
21 | 1
22 | 1
23 | 1
24 | 1
... | ...
Then I would like R to change the values in column "b" to
a | b
-----------
1 | 1
2 | 1
3 | 1
4 | 0
21 | 0
22 | 1
23 | 1
24 | 1
... | ...
Because the values 4 and 21 in the a-column are greater than 10 from each other.
Any help would be greatly appreciated.

df <- data.frame(a = c(1:4, 21:24), b = 1)
# check whether differences are greater than 10
diffs <- diff(df$a) > 10
# create `b`
df$b <- as.integer(!(c(FALSE, diffs) | c(diffs, FALSE)))
The result:
a b
1 1 1
2 2 1
3 3 1
4 4 0
5 21 0
6 22 1
7 23 1
8 24 1

Some alternative.
df <- data.frame(a = c(1:4, 21:24), b = 1L)
local({
w10 <- with(df, which(diff(a) > 10)))
df$b[c(w10, w10+1)] <<- 0L
})

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Formatting long-form survey data in R - r

You can use reshape from base-r: reshape(q, v.names="response", idvar="person", timevar="qstn", direction="wide") person response.A response.B response.C 1 101 0 0 NA 3 102 1 0 NA 5 103 0 1 1

Related

Conditionally remove rows from H2O frame object in R

R: combine two 2-dimensional crosstabs

Data transformation for machine learning

How do you plot/analyze variables in R based on their common value in a data frame?

R - Change row values based on the contents of neighbouring rows

Categories

Resources