I have a data frame with three variables and I want the first variable to be the row names, the second variable to be the column names, and the third variable to be the values associated with those two parameters, with NA or blank where data may be missing. Is this easy/possible to do in R?
example input
structure(list(
Player = c("1","1","2","2","3","3","4","4","5","5","6"),
Type = structure(c(2L, 1L, 2L, 1L, 2L, 1L,2L, 1L, 2L, 1L, 1L),
.Label = c("Long", "Short"), class = "factor"),
Yards = c("23","41","50","29","11","41","48","12","35","27","25")),
.Names = c("Player", "Type", "Yards"),
row.names = c(NA, 11L),
class = "data.frame")
Using the sample data you gave:
df <- structure(list(Player = c("1", "1", "2", "2", "3", "3", "4", "4", "5",
"5", "6"), Type = structure(c(2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 1L),
.Label = c("Long", "Short"), class = "factor"),
Yards = c("23", "41", "50", "29", "11", "41", "48", "12", "35", "27", "25")),
.Names = c("Player", "Type", "Yards"), row.names = c(NA, 11L),
class = "data.frame")
Player Type Yards
1 1 Short 23
2 1 Long 41
3 2 Short 50
4 2 Long 29
5 3 Short 11
6 3 Long 41
7 4 Short 48
8 4 Long 12
9 5 Short 35
10 5 Long 27
11 6 Long 25
dcast will be able to tabulate the two variables.
library(reshape2)
df.cast <- dcast(df, Player~Type, value.var="Yards")
The Player column will be a column, so you need to do a bit extra to make it the row names of the data.frame
rownames(df.cast) <- df.cast$Player
df.cast$Player <- NULL
Long Short
1 41 23
2 29 50
3 41 11
4 12 48
5 27 35
6 25 <NA>
Related
I get an error when I try to run the Dunntest on my data and I can't figure out what's causing it.
I have 4 groups with ordinal discrete data, the Kruskal-Wallis test suggest a significant difference between groups but I can't run the dunntest afterwards.
Any help is appreciated.
> mast_cells
# A tibble: 20 × 2
group score
<ord> <dbl>
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 2 3
7 2 4
8 2 2
9 2 1
10 2 3
11 3 2
12 3 1
13 3 2
14 3 3
15 3 3
16 4 3
17 4 2
18 4 3
19 4 2
20 4 2
> mast_cells$group <- ordered(mast_cells$group ,
+ levels = c("1", "2", "3", "4"))
> kruskal.test( score ~ group, data = mast_cells)
Kruskal-Wallis rank sum test
data: score by group
Kruskal-Wallis chi-squared = 9.1875, df = 3, p-value = 0.0269
> library(FSA)
> dunnTest(score ~ group,
+ data = mast_cells,
+ method="Benjamini-Yekuteili")
Error in if (tmp$Eclass != "factor") { : the condition has length > 1
>
dunTest function does not accept formula as an argument, you need specify your data vector as the first argument, and factor as the second one. Additionally if you choose Benjamini-Yekuteili adjustement method for multiple comparison, option method = "by" should be specified.
See the code below:
library(FSA)
mast_cells <- structure(
list(group = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L),
levels = c("1", "2", "3", "4"), class = c("ordered", "factor")),
score = c(1L, 1L, 1L, 1L, 1L, 3L, 4L, 2L, 1L, 3L,
2L, 1L, 2L, 3L, 3L, 3L, 2L, 3L, 2L, 2L)),
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10",
"11", "12", "13", "14", "15", "16", "17", "18", "19", "20"),
class = "data.frame")
dunnTest(mast_cells$score, mast_cells$group, method = "by")
Output:
Dunn (1964) Kruskal-Wallis multiple comparison
p-values adjusted with the Benjamini-Yekuteili method.
Comparison Z P.unadj P.adj
1 1 - 2 -2.6685305 0.007618387 0.11199029
2 1 - 3 -2.1348244 0.032775359 0.16059926
3 2 - 3 0.5337061 0.593544894 1.00000000
4 1 - 4 -2.4999917 0.012419622 0.09128422
5 2 - 4 0.1685388 0.866159449 1.00000000
6 3 - 4 -0.3651673 0.714986507 1.00000000
I have a list like this, the number above $1081786081 is the user id, I want to plot the day_count according to time.
It's easy to do that if it's a data frame
plot(list4$day_count)
But I don't know how to do it for each list.Should I use lapply?
$`1081786081`
time day_count
1 2016-01-13 2
2 2016-01-20 2
3 2016-02-06 2
4 2016-02-23 2
5 2016-03-14 2
6 2016-03-24 2
7 2016-04-06 2
8 2016-04-11 2
9 2016-05-04 2
10 2016-06-06 2
11 2016-06-26 2
12 2016-07-01 2
$`1087949661`
time day_count
1 2016-01-02 4
2 2016-01-11 2
3 2016-01-20 2
4 2016-01-21 6
5 2016-01-22 2
6 2016-01-27 4
7 2016-01-30 4
8 2016-02-02 2
9 2016-02-05 2
If we need to plot the list of data.frames in a single pdf with separate pages for each plot, after setting the output .pdf, we loop through the 'list4', and plot.
pdf("yourplot.pdf")
invisible(lapply(list4, function(x) with(x, plot(time, day_count))))
dev.off()
We can also create some identifier for each plot by looping through the names of the list elements
pdf("yourplot.pdf")
invisible(lapply(names(list4), function(nm) with(list4[[nm]],
plot(time, day_count, main = paste("plot of", nm)))))
dev.off()
If we need a single plot with lines, we can rbind the list elements and then do the plotting.
library(dplyr)
library(ggplot2)
bind_rows(list4, .id = "grp") %>%
ggplot(., aes(x=time, y = day_count, colour = grp)) +
geom_line() +
geom_point()
data
list4 <- structure(list(`1081786081` = structure(list(time = structure(c(16813,
16820, 16837, 16854, 16874, 16884, 16897, 16902, 16925, 16958,
16978, 16983), class = "Date"), day_count = c(2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L)), .Names = c("time", "day_count"
), row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9",
"10", "11", "12"), class = "data.frame"), `1087949661` = structure(list(
time = structure(c(16802, 16811, 16820, 16821, 16822, 16827,
16830, 16833, 16836), class = "Date"), day_count = c(4L,
2L, 2L, 6L, 2L, 4L, 4L, 2L, 2L)), .Names = c("time", "day_count"
), row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9"),
class = "data.frame")), .Names = c("1081786081",
"1087949661"))
In sparkR I have a DataFrame data.
When I type head(data) we get this output
C0 C1 C2 C3
1 id user_id foreign_model_id machine_id
2 1 3145 4 12
3 2 4079 1 8
4 3 1174 7 1
5 4 2386 9 9
6 5 5524 1 7
I want to remove C0,C1,C2,C3 because they give me problems later one. For example when I use the filter function:
filter(data,data$machine_id==1)
can't run because of this.
I have read the data like this
data <- read.df(sqlContext, "/home/ole/.../data", "com.databricks.spark.csv")
SparkR made the header into the first row and gave the DataFrame a new header because the default for the header option is "false". Set the header option to header="true" and then you won't have to handle with this problem.
data <- read.df(sqlContext, "/home/ole/.../data", "com.databricks.spark.csv", header="true")
Try
colnames(data) <- unlist(data[1,])
data <- data[-1,]
> data
# id user_id foreign_model_id machine_id
#2 1 3145 4 12
#3 2 4079 1 8
#4 3 1174 7 1
#5 4 2386 9 9
#6 5 5524 1 7
If you wish, you can add rownames(data) <- NULL to correct for the row numbers after the deletion of the first row.
After this manipulation, you can select rows that correspond to certain criteria, like
subset(data, data$machine_id==1)
# id user_id foreign_model_id machine_id
#4 3 1174 7 1
In base R, the function filter() suggested in the OP is part of the stats namespace and is usually reserved for the analysis of time series.
data
data <- structure(list(C0 = structure(c(6L, 1L, 2L, 3L, 4L, 5L),
.Label = c("1", "2", "3", "4", "5", "id"), class = "factor"),
C1 = structure(c(6L, 3L, 4L, 1L, 2L, 5L), .Label = c("1174", "2386",
"3145", "4079", "5524", "user_id"), class = "factor"),
C2 = structure(c(5L, 2L, 1L, 3L, 4L, 1L),
.Label = c("1", "4", "7", "9", "foreign_model_id"), class = "factor"),
C3 = structure(c(6L, 2L, 4L, 1L, 5L, 3L),
.Label = c("1", "12", "7", "8", "9", "machine_id"), class = "factor")),
.Names = c("C0", "C1", "C2", "C3"), class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6"))
try this
names <- c()
for (i in seq(along = names(data))) {
names <- c(names, toString(data[1,i]))
}
names(data) <- names
data <- data[-1,]
I simply can't use the answers because in sparkR it can't run: object of type 'S4' is not subsettable. I solved the problem this way, however, I think there is a better way to solve it.
data <- withColumnRenamed(data, "C0","id")
data <- withColumnRenamed(data, "C1","user_id")
data <- withColumnRenamed(data, "C2","foreign_model_id")
data <- withColumnRenamed(data, "C3","machine_id")
And now I can successfully use the filter function as I want to.
I have 2 files with say 3 columns and a few rows.
1 2 10
2 3 20
3 4 30
4 5 40
5 1 50
6 1 60
and
1 8 10
2 3 100
3 4 45
4 5 78
5 2 99
6 80 60
Now i want to create a third file having all the values of first two files and also if first and second column of both the files are same then in third file the values corresponding to them should like say,value in third column of first file must be in third column of newly created file and value in third column of second file must be in fourth column of newly created file.
According to above example answer should be
1 2 10 0
2 3 20 100
3 4 30 45
4 5 40 78
1 8 10 0
5 1 50 0
6 1 60 0
5 2 99 0
6 80 60 0
res <- merge(dat1,dat2, by=c("V1", "V2"),all=TRUE)
indx <- is.na(res[,3])
res[indx,3] <- res[indx,4]
res[indx,4] <- NA
res[is.na(res)] <- 0
# V1 V2 V3.x V3.y
#1 1 2 10 0
#2 1 8 10 0
#3 2 3 20 100
#4 3 4 30 45
#5 4 5 40 78
#6 5 1 50 0
#7 5 2 99 0
#8 6 1 60 0
#9 6 80 60 0
data
dat1 <- structure(list(V1 = structure(1:6, .Label = c("1", "2", "3",
"4", "5", "6"), class = "factor"), V2 = structure(c(2L, 3L, 4L,
5L, 1L, 1L), .Label = c("1", "2", "3", "4", "5"), class = "factor"),
V3 = structure(1:6, .Label = c("10", "20", "30", "40", "50",
"60"), class = "factor")), .Names = c("V1", "V2", "V3"), class = "data.frame", row.names = c(NA,
-6L))
dat2 <- structure(list(V1 = structure(1:6, .Label = c("1", "2", "3",
"4", "5", "6"), class = "factor"), V2 = structure(c(5L, 2L, 3L,
4L, 1L, 6L), .Label = c("2", "3", "4", "5", "8", "80"), class = "factor"),
V3 = structure(c(1L, 2L, 3L, 5L, 6L, 4L), .Label = c("10",
"100", "45", "60", "78", "99"), class = "factor")), .Names = c("V1",
"V2", "V3"), class = "data.frame", row.names = c(NA, -6L))
Convert the data columns to numeric class before you try the above code
dat1[] <- lapply(dat1, function(x) as.numeric(as.character(x)))
dat2[] <- lapply(dat2, function(x) as.numeric(as.character(x)))
It would be easier if you post an example with dput(). I would check if ?merge helps or rbind.fill (package plyr).
Hope this helps
Hermann
DATA AND REQUIREMENTS
The first table (myMatrix1) is from an old geological survey that used different region boundaries (begin and finish) columns to the newer survey.
What I wish to do is to match the begin and finish boundaries and then create two tables one for the new data on sedimentation and one for the new data on bore width characterised as a boolean.
myMatrix1 <- read.table("/path/to/file")
myMatrix2 <- read.table("/path/to/file")
> head(myMatrix1) # this is the old data
sampleIDs begin finish
1 19990224 4 5
2 20000224 5 6
3 20010203 6 8
4 20019024 29 30
5 20020201 51 52
> head(myMatrix2) # this is the new data
begin finish sedimentation boreWidth
1 0 10 1.002455 0.014354
2 11 367 2.094351 0.056431
3 368 920 0.450275 0.154105
4 921 1414 2.250820 1.004353
5 1415 5278 0.114109 NA`
Desired output:
> head(myMatrix6)
sampleIDs begin finish sedimentation #myMatrix4
1 19990224 4 5 1.002455
2 20000224 5 6 1.002455
3 20010203 6 8 2.094351
4 20019024 29 30 2.094351
5 20020201 51 52 2.094351
> head(myMatrix7)
sampleIDs begin finish boreWidthThresh #myMatrix5
1 19990224 4 5 FALSE
2 20000224 5 6 FALSE
3 20010203 6 8 FALSE
4 20019024 29 30 FALSE
5 20020201 51 52 FALSE`
CODE
The following code has taken me several hours to run on my dataset (about 5 million data points). Is there any way to change the code to make it run any faster?
# create empty matrix for sedimentation
myMatrix6 <- data.frame(NA,NA,NA,NA)[0,]
names(myMatrix6) <- letters[1:4]
# create empty matrix for bore
myMatrix7 <- data.frame(NA,NA,NA,NA)[0,]
names(myMatrix7) <- letters[1:4]
for (i in 1:nrow(myMatrix2))
{
# create matrix that has the value of myMatrix1$begin being
# situated between the values of myMatrix2begin[i] and myMatrix2finish[i]
myMatrix3 <- myMatrix1[which((myMatrix1$begin > myMatrix2$begin[i]) & (myMatrix1$begin < myMatrix2$finish[i])),]
myMatrix4 <- rep(myMatrix2$sedimentation, nrow(myMatrix3))
if (is.na(myMatrix2$boreWidth[i])) {
myMatrix5 <- rep(NA, nrow(myMatrix3))
}
else if (myMatrix2$boreWidth[i] == 0) {
myMatrix5 <- rep(TRUE, nrow(myMatrix3))
}
else if (myMatrix2$boreWidth[i] > 0) {
myMatrix5 <- rep(FALSE, nrow(myMatrix3))
}
myMatrix6 <- rbind(myMatrix6, cbind(myMatrix3, myMatrix4))
myMatrix7 <- rbind(myMatrix7, cbind(myMatrix3, myMatrix5))
}
EDIT:
> dput(head(myMatrix2)
structure(list(V1 = structure(c(6L, 1L, 2L, 4L, 5L, 3L), .Label = c("0",
"11", "1415", "368", "921", "begin"), class = "factor"), V2 = structure(c(6L,
1L, 3L, 5L, 2L, 4L), .Label = c("10", "1414", "367", "5278",
"920", "finish"), class = "factor"), V3 = structure(c(6L, 3L,
4L, 2L, 5L, 1L), .Label = c("0.114109", "0.450275", "1.002455",
"2.094351", "2.250820", "sedimentation"), class = "factor"),
V4 = structure(c(5L, 1L, 2L, 3L, 4L, 6L), .Label = c("0.014354",
"0.056431", "0.154105", "1.004353", "boreWidth", "NA"), class = "factor")), .Names = c("V1",
"V2", "V3", "V4"), row.names = c(NA, 6L), class = "data.frame")
> dput(head(myMatrix1)
structure(list(V1 = structure(c(6L, 1L, 2L, 3L, 4L, 5L), .Label = c("19990224",
"20000224", "20010203", "20019024", "20020201", "sampleIDs"), class = "factor"),
V2 = structure(c(6L, 2L, 3L, 5L, 1L, 4L), .Label = c("29",
"4", "5", "51", "6", "begin"), class = "factor"), V3 = structure(c(6L,
2L, 4L, 5L, 1L, 3L), .Label = c("30", "5", "52", "6", "8",
"finish"), class = "factor")), .Names = c("V1", "V2", "V3"
), row.names = c(NA, 6L), class = "data.frame")
First look at these general suggestions on speeding up code: https://stackoverflow.com/a/8474941/636656
The first thing that jumps out at me is that I'd create only one results matrix. That way you're not duplicating the sampleIDs begin finish columns, and you can avoid any overhead that comes with running the matching algorithm twice.
Doing that, you can avoid selecting more than once (although it's trivial in terms of speed as long as you store your selection vector rather than re-calculate).
Here's a solution using apply:
myMatrix1 <- data.frame(sampleIDs=c(19990224,20000224),begin=c(4,5),finish=c(5,6))
myMatrix2 <- data.frame(begin=c(0,11),finish=c(10,367),sed=c(1.002,2.01),boreWidth=c(.014,.056))
glommer <- function(x,myMatrix2) {
x[4:5] <- as.numeric(myMatrix2[ myMatrix2$begin <= x["begin"] & myMatrix2$finish >= x["finish"], c("sed","boreWidth") ])
names(x)[4:5] <- c("sed","boreWidth")
return( x )
}
> t(apply( myMatrix1, 1, glommer, myMatrix2=myMatrix2))
sampleIDs begin finish sed boreWidth
[1,] 19990224 4 5 1.002 0.014
[2,] 20000224 5 6 1.002 0.014
I used apply and stored everything as numeric. Other approaches would be to return a data.frame and have the sampleIDs and begin, finish be ints. That might avoid some problems with floating point error.
This solution assumes there are no boundary cases (e.g. the begin, finish times of myMatrix1 are entirely contained within the begin, finish times of the other). If your data is more complicated, just change the glommer() function. How you want to handle that is a substantive question.