Remove column names in a DataFrame - r

In sparkR I have a DataFrame data.
When I type head(data) we get this output
C0 C1 C2 C3
1 id user_id foreign_model_id machine_id
2 1 3145 4 12
3 2 4079 1 8
4 3 1174 7 1
5 4 2386 9 9
6 5 5524 1 7
I want to remove C0,C1,C2,C3 because they give me problems later one. For example when I use the filter function:
filter(data,data$machine_id==1)
can't run because of this.
I have read the data like this
data <- read.df(sqlContext, "/home/ole/.../data", "com.databricks.spark.csv")

SparkR made the header into the first row and gave the DataFrame a new header because the default for the header option is "false". Set the header option to header="true" and then you won't have to handle with this problem.
data <- read.df(sqlContext, "/home/ole/.../data", "com.databricks.spark.csv", header="true")

Try
colnames(data) <- unlist(data[1,])
data <- data[-1,]
> data
# id user_id foreign_model_id machine_id
#2 1 3145 4 12
#3 2 4079 1 8
#4 3 1174 7 1
#5 4 2386 9 9
#6 5 5524 1 7
If you wish, you can add rownames(data) <- NULL to correct for the row numbers after the deletion of the first row.
After this manipulation, you can select rows that correspond to certain criteria, like
subset(data, data$machine_id==1)
# id user_id foreign_model_id machine_id
#4 3 1174 7 1
In base R, the function filter() suggested in the OP is part of the stats namespace and is usually reserved for the analysis of time series.
data
data <- structure(list(C0 = structure(c(6L, 1L, 2L, 3L, 4L, 5L),
.Label = c("1", "2", "3", "4", "5", "id"), class = "factor"),
C1 = structure(c(6L, 3L, 4L, 1L, 2L, 5L), .Label = c("1174", "2386",
"3145", "4079", "5524", "user_id"), class = "factor"),
C2 = structure(c(5L, 2L, 1L, 3L, 4L, 1L),
.Label = c("1", "4", "7", "9", "foreign_model_id"), class = "factor"),
C3 = structure(c(6L, 2L, 4L, 1L, 5L, 3L),
.Label = c("1", "12", "7", "8", "9", "machine_id"), class = "factor")),
.Names = c("C0", "C1", "C2", "C3"), class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6"))

try this
names <- c()
for (i in seq(along = names(data))) {
names <- c(names, toString(data[1,i]))
}
names(data) <- names
data <- data[-1,]

I simply can't use the answers because in sparkR it can't run: object of type 'S4' is not subsettable. I solved the problem this way, however, I think there is a better way to solve it.
data <- withColumnRenamed(data, "C0","id")
data <- withColumnRenamed(data, "C1","user_id")
data <- withColumnRenamed(data, "C2","foreign_model_id")
data <- withColumnRenamed(data, "C3","machine_id")
And now I can successfully use the filter function as I want to.

Related

Using lapply to group list of data frames by column

I have a list that contains multiple data frames. I would like to sort the data by Category (A) and sum the Frequencies (B) using the lapply-command.
The data is df_list
df_list
$`df.1`
A B
1 Apples 2
2 Pears 5
3 Apples 6
4 Pears 1
5 Apples 3
$`df.2`
A B
1 Oranges 2
2 Pineapples 5
3 Oranges 6
4 Pineapples 1
5 Oranges 3
The desired outcome df_list_2 looks like this:
df_list_2
$`df.1`
A B
1 Apples 11
2 Pears 6
$`df.2`
A B
1 Oranges 11
2 Pineapples 6
I have tried the following code based on lapply:
df_list_2<-df_list[, lapply(B, sum), by = A]
However, I get an error code, saying that A was not found.
Either I mistake how the lapply command works in this case or my understating of how it should work is flawed.
Any help much appreciated.
You need to aggregate in lapply
lapply(df_list, function(x) aggregate(B~A, x, sum))
#[[1]]
# A B
#1 Apples 11
#2 Pears 6
#[[2]]
# A B
#1 Oranges 11
#2 Pineapples 6
Using map from purrr and dplyr it would be
library(dplyr)
purrr::map(df_list, ~.x %>% group_by(A) %>% summarise(sum = sum(B)))
data
df_list <- list(structure(list(A = structure(c(1L, 2L, 1L, 2L, 1L),
.Label = c("Apples", "Pears"), class = "factor"), B = c(2L, 5L, 6L, 1L, 3L)),
class = "data.frame", row.names = c("1", "2", "3", "4", "5")),
structure(list(A = structure(c(1L, 2L, 1L, 2L, 1L), .Label = c("Oranges",
"Pineapples"), class = "factor"), B = c(2L, 5L, 6L, 1L, 3L)), class = "data.frame",
row.names = c("1", "2", "3", "4", "5")))
I fear you might not have a clear idea of lapply nor the extract operator ([). Remember lapply(list, function) applies the specified function you give it to each element of the list you give it. Extract gives you the element you specify:
x <- c('a', 'b', 'c')
x[2]
## "b"
I would imagine that somewhere in your R workspace you have an object names B which is why you didn't get an error along the lines of
## Error in lapply(B, sum) : object 'B' not found
Conversely if you had (accidentally or intentionally) defined both A and B you would see the error
## Error in df_list[, lapply(B, sum), by = A] : incorrect number of dimensions
because that's not at all how to use [; remember, you just pass indexes or booleans to [ along with the occasional optional argument, but by is not one of those.
So without further adieu, here's how I would do this (in base R):
# make some data
a <- c(1, 2, 1, 2, 1)
b <- c(2, 5, 6, 1, 3)
df_list <- list(df.1 = data.frame(A = c('Apples', 'Pears')[a], B = b),
df.2 = data.frame(A = c('Oranges', 'Pineapples')[a], B = b))
# simplify it
df_list_2 <- lapply(df_list, function(x) {
aggregate(list(B = x$B), list(A = x$A), sum)
})
# the desired result
df_list_2
## $df.1
## A B
## 1 Apples 11
## 2 Pears 6
##
## $df.2
## A B
## 1 Oranges 11
## 2 Pineapples 6
You can take advantage of the fact that a data.frame is just a list and shorten up your code like this:
df_list_2 <- lapply(df_list, function(x) {
aggregate(x['B'], x['A'], sum)
})
but the first way of writing it should help make more clear what we're doing
The data.table syntax in OP's post can changed to
library(data.table)
lapply(df_list, function(x) as.data.table(x)[, .(B = sum(B)), by = A])
#$df.1
# A B
#1: Apples 11
#2: Pears 6
#$df.2
# A B
#1: Oranges 11
#2: Pineapples 6
data
df_list <- list(df.1 = structure(list(A = structure(c(1L, 2L, 1L, 2L, 1L
), .Label = c("Apples", "Pears"), class = "factor"), B = c(2L,
5L, 6L, 1L, 3L)), class = "data.frame", row.names = c("1", "2",
"3", "4", "5")), df.2 = structure(list(A = structure(c(1L, 2L,
1L, 2L, 1L), .Label = c("Oranges", "Pineapples"), class = "factor"),
B = c(2L, 5L, 6L, 1L, 3L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5")))

replace column values in data.frame with character string depending on column value, in R

I have a data.frame 'data', where one column contains integer values between 1:100, which are coded values for the Isolate they represent.
Here's my example data, 'data':
Size Isolate spin
1 primary 3 up
2 primary 4 down
3 sec 6 strange
4 ter 1 charm
5 sec 3 bottom
6 quart 2 top
I have another data.frame that contains the key between the integers and the name of the Isolate
1 alpha
2 bravo
3 charlie
4 delta
5 echo
6 foxtrot
7 golf
This list is 100 Isolates in length, too much to type in by hand with if/else.
I'd like to know an easy solution to replacing the integers in my first data.frame, whic aren't in ascending order as you can see, with the corresponding Isolate names in the second data.frame.
I tried, after researching:
data$Isolate <- as.numeric(factor(data$Isolate,
levels =c("alpha","bravo","charlie","delta","echo","foxtrot","golf")
)
)
but this just replaced the Isolate column with N/A.
As Hubert said in the comments, this is a simple use-case for merge.
Let's say the column names of your second "key" data frame are "Isolate" and "Isolate_Name", then it's as easy as
merge(data, key_data, by = "Isolate")
The default is for an "inner join" which will only keep records that have matches. If you're worried about losing records that don't have matches you can add the argument all.x = TRUE.
If you prefer non-base packages, this is easy in data.table or dplyr as well.
Using factor, you could try:
data$Isolate <- factor(data$Isolate,
levels=1:7,
labels =c("alpha","bravo","charlie","delta","echo","foxtrot","golf"))
If you have many levels that are already in their own data.frame, you could automate this.
data$Isolate <- factor(data$Isolate,levels=code$No,labels=code$Value)
With your second data.frame, code:
code <- read.table(text="1 alpha
2 bravo
3 charlie
4 delta
5 echo
6 foxtrot
7 golf",stringsAsFactor=FALSE)
names(code) <- c("No","Value")
df$Isolate <- df2[,1][df$Isolate]
# Size Isolate spin
# 1 primary charlie up
# 2 primary delta down
# 3 sec foxtrot strange
# 4 ter alpha charm
# 5 sec charlie bottom
# 6 quart bravo top
You can subset the lookup data frame by the target data frame.
Data
df <- structure(list(Size = structure(c(1L, 1L, 3L, 4L, 3L, 2L), .Label = c("primary",
"quart", "sec", "ter"), class = "factor"), Isolate = c(3L, 4L,
6L, 1L, 3L, 2L), spin = structure(c(6L, 3L, 4L, 2L, 1L, 5L), .Label = c("bottom",
"charm", "down", "strange", "top", "up"), class = "factor")), .Names = c("Size",
"Isolate", "spin"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
df2 <- structure(list(V2 = structure(1:7, .Label = c("alpha", "bravo",
"charlie", "delta", "echo", "foxtrot", "golf"), class = "factor")), .Names = "V2", class = "data.frame", row.names = c(NA,
-7L))

Create crosstab from three values

I have a data frame with three variables and I want the first variable to be the row names, the second variable to be the column names, and the third variable to be the values associated with those two parameters, with NA or blank where data may be missing. Is this easy/possible to do in R?
example input
structure(list(
Player = c("1","1","2","2","3","3","4","4","5","5","6"),
Type = structure(c(2L, 1L, 2L, 1L, 2L, 1L,2L, 1L, 2L, 1L, 1L),
.Label = c("Long", "Short"), class = "factor"),
Yards = c("23","41","50","29","11","41","48","12","35","27","25")),
.Names = c("Player", "Type", "Yards"),
row.names = c(NA, 11L),
class = "data.frame")
Using the sample data you gave:
df <- structure(list(Player = c("1", "1", "2", "2", "3", "3", "4", "4", "5",
"5", "6"), Type = structure(c(2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 1L),
.Label = c("Long", "Short"), class = "factor"),
Yards = c("23", "41", "50", "29", "11", "41", "48", "12", "35", "27", "25")),
.Names = c("Player", "Type", "Yards"), row.names = c(NA, 11L),
class = "data.frame")
Player Type Yards
1 1 Short 23
2 1 Long 41
3 2 Short 50
4 2 Long 29
5 3 Short 11
6 3 Long 41
7 4 Short 48
8 4 Long 12
9 5 Short 35
10 5 Long 27
11 6 Long 25
dcast will be able to tabulate the two variables.
library(reshape2)
df.cast <- dcast(df, Player~Type, value.var="Yards")
The Player column will be a column, so you need to do a bit extra to make it the row names of the data.frame
rownames(df.cast) <- df.cast$Player
df.cast$Player <- NULL
Long Short
1 41 23
2 29 50
3 41 11
4 12 48
5 27 35
6 25 <NA>

Splitting data based on row values based on conditions

this is how my data looks like:
df <- structure(list(`1` = c(1 , 2 , 3 , 4 , 5 ,6 , 7 , 8 , 9, 10 ,11 ,12, 13, 14 ,15 ,16, 17, 18), `2` = structure(c(4L,5L, 2L, 5L, 2L, 3L, 1L, 6L,4L,5L, 2L, 5L, 2L, 3L, 1L, 6L,4L,5L), .Label=c("a","a","b","b","b","c","c","b","b","b","e","e","f","g","g","g","f","f"),
class="factor"),`3`=c(1,0,1,1,0,1,0,1,1,0,1,0,1,1,0,1,0,0),`4`=c(0,1,0,0,1,0,0,0,0,1,0,1,0,0,0,0,1,1), `5` =c(10,5,20,20,5,0,0,10,10,5,10,5,15,5,5,5,2,2)),
.Names = c("N", "Condition", "AOI_hit_b", "AOI_hit_f", "Time"), class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9","10","11","12","13","14","15","16","17","18"))
I now want to make calculations on Time, depending on whether Condition b is preceded by a or c - depending on wheter AOI_hit_b is a 0 or 1. To that comes that only the first hit of 1 in a Condition is relevant for time, as it always writes the same time for multiple 1's in a condition.
I tried it with the ddply package, but didn't get the output I wanted.
My output should look something like this:
sum of
Condition Time
b(a) 25
b(c) 10
This is the code I have so far:
hit_cb = list()
for (i in 1:nrow(df)){
if (df[i,2] == "b") & (df[i-1,2] == "c") {
hit_cb[i] = ddply(df,.(AOI_hit_b), summarize, mysum=sum(unique(Time)))
}
}

Make Dataframe loop code run faster

DATA AND REQUIREMENTS
The first table (myMatrix1) is from an old geological survey that used different region boundaries (begin and finish) columns to the newer survey.
What I wish to do is to match the begin and finish boundaries and then create two tables one for the new data on sedimentation and one for the new data on bore width characterised as a boolean.
myMatrix1 <- read.table("/path/to/file")
myMatrix2 <- read.table("/path/to/file")
> head(myMatrix1) # this is the old data
sampleIDs begin finish
1 19990224 4 5
2 20000224 5 6
3 20010203 6 8
4 20019024 29 30
5 20020201 51 52
> head(myMatrix2) # this is the new data
begin finish sedimentation boreWidth
1 0 10 1.002455 0.014354
2 11 367 2.094351 0.056431
3 368 920 0.450275 0.154105
4 921 1414 2.250820 1.004353
5 1415 5278 0.114109 NA`
Desired output:
> head(myMatrix6)
sampleIDs begin finish sedimentation #myMatrix4
1 19990224 4 5 1.002455
2 20000224 5 6 1.002455
3 20010203 6 8 2.094351
4 20019024 29 30 2.094351
5 20020201 51 52 2.094351
> head(myMatrix7)
sampleIDs begin finish boreWidthThresh #myMatrix5
1 19990224 4 5 FALSE
2 20000224 5 6 FALSE
3 20010203 6 8 FALSE
4 20019024 29 30 FALSE
5 20020201 51 52 FALSE`
CODE
The following code has taken me several hours to run on my dataset (about 5 million data points). Is there any way to change the code to make it run any faster?
# create empty matrix for sedimentation
myMatrix6 <- data.frame(NA,NA,NA,NA)[0,]
names(myMatrix6) <- letters[1:4]
# create empty matrix for bore
myMatrix7 <- data.frame(NA,NA,NA,NA)[0,]
names(myMatrix7) <- letters[1:4]
for (i in 1:nrow(myMatrix2))
{
# create matrix that has the value of myMatrix1$begin being
# situated between the values of myMatrix2begin[i] and myMatrix2finish[i]
myMatrix3 <- myMatrix1[which((myMatrix1$begin > myMatrix2$begin[i]) & (myMatrix1$begin < myMatrix2$finish[i])),]
myMatrix4 <- rep(myMatrix2$sedimentation, nrow(myMatrix3))
if (is.na(myMatrix2$boreWidth[i])) {
myMatrix5 <- rep(NA, nrow(myMatrix3))
}
else if (myMatrix2$boreWidth[i] == 0) {
myMatrix5 <- rep(TRUE, nrow(myMatrix3))
}
else if (myMatrix2$boreWidth[i] > 0) {
myMatrix5 <- rep(FALSE, nrow(myMatrix3))
}
myMatrix6 <- rbind(myMatrix6, cbind(myMatrix3, myMatrix4))
myMatrix7 <- rbind(myMatrix7, cbind(myMatrix3, myMatrix5))
}
EDIT:
> dput(head(myMatrix2)
structure(list(V1 = structure(c(6L, 1L, 2L, 4L, 5L, 3L), .Label = c("0",
"11", "1415", "368", "921", "begin"), class = "factor"), V2 = structure(c(6L,
1L, 3L, 5L, 2L, 4L), .Label = c("10", "1414", "367", "5278",
"920", "finish"), class = "factor"), V3 = structure(c(6L, 3L,
4L, 2L, 5L, 1L), .Label = c("0.114109", "0.450275", "1.002455",
"2.094351", "2.250820", "sedimentation"), class = "factor"),
V4 = structure(c(5L, 1L, 2L, 3L, 4L, 6L), .Label = c("0.014354",
"0.056431", "0.154105", "1.004353", "boreWidth", "NA"), class = "factor")), .Names = c("V1",
"V2", "V3", "V4"), row.names = c(NA, 6L), class = "data.frame")
> dput(head(myMatrix1)
structure(list(V1 = structure(c(6L, 1L, 2L, 3L, 4L, 5L), .Label = c("19990224",
"20000224", "20010203", "20019024", "20020201", "sampleIDs"), class = "factor"),
V2 = structure(c(6L, 2L, 3L, 5L, 1L, 4L), .Label = c("29",
"4", "5", "51", "6", "begin"), class = "factor"), V3 = structure(c(6L,
2L, 4L, 5L, 1L, 3L), .Label = c("30", "5", "52", "6", "8",
"finish"), class = "factor")), .Names = c("V1", "V2", "V3"
), row.names = c(NA, 6L), class = "data.frame")
First look at these general suggestions on speeding up code: https://stackoverflow.com/a/8474941/636656
The first thing that jumps out at me is that I'd create only one results matrix. That way you're not duplicating the sampleIDs begin finish columns, and you can avoid any overhead that comes with running the matching algorithm twice.
Doing that, you can avoid selecting more than once (although it's trivial in terms of speed as long as you store your selection vector rather than re-calculate).
Here's a solution using apply:
myMatrix1 <- data.frame(sampleIDs=c(19990224,20000224),begin=c(4,5),finish=c(5,6))
myMatrix2 <- data.frame(begin=c(0,11),finish=c(10,367),sed=c(1.002,2.01),boreWidth=c(.014,.056))
glommer <- function(x,myMatrix2) {
x[4:5] <- as.numeric(myMatrix2[ myMatrix2$begin <= x["begin"] & myMatrix2$finish >= x["finish"], c("sed","boreWidth") ])
names(x)[4:5] <- c("sed","boreWidth")
return( x )
}
> t(apply( myMatrix1, 1, glommer, myMatrix2=myMatrix2))
sampleIDs begin finish sed boreWidth
[1,] 19990224 4 5 1.002 0.014
[2,] 20000224 5 6 1.002 0.014
I used apply and stored everything as numeric. Other approaches would be to return a data.frame and have the sampleIDs and begin, finish be ints. That might avoid some problems with floating point error.
This solution assumes there are no boundary cases (e.g. the begin, finish times of myMatrix1 are entirely contained within the begin, finish times of the other). If your data is more complicated, just change the glommer() function. How you want to handle that is a substantive question.

Resources