deleting first row based on column variable - r

How do I delete the first row of each new variable? For example, here is some data:
m <- c("a","a","a","a","a","b","b","b","b","b")
n <- c('x','y','x','y','x','y',"x","y",'x',"y")
o <- c(1:10)
z <- data.frame(m,n,o)
I want to delete the first entry for a and b in column m. I have a very large data frame so I want to do this based on the change from a to b and so on.
Here is what I want the data frame to look like.
m n o
1 a y 2
2 a x 3
3 a y 4
4 a x 5
5 b x 7
6 b y 8
7 b x 9
8 b y 10
Thanks.

Just use duplicated:
z[duplicated(z$m),]
# m n o
#2 a y 2
#3 a x 3
#4 a y 4
#5 a x 5
#7 b x 7
#8 b y 8
#9 b x 9
#10 b y 10
Why this works? Consider:
duplicated("a")
#[1] FALSE
duplicated(c("a","a"))
#[1] FALSE TRUE

data.table is preferred for large datasets in R. setDT converts z data frame to data table by reference. Group by m and remove the first row.
library('data.table')
setDT(z)[, .SD[-1], by = "m"]

Using group_by and row_numberfrom package dplyr:
z %>%
group_by(m) %>%
filter(row_number(o)!=1)

Related

Selecting rows from a data frame from combinations of lists [duplicate]

This question already has answers here:
Removing one table from another in R [closed]
(3 answers)
Closed 5 years ago.
I have a dataframe, dat:
dat<-data.frame(col1=rep(1:4,3),
col2=rep(letters[24:26],4),
col3=letters[1:12])
I want to filter dat on two different columns using ONLY the combinations given by the rows in the data frame filter:
filter<-data.frame(col1=1:3,col2=NA)
lists<-list(list("x","y"),list("y","z"),list("x","z"))
filter$col2<-lists
So for example, rows containing (1,x) and (1,y), would be selected, but not (1,z),(2,x), or (3,y).
I know how I would do it using a for loop:
#create a frame to drop results in
results<-dat[1,]
for(f in 1:nrow(filter)){
temp_filter<-filter[f,]
temp_dat<-dat[dat$col1==temp_filter[1,1] &
dat$col2%in%unlist(temp_filter[1,2]),]
results<-rbind(results,temp_dat)
}
Or if you prefer dplyr style:
require(dplyr)
results<-dat[0,]
for(f in 1:nrow(filter)){
temp_filter<-filter[f,]
temp_dat<-filter(dat,col1==temp_filter[1,1] &
col2%in%unlist(temp_filter[1,2])
results<-rbind(results,temp_dat)
}
results should return
col1 col2 col3
1 1 x a
5 1 y e
2 2 y b
6 2 z f
3 3 z c
7 3 x g
I would normally do the filtering using a merge, but I can't now since I have to check col2 against a list rather than a single value. The for loop works but I figured there would be a more efficient way to do this, probably using some variation of apply or do.call.
We could use dplyr::anti_join() to do the row exclusion filtering for us, if we had two dataframes:
index <- data.frame(col1 = as.character(filter[,1]),
col2 = filter[,2])
anti_join(dat, index)
Joining, by = c("col1", "col2")
col1 col2 col3
1 4 x d
2 1 y e
3 2 z f
4 3 x g
5 4 y h
6 1 z i
7 2 x j
8 3 y k
9 4 z l
mostly base with a little help from dplyr:
dplyr::setdiff(dat,merge(dat,setNames(as.data.frame(filter),names(dat)[1:2])))
col1 col2 col3
1 4 x d
2 1 y e
3 2 z f
4 3 x g
5 4 y h
6 1 z i
7 2 x j
8 3 y k
9 4 z l
A real base R solution though not so pretty and you lose the row order:
subset(merge(dat,`[[<-`(setNames(as.data.frame(filter),names(dat)[1:2]),"x",value=1),all.x=T),is.na(x),-4)
col1 col2 col3
2 1 y e
3 1 z i
4 2 x j
6 2 z f
7 3 x g
8 3 y k
10 4 x d
11 4 y h
12 4 z l

Generating unique ids and group ids using dplyr and concatenation

I have a problem that I suspect has arisen from a dplyr update combined with my hacky code. Given a data frame in which every row is duplicated, I want to assign each row a unique id by combining the entries of two columns with either "_" or "a_" in the middle. I also want to assign a group id by combining the entries of one column with either "" or "a". Because these formats are important for lining up with another data frame, I can't use solutions based on interact and factor that I've seen in other posts.
So I want to go from this:
Generation Identity
1 1 X
2 1 Y
3 1 Z
4 2 X
5 2 Y
6 2 Z
7 3 X
8 3 Y
9 3 Z
10 1 X
11 1 Y
12 1 Z
13 2 X
14 2 Y
15 2 Z
16 3 X
17 3 Y
18 3 Z
to this:
Generation Identity Unique_id Group_id
1 1 X 1_X X
2 1 Y 1_Y Y
3 1 Z 1_Z Z
4 2 X 2_X X
5 2 Y 2_Y Y
6 2 Z 2_Z Z
7 3 X 3_X X
8 3 Y 3_Y Y
9 3 Z 3_Z Z
10 1 X 1a_X Xa
11 1 Y 1a_Y Ya
12 1 Z 1a_Z Za
13 2 X 2a_X Xa
14 2 Y 2a_Y Ya
15 2 Z 2a_Z Za
16 3 X 3a_X Xa
17 3 Y 3a_Y Ya
18 3 Z 3a_Z Za
The minimal example below is based on code that previously worked for me and others in setting the unique id but that now causes RStudio to crash with a seg fault (Exception Type: EXC_BAD_ACCESS (SIGSEGV)). When I call a function containing this code it generates the message
Error in match(vector, df$Unique_id) : 'translateCharUTF8' must be
called on a CHARSXP
which I've read can be symptomatic of memory issues.
library(dplyr)
dff <- data.frame(Generation = rep(1:3, each = 3),
Identity = rep(LETTERS[24:26], times = 3))
dff <- rbind(dff, dff) # duplicate rows
dff <- group_by_(dff, ~Generation, ~Identity) %>%
mutate(Unique_id = c(paste0(Identity[1], "_", Generation[1]), paste0(Identity[1], "a", "_", Generation[1]))) %>%
ungroup
I think the problem is related to an update of dplyr (I'm using the latest release versions of RStudio and all packages, on OSX Sierra). In any case, my solution above is something of a hack. I'd very much appreciate suggestions for improved code, preferably using either base R or dplyr (since the code is part of a package that currently depends on dplyr).
Here is how you can approach the problem:
First find the duplicates of your data. I called my data A
dup=duplicated(A)
Then add a counter row:
A$count=1:nrow(A)
n=ncol(A)#THE COLUMN ADDED
now obtain the two columns needed and cbind it with the original dataframe:
B=data.frame(t(apply(A,1,function(x)
if(dup[as.numeric(x[n])]) c(paste0(x["Identity"],"a"),paste(x[-n],collapse="a_"))
else c(x["Identity"],paste(x[-n],collapse="_")))))
`names<-`(cbind(A[-n],B),c(names(A[-1]),"Group_ID","Unique_ID"))
Identity count Group_ID Unique_ID
1 1 X X 1_X
2 1 Y Y 1_Y
3 1 Z Z 1_Z
4 2 X X 2_X
5 2 Y Y 2_Y
6 2 Z Z 2_Z
7 3 X X 3_X
8 3 Y Y 3_Y
9 3 Z Z 3_Z
10 1 X Xa 1a_X
11 1 Y Ya 1a_Y
12 1 Z Za 1a_Z
13 2 X Xa 2a_X
14 2 Y Ya 2a_Y
15 2 Z Za 2a_Z
16 3 X Xa 3a_X
17 3 Y Ya 3a_Y
18 3 Z Za 3a_Z
Here's my amended version of Onyambu's solution, which refers to columns by name rather than number (and so can handle data frames that have additional columns):
dup <- duplicated(dff) # identify duplicates
dff$count <- 1:nrow(dff) # add count column to the dataframe
# create a new dataframe containing the unique and group ids:
B <- data.frame(t(apply(dff, 1, function(x)
if(dup[as.numeric(x["count"])]) c(paste0(x["Identity"], "a"),
paste(x["Identity"], x["Generation"], sep = "a_"))
else c(x["Identity"], paste(x["Identity"], x["Generation"], sep = "_")))))
# combine the dataframes:
colnames(B) <- c("Group_id", "Unique_id")
dff <- cbind(dff[-ncol(dff), B)

semi_join in R but pull back duplicates

I'm having issues with semi_join from dplyr. Ideally I would like to do a semi join on dfA against dfB. dfA has duplicate values, and so does dfB. I want to pull back all values from dfA that have any matches against dfB even duplicates in dfA.
dfA dfB >> dfC
x y z x g x y z
1 r 5 1 lkm 1 r 5
1 b 4 1 pok 1 b 4
2 4 e 2 jij 2 4 e
3 5 r 2 pop 3 5 r
3 9 g 3 hhg 3 9 g
4 3 0 5 trt
What I would like to get is the dfC output above. Because there is AT LEAST 1 match of x, it pulls back all x's in dfA
semi_join(dfA, dfB, by = "x")
dfC
x y z
1 r 5
2 4 e
3 5 r
inner_join(dfA, dfB, by = "x")
x y z g
1 r 5 lkm
1 r 5 pok
1 b 4 lkm
1 b 4 pok
2 4 e jij
2 4 e pop
3 5 r hhg
3 9 g hhg
Neither of which give me the right result. Any help would be great! Thanks in advance
not sure why you need a join : just use %in%
library(data.table)
setDT(dfA)[x %in% dfB$x,]
# simple base R approach :
dfA[dfA$x %in% dfB$x,]
if you're using dplyr and going to keep passing it down the pipe
library(dplyr)
dfA %>% filter(x %in% dfB$x)

Picking up only specific columns based on conditions on multiple columns in R [duplicate]

This question already has answers here:
How to select the rows with maximum values in each group with dplyr? [duplicate]
(6 answers)
Closed 6 years ago.
I have a data frame, say
df <- data.frame(x = c(1,2,5,6,3,3,3,6,8,8,8,8),
y = c(1,1,1,1,1,2,3,1,1,2,3,4),
z = c("a","b","c","d","e","f","g","h","i","j","k","l"))
it looks like this
x y z
1 1 1 a
2 2 1 b
3 5 1 c
4 6 1 d
5 3 1 e
6 3 2 f
7 3 3 g
8 6 1 h
9 8 1 i
10 8 2 j
11 8 3 k
12 8 4 l
I would like pick unique elements from column x, based on column y such that y should be maximum (in this case say for row number 5 to 7 are 3'3, I would like to pick the x = 3 corresponding to y = 3 (maximum value) similarly for x = 8 I d like to pick y = 4 row )
the output should look like this
x y z
1 1 1 a
2 2 1 b
3 5 1 c
4 6 1 d
5 3 3 g
6 6 1 h
7 8 4 l
I have a solution for that, which I am posting in the solution, but if there is there any better method to achieve this, My solution only works in this specific case (picking the largest) what is the general case solution for this?
One solution using dplyr
library(dplyr)
df %>%
group_by(x) %>%
slice(max(y))
# x y z
# (dbl) (dbl) (chr)
#1 1 1 a
#2 2 1 b
#3 3 3 g
#4 5 1 c
#5 6 1 d
#6 8 4 l
The base R alternative is using aggregate
aggregate(y~x, df, max)
You can achieve the same result using a dplyr chain and dplyr's group_by function. Once you use a group_by function the rest of the functions in the chain are applied within group as opposed to the whole data.frame. So here I filter to where the only rows left are the max(y) per the grouping value of x. This can be extended to be used for the min of y or a particular value.
I think its generally good practice to ungroup the data at the end of a chain using group_by to avoid any unexpected behavior.
library(dplyr)
df <- data.frame(x = c(1,2,5,6,3,3,3,6,8,8,8,8),
y = c(1,1,1,1,1,2,3,1,1,2,3,4),
z = c("a","b","c","d","e","f","g","h","i","j","k","l"))
df %>%
group_by(x) %>%
filter(y==max(y)) %>%
ungroup()
To make it more general... say instead you wanted the mean of y for a given x as opposed to the max. You could then use the summarise function instead of the filter as shown below.
df %>%
group_by(x) %>%
summarise(y=mean(y)) %>%
ungroup()
Using data.table we can use df[order(z), .I[which.max(y)], by = x] to get the rownumbers of interest, eg:
library(data.table)
setDT(df)
df[df[order(z), .I[which.max(y)], by = x][, V1]]
x y z
1: 1 1 a
2: 2 1 b
3: 5 1 c
4: 6 1 d
5: 3 3 g
6: 8 4 l
Here is my solution using dplyr package
library(dplyr)
df <- data.frame(x = c(1,2,5,6,3,3,3,6,8,8,8,8),
y = c(1,1,1,1,1,2,3,1,1,2,3,4),
z = c("a","b","c","d","e","f","g","h","i","j","k","l"))
df <- arrange(df,desc(y))
df_out <- df[!duplicated(df$x),]
df_out
Printing df_out
x y z
1 8 4 l
2 3 3 g
6 1 1 a
7 2 1 b
8 5 1 c
9 6 1 d
Assuming the data frame is ordered by df[order(df$x, df$y),] as it is in the example, you can use base R functions, split, lapply, and do.call/rbind to extract your desired rows using the "split / apply / combine" methodology.
do.call(rbind, lapply(split(df, df$x), function(i) i[nrow(i),]))
x y z
1 1 1 a
2 2 1 b
3 3 3 g
5 5 1 c
6 6 1 h
8 8 4 l
split breaks up the data.frame into a list based on x. This list is fed to lapply which selects the last row of each data.frame, and returns these one row data.frames as a list. This list is then rbinded into a single data frame using do.call.

How to count how many values per level in a given factor?

I have a data.frame mydf with about 2500 rows. These rows correspond to 69 classes of objects in colum 1 mydf$V1, and I want to count how many rows per object class I have.
I can get a factor of these classes with:
objectclasses = unique(factor(mydf$V1, exclude="1"));
What's the terse R way to count the rows per object class? If this were any other language I'd be traversing an array with a loop and keeping count but I'm new to R programming and am trying to take advantage of R's vectorised operations.
Or using the dplyr library:
library(dplyr)
set.seed(1)
dat <- data.frame(ID = sample(letters,100,rep=TRUE))
dat %>%
group_by(ID) %>%
summarise(no_rows = length(ID))
Note the use of %>%, which is similar to the use of pipes in bash. Effectively, the code above pipes dat into group_by, and the result of that operation is piped into summarise.
The result is:
Source: local data frame [26 x 2]
ID no_rows
1 a 2
2 b 3
3 c 3
4 d 3
5 e 2
6 f 4
7 g 6
8 h 1
9 i 6
10 j 5
11 k 6
12 l 4
13 m 7
14 n 2
15 o 2
16 p 2
17 q 5
18 r 4
19 s 5
20 t 3
21 u 8
22 v 4
23 w 5
24 x 4
25 y 3
26 z 1
See the dplyr introduction for some more context, and the documentation for details regarding the individual functions.
Here 2 ways to do it:
set.seed(1)
tt <- sample(letters,100,rep=TRUE)
## using table
table(tt)
tt
a b c d e f g h i j k l m n o p q r s t u v w x y z
2 3 3 3 2 4 6 1 6 5 6 4 7 2 2 2 5 4 5 3 8 4 5 4 3 1
## using tapply
tapply(tt,tt,length)
a b c d e f g h i j k l m n o p q r s t u v w x y z
2 3 3 3 2 4 6 1 6 5 6 4 7 2 2 2 5 4 5 3 8 4 5 4 3 1
Using plyr package:
library(plyr)
count(mydf$V1)
It will return you a frequency of each value.
Using data.table
library(data.table)
setDT(dat)[, .N, keyby=ID] #(Using #Paul Hiemstra's `dat`)
Or using dplyr 0.3
res <- count(dat, ID)
head(res)
#Source: local data frame [6 x 2]
# ID n
#1 a 2
#2 b 3
#3 c 3
#4 d 3
#5 e 2
#6 f 4
Or
dat %>%
group_by(ID) %>%
tally()
Or
dat %>%
group_by(ID) %>%
summarise(n=n())
We can use summary on factor column:
summary(myDF$factorColumn)
One more approach would be to apply n() function which is counting the number of observations
library(dplyr)
library(magrittr)
data %>%
group_by(columnName) %>%
summarise(Count = n())
In case I just want to know how many unique factor levels exist in the data, I use:
length(unique(df$factorcolumn))
Use the package plyr with lapply to get frequencies for every value (level) and every variable (factor) in your data frame.
library(plyr)
lapply(df, count)
This is an old post, but you can do this with base R and no data frames/data tables:
sapply(levels(yTrain), function(sLevel) sum(yTrain == sLevel))

Resources