How to relabel column values - r

This seems like a simple task but for the life of me I can't figure it out. I have a dataframe column with the following structure:
df = as.data.frame(c(1,1,2,2,3,3,4,4))
I also have the following vectors:
index = seq(1,2)
labels = c('Control','Treatment')
The index is updated by a for loop and all I want to do is replace all of the values in the df column that match the index with the appropriate label (eg all values of 1 and 2 in the df will be replaced with 'Control'). So far the closest I've gotten is:
df$col[df$col == index[1]] = labels[1]
If index[1] is replaced with index, only the first value of the vector is matched. How can I do this such that all values are matched and replaced?
Thank you!

It is not completely clear what you want to do. But from your description, I suppose that what you might need is a factor:
df <- data.frame(col=c(1,1,2,2,3,3,4,4))
labs <- c("first", "second", "third", "fourth")
df$col2 <- factor(df$col, labels=labs)
df
# col col2
# 1 1 first
# 2 1 first
# 3 2 second
# 4 2 second
# 5 3 third
# 6 3 third
# 7 4 fourth
# 8 4 fourth
You can change these labels by levels, for example:
> levels(df$col2)[3:4] <- c("tre", "fyra")
> df
col col2
1 1 first
2 1 first
3 2 second
4 2 second
5 3 tre
6 3 tre
7 4 fyra
8 4 fyra

Related

Remove duplicate rows with certain value in specific column

I have a data frame and I want to remove rows that are duplicated in all columns except one column and choose to keep the ones that are not certain values.
In above example, 3rd row and 4th row are duplicated for all columns except for col3, so I want to keep one row only. The complicated step is I want to keep 4th row instead of 3rd because 3rd row in col3 is "excluded". In general, I want to only keep the rows(that were duplicated) that do not have "excluded".
My real data frame have lots of duplicated rows and among those 2 rows that are duplicated, one of them is "excluded" for sure.
Below is re-producible ex:
a <- c(1,2,3,3,7)
b <- c(4,5,6,6,8)
c <- c("red","green","excluded","orange","excluded")
d <- data.frame(a,b,c)
Thank you so much!
Update: Or, when removing duplicate, only keep the second observation (4th row).
dplyr with some base R should work for this:
library(dplyr)
a <- c(1,2,3,3,3,7)
b <- c(4,5,6,6,6,8)
c <- c("red","green","brown","excluded","orange","excluded")
d <- data.frame(a,b,c)
d <- filter(d, !duplicated(d[,1:2]) | c!="excluded")
Result:
a b c
1 1 4 red
2 2 5 green
3 3 6 brown
4 3 6 orange
5 7 8 excluded
The filter will get rid of anything that should be excluded and not duplicated. I added an example of a none unique exclude to your example('brown') to test as well.
Here is an example with a loop:
a <- c(1,2,3,3,7)
b <- c(4,5,6,6,8)
c <- c("red","green","excluded","orange","excluded")
d<- data.frame(a,b,c)
# Give row indices of duplicated rows (only the second and more occurence are given)
duplicated_rows=which(duplicated(d[c("a","b")]))
to_remove=c()
# Loop over different duplicated rows
for(i in duplicated_rows){
# Find simmilar rows
selection=which(d$a==d$a[i] & d$b==d$b[i])
# Sotre indices of raw in the set of duplicated row whihc are "excluded"
to_remove=c(to_remove,selection[which(d$c[selection]=="excluded")])
}
# Remove rows
d=d[-to_remove,]
print(d)
> a b c
> 1 4 red
> 2 2 5 green
> 4 3 6 orange
> 5 7 8 excluded
Here is a possibility ... I hope it can help :)
nquit <- (d %>%
mutate(code= 1:nrow(d)) %>%
group_by(a, b) %>%
mutate(nDuplicate= n()) %>%
filter(nDuplicate > 1) %>%
filter(c == "excluded"))$code
e <- d[-nquit]
Shortening the approach by #Klone a bit, another dplyr solution:
d %>% mutate(c = factor(c, ordered = TRUE,
levels = c("red", "green", "orange", "excluded"))) %>% # Order the factor variable
arrange(c) %>% # Sort the data frame so that excluded comes first
group_by(a, b) %>% # Group by the two columns that determine duplicates
mutate(id = 1:n()) %>% # Assign IDs in each group
filter(id == 1) # Only keep one row in each group
Result:
# A tibble: 4 x 4
# Groups: a, b [4]
a b c id
<dbl> <dbl> <ord> <int>
1 1 4 red 1
2 2 5 green 1
3 3 6 orange 1
4 7 8 excluded 1
Regarding your edit at the end of the question:
Update: Or, when removing duplicate, only keep the second observation (4th row).
note that, in case the ordering of the rows by col3 determines that the row to keep is always the last one among the duplicate records, you can simply set fromLast=TRUE in the duplicated() function to request that rows should be flagged as duplicates starting the duplicate count from the last one found for each duplicate group.
Using a slightly modified version of your data (where I added more duplicate groups to better show that the process works in a more general case):
a <- c(1,1,2,3,3,3,7)
b <- c(4,4,5,6,6,6,8)
c <- c("excluded", "red","green","excluded", "excluded","orange","excluded")
d <- data.frame(a,b,c)
a b c
1 1 4 excluded
2 1 4 red
3 2 5 green
4 3 6 excluded
5 3 6 excluded
6 3 6 orange
7 7 8 excluded
using:
ind2remove = duplicated(d[,c("a", "b")], fromLast=TRUE)
(d_noduplicates = d[!ind2remove,])
we get:
a b c
2 1 4 red
3 2 5 green
6 3 6 orange
7 7 8 excluded
Note that this doesn't require the rows in each duplicate group to be all together in the original data. The only important thing is that you want to keep the record showing up last in the data from each duplicate group.

R randomly allocate different values from vector to dataframe column based on condition

I've got a dataset called df of answers to a question Q1
df = data.frame(ID = c(1:5), Q1 = c(1,1,3,4,2))
I also have a vector where each element is a word
words = c("good","bad","better","improved","fascinating","improvise")
My Objective
IF Q1 = 1 or Q1 = 2, then randomly assign a value from vector words to a newly created column called followup
My Attempt
#If answer to Q1 is 1 or 2, then randomly allocate a word to newly created column "followup"
#Else leave blank
df$followup=ifelse(df$Q1==1 | df$Q1==2,sample(words,1),"")
However doing this leads to repetition of the same randomly selected word for each row that contains a 1 or 2.
ID Q1 followup
1 1 1 fascinating
2 2 1 fascinating
3 3 3
4 4 4
5 5 5
I'd like every word to be randomized and different.
Any inputs would be highly appreciated.
For that we may use
df$followup[df$Q1 %in% 1:2] <- sample(words, sum(df$Q1 %in% 1:2))
df
# ID Q1 followup
# 1 1 1 better
# 2 2 1 improvise
# 3 3 3 <NA>
# 4 4 4 <NA>
# 5 5 2 bad
Since we are generating those values in a single call, replace = FALSE (the default value) in sample gives the desired result of all the values being different.

concatenating only vector values from a row

I have a problem with my R code. At first I have a dataframe (df) with one column which consists of numerical values as well as vectors. These vectors also contain numerical values. This is an example of some rows of the dataframe:
1. 60011000
2. 60523000
4. 60490000
5. 60599000
6. c("60741000", "60740000", "60742000")
7. 60647000
8. c("60766000", "60767000")
9. c("60563000", "60652000")
In the list you can see there are some rows (6, 8 & 9) containing vector elements. I want to concatenate the elements in the vectors to only one element.
For example the result from the vector of line 6 should look like this:
607410006074000060742000
And the result of line 8 should look like this
6076600060767000
My dataframe has more than 30,000 rows so it is impossible for me to do it manually.
Can you help me to solve my problem? It is important that the number of rows does not change.
Thank you very much and please excuse mistakes i made. I am not a native speaker.
The data:
dat <- read.table(text='60011000
60523000
60490000
60599000
c("60741000", "60740000", "60742000")
60647000
c("60766000", "60767000")
c("60563000", "60652000")', sep = "\t")
dat
# V1
# 1 60011000
# 2 60523000
# 3 60490000
# 4 60599000
# 5 c(60741000, 60740000, 60742000)
# 6 60647000
# 7 c(60766000, 60767000)
# 8 c(60563000, 60652000)
You can use gsub to replace all non-digit characters with the empty string.
dat$V1 <- gsub("[^0-9]+", "", dat$V1)
dat
# V1
# 1 60011000
# 2 60523000
# 3 60490000
# 4 60599000
# 5 607410006074000060742000
# 6 60647000
# 7 6076600060767000
# 8 6056300060652000
You could do:
df=data.frame(a=c(1,2,3,4,'c("60741000", "60740000", "60742000")'),
b=c(1,2,3,4,5),
stringsAsFactors = F)
> df
a b
1 1 1
2 2 2
3 3 3
4 4 4
5 c("60741000", "60740000", "60742000") 5
df[,"a"]=sapply(df[,"a"],function(x) paste(eval(parse(text=x)),collapse = ""))
> df
a b
1 1 1
2 2 2
3 3 3
4 4 4
5 607410006074000060742000 5
Here you go; (looks like someone beat me to the punch )
df <- read.table("df.txt",header=F,)
df
# V1
# 1 123
# 2 12
# 3 c("1","55","6")
# 4 356
# 5 c("99","55","3")
df[,1] <- as.numeric(as.character(gsub("[^0-9]","",df[,1])))
df
# V1
# 1 123
# 2 12
# 3 1556
# 4 356
# 5 99553

remove duplicate row based only of previous row

I'm trying to remove duplicate rows from a data frame, based only on the previous row. The duplicate and unique functions will remove all duplicates, leaving you only with unique rows, which is not what I want.
I've illustrated the problem here with a loop. I need to vectorize this because my actual data set is much to large to use a loop on.
x <- c(1,1,1,1,3,3,3,4)
y <- c(1,1,1,1,3,3,3,4)
z <- c(1,2,1,1,3,2,2,4)
xy <- data.frame(x,y,z)
xy
x y z
1 1 1 1
2 1 1 2
3 1 1 1
4 1 1 1 #this should be removed
5 3 3 3
6 3 3 2
7 3 3 2 #this should be removed
8 4 4 4
# loop that produces desired output
toRemove <- NULL
for (i in 2:nrow(xy)){
test <- as.vector(xy[i,] == xy[i-1,])
if (!(FALSE %in% test)){
toRemove <- c(toRemove, i) #build a vector of rows to remove
}
}
xy[-toRemove,] #exclude rows
x y z
1 1 1 1
2 1 1 2
3 1 1 1
5 3 3 3
6 3 3 2
8 4 4 4
I've tried using dplyr's lag function, but it only works on single columns, when I try to run it over all 3 columns it doesn't work.
ifelse(xy[,1:3] == lag(xy[,1:3],1), NA, xy[,1:3])
Any advice on how to accomplish this?
Looks like we want to remove if the row is same as above:
# make an index, if cols not same as above
ix <- c(TRUE, rowSums(tail(xy, -1) == head(xy, -1)) != ncol(xy))
# filter
xy[ix, ]
Why don't you just iterate the list while keeping track of the previous row to compare it to the next row?
If this is true at some point: remember that row position and remove it from the list then start iterating from the beginning of the list.
Don't delete row while iterating because you will get concurrent modification error.

Finding the top values in data frame using r

How can I find the 5 highest values of a column in a data frame
I tried the order() function but it gives me only the indices of the rows, wherease I need the actual data from the column. Here's what I have so far:
tail(order(DF$column, decreasing=TRUE),5)
You need to pass the result of order back to DF:
DF <- data.frame( column = 1:10,
names = letters[1:10])
order(DF$column)
# 1 2 3 4 5 6 7 8 9 10
head(DF[order(DF$column),],5)
# column names
# 1 1 a
# 2 2 b
# 3 3 c
# 4 4 d
# 5 5 e
You're correct that order just gives the indices. You then need to pass those indices to the data frame, to pick out the rows at those indices.
Also, as mentioned in the comments, you can use head instead of tail with decreasing = TRUE if you'd like, but that's a matter of taste.

Resources