Just a quick question: how can I replace some values with others if these values are present in all the dataframe's column? Functions like mapvalues and recode work only if the column is specified, but in my case the dataframe has 89 columns so that would be time-consuming.
For the sake of clarity, take in consideration the following example. I want to replace [NULL] with another value.
Example:
a <- c("NULL",2,"NULL")
b <- c(3, "NULL", 1)
df <- data.frame(a, b)
df
a b
0 NULL 3
1 2 NULL
2 NULL 1
The difference between the example and my case is that the dataset is [35383 x 89], and the values I want to replace are more than one.
Thank you in advance for your time.
An extension to the comment by Ronak Shah. You can add 0 if you want like that. Or you can replace it with desired values, if you like that.
For example, replace the NULLs with mean of the respective columns:
#Run a loop to convert the characters into numbers because for your case it is all characters
#This will change the NULL to NAs.
for (i in colnames(df)){
df[,i] <- as.numeric(df[,i])
}
#Now replace the NAs with the mean of the column
for (i in colnames(df)){
df[,i][is.na(df[,i])] <- mean(df[,i], na.rm=TRUE)
}
You can similarly do this for median also. Let me know in the comment if you have any doubts.
For starters, I have added a few more rows to your example to better show how the code works
df
# a b
#1 NULL 3
#2 2 NULL
#3 NULL 1
#4 a 14
#5 1 a
#6 14 5
First, create two vectors: one with whe values you want to replace (pattern) and one with replacements in the same order. To make sure you have done it right, put them together in a data frame and take a look at the rows (this will also help in next step)
In this case, I want NULL to be 0, "a" to be "alpha", and so on, as shown below
pattern <- c("NULL", "a", 14, 1)
replacement <- c(0, "alpha", "fourteen", "one")
subs <- data.frame(pattern, replacement)
subs
# pattern replacement
#1 NULL 0
#2 a alpha
#3 14 fourteen
#4 1 one
To finish it, we will make a for tthat each time we will pick a pattern and its replacement from the subs data frame we created, and with these values execute a map_df(). This function iterates over the columns from our original data frame (df) and apply the gsub() function with the pattern and replacement
for (i in 1:nrow(subs)) {
df <- map_df(df, gsub, pattern = subs$pattern[i], replacement = subs$replacement[i])
}
df
# a b
#1 0 3
#2 2 0
#3 0 one
#4 alpha fourteen
#5 one alpha
#6 fourteen 5
I hope this was clear. Let me know if you have any doubts
Related
I need to standardize how subgroups are referred to in a data set. To do this I need to identify when a variable matches one of several strings and then set a new variable with the standardized name. I am trying to do that with the following:
df <- data.frame(a = c(1,2,3,4), b = c(depression_male, depression_female, depression_hsgrad, depression_collgrad))
TestVector <- "male"
for (i in TestVector) {
df$grpl <- grepl(paste0(i), df$b)
df[ which(df$grpl == TRUE),]$standard <- "male"
}
The test vector will frequently have multiple elements. The grepl works (I was going to deal with the male/female match confusion later but I'll take suggestions on that) but the subsetting and setting a new variable doesn't. It would be better (and work) if I could transform the grepl output directly into the standard name variable.
Your only real issue is that you need to initialize the standard column. But we can simplify your code a bit:
df <- data.frame(a = c(1,2,3,4), b = c("depression_male", "depression_female", "depression_hsgrad", "depression_collgrad"))
TestVector <- "male"
df$standard <- NA
for (i in TestVector) {
df[ grepl(i, df$b), "standard"] <- "male"
}
df
# a b standard
# 1 1 depression_male male
# 2 2 depression_female male
# 3 3 depression_hsgrad <NA>
# 4 4 depression_collgrad <NA>
Then you've got the issue that the "male" pattern matches "female" as well.
Perhaps you're looking for sub instead? It works like find/replace:
df$standard = sub(pattern = "depression_", replacement = "", df$b)
df
# a b standard
# 1 1 depression_male male
# 2 2 depression_female female
# 3 3 depression_hsgrad hsgrad
# 4 4 depression_collgrad collgrad
It's hard to generalize what will be best in your case without more example input/output pairs. If all your data is of the form "depression_" this will work well. Or maybe the standard name is always after an underscore, so you could use pattern = ".*_" to replace everything before the last underscore. Or maybe something else... Hopefully these ideas give you a good start.
Another question for me as a beginner. Consider this example here:
n = c(2, 3, 5)
s = c("ABBA", "ABA", "STING")
b = c(TRUE, "STING", "STRING")
df = data.frame(n,s,b)
n s b
1 2 ABBA TRUE
2 3 ABA STING
3 5 STING STRING
How can I search within this dataframe for similar strings, i.e. ABBA and ABA as well as STING and STRING and make them the same (doesn't matter whether ABBA or ABA, either fine) that would not require me knowing any variations? My actual data.frame is very big so that it would not be possible to know all the different variations.
I would want something like this returned:
> n = c(2, 3, 5)
> s = c("ABBA", "ABBA", "STING")
> b = c(TRUE, "STING", "STING")
> df = data.frame(n,s,b)
> print(df)
n s b
1 2 ABBA TRUE
2 3 ABBA STING
3 5 STING STING
I have looked around for agrep, or stringdist, but those refer to two data.frames or are able to name the column which I can't since I have many of those.
Anyone an idea? Many thanks!
Best regards,
Steffi
This worked for me but there might be a better solution
The idea is to use a recursive function, special, that uses agrepl, which is the logical version of approximate grep, https://www.rdocumentation.org/packages/base/versions/3.4.1/topics/agrep. Note that you can specify the 'error tolerance' to group similar strings with agrep. Using agrepl, I split off rows with similar strings into x, mutate the s column to the first-occurring string, and then add a grouping variable grp. The remaining rows that were not included in the ith group are stored in y and recursively passed through the function until y is empty.
You need the dplyr package, install.packages("dplyr")
library(dplyr)
desired <- NULL
grp <- 1
special <- function(x, y, grp) {
if (nrow(y) < 1) { # if y is empty return data
return(x)
} else {
similar <- agrepl(y$s[1], y$s) # find similar occurring strings
x <- rbind(x, y[similar,] %>% mutate(s=head(s,1)) %>% mutate(grp=grp))
y <- setdiff(y, y[similar,])
special(x, y, grp+1)
}
}
desired <- special(desired,df,grp)
To change the stringency of string similarity, change max.distance like agrepl(x,y,max.distance=0.5)
Output
n s b grp
1 2 ABBA TRUE 1
2 3 ABBA STING 1
3 5 STING STRING 2
To remove the grouping variable
withoutgrp <- desired %>% select(-grp)
I want to update a dataframe with values from a table of new values where there is a one-to-many relationship between the dataframe and table of new values. This code illustrates the intent:
df = data.frame(x=rep(letters[1:4],5,rep=T), y=1:20)
and new values..
eds = data.frame(x=c('c','d'), val=c(101, 102))
For a one-to-one relationship the following should work:
df$x[match(eds$x, df$x)] = eds$x[match(df$x, eds$x)]
But match only works with first match, so this throws the error number of items to replace is not a multiple of replacement length. Grateful for any tips on the most efficient way to approach this. I'm guessing some sapply wrapper but I can't think of the method.
Thanks in advance.
tmp <- eds$val[match(df$x, eds$x)] # Matching indices (with NAs for no match)
df$y <- ifelse(is.na(tmp), df$y, tmp) # Values at matches (leaving alone for NAs)
head(df, 5)
# x y
# 1 a 1
# 2 b 2
# 3 c 101
# 4 d 102
# 5 a 5
Not that this not a very robust solution. It depends on your exact data structure here (repeating 'c', 'd' pattern) but it works for this case:
df[df[["x"]] %in% eds[["x"]], "y"] = eds[[2]]
I have a list of stocks in an index sorted by date, and I'm trying to remove all rows in which the previous row has the same stock code. This will give a dataframe of the initial index and all dates that there was a change to the index
In my working example, I'll use names instead of the date column, and some numbers.
At first, I thought I could remove the rows by using subset() and !duplicated
name <- c("Joe","Mary","Sue","Frank","Carol","Bob","Kate","Jay")
num <- c(1,2,2,1,2,2,2,3)
num2 <- c(1,1,1,1,1,1,1,1)
df <- data.frame(name,num,num2)
dfnew <- subset(df, !duplicated(df[,2]))
However, this might not work in the case where a stock is removed from the list and then later replaced. So, in my working example, the desired output are the rows of Joe, Mary, Frank, Carol and Jay.
Next I created a function to tell if the index changes. The input of the function is row number:
#------ function to tell if there is a change in the row subset-----#
df2 <- as.matrix(df)
ChangeDay <- function(x){
Current <- df2[x,2:3]
Prev <- df2[x-1,2:3]
if (length(Current) != length(Prev))
NewList <- true
else
NewList <- length(which(Current==Prev))!=length(Current)
return(NewList)
}
Finally, I attempt to create a loop to remove the desired rows. I'm new to programming, and I struggle with loops. I'm not sure what the best way is to pre-allocate memory when the dimensions of my final output is unknown. All the books I've looked at only give trivial loop examples. Here is my latest attempt:
result <- matrix(data=NA,nrow=nrow(df2),ncol=3) #pre allocate memory
tmp <- as.numeric(df2) #store the original data
changes <- 1
for (i in 2:nrow(df2)){ #always keep row 1, thus the loop starts at row 2
if(ChangeDay(i)==TRUE){
result[i,] <-tmp[i] #store the row in result if ChangeDay(i)==TRUE
changes <- changes + 1 #increment counter
}
}
result <- result[1:changes,]
Thansk for your help, and any additional general advice on loops is appreciated!
It is not clear what you want to do. But I guess :
df[c(1,diff(df$num)) !=0,]
name num num2
1 Joe 1 1
2 Mary 2 1
4 Frank 1 1
5 Carol 2 1
8 Jay 3 1
In my code, I am filling the columns of a dataframe with vectors, as so:
df1[columnNum] <- barWidth
This works fine, except for one thing: I want the name of the vector variable (barWidth above) to be retained as the column header, one column at a time. Furthermore, I do not wish to use cbind. This slows the execution of my code down considerably. Consequently, I am using a pre-allocated dataframe.
Can this be done in the vector-to-column assignment? If not, then how do I change it after the fact? I can't find the right syntax to do this with colNames().
TIA
It's being done by the [<-.data.frame function. It could conceivably be replaced by one that looked at the name of the argument but it's such a fundamental function I would be hesitant. Furthermore there appears to be an aversion to that practice signaled by this code at the top of the function definition:
> `[<-.data.frame`
function (x, i, j, value)
{
if (!all(names(sys.call()) %in% c("", "value")))
warning("named arguments are discouraged")
nA <- nargs()
if (nA == 4L) {
<snipped rest of rather long definition>
I don't know why that is there, but it is. Maybe you should either be thinking about using names<- after the column assignment, or using this method:
> dfrm["barWidth"] <- barWidth
> dfrm
a V2 barWidth
1 a 1 1
2 b 2 2
3 c 3 3
4 d 4 4
This can be generalized to a list of new columns:
dfrm <- data.frame(a=letters[1:4])
barWidth <- 1:4
newcols <- list(barWidth=barWidth, bw2 =barWidth)
dfrm[names(newcol)] <- newcol
dfrm
#
a barWidth bw2
1 a 1 1
2 b 2 2
3 c 3 3
4 d 4 4
If you have the list of names of vectors you want to apply you could do:
namevec <- c(...,"barWidth"...,)
columnNums <- c(...,10,...)
df1[columnNums[i]] <- get(namevec[i])
names(df1)[columnNums[i]] <- namevec[i]
or even
columnNums <- c(barWidth=4,...)
for (i in seq_along(columnNums)) {
df1[columnNums[i]] <- get(names(columnNums)[i])
}
names(df1)[columnNums] <- names(columnNums)
but the deeper question would be where this set of vectors is coming from in the first place: could you have them in a list all along?
I'd simply use cbind():
df1 <- cbind( df1, barWidth )
which retains the name. It will, however, end up as the last column in df1