Hope someone can help me on this one for which I have just found a lousy solution on my own: I would like to aggregate (or paste) the labels of four columns (A to D) into a fifth (dream) but conditionally, that is only if its numeric value is 2.
Here is my database df:
id= c(1:12)
A = c(2,NA,NA,2,NA,1,1,1,1,1,NA,2)
B = c(2,1,1,1,2,NA,1,1,1,1,2,1)
C = c(2,1,1,1,2,NA,1,1,1,1,NA,1)
D = c(2,1,1,1,1,1,2,1,1,NA,2,1)
df = data.frame(id,A,B,C,D) ; df
df$A=factor(df$A, labels=c("no", "i saw"))
df$B=factor(df$B, labels=c("no", "someone"))
df$C=factor(df$C, labels=c("no", "sitting"))
df$D=factor(df$D, labels=c("no", "on a cloud")) ; df
Here is below the solution i found, but not so satisfying...
df$dream = ifelse(as.numeric(df$A)!=2, NA, as.character(df$A)) ; df
df$dream = ifelse(as.numeric(df$B)!=2, df$dream, paste(df$dream, as.character(df$B))) ; df
df$dream = ifelse(as.numeric(df$C)!=2, df$dream, paste(df$dream, as.character(df$C))) ; df
df$dream = ifelse(as.numeric(df$D)!=2, df$dream, paste(df$dream, as.character(df$D))) ; df
I am sure there is a straightfoward way to do so, in addition my code doesn't even seem to work this way.. Could someone help me? Thanks
This solution will work but you have to declare this vector of values you want to paste from factors.
# init empty result vector
dream <- character(nrow(df))
# values from each column (A-D) you want to paste
values <- c("i saw","someone","sitting", "on a cloud")
# iterate over each row
for(i in seq_len(nrow(df))){
#paste values from each row
dream[i] <- paste(values[which(as.numeric(df[i,-1]) == 2)], collapse=" ")
}
I think it would be easier if you transform you data.frame in data.table.
For column B you can use
dt <- as.data.table(df)
dt[,dream:=ifelse(B==2,"someone",ifelse(B==1,"no",NA))]
And then replicate the same for the rest 3 columns. I hope this helps
UPDATE
Or maybe you could try this
dt$dream.A <- mapvalues(dt$A,c(1,2),c("no", "i saw"))
Related
I'm a newbie in R programming. I have a requirement in mind and trying to work it out with for loop. I have a data frame with 14 variables which has empty values for some rows and columns. My requirement is to list the number of empty values in each variable (column).
My code below to achieve it:
for (x in names(df)){
cat(paste("No of rows with empty value for", x, " variable:",
nrow(df[df$x == '', ])))
}
nrow(df[df$x=='',])
From the above nrow command, the x value is not getting substituted for df$x == ''.
Need some expert help to fix it.
Thanks in advance,
Regards,
Vin
You can use sapply though to make your code cleaner.
sapply(df, FUN=function(x) sum(x == ''))
I slightly altered your for loop, and added a line break in the end. It is easier if you sum over the booleans created than counting the rows.
##Create some fake data
df <- data.frame(
first_var = c(rep("",10),1:10),
second_var = c(rep("",9), 1:11),
third_var = c(rep("", 8), 1:12),
fourth_Var = c(rep("", 7), 1:13)
)
for(i in names(df)){
cat(paste0("No of rows with empty value for ",i, " variable:",sum(df[,i] == ""),"\n"))
}
I've got some poorly structured data I am trying to clean. I have a list of keywords I can use to extract data frames from a CSV file. My raw data is structured roughly as follows:
There are 7 columns with values, the first columns are all string identifiers, like a credit rating or a country symbol (for FX data), while the other 6 columns are either a header like a percentage change string (e.g. +10%) or just a numerical value. Since I have all this data lumped together, I want to be able to extract data for each category. So for instance, I'd like to extract all the rows between my "credit" keyword and my "FX" keyword in my first column. Is there a way to do this in either base R or dplyr easily?
eg.
df %>%
filter(column1 = in_between("credit", "FX"))
Sample dataframe:
row 1: c('random',-1%', '0%', '1%, '2%')
row 2: c('credit', NA, NA, NA, NA)
row 3: c('AAA', 1,2,3,4)
...
row n: c('FX', '-1%', '0%', '1%, '2%')
And I would want the following output:
row 1: c('credit', -1%', '0%', '1%, '2%')
row 2: c('AAA', 1,2,3,4)
...
row n-1: ...
If I understand correctly you could do something like
start <- which(df$column1 == "credit")
end <- which(df$column1 == "FX")
df[start:(end-1), ]
Of course this won't work if "credit" or "FX" is in the column more than once.
Using what Brian suggested:
in_between <- function(df, start, end){
return(df[start:(end-1),])
}
Then loop over the indices in
dividers = which(df$column1 %in% keywords == TRUE)
And save the function outputs however one would like.
lapply(1:(length(dividers)-1), function(x) in_between(df, start = dividers[x], end = dividers[x+1]))
This works. Messy data so I still have the annoying case where I need to keep the offset rows.
I'm still not 100% sure what you are trying to accomplish but does this do what you need it to?
set.seed(1)
df <- data.frame(
x = sample(LETTERS[1:10]),
y = rnorm(10),
z = runif(10)
)
start <- c("C", "E", "F")
df2 <- df %>%
mutate(start = x %in% start,
group = cumsum(start))
split(df2, df2$group)
I have a list similar to this one:
set.seed(1602)
l <- list(data.frame(subst_name = sample(LETTERS[1:10]), perc = runif(10), crop = rep("type1", 10)),
data.frame(subst_name = sample(LETTERS[1:7]), perc = runif(7), crop = rep("type2", 7)),
data.frame(subst_name = sample(LETTERS[1:4]), perc = runif(4), crop = rep("type3", 4)),
NULL,
data.frame(subst_name = sample(LETTERS[1:9]), perc = runif(9), crop = rep("type5", 9)))
Question: How can I extract the subst_name-column of each data.frame and combine them with cbind() (or similar functions) to a new data.frame without messing up the order of each column? Additionally the columns should be named after the corresponding crop type (this is possible 'cause the crop types are unique for each data.frame)
EDIT: The output should look as follows:
Having read the comments I'm aware that within R it doesn't make much sense but for the sake of having alook at the output the data.frame's View option is quite handy.
With the help of this SO-Question I came up with the following sollution. (There's probably room for improvement)
a <- lapply(l, '[[', 1) # extract the first element of the dfs in the list
a <- Filter(function(x) !is.null(unlist(x)), a) # remove NULLs
a <- lapply(a, as.character)
max.length <- max(sapply(a, length))
## Add NA values to list elements
b <- lapply(a, function(v) { c(v, rep(NA, max.length-length(v)))})
e <- as.data.frame(do.call(cbind, d))
names(e) <- unlist(lapply(lapply(lapply(l, '[[', "crop"), '[[', 2), as.character))
It is not really correct to do this with the given example because the number of rows is not the same in each one of the list's data frames . But if you don't care you can do:
nullElements = unlist(sapply(l,is.null))
l = l[!nullElements] #delete useless null elements in list
columns=lapply(l,function(x) return(as.character(x$subst_name)))
newDf = as.data.frame(Reduce(cbind,columns))
If you don't want recycled elements in the columns you can do
for(i in 1:ncol(newDf)){
colLength = nrow(l[[i]])
newDf[(colLength+1):nrow(newDf),i] = NA
}
newDf = newDf[1:max(unlist(sapply(l,nrow))),] #remove possible extra NA rows
Note that I edited my previous code to remove NULL entries from l to simplify things
I have a data frame. One column contains the following values:
df$current_column=(A,B,C,D,E)
A vector contains a look up value:
v <- c(A=X, B=Y)
I want to replace this column to come up with a list of (X, Y, C,D,E)
I am thinking to create a new column like
df$new_column <- v[df$current_column]
It does the replacement of A and B but it also makes C,D,E as NA (X,Y, NA, NA, NA).
How to keep C,D and E or is there any other way?
looks like ifelse() could help:
d$current_column <- ifelse( d$current_column == A, X,
ifelse( d$current_column == B, Y, d$current_column ))
We can create a logical index with %in% and then do the conversion
i1 <- df$current_column %in% names(v)
df$new_column <- df$current_column
df$new_column[i1] <- v[df$new_column[i1]]
df$new_column
#[1] "X" "Y" "C" "D" "E"
Or use a single ifelse
with(df, ifelse(current_column %in% names(v),
v[current_column], current_column))
Update
If the 'current_column' is factor class, convert to character class and it should work.
df$new_column <- as.character(df$current_column)
df$new_column[i1] <- v[df$new_column[i1]]
data
df <- data.frame(current_column = LETTERS[1:5],
stringsAsFactors=FALSE)
v <- setNames(c('X', 'Y'), LETTERS[1:2])
user2029709,
-- was working off of your little example; for a more generic approach it would be nice to see a snippet of the real data or close simulation. In any case, here is something that may work for you better, without coding manually all ifelse() options, and is still a relatively straightforward solution:
vd <- data.frame(current_column = names(v), new_column = v, stringsAsFactors = FALSE)
df <- merge(df, vd, by = 'current_column', all.x = TRUE)
df$new_column <- ifelse(is.na(df$new_column), df$current_column, df$current_column)
You may have to modify data types when creating vd data.frame to assure proper merge.
Best,
oleg
I have two data frames. First one looks like
dat <- data.frame(matrix(nrow=2,ncol=3))
names(dat) <- c("Locus", "Pos", "NVAR")
dat[1,] <- c("ACTC1-001_1", "chr15:35087734..35087734", "1" )
dat[2,] <- c("ACTC1-001_2 ", "chr15:35086890..35086919", "2")
where chr15:35086890..35086919 indicates all the numbers within this range.
The second looks like:
dat2 <- data.frame(matrix(nrow=2,ncol=3))
names(dat2) <- c("VAR","REF.ALT"," FUNC")
dat2[1,] <- c("chr1:116242719", "T/A", "intergenic" )
dat2[2,] <- c("chr1:116242855", "A/G", "intergenic")
I want to merge these by the values in dat$Pos and dat2$VAR. If the single number in a cell in dat2$VAR is contained within the range of a cell in dat$Pos, I want to merge those rows. If this occurs more than once (dat2$VAR in more than one range in dat$Pos, I want it merged each time). What's the easiest way to do this?
Here is a solution, quite short but not particularly efficient so I would not recommend it for large data. However, you seemed to indicate your data was not that large so give it a try and let me know:
library(plyr)
exploded.dat <- adply(dat, 1, function(x){
parts <- strsplit(x$Pos, ":")[[1]]
chr <- parts[1]
range <- strsplit(parts[2], "..", fixed = TRUE)[[1]]
start <- range[1]
end <- range[2]
data.frame(VAR = paste(chr, seq(from = start, to = end), sep = ":"), x)
})
merge(dat2, exploded.dat, by = "VAR")
If it is too slow or uses too much memory for your needs, you'll have to implement something a bit more complex and this other question looks like a good starting point: Merge by Range in R - Applying Loops.
Please try this out and let us know how it works. Without a larger data set it is a bit hard to trouble shoot. If for whatever reason it does not work, please share a few more rows from your data tables (specifically ones that would match)
SPLICE THE DATA
range.strings <- do.call(rbind, strsplit(dat$Pos, ":"))[, 2]
range.strings <- do.call(rbind, strsplit(range.strings, "\\.\\."))
mins <- as.numeric(range.strings[,1])
maxs <- as.numeric(range.strings[,2])
d2.vars <- as.numeric(do.call(rbind, str_split(dat2$VAR, ":"))[,2])
names(d2.vars) <- seq(d2.vars)
FIND THE MATCHES
# row numebr is the row in dat
# col number is the row in dat2
matches <- sapply(d2.vars, function(v) mins < v & v <= maxs)
MERGE
# create a column in dat to merge-by
dat <- cbind(dat, VAR=NA)
# use the VAR in dat2 as the merge id
sapply(seq(ncol(matches)), function(i)
dat$VAR <- dat2[i, "VAR"] )
merge(dat, dat2)