replacing blank not NA - r

I have two variables a and b
a b
vessel hot
parts
nest NA
best true
neat smooth
I want to replace blank in b with a
la$b[i1] <- ifelse(la$b[i1] == "",la$a[i1],la$b[i1])
But it is not working

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), specify the condition in 'i' (b==''), and assign the values of 'a' that corresponds to TRUE values in 'i' to 'b'. It should be fast as we are assigning in place.
library(data.able)
setDT(df1)[b=='', b:= a]
df1
# a b
#1: vessel hot
#2: parts parts
#3: nest NA
#4: best true
#5: neat smooth
Or we can just base R
i1 <- df1$b=='' & !is.na(df1$b)
df1$b[i1] <- df1$a[i1]
data
df1 <- structure(list(a = c("vessel", "parts", "nest", "best", "neat"
), b = c("hot", "", NA, "true", "smooth")), .Names = c("a", "b"
), class = "data.frame", row.names = c(NA, -5L))

instead of
# la$b[i1] <- ifelse(la$b[i1] == "",la$a[i1],la$b[i1])
# what is i1? it doesn't seem to have any obvious function here
... it should be:
la$b <- ifelse(la$b == "", la$a, la$b)
assuming that you want to replace blank in b with a and that applies to all blanks
it works:
df <- structure(list(a = c("vessel", "parts", "nest", "best", "neat"
), b = c("hot", "parts", NA, "true", "smooth")), .Names = c("a",
"b"), row.names = c(NA, -5L), class = "data.frame")
df$b <- ifelse(df$b=="", df$a, df$b)
# or, with `with`: df$b <- with(df, ifelse(b=="",a,b))
# > df
# a b
# 1 vessel hot
# 2 parts parts
# 3 nest <NA>
# 4 best true
# 5 neat smooth

Related

Grabs rows where second column is equal to value

I have a dataset which looks something like this:
print(animals_in_zoo)
// I only know the name of the first column, the second one is dynamic/based on a previously calculated variable
animals | dynamic_column_name
// What the data looks like
elefant x
turtle
monkey
giraffe x
swan
tiger x
What I want is to collect the rows in which the second columns' value is equal to "x".
What I want to do is something like:
SELECT * from data where col2 == "x";
After that, I want to grab only the first column and create a string object like "elefant giraffe tiger", but that is the easy part.
You can reference that column by its index and use that to get the animals you want:
df1 <- structure(list(animal = c("elefant", "turtle", "monkey", "giraffe",
"swan", "tiger"), dynamic_column = c("x", NA, NA, "x", NA, "x"
)), row.names = c(NA, -6L), class = "data.frame")
df1[, 1][df1[, 2] == "x" & !is.na(df1[, 2])]
#> [1] "elefant" "giraffe" "tiger"
We could use filter with grepl which searches for a pattern 'x' in the string:
# the data frame
df <- read.table(header = TRUE, text =
'my_col
"elefant x"
turtle
monkey
"giraffe x"
swan
"tiger x"'
)
library(dplyr)
df %>%
filter(grepl('x', my_col))
my_col
1 elefant x
2 giraffe x
3 tiger x
Use [: the first argument refers to the rows. You want the rows where the second column is "x". The second argument is the column you need in the end, and you want the column named "animals":
dat[dat[2] == "x", "animals"]
#[1] "elefant" "giraffe" "tiger"
data
dat <- structure(list(animals = c("elefant", "turtle", "monkey", "giraffe",
"swan", "tiger"), V2 = c("x", "", "", "x", "", "x")), row.names = c(NA,
-6L), class = "data.frame")
# animals V2
# 1 elefant x
# 2 turtle
# 3 monkey
# 4 giraffe x
# 5 swan
# 6 tiger x
I guess you have a dataframe?
If so, something like df[df$col2 == 'x',] should work.
With base functions, you can do it like this:
# Option 1
your_dataframe[your_dataframe$col2 == "x", ]
# Option 2
your_dataframe[your_dataframe[,2] == "x", ]
With dplyr functions, you can do it like this:
library(dplyr)
your_dataframe %>%
filter(col2 == "x")

more dynamic melting with data.table

I am looking for the most efficient form to transform
ARTNR FILGRP
1 1 9827
2 2 9348
3 3 9335, 9827, 9339
into this
ARTNR FILGRP
1 1 9827
2 2 9348
3 3 9335
4 3 9827
5 3 9339
I tried the following code and it works, but it is not elegant and has some shortcomings. :
setDT(artnrs)
artnrs[, c("P1", "P2", "P3") := tstrsplit(FILGRP, ",", fixed=TRUE)] # 1)
artnrs <- melt(artnrs, c("ARTNR"), measure = patterns("^P")) # 2)
artnrs[,variable:=NULL] # 3)
artnrs <- na.omit(artnrs, cols="value") # 4)
names(artnrs)[2] <- "FILGRP" # 5)
ad 1) splits the last column in three new ones. How can I make this dynamic and make it fit for five or ten?
ad 2-5) rather clumpsy operations, could I chain this better?
It is based on data.table but performance is not that critical so an easy to understand tidyverse solution would be ok. But the fewer packages, the better.
Thanks!
dput output;
structure(list(ARTNR = c(1, 2, 3), FILGRP = c("9827", "9348", "9335, 9827, 9339")),
row.names = c(NA, -3L), class = "data.frame")
df <- structure(list(ARTNR = c(1, 2, 3), FILGRP = c("9827", "9348", "9335, 9827, 9339")),
row.names = c(NA, -3L), class = "data.frame")
df2 <- strsplit(df$FILGRP, split = ",")
df2 <- data.frame(ARTNR = rep(df$ARTNR, sapply(df2, length)), FILGRP = unlist(df2))
here is a data.table approach
library( data.table )
setDT(DT)
melt( DT[, paste0( "v", 1:length(tstrsplit( DT$FILGRP, ", ") ) ) := tstrsplit( FILGRP, ", ") ],
id.vars = "ARTNR",
measure.vars = patterns( "^v" ),
value.name = "FILGRP" )[!is.na(FILGRP), .SD, .SDcols = c(1,3) ]
# ARTNR FILGRP
# 1: 1 9827
# 2: 2 9348
# 3: 3 9335
# 4: 3 9827
# 5: 3 9339

Loop over strings in r

I'd like to know what is wrong with my code rather than a solution. I wish to loop over some strings my data is as follows:
id source transaction
1 a > b 6 > 0
2 J > k 5
3 b > c 4 > 0
I have a list and wish to go over this list and find the rows that contains that element and compute average.
mylist <- c ("a", "b")
So my desired output will for one of the element in the list is
source avg
a 6
b 2
I do not know who to loop over the list and send them to a csv file. I tried this
mylist <- c( "a", "b" )
for(i in mylist)
{
KeepData <- df [grepl(i, df$source), ]
KeepData <- cSplit(KeepData, "transaction", ">", "long")
avg<- mean(KeepData$transactions)
result <- list(i,avg )
write.table(result ,file="C:/Users.csv", append=TRUE,sep=",",col.names=FALSE,row.names=FALSE)
}
But It gives me "NA" result with the following warning
Warning messages: 1: In mean.default(KeepData$transactions) :
argument is not numeric or logical: returning NA 2: In
mean.default(KeepData$transactions) : argument is not numeric or
logical: returning NA
We can use cSplit to split the 'source' and convert the dataset to 'long' format, then specify the 'i', grouped by 'source', get the mean of 'transaction' (using data.table methods)
library(splitstackshape)
cSplit(df1, "source", " > ", "long")[source %in% mylist, .(avg = mean(transaction)), source]
# source avg
#1: a 6
#2: b 5
Or another option is separate_rows from tidyr to convert to 'long' format, then use the dplyr methods to summarise after grouping by 'source'
library(tidyr)
library(dplyr)
separate_rows(df1, source) %>%
filter(source %in% mylist) %>%
group_by(source) %>%
summarise(avg = mean(transaction))
Update
For the new dataset ('df2'), we need to split both the columns to 'long' format, and then get the mean of 'transaction' grouped by 'source'
cSplit(df2, 2:3, " > ", "long")[source %in% my_list, .(avg = mean(transaction)), source]
# source avg
#1: a 6
#2: b 2
The for loop can be modified to
for(i in mylist) {
KeepData <- cSplit(df2, 2:3, ">", "long")
KeepData <- KeepData[grepl(i, source)]
avg<- mean(KeepData$transaction)
result <- list(i,avg )
print(result)
write.table(result ,file="C:/Users.csv",
append=TRUE,sep=",",col.names=FALSE,row.names=FALSE)
}
#[[1]]
#[1] "a"
#[[2]]
#[1] 6
#[[1]]
#[1] "b"
#[[2]]
#[1] 2
data
df1 <- structure(list(id = 1:3, source = c("a > b", "J > k", "b > c"
), transaction = c(6L, 5L, 4L)), .Names = c("id", "source", "transaction"
), class = "data.frame", row.names = c(NA, -3L))
df2 <- structure(list(id = 1:3, source = c("a > b", "J > k", "b > c"
), transaction = c("6 > 0", "5", "4 > 0")), .Names = c("id",
"source", "transaction"), class = "data.frame", row.names = c(NA,
-3L))

How to match values in one column and parse the matching value from different columns in R

I am really stuck on this problem to work in R. I need to grepl(allgood in column C), and if it is present, I want to parse the values in column A and column B from the row matching the preceding allbad in column C and get the result.
A B C
apple ball allbad-cat
allgood-car
dog bark allbad-pet
bull
dull
allgood-pet
result
A B C
apple ball allbad-cat
apple ball allgood-car
dog bark allbad-pet
bull
dull
dog bark allgood-pet
# find the index of column "C" starts with "allgood"
good.idx <- which(grepl("^allgood", df$C))
# find the index of column "C" starts with "allbad"
bad.idx <- which(grepl("^allbad", df$C))
# for each "good" index, find the maximum "bad" index smaller than the "good" index
good.bad.near <- sapply(good.idx, function(x){
return(max(bad.idx[bad.idx<x]))
})
df$A[good.idx] <- df$A[good.bad.near]
df$B[good.idx] <- df$B[good.bad.near]
Works for your data.
If you have more columns to substitute, you can use the for loop.
for (i in 1:2) {
df[, i][good.idx] <- df[, i][good.bad.near]
}
We can try
library(zoo)
i1 <- !grepl('^(allbad|allgood)', df1$C)
df1[1:2] <- lapply(df1[1:2], function(x) ifelse(i1, '',
na.locf(replace(x, x=='', NA))))
df1
# A B C
#1 apple ball apple
#2 apple ball apple
#3 dog bark dog
#4
#5
#6 dog bark dog
Or using data.table
library(data.table)
setDT(df1)[A=='', c('A', 'B') := NA][,
lapply(.SD, na.locf)][i1, c('A', 'B') := ''][]
data
df1 <- structure(list(A = c("apple", "", "dog", "", "", ""),
B = c("ball",
"", "bark", "", "", ""), C = c("allbad-cat", "allgood-car",
"allbad-pet",
"bull", "dull", "allgood-pet")), .Names = c("A", "B", "C"),
class = "data.frame", row.names = c(NA, -6L))

Converting given list into dataframe

I have the following list:
$id1
$id1[[1]]
A B
"A" "B"
$id1[[2]]
A B
"A" "A1"
$id2
$id2[[1]]
A B
"A2" "B2"
In R-pastable form:
dat = structure(list(SampleTable = structure(list(id2 = list(structure(c("90", "7"), .Names = c("T", "G")), structure(c("90", "8"), .Names = c("T", "G"))), id1 = structure(c("1", "1"), .Names = c("T", "G"))), .Names = c("id2", "id1"))), .Names = "SampleTable")
I want this given list to be converted into following dataframe:
id1 A B
id1 A A1
id2 A2 B2
Your data structure (apparently a named list of unnamed lists of 1-row data.frames) is a bit complicated: the easiest may be to use a loop to build the data.frame.
It can be done directly with do.call, lapply and rbind, but it is not very readable, even if you are familiar with those functions.
# Sample data
d <- list(
id1 = list(
data.frame( x=1, y=1 ),
data.frame( x=2, y=2 )
),
id2 = list(
data.frame( x=3, y=3 ),
data.frame( x=4, y=4 )
),
id3 = list(
data.frame( x=5, y=5 ),
data.frame( x=6, y=6 )
)
)
# Convert
d <- data.frame(
id=rep(names(d), unlist(lapply(d,length))),
do.call( rbind, lapply(d, function(u) do.call(rbind, u)) )
)
Other solution, using a loop, if you have a ragged data structure, containing vectors (not data.frames) as explained in the comments.
d <- structure(list(SampleTable = structure(list(id2 = list(structure(c("90", "7"), .Names = c("T", "G")), structure(c("90", "8"), .Names = c("T", "G"))), id1 = structure(c("1", "1"), .Names = c("T", "G"))), .Names = c("id2", "id1"))), .Names = "SampleTable")
result <- list()
for(i in seq_along(d$SampleTable)) {
id <- names(d$SampleTable)[i]
block <- d$SampleTable[[i]]
if(is.atomic(block)) {
block <- list(block)
}
for(row in block) {
result <- c(result, list(data.frame(id, as.data.frame(t(row)))))
}
}
result <- do.call(rbind, result)
NOTE! I could not get melt and cast working on this kind of ragged data (I tried for over an hour...) I am going to leave this answer here to show that for this kind of operation, the reshape pacakge could also be used.
Using the example data of vincent, you can use melt and cast from the reshape package:
library(reshape)
res = cast(melt(d))[-1]
names(res) = c("id","x","y")
res
id x y
1 id1 1 1
2 id2 3 3
3 id3 5 5
4 id1 2 2
5 id2 4 4
6 id3 6 6
The order in the resulting data.frame is not the same, but the result is identical. And the code is a bit shorter. I use the [-1] to delete the first column which is also returned by melt. This additional variable is the column index of the individual data.frames in the list of lists. Just have a look at the result of melt(d), that will hopefully make it more clear.
This is a bit messier that you let on. That dat object has an extra "layer" above it, so it is easier to work with dat[[1]]:
dfrm <- data.frame(dat[[1]], stringsAsFactors=FALSE)
names(dfrm) <- sub("\\..+$", "", names(dfrm))
> dfrm
id2 id2 id1
T 90 90 1
G 7 8 1
> t(dfrm)
T G
id2 "90" "7"
id2 "90" "8"
id1 "1" "1"

Resources