Julia: how to remove rows by condition with a search across the entire dataframe - julia

Sometimes there is a need to delete rows by some condition in a large dataframe, where it is irrational to manually specify all columns by name. For example, such a situation may occur if the value "no data" is set as some specific numeric value, for example "-1". It would be useful to be able to immediately delete all the rows where this value occurs, similar to how dropmissing() removes them. I have not found a convenient way to replace an arbitrary value with "missing" or directly delete all lines where the specified value occurs.
Simple example:
df = DataFrame(A=[2, -1, 3, 3], B=[2, 5, 7, -1], C=3)
I want to delete all rows where -1 values occur, or at least extract all rows where this value does not occur. Something like this:
df_clear = df[df .>= 0, :]
or:
df_clear = df[findall(x -> x>0, df), :]
A similar syntax works with replacing values, but it doesn't seem to be applicable for deleting strings:
df .= ifelse.(df.< 0, 0, df)
What is the most elegant way to solve this problem? Is there a beautiful solution to perform such a check on a range of columns or on all columns except for the specified one?

You could do it e.g. like this (using the ifelse function you used; the approach assumes you do not have missing values in your data frame):
julia> dropmissing!(ifelse.(df .== -1, missing, df))
2×3 DataFrame
Row │ A B C
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 2 2 3
2 │ 3 7 3
A more direct approach is to use the subset function by keeping rows in which -1 is not present:
julia> subset(df, All() .=> ByRow(!=(-1)))
2×3 DataFrame
Row │ A B C
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 2 2 3
2 │ 3 7 3
finally you could do the following using the indexing syntax:
julia> df[all.(!=(-1), eachrow(df)), :]
2×3 DataFrame
Row │ A B C
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 2 2 3
2 │ 3 7 3

Related

Function to recode multiple variables conditional on other variables

I have a dataset with multiple variables. Each question has the actual survey answer and three other characteristics. So there are four variables for each question. I want to specify if Q135_L ==1 , leave Q135_RT as it is, otherwise code it as NA. I can do that with an ifelse statement.
df$Q135_RT <- ifelse(df$Q135_L == 1, df$Q22_RT, NA)
However, I have hundreds of variables and the names are not related. For example, in the picture we can see Q135, SG1_1 and so on. How can I specify for the whole dataset if a variable ends at _L, then for the same variable ending at _RT should remain as it is, otherwise the variable ending at _RT should be coded as NA.
I tried this but it only returns NAs
ifelse(grepl("//b_L" ==1, df), "//b_RT" , NA)
If I understand your problem correctly, you have a data frame of which the columns represent survey question variables. Each column contains two identifiers, namely: a survey question number (134, 135, etc) and a variable letter (L, R, etc). Because you provide no reproducible example, I tried to make a simplified example of your data frame:
set.seed(5)
DF <- data.frame(array(sample(1:4, 24, replace = TRUE), c(4,6)))
colnames(DF) <- c("Q134_L","Q135_L", "Q134_R", "Q135_R", "Q_L1", "Q134_S")
DF
# Q134_L Q135_L Q134_R Q135_R Q_L1 Q134_S
# 1 2 3 2 3 1 1
# 2 3 1 3 2 4 4
# 3 1 1 3 2 4 3
# 4 3 1 3 3 2 1
What you want is that if Q135_L == 1, leave Q135_RT as it is, otherwise code it as NA. Here is a function that implements this recoding logic:
recode <- function(yourdf, questnums) {
for (k in 1:length(questnums)) {
charnum <- as.character(questnums)
col_end_L_k <- yourdf[grepl("_L\\b", colnames(yourdf)) &
grepl(charnum[k], colnames(yourdf))]
col_end_R_k <- yourdf[grepl("_RT\\b", colnames(yourdf)) &
grepl(charnum[k], colnames(yourdf))]
row_is_1 <- which(col_end_L_k == 1)
col_end_R_k[-row_is_1, ] <- NA
yourdf[, colnames(col_end_R_k)] <- col_end_R_k
}
return(yourdf)
}
This function takes a data frame and a vector of question numbers, and then returns the data frame that has been recoded.
What this function does:
Selecting each question number using for.
Using grepl to identify any column that contains the selected number and contains _L at the end of the column name.
Similar with above but for _RT at the end of the column name.
Using which to identify the location of rows in the _L column that contain 1.
Keeping the values of the _RT column, which has the same question number with the corresponding _L column, in those rows, and change values on other rows to NA.
The result:
recode(DF, 134:135)
# Q134_L Q135_L Q134_RT Q135_RT Q_L1 Q134_S
# 1 2 3 NA NA 1 1
# 2 3 1 NA 2 4 4
# 3 1 1 3 2 4 3
# 4 3 1 NA 3 2 1
Note that the Q_L1 column is not affected because _L in this column is not located on the end of the column name.
As for how to define questnums, the question numbers, you just need to create a numeric vector. Examples:
Your questnums are 1 to 200. Then use 1:200 or seq(200), so recode(DF, 1:200).
Your questnums are 1, 3, 134, 135. Then, use recode(DF, c(1, 3, 134, 135)).
You can also assign the question numbers to an object first, such as n = c(25, 135, 145) and the use it : recode(DF, n)

Can CatagoricalArrays be used with Julia Dataframes to convert multiple columns from string to categories?

I have a fairly large survey dataset (100+ columns and 5000 rows) that has a bunch of string variables.
I can use the following function to convert individual columns one by one,
function fix_df_column(df)
levels = ["x"]
Colname = categorical(df[!, :Colname]; levels, ordered = false)
df[!, :Colname] = Colname
#df
end
but I would like to be able to iterate across the whole dataframe and convert everything automatically.
The only code I can find relates to arrays https://dataframes.juliadata.org/stable/man/categorical/ and the only examples I can find are changes to single columns, not multiple.
Does anyone know a simpler way to achieve this?
Thanks
Yes, you can do it. Assuming that you want to convert all columns that contain strings (without missing) and you want automatic assignment of levels you can do:
transform!(df, names(df, AbstractString) .=> categorical, renamecols=false)
For example:
julia> df = DataFrame(x1=["a", "b"], x2=[1,2], x3=[missing, "x"], x4=["c", "d"])
2×4 DataFrame
Row │ x1 x2 x3 x4
│ String Int64 String? String
─────┼────────────────────────────────
1 │ a 1 missing c
2 │ b 2 x d
julia> transform!(df, names(df, AbstractString) .=> categorical, renamecols=false)
2×4 DataFrame
Row │ x1 x2 x3 x4
│ Cat… Int64 String? Cat…
─────┼────────────────────────────
1 │ a 1 missing c
2 │ b 2 x d
and you can see that only :x1 and :x4 are changed.

data.frame with 2 columns with the same name: how to select the second one?

d1 <- data.frame(a=c(1,2,3))
d2 <- data.frame(a=c(3,4,5))
d3 <- cbind(d1,d2)
doesn't return an error, and an inspection of the environment in RStudio displays two columns with the same name.
If I type:
d3$a
The first column is selected. How to select the second by name?
It is not advised to have duplicate column names and this is one of the main reason.
You can select column by position.
d3[[2]]
Or if you want to select them by name, here is another way -
d3[names(d3) == 'a'][[2]]
It is a little surprising that R does not throw an error/warning when you cbind 2 dataframes that have the same column names. Because, usually R dataframes do not allow exact same names (when you create them using data.frame()).
For example, if you try to create a dataframe (from scratch/without binding 2 dataframes) R automatically changes names slightly by adding .number starting from the second column:
data.frame(a = c(1, 2, 3), a = c(3, 4, 5), a = c(5, 6, 7))
a a.1 a.2
1 1 3 5
2 2 4 6
3 3 5 7
Which means dataframes should not have the same column names. In your example, when you do d3$a R can only show you the first column that have the name a. Similarly, if you do d3[,"a"] you get the first column with the name a.
In general, it would not be a good idea to have a dataframe with multiple columns with the exact same names. However, if you really need to have such a dataframe and you have to get the second column, you would need to do something like:
d3[, 2]
[1] 3 4 5
you may make use of library janitor here
d1 <- data.frame(a=c(1,2,3))
d2 <- data.frame(a=c(3,4,5))
d3 <- cbind(d1,d2)
janitor::clean_names(d3)
#> a a_2
#> 1 1 3
#> 2 2 4
#> 3 3 5
Created on 2021-05-22 by the reprex package (v2.0.0)
We can also do match
d3[match('a', names(d3))][[2]]

Plotting multiple data files with Gadfly

I'm trying to plot different data files using Gadfly. I tried using a DataFrame, but with no result.
I tried using the below code but it just plots the last data file. I don´t know how to plot all the data file in just one plot using Gadfly.
data1 = CSV.read("DOC.CSV")
data2 = CSV.read("DODC.CSV")
data3 = CSV.read("DOTC.CSV")
data4 = CSV.read("DTC.CSV")
data5 = CSV.read("DTDC.CSV")
data6 = CSV.read("DTTC.CSV")
data = [data1,data2,data3,data4,data5,data6]
ddframe = DataFrame()
for i in 1:6
ddframe[Symbol("Data"*string(i))]=DataFrame(x=data[i][!,1],y=data[i][!,2],label="Data "*string(i))
end
p = plot(ddframe,x="x",y="y",color="label",Geom.point,Geom.line,Guide.xlabel("Wavelength(nm)"),Guide.ylabel("Absorbance(UA)"),Guide.title("Absorbance Spectrum for
Cianine dyes"))
I assume your data is like this.
│ x │ y │
│───────┼───────┤
│ 1 │ 1 │
│ 2 │ 3 │
│ 3 │ 5 │
│ 4 │ 7 │
│ 5 │ 9 │
Therefore, the data is an array of dataframes.
More, I assume a dataframe can be represented as a line (wavelength vs absorbance).
According to your code, you may combine every dataframe into a plot, so what you need is the layer() function.
plot(
layer(data[1],
x=:x,
y=:y,
Geom.point,
Geom.line,
Theme(default_color="red")),
layer(data[2],
x=:x,
y=:y,
Geom.point,
Geom.line,
Theme(default_color="blue"))
)

Subset a dataframe using a logical vector with $

I'm having trouble understanding both the reason for use and behavior of the $ symbol in subsetting a data.frame in R. The following example was presented in a beginner's class I'm taking (not with a live professor so can't ask there):
temp_mat <- matrix(1:9, nrow=3)
colnames(temp_mat) <- c('a', 'b', 'c')
temp_df <- data.frame(temp_mat)
Calling temp_df obviously outputs:
a b c
1 1 4 7
2 2 5 8
3 3 6 9
The example given in the course is then:
temp_df[temp_df$c < 10]
Which outputs:
a b c
1 1 4 7
2 2 5 8
3 3 6 9
Reason for use question: The course indicates that $ is used for partial matching, and that x$y is an exact substitute for x[["y", exact=FALSE]]. Why would we want to use a partial matching operator here? Do we use it because we know for sure that in our temp_df there is no other column similar to "c" that could be mistakenly picked up? Additionally, how is partial match measured? A minimum % of characters matching or something? It appears there is a getElement function that would be much more appropriate if working with datasets with unknown or similar column names (e.g. Home Phone versus Cell Phone, would these be seen as a valid partial match?)
Behavior question: it appears the above example temp_df[temp_df$c < 10] is saying "return the subset of elements from temp_df where column c is less than 10" and because all column c elements meet the criteria, the entire dataframe is returned. My interpretation is obviously wrong because temp_df[temp_df$c < 9] returns:
a b
1 1 4
2 2 5
3 3 6
Although the row 1 and 2 elements in column c do meet the criteria of being less than 9, the entire column is omitted. My question then becomes twofold: what is that logical vector actually saying/doing? And how would I write my interpretation of "return the subset of elements from temp_df where column c is less than 9" and have it return:
a b c
1 1 4 7
2 2 5 8
Because in my mind, elements 1 and 2 (rows 1 and 2) met that criteria as their column c values are less than 9 and thus should be returned.
Try breaking down the operation in steps.
temp_df$c < 9
gives a vector as follows:
[1] TRUE TRUE FALSE
When you pass this vector in the manner you have shown:
temp_df[c(TRUE, TRUE, FALSE)] has the effect of operating on columns.
Think about a data.frame as a list, with column names as the keys and the column contents as vector values. The operation preserves the TRUE keys (i.e. columns) and drops the FALSE.
The comma serves to mark the vector as row index. The first two rows are retained and the last one is dropped. Thus, temp_df[c(TRUE, TRUE, FALSE), ] gives:
a b c
1 1 4 7
2 2 5 8
Both the $ and [[ are extract operator which allows to extract elements by name.
OP has raised one query about behavior of exact argument. The exact argument of the [[ operator has been documented in RStudio as:
Controls possible partial matching of [[ when extracting by a
character vector (for most objects, but see under ‘Environments’). The
default is no partial matching. Value NA allows partial matching but
issues a warning when it occurs. Value FALSE allows partial matching
without any warning.
What does it mean? To understand its behavior lets change the column names of data.frame used by OP as:
names(temp_df) <- c("aa","bb","cc")
#partial name of column will work with exact = FALSE
temp_df[["a", exact = FALSE]]
#[1] 1 2 3
#partial name of column will not work with exact = TRUE
temp_df[["a", exact = TRUE]]
#NULL
temp_df[["a", exact = NA]]
#[1] 1 2 3
#Warning message:
#In .subset2(x, i, exact = exact) : partial match of 'a' to 'aa'

Resources