subsetting from same-named data.frame in R - r

I have a data.frame called c41 (HERE). Some column names (e.g., type) in this data frame are repeated once or twice. As a result, data.frame adds a ".number" suffix to distinguish between them.
Suppose I want to subset variable type == 3 among all column names that have a "type" root in their names. Currently, I drop the ".number" suffixes and then subset but that incorrectly returns nothing.
Question: In BASE R, how can I subset a variable value (type == 3) without needing to include the ".number" suffixes (e.g., type == 3 instead of type.1 == 3)?
In other words, how can I find any "type" whose value is 3 regardless of its numeric suffix.
c41 <- read.csv("https://raw.githubusercontent.com/izeh/l/master/c4.csv")
c42 <- setNames(c41, sub("\\.\\d+$", "", names(c41))) # Take off the `".number"` suffixes
subset(c42, type == 3) # Now subset ! But it return nothing!

Renaming the columns to make them non-unique is a recipe for a headache and is not advisable. Without renaming the columns, in base R you could do something like this instead:
c41[rowSums(c41[grep("^type", names(c41))] == 3, na.rm = TRUE) > 0,]
I don't think subset() can be used here if column names are duplicated.

EDIT: I see that you edited your question to specify base R. Can't help you there! But perhaps a dplyr solution is of interest.
You can use dplyr::filter_at and the starts_with helper.
library(dplyr)
library(readr)
c4 <- read_csv("https://raw.githubusercontent.com/izeh/l/master/c4.csv")
c4 %>%
filter_at(vars(starts_with("type")), any_vars(. == 3))
Adding a select_at to display just the relevant columns:
c4 %>%
filter_at(vars(starts_with("type")), any_vars(. == 3)) %>%
select_at(vars(starts_with("type")))
Result:
# A tibble: 2 x 2
type type_1
<dbl> <dbl>
1 1 3
2 2 3

Related

Creating several new columns in a data frame using the same function

I'm sorry for the basic question. I'm just struggling with something that should be simple. Say I have the the data frame "Test" that originally has three fields: Col1, Col2, Col3.
I want to create new columns based on each of the original columns. The values in each row of the new columns would specify whether the corresponding value in the matching row on the original column is above or below the initial column's median. So, for example, in the image attached, Col4 is based on Col1. Col5 is based on Col2. Col6 based on Col3.
test dataframe example:
It's quite easy to perform this function on a single column and output a single column:
Test <- Test %>% mutate(Col4 = derivedFactor(
"below"= Col1 > median(Test$Col1),
"at"= Col1 == median(Test$Col1),
"above"= Col1 < median(Test$Col1)
.default = NA)
)
But if I'm performing this same operation over 50 columns, writing out/copy-paste and editing the code can be tedious and inefficient. I should mention that I am hoping to add the new columns to the data frame, not create another data frame. Additionally, there are about 200 other fields in the data frame that will not have this function performed on them (so I can't just use a mutate_all). And the columns are not uniformly named (my examples above are just examples, not the actual dataset) so I'm not able to find a pattern for mutate_at. Maybe there is a way to manually pass a list of column names to the mutate command?
There must be an easy and elegant way to do this. If anyone could help, that would be amazing.
You can do the following using data.table.
Firstly, I define a function which is applied onto a numeric vector, whereby it outputs the elements' corresponding position in relation to the vector's median:
med_fn = function(x){
med = median(x)
unlist(sapply(x, function(x){
if(x > med) {'Above'}
else if(x < med) {'Below'}
else {'At'}
}))
}
> med_fn(c(1,2,3))
[1] "Below" "At" "Above"
Let us examine some sample data:
dt = data.table(
C1 = c(1, 2, 3),
C2 = c(2, 1, 3),
C3 = c(3, 2, 1)
)
old = c('C1', 'C2', 'C3') # Name of columns I want to perform operation on
new = paste0(old, '_medfn') # Name of new columns following operation
Using the .SD and .SDcols arguments from data.table, I apply med_fn across the columns old, in my case columns C1, C2 and C3. I call the new columns C#_medfn:
dt[, (new) := lapply(.SD, med_fn), .SDcols = old]
Result:
> dt
C1 C2 C3 C1_medfn C2_medfn C3_medfn
1: 1 2 3 Below At Above
2: 2 1 2 At Below At
3: 3 3 1 Above Above Below

How to ensure that a row does not have any "" in a dataframe?

I am pretty new to R, come from a Python background. I have loaded a dataframe as such:
df = read.csv('data.csv', stringsAsFactors = FALSE,
colClasses = colClass,na.strings = c("NA", ""))
My objective is to ensure that there are no missing values in my dataframe. I was thinking of writing code as such:
df = na.omit(df)
It wasn't removing the missing values, I then realized that it could be because of the importing of the dataframe. I imported it into a dataframe where I changed the "NA" to "".
My question is, is there a function similar to NA in which I could explicitly remove rows that have values ""?
Any help would be great!
Edit1 :Here is the first row:
Edit2: Here is the structure of the dataframe:
To do what you actually asked, an anonymous function and an apply function will do the job.
df <- df[!apply(df, 1, function(x){all(x=="")}),]
The apply function applies a function either row or column wise. The second argument chooses which, so 1 means by row and 2 means by column. Finally the last bit is our custom function which returns TRUE if all the data is "" in that row. If you wanted to check for NAs you could replace x=="" with is.na(x). Finally once the apply returns that list of true and falses we shove that as the idex for our dataframe to get back just the rows we want.
EDIT 2: Turns out I understood it the first time around, the below is what you want!
EDIT: I misunderstood your question, the below is the original answer I gave and will remove any row with at least one NA in it!
If you're happy to leave the NAs in there the complete.cases function will return all rows which have an NA. i.e.
df <- df[complete.cases(df),]
If you wanted to get rid of NAs you could do it after you've filtered out all the rows with strictly NAs.
df[is.na(df)] <- ""
If you want to convert empty strings to NAs, another option is dplyr::na_if():
# example data
dat <- tribble(~col1, ~col2,
1, "",
2, "some string",
3, "another string")
dat
# A tibble: 3 x 2
col1 col2
<dbl> <chr>
1 1.00 ""
2 2.00 some string
3 3.00 another string
dat %>%
na_if("") %>%
na.omit()
# A tibble: 2 x 2
col1 col2
<dbl> <chr>
1 2.00 some string
2 3.00 another string

Conditional sum in R – multiple columns

I'm trying to figure out how to extract some specific information from very big tables (e.g., 30'000 rows and 50 columns).
Imagine I have this data frame:
S1 <- c(1,2,1,1,3,1)
S2 <- c(2,1,3,2,1,1)
S3 <- c(1,2,2,1,3,1)
S4 <- c(3,3,4,2,3,1)
S5 <- c(3,2,5,3,2,2)
count <- c(10,5,3,1,1,1)
df <- data.frame(count,S1,S2,S3,S4,S5)
What I need is to sum the column "count" when, for instance, S1 and S3 shares the same value (it doesn't matter which value), but no other column has the same value.
In this example, it should returns the value 11, because I should only take into consideration the values of the column "count" from the rows 1 and 4.
In the rows 2, 5 and 6, S1 and S3 have a similar value, but I don't want consider them because there are also other columns with the same value. And finally, not considering row 3 simply because S1 and S3 have different values.
I know how to do it easily in excel, but I was wondering how I could do it in R. I've tried somme commands from dplyr, but I failed.
If any of you could give a help, I'll be very grateful.
A little more complex, but it works. Using only R base. From this question take the form of comparing multiple columns in a simple way.
sum(df[df$S1==df$S3 & rowSums(sapply(df[,c(3,5,6)],`==`,e2=df$S1)) == 0,1])
[1] 11
The most complex part is how to check multiple columns. In this case we use sapply to compare the columns c(3,5,6) by equality ('==') with S1, (e2 is the second argument of the == function).
As ycw mentions, it can be a little complicated to define all the columns by a vector, so this form allows you to check all the columns except those we don't want.
sum(df[df$S1==df$S3 & rowSums(sapply(df[,!(colnames(df) %in% c("count", "S1", "S3"))],`==`,e2=df$S1)) == 0,1])
Applying the same procedure to the two comparisons and defining only the vector of the same values:
equals <- c("S1", "S3")
not_equals <- !(colnames(df) %in% c("count", equals))
sum(df[rowSums(sapply(df[,equals,drop=FALSE],`==`,e2=df[equals[1]])) == length(equals) &
rowSums(sapply(df[,not_equals,drop=FALSE],`==`,e2=df[equals[1]])) == 0, 1])
Note: Use drop=FALSE for selecting only one column of dataframe and avoid "promotion to vector" problem or omit the , this way:
sum(df[rowSums(sapply(df[equals],`==`,e2=df[equals[1]])) == length(equals) &
rowSums(sapply(df[not_equals],`==`,e2=df[equals[1]])) == 0, 1])
A solution using dplyr. There are two steps. The first filter function finds rows with S1 == S3. The second filter_at function checks columns other than S1, S3, and count all are not equal to S1, which should be the same as S3 after the first filter function.
library(dplyr)
df2 <- df %>%
filter(S1 == S3) %>%
filter_at(vars(-S1, -S3, -count), all_vars(. != S1))
df2
count S1 S2 S3 S4 S5
1 10 1 2 1 3 3
2 1 1 2 1 2 3
Then the total count is as follows.
sum(df2$count)
[1] 11
Using dplyr, rowwise, filter :
library(dplyr)
df %>%
rowwise() %>%
filter(S1==S3 & !S1 %in% c(S2,S4,S5)) %>%
pull(count) %>%
sum()
# [1] 11

Using apply in R to extract rows from a dataframe

Using R, I have to extract specific rows from a data frame depending on certain conditions. The data frame is large (5.5 million rows to 251 columns) but I have given the code below to create a sample data frame.
df <- data.frame("Name" = c("Name1", "Name1", "Name1", "Name1","Name1" ), "Value"=c("X", "X", "Y", "Y", "X"))
I need to skip through the entire data frame row by row starting at the top, and while skipping, when the value of the 'Value' column changes from X to Y or Y to X, I need to extract that row and next row and append them to another data frame. For example, in the data frame above, the Value column of row 2 is X and that of row 3 is Y, and since the value has changed from X to Y, I need to extract the entire row 2 and row 3 and add them to another data frame.
The result of the operations can be seen by running the code below
dfextract <- data.frame("Name" = c("Name1", "Name1"), "Value"=c("X", "Y"))
Currently I have used a 'for' loop to skip row to row and extract the rows when the values don't match. But it very slow and inefficient. The code snippet is below
for (i in 1:count) {
if (df[[i+1, 2]] != df[i,2]) {
dfextract <- rbind(dfextract, df[i,])
dfextract <- rbind(dfextract, df[i+1,])
}
}
I am looking for a better and faster solution to the above situation. Perhaps using the functions belonging to the family of 'apply()' or using 'by()'. Any help would be greatly appreciated.
Thanks in advance.
Maybe the following does it. Note that there are two lapply based loop, in order to predict for changes in the values of column Name.
diffstr <- function(x) x[-1] == x[-length(x)]
res <- lapply(split(df, df$Name), function(x) {
inx <- which(c(FALSE, !diffstr(x$Value)))
do.call(rbind, lapply(inx, function(i) x[(i - 1):i, ]))
})
res <- do.call(rbind, res)
row.names(res) <- NULL
res
How it works.
First, I define a helper function diffstr. It compares all values of x but the first with all values of x but the last. Note that x[-1] is the vector x[2], x[3], ..., x[length(x)], negative indices remove that element from the vector. And the same for x[-length(x), the negative index removes the last x.
split(df, df$Name) splits the data frame into subsets each one of its own Name.
I then lapply an unnamed function to these subsets. This function's argument x will be each of the sub-data frames mentioned above.
That function start by determining where in df$Valueare the changes. This is done with the call to the helper function diffstr. I have to append a FALSE to the return value because at first there are no changes.
The next line is a tricky one. Use lapply on the index of change points inx and for each one get a two rows segment of the data frame x. Then use do.call to call rbind those two rows df's and reassemble them together.
Now res is a list, with one sub-data frame for each Name (done with the split). So it needs to be put back together with another call to do.call(rbind(...)).
Final tidy up. The whole process messed up with the data frame's row names. To set them to NULL is just a well known trick that forces R to renumber the rows.
That's it. If you need more explanations, just say so.
We can use dplyr. lag can shift the row by 1, so we can use Value != lag(Value) to compare if the value is different than the previous one. which(Value != lag(Value)) converts the result to row number. After that, sort(unique(unlist(lapply(which(Value != lag(Value)), function(x) c(x, x - 1))))) makes sure we also got the row number of those previous rows. Finally, slice can subset the data frame based on the row number provided.
library(dplyr)
df2 <- df %>%
slice(sort(unique(unlist(lapply(which(Value != lag(Value)), function(x) c(x, x - 1))))))
df2
# A tibble: 4 x 2
Name Value
<fctr> <fctr>
1 Name1 X
2 Name1 Y
3 Name1 Y
4 Name1 X
If the code is too long to read, you can also calculate the index before using the slice function as follows.
library(dplyr)
ind <- which(df$Value != lag(df$Value))
ind2 <- sort(unique(c(ind, ind - 1)))
df2 <- df %>% slice(ind2)
df2
# A tibble: 4 x 2
Name Value
<fctr> <fctr>
1 Name1 X
2 Name1 Y
3 Name1 Y
4 Name1 X
Using base R, I would probably use an id for the rows and with diff:
df <- data.frame(colA=c(1, 1, 1, 2, 1, 1, 1, 3, 3, 3, 1, 1),
colB=1:12)
keep <- which(diff(df$colA) != 0)
df[unique(c(keep, keep+1)), ]
colA colB
3 1 3
4 2 4
7 1 7
10 3 10
5 1 5
8 3 8
11 1 11
There is probably a faster option though.
When you have a large dataset, speed might be the bottleneck. In this case data.table might be the best option for you.
Using the data.table-library, I would solve it like so:
library(data.table)
dt <- data.table(Name = c("Name1", "Name1", "Name1", "Name1","Name1" ),
Value = c("X", "X", "Y", "Y", "X"))
# look if Value changes to the next instance
dt[, idx := Value != shift(Value, 1, fill = dt$Value[1])]
# filter the rows where the index changes and the next value
# and deselect the variable idx
dt[idx | shift(idx, 1)][, .(Name, Value)]
#> Name Value
#> 1: Name1 Y
#> 2: Name1 Y
#> 3: Name1 X
Why does it give an odd-number and not an even-number?
Well, that is because in your data example, the last row should be selected as it changes, but there is no next row to select as well.

Determine the number of NA values in a column

I want to count the number of NA values in a data frame column. Say my data frame is called df, and the name of the column I am considering is col. The way I have come up with is following:
sapply(df$col, function(x) sum(length(which(is.na(x)))))
Is this a good/most efficient way to do this?
You're over-thinking the problem:
sum(is.na(df$col))
If you are looking for NA counts for each column in a dataframe then:
na_count <-sapply(x, function(y) sum(length(which(is.na(y)))))
should give you a list with the counts for each column.
na_count <- data.frame(na_count)
Should output the data nicely in a dataframe like:
----------------------
| row.names | na_count
------------------------
| column_1 | count
Try the colSums function
df <- data.frame(x = c(1,2,NA), y = rep(NA, 3))
colSums(is.na(df))
#x y
#1 3
A quick and easy Tidyverse solution to get a NA count for all columns is to use summarise_all() which I think makes a much easier to read solution than using purrr or sapply
library(tidyverse)
# Example data
df <- tibble(col1 = c(1, 2, 3, NA),
col2 = c(NA, NA, "a", "b"))
df %>% summarise_all(~ sum(is.na(.)))
#> # A tibble: 1 x 2
#> col1 col2
#> <int> <int>
#> 1 1 2
Or using the more modern across() function:
df %>% summarise(across(everything(), ~ sum(is.na(.))))
If you are looking to count the number of NAs in the entire dataframe you could also use
sum(is.na(df))
In the summary() output, the function also counts the NAs so one can use this function if one wants the sum of NAs in several variables.
A tidyverse way to count the number of nulls in every column of a dataframe:
library(tidyverse)
library(purrr)
df %>%
map_df(function(x) sum(is.na(x))) %>%
gather(feature, num_nulls) %>%
print(n = 100)
This form, slightly changed from Kevin Ogoros's one:
na_count <-function (x) sapply(x, function(y) sum(is.na(y)))
returns NA counts as named int array
sapply(name of the data, function(x) sum(is.na(x)))
Try this:
length(df$col[is.na(df$col)])
User rrs answer is right but that only tells you the number of NA values in the particular column of the data frame that you are passing to get the number of NA values for the whole data frame try this:
apply(<name of dataFrame>, 2<for getting column stats>, function(x) {sum(is.na(x))})
This does the trick
I read a csv file from local directory. Following code works for me.
# to get number of which contains na
sum(is.na(df[, c(columnName)]) # to get number of na row
# to get number of which not contains na
sum(!is.na(df[, c(columnName)])
#here columnName is your desire column name
Similar to hute37's answer but using the purrr package. I think this tidyverse approach is simpler than the answer proposed by AbiK.
library(purrr)
map_dbl(df, ~sum(is.na(.)))
Note: the tilde (~) creates an anonymous function. And the '.' refers to the input for the anonymous function, in this case the data.frame df.
If you're looking for null values in each column to be printed one after the other then you can use this. Simple solution.
lapply(df, function(x) { length(which(is.na(x)))})
Another option using complete.cases like this:
df <- data.frame(col = c(1,2,NA))
df
#> col
#> 1 1
#> 2 2
#> 3 NA
sum(!complete.cases(df$col))
#> [1] 1
Created on 2022-08-27 with reprex v2.0.2
You can use this to count number of NA or blanks in every column
colSums(is.na(data_set_name)|data_set_name == '')
In the interests of completeness you can also use the useNA argument in table. For example table(df$col, useNA="always") will count all of non NA cases and the NA ones.

Resources