How to select variables with star symbols in R - r

I want to select some variables from my csv file in R. I used this select(gender*, age*), but got the error - object not found. I tried select(`gender*`, `age*`) and select(starts_with(gender), starts_with(age)), but neither works. Does anyone know how to select variables with star symbols? Thanks a lot!

It is possible that the select from dplyr is masked by select from any other package as this is working fine. Either specify the packagename with :: or do this on a fresh R session with only dplyr loaded
library(dplyr)
data(iris)
iris$'gender*' <- 'M'
iris%>%
head %>%
dplyr::select(`gender*`)
# gender*
#1 M
#2 M
#3 M
#4 M
#5 M
#6 M

To select a list of column names starting with a specific string, one can use the starts_with() function in dplyr. To illustrate, we'll select the two columns that start with the string Sepal, as in Sepal.Length and Sepal.Width.
library(dplyr)
select(iris,starts_with("Sepal")) %>% head()
...and the output:
> select(iris,starts_with("Sepal")) %>% head()
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
4 4.6 3.1
5 5.0 3.6
6 5.4 3.9
>
We can do the same thing in Base R with grepl() and a regular expression.
# base R version
head(iris[,grepl("^Sepal",names(iris))])
...and the output:
> head(iris[,grepl("^Sepal",names(iris))])
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
4 4.6 3.1
5 5.0 3.6
6 5.4 3.9
>
Also note that if one is using read.csv() to create a data frame in R, it converts any occurrences of * in column headings to ..
# confirm that * is converted to . in read.csv()
textFile <- 'v*1,v*2
1,2
3,4
5,6'
data <- read.csv(text = textFile,header = TRUE)
# see how illegal column name * is converted to .
names(data)
...and the output:
> names(data)
[1] "v.1" "v.2"
>

Related

Using anti_join and saving "anti_joined" data to top of a csv file

I am trying to write something which will continuously update a csv file.
There can be some overlaps since my script collects a little more data than is required so I want to use anti_join to find the data points which are not in the current saved .csv file. I want to append the new data points to the top of the .csv file.
Currently what I have is something like:
iris = iris %>%
mutate(rowId = row_number())
df1 = iris[1:15, ]
df2 = iris[1:10, ]
df1 %>%
anti_join(df2, by = "rowId")
#write.csv(., append = TRUE)
Which gives me correctly the missing data (by rowId)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species rowId
1 5.4 3.7 1.5 0.2 setosa 11
2 4.8 3.4 1.6 0.2 setosa 12
3 4.8 3.0 1.4 0.1 setosa 13
4 4.3 3.0 1.1 0.1 setosa 14
5 5.8 4.0 1.2 0.2 setosa 15
I want to append this top the original data df1 using write_csv. However, my current method adds the data to the bottom of the .csv file. I could sort the data but when the data set gets really large, sorting the data each time just to append correctly 5 observations at the top of the .csv file is not computationally feasible (i.e. when there are 1,000,000 observations, to sort all 1,000,005 observations to have the 5 at the top)
How can I use dply / tidyverse to append the rows to the top of the data?

How do I delete all rows based on a loop in R

I am writing a for loop to delete rows in which all of the values between rows 5 and 8 is 'NA'. However, it only deletes SOME of the rows. When I do a while loop, it deletes all of the rows, but I have to manually end it (i.e. it is an infinite loop...I also have no idea why)
The for/if loop:
for(i in 1:nrow(df)){
if(is.na(df[i,5]) && is.na(df[i,6]) &&
is.na(df[i,7]) && is.na(df[i,8])){
df<- df[-i,]
}
}
while loop (but it is infinite):
for(i in 1:nrow(df)){
while(is.na(df[i,5]) && is.na(df[i,6]) &&
is.na(df[i,7]) && is.na(df[i,8])){
df<- df[-i,]
}
}
Can someone help? Thanks!
What's happening here is that when you remove a row in this way, all the rows below it "move up" to fill the space left behind. When there are repeated rows that should be deleted, the second one gets skipped over. Imagine this table:
1 keep
2 delete
3 delete
4 keep
Now, you loop through a sequence from 1 to 4 (the number of rows) deleting rows that say delete:
i = 1, keep that row ...
i = 2, delete that row. Now, the data frame looks like this:
1 keep
2 delete
3 keep
i = 3, the 3rd row says keep, so keep it ... The final table is:
1 keep
2 delete
3 keep
In your example with while, however, the deletion step keeps running on row 2 until that row doesn't meet the conditions instead of moving on to i = 3 right away. So the process goes:
i = 1, keep that row ...
i = 2, delete that row. Now, the data frame looks like this:
1 keep
2 delete
3 keep
i = 2 (again), delete that row (again). Now, the data frame looks like this:
1 keep
2 keep
i = 2 (again), this row says keep, so keep it and move on to i = 3
I'd be remiss to answer this question without mentioning that there are much better ways to do this in R such as square bracket notation (enter ?`[` in the R console), the filter function in the dplyr package, or the data.table package.
This question has many options: Filter data.frame rows by a logical condition
Store the row number in a vector and remove outside the loop.
test <- iris
test[1:5,2:4] <- NA
> head(test)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 NA NA NA setosa
2 4.9 NA NA NA setosa
3 4.7 NA NA NA setosa
4 4.6 NA NA NA setosa
5 5.0 NA NA NA setosa
6 5.4 3.9 1.7 0.4 setosa
x <- 0
for(i in 1:nrow(test)){
if(is.na(test[i,2]) && is.na(test[i,3]) &&
is.na(test[i,4])){
x <- c(x,i)
}
}
x
test<- test[-x,]
head(test)
> head(test)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
11 5.4 3.7 1.5 0.2 setosa

Calculation via factor results in a by-list - how to circumvent?

I have a data.frame as following:
Lot Wafer Voltage Slope Voltage_irradiated Slope_irradiated m_dist_lot
1 8 810 356.119 6.08423 356.427 6.13945 NA
2 8 818 355.249 6.01046 354.124 6.20855 NA
3 9 917 346.921 6.21474 346.847 6.33904 NA
4 (...)
120 9 914 353.335 6.15060 352.540 6.19277 NA
121 7 721 358.647 6.10592 357.797 6.17244 NA
122 (...)
My goal is simple but also a bit difficult. Definitely it is doable to solve it in several ways:
I want to apply a function "func" to each row according to a factor, e.g. the factor "Lot". This is done via
m_dist_lot<- by(data.frame, data.frame$Lot,func)
This actually works but the result is a by-list:
data.frame$Lot: 7
354 355 363 367 378 419 426 427 428 431 460 477 836
3.5231249 9.4229589 1.4996504 7.2984485 7.6883170 1.2354754 1.8547674 3.1129814 4.4303001 1.9634573 3.7281868 3.6182559 6.4718306
data.frame$Lot: 8
1 2 11 15 17 18 19 20 21 22 24 25
2.1415352 4.6459868 1.3485551 38.8218984 3.9988686 2.2473563 6.7186047 2.6433790 0.5869746 0.5832567 4.5321623 1.8567318
The first row seems to be the row of the initial data.frame where the data is taken from. The second row are the calculated values.
My problem now is: How can I store these values properly into the origin data.frame according to the correct rows?
For example in case of one certain calculation/row of the data frame:
m_dist_lot<- by(data.frame, data.frame$Lot,func)
results for the second row of the data.frame in
data.frame$Lot: 8
2
4.6459868
I want to store the value 4.6459868 in data.frame$m_dist_lot according to the correct row "2":
Lot Wafer Voltage Slope Voltage_irradiated Slope_irradiated m_dist_lot
1 8 810 356.119 6.08423 356.427 6.13945 NA
2 8 818 355.249 6.01046 354.124 6.20855 4.6459868
3 9 917 346.921 6.21474 346.847 6.33904 NA
4 (...)
120 9 914 353.335 6.15060 352.540 6.19277 NA
121 7 721 358.647 6.10592 357.797 6.17244 NA
122 (...)
but I don't know how. My best try actually is to use "unlist".
un<- unlist(m_dist_lot) results in
un[1]
6.354
3.523125
un[2]
6.355
9.422959
un[3]
(..)
But I still don't know how I can "separate" the information of "factor.row" and "calculcated" value in such a way that the information is stored correctly in the data frame.
At least when using un<- unlist(m_dist_lot, use.names = FALSE) the factors are not present:
un[1]
3.523125
un[2]
9.422959
un[3]
1.49965
(..)
But now I lack the information of how to assign these values properly into the data.frame.
Using un<- do.call(rbind, lapply(m_dist_lot, data.frame, stringsAsFactors=FALSE)) results in
(...)
7.922 0.94130936
7.976 4.89560441
8.1 2.14153516
8.2 4.64598677
8.11 1.34855514
(...)
Here I still lack a proper assignment of calculated values <> data.frame.
I'm sure there must be a doable way. Do you know a good method?
Without reproducible data or an example of what you want func to do, I am guessing a bit here. However, I think that dplyr is going to be the answer for you.
First, I am going to use the pipe (%>%) from dplyr (exported from magrittr) to pass the builtin iris data through a series of functions. If what you are trying to calculate requires the full data.frame (and not just a column or two), you could modify this approach to do what you want (just write your function to take a data.frame, add the column(s) of interest, then return the full data.frame).
Here, I first split the iris data by Species (this creates a list, with a separate data.frame for each species). Next, I use lapply to run the function head on each element of the list. This returns a list of data.frames that now each only have three rows. (You could replace head with your function of interest here, as long as it returns a full data.frame.) Finally, I stitch each element of the list back together with bind_rows.
topIris <-
iris %>%
split(.$Species) %>%
lapply(head, n = 3) %>%
bind_rows()
This returns:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 7.0 3.2 4.7 1.4 versicolor
5 6.4 3.2 4.5 1.5 versicolor
6 6.9 3.1 4.9 1.5 versicolor
7 6.3 3.3 6.0 2.5 virginica
8 5.8 2.7 5.1 1.9 virginica
9 7.1 3.0 5.9 2.1 virginica
Which I am going to use to illustrate the approach that I think will actually address your underlying problem.
The group_by function from dplyr allows a similar approach, but without having to split the data.frame. When a data.frame is grouped, any functions applied to it are applied separately by group. Here is an example in action, which ranks the sepal lengths within each species. This is obviously not terribly useful directly, but you could write a custom function which took any number of columns as arguments (which are then passed in as vectors) and returned a vector of the same length (to create a new column or update an existing one). The select function at the end is only there to make it easier to see what I did
topIris %>%
group_by(Species) %>%
mutate(rank_Sepal_Length = rank(Sepal.Length)) %>%
select(Species, rank_Sepal_Length, Sepal.Length)
Returns:
Species rank_Sepal_Length Sepal.Length
<fctr> <dbl> <dbl>
1 setosa 3 5.1
2 setosa 2 4.9
3 setosa 1 4.7
4 versicolor 3 7.0
5 versicolor 1 6.4
6 versicolor 2 6.9
7 virginica 2 6.3
8 virginica 1 5.8
9 virginica 3 7.1
I got a workaround with the help of Force gsub to keep trailing zeros :
un<- do.call(rbind, lapply(list, data.frame, stringsAsFactors=FALSE))
un<- gsub(".*.","", un)
un<- regmatches(un, gregexpr("(?<=.).*", un, perl=TRUE))
rows<- data.frame(matrix(ncol = 1, nrow = lengths(un)))
colnames(rows)<- c("row_number")
rows["row_number"]<- sprintf("%s", rownames(un))
rows["row_number"]<- as.numeric(un[,1])
rows["row_number"]<- sub("^[^.]*[.]", "", format(rows[,1], width = max(nchar(rows[,1]))))

How to select specific column and type with readxl?

I am trying to solve a problem of importing xls data into R with readxl package. The specific xls file has 18 columns and 472 rows, first 7 rows have descriptive text that needs to be skipped. I only want to select col 1,3,6:9 out of the 18 columns for EDA. They have mixed types including date, numeric and text.
The readxl seems not able to import non-continous columns directly. My plan is to use skip =7 to read the entire sheet first and use select next step. However, the problem is readxl guess the date type to numeric by default. Is there a way in readxl to specify col_types by column name?
A reproducible code with example xlsx for a work around demostration.
library(readxl)
xlsx_example <- readxl_example("datasets.xlsx")
# read the entire table
read_excel(xlsx_example)
# select specific column to name - following code does not work
read_excel(xlsx_example, col_types=col (Sepal.Length = "numeric"))
As far as I'm aware, you are not able to specify col_types by column name. It's possible to only read in specific columns though. For example,
read_excel(xlsx_example, col_types=c("numeric", "skip", "numeric", "numeric", "skip"))
will import columns 1, 3 and 4 and skip columns 2 and 5. You could do this for the 18 columns but I think this gets a bit hard to keep track of which column is being imported as which type.
An alternative is to read in all columns as text using col_types = "text" then select and convert variables by name. For example:
library(tidyverse)
library(readxl)
xlsx_example <- readxl_example("datasets.xlsx")
df <- read_excel(xlsx_example, col_types = "text")
df %>%
select(Sepal.Length, Petal.Length) %>%
mutate(Sepal.Length = as.numeric(Sepal.Length))
#> # A tibble: 150 x 2
#> Sepal.Length Petal.Length
#> <dbl> <chr>
#> 1 5.1 1.4
#> 2 4.9 1.4
#> 3 4.7 1.3
#> 4 4.6 1.5
#> 5 5.0 1.4
#> 6 5.4 1.7
#> 7 4.6 1.4
#> 8 5.0 1.5
#> 9 4.4 1.4
#> 10 4.9 1.5
#> # ... with 140 more rows
So I think you can do:
read_excel(xlsx_example, col_types=col (Sepal.Length = col_numeric()))

applying strptime to local data frame

I think I have a problem related to \ that I fail to handle.
Here is an excerpt from a DateTime column of a data.frame I have read with read_csv:
earthquakes[1:20,1]
Source: local data frame [20 x 1]
DateTime
(chr)
1 1964/01/01 12:21:55.40
2 1964/01/01 14:16:27.60
3 1964/01/01 14:18:53.90
4 1964/01/01 15:49:47.90
5 1964/01/01 17:26:43.50
My goal is to extract the years here. Manully doing
> format(strptime(c("1964/01/01 12:21:55.40","1964/01/01 12:21:55.40","1964/01/01 14:16:27.60"), "%Y/%m/%d %H:%M:%OS"), "%Y")
[1] "1964" "1964" "1964"
works as intended. However,
> strptime(earthquakes[1:5,1], "%Y/%m/%d %H:%M:%OS")
DateTime
NA
My hunch is that the problem is related to
as.character(earthquakes[1:5,1])
[1] "c(\"1964/01/01 12:21:55.40\", \"1964/01/01 14:16:27.60\", \"1964/01/01 14:18:53.90\", \"1964/01/01 15:49:47.90\", \"1964/01/01 17:26:43.50\")"
So, that the column in the data frame does also contain the " via the escape \". But I do not know how to handle this from here.
Given that the years are the first four entries, it would also seem OK (but less elegant, imho) to do
substr(earthquakes[1:5,1],1,4)
but that then accordingly just gives
[1] "c(\"1"
Clearly, I could do
substr(earthquakes[1:5,1],4,7)
but that would only work for the first row.
Apparently you have a dplyr::tbl_df and by default in those, [ never simplifies a single column to an atomic vector (in contrast to [ applied to a base R data.frame). Hence, you could use either [[ or $ to extract the column which will then be simplified to atomic vector.
Some examples:
data(iris)
library(dplyr)
x <- tbl_df(iris)
x[1:5, 1]
#Source: local data frame [5 x 1]
#
# Sepal.Length
# (dbl)
#1 5.1
#2 4.9
#3 4.7
#4 4.6
#5 5.0
iris[1:5, 1]
#[1] 5.1 4.9 4.7 4.6 5.0
x[[1]][1:5]
#[1] 5.1 4.9 4.7 4.6 5.0
x$Sepal.Length[1:5]
#[1] 5.1 4.9 4.7 4.6 5.0

Resources