subsetting with Relational Operator != - r

I have a dataframe df with various columns.
In column df$xyz I have about 20 character variables.
I want to retain 3 variables ("HL%", "HH$", "LL$") and all other variables ("truncated", "kk$", "hhb"...) should be replaced with "other".
Thats my data frame:
xz xyz
2.5 HL%
4.4 HH$
9.3 kk$
2.4 kk$
4.5 LL$
5.6 truncated
I need:
xz xyz
2.5 HL%
4.4 HH$
9.3 other
2.4 other
4.5 LL$
5.6 other
I tried:
df$xyz[df$xyz!="HL%"|
df$xyz!="HH$"|
df$xyz!="LL$"] <- "other"
That doesn't seem to do the trick.

As #nya already stated in comments your df$xyz is probably a factor variable, check with str(df).
str(df)
# 'data.frame': 6 obs. of 2 variables:
# $ xz : num 2.5 4.4 9.3 2.4 4.5 5.6
# $ xyz: Factor w/ 6 levels "HH$","HL%","kk$",..: 2 1 6 6 4 6
In this case first update your factor levels with the new level "other" you introduce. Otherwise skip this step.
levels(df$xyz) <- c(levels(df$xyz), "other")
After that just do.
df$xyz[-which(df$xyz %in% c("HL%", "HH$", "LL$"))] <- "other"
Your approach will also work, but you need to replace the | with &.

Related

Calculation via factor results in a by-list - how to circumvent?

I have a data.frame as following:
Lot Wafer Voltage Slope Voltage_irradiated Slope_irradiated m_dist_lot
1 8 810 356.119 6.08423 356.427 6.13945 NA
2 8 818 355.249 6.01046 354.124 6.20855 NA
3 9 917 346.921 6.21474 346.847 6.33904 NA
4 (...)
120 9 914 353.335 6.15060 352.540 6.19277 NA
121 7 721 358.647 6.10592 357.797 6.17244 NA
122 (...)
My goal is simple but also a bit difficult. Definitely it is doable to solve it in several ways:
I want to apply a function "func" to each row according to a factor, e.g. the factor "Lot". This is done via
m_dist_lot<- by(data.frame, data.frame$Lot,func)
This actually works but the result is a by-list:
data.frame$Lot: 7
354 355 363 367 378 419 426 427 428 431 460 477 836
3.5231249 9.4229589 1.4996504 7.2984485 7.6883170 1.2354754 1.8547674 3.1129814 4.4303001 1.9634573 3.7281868 3.6182559 6.4718306
data.frame$Lot: 8
1 2 11 15 17 18 19 20 21 22 24 25
2.1415352 4.6459868 1.3485551 38.8218984 3.9988686 2.2473563 6.7186047 2.6433790 0.5869746 0.5832567 4.5321623 1.8567318
The first row seems to be the row of the initial data.frame where the data is taken from. The second row are the calculated values.
My problem now is: How can I store these values properly into the origin data.frame according to the correct rows?
For example in case of one certain calculation/row of the data frame:
m_dist_lot<- by(data.frame, data.frame$Lot,func)
results for the second row of the data.frame in
data.frame$Lot: 8
2
4.6459868
I want to store the value 4.6459868 in data.frame$m_dist_lot according to the correct row "2":
Lot Wafer Voltage Slope Voltage_irradiated Slope_irradiated m_dist_lot
1 8 810 356.119 6.08423 356.427 6.13945 NA
2 8 818 355.249 6.01046 354.124 6.20855 4.6459868
3 9 917 346.921 6.21474 346.847 6.33904 NA
4 (...)
120 9 914 353.335 6.15060 352.540 6.19277 NA
121 7 721 358.647 6.10592 357.797 6.17244 NA
122 (...)
but I don't know how. My best try actually is to use "unlist".
un<- unlist(m_dist_lot) results in
un[1]
6.354
3.523125
un[2]
6.355
9.422959
un[3]
(..)
But I still don't know how I can "separate" the information of "factor.row" and "calculcated" value in such a way that the information is stored correctly in the data frame.
At least when using un<- unlist(m_dist_lot, use.names = FALSE) the factors are not present:
un[1]
3.523125
un[2]
9.422959
un[3]
1.49965
(..)
But now I lack the information of how to assign these values properly into the data.frame.
Using un<- do.call(rbind, lapply(m_dist_lot, data.frame, stringsAsFactors=FALSE)) results in
(...)
7.922 0.94130936
7.976 4.89560441
8.1 2.14153516
8.2 4.64598677
8.11 1.34855514
(...)
Here I still lack a proper assignment of calculated values <> data.frame.
I'm sure there must be a doable way. Do you know a good method?
Without reproducible data or an example of what you want func to do, I am guessing a bit here. However, I think that dplyr is going to be the answer for you.
First, I am going to use the pipe (%>%) from dplyr (exported from magrittr) to pass the builtin iris data through a series of functions. If what you are trying to calculate requires the full data.frame (and not just a column or two), you could modify this approach to do what you want (just write your function to take a data.frame, add the column(s) of interest, then return the full data.frame).
Here, I first split the iris data by Species (this creates a list, with a separate data.frame for each species). Next, I use lapply to run the function head on each element of the list. This returns a list of data.frames that now each only have three rows. (You could replace head with your function of interest here, as long as it returns a full data.frame.) Finally, I stitch each element of the list back together with bind_rows.
topIris <-
iris %>%
split(.$Species) %>%
lapply(head, n = 3) %>%
bind_rows()
This returns:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 7.0 3.2 4.7 1.4 versicolor
5 6.4 3.2 4.5 1.5 versicolor
6 6.9 3.1 4.9 1.5 versicolor
7 6.3 3.3 6.0 2.5 virginica
8 5.8 2.7 5.1 1.9 virginica
9 7.1 3.0 5.9 2.1 virginica
Which I am going to use to illustrate the approach that I think will actually address your underlying problem.
The group_by function from dplyr allows a similar approach, but without having to split the data.frame. When a data.frame is grouped, any functions applied to it are applied separately by group. Here is an example in action, which ranks the sepal lengths within each species. This is obviously not terribly useful directly, but you could write a custom function which took any number of columns as arguments (which are then passed in as vectors) and returned a vector of the same length (to create a new column or update an existing one). The select function at the end is only there to make it easier to see what I did
topIris %>%
group_by(Species) %>%
mutate(rank_Sepal_Length = rank(Sepal.Length)) %>%
select(Species, rank_Sepal_Length, Sepal.Length)
Returns:
Species rank_Sepal_Length Sepal.Length
<fctr> <dbl> <dbl>
1 setosa 3 5.1
2 setosa 2 4.9
3 setosa 1 4.7
4 versicolor 3 7.0
5 versicolor 1 6.4
6 versicolor 2 6.9
7 virginica 2 6.3
8 virginica 1 5.8
9 virginica 3 7.1
I got a workaround with the help of Force gsub to keep trailing zeros :
un<- do.call(rbind, lapply(list, data.frame, stringsAsFactors=FALSE))
un<- gsub(".*.","", un)
un<- regmatches(un, gregexpr("(?<=.).*", un, perl=TRUE))
rows<- data.frame(matrix(ncol = 1, nrow = lengths(un)))
colnames(rows)<- c("row_number")
rows["row_number"]<- sprintf("%s", rownames(un))
rows["row_number"]<- as.numeric(un[,1])
rows["row_number"]<- sub("^[^.]*[.]", "", format(rows[,1], width = max(nchar(rows[,1]))))

R: Creating an index vector

I need some help with R coding here.
The data set Glass consists of 214 rows of data in which each row corresponds to a glass sample. Each row consists of 10 columns. When viewed as a classification problem, column 10
(Type) specifies the class of each observation/instance. The remaining columns are attributes that might beused to infer column 10. Here is an example of the first row
RI Na Mg Al Si K Ca Ba Fe Type
1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0.0 0.0 1
First, I casted column 10 so that it is interpreted by R as a factor instead of an integer value.
Now I need to create a vector with indices for all observations (must have values 1-214). This needs to be done to creating training data for Naive Bayes. I know how to create a vector with 214 values, but not one that has specific indices for observations from a data frame.
If it helps this is being done to set up training data for Naive Bayes, thanks
I'm not totally sure that I get what you're trying to do... So please forgive me if my solution isn't helpful. If your df's name is 'df', just use the dplyr package for reordering your columns and write
library(dplyr)
df['index'] <- 1:214
df <- df %>% select(index,everything())
Here's an example. So that I can post full dataframes, my dataframes will only have 10 rows...
Let's say my dataframe is:
df <- data.frame(col1 = c(2.3,6.3,9.2,1.7,5.0,8.5,7.9,3.5,2.2,11.5),
col2 = c(1.5,2.8,1.7,3.5,6.0,9.0,12.0,18.0,20.0,25.0))
So it looks like
col1 col2
1 2.3 1.5
2 6.3 2.8
3 9.2 1.7
4 1.7 3.5
5 5.0 6.0
6 8.5 9.0
7 7.9 12.0
8 3.5 18.0
9 2.2 20.0
10 11.5 25.0
If I want to add another column that just is 1,2,3,4,5,6,7,8,9,10... and I'll call it 'index' ...I could do this:
library(dplyr)
df['index'] <- 1:10
df <- df %>% select(index, everything())
That will give me
index col1 col2
1 1 2.3 1.5
2 2 6.3 2.8
3 3 9.2 1.7
4 4 1.7 3.5
5 5 5.0 6.0
6 6 8.5 9.0
7 7 7.9 12.0
8 8 3.5 18.0
9 9 2.2 20.0
10 10 11.5 25.0
Hope this will help
df$ind <- seq.int(nrow(df))

how to sort a column in a table in r

I tried to merge two tables, but the result is like this,
subj gamble_gamble n_gambles expected_value
1 19 32 1.7
10 3 4 1.5
100 3 4 1.5
101 6 32 1.4
102 3 4 1.5
103 19 32 1.7
The subj column isn't ordered in usual way (e.g. 1,2,3,4,5,6). I tried to order the subj column with this command:
newdata <- table3[order(subj),]
but it doesnt work. Can somebody help me?
Use this:
newdata <- table3[order(as.numeric(as.character(table3$subj))),]
This works even if subj is a factor (not just character).

appending new data to specific elements in lists in r

Please correct me if my terminology is wrong because on this question Im not quite sure what Im dealing with regarding elements, objects, lists..I just know its not a data frame.
Using the example from prepksel {adehabitatHS} I am trying to modify my own data to fit into their package. Running this command on their example data creates an object? called x which is a list with 3 sections? elements? to it.
The example data code:
library(adehabitatHS)
data(puechabonsp)
locs <- puechabonsp$relocs
map <- puechabonsp$map
pc <- mcp(locs[,"Name"])
hr <- hr.rast(pc, map)
cp <- count.points(locs[,"Name"], map)
x <- prepksel(map, hr, cp)
looking at the structure of x it is a list of 3 elements called tab, weight, and factor
str(x)
List of 3
$ tab :'data.frame': 191 obs. of 4 variables:
..$ Elevation : num [1:191] 141 140 170 160 152 121 104 102 106 103 ...
..$ Aspect : num [1:191] 4 4 4 1 1 1 1 1 4 4 ...
..$ Slope : num [1:191] 20.9 18 17 24 23.9 ...
..$ Herbaceous: num [1:191] 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 ...
$ weight: num [1:191] 1 1 1 1 1 2 2 4 0 1 ...
$ factor: Factor w/ 4 levels "Brock","Calou",..: 1 1 1 1 1 1 1 1 1 1 ...
for my data, I will create multiple "x" lists and want to merge the data within each segment. So, I have created an "x" for year 2007, 2008 and 2009. Now, I want to append the "tab" element of 08 to 07, then 09 to 07/08. and do the same for the "weight" and "factor" elements of this list "x". How do you bind that data? I thought about using unlist on each segment of the list and then appending and then joining the yearly data for each segment and then rejoining the three segments back into one list. But this was cumbersome and seemed rather inefficient.
I know this is not how it will work, but in my head this is what I should be doing:
newlist<-append(x07$tab, x08$tab, x09$tab)
newlist<-append(x07$weight, x08$weight, x09$weight)
newlist<-append(x07$factor, x08$factor, x09$factor)
maybe rbind? do.call("rbind", lapply(....uh...stuck
append works for vectors and lists, but won't give the output you want for data frames, the elements in your list (and they are lists) are of different types. Something like
tocomb <- list(x07,x08,x09)
newlist <- list(
tab = do.call("rbind",lapply(tocomb,function(x) x$tab)),
weight = c(lapply(tocomb,function(x) x$weight),recursive=TRUE),
factor = c(lapply(tocomb,function(x) x$factor),recursive=TRUE)
)
You may need to be careful with factors if they have different levels - something like as.character on the factors before converting them back with as.factor.
This isn't tested, so some assembly may be required. I'm not an R wizard, and this may not be the best answer.

plot command isn't recognizing column names

Recently when I tried to plot in R I keep getting this error. Can anyone tell me why I can't seem to do a scatter plot? I've pasted the terminal screen below.
tcmg2o4 <-read.table("~/Documents/research/metal.oxides/TcMg2O4.inverse/energydata.txt")
tcmg2o4
V1 V2
1 Lattice_constant Total_energy
2 8.0 -371.63306746
3 8.1 -375.035492
4 8.2 -378.8669067
5 8.3 -380.34136459
6 8.4 -382.3921237
7 8.5 -383.60394736
8 8.6 -384.09517631
9 8.7 -383.77668067
10 8.8 -382.43806866
11 8.9 -381.42213458
12 9.0 -379.63327976
attach(tcmg2o4)
plot(Lattice_constant, Total_energy)
Error in plot(Lattice_constant, Total_energy) :
object 'Lattice_constant' not found
plot(V1,V2)
Your problem is that you are not reading the column names as column names. to do this use
header = T
tcmg2o4 <-read.table("~/Documents/research/metal.oxides/TcMg2O4.inverse/energydata.txt", header = T)
In your case, the read.table call has created column names V1 and V2 and these columns will both be factor variables.
You can check the structure of your read in object by
str(tcmg2o4)
## 'data.frame': 11 obs. of 2 variables:
## $ Lattice_constant: num 8 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 ...
## $ Total_energy : num -372 -375 -379 -380 -382 ...
I would also avoid using attach
instead use with or
with(tcmg2o4, plot(Lattice_constant, Total_energy))
or the fact that it is a 2 column data.frame
plot(tcmg2o4)
or use a formula to specify your x and y axis (y~x)
plot(Total_energy ~ Lattice_constant, data = tcmg2o4)
which will all give the same result and be much clearer as to where the data is stored

Resources