extract data from only columns matching character strings - r

I have a dataset that looks something like this (but much larger)
Jul_08 <- c(1,0,2,0,3)
Aug_08 <- c(0,0,1,0,1)
Sep_08 <- c(0,1,0,0,1)
month<-c("Jul_08","Aug_08","Jul_08","Sep_08","Jul_08")
dataset <- data.frame(Jul_08 = Jul_08, Aug_08 = Aug_08, Sep_08=Sep_08,month=month)
For each row, I would to isolate the value for a select month only as indicated by the "month" field. In other words, for a given row, if the column "month" = Jul_08, then for a new "value" column, I would like to include the datum that pertained to the column "Jul_08" from that row.
In essence, the output would add this value column to the dataset
value<-c(1,0,2,0,3)
Creating this final dataset
dataset.value<-cbind(dataset,value)

You can use matrix indexing:
w <- match(month, names(dataset))
dataset$value <- dataset[ cbind(seq_len(nrow(dataset)), w) ]
Here the w vector tells R which column to take the value from and seq_len is used to say use the same row, so the value column is constructed by taking the 1st column in the 1st row, then the 2nd column and 2nd row, 1st column for the 3rd row, etc.

You can use lapply :
value <- unlist(lapply(1:nrow(dataset),
function(r){
dataset[r,as.character(dataset[r,'month'])]
}))
> value
[1] 1 0 2 0 3
Or, alternatively :
value <- diag(as.matrix(dataset[,as.character(dataset$month)]))
> value
[1] 1 0 2 0 3
Then you can cbind the new column as you did in your example.
Some notes:
I prefer unlist(lapply(...)) over sapply since automagic simplification implemented in sapply function tends to surprise me sometimes. But I'm pretty sure this time you can use it without any problem.
as.character is necessary only if month column is a factor (as in the example), otherwise is redundant (but I would leave it, just to be safe).

Related

Count number of rows in each column in a dataframe that specify a specific condition

New to R btw so I am sorry if it seems like a stupid question.
So basically I have a dataframe with 100 rows and 3 different columns of data. I also have a vector with 3 thresholds, one for each column. I was wondering how you would filter out the values of each column that are superior to the value of each threshold.
Edit: Sry for the incomplete question.
So essentially what i would like to create is a function (that takes a dataframe and a vector of tresholds as parameters) that applies every treshold to their respective column of the dataframe (so there is one treshhold for every column of the dataframe). The number of elements of each column that “respect” their treshold should later be put in a vector. So for example:
Column 1: values = 1,2,3. Treshold = (only values lower than 3)
Column 2: values = 4,5,6. Treshold = (only values lower than 6)
Output: A vector (2,2) since there are two elements in each column that are under their respective tresholds.
Thank you everyone for the help!!
Your example data:
df <- data.frame(a = 1:3, b = 4:6)
threshold <- c(3, 6)
One option to resolve your question is to use sapply(), which applies a function over a list or vector. In this case, I create a vector for the columns in df with 1:ncol(df). Inside the function, you can count the number of values less than a given threshold by summing the number of TRUE cases:
col_num <- 1:ncol(df)
sapply(col_num, function(x) {sum(df[, x] < threshold[x])})
Or, in a single line:
sapply(1:ncol(df), function(x) {sum(df[, x] < threshold[x])})

How do I use one column to gate select rows of another column in r?

I have a dataset with a column of 1's and 0's and another column with double values. I want to make a third column that contains the data in each of the rows in the second column that corresponds to a 1 in the first column. I have no idea how to do this and googling for this has been a nightmare. How do I do this?
You can do this in a one-liner with ifelse. Assuming your data frame is called df, 1 and 0 values in col1, doubles in col2, values corresponding to the zeros are NA:
df$col3 <- ifelse(df$col1, df$col2, NA)
We can do this in base R in a single-line using indexing. We create the logical vector with first 'col1' and use that as index to create the new column. By default, the values that are FALSE from 'i1' will be NA
i1 <- as.logical(df1$col1)
# // or
# i1 <- df1$col1 == 1
df1$col3[i1] <- df1$col2[i1]
Or as a single line
df1$col3[as.logical(df1$col1)] <- df1$col2[as.logical(df1$col1)]

R selecting rows from dataframe using logical indexing: accessing columns by `$` vs `[]`

I have a simple R data.frame object df. I am trying to select rows from this dataframe based on logical indexing from a column col in df.
I am coming from the python world where during similar operations i can either choose to select using df[df[col] == 1] or df[df.col == 1] with the same end result.
However, in the R data frame df[df$col == 1] gives an incorrect result compared to df[df[,col] == 1] (confirmed by summary command). I am not able to understand this difference as from links like http://adv-r.had.co.nz/Subsetting.html it seems that either way is ok. Also, str command on df$col and df[, col] shows the same output.
Is there any guidelines about when to use $ vs [] operator ?
Edit:
digging a little deeper and using this question as reference, it seems like the following code works correctly
df[which(df$col == 1), ]
however, not clear how to guard against NA and when to use which
You confused many things.
In
df[,col]
col should be the column number. For example,
col = 2
x = df[,col]
would select the second column and store it to x.
In
df$col
col should be the column name. For example,
df=data.frame(aa=1:5,bb=10:14)
x = df$bb
would select the second column and store it to x. But you cannot write df$2.
Finally,
df[[col]]
is the same as df[,col] if col is a number. If col is a character ("character" in R means the same as string in other languages), then it selects the column with this name. Example:
df=data.frame(aa=1:5,bb=10:14)
foo = "bb"
x = df[[foo]]
y = df[[2]]
z = df[["bb"]]
Now x, y, and z are all contain the copy of the second column of df.
The notation foo[[bar]] is from lists. The notation foo[,bar] is from matrices. Since dataframe has features of both matrix and list, it can use both.
Use $ when you want to select one specific column by name df$col_name.
Use [] when you want to select one or more columns by number:
df[,1] # select column with index 1
df[,1:3]# select columns with indexes 1 to 3
df[,c(1,3:5,7)] # select columns with indexes 1, 3 to 5 and 7.
[[]] is mostly for lists.
EDIT: df[which(df$col == 1), ] works because which function creates a logical vector which checks if the column index is equal to 1 (true) or not (false). This logical vector is passed to df[] and only true value is shown.
Remove rows with NAs (missing values) in data.frame - to find out more about how to deal with missing values. It is always a good practice to exclude missing values from dataset.

Find elements in one vector and replace by the equivalents in another vector in R

I have 3 columns of names (corresponding to different diatom species). The first column is the current name of my species, the second column is the "old" (i.e. not used any more) name of the species and the third one is the "new" (i.e. after taxonomic update) name of the species.
For each value in the first column I need to find it in the second column and, if found, I need to replace it by the updated name (stored in the third column). So for example, given this matrix:
Column 1 Column 2 Column 3
Achnanthes.atomus Amphora.coffeaeformis Halamphora.coffeaeformis
Achnanthes.biasolettiana Achnanthes.atomus Achnanthidium.atomus
Achnanthes.atomus found in column 1 (first row), should be identified in column 2 (second row) and replaced by its "new name" Achnanthidium.atomus (column 2, second row).
My matrix is called Diatosdef. If I do this, it works:
colnames(Diatosdef) <- gsub("Achnanthes.atomus","Achnanthidium.atomus",colnames(Diatosdef))
But I need to do it species by species, and I have almost 100 species
Can anybody please help me?
Thanks!
P.S: I found that I can do it in Excel with the vlookup function, but I am still looking for a way of doing it in R
fun <- function(x, mat){
if(x %in% mat[,2]) mat[which(mat[,2]==x), 3]
else x
}
Diatosdef[,1] <- sapply(Diatosdef[,1], fun, mat = Diatosdef)
you can try this:
for(i in matrix[:1]){
if(matrix[i:1] == matrix[i:2])
matrix[i:1] = matrix[i:3]
}
maybe the syntax isn't correct.

Guetting a subset in R

I have a dataframe with 14 columns, and I want to subset a dataframe with the same column but keeping only row that repeats (for example, I have an ID variable and if ID = 2 repeated so I subset it).
To begin, I applied a table to my dataframe to see the frequencies of ID
head(sort(table(call.dat$IMSI), decreasing = TRUE), 100)
In my case, 20801170106338 repeat two time; so I want to see the two observation for this ID.
Afterward, I did x <- subset(call.dat, IMSI == "20801170106338") and hsb6 <- call.dat[call.dat$IMSI == "20801170106338", ], but the result is false (for x, it's returning me 0 observation of 14 variale and for hsb6 I have only NA in my dataframe).
Can you help me, thanks.
PS: IMSI is a numeric value.
And x <- subset(call.dat, Handset.Manufacturer == "LG") is another example which works perfectly...
You can use duplicated that is a function giving you an array that is TRUE in case the record is duplicated.
isDuplicated <- duplicated(call.dat$IMSI)
Then, you can extract all the rows containing a duplicated value.
call.dat.duplicated <- all.dat[isDuplicated, ]

Resources