Editing a column inside a dataframe

Editing a column inside a dataframe - r

I am trying to edit my column inside the dataframe i tried using tstrsplit but I didnt get the desired result. i am trying to remove ';' from OID & i want single value in every row in OID column.
this is my code below i did
library(data.table);
setDT(df)[, paste0("OID", 1:3) := tstrsplit(OID, ";", fixed = TRUE)]
doing this code it created 3 different columns OID1 OID2 OID3 but i need to only edit column OID & have single values in it has displayed below in my desired output.
here below is my data-->
QID OID
189 204;202;201;203;
189 202;203;201;204;
189 na
189 204;202;201;203;
189 na
189 204;202;201;203;
189 na
my desired output what i need is below-->
QID OID
189 202
189 201
189 204
189 203

If we need a single element from each row, we can split the 'OID' by ;, loop through the list output with sapply, get a single element with (sample - as the rules are not clear), and update the 'OID' with that output.
transform(df, OID = sapply(strsplit(OID, ";"), sample, 1))
# QID OID
#1 189 202
#2 189 204
#3 189 203
#4 189 202
If we need unique values per row
transform(df, OID = sample(unique(unlist(strsplit(OID, ";")))))
# QID OID
#1 189 202
#2 189 201
#3 189 203
#4 189 204
NOTE: If the "OID" column class is factor, convert to character class before splitting i.e. strsplit(as.character(OID), ";")
data
df <- structure(list(QID = c(189L, 189L, 189L, 189L),
OID = c("204;202;201;203;",
"202;203;201;204;", "204;202;201;203;", "204;202;201;203;")),
.Names = c("QID", "OID"), class = "data.frame", row.names = c(NA, -4L))

I think another option is using the library stringr::str_split_fixed, it vectorised over string, so it should be more efficient than sapply.
str_split_fixed(string, pattern, n)
Please see here: http://www.inside-r.org/packages/cran/stringr/docs/str_split_fixed
df <- data.frame(QID=c(189,189,189,189),
OID=c("204;202;201;203","202;203;201;204",
"204;202;201;203","204;202;201;203"))
df
# QID OID
# 1 189 204;202;201;203
# 2 189 202;203;201;204
# 3 189 204;202;201;203
# 4 189 204;202;201;203
library(stringr)
df$OID = str_split_fixed(df$OID, ";",4)[,1] #get the first seperated column
df
# QID OID
#1 189 204
#2 189 202
#3 189 204
#4 189 204

Related

Remove row with specific value

I have the following data:
library(data.table)
sales <- data.table(Customer = c(192,964,929,345,898,477,705,804,188,231,780,611,420,816,171,212,504,526,471,979,524,410,557,152,417,359,435,820,305,268,763,194,757,475,351,933,805,687,813,880,798,327,602,710,785,840,446,891,165,662),
Producttype = c(1,2,3,2,3,3,2,1,3,3,1,1,2,2,1,3,1,3,3,1,1,1,1,3,3,3,3,2,1,1,3,3,3,3,1,1,3,3,3,2,3,2,3,3,3,2,1,2,3,1),
Price = c(469,721,856,956,554,188,429,502,507,669,427,582,574,992,418,835,652,983,149,917,370,617,876,337,663,252,599,949,915,556,313,842,892,724,415,307,900,114,439,456,541,261,881,757,199,308,958,374,409,738),
Quarter = c(2,3,3,4,4,1,4,4,3,3,1,1,1,1,1,1,4,1,2,1,3,1,2,3,3,4,4,1,1,4,1,1,3,2,1,3,3,2,2,2,1,4,3,3,1,1,1,3,1,1))
How can I remove (let's say) the row in which Customer = 891?
And then I have another question:
If I want to manipulate the data I use data [row, column]. But when I want to use only the rows in which Quarter equals (for example) 4. I use data [Quarter = 4,] Why is it not data [, Quarter = 4] since Quarter is a column and not a row?
I did not find an appropriate answer in the internet which really explains the why.
Thank you.

You have used 'data.table' function to import your data, so you could write :
sales[Customer != 891,]
The data[Quarter = 4, ], ensures that all columns should be returned for the rows where Quarter is equal to 4. The comma(,) is necessary to only select the rows, and not the column Quarter = 4.

When you use indexing, ie, data[row, column] you are telling R to look for either a specific row or column index.
Row: sales[sales$Customer %in% c(192,964),] translates to "search the specific column Customer in the data frame (or table) for any rows that have values that contain 192 or 964 and isolate them. Note that data.table will allow for sales[Customer %in% c(192, 964),] but data frames cant (use sales[sales$Customer %in% c(192,964),])
Customer Producttype Price Quarter
1: 192 1 469 2
2: 964 2 721 3
Columns sales[, "Customer"] translates to "search the data frame (or table) for columns named "Customer" and isolate all its rows
Customer
1: 192
2: 964
3: 929
4: 345
5: 898
...
Note this returns a data table with one column. If you use sales[,Customer] (data table) or sales$Customer (data frame), it will return a vector:
# [1] 192 964 929 345 898 477 705 804 188 231 780 611 420 816 171 212 504 526 471 979 524
# [22] 410 557 152 417 359 435 820 305 268 763 194 757 475 351 933 805 687 813 880 798 327
# [43] 602 710 785 840 446 891 165 662
You can of course combine - if you did, sales[sales$Quarter %in% 1:2, c("Customer", "Producttype")] you would isolate all values of Customer and Producttype which were in quarters 1 and 2:
Customer Producttype
1: 192 1
2: 477 3
3: 780 1
4: 611 1
5: 420 2
...

Using dplyr to compute calculated fields depending on multiple columns without explicitly writing column names

Consider the following code.
set.seed(56)
library(dplyr)
df <- data.frame(
NUM_1 = sample.int(500, replace = TRUE),
DENOM_1 = sample.int(500, replace = TRUE),
NUM_2 = sample.int(500, replace = TRUE),
DENOM_2 = sample.int(500, replace = TRUE)
)
head(df)
NUM_1 DENOM_1 NUM_2 DENOM_2
1 417 379 154 173
2 160 437 239 154
3 243 315 106 361
4 291 169 393 340
5 170 450 429 421
6 422 131 75 64
Without having to manually specify each of the column names (the actual problem has about 40 of these I need to create), I would like to create columns FRAC_1 and FRAC_2 for which FRAC_X = NUM_X/DENOM_X.
So, this would be what I'm looking for with regard to output, but since I'm dealing with about 40 of these, I don't want to have to manually type out each column:
df_frac <- df %>%
mutate(FRAC_1 = NUM_1 / DENOM_1,
FRAC_2 = NUM_2 / DENOM_2)
head(df_frac)
NUM_1 DENOM_1 NUM_2 DENOM_2 FRAC_1 FRAC_2
1 417 379 154 173 1.1002639 0.8901734
2 160 437 239 154 0.3661327 1.5519481
3 243 315 106 361 0.7714286 0.2936288
4 291 169 393 340 1.7218935 1.1558824
5 170 450 429 421 0.3777778 1.0190024
6 422 131 75 64 3.2213740 1.1718750
I would strongly prefer a dplyr solution to this. I thought maybe I could use mutate() with across(), but it isn't clear to me how to tell across() to pair the NUM_x with the corresponding DENOM_x columns.

Here is one in tidyverse
Loop across the columns with names starts_with 'NUM'
Extract the column name cur_column(), replace the substring from 'NUM' to 'DENOM' in str_replace
get the column value, divide by the NUM column, and change the column name in .names to create the 'FRAC' columns
library(dplyr)
library(stringr)
df <- df %>%
mutate(across(starts_with("NUM"), ~
./get(str_replace(cur_column(), 'NUM', 'DENOM')),
.names = "{str_replace(.col, 'NUM', 'FRAC')}"))
-output
head(df)
NUM_1 DENOM_1 NUM_2 DENOM_2 FRAC_1 FRAC_2
1 417 379 154 173 1.1002639 0.8901734
2 160 437 239 154 0.3661327 1.5519481
3 243 315 106 361 0.7714286 0.2936288
4 291 169 393 340 1.7218935 1.1558824
5 170 450 429 421 0.3777778 1.0190024
6 422 131 75 64 3.2213740 1.1718750

How to sort smaller values between two columns in R?

I have a dataframe called test. I want to sort the dataframe and move the smaller values in the left column (sstart) and keep the bigger values on the right column (send). I can do this by using if else condition and creating two new columns with sorted values. How can we do this more efficiently in R?
test<- structure(list(sstart = c(425L, 387L, 436L, 219L,
232L), send = c(125L, 487L, 136L, 3191L, 132L
)), .Names = c("sstart", "send"), row.names = c(4L, 14L, 17L,
23L, 27L), class = "data.frame")
Result I want:
sstart send
125 425
387 487
136 436
219 3191
132 232

REVISED
Sorry, upon re-reading your question, I see I misunderstood you: You just want to sort within each row the first two columns. That's not what my original code (preserved below) does. What you want is this:
data.frame(t(apply(test[,1:2],1,sort))) %>%
rename(sstart=X1, send=X2) %>% dplyr::bind_cols(test[,-1:-2])
I use apply rowwise (that's the "1" there) on the first two columns of test, with the function applied being sort. This gives us a sideways matrix, so I transpose it and turn it into a data.frame, and then bind it back to the rest of the original test. Result:
sstart send
1 125 425
2 387 487
3 136 436
4 219 3191
5 132 232
Sorry about the mix-up.
WRONG CODE:
matrix(sort(unlist(test)),ncol=2) %>% data.frame() %>% dplyr::rename(sstart=X1,send=X2)
Unlisting test turns it into a vector, which we sort and place into a matrix with two columns. Matrix defaults to filling by column, so the smaller ones will go into the first column and the larger ones into the second. We move this matrix into a data.frame and rename the columns sstart and send. Voila.
sstart send
1 125 387
2 132 425
3 136 436
4 219 487
5 232 3191
If there are other columns in test that need to be preserved:
matrix(sort(unlist(test[,1:2])),ncol=2) %>% data.frame() %>%
dplyr::rename(sstart=X1,send=X2) %>%
dplyr::bind_cols(test[,-1:-2])

You can use pmax and pmin, but it is impossible to swap two values without temporarily storing at least one value:
# temp vectors of the columns to "swap" the values as required
low <- pmin(test$sstart, test$send)
high <- pmax(test$sstart, test$send)
# exchange the columns
test$sstart <- low
test$send <- high
# result
test
# sstart send
# 4 125 425
# 14 387 487
# 17 136 436
# 23 219 3191
# 27 132 232
Warning: If NAs occur in your data you may loose information since. You could set NAs to a decent default value as work-around:
E. g. if you add another row containing an NA value
test[6,]$sstart <- NA
test[6,]$send <- 100
you will get two NAs instead of one + the 2nd value:
sstart send
4 125 425
14 387 487
17 136 436
23 219 3191
27 132 232
NA NA NA

I would do this way:
split(test,row(test)) %>%
purrr::map_dfr(~{
sort(.x) %>%
setNames(c("sstart","send"))
})

Text processing in R

I have a text file with many lines (first two are shown below)
1: 146 189 229
2: 191 229
I need to convert to the output
1 146
1 189
1 229
2 191
2 229
I have read the lines in loop, removed the ":" and split by " ".
fbnet <- readLines("0.egonet")
for (line in fbnet){
line <- gsub(":","",line)
line <- unlist(strsplit(line, " ", fixed = TRUE),use.names=FALSE)
friend = line[1]
}
How to proceed next

We can read with read.csv/read.txt specifying the delimiter as : to output a data.frame with 2 columns and then use separate_rows to split the second column ('V2' - when we specify header = FALSE - the automatic naming of columns starts with letter V followed by sequence of numbers for each column) with space delimiter into separate rows and remove the NA elements (in case there are multiple spaces) with filter
library(tidyverse)
read.csv(text=fbnet, sep=":", header = FALSE) %>%
separate_rows(V2, convert = TRUE) %>%
filter(!is.na(V2))
V1 V2
1 1 146
2 1 189
3 1 229
4 2 191
5 2 229
Or using read_delim from readr with separate_rows and filter
read_delim(paste(trimws(fbnet), collapse="\n"), delim=":", col_names = FALSE) %>%
separate_rows(X2, convert = TRUE) %>%
filter(!is.na(X2))
data
fbnet <- readLines(textConnection("1: 146 189 229
2: 191 229"))
#if we are reading from file, then
fbnet <- readLines("file.txt")

Sum paired files over a list of files

I have multiple files where two and two files belong together and should be summed based on values in column 2 to create one file. All files have the same rows. The files that should be summed have similar ID before the L* part of the string.
I would like to make a loop that identifies the paired files and sums in based on column 2.
I have created a function that reads the files, but not sure how to proceed:
file_list <- list.files(pattern = "*.csv)
library(data.table)
lst <- lapply(file_list, function(x)
fread(x, select=c("V1", "V2"))[,
list(ID=paste(V1), freq=V2)])
Below is shown two of the pairs:
Pair one:
01_001_F08_S80_L009
16S_rRNA_copy_A-1 75
16S_rRNA_copy_B-1 86
16S_rRNA_copy_C-1 102
01_001_F08_S80_L002
16S_rRNA_copy_A-1 98
16S_rRNA_copy_B-1 96
16S_rRNA_copy_C-1 101
Pair two:
01_001_F09_S81_L006
16S_rRNA_copy_A-1 242
16S_rRNA_copy_B-1 244
16S_rRNA_copy_C-1 302
01_001_F09_S81_L003
16S_rRNA_copy_A-1 252
16S_rRNA_copy_B-1 253
16S_rRNA_copy_C-1 322

We can split the data by the substring of the names of the 'lst' (created with sub), loop through the list, rbind the nested list elements, grouped by 'ID', get the sum
lapply(split(lst, sub("\\d+$", "", names(lst))),
function(x) rbindlist(x)[, .(freq = sum(freq)), ID])
#$`01_001_F08_S80_L`
# ID freq
#1: 16S_rRNA_copy_A-1 173
#2: 16S_rRNA_copy_B-1 182
#3: 16S_rRNA_copy_C-1 203
#$`01_001_F09_S81_L`
# ID freq
#1: 16S_rRNA_copy_A-1 494
#2: 16S_rRNA_copy_B-1 497
#3: 16S_rRNA_copy_C-1 624

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Editing a column inside a dataframe - r

Related

Remove row with specific value

Using dplyr to compute calculated fields depending on multiple columns without explicitly writing column names

How to sort smaller values between two columns in R?

Text processing in R

Sum paired files over a list of files

Categories

Resources