How can I remove certain part of row names in data frame - r

I have a data set with the following format:
ID | Value
-------------------------- | -------------------------------
AAA1|404744 | 1.7554
ANKHD1-EIF4EBP3|404734 | 0.5174
HLA-B|3106 | 11.7659
HLA-A|3105 | 18.0851
What I want is removing certain part of the row names like this:
ID | Value
--------------------- | -------------------------------
AAA1 | 1.7554
ANKHD1-EIF4EBP3 | 0.5174
HLA-B | 11.7659
HLA-A | 18.0851
Thanks a lot!

We can do this with sub. Match the | (a metacharacter implies or - so either escape \\| it or place it in brackets to get the literal character) followed by characters (.*) and replace it with blank ("")
df$ID <- sub("[|].*", "", df$ID)

Related

How to match two columns in one dataframe using values in another dataframe in R

I have two dataframes. One is a set of ≈4000 entries that looks similar to this:
| grade_col1 | grade_col2 |
| --- | --- |
| A-| A-|
| B | 86|
| C+| C+|
| B-| D |
| A | A |
| C-| 72|
| F | 96|
| B+| B+|
| B | B |
| A-| A-|
The other is a set of ≈700 entries that look similar to this:
| grade | scale |
| --- | --- |
| A+|100|
| A+| 99|
| A+| 98|
| A+| 97|
| A | 96|
| A | 95|
| A | 94|
| A | 93|
| A-| 92|
| A-| 91|
| A-| 90|
| B+| 89|
| B+| 88|
...and so on.
What I'm trying to do is create a new column that shows whether grade_col2 matches grade_col1 with a binary, 0-1 output (0 = no match, 1 = match). Most of grade_col2 is shown by letter grade. But every once in awhile an entry in grade_col2 was accidentally entered as a numeric grade instead. I want this match column to give me a "1" even when grade_col2 is a numeric grade instead of a letter grade. In other words, if grade_col1 is B and grade_col2 is 86, I want this to still be read as a match. Only when grade_col1 is F and grade_col2 is 96 would this not be a match (similar to when grade_col1 is B- and grade_col2 is D = not a match).
The second data frame gives me the information I need to translate between one and the other (entries between 97-100 are A+, between 93-96 are A, and so on). I just don't know how to run a script that uses this information to find matches through all ≈4000 entries. Theoretically, I could do this manually, but the real dataset is so lengthy that this isn't realistic.
I had been thinking of using nested if_else statements with dplyr. But once I got past the first "if" statement, I got stuck. I'd appreciate any help with this people can offer.
You can do this using a join.
Let your first dataframe be grades_df and your second dataframe be lookup_df, then you want something like the following:
output = grades_df %>%
# join on look up, keeping everything grades table
left_join(lookup_df, by = c(grade_col2 = "scale")) %>%
# combine grade_col2 from grades_df and grade from lookup_df
mutate(grade_col2b = ifelse(is.na(grade), grade_col2, grade)) %>%
# indicator column
mutate(indicator = ifelse(grade_col1 == grade_col2b, 1, 0))

Problem with replacing a comma with a period

I replace the comma with a period in the data.frame column
data[,22] <- as.numeric(sub(",", ".", sub(".", "", data[,22], fixed=TRUE), fixed=TRUE))
But I have values that look like this: 110.00, 120.00, 130.00...
When replacing, I get the value:11000.0, 12000.0, 13000.0
And I would like to get: 110.0,120.0, 130.0....
My column 22 data.frame:
| n |
|--------|
| 92,5 |
| 94,5 |
| 96,5 |
| 110.00|
| 120.00|
| 130.00|
What I want to get:
| n |
|--------|
| 92.5 |
| 94.5 |
| 96.5 |
| 110.0|
| 120.0|
| 130.0|
or
| n |
|--------|
| 92.5 |
| 94.5 |
| 96.5 |
| 110.00|
| 120.00|
| 130.00|
Don't replace the periods since they are already in the format that you want. Replace only commas to period and turn the data to numeric.
data[[22]] <- as.numeric(sub(',', '.', fixed = TRUE, data[[22]]))
Using str_replace
library(stringr)
data[[22]] <- as.numeric(str_replace(data[[2]], ",", fixed(".")))
You can use gsub like below
transform(
df,
n = as.numeric(gsub("\\D", ".", n))
)
where non-digital character, i.e., "," or ".", are replaced by "."

List all strings appearing more than once in a file

I have a very large file (around 70GB), and I want to list all strings that appear more than once in the whole file.
I can list all the matches when I specify which string to search in a file, but I want to list all strings that have more than one occurrence.
For example, assuming my file looks like this:
+------+------------------------------------------------------------------+----------------------------------+--+
| HHID | VAL_CD64 | VAL_CD32 | |
+------+------------------------------------------------------------------+----------------------------------+--+
| 203 | 8c5bfd9b6755ffcdb85dc52a701120e0876640b69b2df0a314dc9e7c2f8f58a5 | 373aeda34c0b4ab91a02ecf55af58e15 | |
| 7AB | f6c581becbac4ec1291dc4b9ce566334b1cb2c85e234e489e7fd5e1393bd8751 | 2c4f97a04f02db5a36a85f48dab39b5b | |
| 7AB | abad845107a699f5f99575f8ed43e0440d87a8fc7229c1a1db67793561f0f1c3 | 2111293e946703652070968b224875c9 | |
| 348 | 25c7cf022e6651394fa5876814a05b8e593d8c7f29846117b8718c3dd951e496 | 5c80a555fcda02d028fc60afa29c4a40 | |
| 348 | 67d9c0a4bb98900809bcfab1f50bef72b30886a7b48ff0e9eccf951ef06542f9 | 6c10cd11b805fa57d2ca36df91654576 | |
| 348 | 05f1e412e7765c4b54a9acfd70741af545564f6fdfe48b073bfd3114640f5e37 | 6040b29107adf1a41c4f5964e0ff6dcb | |
| 4D3 | 3e8da3d63c51434bcd368d6829c7cee490170afc32b5137be8e93e7d02315636 | 71a91c4768bd314f3c9dc74e9c7937e8 | |
+------+------------------------------------------------------------------+----------------------------------+--+
And I want to list only records which have HHID more than once, i.e, 7AB and 348.
Any idea how can I implement this?
awk to the rescue:
awk -F'[ |]+' '
$2 ~ /^[[:alnum:]]+$/ { count[$2]++ }
END {
for (hhid in count) {
if (count[hhid] >= 2) {
print hhid
}
}
}
' file
-F'[ |]+' sets the field separator.
$2 ~ /^[[:alnum:]]+$/ filters out the header and horizontal lines.
count[$2]++ increases the value at $2, the string we’re counting. On the first occurrence this initialises the value to 1. On the second occurrence it increases it to 2, and so on.
END is run after all lines have been processed.
for (hhid in count) iterates over the strings in count.
if (count[hhid] >= 2) skips any <2 counts.
print hhid prints the string.

Split column string with delimiters into separate columns in azure kusto

I have a column 'Apples' in azure table that has this string: "Colour:red,Size:small".
Current situation:
|-----------------------|
| Apples |
|-----------------------|
| Colour:red,Size:small |
|-----------------------|
Desired Situation:
|----------------|
| Colour | Size |
|----------------|
| Red | small |
|----------------|
Please help
I'll answer the title as I noticed many people searched for a solution.
The key here is mv-expand operator (expands multi-value dynamic arrays or property bags into multiple records):
datatable (str:string)["aaa,bbb,ccc", "ddd,eee,fff"]
| project splitted=split(str, ',')
| mv-expand col1=splitted[0], col2=splitted[1], col3=splitted[2]
| project-away splitted
project-away operator allows us to select what columns from the input exclude from the output.
Result:
+--------------------+
| col1 | col2 | col3 |
+--------------------+
| aaa | bbb | ccc |
| ddd | eee | fff |
+--------------------+
This query gave me the desired results:
| parse Apples with "Colour:" AppColour ", Size:" AppSize
Remember to include all the different delimiters preceding each word you want to extract, e.g ", Size". Mind the space between.
This helped me then i used my intuition to customize the query according to my needs:
https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/parseoperator

How can we transpose a Table in Spotfire with keeping columnames as rownames?

My sample data is like below. However the original data is very large so I can not hardcode.
+-----+-------+---------+
| IDN | NAME | VALUE |
+-----+-------+---------+
| 121 | test | 1254.25 |
| 152 | testa | 1585.25 |
| 587 | testb | 5878.69 |
+-----+-------+---------+
After transpose function:-
+---------+---------+---------+
| 121 | 152 | 587 |
+---------+---------+---------+
| test | testa | testb |
| 1254.25 | 1585.25 | 5878.69 |
+---------+---------+---------+
Expected:-
+-------+---------+---------+---------+
| IDN | 121 | 152 | 587 |
+-------+---------+---------+---------+
| NAME | test | testa | testb |
| VALUE | 1254.25 | 1585.25 | 5878.69 |
+-------+---------+---------+---------+
I was using t() function is spotfire but in the resultant data table I am missing the columnnames as rownames. Are there anyways to keep
You can do this with UnPivot and Pivot.
Insert > Transformation
Add the Unpivot with the settings below and hit ok
Add the Pivot with the settings below
Change the column name for column 1 with Edit > Column properties
Here Is The Data Table Settings
Add the transformations:
a. Unpivot
Add columns to pass through:
IDN
Add columns to transform:
NAME
VALUE
Category column name: Column
Select category column data type: String
Value column name: Value
Select value column data type: String
Select 'Include null values'
b. Pivot
Choose row identifiers:
Column
Choose value columns and aggregation methods:
Concatenate(Value)
Choose column titles:
IDN
Column naming pattern: %M(%V) for %C

Resources