Recoding variables in vector based on values in different vector - r

Complete R novice here.
I have wide form data frame which includes a vector/variable for participant_number, with each participant providing two responses (score), with a within-subjects manipulation (code).
enter image description here
However, I have three separate sets of values which corresponded to the participant numbers in three different (between subjects) experimental groups (e.g. control, active_1, active_2).
enter image description here
How can I use these sets of values to create a variable in my main data frame which indicates what experimental group the participant belongs to?
Any help, much appreciated.

The package "dplyr" is quite useful for these kind of things. Let's consider a small working example
df <- data.frame(ID=c(1:7))
ListActive1 <- c(1,3)
ListActive2 <- c(2,5)
ListControl <- c(4,7,6)
df is the main data frame containing the ID of the participant (and of course it may have further columns, e.g. the score etc.) The three vectors contain for each group the IDs of the participants belonging to this particular group, e.g. the participants with ID 2 and 5 belong to the group "Active2".
Now we create a new column in the main data frame using the command mutate which comes with the dplyr package (make sure to install and load it).
df <- mutate(df,group=case_when(
ID %in% ListActive1 ~ "Active1",
ID %in% ListActive2 ~ "Active2",
ID %in% ListControl ~ "Control"))
The command case_when checks for each participant in which of the lists the ID appears and then puts the corresponding label in the new column group.
ID group
1 1 Active1
2 2 Active2
3 3 Active1
4 4 Control
5 5 Active2
6 6 Control
7 7 Control

Related

Creating horizontal dataframe from vertical table (with repeated variables)

Need to create usable dataframe using R or Excel
Variable1
ID
Variable2
Name of A person 1
002157
NULL
Drugs used
NULL
3.0
Days in hospital
NULL
2
Name of a surgeon
NULL
JOHN T.
Name of A person 2
002158
NULL
Drugs used
NULL
4.0
Days in hospital
NULL
5
Name of a surgeon
NULL
ADAM S.
I have a table exported from 1C (accounting software). It contains more than 20 thousand observations. A task is to analyze: How many drugs were used and how many days the patient stayed in the hospital.
For that reason, I need to transform the one dataframe into a second dataframe, which will be suitable for doing analysis (from horizontal to vertical). Basically, I have to create a dataframe consisting of 4 columns: ID, drugs used, Hospital stay, and Name of a surgeon. I am guessing that it requires two functions:
for ID it must read the first dataframe and extract filled rows
for Name of a surgeon, Drugs used and Days in hospital the function have to check that the row corresponds to one of that variables and extracts date from the third column, adding it to the second dataframe.
Shortly, I have no idea how to do that. Could you guys help me to write functions for R or tips for excel?
for R, I guess you want something like this:
load the table, make sure to substitute the "," with the separator that is used in your file (could be ";" or "\t" for tab etc.).
df = read.table("path/to/file", sep=",")
create subset tables that contain only one row for the patient
id = subset(df, !is.null(ID))
drugs = subset(df, Variable1 %in% "Drugs used")
days = subset(df, Variable1 %in% "Days in hospital")
#...etc...
make a new data frame that contains these information
new_df = data.frame(
id = id$ID,
drugs = drugs$Variable2,
days = days$Variable2,
#...etc...no comma after the last!
)
EDIT:
Note that this approach only works if your table is basically perfect! Otherwise there might be shifts in the data.
#=====================================================
EDIT 2:
If you have an imperfect table, you might wanna do something like this:
Step 1.5) , change all NA-values (which in you table is labeled as NULL, but I assume R will change that to NA) to the patient ID. Note that the is.na() function in the code below is specifically for that, and will not work with NULL or "NULL" or other stuff:
for(i in seq_along(df$ID)){
if(is.na(df$ID[i])) df$ID[i] <- df$ID[i-1]
}
Then go again to step 2) above (you dont need the id subset though) and then you have to change each data frame a little. As an example for the drugs and days data frames:
drugs = drugs[, -1] #removes the first column
colnames(drugs) = c("ID","drugs") #renames the columns
days = days[, -1]
colnames(days) = c("ID", "days")
Then instead of doing step 3 as above, use merge and choose the ID column to be the merging column.
new_df = merge(drugs, days, by="ID")
Repeat this for other subsetted data frames:
new_df = merge(new_df, surgeon, by="ID")
# etc...
That is much more robust and even if some patients have a line that others dont have (e.g. days), their respective column in this new data frame will just contain an NA for this patient.

How do I make a variable that represents the row number, when rows are not sorted in order?

Pic shows the row number order
I am trying to add a variable to my data set that represents the row number; however every code I've found adds them in order as the rows are currently (1,2,3,4,5), rather than in the order the View option shows (129, 98, 21, 09). I need the order shown in the View option, as I am trying to merge with a another data set, and need the correct ("original row number").
I cannot add row numbers before making changes to the data set as the function doesn't work when I add the ID number.
Alternatively, being able to sort the data by row number would also help, but I don't know how to do that either (clicking on the arrow above the row number does nothing).
A bit of context
I am classifying network nodes in R. I made a matrix from the networks nodes and edges (using nodes2vec), and have to merge this matrix with nodes labels data set (this data set contains one variable which shows if nodes are positive or negative). The picture above shows the created matrix, and the original row numbers from the network data set are no longer in the original order. I need to add a variable to the matrix, that I converted to a data frame using:
netdf1 <- as.data.frame(network.node2vec)
that represents the original row number
what I tried
netdf1 <- netdf1 %>% mutate(id = row_number())
This just adds the row number as the rows are currently ordered so 1,2,3,4...
WHAT WORKED IN THE END == CORRECT ANSWER
db$ID <- rownames(db)
If I do understand your question right you have some kind of dataframe with row names that are not continuus? And now you want to have these row names in an extra column as numeric values?
You can use the row.names()-function and can convert them to numeric if you like:
# just creating a DF that might show what you mean:
testDF <- data.frame(x = 1:10, y = sample((1:1000), 10))
testDF <- testDF[testDF$y < 500,]
View(testDF)
# one possible way to get the row names
testDF$rowNum <- as.numeric(row.names(testDF))
And try to type ?sort to the console if you like to learn something about sorting vectors.
Let's say you have a data frame with row names that are out of order:
my_data <- data.frame(row.names = 5:1,
V1 = 1:5)
#> my_data
# V1
#5 1
#4 2
#3 3
#2 4
#1 5
dplyr::row_number() will add row numbers based on the current sorting, not based on the row names. (A general practice in the tidyverse is to eschew keeping useful data in the row names and to instead incorporate any sorts of row ID info into a variable.)
So you could use #user2554330's advice and add my_data$ID <- row.names(my_data) or the tidyverse equivalent of my_data %>% tibble::rownames_to_column(var = "ID"), then sort by that column.
my_data %>%
tibble::rownames_to_column(var = "ID") %>%
arrange(ID)
ID V1
1 1 5
2 2 4
3 3 3
4 4 2
5 5 1

Is there an R function where I can get the names within a specific column in my dataset

Edit: using the aid from one of the users, I was able to use "table(ArrestData$CHARGE)", yet, since there are over 2400 entries, many of the entries are being omitted. I am looking for the top 5 charges, is there code for this? Additionally, I am looking at a particular council district (which is another variable titled "CITY_COUNCIL_DIST"). I want to see which are the top 5 charges given out within a specific council district. Is there code for this?
Thanks for the help!
Original post follows
Just like how I can use "names(MyData)" to see the names of my variables, I am wondering if I can use a code to see the names/responses/data points of a specific column.
In other words, I am attempting to see the names in my rows for a specific column of data. I would like to see what names are cumulatively being used.
After I find this, I would like to know how many times each name within the rows is being used, whether thats numeric or percentage. After this, I would like to see how many times each name within the rows is being used with the condition that it meets a numeric value of another column/variable.
Apologies if this, in any way, is confusing.
To go further in depth, I am playing around with the Los Angeles Police Data that I got via the Office of the Mayor's website. From 2017-2018, I am attempting to see what charges and the amount of each specific charge were given out in Council District 5. CHARGE and CITY_COUNCIL_DIST are the two variables I am looking at.
Any and all help will be appreciated.
To get all the distinct variables, you can use the unique function, as in:
> x <- c(1,1,2,3,3,4,5,5,5,6)
> unique(x)
[1] 1 2 3 4 5 6
To count the number of distinct values you can use table, as in:
> x <- c(1,1,2,3,3,4,5,5,5,6)
> table(x)
x
1 2 3 4 5 6
2 1 2 1 3 1
The first row gives you the distinct values and the second row the counts for each of them.
EDIT
This edit is aimed to answer your second question following with my previous example.
In order to look for the top five most repeated values of a variable we can use base R. To do so, I would first create a dataframe from your table of frequencies:
df <- as.data.frame(table(x))
Having this, now you just have to order the column Freq in descending order:
df[order(-df$Freq),]
In order to look for the top five most repeated values of a variable within a group, however, we need to go beyond base R. I would use dplyr to create an augmented dataframe with frequencies for each value of the variable of interest, let it be count_variable:
library(dplyr)
x_or <- x %>%
group_by(group_variable, count_variable) %>%
summarise(freq=n())
where x is your original dataframe, group_variable is the variable for your groups and count_variable is the variable you want to count. Now, you just have to order the object in a way you get the frequencies of your count_variable ordered by group_variables:
x_or %>%
arrange(group_variable, count_variable, freq)

Copy Unique values from one data frame to another in R

I have two data frames with totally different column names and values.
Example :
Data Frame 1 ->
company value
A 10
B 11
A 9
Data Frame 2 ->
id value2
Q 7
W 8
E 9
This question has several parts that I want to achieve:
Extract the unique values of COMPANY column from
data frame 1 based on the COMPANY column(Unique companies)
Copy the unique values obtained above into a NEW
COLUMN in Data Frame 2 RANDOMLY (only company field)
Merge the two data frames based on the unique value
column.(This is only for testing, hence why I need this step)
All help is appreciated!!
Thank you in advance.
You could try something like this:
company <- unique(df1$company)
df2$new_column <- sample(company, nrow(df2), replace = TRUE)

dplyr to reference two data frame (summarize function) in R

I created a data frame from a data set with unique marketing sources. Let's say I have 20 unique marketing sources in this new data frame D1. I want to add another column that has the count of times this marketing source was in my original data frame. I'm trying to use the dplyr package but not sure how to reference more than one data frame.
original data has 16000 observations
new data frame has 20 observations as there are only 20 unique marketing sources.
How to use summarize in dplyr to reference two data frames?
My objective is to find the percentage of marketing sources.
My original data frame has two columns: NAME, MARKETING_SOURCE
This data frame has 16,000 observations and 20 distinct marketing sources (email, event, sales call, etc)
I created a new data frame with only the unique MARKETING_SOURCES and called that data frame D1
In my new data frame, I want to add another column that has the number of times each marketing source appeared in the original data frame.
My new Data frame should have two columns: MARKETING_SOURCE, COUNT
I don't know if you need to use dplyr for something like this...
First let's create some data.frames:
df1 <- data.frame(source = letters[sample(1:26, 400, replace = T)])
df2 <- data.frame(source = letters, count = NA)
Then we can use table() to get the frequencies:
counts <- table(df1$source)
df2$count <- counts
head(df2)
source count
1 a 10
2 b 22
3 c 12
4 d 17
5 e 18
6 f 18
UPDATE:
In response to #MrFlick's wise comment below, you can use take the names() of the output from table() to ensure order is preserved:
df2$source <- names(counts)
Certainly not quite as elegant and would be even less elegant if df2 had other columns. But sufficient for the simple case presented above.

Resources