mutate function in R - r

I am trying to add a column from one dataframe to another. The data is long repeated measures data, with each ID having two rows. Both my main dataset (d) and my secondary dataset (d2) use the same column (ID) to link cases to participants. However, when I use mutate like this d <- mutate(d, x = d2$x) the column binds to the dataframe but the values are not tied to the ID. This means that data gets mixed up between participants.
Is there a way to make sure that the values are referenced by ID when I add the column?

Related

Dividing columns in R

I have the following dataset and I want to divide the "Dose" and "Count" columns by 100. I want my new dataset to include the "Visit" column as well.
However, using the code below, my new dataset does not have the "Visit" column.
What if I want to divide only the "Dose" column by 100?
I uploaded my dataset into R and called it data.
data_new=data[,2:ncol(data)]/100
The reason you're not preserving the untouched columns in your new data frame is because you're only assigning the selected columns on which you've applied the vectorized division operation back to what you expect (at least based on how you're naming it) is a data frame.
data$dose <- data$dose / 100

Is there an R function for isolating a set of data from a larger dataset by filtering for one attribute and working independently with the data?

I am working with a large dataset and trying to isolate certain groups of data points by column 1 values. I want to be able to work independently with each of the groups without knowing how many groups there are or the names of the groups. The image is a sample of the dataset. Is there a function to make an object for each of these unique groups with the filter on that group being unique samples in the clade column?
df <- read.csv(file.choose(), header=TRUE)

Question related to table() function in r

I have a very big dataset where there are a number of repetitions of suppose the state column for all latitude and longitude that it covers. Now, I want to find the order in which these states appear (data frame is too big so doesn't show all names) so as to add another column of values in the correct order corresponding to state names. The inner_join doesn't work and says that cannot assign variable of size 122.3Gb. I wanted to use the table() function but it gives alphabetically sorted values and not the order in which the state names appear in the data frame. What can I do?

How do I merge 2 data frames on R based on 2 columns?

I am looking to merge 2 data frames based on 2 columns in R. The two data frames are called popr and dropped column, and they share the same 2 variables: USUBJID and TRTAG2N, which are the variables that I want to combine the 2 data frames by.
The merge function works when I am only trying to do it based off of one column:
merged <- merge(popr,droppedcol,by="USUBJID")
When I attempt to merge by using 2 columns and view the data frame "Duration", the table is empty and there are no values, only column headers. It says "no data available in table".
I am tasked with replicating the SAS code for this in R:
data duration;
set pop combined1 ;
by usubjid trtag2n;
run;
On R, I have tried the following
duration<- merge(popr,droppedcol,by.x="USUBJID","TRTAG2N",by.y="USUBJID","TRTAG2N")
duration <- merge(popr,droppedcol,by.x="USUBJID","TRTAG2N",by.y="USUBJID","TRTAG2N")
duration <- full_join(popr,droppedcol,by = c("USUBJID","TRTAG2N"))
duration <- merge(popr,droppedcol,by = c("USUBJID","TRTAG2N"))
I would like to see a data frame with the columns USUBJID, TRTAG2N, TRTAG2, and FUDURAG2, sorted by first FUDURAG2 and then USUBJID.
Per the SAS documentation, Combining SAS Data Sets, and confirmed by the SAS guru, #Tom, in comments above, the set with by simply means you are interleaving the datasets. No merge (which by the way is also a SAS method which you do not use) is taking place:
Interleaving uses a SET statement and a BY statement to combine
multiple data sets into one new data set. The number of observations
in the new data set is the sum of the number of observations from the
original data sets. However, the observations in the new data set are
arranged by the values of the BY variable or variables and, within
each BY group, by the order of the data sets in which they occur. You
can interleave data sets either by using a BY variable or by using an
index.
Therefore, the best translation of set without by in R is rbind(), and set with by is rbind + order (on the rows):
duration <- rbind(pop, combined1) # STACK DFs
duration <- with(duration, duration[order(usubjid, trtag2n),]) # ORDER ROWS
However, do note: rbind does not allow unmatched columns between the concatenated data sets. However, third-party packages allow for unmatched columns including: plyr::rbind.fill, dplyr::bind_rows, data.table::rbindlist.

Extracted data frame selection still contains entries from full data frame set

I have a data frame (originally from a CSV file) with the columns NAME and YEAR. I have extracted a sample from this data frame of the first ten entries like so:
sample<-df(1:10,)
I want to know the frequency of the values in the NAME column so I input the following:
as.data.frame(table(sample$NAME))
This counts the frequency in the sample correctly but also includes every name from the original data frame in the 'Var1' column (all with a Freq of 0).
The same thing happens if I use unique(sample$NAME) as well: it lists the names from the sample along with all of the names from the original data frame as well.
What am I doing wrong?
This could be a case of unused level in the 'NAME' factor column. We can use droplevels or call factor again to remove those unused levels.
as.data.frame(table(droplevels(sample$NAME)))
Or
as.data.frame(table(factor(sample$NAME)))

Resources