Choose a name for the directory where you will store the dataset. For example, let’s name the directory -imaging informatics-. Move the zip file containing the images and the

Lab 8 Imaging Informatics
The grade for this lab is 20 points instead of 10 points.
Please follow these instructions to acquire the necessary dataset and metadata for your work:
Visit the International Skin Imaging Collaboration (ISIC) at the following URL: https://www.isic-archive.com/.
Navigate to the -Benchmark Datasets- section.
Locate -Challenge 2016: Test- dataset and proceed to download it.
Ensure to download both the image dataset and the corresponding metadata. The metadata may be available as a separate file or included within the zip file containing the images. For our purposes, obtain the metadata as a separate file, even if it’s also within the zip file.
We will utilize both the zip file of images and the separate metadata file in our upcoming tasks.
Make sure you have sufficient storage space and a stable internet connection before downloading the files, as they can be quite large.

Save the downloaded dataset and metadata files to a known location on your computer.
Choose a name for the directory where you will store the dataset. For example, let’s name the directory -imaging informatics-.
Move the zip file containing the images and the separate metadata file into the directory you have named in the previous step.
Use the provided R script to set up your working directory, install necessary libraries, unzip the images, perform the analysis and answer the following questions:
Q1: What observations can you make about the sample images displayed? Take screenshot.

Q2: What is a metadata and what is the significance of having metadata in imaging informatics.

Q3: Why do we often resize and convert images to grayscale before analysis?
Q4: Display the gray image for the third and 5th image and take a screenshot?
Q5: What features are we extracting here and why are they important?
Q6: How does ‘lapply’ help in applying a function to a list of items?
Q7: After merging the feature data with the metadata using the image ID as a key, examine the output of head(combined_data). Can you confirm if the merge was successful by checking for consistency in the image IDs between the metadata and image_features dataframes? How would a discrepancy in these identifiers affect the validity of our subsequent data analysis?
Q8: What does a histogram tell us about the distribution of a single variable? (Take screenshot as well).
Q 9: Identify the diagnosis category with the highest variability in mean intensity. Discuss how this variability might affect diagnostic accuracy or the development of automated diagnostic algorithms.
Q10: Analyze the boxplot comparing benign and malignant cases in terms of mean intensity. How could the observed differences assist in distinguishing between benign and malignant lesions using image analysis techniques?
Q11: What does the boxplot suggest about the variability of pixel intensity in benign versus malignant skin lesion images, and how might this information be valuable in the context of medical image analysis?
Q12: Analyze the scatter plot displaying the relationship between age and mean intensity of the skin lesions. Do you observe any trend or pattern that indicates a correlation between the two variables? How might this analysis be significant in the field of dermatological research or diagnostics?
Q13 Report the correlation coefficient and the p-value obtained from the Pearson correlation test between age and mean intensity. What inference can you make about the linear relationship between the age of an individual and the mean intensity of their skin lesions based on these results? Discuss the potential clinical implications of the correlation findings. Consider the relevance of age as a predictive factor for changes in lesion intensity. How might these results influence screening or diagnostic approaches in a clinical setting?
Q14: Interpret the F value and p-value from the ANOVA test. What do these values indicate about the differences in mean intensities across the defined age groups? Considering that some observations were deleted due to missingness, discuss the potential impact of this missing data on the ANOVA results. How might this affect the validity of the study conclusions regarding age-related differences in lesion intensity?

Q15: Which anatomical site appears to have the highest variety of diagnoses? And which one has the highest count of a single diagnosis? Discuss how the distribution of diagnoses across anatomical sites could inform targeted screening or preventive strategies in dermatological practice.
Q16: What does the range of intensities within each gender category indicate about the variability of mean intensity across different genders? Identify any notable features in the data distribution for each gender, such as outliers or extreme values. What might be the implications of these observations for clinical analysis?
Q 17: Why is it important to convert the ‘benign_malignant’ variable into a factor before analysis?
Q 18: What is the impact of removing rows with NA values on our dataset and subsequent analysis?
Q 19: Explain the significance of setting a seed before sampling. How does this affect the reproducibility of our results?
Q 20: What is the purpose of using a logistic regression model in this context? Describe the outcome we are trying to predict and the predictors we are using.
Q 21: From the model summary, identify which features are significant predictors of the outcome and explain what the coefficients represent.
Q 22: Interpret the confusion matrix. What does it tell you about the model’s performance in terms of correctly classifying benign and malignant cases?
Q 23: Compare the performance of the Random Forest model to the logistic regression model. Which one performs better and why do you think this is the case?